|
02-14-2004, 08:27 AM | #1 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
pdftotext with phpdig does not work
hello board,
phpdig for html and php files works great - but: pdf-files dont work. i tried on several machines of us debian/redhat php4.2.2/4.2.3. pdftotext works fine from bash. if i call with phpdig only one or two files were opened and only partial content found in temp and text files. any ideas - anyone??? tomas |
02-14-2004, 01:20 PM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. In the config file set the following and make sure that there are 755 permissions for the directories to pdftotext as well as to the pdftotext file.
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
02-14-2004, 01:44 PM | #3 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
hello charter,
thanks for quick response - i checked all topics - but still all files are empty. if i set define('PHPDIG_PDF_EXTENSION',''); i can see the temp files and they are empty too ??? |
02-14-2004, 01:49 PM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. What version of PhpDig are you using?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
02-14-2004, 01:57 PM | #5 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
1.80
and the files aren't empty - they have only one page break. i tried a lot of diferent pdfs tried lot of settings in: define('PHPDIG_OPTION_PDF',''); -q -nopgbrk empty but nothing works |
02-14-2004, 02:27 PM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. There was a problem similar with PHP 4.2.2 described in this post. Not sure if this is related to your problem. What do you get onscreen when you add the code in this post?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
02-14-2004, 02:48 PM | #7 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
3:http://192.168.1.240/mysite/pdf/02.pdf
(time : 00:00:21) Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /var/www/html/mysite/phpdig/pdftotext/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Is result test http an array: 1 What is result test http status: PDF |
02-14-2004, 03:10 PM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. That all looks fine. In robot_functions is the following line:
PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
02-14-2004, 03:16 PM | #9 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
Is result test http an array: 1
What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /var/www/html/mysite/phpdig/pdftotext/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Result contains: Array ( ) Return value is: 0 Is result test http an array: 1 What is result test http status: PDF |
02-14-2004, 03:21 PM | #10 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
charter - by the way
how can i do you a little favour for your friendly way doing work here and for the phpdig-project? |
02-14-2004, 03:52 PM | #11 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. The following means that the exec command is succeeding:
Return value is: 0 However the following means that the output from the exec command has no content: Result contains: Array ( ) The pdftotext version 1.01 has the following bugs: Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files. As you are able to run pdftotext from bash, I don't think this is the problem. I would say that there is a problem with PHP trying to exec pdftotext from the script. Perhaps try to upgrade to the latest stable version of PHP or try a different converter.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
02-15-2004, 07:59 AM | #12 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
hello charter,
ok i tried it on an other server fedora_core1/php-4.3.3 -> and grabbing pdf-files now works fine. the result is pdf-indexing with php-4.2.2/3 does not work ! thanks a lot tomas |
02-25-2004, 08:20 AM | #13 | |
Orange Mole
Join Date: Sep 2003
Posts: 40
|
Quote:
PHP 4.2.2 incorrectly handles binary files using the function file($remote_url). That function is used in robot_function.php during indexing. I posted a patch here |
|
02-25-2004, 01:13 PM | #14 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
hello alivin,
great job :-) now pdf-digging works fine even with php-4.2.x - and in my opinion file-funktion also has a bug in php-4.3.x: digging larger pdf's php.ini had to be overwritten with: ini_set(memory_limit, "64M"); using your workaround there are no more memory problems. thanks again for posting back to this thread maybe this little ideas are helpful for you: http://www.phpdig.net/showthread.php?s=&threadid=500 http://www.phpdig.net/showthread.php...=2338#post2338 kind regards from monaco di bavaria tomas |
02-25-2004, 02:45 PM | #15 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
hi alivin,
the memory issue does not change - even with your workaround => i tested with wrong setting in php.ini so if anybody has a problem spidering large pdf's especially with large vector-graphics in it - override php.ini in this way: in spider.php - first write this line: ini_set(memory_limit, "64M"); anyway - your bugfix works great :-) regards tomas Last edited by tomas; 02-25-2004 at 03:27 PM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Anyone considering making PhpDig work with SQL rather than MySQL? | misterbearcom | Mod Requests | 0 | 08-10-2005 04:25 PM |
PhpDig indexing won't work | sigfy | Troubleshooting | 11 | 01-07-2005 07:47 AM |
Cronjob for spidering doen't work anymore with PhpDig 1.8.6 | gaam | Troubleshooting | 0 | 12-22-2004 01:28 AM |
Install phpdig in a file named phpdig doesn't work | Sansnom | Script Installation | 1 | 05-09-2004 04:13 PM |
PhpDig does not work (installs OK) | rafarspd | Troubleshooting | 12 | 01-06-2004 05:20 PM |