|
02-16-2004, 10:26 AM | #1 |
Green Mole
Join Date: Feb 2004
Posts: 17
|
indexing pdf
I installed the pstotext binary, but indexing of pdf-files will not take place. No green checkmark appears when indexing the site. Do you need the ghostscript binary installed on the server? And do you need to upgrade the php-engine (i use version 4.2.2. now). I have red some problems with newer php-engines and indexing html-tags. Is this problem solved in version 1.8?
|
02-16-2004, 12:32 PM | #2 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Check this thread. I believe it will answer your question.
|
02-16-2004, 02:37 PM | #3 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. From http://research.compaq.com/SRC/virtu.../pstotext.html ...
pstotext is a program that works with Ghostscript (version 3.33 or later) to extract plain text from PostScript and PDF files (you should have Ghostscript 3.51 or later for PDF). PHP version 4.2.2/3 seems to have issue with running exec pdftotext as in this thread, but I am not sure if pstotext would have the same problem. The PHP strip_tags function was replaced with a regular expression in version 1.6.3. Version 1.8.0 should not index HTML tags.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
02-20-2004, 07:20 AM | #4 |
Green Mole
Join Date: Feb 2004
Posts: 17
|
I have fixed the path to GS, set the permissions and modified the config.php to this:
// if set to true is_executable used - set to '0' if is_executable is undefined define('USE_IS_EXECUTABLE_COMMAND',true); //use is_executable for external binaries // if set to true, full path to external binary required define('PHPDIG_INDEX_MSWORD',true); define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc'); define('PHPDIG_OPTION_MSWORD','-s 8859-1'); define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_INDEX_MSEXCEL',true); define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv'); define('PHPDIG_OPTION_MSEXCEL',''); //---------EXTERNAL TOOLS EXTENSIONS // if external binary is not STDOUT or different extension is needed // for example, use '.txt' if external binary writes to filename.txt define('PHPDIG_MSWORD_EXTENSION',''); define('PHPDIG_PDF_EXTENSION',''); define('PHPDIG_MSEXCEL_EXTENSION',''); and still no pdf are indexed. What do I do wrong? |
02-20-2004, 06:09 PM | #5 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Try doing as in this thread. What onscreen output do you get?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
02-21-2004, 12:27 PM | #6 |
Green Mole
Join Date: Feb 2004
Posts: 17
|
I get the following output:
SITE : http://www.professioneel-handhaven.nl/ Uit te sluiten paden : - @NONE@ Is result test http an array: 1 What is result test http status: HTML Is result test an array: 1 What is result test status: HTML Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pstotext Does parse pdf exist: 1 Is parse pdf executable: 1 HTML <--- Status 1:http://www.professioneel-handhaven.nl/Bibliotheek/ (tijd : 00:00:05) + + levels 1... Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pstotext Does parse pdf exist: 1 Is parse pdf executable: 1 PDF <--- Status Result contains: Array ( ) Return value is: 0 2:http://www.professioneel-handhaven.n...et_oordeel.pdf (tijd : 00:00:16) Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pstotext Does parse pdf exist: 1 Is parse pdf executable: 1 PDF <--- Status Result contains: Array ( ) Return value is: 0 3:http://www.professioneel-handhaven.n..._Verbeterd.pdf (tijd : 00:00:21) Geen link in tijdelijke tabel Still no result in indexing the pdf-files. |
02-23-2004, 02:10 PM | #7 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
>> ...i use version 4.2.2. now...
Hi. The same issue and echo results are in this thread. If I remember correctly, there have been three cases of 4.2.2 not working and one case of 4.2.3 not working. Upgrading to a later version of PHP solved the problems.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
02-24-2004, 04:10 AM | #8 |
Green Mole
Join Date: Feb 2004
Posts: 17
|
Hello Charter, thank you for your help till now, but the problem still exists... I upgraded the php-engine to 4.3.4, and installed the pdftotext binary. Unfortunetely, no green checkmark for each indexed pdf-file... I send hereby the output of the screen, and hope for new tips.
Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /home/www.professioneel-handhaven.nl/www/Zoeken/xpdf/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 PDF <--- Status Result contains: Array ( ) Return value is: 1 3:http://www.professioneel-handhaven.n...et_oordeel.pdf (tijd : 00:00:20) |
02-24-2004, 01:32 PM | #9 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. What happens when you run pdftotext from shell on a PDF file?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
02-25-2004, 03:42 AM | #10 |
Green Mole
Join Date: Feb 2004
Posts: 17
|
When running pdftotext from shell there was first a problem with the glibc library. We decided to recompile from the xpdf source in /usr/local/bin and now pdf-indexing works fine! The settings in config.php are as follows:
define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION','.txt'); Running pstotext from the shell gives an error on ghostscript (exit code 1) and will defenitely not work on our server. pdftotext is a good alternative. Thanks again to all members for the assistance! |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Indexing PDF | dlaperle | Troubleshooting | 1 | 03-21-2007 08:00 PM |
Problem with PDF indexing | Phantom | External Binaries | 2 | 07-25-2005 03:26 AM |
indexing pdf | philippeguerind | External Binaries | 11 | 02-21-2004 11:50 AM |
PDF indexing | lelandv | External Binaries | 15 | 12-08-2003 05:23 PM |
PDF indexing | aryan | External Binaries | 11 | 11-27-2003 08:51 AM |