PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 02-16-2004, 10:26 AM   #1
Hoek
Green Mole
 
Join Date: Feb 2004
Posts: 17
indexing pdf

I installed the pstotext binary, but indexing of pdf-files will not take place. No green checkmark appears when indexing the site. Do you need the ghostscript binary installed on the server? And do you need to upgrade the php-engine (i use version 4.2.2. now). I have red some problems with newer php-engines and indexing html-tags. Is this problem solved in version 1.8?
Hoek is offline   Reply With Quote
Old 02-16-2004, 12:32 PM   #2
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Check this thread. I believe it will answer your question.
vinyl-junkie is offline   Reply With Quote
Old 02-16-2004, 02:37 PM   #3
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. From http://research.compaq.com/SRC/virtu.../pstotext.html ...

pstotext is a program that works with Ghostscript (version 3.33 or later) to extract plain text from PostScript and PDF files (you should have Ghostscript 3.51 or later for PDF).

PHP version 4.2.2/3 seems to have issue with running exec pdftotext as in this thread, but I am not sure if pstotext would have the same problem.

The PHP strip_tags function was replaced with a regular expression in version 1.6.3. Version 1.8.0 should not index HTML tags.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-20-2004, 07:20 AM   #4
Hoek
Green Mole
 
Join Date: Feb 2004
Posts: 17
I have fixed the path to GS, set the permissions and modified the config.php to this:

// if set to true is_executable used - set to '0' if is_executable is undefined
define('USE_IS_EXECUTABLE_COMMAND',true); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','/usr/local/bin/catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','/usr/local/bin/xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');

and still no pdf are indexed. What do I do wrong?
Hoek is offline   Reply With Quote
Old 02-20-2004, 06:09 PM   #5
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Try doing as in this thread. What onscreen output do you get?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-21-2004, 12:27 PM   #6
Hoek
Green Mole
 
Join Date: Feb 2004
Posts: 17
I get the following output:

SITE : http://www.professioneel-handhaven.nl/
Uit te sluiten paden :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

HTML <--- Status
1:http://www.professioneel-handhaven.nl/Bibliotheek/
(tijd : 00:00:05)
+ +
levels 1...


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

PDF <--- Status


Result contains: Array ( )
Return value is: 0

2:http://www.professioneel-handhaven.n...et_oordeel.pdf
(tijd : 00:00:16)



Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

PDF <--- Status


Result contains: Array ( )
Return value is: 0

3:http://www.professioneel-handhaven.n..._Verbeterd.pdf
(tijd : 00:00:21)

Geen link in tijdelijke tabel

Still no result in indexing the pdf-files.
Hoek is offline   Reply With Quote
Old 02-23-2004, 02:10 PM   #7
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
>> ...i use version 4.2.2. now...

Hi. The same issue and echo results are in this thread. If I remember correctly, there have been three cases of 4.2.2 not working and one case of 4.2.3 not working. Upgrading to a later version of PHP solved the problems.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-24-2004, 04:10 AM   #8
Hoek
Green Mole
 
Join Date: Feb 2004
Posts: 17
Hello Charter, thank you for your help till now, but the problem still exists... I upgraded the php-engine to 4.3.4, and installed the pdftotext binary. Unfortunetely, no green checkmark for each indexed pdf-file... I send hereby the output of the screen, and hope for new tips.

Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/www.professioneel-handhaven.nl/www/Zoeken/xpdf/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

PDF <--- Status


Result contains: Array ( )
Return value is: 1

3:http://www.professioneel-handhaven.n...et_oordeel.pdf
(tijd : 00:00:20)
Hoek is offline   Reply With Quote
Old 02-24-2004, 01:32 PM   #9
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. What happens when you run pdftotext from shell on a PDF file?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-25-2004, 03:42 AM   #10
Hoek
Green Mole
 
Join Date: Feb 2004
Posts: 17
Thumbs up

When running pdftotext from shell there was first a problem with the glibc library. We decided to recompile from the xpdf source in /usr/local/bin and now pdf-indexing works fine! The settings in config.php are as follows:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');

Running pstotext from the shell gives an error on ghostscript (exit code 1) and will defenitely not work on our server. pdftotext is a good alternative.

Thanks again to all members for the assistance!
Hoek is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Indexing PDF dlaperle Troubleshooting 1 03-21-2007 08:00 PM
Problem with PDF indexing Phantom External Binaries 2 07-25-2005 03:26 AM
indexing pdf philippeguerind External Binaries 11 02-21-2004 11:50 AM
PDF indexing lelandv External Binaries 15 12-08-2003 05:23 PM
PDF indexing aryan External Binaries 11 11-27-2003 08:51 AM


All times are GMT -8. The time now is 10:59 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.