PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 08-06-2004, 08:25 AM   #1
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
Question can't index pdf using pdftotext

My server is running php 4.3.8 on a linux system, and I am trying to search pdfs using the pdftotext external binary.

I am able to get phpdig to search html files. Pdftotext converts pdfs and places a txt file in the same directory, when run from the command line, but I haven't been able to configure phpdig to index a linked pdf file on my website.

I have followed all the instructions on the thread "External Binaries Problem Checklist", and have inserted the recommended echo statements in spider.php and robot_functions.php. The output when reindexing shown below.

Thanks very much for any assistance.

SITE : http://www.goeco.com/
Exclude paths :
- cgi-bin/


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
1:http://www.goeco.com/index2.html
(time : 00:00:05)
+ +
level 1...


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
2:http://www.goeco.com/fr_band.html
(time : 00:00:15)

(the same output as above for various other linked pages, until we get to

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable:
15:http://www.goeco.com/profile.pdf
(time : 00:01:33)

No link in temporary table
links found : 15
http://www.goeco.com/index2.html
http://www.goeco.com/fr_band.html
http://www.goeco.com/home.html
http://www.goeco.com/contact.html
http://www.goeco.com/sustainability.html
http://www.goeco.com/response.html
http://www.goeco.com/training.html
http://www.goeco.com/sites.html
http://www.goeco.com/wastes.html
http://www.goeco.com/impacts.html
http://www.goeco.com/audits.html
http://www.goeco.com/ems.html
http://www.goeco.com/services.html
http://www.goeco.com/vision.html
http://www.goeco.com/profile.pdf
Optimizing tables...
Indexing complete ! [Back] to admin interface.
rom is offline   Reply With Quote
Old 08-06-2004, 09:47 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The "is parse pdf executable" is coming up false so check that pdftotext is set to 755 permission.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 08-06-2004, 11:25 AM   #3
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
Hi Charter,

Thanks very much for your quick reply. I had set the permissions correctly, but then moved the file to a new directory, so somehow it was changed to the wrong settings. It is now 755, and this is the output from the echos.

...similar to what was there before except as shown below...

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home5/goeco/HTML/pdftotext -cork ../admin/temp/78556292.tmp
Result contains: Array ( )
Return value is: 1

15:http://www.goeco.com/profile.pdf
(time : 00:01:35)

No link in temporary table
links found : 15
http://www.goeco.com/index2.html
http://www.goeco.com/fr_band.html
http://www.goeco.com/home.html
http://www.goeco.com/contact.html
http://www.goeco.com/sustainability.html
http://www.goeco.com/response.html
http://www.goeco.com/training.html
http://www.goeco.com/sites.html
http://www.goeco.com/wastes.html
http://www.goeco.com/impacts.html
http://www.goeco.com/audits.html
http://www.goeco.com/ems.html
http://www.goeco.com/services.html
http://www.goeco.com/vision.html
http://www.goeco.com/profile.pdf
Optimizing tables...
rom is offline   Reply With Quote
Old 08-06-2004, 11:30 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Now the command:
Code:
/home5/goeco/HTML/pdftotext -cork ../admin/temp/78556292.tmp
is failing so find:
PHP Code:
$command PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2
and replace with:
PHP Code:
$command PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2.' 2>&1'
and see what error it shows on reindex.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 08-06-2004, 04:06 PM   #5
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
Hi Charter,

Thanks again for responding so quickly. Here is the latest error message. Was I supposed to have created a cork file somewhere?

level 3...


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home5/goeco/HTML/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home5/goeco/HTML/pdftotext -cork ../admin/temp/54831932.tmp 2>&1
Result contains: Array ( [0] => Error: Couldn't open file '-cork' )
Return value is: 1

15:http://www.goeco.com/profile.pdf
(time : 00:01:34)

No link in temporary table
rom is offline   Reply With Quote
Old 08-06-2004, 04:11 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The flag cork is an option that doesn't seem available to you so just set the following in the config file:
PHP Code:
define('PHPDIG_OPTION_PDF',''); // two single quotes, no space between 
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 08-06-2004, 04:26 PM   #7
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
Hi Charter,

Thanks. It is working now!

Have a good weekend.

Rom
rom is offline   Reply With Quote
Old 08-07-2004, 01:09 PM   #8
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
I'm working on another website now, and have not been able to get phpdig to index the pdfs on this one either. Have followed all your previous directions, and as an example have received the echos shown below.

Thanks very much for your assistance.

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to:
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
38:http://www.cgxenergy.ca/regionalOverview.html
(time : 00:03:33)
rom is offline   Reply With Quote
Old 08-07-2004, 01:24 PM   #9
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
Never mind.

Just released the define('PHPDIG_INDEX_PDF',true); was still set to false.
rom is offline   Reply With Quote
Old 08-07-2004, 03:50 PM   #10
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
Still stuck, unfortunately. The HTML pages seem OK, but indexing PDFs has given several error messages. After the last one, spidering appears to stop without going through the other 100 or so links.

Thanks again.

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
100:http://www.cgxenergy.ca/affiliated.html
(time : 00:09:18)



Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home/cgxenerg/HTML/investors/pdftotext ../admin/temp/21314312.tmp 2>&1
Result contains: Array ( )
Return value is: 0

101:http://www.cgxenergy.ca/investors/MB...esMar25_04.pdf
(time : 00:09:24)


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /home/cgxenerg/HTML/investors/pdftotext ../admin/temp/24175282.tmp 2>&1
Result contains: Array ( [0] => Error: Copying of text from this document is not allowed. )
Return value is: 3

102:http://www.cgxenergy.ca/investors/OctagonMar08_04.pdf
(time : 00:09:29)


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable
rom is offline   Reply With Quote
Old 08-07-2004, 04:03 PM   #11
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
>> Error: Copying of text from this document is not allowed.

Hi. PhpDig using pdftotext cannot index the PDF if the PDF is set to not allow it.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 08-07-2004, 07:19 PM   #12
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
Hi Charter,

Will phpdig still be able to index the other PDFs? Only some gave the copying error.

Is the "copying of text" a security setting on the PDF?

Is the "copying of text" error the reason that the spidering is dieing part way through?

Thanks,

Rom
rom is offline   Reply With Quote
Old 08-07-2004, 07:26 PM   #13
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. PhpDig can index almost any PDF that allows it, save for PDFs that take so much memory as to cause the script to barf due to lack of memory.

Whomever writes the PDF can set whether the copying of text from the PDF is allowed. I'm not sure about the dieing issue after trying to index an index protected PDF.

How many times do you find PhpDig trying to index an index protected PDF before it dies?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 08-07-2004, 08:50 PM   #14
rom
Green Mole
 
Join Date: Jan 2004
Posts: 25
I receive two "copying of text" errors, then it gets to this point:

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /home/cgxenerg/HTML/investors/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

and stops.

You mention a memory issue. The largest PDF is 4.6 M, so many we should just delete anything more than 1 M from the site, if that would help.

With the following message, has this PDF been iindexed without a problem? I'm not sure what the return value means or whether the array should have something in it.

Command is: /home/cgxenerg/HTML/investors/pdftotext ../admin/temp/21314312.tmp 2>&1
Result contains: Array ( )
Return value is: 0

101:http://www.cgxenergy.ca/investors/MB...esMar25_04.pdf
(time : 00:09:24)

Thanks again. You've been a huge help. I've been tearing my hair out on this one.
rom is offline   Reply With Quote
Old 08-08-2004, 06:02 PM   #15
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Contrary to possible intuition, with the exec command, a return value of zero is a success. The result array is to contain the output from the command, but in the previous post, it looks as though there was a successful execution of the command, but the array is empty. Perhaps check you error logs, and as to a possible memory issue, maybe this thread might help.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF indexing Probelm (pdftotext) ripchen External Binaries 9 10-20-2005 12:14 PM
Can phpdig index Japanese PDF file??? mynamesucks External Binaries 3 02-22-2005 10:59 PM
can phpdig index PDF server-side served from php? Sybolt How-to Forum 1 02-18-2005 01:16 PM
not indexing with pdftotext davideyre External Binaries 2 03-30-2004 01:55 PM
How to index a directory with pdf files simonced How-to Forum 3 02-13-2004 11:41 AM


All times are GMT -8. The time now is 10:42 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.