|
03-30-2004, 11:29 AM | #1 |
Green Mole
Join Date: Mar 2004
Posts: 3
|
not indexing with pdftotext
i am having problems getting phpdig to index pdf files. pdftotext is installed and works fine from the command line.
i have read several of the other posts and have tried the error reporting code suggested. it seems what is happening is that my pdf file does not get recognised as such, instead gets recognised and indexed as html. so if i look in the mysql spider table i can see the begining of the raw pdf file just stripped of a tag that appears in <>, e.g. %PDF-1.2 %Çì¢ 7 0 obj <</Length 8 0 R/Filter /FlateDecode>> stream xœ3Ð3T0 becomes.... %PDF-1.2 %Çì¢ 7 0 obj > stream xœ3Ð3T0 this is for the simple hello world test file that comes with pdftotext. i have included a sample output of the spider below: HTTP/1.1 200 OK Date: Tue, 30 Mar 2004 20:29:36 GMT Server: Apache/1.3.27 (Unix) (Red-Hat/Linux) mod_ssl/2.8.12 OpenSSL/0.9.6 PHP/4.3.4 mod_perl/1.27 Last-Modified: Tue, 30 Mar 2004 19:10:13 GMT ETag: "34ac212-395-4069c615" Accept-Ranges: bytes Content-Length: 917 Connection: close Content-Type: application/pdf Is result test http an array: 1 What is result test http status: HTML Is result test an array: 1 What is result test status: HTML Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 13:http://www.tist.org/tist/docs/welcom...est/hello1.pdf Can you please suggest what i need to do to get the spider to recognise pdf files as pdf files rather than html. i am using phpdig 1.8, xpdf 3.00, and php 4.3.4. thanks for your help, david Last edited by davideyre; 03-30-2004 at 12:09 PM. |
03-30-2004, 01:19 PM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. It looks like you stuck the following code in the robot_functions.php file. This code was meant only when a content type was not returned, which is generally not the case, so just take the code out of the robot_functions.php file.
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
03-30-2004, 01:55 PM | #3 |
Green Mole
Join Date: Mar 2004
Posts: 3
|
thanks - it is working well now
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
pdftotext issue | JonnyNoog | External Binaries | 6 | 07-15-2006 12:40 AM |
pdftotext - not indexing PDFs - oh geez | monkeynutts | External Binaries | 1 | 11-11-2005 10:15 AM |
PDF indexing Probelm (pdftotext) | ripchen | External Binaries | 9 | 10-20-2005 12:14 PM |
pdftotext no solution | Art | External Binaries | 7 | 04-11-2005 05:39 AM |
problem with pdftotext | freak | External Binaries | 1 | 06-02-2004 07:20 AM |