PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 03-30-2004, 11:29 AM   #1
davideyre
Green Mole
 
Join Date: Mar 2004
Posts: 3
not indexing with pdftotext

i am having problems getting phpdig to index pdf files. pdftotext is installed and works fine from the command line.

i have read several of the other posts and have tried the error reporting code suggested. it seems what is happening is that my pdf file does not get recognised as such, instead gets recognised and indexed as html. so if i look in the mysql spider table i can see the begining of the raw pdf file just stripped of a tag that appears in <>, e.g.


%PDF-1.2
%Ç쏢
7 0 obj
<</Length 8 0 R/Filter /FlateDecode>>
stream
xœ3Ð3T0

becomes....

%PDF-1.2 %Ç쏢 7 0 obj > stream xœ3Ð3T0


this is for the simple hello world test file that comes with pdftotext.

i have included a sample output of the spider below:
HTTP/1.1 200 OK
Date: Tue, 30 Mar 2004 20:29:36 GMT
Server: Apache/1.3.27 (Unix) (Red-Hat/Linux) mod_ssl/2.8.12 OpenSSL/0.9.6 PHP/4.3.4 mod_perl/1.27
Last-Modified: Tue, 30 Mar 2004 19:10:13 GMT
ETag: "34ac212-395-4069c615"
Accept-Ranges: bytes
Content-Length: 917
Connection: close
Content-Type: application/pdf



Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

13:http://www.tist.org/tist/docs/welcom...est/hello1.pdf


Can you please suggest what i need to do to get the spider to recognise pdf files as pdf files rather than html. i am using phpdig 1.8, xpdf 3.00, and php 4.3.4.

thanks for your help, david

Last edited by davideyre; 03-30-2004 at 12:09 PM.
davideyre is offline   Reply With Quote
Old 03-30-2004, 01:19 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. It looks like you stuck the following code in the robot_functions.php file. This code was meant only when a content type was not returned, which is generally not the case, so just take the code out of the robot_functions.php file.
PHP Code:
elseif (!eregi("Content-Type: *([a-z]+)/([a-z.-]+)",$answer,$regs)) {
    
$status 'HTML'// no content-type so set to html

__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-30-2004, 01:55 PM   #3
davideyre
Green Mole
 
Join Date: Mar 2004
Posts: 3
thanks - it is working well now
davideyre is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
pdftotext issue JonnyNoog External Binaries 6 07-15-2006 12:40 AM
pdftotext - not indexing PDFs - oh geez monkeynutts External Binaries 1 11-11-2005 10:15 AM
PDF indexing Probelm (pdftotext) ripchen External Binaries 9 10-20-2005 12:14 PM
pdftotext no solution Art External Binaries 7 04-11-2005 05:39 AM
problem with pdftotext freak External Binaries 1 06-02-2004 07:20 AM


All times are GMT -8. The time now is 05:35 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.