|
12-07-2003, 11:05 AM | #1 | |
Green Mole
Join Date: Dec 2003
Posts: 11
|
Quote:
I have added the define above as per the previous problem, the but the contents of the PDF are still not indexed. I'm using "pdftohtml" with a wrapper which removes all HTML formatting resulting in PDF -> TEXT. (syntax: pdf2txt file.pdf --- which results in a STDOUT output of plain text). Of course in the database, there is no hint of the contents of the PDF file, thus not indexed... just the filename itself (which is not really what we want here.) Any help would be appreciated. Leland |
|
12-07-2003, 11:12 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. If the output goes to STDOUT, then set define('PHPDIG_PDF_EXTENSION','');
The extension .txt in define('PHPDIG_PDF_EXTENSION','.txt'); is only needed if the output goes to file with a .txt extension.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-07-2003, 11:18 AM | #3 | |
Green Mole
Join Date: Dec 2003
Posts: 11
|
Quote:
Am I missing something here? define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION',''); the actual PDF file is linked off of another page, and looking at the server logs I do see the crawler retrieving the pdf document in the first place... just that it's still not indexed at all. taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "HEAD /pdftest/InstrumentPilot39.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.2 (PHP; MySql)" taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "GET /pdftest/InstrumentPilot39.pdf HTTP/1.0" 200 1262188 "-" "PHP/4.2.2" Leland |
|
12-07-2003, 11:41 AM | #4 | |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Quote:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
12-07-2003, 12:26 PM | #5 | |
Green Mole
Join Date: Dec 2003
Posts: 11
|
Quote:
(Please note that the latest version stated on freshmeat/sourceforge is 1.6.2.. might want to update that when you get a chance.) Will try 1.6.5 and let you know how it goes Leland |
|
12-07-2003, 12:39 PM | #6 |
Green Mole
Join Date: Dec 2003
Posts: 11
|
hmm..
version 1.6.5 generates an error 404 when inserting the search on the search page. laptop1.discpro.org - - [07/Dec/2003:20:36:39 +0000] "GET /phpdig/index.php?template_demo=.%2Ftemplates%2Fphpdig.html&site=0&path=&result_pag e=index.php&query_string=transition&limite=10&option=start HTTP/1.1" 404 1146 "http://www.discpro.org/phpdig/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" |
12-07-2003, 12:41 PM | #7 |
Green Mole
Join Date: Dec 2003
Posts: 11
|
disregard that... brain fart
Just tried it with the new version, the PDF content is still not indexed L. |
12-07-2003, 01:02 PM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. All I can find is a Windows version of pdf2txt. Are you using xpdf? Unless renamed, it should be pdftotext that comes with xpdf.
If you have shell access and are allowed to locate, just type locate pdftotext to find the path. The freshmeat listing was updated yesterday.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-07-2003, 01:09 PM | #9 |
Green Mole
Join Date: Dec 2003
Posts: 11
|
No.. I'm using actually pdftohtml since the pdftotext and xpdf doesn't support encrypted PDF's. The only utility that I have available to do this is pdftohtml which creates an output in HTML (to STDOUT). I therefore use a wrapper around it and call the binary from the wrapper. The wrapper removes the HTML tags leaving only the plain text... just a simple 4-line perl script:
#!/usr/bin/perl $filename = shift; $output = `/usr/local/bin/pdftohtml -i -stdout -noframes $filename`; $output =~ s/<.*>//g; print $output; As a result, to get the text out of the pdf, simply "pdf2txt myfile.pdf" at the command line, and it outputs the text to STDOUT. Noted on the freshmeat site... guess I should have waited a day before downloading it then Really need to sort out this PDF indexing issue though... it's annoying and I really need for the search engine to be able to search based on the contents of a PDF file... there are several other spiders/search-engines available in Php, BUT none of them can do as comprehensive indexing and searching as phpdig... Leland |
12-07-2003, 01:46 PM | #10 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Just "pdf2txt myfile.pdf" with no .cgi or .pl extension? How does it know to treat it as a perl program?
Try using pdftohtml in define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); because I'm thinking PhpDig should clean the results of tags.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-07-2003, 01:52 PM | #11 | |
Green Mole
Join Date: Dec 2003
Posts: 11
|
Quote:
For example, if you do it from the command line itself: leland@taranta:~/public_html/pdftest> /usr/local/bin/pdf2txt InstrumentPilot39.pdf Engine Management 1 Intelligence Reports 2 Bashing the Beam 6 European Flight Planning 8 Dew Point Review 10 PPL/IR Europe Web Site 12 14 Bert Maes and I attended the engine efficiency and many others. It was very <snip> --- Having said that, I've just added a little hook in the wrapper to detect if the wrapper has even been called, but it looks like the spider isn't even attempting to use it. Despite the settings in config.php: define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt'); define('PHPDIG_OPTION_PDF',''); the externals are called with "exec" are they not? If they are, then it should at least fall into the trap, but it looks as if it's not even getting that far. L. |
|
12-07-2003, 02:03 PM | #12 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Yes, exec is being used. I just tried your perl program and on my OS (Linux/Apache) but #!/usr/bin/perl does not force the execution of perl programs. I'll play around some more with this.
In any case, try using the following: define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION',''); // as it's STDOUT The above is used to make a temp file which is then passed to an index function. In the index function, temp file should be cleaned of tags.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-07-2003, 02:11 PM | #13 | ||
Green Mole
Join Date: Dec 2003
Posts: 11
|
Quote:
Have to, of course, make sure that the perl interpretter is in the right place Quote:
Leland |
||
12-07-2003, 02:16 PM | #14 |
Green Mole
Join Date: Dec 2003
Posts: 11
|
Just looking at the output when running the spider:
SITE : http://www.discpro.org/ Exclude paths : - @NONE@ 1:http://www.discpro.org/ (time : 00:00:01) + + + + + + + + + level 1... 2:http://www.discpro.org/pdftest/InstrumentPilot39.pdf (time : 00:00:02) 3:http://www.discpro.org/?mode=pgpkey (time : 00:00:02) <etc> #3 has the checkmark next to it.. #2 doesn't. Am I to presume that it only indexed the file and not the contents of the file? (it also seemed to do it a little TOO quickly, since it takes at least a few seconds even to convert it from the pdf to html or text. Tells me that it's not even executing the external binary call. Leland |
12-07-2003, 02:20 PM | #15 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Try a search on some words in the PDF file. In the search results is there a result for the PDF file? Also, I am on chat right now if you'd rather chat through this.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Indexing PDF | dlaperle | Troubleshooting | 1 | 03-21-2007 08:00 PM |
Problem with PDF indexing | Phantom | External Binaries | 2 | 07-25-2005 03:26 AM |
indexing pdf | Hoek | External Binaries | 9 | 02-25-2004 03:42 AM |
indexing pdf | philippeguerind | External Binaries | 11 | 02-21-2004 11:50 AM |
PDF indexing | aryan | External Binaries | 11 | 11-27-2003 08:51 AM |