![]() |
Quote:
I have added the define above as per the previous problem, the but the contents of the PDF are still not indexed. I'm using "pdftohtml" with a wrapper which removes all HTML formatting resulting in PDF -> TEXT. (syntax: pdf2txt file.pdf --- which results in a STDOUT output of plain text). Of course in the database, there is no hint of the contents of the PDF file, thus not indexed... just the filename itself (which is not really what we want here.) Any help would be appreciated. :bang: Leland |
Hi. If the output goes to STDOUT, then set define('PHPDIG_PDF_EXTENSION','');
The extension .txt in define('PHPDIG_PDF_EXTENSION','.txt'); is only needed if the output goes to file with a .txt extension. |
Quote:
Am I missing something here? define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION',''); the actual PDF file is linked off of another page, and looking at the server logs I do see the crawler retrieving the pdf document in the first place... just that it's still not indexed at all. taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "HEAD /pdftest/InstrumentPilot39.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.2 (PHP; MySql)" taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "GET /pdftest/InstrumentPilot39.pdf HTTP/1.0" 200 1262188 "-" "PHP/4.2.2" Leland |
Quote:
|
Quote:
(Please note that the latest version stated on freshmeat/sourceforge is 1.6.2.. might want to update that when you get a chance.) Will try 1.6.5 and let you know how it goes :) Leland |
hmm..
version 1.6.5 generates an error 404 when inserting the search on the search page. laptop1.discpro.org - - [07/Dec/2003:20:36:39 +0000] "GET /phpdig/index.php?template_demo=.%2Ftemplates%2Fphpdig.html&site=0&path=&result_pag e=index.php&query_string=transition&limite=10&option=start HTTP/1.1" 404 1146 "http://www.discpro.org/phpdig/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" |
disregard that... brain fart
Just tried it with the new version, the PDF content is still not indexed :( :bang: L. |
Hi. All I can find is a Windows version of pdf2txt. Are you using xpdf? Unless renamed, it should be pdftotext that comes with xpdf.
If you have shell access and are allowed to locate, just type locate pdftotext to find the path. The freshmeat listing was updated yesterday. :) |
No.. I'm using actually pdftohtml since the pdftotext and xpdf doesn't support encrypted PDF's. The only utility that I have available to do this is pdftohtml which creates an output in HTML (to STDOUT). I therefore use a wrapper around it and call the binary from the wrapper. The wrapper removes the HTML tags leaving only the plain text... just a simple 4-line perl script:
#!/usr/bin/perl $filename = shift; $output = `/usr/local/bin/pdftohtml -i -stdout -noframes $filename`; $output =~ s/<.*>//g; print $output; As a result, to get the text out of the pdf, simply "pdf2txt myfile.pdf" at the command line, and it outputs the text to STDOUT. Noted on the freshmeat site... guess I should have waited a day before downloading it then ;) Really need to sort out this PDF indexing issue though... it's annoying and I really need for the search engine to be able to search based on the contents of a PDF file... there are several other spiders/search-engines available in Php, BUT none of them can do as comprehensive indexing and searching as phpdig... :( Leland |
Hi. Just "pdf2txt myfile.pdf" with no .cgi or .pl extension? How does it know to treat it as a perl program?
Try using pdftohtml in define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); because I'm thinking PhpDig should clean the results of tags. |
Quote:
For example, if you do it from the command line itself: leland@taranta:~/public_html/pdftest> /usr/local/bin/pdf2txt InstrumentPilot39.pdf Engine Management 1 Intelligence Reports 2 Bashing the Beam 6 European Flight Planning 8 Dew Point Review 10 PPL/IR Europe Web Site 12 14 Bert Maes and I attended the engine efficiency and many others. It was very <snip> --- Having said that, I've just added a little hook in the wrapper to detect if the wrapper has even been called, but it looks like the spider isn't even attempting to use it. Despite the settings in config.php: define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt'); define('PHPDIG_OPTION_PDF',''); the externals are called with "exec" are they not? If they are, then it should at least fall into the trap, but it looks as if it's not even getting that far. L. |
Hi. Yes, exec is being used. I just tried your perl program and on my OS (Linux/Apache) but #!/usr/bin/perl does not force the execution of perl programs. I'll play around some more with this.
In any case, try using the following: define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION',''); // as it's STDOUT The above is used to make a temp file which is then passed to an index function. In the index function, temp file should be cleaned of tags. |
Quote:
Have to, of course, make sure that the perl interpretter is in the right place ;) Quote:
Leland |
Just looking at the output when running the spider:
SITE : http://www.discpro.org/ Exclude paths : - @NONE@ 1:http://www.discpro.org/ (time : 00:00:01) + + + + + + + + + level 1... 2:http://www.discpro.org/pdftest/InstrumentPilot39.pdf (time : 00:00:02) 3:http://www.discpro.org/?mode=pgpkey (time : 00:00:02) <etc> #3 has the checkmark next to it.. #2 doesn't. Am I to presume that it only indexed the file and not the contents of the file? (it also seemed to do it a little TOO quickly, since it takes at least a few seconds even to convert it from the pdf to html or text. Tells me that it's not even executing the external binary call. Leland |
Hi. Try a search on some words in the PDF file. In the search results is there a result for the PDF file? Also, I am on chat right now if you'd rather chat through this.
|
All times are GMT -8. The time now is 07:21 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.