PDF indexing

lelandv · 12-07-2003, 11:05 AM

Quote:

Originally posted by Charter
Hi. Delete anything in the temp directory, and then try setting the following in the config file:

define('PHPDIG_PDF_EXTENSION','.txt');

Hi.. I have a similar problem to the other poster. Difference here is that the debug test, it does successfully detect that it's a PDF file, and creates the temporary file and promptly deletes it again.

I have added the define above as per the previous problem, the but the contents of the PDF are still not indexed. I'm using "pdftohtml" with a wrapper which removes all HTML formatting resulting in PDF -> TEXT. (syntax: pdf2txt file.pdf --- which results in a STDOUT output of plain text).

Of course in the database, there is no hint of the contents of the PDF file, thus not indexed... just the filename itself (which is not really what we want here.)

Any help would be appreciated.

Leland

Charter · 12-07-2003, 11:12 AM

Hi. If the output goes to STDOUT, then set define('PHPDIG_PDF_EXTENSION','');

The extension .txt in define('PHPDIG_PDF_EXTENSION','.txt'); is only needed if the output goes to file with a .txt extension.

lelandv · 12-07-2003, 11:18 AM

Quote:

Originally posted by Charter
Hi. If the output goes to STDOUT, then set define('PHPDIG_PDF_EXTENSION','');

The extension .txt in define('PHPDIG_PDF_EXTENSION','.txt'); is only needed if the output goes to file with a .txt extension.

Hiya.. I've done this, but the PDF file is still not indexed.. just the filename

Am I missing something here?

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','');

the actual PDF file is linked off of another page, and looking at the server logs I do see the crawler retrieving the pdf document in the first place... just that it's still not indexed at all.

taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "HEAD /pdftest/InstrumentPilot39.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.2 (PHP; MySql)"
taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "GET /pdftest/InstrumentPilot39.pdf HTTP/1.0" 200 1262188 "-" "PHP/4.2.2"

Leland

Charter · 12-07-2003, 11:41 AM

Quote:

Originally posted by lelandv
taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "HEAD /pdftest/InstrumentPilot39.pdf HTTP/1.1" 200 0 "-" "PhpDig/1.6.2 (PHP; MySql)"
taranta.discpro.org - - [07/Dec/2003:19:14:40 +0000] "GET /pdftest/InstrumentPilot39.pdf HTTP/1.0" 200 1262188 "-" "PHP/4.2.2"

Hi. Please update to PhpDig 1.6.5 and try it. If you added the l_time column to the logs table already, then no database changes need to be made in the update. For the files, just reconfigure the connect and config files, and FTP over the PHP files, except for install.php unless you want that file online. BTW, what OS are you running?

lelandv · 12-07-2003, 12:26 PM

Quote:

Originally posted by Charter
Hi. Please update to PhpDig 1.6.5 and try it. If you added the l_time column to the logs table already, then no database changes need to be made in the update. For the files, just reconfigure the connect and config files, and FTP over the PHP files, except for install.php unless you want that file online. BTW, what OS are you running?

Debian Linux for the OS with Apache for the server.

(Please note that the latest version stated on freshmeat/sourceforge is 1.6.2.. might want to update that when you get a chance.)

Will try 1.6.5 and let you know how it goes

Leland

lelandv · 12-07-2003, 12:39 PM

hmm..

version 1.6.5 generates an error 404 when inserting the search on the search page.

laptop1.discpro.org - - [07/Dec/2003:20:36:39 +0000] "GET /phpdig/index.php?template_demo=.%2Ftemplates%2Fphpdig.html&site=0&path=&result_pag e=index.php&query_string=transition&limite=10&option=start HTTP/1.1" 404 1146 "http://www.discpro.org/phpdig/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

lelandv · 12-07-2003, 12:41 PM

disregard that... brain fart

Just tried it with the new version, the PDF content is still not indexed

L.

Charter · 12-07-2003, 01:02 PM

Hi. All I can find is a Windows version of pdf2txt. Are you using xpdf? Unless renamed, it should be pdftotext that comes with xpdf.

If you have shell access and are allowed to locate, just type locate pdftotext to find the path.

The freshmeat listing was updated yesterday.

lelandv · 12-07-2003, 01:09 PM

No.. I'm using actually pdftohtml since the pdftotext and xpdf doesn't support encrypted PDF's. The only utility that I have available to do this is pdftohtml which creates an output in HTML (to STDOUT). I therefore use a wrapper around it and call the binary from the wrapper. The wrapper removes the HTML tags leaving only the plain text... just a simple 4-line perl script:

#!/usr/bin/perl

$filename = shift;
$output = `/usr/local/bin/pdftohtml -i -stdout -noframes $filename`;

$output =~ s/<.*>//g;
print $output;

As a result, to get the text out of the pdf, simply "pdf2txt myfile.pdf" at the command line, and it outputs the text to STDOUT.

Noted on the freshmeat site... guess I should have waited a day before downloading it then

Really need to sort out this PDF indexing issue though... it's annoying and I really need for the search engine to be able to search based on the contents of a PDF file... there are several other spiders/search-engines available in Php, BUT none of them can do as comprehensive indexing and searching as phpdig...

Leland

Charter · 12-07-2003, 01:46 PM

Hi. Just "pdf2txt myfile.pdf" with no .cgi or .pl extension? How does it know to treat it as a perl program?

Try using pdftohtml in define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); because I'm thinking PhpDig should clean the results of tags.

lelandv · 12-07-2003, 01:52 PM

Quote:

Originally posted by Charter
Hi. Just "pdf2txt myfile.pdf" with no .cgi or .pl extension? How does it know to treat it as a perl program?

Try using pdftohtml in define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); because I'm thinking PhpDig should clean the results of tags.

the permissions on the wrapper are 0755 (executable) and the first line contains #!/usr/bin/perl forcing the shell to use perl to execute it.

For example, if you do it from the command line itself:

leland@taranta:~/public_html/pdftest> /usr/local/bin/pdf2txt InstrumentPilot39.pdf

Engine Management
1
Intelligence Reports
2
Bashing the Beam
6
European Flight Planning
8
Dew Point Review
10
PPL/IR Europe Web Site
12
14
Bert Maes and I attended the engine efficiency and many others. It was very

<snip>

---
Having said that, I've just added a little hook in the wrapper to detect if the wrapper has even been called, but it looks like the spider isn't even attempting to use it.

Despite the settings in config.php:
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdf2txt');
define('PHPDIG_OPTION_PDF','');

the externals are called with "exec" are they not? If they are, then it should at least fall into the trap, but it looks as if it's not even getting that far.

L.

Charter · 12-07-2003, 02:03 PM

Hi. Yes, exec is being used. I just tried your perl program and on my OS (Linux/Apache) but #!/usr/bin/perl does not force the execution of perl programs. I'll play around some more with this.

In any case, try using the following:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION',''); // as it's STDOUT

The above is used to make a temp file which is then passed to an index function. In the index function, temp file should be cleaned of tags.

lelandv · 12-07-2003, 02:11 PM

Quote:

Originally posted by Charter
Hi. Yes, exec is being used. I just tried your perl program and on my OS (Linux/Apache) but #!/usr/bin/perl does not force the execution of perl programs. I'll play around some more with this.

Have to, of course, make sure that the perl interpretter is in the right place

Quote:

In any case, try using the following:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION',''); // as it's STDOUT

The above is used to make a temp file which is then passed to an index function. In the index function, temp file should be cleaned of tags.

did this as you suggested... still no index of the file contents... just the filename. It's as if it's not even bothering to look inside the file if it's a .PDF.

Leland

lelandv · 12-07-2003, 02:16 PM

Just looking at the output when running the spider:

SITE : http://www.discpro.org/
Exclude paths :
- @NONE@
1:http://www.discpro.org/
(time : 00:00:01)
+ + + + + + + + +
level 1...
2:http://www.discpro.org/pdftest/InstrumentPilot39.pdf
(time : 00:00:02)

3:http://www.discpro.org/?mode=pgpkey
(time : 00:00:02)

<etc>

#3 has the checkmark next to it.. #2 doesn't.
Am I to presume that it only indexed the file and not the contents of the file?

(it also seemed to do it a little TOO quickly, since it takes at least a few seconds even to convert it from the pdf to html or text. Tells me that it's not even executing the external binary call.

Leland

Charter · 12-07-2003, 02:20 PM

Hi. Try a search on some words in the PDF file. In the search results is there a result for the PDF file? Also, I am on chat right now if you'd rather chat through this.

12-07-2003, 01:02 PM	#8
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. All I can find is a Windows version of pdf2txt. Are you using xpdf? Unless renamed, it should be pdftotext that comes with xpdf. If you have shell access and are allowed to locate, just type locate pdftotext to find the path. The freshmeat listing was updated yesterday. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

12-07-2003, 01:46 PM	#10
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Just "pdf2txt myfile.pdf" with no .cgi or .pl extension? How does it know to treat it as a perl program? Try using pdftohtml in define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); because I'm thinking PhpDig should clean the results of tags. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

12-07-2003, 02:03 PM	#12
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Yes, exec is being used. I just tried your perl program and on my OS (Linux/Apache) but #!/usr/bin/perl does not force the execution of perl programs. I'll play around some more with this. In any case, try using the following: define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftohtml'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION',''); // as it's STDOUT The above is used to make a temp file which is then passed to an index function. In the index function, temp file should be cleaned of tags. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

12-07-2003, 02:20 PM	#15
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Try a search on some words in the PDF file. In the search results is there a result for the PDF file? Also, I am on chat right now if you'd rather chat through this. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Indexing PDF	dlaperle	Troubleshooting	1	03-21-2007 08:00 PM
Problem with PDF indexing	Phantom	External Binaries	2	07-25-2005 03:26 AM
indexing pdf	Hoek	External Binaries	9	02-25-2004 03:42 AM
indexing pdf	philippeguerind	External Binaries	11	02-21-2004 11:50 AM
PDF indexing	aryan	External Binaries	11	11-27-2003 08:51 AM

12-07-2003, 11:12 AM	#2
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. If the output goes to STDOUT, then set define('PHPDIG_PDF_EXTENSION',''); The extension .txt in define('PHPDIG_PDF_EXTENSION','.txt'); is only needed if the output goes to file with a .txt extension. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

12-07-2003, 12:39 PM	#6
lelandv Green Mole Join Date: Dec 2003 Posts: 11	hmm.. version 1.6.5 generates an error 404 when inserting the search on the search page. laptop1.discpro.org - - [07/Dec/2003:20:36:39 +0000] "GET /phpdig/index.php?template_demo=.%2Ftemplates%2Fphpdig.html&site=0&path=&result_pag e=index.php&query_string=transition&limite=10&option=start HTTP/1.1" 404 1146 "http://www.discpro.org/phpdig/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

12-07-2003, 12:41 PM	#7
lelandv Green Mole Join Date: Dec 2003 Posts: 11	disregard that... brain fart Just tried it with the new version, the PDF content is still not indexed L.

12-07-2003, 01:09 PM	#9
lelandv Green Mole Join Date: Dec 2003 Posts: 11	No.. I'm using actually pdftohtml since the pdftotext and xpdf doesn't support encrypted PDF's. The only utility that I have available to do this is pdftohtml which creates an output in HTML (to STDOUT). I therefore use a wrapper around it and call the binary from the wrapper. The wrapper removes the HTML tags leaving only the plain text... just a simple 4-line perl script: #!/usr/bin/perl $filename = shift; $output = `/usr/local/bin/pdftohtml -i -stdout -noframes $filename`; $output =~ s/<.*>//g; print $output; As a result, to get the text out of the pdf, simply "pdf2txt myfile.pdf" at the command line, and it outputs the text to STDOUT. Noted on the freshmeat site... guess I should have waited a day before downloading it then Really need to sort out this PDF indexing issue though... it's annoying and I really need for the search engine to be able to search based on the contents of a PDF file... there are several other spiders/search-engines available in Php, BUT none of them can do as comprehensive indexing and searching as phpdig... Leland

12-07-2003, 02:16 PM	#14
lelandv Green Mole Join Date: Dec 2003 Posts: 11	Just looking at the output when running the spider: SITE : http://www.discpro.org/ Exclude paths : - @NONE@ 1:http://www.discpro.org/ (time : 00:00:01) + + + + + + + + + level 1... 2:http://www.discpro.org/pdftest/InstrumentPilot39.pdf (time : 00:00:02) 3:http://www.discpro.org/?mode=pgpkey (time : 00:00:02) <etc> #3 has the checkmark next to it.. #2 doesn't. Am I to presume that it only indexed the file and not the contents of the file? (it also seemed to do it a little TOO quickly, since it takes at least a few seconds even to convert it from the pdf to html or text. Tells me that it's not even executing the external binary call. Leland