PDF indexing

aryan · 11-26-2003, 07:45 AM

Hi,

I have a problem indexing pdf files.
I have tried both the pstotext and pdftotext binaries.

Both binairies work fine on the commandline

Using pdftotext like this in the configfile
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext');
define('PHPDIG_OPTION_PDF','-q');

All pdf files are saved as txt files inphpdig/Admin/temp (shouldn't htdig delete these automaticly?) but still no text from the pdfs can be found in the search form.

What can be the cause of this?

I' started with 1.6.2 and now I'm using 1.6.4 but both have the same problem. This is on mac os x 10.2.6

/Aryan

Charter · 11-26-2003, 07:54 AM

Hi. Delete anything in the temp directory, and then try setting the following in the config file:

define('PHPDIG_PDF_EXTENSION','.txt');

aryan · 11-26-2003, 08:21 AM

Thank you for your fast reply!

Yes it were .txt files and I deleted them and changed the config file as you suggested.

I still seem to have the same problem though.

The files in temp are for example named: 14377172.tmp.txt

/Aryan

Charter · 11-26-2003, 08:38 AM

Hi. That looks right. Hmm, is '/usr/local/bin/pdftotext' the correct path to the pdftotext binary?

aryan · 11-26-2003, 09:36 AM

Yes that is the correct path, I thought those txtfiles where created by pdftotext and that they were a proof that pdftotext works?

Could it have something yo do with character encoding? The tmp.txt files are ISO-8859-1 encoded and MacOS uses the MacRoman characterset.

Haven't solved correct encoding in the htm either but there it basicly works.

/Aryan

Charter · 11-26-2003, 10:27 AM

Oh, silly me.

It looks like the below chunk of code from robot_functions.php may be failing for you. If you echo out $retval (0 is success, -1 is failure) what does it give? Also, just to be sure, is define('PHPDIG_PDF_EXTENSION','.txt'); set with .txt (including the period)?

PHP Code:


			
    if ($usetool) { // does this

        rename($tempfile1,$tempfile2); // does this

        exec($command,$result,$retval); // does this

        unlink($tempfile2); // does this

        if (!$retval) { // hmm

             // the replacement if š is for unbreaking spaces

             // returned by catdoc parsing msword files

             // and '0xAD' "tiret quadratin" returned by pstotext

             // in iso-8859-1

             // Adjust with your encoding and/or your tools

             if ((is_array($result)) && (count($result) > 0)) {

                $f_handler = fopen($tempfile1,'wb');

                fwrite($f_handler,str_replace('š',' ',str_replace(chr(0xad),'-',implode(' ',$result))));

                fclose($f_handler);

             }

        }

        else {

              return array('tempfile'=>0,'tempfilesize'=>0);

        }

    }

aryan · 11-26-2003, 02:07 PM

Sorry for the delay, I couldn't test right away. I had forgotten the period in '.txt'. Now that I have the period the temp dir is not full of files after an attempt to index anymore.

But indexing of pdf's ends after the fifth pdf when I try to index a directory with 130 pdf's. I get 7 files, one index (html), the index of the parent directory (html) and 5 pdf indexed. The directory "text_content" contains 7 txt files (1.txt, 2.txt etc), the first 6 are readable but last file "7.txt" is full of unreadable junk, only the first line is readable and then it continues with "° ¢£§! ¢£ ©¢ §£ ¶ ¶ ¶ "3 # $ &' •• ß ® ®¶ ©¢ ¢£ %
@3 @ 3F )01)12A021B2" etc.

I didn't succeed with the debugging line, I tried:

PHP Code:


			
  if ($usetool) {

        rename($tempfile1,$tempfile2);

        exec($command,$result,$retval);

        unlink($tempfile2);

        echo "[h1]$retval[H1]";

            if (!$retval) {

             // the replacement if ö is for unbreaking spaces

             // returned by catdoc parsing msword files

             // and '0xAD' "tiret quadratin" returned by pstotext

             // in iso-8859-1

             // Adjust with your encoding and/or your tools

             if ((is_array($result)) && (count($result) > 0)) {

                $f_handler = fopen($tempfile1,'wb');

                fwrite($f_handler,str_replace('ö',' ',str_replace(chr(0xad),'-',implode(' ',$result))));

                fclose($f_handler);

             }

        }

        else {

              return array('tempfile'=>0,'tempfilesize'=>0);

        }

    }

but never saw a "0" or an "1", does that mean they are 0 all the time?

/thanks Aryan

Charter · 11-26-2003, 02:50 PM

Hi. Is there a link from your pages to each of the 130 pdf files? PhpDig won't find them on it's own unless there are links to them. As you can now find search results from some pdf files, it looks like $retval contains success. From the man page of pdftotext version 1.0.1 is the following. Perhaps that is why the seventh page is messed.

BUGS
Some PDF files contain fonts whose encodings have been mangled
beyond recognition. There is no way (short of OCR) to extract
text from these files.

aryan · 11-26-2003, 03:10 PM

Hi,

it is a normal apache file listing that I use as url to index (1 level) file listing. So all pdfs are linked.

You're totally right about the bug in pdftotext, when I tried the same pdf file in the command line I get the same junk text.

Still why does it stop indexing after this file?

/Aryan

Charter · 11-26-2003, 03:37 PM

Hmm, good question. I'm not sure of the inner workings of the pdftotext bug. Maybe when the bug is encountered it stops processes that were trying to run it. What happens when you remove that file from the listing? How far does PhpDig get then?

aryan · 11-27-2003, 01:44 AM

Thank you very much Carter for all help and support!!

Deleting the suspected file did help and all PDF's in the directory are eventually indexed. But when I deleted all text files and databases and tried it again with the suspected pdf it worked as well! Now I'm a bit confused, maybe I just didn't give it enough time to index?

As for the debugging test

PHP Code:


			
_echo "[h1]$retval[H1]";

stupid me tried it in the wrong copy of robot_functions.php (!) sorry.

Now all seems fine but I stumble on another issue. If I do a search on a common keyword I always get only 5 hits!

I also have a problem with the high ASCII characters for example å, ä, ö that are common in Swedish. Searching for words containing those characters doesn't work. I think I have to dig into the htmlentities function because that doesn't work well with MacRoman character set, or is there anything else that I should look into?

/Aryan

Charter · 11-27-2003, 07:51 AM

Hi. Maybe it's a r****m bug? Anyway, glad it's working now. For the numder of results, just change define('NUMBER_OF_RESULTS_PER_SITE',5); in the config file. For the å, ä, ö characters, are you using iso-8859-1? If you search on a word with å, ä, ö in it, PhpDig should use a, a, o when it does its internal search. Can you post a word that doesn't work and a link to your search?

11-26-2003, 07:45 AM	#1
aryan Green Mole Join Date: Nov 2003 Posts: 6	PDF indexing Hi, I have a problem indexing pdf files. I have tried both the pstotext and pdftotext binaries. Both binairies work fine on the commandline Using pdftotext like this in the configfile define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pdftotext'); define('PHPDIG_OPTION_PDF','-q'); All pdf files are saved as txt files inphpdig/Admin/temp (shouldn't htdig delete these automaticly?) but still no text from the pdfs can be found in the search form. What can be the cause of this? I' started with 1.6.2 and now I'm using 1.6.4 but both have the same problem. This is on mac os x 10.2.6 /Aryan

11-26-2003, 07:54 AM	#2
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Delete anything in the temp directory, and then try setting the following in the config file: define('PHPDIG_PDF_EXTENSION','.txt'); __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

11-26-2003, 08:38 AM	#4
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. That looks right. Hmm, is '/usr/local/bin/pdftotext' the correct path to the pdftotext binary? __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

11-26-2003, 10:27 AM	#6
Charter Head Mole Join Date: May 2003 Posts: 2,539	Oh, silly me. It looks like the below chunk of code from robot_functions.php may be failing for you. If you echo out $retval (0 is success, -1 is failure) what does it give? Also, just to be sure, is define('PHPDIG_PDF_EXTENSION','.txt'); set with .txt (including the period)? PHP Code: if ($usetool) { // does this rename($tempfile1,$tempfile2); // does this exec($command,$result,$retval); // does this unlink($tempfile2); // does this if (!$retval) { // hmm // the replacement if š is for unbreaking spaces // returned by catdoc parsing msword files // and '0xAD' "tiret quadratin" returned by pstotext // in iso-8859-1 // Adjust with your encoding and/or your tools if ((is_array($result)) && (count($result) > 0)) { $f_handler = fopen($tempfile1,'wb'); fwrite($f_handler,str_replace('š',' ',str_replace(chr(0xad),'-',implode(' ',$result)))); fclose($f_handler); } } else { return array('tempfile'=>0,'tempfilesize'=>0); } } __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

11-26-2003, 02:07 PM	#7
aryan Green Mole Join Date: Nov 2003 Posts: 6	Sorry for the delay, I couldn't test right away. I had forgotten the period in '.txt'. Now that I have the period the temp dir is not full of files after an attempt to index anymore. But indexing of pdf's ends after the fifth pdf when I try to index a directory with 130 pdf's. I get 7 files, one index (html), the index of the parent directory (html) and 5 pdf indexed. The directory "text_content" contains 7 txt files (1.txt, 2.txt etc), the first 6 are readable but last file "7.txt" is full of unreadable junk, only the first line is readable and then it continues with "° ¢£§! ¢£ ©¢ §£ ¶ ¶ ¶ "3 # $ &' •• ß ® ®¶ ©¢ ¢£ % @3 @ 3F )01)12A021B2" etc. I didn't succeed with the debugging line, I tried: PHP Code: if ($usetool) { rename($tempfile1,$tempfile2); exec($command,$result,$retval); unlink($tempfile2); echo "[h1]$retval[H1]"; if (!$retval) { // the replacement if ö is for unbreaking spaces // returned by catdoc parsing msword files // and '0xAD' "tiret quadratin" returned by pstotext // in iso-8859-1 // Adjust with your encoding and/or your tools if ((is_array($result)) && (count($result) > 0)) { $f_handler = fopen($tempfile1,'wb'); fwrite($f_handler,str_replace('ö',' ',str_replace(chr(0xad),'-',implode(' ',$result)))); fclose($f_handler); } } else { return array('tempfile'=>0,'tempfilesize'=>0); } } but never saw a "0" or an "1", does that mean they are 0 all the time? /thanks Aryan Last edited by aryan; 11-26-2003 at 02:11 PM.

11-26-2003, 08:21 AM	#3
aryan Green Mole Join Date: Nov 2003 Posts: 6	Thank you for your fast reply! Yes it were .txt files and I deleted them and changed the config file as you suggested. I still seem to have the same problem though. The files in temp are for example named: 14377172.tmp.txt /Aryan

11-26-2003, 09:36 AM	#5
aryan Green Mole Join Date: Nov 2003 Posts: 6	Yes that is the correct path, I thought those txtfiles where created by pdftotext and that they were a proof that pdftotext works? Could it have something yo do with character encoding? The tmp.txt files are ISO-8859-1 encoded and MacOS uses the MacRoman characterset. Haven't solved correct encoding in the htm either but there it basicly works. /Aryan

11-26-2003, 02:50 PM	#8
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Is there a link from your pages to each of the 130 pdf files? PhpDig won't find them on it's own unless there are links to them. As you can now find search results from some pdf files, it looks like $retval contains success. From the man page of pdftotext version 1.0.1 is the following. Perhaps that is why the seventh page is messed. BUGS Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

11-26-2003, 03:10 PM	#9
aryan Green Mole Join Date: Nov 2003 Posts: 6	Hi, it is a normal apache file listing that I use as url to index (1 level) file listing. So all pdfs are linked. You're totally right about the bug in pdftotext, when I tried the same pdf file in the command line I get the same junk text. Still why does it stop indexing after this file? /Aryan

11-26-2003, 03:37 PM	#10
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hmm, good question. I'm not sure of the inner workings of the pdftotext bug. Maybe when the bug is encountered it stops processes that were trying to run it. What happens when you remove that file from the listing? How far does PhpDig get then? __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

11-27-2003, 01:44 AM	#11
aryan Green Mole Join Date: Nov 2003 Posts: 6	Thank you very much Carter for all help and support!! Deleting the suspected file did help and all PDF's in the directory are eventually indexed. But when I deleted all text files and databases and tried it again with the suspected pdf it worked as well! Now I'm a bit confused, maybe I just didn't give it enough time to index? As for the debugging test PHP Code: `_echo "[h1]$retval[H1]";` stupid me tried it in the wrong copy of robot_functions.php (!) sorry. Now all seems fine but I stumble on another issue. If I do a search on a common keyword I always get only 5 hits! I also have a problem with the high ASCII characters for example å, ä, ö that are common in Swedish. Searching for words containing those characters doesn't work. I think I have to dig into the htmlentities function because that doesn't work well with MacRoman character set, or is there anything else that I should look into? /Aryan

11-27-2003, 07:51 AM	#12
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Maybe it's a r****m bug? Anyway, glad it's working now. For the numder of results, just change define('NUMBER_OF_RESULTS_PER_SITE',5); in the config file. For the å, ä, ö characters, are you using iso-8859-1? If you search on a word with å, ä, ö in it, PhpDig should use a, a, o when it does its internal search. Can you post a word that doesn't work and a link to your search? __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Indexing PDF	dlaperle	Troubleshooting	1	03-21-2007 07:00 PM
Problem with PDF indexing	Phantom	External Binaries	2	07-25-2005 02:26 AM
indexing pdf	Hoek	External Binaries	9	02-25-2004 02:42 AM
indexing pdf	philippeguerind	External Binaries	11	02-21-2004 10:50 AM
PDF indexing	lelandv	External Binaries	15	12-08-2003 04:23 PM