|
01-06-2004, 06:39 AM | #1 | |
Green Mole
Join Date: Dec 2003
Posts: 26
|
pdf indexing with pstotext
Hi,
I'm running an apache 1.3.28 with php 4.3.4rc1. and phpdig 1.6.4 (hmm, i should updgrade...) But here is my problem.. I've got a lot of pdf, and i want them to be indexed.. I've installed pstotext, which is working right (pstotext "nameofthefile.pdf" shows the contents of the pdf file in STDOUT) i've changed the config file for phpdig to use this.. Quote:
ok... ? When i try to refresh my site, in phpdig admin, pdf files are found, and seems to be indexed.. but when i try to search a name in the pdf text.. no responses.. So where could be the problem ? |
|
01-06-2004, 11:02 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Are you using Windows? If so, set define('USE_IS_EXECUTABLE_COMMAND','0');
Also, are you indexing a page that links to the PDFs or trying to index the PDFs directly?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-06-2004, 01:24 PM | #3 |
Green Mole
Join Date: Dec 2003
Posts: 26
|
I'm running linux, a mandrake 9.1 but i've reinstalled apache and php from the base source
i'm indexing pdf which are linked in some articles, an example : http://umvf.cochin.univ-paris5.fr/ar...id_article=295 |
01-06-2004, 03:23 PM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. From that PDF document, I get the following:
Code:
mysql> select keywords.* from engine,keywords where engine.key_id = keywords.key_id and engine.spider_id = xxxxx; +--------+------------+-----------+ | key_id | twoletters | keyword | +--------+------------+-----------+ | xxxxx | 19 | 1995 | | xxxxx | 50 | 500 | | xxxxx | 30 | 300 | | xxxxx | 80 | 80-100 | | xxxxx | in | infection | | xxxxx | na | nantes | | xxxxx | na | nancy | +--------+------------+-----------+ 7 rows in set (0.01 sec)
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-07-2004, 02:46 AM | #5 |
Green Mole
Join Date: Dec 2003
Posts: 26
|
hmm... ?
what i'm supposed to search ? scuse me but i'm not quite sure ? i've tried : SELECT * FROM `keywords` WHERE keyword = '1995'; SELECT * FROM `keywords` WHERE keyword = '500'; .... SELECT * FROM `keywords` WHERE keyword = 'nancy'; but i've got no results for some of them..and the words which are found may be in others articles. But i've tried a search for "carayon" which is an author of this pdf, and his name is not found, neither in mysql base, or in the search, of course.. Sorry, but I really don't know anything about the encoding used for pdf files... I've updated my version to 1.6.5, but no changes for this problem |
01-07-2004, 05:22 AM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Try saving the PDF at http://www.phpdig.net/demo/avare.pdf and place it on your site in a simple HTML file like so and then try to crawl this HTML file with search depth one. Now when you search on Elise do you see any result?
Code:
<html> <body> <a href="http://umvf.cochin.univ-paris5.fr/avare.pdf">test</a> </body> </html>
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-07-2004, 05:54 AM | #7 | |
Green Mole
Join Date: Dec 2003
Posts: 26
|
ok, i've put the avare.pdf, and a html page
i've crawled this : Quote:
but when i search "harpagon" for example... No results.. Hmm.. is it bad, doc ? |
|
01-07-2004, 06:04 AM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. The avare.pdf file should be good. When you go into the text_content directory, and from shell type
grep -i harpagon * do you see anything?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-07-2004, 06:53 AM | #9 |
Green Mole
Join Date: Dec 2003
Posts: 26
|
no response to that command..
No harpagon in text_content... |
01-07-2004, 07:29 AM | #10 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Okay, it looks like pstotext is not successfully executing from exec($command,$result,$retval); in the robot_functions.php file. From shell type locate pstotext to check the path. If /usr/local/bin/pstotext is the correct path and the output goes to STDOUT, the configuration you posted looks correct. Right after exec($command,$result,$retval); try adding the following and then reindex the avare2.html:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-07-2004, 08:22 AM | #11 |
Green Mole
Join Date: Dec 2003
Posts: 26
|
hmmm.....
i've verified the path to pstotext which is right /usr/local/bin/pstotext the output goes to STDOUT ...? the results of pstotext command goes directly on the console ? that's ok ? i've got this code now in my robot_functions.php PHP Code:
Is this ok, with the code u gave ? i've try to delete and re-index the avare html & pdf.. i can't see the "echo $command . "<br>"; result... but still no "harpagon" in text_contents neither in the results of a search.. argh... Last edited by zevince; 01-07-2004 at 08:28 AM. |
01-07-2004, 08:52 AM | #12 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Yes, that is correct. It looks like $usetool remains set to false so the contents of the if statement are not getting executed. In robot_functions.php add the following and delete and reindex avare2.html. What does it output?
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-08-2004, 02:56 AM | #13 | |
Green Mole
Join Date: Dec 2003
Posts: 26
|
here is the output :
Quote:
Ok i've tried to follow back the code in the function phpdigTestUrl where u set the $status.. i've verified the response of the browser to be "application/pdf" and the encoding is iso-8859-1 as i thought.. but i don't really understnd where the problem is... it seems to be in html mode only, and never try to crawl the pdf ? Last edited by zevince; 01-08-2004 at 06:47 AM. |
|
01-08-2004, 06:56 AM | #14 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. When you go to http://umvf.cochin.univ-paris5.fr/avare2.pdf does your browser open the PDF in the browser window or does your browser prompt you to download the file?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-08-2004, 07:02 AM | #15 |
Green Mole
Join Date: Dec 2003
Posts: 26
|
it promps for download in IE, but it's my settings in acrobat, i think...
but what does it change for the bot ? |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Indexing PDF | dlaperle | Troubleshooting | 1 | 03-21-2007 08:00 PM |
spider hangs on indexing pdf (pstotext) | sushie | External Binaries | 7 | 06-15-2005 06:57 AM |
indexing pdf | Hoek | External Binaries | 9 | 02-25-2004 03:42 AM |
PDF indexing | lelandv | External Binaries | 15 | 12-08-2003 05:23 PM |
PDF indexing | aryan | External Binaries | 11 | 11-27-2003 08:51 AM |