|
07-22-2005, 10:10 AM | #1 |
Green Mole
Join Date: Jul 2005
Posts: 14
|
PDF indexing blocked
Hi,
I'm trying to index a pdf file which I know for sure it exists like http://..../foo.pdf The console prints this : Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: c:/bin/pdftotext.exe Does parse pdf exist: And stay blocked. What happens ? Thanx in advance. Last edited by pascalp; 07-22-2005 at 10:13 AM. |
07-22-2005, 10:19 AM | #2 |
Green Mole
Join Date: Jul 2005
Posts: 14
|
sorry it's this :
Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: c:/bin/pdftotext.exe Does parse pdf exist: 1 |
07-23-2005, 12:57 PM | #3 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
In robot_functions.php, find the appropriate $command variable:
Code:
// it can have _PDF or _MSWORD or _MSEXCEL depending on binary $command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2; Code:
// it can have _PDF or _MSWORD or _MSEXCEL depending on binary $command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2.' 2>&1';
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
07-24-2005, 04:31 AM | #4 |
Green Mole
Join Date: Jul 2005
Posts: 14
|
I had already changed that line but nothing more is displayed...
|
07-24-2005, 05:00 AM | #5 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Is it just one PDF that won't index, or is it all? If just one, how large is the file? Maybe you are running out of memory? Try uncommenting error_reporting(E_ALL); in the config file, and see if a memory error occurs on reindex of the PDF file.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
07-24-2005, 06:18 AM | #6 |
Green Mole
Join Date: Jul 2005
Posts: 14
|
Any PDF file won't index.
the pdf file I tried is 796 kbytes... but I tried another which is 350 kb, it won't index either. I try the error_reporting... |
07-24-2005, 06:23 AM | #7 |
Green Mole
Join Date: Jul 2005
Posts: 14
|
I tried the "error_reporting"... nothing more.
Besides, why doesn't the line "Is parse pdf executable:" display ? |
07-25-2005, 09:09 AM | #8 | |
Head Mole
Join Date: May 2003
Posts: 2,539
|
If you are only getting the following to print out, then check the code edits again to see if echo "Is parse pdf executable: " . is_executable(PHPDIG_PARSE_PDF) . "<br>"; is in there:
Quote:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
07-25-2005, 12:47 PM | #9 |
Green Mole
Join Date: Jul 2005
Posts: 14
|
I finally got the pdf indexing working...
BUT it only indexes a pdf when I indicate the full URL to the pdf. I have a page : http://www.ville-magny-les-hameaux.f...ain_public.htm it contains 2 simple pdf links. The spider doesn't find any pdf link into it whereas there's at least two obvious ones. I use "no" and 20 depth and 20 links parameters. Any idea ? |
07-25-2005, 01:48 PM | #10 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
When you use the code in this post, does any error message print onscreen?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
07-25-2005, 11:43 PM | #11 |
Green Mole
Join Date: Jul 2005
Posts: 14
|
No error message indeed...
|
07-29-2005, 09:51 AM | #12 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Tried a test on your site with search depth one and links per four, and got the below output. Try using...
Code:
define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','c:/bin/pdftotext.exe'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION','.txt'); Spidering in progress... [Stop spider] SITE : http://www.ville-magny-les-hameaux.fr/ Exclude paths : - library - moteur - Pics - plan_site - x_element_base - a_mieux_connaitre/jpg - a_mieux_connaitre/geo/jpg - a_mieux_connaitre/histoire/jpg - a_mieux_connaitre/magny_chiffres/jpg - a_mieux_connaitre/patrimoine/jpg - a_mieux_connaitre/vie_municipale/jpg - actualite/jpg - b_vie_pratique/jpg - b_vie_pratique/se_deplacer/jpg - b_vie_pratique/serv_public/jpg - c_vie_eco/jpg - d_vie_cult_sport/jpg - e_vie_associative/jpg Wait... 1:http://www.ville-magny-les-hameaux.fr/actualite/com_public/main_public.htm (time : 00:00:10) + + + + + + + level 1... Wait... 2:http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.pdf (time : 00:00:36) + + + Wait... 3:http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.doc (time : 00:00:52) Wait... 4:http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.doc (time : 00:01:12) Wait... 5:http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.pdf (time : 00:01:28) level 2... links found : 5 http://www.ville-magny-les-hameaux.fr/actualite/com_public/main_public.htm http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.pdf http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.doc http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.doc http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.pdf Optimizing tables... Indexing complete ! [Back] to admin interface.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
07-31-2005, 03:13 PM | #13 |
Green Mole
Join Date: Jul 2005
Posts: 14
|
I already use the code you give here.
The result is that "the page has been recently indexed" so it doesn't index anymore. As you saw, there are 2 pdfs in it. My spider found no pdf. He just found the dc5.pdf because I gave him explicitly... Any idea ? |
07-31-2005, 03:26 PM | #14 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Do you have shell or command line access? If so, then "touch" the PDFs to give them a new file date. Otherwise, if you are making your PDFs, resave and FTP a new version over, so the file appears updated. See if this will let you reindex the PDFs. Of course, if the PDFs haven't changed content, no reindex is necessary, and PhpDig does look for a "Last-Modified" date. One other thing is that you should be able to delete a page/document from the PhpDig admin panel, so if you want to reindex without touching the file, try a delete and then a reindex, both from the admin panel.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
08-11-2005, 02:36 AM | #15 |
Green Mole
Join Date: Jul 2005
Posts: 14
|
I deleted the page from phpdig admin panel and tried to reindex... it indexes the html file itself but doesn't index the pdf links into it. As I said earlier, when I index the pdf url directly, it works...
For information, no problem of timeout. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
spider.php blocked when indexing | acti_dev | External Binaries | 9 | 12-09-2006 06:29 AM |
phpdig blocked when spidering any site | heli | Troubleshooting | 3 | 09-30-2004 11:42 AM |
indexing pdf | Hoek | External Binaries | 9 | 02-25-2004 03:42 AM |
indexing pdf | philippeguerind | External Binaries | 11 | 02-21-2004 11:50 AM |
PDF indexing | lelandv | External Binaries | 15 | 12-08-2003 05:23 PM |