|
10-13-2005, 10:22 AM | #1 |
Green Mole
Join Date: Oct 2005
Posts: 5
|
PDF indexing Probelm (pdftotext)
Have an intranert site with linked PDFs under a seveal directories under a directory called policies. Can't get phpdig to index the PDFs
Went down the checklist and every thing cehcks out. Not sure where to go from here. Here is the output from the echo statements. Thanks ----------- SITE : http://192.168.13.80/ Exclude paths : - @NONE@ Is result test http an array: 1 What is result test http status: HTML Is result test an array: 1 What is result test status: HTML Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 1:http://192.168.13.80/policies/ (time : 00:00:05) + level 1... Is result test http an array: 1 What is result test http status: HTML Is result test an array: 1 What is result test status: HTML Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Duplicate of an existing document 2:http://192.168.13.80/policies/index.php (time : 00:00:16) No link in temporary table links found : 2 http://192.168.13.80/policies/ http://192.168.13.80/policies/index.php Optimizing tables... Indexing complete ! |
10-14-2005, 09:03 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
The output looks okay. In the config file, if you set LIMIT_TO_DIRECTORY to false then does it index the PDF files?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-17-2005, 03:25 AM | #3 |
Green Mole
Join Date: Oct 2005
Posts: 5
|
Sorry for the delay in repsonding.
That helped by indexing a few of them but it did not inidex all of them. Any other thoughts? |
10-19-2005, 07:49 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Try setting PHPDIG_IN_DOMAIN to true, LIMIT_TO_DIRECTORY to false, both in the config file, and then from the admin panel, use a large search depth, set links per to zero, and choose the no option. You can increase search depth beyond twenty by editing SPIDER_MAX_LIMIT in the config file. Once indexing completes, you should see an 'indexing complete' message onscreen. If, when indexing PDFs, the process seems to die in the middle, it might be a memory issue like in this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-20-2005, 04:11 AM | #5 |
Green Mole
Join Date: Oct 2005
Posts: 5
|
Made those changes - back to square one. The indexing finishes but skips all the pdfs.
Beofre I made the last suggested config changes, the site was locked after indexinf a few of the PDFs and I had to stop the spider in the admin panel. |
10-20-2005, 07:05 AM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Try double checking that PHPDIG_IN_DOMAIN is set to true and LIMIT_TO_DIRECTORY is set to false. The latter should already be false from post two, but maybe it got switched back to true.
Also, in robot_functions.php find: Code:
$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2; Code:
$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2.' 2>&1';
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-20-2005, 07:44 AM | #7 |
Green Mole
Join Date: Oct 2005
Posts: 5
|
here is the end of what printed on the screen:
--------------------------------------- is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/local/bin/pdftotext ../admin/temp/38625112.tmp 2>&1 Result contains: Array ( ) Return value is: 0 90:http://192.168.13.80/services/leadership.pdf (time : 00:07:59) Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/local/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 ------------------------------------------------- nothing else displays beneath but the top of the sreen indicates spidering is still in progress. admin panel says "locked" it did seem to find a few more PDFs but certainly not all on the admin panel, i ask it to use a subdir of "policies". Under polices are 12 subdirs containing 1- 10 PDFs. All the PDFs are linked to the pages. From what rpintede on the screen it did not go down under policies despite using a search depth of 40. Thanks |
10-20-2005, 08:02 AM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
What is the filesize of the PDF file that appears after leadership.pdf in the list? It seems like there is a big PDF in there somewhere that is using up all your PHP memory, which in turn kills PhpDig so it stops indexing and remains locked. Look at the filesizes and unlink the big ones. PDFs of two or three MBs are probably okay, but it depends on your PHP memory.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-20-2005, 08:39 AM | #9 |
Green Mole
Join Date: Oct 2005
Posts: 5
|
in that directory the next PDF is 6.3 mb.
So I'll have to see waht I can do about that file |
10-20-2005, 11:14 AM | #10 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Untested, but if you want to try and index part of the big PDFs...
In robot_functions.php find: Code:
while (!feof($fp)) { $file_content[] = fread($fp,8192); } Code:
$oh_stop_me = 0; while (!feof($fp) && $oh_stop_me < 125) { $file_content[] = fread($fp,8192); $oh_stop_me++; }
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Indexing PDF | dlaperle | Troubleshooting | 1 | 03-21-2007 07:00 PM |
pdftotext - not indexing PDFs - oh geez | monkeynutts | External Binaries | 1 | 11-11-2005 09:15 AM |
can't index pdf using pdftotext | rom | External Binaries | 22 | 08-27-2004 04:11 PM |
not indexing with pdftotext | davideyre | External Binaries | 2 | 03-30-2004 12:55 PM |
PDF indexing | lelandv | External Binaries | 15 | 12-08-2003 04:23 PM |