|
10-26-2005, 07:19 AM | #1 |
Green Mole
Join Date: Jul 2005
Posts: 7
|
PDF and CATDOC indexing
Having loads of fun with this today
The only way I can get a .doc file to index is if I spider the doc directly. It doesn't find it from internal links. I have a page (see link) with a test pdf and doc included in the body of the text, neither of these are found or indexed by the spider though the page it self is listed. http://www.sccyp.org.uk/webpages/about_ourhistory.php if I include a full url to the spider I get the following Is result test http an array: 1 What is result test http status: MSWORD Is result test an array: 1 What is result test status: MSWORD Use is executable is set to: 1 Index the pdf is set to: 1 Parse the msword is set to: /usr/bin/catdoc Parse the pdf is set to: /usr/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/catdoc -s 8859-1 ../admin/temp/66842272.tmp Result contains: Array ( [0] => Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec non leo [1] => nec enim sollicitudin sodales. Morbi sem sem, mattis vitae, imperdiet [2] => in, malesuada non, arcu. Vestibulum condimentum porttitor tellus. Sed [3] => ultricies. Sed volutpat molestie sem. Quisque quis nisl. Sed mi metus, [4] => dictum at, elementum quis, ultricies ac, nunc. Aliquam eget arcu. [5] => Vivamus felis sem, feugiat id, volutpat ac, lobortis vel, felis. Cum [6] => sociis natoque penatibus et magnis dis parturient montes, nascetur [7] => ridiculus mus. Quisque [8] => grav???????????????????????????????????????????????????????????????????? [9] => ???????????????????????? [10] => ) Return value is: 0 4:http://www.sccyp.org.uk/testdoc.doc Why won't it spider from the internal link? For pdf it appears to see the file if I add to the spider automatically but produces no output Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the msword is set to: /usr/bin/catdoc Parse the pdf is set to: /usr/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/pdftotext ../admin/temp/95318772.tmp 2>&1 Result contains: Array ( ) Return value is: 0 1:http://www.sccyp.org.uk/pdftest.pdf (time : 00:00:06) No link in temporary table Again Why won't it spider from the internal link? and when it does what happens to the content. I have been through most of the threads on the board relating to this and can't find an answer Any help gratefully received Chris |
10-26-2005, 10:47 PM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
In the config file set LIMIT_TO_DIRECTORY to false, PHPDIG_IN_DOMAIN to true, and PHPDIG_PDF_EXTENSION to .txt (with the period) and then from the admin panel, use a large search depth, set links per to zero, and select the no option.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-27-2005, 09:29 AM | #3 |
Green Mole
Join Date: Jul 2005
Posts: 7
|
Thanks for responding
I made the changes suggested for the pdf I still get the following Parse the pdf is set to: /usr/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/pdftotext ../admin/temp/52635362.tmp 2>&1 Result contains: Array ( ) Return value is: 0 I looked for the temp file and there was nothing there, should there be or are these removed automatically? Cheers Chris |
10-27-2005, 09:52 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
For the PDFs, also set PHPDIG_OPTION_PDF to '' (two single quotes, no space between) in the config file.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-27-2005, 09:55 AM | #5 |
Green Mole
Join Date: Jul 2005
Posts: 7
|
This is what I have already
define('PHPDIG_INDEX_PDF',true); // set to true define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext'); // assuming linux define('PHPDIG_OPTION_PDF',''); // two single quotes, no space inbetween |
10-27-2005, 10:48 AM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
So you have the following; what version of PhpDig are you using?
Code:
define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION','.txt');
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-28-2005, 04:55 AM | #7 |
Green Mole
Join Date: Jul 2005
Posts: 7
|
I'm using 1.8.7
|
11-01-2005, 02:50 PM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hmm, what happens if you save <removed - looks like you got it to work> PDF to your server and try to index it?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Indexing PDF | dlaperle | Troubleshooting | 1 | 03-21-2007 07:00 PM |
catdoc not indexing all files | brianread | External Binaries | 1 | 11-30-2005 08:14 AM |
catdoc and xls2csv not indexing | greener_02445 | External Binaries | 14 | 04-13-2004 07:33 PM |
no indexing with catdoc and xls2csv | Kylord | External Binaries | 2 | 04-09-2004 07:19 AM |
PDF indexing | lelandv | External Binaries | 15 | 12-08-2003 04:23 PM |