PDF and CATDOC indexing

chrisdgreen · 10-26-2005, 08:19 AM

Having loads of fun with this today

The only way I can get a .doc file to index is if I spider the doc directly.
It doesn't find it from internal links. I have a page (see link) with a test pdf and doc included in the body of the text, neither of these are found or indexed by the spider though the page it self is listed.
http://www.sccyp.org.uk/webpages/about_ourhistory.php

if I include a full url to the spider I get the following

Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/catdoc -s 8859-1 ../admin/temp/66842272.tmp
Result contains: Array ( [0] => Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec non leo [1] => nec enim sollicitudin sodales. Morbi sem sem, mattis vitae, imperdiet [2] => in, malesuada non, arcu. Vestibulum condimentum porttitor tellus. Sed [3] => ultricies. Sed volutpat molestie sem. Quisque quis nisl. Sed mi metus, [4] => dictum at, elementum quis, ultricies ac, nunc. Aliquam eget arcu. [5] => Vivamus felis sem, feugiat id, volutpat ac, lobortis vel, felis. Cum [6] => sociis natoque penatibus et magnis dis parturient montes, nascetur [7] => ridiculus mus. Quisque [8] => grav???????????????????????????????????????????????????????????????????? [9] => ???????????????????????? [10] => )
Return value is: 0

4:http://www.sccyp.org.uk/testdoc.doc

Why won't it spider from the internal link?

For pdf it appears to see the file if I add to the spider automatically but produces no output

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the msword is set to: /usr/bin/catdoc
Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pdftotext ../admin/temp/95318772.tmp 2>&1
Result contains: Array ( )
Return value is: 0

1:http://www.sccyp.org.uk/pdftest.pdf
(time : 00:00:06)
No link in temporary table

Again Why won't it spider from the internal link? and when it does what happens to the content.

I have been through most of the threads on the board relating to this and can't find an answer

Any help gratefully received

Chris

Charter · 10-26-2005, 11:47 PM

In the config file set LIMIT_TO_DIRECTORY to false, PHPDIG_IN_DOMAIN to true, and PHPDIG_PDF_EXTENSION to .txt (with the period) and then from the admin panel, use a large search depth, set links per to zero, and select the no option.

chrisdgreen · 10-27-2005, 10:29 AM

Thanks for responding
I made the changes suggested for the pdf I still get the following

Parse the pdf is set to: /usr/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pdftotext ../admin/temp/52635362.tmp 2>&1
Result contains: Array ( )
Return value is: 0

I looked for the temp file and there was nothing there, should there be or are these removed automatically?

Cheers

Chris

Charter · 10-27-2005, 10:52 AM

For the PDFs, also set PHPDIG_OPTION_PDF to '' (two single quotes, no space between) in the config file.

chrisdgreen · 10-27-2005, 10:55 AM

This is what I have already

define('PHPDIG_INDEX_PDF',true); // set to true
define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext'); // assuming linux
define('PHPDIG_OPTION_PDF',''); // two single quotes, no space inbetween

Charter · 10-27-2005, 11:48 AM

So you have the following; what version of PhpDig are you using?

Code:

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');

chrisdgreen · 10-28-2005, 05:55 AM

I'm using 1.8.7

Charter · 11-01-2005, 03:50 PM

Hmm, what happens if you save <removed - looks like you got it to work> PDF to your server and try to index it?

10-26-2005, 08:19 AM	#1
chrisdgreen Green Mole Join Date: Jul 2005 Posts: 7	PDF and CATDOC indexing Having loads of fun with this today The only way I can get a .doc file to index is if I spider the doc directly. It doesn't find it from internal links. I have a page (see link) with a test pdf and doc included in the body of the text, neither of these are found or indexed by the spider though the page it self is listed. http://www.sccyp.org.uk/webpages/about_ourhistory.php if I include a full url to the spider I get the following Is result test http an array: 1 What is result test http status: MSWORD Is result test an array: 1 What is result test status: MSWORD Use is executable is set to: 1 Index the pdf is set to: 1 Parse the msword is set to: /usr/bin/catdoc Parse the pdf is set to: /usr/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/catdoc -s 8859-1 ../admin/temp/66842272.tmp Result contains: Array ( [0] => Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec non leo [1] => nec enim sollicitudin sodales. Morbi sem sem, mattis vitae, imperdiet [2] => in, malesuada non, arcu. Vestibulum condimentum porttitor tellus. Sed [3] => ultricies. Sed volutpat molestie sem. Quisque quis nisl. Sed mi metus, [4] => dictum at, elementum quis, ultricies ac, nunc. Aliquam eget arcu. [5] => Vivamus felis sem, feugiat id, volutpat ac, lobortis vel, felis. Cum [6] => sociis natoque penatibus et magnis dis parturient montes, nascetur [7] => ridiculus mus. Quisque [8] => grav???????????????????????????????????????????????????????????????????? [9] => ???????????????????????? [10] => ) Return value is: 0 4:http://www.sccyp.org.uk/testdoc.doc Why won't it spider from the internal link? For pdf it appears to see the file if I add to the spider automatically but produces no output Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the msword is set to: /usr/bin/catdoc Parse the pdf is set to: /usr/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/pdftotext ../admin/temp/95318772.tmp 2>&1 Result contains: Array ( ) Return value is: 0 1:http://www.sccyp.org.uk/pdftest.pdf (time : 00:00:06) No link in temporary table Again Why won't it spider from the internal link? and when it does what happens to the content. I have been through most of the threads on the board relating to this and can't find an answer Any help gratefully received Chris

10-26-2005, 11:47 PM	#2
Charter Head Mole Join Date: May 2003 Posts: 2,539	In the config file set LIMIT_TO_DIRECTORY to false, PHPDIG_IN_DOMAIN to true, and PHPDIG_PDF_EXTENSION to .txt (with the period) and then from the admin panel, use a large search depth, set links per to zero, and select the no option. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

10-27-2005, 10:52 AM	#4
Charter Head Mole Join Date: May 2003 Posts: 2,539	For the PDFs, also set PHPDIG_OPTION_PDF to '' (two single quotes, no space between) in the config file. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

10-27-2005, 11:48 AM	#6
Charter Head Mole Join Date: May 2003 Posts: 2,539	So you have the following; what version of PhpDig are you using? Code: define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_PDF_EXTENSION','.txt'); __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

11-01-2005, 03:50 PM	#8
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hmm, what happens if you save <removed - looks like you got it to work> PDF to your server and try to index it? __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

10-27-2005, 10:29 AM	#3
chrisdgreen Green Mole Join Date: Jul 2005 Posts: 7	Thanks for responding I made the changes suggested for the pdf I still get the following Parse the pdf is set to: /usr/bin/pdftotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/pdftotext ../admin/temp/52635362.tmp 2>&1 Result contains: Array ( ) Return value is: 0 I looked for the temp file and there was nothing there, should there be or are these removed automatically? Cheers Chris

10-27-2005, 10:55 AM	#5
chrisdgreen Green Mole Join Date: Jul 2005 Posts: 7	This is what I have already define('PHPDIG_INDEX_PDF',true); // set to true define('PHPDIG_PARSE_PDF','/usr/bin/pdftotext'); // assuming linux define('PHPDIG_OPTION_PDF',''); // two single quotes, no space inbetween

10-28-2005, 05:55 AM	#7
chrisdgreen Green Mole Join Date: Jul 2005 Posts: 7	I'm using 1.8.7

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Indexing PDF	dlaperle	Troubleshooting	1	03-21-2007 08:00 PM
catdoc not indexing all files	brianread	External Binaries	1	11-30-2005 09:14 AM
catdoc and xls2csv not indexing	greener_02445	External Binaries	14	04-13-2004 08:33 PM
no indexing with catdoc and xls2csv	Kylord	External Binaries	2	04-09-2004 08:19 AM
PDF indexing	lelandv	External Binaries	15	12-08-2003 05:23 PM