|
04-07-2004, 07:39 AM | #1 |
Green Mole
Join Date: Apr 2004
Posts: 8
|
catdoc and xls2csv not indexing
Can anyone help me? I have been trying to get word documents
and excel files to index. I am using apache on a win XP system. It will work for text files only. this is how my config settings look : define('USE_IS_EXECUTABLE_COMMAND','0'); //use is_executable for external binaries // if set to true, full path to external binary required define('PHPDIG_INDEX_MSWORD',true); define('PHPDIG_PARSE_MSWORD','C:\catdoc\catdoc'); define('PHPDIG_OPTION_MSWORD','-s 8859-1'); define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext'); define('PHPDIG_OPTION_PDF','-cork'); define('PHPDIG_INDEX_MSEXCEL',true); define('PHPDIG_PARSE_MSEXCEL','C:\catdoc\xls2csv'); define('PHPDIG_OPTION_MSEXCEL',''); //---------EXTERNAL TOOLS EXTENSIONS // if external binary is not STDOUT or different extension is needed // for example, use '.txt' if external binary writes to filename.txt define('PHPDIG_MSWORD_EXTENSION',''); define('PHPDIG_PDF_EXTENSION',''); define('PHPDIG_MSEXCEL_EXTENSION',''); I have tried the xls2csv and the catdoc programs through the MSDOS interface and they work fine. When I try to submit a URI with a .doc or a .xls This is what I get: SITE : http://localhost/ Exclude paths : - @NONE@ No link in temporary table -------------------------------------------------------------------------------- links found : 0 ...Was recently indexed Optimizing tables... Indexing complete ! any advice muchly appreciated -Rich |
04-08-2004, 03:56 AM | #2 |
Green Mole
Join Date: Jan 2004
Posts: 1
|
I have the same problem
My configuration :
phpdig 1.8.0-Easy php 1.7-Windows Xp Config File : define('USE_IS_EXECUTABLE_COMMAND','0'); //use is_executable for external binaries // if set to true, full path to external binary required define('PHPDIG_INDEX_MSWORD',true); define('PHPDIG_PARSE_MSWORD','C:\\Ghostgum\\pstotext\\catdoc'); define('PHPDIG_OPTION_MSWORD','-s 8859-1'); define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','C:\\Ghostgum\\pstotext\\pstotxt3'); define('PHPDIG_OPTION_PDF',''); define('PHPDIG_INDEX_MSEXCEL',true); define('PHPDIG_PARSE_MSEXCEL','C:\\Ghostgum\\pstotext\\xls2csv'); define('PHPDIG_OPTION_MSEXCEL',''); //---------EXTERNAL TOOLS EXTENSIONS // if external binary is not STDOUT or different extension is needed // for example, use '.txt' if external binary writes to filename.txt define('PHPDIG_MSWORD_EXTENSION',''); define('PHPDIG_PDF_EXTENSION',''); There is no way to index Pdf, .doc nor .xls, only html or text files. Catdoc, Pstotxt and xls2csv are functional in a windows shell. I've read quite all the external binaries topics without finding a clue. If you have the begining of an idea, i'am desperate. Thanks. Axel |
04-08-2004, 06:55 AM | #3 |
Green Mole
Join Date: Apr 2004
Posts: 8
|
what should be done.. a cry for help
I see people posting that claim the can get catdoc , xls2csv and pdftotext to work on windows systems. I have read through all the posts on this topic and still can not get phpdig to see these documents,these programs do work through dos only. All I have changed is the config which I showed before. Are there other things that need to be altered in the spider.php file perhaps?
Can anyone direct toward some additional online documentation or show me what they have done to solve this problem. help help somebody! |
04-09-2004, 08:18 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. What version of PHP? Prehaps try this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-09-2004, 09:12 AM | #5 |
Green Mole
Join Date: Apr 2004
Posts: 8
|
PHP 4.3.3
Thank you for responding ,I'm using PHP4.3.3 and I tried the suggestion you listed , I'm still getting:
No link in temporary table links found : 0 |
04-09-2004, 10:35 AM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Just posted this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-11-2004, 01:05 PM | #7 |
Green Mole
Join Date: Apr 2004
Posts: 8
|
Hi so I followed all of your instructions. I went ahead and inserted those echo statements. It doesn't seem to work yet but I will keep trying at it. What does everyone think? All the programs:
catdoc ,pdftotext and xls2csv all work in the command line This is the output that I recieve : Spidering in progress... SITE : http://localhost/ Exclude paths : - @NONE@ Is result test http an array: 1 What is result test http status: HTML Is result test an array: 1 What is result test status: HTML Use is executable is set to: 0 Index the pdf is set to: Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: 1:http://localhost/grants/ (time : 00:00:05) + + + + + + level 1... Is result test http an array: 1 What is result test http status: MSEXCEL Is result test an array: 1 What is result test status: MSEXCEL Use is executable is set to: 0 Index the pdf is set to: Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: 2:http://localhost/grants/test.xls (time : 00:00:15) Is result test http an array: 1 What is result test http status: PLAINTEXT Is result test an array: 1 What is result test status: PLAINTEXT Use is executable is set to: 0 Index the pdf is set to: Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: 3:http://localhost/grants/Solutions.txt (time : 00:00:21) Is result test http an array: 1 What is result test http status: MSWORD Is result test an array: 1 What is result test status: MSWORD Use is executable is set to: 0 Index the pdf is set to: Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: 4:http://localhost/grants/Outline%20of...ation2.doc.doc (time : 00:00:26) Is result test http an array: 1 What is result test http status: MSWORD Is result test an array: 1 What is result test status: MSWORD Use is executable is set to: 0 Index the pdf is set to: Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: 5:http://localhost/grants/MidtermSolutions.doc.doc (time : 00:00:31) Is result test http an array: 1 What is result test http status: MSWORD Is result test an array: 1 What is result test status: MSWORD Use is executable is set to: 0 Index the pdf is set to: Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: 6:http://localhost/grants/Debate.doc (time : 00:00:36) Is result test http an array: 1 What is result test http status: HTML Is result test an array: 1 What is result test status: HTML Use is executable is set to: 0 Index the pdf is set to: Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: 7:http://localhost/ (time : 00:00:41) No link in temporary table links found : 7 http://localhost/grants/ http://localhost/grants/test.xls http://localhost/grants/Solutions.txt http://localhost/grants/Outline of my last presentation2.doc.doc http://localhost/grants/MidtermSolutions.doc.doc http://localhost/grants/Debate.doc http://localhost/ Optimizing tables... Indexing complete ! Any advice please send it my way! -Rich |
04-11-2004, 01:49 PM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. For the external binaries, there aren't any PDF files in your post, just Word and Excel files, and it looks like three Word documents and one Excel file were indexed. What happens when you try a search on a word in one of those DOC/XLS files?
Also, in this thread, I've added a comment and some extra code to echo more stuff. The comment shows where to change _PDF to either _MSWORD or _MSEXCEL in the posted code in order to echo stuff specific for those binaries.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-11-2004, 04:06 PM | #9 |
Green Mole
Join Date: Apr 2004
Posts: 8
|
Indexing problem
My apologies Charter for replying to your previous thread.
I have had made some progress. That is I can get phpdig to see and identify my files , read the titles but not read them ... no green check mark my config: define('USE_IS_EXECUTABLE_COMMAND','0'); define('PHPDIG_INDEX_MSWORD',true); define('PHPDIG_PARSE_MSWORD','C:\\catdoc\\catdoc'); define('PHPDIG_OPTION_MSWORD','-s 8859-1'); define('PHPDIG_INDEX_PDF',true); define('PHPDIG_PARSE_PDF','C:\\pdftotext\\pdftext'); define('PHPDIG_OPTION_PDF','-cork'); define('PHPDIG_INDEX_MSEXCEL',true); define('PHPDIG_PARSE_MSEXCEL','C:\\catdoc\\xls2csv'); define('PHPDIG_OPTION_MSEXCEL',''); define('PHPDIG_PDF_EXTENSION','.txt'); Here is the output when I try and index a excel, word, and pdf file : Is result test http an array: 1 What is result test http status: MSEXCEL Is result test an array: 1 What is result test status: MSEXCEL Use is executable is set to: 0 Index the pdf is set to: 1 Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: Is result test an array: 1 What is result test status: MSEXCEL Use is executable is set to: 0 Index the msword is set to: 1 Parse the msword is set to: C:\catdoc\catdoc Does parse msword exist: Is result test an array: 1 What is result test status: MSEXCEL Use is executable is set to: 0 Index the msexcel is set to: 1 Parse the msexcel is set to: C:\catdoc\xls2csv Does parse msexcel exist: 3:http://localhost/testfiles/Book1.xls (time : 00:00:21) Is result test http an array: 1 What is result test http status: MSWORD Is result test an array: 1 What is result test status: MSWORD Use is executable is set to: 0 Index the pdf is set to: 1 Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: Is result test an array: 1 What is result test status: MSWORD Use is executable is set to: 0 Index the msword is set to: 1 Parse the msword is set to: C:\catdoc\catdoc Does parse msword exist: Is result test an array: 1 What is result test status: MSWORD Use is executable is set to: 0 Index the msexcel is set to: 1 Parse the msexcel is set to: C:\catdoc\xls2csv Does parse msexcel exist: 4:http://localhost/testfiles/GFP.doc (time : 00:00:26) Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 0 Index the pdf is set to: 1 Parse the pdf is set to: C:\pdftotext\pdftext Does parse pdf exist: Is result test an array: 1 What is result test status: PDF Use is executable is set to: 0 Index the msword is set to: 1 Parse the msword is set to: C:\catdoc\catdoc Does parse msword exist: Is result test an array: 1 What is result test status: PDF Use is executable is set to: 0 Index the msexcel is set to: 1 Parse the msexcel is set to: C:\catdoc\xls2csv Does parse msexcel exist: 5:http://localhost/testfiles/GeneChips.pdf (time : 00:00:31) No link in temporary table If anyone can tell me what I am missing please drop me a reply -Rich |
04-11-2004, 04:27 PM | #10 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. All of the following are coming up blank, likely meaning false, so the external binary isn't applied to the file.
Does parse pdf exist: Does parse msword exist: Does parse msexcel exist: Try the following script, and keep changing the $filename variable until you get a 'file exists' for each binary, and use those paths. If the paths that you are using are actually correct, the blank results may be coming from cache, so running the script below will also clear that. PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-11-2004, 05:25 PM | #11 |
Green Mole
Join Date: Apr 2004
Posts: 8
|
Reading, indexng not searchable yet
O.k. Now it seems to be reading things in
changed the locations in the config to define('PHPDIG_PARSE_MSWORD','C:\\catdoc\\catdoc.exe'); define('PHPDIG_PARSE_PDF','C:\\pdftotext\\pdftext.exe'); define('PHPDIG_PARSE_MSEXCEL','C:\\catdoc\\xls2csv.exe'); and it seems to be reading them in, but I still do get the green check? and it's indexing but this is some of the output for the excel and pdf files: Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 0 Index the pdf is set to: 1 Parse the pdf is set to: C:\pdftotext\pdftext.exe Does parse pdf exist: 1 Is result test an array: 1 What is result test status: PDF Use is executable is set to: 0 Index the msword is set to: 1 Parse the msword is set to: C:\catdoc\catdoc.exe Does parse msword exist: 1 Is result test an array: 1 What is result test status: PDF Use is executable is set to: 0 Index the msexcel is set to: 1 Parse the msexcel is set to: C:\catdoc\xls2csv.exe Does parse msexcel exist: 1 Command is: C:\pdftotext\pdftext.exe -cork ../admin/temp/44266892.tmp Result contains: Array ( ) Return value is: 1 3:http://localhost/testfiles/GeneChips.pdf (time : 00:00:21) Is result test http an array: 1 What is result test http status: MSEXCEL Is result test an array: 1 What is result test status: MSEXCEL Use is executable is set to: 0 Index the pdf is set to: 1 Parse the pdf is set to: C:\pdftotext\pdftext.exe Does parse pdf exist: 1 Is result test an array: 1 What is result test status: MSEXCEL Use is executable is set to: 0 Index the msword is set to: 1 Parse the msword is set to: C:\catdoc\catdoc.exe Does parse msword exist: 1 Is result test an array: 1 What is result test status: MSEXCEL Use is executable is set to: 0 Index the msexcel is set to: 1 Parse the msexcel is set to: C:\catdoc\xls2csv.exe Does parse msexcel exist: 1 Command is: C:\catdoc\xls2csv.exe ../admin/temp/72672312.tmp Result contains: Array ( ) Return value is: 1 4:http://localhost/testfiles/Book1.xls (time : 00:00:26) No link in temporary table |
04-11-2004, 06:15 PM | #12 |
Green Mole
Join Date: Apr 2004
Posts: 8
|
no indexing or searching
I just noticed you posted two responses. Thank you for getting back to me and the script. That worked great. The files seem to be getting read:
Use is executable is set to: 0 Index the msexcel is set to: 1 Parse the msexcel is set to: C:\catdoc\xls2csv.exe Does parse msexcel exist: 1 Command is: C:\catdoc\xls2csv.exe ../admin/temp/72672312.tmp Result contains: Array ( ) Return value is: 1 Parse is working and it looks like the programs are being running and the files are sent somewhere ../admin/temp/72672312.tmp when I indexed I still do not get the green check mark and when I search for terms in the documents I get nothing. It seems like I'm so close to getting it to run what else could I be missing? -Rich |
04-11-2004, 08:27 PM | #13 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Try removing the .exe extension from the paths and check this page, and also search this page for IUSR and see if that fixes it.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-12-2004, 09:40 AM | #14 |
Green Mole
Join Date: Apr 2004
Posts: 8
|
changing apache permissions in windows
Thanks for the info. So is the problem apache? or the catdoc, pdftotext programs? In either case I changed the config (got rid of the .exe) I made sure the folders were shared (catdoc/ pdftotext/ and their permissions were read/write. It is not indexing . The links you sent Charter were very helpful thank you ..I am still having a bit of problem when I go into service.msc to change my permissions, I don't see apache as a listing.. I am using easyphp . Does anyone know what service easyphp is listed as in win2000 or XP? Again thank you for reading and any advice please send it this way.
-Rich |
04-13-2004, 08:33 PM | #15 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Command is: C:\catdoc\xls2csv.exe ../admin/temp/72672312.tmp
Result contains: Array ( ) Return value is: 1 Hi. The above output means that, when PhpDig tried to do the exec, an error occurred. I'm not familiar with EasyPHP, but perhaps the user comments on this page might help.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
catdoc catppt xls2csv problems PLEASE HELP | navanick | External Binaries | 0 | 12-15-2005 02:58 PM |
catdoc not indexing all files | brianread | External Binaries | 1 | 11-30-2005 09:14 AM |
PDF and CATDOC indexing | chrisdgreen | External Binaries | 7 | 11-01-2005 03:50 PM |
no indexing with catdoc and xls2csv | Kylord | External Binaries | 2 | 04-09-2004 08:19 AM |
catdoc & xls2csv binaries | Hitman | External Binaries | 1 | 01-13-2004 10:52 AM |