PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 04-07-2004, 07:39 AM   #1
greener_02445
Green Mole
 
Join Date: Apr 2004
Posts: 8
catdoc and xls2csv not indexing

Can anyone help me? I have been trying to get word documents
and excel files to index. I am using apache on a win XP system. It will work for text files only. this is how my config settings look :


define('USE_IS_EXECUTABLE_COMMAND','0'); //use is_executable for external binaries
// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','C:\catdoc\catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','-cork');
define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','C:\catdoc\xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');

I have tried the xls2csv and the catdoc programs through the MSDOS interface and they work fine. When I try to submit a URI with a .doc or a .xls This is what I get:

SITE : http://localhost/
Exclude paths :
- @NONE@
No link in temporary table

--------------------------------------------------------------------------------

links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete !

any advice muchly appreciated

-Rich
greener_02445 is offline   Reply With Quote
Old 04-08-2004, 03:56 AM   #2
maza
Green Mole
 
Join Date: Jan 2004
Posts: 1
I have the same problem

My configuration :

phpdig 1.8.0-Easy php 1.7-Windows Xp

Config File :

define('USE_IS_EXECUTABLE_COMMAND','0'); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','C:\\Ghostgum\\pstotext\\catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','C:\\Ghostgum\\pstotext\\pstotxt3');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','C:\\Ghostgum\\pstotext\\xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');

There is no way to index Pdf, .doc nor .xls, only html or text files. Catdoc, Pstotxt and xls2csv are functional in a windows shell.

I've read quite all the external binaries topics without finding a clue.

If you have the begining of an idea, i'am desperate.

Thanks. Axel
maza is offline   Reply With Quote
Old 04-08-2004, 06:55 AM   #3
greener_02445
Green Mole
 
Join Date: Apr 2004
Posts: 8
what should be done.. a cry for help

I see people posting that claim the can get catdoc , xls2csv and pdftotext to work on windows systems. I have read through all the posts on this topic and still can not get phpdig to see these documents,these programs do work through dos only. All I have changed is the config which I showed before. Are there other things that need to be altered in the spider.php file perhaps?
Can anyone direct toward some additional online documentation or show me what they have done to solve this problem.
help help somebody!
greener_02445 is offline   Reply With Quote
Old 04-09-2004, 08:18 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. What version of PHP? Prehaps try this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-09-2004, 09:12 AM   #5
greener_02445
Green Mole
 
Join Date: Apr 2004
Posts: 8
PHP 4.3.3

Thank you for responding ,I'm using PHP4.3.3 and I tried the suggestion you listed , I'm still getting:

No link in temporary table
links found : 0
greener_02445 is offline   Reply With Quote
Old 04-09-2004, 10:35 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Just posted this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-11-2004, 01:05 PM   #7
greener_02445
Green Mole
 
Join Date: Apr 2004
Posts: 8
Hi so I followed all of your instructions. I went ahead and inserted those echo statements. It doesn't seem to work yet but I will keep trying at it. What does everyone think? All the programs:

catdoc ,pdftotext and xls2csv all work in the command line

This is the output that I recieve :

Spidering in progress...

SITE : http://localhost/
Exclude paths :
- @NONE@

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
1:http://localhost/grants/
(time : 00:00:05)
+ + + + + +
level 1...

Is result test http an array: 1
What is result test http status: MSEXCEL

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
2:http://localhost/grants/test.xls
(time : 00:00:15)

Is result test http an array: 1
What is result test http status: PLAINTEXT

Is result test an array: 1
What is result test status: PLAINTEXT
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
3:http://localhost/grants/Solutions.txt
(time : 00:00:21)

Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
4:http://localhost/grants/Outline%20of...ation2.doc.doc
(time : 00:00:26)


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
5:http://localhost/grants/MidtermSolutions.doc.doc
(time : 00:00:31)



Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
6:http://localhost/grants/Debate.doc
(time : 00:00:36)



Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
7:http://localhost/
(time : 00:00:41)

No link in temporary table

links found : 7
http://localhost/grants/
http://localhost/grants/test.xls
http://localhost/grants/Solutions.txt
http://localhost/grants/Outline of my last presentation2.doc.doc
http://localhost/grants/MidtermSolutions.doc.doc
http://localhost/grants/Debate.doc
http://localhost/
Optimizing tables...
Indexing complete !

Any advice please send it my way!
-Rich
greener_02445 is offline   Reply With Quote
Old 04-11-2004, 01:49 PM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. For the external binaries, there aren't any PDF files in your post, just Word and Excel files, and it looks like three Word documents and one Excel file were indexed. What happens when you try a search on a word in one of those DOC/XLS files?

Also, in this thread, I've added a comment and some extra code to echo more stuff. The comment shows where to change _PDF to either _MSWORD or _MSEXCEL in the posted code in order to echo stuff specific for those binaries.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-11-2004, 04:06 PM   #9
greener_02445
Green Mole
 
Join Date: Apr 2004
Posts: 8
Indexing problem

My apologies Charter for replying to your previous thread.
I have had made some progress. That is I can get phpdig to see and identify my files , read the titles but not read them ... no green check mark
my config:
define('USE_IS_EXECUTABLE_COMMAND','0');
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','C:\\catdoc\\catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','C:\\pdftotext\\pdftext');
define('PHPDIG_OPTION_PDF','-cork');
define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','C:\\catdoc\\xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');
define('PHPDIG_PDF_EXTENSION','.txt');

Here is the output when I try and index a excel, word, and pdf file :

Is result test http an array: 1
What is result test http status: MSEXCEL

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc
Does parse msword exist:

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv
Does parse msexcel exist:
3:http://localhost/testfiles/Book1.xls
(time : 00:00:21)

Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc
Does parse msword exist:

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv
Does parse msexcel exist:
4:http://localhost/testfiles/GFP.doc
(time : 00:00:26)

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc
Does parse msword exist:

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv
Does parse msexcel exist:
5:http://localhost/testfiles/GeneChips.pdf
(time : 00:00:31)
No link in temporary table

If anyone can tell me what I am missing please drop me a reply
-Rich
greener_02445 is offline   Reply With Quote
Old 04-11-2004, 04:27 PM   #10
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. All of the following are coming up blank, likely meaning false, so the external binary isn't applied to the file.

Does parse pdf exist:
Does parse msword exist:
Does parse msexcel exist:

Try the following script, and keep changing the $filename variable until you get a 'file exists' for each binary, and use those paths. If the paths that you are using are actually correct, the blank results may be coming from cache, so running the script below will also clear that.
PHP Code:
<?php
$filename 
"C:\\\\catdoc\\\\catdoc";
clearstatcache();
if (
file_exists($filename)) {
    echo 
"file exists";
} else {
    echo 
"try again";
}
?>
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-11-2004, 05:25 PM   #11
greener_02445
Green Mole
 
Join Date: Apr 2004
Posts: 8
Reading, indexng not searchable yet

O.k. Now it seems to be reading things in
changed the locations in the config to

define('PHPDIG_PARSE_MSWORD','C:\\catdoc\\catdoc.exe');
define('PHPDIG_PARSE_PDF','C:\\pdftotext\\pdftext.exe');
define('PHPDIG_PARSE_MSEXCEL','C:\\catdoc\\xls2csv.exe');

and it seems to be reading them in, but I still do get the green check? and it's indexing but this is some of the output for the excel and pdf files:

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext.exe
Does parse pdf exist: 1

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc.exe
Does parse msword exist: 1

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv.exe
Does parse msexcel exist: 1

Command is: C:\pdftotext\pdftext.exe -cork ../admin/temp/44266892.tmp
Result contains: Array ( )
Return value is: 1

3:http://localhost/testfiles/GeneChips.pdf
(time : 00:00:21)

Is result test http an array: 1
What is result test http status: MSEXCEL

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext.exe
Does parse pdf exist: 1

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc.exe
Does parse msword exist: 1

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv.exe
Does parse msexcel exist: 1

Command is: C:\catdoc\xls2csv.exe ../admin/temp/72672312.tmp
Result contains: Array ( )
Return value is: 1

4:http://localhost/testfiles/Book1.xls
(time : 00:00:26)

No link in temporary table
greener_02445 is offline   Reply With Quote
Old 04-11-2004, 06:15 PM   #12
greener_02445
Green Mole
 
Join Date: Apr 2004
Posts: 8
no indexing or searching

I just noticed you posted two responses. Thank you for getting back to me and the script. That worked great. The files seem to be getting read:
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv.exe
Does parse msexcel exist: 1

Command is: C:\catdoc\xls2csv.exe ../admin/temp/72672312.tmp
Result contains: Array ( )
Return value is: 1
Parse is working and it looks like the programs are being running and the files are sent somewhere ../admin/temp/72672312.tmp
when I indexed I still do not get the green check mark and when
I search for terms in the documents I get nothing. It seems like I'm so close to getting it to run what else could I be missing?
-Rich
greener_02445 is offline   Reply With Quote
Old 04-11-2004, 08:27 PM   #13
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Try removing the .exe extension from the paths and check this page, and also search this page for IUSR and see if that fixes it.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-12-2004, 09:40 AM   #14
greener_02445
Green Mole
 
Join Date: Apr 2004
Posts: 8
changing apache permissions in windows

Thanks for the info. So is the problem apache? or the catdoc, pdftotext programs? In either case I changed the config (got rid of the .exe) I made sure the folders were shared (catdoc/ pdftotext/ and their permissions were read/write. It is not indexing . The links you sent Charter were very helpful thank you ..I am still having a bit of problem when I go into service.msc to change my permissions, I don't see apache as a listing.. I am using easyphp . Does anyone know what service easyphp is listed as in win2000 or XP? Again thank you for reading and any advice please send it this way.

-Rich
greener_02445 is offline   Reply With Quote
Old 04-13-2004, 08:33 PM   #15
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Command is: C:\catdoc\xls2csv.exe ../admin/temp/72672312.tmp
Result contains: Array ( )
Return value is: 1

Hi. The above output means that, when PhpDig tried to do the exec, an error occurred. I'm not familiar with EasyPHP, but perhaps the user comments on this page might help.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
catdoc catppt xls2csv problems PLEASE HELP navanick External Binaries 0 12-15-2005 02:58 PM
catdoc not indexing all files brianread External Binaries 1 11-30-2005 09:14 AM
PDF and CATDOC indexing chrisdgreen External Binaries 7 11-01-2005 03:50 PM
no indexing with catdoc and xls2csv Kylord External Binaries 2 04-09-2004 08:19 AM
catdoc & xls2csv binaries Hitman External Binaries 1 01-13-2004 10:52 AM


All times are GMT -8. The time now is 11:42 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.