|
08-17-2006, 07:39 AM | #1 |
Green Mole
Join Date: Aug 2006
Posts: 1
|
spider documents without extensions
I have some problems with correct mime type detection on our linux server. The documents are pdf and word (doc) files, uploaded with a form an saved without fileextension. Normally Phpdig should read the header and spider the file with the correct external binary. The files are named like 22_upload, 23_upload ...
I'm using catdoc and pstotext with phpdig version 1.8.5. The binary installation should be correct, because catdoc /path to file/file and pstotext -cork /path to file/file returns the content text file -i /path to file/file shows the mime-type: application/pdf or application/msword Spider ist running, but the files in text_content (*.txt) and the column first_words in the database contains the binary code of the files not text content. I'm using # php -f /path/spider.php http://path/documents/ >> /var/log/phpdig.log So it seems, that robot_functions.php does not recognise the mime-type of the documents and does not know, which external binary is correct. Therefore binary code is written into database. Thanks for any suggestions, Joe Last edited by jguert; 08-17-2006 at 07:43 AM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Probleme avec l'indexation des documents | niptan | Troubleshooting | 1 | 11-06-2005 10:26 AM |
Documents disappear | kzant | Troubleshooting | 7 | 07-30-2005 07:26 AM |
How to scan XML documents | batman1056 | How-to Forum | 1 | 05-19-2005 07:34 AM |
Textual content of indexed documents | Dreamory | How-to Forum | 2 | 10-25-2004 07:50 AM |
Duplicate Documents Problem... | vonbrocklin | Troubleshooting | 3 | 11-25-2003 01:16 PM |