|
04-28-2004, 09:00 AM | #1 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
For me it only index the titlte of pdf file and the hour of the indexation and also the weight of the pdf file in the database in table keywords but there is no content of the pdf in the database.
It is strange because when I index a site with pdf files it seems to index see below : Is result test http an array: 1 What is result test http status: PDF Is result test an array: 1 What is result test status: PDF Use is executable is set to: 1 Index the pdf is set to: 1 Parse the pdf is set to: /usr/bin/pstotext Does parse pdf exist: 1 Is parse pdf executable: 1 Command is: /usr/bin/pstotext -cork ../admin/temp/13874292.tmp Result contains: Array ( [0] => Hébergement [1] => Facture [2] => partners -- 5 Sq de tuile_ 78000 Versailles -- Tél. / Fax : 0666666666 -- Email : contact@partners.com [3] => SARL au capital de 3000# -- Siret545454445RCS Versailles -- APE 222Z -- Web : www.partners.com [4] => [5] => FACTURE [6] => partners CLIENT [7] => 5 Sq de tuile Adzd MAdzNdzAS [8] => 78000 Versailles [9] => Tél./fax. : 01 3226222626 [10] => Prestation : Hébergement [11] => Facture du: 01/04/2004 au 31/06/2004 [12] => N° de Facture: 12122/66 [13] => Article Objet Quantité [14] => / [15] => Slots [16] => Prix [17] => unitaire / [18] => Trimestre [19] => Montant TVA [20] => Hébergement Serveur [21] => Total HT 122.36 [22] => Total TVA 23.61 [23] => Total TTC 122.00 [24] => A payer 122.00 EUROS [25] => Mode de paiement : A réception de facture [26] => ) Return value is: 0 5:http://monsiteweb.fr/pdf/01123SOC2004013.PDF (temps : 00:01:49) Pas de liens dans la table temporaire |
04-28-2004, 09:59 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. That all looks like it's working. What do you get when you run the following query?
Code:
select first_words from spider where file like '%.pdf%'; Code:
select keyword from keywords where keyword like '%word%'; If you have define('CONTENT_TEXT',0); set in the config file, then when searching on a keyword just $text from list($title,$text) = explode("\n",$first_words); will be shown regardless of keyword.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-28-2004, 10:29 AM | #3 | ||
Green Mole
Join Date: Apr 2004
Posts: 4
|
Hi,
When I run the following query : select first_words from spider where file like '%.pdf%'; I got nothing When I run the following query : select keyword from keywords where keyword like '%word%'; I got : key_id=61365 twoletters=en keyword=entrymainbodyfirstwords But i think this is because in my index.html I have a word called{{entrymainbodyfirstwords 25}} that has been indexed as keyword=entrymainbodyfirstwords and there is no link with pdf (i think) Here is the configuration in config.php : define('SNIPPET_DISPLAY_LENGTH',150); define('DISPLAY_SNIPPETS',true); define('DISPLAY_SNIPPETS_NUM',4); define('DISPLAY_SUMMARY',true); define('TEXT_CONTENT_PATH','text_content/'); define('CONTENT_TEXT',1); Quote:
Index of /pdf Name Last modified Size Description Parent Directory 28-Apr-2004 16:35 - 01123SOC2004013.PDF 28-Apr-2004 18:30 69k pdf.html 28-Apr-2004 18:18 1k test.doc 28-Apr-2004 17:24 19k zyz.xls 28-Apr-2004 17:24 14k Apache1.3.29 - ProXad [Apr 1 2004 16:04:22] Server at monsiteweb.fr Port 80 Index of /pdf Index of /pdf Index of /pdf Quote:
Thanks for your great job and your quick answer. Paul Last edited by killer27; 04-28-2004 at 10:50 AM. |
||
04-28-2004, 10:56 AM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. When you try the following query, change word to some word that could only be in the PDF file:
Code:
select keyword from keywords where keyword like '%word%'; Index of /pdf Name Last modified Size Description Parent Directory 28-Apr-2004 16:35 - 01123SOC2004013.PDF 28-Apr-2004 18:30 69k pdf.html 28-Apr-2004 18:18 1k test.doc 28-Apr-2004 17:24 19k zyz.xls 28-Apr-2004 17:24 14k Apache1.3.29 - ProXad [Apr 1 2004 16:04:22] Server at monsiteweb.fr Port 80 Index of /pdf Index of /pdf Index of /pdf That seems like a directory listing rather than for the actual PDF file. The $result array contains the following: Result contains: Array ( [0] => Hébergement [1] => Facture [2] => partners -- 5 Sq de tuile_ 78000 Versailles -- Tél. / Fax : 0666666666 -- Email : contact@partners.com [3] => SARL au capital de 3000# -- Siret545454445RCS Versailles -- APE 222Z -- Web : www.partners.com [4] => [5] => FACTURE [6] => partners CLIENT [7] => 5 Sq de tuile Adzd MAdzNdzAS [8] => 78000 Versailles [9] => Tél./fax. : 01 3226222626 [10] => Prestation : Hébergement [11] => Facture du: 01/04/2004 au 31/06/2004 [12] => N° de Facture: 12122/66 [13] => Article Objet Quantité [14] => / [15] => Slots [16] => Prix [17] => unitaire / [18] => Trimestre [19] => Montant TVA [20] => Hébergement Serveur [21] => Total HT 122.36 [22] => Total TVA 23.61 [23] => Total TTC 122.00 [24] => A payer 122.00 EUROS [25] => Mode de paiement : A réception de facture [26] => ) And with $retval being zero, the following code should make a temp file containing the stuff from the $result array: PHP Code:
Code:
select file,first_words from spider where file like '%01123SOC2004013%';
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
04-29-2004, 03:08 PM | #5 | |||
Green Mole
Join Date: Apr 2004
Posts: 4
|
hi,
when I execute the two queries : Quote:
Quote:
I also open all the txt files in the admin/temp directory and I saw the content of the pdf file in 95593951.tmp : Quote:
All my permissions are good, I am able to index doc and xls files. I have php 4.2.2 but I have installed this patch : http://www.phpdig.net/showthread.php?threadid=570 and check everything describe in this thread : http://www.phpdig.net/showthread.php?s=&threadid=799 (I also add the code include in this thread) I attach here my config.php file, spider.php and robot_functions.php in a zip file maybe it can help you to help me. Thanks a lot. Paul Last edited by killer27; 04-29-2004 at 03:18 PM. |
|||
05-01-2004, 07:13 AM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Change define('PHPDIG_PDF_EXTENSION','.txt'); to define('PHPDIG_PDF_EXTENSION',''); in the config file (two single quotes, no space between).
The '.txt' is for when an external PDF binary outputs to a TXT file as with pdftotext, however catdoc goes to STDOUT so no '.txt' is needed.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
05-03-2004, 08:31 AM | #7 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
Thanks a lot, now it works fine.
If someone has the same issue and is using php4.2.2 on linux redhat 7.3 I can share my files. Only two more issues : First one : when trying to index large pdf files like 5 Mo, indexation is impossible, with small pdf files it works (like 200 ko or 500 ko). Second: when I index doc files, the spider transform é, Ã*, è, into special characters like é=é or être=ètre, may you have some explanations about this ? Thanks again and again... Paul |
05-12-2004, 02:28 PM | #8 |
Green Mole
Join Date: May 2004
Location: France
Posts: 8
|
Hi !
For the é=é it looks like the é is translated into UTF-8 (like in Google). |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Spidering issue | cefiro | How-to Forum | 0 | 02-28-2005 10:01 AM |
config issue | baskamer | Troubleshooting | 2 | 12-18-2004 01:33 PM |
How do you install pstotext | krugar | External Binaries | 2 | 12-08-2004 12:53 PM |
problem with pstotext | loicoco | External Binaries | 2 | 07-16-2004 03:17 AM |
Installation issue... again | jinx | Script Installation | 1 | 06-14-2004 09:31 PM |