PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 04-28-2004, 09:00 AM   #1
killer27
Green Mole
 
Join Date: Apr 2004
Posts: 4
For me it only index the titlte of pdf file and the hour of the indexation and also the weight of the pdf file in the database in table keywords but there is no content of the pdf in the database.
It is strange because when I index a site with pdf files it seems to index see below :


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pstotext -cork ../admin/temp/13874292.tmp
Result contains: Array ( [0] => Hébergement [1] => Facture [2] => partners -- 5 Sq de tuile_ 78000 Versailles -- Tél. / Fax : 0666666666 -- Email : contact@partners.com [3] => SARL au capital de 3000# -- Siret545454445RCS Versailles -- APE 222Z -- Web : www.partners.com [4] => [5] => FACTURE [6] => partners CLIENT [7] => 5 Sq de tuile Adzd MAdzNdzAS [8] => 78000 Versailles [9] => Tél./fax. : 01 3226222626 [10] => Prestation : Hébergement [11] => Facture du: 01/04/2004 au 31/06/2004 [12] => N° de Facture: 12122/66 [13] => Article Objet Quantité [14] => / [15] => Slots [16] => Prix [17] => unitaire / [18] => Trimestre [19] => Montant TVA [20] => Hébergement Serveur [21] => Total HT 122.36 [22] => Total TVA 23.61 [23] => Total TTC 122.00 [24] => A payer 122.00 EUROS [25] => Mode de paiement : A réception de facture [26] => )
Return value is: 0

5:http://monsiteweb.fr/pdf/01123SOC2004013.PDF
(temps : 00:01:49)

Pas de liens dans la table temporaire
killer27 is offline   Reply With Quote
Old 04-28-2004, 09:59 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. That all looks like it's working. What do you get when you run the following query?
Code:
select first_words from spider where file like '%.pdf%';
Also, look in the keywords table for words from the PDF file:
Code:
select keyword from keywords where keyword like '%word%';
If you have both define('CONTENT_TEXT',1); and define('DISPLAY_SNIPPETS',true); set in the config file, then there should be a text file in the text_content directory with the PDF content.

If you have define('CONTENT_TEXT',0); set in the config file, then when searching on a keyword just $text from list($title,$text) = explode("\n",$first_words); will be shown regardless of keyword.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-28-2004, 10:29 AM   #3
killer27
Green Mole
 
Join Date: Apr 2004
Posts: 4
Hi,

When I run the following query :

select first_words from spider where file like '%.pdf%';

I got nothing

When I run the following query :
select keyword from keywords where keyword like '%word%';

I got :
key_id=61365
twoletters=en
keyword=entrymainbodyfirstwords


But i think this is because in my index.html I have a word called{{entrymainbodyfirstwords 25}} that has been indexed as keyword=entrymainbodyfirstwords
and there is no link with pdf (i think)


Here is the configuration in config.php :


define('SNIPPET_DISPLAY_LENGTH',150);
define('DISPLAY_SNIPPETS',true);
define('DISPLAY_SNIPPETS_NUM',4);
define('DISPLAY_SUMMARY',true);


define('TEXT_CONTENT_PATH','text_content/');
define('CONTENT_TEXT',1);



Quote:
then there should be a text file in the text_content directory with the PDF content.
Yes I have text files but the pdf text file only show :

Index of /pdf Name Last modified Size Description Parent Directory
28-Apr-2004 16:35 - 01123SOC2004013.PDF 28-Apr-2004 18:30 69k pdf.html
28-Apr-2004 18:18 1k test.doc 28-Apr-2004 17:24 19k zyz.xls 28-Apr-2004
17:24 14k Apache1.3.29 - ProXad [Apr 1 2004 16:04:22] Server at
monsiteweb.fr Port 80 Index of /pdf Index of /pdf Index of /pdf


Quote:
then when searching on a keyword just $text from list($title,$text) = explode("\n",$first_words); will be shown regardless of keyword.
I don't understand the last part of your message ???

Thanks for your great job and your quick answer.

Paul

Last edited by killer27; 04-28-2004 at 10:50 AM.
killer27 is offline   Reply With Quote
Old 04-28-2004, 10:56 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. When you try the following query, change word to some word that could only be in the PDF file:
Code:
select keyword from keywords where keyword like '%word%';
The file in the text_content directory that contains the following:

Index of /pdf Name Last modified Size Description Parent Directory
28-Apr-2004 16:35 - 01123SOC2004013.PDF 28-Apr-2004 18:30 69k pdf.html
28-Apr-2004 18:18 1k test.doc 28-Apr-2004 17:24 19k zyz.xls 28-Apr-2004
17:24 14k Apache1.3.29 - ProXad [Apr 1 2004 16:04:22] Server at
monsiteweb.fr Port 80 Index of /pdf Index of /pdf Index of /pdf

That seems like a directory listing rather than for the actual PDF file. The $result array contains the following:

Result contains: Array ( [0] => Hébergement [1] => Facture [2] => partners -- 5 Sq de tuile_ 78000 Versailles -- Tél. / Fax : 0666666666 -- Email : contact@partners.com [3] => SARL au capital de 3000# -- Siret545454445RCS Versailles -- APE 222Z -- Web : www.partners.com [4] => [5] => FACTURE [6] => partners CLIENT [7] => 5 Sq de tuile Adzd MAdzNdzAS [8] => 78000 Versailles [9] => Tél./fax. : 01 3226222626 [10] => Prestation : Hébergement [11] => Facture du: 01/04/2004 au 31/06/2004 [12] => N° de Facture: 12122/66 [13] => Article Objet Quantité [14] => / [15] => Slots [16] => Prix [17] => unitaire / [18] => Trimestre [19] => Montant TVA [20] => Hébergement Serveur [21] => Total HT 122.36 [22] => Total TVA 23.61 [23] => Total TTC 122.00 [24] => A payer 122.00 EUROS [25] => Mode de paiement : A réception de facture [26] => )

And with $retval being zero, the following code should make a temp file containing the stuff from the $result array:
PHP Code:
if (!$retval) {
     
// the replacement if Å¡ is for unbreaking spaces
     // returned by catdoc parsing msword files
     // and '0xAD' "tiret quadratin" returned by pstotext
     // in iso-8859-1
     // Adjust with your encoding and/or your tools
     
if ((is_array($result)) && (count($result) > 0)) {
        
$f_handler fopen($tempfile1,'wb');
        
fwrite($f_handler,str_replace('Å¡',' ',str_replace(chr(0xad),'-',implode(' ',$result))));
        
fclose($f_handler);
     }
}
else {
     return array(
'tempfile'=>0,'tempfilesize'=>0);

Also, what do you get with the following query:
Code:
select file,first_words from spider where file like '%01123SOC2004013%';
And are the admin/temp and text_content directories set to 777 permissions?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 04-29-2004, 03:08 PM   #5
killer27
Green Mole
 
Join Date: Apr 2004
Posts: 4
hi,

when I execute the two queries :
Quote:
select keyword from keywords where keyword like '%Facture%';
and
Quote:
select file,first_words from spider where file like '%01123SOC2004013%';
I got no results from mysql, so I am sure it is not indexing the pdf.

I also open all the txt files in the admin/temp directory and I saw the content of the pdf file in 95593951.tmp :
Quote:
Hébergement Facture partners -- 9 Sq de Bgdgg - 79699 paris -- Tél. / Fax : 565995465559 -- Email : contact@-partners.com SARL au capital de 3000# -- Siret +6++++RCS Versailles -- APE 698Z -- Web : www.partners.com FACTURE partners CLIENT 9 Sq ghffg Antoine gdgd 75995 paris Tél./fax. : 065965659959 Prestation : Hébergement Facture du: 01/04/2004 au 31/06/2004 N° de Facture: 0899999 Article Objet Quantité / Slots Prix unitaire / Trimestre Montant TVA Hébergement Serveur Vietcong 6+6+5488484 Total HT 120.39 Total TVA 23.61 Total TTC 144.00 A payer 144.00 EUROS Mode de paiement : A réception de facture
When I open all the files in text_content directory there is no file with pdf content.

All my permissions are good, I am able to index doc and xls files.

I have php 4.2.2 but I have installed this patch :

http://www.phpdig.net/showthread.php?threadid=570

and check everything describe in this thread :
http://www.phpdig.net/showthread.php?s=&threadid=799
(I also add the code include in this thread)


I attach here my config.php file, spider.php and robot_functions.php in a zip file maybe it can help you to help me.

Thanks a lot.
Paul
Attached Files
File Type: zip files.zip (23.7 KB, 12 views)

Last edited by killer27; 04-29-2004 at 03:18 PM.
killer27 is offline   Reply With Quote
Old 05-01-2004, 07:13 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Change define('PHPDIG_PDF_EXTENSION','.txt'); to define('PHPDIG_PDF_EXTENSION',''); in the config file (two single quotes, no space between).

The '.txt' is for when an external PDF binary outputs to a TXT file as with pdftotext, however catdoc goes to STDOUT so no '.txt' is needed.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 05-03-2004, 08:31 AM   #7
killer27
Green Mole
 
Join Date: Apr 2004
Posts: 4
Cool

Thanks a lot, now it works fine.

If someone has the same issue and is using php4.2.2 on linux redhat 7.3 I can share my files.

Only two more issues :

First one : when trying to index large pdf files like 5 Mo, indexation is impossible, with small pdf files it works (like 200 ko or 500 ko).

Second: when I index doc files, the spider transform é, Ã*, è, into special characters like é=é or être=ètre, may you have some explanations about this ?

Thanks again and again...

Paul
killer27 is offline   Reply With Quote
Old 05-12-2004, 02:28 PM   #8
Pulsar-san
Green Mole
 
Join Date: May 2004
Location: France
Posts: 8
Hi !

For the é=é
it looks like the é is translated into UTF-8 (like in Google).
Pulsar-san is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Spidering issue cefiro How-to Forum 0 02-28-2005 10:01 AM
config issue baskamer Troubleshooting 2 12-18-2004 01:33 PM
How do you install pstotext krugar External Binaries 2 12-08-2004 12:53 PM
problem with pstotext loicoco External Binaries 2 07-16-2004 03:17 AM
Installation issue... again jinx Script Installation 1 06-14-2004 09:31 PM


All times are GMT -8. The time now is 06:20 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.