|
07-11-2004, 10:12 AM | #1 |
Green Mole
Join Date: Jul 2004
Location: Paris
Posts: 5
|
Numbers everywhere...
Hi,
I'm encountering some problems while indexing a website with phpdig. There is no prolem with the indexing itself, but it's the text that is stored in the txt files of the text_content directory. All the text file contain text with numbers and letters (ex:19b)placed almost every where. On first indexation, there are few but on re-indexing, these alpha-numeric "bugs" begin to invade all the text. especially in the begining of the text Here's an example after 3rd indexation : "b3 46 19b 198 19b ee 119 66 6e 10 Le livre du Mois 15 2 1c Miró, un feu dans les ruines 1a 1d5 Sans doute êtes vous déjÃ* nombreux Ã* avoir vu ou Ã* revoir la très importante exposition consacrée" The text is the one that is shown in the result page, so it is really annoying. It's like some ereg_replace/eregi stuff did'nt do its job well. If somebody can tell me what's wrong, I'll be grateful. Thx. |
07-11-2004, 10:22 AM | #2 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
I believe what you need to do is replace this statement in config.php:
PHP Code:
PHP Code:
BTW, welcome to the forum! |
07-11-2004, 11:23 AM | #3 |
Green Mole
Join Date: Jul 2004
Location: Paris
Posts: 5
|
Thank you for your welcome and reply.
When I saw your reply I felt like "damn, I'm so dumb..." But no, I'm not... configuring the language paremeter to "fr" didn't change anything. I made a new installation of phpdig (with new ddb) to see if nothing came from the "old" indexation. thx anyway. btw, My conf. : Apache server, php 4.2, Windows 2k Server Conf. : Apache, php 4.2.1, Sun OS... and same problem. Last edited by Nad; 07-11-2004 at 11:31 AM. |
07-11-2004, 11:35 AM | #4 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Actually, you may be onto something with starting afresh. I had a test database for phpdig that seemed to mess things up for me when I upgraded to phpdig 1.8.1. I was still pointing to that, and didn't realize that's what the problem was when I was getting some screwy search results.
Let us know if you still have problems. We'll be glad to help. |
07-11-2004, 02:21 PM | #5 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Are you getting external binary output like in this thread, or are you getting character encoding output like in this thread? The results that have these number/letter combos, are they coming from pages that have a different encoding than that set in the config file?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
07-11-2004, 11:59 PM | #6 |
Green Mole
Join Date: Jul 2004
Location: Paris
Posts: 5
|
Hi again,
Charter, The text filled in the txt files come from "simple" html or php pages. Nothing from doc or pdf files. The encoding used is charset=iso-8859-1, the same that is in use in the phpdig config file. On first indexation, these numbers/letters appears like a "flag" in the txt files, sometimes you can find 6-7 files begining with the same combo (ex: e7e, or e5f, or 980 ...) As shown in my first post, in example of text, theses numbers/letters are placed everywhere in the text. thx. |
07-12-2004, 04:06 AM | #7 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. What website was it that these number/letter combos came from?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
07-13-2004, 10:15 AM | #8 |
Green Mole
Join Date: Jul 2004
Location: Paris
Posts: 5
|
Sorry, Network problems for 2 days...
Charter, The website are all the websites that I made (so I must be the guilty here... ;-) ) These website can be on unix server or windows, local or not, still the same problem. Tell me if you need an URl in any case, I'll try to give you one (not local of course) Thx |
07-13-2004, 07:57 PM | #9 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Looks like it might be related to this report. Search that webpage for "a57" (without quotes) and read from there. Also, this may be of interest too.
If these number/letter combos are in fact chunk encoding size markers, then they may be on their own lines so try the following to remove them. In robot_functions.php, in the phpdigGetUrl function, find: PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
07-14-2004, 12:41 AM | #10 |
Green Mole
Join Date: Jul 2004
Location: Paris
Posts: 5
|
Hello Charter,
I inserted your code and tried spidering one website and for now it is working just fine ! It's great! Thanks a lot! I did not fully understood the chunk encoding stuff. I'm still french, so it will take me a bit longuer to read and understand all the informations you gave me to read It seems to be a problem between the web pages I've made and how the server return them... Anyway, thank you again ! |
07-14-2004, 01:43 AM | #11 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. See how "10 Le livre du Mois" and "1c Miró, un feu dans les ruines" were getting indexed...
Speaking of better, I suppose a routine could be written to loop and parse and convert between hex and dec and find string positions and all that, but there probably won't be any header fields in the trailer, so just avoiding the hexadecimals should do as a quick patch.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to Include Numbers which occur in names | galacticvoyager | How-to Forum | 1 | 11-12-2005 01:45 PM |
Indexing of numbers | jerrywin5 | How-to Forum | 3 | 04-06-2005 01:08 PM |
fuzzy search on product numbers | indeh | How-to Forum | 0 | 10-13-2004 11:33 AM |
Numbers | BernhardG | Bug Tracker | 2 | 10-10-2003 04:20 AM |
phpdig not index numbers. | redlock | Troubleshooting | 6 | 10-06-2003 02:44 PM |