|
11-25-2003, 12:09 PM | #1 |
Green Mole
Join Date: Sep 2003
Posts: 5
|
Duplicate Documents Problem...
For some reason, when I run the spider it is kicking back duplicate documents that are not in fact duplicates.
It indexes this: Code:
mambo104/index.php?option=com_weblinks&Itemid=4 Code:
mambo104/index.php?option=com_weblinks&Itemid=1&catid=2 |
11-25-2003, 01:19 PM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. It is the robot_functions.php file that determines whether a page is a duplicate, specifically the phpdigTestDouble function.
In this function, it is the following query that determines a duplicate: PHP Code:
PHP Code:
Making the page title dynamic, depending on the query string, or changing CHUNK_SIZE in the config file would be a couple ways to avoid the duplicates.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
11-25-2003, 01:49 PM | #3 |
Green Mole
Join Date: Sep 2003
Posts: 5
|
Hmmm. I'll have to look into tempfilesize. Could there be some type of bug in there?
document 1 = 3.22KB document 2 = 3.4KB I'm thinking the temp file size should be same as the actual file size, no? And if so I would think the different file sizes would prevent them from being tagged as dupes. Thanks for your other suggestion regarding titles. Unfortunately I am building this a plug-in component for Mambo OS, and their titles are not dynamic out of the box. So, I need to come up with a better solution that works with the stock install of Mambo. Any more info would be appreciated... maybe I can modify the function so that it bases duplicates on the actual URL. |
11-25-2003, 02:16 PM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. The $tempfilesize varible is created in the phpdigTempFile function in the robot_functions.php file and is set to the filesize of the temporary file. Do those two pages still show as duplicates if you increase the CHUNK_SIZE or add some amount of r****m text to the end of one of the pages?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
spider documents without extensions | jguert | External Binaries | 0 | 08-17-2006 08:39 AM |
Documents disappear | kzant | Troubleshooting | 7 | 07-30-2005 08:26 AM |
Too many duplicate link, someone help please! | warrence | Troubleshooting | 1 | 09-07-2004 05:26 PM |
Duplicate/Similar search results? | ChadK | How-to Forum | 3 | 08-20-2004 07:07 AM |
'Duplicate' Search Results | siliconkibou | Troubleshooting | 1 | 01-13-2004 09:00 AM |