|
03-31-2004, 12:32 AM | #1 |
Green Mole
Join Date: Jan 2004
Location: Italy
Posts: 11
|
Bug when spidering subdomains
Hi charter!
I have found the following bug: 1) I spider a site (example: http://www.jobnetwork.it/foodir/foo.htm) that contains a link to a subdomain. The link must consist in hostname onliy. In the example you will find a link to http://piemonte.jobnetwork.it 2) The spider finds the link and since i have define('PHPDIG_IN_DOMAIN',true); adds it to the tempspider and sites tables. Actually it add the site correctly but in tempspider adds the path of the current page. In this case adds http://piemonte.jobnetwork.it/foodir/ 3) I have done some tweaking on it and found that in robot_functions.php in phpdigExplore function: if (substr($regs[8],0,1) == "/") { $links[$index] = phpdigRewriteUrl($regs[8]); } else{ $links[$index] = phpdigRewriteUrl($path.$regs[8]); } the "else" is executed when we don't have any path-filename in the link, so if i link to http://subdomain.jobnetwork.it the current path is added to the link! My solution is the following: if (substr($regs[8],0,1) == "/") { $links[$index] = phpdigRewriteUrl($regs[8]); } elseif($regs[5]=="" or $url == 'http://'.$regs[5].'/'){ // we are in the same host or the host information is not provided $links[$index] = phpdigRewriteUrl($path.$regs[8]); }elseif ($regs[5] != "" && $url != 'http://'.$regs[5].'/') { // host information is provided but we are not in the same host $links[$index] = phpdigRewriteUrl($regs[8]); } Charter what do you think? I don't know if the solution is good, if it is conservative to the other links or not.... Regards Simone Capra capra__nospam__@erweb.it E.R.WEB - s.r.l. http://www.erweb.it |
04-01-2004, 12:48 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Good eye! Yes, I see the problem when a link like http://sub.domain.com is encountered without an ending slash. Untested, but an alternative solution might be the following:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Bug when spidering | julien | Troubleshooting | 3 | 03-01-2005 11:21 PM |
index subdomains | AllKnightAccess | How-to Forum | 3 | 09-26-2004 02:01 PM |
digging subdomains | b2l_grefix | How-to Forum | 6 | 05-10-2004 03:34 PM |
Problems with Subdomains | herberth | Troubleshooting | 8 | 04-02-2004 07:42 AM |
Logical bug and stopping spidering | Konstantine | Bug Tracker | 0 | 03-14-2004 01:03 AM |