|
05-18-2005, 02:11 AM | #1 |
Green Mole
Join Date: May 2005
Posts: 3
|
spidering in domain 2 problems
Hi All,
Spidering a domain using PHPDIG_IN_DOMAIN I have noticed 2 problems (in phpdig-1.8.7): 1) spidering a domain (in my case a large univeristy domain) from the main institutional site results in some sites not being recognized as on domain. For instance if the search starts at: www.uct.ac.za then web.uct.ac.za is recodnised as part of the domain while www.ched.uct.ac.za is not (ie to check domains it seems to strip the first part rather than checl the end of the domain) 2) When it encounters a new site it recorded in the temp file as at / rather than the page linked. So sites that are not searchable from the root folder don't get indexed I'll have a look in the code and see what I can find... David |
05-18-2005, 03:18 AM | #2 |
Green Mole
Join Date: May 2005
Posts: 3
|
patch
Here is an updated phpdigCompareDomains that seems to fix the problem (don't know if it breaks anything else!
//================================================= //Find if an url is same domain than another function phpdigCompareDomains($url1,$url2) { $url1 = parse_url($url1); $url2 = parse_url($url2); print $url1['host']."\n"; print $url2['host']."\n"; if (isset($url1['host']) && isset($url2['host']) && eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url) && eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url) && (strpos($url2['host'],$from_url[2])!==false && (strpos($url2['host'],$from_url[2])+strlen($from_url[2])==strlen($url2['host'])))) { return true; } else { return false; // be careful setting this to true as indexing // could take a very, VeRy, VERY looooong time // return true; } } |
05-18-2005, 05:01 AM | #3 |
Green Mole
Join Date: May 2005
Posts: 3
|
oops
got the terms back to front :-)
//================================================= //Find if an url is same domain than another function phpdigCompareDomains($url1,$url2) { $url1 = parse_url($url1); $url2 = parse_url($url2); print $url1['host']."\n"; print $url2['host']."\n"; if (isset($url1['host']) && isset($url2['host']) && eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url) && eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url) && (strpos($url1['host'],$to_url[2])!==false && (strpos($url1['host'],$to_url[2])+strlen($to_url[2])==strlen($url1['host'])))) { return true; } else { return false; // be careful setting this to true as indexing // could take a very, VeRy, VERY looooong time // return true; } } |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Mod-rewrite = spidering / URL problems | jcnorris | Troubleshooting | 1 | 10-26-2006 10:38 AM |
Spidering problems | Dave A | Troubleshooting | 8 | 08-21-2005 08:46 AM |
one domain | xibalba | How-to Forum | 1 | 03-12-2004 10:18 AM |
Spidering Problems on a Windows Server Website | vinyl-junkie | Troubleshooting | 23 | 02-20-2004 07:44 PM |
Problems spidering dynamic site | Ph0nK | Troubleshooting | 1 | 01-13-2004 04:39 PM |