PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Bug Tracker

Reply
 
Thread Tools
Old 05-18-2005, 02:11 AM   #1
dhorwitz
Green Mole
 
Join Date: May 2005
Posts: 3
spidering in domain 2 problems

Hi All,

Spidering a domain using PHPDIG_IN_DOMAIN I have noticed 2 problems (in phpdig-1.8.7):

1) spidering a domain (in my case a large univeristy domain) from the main institutional site results in some sites not being recognized as on domain. For instance if the search starts at:
www.uct.ac.za then web.uct.ac.za is recodnised as part of the domain while www.ched.uct.ac.za is not (ie to check domains it seems to strip the first part rather than checl the end of the domain)

2) When it encounters a new site it recorded in the temp file as at / rather than the page linked. So sites that are not searchable from the root folder don't get indexed

I'll have a look in the code and see what I can find...

David
dhorwitz is offline   Reply With Quote
Old 05-18-2005, 03:18 AM   #2
dhorwitz
Green Mole
 
Join Date: May 2005
Posts: 3
patch

Here is an updated phpdigCompareDomains that seems to fix the problem (don't know if it breaks anything else!


//=================================================
//Find if an url is same domain than another
function phpdigCompareDomains($url1,$url2) {
$url1 = parse_url($url1);
$url2 = parse_url($url2);
print $url1['host']."\n";
print $url2['host']."\n";

if (isset($url1['host']) && isset($url2['host'])
&& eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url)
&& eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url)
&& (strpos($url2['host'],$from_url[2])!==false && (strpos($url2['host'],$from_url[2])+strlen($from_url[2])==strlen($url2['host'])))) {
return true;
}
else {
return false;
// be careful setting this to true as indexing
// could take a very, VeRy, VERY looooong time
// return true;
}
}
dhorwitz is offline   Reply With Quote
Old 05-18-2005, 05:01 AM   #3
dhorwitz
Green Mole
 
Join Date: May 2005
Posts: 3
oops

got the terms back to front :-)

//=================================================
//Find if an url is same domain than another
function phpdigCompareDomains($url1,$url2) {
$url1 = parse_url($url1);
$url2 = parse_url($url2);
print $url1['host']."\n";
print $url2['host']."\n";

if (isset($url1['host']) && isset($url2['host'])
&& eregi('^([a-z0-9_-]+)\.(.+)',$url1['host'],$from_url)
&& eregi('^([a-z0-9_-]+)\.(.+)',$url2['host'],$to_url)
&& (strpos($url1['host'],$to_url[2])!==false && (strpos($url1['host'],$to_url[2])+strlen($to_url[2])==strlen($url1['host'])))) {
return true;
}
else {
return false;
// be careful setting this to true as indexing
// could take a very, VeRy, VERY looooong time
// return true;
}
}
dhorwitz is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Mod-rewrite = spidering / URL problems jcnorris Troubleshooting 1 10-26-2006 10:38 AM
Spidering problems Dave A Troubleshooting 8 08-21-2005 08:46 AM
one domain xibalba How-to Forum 1 03-12-2004 10:18 AM
Spidering Problems on a Windows Server Website vinyl-junkie Troubleshooting 23 02-20-2004 07:44 PM
Problems spidering dynamic site Ph0nK Troubleshooting 1 01-13-2004 04:39 PM


All times are GMT -8. The time now is 02:43 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.