PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Requests

Reply
 
Thread Tools
Old 07-08-2004, 07:19 PM   #1
bloodjelly
Purple Mole
 
Join Date: Dec 2003
Posts: 106
Spidering sub-directories as the root

I'm interested in getting the spider function, not just the search function, to treat subdirectories of URLs as the root.

For example, if someone wanted to spider http://www.geocities.com/website as its own site, without scanning the true root (www.geocities.com).

So far I changed this bit of code in robot_functions.php:
PHP Code:
$url $pu['scheme']."://".$pu['host']."/"
to this:
PHP Code:
    $url $pu['scheme']."://".$pu['host'];
    if (isset(
$pu['path'])) {
        
$url .= $pu['path']."/";
    }
    else {
        
$url .= "/";
    } 
and this:
PHP Code:
$subpu phpdigRewriteUrl($pu['path']."?".$pu['query']); 
to this:
PHP Code:
    if (isset($pu['path'])) {
        
$subpu phpdigRewriteUrl("?".$pu['query']);
    }
    else {
        
$subpu phpdigRewriteUrl($pu['path']."?".$pu['query']);
    } 
which made the end directory store correctly in the table, but I get a 0 links found message. Has anyone tried to do this yet? I'm not sure if I'm on the right track. Thanks.
bloodjelly is offline   Reply With Quote
Old 07-10-2004, 02:01 PM   #2
caco3
Green Mole
 
Join Date: Jul 2004
Location: Illnau, Switzerland, Europe
Posts: 9
hello bloodjelly

I have the same problem, and i solved it with adding this code:

PHP Code:
///Modifikation 2004 by George Ruinelli //////////////////////////
if($link['url']=="http://www.domain.ch/") {
  
$pos1=strpos("_".$link['path'],"subdir/");
  
$pos2=strpos("_".$link['file'],"subdir/");
  
//if($pos!=1 AND $pos!=2){ 
  
if($pos1==false AND $pos2==false){ //text nicht gefunden
    
$link['ok'] = 0;
  }

in the file robot_functions.php at the end of the function phpdigDetectDir but before
PHP Code:
if (!$link['ok'] && isset($status)) {
    
$link['status'] = $status['status'];
    
$link['host'] =   $status['host'];
    
$link['path'] =   $status['path'];
    
$link['cookies'] = $status['cookies'];

My code prevents phpdig adding a link who isn't in this subdir to it's list

Last edited by caco3; 07-10-2004 at 02:14 PM.
caco3 is offline   Reply With Quote
Old 07-12-2004, 01:39 PM   #3
bloodjelly
Purple Mole
 
Join Date: Dec 2003
Posts: 106
Thanks for the help, caco, but what I need is a mod that adds links to the database exactly as entered, either with a subdirectory or not. In other words, if I wanted to spider "http://www.mysite.com/directory" as a root, I could do it, and if I wanted to spider "http://www.mysite.com" as a root I could do that too.
bloodjelly is offline   Reply With Quote
Old 07-12-2004, 06:27 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Perhaps upgrade to PhpDig version 1.8.2...
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-12-2004, 06:39 PM   #5
bloodjelly
Purple Mole
 
Join Date: Dec 2003
Posts: 106
You are awesome.
bloodjelly is offline   Reply With Quote
Old 07-14-2004, 09:04 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
FYI: version 1.8.3 released to allow for the 'limit to directory' option to be consistent across other control panel options, among other changes.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-15-2004, 07:54 PM   #7
bloodjelly
Purple Mole
 
Join Date: Dec 2003
Posts: 106
Hi charter -

I'm not sure if I'm using the limit to directory feature correctly (I have it set to "true") but when I enter a website (www.geocities.com/psychology_x/main.html for example) it spiders correctly, but the listing in the "sites" table is only for geocities. Is there a way to make each separate directory treated as its own site? Or am I missing something? Thanks.
bloodjelly is offline   Reply With Quote
Old 07-15-2004, 09:05 PM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The issue is that foo.com/bar/ is not a separate domain from foo.com/ but rather a subdirectory of that domain. Spidering can now be limited to subdirectories, but the domain is still the domain. On the other hand, the bar.foo.com/ subdomain, while it can point to the foo.com/bar/ subdirectory, it is a third level domain and can also be treated as a separate site on a separate server. The database storage scheme is domain based, and that is why subdirectories are not stored separately but subdomains are separately stored.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-15-2004, 09:12 PM   #9
bloodjelly
Purple Mole
 
Join Date: Dec 2003
Posts: 106
Got it. Thanks for the explaination.
bloodjelly is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Using non-root MySQL account with phpDig muppet How-to Forum 0 01-22-2006 08:08 AM
Script not indexing - host doesn't allow remote/root access tyhand Troubleshooting 1 07-18-2005 05:48 PM
my root is not being indexed! ivmedia Troubleshooting 1 06-26-2005 03:52 AM
Indexing outside root domain T3D How-to Forum 5 03-14-2004 03:57 PM
Not Indexing Sub-Directories jayhawk Troubleshooting 3 02-11-2004 03:41 PM


All times are GMT -8. The time now is 04:33 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.