PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   Choosy about domains? (http://www.phpdig.net/forum/showthread.php?t=158)

druesome 10-19-2003 08:40 AM

Choosy about domains?
 
Hi, for the last few days I've been spidering without a single hitch, until today. The last website I tried to spider has the .ph domain and I wonder if that could be the reason it could not be spidered. If you could try it out for me, the URL is http://www.birdwatch.ph ..

And lastly, I also noticed that when I spider a site that is hosted under Geocities, the site_url becomes www.geocities.com without including the folder where the site really is. (e.g. www.geocities.com/mysite). Is there a way around this? It may seem like a weird request but I really really need it to be this way coz I'm working on a hack that will benefit from it. Thanks in advance!!

Charter 10-19-2003 11:31 AM

Hi. What message did you get when you tried to crawl birdwatch.ph? Does setting PHPDIG_DEFAULT_INDEX to false in the config file have any effect?

druesome 10-19-2003 09:26 PM

I already tried that yesterday, but didn't work. Actually, when I try to spider the site, it times out and would seem like nothing's happened. When I refresh the admin page, the URL is added to the list however no page is crawled.

Any ideas about my other question? Thanks.

bloodjelly 04-19-2004 05:47 PM

I'm actually curious about druesome's second question as well, and found this thread searching for the answer, but no answer yet. Why does phpDig erase the folder name to a site when it stores the URL? I just searched http://gino.go-gaia.com/forum and it worked well, sticking to that directory, but in the admin panel the link has the forum directory removed. Sorry if this is an easy question but can I make phpDig leave the format of the URL I spidered alone? So that if I spider http://gino.go-gaia.com/forum then that URL will be in the sites table? Thanks.

Charter 04-20-2004 12:49 PM

Hi. As to birdwatch.ph what do you get onscreen when you uncomment //print $answer."<br>\n"; in the robot_functions.php file?

WRT the admin index page, it shows only the site, domain or subdomain as the case may be. This is based off of parse_url (see below code). To view the directories/branches for a specific (sub)domain, just click the site and then click the update button.
PHP Code:

<?php

$link 
"http://foo.domain.com/dir1/dir2/dir3/file.php?a=b&c=d#anchor";
print_r(parse_url($link));

/* start output
Array
(
    [scheme] => http
    [host] => foo.domain.com
    [path] => /dir1/dir2/dir3/file.php
    [query] => a=b&c=d
    [fragment] => anchor
)
end output */

// foo.domain.com gets stored as http://foo.domain.com/

?>


bloodjelly 04-20-2004 04:52 PM

How about if I wanted to store the directory information exactly as entered in the spider script in the spider's "sites" table? Or am I missing something...

Charter 04-20-2004 05:28 PM

Hi. To get a feel for how it works, look through the tables and see how the domain is stored in the sites table and path/file info is stored in the spider/tempspider/excludes tables, and then search the robot_functions.php file for the parse_url function.

bloodjelly 04-30-2004 12:39 AM

Thanks Charter - my host lost all MySQL for about a week (no explaination why) so I haven't been able to try this, but I will ASAP. Thanks for pointing me in the right direction.


All times are GMT -8. The time now is 01:44 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.