PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Submissions

Reply
 
Thread Tools
Old 11-16-2006, 08:52 AM   #1
CentaurAtlas
Green Mole
 
Join Date: Nov 2006
Location: Florida
Posts: 11
wrapper/multiple spiders...

Since it won't let me reply to this thread, here goes here... ;-)
http://www.phpdig.net/forum/showthread.php?t=662

(I know that is an old thread, but I think people are still playing with the wrapper.php mod).

To answer some of the questions:

1. You should also comment out these two lines the beginning of this file:
//show_source("wrapper.php");
//exit;

2. You put wrapper.php in the admin directory.

3. You should log in to the command line (e.g. ssh or telnet) and then you cd to that directory:
cd /path-to-phpdig/admin/


4. Then you should run:
php -f wrapper.php

5. From the command line, this command will show you what is running:

screen -list

6. The current wrapper code calls for at most 6 spiders/threads going:
$threads = 6;

Obviously you can increase it, but I don't know how much improvement you'd get.

I've only been playing with this for a couple of days, but all in all, the whole phpdig package is pretty cool. The two things that I am looking at improving are:
1. Number of spiders, with multiple spiders, the speed increases. Has to be reliable though. So far it seems wrapper.php is doing ok in the reliability department. Where I saw people saying it re-queues sites, that could be fixed by having it check the date to re-spider...still looking into that to see if I see that issue...
2. Modularization. The main code isn't very encapsulated so it makes it more difficult to understand and modify without lots of side effects. ;-)

It is great to have this as an option, and hopefully it will keep getting improved. ;-)

Last edited by CentaurAtlas; 11-16-2006 at 08:55 AM.
CentaurAtlas is offline   Reply With Quote
Old 11-18-2006, 05:07 PM   #2
CentaurAtlas
Green Mole
 
Join Date: Nov 2006
Location: Florida
Posts: 11
Multiple spiders

I've been looking at the multiple spider issue and for me the impetus for doing multiple spiders is to spider more pages faster (obviously, I think).

After looking at the performance of the software, the slowness isn't waiting for the pages, but in processing them.

The performace is in the code that processes the URLs found per page. Perhaps everyone knew this, but it is illustrative to show where the bottleneck is.

In particular if you check the performance you will see that the
foreach($urls as $lien) { ...
}

loop takes the majority of the time. Around 50-60 seconds per page.

In adIdion to (or in instead of) throwing 50 or 60 spiders at a problem, improving the performance of this section would help improve indexing performance greatly.

More in a bit!
CentaurAtlas is offline   Reply With Quote
Old 11-18-2006, 05:25 PM   #3
CentaurAtlas
Green Mole
 
Join Date: Nov 2006
Location: Florida
Posts: 11
p.s. the loop I was referring to is in the spider.php file, in case that wasn't clear.
CentaurAtlas is offline   Reply With Quote
Old 11-19-2006, 09:10 AM   #4
CentaurAtlas
Green Mole
 
Join Date: Nov 2006
Location: Florida
Posts: 11
Here is some more information. The culprit is in the phpdigDetectDir routine inside robot_functions.php. For some pages it is running about 20-60 seconds when it calls
phpdigTestUrl which in turn calls fgets.

The slow part is in the fgets call, which is really strange that it is that slow on a fast connection (3 different fast connections - two on 100 mbit connections and one cable modem).

So, it is something with the network call afterall. I wrote some code in a different language that will grab a couple of pages per second.

Now I'm wondering if it is something throttling back the connection for a robot...because it should not be that slow.
CentaurAtlas is offline   Reply With Quote
Old 11-19-2006, 10:24 AM   #5
CentaurAtlas
Green Mole
 
Join Date: Nov 2006
Location: Florida
Posts: 11
This is in 1.8.8 since mb_eregi isn't working, but the code there looks the same in 1.8.9rc1.

The odd thing is that the delay does not occur in phpdigGetUrl, pages are fetched very quickly there from my experiments.

I am wondering if it is a time-out issue in here for some reason.

I don't see it with all sites, but only with some (e.g. www.dmoz.com)

Using a stream_set_timeout with a 5 second value before that call seems to be helping right at the moment.
CentaurAtlas is offline   Reply With Quote
Old 11-19-2006, 01:07 PM   #6
CentaurAtlas
Green Mole
 
Join Date: Nov 2006
Location: Florida
Posts: 11
Ok, it does the same thing in 1.8.9rc1. I downgraded 1.8.9rc to use the eregi for 1.8.8 so it is kind of a hybrid between the 1.8.8 and 1.8.9rc1.

The odd thing is that that is the only place that I can see that huge amount of time being used for fgets.
CentaurAtlas is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Wrapper jmitchell How-to Forum 0 01-08-2005 10:12 PM
Multiple Spiders jmitchell How-to Forum 3 12-16-2004 05:43 PM
1.8.3 spiders slow, 1.6.3 spiders same site fast Wayne McBryde Troubleshooting 0 09-21-2004 08:10 PM
Multiple spiders tryangle How-to Forum 3 04-24-2004 03:43 AM
fopen wrapper configuration directives? fredh How-to Forum 1 02-28-2004 04:07 PM


All times are GMT -8. The time now is 03:52 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.