View Single Post
Old 02-14-2004, 04:15 AM   #1
obottek
Green Mole
 
Join Date: Sep 2003
Posts: 15
Forking when spidering

I spider a lot of sites with phpdig, which works pretty good. But sometimes it takes really a lot of time, especially, when 404 occurs (see my remark on that on the bugs/problems forum).

EDIT: Threads merged. See next post in this thread.

Also of course, the spider has to live with the speed of the delivered websites. So if you have a site, where the webserver is configured to allow a maximum throughput per connection, you might end up waiting years for a response and that over and over again for each page.

An idea would be to fork the spider process. So instead of one spider process working all items one after the other, it would be great to run multiple spider processes, which each pick the next available site/page/document and so on and spider it. This would really increase the speed when spidering multiple sites and so on dramatically.

I don't know what happens, when I start a second spider process on the command line, so maybe that is already possible. Any ideas and details on that?

Greetings,
Olaf
obottek is offline   Reply With Quote