PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 06-30-2004, 10:37 AM   #1
jdc32
Green Mole
 
Join Date: Jun 2004
Posts: 11
Thumbs up Automatic spider

hi there,

i want automate the adding of links to the se.
had anyone played too with this idea?

env:
i create a table, in this will stored any new links (a lot links). i call this table my linkspool.

so on,.. i have a cron job every 3 minutes which check, whether a new job (link) is in the spool table. if a new job is in there, the script lock the link and spider it. after the spidering the script delete the link in the spool. finished!

but i have two probs!!!!

first:
if a spider lasts over 3 minutes, it takes the next link from the spool and starts a new spider... thats okay... i check with the script how many spider are running, if it more than 5, the script will exit and wait to a thread is free.
this isnt work really good, how can i check with php how many php spider threads are opened??????????

second:
so with the cron, the spider maschine runs and runs and runs..... but if a spiderjob is locked, out any reason, it blocked the thread.
how can i kill via php the spider php pid which is older than 20 minutes and how kick the link from the se db.

sorry for my bad english

jdc
jdc32 is offline   Reply With Quote
Old 06-30-2004, 11:48 AM   #2
bloodjelly
Purple Mole
 
Join Date: Dec 2003
Posts: 106
Hi jdc -

If you have a main script (the one that looks at the linkspool and runs spider processes), keeping track of the number of spiders is easy. Just increment a counter every time a spider is called, and when your counter variable reaches 5, you can sleep the script for a period of time and then check again.

To kill the process, check out this thread: http://www.phpdig.net/showthread.php...&highlight=PID

But instead of using a CRON job, you could use exec() or system() commands through PHP.
bloodjelly is offline   Reply With Quote
Old 07-01-2004, 01:08 AM   #3
jdc32
Green Mole
 
Join Date: Jun 2004
Posts: 11
okay thats cool,
but with the cron i can kill the spider, but the link which the spider was spidering is still locked in the db. i need a search and destroy session

how can i give after kill the spider a parameter (ex. the site_id) to another script which delete all db entries for this link?

thx
jdc32 is offline   Reply With Quote
Old 07-01-2004, 01:17 AM   #4
jdc32
Green Mole
 
Join Date: Jun 2004
Posts: 11
hmmm,.... after thinking, the cron is not really good....for my problem:

10 * * * * ps -ef | grep 'php -f spider.php' | awk '{print $2}' | xargs kill -9


i start any 3 minutes a new spider and shoult kill after 10 minutes,... so i have started more than 1 spider... this cron kill all my spider,.. thats no good.

can i kill via shell all php spiders which has a running time from 10 minutes?
jdc32 is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
automatic closing of the windows opened ainu Coding & Tutorials 1 05-26-2006 03:42 AM
How to schedule automatic spidering using windows schedule task utility joezeon How-to Forum 1 10-15-2005 09:23 AM
Automatic Webpage Thumbnails In Search Result Page JWSmythe Mod Submissions 6 08-24-2004 12:01 PM
Automatic reIndexing - How to set up 2440media How-to Forum 2 06-17-2004 06:01 AM


All times are GMT -8. The time now is 07:23 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.