PhpDig.net - View Single Post

cybercox · 03-26-2004, 06:50 AM

Hi all,
I have completed a little script on the spider process.
The code will spider a site forking when spidering. It is only experimental code.
So if someone wants to improve it... let me know!
The base code (the one that processes the url) is taken from charter's spider.php. My code only wraps the base code.
Actually the program forks up to 20 times when spidering and it need already a site to be present in the db.
So: no records in tempspider table no spidering...
It doesn't implement the same logic of spider.php for several reasons.
1: it doesn't lock the site which is spidering.
2: It uses one connection to mysql on the parent process and others in the children (see the comments) to avoid the "LOST CONNECTION" problem
3: it has a very basic clean up routine called every 30 minutes... this is because it same inserts in the temporary table are still possible. No idea to avoid this... suggests are very appreciated!
4: NOT ALL the fantastic features that make phpdig what it is are implemented (no excludes, no robots.txt, no levels no this, no that)... is very far to be a good program!

With this script i've indexed up to 1000 documents in 2 hours. It is very fast. and very dangerous too. It could fill up your disk
very quickly.

USE IT WITH CARE!!! IT LOOPS......

place the file in your admin directory nad rename as spider_fork.php

Create a new comun in the tempspider table called HASH timestamp or varchar(250)

To call it you must first set up the records in the database and then type from shell:

php -f spider_fork.php

Enjoy and let me know what you think

Bye

Simone Capra

capra__nospam__@erweb.it
E.R.WEB - s.r.l.
http://www.erweb.it

03-26-2004, 06:50 AM	#1
cybercox Green Mole Join Date: Jan 2004 Location: Italy Posts: 11	Forking when spidering Hi all, I have completed a little script on the spider process. The code will spider a site forking when spidering. It is only experimental code. So if someone wants to improve it... let me know! The base code (the one that processes the url) is taken from charter's spider.php. My code only wraps the base code. Actually the program forks up to 20 times when spidering and it need already a site to be present in the db. So: no records in tempspider table no spidering... It doesn't implement the same logic of spider.php for several reasons. 1: it doesn't lock the site which is spidering. 2: It uses one connection to mysql on the parent process and others in the children (see the comments) to avoid the "LOST CONNECTION" problem 3: it has a very basic clean up routine called every 30 minutes... this is because it same inserts in the temporary table are still possible. No idea to avoid this... suggests are very appreciated! 4: NOT ALL the fantastic features that make phpdig what it is are implemented (no excludes, no robots.txt, no levels no this, no that)... is very far to be a good program! With this script i've indexed up to 1000 documents in 2 hours. It is very fast. and very dangerous too. It could fill up your disk very quickly. USE IT WITH CARE!!! IT LOOPS...... place the file in your admin directory nad rename as spider_fork.php Create a new comun in the tempspider table called HASH timestamp or varchar(250) To call it you must first set up the records in the database and then type from shell: php -f spider_fork.php Enjoy and let me know what you think Bye Simone Capra capra__nospam__@erweb.it E.R.WEB - s.r.l. http://www.erweb.it