|
03-26-2004, 07:50 AM | #1 |
Green Mole
Join Date: Jan 2004
Location: Italy
Posts: 11
|
Forking when spidering
Hi all,
I have completed a little script on the spider process. The code will spider a site forking when spidering. It is only experimental code. So if someone wants to improve it... let me know! The base code (the one that processes the url) is taken from charter's spider.php. My code only wraps the base code. Actually the program forks up to 20 times when spidering and it need already a site to be present in the db. So: no records in tempspider table no spidering... It doesn't implement the same logic of spider.php for several reasons. 1: it doesn't lock the site which is spidering. 2: It uses one connection to mysql on the parent process and others in the children (see the comments) to avoid the "LOST CONNECTION" problem 3: it has a very basic clean up routine called every 30 minutes... this is because it same inserts in the temporary table are still possible. No idea to avoid this... suggests are very appreciated! 4: NOT ALL the fantastic features that make phpdig what it is are implemented (no excludes, no robots.txt, no levels no this, no that)... is very far to be a good program! With this script i've indexed up to 1000 documents in 2 hours. It is very fast. and very dangerous too. It could fill up your disk very quickly. USE IT WITH CARE!!! IT LOOPS...... place the file in your admin directory nad rename as spider_fork.php Create a new comun in the tempspider table called HASH timestamp or varchar(250) To call it you must first set up the records in the database and then type from shell: php -f spider_fork.php Enjoy and let me know what you think Bye Simone Capra capra__nospam__@erweb.it E.R.WEB - s.r.l. http://www.erweb.it |
03-26-2004, 08:38 PM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. I've downloaded the code and will take a closer look when I get a chance. However, I've also removed the attachment, and here's why.
>> NOT ALL the fantastic features that make phpdig what it is are implemented (no excludes, no robots.txt, no levels no this, no that)... is very far to be a good program! Users who run PhpDig on their own sites would likely account for personal preference and take care not to exceed their bandwidth allotment or server resources. Care should also be taken not to disregard someone else's preference or adversely affect someone else's machine or pocketbook. Also, the attachment since removed may get PhpDig placed on bad bot lists, especially because the user agent in the since removed code lists the PhpDig.net robot information page which says PhpDig should obey a robots.txt file. This isn't a personal slam or anything like that. It's just a note to let PhpDig users know that, in order to keep PhpDig a benevolent and viable open source project, PhpDig and modifications thereto need to be as "Net Friendly" as they can be.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
03-27-2004, 02:59 AM | #3 |
Green Mole
Join Date: Jan 2004
Location: Italy
Posts: 11
|
Well charter!
I totally agree with you. Perhaps you might share it with the ones who are really interested in forking and in developing this technology... i'm running it and it is very dangerous... downloads everything and uses the bandwidth not in a "clever" manner. >USE IT WITH CARE!!! IT LOOPS...... Could be a problem for end users who don't really know what does it mean. Actually the code loops on every site... But i think is a good start to let you know IT COULD BE A WAY to speed up the spider and to use all your resources (i think about it in en company environment... my resources are what i spent to buy them... :-)) Anyway: i was looking for an efficent way to use ALL the bandwidth and ALL the processor :-) personally i think i owe to phpdig something... and i really don't take it as a "personal slam" since the GPL code says to share modifications and what is done with modifucations is up to the project owner... well this is my personal mod, it's very dangerous but works.. and is not so complicated. So people out there who want to fork when spidering, you can share my code with charter.รน by asking him. I FORGOT php must be compiled with --enable-pcntl for forking to function.... if i could i would implement all this technology to be suitable to end users but time is what i don't have so.... and not everybody can recompile the php to have the pcntl functions.... sharing the code with you is what i wanted and what i got. Regards and let me know something!!!! Simone Capra capra__nospam__@erweb.it E.R.WEB - s.r.l. http://www.erweb.it |
09-27-2004, 01:36 PM | #4 |
Green Mole
Join Date: Sep 2004
Posts: 5
|
I need this script...
I run phpDig on my internal network for the searching of documents within our organisation. I'm could really do with a method of speeding up the spider as it currently takes me over 30 hours to index all of the pages available.
Could you post this change please? |
09-28-2004, 05:30 AM | #5 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
The file should not be redistributed for reasons already mentioned.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
09-29-2004, 01:53 AM | #6 |
Green Mole
Join Date: Sep 2004
Posts: 5
|
Sorry Charter, but that simply isn't good enough. This is an open source project, yet you are censoring other peoples work because it doesn't fit in with your ideals, regardless of how it may actually benefit some of us.
From where I'm standing, this runs contrary to the whole point of open source and I'm very dissappointed, not to mention put out by your stance. Last edited by rockyourbody; 09-29-2004 at 01:56 AM. Reason: Spelling |
09-29-2004, 08:18 AM | #7 | |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Quote:
Any misuse of phpdig - and consuming mass quantities of bandwidth on someone else's server would clearly be a misuse of this software - would reflect badly on phpdig. Do you really want to go there? You don't have to care about phpdig's reputation, but Charter does. If there's a chance that reputation could be tarnished, then I think it would be a dangerous thing to allow this mod to be redistributed. Just my $0.02. |
|
09-29-2004, 09:21 AM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Yes, vinyl-junkie, what you state in your post, plus right from the GNU GPL FAQs...
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
09-29-2004, 10:22 AM | #9 |
Green Mole
Join Date: Sep 2004
Posts: 5
|
Well I'm unhappy.
|
09-30-2004, 06:00 AM | #10 |
Green Mole
Join Date: Jan 2004
Location: Italy
Posts: 11
|
well since the mod is mine.....
i choose not to distribute it. basically to respect charter's work. The mod is very powerful, actually is not looping anymore but has a lot of problems. Like reading robots.txt to respect the standard... each time chases a page. So, charter has her own copy. if she likes can do whatever she wants Regards and many thanks to charter, Simone Capra |
10-01-2004, 12:56 AM | #11 |
Green Mole
Join Date: Sep 2004
Posts: 5
|
Such a waste Killed by the gpl nazi's
|
10-12-2004, 10:40 PM | #12 |
Green Mole
Join Date: Oct 2004
Posts: 4
|
So is there no way to fix the code to respect robots.txt?
I'm just wondering. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Forking | jmitchell | How-to Forum | 2 | 01-18-2005 09:58 AM |
Forking when spidering | obottek | Mod Requests | 5 | 03-13-2004 12:38 PM |