PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 12-28-2003, 01:50 PM   #1
i_am_cam
Green Mole
 
Join Date: Dec 2003
Posts: 6
Spider Problem

Hey, i'm having a problem trying to spider my site. I've read through the other forum topics that match my symptoms but while the blame mostly is aimed at safe_mode being on, my host has it off.

Basically when i'm running spider.php it starts indexing as i'd expect but then hangs after about 10seconds with IE displaying the 'done' alert in the status bar. On the admin index page it says the page is locked as the spider is still running, but nothing else is added.

I downloaded the latest version of phpdig (1.6.5 i believe) and am on PHP Version 4.2.3 (to see my PHP settings if it helps, look here).

The work the spider manages before the hanging bug is what i'd expect .. i can search the pages it's indexed and am pleased with the results, just I need the whole site done!

The search results page can be found here.

Thanks for any help

--Edit--

I've just added this screenshot of what happens while running spider.php, in case this is of any help:

Spider.php Screenshot

--2nd Edit--

And thought i'd add my robots.txt file too

User-agent: PhpDig
Disallow: /forum
Disallow: /phpMyAdmin
Disallow: /sql
Disallow: /templates
Disallow: /templates_c
Allow: /forum/index.php

User-agent: *
Disallow: /

Last edited by i_am_cam; 12-28-2003 at 02:07 PM.
i_am_cam is offline   Reply With Quote
Old 12-28-2003, 03:27 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. PhpDig is restrictive when it parses a robots.txt file. Try applying the code in this thread and then set the robots.txt file as so:
Code:
User-agent: PhpDig
Disallow:

User-agent: *
Disallow: /
After a crawl, you can delete/exclude directories from the admin panel. Also, does the hang always happen, and what entries are in the tempspider table?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-28-2003, 03:39 PM   #3
i_am_cam
Green Mole
 
Join Date: Dec 2003
Posts: 6
Quote:
Originally posted by Charter
Hi. PhpDig is restrictive when it parses a robots.txt file. Try applying the code in this thread and then set the robots.txt file as so:
Code:
User-agent: PhpDig
Disallow:

User-agent: *
Disallow: /
After a crawl, you can delete/exclude directories from the admin panel. Also, does the hang always happen, and what entries are in the tempspider table?
Hi, firstly thanks for the speedy reply!

I've changed the code as suggested in the thread you linked to and modified the robots.txt file as you said, and am getting the same problem each time .. namely that spider.php freezes during the indexing process and locks the site while not indexing any further. I should also mention I have tried completely removing the robots.txt file with no success.

As for the tempspider table, here is the phpMyAdmin dumps in csv and xml

Last edited by i_am_cam; 12-28-2003 at 03:42 PM.
i_am_cam is offline   Reply With Quote
Old 12-28-2003, 03:56 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Adding css to the FORBIDDEN_EXTENSIONS in the config file should prevent errors from appearing in the tempspider table. Anyway, this seems like a timeout issue. What is the time limit in httpd.conf?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-28-2003, 04:05 PM   #5
i_am_cam
Green Mole
 
Join Date: Dec 2003
Posts: 6
ok i've changed the config line to

Quote:
define('FORBIDDEN_EXTENSIONS','\.(gz|z|tar|css|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$');
to try and eliminate the .css files being indexed and causing errors.

as for httpd.conf, I don't have access to this file on my host :/

my max_execution_time is set to 50000 if this helps

Last edited by i_am_cam; 12-28-2003 at 04:23 PM.
i_am_cam is offline   Reply With Quote
Old 12-28-2003, 05:45 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. What happens if you try to crawl using a different browser?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-29-2003, 03:04 AM   #7
i_am_cam
Green Mole
 
Join Date: Dec 2003
Posts: 6
Quote:
Originally posted by Charter
Hi. What happens if you try to crawl using a different browser?
Okay, i've just tried crawling in Mozilla 1.5 and Firebird 0.7 (originally I was using IE) and the end result while running spider.php is the same; it freezes, locks the site, but doesn't actually index any further.
i_am_cam is offline   Reply With Quote
Old 12-29-2003, 06:47 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. I'm thinking that this is a timeout issue with your PHP being in CGI mode. The max_execution_time says 50000 but it seems like the timeout is 30 seconds. What errors, if any, are showing in your PHP error log?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-29-2003, 07:27 AM   #9
i_am_cam
Green Mole
 
Join Date: Dec 2003
Posts: 6
i'm afraid that log_errors in php.ini is set to Off by my host and I don't have access to php.ini in order to change this
i_am_cam is offline   Reply With Quote
Old 12-29-2003, 08:04 AM   #10
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. I crawled your site using PHP with Server API as Apache and as CGI. Both were successful and below is the output using PHP in CGI mode. Maybe your host can shed some light on this issue as I'm not sure of the problem.

Spidering in progress...

--------------------------------------------------------------------------------
SITE : http://liveinglasgow.com/
Exclude paths :
- @NONE@
1:http://liveinglasgow.com/archive.php
(time : 00:00:07)
+ + + + + + +
level 1...
2:http://liveinglasgow.com/privacy.php
(time : 00:00:14)
+
3:http://liveinglasgow.com/archive.php?start=21&sort=&mode=&size=
(time : 00:00:19)
+ + + + + +
4:http://liveinglasgow.com/archive.php?start=&sort=date&mode=desc&size=
(time : 00:00:25)

5:http://liveinglasgow.com/archive.php?start=&sort=date&mode=asc&size=
(time : 00:00:29)

6:http://liveinglasgow.com/archive.php?start=&sort=title&mode=desc&size=
(time : 00:00:33)

7:http://liveinglasgow.com/archive.php?start=&sort=title&mode=asc&size=
(time : 00:00:38)

8:http://liveinglasgow.com/index.php
(time : 00:00:42)

level 2...
9:http://liveinglasgow.com/archive.php?start=41&sort=&mode=&size=
(time : 00:00:46)
+ + + +
10:http://liveinglasgow.com/archive.php?start=1&sort=&mode=&size=
(time : 00:00:52)
+ + + +
11:http://liveinglasgow.com/archive.php?start=21&sort=date&mode=desc&size=
(time : 00:00:58)

12:http://liveinglasgow.com/archive.php?start=21&sort=date&mode=asc&size=
(time : 00:01:04)

13:http://liveinglasgow.com/archive.php?start=21&sort=title&mode=desc&size=
(time : 00:01:10)

14:http://liveinglasgow.com/archive.php?start=21&sort=title&mode=asc&size=
(time : 00:01:15)

15:http://liveinglasgow.com/
(time : 00:01:17)

level 3...
16:http://liveinglasgow.com/archive.php?start=41&sort=title&mode=desc&size=
(time : 00:01:20)

17:http://liveinglasgow.com/archive.php?start=41&sort=title&mode=asc&size=
(time : 00:01:24)

18:http://liveinglasgow.com/archive.php?start=41&sort=date&mode=asc&size=
(time : 00:01:28)

19:http://liveinglasgow.com/archive.php?start=41&sort=date&mode=desc&size=
(time : 00:01:32)

20:http://liveinglasgow.com/archive.php?start=1&sort=title&mode=asc&size=
(time : 00:01:35)

21:http://liveinglasgow.com/archive.php?start=1&sort=title&mode=desc&size=
(time : 00:01:39)

22:http://liveinglasgow.com/archive.php?start=1&sort=date&mode=asc&size=
(time : 00:01:44)

23:http://liveinglasgow.com/archive.php?start=1&sort=date&mode=desc&size=
(time : 00:01:48)

No link in temporary table

--------------------------------------------------------------------------------

links found : 23
http://liveinglasgow.com/archive.php
http://liveinglasgow.com/privacy.php
http://liveinglasgow.com/archive.php?start=21&sort=&mode=&size=
http://liveinglasgow.com/archive.php?start=&sort=date&mode=desc&size=
http://liveinglasgow.com/archive.php?start=&sort=date&mode=asc&size=
http://liveinglasgow.com/archive.php?start=&sort=title&mode=desc&size=
http://liveinglasgow.com/archive.php?start=&sort=title&mode=asc&size=
http://liveinglasgow.com/index.php
http://liveinglasgow.com/archive.php?start=41&sort=&mode=&size=
http://liveinglasgow.com/archive.php?start=1&sort=&mode=&size=
http://liveinglasgow.com/archive.php?start=21&sort=date&mode=desc&size=
http://liveinglasgow.com/archive.php?start=21&sort=date&mode=asc&size=
http://liveinglasgow.com/archive.php?start=21&sort=title&mode=desc&size=
http://liveinglasgow.com/archive.php?start=21&sort=title&mode=asc&size=
http://liveinglasgow.com/
http://liveinglasgow.com/archive.php?start=41&sort=title&mode=desc&size=
http://liveinglasgow.com/archive.php?start=41&sort=title&mode=asc&size=
http://liveinglasgow.com/archive.php?start=41&sort=date&mode=asc&size=
http://liveinglasgow.com/archive.php?start=41&sort=date&mode=desc&size=
http://liveinglasgow.com/archive.php?start=1&sort=title&mode=asc&size=
http://liveinglasgow.com/archive.php?start=1&sort=title&mode=desc&size=
http://liveinglasgow.com/archive.php?start=1&sort=date&mode=asc&size=
http://liveinglasgow.com/archive.php?start=1&sort=date&mode=desc&size=
Optimizing tables...
Indexing complete !
--------------------------------------------------------------------------------
[Back] to admin interface.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-29-2003, 08:07 AM   #11
i_am_cam
Green Mole
 
Join Date: Dec 2003
Posts: 6
Okay, i'll email my host and start chatting to them about this, see if they can help at all. Just out of curiosity .. if i had ssh access to a shell on my host and could run this spidering script there, do you think that would work? Or would the same constraints apply as when executing it via my brower?

Thanks a lot for the help Charter, believe me it's very much appreciated

Cam
i_am_cam is offline   Reply With Quote
Old 12-29-2003, 08:45 AM   #12
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. My understanding is that program execution via SSH bypasses the web application server, but the program execution would still be subject to the PHP configuration itself.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
spider.php problem digdug Script Installation 8 10-18-2006 08:25 AM
Spider problem, Search mb_ereg_replace problem. (Fixed?!) cpeter Troubleshooting 0 02-24-2006 02:56 PM
Problem running spider from Command Line joshuag200 Troubleshooting 17 09-13-2004 08:57 PM
phpdig spider hangs (a powerpoint file problem) davideyre Troubleshooting 1 03-29-2004 01:35 PM
Indexing problem: PhpDig will not spider all of the site mih Troubleshooting 5 03-25-2004 12:54 AM


All times are GMT -8. The time now is 06:56 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.