|
03-10-2004, 01:56 PM | #1 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
https phpdig strange failure
digging a website with some https-forms produces strange failure:
some forms on this website force ssl this way: PHP Code:
<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>485</b><br /> <br /> <b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>503</b><br /> about 5 apache processes, consuming about 80% memory remain for hours and spidering fails. our workaround was: PHP Code:
any ideas - anyone tomas Last edited by tomas; 03-10-2004 at 02:19 PM. |
03-10-2004, 03:35 PM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>485</b><br /> <br /> <b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>503</b><br /> Hi. When you echo the queries right before lines 485 and 503 what do you get?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
03-10-2004, 03:47 PM | #3 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
hallo charter,
there are no echos before - watching spider i can see how the script digs the links from the homepage and doing the +++ for finding links - then at the first link with auto-redirect to ssl spider crashes with these messages. tested on debian/fedora/redhat-enterprise with php 4.2.3/4.3.3/4.3.4 apache 1.3/2 cgi and mod so i'm shure it's not a platform specific error tomas |
03-10-2004, 03:53 PM | #4 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. If you stick the echo statements into the code, what does it echo for the two queries?
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
03-10-2004, 03:56 PM | #5 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
these echos are taken from standard log output -
no extra echos were inserted. t. |
03-10-2004, 03:59 PM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Right, but echo the queries and see what the prints onscreen. The error "warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource" means that the query is messed. If you echo the two $query variables, then you might see how to fix it.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
03-10-2004, 04:06 PM | #7 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
ok i will try it tomorrow,
because as on my previous post the live server of this website we did the patch for our code - so i have to setup all this again on one of the test-machines and start running the spider. but for now it's 2 hours past midnight and i'll leave the office for today - puuuh. by the way - i sent you an email :-) tomas |
03-10-2004, 04:19 PM | #8 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Okay, but one thing to check is to see if the following code is still in the spider.php file, right before the "//is this link already in temp table" comment:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
03-12-2004, 04:11 PM | #9 |
Orange Mole
Join Date: Feb 2004
Posts: 47
|
hi charter,
sorry for my delay - tried to insert the echos - but nothing to catch the bug inside all the sql looks good - but at the last link right before the first of the 'ssl-ones' echoing stops - and the server starts getting hot (about 80% mem and cpu constant) funny thing and i have no idea do you have :-) tomas |
03-12-2004, 05:53 PM | #10 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. I think I know what's going on. I don't believe PhpDig was originally written to crawl https and so the header redirect sends http to https but PhpDig does it's thing and https goes back to http but then the redirect sends it back to https, etcetera, so I think it's a loop.
If you search for http in the spider.php and robot_functions.php files, and maybe other files, you'll see that some code needs to change in order to account for https links. I'm not going to post a bunch of trial and error code now, but will instead work on it for inclusion in another release. In the meantime, try sending the PhpDig robot a 403 when https is encountered based on user agent or something. Not tested, but if you want to crawl https one at a time via command line, the following might be okay: In spider.php change: PHP Code:
PHP Code:
Code:
prompt> php -f spider.php https://www.domain.com
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
09-21-2004, 08:36 PM | #11 |
Green Mole
Join Date: Sep 2004
Posts: 2
|
HTTP vs HTTPS indexing
Hi
This seems to be the appropriate thread for a follow up on https indexing. Here's where I am at and what I have done trying to get phpdig to function on my server. Config Redhat based e-smith server with Apache Symptoms: test mode is php -f spider.php https://my.server.com results in 4094: old priority 0, new priority 18 Spidering in progress... ----------------------------- SITE : https://my.server.com/ Exclude paths : - @NONE@ No link in temporary table links found : 0 ...Was recently indexed Optimizing tables... Indexing complete ! /var/log/httpd/error_log shows group of 4 errors for each spider attempt [Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt [Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1 [Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1 [Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt From the rest of this thread I started looking at how it works. In my server the site is forced to https and is a virtual host structure within apache. The first issue that I can see is that the robots.txt file is being looked for within the root of the server web sites rather than the location of the virtual site and why is it looking at it in a file structure rather than as a url ? The second is the erroneous characters ? Note that if I vary the url to use an http:// access point for the site as a sub-url of the server primary site, then phpdig does a full index. So phpdig and php are configured and will work for sites other than https. Issue #1 Line 869 in robot_functions line is forcing the search for robots.txt back to an http which, in our case, defaults to a different site on the web server. $site = eregi_replace("^https","http",$site); Taking out line 869 with // provides a different set of error log messages [Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD /robots.txt HTTPS/1.1 [Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1 [Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1 [Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD /robots.txt HTTPS/1.1 Which would indicate that the site line should be left out as all four errors are now consistent with Issue #2. A side issue is the question over why robots.txt would be called twice ? Issue #2 The fact that the request for robots.txt is also given as having erroneous characters I looked first at the line 272 of robot_functions.php - function phpdigTestUrl() In this function the END_OF_LINE_MARKER is added to every HEAD request Changed the value for END_OF_LINE_MARKER from \r\n to \n define("END_OF_LINE_MARKER","\n"); Just in case this was the problem, it wasn't and showed no effect on the error messages. Left it as \n for linux. Next I checked the HEAD and GET constructs and confirmed that they do not accept HTTPS/1.1 Line 347 "HEAD $path $http_scheme/1.1".END_OF_LINE_MARKER line 621 "GET $path $http_scheme/1.1".END_OF_LINE_MARKER changed to Line 347 "HEAD $path HTTP/1.1".END_OF_LINE_MARKER line 621 "GET $path HTTP/1.1".END_OF_LINE_MARKER seems that it always wants HTTP/1.1 regardless of HTTPS or HTTP. I checked W3C for any reference to HTTP vs HTTPS and HTTPS does not exist. From my quick read it seems that RFC2818 HTTP over TLS is related and that HTTP is HTTP whereas HTTPS is really HTTP over TLS instead of TCP. But I'll stop there 'cos I dont really need to go that deep. That fixed the erroneous characters issue, but the error logs are now showing the wrong robots.txt file is being looked for again, is this because of this change or because I didn't really fix it earlier? [Wed Sep 22 11:09:27 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt [Wed Sep 22 11:09:27 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt from the access log it shows www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD /robots.txt HTTP/1.1" 404 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)" www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD / HTTP/1.1" 200 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)" www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD / HTTP/1.1" 200 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)" www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD /robots.txt HTTP/1.1" 404 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)" Which would indicate that the script is still searching at the root rather than the virtual host url. As a quick test I used a browser to look at the robots.txt file and checked the access log and that works fine. So it has to be related to how phpdig is calling the robots.txt under https. Another point is that the host name is not shown as the virtualhost name so something is getting lost in translation. I am now stuck, is this a phpdig issue, or an apache issue, or something really obscure. I have two sites that I need to index that are locked into being only https access. The http sites on the same server are fine. The https site now only gets a single link indexed and that is the topmost self link to the domain. Yet doing the same site via http for a depth of 3/3 gets 10 entries. cheers Tony http://www.marblebay.com.au |
12-04-2004, 06:17 PM | #12 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
>> I am now stuck, is this a phpdig issue, or an apache issue, or something really obscure.
tomas, thowden, you still around?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
HTTPS Support | lilorox | Mod Submissions | 0 | 03-12-2008 05:05 AM |
HTTPS Indexing Update | AndrewMull | Troubleshooting | 0 | 10-16-2006 08:57 AM |
https support | JonnyNoog | Mod Requests | 0 | 07-30-2006 11:07 PM |
So : https spidering ! | Choucky | Troubleshooting | 0 | 02-13-2006 02:09 PM |
phpdig & https | desfaitl | How-to Forum | 1 | 09-11-2004 06:41 AM |