PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Bug Tracker

Reply
 
Thread Tools
Old 03-10-2004, 02:56 PM   #1
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
Question https phpdig strange failure

digging a website with some https-forms produces strange failure:

some forms on this website force ssl this way:
PHP Code:
if ($_SERVER["SERVER_PORT"]=="80") {
        
$ssl_redirect "https://foo.com" $_SERVER["SCRIPT_NAME"];
        
header("Location: $ssl_redirect");
        exit;

this leads to:

<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>485</b><br />
<br />
<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>503</b><br />

about 5 apache processes, consuming about 80% memory remain for hours and spidering fails.

our workaround was:
PHP Code:
if (($_SERVER["SERVER_PORT"]=="80") and (strstr(strtolower($_SERVER["HTTP_USER_AGENT"]), mozilla)!="")) {
        
$ssl_redirect "https://foo.com" $_SERVER["SCRIPT_NAME"];
        
header("Location: $ssl_redirect");
        exit;

but on websites you have no access to scripts the bug will make it impossible to spider the site.

any ideas - anyone
tomas

Last edited by tomas; 03-10-2004 at 03:19 PM.
tomas is offline   Reply With Quote
Old 03-10-2004, 04:35 PM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539

<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>485</b><br />
<br />
<b>Warning</b>: mysql_num_rows(): supplied argument is not a valid MySQL result resource in <b>/search/admin/spider.php</b> on line <b>503</b><br />

Hi. When you echo the queries right before lines 485 and 503 what do you get?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-10-2004, 04:47 PM   #3
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
hallo charter,

there are no echos before -
watching spider i can see how the script digs the links from the homepage and doing the +++ for finding links - then at the first link with auto-redirect to ssl spider crashes with these messages.

tested on debian/fedora/redhat-enterprise with php 4.2.3/4.3.3/4.3.4 apache 1.3/2 cgi and mod

so i'm shure it's not a platform specific error

tomas
tomas is offline   Reply With Quote
Old 03-10-2004, 04:53 PM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. If you stick the echo statements into the code, what does it echo for the two queries?
PHP Code:
echo $query " : query1";
echo 
$query " : query2"
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-10-2004, 04:56 PM   #5
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
these echos are taken from standard log output -
no extra echos were inserted.

t.
tomas is offline   Reply With Quote
Old 03-10-2004, 04:59 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Right, but echo the queries and see what the prints onscreen. The error "warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource" means that the query is messed. If you echo the two $query variables, then you might see how to fix it.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-10-2004, 05:06 PM   #7
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
ok i will try it tomorrow,

because as on my previous post the live server of this website we did the patch for our code - so i have to setup all this again
on one of the test-machines and start running the spider.

but for now it's 2 hours past midnight and i'll leave the office
for today - puuuh.

by the way - i sent you an email :-)

tomas
tomas is offline   Reply With Quote
Old 03-10-2004, 05:19 PM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Okay, but one thing to check is to see if the following code is still in the spider.php file, right before the "//is this link already in temp table" comment:
PHP Code:
if (!get_magic_quotes_runtime()) {
    
$lien['path'] = addslashes($lien['path']);
    
$lien['file'] = addslashes($lien['file']);

As both queries as nearly the same, my initial guess would be that $lien['path'] and/or $lien['file'] are no longer being escaped and so the queries are breaking and mysql_num_rows has nothing to 'num' on.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-12-2004, 05:11 PM   #9
tomas
Orange Mole
 
Join Date: Feb 2004
Posts: 47
hi charter,

sorry for my delay -
tried to insert the echos - but nothing to catch the bug inside
all the sql looks good - but at the last link right before the first of the 'ssl-ones' echoing stops - and the server starts getting hot (about 80% mem and cpu constant)

funny thing and i have no idea
do you have :-)

tomas
tomas is offline   Reply With Quote
Old 03-12-2004, 06:53 PM   #10
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. I think I know what's going on. I don't believe PhpDig was originally written to crawl https and so the header redirect sends http to https but PhpDig does it's thing and https goes back to http but then the redirect sends it back to https, etcetera, so I think it's a loop.

If you search for http in the spider.php and robot_functions.php files, and maybe other files, you'll see that some code needs to change in order to account for https links. I'm not going to post a bunch of trial and error code now, but will instead work on it for inclusion in another release.

In the meantime, try sending the PhpDig robot a 403 when https is encountered based on user agent or something. Not tested, but if you want to crawl https one at a time via command line, the following might be okay:

In spider.php change:
PHP Code:
if (ereg('^http://',$argv[1])) { 
to the following:
PHP Code:
if (ereg('^http[s]?://',$argv[1])) { 
and then use the following:
Code:
prompt> php -f spider.php https://www.domain.com
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 09-21-2004, 09:36 PM   #11
thowden
Green Mole
 
Join Date: Sep 2004
Posts: 2
HTTP vs HTTPS indexing

Hi

This seems to be the appropriate thread for a follow up on https indexing. Here's where I am at and what I have done trying to get phpdig to function on my server.

Config Redhat based e-smith server with Apache

Symptoms:
test mode is
php -f spider.php https://my.server.com

results in

4094: old priority 0, new priority 18
Spidering in progress...
-----------------------------
SITE : https://my.server.com/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete !

/var/log/httpd/error_log shows group of 4 errors for each spider attempt

[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt
[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:32:12 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt


From the rest of this thread I started looking at how it works.

In my server the site is forced to https and is a virtual host structure within apache.

The first issue that I can see is that the robots.txt file is being looked for within the root of the server web sites rather than the location of the virtual site and why is it looking at it in a file structure rather than as a url ?
The second is the erroneous characters ?

Note that if I vary the url to use an http:// access point for the site as a sub-url of the server primary site, then phpdig does a full index. So phpdig and php are configured and will work for sites other than https.

Issue #1

Line 869 in robot_functions line is forcing the search for robots.txt back to an http which, in our case, defaults to a different site on the web server.

$site = eregi_replace("^https","http",$site);

Taking out line 869 with // provides a different set of error log messages

[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD /robots.txt HTTPS/1.1
[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD / HTTPS/1.1
[Wed Sep 22 09:47:55 2004] [error] [client 192.168.85.33] request failed: erroneous characters after protocol string: HEAD /robots.txt HTTPS/1.1

Which would indicate that the site line should be left out as all four errors are now consistent with Issue #2. A side issue is the question over why robots.txt would be called twice ?

Issue #2

The fact that the request for robots.txt is also given as having erroneous characters I looked first at the line 272 of robot_functions.php - function phpdigTestUrl()

In this function the END_OF_LINE_MARKER is added to every HEAD request

Changed the value for END_OF_LINE_MARKER from \r\n to \n

define("END_OF_LINE_MARKER","\n");

Just in case this was the problem, it wasn't and showed no effect on the error messages. Left it as \n for linux.

Next I checked the HEAD and GET constructs and confirmed that they do not accept HTTPS/1.1

Line 347 "HEAD $path $http_scheme/1.1".END_OF_LINE_MARKER

line 621 "GET $path $http_scheme/1.1".END_OF_LINE_MARKER

changed to

Line 347 "HEAD $path HTTP/1.1".END_OF_LINE_MARKER

line 621 "GET $path HTTP/1.1".END_OF_LINE_MARKER

seems that it always wants HTTP/1.1 regardless of HTTPS or HTTP.

I checked W3C for any reference to HTTP vs HTTPS and HTTPS does not exist. From my quick read it seems that RFC2818 HTTP over TLS is related and that HTTP is HTTP whereas HTTPS is really HTTP over TLS instead of TCP. But I'll stop there 'cos I dont really need to go that deep.

That fixed the erroneous characters issue, but the error logs are now showing the wrong robots.txt file is being looked for again, is this because of this change or because I didn't really fix it earlier?

[Wed Sep 22 11:09:27 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt
[Wed Sep 22 11:09:27 2004] [error] [client 192.168.85.33] File does not exist: /web/server/root/html/robots.txt

from the access log it shows

www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD /robots.txt HTTP/1.1" 404 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"
www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD / HTTP/1.1" 200 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"
www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD / HTTP/1.1" 200 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"
www.my.server 192.168.85.33 - - [22/Sep/2004:11:09:27 +1000] "HEAD /robots.txt HTTP/1.1" 404 0 "-" "PhpDig/1.8.3 (+http://www.phpdig.net/robot.php)"

Which would indicate that the script is still searching at the root rather than the virtual host url. As a quick test I used a browser to look at the robots.txt file and checked the access log and that works fine. So it has to be related to how phpdig is calling the robots.txt under https. Another point is that the host name is not shown as the virtualhost name so something is getting lost in translation.

I am now stuck, is this a phpdig issue, or an apache issue, or something really obscure.

I have two sites that I need to index that are locked into being only https access. The http sites on the same server are fine.

The https site now only gets a single link indexed and that is the topmost self link to the domain. Yet doing the same site via http for a depth of 3/3 gets 10 entries.

cheers
Tony

http://www.marblebay.com.au
thowden is offline   Reply With Quote
Old 12-04-2004, 07:17 PM   #12
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
>> I am now stuck, is this a phpdig issue, or an apache issue, or something really obscure.

tomas, thowden, you still around?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
HTTPS Support lilorox Mod Submissions 0 03-12-2008 06:05 AM
HTTPS Indexing Update AndrewMull Troubleshooting 0 10-16-2006 09:57 AM
https support JonnyNoog Mod Requests 0 07-31-2006 12:07 AM
So : https spidering ! Choucky Troubleshooting 0 02-13-2006 03:09 PM
phpdig & https desfaitl How-to Forum 1 09-11-2004 07:41 AM


All times are GMT -8. The time now is 08:04 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.