PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Bug Tracker

Reply
 
Thread Tools
Old 11-01-2004, 10:26 AM   #1
vital
Green Mole
 
Join Date: Jul 2004
Posts: 2
Lightbulb Fix for slow spidering in PhpDig 1.8.x

Some people using PhpDig reported severe perfomance impact when moving from version 1.6.x to 1.8.x.
I was one of them. It could take more the 20 seconds for me to index a single page with version 1.8.x. PhpDig 1.6.x indexed the same page in less then a second.
This happened only under win32.

Finally I found the cause of it. Look at the following code:
PHP Code:
function phpdigGetUrl($url,$cookies=array()) {
/* cut */
$fp = @fsockopen($host,$port);
/* cut */
   //complete get
  
$request =
  
"GET $path $http_scheme/1.1".END_OF_LINE_MARKER
  
."Host: $host$sport".END_OF_LINE_MARKER
  
.$cookiesSendString
  
.$auth_string
  
."Accept: */*".END_OF_LINE_MARKER
  
."Accept-Charset: ".PHPDIG_ENCODING.END_OF_LINE_MARKER
  
."Accept-Encoding: identity".END_OF_LINE_MARKER
  
."User-Agent: PhpDig/".PHPDIG_VERSION." (+http://www.phpdig.net/robot.php)".END_OF_LINE_MARKER.END_OF_LINE_MARKER;

    
fputs($fp,$request);

       
//get return page
/* cut */
    
while (!$stop && !feof($fp)) {
          
$flag_to_stop_loop++;
          
$answer fgets($fp,8192)
/* cut */
          
if ($flag_to_stop_loop == 10000) { break; }
/* cut */
}
/* cut */

Spider opens a connection to site using fsockopen(), then sends GET-request and starts reading 8K blocks from socket. The problem here is in HTTP header itself.

As stated in HTTP/1.1 http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Quote:
HTTP/1.1 applications that do not support persistent connections MUST include the "close" connection option in every message.
In our case connection stays open forever and feof() never returns TRUE. So in worst case "while" iterates 10000 times, hence comes the delay.

To fix this search for the following in robot_functions.php file:
."User-Agent: PhpDig/".PHPDIG_VERSION." (+http://www.phpdig.net/robot.php)".END_OF_LINE_MARKER.END_OF_LINE_MARKER;
There will be a total of 3 matches (lines 354, 460 and 628)

Insert before each of them this line:
."Connection: close".END_OF_LINE_MARKER

Hope this will help.

Last edited by vital; 11-01-2004 at 10:54 AM.
vital is offline   Reply With Quote
Old 11-02-2004, 12:20 AM   #2
manfred
Orange Mole
 
Join Date: Nov 2003
Posts: 42
This correction will make a big difference for spidering speed. I suggest that everybody (and Charter) will implement this immediately! Thank you very much.

-m-
manfred is offline   Reply With Quote
Old 11-03-2004, 11:48 AM   #3
funsutton
Green Mole
 
Join Date: Oct 2004
Posts: 5
Wow, I think that fix really helped me. I applied it and it looks good!

Great work!

-Brian Sutton
http://www.piedmontswingdance.org/search
funsutton is offline   Reply With Quote
Old 11-06-2004, 11:33 AM   #4
AllKnightAccess
Green Mole
 
Join Date: Sep 2004
Posts: 25
I tried it, but I am not seeing any faster results. According to my log file, pages at my site are still being indexed at around 60 - 90 seconds per page.
AllKnightAccess is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem in PhpDig Spidering navdeep.madan Troubleshooting 0 10-04-2006 04:51 AM
phpdig blocked when spidering any site heli Troubleshooting 3 09-30-2004 11:42 AM
Indexing slow.... no, _really_ slow bluntman Troubleshooting 1 09-24-2004 02:23 PM
speciffically slow spidering at fgets() slintz Troubleshooting 7 08-18-2004 03:24 AM
Spidering **VERY** Slow Niall Fernie Troubleshooting 4 07-13-2004 01:45 AM


All times are GMT -8. The time now is 11:44 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.