Carl Mikkelsen
04-15-2004, 06:52 PM
I have been deploying phpdig as a test or our intranet. Aside from encountering a php segmentation violation when parsing a cookie, the largest problem I've had is with processing of robots.txt.
Our intranet is almost entirely dynamic content -- much of what I want to index is delivered by TWiki, a collaboration tool distributed from www.twiki.org. In robot_functions.php, a test is made for each URL encountered to determine if it should be indexed. The logic for this test causes an HTTP HEAD request to be issued, I think to determine the content type. This HEAD request is issued without regard to the robots.txt file.
If the content type is appropriate, the robots.txt defined exclusions are tested.
Unfortunately, the HEAD request causes the content for a page to be generated. In some cases, that generation can be VERY lengthly. To all appearances, the indexing is stopped. The HTTP server grinds to a halt computing page content that is never used.
The "fix" I'm testing is to move the check of robots.txt to the beginning of the function. Iff the file is not excluded, then the content type can be tested as before.
If this is causing trouble for anyone else, and if the developers concur, I could post a patch.
Our intranet is almost entirely dynamic content -- much of what I want to index is delivered by TWiki, a collaboration tool distributed from www.twiki.org. In robot_functions.php, a test is made for each URL encountered to determine if it should be indexed. The logic for this test causes an HTTP HEAD request to be issued, I think to determine the content type. This HEAD request is issued without regard to the robots.txt file.
If the content type is appropriate, the robots.txt defined exclusions are tested.
Unfortunately, the HEAD request causes the content for a page to be generated. In some cases, that generation can be VERY lengthly. To all appearances, the indexing is stopped. The HTTP server grinds to a halt computing page content that is never used.
The "fix" I'm testing is to move the check of robots.txt to the beginning of the function. Iff the file is not excluded, then the content type can be tested as before.
If this is causing trouble for anyone else, and if the developers concur, I could post a patch.