|
01-08-2004, 06:34 PM | #1 |
Green Mole
Join Date: Jan 2004
Posts: 10
|
Indexing cookie/session authenticated pages
Hello,
We have installed phpdig 1.6.5 and are facing a problem indexing authenticated pages on our site (More than half the pages on our site use cookie-based authentication). The indexer just ends up accessing publicly accessible pages. We've looked through the code for spider.php and robot_functions.php and found many references to cookie related functions (such as phpDigMakeCookies()), but haven't been able to enable them. Is there any documentation for this? Or could someone provide the steps for providing cookie information. Some background info: Users are authenticated on our site using only a username/password combination provided through a login form on the pages. No pages are .htaccess protected. Thanks! |
01-09-2004, 04:51 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. The functions send HEAD and GET requests. You can see an example in this thread.
Basically the HEAD requests check status and the GET requests grab content. There is nothing in these functions to be turned on so PhpDig can crawl authenticated pages. One thing you might try in the authenticated pages is adding a check for PhpDig. If PhpDig, show content from pages normally needing authentication, if not PhpDig, require user to authenticate. If the authenticated pages are using PHP, you may find the list of reserved variables here useful.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-09-2004, 02:17 PM | #3 |
Green Mole
Join Date: Jan 2004
Posts: 10
|
Thanks for the tip Charter, your idea works!
Checking for the user agent is the way we have chosen to go. |
06-26-2004, 01:33 AM | #4 |
Green Mole
Join Date: Jun 2004
Posts: 22
|
Hi Charter,
I am having the same problems -I want to test for the PHP Dig spinder in the USER_AGENT header, but I have no idea what agent it will be? Any ideas? |
06-26-2004, 09:44 AM | #5 | |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Quote:
|
|
06-27-2004, 08:24 AM | #6 |
Green Mole
Join Date: Jun 2004
Posts: 22
|
Thanks Pat - do you think thatthis code will return true then:
if(strpos($_SERVER["HTTP_USER_AGENT"],"PhpDig")!=FALSE) { //set session information } |
06-27-2004, 08:56 AM | #7 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Yes, something like that will work but you can simplify it a bit, like so:
PHP Code:
|
06-27-2004, 09:41 AM | #8 |
Green Mole
Join Date: Jun 2004
Posts: 22
|
Thanks for the welcome. Testing for the user agent doesn't seem to be working (although it is hard to tell as I don't currently have access to the logs.....
Is there any other way that I could tell if a script is being called by PHPDig? Also, when the spider does a crawl, it seems to dismiss dynamically generated pages that differ only in ids: eg: ?page=articleView&articleId=250 giving Duplicate of an existing document There are several thousand of these articles and it looks like none of them are being indexed.... |
06-27-2004, 10:18 AM | #9 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
There are several threads here in the forum about problems similar to yours, like this one, for example. If that one doesn't provide some clues, just search the forum for any thread with the word "numbers" in it, and you'll probably find something that will help.
|
08-17-2004, 03:23 PM | #10 |
Green Mole
Join Date: Aug 2004
Posts: 2
|
Using user agent as an authentication method
I'm currently facing the same problem (indexing password protected files that don't use .htaccess protection) and found this thread very helpful.
I would like to point out to people that forging a user agent header is very easy, especially with browsers such as opera. If you are going to use the user agent as an authentication method you should edit spider.php and set the user agent to something else and then test for that. Look for the following lines in admin/spider.php and change to something a little harder to guess: // set the User-Agent for the file() function @ini_set('user_agent','PhpDig/'.PHPDIG_VERSION.' (+http://www.phpdig.net/robot.php)'); |
08-18-2004, 10:57 AM | #11 |
Green Mole
Join Date: Aug 2004
Posts: 2
|
A follow up to my earlier post, to edit the spider user agent edit the file admin/robot_functions.php, not the spider.php file I mentioned earlier.
Another good idea would be to have a test in your site's authentication mechanism to check that the IP address of the spider is what you expect it to be, just in case. - Ben |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Indexing Password Protected pages (using session variables) | apetersen | How-to Forum | 1 | 03-27-2007 05:18 AM |
Cookie Jar | cstr1 | Script Installation | 1 | 10-09-2006 11:05 PM |
Session Newbie | tanbou1 | Coding & Tutorials | 1 | 04-13-2004 02:06 PM |
Cookie management | jbc | Mod Requests | 2 | 03-12-2004 12:35 PM |
getting past session protected pages | theverychap | How-to Forum | 4 | 12-03-2003 06:18 AM |