PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 01-08-2004, 06:34 PM   #1
tester
Green Mole
 
Join Date: Jan 2004
Posts: 10
Indexing cookie/session authenticated pages

Hello,

We have installed phpdig 1.6.5 and are facing a problem indexing authenticated pages on our site (More than half the pages on our site use cookie-based authentication). The indexer just ends up accessing publicly accessible pages.

We've looked through the code for spider.php and robot_functions.php and found many references to cookie related functions (such as phpDigMakeCookies()), but haven't been able to enable them.

Is there any documentation for this? Or could someone provide the steps for providing cookie information.

Some background info: Users are authenticated on our site using only a username/password combination provided through a login form on the pages. No pages are .htaccess protected.

Thanks!
tester is offline   Reply With Quote
Old 01-09-2004, 04:51 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The functions send HEAD and GET requests. You can see an example in this thread.

Basically the HEAD requests check status and the GET requests grab content. There is nothing in these functions to be turned on so PhpDig can crawl authenticated pages.

One thing you might try in the authenticated pages is adding a check for PhpDig. If PhpDig, show content from pages normally needing authentication, if not PhpDig, require user to authenticate.

If the authenticated pages are using PHP, you may find the list of reserved variables here useful.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-09-2004, 02:17 PM   #3
tester
Green Mole
 
Join Date: Jan 2004
Posts: 10
Thanks for the tip Charter, your idea works!

Checking for the user agent is the way we have chosen to go.
tester is offline   Reply With Quote
Old 06-26-2004, 01:33 AM   #4
bforsyth
Green Mole
 
Join Date: Jun 2004
Posts: 22
Hi Charter,

I am having the same problems -I want to test for the PHP Dig spinder in the USER_AGENT header, but I have no idea what agent it will be? Any ideas?
bforsyth is offline   Reply With Quote
Old 06-26-2004, 09:44 AM   #5
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Quote:
Originally posted by bforsyth
Hi Charter,

I am having the same problems -I want to test for the PHP Dig spinder in the USER_AGENT header, but I have no idea what agent it will be? Any ideas?
The user agent name is PhpDig.
vinyl-junkie is offline   Reply With Quote
Old 06-27-2004, 08:24 AM   #6
bforsyth
Green Mole
 
Join Date: Jun 2004
Posts: 22
Thanks Pat - do you think thatthis code will return true then:

if(strpos($_SERVER["HTTP_USER_AGENT"],"PhpDig")!=FALSE) {
//set session information
}
bforsyth is offline   Reply With Quote
Old 06-27-2004, 08:56 AM   #7
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Yes, something like that will work but you can simplify it a bit, like so:
PHP Code:
if ($_SERVER["HTTP_USER_AGENT"] == 'PhpDig') {
//set session information

BTW, I forgot to welcome you to the forum. Thanks for joining us!
vinyl-junkie is offline   Reply With Quote
Old 06-27-2004, 09:41 AM   #8
bforsyth
Green Mole
 
Join Date: Jun 2004
Posts: 22
Thanks for the welcome. Testing for the user agent doesn't seem to be working (although it is hard to tell as I don't currently have access to the logs.....

Is there any other way that I could tell if a script is being called by PHPDig?

Also, when the spider does a crawl, it seems to dismiss dynamically generated pages that differ only in ids:

eg: ?page=articleView&articleId=250
giving Duplicate of an existing document

There are several thousand of these articles and it looks like none of them are being indexed....
bforsyth is offline   Reply With Quote
Old 06-27-2004, 10:18 AM   #9
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
There are several threads here in the forum about problems similar to yours, like this one, for example. If that one doesn't provide some clues, just search the forum for any thread with the word "numbers" in it, and you'll probably find something that will help.
vinyl-junkie is offline   Reply With Quote
Old 08-17-2004, 03:23 PM   #10
ben
Green Mole
 
Join Date: Aug 2004
Posts: 2
Using user agent as an authentication method

I'm currently facing the same problem (indexing password protected files that don't use .htaccess protection) and found this thread very helpful.

I would like to point out to people that forging a user agent header is very easy, especially with browsers such as opera. If you are going to use the user agent as an authentication method you should edit spider.php and set the user agent to something else and then test for that.

Look for the following lines in admin/spider.php and change to something a little harder to guess:

// set the User-Agent for the file() function
@ini_set('user_agent','PhpDig/'.PHPDIG_VERSION.' (+http://www.phpdig.net/robot.php)');
ben is offline   Reply With Quote
Old 08-18-2004, 10:57 AM   #11
ben
Green Mole
 
Join Date: Aug 2004
Posts: 2
A follow up to my earlier post, to edit the spider user agent edit the file admin/robot_functions.php, not the spider.php file I mentioned earlier.

Another good idea would be to have a test in your site's authentication mechanism to check that the IP address of the spider is what you expect it to be, just in case.

- Ben
ben is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Indexing Password Protected pages (using session variables) apetersen How-to Forum 1 03-27-2007 05:18 AM
Cookie Jar cstr1 Script Installation 1 10-09-2006 11:05 PM
Session Newbie tanbou1 Coding & Tutorials 1 04-13-2004 02:06 PM
Cookie management jbc Mod Requests 2 03-12-2004 12:35 PM
getting past session protected pages theverychap How-to Forum 4 12-03-2003 06:18 AM


All times are GMT -8. The time now is 11:53 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.