PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 04-23-2004, 03:44 AM   #1
Andreas_Wien
Green Mole
 
Join Date: Apr 2004
Posts: 4
Question includes & excludes

After a quick setup and easy integration I have difficulties spidering the page http://444.docs4you.at correctly.

1.) the path portalnode/ is excluded in the database AND in the robots.txt - nevertheless it is found somehow, and spidered over and over again.

2.) OTOH links on the page are not followed in general. This behavior is different every time, in the worst case 2 pages are spidered and indexed, nothing else, and phpdig hangs spidering portalnode/.

Maybe my understanding of the <!-- phpdigExclude --> / include tags is wrong; Can I assume that the parser reads the page top down, switching off and on and off, and on and off again the indexing as it sees lines with a phpdig-tag?
And; Regardless of include/exclude tags each and every link on the page should be spidered?

Any help would be greatly appreciated!

Best Regards, Andreas
Attached Files
File Type: zip screenshotlog.zip (21.0 KB, 11 views)
Andreas_Wien is offline   Reply With Quote
Old 04-23-2004, 04:26 AM   #2
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Welcome to the forums, Andreas. We're glad you could join us!

Check out this thread for a solution to your problem.
vinyl-junkie is offline   Reply With Quote
Old 04-23-2004, 04:44 AM   #3
Andreas_Wien
Green Mole
 
Join Date: Apr 2004
Posts: 4
Hi Pat, thank you for the reply!

My case is in fact much easier, since the /portalnode/ path sould be excluded altogether from ANY robot indexing, thus my robots.txt looks like:

User-agent: *
Disallow: /portalnode/

Nevertheless, phpdig hangs in eactly that directory.

did I miss something?
Andreas
Andreas_Wien is offline   Reply With Quote
Old 04-23-2004, 05:29 AM   #4
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Seems like I read somewhere in the forums that if you list the complete path for your starting file; e.g.,
http://444.docs4you.at/portalnode/index.php
or whatever the file is called, indexing will work for you.

You most likely have indexing locked right now, so you'll have to unlock it. Search the forum for how to do that if you don't know how. Also, make sure your LIMIT_DAYS parameter in config.php is set in a way that will let you re-spider your site now.

Good luck, and let us know how it goes.
vinyl-junkie is offline   Reply With Quote
Old 04-23-2004, 05:31 AM   #5
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Oops! That's what I get for being in a hurry. I'm getting ready to leave for work right now. You wanted to exclude that directory, not include it.

Are you trying to index the whole site, and it's hanging? I'm not sure what is happening here.
vinyl-junkie is offline   Reply With Quote
Old 04-24-2004, 02:56 AM   #6
Andreas_Wien
Green Mole
 
Join Date: Apr 2004
Posts: 4
yep, the whole site. I don't understand a few points here:

- how it get's a link to /portalnode/uups.php
- why it keeps indexing that file against all the exclude-rules
- why it ignores the rest of the site
- why it hangs

any idea?
Andreas
Andreas_Wien is offline   Reply With Quote
Old 04-24-2004, 07:31 AM   #7
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
I downloaded your zip file and took a look at your screen shot (nothing amiss there as far as I can see) and your spidering log. Just before the link that was being spidered multiple times is this link:

Quote:
http://444.docs4you.at/Content.Node/Veranstaltungen/index.php?S=navi
When I tried to bring up that page in my browser, it seemed to fail. I don't speak German so I don't have a clue what the whole page says, put there is definitely an error in the first line:
Quote:
Warning: Cannot modify header information - headers already sent by (output started at /Node/node/portal.node/uups.php:2) in /Node/node/portal.node/uups.php on line 9
Maybe fixing that will get your problem cleared up.
vinyl-junkie is offline   Reply With Quote
Old 04-24-2004, 07:45 AM   #8
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
I meant to include this in my last post and forgot. I am somewhat concerned about that 404 error that you got first thing in your spider log.

This doesn't have anything to do with phpDig per se, but I found some free link checker software that you might want to use on your site. Just a word of caution though. It consumes bandwidth on your site about like phpDig does, so you wouldn't want to run it every single day. The free version is called REL Checker Lite, and you can download it here.
vinyl-junkie is offline   Reply With Quote
Old 04-24-2004, 01:33 PM   #9
Andreas_Wien
Green Mole
 
Join Date: Apr 2004
Posts: 4
Ja, right, sorry about that - The site is in some respects not finished yet, nevertheless it should be searchable already.

I hope that doesnt affect phpdig in any way. I'm not troubled if phpdig doesnt index a page that doesnt exist. It seems difficult enough to get the existing pages indexed ;-) !

Some additional info about that site:
Every page has several modes of appearance, controled by the S-parameter. I intended to hide this apparent duplicate pages from phpdig by dynamically adding a line:
<meta name="robots" content="noindex,nofollow,none">
iff an S-parameter is passed to the page. So only the simple page (without any S-parameter) should be indexed, they carry a line:
<meta name="robots" content="index,follow">

And even if phpdig hangs in one branch, why doesn't it finish spidering the other branches of the site? And why does it change it's behavior (number of pages successfully indexed) every time I dig the site?

Still confused ... are my assumptions in the initial posting correct?

And the main point is: portal.node/ is EXCLUDED in the DB and in robots.txt. the URL of uups.php lies on that path. Which precautions do I have to take on such pages in order to have phpdig spider the rest of the site that is not explicitly excluded?

Greets from Vienna, Andreas

Last edited by Andreas_Wien; 04-24-2004 at 01:42 PM.
Andreas_Wien is offline   Reply With Quote
Old 04-25-2004, 09:41 AM   #10
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
OK, here's my take on what you're saying, and it's not based on my knowledge of the phpDig code itself. Rather, it's based on what I see in the spider logs. My assumptions may or may not be correct.

phpDig obeys robots.txt - that much we know - but it still has to visit the page to find out if there is a robots exclusion, assuming that it didn't find that already in the robots.txt file. If a page has some kind of problem, like the one I pointed out, that could cause phpDig to go into some kind of loop. Exactly how or why that happens, I wouldn't know.

I hope you understand where I'm coming from with this. What I'm saying is basically this: If phpDig has to visit a page, there better not be any errors in it. If there is, it could throw phpDig into a tailspin and cause it not to spider everything you think it should.

My suggestion would be to either fix the page or use the include/exclude comments in the page(s) that link to the problem document, so that phpDig will not attempt to spider it.
vinyl-junkie is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Url part after &amp; is ignored on spider (1.8.9 RC1 and earliers) obottek Bug Tracker 1 08-24-2006 05:52 AM
excludes, includes and other tables Macedonian How-to Forum 0 01-22-2005 08:05 PM
Parse error with includes cherie Troubleshooting 1 12-14-2004 01:34 PM
Apache includes in a template sgreen Mod Submissions 0 06-17-2004 04:44 PM
Excludes not working although properly set jbafaure Troubleshooting 8 05-02-2004 09:46 AM


All times are GMT -8. The time now is 07:10 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.