includes & excludes

Andreas_Wien · 04-23-2004, 03:44 AM

After a quick setup and easy integration I have difficulties spidering the page http://444.docs4you.at correctly.

1.) the path portalnode/ is excluded in the database AND in the robots.txt - nevertheless it is found somehow, and spidered over and over again.

2.) OTOH links on the page are not followed in general. This behavior is different every time, in the worst case 2 pages are spidered and indexed, nothing else, and phpdig hangs spidering portalnode/.

Maybe my understanding of the  / include tags is wrong; Can I assume that the parser reads the page top down, switching off and on and off, and on and off again the indexing as it sees lines with a phpdig-tag?
And; Regardless of include/exclude tags each and every link on the page should be spidered?

Any help would be greatly appreciated!

Best Regards, Andreas

vinyl-junkie · 04-23-2004, 04:26 AM

Welcome to the forums, Andreas. We're glad you could join us!

Check out this thread for a solution to your problem.

Andreas_Wien · 04-23-2004, 04:44 AM

Hi Pat, thank you for the reply!

My case is in fact much easier, since the /portalnode/ path sould be excluded altogether from ANY robot indexing, thus my robots.txt looks like:

User-agent: *
Disallow: /portalnode/

Nevertheless, phpdig hangs in eactly that directory.

did I miss something?
Andreas

vinyl-junkie · 04-23-2004, 05:29 AM

Seems like I read somewhere in the forums that if you list the complete path for your starting file; e.g.,
http://444.docs4you.at/portalnode/index.php
or whatever the file is called, indexing will work for you.

You most likely have indexing locked right now, so you'll have to unlock it. Search the forum for how to do that if you don't know how. Also, make sure your LIMIT_DAYS parameter in config.php is set in a way that will let you re-spider your site now.

Good luck, and let us know how it goes.

vinyl-junkie · 04-23-2004, 05:31 AM

Oops! That's what I get for being in a hurry. I'm getting ready to leave for work right now.

You wanted to exclude that directory, not include it.

Are you trying to index the whole site, and it's hanging? I'm not sure what is happening here.

Andreas_Wien · 04-24-2004, 02:56 AM

yep, the whole site. I don't understand a few points here:

- how it get's a link to /portalnode/uups.php
- why it keeps indexing that file against all the exclude-rules
- why it ignores the rest of the site
- why it hangs

any idea?
Andreas

vinyl-junkie · 04-24-2004, 07:31 AM

I downloaded your zip file and took a look at your screen shot (nothing amiss there as far as I can see) and your spidering log. Just before the link that was being spidered multiple times is this link:

Quote:

http://444.docs4you.at/Content.Node/Veranstaltungen/index.php?S=navi

When I tried to bring up that page in my browser, it seemed to fail. I don't speak German so I don't have a clue what the whole page says, put there is definitely an error in the first line:

Quote:

Warning: Cannot modify header information - headers already sent by (output started at /Node/node/portal.node/uups.php:2) in /Node/node/portal.node/uups.php on line 9

Maybe fixing that will get your problem cleared up.

vinyl-junkie · 04-24-2004, 07:45 AM

I meant to include this in my last post and forgot. I am somewhat concerned about that 404 error that you got first thing in your spider log.

This doesn't have anything to do with phpDig per se, but I found some free link checker software that you might want to use on your site. Just a word of caution though. It consumes bandwidth on your site about like phpDig does, so you wouldn't want to run it every single day. The free version is called REL Checker Lite, and you can download it here.

Andreas_Wien · 04-24-2004, 01:33 PM

Ja, right, sorry about that - The site is in some respects not finished yet, nevertheless it should be searchable already.

I hope that doesnt affect phpdig in any way. I'm not troubled if phpdig doesnt index a page that doesnt exist. It seems difficult enough to get the existing pages indexed ;-) !

Some additional info about that site:
Every page has several modes of appearance, controled by the S-parameter. I intended to hide this apparent duplicate pages from phpdig by dynamically adding a line:
<meta name="robots" content="noindex,nofollow,none">
iff an S-parameter is passed to the page. So only the simple page (without any S-parameter) should be indexed, they carry a line:
<meta name="robots" content="index,follow">

And even if phpdig hangs in one branch, why doesn't it finish spidering the other branches of the site? And why does it change it's behavior (number of pages successfully indexed) every time I dig the site?

Still confused ... are my assumptions in the initial posting correct?

And the main point is: portal.node/ is EXCLUDED in the DB and in robots.txt. the URL of uups.php lies on that path. Which precautions do I have to take on such pages in order to have phpdig spider the rest of the site that is not explicitly excluded?

Greets from Vienna, Andreas

vinyl-junkie · 04-25-2004, 09:41 AM

OK, here's my take on what you're saying, and it's not based on my knowledge of the phpDig code itself. Rather, it's based on what I see in the spider logs. My assumptions may or may not be correct.

phpDig obeys robots.txt - that much we know - but it still has to visit the page to find out if there is a robots exclusion, assuming that it didn't find that already in the robots.txt file. If a page has some kind of problem, like the one I pointed out, that could cause phpDig to go into some kind of loop. Exactly how or why that happens, I wouldn't know.

I hope you understand where I'm coming from with this. What I'm saying is basically this: If phpDig has to visit a page, there better not be any errors in it. If there is, it could throw phpDig into a tailspin and cause it not to spider everything you think it should.

My suggestion would be to either fix the page or use the include/exclude comments in the page(s) that link to the problem document, so that phpDig will not attempt to spider it.

04-24-2004, 01:33 PM	#9
Andreas_Wien Green Mole Join Date: Apr 2004 Posts: 4	Ja, right, sorry about that - The site is in some respects not finished yet, nevertheless it should be searchable already. I hope that doesnt affect phpdig in any way. I'm not troubled if phpdig doesnt index a page that doesnt exist. It seems difficult enough to get the existing pages indexed ;-) ! Some additional info about that site: Every page has several modes of appearance, controled by the S-parameter. I intended to hide this apparent duplicate pages from phpdig by dynamically adding a line: <meta name="robots" content="noindex,nofollow,none"> iff an S-parameter is passed to the page. So only the simple page (without any S-parameter) should be indexed, they carry a line: <meta name="robots" content="index,follow"> And even if phpdig hangs in one branch, why doesn't it finish spidering the other branches of the site? And why does it change it's behavior (number of pages successfully indexed) every time I dig the site? Still confused ... are my assumptions in the initial posting correct? And the main point is: portal.node/ is EXCLUDED in the DB and in robots.txt. the URL of uups.php lies on that path. Which precautions do I have to take on such pages in order to have phpdig spider the rest of the site that is not explicitly excluded? Greets from Vienna, Andreas Last edited by Andreas_Wien; 04-24-2004 at 01:42 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Url part after & is ignored on spider (1.8.9 RC1 and earliers)	obottek	Bug Tracker	1	08-24-2006 05:52 AM
excludes, includes and other tables	Macedonian	How-to Forum	0	01-22-2005 08:05 PM
Parse error with includes	cherie	Troubleshooting	1	12-14-2004 01:34 PM
Apache includes in a template	sgreen	Mod Submissions	0	06-17-2004 04:44 PM
Excludes not working although properly set	jbafaure	Troubleshooting	8	05-02-2004 09:46 AM

04-23-2004, 04:26 AM	#2
vinyl-junkie Purple Mole Join Date: Jan 2004 Posts: 694	Welcome to the forums, Andreas. We're glad you could join us! Check out this thread for a solution to your problem.

04-23-2004, 04:44 AM	#3
Andreas_Wien Green Mole Join Date: Apr 2004 Posts: 4	Hi Pat, thank you for the reply! My case is in fact much easier, since the /portalnode/ path sould be excluded altogether from ANY robot indexing, thus my robots.txt looks like: User-agent: * Disallow: /portalnode/ Nevertheless, phpdig hangs in eactly that directory. did I miss something? Andreas

04-23-2004, 05:29 AM	#4
vinyl-junkie Purple Mole Join Date: Jan 2004 Posts: 694	Seems like I read somewhere in the forums that if you list the complete path for your starting file; e.g., http://444.docs4you.at/portalnode/index.php or whatever the file is called, indexing will work for you. You most likely have indexing locked right now, so you'll have to unlock it. Search the forum for how to do that if you don't know how. Also, make sure your LIMIT_DAYS parameter in config.php is set in a way that will let you re-spider your site now. Good luck, and let us know how it goes.

04-23-2004, 05:31 AM	#5
vinyl-junkie Purple Mole Join Date: Jan 2004 Posts: 694	Oops! That's what I get for being in a hurry. I'm getting ready to leave for work right now. You wanted to exclude that directory, not include it. Are you trying to index the whole site, and it's hanging? I'm not sure what is happening here.

04-24-2004, 02:56 AM	#6
Andreas_Wien Green Mole Join Date: Apr 2004 Posts: 4	yep, the whole site. I don't understand a few points here: - how it get's a link to /portalnode/uups.php - why it keeps indexing that file against all the exclude-rules - why it ignores the rest of the site - why it hangs any idea? Andreas

04-24-2004, 07:45 AM	#8
vinyl-junkie Purple Mole Join Date: Jan 2004 Posts: 694	I meant to include this in my last post and forgot. I am somewhat concerned about that 404 error that you got first thing in your spider log. This doesn't have anything to do with phpDig per se, but I found some free link checker software that you might want to use on your site. Just a word of caution though. It consumes bandwidth on your site about like phpDig does, so you wouldn't want to run it every single day. The free version is called REL Checker Lite, and you can download it here.

04-25-2004, 09:41 AM	#10
vinyl-junkie Purple Mole Join Date: Jan 2004 Posts: 694	OK, here's my take on what you're saying, and it's not based on my knowledge of the phpDig code itself. Rather, it's based on what I see in the spider logs. My assumptions may or may not be correct. phpDig obeys robots.txt - that much we know - but it still has to visit the page to find out if there is a robots exclusion, assuming that it didn't find that already in the robots.txt file. If a page has some kind of problem, like the one I pointed out, that could cause phpDig to go into some kind of loop. Exactly how or why that happens, I wouldn't know. I hope you understand where I'm coming from with this. What I'm saying is basically this: If phpDig has to visit a page, there better not be any errors in it. If there is, it could throw phpDig into a tailspin and cause it not to spider everything you think it should. My suggestion would be to either fix the page or use the include/exclude comments in the page(s) that link to the problem document, so that phpDig will not attempt to spider it.