|
04-23-2004, 03:44 AM | #1 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
includes & excludes
After a quick setup and easy integration I have difficulties spidering the page http://444.docs4you.at correctly.
1.) the path portalnode/ is excluded in the database AND in the robots.txt - nevertheless it is found somehow, and spidered over and over again. 2.) OTOH links on the page are not followed in general. This behavior is different every time, in the worst case 2 pages are spidered and indexed, nothing else, and phpdig hangs spidering portalnode/. Maybe my understanding of the <!-- phpdigExclude --> / include tags is wrong; Can I assume that the parser reads the page top down, switching off and on and off, and on and off again the indexing as it sees lines with a phpdig-tag? And; Regardless of include/exclude tags each and every link on the page should be spidered? Any help would be greatly appreciated! Best Regards, Andreas |
04-23-2004, 04:26 AM | #2 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Welcome to the forums, Andreas. We're glad you could join us!
Check out this thread for a solution to your problem. |
04-23-2004, 04:44 AM | #3 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
Hi Pat, thank you for the reply!
My case is in fact much easier, since the /portalnode/ path sould be excluded altogether from ANY robot indexing, thus my robots.txt looks like: User-agent: * Disallow: /portalnode/ Nevertheless, phpdig hangs in eactly that directory. did I miss something? Andreas |
04-23-2004, 05:29 AM | #4 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Seems like I read somewhere in the forums that if you list the complete path for your starting file; e.g.,
http://444.docs4you.at/portalnode/index.php or whatever the file is called, indexing will work for you. You most likely have indexing locked right now, so you'll have to unlock it. Search the forum for how to do that if you don't know how. Also, make sure your LIMIT_DAYS parameter in config.php is set in a way that will let you re-spider your site now. Good luck, and let us know how it goes. |
04-23-2004, 05:31 AM | #5 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Oops! That's what I get for being in a hurry. I'm getting ready to leave for work right now. You wanted to exclude that directory, not include it.
Are you trying to index the whole site, and it's hanging? I'm not sure what is happening here. |
04-24-2004, 02:56 AM | #6 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
yep, the whole site. I don't understand a few points here:
- how it get's a link to /portalnode/uups.php - why it keeps indexing that file against all the exclude-rules - why it ignores the rest of the site - why it hangs any idea? Andreas |
04-24-2004, 07:31 AM | #7 | ||
Purple Mole
Join Date: Jan 2004
Posts: 694
|
I downloaded your zip file and took a look at your screen shot (nothing amiss there as far as I can see) and your spidering log. Just before the link that was being spidered multiple times is this link:
Quote:
Quote:
|
||
04-24-2004, 07:45 AM | #8 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
I meant to include this in my last post and forgot. I am somewhat concerned about that 404 error that you got first thing in your spider log.
This doesn't have anything to do with phpDig per se, but I found some free link checker software that you might want to use on your site. Just a word of caution though. It consumes bandwidth on your site about like phpDig does, so you wouldn't want to run it every single day. The free version is called REL Checker Lite, and you can download it here. |
04-24-2004, 01:33 PM | #9 |
Green Mole
Join Date: Apr 2004
Posts: 4
|
Ja, right, sorry about that - The site is in some respects not finished yet, nevertheless it should be searchable already.
I hope that doesnt affect phpdig in any way. I'm not troubled if phpdig doesnt index a page that doesnt exist. It seems difficult enough to get the existing pages indexed ;-) ! Some additional info about that site: Every page has several modes of appearance, controled by the S-parameter. I intended to hide this apparent duplicate pages from phpdig by dynamically adding a line: <meta name="robots" content="noindex,nofollow,none"> iff an S-parameter is passed to the page. So only the simple page (without any S-parameter) should be indexed, they carry a line: <meta name="robots" content="index,follow"> And even if phpdig hangs in one branch, why doesn't it finish spidering the other branches of the site? And why does it change it's behavior (number of pages successfully indexed) every time I dig the site? Still confused ... are my assumptions in the initial posting correct? And the main point is: portal.node/ is EXCLUDED in the DB and in robots.txt. the URL of uups.php lies on that path. Which precautions do I have to take on such pages in order to have phpdig spider the rest of the site that is not explicitly excluded? Greets from Vienna, Andreas Last edited by Andreas_Wien; 04-24-2004 at 01:42 PM. |
04-25-2004, 09:41 AM | #10 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
OK, here's my take on what you're saying, and it's not based on my knowledge of the phpDig code itself. Rather, it's based on what I see in the spider logs. My assumptions may or may not be correct.
phpDig obeys robots.txt - that much we know - but it still has to visit the page to find out if there is a robots exclusion, assuming that it didn't find that already in the robots.txt file. If a page has some kind of problem, like the one I pointed out, that could cause phpDig to go into some kind of loop. Exactly how or why that happens, I wouldn't know. I hope you understand where I'm coming from with this. What I'm saying is basically this: If phpDig has to visit a page, there better not be any errors in it. If there is, it could throw phpDig into a tailspin and cause it not to spider everything you think it should. My suggestion would be to either fix the page or use the include/exclude comments in the page(s) that link to the problem document, so that phpDig will not attempt to spider it. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Url part after & is ignored on spider (1.8.9 RC1 and earliers) | obottek | Bug Tracker | 1 | 08-24-2006 05:52 AM |
excludes, includes and other tables | Macedonian | How-to Forum | 0 | 01-22-2005 08:05 PM |
Parse error with includes | cherie | Troubleshooting | 1 | 12-14-2004 01:34 PM |
Apache includes in a template | sgreen | Mod Submissions | 0 | 06-17-2004 04:44 PM |
Excludes not working although properly set | jbafaure | Troubleshooting | 8 | 05-02-2004 09:46 AM |