PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 07-29-2004, 07:49 PM   #1
jinkas
Green Mole
 
Join Date: Jul 2004
Posts: 8
Question PhpDig "clipping" links while spidering

Ok, I consider myself to be moderately skilled at PHP, but this is something that I just don't understand. As PhpDig spiders my site, it looks for links that are clipped versions of links that are all ready there. (This additional processing really slows the script down.) I have attached the results from the most recent spidering so that you can all see and maybe help. Unfortunately, this is still a test site and for security reasons it is only open to employees of the place where I work until we solve some authorization issues (in other words, you can't go see the code to see why everything is happening); however, I can assure you that the links that PhpDig is trying to follow show up neither in the source code nor in the generate HTML code (the entire site is dynamic).

Anyway, on with the problem...

In the txt file (and all references are to the txt file), the first error of this kind shows up in the first two 404 errors after spidered page #3. http://uuu.cae.wisc.edu/si does not exist, but is a part of uuu.cae.wisc.edu/site, which is the entryway into the rest of the site. Similar errors appear in the last two 404 errors of spidered page #3 (should be /wikiutils/), the first two 404s of page #5 (again, should be /site/), the first two 404s of page #7 (should be /site/public/), the 404 of page #11 (should be .php), the 404s in the middle of page 15 (should be /help/h****uts/, not /help/han), and in many, many other places. In fact, in the final results over 50+ clipped links were "found." (it is a Wiki-based system, and all pages that don't exist give you a dynamically generate error page offering to help you create as a new page the page you have requested).

I know that I've been a little verbose, but the final site will contain 8000+ pages and I would like to be able to squash this error. I just can't figure it out! Could someone please help me? Thank you so much!

-jinkas

P.S. - I cut a chunk out of the middle of the file to make it the right size for uploading. You can see at the end that the clipping seems to happen with a greater and greater frequency (every 404 from at least page 201 to 297 is caused by this link clipping)

P.P.S. - It doesn't seem like the link clipping causes PhpDig to skip real links; all the real links seem to be spidered. It just makes it go much slower.

Last edited by jinkas; 07-29-2004 at 07:52 PM.
jinkas is offline   Reply With Quote
Old 07-29-2004, 07:53 PM   #2
jinkas
Green Mole
 
Join Date: Jul 2004
Posts: 8
Attachment

Here's the attachment...I accidentally deleted it from the original post

(Well, ok, I deleted it on purpose, not knowing that I couldn't reupload it )
Attached Files
File Type: txt phpdig_errors.txt (97.8 KB, 27 views)
jinkas is offline   Reply With Quote
Old 07-30-2004, 12:12 PM   #3
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Check this thread, although a bited dated, and edit the code indicated so that PhpDig finds the links that match your regular expression. Perhaps take out the space and parentheses as those tend to form links from JavaScript, even though they aren't real links.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-30-2004, 01:07 PM   #4
jinkas
Green Mole
 
Join Date: Jul 2004
Posts: 8
Thanks. Sorry I was so verbose. I'll give this a try at work on Monday and let you know how it works out.

-jinkas
jinkas is offline   Reply With Quote
Old 08-02-2004, 03:53 AM   #5
jinkas
Green Mole
 
Join Date: Jul 2004
Posts: 8
Ok, I was never very good with regular expressions....I've tried adding my link format, but just can't seem to get it. Could I get a hand?

Links on my site are of the form:
http://host /site/section /index.php?title=filename

-Host is uuu.cae.wisc.edu for now, but will shortly be changing to www.cae.wisc.edu
-Section can be any number of things, right now the only options are "public" and "admin" (will grow to an indefinite number)
-Filename can be any number of things, right now there are ~100 test files on the site (will grow to 8000+)

Thanks for you help, guys! I really appreciate this!

-jinkas

P.S. - I don't even know if changing those eregi's will work, since PhpDig isn't skipping over any links. It finds all the links on a page, but also finds from 2-6 (approx.) non-existant links due to the fact that it clips any number of chars off of the end of some links.
jinkas is offline   Reply With Quote
Old 08-02-2004, 04:04 AM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Personally, I don't think the latest 1.8.3 version of PhpDig is 'clipping' links. PhpDig can try to follow non-existent links, but this is likely due to JavaScript. Some people want spaces and parentheses allowed in their links, so JavaScript then can come into play. An earlier 1.8.3 version didn't quite deal with chunk encoding so links in this rendition did get messed. Perhaps, the issue you are experiencing has something to due with this earlier version of 1.8.3, but to be sure, take a read through this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
shows blank page if "Search All" and "exact phrase", timeout? alokjain9 Troubleshooting 2 03-07-2006 08:08 AM
"search depth" and "links per" features laurentxav How-to Forum 1 01-12-2005 08:27 PM
relative links without URI but only "?bla=1" blueyed Bug Tracker 3 12-06-2004 03:23 AM
Problem with indexing "links found : 0" IAMHHawaii Troubleshooting 1 09-20-2004 01:06 PM
Spidering with "links found : 0" fransdars Troubleshooting 4 02-02-2004 01:03 AM


All times are GMT -8. The time now is 08:33 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.