|
06-14-2004, 11:27 PM | #1 |
Green Mole
Join Date: Jun 2004
Posts: 3
|
spidering problem
I'm installing phpdig and I really like it... but I know I'm doing something wrong here. When I goto the admin interface, and enter the URL I want to spider, it is only indexing the index page and no further. I am using 5 or 6 for the search depth.
First: it only seems to be spidering the index. It doesn't index the pages that are linked off of the index. Second: I wonder is the problem because my site is a subdirectory itself. http://depts.washington.edu/vei/ Third: My site has several dynamically generated, php, pages (a course catalog.) The spider is defintely not indexing these. These pages are about three links down from the index. Almost all the pages on my site are at least partially dynamic. They all have a dynamic nav bar which lives in a different directory. Thanks for any help you can give. Nathan
__________________
Nathan |
06-15-2004, 12:55 AM | #2 |
Purple Mole
Join Date: Dec 2003
Posts: 106
|
Hi Nathan - welcome to the site.
If you search around the forums here (troubleshooting particularly), you will find many answers to this question. Many of them are titled "no links in temporary table", "links found: 0" and similar. Once you've read all of the previous answers, if it still doesn't work, we'll be happy to tackle the new problem. Good luck.
__________________
Foundmyself.com artist community, art galleries |
06-15-2004, 11:24 PM | #3 |
Green Mole
Join Date: Jun 2004
Posts: 3
|
Hi, thank you for the prompt response. I'm sorry to keep bugging about something which I know has been answered millions of times on here, but I just can't seem to find a response which helps me. I spent about 10 hours fiddling with this and reading posts yesterday and today. So, I'll tell you what I've done.
First of all, I'm running: mysql 4.0.15 PHP version 4.2.0 AIX 4.3.3 The site I'm working on is: http://depts.washington.edu/vei/ One of the first things that I tried was creating a robots.txt file. That didn't work, and I found another post suggesting to erase the robots.txt file, so I did that tonight. I also replaced this line $user_agent = $regs[1]; with if ($regs[1] == "*") { $user_agent = "'$regs[1]'"; } else { $user_agent = $regs[1] } as another post suggested. I also tried indexing http://www.php.net as a test, and it didn't work. I also found lots of other posts talking about this same problem, suggesting changes to the php.ini file, but none of those applieds, as I already had correct settings on the php.ini. So... I'm sure there's some simple answer that I'm just not getting so I appreciate the help. When I tried to index php.net, this is the result I get. SITE : http://www.php.net/ Exclude paths : - '*' - @NONE@ HTTP/1.1 200 OK Date: Wed, 16 Jun 2004 07:15:59 GMT Server: Apache/1.3.26 (Unix) mod_gzip/1.3.26.1a PHP/4.3.3-dev X-Powered-By: PHP/4.3.3-dev Last-Modified: Wed, 16 Jun 2004 07:11:02 GMT Content-language: en Set-Cookie: COUNTRY=USA%2C140.142.16.139; expires=Wed, 23-Jun-04 07:15:59 GMT; path=/; domain=.php.net Connection: close Content-Type: text/html;charset=ISO-8859-1 1:http://www.php.net/\1/ (time : 00:00:31) No link in temporary table -------------------------------------------------------------------------------- links found : 1 http://www.php.net/\1/ Optimizing tables... Indexing complete ! When I try to index my site (http://depts.washington.edu/vei/index.php), i get the same result: SITE : http://depts.washington.edu/ Exclude paths : - '*' - @NONE@ HTTP/1.1 200 OK Date: Wed, 16 Jun 2004 07:23:26 GMT Server: Apache/1.3.29 (Unix) mod_pubcookie/a5/1.77.2.4 mod_uwa/2.2 Resin/2.1.8 mod_fastcgi/2.2.12 mod_ssl/2.8.16 OpenSSL/0.9.7a X-Powered-By: PHP/4.2.0 Content-Type: text/html 1:http://depts.washington.edu/vei/ (time : 00:00:09) No link in temporary table -------------------------------------------------------------------------------- links found : 1 http://depts.washington.edu/vei/ Optimizing tables... Indexing complete ! Thanks again. Nathan
__________________
Nathan |
06-17-2004, 03:25 PM | #4 |
Green Mole
Join Date: Jun 2004
Posts: 3
|
Hi... I figured my problem out, and it wasn't something that I saw in the newsgroup, so I'll leave it here in the hopes that someday it will help someone else.
I figured this out by just plain old debugging the code, so there may be a more direct way to do this, but this is what worked for me. What was happening was that the function phpdigExplore was not returning the URLs contained within my page. The reason was that I have magic_quotes = On. So the eregi functions was failing. So before the line: while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([[a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*([:%/?=&;\\,._a-zA-Z0-9\|+-]*))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) { I added this line of code: $eval = stripslashes($eval); hope this helps someone. Nathan |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Spidering problem | mark40 | Troubleshooting | 1 | 08-28-2007 04:06 AM |
Problem with spidering | tomjed | Troubleshooting | 0 | 02-09-2006 02:50 AM |
Spidering problem please help | KaZ | Troubleshooting | 1 | 12-05-2005 06:59 AM |
Problem Spidering | Trallis | Troubleshooting | 6 | 11-02-2005 07:58 AM |
Problem Spidering | jmitchell | Troubleshooting | 3 | 12-29-2004 05:42 PM |