Long post, possible issues, and maybe the data to solve some?

Paul D. Buck · 02-02-2005, 07:40 AM

Ok, I seem to have phpDig installed and operational. But here are some of my observations based upon the last couple of days experience.

=============
The documentation is weak. About 80% of it is dedicated to the installation and some on configuration, but there is only a few paragraphs on how to operate the software. Looking through the forums there are a lot of questions that seem to be repeated and they are all related to "How do I ...". Putting more of this into the documentation would be a great step forward.

Most tellingly, people are only going to to be installing it once, updating it on occasion, but operating it every day ...

=============
Much of the configuration information is explained to the extent that a person already familiar with your tool will understand the explanation.

Picking a section, just for example:
// regexp forbidden extensions - return sometimes text/html mime-type !!!
define('FORBIDDEN_EXTENSIONS','\.(rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|
arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$');

The BANNED constant means to ban external links in index, meaning that those links do not show up as keys in search results. The FORBIDDEN_EXTENSIONS constant means to ban certain links from being indexed. Don't let the name fool you. A regex can be set in the FORBIDDEN_EXTENSIONS constant to ban various types of links from even being indexed. Again, BANNED is to ban keys from search results, and FORBIDDEN_EXTENSIONS is to ban the index of links.

This explanation does not tell me how to change this setting, or, in the first one how to create what appears at first glance as a regular expression. If this is true, then the way that it works should be explained.

================
In paragraph 6.2 you mention that you can use the "No way" to remove a site, but then don't explain that it can be/must be added back with a rescan and this is the way that you do that. I don't recall when I accidentally clicked one if I got a "Are you Sure?" or not.

==============
Also in 6.3 "Clean common words - deletes words that appear in the common_words.txt file." does not have advice on how and when this file can be/should be updated, and if you do update it what words should be added in ...

+++++++++++++++++++++++
Unanswered questions.

These that follow are a combination of guess work, database snooping, and conjecture about the engine, how it works and what is going on, and why the spider may not be indexing my site correctly. Most of this is presented as questions I had hoped would have been answered in the documentation ...

Site description:
Paul's Web Site
Top level consists of one "Index" page
there are "../" and "../../" levels below with almost all content residing in the "../../" level with some additional indexes in the "../" levels

The site consists of about 250 individual pages by file name along with, now, 3 PHP tools in their own structures, pgpMyFAQ, phpMyAdmin, and phpDig in addition to the base content.

The internal link structure is a huge tangle of back and forth links, depending on the link tester I have 2,000 to 19,000 links, the difference between unique links and link references. I have on the order of 150 broken links for material that is being added. Most pages validate to W3C "Strict" with about 75-100 that are failing validation at this time.

every page links to the top of the site ...

===========
I see "exclude Paths" when I start the spider. How do I set that? and what does it do?

===========
The system starts to spider and then freezes, I am using FireFox, but I get anywhere from 10-20 pages parsed before I get the freeze. Letting it sit for long periods does not seem to work.

===========
Does it hurt anything if I just stop the spider in the middle of things? are my pages still correctly processed if I do that?

===========
How can I tell if one of those time out errors occurs?

===========
I see a list of words that I can "purge" should I add more words?
Does adding words make my database less useful for searches?

===========
How do I get my site reindexed after I have made the first pass?

===========
Should I run the delete processes before I restarting the indexing?

===========
You mention a parameter that will prevent "early" reindexing, what value is it set to and how can I change it?

===========
Most of the time the spider does not complete. Is that a bad thing? I only see it finish a spider and then listing the pages completed occasionally.

===========
On the update page you say depth trumps links, does that mean values 0, 0 will do my entire site?

===========
I did a spider crawl on my site, then I repeated it from the index page, I indexed some pages elsewhere in the site. Then I restarted from the top and some of the pages that it told me it had successfully indexed were re-indexed.

Since the processes do not stop cleanly I am confused about whether or not my site is correctly indexed. I stop the spider cleanly, but this still does not make sense to me.

============
My site is very recursive, yet I understood the indexing function was supposed to have a delay factor, which I assumed was: define('LIMIT_DAYS',0); this implies that I can reindex at will. Yet, there are times when the pages are happily reindexed, and other times when they are not. I cannot figure out the pattern ...

Over time, my update form grows to contain the additional pages, and the search function does seem to find the correct pages, but the behavior does not seem to match the documentation.

I would assume that a 0, 0, no setting would prevent exiting the page to enter another, but this does not seem to be the case.

===========
what does "No link in temporary table" mean?
Is this a good thing? Or a bad thing?

if good "No links in the temporary table, Yea!" if bad, we need troubleshooting tips

============
There does not seem to be a clear explanation of how I can do a reindex with the tool only reindexing the pages that need it based on the MD 5 values, of course I am assuming that you made these numbers for this purpose.

===========
What does: "Duplicate of an existing document" Mean?
doing 3 pages at a time with "yes" 0 0

1:http://boinc-doc.net/site-boinc/boin...oject-list.php
(time : 00:00:06)
2:http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
(time : 00:00:12)
3:http://boinc-doc.net/site-boinc/oman...-menu-help.php
(time : 00:00:19)
4:http://boinc-doc.net/site-boinc/oman...-menu-file.php
(time : 00:00:25)
5:http://boinc-doc.net/index.php
(time : 00:00:32)
6:http://boinc-doc.net/site-boinc/oman-app/app-intro.php
(time : 00:00:38)
Duplicate of an existing document
7:http://boinc-doc.net/site-boinc/oman...pp-install.php
(time : 00:00:44)
8:http://boinc-doc.net/site-boinc/oman-app/app-icons.php
(time : 00:00:50)
No link in temporary table

====================
settings 0, 0, yes both runs
Pass with 3 new pages:

Spidering in progress... [Stop spider]
SITE : http://boinc-doc.net/
Exclude paths :
- @NONE@
1:http://boinc-doc.net/site-boinc/oman...-menu-help.php
(time : 00:00:07)
2:http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
(time : 00:00:13)
3:http://boinc-doc.net/site-boinc/oman...pp-install.php
(time : 00:00:19)
4:http://boinc-doc.net/site-boinc/oman-app/app-icons.php
(time : 00:00:26)
5:http://boinc-doc.net/index.php
(time : 00:00:32)
6:http://boinc-doc.net/site-boinc/oman-app/app-intro.php
(time : 00:00:38)
7:http://boinc-doc.net/site-boinc/boin...oject-list.php
(time : 00:00:43)
8:http://boinc-doc.net/site-boinc/oman...saver-cpdn.php
(time : 00:00:49)
9:http://boinc-doc.net/site-boinc/oman...-cpdn-full.php
(time : 00:00:55)
10:http://boinc-doc.net/site-boinc/oman-app/app-over.php
(time : 00:01:02)
11:http://boinc-doc.net/site-boinc/oman...-menu-file.php
(time : 00:01:07)
Duplicate of an existing document
12:http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php
(time : 00:01:13)
13:http://boinc-doc.net/site-boinc/oman...u-settings.php
(time : 00:01:19)
14:http://boinc-doc.net/site-boinc/oman...menu-popup.php
(time : 00:01:25)
No link in temporary table
links found : 14
http://boinc-doc.net/site-boinc/oman...-menu-help.php
http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
http://boinc-doc.net/site-boinc/oman...pp-install.php
http://boinc-doc.net/site-boinc/oman-app/app-icons.php
http://boinc-doc.net/index.php
http://boinc-doc.net/site-boinc/oman-app/app-intro.php
http://boinc-doc.net/site-boinc/boin...oject-list.php
http://boinc-doc.net/site-boinc/oman...saver-cpdn.php
http://boinc-doc.net/site-boinc/oman...-cpdn-full.php
http://boinc-doc.net/site-boinc/oman-app/app-over.php
http://boinc-doc.net/site-boinc/oman...-menu-file.php
http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php
http://boinc-doc.net/site-boinc/oman...u-settings.php
http://boinc-doc.net/site-boinc/oman...menu-popup.php
Optimizing tables...
Indexing complete ! [Back]

------------
Next pass with 3 new pages:

Spidering in progress... [Stop spider]
SITE : http://boinc-doc.net/
Exclude paths :
- @NONE@
1:http://boinc-doc.net/site-boinc/oman...u-settings.php
(time : 00:00:07)
2:http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php
(time : 00:00:12)
3:http://boinc-doc.net/site-boinc/oman-app/app-icons.php
(time : 00:00:18)
4:http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
(time : 00:00:24)
5:http://boinc-doc.net/site-boinc/oman...-menu-help.php
(time : 00:00:30)
6:http://boinc-doc.net/site-boinc/oman...-menu-file.php
(time : 00:00:36)
7:http://boinc-doc.net/index.php
(time : 00:00:42)
Duplicate of an existing document
8:http://boinc-doc.net/site-boinc/oman-app/app-intro.php
(time : 00:00:47)
9:http://boinc-doc.net/site-boinc/oman...pp-install.php
(time : 00:00:54)
10:http://boinc-doc.net/site-boinc/boin...oject-list.php
(time : 00:01:00)
11:http://boinc-doc.net/site-boinc/oman...r-old-seti.php
(time : 00:01:07)
12:http://boinc-doc.net/site-boinc/oman...-saver-lhc.php
(time : 00:01:13)
13:http://boinc-doc.net/site-boinc/oman...r-lhc-full.php
(time : 00:01:18)
14:http://boinc-doc.net/site-boinc/oman...menu-popup.php
(time : 00:01:24)
15:http://boinc-doc.net/site-boinc/oman...saver-cpdn.php
(time : 00:01:30)
Duplicate of an existing document
16:http://boinc-doc.net/site-boinc/oman...-cpdn-full.php
(time : 00:01:36)
17:http://boinc-doc.net/site-boinc/oman-app/app-over.php
(time : 00:01:42)
No link in temporary table
links found : 17
http://boinc-doc.net/site-boinc/oman...u-settings.php
http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php
http://boinc-doc.net/site-boinc/oman-app/app-icons.php
http://boinc-doc.net/site-boinc/oman...p-menu-gen.php
http://boinc-doc.net/site-boinc/oman...-menu-help.php
http://boinc-doc.net/site-boinc/oman...-menu-file.php
http://boinc-doc.net/index.php
http://boinc-doc.net/site-boinc/oman-app/app-intro.php
http://boinc-doc.net/site-boinc/oman...pp-install.php
http://boinc-doc.net/site-boinc/boin...oject-list.php
http://boinc-doc.net/site-boinc/oman...r-old-seti.php
http://boinc-doc.net/site-boinc/oman...-saver-lhc.php
http://boinc-doc.net/site-boinc/oman...r-lhc-full.php
http://boinc-doc.net/site-boinc/oman...menu-popup.php
http://boinc-doc.net/site-boinc/oman...saver-cpdn.php
http://boinc-doc.net/site-boinc/oman...-cpdn-full.php
http://boinc-doc.net/site-boinc/oman-app/app-over.php
Optimizing tables...
Indexing complete ! [Back]

================
Removing the "-" char as suggested for allowing pop-up to be searched on seems to work, but the display of the "text" in the search dialog still has the "-" missing, so, the database and the source pages have "Pop-Up" and searches for Pop-Up work, but the quoted material contains "pop up" ...

================
Trying to find the error 403 lead me to: changing the code slightly and this is an output:

url: HEAD /site-boinc/boinc-projects/ HTTP/1.1 Host: boinc-doc.net Cookie: PHPSESSID=6963c6e4d54cc72fc71505cd795854be Cookie: PageCount=0 Cookie: ProjectAbbr=predictor Accept: */* Accept-Charset: iso-8859-1 Accept-Encoding: identity Connection: close User-Agent: PhpDig/1.8.7 (+http://www.phpdig.net/robot.php)

HTTP/1.1 403 - http://boinc-doc.net/site-boinc/boinc-projects//
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for explanation.
url: HEAD /site-boinc/boinc-projects/ HTTP/1.1 Host: boinc-doc.net Cookie: PHPSESSID=6963c6e4d54cc72fc71505cd795854be Cookie: PageCount=0 Cookie: ProjectAbbr=predictor Accept: */* Accept-Charset: iso-8859-1 Accept-Encoding: identity Connection: close User-Agent: PhpDig/1.8.7 (+http://www.phpdig.net/robot.php)

Code change was in ROBOT_Vunctions.php:

elseif ($regs[1] == 403) {
echo "url: " . $request . "\r\n <br />\r\n";
print "<br>\n".$answer." - ".$url."<br>\nSee http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for explanation.<br>\n";

This seems to me to be that the spider is trying to locate the cookie(?).

Paul D. Buck · 02-02-2005, 07:41 AM

=================
Could this "dynamic" behavior I have been seeing be because the pages I have are highly dynamic and constantly change parts of the content with some factors controlled by a r****mly generated value?

Is the "spider" crawler identifiable? In other words, when it asks for a page, can I detect that it is the spider and not a normal query?

File name + title is unique identifier, on my site in almost all cases ....

Example: page file name is "account-data.php"

with a page title of:
"Account Data" Page - ClimatePrediction.net - Web Site Owner's Manual
"Account Data" Page - Einstein@Home - Web Site Owner's Manual
"Account Data" Page - LHC@Home - Web Site Owner's Manual
"Account Data" Page - Predictor@Home - Web Site Owner's Manual
"Account Data" Page - SETI@Home - Web Site Owner's Manual

Would changing the Primary key to be filename + file title be a better index. I know that in most cases on simpler sites the numbers of entries would be smaller as there would be no true difference in the pages indexed. In my case, I would be getting 4-6 times the number of pages, but NOW with this I would be able to track the pages that are missing/need indexing, and would not be reindexing the same pages in error.

This would obviously mean changes to the "Modify page" also.

Oh, one other positive thing would be that the MD5 values could be used as they are intended to see if the page is truly different.

=================
I don't know the significance of this, but monitoriing the tempspider table showed that most of my "freezes" occurred when the spider had done 30 entries into the table and about 18 were flagged as "1" indexed. I know tht the table can grow beyond this because I once saw it upto 168 pages ... also frozen. I let the spider continue to run with no changes observed, of course if you are not doing immediate commits then more work could have been pending, but after 30 minutes it seemed to be time to quit.

Re-running the analysis may or may not have 'frozen at the same point/page.

Charter · 02-02-2005, 03:35 PM

Q: I see "exclude Paths" when I start the spider. How do I set that? and what does it do?

A: Use a robots.txt file or exclude content from the admin panel. It excludes content from index.

Q: How can I tell if one of those time out errors occurs?

A: Check that safe_mode is off, review your server error logs, or ask your host if the process is killed.

Q: Does it hurt anything if I just stop the spider in the middle of things? are my pages still correctly processed if I do that?

A: No, not if you use the stop spider link. Yes, for documents already indexed.

Q: I see a list of words that I can "purge" should I add more words? Does adding words make my database less useful for searches?

A: Add more words if you want. It depends on the words you add.

Q: How do I get my site reindexed after I have made the first pass?

A: Use the admin panel text box or spider from shell.

Q: Should I run the delete processes before I restarting the indexing?

A: If you want, but it is not necessary.

Q: You mention a parameter that will prevent "early" reindexing, what value is it set to and how can I change it?

A: It is set to zero. Look for define('LIMIT_DAYS',0); in the config file, or set revisit-after META tags.

Q: On the update page you say depth trumps links, does that mean values 0, 0 will do my entire site?

A: No, set LIMIT_TO_DIRECTORY to false, choose no, set a large search depth, set links per to zero.

Q: what does "No link in temporary table" mean? Is this a good thing? Or a bad thing?

A: The tempspider table is empty. It is good.

Q: What does: "Duplicate of an existing document" Mean?

A: The document looks like an already indexed document.

Q: Could this "dynamic" behavior I have been seeing be because the pages I have are highly dynamic and constantly change parts of the content with some factors controlled by a r****mly generated value?

A: No, the values in the update sites table are currently being used.

Q: Is the "spider" crawler identifiable? In other words, when it asks for a page, can I detect that it is the spider and not a normal query?

A: Yes, review your server access logs when indexing your site.

Q: Would changing the Primary key to be filename + file title be a better index.

A: No, primary keys must be unique.

Paul D. Buck · 02-03-2005, 08:17 AM

Quote:

Originally Posted by Charter

Q: I see "exclude Paths" when I start the spider. How do I set that? and what does it do?

A: Use a robots.txt file or exclude content from the admin panel. It excludes content from index.

I am looking at the v1.8.7 admin panel and there is no option to exclude from the site. I only have one site listed, and listed as locked now. Which is a new issue, what the heck is locked, how did it get locked and how do I unlock it.

Hmm, found a way to unlock it ...

The reason that I posted all of these questions by the way is that if these are answered and then inserted into the documentation as part of the operational section, it would save new people from having to try to figure this stuff out
on their own.

Quote:

Originally Posted by Charter

Q: How can I tell if one of those time out errors occurs?

A: Check that safe_mode is off, review your server error logs, or ask your host if the process is killed.

Safe mode is off. I am not sure I can do either one of the others. But I will look into it

Quote:

Originally Posted by Charter

Q: Does it hurt anything if I just stop the spider in the middle of things? are my pages still correctly processed if I do that?

A: No, not if you use the stop spider link. Yes, for documents already indexed.

I do use the spider link and it makes several passes before it says it is done.

And over time I do *seem* to get an indexed site.

Quote:

Originally Posted by Charter

Q: I see a list of words that I can "purge" should I add more words? Does adding words make my database less useful for searches?

A: Add more words if you want. It depends on the words you add.

Quote:

Originally Posted by Charter

Q: How do I get my site reindexed after I have made the first pass?

A: Use the admin panel text box or spider from shell.

If I select the top end, and do re-index it is supposed to locate the pages that have been updated and index those? I re-arranged a folder, droppin some pages and adding others by it does not look like it is really finding the new pages, well, I will play with it some more ...

Quote:

Originally Posted by Charter

Q: Should I run the delete processes before I restarting the indexing?

A: If you want, but it is not necessary.

Ok, I ran them anyway.

It would be nice to have a fuller explanation of what each process does and what its intent is ...

Quote:

Originally Posted by Charter

Q: You mention a parameter that will prevent "early" reindexing, what value is it set to and how can I change it?

A: It is set to zero. Look for define('LIMIT_DAYS',0); in the config file, or set revisit-after META tags.

So, if limit days is 0, I can reindex at will?

Quote:

Originally Posted by Charter

Q: On the update page you say depth trumps links, does that mean values 0, 0 will do my entire site?

A: No, set LIMIT_TO_DIRECTORY to false, choose no, set a large search depth, set links per to zero.

Ok, I will try this, but I think I already did and it did not work, well I will try again.

Quote:

Originally Posted by Charter

Q: what does "No link in temporary table" mean? Is this a good thing? Or a bad thing?

A: The tempspider table is empty. It is good.

This is one of those that SHOULD be in your document. I saw mousing about here that this is a common question.

Quote:

Q: What does: "Duplicate of an existing document" Mean?

A: The document looks like an already indexed document.

Ok, so it is more of a warning that it is passing over the docuement.

Quote:

Originally Posted by Charter

Q: Could this "dynamic" behavior I have been seeing be because the pages I have are highly dynamic and constantly change parts of the content with some factors controlled by a r****mly generated value?

A: No, the values in the update sites table are currently being used.

I think we talked past each other on this one and the last one. I will pass till then.

Quote:

Originally Posted by Charter

Q: Is the "spider" crawler identifiable? In other words, when it asks for a page, can I detect that it is the spider and not a normal query?

A: Yes, review your server access logs when indexing your site.

I was thinking about catching it at the page when it is called, maybe to have the page make itself more static but that will miss content.

Quote:

Originally Posted by Charter

Q: Would changing the Primary key to be filename + file title be a better index.

A: No, primary keys must be unique.

Yes, my point exactly. I have a page, say "profile-edit.php", this page is called by the viewer in any one of 6 different renderings. The page has content that is driven by the Project that the person is interested in. So, you go to that page and ask for a page titled: "Profile Edit - SETI@Home ..." as profile-edit.php; whereas I ask for "Profile Edit - LHC@Home ..." as profile-edit.php ... we both pull the same page, but get different content based on the project we wish to see.

This means that the one FILE name is NOT unique in the way that it is viewed. You cannot index the page fully as only one version as some of the content is never delivered for your view, and other information from my view.

With the concatenated key of page FILE NAME and PAGE TITLE my pages are now uniquely identifable in the tables.

++++++++++++++
New questions
Q. Would adding the project as a passed parameter make the page table entries based on that identifier?

Q. I have tried settings to force a single page only look so I could feed in a list of pages so I could just drop the site and then feed in the list of pages. But I have not been able to get that to work reliably.

Q. If I have a static page, how to I run phpDig so that it only indexes pages that have been detected as changed? Or is the MD5 signature stored for some other purpose?

Charter · 02-03-2005, 12:59 PM

Q. Would adding the project as a passed parameter make the page table entries based on that identifier?

A: The query string is stored as part of the filename except for any pieces removed by PHPDIG_SESSID_VAR.

Q. I have tried settings to force a single page only look so I could feed in a list of pages so I could just drop the site and then feed in the list of pages. But I have not been able to get that to work reliably.

A: Set LIMIT_TO_DIRECTORY to false, choose no, set search depth to one or more, and set links per to zero.

Q. If I have a static page, how to I run phpDig so that it only indexes pages that have been detected as changed? Or is the MD5 signature stored for some other purpose?

A: Use the code here to create a list of recently changed files. The $md5 variable is for detecting duplicates.

02-02-2005, 07:40 AM	#1
Paul D. Buck Green Mole Join Date: Jan 2005 Location: Sacramento Posts: 8	Long post, possible issues, and maybe the data to solve some? Ok, I seem to have phpDig installed and operational. But here are some of my observations based upon the last couple of days experience. ============= The documentation is weak. About 80% of it is dedicated to the installation and some on configuration, but there is only a few paragraphs on how to operate the software. Looking through the forums there are a lot of questions that seem to be repeated and they are all related to "How do I ...". Putting more of this into the documentation would be a great step forward. Most tellingly, people are only going to to be installing it once, updating it on occasion, but operating it every day ... ============= Much of the configuration information is explained to the extent that a person already familiar with your tool will understand the explanation. Picking a section, just for example: // regexp forbidden extensions - return sometimes text/html mime-type !!! define('FORBIDDEN_EXTENSIONS','\.(rm\|ico\|cab\|swf\|css\|gz\|z\|tar\|zip\|tgz\|msi\| arj\|zoo\|rar\|r[0-9]+\|exe\|bin\|pkg\|rpm\|deb\|bz2)$'); The BANNED constant means to ban external links in index, meaning that those links do not show up as keys in search results. The FORBIDDEN_EXTENSIONS constant means to ban certain links from being indexed. Don't let the name fool you. A regex can be set in the FORBIDDEN_EXTENSIONS constant to ban various types of links from even being indexed. Again, BANNED is to ban keys from search results, and FORBIDDEN_EXTENSIONS is to ban the index of links. This explanation does not tell me how to change this setting, or, in the first one how to create what appears at first glance as a regular expression. If this is true, then the way that it works should be explained. ================ In paragraph 6.2 you mention that you can use the "No way" to remove a site, but then don't explain that it can be/must be added back with a rescan and this is the way that you do that. I don't recall when I accidentally clicked one if I got a "Are you Sure?" or not. ============== Also in 6.3 "Clean common words - deletes words that appear in the common_words.txt file." does not have advice on how and when this file can be/should be updated, and if you do update it what words should be added in ... +++++++++++++++++++++++ Unanswered questions. These that follow are a combination of guess work, database snooping, and conjecture about the engine, how it works and what is going on, and why the spider may not be indexing my site correctly. Most of this is presented as questions I had hoped would have been answered in the documentation ... Site description: Paul's Web Site Top level consists of one "Index" page there are "../" and "../../" levels below with almost all content residing in the "../../" level with some additional indexes in the "../" levels The site consists of about 250 individual pages by file name along with, now, 3 PHP tools in their own structures, pgpMyFAQ, phpMyAdmin, and phpDig in addition to the base content. The internal link structure is a huge tangle of back and forth links, depending on the link tester I have 2,000 to 19,000 links, the difference between unique links and link references. I have on the order of 150 broken links for material that is being added. Most pages validate to W3C "Strict" with about 75-100 that are failing validation at this time. every page links to the top of the site ... =========== I see "exclude Paths" when I start the spider. How do I set that? and what does it do? =========== The system starts to spider and then freezes, I am using FireFox, but I get anywhere from 10-20 pages parsed before I get the freeze. Letting it sit for long periods does not seem to work. =========== Does it hurt anything if I just stop the spider in the middle of things? are my pages still correctly processed if I do that? =========== How can I tell if one of those time out errors occurs? =========== I see a list of words that I can "purge" should I add more words? Does adding words make my database less useful for searches? =========== How do I get my site reindexed after I have made the first pass? =========== Should I run the delete processes before I restarting the indexing? =========== You mention a parameter that will prevent "early" reindexing, what value is it set to and how can I change it? =========== Most of the time the spider does not complete. Is that a bad thing? I only see it finish a spider and then listing the pages completed occasionally. =========== On the update page you say depth trumps links, does that mean values 0, 0 will do my entire site? =========== I did a spider crawl on my site, then I repeated it from the index page, I indexed some pages elsewhere in the site. Then I restarted from the top and some of the pages that it told me it had successfully indexed were re-indexed. Since the processes do not stop cleanly I am confused about whether or not my site is correctly indexed. I stop the spider cleanly, but this still does not make sense to me. ============ My site is very recursive, yet I understood the indexing function was supposed to have a delay factor, which I assumed was: define('LIMIT_DAYS',0); this implies that I can reindex at will. Yet, there are times when the pages are happily reindexed, and other times when they are not. I cannot figure out the pattern ... Over time, my update form grows to contain the additional pages, and the search function does seem to find the correct pages, but the behavior does not seem to match the documentation. I would assume that a 0, 0, no setting would prevent exiting the page to enter another, but this does not seem to be the case. =========== what does "No link in temporary table" mean? Is this a good thing? Or a bad thing? if good "No links in the temporary table, Yea!" if bad, we need troubleshooting tips ============ There does not seem to be a clear explanation of how I can do a reindex with the tool only reindexing the pages that need it based on the MD 5 values, of course I am assuming that you made these numbers for this purpose. =========== What does: "Duplicate of an existing document" Mean? doing 3 pages at a time with "yes" 0 0 1:http://boinc-doc.net/site-boinc/boin...oject-list.php (time : 00:00:06) 2:http://boinc-doc.net/site-boinc/oman...p-menu-gen.php (time : 00:00:12) 3:http://boinc-doc.net/site-boinc/oman...-menu-help.php (time : 00:00:19) 4:http://boinc-doc.net/site-boinc/oman...-menu-file.php (time : 00:00:25) 5:http://boinc-doc.net/index.php (time : 00:00:32) 6:http://boinc-doc.net/site-boinc/oman-app/app-intro.php (time : 00:00:38) Duplicate of an existing document 7:http://boinc-doc.net/site-boinc/oman...pp-install.php (time : 00:00:44) 8:http://boinc-doc.net/site-boinc/oman-app/app-icons.php (time : 00:00:50) No link in temporary table ==================== settings 0, 0, yes both runs Pass with 3 new pages: Spidering in progress... [Stop spider] SITE : http://boinc-doc.net/ Exclude paths : - @NONE@ 1:http://boinc-doc.net/site-boinc/oman...-menu-help.php (time : 00:00:07) 2:http://boinc-doc.net/site-boinc/oman...p-menu-gen.php (time : 00:00:13) 3:http://boinc-doc.net/site-boinc/oman...pp-install.php (time : 00:00:19) 4:http://boinc-doc.net/site-boinc/oman-app/app-icons.php (time : 00:00:26) 5:http://boinc-doc.net/index.php (time : 00:00:32) 6:http://boinc-doc.net/site-boinc/oman-app/app-intro.php (time : 00:00:38) 7:http://boinc-doc.net/site-boinc/boin...oject-list.php (time : 00:00:43) 8:http://boinc-doc.net/site-boinc/oman...saver-cpdn.php (time : 00:00:49) 9:http://boinc-doc.net/site-boinc/oman...-cpdn-full.php (time : 00:00:55) 10:http://boinc-doc.net/site-boinc/oman-app/app-over.php (time : 00:01:02) 11:http://boinc-doc.net/site-boinc/oman...-menu-file.php (time : 00:01:07) Duplicate of an existing document 12:http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php (time : 00:01:13) 13:http://boinc-doc.net/site-boinc/oman...u-settings.php (time : 00:01:19) 14:http://boinc-doc.net/site-boinc/oman...menu-popup.php (time : 00:01:25) No link in temporary table links found : 14 http://boinc-doc.net/site-boinc/oman...-menu-help.php http://boinc-doc.net/site-boinc/oman...p-menu-gen.php http://boinc-doc.net/site-boinc/oman...pp-install.php http://boinc-doc.net/site-boinc/oman-app/app-icons.php http://boinc-doc.net/index.php http://boinc-doc.net/site-boinc/oman-app/app-intro.php http://boinc-doc.net/site-boinc/boin...oject-list.php http://boinc-doc.net/site-boinc/oman...saver-cpdn.php http://boinc-doc.net/site-boinc/oman...-cpdn-full.php http://boinc-doc.net/site-boinc/oman-app/app-over.php http://boinc-doc.net/site-boinc/oman...-menu-file.php http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php http://boinc-doc.net/site-boinc/oman...u-settings.php http://boinc-doc.net/site-boinc/oman...menu-popup.php Optimizing tables... Indexing complete ! [Back] ------------ Next pass with 3 new pages: Spidering in progress... [Stop spider] SITE : http://boinc-doc.net/ Exclude paths : - @NONE@ 1:http://boinc-doc.net/site-boinc/oman...u-settings.php (time : 00:00:07) 2:http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php (time : 00:00:12) 3:http://boinc-doc.net/site-boinc/oman-app/app-icons.php (time : 00:00:18) 4:http://boinc-doc.net/site-boinc/oman...p-menu-gen.php (time : 00:00:24) 5:http://boinc-doc.net/site-boinc/oman...-menu-help.php (time : 00:00:30) 6:http://boinc-doc.net/site-boinc/oman...-menu-file.php (time : 00:00:36) 7:http://boinc-doc.net/index.php (time : 00:00:42) Duplicate of an existing document 8:http://boinc-doc.net/site-boinc/oman-app/app-intro.php (time : 00:00:47) 9:http://boinc-doc.net/site-boinc/oman...pp-install.php (time : 00:00:54) 10:http://boinc-doc.net/site-boinc/boin...oject-list.php (time : 00:01:00) 11:http://boinc-doc.net/site-boinc/oman...r-old-seti.php (time : 00:01:07) 12:http://boinc-doc.net/site-boinc/oman...-saver-lhc.php (time : 00:01:13) 13:http://boinc-doc.net/site-boinc/oman...r-lhc-full.php (time : 00:01:18) 14:http://boinc-doc.net/site-boinc/oman...menu-popup.php (time : 00:01:24) 15:http://boinc-doc.net/site-boinc/oman...saver-cpdn.php (time : 00:01:30) Duplicate of an existing document 16:http://boinc-doc.net/site-boinc/oman...-cpdn-full.php (time : 00:01:36) 17:http://boinc-doc.net/site-boinc/oman-app/app-over.php (time : 00:01:42) No link in temporary table links found : 17 http://boinc-doc.net/site-boinc/oman...u-settings.php http://boinc-doc.net/site-boinc/oman...pp-msg-gen.php http://boinc-doc.net/site-boinc/oman-app/app-icons.php http://boinc-doc.net/site-boinc/oman...p-menu-gen.php http://boinc-doc.net/site-boinc/oman...-menu-help.php http://boinc-doc.net/site-boinc/oman...-menu-file.php http://boinc-doc.net/index.php http://boinc-doc.net/site-boinc/oman-app/app-intro.php http://boinc-doc.net/site-boinc/oman...pp-install.php http://boinc-doc.net/site-boinc/boin...oject-list.php http://boinc-doc.net/site-boinc/oman...r-old-seti.php http://boinc-doc.net/site-boinc/oman...-saver-lhc.php http://boinc-doc.net/site-boinc/oman...r-lhc-full.php http://boinc-doc.net/site-boinc/oman...menu-popup.php http://boinc-doc.net/site-boinc/oman...saver-cpdn.php http://boinc-doc.net/site-boinc/oman...-cpdn-full.php http://boinc-doc.net/site-boinc/oman-app/app-over.php Optimizing tables... Indexing complete ! [Back] ================ Removing the "-" char as suggested for allowing pop-up to be searched on seems to work, but the display of the "text" in the search dialog still has the "-" missing, so, the database and the source pages have "Pop-Up" and searches for Pop-Up work, but the quoted material contains "pop up" ... ================ Trying to find the error 403 lead me to: changing the code slightly and this is an output: url: HEAD /site-boinc/boinc-projects/ HTTP/1.1 Host: boinc-doc.net Cookie: PHPSESSID=6963c6e4d54cc72fc71505cd795854be Cookie: PageCount=0 Cookie: ProjectAbbr=predictor Accept: / Accept-Charset: iso-8859-1 Accept-Encoding: identity Connection: close User-Agent: PhpDig/1.8.7 (+http://www.phpdig.net/robot.php) HTTP/1.1 403 - http://boinc-doc.net/site-boinc/boinc-projects// See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for explanation. url: HEAD /site-boinc/boinc-projects/ HTTP/1.1 Host: boinc-doc.net Cookie: PHPSESSID=6963c6e4d54cc72fc71505cd795854be Cookie: PageCount=0 Cookie: ProjectAbbr=predictor Accept: / Accept-Charset: iso-8859-1 Accept-Encoding: identity Connection: close User-Agent: PhpDig/1.8.7 (+http://www.phpdig.net/robot.php) Code change was in ROBOT_Vunctions.php: elseif ($regs[1] == 403) { echo "url: " . $request . "\r\n <br />\r\n"; print "<br>\n".$answer." - ".$url."<br>\nSee http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for explanation.<br>\n"; This seems to me to be that the spider is trying to locate the cookie(?).

02-02-2005, 07:41 AM	#2
Paul D. Buck Green Mole Join Date: Jan 2005 Location: Sacramento Posts: 8	continued ================= Could this "dynamic" behavior I have been seeing be because the pages I have are highly dynamic and constantly change parts of the content with some factors controlled by a r****mly generated value? Is the "spider" crawler identifiable? In other words, when it asks for a page, can I detect that it is the spider and not a normal query? File name + title is unique identifier, on my site in almost all cases .... Example: page file name is "account-data.php" with a page title of: "Account Data" Page - ClimatePrediction.net - Web Site Owner's Manual "Account Data" Page - Einstein@Home - Web Site Owner's Manual "Account Data" Page - LHC@Home - Web Site Owner's Manual "Account Data" Page - Predictor@Home - Web Site Owner's Manual "Account Data" Page - SETI@Home - Web Site Owner's Manual Would changing the Primary key to be filename + file title be a better index. I know that in most cases on simpler sites the numbers of entries would be smaller as there would be no true difference in the pages indexed. In my case, I would be getting 4-6 times the number of pages, but NOW with this I would be able to track the pages that are missing/need indexing, and would not be reindexing the same pages in error. This would obviously mean changes to the "Modify page" also. Oh, one other positive thing would be that the MD5 values could be used as they are intended to see if the page is truly different. ================= I don't know the significance of this, but monitoriing the tempspider table showed that most of my "freezes" occurred when the spider had done 30 entries into the table and about 18 were flagged as "1" indexed. I know tht the table can grow beyond this because I once saw it upto 168 pages ... also frozen. I let the spider continue to run with no changes observed, of course if you are not doing immediate commits then more work could have been pending, but after 30 minutes it seemed to be time to quit. Re-running the analysis may or may not have 'frozen at the same point/page.

02-02-2005, 03:35 PM	#3
Charter Head Mole Join Date: May 2003 Posts: 2,539	Q: I see "exclude Paths" when I start the spider. How do I set that? and what does it do? A: Use a robots.txt file or exclude content from the admin panel. It excludes content from index. Q: How can I tell if one of those time out errors occurs? A: Check that safe_mode is off, review your server error logs, or ask your host if the process is killed. Q: Does it hurt anything if I just stop the spider in the middle of things? are my pages still correctly processed if I do that? A: No, not if you use the stop spider link. Yes, for documents already indexed. Q: I see a list of words that I can "purge" should I add more words? Does adding words make my database less useful for searches? A: Add more words if you want. It depends on the words you add. Q: How do I get my site reindexed after I have made the first pass? A: Use the admin panel text box or spider from shell. Q: Should I run the delete processes before I restarting the indexing? A: If you want, but it is not necessary. Q: You mention a parameter that will prevent "early" reindexing, what value is it set to and how can I change it? A: It is set to zero. Look for define('LIMIT_DAYS',0); in the config file, or set revisit-after META tags. Q: On the update page you say depth trumps links, does that mean values 0, 0 will do my entire site? A: No, set LIMIT_TO_DIRECTORY to false, choose no, set a large search depth, set links per to zero. Q: what does "No link in temporary table" mean? Is this a good thing? Or a bad thing? A: The tempspider table is empty. It is good. Q: What does: "Duplicate of an existing document" Mean? A: The document looks like an already indexed document. Q: Could this "dynamic" behavior I have been seeing be because the pages I have are highly dynamic and constantly change parts of the content with some factors controlled by a r****mly generated value? A: No, the values in the update sites table are currently being used. Q: Is the "spider" crawler identifiable? In other words, when it asks for a page, can I detect that it is the spider and not a normal query? A: Yes, review your server access logs when indexing your site. Q: Would changing the Primary key to be filename + file title be a better index. A: No, primary keys must be unique. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

02-03-2005, 12:59 PM	#5
Charter Head Mole Join Date: May 2003 Posts: 2,539	Q. Would adding the project as a passed parameter make the page table entries based on that identifier? A: The query string is stored as part of the filename except for any pieces removed by PHPDIG_SESSID_VAR. Q. I have tried settings to force a single page only look so I could feed in a list of pages so I could just drop the site and then feed in the list of pages. But I have not been able to get that to work reliably. A: Set LIMIT_TO_DIRECTORY to false, choose no, set search depth to one or more, and set links per to zero. Q. If I have a static page, how to I run phpDig so that it only indexes pages that have been detected as changed? Or is the MD5 signature stored for some other purpose? A: Use the code here to create a list of recently changed files. The $md5 variable is for detecting duplicates. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Long Time Spidering Crashes	mikiwiki	Troubleshooting	1	10-26-2006 10:25 AM
Long Call help.	GunMuse	How-to Forum	0	07-26-2006 11:11 AM
How long?	sjSJ	How-to Forum	1	05-11-2004 07:09 PM
PDFs in directory listed as one long entry	motopsycho	External Binaries	4	03-09-2004 06:07 PM
How to solve :set_time_limit() has been disabled for security	netall	Troubleshooting	1	02-28-2004 05:12 PM