Hmm, we're not there yet.
The sites I crawling aren't mine, so I can't put robot.txt files into them.
Is there not a function someplace that says
' if the directory of the page you are thinking about indexing is the parent directory of the page you were started at, leave it alone (or not, depending on the config variable)' ?
thanks again
Ciaran
|