PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 12-30-2004, 07:32 AM   #1
the_hut2
Green Mole
 
Join Date: Dec 2004
Posts: 3
Exclude filenames with certain attributes?

All installed and spidering nicely. However, the directory in which the content resides contains a mailarchive of 8000 messages, each of which has its own .html file. Additionally it contains indexes for every 15 messages, again each index has its own .html file.

If the files are in the form:

http://www.mydomain.com/message001.html
http://www.mydomain.com/message002.html
http://www.mydomain.com/message003.html
etc

and

http://www.mydomain.com/index001.html
http://www.mydomain.com/index002.html

is it possible to restrict the spider so that any file which contains the characters "index" in the title is ignored or, alternatively, restrict the spider such that it only searches files which start with the characters "message"?

I had hoped that robots.txt would have been the answer, but you cannot use wildcards to specify exclusions. Typing in a list of the path of every file beginning with "index" is not an option (there are over 150 of them) and in any case, the content is updated every day....

Any help MUCH appreciated
the_hut2 is offline   Reply With Quote
Old 12-30-2004, 01:29 PM   #2
rAdoN
Green Mole
 
Join Date: Oct 2004
Posts: 27
in config - see

http://www.phpdig.net/forum/showthread.php?t=1659
__________________
rAdoN was here
rAdoN is offline   Reply With Quote
Old 12-30-2004, 04:00 PM   #3
the_hut2
Green Mole
 
Join Date: Dec 2004
Posts: 3
Thanks for your help, which is much appreciated, but I am still failing. Here is what I did:

I commented out the old FORBIDDEN_EXTENSIONS line and replaced it with

PHP Code:
define('FORBIDDEN_EXTENSIONS','(*index*|guestbook|\.(html|cgi|php|asp|pl|rm|ico|cab|swf|  css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 
The difference between this and the elements in your post in this thread : http://www.phpdig.net/forum/showthread.php?t=1659 is that I added in *index* before the \ and html after it. I aslo deleted the space.

Then in robot_functions php file, I changed the 3 instances of regs[5] around the !eregi(BANNED,$regs[5]) piece for regs[2]

However, the spider still continues to index files named index123.html

Where am I going wrong?

Last edited by the_hut2; 12-30-2004 at 04:11 PM.
the_hut2 is offline   Reply With Quote
Old 12-30-2004, 04:22 PM   #4
rAdoN
Green Mole
 
Join Date: Oct 2004
Posts: 27
like this

PHP Code:
define('FORBIDDEN_EXTENSIONS','(index[0-9]+\.html|guestbook|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 
remove space

remove cgi|php|asp|pl| if you want index of links end with that extensions

you need to go admin update - delete links you no want - run the cleans - edit FORBIDDEN_EXTENSIONS - index
__________________
rAdoN was here
rAdoN is offline   Reply With Quote
Old 12-30-2004, 04:37 PM   #5
rAdoN
Green Mole
 
Join Date: Oct 2004
Posts: 27
Quote:
Originally Posted by the_hut2
Then in robot_functions php file, I changed the 3 instances of regs[5] around the !eregi(BANNED,$regs[5]) piece for regs[2]
ps - not all three - just !eregi(BANNED,$regs[5]) to !eregi(BANNED,$regs[2]) - only for ban words in BANNED for all link instead of domain
__________________
rAdoN was here
rAdoN is offline   Reply With Quote
Old 12-31-2004, 02:17 AM   #6
the_hut2
Green Mole
 
Join Date: Dec 2004
Posts: 3
Great! That works!

Thanks for your help.
the_hut2 is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
exclude filenames felyx Troubleshooting 0 11-20-2006 09:29 PM
Get text from alt and title images attributes? djuritz How-to Forum 0 07-14-2006 06:05 AM
New Exclude Option josegringo How-to Forum 2 02-17-2005 02:48 PM
How to index filenames? mordormx How-to Forum 0 10-16-2004 06:03 PM
exclude doesn't really work? manute Troubleshooting 15 10-20-2003 02:20 AM


All times are GMT -8. The time now is 11:33 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.