![]() |
Ban features
When doing a huge list of sites to index I may add a list and not look through simply because it is so huge of a list.
In the config: - AutoBan certain Domains. EX: freeservers Geocities and others - AutoBan universal Directory names: links, banners, affiliates, forums, blog, webrings, Etc.. (not just for a certain site, but for all that are crawled) I'm understanding that this is in Phpdig already, but I have yet to get it working. - Autoban certain domains by ext: .biz .info .this .that (if you run an english only site this would be very helpful to weed out sites like .ru and others) This would be a great feature to reduce the size of a bloated Mysq Database. |
Set the words you want to ban in the BANNED constant in the config file.
|
if I want to use a . would it have to be escaped as in about\.com or can I just use about.com ?
|
Basically a . matches any character whereas a \. matches a period.
|
Thanks Charter
Your help greatly reduced the mysql database with all the crud pages that I should not have allowed to be crawled. I also found an excellent tutorial at this place It kept me from asking you 10 more questions of things I could look up but just wasn't sure where to look at. php.net wasn't very helpful as they usually are. |
ok it kept me from asking you 9 more questions maybe......
http://www.horse-riding.net/cgi-bin/guestbook/book.cgi?url=anything&mode=show&refresh=yes I add it to ignore guestbook in BANNED (it was a no go for some reason) I added in FORBIDDEN_EXTENSIONS to have cgi filetype ignored (it was a no go also) This is after a fresh crawl with nothing else done. Code:
// regular expression to ban useless external links in index The code box above made the geocities and tar file look weird with a space , but it's right in the config. Maybe its my browser playing tricks on me. |
The word guestbook isn't part of actual the domain name. If you want to ban keywords based on the entire link, try $regs[2] instead of $regs[5] in the !eregi(BANNED,$regs[5]) piece in the robot_functions.php file. Also the link doesn't end in cgi so look up what $ means on that regexp tutorial you found and remove the $ from the FORBIDDEN_EXTENSIONS constant in the config.php file if you want.
|
I tried $regs[2] as the replacement as suggested and resulted in no change to the indexing. I completely understand that you shouldn't have to answer questons for alterations to your scripting. This is something I really need to work if at all possible. Thank you for taking the time to answer my questions.
I've been reading a lot of php documentation online and still I'm left scratching my head. Possibly is there something else that needs to be changed? |
It needs the functionality to ban the url that contains a directory/path name you wish to ban. This thread was started and intended to be in the Mod Request part of the forums. This is a Mod Request since PhpDig does not have this function.
I am still left with no solution and in the wrong part of the forums. Could you possibly move this back to it's original placement? |
config does already - not mod request
!eregi(BANNED,$regs[2]) work for ban keywords learn regex - use FORBIDDEN_EXTENSIONS PHP Code:
|
Thanks rAdoN,
I'm sure that will help later when I get to filenames and types. I wanted to ban certain PATHS as in /links/ /guestbook/ /forum/ /cgi-bin/ /webring/ /affiliates/ There are a lot of common paths websites use that could be ignored and greatly reduce the size of the MySql database. I would think that many people would find this a useful addition to PhpDig Your help told about FILENAMES and FILETYPES. that wasn't the question. A path would be everything up to a file and not including the file. Path : The way to get from here to there . Not the destination |
this already exist - not addition
PHP Code:
why you not want .cgi .php .asp .pl - dynamic pages FORBIDDEN_EXTENSIONS can be more than extensions config is for you to config - make regex you want |
www.domain.com/links/file.ext
Note the part of the url in bold is the part I need banned. Not a file or an extension. A DIRECTORY to a file. Also called the path I need the directory called LINKS as a banned directory. If it gets to that directory it doesn't follow it to spider any further. I honestly do appreciate your help. A answer of any kind is better than no answer at all. |
awk - you not understand - make the FORBIDDEN_EXTENSIONS regex you want :bang:
PHP Code:
ps - what is not clear |
Quote:
Code:
User-agent: Phpdig |
All times are GMT -8. The time now is 03:44 PM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.