PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)
-   -   Ban features (http://www.phpdig.net/forum/showthread.php?t=1659)

Slider 12-19-2004 09:24 AM

Ban features
 
When doing a huge list of sites to index I may add a list and not look through simply because it is so huge of a list.

In the config:
- AutoBan certain Domains. EX: freeservers Geocities and others

- AutoBan universal Directory names: links, banners, affiliates, forums, blog, webrings, Etc.. (not just for a certain site, but for all that are crawled) I'm understanding that this is in Phpdig already, but I have yet to get it working.

- Autoban certain domains by ext: .biz .info .this .that (if you run an english only site this would be very helpful to weed out sites like .ru and others)

This would be a great feature to reduce the size of a bloated Mysq Database.

Charter 12-19-2004 09:31 AM

Set the words you want to ban in the BANNED constant in the config file.

Slider 12-19-2004 11:51 AM

if I want to use a . would it have to be escaped as in about\.com or can I just use about.com ?

Charter 12-19-2004 12:36 PM

Basically a . matches any character whereas a \. matches a period.

Slider 12-19-2004 01:26 PM

Thanks Charter

Your help greatly reduced the mysql database with all the crud pages that I should not have allowed to be crawled.
I also found an excellent tutorial at this place
It kept me from asking you 10 more questions of things I could look up but just wasn't sure where to look at. php.net wasn't very helpful as they usually are.

Slider 12-19-2004 01:45 PM

ok it kept me from asking you 9 more questions maybe......

http://www.horse-riding.net/cgi-bin/guestbook/book.cgi?url=anything&mode=show&refresh=yes

I add it to ignore guestbook in BANNED (it was a no go for some reason)
I added in FORBIDDEN_EXTENSIONS to have cgi filetype ignored (it was a no go also)

This is after a fresh crawl with nothing else done.

Code:

// regular expression to ban useless external links in index
define('BANNED','^ad\.|banner|banners|doubleclick|links|forum|guestbook|geocities|8m|directory|affiliate|groups|');

// regexp forbidden extensions - return sometimes text/html mime-type !!!
define('FORBIDDEN_EXTENSIONS','\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$');

Any ideas?

The code box above made the geocities and tar file look weird with a space , but it's right in the config. Maybe its my browser playing tricks on me.

Charter 12-19-2004 06:05 PM

The word guestbook isn't part of actual the domain name. If you want to ban keywords based on the entire link, try $regs[2] instead of $regs[5] in the !eregi(BANNED,$regs[5]) piece in the robot_functions.php file. Also the link doesn't end in cgi so look up what $ means on that regexp tutorial you found and remove the $ from the FORBIDDEN_EXTENSIONS constant in the config.php file if you want.

Slider 12-20-2004 01:57 PM

I tried $regs[2] as the replacement as suggested and resulted in no change to the indexing. I completely understand that you shouldn't have to answer questons for alterations to your scripting. This is something I really need to work if at all possible. Thank you for taking the time to answer my questions.

I've been reading a lot of php documentation online and still I'm left scratching my head. Possibly is there something else that needs to be changed?

Slider 12-21-2004 01:43 PM

It needs the functionality to ban the url that contains a directory/path name you wish to ban. This thread was started and intended to be in the Mod Request part of the forums. This is a Mod Request since PhpDig does not have this function.

I am still left with no solution and in the wrong part of the forums.
Could you possibly move this back to it's original placement?

rAdoN 12-30-2004 01:04 PM

config does already - not mod request

!eregi(BANNED,$regs[2]) work for ban keywords

learn regex - use FORBIDDEN_EXTENSIONS

PHP Code:

// no cgi
define('FORBIDDEN_EXTENSIONS','\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)');
// no guestbook
define('FORBIDDEN_EXTENSIONS','(guestbook|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 

pick one - take out space - delete guestbook links from admin update - index - not hard

Slider 12-30-2004 01:49 PM

Thanks rAdoN,
I'm sure that will help later when I get to filenames and types.
I wanted to ban certain PATHS as in
/links/
/guestbook/
/forum/
/cgi-bin/
/webring/
/affiliates/

There are a lot of common paths websites use that could be ignored and greatly reduce the size of the MySql database.
I would think that many people would find this a useful addition to PhpDig


Your help told about FILENAMES and FILETYPES. that wasn't the question.
A path would be everything up to a file and not including the file.
Path : The way to get from here to there . Not the destination

rAdoN 12-30-2004 02:15 PM

this already exist - not addition

PHP Code:

// no links with guestbook forum cgi-bin webring affiliates
// no links ending with .cgi .php .asp .pl .rm .ico .cab ...
define('FORBIDDEN_EXTENSIONS','(guestbook|forum|cgi-bin|webring|affiliates|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 

remove space

why you not want .cgi .php .asp .pl - dynamic pages

FORBIDDEN_EXTENSIONS can be more than extensions

config is for you to config - make regex you want

Slider 12-30-2004 03:45 PM

www.domain.com/links/file.ext

Note the part of the url in bold is the part I need banned.
Not a file or an extension. A DIRECTORY to a file. Also called the path

I need the directory called LINKS as a banned directory.
If it gets to that directory it doesn't follow it to spider any further.

I honestly do appreciate your help. A answer of any kind is better than no answer at all.

rAdoN 12-30-2004 04:02 PM

awk - you not understand - make the FORBIDDEN_EXTENSIONS regex you want :bang:

PHP Code:

// no links with /guestbook/ /forum/ /cgi-bin/ /webring/ /affiliates/
// no links ending with .cgi .php .asp .pl .rm .ico .cab ...
define('FORBIDDEN_EXTENSIONS','(/guestbook/|/forum/|/cgi-bin/|/webring/|/affiliates/|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 

you need to go admin update - delete links you no want - run the cleans - edit FORBIDDEN_EXTENSIONS - relax - index after that

ps - what is not clear

vinyl-junkie 12-30-2004 06:07 PM

Quote:

Originally Posted by Slider
www.domain.com/links/file.ext

Note the part of the url in bold is the part I need banned.
Not a file or an extension. A DIRECTORY to a file. Also called the path

I need the directory called LINKS as a banned directory.
If it gets to that directory it doesn't follow it to spider any further.

I honestly do appreciate your help. A answer of any kind is better than no answer at all.

Just use your robots.txt file to do that, like so:
Code:

User-agent: Phpdig
Disallow: /links/



All times are GMT -8. The time now is 03:44 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.