PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)
-   -   Ban features (http://www.phpdig.net/forum/showthread.php?t=1659)

Slider 12-19-2004 09:24 AM

Ban features
 
When doing a huge list of sites to index I may add a list and not look through simply because it is so huge of a list.

In the config:
- AutoBan certain Domains. EX: freeservers Geocities and others

- AutoBan universal Directory names: links, banners, affiliates, forums, blog, webrings, Etc.. (not just for a certain site, but for all that are crawled) I'm understanding that this is in Phpdig already, but I have yet to get it working.

- Autoban certain domains by ext: .biz .info .this .that (if you run an english only site this would be very helpful to weed out sites like .ru and others)

This would be a great feature to reduce the size of a bloated Mysq Database.

Charter 12-19-2004 09:31 AM

Set the words you want to ban in the BANNED constant in the config file.

Slider 12-19-2004 11:51 AM

if I want to use a . would it have to be escaped as in about\.com or can I just use about.com ?

Charter 12-19-2004 12:36 PM

Basically a . matches any character whereas a \. matches a period.

Slider 12-19-2004 01:26 PM

Thanks Charter

Your help greatly reduced the mysql database with all the crud pages that I should not have allowed to be crawled.
I also found an excellent tutorial at this place
It kept me from asking you 10 more questions of things I could look up but just wasn't sure where to look at. php.net wasn't very helpful as they usually are.

Slider 12-19-2004 01:45 PM

ok it kept me from asking you 9 more questions maybe......

http://www.horse-riding.net/cgi-bin/guestbook/book.cgi?url=anything&mode=show&refresh=yes

I add it to ignore guestbook in BANNED (it was a no go for some reason)
I added in FORBIDDEN_EXTENSIONS to have cgi filetype ignored (it was a no go also)

This is after a fresh crawl with nothing else done.

Code:

// regular expression to ban useless external links in index
define('BANNED','^ad\.|banner|banners|doubleclick|links|forum|guestbook|geocities|8m|directory|affiliate|groups|');

// regexp forbidden extensions - return sometimes text/html mime-type !!!
define('FORBIDDEN_EXTENSIONS','\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$');

Any ideas?

The code box above made the geocities and tar file look weird with a space , but it's right in the config. Maybe its my browser playing tricks on me.

Charter 12-19-2004 06:05 PM

The word guestbook isn't part of actual the domain name. If you want to ban keywords based on the entire link, try $regs[2] instead of $regs[5] in the !eregi(BANNED,$regs[5]) piece in the robot_functions.php file. Also the link doesn't end in cgi so look up what $ means on that regexp tutorial you found and remove the $ from the FORBIDDEN_EXTENSIONS constant in the config.php file if you want.

Slider 12-20-2004 01:57 PM

I tried $regs[2] as the replacement as suggested and resulted in no change to the indexing. I completely understand that you shouldn't have to answer questons for alterations to your scripting. This is something I really need to work if at all possible. Thank you for taking the time to answer my questions.

I've been reading a lot of php documentation online and still I'm left scratching my head. Possibly is there something else that needs to be changed?

Slider 12-21-2004 01:43 PM

It needs the functionality to ban the url that contains a directory/path name you wish to ban. This thread was started and intended to be in the Mod Request part of the forums. This is a Mod Request since PhpDig does not have this function.

I am still left with no solution and in the wrong part of the forums.
Could you possibly move this back to it's original placement?

rAdoN 12-30-2004 01:04 PM

config does already - not mod request

!eregi(BANNED,$regs[2]) work for ban keywords

learn regex - use FORBIDDEN_EXTENSIONS

PHP Code:

// no cgi
define('FORBIDDEN_EXTENSIONS','\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)');
// no guestbook
define('FORBIDDEN_EXTENSIONS','(guestbook|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 

pick one - take out space - delete guestbook links from admin update - index - not hard

Slider 12-30-2004 01:49 PM

Thanks rAdoN,
I'm sure that will help later when I get to filenames and types.
I wanted to ban certain PATHS as in
/links/
/guestbook/
/forum/
/cgi-bin/
/webring/
/affiliates/

There are a lot of common paths websites use that could be ignored and greatly reduce the size of the MySql database.
I would think that many people would find this a useful addition to PhpDig


Your help told about FILENAMES and FILETYPES. that wasn't the question.
A path would be everything up to a file and not including the file.
Path : The way to get from here to there . Not the destination

rAdoN 12-30-2004 02:15 PM

this already exist - not addition

PHP Code:

// no links with guestbook forum cgi-bin webring affiliates
// no links ending with .cgi .php .asp .pl .rm .ico .cab ...
define('FORBIDDEN_EXTENSIONS','(guestbook|forum|cgi-bin|webring|affiliates|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 

remove space

why you not want .cgi .php .asp .pl - dynamic pages

FORBIDDEN_EXTENSIONS can be more than extensions

config is for you to config - make regex you want

Slider 12-30-2004 03:45 PM

www.domain.com/links/file.ext

Note the part of the url in bold is the part I need banned.
Not a file or an extension. A DIRECTORY to a file. Also called the path

I need the directory called LINKS as a banned directory.
If it gets to that directory it doesn't follow it to spider any further.

I honestly do appreciate your help. A answer of any kind is better than no answer at all.

rAdoN 12-30-2004 04:02 PM

awk - you not understand - make the FORBIDDEN_EXTENSIONS regex you want :bang:

PHP Code:

// no links with /guestbook/ /forum/ /cgi-bin/ /webring/ /affiliates/
// no links ending with .cgi .php .asp .pl .rm .ico .cab ...
define('FORBIDDEN_EXTENSIONS','(/guestbook/|/forum/|/cgi-bin/|/webring/|/affiliates/|\.(cgi|php|asp|pl|rm|ico|cab|swf|css|gz|z|tar|zip|tgz|msi|arj|zoo|rar|r[0-9]+|exe|bin|pkg|rpm|deb|bz2)$)'); 

you need to go admin update - delete links you no want - run the cleans - edit FORBIDDEN_EXTENSIONS - relax - index after that

ps - what is not clear

vinyl-junkie 12-30-2004 06:07 PM

Quote:

Originally Posted by Slider
www.domain.com/links/file.ext

Note the part of the url in bold is the part I need banned.
Not a file or an extension. A DIRECTORY to a file. Also called the path

I need the directory called LINKS as a banned directory.
If it gets to that directory it doesn't follow it to spider any further.

I honestly do appreciate your help. A answer of any kind is better than no answer at all.

Just use your robots.txt file to do that, like so:
Code:

User-agent: Phpdig
Disallow: /links/


jmitchell 12-30-2004 06:26 PM

what if you are indexing other sites?

rAdoN 12-30-2004 06:30 PM

use admin update - "Click on the noway sign to exclude from future indexings" - "Click on the cross to delete the branch" - "Click on the cross to delete a document" - that delete for links indexed not wanted - use FORBIDDEN_EXTENSIONS to prevent for sites - run the cleans - index

ps - no listen :bang:

Slider 12-31-2004 06:07 AM

Hello rAdoN,

I apoligize for being such a pain. :)
You really know your stuff and I will never doubt what I hear from you again.
Thank you so much for being here. Maybe I can return the favor in some way in the future.

Slider 12-31-2004 01:49 PM

I added this line to the config:
Code:

define('FORBIDDEN_PATH','(guestbook|forum|cgi-bin|webring|affiliates|links|webrings|banners)');
I added this code to spider.php (the part in bold red is the addition)
Code:

//test content-type of this page if not excluded
                          $result_test_http = '';
                          if (!phpdigReadRobots($exclude,$temp_path) && !eregi(FORBIDDEN_EXTENSIONS,$temp_file) && !eregi(FORBIDDEN_PATH,$temp_path)) {
                                $result_test_http = phpdigTestUrl($url_indexing,'date',$cookies);
                          }

I tried the code you gave and even tried variations of it and never was able to get it to ignore a path or directory. This code should be added to the next phpdig version. it's a neccessity if you want to have a little more control over the content that is being indexed and reduce the MySql database.

rAdoN 01-01-2005 12:56 PM

hoorah - instead use book.cgi you make mod - good for path - i mod your mod :smoke:
PHP Code:

//test content-type of this page if not excluded
$result_test_http '';
if (!
phpdigReadRobots($exclude,$temp_path.$temp_file) && !eregi(FORBIDDEN_EXTENSIONS,$temp_path.$temp_file)) {
     
$result_test_http phpdigTestUrl($url_indexing,'date',$cookies);



Slider 01-01-2005 04:12 PM

I'm not familiar with the book.cgi you were talking about.
The new code you posted would have made it work for the path and filename Congrats!

Thank you very much


All times are GMT -8. The time now is 08:45 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.