|
10-11-2003, 09:21 AM | #1 |
Orange Mole
Join Date: Sep 2003
Posts: 40
|
New feature proposal: targeted indexing
Me and JyGius are working on a new feature.
I hope to have some feedback from other developers before to start. We want to re-index very often a big website and we want to introduce some trick to dramaticaly reduce crawling. It's not possile to use the modification data of files to select modified ones, because there are lots of dynamic pages. For explample I can have news generated with an ID: news.php?nid=10001 news.php?nid=10002 news.php?nid=10003 news.php?nid=10004 news.php?nid=10005 ..... news.php?nid=20000 but only last 4 have been modified since last visit. How can the crawler know that? Our idea is to add a directive in robots.txt containing the url of a text file containing the list of the modified/created pages with their timestamp. For example: 1056987466 news.php?nid=20001 1056987853 news.php?nid=20002 1056988465 news.php?nid=20003 1056995765 news.php?nid=20004 So, Phpdig read that directive and load the text file, parse it and dig only pages modified after last visit without following links. The text file must be created and mantained by the web site software. Obviously this applies to portals totally database driven. If the robots.txt doesn't contain that directive Phpdig can crawl the site as usual. If you have some idea please post it here. Alivin70 |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Feature List? | paulsv | The Mole Hole | 1 | 01-31-2006 09:40 PM |
Detailed feature inquiry (mainly Metadata and protected) | rgrau | How-to Forum | 1 | 02-26-2005 08:13 PM |
quick search feature on index page | bigals | How-to Forum | 3 | 04-02-2004 04:21 AM |
Bug in the "refine" feature | laurentxav | Mod Submissions | 1 | 03-01-2004 08:32 AM |
feature proposal: real exact searching | manute | Mod Requests | 3 | 10-21-2003 11:17 PM |