|
10-10-2004, 03:53 PM | #1 |
Green Mole
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
|
Full Link Exploration with Selective Content Indexing
Howdy Folks,
I just installed PhpDig today and impressed with what I've seen so far. I want to use PhpDig to index specialized game development blogs. I am only interested in indexing the blog articles themselves and wish to ignore all other content on the blog website. You can view an example blog (mine) at this location: http://www.gametableonline.com/blogs/wizwar/index.php I need the spider to explore all documents on a website, but only index documents with an url that contains "article.php". While I can modify my blogs, I cannot modify the blog software GTO uses and even if I could, I'd have to modify several installations since every GTO project has a blog. I can identity if an URL is an actual blog article because it will contain the pattern "article.php?story=<story id>". The only way I can get links to the available blogs is by extracting links from the index.php document (which paginates). So, in order to get JUST article links I need to look at any urls contain index.php to extract the links, and I need to index documents that contain the pattern "article.php". I've managed to modify the phpdigRewriteUrl function to return -1 (ignore, discard?) for Urls that don't contain article.php or index.php: Code:
if (!eregi("article.php|index.php", $eval)) { return -1; } Unfortunately, the index.php document returns a brief summary for each available blog in addition to a direct link. When I search for anything, index.php will usually have a higher result score because each index.php page has summaries of 10 blog articles per page. So, usually before I get any results directly to blog articles that contain my keyword, I get several links to index.php documents. Given how the PhpDig system works, what do think is the best way for me to modify the system for selective indexing? Thanks for your time. Michael McIntosh Last edited by Xavian; 10-10-2004 at 03:59 PM. |
10-10-2004, 08:00 PM | #2 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Welcome to the forum, Xavian. We're glad to have you here.
Looks like you've gone a long way already toward solving your problem. I'm not sure just how well this will fit into to what you're trying to do, but you can exclude certain text from being indexed (see this thread). If you don't want to see summaries or page descriptions in the search results, make sure you have the following values in config.php: Code:
define('DISPLAY_SNIPPETS',false); define('DISPLAY_SUMMARY',false); Hope this helps. |
10-10-2004, 08:32 PM | #3 | |||
Green Mole
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
|
Thanks for the welcome
Quote:
I supposed as a severe hack, I could run a filter on a document after it has been fetched by the spider so that an exclude tag is embedded immediately after the <body> tag of a document. I really want a more graceful method of doing it if possible. Quote:
Quote:
Leave it to me to have a "Square Peg, Round Hole" problem. ;P I better go get a hammer... ;P I work with industrial strength search engine solutions by day, but heavens knows I cannot afford the licensing required to use them for my small personal projects. Something like PhpDig is really nice and I am impressed with the quality. A lot of other projects seem to have very little documentation, but you guys even have forums. Woot! -Michael |
|||
10-10-2004, 08:57 PM | #4 |
Purple Mole
Join Date: Jan 2004
Posts: 694
|
Since you don't have any control over the server, I'm afraid your only option is more custom code. Wish I could be of more help.
|
10-10-2004, 09:34 PM | #5 |
Green Mole
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
|
Thanks for all your help anyways, vinyl-junkie. I suspected I'd have to roll my own, I'm just trying to avoid re-inventing the wheel if there is an easy way to introduce this functionality.
Last edited by Xavian; 10-10-2004 at 09:49 PM. |
10-11-2004, 09:08 AM | #6 |
Green Mole
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
|
Article Comment Tags...
Aha! I think I see a way its possible and I'm curious if you guys have any suggestions on where I should look to implement it...
I've examined the html source code of generated by the blog software and I can see that the article section comment of <!-- ARTICLE START --> and <!-- ARTICLE END --> to mark where the article begins and ends in the html code. So, a revised filter will explore all documents on the blog website, but only index text contained between the <!-- ARTICLE START --> and <!-- ARTICLE END --> tags. I'll look at the spider mechanism for text exclusion and see if I can kludge my own. Another alternative is for me to add results filtering. The template engine code seems complicated, but if I can intercept the query results list before they are rendered, I could iterate the list and remove certain documents based upon URL so that only urls containing "article.php" are output in the search results. That seems the easiest solution actually. Do you guys have any diagrams of how the system works, what the various tables are used for and stuff like that? I'd be happy to submit my mods if I can get this to work. -Michael |
10-11-2004, 09:48 AM | #7 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
An idea to try...
In robot_functions.php find: PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
10-11-2004, 12:03 PM | #8 |
Green Mole
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
|
Thats an awesome idea! I'm glad I checked here first! I'll try that when I get a chance this evening... Thanks!
Last edited by Xavian; 10-11-2004 at 12:06 PM. |
10-11-2004, 08:37 PM | #9 |
Green Mole
Join Date: Oct 2004
Location: Western Massachusetts
Posts: 6
|
Its... Its... ALIVE!
Thanks to your great pointers I got the spider and the engine working the way I needed it. The results are great. You can check it out at:
http://michael.nervestaple.com/gto/blogsearch/ Some good example keywords would be "wiz-war" or "game"... I will be indexing more blogs tommorrow and later on this week I'll post what I modified to show how I did it. I ended up having to modify the spider.php as well, right before the spider calls the phpdigIndexFile function. For now, I gotta catch some Zzzzzs... -Michael |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Selective Indexing of URL Containing a <keyword> | Leith | How-to Forum | 0 | 01-21-2008 02:16 AM |
Help indexing a folder full of PDF | posa | External Binaries | 13 | 02-24-2005 12:11 AM |
Indexing Dynamic Content | greenman | How-to Forum | 0 | 11-11-2004 05:40 AM |
Indexing the content of a database | antalsia | How-to Forum | 1 | 01-28-2004 10:53 AM |
don't indexing metatags content | Christian | How-to Forum | 3 | 01-11-2004 04:29 PM |