Full Link Exploration with Selective Content Indexing

Xavian · 10-10-2004, 03:53 PM

Howdy Folks,

I just installed PhpDig today and impressed with what I've seen so far.

I want to use PhpDig to index specialized game development blogs. I am only interested in indexing the blog articles themselves and wish to ignore all other content on the blog website. You can view an example blog (mine) at this location: http://www.gametableonline.com/blogs/wizwar/index.php

I need the spider to explore all documents on a website, but only index documents with an url that contains "article.php". While I can modify my blogs, I cannot modify the blog software GTO uses and even if I could, I'd have to modify several installations since every GTO project has a blog.

I can identity if an URL is an actual blog article because it will contain the pattern "article.php?story=<story id>". The only way I can get links to the available blogs is by extracting links from the index.php document (which paginates). So, in order to get JUST article links I need to look at any urls contain index.php to extract the links, and I need to index documents that contain the pattern "article.php".

I've managed to modify the phpdigRewriteUrl function to return -1 (ignore, discard?) for Urls that don't contain article.php or index.php:

Code:

if (!eregi("article.php|index.php", $eval)) {  
   return -1;
}

It works very well. Using this method the spider only indexes urls containing index.php or article.php. Due to the dynamic nature of the blog software, the search results aren't very helpful.

Unfortunately, the index.php document returns a brief summary for each available blog in addition to a direct link. When I search for anything, index.php will usually have a higher result score because each index.php page has summaries of 10 blog articles per page. So, usually before I get any results directly to blog articles that contain my keyword, I get several links to index.php documents.

Given how the PhpDig system works, what do think is the best way for me to modify the system for selective indexing?

Thanks for your time.

Michael McIntosh

vinyl-junkie · 10-10-2004, 08:00 PM

Welcome to the forum, Xavian. We're glad to have you here.

Looks like you've gone a long way already toward solving your problem. I'm not sure just how well this will fit into to what you're trying to do, but you can exclude certain text from being indexed (see this thread).

If you don't want to see summaries or page descriptions in the search results, make sure you have the following values in config.php:

Code:

define('DISPLAY_SNIPPETS',false);
define('DISPLAY_SUMMARY',false);

Also, and again I don't know how your site is structured, if you have items that you don't want indexed that are restricted to a specific directory, have a look at this thread.

Hope this helps.

Xavian · 10-10-2004, 08:32 PM

Thanks for the welcome

Quote:

Looks like you've gone a long way already toward solving your problem. I'm not sure just how well this will fit into to what you're trying to do, but you can exclude certain text from being indexed (see this thread).

I've already found those tags and unfortunately, I do not have direct control over the content. I have access to *my* blog, but I am trying to index all game development blogs associated with the gametableonline.com website. My blog is one of something like 10 other blogs. We discuss relevant topics that come up during our projects, and sometimes a problem one developer runs into is one another developer has already discussed. The individual blog sites are searchable, but it involves going to each and every blog and searching for the keyword you want.

I supposed as a severe hack, I could run a filter on a document after it has been fetched by the spider so that an exclude tag is embedded immediately after the <body> tag of a document. I really want a more graceful method of doing it if possible.

Quote:

If you don't want to see summaries or page descriptions in the search results, make sure you have the following values in config.php:

Ah, I do want to see the summaries, the problem is that the search relevancy algorithm (term frequency inverse document frequency based?) returns results I'd like to filter out. I love the summary of each doc and am impressed at the keyword highlighting.

Unfortunately, since the index.php contains summaries of full documents, they get a really large relevancy boost because most document summaries consist of the really important keywords. I need some way to extract links from any index.php docs and ignore the text content of that index.php doc from the spider side since I normally cannot modify the websites themselves. I want the spider to traverse as many links as possible, but drop all text content but content from urls paths containing "article.php".

Quote:

Also, and again I don't know how your site is structured, if you have items that you don't want indexed that are restricted to a specific directory, have a look at this thread.

I'm familiar with robots.txt, but I lack that level of access to the websites in question. On top of that, that won't really help me with the problem I am having. The only programmatic method I have for extracting articles from the blogs is by getting urls generated from the index.php script. I could manually goto the websites and add each and every article to my list by hand, but thats ultimately unworkable for me and is what I'm trying to avoid.

Leave it to me to have a "Square Peg, Round Hole" problem. ;P I better go get a hammer... ;P

I work with industrial strength search engine solutions by day, but heavens knows I cannot afford the licensing required to use them for my small personal projects. Something like PhpDig is really nice and I am impressed with the quality. A lot of other projects seem to have very little documentation, but you guys even have forums.

Woot!

-Michael

vinyl-junkie · 10-10-2004, 08:57 PM

Since you don't have any control over the server, I'm afraid your only option is more custom code. Wish I could be of more help.

Xavian · 10-10-2004, 09:34 PM

Thanks for all your help anyways, vinyl-junkie.

I suspected I'd have to roll my own, I'm just trying to avoid re-inventing the wheel if there is an easy way to introduce this functionality.

Xavian · 10-11-2004, 09:08 AM

Aha! I think I see a way its possible and I'm curious if you guys have any suggestions on where I should look to implement it...

I've examined the html source code of generated by the blog software and I can see that the article section comment of  and  to mark where the article begins and ends in the html code.

So, a revised filter will explore all documents on the blog website, but only index text contained between the  and  tags.

I'll look at the spider mechanism for text exclusion and see if I can kludge my own.

Another alternative is for me to add results filtering. The template engine code seems complicated, but if I can intercept the query results list before they are rendered, I could iterate the list and remove certain documents based upon URL so that only urls containing "article.php" are output in the search results. That seems the easiest solution actually.

Do you guys have any diagrams of how the system works, what the various tables are used for and stuff like that? I'd be happy to submit my mods if I can get this to work.

-Michael

Charter · 10-11-2004, 09:48 AM

An idea to try...

In robot_functions.php find:

PHP Code:


			
foreach ($file_content as $num => $line) {

    if (trim($line)) {

        if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) {

            $exclude = true;

        }

        else if (trim($line) == PHPDIG_INCLUDE_COMMENT) {

            $exclude = false;

            continue;

        }

and replace with:

PHP Code:


			
if ($file && eregi("index.php",$file)) {

    // tags must be on their own lines

    $the_exclude_comment = "<html>";

    $the_include_comment = "</html>";

}

else {

    $the_exclude_comment = PHPDIG_EXCLUDE_COMMENT;

    $the_include_comment = PHPDIG_INCLUDE_COMMENT;

}



foreach ($file_content as $num => $line) {

    if (trim($line)) {

        if ($content_type == 'HTML' && trim($line) == $the_exclude_comment) {

            $exclude = true;

        }

        else if (trim($line) == $the_include_comment) {

            $exclude = false;

            continue;

        }

Xavian · 10-11-2004, 12:03 PM

Thats an awesome idea! I'm glad I checked here first! I'll try that when I get a chance this evening... Thanks!

Xavian · 10-11-2004, 08:37 PM

Thanks to your great pointers I got the spider and the engine working the way I needed it. The results are great. You can check it out at:

http://michael.nervestaple.com/gto/blogsearch/

Some good example keywords would be "wiz-war" or "game"...

I will be indexing more blogs tommorrow and later on this week I'll post what I modified to show how I did it. I ended up having to modify the spider.php as well, right before the spider calls the phpdigIndexFile function.

For now, I gotta catch some Zzzzzs...

-Michael

10-10-2004, 03:53 PM	#1
Xavian Green Mole Join Date: Oct 2004 Location: Western Massachusetts Posts: 6	Full Link Exploration with Selective Content Indexing Howdy Folks, I just installed PhpDig today and impressed with what I've seen so far. I want to use PhpDig to index specialized game development blogs. I am only interested in indexing the blog articles themselves and wish to ignore all other content on the blog website. You can view an example blog (mine) at this location: http://www.gametableonline.com/blogs/wizwar/index.php I need the spider to explore all documents on a website, but only index documents with an url that contains "article.php". While I can modify my blogs, I cannot modify the blog software GTO uses and even if I could, I'd have to modify several installations since every GTO project has a blog. I can identity if an URL is an actual blog article because it will contain the pattern "article.php?story=<story id>". The only way I can get links to the available blogs is by extracting links from the index.php document (which paginates). So, in order to get JUST article links I need to look at any urls contain index.php to extract the links, and I need to index documents that contain the pattern "article.php". I've managed to modify the phpdigRewriteUrl function to return -1 (ignore, discard?) for Urls that don't contain article.php or index.php: Code: if (!eregi("article.php\|index.php", $eval)) { return -1; } It works very well. Using this method the spider only indexes urls containing index.php or article.php. Due to the dynamic nature of the blog software, the search results aren't very helpful. Unfortunately, the index.php document returns a brief summary for each available blog in addition to a direct link. When I search for anything, index.php will usually have a higher result score because each index.php page has summaries of 10 blog articles per page. So, usually before I get any results directly to blog articles that contain my keyword, I get several links to index.php documents. Given how the PhpDig system works, what do think is the best way for me to modify the system for selective indexing? Thanks for your time. Michael McIntosh Last edited by Xavian; 10-10-2004 at 03:59 PM.

10-10-2004, 08:00 PM	#2
vinyl-junkie Purple Mole Join Date: Jan 2004 Posts: 694	Welcome to the forum, Xavian. We're glad to have you here. Looks like you've gone a long way already toward solving your problem. I'm not sure just how well this will fit into to what you're trying to do, but you can exclude certain text from being indexed (see this thread). If you don't want to see summaries or page descriptions in the search results, make sure you have the following values in config.php: Code: define('DISPLAY_SNIPPETS',false); define('DISPLAY_SUMMARY',false); Also, and again I don't know how your site is structured, if you have items that you don't want indexed that are restricted to a specific directory, have a look at this thread. Hope this helps.

10-10-2004, 09:34 PM	#5
Xavian Green Mole Join Date: Oct 2004 Location: Western Massachusetts Posts: 6	Thanks for all your help anyways, vinyl-junkie. I suspected I'd have to roll my own, I'm just trying to avoid re-inventing the wheel if there is an easy way to introduce this functionality. Last edited by Xavian; 10-10-2004 at 09:49 PM.

10-11-2004, 09:08 AM	#6
Xavian Green Mole Join Date: Oct 2004 Location: Western Massachusetts Posts: 6	Article Comment Tags... Aha! I think I see a way its possible and I'm curious if you guys have any suggestions on where I should look to implement it... I've examined the html source code of generated by the blog software and I can see that the article section comment of <!-- ARTICLE START --> and <!-- ARTICLE END --> to mark where the article begins and ends in the html code. So, a revised filter will explore all documents on the blog website, but only index text contained between the <!-- ARTICLE START --> and <!-- ARTICLE END --> tags. I'll look at the spider mechanism for text exclusion and see if I can kludge my own. Another alternative is for me to add results filtering. The template engine code seems complicated, but if I can intercept the query results list before they are rendered, I could iterate the list and remove certain documents based upon URL so that only urls containing "article.php" are output in the search results. That seems the easiest solution actually. Do you guys have any diagrams of how the system works, what the various tables are used for and stuff like that? I'd be happy to submit my mods if I can get this to work. -Michael

10-11-2004, 09:48 AM	#7
Charter Head Mole Join Date: May 2003 Posts: 2,539	An idea to try... In robot_functions.php find: PHP Code: `foreach ($file_content as $num => $line) { if (trim($line)) { if ($content_type == 'HTML' && trim($line) == PHPDIG_EXCLUDE_COMMENT) { $exclude = true; } else if (trim($line) == PHPDIG_INCLUDE_COMMENT) { $exclude = false; continue; }` and replace with: PHP Code: if ($file && eregi("index.php",$file)) { // tags must be on their own lines $the_exclude_comment = "<html>"; $the_include_comment = "</html>"; } else { $the_exclude_comment = PHPDIG_EXCLUDE_COMMENT; $the_include_comment = PHPDIG_INCLUDE_COMMENT; } foreach ($file_content as $num => $line) { if (trim($line)) { if ($content_type == 'HTML' && trim($line) == $the_exclude_comment) { $exclude = true; } else if (trim($line) == $the_include_comment) { $exclude = false; continue; } __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

10-10-2004, 08:57 PM	#4
vinyl-junkie Purple Mole Join Date: Jan 2004 Posts: 694	Since you don't have any control over the server, I'm afraid your only option is more custom code. Wish I could be of more help.

10-11-2004, 12:03 PM	#8
Xavian Green Mole Join Date: Oct 2004 Location: Western Massachusetts Posts: 6	Thats an awesome idea! I'm glad I checked here first! I'll try that when I get a chance this evening... Thanks! Last edited by Xavian; 10-11-2004 at 12:06 PM.

10-11-2004, 08:37 PM	#9
Xavian Green Mole Join Date: Oct 2004 Location: Western Massachusetts Posts: 6	Its... Its... ALIVE! Thanks to your great pointers I got the spider and the engine working the way I needed it. The results are great. You can check it out at: http://michael.nervestaple.com/gto/blogsearch/ Some good example keywords would be "wiz-war" or "game"... I will be indexing more blogs tommorrow and later on this week I'll post what I modified to show how I did it. I ended up having to modify the spider.php as well, right before the spider calls the phpdigIndexFile function. For now, I gotta catch some Zzzzzs... -Michael

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Selective Indexing of URL Containing a <keyword>	Leith	How-to Forum	0	01-21-2008 02:16 AM
Help indexing a folder full of PDF	posa	External Binaries	13	02-24-2005 12:11 AM
Indexing Dynamic Content	greenman	How-to Forum	0	11-11-2004 05:40 AM
Indexing the content of a database	antalsia	How-to Forum	1	01-28-2004 10:53 AM
don't indexing metatags content	Christian	How-to Forum	3	01-11-2004 04:29 PM