|
04-19-2004, 10:10 PM | #1 |
Orange Mole
Join Date: Mar 2004
Posts: 48
|
Reduce duplicates in keywords table through more intelligent indexing
When words are indexed, punctuation such as , . : ; ‘ ‘s and ? should be dropped from the end of the word. In addition, words separated with / and – should be indexed as separate words rather than as one word. This will reduce the number of duplicates in the keywords table in the database and allow the spider to matched words to index against the common words list much more accurately. Depending upon the type of search the user employs, search results will be more accurate as well.
When words are indexed, any punctuation following a word without a space in between is treated as part of the word. Therefore, the keywords table in the database is filled with many duplicates that are just variations of the same word. Examples: following following, following: following; following. following? Other duplicates are created for other reasons. Words separated with a / to indicate an option such as and/or and boy/girl are indexed as a single word. Words that end with a ‘ also create duplicates. Example: bells bells’ Also, words that include an apostrophe cause duplicates. Example: bell bell‘s Unfortunately, not indexing words that are the same except for an s on the end could lead to indexing errors. Therefore, a certain amount of duplicates will exist. Words separated with a – also create duplicates. Examples: Blackberry like blackberry-like bright pink bright-pink It would also be helpful if regular expressions were supported in the common_words.txt file. This would allow you to do something like allow phone numbers and dates but no other numbers or you could exclude all numbers. There is no need to index numbers provided for dimensions, mathematical equations, or chart info. This just bogs down the keyword table with useless data and slows search results. The result should be a cleaner keywords table, faster search results, and more accurate search results. |
04-20-2004, 09:06 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. The punctuation is on there for exact matches, but perhaps this exact match is too exact.
To relax the exact match and drop a lot, but not all, of the punctuation from the end of a word, do the following. In phpdig_functions.php find: Code:
$text = ereg_replace('[^'.$phpdig_words_chars[$enco ding].' \\'._~@#$:&%/;,=-]+',' ',$text); Code:
$text = ereg_replace('(['.$phpdig_words_chars[$enco ding].'])[\\'._~@#$:&%/;,=-]+($|[[:space:]]$|[[:spa ce:]]['.$phpdig_words_chars[$encoding].'])','\1\2',$text); Code:
if (eregi($what_query_chars,$query_to_parse)) { $query_to_parse = eregi_replace($what_que ry_chars," ",$query_to_parse); } Code:
$query_to_parse = ereg_replace('(['.$phpdig_words_chars[PHPDIG_EN CODING].'])[\\'.\_~@#$:&\%/;,=-]+($|[[:space:]]$|[[:space:]]['.$ph pdig_words_chars[PHPDIG_ENCODING].'])','\1\2',$query_to_parse); Code:
if ($option == "exact") { // there are two instances of this Code:
$reg_strings = str_replace('@#@',' ',phpdigPregQuotes(str_repl ace('\\\','',implode('@#@', $query_for_phrase_array)))); Code:
$reg_strings = str_replace('@#@','.* ',phpdigPregQuotes(str_repl ace('\\\','',implode('@#@', $query_for_phrase_array)))); Code:
if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WOR DS_SIZE and !isset($common_words[$key]) and ereg('^['.$phpdig_wo rds_chars[PHPDIG_ENCODING].'#$]',$key))
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Not indexing pages, keywords, etc.. | patrick@online- | Troubleshooting | 5 | 04-15-2006 03:10 AM |
keywords duplicates and unwanted keywords | jerrywin5 | How-to Forum | 5 | 04-06-2005 04:20 PM |
excluding keywords from indexing | Fking | How-to Forum | 1 | 10-05-2004 06:43 PM |
Junk in keywords table - Indexing PDF | Bege | External Binaries | 2 | 04-09-2004 08:15 AM |
Reduce number of connections | druesome | Troubleshooting | 1 | 10-14-2003 08:42 AM |