![]() |
Indexing "<word>-<word>"?
I haven't found this in the docs or the FAQs (or anywhere else for that matter) so I'm asking here.
How do I get PHPDig to index two (or more) words with a hyphen in them as one search-item (as opposed to two seach-items)? For example: the web page contains "foo-bar". After indexing, I can search for "foo", "bar", "foo bar" but NOT "foo-bar". I'd like to be able to search for "foo-bar" as well. Suggestions? |
Here's what I've found so far:
According to the docs, dashes (and many other special characters) are allowed in indexes and searches since v1.8. Yet, in phpdig_functions.php there is a function called phpdigEpureText() that seems to be removing the special characters that the docs say are allowed. Ho, ho! There is also an entry in search_function.php that removes various characters from the search functionality! If you also remove the dash from $what_query_chars in this file and reindex, you can now search for words with dashes in them! At least it worked for me. |
The $what_query_chars variable negates a class of characters; the same goes for the phpdigEpureText function:
Code:
$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=-]+"; Try searching on t-shirts in the online demo. When PhpDig finds a word containing a dash in the chunk it's trying to process, it will try to highlight it. Also, try running the following query, and then search on some of the resultant words: Code:
# add your table prefix if needed Note that when processing search requests, PhpDig displays the DISPLAY_SNIPPETS_NUM number of snippets, so if you are searching on several words, as soon as PhpDig hits DISPLAY_SNIPPETS_NUM, it quits looking for things to highlight. Also, if you set DISPLAY_SNIPPETS to false and DISPLAY_SUMMARY to true, PhpDig will not consider DISPLAY_SNIPPETS_NUM and just display the first words of a page, highlighting only if the search words are within the first words of a page. |
Quote:
When you removed the dash from the class of characters, you essentially replaced the dash in a word with a space, so if you search on foo-bar, PhpDig will then search on foo and/or bar, not the whole word foo-bar. /Quote Okay, that explains why the results highlight "foo bar" and not "foo-bar". There is no "foo-bar" in the database tables. So how do I get phpDig to index "foo-bar"? I'm running phpDig 1.8.7. |
PhpDig v.1.8.7 should index foo-bar as a word, assuming that the dash is a literal dash and the word foo-bar isn't caught up in some JavaScript. Also, if the hyphened word is longer than MAX_WORDS_SIZE, then it won't get inserted into the database table as a keyword. Try making a demo page with some hyphened words, and after you index it, see if you can search and find the hyphened words.
|
Quote:
Go to http://www.linuxnj.com/search/search.php and search for "omni-kuff". No go. "omni","kuff" and "omni kuff" will work fine. I'm going to start whittling the page in question to see if it's something in the page... |
Okay, thanks, I see what you mean. I was probably testing using a modified version by mistake. Anyway, in PhpDig v.1.8.7, find the phpdigCleanHtml function in robot_functions.php, look for the following line, and try removing the dash in the character class.
Code:
$text = eregi_replace("[*{}()\"\r\n\t-]+"," ",$text); |
Perfect!
Thanks loads! Now let's see if I fixed the cron problem and everybody will be happy! :-) |
Hello
Same problem, bu not resolved. I've too 1.8.7 Make a search with "0-26-110318-0" (ISBN Number): http://www.john-howe.com/search/search.php? template_demo=phpdig.html&result_page=search.php... The indexed page: http://www.john-howe.com/portfolio/g...hp?image_id=76 The isbn number is under the pix. I can find it, but it's not display with the hyphen... How can I make this, to correct the displayed results? Regards, Dom PS: I drop the DB and reindex the site again to be sure, but doesn't see that had something to do with the hyphen case... |
|
Hello,
Me too i'm confused... I've on my robot_functions.php around line 147: Code:
function phpdigCleanHtml($text) { and around line 138 in search_function.php: Code:
$what_query_chars = "[^".$phpdig_words_chars[PHPDIG_ENCODING]." \'.\_~@#$:&\%/;,=]+"; // epure chars \'._~@#$:&%/;,=- What I can't understand, it when I'm looking at the temp file in "text_content" folder, it so written without "-": Code:
...SIBLEY HarperCollinsPublishers Can't understand... A lots of thx for your help and time. Regards, Dominqiue |
This part is correct, no "-" after the "\t".
Quote:
Quote:
|
Hello,
Thx... but I've always the problem. I replace search_functions.php with the original 1.8.7 file. An keep the "robot_functions.php" without "-". I delete and reindex the page again and in my temp file, I again the ISBN code without "-": Code:
...SIBLEY HarperCollinsPublishers http://www.john-howe.com/portfolio/g...hp?image_id=76 Sorry, but I'm really confused... Dom |
Spidering in progress... [Stop spider] SITE : http://www.john-howe.com/ Exclude paths : - ads/ - cgi-bin/ - fataneh/gallery/admin/ - flash/ - forum/ - guestbook/ - linkchecker/ - links/ - links/admin/ - mailinglist/ - news/pm/ - portfolio/gallery/admin/ - search/ - stuff/gallery/admin/ - webmail/ 1:http://www.john-howe.com/portfolio/gallery/details.php?image_id=76 (time : 00:00:13) No link in temporary table links found : 1 http://www.john-howe.com/portfolio/gallery/details.php?image_id=76 Optimizing tables... Indexing complete ! [Back] to admin interface. Results 1-1, 1 total, on "ISBN" (0.05 seconds) 1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] / From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith limit to http://www.john-howe.com/, this path : portfolio/gallery/ ...994 The Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers -... Results 1-1, 1 total, on "0-26-110318-0" (0.02 seconds) 1. [100.00 %] :// John Howe :: Illustrator [ Portfolio ] / From Hobbiton to Mordor / Gandalf Before the Walls of Minas Tirith limit to http://www.john-howe.com/, this path : portfolio/gallery/ ... Map of Tolkien's Middle-Earth Brian SIBLEY HarperCollinsPublishers ISBN - 0-26-110318-0 September 2, 1994 R****m House Audio: The Two Towers - CD fro... The only thing changed was: Code:
//replace foo characters by space Code:
//replace foo characters by space |
:bang: I drop database, folder, all. and I've made a fresh install with only the change into robot_functions.php and... nothing..
Always the damn same! You're version is 1.8.8 rc1 no? My version is 1.8.7, maybe that's the point... Dont's know. I'm the only one with that problem with my version? I can't upgrade to 1.8.8 rc1 due to my host DB version... A bug into the 1.8.7? Regards, Dom PS: I'm really sorry to bother you with that. |
All times are GMT -8. The time now is 11:00 PM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.