|
11-23-2004, 08:36 AM | #1 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Using a dictionnary to spider pages
In post http://www.phpdig.net/forum/showthread.php?t=355 about spidering mutli-byte encodings, we found out that the only way to tokenize a text of a language that doesn't have word separators is to use a dictionnary.
Such a dictionnary would contains every word of the language, so it will a very huge file. Each word from the original text must be extracted using the dictionnary to be stored as a keyword. I tried to develop such a function, but I'm afraid it's not fast enough. In the example below I used an text in English and removed spaces. PHP Code:
Can anyone help me or give me advises on how to speed up or improve the function?
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Index some, but spider all pages | griemer | Troubleshooting | 0 | 01-16-2007 06:30 AM |
Cannot spider some pages and ABSOLUTE_SCRIPT_PATH and /usr/local/bin/ | paullind | Troubleshooting | 2 | 04-03-2006 09:06 AM |
Spider stops before all pages are indexed | halide | Troubleshooting | 3 | 07-19-2005 01:26 AM |
Spider indexes cgi pages but not its links!? | WebSpider | Troubleshooting | 3 | 02-08-2005 07:04 PM |
Set time limit on spider.php or number of pages | paullind | Troubleshooting | 1 | 05-01-2004 08:25 AM |