Using a dictionnary to spider pages

Edomondo · 11-23-2004, 08:36 AM

In post http://www.phpdig.net/forum/showthread.php?t=355 about spidering mutli-byte encodings, we found out that the only way to tokenize a text of a language that doesn't have word separators is to use a dictionnary.

Such a dictionnary would contains every word of the language, so it will a very huge file. Each word from the original text must be extracted using the dictionnary to be stored as a keyword.

I tried to develop such a function, but I'm afraid it's not fast enough. In the example below I used an text in English and removed spaces.

PHP Code:


			
$text = "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheHubbleSpaceTelescope.";



$dico = array();



$dico[1] = "nasa";

$dico[2] = "announced";

$dico[3] = "yesterday";

$dico[4] = "cancelling";

$dico[5] = "space";

$dico[6] = "shuttle";

$dico[7] = "servicing";

$dico[8] = "missions";

$dico[9] = "hubble";

$dico[10] = "telescope";

$dico[11] = "other";

$dico[12] = "words";

$dico[13] = "here";



for ($j = 0; $j <= strlen($text); $j++)

    {

    for ($i = 1; $i <= count($dico); $i++)

        {

        if (strtolower(substr($text, $j, strlen($dico[$i]))) == strtolower($dico[$i]))

            {

            echo $dico[$i]." ";

            break;

            }

        }

    }

Each word are displayed with a space between them.

Can anyone help me or give me advises on how to speed up or improve the function?

11-23-2004, 08:36 AM	#1
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	Using a dictionnary to spider pages In post http://www.phpdig.net/forum/showthread.php?t=355 about spidering mutli-byte encodings, we found out that the only way to tokenize a text of a language that doesn't have word separators is to use a dictionnary. Such a dictionnary would contains every word of the language, so it will a very huge file. Each word from the original text must be extracted using the dictionnary to be stored as a keyword. I tried to develop such a function, but I'm afraid it's not fast enough. In the example below I used an text in English and removed spaces. PHP Code: $text = "NASAannouncedyesterdayitiscancellingallspaceshuttleservicingmissionstotheHubbleSpaceTelescope."; $dico = array(); $dico[1] = "nasa"; $dico[2] = "announced"; $dico[3] = "yesterday"; $dico[4] = "cancelling"; $dico[5] = "space"; $dico[6] = "shuttle"; $dico[7] = "servicing"; $dico[8] = "missions"; $dico[9] = "hubble"; $dico[10] = "telescope"; $dico[11] = "other"; $dico[12] = "words"; $dico[13] = "here"; for ($j = 0; $j <= strlen($text); $j++) { for ($i = 1; $i <= count($dico); $i++) { if (strtolower(substr($text, $j, strlen($dico[$i]))) == strtolower($dico[$i])) { echo $dico[$i]." "; break; } } } Each word are displayed with a space between them. Can anyone help me or give me advises on how to speed up or improve the function? __________________ Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Index some, but spider all pages	griemer	Troubleshooting	0	01-16-2007 06:30 AM
Cannot spider some pages and ABSOLUTE_SCRIPT_PATH and /usr/local/bin/	paullind	Troubleshooting	2	04-03-2006 09:06 AM
Spider stops before all pages are indexed	halide	Troubleshooting	3	07-19-2005 01:26 AM
Spider indexes cgi pages but not its links!?	WebSpider	Troubleshooting	3	02-08-2005 07:04 PM
Set time limit on spider.php or number of pages	paullind	Troubleshooting	1	05-01-2004 08:25 AM