PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Mod Submissions

Reply
 
Thread Tools
Old 02-22-2005, 07:19 AM   #1
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Auto language guesser

First, don't expect too much of that post, I've only done half part of the job (the easiest half ).

Now that PhpDig can spider multi encoding it can also spider multi language sites and you will probably want to differenciate the language of each page.

There are several tools to guess languages and happily a few of them are free!
I came across Languid: a statistical language identifier by Maciej Ceglowski (http://languid.cantbedone.org/).
It's a great tool that can guess 72 languages with big accuracy.
It is originally written in Perl (source can be found at http://search.cpan.org/~mceglows/).

I wrote a small function to guess what is the language of a text basing upon the XML API of Languid.

Now I need help to insert it to PhpDig. OK, it will slow a little bit the spidering process but this would be worthy. We would need to add this function to robot_functions.php and create a new field in spider table to store the language ID.
Or maybe can someone write a port to PHP of the original script by Maciej in Perl.

Anyone ready to give me hand on this?
Attached Files
File Type: txt guess_language.php.txt (5.5 KB, 37 views)
Edomondo is offline   Reply With Quote
Old 02-28-2005, 05:09 AM   #2
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Second step: set languages to your indexed pages.
Download both files attached here.
Upload them to your PhpDig admin directory.

Then add language to MySQL in spider table (add prefix if necessary):

Code:
ALTER TABLE `spider` ADD `language` CHAR(2) NOT NULL;
Log in to the admin and open find_language.php in your browser. It will go through your pages trying to guess the language of each page that don't have any language set yet. It uses the text stored in text_context. You can't use this feature if you didn’t activate the text storage in includes/config.php, set:

PHP Code:
define('CONTENT_TEXT',1); 
(if CONTENT_TEXT is set to 0, then change it to 1 and respider your sites)

This will take a while and unfortunately results are not always accurate. So you may want to set languages manually instead.
First open set_language.php in a text editor and set in the $lang_to_set array only the languages you will index. Example:

PHP Code:
$lang_to_set = array("en""ja""fr"); // English, Japanese & French 
FTP the page in ASCII mode to [PHPDIG_DIR]/admin and open it on your browser.
You will have the possibility to set a language to a whole site on just on subdirectories.
Each link is listed with its language value, so you can check if everything is OK.

Please keep in mind that I am far from being a powerful scripter. Many people on this forum could have done a 1000 times easier and neater code.

Don’t hesitate to post bug reports, improvements...

Next step: build the pull down menu to select the languages and change search_functions.php to support this feature.
Attached Files
File Type: txt find_language.php.txt (7.2 KB, 25 views)
File Type: txt set_language.php.txt (5.3 KB, 23 views)
Edomondo is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
language as chinese bailywen How-to Forum 1 07-18-2005 05:19 PM
search only by one language OceanSurf How-to Forum 0 10-12-2004 07:21 AM
china language hua How-to Forum 6 09-18-2004 12:18 PM
auto re-indexing on shared hosting server mental cube How-to Forum 1 09-07-2004 05:10 PM
auto indexing without shell command takpoli How-to Forum 1 04-29-2004 08:26 AM


All times are GMT -8. The time now is 05:07 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.