|
01-05-2004, 09:28 AM | #1 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Japanese encoding : charset=shift_jis
Hi there! PhpDig is just great!
I'm considering improving the system so that it can spider and display different encoding at the same time among which the Japanese encoding. I've read carefully the 3 other topics dedicated to encoding issues. I've though of 4 main points: - There is no space in Japanese, so an algorythm can't make no difference between 2 different words. And it seems to be impossible to stock keywords in the DB. But I can think of a way to pass through it. There are 3 different types of characters in Japanese : Hiragana (26 signs) Katakana (26 signs) Kanji (more than 50,000 signs) It is possible to make a difference refering to the code of theses characters. e.g.: Katakana "re" (レ) is encoded with ¼. if the code of the second character of the encoding is between x and y, then the character is a Katakana. Same for the others. But some words contains different type of characters at the same time, like サボる (katakana + hiragana. means "not to attend school"), キャンプ場 (Katakana + Kanji. means "camping"), 寝る (Kanji + Hiragana, means "to sleep") - There are different Japanese encoding: Shift_JIS, iso-2022-jp and EUC-JP. So how to crawl pages with different encodings? - ア is the same as ア, イ as イ, ウ as ウ, カ as カ... Apart from these signs (about 50) no other matches can be done (like "*", "â" being like "a") Sounds pretty hard, but it is not. Any idea on how to do it? Can anyone give me hand on this?
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st Last edited by Edomondo; 01-05-2004 at 10:16 AM. |
01-06-2004, 05:33 AM | #2 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Assuming Shift_JIS, ISO-2022-JP and EUC-JP are different encodings for the same set of characters, then a utility could be written or may already be available to convert both ISO-2022-JP and Shift_JIS to EUC-JP. The utility could be invoked based on the charset atttribute of the meta tag, converting as nessary for storage in MySQL with charset ujis and utilizing mutli-byte string functions where needed. This method could also be used to convert from a number of encodings to UTF-8.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-06-2004, 09:00 AM | #3 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
You're right, they are all different encodings for the same characters set. The most current is probably Shift_JIS.
I suppose that developping such an utility wouldn't be problem for me, but I'll be looking for something similar over the net. But how to deal with the space issue? Phpdig won't be able to index words. I can't find of any way to pass through this. Will it have to index each phrase separately as a single word?
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
01-06-2004, 11:10 AM | #4 | |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. In robot_functions.php is $separators = " "; and this is what breaks on space for keywords. You could add other characters to $separators, but I am not familiar enough with Japanese to suggest appropriate separators.
One other thing from php.net: Quote:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
01-06-2004, 01:38 PM | #5 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Hi! I found exactly what I needed!!
http://www.spencernetwork.org/jcode-LE/ There are a functions to convert from/to EUC-JP, Shift_JIS & ISO-2022-JP(JIS) and others to convert from/to full-width & half-width character. I didn't find time to test it yet. The documentation is in Japanese but I can make a translation if anyone is interested in it. I've thought about separators in Japanese. There is half and full-width space, half and full-width period, half and full-width comma, half and full-width apostrophe, ... Can I add several separator in the same string? I guess I must enter unencoded characters e.g. @ for full-width space, correct? Or do I need to separate each separator by a sign (comma...)? BTW, thanks for your help Charter. It's greatly appreciated.
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
01-06-2004, 01:59 PM | #6 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. For example, say the "word" was "big/long/phrase and then some" and you wanted to break this up. You could set $separators = " /"; so that keywords would be made on spaces and slashes. More info on strtok can be found here. Also, I'd be interested in the translation.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-07-2004, 07:41 AM | #7 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
I've done a translation of Jcode-LE readme.txt file. It might not always be clear as neither English nor Japanese are my mother tongues :-( I've also indicated where there should have been Japanese characters (replaced by *** due to the txt format)
That would be great if next versions of phpdig would accept several encodings in both indexing and interface. Thanks, strtok() is clearer to me now.
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
01-08-2004, 07:51 AM | #8 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
I did some testing.
Jcode works great to convert from a Japanese encodings to another one. it is definitely what I was looking for! strtok() can use multiple separator pattern made of more than 1 character. But: PHP Code:
PHP Code:
Now I'm going to need help configuring correctly $phpdig_words_chars and $phpdig_string_subst.
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st Last edited by Edomondo; 01-08-2004 at 07:56 AM. |
01-08-2004, 08:27 AM | #9 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. There is a fix here thanks to janalwin. This avoids the problems when $token evaluates to false with zero in the text. I'm not sure why you are adding the second while loop. With while ($tok !== FALSE) all of the $tok will print, but with while ($tok) printing stops after string.
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-08-2004, 09:33 AM | #10 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Damned! I'm afraid you're right. "/*" is use to tokenize the string just like "*/", "/" & "*"... So it's not working as I'd like it to.
It should use for separator: - a single character (space, comma, period...) - a multi-byte character code (in @ is space, A is comma, B is period...) I ran short of idea on this issue
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
01-08-2004, 01:04 PM | #11 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
OK, I found how to replace the punctuation.
In the previous example, using /* to tokenize the string can be achieved this way: PHP Code:
Now, how can I configure $phpdig_string_subst['EUC-JP'] and $phpdig_string_chars['EUC-JP']? It's still a bit confusing to me. Every character composing a multi-byte character will go in $phpdig_string_subst, right? e.g. : ‚ÆÄ*l‹CÌ_é–Ÿ‰æÅ... And $phpdig_string_chars['EUC-JP'] = '[:alnum:]'; seems correct as all characters will be converted to half-width EUC-JP characters during indexing.
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
01-08-2004, 04:39 PM | #12 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. I haven't done any benchmarks so I'm not sure, but for a lot of processing the following might be faster:
PHP Code:
This seems backwards when reading the instructions in the config.php file, but with encodings that don't have Latin counterparts, it's the way I figured to make PhpDig version 1.6.5 work with other languages. Is you MySQL charset ujis? You can find some MySQL charsets and their descriptions here.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
01-09-2004, 06:59 AM | #13 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Actually, the search engine I'm aiming at will have a support for different languages and encodings. Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser. Storage in the DB and plain TXT files will contain non-encoded characters.
So I prefer using a common charset in MySQL, not a specific one for Japanese. I've set a list of all the possible separator (non encoded) in Japanese. For Shift_Jis encoding, there will be: @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ € ‚ ƒ " … * ‡ ˆ ‰ * ‹ Œ Ž ' ' " " o - - ˜ ™ š › œ ž Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª " ¸ ¹ º " ¼ ½ ¾ ¿ È É Ê Ë Ì Í Î Ú Û Ü Ý Þ ß * á â ã ä å æ ç è ð ñ ò ó ô õ ö ÷ ü What would be the fastest way to achieve this? $phpdig_string_subst for Shift_Jis would look like: PHP Code:
Building a correct $phpdig_words_chars wouldn't be a problem too. I'll post one try soon for both Shift_Jis and EUC-JP.
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st Last edited by Edomondo; 01-09-2004 at 07:03 AM. |
01-09-2004, 07:35 AM | #14 | |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Quote:
The $phpdig_string_subst['Shift_Jis'] variable posted isn't correct. There is no need to all Latin letters in the variable. Setting $phpdig_string_subst['EUC-JP'] = 'Q:Q,q:q'; is all that is necessary if no "transformation" between characters is needed. If you are looking to incorporate mutliple encodings, you might consider UTF-8 instead.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
|
01-09-2004, 01:41 PM | #15 |
Orange Mole
Join Date: Jan 2004
Location: In outer space
Posts: 37
|
Hi. I meant each Japanese character will be considered as a couple of single-byte characters.
I can't use UTF-8 because Jcode-LE only cope with EUC-JP, Shift_JIS and ISO-2022-JP (JIS). The indexed pages can only be encoded with one of those encodings. They will all be converted to EUC-JP in this project. The content of indexed pages will have to: - be converted to the reference encoding of the site (EUC-JP in this case) using Jcode-LE. - get the punctuations signs replaced by spaces with strtr or str_replace. Is that correct? Will it be enough to make it work? As part of phrases (<> word) will be indexed, the search will be performed on part of words. This is the list of separator for EUC-JP: ¢£ ¢¤ ¢¥ ¢¦ ¢§ ¢¨ ¢© ¢ª ¢« ¡¦ ¢_ ¢® ¢º ¢» ¢¼ ¢½ ¢¾ ¢¿ ¢À ¢Á ¢Ê ¢Ë ¢Ì ¢Í ¢Î ¢Ï ¢Ð ¢Ü ¢Ý ¢Þ ¢ß ¢* ¢á ¢â ¢ã ¢ä ¢å ¡¢ ¢æ ¡£ ¢ç ¡¤ ¢è ¡¥ ¢é ¢ê ¡§ ¡¨ ¡© ¡ª ¡« ¡¬ ¡_ ¡® ¢ò ¡¯ ¢ó ¡° ¢ô ¡± ¢õ ¡² ¢ö ¡³ ¢÷ ¡´ ¢ø ¡µ ¢ù ¡¶ ¡· ¡¸ ¡¹ ¡º ¢þ ¡» ¡¼ ¡½ ¡¾ ¡¿ ¡À ¡Á ¡Â ¡Ã ¡Ä ¡Å ¡Æ ¡Ç ¡È ¡É ¡Ê ¡Ë ¡Ì ¡Í ¡Î ¡Ï ¡Ð ¡Ñ ¡Ò ¡Ó ¡Ô ¡Õ ¡Ö ¡× ¡Ø ¡Ù ¡Ú ¡Û ¡Ü ¡Ý ¡Þ ¡ß ¡¦ ¡* ¡á ¡â ¡ã ¡ä ¡å ¡æ ¡ç ¡è ¡é ¡ê ¡ë ¡ì ¡* ¡î ¡ï ¡ð ¡ñ ¡ò ¡ó ¡ô ¡õ ¡ö ¡÷ ¡ø ¡ù ¡ú ¡û ¡ü ¡ý ¡þ ¢¡ ¡¢ ¢¢ ¡£ ¢£ ¡¤ ¢¤ ¡¥ ¢¥ ¢¦ ¡§ ¢§ ¡¨ ¢¨ ¡© ¢© ¡ª ¢ª ¡« ¢« ¡¬ ¢¬ ¡_ ¢_ ¡® ¢® ¡¯ ¡° ¡± ¡² ¡³ ¡´ ¡µ ¡¶ ¡· ¡¸ ¡¹ ¡º ¢º ¡» ¢» ¡¼ ¢¼ ¡½ ¢½ ¡¾ ¢¾ ¡¿ ¢¿ ¡À ¢À ¡Á ¢Á ¡Â ¡Ã ¡Ä ¡Å ¡Æ ¡Ç ¡È ¡É ¡Ê ¢Ê ¡Ë ¢Ë ¡Ì ¢Ì ¡Í ¢Í ¡Î ¢Î ¡Ï ¢Ï ¡Ð ¢Ð ¡Ñ ¡Ò ¡Ó ¡Ô ¡Õ ¡Ö ¡× ¡Ø ¡Ù ¡Ú ¡Û ¡Ü ¢Ü ¡Ý ¢Ý ¡Þ ¢Þ ¡ß ¢ß ¡* ¢* ¡á ¢á ¡â ¢â ¡ã ¢ã ¡ä ¢ä ¡å ¢å ¡æ ¢æ ¡ç ¢ç ¡è ¢è ¡é ¢é ¡ê ¢ê ¡ë ¡ì ¡* ¡î ¡ï ¡ð ¡ñ ¡ò ¢ò ¡ó ¢ó ¡ô ¢ô ¡õ ¢õ ¡ö ¢ö ¡÷ ¢÷ ¡ø ¢ø ¡ù ¢ù ¡ú ¡û ¡ü ¡ý ¡þ ¢þ ¢¡ I also set up $phpdig_words_chars for EUC-JP and Shift_JIS: PHP Code:
__________________
Uchû Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Probleme with japanese search | Paka76 | How-to Forum | 0 | 03-24-2006 05:26 AM |
Please, please, please!!! Troubles with charset!!! | Slayter | Troubleshooting | 0 | 12-21-2005 09:37 AM |
Japanese characters on an English page | Shdwdrgn | Troubleshooting | 1 | 03-15-2005 09:28 AM |
Small fix for Japanese indexing | Edomondo | Mod Submissions | 1 | 02-05-2005 01:40 AM |
Help!How to support gb2312 charset? | peterhou | How-to Forum | 1 | 01-16-2005 01:42 PM |