Japanese encoding : charset=shift_jis

Edomondo · 01-05-2004, 08:28 AM

Hi there! PhpDig is just great!

I'm considering improving the system so that it can spider and display different encoding at the same time among which the Japanese encoding.

I've read carefully the 3 other topics dedicated to encoding issues.

I've though of 4 main points:
- There is no space in Japanese, so an algorythm can't make no difference between 2 different words. And it seems to be impossible to stock keywords in the DB. But I can think of a way to pass through it. There are 3 different types of characters in Japanese :
Hiragana (26 signs)
Katakana (26 signs)
Kanji (more than 50,000 signs)
It is possible to make a difference refering to the code of theses characters. e.g.: Katakana "re" (レ) is encoded with Â Â¼. if the code of the second character of the encoding is between x and y, then the character is a Katakana. Same for the others.
But some words contains different type of characters at the same time, like サボる (katakana + hiragana. means "not to attend school"), キャンプ場 (Katakana + Kanji. means "camping"), 寝る (Kanji + Hiragana, means "to sleep")
- There are different Japanese encoding: Shift_JIS, iso-2022-jp and EUC-JP. So how to crawl pages with different encodings?
- ア is the same as ｱ, イ as ｲ, ウ as ｳ, カ as ｶ... Apart from these signs (about 50) no other matches can be done (like "Ã*", "Ã¢" being like "a")

Sounds pretty hard, but it is not.

Any idea on how to do it? Can anyone give me hand on this?

Charter · 01-06-2004, 04:33 AM

Hi. Assuming Shift_JIS, ISO-2022-JP and EUC-JP are different encodings for the same set of characters, then a utility could be written or may already be available to convert both ISO-2022-JP and Shift_JIS to EUC-JP. The utility could be invoked based on the charset atttribute of the meta tag, converting as nessary for storage in MySQL with charset ujis and utilizing mutli-byte string functions where needed. This method could also be used to convert from a number of encodings to UTF-8.

Edomondo · 01-06-2004, 08:00 AM

You're right, they are all different encodings for the same characters set. The most current is probably Shift_JIS.
I suppose that developping such an utility wouldn't be problem for me, but I'll be looking for something similar over the net.

But how to deal with the space issue?
Phpdig won't be able to index words. I can't find of any way to pass through this.
Will it have to index each phrase separately as a single word?

Charter · 01-06-2004, 10:10 AM

Hi. In robot_functions.php is $separators = " "; and this is what breaks on space for keywords. You could add other characters to $separators, but I am not familiar enough with Japanese to suggest appropriate separators.

One other thing from php.net:

Quote:

Character encodings work with PHP:
ISO-8859-*, EUC-JP, UTF-8

Character encodings do NOT work with PHP:
JIS, SJIS

Edomondo · 01-06-2004, 12:38 PM

Hi! I found exactly what I needed!!
http://www.spencernetwork.org/jcode-LE/
There are a functions to convert from/to EUC-JP, Shift_JIS & ISO-2022-JP(JIS) and others to convert from/to full-width & half-width character.
I didn't find time to test it yet. The documentation is in Japanese but I can make a translation if anyone is interested in it.

I've thought about separators in Japanese. There is half and full-width space, half and full-width period, half and full-width comma, half and full-width apostrophe, ...
Can I add several separator in the same string? I guess I must enter unencoded characters e.g. Â @ for full-width space, correct? Or do I need to separate each separator by a sign (comma...)?

BTW, thanks for your help Charter.

It's greatly appreciated.

Charter · 01-06-2004, 12:59 PM

Hi. For example, say the "word" was "big/long/phrase and then some" and you wanted to break this up. You could set $separators = " /"; so that keywords would be made on spaces and slashes. More info on strtok can be found here. Also, I'd be interested in the translation.

Edomondo · 01-07-2004, 06:41 AM

I've done a translation of Jcode-LE readme.txt file. It might not always be clear as neither English nor Japanese are my mother tongues :-( I've also indicated where there should have been Japanese characters (replaced by *** due to the txt format)

That would be great if next versions of phpdig would accept several encodings in both indexing and interface.

Thanks, strtok() is clearer to me now.

Edomondo · 01-08-2004, 06:51 AM

I did some testing.
Jcode works great to convert from a Japanese encodings to another one. it is definitely what I was looking for!

strtok() can use multiple separator pattern made of more than 1 character.

But:

PHP Code:


			
<?php

$string = "This/*is/*an/*example/*string";

$separator = "/*";

/* Use tab and newline as tokenizing characters as well  */

$tok = strtok($string, $separator);

while ($tok) {

   echo "Word=$tok<br />";

   $tok = strtok($separator);

}

?>

Must be replaced by:

PHP Code:


			
<?php

$string = "This/*is/*an/*example/*string";

$separator = "/*";



$tok = strtok($string, $separator);

while ($tok !== FALSE)

{

  $toks[] = $tok;

  $tok = strtok($separator);

}



while (list($k,$v) = each($toks))

{

  echo "Word=$v<br />";

}

?>

So, it might be possible to use multi-byte characters to separate words. Am I right?

Now I'm going to need help configuring correctly $phpdig_words_chars and $phpdig_string_subst.

Charter · 01-08-2004, 07:27 AM

Hi. There is a fix here thanks to janalwin. This avoids the problems when $token evaluates to false with zero in the text. I'm not sure why you are adding the second while loop. With while ($tok !== FALSE) all of the $tok will print, but with while ($tok) printing stops after string.

PHP Code:


			
$string = "This/*is/*an/*example/*string/0/and/some*more*text";

$separator = "/*";

$tok = strtok($string, $separator);

while ($tok !== FALSE) { // try with while ($tok) to compare

   echo "Word=$tok<br />";

   $tok = strtok($separator);

}

Note how $separator is used to tokenize on / and * and what appears to be /* but it really is tokenizing when any one character is found. The while makes it tokenize on / and * so it appears to only break on */ but that is not the case.

Edomondo · 01-08-2004, 08:33 AM

Damned! I'm afraid you're right. "/*" is use to tokenize the string just like "*/", "/" & "*"... So it's not working as I'd like it to.

It should use for separator:
- a single character (space, comma, period...)
- a multi-byte character code (in Â@ is space, ÂA is comma, ÂB is period...)

I ran short of idea on this issue

Edomondo · 01-08-2004, 12:04 PM

OK, I found how to replace the punctuation.
In the previous example, using /* to tokenize the string can be achieved this way:

PHP Code:


			
<?php

$string = "This/*is/*an/*example/*0/*string.";

$separator = " ";



$replace_separator = array("/*" => $separator,

                                              "." => $separator);



$string = trim(strtr($string, $replace_separator));



$tok = strtok($string, $separator);

while ($tok !== FALSE) { // try with while ($tok) to compare

   echo "Word=$tok<br />";

   $tok = strtok($separator);

}

?>

I guess I'll also have to give MAX_WORDS_SIZE the higher value possible.

Now, how can I configure $phpdig_string_subst['EUC-JP'] and $phpdig_string_chars['EUC-JP']? It's still a bit confusing to me.

Every character composing a multi-byte character will go in $phpdig_string_subst, right?
e.g. : â€šÃ†Ã„Ã*lâ€¹CÃŒ_Ã©â€“Å¸â€°Ã¦Ã…...

And $phpdig_string_chars['EUC-JP'] = '[:alnum:]'; seems correct as all characters will be converted to half-width EUC-JP characters during indexing.

Charter · 01-08-2004, 03:39 PM

Hi. I haven't done any benchmarks so I'm not sure, but for a lot of processing the following might be faster:

PHP Code:


			
$separator = " ";

$string = "This/*is/*an/*example/*0/*string.";

$string = str_replace("/*"," ",$string);

$tok = strtok($string, $separator);

while ($tok !== FALSE) {

   echo "Word=$tok<br />";

   $tok = strtok($separator);

}

As for the $phpdig_string_subst and $phpdig_words_chars variables, $phpdig_string_subst['EUC-JP'] = 'Q:Q,q:q'; and $phpdig_words_chars['EUC-JP'] = '[:alnum:]Ã†Ã„Ã*lâ€¹CÃŒ_Ã©â€“Å¸â€°Ã¦Ã…...';

This seems backwards when reading the instructions in the config.php file, but with encodings that don't have Latin counterparts, it's the way I figured to make PhpDig version 1.6.5 work with other languages.

Is you MySQL charset ujis? You can find some MySQL charsets and their descriptions here.

Edomondo · 01-09-2004, 05:59 AM

Actually, the search engine I'm aiming at will have a support for different languages and encodings. Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser. Storage in the DB and plain TXT files will contain non-encoded characters.
So I prefer using a common charset in MySQL, not a specific one for Japanese.

I've set a list of all the possible separator (non encoded) in Japanese.
For Shift_Jis encoding, there will be:
Â@
ÂA
ÂB
ÂC
ÂD
ÂE
ÂF
ÂG
ÂH
ÂI
ÂJ
ÂK
ÂL
ÂM
ÂN
ÂO
ÂP
ÂQ
ÂR
ÂS
ÂT
ÂU
ÂV
ÂW
ÂX
ÂY
ÂZ
Â[
Â\
Â]
Â^
Â_
Â`
Âa
Âb
Âc
Âd
Âe
Âf
Âg
Âh
Âi
Âj
Âk
Âl
Âm
Ân
Âo
Âp
Âq
Âr
Âs
Ât
Âu
Âv
Âw
Âx
Ây
Âz
Â{
Â|
Â}
Â~
Â
Ââ‚¬
ÂÂ
Ââ€š
ÂÆ’
Â"
Ââ€¦
Ââ€*
Ââ€¡
ÂË†
Ââ€°
ÂÅ*
Ââ€¹
ÂÅ’
ÂÂ
ÂÅ½
ÂÂ
ÂÂ
Â'
Â'
Â"
Â"
Âo
Â-
Â-
ÂËœ
Ââ„¢
ÂÅ¡
Ââ€º
ÂÅ“
ÂÂ
ÂÅ¾
ÂÅ¸
Â
ÂÂ¡
ÂÂ¢
ÂÂ£
ÂÂ¤
ÂÂ¥
ÂÂ¦
ÂÂ§
ÂÂ¨
ÂÂ©
ÂÂª
Â"
Â
ÂÂ¸
ÂÂ¹
ÂÂº
Â"
ÂÂ¼
ÂÂ½
ÂÂ¾
ÂÂ¿
ÂÃˆ
ÂÃ‰
ÂÃŠ
ÂÃ‹
ÂÃŒ
ÂÃ
ÂÃŽ
ÂÃš
ÂÃ›
ÂÃœ
ÂÃ
ÂÃž
ÂÃŸ
ÂÃ*
ÂÃ¡
ÂÃ¢
ÂÃ£
ÂÃ¤
ÂÃ¥
ÂÃ¦
ÂÃ§
ÂÃ¨
ÂÃ°
ÂÃ±
ÂÃ²
ÂÃ³
ÂÃ´
ÂÃµ
ÂÃ¶
ÂÃ·
ÂÃ¼

What would be the fastest way to achieve this?

$phpdig_string_subst for Shift_Jis would look like:

PHP Code:


			
$phpdig_string_subst['Shift_Jis'] = 'A:A,a:a,B:B,b:b,C:C,c:c,D:D,d:d,E:E,e:e,F:F,f:f,G:G,g:g,H:H,h:h,I:I,i:i,J:J,j:j,K:K,k:k,L:L,l:l,M:M,m:m,N:N,n:n,O:O,o:o,P:P,p:p,Q:Q,q:q,R:R,r:r,S:S,s:s,T:T,t:t,U:U,u:u,V:V,v:v,W:W,w:w,X:X,x:x,Y:Y,y:y,Z:Z,z:z';

Is that correct?

Building a correct $phpdig_words_chars wouldn't be a problem too. I'll post one try soon for both Shift_Jis and EUC-JP.

Charter · 01-09-2004, 06:35 AM

Quote:

Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser.

Hi. I'm not sure what you mean. Are you planning on storing HTML entites instead?

The $phpdig_string_subst['Shift_Jis'] variable posted isn't correct. There is no need to all Latin letters in the variable. Setting $phpdig_string_subst['EUC-JP'] = 'Q:Q,q:q'; is all that is necessary if no "transformation" between characters is needed.

If you are looking to incorporate mutliple encodings, you might consider UTF-8 instead.

Edomondo · 01-09-2004, 12:41 PM

Hi. I meant each Japanese character will be considered as a couple of single-byte characters.

I can't use UTF-8 because Jcode-LE only cope with EUC-JP, Shift_JIS and ISO-2022-JP (JIS). The indexed pages can only be encoded with one of those encodings. They will all be converted to EUC-JP in this project.

The content of indexed pages will have to:
- be converted to the reference encoding of the site (EUC-JP in this case) using Jcode-LE.
- get the punctuations signs replaced by spaces with strtr or str_replace.
Is that correct? Will it be enough to make it work?

As part of phrases (<> word) will be indexed, the search will be performed on part of words.

This is the list of separator for EUC-JP:
Â¢Â£
Â¢Â¤
Â¢Â¥
Â¢Â¦
Â¢Â§
Â¢Â¨
Â¢Â©
Â¢Âª
Â¢Â«
Â¡Â¦
Â¢_
Â¢Â®
Â¢Âº
Â¢Â»
Â¢Â¼
Â¢Â½
Â¢Â¾
Â¢Â¿
Â¢Ã€
Â¢Ã
Â¢ÃŠ
Â¢Ã‹
Â¢ÃŒ
Â¢Ã
Â¢ÃŽ
Â¢Ã
Â¢Ã
Â¢Ãœ
Â¢Ã
Â¢Ãž
Â¢ÃŸ
Â¢Ã*
Â¢Ã¡
Â¢Ã¢
Â¢Ã£
Â¢Ã¤
Â¢Ã¥
Â¡Â¢
Â¢Ã¦
Â¡Â£
Â¢Ã§
Â¡Â¤
Â¢Ã¨
Â¡Â¥
Â¢Ã©
Â¢Ãª
Â¡Â§
Â¡Â¨
Â¡Â©
Â¡Âª
Â¡Â«
Â¡Â¬
Â¡_
Â¡Â®
Â¢Ã²
Â¡Â¯
Â¢Ã³
Â¡Â°
Â¢Ã´
Â¡Â±
Â¢Ãµ
Â¡Â²
Â¢Ã¶
Â¡Â³
Â¢Ã·
Â¡Â´
Â¢Ã¸
Â¡Âµ
Â¢Ã¹
Â¡Â¶
Â¡Â·
Â¡Â¸
Â¡Â¹
Â¡Âº
Â¢Ã¾
Â¡Â»
Â¡Â¼
Â¡Â½
Â¡Â¾
Â¡Â¿
Â¡Ã€
Â¡Ã
Â¡Ã‚
Â¡Ãƒ
Â¡Ã„
Â¡Ã…
Â¡Ã†
Â¡Ã‡
Â¡Ãˆ
Â¡Ã‰
Â¡ÃŠ
Â¡Ã‹
Â¡ÃŒ
Â¡Ã
Â¡ÃŽ
Â¡Ã
Â¡Ã
Â¡Ã‘
Â¡Ã’
Â¡Ã“
Â¡Ã”
Â¡Ã•
Â¡Ã–
Â¡Ã—
Â¡Ã˜
Â¡Ã™
Â¡Ãš
Â¡Ã›
Â¡Ãœ
Â¡Ã
Â¡Ãž
Â¡ÃŸ
Â¡Â¦
Â¡Ã*
Â¡Ã¡
Â¡Ã¢
Â¡Ã£
Â¡Ã¤
Â¡Ã¥
Â¡Ã¦
Â¡Ã§
Â¡Ã¨
Â¡Ã©
Â¡Ãª
Â¡Ã«
Â¡Ã¬
Â¡Ã*
Â¡Ã®
Â¡Ã¯
Â¡Ã°
Â¡Ã±
Â¡Ã²
Â¡Ã³
Â¡Ã´
Â¡Ãµ
Â¡Ã¶
Â¡Ã·
Â¡Ã¸
Â¡Ã¹
Â¡Ãº
Â¡Ã»
Â¡Ã¼
Â¡Ã½
Â¡Ã¾
Â¢Â¡
Â¡Â¢
Â¢Â¢
Â¡Â£
Â¢Â£
Â¡Â¤
Â¢Â¤
Â¡Â¥
Â¢Â¥
Â¢Â¦
Â¡Â§
Â¢Â§
Â¡Â¨
Â¢Â¨
Â¡Â©
Â¢Â©
Â¡Âª
Â¢Âª
Â¡Â«
Â¢Â«
Â¡Â¬
Â¢Â¬
Â¡_
Â¢_
Â¡Â®
Â¢Â®
Â¡Â¯
Â¡Â°
Â¡Â±
Â¡Â²
Â¡Â³
Â¡Â´
Â¡Âµ
Â¡Â¶
Â¡Â·
Â¡Â¸
Â¡Â¹
Â¡Âº
Â¢Âº
Â¡Â»
Â¢Â»
Â¡Â¼
Â¢Â¼
Â¡Â½
Â¢Â½
Â¡Â¾
Â¢Â¾
Â¡Â¿
Â¢Â¿
Â¡Ã€
Â¢Ã€
Â¡Ã
Â¢Ã
Â¡Ã‚
Â¡Ãƒ
Â¡Ã„
Â¡Ã…
Â¡Ã†
Â¡Ã‡
Â¡Ãˆ
Â¡Ã‰
Â¡ÃŠ
Â¢ÃŠ
Â¡Ã‹
Â¢Ã‹
Â¡ÃŒ
Â¢ÃŒ
Â¡Ã
Â¢Ã
Â¡ÃŽ
Â¢ÃŽ
Â¡Ã
Â¢Ã
Â¡Ã
Â¢Ã
Â¡Ã‘
Â¡Ã’
Â¡Ã“
Â¡Ã”
Â¡Ã•
Â¡Ã–
Â¡Ã—
Â¡Ã˜
Â¡Ã™
Â¡Ãš
Â¡Ã›
Â¡Ãœ
Â¢Ãœ
Â¡Ã
Â¢Ã
Â¡Ãž
Â¢Ãž
Â¡ÃŸ
Â¢ÃŸ
Â¡Ã*
Â¢Ã*
Â¡Ã¡
Â¢Ã¡
Â¡Ã¢
Â¢Ã¢
Â¡Ã£
Â¢Ã£
Â¡Ã¤
Â¢Ã¤
Â¡Ã¥
Â¢Ã¥
Â¡Ã¦
Â¢Ã¦
Â¡Ã§
Â¢Ã§
Â¡Ã¨
Â¢Ã¨
Â¡Ã©
Â¢Ã©
Â¡Ãª
Â¢Ãª
Â¡Ã«
Â¡Ã¬
Â¡Ã*
Â¡Ã®
Â¡Ã¯
Â¡Ã°
Â¡Ã±
Â¡Ã²
Â¢Ã²
Â¡Ã³
Â¢Ã³
Â¡Ã´
Â¢Ã´
Â¡Ãµ
Â¢Ãµ
Â¡Ã¶
Â¢Ã¶
Â¡Ã·
Â¢Ã·
Â¡Ã¸
Â¢Ã¸
Â¡Ã¹
Â¢Ã¹
Â¡Ãº
Â¡Ã»
Â¡Ã¼
Â¡Ã½
Â¡Ã¾
Â¢Ã¾
Â¢Â¡

I also set up $phpdig_words_chars for EUC-JP and Shift_JIS:

PHP Code:


			
$phpdig_words_chars['EUC-JP'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½';

$phpdig_words_chars['Shift_JIS'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~â‚¬Ââ€šÆ’â€žâ€¦â€*â€¡Ë†â€°Å*â€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ*Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã*Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»';

Does it seem OK?

01-05-2004, 08:28 AM	#1
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	Japanese encoding : charset=shift_jis Hi there! PhpDig is just great! I'm considering improving the system so that it can spider and display different encoding at the same time among which the Japanese encoding. I've read carefully the 3 other topics dedicated to encoding issues. I've though of 4 main points: - There is no space in Japanese, so an algorythm can't make no difference between 2 different words. And it seems to be impossible to stock keywords in the DB. But I can think of a way to pass through it. There are 3 different types of characters in Japanese : Hiragana (26 signs) Katakana (26 signs) Kanji (more than 50,000 signs) It is possible to make a difference refering to the code of theses characters. e.g.: Katakana "re" (レ) is encoded with Â Â¼. if the code of the second character of the encoding is between x and y, then the character is a Katakana. Same for the others. But some words contains different type of characters at the same time, like サボる (katakana + hiragana. means "not to attend school"), キャンプ場 (Katakana + Kanji. means "camping"), 寝る (Kanji + Hiragana, means "to sleep") - There are different Japanese encoding: Shift_JIS, iso-2022-jp and EUC-JP. So how to crawl pages with different encodings? - ア is the same as ｱ, イ as ｲ, ウ as ｳ, カ as ｶ... Apart from these signs (about 50) no other matches can be done (like "Ã", "Ã¢" being like "a") Sounds pretty hard, but it is not. Any idea on how to do it? Can anyone give me hand on this? __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st Last edited by Edomondo; 01-05-2004 at 09:16 AM.*

01-06-2004, 04:33 AM	#2
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. Assuming Shift_JIS, ISO-2022-JP and EUC-JP are different encodings for the same set of characters, then a utility could be written or may already be available to convert both ISO-2022-JP and Shift_JIS to EUC-JP. The utility could be invoked based on the charset atttribute of the meta tag, converting as nessary for storage in MySQL with charset ujis and utilizing mutli-byte string functions where needed. This method could also be used to convert from a number of encodings to UTF-8. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

01-06-2004, 08:00 AM	#3
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	You're right, they are all different encodings for the same characters set. The most current is probably Shift_JIS. I suppose that developping such an utility wouldn't be problem for me, but I'll be looking for something similar over the net. But how to deal with the space issue? Phpdig won't be able to index words. I can't find of any way to pass through this. Will it have to index each phrase separately as a single word? __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

01-06-2004, 12:38 PM	#5
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	Hi! I found exactly what I needed!! http://www.spencernetwork.org/jcode-LE/ There are a functions to convert from/to EUC-JP, Shift_JIS & ISO-2022-JP(JIS) and others to convert from/to full-width & half-width character. I didn't find time to test it yet. The documentation is in Japanese but I can make a translation if anyone is interested in it. I've thought about separators in Japanese. There is half and full-width space, half and full-width period, half and full-width comma, half and full-width apostrophe, ... Can I add several separator in the same string? I guess I must enter unencoded characters e.g. Â @ for full-width space, correct? Or do I need to separate each separator by a sign (comma...)? BTW, thanks for your help Charter. It's greatly appreciated. __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

01-06-2004, 12:59 PM	#6
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. For example, say the "word" was "big/long/phrase and then some" and you wanted to break this up. You could set $separators = " /"; so that keywords would be made on spaces and slashes. More info on strtok can be found here. Also, I'd be interested in the translation. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

01-08-2004, 06:51 AM	#8
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	I did some testing. Jcode works great to convert from a Japanese encodings to another one. it is definitely what I was looking for! strtok() can use multiple separator pattern made of more than 1 character. But: PHP Code: `<?php $string = "This/is/an/example/string"; $separator = "/"; / Use tab and newline as tokenizing characters as well / $tok = strtok($string, $separator); while ($tok) { echo "Word=$tok<br />"; $tok = strtok($separator); } ?>` Must be replaced by: PHP Code: `<?php $string = "This/is/an/example/string"; $separator = "/"; $tok = strtok($string, $separator); while ($tok !== FALSE) { $toks[] = $tok; $tok = strtok($separator); } while (list($k,$v) = each($toks)) { echo "Word=$v<br />"; } ?>` So, it might be possible to use multi-byte characters to separate words. Am I right? Now I'm going to need help configuring correctly $phpdig_words_chars and $phpdig_string_subst. __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st Last edited by Edomondo; 01-08-2004 at 06:56 AM.

01-08-2004, 07:27 AM	#9
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. There is a fix here thanks to janalwin. This avoids the problems when $token evaluates to false with zero in the text. I'm not sure why you are adding the second while loop. With while ($tok !== FALSE) all of the $tok will print, but with while ($tok) printing stops after string. PHP Code: `$string = "This/is/an/example/string/0/and/somemoretext"; $separator = "/"; $tok = strtok($string, $separator); while ($tok !== FALSE) { // try with while ($tok) to compare echo "Word=$tok<br />"; $tok = strtok($separator); }` Note how $separator is used to tokenize on / and and what appears to be /* but it really is tokenizing when any one character is found. The while makes it tokenize on / and * so it appears to only break on */ but that is not the case. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

01-08-2004, 08:33 AM	#10
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	Damned! I'm afraid you're right. "/" is use to tokenize the string just like "/", "/" & "*"... So it's not working as I'd like it to. It should use for separator: - a single character (space, comma, period...) - a multi-byte character code (in Â@ is space, ÂA is comma, ÂB is period...) I ran short of idea on this issue __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

01-08-2004, 12:04 PM	#11
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	OK, I found how to replace the punctuation. In the previous example, using /* to tokenize the string can be achieved this way: PHP Code: `<?php $string = "This/is/an/example/0/string."; $separator = " "; $replace_separator = array("/" => $separator, "." => $separator); $string = trim(strtr($string, $replace_separator)); $tok = strtok($string, $separator); while ($tok !== FALSE) { // try with while ($tok) to compare echo "Word=$tok<br />"; $tok = strtok($separator); } ?>` I guess I'll also have to give MAX_WORDS_SIZE the higher value possible. Now, how can I configure $phpdig_string_subst['EUC-JP'] and $phpdig_string_chars['EUC-JP']? It's still a bit confusing to me. Every character composing a multi-byte character will go in $phpdig_string_subst, right? e.g. : â€šÃ†Ã„Ã*lâ€¹CÃŒ_Ã©â€“Å¸â€°Ã¦Ã…... And $phpdig_string_chars['EUC-JP'] = '[:alnum:]'; seems correct as all characters will be converted to half-width EUC-JP characters during indexing. __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

01-08-2004, 03:39 PM	#12
Charter Head Mole Join Date: May 2003 Posts: 2,539	Hi. I haven't done any benchmarks so I'm not sure, but for a lot of processing the following might be faster: PHP Code: `$separator = " "; $string = "This/is/an/example/0/string."; $string = str_replace("/"," ",$string); $tok = strtok($string, $separator); while ($tok !== FALSE) { echo "Word=$tok<br />"; $tok = strtok($separator); }` As for the $phpdig_string_subst and $phpdig_words_chars variables, $phpdig_string_subst['EUC-JP'] = 'Q:Q,q:q'; and $phpdig_words_chars['EUC-JP'] = '[:alnum:]Ã†Ã„Ã*lâ€¹CÃŒ_Ã©â€“Å¸â€°Ã¦Ã…...'; This seems backwards when reading the instructions in the config.php file, but with encodings that don't have Latin counterparts, it's the way I figured to make PhpDig version 1.6.5 work with other languages. Is you MySQL charset ujis? You can find some MySQL charsets and their descriptions here. __________________ Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.

01-09-2004, 05:59 AM	#13
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	Actually, the search engine I'm aiming at will have a support for different languages and encodings. Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser. Storage in the DB and plain TXT files will contain non-encoded characters. So I prefer using a common charset in MySQL, not a specific one for Japanese. I've set a list of all the possible separator (non encoded) in Japanese. For Shift_Jis encoding, there will be: Â@ ÂA ÂB ÂC ÂD ÂE ÂF ÂG ÂH ÂI ÂJ ÂK ÂL ÂM ÂN ÂO ÂP ÂQ ÂR ÂS ÂT ÂU ÂV ÂW ÂX ÂY ÂZ Â[ Â\ Â] Â^ Â_ Â` Âa Âb Âc Âd Âe Âf Âg Âh Âi Âj Âk Âl Âm Ân Âo Âp Âq Âr Âs Ât Âu Âv Âw Âx Ây Âz Â{ Â\| Â} Â~ Â Ââ‚¬ ÂÂ Ââ€š ÂÆ’ Â" Ââ€¦ Ââ€* Ââ€¡ ÂË† Ââ€° ÂÅ* Ââ€¹ ÂÅ’ ÂÂ ÂÅ½ ÂÂ ÂÂ Â' Â' Â" Â" Âo Â- Â- ÂËœ Ââ„¢ ÂÅ¡ Ââ€º ÂÅ“ ÂÂ ÂÅ¾ ÂÅ¸ Â ÂÂ¡ ÂÂ¢ ÂÂ£ ÂÂ¤ ÂÂ¥ ÂÂ¦ ÂÂ§ ÂÂ¨ ÂÂ© ÂÂª Â" Â ÂÂ¸ ÂÂ¹ ÂÂº Â" ÂÂ¼ ÂÂ½ ÂÂ¾ ÂÂ¿ ÂÃˆ ÂÃ‰ ÂÃŠ ÂÃ‹ ÂÃŒ ÂÃ ÂÃŽ ÂÃš ÂÃ› ÂÃœ ÂÃ ÂÃž ÂÃŸ ÂÃ* ÂÃ¡ ÂÃ¢ ÂÃ£ ÂÃ¤ ÂÃ¥ ÂÃ¦ ÂÃ§ ÂÃ¨ ÂÃ° ÂÃ± ÂÃ² ÂÃ³ ÂÃ´ ÂÃµ ÂÃ¶ ÂÃ· ÂÃ¼ What would be the fastest way to achieve this? $phpdig_string_subst for Shift_Jis would look like: PHP Code: `$phpdig_string_subst['Shift_Jis'] = 'A:A,a:a,B:B,b:b,C:C,c:c,D:D,d:d,E:E,e:e,F:F,f:f,G:G,g:g,H:H,h:h,I:I,i:i,J:J,j:j,K:K,k:k,L:L,l:l,M:M,m:m,N:N,n:n,O:O,o:o,P:P,p:p,Q:Q,q:q,R:R,r:r,S:S,s:s,T:T,t:t,U:U,u:u,V:V,v:v,W:W,w:w,X:X,x:x,Y:Y,y:y,Z:Z,z:z';` Is that correct? Building a correct $phpdig_words_chars wouldn't be a problem too. I'll post one try soon for both Shift_Jis and EUC-JP. __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st Last edited by Edomondo; 01-09-2004 at 06:03 AM.

01-09-2004, 12:41 PM	#15
Edomondo Orange Mole Join Date: Jan 2004 Location: In outer space Posts: 37	Hi. I meant each Japanese character will be considered as a couple of single-byte characters. I can't use UTF-8 because Jcode-LE only cope with EUC-JP, Shift_JIS and ISO-2022-JP (JIS). The indexed pages can only be encoded with one of those encodings. They will all be converted to EUC-JP in this project. The content of indexed pages will have to: - be converted to the reference encoding of the site (EUC-JP in this case) using Jcode-LE. - get the punctuations signs replaced by spaces with strtr or str_replace. Is that correct? Will it be enough to make it work? As part of phrases (<> word) will be indexed, the search will be performed on part of words. This is the list of separator for EUC-JP: Â¢Â£ Â¢Â¤ Â¢Â¥ Â¢Â¦ Â¢Â§ Â¢Â¨ Â¢Â© Â¢Âª Â¢Â« Â¡Â¦ Â¢_ Â¢Â® Â¢Âº Â¢Â» Â¢Â¼ Â¢Â½ Â¢Â¾ Â¢Â¿ Â¢Ã€ Â¢Ã Â¢ÃŠ Â¢Ã‹ Â¢ÃŒ Â¢Ã Â¢ÃŽ Â¢Ã Â¢Ã Â¢Ãœ Â¢Ã Â¢Ãž Â¢ÃŸ Â¢Ã* Â¢Ã¡ Â¢Ã¢ Â¢Ã£ Â¢Ã¤ Â¢Ã¥ Â¡Â¢ Â¢Ã¦ Â¡Â£ Â¢Ã§ Â¡Â¤ Â¢Ã¨ Â¡Â¥ Â¢Ã© Â¢Ãª Â¡Â§ Â¡Â¨ Â¡Â© Â¡Âª Â¡Â« Â¡Â¬ Â¡_ Â¡Â® Â¢Ã² Â¡Â¯ Â¢Ã³ Â¡Â° Â¢Ã´ Â¡Â± Â¢Ãµ Â¡Â² Â¢Ã¶ Â¡Â³ Â¢Ã· Â¡Â´ Â¢Ã¸ Â¡Âµ Â¢Ã¹ Â¡Â¶ Â¡Â· Â¡Â¸ Â¡Â¹ Â¡Âº Â¢Ã¾ Â¡Â» Â¡Â¼ Â¡Â½ Â¡Â¾ Â¡Â¿ Â¡Ã€ Â¡Ã Â¡Ã‚ Â¡Ãƒ Â¡Ã„ Â¡Ã… Â¡Ã† Â¡Ã‡ Â¡Ãˆ Â¡Ã‰ Â¡ÃŠ Â¡Ã‹ Â¡ÃŒ Â¡Ã Â¡ÃŽ Â¡Ã Â¡Ã Â¡Ã‘ Â¡Ã’ Â¡Ã“ Â¡Ã” Â¡Ã• Â¡Ã– Â¡Ã— Â¡Ã˜ Â¡Ã™ Â¡Ãš Â¡Ã› Â¡Ãœ Â¡Ã Â¡Ãž Â¡ÃŸ Â¡Â¦ Â¡Ã* Â¡Ã¡ Â¡Ã¢ Â¡Ã£ Â¡Ã¤ Â¡Ã¥ Â¡Ã¦ Â¡Ã§ Â¡Ã¨ Â¡Ã© Â¡Ãª Â¡Ã« Â¡Ã¬ Â¡Ã* Â¡Ã® Â¡Ã¯ Â¡Ã° Â¡Ã± Â¡Ã² Â¡Ã³ Â¡Ã´ Â¡Ãµ Â¡Ã¶ Â¡Ã· Â¡Ã¸ Â¡Ã¹ Â¡Ãº Â¡Ã» Â¡Ã¼ Â¡Ã½ Â¡Ã¾ Â¢Â¡ Â¡Â¢ Â¢Â¢ Â¡Â£ Â¢Â£ Â¡Â¤ Â¢Â¤ Â¡Â¥ Â¢Â¥ Â¢Â¦ Â¡Â§ Â¢Â§ Â¡Â¨ Â¢Â¨ Â¡Â© Â¢Â© Â¡Âª Â¢Âª Â¡Â« Â¢Â« Â¡Â¬ Â¢Â¬ Â¡_ Â¢_ Â¡Â® Â¢Â® Â¡Â¯ Â¡Â° Â¡Â± Â¡Â² Â¡Â³ Â¡Â´ Â¡Âµ Â¡Â¶ Â¡Â· Â¡Â¸ Â¡Â¹ Â¡Âº Â¢Âº Â¡Â» Â¢Â» Â¡Â¼ Â¢Â¼ Â¡Â½ Â¢Â½ Â¡Â¾ Â¢Â¾ Â¡Â¿ Â¢Â¿ Â¡Ã€ Â¢Ã€ Â¡Ã Â¢Ã Â¡Ã‚ Â¡Ãƒ Â¡Ã„ Â¡Ã… Â¡Ã† Â¡Ã‡ Â¡Ãˆ Â¡Ã‰ Â¡ÃŠ Â¢ÃŠ Â¡Ã‹ Â¢Ã‹ Â¡ÃŒ Â¢ÃŒ Â¡Ã Â¢Ã Â¡ÃŽ Â¢ÃŽ Â¡Ã Â¢Ã Â¡Ã Â¢Ã Â¡Ã‘ Â¡Ã’ Â¡Ã“ Â¡Ã” Â¡Ã• Â¡Ã– Â¡Ã— Â¡Ã˜ Â¡Ã™ Â¡Ãš Â¡Ã› Â¡Ãœ Â¢Ãœ Â¡Ã Â¢Ã Â¡Ãž Â¢Ãž Â¡ÃŸ Â¢ÃŸ Â¡Ã* Â¢Ã* Â¡Ã¡ Â¢Ã¡ Â¡Ã¢ Â¢Ã¢ Â¡Ã£ Â¢Ã£ Â¡Ã¤ Â¢Ã¤ Â¡Ã¥ Â¢Ã¥ Â¡Ã¦ Â¢Ã¦ Â¡Ã§ Â¢Ã§ Â¡Ã¨ Â¢Ã¨ Â¡Ã© Â¢Ã© Â¡Ãª Â¢Ãª Â¡Ã« Â¡Ã¬ Â¡Ã* Â¡Ã® Â¡Ã¯ Â¡Ã° Â¡Ã± Â¡Ã² Â¢Ã² Â¡Ã³ Â¢Ã³ Â¡Ã´ Â¢Ã´ Â¡Ãµ Â¢Ãµ Â¡Ã¶ Â¢Ã¶ Â¡Ã· Â¢Ã· Â¡Ã¸ Â¢Ã¸ Â¡Ã¹ Â¢Ã¹ Â¡Ãº Â¡Ã» Â¡Ã¼ Â¡Ã½ Â¡Ã¾ Â¢Ã¾ Â¢Â¡ I also set up $phpdig_words_chars for EUC-JP and Shift_JIS: PHP Code: $phpdig_words_chars['EUC-JP'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{\|}~â‚¬Ââ€šÆ’â€žâ€¦â€â€¡Ë†â€°Åâ€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃÃ¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½'; $phpdig_words_chars['Shift_JIS'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{\|}~â‚¬Ââ€šÆ’â€žâ€¦â€â€¡Ë†â€°Åâ€¹Å’ÂÅ½ÂÂâ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸_Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬_Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃÃ¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»'; Does it seem OK? __________________ UchÃ» Senshi Edomondo http://www.leijiverse.com http://shonen-kokoro.fr.st http://tsukanomanoharu.fr.st

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Probleme with japanese search	Paka76	How-to Forum	0	03-24-2006 04:26 AM
Please, please, please!!! Troubles with charset!!!	Slayter	Troubleshooting	0	12-21-2005 08:37 AM
Japanese characters on an English page	Shdwdrgn	Troubleshooting	1	03-15-2005 08:28 AM
Small fix for Japanese indexing	Edomondo	Mod Submissions	1	02-05-2005 12:40 AM
Help!How to support gb2312 charset?	peterhou	How-to Forum	1	01-16-2005 12:42 PM