PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 01-05-2004, 09:28 AM   #1
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Japanese encoding : charset=shift_jis

Hi there! PhpDig is just great!

I'm considering improving the system so that it can spider and display different encoding at the same time among which the Japanese encoding.

I've read carefully the 3 other topics dedicated to encoding issues.

I've though of 4 main points:
- There is no space in Japanese, so an algorythm can't make no difference between 2 different words. And it seems to be impossible to stock keywords in the DB. But I can think of a way to pass through it. There are 3 different types of characters in Japanese :
Hiragana (26 signs)
Katakana (26 signs)
Kanji (more than 50,000 signs)
It is possible to make a difference refering to the code of theses characters. e.g.: Katakana "re" (レ) is encoded with  ¼. if the code of the second character of the encoding is between x and y, then the character is a Katakana. Same for the others.
But some words contains different type of characters at the same time, like サボる (katakana + hiragana. means "not to attend school"), キャンプ場 (Katakana + Kanji. means "camping"), 寝る (Kanji + Hiragana, means "to sleep")
- There are different Japanese encoding: Shift_JIS, iso-2022-jp and EUC-JP. So how to crawl pages with different encodings?
- ア is the same as ア, イ as イ, ウ as ウ, カ as カ... Apart from these signs (about 50) no other matches can be done (like "*", "â" being like "a")

Sounds pretty hard, but it is not.

Any idea on how to do it? Can anyone give me hand on this?

Last edited by Edomondo; 01-05-2004 at 10:16 AM.
Edomondo is offline   Reply With Quote
Old 01-06-2004, 05:33 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Assuming Shift_JIS, ISO-2022-JP and EUC-JP are different encodings for the same set of characters, then a utility could be written or may already be available to convert both ISO-2022-JP and Shift_JIS to EUC-JP. The utility could be invoked based on the charset atttribute of the meta tag, converting as nessary for storage in MySQL with charset ujis and utilizing mutli-byte string functions where needed. This method could also be used to convert from a number of encodings to UTF-8.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-06-2004, 09:00 AM   #3
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
You're right, they are all different encodings for the same characters set. The most current is probably Shift_JIS.
I suppose that developping such an utility wouldn't be problem for me, but I'll be looking for something similar over the net.

But how to deal with the space issue?
Phpdig won't be able to index words. I can't find of any way to pass through this.
Will it have to index each phrase separately as a single word?
Edomondo is offline   Reply With Quote
Old 01-06-2004, 11:10 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. In robot_functions.php is $separators = " "; and this is what breaks on space for keywords. You could add other characters to $separators, but I am not familiar enough with Japanese to suggest appropriate separators.

One other thing from php.net:

Quote:
Character encodings work with PHP:
ISO-8859-*, EUC-JP, UTF-8

Character encodings do NOT work with PHP:
JIS, SJIS
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-06-2004, 01:38 PM   #5
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Hi! I found exactly what I needed!!
http://www.spencernetwork.org/jcode-LE/
There are a functions to convert from/to EUC-JP, Shift_JIS & ISO-2022-JP(JIS) and others to convert from/to full-width & half-width character.
I didn't find time to test it yet. The documentation is in Japanese but I can make a translation if anyone is interested in it.

I've thought about separators in Japanese. There is half and full-width space, half and full-width period, half and full-width comma, half and full-width apostrophe, ...
Can I add several separator in the same string? I guess I must enter unencoded characters e.g.  @ for full-width space, correct? Or do I need to separate each separator by a sign (comma...)?

BTW, thanks for your help Charter. It's greatly appreciated.
Edomondo is offline   Reply With Quote
Old 01-06-2004, 01:59 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. For example, say the "word" was "big/long/phrase and then some" and you wanted to break this up. You could set $separators = " /"; so that keywords would be made on spaces and slashes. More info on strtok can be found here. Also, I'd be interested in the translation.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-07-2004, 07:41 AM   #7
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
I've done a translation of Jcode-LE readme.txt file. It might not always be clear as neither English nor Japanese are my mother tongues :-( I've also indicated where there should have been Japanese characters (replaced by *** due to the txt format)

That would be great if next versions of phpdig would accept several encodings in both indexing and interface.

Thanks, strtok() is clearer to me now.
Attached Files
File Type: txt readme_en.txt (5.8 KB, 25 views)
Edomondo is offline   Reply With Quote
Old 01-08-2004, 07:51 AM   #8
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
I did some testing.
Jcode works great to convert from a Japanese encodings to another one. it is definitely what I was looking for!

strtok() can use multiple separator pattern made of more than 1 character.

But:

PHP Code:
<?php
$string 
"This/*is/*an/*example/*string";
$separator "/*";
/* Use tab and newline as tokenizing characters as well  */
$tok strtok($string$separator);
while (
$tok) {
   echo 
"Word=$tok<br />";
   
$tok strtok($separator);
}
?>
Must be replaced by:

PHP Code:
<?php
$string 
"This/*is/*an/*example/*string";
$separator "/*";

$tok strtok($string$separator);
while (
$tok !== FALSE)
{
  
$toks[] = $tok;
  
$tok strtok($separator);
}

while (list(
$k,$v) = each($toks))
{
  echo 
"Word=$v<br />";
}
?>
So, it might be possible to use multi-byte characters to separate words. Am I right?

Now I'm going to need help configuring correctly $phpdig_words_chars and $phpdig_string_subst.

Last edited by Edomondo; 01-08-2004 at 07:56 AM.
Edomondo is offline   Reply With Quote
Old 01-08-2004, 08:27 AM   #9
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. There is a fix here thanks to janalwin. This avoids the problems when $token evaluates to false with zero in the text. I'm not sure why you are adding the second while loop. With while ($tok !== FALSE) all of the $tok will print, but with while ($tok) printing stops after string.
PHP Code:
$string "This/*is/*an/*example/*string/0/and/some*more*text";
$separator "/*";
$tok strtok($string$separator);
while (
$tok !== FALSE) { // try with while ($tok) to compare
   
echo "Word=$tok<br />";
   
$tok strtok($separator);

Note how $separator is used to tokenize on / and * and what appears to be /* but it really is tokenizing when any one character is found. The while makes it tokenize on / and * so it appears to only break on */ but that is not the case.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-08-2004, 09:33 AM   #10
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Damned! I'm afraid you're right. "/*" is use to tokenize the string just like "*/", "/" & "*"... So it's not working as I'd like it to.

It should use for separator:
- a single character (space, comma, period...)
- a multi-byte character code (in @ is space, A is comma, B is period...)

I ran short of idea on this issue
Edomondo is offline   Reply With Quote
Old 01-08-2004, 01:04 PM   #11
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
OK, I found how to replace the punctuation.
In the previous example, using /* to tokenize the string can be achieved this way:

PHP Code:
<?php
$string 
"This/*is/*an/*example/*0/*string.";
$separator " ";

$replace_separator = array("/*" => $separator,
                                              
"." => $separator);

$string trim(strtr($string$replace_separator));

$tok strtok($string$separator);
while (
$tok !== FALSE) { // try with while ($tok) to compare
   
echo "Word=$tok<br />";
   
$tok strtok($separator);
}
?>
I guess I'll also have to give MAX_WORDS_SIZE the higher value possible.

Now, how can I configure $phpdig_string_subst['EUC-JP'] and $phpdig_string_chars['EUC-JP']? It's still a bit confusing to me.

Every character composing a multi-byte character will go in $phpdig_string_subst, right?
e.g. : ‚ÆÄ*l‹CÌ_é–Ÿ‰æÅ...

And $phpdig_string_chars['EUC-JP'] = '[:alnum:]'; seems correct as all characters will be converted to half-width EUC-JP characters during indexing.
Edomondo is offline   Reply With Quote
Old 01-08-2004, 04:39 PM   #12
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. I haven't done any benchmarks so I'm not sure, but for a lot of processing the following might be faster:
PHP Code:
$separator " ";
$string "This/*is/*an/*example/*0/*string.";
$string str_replace("/*"," ",$string);
$tok strtok($string$separator);
while (
$tok !== FALSE) {
   echo 
"Word=$tok<br />";
   
$tok strtok($separator);

As for the $phpdig_string_subst and $phpdig_words_chars variables, $phpdig_string_subst['EUC-JP'] = 'Q:Q,q:q'; and $phpdig_words_chars['EUC-JP'] = '[:alnum:]ÆÄ*l‹CÌ_é–Ÿ‰æÅ...';

This seems backwards when reading the instructions in the config.php file, but with encodings that don't have Latin counterparts, it's the way I figured to make PhpDig version 1.6.5 work with other languages.

Is you MySQL charset ujis? You can find some MySQL charsets and their descriptions here.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-09-2004, 06:59 AM   #13
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Actually, the search engine I'm aiming at will have a support for different languages and encodings. Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser. Storage in the DB and plain TXT files will contain non-encoded characters.
So I prefer using a common charset in MySQL, not a specific one for Japanese.

I've set a list of all the possible separator (non encoded) in Japanese.
For Shift_Jis encoding, there will be:
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~

€

‚
ƒ
"
…
*
‡
ˆ
‰
*
‹
Œ

Ž


'
'
"
"
o
-
-
˜
™
š
›
œ

ž
Ÿ

¡
¢
£
¤
¥
¦
§
¨
©
ª
"

¸
¹
º
"
¼
½
¾
¿
È
É
Ê
Ë
Ì
Í
Î
Ú
Û
Ü
Ý
Þ
ß
*
á
â
ã
ä
å
æ
ç
è
ð
ñ
ò
ó
ô
õ
ö
÷
ü

What would be the fastest way to achieve this?

$phpdig_string_subst for Shift_Jis would look like:

PHP Code:
$phpdig_string_subst['Shift_Jis'] = 'A:A,a:a,B:B,b:b,C:C,c:c,D:D,d:d,E:E,e:e,F:F,f:f,G:G,g:g,H:H,h:h,I:I,i:i,J:J,j:j,K:K,k:k,L:L,l:l,M:M,m:m,N:N,n:n,O:O,o:o,P:P,p:p,Q:Q,q:q,R:R,r:r,S:S,s:s,T:T,t:t,U:U,u:u,V:V,v:v,W:W,w:w,X:X,x:x,Y:Y,y:y,Z:Z,z:z'
Is that correct?

Building a correct $phpdig_words_chars wouldn't be a problem too. I'll post one try soon for both Shift_Jis and EUC-JP.

Last edited by Edomondo; 01-09-2004 at 07:03 AM.
Edomondo is offline   Reply With Quote
Old 01-09-2004, 07:35 AM   #14
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Quote:
Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser.
Hi. I'm not sure what you mean. Are you planning on storing HTML entites instead?

The $phpdig_string_subst['Shift_Jis'] variable posted isn't correct. There is no need to all Latin letters in the variable. Setting $phpdig_string_subst['EUC-JP'] = 'Q:Q,q:q'; is all that is necessary if no "transformation" between characters is needed.

If you are looking to incorporate mutliple encodings, you might consider UTF-8 instead.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 01-09-2004, 01:41 PM   #15
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Hi. I meant each Japanese character will be considered as a couple of single-byte characters.

I can't use UTF-8 because Jcode-LE only cope with EUC-JP, Shift_JIS and ISO-2022-JP (JIS). The indexed pages can only be encoded with one of those encodings. They will all be converted to EUC-JP in this project.

The content of indexed pages will have to:
- be converted to the reference encoding of the site (EUC-JP in this case) using Jcode-LE.
- get the punctuations signs replaced by spaces with strtr or str_replace.
Is that correct? Will it be enough to make it work?

As part of phrases (<> word) will be indexed, the search will be performed on part of words.

This is the list of separator for EUC-JP:
¢£
¢¤
¢¥
¢¦
¢§
¢¨
¢©
¢ª
¢«
¡¦
¢_
¢®
¢º
¢»
¢¼
¢½
¢¾
¢¿
¢À
¢Á
¢Ê
¢Ë
¢Ì
¢Í
¢Î
¢Ï
¢Ð
¢Ü
¢Ý
¢Þ
¢ß
¢*
¢á
¢â
¢ã
¢ä
¢å
¡¢
¢æ
¡£
¢ç
¡¤
¢è
¡¥
¢é
¢ê
¡§
¡¨
¡©
¡ª
¡«
¡¬
¡_
¡®
¢ò
¡¯
¢ó
¡°
¢ô
¡±
¢õ
¡²
¢ö
¡³
¢÷
¡´
¢ø
¡µ
¢ù
¡¶
¡·
¡¸
¡¹
¡º
¢þ
¡»
¡¼
¡½
¡¾
¡¿
¡À
¡Á
¡Â
¡Ã
¡Ä
¡Å
¡Æ
¡Ç
¡È
¡É
¡Ê
¡Ë
¡Ì
¡Í
¡Î
¡Ï
¡Ð
¡Ñ
¡Ò
¡Ó
¡Ô
¡Õ
¡Ö
¡×
¡Ø
¡Ù
¡Ú
¡Û
¡Ü
¡Ý
¡Þ
¡ß
¡¦
¡*
¡á
¡â
¡ã
¡ä
¡å
¡æ
¡ç
¡è
¡é
¡ê
¡ë
¡ì
¡*
¡î
¡ï
¡ð
¡ñ
¡ò
¡ó
¡ô
¡õ
¡ö
¡÷
¡ø
¡ù
¡ú
¡û
¡ü
¡ý
¡þ
¢¡
¡¢
¢¢
¡£
¢£
¡¤
¢¤
¡¥
¢¥
¢¦
¡§
¢§
¡¨
¢¨
¡©
¢©
¡ª
¢ª
¡«
¢«
¡¬
¢¬
¡_
¢_
¡®
¢®
¡¯
¡°
¡±
¡²
¡³
¡´
¡µ
¡¶
¡·
¡¸
¡¹
¡º
¢º
¡»
¢»
¡¼
¢¼
¡½
¢½
¡¾
¢¾
¡¿
¢¿
¡À
¢À
¡Á
¢Á
¡Â
¡Ã
¡Ä
¡Å
¡Æ
¡Ç
¡È
¡É
¡Ê
¢Ê
¡Ë
¢Ë
¡Ì
¢Ì
¡Í
¢Í
¡Î
¢Î
¡Ï
¢Ï
¡Ð
¢Ð
¡Ñ
¡Ò
¡Ó
¡Ô
¡Õ
¡Ö
¡×
¡Ø
¡Ù
¡Ú
¡Û
¡Ü
¢Ü
¡Ý
¢Ý
¡Þ
¢Þ
¡ß
¢ß
¡*
¢*
¡á
¢á
¡â
¢â
¡ã
¢ã
¡ä
¢ä
¡å
¢å
¡æ
¢æ
¡ç
¢ç
¡è
¢è
¡é
¢é
¡ê
¢ê
¡ë
¡ì
¡*
¡î
¡ï
¡ð
¡ñ
¡ò
¢ò
¡ó
¢ó
¡ô
¢ô
¡õ
¢õ
¡ö
¢ö
¡÷
¢÷
¡ø
¢ø
¡ù
¢ù
¡ú
¡û
¡ü
¡ý
¡þ
¢þ
¢¡

I also set up $phpdig_words_chars for EUC-JP and Shift_JIS:

PHP Code:
$phpdig_words_chars['EUC-JP'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüý';
$phpdig_words_chars['Shift_JIS'] = '[:alnum:]@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…*‡ˆ‰*‹ŒŽ‘’“”•–—˜™š›œžŸ_¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúû'
Does it seem OK?
Edomondo is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Probleme with japanese search Paka76 How-to Forum 0 03-24-2006 05:26 AM
Please, please, please!!! Troubles with charset!!! Slayter Troubleshooting 0 12-21-2005 09:37 AM
Japanese characters on an English page Shdwdrgn Troubleshooting 1 03-15-2005 09:28 AM
Small fix for Japanese indexing Edomondo Mod Submissions 1 02-05-2005 01:40 AM
Help!How to support gb2312 charset? peterhou How-to Forum 1 01-16-2005 01:42 PM


All times are GMT -8. The time now is 09:46 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.