View Single Post
Old 01-09-2004, 05:59 AM   #13
Edomondo
Orange Mole
 
Edomondo's Avatar
 
Join Date: Jan 2004
Location: In outer space
Posts: 37
Actually, the search engine I'm aiming at will have a support for different languages and encodings. Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser. Storage in the DB and plain TXT files will contain non-encoded characters.
So I prefer using a common charset in MySQL, not a specific one for Japanese.

I've set a list of all the possible separator (non encoded) in Japanese.
For Shift_Jis encoding, there will be:
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~

€

‚
ƒ
"
…
*
‡
ˆ
‰
*
‹
Œ

Ž


'
'
"
"
o
-
-
˜
™
š
›
œ

ž
Ÿ

¡
¢
£
¤
¥
¦
§
¨
©
ª
"

¸
¹
º
"
¼
½
¾
¿
È
É
Ê
Ë
Ì
Í
Î
Ú
Û
Ü
Ý
Þ
ß
*
á
â
ã
ä
å
æ
ç
è
ð
ñ
ò
ó
ô
õ
ö
÷
ü

What would be the fastest way to achieve this?

$phpdig_string_subst for Shift_Jis would look like:

PHP Code:
$phpdig_string_subst['Shift_Jis'] = 'A:A,a:a,B:B,b:b,C:C,c:c,D:D,d:d,E:E,e:e,F:F,f:f,G:G,g:g,H:H,h:h,I:I,i:i,J:J,j:j,K:K,k:k,L:L,l:l,M:M,m:m,N:N,n:n,O:O,o:o,P:P,p:p,Q:Q,q:q,R:R,r:r,S:S,s:s,T:T,t:t,U:U,u:u,V:V,v:v,W:W,w:w,X:X,x:x,Y:Y,y:y,Z:Z,z:z'
Is that correct?

Building a correct $phpdig_words_chars wouldn't be a problem too. I'll post one try soon for both Shift_Jis and EUC-JP.

Last edited by Edomondo; 01-09-2004 at 06:03 AM.
Edomondo is offline   Reply With Quote