Actually, the search engine I'm aiming at will have a support for different languages and encodings. Japanese characters will be processed like non multi-byte characters. The decoding to Japanese characters will be done at the end in the browser. Storage in the DB and plain TXT files will contain non-encoded characters.
So I prefer using a common charset in MySQL, not a specific one for Japanese.
I've set a list of all the possible separator (non encoded) in Japanese.
For Shift_Jis encoding, there will be:
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
€
‚
ƒ
"
…
*
‡
ˆ
‰
*
‹
Œ
Ž
'
'
"
"
o
-
-
˜
™
š
›
œ
ž
Ÿ
¡
¢
£
¤
¥
¦
§
¨
©
ª
"
¸
¹
º
"
¼
½
¾
¿
È
É
Ê
Ë
Ì
Í
Î
Ú
Û
Ü
Ý
Þ
ß
*
á
â
ã
ä
å
æ
ç
è
ð
ñ
ò
ó
ô
õ
ö
÷
ü
What would be the fastest way to achieve this?
$phpdig_string_subst for Shift_Jis would look like:
PHP Code:
$phpdig_string_subst['Shift_Jis'] = 'A:A,a:a,B:B,b:b,C:C,c:c,D:D,d:d,E:E,e:e,F:F,f:f,G:G,g:g,H:H,h:h,I:I,i:i,J:J,j:j,K:K,k:k,L:L,l:l,M:M,m:m,N:N,n:n,O:O,o:o,P:P,p:p,Q:Q,q:q,R:R,r:r,S:S,s:s,T:T,t:t,U:U,u:u,V:V,v:v,W:W,w:w,X:X,x:x,Y:Y,y:y,Z:Z,z:z';
Is that correct?
Building a correct $phpdig_words_chars wouldn't be a problem too. I'll post one try soon for both Shift_Jis and EUC-JP.