PDA

View Full Version : iso-8859-7


mkst
10-08-2003, 06:43 AM
Hello there!
I would like to know how I can change the iso to iso-8859-7 (greek). I read the documentation but could not understand how to set the

$phpdig_string_subst['iso-8859-7'] and

$phpdig_words_chars['iso-8859-7'] values.

Any help please??

Rolandks
10-08-2003, 01:51 PM
You must define ALL - Chr: in this String
$phpdig_string_subst['iso-8859-7'] ='......here is iso-8859-7 chr ...........'

see:
http://www.softlab.ntua.gr/~sivann/xgrk/iso8859-7.html

and set:
define('PHPDIG_ENCODING','iso-8859-7');

Perhaps you found the code in ONE Line with google ?

-Roland-

mkst
10-09-2003, 02:19 AM
Thanks for your reply Rolandks! :)

oK, I think I got it......
What about the:

$phpdig_words_chars['iso-8859-2'] = '[:alnum:]ðþß';

What is it used for? Will I have to change it?

Regards,
Mike

Charter
10-09-2003, 05:56 PM
Hi. The $phpdig_words_chars['iso-8859-2'] = '[:alnum:]ðþß'; is for non-accented 'lowercase' letters such as the German ß (pronouced 'ess set' if I remeber correctly) for example. Sort of think of it like anything that doesn't go in $phpdig_string_subst['iso-8859-2'] might go in $phpdig_words_chars['iso-8859-2']. If you will, once you get your 'iso-8859-7' set, please post it in the Mod Submissions (http://www.phpdig.net/forumdisplay.php?forumid=24) forum in case others might want to use it. Thanks. :)

mkst
11-26-2003, 09:04 AM
Unfortuanetely I can not make it to work. :cry: :cry:

I have used something like:

$phpdig_string_subst['iso-8859-7'] = 'Á:¢,Å:¸,Ç:¹,É:ºÚ,Ï:¼,Õ:¾,Ù:¿,Ü:á,å:Ý,ç:Þ,é:ßúÀ,ï :ü,õ:ýû*,ù:þ';

I have changed the encoding to: define ('PHPDIG_ENCODING','iso-8859-7');

I think that the problem is with $phpdig_words_chars['iso-8859-1']='[:alnum:]ðþß' string. What letters do i put within the [::] characters and what letters after this?

The script searches some of the english pages that i have in the site, but does not search any greek pages. The table 'keywords' only contains english words.

I would really need some help!
ps. I am using the 1.6.2 version.

Charter
11-26-2003, 11:06 AM
Hi. I found the below ASCII representation of iso-8859-7 at http://www.gar.no/home/mats/8859-7.htm.

80-9F: unassigned
// note A0 is a space
A0-BF: _¡¢£¤¥¦§¨©ª«¬_®¯°±²³´µ¶·¸¹º»¼½¾¿
C0-DF: ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
E0-FF: *áâãäåæçèéêëì*îïðñòóôõö÷øùúûüýþÿ

When making $phpdig_string_subst['iso-8859-7'], it's like making a key value set. For example, if the Latin A is like the Greek Ά (hex B6) then the $phpdig_string_subst['iso-8859-7'] variable would start like the following:

$phpdig_string_subst['iso-8859-7'] = 'A:¶';

If Greek uses Á (hex C1) also like the Latin A, then $phpdig_string_subst['iso-8859-7'] would start like the following:

$phpdig_string_subst['iso-8859-7'] = 'A:¶Á';

The same type of thing goes for Latin a. If Greek uses ά (hex DC) and á (hex E1) like the Latin a, then $phpdig_string_subst['iso-8859-7'] would start like the following:

$phpdig_string_subst['iso-8859-7'] = 'A:¶Á,a:Üá';

The $phpdig_string_subst['iso-8859-7'] variable is for all accented or diacritic characters (basically all accented characters and those characters that do not copy paste into ASCII as the characters themsleves but rather copy paste as ASCII representations of the characters).

The $phpdig_words_chars['iso-8859-7'] variable is for lowercase non-accented characters (basically those lowercase non-accented characters that copy paste into ASCII as the characters themselves). An example of this would be Greek µ, so it could be added to $phpdig_words_chars['iso-8859-7'] like so:

$phpdig_words_chars['iso-8859-7'] = '[:alnum:]ðþßµ';

Note that it is possible to have an ASCII representaion of a character be in $phpdig_string_subst['iso-8859-7'] and also have the ASCII character itself be in $phpdig_words_chars['iso-8859-7'].

mkst
11-27-2003, 06:21 AM
Thanks for your reply Charter!
...but I am still confused!! :confused: :confused: :confused: :confused:

Originally posted by Charter
For example, if the Latin A is like the Greek Ά (hex B6) then the $phpdig_string_subst['iso-8859-7'] variable would start like the following:

$phpdig_string_subst['iso-8859-7'] = 'A:¶';

If Greek uses A (hex C1) also like the Latin A, then $phpdig_string_subst['iso-8859-7'] would start like the following:

$phpdig_string_subst['iso-8859-7'] = 'A:¶A';

.......
The $phpdig_words_chars['iso-8859-7'] variable is for lowercase non-accented characters (basically those lowercase non-accented characters that copy paste into ASCII as the characters themselves). An example of this would be Greek µ.....


What exactly do you mean by 'is like' ? I know that latin capital A looks like the greek capital Á but this is not the case for the lower case letters or some other capital letters.

And what exactly do you mean by '(basically those lowercase non-accented characters that copy paste into ASCII as the characters themselves)' ?

I have tried something like this:

$phpdig_string_subst['iso-8859-7'] = 'A:¶A,a:Üá,E:Ÿ,e:åÝ,H:ǹ,h:çÞ,I:ɺÚ,i:éßúÀ,O:ϼ,o:ïü ,Y:Õ¾Û,y:õýû*,L:Ë,l:ë,N:Í,n:*,V:Ù,v:ùþ,M:Ì,m:ì,P:Ð,p:ð,X: ×,x:÷,K:Ê,k:ê,B:Â,b:â,C:Ø,c:ø,G:Ã,g:ã,D:Ä,d:ä,Z:Æ,z:æ,U:È,u: è,K:Ê,k:ê,J:Î,j:î,R:Ñ,r:ñ,S:Ó,s:óò,T:Ô,t:ô,F:Ö,f:ö';

and

$phpdig_words_chars['iso-8859-7'] = '[:alnum:]ðþßìòñôèóäöãîêëæ÷øâ*ð';


I have also tried different variation of the above but still could not make it to work correct.

The engine indexes the site alright but only recoginzes and prints results for part of the keyword.
Also the 'keywords' table contains words with with latin letters only. It is this allright i guess uh?

Thank you for your time Charter, and i hope i am not much of a trouble :angel:

Charter
11-27-2003, 08:10 AM
Hi. I'll use a German word as an example of what I mean by the 'is like' phrase. The German word Gästebuch means Guestbook. The ä in Gästebuch 'is like' the Latin a. Such characters like ä are stored as their Latin counterparts in the database for searching. When you copy paste a character into a text editor, it will either show up as the character or some ASCII equivalent of the character. The characters that show up as the actual character are the ones that go in $phpdig_words_chars['iso-8859-7'] but no accented characters should go in $phpdig_words_chars['iso-8859-7']. All accented or diacritic characters should go in $phpdig_string_subst['iso-8859-7'].

mkst
11-28-2003, 06:11 AM
Thank you for your reply Charter. It seems that i managed to create the right $phpdig_string_subst and $phpdig_words_chars.

However, I still have one problem regarding words that start with capital letter. I can only find a word that starts with certan capital letters, otherwise I get zero matches. The search works ok for lower case words.

Do you have any idea why this is happening?

Regards,
Mike

Charter
11-28-2003, 06:17 AM
Hi. What are $phpdig_string_subst['iso-8859-7'] and $phpdig_words_chars['iso-8859-7'] currently set to? What capital letters are not working? Maybe there is a mismatched key value type pairing.

mkst
11-28-2003, 06:25 AM
$phpdig_string_subst['iso-8859-7'] = 'A:Á¶,a:Üá,B:Â,b:â,G:Ã,g:ã,D:Ä,d:ä,E:Ÿ,e:åÝ,Z:Æ,z:æ,H:ǹ ,h:çÞ,U:È,u:è,I:ɺÚ,i:éßúÀ,K:Ê,k:ê,L:Ë,l:ë,M:Ì,m:ì,N:Í,n: *,J:Î,j:î,O:ϼ,o:ïü,P:Ð,p:ð,R:Ñ,r:ñ,S:Ó,s:óò,T:Ô,t:ô,Y:Õ¾ Û,y:õýû*,F:Ö,f:ö,X:×,x:÷,C:Ø,c:ø,V:Ù,v:ùþ';

and

$phpdig_words_chars['iso-8859-7'] = '[:alnum:]áâãäåæçèéêëì*îïðñóôõö÷øù';


I have double checked for type errors, dont think that this is the case.
Words starting with Á, ¶, Ð, Ì have no problem.

Charter
11-28-2003, 06:51 AM
Hi. Of áâãäåæçèéêëì*îïðñóôõö÷øù the only ones that should be in the $phpdig_words_chars['iso-8859-7'] variable are æçðø like so:

$phpdig_words_chars['iso-8859-7'] = '[:alnum:]æçðø';

These áâãäåèéêëì*îïñóôõö÷ù are accented/diacritic characters and need to be matched up to their Latin counterparts in $phpdig_string_subst['iso-8859-7'].

mkst
11-28-2003, 07:39 AM
Thanks Charter but there is no improvent. :(
It is now worse than before....

Charter
11-28-2003, 10:37 AM
Hi. I am not very familiar with the Greek alphabet beyond mathematical usage. Below is what I came up with assuming that Latin A is like Greek Alpha, Latin a is like Greek alpha, and so forth. I make no claims of correctness. ;)

$phpdig_string_subst['iso-8859-7'] = 'A:¶Á,a:Üá,B:Â,G:Ã,g:ã,D:Ä,
d:ä,E:¸Å,e:Ýå,Z:Æ,z:æ,I:ºÉÚ,i:Àßéú,K:Ê,k:ê,L:Ë,l:ë,M:Ì,N :Í,n:*,
X:Î,x:î,O:¼Ï,o:ïü,P:Ð,p:ð,R:Ñ,r:ñ,S:Ó,s:òó,T:Ô,t:ô,Y:¾ÕÛ, y:*õûý';

$phpdig_words_chars['iso-8859-7'] = '[:alnum:]ßµ';

I was not sure what to do with the following characters: Eta, eta, Theta, theta, Phi, phi, Chi, chi, Psi, psi, Omega, omega.

I also made the following assumptions: Latin G is like Greek Gamma, Latin g is like Greek gamma, Latin R is like Greek Rho, Latin r is like Greek rho, Latin Y is like Greek Upsilon, Latin y is like Greek upsilon.

As I m not very familiar with the Greek language, this is the best that I can offer. :(

mitsoskitsos
12-24-2003, 02:41 AM
Hi.
I am also trying to index greek pages with encoding 8859-7 and I have some problems.
I think that the origin of the problem is that greek characters are converted to latin and then putted in the keywords table.
Why is it necessary to convert the greek characters to latin?
I think that the engine would have worked much better and more accurate without this conversion.
Is there a hack that I could apply so greek characters won't be converted to latin?

Charter
12-24-2003, 12:48 PM
Hi. PhpDig takes content and writes it to text files. Certain characters display as themselves in ASCII but other characters show up as their ISO counterparts. If you look at this (http://www.columbia.edu/kermit/greek.html) page, you can see how, for example, Delta displays as Δ in the browser but when you view Delta in the HTML source it is the Ä character.

mitsoskitsos
12-26-2003, 01:51 AM
Hi Charter. Thanks for your reply but I still don't understand it (I don't have much experience...)

Correct my if I say something wrong... Delta character has decimal value 196, so phpdig spider reads 196 from the 8859-7 encoded html page, then writes it as 196 in the text file and then it converts it to the corresponding latin (below 127) character according to the phpdig_string_subst table. I cannot understand why you do not put the 196 character on the mysql table...

I want to index only iso-8859-7 pages, so I am not interested in other encodings. In the text_content directory I can read perfectly the txt files but when the greek to latin conversion takes place something goes wrong. I have tried many combinations of the phpdig_string_subst and phpdig_words_chars variables but the result isn't good.

So I came up that the only solution is to bypass the greek to latin conversion. Can you help me this? (I cannot easily find this conversion part in the phpdig code)

Charter
12-26-2003, 05:40 AM
Hi. Can you attach one of the text files from the text_content directory?

mitsoskitsos
12-27-2003, 09:10 AM
http://beta.topweb.gr/25.txt

It's from a test page

Charter
12-28-2003, 03:40 AM
Hi. Thanks. When PhpDig spiders an ISO-8859-7 page, it sees characters like the following:

english test spider åëëç*éêÜ ôåóô áñÜ÷*ç áâãäåæçèéêëì*ïðñóôõö÷øù áâã

Because PhpDig currenly supports only ISO-8859-1 and ISO-8859-2, it does not know how to convert the above ASCII characters to the following characters that get displayed in the browser:

english test spider ελληνικά τεστ αράχνη αβγδεζηθικλμνοπρστυφχψω αβγ

The $phpdig_string_subst and $phpdig_words_chars variables are available to setup another ISO-8859 but only if the language can be mapped one-to-one with Latin counterparts.

Of course, this one-to-one mapping cannot be done with a variety of languages and so PhpDig does not convert those languages correctly.

Just as a test, if you are using PhpDig on ISO-8859-7 pages only, set the following in the config.php file and then do a crawl:

define('PHPDIG_ENCODING','iso-8859-7');
// give functions something trivial to do
$phpdig_string_subst['iso-8859-7'] = 'A:A,a:a';
// remove word wrapping in the below line
$phpdig_words_chars['iso-8859-7'] = '[:alnum:]µ¶¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜ Þß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüýþÿ';

With ISO-8859-7 set, the browser should pass the search query into PhpDig as (extended) ASCII characters. The other thing to check is to see how Client characterset and Server characterset are set.

This can be done via shell. Just go to the MySQL prompt and type status and MySQL will output the info. What are your Client characterset and Server characterset set to?

If you are not able to check the setting of Client characterset and Server characterset, then take a look at the new table entries via phpMyAdmin after doing a crawl with the above changes. Are the words and characters stored as (extended) ASCII? Also, how are the new search results?

mitsoskitsos
12-28-2003, 12:57 PM
Hi.

I made the test and here's what happens:

I crawled the same test page and only the 3 english words were put in the keywords table.
If I change the phpdig_string_subst and map greek to latin characters, then I get all words in the keywords table, but they are (as they should be) latin.

I think that all problems would have solved if we managed to put extended ASCII chars into the mysql table.

The client and server characterset are both greek (I compiled mysql with-charset=greek).

Charter
12-28-2003, 06:25 PM
Hi. Try the following. Keep the following changes in the config.php file:

define('PHPDIG_ENCODING','iso-8859-7');
// give functions something trivial to do
$phpdig_string_subst['iso-8859-7'] = 'A:A,a:a';
// remove word wrapping in the below line
$phpdig_words_chars['iso-8859-7'] = '[:alnum:]µ¶¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜ Þß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüýþÿ';

In addition, in the robot_functions.php file is a phpdigIndexFile function.

In the phpdigIndexFile function change:

global $common_words,$relative_script_path,$s_yes,$s_no,$br;

to the following:

global $phpdig_words_chars,$common_words,$relative_script_path,$s_yes,$s_no,$br;

Also, in the phpdigIndexFile function change:

if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^[0-9a-zßðþ]',$key))

to the following:

if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].']',$key))

Remember to remove any "word" wrapping in the above code.

Now when you do a crawl do you see (extended) ASCII or Greek characters in the table?

mitsoskitsos
12-29-2003, 01:14 PM
Hi Charter.

Did the changes and all works fine!

I see Greek characters in the table and the search works perfectly.
I will make more tests and I'll let you know if there's a problem.

Many thanks!

Charter
12-29-2003, 05:03 PM
Hi. Great news! What version of MySQL are you running?

The below ISO-8859-7 tables are from this (http://www.columbia.edu/kermit/greek.html) page and show the Greek and extended ASCII characters.

I omitted some of the extended ASCII characters from the $phpdig_words_chars variable based on my limited knowledge of Greek. You should check the $phpdig_words_chars variable and add or remove extended ASCII characters in the variable based on your knowledge. The space isn't necessary to add, and of course don't add "bad" characters.

Please do let me know the results of your tests. I'd like to make PhpDig portable for more languages than ISO-8859-1 and ISO-8859-2. Also, what's the link to your search? I'd like to take a look.

[Extended ASCII Character]
ISO 8859-7 Latin / Greek Alphabet
char dec col/row oct hex description
[_] 160 10/00 240 A0 No-break space
[¡] 161 10/01 241 A1 Left single quotation mark
[¢] 162 10/02 242 A2 right single quotation mark
[£] 163 10/03 243 A3 Pound sign
[¤] 164 10/04 244 A4 (UNUSED)
[¥] 165 10/05 245 A5 (UNUSED)
[¦] 166 10/06 246 A6 Broken bar
[§] 167 10/07 247 A7 Paragraph sign
[¨] 168 10/08 250 A8 Diaeresis (Dialytika)
[©] 169 10/09 251 A9 Copyright sign
[ª] 170 10/10 252 AA (UNUSED)
[«] 171 10/11 253 AB Left angle quotation
[¬] 172 10/12 254 AC Not sign
[_] 173 10/13 255 AD Soft hyphen
[®] 174 10/14 256 AE (UNUSED)
[¯] 175 10/15 257 AF Horizontal bar (Parenthetiki pavla)
[°] 176 11/00 260 B0 Degree sign
[±] 177 11/01 261 B1 Plus-minus sign
[²] 178 11/02 262 B2 Superscript two
[³] 179 11/03 263 B3 Superscript three
[´] 180 11/04 264 B4 Accent (tonos)
[µ] 181 11/05 265 B5 Diaeresis and accent (Dialytika and Tonos)
[¶] 182 11/06 266 B6 Alpha with accent
[·] 183 11/07 267 B7 Middle dot (Ano Teleia)
[¸] 184 11/08 270 B8 Epsilon with accent
[¹] 185 11/09 271 B9 Eta with accent
[º] 186 11/10 272 BA Iota with accent
[»] 187 11/11 273 BB Right angle quotation
[¼] 188 11/12 274 BC Omicron with accent
[½] 189 11/13 275 BD One half
[¾] 190 11/14 276 BE Upsilon with accent
[¿] 191 11/15 277 BF Omega with accent
[À] 192 12/00 300 C0 iota with diaeresis and accent
[Á] 193 12/01 301 C1 Alpha
[Â] 194 12/02 302 C2 Beta
[Ã] 195 12/03 303 C3 Gamma
[Ä] 196 12/04 304 C4 Delta
[Å] 197 12/05 305 C5 Epsilon
[Æ] 198 12/06 306 C6 Zeta
[Ç] 199 12/07 307 C7 Eta
[È] 200 12/08 310 C8 Theta
[É] 201 12/09 311 C9 Iota
[Ê] 202 12/10 312 CA Kappa
[Ë] 203 12/11 313 CB Lamda
[Ì] 204 12/12 314 CC Mu
[Í] 205 12/13 315 CD Nu
[Î] 206 12/14 316 CE Ksi
[Ï] 207 12/15 317 CF Omicron
[Ð] 208 13/00 320 D0 Pi
[Ñ] 209 13/01 321 D1 Rho
[Ò] 210 13/02 322 D2 (UNUSED)
[Ó] 211 13/03 323 D3 Sigma
[Ô] 212 13/04 324 D4 Tau
[Õ] 213 13/05 325 D5 Upsilon
[Ö] 214 13/06 326 D6 Phi
[×] 215 13/07 327 D7 Khi
[Ø] 216 13/08 330 D8 Psi
[Ù] 217 13/09 331 D9 Omega
[Ú] 218 13/10 332 DA Iota with diaeresis
[Û] 219 13/11 333 DB Upsilon with diaeresis
[Ü] 220 13/12 334 DC alpha with accent
[Ý] 221 13/13 335 DD epsilon with accent
[Þ] 222 13/14 336 DE eta with accent
[ß] 223 13/15 337 DF iota with accent
[*] 224 14/00 340 E0 upsilon with diaeresis and accent
[á] 225 14/01 341 E1 alpha
[â] 226 14/02 342 E2 beta
[ã] 227 14/03 343 E3 gamma
[ä] 228 14/04 344 E4 delta
[å] 229 14/05 345 E5 epsilon
[æ] 230 14/06 346 E6 zeta
[ç] 231 14/07 347 E7 eta
[è] 232 14/08 350 E8 theta
[é] 233 14/09 351 E9 iota
[ê] 234 14/10 352 EA kappa
[ë] 235 14/11 353 EB lamda
[ì] 236 14/12 354 EC mu
[*] 237 14/13 355 ED nu
[î] 238 14/14 356 EE ksi
[ï] 239 14/15 357 EF omicron
[ð] 240 15/00 360 F0 pi
[ñ] 241 15/01 361 F1 rho
[ò] 242 15/02 362 F2 terminal sigma
[ó] 243 15/03 363 F3 sigma
[ô] 244 15/04 364 F4 tau
[õ] 245 15/05 365 F5 upsilon
[ö] 246 15/06 366 F6 phi
[÷] 247 15/07 367 F7 khi
[ø] 248 15/08 370 F8 psi
[ù] 249 15/09 371 F9 omega
[ú] 250 15/10 372 FA iota with diaeresis
[û] 251 15/11 373 FB upsilon with diaeresis
[ü] 252 15/12 374 FC omicron with diaeresis
[ý] 253 15/13 375 FD upsilon with accent
[þ] 254 15/14 376 FE omega with accent
[ÿ] 255 15/15 377 FF (UNUSED)


[Greek Character]
ISO 8859-7 Latin / Greek Alphabet
char dec col/row oct hex description
[ ] 160 10/00 240 A0 No-break space
[ʽ] 161 10/01 241 A1 Left single quotation mark
[ʼ] 162 10/02 242 A2 right single quotation mark
[£] 163 10/03 243 A3 Pound sign
[] 164 10/04 244 A4 (UNUSED)
[] 165 10/05 245 A5 (UNUSED)
[¦] 166 10/06 246 A6 Broken bar
[§] 167 10/07 247 A7 Paragraph sign
[¨] 168 10/08 250 A8 Diaeresis (Dialytika)
[©] 169 10/09 251 A9 Copyright sign
[] 170 10/10 252 AA (UNUSED)
[«] 171 10/11 253 AB Left angle quotation
[¬] 172 10/12 254 AC Not sign
[_] 173 10/13 255 AD Soft hyphen
[] 174 10/14 256 AE (UNUSED)
[―] 175 10/15 257 AF Horizontal bar (Parenthetiki pavla)
[°] 176 11/00 260 B0 Degree sign
[±] 177 11/01 261 B1 Plus-minus sign
[²] 178 11/02 262 B2 Superscript two
[³] 179 11/03 263 B3 Superscript three
[΄] 180 11/04 264 B4 Accent (tonos)
[΅] 181 11/05 265 B5 Diaeresis and accent (Dialytika and Tonos)
[Ά] 182 11/06 266 B6 Alpha with accent
[·] 183 11/07 267 B7 Middle dot (Ano Teleia)
[Έ] 184 11/08 270 B8 Epsilon with accent
[Ή] 185 11/09 271 B9 Eta with accent
[Ί] 186 11/10 272 BA Iota with accent
[»] 187 11/11 273 BB Right angle quotation
[Ό] 188 11/12 274 BC Omicron with accent
[½] 189 11/13 275 BD One half
[Ύ] 190 11/14 276 BE Upsilon with accent
[Ώ] 191 11/15 277 BF Omega with accent
[ΐ] 192 12/00 300 C0 iota with diaeresis and accent
[Α] 193 12/01 301 C1 Alpha
[Β] 194 12/02 302 C2 Beta
[Γ] 195 12/03 303 C3 Gamma
[Δ] 196 12/04 304 C4 Delta
[Ε] 197 12/05 305 C5 Epsilon
[Ζ] 198 12/06 306 C6 Zeta
[Η] 199 12/07 307 C7 Eta
[Θ] 200 12/08 310 C8 Theta
[Ι] 201 12/09 311 C9 Iota
[Κ] 202 12/10 312 CA Kappa
[Λ] 203 12/11 313 CB Lamda
[Μ] 204 12/12 314 CC Mu
[Ν] 205 12/13 315 CD Nu
[Ξ] 206 12/14 316 CE Ksi
[Ο] 207 12/15 317 CF Omicron
[Π] 208 13/00 320 D0 Pi
[Ρ] 209 13/01 321 D1 Rho
[] 210 13/02 322 D2 (UNUSED)
[Σ] 211 13/03 323 D3 Sigma
[Τ] 212 13/04 324 D4 Tau
[Υ] 213 13/05 325 D5 Upsilon
[Φ] 214 13/06 326 D6 Phi
[Χ] 215 13/07 327 D7 Khi
[Ψ] 216 13/08 330 D8 Psi
[Ω] 217 13/09 331 D9 Omega
[Ϊ] 218 13/10 332 DA Iota with diaeresis
[Ϋ] 219 13/11 333 DB Upsilon with diaeresis
[ά] 220 13/12 334 DC alpha with accent
[έ] 221 13/13 335 DD epsilon with accent
[ή] 222 13/14 336 DE eta with accent
[ί] 223 13/15 337 DF iota with accent
[ΰ] 224 14/00 340 E0 upsilon with diaeresis and accent
[α] 225 14/01 341 E1 alpha
[β] 226 14/02 342 E2 beta
[γ] 227 14/03 343 E3 gamma
[δ] 228 14/04 344 E4 delta
[ε] 229 14/05 345 E5 epsilon
[ζ] 230 14/06 346 E6 zeta
[η] 231 14/07 347 E7 eta
[θ] 232 14/08 350 E8 theta
[ι] 233 14/09 351 E9 iota
[κ] 234 14/10 352 EA kappa
[λ] 235 14/11 353 EB lamda
[μ] 236 14/12 354 EC mu
[ν] 237 14/13 355 ED nu
[ξ] 238 14/14 356 EE ksi
[ο] 239 14/15 357 EF omicron
[π] 240 15/00 360 F0 pi
[ρ] 241 15/01 361 F1 rho
[ς] 242 15/02 362 F2 terminal sigma
[σ] 243 15/03 363 F3 sigma
[τ] 244 15/04 364 F4 tau
[υ] 245 15/05 365 F5 upsilon
[φ] 246 15/06 366 F6 phi
[χ] 247 15/07 367 F7 khi
[ψ] 248 15/08 370 F8 psi
[ω] 249 15/09 371 F9 omega
[ϊ] 250 15/10 372 FA iota with diaeresis
[ϋ] 251 15/11 373 FB upsilon with diaeresis
[ό] 252 15/12 374 FC omicron with diaeresis
[ύ] 253 15/13 375 FD upsilon with accent
[ώ] 254 15/14 376 FE omega with accent
[] 255 15/15 377 FF (UNUSED)

mitsoskitsos
12-29-2003, 11:09 PM
Hi.

I am running Mysql 4.0.17 but I think it will work also under 3.23.xx. I'll check it and let you know.

I used the following phpdig_words_chars variable
$phpdig_words_chars['iso-8859-7'] = '[:alnum:]ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙ¢¸¹º¼¾¿ÚÛáâãä æçèéêëì*îïðñóôõö÷øùÜÝÞßüýþúûÀ*';

The link is http://find.pin.gr/search.php

Charter
12-29-2003, 11:39 PM
Hi. Just tried it. Sweet. Also, you might want to change:

$phpdig_string_subst['iso-8859-7'] = 'A:A,a:a';

to the following:

$phpdig_string_subst['iso-8859-7'] = 'Q:Q,q:q';

as Q is less used than A.

mkst
12-30-2003, 04:42 AM
Hello!

I tried this as well, however something is wrong again...I tried to index a site that contained greek words and some english words.
The keywords table contains all the greek (with greek characters) and english words. The search works perfect when I search for an english word however when I search for a greek word (which exist in the keywords table) i get no results.

When I removed the follwoing line (101) from search_function.php it seemed to work fine for both greek and english. Any ideas?

if (eregi("[^[:alnum:]^ +^-]+",$query_to_parse)) { $query_to_parse = eregi_replace("[^[:alnum:]^ ]+"," ",$query_to_parse); }

The $query_to_parse variable is always set to an emty string when i search for greek.

ps. The client and server charsets are set to latin1

Charter
12-30-2003, 04:44 AM
Hi. Update to PhpDig 1.6.5. ;)

Charter
12-30-2003, 04:51 AM
mkst, what version of MySQL are you running?

mkst
12-30-2003, 04:55 AM
Thanks!
I am running ver. 4.0.15-standard

mkst
12-30-2003, 06:22 AM
Upgrading seems that works.
Thanks! :)

The engine can find both greek and english words.
However, there two more issues.
1. Example: The search for "äåëöß*éá" and "äåëÖß*éá" (change the case of a single letter) gives exactly the same results, however, the keyword does not appear in the description. The description shows only the first sentence of the html file.
2. There are no results displayed if i forget to place the accent.

I solved the second problem by replacing the
$phpdig_string_subst['iso-8859-7'] = 'A:A,a:a';
with
$phpdig_string_subst['iso-8859-7'] = 'A:A,a:a,é:ßú,á:Ü,å:Ý,ç:Þ,ï:ü,õ:ýû,ù:þ';
(vowel without accent:vowel with accent,vowel with diaeresis)

Is this all right? I noticed that there are fewer entries in the keywords table (were 1232, now 1204).
Any ideas about 1.?

Thanks!

Charter
12-30-2003, 06:48 AM
Hi. Are you sure your MySQL charset is latin1? As for number one, do you have define('DISPLAY_SNIPPETS',true); in the config file?

If so then try increasing define('SNIPPET_DISPLAY_LENGTH',150); and/or define('DISPLAY_SNIPPETS_NUM',4); in the config file.

Also, in search_function.php find the following:

while($num_extracts < DISPLAY_SNIPPETS_NUM && $extract_content = fgets($f_handler,1024)) {
if(eregi($reg_strings,$extract_content)) {
$match_this_spot = str_replace('<','&lt;',str_replace('>','&gt;',trim($match_this_spot)));

There is a typo (not related to number one):

$match_this_spot = str_replace('<','&lt;',str_replace('>','&gt;',trim($match_this_spot)));

should be:

$match_this_spot = str_replace('<','&lt;',str_replace('>','&gt;',trim($extract_content)));

Now for number one, decrease 1024 in fgets($f_handler,1024) in the above code to something like 200.