|
12-24-2003, 12:48 PM | #16 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. PhpDig takes content and writes it to text files. Certain characters display as themselves in ASCII but other characters show up as their ISO counterparts. If you look at this page, you can see how, for example, Delta displays as Δ in the browser but when you view Delta in the HTML source it is the Ä character.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-26-2003, 01:51 AM | #17 |
Green Mole
Join Date: Dec 2003
Posts: 7
|
Hi Charter. Thanks for your reply but I still don't understand it (I don't have much experience...)
Correct my if I say something wrong... Delta character has decimal value 196, so phpdig spider reads 196 from the 8859-7 encoded html page, then writes it as 196 in the text file and then it converts it to the corresponding latin (below 127) character according to the phpdig_string_subst table. I cannot understand why you do not put the 196 character on the mysql table... I want to index only iso-8859-7 pages, so I am not interested in other encodings. In the text_content directory I can read perfectly the txt files but when the greek to latin conversion takes place something goes wrong. I have tried many combinations of the phpdig_string_subst and phpdig_words_chars variables but the result isn't good. So I came up that the only solution is to bypass the greek to latin conversion. Can you help me this? (I cannot easily find this conversion part in the phpdig code) |
12-26-2003, 05:40 AM | #18 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Can you attach one of the text files from the text_content directory?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-27-2003, 09:10 AM | #19 |
Green Mole
Join Date: Dec 2003
Posts: 7
|
|
12-28-2003, 03:40 AM | #20 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Thanks. When PhpDig spiders an ISO-8859-7 page, it sees characters like the following:
Code:
english test spider åëëç*éêÜ ôåóô áñÜ÷*ç áâãäåæçèéêëì*ïðñóôõö÷øù áâã Code:
english test spider ελληνικά τεστ αράχνη αβγδεζηθικλμνοπρστυφχψω αβγ Of course, this one-to-one mapping cannot be done with a variety of languages and so PhpDig does not convert those languages correctly. Just as a test, if you are using PhpDig on ISO-8859-7 pages only, set the following in the config.php file and then do a crawl: PHP Code:
This can be done via shell. Just go to the MySQL prompt and type status and MySQL will output the info. What are your Client characterset and Server characterset set to? If you are not able to check the setting of Client characterset and Server characterset, then take a look at the new table entries via phpMyAdmin after doing a crawl with the above changes. Are the words and characters stored as (extended) ASCII? Also, how are the new search results?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-28-2003, 12:57 PM | #21 |
Green Mole
Join Date: Dec 2003
Posts: 7
|
Hi.
I made the test and here's what happens: I crawled the same test page and only the 3 english words were put in the keywords table. If I change the phpdig_string_subst and map greek to latin characters, then I get all words in the keywords table, but they are (as they should be) latin. I think that all problems would have solved if we managed to put extended ASCII chars into the mysql table. The client and server characterset are both greek (I compiled mysql with-charset=greek). |
12-28-2003, 06:25 PM | #22 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Try the following. Keep the following changes in the config.php file:
PHP Code:
In the phpdigIndexFile function change: PHP Code:
PHP Code:
PHP Code:
PHP Code:
Now when you do a crawl do you see (extended) ASCII or Greek characters in the table?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-29-2003, 01:14 PM | #23 |
Green Mole
Join Date: Dec 2003
Posts: 7
|
Hi Charter.
Did the changes and all works fine! I see Greek characters in the table and the search works perfectly. I will make more tests and I'll let you know if there's a problem. Many thanks! |
12-29-2003, 05:03 PM | #24 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Great news! What version of MySQL are you running?
The below ISO-8859-7 tables are from this page and show the Greek and extended ASCII characters. I omitted some of the extended ASCII characters from the $phpdig_words_chars variable based on my limited knowledge of Greek. You should check the $phpdig_words_chars variable and add or remove extended ASCII characters in the variable based on your knowledge. The space isn't necessary to add, and of course don't add "bad" characters. Please do let me know the results of your tests. I'd like to make PhpDig portable for more languages than ISO-8859-1 and ISO-8859-2. Also, what's the link to your search? I'd like to take a look. Code:
[Extended ASCII Character] ISO 8859-7 Latin / Greek Alphabet char dec col/row oct hex description [_] 160 10/00 240 A0 No-break space [¡] 161 10/01 241 A1 Left single quotation mark [¢] 162 10/02 242 A2 right single quotation mark [£] 163 10/03 243 A3 Pound sign [¤] 164 10/04 244 A4 (UNUSED) [¥] 165 10/05 245 A5 (UNUSED) [¦] 166 10/06 246 A6 Broken bar [§] 167 10/07 247 A7 Paragraph sign [¨] 168 10/08 250 A8 Diaeresis (Dialytika) [©] 169 10/09 251 A9 Copyright sign [ª] 170 10/10 252 AA (UNUSED) [«] 171 10/11 253 AB Left angle quotation [¬] 172 10/12 254 AC Not sign [_] 173 10/13 255 AD Soft hyphen [®] 174 10/14 256 AE (UNUSED) [¯] 175 10/15 257 AF Horizontal bar (Parenthetiki pavla) [°] 176 11/00 260 B0 Degree sign [±] 177 11/01 261 B1 Plus-minus sign [²] 178 11/02 262 B2 Superscript two [³] 179 11/03 263 B3 Superscript three [´] 180 11/04 264 B4 Accent (tonos) [µ] 181 11/05 265 B5 Diaeresis and accent (Dialytika and Tonos) [¶] 182 11/06 266 B6 Alpha with accent [·] 183 11/07 267 B7 Middle dot (Ano Teleia) [¸] 184 11/08 270 B8 Epsilon with accent [¹] 185 11/09 271 B9 Eta with accent [º] 186 11/10 272 BA Iota with accent [»] 187 11/11 273 BB Right angle quotation [¼] 188 11/12 274 BC Omicron with accent [½] 189 11/13 275 BD One half [¾] 190 11/14 276 BE Upsilon with accent [¿] 191 11/15 277 BF Omega with accent [À] 192 12/00 300 C0 iota with diaeresis and accent [Á] 193 12/01 301 C1 Alpha [Â] 194 12/02 302 C2 Beta [Ã] 195 12/03 303 C3 Gamma [Ä] 196 12/04 304 C4 Delta [Å] 197 12/05 305 C5 Epsilon [Æ] 198 12/06 306 C6 Zeta [Ç] 199 12/07 307 C7 Eta [È] 200 12/08 310 C8 Theta [É] 201 12/09 311 C9 Iota [Ê] 202 12/10 312 CA Kappa [Ë] 203 12/11 313 CB Lamda [Ì] 204 12/12 314 CC Mu [Í] 205 12/13 315 CD Nu [Î] 206 12/14 316 CE Ksi [Ï] 207 12/15 317 CF Omicron [Ð] 208 13/00 320 D0 Pi [Ñ] 209 13/01 321 D1 Rho [Ò] 210 13/02 322 D2 (UNUSED) [Ó] 211 13/03 323 D3 Sigma [Ô] 212 13/04 324 D4 Tau [Õ] 213 13/05 325 D5 Upsilon [Ö] 214 13/06 326 D6 Phi [×] 215 13/07 327 D7 Khi [Ø] 216 13/08 330 D8 Psi [Ù] 217 13/09 331 D9 Omega [Ú] 218 13/10 332 DA Iota with diaeresis [Û] 219 13/11 333 DB Upsilon with diaeresis [Ü] 220 13/12 334 DC alpha with accent [Ý] 221 13/13 335 DD epsilon with accent [Þ] 222 13/14 336 DE eta with accent [ß] 223 13/15 337 DF iota with accent [*] 224 14/00 340 E0 upsilon with diaeresis and accent [á] 225 14/01 341 E1 alpha [â] 226 14/02 342 E2 beta [ã] 227 14/03 343 E3 gamma [ä] 228 14/04 344 E4 delta [å] 229 14/05 345 E5 epsilon [æ] 230 14/06 346 E6 zeta [ç] 231 14/07 347 E7 eta [è] 232 14/08 350 E8 theta [é] 233 14/09 351 E9 iota [ê] 234 14/10 352 EA kappa [ë] 235 14/11 353 EB lamda [ì] 236 14/12 354 EC mu [*] 237 14/13 355 ED nu [î] 238 14/14 356 EE ksi [ï] 239 14/15 357 EF omicron [ð] 240 15/00 360 F0 pi [ñ] 241 15/01 361 F1 rho [ò] 242 15/02 362 F2 terminal sigma [ó] 243 15/03 363 F3 sigma [ô] 244 15/04 364 F4 tau [õ] 245 15/05 365 F5 upsilon [ö] 246 15/06 366 F6 phi [÷] 247 15/07 367 F7 khi [ø] 248 15/08 370 F8 psi [ù] 249 15/09 371 F9 omega [ú] 250 15/10 372 FA iota with diaeresis [û] 251 15/11 373 FB upsilon with diaeresis [ü] 252 15/12 374 FC omicron with diaeresis [ý] 253 15/13 375 FD upsilon with accent [þ] 254 15/14 376 FE omega with accent [ÿ] 255 15/15 377 FF (UNUSED) Code:
[Greek Character] ISO 8859-7 Latin / Greek Alphabet char dec col/row oct hex description [ ] 160 10/00 240 A0 No-break space [ʽ] 161 10/01 241 A1 Left single quotation mark [ʼ] 162 10/02 242 A2 right single quotation mark [£] 163 10/03 243 A3 Pound sign [] 164 10/04 244 A4 (UNUSED) [] 165 10/05 245 A5 (UNUSED) [¦] 166 10/06 246 A6 Broken bar [§] 167 10/07 247 A7 Paragraph sign [¨] 168 10/08 250 A8 Diaeresis (Dialytika) [©] 169 10/09 251 A9 Copyright sign [] 170 10/10 252 AA (UNUSED) [«] 171 10/11 253 AB Left angle quotation [¬] 172 10/12 254 AC Not sign [_] 173 10/13 255 AD Soft hyphen [] 174 10/14 256 AE (UNUSED) [―] 175 10/15 257 AF Horizontal bar (Parenthetiki pavla) [°] 176 11/00 260 B0 Degree sign [±] 177 11/01 261 B1 Plus-minus sign [²] 178 11/02 262 B2 Superscript two [³] 179 11/03 263 B3 Superscript three [΄] 180 11/04 264 B4 Accent (tonos) [΅] 181 11/05 265 B5 Diaeresis and accent (Dialytika and Tonos) [Ά] 182 11/06 266 B6 Alpha with accent [·] 183 11/07 267 B7 Middle dot (Ano Teleia) [Έ] 184 11/08 270 B8 Epsilon with accent [Ή] 185 11/09 271 B9 Eta with accent [Ί] 186 11/10 272 BA Iota with accent [»] 187 11/11 273 BB Right angle quotation [Ό] 188 11/12 274 BC Omicron with accent [½] 189 11/13 275 BD One half [Ύ] 190 11/14 276 BE Upsilon with accent [Ώ] 191 11/15 277 BF Omega with accent [ΐ] 192 12/00 300 C0 iota with diaeresis and accent [Α] 193 12/01 301 C1 Alpha [Β] 194 12/02 302 C2 Beta [Γ] 195 12/03 303 C3 Gamma [Δ] 196 12/04 304 C4 Delta [Ε] 197 12/05 305 C5 Epsilon [Ζ] 198 12/06 306 C6 Zeta [Η] 199 12/07 307 C7 Eta [Θ] 200 12/08 310 C8 Theta [Ι] 201 12/09 311 C9 Iota [Κ] 202 12/10 312 CA Kappa [Λ] 203 12/11 313 CB Lamda [Μ] 204 12/12 314 CC Mu [Ν] 205 12/13 315 CD Nu [Ξ] 206 12/14 316 CE Ksi [Ο] 207 12/15 317 CF Omicron [Π] 208 13/00 320 D0 Pi [Ρ] 209 13/01 321 D1 Rho [] 210 13/02 322 D2 (UNUSED) [Σ] 211 13/03 323 D3 Sigma [Τ] 212 13/04 324 D4 Tau [Υ] 213 13/05 325 D5 Upsilon [Φ] 214 13/06 326 D6 Phi [Χ] 215 13/07 327 D7 Khi [Ψ] 216 13/08 330 D8 Psi [Ω] 217 13/09 331 D9 Omega [Ϊ] 218 13/10 332 DA Iota with diaeresis [Ϋ] 219 13/11 333 DB Upsilon with diaeresis [ά] 220 13/12 334 DC alpha with accent [έ] 221 13/13 335 DD epsilon with accent [ή] 222 13/14 336 DE eta with accent [ί] 223 13/15 337 DF iota with accent [ΰ] 224 14/00 340 E0 upsilon with diaeresis and accent [α] 225 14/01 341 E1 alpha [β] 226 14/02 342 E2 beta [γ] 227 14/03 343 E3 gamma [δ] 228 14/04 344 E4 delta [ε] 229 14/05 345 E5 epsilon [ζ] 230 14/06 346 E6 zeta [η] 231 14/07 347 E7 eta [θ] 232 14/08 350 E8 theta [ι] 233 14/09 351 E9 iota [κ] 234 14/10 352 EA kappa [λ] 235 14/11 353 EB lamda [μ] 236 14/12 354 EC mu [ν] 237 14/13 355 ED nu [ξ] 238 14/14 356 EE ksi [ο] 239 14/15 357 EF omicron [π] 240 15/00 360 F0 pi [ρ] 241 15/01 361 F1 rho [ς] 242 15/02 362 F2 terminal sigma [σ] 243 15/03 363 F3 sigma [τ] 244 15/04 364 F4 tau [υ] 245 15/05 365 F5 upsilon [φ] 246 15/06 366 F6 phi [χ] 247 15/07 367 F7 khi [ψ] 248 15/08 370 F8 psi [ω] 249 15/09 371 F9 omega [ϊ] 250 15/10 372 FA iota with diaeresis [ϋ] 251 15/11 373 FB upsilon with diaeresis [ό] 252 15/12 374 FC omicron with diaeresis [ύ] 253 15/13 375 FD upsilon with accent [ώ] 254 15/14 376 FE omega with accent [] 255 15/15 377 FF (UNUSED)
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-29-2003, 11:09 PM | #25 |
Green Mole
Join Date: Dec 2003
Posts: 7
|
Hi.
I am running Mysql 4.0.17 but I think it will work also under 3.23.xx. I'll check it and let you know. I used the following phpdig_words_chars variable $phpdig_words_chars['iso-8859-7'] = '[:alnum:]ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙ¢¸¹º¼¾¿ÚÛáâãä æçèéêëì*îïðñóôõö÷øùÜÝÞßüýþúûÀ*'; The link is http://find.pin.gr/search.php |
12-29-2003, 11:39 PM | #26 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Just tried it. Sweet. Also, you might want to change:
PHP Code:
PHP Code:
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-30-2003, 04:42 AM | #27 |
Green Mole
Join Date: Oct 2003
Posts: 11
|
Hello!
I tried this as well, however something is wrong again...I tried to index a site that contained greek words and some english words. The keywords table contains all the greek (with greek characters) and english words. The search works perfect when I search for an english word however when I search for a greek word (which exist in the keywords table) i get no results. When I removed the follwoing line (101) from search_function.php it seemed to work fine for both greek and english. Any ideas? PHP Code:
ps. The client and server charsets are set to latin1 |
12-30-2003, 04:44 AM | #28 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
Hi. Update to PhpDig 1.6.5.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-30-2003, 04:51 AM | #29 |
Head Mole
Join Date: May 2003
Posts: 2,539
|
mkst, what version of MySQL are you running?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension. |
12-30-2003, 04:55 AM | #30 |
Green Mole
Join Date: Oct 2003
Posts: 11
|
Thanks!
I am running ver. 4.0.15-standard |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
I want search RUSSIAN (ISO-8859-5) language in PHPDig, How to ??? | Ivan | How-to Forum | 1 | 09-26-2003 04:30 PM |