PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > How-to Forum

Reply
 
Thread Tools
Old 12-24-2003, 12:48 PM   #16
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. PhpDig takes content and writes it to text files. Certain characters display as themselves in ASCII but other characters show up as their ISO counterparts. If you look at this page, you can see how, for example, Delta displays as Δ in the browser but when you view Delta in the HTML source it is the Ä character.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-26-2003, 01:51 AM   #17
mitsoskitsos
Green Mole
 
Join Date: Dec 2003
Posts: 7
Hi Charter. Thanks for your reply but I still don't understand it (I don't have much experience...)

Correct my if I say something wrong... Delta character has decimal value 196, so phpdig spider reads 196 from the 8859-7 encoded html page, then writes it as 196 in the text file and then it converts it to the corresponding latin (below 127) character according to the phpdig_string_subst table. I cannot understand why you do not put the 196 character on the mysql table...

I want to index only iso-8859-7 pages, so I am not interested in other encodings. In the text_content directory I can read perfectly the txt files but when the greek to latin conversion takes place something goes wrong. I have tried many combinations of the phpdig_string_subst and phpdig_words_chars variables but the result isn't good.

So I came up that the only solution is to bypass the greek to latin conversion. Can you help me this? (I cannot easily find this conversion part in the phpdig code)
mitsoskitsos is offline   Reply With Quote
Old 12-26-2003, 05:40 AM   #18
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Can you attach one of the text files from the text_content directory?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-27-2003, 09:10 AM   #19
mitsoskitsos
Green Mole
 
Join Date: Dec 2003
Posts: 7
http://beta.topweb.gr/25.txt

It's from a test page
mitsoskitsos is offline   Reply With Quote
Old 12-28-2003, 03:40 AM   #20
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Thanks. When PhpDig spiders an ISO-8859-7 page, it sees characters like the following:
Code:
english test spider åëëç*éêÜ ôåóô áñÜ÷*ç áâãäåæçèéêëì*ïðñóôõö÷øù áâã
Because PhpDig currenly supports only ISO-8859-1 and ISO-8859-2, it does not know how to convert the above ASCII characters to the following characters that get displayed in the browser:
Code:
english test spider ελληνικά τεστ αράχνη αβγδεζηθικλμνοπρστυφχψω αβγ
The $phpdig_string_subst and $phpdig_words_chars variables are available to setup another ISO-8859 but only if the language can be mapped one-to-one with Latin counterparts.

Of course, this one-to-one mapping cannot be done with a variety of languages and so PhpDig does not convert those languages correctly.

Just as a test, if you are using PhpDig on ISO-8859-7 pages only, set the following in the config.php file and then do a crawl:
PHP Code:
define('PHPDIG_ENCODING','iso-8859-7');
// give functions something trivial to do
$phpdig_string_subst['iso-8859-7'] = 'A:A,a:a';
// remove word wrapping in the below line
$phpdig_words_chars['iso-8859-7'] = '[:alnum:]µ¶¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüýþÿ'
With ISO-8859-7 set, the browser should pass the search query into PhpDig as (extended) ASCII characters. The other thing to check is to see how Client characterset and Server characterset are set.

This can be done via shell. Just go to the MySQL prompt and type status and MySQL will output the info. What are your Client characterset and Server characterset set to?

If you are not able to check the setting of Client characterset and Server characterset, then take a look at the new table entries via phpMyAdmin after doing a crawl with the above changes. Are the words and characters stored as (extended) ASCII? Also, how are the new search results?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-28-2003, 12:57 PM   #21
mitsoskitsos
Green Mole
 
Join Date: Dec 2003
Posts: 7
Hi.

I made the test and here's what happens:

I crawled the same test page and only the 3 english words were put in the keywords table.
If I change the phpdig_string_subst and map greek to latin characters, then I get all words in the keywords table, but they are (as they should be) latin.

I think that all problems would have solved if we managed to put extended ASCII chars into the mysql table.

The client and server characterset are both greek (I compiled mysql with-charset=greek).
mitsoskitsos is offline   Reply With Quote
Old 12-28-2003, 06:25 PM   #22
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Try the following. Keep the following changes in the config.php file:
PHP Code:
define('PHPDIG_ENCODING','iso-8859-7'); 
// give functions something trivial to do 
$phpdig_string_subst['iso-8859-7'] = 'A:A,a:a'
// remove word wrapping in the below line 
$phpdig_words_chars['iso-8859-7'] =  '[:alnum:]µ¶¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüýþÿ'
In addition, in the robot_functions.php file is a phpdigIndexFile function.

In the phpdigIndexFile function change:
PHP Code:
global $common_words,$relative_script_path,$s_yes,$s_no,$br
to the following:
PHP Code:
global $phpdig_words_chars,$common_words,$relative_script_path,$s_yes,$s_no,$br
Also, in the phpdigIndexFile function change:
PHP Code:
        if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^[0-9a-zßðþ]',$key)) 
to the following:
PHP Code:
        if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].']',$key)) 
Remember to remove any "word" wrapping in the above code.

Now when you do a crawl do you see (extended) ASCII or Greek characters in the table?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-29-2003, 01:14 PM   #23
mitsoskitsos
Green Mole
 
Join Date: Dec 2003
Posts: 7
Hi Charter.

Did the changes and all works fine!

I see Greek characters in the table and the search works perfectly.
I will make more tests and I'll let you know if there's a problem.

Many thanks!
mitsoskitsos is offline   Reply With Quote
Old 12-29-2003, 05:03 PM   #24
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Great news! What version of MySQL are you running?

The below ISO-8859-7 tables are from this page and show the Greek and extended ASCII characters.

I omitted some of the extended ASCII characters from the $phpdig_words_chars variable based on my limited knowledge of Greek. You should check the $phpdig_words_chars variable and add or remove extended ASCII characters in the variable based on your knowledge. The space isn't necessary to add, and of course don't add "bad" characters.

Please do let me know the results of your tests. I'd like to make PhpDig portable for more languages than ISO-8859-1 and ISO-8859-2. Also, what's the link to your search? I'd like to take a look.
Code:
[Extended ASCII Character]
ISO 8859-7 Latin / Greek Alphabet
char dec col/row oct hex  description
[_]  160  10/00  240  A0  No-break space
[¡]  161  10/01  241  A1  Left single quotation mark
[¢]  162  10/02  242  A2  right single quotation mark
[£]  163  10/03  243  A3  Pound sign
[¤]  164  10/04  244  A4  (UNUSED)
[¥]  165  10/05  245  A5  (UNUSED)
[¦]  166  10/06  246  A6  Broken bar
[§]  167  10/07  247  A7  Paragraph sign
[¨]  168  10/08  250  A8  Diaeresis (Dialytika)
[©]  169  10/09  251  A9  Copyright sign
[ª]  170  10/10  252  AA  (UNUSED)
[«]  171  10/11  253  AB  Left angle quotation
[¬]  172  10/12  254  AC  Not sign
[_]  173  10/13  255  AD  Soft hyphen
[®]  174  10/14  256  AE  (UNUSED)
[¯]  175  10/15  257  AF  Horizontal bar (Parenthetiki pavla)
[°]  176  11/00  260  B0  Degree sign
[±]  177  11/01  261  B1  Plus-minus sign
[²]  178  11/02  262  B2  Superscript two
[³]  179  11/03  263  B3  Superscript three
[´]  180  11/04  264  B4  Accent (tonos)
[µ]  181  11/05  265  B5  Diaeresis and accent (Dialytika and Tonos)
[¶]  182  11/06  266  B6  Alpha with accent
[·]  183  11/07  267  B7  Middle dot (Ano Teleia)
[¸]  184  11/08  270  B8  Epsilon with accent
[¹]  185  11/09  271  B9  Eta with accent
[º]  186  11/10  272  BA  Iota with accent
[»]  187  11/11  273  BB  Right angle quotation
[¼]  188  11/12  274  BC  Omicron with accent
[½]  189  11/13  275  BD  One half
[¾]  190  11/14  276  BE  Upsilon with accent
[¿]  191  11/15  277  BF  Omega with accent
[À]  192  12/00  300  C0  iota with diaeresis and accent
[Á]  193  12/01  301  C1  Alpha
[Â]  194  12/02  302  C2  Beta
[Ã]  195  12/03  303  C3  Gamma
[Ä]  196  12/04  304  C4  Delta
[Å]  197  12/05  305  C5  Epsilon
[Æ]  198  12/06  306  C6  Zeta
[Ç]  199  12/07  307  C7  Eta
[È]  200  12/08  310  C8  Theta
[É]  201  12/09  311  C9  Iota
[Ê]  202  12/10  312  CA  Kappa
[Ë]  203  12/11  313  CB  Lamda
[Ì]  204  12/12  314  CC  Mu
[Í]  205  12/13  315  CD  Nu
[Î]  206  12/14  316  CE  Ksi
[Ï]  207  12/15  317  CF  Omicron
[Ð]  208  13/00  320  D0  Pi
[Ñ]  209  13/01  321  D1  Rho
[Ò]  210  13/02  322  D2  (UNUSED)
[Ó]  211  13/03  323  D3  Sigma
[Ô]  212  13/04  324  D4  Tau
[Õ]  213  13/05  325  D5  Upsilon
[Ö]  214  13/06  326  D6  Phi
[×]  215  13/07  327  D7  Khi
[Ø]  216  13/08  330  D8  Psi
[Ù]  217  13/09  331  D9  Omega
[Ú]  218  13/10  332  DA  Iota with diaeresis
[Û]  219  13/11  333  DB  Upsilon with diaeresis
[Ü]  220  13/12  334  DC  alpha with accent
[Ý]  221  13/13  335  DD  epsilon with accent
[Þ]  222  13/14  336  DE  eta with accent
[ß]  223  13/15  337  DF  iota with accent
[*]  224  14/00  340  E0  upsilon with diaeresis and accent
[á]  225  14/01  341  E1  alpha
[â]  226  14/02  342  E2  beta
[ã]  227  14/03  343  E3  gamma
[ä]  228  14/04  344  E4  delta
[å]  229  14/05  345  E5  epsilon
[æ]  230  14/06  346  E6  zeta
[ç]  231  14/07  347  E7  eta
[è]  232  14/08  350  E8  theta
[é]  233  14/09  351  E9  iota
[ê]  234  14/10  352  EA  kappa
[ë]  235  14/11  353  EB  lamda
[ì]  236  14/12  354  EC  mu
[*]  237  14/13  355  ED  nu
[î]  238  14/14  356  EE  ksi
[ï]  239  14/15  357  EF  omicron
[ð]  240  15/00  360  F0  pi
[ñ]  241  15/01  361  F1  rho
[ò]  242  15/02  362  F2  terminal sigma
[ó]  243  15/03  363  F3  sigma
[ô]  244  15/04  364  F4  tau
[õ]  245  15/05  365  F5  upsilon
[ö]  246  15/06  366  F6  phi
[÷]  247  15/07  367  F7  khi
[ø]  248  15/08  370  F8  psi
[ù]  249  15/09  371  F9  omega
[ú]  250  15/10  372  FA  iota with diaeresis
[û]  251  15/11  373  FB  upsilon with diaeresis
[ü]  252  15/12  374  FC  omicron with diaeresis
[ý]  253  15/13  375  FD  upsilon with accent
[þ]  254  15/14  376  FE  omega with accent
[ÿ]  255  15/15  377  FF  (UNUSED)
Code:
[Greek Character]
ISO 8859-7 Latin / Greek Alphabet
char dec col/row oct hex  description
[ ]  160  10/00  240  A0  No-break space
[ʽ]  161  10/01  241  A1  Left single quotation mark
[ʼ]  162  10/02  242  A2  right single quotation mark
[£]  163  10/03  243  A3  Pound sign
[]  164  10/04  244  A4  (UNUSED)
[]  165  10/05  245  A5  (UNUSED)
[¦]  166  10/06  246  A6  Broken bar
[§]  167  10/07  247  A7  Paragraph sign
[¨]  168  10/08  250  A8  Diaeresis (Dialytika)
[©]  169  10/09  251  A9  Copyright sign
[]  170  10/10  252  AA  (UNUSED)
[«]  171  10/11  253  AB  Left angle quotation
[¬]  172  10/12  254  AC  Not sign
[_]  173  10/13  255  AD  Soft hyphen
[]  174  10/14  256  AE  (UNUSED)
[―]  175  10/15  257  AF  Horizontal bar (Parenthetiki pavla)
[°]  176  11/00  260  B0  Degree sign
[±]  177  11/01  261  B1  Plus-minus sign
[²]  178  11/02  262  B2  Superscript two
[³]  179  11/03  263  B3  Superscript three
[΄]  180  11/04  264  B4  Accent (tonos)
[΅]  181  11/05  265  B5  Diaeresis and accent (Dialytika and Tonos)
[Ά]  182  11/06  266  B6  Alpha with accent
[·]  183  11/07  267  B7  Middle dot (Ano Teleia)
[Έ]  184  11/08  270  B8  Epsilon with accent
[Ή]  185  11/09  271  B9  Eta with accent
[Ί]  186  11/10  272  BA  Iota with accent
[»]  187  11/11  273  BB  Right angle quotation
[Ό]  188  11/12  274  BC  Omicron with accent
[½]  189  11/13  275  BD  One half
[Ύ]  190  11/14  276  BE  Upsilon with accent
[Ώ]  191  11/15  277  BF  Omega with accent
[ΐ]  192  12/00  300  C0  iota with diaeresis and accent
[Α]  193  12/01  301  C1  Alpha
[Β]  194  12/02  302  C2  Beta
[Γ]  195  12/03  303  C3  Gamma
[Δ]  196  12/04  304  C4  Delta
[Ε]  197  12/05  305  C5  Epsilon
[Ζ]  198  12/06  306  C6  Zeta
[Η]  199  12/07  307  C7  Eta
[Θ]  200  12/08  310  C8  Theta
[Ι]  201  12/09  311  C9  Iota
[Κ]  202  12/10  312  CA  Kappa
[Λ]  203  12/11  313  CB  Lamda
[Μ]  204  12/12  314  CC  Mu
[Ν]  205  12/13  315  CD  Nu
[Ξ]  206  12/14  316  CE  Ksi
[Ο]  207  12/15  317  CF  Omicron
[Π]  208  13/00  320  D0  Pi
[Ρ]  209  13/01  321  D1  Rho
[]  210  13/02  322  D2  (UNUSED)
[Σ]  211  13/03  323  D3  Sigma
[Τ]  212  13/04  324  D4  Tau
[Υ]  213  13/05  325  D5  Upsilon
[Φ]  214  13/06  326  D6  Phi
[Χ]  215  13/07  327  D7  Khi
[Ψ]  216  13/08  330  D8  Psi
[Ω]  217  13/09  331  D9  Omega
[Ϊ]  218  13/10  332  DA  Iota with diaeresis
[Ϋ]  219  13/11  333  DB  Upsilon with diaeresis
[ά]  220  13/12  334  DC  alpha with accent
[έ]  221  13/13  335  DD  epsilon with accent
[ή]  222  13/14  336  DE  eta with accent
[ί]  223  13/15  337  DF  iota with accent
[ΰ]  224  14/00  340  E0  upsilon with diaeresis and accent
[α]  225  14/01  341  E1  alpha
[β]  226  14/02  342  E2  beta
[γ]  227  14/03  343  E3  gamma
[δ]  228  14/04  344  E4  delta
[ε]  229  14/05  345  E5  epsilon
[ζ]  230  14/06  346  E6  zeta
[η]  231  14/07  347  E7  eta
[θ]  232  14/08  350  E8  theta
[ι]  233  14/09  351  E9  iota
[κ]  234  14/10  352  EA  kappa
[λ]  235  14/11  353  EB  lamda
[μ]  236  14/12  354  EC  mu
[ν]  237  14/13  355  ED  nu
[ξ]  238  14/14  356  EE  ksi
[ο]  239  14/15  357  EF  omicron
[π]  240  15/00  360  F0  pi
[ρ]  241  15/01  361  F1  rho
[ς]  242  15/02  362  F2  terminal sigma
[σ]  243  15/03  363  F3  sigma
[τ]  244  15/04  364  F4  tau
[υ]  245  15/05  365  F5  upsilon
[φ]  246  15/06  366  F6  phi
[χ]  247  15/07  367  F7  khi
[ψ]  248  15/08  370  F8  psi
[ω]  249  15/09  371  F9  omega
[ϊ]  250  15/10  372  FA  iota with diaeresis
[ϋ]  251  15/11  373  FB  upsilon with diaeresis
[ό]  252  15/12  374  FC  omicron with diaeresis
[ύ]  253  15/13  375  FD  upsilon with accent
[ώ]  254  15/14  376  FE  omega with accent
[]  255  15/15  377  FF  (UNUSED)
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-29-2003, 11:09 PM   #25
mitsoskitsos
Green Mole
 
Join Date: Dec 2003
Posts: 7
Hi.

I am running Mysql 4.0.17 but I think it will work also under 3.23.xx. I'll check it and let you know.

I used the following phpdig_words_chars variable
$phpdig_words_chars['iso-8859-7'] = '[:alnum:]ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙ¢¸¹º¼¾¿ÚÛáâãä æçèéêëì*îïðñóôõö÷øùÜÝÞßüýþúûÀ*';

The link is http://find.pin.gr/search.php
mitsoskitsos is offline   Reply With Quote
Old 12-29-2003, 11:39 PM   #26
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Just tried it. Sweet. Also, you might want to change:
PHP Code:
$phpdig_string_subst['iso-8859-7'] = 'A:A,a:a'
to the following:
PHP Code:
$phpdig_string_subst['iso-8859-7'] = 'Q:Q,q:q'
as Q is less used than A.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-30-2003, 04:42 AM   #27
mkst
Green Mole
 
Join Date: Oct 2003
Posts: 11
Hello!

I tried this as well, however something is wrong again...I tried to index a site that contained greek words and some english words.
The keywords table contains all the greek (with greek characters) and english words. The search works perfect when I search for an english word however when I search for a greek word (which exist in the keywords table) i get no results.

When I removed the follwoing line (101) from search_function.php it seemed to work fine for both greek and english. Any ideas?

PHP Code:
if (eregi("[^[:alnum:]^ +^-]+",$query_to_parse)) { $query_to_parse eregi_replace("[^[:alnum:]^ ]+"," ",$query_to_parse); } 
The $query_to_parse variable is always set to an emty string when i search for greek.

ps. The client and server charsets are set to latin1
mkst is offline   Reply With Quote
Old 12-30-2003, 04:44 AM   #28
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Update to PhpDig 1.6.5.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-30-2003, 04:51 AM   #29
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
mkst, what version of MySQL are you running?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 12-30-2003, 04:55 AM   #30
mkst
Green Mole
 
Join Date: Oct 2003
Posts: 11
Thanks!
I am running ver. 4.0.15-standard
mkst is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
I want search RUSSIAN (ISO-8859-5) language in PHPDig, How to ??? Ivan How-to Forum 1 09-26-2003 04:30 PM


All times are GMT -8. The time now is 04:53 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.