PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)
-   -   windows-1251 encoding (http://www.phpdig.net/forum/showthread.php?t=275)

jvalej 12-08-2003 12:15 AM

windows-1251 encoding
 
Hello all!

I would like to configure PhpDig, so that it can search pages with windows-1251 (cyrillic) encoding.

I have viewed two similar threads in this forum, on questions of encoding, located here:

ISO-8859-5
ISO-8859-7

The ISO-8859-7 thread is quite extensive, but to say the truth, I still have little clue on how to include windows-1251 encoding support into PhpDig... :bang:

I have found a couple of pages on windows-1251 on these sites:

http://www.sensi.org/~alec/locale/other/win1251.html
http://www.cs.susu.ac.ru/RS6000/tbcp1251.html

Would please anyone help me, how and which characters do I add to the:

$phpdig_string_subst['windows-1251']

and

$phpdig_words_chars['windows-1251']


Thank you very much!!!

Charter 12-08-2003 02:48 PM

Windows-1251 Characters (note A0 is a space):

A0-AF _ Ў ў Ј ¤ Ґ ¦ § Ё © Є « ¬ _ ® Ї
B0-BF ° ± І і ґ µ ¶ · ё № є » ј Ѕ ѕ ї
C0-CF А Б В Г Д Е Ж З И Й К Л М Н О П
D0-DF Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
E0-EF а б в г д е ж з и й к л м н о п
F0-FF р с т у ф х ц ч ш щ ъ ы ь э ю я

Those Characters in ASCII (note A0 is a space):

A0-AF _ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ _ ® ¯
B0-BF ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
C0-CF À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D0-DF Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E0-EF * á â ã ä å æ ç è é ê ë ì * î ï
F0-FF ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Then complete this table (I had to 'code' it to keep spacing):
Code:

Latin  Cyrillic  Hex  ASCII
-----  --------  ---  -----
A      A        C0  À
B      Б        C1  Á
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
a      a        E0  *
b      б        E1  á
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

I'm not very familiar with Cyrillic so I'm not sure if the entries in the above table are correct Latin to Cyrillic mappings. There might be characters that don't map, and I'm not sure what to do with those characters.

Charter 12-30-2003 04:32 AM

Hi. Here's a new approach/workaround for use with PhpDig 1.6.5.

In the config.php file set the following:
PHP Code:

define('PHPDIG_ENCODING','windows-1251');

// give functions something trivial to do
$phpdig_string_subst['windows-1251'] = 'Q:Q,q:q';

// remove word wrapping in the below line
$phpdig_words_chars['windows-1251'] = '[:alnum:]ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüýþÿ'

In addition, in the robot_functions.php file is a phpdigIndexFile function.

In the phpdigIndexFile function replace:
PHP Code:

global $common_words,$relative_script_path,$s_yes,$s_no,$br

with the following:
PHP Code:

global $phpdig_words_chars,$common_words,$relative_script_path,$s_yes,$s_no,$br

Also, in the phpdigIndexFile function replace:
PHP Code:

        if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^[0-9a-zßðþ]',$key)) 

with the following:
PHP Code:

        if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].']',$key)) 

Remember to remove any "word" wrapping in the above code and use PhpDig 1.6.5 if not used already.

Also you will need to index from scratch for the changes to take effect.

Please let me know how this method works for you.

jvalej 01-02-2004 07:18 AM

Happy New Year! :)

Thank you very much for your help! The windows-1251 solution that you have posted, works!

But now I have 2 other questions:


For exmaple, if I search for the 3 letter word (search link to which is given below) there's no highlighting being made for it:

http://oasiswithin.net/search/search...rt&lim_start=0

My "config.php" contains the following settings:

define('SMALL_WORDS_SIZE',2);


And If I search for the word:

http://oasiswithin.net/search/search...rt&lim_start=0

result number 2, also contains the words from the navigational menu, though this section should be excluded from being displayed, as I have the following settings in "config.php":

define('PHPDIG_EXCLUDE_COMMENT','<!-- *************************************');
define('PHPDIG_INCLUDE_COMMENT','************************************** -->');

The content which should be considered for the search engine is located between these 2 comment lines. Or maybe I understood it wrong?

And just to mention, the navigational section of the site is not contained in the content file which is being indexed, but is being included to this file via PHP include call, from an external .HTML file. Maybe it has to do something with this?..


Thank you! :)

Charter 01-02-2004 08:51 AM

Hi. For the first part, it looks like the highlighting isn't picking up the case sensitivity. For this, try the following:

In the config.php file, replace:
PHP Code:

// give functions something trivial to do
$phpdig_string_subst['windows-1251'] = 'Q:Q,q:q';

// remove word wrapping in the below line
$phpdig_words_chars['windows-1251'] = '[:alnum:]ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüýþÿ'

with the following:
PHP Code:

// remove word wrapping in the below line
$phpdig_string_subst['windows-1251'] = 'À:*,Á:á,Â:â,Ã:ã,Ä:ä,Å:å,Æ:æ,Ç:ç,È:è,É:é,Ê:ê,Ë:ë,Ì:ì,Í:*,Î:î,Ï:ï,Ð:ð,Ñ:ñ,Ò:ò,Ó:ó,Ô:ô,Õ:õ,Ö:ö,×:÷,Ø:ø,Ù:ù,Ú:ú,Û:û,Ü:ü,Ý:ý,Þ:þ,ß:ÿ';

// remove word wrapping in the below line
$phpdig_words_chars['windows-1251'] = '[:alnum:]ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß*áâãäåæçèéêëì*îïðñòóôõö÷øùúûüýþÿ'

For the second part, to use the PhpDig exclude/include comments with the definitions given it works like below, where the PhpDig exclude/include comments must each be on their own line:
PHP Code:

define('PHPDIG_EXCLUDE_COMMENT','<!-- *************************************');
define('PHPDIG_INCLUDE_COMMENT','************************************** -->'); 

Code:

<!-- *************************************
the content
to exclude
goes here
************************************** -->

If you look at the HTML source for result two, the PhpDig exclude/include comments do not surround the navigational menu so that change will need to be made and then a reindex done.


All times are GMT -8. The time now is 03:18 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.