|
09-21-2006, 03:17 PM | #1 |
Green Mole
Join Date: Sep 2006
Posts: 4
|
UTF-8 Question
If my html pages are UTF-8 what are the consequences of using PhpDig? It seems to work ok. Am I missing something? Also, is there a plan to support UTF-8 in an upcoming release? It seems to me that this is crucial as UTF-8 is quite common now.
Thanks, any help would be appreciated. |
09-22-2006, 03:11 PM | #2 |
Purple Mole
Join Date: Aug 2004
Location: North Island New Zealand
Posts: 170
|
UTF-8 has the following properties:
UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. All possible 231 UCS codes can be encoded. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long. The sorting order of Bigendian UCS-4 byte strings is preserved. The bytes 0xFE and 0xFF are never used in the UTF-8 encoding. ============================================== In addition to all that, UTF-8 was introduced to provide an ASCII backwards compatible multi-byte encoding. The definitions of UTF-8 in UCS and Unicode differed originally slightly, because in UCS, up to 6-byte long UTF-8 sequences were possible to represent characters up to U-7FFFFFFF, while in Unicode only up to 4-byte long UTF-8 sequences are defined to represent characters up to U-0010FFFF. (The difference was in essence the same as between UCS-4 and UTF-32.) |
09-22-2006, 03:20 PM | #3 |
Green Mole
Join Date: Sep 2006
Posts: 4
|
I am sorry, but that info is a litte over my head. Mainly I just wanted answers to my specific questions. Maybe I should revise them slightly. What are the consequences of using PhpDig with UTF-8 files? And is there a plan to support UTF-8 in an upcoming release?
|
09-22-2006, 06:03 PM | #4 |
Purple Mole
Join Date: Aug 2004
Location: North Island New Zealand
Posts: 170
|
The only problem you may get is that in some results a few characters may have odd letters displayed like accents above them.
|
09-22-2006, 06:10 PM | #5 |
Green Mole
Join Date: Sep 2006
Posts: 4
|
Thank you so much for you help.
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
UTF-8 support | Zebulon | Mod Requests | 0 | 12-05-2006 07:38 AM |
utf-8 pages | djuritz | How-to Forum | 2 | 07-02-2006 12:03 PM |
Storing UTF-8 in MySQL | Edomondo | Coding & Tutorials | 2 | 02-17-2005 02:35 AM |
other UTF-8 languages | miladmovie | How-to Forum | 1 | 02-08-2005 10:28 AM |
utf-8 support | kozlovsk | How-to Forum | 1 | 10-27-2004 05:56 AM |