Thursday, October 18, 2007

4D v11 SQL Unicode Compatibility

Keisuke Miyako gives a talk at 4D Summit 2007 on UnicodeThese are notes from the "4D v11 SQL Unicode Compatibility" talk given by Keisuke Miyako of 4D Japan at 4D Summit 2007.

[Editorial note: Miyako's linguistic abilities were incredible. He was effortlessly switching between Japanese, Chinese, Arabic, English and Swedish examples. The depth of his knowledge and mastery of a very complicated subject was completely impressive.

The post has been updated to reflect Miyako's comments on the NUG on 10/23 and 10/24.]


Not everything in v11 is unicode compatible. What is compatible is the database, variables, etc. What's not unicode compatible are things like file paths (if the OS is not Unicode compatible), and the method and structure editors.

You can turn unicode off using SET DATABASE PARAMETER, but you're actually turning a conversion mode on and there's a performance hit.

Unicode is not necessarily a two-byte character system. Under ASCII 127 it's a one-byte character system, and there are instances where it has four-byte characters.

There are basic multilingual planes (65k characters - 2 bytes) and supplementary planes (1M+ characters - 4 bytes). The supplementary plane are used for historical characters and ones used by ethnic minorities, etc. (they're characters that are rarely used).

There are "non characters" for internal software use, and "combining characters" like diacritical marks, and "surrogate pairs" (characters which are always represented by the combination of two code points, and hence each code point cannot stand on its own).

In other words, there are character shapes and then modifiers. So diacritical marks are actually expressed after the basic character shape and modify the basic character. U+0061 (a) + U+0301 = á.

Some characters like the cryllic a and the latin 'a' look the same, but because they have different meanings/usages, they get a different code point. The exception are the CJK Unified Ideograph - Chinese, Japanese and Korean use similar characters so they share code points. But simplified and traditional Chinese are distinct - they have similar meanings and pronuciations, but are written very differently. So in a way the font style resulted in different code points violating one of the principles of unicode, but practical.

Korean has 38 "parts" that make up over 10,000 characters. Hence all of their characters involve combining characters.

Combining characters mean a single character can be made of of 4, 6, or more bytes.

Then there can be precomposed versions of characters that could also be rendered with combining characters. This means you have to do "normalization" where different renderings of the same character are treated as the same character.

The Length function in 4D v11 gives the number of code points, not the number of characters!

One way to do normalization is to decompose combined characters into their component parts. Another way does that and then composes them into combined characters (saves storage space). And a last way evaluates them based on "compatible" characters.

If you compare characters in 4D it does it based on compatible characters. The sum total is that accented characters will still evaluate to be equal to their non-accented cousins. So a = á.

4D can still do sorting (obviously), but it's now based on language rules defined in unicode. So for example v and w are not the same in English, but they are the same in Swedish.

4D 2004 supported 6,798 Japanese characters, v11 supports 11,233 characters. This opens more doord for 4D applications in settings that need all 11,233 characters (e.g. government forms/applications).

In mainland China 4D 2004 was ranked a 'C' for character support. v11 is ranked 'A+'. This also opens more doors for 4D applications.

Some of the text commands have been modified. One of the most important was the optional * which was added to Position() which makes it behave the way it used to. (Try searching for char(0) in a string to see what I mean). You'll only see the problem if you're searching for something which is not defined as a character in Unicode.

The pasteboard (aka clipboard) commands let you get the legacy version of the text or the unicode version of the text.

CONVERT FROM TEXT and CONVERT TO TEXT are new commands, they're similar to TEXT TO BLOB. (I didn't quite get what was going on there.)

When using PROCESS HTML TAGS you want to use BLOBs since if you use text it will treat it as UTF-16 which may not be appropriate (say if you want UTF-8).

If you have multiple languages in the same field you'll be OK if it's English plus one language - select the non-English language for the conversion and it won't affect the English portion. The English only uses lower ASCII, so it will convert correctly. The problem is if you have two languages that use upper ASCII.

If you have problem text, export it before conversion and then import to v11 and specify the format and you'll be fine.

An Alpha 20 field or variable can hold 20 code points, not 20 characters. 4D can handle combining characters but uses precomposed characters by default to avoid expanding the text and going beyond the length of the field during conversion.

If 4D needs to display a code point that's outside of your specified font, it will automatically change to a font that can display the code point.

Update from Miyako's 4D NUG posting:
The 4D language does not support Unicode beyond 0xFFFF but we can (and should) use regular expressions to match such characters.

4D v11 SQL on Mac OS X is apparently Unicode-safe (I think that's the formal description) in that at least the data is not lost though the application may not be fully aware of what's going on.

Mind you, the debugger is not Unicode compatible so the preview pane does not visualize the real values.

The last time I tested on Windows Vista, Surrogate characters were Unicode-safe subject to font settings, such as the newly introduced MEIRYO being used.

Labels: , ,

Digg It!  Add to del.icio.us  Add to StumbleUpon  Add to Reddit  Add to Technorati  Add to Furl  Add to Netscape

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home