Wednesday, March 16, 2005

Character (Java 2 Platform SE 5.0)

Interesting documentation. The newest version of Java has had to rethink what a "character" means.

The original Java dream of a "char" type that could represent every possible character in a single unit, strings whose length in char primitives is always equal to their length in notional Unicode characters, and other such good stuff, was naive and doomed to failure.

Java's char is now an only slightly better abstraction for the idea of "character" than the primordial C/C++ char, which is so encumbered with the baggage of being defined as equal to a "byte" and the unit in which the sizeof all other things is measured, as to be almost totally unrelated to the idea of human-readable text in any encoding other than ASCII. The difference between this and the situation of Java is now one of degree rather than of kind.

The issues with ASCII-vs-Unicode, "wchar", etc., that have plaugued every other programming language have caught up with Java in version 5; it's just that it took until UTF-16 while others have been dealing with UTF-8 for years.

You'd think they could have see, ten years ago, that the number of code points would eventually (soon!) grow past the point where language/processor designers would be willing to follow with a primitive type. A new language being defined today could avoid the problem by always using 32-bit chars, the same way Java did with its 16-bit char. But even to a Java person, that seems wasteful. And what happens if Unicode grows to need 64 bits?

Comments: