Everything2
Near Matches
Ignore Exact
Full Text
Everything2

UTF-16

created by tres equis

(idea) by spitzak (3.6 y) (print)   ?   I like it! Sat Dec 01 2001 at 0:11:29

Unicode encoded as two bytes per character. The obvioius way to do this is to put the bottom 16 bits into the two bytes (high byte first so sorting order is preserved), and this is called UCS-2. When people realized that (due to Chinese, mostly) more than 65,536 characters were needed, they came up with this bastard encoding, rather than using UTF-8, which is a sensible encoding. MicroSoft uses this encoding in their stuff, sigh.

UTF-16 can encoded Unicode up to 0x10ffff. All codes less than 0xffff but not in the range 0xd800-0xdfff are encoded high byte first, low byte second.

The "characters" 0xd800-0xdfff are called "surrogate characters" and must appear in pairs. These are combined in a complex way to produce the characters in the range 0x10000 through 0x10ffff. They also defeat the only plausible advantage of UTF-16, which is that the characters are the same size!

Don't use this, it is just proof that the standards people have their heads up their asses. Use UTF-8 instead.


(thing) by tongpoo (1 mon) (print)   ?   I like it! Thu Mar 27 2003 at 23:07:29

This Unicode Transformation Format serializes each Unicode value as two bytes, or in case of values above U+FFFF, four bytes (a surrogate pair). A UTF-16 can be either in little-endian or big-endian format. An initial byte sequence called the byte order mark (BOM) is required for UTFs. The BOM is U+FEFF ZERO WIDTH NO-BREAK SPACE (therefore it doesn't do anything) and it can have several different byte sequences: To prevent ambiguity, U+FFFE is not defined.

The Unicode codespace is allocated into several areas, one being the Surrogate Area, which consists of 1,024 high surrogates (U+D800 - U+DBFF) and 1,024 low surrogates (U+DC00 - U+DFFF). A high surrogate, followed by a low surrogate, forms a surrogate pair that represents a single Unicode scalar value. Approximately one million surrogate pairs are possible, and their values can be derived from this formula:
65536 + ((highSurrogate & 1023) << 10) + (lowSurrogate & 1023)
In plain English, it takes the the last ten binary digits from both surrogates, concatinates those, and adds 216 to that number. As of Version 3.0, none of the surrogate pairs have been assigned.

UTF-16 on average can save about a byte per character over UTF-8 when encoding East Asian text.

Sources (PDF and PowerPoint files):
  • "The Unicode Standard, Version 3.0" Section 2.3, Encoding Forms.
    http://www.Unicode.org/book/ch02.pdf

  • "The Unicode Standard, Version 3.0" Section 3.7, Surrogates.
    http://www.Unicode.org/book/ch03.pdf

  • "The Unicode Standard, Version 3.0" Section 5.4, Handling Surrogate Pairs.
    http://www.Unicode.org/book/ch05.pdf

  • "Surrogate Support in Microsoft Products."
    http://www.Unicode.org/iuc/iuc18/papers/a8.ppt

printable version
chaos

UTF-8 UTF-32 UCS-2 Unicode
UTF-7 Unicode Transformation Format UCS-4 big-endian
Mule-UCS Making your own nuclear car bomb Specials surrogate pair
little-endian Surrogates Area Cosmic Chasm byte order mark
Tron character set NULL terminator
Y'know, if you log in, you can write something here, or contact authors directly on the site. Create a New User if you don't already have an account.
  Epicenter
Login
Password

password reminder
register

Everything2 Help

Cool Staff Picks
Drink up!
Fifteen Elvish ways to die
The problems of the modern west
Presocratic Greek Philosophers
Bikram Yoga
water
redshift resolves Olbers' paradox
Abraham
Francesca Woodman
Terry Fox
Brian Downing
Three-year-old boys are usually not very interesting people
Darryl Strawberry
Tortoise
New Writeups
Glowing Fish
The Uncanny X-Men and the New Teen Titans(thing)
WolfKeeper
Launch loop(idea)
TendoKing
Katana(person)
Wuukiee
Highly ornamental cultivars of brambles still have as many thorns as their wild counterparts(idea)
TheDeadGuy
Editor Log: May 2008(log)
everyday j.Lo
pray do not molest them(thing)
ammie
Bands Who Take Their Names from Eighteenth-century English Poetry and Prose(idea)
shaogo
Under My Thumb(review)
ammie
Rock On(person)
The Custodian
The Dresden Files(thing)
Ouzo
PETA becomes you, a proposed future(fiction)
Ereneta
Stone Soup, Part Two(fiction)
jjen
Sorrier than I ever thought I would be(personal)
locke baron
Moskva class antisubmarine cruiser(thing)
Wuukiee
May 15, 2008(idea)
E2 is a by-product of the existence of The Everything Development Company