A
Byte Order Mark (BOM) is a signature at the beginning of a
Unicode data stream that may be used by a higher protocol. The signature can indicate whether a data stream is Unicode encoded or not, and if so, which
Unicode Transformation Format (UTF) is used.
The BOM is
U+FEFF ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), which can be represented in different byte sequences depending on the UTF:
Byte Sequence Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
If an application does not suspect that a BOM is being used, the BOM may be misinterpreted in various ways. Below are some examples:
Value Your Browser Description
U+BBEF 믯 a Hangul character
U+EFBB personal use
U+FEFF zero width non-breaking space
U+FFFE undefined
To encode a ZWNBSP as the first chracter in a data stream that also uses a BOM, simply start with U+FEFF U+FEFF.
While most (if not all) modern
Microsoft applications use BOM, not all software do. For example, the API for
Java 1.4 SE treats ZWNBSP like an ordinary character. Its Reader
classes do not attempt to determine an InputStream's encoding by looking for a BOM signature.
Reference(s):
FAQ - UTF and BOM
http://www.unicode.org/unicode/faq/utf_bom.html