.TH unicode 3erl "stdlib 3.14" "Ericsson AB" "Erlang Module Definition"
.SH NAME
unicode \- Functions for converting Unicode characters.
.SH DESCRIPTION
.LP
This module contains functions for converting between different character representations\&. It converts between ISO Latin-1 characters and Unicode characters, but it can also convert between different Unicode encodings (like UTF-8, UTF-16, and UTF-32)\&.
.LP
The default Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built-in functions and libraries in OTP expect to find binary Unicode data\&. In lists, Unicode data is encoded as integers, each integer representing one character and encoded simply as the Unicode code point for the character\&.
.LP
Other Unicode encodings than integers representing code points or UTF-8 in binaries are referred to as "external encodings"\&. The ISO Latin-1 encoding is in binaries and lists referred to as latin1-encoding\&.
.LP
It is recommended to only use external encodings for communication with external entities where this is required\&. When working inside the Erlang/OTP environment, it is recommended to keep binaries in UTF-8 when representing Unicode characters\&. ISO Latin-1 encoding is supported both for backward compatibility and for communication with external entities not supporting Unicode character sets\&.
.LP
Programs should always operate on a normalized form and compare canonical-equivalent Unicode characters as equal\&. All characters should thus be normalized to one form once on the system borders\&. One of the following functions can convert characters to their normalized forms \fIcharacters_to_nfc_list/1\fR\&, \fIcharacters_to_nfc_binary/1\fR\&, \fIcharacters_to_nfd_list/1\fR\& or \fIcharacters_to_nfd_binary/1\fR\&\&. For general text \fIcharacters_to_nfc_list/1\fR\& or \fIcharacters_to_nfc_binary/1\fR\& is preferred, and for identifiers one of the compatibility normalization functions, such as \fIcharacters_to_nfkc_list/1\fR\&, is preferred for security reasons\&. The normalization functions where introduced in OTP 20\&. Additional information on normalization can be found in the Unicode FAQ\&.
.SH DATA TYPES
.nf

\fBencoding()\fR\& = 
.br
    latin1 | unicode | utf8 | utf16 |
.br
    {utf16, endian()} |
.br
    utf32 |
.br
    {utf32, endian()}
.br
.fi
.nf

\fBendian()\fR\& = big | little
.br
.fi
.nf

\fBunicode_binary()\fR\& = binary()
.br
.fi
.RS
.LP
A \fIbinary()\fR\& with characters encoded in the UTF-8 coding standard\&.
.RE

.nf

\fBchardata()\fR\& = charlist() | unicode_binary()
.br
.fi
.nf

\fBcharlist()\fR\& = 
.br
    maybe_improper_list(char() | unicode_binary() | charlist(),
.br
                        unicode_binary() | [])
.br
.fi
.nf

\fBexternal_unicode_binary()\fR\& = binary()
.br
.fi
.RS
.LP
A \fIbinary()\fR\& with characters coded in a user-specified Unicode encoding other than UTF-8 (that is, UTF-16 or UTF-32)\&.
.RE

.nf

\fBexternal_chardata()\fR\& = 
.br
    external_charlist() | external_unicode_binary()
.br
.fi
.nf

\fBexternal_charlist()\fR\& = 
.br
    maybe_improper_list(char() |
.br
                        external_unicode_binary() |
.br
                        external_charlist(),
.br
                        external_unicode_binary() | [])
.br
.fi
.nf

\fBlatin1_binary()\fR\& = binary()
.br
.fi
.RS
.LP
A \fIbinary()\fR\& with characters coded in ISO Latin-1\&.
.RE

.nf

\fBlatin1_char()\fR\& = byte()
.br
.fi
.RS
.LP
An \fIinteger()\fR\& representing a valid ISO Latin-1 character (0-255)\&.
.RE

.nf

\fBlatin1_chardata()\fR\& = latin1_charlist() | latin1_binary()
.br
.fi
.RS
.LP
Same as \fIiodata()\fR\&\&.
.RE

.nf

\fBlatin1_charlist()\fR\& = 
.br
    maybe_improper_list(latin1_char() |
.br
                        latin1_binary() |
.br
                        latin1_charlist(),
.br
                        latin1_binary() | [])
.br
.fi
.RS
.LP
Same as \fIiolist()\fR\&\&.
.RE

.SH EXPORTS
.LP
.nf

.B
bom_to_encoding(Bin) -> {Encoding, Length}
.br
.fi
.br
.RS
.LP
Types:

.RS 3
Bin = binary()
.br
.RS 2
 A \fIbinary()\fR\& such that \fIbyte_size(Bin) >= 4\fR\&\&. 
.RE
Encoding = 
.br
    latin1 | utf8 | {utf16, endian()} | {utf32, endian()}
.br
Length = integer() >= 0
.br
.nf
\fBendian()\fR\& = big | little
.fi
.br
.RE
.RE
.RS
.LP
Checks for a UTF Byte Order Mark (BOM) in the beginning of a binary\&. If the supplied binary \fIBin\fR\& begins with a valid BOM for either UTF-8, UTF-16, or UTF-32, the function returns the encoding identified along with the BOM length in bytes\&.
.LP
If no BOM is found, the function returns \fI{latin1,0}\fR\&\&.
.RE

.LP
.nf

.B
characters_to_binary(Data) -> Result
.br
.fi
.br
.RS
.LP
Types:

.RS 3
Data = latin1_chardata() | chardata() | external_chardata()
.br
Result = 
.br
    binary() |
.br
    {error, binary(), RestData} |
.br
    {incomplete, binary(), binary()}
.br
RestData = latin1_chardata() | chardata() | external_chardata()
.br
.RE
.RE
.RS
.LP
Same as \fIcharacters_to_binary(Data, unicode, unicode)\fR\&\&.
.RE

.LP
.nf

.B
characters_to_binary(Data, InEncoding) -> Result
.br
.fi
.br
.RS
.LP
Types:

.RS 3
Data = latin1_chardata() | chardata() | external_chardata()
.br
InEncoding = encoding()
.br
Result = 
.br
    binary() |
.br
    {error, binary(), RestData} |
.br
    {incomplete, binary(), binary()}
.br
RestData = latin1_chardata() | chardata() | external_chardata()
.br
.RE
.RE
.RS
.LP
Same as \fIcharacters_to_binary(Data, InEncoding, unicode)\fR\&\&.
.RE

.LP
.nf

.B
characters_to_binary(Data, InEncoding, OutEncoding) -> Result
.br
.fi
.br
.RS
.LP
Types:

.RS 3
Data = latin1_chardata() | chardata() | external_chardata()
.br
InEncoding = OutEncoding = encoding()
.br
Result = 
.br
    binary() |
.br
    {error, binary(), RestData} |
.br
    {incomplete, binary(), binary()}
.br
RestData = latin1_chardata() | chardata() | external_chardata()
.br
.RE
.RE
.RS
.LP
Behaves as \fIcharacters_to_list/2\fR\&, but produces a binary instead of a Unicode list\&.
.LP
\fIInEncoding\fR\& defines how input is to be interpreted if binaries are present in \fIData\fR\&
.LP
\fIOutEncoding\fR\& defines in what format output is to be generated\&.
.LP
Options:
.RS 2
.TP 2
.B
\fIunicode\fR\&:
An alias for \fIutf8\fR\&, as this is the preferred encoding for Unicode characters in binaries\&.
.TP 2
.B
\fIutf16\fR\&:
An alias for \fI{utf16,big}\fR\&\&.
.TP 2
.B
\fIutf32\fR\&:
An alias for \fI{utf32,big}\fR\&\&.
.RE
.LP
The atoms \fIbig\fR\& and \fIlittle\fR\& denote big- or little-endian encoding\&.
.LP
Errors and exceptions occur as in \fIcharacters_to_list/2\fR\&, but the second element in tuple \fIerror\fR\& or \fIincomplete\fR\& is a \fIbinary()\fR\& and not a \fIlist()\fR\&\&.
.RE

.LP
.nf

.B
characters_to_list(Data) -> Result
.br
.fi
.br
.RS
.LP
Types:

.RS 3
Data = latin1_chardata() | chardata() | external_chardata()
.br
Result = 
.br
    list() |
.br
    {error, list(), RestData} |
.br
    {incomplete, list(), binary()}
.br
RestData = latin1_chardata() | chardata() | external_chardata()
.br
.RE
.RE
.RS
.LP
Same as \fIcharacters_to_list(Data, unicode)\fR\&\&.
.RE

.LP
.nf

.B
characters_to_list(Data, InEncoding) -> Result
.br
.fi
.br
.RS
.LP
Types:

.RS 3
Data = latin1_chardata() | chardata() | external_chardata()
.br
InEncoding = encoding()
.br
Result = 
.br
    list() |
.br
    {error, list(), RestData} |
.br
    {incomplete, list(), binary()}
.br
RestData = latin1_chardata() | chardata() | external_chardata()
.br
.RE
.RE
.RS
.LP
Converts a possibly deep list of integers and binaries into a list of integers representing Unicode characters\&. The binaries in the input can have characters encoded as one of the following:
.RS 2
.TP 2
*
ISO Latin-1 (0-255, one character per byte)\&. Here, case parameter \fIInEncoding\fR\& is to be specified as \fIlatin1\fR\&\&.
.LP
.TP 2
*
One of the UTF-encodings, which is specified as parameter \fIInEncoding\fR\&\&.
.LP
.RE

.LP
Note that integers in the list always represent code points regardless of \fIInEncoding\fR\& passed\&. If \fIInEncoding latin1\fR\& is passed, only code points < 256 are allowed; otherwise, all valid unicode code points are allowed\&.
.LP
If \fIInEncoding\fR\& is \fIlatin1\fR\&, parameter \fIData\fR\& corresponds to the \fIiodata()\fR\& type, but for \fIunicode\fR\&, parameter \fIData\fR\& can contain integers > 255 (Unicode characters beyond the ISO Latin-1 range), which makes it invalid as \fIiodata()\fR\&\&.
.LP
The purpose of the function is mainly to convert combinations of Unicode characters into a pure Unicode string in list representation for further processing\&. For writing the data to an external entity, the reverse function \fIcharacters_to_binary/3\fR\& comes in handy\&.
.LP
Option \fIunicode\fR\& is an alias for \fIutf8\fR\&, as this is the preferred encoding for Unicode characters in binaries\&. \fIutf16\fR\& is an alias for \fI{utf16,big}\fR\& and \fIutf32\fR\& is an alias for \fI{utf32,big}\fR\&\&. The atoms \fIbig\fR\& and \fIlittle\fR\& denote big- or little-endian encoding\&.
.LP
If the data cannot be converted, either because of illegal Unicode/ISO Latin-1 characters in the list, or because of invalid UTF encoding in any binaries, an error tuple is returned\&. The error tuple contains the tag \fIerror\fR\&, a list representing the characters that could be converted before the error occurred and a representation of the characters including and after the offending integer/bytes\&. The last part is mostly for debugging, as it still constitutes a possibly deep or mixed list, or both, not necessarily of the same depth as the original data\&. The error occurs when traversing the list and whatever is left to decode is returned "as is"\&.
.LP
However, if the input \fIData\fR\& is a pure binary, the third part of the error tuple is guaranteed to be a binary as well\&.
.LP
Errors occur for the following reasons:
.RS 2
.TP 2
*
Integers out of range\&.
.RS 2
.LP
If \fIInEncoding\fR\& is \fIlatin1\fR\&, an error occurs whenever an integer > 255 is found in the lists\&.
.RE

.RS 2
.LP
If \fIInEncoding\fR\& is of a Unicode type, an error occurs whenever either of the following is found:
.RE

.RS 2
.TP 2
*
An integer > 16#10FFFF (the maximum Unicode character)
.LP
.TP 2
*
An integer in the range 16#D800 to 16#DFFF (invalid range reserved for UTF-16 surrogate pairs)
.LP
.RE

.LP
.TP 2
*
Incorrect UTF encoding\&.
.RS 2
.LP
If \fIInEncoding\fR\& is one of the UTF types, the bytes in any binaries must be valid in that encoding\&.
.RE

.RS 2
.LP
Errors can occur for various reasons, including the following:
.RE

.RS 2
.TP 2
*
"Pure" decoding errors (like the upper bits of the bytes being wrong)\&.
.LP
.TP 2
*
The bytes are decoded to a too large number\&.
.LP
.TP 2
*
The bytes are decoded to a code point in the invalid Unicode range\&.
.LP
.TP 2
*
Encoding is "overlong", meaning that a number should have been encoded in fewer bytes\&.
.LP
.RE

.RS 2
.LP
The case of a truncated UTF is handled specially, see the paragraph about incomplete binaries below\&.
.RE

.RS 2
.LP
If \fIInEncoding\fR\& is \fIlatin1\fR\&, binaries are always valid as long as they contain whole bytes, as each byte falls into the valid ISO Latin-1 range\&.
.RE

.LP
.RE

.LP
A special type of error is when no actual invalid integers or bytes are found, but a trailing \fIbinary()\fR\& consists of too few bytes to decode the last character\&. This error can occur if bytes are read from a file in chunks or if binaries in other ways are split on non-UTF character boundaries\&. An \fIincomplete\fR\& tuple is then returned instead of the \fIerror\fR\& tuple\&. It consists of the same parts as the \fIerror\fR\& tuple, but the tag is \fIincomplete\fR\& instead of \fIerror\fR\& and the last element is always guaranteed to be a binary consisting of the first part of a (so far) valid UTF character\&.
.LP
If one UTF character is split over two consecutive binaries in the \fIData\fR\&, the conversion succeeds\&. This means that a character can be decoded from a range of binaries as long as the whole range is specified as input without errors occurring\&.
.LP
\fIExample:\fR\&
.LP
.nf

decode_data(Data) ->
   case unicode:characters_to_list(Data,unicode) of
      {incomplete,Encoded, Rest} ->
            More = get_some_more_data(),
            Encoded ++ decode_data([Rest, More]);
      {error,Encoded,Rest} ->
            handle_error(Encoded,Rest);
      List ->
            List
   end.
.fi
.LP
However, bit strings that are not whole bytes are not allowed, so a UTF character must be split along 8-bit boundaries to ever be decoded\&.
.LP
A \fIbadarg\fR\& exception is thrown for the following cases:
.RS 2
.TP 2
*
Any parameters are of the wrong type\&.
.LP
.TP 2
*
The list structure is invalid (a number as tail)\&.
.LP
.TP 2
*
The binaries do not contain whole bytes (bit strings)\&.
.LP
.RE

.RE

.LP
.nf

.B
characters_to_nfc_list(CD :: chardata()) ->
.B
                          [char()] | {error, [char()], chardata()}
.br
.fi
.br
.RS
.LP
Converts a possibly deep list of characters and binaries into a Normalized Form of canonical equivalent Composed characters according to the Unicode standard\&.
.LP
Any binaries in the input must be encoded with utf8 encoding\&.
.LP
The result is a list of characters\&.
.LP
.nf

3> unicode:characters_to_nfc_list([<<"abc..a">>,[778],$a,[776],$o,[776]]).
"abc..åäö"

.fi
.RE

.LP
.nf

.B
characters_to_nfc_binary(CD :: chardata()) ->
.B
                            unicode_binary() |
.B
                            {error, unicode_binary(), chardata()}
.br
.fi
.br
.RS
.LP
Converts a possibly deep list of characters and binaries into a Normalized Form of canonical equivalent Composed characters according to the Unicode standard\&.
.LP
Any binaries in the input must be encoded with utf8 encoding\&.
.LP
The result is an utf8 encoded binary\&.
.LP
.nf

4> unicode:characters_to_nfc_binary([<<"abc..a">>,[778],$a,[776],$o,[776]]).
<<"abc..åäö"/utf8>>

.fi
.RE

.LP
.nf

.B
characters_to_nfd_list(CD :: chardata()) ->
.B
                          [char()] | {error, [char()], chardata()}
.br
.fi
.br
.RS
.LP
Converts a possibly deep list of characters and binaries into a Normalized Form of canonical equivalent Decomposed characters according to the Unicode standard\&.
.LP
Any binaries in the input must be encoded with utf8 encoding\&.
.LP
The result is a list of characters\&.
.LP
.nf

1> unicode:characters_to_nfd_list("abc..åäö").
[97,98,99,46,46,97,778,97,776,111,776]

.fi
.RE

.LP
.nf

.B
characters_to_nfd_binary(CD :: chardata()) ->
.B
                            unicode_binary() |
.B
                            {error, unicode_binary(), chardata()}
.br
.fi
.br
.RS
.LP
Converts a possibly deep list of characters and binaries into a Normalized Form of canonical equivalent Decomposed characters according to the Unicode standard\&.
.LP
Any binaries in the input must be encoded with utf8 encoding\&.
.LP
The result is an utf8 encoded binary\&.
.LP
.nf

2> unicode:characters_to_nfd_binary("abc..åäö").
<<97,98,99,46,46,97,204,138,97,204,136,111,204,136>>

.fi
.RE

.LP
.nf

.B
characters_to_nfkc_list(CD :: chardata()) ->
.B
                           [char()] |
.B
                           {error, [char()], chardata()}
.br
.fi
.br
.RS
.LP
Converts a possibly deep list of characters and binaries into a Normalized Form of compatibly equivalent Composed characters according to the Unicode standard\&.
.LP
Any binaries in the input must be encoded with utf8 encoding\&.
.LP
The result is a list of characters\&.
.LP
.nf

3> unicode:characters_to_nfkc_list([<<"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]).
"abc..åäö32"

.fi
.RE

.LP
.nf

.B
characters_to_nfkc_binary(CD :: chardata()) ->
.B
                             unicode_binary() |
.B
                             {error, unicode_binary(), chardata()}
.br
.fi
.br
.RS
.LP
Converts a possibly deep list of characters and binaries into a Normalized Form of compatibly equivalent Composed characters according to the Unicode standard\&.
.LP
Any binaries in the input must be encoded with utf8 encoding\&.
.LP
The result is an utf8 encoded binary\&.
.LP
.nf

4> unicode:characters_to_nfkc_binary([<<"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]).
<<"abc..åäö32"/utf8>>

.fi
.RE

.LP
.nf

.B
characters_to_nfkd_list(CD :: chardata()) ->
.B
                           [char()] |
.B
                           {error, [char()], chardata()}
.br
.fi
.br
.RS
.LP
Converts a possibly deep list of characters and binaries into a Normalized Form of compatibly equivalent Decomposed characters according to the Unicode standard\&.
.LP
Any binaries in the input must be encoded with utf8 encoding\&.
.LP
The result is a list of characters\&.
.LP
.nf

1> unicode:characters_to_nfkd_list(["abc..åäö",[65299,65298]]).
[97,98,99,46,46,97,778,97,776,111,776,51,50]

.fi
.RE

.LP
.nf

.B
characters_to_nfkd_binary(CD :: chardata()) ->
.B
                             unicode_binary() |
.B
                             {error, unicode_binary(), chardata()}
.br
.fi
.br
.RS
.LP
Converts a possibly deep list of characters and binaries into a Normalized Form of compatibly equivalent Decomposed characters according to the Unicode standard\&.
.LP
Any binaries in the input must be encoded with utf8 encoding\&.
.LP
The result is an utf8 encoded binary\&.
.LP
.nf

2> unicode:characters_to_nfkd_binary(["abc..åäö",[65299,65298]]).
<<97,98,99,46,46,97,204,138,97,204,136,111,204,136,51,50>>

.fi
.RE

.LP
.nf

.B
encoding_to_bom(InEncoding) -> Bin
.br
.fi
.br
.RS
.LP
Types:

.RS 3
Bin = binary()
.br
.RS 2
 A \fIbinary()\fR\& such that \fIbyte_size(Bin) >= 4\fR\&\&. 
.RE
InEncoding = encoding()
.br
.RE
.RE
.RS
.LP
Creates a UTF Byte Order Mark (BOM) as a binary from the supplied \fIInEncoding\fR\&\&. The BOM is, if supported at all, expected to be placed first in UTF encoded files or messages\&.
.LP
The function returns \fI<<>>\fR\& for \fIlatin1\fR\& encoding, as there is no BOM for ISO Latin-1\&.
.LP
Notice that the BOM for UTF-8 is seldom used, and it is really not a \fIbyte order\fR\& mark\&. There are obviously no byte order issues with UTF-8, so the BOM is only there to differentiate UTF-8 encoding from other UTF formats\&.
.RE