The Unicode Blog: UCA

Tuesday, June 20, 2017

Announcing The Unicode® Standard, Version 10.0

Version 10.0 of the Unicode Standard is now available. For the first time, both the core specification and the data files are available on the same date. Version 10.0 adds 8,518 characters, for a total of 136,690 characters. These additions include four new scripts, for a total of 139 scripts, as well as 56 new emoji characters.

The new scripts and characters in Version 10.0 add support for lesser-used languages and unique written requirements worldwide, including:

Masaram Gondi, used to write Gondi in Central and Southeast India
Nüshu,used by women in China to write poetry and other discourses until the late twentieth century
Soyombo and Zanabazar Square, used in historic Buddhist texts to write Sanskrit, Tibetan, and Mongolian
Syriac letters used for writing Suriyani Malayalam, also known as Garshuni and as Syriac Malayalam
Gujarati signs used for the transliteration of the Arabic script into Gujarati by Ismaili Khoja communities
A set of 285 Hentaigana characters used in Japan (historic variants of Hiragana characters)
CJK Extension F (7,473 Han characters)

Among important symbol additions are:

Bitcoin sign
A set of Typicon marks and symbols
56 emoji characters including:

mage	coconut
fairy	broccoli
vampire	sandwich

For the full list of emoji characters, see emoji additions for Unicode 10.0, and Emoji Counts. For a detailed description of support for emoji characters by the Unicode Standard, see UTS #51, Unicode Emoji.

Three other important Unicode specifications have been updated for Version 10.0:

UTS #10, Unicode Collation Algorithm — sorting Unicode text
UTS #39, Unicode Security Mechanisms — reducing Unicode spoofing
UTS #46, Unicode IDNA Compatibility Processing — compatible processing of non-ASCII URLs

Unicode 10.0 includes a number of changes. Some of the Unicode Standard Annexes have modifications for Unicode 10.0, often in coordination with changes to character properties. In particular, there are changes to UAX #14, Unicode Line Breaking Algorithm, UAX #29, Unicode Text Segmentation, and UAX #31, Unicode Identifier and Pattern Syntax. In addition, UAX #50, Unicode Vertical Text Layout, has been newly incorporated as a part of the standard.

The Unicode Standard is the foundation for all modern software and communications around the world, including all modern operating systems, browsers, laptops, and smart phones—plus the Internet and Web (URLs, HTML, XML, CSS, JSON, etc.). The Unicode Standard, its associated standards, and data form the foundation for CLDR and ICU releases.

Adopt-a-Character

All the additional 8,518 characters including 239 new emoji are now available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages.

About the Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards.

The membership of the consortium represents a broad spectrum of corporations and organizations, many in the computer and information processing industry. Members include: Adobe, Apple, EmojiXpress, Facebook, Google, Government of Bangladesh, Government of India, Huawei, IBM, Microsoft, Monotype Imaging, Netflix, Sultanate of Oman MARA, Oracle, Rajya Marathi Vikas Sanstha, SAP, Symantec, Tamil Virtual University, The University of California (Berkeley), plus well over a hundred Associate, Liaison, and Individual members. For a complete member list go to http://www.unicode.org/consortium/members.html.

Monday, December 10, 2012

Unicode Collation Proposed Update

The Unicode Collation Algorithm (UCA) data is being modified to make all digits with the same numeric value sort the same, whether they are European (ASCII), Arabic, Devanagari, or others. In addition, the format of the main data table has changed to omit the (unused) 4th level weight, and some data tables are moved to the Unicode CLDR project.

These and other changes are in the new proposed update: see PRI 235. For the exact list of modifications, see Modifications.

Tuesday, June 26, 2012

Proposed updates for Unicode Collation and IDNA

The proposed update of UTS#10 Unicode Collation Algorithm (UCA) modifies the specification for certain edge cases (overlapping contractions), and tightens the requirements for well-formed collation element tables. The detailed descriptions of parametric tailoring options have been removed, and now refer to the corresponding section in LDML. That section adds new explanations and definitions. There are a number of improvements, including additional examples, and some rearrangement of text. See PRI #223

The data has been updated for the Unicode 6.2 beta review, and the associated CollationAuxiliary.txt file in CollationAuxiliary.zip now includes a description of the implicit fractional weight generation and the context syntax. For more details, see Modifications.

There is also a proposed update of UTS #46 Unicode IDNA Compatibility Processing. The data has been updated for the Unicode 6.2 beta review, with minor changes to the text. See PRI #224

Tuesday, June 20, 2017