draft-freytag-troublesome-characters-01

IETF                                                          A. Freytag
Internet-Draft                                               ASMUS, Inc.
Intended status: Standards Track                              J. Klensin
Expires: January 1, 2018
                                                             A. Sullivan
                                                            Oracle Corp.
                                                           June 30, 2017


Those Troublesome Characters: A Registry of Unicode Code Points Needing
         Special Consideration When Used in Network Identifiers
                draft-freytag-troublesome-characters-01

Abstract

   Unicode's design goal is to be the universal character set for all
   applications.  The goal entails the inclusion of very large numbers
   of characters.  It is also focused on written language; special
   provisions have always been needed for identifiers.  The sheer size
   of the repertoire increases the possibility of accidental or
   intentional use of characters that can cause confusion among users,
   particularly where linguistic context is ambiguous, unavailable, or
   impossible to determine.  A registry of code points that can be
   sometimes especially problematic may be useful to guide system
   administrators in setting parameters for allowable code points in an
   identifier system, and to aid applications in creating security aids
   for users.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 1, 2018.







Freytag, et al.          Expires January 1, 2018                [Page 1]


Internet-Draft           Troublesome Characters                June 2017


Copyright Notice

   Copyright (c) 2017 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Unicode code points and identifiers . . . . . . . . . . . . .   2
   2.  Background and Conventions  . . . . . . . . . . . . . . . . .   4
   3.  Techniques already in place . . . . . . . . . . . . . . . . .   4
   4.  A registry of code points . . . . . . . . . . . . . . . . . .   6
     4.1.  Discussion  . . . . . . . . . . . . . . . . . . . . . . .   6
     4.2.  Registry initial contents . . . . . . . . . . . . . . . .   7
       4.2.1.  Code Point Table  . . . . . . . . . . . . . . . . . .   7
       4.2.2.  References for Registry . . . . . . . . . . . . . . .  31
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  33
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  33
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  33
     7.1.  Normative References  . . . . . . . . . . . . . . . . . .  33
     7.2.  Informative References  . . . . . . . . . . . . . . . . .  34
   Appendix A.  Additional Background  . . . . . . . . . . . . . . .  35
     A.1.  The                       Theory of Inclusion . . . . . .  35
     A.2.  The Difference Between Theory and Practice  . . . . . . .  36
       A.2.1.  Confusability . . . . . . . . . . . . . . . . . . . .  36
       A.2.2.  Not everything can be solved  . . . . . . . . . . . .  38
   Appendix B.  Examples . . . . . . . . . . . . . . . . . . . . . .  38
   Appendix C.  Discussion Venue . . . . . . . . . . . . . . . . . .  40
   Appendix D.  Change History . . . . . . . . . . . . . . . . . . .  41
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  41

1.  Unicode code points and identifiers

   Unicode [Unicode] is a coded character set that aims to support every
   writing system.  Writing systems evolve over time and are sometimes
   influenced by one another.  As a result, Unicode encodes many
   characters that, to a reader, appear to be the same thing; but that
   are encoded differently from one another.  This sort of difference is
   usually not important in written texts, because competent readers and



Freytag, et al.          Expires January 1, 2018                [Page 2]


Internet-Draft           Troublesome Characters                June 2017


   writers of a language are able to compensate for the selection of the
   "wrong" character when reading or writing.  Finally, the goal of
   supporting every writing system also implies that Unicode is designed
   to properly represent text in written languages, so special
   provisions are needed for identifiers.

   Identifiers that are used in a network or, especially, an Internet
   context present several special problems because of the above feature
   of Unicode:

   1.  In many (perhaps most) uses of identifiers, it is either
       practically difficult or impossible to ascertain the correct
       language context in which the identifier is being or will be
       used.  In the case of an internationalized domain name, for
       instance, each label could in principle represent a new locus of
       control, because there could be a delegation there.  A new locus
       of control means that the administrator of the resulting zone
       could speak, read, or intend a different language context than
       the one from the parent.  Moreover, at least some domains (such
       as the root) have an Internet-wide context and therefore do not
       really have a language context as such.  In any case, the
       language context is simply not available as part of a DNS lookup,
       so there is no way to make the DNS sensitive to this sort of
       issue.  Even in the case of email local-parts, where a sender is
       likely to know at least one of the languages of the receiver, the
       language context that was in use at the time the identifier was
       created is often unknown.

   2.  Identifiers on the network are in general exact-match systems,
       because an ambiguous identifier is problematic.  Sometimes, but
       not always, there are facilities for aliasing such that multiple
       identifiers can be put together as a single identity; the DNS,
       for example, does not have such an aliasing capability, because
       in the DNS all aliases are one-way pointers.  Aliasing techniques
       are in any case just an extension of the exact-match approach,
       and do not work the way a competent human reader does when
       interpolating the "right" character upon seeing the "wrong" one.

   3.  Because there are many characters that may appear to be the same
       (or even, that are defined in such a way that they are all but
       guaranteed to be rendered by the same glyphs), it is fairly easy
       to create an identifier either by accident or on purpose that is
       likely to be confused with some other identifier even by
       competent readers and writers of a language.

   4.  For some scripts the repertoire of shapes is shared, so that
       there are cases of two strings in which all the code points in
       one script in the first string, and all the code points in



Freytag, et al.          Expires January 1, 2018                [Page 3]


Internet-Draft           Troublesome Characters                June 2017


       another script in the second string, are respectively confusable
       with one another.  In that case, the strings cannot be
       distinguished by a reader, and the whole string is confusable.

   5.  For some scripts, both users and rendering systems do not expect
       to encounter code points in arbitrary sequence.  Most code points
       normally occur only in specific locations within a syllable.  If
       random labels were permitted, some would not display as expected
       (including having some features misplaced or not displayed) while
       others would present recognition problems to users experienced
       with the script.  Some devices may also not support arbitrary
       input.

   Beyond these issues, human perception is easily tricked, so that
   entirely unrelated character sequences can become confusable -- for
   example "rn" being confused with "m".  Humans read strings, not
   characters, and they will mostly see what they expect to see.  Some
   additional discussion of the background can be found in Appendix A.

2.  Background and Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

   A reader needs to be familiar with Unicode [Unicode], IDNA2008
   [RFC5890] [RFC5891] [RFC5892] [RFC5893] [RFC5894], PRECIS (at least
   the framework, [RFC7564]), and conventions for discussion of
   internationalization in the IETF (see [RFC6365]).

3.  Techniques already in place

   In the IDNA mechanism for including Unicode code points [RFC5892], a
   code point is only included when it meets the needs of
   internationalizing domain names as explained in the IDNA framework
   [RFC5894].  For identifiers other than those specified by IDNA, the
   PRECIS framework [RFC7564] generalizes the same basic technique.  In
   both cases, the overall approach is to assume that all characters are
   excluded, and then to include characters according to properties
   derived from the Unicode character properties.  This general strategy
   cuts the enormous size of the Unicode database somewhat, avoiding
   including some characters that are necessarily unsuited for use as
   identifiers.

   The mechanism of inclusion by derived property, while helpful, is
   insufficient to guarantee every included character is safe for use in
   identifiers.  Some characters' properties lead them to be included
   even though they are not obviously good candidates.  In other cases,



Freytag, et al.          Expires January 1, 2018                [Page 4]


Internet-Draft           Troublesome Characters                June 2017


   individual characters are good for inclusion, but are problematic in
   combination.  Finally, there are cases where characters (or sequences
   of characters) are not problematic by themselves, or if used in a
   mutually exclusive manner in the same identifier, but become
   problematic when their choice represents the only difference between
   otherwise identical identifiers.  For some examples, see Appendix B.

   Operators of systems that create identifiers (whether through a
   registry or through a peer-to-peer identifier negotiation system)
   need to make policies for characters they will permit.  Operators of
   registries, for instance, can help by adopting good registration
   policies: "Users will benefit if registries only permit characters
   from scripts that are well-understood by the registry or its
   advisers."[RFC5894] The difficulty for many operators, however, is
   that they do not have the writing system expertise to claim any
   character is "well-understood", and they do not really have the time
   to develop that expertise.  Such operators should in fact not use or
   register such characters.  Unfortunately, in many cases the operators
   are stewards of systems where the user population demands identifiers
   useful to them in their local languages.  In other cases, operators
   may proceed without a proper understanding owing to financial or
   market share incentives.  The risk for Internet identifiers in such
   cases is obviously that ill-understood and potentially exploitable
   gaps in registration policies will open.  To help mitigate such
   issues, a registry of Unicode code points that present special issues
   for network identifiers can help guide protocol and operating
   decisions about whether to permit a given code point or sequence of
   code points.  This will not completely protect against poor
   registration or use, but it may provide operational guidance
   necessary for people who are responsible for creating policies.

   Note that the registry defined herein does not address any of the
   issues created by whole-string confusables where each of the
   identifiers is of a different script.  A common workaround, limiting
   a registry to identifiers of only a single script, would mitigate
   this issue.

   For some of the code points (or code point sequences listed hat
   present issues for identifiers, it may be most expeditious to simply
   not include them, even though they are valid according to the
   protocol.  Sometimes, one of a pair of identical code points (or code
   point sequences) may be deemed preferable over the other for
   practical reasons.

   In the case of registries, it is not always necessary or desirable to
   exclude characters.  Sometimes, it is merely necessary to ensure that
   for two otherwise identical identifiers, only one of a set of
   mutually exclusive characters (or sequences of characters) is used,



Freytag, et al.          Expires January 1, 2018                [Page 5]


Internet-Draft           Troublesome Characters                June 2017


   while preventing the later registration of the the label containing
   the other one in order to avoid ambiguity.  This way the operator
   does not need to make a choice.  In certain cases, where both of
   these identifiers mean the same thing, an operator may decide to
   allow both labels to be registered simultaneously, but only to the
   same entity.

   In every case, the registry here defined includes code points that
   require special attention when they are to be used in identifiers.
   An administrator who does not have the time or inclination to develop
   the requisite understanding would be well-advised simply not to
   permit these code points at all.

4.  A registry of code points

4.1.  Discussion

   The registry contains three fields.  The first field, called "Code
   Point(s)", is a code point or sequence of code points.  The second,
   "Cross Reference", contains zero or more cross references to related
   code points.  The third, called "Explanation", is a free form text
   field that briefly describes the issue.  The explanation field also
   contains one or more references to documents defining the code point
   and the reason why it presents an issue.  These reference may be to
   documents external to the registry, so long as the reference is
   stable.

   The registry is not intended as an alternative to normal operational
   policies that are used for protocols under normal administrative
   scope.  For instance, zone operators that support IDNA are expected
   to create policies governing the code points that they will permit
   (see [RFC5894] and [I-D.rfc5891bis]).  The registry herein defined is
   intended to highlight particularly troublesome code points or code
   point sequences for the benefit of administrators creating such
   policies.  It is also intended to highlight characters that may
   create identifier ambiguities and thereby create security
   vulnerabilities.

   If a character appears in the registry, that does not automatically
   mean that it is a bad candidate for use in identifiers generally.
   Absent a well-defined and verifiable policy, however, such a code
   point or sequence might well be treated with suspicion by users and
   by tools.

   The registry is updated by Expert Review.  It ought to contain only
   code points that are significant in identifiers and that need special
   policies (including policies of exclusion).  Only code points that
   are eligible for use in identifiers (i.e. that are not DISALLOWED)



Freytag, et al.          Expires January 1, 2018                [Page 6]


Internet-Draft           Troublesome Characters                June 2017


   ought to be included.  Code points that are CONTEXTJ or CONTEXTO
   ought to only be included if concerns are identified that are not
   mitigated by the existing IDNA context rules.

4.2.  Registry initial contents

4.2.1.  Code Point Table

   +----------+-----------+--------------------------------------------+
   | Code     | Cross     | Explanation                                |
   | Point or | Reference |                                            |
   | Sequence |           |                                            |
   +----------+-----------+--------------------------------------------+
   | 0307     |           | Restricted Context: By definition, LATIN   |
   |          |           | SMALL LETTER I plus combining DOT ABOVE    |
   |          |           | renders exactly the same as LATIN SMALL    |
   |          |           | LETTER I by itself and does so in practice |
   |          |           | for any good font. The same is true for    |
   |          |           | all Unicode characters with the            |
   |          |           | soft_dotted property; they lose their dot  |
   |          |           | if followed by a combining mark. DOT ABOVE |
   |          |           | should be excluded, or restricted to       |
   |          |           | contexts where it does not follow a        |
   |          |           | soft_dotted letter. [115]                  |
   | 006C     | 019A      | Identical: Usually indistinguishable from  |
   | 0335     |           | LETTER L WITH BAR                          |
   | 006F     | 00F8      | Identical: Usually indistinguishable from  |
   | 0337     |           | LETTER O WITH STROKE                       |
   | 00F8     | 006F 0337 | Identical: Usually indistinguishable in    |
   |          |           | appearance from LETTER O plus combining    |
   |          |           | SHORT SOLIDUS OVERLAY                      |
   | 02A6     | 0074 0073 | Identical: Looks like LETTER T plus LETTER |
   |          |           | S, except for slight kerning               |
   | 0074     | 02A6      | Identical: Looks like TS DIGRAPH, except   |
   | 0073     |           | for lack of kerning                        |
   | 019A     | 006C 0335 | Identical: Usually indistinguishable from  |
   |          |           | LETTER L plus combining SHORT STROKE       |
   |          |           | OVERLAY                                    |
   | 01C0     |           | Not Recommended: Indistinguishable from a  |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 01C1     |           | Not Recommended: Indistinguishable from a  |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 01C2     |           | Not Recommended: Indistinguishable from a  |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 01C3     |           | Not Recommended: Indistinguishable from a  |



Freytag, et al.          Expires January 1, 2018                [Page 7]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 01DD     | 0259      | Identical: Identical in appearance to      |
   |          |           | U+0259 [150]                               |
   | 0259     | 01DD      | Identical: Identical in appearance to      |
   |          |           | U+01DD [150]                               |
   | 02B9     |           | Not Recommended: Indistinguishable from a  |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02BA     |           | Not Recommended: Indistinguishable from a  |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02BC     |           | Not Recommended: Indistinguishable from a  |
   |          |           | punctuation character (U+2019), which is   |
   |          |           | not PVALID [6912]                          |
   | 02BD     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02BE     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02BF     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02C0     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02C1     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02C6     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02C7     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02C8     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02C9     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02CA     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02CB     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |



Freytag, et al.          Expires January 1, 2018                [Page 8]


Internet-Draft           Troublesome Characters                June 2017


   | 02CC     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02CD     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02CE     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02CF     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02D0     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02D1     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02EC     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 02EE     |           | Not Recommended: Indistinguishable from    |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | 0321     |           | Not Recommended: Not intended for forming  |
   |          |           | combined orthographic letters              |
   | 0322     |           | Not Recommended: Not intended for forming  |
   |          |           | combined orthographic letters              |
   | 0334     |           | Not Recommended: Not intended for forming  |
   |          |           | combined orthographic letters              |
   | 0335     |           | Not Recommended: Not intended for forming  |
   |          |           | combined orthographic letters              |
   | 0336     |           | Not Recommended: Not intended for forming  |
   |          |           | combined orthographic letters              |
   | 0337     |           | Not Recommended: Not intended for forming  |
   |          |           | combined orthographic letters              |
   | 0338     |           | Not Recommended: Not intended for forming  |
   |          |           | combined orthographic letters              |
   | 0633     | 069A      | Identical: Identical in appearance to      |
   | 065C     |           | U+069A [300]                               |
   | 06EC     |           |                                            |
   | 06A1     | 0641      | Identical: Identical in appearance to      |
   | 065C     | 065C,     | U+06A3 and to U+0641 065C [300]            |
   | 06EC     | 06A3      |                                            |
   | 0633     | 06FA      | Identical: Identical in appearance to      |
   | 06DB     |           | U+06FA [300]                               |
   | 065C     |           |                                            |
   | 0635     | 0636      | Identical: Identical in appearance to      |



Freytag, et al.          Expires January 1, 2018                [Page 9]


Internet-Draft           Troublesome Characters                June 2017


   | 065C     | 065C,     | U+06FC and to U+0636 U+065C [300]          |
   | 06EC     | 06FB      |                                            |
   | 0639     | 06FC,     | Identical: Identical in appearance to      |
   | 065C     | 063A 065C | U+06FC and U+063A U+065C [300]             |
   | 06EC     |           |                                            |
   | 06BA     | 0646      | Identical: Identical in appearance to      |
   | 065C     | 065C,     | U+06B9 and to U+0646 U+065C [300]          |
   | 06EC     | 06B9      |                                            |
   | 06CF     | 0648 06EC | Identical: Identical in appearance to      |
   |          |           | U+0648 U+06EC [300]                        |
   | 063A     | 0639 06EC | Identical: Identical in appearance to      |
   |          |           | U+0639 U+06EC [300]                        |
   | 0636     | 0635 06EC | Identical: Identical in appearance to      |
   |          |           | U+0635 U+06EC [300]                        |
   | 062E     | 062D 06EC | Identical: Identical in appearance to      |
   |          |           | U+062D U+06EC [300]                        |
   | 06BF     | 0686 06EC | Identical: Identical in appearance to      |
   |          |           | U+0686 U+06EC [300]                        |
   | 0630     | 062F 06EC | Identical: Identical in appearance to      |
   |          |           | U+062F U+06EC [300]                        |
   | 0632     | 0631 06EC | Identical: Identical in appearance to      |
   |          |           | U+0631 U+06EC [300]                        |
   | 06B6     | 0644 06EC | Identical: Identical in appearance to      |
   |          |           | U+0644 U+06EC [300]                        |
   | 06AC     | 0643 06EC | Identical: Identical in appearance to      |
   |          |           | U+0643 U+06EC [300]                        |
   | 06BB     | 066E      | Identical: Identical in appearance to      |
   |          | 0615,     | U+06BA U+0615 and to U+06BB or U+066E      |
   |          | 06BA      | U+0615 when assuming initial or medial     |
   |          | 0615,     | form [300]                                 |
   |          | 0679      |                                            |
   | 0679     | 06BB,     | Identical: Identical in appearance to      |
   |          | 066E      | U+066E U+0615 and to U+06BB or U+06BA      |
   |          | 0615,     | U+0615 when assuming initial or medial     |
   |          | 06BA 0615 | form [300]                                 |
   | 06FF     | 06BE      | Identical: Identical in appearance to      |
   |          | 065B,     | U+06BE U+065B and to U+0647 U+065B [300]   |
   |          | 0647 065B |                                            |
   | 06C7     | 0648      | Identical: Identical in appearance to      |
   |          | 064F,     | U+0648 U+064F and to U+0648 U+0619 [300]   |
   |          | 0648 0619 |                                            |
   | 063D     | 06CC 065B | Identical: Identical in appearance to      |
   |          |           | U+06CC U+065B [300]                        |
   | 0648     | 06CF      | Identical: Identical in appearance to      |
   | 06EC     |           | U+06CF [300]                               |
   | 0639     | 063A      | Identical: Identical in appearance to      |
   | 06EC     |           | U+063A [300]                               |
   | 0635     | 0636      | Identical: Identical in appearance to      |



Freytag, et al.          Expires January 1, 2018               [Page 10]


Internet-Draft           Troublesome Characters                June 2017


   | 06EC     |           | U+0636 [300]                               |
   | 062D     | 062E      | Identical: Identical in appearance to      |
   | 06EC     |           | U+062E [300]                               |
   | 0686     | 06BF      | Identical: Identical in appearance to      |
   | 06EC     |           | U+06BF [300]                               |
   | 062F     | 0630      | Identical: Identical in appearance to      |
   | 06EC     |           | U+0630 [300]                               |
   | 0631     | 0632      | Identical: Identical in appearance to      |
   | 06EC     |           | U+0632 [300]                               |
   | 0644     | 06B6      | Identical: Identical in appearance to      |
   | 06EC     |           | U+06B6 [300]                               |
   | 066F     | 0641,     | Identical: Identical in appearance to      |
   | 06EC     | 06A1      | U+06A7 and to U+0641 or U+06A1 U+06EC when |
   |          | 06EC,     | assuming initial or medial form [300]      |
   |          | 06A7      |                                            |
   | 06A1     | 0641,     | Identical: Identical in appearance to      |
   | 06EC     | 06A7,     | U+0641 and to U+06A7 or U+066F U+06EC when |
   |          | 066F 06EC | assuming initial or medial form [300]      |
   | 06BA     | 0646      | Identical: Identical in appearance to      |
   | 06EC     |           | U+0646 [300]                               |
   | 0643     | 06AC      | Identical: Identical in appearance to      |
   | 06EC     |           | U+06AC [300]                               |
   | 06BA     | 0679,     | Identical: Identical in appearance to      |
   | 0615     | 06BB,     | U+06BB and to 0679 or U+066E U+0615 when   |
   |          | 066E 0615 | assuming initial or medial form [300]      |
   | 066E     | 0679,     | Identical: Identical in appearance to      |
   | 0615     | 06BA      | U+0679 and to 06BB or U+06BA U+0615 when   |
   |          | 0615,     | assuming initial or medial form [300]      |
   |          | 06BB      |                                            |
   | 06CC     | 063D      | Identical: Identical in appearance to      |
   | 065B     |           | U+063D [300]                               |
   | 0648     | 0648      | Identical: Identical in appearance to      |
   | 064F     | 0619,     | U+0648 U+0619 and to U+06C7 [300]          |
   |          | 06C7      |                                            |
   | 0648     | 0648      | Identical: Identical in appearance to      |
   | 0619     | 064F,     | U+0648 U+064F and to U+06C7 [300]          |
   |          | 06C7      |                                            |
   | 0615     |           | Not Recommended: Part of  homoglyph        |
   |          |           | sequence(s) not covered by normalization.  |
   |          |           | [300]                                      |
   | 0626     | 0649      | Identical: Identical in appearance to YEH  |
   |          | 0654,     | plus combining HAMZAH ABOVE and U+ 06CC or |
   |          | 064A      | U+064A plus combining HAMZAH ABOVE [300]   |
   |          | 0654,     |                                            |
   |          | 06CC 0654 |                                            |
   | 0628     | 08A1      | Identical: Identical in appearance to      |
   | 0654     |           | U+08A1 [IAB]                               |
   | 0629     | 06C3      | Identical: Identical in appearance to      |



Freytag, et al.          Expires January 1, 2018               [Page 11]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | U+06C3 when assuming final form [300]      |
   | 062D     | 0772      | Identical: Identical in appearance to HAH  |
   | 0615     |           | with SMALL TAH ABOVE [300]                 |
   | 062D     | 0681      | Identical: Identical in appearance to      |
   | 0654     |           | U+0681 [300]                               |
   | 062F     | 0688      | Identical: Identical in appearance to      |
   | 0615     |           | U+0688 [300]                               |
   | 062F     | 06EE      | Identical: Identical in appearance to      |
   | 065B     |           | U+06EE [300]                               |
   | 0631     | 0691      | Identical: Identical in appearance to      |
   | 0615     |           | U+0691 [300]                               |
   | 0631     | 076C      | Identical: Identical in appearance to      |
   | 0654     |           | U+076C [300]                               |
   | 0631     | 06EF      | Identical: Identical in appearance to      |
   | 065B     |           | U+06EF [300]                               |
   | 0641     | 066F      | Identical: Identical in appearance to      |
   |          | 06EC,     | U+06A1 U+06EC and to U+06A7 or U+066F      |
   |          | 06A1      | U+06EC when assuming initial or medial     |
   |          | 06EC,     | form [300]                                 |
   |          | 06A7      |                                            |
   | 0643     | 06A9      | Identical: Idential in appearance to       |
   |          |           | U+06A9 KEHEH when assuming initial form    |
   |          |           | [300]                                      |
   | 0644     | 06B5      | Identical: Idential in appearance to       |
   | 065A     |           | U+06B5 [300]                               |
   | 0646     | 06BA      | Identical: Identical in appearance to      |
   |          | 06EC,     | U+06BA 06EC and to U+06BA when assuming    |
   |          | 06BA      | initial or medial form [300]               |
   | 0646     | 0768      | Identical: Identical in appearance in to   |
   | 0615     |           | U+0768 [300]                               |
   | 0646     | 0769      | Identical: Identical in appearance in to   |
   | 065A     |           | U+0769 [300]                               |
   | 0647     | 06BE,     | Identical: Identical in appearance to AE   |
   |          | 06C1,     | when assuming final or isolated form;      |
   |          | 06D5      | Identical in appearance to U+XXX when      |
   |          |           | assuming initial or medial form; identical |
   |          |           | in appearance to U+XXX when assuming       |
   |          |           | isolated form [300]                        |
   | 0647     | 06C0,     | Identical: Identical in appearance to      |
   | 0654     | 06C2      | U+06C2 and U+06C0 [300]                    |
   | 0647     | 06BE      | Identical: Identical in appearance to      |
   | 065B     | 065B,     | U+06FF and to U+06BE plus combining        |
   |          | 06FF      | INVERTED SMALL V ABOVE [300]               |
   | 0648     | 06C6      | Identical: Identical in appearance to      |
   | 065A     |           | U+06C6                                     |
   | 0648     | 06C9      | Identical: Identical in appearance to      |
   | 065B     |           | U+06C9                                     |
   | 0648     | 06C8      | Identical: Identical in appearance to YU   |



Freytag, et al.          Expires January 1, 2018               [Page 12]


Internet-Draft           Troublesome Characters                June 2017


   | 0670     |           | U+06C8                                     |
   | 0649     | 06CC,     | Restricted Context: Not intended to be     |
   |          | 064A      | used with HAMZA ABOVE, use U+0626 instead, |
   |          |           | identical in appearance to U+064A when     |
   |          |           | assuming initial or medial form [99] [115] |
   |          |           | [300]                                      |
   | 0649     | 0626,     | Not Recommended: This sequence not to be   |
   | 0654     | 06CC 0654 | used; Identical in appearance in initial   |
   |          |           | position to HIGH HAMZA YEH $$$, as it      |
   |          |           | would be identical in appearance to U+0626 |
   |          |           | [99] [115] [300]                           |
   | 06CC     | 0649      | Identical: identical in appearance in one  |
   | 0654     | 0654,     | or more positions to U+0626 [99] [300]     |
   |          | 0626      |                                            |
   | 0649     | 06CC      | Identical: Identical in appearance to      |
   | 065A     | 065A,     | U+06CE and to U+06CC plus combining SMALL  |
   |          | 06CE      | V ABOVE [300]                              |
   | 064A     | 06CC,     | Identical: Idential in appearance to       |
   |          | 0649      | U+06CC when assuming final or isolated     |
   |          |           | form [300]                                 |
   | 064A     | 0626,     | Identical: U+064A is supposed to loose its |
   | 0654     | 08A8      | dots when combined with HAMZA ABOVE, which |
   |          |           | would make the sequence U+064A U+0654      |
   |          |           | identical in appearance to U+0626. In some |
   |          |           | fonts, the dots are retained, and the      |
   |          |           | sequence is then identical in appearance   |
   |          |           | with U+08A8 [99] [300]                     |
   | 064B     |           | Not Recommended: Not to be used in zone    |
   |          |           | files for the Arabic language, per RFC     |
   |          |           | 5564 [5564]                                |
   | 064C     |           | Not Recommended: Not to be used in zone    |
   |          |           | files for the Arabic language, per RFC     |
   |          |           | 5564 [5564]                                |
   | 064D     |           | Not Recommended: Not to be used in zone    |
   |          |           | files for the Arabic language, per RFC     |
   |          |           | 5564 [5564]                                |
   | 064E     |           | Not Recommended: Not to be used in zone    |
   |          |           | files for the Arabic language, per RFC     |
   |          |           | 5564 [5564]                                |
   | 064F     |           | Not Recommended: Not to be used in zone    |
   |          |           | files for the Arabic language, per RFC     |
   |          |           | 5564. Also: Part of  homoglyph sequence(s) |
   |          |           | not covered by normalization. [300] [5564] |
   | 0650     |           | Not Recommended: Not to be used in zone    |
   |          |           | files for the Arabic language, per RFC     |
   |          |           | 5564 [5564]                                |
   | 0651     |           | Not Recommended: Not to be used in zone    |
   |          |           | files for the Arabic language, per RFC     |



Freytag, et al.          Expires January 1, 2018               [Page 13]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | 5564 [5564]                                |
   | 0652     |           | Not Recommended: Not to be used in zone    |
   |          |           | files for the Arabic language, per RFC     |
   |          |           | 5564 [5564]                                |
   | 0654     |           | Not Recommended: Part of  homoglyph        |
   |          |           | sequence(s) not covered by normalization.  |
   |          |           | [300]                                      |
   | 065A     |           | Not Recommended: Part of  homoglyph        |
   |          |           | sequence(s) not covered by normalization.  |
   |          |           | [300]                                      |
   | 065B     |           | Not Recommended: Part of  homoglyph        |
   |          |           | sequence(s) not covered by normalization.  |
   |          |           | [300]                                      |
   | 065C     |           | Not Recommended: Part of  homoglyph        |
   |          |           | sequence(s) not covered by normalization.  |
   |          |           | [300]                                      |
   | 0660     | 06F0      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT     |
   |          |           | ZERO [110]                                 |
   | 0661     | 06F1      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT ONE |
   |          |           | [110]                                      |
   | 0662     | 06F2      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT TWO |
   |          |           | [110]                                      |
   | 0663     | 06F3      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT     |
   |          |           | THREE [110]                                |
   | 0667     | 06F7      | Identical: Usually identical in appearance |
   |          |           | and meaning to EXTENDED ARABIC-INDIC DIGIT |
   |          |           | SEVEN [110]                                |
   | 0668     | 06F8      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT     |
   |          |           | EIGHT [110]                                |
   | 0669     | 06F9      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT     |
   |          |           | NINE [110]                                 |
   | 0673     |           | Other issue: Deprecated; If required, use  |
   |          |           | sequence U+0627 U+065F instead             |
   | 0681     | 062D 0654 | Identical: Identical in appearance to HAH  |
   |          |           | plus combining HAMZA ABOVE [300]           |
   | 0688     | 062F 0615 | Identical: Identical in appearance to DAL  |
   |          |           | plus combining SMALL HIGH TAH              |
   | 068A     | 068B,     | Identical: Identical in appearance to      |
   | 0615     | 0688 065C | U+068B and U+0688 U+065C [300]             |
   | 068B     | 0688      | Identical: Identical in appearance to DAL  |
   |          | 065C,     | WITH DOT BELOW plus combining SMALL HIGH   |
   |          | 068A 0615 | TAH [300]                                  |



Freytag, et al.          Expires January 1, 2018               [Page 14]


Internet-Draft           Troublesome Characters                June 2017


   | 0691     | 0631 0615 | Identical: Identical in appearance to REH  |
   |          |           | plus combining SMALL HIGH TAH              |
   | 069A     | 0633 065C | Identical: Identical in appearance to      |
   |          | 06EC      | combining sequence with two combining      |
   |          |           | marks [300]                                |
   | 06B9     | 0646      | Identical: Identical in appearance to      |
   |          | 065C,     | U+0646 U+065C and combining sequence with  |
   |          | 06BA 065C | two combining marks [300]                  |
   |          | 06EC      |                                            |
   | 06A9     | 0643      | Identical: Idential in appearance to       |
   |          |           | U+0643 KAF when assuming initial form      |
   |          |           | [300]                                      |
   | 06BE     | 0647,     | Identical: Idential in appearance to       |
   |          | 06C1,     | U+0647 when assuming initial or medial     |
   |          | 06D5      | form and from U+06D5 when assuming final   |
   |          |           | form [300]                                 |
   | 06CC     | 064A,     | Identical: Idential in appearance to       |
   |          | 0649      | U+064A when assuming initila or mdeial     |
   |          |           | fomr and to U+0649 when assuming final or  |
   |          |           | isolated form [300]                        |
   | 067B     | 06D0      | Identical: Identical in appearance to      |
   |          |           | U+06D0 when assuming initial form [300]    |
   | 0670     |           | Not Recommended: Part of  homoglyph        |
   |          |           | sequence(s) not covered by normalization.  |
   |          |           | [300]                                      |
   | 067E     | 06BD,     | Identical: Identical in appearance to      |
   |          | 06BA 06DB | U+06BD or U+06BA U+06DB when assuming      |
   |          |           | initial or medial form [300]               |
   | 06A4     | 06A8,     | Identical: Identical in appearance to      |
   |          | 06A1      | U+06A8 and U+066F 06DB when assuming       |
   |          | 06DB,     | initial or medial form and to U+06A1       |
   |          | 066F 06DB | U+06DB [300]                               |
   | 06A7     | 0641,     | Identical: Identical in appearance to      |
   |          | 066F      | U+066F U+06EC and to U+0641 or U+06A1      |
   |          | 06EC,     | U+06EC when assuming initial  or medial    |
   |          | 06A1 06EC | form [300]                                 |
   | 06A8     | 06A4,     | Identical: Identical in appearance to      |
   |          | 06A1      | U+06A4 and to U+06A1 U+06DB when assuming  |
   |          | 06DB,     | initial or medial form and to  U+066F 06DB |
   |          | 066F 06DB | [300]                                      |
   | 06BA     | 0646      | Identical: Identical in appearance to      |
   |          |           | U+0646 when assuming initial or medial     |
   |          |           | form [300]                                 |
   | 06B5     | 0644 065A | Identical: Identical in appearance to      |
   |          |           | U+0644 with SMALL V ABOVE [300]            |
   | 06C0     | 0647      | Identical: Identical in appearance to      |
   |          | 0654,     | U+06C2 when assuming final form and to     |
   |          | 06C2      | U+0647 with HAMZA ABOVE [300]              |



Freytag, et al.          Expires January 1, 2018               [Page 15]


Internet-Draft           Troublesome Characters                June 2017


   | 06C1     | 0647,     | Identical: Idential in appearance to       |
   |          | 06BE,     | U+0647 and U+06D5 when assuming isolated   |
   |          | 06D5      | form [300]                                 |
   | 06C2     | 0647      | Identical: Identical in appearance to      |
   |          | 0654,     | U+06C0 when assuming final form and to     |
   |          | 06C0      | U+0647 with HAMZA ABOVE [300]              |
   | 06C3     | 0629      | Identical: Identical in appearance to      |
   |          |           | U+0629 when assuming final form [300]      |
   | 06C6     | 0648 065A | Identical: Identical in appearance to WAV  |
   |          |           | plus combining SMALL V ABOVE [300]         |
   | 06C8     | 0648 0670 | Identical: Identical in appearance to WAV  |
   |          |           | plus combining SUPERSCRIPT ALEF U+0648     |
   |          |           | U+0670                                     |
   | 066E     | 0756      | Identical: Identical in appearance to BEH  |
   | 065A     |           | WITH SMALL V                               |
   | 0697     | 0771      | Identical: Identical in appearance to REH  |
   | 0615     |           | with SMALL TAH AND TWO DOTS                |
   | 06C9     | 0648 065B | Identical: Identical in appearance to WAV  |
   |          |           | plus combining  INVERTED SMALL V ABOVE     |
   | 06CE     | 0649      | Identical: Identical in appearance toYEH   |
   |          | 065A,     | and ALEF MAKSURA, each plus combining      |
   |          | 06CC 065A | SMALL V ABOVE [300]                        |
   | 06CC     | 06CE,     | Identical: Identical in appearance to      |
   | 065A     | 0649 065A | U+06CE, and to ALEF MASKURA plus combining |
   |          |           | SMALL V ABOVE [300]                        |
   | 06D0     | 067B      | Identical: Identical in appearance to      |
   |          |           | U+067B when assuming initial form [300]    |
   | 06D5     | 0647,     | Identical: Idential in appearance to       |
   |          | 06C1,     | U+0647 HEH when assuming final or isolated |
   |          | 06BE      | form, and from U+06C1 when assuming        |
   |          |           | isolated form, [300]                       |
   | 06D6     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06D7     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06D8     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06D9     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |



Freytag, et al.          Expires January 1, 2018               [Page 16]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06DA     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06DB     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. Part of  homoglyph        |
   |          |           | sequence(s) not covered by normalization.  |
   |          |           | [115] [300]                                |
   | 06DC     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06DF     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06E0     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06E1     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06E2     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06E3     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06E4     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |



Freytag, et al.          Expires January 1, 2018               [Page 17]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06E5     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06E6     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06E7     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06E8     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06EA     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06EB     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06EC     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. Part of  homoglyph        |
   |          |           | sequence(s) not covered by normalization.  |
   |          |           | [115] [300]                                |
   | 06ED     |           | Not Recommended: Specialized use; Quranic  |
   |          |           | marks not used in writing contemporary     |
   |          |           | Arabic script based languages; hard to     |
   |          |           | distinguish at small sizes. Not suitable   |
   |          |           | for identifiers. [115] [300]               |
   | 06EE     | 062F 065B | Identical: Identical in appearance to DAL  |
   |          |           | plus combining INVERTED SMALL V ABOVE      |
   | 06EF     | 0631 065B | Identical: Identical in appearance to REH  |
   |          |           | plus combining INVERTED SMALL V ABOVE      |



Freytag, et al.          Expires January 1, 2018               [Page 18]


Internet-Draft           Troublesome Characters                June 2017


   | 06F0     | 0660      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT     |
   |          |           | ZERO [110]                                 |
   | 06F1     | 0661      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT ONE |
   |          |           | [110]                                      |
   | 06F2     | 0662      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT TWO |
   |          |           | [110]                                      |
   | 06F3     | 0663      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT     |
   |          |           | THREE [110]                                |
   | 06F7     | 0667      | Identical: Usually identical in appearance |
   |          |           | and meaning to EXTENDED ARABIC-INDIC DIGIT |
   |          |           | SEVEN [110]                                |
   | 06F8     | 0668      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT     |
   |          |           | EIGHT [110]                                |
   | 06F9     | 0669      | Identical: Identical in appearance and     |
   |          |           | meaning to EXTENDED ARABIC-INDIC DIGIT     |
   |          |           | NINE [110]                                 |
   | 06FA     | 0633 06DB | Identical: Identical in appearance to      |
   |          | 065C      | combining sequence with two combining      |
   |          |           | marks [300]                                |
   | 06FD     |           | Not Recommended: Does not have the         |
   |          |           | XID_CONTINUE property; not considered      |
   |          |           | suitable for identifiers by Unicode [120]  |
   | 06FE     |           | Not Recommended: Does not have the         |
   |          |           | XID_CONTINUE property; not considered      |
   |          |           | suitable for identifiers by Unicode [120]  |
   | 06BE     | 06FF,     | Identical: Identical in appearance to      |
   | 065B     | 0647 065B | U+06FF and U+0647 U+ 065B [300]            |
   | 0756     | 066E 065A | Identical: Identical in appearance to      |
   |          |           | DOTLESS BEH plus SMALL V ABOVE [300]       |
   | 0762     | 06A9 06EC | Identical: Identical in appearance to      |
   |          |           | U+06A9 with DOT ABOVE [300]                |
   | 06A9     | 0762      | Identical: Identical in appearance to      |
   | 06EC     |           | U+0762 [300]                               |
   | 0765     | 0645 06EC | Identical: Identical in appearance to      |
   |          |           | U+0645 with DOT ABOVE [300]                |
   | 0645     | 0765      | Identical: Identical in appearance to      |
   | 06EC     |           | U+0765E [300]                              |
   | 0768     | 0646 0615 | Identical: Identical in appearance to      |
   |          |           | U+0646 plus SMALL V ABOVE [300]            |
   | 0769     | 0646 065A | Identical: Identical in appearance to      |
   |          |           | U+646 with SMALL V ABOVE [300]             |
   | 0771     | 0697 0615 | Identical: Identical in appearance to REH  |
   |          |           | WITH TWO DOTS ABOVE plus SMALL TAH ABOVE   |



Freytag, et al.          Expires January 1, 2018               [Page 19]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | [300]                                      |
   | 0772     | 062D 0615 | Identical: Identical in appearance to  HAH |
   |          |           | plus SMAL TAH ABOVE [300]                  |
   | 076C     | 0631 0654 | Identical: Identical in appearance to REH  |
   |          |           | plus combining HAMZAH ABOVE                |
   | 08A1     | 0628 0654 | Identical: Used for Fulfulde, Identical in |
   |          |           | appearance to BEH plus combining HAMZAH    |
   |          |           | ABOVE                                      |
   | 063F     | 06CC      | Identical: Identical in appearance to      |
   |          | 06DB,     | U+06CC U+06DB [300]                        |
   |          | 0649 06DB |                                            |
   | 0634     | 0633 06DB | Identical: Identical in appearance to      |
   |          |           | U+0633 U+06DB [300]                        |
   | 069C     | 069B 06DB | Identical: Identical in appearance to      |
   |          |           | U+069B U+06DB [300]                        |
   | 062B     | 066E 06DB | Identical: Identical in appearance to      |
   |          |           | U+066E U+06DB [300]                        |
   | 0685     | 062D 06DB | Identical: Identical in appearance to      |
   |          |           | U+062D U+06DB [300]                        |
   | 0698     | 0631 06DB | Identical: Identical in appearance to      |
   |          |           | U+0631 U+06DB [300]                        |
   | 068E     | 062F 06DB | Identical: Identical in appearance to      |
   |          |           | U+062F U+06DB [300]                        |
   | 06A0     | 0639 06DB | Identical: Identical in appearance to      |
   |          |           | U+0639 U+06DB [300]                        |
   | 06AD     | 0643 06DB | Identical: Identical in appearance to      |
   |          |           | U+0643 U+06DB [300]                        |
   | 06B4     | 06AF 06DB | Identical: Identical in appearance to      |
   |          |           | U+06AF U+06DB [300]                        |
   | 06B7     | 0644 06DB | Identical: Identical in appearance to      |
   |          |           | U+0644 U+06DB [300]                        |
   | 06BD     | 067E,     | Identical: Identical in appearance to      |
   |          | 06BA 06DB | U+06BA U+06DB and to U+067E when assuming  |
   |          |           | initial or medial form [300]               |
   | 0763     | 06A9 06DB | Identical: Identical in appearance to      |
   |          |           | U+06A9 U+06DB [300]                        |
   | 0628     | 066E 065C | Identical: Identical in appearance to      |
   |          |           | U+066E U+065C [300]                        |
   | 068A     | 062F 065C | Identical: Identical in appearance to      |
   |          |           | U+062F U+065C [300]                        |
   | 0694     | 0631 065C | Identical: Identical in appearance to      |
   |          |           | U+0631 U+065C [300]                        |
   | 06A3     | 0641      | Identical: Identical in appearance to      |
   |          | 065C,     | U+0641 U+065C [300]                        |
   |          | 06A1 065C |                                            |
   |          | 06EC      |                                            |
   | 06FC     | 0639 065C | Identical: Identical in appearance to      |
   |          | 06EC,     | U+063A U+065C and to U+0639 U+065C U+06EC  |



Freytag, et al.          Expires January 1, 2018               [Page 20]


Internet-Draft           Troublesome Characters                June 2017


   |          | 063A 065C | [300]                                      |
   | 06FB     | 0635 065C | Identical: Identical in appearance to      |
   |          | 06EC,     | U+0636 U+065C and to U+0635 U+065C U+06EC  |
   |          | 0636 065C | [300]                                      |
   | 0751     | 062B 065C | Identical: Identical in appearance to      |
   |          |           | U+062B U+065C [300]                        |
   | 0766     | 0645 065C | Identical: Identical in appearance to      |
   |          |           | U+0645 U+065C [300]                        |
   | 0649     | 063F,     | Identical: Identical in appearance to      |
   | 06DB     | 06CC 06DB | U+063F [300]                               |
   | 06CC     | 063F,     | Identical: Identical in appearance to      |
   | 06DB     | 0649 06DB | U+063F [300]                               |
   | 0633     | 0634      | Identical: Identical in appearance to      |
   | 06DB     |           | U+0634 [300]                               |
   | 069B     | 069C      | Identical: Identical in appearance to      |
   | 06DB     |           | U+069C [300]                               |
   | 066E     | 062B      | Identical: Identical in appearance to      |
   | 06DB     |           | U+062B [300]                               |
   | 062D     | 0685      | Identical: Identical in appearance to      |
   | 06DB     |           | U+0685 [300]                               |
   | 0631     | 0698      | Identical: Identical in appearance to      |
   | 06DB     |           | U+0698 [300]                               |
   | 062F     | 068E      | Identical: Identical in appearance to      |
   | 06DB     |           | U+068E [300]                               |
   | 0639     | 06A0      | Identical: Identical in appearance to      |
   | 06DB     |           | U+06A0 [300]                               |
   | 06A1     | 06A4,     | Identical: Identical in appearance to      |
   | 06DB     | 06A8,     | U+06A4 and U+06A8 and U+066F U+06DB when   |
   |          | 066F 06DB | assuming... [300]                          |
   | 066F     | 06A8,     | Identical: Identical in appearance to      |
   | 06DB     | 06A4,     | U+06A8 and to ... [300]                    |
   |          | 06A1 06DB |                                            |
   | 0643     | 06AD      | Identical: Identical in appearance to      |
   | 06DB     |           | U+06AD [300]                               |
   | 06AF     | 06B4      | Identical: Identical in appearance to      |
   | 06DB     |           | U+06B4 [300]                               |
   | 0644     | 06B7      | Identical: Identical in appearance to      |
   | 06DB     |           | U+06B7 [300]                               |
   | 06BA     | 067E,     | Identical: Identical in appearance to      |
   | 06DB     | 06BD      | U+06BD and to U+067E when assuming initial |
   |          |           | or medial form [300]                       |
   | 06A9     | 0763      | Identical: Identical in appearance to      |
   | 06DB     |           | U+0763 [300]                               |
   | 066E     | 0628      | Identical: Identical in appearance to      |
   | 065C     |           | U+0628 [300]                               |
   | 062F     | 068A      | Identical: Identical in appearance to      |
   | 065C     |           | U+068A [300]                               |
   | 0688     | 068A      | Identical: Identical in appearance to      |



Freytag, et al.          Expires January 1, 2018               [Page 21]


Internet-Draft           Troublesome Characters                June 2017


   | 065C     | 0615,     | U+068B [300]                               |
   |          | 068B      |                                            |
   | 0631     | 0694      | Identical: Identical in appearance to      |
   | 065C     |           | U+0694 [300]                               |
   | 0641     | 06A1 065C | Identical: Identical in appearance to      |
   | 065C     | 06EC,     | U+06A3 and to U+06A1 U+065C U+06EC [300]   |
   |          | 06A3      |                                            |
   | 0646     | 06BA 065C | Identical: Identical in appearance to      |
   | 065C     | 06EC,     | U+06B9 and to a sequence with two          |
   |          | 06B9      | combining marks [300]                      |
   | 063A     | 0639 065C | Identical: Identical in appearance to      |
   | 065C     | 06EC,     | U+06FC and to U+0639 U+065C U+06EC [300]   |
   |          | 06FC      |                                            |
   | 0636     | 0635 065C | Identical: Identical in appearance to      |
   | 065C     | 06EC,     | U+06FB and to U+0635 U+065C U+06EC [300]   |
   |          | 06FB      |                                            |
   | 062B     | 0751      | Identical: Identical in appearance to      |
   | 065C     |           | U+0751 [300]                               |
   | 0645     | 0766      | Identical: Identical in appearance to      |
   | 065C     |           | U+0766 [300]                               |
   | 08A8     | 064A 0654 | Identical: Identical in appearance to      |
   |          |           | U+064A U+0654 [99]                         |
   | 08A9     | 064A 06EC | Identical: Identical in appearance to      |
   |          |           | U+064A U+06EC [99]                         |
   | 064A     | 08A9      | Identical: Identical in appearance U+08A9  |
   | 06EC     |           | [99]                                       |
   | 098C     | 09E1      | Identical: Identical in appearance to      |
   | 09E2     |           | VOCALIC LL                                 |
   | 09E1     | 098C 09E2 | Identical: Used for Sanskrit, Identical in |
   |          |           | appearance to LETTER VOCALIC L plus SIGN   |
   |          |           | VOCALIC L                                  |
   | 0B95     | 0BE7      | Identical: Identical in appearance to      |
   |          |           | TAMIL DIGIT ONE                            |
   | 0BE7     | 0B95      | Identical: Identical in appearance to      |
   |          |           | TAMIL KA [110]                             |
   | 0D4C     | 0D57      | Not Recommended: Obsolete, preferred       |
   |          |           | alternative is U+0D57 [120] [115]          |
   | 0D57     | 0D4C      | Identical: This code point preferred over  |
   |          |           | U+0D4C, which is obsolete [120]            |
   | 0E3A     |           | Other issue: Renders unreliably, or not at |
   |          |           | all, if adjacent to any Thai vowel below.  |
   |          |           | This may be prevented by a context rule    |
   | 0E41     |           | Other issue: Digraph of U+0E40 SARA E      |
   |          |           | U+0E40 SARA E. Normally handled by         |
   |          |           | disallowing the seqeunce via a context     |
   |          |           | rule                                       |
   | 0E40     |           | Restricted Context: Restrict more than     |
   |          |           | oneSARA E from occurring together, as      |



Freytag, et al.          Expires January 1, 2018               [Page 22]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | pairs are indistinguishable from U+0E40    |
   |          |           | SARA EE. This restriction is normally      |
   |          |           | implemented more generally, disallowing    |
   |          |           | any pair of leading vowels                 |
   | 0E45     |           | Restricted Context: Only occurs after two  |
   |          |           | special Thai vowels,U+0E24 RU and U+0E26   |
   |          |           | LU. Is also potentially confused with      |
   |          |           | U+0E32 SARA I. Both issues can be          |
   |          |           | addressed by defining a context rule.      |
   |          |           | Alternatively the context may be spelled   |
   |          |           | out by enumerating the two sequences and   |
   |          |           | excluding U+0E45 if occurring by itself.   |
   | 0E4E     |           | Not Recommended: Rarely used in modern     |
   |          |           | Thai; it is more commonly replaced with    |
   |          |           | U+0E3A (PHINTHU). Excluding it avoids      |
   |          |           | issues with confusing it with another      |
   |          |           | diacritic U+0E4C (THANTHAKHAT). Both are   |
   |          |           | rendered atop a syllable and hard to       |
   |          |           | distinguish at small sizes.                |
   | 0F18     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 0F19     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 0F35     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 0F37     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 0F3E     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 0F3F     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 0F7A     | 0F7B      | Identical: Identical in appearance to      |
   | 0F7A     |           | VOWEL SIGN EE [120] [115]                  |
   | 0F7B     | 0F7A 0F7A | Identical: Identical in appearance to a    |
   |          |           | sequence of two VOWEL SIGN E [120] [115]   |
   | 0F7C     | 0F7D      | Identical: Identical in appearance to      |
   | 0F7C     |           | VOWEL SIGN OO [120] [115]                  |
   | 0F7D     | 0F7C 0F7C | Identical: Identical in appearance to a    |
   |          |           | sequence of two VOWEL SIGN O [120] [115]   |
   | 0FC6     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [115] [120]                |



Freytag, et al.          Expires January 1, 2018               [Page 23]


Internet-Draft           Troublesome Characters                June 2017


   | 101D     | 1040      | Identical: Letter U+101D is identical to   |
   |          |           | digit U+1040 [100] [150]                   |
   | 1040     | 101D      | Identical: Digit U+1040 is identical to    |
   |          |           | letter U+101D [110] [150]                  |
   | 1200     | 1210,     | Interchangeable: U+1200, U+1210 and U+1280 |
   |          | 1280      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1201     | 1211,     | Interchangeable: U+1201, U+1211 and U+1281 |
   |          | 1281      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1202     | 1212,     | Interchangeable: U+1202, U+1212 and U+1282 |
   |          | 1282      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1203     | 1213,     | Interchangeable: U+1203, U+1213 and U+1283 |
   |          | 1283      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1204     | 1214,     | Interchangeable: U+1204, U+1214 and U+1284 |
   |          | 1284      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1205     | 1215,     | Interchangeable: U+1205, U+1215 and U+1285 |
   |          | 1285      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1206     | 1216,     | Interchangeable: U+1206, U+1216 and U+1286 |
   |          | 1286      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1210     | 1200,     | Interchangeable: U+1200, U+1210 and U+1280 |
   |          | 1280      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1211     | 1201,     | Interchangeable: U+1201, U+1211 and U+1281 |
   |          | 1281      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1212     | 1202,     | Interchangeable: U+1202, U+1212 and U+1282 |
   |          | 1282      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1213     | 1203,     | Interchangeable: U+1203, U+1213 and U+1283 |
   |          | 1283      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1214     | 1204,     | Interchangeable: U+1204, U+1214 and U+1284 |
   |          | 1284      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1215     | 1205,     | Interchangeable: U+1205, U+1215 and U+1285 |
   |          | 1285      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1216     | 1206,     | Interchangeable: U+1206, U+1216 and U+1286 |
   |          | 1286      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1217     | 1288      | Interchangeable: U+1217 and U+1288 are     |
   |          |           | used interchangeably in Amharic [100]      |



Freytag, et al.          Expires January 1, 2018               [Page 24]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | [202]                                      |
   | 1220     | 1230      | Interchangeable: U+1220 and U+1230 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1221     | 1231      | Interchangeable: U+1221 and U+1231 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1222     | 1232      | Interchangeable: U+1222 and U+1232 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1223     | 1233      | Interchangeable: U+1223 and U+1233 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1224     | 1234      | Interchangeable: U+1224 and U+1234 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1225     | 1235      | Interchangeable: U+1225 and U+1235 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1226     | 1236      | Interchangeable: U+1226 and U+1236 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1227     | 1237      | Interchangeable: U+1227 and U+1237 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1230     | 1220      | Interchangeable: U+1230 and U+1220 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1231     | 1221      | Interchangeable: U+1231 and U+1221 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1232     | 1222      | Interchangeable: U+1232 and U+1222 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1233     | 1223      | Interchangeable: U+1233 and U+1223 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1234     | 1224      | Interchangeable: U+1234 and U+1224 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1235     | 1225      | Interchangeable: U+1235 and U+1225 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1236     | 1226      | Interchangeable: U+1236 and U+1226 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1237     | 1227      | Interchangeable: U+1237 and U+1227 are     |
   |          |           | used interchangeably in Amharic [100]      |



Freytag, et al.          Expires January 1, 2018               [Page 25]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | [202]                                      |
   | 1280     | 1200,     | Interchangeable: U+1200, U+1210 and U+1280 |
   |          | 1210      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1281     | 1201,     | Interchangeable: U+1201, U+1211 and U+1281 |
   |          | 1211      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1282     | 1202,     | Interchangeable: U+1202, U+1212 and U+1282 |
   |          | 1212      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1283     | 1203,     | Interchangeable: U+1203, U+1213 and U+1283 |
   |          | 1213      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1284     | 1204,     | Interchangeable: U+1204, U+1214 and U+1284 |
   |          | 1214      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1285     | 1205,     | Interchangeable: U+1205, U+1215 and U+1285 |
   |          | 1215      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1286     | 1206,     | Interchangeable: U+1206, U+1216 and U+1286 |
   |          | 1216      | are used interchangeably in Amharic [100]  |
   |          |           | [202]                                      |
   | 1288     | 1217      | Interchangeable: U+1288 and U+1217 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12A0     | 12A3,     | Interchangeable: U+12A0, U+12A3, U+12D0    |
   |          | 12D0,     | and U+12D3 are used interchangeably in     |
   |          | 12D3      | Amharic [100] [202]                        |
   | 12A1     | 12D1      | Interchangeable: U+12A1 and U+12D1 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12A2     | 12D2      | Interchangeable: U+12A2 and U+12D2 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12A3     | 12A0,     | Interchangeable: U+12A0, U+12A3, U+12D0    |
   |          | 12D0,     | and U+12D3 are used interchangeably in     |
   |          | 12D3      | Amharic [100] [202]                        |
   | 12A4     | 12D4      | Interchangeable: U+12A4 and U+12D4 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12A5     | 12D5      | Interchangeable: U+12A5 and U+12D5 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12A6     | 12D6      | Interchangeable: U+12A6 and U+12D6 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12AE     | 12B0      | Interchangeable: U+12AE and U+12B0 are     |
   |          |           | used interchangeably in Amharic [100]      |



Freytag, et al.          Expires January 1, 2018               [Page 26]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | [202]                                      |
   | 12B0     | 12AE      | Interchangeable: U+12B0 and U+12AE are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12D0     | 12A0,     | Interchangeable: U+12A0, U+12A3, U+12D0    |
   |          | 12A3,     | and U+12D3 are used interchangeably in     |
   |          | 12D3      | Amharic [100] [202]                        |
   | 12D1     | 12A1      | Interchangeable: U+12D1 and U+12A1 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12D2     | 12A2      | Interchangeable: U+12D2 and U+12A2 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12D3     | 12A0,     | Interchangeable: U+12A0, U+12A3, U+12D0    |
   |          | 12A3,     | and U+12D3 are used interchangeably in     |
   |          | 12D0      | Amharic [100] [202]                        |
   | 12D4     | 12A4      | Interchangeable: U+12D4 and U+12D4 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12D5     | 12A5      | Interchangeable: U+12D5 and U+12A5 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 12D6     | 12A6      | Interchangeable: U+12D6 and U+12A6 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1338     | 1340      | Interchangeable: U+1338 and U+1340 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1339     | 1341      | Interchangeable: U+1339 and U+1341 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 133A     | 1342      | Interchangeable: U+133A and U+1342 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 133B     | 1343      | Interchangeable: U+133B and U+1343 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 133C     | 1344      | Interchangeable: U+133C and U+1344 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 133D     | 1345      | Interchangeable: U+133D and U+1345 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 133E     | 1346      | Interchangeable: U+133E and U+1346 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1340     | 1338      | Interchangeable: U+1340 and U+1338 are     |
   |          |           | used interchangeably in Amharic [100]      |



Freytag, et al.          Expires January 1, 2018               [Page 27]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | [202]                                      |
   | 1341     | 1339      | Interchangeable: U+1341 and U+1339 are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1342     | 133A      | Interchangeable: U+1342 and U+133A are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1343     | 133B      | Interchangeable: U+1343 and U+133B are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1344     | 133C      | Interchangeable: U+1344 and U+133C are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1345     | 133D      | Interchangeable: U+1345 and U+133D are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 1346     | 133E      | Interchangeable: U+1346 and U+133E are     |
   |          |           | used interchangeably in Amharic [100]      |
   |          |           | [202]                                      |
   | 17A2     | 17A3      | Other issue: Preferred for deprecated      |
   |          |           | U+17A3 [120] [150]                         |
   | 17A3     | 17A2      | Not Recommended: Deprecated in Unicode,    |
   |          |           | preferred is U+17A2 [120] [115] [150]      |
   | 17A4     |           | Not Recommended: Deprecated in Unicode     |
   |          |           | [120] [115]                                |
   | 17A7     | 17A8      | Other issue: This sequence preferred over  |
   | 17CA     |           | U+17A8, which is obsolete [120]            |
   | 17A8     | 17A7 17CA | Not Recommended: Obsolete, sequence        |
   |          |           | U+17A7 U+17CA preferred [120] [115]        |
   | 17D2     | 17D2 178F | Identical: When preceded by U+17D2, U+178A |
   | 178A     |           | and U+178F are indistinguishable [204]     |
   | 17D2     | 17D2 178A | Identical: When preceded by U+17D2, U+178A |
   | 178F     |           | and U+178F are indistinguishable [204]     |
   | 1835     | 1855      | Identical: U+1835 is identical to U+1855   |
   |          |           | [115] [150]                                |
   | 1855     | 1835      | Identical: U+1855 is identical to U+1835   |
   |          |           | [115] [150]                                |
   | 199E     | 19D0      | Identical: Letter U+199E is identical to   |
   |          |           | digit U+19D0 [115] [150]                   |
   | 19D0     | 199E      | Identical: Digit U+19D0 is identical to    |
   |          |           | Letter U+199E [115] [150]                  |
   | 19B1     | 19D1      | Identical: Letter U+19B1 is identical to   |
   |          |           | digit U+19D1 [150]                         |
   | 19D1     | 19B1      | Identical: Digit U+19D1 is identical to    |
   |          |           | letter U+19B2 [115] [150]                  |
   | 1B0D     | 1B52      | Identical: Letter U+1B0D is identical to   |
   |          |           | digit U+1B52 [115] [150]                   |
   | 1B11     | 1B53      | Identical: Letter U+1B11 is identical to   |



Freytag, et al.          Expires January 1, 2018               [Page 28]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | digit U+1B53 [115] [150]                   |
   | 1B28     | 1B58      | Identical: Letter U+1B28 is identical to   |
   |          |           | digit U+1B58 [115] [150]                   |
   | 1B52     | 1B0D      | Identical: Digit U+1B52 is identical to    |
   |          |           | letter U+1B0D [115] [150]                  |
   | 1B53     | 1B11      | Identical: Digit U+1B53 is identical to    |
   |          |           | letter U+1B11 [115] [150]                  |
   | 1B58     | 1B28      | Identical: Digit U+1B58 is identical to    |
   |          |           | letter U+1B28 [115] [150]                  |
   | 1C82     |           | Not Recommended: Cyrillic NARROW O is a    |
   |          |           | code point for specialist use, and common  |
   |          |           | users do not expect to encounter it. It    |
   |          |           | resembles digit ZERO and can be used to    |
   |          |           | create an apparent contrast to the letter  |
   |          |           | O in a label [115]                         |
   | 214E     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 2184     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 2E2F     |           | Not Recommended: Does not have the         |
   |          |           | XID_CONTINUE property; not considered      |
   |          |           | suitable for identifiers by Unicode [120]  |
   | 3006     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 302A     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 302B     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 302C     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 302D     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 303C     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | 3078     | 30D8      | Identical: Indistinguishable from U+30D8   |
   | 3079     | 30D9      | Identical: Indistinguishable from U+30D9   |
   | 307A     | 30DA      | Identical: Indistinguishable from U+30DA   |
   | 30AB     | 529B      | Identical: Not always distinct from U+529B |
   | 30AA     | 624D      | Identical: Not always distinct from U+624D |
   | 30ED     | 53E3      | Identical: Not always distinct from U+53E3 |



Freytag, et al.          Expires January 1, 2018               [Page 29]


Internet-Draft           Troublesome Characters                June 2017


   | 30CF     | 516B      | Identical: Not always distinct from U+516B |
   | 30C8     | 535C      | Identical: Not always distinct from U+535C |
   | 30CB     | 4E8C      | Identical: Not always distinct from U+4E8C |
   | 30A8     | 5DE5      | Identical: Not always distinct from U+5DE5 |
   | 30D8     | 3078      | Identical: Indistinguishable from U+3078   |
   | 30D9     | 3079      | Identical: Indistinguishable from U+3079   |
   | 30DA     | 307A      | Identical: Indistinguishable from U+307A   |
   | 529B     | 30AB      | Identical: Not always distinct from U+30AB |
   | 624D     | 30AA      | Identical: Not always distinct from U+30AA |
   | 53E3     | 30ED      | Identical: Not always distinct from U+30ED |
   | 516B     | 30CF      | Identical: Not always distinct from U+30CF |
   | 535C     | 30C8      | Identical: Not always distinct from U+30C8 |
   | 4E8C     | 30CB      | Identical: Not always distinct from U+30CB |
   | 5DE5     | 30A8      | Identical: Not always distinct from U+30A8 |
   | 30FC     | 4E00      | Identical: Indistinguishable from U+4E00   |
   | 4CA4     |           | Not Recommended: Incorrectly unified       |
   |          |           | ideograph; Encoding is unstable [120]      |
   | 4E00     | 30FC      | Identical: Indistinguishable from U+30FC   |
   | 30FD     | 4E36      | Identical: A single stroke shape;          |
   |          |           | Indistinguishable from U+4E36              |
   | 4E36     | 30FD      | Identical: A single stroke shape;          |
   |          |           | Indistinguishable from U+30FD              |
   | A717     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | A718     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | A719     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | A71A     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | A71B     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | A71C     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | A71D     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | A71E     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | A71F     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |



Freytag, et al.          Expires January 1, 2018               [Page 30]


Internet-Draft           Troublesome Characters                June 2017


   |          |           | or  punctuation [120]                      |
   | A78C     |           | Not Recommended: Indistinguishable from a  |
   |          |           | punctuation character that is not PVALID   |
   |          |           | [120]                                      |
   | A9CF     |           | Not Recommended: Formally has the letter   |
   |          |           | property, but functions more like a symbol |
   |          |           | or  punctuation [120]                      |
   | FE20     |           | Not Recommended: Specialized combining     |
   |          |           | mark, problematic for identifiers [120]    |
   | FE21     |           | Not Recommended: Specialized combining     |
   |          |           | mark, problematic for identifiers [120]    |
   | FE22     |           | Not Recommended: Specialized combining     |
   |          |           | mark, problematic for identifiers [120]    |
   | FE23     |           | Not Recommended: Specialized combining     |
   |          |           | mark, problematic for identifiers [120]    |
   | FE24     |           | Not Recommended: Specialized combining     |
   |          |           | mark, problematic for identifiers [120]    |
   | FE25     |           | Not Recommended: Specialized combining     |
   |          |           | mark, problematic for identifiers [120]    |
   | FE26     |           | Not Recommended: Specialized combining     |
   |          |           | mark, problematic for identifiers [120]    |
   | 101FD    |           | Not Recommended: Specialized combining     |
   |          |           | mark, problematic for identifiers [120]    |
   | 10486    | 104A0     | Identical: Identical in appearance U+104A0 |
   |          |           | OSMANYA DEEL [115] [150]                   |
   | 104A0    | 10486     | Identical: Identical in appearance to      |
   |          |           | U+10486 OSMANYA DIGIT ZERO [115] [150]     |
   +----------+-----------+--------------------------------------------+

        Table 1: Registry of Unicode Code Points Requiring Special
                   Consideration in Network Identifiers

4.2.2.  References for Registry

   [99]  The Unicode Consortium, "The Unicode Standard", (latest
      version) http:www.unicode.org/versions/latest (Multiple, or latest
      version)

   [100]  Integration Panel, "Maximal Starting Repertoire (MSR-2)",
      April 2015, https://www.icann.org/en/system/files/files/msr-2-
      overview-14apr15-en.pdf (Code points included in MSR-2 as
      potentially appropriate for the root zone)

   [110]  The Unicode Consortium, "Derived Numeric Type", (latest
      version) http://www.unicode.org/Public/UCD/latest/ucd/extracted/
      DerivedNumericType.txt (Code points from modern use scripts,
      excluded from MSR-2 solely because they are defined as digits in
      the Unicode Character Database)



Freytag, et al.          Expires January 1, 2018               [Page 31]


Internet-Draft           Troublesome Characters                June 2017


   [115]  Integration Panel, "Maximal Starting Repertoire (MSR-2)",
      April 2015, https://www.icann.org/en/system/files/files/msr-2-
      overview-14apr15-en.pdf (Code points excluded from MSR-2 as
      inappropriate for the root zone)

   [120]  Integration Panel, "Maximal Starting Repertoire (MSR-2)",
      April 2015, https://www.icann.org/en/system/files/files/msr-2-
      overview-14apr15-en.pdf (Code points considered problematic by
      MSR-2)

   [150]  The Unicode Consortium, "Intentional.txt", Version 10.0.0,
      http://www.unicode.org/Public/security/10.0.0/intentional.txt
      (Code points considered identical by intention)

   [201]  TF-AIDN, "Proposal for Arabic Script Root Zone LGR", 18
      November 2015 https://www.icann.org/en/system/files/files/arabic-
      lgr-proposal-18nov15-en.pdf ()

   [202]  Ethiopic Generation Panel, "Proposal for Ethiopic Script Root
      Zone LGR", May 17, 2017,
      https://www.icann.org/en/system/files/files/proposal-ethiopic-lgr-
      17may17-en.pdf ()

   [204]  Khmer Generation Panel, "Proposal for Khmer Script Root Zone
      Label Generation Rules (LGR)", August 15, 2016,
      https://www.icann.org/en/system/files/files/proposal-khmer-lgr-
      15aug16-en.pdf ()

   [206]  Thai Generation Panel, "Proposal for the Thai Script Root Zone
      LGR", May 25, 2017 https://www.icann.org/en/system/files/files/
      proposal-thai-lgr-25may17-en.pdf ()

   [300]  Internationalized Domain Names Variant Issues Project: Arabic
      Case Study Team Issues Report, ICANN, October 7, 2011
      https://archive.icann.org/en/topics/new-gtlds/arabic-vip-issues-
      report-07oct11-en.pdf (In -script variants)

   [5564]  RFC 5564 (Code points to be excluded from repertoires for the
      Arabic language)

   [6912]  RFC 6912 (Code points considered problematic)

   [IAB]  IAB, "IAB Statement on Identifiers and Unicode 7.0.0",
      February, 2015, https://www.iab.org/documents/correspondence-
      reports-documents/2015-2/iab-statement-on-identifiers-and-unicode-
      7-0-0/ ()





Freytag, et al.          Expires January 1, 2018               [Page 32]


Internet-Draft           Troublesome Characters                June 2017


5.  IANA Considerations

   The IANA Services Operator is hereby requested to create the Registry
   of Unicode Code Points for Special Consideration in Network
   Identifiers, and to populate it with the values in section
   Section 4.2.  The registry is to be updated by Expert Review.

   This registry has no formal protocol status with respect to IDNA or
   PRECIS.  It is a registry intended to be used by those creating
   registration or lookup policies, in order to inform the development
   of such policies.

6.  Security Considerations

   The registry established by this document is intended to help
   operators of identifier systems in deciding what to permit in
   identifiers.  It may also be useful for user agents that attempt to
   provide warnings to users about suspicious or inadvisable
   identifiers.  Operators that fail to make policies addressing the
   contents of the registry may permit the creation of identifiers that
   are misleading or that may be used in attacks on the network or
   users.

   The registry is not a magic solution to all identifier ambiguity, and
   even refusing to permit registration of, or lookup of, every code
   point in the registry cannot ensure that misleading or confusing
   identifiers will never be created.

7.  References

7.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <http://www.rfc-editor.org/info/rfc2119>.

   [RFC5890]  Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Definitions and Document Framework",
              RFC 5890, DOI 10.17487/RFC5890, August 2010,
              <http://www.rfc-editor.org/info/rfc5890>.

   [RFC5891]  Klensin, J., "Internationalized Domain Names in
              Applications (IDNA): Protocol", RFC 5891,
              DOI 10.17487/RFC5891, August 2010,
              <http://www.rfc-editor.org/info/rfc5891>.





Freytag, et al.          Expires January 1, 2018               [Page 33]


Internet-Draft           Troublesome Characters                June 2017


   [RFC5892]  Faltstrom, P., Ed., "The Unicode Code Points and
              Internationalized Domain Names for Applications (IDNA)",
              RFC 5892, DOI 10.17487/RFC5892, August 2010,
              <http://www.rfc-editor.org/info/rfc5892>.

   [RFC5893]  Alvestrand, H., Ed. and C. Karp, "Right-to-Left Scripts
              for Internationalized Domain Names for Applications
              (IDNA)", RFC 5893, DOI 10.17487/RFC5893, August 2010,
              <http://www.rfc-editor.org/info/rfc5893>.

   [RFC5894]  Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Background, Explanation, and
              Rationale", RFC 5894, DOI 10.17487/RFC5894, August 2010,
              <http://www.rfc-editor.org/info/rfc5894>.

   [RFC7564]  Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
              Preparation, Enforcement, and Comparison of
              Internationalized Strings in Application Protocols",
              RFC 7564, DOI 10.17487/RFC7564, May 2015,
              <http://www.rfc-editor.org/info/rfc7564>.

   [UAX44]    The Unicode Consortium, "Unicode Standard Annex #44,
              Unicode Character Database",
              <http://www.unicode.org/reports/tr44/>.

              This references the most currently published version of
              the description of the Unicode Character Database.

   [UCD]      The Unicode Consortium, "Unicode Character Database",
              <http://www.unicode.org/Public/UCD/latest/ucd/>.

              This references the most currently published version of
              the data files for the Unicode Character Database

   [Unicode]  The Unicode Consortium, "The Unicode Standard, Latest
              Version", <http://www.unicode.org/versions/latest/>.

              This references the most currently published version

7.2.  Informative References

   [I-D.klensin-idna-5892upd-unicode70]
              Klensin, J. and P. Faeltstroem, "IDNA Update for Unicode
              7.0.0", draft-klensin-idna-5892upd-unicode70-04 (work in
              progress), March 2015.






Freytag, et al.          Expires January 1, 2018               [Page 34]


Internet-Draft           Troublesome Characters                June 2017


   [I-D.rfc5891bis]
              Klensin, J., "Internationalized Domain Names in
              Applications (IDNA): Registry Restrictions and
              Recommendations", March 2017,
              <https://datatracker.ietf.org/doc/draft-klensin-idna-
              rfc5891bis/>.

   [RFC6365]  Hoffman, P. and J. Klensin, "Terminology Used in
              Internationalization in the IETF", BCP 166, RFC 6365,
              DOI 10.17487/RFC6365, September 2011,
              <http://www.rfc-editor.org/info/rfc6365>.

Appendix A.  Additional Background

A.1.  The Theory of Inclusion

   The mechanism that the IETF has come to prefer for
   internationalization of identifiers may be called "inclusion-based
   identifier internationalization", or "inclusion" for short.  Under
   inclusion, the characters that are permissible in identifiers for a
   protocol are selected from the set of all Unicode characters.  One
   starts with an empty set of characters, and then gradually adds
   characters to the set, usually based on Unicode properties (see
   below, and also Section 3).

   Inclusion depends in part on assumptions the IETF made when the
   strategy was adopted and developed; some of those assumptions were
   about the relationships between different characters and the
   likelihood that similar such relationships would get added to future
   versions of Unicode.  Those assumptions turn out not to have been
   true in every case.  Code points at issue are among those to be
   listed in the registry defined here.  (See Section 4.2.)

   The intent of Unicode is to encode all known writing systems into a
   single coded character set.  One consequence of that goal is that
   Unicode encodes an enormous number of characters.  Another is that
   the work of Unicode does not end until every writing system is
   encoded; even after that, it needs to continue to track any changes
   in those writing systems.

   Unicode encodes abstract characters, not glyphs.  Because of the way
   Unicode was built up over time, there are sometimes multiple ways to
   encode the same abstract character.  For example, an e with an acute
   accent may be written by combining U+0065 LATIN SMALL LETTER E and
   U+0031 COMBINING ACUTE ACCENT, or it may be written U+00E9 LATIN
   SMALL LETTER E WITH ACUTE.  If Unicode encodes an abstract character
   in more than one way, then for most purposes the different encodings
   should all be treated as though they're the same character.  This



Freytag, et al.          Expires January 1, 2018               [Page 35]


Internet-Draft           Troublesome Characters                June 2017


   "canonical equivalence" between encodings of the same abstract
   characters is explicitly called out by Unicode.  A lack of a defined
   canonical equivalence is tantamount to an assertion by Unicode that
   the two encodings do not represent the same abstract character, even
   if both happen to result in the same appearance.

   Every encoded character in Unicode (more precisely, every code point)
   is associated with a set of properties.  The properties define what
   script a code point is in, whether it is a letter or a number or
   punctuation and so forth, its direction when written, to what other
   code point or code point sequence it is canonically equivalent, and
   many other properties.  These properties are important to the
   inclusion mechanism.  They are defined in the Unicode Character
   Database [UCD] [UAX44].

   Inclusion depends on the assumption that such strings as will be used
   in identifiers will not have any ambiguous matching to other strings.
   In practice, this means that input strings to the protocol are
   expected to be in Normalization Form C.  This way, any alternative
   sequences of code points for the same characters will be normalized
   to a single form.  If all the characters in the string are also
   included for the protocol's candidate identifiers, then the string is
   eligible to be an identifier under the protocol.

A.2.  The Difference Between Theory and Practice

   In principle, under inclusion identifiers should be unambiguous.  It
   has always been recognized, however, that for humans some ambiguity
   is inevitable, because of the vagaries of writing systems and of
   human perception.

   Normalization Form C ("NFC") removes the ambiguities based on dual or
   multiple encoding for the same abstract character.  However,
   characters are not the same as their glyphs.  This means that it is
   possible for certain abstract characters to share a glyph.  We can
   call such abstract characters "homoglyphs".  While this looks at
   first like something that should be handled (or should have been
   handled) by normalization (NFC or something else), there are
   important differences; the situation is in some sense an extreme case
   of a spectrum of ambiguity discussed.

A.2.1.  Confusability

   While Unicode deals in abstract characters and inclusion works on
   Unicode code points, users interact with strings as actually
   rendered: sequences of glyphs.  There are characters that, depending
   on font, sometimes look quite similar to one another (such as "l" and
   "1"); any character that is like this is often called "visually



Freytag, et al.          Expires January 1, 2018               [Page 36]


Internet-Draft           Troublesome Characters                June 2017


   similar".  More difficult are characters that, in any normal
   rendering, always look the same as one another.  The shared history
   of Cyrillic, Greek, and Latin scripts, for example, means that there
   are characters in each script that function similarly and that are
   usually indistinguishable from one another, though they are not the
   same abstract character.  These are examples of "homoglyphs."  Any
   character that can be confused for another one can be called
   confusable, and confusability can be thought of as a spectrum with
   "visually similar" at one end, and "homoglyphs" at the other.  (We
   use the term "homoglyph" strictly: code points that normally use the
   same glyph when rendered.)

   Most of the time, there is some characteristic that can help to
   mitigate confusion.  Mitigation may be as simple as using a font
   designed to distinguish among different characters.  For homoglyphs,
   a large number of cases (but not all of them) turn out to be in
   different scripts.  As a result, it is usually a good idea to adopt
   the operational convention that identifiers for a protocol should
   always be in a single script.  This strategy has limits.  First,
   identifiers are not always under the operational control of a single
   authority (such as in the case of DNS, where the system is under
   distributed control so that different parts of the hierarchy can have
   different operational rules).  Moreover, sometimes the repertoire
   used in operation allows multiple scripts that create whole string
   confusables -- strings made up entirely of homoglyphs of another
   string in a different script (such as can be found between Cyrillic
   and Latin, for example).  In such cases, mitigation must turn to
   other means of preventing the registration of mutually confusable
   string, for example by ensuring that the registration of one of them
   (whichever comes first) blocks the later registration of the other.

   Also, operators should only ever use the smallest repertoire of code
   points possible for their environment.  So, for example, if there is
   a code point that is sometimes used but is perhaps a little obscure,
   it is better to leave it out and gain some experience with other
   cases first.  In particular, code points used only in a language with
   which the administrator is not familiar should probably be excluded.
   The same applies to code points used in specialized contexts, such as
   those only found in historic or sacred documents, or only used for
   phonetic transcription or poetry.  In the case of IDNA, some client
   programs restrict display of U-labels to top-level domains known to
   have policies about single-script labels.

   None of these policies or convention, other than ensuring mutual
   exclusion, will do anything to help strict homoglyphs of each other
   in the same script (see Appendix B for some example cases.)





Freytag, et al.          Expires January 1, 2018               [Page 37]


Internet-Draft           Troublesome Characters                June 2017


   Finally, there are some writing systems where characters do not
   normally occur in arbitrary locations in the context of each
   syllable.  Neither users nor rendering systems for such scripts are
   adept at handling arbitrary sequences of such characters.  While some
   latitude beyond strict spelling rules may be accommodated, policies
   that enforce a minimal set of structural rules are required to ensure
   that users can identify the identifier and systems can render them
   predictably.

A.2.2.  Not everything can be solved

   As noted in Section 1, it is not possible to solve all the problems
   with identifier systems, particularly when human factors are taken
   into account.

Appendix B.  Examples

   There are a number of cases that illustrate the combining sequence or
   digraph issue:

   U+08A1 vs \u'0628'\u'0654'  This case is ARABIC LETTER BEH WITH HAMZA
      ABOVE, which is the one that was detected during expert review
      that caused the IETF to notice the issue.  The issue existed
      before this, but we did not know it.  For detailed discussion of
      this case and some of the following ones, see
      [I-D.klensin-idna-5892upd-unicode70]

   U+0681 vs \u'062D'\u'0654'  This case is ARABIC LETTER HAH WITH HAMZA
      ABOVE, which (like U+08A1) does not have a canonical equivalent.
      In both cases, the places where hamza above are used are
      specialized enough that the combining marks can be excluded in
      some cases (for example, the root zone under IDNA).

   U+0623 vs \u'0627'\u'0654'  This case is ARABIC LETTER ALEF WITH
      HAMZA ABOVE.  Unlike the previous two cases, it does have a
      canonical equivalence with the combining sequence.  In the past,
      the IETF misunderstood the reasons for the difference between this
      pair and the previous two cases.

   U+09E1 vs u\'098C'u\'09E2'  This case is BENGALI LETTER VOCALIC LL.
      This is an example in Bengali script of a case without a canonical
      equivalence to the combining sequence.  Per Unicode, the single
      code point should be used to represent vowel letters in text, and
      the sequence of code points should not be used.  But it is not a
      simple matter of disallowing the combining vowel mark in cases
      like this; where the combination does not exist and the use of the
      sequence is already established, Unicode is unlikely to encode the
      combination.



Freytag, et al.          Expires January 1, 2018               [Page 38]


Internet-Draft           Troublesome Characters                June 2017


   U+019A vs \u'006C'\u'0335'  This case is LATIN SMALL LETTER L WITH
      BAR.  In at least some fonts, there is a detectable difference
      with the combining sequence, but only if one compares them side-
      by-side.  Unlike a separable diacritic, there are no fast rules
      for placement of overlays.  A bar may cross at different heights
      for different glyph shape or may cross different parts of the
      glyph.  For this reason, there is no canonical equivalence defined
      between the sequence and the composite.  Unicode has a principle
      of encoding barred letters of specific shape as single code point
      composites when needed for any writing system.

   U+00F8 vs \u'006F'\u'0337'  This is LATIN SMALL LETTER O WITH STROKE.
      The effect is similar to the previous case.  Unicode has a
      principle of encoding stroked letters as composites when needed
      for any writing system.

   U+02A6 vs \u'0074'\u'0073'  This is LATIN SMALL LETTER TS DIGRAPH,
      which is not canonically equivalent to the letters t and s.  The
      intent appears to be that the digraph shows the two shapes as
      kerned, but the difference may be slight out of context.

   U+01C9 vs \u'006C'\u'006A'  Unlike the TS digraph, the LJ digraph has
      a relevant compatibility decomposition, so it fails the relevant
      stability rules under inclusion and is therefore DISALLOWED in
      IDNA2008.  This illustrates the way that consistencies that might
      be natural to some users of a script are not necessarily found in
      it, possibly because of uses by another writing system.

   U+06C8 vs u\'0648'u\'0670'  ARABIC LETTER YU is an example where the
      normally-rendered character looks just like a combining sequence,
      but are named differently.  In other words, this is an example
      where the simple fact of the Unicode name would have concealed the
      apparent relationship from the casual observer.

   U+069 vs \u'0069'\u'0307'  LATIN SMALL LETTER I followed by COMBINING
      DOT ABOVE by definition, renders exactly the same as LATIN SMALL
      LETTER I by itself and does so in practice for any good font.  The
      same would be true if "i" was replaced with any of the other
      Soft_Dotted characters defined in Unicode.  The character sequence
      \u'0069'\u'0307' (followed by no other combining mark) is
      reportedly rather common on the Internet.  Because base character
      and stand-alone code point are the same in this case, and the code
      points affected have the Soft_Dotted property already, this could
      be mitigated separately via a context rule affecting U+0307.

   Other cases that demonstrate that the the issue does not lie
   exclusively or primarily with combining sequences:




Freytag, et al.          Expires January 1, 2018               [Page 39]


Internet-Draft           Troublesome Characters                June 2017


   U+0B95 vs U+0BE7  The TAMIL LETTER KA and TAMIL DIGIT ONE are always
      indistinguishable, but needed to be encoded separately because one
      is a letter and the other is a digit.

   Arabic-Indic Digits vs. Extended Arabic-Indic Digits  Seven digits of
      these two sequences have entirely identical shapes.  This case is
      an example of something dealt with in inclusion that nevertheless
      can lead to confusions that are not fully mitigated.  IDNA, for
      example, contains context rules restricting the digits to one set
      or another; but such rules apply only to a single label, not to an
      entire name.  Moreover, it provides no way of distinguishing
      between two labels that both conform to the context rule, but
      where each contains a different member one of the seven identical
      shape pairs.

   U+53E3 vs U+56D7  These are two Han characters (roughly rectangular)
      that are different when laid side by side; but they may be
      difficult to distinguish out of context or in very small print.

   U+01DD vs U+0259  The two code points share the same (lower case)
      forms, but are encoded differently due to different uppercase
      forms.  The fact that they uppercase differently is taken as
      evidence that they are not the same abstract character, despite
      the superficial evidence of their shared shape.  The more common
      cases, where the uppercase form are identical may be of less
      concern, given that IDNA 2008 is limited to lower case.

   Cross script homoglyphs usually do not involve combining sequences,
   but can be mitigated by rules requiring strings to be in a single
   script.

      LATIN SMALL LETTER OPEN E is one of a handful of examples of
      characters borrowed from another script, in this case GREEK SMALL
      LETTER EPSILON.

      LATIN SMALL LETTER E and CYRILLIC SMALL LETTER IE are historically
      related, both derive from uppercase forms of the GREEK CAPTIAL
      LETTER EPSILON.  There are a number of such pairs -- enough to
      make many whole strings that look the same in both scripts (but
      usually spell nonsense in one of them).  An example would be
      "pax".

Appendix C.  Discussion Venue

   Note to RFC Editor: this section should be removed prior to
   publication as an RFC.





Freytag, et al.          Expires January 1, 2018               [Page 40]


Internet-Draft           Troublesome Characters                June 2017


   This Internet-Draft may be discussed on the IAB Internationalization
   public list: i18n-discuss@iab.org.

Appendix D.  Change History

   Note to RFC Editor: this section should be removed prior to
   publication as an RFC.

   00:

      *  Initial version

   01:

      *  Add background and examples from the LUCID Problem Statement

      *  Add a paragraph about motivation to explain the difference
         between this registry and administrative policy more generally

      *  Expand and clarify a number of earlier points of discussion

      *  Attempt to make clear that this registry does not update any
         protocols

      *  Move some formerly-appendix material to the body

      *  Expand the initial registry.

Authors' Addresses

   Asmus Freytag
   ASMUS, Inc.

   Email: asmus@unicode.org


   John C Klensin
   1770 Massachusetts Ave, Ste 322
   Cambridge, MA  02140
   U.S.A.

   Email: john-ietf@jck.com









Freytag, et al.          Expires January 1, 2018               [Page 41]


Internet-Draft           Troublesome Characters                June 2017


   Andrew Sullivan
   Oracle Corp.
   100 Milverton Drive
   Missisauga, ON  L5R 4H1
   Canada

   Email: andrew.s.sullivan@oracle.com












































Freytag, et al.          Expires January 1, 2018               [Page 42]

Document	Document type	This is an older version of an Internet-Draft whose latest revision state is "Expired". Expired & archived
	Select version	00 01 02
	Compare versions
	Authors	Asmus Freytag , Dr. John C. Klensin , Andrew Sullivan
	RFC stream	(None)
	Other formats	txt xml pdf bibtex bibxml