REGISTRATION INFORMATION

   Tag             256 (stringref-namespace)
   Data Item       multiple
   Semantics       mark value as having string references
   Reference       http://cbor.schmorp.de/stringref
   Contact         Marc A. Lehmann <cbor@schmorp.de>

   Tag             25 (stringref)
   Data Item       unsigned integer
   Semantics       reference the nth previously seen string
   Reference       http://cbor.schmorp.de/stringref
   Contact         Marc A. Lehmann <cbor@schmorp.de>

RATIONALE

These two tags can be used to efficiently share constant strings, which are common in many data structures. They are an optimisation only, similar to LZSS compresssion, for the string major types in CBOR.

Background: many data structures or protocols contain repeated string constants (often called atoms or quarks). For example, this is a real-world (JSON-formatted) data structure that contains many repeated strings used as map keys:

  [
     {
       "name" : "Cocktail",
       "count" : 417,
       "rank" : 4
     },
     {
       "rank" : 4,
       "count" : 312,
       "name" : "Bath"
     },
     {
       "count" : 691,
       "name" : "Food",
       "rank" : 4
     }
  ]

This can be encoded nicely with CBOR, but each occurrence of "name", "count" and "rank" will be encoded separately. In highly repetitive data structures (the above is taken from a game save file) these can take up considerable space.

This scheme can be used to reduce this overhead with a simple scheme that is easily implementable.

DESCRIPTION

Stringref consists of two tags, stringref-namespace (value 256), which marks a value as containing string references, and stringref (value 25), which references a string previously encoded in the value.

The stringref-namespace tag is used to define a namespace for the string reference ids. stringref tags are only valid inside CBOR values marked with stringref-namespace.

The purpose of stringref-namespace is three-fold:

1. It signals the decoder that it needs to keep a copy of every string that is being decoded while decoding the tagged value (and conversely, that there is no need for this extra overhead without this tag).
2. It can be used to confine string references to certain subtrees, which might increase encoding efficiency by being able to use smaller references.
3. CBOR objects encoded with this scheme can be blindly embedded in existing CBOR streams, as the indices are relative to the outer namespace tag, and don't change when embedded.

Within a value tagged with stringref-namespace, every string that is encoded with a definite length and has a minimum length is assigned an implicit index, starting from zero.

The minimum length required to get an index depends on the index value that would be assigned to it, as stringref references should be shorter than the string they reference, and there is no point in wasting indices on strings that should not be referenced anyway.

The lengths are as follows:

   string index              minimum string length in octets
   0 .. 23                   3
   24 .. 255                 4
   256 .. 65535              5
   65536 .. 4294967295       7
   4294967296 ..             11

The minimum string length is simply the length of the stringref tag (2 octets), plus the minimum size the resulting index takes up, using major type 0 encoding.

Each time a string is to be encoded by the encoder, it can choose to encode it as a normal string, or attempt to encode it as a stringref.

If it attempts to encode a stringref, it needs to check whether the string has been assigned an index already. If yes, it can emit a stringref tag with the index.

In all other cases the encoder must check whether it has to assign an index value to it by checking against the minimum length required for the next index value to be assigned, and assigning the next index value if the string has minimum length or is longer.

A typical encoder might want to look up every string (using the major type and the octets) it encodes in some kind of internal map that stores previously-assigned indices. If found, it will emit a stringref. If not, it attempts to assign an index to the string and stores that in its map.

Decoders might choose not to look up previous indices for a string, for example, because it only wants to compress strings used as map keys, or because it intrinsically knows that the string is only used once. This is fine, but each time the encoder emits a string, it MUST assign an index if the minimum length is reached or exceeded, even if it doesn't store that index for later use.

Byte strings and text strings share the same index namespace, but are distinct, even if their contents are bit-identical. Indefinite length strings and the chunks they consist of are never assigned an index.

A typical encoder would emit the stringref-namespace tag once to tag the top-level value in a CBOR stream, but nothing keeps an encoder from nesting stringref-namespace tags. Nesting might be done for convenience, to improve coding efficiency, to embed values or for any other reason. Similarly, an encoder can sort map keys for greater coding efficiency, but is not required to.

A decoder follows the same process of assigning indices. A possible implementation could be outlined like this:

When a decoder supports stringref, then, upon encountering a stringref-namespace tag, it should initialise an array and start decoding the tagged value. The array stores, for each index, the major type (byte or text) and the string value. Implementations are free to use their internal format, of course - languages that do not distinguish between text and byte strings can just store their internal string type in the array.

Since stringref-namespace tags can be nested, the decoder needs to save and restore the outer array before starting and after ending the decoding of the tagged value. In a recursive implementation this could simply look like this:

   global stringref_array; // global per encoder
   local outer_stringref_array;

   outer_stringref_array = stringref_array;
   stringref_array = allocate_array ();

   value = decode_cbor_value ();

   deallocate_array (stringref_array);
   stringref_array = outer_stringref_array;

   return value;

When a stringref tag is encountered, it MUST contain an unsigned integer (major type 0). The integer MUST be less than the number of entries in the array. The decoder than looks up the string in the array, given the index, and copies it to the result (string values should not be shared unless the language normally does so - if strings are mutable, a copy should be decoded).

When a definite-length text or byte string is encountered, it is decoded normally. Afterwards, the decoder checks whether it needs to assign an index to the string, by comparing the string length against the minimum length required by the index. If the string is shorter, nothing happens, otherwise the next free index is used and the string (and the information whether it is a byte or text string) is recorded at the end of the array.

In languages with dynamic arrays, this could be accomplished by using the array length as the next index to be assigned, and pushing the string onto the end of the array when it is long enough.

IMPLEMENTATION NOTE

The semantics of stringref tags require the decoder to be aware and the encoder to be under control of the sequence in which data items are encoded into the CBOR stream. This means these tags cannot be implemented on top of every generic CBOR encoder/decoder (which might reorder entries in a map); they typically need to be integrated into their works.

DESIGN RATIONALE

The stringref tag was chosen to be short, without requiring standards action. The namespace tag is rare, so doesn't benefit from a short encoding as much.

Implicit tagging/counting was chosen to support stream encoders. Having to tag strings first requires either multiple passes over the data (which might not be available, ruling out some encoders) or tagging more strings than needed (wasting space). Explicit tagging also isn't necessarily better even under optimal conditions, as the explicit tags waste space.

Stream decoders are affected less by implicit tagging than encoders.

The namespace tag was introduced for two reasons: first to allow embedding of CBOR strings into other CBOR strings, secondly for decoding efficiency - the decoder only has to expect stringref tags inside namespaces and therefore doesn't have to maintain extra state outside of them.

EXAMPLES

The array-of-maps from the rationale example would normally compress to a CBOR text of 83 bytes. Using this extension where possible, this reduces to 74 bytes:

   d9 0100                      # tag(256)
      83                        # array(3)
         a3                     # map(3)
            44                  # bytes(4)
               72616e6b         # "rank"
            04                  # unsigned(4)
            45                  # bytes(5)
               636f756e74       # "count"
            19 01a1             # unsigned(417)
            44                  # bytes(4)
               6e616d65         # "name"
            48                  # bytes(8)
               436f636b7461696c # "Cocktail"
         a3                     # map(3)
            d8 19               # tag(25)
               02               # unsigned(2)
            44                  # bytes(4)
               42617468         # "Bath"
            d8 19               # tag(25)
               01               # unsigned(1)
            19 0138             # unsigned(312)
            d8 19               # tag(25)
               00               # unsigned(0)
            04                  # unsigned(4)
         a3                     # map(3)
            d8 19               # tag(25)
               02               # unsigned(2)
            44                  # bytes(4)
               466f6f64         # "Food"
            d8 19               # tag(25)
               01               # unsigned(1)
            19 02b3             # unsigned(691)
            d8 19               # tag(25)
               00               # unsigned(0)
            04                  # unsigned(4)

The following JSON array illustrates the effect of the index on the minimum string length:

   [  "1", "222", "333",   "4", "555", "666", "777", "888", "999",
    "aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
    "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr",
    "333",
    "ssss",
    "qqq", "rrr", "ssss"]

The strings "1", "4" and "rrr" are too short to get an index assigned. All others that are not encoded with a stringref do (this assumes that JSON strings are encoded as CBOR byte strings):

   d9 0100           # tag(256)
      98 20          # array(32)
         41          # bytes(1)
            31       # "1"
         43          # bytes(3)
            323232   # "222"
         43          # bytes(3)
            333333   # "333"
         41          # bytes(1)
            34       # "4"
         43          # bytes(3)
            353535   # "555"
         43          # bytes(3)
            363636   # "666"
         43          # bytes(3)
            373737   # "777"
         43          # bytes(3)
            383838   # "888"
         43          # bytes(3)
            393939   # "999"
         43          # bytes(3)
            616161   # "aaa"
         43          # bytes(3)
            626262   # "bbb"
         43          # bytes(3)
            636363   # "ccc"
         43          # bytes(3)
            646464   # "ddd"
         43          # bytes(3)
            656565   # "eee"
         43          # bytes(3)
            666666   # "fff"
         43          # bytes(3)
            676767   # "ggg"
         43          # bytes(3)
            686868   # "hhh"
         43          # bytes(3)
            696969   # "iii"
         43          # bytes(3)
            6a6a6a   # "jjj"
         43          # bytes(3)
            6b6b6b   # "kkk"
         43          # bytes(3)
            6c6c6c   # "lll"
         43          # bytes(3)
            6d6d6d   # "mmm"
         43          # bytes(3)
            6e6e6e   # "nnn"
         43          # bytes(3)
            6f6f6f   # "ooo"
         43          # bytes(3)
            707070   # "ppp"
         43          # bytes(3)
            717171   # "qqq"
         43          # bytes(3)
            727272   # "rrr"
         d8 19       # tag(25)
            01       # unsigned(1)
         44          # bytes(4)
            73737373 # "ssss"
         d8 19       # tag(25)
            17       # unsigned(23)
         43          # bytes(3)
            727272   # "rrr"
         d8 19       # tag(25)
            18 18    # unsigned(24)

This example shows three stringref-namespace tags, two of which are nested inside another:

   256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)])

   d9 0100               # tag(256)
      85                 # array(5)
         63              # text(3)
            616161       # "aaa"
         d8 19           # tag(25)
            00           # unsigned(0)
         d9 0100         # tag(256)
            83           # array(3)
               63        # text(3)
                  626262 # "bbb"
               63        # text(3)
                  616161 # "aaa"
               d8 19     # tag(25)
                  01     # unsigned(1)
         d9 0100         # tag(256)
            82           # array(2)
               63        # text(3)
                  636363 # "ccc"
               d8 19     # tag(25)
                  00     # unsigned(0)
         d8 19           # tag(25)
            00           # unsigned(0)

The decoded data structure might look like this:

  ["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"]

IMPLEMENTATIONS

This section lists known implementations of this extension (drop me a mail if you want to be listed here).

* [Perl] CBOR::XS (reference implementation)
* [C++] jsoncons