Q (number format): Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m Fix dead link for www.superkits.net
 
(96 intermediate revisions by 64 users not shown)
Line 1: Line 1:
{{short description|Number format for specifying provision}}
{{refimprove|date=May 2015}}
'''Q''' is a [[fixed-point arithmetic|fixed point]] number format where the number of [[Fraction (mathematics)|fractional]] [[bit]]s (and optionally the number of [[integer]] bits) is specified. For example, a Q15 number has 15 fractional bits; a Q1.14 number has 1 integer bit and 14 fractional bits. Q format is often used in hardware that does not have a floating-point unit and in applications that require [[fixed-point arithmetic|constant resolution]].


{{use dmy dates|date=May 2019|cs1-dates=y}}{{use list-defined references|date=January 2022}}
==Characteristics==
Q format numbers are notionally fixed point numbers, that is, they are stored and operated upon as regular binary signed integers, thus allowing standard integer hardware/[[Arithmetic logic unit|ALU]] to perform [[rational number]] calculations. The number of integer bits, fractional bits and the underlying word size are to be chosen by the programmer on an application-specific basis — the programmer's choices of the foregoing will depend on the range and resolution needed for the numbers.


{{distinguish|Q code|Wikidata#Main parts}}
Some DSP architectures offer native support for common formats, such as Q1.15. In this case, the processor can support arithmetic in one step, offering saturation (for addition and subtraction) and renormalization (for multiplication) in a single instruction. Most standard CPUs do not. If the architecture does not directly support the particular fixed point format chosen, the programmer will need to handle saturation and renormalization explicitly with bounds checking and bit shifting.


The '''Q notation''' is a way to specify the parameters of a binary [[fixed-point arithmetic|fixed point]] number format. For example, in Q notation, the number format denoted by <code>Q8.8</code> means that the fixed point numbers in this format have 8 bits for the integer part and 8 bits for the fraction part.
There are two conflicting notations for fixed point. Both notations are written as Q''m''.''n'', where:
*Q designates that the number is in the Q format notation — the [[Texas Instruments]] representation for signed fixed-point numbers (the "Q" being reminiscent of the standard symbol for the set of [[rational number]]s).
*''m.'' (optional, assumed to be zero or one) is the number of bits set aside to designate the two's complement integer portion of the number, exclusive or inclusive of the sign bit (therefore if m is not specified it is taken as zero or one).
*''n'' is the number of bits used to designate the fractional portion of the number, i.e. the number of bits to the right of the binary point. (If n = 0, the Q numbers are integers — the degenerate case).


A number of [[Fixed-point_arithmetic#Notations|other notations]] have been used for the same purpose.
One convention includes the sign bit in the value of ''m'',<ref name=arm>[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0066d/CHDFAAEI.html ARM® Developer Suite AXD and armsd Debuggers Guide Version 1.2 . Home > AXD > AXD Facilities > Data formatting > Q-format . 4.7.9. Q-format]</ref> and the other convention does not. The choice of convention can be determined by summing ''m+n''. If the value is equal to the register size, then the sign bit is included in the value of ''m''. If it is one less than the register size, the sign bit is not included in the value of ''m''.


==Definition==
In addition, the letter U can be prefixed to the Q to indicate an unsigned value, such as UQ1.15, indicating values from 0.0 to +1.99997.
===Texas Instruments version===
The Q notation, as defined by [[Texas Instruments]],<ref name="TI_2003"/> consists of the letter {{mono|Q}} followed by a pair of numbers ''m''{{mono|.}}''n'', where ''m'' is the number of bits used for the integer part of the value, and ''n'' is the number of fraction bits.


By default, the notation describes ''signed'' binary fixed point format, with the unscaled integer being stored in [[two's complement]] format, used in most binary processors. The first bit always gives the sign of the value(1 = negative, 0 = non-negative), and it is ''not'' counted in the ''m'' parameter. Thus, the total number ''w'' of bits used is 1 + ''m'' + ''n''.
Signed Q values are stored in [[two's complement]] format, just like signed integer values on most processors. In two's complement, the sign bit is extended to the register size.


For example, the specification {{mono|Q3.12}} describes a signed binary fixed-point number with a ''w'' = 16 bits in total, comprising the sign bit, three bits for the integer part, and 12 bits that are the fraction. That is, a 16-bit signed (two's complement) integer, that is implicitly multiplied by the scaling factor 2<sup>−12</sup>
For a given Q''m''.''n'' format, using an ''m''+''n''+1 bit signed integer container with ''n'' fractional bits:
* its range is <math>[ - (2^m) , 2^m -2^{-n}]</math>
* its resolution is <math>2^{-n}</math>


In particular, when ''n'' is zero, the numbers are just integers. If ''m'' is zero, all bits except the sign bit are fraction bits; then the range of the stored number is from −1.0 (inclusive) to +1.0 (exclusive).
For a given UQ''m''.''n'' format, using an ''m''+''n'' bit unsigned integer container with ''n'' fractional bits:
* its range is <math>[ 0 , 2^m -2^{-n}]</math>
* its resolution is <math>2^{-n}</math>


The ''m'' and the dot may be omitted, in which case they are inferred from the size of the variable or register where the value is stored. Thus, {{mono|Q12}} means a signed integer with any number of bits, that is implicitly multiplied by 2<sup>−12</sup>.
For example, a Q14.1 format number:
* requires 14+1+1 = 16 bits
* its range is [-2<sup>14</sup>, 2<sup>14</sup> - 2<sup>−1</sup>] = [-16384.0, +16383.5] = [0x8000, 0x8001 … 0xFFFF, 0x0000, 0x0001 … 0x7FFE, 0x7FFF]
* its resolution is 2<sup>−1</sup> = 0.5


The letter {{mono|U}} can be prefixed to the {{mono|Q}} to denote an ''unsigned'' binary fixed-point format. For example, {{mono|UQ1.15}} describes values represented as unsigned 16-bit integers with an implicit scaling factor of 2<sup>−15</sup>, which range from 0.0 to (2<sup>16</sup>&minus;1)/2<sup>15</sup> = +1.999969482421875.
Unlike [[floating point]] numbers, the resolution of Q numbers will remain constant over the entire range.


==Conversion==
===ARM version===
A variant of the Q notation has been in use by [[ARM architecture family|ARM]]. In this variant, the ''m'' number includes the sign bit. For example, a 16-bit signed integer would be denoted <code>Q15.0</code> in the TI variant, but <code>Q16.0</code> in the ARM variant.<ref name="ARM_2001"/><ref name="ARM_2006"/>


===Float to Q===
==Characteristics==
The resolution (difference between successive values) of a Q''m''.''n'' or UQ''m''.''n'' format is always 2<sup>−''n''</sup>. The range of representable values depends on the notation used:
To convert a number from [[IEEE 754|floating point]] to Q''m''.''n'' format:
{| class="wikitable"
# Multiply the floating point number by 2<sup>''n''</sup>
|+Range of representable values in Q notation
# Round to the nearest integer
!Notation

!Texas Instruments Notation
===Q to float===
!ARM Notation
To convert a number from Q''m''.''n'' format to floating point:
|-
# Convert the number to floating point as if it were an integer, in other words remove the binary point
|Signed Q''m''.''n''
# Multiply by 2<sup>−''n''</sup>
|−2<sup>''m''</sup> to +2<sup>''m''</sup> − 2<sup>−''n''</sup>
|−2<sup>''m''−1</sup> to +2<sup>''m''−1</sup> − 2<sup>−''n''</sup>
|-
|Unsigned UQ''m''.''n''
|0 to 2<sup>''m''</sup> − 2<sup>−''n''</sup>
|0 to 2<sup>''m''</sup> − 2<sup>−''n''</sup>
|}
For example, a Q15.1 format number requires 15+1 = 16 bits, has resolution 2<sup>−1</sup> = 0.5, and the representable values range from &minus;2<sup>14</sup> = &minus;16384.0 to +2<sup>14</sup> &minus; 2<sup>−1</sup> = +16383.5. In hexadecimal, the negative values range from 0x8000 to 0xFFFF followed by the non-negative ones from 0x0000 to 0x7FFF.


==Math operations==
==Math operations==
Q numbers are a ratio of two integers: the numerator is kept in storage, the denominator is equal to 2<sup>''n''</sup>.
Q numbers are a ratio of two integers: the numerator is kept in storage, the denominator <math>d</math> is equal to 2<sup>''n''</sup>.


Consider the following example:
Consider the following example:


* The Q8 denominator equals 2<sup>8</sup> = 256
* The Q8 denominator equals 2<sup>8</sup> = 256

* 1.5 equals 384/256
* 1.5 equals 384/256

* 384 is stored, 256 is inferred because it is a Q8 number.
* 384 is stored, 256 is inferred because it is a Q8 number.


If the Q number's base is to be maintained (''n'' remains constant) the Q number math operations must keep the denominator constant. The following formulas show math operations on the general Q numbers <math>N_1</math> and <math>N_2</math>.
If the Q number's base is to be maintained (''n'' remains constant) the Q number math operations must keep the denominator <math>d</math> constant. The following formulas show math operations on the general Q numbers <math>N_1</math> and <math>N_2</math>. (If we consider the example as mentioned above, <math>N_1</math> is 384 and <math>d</math> is 256.)


<math>\begin{align}
<math>\begin{align}
Line 65: Line 62:
\end{align}</math>
\end{align}</math>


Because the denominator is a power of two the multiplication can be implemented as an [[arithmetic shift]] to the left and the division as an arithmetic shift to the right; on many processors shifts are faster than multiplication and division.
Because the denominator is a power of two, the multiplication can be implemented as an [[arithmetic shift]] to the left and the division as an arithmetic shift to the right; on many processors shifts are faster than multiplication and division.


To maintain accuracy the intermediate multiplication and division results must be double precision and care must be taken in [[rounding]] the intermediate result before converting back to the desired Q number.
To maintain accuracy, the intermediate multiplication and division results must be double precision and care must be taken in [[rounding]] the intermediate result before converting back to the desired Q number.


Using [[C (programming language)|C]] the operations are (note that here, Q refers to the fractional part's number of bits) :
Using [[C (programming language)|C]] the operations are (note that here, Q refers to the fractional part's number of bits) :


===Addition===
===Addition===
<source lang="c">
<syntaxhighlight lang="c">
short q_add(short a, short b)
int16_t q_add(int16_t a, int16_t b)
{
{
short result;
return a + b;

result = a + b;

return result;
}
}
</syntaxhighlight>
</source>
With saturation
With saturation
<source lang="c">
<syntaxhighlight lang="c">
short q_add_sat(short a, short b)
int16_t q_add_sat(int16_t a, int16_t b)
{
{
short result;
int16_t result;
long tmp;
int32_t tmp;


tmp = (long)a + (long)b;
tmp = (int32_t)a + (int32_t)b;
if (tmp > 0x7FFF)
if (tmp > 0x7FFF)
tmp = 0x7FFF;
tmp = 0x7FFF;
if (tmp < -1 * 0x7FFF)
if (tmp < -1 * 0x8000)
tmp = -1 * 0x7FFF;
tmp = -1 * 0x8000;
result = (short)tmp;
result = (int16_t)tmp;


return result;
return result;
}
}
</syntaxhighlight>
</source>
Unlike floating point ±Inf, saturated results are not sticky and will unsaturate on adding a negative value to a positive saturated value (0x7FFF) and vice versa in that implementation shown. In assembly language, the Signed Overflow flag can be used to avoid the typecasts needed for that C implementation.


===Subtraction===
===Subtraction===
<source lang="c">
<syntaxhighlight lang="c">
short q_sub(short a, short b)
int16_t q_sub(int16_t a, int16_t b)
{
{
short result;
return a - b;

result = a - b;

return result;
}
}
</syntaxhighlight>
</source>


===Multiplication===
===Multiplication===
<source lang="c">
<syntaxhighlight lang="c">
// precomputed value:
// precomputed value:
#define K (1 << (Q - 1))
#define K (1 << (Q - 1))
// saturate to range of short
// saturate to range of int16_t
short sat16(long x)
int16_t sat16(int32_t x)
{
{
if (x > 0x7FFF) return 0x7FFF;
if (x > 0x7FFF) return 0x7FFF;
else if (x < 0x8000) return 0x8000;
else if (x < -0x8000) return -0x8000;
else return (short)x;
else return (int16_t)x;
}
}


short q_mul(short a, short b)
int16_t q_mul(int16_t a, int16_t b)
{
{
short result;
int16_t result;
long temp;
int32_t temp;


temp = (long)a * (long)b; // result type is operand's type
temp = (int32_t)a * (int32_t)b; // result type is operand's type
// Rounding; mid values are rounded up
// Rounding; mid values are rounded up
temp += K;
temp += K;
Line 138: Line 128:
return result;
return result;
}
}
</syntaxhighlight>
</source>


===Division===
===Division===
<source lang="c">
<syntaxhighlight lang="c">
short q_div(short a, short b)
int16_t q_div(int16_t a, int16_t b)
{
{
/* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */
short result;
long temp;
int32_t temp = (int32_t)a << Q;
/* Rounding: mid values are rounded up (down for negative values). */

/* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */
// pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format)
temp = (long)a << Q;
if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) {
temp += b / 2; /* OR shift 1 bit i.e. temp += (b >> 1); */
// Rounding: mid values are rounded up (down for negative values).
} else {
if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0))
temp += b / 2;
temp -= b / 2; /* OR shift 1 bit i.e. temp -= (b >> 1); */
else
}
temp -= b / 2;
return (int16_t)(temp / b);
result = (short)(temp / b);

return result;
}
}
</syntaxhighlight>
</source>


==See also==
==See also==
{{Portal|Computer Science}}
* [[Binary scaling]]
* [[Fixed-point arithmetic]]
* [[Fixed-point arithmetic]]
* [[Floating-point arithmetic]]
* [[Floating-point arithmetic]]


==References==
==References==
{{Reflist}}
{{Reflist|refs=
<ref name="ARM_2001">{{cite web |title=ARM Developer Suite AXD and armsd Debuggers Guide |version=1.2 |publisher=[[ARM Limited]] |at=Chapter 4.7.9. AXD > AXD Facilities > Data formatting > Q-format |date=2001 |orig-date=1999 |id=ARM DUI 0066D |url=http://infocenter.arm.com/help/topic/com.arm.doc.dui0066d/CHDFAAEI.html?resultof=%22%51%2d%66%6f%72%6d%61%74%22%20%22%71%2d%66%6f%72%6d%61%74%22%20 |url-status=live |archive-url=https://archive.today/20171104110547/http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0066d/CHDFAAEI.html |archive-date=2017-11-04}}</ref>
<ref name="ARM_2006">{{cite book |title=RealView Development Suite AXD and armsd Debuggers Guide |version=3.0 |publisher=[[ARM Limited]] |chapter=Chapter 4.7.9. AXD > AXD Facilities > Data formatting > Q-format |date=2006 |orig-date=1999 |id=ARM DUI 0066G |pages=4–24 |url=http://infocenter.arm.com/help/topic/com.arm.doc.dui0066g/DUI0066.pdf |url-status=live |archive-url=https://web.archive.org/web/20171104105632/http://infocenter.arm.com/help/topic/com.arm.doc.dui0066g/DUI0066.pdf |archive-date=2017-11-04}}</ref>
<ref name="TI_2003">{{cite book |title=TMS320C64x DSP Library Programmer's Reference |chapter=Appendix A.2 |date=October 2003 |id=SPRU565 |publisher=[[Texas Instruments Incorporated]] |publication-place=Dallas, Texas, USA |url=http://focus.ti.com/lit/ug/spru565b/spru565b.pdf |access-date=2022-12-22 |url-status=live |archive-url=https://web.archive.org/web/20221222210046/https://www.ti.com/lit/ug/spru565b/spru565b.pdf |archive-date=2022-12-22}} (150 pages)</ref>
}}


==External links==
==Further reading==
* [http://www.superkits.net/whitepapers/Fixed%20Point%20Representation%20&%20Fractional%20Math.pdf Fixed Point Representation And Fractional Math] (Note: the accuracy of the article is in dispute; see discussion.)
* {{cite web |title=Fixed Point Representation & Fractional Math |author-first=Erick L. |author-last=Oberstar |date=2007-08-30 |orig-date=2004 |publisher=Oberstar Consulting |version=1.2 |url=http://www.superkits.net/whitepapers/Fixed%20Point%20Representation%20&%20Fractional%20Math.pdf |access-date=2017-11-04 |url-status=dead |archive-url=https://web.archive.org/web/20171104111827/http://www.superkits.net/whitepapers/Fixed%20Point%20Representation%20%26%20Fractional%20Math.pdf |archive-date=2017-11-04}} (Note: the accuracy of the article is in dispute; see discussion.)


==External links==
* [https://github.com/mgarcia01752/Q-Number-Format Q-Number-Format Java Implementation]
* {{cite web |url=https://github.com/mgarcia01752/Q-Number-Format |title=Q-Number-Format Java Implementation |website=[[GitHub]] |access-date=2017-11-04 |url-status=live |archive-url=https://web.archive.org/web/20171104112216/https://github.com/mgarcia01752/Q-Number-Format |archive-date=2017-11-04}}
* {{cite web |url=https://chummersone.github.io/qformat.html |title=Q-format Converter |access-date=2021-06-25 |url-status=live |archive-url=https://web.archive.org/web/20210625213105/https://chummersone.github.io/qformat.html |archive-date=2021-06-25}}
* {{cite web |url=https://github.com/howerj/q |title=Q Library (C implementation) |website=[[GitHub]] |access-date=2024-03-05 }}


[[Category:Computer arithmetic]]
[[Category:Computer arithmetic]]

Latest revision as of 11:21, 15 April 2024

The Q notation is a way to specify the parameters of a binary fixed point number format. For example, in Q notation, the number format denoted by Q8.8 means that the fixed point numbers in this format have 8 bits for the integer part and 8 bits for the fraction part.

A number of other notations have been used for the same purpose.

Definition[edit]

Texas Instruments version[edit]

The Q notation, as defined by Texas Instruments,[1] consists of the letter Q followed by a pair of numbers m.n, where m is the number of bits used for the integer part of the value, and n is the number of fraction bits.

By default, the notation describes signed binary fixed point format, with the unscaled integer being stored in two's complement format, used in most binary processors. The first bit always gives the sign of the value(1 = negative, 0 = non-negative), and it is not counted in the m parameter. Thus, the total number w of bits used is 1 + m + n.

For example, the specification Q3.12 describes a signed binary fixed-point number with a w = 16 bits in total, comprising the sign bit, three bits for the integer part, and 12 bits that are the fraction. That is, a 16-bit signed (two's complement) integer, that is implicitly multiplied by the scaling factor 2−12

In particular, when n is zero, the numbers are just integers. If m is zero, all bits except the sign bit are fraction bits; then the range of the stored number is from −1.0 (inclusive) to +1.0 (exclusive).

The m and the dot may be omitted, in which case they are inferred from the size of the variable or register where the value is stored. Thus, Q12 means a signed integer with any number of bits, that is implicitly multiplied by 2−12.

The letter U can be prefixed to the Q to denote an unsigned binary fixed-point format. For example, UQ1.15 describes values represented as unsigned 16-bit integers with an implicit scaling factor of 2−15, which range from 0.0 to (216−1)/215 = +1.999969482421875.

ARM version[edit]

A variant of the Q notation has been in use by ARM. In this variant, the m number includes the sign bit. For example, a 16-bit signed integer would be denoted Q15.0 in the TI variant, but Q16.0 in the ARM variant.[2][3]

Characteristics[edit]

The resolution (difference between successive values) of a Qm.n or UQm.n format is always 2n. The range of representable values depends on the notation used:

Range of representable values in Q notation
Notation Texas Instruments Notation ARM Notation
Signed Qm.n −2m to +2m − 2n −2m−1 to +2m−1 − 2n
Unsigned UQm.n 0 to 2m − 2n 0 to 2m − 2n

For example, a Q15.1 format number requires 15+1 = 16 bits, has resolution 2−1 = 0.5, and the representable values range from −214 = −16384.0 to +214 − 2−1 = +16383.5. In hexadecimal, the negative values range from 0x8000 to 0xFFFF followed by the non-negative ones from 0x0000 to 0x7FFF.

Math operations[edit]

Q numbers are a ratio of two integers: the numerator is kept in storage, the denominator is equal to 2n.

Consider the following example:

  • The Q8 denominator equals 28 = 256
  • 1.5 equals 384/256
  • 384 is stored, 256 is inferred because it is a Q8 number.

If the Q number's base is to be maintained (n remains constant) the Q number math operations must keep the denominator constant. The following formulas show math operations on the general Q numbers and . (If we consider the example as mentioned above, is 384 and is 256.)

Because the denominator is a power of two, the multiplication can be implemented as an arithmetic shift to the left and the division as an arithmetic shift to the right; on many processors shifts are faster than multiplication and division.

To maintain accuracy, the intermediate multiplication and division results must be double precision and care must be taken in rounding the intermediate result before converting back to the desired Q number.

Using C the operations are (note that here, Q refers to the fractional part's number of bits) :

Addition[edit]

int16_t q_add(int16_t a, int16_t b)
{
    return a + b;
}

With saturation

int16_t q_add_sat(int16_t a, int16_t b)
{
    int16_t result;
    int32_t tmp;

    tmp = (int32_t)a + (int32_t)b;
    if (tmp > 0x7FFF)
        tmp = 0x7FFF;
    if (tmp < -1 * 0x8000)
        tmp = -1 * 0x8000;
    result = (int16_t)tmp;

    return result;
}

Unlike floating point ±Inf, saturated results are not sticky and will unsaturate on adding a negative value to a positive saturated value (0x7FFF) and vice versa in that implementation shown. In assembly language, the Signed Overflow flag can be used to avoid the typecasts needed for that C implementation.

Subtraction[edit]

int16_t q_sub(int16_t a, int16_t b)
{
    return a - b;
}

Multiplication[edit]

// precomputed value:
#define K   (1 << (Q - 1))
 
// saturate to range of int16_t
int16_t sat16(int32_t x)
{
	if (x > 0x7FFF) return 0x7FFF;
	else if (x < -0x8000) return -0x8000;
	else return (int16_t)x;
}

int16_t q_mul(int16_t a, int16_t b)
{
    int16_t result;
    int32_t temp;

    temp = (int32_t)a * (int32_t)b; // result type is operand's type
    // Rounding; mid values are rounded up
    temp += K;
    // Correct by dividing by base and saturate result
    result = sat16(temp >> Q);

    return result;
}

Division[edit]

int16_t q_div(int16_t a, int16_t b)
{
    /* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */
    int32_t temp = (int32_t)a << Q;
    /* Rounding: mid values are rounded up (down for negative values). */
    /* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */
    if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) {   
        temp += b / 2;    /* OR shift 1 bit i.e. temp += (b >> 1); */
    } else {
        temp -= b / 2;    /* OR shift 1 bit i.e. temp -= (b >> 1); */
    }
    return (int16_t)(temp / b);
}

See also[edit]

References[edit]

  1. ^ "Appendix A.2". TMS320C64x DSP Library Programmer's Reference (PDF). Dallas, Texas, USA: Texas Instruments Incorporated. October 2003. SPRU565. Archived (PDF) from the original on 2022-12-22. Retrieved 2022-12-22. (150 pages)
  2. ^ "ARM Developer Suite AXD and armsd Debuggers Guide". 1.2. ARM Limited. 2001 [1999]. Chapter 4.7.9. AXD > AXD Facilities > Data formatting > Q-format. ARM DUI 0066D. Archived from the original on 2017-11-04.
  3. ^ "Chapter 4.7.9. AXD > AXD Facilities > Data formatting > Q-format". RealView Development Suite AXD and armsd Debuggers Guide (PDF). 3.0. ARM Limited. 2006 [1999]. pp. 4–24. ARM DUI 0066G. Archived (PDF) from the original on 2017-11-04.

Further reading[edit]

External links[edit]