Error introduced by division:
Let be ideal, errorless
Let computed value from integer division, , be and the associated error be , such that
Error introduced by multiplication:
Error propagation by multiplication:
Error propagated by division:
Error Metric:
Earlier examples shown to illustrate error generation and propagation with reordering of operatands
Keep the decision to go big or go small in mind as in each step of implementation in the next topic.
We will formally introduce an additional scale at each step and per-operand.
0011.1110
and 0001.1000
0011.1110
as 00111110
0001.1000
as 00011000
00011000
+ 00111110
= 01010110
01010110
as 0101.0110
QM.N
notation: M+N bits, with M bits as whole part and N bits as fractional part.1101.0000 * 16 = 11010000 S=16
, Q4.401.011000 * 64 = 01011000 S=64
, Q2.6computed as , so that the addition is an integer addition
Interpret result by dividing output by S to obtain answer C
using a power of two for S is efficient, though not required
computed as
Divide result by S to interpret answer C
Example: Check if addition of 0.17 meters and 0.24 meters is greater than .9 meters
int Y_S100 = (.9 * 100); int A_S100 = (0.17 * 100); int B_S100 = (0.24 * 100); int C_S100 = A_S100+B_S100; //integer addition int flag = C_S100 > Y_S100; //integer subtraction/comparison
To be human understandable, an example presented with powers of 10 for scaling, but powers of 2 are usually appropriate for efficiency and predictability of errors and error propagation.
computed as
Unfortunately, the intermediate result required more storage than the scaled result
Example: Check if area of rectangle with sides of length 0.17 meters and 0.24 meters is greater than the area of a square with sides of length .2 meters
int AREASQ_S10000 = (.2 * 100) * (.2 * 100); int AREASQ_S100 = (.2 * .2 * 100); int A_S100 = (0.17 * 100); int B_S100 = (0.24 * 100); int ARECT_S10000 = A_S100*B_S100; //integer multiplication int flag0 = ARECT_S10000 > AREASQ_S10000; //integer subtraction/comparison int ARECT_S100 = ARECT_S10000/100; //convert to scale 100 int flag1 = ARECT_S100 > AREASQ_S100; //integer subtraction/comparison
To be human understandable, an example presented with powers of 10 for scaling, but powers of 2 are usually appropriate for efficiency and predictability of errors and error propagation.
A/B could be computed as
Scales cancel. Which is fine if you only wanted an integer answer
Would need to multi by S to obtain scaled result for further math
For precision, better to prescale one of the operands
For rounding, remember to apply numerator pre bias of when performing integer division
Example:
Circumference to Radius
Note that Python3 provides an explicit integer division: //
from math import * import numpy as np import pyforest tau = pi * 2 #tau C=tau+2.8 # some value for circumference #(note that adding was intentional here just to produce an interesting number 9.0831853072) #C C8=floor(C*8) #C*8,C8,C8/8 C256=floor(C*8*8) #C*8*8,C256,C256/(8*8) tau8=floor(tau * 8) #tau8,tau8/8 halftau8=floor(tau/2 * 8) //precomputed bias #halftau8,halftau8/8 print("C = %20.10f"%C) print("C/tau = %20.10f"%((C)/(tau))) print("C8/tau8 = %20.10f"%((C8)//(tau8))) print(" with pre-div bias : %20.10f"%((C8+halftau8)//(tau8))) print(" with prescale : %20.10f"%((C8*8)//(tau8)/8)) print(" with prescale and pre-div bias: %20.10f"%((C8*8+halftau8)//(tau8)/8)) print("C256/tau8 = %20.10f"%((C256)//(tau8)/8)) print(" with pre-div bias : %20.10f"%((C256+halftau8)//(tau8)/8)) print("---")
C = 9.0831853072
C/tau = 1.4456338407
C8/tau8 = 1.0000000000
with pre-div bias : 1.0000000000
with prescale : 1.3750000000
with prescale and pre-div bias: 1.5000000000
C256/tau8 = 1.3750000000
with pre-div bias : 1.5000000000
Reducing precision will involve division, typically achived using right-shift
Therefore follow pre-biasing rule for division
When removing lesser significant bits, perform biasing of 1/2 weight of surviving position according to the sign of the value before shifting to achieve rounding
Examples converting positive value Q4.4 to Q6.2:
, 4.125, 0010.0010, stored as 00100010
00001000
,lost ending bits 10
00100010
+ 00000010
= 00100100
00001001
, 1110.1111 stored as 1110 1111
11111011
,lost ending bits 11
1110 1111
+ 1111 1110
= 1111 0001
1111 1100
Code examples in the below block are copied and provided under the Creative Commons Attribution-ShareAlike License: https://creativecommons.org/licenses/by-sa/3.0/
Wikipedia contributors. (2021, November 24). Q (number format). In Wikipedia, The Free Encyclopedia. Retrieved 18:10, November 29, 2021, from https://en.wikipedia.org/w/index.php?title=Q_(number_format)&oldid=1056933643
int16_t q_add_sat(int16_t a, int16_t b) { int16_t result; int32_t tmp; tmp = (int32_t)a + (int32_t)b; if (tmp > 0x7FFF) tmp = 0x7FFF; if (tmp < -1 * 0x8000) tmp = -1 * 0x8000; result = (int16_t)tmp; return result; } // precomputed value: #define K (1 << (Q - 1)) // saturate to range of int16_t int16_t sat16(int32_t x) { if (x > 0x7FFF) return 0x7FFF; else if (x < -0x8000) return -0x8000; else return (int16_t)x; } int16_t q_mul(int16_t a, int16_t b) { int16_t result; int32_t temp; temp = (int32_t)a * (int32_t)b; // result type is operand's type // Rounding; mid values are rounded up temp += K; // Correct by dividing by base and saturate result result = sat16(temp >> Q); return result; } int16_t q_div(int16_t a, int16_t b) { /* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */ int32_t temp = (int32_t)a << Q; /* Rounding: mid values are rounded up (down for negative values). */ /* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */ if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) { temp += b / 2; /* OR shift 1 bit i.e. temp += (b >> 1); */ } else { temp -= b / 2; /* OR shift 1 bit i.e. temp -= (b >> 1); */ } return (int16_t)(temp / b); }