Lecture 16 – Signed Integers and Arithmetic Part III

Ryan Robucci

Lecture 16 – Signed Integers and Arithmetic Part III

References

$^\dagger$

Previous Discussions

Numerator Bias before casting and division
Effect of order of operations

Modeling Error Propagation

Error introduced by division:

Let $q$ be ideal, errorless $q=i/x$
Let computed value from integer division, $i\underset{\rm int}{\boxed{/}}x$ , be $\hat{q}$ and the associated error be $\Delta q$ , such that
- $\hat{q}= q + \Delta q$ ,
- Relative Error: $\Delta q \over q$ $\frac{Δ q}{q}$ , relative magnitude of error of computed to ideal is related to input operands and precision of computation
  - Often relative error is more pertinent than absolute error
- Example 5/2
  - $5\underbrace{/}_{int}2$ provides estimate $\hat{q}$
    $q_{\rm ideal}=\underbrace{2}_{\hat{q}}-\underbrace{(-0.5)}_{\Delta q}$
  - Relative Error: -.25 (i.e. -25%)

Error introduced by multiplication:

Let $m$ be ideal, errorless $m= i \times x$
Let computed value from integer multiplication be, $i\underset{\rm int}{\boxed{*}}x$ $i int * x$ , be $\hat{m}$ $\overset{m}{^}$ and the associated error be $\Delta m$ $Δ m$ , such that
$\hat{m}= m + \Delta m$ $\overset{m}{^} = m + Δ m$
- error could be caused by overflow or saturation
- error may be zero if no overflow can occur

Error propagation by multiplication:

A following multiplication operation $\times y$ $\times y$ multiplies the prior error:
$\hat{q} \times y = (q + \Delta q)\times y = \underbrace{q \times y}_{\rm ideal} + \underbrace{\Delta q \times y}_{\rm multiplied\ error}$ $\overset{q}{^} \times y = (q + Δ q) \times y = ideal q \times y + multiplied error Δ q \times y$
- Example 5/2*7
  - 1st-step
    - $5\underbrace{/}_{int}2=2=\hat{q}$
    - note $5.0/2.0 = 5\underbrace{/}_{int}2-{\texttt {error}}=\underbrace{2}_{\hat{q}}-\underbrace{(-0.5)}_{\Delta q}$
  - 2nd-step $q\times7=2\times7 - (-0.5\times 7)=\underbrace{14}_{\hat{q}\times 7}-\underbrace{(-3.5)}_{error}$

Error propagated by division:

A division, $/ y$ $/ y$ , scales the prior error:
- $\hat{m} / y = (m + \Delta m) / y = \underbrace{m / y}_{\rm ideal} + \underbrace{\Delta m / y}_{\rm scaled\ error}$
- Though division scales the absolute error which may appear favorable, division doesn't propagate reduced relative error
  - relative error= $\frac{\overbrace{\Delta m / y}^{\rm scaled\ error}}{\underbrace{m / y}_{\rm ideal}}=\frac{\Delta m }{m }$

Error Metric:

May measure and model error propagation using
- the worse case error
  - same penalty for rare error events as common events
- average magnitude of error
  - rare events of large error don't matter as much
- root-mean-square error ( a.k.a RMS error, or standard deviation $\sigma$ $σ$ )
  - like previous, but larger errors are penalized proportionally more than small errors

Earlier examples shown to illustrate error generation and propagation with reordering of operatands
Keep the decision to go big or go small in mind as in each step of implementation in the next topic.
We will formally introduce an additional scale at each step and per-operand.

Fixed-point arithmetic

Floating Point Math and Fixed-Point Math

If no floating point unit (FPU) is available, you can find a floating point software library. This will likely be slow.
Another option is fixed-point math. You can write or use a library or just do it as needed inline….

Fixed-point arithmetic

Problem: Want to use integer operations to represent adding 0011.1110 and 0001.1000
Solution (by example with explanation to follow):
- Store 0011.1110 as 00111110
- Store 0001.1000 as 00011000
- Add 00011000 + 00111110 = 01010110
- Interpret 01010110 as 0101.0110

QM.N notation

Can use common QM.N notation: M+N bits, with M bits as whole part and N bits as fractional part.
To design the computation, must first determine number of bits to use for whole and fraction parts depending on range and precision needed. This determines the scale factors required to convert the faction to a whole number
QM.N corresponds to $S = 2^N$
Example
1101.0000 * 16 = 11010000 S=16, Q4.4
01.011000 * 64 = 01011000 S=64, Q2.6

Addition and Subtraction with scale S (operands with same scale)

$A+B$ computed as $A\times S \underbrace{+}_{\mathclap{\text{integer addition}}} B\times S=C\times S$ , so that the addition is an integer addition
Interpret result by dividing output by S to obtain answer C
using a power of two for S is efficient, though not required
$A-B$ computed as $A\times S-B\times S=C\times S$
Divide result by S to interpret answer C
Example: Check if addition of 0.17 meters and 0.24 meters is greater than .9 meters

int Y_S100 = (.9 * 100);
int A_S100 = (0.17 * 100);
int B_S100 = (0.24 * 100);
int C_S100 = A_S100+B_S100; //integer addition
int flag = C_S100 > Y_S100; //integer subtraction/comparison

To be human understandable, an example presented with powers of 10 for scaling, but powers of 2 are usually appropriate for efficiency and predictability of errors and error propagation.

Multiplication with scale S

$A*B$ computed as $(A \times S)\underbrace{\times}_{\mathclap{\text{integer multiplication}}} (B \times S)=C \times S \times S$
- Divide result by $S^2$ for final interpretation
- Or to maintain S-Scale representation for C for further computations, divide result by S
Unfortunately, the intermediate result $C \times S \times S$ required more storage than the scaled result $C \times S$

Example: Check if area of rectangle with sides of length 0.17 meters and 0.24 meters is greater than the area of a square with sides of length .2 meters

int AREASQ_S10000 = (.2 * 100) * (.2 * 100);
int AREASQ_S100 = (.2 * .2 * 100);

int A_S100 = (0.17 * 100);
int B_S100 = (0.24 * 100);

int ARECT_S10000 = A_S100*B_S100; //integer multiplication
int flag0 = ARECT_S10000 > AREASQ_S10000; //integer subtraction/comparison

int ARECT_S100 = ARECT_S10000/100; //convert to scale 100
int flag1 = ARECT_S100   > AREASQ_S100; //integer subtraction/comparison

To be human understandable, an example presented with powers of 10 for scaling, but powers of 2 are usually appropriate for efficiency and predictability of errors and error propagation.

Division with scale S:

A/B could be computed as $(A \times S)\underbrace{/}_{\mathclap{\text{integer division}}}(B \times S)=C$
Scales cancel. Which is fine if you only wanted an integer answer
Would need to multi by S to obtain scaled result $C\times S$ for further math
- …but this is less accurate since the lower bits have already been lost
For precision, better to prescale one of the operands $((A\times S)\times S)/(B\times S)=C\times S$
- Unfortunately, the intermediate term $((A \times S)\times S)$ requires more storage and a larger computation
For rounding, remember to apply numerator pre bias of $\rm sign(\rm numerator) \times 1/2 \times |\rm denominator|$ when performing integer division

Example:
Circumference to Radius
Note that Python3 provides an explicit integer division: //

from math import *
import numpy as np
import pyforest

tau = pi * 2
#tau

C=tau+2.8 # some value for circumference 
          #(note that adding was intentional here just to produce an interesting number 9.0831853072)

#C

C8=floor(C*8)
#C*8,C8,C8/8

C256=floor(C*8*8)
#C*8*8,C256,C256/(8*8)

tau8=floor(tau * 8)
#tau8,tau8/8

halftau8=floor(tau/2 * 8) //precomputed bias
#halftau8,halftau8/8

print("C                               = %20.10f"%C) 
print("C/tau                           = %20.10f"%((C)/(tau))) 

print("C8/tau8                         = %20.10f"%((C8)//(tau8)))

print("  with pre-div bias             : %20.10f"%((C8+halftau8)//(tau8)))

print("  with prescale                 : %20.10f"%((C8*8)//(tau8)/8))
print("  with prescale and pre-div bias: %20.10f"%((C8*8+halftau8)//(tau8)/8))
print("C256/tau8                       = %20.10f"%((C256)//(tau8)/8))
print("  with pre-div bias             : %20.10f"%((C256+halftau8)//(tau8)/8))
print("---")

C                               =         9.0831853072
C/tau                           =         1.4456338407
C8/tau8                         =         1.0000000000
  with pre-div bias             :         1.0000000000
  with prescale                 :         1.3750000000
  with prescale and pre-div bias:         1.5000000000
C256/tau8                       =         1.3750000000
  with pre-div bias             :         1.5000000000

Bias before reducing precision

Reducing precision will involve division, typically achived using right-shift
Therefore follow pre-biasing rule for division
When removing lesser significant bits, perform biasing of 1/2 weight of surviving position according to the sign of the value before shifting to achieve rounding
- if dividing by $2^n$ , before right shifting by n, add or subtract (1 << (n - 1))
Examples converting positive value Q4.4 to Q6.2:
- Will need to loss bits in 1/8 and 1/16 position, with new lsb position of 1/4
- For positive numbers need to add 1/8th before truncation
- For negative numbers need to subtract 1/8th before truncation
$4 \frac{1}{8}$ , 4.125, 0010.0010, stored as 00100010
- Without prebias for correct rounding:
  - right shift by 2 results in 00001000 ,lost ending bits 10
  - interpreted as 4.0
- With prebias for correct rounding:
  - $4 \frac{1}{8} + \frac{1}{8} = 4 \frac{1}{4}$ $4 \frac{1}{8} + \frac{1}{8} = 4 \frac{1}{4}$
    - 00100010 + 00000010 = 00100100
  - right shift by 2: 00001001
  - interpreted as 4.25 (4.125 rounded to the nearst $\frac{1}{4}$ )
$-1\frac{1}{16}$ , 1110.1111 stored as 1110 1111
- Without prebias:
  - (Arithmetic) right shift by 2 results in 11111011 ,lost ending bits 11
  - interpreted as $-1 \frac{1}{4}$
- With prebias:
  - $1 \frac{1}{16} + \frac{1}{8}$ $1 \frac{1}{16} + \frac{1}{8}$
    - 1110 1111 + 1111 1110 = 1111 0001
  - right shift by 2: 1111 1100
  - interpreted as -1 ( $-1\frac{1}{16}$ rounded to the nearst $\frac{1}{4}$ )

Example Fixed-Point Arith Library Code

Q (number format) - Wikipedia
- reviewed code in-class, provided below

Code examples in the below block are copied and provided under the Creative Commons Attribution-ShareAlike License: https://creativecommons.org/licenses/by-sa/3.0/

Wikipedia contributors. (2021, November 24). Q (number format). In Wikipedia, The Free Encyclopedia. Retrieved 18:10, November 29, 2021, from https://en.wikipedia.org/w/index.php?title=Q_(number_format)&oldid=1056933643

int16_t q_add_sat(int16_t a, int16_t b)
{
    int16_t result;
    int32_t tmp;

    tmp = (int32_t)a + (int32_t)b;
    if (tmp > 0x7FFF)
        tmp = 0x7FFF;
    if (tmp < -1 * 0x8000)
        tmp = -1 * 0x8000;
    result = (int16_t)tmp;

    return result;
}

// precomputed value:
#define K   (1 << (Q - 1))
 
// saturate to range of int16_t
int16_t sat16(int32_t x)
{
	if (x > 0x7FFF) return 0x7FFF;
	else if (x < -0x8000) return -0x8000;
	else return (int16_t)x;
}

int16_t q_mul(int16_t a, int16_t b)
{
    int16_t result;
    int32_t temp;

    temp = (int32_t)a * (int32_t)b; // result type is operand's type
    // Rounding; mid values are rounded up
    temp += K;
    // Correct by dividing by base and saturate result
    result = sat16(temp >> Q);

    return result;
}

int16_t q_div(int16_t a, int16_t b)
{
    /* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */
    int32_t temp = (int32_t)a << Q;
    /* Rounding: mid values are rounded up (down for negative values). */
    /* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */
    if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) {   
        temp += b / 2;    /* OR shift 1 bit i.e. temp += (b >> 1); */
    } else {
        temp -= b / 2;    /* OR shift 1 bit i.e. temp -= (b >> 1); */
    }
    return (int16_t)(temp / b);
}

Lecture 16 – Signed Integers and Arithmetic Part III

Table of Contents

References

Previous Discussions

Modeling Error Propagation

Fixed-point arithmetic

Floating Point Math and Fixed-Point Math

Fixed-point arithmetic

QM.N notation

Addition and Subtraction with scale S (operands with same scale)

Multiplication with scale S

Division with scale S:

Bias before reducing precision

Example Fixed-Point Arith Library Code