Lecture – Integers Representations and Arithmetic

Ryan Robucci

Table of Contents

References

Signed and Unsigned Integers

Unsigned Interpretation:
i=0n1(2ixi)=\displaystyle\sum_{i=0}^{n-1}(2^i x_i)=
2n1xn1+i=0n2(2ixi)\displaystyle \textcolor{red}{2^{n-1} x_{n-1} } + \sum_{i=0}^{n-2}(2^i x_i )

Signed Interpretation (MSB used for representing Negative Numbers)

2n1xn1+i=0n2(2ixi)\displaystyle \textcolor{red}{-2^{n-1}x_{n-1} } + \sum_{i=0}^{n-2}(2^i x_i )

Negative Numbers in Two’s Complement (2NX2^N-X)

Two’s Complement Wheel

Helps visualize

  • bit interpretation
  • modular arithmetic
    • e.g. (X+Y)%24(X+Y)\%2^4
  • overflow
image src:https://stackoverflow.com/questions/55145028/binary-ones-complement-in-python-3, though origin uncertain

Conversion Arithmetic

Conversion for 4-bit Data

Bits
x3x2x1x0x_3x_2x_1x_0
Unsigned relation Signed
1111 15 +1616\xtofrom[+16]{-16} 1=2415\textcolor{red}{-1}=2^4-15
1110 14 +1616\xtofrom[+16]{-16} 2=2414\textcolor{red}{-2}=2^4-14
1101 13 +1616\xtofrom[+16]{-16} 3=2413\textcolor{red}{-3}=2^4-13
1100 12 +1616\xtofrom[+16]{-16} 4=2412\textcolor{red}{-4}=2^4-12
1011 11 +1616\xtofrom[+16]{-16} 5=2411\textcolor{red}{-5}=2^4-11
1010 10 +1616\xtofrom[+16]{-16} 6=2410\textcolor{red}{-6}=2^4-10
1001 9 +1616\xtofrom[+16]{-16} 7=249\textcolor{red}{-7}=2^4-9
1000 8 +1616\xtofrom[+16]{-16} 8=248\textcolor{red}{-8}=2^4-8
0111 7\textcolor{blue}{7} = +7\textcolor{blue}{+7}
0110 6\textcolor{blue}{6} = +6\textcolor{blue}{+6}
0101 5\textcolor{blue}{5} = +5\textcolor{blue}{+5}
0100 4\textcolor{blue}{4} = +4\textcolor{blue}{+4}
0011 3\textcolor{blue}{3} = +3\textcolor{blue}{+3}
0010 2\textcolor{blue}{2} = +2\textcolor{blue}{+2}
0001 1\textcolor{blue}{1} = +1\textcolor{blue}{+1}
0000 0\textcolor{blue}{0} = +0\textcolor{blue}{+0}

Visual Example:
Sine Wave transmitted using 5-bit data

Complement and Increment

Ones Complement

Sign-Magnitude

Long/short unsigned/signed Conversion and Casting Rules

Review Lecture Notes on C Type Casting

Signed/Unsigned Pitfall Examples

. . .
unsigned int bottleStock = 10; //**unsigned**
...
if (bottleStock >= 0) { //always TRUE!!!
  bottleStock = bottleStock-1;
}
. . .
unsigned int bottleStock = 10; //**unsigned**

int request=21;

if ((bottleStock - request) >= 0) { //always TRUE!!!
  bottleStock = bottleStock-request;
}

Review: Sign Extension

Run-time Overflow Detection for Addition

Takeaway

Run-time Overflow Detection for Subtraction

Takeaway

Review Points on Bit Length

Rounding Errors float to int

Takeway

To achieve common rounding when casting from float to int

Affecting Conversion Rounding Error with Numerator Bias

Assume unsigned int A,B;

Example:

Takeway

For positive integers, bias the numerator with half of the denominator to approximate division following by rounding

First go big or first go small: choices for order of operations affects result

Takeway

Enrichment: Impacts of Limited Precision in DSP Filters

Eliminating Overflow or Saturation with Input Prescaling

Increasing Effective Computation Precision with Input Prescaling

Log Scale

Avoid Unnecessary Operations

Saturation

Enrichment Topic: Soft Max

Instantaneous value mapping and affect on sinusoidal time series:

Enrichment Topic: Modeling Quantization / Round-off Noise

Enrichment Topic: Limit Cycles

Scaling by Powers of Two using Shift

Takeaway

Software Multiplication using Smaller Hardware Multipliers

Takeaway

Multiplication by Low-Density Constants

Takeaway

Rounded Integer Division

round(float(A)/float(B))=(A+B/2)/B|\rm round(float(A)/float(B))| = (|A|+|B/2|)/|B|

Takeaway

Rounded Division example with even denominator

 floating-point division: 7.0/float4=1.75 rounded result: round(7.0/float4)=2.0 cast result to int: int(7.0/float4)=1 bias float before cast: int((7.0/float4)0.5)=int((1.75)0.5)=int(2.25)=2 integer biasing: ((7+(4/2))/int4)=fix((7.0+(2))/float4)=fix(2.25)=2\begin{alignedat}{5} &\text { floating-point division: } && \rm && 7.0 \underset{float}/ -4 && &&= -1.75 \\ &\text { rounded result: } && \rm round(&& 7.0 \underset{float}/ -4 && ) &&= -2.0 \\ &\text { cast result to int: } && \rm int( && 7.0 \underset{float}/ -4 && ) &&= -1 \\ &\text { bias float before cast: } && \rm int(( && 7.0 \underset{float}/ -4 && ) \red{- 0.5}) &&= \\ & && \rm int(( && \blue{-1.75} &&) \red{- 0.5}) &&= \\ & && \rm int( && && \blue{-2.25}) &&= -2 \\ &\text { integer biasing: } && \rm (( && 7\red{+(4/2)}) \underset{\rm int}/ -4 && ) &&= \\ & && \rm fix(( && 7.0\red{+(2)}) \underset{\rm float}/ -4 && ) &&= \\ & && \rm fix( && \blue{-2.25} && ) &&= -2 \\ \end{alignedat}

Rounded Division example with odd denominator

 floating-point division: 8.0/float3=2.666... rounded result: round(8.0/float3)=3.0 cast result to int: int(8.0/float3)=2 bias float before cast: int((8.0/float3)0.5)=int(2.666...0.5)=int(3.16...)=3 integer biasing: ((8+(3/2))/int3)=fix((8.0+(1))/float3)=fix(3.0)=3\begin{alignedat}{5} &\text { floating-point division: } && \rm && 8.0 \underset{float}/ -3 && &&= -2.666... \\ &\text { rounded result: } && \rm round(&& 8.0 \underset{float}/ -3 && ) &&= -3.0 \\ &\text { cast result to int: } && \rm int( && 8.0 \underset{float}/ -3 && ) &&= -2 \\ &\text { bias float before cast: } && \rm int(( && 8.0 \underset{float}/ -3 && ) \red{- 0.5}) &&= \\ & && \rm int( && \blue{-2.666...} && \red{- 0.5} ) &&= \\ & && \rm int( && && \blue{-3.16...}) &&= -3 \\ &\text { integer biasing: } && \rm (( && 8\red{+(3/2)}) \underset{\rm int}/ -3 && ) &&= \\ & && \rm fix(( && 8.0\red{+(1)}) \underset{\rm float}/ -3 && ) &&= \\ & && \rm fix( && \blue{-3.0} && ) &&= -3 \\ \end{alignedat}

Floor vs Fix vs Round for Singed Values

Positive-Biased Value Before Fix

Negative-Biased Value Before Fix

Sign-Dependant-Bias before Fix

To Properly Mimic Signed-Integer Divided by Power of Two with Shift, a Sign-dependant Pre-Shift Bias is Required

Takeaway

Divide xx by 2k2^k is defined to be ROUND TOWARDS ZERO — mimicking this detail with arithmetic right-shift requires pre-biasing xx by adding 2k12^k-1: ( x + ((1<<k)-1) ) << k

Arbitrary Precision Arithmetic

Error Propagation

Error introduced by division:

Error propagation by multiplication:

Error introduced by multiplication:

Error propagated by division:

Error Metric:

Earlier examples shown to illustrate error generation and propagation with reordering of operatands
Keep the decision to go big or go small in mind as in each step of implementation in the next topic.
We will formally introduce an additional scale at each step and per operand.

Fixed-point arithmetic

Floating Point Math and Fixed-Point Math

Fixed-point arithmetic

QM.N notation

Addition and Subtraction with scale S (operands with same scale)

Multiplication with scale S

Division with scale S:

Bias before reducing precision

Example Fixed-Point Arith Library Code

Code examples in the below block are copied and provided under the Creative Commons Attribution-ShareAlike License: https://creativecommons.org/licenses/by-sa/3.0/

Wikipedia contributors. (2021, November 24). Q (number format). In Wikipedia, The Free Encyclopedia. Retrieved 18:10, November 29, 2021, from https://en.wikipedia.org/w/index.php?title=Q_(number_format)&oldid=1056933643

int16_t q_add_sat(int16_t a, int16_t b)
{
    int16_t result;
    int32_t tmp;

    tmp = (int32_t)a + (int32_t)b;
    if (tmp > 0x7FFF)
        tmp = 0x7FFF;
    if (tmp < -1 * 0x8000)
        tmp = -1 * 0x8000;
    result = (int16_t)tmp;

    return result;
}

// precomputed value:
#define K   (1 << (Q - 1))
 
// saturate to range of int16_t
int16_t sat16(int32_t x)
{
	if (x > 0x7FFF) return 0x7FFF;
	else if (x < -0x8000) return -0x8000;
	else return (int16_t)x;
}

int16_t q_mul(int16_t a, int16_t b)
{
    int16_t result;
    int32_t temp;

    temp = (int32_t)a * (int32_t)b; // result type is operand's type
    // Rounding; mid values are rounded up
    temp += K;
    // Correct by dividing by base and saturate result
    result = sat16(temp >> Q);

    return result;
}

int16_t q_div(int16_t a, int16_t b)
{
    /* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */
    int32_t temp = (int32_t)a << Q;
    /* Rounding: mid values are rounded up (down for negative values). */
    /* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */
    if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) {   
        temp += b / 2;    /* OR shift 1 bit i.e. temp += (b >> 1); */
    } else {
        temp -= b / 2;    /* OR shift 1 bit i.e. temp -= (b >> 1); */
    }
    return (int16_t)(temp / b);
}

Corrections