Lecture – Integers Representations and Arithmetic

Ryan Robucci

Lecture – Integers Representations and Arithmetic

References

$^\dagger$ A few extended topics images from CD Rom for Mitra, Sanjit Kumar, and Yonghong Kuo. Digital signal processing: a computer-based approach. Vol. 2. New York: McGraw-Hill, 2006.

Signed and Unsigned Integers

Consider $n$ ordered bits representing a value $X$

Unsigned Interpretation:
$\displaystyle\sum_{i=0}^{n-1}(2^i x_i)=$
$\displaystyle \textcolor{red}{2^{n-1} x_{n-1} } + \sum_{i=0}^{n-2}(2^i x_i )$

Signed Interpretation (MSB used for representing Negative Numbers)

$\displaystyle \textcolor{red}{-2^{n-1}x_{n-1} } + \sum_{i=0}^{n-2}(2^i x_i )$

Two’s complement uses the msb to weight/contribute a large negative number instead of a large positive value
Ex: 1001 is -8 + 1=-7
4-bit signed range is 0+4+2+1 =7 to -8+0+0+0 =-8
4-bit unsigned Range is 8+4+2+1=15 to 0+0+0+0 =0
If MSB, $x_{N-1}$ , is 0 the interpretation is the same
- unsigned: $X <= 2^{N-1}-1$
- signed: X positive , $X <= 2^{N-1}-1$
If MSB, $x^N-1$ , is 1, the interpretation is different
- unsigned: $X >= 2^{N-1}$
- signed: $X$ negative, $X >= -1 \times 2^{N-1}$
When the MSB is 1, the adjustment from one interpretation to the other is an offset of 2 times the most significant weight, $2 \times 2^{N-1}$ , which is $2^{N}$ .
- Subtract $2^N$ to go from unsigned to signed
- Add $2^N$ to go from signed to unsigned
  Ex: Receive bits from a UART 10000010
  as unsigned this appears to be 130, but detecting that it is >=128 and then
  subtracting 256 gives results in -126

Negative Numbers in Two’s Complement ( $2^N-X$ )

It is also useful to think about negative twos-complement values complement in this way:
- for a N-bit X, if the msb is 1 the value is the following difference: $2^{N}$ - $unsigned(X)
- for a 16-bit value X, if the msb is 1, the value is the distance from $2^{16}$ to the value from unsigned interpretation of X
Ex:
- 16’b1111_1111_1111_1101 is therefore easily seen to be -3: since X+2+1 overflows to $2^{16}$
- 16’b1111_1110_1111_1101 is therefore easily seen to be -259: since X+256+2+1 overflows to $2^{16}$

Two’s Complement Wheel

Helps visualize

bit interpretation
modular arithmetic
- e.g. $(X+Y)\%2^4$
overflow

image src:https://stackoverflow.com/questions/55145028/binary-ones-complement-in-python-3, though origin uncertain

adding (and subtracting) signed and unsigned numbers is no different at the bit/hardware level, represented as modular arithmetic

Conversion Arithmetic

What number is this: 1011? What data is stored in these bits: 0000 1000 0000 0001?
- The answer depends on the interpretation
- If neither the hardware nor software provides the required interpretation natively, you must facilitate interpretation by converting the data into a native format.
The following assumes word sizes are sufficient for the conversion arithmetic, e.g. working with 12-bit words in a 16-bit architecture
Interpreting unsigned as signed:
- To reinterpret $\textcolor{red}{(1)} 2^{N-1}$ as $\textcolor{red}{-(1)}2^{N-1}$ , subtract $2^N$
- If reading bits from a data stream, the subtraction by be necessary in software
- Example: a 12-bit two’s complement ADC value is read from a 16-bit I/O port as 12’b0000_1000_0000_0001. Using a default interpretation in C, printf("%d",X) displays the result as 2049. What is the actual value? Ans: -2047
Interpreting signed as unsigned:
- To reinterpret $\textcolor{red}{-(1)} 2^{N-1}$ as $(1) 2^{N-1}$ , add $2^N$
- Ex: if we print -2047 as unsigned, we get 2049

Conversion for 4-bit Data

Bits $x_3x_2x_1x_0$	Unsigned	relation	Signed
`1111`	15	$\xtofrom[+16]{-16}$	$\textcolor{red}{-1}=2^4-15$
`1110`	14	$\xtofrom[+16]{-16}$	$\textcolor{red}{-2}=2^4-14$
`1101`	13	$\xtofrom[+16]{-16}$	$\textcolor{red}{-3}=2^4-13$
`1100`	12	$\xtofrom[+16]{-16}$	$\textcolor{red}{-4}=2^4-12$
`1011`	11	$\xtofrom[+16]{-16}$	$\textcolor{red}{-5}=2^4-11$
`1010`	10	$\xtofrom[+16]{-16}$	$\textcolor{red}{-6}=2^4-10$
`1001`	9	$\xtofrom[+16]{-16}$	$\textcolor{red}{-7}=2^4-9$
`1000`	8	$\xtofrom[+16]{-16}$	$\textcolor{red}{-8}=2^4-8$
`0111`	$\textcolor{blue}{7}$	=	$\textcolor{blue}{+7}$
`0110`	$\textcolor{blue}{6}$	=	$\textcolor{blue}{+6}$
`0101`	$\textcolor{blue}{5}$	=	$\textcolor{blue}{+5}$
`0100`	$\textcolor{blue}{4}$	=	$\textcolor{blue}{+4}$
`0011`	$\textcolor{blue}{3}$	=	$\textcolor{blue}{+3}$
`0010`	$\textcolor{blue}{2}$	=	$\textcolor{blue}{+2}$
`0001`	$\textcolor{blue}{1}$	=	$\textcolor{blue}{+1}$
`0000`	$\textcolor{blue}{0}$	=	$\textcolor{blue}{+0}$

Visual Example:
Sine Wave transmitted using 5-bit data

Complement and Increment

Implementation of $2^N-X$ with operations defined on N-bit values
- $2^N$ has no N-bit representation
A two-step process can represent negation, without requiring extending the number of bits for the numerical symbols.
Steps:
1. Bit-Wise Complement
2. Increment
Let X and Y represent bits in hardware expressed as an unsigned interpretation, though they may store a two’s complement representation
First, recall that addition and subtraction with signed and unsigned numbers are exactly the same in hardware: (let $\underset{N}\boxplus$ $\underset{N}\boxminus$ be modular addition and subtraction)
- $(X \underset{N}\boxplus X) Y$ is $(X+Y) \% 2^N$ where $X$ and $Y$ are the vector of bits undergoing modular addition
Negating in two’s complement:
- -X: ( $2^N \underset{N}\boxminus X) X$ ) is $(2^N-X) \% 2^N = -X$
$2^N$ cannot be represented with N bits, and we might like to represent operations using symbols in the same domain, that of N-bit numbers.
- $2^N-X$ $2^{N} - X$ can be represented in two operations
  - Let ${\mathcal H}$ satisfy $2^N = {\mathcal H}+1$
    so that ${\mathcal H}$ is "all ones" ${\mathcal H} = (2^N-1) = {\underbrace{{111....11_2}}_{N\rm\,bits}}$
  - Then define negation $\underset{N}\boxminus X=(0) \underset{N}\boxminus X$ by the following function $\underset{N}\boxminus X: ({\mathcal H}\underset{N}\boxplus 1) \underset{N}\boxminus X = ({\mathcal H} \underset{N}\boxminus X) \underset{N}\boxplus 1$
  - Which when computed with operations equivalent to with modular arithmetic is $(({\mathcal H}-X) \% 2^N + 1) \% 2^N = -X$
Ex: 4-bit negation of 3
- $x \rightarrow -x$ is $x-2^4$ , where the magnitude of the negative number is $2^4-x$ ,
  but $2^4$ has no 4-bit representation
- $2^4 = 2^4 - 1 + 1 = 1111_2 + 0001_2$
- $3=0011_2$
- -3: $\begin{aligned} 2^4 - 3= &1111_2 & + 0001_2 &\red{- 0011_2}\\ =& 1111_2 & \red{ - 0011_2} &+ 0001_2\\ =& & \red{1100_2} &+ 0001_2\\ =& & \red{1101_2} \\ \end{aligned}$

Ones Complement

Ones complement representation represents negative numbers by just bit-complement
Ones complement is inverting bits:
- $\underbrace{\mathcal H}_{\mathclap{``all\ ones"}} \underset{N}\boxminus X$ , can be arithmetically described as $\underbrace{(2^N-1)}_{\mathclap{``all\ ones"}} \underset{N}\boxminus X$
- in modular arithmetic is
  $((2^N-1)-X )\% 2^N = -1-X$
Therefore, to generate a twos complement negative from a ones complement negation, just add 1
$(2^N-1)\underset{N}\boxminus X \underset{N}\boxplus1 = 2^N – X$
which as modular arithmetic is
$((2^N-1)-X +1)\%2^N = (2^N – X) \%2^N = -X$

Sign-Magnitude

For Sign-Magnitude representation, bit N-1 is the sign and the remaining bits are the magnitude
$x_{N-1}x_{N-2}x_{N-3}$
both ones complement and sign-magnitude suffer from 0 have two representations

data streams with typically small values on either side of zero involve less bit flipping between which can be valuable for power or digital compression applied along the bit columns

+4: 00000100

+3: 00000011

+2: 00000010

+1: 00000001

+0: 00000000

-1: 10000001

-2: 10000010

-3: 10000011

-4: 10000100

+4: 00000100

+3: 00000011

+2: 00000010

+1: 00000001

+0: 00000000

-1: 11111111

-2: 11111110

-3: 11111101

-4: 11111100

Note in the sequence here that several column never change

+4: 00000100
+0: 00000000
+3: 00000011
+3: 00000011
+3: 00000011
+3: 00000011
-3: 10000011
+0: 00000000
+1: 00000001
+0: 00000000
+1: 00000001
-3: 10000011
+1: 00000001
+0: 00000000
-1: 10000001
-2: 10000010
-3: 10000011
-3: 10000011
+3: 00000011
+2: 00000010
+3: 00000011
+2: 00000010
+1: 00000001
-4: 10000100
+0: 00000000
+1: 00000001
+0: 00000000
+1: 00000001
+1: 00000001
+0: 00000000
+1: 00000001
+0: 00000000
-1: 10000001
-2: 10000010
-3: 10000011
-3: 10000011
-3: 10000011
-4: 10000100
-4: 10000100

Long/short unsigned/signed Conversion and Casting Rules

Review Lecture Notes on C Type Casting

Signed/Unsigned Pitfall Examples

. . .
unsigned int bottleStock = 10; //**unsigned**
...
if (bottleStock >= 0) { //always TRUE!!!
  bottleStock = bottleStock-1;
}

. . .
unsigned int bottleStock = 10; //**unsigned**

int request=21;

if ((bottleStock - request) >= 0) { //always TRUE!!!
  bottleStock = bottleStock-request;
}

Review: Sign Extension

Review
- Sign Extension involves repeating msb
- Truncation is safe as long as alls truncated bit are the same as the surviving msb

Run-time Overflow Detection for Addition

Two’s complement addition of two numbers where the longest is N-bit can require up to N+1 bits for the result
Two’s complement addition can only overflow if the signs of the operands are the same
- overflow a+b
- positive overflow (>127) is blue
- negative overflow (<-128) is red
Overflow check: input sign bits are same and do not match result sign bit

y=a-b;
overflow=( (a>=0) && (b>=0) && (y<0) ) | ( (a<0) && (b<0) && (y>=0) );
May exclude input operand being zero
- overflow=( (a>0) && (b>0) && (y<0) ) | ( (a<0) && (b<0) && (y>=0) );
Using #include <limits.h>
- overflow=((a>0) && (b>(INT_MAX-a))) | ((a<0) && (b<(INT_MIN+a)));
Note, typically you can use an available overflow flag (e.g. V), but for software-only manipulation, or when using a language that doesn’t support access to hardware flags, or if the flags are not preserved you must resort to software techniques
- If possible convert data to native words (e.g. perform sign extension) to use hardware flags

Takeaway

Addition overflow possible only if input signs are same, in which case is indicated by the result is a different sign after the operation is performed
If native word-size is used, may be able to use hardware overflow flag
- Otherwise, might sign-extention first or otherwise need to provide software to check appropriate bits
A pre-overflow operation check on the input operands is give by example for SIGNED INT: ((a>0) && (b>(INT_MAX-a))) | ((a<0) && (b<(INT_MIN+a)))

Run-time Overflow Detection for Subtraction

Two’s complement subtraction can only under/overflow if the signs of the operands are different, otherwise the magnitude of the result must be smaller than the maximum magnitude of the two operands.
- a-b, overflow is colored
Overflow check: sign bits of the input operands are different and resulting sign bit does not match the first operand
- Otherwise stated: sign bit of the second operand and the result are the same and not equal to that of the first
- Otherwise stated: invert the sign bit of the second operand and use the bit test from addition overflow or negate (two’s complement) second operand and perform addition overflow test
- overflow=((a>=0) && (b<0) && (y<0)) | ((a<0) && (b>=0) && (y>=0));
- overflow=((a>0) && (b<0) && (y<0)) | ((a<0) && (b>0) && (y>=0));
As for addition, might use hardware overflow flag

Takeaway

Subtraction Overflow Test: either
- Invert the sign bit of the second operand and use the bit test from addition overflow OR
- negate (two’s complement) second operand and perform addition with addition overflow test

Review Points on Bit Length

Twos-complement addition of two N-bit numbers can require up to N+1 bits to store a full result including the overflow
- If one operand is shorter, take N to be the maximum of the two lengths
Two’s complement addition can only overflow if signs of operands are the same (likewise for subtraction the signs must be different)
Result of N-bit addition with overflow is dropping of MSBits’s: A+B = (A+B) mod (2^N)
For multiplication, multiplying two N-bit numbers requires up to 2N bits to store the operand. Multiplying a N-bit with a M-bit requires up to N+M bits.

Rounding Errors float to int

In C, Conversion from float to int is defined as truncation of fractional digits, effectively rounding towards zero
- $5.7 \rightarrow 5$
- $-5.7 \rightarrow -5$
Nearest Integer Rounding using Floating Point Biasing:
- Add +/- .5 before truncation depending on sign
- Positive Numbers:
  - $5.7+.5 = 6.2 \rightarrow 6$
  - $5.4+.5 = 5.9 \rightarrow 5$
  - $-5.7+.5 = -5.2 \rightarrow -5$ doesn’t work
- For negative numbers, need to subtract
  - $-5.7-.5 = -6.2 \rightarrow -6$
  - $-5.4-.5 = -5.9 \rightarrow -5$

Takeway

To achieve common rounding when casting from float to int

For positive floats add 0.5 before casting
For negative floats subtract 0.5 before casting

Affecting Conversion Rounding Error with Numerator Bias

Assume unsigned int A,B;

A/B produces floor(A/B) rather than round(A/B)
So (A+B/2)/B is how we get rounding with integer-only arithmetic
If B is odd, we need to choose to round B/2 up or down depending on application

Example:

Want to set serial baud rate using clock divider
i.e. BAUD_RATE=CLK_RATE/CLK_DIV
Option 1:
```
#define CLK_DIV (CLK_RATE/BAUD_RATE)
```
Option 2: BETTER CHOICE WITH INTEGER BIASING
```
#define CLK_DIV \
(CLK_RATE+(BAUD_RATE/2))/BAUD_RATE
```
(Note that \ is just the line continuation for c macros)

Takeway

For positive integers, bias the numerator with half of the denominator to approximate division following by rounding

First go big or first go small: choices for order of operations affects result

In general, if intermediate result will fit in the allowed integer word size, apply integer multiplications before integer divisions to avoid loss of precision (e.g. go big first unless overflow, in which case first go small )
Need to consider loss of MSB or LSB digits with intermediate terms and make choices base on application or expected ranges for input values
Example 1:
int i = 600, x = 54, y = 23;
- want i/x*y : true answer is 255.555…
  i/x*y result is 253 : good
  i*y/x result is 255 : better
  (i*y+x/2)/x result is 256 : best
Example 2:
- Assume
  - an architecture with 16-bit int,
  - unsigned int c = 7000, x=10,y=2;
  - want c*x/y which is truly 35000
- c*=x; overflows c since (c*x)>65535 resulting in x = 4465
- c/=y; get 2232 here from dividing 4465 by 2
- c/=y; c*=x; result is 3500*10=35000

Takeway

In general, if intermediate result will fit in the allowed integer word size, apply integer multiplications before integer divisions to avoid loss of precision

Enrichment: Impacts of Limited Precision in DSP Filters

Limited-precision representation of coefficients limits realizable system parameters
Images from Digital Signal Processing with Student CD ROM 4th Edition by Sanjit Mitra

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Direct-form II: very poor for high-pass below pass filters requiring pole near the real axis

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Manipulation of the order of operations, or alternative numerical formulations can yield different possible system realizations Here is a equivalent system if infinite precision is used, but yields different results if parameters and calculations are of limited precision

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill

Eliminating Overflow or Saturation with Input Prescaling

Overflow may occur at the output and at many points within a system
May be able to avoid internal overflow by pre-scaling (division) and post-scaling (multiplication)
If overflow is possible at the output of the accumulators and cannot be stored into the delay storage registers…
- solution is to pre-downscale inputs or include downscaling at internal points within the system (e.g. divide/right-shift x by a necessary factor), but at the cost of lesser effective precision in the result
The transfer function can be computed from the input to the output as well as any intermediate signal.
If the input is bounded, and the system is a bounded-input bounded-output, then multiplying by the gain of the transfer function to any point by the input magnitude reveals the range required at any point
To limit range (# bits) the input can be pre-scaled to, and if required the output can be post-scaled
- many times working with the scaled result downstream (no post-scale correction) is fine or better

Increasing Effective Computation Precision with Input Prescaling

In previous graph, can apply multiplication in pre-scaling and division for post-scaling if orginal-scale result is required (many times it isn’t)
By performing internal computations with larger values, numerical loss of accuracy at each step is mitigated
Ex:
- Code:
  - (a/b) > (c/5)
- Given
  - a=21
  - b=4
  - c=26
- Integer result is same
- Prescale:
  - 1000*(a/b) > 1000*(c/5)
- New C code:
  - ( (1000*a) / b) > ( (1000* c)/5)
- Result
  - ( (21000) / 4) > ( (26000)/5)
  - 5250 > 5200

Log Scale

Log Scales maintain relative accuracy and represent a useful domain change
- Compress large values
  - log10(100)=2,log10(1000)=3
- Expand small values
  - log2(1/2)=-1,log2(1/4)=-2
  - Common to use “log-likihood” in probability when there are many possibilities each with fractional probability
- Multiplication becomes addition, division becomes subtraction
Other monotonic functions can be used appropriate to store data with its square-root or its square

Avoid Unnecessary Operations

You might improve computational throughput and accuracy by avoiding unnecessary operations.
Example: Distance Comparison
- $\sqrt{a_x^2+a_y^2} >= \sqrt{b_x^2+b_y^2}$ $a_{x}^{2} + a_{y}^{2} >= b_{x}^{2} + b_{y}^{2}$
  - sqrt(pow(a_x,2)+pow(a_y,2)) > sqrt(pow(b_x,2)+pow(b_y,2))
- Note that sqrt monotonic function, therefore for this function f(x)
  - $(x>=y) \rightarrow f(x)>=f(y)$
- Simplify:
  - pow(a_x,2)+pow(a_y,2) > pow(b_x,2)+pow(b_y,2)

Saturation

Sometimes overflow is usual or practically unavoidable, and handing with detection followed by replacing result with limit is more appropriate than modulo arithmetic where small overflow results in a very large discrepancy
Detect overflow condition – override output to avoid LARGE roll-over error from modulus arithmetic
- If result exceeds $2^{N-2}-1$ replace with $2^{N-2}-1$
- If result exceeds $-2^{N-2}$ replace with $2^{N-2}$
- Schematic depiction (of required software)
Example:
- heat controller where heating element unsigned 8-bit input value 250 should increase by 10
- should detect and set result to 255 (error of -5), rather than compute 5 (error of -250)
Balanced Range to mitigate accumulated error
- many signal processing algorithms and control systems employ summation of values in a series
  - accumulation of error is important consideration
- physical processes can also be affected by accumulated error
  - i.e. accumulation effect is in physical domain
- Using limits $2^{N-2}-1$ $2^{N - 2} - 1$ and $-(2^{N-1})$ $- (2^{N - 1})$ under some conditions may cause bias (non-zero average error) , might be better to limit to $2^{m-2}-1$ $2^{m - 2} - 1$ and $-(2^{N-1}-1)$ $- (2^{N - 1} - 1)$
  - this creates balanced error accumulation

Enrichment Topic: Soft Max

Choosing Smoother Limiting Functions will smooth the estimated wave by “compressing” past a soft limit. The saturation error is smoothed and high-frequency distortion is mitigated.
A linearized approach can be implemented cheaply using a scale for signal components exceeding a soft limit (may involve scale by const or scale by power of two)
Smoother sigmoidal functions mimicking analog circuits can be used as well

Instantaneous value mapping and affect on sinusoidal time series:

Enrichment Topic: Modeling Quantization / Round-off Noise

When error is small compared to a signal, quantization error is model as noise
in fact it is nearly ideal to model the error as additive white noise when the error very small compared to the signal (high SNR) because the error appears to be uncorrelated with the input whereas the signal can appear to be related to the signal otherwise (see provided graphic)
A computation stage can be numerically modeled using an ideal calculation followed by additive error
Limited-precision effects are modeled as quantization error and analyzed as if noise is inserted into the system at the quantization/round-off step
Good SNR (signal to noise ratio) demands that the signal’s actual range in an application be large compared to the round-off error

Enrichment Topic: Limit Cycles

Limit Cycles are a special case of quantization (limited precision) error propagation of concern for systems with iterative feedback for which errors seem to propagated indefinitely over iterations
Feedback in computations, e.g. DSP IIR systems, can exhibit limit cycles, which are indefinite oscillatory responses from either round-off or overflow errors. This is possible from the feedback of the Limited-precision error, and repetitive rounding.
Errors may appear as constant outputs or +/- oscillations, even after the input becomes 0.

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill

Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
In these images $\mathcal Q$ refers to a location where precision must be limited, e.g. limit precision (#bits).
In DSP, FIR filters do not exhibit this since they have no feedback (no memory)
Techniques to avoid unwanted limited cycle behavior or bound it usually involve increasing bit length of intermediate calculations, using increased bit length use feedback of error and feedback, introducing random quantization to mitigate perpetuating effects
Reference for Additional Examples:
- Discrete-Time Signal Processing (3rd Edition) (Prentice-Hall Signal Processing Series) 3rd Edition by Alan V. Oppenheim (Author), Ronald W. Schafer
  - Ex 6.15 steps through an example of this shows round-off error where the output oscillates
  - Ex 6.16 steps through an example of this shows overflow error- based oscillation which can be more sever

Scaling by Powers of Two using Shift

Multiplying by a positive integer power of two may be performed with the logical left-shift shift operator
Multiplying by $2^k$ $2^{k}$ is left shift by $k$ $k$ :
- a<<k
- Typically synthesizers require k to be a constant;
k bits on left are discarded
Detecting Overflow: no overflow if all discarded bits are the same and the new msb matches the truncated bits

Takeaway

Multiplying $a$ power of two, $2^k$ , can be performed with a left shift by $k$ , a<<k

Software Multiplication using Smaller Hardware Multipliers

8-bit input multipliers can be used to compute 16-bit input multiplication,

Many smaller partial products are computed and weighted or added into appropriately weighted output word
mathematically shown: Y = {AH,AL} x {BH,BL} = AHBH<<16+AHBL<<8+ALBH<<8+ALBL.

Here {AH,AL} is used to denote bit concatenation (not a C operator)

Compute 16-bit results:

$AL \times BL$
$AL \times BH$
$AH \times BL$
$AH \times BH$

Shift, Extend, Add
$\boxed{AH}\boxed{AL} \times \boxed{BH}\boxed{BL}=$

	${\rm byte3}$	${\rm byte2}$	${\rm byte1}$	${\rm byte0}$
	sign ext		$AL \times BL$
+	sign ext	$AH \times BL$		$0$
+	sign ext	$AL \times BH$		$0$
+	$AH \times BH$		$0$	$0$
=	$Y_{\rm byte3}$	$Y_{\rm byte2}$	$Y_{\rm byte1}$	$Y_{\rm byte0}$

See Application note AVR201: Using the AVR Hardware Multiplier: http://ww1.microchip.com/downloads/en/Appnotes/Atmel-1631-Using-the-AVR-Hardware-Multiplier_ApplicationNote_AVR201.pdf
Can generalize to larger word-sizes, maintaining separate words of varying weights
Can use Karatsuba Algorithm for fast multiplication of large numbers (outside the scope of this course)

Takeaway

Long multiplications can be mapped to hardware-supported multiplications by breaking the TWO input operands into supported input word-sizes, performing all associated pair-wise multiplications between the TWO inputs (like FIOL method ), and then adding the results with appropriate shifting to implement wights and sign extensions

Multiplication by Low-Density Constants

Multiplication by low-density constants (few non-zero bits) can be inexpensive:
$u\times24 = u \times (16+ 8) = u<<4 + u<<3$
Using example from previous slide: if the second operand is a constant, the synthesizer reduces the multiplication to shift and one adder:

Takeaway

Multiplying by low-density constants implemented with a few shift-add operations

Rounded Integer Division

Rounded-result division of integers A,B may be accomplished by adding a bias matching the sign of the result, such as by adding an offset(bias) to A that is half the magnitude of the divisor B (integer division half) and matches the sign of the numerator

$|\rm round(float(A)/float(B))| = (|A|+|B/2|)/|B|$

Why?: Because integer division is round towards 0, but if $|A\%B|$ is at least half of $|B|$ then we need to round away from 0, which can accomplished by effectively adding 0.5 to the magnitude of the result before rounding
A+(|B|*sign(A))/2 where sign(A) is 1 or -1 according to the sign of A
Exercise: make code to divide a integer (signed), S, by 256 with a rounded result
result=((S>=0)?(S+128):(S-128))/256;
Exercise: make code to divide a integer (signed), S, by 5 with a rounded result
result=((S>=0)?(S+2):(S-1))/5;
A hardware (RTL to netlist) synthesizer may only support division by powers of two and possibly only division by constants
float to int conversion is floating-point fix() operator (subtract fractional part) with type conversion discarding fractional part

Takeaway

int division x/y is same as floating point division followed by int conversion ROUND TOWARDS ZERO,
- Does not achieve nearest int rounding result, requires a sign-dependant bias such as the following:
  - $\rm (x+\textcolor{red}{sign(x)\times |y/2|})/y$ or
  - $\rm \Big(x+\textcolor{red}{\big((sign(x) \kern{3pt} !\kern{-3pt}= sign(y))?-1:1\big) \times y/2}\Big)/y$

Examples and plots follow

Rounded Division example with even denominator

$\begin{alignedat}{5} &\text { floating-point division: } && \rm && 7.0 \underset{float}/ -4 && &&= -1.75 \\ &\text { rounded result: } && \rm round(&& 7.0 \underset{float}/ -4 && ) &&= -2.0 \\ &\text { cast result to int: } && \rm int( && 7.0 \underset{float}/ -4 && ) &&= -1 \\ &\text { bias float before cast: } && \rm int(( && 7.0 \underset{float}/ -4 && ) \red{- 0.5}) &&= \\ & && \rm int(( && \blue{-1.75} &&) \red{- 0.5}) &&= \\ & && \rm int( && && \blue{-2.25}) &&= -2 \\ &\text { integer biasing: } && \rm (( && 7\red{+(4/2)}) \underset{\rm int}/ -4 && ) &&= \\ & && \rm fix(( && 7.0\red{+(2)}) \underset{\rm float}/ -4 && ) &&= \\ & && \rm fix( && \blue{-2.25} && ) &&= -2 \\ \end{alignedat}$

Rounded Division example with odd denominator

$\begin{alignedat}{5} &\text { floating-point division: } && \rm && 8.0 \underset{float}/ -3 && &&= -2.666... \\ &\text { rounded result: } && \rm round(&& 8.0 \underset{float}/ -3 && ) &&= -3.0 \\ &\text { cast result to int: } && \rm int( && 8.0 \underset{float}/ -3 && ) &&= -2 \\ &\text { bias float before cast: } && \rm int(( && 8.0 \underset{float}/ -3 && ) \red{- 0.5}) &&= \\ & && \rm int( && \blue{-2.666...} && \red{- 0.5} ) &&= \\ & && \rm int( && && \blue{-3.16...}) &&= -3 \\ &\text { integer biasing: } && \rm (( && 8\red{+(3/2)}) \underset{\rm int}/ -3 && ) &&= \\ & && \rm fix(( && 8.0\red{+(1)}) \underset{\rm float}/ -3 && ) &&= \\ & && \rm fix( && \blue{-3.0} && ) &&= -3 \\ \end{alignedat}$

Floor vs Fix vs Round for Singed Values

Positive-Biased Value Before Fix

Negative-Biased Value Before Fix

Sign-Dependant-Bias before Fix

To Properly Mimic Signed-Integer Divided by Power of Two with Shift, a Sign-dependant Pre-Shift Bias is Required

Divide x by $2^k$ is almost the same as an arithmetic right shift
discard k bits on the right and replicate sign bit k times on the left. Must use the arithmetic shift to perform sign extension; x>>>k; same as {k{x[msbindex]},x[msbindex:k]}
However, “integer division” is defined by truncation of the fractional bits of the result, also known as “round towards zero” To mimic this behavior more is needed:
- For a positive x, integer division corresponds to floor(x/n)
  - ex: 5/2 = 2
  - If x is positive you can just use logical shifting: x>>k;
- For a negative x, integer division corresponds to ceil(x/n)
  - ex: -5/2= -2
  - Note that arithmetic right-shift does not mimic this behavior
    $\begin{aligned} 5: & 0101_2 \\ ~5: & 1010_2 \\ -5: & 1011_2 \\ & 1011_2>>>1=1101_2 : \red{-3\not =-2} \end{aligned}$
If x is negative, we want ceil(x/n) which may be computed by applying a bias that is 1 lsb less than the divisor … this is just enough bias that with any fractional result the value becomes ‘‘bigger’’ than the next ‘‘bigger’’ negative integer so that:
- ${\rm floor} ((x+(2^k-1))/(2^k))$
- (x +((1<<k)-1)) >> k
- Example: (-5 + (2-1))/2=(-5+1)/2=-4/2=-2
  - (-5 +((1<<1)-1)) >> 1
  - (-5 + 1 ) >> 1

Takeaway

Divide $x$ by $2^k$ is defined to be ROUND TOWARDS ZERO — mimicking this detail with arithmetic right-shift requires pre-biasing $x$ by adding $2^k-1$ : ( x + ((1<<k)-1) ) << k

Arbitrary Precision Arithmetic

Embedded systems can implement larger word-length operations, which is a reasonable alternative to using a more complex processor if needs are not at timing-critical points in time.
For 16 bit calculations, an 8-bit architecture may support double-register arithmetic (e.g. use two registers to hold output of a 8X8 multiplication)
For even longer numbers results can be calculated a piece at a time and overflow bits (add/sum) or overflow registers (multiply) can be used to compute larger results. The built-in C variable types are usually automatically handled by the compiler. If even longer types are needed, find an arbitrary precision arithmetic software library.
So, don’t say that you need a 128-bit processor to perform 128-bit arithmetic.
- Its just that implementing 128-bit arithmetic with a 8-bit processor via could be slow.
- Take for example performing two 128-bit operations a second when using an 8-bit processor.
  - That shouldn’t be a problem (even at only 8 MHz) and is not cause to change to a different hardware platform.

Error Propagation

Error introduced by division:

Let $q$ be ideal, errorless $q=i/x$
Let computed value from integer division, $i\underset{\rm int}{\boxed{/}}x$ , be $\hat{q}$ and the associated error be $\Delta q$ , such that
- $\hat{q}= q + \Delta q$ ,
- relative magnitude of error of computed to ideal is related to input operands and precision of computation

Error propagation by multiplication:

A following multiplication operation $\times y$ multiplies the error up to that point:
$\hat{q} \times y = (q + \Delta q)\times y = \underbrace{q \times y}_{\rm ideal} + \underbrace{\Delta q \times y}_{\rm multiplied\ error}$

Error introduced by multiplication:

Let $m$ be ideal, errorless $m= i \times x$
Let computed value from integer multiplication be, $i\underset{\rm int}{\boxed{*}}x$ $i int * x$ , be $\hat{m}$ $\overset{m}{^}$ and the associated error be $\Delta m$ $Δ m$ , such that
$\hat{m}= m + \Delta m$ $\overset{m}{^} = m + Δ m$
- error could be caused by overflow->saturation
- error could be zero if no overflow can occur

Error propagated by division:

A division, $/ y$ $/ y$ , following scales the error up to that point:
- $\hat{m} / y = (m + \Delta m) / y = \underbrace{m / y}_{\rm ideal} + \underbrace{\Delta m / y}_{\rm scaled\ error}$

Error Metric:

May measure and model error propagation using
- the worse case error
  - same penalty for rare error events as common events
- average magnitude of error
  - rare events of large error don’t matter as much
- root-mean-square error ( a.k.a RMS error, or standard deviation $\sigma$ $σ$ )
  - like previous, but larger errors are penalized proportionally more than small errors

Earlier examples shown to illustrate error generation and propagation with reordering of operatands
Keep the decision to go big or go small in mind as in each step of implementation in the next topic.
We will formally introduce an additional scale at each step and per operand.

Fixed-point arithmetic

Floating Point Math and Fixed-Point Math

If no floating point unit (FPU) is available, you can find a floating point software library. This will likely be slow.
Another option is fixed-point math. You can write or use a library or just do it as needed inline….

Fixed-point arithmetic

Problem: Want to use integer operations to represent adding 0011.1110 and 0001.1000
Solution (by example with explanation to follow):
- Store 0011.1110 as 00111110
- Store 0001.1000 as 00011000
- Add 00011000 + 00111110 = 01010110
- Interpret 01010110 as 0101.0110

QM.N notation

Can use common QM.N notation: M+N bits, with M bits as whole part and N bits as fractional part.
To design the computation, must first determine number of bits to use for whole and fraction parts depending on range and precision needed. This determines the scale factors required to convert the faction to a whole number
QM.N corresponds to $S = 2^N$
Example
1101.0000 * 16 = 11010000 S=16, Q4.4
01.011000 * 64 = 01011000 S=64, Q2.6

Addition and Subtraction with scale S (operands with same scale)

$A+B$ computed as $A\times S \underbrace{+}_{\mathclap{\text{integer addition}}} B\times S=C\times S$ , so that the addition is an integer addition
Interpret result by dividing output by S to obtain answer C
using a power of two for S is efficient, though not required
$A-B$ computed as $A\times S-B\times S=C\times S$
Divide result by S to interpret answer C
Example: Check if addition of 0.17 meters and 0.24 meters is greater than .9 meters

int Y_S100 = (.9 * 100);
int A_S100 = (0.17 * 100);
int B_S100 = (0.24 * 100);
int C_S100 = A_S100+B_S100; //integer addition
int flag = C_S100 > Y_S100; //integer subtraction/comparison

To be human understandable, an example presented with powers of 10 for scaling, but powers of 2 are usually appropriate for efficiency and predictability of errors and error propagation.

Multiplication with scale S

$A*B$ computed as $(A \times S)\underbrace{\times}_{\mathclap{\text{integer multiplication}}} (B \times S)=C \times S \times S$
- Divide result by $S^2$ for final interpretation
- Or to maintain S-Scale representation for C for further computations, divide result by S
Unfortunately, the intermediate result $C \times S \times S$ required more storage than the scaled result $C \times S$

Example: Check if area of rectangle with sides of length 0.17 meters and 0.24 meters is greater than the area of a square with sides of length .2 meters

int AREASQ_S10000 = (.2 * 100) * (.2 * 100);
int AREASQ_S100 = (.2 * .2 * 100);

int A_S100 = (0.17 * 100);
int B_S100 = (0.24 * 100);

int ARECT_S10000 = A_S100*B_S100; //integer multiplication
int flag0 = ARECT_S10000 > AREASQ_S10000; //integer subtraction/comparison

int ARECT_S100 = ARECT_S10000/100; //convert to scale 100
int flag1 = ARECT_S100   > AREASQ_S100; //integer subtraction/comparison

To be human understandable, an example presented with powers of 10 for scaling, but powers of 2 are usually appropriate for efficiency and predictability of errors and error propagation.

Division with scale S:

A/B could be computed as $(A \times S)\underbrace{/}_{\mathclap{\text{integer division}}}(B \times S)=C$
Scales cancel. Which is fine if you only wanted an integer answer
Would need to multi by S to obtain scaled result $C\times S$ for further math
- …but this is less accurate since the lower bits have already been lost
For precision, better to prescale one of the operands $((A\times S)\times S)/(B\times S)=C\times S$
- Unfortunately, the intermediate term $((A \times S)\times S)$ requires more storage and a larger computation
For rounding, remember to apply numerator pre bias of $\rm sign(\rm numerator) \times 1/2 \times |\rm denominator|$ when performing integer division

Example:
Circumference to Radius
Note that // is integer division in Python3

from math import *
import numpy as np
import pyforest

tau = pi * 2
#tau

C=tau+2.8 # some value for circumference 
          #(note that adding was intentional here just to produce an interesting number 9.0831853072)

#C

C8=floor(C*8)
#C*8,C8,C8/8

C256=floor(C*8*8)
#C*8*8,C256,C256/(8*8)

tau8=floor(tau * 8)
#tau8,tau8/8

halftau8=floor(tau/2 * 8) //precomputed bias
#halftau8,halftau8/8

print("C                               = %20.10f"%C) 
print("C/tau                           = %20.10f"%((C)/(tau))) 

print("C8/tau8                         = %20.10f"%((C8)//(tau8)))

print("  with pre-div bias             : %20.10f"%((C8+halftau8)//(tau8)))

print("  with prescale                 : %20.10f"%((C8*8)//(tau8)/8))
print("  with prescale and pre-div bias: %20.10f"%((C8*8+halftau8)//(tau8)/8))
print("C256/tau8                       = %20.10f"%((C256)//(tau8)/8))
print("  with pre-div bias             : %20.10f"%((C256+halftau8)//(tau8)/8))
print("---")

C                               =         9.0831853072
C/tau                           =         1.4456338407
C8/tau8                         =         1.0000000000
  with pre-div bias             :         1.0000000000
  with prescale                 :         1.3750000000
  with prescale and pre-div bias:         1.5000000000
C256/tau8                       =         1.3750000000
  with pre-div bias             :         1.5000000000

Bias before reducing precision

Reducing precision will involve division, typically achived using right-shift
Therefore follow prebiasing rule for division
When removing lesser significant bits, perform biasing of 1/2 weight of surviving position according to the sign of the value before shifting to achieve rounding
- if dividing by $2^n$ , before right shifting by n, add or subtract (1 << (n - 1))
Examples converting positive value Q4.4 to Q6.2:
- Will need to loss bits in 1/8 and 1/16 position, with new lsb position of 1/4
- For positive numbers need to add 1/8th before truncation
- For negative numbers need to subtract 1/8th before truncation
$4 \frac{1}{8}$ , 4.125, 0010.0010, stored as 00100010
- Without prebias for correct rounding:
  - right shift by 2 results in 00001000 ,lost ending bits 10
  - interpreted as 4.0
- With prebias for correct rounding:
  - $4 \frac{1}{8} + \frac{1}{8} = 4 \frac{1}{4}$ $4 \frac{1}{8} + \frac{1}{8} = 4 \frac{1}{4}$
    - 00100010 + 00000010 = 00100100
  - right shift by 2: 00001001
  - interpreted as 4.25 (4.125 rounded to the nearst $\frac{1}{4}$ )
$-1\frac{1}{16}$ , 1110.1111 stored as 1110 1111
- Without prebias:
  - (Arithmetic) right shift by 2 results in 11111011 ,lost ending bits 11
  - interpreted as $-1 \frac{1}{4}$
- With prebias:
  - $1 \frac{1}{16} + \frac{1}{8}$ $1 \frac{1}{16} + \frac{1}{8}$
    - 1110 1111 + 1111 1110 = 1111 0001
  - right shift by 2: 1111 1100
  - interpreted as -1 ( $-1\frac{1}{16}$ rounded to the nearst $\frac{1}{4}$ )

Example Fixed-Point Arith Library Code

Q (number format) - Wikipedia
- reviewed code in-class, provided below

Code examples in the below block are copied and provided under the Creative Commons Attribution-ShareAlike License: https://creativecommons.org/licenses/by-sa/3.0/

Wikipedia contributors. (2021, November 24). Q (number format). In Wikipedia, The Free Encyclopedia. Retrieved 18:10, November 29, 2021, from https://en.wikipedia.org/w/index.php?title=Q_(number_format)&oldid=1056933643

int16_t q_add_sat(int16_t a, int16_t b)
{
    int16_t result;
    int32_t tmp;

    tmp = (int32_t)a + (int32_t)b;
    if (tmp > 0x7FFF)
        tmp = 0x7FFF;
    if (tmp < -1 * 0x8000)
        tmp = -1 * 0x8000;
    result = (int16_t)tmp;

    return result;
}

// precomputed value:
#define K   (1 << (Q - 1))
 
// saturate to range of int16_t
int16_t sat16(int32_t x)
{
	if (x > 0x7FFF) return 0x7FFF;
	else if (x < -0x8000) return -0x8000;
	else return (int16_t)x;
}

int16_t q_mul(int16_t a, int16_t b)
{
    int16_t result;
    int32_t temp;

    temp = (int32_t)a * (int32_t)b; // result type is operand's type
    // Rounding; mid values are rounded up
    temp += K;
    // Correct by dividing by base and saturate result
    result = sat16(temp >> Q);

    return result;
}

int16_t q_div(int16_t a, int16_t b)
{
    /* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */
    int32_t temp = (int32_t)a << Q;
    /* Rounding: mid values are rounded up (down for negative values). */
    /* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */
    if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) {   
        temp += b / 2;    /* OR shift 1 bit i.e. temp += (b >> 1); */
    } else {
        temp -= b / 2;    /* OR shift 1 bit i.e. temp -= (b >> 1); */
    }
    return (int16_t)(temp / b);
}

Corrections

2022-08: error propagation: comared to ideal -> error of computed to ideal

Lecture – Integers Representations and Arithmetic

Table of Contents

References

Signed and Unsigned Integers

Negative Numbers in Two’s Complement (2N−X2^N-X2N−X)

Two’s Complement Wheel

Conversion Arithmetic

Complement and Increment

Ones Complement

Sign-Magnitude

Long/short unsigned/signed Conversion and Casting Rules

Signed/Unsigned Pitfall Examples

Review: Sign Extension

Run-time Overflow Detection for Addition

Run-time Overflow Detection for Subtraction

Review Points on Bit Length

Rounding Errors float to int

Affecting Conversion Rounding Error with Numerator Bias

First go big or first go small: choices for order of operations affects result

Enrichment: Impacts of Limited Precision in DSP Filters

Eliminating Overflow or Saturation with Input Prescaling

Increasing Effective Computation Precision with Input Prescaling

Log Scale

Avoid Unnecessary Operations

Saturation

Enrichment Topic: Soft Max

Enrichment Topic: Modeling Quantization / Round-off Noise

Enrichment Topic: Limit Cycles

Scaling by Powers of Two using Shift

Software Multiplication using Smaller Hardware Multipliers

Multiplication by Low-Density Constants

Rounded Integer Division

Rounded Division example with even denominator

Rounded Division example with odd denominator

Floor vs Fix vs Round for Singed Values

Positive-Biased Value Before Fix

Negative-Biased Value Before Fix

Sign-Dependant-Bias before Fix

To Properly Mimic Signed-Integer Divided by Power of Two with Shift, a Sign-dependant Pre-Shift Bias is Required

Arbitrary Precision Arithmetic

Error Propagation

Fixed-point arithmetic

Floating Point Math and Fixed-Point Math

Fixed-point arithmetic

QM.N notation

Addition and Subtraction with scale S (operands with same scale)

Multiplication with scale S

Division with scale S:

Bias before reducing precision

Example Fixed-Point Arith Library Code

Corrections

Negative Numbers in Two’s Complement ( $2^N-X$ )