Lecture 16 – Signed Integers and Arithmetic Part I

Ryan Robucci

Lecture 16 – Signed Integers and Arithmetic Part I

References

$^\dagger$ A few extended topics images from CD Rom for Mitra, Sanjit Kumar, and Yonghong Kuo. Digital signal processing: a computer-based approach. Vol. 2. New York: McGraw-Hill, 2006.

Signed and Unsigned Integers

Consider $n$ ordered bits representing a value $X$

Unsigned Interpretation:
$\displaystyle\sum_{i=0}^{n-1}(2^i x_i)=$
$\displaystyle \textcolor{red}{2^{n-1} x_{n-1} } + \sum_{i=0}^{n-2}(2^i x_i )$

Signed Interpretation (MSB used for representing Negative Numbers)

$\displaystyle \textcolor{red}{-2^{n-1}x_{n-1} } + \sum_{i=0}^{n-2}(2^i x_i )$

Two's complement uses the msb to weight/contribute a large negative number instead of a large positive value
Ex: 1001 is -8 + 1=-7
4-bit signed range is 0+4+2+1 =7 to -8+0+0+0 =-8
4-bit unsigned Range is 8+4+2+1=15 to 0+0+0+0 =0
If MSB, $x_{N-1}$ , is 0 the interpretation is the same
- unsigned: $X <= 2^{N-1}-1$
- signed: X positive , $X <= 2^{N-1}-1$
If MSB, $x^N-1$ , is 1, the interpretation is different
- unsigned: $X >= 2^{N-1}$
- signed: $X$ negative, $X >= -1 \times 2^{N-1}$
When the MSB is 1, the adjustment from one interpretation to the other is an offset of 2 times the most significant weight, $2 \times 2^{N-1}$ , which is $2^{N}$ .
- Subtract $2^N$ to go from unsigned to signed
- Add $2^N$ to go from signed to unsigned
  Ex: Receive bits from a UART 10000010
  as unsigned this appears to be 130, but detecting that it is >=128 and then
  subtracting 256 gives results in -126

Negative Numbers in Two's Complement ( $2^N-X$ )

It is also useful to think about negative twos-complement values complement in this way:
- for a N-bit X, if the msb is 1 the value is the following difference: $2^{N}$ - $unsigned(X)
- for a 16-bit bit-vector X, if the msb is 1, the value is the distance from $2^{16}$ to the value from unsigned interpretation of X
Ex:
- 16'b1111_1111_1111_1101 = -3 ; X+2+1 overflows to $2^{16}$
- 16'b1111_1110_1111_1101 = -259 ; X+256+2+1 overflows to $2^{16}$

Two's Complement Wheel

Helps visualize

bit interpretation
modular arithmetic
- e.g. $(X+Y)\%2^4$
overflow

image src:https://stackoverflow.com/questions/55145028/binary-ones-complement-in-python-3, though origin uncertain

adding (and subtracting) signed and unsigned numbers is no different at the bit/hardware level, represented as modular arithmetic

Conversion Arithmetic

The following assumes word sizes are sufficient for the conversion arithmetic, e.g. working with 12-bit words in a 16-bit architecture
Interpreting unsigned as signed:
- To reinterpret $\textcolor{red}{(1)} 2^{N-1}$ as $\textcolor{red}{-(1)}2^{N-1}$ , subtract $2^N$
- If reading bits from a data stream, the subtraction by be necessary in software
- Example: a 12-bit two's complement ADC value is read from a 16-bit I/O port as 12'b0000_1000_0000_0001. Using a default interpretation in C, printf("%d",X) displays the result as 2049. What is the actual value? Ans: -2047
Interpreting signed as unsigned:
- To reinterpret $\textcolor{red}{-(1)} 2^{N-1}$ as $(1) 2^{N-1}$ , add $2^N$
- Ex: if we print -2047 as unsigned, we get 2049

Conversion for 4-bit Data

Bits $x_3x_2x_1x_0$	Unsigned	relation	Signed
`1111`	15	$\xtofrom[+16]{-16}$	$\textcolor{red}{-1}=2^4-15$
`1110`	14	$\xtofrom[+16]{-16}$	$\textcolor{red}{-2}=2^4-14$
`1101`	13	$\xtofrom[+16]{-16}$	$\textcolor{red}{-3}=2^4-13$
`1100`	12	$\xtofrom[+16]{-16}$	$\textcolor{red}{-4}=2^4-12$
`1011`	11	$\xtofrom[+16]{-16}$	$\textcolor{red}{-5}=2^4-11$
`1010`	10	$\xtofrom[+16]{-16}$	$\textcolor{red}{-6}=2^4-10$
`1001`	9	$\xtofrom[+16]{-16}$	$\textcolor{red}{-7}=2^4-9$
`1000`	8	$\xtofrom[+16]{-16}$	$\textcolor{red}{-8}=2^4-8$
`0111`	$\textcolor{blue}{7}$	=	$\textcolor{blue}{+7}$
`0110`	$\textcolor{blue}{6}$	=	$\textcolor{blue}{+6}$
`0101`	$\textcolor{blue}{5}$	=	$\textcolor{blue}{+5}$
`0100`	$\textcolor{blue}{4}$	=	$\textcolor{blue}{+4}$
`0011`	$\textcolor{blue}{3}$	=	$\textcolor{blue}{+3}$
`0010`	$\textcolor{blue}{2}$	=	$\textcolor{blue}{+2}$
`0001`	$\textcolor{blue}{1}$	=	$\textcolor{blue}{+1}$
`0000`	$\textcolor{blue}{0}$	=	$\textcolor{blue}{+0}$

Visual Example: Sine Wave transmitted using 5-bit data:

Two's Complement as Complement-and-Increment

A two-step process can represent negation, without requiring extending the number of bits for the numerical symbols.
Steps:
1. Bit-Wise Complement
2. Increment
Let X and Y represent bits in hardware expressed as an unsigned interpretation, though they may store a two's complement representation
First, recall that addition and subtraction with signed and unsigned numbers are exactly the same in hardware: (let $\underset{N}\boxplus$ $\underset{N}\boxminus$ be modular addition and subtraction)
- $X \boxplus Y$ is $(X+Y) \% 2^N$ where $X$ and $Y$ are the vector of bits undergoing modular addition
Negating in two’s complement:
- -X: ( $2^N \boxminus X$ ) is $(2^N-X) \% 2^N = -X$
$2^N$ cannot be represented with N bits, and we might like to represent operations using symbols in the same domain, that of N-bit numbers.
- $2^N-X$ $2^{N} - X$ can be represented in two operations
  - Let ${\mathcal H}$ satisfy $2^N = {\mathcal H}+1$
    so that ${\mathcal H} = (2^N-1) = {\underbrace{{111....11_2}}_{N\rm\,bits}}$
  - Then define negation by the following function $\underset{N}\boxminus X: ({\mathcal H}\underset{N}\boxplus 1) \underset{N}\boxminus X = ({\mathcal H} \underset{N}\boxminus X) \underset{N}\boxplus 1$
  - Which in modular arithmetic is $(({\mathcal H}-X) \% 2^N + 1) \% 2^N = -X$
Ex: 4-bit negation of 3
- $x \rightarrow -x$ is $x-2^4$ , where the magnitude of the negative number is $2^4-x$ ,
  but $2^4$ has no 4-bit representation
- $2^4 = 2^4 - 1 + 1 = 1111_2 + 0001_2$
- $3=0011_2$
- -3: $\begin{aligned} 2^4 - 3= &1111_2 & + 0001_2 &\red{- 0011_2}\\ =& 1111_2 & \red{ - 0011_2} &+ 0001_2\\ =& & \red{1100_2} &+ 0001_2\\ =& & \red{1101_2} \\ \end{aligned}$

Ones Complement

Ones complement representation represents negative numbers by just bit-complement
Ones complement is inverting bits:
- $\underbrace{\mathcal H}_{\mathclap{``all\ ones"}} \underset{N}\boxminus X$ , can be arithmetically described as $\underbrace{(2^N-1)}_{\mathclap{``all\ ones"}} \underset{N}\boxminus X$
- in modular arithmetic is
  $((2^N-1)-X )\% 2^N = -1-X$
Therefore, to generate a twos complement negative from a ones complement negation, just add 1
$(2^N-1)\underset{N}\boxminus X \underset{N}\boxplus1 = 2^N – X$
which as modular arithmetic is
$((2^N-1)-X +1)\%2^N = (2^N – X) \%2^N = -X$

Sign-Magnitude

For Sign-Magnitude representation, bit N-1 is the sign and the remaining bits are the magnitude
$x_{N-1}x_{N-2}x_{N-3}$
both ones complement and sign-magnitude suffer from 0 have two representations

data streams with typically small values on either side of zero involve less bit flipping between which can be valuable for power or digital compression applied along the bit columns

+4: 00000100

+3: 00000011

+2: 00000010

+1: 00000001

+0: 00000000

-1: 10000001

-2: 10000010

-3: 10000011

-4: 10000100

+4: 00000100

+3: 00000011

+2: 00000010

+1: 00000001

+0: 00000000

-1: 11111111

-2: 11111110

-3: 11111101

-4: 11111100

Note in the sequence below that several column never change

+4: 00000100
+0: 00000000
+3: 00000011
+3: 00000011
+3: 00000011
+3: 00000011
-3: 10000011
+0: 00000000
+1: 00000001
+0: 00000000
+1: 00000001
-3: 10000011
+1: 00000001
+0: 00000000
-1: 10000001
-2: 10000010
-3: 10000011
-3: 10000011
+3: 00000011
+2: 00000010
+3: 00000011
+2: 00000010
+1: 00000001
-4: 10000100
+0: 00000000
+1: 00000001
+0: 00000000
+1: 00000001
+1: 00000001
+0: 00000000
+1: 00000001
+0: 00000000
-1: 10000001
-2: 10000010
-3: 10000011
-3: 10000011
-3: 10000011
-4: 10000100
-4: 10000100

Truncation and Extension for Integers (in C)

In C, if a mix of signed and unsigned operands of the same rank are provided to an operator, operands are first cast to be unsigned (dumb conversion (e.g. -1 $\rightarrow$ big number)).

Signed/Unsigned Casting may be followed by length adjustment.

Extension/truncation of signed value:

Truncation

Assignment to a shorter type is always just bit truncation
- no Python-like smart rounding such as unsigned(-1) → 0
(Left) Truncation Errors:
- For unsigned, truncation is not a problem as long as all the truncated bits are 0
  - ex: 0000101 (5) , can truncate at most 3 bits
- For signed, truncation is no problem as long as all the bits truncated are the same AND they match the surviving msb
  - ex: 1111100 (-4) , can truncate at most 4 bits

Signed and Unsigned Integral Extension

Assignment to a longer type is done with either zero or sign extension depending on the source type:

Unsigned types use zero extension
Signed types use sign extension

Long/short unsigned/signed Conversion Rules

(C/C++) Combinations of (dest type) = (source type) to consider

$\begin{align*} \rm LHS &= RHS \\ \text{( unsigned long)} &= \text{(unsigned long)} \\ \text{( unsigned long)} &= \text{( signed long)} \\ \text{( signed long)} &= \text{(unsigned long)} \\ \text{( signed long)} &= \text{( signed long)} \\ \\ \text{(unsigned short)} &= \text{(unsigned long)}\\ \text{(unsigned short)} &= \text{( signed long)}\\ \text{( signed short)} &= \text{(unsigned long)}\\ \text{( signed short)} &= \text{( signed long)}\\ \\ \text{( unsigned long)} &= \text{(unsigned short)}\\ \text{( unsigned long)} &= \text{( signed short)}\\ \text{( signed long)} &= \text{(unsigned short)}\\ \text{( signed long)} &= \text{( signed short)}\\ \\ \text{(unsigned short)} &= \text{(unsigned short)}\\ \text{(unsigned short)} &= \text{( signed short)}\\ \text{( signed short)} &= \text{(unsigned short)}\\ \text{( signed short)} &= \text{( signed short)} \end{align*}$

The C rules are simple and used in Verilog:
1. extend or truncate source
- Going from longer to shorter, upper bits are truncated
- Going from shorter to longer, zero or sign extension is done depending on source type being unsigned or signed respectively
1. bit copy
- To/From signed/unsigned is just bit copying, no other smart manipulation/conversion/rounding is done

Long/short unsigned/signed Verilog Conversion Rule Demos

Verilog

module conversion_demo;

wire [7:0] 
  u8x = 8'b11111111;              
wire signed [7:0] 
  s8x = 8'b11111111;   
wire [15:0]       
  u16x = 16'b1111_1111_1111_1111;    
wire signed [15:0] 
  s16x = 16'b1111_1111_1111_1111;

wire        [15:0] u16y_ux = u8x;   
wire        [15:0] u16y_sx = s8x;
wire signed [15:0] s16y_ux = u8x;   
wire signed [15:0] s16y_sx = s8x;   

wire        [7:0] u8y_ux = u16x;    
wire        [7:0] u8y_sx = s16x;
wire signed [7:0] s8y_ux = u16x;    
wire signed [7:0] s8y_sx = s16x;

initial begin
#0;
$display("u16y_ux:%16b,%7d",u16y_ux,u16y_ux);
//u16y_ux: 0000000011111111,     255
$display("u16y_sx:%16b,%7d",u16y_sx,u16y_sx);
//u16y_sx: 1111111111111111,   65535
$display("s16y_ux:%16b,%7d",s16y_ux,s16y_ux);
//s16y_ux: 0000000011111111,     255
$display("s16y_sx:%16b,%7d",s16y_sx,s16y_sx);
//s16y_sx: 1111111111111111,      -1

$display(" u8y_ux: %16b, %7d",u8y_ux,u8y_ux);
//u8y_ux:         11111111,      255
$display(" u8y_sx: %16b, %7d",u8y_sx,u8y_sx);
//u8y_sx:         11111111,      255
$display(" s8y_ux: %16b, %7d",s8y_ux,s8y_ux);
//s8y_ux:         11111111,       -1
$display(" s8y_sx: %16b, %7d",s8y_sx,s8y_sx);
//s8y_sx:         11111111,       -1
end
endmodule

C/C++

unsigned short u16x = 0xFFFF;              
signed short s16x = 0xFFFF;   
unsigned int u32x = 0xFFFFFFFF;    
signed int s32x = 0xFFFFFFFF;   
unsigned int u32y_u16x = u16x;   
unsigned int u32y_s16x = s16x;
signed int s32y_u16x = u16x;   
signed int s32y_s16x = s16x;   
unsigned short u16y_u32x = u32x;    
unsigned short u16y_s32x = s32x;
signed short s16y_u32x = u32x;    
signed short s16y_s32x = s32x;   

printf("sizeof(short) :%lu\n",sizeof(short));
printf("sizeof(int)   :%lu\n",sizeof(int));
printf("---\n");
printf("u32y_u16x: %10u\n",u32y_u16x);
printf("u32y_s16x: %10u\n",u32y_s16x);
printf("s32y_u16x: %10d\n",s32y_u16x);
printf("s32y_s16x: %10d\n",s32y_s16x);
printf("u16y_u32x: %10u\n",u16y_u32x);
printf("u16y_s32x: %10u\n",u16y_s32x);
printf("s16y_u32x: %10d\n",s16y_u32x);
printf("s16y_s32x: %10d\n",s16y_s32x);

sizeof(short) :2
sizeof(int)   :4
---
u32y_u16x:      65535
u32y_s16x: 4294967295
s32y_u16x:      65535
s32y_s16x:         -1
u16y_u32x:      65535
u16y_s32x:      65535
s16y_u32x:         -1
s16y_s32x:         -1

Review Points on Bit Length

Twos-complement addition of two N-bit numbers can require up to N+1 bits to store a full result including the overflow
- If one operand is shorter, take N to be the maximum of the two lengths
Two’s complement addition can only overflow if signs of operands are the same (likewise for subtraction the signs must be different)
Result of N-bit addition with overflow is dropping of MSBits’s: A+B = (A+B) mod (2^N)
For multiplication, multiplying two N-bit numbers requires up to 2N bits to store the operand. Multiplying a N-bit with a M-bit requires up to N+M bits.

Not All Adder Structures are Alike

Knowing the options for implementation of addition on a given platform in the context of algorithm implementation is important. There are tradeoffs for size and speed. An overview of adders is provided in a later lecture.

Verilog: Sign Extension along with Addition

Review
- Sign Extension (extention of signed values) involves repeating the msb
- Truncation is safe as long as alls truncated bit are the same as the surviving msb
Verilog Extention/Truncation Might be performed implicitly or explicitly

wire signed [7:0] x,y;
wire signed [15:0] s1,s2;

assign s1 = {8{x[7]},x} + 
            {8{y[7]},y} // explicit sign
                        // extension

assign s2 = x + y; // implicit sign extension

Run-time Overflow Detection for Addition

Two's complement ADDITION of two numbers where the longest is N-bit can require up to N+1 bits for the result
Two's complement ADDITION can only overflow if the signs of the operands are the same
- overflow a+b
- positive overflow (ideal result>127) is blue
- negative overflow (ideal result<-128) is red
Overflow check using result:
- input sign bits are same and do not match result sign bit
Overflow check using inputs-only:
- In C: overflow=((a>0) && (b>(INT_MAX-a))) | ((a<0) && (b<(INT_MIN+a)));
Overflow Run-Time Check Examples:

Verilog overflow=(a[N-1]==b[N-1]) && (a[N-1]!=y[N-1]);

Example

wire signed [7:0] x,y,s;
wire flagOverflow;

assign s = x+y; //context determined 
                //    8-bit addition
//overflow case is when the sign of the 
//  input operands are the same and
//  sign of result does not match
assign flagOverflow = (x[7] == y[7]) && 
                      (y[7] ~= s[7])

C (for reference only)

y=a-b;
overflow=( (a>=0) && (b>=0) && (y<0) ) | ( (a<0) && (b<0) && (y>=0) );
- Note: may exclude input operands being zero
  - overflow=( (a>0) && (b>0) && (y<0) ) | ( (a<0) && (b<0) && (y>=0));
Using #include <limits.h>
- overflow=((a>0) && (b>(INT_MAX-a))) | ((a<0) && (b<(INT_MIN+a)));
Note, typically you can use an available overflow flag (e.g. V), but for software-only manipulation, or when using a language that doesn't support access to hardware flags, or if the flags are not preserved you must resort to software techniques
If possible convert data to native words (e.g. perform sign extension) to use hardware flags

Takeaway

Addition overflow possible only if input signs are same, in which case is indicated by the result is a different sign after the operation is performed
If native word-size is used, may be able to use hardware overflow flag
- Otherwise, might sign-extention first or otherwise need to provide software to check appropriate bits
A pre-overflow operation check on the input operands is give by example for SIGNED INT: ((a>0) && (b>(INT_MAX-a))) | ((a<0) && (b<(INT_MIN+a)))

Run-time Overflow Detection for Subtraction

Two's complement subtraction can only under/overflow if the signs of the operands are different, otherwise the magnitude of the result must be smaller than the maximum magnitude of the two operands.
- a-b, overflow is colored
Overflow check: sign bits of the input operands are different and resulting sign bit does not match the first operand
- Otherwise stated: sign bit of the second operand and the result are the same and not equal to that of the first
- Otherwise stated: invert the sign bit of the second operand and perform the bit test for addition overflow or negate (two's complement) second operand and perform the addition overflow test for operarands
Verilog
- overflow=(a[N-1]!=b[N-1]) && (a[N-1]!=y[N-1]);
C
- overflow=((a>=0) && (b<0) && (y<0)) | ((a<0) && (b>=0) && (y>=0));
- overflow=((a> 0) && (b<0) && (y<0)) | ((a<0) && (b> 0) && (y>=0));
- Like addition, might use hardware overflow flag if available