Consider n ordered bits representing a value X
Unsigned Interpretation:
i=0∑n−1(xi2i)=
xn−12n−1+i=0∑n−2(xi2i)
Signed Interpretation (MSB used for representing Negative Numbers)
−xn−12n−1+i=0∑n−2(xi2i)
If MSB, xN−1 , is 0
If MSB, xN−1 , is 1,
Bits x3x2x1x0 |
Unsigned | Signed | |
---|---|---|---|
0000 |
0 | = | +0 |
0001 |
1 | = | +1 |
0010 |
2 | = | +2 |
0011 |
3 | = | +3 |
0100 |
4 | = | +4 |
0101 |
5 | = | +5 |
0110 |
6 | = | +6 |
0111 |
7 | = | +7 |
1000 |
8 | →−16 | −8=24−8 |
1001 |
9 | ←+16 | −7=24−9 |
1010 |
10 | −6=24−10 | |
1011 |
11 | −5=24−11 | |
1100 |
12 | −4=24−12 | |
1101 |
13 | −3=24−13 | |
1110 |
14 | −2=24−14 | |
1111 |
15 | −1=24−15 |
X+Y
is (X+Y)%2N2^N-X
is (2N−X)%2N=−X(C/C++) Combinations of (dest type) = (source type) to consider
( unsigned long) = (unsigned long)
( unsigned long) = ( signed long)
( signed long) = (unsigned long)
( signed long) = ( signed long)
(unsigned short) = (unsigned long)
(unsigned short) = ( signed long)
( signed short) = (unsigned long)
( signed short) = ( signed long)
( unsigned long) = (unsigned short)
( unsigned long) = ( signed short)
( signed long) = (unsigned short)
( signed long) = ( signed short)
(unsigned short) = (unsigned short)
(unsigned short) = ( signed short)
( signed short) = (unsigned short)
( signed short) = ( signed short)
module conversion_demo; wire [7:0] u8x = 8'b11111111; wire signed [7:0] s8x = 8'b11111111; wire [15:0] u16x = 16'b1111_1111_1111_1111; wire signed [15:0] s16x = 16'b1111_1111_1111_1111;
wire [15:0] u16y_ux = u8x; wire [15:0] u16y_sx = s8x; wire signed [15:0] s16y_ux = u8x; wire signed [15:0] s16y_sx = s8x; wire [7:0] u8y_ux = u16x; wire [7:0] u8y_sx = s16x; wire signed [7:0] s8y_ux = u16x; wire signed [7:0] s8y_sx = s16x;
initial begin #0; $display("u16y_ux: %16b, %7d",u16y_ux,u16y_ux); //u16y_ux: 0000000011111111, 255 $display("u16y_sx: %16b, %7d",u16y_sx,u16y_sx); //u16y_sx: 1111111111111111, 65535 $display("s16y_ux: %16b, %7d",s16y_ux,s16y_ux); //s16y_ux: 0000000011111111, 255 $display("s16y_sx: %16b, %7d",s16y_sx,s16y_sx); //s16y_sx: 1111111111111111, -1
$display(" u8y_ux: %16b, %7d",u8y_ux,u8y_ux); //u8y_ux: 11111111, 255 $display(" u8y_sx: %16b, %7d",u8y_sx,u8y_sx); //u8y_sx: 11111111, 255 $display(" s8y_ux: %16b, %7d",s8y_ux,s8y_ux); //s8y_ux: 11111111, -1 $display(" s8y_sx: %16b, %7d",s8y_sx,s8y_sx); //s8y_sx: 11111111, -1 end endmodule
Discussed in Next Lecture
Sign Extension with Operations using Signed Variables
wire signed [11:0] x,y; wire signed [12:0] s1,s2; assign s1 = {8{x[7]},x} + {8{y[7]},y} //explicit sign extension assign s2 = x + y; //implicit sign extension
wire signed [7:0] x,y,s; wire flagOverflow; assign s = x+y; //context determined // 8-bit addition //overflow case is when the sign of the // input operands are the same and // sign of result does not match assign flagOverflow = (x[7] == y[7]) && (y[7] ~= s[7])
Assume int A,B;
Example:
BAUD_RATE=CLK_RATE/CLK_DIV
#define CLK_DIV (CLK_RATE/BAUD_RATE)
#define CLK_DIV (CLK_RATE+(BAUD_RATE/2))/BAUD_RATE
In general, if intermediate result will fit in the allowed integer word size, apply integer multiplications before integer divisions to avoid loss of precision
Need to consider loss of MSB or LSB digits with intermediate terms and make choices base on application or expected ranges for input values
Example 1: int i = 660, x = 54, int y = 23;
i/x*y
gives 253 : goodi*y/x
gives 255 : better(i*y+x/2)/x
gives 256 : bestExample 2: unsigned int c = 7000, x=10,y=2;
c*x/y
which is truly 35000
c*=x;
overflows c since (c*x)>65535
resulting in x = 4465
c/=y;
get 2232
herec/=y; c*=x;
gives 35000
!!!!!In general, if intermediate result will fit in the allowed integer word size, apply integer multiplications before integer divisions to avoid loss of precision
Limited-precision representation of coefficients limits realizable system parameters
Images from Digital Signal Processing with Student CD ROM 4th Edition by Sanjit Mitra
Direct-form II very poor for high-pass below pass filters requiring pole near the real axis
Manipulation of the order of operations can yield different possible system realizations Here is a equivalent system if infinite precision is used, but yields different results if parameters and calculations are of limited precision
When error is small compared to a signal, quantization error is model as noise, in fact it is nearly ideal to model the error as white noise when the error very small compared to the signal (high SNR)
Techniques to avoid unwanted limited cycle behaviour usually involve increasing bit length of intermediate calculations, using increased bit length use feedback of error and feedback, introducing random quantization to mitigate perpetuating effects
Ref for Additional Examples:
Discrete-Time Signal Processing (3rd Edition) (Prentice-Hall Signal Processing Series) 3rd Edition by Alan V. Oppenheim (Author), Ronald W. Schafer
Verilog 2001 provides signed reg and wire vectors
Casting to and from signed may be implicit or may be explicit by using
$unsigned()
Ex: reg_s=$unsigned(reg_u);
$signed()
Ex: reg_u=$unsigned(reg_s);
Implicit or Explicit Casting is always a dumb conversion (same as C), they never change bits, just the interpretation of the bits (e.g. -1 is not round to 0 upon conversion to unsigned, it is just reinterpreted as the largest unsigned value) for subsequent operations by the compiler/synthesizer
In this example we see that addition obeys modular arithmetic with a result u8y=0
ur8a = 128; ur8b = 128; ur8y= ur8a+ur8b;
In this example we see that the addition is an expression paired with an assignment, so the length of the assigned variable sets the context-determined expression operand length of the addition to take on the length of the largest operand, 9-bits. Using zero-extension in this case, the addition operands are each extended to 9-bits before addition.
ur8a = 128; ur8b = 128; ur9y= ur8a+ur8b; //9-bit addition
{a}
for which the result is the length of a
{2'b00,b,a}
for which the result length is 2+length(b
)+length(a
)ur8a = 128; ur8b = 128; ur16y= {ur8a+ur8b}; //8 bit addition ur16z= ur8a+ur8b; //16 bit addition
5.4.1 Rules for expression bit lengths
The rules governing the expression bit lengths have been formulated so that most practical situations have a natural solution.
The number of bits of an expression (known as the size of the expression) shall be determined by the operands involved in the expression and the context in which the expression is given.
A self-determined expression is one where the bit length of the expression is solely determined by the expression itself—for example, an expression representing a delay value.
A context-determined expression is one where the bit length of the expression is determined by the bit length of the expression and by the fact that it is part of another expression. For example, the bit size of the right-hand expression of an assignment depends on itself and the size of the left-hand side.
Table 5-22 shows how the form of an expression shall determine the bit lengths of the results of the expression. In Table 5-22, i, j, and k represent expressions of an operand, and L(i) represents the bit length of the operand represented by i.
Multiplication may be performed without losing any overflow bits by assigning the result to something wide enough to hold it.
At UMBC, goto http://ieeexplore.ieee.org/
Search for IEEE Std 1364-2005
You'll find
Addition in the context with of assignment will cause extension of operands to the size of the result (the synthesizer may later remove useless hardware). However, the extension is performed according to the type of addition, which is determined by the operands. Therefore, signed extension is only performed if BOTH operands are signed regardless of the assignment.
module test(); reg signed [7:0] s; reg [7:0] u; reg signed [7:0] neg_two; reg [15:0] x1,x2; reg signed [15:0] y1,y2;
initial begin neg_two = -2; s = 1; u = 1; x1 = u + neg_two; x2 = s + neg_two; y1 = u + neg_two; y2 = s + neg_two; $display("%b",x1); $display("%b",x2); $display("%b",y1); $display("%b",y2); end endmodule
Result:
0000000011111111
1111111111111111
0000000011111111
1111111111111111
. . . reg [3:0] bottleStock = 10; //**unsigned** always @ (posedige clk, negedge rst_) if (rst_==0) bottleStock<=10; else if (bottleStock >= 0) //always TRUE!!! bottleStock <= bottleStock-1;
. . . input wire [2:0] remove; signed reg [3:0] remainingStock = 10; //**signed** always @ (posedige clk, negedge rst_) if (rst_==0) remainingStock<=10; else if ((remainingStock-remove) >= 0) //always TRUE!!! remainingStock <= remainingStock-remove;
a<<k
In general to hold the result you need M+N bits where M and N are the length of the operands
wire [N-1:0] a; wire [M-1:0] b; wire [M+N-1:0] y; y=a*b;
The multiplication in the context with the assignment will cause extension of operands to the size of the result (the synthesizer may later remove useless hardware). However, the extension type is performed according to the type of input operands (signed extension is only performed if BOTH operands are signed)
AND
gates as 1-bit mult)1011 (M bits) x 10010 (Nbits):
Multiplication Overflow Check Ideas:
Multiplication by constants with only few non-zero bits can be inexpensive:
u∗24=u∗(16+8)=u<<4+u<<3
This concept is important for computer engineer to have in their tool belt.
Using example from previous slide: if the second operand is a constant, the synthesizer reduces the multiplication to shift and one adder:
|round(float(A)/float(B))| = (|A|+|B/2|)/|B|
A+(|B|*sign(A))/2
where sign(A)
is 1 or -1 according to the sign of Aresult = (S>=0) ? ((S+128)/256) : ((S-128)/256);
result = (S>=0) ? ((S+2)/5) : ((S-1)/5);
fix()
operator (subtract fractional part) with type conversion discarding fractional partx>>>k;
same as {k{x[msbindex]},x[msbindex:k]}
This is
floor(x/n)
ex: 5/2 = 2ciel(x/n)
ex: -5/2= -2x>>k;
floor((x-(2k-1))/(2k))
In verilog: (x -((1<<k)-1)) >>> k
Want to use integer operations to represent adding 0011.1110
and 0001.1000
Solution:
0011.1110
as 00111110
0001.1000
as 00011000
00011000
+ 00111110
= 01010110
01010110
as 0101.0110
Use QM.N
format: M+N bits, with M bits as whole part and N bits as fractional part
Up to you to determine number of bits to use for whole and fraction parts depending on range and precision needed. This determines the scale factors required to convert the faction to a whole number
1101.0000 * 16 = 11010000 S=16
, Q4.4
01.011000 * 64 = 01011000 S=64
, Q2.6
A+B
computed as A*S+B*S=C*S
A-B
computed as A*S-B*S=C*S
A*B
computed as (A*S)*(B*S)=C*S2
DC*S*S
required more storage than the scaled result C*S
A/B
could be computed as (A*S)/(B*S)=C
((A*S)*S)/(B*S)=C*S
((A*S)*S)
required more storage and a larger computationwire signed [7:0] x,y; wire flagEq; flagEq = (x==y); implements eight xnor2 followed by and8;