Unsigned Interpretation:
Signed Interpretation (MSB used for representing Negative Numbers)
Two’s complement uses the msb to weight/contribute a large negative number instead of a large positive value
Ex: 1001 is -8 + 1=-7
4-bit signed range is 0+4+2+1 =7 to -8+0+0+0 =-8
4-bit unsigned Range is 8+4+2+1=15 to 0+0+0+0 =0
If MSB, , is 0 the interpretation is the same
If MSB, , is 1, the interpretation is different
When the MSB is 1, the adjustment from one interpretation to the other is an offset of 2 times the most significant weight, , which is .
10000010
It is also useful to think about negative twos-complement values complement in this way:
$unsigned(X)
Ex:
Helps visualize
The following assumes word sizes are sufficient for the conversion arithmetic, e.g. working with 12-bit words in a 16-bit architecture
Interpreting unsigned as signed:
printf("%d",X)
displays the result as 2049. What is the actual value? Ans: -2047Interpreting signed as unsigned:
Conversion for 4-bit Data
Bits |
Unsigned | relation | Signed |
---|---|---|---|
1111 |
15 | ||
1110 |
14 | ||
1101 |
13 | ||
1100 |
12 | ||
1011 |
11 | ||
1010 |
10 | ||
1001 |
9 | ||
1000 |
8 | ||
0111 |
= | ||
0110 |
= | ||
0101 |
= | ||
0100 |
= | ||
0011 |
= | ||
0010 |
= | ||
0001 |
= | ||
0000 |
= |
Visual Example Sine Wave transmitted using 5-bit data:
A two-step process can represent negation, without requiring extending the number of bits for the numerical symbols.
Steps:
Let X and Y represent bits in hardware expressed as an unsigned interpretation, though they may store a two’s complement representation
First, recall that addition and subtraction with signed and unsigned numbers are exactly the same in hardware: (let be modular addition and subtraction)
Negating in two’s complement:
cannot be represented with N bits, and we might like to represent operations using symbols in the same domain, that of N-bit numbers.
Ex: 4-bit negation of 3
Ones complement representation represents negative numbers by just bit-complement
Ones complement is inverting bits:
Therefore, to generate a twos complement negative from a ones complement negation, just add 1
which as modular arithmetic is
For Sign-Magnitude representation, bit N-1 is the sign and the remaining bits are the magnitude
both ones complement and sign-magnitude suffer from 0 have two representations
data streams with typically small values on either side of zero involve less bit flipping between which can be valuable for power or digital compression applied along the bit columns
+4: 00000100
+3: 00000011
+2: 00000010
+1: 00000001
+0: 00000000
-1: 10000001
-2: 10000010
-3: 10000011
-4: 10000100
+4: 00000100
+3: 00000011
+2: 00000010
+1: 00000001
+0: 00000000
-1: 11111111
-2: 11111110
-3: 11111101
-4: 11111100
Note in the sequence below that several column never change
+4: 00000100
+0: 00000000
+3: 00000011
+3: 00000011
+3: 00000011
+3: 00000011
-3: 10000011
+0: 00000000
+1: 00000001
+0: 00000000
+1: 00000001
-3: 10000011
+1: 00000001
+0: 00000000
-1: 10000001
-2: 10000010
-3: 10000011
-3: 10000011
+3: 00000011
+2: 00000010
+3: 00000011
+2: 00000010
+1: 00000001
-4: 10000100
+0: 00000000
+1: 00000001
+0: 00000000
+1: 00000001
+1: 00000001
+0: 00000000
+1: 00000001
+0: 00000000
-1: 10000001
-2: 10000010
-3: 10000011
-3: 10000011
-3: 10000011
-4: 10000100
-4: 10000100
(C/C++) Combinations of (dest type) = (source type) to consider
module conversion_demo; wire [7:0] u8x = 8'b11111111; wire signed [7:0] s8x = 8'b11111111; wire [15:0] u16x = 16'b1111_1111_1111_1111; wire signed [15:0] s16x = 16'b1111_1111_1111_1111;
wire [15:0] u16y_ux = u8x; wire [15:0] u16y_sx = s8x; wire signed [15:0] s16y_ux = u8x; wire signed [15:0] s16y_sx = s8x; wire [7:0] u8y_ux = u16x; wire [7:0] u8y_sx = s16x; wire signed [7:0] s8y_ux = u16x; wire signed [7:0] s8y_sx = s16x;
initial begin #0; $display("u16y_ux:%16b,%7d",u16y_ux,u16y_ux); //u16y_ux: 0000000011111111, 255 $display("u16y_sx:%16b,%7d",u16y_sx,u16y_sx); //u16y_sx: 1111111111111111, 65535 $display("s16y_ux:%16b,%7d",s16y_ux,s16y_ux); //s16y_ux: 0000000011111111, 255 $display("s16y_sx:%16b,%7d",s16y_sx,s16y_sx); //s16y_sx: 1111111111111111, -1
$display(" u8y_ux: %16b, %7d",u8y_ux,u8y_ux); //u8y_ux: 11111111, 255 $display(" u8y_sx: %16b, %7d",u8y_sx,u8y_sx); //u8y_sx: 11111111, 255 $display(" s8y_ux: %16b, %7d",s8y_ux,s8y_ux); //s8y_ux: 11111111, -1 $display(" s8y_sx: %16b, %7d",s8y_sx,s8y_sx); //s8y_sx: 11111111, -1 end endmodule
Knowing the options for implementation of addition in the context of algorithms is important. Options are overviewed in later lectures.
wire signed [7:0] x,y; wire signed [15:0] s1,s2; assign s1 = {8{x[7]},x} + {8{y[7]},y} // explicit sign // extension assign s2 = x + y; // implicit sign extension
Two’s complement addition of two numbers where the longest is N-bit can require up to N+1 bits for the result
Two’s complement addition can only overflow if the signs of the operands are the same
Overflow check: input sign bits are same and do not match result sign bit
y=a-b;
overflow=( (a>=0) && (b>=0) && (y<0) ) | ( (a<0) && (b<0) && (y>=0) );
May exclude input operand being zero
overflow=( (a>0) && (b>0) && (y<0) ) | ( (a<0) && (b<0) && (y>=0) );
Using #include <limits.h>
overflow=((a>0) && (b>(INT_MAX-a))) | ((a<0) && (b<(INT_MIN+a)));
Note, typically you can use an available overflow flag (e.g. V), but for software-only manipulation, or when using a language that doesn’t support access to hardware flags, or if the flags are not preserved you must resort to software techniques
Takeaway
((a>0) && (b>(INT_MAX-a))) | ((a<0) && (b<(INT_MIN+a)))
overflow=((a>=0) && (b<0) && (y<0)) | ((a<0) && (b>=0) && (y>=0));
overflow=((a>0) && (b<0) && (y<0)) | ((a<0) && (b>0) && (y>=0));
Takeaway
overflow=(a[N-1]==b[N-1]) && (a[N-1]!=y[N-1]);
wire signed [7:0] x,y,s; wire flagOverflow; assign s = x+y; //context determined // 8-bit addition //overflow case is when the sign of the // input operands are the same and // sign of result does not match assign flagOverflow = (x[7] == y[7]) && (y[7] ~= s[7])
overflow=(a[N-1]!=b[N-1]) && (a[N-1]!=y[N-1]);
In C, Conversion from float to int is defined as truncation of fractional digits, effectively rounding towards zero
Nearest Integer Rounding using Floating Point Biasing:
Takeway
To achieve common rounding when casting from float to int
Assume unsigned int A,B;
Example:
BAUD_RATE=CLK_RATE/CLK_DIV
#define CLK_DIV (CLK_RATE/BAUD_RATE)
#define CLK_DIV \
(CLK_RATE+(BAUD_RATE/2))/BAUD_RATE
(Note that \
is just the line continuation for c macros)Takeway
For positive integers, bias the numerator with half of the denominator to approximate division following by rounding
In general, if intermediate result will fit in the allowed integer word size, apply integer multiplications before integer divisions to avoid loss of precision (e.g. go big first unless overflow, in which case first go small )
Need to consider loss of MSB or LSB digits with intermediate terms and make choices base on application or expected ranges for input values
Example 1:
int i = 600, x = 54, y = 23;
i/x*y
result is 253 : goodi*y/x
result is 255 : better(i*y+x/2)/x
result is 256 : bestExample 2:
unsigned int c = 7000, x=10,y=2;
c*x/y
which is truly 35000
c*=x;
overflows c since (c*x)>65535
resulting in x = 4465
c/=y;
get 2232
here from dividing 4465 by 2c/=y; c*=x;
result is 3500*10=35000
Takeway
Limited-precision representation of coefficients limits realizable system parameters
Images from Digital Signal Processing with Student CD ROM 4th Edition by Sanjit Mitra
Direct-form II: very poor for high-pass below pass filters requiring pole near the real axis
Manipulation of the order of operations, or alternative numerical formulations can yield different possible system realizations Here is a equivalent system if infinite precision is used, but yields different results if parameters and calculations are of limited precision
Overflow may occur at the output and at many points within a system
May be able to avoid internal overflow by pre-scaling (division) and post-scaling (multiplication)
If overflow is possible at the output of the accumulators and cannot be stored into the delay storage registers…
The transfer function can be computed from the input to the output as well as any intermediate signal.
If the input is bounded, and the system is a bounded-input bounded-output, then multiplying by the gain of the transfer function to any point by the input magnitude reveals the range required at any point
To limit range (# bits) the input can be pre-scaled to, and if required the output can be post-scaled
In previous graph, can apply multiplication in pre-scaling and division for post-scaling if orginal-scale result is required (many times it isn’t)
By performing internal computations with larger values, numerical loss of accuracy at each step is mitigated
An analagy would be performing math with millimeters rather than meters
Ex:
(a/b) > (c/5)
a=21
b=4
c=26
1000*(a/b) > 1000*(c/5)
( (1000*a) / b) > ( (1000* c)/5)
( (21000) / 4) > ( (26000)/5)
5250 > 5200
sqrt(pow(a_x,2)+pow(a_y,2)) > sqrt(pow(b_x,2)+pow(b_y,2))
pow(a_x,2)+pow(a_y,2) > pow(b_x,2)+pow(b_y,2)
Sometimes overflow is usual or practically unavoidable, and handing with detection followed by replacing result with limit is more appropriate than modulo arithmetic where small overflow results in a very large discrepancy
Detect overflow condition – override output to avoid LARGE roll-over error from modulus arithmetic
Example:
Balanced Range to mitigate accumulated error
Instantaneous value mapping and affect on sinusoidal time series:
When error is small compared to a signal, quantization error is model as noise
in fact it is nearly ideal to model the error as additive white noise when the error very small compared to the signal (high SNR) because the error appears to be uncorrelated with the input whereas the signal can appear to be related to the signal otherwise (see provided graphic)
A computation stage can be numerically modeled using an ideal calculation followed by additive error
Limited-precision effects are modeled as quantization error and analyzed as if noise is inserted into the system at the quantization/round-off step
Good SNR (signal to noise ratio) demands that the signal’s actual range in an application be large compared to the round-off error
Limit Cycles are a special case of quantization (limited precision) error propagation of concern for systems with iterative feedback for which errors seem to propagated indefinitely over iterations
Feedback in computations, e.g. DSP IIR systems, can exhibit limit cycles, which are indefinite oscillatory responses from either round-off or overflow errors. This is possible from the feedback of the Limited-precision error, and repetitive rounding.
Errors may appear as constant outputs or +/- oscillations, even after the input becomes 0.
In these images refers to a location where precision must be limited, e.g. limit precision (#bits).
In DSP, FIR filters do not exhibit this since they have no feedback (no memory)
Techniques to avoid unwanted limited cycle behavior or bound it usually involve increasing bit length of intermediate calculations, using increased bit length use feedback of error and feedback, introducing random quantization to mitigate perpetuating effects
Reference for Additional Examples:
Verilog 2001 provides signed reg and wire vectors
Casting to and from signed may be implicit or may be explicit by using
$unsigned()
Ex: reg_s=$unsigned(reg_u);
$signed()
Ex: reg_u=$unsigned(reg_s);
A self-determined expression is one in which the length of the result is determined by the length of the operands or in some cases the expression has a predetermined length for the result.
However, addition and other operation expressions may act as a context-determined expression in which the bit length is determined by the context of the expression that contains it, such as with an addition coupled with an assignment.
In this example we see that addition obeys modular arithmetic with a result u8y=0
ur8a = 128; ur8b = 128; ur8y= ur8a+ur8b;
In this example we see that the addition is an expression paired with an assignment, so the length of the assigned variable sets the context-determined expression operand length of the addition to take on the length of the largest operand, 9-bits. Using zero-extension in this case, the addition operands are each extended to 9-bits before addition.
ur8a = 128; ur8b = 128; ur9y= ur8a+ur8b; //9-bit addition
Some operators always represent a self-determined expression, they have a well-defined bit-length that is independent of the context in which they are used and must be derived directly from the input operand(s) (the result may still be extended or truncated as needed). These may also force the operands to obey their self-determined expression bit length.
The concatenation operator is one such example of a self-determined expression with a bit-length that is well-defined as the sum of length of its operands, and in turn its operands are forced to use their self-determined expression length.
{a}
for which the result is the length of a
{2'b00,b,a}
for which the result length is 2+length(b
)+length(a
)The use of a single operand to { }
can force a self-determination for expressions like addition:
ur9y=256
ur8a = 128; ur8b = 128; ur16y= {ur8a+ur8b}; //8 bit addition ur16z= ur8a+ur8b; //16 bit addition
5.4.1 Rules for expression bit lengths
The rules governing the expression bit lengths have been formulated so that most practical situations have a natural solution.
The number of bits of an expression (known as the size of the expression) shall be determined by the operands involved in the expression and the context in which the expression is given.
A self-determined expression is one where the bit length of the expression is solely determined by the expression itself—for example, an expression representing a delay value.
A context-determined expression is one where the bit length of the expression is determined by the bit length of the expression and by the fact that it is part of another expression. For example, the bit size of the right-hand expression of an assignment depends on itself and the size of the left-hand side.
Table 5-22 shows how the form of an expression shall determine the bit lengths of the results of the expression. In Table 5-22, i, j, and k represent expressions of an operand, and L(i) represents the bit length of the operand represented by i.
Multiplication may be performed without losing any overflow bits by assigning the result to something wide enough to hold it.
At UMBC or if offsite use the UMBC single-sign-on option, goto http://ieeexplore.ieee.org/
Search for IEEE Std 1364-2005
You’ll find
IEEE Standard for Verilog Hardware Description Language IEEE Std 1364-2005 (Revision of IEEE Std 1364-2001)
older standard 1995 standard and the SystemVerilog standard
For System Verilog: IEEE Std 1800-2012 (Revision of IEEE Std 1800-2009)
https://ieeexplore.ieee.org/document/8299595
Addition in the context with of assignment will cause extension of operands to the size of the result (the synthesizer may later remove useless hardware). However, the extension is performed according to the type of addition, which is determined by the operands. Therefore, signed extension is only performed if BOTH operands are signed regardless of the assignment.
module test(); reg signed [7:0] s; reg [7:0] u; reg signed [7:0] neg_two; reg [15:0] x1,x2; reg signed [15:0] y1,y2;
initial begin neg_two = -2; s = 1; u = 1; x1 = u + neg_two; x2 = s + neg_two; y1 = u + neg_two; y2 = s + neg_two; $display("%b",x1); $display("%b",x2); $display("%b",y1); $display("%b",y2); end endmodule
Result:
0000000011111111
1111111111111111
0000000011111111
1111111111111111
. . . reg [3:0] bottleStock = 10; //**unsigned** always @ (posedige clk, negedge rst_) if (rst_==0) bottleStock<=10; else if (bottleStock >= 0) //always TRUE!!! bottleStock <= bottleStock-1;
. . . input wire [2:0] remove; signed reg [3:0] remainingStock = 10;//**signed** always @ (posedige clk, negedge rst_) if (rst_==0) remainingStock<=10; else if ((remainingStock-remove) >= 0) //always TRUE!!! remainingStock <= remainingStock-remove;
a<<k
Takeaway
a<<k
which in hardware is just rewiringIn general to hold the result you need M+N bits where M and N are the length of the operands
wire [N-1:0] a; wire [M-1:0] b; wire [M+N-1:0] y; y=a*b;
The multiplication in the context with the assignment will cause extension of operands to the size of the result (the synthesizer may later remove useless hardware). However, extension type performed according to input operands (signed extension performed only if BOTH operands signed)
Multiplication by two variables can be expensive, with a size on the order of MxN (full 1-bit adders and AND
gates as 1-bit mult)
Delay is proportional to M+N
1011 (M bits) x 10010 (Nbits):
8-bit input multipliers can be used to compute 16-bit input multiplication,
Shift, Extend, Add
sign ext | ||||
+ | sign ext | |||
+ | sign ext | |||
+ | ||||
= |
See Application note AVR201: Using the AVR Hardware Multiplier: http://ww1.microchip.com/downloads/en/Appnotes/Atmel-1631-Using-the-AVR-Hardware-Multiplier_ApplicationNote_AVR201.pdf
Can generalize to larger word-sizes, maintaining separate words of varying weights
Can use Karatsuba Algorithm for fast multiplication of large numbers (outside the scope of this course)
Takeaway
Takeaway
Why?: Because integer division is round towards 0, but if is at least half of then we need to round away from 0, which can accomplished by effectively adding 0.5 to the magnitude of the result before rounding
A+(|B|*sign(A))/2
where sign(A)
is 1 or -1 according to the sign of A
Exercise: make code to divide a integer (signed), S, by 256 with a rounded result
result=((S>=0)?(S+128):(S-128))/256;
Exercise: make code to divide a integer (signed), S, by 5 with a rounded result
result=((S>=0)?(S+2):(S-1))/5;
A hardware (RTL to netlist) synthesizer may only support division by powers of two and possibly only division by constants
float to int conversion is floating-point fix()
operator (subtract fractional part) with type conversion discarding fractional part
Takeaway
The context here is that you are are given C code for an algorithm that has a signed-integer divide by power of two
Divide x by is almost the same as an arithmetic right shift
discard k bits on the right and replicate sign bit k times on the left. Must use the arithmetic shift to perform sign extension; x>>>k;
same as {k{x[msbindex]},x[msbindex:k]}
However, “integer division” is defined by truncation of the fractional bits of the result, also known as “round towards zero” To mimic this behavior more is needed:
floor(x/n)
x>>k;
ceil(x/n)
If x is negative, we want ceil(x/n) which may be computed by applying a bias that is 1 lsb less than the divisor … this is just enough bias that with any fractional result the value becomes ‘‘bigger’’ than the next ‘‘bigger’’ negative integer so that:
(x +((1<<k)-1)) >> k
(-5 +((1<<1)-1)) >> 1
(-5 + 1 ) >> 1
Takeaway
Divide by is defined to be ROUND TOWARDS ZERO — mimicking this detail with arithmetic right-shift requires pre-biasing by adding : ( x + ((1<<k)-1) ) << k
Embedded systems can implement larger word-length operations, which is a reasonable alternative to using a more complex processor if needs are not at timing-critical points in time.
For 16 bit calculations, an 8-bit architecture may support double-register arithmetic (e.g. use two registers to hold output of a 8X8 multiplication)
For even longer numbers results can be calculated a piece at a time and overflow bits (add/sum) or overflow registers (multiply) can be used to compute larger results. The built-in C variable types are usually automatically handled by the compiler. If even longer types are needed, find an arbitrary precision arithmetic software library.
So, don’t say that you need a 128-bit processor to perform 128-bit arithmetic.
Error introduced by division:
Let be ideal, errorless
Let computed value from integer division, , be and the associated error be , such that
Error propagation by multiplication:
Error introduced by multiplication:
Error propagated by division:
Error Metric:
Earlier examples shown to illustrate error generation and propagation with reordering of operatands
Keep the decision to go big or go small in mind as in each step of implementation in the next topic.
We will formally introduce an additional scale at each step and per operand.
0011.1110
and 0001.1000
0011.1110
as 00111110
0001.1000
as 00011000
00011000
+ 00111110
= 01010110
01010110
as 0101.0110
QM.N
notation: M+N bits, with M bits as whole part and N bits as fractional part.1101.0000 * 16 = 11010000 S=16
, Q4.401.011000 * 64 = 01011000 S=64
, Q2.6computed as , so that the addition is an integer addition
Interpret result by dividing output by S to obtain answer C
using a power of two for S is efficient, though not required
computed as
Divide result by S to interpret answer C
Example: Check if addition of 0.17 meters and 0.24 meters is greater than .9 meters
int Y_S100 = (.9 * 100); int A_S100 = (0.17 * 100); int B_S100 = (0.24 * 100); int C_S100 = A_S100+B_S100; //integer addition int flag = C_S100 > Y_S100; //integer subtraction/comparison
To be human understandable, an example presented with powers of 10 for scaling, but powers of 2 are usually appropriate for efficiency and predictability of errors and error propagation.
computed as
Unfortunately, the intermediate result required more storage than the scaled result
Example: Check if area of rectangle with sides of length 0.17 meters and 0.24 meters is greater than the area of a square with sides of length .2 meters
int AREASQ_S10000 = (.2 * 100) * (.2 * 100); int AREASQ_S100 = (.2 * .2 * 100); int A_S100 = (0.17 * 100); int B_S100 = (0.24 * 100); int ARECT_S10000 = A_S100*B_S100; //integer multiplication int flag0 = ARECT_S10000 > AREASQ_S10000; //integer subtraction/comparison int ARECT_S100 = ARECT_S10000/100; //convert to scale 100 int flag1 = ARECT_S100 > AREASQ_S100; //integer subtraction/comparison
To be human understandable, an example presented with powers of 10 for scaling, but powers of 2 are usually appropriate for efficiency and predictability of errors and error propagation.
A/B could be computed as
Scales cancel. Which is fine if you only wanted an integer answer
Would need to multi by S to obtain scaled result for further math
For precision, better to prescale one of the operands
For rounding, remember to apply numerator pre bias of when performing integer division
Example:
Circumference to Radius
Note that //
is integer division in Python3
from math import * import numpy as np import pyforest tau = pi * 2 #tau C=tau+2.8 # some value for circumference #(note that adding was intentional here just to produce an interesting number 9.0831853072) #C C8=floor(C*8) #C*8,C8,C8/8 C256=floor(C*8*8) #C*8*8,C256,C256/(8*8) tau8=floor(tau * 8) #tau8,tau8/8 halftau8=floor(tau/2 * 8) //precomputed bias #halftau8,halftau8/8 print("C = %20.10f"%C) print("C/tau = %20.10f"%((C)/(tau))) print("C8/tau8 = %20.10f"%((C8)//(tau8))) print(" with pre-div bias : %20.10f"%((C8+halftau8)//(tau8))) print(" with prescale : %20.10f"%((C8*8)//(tau8)/8)) print(" with prescale and pre-div bias: %20.10f"%((C8*8+halftau8)//(tau8)/8)) print("C256/tau8 = %20.10f"%((C256)//(tau8)/8)) print(" with pre-div bias : %20.10f"%((C256+halftau8)//(tau8)/8)) print("---")
C = 9.0831853072
C/tau = 1.4456338407
C8/tau8 = 1.0000000000
with pre-div bias : 1.0000000000
with prescale : 1.3750000000
with prescale and pre-div bias: 1.5000000000
C256/tau8 = 1.3750000000
with pre-div bias : 1.5000000000
Reducing precision will involve division, typically achived using right-shift
Therefore follow prebiasing rule for division
When removing lesser significant bits, perform biasing of 1/2 weight of surviving position according to the sign of the value before shifting to achieve rounding
Examples converting positive value Q4.4 to Q6.2:
, 4.125, 0010.0010, stored as 00100010
00001000
,lost ending bits 10
00100010
+ 00000010
= 00100100
00001001
, 1110.1111 stored as 1110 1111
11111011
,lost ending bits 11
1110 1111
+ 1111 1110
= 1111 0001
1111 1100
Code examples in the below block are copied and provided under the Creative Commons Attribution-ShareAlike License: https://creativecommons.org/licenses/by-sa/3.0/
Wikipedia contributors. (2021, November 24). Q (number format). In Wikipedia, The Free Encyclopedia. Retrieved 18:10, November 29, 2021, from https://en.wikipedia.org/w/index.php?title=Q_(number_format)&oldid=1056933643
int16_t q_add_sat(int16_t a, int16_t b) { int16_t result; int32_t tmp; tmp = (int32_t)a + (int32_t)b; if (tmp > 0x7FFF) tmp = 0x7FFF; if (tmp < -1 * 0x8000) tmp = -1 * 0x8000; result = (int16_t)tmp; return result; } // precomputed value: #define K (1 << (Q - 1)) // saturate to range of int16_t int16_t sat16(int32_t x) { if (x > 0x7FFF) return 0x7FFF; else if (x < -0x8000) return -0x8000; else return (int16_t)x; } int16_t q_mul(int16_t a, int16_t b) { int16_t result; int32_t temp; temp = (int32_t)a * (int32_t)b; // result type is operand's type // Rounding; mid values are rounded up temp += K; // Correct by dividing by base and saturate result result = sat16(temp >> Q); return result; } int16_t q_div(int16_t a, int16_t b) { /* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */ int32_t temp = (int32_t)a << Q; /* Rounding: mid values are rounded up (down for negative values). */ /* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */ if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) { temp += b / 2; /* OR shift 1 bit i.e. temp += (b >> 1); */ } else { temp -= b / 2; /* OR shift 1 bit i.e. temp -= (b >> 1); */ } return (int16_t)(temp / b); }
wire signed [7:0] x,y; wire flagEq; flagEq = (x==y);
Example may be implemented by eight xnor2
followed by and8
;
Can also use subtraction and check for zero and sign bits
Comparison chain (just x>y) representative circuit:
explanatory pseudo code:
if x[msb]&~y[msb] //if x=1???????,y=0??????? flag_x_gt_y=True; // return True else if not ~x[msb]&y[msb] //elif not x=0???????,y=1??????? if x[msb-1]&~y[msb-1] // check remaining bits flag_x_gt_y=True; else if not ~x[msb-1]&y[msb-1] if x[msb-2]&~y[msb-2] flag_x_gt_y=True; else if not ~x[msb-2]&y[msb-2] . . . if x[0]&~y[0] flag_x_gt_y=True; return False
4-Bit “Full” Magnitude Comparator Sharing Intermediate Terms
Ref: http://www.ti.com/lit/ds/symlink/sn74ls85.pdf