† A few extended topics images from CD Rom for Mitra, Sanjit Kumar, and Yonghong Kuo. Digital signal processing: a computer-based approach. Vol. 2. New York: McGraw-Hill, 2006.
Rounding Errors float to int
In C, Conversion from float to int is defined as truncation of fractional digits, effectively rounding towards zero
5.7→5
−5.7→−5
Nearest Integer Rounding using Floating Point Biasing:
Add +/- .5 before truncation depending on sign
Positive Numbers:
5.7+.5=6.2→6
5.4+.5=5.9→5
−5.7+.5=−5.2→−5 doesn’t work
For negative numbers, need to subtract
−5.7−.5=−6.2→−6
−5.4−.5=−5.9→−5
Takeway
To achieve common rounding when casting from float to int
For positive floats add 0.5 before casting
For negative floats subtract 0.5 before casting
Affecting Conversion Rounding Error with Numerator Bias
Assume unsigned int A,B;
A/B produces floor(A/B) rather than round(A/B)
So (A+B/2)/B is how we get rounding with integer-only arithmetic
If B is odd, we need to choose to round B/2 up or down depending on application
Example:
Want to set serial baud rate using clock divider
i.e. BAUD_RATE=CLK_RATE/CLK_DIV
Option 1:
C:
#defineCLK_DIV(CLK_RATE/BAUD_RATE)
Verilog:
parameter CLK_DIV (CLK_RATE/BAUD_RATE)
Option 2: BETTER CHOICE WITH INTEGER BIASING
#defineCLK_DIV\(CLK_RATE+(BAUD_RATE/2))/BAUD_RATE
(Note that \ is just the line continuation for c macros)
Verilog:
For positive integers, bias the numerator with half of the denominator to approximate division following by rounding
First go big or first go small: choices for order of operations affect result
In general, if intermediate result will fit in the allowed integer word size, apply integer multiplications before integer divisions to avoid loss of precision (e.g. go big first unless overflow, in which case first go small )
Need to consider loss of MSB or LSB digits with intermediate terms and make choices base on application or expected ranges for input values
Example 1: int i = 600, x = 54, y = 23;
want i/x*y : true answer is 255.555… i/x*y result is 253 : good i*y/x result is 255 : better (i*y+x/2)/x result is 256 : best
Example 2:
Assume
an architecture with 16-bit int,
unsigned int c = 7000, x=10,y=2;
want c*x/y which is truly 35000
c*=x; overflows c since (c*x)>65535 resulting in x = 4465
c/=y; get 2232 here from dividing 4465 by 2
c/=y; c*=x; result is 3500*10=35000
Takeway
In general, if intermediate result will fit in the allowed integer word size, apply integer multiplications before integer divisions to avoid loss of precision
Impacts of Limited Precision in DSP Filters
Limited-precision representation of coefficients limits realizable system parameters
Images from Digital Signal Processing with Student CD ROM 4th Edition by Sanjit Mitra
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Direct-form II: very poor for high-pass below pass filters requiring pole near the real axis
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Manipulation of the order of operations, or alternative numerical formulations can yield different possible system realizations Here is a equivalent system if infinite precision is used, but yields different results if parameters and calculations are of limited precision
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Eliminating Overflow or Saturation with Input Prescaling
Overflow may occur at the output and at many points within a system
May be able to avoid internal overflow by pre-scaling (division) and post-scaling (multiplication)
If overflow is possible at the output of the accumulators and cannot be stored into the delay storage registers..
solution is to pre-downscale inputs or include downscaling at internal points within the system (e.g. divide/right-shift x by a necessary factor), but at the cost of lesser effective precision in the result
The transfer function can be computed from the input to the output as well as any intermediate signal.
If the input is bounded, and the system is a bounded-input bounded-output, then multiplying by the gain of the transfer function to any point by the input magnitude reveals the range required at any point
To limit range (# bits) the input can be pre-scaled, and if required the output can be post-scaled
many times working with the scaled result downstream (no post-scale correction) is acceptable or preferred
Increasing Effective Computation Precision with Input Prescaling
In previous graph, can apply multiplication in pre-scaling and division for post-scaling if orginal-scale result is required (many times it isn't)
By performing internal computations with larger values, numerical loss of accuracy at each step is mitigated
An analogy would be performing math with millimeters rather than meters
Ex:
Code:
(a/b) > (c/5)
Given
a=21
b=4
c=26
Integer result is same
Prescale, conceptually:
1000×(a/b)>1000×(c/5)
New C code:
( (1000*a) / b) > ( (1000* c)/5)
Result
( (21000) / 4) > ( (26000)/5)
5250 > 5200
Log Scale
Log Scales maintain relative accuracy and represent a useful domain change
Compress large values
log10(100)=2,log10(1000)=3
Expand small values
log2(1/2)=-1,log2(1/4)=-2
Common to use "log-likihood" in probability computations when there are many possibilities each with a small fractional probability and an infinitesimal joint probability
Implementing Saturation to large avoid modular arithmetic error
Sometimes overflow is usual or practically unavoidable, and handing with detection followed by replacing result with limit is more appropriate than modulo arithmetic where small overflow results in a very large discrepancy
Detect overflow condition – override output to avoid LARGE roll-over error from modulus arithmetic
If result exceeds 2N−2−1 replace with 2N−2−1
If result exceeds −2N−2 replace with 2N−2
Schematic depiction (of required software or hardware)
Example:
heat controller where heating element unsigned 8-bit input value 250 should increase by 10
should detect and set result to 255 (error of -5), rather than compute 5 (error of -250)
Balanced Range to mitigate accumulated error
many signal processing algorithms and control systems employ summation of values in a series
accumulation of error is important consideration
physical processes can also be affected by accumulated error
i.e. accumulation effect is in physical domain
Using limits 2N−2−1 and −(2N−1) under some conditions may cause bias (non-zero average error) , might be better to limit to 2m−2−1 and −(2N−1−1)
this creates balanced error accumulation
Soft Limiting
Choosing Smoother Limiting Functions will smooth the estimated wave by "compressing" past a soft limit. The saturation error is smoothed and high-frequency distortion is mitigated.
A linearized approach can be implemented cheaply using a scale for signal components exceeding a soft limit (may involve scale by const or scale by power of two)
Ex: 0<α<1 : such as α=.25 so that upper range can be compressed using can compress upper range using shift by two
TH=128
if(A>TH) A=TH+(A-TH)*ALPHA
Smoother sigmoidal functions mimicking analog circuits can be used as well
Instantaneous value mapping and affect on sinusoidal time series:
Quantization / Round-off Noise
When error is small compared to a signal, quantization error is model as noise
in fact it is nearly ideal to model the error as additive white noise when the error very small compared to the signal (high SNR) because the error appears to be uncorrelated with the input whereas the signal can appear to be related to the signal otherwise (see provided graphic)
A computation stage can be numerically modeled using an ideal calculation followed by additive error
Limited-precision effects are modeled as quantization error and analyzed as if noise is inserted into the system at the quantization/round-off step
Good SNR (signal to noise ratio) demands that the signal's actual range in an application be large compared to the round-off error
Limit Cycles
Limit Cycles are a special case of quantization (limited precision) error propagation of concern for systems with iterative feedback for which errors seem to propagated indefinitely over iterations
Feedback in computations, e.g. DSP IIR systems, can exhibit limit cycles, which are indefinite oscillatory responses from either round-off or overflow errors. This is possible from the feedback of the Limited-precision error, and repetitive rounding.
Errors may appear as constant outputs or +/- oscillations, even after the input becomes 0.
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
Digital Signal Processing, A Computer-Based Approach Sanjit K. Mitra McGraw Hill
In these images Q refers to a location where precision must be limited, e.g. limit precision (#bits).
In DSP, FIR filters do not exhibit this since they have no feedback (no memory)
Techniques to avoid unwanted limited cycle behavior or bound it usually involve increasing bit length of intermediate calculations, using increased bit length use feedback of error and feedback, introducing random quantization to mitigate perpetuating effects
Reference for Additional Examples:
Discrete-Time Signal Processing (3rd Edition) (Prentice-Hall Signal Processing Series) 3rd Edition by Alan V. Oppenheim (Author), Ronald W. Schafer
Ex 6.15 steps through an example of this shows round-off error where the output oscillates
Ex 6.16 steps through an example of this shows overflow error- based oscillation which can be more sever
Verilog: Working with signed and unsigned reg and wire
Verilog 2001 provides signed reg and wire vectors
Casting to and from signed may be implicit or may be explicit by using
$unsigned() Ex: reg_s=$unsigned(reg_u);
$signed() Ex: reg_u=$unsigned(reg_s);
these do not affect the bits, they affect subsequent operations
Verilog: Signed/Unsigned Casting
Implicit or Explicit Casting is always a dumb conversion (same as C), they never change bits, just the interpretation of the bits (e.g. -1 is not round to 0 upon conversion to unsigned, just reinterpreted as the largest unsigned value) for subsequent interpretation of operations by the compiler/synthesizer
If a mix of signed and unsigned operands are provided to an operator, operands are first cast to be unsigned (like C)
Signed/Unsigned Casting may be followed by length adjustment.
Truncation and Extension
Truncation and Extension may be implied by context in Verilog.
Some review of Truncation and Extension for Arithmetic Operators Discussion provided herein
Truncation
Assignment to a shorter type is always just bit truncation
no Python-like smart rounding such as unsigned(-1) → 0
Error Checking:
For unsigned, truncation is not a problem as long as all the truncated bits are 0
ex: 0000101 (5) , can truncate at most 3 bits
For signed, truncation is not a problem as long as all the bits truncated are the same AND they match the surviving msb
ex: 1111100 (-4) , can truncate at most 4 bits
Signed and Unsigned Extension
Assignment to a longer type is done with either zero or sign extension depending on the type:
Unsigned types use zero extension
Signed types use sign extension
Verilog Rules for expression bit lengths
A self-determined expression is one in which the length of the result is determined by the length of the operands or in some cases the expression has a predetermined length for the result.
Ex: as self-determined expression the result of addition has a length that is the maximum length of the two operands.
Ex: a comparison is always a self-determined expression with a 1-bit result
However, addition and other operation expressions may act as a context-determined expression in which the bit length is determined by the context of the expression that contains it, such as with an addition coupled with an assignment.
Verilog Contex-Determined Addition
In this example we see that addition obeys modular arithmetic with a result u8y=0
ur8a =128;
ur8b =128;
ur8y= ur8a+ur8b;
In this example we see that the addition is an expression paired with an assignment, so the length of the assigned variable sets the context-determined expression operand length of the addition to take on the length of the largest operand, 9-bits. Using zero-extension in this case, the addition operands are each extended to 9-bits before addition.
Self-Determined Self-Determined Expression and Self-Determined Operands
Some operators always represent a self-determined expression, they have a well-defined bit-length that is independent of the context in which they are used and must be derived directly from the input operand(s) (the result may still be extended or truncated as needed). These may also force the operands to obey their self-determined expression bit length.
The concatenation operator is one such example of a self-determined expression with a bit-length that is well-defined as the sum of length of its operands, and in turn its operands are forced to use their self-determined expression length.
Single-operand : e.g. {a} for which the result is the length of a
Multiple operands: e.g. {2'b00,b,a} for which the result length is 2+length(b)+length(a)
The use of a single operand to { } can force a self-determination for expressions like addition:
In this next example, the self determined length of the addition is 8-bits which results in 0 for the summation. The 8-bit result from the concatenation operator is always unsigned and thus is zero extended.
The result is ur9y=256
ur8a =128;
ur8b =128;
ur16y={ur8a+ur8b};//8 bit addition
ur16z= ur8a+ur8b;//16 bit addition
Rules for expression bit lengths from IEEE Standards
5.4.1 Rules for expression bit lengths
The rules governing the expression bit lengths have been formulated so that most practical situations have a natural solution.
The number of bits of an expression (known as the size of the expression) shall be determined by the operands involved in the expression and the context in which the expression is given.
A self-determined expression is one where the bit length of the expression is solely determined by the expression itself—for example, an expression representing a delay value.
A context-determined expression is one where the bit length of the expression is determined by the bit length of the expression and by the fact that it is part of another expression. For example, the bit size of the right-hand expression of an assignment depends on itself and the size of the left-hand side.
Table 5-22 shows how the form of an expression shall determine the bit lengths of the results of the expression. In Table 5-22, i, j, and k represent expressions of an operand, and L(i) represents the bit length of the operand represented by i.
Multiplication may be performed without losing any overflow bits by assigning the result to something wide enough to hold it.
Reference from IEEE Standards
IEEE Standard for Verilog Hardware Description Language IEEE Std 1364-2005
Obtaining the IEEE Verilog Specification
At UMBC or if offsite use the UMBC single-sign-on option, goto http://ieeexplore.ieee.org/
Search for IEEE Std 1364-2005
You'll find
IEEE Standard for Verilog Hardware Description Language IEEE Std 1364-2005 (Revision of IEEE Std 1364-2001)
older standard 1995 standard and the SystemVerilog standard
Addition in the context with of assignment will cause extension of operands to the size of the result (the synthesizer may later remove useless hardware). However, the extension is performed according to the type of addition, which is determined by the operands. Therefore, signed extension is only performed if BOTH operands are signed regardless of the assignment.
initialbegin
neg_two =-2;
s =1; u =1;
x1 = u + neg_two; x2 = s + neg_two;
y1 = u + neg_two; y2 = s + neg_two;$display("%b",x1);$display("%b",x2);$display("%b",y1);$display("%b",y2);endendmodule
The multiplication in the context with the assignment will cause extension of operands to the size of the result (the synthesizer may later remove useless hardware). However, extension type performed according to input operands (signed extension performed only if BOTH operands signed)
Multiplication by two variables can be expensive, with a size on the order of MxN (full 1-bit adders and AND gates as 1-bit mult)
Delay is proportional to M+N
1011 (M bits) x 10010 (Nbits):
Multiplication Overflow Check
can compute full result and check truncation
can can partial products (bit AND, shift) and check for overflow that word occur in shifting partial products and check after each addition
can compute full result then divide and compare to input
Software Multiplication using Smaller Hardware Multipliers
8-bit input multipliers can be used to compute 16-bit input multiplication,
Many smaller partial products are computed and weighted or added into appropriately weighted output word
mathematically shown: Y = {AH,AL} x {BH,BL} = AHBH<<16+AHBL<<8+ALBH<<8+ALBL.
Here {AH,AL} is used to denote bit concatenation (not a C operator)
Can generalize to larger word-sizes, maintaining separate words of varying weights
Can use Karatsuba Algorithm for fast multiplication of large numbers
Takeaway
Long multiplications can be mapped to hardware-supported multiplications by breaking the TWO input operands into supported input word-sizes, performing all associated pair-wise multiplications between the TWO inputs (like FIOL method ), and then adding the results with appropriate shifting to implement wights and sign extensions
Example: 8-bit multiplication using 4-bit multipliers
To avoid the larger multiplier in (((AH+AL)∗(BH+BL)−AH∗BH−AL∗BL))<<8
Start with −((AH−AL)∗(BH−BL)−AH∗BH−AL∗BL))<<8)
(AH−AL)∗(BH−BL) may be negative requiring an extra bit of output so,
Compute the sign and absolute value of (AH-AL) as well as (BH-BL)
Compute ∣(AH−AL)×(BH−BL)∣ using the standard size multiplication and use the sign of the result to add or subtract the result from the sum
EMULATING LARGE MULTIPLIERS
FPGA may have banks of “hard” multipliers (e.g. 8x8 multipliers, sometime in what is called DSP slices) so as to avoid using large portions of the programmable fabric. Ex. 8-bit multipliers can be used for 16-bit mult, mathematically shown: Y = {AH,AL} x {BH,BL} = AHBH<<16+AHBL<<8+ALBH<<8+ALBL. The synthesizer will recognize the size of the multiplication block construct the mapping to available multipliers for you.
Karatsuba Algorithm may be used
Multiplication by Low-Density Constants
Multiplication by low-density constants (few non-zero bits) can be inexpensive: u×24=u×(16+8)=u<<4+u<<3
Using example from previous slide: if the second operand is a constant, the synthesizer reduces the multiplication to shift and one adder:
Takeaway
Multiplying by low-density constants implemented with a few shift-add operations
Rounded Integer Division
Rounded-result division of integers A,B may be accomplished by adding a bias matching the sign of the result, such as by adding an offset(bias) to A that is half the magnitude of the divisor B (integer division half) and matches the sign of the numerator
∣round(float(A)/float(B))∣=(∣A∣+∣B/2∣)/∣B∣
Why?: Because integer division is round towards 0, but if ∣A%B∣ is at least half of ∣B∣ then we need to round away from 0, which can accomplished by effectively adding 0.5 to the magnitude of the result before rounding A+(|B|*sign(A))/2 where sign(A) is 1 or -1 according to the sign of A
Exercise: make code to divide a integer (signed), S, by 256 with a rounded result result=((S>=0)?(S+128):(S-128))/256;
Exercise: make code to divide a integer (signed), S, by 5 with a rounded result result=((S>=0)?(S+2):(S-1))/5;
A hardware (RTL to netlist) synthesizer may only support division by powers of two and possibly only division by constants
float to int conversion is floating-point fix() operator (subtract fractional part) with type conversion discarding fractional part
Takeaway
int division x/y is same as floating point division followed by int conversion ROUND TOWARDS ZERO,
Does not achieve nearest int rounding result, requires a sign-dependant bias such as the following:
(x+sign(x)×∣y/2∣)/y or
(x+((sign(x)!=sign(y))?−1:1)×y/2)/y
Examples and plots follow
Rounded Division example with even denominator
floating-point division: rounded result: cast result to int: bias float before cast: integer biasing: round(int(int((int((int(((fix((fix(7.0float/−47.0float/−47.0float/−47.0float/−4−1.757+(4/2))int/−47.0+(2))float/−4−2.25)))−0.5))−0.5)−2.25))))=−1.75=−2.0=−1===−2===−2
Rounded Division example with odd denominator
floating-point division: rounded result: cast result to int: bias float before cast: integer biasing: round(int(int((int(int(((fix((fix(8.0float/−38.0float/−38.0float/−38.0float/−3−2.666...8+(3/2))int/−38.0+(1))float/−3−3.0)))−0.5)−0.5)−3.16...))))=−2.666...=−3.0=−2===−3===−3
Floor vs Fix vs Round for Singed Values
Positive-Biased Value Before Fix
Negative-Biased Value Before Fix
Sign-Dependant-Bias before Fix
To Properly Mimic Signed-Integer Divided by Power of Two with Shift, a Sign-dependant Pre-Shift Bias is Required
The context here is that you are are given C code for an algorithm that has a signed-integer divide by power of two
You want to implement the algorithm in hardware using a shift to achieve matching results
Divide x by 2k is almost the same as an arithmetic right shift
discard k bits on the right and replicate sign bit k times on the left. Must use the arithmetic shift to perform sign extension; x>>>k; same as {k{x[msbindex]},x[msbindex:k]}
However, “integer division” is defined by truncation of the fractional bits of the result, also known as “round towards zero” To mimic this behavior more is needed:
For a positive x, integer division corresponds to floor(x/n)
ex: 5/2 = 2
If x is positive you can just use logical shifting: x>>k;
For a negative x, integer division corresponds to ceil(x/n)
ex: -5/2= -2
Note that arithmetic right-shift does not mimic this behavior 5:5:−5:01012101021011210112>>>1=11012:−3=−2
If x is negative, we want ceil(x/n) which may be computed by applying a bias that is 1 lsb less than the divisor … this is just enough bias that with any fractional result the value becomes ''bigger'' than the next ''bigger'' negative integer so that:
floor((x+(2k−1))/(2k))
(x +((1<<k)-1)) >> k (note that here >> represents an signed arithmetic shift)
Example: (-5 + (2-1))/2=(-5+1)/2=-4/2=-2
(-5 +((1<<1)-1)) >> 1
(-5 + 1 ) >> 1
Takeaway
Divide x by 2k is defined to be ROUND TOWARDS ZERO --- mimicking this detail with arithmetic right-shift requires pre-biasing x by adding 2k−1: ( x + ((1<<k)-1) ) << k
Very Long and Arbitrary Precision Arithmetic Using Software
Arbitrary precision arithmetic libraries are software libraries that support arithmetic on word-lengths of arbitrary precision, outside the native length provided by hardware.
Embedded systems can implement larger word-length operations, which is a reasonable alternative to using a more complex processor if arbitrary word-lenth computation needs are not timing-critical
For 16 bit calculations, an 8-bit architecture may support double-register arithmetic (e.g. use two registers to hold output of a 8X8 multiplication)
For even longer numbers results can be calculated a piece at a time and overflow bits (add/sum) or overflow registers (multiply) can be used to compute larger results. The built-in C variable types are usually automatically handled by the compiler. If even longer types are needed, find an arbitrary precision arithmetic software library.
So, don't say that you need a 128-bit processor to perform 128-bit arithmetic.
Its just that implementing 128-bit arithmetic with a 8-bit processor via could be slow.
Take for example performing two 128-bit operations a second when using an 8-bit processor.
That shouldn't be a problem (even at only 8 MHz) and is not cause to change to a different hardware platform.
Fixed-Point Arithmetic
Later Topic will be Fixed-Point Arithmetic, which allows for computing with representation of fractional values using only integer arithmetic units
Additional Operations
Equality
Magnitude Comparison
Equality Comparison
wiresigned[7:0] x,y;wire flagEq;
flagEq =(x==y);
Example may be implemented by eight xnor2 followed by and8;
Magnitude Comparator
Can also use subtraction and check for zero and sign bits
if x[msb]&~y[msb]//if x=1???????,y=0???????
flag_x_gt_y=True;// return Trueelseif not ~x[msb]&y[msb]//elif not x=0???????,y=1???????if x[msb-1]&~y[msb-1]// check remaining bits
flag_x_gt_y=True;elseif not ~x[msb-1]&y[msb-1]if x[msb-2]&~y[msb-2]
flag_x_gt_y=True;elseif not ~x[msb-2]&y[msb-2]...if x[0]&~y[0]
flag_x_gt_y=True;return False