which after specification of the bits, represents a minimal expression of summation of weights,
e.g. Q=1183=8+2+1+41+81=W4+W1+W0+W−2+W−3
Consider an integer division of positive operands (Q=DN):
N=Q⋅D+R where 0≤R<D,
The goal is to find which terms Wi would be included in the minimal expression of Q
e.g. N=(W3+W1+W0+W−2+W−3)D+R
An iterative process is possible, starting from the MSB.
Initialize R = N to start the process and in each step make a comparison: 2i⋅D>R
Starting with i=n-m-1, check if 2n−m−1⋅D<R
if so, subtract: set qn−m−1=1 and subtract 2n−m−1 from R
otherwise, set qn−m−1=0 and leave R as-is
i=i-1: if 2n−m−2×D<R subtract: set qn−m−2=1 and subtract 2n−m−2 from R,
otherwise we set qn−m−2=0
i=i-1: if 2n−m−3×D>R ...
After the last bit we are left with a remainder R<D and our result Q
Example Solve Q=10.125/3
Try to subtract 3×Wi starting with the largest weight without overshooting (negative remainder) to choose which weights to use.
Record which weights were used
Sum weights at the end
Each step entails a comparison and conditional subtraction
comparison itself is implemented by subtraction R−2i⋅D and checking if it is it is positive.
Alternatively, can always initially subtract from R: R−2i⋅D, and if it is negative we'll undo the subtraction
This undo process can be mathematically noted as R=R+2i⋅D, but can be called restoring (a.k.a. UNDO) and is implemented by preserving a saved copy of of R before subtraction and, subtraction and mux between the subtraction and the original value
Depictions of implementations
Depiction of "restoring" (white box) vs practical implementation
Depiction of processing-stage dependant shift (variable shift) as well then implemented as 1-bit shift (shift register, constant shift-by-1) is on the left.
The conditional subtraction is shown on the right for both.
For the algorithms listed in these slides, some liberty has been taken on the presented algorithms, please refer to the referenced text or many research papers available to work out final details of hardware implementation
Choice of (Add) OR (Don't Add a.k.a restore) → Choice of (Add) OR (Subtract)
Modify the previous division process by modifying the iterative step to always add or subtract from the remainder (previously we conditionally subtracted), and recording a value qi at each step representing +1 or −1 times a weight 2i. The working remainder may become negative in a given step, but it is left as it is (not restored).
At end,sum the additive terms (wherever qi represented +1) and the subtractive terms (wherever qi represented −1) with their respective weight (Wi) to represent the final answer. Note the remainder after the last step may be left as negative in this process, requiring an adjustment on the last bit of precision accordingly.
Ex: Solve Q=10.125/3, but this time must use all weights, either subtracting or adding.
Algorithm for Iterative Division of Integers (Non-Restoring)
Q=N/D ;assuming n=0, N>D
Final step (3) corrects for a negative remainder (subtracted a little too much)
Initialize r=N
for i = m-1 downto 0: qi={1,−1,r≥0 ((produces subtraction))r<0 ((produces addition)) r=r−qi⋅(D×2i)
Alternative: Scaling the remainder rather progressively smaller weight
In the previous processes, the working remainder was reduced in each step. In fact, after processing each at each bit position, the corresponding bit in the remainder is 0 (as well as all "previous" the bits to the left). Thus, we can we can increase r by two (a shift) before comparison without any loss of information.
Correspondingly, instead of the comparison 2i⋅D>r, compare ⋅D>r since r has been scaled by 2i by stage i
it turns out that the result is the same as the Non-Restoring division scheme already discussed except that the working remainder after the final step must be scaled for final interpretation R=r2−m (proof page 40 Koren)
Iterative Division of Integers (Restoring, Scaling Remainder)
Q=N/D ;assuming n=0, N>D
Note that term D×2m−1 is a fixed shift, not recomputed every iteration
7:N: 233, D*2^m:1280 Decision:sub
6:N: -814, D*2^m:1280 Decision:add
5:N: -348, D*2^m:1280 Decision:add
4:N: 584, D*2^m:1280 Decision:sub
3:N: -112, D*2^m:1280 Decision:add
2:N: 1056, D*2^m:1280 Decision:sub
1:N: 832, D*2^m:1280 Decision:sub
0:N: 384, D*2^m:1280 Decision:sub
Bit Indexes used for subtraction: [7, 4, 2, 1, 0]
Bit Indexes used for addition: [6, 5, 3]
Q:47
Q*D:235
R:-2
Remainder Correction:
Q:46
Q*D:230
R:3
Extension of Non-Restoring Division to Signed Operands
Fortunately, the extension of non-restoring division to handle twos-complement signed values is straightforward and only involves modifiying the iteration qi={+1,−1,sign(2ri−1)=sign(D)sign(2ri−1)==sign(D)
Division by Convergence
Problem Statement: Find Q , where Q=DN, given N and D ...but converge faster than 1-bit per iteration
Alternative problem statement towards solution:
find a factor R such that D×R=1, then find Q using Q=N×R
DN=D⋅RN⋅R=1Q
R will be computed over many stages by finding many factors Ri and R is taken to be R=R1⋅R2⋅R3⋅...
D input assumed to already be normalized to 1: (...(((D⋅R1)⋅R2)⋅R3)⋅...)
Solution:
Start with a D0 that it is a "normalized fraction" (binary 0.1xxxxxx...), found by scaling D by a power of 2 as needed so that 21≤D<1
D0=D×S;S=2−x ,
Remember to inversely scale the final result
To prescale D, shift right by some s until all bits to the left of the decimal are 0.
Ex: 101.11→.10111 using shift by s=3
To Postscale, would right shift R or simply right shift the result Q by s.
Iterative Step i=1...m: Multiply Di−1 by a value Ri∈[1,1.5) to make it closer to 1 but not equal or greater than 1. The selection of Ri is presented using an intermediate variable yi that allows us to demonstrate some properties later:
In each iteration, represent D by a value y such that D=1−y. Therefore, y represents the distance from 1, noting 0<y≤21 in every iteration
The error bound is cut in half in the first iteration, cut by another 4 in the next, another 16 in the next etc...
Double the number of bits each iteration
D0=.1xxxxx....,
D1=.11xxxx....
D2=.1111xx....,
D3=.11111111xx..
BETTER THAN 1-BIT PER ITERATION OF "LONG" DIVISION!
With each iteration, Di→.111111111...
Therefore, the iterations are performed until the desired accuracy is achieved and/or that because of limited precision no further accuracy from additional iterations is obtained
Algorithm
Each step is two multiplications and a two's complement operation
let D=1+y;y=D−1 xi=(1−(y))(1+(y)2)(1+(y)4)⋅⋅⋅(1+(y)2i−1)
This is the series from earlier, therefore i→∞limxi=1+y1=D1
Lookup Table for Initial Guess
"first approximation" for iterative methods can be pre-calculated from a table to reduce required # iterations
Ex: Div by Reciprocation
Given j bits to pre-calculate, a table of 2j entries (initial approximation) should be provided of the desired accuracy of n bits
optimal choice is function (1/x) evaluated center of interval (proof omitted).
Square Root
Will not be discussed this semester.
Using Newton-Raphson iteration is one reasonable approach.
Iteration Implementation
Multi-Cycle
The computation for Iterative methods can distributed in time
e.g. one iteration per clock cycle
can support convergence testing and a variable number of iterations
If the iterations are minimal, one could attempt to use a fast clock in the iterative processor
requires multi-clock domain design to interface the surrounding system
If combinatorial path delay allows, implementing multiple iterations in cascaded combinatorial logic minimizes the number of clock cycles required
Loop Unrolling
Iterative methods with a finite/static number of iterations (or at least a manageable upper bound) can be implemented in combinatoral logic
hardware for each iteration can be tailored to the parameters of a specific iteration
ex: multiplying by z=z×i in iteration i
a multi-cycle iterative processor might use a multiplier to support multiplying by the variable i
an unrolled hardware iteration only needs to support multiplying by a constant
Pipelining
Iterative methods lend themselves well to pipelining (esp. for a static number of iterations), allowing throughput matched to the pace of surrounding system
might require more hardware/area (and perhaps power)
pipeline stages may perform one OR MORE iterations
greater combinatorial path delay from cascading the combinatorial logic of two or more iterations
if the clock rate can be maintained, performing more than one iteration in a given stage minimizes the number of stages and hence latency
pipelining ( like combinatorial unrolling) allows hardware stages to be tailored to a given iteration (e.g. multiplying by a constant versus a variable)
Lookup Tables for Initial Guess
reduce run-time cycles with precomputed iterations
implement difficult-to-compute initial guess values for globally complex functions, use methods like Newton-Raphson Iteration to iterate locally to a precise solution