Lecture 18 – Iterative Arithmetic Implementations

Ryan Robucci

References

$^\dagger$ Koren, Israel. Computer arithmetic algorithms, 2nd ed. AK Peters/CRC Press, 2001.

Sequential Multiplication using Shift-AND-Add

Sequential Multiplication using iterative leftshift of input operand (Unsigned)

Assume $X$ of $m$ bits $x_{m-1}x_{m-2}...x_{1}x_{0}$ and $Y$ of $n$ bits
Start with $Z=Y \cdot x_{0}$ where, $x_{0}\in\{0,1\}$
For $i = 1 \to m-1$ , $Y=\underbrace{Y\times2}_{\rm Y \texttt{<<} 1}$ , $Z=Z+Y \cdot x_{i}$

Example 1001 x 1101

   1101
 x 1001
   ----
   1101 x 1(lsb)
  1101  x 0
 1101   x 0
1101    x 1
---------
1110101

Sequential Multiplication using iterative rightshift of partial result (Unsigned)

Assume $X$ of $m$ bits $x_{m-1}x_{m-2}...x_{1}x_{0}$ and $Y$ of $n$ bits
Start with $Z=\green{\overline{\underline{(Y \texttt{<<} (m-1))}}} \cdot x_{0}$ where,
- $x_{0}\in\{0,1\}$
- $\green{\overline{\underline{(Y\texttt{<<}(m-1))}}}$ is a constant-shift of the input requiring only rewiring of Y
For $i = 1 \to m-1$ ,
$Z=\underbrace{Z/2}_{\rm Z \texttt{>>} 1}+\green{\overline{\underline{(Y\texttt{<<}(m-1))}}}\cdot x_{i}$

Example 1001 x 1101

      1101000 x 1(lsb)
    = 1101000
>>1 =  110100  
    + 1101000 x 0
    = 0110100
>>1 =  011010
    + 1101000 x 0
    = 0011010
>>1 =  001101
    + 1101000 x 1
      1110101

Simple Iterative ("Long") Division

Division: $Q=\frac{N}{D}$
Q is represented by n bits with m fractional bits and n-m integer bits:
- $q_{n-m-1}...q_3q_2q_1q_0.q_{-1}q_{-2}q_{-3}q_{-4}...q_{-m}$
The bits of Q are multiplied by various by various weights representing a sum of terms
$Q_i = q_i \cdot W_i$ where $W_i={2^i}$ and $q_i\in{0,1}$
- $\displaystyle \sum_{i=n-m-1}^{i=-m} (Q_i = q_i \cdot W_i)$
- $Q_{n-m-1}+...+Q_3+Q_2+Q_1+Q_0+Q_{-1}+Q_{-2}+Q_{-3}+Q_{-4}+...+Q_{-m}$
- which after specification of the bits, represents a minimal expression of summation of weights,
  - e.g. $Q=11 \frac{3}{8} = 8+2+1+\frac{1}{4}+\frac{1}{8} = W_4+W_1+W_0+W_{-2}+W_{-3}$
Consider an integer division of positive operands ( $Q=\frac{N}{D}$ ):
- $N = Q \cdot D + R$ where $0\le R\lt D$ ,
- The goal is to find which terms $W_i$ $W_{i}$ would be included in the minimal expression of Q
  - e.g. $N = (W_3+W_1+W_0+W_{-2}+W_{-3}) D + R$
An iterative process is possible, starting from the MSB.
Initialize R = N to start the process and in each step make a comparison: $2^{i} \cdot D > R$
- Starting with i=n-m-1, check if $2^{n-m-1} \cdot D < R$ $2^{n - m - 1} \cdot D < R$
  - if so, subtract: set $q_{n-m-1}=1$ and subtract $2^{n-m-1}$ from R
  - otherwise, set $q_{n-m-1}=0$ and leave R as-is
- i=i-1: if $2^{n-m-2} \times D < R$ subtract: set $q_{n-m-2}=1$ and subtract $2^{n-m-2}$ from R,
  otherwise we set $q_{n-m-2}=0$
- i=i-1: if $2^{n-m-3} \times D > R$ ...
- After the last bit we are left with a remainder R<D and our result Q
Example Solve Q=10.125/3
Try to subtract $3 \times W_i$ starting with the largest weight without overshooting (negative remainder) to choose which weights to use.
Record which weights were used
Sum weights at the end
Each step entails a comparison and conditional subtraction
- comparison itself is implemented by subtraction $R - 2^{i} \cdot D$ and checking if it is it is positive.
Alternatively, can always initially subtract from R: $R - 2^{i} \cdot D$ , and if it is negative we'll undo the subtraction
This undo process can be mathematically noted as $R=R+2^{i} \cdot D$ , but can be called restoring (a.k.a. UNDO) and is implemented by preserving a saved copy of of R before subtraction and, subtraction and mux between the subtraction and the original value

Depictions of implementations

Depiction of "restoring" (white box) vs practical implementation

Depiction of processing-stage dependant shift (variable shift) as well then implemented as 1-bit shift (shift register, constant shift-by-1) is on the left.
The conditional subtraction is shown on the right for both.

Iterative Division of Integers (Restoring/"UNDO")

Q=N/D; assuming n=0, N>D

Initialize $r=N$
For for i = m-1 downto 0:

$q_i = \begin{cases} 1 , & r-D\cdot{2^i} \ge 0 \text{ ((produces subtraction))}\\ 0 , & r-D\cdot{2^i} \lt 0 \text{ ((restore/do nothing))} \end{cases}$
$r=r - q_i \cdot (D \times 2^{i})$

Construct Q: $Q=\sum_0^{m-1}q_i2^i$

For the algorithms listed in these slides, some liberty has been taken on the presented algorithms, please refer to the referenced text or many research papers available to work out final details of hardware implementation

Example

m:8
Q=N/D=233/5
Initial Remainder:233
       7:R:     233, D*2^i:     640  Decision:restore
       6:R:     233, D*2^i:     320  Decision:restore
       5:R:     233, D*2^i:     160  Decision:subtract
       4:R:      73, D*2^i:      80  Decision:restore
       3:R:      73, D*2^i:      40  Decision:subtract
       2:R:      33, D*2^i:      20  Decision:subtract
       1:R:      13, D*2^i:      10  Decision:subtract
       0:R:       3, D*2^i:       5  Decision:restore
Bit Indexes used for subtraction: [5, 3, 2, 1]
Q:46
Q*D:230
N:3

Non-Restoring Division (no UNDO required)

No Restoration (No Restoration Required)

Choice of (Add) OR (Don't Add a.k.a restore) → Choice of (Add) OR (Subtract)
- Modify the previous division process by modifying the iterative step to always add or subtract from the remainder (previously we conditionally subtracted), and recording a value $q_i$ at each step representing $^+1$ or $^-1$ times a weight $2^i$ . The working remainder may become negative in a given step, but it is left as it is (not restored).
At end,sum the additive terms (wherever $q_i$ represented $^+1$ ) and the subtractive terms (wherever $q_i$ represented $^-1$ ) with their respective weight ( $W_i$ ) to represent the final answer. Note the remainder after the last step may be left as negative in this process, requiring an adjustment on the last bit of precision accordingly.

Ex: Solve Q=10.125/3, but this time must use all weights, either subtracting or adding.

Algorithm for Iterative Division of Integers (Non-Restoring)

Q=N/D ;assuming n=0, N>D
Final step (3) corrects for a negative remainder (subtracted a little too much)

Initialize $r=N$
for i = m-1 downto 0:
$q_i = \begin{cases} 1 , & r \ge 0 \text{ ((produces subtraction))} \\ \red{-1} , & r \lt 0 \text{ ((produces addition))} \end{cases}$
$r=r - q_i \cdot (D \times 2^{i})$
Final step: $\red{\text{if }r<0 \text{ then } q_0=0,r=r+D}$
Construct Q: $Q=\sum_0^{m-1}q_i2^i$

Example

m:8
Q=N/D=233/5
Initial Remainder:233
7:R:     233, D*2^i:     640 Decision:sub
6:R:    -407, D*2^i:     320 Decision:add
5:R:     -87, D*2^i:     160 Decision:add
4:R:      73, D*2^i:      80 Decision:sub
3:R:      -7, D*2^i:      40 Decision:add
2:R:      33, D*2^i:      20 Decision:sub
1:R:      13, D*2^i:      10 Decision:sub
0:R:       3, D*2^i:       5 Decision:sub
Bit Indexes used for subtraction: [7, 4, 2, 1, 0]
Bit Indexes used for addition: [6, 5, 3]
Q:47
Q*D:235
R:-2
Remainder Correction:
Q:46
Q*D:230
R:3

Alternative: Scaling the remainder rather progressively smaller weight

In the previous processes, the working remainder was reduced in each step. In fact, after processing each at each bit position, the corresponding bit in the remainder is 0 (as well as all "previous" the bits to the left). Thus, we can we can increase r by two (a shift) before comparison without any loss of information.
Correspondingly, instead of the comparison $2^{i} \cdot D > r$ , compare $\cdot D > r$ since r has been scaled by $2^{i}$ by stage $i$
- it turns out that the result is the same as the Non-Restoring division scheme already discussed except that the working remainder after the final step must be scaled for final interpretation $R=r2^{-m}$ (proof page 40 Koren)

Iterative Division of Integers (Restoring, Scaling Remainder)

Q=N/D ;assuming n=0, N>D
Note that term $\green{\overline{\underline{D \times 2^{m-1}}}}$ is a fixed shift, not recomputed every iteration

Initialize $r=N$
for i = m-1 downto 0:

$q_i = \begin{cases} 1 , & r-\green{\overline{\underline{D \times 2^{m-1}}}} \ge 0 \text{ ((produces subtraction))} \\ 0 , &r-\green{\overline{\underline{D \times 2^{m-1}}}} \lt 0 \end{cases}$
$r=(r - q_i \cdot (\green{\overline{\underline{D \times 2^{m-1}}}}))\red{\times 2}$ ((remainder scaling))

$\red{r=r/2^{m}}$ ((corrects scale))
Construct Q: $Q=\sum_0^{m-1}q_i2^i$

Example:

7:R:     233, D*2^(m-1):     640 Decision:restore
6:R:     466, D*2^(m-1):     640 Decision:restore
5:R:     932, D*2^(m-1):     640 Decision:subtract
4:R:     584, D*2^(m-1):     640 Decision:restore
3:R:    1168, D*2^(m-1):     640 Decision:subtract
2:R:    1056, D*2^(m-1):     640 Decision:subtract
1:R:     832, D*2^(m-1):     640 Decision:subtract
0:R:     384, D*2^(m-1):     640 Decision:restore
R:768
Scaled Remainder:3
Bit Indexes used for subtraction: [5, 3, 2, 1]
Q:46
Q*D:230
R:3

Iterative Division of Integers (Non-Restoring,Scaling Remainder)

Q=N/D ;assuming n=0, N>D

Initialize $r=N$
for i = m-1 downto 0:

$q_i = \begin{cases} 1 , & r \ge 0 \text{ ((produces subtraction))} \\ \red{-1} , & r \lt 0 \text{ ((produces addition))} \end{cases}$
$r=(r - q_i \cdot (\green{\overline{\underline{D \times 2^{m-1}}}}))\red{\times 2}$

$\red{r=r/2^{m}}$ ((corrects scale))
Final step: if $r<0$ , $q_0=0$ , $r=r+D$
Construct Q: $Q=\sum_0^{m-1}q_i2^i$

Example:

7:R:     233, D*2**(m-1):640  Decision: subtract
6:R:    -814, D*2**(m-1):640  Decision: add
5:R:    -348, D*2**(m-1):640  Decision: add
4:R:     584, D*2**(m-1):640  Decision: subtract
3:R:    -112, D*2**(m-1):640  Decision: add
2:R:    1056, D*2**(m-1):640  Decision: subtract
1:R:     832, D*2**(m-1):640  Decision: subtract
0:R:     384, D*2**(m-1):640  Decision: subtract
Bit Indexes used for subtraction: [7, 4, 2, 1, 0]
Bit Indexes used for addition: [6, 5, 3]
Q:47
Q*D:235
R:-2
Remainder Correction:
Q:46
Q*D:230
R:3

Final Change: Iterative Division of Integers (Non-Restoring, Scaling Remainder Before Subtraction)

Q=N/D ;assuming n=0, N>D

Initialize $r=N$
for i = m-1 downto 0:

$q_i = \begin{cases} 1 , & \red{2\times}r \ge 0 \text{ ((produces subtraction))} \\ -1 , & \red{2\times}r \lt 0 \text{ ((produces addition))} \end{cases}$
$r=(\red{2\times} r - q_i \cdot (D \times 2^{\red{m}}))$

$r=r/2^{m}$ ((corrects scale))
Final step: if $r<0$ , $q_0=0$ , $r=r+D$
Construct Q: $Q=\sum_0^{m-1}q_i2^i$

Example:

7:N:     233, D*2^m:1280  Decision:sub
6:N:    -814, D*2^m:1280  Decision:add
5:N:    -348, D*2^m:1280  Decision:add
4:N:     584, D*2^m:1280  Decision:sub
3:N:    -112, D*2^m:1280  Decision:add
2:N:    1056, D*2^m:1280  Decision:sub
1:N:     832, D*2^m:1280  Decision:sub
0:N:     384, D*2^m:1280  Decision:sub
Bit Indexes used for subtraction: [7, 4, 2, 1, 0]
Bit Indexes used for addition: [6, 5, 3]
Q:47
Q*D:235
R:-2
Remainder Correction:
Q:46
Q*D:230
R:3

Extension of Non-Restoring Division to Signed Operands

Fortunately, the extension of non-restoring division to handle twos-complement signed values is straightforward and only involves modifiying the iteration
$q_i = \begin{cases} ^+1, & sign(2r_{i-1})\ne sign(D)\\ ^-1, & sign(2r_{i-1})==sign(D) \end{cases}$

Division by Convergence

Problem Statement: Find $Q$ , where $Q=\frac{N}{D}$ , given N and D ...but converge faster than 1-bit per iteration
Alternative problem statement towards solution:
- find a factor $R$ $R$ such that $D \times R=1$ $D \times R = 1$ , then find Q using $Q = N \times R$ $Q = N \times R$
  - $\frac{N}{D}=\frac{N\cdot R}{D\cdot R}=\frac{Q}{1}$
  - R will be computed over many stages by finding many factors $R_i$ and $R$ is taken to be
    $R=R_1 \cdot R_2 \cdot R_3 \cdot ...$
  - D input assumed to already be normalized to 1: $(...(((D \cdot R_1) \cdot R_2) \cdot R_3 )\cdot ...)$

Solution:

Start with a $D_0$ that it is a "normalized fraction" (binary 0.1xxxxxx...), found by scaling D by a power of 2 as needed so that $\frac{1}{2} \le D \lt 1$
- $D_0=D \times S; S =2^{-x}$ ,
- Remember to inversely scale the final result
To prescale D, shift right by some $s$ until all bits to the left of the decimal are 0.
- Ex: $101.11 \rightarrow .10111$ using shift by s=3
To Postscale, would right shift R or simply right shift the result Q by s.
Iterative Step $i=1...m$ : Multiply $D_{i-1}$ by a value $R_i \in [1,1.5)$ to make it closer to 1 but not equal or greater than 1. The selection of $R_i$ is presented using an intermediate variable $y_i$ that allows us to demonstrate some properties later:
- In each iteration, represent $D$ $D$ by a value $y$ $y$ such that $D=1-y$ $D = 1 - y$ . Therefore, $y$ $y$ represents the distance from 1, noting $0 \lt y \le \frac{1}{2}$ $0 < y \leq \frac{1}{2}$ in every iteration
  - $D_{i-1}$ $D_{i - 1}$ , $y_{i-1}$ $y_{i - 1}$ satisfy
    - $D_{i-1}=1-y_{i-1}$
    - $y_{i-1} = 1-D_{i-1}$ ;
- Create a new denominator
  - $D_{i} = D_{i-1} \cdot R_{i}$ where $R_{i}=1+y_{i-1}=2-D_{i-1}$

Error:

Examining the first step:
- $R_1 = 1 + y_0$
- $D_1 = D_0 \times R_1 = (1-y_0)(1+y_0)=1-y_0^2$
- $y_1=1- D_1=1-(1-y_0^2)=y_0^2$
- note $0 \lt y_1 \le \frac{1}{4}$ $0 < y_{1} \leq \frac{1}{4}$ and $\frac{3}{4} \le D_1 \lt 1$ $\frac{3}{4} \leq D_{1} < 1$
  - whereas $0 \lt y_0 \le \frac{1}{2}$ and $\frac{1}{2} \le D_0 \lt 1$
  - $y_1=y_0^2$ , so $y_1$ must be smaller than $y_0$ or 0
  - $D_0=1-y_0$
  - $D_1=1-y_0^2 \therefore$ $D_1$ generally closer to 1 than $D_0$
Examining the second step:
- $R_2 = 1 + y_1$
- $D_2 = D_1 \times R_2 = (1-y_0^2) (1 + y_1) = (1-y_0^2) (1 + y_0^2) =1-y_0^4$
- $y_2 = 1-D_2=1 - (1-y_0^4)=y_0^4$
- note $0 \lt y_2 \le \frac{1}{16}$ and $\frac{15}{16} \le D_2 \lt 1$
Examining the third step:
- $R_3 = 1 + y_2$
- $D_3 = D_2 \times R_3 = (1-y_0^4) (1 + y_0^4) =1-y_0^8$
- $y_3 = 1-D_3=1 - (1-y_0^8)=y_0^8$
Examining 4-th step:
- $D_4=1-y_0^{16}$
Examining n-th step:
- $D_n=1-y_0^{(2^n)}$
The error bound is cut in half in the first iteration, cut by another 4 in the next, another 16 in the next etc...
Double the number of bits each iteration
- $D_0=\texttt{.1xxxxx....}$ ,
- $D_1=\texttt{.1\textcolor{red}{1}xxxx....}$
- $D_2=\texttt{.11\textcolor{red}{11}xx....}$ ,
- $D_3=\texttt{.1111\textcolor{red}{1111}xx..}$
BETTER THAN 1-BIT PER ITERATION OF "LONG" DIVISION!
With each iteration, $D_i \rightarrow .111111111...$
Therefore, the iterations are performed until the desired accuracy is achieved and/or that because of limited precision no further accuracy from additional iterations is obtained

Algorithm

Each step is two multiplications and a two's complement operation

$D_{i+1}=D_i \cdot R_i$ (converges to 1)
$N_{i+1}=N_i \cdot R_i$ (converges to answer)
$R_{i+1}={\rm TWO} - D_{i+1}$

Convergence of D to 1

(Koren Page 214)
$\begin{array}{rl} D_\infty & = D \cdot R_1 \cdot R_2 \cdot R_3... = \\ & (1-y_0) \times \left[(1+y_0)(1+y_0^2)(1+y_0^4)...\right]= \\ & \textcolor{red}{(1+y_0)} \times \left[\textcolor{red}{(1-y_0)}(1+y_0^2)(1+y_0^4)...\right] \end{array}$

The final bracketed term is the series expansion for $\frac{1}{1-(-y_0)}$ for $y_0 \in [0,1/2]$
Therefore $\underset{i\rightarrow \infty}{\lim D_i} = (1+y_0)\left(\frac{1}{1+y_0}\right)=1$

Lookup Tables

Lookup tables can be used to bypass the first few iterations. This use of lookup tables will be discussed in the context of another method

Division by Reciprocation

To find $Q=\frac{N}{D}$ , calculate $D^{-1}$ and multiply N by it.
The name of the method originates from finding $D^{-1}$ $D^{- 1}$ and using it to find Q
- $Q=N\cdot \frac{1}{D}$
But, how to compute $D^{-1}$ $D^{- 1}$ ?
- Can find x by solving $x-\frac{1}{D} = 0$ using Newton-Raphson Iteration

Newton-Raphson Iteration

Newton-Raphson Iteration is a method to iteravely search for a zero of a function, i.e. $f(x)=0$
key idea in Newton-Raphson Iteration is to update the current to find a zero using two pieces of information about the function
- the value of f at x: $f(x)$
- the derivate of f at x: $f'(x)$
to solve for $\frac{1}{x}=D$ , let $f(x)=\frac{1}{x}-D$ , which goes to zero at $x=\frac{1}{D}$

$x_{x+1} = x_i -\frac{f(x_i)}{\left( \frac{d f(x_i)}{d x_i}\right) }$
$\frac{d f(x_i)}{d x_i} = - \frac{1}{x^2}$
which sets up the Iteration:
- $x_{i+1} = x_i (2 - D \cdot x_i)$ , that will approach $\frac{1}{D}$

Quadratic Convergence of Division by Reciprocation (rate of reduction of error)

In each iteration, we obtain double the number of bits of the previous iteration (1 extra, then 2 extra then 4 extra...)
To show:

let error at $i$ th iteration be $\delta_i={x_i} - \frac{1}{D}$
derive $\delta_{i+1}$
Note:
- $x_{i}=\frac{1}{D} + \delta_i$ and
  $\begin{aligned} x_{i+1} = x_i (2 - D \cdot x_i)=(\frac{1}{D} + \delta_i) (2 - D \cdot (\frac{1}{D} + \delta_i))=\\ (\frac{1}{D} + \delta_i) (2 - 1 - D \cdot \delta_i)) = \\ (\frac{1}{D} + \delta_i) (1 - D \cdot \delta_i) = \\ \frac{1}{D} + \delta_i - \delta_i-D \delta_i^2 = \\ \frac{1}{D} - D \delta_i^2\end{aligned}$
Hence, $\delta_{i+1}=x_{i+1} - \frac{1}{D} = -D\delta_i^2$
If D is a normalized fraction, then $(1/2 \le D \lt 1)$ then $\delta_i \le 1$ , therefore $\delta_i$ decreases quadratically

Quadratic Convergence of Div by Reciprocation

We can also show the direct convergence $x_i \rightarrow \frac{1}{D}$ :
Let the initial guess be
$x_0 = 1$

$\begin{aligned} x_1 = 2-D \end{aligned}$

$\begin{aligned} x_2 &= (2-D)(2-D(2-D))\\ &= (2-D)(\green{1+1}-D(2-D))\\ &= (2-D)(1+1-2D-D^2))\\ &= (2-D)(1+\red{(D-1)^2}) \end{aligned}$

$\begin{aligned} x_3 &= \purple{[(2-D)(1+(D-1)^2)]}(2-D\purple{[(2-D)(1+(D-1)^2)]})\\ &= [(2-D)(1+(D-1)^2)](\green{1+1}-D[(2-D)(1+(D-1)^2)])\\ &= [(2-D)(1+(D-1)^2)](1+ 1 - 4 D + 6 D^2 - 4 D^3 + D^4)\\ &= [(2-D)(1+(D-1)^2)](1+\red{(D-1)^4}) \end{aligned}$

$x_i = (2-D)(1+(D-1)^2)(1+(D-1)^4)\cdot\cdot\cdot(1+(D-1)^{2^{i-1}}) = (\textcolor{red}{1}-(D-\textcolor{red}{1}))(1+(D-1)^2)(1+(D-1)^4)\cdot\cdot\cdot(1+(D-1)^{2^{i-1}})$

let $D=1+y$ ; $y=D-1$
$x_i=(1-(y))(1+(y)^2)(1+(y)^4)\cdot\cdot\cdot(1+(y)^{2^{i-1}})$
This is the series from earlier, therefore
$\underset{i\rightarrow \infty}{\lim x_i} = \frac{1}{1+y}=\frac{1}{D}$

Lookup Table for Initial Guess

"first approximation" for iterative methods can be pre-calculated from a table to reduce required # iterations

Ex: Div by Reciprocation
- Given j bits to pre-calculate, a table of $2^j$ entries (initial approximation) should be provided of the desired accuracy of n bits
- optimal choice is function (1/x) evaluated center of interval (proof omitted).

Square Root

Will not be discussed this semester.
Using Newton-Raphson iteration is one reasonable approach.

Iteration Implementation

Multi-Cycle

The computation for Iterative methods can distributed in time
- e.g. one iteration per clock cycle
- can support convergence testing and a variable number of iterations

If the iterations are minimal, one could attempt to use a fast clock in the iterative processor
- requires multi-clock domain design to interface the surrounding system
If combinatorial path delay allows, implementing multiple iterations in cascaded combinatorial logic minimizes the number of clock cycles required

Loop Unrolling

Iterative methods with a finite/static number of iterations (or at least a manageable upper bound) can be implemented in combinatoral logic
hardware for each iteration can be tailored to the parameters of a specific iteration
- ex: multiplying by $z=z\times i$ $z = z \times i$ in iteration $i$ $i$
  - a multi-cycle iterative processor might use a multiplier to support multiplying by the variable i
  - an unrolled hardware iteration only needs to support multiplying by a constant

Pipelining

Iterative methods lend themselves well to pipelining (esp. for a static number of iterations), allowing throughput matched to the pace of surrounding system
- might require more hardware/area (and perhaps power)
pipeline stages may perform one OR MORE iterations
- greater combinatorial path delay from cascading the combinatorial logic of two or more iterations
- if the clock rate can be maintained, performing more than one iteration in a given stage minimizes the number of stages and hence latency
pipelining ( like combinatorial unrolling) allows hardware stages to be tailored to a given iteration (e.g. multiplying by a constant versus a variable)

Lookup Tables for Initial Guess

reduce run-time cycles with precomputed iterations

implement difficult-to-compute initial guess values for globally complex functions, use methods like Newton-Raphson Iteration to iterate locally to a precise solution