Lecture 17 – Datapath

Ryan Robucci

Table of Contents

References

Datapath Blocks

Boole’s expansion theorem

Reformulating Computation Blocks guided by Boole’s expansion theorem

1-bit adder

N-bit adder: Ripple Carry Adder

Carry Lookahead

Adder Topologies

Carry-Save Addition

3:2 Compressors Example

4:2 Compressors

Chain of 4:2 compressors

Array and Tree CSA for Many Operands

Bit-Sliced Layout

Serialized Addition (without and with CSA)

Serialized Multiplication: Serial Shift-Add multiplier

Array Multiplier

Karatsuba Algorithm (covered earlier)

Shifters

Bit Slicing

RAM/ROM

Sync/Async Reads

Memory Inference Capabilities

Dual-Port RAM

Modes for how synchronous writes affect reads from the same address in the same cycle

Burst Modes

DMA

DMA Modes

FPGA: Use of Inbuilt Adders

FPGA: Multipliers

Shift Using a Multiplier

References

$^\dagger$ Digital Integrated Circuits,
2nd edition, Jan M. Rabaey, Anantha
Chandrakasan and Borivoje Nikolic (slides captures self-noted)
$^{\dagger\dagger}$ CMOS VLSI Design 4th Edition http://pages.hmc.edu/harris/cmosvlsi/4e/index.html (slides captures self-noted)
$^{\dagger\dagger\dagger}$ Asim J. Al-Khalili Distinguished Emeritus Professor, P. Eng. http://users.encs.concordia.ca/~asim/COEN_6501/6501_Lecture_Notes/ http://users.encs.concordia.ca/~asim/
other sources provided inline

Datapath Blocks

Arithmetic unit
- Adders
- Multipliers
- Shifters
- comparators
- etc
Memory
- RAM
- ROM
- Buffers
- Shift registers

Control
- Finite state machine (PLA, random logic.)
- Counters
Interconnect
- Switches
- Arbiters
- Bus

Boole’s expansion theorem

Technique to balance delays or redistribute logic
May “expand” functions with respect to an input, into two factors including concurrent/parallel computation of two results:
- a result based on the input being true
- a result based on the input being false
if a is a result from a long combinational path, this allows us to prepare two answers and multiplex between them when a is available
if expanding w.r.t. $\rm N$ variables, an $2^{\rm N}$ -input mux is required
speed is improved if computing results in parallel
notice that some partial products / results may be in common between the multiplexed terms and could be shared
https://en.wikipedia.org/wiki/Boole's_expansion_theorem
$f(a,b,c,...) = a \land f(0,b,c,...) \lor ~a \land f_1(1,b,c,...)$
$f(a,b,c,...) = a \land f_{a=0}(b,c,...) \lor ~a \land f_{a=1}(b,c,...)$
- $f_{a=0}(b,c,...)$ and $f_{a=1}(b,c,...)$ are the positive and negative Shannon cofactors of f wrt. a
( $\lor$ denotes OR ; $\land$ denotes AND; $\oplus$ denotes XOR)

Reformulating Computation Blocks guided by Boole’s expansion theorem

when speed is primary concern, focus on critical path(s)

generalized lesson from Boole’s expansion theorem: if it can reduce the critical path, outside the critical path calculate partial results that would otherwise require more computation in critical path (i.e ask what work can be done while waiting, and ask if it can also reduce waiting)

this can involve preparing two answers and waiting on final deciding factor
- may or may not involve additional hardware, with benefit of minimizing the critical path

it may involve just reorganizing operations

Example 1:
c+y+z where propagation delay shows that inputs arrive in the order z,y,c
- if c is the last input operand to be be computed, then first compute y+z rather than c+y

Example 2:

y = \begin{cases} a^2 & {\rm if } (a\times b \gt c\times d) \\ b^2 & \rm otherwise \\ \end{cases}

Option 1: Decide which input to provide square unit

if $(a\times b \gt c\times d)$ $x=a$ else $x=b$
$y = x^2$

    +------+
a ->| cPNK |
    | MULT |--+
b ->|      |  |    +----+
    +------+  +--->|cGRE|
                   | GT |---+
    +------+  +--->|    |   |
c ->| cPNK |  |    +----+   |
    | MULT |--+             |
d ->|      |                |
    +------+          +-----+
                      |
                      |
                      |
                      |
                      v sel
                   +-------+    +------+
                a->|D1   M | x  | cPNK |
                   |cYEL U |--->| MULT |-->y
                b->|D0   X |    |      |
                   +-------+    +------+

▶︎

all

running...

Option 2: Compute both squares and choose at output

${\rm xsq}=x^2$
${\rm ysq}=y^2$
if $(a\times b \gt c \times d)$ $y={\rm xsq}$ else $y={\rm ysq}$

    +------+
a ->| cPNK |
    | MULT |--+
b ->|      |  |    +----+
    +------+  +--->|cGRE|
                   | GT |---+
    +------+  +--->|    |   |
c ->| cPNK |  |    +----+   |
    | MULT |--+             |
d ->|      |                |
    +------+          +-----+
                      |
                      |
    +------+          |
a ->| cPNK |          |
    | MULT |--+       v sel
a ->|      |  |    +-------+
    +------+  +--->|D1   M |
                   |cYEL U |---> y 
    +------+  +--->|D0   X |
b ->| cPNK |  |    +-------+
    | MULT |--+
b ->|      |  
    +------+

▶︎

all

running...

1-bit adder

Half adder 2 input, doesn’t include carry in, generates Sum and Carry,
Full-Adder 3 input, generates Sum and Carry

N-bit adder: Ripple Carry Adder

For every cell, both outputs depend on the previous carry result
$s_{i+1}= a_i \oplus b_i \oplus \red{c_i}$ (odd parity)
$c_{i+1}= a_i b_i \lor a_i \red{c_i} \lor b_i \red{c_i}$
The critical path is through the intermediate chain
Delay is proportional to adder delay and linear with the the number of bits

Carry Lookahead

Want to allow some computation before the chain of previous cells complete
In long adders, want to break the carry chain, which is the critical path
In carry lookahead logic block, each carry output is dependant on the block carry input and independent of the previous stages carry output
We do this by attempting to compute all carry results in parallel as much as possible
- Prepare whatever computation can be done without waiting and have a minimal-delay path waiting on the required result
$c_{i+1}= x_i y_i \lor x_i y_i \lor x_i y_i = x_i y_i \lor (x_i \oplus y_i) c_i$
- A (high) Carry output is either generated locally ( $x_i y_i$ ), or propagated if $(x_i \oplus y_i)$

In parallel at each bit position, compute
- $G_i = x_i y_i$
- $P_i = (x_i \oplus y_i)$
Then
- $c_{i+1} = x_i y_i \lor (x_i \oplus y_i) c_i = G_i \lor P_i c_i$
- $s_i = P_i \oplus c_i$
Chain Part of the critical path is replaced by an and-or

$\begin{array}{ccrrrrrr} c_0 = & & & & & c_0\\ c_1 = & & & & G_0 \lor & P_0c_0\\ c_2 = & & &G_1 \lor &P_1 G_0 \lor &P_1 P_0 c_0\\ c_3 = & &G_2 \lor &P_2 G_1 \lor &P_2 P_1 G_0 \lor &P_2 P_1 P_0 c_0\\ \red{c_3} = & G_3 \lor &P_3 G_2 \lor &P_3 P_2 G_1 \lor &P_3 P_2 P_1 G_0 \lor &P_3 P_2 P_1 P_0 c_0\\ \end{array}$

Final parallel layer to produce sum $s_i = P_i \oplus c_i$

Enrichment

VLSI Carry Logic: Manchester Carry Chains

Carry Ripple with Carry Lookahead Adders and more advanced schemes including Hierarchial Carry Lookahead Schemes are possible, reverse lookahead
Specialized fast and compact carry chain logic called Manchester Carry Chains can be used in VLSI implementations

http://bwrcs.eecs.berkeley.edu/Classes/icdesign/ee141_s03/Lectures/Lecture18-AddMult.pdf

Trees for Adders

Alternatively, P and G may be calculated in a tree:

http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html#fsa_pfx

The logarithmic lookahead can be done as follows to computer the generate, propagate, and carry signals for an N-bit adder:

http://pages.hmc.edu/harris/class/ha1/lect12.pdf

Adder Topologies

Ripple-carry adder
Carry-lookahead adder
Carry-save adders

Other Adder Topologies
Reference: https://en.wikipedia.org/wiki/Adder_(electronics)

Carry-select adder (can be fastest for some word-sizes and technologies, power&size can be large)
Carry-skip adder (smaller than carry select)
Brent–Kung adder
Kogge–Stone adder
Ling adder

Enrichment

Carry Select Adders

Another concept is to pre-compute two results at each bit slice based on a carry, and only wait for the carry.
This is employed by conditional sum adders and carry-select adders. Like Boole’s Expansion Theorem
note the increased size of partial adders on the left that balance critical paths
- for large wordlength adders the fanout of the carry output to the next stage becomes limiting

Orignal Figure: Quanticles Wikipedia
https://en.wikipedia.org/wiki/File:Carry-select-adder-variable-size.png

Carry Skip Adders

http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html#fsa_cska

Though not quite as fast as look-ahead, uses less area/power
Fixed-block-size carry-skip adder:

http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html#fsa_cska
Variable-block-size carry-skip adder:

http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html#fsa_cska

Carry-Save Addition

Suitable for multi-operand addition (more than 2)
Computes many additions with partial results in expanded/redundant form
Requires final (fast) adder-of-choice to compute the end result
CSA also referenced as 3:2 compressors

3:2 Compressors Example

CSA of various bit-widths

4:2 Compressors

CSA with more operands, each n-bit
The structure on the left can add 4 operands, but
The structure on the right is called a 4:2 compressor
Result requires final adder to combine partial results, x+y

Chain of 4:2 compressors

Excerpt from: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/stx_cookbook.pdf
note critical path is two stages
6:3 compressor noted is described in document…it “fits” better with the programmable logic units

Array and Tree CSA for Many Operands

CSA can be organized in a linear array or as trees
Note how CSAs allow for somewhat regular and distributable layouts / floorplans

Bit-Sliced Layout

Serialized Addition (without and with CSA)

The Array of CSA may be serialized to save on space
Traditional adder shown on the left
On the Right a serialized CSA is shown
- Serialized sum computed with CSA, maintained in “redundant form”
- Additional Fast Adder required to assemble final result

Serialized Multiplication: Serial Shift-Add multiplier

$Z=X \times Y$ implemented with a single adder
Shifted versions of X conditionally added over time based on bits in Y

the version on the right is optimized, covered as time allows
note that in the original version, at each timestep a bit of the X and Y working registers are discarded while the number of valid bits required in the working register Z is increasing. Furthermore, only some N bits of the result change at a time. This suggests the ability to combine registers http://users.utcluj.ro/~baruch/book_ssce/SSCE-Shift-Mult.pdf#page=3

Array Multiplier

2D matrix of adders to multiply N-bit and M-bit operands
size proportional to $N \times M$
delay proportional to $N+M$

Image $^\dagger$

Other Multiplier Architectures

Slide source See References $^{\dagger\dagger\dagger}$

Slide $\dagger$ :

Karatsuba Algorithm (covered earlier)

Recall that the Karatsuba Algorithm can be used to extend usage of hardware multipliers to larger multiplcation wordlengths

Shifters

If variable shift and/or variable rotate is required, compact structures like a Barrel Shifter may be in a compute architecture.
VLSI implementations have an advantage over FPGA since compact pass-gate structures can be built

Figure created by Cmglee (wikipedia) and provide according to CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/) (https://upload.wikimedia.org/wikipedia/commons/2/2b/Crossbar_barrel_shifter.svg)

Variations and details provided below reference; covered as time allows; cascade of coarse and fine shift is notable to motive bit-sliced layouts common in ALU structures

Note the use of bit sliced “lanes” which work well with other bit-sliced layouts
- in-lane routes in black
  cross-lane routes are colored
- the cascade of shifters minimizes the cross-lane routing (shift of 8 achieved)

Key points
- Cascade of Coarse and Fine Shift
- Barrel Shifter for VLSI
Various Shift Types can be implemented by changing (dynamically or statically) connections to core shifter
- Logical Shift
- Arithmetic Shift
- Rotate

Enrichment Shifters VLSI

Bit Slicing

A Bit slice is the processing/routing “lane” with all the bit-processing elements, regularizing layout and allowing for compact routing
VLSI: bit pitch is determined by the widest element
- other smaller layouts are intentionally spaced so that they match the bit pitch
FPGA: regular (uniform) design can result in easy regular routing
many datapath elements including processing multiplexers, and storage can be organized this way

RAM/ROM

In addition to registers, there are three primary choices for implementation of large memories:

Distributed Register Memory Slice/Logic Blocks have a number of inbuilt slice registers (slice is a Xilinx term)
- fast
- allows collocating memory and computation
- can reduce routing
- can serve as a local buffer for block RAM and external memory
- DISADVANTAGE: limited number available
Distributed RAM
- LUT (normally used for logic) or any other memory within a configurable cell used as a distributed RAM
  - LUTRAM: LUT as used RAM
- fast
- allows collocating memory and computation
- can reduce routing
- can serve as a local buffer for block RAM and external memory
- DISADVANTAGE: consumes resources for otherwise used for logic implementation
Block RAM
- High-Density Dedicated RAM
- less flexible
- limited access, e.g. dual read port allows reading only two-values at a time
- may require large routing and/or copying contents repeated to/from distributed RAM for many types of computations
- may have registered and non-registered (e.g. for large combinatorial LUT) options
External RAM
- large capacity SDRAM (synchronous DRAM) can be used off-chip
- large memory applications require this
- route and multiplex, cache into local memory
- usually a synthesized or HARD memory controller for interfacing with memory is available on an FPGA platform

Depiction of small region of an FPGA with Block RAM and and Slices with LUTs:

Example Report Showing LUT used as Memory, and Block Ram utilization:

...

1. Slice Logic
--------------

+----------------------------+-------+-------+------------+-----------+-------+
|          Site Type         |  Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+-------+
| Slice LUTs*                | 18777 |     0 |          0 |     20800 | 90.27 |
|   LUT as Logic             | 18629 |     0 |          0 |     20800 | 89.56 |
|   LUT as Memory            |   148 |     0 |          0 |      9600 |  1.54 |
|     LUT as Distributed RAM |   148 |     0 |            |           |       |
|     LUT as Shift Register  |     0 |     0 |            |           |       |
| Slice Registers            | 17050 |     0 |          0 |     41600 | 40.99 |
|   Register as Flip Flop    | 16916 |     0 |          0 |     41600 | 40.66 |
|   Register as Latch        |   134 |     0 |          0 |     41600 |  0.32 |
| F7 Muxes                   |   671 |     0 |          0 |     16300 |  4.12 |
| F8 Muxes                   |    30 |     0 |          0 |      8150 |  0.37 |
+----------------------------+-------+-------+------------+-----------+-------+

...
 
2. Memory
---------
+-------------------+------+-------+------------+-----------+-------+
|     Site Type     | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile    | 36.5 |     0 |          0 |        50 | 73.00 |
|   RAMB36/FIFO*    |   36 |     0 |          0 |        50 | 72.00 |
|     RAMB36E1 only |   36 |       |            |           |       |
|   RAMB18          |    1 |     0 |          0 |       100 |  1.00 |
|     RAMB18E1 only |    1 |       |            |           |       |
+-------------------+------+-------+------------+-----------+-------+

...

Block Ram:

Global view highligting used cells:

Enlarged View the same with individual BMEM cells outlined next to the areas of programmable fabric:

Sync/Async Reads

on Xilinx, distributed RAM reads are asynchronous
- although, either
  output registers
  or
  input address/data/control registers
  can be added to a design to achieve synchronous behavior
Distributed RAM versus Dedicated Block RAM:

Action Distributed RAM Dedicated Block RAM

Write Synchronous Synchronous

Read Asynchronous Synchronous
this means that LUTRAM can be used like combinatorial logic
- can be cascaded with other combinational logic and other LUTRAMs to perform operations in one cycle

Action	Distributed RAM	Dedicated Block RAM
Write	Synchronous	Synchronous
Read	Asynchronous	Synchronous

Memory Inference Capabilities

(xilinx:)
Memory inference capabilities include the following:
https://docs.xilinx.com/r/en-US/ug901-vivado-synthesis/Choosing-Between-Distributed-RAM-and-Dedicated-Block-RAM

Support for any size and data width. Vivado synthesis maps the memory description to one or several RAM primitives
Single-port, simple-dual port, true dual port
Up to two write ports
Multiple read ports

Provided that only one write port is described, Vivado synthesis can identify RAM descriptions with two or more read ports that access the RAM contents at addresses different from the write address.

Write enable
RAM enable (block RAM)
Data output reset (block RAM)
Optional output register (block RAM)
Byte write enable (block RAM)
- ability to mask writes, updating only some bytes of a register
Each RAM port can be controlled by its distinct clock, port enable, write enable, and data output reset
Initial contents specification
Vivado synthesis can use parity bits as regular data bits to accommodate the described data widths

Dual-Port RAM

(https://docs.xilinx.com/r/en-US/ug958-vivado-sysgen-ref/Dual-Port-RAM)

Modes for how synchronous writes affect reads from the same address in the same cycle

(based on Xilinx documentation)

Read-first (old data read)
When a read and a write occur at the same address, old content is read before new content is loaded.
Write-first (new data read)
- data written is immediate available in the same cycle for read
- also known as read-through.
No-change
- active data write prevents read data output updates
- must be followed by explicit read operation in a following cycle to see the result

Burst Modes

typically the more synchronization/handshaking required the slower the system
avoid unnecessary repeated handshaking/address timing delay by requesting a block of data to transfer
typically a starting address is specified, followed by a
- fixed block transfer of data or
- a variable block-size transfer, which requires an control line (e.g. can hold write_en) or pre-communicated size
avoids idle bus for long periods stalling other processes/devices, particularly the processor itself if DMA is used

DMA

Shared bussed suffer from congestion blocking progression
Furthermore, transferring data between modules on the bus burdens the processor
- Common to include Direct Memory Access to support direct memory transfer

Processor requests a direct transfer between devices on bus
DMA manages the transfer, freeing the processor
Direct transfer can be faster
Useful for transfers of large blocks of memory

DMA Modes

Burst / Blocking Modes
- processor requests block transfer, DMA hold bus until full transfer is complete
- can block software processes on the processor
Cycle Stealing
- some efficiency of burst, minimized handshaking (e.g. one starting address)
- after ever transaction, DMA yeilds to processor perhaps for one cycle to allow it to resume bus control
- interleaves processor and DMA use of bus
- slight longer full transfer
Transparent
- DMA transfer is low priority, burst-like but
- lets processor have bus whenever requested for as long as needed

FPGA: Use of Inbuilt Adders

FPGA: Multipliers

Can be implemented with

LUTs (like typical synthesis, more configurable)

Large Memory Blocks (block RAM) act like large multiplier lookup tables, though locations limited

  address {a,b}             
    |            
    v            
  +---+--------+ 
  | d |        |  
  | e | stored |          
  | c | values |  
  | o |        | 
  | d |        | 
  | e |        | 
  | r |        | 
  |   |        | 
  +---+--------+ 
          |       
          v
        result

▶︎

all

running...

ditaa version 0.9, Copyright (C) 2004--2009  Efstathios (Stathis) Sideris

Running with options:
no-separation
Reading file: /home/robucci/Nextcloud/covail/Courses/CMPE691Codesign/Lectures/Lecture17__Datapath/ze001zra2_code_chunk.ditaa
Rendering to file: /home/robucci/Nextcloud/covail/Courses/CMPE691Codesign/Lectures/Lecture17__Datapath/ze001zra2_code_chunk.png
Done in 0sec

Dedicated DSP blocks (efficient but limited resource at fixed locations which can cause routing complexity and bottlenecks)
combination of the above

AppNote:Implementing Multipliers in FPGA Devices, Altera

Shift Using a Multiplier

A practical option for shifts in FPGAs is to implement shift using a multiplier:
For a shift n, multiply A by a power of two: $A \times 2^n$
While this may seem counter-intuitive for ASICs, FPGAs have in-built multipliers which are there whether you use them or not
A variable shift is just implemented as a 1-hot encoding into one of the operands
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/stx_cookbook.pdf pg. 2-11