Lecture 17 – Datapath

Ryan Robucci

Table of Contents

Lecture 17 – Datapath

Carry-Save Addition

RAM/ROM

DMA

References

Datapath Blocks

  • Arithmetic unit
    • Adders
    • Multipliers
    • Shifters
    • comparators
    • etc
  • Memory
    • RAM
    • ROM
    • Buffers
    • Shift registers
  • Control
    • Finite state machine (PLA, random logic.)
    • Counters
  • Interconnect
    • Switches
    • Arbiters
    • Bus

Boole’s expansion theorem

Reformulating Computation Blocks guided by Boole’s expansion theorem

1-bit adder

N-bit adder: Ripple Carry Adder

Carry Lookahead

c0=c0c1=G0P0c0c2=G1P1G0P1P0c0c3=G2P2G1P2P1G0P2P1P0c0c3=G3P3G2P3P2G1P3P2P1G0P3P2P1P0c0\begin{array}{ccrrrrrr} c_0 = & & & & & c_0\\ c_1 = & & & & G_0 \lor & P_0c_0\\ c_2 = & & &G_1 \lor &P_1 G_0 \lor &P_1 P_0 c_0\\ c_3 = & &G_2 \lor &P_2 G_1 \lor &P_2 P_1 G_0 \lor &P_2 P_1 P_0 c_0\\ \red{c_3} = & G_3 \lor &P_3 G_2 \lor &P_3 P_2 G_1 \lor &P_3 P_2 P_1 G_0 \lor &P_3 P_2 P_1 P_0 c_0\\ \end{array}

Enrichment

VLSI Carry Logic: Manchester Carry Chains

Trees for Adders

Alternatively, P and G may be calculated in a tree:

http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html#fsa_pfx
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html#fsa_pfx

The logarithmic lookahead can be done as follows to computer the generate, propagate, and carry signals for an N-bit adder:

http://pages.hmc.edu/harris/class/ha1/lect12.pdf

http://pages.hmc.edu/harris/class/ha1/lect12.pdf

Adder Topologies

Other Adder Topologies
Reference: https://en.wikipedia.org/wiki/Adder_(electronics)

Enrichment

Carry Select Adders

Orignal Figure: Quanticles Wikipedia
https://en.wikipedia.org/wiki/File:Carry-select-adder-variable-size.png

related reference: http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html

Carry Skip Adders

http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html#fsa_cska

Carry-Save Addition

3:2 Compressors Example

4:2 Compressors

Chain of 4:2 compressors

Array and Tree CSA for Many Operands

Bit-Sliced Layout

Serialized Addition (without and with CSA)

Serialized Multiplication: Serial Shift-Add multiplier



Array Multiplier

  • 2D matrix of adders to multiply N-bit and M-bit operands
  • size proportional to N×MN \times M
  • delay proportional to N+MN+M

Image^\dagger

Other Multiplier Architectures

Slide source See References ^{\dagger\dagger\dagger}

Slide source See References ^{\dagger\dagger\dagger}

Slide source See References ^{\dagger\dagger\dagger}

Slide\dagger:

Karatsuba Algorithm (covered earlier)

Shifters

Figure created by Cmglee (wikipedia) and provide according to CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/) (https://upload.wikimedia.org/wikipedia/commons/2/2b/Crossbar_barrel_shifter.svg)

  • Note the use of bit sliced “lanes” which work well with other bit-sliced layouts
    • in-lane routes in black
      cross-lane routes are colored
    • the cascade of shifters minimizes the cross-lane routing (shift of 8 achieved)

Enrichment Shifters VLSI

Bit Slicing



RAM/ROM

In addition to registers, there are three primary choices for implementation of large memories:

  1. Distributed Register Memory Slice/Logic Blocks have a number of inbuilt slice registers (slice is a Xilinx term)

    • fast
    • allows collocating memory and computation
    • can reduce routing
    • can serve as a local buffer for block RAM and external memory
    • DISADVANTAGE: limited number available
  2. Distributed RAM

    • LUT (normally used for logic) or any other memory within a configurable cell used as a distributed RAM
      • LUTRAM: LUT as used RAM
    • fast
    • allows collocating memory and computation
    • can reduce routing
    • can serve as a local buffer for block RAM and external memory
    • DISADVANTAGE: consumes resources for otherwise used for logic implementation
  3. Block RAM

    • High-Density Dedicated RAM
    • less flexible
    • limited access, e.g. dual read port allows reading only two-values at a time
    • may require large routing and/or copying contents repeated to/from distributed RAM for many types of computations
    • may have registered and non-registered (e.g. for large combinatorial LUT) options
  4. External RAM

    • large capacity SDRAM (synchronous DRAM) can be used off-chip
    • large memory applications require this
    • route and multiplex, cache into local memory
    • usually a synthesized or HARD memory controller for interfacing with memory is available on an FPGA platform

Depiction of small region of an FPGA with Block RAM and and Slices with LUTs:

Example Report Showing LUT used as Memory, and Block Ram utilization:

...

1. Slice Logic
--------------

+----------------------------+-------+-------+------------+-----------+-------+
|          Site Type         |  Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+-------+
| Slice LUTs*                | 18777 |     0 |          0 |     20800 | 90.27 |
|   LUT as Logic             | 18629 |     0 |          0 |     20800 | 89.56 |
|   LUT as Memory            |   148 |     0 |          0 |      9600 |  1.54 |
|     LUT as Distributed RAM |   148 |     0 |            |           |       |
|     LUT as Shift Register  |     0 |     0 |            |           |       |
| Slice Registers            | 17050 |     0 |          0 |     41600 | 40.99 |
|   Register as Flip Flop    | 16916 |     0 |          0 |     41600 | 40.66 |
|   Register as Latch        |   134 |     0 |          0 |     41600 |  0.32 |
| F7 Muxes                   |   671 |     0 |          0 |     16300 |  4.12 |
| F8 Muxes                   |    30 |     0 |          0 |      8150 |  0.37 |
+----------------------------+-------+-------+------------+-----------+-------+

...
 
2. Memory
---------
+-------------------+------+-------+------------+-----------+-------+
|     Site Type     | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile    | 36.5 |     0 |          0 |        50 | 73.00 |
|   RAMB36/FIFO*    |   36 |     0 |          0 |        50 | 72.00 |
|     RAMB36E1 only |   36 |       |            |           |       |
|   RAMB18          |    1 |     0 |          0 |       100 |  1.00 |
|     RAMB18E1 only |    1 |       |            |           |       |
+-------------------+------+-------+------------+-----------+-------+

...




Block Ram:

Global view highligting used cells:

Enlarged View the same with individual BMEM cells outlined next to the areas of programmable fabric:

Sync/Async Reads

Memory Inference Capabilities

(xilinx:)
Memory inference capabilities include the following:
https://docs.xilinx.com/r/en-US/ug901-vivado-synthesis/Choosing-Between-Distributed-RAM-and-Dedicated-Block-RAM

Provided that only one write port is described, Vivado synthesis can identify RAM descriptions with two or more read ports that access the RAM contents at addresses different from the write address.

Dual-Port RAM


(https://docs.xilinx.com/r/en-US/ug958-vivado-sysgen-ref/Dual-Port-RAM)

Modes for how synchronous writes affect reads from the same address in the same cycle

(based on Xilinx documentation)

  1. Read-first (old data read)
    When a read and a write occur at the same address, old content is read before new content is loaded.
  2. Write-first (new data read)
    • data written is immediate available in the same cycle for read
    • also known as read-through.
  3. No-change
    • active data write prevents read data output updates
    • must be followed by explicit read operation in a following cycle to see the result

Burst Modes

DMA

DMA Modes

FPGA: Use of Inbuilt Adders

FPGA: Multipliers

Shift Using a Multiplier

Lecture 17 – Datapath

Carry-Save Addition

RAM/ROM

DMA