Lecture 01 – Codesign

Ryan Robucci

• Spacebar to advance through slides in order
• Shift-Spacebar to go back
• Arrow keys for navigation

• ESC/O-Key to see slide overview
• ? to see help

Printable Version

Lecture 01 – Codesign

References

Required Textbook for the course: Patrick R. Schaumont, A Practical Introduction to Hardware/Software Codesign 2nd Edition, Springer (ISBN:978-1461437369)
https://www.springer.com/us/book/9781461437369
$^\dagger$ A Practical Introduction to Hardware Software Codesign

Hardware Software Codesign

Traditional Definition: Hardware/Software Codesign is the design of cooperating hardware components and software components in a single design effort.
The line between hardware and software is fuzzy (FPGAs, Specialized Processors, etc.)
Definition: Hardware/Software Codesign is the design of an application in terms of fixed and flexible components
- where application is defined to be the overall target function of a system in cluing both the hardware and software
- Here the emphasis is on “(re)programmable” vs fixed which can be implemented in hardware or software

Textbook Overview

†Schaumont

Representations of Digital Systems

Schematic:

▶︎
all
running...

HDL:

module counter(rst, clk, q);   
  input              rst;
  output logic [7:0] q;
  input              clk;

  logic  [7:0]       c;
  
  always_comb begin
    q  = c+1;
  end
  
  always @ (posedge clk) begin: blk1
      c  <= rst ? 0 : q;
  end
endmodule // test

Timing Diagram:

†Schaumont
Truth Tables: Combinational and Sequential

Combinational

a b y

0 0 0

0 1 1

1 0 1

1 1 0
Sequential
- Implicit State: The state variable is the output
  
  source: https://www.instructables.com/id/D-Flip-Flop-With-Preset-and-Clear/
- State Transition Table: Explicit Internal State Register
  
  A B Current State Next State Output
  
  0 0 S1 S1 0
  
  0 0 S2 S1 1
  
  0 1 S1 S2 1
  
  0 1 S2 S2 0
  
  ... ... ... ... ...
  
  1 1 S1 S2 1
  
  1 1 S2 S1 0

a	b	y
0	0	0
0	1	1
1	0	1
1	1	0

A	B	Current State	Next State	Output
0	0	S1	S1	0
0	0	S2	S1	1
0	1	S1	S2	1
0	1	S2	S2	0
...	...	...	...	...
1	1	S1	S2	1
1	1	S2	S1	0

Single-clock synchronous digital circuits

Course textbook uses single-clock synchronous digital circuits using word-level combinatorial logic and flip-flops
- These are built from components like registers, adders, and multiplexers
  
  ▶︎
  all
  running...
  dirname: missing operand Try 'dirname --help' for more information.

Cycle-based (RTL) Modeling

Cycle-based hardware modeling is often called register-transfer-level (RTL) modeling because the behavior of a circuit can be thought of as a sequence of transfers between registers, with logical and arithmetic operations performed on the signals during transfers
Basically, you describe what the contents of every register should be after every clock cycle based on the contents of registers before the clock transition

This description lends itself to implementation as synchronous logic through automated hardware synthesis

always_comb begin
  q  = c+1;
end

always_ff @ (posedge clk) begin: blk1
    c  <= rst ? 0 : q;
end

†Schaumont

does not describe all behaviors of digital logic, i.e. asynchronous logic, dynamic logic, multiphase clocked-hardware, and hardware with latches
does not by default include other sub-clock-cycle timing such as delays and thus would not model physical hardware and problems like race conditions.

Sequential Software

The most basic software style is single-thread sequential programs.
Subsumes a processor with some amount of registers, caches, RAM, ROM, arithmetic and other combinatorial modules, control logic etc... to implement an instruction set. Those instructions are called sequentially to move and process data.

†Schaumont
Does not include all programming paradigms, such as multi-threaded (which sometimes create an illusion of parallelism), objected oriented software, and functional programming
Assembly and C have a strong correlation to the operations happening within the hardware (and give some sense of hardware utilization) while even higher-level languages tend to abstract the hardware further away.

Sequential vs Parallel vs Concurrent

The function of a code written with a sequential coding style is inferred by assuming that one line of code is executed at a time (in a deterministic order)
- Execution order dependency (execution of operations require that operations have completed) (can be represented as a dependency graph)
- Can directly describe a cascade of operations where the output or effect of operating is the operating is the input prerequisite to another
- (A smart compiler and architecture may reorder execution if it determines that doing so is more efficient and does not change the result)
Concurrency: concurrent code design allows a describing a function by independent operations, where the order of execution between many of them might not be defined
- Because of the non-deterministic order of execution, poor use of concurrency can lead to non-deterministic results (which may not be desired)
- The function of each concurrent actor/block might be described using sequential code
Parallelism (or parallel execution) refers to simultaneous execution, in which multiple operations are happening at the same instant of time
- typically requires availability of additional hardware resources
- often useful when concurrency is appropriate (minimal order dependency among operations), see Amdahl's law later

Hardware and Software

Developing hardware and software together, as part of the same effort (team effort or individual effort), can be challenging, especially because the choices of simulation and development platforms are limited. But, software-hardware codesign is often necessary and increasingly a standard in reconfigurable systems

The line between software and hardware is fuzzy

FPGAs contain reconfigurable hardware that is most be configured according to a given netlist description. It is called software, but developing for FPGAs feels much more like hardware development.
Digital Signal Processors (DSPs) are much like general purpose processors, but with specialized hardware instructions. Taking advantage of these means being aware of the advantages of the specialized hardware in comparison to software and specifically structuring the program to use it (may require assembly).
Development of and with Application-Specific Instruction-Set Processors (ASIP) goes one step further. Development here requires implementing your own custom hardware instructions and coding to use them.
Using CELL processors (in PS3) well requires understanding the hardware architecture of the cells and how they communicate and interact.
In general, a good knowledge of hardware and software is useful for developing specialized processing platforms.

Hardware or Software

In general, a functional implementation through software development is more rapid than though hardware development. This is owed in part to availability of libraries, ease of debugging (particularly visibility into the software operation), and the rapid iterations from testing to redesign.
For implementing the same algorithm, hardware takes longer to describe and develop because more details must be considered. Software development in fact leverages the hardware design work already done. Debugging hardware is difficult, and often requires designing in-advance to have any ability to debug or have insight into the operation. In software, this can usually be added afterward will little effort. The design to test cycles are much longer – this is a key disadvantage. An IC iteration can take months or years.

FPGAs trade the efficiency of full-custom hardware for some of the rapid design cycle properties of software. FPGAs can be well-characterized in-advance by the manufacturer and simulated fairly well. “Software” can be tested on the actual hardware nearly instantly as compared to full-custom hardware.

Hardware vs Software Systems Specialty

Hardware can be optimized furthest for for certain applications while software platforms emphasize flexibility.

†Schaumont

Advantages of hardware

Performance. The graph shows bits/cycles of hardware and software implementations of a cryptography algorithm for an embedded application. Hardware can do more per cycle, trading flexibility (general processor) for applications-specific performance.

†Schaumont
Energy efficiency: useful work done per clock cycle. Data for performing AES. Varies by orders of magnitude.

†Schaumont

Parallelism

Single core processors are reaching performance boundaries pushing the use of multi-core systems.
Can’t expect strong software performance improvements based on increasing speed of processors (i.e. can’t just wait a couple years from processors to run faster). It is not clear how parallelism will effect efficiency in general.
Regarding hardware, most hardware is already parallel so the design methodology is already there.

Advantages of software

Design Complexity: software is easier to manage, the simulation and testing environments are much friendlier and provide good observation of software internals. Debugging platforms are well-established. (though debugging large, multi-node networked systems is still an area of research as is finding security bugs)
Design Cost: designing new software is much cheaper than designing and fabricating new hardware.
Faster development: software implementations can start and proceed predictably without the usual need for incremental hardware prototyping
Abstraction from implementation technology reduces complexity of implementation and need to re-engineer with each technology development.

Hardware Software Codesign Space

†Schaumont

Programmable Platform: collection of programmable components
Application Mapping refers to writing software for that platform and if needed customizing hardware configuration.
Platform Programming is the task of mapping application software onto hardware. The “compiling” or “synthesizing” for downloading of the software representation onto the platform hardware. Many times this is a basic C compiler or a synthesizer for HDL.
RISC: software is C; hardware is a general purpose processor
FPGA: software is HDL (unless a soft-processor is used then C is also used)
DSP: software is C and custom assembly; hardware is tailored for class of applications
ASIP: software is C as well as hardware description of processor extentions
ASIC: assuming HDL, the “application” and “platform” are one in the same

Example Hardware Software Platform

†Schaumont

Platform Flexibility and Efficiency

there can be a cost for flexibility: the flexibility of a platform, the ability of the platform to be adapted to different applications, is generally inverse to the efficiency (performance, size, energy, etc...)

Application Mapping

Application-specific mapping:

C Code: Generally reusable and portable

sum=0;

for (i = 0; i<N; i++){

  sum += m[i] * n[i];

}

TI C64x DSP Assembly

Specific pairs of statements can execute in parallel

Not very general or portable

LDDW .D2T2 *B_n++,B_reg1:B_reg0
LDDW .D1T1 *A_m++,A_reg1:A_reg0    ;runs in parallel with previous

DOTP2 .M2X A_reg0, B_reg0, B_prod
DOTP2 .M1X A_reg1, B_reg1, A_prod  ;runs in parallel with previous

SPKERNEL 4, 0
ADD .L2 B_sum,B_prod,B_sum
ADD .L1 A_sum,A_prod, A_sum        ;runs in parallel with previous

Dualism of Hardware and Software Design and Platforms

Design Paradigm:
- Hardware: Primarily Spacial Decomposition
- Software: Primarily Temporal Decomposition
Resource Cost:
- Hardware: more complex design means more area (and possibly time)
- Software: more complex design takes more time to execute, area is fixed by the hardware
Flexibility
- Hardware: by design for flexibility
- Software: Implicit
Parallelism
- Hardware: implicit
- Software: by careful design
Modeling vs implementation
- Hardware: commonly HDL, otherwise schematic, but this is not exactly the implemented hardware
- Software: modeling language is the implementation language itself (multi-node systems require some system modeling)
Reuse
- Software: Common, many standard libraries and relatively easy to use interfaces
- Hardware: interface, conventions, and implementations vary making design reuse difficult. Some standardized interfaces (bus specifications and protocols) do exist

Time Abstraction in Simulation

†Schaumont

Continuous time: SPICE simulation
Discrete Event: only model lumped events and actions at certain points in time (not necessarily regularly spaced)
Cycle-accurate: lumps events and actions according to bins of time defined by a clock. Certain events the happen on a sub-clock span like glitches may not be modeled. Model actions as happening either immediately at a clock edge or after some integral number of clock edges.
Instruction-accurate: models functionality based on certain operations. All events at every clock cycle are not modeled, just the result of each instruction
Transaction-accurate: used to model interactions between nodes of a system. The nodes can be hardware or software but the simulation model can be vary simple as compared to the implementation (example: modeling a hard drive)

Concurrency vs Parallelism

Concurrency: ability to execute simultaneous operations because the operations are independent (don't need the results of one to perform the other)
Parallelism: ability to execute simultaneous operations because of use of or availability of hardware resources to do so.
Hardware physically executes in parallel
- Multi-cycle computations may use the same hardware repeatedly over time
Software can be modeled/described as sequential, concurrent, and parallel
To speedup computation, must deciding what parts of an algorithm can be parallelized

Application Example

†Schaumont

Consider eight interconnected processors performing a summation of 8 numbers.
We can assume an addition itself is a sequential operation.
Summing 8 numbers on a sequential processor take 7 times longer than adding two numbers.
In this parallel system, it can be done in 3.

Here is the schedule:

The vertical axis is time
The horizontal axis is processor number

†Schaumont

Retask unused resources deduplicate computations

Unused parallel processors can be assigned new work while others continue processing partial results.
Computations can be organized so that partial results are shared instead of duplicated. Consider which of recomputation or communication is cheaper.
Example many different sums are computed:

†Schaumont

Maximal Speedup of an Application Governed by Amdahl’s Law

The maximal speedup is determined by the level of possible parallelism, but the complexity is in identifying what must be sequential vs. what can be parallelized, and how to parallelize. Algorithms may need to be reworked to leverage parallel computation.

A description can be found here: https://en.wikipedia.org/wiki/Amdahl's_law

Amdahl's Law

Partition latency, $T$ ,of a sequential algorithm (where latency is the sum of latency of the parts) into two portions:
- $T = T\cdot (1-p) + T \cdot (p)$
- portion $1-p$ that is non-optimizable sections of code
- portion $p$ that may be optimized, such as through parallism, by a factor of $s$ :
  $T_{\rm new} = T \cdot (1-p) + \frac{T\cdot(p)}{s}$
The RATIO of the old latency to the new latency can be called SPEEDUP, $\mathbf S$ :

$\mathbf S = \frac{T_{\rm old}}{T_{\rm new}} = \frac{T\cdot (1-p) + T\cdot(p)}{T\cdot(1-p) + \frac{T\cdot(p)}{s}}= \frac{\cancel{T}\cdot (1-p) + \cancel{T}\cdot(p)}{\cancel{T}\cdot(1-p) + \frac{\cancel{T}\cdot(p)}{s}}$
$\mathbf S=\frac{(1-p)+p}{(1-p)+p/s}=\frac{1}{(1-p)+(p)/s}$
Therefore, speedup is bounded by $\mathbf S \le \frac{1}{1-p}$
- Example: if your code spends 33% of the time in sequential code, the most that the code can be sped up by parallelism is 3x -- even when 66% of the computation is executed in near-zero time
- Example with 1 s of computation, involving 16 parallelizable tasks totaling 2/3 of the computation.
Must find opportunities for parallelism: a key challenge is understanding what must be sequential and what can be parallelized, and what additional overhead might be incurred from communication to support parallelism

Digital Circuit Design Tradeoffs and Iron law of processor performance

In traditional computer architecture, computations are divided into instructions.

Computer Architecture Iron law of processor performance:

${\displaystyle{\frac {Time}{Program}} = {\frac {Instructions}{Program}}\times {\frac {ClockCycles}{Instruction}}\times {\frac {Time}{ClockCycles}}}$

CISC approach as compared to RISC:
- favors a larger set (more instructions) and more complex instructions
- a given instruction may do more work, requiring fewer instructions per program
- however, each instruction could take more time (lower instruction throughput) than in a CISC approach
  - support for a larger set of instructions tends to increase instruction decoding time, as well as overall size of circuit which may lead to slower clock cycles or more clock cycles
Custom Synchronous Digital Circuit Design:
a. circuit decomposition tend to break computation into groups/subgraphs of combinational circuits, with groups communicating with eachother in time through registers at designated, discrete times
- Within combinational logic groups, logic communicates asynchronously. Outputs are updated as results are ready, and signals can change multiple times (called glitch) as all logic settles
- A register enforces that a single update is communicated per cycle. Combinational circuits must compute final results before the time indicated by clock edge
- Feedback of a results from combinational logic involving formation of dependency loops should instead happen through registers
- common datapath blocks exist with efficient implementations, try to organize computation to utilize these (bottom-up vs top-down)
b. May tradeoff time required for each cycle against how much work can be performed in each combinational group per cycle
c. Can optimize system time required to perform computation by identifying and reducing worst-case delay among all parallel circuits, intent on reducing $\frac {time}{clock\ cycle}$ )
- path with worst case is called the critical path --- it defines the best,minimal $\frac {time}{clock\ cycle}$ (i.e. clock period)
d. balance delays: may reorganize combinational logic within a group to reduce delay in the critical path at expense of delay in other paths, but worst-case path is the focus
- final optimized design may have many paths with a delay near that of the worst case path
e. low-level circuit optimization: may resize gates, or use different circuit devices, custom-design layouts for logic blocks, or different logic implementation (e.g. not just fully-complementary logic implementation known CMOS) (refer to VLSI course)

f. can reorganize computation in time by introducing and moving registers
- moving registers: retiming moves computing from one section of a circuit to another, presumably increasing delay in one part to reducing the delay in another
- new register to break up computation work and paths: pipelining introduces new registers, breaking a long combinational paths in order to reduce $\frac {time}{clock\ cycle}$ ) at the cost of requiring more cycles of latency per computation
g. Can introduce additional hardware to parallelize work (e.g. unfolding)

h. Iterative algorithms provide methods to break complex computation into smaller units of work that can be performed with a faster clock, avoiding system-limiting critical paths and/or less hardware

i. May want non-uniform clocks/register updates
- Multi-cycle paths allow some groups of computation to take longer than a cycle
- Multi-clock domains allow for different clocks, but need to handle cross clock-domain communication specially
j. Consider use of reprogrammable circuits and software

Lecture 01 – Codesign

Table of Contents

References

Hardware Software Codesign

Textbook Overview

Representations of Digital Systems

Single-clock synchronous digital circuits

Cycle-based (RTL) Modeling

Sequential Software

Sequential vs Parallel vs Concurrent

Hardware and Software

The line between software and hardware is fuzzy

Hardware or Software

Hardware vs Software Systems Specialty

Advantages of hardware

Parallelism

Advantages of software

Hardware Software Codesign Space

Example Hardware Software Platform

Platform Flexibility and Efficiency

Application Mapping

Dualism of Hardware and Software Design and Platforms

Time Abstraction in Simulation

Concurrency vs Parallelism

Application Example

Retask unused resources deduplicate computations

Maximal Speedup of an Application Governed by Amdahl’s Law

Amdahl's Law

Digital Circuit Design Tradeoffs and Iron law of processor performance