Required Textbook for the course: Patrick R. Schaumont, A Practical Introduction to Hardware/Software Codesign 2nd Edition, Springer (ISBN:978-1461437369)
Traditional Definition: Hardware/Software Codesign is the design of cooperating hardware components and software components in a single design effort.
The line between hardware and software is fuzzy (FPGAs, Specialized Processors, etc.)
Definition: Hardware/Software Codesign is the design of an application in terms of fixed and flexible components
Schematic:
HDL:
module counter(rst, clk, q); input rst; output logic [7:0] q; input clk; logic [7:0] c; always_comb begin q = c+1; end always @ (posedge clk) begin: blk1 c <= rst ? 0 : q; end endmodule // test
Timing Diagram:
Truth Tables: Combinational and Sequential
Combinational
a | b | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
Sequential
Implicit State: The state variable is the output
State Transition Table: Explicit Internal State Register
A | B | Current State | Next State | Output |
---|---|---|---|---|
0 | 0 | S1 | S1 | 0 |
0 | 0 | S2 | S1 | 1 |
0 | 1 | S1 | S2 | 1 |
0 | 1 | S2 | S2 | 0 |
... | ... | ... | ... | ... |
1 | 1 | S1 | S2 | 1 |
1 | 1 | S2 | S1 | 0 |
Course textbook uses single-clock synchronous digital circuits using word-level combinatorial logic and flip-flops
These are built from components like registers, adders, and multiplexers
Cycle-based hardware modeling is often called register-transfer-level (RTL) modeling because the behavior of a circuit can be thought of as a sequence of transfers between registers, with logical and arithmetic operations performed on the signals during transfers
Basically, you describe what the contents of every register should be after every clock cycle based on the contents of registers before the clock transition
This description lends itself to implementation as synchronous logic through automated hardware synthesis
always_comb begin q = c+1; end always_ff @ (posedge clk) begin: blk1 c <= rst ? 0 : q; end
does not describe all behaviors of digital logic, i.e. asynchronous logic, dynamic logic, multiphase clocked-hardware, and hardware with latches
does not by default include other sub-clock-cycle timing such as delays and thus would not model physical hardware and problems like race conditions.
The most basic software style is single-thread sequential programs.
Subsumes a processor with some amount of registers, caches, RAM, ROM, arithmetic and other combinatorial modules, control logic etc... to implement an instruction set. Those instructions are called sequentially to move and process data.
Does not include all programming paradigms, such as multi-threaded (which sometimes create an illusion of parallelism), objected oriented software, and functional programming
Assembly and C have a strong correlation to the operations happening within the hardware (and give some sense of hardware utilization) while even higher-level languages tend to abstract the hardware further away.
The function of a code written with a sequential coding style is inferred by assuming that one line of code is executed at a time (in a deterministic order)
Concurrency: concurrent code design allows a describing a function by independent operations, where the order of execution between many of them might not be defined
Parallelism (or parallel execution) refers to simultaneous execution, in which multiple operations are happening at the same instant of time
FPGAs contain reconfigurable hardware that is most be configured according to a given netlist description. It is called software, but developing for FPGAs feels much more like hardware development.
Digital Signal Processors (DSPs) are much like general purpose processors, but with specialized hardware instructions. Taking advantage of these means being aware of the advantages of the specialized hardware in comparison to software and specifically structuring the program to use it (may require assembly).
Development of and with Application-Specific Instruction-Set Processors (ASIP) goes one step further. Development here requires implementing your own custom hardware instructions and coding to use them.
Using CELL processors (in PS3) well requires understanding the hardware architecture of the cells and how they communicate and interact.
In general, a good knowledge of hardware and software is useful for developing specialized processing platforms.
Performance. The graph shows bits/cycles of hardware and software implementations of a cryptography algorithm for an embedded application. Hardware can do more per cycle, trading flexibility (general processor) for applications-specific performance.
Energy efficiency: useful work done per clock cycle. Data for performing AES. Varies by orders of magnitude.
Programmable Platform: collection of programmable components
Application Mapping refers to writing software for that platform and if needed customizing hardware configuration.
Platform Programming is the task of mapping application software onto hardware. The “compiling” or “synthesizing” for downloading of the software representation onto the platform hardware. Many times this is a basic C compiler or a synthesizer for HDL.
RISC: software is C; hardware is a general purpose processor
FPGA: software is HDL (unless a soft-processor is used then C is also used)
DSP: software is C and custom assembly; hardware is tailored for class of applications
ASIP: software is C as well as hardware description of processor extentions
ASIC: assuming HDL, the “application” and “platform” are one in the same
C Code: Generally reusable and portable
sum=0; for (i = 0; i<N; i++){ sum += m[i] * n[i]; }
TI C64x DSP Assembly
Specific pairs of statements can execute in parallel
Not very general or portable
LDDW .D2T2 *B_n++,B_reg1:B_reg0 LDDW .D1T1 *A_m++,A_reg1:A_reg0 ;runs in parallel with previous DOTP2 .M2X A_reg0, B_reg0, B_prod DOTP2 .M1X A_reg1, B_reg1, A_prod ;runs in parallel with previous SPKERNEL 4, 0 ADD .L2 B_sum,B_prod,B_sum ADD .L1 A_sum,A_prod, A_sum ;runs in parallel with previous
Design Paradigm:
Resource Cost:
Flexibility
Parallelism
Modeling vs implementation
Reuse
Here is the schedule:
The vertical axis is time
The horizontal axis is processor number
Unused parallel processors can be assigned new work while others continue processing partial results.
Computations can be organized so that partial results are shared instead of duplicated. Consider which of recomputation or communication is cheaper.
Example many different sums are computed:
The maximal speedup is determined by the level of possible parallelism, but the complexity is in identifying what must be sequential vs. what can be parallelized, and how to parallelize. Algorithms may need to be reworked to leverage parallel computation.
A description can be found here: https://en.wikipedia.org/wiki/Amdahl's_law
Partition latency, ,of a sequential algorithm (where latency is the sum of latency of the parts) into two portions:
The RATIO of the old latency to the new latency can be called SPEEDUP, :
Therefore, speedup is bounded by
Must find opportunities for parallelism: a key challenge is understanding what must be sequential and what can be parallelized, and what additional overhead might be incurred from communication to support parallelism
In traditional computer architecture, computations are divided into instructions.
Computer Architecture Iron law of processor performance:
CISC approach as compared to RISC:
Custom Synchronous Digital Circuit Design:
a. circuit decomposition tend to break computation into groups/subgraphs of combinational circuits, with groups communicating with eachother in time through registers at designated, discrete times
b. May tradeoff time required for each cycle against how much work can be performed in each combinational group per cycle
c. Can optimize system time required to perform computation by identifying and reducing worst-case delay among all parallel circuits, intent on reducing )
d. balance delays: may reorganize combinational logic within a group to reduce delay in the critical path at expense of delay in other paths, but worst-case path is the focus
e. low-level circuit optimization: may resize gates, or use different circuit devices, custom-design layouts for logic blocks, or different logic implementation (e.g. not just fully-complementary logic implementation known CMOS) (refer to VLSI course)
f. can reorganize computation in time by introducing and moving registers
g. Can introduce additional hardware to parallelize work (e.g. unfolding)
h. Iterative algorithms provide methods to break complex computation into smaller units of work that can be performed with a faster clock, avoiding system-limiting critical paths and/or less hardware
i. May want non-uniform clocks/register updates
j. Consider use of reprogrammable circuits and software