Physical Design Flow and Timing Analysis

Ryan Robucci

1. References

https://eclipse.umbc.edu/robucci/cmpeRSD/Lectures/Lecture11__Metastability/
[ꭝ] Illustration from The Design Warriors's Guide to FPGAs by Clive Maxfield, Elsevier
[ꭝꭝ] Modeling, Synthesis and Rapid Prototyping with the Verilog HDL, Michael D. Ciletti
Basic Timing Analysis Concepts: https://eclipse.umbc.edu/robucci/cmpe415/attachments/Lecture14.pdf#page=10 Page 10-31

2. Objectives

Physical Design Flow
Understand key terminology of timing analysis, specifically Static Timing Analysis (STA) and Dynamic Timing Analysis
Understand Static Timing Analysis (STA)
Understand Dynamic Timing Analysis as a Complement to STA
Terms
- User Constraints (e.g. user constraints file, directives, * project settings)
- Source Register; Destination Register
- Startpoint; Endpoint
- False Path
- Multi-cycle Path
- Path Delay
- Setup Time Slack
- Hold Time Slack
- Clock Path
- Clock Network
- Asynchronous Path
- Clock Domain
- Unrelated Clock Domain, and Related Clock Domains
- Clock Jitter and Uncertainty
- Clock Path
- Metastability
- Clock Skew
- Signal Skew
- Pulse Width
- Contamination delay
- Aync Reset and Control
- SDF
- ECO
Event-Based vs Cycle-Based Simulation

3. Electronic Design Tools

CAD: Computer Aided Design
CEA: Computer Aided Engineering
EDA: Electronic Design Automation
Historically,
- CAE refereed to front end tools like design capture and simulation
- CAD refereed to backend tools like layout, place, and route
- EDA was accepted as the merger of all topics

4. ASIC vs FGPA Design

http://www.xilinx.com/company/gettingstarted/fpgavsasic.htm

FPGAs have a shorter design process and shorter design cycles
When problems arise, design iterations are required that can be very time-consuming.
FPGAs have a significant advantage when design updates are required based on in-system testing
Updates in-field allows in-system test and re-programming; insertion of new debug circuitry and later removal; and feature upgrades

5. Automated Physical Design

A global optimum solution going from RTL to place-and-routed physical design is difficult is difficult to find
- the processes is broken into several steps towards a (hopefully) good solution
Heuristics are used in each step that generally lead to a good outcome
Synthesis converts a high-level behavioral description into a representation of a digital circuit
Mapping, Packing, and Place and Route follow

6. Synthesis and Mapping

Synthesis converts a high-level behavioral description into a representation of a digital circuit, such as a netlist
A netlist is a structural representation of a design, with instances of simple cells from a library
Mapping is the process of associating entities such as gate-level functions in the gate-level netlist with the physical LUT-level functions available on the FPGA

ꭝMaxfield

7. Packing

Packing is grouping of LUT and registers into CLBs

Lecture14__FPGA_Developement-8.pdf.svg

8. Place and Route

Place and Route is the process of placing CLBs and finding routing configuration to make required interconnections

Lecture14__FPGA_Developement-9.pdf.svg

9. Timing Verification vs Logical Verification

Timing Verification refers to analyzing a with regard to timing requirments.
This usually specifically refers to sub-cycle (within clock cycle) timing analysis
- Cycle-timing verification refers to looking at which cycles activity occurs and is distinct
Timing Analysis is very distinct from functional/behavioral/logical verification
Logical Verification:
- see if sum = {(a&b) | (a&c) | (b&c) , a ^ b ^ c, } is a valid implementation of sum(a,b) = a+b
- typically checked using many test input vectors from the set 000,001,…111 and comparing the output vectors produced from a high-level model and Verilog simulation
- special symbolic analysis tools exist that can check for logic equivalence efficiently for some designs

10. Timing Analysis and Post-Place-and-Route simulation

The product of Place-and-Route is a fully routed physical design that a timing analysis tool can extract timing information from and check for any timing violations (setup, hold,etc…) associated with any of the internal registers.
These estimatates involve more accurate than load estimates that would be used before place and route
Clock networks, switch configuration for routing, and signal buffers are known at this stage
A new netlist can be generated that includes accurate delays in a standard delay format SDF file associated with the post place and route netlist (can't push delays directly back to original description as lots of stuff has moved around or changed)

11. FPGA vs ASIC Physical Design Tools

FPGAs vs ASICs: FPGAs are regular structures and a represent a constrained design space for analysis tools (and synthesis tools).
Many designs will emerge from one underlying physical hardware design
This underlying hardware can be heavily characterized in the fabricated IC. This for allows tools that can perform accurate general design analysis for anyone using the same underlying hardware. In ASICs, each design can be very different or have unique differences, meaning it is more difficult for tools to accurately predict timing for every possible design. In ASIC Design, SPICE-Level simulation is sometimes used.

12. Static Timing Analysis

Static timing analysis refers to using using delays extracted from physical implementation to analyze timing directly rather than through simulation
- Place-and-routed delays are extracted from place and routed design
Static timing analysis does not involve driving inputs input the system and analyzing resulting waveforms
Static Timing Analysis is often fast and may be part of an automation tool's optimization process to test and evaluate design option trade-offs
Pre place-and-route estimates delays and can drive synthesis, timing-driven synthesis, Timing-Driven Synthesis logic option
Typically pessimistic delay assumptions are made to arrive at a worst-case model – a data-driven simulation may reveal what delays are actually relevant in a design

13. Clock Period Constraint

For older Xilinx Tools, a PERIOD constraint should be supplied for every clock

14. Clock Domain Defined by Seq. Elements

Sequential Elements Include
- Flip Flops
- Latches
- Clocked Distributed and Block RAM/ROM
- FIFOs
- I/O Hardware with Clock Input (e.g. I/O SerDes)
- Hardware bocks with Clock Input (e.g. Xilnix MULT18/18)
The combinational paths between sequential elements in the same clock domain are constrained and must be analyzed

15. Clock Network

A clock network is a dedicated network of circuity of clock buffers for balancing delay
Typically an H-tree is used Balanced Delay Using H-tree:

https://en.wikipedia.org/wiki/H_tree ©️Wikipedia user:LoopakeCC BY-SA 4.0
--
FPGA: clock network has many gloabal and local clock networks/regions and support components https://www.allaboutcircuits.com/technical-articles/clock-management-clock-resources-of-fpgas/
- Different I/O banks may be connected to different clock drivers
For ASIC, custom trees are inserted in a process step call "Clock - Tree Insertion"
- Insertion Delay refers to new timing after Clock Tree Insertion

((Clock Buffer Depection Reference: S. A. Tawfik and V. Kursun, "Buffer Insertion and Sizing in Clock Distribution Networks with Gradual Transition Time Relaxation for Reduced Power Consumption," 2007 14th IEEE International Conference on Electronics, Circuits and Systems, Marrakech, Morocco, 2007, pp. 845-848, doi: 10.1109/ICECS.2007.4511123.))

16. Timing requirements for Registers

Registers tend to have voltage as well as specific timing requirements for the inputs
Setup and Hold
If violated, may get
- Old Data
- New Data
- Metastable Output
  - Not conform to valid voltage specifications in-time (late settling)
  - May change in middle of clock period (multiple output transitions triggered by the clk)
  - Possible Output transitions when expected is 0->1:
representation in digital timing diagrams: a metastable output may be represented / depicted by a period of invalid/uncertain state

17. Clock Jitter / Uncertainty

clock jitter is continuously randomized uncertainty in sampling time of data
manufacturing and voltage changes can cause fixed changes, contributing to uncertainty
data change event times also are uncertain
voltage noise and temporal noise are related
… -> voltage noise -> temporal noise -> voltage noise -> …
induces generated signal timing

18. Setup and Hold Times

Lecture14__FPGA_Developement-17.pdf.svg

Setup and Hold times define a window around a clock edge during which data inputs to a register should not transition.
Setup Time defines the time before a clock edge that a signal must settle. A violation occurs with a path delay is too large. (It so happens that negative setup times are common)
Hold Time defines the time after a clock edge that a signal must not begin a transition. A violation occurs when a path delay is too small.
Setup and Hold Time Slack quantify how much "time to spare" before and error occurs. A negative slack indicates a violation.

19. Clock Skew

Skew refers of the differice in time of hte arrival of a change at two differ points for a signal that is logically the same. Can be caused by transmission line delays, or buffers
Signal skew refers to logic data signals whereas clock skew refers to the clock signals

Clocks are buffered through a clock routing network. Clock signals are delayed with respect to the original clock.
Paths are defined starting at a source register and terminating at a destination register
Path delay: $T_{\rm PD}=T_{\rm clk-to-q} + T_{\rm comb.\ path\ delay}$
Clock Skew is the difference in arrival time of clock edges at destination registers and source registers

<!– xslide –>

20. Positive Clock Skew

Setup and hold time windows are defined with respect to the destination register clock edge
If the destination clock is more delayed than the source clock, it represents positive clock skew with regard to that path. This gives m*ore time for a path to settle and thus avoid a setup time violation*. It unfortunately delays the cutoff time for holding a data signal.

21. Negative Clock Skew

If the destination clock is less delayed than the source clock, it represents negative clock skew with respect to that path. This gives less time for a path to settle and advances the cutoff time for when a signal must hold.

22. Setup Timing Slack in Critical Path

Static Timing Analysis is a structural analysis based on previous characterizations. A key parameter from the analysis is the setup timing slack
- $T_{\rm CLK\,TO\,Q}$: Delay from clock edge to sequential gates update
- $T_{\rm CPD}$: Delay through combinatorial gates and routing
- $T_{\rm PD}$: Path Delay, $T_{\rm CLK\,TO\,Q} +T_{\rm CPD}$

Critical Path Timing requirement: $T_{\rm clk} +T_{\rm skew}> {\rm TPD} + T_{\rm setup}$ $T_{\rm skew}$: Time from when clk edge occurs at an source flip-flip to when the edge occurs at a destination (e.g. clock delay 1 -clock delay 2) As defined here, positive clock skew with respect to a critical path increases setup slack, while negative skew reduces it.

23. Setup Time Analysis (slack)

For each path there should be some time slack. A positive setup time slack value refers to how much extra delay could be added in the data path or how much faster a clock rate could be.

$T_{suSlack}$: positive slack indicates the timing requirement is met for a defined clock period while a negative slack means it is not $T_{suSlack} =T_{clk} -({\rm TPD} + T_{\rm setup} )+T_{\rm skew} - ({\rm Jitter\ or\ Uncertainty})$
All possible paths must be analyzed. The paths with the longest delay are important, but the analysis should be a combination of path delay and clock skew and clock and path delay uncertainty, not just path delay.
- A short delay path could still be a problem because of race conditions
Setup Time Analysis = Data Path Delay including source clock-to-q delay + Desitination Synchronous Element Setup Time - Clock Path Skew

Lecture14__FPGA_Developement-22.pdf.svg

24. Hold Time Analysis and Slack

Hold Time Analysis Avoids Race Conditions
- Tslack(hold) = TPD -Thold- Tskew
Most problematic are "short" combinatorial paths and high clock skew (e.g. logically back-to-back registers that are far from each other on the clock network)
Can fix by slowing data path (adding several slow buffers in series)
In VLSI, tend to route clock in opposite direction of data whenever creating shift register chains.

25. Contamination Delay for Hold Time Analysis (and async. removal time)

Contamination Delay is more conservative (safer) than Clock to Q
For a more conservative estimate, T_PD can be replaced with a smaller value called the Contamination Delay, a measure of how soon the output of a FF may becomes unusable (e.g. not valid 0) $T_{PD}$: Path Delay, $T_{CLK TO Q}+T_{CPD}$ becomes $T_{PD}: T_{CD}+T_{CPD}$

where $T_{CD}$ is the time after the clock that the output may change as opposed to when it is likely to have settled
can also be used for asynchronous control removal time (later in today's discussion)

26. Unconstrained Paths and CDC Clock Domain Crossing

A clock domains is defined by sequential elements with the same control clock
- Combinatioal logic on paths between registers in the same clock domain are part of that domain
- Paths within the domain may be analyzed for timing and ensured to behave as predicted from cycle to cycle
Combinational logic paths from a register of one clock domain to another is known as a clock-domain-crossing (CDC) signal
- unless the clock domains are "related clock domain", such as one being derived from the other and having an integer multiple, it may not be possible that such paths won't have transistions that violate setup/hold time constraints
What else? Input Offsets and Output requirements (Assume CLKA and CLKB are independently constrained – Unrelated Clock Domains) and look at

By default, the input and output paths, regardless if there is only one or more clock domains, are not analyzed

Can add a specification for time after a clock edge allowed to reach output pad
Can add a specification for the transition window for input to analyze setup and hold time

Cross domain paths are paths originating from the output of a sequential element in one clock domain and ending at the input of sequential elements in another clock domain

By default, if multiple clock domains are present, cross-domain paths are not constrained and not analyzed.

in design partitioning, partition design according to the signal types

constrained in a given clock domain
CDC per pair of clocks
otherwise unconstrained, input, output

27. Clock Domain Crossing (CDC) Signals

Clock Domain Crossing arises from the need to communication between Multiple Clock Domains
You have learned in sync. design in one clock domain
Having Multiple Clock Domains means Clock Domain Crossing must be considered
typically a timing analysis setup should identify such signals and either provide more information on how to analyze or simply waive violations

28. Asynchronous Inputs

Not all signals arise from a clock domain at all, such as physical, real-world inputs
A problem arises at a fanout point, where mutiple interpretations are performed in parallel. If the voltage is not a valid high or valid low (e.g. invalid during sampling), inconsistent interpretations can result.

30. Multicycle Paths

Can override ONE-PERIOD-DELAY-CONSTRAINT on some paths, such as to provid a series of multipliers two clock cycles instead of one to complete
In this example the assertion of the enable signals (en) determines the actual constraints. An STA tool may not have an understanding/information about the design to infer this on its own, which is why the override is needed
Consult documentation for how to specify individual paths (such as specifying source and destination register) and describe adjustments to the timing constraints uniquely for those

31. False Paths

Can exclude paths were timing based on a single-cycle is not important (false paths)
In the following example, assume mode is a signal set once upon processor power-up initialization.
```
...
S3: D <=   (endianMode?{AH,AL}:{AL,AH})
         + (endianMode?{BH,BL}:{BL,BH});
```
- The transition on endianMode is therefore not required for timing analysis, but it would make sense to request 2 STA, on the case where endianMode is 1 and when it is 0
Where the design accounts for timing concerns, such as explicit designs to handle Clock Domain Crossing and Synchronizers, standard analysis and warning isn't meaningful
```
always @ (posedge clk) begin
  Q1<=in;
  Q2<=Q1;
end
```
Or if the architecture doesn't require that a path meet timing requirements.
- In the following example, based on the design constraint the highlighted path would never be exercised
Or logically irrelevant path:
- Example:

Other Examples: https://www.edn.com/design/integrated-circuit-design/4433229/Basics-of-multi-cycle---false-paths

32. Asynchronous Signals (e.g. Async Clear)

Note that asynchronous signals are not easily analyzed for timing and are sensitive to glitches.

Wherever sensitivity to glitches exits, using registered output logic to generate the control signal is a good idea, otherwise ensure a hazard-free logic path:

33. Dynamic Timing Analysis

Whereas static timing analysis processes timing equations based on the circuit structure, dynamic timing analysis uses the results of a temporal simulation
The generated/saved waveforms (i.e. transient waveforms) are analyzed for timing violations and glitches
A complexity is that suitable input sequences (input vectors) must be created to test characteristics of the circuit at internal points.

34. Dynamic Timing Analysis using Event-Driven Simulation

Dynamic Timing Analysis typically uses an event-driven simulation, just like functional simulation i.e. dynamic timing analysis involves a gate-level simulation with timing information
Event driven simulation exploits the fact that most signals are quiescent at any given point in time.
- In event driven simulation, computational effort not expended on quiescent signals, i.e., their values are not recomputed at each time step
- Rather, the simulator waits for an event to occur, i.e., for a signal to undergo a change in value, and ONLY the values of those signals are recomputed.
- Decide time step size dynamically (“on the fly”), based on the next event future queue
Verilog tools include several methods for annotating timing including use of
specify blocks

[ꭝꭝ] Figure 2.8 Michael D. Ciletti
--
Timing Data augmentation using a sidecar files e.g. Standard Delay Format Files
Delay parameters (e.g. #) may also be used to model timing delays described by Inertial Delay and Transport Delay
Verilog also support automatic timing check tasks to flag violations during simulation These can be embedded alongside sequential logic
```
$hold (reference_event, data_event, limit[,notifier]) ;
```
Assertion violations are detected during simulation and reported
SystemVerilog Assertions (SVA) represent an specialized syntax for describing timing relationships, which can be inlined with other SystemVerilog RTL code and maintained (updated) along with it
Online Analysis
- checks are activated and deactivated as-needed when satisfied or violated
Offline Analysis (post-simulation)
- saves waveforms in format such as VCD and analysis is performed afterward
- this is not the typical approach since disk IO is time-consuming (saving and reading all waveforms required for analysis)

35. Async Signal Timing Checks

recovery time: required time after de-assertion edge of asynchronous control that the control must be stable before the trigger-edge of clock

removal-time: time after a clock trigger-edge that an asserted asynchronous control signal must remain stable

reference: https://www.intel.com/content/www/us/en/docs/programmable/683539/20-4/recovery-and-removal-checks.html

36. Dynamic vs Static Timing Analysis; Functional vs Timing Verification

Functional simulation and verification refers to verifying logical descriptions, including Boolean expressions and register transfers
Timing verification refers to making sure that all timing requirements, external and internal (setup, hold, etc..) are satisfied
Dynamic Timing Analysis refers to analysis of timing by simulating the system with inputs and examining the resulting waveforms.
Static timing analysis uses no data and does not use a simulation – is just analyzes the structure
A timing analysis tool stores delays in a separate file Standard Delay Format SDF file along side the post place and route netlist

37. Overview Static vs Dynamic Timing Analysis

Dynamic timing analysis and functional simulation each require well-selected input sequences These are called test-vectors. Static Timing analysis uses no test vectors.
Dynamic timing analysis is sometimes combined with functional simulation while static timing analysis can not be
Though Dynamic Timing and Functional Analysis use an event-driven simulation which is much faster than SPICE-level "analog" circuit simulation, Static Timing Analysis is much faster as it only analyzes structure and can even be performed during place and route iterations.
Dynamic Timing Analysis can easily report glitches on asynchronous set/reset
Dynamic Timing Analysis will "correctly overlook" false paths if test vectors are valid, since the false path will not exhibit time violations in the actual circuit
Multiple Clock Domains and Asynchronous Reset/Clear
- Static timing analysis is more straight forward with one clock domain, though can be extended to handle multiple clock domains. Dynamic timing analysis can automatically handle multiple clock domains as well as assess timing of asynchronous reset/clear

38. FPGA vendors role in EDA

On ASIC side, tools are expensive
FPGA companies focused on selling FPGAs provided tools for cheap which drives use and sale of their FPGAs.
Tools were produced by third parties

FPGA continue to innovate tools to promote adoption of FPGAs

39. Additional Materials

Verilog Specify blocks
- https://peterfab.com/ref/verilog/verilog_renerta/mobile/source/vrg00052.htm
Full example:
- http://www.xilinx.pe.kr/_hdl/2/RESOURCES/www.ee.ed.ac.uk/~gerard/Teach/Verilog/me5cds/me95rh.html

40. SDF Timing Annotation

SDF text file format for capturing timing information about cells or design and information about how to perform timing checks
may accompany a synthesis cell library library
Example http://www.eda.org/sdf/sdf_3.0.pdf:

DELAYFILE
(SDFVERSION "3.0")
(DESIGN "BIGCHIP")
(DATE "March 12, 1995 09:46")
(VENDOR "Southwestern ASIC")
(PROGRAM "Fast program")
(VERSION "1.2a")
(DIVIDER /)
(VOLTAGE 5.5:5.0:4.5)
(PROCESS "best:nom:worst")
(TEMPERATURE -40:25:125)
(TIMESCALE 100 ps)
(CELL
(CELLTYPE "BIGCHIP")
(INSTANCE top)
(DELAY
(ABSOLUTE
(INTERCONNECT mck b/c/clk (.6:.7:.9))
(INTERCONNECT d[0] b/c/d (.4:.5:.6))
)
)
)
(CELL
(CELLTYPE "AND2")
(INSTANCE top/b/d)
(DELAY
(ABSOLUTE
(IOPATH a y (1.5:2.5:3.4) (2.5:3.6:4.7))
(IOPATH b y (1.4:2.3:3.2) (2.3:3.4:4.3))
)
)
)
(CELL
(CELLTYPE "DFF")
(INSTANCE top/b/c)
(DELAY
(ABSOLUTE
(IOPATH (posedge clk) q (2:3:4) (5:6:7))
(PORT clr (2:3:4) (5:6:7))
)
)
(TIMINGCHECK
(SETUPHOLD d (posedge clk) (3:4:5) (-1:-1:-1))
(WIDTH clk (4.4:7.5:11.3))
)
)
(CELL
. . .
)
)

Netlist: a simple structural representation of a circuit
Gate-Level Netlist: a circuit reduced to the simplest gates/cells available
Timing Annotator: part of a tool that matches and associates cell timing information from SDF or specify blocks to parts of the design to set the stage for timing analysis of the overall esign
Back Annotation: the result of a timing analysis tool can be a more complete or accurate timing information based on additional analysis of the resulting physical circuit. Can be capture in an SDF file
Interconnect delays: allow association of unique a with each port of each instance, not representing part of the original netlist or any instance but instead the extra delay associated with the interconnct path to a given input from its source e.g.
- identify instance2.portb with signal propogation delay 20 ps
- identify instance2.porta with signal propogation delay 25 ps
- identify instance3.portq with signal propogation delay 30 ps

41. Engineering Change Order (ECO)

In engineering business, an ECO is a formal request to change a design specification, often specifying specific part of a design to change
In EDA tools, ECO refers to abilities to track, manage, analyze/predict impact (e.g. on timing), and perform changes on some part of a design at some step in the design flow the design flow, often without having to go "back to the beginning"
- For an EDA tool flow, this could involve logic insertion into a synthesized (post-implementation) netlist without re-synthesizing the entire design, or even after place and route
- Tools may support incremental operations like incremental place-and-route to merge updates into existing design
  - step 1: rip out (delete) old circuitry
  - step 2: insert new circuitry
- Supports what-if analysis order of magnitude faster
  - check insertion of a inverter, flip-flop, or extra mux&IO

42. Design Rule Checking DRC

checks at various stages as to if the physical or logic design

is valid
- e.g. standard signal does not have multiple drivers
- warnings like:
  - "net has only one connection"
  - "output port is undriven"
conforms to additional contraints provided
- e.g.
  - out1p and out2n should be physically assigned to a pair of pins supporting low-voltage 1-V differential signaling
  - clk input is routed physically through an appropriate input clock pin with connection to global clock buffers rather than through a general-purpose I/O pin
waived DRC violations (waivers):
- user may waive-off some specific errors by provided additional information to tools and avoid distraction from important issues

43. Cycle-Based Simulators

Event-driven simuation may involve multiple reevaluations of circuit components as inputs are propagated with delays
Cycle-based simulators attempt to update registers once per cycle, computing the function of the combinational logic once per cycle

Cycle-based simulators can be 5-10 times faster
Cycle-based simulation is suitable for design with well-defined clocks (e.g. not async circuits)
- Latches, combinatorial loops, clock-gating are more complex
example tool https://www.veripool.org/verilator/

44. Two-State vs Four-State Simulation

Two-state simulation Uses one-bit for values rather than 2 required to represent each 0,1,x,z
Less memory and faster operations (simple check on bits rather than considering all combinations of signal states)
allows better mapping of storage and operations to the native simualation platform
code must not make use of x and z
use of x as a literal in code may be mapped to a random choice of 1 or 0 during run-time evaluation
- so, rather than seeing an x propagated in sim, would see a randomized value that may be different in each execution

example of cycle-based 2-state simulator https://www.veripool.org/verilator/