Lecture 06 – Transformations

Ryan Robucci

• Spacebar to advance through slides in order
• Shift-Spacebar to go back
• Arrow keys for navigation

• ESC/O-Key to see slide overview
• ? to see help

Printable Version

Table of Contents

References

Transformations

Multirate Expansion

Multi Rate Expansion

G IN IN A A IN->A 1 1 T2 A->T2 3 B B OUT OUT B->OUT 1 1 T1 T1->B 2 T2->T1

-based on book figure, colored tokens \dagger

Initial Graph from Book

G cluster_R IN0 IN0 A0 A0 IN0->A0 T1 A0->T1 a T2 A0->T2 b B1 B1 A0->B1 c B0 B0 OUT0 OUT0 B0->OUT0 T1->B0 T2->B0 OUT1 OUT1 B1->OUT1 B2 B2 OUT2 OUT2 B2->OUT2 A1 A1 A1->B1 d A1->B2 e A1->B2 f IN1 IN1 IN1->A1

Correction Based on Initial Tokens

G cluster_R rank0 rank1 rank0->rank1 A0 A0 B0 B0 T0 T0->B0 e T1 T1->B0 f B1 B1 OUT0 OUT0 B0->OUT0 B2 B2 OUT1 OUT1 B1->OUT1 OUT2 OUT2 B2->OUT2 A0:ne->B1 a A0:e->B1 b A0:se->B2 c A1 A1 A1:e->T0 A1:se->T1 A1:ne->B2 d IN0 IN0 IN0->A0 IN1 IN1 IN1->A1

Retiming

2.5.2 Retiming \dagger

†Shaumont

The initial formulation of the retiming problem as described by Leiserson and Saxe is as follows. Given a directed graph G:=(V,E) whose verties represent logic gates or combinational delay elements in a ciruit assume there is a directed edge e := (u,v) between two elements that are connected directly or through one or more registers. Let the "weight" of each edge w(e) be the number of registers present along edge e in the initial circuit. Let d(v) be the propagation delay through vertex v. The goal in retiming sto compute an integer "lag" value r(v) for each vertex such that the retimed weight wr(e):=w(e)+r(v)r(u)w_r(e) := w(e) +r(v) - r(u) of every edge is non-negative. There is a proof that this preserves the output functionality. [C. E, Leiserson, J.B. Saxe, "Retiming Synchronous Circuitry, "Algorithmica, Vol. 6, No. 1, pp. 5-35, 1991.]

†Shaumont

Figure 2.21 illustrates retiming using an example. The top data flow graph, Fig.2.21a, illustrates the initial system, This graph has an iteration bound of 8. However the actual data output period of Fig. 22a is 16 time units, because actors A,B, and C need to execute as a sequence. If we imagine actor A to fire once, then it will consume the tokens (delays) at its inputs, and produce an output token. The resulting graph is shown in Fig. 2.21. This time, the data output period has reduced 11 time units. The reason is that actor A and the chain of actors B and C, can each operate in parallel. The graph of Fig. 2.21b is functionally identical to the graph of Fig. 2.21a: it will produce the same identical stream of output samples when given the same stream of input samples. Finally, Fig. 2.21e shows the result of moving the delay across actor B to obtain yet another equivalent marking. This implementation is faster than the previous one: as a matter of fact, this implementation achieves the iteration bound of 8 time units per sample. No faster implementation exists for the given graph and the given set of actors.

Shifting the delay on the edge BC further would result in a delay on the outputs of actor C: one on the output queue, and one in the feedback loop. This final transformation illustrates an important property of retiming: it's not possible to increase the number of delays ina loop by means of retiming.

Pipelining

Pipelining:

Pipelining with SDF Graphs

by Example Figure 2.22

†Shaumont

Question asked in class: What was the purpose of the second pipeline insertion vs only one here. For example, if C was annotated with 3, would the second pipeline be more justifiable?

Piplelining to Meet Timing Requirements

Critical Path Timing Requirement:
TCLKtoQ+TPD+Tsetup<Tclk\rm{T_{CLK-to-Q} + T_{PD} + T_{setup} < T_{clk}}
Q: What to do if timing requirement is not satisfied?

module calc(q, a, b, c,d clk);
  output q;
  input a, b,c,d;
  input clk;
  reg [31:0] q
  always @(posedge clk) begin: bx
    reg [31:0] tmp1;
    reg [31:0] tmp2;
    tmp1=a*b;
    tmp2=a*b;
    q<=tmp1*tmp2;
  end
endmodule

Pipeline

module calc(q, a, b, c,d clk);
  output q;
  input a, b,c,d;
  input clk;
  reg [31:0] q
  always @(posedge clk) begin: bx
    reg [31:0] tmp1;
    reg [31:0] tmp2;
    tmp1<=a*b; //pipeline
    tmp2<=a*b; //pipeline
    q<=tmp1*tmp2;
  end
endmodule

Reducing Critical Path of a computation with Pipelining

Example:
Reducing the critical Path in a moving average calculation:
out[n]=c2×x[n2]+c1×x[n1]+c0×x[n]\rm{out}[n]=c_2 \times x[n-2] + c_1 \times x[n-1] + c_0 \times x[n]

†Shaumont
 

†Shaumont

Fig. 2.24 \dagger

†Shaumont

Fig. 2.25 \dagger

†Shaumont

Fig. 2.26 \dagger

Unfolding (parallelization)


https://en.wikipedia.org/wiki/Unfolding\_(DSP\_implementation)
Image Source: https://en.wikipedia.org/wiki/File:DSP_Folding_example.pdf
Author Jackyknight https://creativecommons.org/licenses/by-sa/3.0/deed.en

Unfolding Recipe

Sources: Wikipedia and http://www.eecs.yorku.ca/course_archive/2006-07/F/4210/Ch5_unfolding.pdf

To perform J\bm J-level unfolding:

Simple Unfolding Example


https://en.wikipedia.org/wiki/Unfolding\_(DSP\_implementation)
Image Source: https://en.wikipedia.org/wiki/File:Unfolding_algorithm_description.pdf
Author Jackyknight https://creativecommons.org/licenses/by-sa/3.0/deed.en
 

Example 2

†Shaumont

Unfolding Example 3


Image Source: https://en.wikipedia.org/wiki/File:Example_of_unfolding.pdf
Author Jackyknight https://creativecommons.org/licenses/by-sa/3.0/deed.en

Unfolding Critical Path and Retiming

Unfolding for Low Power

f(x) fast

vs

f(x) slow f(x) slow

Bit-level Parallel Processing


Image Source: https://en.wikipedia.org/wiki/File:Bit-level_unfolding.pdf
Author Jackyknight https://creativecommons.org/licenses/by-sa/3.0/deed.en