Pipelining for FPGAs

Edouard Tavinor
6 min readMar 30, 2017

I thought I’d write an explanation of pipelining for FPGAs and why it’s so important.

First of all, what is pipelining? An example. Let’s say you have two channels of 64 bit numbers coming into your component and you want to add them up and output them. Let’s assume you’re going to trust the synthesis tools and write minimal code. So you write this in VHDL code:

entity my_adder is
Port (
clk : in std_logic;
data1 : in std_logic_vector(63 downto 0);
data2 : in std_logic_vector(63 downto 0);
in_enabled : in std_logic;
data_out : out std_logic_vector(64 downto 0);
out_enabled : out std_logic);
end entity;
architecture foo of my_adder is
begin
process(clk)
begin
if (rising_edge(clk)) then
data_out <= std_logic_vector(unsigned("0" & data1) + unsigned("0" + data2)); -- result one bit bigger to account for overflows
out_enabled <= in_enabled;
end process;
end architecture;

and you get the right answer. The problem is, the synthesis software promises to do this in one clock cycle. That means, that if you want to have a clock cycle of 100 MHz for your design, then the synthesis software will say “this isn’t possible. If you expect me to do that addition in one clock cycle then you’re going to have to scale the frequency back a lot”.

So what can you do? With addition, you can break the calculation up into lots of little chunks and do these chunks simultaneously. You can work out the addition in 16 bit chunks and then combine these chunks to get the result.

You can work out these 4 16 bit chunks simultaneously. However, you haven’t yet got the correct answer, because one of the chunksums (or even all of them) could overflow, and we don’t know if a chunksum will overflow, until we’ve checked if the chunksum before it overflows. So let’s say we use the first clock cycle to add up these 4 16 bit chunks. At the end of the first clock cycle, the bottom 16 bits of the answer will be correct. Then in the second clock cycle, we can, if necessary add one to the chunksum the second from the end. Now we know the bottom 32 bits of the answer. This will take place while the 4 16 bit chunks are being added for the inputs which arrived one clock cycle later. Then in the next clock cycle we can do the same for the chunk second from the beginning, and finally in the fourth clock cycle we’ll be able to output the correct result.

This is the basic idea of a pipeline. Data arrives on the input channels every clock cycle. By dividing our program into four parts, every part can be running at the same time, but on different data. The data that arrived on the first clock cycle progresses down the pipeline. One step behind it is the data from the following clock cycle. One step infront of it is data from the previous clock cycle.

To do this, we need to have enough space reserved so that all of the data needed can exist without being overwritten before it has been outputted.

So let’s declare an array of 4 records. Each record should be large enough to contain all the state we need. We’ll call that our pipeline. All the state we have for an addition is the 2 inputs, 4 chunksums, space for the data_enabled signal, and space to assemble the result. This results in a structure as follows:

constant NUM_CHUNKS : integer := 4;
constant PIPELINE_LENGTH : integer := 4;
type chunksums_type is array(NUM_CHUNKS-1 downto 0) of unsigned(16 downto 0);
type state_type is record
data1 : std_logic_vector(63 downto 0);
data2 : std_logic_vector(63 downto 0);
chunksums : chunksums_type;
result : unsigned(64 downto 0);
data_enabled : std_logic;
end record state_type;
type pipeline_type is array(PIPELINE_LENGTH-1 downto 0) of state_type;
signal pipeline : pipeline_type;

Now, for each clock cycle, we perform an operation on each element of the pipeline and then store the result one slot up in the pipeline.

Well, that’s a general method that would certainly work. However, it would take up a lot of space. We can reduce the storage necessary by noticing that after the chunksums are calculated, we don’t need to store data1 and data2, so instead, when new data arrives, we’ll store the chunksums directly.

Another thing to note is that we don’t actually need all the chunksums all the time. The first chunksum can be placed directly in the result, so we don’t need to store it anywhere else. Also, the second chunksum can be discarded after the second clock cycle, because it’s already been inserted into the result and we won’t be using it again. We only need space to store it for one clock cycle. Likewise the third chunksum can be discarded after the third clock cycle. The fourth chunksum can be discarded after the fourth clock cycle, because in this clock cycle it will be inserted into the result.

This results in the following architecture:

architecture foo of my_adder is
type pipeline_type is array(3 downto 0) of unsigned(64 downto 0);
signal pipeline : pipeline_type := (others => (others => '0'));
signal part10, part20, part21, part30, part31, part32 : unsigned(16 downto 0) := (others => '0');
signal out_en_buffer : std_logic_vector(3 downto 0) := (others => '0');
begin
process(clk)
begin
if (rising_edge(clk)) then
out_enabled <= out_en_buffer(3);
data_out <= pipeline(3);
end if;
end process;
process(clk)
begin
if (rising_edge(clk)) then
part10 <= unsigned("0" & data1(31 downto 16)) + unsigned("0" & data2(31 downto 16));
part20 <= unsigned("0" & data1(47 downto 32)) + unsigned("0" & data2(47 downto 32));
part30 <= unsigned("0" & data1(63 downto 48)) + unsigned("0" & data2(63 downto 48));
part21 <= part20;
part31 <= part20;
part32 <= part31;
pipeline(0)(16 downto 0) <= unsigned("0" & data1(15 downto 0)) + unsigned("0" & data2(15 downto 0));
pipeline(1)(15 downto 0) <= pipeline(0)(15 downto 0);
pipeline(1)(32 downto 16) <= pipeline(0)(32 downto 16) + part10;
pipeline(2)(31 downto 0) <= pipeline(1)(31 downto 0);
pipeline(2)(48 downto 32) <= pipeline(1)(48 downto 32) + part21;
pipeline(3)(47 downto 0) <= pipeline(2)(47 downto 0);
pipeline(3)(64 downto 48) <= pipeline(2)(64 downto 48) + part32;
out_en_buffer <= out_en_buffer(2 downto 0) & in_enabled;
end if;
end process;
end architecture;

Now, no single addition of more than 17 bits needs to be done in any clock cycle. You have to wait 4 clock cycles to get the result of the first addition, but then the results come every clock cycle.

I hope this will come in useful for somebody :)

In case anybody’s interested, I calculated the timing for the two different designs by adding a clock constraint, generating a timing report under Synthesis and then looking at the unconstrained paths. Here are the worst timing paths for the two different designs (I reduced the sizes of data_in to 32 bits, because my FPGA doesn’t have enough input and output pins for 64 bit inputs):

Without pipelining
With pipelining: much shorter :)

And here’s the worst internal path from one clock cycle to the next for the pipelined design (the design without pipelining has no internal paths from one clock cycle to the next). This shows that we could have run the internal paths at almost 500 MHz :)

Worst internal path from one clock to the next.

One thing that you could do to make this even more performant is add a buffer for data1. If you do this, the constraints on the input signals are relaxed, because they only have to be stable for a much shorter time.

--

--