1 Introduction

In the framework of the research project Green Flash [1], we developed the work presented in this paper, aimed at efficiently implementing the Matrix-Vector Multiplication (MVM) on the FPGA technology. As discussed in [2,3,4,5], in Adaptive Optics (AO) the effect of the atmospheric turbulence is compensated using the mobile mirrors in the telescope, which are moved according to a given real-time control algorithm. The dominating part of such an algorithm, see technical annex of the project [1], is the execution of two MVMs, namely sk* = Mvk and wk = Rs polk .

In this paper we illustrate how we used QuickPlay [6], a High-Level Synthesis (HLS) design flow, to efficiently implement the MVM on FPGA. Representing one of the Level-2 BLAS functions [7], MVM is the basis for many algebraic computations and it is fundamental in many application domains. We underline that we see the presented work as a template of the methodology to be adopted when using HLS.

We start describing the problem to be solved, together with the constraints imposed by the challenges on the architecture to be implemented. Next, we present the formulation of the solution, explaining how parallelism should be exploited to obtain an efficient implementation. The implementation we propose in this paper uses four levels of parallelism: as the MVM is a collection of many independent scalar products, we introduce pipeline and spatial parallelism both at the coarse level (parallelization among different scalar products) and at the fine level (parallelization within the computation of one scalar product). A performance model is derived to quantify the performance achievable through the proposed implementation: this phase is crucial to validate the performance of the HLS. When using HLS, it is crucial the preliminary determination of what can be achieved, checking after the synthesis that the results produced by the automated synthesis process comply with expectations: in lack of this modeling phase, we should rely only on comparisons with other implementations to (indirectly) evaluate the implementation produced by HLS. In this paper, the emphasis is put mainly on the evaluation of the quality of the implementation derived from the HLS flow, as we are not trying to assess the superiority of a given technology against another: discussing FPGA vs GPU is not the aim of this paper. For this reason, we put much effort into the modeling of the performance which can be theoretically achieved, to have an absolute criterion to evaluate the quality of the FPGA implementation: the closer is the performance to the theoretical forecast, the better it is.

The document is concluded with the presentation of the results, in terms of performance achieved in actual runs (GFlop/s) and resource used (LUT, memory blocks, DSP).

2 Related Work

Due to its relevance in many domains, the implementation of the MVM has been widely investigated; in particular, how to efficiently implement the operation on the FPGA technology has been investigated. In [8] the authors present a comparison of the implementation of the MVM, the gaxpy operation, on FPGA, GPU and CPU. They describe the FPGA implementation, organizing the internal dual-ported memories as V row banks which store the rows of the matrix; each of these banks is composed by B banks which store in an interleaved way the rows mapped into the row bank; thanks to this organization, at each clock cycle V × B elements can be read and written from and to the memory. These elements can feed Q ≤ V pipelined modules, each one computing a B-size scalar product. The work is further improved in [9], where the management of large external memory is added. In [10, 11] the FPGA implementation of the BLAS operations is discussed, with a special focus on the implementation of the reduction circuit needed in the accumulation involved in each BLAS operation. The authors in [12] report the FPGA implementation of the MVM and matrix-matrix product with a detailed analysis of the error propagation in the accumulation phase. Considering that the MVM problem is I/O bound and there is no benefit in increasing the parallelism beyond the I/O saturation, the authors propose to use some logic to implement the group-alignment based floating-point summation [13], which increases the numerical accuracy of the computation. The FPGA implementation of the BLAS is reported in [14]. In this work, while relying on the OpenCL framework [15] for the actual FPGA implementation, the authors give a detailed performance model to drive the selection of the parameters determining the tradeoff between speed and resource performance. Using the selected parameters, some code generators are activated to generate the OpenCL description of the optimized BLAS routine. The reader interested in the implementation of the MVM on GPU technology can refer to [16], which presents an analysis of the MVM implementation on GPU, together with a detailed performance model.

3 Problem Definition

The MVM is the basic operation to perform the Wavefront Reconstruction control algorithm; its usage is well known in the Adaptive Optics community and dates back to the late ’80 s [17] and has been successively improved many times [2]. In our implementation, using single-precision floating-point arithmetic, we have to multiply two matrices M[Nmeans, Nrec] and R[Nrec, Nmeans] with the vectors vk[Nrec], sk[Nmeans], being Nmeans = 9232 and Nrec = 6316.

Due to their size, M and R are stored in external memory. M and R do not change for a quite long time and must be multiplied many times by vectors vk and sk; processing step (k + 1) can start only when the kth step has finished.

Once the bandwidth BW to access the external memory is fixed, an upper bound for the speed of the computation is determined. To perform the MVM, the matrix must be read from the memory; when we refer to a generic matrix M[n, m] and we indicate with D the floating-point data size expressed in bytes (in single-precision D = 4, in double-precision D = 8), the matrix size is Ms = nmD [Bytes] and the time to read the matrix from external memory is

$$ {\text{t}}_{\text{R}} {\text{ = nmD/BW}} . $$
(1)

As the number of operations performed in the MVM is nops = 2 nm and the overall computing time cannot be smaller than tR, the computing speed SC cannot be larger than nops/tR i.e.,

$$ {\text{S}}_{\text{C}} \le \frac{{{\text{n}}_{{\rm ops}} }}{{{\text{t}}_{{\rm R}} }} = \frac{{2{\text{nm}}}}{{\frac{\text{nmD}}{\hbox{BW}}}} = \frac{{2{\text{BW}}}}{\hbox{D}} . $$
(2)

Using single-precision floating-point, D = 4, the speed can never be greater than half of the available memory BW.

In the following sections, we will analyze how the MVM should be implemented to be as close as possible to the previous limit.

4 Guidelines for Implementation: Exploiting Coarse-Grained Parallelism

The MVM b = M × a (M[n, m], a[m], b[n]) is the collection of n independent scalar products between m-sized vectors i.e.,

$$ \text{b}_{\text{i}} = {\mathbf{m}}_{{\mathbf{i}}} \cdot {\mathbf{a}}\,\text{i} = 0,1, \ldots ,\text{n} - 1; \, \text{b}_{\text{i}} \in {\mathbb{R}};{\mathbf{m}}_{{\mathbf{i}}} \in {\mathbb{R}}^{\text{m}} ;{\mathbf{a}} \in {\mathbb{R}}^{\text{m}} . $$
(3)

Let’s implement, in an optimized way, a kernel SP which performs a certain number of scalar products between one vector a and several vectors read from the external memory; if we have p external memory banks, we can partitionFootnote 1 M in p equal parts Mp, each containing n/p different matrix lines mp,i with p = 0, 1, …, p − 1 and i = 0, 1, …, n/p − 1 (each line is an m-sized vector), storing each Mp into a different memory bank. We instantiate p replicas of the SP scalar product kernel and we distribute a copy of the a vector, to be read once, to all the SP kernels. Each SP kernel computes a portion bp of the b result vector. The final vector is obtained properly merging (i.e., concatenating) all the bp sub-vectors.

The degree of parallelism p is selected to make (nearly) equal the BW requirement with the BW available toward the external memory (BWExtMem); let’s indicate with BWreq the memory bandwidth requested by the SP kernel (BWreq will be quantified in the following).

The memory bandwidth required by the p SP kernels is p × BWreq and must be large enough to saturate BWExtMem i.e., BWExtMem ≈ p × BWreq which gives

$$ p \approx {\text{ BW}}_{\text{ExtMem}} / {\text{BW}}_{\text{req}} . $$
(4)
Fig. 1.
figure 1

Coarse-grained parallel architecture to implement the MVM

In the following, when giving numerical examples, we use the parameters characterizing the μXComp board, developed by MicroGate and equipped with an Intel ARRIA 10 GX1150 FPGA [18]. Referring to the previous example and to the four Hyper Memory Cube (HMC) banks present in the μXComp board (each HMC bank has a peak BW of 17 GB/s), BWExtMem = 68 GB/s. In our implementation of the SP kernel BWreq = 19.2 GB/s, so the degree of parallelism that can be efficiently supported is given by Eq. (4) which yields p ≈ 4. Therefore, four SP kernels can be instantiated, each one accessing a different HMC bank.

5 The Scalar Product: Basic Pipelined Implementation

As a consequence of the discussion of the previous section, we recognize the scalar product as our coarse grain unit of parallelism. The scalar product can be implemented with one pipelined MADD (one multiplier and one adder) which iteratively computes the recurrence

$$ {\text{s}}_{{\rm i} + 1} = {\text{a}}_{{\rm i}} \times {\text{b}}_{{\rm i}} + {\text{s}}_{{\rm i}} \quad {\text{i = 0,}}\, \ldots , {\text{ n}} - 1\,{\text{with s}}_{ 0} = 0,\,{\text{a}}_{{\rm i}} \in {\text{a, b}}_{{\rm i}} \in {\text{b}} . $$

The computation of the next MADD operation is dependent on the completion of the previous operation, so a new MADD cannot start until the previous has finished, thus waiting for the latency L of the MADD.

To avoid paying this penalty, we can exploit the commutativity and associativity of the ADD operation (let us neglect the effects of the limited precision). Under the commutative and associative hypothesis for the ADD and assuming m to be an integer multiple of L, we can rewrite the scalar product as in the following

$$ {\text{s}} = \sum\nolimits_{{{\text{i}} = 0}}^{{{\text{m}} - 1}} {\left( {{\text{a}}_{\text{i}} \cdot {\text{b}}_{\text{i}} } \right)} = \sum\nolimits_{{{\text{i}} = 0}}^{{{\text{L}} - 1}} {\left( {\sum\nolimits_{{{\text{j}} = 0}}^{{\frac{\text{m}}{\hbox{L}} - 1}} {{\text{a}}_{{jL + {\text{i}}}} \cdot {\text{b}}_{{jL + {\text{i}}}} } } \right)} $$
(5)

where

  • vectors a and b have been partitioned into L sub-vectors ai and bi,

  • L partial scalar products are computed (expression in brackets) and finally

  • the result is derived by summing the L partial scalar products (external sum).

In the previous formulation, each partial scalar product has to be updated every L clock cycles; during its processing (requiring L cycles), the other L-1 partial scalar products will be processed, each one being at a different stage of the pipeline. Only the final (i.e., the external) sum requires the accumulation of values where the dependence cannot be completely hidden, thus imposing the payment of some pipeline penalty.

Following the previous approach, we can compute the scalar product in Nclk clock cycles, as follows

$$ {\text{N}}_{\text{clk}} = \left( {{\text{m}} - 1} \right) + {\text{L}} + {\text{O}}\left( {{\text{L}}_{{\rm A}} * {\text{log}}\left( {\text{L}} \right)} \right) $$
(6)

where (m − 1) + L, are the cycles needed to compute the m MADD operations and O(LA*log(L)) are the cycles needed to perform the final sum of the L partial scalar products (LA is the latency of the pipelined add operator) using L/2 adders; if m ≫ L, Nclk ≈ m. In our case m ≫ L, so we compute the 2 m operations required by the scalar product in Nclk ≈ m clock cycles, thus sustaining 2 FP operations per cycle. The sustained speed of the computation is SC = 2fck = 300 MFlop/s for fck = 150 MHz.

As seen in the previous section, to sustain the speed of the computation SC we must have a BW toward the memory which is at least twice the numerical value of SC (Eq. 2)). In this case, the memory BW required by the kernel would be BWreq = 2 * SC = 2 * 300 = 600 MB/s. Referring to the BW of the HMC memory we are using (≈6 GB/s), to saturate the memory BW we should put p = 68/0.6 = 112 kernels in parallel, which would require 112 ports to access the external memory module: this huge number of ports is not realistic, so we have to find a way to increase the computational speed of the kernel which performs the basic scalar product, in order to use, with the BWreq of a single kernel, a significant portion of the available memory BW.

6 The Scalar Product: Exploiting Spatial Parallelism

To increase the computational speed and the BWreq of the kernel which computes the scalar product, we could further partition each of the L sub-vectors into P sub-vectors so that, at each cycle, we can start computing P independent partial scalar products.

Let’s rewrite the Eq. (5) as in the following

$$ s = \sum\nolimits_{i = 0}^{m - 1} {\left( {a_{i} \cdot b_{i} } \right)} = \sum\nolimits_{i = 0}^{L - 1} {\sum\nolimits_{{{\text{j}} = 0}}^{{{\text{P}} - 1}} {\left( {\sum\nolimits_{k = 0}^{{\frac{m}{LP} - 1}} {a_{{{\text{iP}} + {\text{j}} + {\text{kLP}}}} \cdot b_{{{\text{iP}} + {\text{j}} + {\text{kLP}}}} } } \right)} } $$
(7)

where vectors a and b have been partitioned into LP sub-vectors, each with m/(LP) elements; the generic sub-vector vij is defined as

$$ {\mathbf{v}}_{{{\mathbf{ij}}}} = \, \left\{ {{\text{v}}_{{\text{iP}} + {\text{j}} + {\text{kLP}}} |{\text{ k }} = \, 0,1, \ldots ,{\text{m/LP}}} \right\}\quad {\text{i}} = 0,1, \ldots ,{\text{L}} - 1\quad {\text{j }} = \, 0,1, \cdots ,{\text{P}} - 1. $$

Once partitioned a and b into the LP sub-vectors aij and bij, we compute the LP partial scalar products sij (expression in brackets in (7)), then we sum all the LP partial values to obtain the final result.

Using P MADDs, if we can read 2P floating-point values per cycle, the number of cycles to determine the LP partial scalar products is given by

$$ {\text{N}}_{{\rm comp}} = \left[ {\left( {\frac{\text{m}}{\hbox{P}} - 1} \right) + {\text{L}}} \right] . $$
(8)

In fact, after L clock cycles, P MADD results are produced; the remaining (m – P) MADD results are produced in the following (m – P)/P cycles, as P new results are produced at every cycle.

Once generated the N = LP sij values, they must be summed together to obtain the final scalar product.

As already discussed, we can use N/2 adders to perform the sum of N numbers in \( [log_{2} (N)]L_{A} \) clock cycles. If we use PA < N adders, in each layer we can parallelize the sums among all the PA adders. It’s easy to verify that the number of cycles to compute the sum of N = LP numbers using PA pipelined adders is given by

$$ {\text{NCycles}}_{\text{sum}} \left( {{\text{P}}_{\text{A}} } \right) = \sum\nolimits_{{{\text{i}} = 1}}^{{\left\lceil {\log_{2} \left( {\text{N}} \right)} \right\rceil }} {\left( {\left\lceil {\frac{\text{N}}{{2^{\text{i}} }}\frac{1}{{{\text{P}}_{\text{A}} }}} \right\rceil + {\text{L}}_{\text{A}} } \right)} \approx \frac{\text{N}}{{{\text{P}}_{\text{A}} }} + \left\lceil {\log_{2} \left( {\text{N}} \right)} \right\rceil {\text{L}}_{\text{A}} . $$
(9)

The number of cycles NCyclesSP necessary to compute the scalar product of two vectors of size m using P pipelined MADD modules, with latency L, and PA pipelined adders, with latency LA, is given by

$$ {\text{NCycles}}_{\text{SP}} = {\text{N}}_{\text{comp}} + {\text{NCycles}}_{\text{sum}} \left( {{\text{P}}_{\text{A}} } \right). $$
(10)

From (8), (9) and (10) we get

$$ {\text{NCycles}}_{\text{SP}} \approx \frac{\text{m}}{\hbox{P}} + {\text{L}} + \frac{\text{LP}}{{{\hbox{P}}_{\text{A}} }} + \left\lceil {\log_{2} \left( {\text{LP}} \right)} \right\rceil {\text{L}}_{{\rm A}} . $$
(11)

From the previous expression, we can compute the sustained speed of the computation (expressed in operations/cycle) as

$$ {\text{SustainedSpeed}} = \frac{{2{\text{m}}}}{{\frac{\text{m}}{\hbox{P}} + {\text{L}} + \frac{\text{LP}}{{{\hbox{P}}_{{\rm A}} }} + \left\lceil {\log_{2} \left( {\text{LP}} \right)} \right\rceil {\text{L}}_{\text{A}} }} . $$
(12)

In previous equation L and LA are fixed by the technology (for instance, with the current version of QuickCompiler and for the ARRIA10 FPGA, L = 8 and LA = 3), m is fixed by the problem, P and PA are the parameters of the architecture that must be determined to maximize the sustained speed.

P must satisfy the following requirements:

  • must be a power of 2, i.e. P = 2k, because it determines the width of the internal memory used by the SP kernel (width of the memory must be a power of 2),

  • must be large enough to nearly saturate the memory BW.

In our example, fck = 150 MHz and the BW to one bank of the HMC memory is 17 GB/s. Thus, the width W to saturate the BW is given by

$$ {\text{W}}*{\text{f}}_{\text{ck}} = {\text{ BW }}\left[ {\text{Byte/s}} \right] \, = > {\text{W }} = {\text{ BW}}/{\text{f}}_{\text{ck}} \left[ {\text{Byte}} \right] $$

which gives W = 17000/150 = 113 [Byte]. As W has to be a power of 2, we can set W = 128 [Byte] (the closest to 113), thus fixing the MADD parallelism to 32 (32 MADDs must read 64 floats/cycle; 32 floats come from the buffer memory connected to the HMC and storing a row of the matrix M and 32 floats come from the buffer memory connected to the input stream and storing the vector a, read only once at the very beginning).

When P = 32 and m = 8K elements, the number of cycles necessary to compute the LP partial products sij is (ref. to Eq. (8))

$$ {\text{NCycles}}_{\text{comp}} = \left( {{\text{m}}/{\text{P}}} \right) - 1 + {\text{L}} = \left( {8192/32} \right) - 1 + 8 = 263. $$

If we set PA = 4 (adder parallelism), the number of cycles to sum all the partial results is (ref. to Eq. (9))

$$ {\text{NCycles}}_{\text{sum}} \left( {{\text{P}}_{\text{A}} } \right) \approx \frac{\text{LP}}{{{\text{P}}_{\text{A}} }} + \left\lceil {\log_{2} \left( {\text{LP}} \right)} \right\rceil {\text{L}}_{\text{A}} = \frac{8 \cdot 32}{4} + 8 \cdot 3 = 88. $$

With the previous values, the Eq. (9) gives a Sustained Speed of 46.7 operations/cycle; as fck = 150 MHz, the previous figure corresponds to

$$ 4 6. 7\left[ {\text{ops/cycle}} \right] * 1 5 0\left[ {\text{MHz}} \right]{ = 7} . 0 { }\left[ {\text{GFlop/s}} \right] . $$

7 MVM: Coarse-Grained Pipelining

In the operation b = M × a, the result vector b can be computed through the following loop

figure a

whose body can be decomposed in three basic operations:

figure b

The loop can be repeated in different kernels when the matrix M is partitioned into p submatrices, as depicted in Fig. 1.

Regarding the time complexity (expressed in number of clock cycles), we can write the following relations

  • moving 4 m bytes from the external memory, accessible through a port with W = 4P bytes, to the internal multi-ported memory requires the number of cycles

    $$ {\text{N}}_{\text{mem}} = \frac{\text{m}}{\hbox{P}} + {\text{L}}_{\text{m}} $$
    (13)

    as the internal memory can accept 4P bytes/cycle; Lm is the latency to access the external memory; if Wfck = BWreq > BWExtMem, the actual number of cycles will be larger than Nmem because the required bandwidth Wfck is larger than the available memory bandwidth;

  • the number of cycles required to compute the LP partial scalar products is given by Eq. (8):

  • the sum of LP values using PA floating-point adders (with latency LA) requires the number of clock cycles NSum(PA) given by Eq. (9).

As the iterations of the loop are independent, the loop can be pipelined, at a coarse grain, with three pipeline stages:

  • load vector mi,

  • compute the LP partial scalar products sij,

  • sum the LP sij.

The duration of each stage of this “macro-pipeline” is given by

$$ {\text{N}}_{\text{PipeStage}} = \hbox{max} \left( {{\text{N}}_{\text{mem}} ,{\text{N}}_{\text{comp}} ,{\text{N}}_{\text{SUM}} \left( {{\text{P}}_{\text{A}} } \right)} \right). $$
(14)

Being the loop fully pipelined, n + 2 “macro-pipeline” stages are required to process n matrix lines and to compute n scalar products. The number of cycles necessary to compute the whole MVM, using p equal SP kernels, is given by

$$ {\text{N}}_{\text{Total}} = \left( {\frac{\text{n}}{p} + 2} \right){\text{N}}_{\text{PipeStage}} . $$
(15)

The sustained speed (operations/cycle) is given by the ratio

$$ {\text{S}} = \frac{{{\text{N}}_{\hbox{operations}} }}{{{\hbox{N}}_{\text{Total}} }} = \frac{{2{\text{nm}}}}{{\left( {\frac{\text{n}}{p} + 2} \right){\text{N}}_{\hbox{PipeStage}} }} . $$
(16)

Let’s consider the case characterized by the following parameters:

  • m = n = 8192 (m: size of the vector, n: number of scalar products to be computed)

  • LA = 3, L = 8 and Lm = 200 cycles (latencies of FP adder, MADD and HMC)

  • P = 32 (spatial parallelism, i.e., number of MADD operations performed in parallel)

  • PA = 1 (1 adder is used to sum the LP partial scalar products)

  • p = 2 (kernel parallelism, i.e., number of equal kernels, each one performing the scalar product)

Previous values, when inserted in the expressions derived above, give the following values:

  • Nmem = m/P + Lm = 456,

  • Ncomp = m/P + LMADD = 264,

  • $$ N_{SUM} \left( {P_{A} } \right) \approx \frac{LP}{{P_{A} }} + \left\lceil {log_{2} \left( {LP} \right)} \right\rceil L_{A} = 280. $$

So NPipeStage = 456 and the sustained speed, when fck = 150 MHz, is

$$ {\text{S}} = \frac{{{\text{N}}_{\text{operations}} }}{{{\hbox{N}}_{\text{Total}} }} = \frac{{2{\text{nm}}}}{{\left( {\frac{\text{n}}{p} + 2} \right){\text{N}}_{\text{PipeStage}} }} = 71.82 \left[ {\frac{\text{ops}}{\hbox{cycle}}} \right] = 10.8\left[ {\frac{\text{GFlop}}{\hbox{s}}} \right]. $$

It’s worth to be underlined that, when we ran on the μXComp board the test developed using previous values, we measured an overall speed of 10.6 [GFlop/s], in perfect agreement with the performance foreseen by the model (see Table 1, reported in the section related to performance).

8 FPGA Implementation of the MVM Through the QuickPlay HLS

In this section, we analyze the actual FPGA implementation of the MVM algorithm, based on the considerations illustrated in the previous sections.

To achieve the FPGA implementation, we use the Accelize HLS framework (QuickPlay with its embedded QuickCompiler HLS engine [6], formerly produced by Accelize and to be shortly released as Open Source SW).

We refer to the architecture depicted in Fig. 1 and, in the following Fig. 2, we report the QuickPlay schematic representing that architecture, in the case of p = 4 SP kernels.

Fig. 2.
figure 2

Top level of the design with p = 4 SP kernels, as shown in the QuickPlay VisualEditor

In the previous design, we can recognize the four VectorMatrixProduct kernels, each performing n/4 scalar products: they are connected to four different HMC memory banks. The first mySplit kernel is used to divide the input data coming from the input port in

  1. a)

    the configuration part (8 bytes sent to the config_in - config_out chain to distribute the Id of the computing kernels) and

  2. b)

    the data part (data are the values of the matrix M to be stored in the memory and the vector a to be multiplied with the matrix) which is sent to the 4 computing kernels through a streamCopy kernel.

The last BuildResultVector kernel is used to concatenate the results produced by the four VectorMatrixProduct kernels, generating the result vector.

8.1 The Scalar Product

As seen in the Sect. 6, the basic step to compute the scalar product between the lth row of the matrix (ml) and the input vector a is the following

compute the LP values

$$ {\text{s}}_{\text{ij}} = {\mathbf{m}}_{{{\mathbf{l}}:{\mathbf{ij}}}} \cdot {\mathbf{a}}_{{{\mathbf{ij}}}} = \mathop \sum \limits_{k = 0}^{{\frac{\text{m}}{LP} - 1}} \left( {{\text{m}}_{{{\text{l}}:{\text{ip}} + {\text{j}} + {\text{kLP}}}} \cdot {\text{a}}_{{{\text{ip}} + {\text{j}} + {\text{kLP}}}} } \right) \begin{array}{*{20}c} {{\text{i}} = 0,1, \ldots ,{\text{L}} - 1} \\ {{\text{j}} = 0,1, \ldots ,{\text{P}} - 1} \\ \end{array} $$

which requires the computation of LP partial scalar products. The basic operation to implement these scalar products is the vector multiply-and-add pipelined function which takes as input P pairs of single-precision floating-point variables and produces P floating-point values (in our implementation P = 32), performing the computation

$$ {\text{c}}_{\text{i}}\, +{=} {\text{ a}}_{\text{i}} \times {\text{b}}_{\text{i}} ; \,{\text{i}} = 1,2, \ldots ,{\text{P}}. $$

The sketch of the QuickPlay C code to implement the vector pipelined MADD is the following

figure c

Thanks to the /*#qp pipeline*/ directive the previous function is synthesized as a pipelined function which performs 2P = 64 floating-point operations per cycle (P add and P mul).

From the synthesis reports of QuickCompiler we know that previous function requires 7 cycles to produce the output results, so LMADD = 7 cycles; we use L = 8 to include the cycle needed to read the data from the memory. The MADD is implemented through the instantiation of 32 fp adders and 32 fp multipliers.

The MADD() function computes the LP scalar products sij i = 0, .., L − 1 and j = 0, 1, …, P − 1 through the following code:

figure d

In our example the size of the vector m assumes the value m = 8192 and L × P = 256. The loop is executed m/(LP) = 8192/256 = 32 times, so the directive /*#qp unroll 32*/ unrolls completely the loop.

The scalar variables a1, …, b32 are read from the FastMemory a[] and the FastMemory b[] in one clock cycle.

8.2 FastMemory

The FastMemory modules are the memories used by QuickCompiler to map internal arrays. They are implemented on embedded ram and are described by the tuple

$$ {\text{FastMemory}} = {<}{\text{W, G, N, DType}},\,{\text{Size}}{>} $$
  • W is the width of the wide “external” port.

  • G is the number of independent groups, each group being formed by N ports; usually G = 2 (as the embedded Ram modules are dual-ported)

  • N is the number of typed ports in each of the G groups. Each port presents a data which has size DType;

  • DType is the size (in bytes) of the data type stored in the FastMemory. In QuickCompiler, each array is stored in a different FastMemory.

  • Size is the size of the memory, expressed in Bytes.

FastMemory has G × N + 1 ports.

The large external port, whose width is W = N * DType, is used to transfer data to/from streams or to/from external memories through the qpReadStream(), qpWriteStream() and memcpy() functions. The bandwidth of read/write through this port is given by BW = W * fck [Byte/s]; typical value is W = 128 [B], fck = 150 MHz and BW = 19.2 GB/s. The latency to access external memories depends on the available memory controller; the HMC controller in the μXComp board is characterized by a latency Lm = 200 cycles.

The G × N “internal” ports, whose size is DType, are accessed by the kernel. The internal BW, between the FastMemory and the computing kernel, is G times the BW of the external port. The latency to read a data from the FastMemory to the kernel is one cycle while writing a data from the kernel to the fast memory is accomplished in the same cycle.

Since W ≥ Dtype, each group of ports allows accessing N = W/DType elements of an array at the same clock cycle. As the memory is organized in word of W bytes, when the first port of a group is used it selects the memory word being accessed and it allows the other ports of its group to access the other array elements of the word.

The FastMemories a[m] and b[m] have been declared with the directive /*#qp ports 2 32*/ which specifies that the array, composed by m = 8K float elements, is stored in a memory which has G = 2 groups of N = 32 ports accessible in parallel, every port being four bytes wide (as they are float data type). Both a and b FastMemories are characterized by the tuple

$$ {<}{\text{W}} = 128,\,{\text{G}} = 2,\,{\text{N}} = 32,\,{\text{DType}} = 4,\,{\text{Size}} = 32768{>}. $$

This means that up to 64 floats can be read/written in parallel in one clock cycle.

In one iteration of the loop, the LP sij values are updated; values sij are mapped onto the variables si_j (i = 0, .., 7 and j = 0, .., 31).

The previous loop-code is scheduled by the QuickCompiler HLS engine as described in Sect. 6, with the performance given by Eq. (8).

Looking in the QuickCompiler timing report, we see that the execution of the module implementing the previous code requires 264 clock cycles, in perfect agreement with the formula derived from the analysis Ncomp = m/P + LMADD.

After having computed the LP sij values, we must sum them together to obtain the result i.e., we must implement the expression

$$ {\text{b}}_{\text{l}} = \mathop \sum \limits_{i = 0}^{L - 1} \mathop \sum \limits_{{{\text{j}} = 0}}^{{{\text{P}} - 1}} \left( {{\text{s}}_{\text{ij}} } \right). $$

The previous formula is very simply computed through the following (not pipelined) function

figure g

which is scheduled by QuickCompiler on one fp adder and requires 263 clock cycles to be executed, slightly better than the simplified model presented in the Sect. 6, Eq. (9), which was foreseeing 280 clock cycles (in our simplified model we are neglecting the possibility to start the computation of a new layer of sums in the tree adding scheme before terminating the previous layer).

Putting the things together, the number of cycles requested to compute the scalar product of two vectors a and b, each containing m = 8192 floating-point values and stored in two dedicated FastMemory modules, each module having G = 2 groups of N = 32 ports 4 bytes wide, requires 264 + 263 = 527 clock cycles. This figure corresponds to (nearly) 31 [flop/cycle] which, for a clock frequency fck = 150 [MHz], gives the sustained speed S = 31 * 150 = 4650 [MFlop/s].

8.3 MVM with a Coarse-Grained Pipeline

We use the just described scalar product module as a basic block to perform the MVM; the pseudo-code for the MVM is the following:

figure e

While the loading of the a vector is negligible, as it is performed only once, before starting the actual computation loop, the load of the mi vector is relevant because it lasts for Nmem = Lmem + m * D/W cycles, being

  • Lmem the latency to access the external memory (in our case Lmem ≈ 200)

  • W the width of the “external” port of the FastMemory (W = 128 [Byte])

In our case (n = m = 8192, W = 128, LMADD = 8, LA = 3, PA = 1)

  • Nmem = 456

  • Ncomp = 264

  • Nsum = 263

and the global number of cycles necessary to compute b = M × a is given by

$$ {\text{N}}_{\text{seq}} = {\text{n}} * \left( {{\text{N}}_{\text{mem}} + {\text{N}}_{\text{comp}} + {\text{N}}_{\text{sum}} } \right) . $$
(17)

As the number of floating-point operations to compute the MVM is Nflop = 2 nm, the speed expressed in number of operations per cycle is given by

$$ S_{ops/cycle} = \frac{2nm}{{n\left( {N_{mem} + N_{comp} + N_{sum} } \right)}} \approx 16.7 \left[ {\frac{ops}{cycle}} \right]. $$

Considering that each iteration of the computing loop is independent on the others, it is immediate to think to a pipelined scheme to overlap the three operations (Fig. 3):

Fig. 3.
figure 3

Gantt for the pipelined execution of the MVM

The computation, arranged according to previous scheduling, can be executed in Npipe cycles

$$ {\text{N}}_{\text{pipe}} = {\text{n}} * {\text{N}}_{\text{mem}} + {\text{N}}_{\text{comp}} + {\text{N}}_{\text{sum}} . $$
(18)

The speed-up of the pipelined implementation, with respect to the not-pipelined implementation, is given by

$$ S = \frac{{N_{seq} }}{{N_{pipe} }} = \frac{{8.1 \times 10^{6} }}{{3.7 \times 10^{6} }} = 2.2 $$

from which we can derive the expected speed for the pipelined implementation, in ops/cycle, through the following expression

$$ S_{ops/cycle} \left( {pipe} \right) = S_{ops/cycle} \left( {seq} \right) *S = 16.7 *2.16 = 36.01 \left[ {\frac{ops}{cycle}} \right]. $$

The speed, in flop/s, is obtained as in the following

$$ S_{{flops/{\text{s}}}} = S_{ops/cycle} *f_{ck} = 36.01 *150 = 5411 \left[ {\frac{MFlops}{s}} \right]. $$

The scheduling described in Fig. 3 is enforced by the QuickPlay HLS when compiling the following code:

figure f

In previous code we can recognize three sections:

  • the preamble, to fill the pipeline modules; in this section, we find the read (once for all) of the a vector, the read of the first three rows of M, two computations of the partial scalar results and one sum operation.

  • the loop, which implements the steady-state of the pipelined behavior; in this section we find three reads of rows of M, three computations of partial scalar result, three sum operations and the write to the output of three results i.e., the manual unroll of three complete processing of three rows of M;

  • the postamble, which empties the pipeline (no more matrix rows are read). In this phase it is finished the processing of the last three rows. It is the dual of the preamble; we have no read, one computation of the partial scalar products, two sum operations and three write of the results.

To ensure the parallel execution of the different functions accessing the same array, we used three different buffers to store the rows of matrix M.

The QuickPlay project which instantiates all the available 4 HMC memory modules, each connected to one Compute MatrixVectorProduct, is reported in Fig. 2.

Both the input and output ports have been mapped onto a PCIe interface.

The PCIe IP, HMC controller IP, clock and reset generator IP, as well as the copy IP and the FIFO IP are all part of the QuickPlay distribution and are instantiated by the tool in a transparent way (clock & reset generator, FIFO) or based on the configuration derived from the Visual Editor. The computing kernels are generated by QuickCompiler, the HLS engine of QuickPlay.

9 Performance Results

To show the performance achieved, in terms of both speed and resource usage, we report for the different designs developed (with 1, 2, 3 and 4 kernels, each performing the MVM on a portion of the matrix M)

  • the sustained speed [GFlop/s] measured on actual runs on the MicroGate board (equipped with one ARRIA 10 FPGA and 4 HMC memory banks),

  • the resource used (ALM - Arithmetic Logic Modules, memory modules M20K).

Table 1. Results when implementing the MVM with 1, 2, 3 and 4 SP kernels

The design presents nearly linear scaling for computational performance.

To understand how resources are used, we report, for the largest design using four equal MatrixVectorProduct kernels, the percentage of the resources (ALM, M20K) used to implement

  • the PCIe interfacing IP: ALM 2.7%, M2K 0.7%

  • the MatrixVectorProduct kernels: ALM 5.3%, M2K 8.1% each kernel

  • the HMC memory controllers: ALM 7.0%, M2K 5.7% each controller

  • the other auxiliary modules (reset and clock generators, FIFOs, mySplit and BuildResultVector modules, …): ALM 18.6%, M2K 22.9%

When the FPGA board was configured with the design using four SP kernels, the power consumption of the board was 40 W, resulting in the energy efficiency of 0.53 GFlop/s/W.

Even if we think that comparison with other implementations is a weak way to evaluate an implementation, we report an alternative realization of the MVM to verify that the proposed solution is aligned with what is allowed by the current technology.

The work presented in [14] reports the implementation of several BLAS routines, including the MVM. The performance of this routine is reported in the case of a 1024 × 1024 matrix stored within the internal RAM, thus not requiring any communication with the DDR banks; in the case of vectorization width set to 64 (i.e., performing in parallel 64 multiply operations) it is reported a computing speed greater than 20 GFlop/s (both in single and in double precision). While giving an idea of the performance achievable by the hardware in the FPGA, such a figure would require a significantly large I/O BW to be sustained for larger matrices (as the 8Kx8k matrices used in our case): the proper buffering and macro-pipelining of the computation to sustain the traffic with the DDR memory is not addressed in [14], not being this the core of the FBLAS implementation.

10 Future Developments

Looking at the Gantt reported in Fig. 3, we see that the transfer of one line of matrix M from memory lasts longer (456 cycles) than the computation of the partial scalar products (264 cycles) and the final sum (263 cycles). This happens because QuickPlay HLS does not support outstanding read operations, which would allow overlapping different memory transfers. Could we use outstanding memory reads, the latency of the next transfer could be overlapped with the actual data transfer of the current, as in the following:

In the previous figure, we decomposed the time to transfer data from HMC to the kernel into the latency L(mi) and the actual data transfer. It’s easy to verify that the number of cycles needed to perform the computation shown in Fig. 4 is given by

Fig. 4.
figure 4

Pipelined implementation with support to outstanding read operations

$$ {\text{N}}_{\text{pipe}} = {\text{ n}}*{ \hbox{max} }\left( {{\text{L}}\left( {\text{m}} \right),\,{\text{N}}_{\text{mem}} ,\,{\text{N}}_{\text{comp}} ,\,{\text{N}}_{\text{sum}} } \right) + {\text{L}}\left( {\text{m}} \right) + {\text{N}}_{\text{mem}} + {\text{N}}_{\text{comp}} + {\text{N}}_{\text{sum}} . $$

In our case (n = m = 8192) the values are L(m) = 200, Nmem = 256, Ncomp = 256 and Nsum = 263 which yield

$$ {\text{N}}_{\text{pipe}} = {\text{ n}}*264 + 983 $$

Previous value corresponds to

$$ S_{ops/cycle} \left( {pipe} \right) = \frac{2nm}{264n + 983} \approx 62\left[ {\frac{ops}{cycle}} \right] $$

i.e., 9.3 GFlop/s when fck = 150 MHz, very close to the limit imposed by the memory BW.

11 Conclusions

The activities performed to implement on FPGA the MVM through the QuickPlay HLS flow have been described.

We started formalizing the problem, describing how parallelism is a key factor to achieve the expected performance and we showed how parallelism could be introduced at 4 different levels:

  • (spatial) parallelization over the different rows of the matrix, computing in parallel the scalar products between the input vector and p different rows of the matrix M

  • parallelization (pipelining) of the basic scalar product, achieved thanks to the introduction of L different independent partial scalar products to break the data dependence characterizing the classical accumulation scheme (L is the latency of the basic Multiply and Add pipelined operation)

  • parallelization derived from the iteration of the previous decomposition, dividing each of the L sub-vectors into P smaller sub-vectors, thus performing in parallel P pipelined partial scalar products

  • coarse-grained pipelining, overlapping different phases of successive scalar products when multiplying the input vector by different rows of the matrix M (read from external memory, computation of the partial scalar products, sum of the partial results).

Some models to compute the expected performance of the algorithm we have implemented have been presented and discussed. We found a good agreement between the forecasted and the actual performance. This agreement demonstrates the good quality of the hardware generated by the HLS engine.