from the city of Vienna: Notes on SPMD architecture IV

Data transferences.

Gentlemen, let’s go row!

Robert A. Heinlein

This post describes the different types of Data transferences required for the Architecture model.

--------------------

As stated in the previous post "Requirements", the different types of transferences required for the Architecture model are the Data distribution, Data redistribution and Data collection transferences.

Irrespective of the type of transference, no-data-loss has to be guaranteed. Whatever the input and output data units are, they will be transferred in blocks. Data blocks are stored in memory buffers. As soon as a source node has a block ready to send and the correspondent destination node has an input buffer ready to receive it, the transference can be performed. Both end-points must be ready before the transference starts. This establishes a flow control mechanism at data block level.

To allow the input and output of the node to work independently, both the input and the output must have their own memory buffers. In the following figures output buffers are identified in blue color and input buffers in green color. Besides, the buffers are identified by numbers. The numbers identify the order of arrival to the input node of the data block stored in the buffer. Block 1 was the first and block N the last. Numberless buffers are empty buffers.

In order to respect at the output the order of the input data, consistency between data distribution and data collection has to be maintained. To get this, the input node distributes the blocks in a rotatory basis to the processing nodes and the output node collects the blocks in the same way. This means that, for the data distribution, the destination nodes sequence will be [n_p1, n_p2,.,n_pN, n_p1,..]. Analogously, for the data collection, the source nodes sequence will be the same. So, the identification of processing nodes indicates its order in the processing nodes’ set in receiving and transmitting data blocks from the input node and to the input node, respectively.

The next figure depicts the data blocks’ distribution and collection showing the internals of the simple topology depicted in Figure 2-1 in terms of the nodes’ input and output memory buffers. The considered situation is the following:

The input node, in orderly way, has distributed the data blocks 1 to N to the processing nodes 1 to N. Block 1 was the first in arriving to the input node and block N the last one.
Processing node 1 has just finished the transference of the results of processing data block 1 to the output node. Processing node 1 has already got free the input buffer but still hasn’t got free the output buffer.
The transfer of the N+1 data block is ready to start as the block is ready in the input node and an input memory buffer is available in the processing node 1.

We will consider that the data distribution transfers are started by the input node and, symmetrically, the data collection transfers are started by the output node.

Figure 4-1

As it was previously stated in the post "Requirements" of this series, the Data redistribution capability is necessary at least to the extent of covering the distributed matrix transposition in the case of 2-D processing on distributed memory.

In this case, for the sake of simplicity, we will assume that the input, output and intermediate matrices have the same dimensions, mxn; and that m/N = p and n/N = q, being m and n the number of rows and columns of the matrices, respectively; N the number of processing nodes, and p and q integer numbers.

The input matrix has to be distributed among the processing nodes. A data block containing a pxn submatrix is transferred from the input node to each one of them. Each processing node performs 1-D processing over the rows of its submatrix (horizontal processing). At some point, a transposition of the distributed matrix (Corner Turn) is performed. As a result, a nxm matrix is distributed among the processing nodes, each one of them is going to store a qxm submatrix. Then, the processing nodes perform 1-D processing over the rows of its submatrix (vertical processing), and after another Corner Turn, a mxn matrix is distributed among the processing nodes. This four steps sequence is repeated so many times as the algorithm needs. Finally, a data block containing a pxn submatrix is transferred from each processing node to the output node.

Referring to the simple algorithm depicted in Figure 1-2 of the post "Introduction" of this series, note that the matrix transpositions that will convert into Corner Turns have to take place outside the sections.

The following figure illustrates the distributed matrix transposition operation showing the contents of the processing nodes input and output buffers. The matrix A, an mxn matrix splitted in N sub-matrices pxn is stored in the output buffers and its transposed the matrix B, a nxm matrix split in N sub-matrices qxm is stored in the input buffers.

Note that all the processing nodes have to be interconnected and that the source node set and destination node set are the same, but the source and destination buffers are different.

Figure 4-2

On the other hand, the chosen data distribution and collection mechanism ensure not only that the order of the input data blocks is maintained at the output, but also the consistent construction of the distributed matrix.

The situation depicted in the previous figure can be described in algebraic terms as shown in the following figure. As can be seen, the matrices A and B are distributed among the output and input buffers -respectively- of the processing nodes. The ouput buffers hold the submatrices A₁,A₂,..,A_N and the input buffers hold the submatrices B₁,B₂,..,B_N. The submatrix Ai consists of the submatrices A_i1,A_i2,..,A_iN and the submatrix B_i of the submatrices A^T_i1,A^T_i2,...,A^T_iN. Note that the subscripts of the submatrix A_ij indicate not only its position in the matrix A, but also the source and destination nodes.

Figure 4-3

The distributed matrix transposition is performed by every source node n_pi (i=1,..,N) by transferring the submatrix A^T_ij to every destination node n_pi (i=1,..,N). Alternatively, it can be done by transferring the submatrix A_ij and performing the transposition in the destination node. The hardware characteristics of the nodes will determine in each case if one option is better than the other. Both options are analogous, for the sake of brevity we will choose the first one. Note that every source node transfers only one submatrix to every destination node.

An arbitration mechanism among source nodes based on the rule “first come, first served” is used in first term. In second term, conflicts of coincidence in time – if they are– are solved using a sequential rotatory mechanism (n_p1,n_p2,...n_pN,n_p1).

Once it is selected, every source node n_pi (i=1,..,N), performs data distribution transfers -started by the source node- to all the destination nodes n_p1 ,n_p2,...n_pN. The transferences begin as soon as an output buffer is ready but the input buffers will become filled as the transfers of the last source node are performed.

Note that when the workload in the processing nodes is balanced, the output buffers will be ready for being transmitted in the order n_p1,n_p2,..,n_pN as this is the sequence used by the Distribution Channel to transfer the data blocks from the input node. On the other hand, the case of possible waste of time due unbalanced processing load among processing nodes, is covered by the arbitration mechanism based on the rule “first come, first served”. In this way, the transference times and the output buffers readiness times are overlapped.

In other words, the observed effect from an external point of view is that, the different processing nodes begin the Corner Turn in different moments of time (desynchronized) but finish it – more or less – at the same time (synchronized).

To release the processor of communications-related tasks, the memory addressing capabilities of the bridge device have to include the extraction of the submatrix from the appropriate place of the memory buffer. Making use of this capability, it is also possible to release to the processor of the local transposition of the submatrix A_ij.

The following piece of pseudocode describes the A_ij submatrix extraction from the memory buffer of the source node. The nomenclature corresponds to the used in the two previous figures, pb is a pointer to the memory buffer.

for (u=0; u< p; u++)

{

for (v=0; v< n; v++)

{

/* send data in location (pb + (j-1)*q + u*n + v) to the destination node */

}

So far, we have considered irrelevant whether what is being transferred is a matrix or a vector. Nevertheless, taking into account that what is being transferred is a matrix, the bridge device can also avoid that the processor has to perform the local transposition of the submatrix A_ij. For that, in the source node, the bridge device has to read the elements of the submatrix by columns from the output buffer before sending them consecutively to the destination node. This corresponds to exchanging the loops in the afore displayed piece of code.

In the writing of this article, Mike Oldfield ("Tubular Bells" Part I, Live at the BBC 1973) has collaborated in an involuntary but decisive way.