Data transferences.
Gentlemen, let’s go row!Robert A. Heinlein
This post describes the
different types of Data transferences required for the Architecture
model.
--------------------
As stated in the previous
post "Requirements", the
different types of transferences required for the Architecture
model are the Data
distribution, Data redistribution and Data collection transferences.
Irrespective of the type of transference, no-data-loss has to be guaranteed. Whatever the input and output data units are, they will be transferred in blocks. Data blocks are stored in memory buffers. As soon as a source node has a block ready to send and the correspondent destination node has an input buffer ready to receive it, the transference can be performed. Both end-points must be ready before the transference starts. This establishes a flow control mechanism at data block level.
Irrespective of the type of transference, no-data-loss has to be guaranteed. Whatever the input and output data units are, they will be transferred in blocks. Data blocks are stored in memory buffers. As soon as a source node has a block ready to send and the correspondent destination node has an input buffer ready to receive it, the transference can be performed. Both end-points must be ready before the transference starts. This establishes a flow control mechanism at data block level.
To allow the input and output of the
node to work independently, both the input and the output must have
their own memory buffers. In the following
figures output buffers are identified in blue color and input
buffers in green color. Besides, the buffers are identified by
numbers. The numbers identify the order of arrival to the input node
of the data block stored in the buffer. Block 1 was the first and
block N the last. Numberless buffers are empty buffers.
In order to
respect at the output the order of the input data, consistency
between data distribution and data collection has to be maintained.
To get this, the input node distributes the blocks in a rotatory
basis to the processing nodes and the output node collects the blocks
in the same way. This means that, for the data distribution, the
destination nodes sequence will be [np1, np2,.,npN, np1,..].
Analogously, for the data collection, the source nodes sequence will
be the same. So, the identification of processing nodes indicates its
order in the processing nodes’ set in receiving and transmitting
data blocks from the input node and to the input node, respectively.
The next figure
depicts the data blocks’ distribution and collection showing the
internals of the simple topology depicted in Figure 2-1 in terms of
the nodes’ input and output memory buffers. The considered
situation is the following:
- The input node, in orderly way, has distributed the data blocks 1 to N to the processing nodes 1 to N. Block 1 was the first in arriving to the input node and block N the last one.
- Processing node 1 has just finished the transference of the results of processing data block 1 to the output node. Processing node 1 has already got free the input buffer but still hasn’t got free the output buffer.
- The transfer of the N+1 data block is ready to start as the block is ready in the input node and an input memory buffer is available in the processing node 1.
We will consider
that the data distribution transfers are started by the input node
and, symmetrically, the data collection transfers are started by the
output node.
Figure 4-1 |
As it was previously stated in the post "Requirements" of this series, the Data redistribution
capability is necessary at least to the extent of covering the
distributed matrix transposition in the case of 2-D processing on
distributed memory.
In this case, for
the sake of simplicity, we will assume that the input, output and
intermediate matrices have the same dimensions, mxn; and that m/N = p
and n/N = q, being m and n the number of rows and columns of the
matrices, respectively; N the number of processing nodes, and p and q
integer numbers.
The input matrix has to be distributed among the processing nodes. A data block containing a pxn submatrix is transferred from the input node to each one of them. Each processing node performs 1-D processing over the rows of its submatrix (horizontal processing). At some point, a transposition of the distributed matrix (Corner Turn) is performed. As a result, a nxm matrix is distributed among the processing nodes, each one of them is going to store a qxm submatrix. Then, the processing nodes perform 1-D processing over the rows of its submatrix (vertical processing), and after another Corner Turn, a mxn matrix is distributed among the processing nodes. This four steps sequence is repeated so many times as the algorithm needs. Finally, a data block containing a pxn submatrix is transferred from each processing node to the output node.
Referring to the simple algorithm depicted in Figure 1-2 of the post "Introduction" of this series, note that the matrix transpositions that will convert into Corner Turns have to take place outside the sections.
The following
figure illustrates the distributed matrix transposition operation
showing the contents of the processing nodes input and output
buffers. The matrix A, an mxn matrix splitted in N sub-matrices pxn is stored in the
output buffers and its transposed the matrix B, a nxm matrix split in N
sub-matrices qxm is stored in the input buffers.
Note that all the
processing nodes have to be interconnected and that the source node
set and destination node set are the same, but the source and
destination buffers are different.
On the other hand, the chosen data distribution and collection mechanism ensure not only that the order of the input data blocks is maintained at the output, but also the consistent construction of the distributed matrix.
Figure 4-2 |
On the other hand, the chosen data distribution and collection mechanism ensure not only that the order of the input data blocks is maintained at the output, but also the consistent construction of the distributed matrix.
The situation
depicted in the previous figure can be described in algebraic terms as shown in the
following figure. As can be seen, the matrices A and B are distributed among
the output and input buffers -respectively- of the processing nodes. The ouput buffers hold the submatrices A1,A2,..,AN and the input buffers hold the submatrices B1,B2,..,BN. The
submatrix Ai consists of the submatrices Ai1,Ai2,..,AiN and the submatrix Bi of the submatrices ATi1,ATi2,...,ATiN. Note that
the subscripts of the submatrix Aij indicate not only its position in
the matrix A, but also the source and destination nodes.
Figure 4-3 |
The distributed matrix transposition is performed by every source node npi (i=1,..,N) by transferring the submatrix ATij to every destination node npi (i=1,..,N). Alternatively, it can be done by transferring the submatrix Aij and performing the transposition in the destination node. The hardware characteristics of the nodes will determine in each case if one option is better than the other. Both options are analogous, for the sake of brevity we will choose the first one. Note that every source node transfers only one submatrix to every destination node.
An arbitration mechanism among source nodes based on the rule “first come, first served” is used in first term. In second term, conflicts of coincidence in time – if they are– are solved using a sequential rotatory mechanism (np1,np2,...npN,np1).
Once it is selected, every source node npi (i=1,..,N), performs data distribution transfers -started by the source node- to all the destination nodes np1 ,np2,...npN. The transferences begin as soon as an output buffer is ready but the input buffers will become filled as the transfers of the last source node are performed.
Note that when the workload in the processing nodes is balanced, the
output buffers will be ready for being transmitted in the order
np1,np2,..,npN
as this is the sequence used by the Distribution Channel to transfer
the data blocks from the input node. On the other hand, the case of
possible waste of time due unbalanced processing load among
processing nodes, is covered by the arbitration mechanism based on
the rule “first come, first served”. In this way, the
transference times and the output buffers readiness times are
overlapped.
In other words, the observed effect from an external point of view is that, the different processing nodes begin the Corner Turn in different moments of time (desynchronized) but finish it – more or less – at the same time (synchronized).
To release the processor of communications-related tasks, the memory addressing capabilities of the bridge device have to include the extraction of the submatrix from the appropriate place of the memory buffer. Making use of this capability, it is also possible to release to the processor of the local transposition of the submatrix Aij.
The following piece of pseudocode describes the Aij submatrix extraction from the memory buffer of the source node. The nomenclature corresponds to the used in the two previous figures, pb is a pointer to the memory buffer.
for (u=0; u< p; u++)
{
for (v=0; v< n; v++)
{
/* send data in location (pb + (j-1)*q + u*n + v) to the destination node */
}
}
So far, we have considered irrelevant whether what is being transferred is a matrix or a vector. Nevertheless, taking into account that what is being transferred is a matrix, the bridge device can also avoid that the processor has to perform the local transposition of the submatrix Aij. For that, in the source node, the bridge device has to read the elements of the submatrix by columns from the output buffer before sending them consecutively to the destination node. This corresponds to exchanging the loops in the afore displayed piece of code.
In the writing of this article, Mike Oldfield ("Tubular Bells" Part I, Live at the BBC 1973) has collaborated in an involuntary but decisive way.
---------------------
1. Picture: DSC_5832.JPG
2. I want to thank Carol G. for her revision of this text.
http://remopedregalejo.blogspot.com.es/2010/09/regata-de-traineras-rias-baixas-galicia_9112.htm
2. I want to thank Carol G. for her revision of this text.
0 Comments