In order to
respect at the output the order of the input data, consistency
between data distribution and data collection has to be maintained.
To get this, the input node distributes the blocks in a rotatory
basis to the processing nodes and the output node collects the blocks
in the same way. This means that, for the data distribution, the
destination nodes sequence will be [np1, np2,.,npN, np1,..].
Analogously, for the data collection, the source nodes sequence will
be the same. So, the identification of processing nodes indicates its
order in the processing nodes’ set in receiving and transmitting
data blocks from the input node and to the input node, respectively.
The next figure
depicts the data blocks’ distribution and collection showing the
internals of the simple topology depicted in Figure 2-1 in terms of
the nodes’ input and output memory buffers. The considered
situation is the following:
We will consider
that the data distribution transfers are started by the input node
and, symmetrically, the data collection transfers are started by the
output node.
In this case, for
the sake of simplicity, we will assume that the input, output and
intermediate matrices have the same dimensions, mxn; and that m/N = p
and n/N = q, being m and n the number of rows and columns of the
matrices, respectively; N the number of processing nodes, and p and q
integer numbers.
The input matrix
has to be distributed among the processing nodes. A data block
containing a pxn submatrix is transferred from the input node to each
one of them. Each processing node performs 1-D processing over the
rows of its submatrix (horizontal processing). At some point, a
transposition of the distributed matrix (Corner Turn) is performed.
As a result, a nxm matrix is distributed among the processing nodes,
each one of them is going to store a qxm submatrix. Then, the
processing nodes perform 1-D processing over the rows of its
submatrix (vertical processing), and after another Corner Turn, a
mxn matrix is distributed among the processing nodes. This four steps
sequence is repeated so many times as the algorithm needs. Finally, a
data block containing a pxn submatrix is transferred from each
processing node to the output node.
Referring
to the simple algorithm depicted in Figure 1-2 of the post "
Introduction" of this series, note that the matrix
transpositions that will convert into Corner Turns have to take place
outside the sections.
The following
figure illustrates the distributed matrix transposition operation
showing the contents of the processing nodes input and output
buffers. The matrix A, an mxn matrix splitted in N sub-matrices pxn is stored in the
output buffers and its transposed the matrix B, a nxm matrix split in N
sub-matrices qxm is stored in the input buffers.
Note that all the
processing nodes have to be interconnected and that the source node
set and destination node set are the same, but the source and
destination buffers are different.
 |
Figure 4-2 |
On the other
hand, the chosen data distribution and collection mechanism ensure
not only that the order of the input data blocks is maintained at the
output, but also the consistent construction of the distributed
matrix.
The situation
depicted in the previous figure can be described in algebraic terms as shown in the
following figure. As can be seen, the matrices A and B are distributed among
the output and input buffers -respectively- of the processing nodes. The ouput buffers hold the submatrices A1,A2,..,AN and the input buffers hold the submatrices B1,B2,..,BN. The
submatrix Ai consists of the submatrices Ai1,Ai2,..,AiN and the submatrix Bi of the submatrices ATi1,ATi2,...,ATiN. Note that
the subscripts of the submatrix Aij indicate not only its position in
the matrix A, but also the source and destination nodes.
 |
Figure 4-3 |
The distributed matrix transposition is performed by every source node
npi (i=1,..,N)
by transferring the submatrix
ATij to every destination node
npi
(i=1,..,N). Alternatively, it can be done by transferring the submatrix
Aij
and performing the transposition in the destination node. The hardware
characteristics of the nodes will determine in each case if one option
is better than the other. Both options are analogous, for the sake of
brevity we will choose the first one. Note that every source node
transfers only one submatrix to every destination node.
An arbitration mechanism among source nodes based on the rule “first come, first served” is used in first term. In second term, conflicts of coincidence in time – if they are– are solved using a sequential rotatory mechanism (np1,np2,...npN,np1).
Once it is selected, every source node
npi (i=1,..,N), performs data distribution transfers -started by the source node- to all the destination nodes
np1
,np2,...npN. The transferences begin as soon as an
output buffer is ready but the input buffers will become filled as
the transfers of the last source node are performed.
Note that when the workload in the processing nodes is balanced, the
output buffers will be ready for being transmitted in the order
np1,np2,..,npN
as this is the sequence used by the Distribution Channel to transfer
the data blocks from the input node. On the other hand, the case of
possible waste of time due unbalanced processing load among
processing nodes, is covered by the arbitration mechanism based on
the rule “first come, first served”. In this way, the
transference times and the output buffers readiness times are
overlapped.
In other words, the observed effect from an external point of view is that, the different processing nodes begin the Corner Turn in different moments of time (desynchronized) but finish it – more or less – at the same time (synchronized).
To release the
processor of communications-related tasks, the memory addressing
capabilities of the bridge device have to include the extraction of the submatrix from the appropriate place of the memory
buffer. Making use of this capability, it is also possible to release
to the processor of the local transposition of the submatrix Aij.
The following piece of pseudocode describes the
Aij
submatrix extraction from the memory buffer of the source node. The
nomenclature corresponds to the used in the two previous figures,
pb is a pointer to the memory buffer.
for (u=0; u< p; u++)
{
for (v=0; v< n; v++)
{
/* send data in location (pb + (j-1)*q + u*n + v) to the destination node */
}
}
So
far, we have considered irrelevant whether what is being transferred is
a matrix or a vector. Nevertheless, taking into account that what is
being transferred is a matrix, the bridge device can also avoid that the
processor has to perform the local transposition of the submatrix
Aij.
For that, in the source node, the bridge device has to read the
elements of the submatrix by columns from the output buffer before
sending them consecutively to the destination node. This corresponds to
exchanging the loops in the afore displayed piece of code.
In the writing of
this article, Mike Oldfield ("Tubular Bells" Part I, Live at the BBC 1973) has collaborated in an
involuntary but decisive way.