Local and Distributed memory processing in steady workload.
Row, row, row your boat---
This post describes the operation of the Architecture model in the cases of Local and Distributed memory processing with the nodes working in a steady workload condition.
Cases where Local memory
processing is possible and where Distributed memory processing is
necessary are addressed in Appendix “Local and distributed memory processing”. The situation where the nodes’ workload can be
considered steady is addressed in Appendix “Nodes’ workload overtime”.
--------------------
According to what is said in the Appendix “Local and distributed memory processing” of this series, Local memory
processing is possible when the node has enough memory to perform the
required processing. If the node hasn’t enough memory or
Throughput, Distributed memory processing can be performed
considering the memory of the processing nodes set as a whole.
In the case of Local
memory processing, the fact that the input and output data units are
vectors or matrices is irrelevant, only its size is important. The
same is not true for the case of 2-D Distributed memory processing,
as was shown in the previous post “Data transferences”.
Local memory processing
The next figure shows the processing of
the nodes over time for a four processing nodes’ MP machine. Eight
input data units are processed and eight output data units are
produced. Each data unit consists of one data block. The data blocks
are identified by numbers. The data block 1 was the first in
arriving. For the input and output nodes has been taken a Latency far
inferior compared to the Latency of a single processing node.
Figure 6-1 |
In the figure above, note that each
processing node works independently of the other processing nodes
from the beginning until the end of the algorithm execution.
Moreover, the processing nodes take the maximum Latency that fulfills
the requirement of no-data-loss. This means that, when the input node
finishes the processing of the data blocks 5-8, the processing nodes
must have finished the processing of the data blocks 1-4. And when
the processing nodes finishes the process of the data blocks 5-8, the
output node must have finished the processing of the data blocks 1-4.
Note that if the time necessary to send
a data unit from one node to another is negligible compared to the
time necessary for processing that data unit, just one output buffer
in the input and processing nodes, and just one input buffer in the
output and processing nodes are necessary to no losing data.
When the Latency of the input node is
similar to the Latency of the processing nodes, the processing nodes
do not work in parallel but in a serial basis. This is shown in the
next figure. The same is applicable to the Latency of the output
node.
Figure 6-2 |
For the MP machine, maximum Throughput and minimum Latency
occur when the input and output nodes do not perform any processing.
So, in order to take advantage of the processing power of the MP
machine, the Latency of the input and output nodes has to be kept so
small as possible compared to the Latency of the processing nodes.
Distributed memory processing
The next figure shows the processing of
the nodes along the time for a four processing nodes’ MP machine.
Twelve input data units are processed and twelve output data units
are produced. Each data unit consists of four data blocks. The data
blocks are identified by numbers, being the data block 1 the first in
arriving. For the input and output nodes has been taken a Latency
comparable to the Latency of a single processing node. The 2-D
processing consists of Rows processing (horizontal processing) +
Corner Turn + Rows processing (vertical processing) + Corner Turn +
Rows processing (horizontal processing). Processing is depicted in
light gray color and Corner Turns in dark gray color.
Figure 6-3 |
The processing nodes process an input
data unit while the following one is arriving from the input node and
the previous output data unit is being sent to the output node. For
instance and referring to the previous figure, while the processing
nodes process the data blocks 5-8 using the input and output buffers
no 2, the data blocks 9-12 are being transferred from the input node
to the input buffers no 1 and the data blocks 1-4 are being
transferred to the output node from the output buffers no 1.
Note that, differently to the Local
memory processing case, when the time necessary to send a data block
from one node to another is negligible compared to the time necessary
for processing that data block, just one output buffer in the input
node, and one input buffer in the output node are necessary to no
losing data.
If just one input buffer and one output
buffer are used in the processing nodes, the processing nodes only
would have for processing the time corresponding to an input data
block -instead of the time corresponding to an input data unit- to no
losing data.
In the writing of this article, Camel (Music inspired by the Snow Goose) have collaborated
in an involuntary but decisive way.
1. Picture: Based on DSC_6343.JPG
http://remopedregalejo.blogspot.com.es/2010/09/regata-de-traineras-rias-baixas-galicia_3737.html
2. I want to thank Carol G. for her revision of this text.
0 Comments