from the city of Vienna: Notes on SPMD architecture VI

Local and Distributed memory processing in steady workload.

Notes on SPMD architecture IV. Local memory processing on steady workload

Row, row, row your boat

---

This post describes the operation of the Architecture model in the cases of Local and Distributed memory processing with the nodes working in a steady workload condition.

Cases where Local memory processing is possible and where Distributed memory processing is necessary are addressed in Appendix “Local and distributed memory processing”. The situation where the nodes’ workload can be considered steady is addressed in Appendix “Nodes’ workload overtime”.

--------------------

According to what is said in the Appendix “Local and distributed memory processing” of this series, Local memory processing is possible when the node has enough memory to perform the required processing. If the node hasn’t enough memory or Throughput, Distributed memory processing can be performed considering the memory of the processing nodes set as a whole.

In the case of Local memory processing, the fact that the input and output data units are vectors or matrices is irrelevant, only its size is important. The same is not true for the case of 2-D Distributed memory processing, as was shown in the previous post “Data transferences”.

Local memory processing

The next figure shows the processing of the nodes over time for a four processing nodes’ MP machine. Eight input data units are processed and eight output data units are produced. Each data unit consists of one data block. The data blocks are identified by numbers. The data block 1 was the first in arriving. For the input and output nodes has been taken a Latency far inferior compared to the Latency of a single processing node.


Figure 6-1

In the figure above, note that each processing node works independently of the other processing nodes from the beginning until the end of the algorithm execution. Moreover, the processing nodes take the maximum Latency that fulfills the requirement of no-data-loss. This means that, when the input node finishes the processing of the data blocks 5-8, the processing nodes must have finished the processing of the data blocks 1-4. And when the processing nodes finishes the process of the data blocks 5-8, the output node must have finished the processing of the data blocks 1-4.

Note that if the time necessary to send a data unit from one node to another is negligible compared to the time necessary for processing that data unit, just one output buffer in the input and processing nodes, and just one input buffer in the output and processing nodes are necessary to no losing data.

When the Latency of the input node is similar to the Latency of the processing nodes, the processing nodes do not work in parallel but in a serial basis. This is shown in the next figure. The same is applicable to the Latency of the output node.

Figure 6-2

For the MP machine, maximum Throughput and minimum Latency occur when the input and output nodes do not perform any processing. So, in order to take advantage of the processing power of the MP machine, the Latency of the input and output nodes has to be kept so small as possible compared to the Latency of the processing nodes.

Distributed memory processing

The next figure shows the processing of the nodes along the time for a four processing nodes’ MP machine. Twelve input data units are processed and twelve output data units are produced. Each data unit consists of four data blocks. The data blocks are identified by numbers, being the data block 1 the first in arriving. For the input and output nodes has been taken a Latency comparable to the Latency of a single processing node. The 2-D processing consists of Rows processing (horizontal processing) + Corner Turn + Rows processing (vertical processing) + Corner Turn + Rows processing (horizontal processing). Processing is depicted in light gray color and Corner Turns in dark gray color.

Notes on SPMD architecture V. Figure 5-3

Figure 6-3

The processing nodes process an input data unit while the following one is arriving from the input node and the previous output data unit is being sent to the output node. For instance and referring to the previous figure, while the processing nodes process the data blocks 5-8 using the input and output buffers no 2, the data blocks 9-12 are being transferred from the input node to the input buffers no 1 and the data blocks 1-4 are being transferred to the output node from the output buffers no 1.

Note that, differently to the Local memory processing case, when the time necessary to send a data block from one node to another is negligible compared to the time necessary for processing that data block, just one output buffer in the input node, and one input buffer in the output node are necessary to no losing data.

If just one input buffer and one output buffer are used in the processing nodes, the processing nodes only would have for processing the time corresponding to an input data block -instead of the time corresponding to an input data unit- to no losing data.