from the city of Vienna: Multiprocessor Architecture

Multiprocessor Architecture

Showing posts with label Multiprocessor Architecture. Show all posts

Notes on SPMD architecture A III

Leopoldo Gomez March 17, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

Nodes’ workload over time.

We may have all come in different ships, but we’re in the same boat now

Martin Luther King Jr

The object of this post is to define the different nodes’ working load conditions addressed in this series.

--------------------

Even though the code that runs in the processing nodes is the same, the Latency and Throughput of the processing nodes are not always constant over time, nor are they the same for all of them. This occurs, for instance, when the algorithm is implemented in such a way that the processing to be performed on the data depends on the value of the data itself. In cases like this, on one hand, the workload of the nodes (workload) varies over time, and on the other, an imbalance of the workload among the nodes (unbalanced workload) takes place.

By steady workload, we mean that all the processing nodes have the same workload. In addition, the input, processing, and output nodes can have different workloads but all of them are constant over the time.

In real life, the nodes' workload is not constant. Usually, the nodes have more things to do than the processing; for instance, to run the operating system. Anyway, once calculated the minimum number of buffers under the condition of steady workload, we will continue considering that the nodes work under steady workload while that number of buffers is enough for the machine to work properly.

A non-steady workload over time may affect the number of buffers per node required for ensuring no-data-loss. An unbalanced workload among nodes may affect Data redistribution transferences. Both cases are considered later.

In the writing of this article, Chico Buarque & Elis Regina (Noite dos mascarados) have collaborated in an involuntary but decisive way.

---------------------

1. Picture: http://remo.diariovasco.com/2009/_images/concurso2.jpg

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture A II

Leopoldo Gomez March 17, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

Local and Distributed memory processing.

I‘ll stick at building ships

George Steinbrenner

The object of this post is to describe the relationship between the resources of the nodes and the local and distributed memory processing.

--------------------

We will call algorithm input data unit (input data unit) to the data object on which the algorithm works. This data unit can not be split, it has to be processed as a whole. The algorithm’s result of processing an input data unit is an algorithm output data unit (output data unit). The data units characteristics include the shape (linear or rectangular), size (number of elements) and data type (integer, floating-point, etc.).

Latency and Throughput are parameters which characterize the performance of a processing machine or node and that were advanced in Introduction. In terms of data units, node Latency stands for the time delay between an input data unit and the correspondent output data unit -this time is the necessary time to process the input data unit and to produce the output data unit (processing time)- and node Throughput stands for the number of data units that the node is able to process per time unit (input and output Throughputs can be different).

Hereinafter, we will refer to the memory and Throughput of the node as the resources of the node.

If total resources of the set of processing nodes are not enough to perform the required processing in the required time, and the architecture is scalable, the number of processing nodes can be increased until these resources are enough. In real life, the increment of the number of processing nodes is limited by hardware constraints.

Regarding the resources of the node, we will consider two cases. In the first case, the node has enough memory but not enough Throughput to perform the required processing, so the processing is performed on the nodes’ local memory (local memory processing). In the second case, the node does not have enough memory or Throughput, so the processing is performed on the distributed memory in the processing nodes set (distributed memory processing).

In the writing of this article, Florence + The Machine - You've Got The Love (Live at the Rivolli Ballroom) have collaborated in an involuntary but decisive way.

---------------------

1. Picture: https://www.naiz.eus/media/asset_publics/resources/000/032/786/news_landscape/zumaia.jpg?1377808057

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture A I

Leopoldo Gomez March 17, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

1-D and 2-D processing implementation.

If the art of ship-building were in the wood, ships would exist by Nature

Aristotle

The object of this post is to summarize some background information in 1-D and 2-D processing implementation.

--------------------

Both signal and image processing involves 1-D and 2-D mathematical computing (1-D and 2-D processing). 1-D processing is performed over 1-D data objects (vectors), and 2-D processing over 2-D data objects (matrices).

As is well known, a vector A consisting of n elements is said to have a size (or length) of n, and its elements are denoted by Ai (i=1,.,n). A matrix A consisting of m rows and n columns is said to have a size of mxn (abbreviated: a mxn matrix) and its elements are denoted by Aij (i=1,.,m; j=1,.,n), being i the row number and j the column number. Moreover, the matrix transposition is an operation that consists in exchanging rows by columns. So, being AT the matrix resultant of transposing the matrix A, then Aij = ATji.

On the other hand, in the C programming language (among others), an array consists of a set of elements stored in consecutive memory locations. A vector is implemented as an array of elements, a matrix as an array of rows and a row as an array of elements. This is the vectors and matrices implementation model that will be used within this work.

According to the above, and being p a pointer to the first element of the array, the aforementioned elements Ai, Aij, and ATij can be addressed as p+(i-1), p+(j-1)+(i-1)*n and p+(i-1)+(j-1)*m, respectively.

2-D processing consists of two 1-D processings, one horizontal and the other vertical (over the rows and the columns of the matrix, respectively). Due to the fact that consecutive accesses to consecutive memory locations are faster than consecutive accesses to non-consecutive memory locations, processing rows are faster than processing columns. For that reason, 2-D processing is implemented as a sequence consisting of horizontal processing + a matrix transposition + horizontal processing.

In the writing of this article, Roxette (I remember you) have collaborated in an involuntary but decisive way.

---------------------

1. Picture: http://www.efdeportes.com/efd196/el-remo-en-educacion-fisica-en-cantabria-trainera-02.jpg

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture VIII

Leopoldo Gomez February 24, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

Conclusion.

Crew is life, everything else is just details

Anonymous

This post reviews and rewrites the initial requirements for the Architecture model stated in the post "Requirements" in the light of the description and characterization of the model carried out throughout the series.

--------------------

As was stated as an initial requirement in the aforementioned post, three different types of transfers among nodes have to be supported, these are the Data distribution, Data collection and Data redistribution transferences. The first two are necessary in the case of working on Local memory and the three of them, where performing a 2-D processing on Distributed memory.

Additionally, the initial no-data-loss requirement stated in the aforementioned post, has lead to a lower lever requirement concerning every Data transference that consists on supporting flow control at data block level.

On the other hand, as has been shown throughout this series, a minimum number of memory buffers for the data transferences is necessary in order to allow the nodes to work at the maximum Throughput as well as in order to avoid data-loss.

With respect to achieving maximum Throughput, this number may vary depending on the processing (1-D or 2-D) and the working load of the nodes over time (steady or non-steady).

Certainly, as was shown in the post "Local and Distributed memory processing in steady workload" of this series, a different number of buffers is necessary for the nodes to achieve maximum Throughput when performing 1-D or 2-D processing.

Otherwise, in what concerns to no-data-loss, as was shown in the previous post "Operation in non-steady workload", memory buffers allow the node to temporarily not to lose input data when the processor can not process input data units at the rate required by the input device; similarly, they allow the node to temporarily not lose output data when the output device can not send output data units at the rate required by the processor.

In order to adequate the number of processing nodes to the necessities –in terms of memory and/or Throughput– of the processing, the Architecture model has to be scalable.

According to all the above, the requirements for the Architecture model can, finally, be stated as follows:

For the Architecture model, it must be possible to define the number of processing nodes.

For each node, it must be possible to define the necessary transfers. The different types of transfers are the Data distribution transfer, the Data collection and the Data redistribution transfer. All of them have to support flow control at data block level.

For each transfer, it must be possible to define the necessary number of buffers for both input and output.

In the writing of this article, Heroes del Silencio (Mar adentro, April 1992 Munich (Germany)) have collaborated in an involuntary but decisive way.

---------------------

1. Picture: http://www.eldiario.es/norte/cantabria/cultura/Traineras-liturgia-deporte_12_507569240.html

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture VII

Leopoldo Gomez February 24, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

Operation in non-steady workload.

In good days I row, in bad days I row harder

--

This post describes the operation of the Architecture model when the nodes are working in a non-steady workload condition.

The situation where the nodes’ workload can be considered non-steady is addressed in Appendix “Nodes’ workload over time”.

Figure 3-1 of the previous post "Requirements" is included here for convenience (see Figure 7-1).

--------------------

In Figure 7-1, note that the input node, the processing nodes set and the output nodes are stages of a pipeline. The Throughput of each stage has to be at least equal to the Throughput of the previous one. The Throughput of the input node, processing node set and output node has to be at least equal to the Throughput of the input data, input node and processing node set, respectively.

Figure 7-1

When a node works under non-steady workload condition, it experiences temporal work overloads (work peaks). When a node experiences a work peak, its Latency grows. From an external point of view, that node seems to become slower. If the number of buffers for the transferences has not been calculated for the worst case, when a destination node can’t absorb the output Throughput of its source node, the faster will run the slower over with the result of data loss. In cases like this, the data loss can be avoided by incrementing the number of memory buffers.

Indeed, note that if the work peaks are strong enough, the effect is similar to the one caused by a traffic jam. The jam advances in the opposite direction to the direction in which the cars advance. If for some reason, the output node gets so overloaded that it can’t collect the data blocks from the processing nodes, the lack of empty output buffers forces the processing nodes to stop processing. If the situation persists, the input node will stop distributing data blocks due to the lack of empty input buffers in the processing nodes. Once it has exhausted its own memory buffers, the input node will not be able to service the input link and data loss becomes inevitable.

In the writing of this article, The Who (Love Reign O'er Me, Quadrophenia) have collaborated in an involuntary but decisive way.

--------------------

1. Picture:www.eitb.eus/es/.../video-un-dia-inolvidable-espectacular-bandera-la-concha-2017/

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture VI

Leopoldo Gomez February 17, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

Local and Distributed memory processing in steady workload.

Notes on SPMD architecture IV. Local memory processing on steady workload

Row, row, row your boat

---

This post describes the operation of the Architecture model in the cases of Local and Distributed memory processing with the nodes working in a steady workload condition.

Cases where Local memory processing is possible and where Distributed memory processing is necessary are addressed in Appendix “Local and distributed memory processing”. The situation where the nodes’ workload can be considered steady is addressed in Appendix “Nodes’ workload overtime”.

--------------------

According to what is said in the Appendix “Local and distributed memory processing” of this series, Local memory processing is possible when the node has enough memory to perform the required processing. If the node hasn’t enough memory or Throughput, Distributed memory processing can be performed considering the memory of the processing nodes set as a whole.

In the case of Local memory processing, the fact that the input and output data units are vectors or matrices is irrelevant, only its size is important. The same is not true for the case of 2-D Distributed memory processing, as was shown in the previous post “Data transferences”.

Local memory processing

The next figure shows the processing of the nodes over time for a four processing nodes’ MP machine. Eight input data units are processed and eight output data units are produced. Each data unit consists of one data block. The data blocks are identified by numbers. The data block 1 was the first in arriving. For the input and output nodes has been taken a Latency far inferior compared to the Latency of a single processing node.


Figure 6-1

In the figure above, note that each processing node works independently of the other processing nodes from the beginning until the end of the algorithm execution. Moreover, the processing nodes take the maximum Latency that fulfills the requirement of no-data-loss. This means that, when the input node finishes the processing of the data blocks 5-8, the processing nodes must have finished the processing of the data blocks 1-4. And when the processing nodes finishes the process of the data blocks 5-8, the output node must have finished the processing of the data blocks 1-4.

Note that if the time necessary to send a data unit from one node to another is negligible compared to the time necessary for processing that data unit, just one output buffer in the input and processing nodes, and just one input buffer in the output and processing nodes are necessary to no losing data.

When the Latency of the input node is similar to the Latency of the processing nodes, the processing nodes do not work in parallel but in a serial basis. This is shown in the next figure. The same is applicable to the Latency of the output node.

Figure 6-2

For the MP machine, maximum Throughput and minimum Latency occur when the input and output nodes do not perform any processing. So, in order to take advantage of the processing power of the MP machine, the Latency of the input and output nodes has to be kept so small as possible compared to the Latency of the processing nodes.

Distributed memory processing

The next figure shows the processing of the nodes along the time for a four processing nodes’ MP machine. Twelve input data units are processed and twelve output data units are produced. Each data unit consists of four data blocks. The data blocks are identified by numbers, being the data block 1 the first in arriving. For the input and output nodes has been taken a Latency comparable to the Latency of a single processing node. The 2-D processing consists of Rows processing (horizontal processing) + Corner Turn + Rows processing (vertical processing) + Corner Turn + Rows processing (horizontal processing). Processing is depicted in light gray color and Corner Turns in dark gray color.

Notes on SPMD architecture V. Figure 5-3

Figure 6-3

The processing nodes process an input data unit while the following one is arriving from the input node and the previous output data unit is being sent to the output node. For instance and referring to the previous figure, while the processing nodes process the data blocks 5-8 using the input and output buffers no 2, the data blocks 9-12 are being transferred from the input node to the input buffers no 1 and the data blocks 1-4 are being transferred to the output node from the output buffers no 1.

Note that, differently to the Local memory processing case, when the time necessary to send a data block from one node to another is negligible compared to the time necessary for processing that data block, just one output buffer in the input node, and one input buffer in the output node are necessary to no losing data.

If just one input buffer and one output buffer are used in the processing nodes, the processing nodes only would have for processing the time corresponding to an input data block -instead of the time corresponding to an input data unit- to no losing data.

In the writing of this article, Camel (Music inspired by the Snow Goose) have collaborated in an involuntary but decisive way.

---------------------

1. Picture: Based on DSC_6343.JPG http://remopedregalejo.blogspot.com.es/2010/09/regata-de-traineras-rias-baixas-galicia_3737.html

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture V

Leopoldo Gomez February 11, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

Nodes' operation.

Seulement celui qui ne rame pas a le temps de faire vagues¹

Jean Paul Sartre

This post describes the internal operation of the nodes in the Architecture model.

--------------------

According to the composition of the nodes described in the post "Requirements" of this series, in addition to processor and memory devices, all the nodes have a communications' device that support the transfers among them (the bridge). Moreover, the input and output nodes have communications devices that support the transfers from the input link and to the output link, respectively (the input link communications device and the output link communications device).

In other words, for the input node, the input device is the input link communications device and the output device is the bridge device; for the processing nodes, input and output devices are the bridge; and for the output node, the input device is the bridge device and the output device is the output link communications device.

On the other hand, as it was previously advanced in the post "Data transferences" of this series, to allow the input and output of the node to work independently, both the input and the output must have their own memory buffers.

Taking also into account that the input device, the processor and the output device of the node work in pipeline, the functionality of a generic node can be depicted as shown in the figure below.

Figure 5-1

Additionally, input and output buffers support the fulfilling of the Architecture requirements of getting the maximum possible throughput from the MP machine and no-data-loss stated in "Requirements".

The requirement of getting the maximum possible throughput from the MP machine is fulfilled as the buffers allow the input device, processor, and output device to work independently.

The requirement of no-data-loss is fulfilled as the buffers allow that the node do not loss data during work peaks. Input buffers avoid the data-loss when the processor can’t process input data units at the rhythm required by the input device. Output buffers avoid the data-loss when the output device can’t send output data units at the rhythm required by the processor.

Moreover, input and output buffers also allows performing out-of-place processing into the nodes. In an in-place processing, the results obtained are stored in the data buffer. In an out-of-place processing, at least two buffers are necessary, one for the data to be processed and another one for the results. Following a zero-copy strategy, the processor uses input buffers for the data to be processed and output buffers for the processing results.

Let’s consider that the nodes operate in steady workload condition. In order to achieve maximum Throughput, as soon as the processor finishes the processing of a data unit, the next input data unit has to be ready in an input buffer and an empty output buffer has to be available. Than means that while the processor is processing, the input device has to receive data and to fill an input buffer and the output device has to send data and to empty an output buffer. If the nodes perform the processing in steady workload conditions over time, for overlapping the data reception, the data sending and the processing, two input buffers and two output buffers are necessary. Note that this number of buffers also fulfills the no-data-loss requirement.

If the nodes operate in non-steady workload condition, the number of input and output buffers required depends on the work peaks that the nodes have to deal with. This topic will be addressed later in this series. For the moment, we will just consider that the machine have enough buffers to work properly.

According to the above, the operation of the nodes is as follows:

The input node stores continuously the data coming from the input link in the empty input buffer. The output node sends continuously the results of the processing from the full output buffer to the output link.
All the nodes start processing as soon as they have an input buffer filled with a new data block and a empty output buffer. They perform the processing using both buffers. At the end of the processing, the results will be in the output buffer. When a node finishes the processing, the input buffer is freed and the output buffer is transferred to an input buffer of the next corresponding node (input and processing nodes), or to the output link (output node). When the node finishes the transference, it frees the output buffer and goes for another input data block.

So, in order to support the nodes’ operation described above, the nodes’ devices work as follows:

The input device accepts a new data block as long as there are input buffers available. When the transference of a data block is completed, the input device releases the control of the buffer to the Processor.
The Processor starts processing as soon as it has an input buffer filled with a new data block and a empty output buffer. If there is no data block available, it waits for one. When the Processor completes the processing, frees the input buffer, releases the control of the output buffer to the output device and goes for a new data block to process.
The output device tries to transfer output buffers as long there are output buffers pending. When the output device finishes a transference, releases the control of the output buffer to the Processor and the Processor frees it.

Note that both, the operation of the nodes and the operation of the devices of the nodes have been on purpose described without any reference to the type of transference. It is considered a matter of the software to offer a “neutral” interface that supports an homogeneous operation for all the types of transference.

In the writing of this article, the Escola de Gaitas de Ortigueira (A sanssonete & busildre reels. Festival Internacional do Mundo Celta, Ortigueira 2014) has collaborated in an involuntary but decisive way.

---------------------

1. Only the one who doesn't row has the time to rock the boat.
The original phrase of Mr. Sartre refers to the guy that doesn’t rows. For obvious reasons, I have chosen to adapt the quote.

2. Picture: riasbaixas.jpg, https://ribadeomuller.wordpress.com/2012/08/31/as-remeiras-galegas-estan-batallando-por-unha-nova-vitoria-na-v-bandeira-feminina-da-concha-2012/

3. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture IV

Leopoldo Gomez February 11, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

Data transferences.

Gentlemen, let’s go row!

Robert A. Heinlein

This post describes the different types of Data transferences required for the Architecture model.

--------------------

As stated in the previous post "Requirements", the different types of transferences required for the Architecture model are the Data distribution, Data redistribution and Data collection transferences.

Irrespective of the type of transference, no-data-loss has to be guaranteed. Whatever the input and output data units are, they will be transferred in blocks. Data blocks are stored in memory buffers. As soon as a source node has a block ready to send and the correspondent destination node has an input buffer ready to receive it, the transference can be performed. Both end-points must be ready before the transference starts. This establishes a flow control mechanism at data block level.

To allow the input and output of the node to work independently, both the input and the output must have their own memory buffers. In the following figures output buffers are identified in blue color and input buffers in green color. Besides, the buffers are identified by numbers. The numbers identify the order of arrival to the input node of the data block stored in the buffer. Block 1 was the first and block N the last. Numberless buffers are empty buffers.

In order to respect at the output the order of the input data, consistency between data distribution and data collection has to be maintained. To get this, the input node distributes the blocks in a rotatory basis to the processing nodes and the output node collects the blocks in the same way. This means that, for the data distribution, the destination nodes sequence will be [n_p1, n_p2,.,n_pN, n_p1,..]. Analogously, for the data collection, the source nodes sequence will be the same. So, the identification of processing nodes indicates its order in the processing nodes’ set in receiving and transmitting data blocks from the input node and to the input node, respectively.

The next figure depicts the data blocks’ distribution and collection showing the internals of the simple topology depicted in Figure 2-1 in terms of the nodes’ input and output memory buffers. The considered situation is the following:

The input node, in orderly way, has distributed the data blocks 1 to N to the processing nodes 1 to N. Block 1 was the first in arriving to the input node and block N the last one.
Processing node 1 has just finished the transference of the results of processing data block 1 to the output node. Processing node 1 has already got free the input buffer but still hasn’t got free the output buffer.
The transfer of the N+1 data block is ready to start as the block is ready in the input node and an input memory buffer is available in the processing node 1.

We will consider that the data distribution transfers are started by the input node and, symmetrically, the data collection transfers are started by the output node.

Figure 4-1

As it was previously stated in the post "Requirements" of this series, the Data redistribution capability is necessary at least to the extent of covering the distributed matrix transposition in the case of 2-D processing on distributed memory.

In this case, for the sake of simplicity, we will assume that the input, output and intermediate matrices have the same dimensions, mxn; and that m/N = p and n/N = q, being m and n the number of rows and columns of the matrices, respectively; N the number of processing nodes, and p and q integer numbers.

The input matrix has to be distributed among the processing nodes. A data block containing a pxn submatrix is transferred from the input node to each one of them. Each processing node performs 1-D processing over the rows of its submatrix (horizontal processing). At some point, a transposition of the distributed matrix (Corner Turn) is performed. As a result, a nxm matrix is distributed among the processing nodes, each one of them is going to store a qxm submatrix. Then, the processing nodes perform 1-D processing over the rows of its submatrix (vertical processing), and after another Corner Turn, a mxn matrix is distributed among the processing nodes. This four steps sequence is repeated so many times as the algorithm needs. Finally, a data block containing a pxn submatrix is transferred from each processing node to the output node.

Referring to the simple algorithm depicted in Figure 1-2 of the post "Introduction" of this series, note that the matrix transpositions that will convert into Corner Turns have to take place outside the sections.

The following figure illustrates the distributed matrix transposition operation showing the contents of the processing nodes input and output buffers. The matrix A, an mxn matrix splitted in N sub-matrices pxn is stored in the output buffers and its transposed the matrix B, a nxm matrix split in N sub-matrices qxm is stored in the input buffers.

Note that all the processing nodes have to be interconnected and that the source node set and destination node set are the same, but the source and destination buffers are different.

Figure 4-2

On the other hand, the chosen data distribution and collection mechanism ensure not only that the order of the input data blocks is maintained at the output, but also the consistent construction of the distributed matrix.

The situation depicted in the previous figure can be described in algebraic terms as shown in the following figure. As can be seen, the matrices A and B are distributed among the output and input buffers -respectively- of the processing nodes. The ouput buffers hold the submatrices A₁,A₂,..,A_N and the input buffers hold the submatrices B₁,B₂,..,B_N. The submatrix Ai consists of the submatrices A_i1,A_i2,..,A_iN and the submatrix B_i of the submatrices A^T_i1,A^T_i2,...,A^T_iN. Note that the subscripts of the submatrix A_ij indicate not only its position in the matrix A, but also the source and destination nodes.

Figure 4-3

The distributed matrix transposition is performed by every source node n_pi (i=1,..,N) by transferring the submatrix A^T_ij to every destination node n_pi (i=1,..,N). Alternatively, it can be done by transferring the submatrix A_ij and performing the transposition in the destination node. The hardware characteristics of the nodes will determine in each case if one option is better than the other. Both options are analogous, for the sake of brevity we will choose the first one. Note that every source node transfers only one submatrix to every destination node.

An arbitration mechanism among source nodes based on the rule “first come, first served” is used in first term. In second term, conflicts of coincidence in time – if they are– are solved using a sequential rotatory mechanism (n_p1,n_p2,...n_pN,n_p1).

Once it is selected, every source node n_pi (i=1,..,N), performs data distribution transfers -started by the source node- to all the destination nodes n_p1 ,n_p2,...n_pN. The transferences begin as soon as an output buffer is ready but the input buffers will become filled as the transfers of the last source node are performed.

Note that when the workload in the processing nodes is balanced, the output buffers will be ready for being transmitted in the order n_p1,n_p2,..,n_pN as this is the sequence used by the Distribution Channel to transfer the data blocks from the input node. On the other hand, the case of possible waste of time due unbalanced processing load among processing nodes, is covered by the arbitration mechanism based on the rule “first come, first served”. In this way, the transference times and the output buffers readiness times are overlapped.

In other words, the observed effect from an external point of view is that, the different processing nodes begin the Corner Turn in different moments of time (desynchronized) but finish it – more or less – at the same time (synchronized).

To release the processor of communications-related tasks, the memory addressing capabilities of the bridge device have to include the extraction of the submatrix from the appropriate place of the memory buffer. Making use of this capability, it is also possible to release to the processor of the local transposition of the submatrix A_ij.

The following piece of pseudocode describes the A_ij submatrix extraction from the memory buffer of the source node. The nomenclature corresponds to the used in the two previous figures, pb is a pointer to the memory buffer.

for (u=0; u< p; u++)

{

for (v=0; v< n; v++)

{

/* send data in location (pb + (j-1)*q + u*n + v) to the destination node */

}

So far, we have considered irrelevant whether what is being transferred is a matrix or a vector. Nevertheless, taking into account that what is being transferred is a matrix, the bridge device can also avoid that the processor has to perform the local transposition of the submatrix A_ij. For that, in the source node, the bridge device has to read the elements of the submatrix by columns from the output buffer before sending them consecutively to the destination node. This corresponds to exchanging the loops in the afore displayed piece of code.

In the writing of this article, Mike Oldfield ("Tubular Bells" Part I, Live at the BBC 1973) has collaborated in an involuntary but decisive way.

---------------------

1. Picture: DSC_5832.JPG

http://remopedregalejo.blogspot.com.es/2010/09/regata-de-traineras-rias-baixas-galicia_9112.htm

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture III

Leopoldo Gomez February 04, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

Requirements.

Notes on SPMD architecture II. General considerations

Wenn man rudert, ist es nicht das Rudern, was das Schiff bewegt, sondern Rudern ist nur eine magische Zeremonie, durch welche man einen Dämon zwingt, das Schiff zu bewegen^[1]

Friedrich Nietzsche

This post states the high level (system level) requirements for the SPMD architecture model (Architecture model) to support 1-D and 2-D processing, both in the cases of using Local memory and Distributed memory as well as in conditions of steady and non-steady nodes’ workload over time.

This post is based on the 1-D and 2-D processing implementation described in Appendix "1-D and 2-D processing implementation". The situations where Local memory processing is possible and where Distributed memory processing is necessary are addressed in Appendix "Local and distributed memory processing". The working conditions of the nodes of steady and non-steady workload are defined in Appendix "Nodes’ workload over time".

--------------------

The simple SPMD topology presented in Figure 1-1 of the previous article “Introduction” (Introduction), is included here for convenience (see Figure 3-1). What the figure depicts is:

A set of N processing nodes (n_pi, i=1,..,N). Each node runs a “copy” of the algorithm over different blocks of input data.
One input node (n_i), which manages the input link and distributes the input data to the processing nodes.
One output node (n_o), which collects the processing results and manages the output link.

In Figure 3-1, the following types of data transfers among nodes are considered:

Data distribution, which are the transfers from the input node to the processing nodes.
Data collection, which are the transfers from the processing nodes to the output node.
Data redistribution, which are the transfers among the processing nodes.

Figure 3-1

Data distribution and collection capabilities are required by the topology itself. Data redistribution functionality is necessary at least to the extent of covering the distributed matrix transposition in the case of 2-D processing on distributed memory. In the previous figure the Data redistribution capability is represented by the curved arrow that connects the processing nodes set outputs to the inputs.

In order to get the maximum possible Throughput from the MP machine, internal asynchrony in the nodes (internal asynchrony) will be addressed. That is, the devices of the node have to work asynchronously, the ones with respect to the others. The processing and the communications are performed asynchronously the one from the others, as well as the input and output communications. Note that, as a consequence, the internal asynchrony gives rise to a second level of asynchrony, what is that the nodes work asynchronously, the ones with respect to the others (external asynchrony).

Certainly, in addition to processor and memory devices, all the nodes have a communications' device that support the transfers among them, this device will be referred as the bridge device. Moreover, the input and output nodes have communications devices that support the transfers from the input link and to the output link, respectively. These devices will be referred as the input link communications device and the output link communications device.

The communications devices allows the processor to perform the transferences asynchronously to the rest of the processing. This means that, while they perform transfers the processor can perform signal or image processing or any other task.

If, additionally, the input and output communications of the nodes are asynchronous, finally the nodes will work asynchronously.

As denoted in the Appendix "Local and Distributed memory processing", the Architecture model have to be scalable in order to adequate the number of processing nodes to the necessities –in terms of memory and/or Throughput– of the processing.

Hereinafter, we will consider that the architecture meets the following requirements:

It supports the Data distribution, redistribution and collection capabilities
It is scalable
It gets the maximum possible Throughput from the MP Machine
It supports continuous input and output data flows without losing data (no-data-loss)

In the writing of this article, Saodaj' (Pokor Lèr) have collaborated in an involuntary but decisive way.

---------------------

1. When one rows it is not the rowing which moves the ship: rowing is only a magical ceremony by means of which one compels a demon to move the ship.

2. Picture: Embarcadero Center, San Francisco | Pixabay https://pixabay.com/en/spiral-staircase-architecture-1149509/

3. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture II

Leopoldo Gomez February 04, 2018

Leopoldo Gomez

Multiprocessor Architecture 0

Overview.

A ship in port is safe, but that’s not what ships are built for

Grace Hopper

The subject of this post is to describe the structure and provide an overview of the Notes on SPMD architecture series.

--------------------

This series consists of two sections, the “Main body” and the Appendices, and each of them of the following posts:

Main body

1. Notes on SPMD architecture I. Introduction

This post presents the object of the series, provides a description of the SPMD architecture and also some considerations on Multiprocessor (MP) machines.

2. Notes on SPMD architecture II. Overview (this article)

This post describes the structure and provides an overview of the series.

3. Notes on SPMD architecture III. Requirements

This post states the high level (system Level) requirements for the SPMD architecture model (Architecture model) to support 1-D and 2-D processing, both in the cases of using Local memory and Distributed memory as well as in conditions of steady and non-steady nodes' workload over time.

This post is based on the 1-D and 2-D processing implementation described in Appendix “1-D and 2-D processing implementation”. The situations where Local memory processing is possible and where Distributed memory processing is necessary are addressed in Appendix “Local and distributed memory processing”. The working conditions of the nodes of steady and non-steady workload are defined in Appendix "Nodes' workload over time"

4. Notes on SPMD architecture IV. Data transferences

This post describes the different types of Data transferences required for the Architecture model.

5. Notes on SPMD architecture V. Nodes operation

This post describes the internal operation of the nodes in the Architecture model.

6. Notes on SPMD architecture VI. Local and Distributed memory processing in steady workload.

This post describes the operation of the Architecture model in the cases of Local and Distributed memory processing with the nodes working in a steady workload condition.

7. Notes on SPMD architecture VII. Operation in non-steady workload

This post describes the operation of the Architecture model when the nodes are working in a non-steady workload condition.

The situation where the nodes’ workload can be considered non-steady is addressed in Appendix “Nodes’ workload over time”.

8. Notes on SPMD architecture VIII. Conclusion

This post reviews and rewrites the initial requirements for the Architecture model stated in the post Requirements in the light of the description and characterization of the model carried out throughout the series.

Appendices

9. Notes on SPMD architecture A I. 1-D and 2-D processing implementation

This post provides background information in software implementation of 1-D and 2-D processing.

10. Notes on SPMD architecture A II. Local and distributed memory processing

This post introduces the necessity of distributed memory processing depending on the resources of the nodes.

11. Notes on SPMD architecture A III. Nodes’ workload over time

This post defines the situations of workload of the nodes, steady and non-steady over time.

In the writing of this article, Rory Gallagher (Tattoo'd Lady, Irish Tour '74) has collaborated in an involuntary but decisive way.

--------------------

1. Picture: http://modelismodezapalobaco.blogspot.com.es/2009/04/planos-de-una-treineradel-siglo-xix.html

2. I want to thank Jeffrey Baldwin for his revision of this text.