from the city of Vienna

Node programming model.

It's hardware that makes a machine fast,

it's software that makes a fast machine slow

Craig Bruce

This post describes the programming model for the input, output and processing nodes.

In terms of nodes’ functionality, as shown through the “Notes on SPMD architecture” series, the SPMD architecture consists of three different types of nodes. According to this, a maximum of three different software components is necessary: those corresponding to the input, the output, and the processing nodes.

As a general principle, we will consider that the nodes’ programming model must enable the implementations of the algorithm and the Skeleton to be independent.

In this post, as in others of this series, when for the sake of clarity, has been considered necessary to explain something related in some way to the programming language, it has been done using a C-like pseudocode.

--------------------

We will consider that the operation states for the input, output, and processing nodes are the same. Following state machine (Figure 5-1) displays these operation states and the transitions between them.

Figure 5-1

The Init, Run and Exit states correspond to the normal operation of the node. The Error state has been included to provide the machine with a controlled state when errors considered as non-recoverable by the error management occur.

We will assume that all the nodes implement this state machine, although the functionality of the states varies depending on the type of node.

Next, the states’ functionality is described for the processing node. The C-like functions used were described in the previous post “Channels”. Error management is not addressed.

Two different versions of the Run state are shown. They correspond to the cases of 1-D processing and 2-D processing on local memory and 2-D processing on distributed memory. In the first of the cases, a Redistribution channel is not necessary –since there are not Corner Turns– but in the second case, it is.

The pseudocode of the input and output nodes can be easily derived from the pseudocode presented for the processing node in the first case above mentioned.

State Init

/* Get resources; when finished go to State Run */

/* On error, go to Error State */

createChannel(inputDistributionChannel);

createChannel(outputCollectionChannel);

createChannel(inputRedistributionChannel);

createChannel(outputRedistributionChannel);

...

State Exit

/* Free resources; when finish exit the code */

/* On error, go to Error State */

destroyChannel(inputDistributionChannel);

destroyChannel(outputCollectionChannel);

destroyChannel(inputRedistributionChannel);

destroyChannel(outputRedistributionChannel);

…

State Error

/* Do nothing or almost nothing. Ask or wait for help */

…

Following, the state Run corresponding to the cases of 1-D processing and 2-D processing on local memory is described. In the presented pseudocode, note that for the input node, the inputChannel is the channel correspondent to the input link and for the output node, the outputChannel is the channel correspondent to the output link.

State Run

/* 1-D processing and 2-D processing on local memory*/

/* Execute the algorithm */

/* On signal of terminate, go to Exit State */

/* On error, go to Error State */

/* Get a data buffer received through the input distributionChannel */

get(inputBuffer, inputDistributionChannel);

/* Get an empty buffer from the output collectionChannel to store the results of the process */

get(outputBuffer, outputCollectionChannel);

/* Process the received buffer */

algorithm();

/* Return an empty buffer to the input distributionChannel */

put(inputBuffer, intputDistributionChannel);

/* Send a full buffer through the output collectionChannel */

put(outputBuffer, outputCollectionChannel);

Following, the state Run corresponding to the case of 2-D processing on distributed memory is described. The pseudocode shown corresponds to an n-stages’ algorithm with n-2 Corner Turns.

State Run.

/* 2-D processing on distributed memory */

/* Execute the algorithm */

/* On signal of terminate, go to Exit State */

/* On error, go to Error State */

switch (stage)
{

case 0:

/* Get a data buffer received through the input distributionChannel */

get(inputBuffer, inputDistributionChannel);

/* Get an empty buffer from the output redistributionChannel to store the results of the process */

get(outputBuffer, outputRedistributionChannel);

/* Process the received buffer applying the stage 1 of the algorithm */

algorithmStage(0);

/* Return an empty buffer to the input distributionChannel */

put(inputBuffer, intputDistributionChannel);

/* Send a full buffer through the output redistributionChannel */

put(outputBuffer, outputRedistributionChannel);

break;

case 1:

get(inputBuffer, inputRedistributionChannel);

get(outputBuffer, outputRedistributionChannel);

/* Process the received buffer applying the stage 2 of the algorithm */

algorithmStage(1);

put(inputBuffer, inputRedistributionChannel);

put(outputBuffer, outputRedistributionChannel);

break;

…

case n-1:

get(inputBuffer, inputRedistributionChannel);

get(outputBuffer, outputCollectionChannel);

/* Process the received buffer applying the stage N of the algorithm */

algorithmStage(n-1);

put(inputBuffer, inputRedistributionChannel);

put(outputBuffer, outputCollectionChannel);

break;

default:

break;

}

Take into account that, for the case of 2-D processing on distributed memory, Corner turns have to be performed between stages of the algorithm. Consequently, in the intermediate stages (stages 1 to n-2), the Redistribution channel is either the input as the output channel. However, in the first stage, the input and output channels are the Distribution and the Redistribution channels, respectively, and in the last stage, the input and output channels are the Redistribution and the Collection channels, respectively.

In the writing of this article, Franco Battiato (No Time No Space. Mondi lontanissimi, 1985) has collaborated in an involuntary but decisive way.

--------------------

1. Picture: white-background-cubes-architecture-2432313 by icame | pixabay. Link to the source.

Notes for a multiprocessor skeleton IV

Leopoldo Gomez April 30, 2018

Channels.

Il progresso... Sempre tardi arriva!¹

Alfredo (Cinema Paradiso)

This post provides a high-level description of the different types of Channels required for the Skeleton. These Channels support the Data transferences required for the Architecture model described in “Notes on SPMD architecture” series. The Data transferences required for the Architecture model are described in the post "Data transferences" of that series.

The Channels were previously introduced in the post “Introduction“ of this series. The simple SPMD topology with Data redistribution capability presented there in Figure 1-1, is included below for consistency (see Figure 4-1).

--------------------

The Skeleton has to support the transfers needed by the application. These transfers are performed through channels. The channels are software objects managed by the nodes’ software. They support unidirectional transferences of data blocks from the source node/s to the destination node/s. As mentioned in "Introduction", the channel’s library is either supplied, recommended or supported by the MP machine provider.

Processors, bridge devices (bridges) and memories of the source and destination nodes are involved in the transfer of data. For the transference to be performed in the more convenient way in the different cases addressed in “Data transferences”, the management of the transference may vary in different aspects. These aspects are implemented by means of parameters and are described below.

As a software object, the channel encapsulates the necessary code to perform the transference offering an interface similar to that of a First-In-First-Out (FIFO) queue of buffers. A FIFO is managed by means of two operations, one to enqueue (insert) a buffer (put(buffer)) and the other to dequeue (extract) a buffer (get(buffer)).

On the other hand, a channel is defined in the nodes it connects. In the source node/s is defined the source endpoint/s of the channel and in the destination node/s, the destination endpoint/s.

Let us consider that the channels internally manage the pools of empty and full buffers used for the data transfer. As stated throughout the “Notes on SPMD architecture” series, each pool must have a minimum number of buffers. Pools of buffers are required at both endpoints, source, and destination. These pools of buffers are organized as FIFOs, and hereinafter referred as the e-queue (empty buffers queue) and the f-queue (full buffers queue).

Note that both, the e-queue and the f-queue can be considered as parts of a single pipe. Initially, the pipe consists only of empty buffers. Over time, buffers are first getting filled and then emptied, but the total number of buffers remains constant. As stated in "Requirements", that number has to be calculated to get the maximum Throughput from the machine and guarantee no-data-loss. Anyway, for the sake of clarity, we will continue referring to the e-queue and the f-queue as single FIFOs.

The total number of buffers, as well as its size, are parameters that characterize a channel.

At the source endpoint, the processor uses the e-queue to get buffers to be filled with a data block, and the f-queue to pass the data blocks to be transmitted to the bridge. At the destination endpoint, the processor uses the e-queue to provide the bridge with memory buffers that will be used in the reception of data blocks, and the f-queue to get the received data blocks from the bridge.

As stated in “Requirements”, the channels support a flow control mechanism at data block level in order to guarantee no-data-loss. This means that the transference starts when the f-queue in the source node and the e-queue in the destination nodes are not empty.

Depending on the endpoint that starts the transference we will consider two types of channels, push and pull. In the push channels, the data transfer is started by the source node, and in the pull channels by the destination node.

The operations put(buffer) and get(buffer) work in different way depending on whether the channel is push or pull and the endpoint where they are executed.

In a push channel,

In the source node, get(buffer) is used to dequeue a buffer from the e-queue and put(buffer) to queue a buffer in the f-queue.
In the destination node, get(buffer) is used to dequeue a buffer from the f-queue and put(buffer) to queue a buffer in the e-queue.

In a pull channel,

In the destination node, get(buffer) is used to dequeue a buffer from the f-queue and put(buffer) to queue a buffer in the e-queue.
In the source node, get(buffer) is used to dequeue a buffer from the e-queue and put(buffer) to queue a buffer in the f-queue.

In the source endpoint of the channel, a transfer is blocking when the software, after starting the transfer, waits for the data block transmission to be completed in order to continue. In the destination endpoint of the channel, a transfer is blocking when the software waits for the data block reception to be completed in order to continue.

The choice of the blocking or non-blocking transference option for the channels depends on the operation expected for the machine is in steady or non-steady workload (see Appendix “Nodes’ workload over time”, “Notes on SPMD architecture” series). Certainly, in steady workload non-blocking operation does not contribute with any advantage, and conversely, it can make the debugging more difficult. However, in non-steady workload, non-blocking operation can help to clear the f-queues to the minimum level as fast as possible.

In “Data transferences”, the following types of data transfers among nodes were identified

Data distribution, which are the transfers from the input node to the processing nodes.
Data collection, which are the transfers from the processing nodes to the output node.
Data redistribution, which are the transfers among the processing nodes.

Following, different types of channels are defined for each type of transference.

In Figure 4-1, the Data distribution channel is represented by the straight arrows that connect the input node and the processing nodes, the Data collection channel by the straight arrows that connect the processing nodes and the output node, and the Data redistribution channel by the gray curved arrow that connect the output and the input of the processing nodes (note that the source nodes set and the destination nodes set are the same).

Figure 4-1

So, the Data distribution channel consists of one source endpoint and many (N) destination endpoints, the Data collection channel of many (N) source endpoints and one destination endpoint and the Data redistribution channel of many (N) source endpoints and many (N) destination endpoints.

The definition of the source and destination endpoints includes the identification of the source and destination nodes, respectively. This means, that depending on the type of channel, the following information must be specified:

Data distribution channel, the source node and the destination nodes’ set
Data collection channel, the source nodes’ set and the destination node
Data redistribution channel, the source nodes’ set and the destination nodes’ set

A nodes’ set definition includes both the number of nodes that comprise it and the identification of each one of said nodes. The order in which the nodes are defined determines the sequence for the rotatory arbitration mechanisms.

In the Data distribution channel, a rotatory arbitration mechanism decides to which destination node the input node transfers the current data block. Similarly, in the Data collection channel, a rotatory arbitration mechanism decides which source node will transfer the current data block to the output node.

We will consider that the Data distribution channel is a channel type push and the Data collection channel is a channel type pull.

The Data redistribution channel supports the transposition of the distributed matrix taking into account its dimensions and the number of processing nodes. As it will be seen below, this channel is built using Data distribution channels.

The following figure illustrates the distributed matrix transposition as described in “Data transferences”. Processing node n_pi (i=1,…,N) holds –before the Corner Turn– the submatrices A_i1,A_i2,..,A_iN, and after the Corner Turn the submatrices A^T_i1,A^T_i2,...,A^T_iN.

Figure 4-2

The distributed matrix transposition consists on two operations. The submatrices transposition and the submatrices transferences.

As said in “Data transferences”, the submatrices transposition can be performed both in the source nodes or in the destination nodes. In addition, this transposition can be performed by the processor or by the bridge.

Regarding the submatrices transferences, every source node n_pi (i=1,…,N) transfers every submatrix to the corresponding destination node using a Data distribution channel type push. This Data distribution channel makes use of a two-steps arbitration mechanism. In the first step, the rule “first come, first served” is used. Potential conflicts are solved in the second step using the original sequential rotatory mechanism afore described for the input node.

According to this, note that the transferences will begin as soon as an output buffer is ready but the destination buffers will not be completed until the transfers from the last source node are performed.

So, if the buffer has to be managed as a vector or a matrix, and in this last case, the dimensions of the whole distributed matrix, are also parameters characteristics of a channel.

In the writing of this article, Stevie Nicks (Edge Of Seventeen, White Winged Dove Concert) has collaborated in an involuntary but decisive way.

--------------------

1. Progress, always late!

2. Picture: splashing-165192_1280.jpg. PublicDomainPictures | Pixabay. Link to the source.

Notes for a multiprocessor skeleton III

Leopoldo Gomez April 29, 2018

Description.

La semplicità è l'ultima forma della sofisticazione¹

Leonardo da Vinci

This post describes the Skeleton’s in terms of structure, processors’ network configuration and operation.

When, for the sake of clarity, has been considered necessary to explain something related in some way with the programming language, it has been done in terms of the C language.

--------------------

As was stated in post “Introduction” of this series, the Skeleton is the component of the software of Single Program Multiple Data (SPMD) machines –dedicated to signal and/or image processing applications– that constitutes the logical communication’s infrastructure that connects the processors. According to this, the main functions of the Skeleton are to configure and support the communications of the processors’ network.

The communications of the processors’ network consist on the transferences described in post “Data transferences“ of the series “Notes on SPMD architecture” series, these transferences are supported by means of channels.

The configuration of the communications of the processors’ network consists of the definition of the network topology and the configuration of the channels.

The software correspondent to channels’ creation, configuration, use, and destruction are part of the Skeleton. These operations are addressed in this series first, in post “Channels” and then, in post “Nodes’ programming model”. The configuration of the communications of the processors’ network is addressed below and the channels’ configuration parameters’ set in the mentioned post “Channels”.

As was stated in post “Requirements” of this series, the processor network is considered to be static. That means that the network topology is defined only after the Skeleton’s code starts to run. Therefore, once the Skeleton code is running, it has to be re-started in order to be re-configured.

Consequently, one of the firsts things that the Skeleton’s code when start up is to read the Skeleton’s configuration file (Configuration file). This file contains all the necessary parameters to configure the processors’ network. Regardless of that additional parameters might be necessary for a specific machine and/or a specific application, within this work we will only address those parameters corresponding to the channels.

We will consider that the Configuration file resides in some kind of non-volatile storage media (hard-disk or memory) accessible by all the nodes as mentioned in post “Requirements”.

On the other hand, in the same post was stated that the SPMD machine application software consists of three executable files, the correspondent to the input, output and processing nodes. Let’s name these three files as inputNode, outputNode, and processingNode. Let’s consider that those files are built from the object files inputNode.o, outputNode.o and processingNode.o. These objects files are linked with the Skeleton object files as well as the necessary libraries in order to generate the correspondent executable files. At the same time, let's say that these object files correspond to the source files inputNode.c, outputNode.c and processingNode.c.

In some moment in the later start-up of the machine, those executable files, together with the Configuration file are allocated to the nodes and made to run. It is considered that the allocation mechanism is provided by the platform, either by the operating system (OS) or by some library. After the Skeleton reads the Configuration file in each node, the nodes’ software becomes parametrized and ready to run.

Note that information about the nodes that constitute the network -at least number of nodes and function to be performed- has to be provided also to the allocation process.

Therefore, for supporting the configuration of the communications of the processors’ network, the Skeleton supports the definition of the variables that parametrize the channels used in the nodes’ code. When the Configuration file is read, values are assigned to those variables. The definition of such variables is carried out in header file/s included in the source files inputNode.c, outputNode.c and processingNode.c, and the reading of the Configuration file is performed by a Skeleton function called from the code of the nodes.

In the writing of this article, Stanley Clarke, Marcus Miller and Victor Wooten (SMV Concert 2009 - Jazz à Vienne, France) have collaborated in an involuntary but decisive way.

---------------------

1. Simplicity is the ultimate sophistication.

2. Picture: Based on “Magic cube – cube puzzle play” by Domenic Blair | Pixabay. Link to the source.

Notes for a multiprocessor skeleton II

Leopoldo Gomez April 29, 2018

Requirements.

You don't need a framework. You need a painting, not a frame.

Klaus Kinski

This post states the requirements for the Skeleton based on the requirements established for Architecture model in the post “Conclusion” of the series “Notes on SPMD architecture”.

--------------------

As the Skeleton has to support the Architecture model, like this one, it has also to support 1-D and 2-D processing, both in the cases of using Local memory and Distributed memory as well as in conditions of steady and non-steady nodes’ workload over time.

Therefore, according to the conclusions of the aforementioned post with respect to the Architecture model, the requirements for the Skeleton can be stated as follows:

For the Skeleton, it must be possible to define the number of processing nodes.
For each node, it must be possible to define the necessary transfers. The different types of transfers are the Data distribution, the Data collection, and the Data redistribution.
The Data distribution transfer is started by the source node and the Data collection transfer by the destination node. Both of them have to support flow control at data block level. Also, both of them performs the transfers in a rotatory basis.
The Data redistribution transfer is carried out by means of multiple Data distribution transfers. Each node of the source processing nodes set performs Data distribution transfers to every node of the destination processing nodes set. The node of the source processing nodes set to perform the transfer is selected following -in the first term- the rule “first come, first served”; possible conflicts of coincidence in time are solved -in the second term- in a sequential rotatory basis.
For each transfer, it must be possible to define the necessary number of buffers for both input and output. That number has to be calculated to get the maximum Throughput from the machine and guarantee no-data-loss (see the post “Conclusion” of the “Notes on SPMD architecture” series.

The processor’s network is static. That means that the architecture is defined only after the code starts to run. Therefore, once the Skeleton code is running, it has to be re-started in order to re-define its configuration.

Additionally, we will also consider that:

The SPMD machine application’s software consists of three executable files. Those files are the correspondent to the input, output and processing nodes.

The SPMD machine has some kind of non-volatile storage media (hard disk or memory) accessible by all the nodes. Its purpose is, among others, to hold the configuration of the machine.

In the writing of this article, King Crimson (Starless, Radical action to unseat the hold of monkey mind) has collaborated in an involuntary but decisive way.

--------------------

1. Picture: 2904641-digital-art-minimalism-simple-cube-red-white-3d___mixed-wallpapers.jpg. Link to the source.

Notes for a multiprocessor skeleton I

Leopoldo Gomez March 18, 2018

Introduction.

Work together, help each other and communicate

Mauricio Pellegrino

The object of this work is to describe the multiprocessor skeleton (the Skeleton) as a component of the software of Single Program Multiple Data (SPMD) machines dedicated to signal and/or image processing applications.

The Skeleton is the logical communication’s infrastructure that connects the processors. This work first addresses the requirements. Next, the channels -the software objects that support the transferences-, the configuration, and the nodes' programming model. Finally, some points of the presented model are discussed.

This series "Notes for a multiprocessor skeleton" takes as its starting point the series "Notes on SPMD architecture” series. In particular, the Architecture model defined there has been taken as reference for this work.

Within this text, the terms computer and machine are used interchangeably.

--------------------

In order to state the concept of Skeleton, we will make use of the Architecture model described throughout the "Notes on SPMD Architecture" series. A summary of that description is included below.

SPMD machines are parallel processing computers that operate using an identical copy of the program in each processor, and in which each processor acts on different chunks of data. The following figure shows the topology of the Architecture model.

Figure 1-1

This work is centered in SPMD machines dedicated to signal and image processing applications. So hereinafter, we will use the term “algorithm” instead of “program”.

What the previous figure depicts is, in terms of nodes:

A set of N processing nodes (n_pi, i=1,..,N). Each node runs a “copy” of the algorithm over different blocks of input data.
One input node (n_i), which manages the input link and distributes the input data to the processing nodes.
One output node (n_o), which collects the processing results and manages the output link.

And in terms of data transfers:

Data distribution, which are the transfers from the input node to the processing nodes.
Data collection, which are the transfers from the processing nodes to the output node.
Data redistribution, which are the transfers among the processing nodes.

Data distribution and collection capabilities are required by the topology itself. Data redistribution functionality is necessary at least to the extent of covering the distributed matrix transposition in the case of 2-D processing on distributed memory (see Appendix “1-D and 2-D Processing implementation” of the “Notes on SPMD architecture series”). In the previous figure, the Data redistribution capability is represented by the curved arrow that connects the processing nodes set outputs to the inputs.

Although, it may sometimes be necessary that the input node and/or the output node process the data with some algorithm’s lightweight section (see Figure 1-2 in post Introduction of the “Notes on SPMD architecture” series. In any case, we will continue using the term “algorithm” for the code that runs on the processing nodes since it will be the heaviest weight section.

We will consider that the hardware of the MP machine is not linked to the application -so the hardware is reusable for different applications- but rather it is the software that customizes the machine for each application.

From a physical point of view, a MP computer consists of processor boards connected by a high-speed bus plus input and output interfaces. From a logical point of view, it consists of nodes connected by channels. Nodes support processing, and channels support communications.

As said above, the Skeleton is the logical communication’s infrastructure that connects the nodes. In the previous figure, it is represented by the thick black arrows and the curved gray arrow.

From the perspective of the software implementation, the MP machine application code can be partitioned into two different layers; the Skeleton code layer and the algorithm code layer. The Skeleton layer is supported by the channel’s library and the algorithm layer by the mathematical functions’ library, as is shown in the next figure. The operating system, in addition to channel and mathematical functions’ libraries, are either supplied, recommended or supported by the machine provider.

Multiprocessor machine software layers diagram

Figure 1-2

This approach enables code re-usability and independent code development. Code re-usability is possible because different applications based on different algorithms can make use of the same Skeleton implementation. Independent code development is also possible because Skeleton and algorithm code can be developed independently.

In the writing of this article, Paco de Lucía, John McLaughlin and Al Di Meola (Mediterranean sun dance, Pavarotti & Friends for Ward Child) have collaborated in an involuntary but decisive way.

--------------------

1. Picture: matrix_effect_by_en3rgy16-d4hktcb.png. Link to the source.

2. I want to thank Theresa Curtis for her revision of this text.

Notes on SPMD architecture A III

Leopoldo Gomez March 17, 2018

Nodes’ workload over time.

We may have all come in different ships, but we’re in the same boat now

Martin Luther King Jr

The object of this post is to define the different nodes’ working load conditions addressed in this series.

--------------------

Even though the code that runs in the processing nodes is the same, the Latency and Throughput of the processing nodes are not always constant over time, nor are they the same for all of them. This occurs, for instance, when the algorithm is implemented in such a way that the processing to be performed on the data depends on the value of the data itself. In cases like this, on one hand, the workload of the nodes (workload) varies over time, and on the other, an imbalance of the workload among the nodes (unbalanced workload) takes place.

By steady workload, we mean that all the processing nodes have the same workload. In addition, the input, processing, and output nodes can have different workloads but all of them are constant over the time.

In real life, the nodes' workload is not constant. Usually, the nodes have more things to do than the processing; for instance, to run the operating system. Anyway, once calculated the minimum number of buffers under the condition of steady workload, we will continue considering that the nodes work under steady workload while that number of buffers is enough for the machine to work properly.

A non-steady workload over time may affect the number of buffers per node required for ensuring no-data-loss. An unbalanced workload among nodes may affect Data redistribution transferences. Both cases are considered later.

In the writing of this article, Chico Buarque & Elis Regina (Noite dos mascarados) have collaborated in an involuntary but decisive way.

---------------------

1. Picture: http://remo.diariovasco.com/2009/_images/concurso2.jpg

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture A II

Leopoldo Gomez March 17, 2018

Local and Distributed memory processing.

I‘ll stick at building ships

George Steinbrenner

The object of this post is to describe the relationship between the resources of the nodes and the local and distributed memory processing.

--------------------

We will call algorithm input data unit (input data unit) to the data object on which the algorithm works. This data unit can not be split, it has to be processed as a whole. The algorithm’s result of processing an input data unit is an algorithm output data unit (output data unit). The data units characteristics include the shape (linear or rectangular), size (number of elements) and data type (integer, floating-point, etc.).

Latency and Throughput are parameters which characterize the performance of a processing machine or node and that were advanced in Introduction. In terms of data units, node Latency stands for the time delay between an input data unit and the correspondent output data unit -this time is the necessary time to process the input data unit and to produce the output data unit (processing time)- and node Throughput stands for the number of data units that the node is able to process per time unit (input and output Throughputs can be different).

Hereinafter, we will refer to the memory and Throughput of the node as the resources of the node.

If total resources of the set of processing nodes are not enough to perform the required processing in the required time, and the architecture is scalable, the number of processing nodes can be increased until these resources are enough. In real life, the increment of the number of processing nodes is limited by hardware constraints.

Regarding the resources of the node, we will consider two cases. In the first case, the node has enough memory but not enough Throughput to perform the required processing, so the processing is performed on the nodes’ local memory (local memory processing). In the second case, the node does not have enough memory or Throughput, so the processing is performed on the distributed memory in the processing nodes set (distributed memory processing).

In the writing of this article, Florence + The Machine - You've Got The Love (Live at the Rivolli Ballroom) have collaborated in an involuntary but decisive way.

---------------------

1. Picture: https://www.naiz.eus/media/asset_publics/resources/000/032/786/news_landscape/zumaia.jpg?1377808057

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture A I

Leopoldo Gomez March 17, 2018

1-D and 2-D processing implementation.

If the art of ship-building were in the wood, ships would exist by Nature

Aristotle

The object of this post is to summarize some background information in 1-D and 2-D processing implementation.

--------------------

Both signal and image processing involves 1-D and 2-D mathematical computing (1-D and 2-D processing). 1-D processing is performed over 1-D data objects (vectors), and 2-D processing over 2-D data objects (matrices).

As is well known, a vector A consisting of n elements is said to have a size (or length) of n, and its elements are denoted by Ai (i=1,.,n). A matrix A consisting of m rows and n columns is said to have a size of mxn (abbreviated: a mxn matrix) and its elements are denoted by Aij (i=1,.,m; j=1,.,n), being i the row number and j the column number. Moreover, the matrix transposition is an operation that consists in exchanging rows by columns. So, being AT the matrix resultant of transposing the matrix A, then Aij = ATji.

On the other hand, in the C programming language (among others), an array consists of a set of elements stored in consecutive memory locations. A vector is implemented as an array of elements, a matrix as an array of rows and a row as an array of elements. This is the vectors and matrices implementation model that will be used within this work.

According to the above, and being p a pointer to the first element of the array, the aforementioned elements Ai, Aij, and ATij can be addressed as p+(i-1), p+(j-1)+(i-1)*n and p+(i-1)+(j-1)*m, respectively.

2-D processing consists of two 1-D processings, one horizontal and the other vertical (over the rows and the columns of the matrix, respectively). Due to the fact that consecutive accesses to consecutive memory locations are faster than consecutive accesses to non-consecutive memory locations, processing rows are faster than processing columns. For that reason, 2-D processing is implemented as a sequence consisting of horizontal processing + a matrix transposition + horizontal processing.

In the writing of this article, Roxette (I remember you) have collaborated in an involuntary but decisive way.

---------------------

1. Picture: http://www.efdeportes.com/efd196/el-remo-en-educacion-fisica-en-cantabria-trainera-02.jpg

2. I want to thank Carol G. for her revision of this text.

Notes on SPMD architecture VIII

Leopoldo Gomez February 24, 2018