Parallel and Distributed Algorithms (c) 2004-03-16 Fabian M. Suchanek http://www.mpi-inf.mpg.de/~suchanek/personal/texts/summaries/pda.txt This is a summary of the course "Parallel and Distributed Algorithms" held by Dr. Uli Meyer in the WS 2003 at Saarland University. The structure of this summary is as follows: * Prerequisites * Parallel Algorithms * Time and Work * Problems and Solutions * Basic Problems * Sorting and Merging * ANSV * Lists * Trees * Networks * Bus-Networks Parts of this summary have been enriched by help of the nice script provided by Prof. Jop Sibeyn for his lecture on Parallel Algrithms at the University of Halle. See http://www.informatik.uni-halle.de/~jopsi/dpar03/notes_full.shtml By reading the following text, you accept that the author does not accept any responsibility for the correctness or completeness of this text. If you have any corrections or remarks (especially concerning th statements marked with a "(?)"), please mail me at f.m.suchanek@zweb.de (remove the letter 'z') . This is the only way to make the publication of this summary useful for me, too. ====================================================================== Prerequisites ====================================================================== Set: An unordered collection of different elements. S = {s1,...sn} S = { a | p(x) } is the set of all those x for which p(x) holds eps of a set: One of its elements. Notation: eps({x}) := x Size, cardinality of a set: The number of its elements. Notation: |{s1,s2,...sn}| = n Intersection of two sets: The set of those elements which occur in both of them. Notation: X /\ Y Union of two sets: The set of all of their elements. Notation: X \/ Y Difference of two sets: The set of those elements which occur in the first one, but not in the second one. Notation: X \ Y Subset of a set S: A set, the elements of which are all contained in S. Notation: S' =< S Superset of a set S: A set, which contains all elements of S. Notation: S' >= S Tupel: An ordered collection of elements. Notation: t=(t[1],t[2],...t[n]) Example: t=(1,'A',abc) is a tupel Index of an element t in a tuple T: The number of elements preceding t in T. Length, Dimension of a tupel: The number of its elements. Pair: A tupel of dimension 2. Primitive type: A set of finite size. Type: A primitive type or a tuple of types together with a tuple of names. Notation for tuple types: (name1 : type1, name2 : type2, ... ) Example: {1,2,3} is the type of the first three natural numbers (car : Integer, cdr : List) is an integer list type Sequence: A tuple, the elements of which all belong to the same type. Cartesian Product of two sets A and B: The set of all possible pairs, the first element of which is in A and the second element of which is in B. Notation: A x B = C Example: {1,2} x {a,b,c} = { (1,a), (2,a), (1,b), (2,b), (1,c), (2,c) } n-th Power of a set S: The set S^n := S x S x S x ... S where 'S' occurs n times. Relation over two sets A and B: A subset of the cartesian product of A and B. Notation: * (a,b) is an element of the relation R * r(a,b) * a r b * (a,b) is in r Example: The relation ">" is the set of all pairs of real numbers, in which the first number is greater than the second. It holds: ">" = { (1,0), (27.3,12.5), (17,-12), ... } One writes: >(3,2) or 3 > 2 or (3,2) is in ">". Irreflexive relation: A relation R such that not R(a,a) for all a. Symmetric relation: A relation R such that if R(a,b) then also R(b,a). Domain of a relation over A and B: A. Range of a relation over A and B: B. Function: A relation f over two sets A and B, such that for every a in A there is either none or exactly one element b in B with a f b. Notation for f: f:A->B Notation for "a f b": f(a)=b. Result of a function f:A->B for an argument x: The element of B such that f(x)=b. Injective function: A function f:A->B, which maps all arguments to different results. Operation: A function, the domain of which is the cartesian product of two (simple) sets. Notation: a f b := f(a,b) Associative operation: An operation f, for which a f (b f c) = (a f b) f c Commutative operation: An operation f, for which a f b = b f a Real: The set of real numbers. For a detailed definition and associated operations, cf "Analysis" (analysis.txt, German) Integer, Natural: The set of natural numbers. For a detailed definition and associated operations, cf "Analysis" (analysis.txt, German) Order of a function f:Natural->Real: A function g:Natural->Real, such that there is x0 in R and c in R, such that f(x)x0. Complexity, Class of a function g:Natural->Real: A set of functions having the same order as g. Notation for "g:Natural->Real is an element of O(f)": g in O(f) Algorithm: A sequence of instructions. Memory cell: A unit, which can store a value. There are usually infinitely many memory cells. They are numbered. PU, Processing unit: A unit, which can execute instructions. The PUs are usually numbered 0..p. PRAM, Parallel Random Access Machine: A model of a computer, which can execute more than one algorithm at a time. It consists of * a set of p PUs * an unlimited sequence of memory cells // In this summary, the PUs are numbered zero-based, which // simplifies their addressing Address of a memory cell: Its index. Datastructure of type t in a PRAM: * If t is a primitive type, then one memory cell in a PRAM, which can store a value of t. * If t is a tuple of types t[i], then one memory cell, which stores the address of a sequence of memory cells, each of which is a datastructure of type t[i] // This is the model used in Java and FAST Array of length k and type t: The type ([0] : t, [1] : t, ... [k-1] : t). This type is called "array[k] of t". The elements of a corresponding datastructure are referred to as follows: A[i] is the element at position i A[i..j] is the sub-array starting at i and ending at j A[i#n] is the sub-array starting at i of length n A.length is k Often, the word "array" also refers to a datastructure of type array. Partner array of an array datastructure A: An array datastructure of the same length as A. Matrix of dimension m*n and type t: An array of length m and type array of length n of type t. This type is called "array[m][n] of t". Square matrix: A matrix X with X.length==X[1].length. Product of two matrices A und B: The matrix C with C[i][j] = SUM k=1..A[1].length A[i][k]*B[k][j] The product is only defined if A[1].length==B.length. n-th Power of a square matrix A: The matrix A^1 := A A^n := A^(n-1)*A Variable of type t: A name for a datastructure of type t. If a variable occurs in an algorithm, it is to be read as the content of the datastructure. If v is the variable and t is (name[1] : type[1], ...), then the i-th entry of the datastructure is referred to as v.name[i]. Declaring a variable: Introducing a new memory cell or area with a name. Notation: VAR [ : ] [ := ] Procedure: An algorithm with a name and input variables. The input variables have types and are available in the algorithm. All variables declared in the procedure disppear with their memory cells after the procedure has been executed and they may not be accessed from outside. The procedure may "return a result", which must be a value of an indicated type. Since this value has to be stored in a memory cell, returning a value involves a memory-write-process (this will be important later). Notation: FUNCTION ( : , ...) : or PROCEDURE ( : , ...) Notation for input-variables: A, B : t means A : t, B : t A : array[n] of t means VAR n := A.length B : array[n] of t with n introduced means ASSERT B.length==n B : array[>n] of t means ASSERT B.length>=n A : array[n=2^k] of t means VAR n := A.length ASSERT n is a power of two A : array[n=k^2] of t means VAR n := A.length ASSERT n is a square number Procedure call: An instruction, consisting of the name of a procedure followed by values or variables ("parameters"). If this instruction is executed, each parameter, which is a value, is put into a new memory cell. Furthermore, one memory cell is needed to memorize the location of the call in the algorithm. Then, the instructions of the procedure are executed, where the input variables in the procedure refer to the memory cells denoted by the parameters of the call. After having executed the procedure, control returns to the instruction after the call. The value of the call is the result of the procedure, if there is any. In order for a call to be valid, the number and types of the call parameters must match with the input variables of the procedure. Since multiple procedure calls may be excuted in parallel, each call allocates enough memory cells to store the parameters, the call position, the variables declared inside the procedure and all of this for all calls that may be executed by the procedure (!). Notation: (,...) // This description implements a call by reference, other // interpretations are possible. // Note that memory allocation and calculation of required memory // is no problem, since we have unlimited memory and may produce // memory garbage as we like. Recursive procedure: A procedure, which calls itself. Bit: The primitive type {0,1}. Boolean: The primitive type {true,false}. Often, bits are used for booleans. Arithmetic if: The function '?': Boolean x A x A -> A for a set A, which is defined as true ? a : b := a false ? a: b := b Logical or: The function mapping two bits a and b to (a+b) mod 2. Stack of type t: An array of type t with a long length. Initially, the first memory cell contains the value 0. The following operations work on a stack: PROCEDURE push(S : stack, x : t) S[0] := S[0] + 1 S[S[0]] := x FUNCTION pop(S : stack) : t S[0] := S[0] - 1 RETURN S[S[0]+1] FUNCTION top(S : stack) : t RETURN S[S[0]] FUNCTION empty(S : stack) : boolean RETURN S[0]==0 Time of an algorithm for an input size n: The order of the function which calculates the time needed to execute the algorithm, in dependence upon the input size n. One says "The algorithm is in time O(g)". Space of an algorithm for an input size n: The order of the function which calculates the number of cells needed to execute the algorithm, in dependence upon the input size n. One says "The algorithm is in space O(g)". Set function: An associative commutative operation applied to a set. For such an operation f and a set X={x[1],x[2],...x[n]}, one defines f X := x[1] f x[2] f ... f x[n] Usually, the operation symbol is enlarged or upcased. Examples: SUM, MIN, MAX, OR Prefix-XXX of a sequence S: A sequence S' of the same length such that S'[i] = XXX {S[j] | j= is <3,3,2,2> ceil: The function ceil:Real->Integer, so that ceil(x) is the smallest integer, which is not smaller than x. floor: The function floor:Real->Integer, so that floor(x) is the greatest integer, which is not greater than x. Binary representation of an integer i: The sequence binrep(i) of length ceil(log2(i)) with binrep(i)[j] := i div 2^(ceil(log2(i))-j-1) mod 2. Decimal of a binary representation b: The number decimal(b) := SUM {b[i]*2^(length(b)-i-1) | i=0..length(b)-1} iff: if and only if. Lemma, Theorem: A fact. Graph, Linked set: A pair of a set S and an irreflexive relation Edge < S x S, which indicates which element is "linked" to which other elements. It is assumed that S={0,...n-1}. Node of a graph (S,R): An element of S. Edge of a graph (S,R): An element of R. Inverse edge of an edge (s,s'): The edge (s',s). Link of a graph (S,R): A pair (s,s'), such that R(s,s') or R(s',s). Bidirected, Undirected graph: A graph with a symmetric edge relation. Undirected graph of a graph (S,R): The graph (S,R') with R R'(a,b) && R'(b,a). Parent of a node s in a graph (S,R): An element f with R(f,s). Ancestor of a node s in a graph: A parent of s or an ancestor of one of its parents. Child of a node s in a graph (S,R): An element c with R(s,c). Descendant of a node s in a graph: A child of s or a descandant of one of its children. Leaf of a graph: A node without children. Baby descendant of of a node s in a graph: A descendant which is a leaf. Adjacent node, Linked node of a node s: A parent or a child of s. Incoming edge of node s: An edge (s',s). Outgoing edge of node s: An edge (s,s'). Indegree of a node: The number of its incoming edges. Outdegree of a node: The number of its outgoing edges. Degree of a node: The sum of its indegree and outdegree. Path in a graph (V,E): A sequence of nodes , such that v[i] in V and (v[i],v[i+1]) in E. Length of a path: The number of its nodes minus 1. Distance of two nodes in a graph: The length of the shortest path between them. Chain in a graph: A sequence of nodes such that each node is linked to its successor or the successor is linked to the node. Cycle in a graph: A path, the start node and the end node of which are equal. Weighed graph: A graph with a function, which maps each edge to a number, a so-called "weight". Complete graph: A graph (S, S x S). Tree: A graph where each node has exactly one parent, except for one special node, which does not have any parent. List: A graph where each node has exactly one child and on parent except for the last node, which has no children and the first one which has no parent. Undirected tree: An undirected graph of a tree. Root of a tree: The node without a parent. Level of a node in a tree: The number of its ancestors. Level in a tree: A maximal set of nodes with the same level. ====================================================================== Parallel algorithms ====================================================================== Phase: An sub-algorithm, which is to be executed by multiple PUs simultaneously. The following instruction is used: FOR in PARDO [ BY PUs

[,

]] where is a variable of a type t, is a subset of t and

is a set of PU numbers with || =< |

|. If ==

(as it is usually the case), the "BY"-clause may be omitted. The are executed in parallel by each of the PUs of

. In each of these parallel excutions, has a different value, taken from . If |S|>|P| and |S|=<|P|*2, then the additional ",

" indicates that the loop is to be executed twice |P| times. If a variable is declared within a phase, this implicitly creates one variable for each parallel process. // I suppose this is the general form of the PARDO used in the // lecture Parallel algorithm: An algorithm containing phases. The parts of the algorithm outside phases are executed by the first PU available for the algorithm. Sequential algorithm: An algorithm, which is not parallel. RAM, Random Access Machine: A PRAM with one PU. Read: The process of one PU reading the value of one memory cell. Write: The process of one PU writing the value of one memory cell. Access: A read or a write. Exclusive Access: The phenomenon that no two PUs access the same memory cell in one phase. Concurrent Access: The phenomenon that more than one PU may access one memory cell in one phase. Not all PRAMs may allow concurrent access. EREW-PRAM: A PRAM which * can only do exclusive read * can only do exclusive write CREW-PRAM: A PRAM which * can do concurrent read * can only do exclusive write CRCW-PRAM: A PRAM which * can do concurrent read * can do concurrent write Arbitrary concurrent write PRAM: A CRCW-PRAM, where it cannot be said deterministically, which value a memory cell will contain after different PUs have written different values to it in one phase. Common concurrent write PRAM: A CRCW-PRAM, which only allows concurrent writing if the same value is written. Priority concurrent write PRAM: A CRCW-PRAM, in which PUs with a lower number are considered more powerful. If a concurrent write is done, the value of the PU with a lower number is put into the cell. Machine model: A set of all PRAMs, which share the same support for concurrent access. n-is-a-Assumption: The assumption, that the length of an array has a certain property (such as being even, being a power of 2 or being a square number). This can always be achieved by prolonging the array by some irrelevant entries: zeroes, the values EMPTY or INF, depending on the problem. Distinctness-assumption of a sequence S: The assumption that there are no two elements of the sequence, which are equal. On a PRAM, this can easily be achieved by creating a partner array for S, containing the indices 0..S.length-1. Then, we understand each element of S as a pair of the element of S and the corresponding element of the partner array. Thus, all elements are distinct, without their relative order being changed. After having treated the elements, we forget about the partner array. Example: S =<7,8,3,6,2,2,3,9> S' =<0,1,2,3,4,5,6,7> new S =<(7,0),(8,1),(3,2),(6,3),(2,4),(2,5),(3,6),(9,7)> Parallel recursive procedure: A procedure, which executes two parallel calls to itself. Each of the calls is assigned a memory area, which can hold * the parameters, if they are values * the variables declared in the procedure * the location of the call in the algorithm * all of this for all possible calls, which the procedure might execute, including all parallel calls to itself A new notation is used: (,...,P/2) // (,...,P/2) stands for FOR i in {1,2} PARDO BY PUs {P[0],P[P.length/2]} IF i==1 THEN (( | | work | |____________ | | | work | +-------------> +-------------> time time Speeding up a parallel algorithm: Redesigning it so that it needs more PUs, but less running time. It is always useful to speed up an algorithm, because one can simulate an algorithm requiring many PUs on a PRAM with less PUs. If one had a slow algorithm with few PUs, then there would be idle PUs on a PRAM with many PUs without a gain in time. Optimal-work-algorithm: A parallel algorithm, the number of operations of which equals the time order of the sequential algorithm for the problem. This means that if the optimal work algorithm is executed by a simulated PRAM, it takes the same time as the sequential algorithm. But unlike the sequential algorithm, the optimal work algorithm can also be executed on a PRAM. A PRAM with n PUs can execute it n times faster than the sequential algorithm. Compared to other parallel algorithms, the optimal work algorithm is the fastest if simulated on a PRAM with a restricted number of PUs. Optimal speed-up of a parallel algorithm: The redesigning of the algorithm so that it becomes an optimal work algorithm. Good algorithm: An optimal-work-algorithm with a small time complexity. The smaller the time complexity is, the larger the number of PUs is. If a parallel algorithm requires more PUs than provided by a certain PRAM, the other PUs can always be simulated. But if a parallel algorithm requires less PUs than provided by a PRAM, the superfluous PUs would be idle. Thus, a high number of required PUs and a small time complexity guarantees the highest flexibility regarding the choice of the executing PRAM. Reduction-strategy: The following strategy for solving a problem, if there is an efficient algorithm, which requires too many PUs: Make the problem somehow smaller, until it is small enough to be solved by the efficient algorithm. Then undo the steps needed to make the problem smaller, while at the same time maintaining the correctness of the solution. xxxxxxx | xxxxxxx ^ xxxxx | make smaller xxxxx | make big again xxx V xxx | | ^ '----------------------------' solve by efficient algo Recursive-Division-Strategy: The following strategy for solving a problem, if there is an efficient algorithm, which requires too many PUs: Be N(n) the number of PUs required to solve the problem of size n efficiently. Divide the problem into k sub-problems, such that N(k)=n. Solve each of these sub-problems recursively. This yields k results. Now we can apply the efficient algorithm on these k results, since we have n=N(k) PUs. This strategy does not work for all problems and it may require certain pre-processing and post-processing of the data. xxxxx xxxxxx xxxxxx <-- sub-problems, solve recursively \_____|_____/ <-- unite solutions with the efficient V algorithm Sequential-division-strategy: The following strategy for solving a problem, if there is an efficient algorithm, which requires too many PUs: Be N(n) the number of PUs required to solve the problem of size n efficiently. Divide the problem into k sub-problems, such that N(k)=n. Solve each of these sub-problems sequentially by one PU. This yields k results. Now we can apply the efficient algorithm on these k results, since we have n=N(k) PUs. This strategy does not work for all problems and it may require certain pre-processing and post-processing of the data. This only works, if the sequential algorithm is fast enough and if the problem can be divided into two equal steps. xxxxxx xxxxxx xxxxxxxx <- sub-problems, solve sequentially \_______|______/ <- unite using efficient algorithm V Accelerated Cascading: The combination of fast, but work-intensive algorithms with slow but work-efficient algorithms in order to achieve all in all a good time and a good work. Algo-Sequence Theorem: The fact that a sequence of parallel algorithms A[i] with times t[i] and work q[i] can be executed in time O(SUM {t[i]}) with a work of SUM {q[i]}. This is not trivial, because some algorithms may require many PUs, which would be idle for another algorithm, so that the overall work increases. Proof for the theorem: Let t := SUM {t[i]} and q := SUM {q[i]}. We chose p = ceil(q/t) PUs. We slow down the algorithms, which need more than p PUs. The time of the algorithms with less PUs remains the same. The time of the algorithms which are slowed down increases, but at the overall time doubles at most: Suppose all algorithms are slowed down (worst case). Then the total time will be SUM {ceil(p[i]/p)*t[i]} =< SUM {(p[i]/p + 1)*t[i]} = SUM {p[i]*t[i]} / p + SUM {t[i]} = SUM {q[i]} / p + t = q / p + t =< t + t = 2*t Thus, the time class is still O(t). Reduction of problem A to problem B: An algorithm, which translates an A-problem to a B-problem and the solution of a B-problem to a solution of an A-problem. If this reduction can be done in neglectable time (i.e. time O(1)), then this proves: * If B can be solved, then A can be solved * B cannot be solved faster than A (because as soon as we have a fast solution for B, we also have a fast solution for A) ====================================================================== Problems and solutions ====================================================================== ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Basic Problems ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sum-Problem: The problem of, given n numbers, calculating their sum. This problem can be solved in time O(n) on a RAM: FUNCTION sumSeq(A : array[n] of Integer) : Integer VAR result := 0 FOR i:=0 TO n-1 DO result := result + A[i] RETURN result The problem can also be solved in time O(log(n)) on a EREW-PRAM with n PUs: FUNCTION sumNL(A : array[n=2^k] of Integer, P:array[>n] of PU) : Integer FOR t := 1 TO log2(n) DO FOR i in {1,...n} PARDO BY PUs P IF i/2^t is a natural number THEN A[i] := A[i] + A[i-2^(t-1)] RETURN A[n-1] In this algorithm, each PU looks at one memory cell. A kind of tree is built upon the memory cells: A1 A2 A3 A4 A5 A6 A7 A8 <- array A \ / \ / \ / \ / <- first phase \ / \ / \ / \ / \ / \ / <- second phase \_____ ____/ \/ <- third phase Each of the levels of this tree corresponds to one phase t. In the first phase, each PU at an even position collects the value of its left neighbour. In the second phase, every PU at a position divisible by 4 collects the (already accumulated) value of its left neighbor 2 positions away. This way, the PUs reduce the whole array to one value exponentially fast. The running time is O(log(n)). It is not possible to do it faster: The information of each memory cell has to be gathered. At a maximum, n/2 PUs can gather the information of other PUs. After this phase, the information is necessarily still distributed among n/2 PUs. These need again to gather their information. In one phase, they can achieve an information concentration to n/2/2 PUs -- not more. This continues until all information has been gathered in one PU. If T is the number of phases, it holds: n/2^T=1 <=> T = log2(n) => the problem is in O(log(n)). The problem can also be solved in time O(log(n)) with optimal work on a PRAM with n/log(n) PUs by the sequential-division strategy: FUNCTION sum(A : array[n] of Integer, P : array[>n/log(n)] of PU) : Integer VAR tmpresult : array[n/log(n)] of Integer FOR i in {0,...n/log(n)} PARDO BY PUs P tmpresult[i] := sumSeq(A[i*log(n) # log(n)]) RETURN sumNL(tmpresult,P) Example: With sumNL, if we have 100000 numbers to add up, this takes time O(5) on a PRAM with 100000 PUs. If we only had 100 PUs and simulated the rest, this would take time O(5000). With the optimal work algorithm, it would take time O(5) on a PRAM with 20000 PUs. If we simulate this algorithm with 100 PUS, we are in time O(1000). Or-Problem: The problem of, given a set of n bits, calculating the logical or of them. This problem can be solved in time O(1) on a CRCW-PRAM with n PUs. Algorithm: FUNCTION or(A : array[n] of Bit, S : array[>n] of PU) : Bit VAR result := 0 FOR i in {0,...n-1} PARDO BY PUs S // If my cell is 1, then the result is 1 IF A[i]==1 THEN result := 1; RETURN result In this algorithm, each PU looks at a memory cell. If it sees a 1, it sets the cell "dest" to 1. Thus, it suffices if the array contains one 1. If the problem is to be solved on a EREW-PRAM, then this can be done similarly to the Sum-Problem in time O(log(n)). Similarly to the argumentation for the Sum-problem, it can be proven that the Or-problem cannot be solved faster than O(log(n)) on an EREW-PRAM. Prefix-Sum-Problem: The problem of, given a sequence X of n numbers, calculating the sums x[0], x[0]+x[1], x[0]+x[1]+x[2],... . This problem can be solved in time O(n) by a RAM: PROCEDURE prefixSumSeq(X,result : array[n] of Integer) result[0]:=X[0] FOR i:=1 TO n-1 DO result[i] := result[i-1]+X[i] This problem can be solved in time O(log(n)) with n PUs on a CREW-PRAM: PROCEDURE prefixSumCR(X,result : array[n=2^k] of Integer, P : array[>n] of PU) // Trivial case: Just return the first element IF n==1 THEN result[0]:=X[0] RETURN // Divide the problem in two halves, solve each of them prefixSumCR(X[0..n/2-1],result[0..n/2-1],P/2) // prefixSumCR(X[n/2..n-1],result[n/2..n-1],P/2) // Adjust result of the second half by adding the // sum of the first FOR i in {n/2,..n-1} PARDO BY PUs P result[i] := result[i] + result[n/2-1] This algorithm divides the problem in two halves and calls itself recursively on these two halves. Then, the second half is adjusted by adding the total sum of the first half to each array entry. This algorithm requires a CREW, because all PUs read the cell result[n/2-1] in the last line of the procedure. If the problem is to be solved on a EREW, the following algorithm can be used, which also requires n PUs and O(log(n)) time: PROCEDURE prefixSumL(X,result : array[n] of Integer, P : array[>n] of PU) // Trivial case IF n==1 THEN result[0]:=X[0] RETURN // Get two temporary arrays VAR X' : array[n/2] of Integer VAR result' : array[n/2] of Integer // Pre-calculate sums of pairs of X-elements to X' FOR i in {0,..n/2-1} PARDO by PUs P X'[i]:=X[i*2]+X[i*2+1] // Solve problem on the temporary array prefixSum(X',result',P) // Explode temporary results to real results FOR i in {0,..n/2-1} PARDO by PUs P result[i*2+1]:=result'[i] FOR i in {0,..n/2-1} PARDO by PUs P result[i*2]:=result[i*2-1]+X[2*i] The problem can also be solved optimally in time log(n) by n/log(n) PUs on a EREW-PRAM by help of the sequential-division strategy: PROCEDURE prefixSum(A,result : array[n] of Integer, P : array[>n/log(n)] of PU) VAR C,C' : array[n/log(n)] of Integer FOR i in {0,..n/log(n)-1} PARDO BY PUs P prefixSumSeq(result[i*log(n) # log(n)], C[i*log(n) # log(n)]) C[i]:=result[(i+1)*log(n)-1] prefixSumL(C,C',P) FOR i in {1,..n/log(n)-1} PARDO BY PUs P FOR j:=0 TO log(n) DO result[i*log(n)+j] := result[i*log(n)+j] + C'[i-1] We divide the problem into n/log(n) groups of log(n) elements. We solve the prefix-sum-problem on each of them sequentially (time O(log(n))), storing the results in the corresponding cells of C. Now, we solve the prefix-sum-problem on the last values of these group prefix-sums by the standard algorithm (time O(log(n/log(n)))) and have the results placed in C'. Last, we update each group by adding the corresponding value of C'. This updating involves one PU per group, running over all the log(n) elements of the group. Thus, we have a prefix-sum-Algorithm with n/log(n) PUs and time O(log(n)). A = |--log(n)--| |--log(n)--| |--log(n)--| ... |--log(n)--| result = prefixsum prefixsum prefixsum prefixsum C = x1 x2 x3 xm Prefix-XXX-Algorithm: The prefix-sum-algorithm, used with the associative function XXX. The prefix-sum-Algorithm works with every associative operation, not just the sum. Start-Bit-Problem: The problem of, given a sequence of n Bits of the form S=<0,0,...0,1,1,...1>, finding the position of the first 1-Bit. The problem can be solved with n PUs in time O(1) on a EREW-PRAM by the following algorithm: PROCEDURE startBit(X : array[n] of Bit, P : array[>n] of PU) : Integer IF X[0]==1 THEN RETURN 0 FOR i in {1,...n-1} PARDO BY PUs P IF X[i]==1 && X[i-1]==0 THEN RETURN i RETURN n In this algorithm, each PU looks at one array cell. If its own cell is set, whereas its left neighbor is not, then the PU knows it must be the first one. Sorted sequence of a sequence S of natural numbers: A sequence, which contains all numbers of S and in which each elements is smaller than its successor. First-Bit-Problem: The problem of, given a sequence of n bits, determining the index of the first bit set to 1. This problem can be solved by a CRCW-PRAM with n PUs in time O(1): PROCEDURE _firstBit(A : array[n] of Bit, P : array[n^2] of PU) : Integer VAR M : array[n][n] of Bit FOR (i,j) in {(x,y) | 0=n] of PU) : Integer // One C[i] for each group of sqrt(n) elements VAR C : array[sqrt(n)] of Integer // calculate the logical or for each group FOR i in {0,..sqrt(n)-1} PARDO BY PUs P C[i] := or(A[i*sqrt(n) # sqrt(n)], P[i*sqrt(n) # sqrt(n)]) VAR firstGroup := _firstBit(C[i],P) IF firstGroup==-1 THEN RETURN n RETURN _firstBit(A[firstGroup*sqrt(n) # sqrt(n)],P) The first procedure finds the first 1-bit in an array A, if the number of PUs is the square of the number of bits. This is done by having the PUs evaluate a function on each pair of bits and store the results in a matrix. The function is designed so that the row with a 1 is the row of the first bit set. Its index is returned. The main procedure divides the problem into sub-groups of sqrt(n) elements. Then, a logical or is performed on these groups, thus outing the group which must contain the first 1-bit. This group is then examined by the helper procedure. Min-Problem: The problem of, given a set of n numbers, calculating the smallest among them. It can be solved in time O(1) with n^2 PUs on a CRCW-PRAM by the following algorithm: FUNCTION min(A : array[n] of Integer, P : array[>n^2] of PU) : Integer // Each PU will have a cell in matrix B VAR B : matrix[n][n] of Integer; // Each PU compares 2 numbers, sets B[i][j] to the result FOR (i,j) in {(x,y) | 0= A[j] THEN B[i][j]:=1; ELSE B[i][j]:=0; // Do a logical or on each of the lines of the matrix B FOR i in {0,...n-1} PARDO BY PUs P IF or(B[i][0..n-1], P[i*n # n])==0 THEN RETURN A[i] In this algorithm, the PUs are arranged in a matrix. A PU at position i,j compares the two numbers X[i] and X[j]. If X[i] is greater than X[j], it sets B[i][j] to 1. If one line does not contain a 1, then this means that the corresponding number is not greater than any of the other numbers, it must be the minimum. In order to find this line, a logical or is done on each of the lines. The problem can also be solved in time O(1) by n PUs on a CRCW-PRAM, if the range of the numbers is known to be 0...m: FUNCTION minCW(X : array[n] of Integer, m : Integer, P : array[>n] of PUs) : Integer ASSERT m is an upper bound for the values of X ASSERT m=. By , the smallest non-empty bucket can be determined. Within the numbers of this bucket, the procedure is repeated. The problem can also be solved by n PUs on a CRCW-PRAM in time O(log(log(n))) using the recursive-division-strategy: FUNCTION fastMinCW(X : array[n] of Integer, P : array[>n] of PUs) : Integer IF n<4 THEN // Trivially compute the minimum RETURN min(X) // Subdivide into floor(sqrt(n)) groups VAR k := floor(sqrt(n)) // Each has a size ceil(sqrt(n))+1 VAR s := ceil(sqrt(n))+1 VAR results : array[k] of Integer // Solve each of the groups recursively FOR i in {0,..k} PARDO BY PUs P result[i] := fastMinCW(X[i*s # s], P[i*s # s]) // Return the (constant time) minimum of the results RETURN min(result,m,P) This algorithm divides the problem into sqrt(n) groups. Each of the groups is solved recursively. Then, there are enough PUs to use the constant-time-algorithm on the results. The groups have a size of ceil(sqrt(n))+1 elements, because (ceil(sqrt(n))+1)*floor(sqrt(n))>n. The time needed by this algorithm is time(n) = (n<4) ? c : time(ceil(sqrt(n))+1) + c for some constant c. For n>16, it holds that ceil(sqrt(n))+1 =< sqrt(n) + 2 =< 2*sqrt(n) =< n^(1/4)*n^(1/2) = n^(3/4) Thus, the n passed down in the recursive equation of time(n) is taken to the power of 3/4 each time. How often can n be taken to the power of 3/4, before the trivial size 4 is reached? n^(3/4)^(3/4)^... = 4 --------k------- <=> n^((3/4)^k) = 4 <=> log(4)/log(n) = (3/4)^k <=> log(log(4)/log(n)) / log(3/4) = k <=> (log(log(4)) - log(log(n)) ) / log(3/4) = k <=> (log(log(n)) - log(log(4)) ) / log(4/3) = k <=> (log(log(n)) - constant) / constant = k Thus, the number of recursive calls is in O(log(log(n)). This is a very common proof-pattern, showing that an exponential reduction in problem size results in a double-log running time. Root-Problem: The problem of, given a natural number n, calculating n to the power of a/b, where a and b are integers. This problem can be solved in time O(1) by a CREW-PRAM with n PUs: FUNCTION pow(base, exp : Integer) : Integer // Calculates base^exp IF exp == 0 THEN RETURN 1 ELSE RETURN base*pow(base, exp-1) PROCEDURE root(n,a,b : integer, result : array[2] of Integer, P : array[>n] of PU) // Calculates ceil(n^(a/b)) and floor(n^(a/b)), stores // these values in result[0] and result[1] VAR na = pow(n,a) FOR i in {1,..,n} PARDO BY PUs P VAR ib=pow(i,b) IF ib < na && pow((i+1),b) > na THEN result[0] := i result[1] := i+1 RETURN IF ib == na THEN result[0] := i result[1] := i RETURN In this algorithm, each PU looks at a possible candidate for n^(a/b). If the candidate i fulfills the equation n^a==i^b, then i must be the right one. Multiple-Prefix-Problem: The problem of, given a sequence of sequences of numbers, calculating the prefix-sums for each of them. This problem can be solved on a EREW-PRAM with n PUs in tim O(log(n)), where n is the total number of numbers. In order to do so, we assume that all sequences are contained one after the other in one big array X. Additionally, we have a partner array B, which contains a 1 whenever the corresponding element in X starts a new sequence. Then, we define a new operator, which takes two pairs, each consisting of an element of X and an element of B: (x1,b1) msum (x2,b2) := (b2=1) ? (x2,1) : (x1+x2,b1) Within a sequence (B[1]=1, B[i]=0 for i>1), msum behaves like the ordinary sum. When a border to a new sequence is spanned, msum ignores the element from the old sequence and returns the first element of the new sequence. The operator msum is associative: (x1,b1) msum ((x2,b2) msum (x3,b3)) = (x1,b1) msum ((b3=1) ? (x3,1) : (x2+x3,b2)) = (b3=1) ? (x3,1) : ( (b2=1) ? (x2+x3,1) : (x1+x2+x3,b1) ) ((x1,b1) msum (x2,b2)) msum (x3,b3) = ((b2=1) ? (x2,1) : (x1+x2,b1)) msum (x3,b3) = (b3=1) ? (x3,1) : ( (b2=1) ? (x2,1) : (x1+x2+x3,b1) ) Now, we can apply the standard prefix-sum-algorithm with msum and the prefix-sums will be computed for each sequence. PROCEDURE multiplePrefix(A, B, result : array of Integer, P : array of PU) // Solves prefixSum on all sub-sequencs of A prefixMSUM((A,B),result,P) Segmented-Broadcast-Problem: The problem of, given a sequence X of n cells, some of which are empty, spreading the values of the non-empty cells to the empty cells to the right. This problem can be solved in time O(log(n)) on a EREW-PRAM. It can be reduced to the Multiple-Prefix-Problem. We need a partner array B, which contains a 1 for every non-empty cell of X. Thus, each sequence of one non-empty cells and the following empty cells is regarded as a sub-sequence of X. The prefix-sums within each of these sub-sequences will fill up the empty cells with the value of the non-empty cell: PROCEDURE segmentedBroadcast(A, B, result : array[n] of Integer, P : array[>n] of PU) FOR i in {0,..n-1} PARDO BY PUs P IF B[i]==0 THEN A[i]:=0 multiplePrefix(A,B,result,P) Prefix-Min-Problem: The problem of, given a sequence S of n numbers, calculating min({S[0]}), min({S[0], S[1]}), min({S[0], S[1], S[2]}), ... . Since min is an associative operation, this problem can be solved in time O(log(n)) with n PUs on a EREW-PRAM by a modified prefixSum- Algorithm. It can also be solved in time O(1) by n^3 PUs on a CRCW-PRAM by the following algorithm: PROCEDURE prefixMinCW(A, result : array[n] of Integer, P : array[>n^3] of PU) FOR i in {0,...n-1} PARDO BY PUs P result[i] := minCW(A[0..i], P[i*n^2 # n^2]) In this algorithm, n^2 PUs calculate the minimum of A[0..n-1]. Other (n-1)^2 PUs calculate the minimum of A[0..n-2]. The next (n-2)^2 PUs calculate the minimum of A[0..n-3] and so on. This algorithm does not make use of all of the n^3 PUs. The problem can also be solved on a CRCW-PRAM with n PUs in time O(log(log(n))) by the recursive-divsion- strategy: We divide the problem into n^(1/3) sub-parts of size n^(2/3). Each of these sub-problems is solved recursively. Now, n PUs can solve the prefix problem on the n^(1/3) total minima of the sub-problems. Last, the local results of the sub-problems have to be adjusted on a one-by-one basis: PROCEDURE fastPrefixMin(A, result : array[n] of Integer, P : array[>n] of PU) // Trivial case IF n =< 8 THEN result := trivialPrefixMin(A) RETURN // Subdivide the array into n^(1/3) arrays of length // n^(2/3) each. Compute the prefix minima for each of these // parts, store them in the corresponding result-cells VAR n13 = floor(n^(1/3)); VAR n23 = ceil(n/n13); FOR i in {0,...n13} PARDO BY PUs P prefixmin2(A[i*n23 # n23], result[i*n23 # n23], P[i*n23 # n23]) // Now copy the last minimum of each of the sub-problems // to an array VAR tmp : array [n13] of Integer; FOR i in {0,..n13} PARDO BY PUs P[0..n13] tmp[i] := result[i*n23]; // Calculate the prefix minima of these last minima VAR tmpresult : array [n13] of Integer; prefixmin(tmp,tmpresult,P); // Look at each number in the result. If a number is larger // than the preceding minimum, cut it. FOR i in {n23,..n} PARDO BY PUs P[n23..n-1] IF result[i]>tmpresult[i div n23] THEN dest[i]:=tmpresult[i div n23]; This algorithm is in time O(log(log(n))). Proof: If n=<16, then the algorithm calculates: * n13=2 * n23=8 * each of the 2 subproblems of size 8 in constant time * the prefixmin of 2 elements in constant time * the array update in constant time * --> everything in constant time. For n>16, recursion is taken into account, leading to: time(n) = (n=<16) ? c : time(ceil(n/floor(n^(1/3)))+c where c is some constant. For n>16, we have: ceil(n/floor(n^(1/3)) < n/floor(n^(1/3)) + 1 < n/(n^(1/3)-1) + 1 < n/(n^(1/3)-n^(1/3)/2) + 1 < n/n^(1/3)*2 + 1 < n^(2/3)*2 + 1 < n^(2/3)*2 + n^(2/3)/6 < n^(2/3)*(2+1/6) < n^(2/3)*n^(2/7) = n^(20/21) Thus, the problem-size decreases with each by taking the previous problem size to the power of 20/21. Similarly to the proof for fastMinCW, this entails the recursion depth to be in O(log(log(n)). An alternative proof is the following: Assume n to be a cube of a natural number. This can always be achieved by adding zeroes to the sequence. Then, n13=n^(1/3) and n23=n^(2/3). Then the problem size decreases in each recursion step by taking it to the power of 2/3. As it has been shown for fastMinCW, such a reduction results in a running time of O(log(log(n)). Matrix power problem: The problem of, given a square matrix A with A.length==n and a natural number k, calculating A^k. This problem can be solved on a CREW-PRAM with n^3/log(n) PUs in time O(log(k)*log(n)) by the following algorithm: PROCEDURE matrixProduct(A,B,result: array[n][n] of Integer, P : array[>n^3/log(n)] of PU) VAR M : array[n][n][n] of Integer FOR (x,y) in {(x,y) | 0=n^3/log(n)] of PU) IF k==1 THEN RETURN A VAR Ak2 : Matrix[n][n] matrixPower(A, Ak2, floor(k/2), P) IF k mod 2 == 0 THEN matrixProduct(Ak2, Ak2, result, P) ELSE VAR tmp : array[n][n] of Integer matrixProduct(Ak2, Ak2, tmp, P) matrixProduct(tmp, A, result, P) This algorithm first defines how to compute the product of two matrices: The products of all couples of entries of both matrices are calculated and stored in a cube M. This can be done in time O(log(n)) if every PU takes care of log(n) of the n^3 couples. Then, the lines of the cube are summed up to form the result matrix. There are n^2 lines of n elements to be summed up, which can be done in time O(log(n)) and n^3/log(n) PUs. The main procedure relies on the fact that A^k = A^(k/2)*A^(k/2) for even k = A^floor(k/2)*A^floor(k/2)*A for odd k Since k is divided by two in each recursion and each recursion takes time log(n), the overall time is O(log(n)*log(k)). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sorting and merging ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sorting-problem: The problem of, given a sequence of n natural numbers, finding their sorted sequence. This problem can be solved by a RAM in time O(m+n), if m is an upper bound for the sequence elements: PROCEDURE bucketSort(S, result : array[n] of Integer, max : Integer) ASSERT all numbers of S are positive ASSERT max is an upper bound of the elements of A // Buckets count occurences of numbers bucket : array[max] of Integer :=<0,0,...0> FOR i := 0 TO n-1 DO bucket[S[i]] := bucket[S[i]] + 1 // Collect buckets to form the ordered sequence resultIndex := 0 FOR bucketIndex := 0 TO max-1 DO WHILE bucket[bucketIndex] > 0 DO result[resultIndex] := bucketIndex bucket[bucketIndex] := bucket[bucketIndex] - 1 RETURN result This algorithm has one bucket for each number which may appear in S. It the goes through all the numbers of the set and throws them into the corresponding buckets (by increasing a counter). Then, it goes through all the buckets and constructs the sorted sequence. This problem can also be solved by n^2 PUs on a CREW-PRAM in time O(log(n)) by the following algorithm: PROCEDURE sortCW(A, result : array[n] of Integer, P : array[>n^2] of PU) ASSERT All elements of A are distinct // Each PU will have a cell in matrix B VAR B : matrix[n][n] of Integer; // Each number has a cell in array P, storing the final // position in the sorted sequence VAR P : array [0,..n-1] of Integer; // Each PU compares 2 numbers, sets B[i][j] to the result FOR (i,j) in {(x,y) | 0= A[j] THEN B[i][j]:=1; ELSE B[i][j]:=0; // Sum up the number of smaller elements FOR i in {0,..n-1} PARDO BY PUs P P[i] := sum(B[i][0..n-1],P[i*n # n]) // Distribute the numbers according to their position FOR i in {0,..n-1} PARDO BY PUs P result[P[i]]:=i RETURN result Similar to the min-Algorithm, each PU looks at a pair of numbers. Each line in the matrix B has a 1 for each number, which is smaller than the number of the line. If one sums up each line, one gets the final position of the number in the sorted sequence. The problem can also be solved by n^2 PUs on a EREW-PRAM in time O(log(n)): PROCEDURE sort(A, result : array[n] of Integer, P : array[>n^2] of PU) ASSERT All elements of A are different // Arrays for duplicating the data horiz : array[n][n+1] of Integer vert : array[n+1][n] of Integer segB : array[n+1] of Integer := <1,0,0,...> // Distribute the data in horiz & vert FOR i in {0,..n-1} PARDO BY PUs P horiz[i][0]:=i segmentedBroadcast(horiz[i][0..n],segB,P[i*n # n]) vert[0][i]:=i segmentedBroadcast(vert[0..n][i],segB,P[i*n # n]) First, we duplicate the data by segmented broadcasting into two arrays. Then, the sortCW can run on these data without producing concurrent reads. The problem can also be solved by n/(m+log(n)) PUs in time O(m+log(n)) on a EREW-PRAM, where m is an upper bound: PROCEDURE sortOpt(A, result : array of Integer, m : Integer, P : array[>n/(m+log(n))] of PU) ASSERT All elements of A are different ASSERT m is an upper bound of the elements of A ASSERT All elements of A are positive // Divide array into groups of size m' VAR m' := m+log(n) // Each group has a PU VAR p := ceil(n/m') // Do a bucket sort on each group bucket : array[p][m] of Integer := <0,0,...0> FOR i in {0,..P.length-1} PARDO BY PUs P FOR j := 0 TO m'-1 DO bucket[i][A[j]] := bucket[i][A[j]] + 1 // Sum up the buckets for each number VAR sumBucket : array[m] of Integer FOR i in {0..m} PARDO BY PUs P sumBucket[i] := sum(bucket[0..p-1][i], virtualPUs) // Calculate the prefix sum VAR prefixBucket : array[m] of Integer prefixSumSeq(sumBucket,prefixBucket) // Collect buckets to form the ordered sequence FOR i := 0 TO m-1 DO result[prefixSum[i]] := i This algorithm does a bucket sort on each group of m'=m+log(n) elements. Then, it adds up the entries of all buckets for the number 1, the entries of all buckets for the number 2 etc. This gives the total number of occurences of each number. Creating all these sums would normally require m*n/m' PUs and time O(log(n/m')). But the overall work is O(m*n/m')n/(m+log(n))] of PU) // Sorts the elements of A to result VAR n := A.length ASSERT P.length>=n/(m^eps+log(n)) ASSERT All elements of A are different ASSERT m is the maximum of A+1 ASSERT All elements of A are positive FOR digit := 0 TO 1/eps DO // Call a modified , which sorts on // x div (m^eps)^digit mod m^eps sortOpt_(A, m^eps, result, P) // Now copy the result back to A FOR i in {0,...n/(m^eps+log(n))-1} PARDO BY PUs P FOR j:=0 TO (m^eps+log(n)) DO A[i*(m^eps+log(n))+j] := result[(m^eps+log(n))+j] This algorithm expresses the numbers to be sorted by a basis of m^eps. If eps is for example 0.5, then all numbers x are split up into two digits a and b like x = a * m^0.5 + b Then, a kind of radix-sort is done: We do a bucket sort on each of the "digits", starting fom the rightmost one. These bucket-sorts run much faster than a normal bucket sort, because each of them only has m^eps buckets instead of m. We need 1/eps of these bucket sorts. // For details of normal radix-sort, cf "Algorithms and // Datastructures" (ads.txt) Rank of a number in a sorted sequence: The number of elements in the sequence, which are smaller than the given number. Rank-Problem: The problem of, given a sorted sequence of n numbers, determining the rank of another given number. The problem can be solved in time O(log(n)) on a RAM: FUNCTION binSearch(X : array[n] of Integer, k : Integer) : Integer // Trivial cases IF n==0 THEN RETURN 0 IF n==1 THEN RETURN k>X[0]?1:0 // Peek element in the middle, recurse on interesting half IF X[n/2]>=k THEN RETURN binSearch(X[0..n/2],k) ELSE RETURN binSearch(X[n/2+1..n-1],k)+n/2 This algorithm always looks at the element in the middle of X. Thus, it can determine, which half of X should be analyzed next, dividing the problem size by 2 each time. The problem can also be solved with n PUs in time O(1) on a CREW-PRAM by the following algorithm: FUNCTION rankN(X : array[n] of Integer, k : Integer, P : array[>n] of PU) : Integer IF X[0]>=k THEN RETURN 0 FOR i in {1,...n-1} PARDO BY PUs P IF X[i]>=k && X[i-1]X[0]?1:0 // Partial results for each processor VAR r : array[p] of Boolean // Peek at positions i*n/p FOR i in {1,..p} PARDO BY PUs P r[i] := (X[i*ceil(n/p)-1] >= k) // Check out which group is most interesting VAR where := startBit(r,P) RETURN where*ceil(n/p)+ rank(X[where*ceil(n/p) # ceil(n/p)],P) This algorithm first checks the positions 0,n/p,2*n/p... of the array to find the area where the true rank must be hidden. Then, it calls itself recursively to analyze this area of interest, until it is trivial. Each time, the problem size is divided by p and thus, the running time is in principle in O(log(n)/log(p)). Since n is not always dividable by p, the group size has to be rounded up, resulting in a running time of O(log(n+1)/log(p+1)). This algorithm is speeded up optimally for any constant number of PUs, because log(n+1)/log(constant#PU+1) * constant#PU <=> log(n+1) * constant in O(log(n)) which is the running time of the sequential binary search algorithm. This is also the best possible running time: In the first step, any algorithm can at most look at p cells. This divides the numbers into any segments of size n/p. These segments need again to be inspected and again, only p cells can be analyzed at a time. It takes time O(log(n)/log(p)) until the p processors have cut down the interesting area (where the rank might be hiding) to p. Merging-Problem: The problem of, given two sorted sequences, creating a sorted sequence, which contains all of their elements. This problem can be solved on a RAM in time O(n+m), where n and m are the lengths of the two sequences: PROCEDURE mergeSeq(X,Y,result : array[k] of Integer) ASSERT X and Y are ordered VAR m := 0 VAR n := 1 FOR i := 0 TO result.length DO IF X[m]m+n] of PU) ASSERT X and Y are ordered, all elements are distinct FOR i in {0,..m+n-1} PARDO BY PUs P IF i < m THEN result[i+binSearch(Y,X[i])] := X[i] ELSE result[i-m+binSearch(X,Y[i-m])] := Y[i-m] In this algorithm, each element of a sequence determines its rank in the opposite sequence. Then, it can compute its rank in the result sequence as the sum of its own rank and its rank in the opposite sequence. The problem can also be solved on a CREW-PRAM with 2*m*n PUs in time O(1), if the call to "binSearch" is replaced by "rankN": PROCEDURE mergeN*M(X : array[n] of Integer, Y : array[m] of Integer result : array[n+m] of Integer, P : array[>m*n] of PU) ASSERT X and Y are ordered, all elements are distinct FOR i in {0,..m+n-1} PARDO BY PUs P IF i < m THEN result[i+rankN(Y,X[i],P[i*n # m])] := X[i] ELSE result[i-m+rankN(X,Y[i-m], P[m*n+(i-m)*n # n])] := Y[i-m] The problem can also be solved with 2*n PUs on a CREW-PRAM in time O(log(log(n))). It is assumed that the length of both arrays is n. PROCEDURE merge2*N(X : array[n=k^2] of Integer, Y : array[n] of Integer result : array[2*n] of Integer, P : array[>2*n] of PU) ASSERT X and Y are ordered, all elements are distinct // Build groups of sqrt(n) in X and in Y // have each group end find its rank in the opposite // sequence rankOfXinY : array[sqrt(n)] of Integer rankOfYinX : array[sqrt(n)] of Integer FOR i in {0,..sqrt(n)-1} PARDO rankOfXinY[i] := rank(Y,X[i*sqrt(n)]) rankOfYinX[i] := rank(X,Y[i*sqrt(n)]) FOR i in {1,..sqrt(n)-1} PARDO BY PUs P merge2*N(X[startInX..endInX-1], Y[startInY..endInY-1], result[startinX+startinY...], P[i*sqrt(n)#sqrt(n)]) This algorithm has the elements at positions sqrt(n), 2*sqrt(n), 3*sqrt(n),... in X determine their rank in the sequence Y. Similarly, the elements at positions sqrt(n), 2*sqrt(n), 3*sqrt(n),... in Y determine their rank in the sequence X. The parallelogram engendered by one rank pointer from X and one from Y (or two from X or two from Y) defines a new sub-problem. Then the algorithm is called recursively for these subproblems. The solution is the concatenation of all the sub-solutions. Since the problem size is taken to a power of 1/2 each time, the running time is O(log(log(n)). The problem can also be solved with optimal work in time O(log(log(n))) with O(n/log(log(n))) PUs by the sequential division strategy, if only n/log(log(n)) elements determine their rank and the groups are merged sequentially. The work is then O(n). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ANSV ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Right dominator of an index i in a sequence S: The element of S, which * has an index greater than i * is greater than S[i] * has the smallest index (i.e. it is the next number from i) Example: S = 7 6 1 5 2 8 3 4 index: 0 1 2 3 4 5 6 7 The right dominator of position 3 is 8 ANSV-Problem, All Nearest Smaller Values: The problem of, given a sequence of n integers, determining for each of them the next right value, which is smaller, and the next left value, which is smaller. Here, the problem is simplified as follows: Calculate for each number its right dominator. By inversing the sequence order and negating the elements, the general ANSV-problem can be reduced to the simplified version. Example: Sequence: 4 2 3 1 4 RightDom: - 3 4 4 - Visualization: | .~~>| The dominator of 3 is 4 |_|_|_._| Sequence: 4 2 3 1 4 The problem can be solved in time O(n) on a RAM: PROCEDURE ansvSeq(A, result : array[n] of Integer) VAR s : Stack of (value:Integer,position:Integer) push(s,(+oo,-1)) FOR i := 0 TO n-1 DO result[i] := -1 IF top(s).value > A[i] THEN push(s,(A[i],i)) ELSE WHILE top(s).value < A[i] DO result[pop(s).position] := A[i] This algorithm runs through downward sub-sequences and pushes their elements. As soon as a step upwards is encountered, the right dominators of all smaller elements met so far are set. Since every element is pushed and popped only once, the running time is O(n). The problem can also be solved on a CREW-PRAM with n PUs in time O(log(n)): PROCEDURE ansvN(A, rightDom : array[n=2^k] of Integer, P : array[>n] of PU) ASSERT all elements of A are distinct // Calculate a binary tree of maxima maxima : array[log(n)][n] of Integer FOR i in {0,..n-1} PARDO BY PUs P maxima[0][i] := A[i] FOR level := 1 TO log(n)-1 DO FOR i in {0,..n/level^2-1} PARDO BY PUs P maxima[level][i] := max(maxima[level-1][i], maxima[level-1][i+1]) // Set each PU on one leaf, have it find its dominator FOR i in {0,..n-1} PARDO BY PUs P VAR pos := i VAR level := 0 // Go up in the tree until we meet a node from the // left, which has a greater value on the right WHILE levelmaxima[level][pos+1]) DO level := level + 1 pos := pos / 2 // No right dominator found IF level==log(n) THEN rightDom[i] := -1 ELSE // Now go down, keep left, but do not visit nodes // smaller than A[i] pos := pos + 1 WHILE level != 0 DO level := level - 1 pos := pos * 2 IF maxima[level][pos]). Example: 8 level log(8)=3 6 8 level 2 6 5 7 8 level 1 A = <3,6,1,5,2,7,4,8> level 0 Now, each PU looks at one element A[i] (for example 6) and performs the following steps: It goes up in the tree, until it hits a node from the left, which has a greater right child. Here, this would be on level 3. Then, it goes down the tree again, keeping itself as left as possible, but never visiting a node smaller than A[i]. Here, it would go down, passing 8,7 and again 7. Thus 7 is the right dominator of the 6. The problem can also be solved with optimal work in time O(log(n)) on a CREW-PRAM with O(n/log(n)) PUs by the sequential-division strategy: PROCEDURE ansv(A, rightDom : array[n] of Integer, P : array[>n/log(n)] of PU) // Calculates the right dominators of A to rightDom ASSERT all elements of A are distinct ASSERT n is a power of two We subdivide the problem into n/log(n) groups, each of size log(n), and solve these problems sequentially | | | |||| ||| | | ||.||..||| ||||||.|.| ||||||..|| <-- n/log(n) sub-problems --log(n)-- --log(n)-- --log(n)-- FOR i in {0,..n/log(n)} PARDO BY PUs P ansvSeq(A[i*log(n) # log(n)], rightDom[i*log(n) # log(n)],P[i]) Now there may still be elements in the groups, the right dominator of which is in another group. For the solution, we make each sub-problem a triangle around its maximum. While doing so, we also collect the maximum of each hill. || || ..|.||.|||.|.. ~~~~~> ..|||||||||||.. maxima : array[n/log(n)] of Integer maximaPos : array[n/log(n)] of Integer FOR i in {0,..n/log(n)} PARDO BY PUs P VAR j := 0 VAR mval := A[i*log(n)] VAR max := max(A[i*log(n) # log(n)]) // This maximum can be calculated sequentially // Align left hill side WHILE mval < max && A[i*log(n)+j]=mval THEN mval := A[i*log(n)+j] maxima[i]:=mval maximaPos[i]:=i*log(n)+j j := log(n)-1 mval := A[i*log(n)+log(n)-1] // Align right hill side WHILE mval < max && A[i*log(n)+j]=mval THEN mval := A[i*log(n)+j] Now we solve the ANSV-problem on the group maxima .~~~~~~> /\ / \ / \ / \ /\ / \ maximaRightDomPos : array[n/log(n)] of Integer maximaLeftDomPos : array[n/log(n)] of Integer ansvN(maxima,maximaRightDomPos,P) ansvN(reverse(maxima),reverse(maximaLeftDomPos),P) /\_________ <- PU C will have to take care of the area / \_______/\ between the two lines. C can easily find / \ /\ / \ where the upper line hits the hill of A. A B C But it's B's task to tell C where the lower line hits the hill of A In general: If my (B's) left dominator is also the left dominator of my right dominator (C), then I binSearch my own maximum in my left dominator's hill (A) and tell my right dominator (C) about this position. Similarly, I binSearch my own maximum in my right dominator's hill (C) and tell my right dominator (C) the resulting rank. leftLBoundPos : array[n/log(n)] of Integer := <-1,-1,...> rightLBoundPos : array[n/log(n)] of Integer := <-1,-1,...> FOR i in {0,..n/log(n)-1} PARDO BY PUs P IF maximaLeftDomPos[maximaRightDomPos[i]] == maximaLeftDomPos[i] THEN leftLBoundPos[maximaRightDomPos[i]] := binSearch(maxima[i],A[maximaPos[maximaLeftDomPos[i]].. (maximaLeftDomPos[i]+1)*log(n)]) rightLBoundPos[maximaRightDomPos[i]] := binSearch(maxima[i],A[maximaRightDomPos[i]*log(n).. maximaPos[maximaRightDomPos[i]]]) Now, each PU merges the area it is responsible for. In the above example, C merges the part of C's hill between the lines with the part of A's hill between the lines. FOR i in {0,..n/log(n)-1} PARDO BY PUs P IF leftLBoundPos[i] == -1 THEN leftLBoundPos[i] := (maximaLeftDomPos[i]+1)*log(n) IF rightLBoundPos[i] == -1 THEN rightLBoundPos[i] := i*log(n) VAR merged : array[2*log(n)] of Integer mergeSeq(A[maximaPos[maximaLeftDomPos[i]].. leftLBoundPos[i]], A[rightLBoundPos[i].. maximaPos[i]], merged) Having merged both sides, we can determine the right dominator of an element x on the left hill side by inspecting the position x got in the merged sequence FOR j := maximaPos[leftDomPos[i]] TO (leftDomPos[i])*log(n) DO IF rightDom[j] == -1 THEN rightDom[j] := binSearch(A[j],merged) + rightLBoundPos[i] All this also has to be done for situations where hill C is higher than hill A. Thus, the whole process starting from the ABC-picture also has to be done vice versa. This algorithm requires O(log(n)) time and n/log(n) PUs. Hence, it is work optimal. The ANSV-problem can also be solved on a CRCW-PRAM in time O(log(log(n))^2) with n/log(log(n))^2 PUs, if the problem is split into sqrt(n) sub-problems, which are solved recursively. Then there are log(log(n)) recursive calls, each of which runs time O(log(log(n))). The recursion is stopped at a sub-problem size of log(log(n)). These sub-problems are then solved sequentially. The ANSV-problem can also be solved in time O(1) on a CRCW-PRAM with n^2 PUs: PROCEDURE ansvN2(A, rightDom : array[n=k^2] of Integer, P : array[>n^2] of PU) ASSERT all elements of A are distinct VAR m : array[n][n] of Bit FOR (i,j) in {(0,0),(0,1),...} PARDO BY PUs P m[i][j] := A[i]j?1:0 FOR i in {0,...n-1} PARDO rightDom[i] := firstbit(m[i][0..n-1],P[i*n # n]) This algorithm arranges the PUs in a matrix. Each looks at a pair of two numbers of A and sets m[i][j] to 1, if A[i]>A[j]. Thereby, only the upper right triangle of the matrix is filled. Then, the first bit in a line i identifies the position of the right dominator of i. The ANSV-problem cannot be solved faster than the Merging-problem, because merging can be reduced to ANSV: PROCEDURE mergeByANSV(X,Y, result : array[n] of Integer, P : array[>2*n] of PU) ASSERT X and Y are sorted, all elements distinct VAR seq : array[2*n] of Integer := VAR rankXinY : array[2*n] of Integer ansv(seq,rankXinY,P) seq := VAR rankYinX : array[2*n] of Integer ansv(seq,rankYinX,P) FOR i in {0,..n-1} PARDO BY PUs P result[i+rankXinY[n-i-1]-n]:=X[i] result[i+rankYinX[n-1-1]-n]:=Y[i] This algorithm arranges the two sorted sequence face-to-face || || ||||...|||| -X Y by reversing X. Then the right dominator of an element of X determines its rank in Y. By knowing the rank in X and the rank in Y, we can compute the rank in the sorted sequence. The ANSV-problem cannot be solved faster than the OR-problem, because each OR-problem can be reduced to an ANSV-problem: FUNCTION orByANSV(X : array[n] of Bit, P : array[>n+1] of PU) : Bit VAR copy : array[n+1] of Integer FOR i in {0,..n-1} PARDO BY PUs P copy[i]:=X[i] copy[n]:=2 VAR rightDom : array[n+1] of Integer ansv(copy,rightDom,P) RETURN rightDom[0]==n+1?1:0 An additional element with value 2 is added to the right of the sequence. Then, the right dominator of the 0-element is determind by the help of ANSV. If the right dominator is 2, then the result of OR is 0, else the result is 1. Parenthesis-Problem: The problem of, given an array of characters, determining which opening parenthesis matches with which closing paranthesis. The problem can be solved on a CRCW-PRAM in O(log(n)) time using n/log(n) PUs: PROCEDURE parMatch(S : array[n] of Character, result : array[n] of Integer, P : array[>n/log(n)] of PU) // Array A encodes parenthesis to numbers VAR A:array[n] of Integer FOR i in {0,..n/log(n)} PARDO BY PUs P FOR j:=1 to log(n) DO pos:=i*log(n)+j IF S[pos]=='(' THEN A[pos]:=1 ELSE IF S[pos]==')' THEN A[pos]:=-1 ELSE A[pos]:=0 // Array B contains prefix sums of A VAR B:array[n] of Integer prefixSum(A,B,P) // Run a modified ANSV, which finds the nearest smaller // right neigbor ansv_(A,result,P) FOR i in {0,..n/log(n)} PARDO BY PUs P FOR j:=1 to log(n) DO pos:=i*log(n)+j IF S[pos]=='(' THEN result[result[pos]]:=pos We translate each '(' to 1 and each ')' to -1. This can be done in time O(log(n)), if each of the n/log(n) PUs takes care of a group of log(n) elements. Now, we have the optimal prefix-sum-algorithm run on the parenthesis string. After this, each '(' has a number corresponding to its depth and each ')' has a number corresponding to its depth-1. Now, we have the ANSV-algorithm run on this string in order to find the nearest smallest right neighbor. Thus, each '(' is matched with its ')'. Again, we have the n/log(n) PUs run over all elements of the string: Each '(' redirects the pointer of its corresponding ')' to itself. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Lists ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Color of a linked set S: A function color:S->Natural, such that no two linked elements of S get the same value of color. lsd, least significant differing bit of two integers a and b: The index of the rightmost bit in the binary representation of a, which differs from the binary representation of b. Array-form of a list: A pair of an array L and an array Succ, such that * L contains all elements of the list in any order * Succ[i]==j iff the element L[i] is followed by the element L[j] in the list List-coloring-problem: The problem of, given a list in form of an array, coloring the nodes simultaneously with a small number of colors. The problem is that the list cannot be traversed (takes too long) and that each element must be given its color in one step only by help of the Succ-array. This problem can be solved in time O(1) on a CREW-PRAM with n PUs, where n is length of the list: PROCEDURE listColor(succ,result : array[n] of Integer, P : array[n] of PU) FOR i in {0,...n-1} PARDO result[i] := lsd(i,succ[i])*2 + i div 2^lsd(i,succ[i]) mod 2 Example: i = 4 = 0100b succ[i] = 12 = 1100b ^-------- lsd=3 v------- the bit of 4 causing the difference color for 4: 110b = 6 ^^-------- binary representation of the lsd Now, two adjacent elements have different colors: Suppose node i has the same lsd with its successor as this successor has with its successor, i.e. lsd(i,S(i))=lsd(S(i),S(S(i))) =: L. Then the L-th bit of i and succ[i] differ. Since this bit is involved in the calculation of the color of i and the color of succ[i], these colors will also differ. Since the range of lsd is 0..log2(n), there are at most O(log(n)) colors. This algorithm can be iterated: New colors can be calculated from the old colors. But we will get stuck at colors of 3 bits: The lsd of two 3-bit-numbers is in range 0..2. Thus, c is in range 0..5 and takes again 3 bits. The number of colors is 6 then. List-Ranking-Problem: The problem of, given a list in form of an array (L,Succ), determining for each element of L its distance to the end of the list. In the following, the array Succ is called A and the distance of each L[i] is stored in an array cell d[i]. The problem can be solved on a EREW-PRAM with n PUs in time O(log(n)) by the following algorithm: PROCEDURE listRankN(A,d : array[n] of Integer, P : array[>n] of PU) finished : array[n] of Boolean FOR i in {0...n-1} PARDO BY PUs P d[i]:=(i==A[i])?0:1 finished[i]:=(i==A[i]) FOR phase:=0 TO log2(n)+1 PARDO IF !finished[i] THEN IF finished[A[i]] THEN finished[i]:=true d[i]:=d[i]+d[A[i]] ELSE A[i]:=A[A[i]] d[i]:=d[i]*2 In this algorithm, each PU looks at its own memory cell. In each phase, the PUs reach out for their neighbor. Since their neighbor also reaches out for the neighbor, an exponential covering of the list results. The PUs increase their counter d[i] to count the width of their reach-out. When they find a finished element, they calculate their own distance and are finished as well. The problem can also be solved on a EREW-PRAM with n/log(log(n)) PUs in time log(n) by the reduction-strategy: PROCEDURE listRankL(A, d : array[n] of Integer, P : array[>n/log(log(n))] of PU) ASSERT d == <0,0,...> in the first call // Do we have enough PUs for listRankN? IF P.length>=n THEN // Call a modified , which skips the // initialization of d RETURN listRankN_(A,d,P) // Equip each element with a color CONST K := log(n) color : {0,..n-1} -> {0,..K} := LAMBDA i.(lsd(i,A[i])*2 + i div 2^lsd(i,A[i]) mod 2) // Find those elements of A, which are a local maximum // concerning their color VAR locmax : stack of (Integer, Integer) := <> FOR i in {0,..n/log(n)-1} PARDO BY PUs P FOR j:=0 TO log(n)-1 DO VAR pos:=i*log(n)+j // Is my successor a local maximum? IF color(A[pos])>color(pos) && color(A[pos])>color(A[A[pos]]) THEN // Memorize maximum push(locmax,(pos,A[pos])) // Increase the distance of pos d[pos] := d[pos] + 1 // Unlink my successor tmp := A[A[pos]] A[A[pos]] := EMPTY A[pos] := tmp // Now eliminate the empty cells // This can be done by copying A // Now, having simplified the problem, call recursively listRankLL(A,d,P) // Pop up d and A again by copying them // Undo the removal of the elements FOR (pred,max) in {s | s in locmax} PARDO BY PUs P A[max] := A[pred] A[pred]:= max d[max]:=d[pred] + 1 This algorithm first "colors" the elements of the list quite randomly. Then, the local maxima of these colors are found. Thus, the elements picked will have a distance of at least 2 and at most 2*K-1. These elements are now memorized and unlinked from the list. Then, the algorithm calls itself recursively. After that, the unlinked elements are restored. The problem can also be solved with optimal work on a CRCW-PRAM with n/log(n) PUs in time O(log(n)): PROCEDURE listRank(A, d : array[n] of Integer, P : array[>n/log(n)] of PU) // Find predecessor for each element VAR pred : array[n] of Integer FOR i in {0,..n/log(n)-1} PARDO BY PUs P FOR j:=0 TO log(n)-1 DO VAR pos:=i*log(n)+j pred[A[pos]] := pos d[pos]:=0 // Work sequentially in groups of log(n), try to // bridge out elements // Each group has a pointer VAR pointer : array[n/log(n)] of Integer // And each group has a dice VAR dice : array[n/log(n)] of Integer FOR i in {0,..n/log(n)-1} PARDO BY PUs P pointer[i]:=i*log(n) WHILE pointer[i] < (i+1)*log(n) DO // Check whether the successor is not the current // element of its group VAR sucGroup := A[pointer[i]] div log(n) IF pointer[sucGroup] != A[pointer[i]] THEN // Check whether the predecessor of A[pointer[i]] // is not the current element of its group VAR predGroup := pred[pointer[i]] div log(n) IF pointer[predGroup] != pred[pointer[i]] THEN // We can savely bridge out A[pointer[i]] // Increase the distance of pos d[pointer[i]] := d[pointer[i]] + 1 // Unlink me pred[A[pointer[i]]] := pred[pointer[i]] A[pred[pointer[i]]] := A[A[pointer[i]]] pointer[i] := pointer[i] + 1 CONTINUE WHILE // Else we have a problem, throw a dice dice[i] := random({0,1}) IF dice[i]==1 and dice[sucGroup]==0 THEN // We won and may unlink our element d[pointer[i]] := d[pointer[i]] + 1 pred[A[pointer[i]]] := pred[pointer[i]] A[pred[pointer[i]]] := A[A[pointer[i]]] pointer[i] := pointer[i] + 1 // Now call recursively and reinsert bridged elements In this algorithm, the list is divided into groups of size log(n), so that each PU can run through one group. Each PU tries to bridge out the current element: It bends the predecessor to point to the successor. This may only work if predecessor and successor are not current elements of their respective groups. If they are, a dice is thrown, which determines which group may bridge out its element. Thus, the PUs advance at different speed. The process is over when every PU reached the end of its group. Then a recursive call is done and the bridged out elements are reinserted again. One can show by the laws of probability (Chernoff bound) that the overall time is still O(log(n)). The problem can also be solved deterministically with n/log(n) PUs on a CRCW-PRAM in time O(log(n)). The algorithm works similarly to the randomized version above, but handles occupied predecessors and occupied successors differently: Such a sequence of occupied predecessor, current element and occupied successor forms a sub- list. Such a list is colored with O(log(log(n))) colors and local maxima of the color-values are found. The successors of each local maximum are bridegd out. The list-ranking-problem cannot be solved faster than O(log(n)) on a CREW-PRAM, because the OR-problem can be reduced to list-ranking: FUNCTION ORbyListRank(A : array[n] of Bit, P : array[n] of PU) : Bit VAR Succ : array[n] of Integer // Create a list in form of an array FOR i in {0,...n} PARDO BY PUs P IF A[i]==0 THEN Succ[i] := i+1 ELSE Succ[i] := n-1 VAR d : array[n] of Integer listRank(Succ, d, P) IF d[0]==n THEN RETURN 0 ELSE RETURN 1 This algorithm creates a list out of the bit array. Each 0 has as successor its natural successor in the array, each 1 has as successor the last element (the list is not a real list but a tree, but this does not matter). Then, the list ranking problem is solved. If the distance of the first element to the list end is n, then there was no 1 in the array, the result is 0. If list-ranking could be done faster than O(log(n)) on a CREW-PRAM, then OR could be solved faster than O(log(n)). This is provably impossible. List-unwringling: The conversion of a list in form of an array to an array, which contains all list elements in the right order. If n is the number of elements, this can be done with optimal work on a CRCW-PRAM with n/log(n) PUs in time O(log(n)): PROCEDURE unwringle(L, Succ, result : array[n] of Integer, P : array[>n/log(n)] of PU) VAR d : array[n] of Integer listRank(Succ,d,P) FOR i in {0,..n/log(n)-1} PARDO BY PUs P FOR j:=i*n/log(n) TO (i+1)*n/log(n)-1 DO result[n-d[j]-1]:=L[j] As a result, algorithms which work on arrays can also work on lists. Label-problem: The problem of, given a sorted array A= of integers and a label array L= with L[i] in {0,...m=f(n)}, finding for each label the minimum element of A with this label. The problem can be solved in time O(m) with n/m PUs. If m in O(log(n)), a CREW-PRAM suffices, if m in O(log(log(n))), a CRCW-PRAM is needed. The following algrithm distinguishes these cases: PROCEDURE label(A : array[n] of Integer, L : array[n] of {0,...m-1}, result : array[m] of Integer, P : array[>n/m] of PU) ASSERT A is sorted // Divide A into n/m sub-problems of size m // Each sub-problem has a column in matrix M // Each possible label 0...m has a row in M VAR M : array[m][n/m] of Integer VAR Mfilled : array[m][n/m] of Bit FOR i in {0,...n/m-1} PARDO BY PUs P FOR j:=0 TO m-1 DO Mfilled[j][i]:=0 // Each PU runs through its sub-problem FOR i in {0,...n/m-1} PARDO BY PUs P FOR j:=0 TO m-1 DO VAR pos:=i*m+j // Check out whether the A[pos] could be the // smallest element for label L[pos] in this group IF Mfilled[L[pos]][i] == 0 THEN M[L[pos]][i] := A[pos] Mfilled[L[pos]][i] := 1 // Now, the first column of M should be the result, but // there may still be empty cells! These need to be // filled with the value of the next non-empty cell in // the same row to the right VAR M' : array[m][n/m] of Integer // If m is in O(log(n)), use this code ASSERT We have a CREW-PRAM // Call a modified segmented broadcast, which broadcasts // values from the right to the left and works on matrices segmentedBroadcast_(M, Mfilled, M', P) // Result is first column, filled up by segmented BC result := M[][0] // If m is in O(log(log(n))), use this code ASSERT We have a CRCW-PRAM // Find first bit in each line FOR line:=0 TO m DO result[line] := firstBit(M[line][],P) We split the problem into n/m sub-problems of equal size. We use a matrix M[m][n/m], which has one column for each sub-problem. Each PU now solves one sub-problem in A sequentially: It runs through its part of A and whenever it hits a label l[i] for the first time, it sets the corresponding entry M[l[i]][j] to the value a[i]. Now, the first column of the matrix is intended to be the result: For each label L, it contains the first corresponding a[i]. The problem is that the first group may not have seen an element for each label, some cells in the first column of M may be empty. They need to be filled with the next non-empty value from the next column in M. In the case of m in O(log(n)), this can be done on a CREW-PRAM by segmented broadcast. In the case of m in O(log(log(n))), this can be done on a CRCW-PRAM by finding the first non-empty cell of each line. Translation of a CRCW-PRAM-Algrithm to a EREW-Algorithm: The following algorithm, which has to be plugged into each phase of the CRCW-PRAM- Algorithm. Assume the CRCW-PRAM-Algorithm complies t phases and p PUs and uses uses m memory cells. In each phase, the PUs register their read- and write-requests. Then, these requests are processed centrally: ReadCells, ReadPUs, ReadPUs', ReadCells' : array[p] of Integer PROCEDURE wantsToRead(pu : Integer, cell : Integer) // Called by each PU, which wants to read a cell ReadCells[pu] := cell ReadPUs[pu]:= pu PROCEDURE getElements(P : array of PU) // Called after all PUs have registered their read requests // by wantsToRead // Call a modified sortEps, which sorts ReadCells and moves // the corresponding elements of ReadPUs as well sortEps_((ReadCells,ReadPUs),(ReadPUs',ReadCells'),m,P) // Now ReadCells' is sorted. Check out where a sequence of // read requests for the same cell appears newSeq : array[p] of Integer FOR i in {0,..p-1} PARDO BY PUs P IF i==0 || i>1 && ReadCells'[i-1]!=ReadCells'[i] THEN newSeq[i] := 1 ELSE newSeq[i] := 0 // Each PU starting a new sequence reads the data from memory FOR i in {0,..p-1} PARDO BY PUs P IF newSeq[i]==1 THEN ReadCells'[i] := memory[ReadCells'[i]] // Now we distribute the data along ReadCells'. Thus, // ReadCells' will contain the contents of the cells instead // of their numbers segmentedBroadcast(ReadCells',newSeq,P) // Now undo the sorting of ReadCells' by sorting by ReadPUs sortEps_((ReadCells',ReadPUs'),(ReadPUs,ReadCells),m,P) // Now ReadCells[i] contains the value which PU i wanted // to read FUNCTION read(pu : Integer) : Integer // Called by each PU after has been done RETURN ReadCells[pu] WriteCells, WriteValues, WriteCells', WriteValues' : array[p] of Integer PROCEDURE wantsToWrite(pu, cell, value : Integer) // Called by a PU which wants to write WriteCells[pu] := cell WriteValues[pu] := value PROCEDURE doWrites(P : array of PU) // Called after all PUs have registered their write requests // by // Call a modified sortEps, which sorts WriteCells and // moves the corresponding elements of WriteValues as well sortEps_((WriteCells,WriteValues), (WriteCells',WriteValues',m,P) // Now WriteCells' is sorted. Check out where a sequence of // write requests to the same cell appears. Of the // corresponding values to be written, simply pick the first // and write it to the cell. FOR i in {0,..p-1} PARDO BY PUs P IF i==0 || i>1 && WriteCells'[i-1]!=WriteCells'[i] THEN memory[i] := WriteValues'[i] This algorithm is plugged in after each phase. If the CRCW-Algorithm needs time O(t), then the EREW-algorithm needs time O((m^eps+log(p))*t), where eps is the constant of sortEps. If eps is chosen appropriately, then m^eps+log(p) reduces to a constant factor (?). Thus, the algorithm can be tuned to time O(t) and work O(p*t). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Trees ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Tree-coloring-problem: The problem of, given a tree, finding a coloring. A tree can always be colored with two colors by alternating the colors for each level: color(v)=level(v) mod 2. Tree-Depth-Problem: The problem of, given a tree, finding for each node its level. This problem can be solved similarly to the List-ranking-problem. Edge successor function in an undirected graph: A function, which returns for each edge another edge of the graph. Singly linked list of type t: The type (car : t, cdr : SinglyLinkedList of t). NIL is a constant of this type. A list can be stored as a singly linked list, the last cdr of which is NIL. Adjacency list of a graph: An array of type singly linked list of type node. The list at position i stores the children of node i. Advanced edge: The type (start:Node,end:Node,pair:Edge,weight:Natural). One stores with each edge * its start node * its end node * its weight and * a pointer to its inverse edge ("pair-pointer") Advanced adjacency list of a tree: An array of type singly linked list of type advanced edge. The last element of each list does not point to NIL, but to the first element of the list ("wrap pointer"). One assumes that the edges are numbered. This is possible, as the datastructure will occupy a sequence of memory cells. One has the pair-pointer of an advanced edge point to the list element of the inverse edge, not to the edge itself. Euler circuit of an undirected tree: A cycle, which contains each edge once. That is: It contains for each pair of linked nodes both edges. An Euler circuit can be given by an edge successor function. Suppose all linked nodes of a node are numbered, i.e. linkedNode(s,i) is the i-th node of the nodes linked to s. Then the successor function can be defined as S() := S describes a depth-first-search in the tree. A node v is entered the first time via a link (u,v). All its child nodes are treated and then v is left via (v,u) and entered no more. If the tree is given by the advanced adjacancy list representation, S can be calculated in constant time. As a result, an Euler circuit in a tree of n nodes can be found in time O(log(n)) on a CREW-PRAM with n PUs as follows: PROCEDURE euler(tree: AdvancedAdjacencyList, result : array[n*2-1] of AdvancedEdge, P : array[>n] of PU) VAR way : array[2*n-1] of AdvancedEdge VAR list : array[2*n-1] of Node FOR e in Edges of tree PARDO BY PUs P, P list[e]:=e.pair.cdr way[e]:=e unwringle(way,list,result,P) This algorithm can be optimized by the sequential division strategy to run with n/log(n) PUs. // This algorithm makes explicit what I think is behind the abstract // presentation in the lecture Generic-Euler-Solution: The following algorithm template, which runs in time O(log(n)) with n/log(n) PUs for trees with n nodes: PROCEDURE genericEuler(tree : AdvancedAdjacencyList, result : array[n] of Node,..., P : array[>n] of PU) ASSERT tree is undirected VAR ec : array[2*n-1] of AdvancedEdge euler(tree, ec, P) VAR weight, prefixweight : array[2*n-1] of Integer FOR i in {0,...2*n-3} PARDO BY PUs P, P weight[i] := ec[i].weight prefixSum(weight, prefixweight, P) FOR i in {0,...2*n-3} PARDO BY PUs P, P ec[i].weight:=prefixweight[i] This algorithm gets a tree, finds an euler circuit, assigns weights to the edges of the circuit, calculates the prefix sums of these weights and replaces the edge weights by the prefix sums. Some code may be introduced in between. The general idea is that the euler circuit in combination with the prefix sums simulates a depth-first search with a counter. With each edge passed, the counter is incremented by the corresponding edge weight. As each edge is traversed exactly once, each edge maintains its specific edge weight sum. Tree-Root-problem: The problem of, given an undirected tree (S,R) and a root node, determining the relation R'. // Set edge weights to 1 FOR e in Edges of tree PARDO BY P, P e.weight:=1 // Create doubleec := ec + ec VAR doubleec : array[4*n-2] of AdvancedEdge VAR rootpos : Integer FOR i in {0,...2*n-2} PARDO BY P, P doubleec[i]:=ec[i] doubleec[2*n-2+i]:=ec[i] IF ec[i].node == root THEN rootpos := i // Now, doubleec[rootpos # n*2-1] is an euler circuit // starting at the root. Copy it to ec FOR i in {0,...2*n-2} PARDO BY P, P ec[i] := doubleec[rootpos+i] // Find for each node the latest outgoing edge FOR e in Edges of tree PARDO BY P, P IF e.weight > e.pair.car.weight THEN result[e.start] := e.end ELSE result[e.end] := e.start This algorithm first circulates the euler circuit so that it starts at the root. As each edge weight is 1, an edge is heavier if it is passed later. Since a node is entered, worked on and then left and entered no more, the edge used for leaving can be seen as part of the tree. Regarding an edge and its inverse edge , the one with the higher value will be the one used for leaving. If e.g. has the higher weight than , then the successor of u is set to v. The result is an array, which contains for each node its parent. Example: Edge weights: Prefix edge weights <---1----- <---12----- u v u v ----1----> ----99----> => v is the parent of u Binary tree: A tree in which all nodes have either 0, 1 or 2 children. One talks of the "left" and the "right" child of a node. If a node has just one child, this is the "left child". Complete binary tree: A binary tree, in which each node has either 2 or 0 children and where the number of nodes is a power of 2 minus 1. Postordering of a binary tree (S,R): A function postorder:S->{1,...|S|}, such that postorder(leftchild(u)) // Add inverse edges, produce an undirected tree inverse : Stack of AdvancedEdge FOR e in Edges of tree PARDO BY PUs P // Edges parent to child get weight 0 e.weight:=0 // Edges child to parent get weight 1 tree[e.end]:=(car:(start:e.end,end:e.start, pair:e,weight:1),cdr:tree[e.end]) push(inverse,tree[e.end]) // Run through all inverse tree edges, give edge // weight as postorder number to start node FOR e in inverse PARDO BY PUs P result[e.start] := e.weight result[n-1]:=n This algorithm first converts the tree to an undirected tree. While doing so, it gives edge weight 0 to tree edges and edge weight 1 to inverse tree edges. Then an Euler circuit is found and the prefix sums of the edge weights are calculated. Now the prefix edge weight of an inverse tree edge leaving v is the postorder number of v. Example: Edge weights: Prefix edge weights parent parent | ^ | ^ 0 | 17 | | 1 | 99 V | V | child child => postorder(child)=99 Level-problem, BFS-depth-problem: The problem of, given a tree, determining for each node its level. This problem can be solved on a CREW-PRAM with n PUs in time O(1) by the genericEuler-procedure: // Add inverse edges, produce an undirected tree inverse : Stack of AdvancedEdge FOR e in Edges of tree PARDO BY PUs P // Edges parent to child get weight 1 e.weight:=1 // Edges child to parent get weight -1 tree[e.end]:=(car:(start:e.end,end:e.start, pair:e,weight:-1),cdr:tree[e.end]) push(inverse,tree[e.end]) // Run through all inverse tree edges, give edge // weight as bfs-depth to start node FOR e in inverse PARDO BY PUs P result[e.start] := e.weight result[0]:=0 This algorithm increases the counter for each tree edge and decreases the counter for each inverse tree edge. Thus, the counter reflects the level of a node or its distance from the root. Example: Edge weights: Prefix edge weights parent parent | ^ | ^ 1 | 3 | | -1 | 2 V | V | child child => bfsdepth(child)=3 Least significant 1-bit of an integer i: The index of the rightmost 1 in the binary representation, noted ls1(i). Inordering of a binary tree (S,R): A function inorder:S->{1,...|S|}, such that inorder(leftchild(u))) and the right child is rightchild(v) = decimal() where b is the binary representation of v. The i-th level is the set of nodes where the last (#levels-i-1) bits are 0. Example: 8 4 12 2 6 10 14 1 3 5 7 9 11 13 15 Left descendant of a node v in a binary tree: The left child of v or one of its descendants. In an inorder tree, the binary representation of the left descendants of v is of the form binrep(leftdesc) = , where b is the binary representation of v and x[j] are arbitrary bits, which are not all 0. Right descendant of a node v in a binary tree: The right child of v or one of its descandants. In an inorder tree, the binary representation of the right descendants of v is of the form binrep(leftdesc) = where b is the binary representation of v and x[j] are arbitrary bits, which are not all 0. Left baby of a tree node s: The leftmost baby of node s. In an inorder tree, the left baby of a node v is leftbaby(v) = decimal() where b is the binary representation of v. First appearance of a node v in a path p: The index i such that p[i]==v and p[j]!=v for all jn] of PU) : Integer IF ec[0].start==node THEN RETURN 0 FOR i in {1,...2*n-1} PARDO BY PUs P, P IF ec[i].start==node AND bfsdepth[ec[i-1].start] and

will be left away. Last appearance of a node v in a path p: The index i such that p[i]==v and p[j]!=v for all j>i. The last appearance of a node in an euler circuit can be determined by help of the bfs-depth: i is the first appearance of v if p[i]==v and p[i+1] has a smaller bfs-depth than p[i]. Thus, the first appearance can be found in time O(1) on a CREW-PRAM with n PUs, where n is the number of nodes. Unrelated nodes of a tree: Two nodes, none of which is an ancestor of the other. LCA, lowest common ancestor of two tree nodes c1 and c2: The node p=lca(c1,c2), such that p is an ancestor of c1 and c2 and that no child of p is also an ancestor of c1 and c2. If the tree is an inorder tree, the lca can be calculated as follows: lca(u,v) = decimal() where b[1]...b[i-1] are the first equal bits of the binary representations of inorder(u) and inorder(v). Interval-minimum-problem: The problem of, given a sequence of integers and two indices, determining the index of the smallest value in the sequence between the two given indices. This problem requires a precomputation of an inorder tree with as many leaves as there are elements in the sequence. With each node of this tree, one stores the prefix minima and the suffix minima of all baby descendants. This tree can be calculated in time O(log(n)) by n PUs on a CREW-PRAM as follows: PROCEDURE intervalTree(s : array[n] of Integer, result : array[n][n] of (prefminInd,sufminInd:Integer) P : array[>n] of PU) // Set all leaves to the sequence elements FOR i in {0,...n} PARDO BY PUs P result[2*i+1][0] := i // Walk up in the tree VAR #levels := log2(n) FOR lev := log2(n) TO 0 DO FOR node in level(lev) PARDO combinePref(result[leftchild(node)][].prefminInd, result[rightchild(node)][].prefminInd, result[node][].prefminInd,2^(#levels-lev-1),P[...]) combineSuf(result[leftchild(node)][].sufminInd, result[rightchild(node)][].prefminInd, result[node][].sufminInd,2^(#levels-lev-1),P[...]) This algorithm walks up in the tree and combines the prefix- and suffix- minima of the two children of a node to a prefix- and suffix-minima. The function combinePref combines the prefix minimum arrays of the two children of by copying the left prefix array and copying the right prefix array. But a value from the right prefix array is only copied if it is smaller than the rightmost value of the left prefix array. PROCEDURE combinePref(s : array[n] of Integer, left, right:array[n] of Integer, len : Integer, result : array[n] of Integer, P : array[len] of PU) FOR i in {0,...len/2-1} PARDO BY PUs P result[i]:=left[i] result[len+i]:=s[right[i]]n] of PU) ASSERT contains each linked node pair once // Add inverse edges VAR allEdges : array[n*2] of AdvancedEdge FOR i in {0,...n-1} PARDO BY PUs P allEdges[i].start:=edges[i].start allEdges[i].end:=edges[i].end allEdges[i].inverse:=allEdges[n+i] allEdges[n+i].start:=edges[i].end allEdges[n+i].end:=edges[i].start allEdges[n+i].inverse:=allEdges[i] // Call a modified sort, which sorts the pointers to // the edges by their start node, making them // list entries at the same time and dropping the // end node sort_(allEdges, edgesSorted, P) // Sequence start sets result pointer FOR i in {0,...2*n-1} PARDO BY PUs P, P IF i==0 || edgesSorted[i].car.start != edgesSorted[i-1].car.start THEN // Set pointer in result array result[edgesSorted[i].car.start] := i // Sequence end sets circular pointer FOR i in {0,...2*n-1} PARDO BY PUs P, P IF i==2*n-1 || edgesSorted[i].car.start != edgesSorted[i+1].car.start THEN edgesSorted[i].cdr:=result[edgesSorted[i].car.start] ELSE edgesSorted[i].cdr:= edgesSorted[i+1] This algorithm first creates the appropriate inverse edges for all edges. While doing so, it interconnects them by pointers. Then, it sorts the edges by their start node. Thereby, the array is transformed to an array of (dangling) list entries. Now, each start of a sub-sequence of edges with an equal start node sets the pointer of the result adjacency array to itself. Similarly, each end of such a sub-sequence sets its cdr-pointer to the list start. The result is an adjacency array, the singly linked lists of which are all within the long array . Event with high probability: An event of size n, which happens in 1-n^-c*100% of the situations where it could happen, where c is some constant. ====================================================================== Networks ====================================================================== Models of computation for parellel algorithms: * Models with a shared memory * PRAM * synchronous (i.e. CRCW (?)) * asynchronous (i.e. EREW (?)) * Networks Package: A tuple of information of globally fixed size. With a package, one stores the index of a PU where the package should go to. Package storage: A sequence of memeory cells, which can store one package. Network PU: A PU, which has an internal memory. Such a PU may be linked to other PUs. Queue: An array of packages, maintained by a network PU in its internal memory. One is interested in keeping this queue small, in the best case bounded by a constant. Input storage: A package storage maintained in a network PU. Output storage: A package storage maintained in a network PU. Network: A graph, the nodes of which are network PUs. For an edge (p1,p2) in a network, the PU p1 has an output storage for this edge and the PU p2 has an input storage for this edge. But networks are by default undirected, so that the PUs have input storages and output storages for all linked PUs. Step in a network: The transmission of the packages in all output storages to the input storages of the corresponding linked network PUs. A step is preceded by a period of computation. Bandwidth of a network (P,L < P x P): A function bw:L->Real, which equips each link with a positive value. Bandwidth-network: A network with a bandwidth. For an edge of bandwidth k, each of the two linked PUs has k input storages and k output storages. Degree of a network: The maximum among the degrees of all PUs of the network. The higher the degre of a network, the easier is communication in it, but the more costly is its implementation. Hence one prefers a small degre. Diameter of a network: The maximum among the distances of all pairs of PUs. The smaller the diameter, the faster the communication. Hence one prefers a small diametre. Put-through, Bisection width of a network: The minimum number of links that must be removed in order to decompose a network with n PUs into two networks with at most ceil(n/2) PUs. A high bisection width entails an easy communication and is thus desirable. Switch: A node in an undirected graph of degree 4 with two possible states. The edges are labelled W(est), E(ast), N(orth) and S(outh). In one state (WENS), West is connected to East and North is connected to South. In the other state (WNES), West is connected to North and East is connected to South. Switch network: An undirected graph, the nodes of which are either PUs or switches. This switch network allows for the dynamic connection of PUs by the switches, altering the switch relation. Crossbar-switch: A switch network, which allows connecting each PU with each other PU at the same time, although each PU has degree 2. Each PU can send to at most one PU and receive from at most one PU, but the PU to which it is sending may be different from the PU from which it is receiving. The network looks as follows: ,-----------------------------, | ,-----------------, | | | +----------, | | | | '-PU0-----S11--S21- ... -Sn1 Sij are switches | | | | | PUi are PUs | '---PU1-----S12--S22- ... -Sn2 | ... '-----PUn-----S1n--S2n- ... -Snn Hamiltonian cycle in a graph: A cycle, which passes each node exactly once and the start node twice. Hamiltonian graph: A graph which has a Hamiltonian cycle. Cartesian product of graphs (V1,E1) and (V2,E2): The graph (V1, E1) x (V2, E2) := (V, E) with V = {(u,v)| u in V1 and v in V2} = V1 x V2 E = {((u,v),(u',v)) | (u,u') in E1, v in V2} \/ {((u,v),(u,v')) | (v,v') in E2, u in V1} It holds: |V| = |V1| * |V2| |E| = |E1| * |V2| + |E2| * |V1| (?) n-th Power of a graph G: G^1 := G for n=1 G^n := G^(n-1) * G else n-dimensional grid: The product of n lists. Grid: A 2-dimensional grid. That is: A matrix-like graph. Iff at least one side of the grid has an even number of node, then the grid is hamiltonian. One talks of the "rows" and "columns" of a grid and says that the rows are above each other and the columns are beneath each other. n-dimensional Hypercube: An undirected graph in form of a list to the power of n. That is: For n=0, we have one single node (?) For n=1, we have the nodes linearly one after the other. For n=2, we have the nodes arranged in a matrix For n=3, we have a cube-like graph For n>3, we have some n-dimensional cube-like structure. simple n-dimensional Hypercube: An n-dimensional hypercube from a list of 2 elements. Notation: H[n] Each simple hypercube is hamiltonian. Proof: For H[0], we have a trivial hamiltonian cycle. Now suppose we have a hamiltonian cycle for H[n-1]. Then we also have a hamiltonian cycle for H[n], because H[n] = H[n-1] * H H[n] are two hypercubes A and B of the form H[n-1], all corresponding nodes of which are cross-connected. Now we take the hamiltonian cycles in A and B and break them up in the same place. Then we connect the cycle start in A with the cycle start in B and the cycle end in A with the cycle end in B and get a new hamiltonian cycle. Alternative proof: Label the nodes with binary numbers. When two simple (n-1)-dimensional hypercubes A and B are glued together to form H[n], then label the nodes as follows: Keep the original labelling in A and B, but precede all labels in A by a '0' and all labels in B by a '1'. Iff two nodes are connected, then their labels differ in one bit. A hamiltonian cycle can be given as follows: Be S[n-1] the sequence of labels of the hamiltonian cycle in H[n-1] without the cut and with the original numbering. Then the sequence of labels of the hamiltonian cycle in H[n] is 0-S[n-1] + 1-reverse(S[n-1]) + zeroLabel where "0-" means sticking a '0' to the front of each label of the sequence. The cycle in hamiltonian, because each labl differs from its predecessor by one bit. So each node is connected to its predecessor and all nodes are touched once. This labelling implements the same strategy as described above. Square: A 2-dimensional hypercube. Cube: A 3-dimensional hypercube. Linear array: A network in form of a list. Be n the number of PUs. Diameter: n-1 Degree: 2 Bisection width: 1 Circular array, ring: A linear array, where the two PUs with degree 1 are linked by an additional link. On rings most problems can be solved twice as fast as on linear arrays. Diameter: floor(n/2) Degree: 2 Bisection width: 2 Mesh: A network in form of a grid. That is: There are n*m PUs, numbered p[i][j] with i in {0,...n-1} and j in {0,...m-1}. Each PU p[i][j] is connected to p[i-1][j], p[i+1][j], p[i][j-1] and p[i][j+1], if these PUs exits. Meshes are easy to construct and to scale. Diameter: n+m-2 (high, bad) Degree: 4 (constant, good) Bisection width: min(n, m) (small, bad) Network indexing: A mapping of network PUs to natural numbers. We always assume some kind of indexing, so that the relations of "successor" and "predecessor" are defined on PUs. Row-Major: Indexing the PUs of a n*n mesh so that the PU in line i and column j gets the number i*n+j. ------------------------> 0 1 2 3 ... n-1 ------------------------> n n+1 n+2 n+3 ... 2*n-1 ... Column-major: Indexing the PUs of a n*n mesh so that the PU in line i and column j gets the number i+j*n. | 0 | n . | 1 | n+1 . | 2 | n+2 . | . | . | . | . | . | . V n-1 V 2*n-1 Row-Snake: Indexing the PUs of a n*n mesh so that the PU in line i and column j gets the number * i*n+j if i is even * i*n+n-j if i is odd ----------------------------------------> -----, 0 1 2 ... n-3 n-2 n-1 | ,---- <---------------------------------------- <----' | 2*n-1 2*n-2 2*n-3 ... n+2 n+1 n '---> ----------------------------------------> 2*n 2*n+1 2*n+2 ... This indexing can be useful for tasks like sorting: Every PU is closely connected to its predecessor and successor. Torus: A network, which is the product of a circular array of n PUs and a circular array of m PUs. That is: It is a mesh with m*n PUs, where the PUs "at the edges" with degree 3 or 2 are connected to the corresponding PU on the other side. On tori most problems can be solved twice as fast as on meshes. Diameter: floor(n/2) + floor(m/2) (high, bad) Degree: 4 (constant, good) Bisection width: 2 * min(n,m) (small, bad) Tree network: A network in form of a perfect binary tree of k levels. PUs: 2^k-1 Diameter: 2 * k (small, good) Degree: 3 (constant, good) Bisection width: 1 (small, bad) Fat tree: A bandwidth tree network with k levels, where the links between level i and level i + 1 have a bandwidth of 2^(k-i-2). Mesh of trees: A network, consisting of n*m PUs, arranged like in a mesh, but without any direct links between them. Each row p[i][...] has a tree network above it with the PUs of the row as leaves. Similarly, each column p[...][j] also has a tree network above it with the PUs of the column as leaves. If the underlying mesh has n*n PUs, then the total number of PUs is 2 * n * (2 * n - 1) - n^2 = 3 * n^2 - 2 * n. k-dimensional simple Hypercube network: A network in form of a simple hypercube. That is: There are n = 2^k PUs arranged in a k-dimensional grid. Each PU is connected to its k neighbors in the grid. Most problems can be solved optimally on hypercube networks. Diameter: k Degree: k = log n (not constant, bad) Bisection width: n / 2 Ring splitting: Replacing a node of a network with n edges by a circular array of n PUs. Each of these PUs is connected to its predecessor and its successor in the array and additionally takes one of the n outgoing edges. By ring splitting, the degree of a network can be reduced to 3. The length of paths is prolonged, but all in all for any path only by n. Cube-connected cycle, CCC: A k-dimensional simple hypercube network, where ring splitting has been done. PUs: k*2^k Diameter: floor(5*k/2)-2 for k>=3 Degree: 3 Bisection width: 2^k / 2 Embedding of a network N1 in a network N2: An injective function phi from the PUs of N1 to the PUs of N2, such that for every link from p1 to p2 in N1, there is also a path from phi(p1) to phi(p2) in N2. Adjacent nodes of a graph: Nodes with a link between them. Necessary path of an embedding phi: A shortest path in the result network, which connects two nodes phi(p1) and phi(p2), where p1 and p2 were adjacent in the original network. Dilation of an embedding: The maximum among the lengths of the necessary paths. Congestion of an embedding: The maximum number of necessary paths running over any link in the target network. Expansion of an embedding: The number of nodes of the original network, divided by the number of nodes of the target network. Network simulation: Embedding a network N1, which is required for a specific algorithm, on a network N2, which is at hand. Any step of N1 can be simulated on N2 with at most dilation * congestion steps, because each information exchange from a PU p1 to an adjacent PU p2 in N1 may have to travel a path of length and may have to wait times before it can be transferred in N2. Thus, the work on N2 exceeds the work on N1 by at most a factor dilation * congestion * expansion. As long as all these numbers are constants, an optimal algorithm for N1 automatically gives an optimal algorithm for N2. Thanks to simulation, it is not necessary to redesign an algorithm written for a specific network type, because it can be executed on another network type. Example: N1 (what we want) 1-2-3-4-5 '-------' N2 (what we have) a-b-c-d-e Mapping: 1->a, 2->c, 3->e, 4->d, 5->b Result: 1-5-2-4-3 This mapping is done as follows: The new array has the two outmost elements of the original array, then the two but-outmost elements of the original array and finally the middle element of the original array. This example has Dilation: 2 Congestion: 2 Expansion: 1 All factors are constant, hence any problem solvable on a circular array can also be solved on a linear array in the same time class. The example can be geralized to map tori to meshes. Hence, problems solvable on tori can also be solved on meshs. The embedding of a linear array is nothing else than the construction of a hamiltonian path in the destination network. Such an embedding provides a way for indexing the PUs in the destination network such that PUs with consecutive indices are adjacent. The embedding of a circular array corresponds to finding a hamiltonian cycle in the destination network. Routing: The problem of, given a network with packages stored in certain PUs, transferring the packages to certain other PUs by steps. Routing graph: A directed graph, the nodes of which are PUs and the edges of which correspond to a package with its source and destination PU. All-to-All-Routing: Routing from each PU different packages to all other PUs. The routing graph is complete. Permutation routing: Routing from each PU one package to another PU. The routing graph is a circular list. Permutation routing should be possible in a time proportional to the network diametre. Gathering: Routing from all PUs different package to one target PU. The routing graph is a tree with one root and all other PUs being leaves. Broadcasting: Routing from one PU one package to all other PUs. The routing graph is a tree with one root and all other PUs being leaves, but with all edges reversed. Gossipping: Routing from each PU one package to all other PUs. The routing graph is complete, but each PU has only one package, which it sends to all PUs. Hot-Potatoe-Network: A network, in which the PUs have no queue. As a result, they cannot store packages and will send the packages back if they can't handle them at the moment. Separate-phase Greedy Routing: Permutation-Routing in a 2-dimensional mesh by the following scheme: 1st phase: All packages are simultaneously routed in their row until they found their column 2nd phase: All packages are simultaneously routed in their column until they found their destination. This routing takes a long time and it requires large queues, but it is easy to analyze. Glued-phase Greedy Routing: Permutation-Routing in a 2-dimensional mesh by the following scheme: All packages are simultaneously routed in their row until they found their column. Those which found their column are immediately routed in this coulumn, although other packages might still be seeking their row. This approach is faster than separate-phase greedy routing, but it may happen also here that the queue size cannot be bounded by a constant: Consider a n*n-mesh, in which the middle column is the destination column for the packages. The PU at the bottom of this line in the but-last row will be called the distributor. Now n/3 PUs left of the ditributor, n/3 PUs right of the ditributor and n/3 PUs in the line below the distributor want to send packages to the destination column. Then all these packages have to pass the distributor. With glued-phase greedy routing, this PU will have to have a queue size of 2*n/3, which is in O(n). Farthest-first: The principle that the package which has a farther way to go should be treated first by a PU. Greedy-Plus Routing: Permutation-Routing in a 2-dimensional mesh by the following scheme: Each package looks whether it has more rows to go or more columns to go. If it has more columns to go, it goes one column, else it goes one row. Thus, it reaches its target PU by going diagonally. But even for this scheme, routing problems can in n*n-meshes be constructed, in which some PUs need a queue of size n/2. Comparator box: A network PU in a directed network, which has two integer input storages x1 and x2 and two integer output storages min and max. It takes values x1 and x2 and stores the values min(x1,x2) and max(x1,x2) in min and max, respectively. A comparator box can perform additional computation and also keep packages in its internal queue. Comparator network, Comparison-based sorting network: A directed network, the PUs of which are comparator boxes. k-distribution: A comparator network, in which each PU initially maintains k packages in its queue. Network sorting: The routing of packages in a linear array or mesh, such that after the process, each PU maintains packages which are smaller than than the packages maintained by the successor PU. The relation of "smaller" needs to be defined on the packages. Usually, sorting is done by a comparator network: In the beginning, each PU maintains a number of packages in its queue. Then it always sends the smallest package to the preceding PU and the largest package to the succeeding PU. In the end, each PU maintains packages in its queue, which are smaller than the packages maintained by the successor PU. Monotonically increasing function: A function f:Natural->Natural, such that x1j h(i,j) := the number of packages that start in a PU j'>j and go to a PU i'0 ? g(i,j)+j-i-1 : 0 This is the number of steps it is going to take to rout the g-packages one after the other through the bottleneck i--j. H(i,j) := h(i,j)>0 ? h(i,j)+j-i-1 : 0 This is the number of steps it is going to take to rout the h-packages one after the other through the bottleneck i--j. M(D) := max({G(i,j), H(i,j) | i < j}) The maximum number of all G's and all H's is a lower bound for the network sorting problem given this k-ditribution. Farthest-first sorting lemma: The fact that a comparator network can sort a k-distribution D by the farthest-first strategy in M(D) steps. Sketch for a proof: Consider a PU j and all packages at PUs i z0 < n*i - n <=> n*k + z0 < n*i - n + n*k <=> z0+n*(k-i) < n*(k-1) That is: The number of 0-packages, which row i may at most have, is smaller than the number of smallest packages, which will be removed from this row. Thus, all the 0-packages will be removed and only 1-packages remain, the row is clean. * z0 = n*(i-1) Here, no similary argument applies. Hence, this row may remain dirty. * z0 > n*(i-1) <=> n*k - z1 > n*i - n <=> > n*k > z1+(i-1)*n Hence, the number of 1-packages, which row i may at most have, is smaller than the number of largest packages, which will be removed from this row. Thus, all 1-packages will be removed and only 0-packages remain, the row is clean. Since the second case may only occur in one row, only one row is dirty after the process. Mesh-sorting: Network sorting in a n*n-mesh. This can be done as follows * if n=2, then Example: 5--8 <- 4 PUs, each with a package | | in its queue 1--2 * each package is copied to it horizontal neighbor 5,8 -- 5,8 <- each PU with 2 packages | | in its queue 1,2 -- 1,2 * the left PUs keep the smaller package, the right PUs keep the larger one 5 -- 8 | | 1 -- 2 * the same is done vertically * if n>2 then // For purposes of anaysis, we will rely on the // 0-1-principle and look only at packages of value 1 and 0. 1. Execute this algorithm recursively for each quarter of the mesh. Sort the quarters in the following way: <<< | >>> <-- left upper quarter: increasing ----+---- right upper quarter: decreasing <<< | >>> Each quarter has at most 1 dirty row. 2. Divide the mesh into 4 vertical strips of breadth n/4. Rout the packages in the left strip to the PUs in the left middle strip. Rout the packages in the right strip to the PUs in the right middle strip. All routings keep the relative order of the packages in the lines: n/4 n/4 n/4 n/4 | | | | | | ~~+~> | <~+~~ | Each PU in the middle strip has 2 | ~~+~> | <~+~~ | packages in its queue | ~~+~> | <~+~~ | Steps: n/4 3. Sort the two middle strips, column by column, package by package, increasing from top to bottom: | | ^ ^ | ^ ^ | | | | ^ ^ | ^ ^ | | | | ^ ^ | ^ ^ | | | | ^ ^ | ^ ^ | | | | ^ ^ | ^ ^ | | | | ^ ^ | ^ ^ | | Now, all 0-packages have been moved up and all 1-packages have been moved down. Since before, each quarter had at most one dirty row, there will now be at most 2 dirty rows. Steps: n 4. Now, we apply a modified dirty row elimination to reduce the two dirty rows to one dirty row. 4.1. Each PU has 2 packages. It copies the larger one to the PU above and the smaller one to the PU below. Each PU in the first row of the middle strips gets a dummy package, which is infinitely small. Similarly, each PU in the last row of the middle strips gets a package which is infinitely large. Now each PU in the middle strip has 4 packages. Steps: 1 4.2. Sort the rows in the two middle strips, building an increasing and decreasing structure: | |<<<<|>>>>| | | |<<<<|>>>>| | | |<<<<|>>>>| | Steps: 3/8*n 4.3. In each row, delete the n/2 largest and the n/2 smallest packages. Each PU in the middle strip now has 2 packages. Steps: 1 4.4. Spread the packages over the whole rows. Each PU has 1 package. | < < | < < | > > | > > | | < < | < < | > > | > > | | < < | < < | > > | > > | Steps: O(3/8*n) Thus, this algorithm needs (2+3/8)*n steps without taking into account recursion. The smallest recursion takes (2+3/8)*4 steps, the next takes (2+3/8)*8 steps etc., but the overall number is (2+3/8)*(4+8+16+...n) = (2+3/8)*(4+8+16+...2^k) assume n=2^k < (2+3/8)*(1+2+4+8+..2^k) < (2+3/8)*2^(k+1) = (2+3/8)*2*n = (4+3/4)*n Thus, the algorithm runs with less than 5*n steps. Kundal-Algorithm: The following network permutation routing algorithm for a n*n mesh: Choose a sub-mesh size s 1 0 1 ****\ P \ - **P temporary PU #2 catches the 1 \ - \ P 2 2 2 ^-^-+ P +-P temporary PU #2 sends its id + P The overall time of this algorithm is O(1). Ranking in a bus-network: The following algorithm for ranking a number in a sequence of n numbers in a reconfigurable mesh of size n*n. The input-PUs get the sequence of numbers as packages. The first PU gets the to-be-ranked number x as a package. Example: 3 7 5 S S S P S S S P S S S P The first PU distributes x in two steps to all switches. The switches memorize this number. 3 7 5 Example: x=4 4 4 4 P 4 4 4 P 4 4 4 P The input-PUs send their numbers to all switches below them. 3 7 5 3 7 5 P 3 7 5 P 3 7 5 P Each switch compares x to the newly send number. * If the new number is smaller, it does a diagonal connect * If the new number is greater-equal, it does a horizontal connect 3 7 5 \ - \ P \ - \ P \ - \ P A diagonal count is done, the result is the rank of x. sends 1 --> 3 7 5 \ - \ P \ - \ P temporary PU #1 catches the 1 \ - \ P --> result is 1 Optimized ranking in a bus-network: The following algorithm for ranking a number in a sequence of n numbers in a reconfigurable mesh with n input-PUs and sqrt(n) temporary PUs: Everything is done as in the normal ranking, except for the diagonal count. The switches at the bottom line establish each a bus-connection to the top-line switch and the input-PU in the same column (how is this done?) The switch at the upper left corner sends a '1' on its bus. This '1' will go to the right and down in the mesh. If it hits the bottom line, it is beamed to the first line. The temporary PU receiving the '1' memorizes this. Each input-PU, in whose column a beam appeared, knows this. It prepares a '1'-package. All other input-PUs prepare a '0'-package. A 1-count is done. Each input-PU knows the number of beams now. The temporary PU, which received the '1' sends its id-number via the big bus to all input-PUs. The result is the number of beams, multiplied by sqrt(n) plus the id-number of the temporary PU receiving the '1'. Routing in a bus-network: The following algorithm for routing n packages in a n*n reconfigurable mesh: The input PUs get the packages, together with the id of their destination PUs. Example: (b,2) (c,3) (a,1) S S S P S S S P S S S P The input PUs send the number of the destination PU to the switches in their column. (b,2) (c,3) (a,1) 2 3 1 P 2 3 1 P 2 3 1 P In each column, each switch decides * If it is at the line with the number it just received, it does a bottom-block-connect * else, it does a cross-connect. (b,2) (c,3) (a,1) + + -'- P -'- + + P + -'- + P Each input-PU sends its package. (b,2) (c,3) (a,1) + + -'- a -'- + + b + -'- + c Each switch does a cross-connect, the switches on the diagonal do a bottom-block-connect. The temporary PUs send their packages back. a b c -'- + + a + -'- + b + + -'- c Sorting in a bus-network: The following algorithm for sorting n numbers by n reconfigurable meshes, each with n*n switches Example for n=3: P P P P P P P P P 3 layers of reconfigurable S S S P S S S P S S S P meshes S S S P S S S P S S S P S S S P S S S P S S S P All input-PUs at the same position are connected by a bus Algorithm: The input-PUs of the first layer receive the numbers as packages. 3 7 5 P P P P P P S S S P S S S P S S S P S S S P S S S P S S S P S S S P S S S P S S S P Each of these input-PUs sends its package to all other input-PUs in the other layers at the same position. 3 7 5 3 7 5 3 7 5 S S S P S S S P S S S P S S S P S S S P S S S P S S S P S S S P S S S P In the i-th layer, a ranking is done for the i-th number. 1 1 1 3 3 3 2 2 2 3 3 3 7 7 7 5 5 5 P 3 3 3 7 7 7 5 5 5 P 3 3 3 7 7 7 5 5 5 P In each layer, the i-th input-PU sends its result to the i-th input PUs in all layers. 1 3 2 1 3 2 1 3 2 3 3 3 7 7 7 5 5 5 P 3 3 3 7 7 7 5 5 5 P 3 3 3 7 7 7 5 5 5 P In the first layer, the i-th PU still knows the i-th number. It now also knows where that i-th number should go. A routing is done in the first layer. The overall time is O(1).