Parallel Matrix Multiply
In the lecture I merely said write a parallel matrix multiplication program: C = A * B. Here I have added detail to make it easier for you.
The program should be able to accept any size square matrix and
number of processors, although I will accept a fixed number of
Some comments on implementation and
1. Only one process (e.g., a process with rank = 0 acting as the master) should read the data/matrix (or assign values to matrix A and B) and then it should automatically share this among the worker processes.
2. You can divide the work among processes in a number of
ways. Some suggested ways are: (a) treat computations of all elements
of each row of the resulting
matrix C as one work unit, (b) treat computations of all elements of
each column of the resulting
matrix C as one work unit, (c) computations of each element of
the resulting matrix C as one work unit, (d) create work units based on
the number of processes and the size of matrix and then assign them
different worker process.
to do assign the work units to worker processes such that the load/work distribution among
them is balanced. The master process should collect the results from
worker processes and strores in matrix C.
4. Testing and Evaluation:
(a) Take the size of square matrix as an input to the program by reading from the keyboard at runtime.
(b) Test your program to make sure that it works in parallel!
(c) When the size of matrix (N) is larger than 5, it is NOT be feasible for you to input elements of matrix A and B from the keyboard. In such cases, simply assign some random values to elements matrix A and B.
(d) Run your program for three different problem sizes: matrix size: N = 100, 200, 300, 400, 500 on different combinations of processes: 4, 8, 10, and 12 processes.
(e) Note the processing time by making use of appropriate MPI calls and draw a graph for varying values of matrix size with "Number of CPUs/nodes on X column" and "the time taken " on Y column.
That is, use randomly generated matricies in a
range of sizes and produce a
report on the time taken for a range of different sized matricies in a
parallel programmin environment using MPI.
A better way of expressing the problem is: Write a function "square_dgemm", similar to the LAPACK "dgemm" routine but limited to square matrices, to compute the product of two matrices using MPI to distribute the task onto a number of processors. Note, only one processor should read the data/matrix and then the program should automatically share this among the processors.
There is a harness routine matmul.c in http://www.hpc.unimelb.edu.au/cs/assignment2/ which provides checks on correctness and scalability. There are also serial basic and dlocked matrix multiiply sub-routines in the directory which may give assistance with methods of dealing with the odd bits left when the number of processors does not divide the matrix size.