Question 1: Matrix Multiplication
Problem Statement:
We have to take n matrices as input and compute multiplication of the given matrices and return the product matrix given $ 1 < n < 5 $. Each matric have a maximum dimension of $ 1000 * 1000 $.
Optimisation Approaches:
The baseline algorithm of matrix multiplication is almost taking $38$ secs on abacus for 5 matrices of each of dimensions $1000 1000 $ . The various optimisation approaches I made and their corresponding run time values are provided for a matrix of size $ 1000 1000 $.
1. Changing Loop order:
The order of looping was changed to $ i,k,j$ from $i,j,k$ was made to enhance the utility of cache spatial locality. This change brought the run time to $~30 secs$.
2. Divide and Conquer Algorithm:
Next I implemented a cache oblivious divide and conquer algorithm, to reduce the time furthur. However with. a coarsening constant of 1, the function overhead being significantly high, the runtime almost went up to $~50$ secs. On some trial and error, the best coarsening constant was found to be $32$ which gave a runtime of almost $~12$ secs. The divide and conquer algorithm used is given as follows:
Inputs: matrices A of size n × m, B of size m × p.
Base case: if $max(n, m, p)$ is below some threshold, use an unrolled version of the iterative algorithm.
Recursive cases:
If $max(n, m, p) = n$, split A horizontally:
${\displaystyle C={\begin{pmatrix}A{1}\A{2}\end{pmatrix}}{B}={\begin{pmatrix}A{1}B\A{2}B\end{pmatrix}}}$
$ C={\begin{pmatrix}A{1}\A{2}\end{pmatrix}}{B}={\begin{pmatrix}A{1}B\A{2}B\end{pmatrix}}$
Else, if $max(n, m, p) = p$, split B vertically:
${\displaystyle C=A{\begin{pmatrix}B{1}&B{2}\end{pmatrix}}={\begin{pmatrix}AB{1}&AB{2}\end{pmatrix}}}C=A{\begin{pmatrix}B{1}&B{2}\end{pmatrix}}={\begin{pmatrix}AB{1}&AB{2}\end{pmatrix}}$
Otherwise, $max(n, m, p) = m$. Split A vertically and B horizontally:
${\displaystyle C={\begin{pmatrix}A{1}&A{2}\end{pmatrix}}{\begin{pmatrix}B{1}\B{2}\end{pmatrix}}=A{1}B{1}+A{2}B{2}}$
$ C={\begin{pmatrix}A{1}&A{2}\end{pmatrix}}{\begin{pmatrix}B{1}\B{2}\end{pmatrix}}=A{1}B{1}+A{2}B{2} $
3. Using a 2D Array instead of a 1D Array:
Inititally the Divide and Conquer algorithm was implemented by using 1D Arrays of size $(row size * column size)$. However the pointer access referencing over the array was causing a significant overhead. On using a 2D array, accessing rows became much faster, causin the runtime to decrease to ~7 secs.
4. Compiler level Optimisation
I read about the various compiler level optimisations. I used keywords like register to store the variables in register for faster access. Also, restrict to reduce some of the compiler checks. Also I used a functional macro for the max function. All these brought down the runtime to $~6$ secs.
Question 2: Floyd Warshall Algorithm
Problem Statement:
We have to calculate the multiple source shortest path using the Floyd Warshall Algorithm. The complexity of the algorithm is $O(n^3)$.
Optimisation Approaches
The baseline code took $~60 secs$ for the testcase t29
. The baseline loop order cannot be changed since k has to be the outermost loop because of dependencies.
The various optimisation techniques used are as follows:
1. Using a 2D Array instead of a 1D Array:
Inititally the algorithm was implemented by using 1D Arrays of size $(row size * column size)$. However the pointer access referencing over the array was causing a significant overhead. On using a 2D array, accessing rows became much faster, causin the runtime to decrease to ~25 secs.
2. Compiler level Optimisation
I read about the various compiler level optimisations. I used keywords like register to store the variables in register for faster access. Also, restrict to reduce some of the compiler checks. Also, I used pointer reference instead of array reference. All these decreased the runtime to almost ~20 secs.