For a personal project, I’ve been digging into C++ concurrency primitives. Previously, I finished reading OSTEP and found thread synchronization really weird, but very interesting.

To learn more about how to use std::thread, I decided to do a performance comparison of threaded vs unthreaded matmult. Matmult (or matrix multiplication) is a very big thing nowadays and is an obviously parallelizable task. A thread to compute dot product can be dispatched for each element of the resultant matrix, since the elements of the result are not dependent on each other.

That’s really what makes ML models feasible: GPU/TPU’s now have the ability to really parallelize their matmults for training/inference. As one of my ECE professors recently said, the AI boom was fueled by hardware innovations that saw the Scaling Law allow us to actually have good results with deep neural networks.

quick theory overview

– how to multiply 2 matrices??

my thought process going in

trolling

“That’s weird”, I remarked as I saw the multi-threading implementation often taking as much as a factor 10 times longer than the unthreaded version.

conclusions

Turns out, threading is not as OP as I thought. Naively applying optimizations like threading doesn’t necessarily translate to real performance benefits. In hindsight, I probably should’ve seen this one coming, since this was an often-stated theme of OSTEP and software dev articles I’ve read. “Premature optimization is the root of all evil” or something.

A good summary of the possible performance problems with naive multithreading can be found in this StackOverflow post.

In reality, no one’s hand-coding implementations of matmult or whatnot by hand nowadays. Just use Python with JIT or Numba or something says very pragmatic ML researchers. And the naive triple for-loop matmult algorithm isn’t even the most efficient (in terms of Big O) algorithm around. With some mysterious linear algebra, you can get down to O(n2.373) instead of O(n3).

My code for the threaded matmult comparisons can be found here as well as some other concurrency practice programs I was playing with.