OpenCL Impressions
With the release of OS X 10.6, Apple included libraries and device driver support for a platform of general-purpose GPU computing known as OpenCL. This allows developers to run parallelizable code on graphics cards. I decided to give it a spin since a lot of the machine learning work I do could benefit from parallel processing, and while modern CPU's can give us around 16 parallel threads of execution, modern GPU's can give us hundreds of threads by design.
I started off by trying a simple matrix multiplication program, which in theory can be easily broken into many threads. If you recall from linear algebra class, the multiplication of two matrices results in a third matrix where the element at row i and column j is calculated by taking the dot product of row i in the first matrix and column j in the second matrix. Since each element can be determined independently of the rest, this makes it perfect to split up the work into multiple threads. A familiar task for me is to multiply a random 200000x120 matrix by a random 120x120 matrix, as I've had to heavily optimize this task on a CPU, and on a multi-core CPU the heavily optimized version takes around a second.
First off, I found the lack of documentation and sample programs a little frustrating. OpenCL is so new that there isn't much out there. I started off with a python project called PyOpenCL, written by Andreas Klockner who has done a great job. It's still really early software, so it took a while to get up and running. Andreas was nice enough to fix a bug upstream that was blocking me. I used the matrix multiplication OpenCL code in nVidia's shared memory example, ran it on a MacBook Air with an nVidia 9400M GPU, and the GPU code was consistently about 9x slower with 32-bit floats.
I eventually came across a bunch of OpenCL example programs hidden in nVidia's CUDA SDK, which is a similar language to OpenCL, but works only on nVidia cards. There is a sample matrix multiplication C++ program that will multiply matrices of any size, so I tried that. It took about 8s to multiply my favorite matrices, much worse than I've gotten on a CPU. To make sure I wasn't doing something wrong, I then tried multiplying the matrices using CUBLAS, a heavily optimized matrix library that runs CUDA code on the GPU. 5 seconds there.
Turns out that instead of matrix multiplication being many times faster on GPUs, it can be many times slower. The reason is that matrix multiplication is memory bound. It takes longer to transfer the matrix tiles over to graphics card via the PCI Express bus than it does for the CPU to access the memory and do the multiplication and addition operations on many fewer threads. I verified this by using smaller ints instead of floats and got a significant speed-up. Next thing I want to try is running these tests on a better GPU. The 9400M on the MacBook air has a max memory bandwidth of 17.1 GB/s, while nVidia's latest and greatest, the GTX 285, claims 159 GB/s. The 9400M also doesn't even have its own local memory to cache tiles... it has to share system memory. At most though, I think I'm looking at a 2-3x speed-up multiplying my matrices, which is a lot less than I expected.
So long story short, took a long time to do something simple with OpenCL, and the GPU strengths aren't quite there yet to make it worthwhile for my problem. No doubt there are computations out there better performed on a GPU, but I don't really believe the marketing hype that says GP-GPU is going to make every-day applications run faster (yet, at least).
If you're interested in reading more about shortcomings of the GPU, I recommend reading Tim Sweeney's slides. I read this a few months ago, re-read it after this experiment, and it's interesting that his ultimate conclusion is the same as mine- GPGPU programming is too hard and limited, and the non-unified memory architecture is a big problem. Maybe I will put my money on the CPU... or on AMD/ATi since the synergies there might give us a nice hybrid architecture.
