About Me | Hunch | Twitter

OpenCL Impressions

written by matt, on Sep 19, 2009 6:45:00 AM.

With the release of OS X 10.6, Apple included libraries and device driver support for a platform of general-purpose GPU computing known as OpenCL. This allows developers to run parallelizable code on graphics cards. I decided to give it a spin since a lot of the machine learning work I do could benefit from parallel processing, and while modern CPU's can give us around 16 parallel threads of execution, modern GPU's can give us hundreds of threads by design.

I started off by trying a simple matrix multiplication program, which in theory can be easily broken into many threads. If you recall from linear algebra class, the multiplication of two matrices results in a third matrix where the element at row i and column j is calculated by taking the dot product of row i in the first matrix and column j in the second matrix. Since each element can be determined independently of the rest, this makes it perfect to split up the work into multiple threads. A familiar task for me is to multiply a random 200000x120 matrix by a random 120x120 matrix, as I've had to heavily optimize this task on a CPU, and on a multi-core CPU the heavily optimized version takes around a second.

First off, I found the lack of documentation and sample programs a little frustrating. OpenCL is so new that there isn't much out there. I started off with a python project called PyOpenCL, written by Andreas Klockner who has done a great job. It's still really early software, so it took a while to get up and running. Andreas was nice enough to fix a bug upstream that was blocking me. I used the matrix multiplication OpenCL code in nVidia's shared memory example, ran it on a MacBook Air with an nVidia 9400M GPU, and the GPU code was consistently about 9x slower with 32-bit floats.

I eventually came across a bunch of OpenCL example programs hidden in nVidia's CUDA SDK, which is a similar language to OpenCL, but works only on nVidia cards. There is a sample matrix multiplication C++ program that will multiply matrices of any size, so I tried that. It took about 8s to multiply my favorite matrices, much worse than I've gotten on a CPU. To make sure I wasn't doing something wrong, I then tried multiplying the matrices using CUBLAS, a heavily optimized matrix library that runs CUDA code on the GPU. 5 seconds there.

Turns out that instead of matrix multiplication being many times faster on GPUs, it can be many times slower. The reason is that matrix multiplication is memory bound. It takes longer to transfer the matrix tiles over to graphics card via the PCI Express bus than it does for the CPU to access the memory and do the multiplication and addition operations on many fewer threads. I verified this by using smaller ints instead of floats and got a significant speed-up. Next thing I want to try is running these tests on a better GPU. The 9400M on the MacBook air has a max memory bandwidth of 17.1 GB/s, while nVidia's latest and greatest, the GTX 285, claims 159 GB/s. The 9400M also doesn't even have its own local memory to cache tiles... it has to share system memory. At most though, I think I'm looking at a 2-3x speed-up multiplying my matrices, which is a lot less than I expected.

So long story short, took a long time to do something simple with OpenCL, and the GPU strengths aren't quite there yet to make it worthwhile for my problem. No doubt there are computations out there better performed on a GPU, but I don't really believe the marketing hype that says GP-GPU is going to make every-day applications run faster (yet, at least).

If you're interested in reading more about shortcomings of the GPU, I recommend reading Tim Sweeney's slides. I read this a few months ago, re-read it after this experiment, and it's interesting that his ultimate conclusion is the same as mine- GPGPU programming is too hard and limited, and the non-unified memory architecture is a big problem. Maybe I will put my money on the CPU... or on AMD/ATi since the synergies there might give us a nice hybrid architecture.

  • mattgattis
    Thanks for the advice David. I was referring to __local in OpenCL, aka shared memory, which is on-chip on many graphics cards but off-chip on the 9400m. And I meant caching as in keeping copies of tiles copied from the host's RAM on the GPU's RAM, not caching as in the CPU's L2 cache. Thanks for the tip on not making the useless copy on the 9400m. I'll give that a shot.
  • jho
    For those of you who are curious I've done some in depth matrix multiply benchmarking on OpenCL http://sites.google.com/site/jhosite/csc5551/research-project. The results are pretty impressive, especially when I introduced a vector optimization that wasn't in nvidia's reference code.
  • jho
    Holmes - correct, with a large enough square matrix you will see 70x the performance vs. a sequential matrix multiply on the CPU. Run the same OpenCL matrix multiply on the CPU vs. the GPU, the GPU will demolish the CPU.
  • Holmes Futrell
    Matrix multiplication is only bandwidth bound when the inner dimensions are small -- which is what's going on in your testing. Try out square matrices instead!
  • mattgattis
    Also, Gavin - I was multiplying those matrices by a random matrix of the same size transposed.
  • mattgattis
    Gavin - the input matrix was 160,000x16 using PyOpenCL, since I was using nvidia's reference code that conveniently uses the same number of columns as the tile size to make the code simple. The oclMatrixMul code in the CUDA sdk let me do arbitrary size matrices and tile size, so i tried a bunch of different combinations, but mainly 200,000x120 using a tile size of 20. With both, I also coded a simple matrix multiplication to run on the host for comparison. In Python, I used numpy which is really heavily optimized. In the C version, I coded my own really simple one which took about 3s compared to numpy's 1s. I did try different matrix sizes and got similar results. I think no matter how big your matrix is, you still have the same ratio of copying data to computing dot products, but I'm not positive the math works out that way. The one thing that confirmed it was memory bandwidth was varying the data types of the entries in the matrix. There was a speedup that put the GPU on par with the CPU when the GPU was using 16 bit ints. And to Gavin - I didn't realize it was only 16k per processor. That must've been why the code crashed when I tried using larger tile sizes. Not sure why the shared mem didn't help then. I was careful to time only the kernel call, not the compilation. The PyOpenCL code doesn't give you granularity for excluding the kernel allocation... would that be happening in the compile call or the run call? I still want to test all of this on a better GPU to verify my theory of the task being memory bound.
  • David
    Shared memory is only 16k per multiprocessor, so it really isn't much. The 9400M has 2 multiprocessors. At any rate, I think if CL_DEVICE_LOCAL_MEM_TYPE is CL_LOCAL then it's located on-chip, vs. CL_GLOBAL. I haven't tried the different listings for matrix multiplication, but make sure that you're not counting the compilation of the kernel in the time taken or any allocations needed by the kernel; the latter especially is delayed until the first execution and will count towards execution time if you're not careful (see also http://forums.nvidia.com/index.php?showtopic=99844)
  • Gavin
    When comparing all these different approaches, was the input matrix the same size in each case? IOW, was the algorithm the only factor changed? When running your OpenCL test, how did you divide up the data into blocks for the work queue? It would be very interesting to run the exact same test, varying only whether it was running on the CPU or GPU when creating the OpenCL context. Since you postulate that memory bandwidth was a potential limiting factor, running the same test with increasing matrix sizes should show a knee in the curve where you hit this limit. Now I want to sit down and run some more tests myself...
  • mattgattis
    Hey David - I guess I just assume its off-chip since none of the specifications say that the 9400m has its own memory banks or how much memory it contains. Not sure physically where it would store the shared memory, unless its just a really tiny amount they put on the chip equivalent to register space? Also I compared the OpenCL code in nVidia's getting started guide for multiplying matrices in shared memory. They have three different listings, one that doesn't use shared memory and two that do use shared memory. They claim quite a big performance boost by using the shared memory, but when I tried all 3 on the 9400m, I got roughly the same performance.
  • David
    I'm curious, where is it mentioned that the 9400m has off-chip shared memory? All of NVIDIA's programming/optimization guides pretty much tell you to assume it's on-chip and as fast as registers if you avoid bank conflicts.
  • David
    OpenCL muddied up the terms, but if you mean what's called shared memory in CUDA (__local in OpenCL), the 9400M does have that on-chip. It's thread-local memory that's shared with global memory, and is off-chip and thus shared with normal RAM on the Air. Also, global and thread-local memory aren't cached on any NVIDIA GPUs anyway. If caching would help (or you can't do coalesced accesses), you either have to stage it into shared memory or access it via a texture. One thing that would help for your situation is using mapped pinned memory to eliminate the memcpy that's useless when VRAM is shared with normal RAM (CU_DEVICE_ATTRIBUTE_INTEGRATED in CUDA, I forget what in OpenCL) - see oclBandwidthTest to see how it's done and look for CUDA2.2PinnedMemoryAPIs.pdf for information.
  • jimmyconway
    I saw this on reddit. Just read it. Interesting. I thought it would have been many times faster. How many average programs really require complex mathematical functions? I am still having a ball with writing server-client socket applications, maybe a GUI soon.