PDA

View Full Version : Implementing a code generator for fast matrix multiplication in OpenCL on the GPU


News
07-11-12, 05:30 AM
Abstract:

This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precision GEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.

(K. Matsumoto, N. Nakasato, S. G. Sedukhin:¬*‚??Implementing a code generator for fast matrix multiplication in OpenCL on the GPU‚??, accepted for Special Session: Auto-Tuning for Multicore and GPU (ATMG), IEEE 6th International Symposium on Embedded Multicore SoCs (MCSoC-12), Sep. 2012. [PDF (ftp://ftp.u-aizu.ac.jp/pub/u-aizu/doc/Tech-Report/2012/2012-002.pdf)])

http://feeds.feedburner.com/~r/gpgpuorg/~4/a8tLMTD33SI

More... (http://feedproxy.google.com/~r/gpgpuorg/~3/a8tLMTD33SI/code-generator-for-fast-matrix-multiplication-in-opencl)