Performance Evaluation and Optimization on GPU

Article Preview

Abstract:

GPU provides higher peak performance with hundreds of cores than CPU counterpart. However, it is a big challenge to take full advantage of their computing power. In order to understand performance bottlenecks of applications on many-core GPU and then optimize parallel programs on GPU architectures, we propose a performance evaluating model based on memory wall and then classify applications into AbM (Application bound-in Memory) and AbC (Application bound-in Computing). Furthermore, we optimize kernels characterized with low memory bandwidth including matrix multiplication and FFT (Fast Fourier Transform) by employing texture cache on NVIDIA GTX280 using CUDA (Compute Unified Device Architecture). Experimental results show that texture cache is helpful for AbM with better data locality, so it is critical to utilize GPU memory hierarchy efficiently for performance improvement.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 219-220)

Pages:

1445-1449

Citation:

Online since:

March 2011

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2011 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Ryoo S., Rodrigues, C. I., Baghsorkhi, S. S., etc. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, Salt Lake. 2008, pp.73-82.

DOI: 10.1145/1345206.1345220

Google Scholar

[2] NVIDIA. CUDA programming guide 2.0. NVIDIA Corporation, (2008).

Google Scholar

[3] NVIDIA GeForce series GTX280, 8800GTX, 8800GT. http://www.nvidia.com/geforce.

Google Scholar

[4] Nathan B., Michael G. Implementing sparse matrix-vector multiplication on throughput oriented processors. In Proceedings of the ACM SC09, New York.2009, pp.141-152.

Google Scholar

[5] Ogata, Y., Endo T., Maruyama N.etc. An efficient, model-based CPU-GPU heterogeneous FFT library. In Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Piscataway. 2008, pp.252-257.

DOI: 10.1109/ipdps.2008.4536163

Google Scholar

[6] Khailany B., Dally W. J., Rixner S., etc. Imagine: Media Processing with Streams. IEEE Micro, 2001, p.35–46.

DOI: 10.1109/40.918001

Google Scholar

[7] Dally W. J, Labonte, F., Das A., etc. Merrimac: Supercomputing with Streams. In Proceedings of the SC2003. Nov. 15-21, Phoenix, Arizona, (2003).

Google Scholar

[8] Eichenberger A. E., Brien K. O., Brien, K. O., etc. Optimizing Compiler for the CELL Processor. In Proceedings of the PACT2005. Washington, DC, USA, 2005, pp.161-172.

Google Scholar