GPU provides higher peak performance with hundreds of cores than CPU counterpart. However, it is a big challenge to take full advantage of their computing power. In order to understand performance bottlenecks of applications on many-core GPU and then optimize parallel programs on GPU architectures, we propose a performance evaluating model based on memory wall and then classify applications into AbM (Application bound-in Memory) and AbC (Application bound-in Computing). Furthermore, we optimize kernels characterized with low memory bandwidth including matrix multiplication and FFT (Fast Fourier Transform) by employing texture cache on NVIDIA GTX280 using CUDA (Compute Unified Device Architecture). Experimental results show that texture cache is helpful for AbM with better data locality, so it is critical to utilize GPU memory hierarchy efficiently for performance improvement.