c - Reduce matrix rows with CUDA -
windows 7, nvidia geforce 425m. i wrote simple cuda code calculates row sums of matrix. matrix has uni-dimensional representation (pointer float).
the serial version of code below (it has 2 loops, expected):
void serial_rowsum (float* m, float* output, int nrow, int ncol) { float sum; (int = 0 ; < nrow ; i++) { sum = 0; (int j = 0 ; j < ncol ; j++) sum += m[i*ncol+j]; output[i] = sum; } } inside cuda code, call kernel function sweeping matrix rows. below, kernel call snippet:
dim3 threadsperblock((unsigned int) nthreadsperblock); // has multiple of 32 dim3 blockspergrid((unsigned int) ceil(nrow/(float) nthreadsperblock)); kernel_rowsum<<<blockspergrid, threadsperblock>>>(d_m, d_output, nrow, ncol); and kernel function performs parallel sum of rows (still has 1 loop):
__global__ void kernel_rowsum(float *m, float *s, int nrow, int ncol) { int rowidx = threadidx.x + blockidx.x * blockdim.x; if (rowidx < nrow) { float sum=0; (int k = 0 ; k < ncol ; k++) sum+=m[rowidx*ncol+k]; s[rowidx] = sum; } } so far good. serial , parallel (cuda) results equal.
the whole point cuda version takes twice time of serial 1 compute, if change nthreadsperblock parameter: tested nthreadsperblock 32 1024 (maximum number of threads per block allowed card).
imo, matrix dimension big enough justify parallelization: 90,000 x 1,000.
below, report time elapsed serial , parallel versions using different nthreadsperblock. time reported in msec on average of 100 samples:
matrix: nrow = 90000 x ncol = 1000
serial: average time elapsed per sample in msec (100 samples): 289.18.
cuda (32 threadsperblock): average time elapsed per sample in msec (100 samples): 497.11.
cuda (1024 threadsperblock): average time elapsed per sample in msec (100 samples): 699.66.
just in case, version 32/1024 nthreadsperblock fastest/slowest one.
i understand there kind of overhead when copying host device , other way around, maybe slowness because not implementing fastest code.
since far being cuda expert:
am coding fastest version task? how improve code? can rid of loop in kernel function?
any thoughts appreciated.
edit 1
although describe standard rowsum, interested in and/or operation of rows have (0;1} values, rowand/rowor. said, doesn't allow me exploit cublas multiply 1's col column vector trick, suggested commentators.
edit 2
as suggest users other users , here endorsed:
forget trying write own functions, use thrust library instead , magic comes.
since mentioned need general reduction algorithm other sum only. try give 3 approaches here. kernel approach may have highest performance. thrust approach easiest implement. cublas approach works sum , have performance.
kernel approach
here's doc introducing how optimize standard parallel reduction. standard reduction can divide 2 stages.
- multiple thread blocks each reduces 1 part of data;
- one thread block reduces result of stage 1 final 1 element.
for multi-reduction (reduce rows of mat) problem, stage 1 enough. idea reduce 1 row per thread block. further considerations multi-row per thread block or 1 row per multiple thread blocks, can refer paper provided @novak. may improve performance more, matrices bad shape.
thrust approach
general multi-reduction can done thrust::reduction_by_key in few minutes. can find discussions here determining least element , position in each matrix column cuda thrust.
however thrust::reduction_by_key not assume each row has same length, performance penalty. post how normalize matrix columns in cuda max performance? gives profiling comparison between thrust::reduction_by_key , cublas approach on sum of rows. may give basic understanding performance.
cublas approach
sum of rows/cols of matrix can seen matrix-vector multiplication elements of vector ones. can represented following matlab code.
y = * ones(size(a,2),1); where y sum of rows of a.
cublas libary provides high performance matrix-vector multiplication function cublas<t>gemv() operation.
timing result shows routine 10~50% slower read elements of once, can seen theoretical upper limit of performance operation.
Comments
Post a Comment