c - Reduce matrix rows with CUDA -

windows 7, nvidia geforce 425m.

i wrote simple cuda code calculates row sums of matrix. matrix has uni-dimensional representation (pointer float).

the serial version of code below (it has 2 loops, expected):

void serial_rowsum (float* m, float* output, int nrow, int ncol) {     float sum;     (int = 0 ; < nrow ; i++) {         sum = 0;         (int j = 0 ; j < ncol ; j++)             sum += m[i*ncol+j];         output[i] = sum;     } }

inside cuda code, call kernel function sweeping matrix rows. below, kernel call snippet:

dim3 threadsperblock((unsigned int) nthreadsperblock); // has multiple of 32 dim3 blockspergrid((unsigned int) ceil(nrow/(float) nthreadsperblock));   kernel_rowsum<<<blockspergrid, threadsperblock>>>(d_m, d_output, nrow, ncol);

and kernel function performs parallel sum of rows (still has 1 loop):

__global__ void kernel_rowsum(float *m, float *s, int nrow, int ncol) {      int rowidx = threadidx.x + blockidx.x * blockdim.x;      if (rowidx < nrow) {         float sum=0;         (int k = 0 ; k < ncol ; k++)             sum+=m[rowidx*ncol+k];         s[rowidx] = sum;                 }  }

so far good. serial , parallel (cuda) results equal.

the whole point cuda version takes twice time of serial 1 compute, if change nthreadsperblock parameter: tested nthreadsperblock 32 1024 (maximum number of threads per block allowed card).

imo, matrix dimension big enough justify parallelization: 90,000 x 1,000.

below, report time elapsed serial , parallel versions using different nthreadsperblock. time reported in msec on average of 100 samples:

matrix: nrow = 90000 x ncol = 1000

serial: average time elapsed per sample in msec (100 samples): 289.18.

cuda (32 threadsperblock): average time elapsed per sample in msec (100 samples): 497.11.

cuda (1024 threadsperblock): average time elapsed per sample in msec (100 samples): 699.66.

just in case, version 32/1024 nthreadsperblock fastest/slowest one.

i understand there kind of overhead when copying host device , other way around, maybe slowness because not implementing fastest code.

since far being cuda expert:

am coding fastest version task? how improve code? can rid of loop in kernel function?

any thoughts appreciated.

edit 1

although describe standard rowsum, interested in and/or operation of rows have (0;1} values, rowand/rowor. said, doesn't allow me exploit cublas multiply 1's col column vector trick, suggested commentators.

edit 2

as suggest users other users , here endorsed:

forget trying write own functions, use thrust library instead , magic comes.

since mentioned need general reduction algorithm other sum only. try give 3 approaches here. kernel approach may have highest performance. thrust approach easiest implement. cublas approach works sum , have performance.

kernel approach

here's doc introducing how optimize standard parallel reduction. standard reduction can divide 2 stages.

multiple thread blocks each reduces 1 part of data;
one thread block reduces result of stage 1 final 1 element.

for multi-reduction (reduce rows of mat) problem, stage 1 enough. idea reduce 1 row per thread block. further considerations multi-row per thread block or 1 row per multiple thread blocks, can refer paper provided @novak. may improve performance more, matrices bad shape.

thrust approach

general multi-reduction can done thrust::reduction_by_key in few minutes. can find discussions here determining least element , position in each matrix column cuda thrust.

however thrust::reduction_by_key not assume each row has same length, performance penalty. post how normalize matrix columns in cuda max performance? gives profiling comparison between thrust::reduction_by_key , cublas approach on sum of rows. may give basic understanding performance.

cublas approach

sum of rows/cols of matrix can seen matrix-vector multiplication elements of vector ones. can represented following matlab code.

y = * ones(size(a,2),1);

where y sum of rows of a.

cublas libary provides high performance matrix-vector multiplication function cublas<t>gemv() operation.

timing result shows routine 10~50% slower read elements of once, can seen theoretical upper limit of performance operation.

Search This Blog

Roma