c - Reduce matrix rows with CUDA -
windows 7, nvidia geforce 425m.
i wrote simple cuda code calculates row sums of matrix. matrix has uni-dimensional representation (pointer float).
the serial version of code below (it has 2
loops, expected):
void serial_rowsum (float* m, float* output, int nrow, int ncol) { float sum; (int = 0 ; < nrow ; i++) { sum = 0; (int j = 0 ; j < ncol ; j++) sum += m[i*ncol+j]; output[i] = sum; } }
inside cuda code, call kernel function sweeping matrix rows. below, kernel call snippet:
dim3 threadsperblock((unsigned int) nthreadsperblock); // has multiple of 32 dim3 blockspergrid((unsigned int) ceil(nrow/(float) nthreadsperblock)); kernel_rowsum<<<blockspergrid, threadsperblock>>>(d_m, d_output, nrow, ncol);
and kernel function performs parallel sum of rows (still has 1
loop):
__global__ void kernel_rowsum(float *m, float *s, int nrow, int ncol) { int rowidx = threadidx.x + blockidx.x * blockdim.x; if (rowidx < nrow) { float sum=0; (int k = 0 ; k < ncol ; k++) sum+=m[rowidx*ncol+k]; s[rowidx] = sum; } }
so far good. serial , parallel (cuda) results equal.
the whole point cuda version takes twice time of serial 1 compute, if change nthreadsperblock
parameter: tested nthreadsperblock
32
1024
(maximum number of threads per block allowed card).
imo, matrix dimension big enough justify parallelization: 90,000 x 1,000
.
below, report time elapsed serial , parallel versions using different nthreadsperblock
. time reported in msec
on average of 100
samples:
matrix: nrow = 90000 x ncol = 1000
serial: average time elapsed per sample in msec (100
samples): 289.18
.
cuda (32
threadsperblock): average time elapsed per sample in msec (100
samples): 497.11
.
cuda (1024
threadsperblock): average time elapsed per sample in msec (100
samples): 699.66
.
just in case, version 32
/1024
nthreadsperblock
fastest/slowest one.
i understand there kind of overhead when copying host device , other way around, maybe slowness because not implementing fastest code.
since far being cuda expert:
am coding fastest version task? how improve code? can rid of loop in kernel function?
any thoughts appreciated.
edit 1
although describe standard rowsum
, interested in and
/or
operation of rows have (0;1}
values, rowand
/rowor
. said, doesn't allow me exploit cublas
multiply 1
's col
column vector trick, suggested commentators.
edit 2
as suggest users other users , here endorsed:
forget trying write own functions, use thrust library instead , magic comes.
since mentioned need general reduction algorithm other sum only. try give 3 approaches here. kernel approach may have highest performance. thrust approach easiest implement. cublas approach works sum , have performance.
kernel approach
here's doc introducing how optimize standard parallel reduction. standard reduction can divide 2 stages.
- multiple thread blocks each reduces 1 part of data;
- one thread block reduces result of stage 1 final 1 element.
for multi-reduction (reduce rows of mat) problem, stage 1 enough. idea reduce 1 row per thread block. further considerations multi-row per thread block or 1 row per multiple thread blocks, can refer paper provided @novak. may improve performance more, matrices bad shape.
thrust approach
general multi-reduction can done thrust::reduction_by_key
in few minutes. can find discussions here determining least element , position in each matrix column cuda thrust.
however thrust::reduction_by_key
not assume each row has same length, performance penalty. post how normalize matrix columns in cuda max performance? gives profiling comparison between thrust::reduction_by_key
, cublas approach on sum of rows. may give basic understanding performance.
cublas approach
sum of rows/cols of matrix can seen matrix-vector multiplication elements of vector ones. can represented following matlab code.
y = * ones(size(a,2),1);
where y
sum of rows of a.
cublas libary provides high performance matrix-vector multiplication function cublas<t>gemv()
operation.
timing result shows routine 10~50% slower read elements of once, can seen theoretical upper limit of performance operation.
Comments
Post a Comment