C++ SSE and aligned array of ints and vector of ints -
thanks of you, have used sse speeding computation of 1 of function of scientific app in c++ use sse instructions comparing huge vectors of ints.
the final version of optimized sse function is:
int getbestdiffssse(int nodeid, const vector<int> &goalnodeidtemp) { int positionnodeid = 2 * nodeid * nof; int mynewindex = 2 * nof; int result[4] __attribute__((aligned(16))) = {0}; __m128i vresult = _mm_set1_epi32(0); __m128i v1, v2, vmax; (int k = 0; k < mynewindex; k += 4) { v1 = _mm_loadu_si128((__m128i *) & distances[positionnodeid + k]); v2 = _mm_loadu_si128((__m128i *) & goalnodeidtemp[k]); v1 = _mm_xor_si128(v1, vke); v2 = _mm_xor_si128(v2, vko); v1 = _mm_sub_epi32(v1, vke); v2 = _mm_sub_epi32(v2, vko); vmax = _mm_add_epi32(v1, v2); vresult = _mm_max_epi32(vresult, vmax); } _mm_store_si128((__m128i *) result, vresult); return max(max(max(result[0], result[1]), result[2]), result[3]); }
where
const __m128i vke = _mm_set_epi32(0, -1, 0, -1); const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
and
int* distances distances= new int[size];
where size huge (18m x 64)
my naive question is: believe better speed if both: a) array distances aligned or b) vector goalnodeidtemp aligned , c) how do that?
i' ve seen posts memalign or align_malloc have not understand how use them dynamic array or vector. or since talking ints, alignment not issue? keep in mind using ubuntu 12.04 , gcc, solution visual studio compiler not option.
added questions: first of all, following code enough align dynamic array (keep in mind definition , initialization have kept differently);
int *distances __attribute__((aligned(16))); distances = new int[size];
second, in order align vector goalnodeidtemp need write entire code custom vector allocator? there simpler alternative?
i need help. in advance
there several things can improve performances bit :
- take
__m128i v1, v2, vmax;
out of loop, done compiler - make sure distances aligned
- instead of using std::vector, align data , pass pointer. use
_mm_load_si128
.
if distance , goalnodeidtemp aligned, use raw pointers. :
__m128i *v1 = (__m128i *) & distances[positionnodeid + k]; __m128i *v2 = (__m128i *) & goalnodeidtemp[k];
all further optimizations, need assembly code.
do believe better speed if both: a) array distances aligned b) vector goalnodeidtemp aligned
yes, small performance boost. nothing spectacular, if every cycle count, may noticeable
how do that?
to have goalnodeidtemp
aligned, have use special allocator std::vector
(see example here how it).
to align distance
, have bit careful. see here how allocate aligned memory.
Comments
Post a Comment