C++ SSE and aligned array of ints and vector of ints -


thanks of you, have used sse speeding computation of 1 of function of scientific app in c++ use sse instructions comparing huge vectors of ints.

the final version of optimized sse function is:

int getbestdiffssse(int nodeid, const vector<int> &goalnodeidtemp) {     int positionnodeid = 2 * nodeid * nof;     int mynewindex = 2 * nof;     int result[4] __attribute__((aligned(16))) = {0};      __m128i vresult = _mm_set1_epi32(0);     __m128i v1, v2, vmax;      (int k = 0; k < mynewindex; k += 4) {         v1 = _mm_loadu_si128((__m128i *) & distances[positionnodeid + k]);         v2 = _mm_loadu_si128((__m128i *) & goalnodeidtemp[k]);         v1 = _mm_xor_si128(v1, vke);         v2 = _mm_xor_si128(v2, vko);         v1 = _mm_sub_epi32(v1, vke);         v2 = _mm_sub_epi32(v2, vko);         vmax = _mm_add_epi32(v1, v2);         vresult = _mm_max_epi32(vresult, vmax);     }     _mm_store_si128((__m128i *) result, vresult);     return max(max(max(result[0], result[1]), result[2]), result[3]); } 

where

const __m128i vke = _mm_set_epi32(0, -1, 0, -1); const __m128i vko = _mm_set_epi32(-1, 0, -1, 0); 

and

int* distances  distances= new int[size]; 

where size huge (18m x 64)

my naive question is: believe better speed if both: a) array distances aligned or b) vector goalnodeidtemp aligned , c) how do that?

i' ve seen posts memalign or align_malloc have not understand how use them dynamic array or vector. or since talking ints, alignment not issue? keep in mind using ubuntu 12.04 , gcc, solution visual studio compiler not option.

added questions: first of all, following code enough align dynamic array (keep in mind definition , initialization have kept differently);

int *distances __attribute__((aligned(16))); distances = new int[size]; 

second, in order align vector goalnodeidtemp need write entire code custom vector allocator? there simpler alternative?

i need help. in advance

there several things can improve performances bit :

  • take __m128i v1, v2, vmax; out of loop, done compiler
  • make sure distances aligned
  • instead of using std::vector, align data , pass pointer. use _mm_load_si128.

if distance , goalnodeidtemp aligned, use raw pointers. :

__m128i *v1 = (__m128i *) & distances[positionnodeid + k]; __m128i *v2 = (__m128i *) & goalnodeidtemp[k]; 

all further optimizations, need assembly code.


do believe better speed if both: a) array distances aligned b) vector goalnodeidtemp aligned

yes, small performance boost. nothing spectacular, if every cycle count, may noticeable

how do that?

to have goalnodeidtemp aligned, have use special allocator std::vector (see example here how it).

to align distance, have bit careful. see here how allocate aligned memory.


Comments

Popular posts from this blog

curl - PHP fsockopen help required -

HTTP/1.0 407 Proxy Authentication Required PHP -

c# - Resource not found error -