This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

vectorizing accumulate() when possible ?

From: Bernard Hugueney <bernard at bernard-hugueney dot org>
To: libstdc++ at gcc dot gnu dot org
Date: Mon, 21 Sep 2009 14:38:59 +0200
Subject: vectorizing accumulate() when possible ?

Hi,
While trying to achieve maximum speed for something as trivial as
computing sums of large quantities of numbers, I was wondering whether
it would be possible to vectorize std::accumulate when it makes sense.

Current implementation is completly generic, but I believe that more
often thant not (at least when performance matters), std::accumulate
will be called with primitive numeric data type, on contiguous memory
and with standard arithmetic + or * operators. In this configuration,
a vector of intermediate results could be used to enable the use of
vector (i.e SSE) instructions.

I came up with a small exemple program(*) trying to lift the
vectorizing power of the compiler for operator+ as a proof of
concept.People with better understanding of the vectorizing process
would surely improve it (see comments in code).
I realize that such an implementation unfortunately violates the
std::accumulate() (imvho over-)specifications. However, I believe that
the parallel mode already does violates thoses rules so maybe such
implementation could find its way into parallel mode (or when
-ffastmath is used) ? (Btw, I'd be very interested by informations on
trying to tackle the interactions between vectorizing and
parallelizing such loops with OpenMP even if parallel mode would only
be useful on cores *not* sharing L2 cache for such memory I/O
intensive loops.)

Do you think this kind of specializations could have a place in the
libstdc++ implementation ?

Best regards,

Bernard

(*)
// g++ -o vect_acc vect_acc.cxx -lstdc++ -std=c++0x
-ftree-vectorizer-verbose=99 -O4 -march=native

#include <stdlib.h> //atol to avoid dependance on boost::lexical_cast<>

#include <iterator>
#include <iostream>
#include <numeric>

template<int vect_size, typename In, typename T> T acc(In b, In e, T init){
? int const vect_elts(vect_size/sizeof(T));
? std::size_t n(e-b), r(n%vect_elts);
? T vect[vect_elts];

? // should be unrolled a compile-time
? for(std::size_t i(0); i!=vect_elts; ++i){ vect[i]= 0.;}
? // should also be unrolled at compile time, should also handle
alignment issues so the r values
? // should be taken from beginning and end of sequence accordingly
? for(std::size_t i(0); i!=r; ++i, ++b){ init+= *b; }

? for(; b!=e; b+= vect_elts){
? //should be vectorized
??? for(std::size_t i(0); i!= vect_elts; ++i)
????? { vect[i] += b[i]; }
? }

? // should be unrolled a compile-time
? for(std::size_t i(0); i!=vect_elts; ++i){ init+=vect[i];}

? return init;
}

int main(int argc, char* argv[]){
? typedef int data_t;
? long n(atol(argv[1])), loops(argc > 2 ? atol(argv[2]):1);
? data_t const* const d= new data_t[n];
? volatile data_t res;// volatile in case optimizer would want to be
to agressive
? if(argc > 3){
??? for(long i(0); i!= loops; ++i)
????? { res= acc<16*8>(d, d+n, 0.); }
? }else{
??? for(long i(0); i!= loops; ++i)
????? { res= std::accumulate(d, d+n, 0.); }
? }

? std::cout<< res <<std::endl;
? return 0;
}

Follow-Ups:
- Re: vectorizing accumulate() when possible ?
  - From: Christopher Jefferson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]