typedef struct A { int a, b; } A; void*f(A*restrict p){ A*q=__builtin_malloc(1024*sizeof(A)); for(int i=0;i<1024;++i){ #ifdef HELP q[i]=p[i]; #else q[i].a=p[i].a; q[i].b=p[i].b; #endif } return q; } At -O3, with HELP, we get the expected memcpy. Without it, the loop is only vectorized.
Confirmed. loop distribution only handles stride 1 accesses and single loads/stores for the pattern recognition. With my ongoing work on vectorizer refactoring it might be possible to re-use its DR group analysis and thus work on DR groups here. Or we may want to teach this pattern to the vectorizer itself (eh...). Or we may want to un-"SRA" such patterns, generating aggregate copies.
(In reply to Richard Biener from comment #1) > Or we may want to un-"SRA" such patterns, generating aggregate copies. I notice that store-merging does not merge these stores, I didn't check why. SLP can do it for long but not for int (no vector of 2 ints) with -fdisable-tree-vect. (anyway that's too late for ldist, the DR / vectorizer approach sounds better, just mentioning this as another possible missed optimization) The testcase is a simplified version of boost::container::flat_map<int,int>. The most important missing transformation is memmove, but it was easier to report memcpy and I kind of expect that they may all be fixed together.
(In reply to Marc Glisse from comment #2) > (In reply to Richard Biener from comment #1) > > Or we may want to un-"SRA" such patterns, generating aggregate copies. > > I notice that store-merging does not merge these stores, I didn't check why. > SLP can do it for long but not for int (no vector of 2 ints) with > -fdisable-tree-vect. > > (anyway that's too late for ldist, the DR / vectorizer approach sounds > better, just mentioning this as another possible missed optimization) Yes, merging and SRA are conflicting with each other here, and it's difficult to get a model deciding when to do what. With DR improvement, we can identify and connect two or more builtin partitions in ldist.