Both LLVM and icc recognize initialization and copy loop and synthesize calls to memcpy and memset. memmove call can also be synthesized when src/target may overlap. Option needs to provided to disable such optimization in signal handlers. I consider this as optimization for benchmarking ;) For instance, the prime number finder program sieve.c is one of the benchmarks in LLVM. Both LLVM and icc beats gcc in this one because of the missing optimization. #ifndef T #define T int #endif T arr[1000]; void foo(int n) { int i; for (i = 0; i < n; i++) { arr[i] = 0; } } void foo2(int n, T* p) { int i; for (i = 0; i < n; i++) { *p++ = 0; } } #ifndef T #define T int #endif T arr[1000]; T arr2[1000]; void foo(int n) { int i; for (i = 0; i < n; i++) { arr[i] = arr2[i]; } } // sieve.c /* -*- mode: c -*- * $Id: sieve.c 36673 2007-05-03 16:55:46Z laurov $ * http://www.bagley.org/~doug/shootout/ */ #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { #ifdef SMALL_PROBLEM_SIZE #define LENGTH 17000 #else #define LENGTH 170000 #endif int NUM = ((argc == 2) ? atoi(argv[1]) : LENGTH); static char flags[8192 + 1]; long i, k; int count = 0; while (NUM--) { count = 0; for (i=2; i <= 8192; i++) { flags[i] = 1; } for (i=2; i <= 8192; i++) { if (flags[i]) { /* remove all multiples of prime: i */ for (k=i+i; k <= 8192; k+=i) { flags[k] = 0; } count++; } } } printf("Count: %d\n", count); return(0); }
I thought there is already one part of graphite.
I should mention I made one patch before based on the vectorizer code which did detection of at least memset; it was while I was an intern at Apple. I posted it and there was some review. And a few years back there was a paper at the GCC summit about it and expanding it to memcpy. I don't know what happened to that code though.
(In reply to comment #2) > I should mention I made one patch before based on the vectorizer code which did > detection of at least memset; it was while I was an intern at Apple. I posted > it and there was some review. And a few years back there was a paper at the > GCC summit about it and expanding it to memcpy. I don't know what happened to > that code though. Some simple analysis using scev to identify loads and stores with linear address should be good enough. LLVM's version also tries to merge smaller memsets into a larger one.
The memset part can be handled by -ftree-loop-distribution (not enabled at -O3)
(In reply to comment #4) > LLVM's version also tries to merge smaller memsets into a larger one. That is filed as PR 49872. And the original issue is related to PR 30442.
Note that -ftree-loop-distribution for some weird reason only handles clearing, but not initialization by arbitrary constants that are convertible to memset (i.e. all bytes have the same constant stored, such as 0x0a0a0a0a etc.).
clearing recognition is enabled by default at -O2+ with -ftree-loop-distribute-patterns. I agree more general memset handling would be nice, as well as memcpy and memmove recognition.
Author: rguenth Date: Tue Jun 5 09:24:43 2012 New Revision: 188232 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188232 Log: 2012-06-05 Richard Guenther <rguenther@suse.de> PR tree-optimization/53081 * tree-loop-distribution.c (generate_memset_builtin): Handle all kinds of byte-sized stores. (classify_partition): Likewise. (tree_loop_distribution): Adjust seed statements used for !flag_tree_loop_distribution. * gcc.dg/tree-ssa/ldist-19.c: New testcase. * gcc.c-torture/execute/builtins/builtins.exp: Always pass -fno-tree-loop-distribute-patterns. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-19.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.c-torture/execute/builtins/builtins.exp trunk/gcc/tree-loop-distribution.c
Author: rguenth Date: Wed Jun 6 09:45:27 2012 New Revision: 188261 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188261 Log: 2012-06-06 Richard Guenther <rguenther@suse.de> PR tree-optimization/53081 * tree-data-ref.h (adjacent_store_dr_p): Rename to ... (adjacent_dr_p): ... this and make it work for reads, too. * tree-loop-distribution.c (enum partition_kind): Add PKIND_MEMCPY. (struct partition_s): Change main_stmt to main_dr, add secondary_dr member. (build_size_arg_loc): Change to date data-reference and not gimplify here. (build_addr_arg_loc): New function split out from ... (generate_memset_builtin): ... here. Use it and simplify. (generate_memcpy_builtin): New function. (generate_code_for_partition): Adjust. (classify_partition): Streamline pattern detection. Detect memcpy. (ldist_gen): Adjust. (tree_loop_distribution): Adjust seed statements for memcpy recognition. * gcc.dg/tree-ssa/ldist-20.c: New testcase. * gcc.dg/tree-ssa/loop-19.c: Add -fno-tree-loop-distribute-patterns. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-20.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c trunk/gcc/tree-data-ref.h trunk/gcc/tree-loop-distribution.c
Fixed.
It doesn't quite seem to work for this simple Fortran testcase yet SUBROUTINE S(a,N) INTEGER :: N,a(N) a=1 END SUBROUTINE S (works for memset to 0)
On Wed, 6 Jun 2012, Joost.VandeVondele at mat dot ethz.ch wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53081 > > --- Comment #12 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-06 11:32:08 UTC --- > It doesn't quite seem to work for this simple Fortran testcase yet > > SUBROUTINE S(a,N) > INTEGER :: N,a(N) > a=1 > END SUBROUTINE S > > (works for memset to 0) Well, you can't transform this to a memset ;)
(In reply to comment #13) > Well, you can't transform this to a memset ;) blush things work as advertised for correct testcases... thanks!
Fix causes http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53588
Also cause of http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726