Bug 33103 - Redundant multiplications for memset
Redundant multiplications for memset
Status: NEW
Product: gcc
Classification: Unclassified
Component: middle-end
4.3.0
: P3 enhancement
: ---
Assigned To: Not yet assigned to anyone
: missed-optimization
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-08-18 05:54 UTC by Guillaume Melquiond
Modified: 2012-06-06 10:11 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-06-06 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Guillaume Melquiond 2007-08-18 05:54:59 UTC
This report was prompted by a mail on the lkml which was suggesting to hand-craft memset: http://lkml.org/lkml/2007/8/17/309 . So I wondered if the code generated for __builtin_memset was any good, and could be used instead of hand-crafted code. I tested with (Debian) GCC 3.4.6, 4.1.3, 4.2.1, and also with a snapshot of GCC 4.3. All the results are similar, so I will only show them for GCC 4.2 on x86-64. Compilation was done with -O3.

First, the __builtin_memset code:

  void fill1(char *s, int a)
  {
    __builtin_memset(s, a, 15);
  }

GCC generates:

   0:   40 0f b6 c6             movzbl %sil,%eax
   4:   48 ba 01 01 01 01 01    mov    $0x101010101010101,%rdx
   b:   01 01 01 
   e:   40 0f b6 ce             movzbl %sil,%ecx
  12:   48 0f af c2             imul   %rdx,%rax
  16:   40 88 77 0e             mov    %sil,0xe(%rdi)
  1a:   48 89 07                mov    %rax,(%rdi)
  1d:   40 0f b6 c6             movzbl %sil,%eax
  21:   69 c0 01 01 01 01       imul   $0x1010101,%eax,%eax
  27:   89 47 08                mov    %eax,0x8(%rdi)
  2a:   89 c8                   mov    %ecx,%eax
  2c:   c1 e0 08                shl    $0x8,%eax
  2f:   01 c8                   add    %ecx,%eax
  31:   66 89 47 0c             mov    %ax,0xc(%rdi)
  35:   c3                      retq   

Notice that GCC first computes %sil * (01)^8 and puts it into %rax, then it computes %sil * (01)^4 and puts it into %eax (where it already was, due to the previous multiplication), then it computes %sil * (01)^2 and puts it into %ax (where it already was, again).

Second, some code where multiplication results are reused:

  void fill2(char *s, int a)
  {
    unsigned long long int v = (unsigned char)a * 0x0101010101010101ull;
    *(unsigned long long int *)s = v;
    *(unsigned *)(s + 8) = v;
    *(unsigned short *)(s + 12) = v;
    *(s + 15) = v;
  }

GCC generates:

   0:   40 0f b6 f6             movzbl %sil,%esi
   4:   48 b8 01 01 01 01 01    mov    $0x101010101010101,%rax
   b:   01 01 01 
   e:   48 0f af f0             imul   %rax,%rsi
  12:   48 89 37                mov    %rsi,(%rdi)
  15:   89 77 08                mov    %esi,0x8(%rdi)
  18:   66 89 77 0c             mov    %si,0xc(%rdi)
  1c:   40 88 77 0f             mov    %sil,0xf(%rdi)
  20:   c3                      retq   

The function is 21 bytes smaller (-40%), it does not require two additional registers (c and d), and it will not be slower.

The same issue arises on x86_32. The hand-written code (with 32bit integers this time) is 14 bytes smaller for memset(,,15).
Comment 1 Andrew Pinski 2007-08-18 09:02:35 UTC
This is a middle-end issue.  What is happening is that it is splitting it up early on with the size.  Now what the trunk is doing is better than what was done for 4.0.x.  
Comment 2 Richard Biener 2012-06-06 10:11:41 UTC
Re-confirmed.