Bug 33103

Summary: Redundant multiplications for memset
Product: gcc Reporter: Guillaume Melquiond <guillaume.melquiond>
Component: middle-endAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED DUPLICATE    
Severity: enhancement CC: gcc-bugs, rguenth
Priority: P3 Keywords: missed-optimization
Version: 4.3.0   
Target Milestone: 12.0   
Host: Target: x86_64-linux-gnu
Build: Known to work: 12.0
Known to fail: 10.1.0, 5.1.0 Last reconfirmed: 2012-06-06 00:00:00

Description Guillaume Melquiond 2007-08-18 05:54:59 UTC
This report was prompted by a mail on the lkml which was suggesting to hand-craft memset: http://lkml.org/lkml/2007/8/17/309 . So I wondered if the code generated for __builtin_memset was any good, and could be used instead of hand-crafted code. I tested with (Debian) GCC 3.4.6, 4.1.3, 4.2.1, and also with a snapshot of GCC 4.3. All the results are similar, so I will only show them for GCC 4.2 on x86-64. Compilation was done with -O3.

First, the __builtin_memset code:

  void fill1(char *s, int a)
  {
    __builtin_memset(s, a, 15);
  }

GCC generates:

   0:   40 0f b6 c6             movzbl %sil,%eax
   4:   48 ba 01 01 01 01 01    mov    $0x101010101010101,%rdx
   b:   01 01 01 
   e:   40 0f b6 ce             movzbl %sil,%ecx
  12:   48 0f af c2             imul   %rdx,%rax
  16:   40 88 77 0e             mov    %sil,0xe(%rdi)
  1a:   48 89 07                mov    %rax,(%rdi)
  1d:   40 0f b6 c6             movzbl %sil,%eax
  21:   69 c0 01 01 01 01       imul   $0x1010101,%eax,%eax
  27:   89 47 08                mov    %eax,0x8(%rdi)
  2a:   89 c8                   mov    %ecx,%eax
  2c:   c1 e0 08                shl    $0x8,%eax
  2f:   01 c8                   add    %ecx,%eax
  31:   66 89 47 0c             mov    %ax,0xc(%rdi)
  35:   c3                      retq   

Notice that GCC first computes %sil * (01)^8 and puts it into %rax, then it computes %sil * (01)^4 and puts it into %eax (where it already was, due to the previous multiplication), then it computes %sil * (01)^2 and puts it into %ax (where it already was, again).

Second, some code where multiplication results are reused:

  void fill2(char *s, int a)
  {
    unsigned long long int v = (unsigned char)a * 0x0101010101010101ull;
    *(unsigned long long int *)s = v;
    *(unsigned *)(s + 8) = v;
    *(unsigned short *)(s + 12) = v;
    *(s + 15) = v;
  }

GCC generates:

   0:   40 0f b6 f6             movzbl %sil,%esi
   4:   48 b8 01 01 01 01 01    mov    $0x101010101010101,%rax
   b:   01 01 01 
   e:   48 0f af f0             imul   %rax,%rsi
  12:   48 89 37                mov    %rsi,(%rdi)
  15:   89 77 08                mov    %esi,0x8(%rdi)
  18:   66 89 77 0c             mov    %si,0xc(%rdi)
  1c:   40 88 77 0f             mov    %sil,0xf(%rdi)
  20:   c3                      retq   

The function is 21 bytes smaller (-40%), it does not require two additional registers (c and d), and it will not be slower.

The same issue arises on x86_32. The hand-written code (with 32bit integers this time) is 14 bytes smaller for memset(,,15).
Comment 1 Andrew Pinski 2007-08-18 09:02:35 UTC
This is a middle-end issue.  What is happening is that it is splitting it up early on with the size.  Now what the trunk is doing is better than what was done for 4.0.x.  
Comment 2 Richard Biener 2012-06-06 10:11:41 UTC
Re-confirmed.
Comment 3 Andrew Pinski 2021-08-22 00:06:55 UTC
Fixed on the trunk by r12-285.

This is what we get now:
fill1(char*, int):
        movabsq $72340172838076673, %rax
        movzbl  %sil, %esi
        imulq   %rax, %rsi
        movq    %rsi, (%rdi)
        movq    %rsi, 7(%rdi)
        ret
Comment 4 H.J. Lu 2021-08-22 01:26:59 UTC
Dup.

*** This bug has been marked as a duplicate of bug 90773 ***