87104 – missed &, == optimization makes Emacs ~0.4% slower on x86-64

Bug 87104 - missed &, == optimization makes Emacs ~0.4% slower on x86-64

Summary: missed &, == optimization makes Emacs ~0.4% slower on x86-64

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	8.1.1

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2018-08-25 21:10 UTC by Paul Eggert
Modified:	2024-08-31 08:14 UTC (History)
CC List:	3 users (show)

See Also:	110405 115529 116546
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2018-08-27 00:00:00

Attachments
In this file, f and g are equivalent, but f is not optimized as well as g (91 bytes, text/plain) 2018-08-25 21:10 UTC, Paul Eggert	Details
output when compiling fg.c with "gcc -O2 -S" on x86-64 (275 bytes, text/plain) 2018-08-25 21:11 UTC, Paul Eggert	Details
output when compiling fg.c with "gcc -O2 -S" on x86 (276 bytes, text/plain) 2018-08-25 23:52 UTC, Paul Eggert	Details
patch to illustrate the issue (1.09 KB, patch) 2018-08-27 07:18 UTC, pipcet	Details \| Diff
canonicalize to (A&B) == C (1.01 KB, patch) 2018-08-27 09:46 UTC, pipcet	Details \| Diff
WIP patch (1.94 KB, patch) 2018-08-28 16:01 UTC, pipcet	Details \| Diff
WIP patch (2.93 KB, patch) 2018-09-18 13:59 UTC, pipcet	Details \| Diff
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Paul Eggert 2018-08-25 21:10:21 UTC

Created attachment 44595 [details]
In this file, f and g are equivalent, but f is not optimized as well as g

GNU Emacs uses the low-order three bits of its words to represent object type tags, and therefore contains a lot of code that (at the low level) looks like this:

   if ((x & 7) == 6)

when checking whether an object has type tag 6, say. On x86-64 gcc -O2 generates insns like this:

   movq %rax, %rbx
   andl $7, %ebx
   cmpq $6, %rbx

whereas the following would be smaller and faster:

   leal -6 (%rax), %ebx
   tstl $7, %ebx

Doing this to a test version of Emacs made it 0.4% faster overall, on my standard benchmark of Emacs byte-compiling all its Lisp files.

I'm attaching a source file fg.c that illustrates the problem. It defines a function f that uses the Emacs idiom and is poorly optimized compared to the equivalent function g. I also plan to attach the assembly language output fg.s, which shows that f has one more instruction than g.

I plan to tune Emacs so that it cajoles GCC into the better insns; however, GCC should optimize this code better, for the sake of programs other than Emacs.

Comment 1 Paul Eggert 2018-08-25 21:11:12 UTC

Created attachment 44596 [details]
output when compiling fg.c with "gcc -O2 -S" on x86-64

Comment 2 Andrew Pinski 2018-08-25 21:12:18 UTC

This seems like a target issue ...

Comment 3 Paul Eggert 2018-08-25 23:50:45 UTC

(In reply to Andrew Pinski from comment #2)
> This seems like a target issue ...

Although the code generated is target-dependent, the performance problem is not limited to x86-64. x86 has the same problem, and I suspect other platforms do too. I'll attach the x86 assembler code, where 'f' has one more insn than 'g' does even though the two functions have identical behavior.

Comment 4 Paul Eggert 2018-08-25 23:52:07 UTC

Created attachment 44597 [details]
output when compiling fg.c with "gcc -O2 -S" on x86

Comment 5 Andrew Pinski 2018-08-26 00:14:22 UTC

(In reply to eggert from comment #3)
> (In reply to Andrew Pinski from comment #2)
> > This seems like a target issue ...
> 
> Although the code generated is target-dependent, the performance problem is
> not limited to x86-64. x86 has the same problem, and I suspect other
> platforms do too. I'll attach the x86 assembler code, where 'f' has one more
> insn than 'g' does even though the two functions have identical behavior.

x86 and x86_64 uses the same back-end so yes it does seem target secific.

AARCH64 produces:
f:
        and     x2, x0, 7
        and     w1, w0, 1
        cmp     x2, 6
        csel    w0, w0, w1, ne
        ret
        .size   f, .-f
        .align  2
        .p2align 3,,7
        .global g
        .type   g, %function
g:
        sub     x2, x0, #6
        and     w1, w0, 1
        tst     x2, 7
        csel    w0, w0, w1, ne
        ret
        .size   g, .-g

Both are similar in terms of performance.

Comment 6 Richard Biener 2018-08-27 07:15:27 UTC

So on GIMPLE the following are not canonicalized:

  <bb 2> [local count: 1073741825]:
  _1 = i_4(D) & 7;
  _8 = (int) i_4(D);
  if (_1 == 6)
    goto <bb 3>; [20.97%]
  else
    goto <bb 4>; [79.03%]

vs.

  <bb 2> [local count: 1073741825]:
  _1 = i_5(D) + 18446744073709551610;
  _2 = _1 & 7;
  _9 = (int) i_5(D);
  if (_2 == 0)
    goto <bb 3>; [34.00%]
  else
    goto <bb 4>; [66.00%]

where I'd call the former better.  Thus for some unknown constraint
on @1, @2 and @3

(simplify
 (eq (convert? (bit_and (plus @0 INTEGER_CST@3) @2)) @1)
 (eq (convert (bit_and @0 @2)) { ... }))

Comment 7 pipcet 2018-08-27 07:17:16 UTC

(In reply to Andrew Pinski from comment #5)
> x86 and x86_64 uses the same back-end so yes it does seem target secific.

I think it's not a target issue; we really want to be generating the same code for these equivalent functions, and the code on AArch64 is still different.

I'm also not sure what we do for ((X ^ A) & B) == C.

I'm attaching a patch that results, at least, in g being compiled to "jmp f", so the code is now recognized as equivalent. It seems to produce the right code for Paul's test case, too.

Comment 8 pipcet 2018-08-27 07:18:29 UTC

Created attachment 44605 [details]
patch to illustrate the issue

Comment 9 pipcet 2018-08-27 09:41:26 UTC

(In reply to Richard Biener from comment #6)
> So on GIMPLE the following are not canonicalized:
> 
>   <bb 2> [local count: 1073741825]:
>   _1 = i_4(D) & 7;
>   _8 = (int) i_4(D);
>   if (_1 == 6)
>     goto <bb 3>; [20.97%]
>   else
>     goto <bb 4>; [79.03%]
> 
> vs.
> 
>   <bb 2> [local count: 1073741825]:
>   _1 = i_5(D) + 18446744073709551610;
>   _2 = _1 & 7;
>   _9 = (int) i_5(D);
>   if (_2 == 0)
>     goto <bb 3>; [34.00%]
>   else
>     goto <bb 4>; [66.00%]
> 
> where I'd call the former better.  Thus for some unknown constraint
> on @1, @2 and @3
> 
> (simplify
>  (eq (convert? (bit_and (plus @0 INTEGER_CST@3) @2)) @1)
>  (eq (convert (bit_and @0 @2)) { ... }))


I suggest that the constraint be that @2 is of the form
000...000111...111000...000 and @3 is of the form
???...??????...???000...000 and @1 is zero.

So your plan is to canonicalize to (X & MASK) == VALUE first, then do something target-dependent to emit (X - VALUE) & MASK == 0 instead? How would the target realize that? I tried adding a peephole2 rule but that's apparently too late and doesn't match the insn sequences in Paul's test.

Comment 10 pipcet 2018-08-27 09:46:01 UTC

Created attachment 44606 [details]
canonicalize to (A&B) == C

This canonicalizes to the worse code on x86_64.

Comment 11 pipcet 2018-08-28 16:01:25 UTC

Created attachment 44617 [details]
WIP patch

I'm having partial success with this patch, which does two things:

1. Canonicalize to the easier-to-read (X & A) == B form
2. Emit optimized lea-test sequence for (X & A) == B

(1) is straightforward; we need a new routine to test that an integer
masks a contiguous range of bit positions.

(2) is straightforward except for there being three cases: 32-bit,
64-bit, and mixed. Emacs uses the mixed case, where the lower 3 bits
of a 64 bit value are tested.

The strange thing is that code like

int h17(long int i)
{
  if ((i & 12) == 12)
    return 1;
  return 0;
}

does not work.  I see this intermediate RTL:

 (insn 7 6 8 2 (set (reg:CCZ 17 flags)
         (compare:CCZ (and:DI (not:DI (reg/v:DI 86 [ i ]))
                 (const_int 12 [0xc]))
             (const_int 0 [0]))) "h17.c":4 15 {*cmpdi_1}
      (expr_list:REG_DEAD (reg:DI 88)

Surely we should be dealing with a canonical form instead?  Who's
generating this non-canonical expression, and why?

Is it legal to do something like
  {
    operands[2] = negate_rtx (operands[2]);
  }
in a define_split? That would avoid the need to generate complicated
RTL exprs from C.

Anyway, this all needs more work and much more testing, but it's a start.

Comment 12 pipcet 2018-08-30 14:02:28 UTC

(In reply to pipcet from comment #11)
>  (insn 7 6 8 2 (set (reg:CCZ 17 flags)
>          (compare:CCZ (and:DI (not:DI (reg/v:DI 86 [ i ]))
>                  (const_int 12 [0xc]))
>              (const_int 0 [0]))) "h17.c":4 15 {*cmpdi_1}
>       (expr_list:REG_DEAD (reg:DI 88)
> 
> Surely we should be dealing with a canonical form instead?  Who's
> generating this non-canonical expression, and why?

simplify-rtx.c, it turns out, because it "canonicalizes" (x & y) = y to (~x & y) = 0. I think that's strange, but we can work around it.

I'm testing these three approaches:
1. canonicalize to (x-y) & z = 0
2. don't canonicalize, but add a define_insn_and_split
3. original gcc

head-to-head. I'm compiling trunk Emacs with Paul's patch reverted, then running  "perf stat ./src/temacs --batch" in a loop and producing a histogram of the cycles needed. It seems (1) and (2) beat (3) quite significantly (1.1%) while (1) very narrowly beats (2) (< 0.1%). Both values are the median values, but it looks like the curves are simply shifted a little, so I'm prepared to say it's a consistent effect.

The code looks good, and the slight difference between (1) and (2) makes sense, because (2) generates:

	leal	-5(%rdi), %esi
	movq	%rdi, %rax
	andl	$7, %esi
	je	.L129
	ret
	.p2align 4,,10
	.p2align 3
.L129:
	movslq	suspicious_object_index(%rip), %rsi
	movl	$0, %ecx

while (1) realizes %rsi is zero at this point and skips the movl. (Looking at this code, I do not understand why movl is used rather than the standard xorl, though, so maybe this is another optimization opportunity).

So I think the performance difference is really significant for Emacs; my plan is to test all three versions on other programs, make sure the code works for C bitfields, and then submit it for inclusion. Is that okay?

Comment 13 Tom Tromey 2018-08-31 14:18:18 UTC

(In reply to pipcet from comment #12)

> So I think the performance difference is really significant for Emacs; my
> plan is to test all three versions on other programs, make sure the code
> works for C bitfields, and then submit it for inclusion. Is that okay?

Just as a process comment, you're not too likely to get an answer to this
sort of question; instead just go ahead and send the patch and see what
the responses are.

Thanks for looking into this.

Comment 14 pipcet 2018-09-18 13:59:18 UTC

Created attachment 44719 [details]
WIP patch

Okay, I've run into a few issues:
1. temacs run time changes unpredictably based on the configuration data, because of find_string_data_in_pure.
2. My CPU fuses "cmp" and a conditional branch and "test" and a conditional branch, but not "and" and a conditional branch. So we were optimizing a three-insn two-uop sequence into a two-insn two-uop sequence, and I was not seeing any performance improvement.
3. The code size changes sometimes cause branches to be mispredicted much more often for no apparent reason.

I've worked around (1) and (2), by disabling find_string_data_in_pure() and making the peephole rule that turned "test" into "and" conditional on CPU type. Now I'm seeing a consistent performance improvement (as well as fewer instructions, fewer uops, and more fused branches) for Perl and Emacs.

Comment 15 Andrew Pinski 2023-06-25 18:57:27 UTC

Note in the original testcase, I noticed we don't do some "VRP/nonzero bits" optimization so I filed PR 110405 for that. It does not change the other transformation.