Bug 111774 - boringssl performance gap between clang and gcc for x25519 operations
Summary: boringssl performance gap between clang and gcc for x25519 operations
Status: WAITING
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 13.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2023-10-11 19:33 UTC by cpu
Modified: 2024-12-08 01:08 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-10-11 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description cpu 2023-10-11 19:33:23 UTC
Hi folks,

I've been bisecting a performance regression for x25519 cryptographic operations with BoringSSL (https://boringssl.googlesource.com/boringssl) that causes builds with gcc (tested w/ 13.2.0) to perform significantly worse than builds with clang (tested w/ clang 11.1.0).

I've identified the regression is in this commit: https://github.com/google/boringssl/commit/d605df5b6f8462c1f3005da82d718ec067f46b70


Building the project with gcc prior to this commit (Linux 6.1.55, gcc 13.2.0, 12th Gen Intel Core i7-1280P) shows the following numbers in the boringssl performance tests:

Did 90900 Ed25519 key generation operations in 1006408us (90321.2 ops/sec)
Did 94000 Ed25519 signing operations in 1002192us (93794.4 ops/sec)
Did 33000 Ed25519 verify operations in 1029750us (32046.6 ops/sec)
Did 103000 Curve25519 base-point multiplication operations in 1005442us (102442.5 ops/sec)
Did 39000 Curve25519 arbitrary point multiplication operations in 1010017us (38613.2 ops/sec)

Building the project with gcc at the identified regression commit produces worse numbers for the same benchmarks:

Did 33744 Ed25519 key generation operations in 1006475us (33526.9 ops/sec)
Did 34000 Ed25519 signing operations in 1011973us (33597.7 ops/sec)
Did 32000 Ed25519 verify operations in 1032193us (31002.0 ops/sec)
Did 36000 Curve25519 base-point multiplication operations in 1021745us (35233.8 ops/sec)
Did 39000 Curve25519 arbitrary point multiplication operations in 1020887us (38202.1 ops/sec)

Running the same tests prior to the problematic commit but using clang 11.1.0 produces these numbers:

Did 80132 Ed25519 key generation operations in 1004593us (79765.6 ops/sec)
Did 81000 Ed25519 signing operations in 1003061us (80752.8 ops/sec)
Did 28000 Ed25519 verify operations in 1010878us (27698.7 ops/sec)
Did 87000 Curve25519 base-point multiplication operations in 1005378us (86534.6 ops/sec)
Did 38000 Curve25519 arbitrary point multiplication operations in 1004032us (37847.4 ops/sec)

And doing the same with the problematic commit and clang 11.1.0 shows:

Did 83739 Ed25519 key generation operations in 1007756us (83094.5 ops/sec)
Did 88000 Ed25519 signing operations in 1010131us (87117.4 ops/sec)
Did 31000 Ed25519 verify operations in 1013649us (30582.6 ops/sec)
Did 94000 Curve25519 base-point multiplication operations in 1008822us (93178.0 ops/sec)
Did 39000 Curve25519 arbitrary point multiplication operations in 1020461us (38218.0 ops/sec)

You can see with the reported numbers that while the clang build is a little bit slower after the problematic commit, the GCC build is much slower, suggesting something specific to GCC is causing the slow down.

I'm not confident in my ability to dissect the underlying cause, but suspect that GCC's handling of the new precomputed table representation is not as efficient as it could be relative to clang. I'm hopeful that with clear reproduction steps someone more familiar would be able to make progress.

I've already opened a bug with the BoringSSL project: https://bugs.chromium.org/p/boringssl/issues/detail?id=655 


Here are the reproduction steps:

1. Check out https://github.com/google/boringssl/commit/d605df5b6f8462c1f3005da82d718ec067f46b70
2. Configure and build the project **with GCC**:
```
CFLAGS="-Wno-error=stringop-overflow" CC= CXX= cmake -DCMAKE_BUILD_TYPE=Release -B build-release-gcc
<snipped>
make -C build-release-gcc
<snipped>
```
3. Run the `bssl speed` tool, filtering for `25519`:
```
build-release-gcc/tool/bssl speed -filter 25519
```
4. Observe slower results.
```
5. Check out https://github.com/google/boringssl/commit/4a0393fcf37d7dbd090a5bb2293601a9ec7605da - the parent commit to d605df5b6f8462c1f3005da82d718ec067f46b70
6. Repeat the process described above.
7. Observe faster results.

The same process can be undertaken with clang by substituting the `cmake` step with:

CC=clang CXX=clang++ cmake -DCMAKE_BUILD_TYPE=Release -B build-release-clang
make -C build-release-clang

Thank you!
Comment 1 Andrew Pinski 2023-10-11 19:51:54 UTC
Hmm:
#if defined(__clang__) // materialize for vectorization, 6% speedup
  __asm__("" : "+m" (t_bytes) : /*no inputs*/);
#endif


What target is this for? What processor too?
Comment 2 Andrew Pinski 2023-10-11 19:52:38 UTC
(In reply to Andrew Pinski from comment #1)
> Hmm:
> #if defined(__clang__) // materialize for vectorization, 6% speedup
>   __asm__("" : "+m" (t_bytes) : /*no inputs*/);
> #endif
> 
> 
> What target is this for? What processor too?

What happens if you enable the above for GCC too?
Comment 3 cpu 2023-10-11 20:02:19 UTC
> What happens if you enable the above for GCC too?

That appears to have helped, but not closed the gap:

```
Did 39600 Ed25519 key generation operations in 1001716us (39532.2 ops/sec)
Did 41000 Ed25519 signing operations in 1006641us (40729.5 ops/sec)
Did 32000 Ed25519 verify operations in 1020079us (31370.1 ops/sec)
Did 43000 Curve25519 base-point multiplication operations in 1023075us (42030.2 ops/sec)
Did 39000 Curve25519 arbitrary point multiplication operations in 1008147us (38684.8 ops/sec)
```