Hi folks, I've been bisecting a performance regression for x25519 cryptographic operations with BoringSSL (https://boringssl.googlesource.com/boringssl) that causes builds with gcc (tested w/ 13.2.0) to perform significantly worse than builds with clang (tested w/ clang 11.1.0). I've identified the regression is in this commit: https://github.com/google/boringssl/commit/d605df5b6f8462c1f3005da82d718ec067f46b70 Building the project with gcc prior to this commit (Linux 6.1.55, gcc 13.2.0, 12th Gen Intel Core i7-1280P) shows the following numbers in the boringssl performance tests: Did 90900 Ed25519 key generation operations in 1006408us (90321.2 ops/sec) Did 94000 Ed25519 signing operations in 1002192us (93794.4 ops/sec) Did 33000 Ed25519 verify operations in 1029750us (32046.6 ops/sec) Did 103000 Curve25519 base-point multiplication operations in 1005442us (102442.5 ops/sec) Did 39000 Curve25519 arbitrary point multiplication operations in 1010017us (38613.2 ops/sec) Building the project with gcc at the identified regression commit produces worse numbers for the same benchmarks: Did 33744 Ed25519 key generation operations in 1006475us (33526.9 ops/sec) Did 34000 Ed25519 signing operations in 1011973us (33597.7 ops/sec) Did 32000 Ed25519 verify operations in 1032193us (31002.0 ops/sec) Did 36000 Curve25519 base-point multiplication operations in 1021745us (35233.8 ops/sec) Did 39000 Curve25519 arbitrary point multiplication operations in 1020887us (38202.1 ops/sec) Running the same tests prior to the problematic commit but using clang 11.1.0 produces these numbers: Did 80132 Ed25519 key generation operations in 1004593us (79765.6 ops/sec) Did 81000 Ed25519 signing operations in 1003061us (80752.8 ops/sec) Did 28000 Ed25519 verify operations in 1010878us (27698.7 ops/sec) Did 87000 Curve25519 base-point multiplication operations in 1005378us (86534.6 ops/sec) Did 38000 Curve25519 arbitrary point multiplication operations in 1004032us (37847.4 ops/sec) And doing the same with the problematic commit and clang 11.1.0 shows: Did 83739 Ed25519 key generation operations in 1007756us (83094.5 ops/sec) Did 88000 Ed25519 signing operations in 1010131us (87117.4 ops/sec) Did 31000 Ed25519 verify operations in 1013649us (30582.6 ops/sec) Did 94000 Curve25519 base-point multiplication operations in 1008822us (93178.0 ops/sec) Did 39000 Curve25519 arbitrary point multiplication operations in 1020461us (38218.0 ops/sec) You can see with the reported numbers that while the clang build is a little bit slower after the problematic commit, the GCC build is much slower, suggesting something specific to GCC is causing the slow down. I'm not confident in my ability to dissect the underlying cause, but suspect that GCC's handling of the new precomputed table representation is not as efficient as it could be relative to clang. I'm hopeful that with clear reproduction steps someone more familiar would be able to make progress. I've already opened a bug with the BoringSSL project: https://bugs.chromium.org/p/boringssl/issues/detail?id=655 Here are the reproduction steps: 1. Check out https://github.com/google/boringssl/commit/d605df5b6f8462c1f3005da82d718ec067f46b70 2. Configure and build the project **with GCC**: ``` CFLAGS="-Wno-error=stringop-overflow" CC= CXX= cmake -DCMAKE_BUILD_TYPE=Release -B build-release-gcc <snipped> make -C build-release-gcc <snipped> ``` 3. Run the `bssl speed` tool, filtering for `25519`: ``` build-release-gcc/tool/bssl speed -filter 25519 ``` 4. Observe slower results. ``` 5. Check out https://github.com/google/boringssl/commit/4a0393fcf37d7dbd090a5bb2293601a9ec7605da - the parent commit to d605df5b6f8462c1f3005da82d718ec067f46b70 6. Repeat the process described above. 7. Observe faster results. The same process can be undertaken with clang by substituting the `cmake` step with: CC=clang CXX=clang++ cmake -DCMAKE_BUILD_TYPE=Release -B build-release-clang make -C build-release-clang Thank you!
Hmm: #if defined(__clang__) // materialize for vectorization, 6% speedup __asm__("" : "+m" (t_bytes) : /*no inputs*/); #endif What target is this for? What processor too?
(In reply to Andrew Pinski from comment #1) > Hmm: > #if defined(__clang__) // materialize for vectorization, 6% speedup > __asm__("" : "+m" (t_bytes) : /*no inputs*/); > #endif > > > What target is this for? What processor too? What happens if you enable the above for GCC too?
> What happens if you enable the above for GCC too? That appears to have helped, but not closed the gap: ``` Did 39600 Ed25519 key generation operations in 1001716us (39532.2 ops/sec) Did 41000 Ed25519 signing operations in 1006641us (40729.5 ops/sec) Did 32000 Ed25519 verify operations in 1020079us (31370.1 ops/sec) Did 43000 Curve25519 base-point multiplication operations in 1023075us (42030.2 ops/sec) Did 39000 Curve25519 arbitrary point multiplication operations in 1008147us (38684.8 ops/sec) ```