Bug 111612 - GCC twice as slow as Clang for minisweep (SPEC HPC 2021)
Summary: GCC twice as slow as Clang for minisweep (SPEC HPC 2021)
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 14.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2023-09-27 12:50 UTC by Tobias Burnus
Modified: 2023-09-30 08:04 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2023-09-27 12:50:35 UTC
The discussion came out during this year's GNU Tools Cauldron during the OpenMP/OpenACC/offloading talks, i.e.
https://gcc.gnu.org/wiki/cauldron2023#cauldron2023talks.openacc_openmp_offloading_and_gcc

In that talk, using MPI with 8 ranks gave the following
(--define model=mpi --ranks 8):

3855 s (~1.071 h) - Nvidia HPC SDK  23.5 (May 2023): 
4076 s (~1.132 h) - LLVM 17 (pre) commit 34cf263e6 (2023-08-07):
4900 s (~1.361 h) or/up to 6624 s (~1.840 h) - GCC og13 commit b003e6511 (2023-07-19)

* * *

I just tried it myself as follows - using the non SPEC-ified version
and a modified input from how-to-run readme. I have not checked whether there are any gotchas, but it should be identical and without OpenMP, MPI or similar.

Namely:

  git clone https://github.com/wdj/minisweep.git
  cmake -DCMAKE_C_FLAGS=-O2 -DCMAKE_C_COMPILER=/usr/bin/clang-14 ../..

And likewise for GCC mainline, also with -O2.

Running then:
time ./sweep --ncell_x  4 --ncell_y 8 --ncell_z 32

GCC mainline:
Normsq result: 2.82234163e+12  diff: 0.000e+00  PASS  time: 7.817  GF/s: 0.315
real    0m8,124s / user    0m7,943s / sys     0m0,180s

Clang/LLVM-14:
Normsq result: 2.82234163e+12  diff: 0.000e+00  PASS  time: 3.036  GF/s: 0.812
real    0m3,223s / user    0m3,085s / sys     0m0,137s


Using -O3 -flto, I get: 2.070s (GCC) vs. 1.053s (Clang/LLVM)
Comment 1 Chung-Lin Tang 2023-09-28 07:48:13 UTC
To clarify, the numbers here are using mainline, and not devel/omp/gcc-13 with -fopenmp-target=acc, right?
Comment 2 Tobias Burnus 2023-09-28 11:52:00 UTC
> To clarify, the numbers here are using mainline,
> and not devel/omp/gcc-13 with -fopenmp-target=acc, right?

The presentation, i.e. everything quoted before "* * *", is with OG13.
But I only quoted the result for MPI w/o any OpenMP/OpenACC/-fopenmp-target=acc
If interested in those, go to the presentation.

* * *

However, as far as this PR is concerned, it is about plain single-thread host execution. And everything below the "* * *" is run with GCC mainline (GCC 14 mainline as of yesterday on an Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz system - and clang 14.0.0-1ubuntu1.1).

[Note: Contrary to SPEC HPC 2021, the GitHub version does support neither OpenACC nor OpenMP offloading. It does, however, support MPI and OpenMP (using on the host: parallel for / parallel / task / taskgroup) - but as used (see "cmake" line), it uses neither.]

Hence: This PR is really for single-thread & vectorization (missed) optimization, only.