Bug 84402 - [meta] GCC build system: parallelism bottleneck
Summary: [meta] GCC build system: parallelism bottleneck
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: bootstrap (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: build, meta-bug
Depends on: 42980 45928 52509 78288 87832 107804 109310 109898 111600 111619 112422 116146 116188 1948 29442 54179 109051 113575
Blocks:
  Show dependency treegraph
 
Reported: 2018-02-15 09:45 UTC by Martin Liška
Modified: 2024-08-02 02:45 UTC (History)
13 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-07-18 00:00:00


Attachments
make all-host -j8 on 8 core Haswell machine (16.73 KB, image/svg+xml)
2018-02-15 09:46 UTC, Martin Liška
Details
make all-host -j128 on 128 core EPYC machine (14.37 KB, image/svg+xml)
2018-02-15 09:47 UTC, Martin Liška
Details
make (for configure --disable-boostrap) -j128 on 128 core EPYC machine (19.67 KB, image/svg+xml)
2018-02-15 09:47 UTC, Martin Liška
Details
wall time report: make (for configure --disable-boostrap) on Haswell machine (system compiler -O2 -g) (8.14 KB, text/plain)
2018-02-15 09:48 UTC, Martin Liška
Details
wall time report: boostrap stage1 on Haswell machine (5.13 KB, text/plain)
2018-02-15 09:49 UTC, Martin Liška
Details
wall time report: boostrap stage2 on Haswell machine (7.49 KB, text/plain)
2018-02-15 09:49 UTC, Martin Liška
Details
wall time report: boostrap stage3 on Haswell machine (9.24 KB, text/plain)
2018-02-15 09:49 UTC, Martin Liška
Details
Parallel build of make all-host on 8 core Haswell machine (19.34 KB, image/svg+xml)
2018-02-15 13:17 UTC, Martin Liška
Details
Parallel build of make all-host on 8 core Haswell machine (19.37 KB, image/svg+xml)
2018-02-15 18:27 UTC, Martin Liška
Details
Parallel build of make all-host on 8 core Haswell machine (23.04 KB, image/svg+xml)
2018-02-16 09:13 UTC, Martin Liška
Details
Parallel build of make all-host on 128 core EPYC machine (20.19 KB, image/svg+xml)
2018-02-16 09:14 UTC, Martin Liška
Details
-ftime-report for most time consuming files on Haswell machine (13.72 KB, text/plain)
2018-02-21 08:45 UTC, Martin Liška
Details
-ftime-report for most time consuming files on Haswell machine (17.30 KB, text/plain)
2018-02-21 14:00 UTC, Martin Liška
Details
Parallel build of make all-host on 128 core EPYC machine (log file) (11.50 KB, text/plain)
2018-02-23 09:01 UTC, Martin Liška
Details
make -j 64 all-gcc, with --disable-bootstrap, on 64-cores. Blue means dependency to gimple-match. (91.20 KB, application/pdf)
2019-02-07 14:02 UTC, Giuliano Belinassi
Details
CPU utilization of make all-host on recent AMD server (30.38 KB, image/svg+xml)
2022-11-30 08:13 UTC, Martin Liška
Details
make all-host on Ryzen 9 (39.75 KB, image/svg+xml)
2022-12-01 10:01 UTC, Martin Liška
Details
make all-host on Ryzen 9 with LTO partial linking (36.17 KB, image/svg+xml)
2022-12-01 10:03 UTC, Martin Liška
Details
Partial linking path (1.18 KB, patch)
2022-12-01 10:07 UTC, Martin Liška
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Liška 2018-02-15 09:45:05 UTC
As discussed yesterday on IRC, current build of GCC has various issues that make it not fully parallelable on machines with higher number of CPUs.

I've did a hack to make where I recorded timestamp when a target is triggered and finished:
https://github.com/marxin/make/tree/timestamp

Then I built GCC with -j1 and used following parser to generate reports:
https://github.com/marxin/script-misc/blob/master/parse-make-log.py

I prepared various reports that I'm going to add as attachments.
Comment 1 Martin Liška 2018-02-15 09:46:20 UTC
Created attachment 43420 [details]
make all-host -j8 on 8 core Haswell machine
Comment 2 Martin Liška 2018-02-15 09:47:05 UTC
Created attachment 43421 [details]
make all-host -j128 on 128 core EPYC machine
Comment 3 Martin Liška 2018-02-15 09:47:36 UTC
Created attachment 43422 [details]
make (for configure --disable-boostrap) -j128 on 128 core EPYC machine
Comment 4 Martin Liška 2018-02-15 09:48:56 UTC
Created attachment 43423 [details]
wall time report: make (for configure --disable-boostrap) on Haswell machine (system compiler -O2 -g)
Comment 5 Martin Liška 2018-02-15 09:49:19 UTC
Created attachment 43424 [details]
wall time report: boostrap stage1 on Haswell machine
Comment 6 Martin Liška 2018-02-15 09:49:30 UTC
Created attachment 43425 [details]
wall time report: boostrap stage2 on Haswell machine
Comment 7 Martin Liška 2018-02-15 09:49:49 UTC
Created attachment 43426 [details]
wall time report: boostrap stage3 on Haswell machine
Comment 8 Martin Liška 2018-02-15 10:08:54 UTC
I forgot to note that minimum time threshold is 0.5s for the wall time reports.
Comment 9 Martin Liška 2018-02-15 13:17:26 UTC
Created attachment 43428 [details]
Parallel build of make all-host on 8 core Haswell machine
Comment 10 Martin Liška 2018-02-15 18:27:52 UTC
Created attachment 43432 [details]
Parallel build of make all-host on 8 core Haswell machine
Comment 11 Martin Liška 2018-02-15 18:40:40 UTC
(In reply to Martin Liška from comment #10)
> Created attachment 43432 [details]
> Parallel build of make all-host on 8 core Haswell machine

This was generated with a slightly modified make (being able to run fully in parallel):
https://github.com/marxin/make/tree/timestamp-v2

And output is then parsed and 'stacked' graph is generated:
https://github.com/marxin/script-misc/blob/master/parse-make-log-parallel.py
Comment 12 Martin Liška 2018-02-16 09:13:42 UTC
Created attachment 43439 [details]
Parallel build of make all-host on 8 core Haswell machine
Comment 13 Martin Liška 2018-02-16 09:14:24 UTC
Created attachment 43440 [details]
Parallel build of make all-host on 128 core EPYC machine
Comment 14 Martin Liška 2018-02-21 08:45:49 UTC
Created attachment 43478 [details]
-ftime-report for most time consuming files on Haswell machine
Comment 15 Segher Boessenkool 2018-02-21 11:58:10 UTC
This is a -O0 build?  That's what that time report shows afaics.
Comment 16 Martin Liška 2018-02-21 14:00:56 UTC
Created attachment 43482 [details]
-ftime-report for most time consuming files on Haswell machine

Properly generated with -O2 which was missing in previous version.
Comment 17 Tom Tromey 2018-02-22 14:46:13 UTC
The results in comment #13 seem to be missing some compilations --
I would have expected to see more files from libcpp in there.
As it is I only see directives.o and line-map.o.
Comment 18 Martin Liška 2018-02-23 09:01:40 UTC
Created attachment 43492 [details]
Parallel build of make all-host on 128 core EPYC machine (log file)
Comment 19 Martin Liška 2018-02-23 09:02:29 UTC
(In reply to Tom Tromey from comment #17)
> The results in comment #13 seem to be missing some compilations --
> I would have expected to see more files from libcpp in there.
> As it is I only see directives.o and line-map.o.

There was a minimum threshold of 0.5s, please take a look at log file in:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402#c18
Comment 20 Martin Liška 2018-04-04 12:48:29 UTC
For the libsanitizer/*/*_interceptors I make a quick patch:
https://github.com/marxin/gcc/commit/5ce658230db567474997fa411f23ac78366487ce
which basically splits asan_interceptors.cc and sanitizer_common_interceptors.inc and moves implementation of string functions to a separate compile unit.
This shrinks time from 38->34s for asan_interceptors.cc being built with enabled checking stage1 compiler.

I believe splitting the interceptors to couple of logical sub-files will make it very fast. List of interceptors grepped from sanitizer_common_interceptors.inc:
I can imagine splitting that to components like string, stdio, time, process, thread, math,..

INTERCEPTOR(SIZE_T, strlen, const char *s) {
INTERCEPTOR(SIZE_T, strnlen, const char *s, SIZE_T maxlen) {
INTERCEPTOR(char*, strndup, const char *s, uptr size) {
INTERCEPTOR(char*, __strndup, const char *s, uptr size) {
INTERCEPTOR(char*, textdomain, const char *domainname) {
INTERCEPTOR(int, strcmp, const char *s1, const char *s2) {
INTERCEPTOR(int, strncmp, const char *s1, const char *s2, uptr size) {
INTERCEPTOR(int, strcasecmp, const char *s1, const char *s2) {
INTERCEPTOR(int, strncasecmp, const char *s1, const char *s2, SIZE_T size) {
INTERCEPTOR(char*, strstr, const char *s1, const char *s2) {
INTERCEPTOR(char*, strcasestr, const char *s1, const char *s2) {
INTERCEPTOR(char*, strtok, char *str, const char *delimiters) {
INTERCEPTOR(void*, memmem, const void *s1, SIZE_T len1, const void *s2,
INTERCEPTOR(char*, strchr, const char *s, int c) {
INTERCEPTOR(char*, strchrnul, const char *s, int c) {
INTERCEPTOR(char*, strrchr, const char *s, int c) {
INTERCEPTOR(SIZE_T, strspn, const char *s1, const char *s2) {
INTERCEPTOR(SIZE_T, strcspn, const char *s1, const char *s2) {
INTERCEPTOR(char *, strpbrk, const char *s1, const char *s2) {
INTERCEPTOR(void *, memset, void *dst, int v, uptr size) {
INTERCEPTOR(void *, memmove, void *dst, const void *src, uptr size) {
INTERCEPTOR(void *, memcpy, void *dst, const void *src, uptr size) {
INTERCEPTOR(int, memcmp, const void *a1, const void *a2, uptr size) {
INTERCEPTOR(void*, memchr, const void *s, int c, SIZE_T n) {
INTERCEPTOR(void*, memrchr, const void *s, int c, SIZE_T n) {
INTERCEPTOR(double, frexp, double x, int *exp) {
INTERCEPTOR(float, frexpf, float x, int *exp) {
INTERCEPTOR(long double, frexpl, long double x, int *exp) {
INTERCEPTOR(SSIZE_T, read, int fd, void *ptr, SIZE_T count) {
INTERCEPTOR(SIZE_T, fread, void *ptr, SIZE_T size, SIZE_T nmemb, void *file) {
INTERCEPTOR(SSIZE_T, pread, int fd, void *ptr, SIZE_T count, OFF_T offset) {
INTERCEPTOR(SSIZE_T, pread64, int fd, void *ptr, SIZE_T count, OFF64_T offset) {
INTERCEPTOR_WITH_SUFFIX(SSIZE_T, readv, int fd, __sanitizer_iovec *iov,
INTERCEPTOR(SSIZE_T, preadv, int fd, __sanitizer_iovec *iov, int iovcnt,
INTERCEPTOR(SSIZE_T, preadv64, int fd, __sanitizer_iovec *iov, int iovcnt,
INTERCEPTOR(SSIZE_T, write, int fd, void *ptr, SIZE_T count) {
INTERCEPTOR(SIZE_T, fwrite, const void *p, uptr size, uptr nmemb, void *file) {
INTERCEPTOR(SSIZE_T, pwrite, int fd, void *ptr, SIZE_T count, OFF_T offset) {
INTERCEPTOR(SSIZE_T, pwrite64, int fd, void *ptr, OFF64_T count,
INTERCEPTOR_WITH_SUFFIX(SSIZE_T, writev, int fd, __sanitizer_iovec *iov,
INTERCEPTOR(SSIZE_T, pwritev, int fd, __sanitizer_iovec *iov, int iovcnt,
INTERCEPTOR(SSIZE_T, pwritev64, int fd, __sanitizer_iovec *iov, int iovcnt,
INTERCEPTOR(int, prctl, int option, unsigned long arg2,
INTERCEPTOR(unsigned long, time, unsigned long *t) {
INTERCEPTOR(__sanitizer_tm *, localtime, unsigned long *timep) {
INTERCEPTOR(__sanitizer_tm *, localtime_r, unsigned long *timep, void *result) {
INTERCEPTOR(__sanitizer_tm *, gmtime, unsigned long *timep) {
INTERCEPTOR(__sanitizer_tm *, gmtime_r, unsigned long *timep, void *result) {
INTERCEPTOR(char *, ctime, unsigned long *timep) {
INTERCEPTOR(char *, ctime_r, unsigned long *timep, char *result) {
INTERCEPTOR(char *, asctime, __sanitizer_tm *tm) {
INTERCEPTOR(char *, asctime_r, __sanitizer_tm *tm, char *result) {
INTERCEPTOR(long, mktime, __sanitizer_tm *tm) {
INTERCEPTOR(char *, strptime, char *s, char *format, __sanitizer_tm *tm) {
INTERCEPTOR(int, vscanf, const char *format, va_list ap)
INTERCEPTOR(int, vsscanf, const char *str, const char *format, va_list ap)
INTERCEPTOR(int, vfscanf, void *stream, const char *format, va_list ap)
INTERCEPTOR(int, __isoc99_vscanf, const char *format, va_list ap)
INTERCEPTOR(int, __isoc99_vsscanf, const char *str, const char *format,
INTERCEPTOR(int, __isoc99_vfscanf, void *stream, const char *format, va_list ap)
INTERCEPTOR(int, scanf, const char *format, ...)
INTERCEPTOR(int, fscanf, void *stream, const char *format, ...)
INTERCEPTOR(int, sscanf, const char *str, const char *format, ...)
INTERCEPTOR(int, __isoc99_scanf, const char *format, ...)
INTERCEPTOR(int, __isoc99_fscanf, void *stream, const char *format, ...)
INTERCEPTOR(int, __isoc99_sscanf, const char *str, const char *format, ...)
INTERCEPTOR(int, vprintf, const char *format, va_list ap)
INTERCEPTOR(int, vfprintf, __sanitizer_FILE *stream, const char *format,
INTERCEPTOR(int, vsnprintf, char *str, SIZE_T size, const char *format,
INTERCEPTOR(int, vsnprintf_l, char *str, SIZE_T size, void *loc,
INTERCEPTOR(int, snprintf_l, char *str, SIZE_T size, void *loc,
INTERCEPTOR(int, vsprintf, char *str, const char *format, va_list ap)
INTERCEPTOR(int, vasprintf, char **strp, const char *format, va_list ap)
INTERCEPTOR(int, __isoc99_vprintf, const char *format, va_list ap)
INTERCEPTOR(int, __isoc99_vfprintf, __sanitizer_FILE *stream,
INTERCEPTOR(int, __isoc99_vsnprintf, char *str, SIZE_T size, const char *format,
INTERCEPTOR(int, __isoc99_vsprintf, char *str, const char *format,
INTERCEPTOR(int, printf, const char *format, ...)
INTERCEPTOR(int, fprintf, __sanitizer_FILE *stream, const char *format, ...)
INTERCEPTOR(int, sprintf, char *str, const char *format, ...) // NOLINT
INTERCEPTOR(int, snprintf, char *str, SIZE_T size, const char *format, ...)
INTERCEPTOR(int, asprintf, char **strp, const char *format, ...)
INTERCEPTOR(int, __isoc99_printf, const char *format, ...)
INTERCEPTOR(int, __isoc99_fprintf, __sanitizer_FILE *stream, const char *format,
INTERCEPTOR(int, __isoc99_sprintf, char *str, const char *format, ...)
INTERCEPTOR(int, __isoc99_snprintf, char *str, SIZE_T size,
INTERCEPTOR(int, ioctl, int d, unsigned long request, ...) {
INTERCEPTOR(__sanitizer_passwd *, getpwnam, const char *name) {
INTERCEPTOR(__sanitizer_passwd *, getpwuid, u32 uid) {
INTERCEPTOR(__sanitizer_group *, getgrnam, const char *name) {
INTERCEPTOR(__sanitizer_group *, getgrgid, u32 gid) {
INTERCEPTOR(int, getpwnam_r, const char *name, __sanitizer_passwd *pwd,
INTERCEPTOR(int, getpwuid_r, u32 uid, __sanitizer_passwd *pwd, char *buf,
INTERCEPTOR(int, getgrnam_r, const char *name, __sanitizer_group *grp,
INTERCEPTOR(int, getgrgid_r, u32 gid, __sanitizer_group *grp, char *buf,
INTERCEPTOR(__sanitizer_passwd *, getpwent, int dummy) {
INTERCEPTOR(__sanitizer_group *, getgrent, int dummy) {
INTERCEPTOR(__sanitizer_passwd *, fgetpwent, void *fp) {
INTERCEPTOR(__sanitizer_group *, fgetgrent, void *fp) {
INTERCEPTOR(int, getpwent_r, __sanitizer_passwd *pwbuf, char *buf,
INTERCEPTOR(int, fgetpwent_r, void *fp, __sanitizer_passwd *pwbuf, char *buf,
INTERCEPTOR(int, getgrent_r, __sanitizer_group *pwbuf, char *buf, SIZE_T buflen,
INTERCEPTOR(int, fgetgrent_r, void *fp, __sanitizer_group *pwbuf, char *buf,
INTERCEPTOR(void, setpwent, int dummy) {
INTERCEPTOR(void, endpwent, int dummy) {
INTERCEPTOR(void, setgrent, int dummy) {
INTERCEPTOR(void, endgrent, int dummy) {
INTERCEPTOR(int, clock_getres, u32 clk_id, void *tp) {
INTERCEPTOR(int, clock_gettime, u32 clk_id, void *tp) {
INTERCEPTOR(int, clock_settime, u32 clk_id, const void *tp) {
INTERCEPTOR(int, getitimer, int which, void *curr_value) {
INTERCEPTOR(int, setitimer, int which, const void *new_value, void *old_value) {
INTERCEPTOR(int, glob, const char *pattern, int flags,
INTERCEPTOR(int, glob64, const char *pattern, int flags,
INTERCEPTOR_WITH_SUFFIX(int, wait, int *status) {
INTERCEPTOR_WITH_SUFFIX(int, waitid, int idtype, long long id, void *infop,
INTERCEPTOR_WITH_SUFFIX(int, waitid, int idtype, int id, void *infop,
INTERCEPTOR_WITH_SUFFIX(int, waitpid, int pid, int *status, int options) {
INTERCEPTOR(int, wait3, int *status, int options, void *rusage) {
INTERCEPTOR(int, __wait4, int pid, int *status, int options, void *rusage) {
INTERCEPTOR(int, wait4, int pid, int *status, int options, void *rusage) {
INTERCEPTOR(char *, inet_ntop, int af, const void *src, char *dst, u32 size) {
INTERCEPTOR(int, inet_pton, int af, const char *src, void *dst) {
INTERCEPTOR(int, inet_aton, const char *cp, void *dst) {
INTERCEPTOR(int, pthread_getschedparam, uptr thread, int *policy, int *param) {
INTERCEPTOR(int, getaddrinfo, char *node, char *service,
INTERCEPTOR(int, getnameinfo, void *sockaddr, unsigned salen, char *host,
INTERCEPTOR(int, getsockname, int sock_fd, void *addr, int *addrlen) {
INTERCEPTOR(struct __sanitizer_hostent *, gethostbyname, char *name) {
INTERCEPTOR(struct __sanitizer_hostent *, gethostbyaddr, void *addr, int len,
INTERCEPTOR(struct __sanitizer_hostent *, gethostent, int fake) {
INTERCEPTOR(struct __sanitizer_hostent *, gethostbyname2, char *name, int af) {
INTERCEPTOR(int, gethostbyname_r, char *name, struct __sanitizer_hostent *ret,
INTERCEPTOR(int, gethostent_r, struct __sanitizer_hostent *ret, char *buf,
INTERCEPTOR(int, gethostbyaddr_r, void *addr, int len, int type,
INTERCEPTOR(int, gethostbyname2_r, char *name, int af,
INTERCEPTOR(int, getsockopt, int sockfd, int level, int optname, void *optval,
INTERCEPTOR(int, accept, int fd, void *addr, unsigned *addrlen) {
INTERCEPTOR(int, accept4, int fd, void *addr, unsigned *addrlen, int f) {
INTERCEPTOR(double, modf, double x, double *iptr) {
INTERCEPTOR(float, modff, float x, float *iptr) {
INTERCEPTOR(long double, modfl, long double x, long double *iptr) {
INTERCEPTOR(SSIZE_T, recvmsg, int fd, struct __sanitizer_msghdr *msg,
INTERCEPTOR(SSIZE_T, sendmsg, int fd, struct __sanitizer_msghdr *msg,
INTERCEPTOR(int, getpeername, int sockfd, void *addr, unsigned *addrlen) {
INTERCEPTOR(int, sysinfo, void *info) {
INTERCEPTOR(__sanitizer_dirent *, opendir, const char *path) {
INTERCEPTOR(__sanitizer_dirent *, readdir, void *dirp) {
INTERCEPTOR(int, readdir_r, void *dirp, __sanitizer_dirent *entry,
INTERCEPTOR(__sanitizer_dirent64 *, readdir64, void *dirp) {
INTERCEPTOR(int, readdir64_r, void *dirp, __sanitizer_dirent64 *entry,
INTERCEPTOR(uptr, ptrace, int request, int pid, void *addr, void *data) {
INTERCEPTOR(char *, setlocale, int category, char *locale) {
INTERCEPTOR(char *, getcwd, char *buf, SIZE_T size) {
INTERCEPTOR(char *, get_current_dir_name, int fake) {
INTERCEPTOR(INTMAX_T, strtoimax, const char *nptr, char **endptr, int base) {
INTERCEPTOR(INTMAX_T, strtoumax, const char *nptr, char **endptr, int base) {
INTERCEPTOR(SIZE_T, mbstowcs, wchar_t *dest, const char *src, SIZE_T len) {
INTERCEPTOR(SIZE_T, mbsrtowcs, wchar_t *dest, const char **src, SIZE_T len,
INTERCEPTOR(SIZE_T, mbsnrtowcs, wchar_t *dest, const char **src, SIZE_T nms,
INTERCEPTOR(SIZE_T, wcstombs, char *dest, const wchar_t *src, SIZE_T len) {
INTERCEPTOR(SIZE_T, wcsrtombs, char *dest, const wchar_t **src, SIZE_T len,
INTERCEPTOR(SIZE_T, wcsnrtombs, char *dest, const wchar_t **src, SIZE_T nms,
INTERCEPTOR(SIZE_T, wcrtomb, char *dest, wchar_t src, void *ps) {
INTERCEPTOR(int, tcgetattr, int fd, void *termios_p) {
INTERCEPTOR(char *, realpath, const char *path, char *resolved_path) {
INTERCEPTOR(char *, canonicalize_file_name, const char *path) {
INTERCEPTOR(SIZE_T, confstr, int name, char *buf, SIZE_T len) {
INTERCEPTOR(int, sched_getaffinity, int pid, SIZE_T cpusetsize, void *mask) {
INTERCEPTOR(int, sched_getparam, int pid, void *param) {
INTERCEPTOR(char *, strerror, int errnum) {
INTERCEPTOR(int, strerror_r, int errnum, char *buf, SIZE_T buflen) {
INTERCEPTOR(char *, strerror_r, int errnum, char *buf, SIZE_T buflen) {
INTERCEPTOR(int, __xpg_strerror_r, int errnum, char *buf, SIZE_T buflen) {
INTERCEPTOR(int, scandir, char *dirp, __sanitizer_dirent ***namelist,
INTERCEPTOR(int, scandir64, char *dirp, __sanitizer_dirent64 ***namelist,
INTERCEPTOR(int, getgroups, int size, u32 *lst) {
INTERCEPTOR(int, poll, __sanitizer_pollfd *fds, __sanitizer_nfds_t nfds,
INTERCEPTOR(int, ppoll, __sanitizer_pollfd *fds, __sanitizer_nfds_t nfds,
INTERCEPTOR(int, wordexp, char *s, __sanitizer_wordexp_t *p, int flags) {
INTERCEPTOR(int, sigwait, __sanitizer_sigset_t *set, int *sig) {
INTERCEPTOR(int, sigwaitinfo, __sanitizer_sigset_t *set, void *info) {
INTERCEPTOR(int, sigtimedwait, __sanitizer_sigset_t *set, void *info,
INTERCEPTOR(int, sigemptyset, __sanitizer_sigset_t *set) {
INTERCEPTOR(int, sigfillset, __sanitizer_sigset_t *set) {
INTERCEPTOR(int, sigpending, __sanitizer_sigset_t *set) {
INTERCEPTOR(int, sigprocmask, int how, __sanitizer_sigset_t *set,
INTERCEPTOR(int, backtrace, void **buffer, int size) {
INTERCEPTOR(char **, backtrace_symbols, void **buffer, int size) {
INTERCEPTOR(void, _exit, int status) {
INTERCEPTOR(int, pthread_mutex_lock, void *m) {
INTERCEPTOR(int, pthread_mutex_unlock, void *m) {
INTERCEPTOR(__sanitizer_mntent *, getmntent, void *fp) {
INTERCEPTOR(__sanitizer_mntent *, getmntent_r, void *fp,
INTERCEPTOR(int, statfs, char *path, void *buf) {
INTERCEPTOR(int, fstatfs, int fd, void *buf) {
INTERCEPTOR(int, statfs64, char *path, void *buf) {
INTERCEPTOR(int, fstatfs64, int fd, void *buf) {
INTERCEPTOR(int, statvfs, char *path, void *buf) {
INTERCEPTOR(int, fstatvfs, int fd, void *buf) {
INTERCEPTOR(int, statvfs64, char *path, void *buf) {
INTERCEPTOR(int, fstatvfs64, int fd, void *buf) {
INTERCEPTOR(int, initgroups, char *user, u32 group) {
INTERCEPTOR(char *, ether_ntoa, __sanitizer_ether_addr *addr) {
INTERCEPTOR(__sanitizer_ether_addr *, ether_aton, char *buf) {
INTERCEPTOR(int, ether_ntohost, char *hostname, __sanitizer_ether_addr *addr) {
INTERCEPTOR(int, ether_hostton, char *hostname, __sanitizer_ether_addr *addr) {
INTERCEPTOR(int, ether_line, char *line, __sanitizer_ether_addr *addr,
INTERCEPTOR(char *, ether_ntoa_r, __sanitizer_ether_addr *addr, char *buf) {
INTERCEPTOR(__sanitizer_ether_addr *, ether_aton_r, char *buf,
INTERCEPTOR(int, shmctl, int shmid, int cmd, void *buf) {
INTERCEPTOR(int, random_r, void *buf, u32 *result) {
INTERCEPTOR_PTHREAD_ATTR_GET(detachstate, sizeof(int))
INTERCEPTOR_PTHREAD_ATTR_GET(guardsize, sizeof(SIZE_T))
INTERCEPTOR_PTHREAD_ATTR_GET(schedparam, struct_sched_param_sz)
INTERCEPTOR_PTHREAD_ATTR_GET(schedpolicy, sizeof(int))
INTERCEPTOR_PTHREAD_ATTR_GET(scope, sizeof(int))
INTERCEPTOR_PTHREAD_ATTR_GET(stacksize, sizeof(SIZE_T))
INTERCEPTOR(int, pthread_attr_getstack, void *attr, void **addr, SIZE_T *size) {
INTERCEPTOR_PTHREAD_ATTR_GET(inheritsched, sizeof(int))
INTERCEPTOR(int, pthread_attr_getaffinity_np, void *attr, SIZE_T cpusetsize,
INTERCEPTOR_PTHREAD_MUTEXATTR_GET(pshared, sizeof(int))
INTERCEPTOR_PTHREAD_MUTEXATTR_GET(type, sizeof(int))
INTERCEPTOR_PTHREAD_MUTEXATTR_GET(protocol, sizeof(int))
INTERCEPTOR_PTHREAD_MUTEXATTR_GET(prioceiling, sizeof(int))
INTERCEPTOR_PTHREAD_MUTEXATTR_GET(robust, sizeof(int))
INTERCEPTOR_PTHREAD_MUTEXATTR_GET(robust_np, sizeof(int))
INTERCEPTOR_PTHREAD_RWLOCKATTR_GET(pshared, sizeof(int))
INTERCEPTOR_PTHREAD_RWLOCKATTR_GET(kind_np, sizeof(int))
INTERCEPTOR_PTHREAD_CONDATTR_GET(pshared, sizeof(int))
INTERCEPTOR_PTHREAD_CONDATTR_GET(clock, sizeof(int))
INTERCEPTOR_PTHREAD_BARRIERATTR_GET(pshared, sizeof(int)) // !mac !android
INTERCEPTOR(char *, tmpnam, char *s) {
INTERCEPTOR(char *, tmpnam_r, char *s) {
INTERCEPTOR(int, ttyname_r, int fd, char *name, SIZE_T namesize) {
INTERCEPTOR(char *, tempnam, char *dir, char *pfx) {
INTERCEPTOR(int, pthread_setname_np, uptr thread, const char *name) {
INTERCEPTOR(void, sincos, double x, double *sin, double *cos) {
INTERCEPTOR(void, sincosf, float x, float *sin, float *cos) {
INTERCEPTOR(void, sincosl, long double x, long double *sin, long double *cos) {
INTERCEPTOR(double, remquo, double x, double y, int *quo) {
INTERCEPTOR(float, remquof, float x, float y, int *quo) {
INTERCEPTOR(long double, remquol, long double x, long double y, int *quo) {
INTERCEPTOR(double, lgamma, double x) {
INTERCEPTOR(float, lgammaf, float x) {
INTERCEPTOR(long double, lgammal, long double x) {
INTERCEPTOR(double, lgamma_r, double x, int *signp) {
INTERCEPTOR(float, lgammaf_r, float x, int *signp) {
INTERCEPTOR(long double, lgammal_r, long double x, int *signp) {
INTERCEPTOR(int, drand48_r, void *buffer, double *result) {
INTERCEPTOR(int, lrand48_r, void *buffer, long *result) {
INTERCEPTOR(int, rand_r, unsigned *seedp) {
INTERCEPTOR(SSIZE_T, getline, char **lineptr, SIZE_T *n, void *stream) {
INTERCEPTOR(SSIZE_T, __getdelim, char **lineptr, SIZE_T *n, int delim,
INTERCEPTOR(SSIZE_T, getdelim, char **lineptr, SIZE_T *n, int delim,
INTERCEPTOR(SIZE_T, iconv, void *cd, char **inbuf, SIZE_T *inbytesleft,
INTERCEPTOR(__sanitizer_clock_t, times, void *tms) {
INTERCEPTOR(void *, __tls_get_addr, void *arg) {
INTERCEPTOR(uptr, __tls_get_addr_internal, void *arg) {
INTERCEPTOR(SSIZE_T, listxattr, const char *path, char *list, SIZE_T size) {
INTERCEPTOR(SSIZE_T, llistxattr, const char *path, char *list, SIZE_T size) {
INTERCEPTOR(SSIZE_T, flistxattr, int fd, char *list, SIZE_T size) {
INTERCEPTOR(SSIZE_T, getxattr, const char *path, const char *name, char *value,
INTERCEPTOR(SSIZE_T, lgetxattr, const char *path, const char *name, char *value,
INTERCEPTOR(SSIZE_T, fgetxattr, int fd, const char *name, char *value,
INTERCEPTOR(int, getresuid, void *ruid, void *euid, void *suid) {
INTERCEPTOR(int, getresgid, void *rgid, void *egid, void *sgid) {
INTERCEPTOR(int, getifaddrs, __sanitizer_ifaddrs **ifap) {
INTERCEPTOR(char *, if_indextoname, unsigned int ifindex, char* ifname) {
INTERCEPTOR(unsigned int, if_nametoindex, const char* ifname) {
INTERCEPTOR(int, capget, void *hdrp, void *datap) {
INTERCEPTOR(int, capset, void *hdrp, const void *datap) {
INTERCEPTOR(void *, __aeabi_memmove, void *to, const void *from, uptr size) {
INTERCEPTOR(void *, __aeabi_memmove4, void *to, const void *from, uptr size) {
INTERCEPTOR(void *, __aeabi_memmove8, void *to, const void *from, uptr size) {
INTERCEPTOR(void *, __aeabi_memcpy, void *to, const void *from, uptr size) {
INTERCEPTOR(void *, __aeabi_memcpy4, void *to, const void *from, uptr size) {
INTERCEPTOR(void *, __aeabi_memcpy8, void *to, const void *from, uptr size) {
INTERCEPTOR(void *, __aeabi_memset, void *block, uptr size, int c) {
INTERCEPTOR(void *, __aeabi_memset4, void *block, uptr size, int c) {
INTERCEPTOR(void *, __aeabi_memset8, void *block, uptr size, int c) {
INTERCEPTOR(void *, __aeabi_memclr, void *block, uptr size) {
INTERCEPTOR(void *, __aeabi_memclr4, void *block, uptr size) {
INTERCEPTOR(void *, __aeabi_memclr8, void *block, uptr size) {
INTERCEPTOR(void *, __bzero, void *block, uptr size) {
INTERCEPTOR(int, ftime, __sanitizer_timeb *tp) {
INTERCEPTOR(void, xdrmem_create, __sanitizer_XDR *xdrs, uptr addr,
INTERCEPTOR(void, xdrstdio_create, __sanitizer_XDR *xdrs, void *file, int op) {
INTERCEPTOR(int, xdr_bytes, __sanitizer_XDR *xdrs, char **p, unsigned *sizep,
INTERCEPTOR(int, xdr_string, __sanitizer_XDR *xdrs, char **p,
INTERCEPTOR(void *, tsearch, void *key, void **rootp,
INTERCEPTOR(int, __uflow, __sanitizer_FILE *fp) {
INTERCEPTOR(int, __underflow, __sanitizer_FILE *fp) {
INTERCEPTOR(int, __overflow, __sanitizer_FILE *fp, int ch) {
INTERCEPTOR(int, __wuflow, __sanitizer_FILE *fp) {
INTERCEPTOR(int, __wunderflow, __sanitizer_FILE *fp) {
INTERCEPTOR(int, __woverflow, __sanitizer_FILE *fp, int ch) {
INTERCEPTOR(__sanitizer_FILE *, fopen, const char *path, const char *mode) {
INTERCEPTOR(__sanitizer_FILE *, fdopen, int fd, const char *mode) {
INTERCEPTOR(__sanitizer_FILE *, freopen, const char *path, const char *mode,
INTERCEPTOR(__sanitizer_FILE *, fopen64, const char *path, const char *mode) {
INTERCEPTOR(__sanitizer_FILE *, freopen64, const char *path, const char *mode,
INTERCEPTOR(__sanitizer_FILE *, open_memstream, char **ptr, SIZE_T *sizeloc) {
INTERCEPTOR(__sanitizer_FILE *, open_wmemstream, wchar_t **ptr,
INTERCEPTOR(__sanitizer_FILE *, fmemopen, void *buf, SIZE_T size,
INTERCEPTOR(int, _obstack_begin_1, __sanitizer_obstack *obstack, int sz,
INTERCEPTOR(int, _obstack_begin, __sanitizer_obstack *obstack, int sz,
INTERCEPTOR(void, _obstack_newchunk, __sanitizer_obstack *obstack, int length) {
INTERCEPTOR(int, fflush, __sanitizer_FILE *fp) {
INTERCEPTOR(int, fclose, __sanitizer_FILE *fp) {
INTERCEPTOR(void*, dlopen, const char *filename, int flag) {
INTERCEPTOR(int, dlclose, void *handle) {
INTERCEPTOR(char *, getpass, const char *prompt) {
INTERCEPTOR(int, timerfd_settime, int fd, int flags, void *new_value,
INTERCEPTOR(int, timerfd_gettime, int fd, void *curr_value) {
INTERCEPTOR(int, mlock, const void *addr, uptr len) {
INTERCEPTOR(int, munlock, const void *addr, uptr len) {
INTERCEPTOR(int, mlockall, int flags) {
INTERCEPTOR(int, munlockall, void) {
INTERCEPTOR(__sanitizer_FILE *, fopencookie, void *cookie, const char *mode,
INTERCEPTOR(int, sem_init, __sanitizer_sem_t *s, int pshared, unsigned value) {
INTERCEPTOR(int, sem_destroy, __sanitizer_sem_t *s) {
INTERCEPTOR(int, sem_wait, __sanitizer_sem_t *s) {
INTERCEPTOR(int, sem_trywait, __sanitizer_sem_t *s) {
INTERCEPTOR(int, sem_timedwait, __sanitizer_sem_t *s, void *abstime) {
INTERCEPTOR(int, sem_post, __sanitizer_sem_t *s) {
INTERCEPTOR(int, sem_getvalue, __sanitizer_sem_t *s, int *sval) {
INTERCEPTOR(int, pthread_setcancelstate, int state, int *oldstate) {
INTERCEPTOR(int, pthread_setcanceltype, int type, int *oldtype) {
INTERCEPTOR(int, mincore, void *addr, uptr length, unsigned char *vec) {
INTERCEPTOR(SSIZE_T, process_vm_readv, int pid, __sanitizer_iovec *local_iov,
INTERCEPTOR(SSIZE_T, process_vm_writev, int pid, __sanitizer_iovec *local_iov,
INTERCEPTOR(char *, ctermid, char *s) {
INTERCEPTOR(char *, ctermid_r, char *s) {
INTERCEPTOR(SSIZE_T, recv, int fd, void *buf, SIZE_T len, int flags) {
INTERCEPTOR(SSIZE_T, recvfrom, int fd, void *buf, SIZE_T len, int flags,
INTERCEPTOR(SSIZE_T, send, int fd, void *buf, SIZE_T len, int flags) {
INTERCEPTOR(SSIZE_T, sendto, int fd, void *buf, SIZE_T len, int flags,
INTERCEPTOR(int, eventfd_read, int fd, u64 *value) {
INTERCEPTOR(int, eventfd_write, int fd, u64 value) {
INTERCEPTOR(int, stat, const char *path, void *buf) {
INTERCEPTOR(int, __xstat, int version, const char *path, void *buf) {
INTERCEPTOR(int, __xstat64, int version, const char *path, void *buf) {
INTERCEPTOR(int, __lxstat, int version, const char *path, void *buf) {
INTERCEPTOR(int, __lxstat64, int version, const char *path, void *buf) {
INTERCEPTOR(void *, getutent, int dummy) {
INTERCEPTOR(void *, getutid, void *ut) {
INTERCEPTOR(void *, getutline, void *ut) {
INTERCEPTOR(void *, getutxent, int dummy) {
INTERCEPTOR(void *, getutxid, void *ut) {
INTERCEPTOR(void *, getutxline, void *ut) {
INTERCEPTOR(int, getloadavg, double *loadavg, int nelem) {
INTERCEPTOR(int, mcheck, void (*abortfunc)(int mstatus)) {
INTERCEPTOR(int, mcheck_pedantic, void (*abortfunc)(int mstatus)) {
INTERCEPTOR(int, mprobe, void *ptr) {
INTERCEPTOR(SIZE_T, wcslen, const wchar_t *s) {
INTERCEPTOR(SIZE_T, wcsnlen, const wchar_t *s, SIZE_T n) {
INTERCEPTOR(wchar_t *, wcscat, wchar_t *dst, const wchar_t *src) {
INTERCEPTOR(wchar_t *, wcsncat, wchar_t *dst, const wchar_t *src, SIZE_T n) {
Comment 21 rguenther@suse.de 2018-04-04 13:12:37 UTC
On Wed, 4 Apr 2018, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402
> 
> --- Comment #20 from Martin Liška <marxin at gcc dot gnu.org> ---
> For the libsanitizer/*/*_interceptors I make a quick patch:
> https://github.com/marxin/gcc/commit/5ce658230db567474997fa411f23ac78366487ce
> which basically splits asan_interceptors.cc and
> sanitizer_common_interceptors.inc and moves implementation of string functions
> to a separate compile unit.
> This shrinks time from 38->34s for asan_interceptors.cc being built with
> enabled checking stage1 compiler.
> 
> I believe splitting the interceptors to couple of logical sub-files will make
> it very fast. List of interceptors grepped from
> sanitizer_common_interceptors.inc:
> I can imagine splitting that to components like string, stdio, time, process,
> thread, math,..

The question is of course _why_ it is this slow.  It's not that this
is 10000s of functions or very large ones...
Comment 22 Martin Liška 2018-04-04 13:17:40 UTC
(In reply to rguenther@suse.de from comment #21)
> On Wed, 4 Apr 2018, marxin at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402
> > 
> > --- Comment #20 from Martin Liška <marxin at gcc dot gnu.org> ---
> > For the libsanitizer/*/*_interceptors I make a quick patch:
> > https://github.com/marxin/gcc/commit/5ce658230db567474997fa411f23ac78366487ce
> > which basically splits asan_interceptors.cc and
> > sanitizer_common_interceptors.inc and moves implementation of string functions
> > to a separate compile unit.
> > This shrinks time from 38->34s for asan_interceptors.cc being built with
> > enabled checking stage1 compiler.
> > 
> > I believe splitting the interceptors to couple of logical sub-files will make
> > it very fast. List of interceptors grepped from
> > sanitizer_common_interceptors.inc:
> > I can imagine splitting that to components like string, stdio, time, process,
> > thread, math,..
> 
> The question is of course _why_ it is this slow.  It's not that this
> is 10000s of functions or very large ones...

It's analyzed here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78288
Comment 23 Martin Liška 2018-04-26 07:51:25 UTC
I can easily split insn-emit.c. Once we know which was a split should be done, I can prepare patch for that.
Comment 24 Eric Gallager 2018-09-07 03:23:59 UTC
(In reply to Martin Liška from comment #23)
> I can easily split insn-emit.c. Once we know which was a split should be
> done, I can prepare patch for that.

Confirmed, please do this!
Comment 25 Martin Liška 2018-09-10 11:49:11 UTC
Let me assign it.
Comment 26 Giuliano Belinassi 2019-02-07 14:02:16 UTC
Created attachment 45630 [details]
make -j 64 all-gcc, with --disable-bootstrap, on 64-cores. Blue means dependency to gimple-match.

Since gimple-match.c takes so long to compile, I was wondering if it might be possible to reorder the compilation so we can push its compilation early in the dependency graph.

I did the following steps: 
 1) 'configure --disable-bootstrap'
 2) 'make -j 64 all-gcc'
 3) 'make clean'. 
 4) 'make gimple-match.o' using a wrapper[1] that I created to log all files required by gimple-match, and plotted the attached graphic. Here, blue means dependency and the largest bar is the 'gimple-match.c' itself.

I used a 64 cores AMD Opteron 6376 in the process.

Any ideas?

[1] https://github.com/giulianobelinassi/gcc-timer-analysis
Comment 27 Martin Liška 2019-02-07 14:28:39 UTC
> Since gimple-match.c takes so long to compile, I was wondering if it might
> be possible to reorder the compilation so we can push its compilation early
> in the dependency graph.

No, the proper fix would be to split the generated files and compile them in parallel. Similarly for all the insn-*.c generated files. That would the proper fix.

Anyway, I like the graph you made :)
Comment 28 Segher Boessenkool 2019-02-07 15:10:05 UTC
But what version of GCC is this graph, with what exact configuration?
Comment 29 Giuliano Belinassi 2019-02-07 16:17:18 UTC
> No, the proper fix would be to split the generated files and compile them in parallel. Similarly for all the insn-*.c generated files. That would the proper fix.

Indeed. However, I am working on parallelizing the compilation with threads. This may lead to a solution, but may not be the best for this scenario.

> Anyway, I like the graph you made :)

Thank you.

> But what version of GCC is this graph, with what exact configuration?

* This is the gcc that I used to build: *

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.2.0-14' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 8.2.0 (Debian 8.2.0-14) 

* The gcc that I built: *

Using built-in specs.
COLLECT_GCC=./xgcc
Target: x86_64-pc-linux-gnu
Configured with: /home/giulianob/gcc_svn/trunk//configure --disable-checking --disable-bootstrap
Thread model: posix
gcc version 9.0.1 20190205 (experimental) (GCC)
Comment 30 Martin Liška 2019-05-07 12:09:48 UTC
A possible solution can be usage of '-flinker-output=nolto-rel -r' for huge files.
Comment 31 Eric Gallager 2019-11-07 05:38:10 UTC
I think this came up at Cauldron, but I forget what exactly people said about it...
Comment 32 Giuliano Belinassi 2019-11-07 13:58:28 UTC
(In reply to Eric Gallager from comment #31)
> I think this came up at Cauldron, but I forget what exactly people said
> about it...

Actually this PR comes before Cauldron 2019. One way to fix this issue is to make the match.pd parser output several smaller gimple-match.c, and add these to the Makefile. Also repeat this procedure to other big files.

Another solution is to parallelize GCC internals and make GCC communicate with Make somehow so that when a CPU is idle, it starts compiling some files in parallel.
Comment 33 Jakub Jelinek 2020-05-07 11:56:20 UTC
GCC 10.1 has been released.
Comment 34 Eric Gallager 2020-05-07 22:59:15 UTC
(In reply to Giuliano Belinassi from comment #32)
> (In reply to Eric Gallager from comment #31)
> > I think this came up at Cauldron, but I forget what exactly people said
> > about it...
> 
> Actually this PR comes before Cauldron 2019. 

By "came up" I meant simply that it was mentioned, not that that was where it originated...
Comment 35 jojo 2020-07-09 09:44:27 UTC
(In reply to Martin Liška from comment #30)
> A possible solution can be usage of '-flinker-output=nolto-rel -r' for huge
> files.

it's useful for splitting huge files ?
Comment 36 Martin Liška 2020-07-09 10:04:41 UTC
(In reply to jojo from comment #35)
> (In reply to Martin Liška from comment #30)
> > A possible solution can be usage of '-flinker-output=nolto-rel -r' for huge
> > files.
> 
> it's useful for splitting huge files ?

There's experiment I did:

$ time g++ -O2 /tmp/gimple-match.ii -c

real    0m35.790s
user    0m35.490s
sys    0m0.268s

$ time g++ -O2 /tmp/gimple-match.ii -c -flto

real    0m8.138s
user    0m7.915s
sys    0m0.202s

$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o  -r -o gimple-match2.o

real    0m9.087s
user    1m56.028s
sys    0m3.292s

$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o  -r -o gimple-match2.o --param lto-partitions=8

real    0m7.350s
user    0m48.548s
sys    0m0.976s

$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o  -r -o gimple-match2.o --param lto-partitions=4

real    0m9.847s
user    0m30.462s
sys    0m0.392s

so for N==4 we get to 8+10s = 18s (compared to the original 36s). And total user time is 30+8, which is comparable
to the original 36s.
Comment 37 Richard Biener 2020-07-09 11:40:45 UTC
(In reply to Martin Liška from comment #36)
> (In reply to jojo from comment #35)
> > (In reply to Martin Liška from comment #30)
> > > A possible solution can be usage of '-flinker-output=nolto-rel -r' for huge
> > > files.
> > 
> > it's useful for splitting huge files ?
> 
> There's experiment I did:
> 
> $ time g++ -O2 /tmp/gimple-match.ii -c
> 
> real    0m35.790s
> user    0m35.490s
> sys    0m0.268s
> 
> $ time g++ -O2 /tmp/gimple-match.ii -c -flto
> 
> real    0m8.138s
> user    0m7.915s
> sys    0m0.202s
> 
> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o  -r -o
> gimple-match2.o
> 
> real    0m9.087s
> user    1m56.028s
> sys    0m3.292s
> 
> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o  -r -o
> gimple-match2.o --param lto-partitions=8
> 
> real    0m7.350s
> user    0m48.548s
> sys    0m0.976s
> 
> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o  -r -o
> gimple-match2.o --param lto-partitions=4
> 
> real    0m9.847s
> user    0m30.462s
> sys    0m0.392s
> 
> so for N==4 we get to 8+10s = 18s (compared to the original 36s). And total
> user time is 30+8, which is comparable
> to the original 36s.

The GSoC parallelism project this year is supposed to replicate this
in a cheaper way and also develop some magic to automatically trigger
it when it seems profitable.
Comment 38 jojo 2020-07-13 05:51:49 UTC
(In reply to Martin Liška from comment #36)
> (In reply to jojo from comment #35)
> > (In reply to Martin Liška from comment #30)
> > > A possible solution can be usage of '-flinker-output=nolto-rel -r' for huge
> > > files.
> > 
> > it's useful for splitting huge files ?
> 
> There's experiment I did:
> 
> $ time g++ -O2 /tmp/gimple-match.ii -c
> 
> real    0m35.790s
> user    0m35.490s
> sys    0m0.268s
> 
> $ time g++ -O2 /tmp/gimple-match.ii -c -flto
> 
> real    0m8.138s
> user    0m7.915s
> sys    0m0.202s
> 
> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o  -r -o
> gimple-match2.o
> 
> real    0m9.087s
> user    1m56.028s
> sys    0m3.292s
> 
> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o  -r -o
> gimple-match2.o --param lto-partitions=8
> 
> real    0m7.350s
> user    0m48.548s
> sys    0m0.976s
> 
> $ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o  -r -o
> gimple-match2.o --param lto-partitions=4
> 
> real    0m9.847s
> user    0m30.462s
> sys    0m0.392s
> 
> so for N==4 we get to 8+10s = 18s (compared to the original 36s). And total
> user time is 30+8, which is comparable
> to the original 36s.

It's looks a little cost down for huge file as insn-emit.c......
I want to use shell tool like 'csplit' to split it and compile parallelly
Comment 39 Richard Biener 2020-07-23 06:51:46 UTC
GCC 10.2 is released, adjusting target milestone.
Comment 40 Richard Biener 2021-04-08 12:02:28 UTC
GCC 10.3 is being released, retargeting bugs to GCC 10.4.
Comment 41 Andrew Pinski 2021-07-19 06:19:46 UTC
Latest discussion of this can also be found at:
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571555.html
Comment 42 Eric Gallager 2021-10-09 12:58:00 UTC
Is this just about parallelism bottlenecks for the main build target (e.g. just `make` or `make all`), or does it apply to other Makefile targets, too? (e.g. the testsuite via `make check`, or docs via `make pdf` or something)
Comment 43 Martin Liška 2021-10-11 08:01:32 UTC
(In reply to Eric Gallager from comment #42)
> Is this just about parallelism bottlenecks for the main build target (e.g.
> just `make` or `make all`), or does it apply to other Makefile targets, too?
> (e.g. the testsuite via `make check`, or docs via `make pdf` or something)

Well, it was intended to cover only the main build, which pdf can be seen as part of. On the other hand, `make check` should belong to a different PR if you have troubles with it.
Comment 44 Eric Gallager 2021-10-11 18:10:59 UTC
(In reply to Martin Liška from comment #43)
> (In reply to Eric Gallager from comment #42)
> > Is this just about parallelism bottlenecks for the main build target (e.g.
> > just `make` or `make all`), or does it apply to other Makefile targets, too?
> > (e.g. the testsuite via `make check`, or docs via `make pdf` or something)
> 
> Well, it was intended to cover only the main build, which pdf can be seen as
> part of.

I usually have to run `make pdf` as a separate build target, though, as it doesn't get run as part of the main build for me... and the bottleneck there, for the pdf target, is in libstdc++ for me...
Comment 45 Eric Gallager 2021-11-01 04:56:03 UTC
(In reply to Martin Liška from comment #0)
> [...]
> Then I built GCC with -j1 and used following parser to generate reports:
> https://github.com/marxin/script-misc/blob/master/parse-make-log.py

The new URL for that script is now this, btw: https://github.com/marxin/script-misc/blob/master/legacy/parse-make-log.py
Comment 46 Sam James 2022-05-29 03:44:21 UTC
Even partially making the build less recursive would likely help a fair bit. 

The classic text on this is https://accu.org/journals/overload/14/71/miller_2004/. This doesn't mean that splitting up files is futile, but when watching a build, much of the time, make doesn't even get to traverse into each of the directories, because it doesn't know if it's able to. It can safely be done in stages.

Using includes would let you get a lot of the current state wrt split directories. Could even just have a certain number of toplevel directories but non-recursive within them.
Comment 47 Segher Boessenkool 2022-06-02 22:05:09 UTC
(In reply to Sam James from comment #46)
> Even partially making the build less recursive would likely help a fair bit.

It will help a bit, sure, but not nearly as much as you perhaps hope for.
 
There are quite a few "synchronisation" points where nothing after it can be
done until everything before it has been done.  Partly this is just because
we have a three-stage bootstrap, but also there are some generator programs
that everything else depends on (on its output that is), and those are real
chokepoints.

Also, recursive make is a scourge of humanity, for sure, but fixing this has
to be done in autoxxxx first and foremost.
Comment 48 Martin Liška 2022-11-30 08:13:38 UTC
Created attachment 53989 [details]
CPU utilization of make all-host on recent AMD server

The situation with a recent AMD server is really bad! Having 192 cores, the average CPU utilization of `make all-host` is 6% !
Comment 49 Martin Liška 2022-11-30 08:23:04 UTC
One more observation I made, apparently we're trying to sort (in Makefile.in) OBJS with the biggest at the very beginning:

  1295  # Language-independent object files.
  1296  # We put the *-match.o and insn-*.o files first so that a parallel make
  1297  # will build them sooner, because they are large and otherwise tend to be
  1298  # the last objects to finish building.
  1299  OBJS = \
  1300          gimple-match.o \
  1301          generic-match.o \
  1302          insn-attrtab.o \
  1303          insn-automata.o \

That's fine, plus we introduce dependency for all objects to depend on generated_files:

  4441  # In order for parallel make to really start compiling the expensive
  4442  # objects from $(OBJS) as early as possible, build all their
  4443  # prerequisites strictly before all objects.
  4444  $(ALL_HOST_OBJS) : | $(generated_files)

Using that, we should see gimple-match.o being spawned very soon, but it's not the case. Imagine you have already built all-host and let's see what happens:

$ rm -f gimple-match.o ; rm -f tree*.o && make -j4 --debug=b libbackend.a 2>&1 | less
...
   File 'gimple-match.o' does not exist.
             Prerequisite 'cs-bconfig.h' is newer than target 'bconfig.h'.
            Must remake target 'bconfig.h'.
             Prerequisite 'cstamp-h' is newer than target 'auto-host.h'.
            Must remake target 'auto-host.h'.
                     Prerequisite 's-options' is newer than target 'optionlist'.
                    Must remake target 'optionlist'.
                 Prerequisite 's-gtyp-input' is newer than target 'gtyp-input.list'.
                Must remake target 'gtyp-input.list'.
                     Prerequisite 's-bversion' is newer than target 'bversion.h'.
                    Must remake target 'bversion.h'.
     Prerequisite 'cs-config.h' is newer than target 'config.h'.
    Must remake target 'config.h'.
...
   File 'tree-vrp.o' does not exist.
   File 'tree.o' does not exist.
     Prerequisite 's-i386-bt' is newer than target 'i386-builtin-types.inc'.
    Must remake target 'i386-builtin-types.inc'.
   File 'gimple-match.o' does not exist.
             Prerequisite 's-modes-h' is newer than target 'insn-modes.h'.
            Must remake target 'insn-modes.h'.
             Prerequisite 's-modes-inline-h' is newer than target 'insn-modes-inline.h'.
            Must remake target 'insn-modes-inline.h'.
                     Prerequisite 's-version' is newer than target 'version.h'.
                    Must remake target 'version.h'.
                 Prerequisite 's-options-h' is newer than target 'options.h'.
                Must remake target 'options.h'.
             Prerequisite 's-genrtl-h' is newer than target 'genrtl.h'.
            Must remake target 'genrtl.h'.
             Prerequisite 's-modes-m' is newer than target 'min-insn-modes.cc'.
            Must remake target 'min-insn-modes.cc'.
...
   File 'gimple-match.o' does not exist.
             Prerequisite 's-gtype' is newer than target 'gtype-desc.h'.
            Must remake target 'gtype-desc.h'.
             Prerequisite 's-constants' is newer than target 'insn-constants.h'.
            Must remake target 'insn-constants.h'.
...
  Must remake target 'tree-affine.o'.
g++  -fno-PIE -c   -g     -DIN_GCC     -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -fno-common  -DHAVE_CONFIG_H -I. -I. -I/home/marxin/Programming/gcc/gcc -I/home/marxin/Programming/gcc/gcc/. -I/home/marxin/Programming/gcc/gcc/../include -I/home/marxin/Programming/gcc/gcc/../libcpp/include -I/home/marxin/Programming/gcc/gcc/../libcody  -I/home/marxin/Programming/gcc/gcc/../libdecnumber -I/home/marxin/Programming/gcc/gcc/../libdecnumber/bid -I../libdecnumber -I/home/marxin/Programming/gcc/gcc/../libbacktrace   -o tree-affine.o -MT tree-affine.o -MMD -MP -MF ./.deps/tree-affine.TPo /home/marxin/Programming/gcc/gcc/tree-affine.cc
   File 'tree-call-cdce.o' does not exist.
  Must remake target 'tree-call-cdce.o'.
g++  -fno-PIE -c   -g     -DIN_GCC     -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -fno-common  -DHAVE_CONFIG_H -I. -I. -I/home/marxin/Programming/gcc/gcc -I/home/marxin/Programming/gcc/gcc/. -I/home/marxin/Programming/gcc/gcc/../include -I/home/marxin/Programming/gcc/gcc/../libcpp/include -I/home/marxin/Programming/gcc/gcc/../libcody  -I/home/marxin/Programming/gcc/gcc/../libdecnumber -I/home/marxin/Programming/gcc/gcc/../libdecnumber/bid -I../libdecnumber -I/home/marxin/Programming/gcc/gcc/../libbacktrace   -o tree-call-cdce.o -MT tree-call-cdce.o -MMD -MP -MF ./.deps/tree-call-cdce.TPo /home/marxin/Programming/gcc/gcc/tree-call-cdce.cc
   File 'tree-cfg.o' does not exist.
  Must remake target 'tree-cfg.o'.
g++  -fno-PIE -c   -g     -DIN_GCC     -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -fno-common  -DHAVE_CONFIG_H -I. -I. -I/home/marxin/Programming/gcc/gcc -I/home/marxin/Programming/gcc/gcc/. -I/home/marxin/Programming/gcc/gcc/../include -I/home/marxin/Programming/gcc/gcc/../libcpp/include -I/home/marxin/Programming/gcc/gcc/../libcody  -I/home/marxin/Programming/gcc/gcc/../libdecnumber -I/home/marxin/Programming/gcc/gcc/../libdecnumber/bid -I../libdecnumber -I/home/marxin/Programming/gcc/gcc/../libbacktrace   -o tree-cfg.o -MT tree-cfg.o -MMD -MP -MF ./.deps/tree-cfg.TPo /home/marxin/Programming/gcc/gcc/tree-cfg.cc
   File 'tree-cfgcleanup.o' does not exist.
  Must remake target 'tree-cfgcleanup.o'.

So gimple-match.o has a complex dependencies that are somehow under investigation and that's why it doesn't start early :/

It's likely related to various '	$(STAMP) $name' we use, if I consider one:

gtyp-input.list: s-gtyp-input ; @true
s-gtyp-input: Makefile
	@: $(call write_entries_to_file,$(GTFILES),tmp-gi.list)
	$(SHELL) $(srcdir)/../move-if-change tmp-gi.list gtyp-input.list
	$(STAMP) s-gtyp-input

Here we touch 's-gtyp-input' later than gtyp-input.list is created and thus gtyp-input.list always need
to be remade becase it's dependency s-gtyp-input is newer. Similarly for many other rules:

gimple-match.cc: s-match gimple-match-head.cc ; @true

s-match: build/genmatch$(build_exeext) $(srcdir)/match.pd cfn-operators.pd
	$(RUN_GEN) build/genmatch$(build_exeext) --gimple $(srcdir)/match.pd \
	    > tmp-gimple-match.cc
	$(RUN_GEN) build/genmatch$(build_exeext) --generic $(srcdir)/match.pd \
	    > tmp-generic-match.cc
	$(SHELL) $(srcdir)/../move-if-change tmp-gimple-match.cc \
	    					gimple-match.cc
	$(SHELL) $(srcdir)/../move-if-change tmp-generic-match.cc \
	    					generic-match.cc
	$(STAMP) s-match

Here it's even more complicated, I think s-match should be only updated if generic-match.cc is touched,
otherwise, we again end up younger s-match than gimple-match.cc.

Can please any GNU make expect judge here? Starting e.g. gimple-match.cc early would really help
to speed up the build process.
Comment 50 Richard Biener 2022-11-30 08:25:59 UTC
(In reply to Martin Liška from comment #48)
> Created attachment 53989 [details]
> CPU utilization of make all-host on recent AMD server
> 
> The situation with a recent AMD server is really bad! Having 192 cores, the
> average CPU utilization of `make all-host` is 6% !

Just do more builds in parallel!  There's just 903 .o files in gcc/ and
libbackend.a just has 490 of them.  It's not surprising the few larger
files stretch out the compile-time here.  Try LTOing libbackend.a?
Comment 51 Richard Biener 2022-11-30 08:27:13 UTC
(In reply to Martin Liška from comment #49)

[...]

> Can please any GNU make expect judge here? Starting e.g. gimple-match.cc
> early would really help
> to speed up the build process.

this has come up in the past and there's no reliable way to order things
(just use make -j on such machines and overcommit?)
Comment 52 Richard Biener 2022-11-30 08:38:59 UTC
(In reply to Richard Biener from comment #51)
> (In reply to Martin Liška from comment #49)
> 
> [...]
> 
> > Can please any GNU make expect judge here? Starting e.g. gimple-match.cc
> > early would really help
> > to speed up the build process.
> 
> this has come up in the past and there's no reliable way to order things
> (just use make -j on such machines and overcommit?)

Doesn't make a difference to overall time so early starting isn't the issue
it seems.
Comment 53 Martin Liška 2022-11-30 09:10:08 UTC
(In reply to Richard Biener from comment #50)
> (In reply to Martin Liška from comment #48)
> > Created attachment 53989 [details]
> > CPU utilization of make all-host on recent AMD server
> > 
> > The situation with a recent AMD server is really bad! Having 192 cores, the
> > average CPU utilization of `make all-host` is 6% !
> 
> Just do more builds in parallel!

No! I'm speaking about faster edit-build-debug cycles and also about faster builds of gcc packages.

> There's just 903 .o files in gcc/ and
> libbackend.a just has 490 of them.  It's not surprising the few larger
> files stretch out the compile-time here.

Well, gimple-match.o takes ~66s on my new AMD Ryzen 9 5950X CPU :/

> Try LTOing libbackend.a?

Yep, that's our parallel for free approach and I would welcome that, however:


during IPA pass: inline
In member function ‘quick_push’,
    inlined from ‘make_forwarders_with_degenerate_phis’ at /home/marxin/Programming/gcc/gcc/tree-ssa-dce.cc:1848:6:
/home/marxin/Programming/gcc/gcc/vec.h:1958:28: internal compiler error: Segmentation fault
 1958 |   return m_vec->quick_push (obj);
      |                            ^
0x102f987 internal_error(char const*, ...)
	???:0
0x117935b cgraph_node::get_untransformed_body()
	???:0
0x123f6e9 optimize_inline_calls(tree_node*)
	???:0
0x123e4d2 inline_transform(cgraph_node*)
	???:0
0x123da5f execute_all_ipa_transforms(bool)
	???:0
0x15ebe1b cgraph_node::expand()
	???:0
0x15e2f6d symbol_table::compile()
	???:0
0x15d0368 lto_main()
	???:0

I'll isolate that and hope we can add a configure option for LTOed libbackend.a.
Comment 54 Martin Liška 2022-12-01 09:43:16 UTC
> Try LTOing libbackend.a?

So this option is not feasible as well, we're paying a too high price for parallel WPA of the LTO and the resulting time on 32 cores is even slower :/
Comment 55 Martin Liška 2022-12-01 10:01:49 UTC
Created attachment 53995 [details]
make all-host on Ryzen 9
Comment 56 Martin Liška 2022-12-01 10:03:27 UTC
Created attachment 53996 [details]
make all-host on Ryzen 9 with LTO partial linking

Using partial linking for the following 4 objects (gimple-match.o generic-match.o insn-recog.o insn-emit.o), I can speed up build of all-host by almost 30s from 145 to 115 seconds).
Comment 57 Martin Liška 2022-12-01 10:07:06 UTC
Created attachment 53997 [details]
Partial linking path
Comment 58 Andrew Carlotti 2023-03-27 14:55:07 UTC
Since November 2021, there's been a significant regression in the compile time for gimple-match.cc during a bootstrap build (+100% in Stage 2, +73% in Stage 3), with this regression accounting for over 20% of the current total bootstrap time on some aarch64 machines.

Most of the change in compile time is due to the following 6 commits (of which one is a performance improvement, and one only regressed the Stage 2 build):

7df89377a7ae3906255e38a79be8e5d962c3a0df 24th November 2021
Enhance optimize_atomic_bit_test_and to handle truncation. (Hongtao Liu)
Stage 2: +27%
Stage 3: +33%

9a53101caadae1b5c8d791d247b05268ee4f7f92 16th May 2022
Add MIN/MAX folding from fold_cond_expr_with_comparison to match.pd (Richard Biener)
Stage 2: +15%
Stage 3: +15%

409978d58dafa689c5b3f85013e2786526160f2c 9th August 2022
tree-optimization/106514 - add --param max-jump-thread-paths (Richard Biener)
Stage 2: -7%
Stage 3: -10%

011d0a033ab370ea38b06b813ac62be8dde0801b 18th August 2022
Make path_range_query standalone and add reset_path. (Aldy Hernandez)
Stage 2: +5%
Stage 3: +0%

4d9db4bdd458a4b526f59e4bc5bbd549d3861cea 12th December 2022
middle-end: simplify complex if expressions where comparisons are inverse of one another. (Tamar Christina)
Stage 2: +10%
Stage 3: +9%

733a1b777f16cd397b43a242d9c31761f66d3da8 13th January 2023
sched-deps: do not schedule pseudos across calls [PR108117] (Alexander Monakov)
Stage 2: +14%
Stage 3: +9%
Comment 59 Martin Liška 2023-03-28 03:01:18 UTC
(In reply to Andrew Carlotti from comment #58)
> Since November 2021, there's been a significant regression in the compile
> time for gimple-match.cc during a bootstrap build (+100% in Stage 2, +73% in
> Stage 3), with this regression accounting for over 20% of the current total
> bootstrap time on some aarch64 machines.

Thank for the interesting numbers! Yeah, it's very unfortunate :/

> 
> Most of the change in compile time is due to the following 6 commits (of
> which one is a performance improvement, and one only regressed the Stage 2
> build):
> 
> 7df89377a7ae3906255e38a79be8e5d962c3a0df 24th November 2021
> Enhance optimize_atomic_bit_test_and to handle truncation. (Hongtao Liu)
> Stage 2: +27%
> Stage 3: +33%

This one is btw. a known issue PR108129.
Comment 60 Richard Biener 2023-03-28 08:30:41 UTC
(In reply to Martin Liška from comment #59)
> (In reply to Andrew Carlotti from comment #58)
> > Since November 2021, there's been a significant regression in the compile
> > time for gimple-match.cc during a bootstrap build (+100% in Stage 2, +73% in
> > Stage 3), with this regression accounting for over 20% of the current total
> > bootstrap time on some aarch64 machines.
> 
> Thank for the interesting numbers! Yeah, it's very unfortunate :/
> 
> > 
> > Most of the change in compile time is due to the following 6 commits (of
> > which one is a performance improvement, and one only regressed the Stage 2
> > build):
> > 
> > 7df89377a7ae3906255e38a79be8e5d962c3a0df 24th November 2021
> > Enhance optimize_atomic_bit_test_and to handle truncation. (Hongtao Liu)
> > Stage 2: +27%
> > Stage 3: +33%
> 
> This one is btw. a known issue PR108129.

But the revision only sligthly changes the patterns so I'm very curious
how it arrived at 30% slowdown.

Note these (match ..) patterns that are not used from inside match.pd itself
(and do not use other (match ..)) would be perfect candidates to emit
to separate files.  Either by explicit syntax or magically where the former
would be easier to cater for in the Makefile.

The "trivial" improvement of course would be to special-case
iterator uses als for (match ...) like we do for (simplify ...) where
we can delay substitution.
Comment 61 Alexander Monakov 2023-03-28 08:45:00 UTC
(In reply to Richard Biener from comment #60)
> > This one is btw. a known issue PR108129.
> 
> But the revision only sligthly changes the patterns so I'm very curious
> how it arrived at 30% slowdown.

It adds an extra 'convert2?' to 'nop_atomic_bit_test_and_p' matchers, and since match.pd expansion works by emitting match subtrees twice for each '?' component, that gives an extra 2x factor to the already bad combinatorial explosion going on in those patterns.

We really need to rework match-and-simplify emission in a smarter way. I've looked at that in January once, but there's a few things I'd need help understanding, such as...

> The "trivial" improvement of course would be to special-case
> iterator uses als for (match ...) like we do for (simplify ...) where
> we can delay substitution.

... this. Is there a short explanation what's 'delayed substitution' in this context?
Comment 62 Jakub Jelinek 2023-03-28 08:54:12 UTC
Looking at gimple-match.cc, the case CFN_BUILT_IN_ATOMIC_FETCH_OR_{1,2,4,8,16}: etc. blocks are identical there, except for the numbers in next_after_fail* label numbers.
So, could we perhaps expand everything the way we do and just when emitting a switch
hash the subtree of the cases to be emitted and if the hashes are equal also compare
and if the subtrees are the same (== would result in the same text being emitted into
the output except for the label numbers) emit multiple cases with the same block?
Admittedly I haven't looked yet at the data structures genmatch.cc uses before emitting
the source, so don't know whether it is feasible.
Comment 63 rguenther@suse.de 2023-03-28 09:05:22 UTC
On Tue, 28 Mar 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84402
> 
> Alexander Monakov <amonakov at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |amonakov at gcc dot gnu.org
> 
> --- Comment #61 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #60)
> > > This one is btw. a known issue PR108129.
> > 
> > But the revision only sligthly changes the patterns so I'm very curious
> > how it arrived at 30% slowdown.
> 
> It adds an extra 'convert2?' to 'nop_atomic_bit_test_and_p' matchers, and since
> match.pd expansion works by emitting match subtrees twice for each '?'
> component, that gives an extra 2x factor to the already bad combinatorial
> explosion going on in those patterns.
> 
> We really need to rework match-and-simplify emission in a smarter way. I've
> looked at that in January once, but there's a few things I'd need help
> understanding, such as...
> 
> > The "trivial" improvement of course would be to special-case
> > iterator uses als for (match ...) like we do for (simplify ...) where
> > we can delay substitution.
> 
> ... this. Is there a short explanation what's 'delayed substitution' in this
> context?

'delayed substitution' works for (simplify (...)) by not expanding the
substitution for each (for ..) iterator but instead passing it as
variable to a split out common function.

For (match (...)) the "substitution" part is trivial so there's no
point doing that.  But instead we can look to apply something similar
to the "matching" part.  When we have

(for X (A B ...)
 (simplify
  (op (X (op2 ...) ...))
  ...

we get for the matching of 'X' (if it's not at the toplevel)

 switch (...)
 {
 case A:
  {
   .. match the rest ..
  }
 case B:
  {
   .. match the rest ..
  }
...

but we can instead emit (maybe only in a subset of cases?)

 switch (...)
 {
 case A:
 case B:
 case ...:
  {
   .. mach the rest ..
  }

in theory we support things like

(for X (plus IFN_POW)
 (...

as both operators are binary - so that's cases we cannot handle this way.

Basically we'd keep the user-defined operator in the AST and adjust
code-generation to deal with that.

I will try to do that.
Comment 64 GCC Commits 2023-03-28 11:31:20 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:75cda3be0232f745cda4e177d514f6900390af0b

commit r13-6902-g75cda3be0232f745cda4e177d514f6900390af0b
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Mar 28 12:42:14 2023 +0200

    bootstrap/84402 - improve (match ...) code generation
    
    The following avoids duplicating matching code for (match ...)
    in match.pd when possible.  That's more easily possible for
    (match ...) than simplify because we do not need to handle
    common matches (those would be diagnosed only during compiling)
    nor is the result able to inspect the active operator.
    
    Specifically this reduces the size of the generated matches for
    the atomic ops as noted in PR108129.
    
    gimple-match.cc shrinks from 245k lines to 209k lines with this patch.
    
            PR bootstrap/84402
            PR tree-optimization/108129
            * genmatch.cc (lower_for): For (match ...) delay
            substituting into the match operator if possible.
            (dt_operand::gen_gimple_expr): For user_id look at the
            first substitute for determining how to access operands.
            (dt_operand::gen_generic_expr): Likewise.
            (dt_node::gen_kids): Properly sort user_ids according
            to their substitutes.
            (dt_node::gen_kids_1): Code-generate user_id matching.
Comment 65 GCC Commits 2023-05-05 12:47:49 UTC
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:580cda3c2799b1f8323af770e52f1eb0fa204718

commit r14-496-g580cda3c2799b1f8323af770e52f1eb0fa204718
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Fri May 5 13:35:17 2023 +0100

    match.pd: don't emit label if not needed
    
    This is a small QoL codegen improvement for match.pd to not emit labels when
    they are not needed.  The codegen is nice and there is a small (but consistent)
    improvement in compile time.
    
    gcc/ChangeLog:
    
            PR bootstrap/84402
            * genmatch.cc (dt_simplify::gen_1): Only emit labels if used.
Comment 66 GCC Commits 2023-05-05 12:47:54 UTC
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:e487fcc0f7466ea663a0fea52076337bebd42b8b

commit r14-497-ge487fcc0f7466ea663a0fea52076337bebd42b8b
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Fri May 5 13:36:01 2023 +0100

    match.pd: Remove commented out line pragmas unless -vv is used.
    
    genmatch currently outputs commented out line directives that have no effect
    but the compiler still has to parse only to discard.
    
    They are however handy when debugging genmatch output.  As such this moves them
    behind the -vv flag.
    
    gcc/ChangeLog:
    
            PR bootstrap/84402
            * genmatch.cc (output_line_directive): Only emit commented directive
            when -vv.
Comment 67 GCC Commits 2023-05-05 12:47:59 UTC
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:c0ce29bc1ce329001b6c02bb3d34bcbb086e1b72

commit r14-498-gc0ce29bc1ce329001b6c02bb3d34bcbb086e1b72
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Fri May 5 13:36:43 2023 +0100

    match.pd: CSE the dump output check.
    
    This is a small improvement in QoL codegen for match.pd to save time not
    re-evaluating the condition for printing debug information in every function.
    
    There is a small but consistent runtime and compile time win here.  The runtime
    win comes from not having to do the condition over again, and on Arm plaforms
    we now use the new test-and-branch support for booleans to only have a single
    instruction here.
    
    gcc/ChangeLog:
    
            PR bootstrap/84402
            * genmatch.cc (decision_tree::gen, write_predicate): Generate new
            debug_dump var.
            (dt_simplify::gen_1): Use it.
Comment 68 GCC Commits 2023-05-05 12:48:04 UTC
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:27fcf994c5515e1bbf2ff03d28fd2fa927c7e7b5

commit r14-499-g27fcf994c5515e1bbf2ff03d28fd2fa927c7e7b5
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Fri May 5 13:37:49 2023 +0100

    genmatch: split shared code to gimple-match-exports.cc
    
    In preparation for automatically splitting match.pd files I split off the
    non-static helper functions that are shared between the match.pd functions off
    to another file.
    
    This file can be compiled in parallel and also allows us to later avoid
    duplicate symbols errors.
    
    gcc/ChangeLog:
    
            PR bootstrap/84402
            * Makefile.in (OBJS): Add gimple-match-exports.o.
            * genmatch.cc (decision_tree::gen): Export gimple_gimplify helpers.
            * gimple-match-head.cc (gimple_simplify, gimple_resimplify1,
            gimple_resimplify2, gimple_resimplify3, gimple_resimplify4,
            gimple_resimplify5, constant_for_folding, convert_conditional_op,
            maybe_resimplify_conditional_op, gimple_match_op::resimplify,
            maybe_build_generic_op, build_call_internal, maybe_push_res_to_seq,
            do_valueize, try_conditional_simplification, gimple_extract,
            gimple_extract_op, canonicalize_code, commutative_binary_op_p,
            commutative_ternary_op_p, first_commutative_argument,
            associative_binary_op_p, directly_supported_p,
            get_conditional_internal_fn): Moved to gimple-match-exports.cc
            * gimple-match-exports.cc: New file.
Comment 69 GCC Commits 2023-05-05 12:48:09 UTC
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:703417a030b3d80f55ba1402adc3f1692d3631e5

commit r14-500-g703417a030b3d80f55ba1402adc3f1692d3631e5
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Fri May 5 13:38:50 2023 +0100

    match.pd: automatically partition *-match.cc files.
    
    Following on from Richi's RFC[1] this is another attempt to split up match.pd
    into multiple gimple-match and generic-match files.  This version is fully
    automated and requires no human intervention.
    
    First things first, some perf numbers.  The following shows the effect of the
    patch on my desktop doing parallel compilation of gimple-match:
    
    +--------+------------------+--------+------------------+
    | splits | rel. improvement | splits | rel. improvement |
    +--------+------------------+--------+------------------+
    |      1 | 0.00%            |     33 | 91.03%           |
    |      2 | 71.77%           |     34 | 84.02%           |
    |      3 | 100.71%          |     35 | 83.42%           |
    |      4 | 143.08%          |     36 | 78.80%           |
    |      5 | 176.18%          |     37 | 74.06%           |
    |      6 | 174.40%          |     38 | 55.76%           |
    |      7 | 176.62%          |     39 | 66.90%           |
    |      8 | 168.35%          |     40 | 18.25%           |
    |      9 | 189.80%          |     41 | 16.55%           |
    |     10 | 171.77%          |     42 | 47.02%           |
    |     11 | 152.82%          |     43 | 15.29%           |
    |     12 | 112.20%          |     44 | 21.63%           |
    |     13 | 158.57%          |     45 | 41.53%           |
    |     14 | 158.57%          |     46 | 21.98%           |
    |     15 | 152.07%          |     47 | -42.74%          |
    |     16 | 151.70%          |     48 | -32.62%          |
    |     17 | 131.52%          |     49 | 11.81%           |
    |     18 | 133.11%          |     50 | 34.07%           |
    |     19 | 137.33%          |     51 | 2.71%            |
    |     20 | 103.83%          |     52 | -22.23%          |
    |     21 | 132.47%          |     53 | 32.30%           |
    |     22 | 116.52%          |     54 | 21.45%           |
    |     23 | 112.73%          |     55 | 40.02%           |
    |     24 | 111.94%          |     56 | 42.83%           |
    |     25 | 112.73%          |     57 | -9.98%           |
    |     26 | 104.07%          |     58 | 18.01%           |
    |     27 | 113.27%          |     59 | -4.91%           |
    |     28 | 96.77%           |     60 | 22.94%           |
    |     29 | 93.42%           |     61 | -3.73%           |
    |     30 | 87.67%           |     62 | -27.43%          |
    |     31 | 89.54%           |     63 | -1.05%           |
    |     32 | 84.42%           |     64 | -5.44%           |
    +--------+------------------+--------+------------------+
    
    As can be seen there seems to be a point of diminishing returns in doing splits.
    This comes from the fact that these match files consume a sizeable amount of
    headers.  At a certain point the parsing overhead of the headers dominate and
    you start losing in gains.
    
    As such from this I've made the default 10 splits per file to allow for some
    room for growth in the future without needing changes to the split amount.
    Since 5-10 show roughly the same gains it means we can afford to double the
    file sizes before we need to up the split amount.  This can be controlled
    by the configure parameter --with-matchpd-partitions=.
    
    At 10 splits the sizes of the files are:
    
     1.2M gimple-match-1.cc
     490K gimple-match-2.cc
     459K gimple-match-3.cc
     462K gimple-match-4.cc
     466K gimple-match-5.cc
     690K gimple-match-6.cc
     517K gimple-match-7.cc
     693K gimple-match-8.cc
    1011K gimple-match-9.cc
     490K gimple-match-10.cc
     210K gimple-match-auto.h
    
    The reason gimple-match-1.cc is so large is because it got allocated a very
    large function: gimple_simplify_NE_EXPR.
    
    Because of these sporadically large functions the allocation to a split happens
    based on the amount of data already written to a split instead of just a simple
    round robin allocation (though the patch supports that too.).   This means that
    once gimple_simplify_NE_EXPR is allocated to gimple-match-1.cc nothing uses it
    again until the rest of the files catch up.
    
    To support this split a new header file *-match-auto.h is generated to allow
    the individual files to compile separately.
    
    Lastly for the auto generated files I use pragmas to silence the unused
    predicate warnings instead of the previous Makefile way because I couldn't find
    a way to set them without knowing the number of split files beforehand.
    
    Finally with this change, bootstrap time has dropped 8 minutes on AArch64.
    
    [1] https://gcc.gnu.org/legacy-ml/gcc-patches/2018-04/msg01125.html
    
    gcc/ChangeLog:
    
            PR bootstrap/84402
            * genmatch.cc (emit_func, SIZED_BASED_CHUNKS, get_out_file): New.
            (decision_tree::gen): Accept list of files instead of single and update
            to write function definition to header and main file.
            (write_predicate): Likewise.
            (write_header): Emit pragmas and new includes.
            (main): Create file buffers and cleanup.
            (showUsage, write_header_includes): New.
Comment 70 GCC Commits 2023-05-05 12:48:14 UTC
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:0a85544e1aaeca41133ecfc438cda913dbc0f122

commit r14-501-g0a85544e1aaeca41133ecfc438cda913dbc0f122
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Fri May 5 13:42:17 2023 +0100

    match.pd: Use splits in makefile and make configurable.
    
    This updates the build system to split up match.pd files into chunks of 10.
    This also introduces a new flag --with-matchpd-partitions which can be used to
    change the number of partitions.
    
    For the analysis of why 10 please look at the previous patch in the series.
    
    gcc/ChangeLog:
    
            PR bootstrap/84402
            * Makefile.in (NUM_MATCH_SPLITS, MATCH_SPLITS_SEQ,
            GIMPLE_MATCH_PD_SEQ_SRC, GIMPLE_MATCH_PD_SEQ_O,
            GENERIC_MATCH_PD_SEQ_SRC, GENERIC_MATCH_PD_SEQ_O): New.
            (OBJS, MOSTLYCLEANFILES, .PRECIOUS): Use them.
            (s-match): Split into s-generic-match and s-gimple-match.
            * configure.ac (with-matchpd-partitions,
            DEFAULT_MATCHPD_PARTITIONS): New.
            * configure: Regenerate.
Comment 71 GCC Commits 2023-10-31 12:35:27 UTC
The master branch has been updated by Robin Dapp <rdapp@gcc.gnu.org>:

https://gcc.gnu.org/g:184378027e92f51e02d3649e0ca523f487fd2810

commit r14-5034-g184378027e92f51e02d3649e0ca523f487fd2810
Author: Robin Dapp <rdapp@ventanamicro.com>
Date:   Thu Oct 12 11:23:26 2023 +0200

    genemit: Split insn-emit.cc into several partitions.
    
    On riscv insn-emit.cc has grown to over 1.2 mio lines of code and
    compiling it takes considerable time.
    Therefore, this patch adjust genemit to create several partitions
    (insn-emit-1.cc to insn-emit-n.cc).  The available patterns are
    written to the given files in a sequential fashion.
    
    Similar to match.pd a configure option --with-emitinsn-partitions=num
    is introduced that makes the number of partition configurable.
    
    gcc/ChangeLog:
    
            PR bootstrap/84402
            PR target/111600
    
            * Makefile.in: Handle split insn-emit.cc.
            * configure: Regenerate.
            * configure.ac: Add --with-insnemit-partitions.
            * genemit.cc (output_peephole2_scratches): Print to file instead
            of stdout.
            (print_code): Ditto.
            (gen_rtx_scratch): Ditto.
            (gen_exp): Ditto.
            (gen_emit_seq): Ditto.
            (emit_c_code): Ditto.
            (gen_insn): Ditto.
            (gen_expand): Ditto.
            (gen_split): Ditto.
            (output_add_clobbers): Ditto.
            (output_added_clobbers_hard_reg_p): Ditto.
            (print_overload_arguments): Ditto.
            (print_overload_test): Ditto.
            (handle_overloaded_code_for): Ditto.
            (handle_overloaded_gen): Ditto.
            (print_header): New function.
            (handle_arg): New function.
            (main): Split output into 10 files.
            * gensupport.cc (count_patterns): New function.
            * gensupport.h (count_patterns): Define.
            * read-md.cc (md_reader::print_md_ptr_loc): Add file argument.
            * read-md.h (class md_reader): Change definition.