While preparing GCC builds for a new Solaris 11.4/x86 GCC CFarm system, I re-ran into an issue with D programs looping inside a Solaris kernel zone (a VM), while the same binaries work fine on bare metal. I've now managed to root-cause the issue. When bootstrapping e.g. GCC 13 with GCC 9.5.0 or 11.4.0, configuring libphobos in stage 1 loops. This can be reproduced with $ cat conftest.d module object; extern(C) int main() { return 0; } $ d21 conftest.d which loops indefinitely. d21 show the following stacktrace: Thread 2 received signal SIGINT, Interrupt. [Switching to Thread 1 (LWP 1)] _D4core5cpuid12getCpuInfo0BFNbNiNeZv () at gcc-11.4.0/libphobos/libdruntime/core/cpuid.d:669 669 if (b!=0) { 1: x/i $pc => 0x442b311 <_D4core5cpuid12getCpuInfo0BFNbNiNeZv+33>: test %ebx,%ebx (gdb) bt #0 _D4core5cpuid12getCpuInfo0BFNbNiNeZv () at gcc-11.4.0/libphobos/libdruntime/core/cpuid.d:669 #1 0x000000000442b7e3 in _D4core5cpuid8cpuidX86FNbNiNeZv () at gcc-11.4.0/libphobos/libdruntime/core/cpuid.d:953 #2 0x000000000442bd75 in _D4core5cpuid18_sharedStaticCtor1FNbNiNeZv () at gcc-11.4.0/libphobos/libdruntime/core/cpuid.d:1073 #3 0x000000000441a421 in runModuleFuncs (this=0x0, modules=...) at gcc-11.4.0/libphobos/libdruntime/rt/minfo.d:858 #4 _D2rt5minfo11ModuleGroup8runCtorsMFZv (this=...) at gcc-11.4.0/libphobos/libdruntime/rt/minfo.d:728 #5 0x000000000441b5bd in __foreachbody1 (this=<optimized out>, sg=...) at gcc-11.4.0/libphobos/libdruntime/rt/minfo.d:796 #6 0x000000000440ffd2 in _D3gcc8sections3elf3DSO7opApplyFMDFKSQBjQBiQBcQBbZiZi (dg=...) at gcc-11.4.0/libphobos/libdruntime/gcc/sections/elf.d:106 #7 0x000000000441a61e in rt_moduleCtor () at gcc-11.4.0/libphobos/libdruntime/rt/minfo.d:793 #8 0x000000000440f880 in rt_init () at gcc-11.4.0/libphobos/libdruntime/rt/dmain2.d:129 #9 0x00000000022f4c16 in d_init_options (decoded_options=0x47e1f00) at gcc-13.2.0/gcc/d/d-lang.cc:290 #10 0x0000000002ac5fbc in toplev::main (this=0x7fffbffff97a, argc=2, argv=0x7fffbffff9b8) at gcc-13.2.0/gcc/toplev.cc:2240 #11 0x0000000004301c46 in main (argc=2, argv=0x7fffbffff9b8) at gcc-13.2.0/gcc/main.cc:39 Running getCpuInfo0B side-by-side in the kernel zone and on bare metal shows: kernel zone bare metal level 0 a 0 1 b 1 2 level 1 a 0 5 b 1 28 level 2 a 0 0 b 1 0 and so on for each higher level. So inside a kernel zone, a!=0 || b!=0 remains true, explaining the loop. If I'm reading the spec (Intel® 64 and IA-32 Architectures Software Developer’s Manual, Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4, Order Number: 325462-081US, September 2023, Vol. 2A, 3-225, p.821) correctly, this is a bug in the kernel zone software: A sub-leaf returning an invalid domain always returns 0 in EAX and EBX. OTOH I don't see why getCpuInfo0B needs to loop here since it's only interested in levels 0 and 1 anyway. This affects all DMD-based versions of GDC, while the previous C++-based versions are fine.
(In reply to Rainer Orth from comment #0) > This affects all DMD-based versions of GDC, while the previous C++-based > versions > are fine. The compiler is fine, but if I understand right, all programs built by the C++-based version would still observe the same infinite loop.
> --- Comment #1 from ibuclaw at gcc dot gnu.org --- > (In reply to Rainer Orth from comment #0) >> This affects all DMD-based versions of GDC, while the previous C++-based >> versions >> are fine. > The compiler is fine, but if I understand right, all programs built by the > C++-based version would still observe the same infinite loop. Just the opposite: both D-based d21 and every D program somehow using getCpuInfo0B would experience the loop. I believe I originally experienced that in early (GCC 8 or 9) versions when testing libphobos in a Solaris 11.3 kernel zone.
Based on what I see here, this patch to core.cpuid should be sufficient to fix loop and not introduce any change in existing behaviour. --- --- a/druntime/src/core/cpuid.d +++ b/druntime/src/core/cpuid.d @@ -666,10 +666,12 @@ void getAMDcacheinfo() // to determine number of processors. void getCpuInfo0B() { - int level=0; int threadsPerCore; uint a, b, c, d; - do { + // I'm not sure about this. The docs state that there + // are 2 hyperthreads per core if HT is factory enabled. + for (int level = 0; level < 2; level++) + { version (GNU_OR_LDC) asm pure nothrow @nogc { "cpuid" : "=a" (a), "=b" (b), "=c" (c), "=d" (d) : "a" (0x0B), "c" (level); } else asm pure nothrow @nogc { @@ -681,19 +683,20 @@ void getCpuInfo0B() mov c, ECX; mov d, EDX; } - if (b!=0) { - // I'm not sure about this. The docs state that there - // are 2 hyperthreads per core if HT is factory enabled. - if (level==0) + if (b != 0) + { + if (level == 0) threadsPerCore = b & 0xFFFF; - else if (level==1) { + else if (level == 1) + { cpuFeatures.maxThreads = b & 0xFFFF; cpuFeatures.maxCores = cpuFeatures.maxThreads / threadsPerCore; } - } - ++level; - } while (a!=0 || b!=0); + // Got "invalid domain" returned from cpuid + if (a == 0 && b == 0) + break; + } } void cpuidX86()
> --- Comment #3 from ibuclaw at gcc dot gnu.org --- > Based on what I see here, this patch to core.cpuid should be sufficient to fix > loop and not introduce any change in existing behaviour. I've now bootstrapped a patched gcc 13.2.0 both inside the kernel zone (amd64-pc-solaris2.11) and out, and indeed the build completes and there are no differences in testsuite results between that run and an equivalent one on bare metal. One caveat, though: I originally used an unpatched gcc 11.4.0 as build compiler. However, that doesn't work because the stage 1 d21 is linked with the build compiler's libgphobos, thus loops when used. Applying the patch to the gcc 11.4.0 sources and rebuilding fixed that. Thanks a lot.
Upstream PR https://github.com/dlang/dmd/pull/15778
> --- Comment #5 from ibuclaw at gcc dot gnu.org --- > Upstream PR https://github.com/dlang/dmd/pull/15778 Excellent, thanks a lot for the blindingly fast fix. I'll file a bug with Oracle about this, too.
Patch ready to apply to releases/gcc-13, and backports to gcc-12 and gcc-11. Mainline will get this in the next upstream merge.
The releases/gcc-13 branch has been updated by Iain Buclaw <ibuclaw@gcc.gnu.org>: https://gcc.gnu.org/g:0b25c1295d4e84af681f4b1f4af2ad37cd270da3 commit r13-8008-g0b25c1295d4e84af681f4b1f4af2ad37cd270da3 Author: Iain Buclaw <ibuclaw@gdcproject.org> Date: Tue Nov 7 14:04:07 2023 +0100 libphobos: Fix regression d21 loops in getCpuInfo0B in Solaris/x86 kernel zone This function assumes that cpuid would return "invalid domain" when a sub-leaf index greater than what's supported is requested. This turned out not to always be the case when running on some virtual machines. As the loop only does anything for levels 0 and 1, make that a hard limit for number of times the loop is ran. PR d/112408 libphobos/ChangeLog: * libdruntime/core/cpuid.d (getCpuInfo0B): Limit number of times loop runs.
The releases/gcc-12 branch has been updated by Iain Buclaw <ibuclaw@gcc.gnu.org>: https://gcc.gnu.org/g:8a880d895a468a44fd3e268dc548e64aebe8f5d4 commit r12-9963-g8a880d895a468a44fd3e268dc548e64aebe8f5d4 Author: Iain Buclaw <ibuclaw@gdcproject.org> Date: Tue Nov 7 14:04:07 2023 +0100 libphobos: Fix regression d21 loops in getCpuInfo0B in Solaris/x86 kernel zone This function assumes that cpuid would return "invalid domain" when a sub-leaf index greater than what's supported is requested. This turned out not to always be the case when running on some virtual machines. As the loop only does anything for levels 0 and 1, make that a hard limit for number of times the loop is ran. PR d/112408 libphobos/ChangeLog: * libdruntime/core/cpuid.d (getCpuInfo0B): Limit number of times loop runs. (cherry picked from commit 0b25c1295d4e84af681f4b1f4af2ad37cd270da3)
The releases/gcc-11 branch has been updated by Iain Buclaw <ibuclaw@gcc.gnu.org>: https://gcc.gnu.org/g:47d833394a09068ba0607a57aa149dfe3dc11e8b commit r11-11092-g47d833394a09068ba0607a57aa149dfe3dc11e8b Author: Iain Buclaw <ibuclaw@gdcproject.org> Date: Tue Nov 7 14:04:07 2023 +0100 libphobos: Fix regression d21 loops in getCpuInfo0B in Solaris/x86 kernel zone This function assumes that cpuid would return "invalid domain" when a sub-leaf index greater than what's supported is requested. This turned out not to always be the case when running on some virtual machines. As the loop only does anything for levels 0 and 1, make that a hard limit for number of times the loop is ran. PR d/112408 libphobos/ChangeLog: * libdruntime/core/cpuid.d (getCpuInfo0B): Limit number of times loop runs. (cherry picked from commit 0b25c1295d4e84af681f4b1f4af2ad37cd270da3)
Mainline got this in r14-5678.