Reported at http://gcc.gnu.org/ml/fortran/2011-12/msg00148.html For a given fortran source file with module(s), repeated compilations may not create always the same module file. The URL above provides an example. It seems that the symbols are not output in the same order. In the example provided, there are two module variants, the difference between them being that one single symbol is at a different place in the file ( as reported at http://gcc.gnu.org/ml/fortran/2011-12/msg00163.html ). Possible causes: - unitialized memory, though valgrind reported nothing for me. - tree rebalancing. Module I/O manages structures to be serialized using a global tree ordered (when writing) by structures' addresses. Could an unrepeatable order just mean an unrepeatable address?.
Also reported here: http://gcc.gnu.org/ml/gcc/2012-10/msg00075.html this is the source of recompilation cascades sometimes seen in CP2K as well. I'm wondering if a very naive hack like sorting .mod content (like in cat old.mod 1 | sort -s > new.mod) could not paper over this problem sufficiently well to make it irrelevant in reality.
I'm not in a position to test, but write_generic in module.c makes no effort to traverse the tree in a left to right order, i.e. the line write_generic ((gfc_symtree *)st->right); should be moved to the bottom of the function.
Created attachment 28372 [details] Candidate patch Here's a patch which fixes one case of module contents not being written in a defined order. I cannot test this right now because of bootstrap failures. There may be other places in the module writing code, where structures aren't traversed in a defined order. 2012-10-06 Tobias Schlüter <tobi@gcc.gnu.org> PR fortran/51727 * module.c (write_generic): Traverse tree in left-to-right order.
(In reply to comment #3) > > 2012-10-06 Tobias Schlüter <tobi@gcc.gnu.org> > > PR fortran/51727 > * module.c (write_generic): Traverse tree in left-to-right order. If tested that this patch fixes the problem for omp_lib.mod, so would likely also fix recompilation cascades.. However, testing it on CP2K I'm finding that compilation fails with this patch (and passes without), so something seems wrong. The difference between the generated modules is rather large.
Created attachment 28373 [details] bad module
Created attachment 28374 [details] good module
The main difference between 'good' and 'bad' seems to be the 'header' lines bad: () (('arch_topology' 'machine_architecture_types' 2)) () good: () (('arch_topology' 'machine_architecture_types' 2) ('ma_mp_type' 'machine_architecture_types' 3) ('ma_process' 'machine_architecture_types' 4) ('machine_output' 'machine_architecture_types' 5) ('thread_inf' 'machine_architecture_types' 6)) ()
(In reply to comment #3) > Created attachment 28372 [details] > Candidate patch actually... looking at the patch, don't you need to deal with the if statements that return ?
(In reply to comment #1) > Also reported here: > > http://gcc.gnu.org/ml/gcc/2012-10/msg00075.html To add a little more detail, here is one diff seen between two .mod files: $ diff omp_lib.mod.first omp_lib.mod 2c2 < MD5:39fa3d71e0f92c936ff97e590e8d74cf -- If you edit this, you'll get what you deserve. --- > MD5:893e632666abe58013fd14806b18696f -- If you edit this, you'll get what you deserve. 174,175d173 < 20 'lock' '' '' 19 ((VARIABLE INOUT UNKNOWN-PROC UNKNOWN UNKNOWN 0 0 < DUMMY) (INTEGER 4 0 0 0 INTEGER ()) 0 0 () () 0 () () () 0 0) 177a176,177 > 20 'lock' '' '' 19 ((VARIABLE INOUT UNKNOWN-PROC UNKNOWN UNKNOWN 0 0 > DUMMY) (INTEGER 4 0 0 0 INTEGER ()) 0 0 () () 0 () () () 0 0) This is an ordering difference in the output of write_symbol1, which traverses a pointer tree whose compare function compares pointers with greaterthan and lessthan. The tree is augmented during traversal, so there are repeated calls to write_symbol1 along with a 'written' marker at each node. Ordering by integer 'tag' assigned to each pointer may be problematic because new tags are assigned during traversals, meaning that the sequence in which tags arrive is itself partially a function of the pointer ordering. > I'm wondering if a very naive hack like sorting .mod content (like in cat > old.mod 1 | sort -s > new.mod) could not paper over this problem sufficiently > well to make it irrelevant in reality. Perhaps, but it would have to be less naive than this in practice. A .mod file is structured into sections that a simple sort could fragment. It also contains an MD5 checksum of its contents, but I'm not sure how important this is. If it's not critical, removing it might be an acceptable price to pay for simpler code elsewhere.
(In reply to comment #8) > (In reply to comment #3) > > Created attachment 28372 [details] > > Candidate patch > > actually... looking at the patch, don't you need to deal with the if statements > that return ? Yes, you're right. I'm now in a position to actually test, and I'll try a modified patch. (and I somehow forgot to add myself to CC, so I only now see your comments, sorry about that)
Created attachment 28410 [details] Patch to output generic interfaces in defined order Here's a correct version of the previous patch. Maybe it's useful, but as Simon correctly points out, the real issue appears to be due to the way the pointer_info tree is built.
I have a patch which is currently running the testsuite.
Created attachment 28424 [details] Patch for testing This patch should make output deterministic. Joost, can you please try it out? I would also be interested in the compile-time impact on module-heavy code. Because the pointer_info tree can change during writing, I sometimes have to sort the tree several times, and I have no idea if this can become important in files with large modules. Actually, now that I wrote this, I can already think of a trivial speedup, which I will implement before the final submission. Nevertheless, I would still be interested if the change disappears in the noise.
Created attachment 28425 [details] Patch for testing Ok, the testsuite doesn't take so much time, and the modified test passed immediately. Here is the patch, sped up by only including the relevant symbols in the map. NB even with this patch, the symbols will not necessarily appear in numeric order in the module file, because dependencies may only become apparent during the writing of the symbols. Nevertheless, the order is now deterministic.
I'm sorry that I'm spamming your inboxes, but I only now read the comment in front of write_symbol1, and it says something that I was wondering about all this time, so I want to at least have this recorded. The comment says: """ The act of writing a symbol can modify the pointer_info tree, so we cease traversal if we find a symbol to write. """ This has not been the case since 2007, even though it appears to make a lot of sense: if symbols are added to the balanced binary tree during symbol writing, it can happen that the layout of the tree changes. I don't think elements can be removed from the tree during symbol writing, so the code wouldn't fail (as for all we know it didn't), but it would traverse the tree in an unpredictable order. With my patch, the situation certainly doesn't become worse, but it would be preferable to have a two-step mechanism, where first all necessary symbols are identified, and then written instead of the current intermingled process. This would also allow writing the symbols in strictly numerical order. I don't know how much surgery this would require.
(In reply to comment #15) > ... > This has not been the case since 2007, even though it appears to make a lot of > sense... I noticed the same thing while investigating this. write_symbol1 visits every node and writes all unwritten ones, marking them written. Because writing a node may add new nodes, write_symbol1's caller iterates it until no more writes occur. Before reporting the issue I tried several speculative solutions, with no real success. I experimented with the two-step mechanism you mention, but could not convince myself that it would solve the problem. A full tree built before really writing any node still seemed to leave something ordered by (unpredictable) pointer and with integer "tags" that are partially a function of the pointers they represent. I'm not quite seeing how your patch fixes this; it orders writes by tag *within* write_symbol1 calls, but not *across* them. It is hard to reproduce this problem at all. I see it in only one of multiple gcc configurations. It is also hard to evaluate possible solutions -- adding a failed "non-solution" can change the relationship between pointers enough to make the symptoms disappear while not solving the actual problem.
I'm on vacation until Mon, Oct. 15. For compiler related questions, please email c-compiler-team@google.com. If you need to contact a manager, please email lp-mgrs@google.com. Thanks, Ollie
(In reply to comment #14) > Created attachment 28425 [details] > Patch for testing thanks... now repeated CP2K compiles give identical '.mod's, and of course also omp_lib.mod is fixed. This will very much improve the user experience for those working on large code bases. since you're using C++, I guess a backport to older branches is out of question....
Simon, I don't think the 'integer's are functions of the pointers once you process the symbols in a defined order. The non-determinism was caused by traversing the pointer_info tree based on the non-deterministic pointers. With the patch, the tree is copied to a map sorted by the integers. In the first iteration of write_symbol1 these are given by the traversal of the namespace, which is deterministic. Since we now process these symbols in a deterministic order, the integers of the symbols added during each iteration are also deterministic. I might be missing something, of course. Joost, it wouldn't be much work to implement the same thing using GFC_BBT instead of a std::map. It would require a typedef and one would have to cleanup the tree manually, but if fixing old versions of gcc is valuable, I can do it. I might have to do this anyway, as I don't know if gfortran maintainers like C++isms (even though in this particular case it's probably non-controversial).
(In reply to comment #19) > ...Since we now process these symbols in a deterministic order, > the integers of the symbols added during each iteration are also deterministic. Thanks for the note and the fix. I guess my worry is that the order tags are assigned to symbols is a function of pointer ordering. I see the issue with what look like anonymous symbols, but can't convince myself it couldn't occur with something less anonymous. Suppose two anonymous symbols 'X' differentiated only by their addresses 101 and 102. After tagging there are two X's, call them X1 and X2, that will produce two lines that are identical apart from tag. Ordering by tag makes the order the same even if you reverse their addresses. However, two named symbols 'A' and 'B' at 101 and 102 will become A1 and B2 and produce two lines that differ in both tag and other fields, written in tag order as A first and B second. If you reverse their addresses the tags are assigned differently, A2 and B1. Here B has the smaller tag and so is written before A. At this stage though, I'm out of my depth on fortran, so it may well be that I'm the one missing something.
Hm, I don't know about anonymous symbols. If they exist and end up in modules (which I honestly don't know), I would hope that they would obtain their number from some counter, which should still be deterministic as it would be incremented as code and symbols are processed (which is hopefully all deterministic). My idea would be: submit the patch (maybe a non-C++ version as explained above), if the bug ever re-appears we'll need a more invasive solution.
Created attachment 28440 [details] patch that doesn't use c++ Here's a patch that works essentially the same way, but doesn't use C++. It's about 50 lines longer than the other patch and produces identical module files in the small examples. It's not yet ready for submission, as I haven't yet added all the necessary comments, also maybe some variables could be named better, but I would do this if you think it's worthy fixing this on the old branches as well. I would then submit the C++ patch for the trunk and the C version for the older branches.
(In reply to comment #22) > Created attachment 28440 [details] > patch that doesn't use c++ I've tested the patch with (an older version of) the 4.7 branch, and it works fine for CP2K.
(In reply to comment #23) > I've tested the patch with (an older version of) the 4.7 branch, and it works > fine for CP2K. it doesn't apply cleanly to 4.6, so no testing there unfortunately.
Just to provide some additional numbers on how important this patch is for practical development (and of course to +1 on backports).... for a 'typical code change' on a CP2K tree (add an unused local variable to a subroutine) the speedup due to avoided recompilation (on a 32 core server) can be obtained from the following compile timings (repeatable for various tries): 4.6(unpatched) real 1m45.064s 4.7(patched) real 0m14.958s I really think this is a pretty substantial bug fix of an existing feature.
The patch has been posted some time ago, with an OK for trunk.. http://gcc.gnu.org/ml/fortran/2012-10/msg00067.html Maybe it is a good time to commit before the next stage starts ?
There were concerns about error handling in out-of-memory conditions which I addressed in a separate patch <http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01470.html>. This patch got no reply so far. May plan was to submit the non-C++ patch if the patch for C++ error handling is not approved during stage 1.
Author: tobi Date: Thu Nov 8 15:46:07 2012 New Revision: 193329 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193329 Log: PR fortran/51727 * module.c (sorted_pointer_info): New. (gfc_get_sorted_pointer_info): New. (free_sorted_pointer_info_tree): New. (compare_sorted_pointer_info): New. (find_symbols_to_write): New. (write_symbol1_recursion): New. (write_symbol1): Collect symbols that need writing, output in order. (write_generic): Traverse tree in order. Modified: trunk/gcc/fortran/ChangeLog trunk/gcc/fortran/module.c
I committed the C-only version of the patch as the issues mentioned in comment #27 couldn't be addressed before stage3.
(In reply to comment #29) > I committed the C-only version of the patch as the issues mentioned in comment > #27 couldn't be addressed before stage3. Thanks Tobi! I have been using your C-only patch for a couple of weeks now for the 4.7 branch, and it is greatly improving our edit/compile-cycles. For one of my students, it yields an effective 10x speedup in building CP2K after a typical code change, greatly facilitating the programming project he is on. I would suggest that after a couple of weeks on trunk, this should be reconsidered again for backporting to 4.7.
(In reply to comment #30) > (In reply to comment #29) > > I committed the C-only version of the patch as the issues mentioned in comment > > #27 couldn't be addressed before stage3. > > Thanks Tobi! > > I have been using your C-only patch for a couple of weeks now for the 4.7 > branch, and it is greatly improving our edit/compile-cycles. For one of my > students, it yields an effective 10x speedup in building CP2K after a typical > code change, greatly facilitating the programming project he is on. I would > suggest that after a couple of weeks on trunk, this should be reconsidered > again for backporting to 4.7. Good to hear that. If you can use the additional free time to walk over to my brother's office, then please say 'Hi' to him. Otherwise the faculty meeting will have to do :-) As for the backport, I think the patch is absolutely risk-free, and it should have been approved for 4.7 even though it doesn't fulfill the formal requirements. I think it's worth another try once it's not only saved ethz.ch a lot of electricity but also survived for some time on the 4.8 branch. Please ping the patch in a few weeks so it's not forgotten. I also put a note into my calendar.
> If you can use the additional free time to walk over to my > brother's office, then please say 'Hi' to him. Otherwise the faculty meeting > will have to do :-) Let's call it a small world... I will meet him next week.
(In reply to comment #31) > As for the backport, I think the patch is absolutely risk-free, and it should > have been approved for 4.7 even though it doesn't fulfill the formal > requirements. Please ping the patch in a few weeks so it's not forgotten. Ping....