This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: RFA [PATCH] * lock-and-run.sh: Check for process existence rather than timeout.

From: Jason Merrill <jason at redhat dot com>
To: Alexandre Oliva <oliva at gnu dot org>
Cc: gcc-patches List <gcc-patches at gcc dot gnu dot org>
Date: Mon, 21 Oct 2019 12:33:15 -0400
Subject: Re: RFA [PATCH] * lock-and-run.sh: Check for process existence rather than timeout.
References: <20191014213545.24627-1-jason@redhat.com> <oro8ydgyxv.fsf@lxoliva.fsfla.org> <CADzB+2mRrBrZTtvjUHxpvfGumwjjjJAcGooq1gRKaAvkanDegA@mail.gmail.com>

On 10/21/19 11:35 AM, Jason Merrill wrote:

On Sat, Oct 19, 2019 at 12:08 AM Alexandre Oliva <oliva@gnu.org<mailto:oliva@gnu.org>> wrote:


    Hello, Jason,

    On Oct 14, 2019, Jason Merrill <jason@redhat.com
    <mailto:jason@redhat.com>> wrote:

     > Alex, you had a lot of helpful comments when I first wrote this,
    any thoughts
     > on this revision?

    I think the check of the pid file could be made slightly simpler and
    cheaper if we created it using:

        echo $$ > $lockdir/pidT && mv $lockdir/pidT $lockdir/pid

    instead of

     > +touch $lockdir/$$



     > +     pid="`(cd $lockdir; echo *)`"

    The ""s are implicit in a shell assignment, though there really
    shouldn't be more than one PID-named file in the dir.  With the change
    suggested above, this would become

             pid=`cat $lockdir/pid 2>/dev/null`


Done.

    There's a slight possibility of hitting this right between the creation
    of the dir and the creation of the pid file, thus the 2>/dev/null.

     > +     if ps "$pid" >/dev/null; then

    could be tested with much lower overhead:

             if test -z "$pid" || kill -0 $pid ; then


Done

    though it might make sense to have a different test and error message
    for the case of the absent pid file.


Done.

    We might also wish to use different lock-breaking logic for that case,
    too, e.g. checking that the timestamp of the dir didn't change by
    comparing `ls -ld $lockdir` with what we got 30 seconds before.  If it
    changed or the output is now empty, we just lost the race again.

    It's unlikely that the dir would remain unchanged for several seconds
    without the pid file, so if we get the same timestamp after 30 seconds,
    it's likely that something went wrong with the lock holder, though it's
    not impossible to imagine a scenario in which the lock program that just
    succeeded in creating the dir got stopped (^Z) or killed-9 just before
    creating the PID file.

This patch uses stamps in the lock directory so that they areautomatically cleared when the lock changes hands.


    Even then, maybe breaking the lock is not such a great idea in
    general...

    Though mkdir is an operation that forces a synchronization, reading a
    file without a filesystem lock isn't.  The rename alleviates that a bit,
    but it's not entirely unreasonable for an attempt to read the file to
    cache the absence of the file and not notice a creation shortly
    afterward.  This would be a lot more of an issue in case of contention
    for the lock between different clients of a shared networked filesystem,
    though we might imagine highly parallel systems to eventually hit such
    issues as well.

    But just the possibility of contention across a shared network
    filesystem would give me pause, out of realizing that checking for a
    local process with the same PID would do no good.  And then, kill -0
    would not succeed if the lock holder was started by a different user,
    unlike ps.

    What if we printed an error message suggesting the command to clean up,
    and then errored out, instead of retrying forever or breaking the lock
    and proceeding?  Several programs that rely on lock files (git and svn
    come to mind) seem to be taking such an approach these days, presumably
    because of all the difficulties in automating the double-checking in all
    potential scenarios.

It seems to me that the downside of stealing the lock (greater resourcecontention) is less than the downside of erroring out (build fails,user has to intervene manually, possibly many hours later when they notice).


How's this?


...once more, with less HTML.

commit 157d9c29a21ab0a5d854f53ca51ca9a7986a4e64
Author: Jason Merrill <jason@redhat.com>
Date:   Sat Sep 14 14:02:20 2019 -0400

            * lock-and-run.sh: Check for process existence rather than timeout.
    
    Matthias Klose noted that on less powerful targets, a link might take more
    than 5 minutes; he mentions a figure of 3 hours for an LTO link.  So this
    patch changes the timeout to a check for whether the locking process still
    exists.  If the lock exists in an erroneous state (no pid file or can't
    signal the pid) for 30 sec, steal it.

diff --git a/gcc/lock-and-run.sh b/gcc/lock-and-run.sh
index 3a6a84c253a..c1ae16bbb01 100644
--- a/gcc/lock-and-run.sh
+++ b/gcc/lock-and-run.sh
@@ -1,33 +1,43 @@
 #! /bin/sh
 # Shell-based mutex using mkdir.
 
+self=`basename $0`
 lockdir="$1" prog="$2"; shift 2 || exit 1
 
 # Remember when we started trying to acquire the lock.
 count=0
-touch lock-stamp.$$
 
-trap 'rm -r "$lockdir" lock-stamp.$$' 0
+err () {
+    if test -f $lockdir/lock-$1.$$; then
+	echo "$self: *** (PID $$) removing stale $lockdir" >&2
+	rm -rf $lockdir
+    else
+	touch $lockdir/lock-$1.$$
+    fi
+}
 
 until mkdir "$lockdir" 2>/dev/null; do
     # Say something periodically so the user knows what's up.
     if [ `expr $count % 30` = 0 ]; then
-	# Reset if the lock has been renewed.
-	if [ -n "`find \"$lockdir\" -newer lock-stamp.$$`" ]; then
-	    touch lock-stamp.$$
-	    count=1
-	# Steal the lock after 5 minutes.
-	elif [ $count = 300 ]; then
-	    echo removing stale $lockdir >&2
-	    rm -r "$lockdir"
+	# Check for valid lock.
+	if pid=`cat $lockdir/pid 2>/dev/null` && kill -0 $pid 2>/dev/null; then
+	    echo "$self: (PID $$) waiting $count sec to acquire $lockdir from PID $pid" >&2
+	elif test -z "$pid"; then
+	    echo "$self: (PID $$) cannot read $lockdir/pid" >&2
+	    err nopid
 	else
-	    echo waiting to acquire $lockdir >&2
+	    echo "$self: (PID $$) cannot signal $lockdir owner PID $pid" >&2
+	    err dead
 	fi
     fi
     sleep 1
     count=`expr $count + 1`
 done
 
+trap 'rm -rf "$lockdir"' 0
+echo $$ > $lockdir/pidT && mv $lockdir/pidT $lockdir/pid
+echo "$self: (PID $$) acquired $lockdir after $count seconds" >&2
+
 echo $prog "$@"
 $prog "$@"

References:
- RFA [PATCH] * lock-and-run.sh: Check for process existence rather than timeout.
  - From: Jason Merrill
- Re: RFA [PATCH] * lock-and-run.sh: Check for process existence rather than timeout.
  - From: Alexandre Oliva

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]