diff mbox series

Remove timing related checks of time/tst-cpuclock1

Message ID 20201023130706.2279203-1-stli@linux.ibm.com
State New
Headers show
Series Remove timing related checks of time/tst-cpuclock1 | expand

Commit Message

Stefan Liebler Oct. 23, 2020, 1:07 p.m. UTC
Starting with the commit 04deeaa9ea74b0679dfc9d9155a37b6425f19a9f
"Fix time/tst-cpuclock1 intermitent failures" (2020-07-11),
this test fails quite often on s390x/s390 with one/multiple of those:
"before - after" / "nanosleep time" / "dead - after" ourside reasonable range.

On a zVM/kvm guest the CPUs are shared between multiple guests.
And even on the lpar (kvm host) the CPUs are usually shared between multiple lpars.
The defined CPUs for a lpar/zVM-system could also have lower weights compared
to other lpars which let the steal time further grow.

Usually I build (-j$(nproc)) and test (PARALLELMFLAGS="-j$(nproc)") glibc multiple
times, e.g. with different GCCs, on various lpars or zVM guests at the same time.
During this time, I've run the test for 13500 times and obvserved the following fails:
~600x "before - after"
~60x "nanosleep time"
~70x "dead - after"

I've also observed a lot of "before - after" fails on a intel kvm-guest while
building/testing glibc on it.

The mentioned commit has tighten the limits of valid tv_nsec ranges:
"before - after" (expected: 500000000):
- 100000000 ... 600000000
+ 450000000 ... 550000000

"nanosleep time" (expected: 100000000):
- 100000000 ... 200000000
+ 090000000 ... 120000000

"dead - after" (expected: 100000000):
-           ... 200000000
+ 090000000 ... 120000000

The test itself forks a child process which chew_cpu (user- and kernel-space).
The parent process sleeps with nanosleep(0.5s) and measures the child_clock time:

Comments

Carlos O'Donell Oct. 23, 2020, 1:21 p.m. UTC | #1
On 10/23/20 9:07 AM, Stefan Liebler via Libc-alpha wrote:
> Starting with the commit 04deeaa9ea74b0679dfc9d9155a37b6425f19a9f
> "Fix time/tst-cpuclock1 intermitent failures" (2020-07-11),
> this test fails quite often on s390x/s390 with one/multiple of those:
> "before - after" / "nanosleep time" / "dead - after" ourside reasonable range.
> 
> On a zVM/kvm guest the CPUs are shared between multiple guests.
> And even on the lpar (kvm host) the CPUs are usually shared between multiple lpars.
> The defined CPUs for a lpar/zVM-system could also have lower weights compared
> to other lpars which let the steal time further grow.
> 
> Usually I build (-j$(nproc)) and test (PARALLELMFLAGS="-j$(nproc)") glibc multiple
> times, e.g. with different GCCs, on various lpars or zVM guests at the same time.
> During this time, I've run the test for 13500 times and obvserved the following fails:
> ~600x "before - after"
> ~60x "nanosleep time"
> ~70x "dead - after"

I agree that this is a common failure on s390x at Red Hat also during high load
times on our lpars.
 
> I've also observed a lot of "before - after" fails on a intel kvm-guest while
> building/testing glibc on it.
> 
> The mentioned commit has tighten the limits of valid tv_nsec ranges:

Correct, because we have no data on how *bad* the timing can be.

> "before - after" (expected: 500000000):
> - 100000000 ... 600000000
> + 450000000 ... 550000000
> 
> "nanosleep time" (expected: 100000000):
> - 100000000 ... 200000000
> + 090000000 ... 120000000
> 
> "dead - after" (expected: 100000000):
> -           ... 200000000
> + 090000000 ... 120000000
> 
> The test itself forks a child process which chew_cpu (user- and kernel-space).
> The parent process sleeps with nanosleep(0.5s) and measures the child_clock time:
> diff = after - before
> With much workload on the machine, the child won't make much progess
> and it can fall much beyond the minimum limit.

Correct.

I don't think this test is giving us any value.
 
> Afterwards the parent process sleeps with clock_nanosleep (child_clock, 0.1s):
> diff = afterns - after
> The test currently also allows 0.9 * 0.1s which would be an error.
> Depending on the workload, the maximum limit can exceed the 1.2 * 0.1s.
> 
> For "dead - after", the parent process kills the child process and waits long
> enough to let the child finish dying. Then it gets the time of the child:
> diff = dead - after
> Note that diff also contains the time for the previous clock_nanosleep.
> Thus you'll often see both fails at the same time.

Agreed.

> After discussion on the mailing list, we've decided to keep the functional
> checks for the clock* functions and remove the timing related checks as those
> are prone to false positives.

Agreed.

> ---
>  time/tst-cpuclock1.c | 65 ++++----------------------------------------
>  1 file changed, 6 insertions(+), 59 deletions(-)

Removes 3 support_timespec_check_in_range() checks, and one clock_gettime
failure check for the after time (which we on longer care about).

OK for master.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

> 
> diff --git a/time/tst-cpuclock1.c b/time/tst-cpuclock1.c
> index 1ac611a92b..f40b590111 100644
> --- a/time/tst-cpuclock1.c
> +++ b/time/tst-cpuclock1.c
> @@ -26,7 +26,6 @@
>  #include <signal.h>
>  #include <stdint.h>
>  #include <sys/wait.h>
> -#include <support/timespec.h>

OK. Don't need the macros any more.

>  
>  /* This function is intended to rack up both user and system time.  */
>  static void
> @@ -125,7 +124,7 @@ do_test (void)
>  	  child, (unsigned long int) child_clock,
>  	  (uintmax_t) res.tv_sec, (uintmax_t) res.tv_nsec);
>  
> -  struct timespec before, after;
> +  struct timespec before;

OK.

>    if (clock_gettime (child_clock, &before) < 0)
>      {
>        printf ("clock_gettime on live PID %d clock %lx => %s\n",
> @@ -137,38 +136,7 @@ do_test (void)
>    printf ("live PID %d before sleep => %ju.%.9ju\n",
>  	  child, (uintmax_t) before.tv_sec, (uintmax_t) before.tv_nsec);
>  
> -  struct timespec sleeptime = { .tv_nsec = 500000000 };
> -  if (nanosleep (&sleeptime, NULL) != 0)
> -    {
> -      perror ("nanosleep");
> -      result = 1;
> -      goto done;
> -    }
> -
> -  if (clock_gettime (child_clock, &after) < 0)
> -    {
> -      printf ("clock_gettime on live PID %d clock %lx => %s\n",
> -	      child, (unsigned long int) child_clock, strerror (errno));
> -      result = 1;

OK. Remove check 1/4. Not needed.

> -      goto done;
> -    }
> -  /* Should be close to 0.5.  */
> -  printf ("live PID %d after sleep => %ju.%.9ju\n",
> -	  child, (uintmax_t) after.tv_sec, (uintmax_t) after.tv_nsec);
> -
> -  /* The bound values are empirically defined by testing this code over high cpu
> -     usage and different nice values. Of all the values we keep the 90th
> -     percentile of values and use those values for our testing allowed range.  */
> -  struct timespec diff = timespec_sub (support_timespec_normalize (after),
> -				       support_timespec_normalize (before));
> -  if (!support_timespec_check_in_range (sleeptime, diff, .9,  1.1))
> -    {
> -      printf ("before - after %ju.%.9ju outside reasonable range\n",
> -	      (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec);
> -      result = 1;
> -    }

OK. Remove check 2/4.

> -
> -  sleeptime.tv_nsec = 100000000;
> +  struct timespec sleeptime = { .tv_nsec = 100000000 };

OK.

>    e = clock_nanosleep (child_clock, 0, &sleeptime, NULL);
>    if (e == EINVAL || e == ENOTSUP || e == ENOSYS)
>      {
> @@ -191,18 +159,9 @@ do_test (void)
>  	}
>        else
>  	{
> -        /* The bound values are empirically defined by testing this code over
> -           high cpu usage and different nice values. Of all the values we keep
> -           the 90th percentile of values and use those values for our testing
> -           allowed range.  */
> -	  diff = timespec_sub (support_timespec_normalize (afterns),
> -			       support_timespec_normalize (after));
> -	  if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2))
> -	    {
> -	      printf ("nanosleep time %ju.%.9ju outside reasonable range\n",
> -		      (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec);
> -	      result = 1;

OK. Remove check 3/4.

> -	    }
> +	  printf ("live PID %d after sleep => %ju.%.9ju\n",
> +		  child, (uintmax_t) afterns.tv_sec,
> +		  (uintmax_t) afterns.tv_nsec);

OK. Informative.

>  	}
>      }
>  
> @@ -231,21 +190,9 @@ do_test (void)
>        result = 1;
>        goto done;
>      }
> -  /* Should be close to 0.6.  */
> +  /* Should be close to 0.1.  */
>    printf ("dead PID %d => %ju.%.9ju\n",
>  	  child, (uintmax_t) dead.tv_sec, (uintmax_t) dead.tv_nsec);

OK. Informative.

> -  /* The bound values are empirically defined by testing this code over high cpu
> -     usage and different nice values. Of all the values we keep the 90th
> -     percentile of values and use those values for our testing allowed range.  */
> -  diff = timespec_sub (support_timespec_normalize (dead),
> -		       support_timespec_normalize (after));
> -  sleeptime.tv_nsec = 100000000;
> -  if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2))
> -    {
> -      printf ("dead - after %ju.%.9ju outside reasonable range\n",
> -	      (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec);
> -      result = 1;
> -    }

OK. Remove check 4/4.

>  
>    /* Now reap the child and verify that its clock is no longer valid.  */
>    {
>
Stefan Liebler Oct. 26, 2020, 9:55 a.m. UTC | #2
On 10/23/20 3:21 PM, Carlos O'Donell wrote:
> On 10/23/20 9:07 AM, Stefan Liebler via Libc-alpha wrote:
>> Starting with the commit 04deeaa9ea74b0679dfc9d9155a37b6425f19a9f
>> "Fix time/tst-cpuclock1 intermitent failures" (2020-07-11),
>> this test fails quite often on s390x/s390 with one/multiple of those:
>> "before - after" / "nanosleep time" / "dead - after" ourside reasonable range.
>>
>> On a zVM/kvm guest the CPUs are shared between multiple guests.
>> And even on the lpar (kvm host) the CPUs are usually shared between multiple lpars.
>> The defined CPUs for a lpar/zVM-system could also have lower weights compared
>> to other lpars which let the steal time further grow.
>>
>> Usually I build (-j$(nproc)) and test (PARALLELMFLAGS="-j$(nproc)") glibc multiple
>> times, e.g. with different GCCs, on various lpars or zVM guests at the same time.
>> During this time, I've run the test for 13500 times and obvserved the following fails:
>> ~600x "before - after"
>> ~60x "nanosleep time"
>> ~70x "dead - after"
> 
> I agree that this is a common failure on s390x at Red Hat also during high load
> times on our lpars.
>  
>> I've also observed a lot of "before - after" fails on a intel kvm-guest while
>> building/testing glibc on it.
>>
>> The mentioned commit has tighten the limits of valid tv_nsec ranges:
> 
> Correct, because we have no data on how *bad* the timing can be.
> 
>> "before - after" (expected: 500000000):
>> - 100000000 ... 600000000
>> + 450000000 ... 550000000
>>
>> "nanosleep time" (expected: 100000000):
>> - 100000000 ... 200000000
>> + 090000000 ... 120000000
>>
>> "dead - after" (expected: 100000000):
>> -           ... 200000000
>> + 090000000 ... 120000000
>>
>> The test itself forks a child process which chew_cpu (user- and kernel-space).
>> The parent process sleeps with nanosleep(0.5s) and measures the child_clock time:
>> diff = after - before
>> With much workload on the machine, the child won't make much progess
>> and it can fall much beyond the minimum limit.
> 
> Correct.
> 
> I don't think this test is giving us any value.
>  
>> Afterwards the parent process sleeps with clock_nanosleep (child_clock, 0.1s):
>> diff = afterns - after
>> The test currently also allows 0.9 * 0.1s which would be an error.
>> Depending on the workload, the maximum limit can exceed the 1.2 * 0.1s.
>>
>> For "dead - after", the parent process kills the child process and waits long
>> enough to let the child finish dying. Then it gets the time of the child:
>> diff = dead - after
>> Note that diff also contains the time for the previous clock_nanosleep.
>> Thus you'll often see both fails at the same time.
> 
> Agreed.
> 
>> After discussion on the mailing list, we've decided to keep the functional
>> checks for the clock* functions and remove the timing related checks as those
>> are prone to false positives.
> 
> Agreed.
> 
>> ---
>>  time/tst-cpuclock1.c | 65 ++++----------------------------------------
>>  1 file changed, 6 insertions(+), 59 deletions(-)
> 
> Removes 3 support_timespec_check_in_range() checks, and one clock_gettime
> failure check for the after time (which we on longer care about).
> 
> OK for master.
> 
> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
> 
>>
>> diff --git a/time/tst-cpuclock1.c b/time/tst-cpuclock1.c
>> index 1ac611a92b..f40b590111 100644
>> --- a/time/tst-cpuclock1.c
>> +++ b/time/tst-cpuclock1.c
>> @@ -26,7 +26,6 @@
>>  #include <signal.h>
>>  #include <stdint.h>
>>  #include <sys/wait.h>
>> -#include <support/timespec.h>
> 
> OK. Don't need the macros any more.
> 
>>  
>>  /* This function is intended to rack up both user and system time.  */
>>  static void
>> @@ -125,7 +124,7 @@ do_test (void)
>>  	  child, (unsigned long int) child_clock,
>>  	  (uintmax_t) res.tv_sec, (uintmax_t) res.tv_nsec);
>>  
>> -  struct timespec before, after;
>> +  struct timespec before;
> 
> OK.
> 
>>    if (clock_gettime (child_clock, &before) < 0)
>>      {
>>        printf ("clock_gettime on live PID %d clock %lx => %s\n",
>> @@ -137,38 +136,7 @@ do_test (void)
>>    printf ("live PID %d before sleep => %ju.%.9ju\n",
>>  	  child, (uintmax_t) before.tv_sec, (uintmax_t) before.tv_nsec);
>>  
>> -  struct timespec sleeptime = { .tv_nsec = 500000000 };
>> -  if (nanosleep (&sleeptime, NULL) != 0)
>> -    {
>> -      perror ("nanosleep");
>> -      result = 1;
>> -      goto done;
>> -    }
>> -
>> -  if (clock_gettime (child_clock, &after) < 0)
>> -    {
>> -      printf ("clock_gettime on live PID %d clock %lx => %s\n",
>> -	      child, (unsigned long int) child_clock, strerror (errno));
>> -      result = 1;
> 
> OK. Remove check 1/4. Not needed.
> 
>> -      goto done;
>> -    }
>> -  /* Should be close to 0.5.  */
>> -  printf ("live PID %d after sleep => %ju.%.9ju\n",
>> -	  child, (uintmax_t) after.tv_sec, (uintmax_t) after.tv_nsec);
>> -
>> -  /* The bound values are empirically defined by testing this code over high cpu
>> -     usage and different nice values. Of all the values we keep the 90th
>> -     percentile of values and use those values for our testing allowed range.  */
>> -  struct timespec diff = timespec_sub (support_timespec_normalize (after),
>> -				       support_timespec_normalize (before));
>> -  if (!support_timespec_check_in_range (sleeptime, diff, .9,  1.1))
>> -    {
>> -      printf ("before - after %ju.%.9ju outside reasonable range\n",
>> -	      (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec);
>> -      result = 1;
>> -    }
> 
> OK. Remove check 2/4.
> 
>> -
>> -  sleeptime.tv_nsec = 100000000;
>> +  struct timespec sleeptime = { .tv_nsec = 100000000 };
> 
> OK.
> 
>>    e = clock_nanosleep (child_clock, 0, &sleeptime, NULL);
>>    if (e == EINVAL || e == ENOTSUP || e == ENOSYS)
>>      {
>> @@ -191,18 +159,9 @@ do_test (void)
>>  	}
>>        else
>>  	{
>> -        /* The bound values are empirically defined by testing this code over
>> -           high cpu usage and different nice values. Of all the values we keep
>> -           the 90th percentile of values and use those values for our testing
>> -           allowed range.  */
>> -	  diff = timespec_sub (support_timespec_normalize (afterns),
>> -			       support_timespec_normalize (after));
>> -	  if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2))
>> -	    {
>> -	      printf ("nanosleep time %ju.%.9ju outside reasonable range\n",
>> -		      (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec);
>> -	      result = 1;
> 
> OK. Remove check 3/4.
> 
>> -	    }
>> +	  printf ("live PID %d after sleep => %ju.%.9ju\n",
>> +		  child, (uintmax_t) afterns.tv_sec,
>> +		  (uintmax_t) afterns.tv_nsec);
> 
> OK. Informative.
> 
>>  	}
>>      }
>>  
>> @@ -231,21 +190,9 @@ do_test (void)
>>        result = 1;
>>        goto done;
>>      }
>> -  /* Should be close to 0.6.  */
>> +  /* Should be close to 0.1.  */
>>    printf ("dead PID %d => %ju.%.9ju\n",
>>  	  child, (uintmax_t) dead.tv_sec, (uintmax_t) dead.tv_nsec);
> 
> OK. Informative.
> 
>> -  /* The bound values are empirically defined by testing this code over high cpu
>> -     usage and different nice values. Of all the values we keep the 90th
>> -     percentile of values and use those values for our testing allowed range.  */
>> -  diff = timespec_sub (support_timespec_normalize (dead),
>> -		       support_timespec_normalize (after));
>> -  sleeptime.tv_nsec = 100000000;
>> -  if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2))
>> -    {
>> -      printf ("dead - after %ju.%.9ju outside reasonable range\n",
>> -	      (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec);
>> -      result = 1;
>> -    }
> 
> OK. Remove check 4/4.
> 
>>  
>>    /* Now reap the child and verify that its clock is no longer valid.  */
>>    {
>>
> 
> 


Committed.
Thanks,
Stefan
diff mbox series

Patch

diff = after - before
With much workload on the machine, the child won't make much progess
and it can fall much beyond the minimum limit.

Afterwards the parent process sleeps with clock_nanosleep (child_clock, 0.1s):
diff = afterns - after
The test currently also allows 0.9 * 0.1s which would be an error.
Depending on the workload, the maximum limit can exceed the 1.2 * 0.1s.

For "dead - after", the parent process kills the child process and waits long
enough to let the child finish dying. Then it gets the time of the child:
diff = dead - after
Note that diff also contains the time for the previous clock_nanosleep.
Thus you'll often see both fails at the same time.

After discussion on the mailing list, we've decided to keep the functional
checks for the clock* functions and remove the timing related checks as those
are prone to false positives.
---
 time/tst-cpuclock1.c | 65 ++++----------------------------------------
 1 file changed, 6 insertions(+), 59 deletions(-)

diff --git a/time/tst-cpuclock1.c b/time/tst-cpuclock1.c
index 1ac611a92b..f40b590111 100644
--- a/time/tst-cpuclock1.c
+++ b/time/tst-cpuclock1.c
@@ -26,7 +26,6 @@ 
 #include <signal.h>
 #include <stdint.h>
 #include <sys/wait.h>
-#include <support/timespec.h>
 
 /* This function is intended to rack up both user and system time.  */
 static void
@@ -125,7 +124,7 @@  do_test (void)
 	  child, (unsigned long int) child_clock,
 	  (uintmax_t) res.tv_sec, (uintmax_t) res.tv_nsec);
 
-  struct timespec before, after;
+  struct timespec before;
   if (clock_gettime (child_clock, &before) < 0)
     {
       printf ("clock_gettime on live PID %d clock %lx => %s\n",
@@ -137,38 +136,7 @@  do_test (void)
   printf ("live PID %d before sleep => %ju.%.9ju\n",
 	  child, (uintmax_t) before.tv_sec, (uintmax_t) before.tv_nsec);
 
-  struct timespec sleeptime = { .tv_nsec = 500000000 };
-  if (nanosleep (&sleeptime, NULL) != 0)
-    {
-      perror ("nanosleep");
-      result = 1;
-      goto done;
-    }
-
-  if (clock_gettime (child_clock, &after) < 0)
-    {
-      printf ("clock_gettime on live PID %d clock %lx => %s\n",
-	      child, (unsigned long int) child_clock, strerror (errno));
-      result = 1;
-      goto done;
-    }
-  /* Should be close to 0.5.  */
-  printf ("live PID %d after sleep => %ju.%.9ju\n",
-	  child, (uintmax_t) after.tv_sec, (uintmax_t) after.tv_nsec);
-
-  /* The bound values are empirically defined by testing this code over high cpu
-     usage and different nice values. Of all the values we keep the 90th
-     percentile of values and use those values for our testing allowed range.  */
-  struct timespec diff = timespec_sub (support_timespec_normalize (after),
-				       support_timespec_normalize (before));
-  if (!support_timespec_check_in_range (sleeptime, diff, .9,  1.1))
-    {
-      printf ("before - after %ju.%.9ju outside reasonable range\n",
-	      (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec);
-      result = 1;
-    }
-
-  sleeptime.tv_nsec = 100000000;
+  struct timespec sleeptime = { .tv_nsec = 100000000 };
   e = clock_nanosleep (child_clock, 0, &sleeptime, NULL);
   if (e == EINVAL || e == ENOTSUP || e == ENOSYS)
     {
@@ -191,18 +159,9 @@  do_test (void)
 	}
       else
 	{
-        /* The bound values are empirically defined by testing this code over
-           high cpu usage and different nice values. Of all the values we keep
-           the 90th percentile of values and use those values for our testing
-           allowed range.  */
-	  diff = timespec_sub (support_timespec_normalize (afterns),
-			       support_timespec_normalize (after));
-	  if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2))
-	    {
-	      printf ("nanosleep time %ju.%.9ju outside reasonable range\n",
-		      (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec);
-	      result = 1;
-	    }
+	  printf ("live PID %d after sleep => %ju.%.9ju\n",
+		  child, (uintmax_t) afterns.tv_sec,
+		  (uintmax_t) afterns.tv_nsec);
 	}
     }
 
@@ -231,21 +190,9 @@  do_test (void)
       result = 1;
       goto done;
     }
-  /* Should be close to 0.6.  */
+  /* Should be close to 0.1.  */
   printf ("dead PID %d => %ju.%.9ju\n",
 	  child, (uintmax_t) dead.tv_sec, (uintmax_t) dead.tv_nsec);
-  /* The bound values are empirically defined by testing this code over high cpu
-     usage and different nice values. Of all the values we keep the 90th
-     percentile of values and use those values for our testing allowed range.  */
-  diff = timespec_sub (support_timespec_normalize (dead),
-		       support_timespec_normalize (after));
-  sleeptime.tv_nsec = 100000000;
-  if (!support_timespec_check_in_range (sleeptime, diff, .9, 1.2))
-    {
-      printf ("dead - after %ju.%.9ju outside reasonable range\n",
-	      (uintmax_t) diff.tv_sec, (uintmax_t) diff.tv_nsec);
-      result = 1;
-    }
 
   /* Now reap the child and verify that its clock is no longer valid.  */
   {