diff mbox series

fzsync: skip test when avaliable CPUs less than 2

Message ID 20201125101633.30154-1-liwang@redhat.com
State Superseded
Headers show
Series fzsync: skip test when avaliable CPUs less than 2 | expand

Commit Message

Li Wang Nov. 25, 2020, 10:16 a.m. UTC
It makes no sense to run parallel thread to simulate race conditions on
system with CPU number less than two, especially for kvm guest, it does
not have any chance to get real parallel running and probably encounter
failure as below:

=== 100% reproducible on a 1cpu guest ===

cmdline="af_alg07"
contacts=""
analysis=exit
<<<test_output>>>
tst_test.c:1248: TINFO: Timeout per run is 0h 05m 00s
../../../include/tst_fuzzy_sync.h:507: TINFO: Minimum sampling period ended
../../../include/tst_fuzzy_sync.h:330: TINFO: loop = 1024, delay_bias = 0
../../../include/tst_fuzzy_sync.h:318: TINFO: start_a - start_b: { avg = -137522ns, avg_dev = 854248ns, dev_ratio = 6.21 }
../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - start_a  : { avg =  1915ns, avg_dev =   535ns, dev_ratio = 0.28 }
../../../include/tst_fuzzy_sync.h:318: TINFO: end_b - start_b  : { avg =  1885ns, avg_dev =    42ns, dev_ratio = 0.02 }
../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - end_b    : { avg = -137492ns, avg_dev = 854818ns, dev_ratio = 6.22 }
../../../include/tst_fuzzy_sync.h:318: TINFO: spins            : { avg = 554786  , avg_dev =  7355  , dev_ratio = 0.01 }
../../../include/tst_fuzzy_sync.h:636: TINFO: Exceeded execution time, requesting exit
af_alg07.c:96: TFAIL: fchownat() failed to fail, kernel may be vulnerable

Signed-off-by: Li Wang <liwang@redhat.com>
CC: Richard Palethorpe <rpalethorpe@suse.de>
---
 include/tst_fuzzy_sync.h | 3 +++
 1 file changed, 3 insertions(+)

Comments

Richard Palethorpe Nov. 25, 2020, 11:22 a.m. UTC | #1
Hello Li,

Li Wang <liwang@redhat.com> writes:

> It makes no sense to run parallel thread to simulate race conditions on
> system with CPU number less than two, especially for kvm guest, it does
> not have any chance to get real parallel running and probably encounter
> failure as below:

Most of the tests using FuzzySync do not need true parallism. We were
able to reproduce a number of race conditions on a single vCPU. Infact
it may actually benefit some races because one thread has to pause to
allow the other to run, perhaps creating a huge race window.

>
> === 100% reproducible on a 1cpu guest ===
>
> cmdline="af_alg07"
> contacts=""
> analysis=exit
> <<<test_output>>>
> tst_test.c:1248: TINFO: Timeout per run is 0h 05m 00s
> ../../../include/tst_fuzzy_sync.h:507: TINFO: Minimum sampling period ended
> ../../../include/tst_fuzzy_sync.h:330: TINFO: loop = 1024, delay_bias = 0
> ../../../include/tst_fuzzy_sync.h:318: TINFO: start_a - start_b: { avg = -137522ns, avg_dev = 854248ns, dev_ratio = 6.21 }
> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - start_a  : { avg =  1915ns, avg_dev =   535ns, dev_ratio = 0.28 }
> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_b - start_b  : { avg =  1885ns, avg_dev =    42ns, dev_ratio = 0.02 }
> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - end_b    : { avg = -137492ns, avg_dev = 854818ns, dev_ratio = 6.22 }
> ../../../include/tst_fuzzy_sync.h:318: TINFO: spins            : { avg = 554786  , avg_dev =  7355  , dev_ratio = 0.01 }
> ../../../include/tst_fuzzy_sync.h:636: TINFO: Exceeded execution time, requesting exit
> af_alg07.c:96: TFAIL: fchownat() failed to fail, kernel may be vulnerable
>
> Signed-off-by: Li Wang <liwang@redhat.com>
> CC: Richard Palethorpe <rpalethorpe@suse.de>
> ---
>  include/tst_fuzzy_sync.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/tst_fuzzy_sync.h b/include/tst_fuzzy_sync.h
> index 4141f5c64..2e864b312 100644
> --- a/include/tst_fuzzy_sync.h
> +++ b/include/tst_fuzzy_sync.h
> @@ -281,6 +281,9 @@ static void tst_init_stat(struct tst_fzsync_stat *s)
>  static void tst_fzsync_pair_reset(struct tst_fzsync_pair *pair,
>  				  void *(*run_b)(void *))
>  {
> +	if (get_nprocs() < 2)
> +		tst_brk(TCONF, "Fuzzy Sync requires at least two CPUs available");
> +
>  	tst_fzsync_pair_cleanup(pair);
>  
>  	tst_init_stat(&pair->diff_ss);

Perhaps this test would pass with more loops and a big enough delay
range, but this is also wasting time on a single vCPU. I'm not sure
whether we should filter this test at the LTP level; it may trigger the
bug on some single CPU configs.

Why not print a warning instead of refusing to run?
Cyril Hrubis Nov. 25, 2020, 11:54 a.m. UTC | #2
Hi!
> Perhaps this test would pass with more loops and a big enough delay
> range, but this is also wasting time on a single vCPU. I'm not sure
> whether we should filter this test at the LTP level; it may trigger the
> bug on some single CPU configs.
> 
> Why not print a warning instead of refusing to run?

That's not a solution either, warning would end up in a test results as
well.

I guess that we can add something as .min_cpus to the tst_test structure
and set it for this test?
Martin Doucha Nov. 25, 2020, 11:56 a.m. UTC | #3
On 25. 11. 20 12:22, Richard Palethorpe wrote:
> Hello Li,
> 
> Li Wang <liwang@redhat.com> writes:
> 
>> It makes no sense to run parallel thread to simulate race conditions on
>> system with CPU number less than two, especially for kvm guest, it does
>> not have any chance to get real parallel running and probably encounter
>> failure as below:
> 
> Most of the tests using FuzzySync do not need true parallism. We were
> able to reproduce a number of race conditions on a single vCPU. Infact
> it may actually benefit some races because one thread has to pause to
> allow the other to run, perhaps creating a huge race window.
> 
>>
>> === 100% reproducible on a 1cpu guest ===
>>
>> cmdline="af_alg07"
>> contacts=""
>> analysis=exit
>> <<<test_output>>>
>> tst_test.c:1248: TINFO: Timeout per run is 0h 05m 00s
>> ../../../include/tst_fuzzy_sync.h:507: TINFO: Minimum sampling period ended
>> ../../../include/tst_fuzzy_sync.h:330: TINFO: loop = 1024, delay_bias = 0
>> ../../../include/tst_fuzzy_sync.h:318: TINFO: start_a - start_b: { avg = -137522ns, avg_dev = 854248ns, dev_ratio = 6.21 }
>> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - start_a  : { avg =  1915ns, avg_dev =   535ns, dev_ratio = 0.28 }
>> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_b - start_b  : { avg =  1885ns, avg_dev =    42ns, dev_ratio = 0.02 }
>> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - end_b    : { avg = -137492ns, avg_dev = 854818ns, dev_ratio = 6.22 }
>> ../../../include/tst_fuzzy_sync.h:318: TINFO: spins            : { avg = 554786  , avg_dev =  7355  , dev_ratio = 0.01 }
>> ../../../include/tst_fuzzy_sync.h:636: TINFO: Exceeded execution time, requesting exit
>> af_alg07.c:96: TFAIL: fchownat() failed to fail, kernel may be vulnerable
>>
>> Signed-off-by: Li Wang <liwang@redhat.com>
>> CC: Richard Palethorpe <rpalethorpe@suse.de>
>> ---
>>  include/tst_fuzzy_sync.h | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/include/tst_fuzzy_sync.h b/include/tst_fuzzy_sync.h
>> index 4141f5c64..2e864b312 100644
>> --- a/include/tst_fuzzy_sync.h
>> +++ b/include/tst_fuzzy_sync.h
>> @@ -281,6 +281,9 @@ static void tst_init_stat(struct tst_fzsync_stat *s)
>>  static void tst_fzsync_pair_reset(struct tst_fzsync_pair *pair,
>>  				  void *(*run_b)(void *))
>>  {
>> +	if (get_nprocs() < 2)
>> +		tst_brk(TCONF, "Fuzzy Sync requires at least two CPUs available");
>> +
>>  	tst_fzsync_pair_cleanup(pair);
>>  
>>  	tst_init_stat(&pair->diff_ss);
> 
> Perhaps this test would pass with more loops and a big enough delay
> range, but this is also wasting time on a single vCPU. I'm not sure
> whether we should filter this test at the LTP level; it may trigger the
> bug on some single CPU configs.

No, af_alg07 requires 2 CPUs, otherwise it'll report false positives.
The test will pass only if fchownat() hits a half-closed socket and
returns error. But IIRC the half-closed socket will be destroyed during
reschedule which means there's no race window to hit anymore. But it
would be better to put the TCONF condition into the test itself.
Martin Doucha Nov. 25, 2020, 11:57 a.m. UTC | #4
On 25. 11. 20 12:54, Cyril Hrubis wrote:
> Hi!
>> Perhaps this test would pass with more loops and a big enough delay
>> range, but this is also wasting time on a single vCPU. I'm not sure
>> whether we should filter this test at the LTP level; it may trigger the
>> bug on some single CPU configs.
>>
>> Why not print a warning instead of refusing to run?
> 
> That's not a solution either, warning would end up in a test results as
> well.
> 
> I guess that we can add something as .min_cpus to the tst_test structure
> and set it for this test?

+1 for .min_cpus from me.
Li Wang Nov. 25, 2020, 12:50 p.m. UTC | #5
On Wed, Nov 25, 2020 at 7:56 PM Martin Doucha <mdoucha@suse.cz> wrote:

> On 25. 11. 20 12:22, Richard Palethorpe wrote:
> > Hello Li,
> >
> > Li Wang <liwang@redhat.com> writes:
> >
> >> It makes no sense to run parallel thread to simulate race conditions on
> >> system with CPU number less than two, especially for kvm guest, it does
> >> not have any chance to get real parallel running and probably encounter
> >> failure as below:
> >
> > Most of the tests using FuzzySync do not need true parallism. We were
> > able to reproduce a number of race conditions on a single vCPU. Infact
> > it may actually benefit some races because one thread has to pause to
> > allow the other to run, perhaps creating a huge race window.

>
> >>
> >> === 100% reproducible on a 1cpu guest ===
> >>
> >> cmdline="af_alg07"
> >> contacts=""
> >> analysis=exit
> >> <<<test_output>>>
> >> tst_test.c:1248: TINFO: Timeout per run is 0h 05m 00s
> >> ../../../include/tst_fuzzy_sync.h:507: TINFO: Minimum sampling period
> ended
> >> ../../../include/tst_fuzzy_sync.h:330: TINFO: loop = 1024, delay_bias =
> 0
> >> ../../../include/tst_fuzzy_sync.h:318: TINFO: start_a - start_b: { avg
> = -137522ns, avg_dev = 854248ns, dev_ratio = 6.21 }
> >> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - start_a  : { avg
> =  1915ns, avg_dev =   535ns, dev_ratio = 0.28 }
> >> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_b - start_b  : { avg
> =  1885ns, avg_dev =    42ns, dev_ratio = 0.02 }
> >> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - end_b    : { avg
> = -137492ns, avg_dev = 854818ns, dev_ratio = 6.22 }
> >> ../../../include/tst_fuzzy_sync.h:318: TINFO: spins            : { avg
> = 554786  , avg_dev =  7355  , dev_ratio = 0.01 }
> >> ../../../include/tst_fuzzy_sync.h:636: TINFO: Exceeded execution time,
> requesting exit
> >> af_alg07.c:96: TFAIL: fchownat() failed to fail, kernel may be
> vulnerable
> >>
> >> Signed-off-by: Li Wang <liwang@redhat.com>
> >> CC: Richard Palethorpe <rpalethorpe@suse.de>
> >> ---
> >>  include/tst_fuzzy_sync.h | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/include/tst_fuzzy_sync.h b/include/tst_fuzzy_sync.h
> >> index 4141f5c64..2e864b312 100644
> >> --- a/include/tst_fuzzy_sync.h
> >> +++ b/include/tst_fuzzy_sync.h
> >> @@ -281,6 +281,9 @@ static void tst_init_stat(struct tst_fzsync_stat *s)
> >>  static void tst_fzsync_pair_reset(struct tst_fzsync_pair *pair,
> >>                                void *(*run_b)(void *))
> >>  {
> >> +    if (get_nprocs() < 2)
> >> +            tst_brk(TCONF, "Fuzzy Sync requires at least two CPUs
> available");
> >> +
> >>      tst_fzsync_pair_cleanup(pair);
> >>
> >>      tst_init_stat(&pair->diff_ss);
> >
> > Perhaps this test would pass with more loops and a big enough delay
> > range, but this is also wasting time on a single vCPU. I'm not sure
> > whether we should filter this test at the LTP level; it may trigger the
> > bug on some single CPU configs.
>
> No, af_alg07 requires 2 CPUs, otherwise it'll report false positives.
> The test will pass only if fchownat() hits a half-closed socket and
> returns error. But IIRC the half-closed socket will be destroyed during
> reschedule which means there's no race window to hit anymore. But it
> would be better to put the TCONF condition into the test itself.
>

+1
Correct, I stand by Martin's point.

And we can avoid adding this patch to FuzzySync lib, but for af_alg07 2cpus
is required.
(maybe go with Cyril's suggest to add .min_cpus)
Cyril Hrubis Nov. 25, 2020, 1:13 p.m. UTC | #6
Hi!
> And we can avoid adding this patch to FuzzySync lib, but for af_alg07 2cpus
> is required.
> (maybe go with Cyril's suggest to add .min_cpus)

I would go for that and we can make use of it in getcwd04.c as well.
Richard Palethorpe Nov. 25, 2020, 1:23 p.m. UTC | #7
Hello,

Martin Doucha <mdoucha@suse.cz> writes:

> On 25. 11. 20 12:22, Richard Palethorpe wrote:
>> Hello Li,
>> 
>> Li Wang <liwang@redhat.com> writes:
>> 
>>> It makes no sense to run parallel thread to simulate race conditions on
>>> system with CPU number less than two, especially for kvm guest, it does
>>> not have any chance to get real parallel running and probably encounter
>>> failure as below:
>> 
>> Most of the tests using FuzzySync do not need true parallism. We were
>> able to reproduce a number of race conditions on a single vCPU. Infact
>> it may actually benefit some races because one thread has to pause to
>> allow the other to run, perhaps creating a huge race window.
>> 
>>>
>>> === 100% reproducible on a 1cpu guest ===
>>>
>>> cmdline="af_alg07"
>>> contacts=""
>>> analysis=exit
>>> <<<test_output>>>
>>> tst_test.c:1248: TINFO: Timeout per run is 0h 05m 00s
>>> ../../../include/tst_fuzzy_sync.h:507: TINFO: Minimum sampling period ended
>>> ../../../include/tst_fuzzy_sync.h:330: TINFO: loop = 1024, delay_bias = 0
>>> ../../../include/tst_fuzzy_sync.h:318: TINFO: start_a - start_b: { avg = -137522ns, avg_dev = 854248ns, dev_ratio = 6.21 }
>>> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - start_a  : { avg =  1915ns, avg_dev =   535ns, dev_ratio = 0.28 }
>>> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_b - start_b  : { avg =  1885ns, avg_dev =    42ns, dev_ratio = 0.02 }
>>> ../../../include/tst_fuzzy_sync.h:318: TINFO: end_a - end_b    : { avg = -137492ns, avg_dev = 854818ns, dev_ratio = 6.22 }
>>> ../../../include/tst_fuzzy_sync.h:318: TINFO: spins            : { avg = 554786  , avg_dev =  7355  , dev_ratio = 0.01 }
>>> ../../../include/tst_fuzzy_sync.h:636: TINFO: Exceeded execution time, requesting exit
>>> af_alg07.c:96: TFAIL: fchownat() failed to fail, kernel may be vulnerable
>>>
>>> Signed-off-by: Li Wang <liwang@redhat.com>
>>> CC: Richard Palethorpe <rpalethorpe@suse.de>
>>> ---
>>>  include/tst_fuzzy_sync.h | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/include/tst_fuzzy_sync.h b/include/tst_fuzzy_sync.h
>>> index 4141f5c64..2e864b312 100644
>>> --- a/include/tst_fuzzy_sync.h
>>> +++ b/include/tst_fuzzy_sync.h
>>> @@ -281,6 +281,9 @@ static void tst_init_stat(struct tst_fzsync_stat *s)
>>>  static void tst_fzsync_pair_reset(struct tst_fzsync_pair *pair,
>>>  				  void *(*run_b)(void *))
>>>  {
>>> +	if (get_nprocs() < 2)
>>> +		tst_brk(TCONF, "Fuzzy Sync requires at least two CPUs available");
>>> +
>>>  	tst_fzsync_pair_cleanup(pair);
>>>  
>>>  	tst_init_stat(&pair->diff_ss);
>> 
>> Perhaps this test would pass with more loops and a big enough delay
>> range, but this is also wasting time on a single vCPU. I'm not sure
>> whether we should filter this test at the LTP level; it may trigger the
>> bug on some single CPU configs.
>
> No, af_alg07 requires 2 CPUs, otherwise it'll report false positives.
> The test will pass only if fchownat() hits a half-closed socket and
> returns error. But IIRC the half-closed socket will be destroyed during
> reschedule which means there's no race window to hit anymore. But it
> would be better to put the TCONF condition into the test itself.

Interesting, I wonder if this is also true for the real-time kernel with
the threads set to RT priority?
Joerg Vehlow Nov. 30, 2020, 7:53 a.m. UTC | #8
Hi,
>> No, af_alg07 requires 2 CPUs, otherwise it'll report false positives.
>> The test will pass only if fchownat() hits a half-closed socket and
>> returns error. But IIRC the half-closed socket will be destroyed during
>> reschedule which means there's no race window to hit anymore. But it
>> would be better to put the TCONF condition into the test itself.
> Interesting, I wonder if this is also true for the real-time kernel with
> the threads set to RT priority?
It looks like the test can fail even with more than one cpu. I've seen 
this sporadic failure on different hardware with more than two cores, at 
least on intel denverton (x86_64) and renesas r-car (aarch64) systems. 
Both with kernel 4.19 with the fix included, on the denverton system the 
rt parches were included and on the r-car not. The test passes most of 
the time, but sometimes fails with the message Li posted.

It also seems to fail sporadically on other systems as well: 
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1892860

Additionally I tested on qemu-x86 with 4.19 with and without rt patches. 
The test succeeds even with only one virtualized cpu. So either Martin's 
assumption is wrong or it holds only for newer kernel versions?

Jörg
Li Wang Nov. 30, 2020, 8:14 a.m. UTC | #9
Hi Joerg,

On Mon, Nov 30, 2020 at 3:53 PM Joerg Vehlow <lkml@jv-coder.de> wrote:

> Hi,
> >> No, af_alg07 requires 2 CPUs, otherwise it'll report false positives.
> >> The test will pass only if fchownat() hits a half-closed socket and
> >> returns error. But IIRC the half-closed socket will be destroyed during
> >> reschedule which means there's no race window to hit anymore. But it
> >> would be better to put the TCONF condition into the test itself.
> > Interesting, I wonder if this is also true for the real-time kernel with
> > the threads set to RT priority?
> It looks like the test can fail even with more than one cpu. I've seen
> this sporadic failure on different hardware with more than two cores, at
> least on intel denverton (x86_64) and renesas r-car (aarch64) systems.
> Both with kernel 4.19 with the fix included, on the denverton system the
> rt parches were included and on the r-car not. The test passes most of
> the time, but sometimes fails with the message Li posted.
>
> It also seems to fail sporadically on other systems as well:
> https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1892860
>
> Additionally I tested on qemu-x86 with 4.19 with and without rt patches.
> The test succeeds even with only one virtualized cpu. So either Martin's
> assumption is wrong or it holds only for newer kernel versions?
>

No, Mertin is not wrong, and you are also right.

They are totally two different issues of af_alg07, the test on 1CPU
should be fixed with TCONF. But the fail with aarch64 is more like a
hardware issue, Chunyu has a drafted patch to add init delay value for
such a system.

Can you try this on your aarm64 platform?
-----------------------------
fzsync can't get a random delay range on hpe-moonshot systems, so run with
delay=0 during all the tests. This is probably the hardware issue such as
cache line design so can't get a stable state during the execution of the
critical
section. Provide an experience delay value on hpe-moonshot to make it hit
the race window immediately without exceeding samples.

---
 testcases/kernel/crypto/af_alg07.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/testcases/kernel/crypto/af_alg07.c
b/testcases/kernel/crypto/af_alg07.c
index 6ad86f4f3..24f5b8088 100644
--- a/testcases/kernel/crypto/af_alg07.c
+++ b/testcases/kernel/crypto/af_alg07.c
@@ -47,6 +47,7 @@ static void setup(void)
  fd = SAFE_OPEN("tmpfile", O_RDWR | O_CREAT, 0644);

  tst_fzsync_pair_init(&fzsync_pair);
+ fzsync_pair.delay_bias = 700;
 }

 static void *thread_run(void *arg)
Joerg Vehlow Nov. 30, 2020, 8:39 a.m. UTC | #10
Hi Li,

On 11/30/2020 9:14 AM, Li Wang wrote:
> Hi Joerg,
>
> On Mon, Nov 30, 2020 at 3:53 PM Joerg Vehlow <lkml@jv-coder.de 
> <mailto:lkml@jv-coder.de>> wrote:
>
>     Hi,
>     >> No, af_alg07 requires 2 CPUs, otherwise it'll report false
>     positives.
>     >> The test will pass only if fchownat() hits a half-closed socket and
>     >> returns error. But IIRC the half-closed socket will be
>     destroyed during
>     >> reschedule which means there's no race window to hit anymore.
>     But it
>     >> would be better to put the TCONF condition into the test itself.
>     > Interesting, I wonder if this is also true for the real-time
>     kernel with
>     > the threads set to RT priority?
>     It looks like the test can fail even with more than one cpu. I've
>     seen
>     this sporadic failure on different hardware with more than two
>     cores, at
>     least on intel denverton (x86_64) and renesas r-car (aarch64)
>     systems.
>     Both with kernel 4.19 with the fix included, on the denverton
>     system the
>     rt parches were included and on the r-car not. The test passes
>     most of
>     the time, but sometimes fails with the message Li posted.
>
>     It also seems to fail sporadically on other systems as well:
>     https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1892860
>     <https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1892860>
>
>     Additionally I tested on qemu-x86 with 4.19 with and without rt
>     patches.
>     The test succeeds even with only one virtualized cpu. So either
>     Martin's
>     assumption is wrong or it holds only for newer kernel versions?
>
>
> No, Mertin is not wrong, and you are also right.
>
> They are totally two different issues of af_alg07, the test on 1CPU
> should be fixed with TCONF. But the fail with aarch64 is more like a
> hardware issue, Chunyu has a drafted patch to add init delay value for
> such a system.
I think you misunderstood something. I see random fails with "TFAIL: 
fchownat() failed to fail, kernel may be vulnerable" on both x86_64 and 
aarch64 with more than one cpu core (4 for x86_64 and 2 or 4 for aarch64).

I see no error ("TPASS: fchownat() failed successfully: ENOENT (2)") on 
single core qemu-x86. This is why I think Martin's assumption may be 
wrong. If it was right, it should never succeed on a single core system 
right?

Jörg
Richard Palethorpe Nov. 30, 2020, 9:01 a.m. UTC | #11
Hello,

Li Wang <liwang@redhat.com> writes:

> Hi Joerg,
>
> On Mon, Nov 30, 2020 at 3:53 PM Joerg Vehlow <lkml@jv-coder.de> wrote:
>
>> Hi,
>> >> No, af_alg07 requires 2 CPUs, otherwise it'll report false positives.
>> >> The test will pass only if fchownat() hits a half-closed socket and
>> >> returns error. But IIRC the half-closed socket will be destroyed during
>> >> reschedule which means there's no race window to hit anymore. But it
>> >> would be better to put the TCONF condition into the test itself.
>> > Interesting, I wonder if this is also true for the real-time kernel with
>> > the threads set to RT priority?
>> It looks like the test can fail even with more than one cpu. I've seen
>> this sporadic failure on different hardware with more than two cores, at
>> least on intel denverton (x86_64) and renesas r-car (aarch64) systems.
>> Both with kernel 4.19 with the fix included, on the denverton system the
>> rt parches were included and on the r-car not. The test passes most of
>> the time, but sometimes fails with the message Li posted.
>>
>> It also seems to fail sporadically on other systems as well:
>> https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1892860
>>
>> Additionally I tested on qemu-x86 with 4.19 with and without rt patches.
>> The test succeeds even with only one virtualized cpu. So either Martin's
>> assumption is wrong or it holds only for newer kernel versions?
>>
>
> No, Mertin is not wrong, and you are also right.
>
> They are totally two different issues of af_alg07, the test on 1CPU
> should be fixed with TCONF. But the fail with aarch64 is more like a
> hardware issue, Chunyu has a drafted patch to add init delay value for
> such a system.
>
> Can you try this on your aarm64 platform?
> -----------------------------
> fzsync can't get a random delay range on hpe-moonshot systems, so run with
> delay=0 during all the tests. This is probably the hardware issue such as
> cache line design so can't get a stable state during the execution of the
> critical
> section. Provide an experience delay value on hpe-moonshot to make it hit
> the race window immediately without exceeding samples.
>
> ---
>  testcases/kernel/crypto/af_alg07.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/testcases/kernel/crypto/af_alg07.c
> b/testcases/kernel/crypto/af_alg07.c
> index 6ad86f4f3..24f5b8088 100644
> --- a/testcases/kernel/crypto/af_alg07.c
> +++ b/testcases/kernel/crypto/af_alg07.c
> @@ -47,6 +47,7 @@ static void setup(void)
>   fd = SAFE_OPEN("tmpfile", O_RDWR | O_CREAT, 0644);
>
>   tst_fzsync_pair_init(&fzsync_pair);
> + fzsync_pair.delay_bias = 700;

I hope there is some way to set this dynamically. Similar to
CVE-2016-7117.

If we know that we should get some particular error we could modify the
bias until the error happens.

>  }
>
>  static void *thread_run(void *arg)
> -- 
> 2.19.1
Li Wang Nov. 30, 2020, 9:03 a.m. UTC | #12
Hi Joerg,

On Mon, Nov 30, 2020 at 4:39 PM Joerg Vehlow <lkml@jv-coder.de> wrote:

> Hi Li,
>
> On 11/30/2020 9:14 AM, Li Wang wrote:
> > Hi Joerg,
> >
> > On Mon, Nov 30, 2020 at 3:53 PM Joerg Vehlow <lkml@jv-coder.de
> > <mailto:lkml@jv-coder.de>> wrote:
> >
> >     Hi,
> >     >> No, af_alg07 requires 2 CPUs, otherwise it'll report false
> >     positives.
> >     >> The test will pass only if fchownat() hits a half-closed socket
> and
> >     >> returns error. But IIRC the half-closed socket will be
> >     destroyed during
> >     >> reschedule which means there's no race window to hit anymore.
> >     But it
> >     >> would be better to put the TCONF condition into the test itself.
> >     > Interesting, I wonder if this is also true for the real-time
> >     kernel with
> >     > the threads set to RT priority?
> >     It looks like the test can fail even with more than one cpu. I've
> >     seen
> >     this sporadic failure on different hardware with more than two
> >     cores, at
> >     least on intel denverton (x86_64) and renesas r-car (aarch64)
> >     systems.
> >     Both with kernel 4.19 with the fix included, on the denverton
> >     system the
> >     rt parches were included and on the r-car not. The test passes
> >     most of
> >     the time, but sometimes fails with the message Li posted.
> >
> >     It also seems to fail sporadically on other systems as well:
> >     https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1892860
> >     <https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1892860>
> >
> >     Additionally I tested on qemu-x86 with 4.19 with and without rt
> >     patches.
> >     The test succeeds even with only one virtualized cpu. So either
> >     Martin's
> >     assumption is wrong or it holds only for newer kernel versions?
> >
> >
> > No, Mertin is not wrong, and you are also right.
> >
> > They are totally two different issues of af_alg07, the test on 1CPU
> > should be fixed with TCONF. But the fail with aarch64 is more like a
> > hardware issue, Chunyu has a drafted patch to add init delay value for
> > such a system.
> I think you misunderstood something. I see random fails with "TFAIL:
> fchownat() failed to fail, kernel may be vulnerable" on both x86_64 and
> aarch64 with more than one cpu core (4 for x86_64 and 2 or 4 for aarch64).
>

Well, seems I was somewhat arbitrary on this problem a moment ago.

Probably I was missing the 4cores fails on x86_64 you mentioned, we just
observed that FAIL on 1CPU x86_64 and hpe_moonshot(aarch64) so far.
The tentative conclusion of our debugging result:

  1. FAIL with 1CPU KVM x86_64 is false positives
  2. FAIL with hpe_moonshot aarch64 is caused by cache line design


>
> I see no error ("TPASS: fchownat() failed successfully: ENOENT (2)") on
> single core qemu-x86. This is why I think Martin's assumption may be
> wrong. If it was right, it should never succeed on a single core system
> right?
>

Hmm, it's hard to say never, it is also possible to create a race window on
a single-core system.
Anyway, we need to do more investigation.
Martin Doucha Nov. 30, 2020, 2:16 p.m. UTC | #13
On 30. 11. 20 10:01, Richard Palethorpe wrote:
> Hello,
> 
> Li Wang <liwang@redhat.com> writes:
> 
>> Hi Joerg,
>>
>> On Mon, Nov 30, 2020 at 3:53 PM Joerg Vehlow <lkml@jv-coder.de> wrote:
>>
>>> Hi,
>>>>> No, af_alg07 requires 2 CPUs, otherwise it'll report false positives.
>>>>> The test will pass only if fchownat() hits a half-closed socket and
>>>>> returns error. But IIRC the half-closed socket will be destroyed during
>>>>> reschedule which means there's no race window to hit anymore. But it
>>>>> would be better to put the TCONF condition into the test itself.
>>>> Interesting, I wonder if this is also true for the real-time kernel with
>>>> the threads set to RT priority?
>>> It looks like the test can fail even with more than one cpu. I've seen
>>> this sporadic failure on different hardware with more than two cores, at
>>> least on intel denverton (x86_64) and renesas r-car (aarch64) systems.
>>> Both with kernel 4.19 with the fix included, on the denverton system the
>>> rt parches were included and on the r-car not. The test passes most of
>>> the time, but sometimes fails with the message Li posted.
>>>
>>> It also seems to fail sporadically on other systems as well:
>>> https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1892860
>>>
>>> Additionally I tested on qemu-x86 with 4.19 with and without rt patches.
>>> The test succeeds even with only one virtualized cpu. So either Martin's
>>> assumption is wrong or it holds only for newer kernel versions?
>>>
>>
>> No, Mertin is not wrong, and you are also right.
>>
>> They are totally two different issues of af_alg07, the test on 1CPU
>> should be fixed with TCONF. But the fail with aarch64 is more like a
>> hardware issue, Chunyu has a drafted patch to add init delay value for
>> such a system.
>>
>> Can you try this on your aarm64 platform?
>> -----------------------------
>> fzsync can't get a random delay range on hpe-moonshot systems, so run with
>> delay=0 during all the tests. This is probably the hardware issue such as
>> cache line design so can't get a stable state during the execution of the
>> critical
>> section. Provide an experience delay value on hpe-moonshot to make it hit
>> the race window immediately without exceeding samples.
>>
>> ---
>>  testcases/kernel/crypto/af_alg07.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/testcases/kernel/crypto/af_alg07.c
>> b/testcases/kernel/crypto/af_alg07.c
>> index 6ad86f4f3..24f5b8088 100644
>> --- a/testcases/kernel/crypto/af_alg07.c
>> +++ b/testcases/kernel/crypto/af_alg07.c
>> @@ -47,6 +47,7 @@ static void setup(void)
>>   fd = SAFE_OPEN("tmpfile", O_RDWR | O_CREAT, 0644);
>>
>>   tst_fzsync_pair_init(&fzsync_pair);
>> + fzsync_pair.delay_bias = 700;
> 
> I hope there is some way to set this dynamically. Similar to
> CVE-2016-7117.
> 
> If we know that we should get some particular error we could modify the
> bias until the error happens.

There are three possible outcomes of the race:
1) fchownat() returns 0 => fchownat() was called too early or the kernel
is vulnerable, you can adjust bias here
2) fchwonat() fails with ENOENT => kernel is fixed, print TPASS and exit
3) fchownat() fails with EBADF => fchownat() was called too late, you
can adjust bias here

IIRC I didn't play with bias in this test because on x86_64 it passes
almost instantly on a fixed kernel. Feel free to add dynamic bias
adjustment for ARM.
diff mbox series

Patch

diff --git a/include/tst_fuzzy_sync.h b/include/tst_fuzzy_sync.h
index 4141f5c64..2e864b312 100644
--- a/include/tst_fuzzy_sync.h
+++ b/include/tst_fuzzy_sync.h
@@ -281,6 +281,9 @@  static void tst_init_stat(struct tst_fzsync_stat *s)
 static void tst_fzsync_pair_reset(struct tst_fzsync_pair *pair,
 				  void *(*run_b)(void *))
 {
+	if (get_nprocs() < 2)
+		tst_brk(TCONF, "Fuzzy Sync requires at least two CPUs available");
+
 	tst_fzsync_pair_cleanup(pair);
 
 	tst_init_stat(&pair->diff_ss);