memcg_lib/memcg_process: Better synchronization of signal USR1
diff mbox series

Message ID 20191106073621.58738-1-lkml@jv-coder.de
State New
Headers show
Series
  • memcg_lib/memcg_process: Better synchronization of signal USR1
Related show

Commit Message

Joerg Vehlow Nov. 6, 2019, 7:36 a.m. UTC
From: Joerg Vehlow <joerg.vehlow@aox-tech.de>

memcg_process is triggered by memcg_lib to do allocations and
deallocations. These work was done in the signal handler.
In some cases it could happen, that memcg_lib send multiple
signals (e.g. in function warmup). This lead to signals getting
lost and failed tests.

The patch moves the allocation and deallocation to the main
function, triggered by a flag set by the signal handler.
Additionally TST_CHECKPOINT_WAKE/TST_CHECKPOINT_WAIT is
used to make memcg_lib wait until memcg_process is done
allocating/deallocating.

Signed-off-by: Joerg Vehlow <joerg.vehlow@aox-tech.de>
---
 .../controllers/memcg/functional/memcg_lib.sh |  1 +
 .../memcg/functional/memcg_process.c          | 44 +++++++++++--------
 2 files changed, 27 insertions(+), 18 deletions(-)

Comments

Joerg Vehlow Nov. 6, 2019, 8:33 a.m. UTC | #1
> diff --git a/testcases/kernel/controllers/memcg/functional/memcg_lib.sh b/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
> index aadaae4d2..7440e1eee 100755
> --- a/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
> +++ b/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
> @@ -124,6 +124,7 @@ signal_memcg_process()
>   	local usage_start=$(cat ${path}memory.usage_in_bytes)
>   
>   	kill -s USR1 $pid 2> /dev/null
> +	TST_CHECKPOINT_WAIT 1
Actually this does not work like this, because some of the
tests trigger the oom killer and TEST_CHECKPOINT_WAIT calling
tst_checkpoint uses ROD. Is it ok to directly call

tst_checkpoint wait 10000 "1"

and ignore the result here?

BTW: Is there no such thing like TST_CHECKPOINT in the new
shell test library?
Petr Vorel Nov. 21, 2019, 6:34 p.m. UTC | #2
Hi Joerg,

> > diff --git a/testcases/kernel/controllers/memcg/functional/memcg_lib.sh b/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
> > index aadaae4d2..7440e1eee 100755
> > --- a/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
> > +++ b/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
> > @@ -124,6 +124,7 @@ signal_memcg_process()
> >   	local usage_start=$(cat ${path}memory.usage_in_bytes)
> >   	kill -s USR1 $pid 2> /dev/null
> > +	TST_CHECKPOINT_WAIT 1
> Actually this does not work like this, because some of the
> tests trigger the oom killer and TEST_CHECKPOINT_WAIT calling
> tst_checkpoint uses ROD. Is it ok to directly call

> tst_checkpoint wait 10000 "1"

> and ignore the result here?

> BTW: Is there no such thing like TST_CHECKPOINT in the new
> shell test library?
No, there is no support for TST_CHECKPOINT in shell.
To be honest I have no idea how to implement it.
It could be done in some form of checking some file content and in a loop and
sleep in the meantime (ineffective), but sync between C and shell API is IMHO
not possible.

Kind regards,
Petr
Cyril Hrubis Nov. 25, 2019, 1:14 p.m. UTC | #3
Hi!
> No, there is no support for TST_CHECKPOINT in shell.

Well there is actually. There is a tst_checkpoint helper that calls
tst_reinit() to map the shared page to be used with futexes and then
just calls the corresponding wake or wait function.

And yes it has been used in the cgroup testcases to synchronize C
helper that allocates memory and shell test that executes it.
Cyril Hrubis Nov. 25, 2019, 1:29 p.m. UTC | #4
Hi!
> Actually this does not work like this, because some of the
> tests trigger the oom killer and TEST_CHECKPOINT_WAIT calling
> tst_checkpoint uses ROD. Is it ok to directly call
> 
> tst_checkpoint wait 10000 "1"
> 
> and ignore the result here?

Wouldn't that delay the test for too long?

The default timeout for checkpoints is probably too big.

This problem is quite tricky to get right I guess. Maybe we can watch
/proc/[pid]/statm for increase data + stack memory.

> BTW: Is there no such thing like TST_CHECKPOINT in the new
> shell test library?

It does not seem to be there, but these shell functions are just
wrappers that do check the tst_checkpoint return value, which would be
fairly easy to add.
Joerg Vehlow Nov. 25, 2019, 1:48 p.m. UTC | #5
Hi Cyril,

> Hi!
>> Actually this does not work like this, because some of the
>> tests trigger the oom killer and TEST_CHECKPOINT_WAIT calling
>> tst_checkpoint uses ROD. Is it ok to directly call
>>
>> tst_checkpoint wait 10000 "1"
>>
>> and ignore the result here?
> Wouldn't that delay the test for too long?
>
> The default timeout for checkpoints is probably too big.
>
> This problem is quite tricky to get right I guess. Maybe we can watch
> /proc/[pid]/statm for increase data + stack memory.
The timeout is specified on the command line (the 10000) in ms.
We run the test with timeout=1000 now and it works fine. It is simpler 
than thinking about any
other synchronization technique. The additonal wait adds less than 30 
for all tests, that use memcg_process.
>> BTW: Is there no such thing like TST_CHECKPOINT in the new
>> shell test library?
> It does not seem to be there, but these shell functions are just
> wrappers that do check the tst_checkpoint return value, which would be
> fairly easy to add.
I just wondered if I didn't see it

Jörg
Petr Vorel Nov. 25, 2019, 2:28 p.m. UTC | #6
Hi,

> > No, there is no support for TST_CHECKPOINT in shell.

> Well there is actually. There is a tst_checkpoint helper that calls
> tst_reinit() to map the shared page to be used with futexes and then
> just calls the corresponding wake or wait function.
Great.
Joerg, sorry for being wrong, thanks Cyril for correction.

> And yes it has been used in the cgroup testcases to synchronize C
> helper that allocates memory and shell test that executes it.

Kind regards,
Petr
Cyril Hrubis Nov. 25, 2019, 3:32 p.m. UTC | #7
Hi!
> >> Actually this does not work like this, because some of the
> >> tests trigger the oom killer and TEST_CHECKPOINT_WAIT calling
> >> tst_checkpoint uses ROD. Is it ok to directly call
> >>
> >> tst_checkpoint wait 10000 "1"
> >>
> >> and ignore the result here?
> > Wouldn't that delay the test for too long?
> >
> > The default timeout for checkpoints is probably too big.
> >
> > This problem is quite tricky to get right I guess. Maybe we can watch
> > /proc/[pid]/statm for increase data + stack memory.
> The timeout is specified on the command line (the 10000) in ms.

Ah, sorry I was blind.

> We run the test with timeout=1000 now and it works fine. It is simpler 
> than thinking about any
> other synchronization technique. The additonal wait adds less than 30 
> for all tests, that use memcg_process.

30 what? seconds? That is unfortunatelly not acceptable.

Actually having a closer look at the code there is a loop that checks
every 100ms if:

1) the process is still alive
2) if there was increase in usage_in_bytes in the corresponding cgroup

So what is wrong with the original code?
Joerg Vehlow Nov. 26, 2019, 5:08 a.m. UTC | #8
>> We run the test with timeout=1000 now and it works fine. It is simpler
>> than thinking about any
>> other synchronization technique. The additonal wait adds less than 30
>> for all tests, that use memcg_process.
> 30 what? seconds? That is unfortunatelly not acceptable.
Yes 30 seconds. Why shouldn't that be not acceptable? It is nothing compared
to the runtime of other tests.
>
> Actually having a closer look at the code there is a loop that checks
> every 100ms if:
>
> 1) the process is still alive
> 2) if there was increase in usage_in_bytes in the corresponding cgroup
>
> So what is wrong with the original code?
Please reread the description of my initial post. The problem is the 
signal race
not the check. The checkpoint system prevents the race. There is no way 
around
a solid synchronization.
Cyril Hrubis Nov. 26, 2019, 12:10 p.m. UTC | #9
Hi!
> >> We run the test with timeout=1000 now and it works fine. It is simpler
> >> than thinking about any
> >> other synchronization technique. The additonal wait adds less than 30
> >> for all tests, that use memcg_process.
> > 30 what? seconds? That is unfortunatelly not acceptable.
> Yes 30 seconds. Why shouldn't that be not acceptable? It is nothing compared
> to the runtime of other tests.

I have written a blog post that partly applies to this case, see:

https://people.kernel.org/metan/why-sleep-is-almost-never-acceptable-in-tests

> > Actually having a closer look at the code there is a loop that checks
> > every 100ms if:
> >
> > 1) the process is still alive
> > 2) if there was increase in usage_in_bytes in the corresponding cgroup
> >
> > So what is wrong with the original code?
> Please reread the description of my initial post. The problem is the 
> signal race
> not the check. The checkpoint system prevents the race. There is no way 
> around
> a solid synchronization.

So the problem is that sometimes the program has not finished handling
the first signal and we are sending another, right?

I guess that the proper solution would be avoding the signals in the
first place. I guess that we can estabilish two-way communication with
fifos, which would also mean that we would get notified as fast as the
child dies as well.
Joerg Vehlow Nov. 26, 2019, 12:39 p.m. UTC | #10
>> Yes 30 seconds. Why shouldn't that be not acceptable? It is nothing compared
>> to the runtime of other tests.
> I have written a blog post that partly applies to this case, see:
>
> https://people.kernel.org/metan/why-sleep-is-almost-never-acceptable-in-tests
I know where you are coming from and it is basically the same as my own 
opinion.
The difference is: When I look at ltp I see a runtime of more than 6 
hours, looking at the
controller test alone it is more than 4 hours. This puts 30 seconds into 
a very differenet
perspective than looking at only syscall tests. (In the testrun I looked 
at it is around 13 minutes).
That is why I don't care about 30 seconds in this case.

>
> So the problem is that sometimes the program has not finished handling
> the first signal and we are sending another, right?
>
> I guess that the proper solution would be avoding the signals in the
> first place. I guess that we can estabilish two-way communication with
> fifos, which would also mean that we would get notified as fast as the
> child dies as well.
Correct. Using fifos is probably a viable solution, but it would require 
library work,
because otherwise the overhead is way too big.
Another thing I can think of is extending tst_checkpoint wait to also 
watch a process
and stop waiting, if that process dies. This would be the simplest way 
to get good
synchronization and get rid of the sleep.
Joerg Vehlow Nov. 27, 2019, 7:41 a.m. UTC | #11
Hi,
>>
>> So the problem is that sometimes the program has not finished handling
>> the first signal and we are sending another, right?
>>
>> I guess that the proper solution would be avoding the signals in the
>> first place. I guess that we can estabilish two-way communication with
>> fifos, which would also mean that we would get notified as fast as the
>> child dies as well.
> Correct. Using fifos is probably a viable solution, but it would 
> require library work,
> because otherwise the overhead is way too big.
> Another thing I can think of is extending tst_checkpoint wait to also 
> watch a process
> and stop waiting, if that process dies. This would be the simplest way 
> to get good
> synchronization and get rid of the sleep.
>
When thinking about this yesterday, I had another idea to fix it without 
much work:
For the test, that expects the process to terminate, just don't call 
tst_checkpoint at
all. We know it will fail there. Would this be acceptable for you?

Jörg
Cyril Hrubis Dec. 3, 2019, 3:12 p.m. UTC | #12
Hi!
> > I have written a blog post that partly applies to this case, see:
> >
> > https://people.kernel.org/metan/why-sleep-is-almost-never-acceptable-in-tests
> I know where you are coming from and it is basically the same as my own 
> opinion.
> The difference is: When I look at ltp I see a runtime of more than 6 
> hours, looking at the
> controller test alone it is more than 4 hours. This puts 30 seconds into 
> a very differenet
> perspective than looking at only syscall tests. (In the testrun I looked 
> at it is around 13 minutes).
> That is why I don't care about 30 seconds in this case.

controllers testrun runs for 25 minutes on our servers, it will probably
be reduced to 15 minutes in two or three years with next upgrade. The
main point is that hardware tends to be faster and faster but any sleep
in the tests will not scale and ends up being a problem sooner or later.
It also greatly depends on which HW are you running the tests on.

> > So the problem is that sometimes the program has not finished handling
> > the first signal and we are sending another, right?
> >
> > I guess that the proper solution would be avoding the signals in the
> > first place. I guess that we can estabilish two-way communication with
> > fifos, which would also mean that we would get notified as fast as the
> > child dies as well.
> Correct. Using fifos is probably a viable solution, but it would require 
> library work,
> because otherwise the overhead is way too big.
> Another thing I can think of is extending tst_checkpoint wait to also 
> watch a process
> and stop waiting, if that process dies. This would be the simplest way 
> to get good
> synchronization and get rid of the sleep.

I'm not sure if we can implement this without introducing another race
condition. The only way how to wake up futex from sleep before it
timeouts in a race-free way is sending a signal. In this case we should
see EINTR. But that would mean that the process that is waking up the
futex has to be a child of the process, unless we reparent that process,
but all that would be too tricky I guess.

If we decide to wake the futex regulary to check if the process is alive
we can miss the wake. Well the library tries hard and loops over the
wake syscall for a while, but this could still fail on very slow
devices under load. But if the timing is unfortunate we may miss more
than one wake signal, which would lead to timeout. Timing problems like
that can easily arise on VMs with a single CPU on overbookend host.

Patch
diff mbox series

diff --git a/testcases/kernel/controllers/memcg/functional/memcg_lib.sh b/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
index aadaae4d2..7440e1eee 100755
--- a/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
+++ b/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
@@ -124,6 +124,7 @@  signal_memcg_process()
 	local usage_start=$(cat ${path}memory.usage_in_bytes)
 
 	kill -s USR1 $pid 2> /dev/null
+	TST_CHECKPOINT_WAIT 1
 
 	if [ -z "$size" ]; then
 		return
diff --git a/testcases/kernel/controllers/memcg/functional/memcg_process.c b/testcases/kernel/controllers/memcg/functional/memcg_process.c
index 0e671f566..8e846879d 100644
--- a/testcases/kernel/controllers/memcg/functional/memcg_process.c
+++ b/testcases/kernel/controllers/memcg/functional/memcg_process.c
@@ -24,7 +24,8 @@ 
 static int fd;
 
 static volatile int flag_exit;
-static volatile int flag_allocated;
+static volatile int flag_do_work;
+static int flag_allocated;
 
 static int opt_mmap_anon;
 static int opt_mmap_file;
@@ -257,22 +258,7 @@  static void sigint_handler(int __attribute__ ((unused)) signo)
  */
 static void sigusr_handler(int __attribute__ ((unused)) signo)
 {
-	if (opt_mmap_anon)
-		mmap_anon();
-
-	if (opt_mmap_file)
-		mmap_file();
-
-	if (opt_mmap_lock1)
-		mmap_lock1();
-
-	if (opt_mmap_lock2)
-		mmap_lock2();
-
-	if (opt_shm)
-		shm();
-
-	flag_allocated = !flag_allocated;
+	flag_do_work++;
 }
 
 int main(int argc, char *argv[])
@@ -302,8 +288,30 @@  int main(int argc, char *argv[])
 
 	TST_CHECKPOINT_WAKE(0);
 
-	while (!flag_exit)
+	while (!flag_exit) {
+		if (flag_do_work) {
+			flag_do_work--;
+			if (opt_mmap_anon)
+				mmap_anon();
+
+			if (opt_mmap_file)
+				mmap_file();
+
+			if (opt_mmap_lock1)
+				mmap_lock1();
+
+			if (opt_mmap_lock2)
+				mmap_lock2();
+
+			if (opt_shm)
+				shm();
+
+			flag_allocated = !flag_allocated;
+
+			TST_CHECKPOINT_WAKE(1);
+		}
 		sleep(1);
+	}
 
 	close(fd);