diff mbox

please fix FUSION (Was: [v3.13][v3.14][Regression]kthread:makekthread_create()killable)

Message ID 201403221525.FJC69759.MSLOtHOJOVFFQF@I-love.SAKURA.ne.jp
State New
Headers show

Commit Message

Tetsuo Handa March 22, 2014, 6:25 a.m. UTC
Oleg Nesterov wrote:
> Tetsuo, what do you think?

I don't like blocking SIGKILL while doing operations that depend on memory
allocation by other processes. If the OOM killer is triggered and it chose
the process blocking SIGKILL in mptsas_init() (I know it unlikely happens),
it generates the OOM killer deadlock.

My preference is to fix kthread_create() to handle SIGKILL gracefully.
kthread_create() did not return upon SIGKILL before commit 786235ee.
Since commit 786235ee, there is imbalance that "kmalloc(GFP_KERNEL) in
kthread_create_on_node() ignores SIGKILL unless TIF_MEMDIE is set" but
"wait_for_completion_killable() in kthread_create_on_node() does not ignore
SIGKILL even if TIF_MEMDIE is not set".

Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
commit 786235ee sounds like a kernel API breakage.
----------
>From 731f1f6dec7efaa132f751c432079b9b1fdb78d2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 22 Mar 2014 15:16:50 +0900
Subject: [PATCH] kthread: Handle SIGKILL gracefully in kthread_create().

Commit 786235ee "kthread: make kthread_create() killable" changed to
leave kthread_create() as soon as receiving SIGKILL. But this change
caused boot failures because systemd-udevd receives SIGKILL due to timeout
while loading SCSI controller drivers using finit_module() [1].

Therefore, this patch changes kthread_create() from "wait forever in
killable state" to "wait for 10 seconds in unkillable state, check for
the OOM killer every second".

This patch also changes the return value of timeout case from -ENOMEM
to -EINTR because -ENOMEM could make sense for only TIF_MEMDIE case.

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 kernel/kthread.c |   37 +++++++++++++++++++++----------------
 1 files changed, 21 insertions(+), 16 deletions(-)

Comments

Oleg Nesterov March 22, 2014, 7:25 p.m. UTC | #1
On 03/22, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > Tetsuo, what do you think?
>
> I don't like blocking SIGKILL while doing operations that depend on memory
> allocation by other processes. If the OOM killer is triggered and it chose
> the process blocking SIGKILL in mptsas_init() (I know it unlikely happens),
> it generates the OOM killer deadlock.

It seems that we do not understand each other.

I do not like this hack too. And it is even wrong, you can't really block
SIGKILL. And it is unnecessary in a sense that (I think) it is fine that
module_init() reacts to SIGKILL and aborts, just the fact it crashes the
kernel in the error paths is not fine.

The driver should be fixed anyway. As for timeout, either userspace/systemd
should be changed to not send SIGKILL after 30 secs, or (better) the driver
should be changed to not waste 30 secs.

The hack I sent can only serve as a short term solution, it should be
reverted once we have something better. And, otoh, this hack only affects
the problematic driver which should be fixed in any case.

Do you see my point? I can be wrong, but I think that you constantly
misunderstand the intent.

> My preference is to fix kthread_create() to handle SIGKILL gracefully.

And now I do not understand you too. I do not understand why should we
"fix" kthread_create().

> Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
> ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
> commit 786235ee sounds like a kernel API breakage.

Personally I do not really think so, but OK. In this case lets revert
786235ee.

> Commit 786235ee "kthread: make kthread_create() killable" changed to
> leave kthread_create() as soon as receiving SIGKILL. But this change
> caused boot failures because systemd-udevd receives SIGKILL due to timeout
> while loading SCSI controller drivers using finit_module() [1].

And I still think that 786235ee simply uncovered the problems in this
driver. Perhaps we should change kthread_create() for some reason, but
(imho) not because we need to help the buggy code.

> Therefore, this patch changes kthread_create() from "wait forever in
> killable state" to "wait for 10 seconds in unkillable state, check for
> the OOM killer every second".

Personally I dislike this change. In fact I think it is ugly. But this
is only my opinion.

If you convince someone to take this patch I won't argue.


> This patch also changes the return value of timeout case from -ENOMEM
> to -EINTR because -ENOMEM could make sense for only TIF_MEMDIE case.
>
>   [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705
>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  kernel/kthread.c |   37 +++++++++++++++++++++----------------
>  1 files changed, 21 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index b5ae3ee..6a25a9f 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -269,6 +269,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  					   const char namefmt[],
>  					   ...)
>  {
> +	int i = 0;
>  	DECLARE_COMPLETION_ONSTACK(done);
>  	struct task_struct *task;
>  	struct kthread_create_info *create = kmalloc(sizeof(*create),
> @@ -287,24 +288,28 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>
>  	wake_up_process(kthreadd_task);
>  	/*
> -	 * Wait for completion in killable state, for I might be chosen by
> -	 * the OOM killer while kthreadd is trying to allocate memory for
> -	 * new kernel thread.
> +	 * Wait for completion with 10 seconds timeout. Unless the kthreadd is
> +	 * blocked for direct memory reclaim path, the kthreadd will be able to
> +	 * complete the request within 10 seconds. Also, check every second if
> +	 * I was chosen by the OOM killer in order to avoid the OOM killer
> +	 * deadlock.
>  	 */
> -	if (unlikely(wait_for_completion_killable(&done))) {
> -		/*
> -		 * If I was SIGKILLed before kthreadd (or new kernel thread)
> -		 * calls complete(), leave the cleanup of this structure to
> -		 * that thread.
> -		 */
> -		if (xchg(&create->done, NULL))
> -			return ERR_PTR(-ENOMEM);
> -		/*
> -		 * kthreadd (or new kernel thread) will call complete()
> -		 * shortly.
> -		 */
> -		wait_for_completion(&done);
> +	do {
> +		if (likely(wait_for_completion_timeout(&done, HZ)))
> +			goto ready;
> +	} while (i++ < 10 && !test_thread_flag(TIF_MEMDIE));
> +	/*
> +	 * The kthreadd was unable to complete the request within 10 seconds
> +	 * (or I was chosen by the OOM killer). Thus, give up and leave the
> +	 * cleanup of this structure to the kthreadd (or new kernel thread).
> +	 */
> +	if (xchg(&create->done, NULL)) {
> +		WARN(1, "Gave up waiting for kthreadd.\n");
> +		return ERR_PTR(-EINTR);
>  	}
> +	/* kthreadd (or new kernel thread) will call complete() shortly. */
> +	wait_for_completion(&done);
> +ready:
>  	task = create->result;
>  	if (!IS_ERR(task)) {
>  		static const struct sched_param param = { .sched_priority = 0 };
> --
> 1.7.1
Thomas Gleixner March 22, 2014, 9:25 p.m. UTC | #2
On Sat, 22 Mar 2014, Oleg Nesterov wrote:
> On 03/22, Tetsuo Handa wrote:
> > Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
> > ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
> > commit 786235ee sounds like a kernel API breakage.
> 
> Personally I do not really think so, but OK. In this case lets revert
> 786235ee.

We have explicitely:

    mutex_lock, mutex_lock_killable and mutex_lock_interruptible.

Ditto for wait_event and wait_for_completion.

So the existance of the uninterruptible versions does not make an
argument for the kthread_create() case.
 
> > Commit 786235ee "kthread: make kthread_create() killable" changed to
> > leave kthread_create() as soon as receiving SIGKILL. But this change
> > caused boot failures because systemd-udevd receives SIGKILL due to timeout
> > while loading SCSI controller drivers using finit_module() [1].
> 
> And I still think that 786235ee simply uncovered the problems in this
> driver. Perhaps we should change kthread_create() for some reason, but
> (imho) not because we need to help the buggy code.

Right.
 
> > Therefore, this patch changes kthread_create() from "wait forever in
> > killable state" to "wait for 10 seconds in unkillable state, check for
> > the OOM killer every second".
> 
> Personally I dislike this change. In fact I think it is ugly. But this
> is only my opinion.

It's not only ugly, it's activly wrong. It's as wrong as 786235ee
itself. And 786235ee needs to be reverted and the revert must go into
3.13.stable as well. I'll send a revert request in separate mail.

Thanks,

	tglx
Thomas Gleixner March 22, 2014, 10:01 p.m. UTC | #3
On Sat, 22 Mar 2014, Thomas Gleixner wrote:
> On Sat, 22 Mar 2014, Oleg Nesterov wrote:
> > Personally I dislike this change. In fact I think it is ugly. But this
> > is only my opinion.
> 
> It's not only ugly, it's activly wrong. It's as wrong as 786235ee
> itself. And 786235ee needs to be reverted and the revert must go into
> 3.13.stable as well. I'll send a revert request in separate mail.

Sorry. I misread the combo of both patches. 786235ee is correct. We
definitely don't want to revert it. 

But I still think, that changing this to butt ugly heuristics with an
arbitrary chosen timeout is not the proper solution to "fix" a driver
problem and to work around a systemd policy which mandates that
modprobe must return in 30 seconds.

But then systemd/udev mutters:

   "You migh be able to work around the timeout with udev rules and
    OPTIONS+="event_timeout=120", but that code was maybe never used
    or tested, so it might not work correctly." [1]

AFAICT from the ubuntu bug system [2] nobody bothered even to try that.

And if the udev/systemd event_timeout option is broken it's way better
to fix that one instead of hacking random heuristics into the kernel.

Thanks,

	tglx

[1] http://lists.freedesktop.org/archives/systemd-devel/2014-March/018007.html
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705
Tetsuo Handa March 22, 2014, 11:50 p.m. UTC | #4
Oleg Nesterov wrote:
> On 03/22, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > > Tetsuo, what do you think?
> >
> > I don't like blocking SIGKILL while doing operations that depend on memory
> > allocation by other processes. If the OOM killer is triggered and it chose
> > the process blocking SIGKILL in mptsas_init() (I know it unlikely happens),
> > it generates the OOM killer deadlock.
> 
> It seems that we do not understand each other.
> 

I'm agreeing with you regarding long term solution. I think that I and you
do not understand each other regarding which approach should be used for short
term solution.

> I do not like this hack too. And it is even wrong, you can't really block
> SIGKILL. And it is unnecessary in a sense that (I think) it is fine that
> module_init() reacts to SIGKILL and aborts, just the fact it crashes the
> kernel in the error paths is not fine.

I expect that kernel code reacts to SIGKILL and aborts, but current code
(not only fusion but any code) is not ready for reacting to SIGKILL due to
use of uninterruptible versions of lock/wait etc. operators.

> The driver should be fixed anyway. As for timeout, either userspace/systemd
> should be changed to not send SIGKILL after 30 secs, or (better) the driver
> should be changed to not waste 30 secs.

I'm not asserting that we should not fix the driver and the userspace.
I agree that we should fix the driver to respond SIGKILL properly and
fix the userspace not to send SIGKILL on hard coded timeout.

> 
> The hack I sent can only serve as a short term solution, it should be
> reverted once we have something better. And, otoh, this hack only affects
> the problematic driver which should be fixed in any case.
> 
> Do you see my point? I can be wrong, but I think that you constantly
> misunderstand the intent.
> 
> > My preference is to fix kthread_create() to handle SIGKILL gracefully.
> 
> And now I do not understand you too. I do not understand why should we
> "fix" kthread_create().

I can see your point. But as for kernel for Ubuntu 14.04 LTS (which the date
of kernel freeze comes shortly), fixing kthread_create() is the safer choice,
for we haven't proved that the fusion is the only kernel code which is
disturbed by commit 786235ee.

After we confirmed that there is no more kernel code which is disturbed by
commit 786235ee, we can revert the "kthread: Handle SIGKILL gracefully in
kthread_create()." patch.

> 
> > Many kernel operations (e.g. mutex_lock() wait_event() wait_for_completion())
> > ignore SIGKILL and the current users depend on SIGKILL being ignored. Thus,
> > commit 786235ee sounds like a kernel API breakage.
> 
> Personally I do not really think so, but OK. In this case lets revert
> 786235ee.
> 
> > Commit 786235ee "kthread: make kthread_create() killable" changed to
> > leave kthread_create() as soon as receiving SIGKILL. But this change
> > caused boot failures because systemd-udevd receives SIGKILL due to timeout
> > while loading SCSI controller drivers using finit_module() [1].
> 
> And I still think that 786235ee simply uncovered the problems in this
> driver. Perhaps we should change kthread_create() for some reason, but
> (imho) not because we need to help the buggy code.
> 

I don't mean to help the buggy code. The "kthread: Handle SIGKILL gracefully
in kthread_create()." patch (or revert commit 786235ee) is short term solution
(especially for distributions which the date of kernel freeze is approaching).

> > Therefore, this patch changes kthread_create() from "wait forever in
> > killable state" to "wait for 10 seconds in unkillable state, check for
> > the OOM killer every second".
> 
> Personally I dislike this change. In fact I think it is ugly. But this
> is only my opinion.
> 
> If you convince someone to take this patch I won't argue.
Tetsuo Handa March 22, 2014, 11:57 p.m. UTC | #5
Thomas Gleixner wrote:
> But then systemd/udev mutters:
> 
>    "You migh be able to work around the timeout with udev rules and
>     OPTIONS+="event_timeout=120", but that code was maybe never used
>     or tested, so it might not work correctly." [1]
> 
> AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> 
> And if the udev/systemd event_timeout option is broken it's way better
> to fix that one instead of hacking random heuristics into the kernel.

I haven't tried the event_timeout= option but I think it will not work.
The timeout is hard coded as shown below and there will be no chance for taking
the event_timeout= option into account.

---------- systemd-204/src/udev/udevd.c start ----------
(...snipped...)
                        /* check for hanging events */
                        udev_list_node_foreach(loop, &worker_list) {
                                struct worker *worker = node_to_worker(loop);

                                if (worker->state != WORKER_RUNNING)
                                        continue;

                                if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {
                                        log_error("worker [%u] %s timeout; kill it\n", worker->pid,
                                            worker->event ? worker->event->devpath : "<idle>");
                                        kill(worker->pid, SIGKILL);
                                        worker->state = WORKER_KILLED;
                                        /* drop reference taken for state 'running' */
                                        worker_unref(worker);
                                        if (worker->event) {
                                                log_error("seq %llu '%s' killed\n",
                                                          udev_device_get_seqnum(worker->event->dev), worker->event->devpath);
                                                worker->event->exitcode = -64;
                                                event_queue_delete(worker->event, true);
                                                worker->event = NULL;
                                        }
                                }
                        }
(...snipped...)
---------- systemd-204/src/udev/udevd.c end ----------
Thomas Gleixner March 23, 2014, 8:04 a.m. UTC | #6
On Sun, 23 Mar 2014, Tetsuo Handa wrote:

> Thomas Gleixner wrote:
> > But then systemd/udev mutters:
> > 
> >    "You migh be able to work around the timeout with udev rules and
> >     OPTIONS+="event_timeout=120", but that code was maybe never used
> >     or tested, so it might not work correctly." [1]
> > 
> > AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> > 
> > And if the udev/systemd event_timeout option is broken it's way better
> > to fix that one instead of hacking random heuristics into the kernel.
> 
> I haven't tried the event_timeout= option but I think it will not work.
> The timeout is hard coded as shown below and there will be no chance for taking
> the event_timeout= option into account.
> 
> ---------- systemd-204/src/udev/udevd.c start ----------
> (...snipped...)
>                         /* check for hanging events */
>                         udev_list_node_foreach(loop, &worker_list) {
>                                 struct worker *worker = node_to_worker(loop);
> 
>                                 if (worker->state != WORKER_RUNNING)
>                                         continue;
> 
>                                 if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {

And because systemd has an immutable hardcoded random timeout we add
another hardcoded random timeout into kthread_create() to work around
that.

How broken is that?

And it seems other people have solved it:

    http://www.redhat.com/archives/lvm-devel/2013-September/msg00036.html

Thanks,

	tglx
James Bottomley March 23, 2014, 2:19 p.m. UTC | #7
On Sun, 2014-03-23 at 09:04 +0100, Thomas Gleixner wrote:
> On Sun, 23 Mar 2014, Tetsuo Handa wrote:
> 
> > Thomas Gleixner wrote:
> > > But then systemd/udev mutters:
> > > 
> > >    "You migh be able to work around the timeout with udev rules and
> > >     OPTIONS+="event_timeout=120", but that code was maybe never used
> > >     or tested, so it might not work correctly." [1]
> > > 
> > > AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> > > 
> > > And if the udev/systemd event_timeout option is broken it's way better
> > > to fix that one instead of hacking random heuristics into the kernel.
> > 
> > I haven't tried the event_timeout= option but I think it will not work.
> > The timeout is hard coded as shown below and there will be no chance for taking
> > the event_timeout= option into account.
> > 
> > ---------- systemd-204/src/udev/udevd.c start ----------
> > (...snipped...)
> >                         /* check for hanging events */
> >                         udev_list_node_foreach(loop, &worker_list) {
> >                                 struct worker *worker = node_to_worker(loop);
> > 
> >                                 if (worker->state != WORKER_RUNNING)
> >                                         continue;
> > 
> >                                 if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {
> 
> And because systemd has an immutable hardcoded random timeout we add
> another hardcoded random timeout into kthread_create() to work around
> that.
> 
> How broken is that?
> 
> And it seems other people have solved it:
> 
>     http://www.redhat.com/archives/lvm-devel/2013-September/msg00036.html

I agree with Thomas.  A hardcoded timeout is a systemd bug.  However,
could I get confirmation, while you can use this bug to do it, that the
patch back in this thread actually fixes the crash when scsi_alloc_host
fails, that's the serious SCSI bug, in my view?

Thanks,

James
Thomas Gleixner March 23, 2014, 2:28 p.m. UTC | #8
On Sun, 23 Mar 2014, James Bottomley wrote:
> On Sun, 2014-03-23 at 09:04 +0100, Thomas Gleixner wrote:
> > On Sun, 23 Mar 2014, Tetsuo Handa wrote:
> > 
> > > Thomas Gleixner wrote:
> > > > But then systemd/udev mutters:
> > > > 
> > > >    "You migh be able to work around the timeout with udev rules and
> > > >     OPTIONS+="event_timeout=120", but that code was maybe never used
> > > >     or tested, so it might not work correctly." [1]
> > > > 
> > > > AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> > > > 
> > > > And if the udev/systemd event_timeout option is broken it's way better
> > > > to fix that one instead of hacking random heuristics into the kernel.
> > > 
> > > I haven't tried the event_timeout= option but I think it will not work.
> > > The timeout is hard coded as shown below and there will be no chance for taking
> > > the event_timeout= option into account.
> > > 
> > > ---------- systemd-204/src/udev/udevd.c start ----------
> > > (...snipped...)
> > >                         /* check for hanging events */
> > >                         udev_list_node_foreach(loop, &worker_list) {
> > >                                 struct worker *worker = node_to_worker(loop);
> > > 
> > >                                 if (worker->state != WORKER_RUNNING)
> > >                                         continue;
> > > 
> > >                                 if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {
> > 
> > And because systemd has an immutable hardcoded random timeout we add
> > another hardcoded random timeout into kthread_create() to work around
> > that.
> > 
> > How broken is that?
> > 
> > And it seems other people have solved it:
> > 
> >     http://www.redhat.com/archives/lvm-devel/2013-September/msg00036.html
> 
> I agree with Thomas.  A hardcoded timeout is a systemd bug.  However,
> could I get confirmation, while you can use this bug to do it, that the
> patch back in this thread actually fixes the crash when scsi_alloc_host
> fails, that's the serious SCSI bug, in my view?

Which patch, the one to kthread_create() or the one to SCSI?

Thanks,

	tglx
James Bottomley March 23, 2014, 2:29 p.m. UTC | #9
On Sun, 2014-03-23 at 15:28 +0100, Thomas Gleixner wrote:
> On Sun, 23 Mar 2014, James Bottomley wrote:
> > On Sun, 2014-03-23 at 09:04 +0100, Thomas Gleixner wrote:
> > > On Sun, 23 Mar 2014, Tetsuo Handa wrote:
> > > 
> > > > Thomas Gleixner wrote:
> > > > > But then systemd/udev mutters:
> > > > > 
> > > > >    "You migh be able to work around the timeout with udev rules and
> > > > >     OPTIONS+="event_timeout=120", but that code was maybe never used
> > > > >     or tested, so it might not work correctly." [1]
> > > > > 
> > > > > AFAICT from the ubuntu bug system [2] nobody bothered even to try that.
> > > > > 
> > > > > And if the udev/systemd event_timeout option is broken it's way better
> > > > > to fix that one instead of hacking random heuristics into the kernel.
> > > > 
> > > > I haven't tried the event_timeout= option but I think it will not work.
> > > > The timeout is hard coded as shown below and there will be no chance for taking
> > > > the event_timeout= option into account.
> > > > 
> > > > ---------- systemd-204/src/udev/udevd.c start ----------
> > > > (...snipped...)
> > > >                         /* check for hanging events */
> > > >                         udev_list_node_foreach(loop, &worker_list) {
> > > >                                 struct worker *worker = node_to_worker(loop);
> > > > 
> > > >                                 if (worker->state != WORKER_RUNNING)
> > > >                                         continue;
> > > > 
> > > >                                 if ((now(CLOCK_MONOTONIC) - worker->event_start_usec) > 30 * 1000 * 1000) {
> > > 
> > > And because systemd has an immutable hardcoded random timeout we add
> > > another hardcoded random timeout into kthread_create() to work around
> > > that.
> > > 
> > > How broken is that?
> > > 
> > > And it seems other people have solved it:
> > > 
> > >     http://www.redhat.com/archives/lvm-devel/2013-September/msg00036.html
> > 
> > I agree with Thomas.  A hardcoded timeout is a systemd bug.  However,
> > could I get confirmation, while you can use this bug to do it, that the
> > patch back in this thread actually fixes the crash when scsi_alloc_host
> > fails, that's the serious SCSI bug, in my view?
> 
> Which patch, the one to kthread_create() or the one to SCSI?

The one to SCSI ... I'm only really interested in the oops when
scsi_host_alloc fails.

James
diff mbox

Patch

diff --git a/kernel/kthread.c b/kernel/kthread.c
index b5ae3ee..6a25a9f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -269,6 +269,7 @@  struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 					   const char namefmt[],
 					   ...)
 {
+	int i = 0;
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct task_struct *task;
 	struct kthread_create_info *create = kmalloc(sizeof(*create),
@@ -287,24 +288,28 @@  struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 
 	wake_up_process(kthreadd_task);
 	/*
-	 * Wait for completion in killable state, for I might be chosen by
-	 * the OOM killer while kthreadd is trying to allocate memory for
-	 * new kernel thread.
+	 * Wait for completion with 10 seconds timeout. Unless the kthreadd is
+	 * blocked for direct memory reclaim path, the kthreadd will be able to
+	 * complete the request within 10 seconds. Also, check every second if
+	 * I was chosen by the OOM killer in order to avoid the OOM killer
+	 * deadlock.
 	 */
-	if (unlikely(wait_for_completion_killable(&done))) {
-		/*
-		 * If I was SIGKILLed before kthreadd (or new kernel thread)
-		 * calls complete(), leave the cleanup of this structure to
-		 * that thread.
-		 */
-		if (xchg(&create->done, NULL))
-			return ERR_PTR(-ENOMEM);
-		/*
-		 * kthreadd (or new kernel thread) will call complete()
-		 * shortly.
-		 */
-		wait_for_completion(&done);
+	do {
+		if (likely(wait_for_completion_timeout(&done, HZ)))
+			goto ready;
+	} while (i++ < 10 && !test_thread_flag(TIF_MEMDIE));
+	/*
+	 * The kthreadd was unable to complete the request within 10 seconds
+	 * (or I was chosen by the OOM killer). Thus, give up and leave the
+	 * cleanup of this structure to the kthreadd (or new kernel thread).
+	 */
+	if (xchg(&create->done, NULL)) {
+		WARN(1, "Gave up waiting for kthreadd.\n");
+		return ERR_PTR(-EINTR);
 	}
+	/* kthreadd (or new kernel thread) will call complete() shortly. */
+	wait_for_completion(&done);
+ready:
 	task = create->result;
 	if (!IS_ERR(task)) {
 		static const struct sched_param param = { .sched_priority = 0 };