diff mbox

[RFC,v2,17/18] livepatch: change to a per-task consistency model

Message ID d60b2b9ce6911bb10dbbf244df58df31c7a8af66.1461875890.git.jpoimboe@redhat.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Josh Poimboeuf April 28, 2016, 8:44 p.m. UTC
Change livepatch to use a basic per-task consistency model.  This is the
foundation which will eventually enable us to patch those ~10% of
security patches which change function or data semantics.  This is the
biggest remaining piece needed to make livepatch more generally useful.

This code stems from the design proposal made by Vojtech [1] in November
2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
consistency and syscall barrier switching combined with kpatch's stack
trace switching.  There are also a number of fallback options which make
it quite flexible.

Patches are applied on a per-task basis, when the task is deemed safe to
switch over.  When a patch is enabled, livepatch enters into a
transition state where tasks are converging to the patched state.
Usually this transition state can complete in a few seconds.  The same
sequence occurs when a patch is disabled, except the tasks converge from
the patched state to the unpatched state.

An interrupt handler inherits the patched state of the task it
interrupts.  The same is true for forked tasks: the child inherits the
patched state of the parent.

Livepatch uses several complementary approaches to determine when it's
safe to patch tasks:

1. The first and most effective approach is stack checking of sleeping
   tasks.  If no affected functions are on the stack of a given task,
   the task is patched.  In most cases this will patch most or all of
   the tasks on the first try.  Otherwise it'll keep trying
   periodically.  This option is only available if the architecture has
   reliable stacks (CONFIG_RELIABLE_STACKTRACE and
   CONFIG_STACK_VALIDATION).

2. The second approach, if needed, is kernel exit switching.  A
   task is switched when it returns to user space from a system call, a
   user space IRQ, or a signal.  It's useful in the following cases:

   a) Patching I/O-bound user tasks which are sleeping on an affected
      function.  In this case you have to send SIGSTOP and SIGCONT to
      force it to exit the kernel and be patched.
   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
      then it will get patched the next time it gets interrupted by an
      IRQ.
   c) Applying patches for architectures which don't yet have
      CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
      most of the tasks on the system.  However this isn't a complete
      solution, because there's currently no way to patch kthreads
      without CONFIG_RELIABLE_STACKTRACE.

   Note: since idle "swapper" tasks don't ever exit the kernel, they
   instead have a kpatch_patch_task() call in the idle loop which allows
   them to patched before the CPU enters the idle state.

3. A third approach (not yet implemented) is planned for the case where
   a kthread is sleeping on an affected function.  In that case we could
   kick the kthread with a signal and then try to patch the task from
   the to-be-patched function's livepatch ftrace handler when it
   re-enters the function.  This will require
   CONFIG_RELIABLE_STACKTRACE.

All the above approaches may be skipped by setting the 'immediate' flag
in the 'klp_patch' struct, which will patch all tasks immediately.  This
can be useful if the patch doesn't change any function or data
semantics.  Note that, even with this flag set, it's possible that some
tasks may still be running with an old version of the function, until
that function returns.

There's also an 'immediate' flag in the 'klp_func' struct which allows
you to specify that certain functions in the patch can be applied
without per-task consistency.  This might be useful if you want to patch
a common function like schedule(), and the function change doesn't need
consistency but the rest of the patch does.

For architectures which don't have CONFIG_RELIABLE_STACKTRACE, there
are two options:

a) the user can set the patch->immediate flag which causes all tasks to
   be patched immediately.  This option should be used with care, only
   when the patch doesn't change any function or data semantics; or

b) use the kernel exit switching approach (this is the default).
   Note the patching will never complete because there's no currently no
   way to patch kthreads without CONFIG_RELIABLE_STACKTRACE.

The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
is in transition.  Only a single patch (the topmost patch on the stack)
can be in transition at a given time.  A patch can remain in transition
indefinitely, if any of the tasks are stuck in the initial patch state.

A transition can be reversed and effectively canceled by writing the
opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
the transition is in progress.  Then all the tasks will attempt to
converge back to the original patch state.

[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 Documentation/ABI/testing/sysfs-kernel-livepatch |   8 +
 Documentation/livepatch/livepatch.txt            | 132 ++++++-
 include/linux/init_task.h                        |   9 +
 include/linux/livepatch.h                        |  34 +-
 include/linux/sched.h                            |   3 +
 kernel/fork.c                                    |   3 +
 kernel/livepatch/Makefile                        |   2 +-
 kernel/livepatch/core.c                          |  98 +++--
 kernel/livepatch/patch.c                         |  43 +-
 kernel/livepatch/patch.h                         |   1 +
 kernel/livepatch/transition.c                    | 474 +++++++++++++++++++++++
 kernel/livepatch/transition.h                    |  14 +
 kernel/sched/idle.c                              |   4 +
 13 files changed, 781 insertions(+), 44 deletions(-)
 create mode 100644 kernel/livepatch/transition.c
 create mode 100644 kernel/livepatch/transition.h

Comments

Petr Mladek May 4, 2016, 8:42 a.m. UTC | #1
On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -76,6 +76,7 @@
>  #include <linux/compiler.h>
>  #include <linux/sysctl.h>
>  #include <linux/kcov.h>
> +#include <linux/livepatch.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/pgalloc.h>
> @@ -1586,6 +1587,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  		p->parent_exec_id = current->self_exec_id;
>  	}
>  
> +	klp_copy_process(p);

I am in doubts here. We copy the state from the parent here. It means
that the new process might still need to be converted. But at the same
point print_context_stack_reliable() returns zero without printing
any stack trace when TIF_FORK flag is set. It means that a freshly
forked task might get be converted immediately. I seems that boot
operations are always done when copy_process() is called. But
they are contradicting each other.

I guess that print_context_stack_reliable() should either return
-EINVAL when TIF_FORK is set. Or it should try to print the
stack of the newly forked task.

Or do I miss something, please?

> +
>  	spin_lock(&current->sighand->siglock);
>  
>  	/*

[...]

> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> new file mode 100644
> index 0000000..92819bb
> --- /dev/null
> +++ b/kernel/livepatch/transition.c
> +/*
> + * This function can be called in the middle of an existing transition to
> + * reverse the direction of the target patch state.  This can be done to
> + * effectively cancel an existing enable or disable operation if there are any
> + * tasks which are stuck in the initial patch state.
> + */
> +void klp_reverse_transition(void)
> +{
> +	struct klp_patch *patch = klp_transition_patch;
> +
> +	klp_target_state = !klp_target_state;
> +
> +	/*
> +	 * Ensure that if another CPU goes through the syscall barrier, sees
> +	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
> +	 * klp_patch_task(), it also sees the above write to the target state.
> +	 * Otherwise it can put the task in the wrong universe.
> +	 */
> +	smp_wmb();
> +
> +	klp_start_transition();
> +	klp_try_complete_transition();

It is a bit strange that we keep the work scheduled. It might be
better to use

       mod_delayed_work(system_wq, &klp_work, 0);

Which triggers more ideas from the nitpicking deparment:

I would move the work definition from core.c to transition.c because
it is closely related to klp_try_complete_transition();

When on it. I would make it more clear that the work is related
to transition. Also I would call queue_delayed_work() directly
instead of adding the klp_schedule_work() wrapper. The delay
might be defined using a constant, e.g.

#define KLP_TRANSITION_DELAY round_jiffies_relative(HZ)

queue_delayed_work(system_wq, &klp_transition_work, KLP_TRANSITION_DELAY);

Finally, the following is always called right after
klp_start_transition(), so I would call it from there.

	if (!klp_try_complete_transition())
		klp_schedule_work();


> +
> +	patch->enabled = !patch->enabled;
> +}
> +

It is really great work! I am checking this patch from left, right, top,
and even bottom and all seems to work well together.

Best Regards,
Petr
Petr Mladek May 4, 2016, 12:39 p.m. UTC | #2
On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.

I spent a lot of time with checking the memory barriers. It seems that
they are basically correct.  Let me use my own words to show how
I understand it. I hope that it will help others with review.

> diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> index 782fbb5..b3b8639 100644
> --- a/kernel/livepatch/patch.c
> +++ b/kernel/livepatch/patch.c
> @@ -29,6 +29,7 @@
>  #include <linux/bug.h>
>  #include <linux/printk.h>
>  #include "patch.h"
> +#include "transition.h"
>  
>  static LIST_HEAD(klp_ops);
>  
> @@ -58,11 +59,42 @@ static void notrace klp_ftrace_handler(unsigned long ip,
>  	ops = container_of(fops, struct klp_ops, fops);
>  
>  	rcu_read_lock();
> +
>  	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
>  				      stack_node);
> -	if (WARN_ON_ONCE(!func))
> +
> +	if (!func)
>  		goto unlock;
>  
> +	/*
> +	 * See the comment for the 2nd smp_wmb() in klp_init_transition() for
> +	 * an explanation of why this read barrier is needed.
> +	 */
> +	smp_rmb();

I would prefer to be more explicit, e.g.

	/*
	 * Read the right func->transition when the struct appeared on top of
	 * func_stack.  See klp_init_transition and klp_patch_func().
	 */

Note that this barrier is not really needed when the patch is being
disabled, see below.

> +
> +	if (unlikely(func->transition)) {
> +
> +		/*
> +		 * See the comment for the 1st smp_wmb() in
> +		 * klp_init_transition() for an explanation of why this read
> +		 * barrier is needed.
> +		 */
> +		smp_rmb();

Similar here:

		/*
		 * Read the right initial state when func->transition was
		 * enabled, see klp_init_transition().
		 *
		 * Note that the task must never be migrated to the target
		 * state when being inside this ftrace handler.
		 */

We might want to move the second paragraph on top of the function.
It is a basic and important fact. It actually explains why the first
read barrier is not needed when the patch is being disabled.

There are some more details below. I started to check and comment the
barriers from klp_init_transition().


> +		if (current->patch_state == KLP_UNPATCHED) {
> +			/*
> +			 * Use the previously patched version of the function.
> +			 * If no previous patches exist, use the original
> +			 * function.
> +			 */
> +			func = list_entry_rcu(func->stack_node.next,
> +					      struct klp_func, stack_node);
> +
> +			if (&func->stack_node == &ops->func_stack)
> +				goto unlock;
> +		}
> +	}
> +
>  	klp_arch_set_pc(regs, (unsigned long)func->new_func);
>  unlock:
>  	rcu_read_unlock();
> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> new file mode 100644
> index 0000000..92819bb
> --- /dev/null
> +++ b/kernel/livepatch/transition.c
> +/*
> + * klp_patch_task() - change the patched state of a task
> + * @task:	The task to change
> + *
> + * Switches the patched state of the task to the set of functions in the target
> + * patch state.
> + */
> +void klp_patch_task(struct task_struct *task)
> +{
> +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> +
> +	/*
> +	 * The corresponding write barriers are in klp_init_transition() and
> +	 * klp_reverse_transition().  See the comments there for an explanation.
> +	 */
> +	smp_rmb();

I would prefer to be more explicit, e.g.

	/*
	 * Read the correct klp_target_state when TIF_PATCH_PENDING was set
	 * and this function was called.  See klp_init_transition() and
	 * klp_reverse_transition().
	 */
> +
> +	task->patch_state = klp_target_state;
> +}

The function name confused me few times when klp_target_state
was KLP_UNPATCHED. I suggest to rename it to klp_update_task()
or klp_transit_task().

> +/*
> + * Initialize the global target patch state and all tasks to the initial patch
> + * state, and initialize all function transition states to true in preparation
> + * for patching or unpatching.
> + */
> +void klp_init_transition(struct klp_patch *patch, int state)
> +{
> +	struct task_struct *g, *task;
> +	unsigned int cpu;
> +	struct klp_object *obj;
> +	struct klp_func *func;
> +	int initial_state = !state;
> +
> +	klp_transition_patch = patch;
> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (patch->immediate)
> +		return;
> +
> +	/*
> +	 * Initialize all tasks to the initial patch state to prepare them for
> +	 * switching to the target state.
> +	 */
> +	read_lock(&tasklist_lock);
> +	for_each_process_thread(g, task)
> +		task->patch_state = initial_state;
> +	read_unlock(&tasklist_lock);
> +
> +	/*
> +	 * Ditto for the idle "swapper" tasks.
> +	 */
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		idle_task(cpu)->patch_state = initial_state;
> +	put_online_cpus();
> +
> +	/*
> +	 * Ensure klp_ftrace_handler() sees the task->patch_state updates
> +	 * before the func->transition updates.  Otherwise it could read an
> +	 * out-of-date task state and pick the wrong function.
> +	 */
> +	smp_wmb();

This barrier is needed when the patch is being disabled. In this case,
the ftrace handler is already in use and the related struct klp_func
are on top of func_stack. The purpose is well described above.

It is not needed when the patch is being enabled because it is
not visible to the ftrace handler at the moment. The barrier below
is enough.


> +	/*
> +	 * Set the func transition states so klp_ftrace_handler() will know to
> +	 * switch to the transition logic.
> +	 *
> +	 * When patching, the funcs aren't yet in the func_stack and will be
> +	 * made visible to the ftrace handler shortly by the calls to
> +	 * klp_patch_object().
> +	 *
> +	 * When unpatching, the funcs are already in the func_stack and so are
> +	 * already visible to the ftrace handler.
> +	 */
> +	klp_for_each_object(patch, obj)
> +		klp_for_each_func(obj, func)
> +			func->transition = true;
> +
> +	/*
> +	 * Set the global target patch state which tasks will switch to.  This
> +	 * has no effect until the TIF_PATCH_PENDING flags get set later.
> +	 */
> +	klp_target_state = state;
> +
> +	/*
> +	 * For the enable path, ensure klp_ftrace_handler() will see the
> +	 * func->transition updates before the funcs become visible to the
> +	 * handler.  Otherwise the handler may wrongly pick the new func before
> +	 * the task switches to the patched state.

By other words, it makes sure that the ftrace handler will see
the updated func->transition before the ftrace handler is registered
and/or before the struct klp_func is listed in func_stack.


> +	 * For the disable path, the funcs are already visible to the handler.
> +	 * But we still need to ensure the ftrace handler will see the
> +	 * func->transition updates before the tasks start switching to the
> +	 * unpatched state.  Otherwise the handler can miss a task patch state
> +	 * change which would result in it wrongly picking the new function.

If this is true, it would mean that the task might be switched when it
is in the middle of klp_ftrace_handler. It would mean that reading
task->patch_state would be racy against the modification by
klp_patch_task().

Note that before we call klp_patch_task(), the task will stay in the
previous state. We are disabling the patch, so the previous state
is that the patch is enabled. It means that it should always use
the new function before klp_patch_task() is called. It means
that it does not matter if it sees func->transition updated or not.
In both cases, it will use the new function.

Fortunately, task->patch_state might be set to KLP_UNPATCHED
only when it is sleeping or on some other safe location, e.g.
when leaving to userspace. In both cases, the barrier is not needed
here.

By other words, this barrier is not needed to synchronize func_stack
and patch->transition when it is being disabled.


> +	 * This barrier also ensures that if another CPU goes through the
> +	 * syscall barrier, sees the TIF_PATCH_PENDING writes in
> +	 * klp_start_transition(), and calls klp_patch_task(), it also sees the
> +	 * above write to the target state.  Otherwise it can put the task in
> +	 * the wrong universe.
> +	 */

By other words, it makes sure that klp_patch_task() will assign the
right patch_state. Where klp_patch_task() could not be called
before we set TIF_PATCH_PENDING in klp_start_transition().

> +	smp_wmb();
> +}
> +
> +/*
> + * Start the transition to the specified target patch state so tasks can begin
> + * switching to it.
> + */
> +void klp_start_transition(void)
> +{
> +	struct task_struct *g, *task;
> +	unsigned int cpu;
> +
> +	pr_notice("'%s': %s...\n", klp_transition_patch->mod->name,
> +		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (klp_transition_patch->immediate)
> +		return;
> +
> +	/*
> +	 * Mark all normal tasks as needing a patch state update.  As they pass
> +	 * through the syscall barrier they'll switch over to the target state
> +	 * (unless we switch them in klp_try_complete_transition() first).
> +	 */
> +	read_lock(&tasklist_lock);
> +	for_each_process_thread(g, task)
> +		set_tsk_thread_flag(task, TIF_PATCH_PENDING);

A bad intuition might suggest that we do not need to set this flag
when klp_start_transition() is called from klp_reverse_transition()
and the task already is in the right state.

But I think that we actually must set TIF_PATCH_PENDING even in this
case to avoid a possible race. We do not know if klp_patch_task()
is not running at the moment with the previous klp_target_state().


> +	read_unlock(&tasklist_lock);
> +
> +	/*
> +	 * Ditto for the idle "swapper" tasks, though they never cross the
> +	 * syscall barrier.  Instead they switch over in cpu_idle_loop().
> +	 */
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		set_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
> +	put_online_cpus();
> +}
> +
> +/*
> + * This function can be called in the middle of an existing transition to
> + * reverse the direction of the target patch state.  This can be done to
> + * effectively cancel an existing enable or disable operation if there are any
> + * tasks which are stuck in the initial patch state.
> + */
> +void klp_reverse_transition(void)
> +{
> +	struct klp_patch *patch = klp_transition_patch;
> +
> +	klp_target_state = !klp_target_state;
> +
> +	/*
> +	 * Ensure that if another CPU goes through the syscall barrier, sees
> +	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
> +	 * klp_patch_task(), it also sees the above write to the target state.
> +	 * Otherwise it can put the task in the wrong universe.
> +	 */
> +	smp_wmb();

Yup, it is the same reason as for the 2nd barrier in klp_init_transition()
regarding klp_target_state and klp_patch_task() that is triggered by
TIF_PATCH_PENDING.

> +
> +	klp_start_transition();
> +	klp_try_complete_transition();
> +
> +	patch->enabled = !patch->enabled;
> +}
> +

Best Regards,
Petr
Peter Zijlstra May 4, 2016, 1:53 p.m. UTC | #3
On Wed, May 04, 2016 at 02:39:40PM +0200, Petr Mladek wrote:
> > +	 * This barrier also ensures that if another CPU goes through the
> > +	 * syscall barrier, sees the TIF_PATCH_PENDING writes in
> > +	 * klp_start_transition(), and calls klp_patch_task(), it also sees the
> > +	 * above write to the target state.  Otherwise it can put the task in
> > +	 * the wrong universe.
> > +	 */
> 
> By other words, it makes sure that klp_patch_task() will assign the
> right patch_state. Where klp_patch_task() could not be called
> before we set TIF_PATCH_PENDING in klp_start_transition().
> 
> > +	smp_wmb();
> > +}

So I've not read the patch; but ending a function with an smp_wmb()
feels wrong.

A wmb orders two stores, and I feel both stores should be well visible
in the same function.
Petr Mladek May 4, 2016, 2:12 p.m. UTC | #4
On Wed 2016-05-04 14:39:40, Petr Mladek wrote:
> 		 *
> 		 * Note that the task must never be migrated to the target
> 		 * state when being inside this ftrace handler.
> 		 */
> 
> We might want to move the second paragraph on top of the function.
> It is a basic and important fact. It actually explains why the first
> read barrier is not needed when the patch is being disabled.

I wrote the statement partly intuitively. I think that it is really
somehow important. And I am slightly in doubts if we are on the safe side.

First, why is it important that the task->patch_state is not switched
when being inside the ftrace handler?

If we are inside the handler, we are kind-of inside the called
function. And the basic idea of this consistency model is that
we must not switch a task when it is inside a patched function.
This is normally decided by the stack.

The handler is a bit special because it is called right before the
function. If it was the only patched function on the stack, it would
not matter if we choose the new or old code. Both decisions would
be safe for the moment.

The fun starts when the function calls another patched function.
The other patched function must be called consistently with
the first one. If the first function was from the patch,
the other must be from the patch as well and vice versa.

This is why we must not switch task->patch_state dangerously
when being inside the ftrace handler.

Now I am not sure if this condition is fulfilled. The ftrace handler
is called as the very first instruction of the function. Does not
it break the stack validity? Could we sleep inside the ftrace
handler? Will the patched function be detected on the stack?

Or is my brain already too far in the fantasy world?


Best regards,
Petr
Petr Mladek May 4, 2016, 2:48 p.m. UTC | #5
On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.
> 
> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> new file mode 100644
> index 0000000..92819bb
> --- /dev/null
> +++ b/kernel/livepatch/transition.c
> +/*
> + * klp_patch_task() - change the patched state of a task
> + * @task:	The task to change
> + *
> + * Switches the patched state of the task to the set of functions in the target
> + * patch state.
> + */
> +void klp_patch_task(struct task_struct *task)
> +{
> +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> +
> +	/*
> +	 * The corresponding write barriers are in klp_init_transition() and
> +	 * klp_reverse_transition().  See the comments there for an explanation.
> +	 */
> +	smp_rmb();
> +
> +	task->patch_state = klp_target_state;
> +}
> +
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index bd12c6c..60d633f 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -9,6 +9,7 @@
>  #include <linux/mm.h>
>  #include <linux/stackprotector.h>
>  #include <linux/suspend.h>
> +#include <linux/livepatch.h>
>  
>  #include <asm/tlb.h>
>  
> @@ -266,6 +267,9 @@ static void cpu_idle_loop(void)
>  
>  		sched_ttwu_pending();
>  		schedule_preempt_disabled();
> +
> +		if (unlikely(klp_patch_pending(current)))
> +			klp_patch_task(current);
>  	}

Some more ideas from the world of crazy races. I was shaking my head
if this was safe or not.

The problem might be if the task get rescheduled between the check
for the pending stuff or inside the klp_patch_task() function.
This will get even more important when we use this construct
on some more locations, e.g. in some kthreads.

If the task is sleeping on this strange locations, it might assign
strange values on strange times.

I think that it is safe only because it is called with the
'current' parameter and on a safe locations. It means that
the result is always safe and consistent. Also we could assign
an outdated value only when sleeping between reading klp_target_state
and storing task->patch_state. But if anyone modified
klp_target_state at this point, he also set TIF_PENDING_PATCH,
so the change will not get lost.

I think that we should document that klp_patch_func() must be
called only from a safe location from within the affected task.

I even suggest to avoid misuse by removing the struct *task_struct
parameter. It should always be called with current.

Best Regards,
Petr
Jiri Kosina May 4, 2016, 2:56 p.m. UTC | #6
On Wed, 4 May 2016, Petr Mladek wrote:

> > +
> > +		if (unlikely(klp_patch_pending(current)))
> > +			klp_patch_task(current);
> >  	}
> 
> Some more ideas from the world of crazy races. I was shaking my head
> if this was safe or not.
> 
> The problem might be if the task get rescheduled between the check
> for the pending stuff 

The code in question is running with preemption disabled.

> or inside the klp_patch_task() function. 

We must make sure that this function doesn't go to sleep. It's only used 
to clear the task_struct flag anyway.
Josh Poimboeuf May 4, 2016, 3:51 p.m. UTC | #7
On Wed, May 04, 2016 at 10:42:23AM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > Change livepatch to use a basic per-task consistency model.  This is the
> > foundation which will eventually enable us to patch those ~10% of
> > security patches which change function or data semantics.  This is the
> > biggest remaining piece needed to make livepatch more generally useful.
> 
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -76,6 +76,7 @@
> >  #include <linux/compiler.h>
> >  #include <linux/sysctl.h>
> >  #include <linux/kcov.h>
> > +#include <linux/livepatch.h>
> >  
> >  #include <asm/pgtable.h>
> >  #include <asm/pgalloc.h>
> > @@ -1586,6 +1587,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >  		p->parent_exec_id = current->self_exec_id;
> >  	}
> >  
> > +	klp_copy_process(p);
> 
> I am in doubts here. We copy the state from the parent here. It means
> that the new process might still need to be converted. But at the same
> point print_context_stack_reliable() returns zero without printing
> any stack trace when TIF_FORK flag is set. It means that a freshly
> forked task might get be converted immediately. I seems that boot
> operations are always done when copy_process() is called. But
> they are contradicting each other.
> 
> I guess that print_context_stack_reliable() should either return
> -EINVAL when TIF_FORK is set. Or it should try to print the
> stack of the newly forked task.
> 
> Or do I miss something, please?

Ok, I admit it's confusing.

A newly forked task doesn't *have* a stack (other than the pt_regs frame
it needs for the return to user space), which is why
print_context_stack_reliable() returns success with an empty array of
addresses.

For a little background, see the second switch_to() macro in
arch/x86/include/asm/switch_to.h.  When a newly forked task runs for the
first time, it returns from __switch_to() with no stack.  It then jumps
straight to ret_from_fork in entry_64.S, calls a few C functions, and
eventually returns to user space.  So, assuming we aren't patching entry
code or the switch_to() macro in __schedule(), it should be safe to
patch the task before it does all that.

With the current code, if an unpatched task gets forked, the child will
also be unpatched.  In theory, we could go ahead and patch the child
then.  In fact, that's what I did in v1.9.

But in v1.9 discussions it was pointed out that someday maybe the
ret_from_fork stuff will get cleaned up and instead the child stack will
be copied from the parent.  In that case the child should inherit its
parent's patched state.  So we decided to make it more future-proof by
having the child inherit the parent's patched state.

So, having said all that, I'm really not sure what the best approach is
for print_context_stack_reliable().  Right now I'm thinking I'll change
it back to return -EINVAL for a newly forked task, so it'll be more
future-proof: better to have a false positive than a false negative.
Either way it will probably need to be changed again if the
ret_from_fork code gets cleaned up.

> > +
> >  	spin_lock(&current->sighand->siglock);
> >  
> >  	/*
> 
> [...]
> 
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > new file mode 100644
> > index 0000000..92819bb
> > --- /dev/null
> > +++ b/kernel/livepatch/transition.c
> > +/*
> > + * This function can be called in the middle of an existing transition to
> > + * reverse the direction of the target patch state.  This can be done to
> > + * effectively cancel an existing enable or disable operation if there are any
> > + * tasks which are stuck in the initial patch state.
> > + */
> > +void klp_reverse_transition(void)
> > +{
> > +	struct klp_patch *patch = klp_transition_patch;
> > +
> > +	klp_target_state = !klp_target_state;
> > +
> > +	/*
> > +	 * Ensure that if another CPU goes through the syscall barrier, sees
> > +	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
> > +	 * klp_patch_task(), it also sees the above write to the target state.
> > +	 * Otherwise it can put the task in the wrong universe.
> > +	 */
> > +	smp_wmb();
> > +
> > +	klp_start_transition();
> > +	klp_try_complete_transition();
> 
> It is a bit strange that we keep the work scheduled. It might be
> better to use
> 
>        mod_delayed_work(system_wq, &klp_work, 0);

True, I think that would be better.

> Which triggers more ideas from the nitpicking deparment:
> 
> I would move the work definition from core.c to transition.c because
> it is closely related to klp_try_complete_transition();

That could be good, but there's a slight problem: klp_work_fn() requires
klp_mutex, which is static to core.c.  It's kind of nice to keep the use
of the mutex in core.c only.

> When on it. I would make it more clear that the work is related
> to transition.

How would you recommend doing that?  How about:

- rename "klp_work" -> "klp_transition_work"
- rename "klp_work_fn" -> "klp_transition_work_fn" 

?

> Also I would call queue_delayed_work() directly
> instead of adding the klp_schedule_work() wrapper. The delay
> might be defined using a constant, e.g.
> 
> #define KLP_TRANSITION_DELAY round_jiffies_relative(HZ)
> 
> queue_delayed_work(system_wq, &klp_transition_work, KLP_TRANSITION_DELAY);

Sure.

> Finally, the following is always called right after
> klp_start_transition(), so I would call it from there.
> 
> 	if (!klp_try_complete_transition())
> 		klp_schedule_work();

Except for when it's called by klp_reverse_transition().  And it really
depends on whether we want to allow transition.c to use the mutex.  I
don't have a strong opinion either way, I may need to think about it
some more.

> > +
> > +	patch->enabled = !patch->enabled;
> > +}
> > +
> 
> It is really great work! I am checking this patch from left, right, top,
> and even bottom and all seems to work well together.

Great!  Thanks a lot for the thorough review!
Josh Poimboeuf May 4, 2016, 4:51 p.m. UTC | #8
On Wed, May 04, 2016 at 03:53:29PM +0200, Peter Zijlstra wrote:
> On Wed, May 04, 2016 at 02:39:40PM +0200, Petr Mladek wrote:
> > > +	 * This barrier also ensures that if another CPU goes through the
> > > +	 * syscall barrier, sees the TIF_PATCH_PENDING writes in
> > > +	 * klp_start_transition(), and calls klp_patch_task(), it also sees the
> > > +	 * above write to the target state.  Otherwise it can put the task in
> > > +	 * the wrong universe.

(oops, missed a "universe" -> "patch state" rename)

> > > +	 */
> > 
> > By other words, it makes sure that klp_patch_task() will assign the
> > right patch_state. Where klp_patch_task() could not be called
> > before we set TIF_PATCH_PENDING in klp_start_transition().
> > 
> > > +	smp_wmb();
> > > +}
> 
> So I've not read the patch; but ending a function with an smp_wmb()
> feels wrong.
> 
> A wmb orders two stores, and I feel both stores should be well visible
> in the same function.

Yeah, I would agree with that.  And also, it's probably a red flag that
the barrier needs *three* paragraphs to describe the various cases its
needed for.

However, there are some complications:

1) The stores are in separate functions (which is a generally a good
   thing as it greatly helps the readability of the code).

2) Which stores are being ordered depends on whether the function is
   called in the enable path or the disable path.

3) Either way it actually orders *two* separate pairs of stores.

Anyway I'm thinking I should move that barrier out of
klp_init_transition() and into its callers.  The stores will still be in
separate functions but at least there will be better visibility of where
the stores are occurring, and the comments can be a little more focused.
Josh Poimboeuf May 4, 2016, 5:02 p.m. UTC | #9
On Wed, May 04, 2016 at 02:39:40PM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > Change livepatch to use a basic per-task consistency model.  This is the
> > foundation which will eventually enable us to patch those ~10% of
> > security patches which change function or data semantics.  This is the
> > biggest remaining piece needed to make livepatch more generally useful.
> 
> I spent a lot of time with checking the memory barriers. It seems that
> they are basically correct.  Let me use my own words to show how
> I understand it. I hope that it will help others with review.

[...snip a ton of useful comments...]

Thanks, this will help a lot!  I'll try to incorporate your barrier
comments into the code.

I also agree that kpatch_patch_task() is poorly named.  I was trying to
make it clear to external callers that "hey, the task is getting patched
now!", but it's internally inconsistent with livepatch code because we
make a distinction between patching and unpatching.

Maybe I'll do:

  klp_update_task_patch_state()
Josh Poimboeuf May 4, 2016, 5:25 p.m. UTC | #10
On Wed, May 04, 2016 at 04:12:05PM +0200, Petr Mladek wrote:
> On Wed 2016-05-04 14:39:40, Petr Mladek wrote:
> > 		 *
> > 		 * Note that the task must never be migrated to the target
> > 		 * state when being inside this ftrace handler.
> > 		 */
> > 
> > We might want to move the second paragraph on top of the function.
> > It is a basic and important fact. It actually explains why the first
> > read barrier is not needed when the patch is being disabled.
> 
> I wrote the statement partly intuitively. I think that it is really
> somehow important. And I am slightly in doubts if we are on the safe side.
> 
> First, why is it important that the task->patch_state is not switched
> when being inside the ftrace handler?
> 
> If we are inside the handler, we are kind-of inside the called
> function. And the basic idea of this consistency model is that
> we must not switch a task when it is inside a patched function.
> This is normally decided by the stack.
> 
> The handler is a bit special because it is called right before the
> function. If it was the only patched function on the stack, it would
> not matter if we choose the new or old code. Both decisions would
> be safe for the moment.
> 
> The fun starts when the function calls another patched function.
> The other patched function must be called consistently with
> the first one. If the first function was from the patch,
> the other must be from the patch as well and vice versa.
> 
> This is why we must not switch task->patch_state dangerously
> when being inside the ftrace handler.
> 
> Now I am not sure if this condition is fulfilled. The ftrace handler
> is called as the very first instruction of the function. Does not
> it break the stack validity? Could we sleep inside the ftrace
> handler? Will the patched function be detected on the stack?
> 
> Or is my brain already too far in the fantasy world?

I think this isn't a possibility.

In today's code base, this can't happen because task patch states are
only switched when sleeping or when exiting the kernel.  The ftrace
handler doesn't sleep directly.

If it were preempted, it couldn't be switched there either because we
consider preempted stacks to be unreliable.

In theory, a DWARF stack trace of a preempted task *could* be reliable.
But then the DWARF unwinder should be smart enough to see that the
original function called the ftrace handler.  Right?  So the stack would
be reliable, but then livepatch would see the original function on the
stack and wouldn't switch the task.

Does that make sense?
Josh Poimboeuf May 4, 2016, 5:57 p.m. UTC | #11
On Wed, May 04, 2016 at 04:48:54PM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > Change livepatch to use a basic per-task consistency model.  This is the
> > foundation which will eventually enable us to patch those ~10% of
> > security patches which change function or data semantics.  This is the
> > biggest remaining piece needed to make livepatch more generally useful.
> > 
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > new file mode 100644
> > index 0000000..92819bb
> > --- /dev/null
> > +++ b/kernel/livepatch/transition.c
> > +/*
> > + * klp_patch_task() - change the patched state of a task
> > + * @task:	The task to change
> > + *
> > + * Switches the patched state of the task to the set of functions in the target
> > + * patch state.
> > + */
> > +void klp_patch_task(struct task_struct *task)
> > +{
> > +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> > +
> > +	/*
> > +	 * The corresponding write barriers are in klp_init_transition() and
> > +	 * klp_reverse_transition().  See the comments there for an explanation.
> > +	 */
> > +	smp_rmb();
> > +
> > +	task->patch_state = klp_target_state;
> > +}
> > +
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index bd12c6c..60d633f 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/stackprotector.h>
> >  #include <linux/suspend.h>
> > +#include <linux/livepatch.h>
> >  
> >  #include <asm/tlb.h>
> >  
> > @@ -266,6 +267,9 @@ static void cpu_idle_loop(void)
> >  
> >  		sched_ttwu_pending();
> >  		schedule_preempt_disabled();
> > +
> > +		if (unlikely(klp_patch_pending(current)))
> > +			klp_patch_task(current);
> >  	}
> 
> Some more ideas from the world of crazy races. I was shaking my head
> if this was safe or not.
> 
> The problem might be if the task get rescheduled between the check
> for the pending stuff or inside the klp_patch_task() function.
> This will get even more important when we use this construct
> on some more locations, e.g. in some kthreads.
> 
> If the task is sleeping on this strange locations, it might assign
> strange values on strange times.
> 
> I think that it is safe only because it is called with the
> 'current' parameter and on a safe locations. It means that
> the result is always safe and consistent. Also we could assign
> an outdated value only when sleeping between reading klp_target_state
> and storing task->patch_state. But if anyone modified
> klp_target_state at this point, he also set TIF_PENDING_PATCH,
> so the change will not get lost.
> 
> I think that we should document that klp_patch_func() must be
> called only from a safe location from within the affected task.
> 
> I even suggest to avoid misuse by removing the struct *task_struct
> parameter. It should always be called with current.

Would the race involve two tasks trying to call klp_patch_task() for the
same task at the same time?  If so I don't think that would be a problem
since they would both write the same value for task->patch_state.

(Sorry if I'm being slow, I think I've managed to reach my quota of hard
thinking for the day and I don't exactly follow what the race would be.)
Miroslav Benes May 5, 2016, 9:41 a.m. UTC | #12
On Wed, 4 May 2016, Josh Poimboeuf wrote:

> On Wed, May 04, 2016 at 10:42:23AM +0200, Petr Mladek wrote:
> > On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > > Change livepatch to use a basic per-task consistency model.  This is the
> > > foundation which will eventually enable us to patch those ~10% of
> > > security patches which change function or data semantics.  This is the
> > > biggest remaining piece needed to make livepatch more generally useful.
> > 
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -76,6 +76,7 @@
> > >  #include <linux/compiler.h>
> > >  #include <linux/sysctl.h>
> > >  #include <linux/kcov.h>
> > > +#include <linux/livepatch.h>
> > >  
> > >  #include <asm/pgtable.h>
> > >  #include <asm/pgalloc.h>
> > > @@ -1586,6 +1587,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> > >  		p->parent_exec_id = current->self_exec_id;
> > >  	}
> > >  
> > > +	klp_copy_process(p);
> > 
> > I am in doubts here. We copy the state from the parent here. It means
> > that the new process might still need to be converted. But at the same
> > point print_context_stack_reliable() returns zero without printing
> > any stack trace when TIF_FORK flag is set. It means that a freshly
> > forked task might get be converted immediately. I seems that boot
> > operations are always done when copy_process() is called. But
> > they are contradicting each other.
> > 
> > I guess that print_context_stack_reliable() should either return
> > -EINVAL when TIF_FORK is set. Or it should try to print the
> > stack of the newly forked task.
> > 
> > Or do I miss something, please?
> 
> Ok, I admit it's confusing.
> 
> A newly forked task doesn't *have* a stack (other than the pt_regs frame
> it needs for the return to user space), which is why
> print_context_stack_reliable() returns success with an empty array of
> addresses.
> 
> For a little background, see the second switch_to() macro in
> arch/x86/include/asm/switch_to.h.  When a newly forked task runs for the
> first time, it returns from __switch_to() with no stack.  It then jumps
> straight to ret_from_fork in entry_64.S, calls a few C functions, and
> eventually returns to user space.  So, assuming we aren't patching entry
> code or the switch_to() macro in __schedule(), it should be safe to
> patch the task before it does all that.
> 
> With the current code, if an unpatched task gets forked, the child will
> also be unpatched.  In theory, we could go ahead and patch the child
> then.  In fact, that's what I did in v1.9.
> 
> But in v1.9 discussions it was pointed out that someday maybe the
> ret_from_fork stuff will get cleaned up and instead the child stack will
> be copied from the parent.  In that case the child should inherit its
> parent's patched state.  So we decided to make it more future-proof by
> having the child inherit the parent's patched state.
> 
> So, having said all that, I'm really not sure what the best approach is
> for print_context_stack_reliable().  Right now I'm thinking I'll change
> it back to return -EINVAL for a newly forked task, so it'll be more
> future-proof: better to have a false positive than a false negative.
> Either way it will probably need to be changed again if the
> ret_from_fork code gets cleaned up.

I'd be for returning -EINVAL. It is a safe play for now.

[...]
 
> > Finally, the following is always called right after
> > klp_start_transition(), so I would call it from there.
> > 
> > 	if (!klp_try_complete_transition())
> > 		klp_schedule_work();

On the other hand it is quite nice to see the sequence

init
start
try_complete

there. Just my 2 cents.

> Except for when it's called by klp_reverse_transition().  And it really
> depends on whether we want to allow transition.c to use the mutex.  I
> don't have a strong opinion either way, I may need to think about it
> some more.

Miroslav
Petr Mladek May 5, 2016, 10:21 a.m. UTC | #13
On Wed 2016-05-04 12:02:36, Josh Poimboeuf wrote:
> On Wed, May 04, 2016 at 02:39:40PM +0200, Petr Mladek wrote:
> > On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > > Change livepatch to use a basic per-task consistency model.  This is the
> > > foundation which will eventually enable us to patch those ~10% of
> > > security patches which change function or data semantics.  This is the
> > > biggest remaining piece needed to make livepatch more generally useful.
> > 
> > I spent a lot of time with checking the memory barriers. It seems that
> > they are basically correct.  Let me use my own words to show how
> > I understand it. I hope that it will help others with review.
> 
> [...snip a ton of useful comments...]
> 
> Thanks, this will help a lot!  I'll try to incorporate your barrier
> comments into the code.

Thanks a lot.

> I also agree that kpatch_patch_task() is poorly named.  I was trying to
> make it clear to external callers that "hey, the task is getting patched
> now!", but it's internally inconsistent with livepatch code because we
> make a distinction between patching and unpatching.
> 
> Maybe I'll do:
> 
>   klp_update_task_patch_state()

I like it. It is long but it well describes the purpose.

Livepatch is using many state variables:

  + global:                klp_transition_patch, klp_target_state
  + per task specific:     TIF_PENDING_PATCH, patch_state
  + per each new function: transition, patched
  + per old function:      func_stack
  + per object:            patched, loaded
  + per patch:             enabled

The dependency between them and the workflow is important to
create a mental picture about the Livepatching. Good names
help with it.

Best Regards,
Petr
Petr Mladek May 5, 2016, 11:21 a.m. UTC | #14
On Wed 2016-05-04 12:25:17, Josh Poimboeuf wrote:
> On Wed, May 04, 2016 at 04:12:05PM +0200, Petr Mladek wrote:
> > On Wed 2016-05-04 14:39:40, Petr Mladek wrote:
> > > 		 *
> > > 		 * Note that the task must never be migrated to the target
> > > 		 * state when being inside this ftrace handler.
> > > 		 */
> > > 
> > > We might want to move the second paragraph on top of the function.
> > > It is a basic and important fact. It actually explains why the first
> > > read barrier is not needed when the patch is being disabled.
> > 
> > I wrote the statement partly intuitively. I think that it is really
> > somehow important. And I am slightly in doubts if we are on the safe side.
> > 
> > First, why is it important that the task->patch_state is not switched
> > when being inside the ftrace handler?
> > 
> > If we are inside the handler, we are kind-of inside the called
> > function. And the basic idea of this consistency model is that
> > we must not switch a task when it is inside a patched function.
> > This is normally decided by the stack.
> > 
> > The handler is a bit special because it is called right before the
> > function. If it was the only patched function on the stack, it would
> > not matter if we choose the new or old code. Both decisions would
> > be safe for the moment.
> > 
> > The fun starts when the function calls another patched function.
> > The other patched function must be called consistently with
> > the first one. If the first function was from the patch,
> > the other must be from the patch as well and vice versa.
> > 
> > This is why we must not switch task->patch_state dangerously
> > when being inside the ftrace handler.
> > 
> > Now I am not sure if this condition is fulfilled. The ftrace handler
> > is called as the very first instruction of the function. Does not
> > it break the stack validity? Could we sleep inside the ftrace
> > handler? Will the patched function be detected on the stack?
> > 
> > Or is my brain already too far in the fantasy world?
> 
> I think this isn't a possibility.
> 
> In today's code base, this can't happen because task patch states are
> only switched when sleeping or when exiting the kernel.  The ftrace
> handler doesn't sleep directly.
> 
> If it were preempted, it couldn't be switched there either because we
> consider preempted stacks to be unreliable.

This was the missing piece.

> In theory, a DWARF stack trace of a preempted task *could* be reliable.
> But then the DWARF unwinder should be smart enough to see that the
> original function called the ftrace handler.  Right?  So the stack would
> be reliable, but then livepatch would see the original function on the
> stack and wouldn't switch the task.
> 
> Does that make sense?

Yup. I think that we are on the safe side. Thanks for explanation.

Best Regards,
Petr
Petr Mladek May 5, 2016, 11:57 a.m. UTC | #15
On Wed 2016-05-04 12:57:00, Josh Poimboeuf wrote:
> On Wed, May 04, 2016 at 04:48:54PM +0200, Petr Mladek wrote:
> > On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > > Change livepatch to use a basic per-task consistency model.  This is the
> > > foundation which will eventually enable us to patch those ~10% of
> > > security patches which change function or data semantics.  This is the
> > > biggest remaining piece needed to make livepatch more generally useful.
> > > 
> > > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > > new file mode 100644
> > > index 0000000..92819bb
> > > --- /dev/null
> > > +++ b/kernel/livepatch/transition.c
> > > +/*
> > > + * klp_patch_task() - change the patched state of a task
> > > + * @task:	The task to change
> > > + *
> > > + * Switches the patched state of the task to the set of functions in the target
> > > + * patch state.
> > > + */
> > > +void klp_patch_task(struct task_struct *task)
> > > +{
> > > +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> > > +
> > > +	/*
> > > +	 * The corresponding write barriers are in klp_init_transition() and
> > > +	 * klp_reverse_transition().  See the comments there for an explanation.
> > > +	 */
> > > +	smp_rmb();
> > > +
> > > +	task->patch_state = klp_target_state;
> > > +}
> > > +
> > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > > index bd12c6c..60d633f 100644
> > > --- a/kernel/sched/idle.c
> > > +++ b/kernel/sched/idle.c
> > > @@ -9,6 +9,7 @@
> > >  #include <linux/mm.h>
> > >  #include <linux/stackprotector.h>
> > >  #include <linux/suspend.h>
> > > +#include <linux/livepatch.h>
> > >  
> > >  #include <asm/tlb.h>
> > >  
> > > @@ -266,6 +267,9 @@ static void cpu_idle_loop(void)
> > >  
> > >  		sched_ttwu_pending();
> > >  		schedule_preempt_disabled();
> > > +
> > > +		if (unlikely(klp_patch_pending(current)))
> > > +			klp_patch_task(current);
> > >  	}
> > 
> > Some more ideas from the world of crazy races. I was shaking my head
> > if this was safe or not.
> > 
> > The problem might be if the task get rescheduled between the check
> > for the pending stuff or inside the klp_patch_task() function.
> > This will get even more important when we use this construct
> > on some more locations, e.g. in some kthreads.
> > 
> > If the task is sleeping on this strange locations, it might assign
> > strange values on strange times.
> > 
> > I think that it is safe only because it is called with the
> > 'current' parameter and on a safe locations. It means that
> > the result is always safe and consistent. Also we could assign
> > an outdated value only when sleeping between reading klp_target_state
> > and storing task->patch_state. But if anyone modified
> > klp_target_state at this point, he also set TIF_PENDING_PATCH,
> > so the change will not get lost.
> > 
> > I think that we should document that klp_patch_func() must be
> > called only from a safe location from within the affected task.
> > 
> > I even suggest to avoid misuse by removing the struct *task_struct
> > parameter. It should always be called with current.
> 
> Would the race involve two tasks trying to call klp_patch_task() for the
> same task at the same time?  If so I don't think that would be a problem
> since they would both write the same value for task->patch_state.

I have missed that the two commands are called with preemption
disabled. So, I had the following crazy scenario in mind:


CPU0				CPU1

klp_enable_patch()

  klp_target_state = KLP_PATCHED;

  for_each_task()
     set TIF_PENDING_PATCH

				# task 123

				if (klp_patch_pending(current)
				  klp_patch_task(current)

                                    clear TIF_PENDING_PATCH

				    smp_rmb();

				    # switch to assembly of
				    # klp_patch_task()

				    mov klp_target_state, %r12

				    # interrupt and schedule
				    # another task


  klp_reverse_transition();

    klp_target_state = KLP_UNPATCHED;

    klt_try_to_complete_transition()

      task = 123;
      if (task->patch_state == klp_target_state;
         return 0;

    => task 123 is in target state and does
    not block conversion

  klp_complete_transition()


  # disable previous patch on the stack
  klp_disable_patch();

    klp_target_state = KLP_UNPATCHED;
  
  
				    # task 123 gets scheduled again
				    lea %r12, task->patch_state

				    => it happily stores an outdated
				    state


This is why the two functions should get called with preemption
disabled. We should document it at least. I imagine that we will
use them later also in another context and nobody will remember
this crazy scenario.

Well, even disabled preemption does not help. The process on
CPU1 might be also interrupted by an NMI and do some long
printk in it.

IMHO, the only safe approach is to call klp_patch_task()
only for "current" on a safe place. Then this race is harmless.
The switch happen on a safe place, so that it does not matter
into which state the process is switched.

By other words, the task state might be updated only

   + by the task itself on a safe place
   + by other task when the updated on is sleeping on a safe place

This should be well documented and the API should help to avoid
a misuse.

Best Regards,
Petr
Petr Mladek May 5, 2016, 1:06 p.m. UTC | #16
On Wed 2016-05-04 10:51:21, Josh Poimboeuf wrote:
> On Wed, May 04, 2016 at 10:42:23AM +0200, Petr Mladek wrote:
> > On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > > Change livepatch to use a basic per-task consistency model.  This is the
> > > foundation which will eventually enable us to patch those ~10% of
> > > security patches which change function or data semantics.  This is the
> > > biggest remaining piece needed to make livepatch more generally useful.
> > 
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -76,6 +76,7 @@
> > >  #include <linux/compiler.h>
> > >  #include <linux/sysctl.h>
> > >  #include <linux/kcov.h>
> > > +#include <linux/livepatch.h>
> > >  
> > >  #include <asm/pgtable.h>
> > >  #include <asm/pgalloc.h>
> > > @@ -1586,6 +1587,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> > >  		p->parent_exec_id = current->self_exec_id;
> > >  	}
> > >  
> > > +	klp_copy_process(p);
> > 
> > I am in doubts here. We copy the state from the parent here. It means
> > that the new process might still need to be converted. But at the same
> > point print_context_stack_reliable() returns zero without printing
> > any stack trace when TIF_FORK flag is set. It means that a freshly
> > forked task might get be converted immediately. I seems that boot
> > operations are always done when copy_process() is called. But
> > they are contradicting each other.
> > 
> > I guess that print_context_stack_reliable() should either return
> > -EINVAL when TIF_FORK is set. Or it should try to print the
> > stack of the newly forked task.
> > 
> > Or do I miss something, please?
> 
> Ok, I admit it's confusing.
> 
> A newly forked task doesn't *have* a stack (other than the pt_regs frame
> it needs for the return to user space), which is why
> print_context_stack_reliable() returns success with an empty array of
> addresses.
> 
> For a little background, see the second switch_to() macro in
> arch/x86/include/asm/switch_to.h.  When a newly forked task runs for the
> first time, it returns from __switch_to() with no stack.  It then jumps
> straight to ret_from_fork in entry_64.S, calls a few C functions, and
> eventually returns to user space.  So, assuming we aren't patching entry
> code or the switch_to() macro in __schedule(), it should be safe to
> patch the task before it does all that.

This is great explanation. Thanks for it.

> So, having said all that, I'm really not sure what the best approach is
> for print_context_stack_reliable().  Right now I'm thinking I'll change
> it back to return -EINVAL for a newly forked task, so it'll be more
> future-proof: better to have a false positive than a false negative.
> Either way it will probably need to be changed again if the
> ret_from_fork code gets cleaned up.

I would prefer the -EINVAL. It might safe some hairs when anyone
is working on patching the switch_to stuff. Also it is not that
big loss beacuse most tasks will get migrated on the return to
userspace.

It might help a bit with the newly forked kthreads. But there should
be more safe location where the new kthreads might get migrated,
e.g. right before the main function gets called.


> > > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > > new file mode 100644
> > > index 0000000..92819bb
> > > --- /dev/null
> > > +++ b/kernel/livepatch/transition.c
> > > +/*
> > > + * This function can be called in the middle of an existing transition to
> > > + * reverse the direction of the target patch state.  This can be done to
> > > + * effectively cancel an existing enable or disable operation if there are any
> > > + * tasks which are stuck in the initial patch state.
> > > + */
> > > +void klp_reverse_transition(void)
> > > +{
> > > +	struct klp_patch *patch = klp_transition_patch;
> > > +
> > > +	klp_target_state = !klp_target_state;
> > > +
> > > +	/*
> > > +	 * Ensure that if another CPU goes through the syscall barrier, sees
> > > +	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
> > > +	 * klp_patch_task(), it also sees the above write to the target state.
> > > +	 * Otherwise it can put the task in the wrong universe.
> > > +	 */
> > > +	smp_wmb();
> > > +
> > > +	klp_start_transition();
> > > +	klp_try_complete_transition();
> > 
> > It is a bit strange that we keep the work scheduled. It might be
> > better to use
> > 
> >        mod_delayed_work(system_wq, &klp_work, 0);
> 
> True, I think that would be better.
> 
> > Which triggers more ideas from the nitpicking deparment:
> > 
> > I would move the work definition from core.c to transition.c because
> > it is closely related to klp_try_complete_transition();
> 
> That could be good, but there's a slight problem: klp_work_fn() requires
> klp_mutex, which is static to core.c.  It's kind of nice to keep the use
> of the mutex in core.c only.

I see and am surprised that we take the lock only in core.c ;-)

I do not have a strong opinion then. Just a small one. The lock guards
also operations from the other .c files. I think that it is only a matter
of time when we will need to access it there. But the work is clearly
transition-related. But it is a real nitpicking. I am sorry for it.

> > When on it. I would make it more clear that the work is related
> > to transition.
> 
> How would you recommend doing that?  How about:
> 
> - rename "klp_work" -> "klp_transition_work"
> - rename "klp_work_fn" -> "klp_transition_work_fn" 

Yup, sounds better.

> > Also I would call queue_delayed_work() directly
> > instead of adding the klp_schedule_work() wrapper. The delay
> > might be defined using a constant, e.g.
> > 
> > #define KLP_TRANSITION_DELAY round_jiffies_relative(HZ)
> > 
> > queue_delayed_work(system_wq, &klp_transition_work, KLP_TRANSITION_DELAY);
> 
> Sure.
> 
> > Finally, the following is always called right after
> > klp_start_transition(), so I would call it from there.
> > 
> > 	if (!klp_try_complete_transition())
> > 		klp_schedule_work();
> 
> Except for when it's called by klp_reverse_transition().  And it really
> depends on whether we want to allow transition.c to use the mutex.  I
> don't have a strong opinion either way, I may need to think about it
> some more.

Ah, I had in mind that it could be replaced by that

	mod_delayed_work(system_wq, &klp_transition_work, 0);

So, that we would never call klp_try_complete_transition()
directly. Then it could be the same in all situations. But
it might look strange and be ineffective when really starting
the transition. So, maybe forget about it.

Best Regards,
Petr
Petr Mladek May 6, 2016, 11:33 a.m. UTC | #17
On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> index 782fbb5..b3b8639 100644
> --- a/kernel/livepatch/patch.c
> +++ b/kernel/livepatch/patch.c
> @@ -29,6 +29,7 @@
>  #include <linux/bug.h>
>  #include <linux/printk.h>
>  #include "patch.h"
> +#include "transition.h"
>  
>  static LIST_HEAD(klp_ops);
>  
> @@ -58,11 +59,42 @@ static void notrace klp_ftrace_handler(unsigned long ip,
>  	ops = container_of(fops, struct klp_ops, fops);
>  
>  	rcu_read_lock();
> +
>  	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
>  				      stack_node);
> -	if (WARN_ON_ONCE(!func))
> +
> +	if (!func)
>  		goto unlock;
>  
> +	/*
> +	 * See the comment for the 2nd smp_wmb() in klp_init_transition() for
> +	 * an explanation of why this read barrier is needed.
> +	 */
> +	smp_rmb();
> +
> +	if (unlikely(func->transition)) {
> +
> +		/*
> +		 * See the comment for the 1st smp_wmb() in
> +		 * klp_init_transition() for an explanation of why this read
> +		 * barrier is needed.
> +		 */
> +		smp_rmb();

I would add here:

		WARN_ON_ONCE(current->patch_state == KLP_UNDEFINED);

We do not know in which context this is called, so the printk's are
not ideal. But it will get triggered only if there is a bug in
the livepatch implementation. It should happen on random locations
and rather early when a bug is introduced.

Anyway, better to die and catch the bug that let the system running
in an undefined state and produce cryptic errors later on.


> +		if (current->patch_state == KLP_UNPATCHED) {
> +			/*
> +			 * Use the previously patched version of the function.
> +			 * If no previous patches exist, use the original
> +			 * function.
> +			 */
> +			func = list_entry_rcu(func->stack_node.next,
> +					      struct klp_func, stack_node);
> +
> +			if (&func->stack_node == &ops->func_stack)
> +				goto unlock;
> +		}
> +	}

I am staring into the code for too long now. I need to step back for a
while. I'll do another look when you send the next version. Anyway,
you did a great work. I speak mainly for the livepatch part and
I like it.

Best Regards,
Petr
Josh Poimboeuf May 6, 2016, 12:38 p.m. UTC | #18
On Thu, May 05, 2016 at 01:57:01PM +0200, Petr Mladek wrote:
> I have missed that the two commands are called with preemption
> disabled. So, I had the following crazy scenario in mind:
> 
> 
> CPU0				CPU1
> 
> klp_enable_patch()
> 
>   klp_target_state = KLP_PATCHED;
> 
>   for_each_task()
>      set TIF_PENDING_PATCH
> 
> 				# task 123
> 
> 				if (klp_patch_pending(current)
> 				  klp_patch_task(current)
> 
>                                     clear TIF_PENDING_PATCH
> 
> 				    smp_rmb();
> 
> 				    # switch to assembly of
> 				    # klp_patch_task()
> 
> 				    mov klp_target_state, %r12
> 
> 				    # interrupt and schedule
> 				    # another task
> 
> 
>   klp_reverse_transition();
> 
>     klp_target_state = KLP_UNPATCHED;
> 
>     klt_try_to_complete_transition()
> 
>       task = 123;
>       if (task->patch_state == klp_target_state;
>          return 0;
> 
>     => task 123 is in target state and does
>     not block conversion
> 
>   klp_complete_transition()
> 
> 
>   # disable previous patch on the stack
>   klp_disable_patch();
> 
>     klp_target_state = KLP_UNPATCHED;
>   
>   
> 				    # task 123 gets scheduled again
> 				    lea %r12, task->patch_state
> 
> 				    => it happily stores an outdated
> 				    state
> 

Thanks for the clear explanation, this helps a lot.

> This is why the two functions should get called with preemption
> disabled. We should document it at least. I imagine that we will
> use them later also in another context and nobody will remember
> this crazy scenario.
> 
> Well, even disabled preemption does not help. The process on
> CPU1 might be also interrupted by an NMI and do some long
> printk in it.
> 
> IMHO, the only safe approach is to call klp_patch_task()
> only for "current" on a safe place. Then this race is harmless.
> The switch happen on a safe place, so that it does not matter
> into which state the process is switched.

I'm not sure about this solution.  When klp_complete_transition() is
called, we need all tasks to be patched, for good.  We don't want any of
them to randomly switch to the wrong state at some later time in the
middle of a future patch operation.  How would changing klp_patch_task()
to only use "current" prevent that?

> By other words, the task state might be updated only
> 
>    + by the task itself on a safe place
>    + by other task when the updated on is sleeping on a safe place
> 
> This should be well documented and the API should help to avoid
> a misuse.

I think we could fix it to be safe for future callers who might not have
preemption disabled with a couple of changes to klp_patch_task():
disabling preemption and testing/clearing the TIF_PATCH_PENDING flag
before changing the patch state:

  void klp_patch_task(struct task_struct *task)
  {
  	preempt_disable();
  
  	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
  		task->patch_state = READ_ONCE(klp_target_state);
  
  	preempt_enable();
  }

We would also need a synchronize_sched() after the patching is complete,
either at the end of klp_try_complete_transition() or in
klp_complete_transition().  That would make sure that all existing calls
to klp_patch_task() are done.
Josh Poimboeuf May 6, 2016, 12:44 p.m. UTC | #19
On Fri, May 06, 2016 at 01:33:01PM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> > index 782fbb5..b3b8639 100644
> > --- a/kernel/livepatch/patch.c
> > +++ b/kernel/livepatch/patch.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/bug.h>
> >  #include <linux/printk.h>
> >  #include "patch.h"
> > +#include "transition.h"
> >  
> >  static LIST_HEAD(klp_ops);
> >  
> > @@ -58,11 +59,42 @@ static void notrace klp_ftrace_handler(unsigned long ip,
> >  	ops = container_of(fops, struct klp_ops, fops);
> >  
> >  	rcu_read_lock();
> > +
> >  	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
> >  				      stack_node);
> > -	if (WARN_ON_ONCE(!func))
> > +
> > +	if (!func)
> >  		goto unlock;
> >  
> > +	/*
> > +	 * See the comment for the 2nd smp_wmb() in klp_init_transition() for
> > +	 * an explanation of why this read barrier is needed.
> > +	 */
> > +	smp_rmb();
> > +
> > +	if (unlikely(func->transition)) {
> > +
> > +		/*
> > +		 * See the comment for the 1st smp_wmb() in
> > +		 * klp_init_transition() for an explanation of why this read
> > +		 * barrier is needed.
> > +		 */
> > +		smp_rmb();
> 
> I would add here:
> 
> 		WARN_ON_ONCE(current->patch_state == KLP_UNDEFINED);
> 
> We do not know in which context this is called, so the printk's are
> not ideal. But it will get triggered only if there is a bug in
> the livepatch implementation. It should happen on random locations
> and rather early when a bug is introduced.
> 
> Anyway, better to die and catch the bug that let the system running
> in an undefined state and produce cryptic errors later on.

Ok, makes sense.

> > +		if (current->patch_state == KLP_UNPATCHED) {
> > +			/*
> > +			 * Use the previously patched version of the function.
> > +			 * If no previous patches exist, use the original
> > +			 * function.
> > +			 */
> > +			func = list_entry_rcu(func->stack_node.next,
> > +					      struct klp_func, stack_node);
> > +
> > +			if (&func->stack_node == &ops->func_stack)
> > +				goto unlock;
> > +		}
> > +	}
> 
> I am staring into the code for too long now. I need to step back for a
> while. I'll do another look when you send the next version. Anyway,
> you did a great work. I speak mainly for the livepatch part and
> I like it.

Thanks for the helpful reviews!  I'll be on vacation again next week so
I get a break too :-)
Miroslav Benes May 9, 2016, 9:41 a.m. UTC | #20
[...]

> +static int klp_target_state;

[...]

> +void klp_init_transition(struct klp_patch *patch, int state)
> +{
> +	struct task_struct *g, *task;
> +	unsigned int cpu;
> +	struct klp_object *obj;
> +	struct klp_func *func;
> +	int initial_state = !state;
> +
> +	klp_transition_patch = patch;
> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (patch->immediate)
> +		return;
> +
> +	/*
> +	 * Initialize all tasks to the initial patch state to prepare them for
> +	 * switching to the target state.
> +	 */
> +	read_lock(&tasklist_lock);
> +	for_each_process_thread(g, task)
> +		task->patch_state = initial_state;
> +	read_unlock(&tasklist_lock);
> +
> +	/*
> +	 * Ditto for the idle "swapper" tasks.
> +	 */
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		idle_task(cpu)->patch_state = initial_state;
> +	put_online_cpus();
> +
> +	/*
> +	 * Ensure klp_ftrace_handler() sees the task->patch_state updates
> +	 * before the func->transition updates.  Otherwise it could read an
> +	 * out-of-date task state and pick the wrong function.
> +	 */
> +	smp_wmb();
> +
> +	/*
> +	 * Set the func transition states so klp_ftrace_handler() will know to
> +	 * switch to the transition logic.
> +	 *
> +	 * When patching, the funcs aren't yet in the func_stack and will be
> +	 * made visible to the ftrace handler shortly by the calls to
> +	 * klp_patch_object().
> +	 *
> +	 * When unpatching, the funcs are already in the func_stack and so are
> +	 * already visible to the ftrace handler.
> +	 */
> +	klp_for_each_object(patch, obj)
> +		klp_for_each_func(obj, func)
> +			func->transition = true;
> +
> +	/*
> +	 * Set the global target patch state which tasks will switch to.  This
> +	 * has no effect until the TIF_PATCH_PENDING flags get set later.
> +	 */
> +	klp_target_state = state;

I am afraid there is a problem for (patch->immediate == true) patches. 
klp_target_state is not set for those and the comment is not entirely 
true, because klp_target_state has an effect in several places.

[...]

> +void klp_start_transition(void)
> +{
> +	struct task_struct *g, *task;
> +	unsigned int cpu;
> +
> +	pr_notice("'%s': %s...\n", klp_transition_patch->mod->name,
> +		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");

Here...

> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (klp_transition_patch->immediate)
> +		return;
> +

[...]

> +bool klp_try_complete_transition(void)
> +{
> +	unsigned int cpu;
> +	struct task_struct *g, *task;
> +	bool complete = true;
> +
> +	/*
> +	 * If the patch can be applied or reverted immediately, skip the
> +	 * per-task transitions.
> +	 */
> +	if (klp_transition_patch->immediate)
> +		goto success;
> +
> +	/*
> +	 * Try to switch the tasks to the target patch state by walking their
> +	 * stacks and looking for any to-be-patched or to-be-unpatched
> +	 * functions.  If such functions are found on a stack, or if the stack
> +	 * is deemed unreliable, the task can't be switched yet.
> +	 *
> +	 * Usually this will transition most (or all) of the tasks on a system
> +	 * unless the patch includes changes to a very common function.
> +	 */
> +	read_lock(&tasklist_lock);
> +	for_each_process_thread(g, task)
> +		if (!klp_try_switch_task(task))
> +			complete = false;
> +	read_unlock(&tasklist_lock);
> +
> +	/*
> +	 * Ditto for the idle "swapper" tasks.
> +	 */
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		if (!klp_try_switch_task(idle_task(cpu)))
> +			complete = false;
> +	put_online_cpus();
> +
> +	/*
> +	 * Some tasks weren't able to be switched over.  Try again later and/or
> +	 * wait for other methods like syscall barrier switching.
> +	 */
> +	if (!complete)
> +		return false;
> +
> +success:
> +	/*
> +	 * When unpatching, all tasks have transitioned to KLP_UNPATCHED so we
> +	 * can now remove the new functions from the func_stack.
> +	 */
> +	if (klp_target_state == KLP_UNPATCHED) {

Here (this is the most important one I think).

> +		klp_unpatch_objects(klp_transition_patch);
> +
> +		/*
> +		 * Don't allow any existing instances of ftrace handlers to
> +		 * access any obsolete funcs before we reset the func
> +		 * transition states to false.  Otherwise the handler may see
> +		 * the deleted "new" func, see that it's not in transition, and
> +		 * wrongly pick the new version of the function.
> +		 */
> +		synchronize_rcu();
> +	}
> +
> +	pr_notice("'%s': %s complete\n", klp_transition_patch->mod->name,
> +		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");

Here

> +
> +	/* we're done, now cleanup the data structures */
> +	klp_complete_transition();
> +
> +	return true;
> +}
> +
> +/*
> + * This function can be called in the middle of an existing transition to
> + * reverse the direction of the target patch state.  This can be done to
> + * effectively cancel an existing enable or disable operation if there are any
> + * tasks which are stuck in the initial patch state.
> + */
> +void klp_reverse_transition(void)
> +{
> +	struct klp_patch *patch = klp_transition_patch;
> +
> +	klp_target_state = !klp_target_state;

And probably here.

All other references look safe.

I guess we need to set klp_target_state even for immediate patches. Should 
we also initialize it with KLP_UNDEFINED and set it to KLP_UNDEFINED in 
klp_complete_transition()?

Miroslav
Petr Mladek May 9, 2016, 12:23 p.m. UTC | #21
On Fri 2016-05-06 07:38:55, Josh Poimboeuf wrote:
> On Thu, May 05, 2016 at 01:57:01PM +0200, Petr Mladek wrote:
> > I have missed that the two commands are called with preemption
> > disabled. So, I had the following crazy scenario in mind:
> > 
> > 
> > CPU0				CPU1
> > 
> > klp_enable_patch()
> > 
> >   klp_target_state = KLP_PATCHED;
> > 
> >   for_each_task()
> >      set TIF_PENDING_PATCH
> > 
> > 				# task 123
> > 
> > 				if (klp_patch_pending(current)
> > 				  klp_patch_task(current)
> > 
> >                                     clear TIF_PENDING_PATCH
> > 
> > 				    smp_rmb();
> > 
> > 				    # switch to assembly of
> > 				    # klp_patch_task()
> > 
> > 				    mov klp_target_state, %r12
> > 
> > 				    # interrupt and schedule
> > 				    # another task
> > 
> > 
> >   klp_reverse_transition();
> > 
> >     klp_target_state = KLP_UNPATCHED;
> > 
> >     klt_try_to_complete_transition()
> > 
> >       task = 123;
> >       if (task->patch_state == klp_target_state;
> >          return 0;
> > 
> >     => task 123 is in target state and does
> >     not block conversion
> > 
> >   klp_complete_transition()
> > 
> > 
> >   # disable previous patch on the stack
> >   klp_disable_patch();
> > 
> >     klp_target_state = KLP_UNPATCHED;
> >   
> >   
> > 				    # task 123 gets scheduled again
> > 				    lea %r12, task->patch_state
> > 
> > 				    => it happily stores an outdated
> > 				    state
> > 
> 
> Thanks for the clear explanation, this helps a lot.
> 
> > This is why the two functions should get called with preemption
> > disabled. We should document it at least. I imagine that we will
> > use them later also in another context and nobody will remember
> > this crazy scenario.
> > 
> > Well, even disabled preemption does not help. The process on
> > CPU1 might be also interrupted by an NMI and do some long
> > printk in it.
> > 
> > IMHO, the only safe approach is to call klp_patch_task()
> > only for "current" on a safe place. Then this race is harmless.
> > The switch happen on a safe place, so that it does not matter
> > into which state the process is switched.
> 
> I'm not sure about this solution.  When klp_complete_transition() is
> called, we need all tasks to be patched, for good.  We don't want any of
> them to randomly switch to the wrong state at some later time in the
> middle of a future patch operation.  How would changing klp_patch_task()
> to only use "current" prevent that?

You are right that it is pity but it really should be safe because
it is not entirely random.

If the race happens and assign an outdated value, there are two
situations:

1. It is assigned when there is not transition in the progress.
   Then it is OK because it will be ignored by the ftrace handler.
   The right state will be set before the next transition starts.

2. It is assigned when some other transition is in progress.
   Then it is OK as long as the function is called from "current".
   The "wrong" state will be used consistently. It will switch
   to the right state on another safe state.


> > By other words, the task state might be updated only
> > 
> >    + by the task itself on a safe place
> >    + by other task when the updated on is sleeping on a safe place
> > 
> > This should be well documented and the API should help to avoid
> > a misuse.
> 
> I think we could fix it to be safe for future callers who might not have
> preemption disabled with a couple of changes to klp_patch_task():
> disabling preemption and testing/clearing the TIF_PATCH_PENDING flag
> before changing the patch state:
> 
>   void klp_patch_task(struct task_struct *task)
>   {
>   	preempt_disable();
>   
>   	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
>   		task->patch_state = READ_ONCE(klp_target_state);
>   
>   	preempt_enable();
>   }

It reduces the race window a bit but it is still there. For example,
NMI still might add a huge delay between reading klp_target_state
and assigning task->patch state.

What about the following?

/*
 * This function might assign an outdated value if the transaction
`* is reverted and finalized in parallel. But it is safe. If the value
 * is assigned outside of a transaction, it is ignored and the next
 * transaction will set the right one. Or if it gets assigned
 * inside another transaction, it will repeat the cycle and
 * set the right state.
 */
void klp_update_current_patch_state()
{
	while (test_and_clear_tsk_thread_flag(current, TIF_PATCH_PENDING))
		current->patch_state = READ_ONCE(klp_target_state);
}

Note that the disabled preemption helped only partially,
so I think that it was not really needed.

Hmm, it means that the task->patch_state  might be either
KLP_PATCHED or KLP_UNPATCHED outside a transition. I wonder
if the tristate really brings some advantages.


Alternatively, we might synchronize the operation with klp_mutex.
The function is called in a slow path and in a safe context.
Well, it might cause contention on the lock when many CPUs are
trying to update their tasks.

Best Regards,
Petr
Miroslav Benes May 9, 2016, 3:42 p.m. UTC | #22
On Wed, 4 May 2016, Josh Poimboeuf wrote:

> On Wed, May 04, 2016 at 04:12:05PM +0200, Petr Mladek wrote:
> > On Wed 2016-05-04 14:39:40, Petr Mladek wrote:
> > > 		 *
> > > 		 * Note that the task must never be migrated to the target
> > > 		 * state when being inside this ftrace handler.
> > > 		 */
> > > 
> > > We might want to move the second paragraph on top of the function.
> > > It is a basic and important fact. It actually explains why the first
> > > read barrier is not needed when the patch is being disabled.
> > 
> > I wrote the statement partly intuitively. I think that it is really
> > somehow important. And I am slightly in doubts if we are on the safe side.
> > 
> > First, why is it important that the task->patch_state is not switched
> > when being inside the ftrace handler?
> > 
> > If we are inside the handler, we are kind-of inside the called
> > function. And the basic idea of this consistency model is that
> > we must not switch a task when it is inside a patched function.
> > This is normally decided by the stack.
> > 
> > The handler is a bit special because it is called right before the
> > function. If it was the only patched function on the stack, it would
> > not matter if we choose the new or old code. Both decisions would
> > be safe for the moment.
> > 
> > The fun starts when the function calls another patched function.
> > The other patched function must be called consistently with
> > the first one. If the first function was from the patch,
> > the other must be from the patch as well and vice versa.
> > 
> > This is why we must not switch task->patch_state dangerously
> > when being inside the ftrace handler.
> > 
> > Now I am not sure if this condition is fulfilled. The ftrace handler
> > is called as the very first instruction of the function. Does not
> > it break the stack validity? Could we sleep inside the ftrace
> > handler? Will the patched function be detected on the stack?
> > 
> > Or is my brain already too far in the fantasy world?
> 
> I think this isn't a possibility.
> 
> In today's code base, this can't happen because task patch states are
> only switched when sleeping or when exiting the kernel.  The ftrace
> handler doesn't sleep directly.
> 
> If it were preempted, it couldn't be switched there either because we
> consider preempted stacks to be unreliable.

And IIRC ftrace handlers cannot sleep and are called with preemption 
disabled as of now. The code is a bit obscure, but see 
__ftrace_ops_list_func for example. This is "main" ftrace handler that 
calls all the registered ones in case FTRACE_OPS_FL_DYNAMIC is set (which 
is always true for handlers coming from modules) and CONFIG_PREEMPT is 
on. If it is off and there is only one handler registered for a function 
dynamic trampoline is used. See commit 12cce594fa8f ("ftrace/x86: Allow 
!CONFIG_PREEMPT dynamic ops to use allocated trampolines"). I think 
Steven had a plan to implement dynamic trampolines even for 
CONFIG_PREEMPT case but he still hasn't done it. It should use RCU_TASKS 
infrastructure.

The reason for all the mess is that ftrace needs to be sure that no task 
is in the handler when the handler/trampoline is freed.

So we should be safe for now even from this side.

Miroslav
Miroslav Benes May 10, 2016, 11:39 a.m. UTC | #23
On Thu, 28 Apr 2016, Josh Poimboeuf wrote:

> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.
> 
> This code stems from the design proposal made by Vojtech [1] in November
> 2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
> consistency and syscall barrier switching combined with kpatch's stack
> trace switching.  There are also a number of fallback options which make
> it quite flexible.
> 
> Patches are applied on a per-task basis, when the task is deemed safe to
> switch over.  When a patch is enabled, livepatch enters into a
> transition state where tasks are converging to the patched state.
> Usually this transition state can complete in a few seconds.  The same
> sequence occurs when a patch is disabled, except the tasks converge from
> the patched state to the unpatched state.
> 
> An interrupt handler inherits the patched state of the task it
> interrupts.  The same is true for forked tasks: the child inherits the
> patched state of the parent.
> 
> Livepatch uses several complementary approaches to determine when it's
> safe to patch tasks:
> 
> 1. The first and most effective approach is stack checking of sleeping
>    tasks.  If no affected functions are on the stack of a given task,
>    the task is patched.  In most cases this will patch most or all of
>    the tasks on the first try.  Otherwise it'll keep trying
>    periodically.  This option is only available if the architecture has
>    reliable stacks (CONFIG_RELIABLE_STACKTRACE and
>    CONFIG_STACK_VALIDATION).
> 
> 2. The second approach, if needed, is kernel exit switching.  A
>    task is switched when it returns to user space from a system call, a
>    user space IRQ, or a signal.  It's useful in the following cases:
> 
>    a) Patching I/O-bound user tasks which are sleeping on an affected
>       function.  In this case you have to send SIGSTOP and SIGCONT to
>       force it to exit the kernel and be patched.
>    b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
>       then it will get patched the next time it gets interrupted by an
>       IRQ.
>    c) Applying patches for architectures which don't yet have
>       CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
>       most of the tasks on the system.  However this isn't a complete
>       solution, because there's currently no way to patch kthreads
>       without CONFIG_RELIABLE_STACKTRACE.
> 
>    Note: since idle "swapper" tasks don't ever exit the kernel, they
>    instead have a kpatch_patch_task() call in the idle loop which allows

s/kpatch_patch_task()/klp_patch_task()/

[...]

> --- a/Documentation/livepatch/livepatch.txt
> +++ b/Documentation/livepatch/livepatch.txt
> @@ -72,7 +72,8 @@ example, they add a NULL pointer or a boundary check, fix a race by adding
>  a missing memory barrier, or add some locking around a critical section.
>  Most of these changes are self contained and the function presents itself
>  the same way to the rest of the system. In this case, the functions might
> -be updated independently one by one.
> +be updated independently one by one.  (This can be done by setting the
> +'immediate' flag in the klp_patch struct.)
>  
>  But there are more complex fixes. For example, a patch might change
>  ordering of locking in multiple functions at the same time. Or a patch
> @@ -86,20 +87,103 @@ or no data are stored in the modified structures at the moment.
>  The theory about how to apply functions a safe way is rather complex.
>  The aim is to define a so-called consistency model. It attempts to define
>  conditions when the new implementation could be used so that the system
> -stays consistent. The theory is not yet finished. See the discussion at
> -http://thread.gmane.org/gmane.linux.kernel/1823033/focus=1828189
> -
> -The current consistency model is very simple. It guarantees that either
> -the old or the new function is called. But various functions get redirected
> -one by one without any synchronization.
> -
> -In other words, the current implementation _never_ modifies the behavior
> -in the middle of the call. It is because it does _not_ rewrite the entire
> -function in the memory. Instead, the function gets redirected at the
> -very beginning. But this redirection is used immediately even when
> -some other functions from the same patch have not been redirected yet.
> -
> -See also the section "Limitations" below.
> +stays consistent.
> +
> +Livepatch has a consistency model which is a hybrid of kGraft and
> +kpatch:  it uses kGraft's per-task consistency and syscall barrier
> +switching combined with kpatch's stack trace switching.  There are also
> +a number of fallback options which make it quite flexible.
> +
> +Patches are applied on a per-task basis, when the task is deemed safe to
> +switch over.  When a patch is enabled, livepatch enters into a
> +transition state where tasks are converging to the patched state.
> +Usually this transition state can complete in a few seconds.  The same
> +sequence occurs when a patch is disabled, except the tasks converge from
> +the patched state to the unpatched state.
> +
> +An interrupt handler inherits the patched state of the task it
> +interrupts.  The same is true for forked tasks: the child inherits the
> +patched state of the parent.
> +
> +Livepatch uses several complementary approaches to determine when it's
> +safe to patch tasks:
> +
> +1. The first and most effective approach is stack checking of sleeping
> +   tasks.  If no affected functions are on the stack of a given task,
> +   the task is patched.  In most cases this will patch most or all of
> +   the tasks on the first try.  Otherwise it'll keep trying
> +   periodically.  This option is only available if the architecture has
> +   reliable stacks (CONFIG_RELIABLE_STACKTRACE and
> +   CONFIG_STACK_VALIDATION).
> +
> +2. The second approach, if needed, is kernel exit switching.  A
> +   task is switched when it returns to user space from a system call, a
> +   user space IRQ, or a signal.  It's useful in the following cases:
> +
> +   a) Patching I/O-bound user tasks which are sleeping on an affected
> +      function.  In this case you have to send SIGSTOP and SIGCONT to
> +      force it to exit the kernel and be patched.
> +   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
> +      then it will get patched the next time it gets interrupted by an
> +      IRQ.
> +   c) Applying patches for architectures which don't yet have
> +      CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
> +      most of the tasks on the system.  However this isn't a complete
> +      solution, because there's currently no way to patch kthreads
> +      without CONFIG_RELIABLE_STACKTRACE.
> +
> +   Note: since idle "swapper" tasks don't ever exit the kernel, they
> +   instead have a kpatch_patch_task() call in the idle loop which allows

s/kpatch_patch_task()/klp_patch_task()/

Otherwise all the code that touches livepatch looks good to me. Apart from 
the things mentioned in another emails.

Miroslav
Josh Poimboeuf May 16, 2016, 5:27 p.m. UTC | #24
On Mon, May 09, 2016 at 11:41:37AM +0200, Miroslav Benes wrote:
> > +void klp_init_transition(struct klp_patch *patch, int state)
> > +{
> > +	struct task_struct *g, *task;
> > +	unsigned int cpu;
> > +	struct klp_object *obj;
> > +	struct klp_func *func;
> > +	int initial_state = !state;
> > +
> > +	klp_transition_patch = patch;
> > +
> > +	/*
> > +	 * If the patch can be applied or reverted immediately, skip the
> > +	 * per-task transitions.
> > +	 */
> > +	if (patch->immediate)
> > +		return;
> > +
> > +	/*
> > +	 * Initialize all tasks to the initial patch state to prepare them for
> > +	 * switching to the target state.
> > +	 */
> > +	read_lock(&tasklist_lock);
> > +	for_each_process_thread(g, task)
> > +		task->patch_state = initial_state;
> > +	read_unlock(&tasklist_lock);
> > +
> > +	/*
> > +	 * Ditto for the idle "swapper" tasks.
> > +	 */
> > +	get_online_cpus();
> > +	for_each_online_cpu(cpu)
> > +		idle_task(cpu)->patch_state = initial_state;
> > +	put_online_cpus();
> > +
> > +	/*
> > +	 * Ensure klp_ftrace_handler() sees the task->patch_state updates
> > +	 * before the func->transition updates.  Otherwise it could read an
> > +	 * out-of-date task state and pick the wrong function.
> > +	 */
> > +	smp_wmb();
> > +
> > +	/*
> > +	 * Set the func transition states so klp_ftrace_handler() will know to
> > +	 * switch to the transition logic.
> > +	 *
> > +	 * When patching, the funcs aren't yet in the func_stack and will be
> > +	 * made visible to the ftrace handler shortly by the calls to
> > +	 * klp_patch_object().
> > +	 *
> > +	 * When unpatching, the funcs are already in the func_stack and so are
> > +	 * already visible to the ftrace handler.
> > +	 */
> > +	klp_for_each_object(patch, obj)
> > +		klp_for_each_func(obj, func)
> > +			func->transition = true;
> > +
> > +	/*
> > +	 * Set the global target patch state which tasks will switch to.  This
> > +	 * has no effect until the TIF_PATCH_PENDING flags get set later.
> > +	 */
> > +	klp_target_state = state;
> 
> I am afraid there is a problem for (patch->immediate == true) patches. 
> klp_target_state is not set for those and the comment is not entirely 
> true, because klp_target_state has an effect in several places.

Ah, you're right.  I moved this assignment here for v2.  It was
originally done before the patch->immediate check.  If I remember
correctly, I moved it closer to the barrier for better readability (but
I created a bug in the process).

> I guess we need to set klp_target_state even for immediate patches. Should 
> we also initialize it with KLP_UNDEFINED and set it to KLP_UNDEFINED in 
> klp_complete_transition()?

Yes, to both.
Josh Poimboeuf May 16, 2016, 6:12 p.m. UTC | #25
On Mon, May 09, 2016 at 02:23:03PM +0200, Petr Mladek wrote:
> On Fri 2016-05-06 07:38:55, Josh Poimboeuf wrote:
> > On Thu, May 05, 2016 at 01:57:01PM +0200, Petr Mladek wrote:
> > > I have missed that the two commands are called with preemption
> > > disabled. So, I had the following crazy scenario in mind:
> > > 
> > > 
> > > CPU0				CPU1
> > > 
> > > klp_enable_patch()
> > > 
> > >   klp_target_state = KLP_PATCHED;
> > > 
> > >   for_each_task()
> > >      set TIF_PENDING_PATCH
> > > 
> > > 				# task 123
> > > 
> > > 				if (klp_patch_pending(current)
> > > 				  klp_patch_task(current)
> > > 
> > >                                     clear TIF_PENDING_PATCH
> > > 
> > > 				    smp_rmb();
> > > 
> > > 				    # switch to assembly of
> > > 				    # klp_patch_task()
> > > 
> > > 				    mov klp_target_state, %r12
> > > 
> > > 				    # interrupt and schedule
> > > 				    # another task
> > > 
> > > 
> > >   klp_reverse_transition();
> > > 
> > >     klp_target_state = KLP_UNPATCHED;
> > > 
> > >     klt_try_to_complete_transition()
> > > 
> > >       task = 123;
> > >       if (task->patch_state == klp_target_state;
> > >          return 0;
> > > 
> > >     => task 123 is in target state and does
> > >     not block conversion
> > > 
> > >   klp_complete_transition()
> > > 
> > > 
> > >   # disable previous patch on the stack
> > >   klp_disable_patch();
> > > 
> > >     klp_target_state = KLP_UNPATCHED;
> > >   
> > >   
> > > 				    # task 123 gets scheduled again
> > > 				    lea %r12, task->patch_state
> > > 
> > > 				    => it happily stores an outdated
> > > 				    state
> > > 
> > 
> > Thanks for the clear explanation, this helps a lot.
> > 
> > > This is why the two functions should get called with preemption
> > > disabled. We should document it at least. I imagine that we will
> > > use them later also in another context and nobody will remember
> > > this crazy scenario.
> > > 
> > > Well, even disabled preemption does not help. The process on
> > > CPU1 might be also interrupted by an NMI and do some long
> > > printk in it.
> > > 
> > > IMHO, the only safe approach is to call klp_patch_task()
> > > only for "current" on a safe place. Then this race is harmless.
> > > The switch happen on a safe place, so that it does not matter
> > > into which state the process is switched.
> > 
> > I'm not sure about this solution.  When klp_complete_transition() is
> > called, we need all tasks to be patched, for good.  We don't want any of
> > them to randomly switch to the wrong state at some later time in the
> > middle of a future patch operation.  How would changing klp_patch_task()
> > to only use "current" prevent that?
> 
> You are right that it is pity but it really should be safe because
> it is not entirely random.
> 
> If the race happens and assign an outdated value, there are two
> situations:
> 
> 1. It is assigned when there is not transition in the progress.
>    Then it is OK because it will be ignored by the ftrace handler.
>    The right state will be set before the next transition starts.
> 
> 2. It is assigned when some other transition is in progress.
>    Then it is OK as long as the function is called from "current".
>    The "wrong" state will be used consistently. It will switch
>    to the right state on another safe state.

Maybe it would be safe, though I'm not entirely convinced.  Regardless I
think we should avoid these situations entirely because they create
windows for future bugs and races.

> > > By other words, the task state might be updated only
> > > 
> > >    + by the task itself on a safe place
> > >    + by other task when the updated on is sleeping on a safe place
> > > 
> > > This should be well documented and the API should help to avoid
> > > a misuse.
> > 
> > I think we could fix it to be safe for future callers who might not have
> > preemption disabled with a couple of changes to klp_patch_task():
> > disabling preemption and testing/clearing the TIF_PATCH_PENDING flag
> > before changing the patch state:
> > 
> >   void klp_patch_task(struct task_struct *task)
> >   {
> >   	preempt_disable();
> >   
> >   	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
> >   		task->patch_state = READ_ONCE(klp_target_state);
> >   
> >   	preempt_enable();
> >   }
> 
> It reduces the race window a bit but it is still there. For example,
> NMI still might add a huge delay between reading klp_target_state
> and assigning task->patch state.

Maybe you missed this paragraph from my last email:

| We would also need a synchronize_sched() after the patching is complete,
| either at the end of klp_try_complete_transition() or in
| klp_complete_transition().  That would make sure that all existing calls
| to klp_patch_task() are done.

So a huge NMI delay wouldn't be a problem here.  The call to
synchronize_sched() in klp_complete_transition() would sleep until the
NMI handler returns and the critical section of klp_patch_task()
finishes.  So once a patch is complete, we know that it's really
complete.

> What about the following?
> 
> /*
>  * This function might assign an outdated value if the transaction
> `* is reverted and finalized in parallel. But it is safe. If the value
>  * is assigned outside of a transaction, it is ignored and the next
>  * transaction will set the right one. Or if it gets assigned
>  * inside another transaction, it will repeat the cycle and
>  * set the right state.
>  */
> void klp_update_current_patch_state()
> {
> 	while (test_and_clear_tsk_thread_flag(current, TIF_PATCH_PENDING))
> 		current->patch_state = READ_ONCE(klp_target_state);
> }

I'm not sure how this would work.  How would the thread flag get set
again after it's been cleared?

Also I really don't like the idea of randomly updating a task's patch
state after the transition has been completed.

> Note that the disabled preemption helped only partially,
> so I think that it was not really needed.
> 
> Hmm, it means that the task->patch_state  might be either
> KLP_PATCHED or KLP_UNPATCHED outside a transition. I wonder
> if the tristate really brings some advantages.
> 
> 
> Alternatively, we might synchronize the operation with klp_mutex.
> The function is called in a slow path and in a safe context.
> Well, it might cause contention on the lock when many CPUs are
> trying to update their tasks.

I don't think a mutex would work because at least the ftrace handler
(and maybe more) can't sleep.  Maybe a spinlock could work but I think
that would be overkill.
Jessica Yu May 17, 2016, 10:53 p.m. UTC | #26
+++ Josh Poimboeuf [28/04/16 15:44 -0500]:

[snip]

>diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt
>index 6c43f6e..bee86d0 100644
>--- a/Documentation/livepatch/livepatch.txt
>+++ b/Documentation/livepatch/livepatch.txt
>@@ -72,7 +72,8 @@ example, they add a NULL pointer or a boundary check, fix a race by adding
> a missing memory barrier, or add some locking around a critical section.
> Most of these changes are self contained and the function presents itself
> the same way to the rest of the system. In this case, the functions might
>-be updated independently one by one.
>+be updated independently one by one.  (This can be done by setting the
>+'immediate' flag in the klp_patch struct.)
>
> But there are more complex fixes. For example, a patch might change
> ordering of locking in multiple functions at the same time. Or a patch
>@@ -86,20 +87,103 @@ or no data are stored in the modified structures at the moment.
> The theory about how to apply functions a safe way is rather complex.
> The aim is to define a so-called consistency model. It attempts to define
> conditions when the new implementation could be used so that the system
>-stays consistent. The theory is not yet finished. See the discussion at
>-http://thread.gmane.org/gmane.linux.kernel/1823033/focus=1828189
>-
>-The current consistency model is very simple. It guarantees that either
>-the old or the new function is called. But various functions get redirected
>-one by one without any synchronization.
>-
>-In other words, the current implementation _never_ modifies the behavior
>-in the middle of the call. It is because it does _not_ rewrite the entire
>-function in the memory. Instead, the function gets redirected at the
>-very beginning. But this redirection is used immediately even when
>-some other functions from the same patch have not been redirected yet.
>-
>-See also the section "Limitations" below.
>+stays consistent.
>+
>+Livepatch has a consistency model which is a hybrid of kGraft and
>+kpatch:  it uses kGraft's per-task consistency and syscall barrier
>+switching combined with kpatch's stack trace switching.  There are also
>+a number of fallback options which make it quite flexible.
>+
>+Patches are applied on a per-task basis, when the task is deemed safe to
>+switch over.  When a patch is enabled, livepatch enters into a
>+transition state where tasks are converging to the patched state.
>+Usually this transition state can complete in a few seconds.  The same
>+sequence occurs when a patch is disabled, except the tasks converge from
>+the patched state to the unpatched state.
>+
>+An interrupt handler inherits the patched state of the task it
>+interrupts.  The same is true for forked tasks: the child inherits the
>+patched state of the parent.
>+
>+Livepatch uses several complementary approaches to determine when it's
>+safe to patch tasks:
>+
>+1. The first and most effective approach is stack checking of sleeping
>+   tasks.  If no affected functions are on the stack of a given task,
>+   the task is patched.  In most cases this will patch most or all of
>+   the tasks on the first try.  Otherwise it'll keep trying
>+   periodically.  This option is only available if the architecture has
>+   reliable stacks (CONFIG_RELIABLE_STACKTRACE and
>+   CONFIG_STACK_VALIDATION).
>+
>+2. The second approach, if needed, is kernel exit switching.  A
>+   task is switched when it returns to user space from a system call, a
>+   user space IRQ, or a signal.  It's useful in the following cases:
>+
>+   a) Patching I/O-bound user tasks which are sleeping on an affected
>+      function.  In this case you have to send SIGSTOP and SIGCONT to
>+      force it to exit the kernel and be patched.

See below -

>+   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
>+      then it will get patched the next time it gets interrupted by an
>+      IRQ.
>+   c) Applying patches for architectures which don't yet have
>+      CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
>+      most of the tasks on the system.  However this isn't a complete
>+      solution, because there's currently no way to patch kthreads
>+      without CONFIG_RELIABLE_STACKTRACE.
>+
>+   Note: since idle "swapper" tasks don't ever exit the kernel, they
>+   instead have a kpatch_patch_task() call in the idle loop which allows
>+   them to patched before the CPU enters the idle state.
>+
>+3. A third approach (not yet implemented) is planned for the case where
>+   a kthread is sleeping on an affected function.  In that case we could
>+   kick the kthread with a signal and then try to patch the task from
>+   the to-be-patched function's livepatch ftrace handler when it
>+   re-enters the function.  This will require
>+   CONFIG_RELIABLE_STACKTRACE.
>+
>+All the above approaches may be skipped by setting the 'immediate' flag
>+in the 'klp_patch' struct, which will patch all tasks immediately.  This
>+can be useful if the patch doesn't change any function or data
>+semantics.  Note that, even with this flag set, it's possible that some
>+tasks may still be running with an old version of the function, until
>+that function returns.
>+
>+There's also an 'immediate' flag in the 'klp_func' struct which allows
>+you to specify that certain functions in the patch can be applied
>+without per-task consistency.  This might be useful if you want to patch
>+a common function like schedule(), and the function change doesn't need
>+consistency but the rest of the patch does.
>+
>+For architectures which don't have CONFIG_RELIABLE_STACKTRACE, there
>+are two options:
>+
>+a) the user can set the patch->immediate flag which causes all tasks to
>+   be patched immediately.  This option should be used with care, only
>+   when the patch doesn't change any function or data semantics; or
>+
>+b) use the kernel exit switching approach (this is the default).
>+   Note the patching will never complete because there's no currently no
>+   way to patch kthreads without CONFIG_RELIABLE_STACKTRACE.
>+
>+The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
>+is in transition.  Only a single patch (the topmost patch on the stack)
>+can be in transition at a given time.  A patch can remain in transition
>+indefinitely, if any of the tasks are stuck in the initial patch state.
>+
>+A transition can be reversed and effectively canceled by writing the
>+opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
>+the transition is in progress.  Then all the tasks will attempt to
>+converge back to the original patch state.
>+
>+There's also a /proc/<pid>/patch_state file which can be used to
>+determine which tasks are blocking completion of a patching operation.
>+If a patch is in transition, this file shows 0 to indicate the task is
>+unpatched and 1 to indicate it's patched.  Otherwise, if no patch is in
>+transition, it shows -1. Any tasks which are blocking the transition
>+can be signaled with SIGSTOP and SIGCONT to force them to change their
>+patched state.

What about tasks sleeping on affected functions in uninterruptible
sleep (possibly indefinitely)? Since all signals are ignored, we
wouldn't be able to patch those tasks in this way, right? Would that
be an unsupported case? Might be useful to mention this in the
documentation somewhere.

Jessica
Jiri Kosina May 18, 2016, 8:16 a.m. UTC | #27
On Tue, 17 May 2016, Jessica Yu wrote:

> What about tasks sleeping on affected functions in uninterruptible sleep 
> (possibly indefinitely)? Since all signals are ignored, we wouldn't be 
> able to patch those tasks in this way, right? Would that be an 
> unsupported case?

I don't think there is any better way out of this situation than 
documenting that the convergence of patching could in such cases could 
take quite a lot of time (well, we can pro-actively try to detect this 
situation before the patching actually start, and warn about the possible 
consequences).

But let's face it, this should be pretty uncommon, because (a) it's not 
realistic for the wait times to be really indefinite (b) the task is 
likely to be in TASK_KILLABLE rather than just plain TASK_UNINTERRUPTIBLE.
Petr Mladek May 18, 2016, 1:12 p.m. UTC | #28
On Mon 2016-05-16 13:12:50, Josh Poimboeuf wrote:
> On Mon, May 09, 2016 at 02:23:03PM +0200, Petr Mladek wrote:
> > On Fri 2016-05-06 07:38:55, Josh Poimboeuf wrote:
> > > On Thu, May 05, 2016 at 01:57:01PM +0200, Petr Mladek wrote:
> > > > I have missed that the two commands are called with preemption
> > > > disabled. So, I had the following crazy scenario in mind:
> > > > 
> > > > 
> > > > CPU0				CPU1
> > > > 
> > > > klp_enable_patch()
> > > > 
> > > >   klp_target_state = KLP_PATCHED;
> > > > 
> > > >   for_each_task()
> > > >      set TIF_PENDING_PATCH
> > > > 
> > > > 				# task 123
> > > > 
> > > > 				if (klp_patch_pending(current)
> > > > 				  klp_patch_task(current)
> > > > 
> > > >                                     clear TIF_PENDING_PATCH
> > > > 
> > > > 				    smp_rmb();
> > > > 
> > > > 				    # switch to assembly of
> > > > 				    # klp_patch_task()
> > > > 
> > > > 				    mov klp_target_state, %r12
> > > > 
> > > > 				    # interrupt and schedule
> > > > 				    # another task
> > > > 
> > > > 
> > > >   klp_reverse_transition();
> > > > 
> > > >     klp_target_state = KLP_UNPATCHED;
> > > > 
> > > >     klt_try_to_complete_transition()
> > > > 
> > > >       task = 123;
> > > >       if (task->patch_state == klp_target_state;
> > > >          return 0;
> > > > 
> > > >     => task 123 is in target state and does
> > > >     not block conversion
> > > > 
> > > >   klp_complete_transition()
> > > > 
> > > > 
> > > >   # disable previous patch on the stack
> > > >   klp_disable_patch();
> > > > 
> > > >     klp_target_state = KLP_UNPATCHED;
> > > >   
> > > >   
> > > > 				    # task 123 gets scheduled again
> > > > 				    lea %r12, task->patch_state
> > > > 
> > > > 				    => it happily stores an outdated
> > > > 				    state
> > > > 
> > > 
> > > Thanks for the clear explanation, this helps a lot.
> > > 
> > > > This is why the two functions should get called with preemption
> > > > disabled. We should document it at least. I imagine that we will
> > > > use them later also in another context and nobody will remember
> > > > this crazy scenario.
> > > > 
> > > > Well, even disabled preemption does not help. The process on
> > > > CPU1 might be also interrupted by an NMI and do some long
> > > > printk in it.
> > > > 
> > > > IMHO, the only safe approach is to call klp_patch_task()
> > > > only for "current" on a safe place. Then this race is harmless.
> > > > The switch happen on a safe place, so that it does not matter
> > > > into which state the process is switched.
> > > 
> > > I'm not sure about this solution.  When klp_complete_transition() is
> > > called, we need all tasks to be patched, for good.  We don't want any of
> > > them to randomly switch to the wrong state at some later time in the
> > > middle of a future patch operation.  How would changing klp_patch_task()
> > > to only use "current" prevent that?
> > 
> > You are right that it is pity but it really should be safe because
> > it is not entirely random.
> > 
> > If the race happens and assign an outdated value, there are two
> > situations:
> > 
> > 1. It is assigned when there is not transition in the progress.
> >    Then it is OK because it will be ignored by the ftrace handler.
> >    The right state will be set before the next transition starts.
> > 
> > 2. It is assigned when some other transition is in progress.
> >    Then it is OK as long as the function is called from "current".
> >    The "wrong" state will be used consistently. It will switch
> >    to the right state on another safe state.
> 
> Maybe it would be safe, though I'm not entirely convinced.  Regardless I
> think we should avoid these situations entirely because they create
> windows for future bugs and races.

Yup, I would prefer a cleaner solution as well.

> > > > By other words, the task state might be updated only
> > > > 
> > > >    + by the task itself on a safe place
> > > >    + by other task when the updated on is sleeping on a safe place
> > > > 
> > > > This should be well documented and the API should help to avoid
> > > > a misuse.
> > > 
> > > I think we could fix it to be safe for future callers who might not have
> > > preemption disabled with a couple of changes to klp_patch_task():
> > > disabling preemption and testing/clearing the TIF_PATCH_PENDING flag
> > > before changing the patch state:
> > > 
> > >   void klp_patch_task(struct task_struct *task)
> > >   {
> > >   	preempt_disable();
> > >   
> > >   	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
> > >   		task->patch_state = READ_ONCE(klp_target_state);
> > >   
> > >   	preempt_enable();
> > >   }
> > 
> > It reduces the race window a bit but it is still there. For example,
> > NMI still might add a huge delay between reading klp_target_state
> > and assigning task->patch state.
> 
> Maybe you missed this paragraph from my last email:
>
> | We would also need a synchronize_sched() after the patching is complete,
> | either at the end of klp_try_complete_transition() or in
> | klp_complete_transition().  That would make sure that all existing calls
> | to klp_patch_task() are done.
> 
> So a huge NMI delay wouldn't be a problem here.  The call to
> synchronize_sched() in klp_complete_transition() would sleep until the
> NMI handler returns and the critical section of klp_patch_task()
> finishes.  So once a patch is complete, we know that it's really
> complete.

Yes, synchronize_sched() will help with the premeption disabled. I did
not shake my head enough last time.


> > What about the following?
> > 
> > /*
> >  * This function might assign an outdated value if the transaction
> > `* is reverted and finalized in parallel. But it is safe. If the value
> >  * is assigned outside of a transaction, it is ignored and the next
> >  * transaction will set the right one. Or if it gets assigned
> >  * inside another transaction, it will repeat the cycle and
> >  * set the right state.
> >  */
> > void klp_update_current_patch_state()
> > {
> > 	while (test_and_clear_tsk_thread_flag(current, TIF_PATCH_PENDING))
> > 		current->patch_state = READ_ONCE(klp_target_state);
> > }
> 
> I'm not sure how this would work.  How would the thread flag get set
> again after it's been cleared?

See the race described in the previous mail. The problem is when the
target_state and the TIF flags gets set after reading klp_target_state
into a register and before storing the value into current->patch_state.

We do not need this if use the synchronize_sched() and fix up
current->patch_state then.

> Also I really don't like the idea of randomly updating a task's patch
> state after the transition has been completed.
> 
> > Note that the disabled preemption helped only partially,
> > so I think that it was not really needed.
> > 
> > Hmm, it means that the task->patch_state  might be either
> > KLP_PATCHED or KLP_UNPATCHED outside a transition. I wonder
> > if the tristate really brings some advantages.
> > 
> > 
> > Alternatively, we might synchronize the operation with klp_mutex.
> > The function is called in a slow path and in a safe context.
> > Well, it might cause contention on the lock when many CPUs are
> > trying to update their tasks.
> 
> I don't think a mutex would work because at least the ftrace handler
> (and maybe more) can't sleep.  Maybe a spinlock could work but I think
> that would be overkill.

Sure, I had a spinlock in mind.

Best Regards,
Petr
Josh Poimboeuf May 18, 2016, 4:51 p.m. UTC | #29
On Wed, May 18, 2016 at 10:16:22AM +0200, Jiri Kosina wrote:
> On Tue, 17 May 2016, Jessica Yu wrote:
> 
> > What about tasks sleeping on affected functions in uninterruptible sleep 
> > (possibly indefinitely)? Since all signals are ignored, we wouldn't be 
> > able to patch those tasks in this way, right? Would that be an 
> > unsupported case?
> 
> I don't think there is any better way out of this situation than 
> documenting that the convergence of patching could in such cases could 
> take quite a lot of time (well, we can pro-actively try to detect this 
> situation before the patching actually start, and warn about the possible 
> consequences).
> 
> But let's face it, this should be pretty uncommon, because (a) it's not 
> realistic for the wait times to be really indefinite (b) the task is 
> likely to be in TASK_KILLABLE rather than just plain TASK_UNINTERRUPTIBLE.

Yeah, I think this situation -- a task sleeping on an affected function
in uninterruptible state for a long period of time -- would be
exceedingly rare and not something we need to worry about for now.
Jiri Kosina May 18, 2016, 8:22 p.m. UTC | #30
On Wed, 18 May 2016, Josh Poimboeuf wrote:

> Yeah, I think this situation -- a task sleeping on an affected function 
> in uninterruptible state for a long period of time -- would be 
> exceedingly rare and not something we need to worry about for now.

Plus in case task'd be in TASK_UNINTERRUPTIBLE for more than 120s, hung 
task detector would trigger anyway.
David Laight May 23, 2016, 9:42 a.m. UTC | #31
From: Jiri Kosina

> Sent: 18 May 2016 21:23

> On Wed, 18 May 2016, Josh Poimboeuf wrote:

> 

> > Yeah, I think this situation -- a task sleeping on an affected function

> > in uninterruptible state for a long period of time -- would be

> > exceedingly rare and not something we need to worry about for now.

> 

> Plus in case task'd be in TASK_UNINTERRUPTIBLE for more than 120s, hung

> task detector would trigger anyway.


Related, please can we have a flag for the sleep and/or process so that
an uninterruptible sleep doesn't trigger the 'hung task' detector
and also stops the process counting towards the 'load average'.

In particular some kernel threads are not signalable, and do not
want to be woken by signals (they exit on a specific request).

	David
Jiri Kosina May 23, 2016, 6:44 p.m. UTC | #32
On Mon, 23 May 2016, David Laight wrote:

> Related, please can we have a flag for the sleep and/or process so that
> an uninterruptible sleep doesn't trigger the 'hung task' detector

TASK_KILLABLE

> and also stops the process counting towards the 'load average'.

TASK_NOLOAD
David Laight May 24, 2016, 3:06 p.m. UTC | #33
From: Jiri Kosina 
> Sent: 23 May 2016 19:45
> > Related, please can we have a flag for the sleep and/or process so that
> > an uninterruptible sleep doesn't trigger the 'hung task' detector
> 
> TASK_KILLABLE

Not sure that does what I want.
It appears to allow some 'kill' actions to wake the process.
I'm sure I've looked at the 'hung task' code since 2007.

> > and also stops the process counting towards the 'load average'.
> 
> TASK_NOLOAD

Ah, that was added in May 2015.
Not surprising I didn't know about it.

I'll leave the code doing:
  set_current_state(signal_pending(current) ? TASK_UNINTERRUPTIBLE ? TASK_INTERRUPTIBLE);
for a while longer.

	David
Jiri Kosina May 24, 2016, 10:45 p.m. UTC | #34
On Tue, 24 May 2016, David Laight wrote:

> > > Related, please can we have a flag for the sleep and/or process so that
> > > an uninterruptible sleep doesn't trigger the 'hung task' detector
> > 
> > TASK_KILLABLE
> 
> Not sure that does what I want.
> It appears to allow some 'kill' actions to wake the process.
> I'm sure I've looked at the 'hung task' code since 2007.

The trick is the 

	if (t->state == TASK_UNINTERRUPTIBLE)

test in check_hung_uninterruptible_tasks(). That makes sure that 
TASK_KILLABLE tasks (e.g. waiting on NFS I/O, but not limited only to 
that; feel free to set it whereever you need) are skipped. Please note 
that TASK_KILLABLE is actually a _mask_ that includes TASK_UNINTERRUPTIBLE 
as well; therefore the '==' test skips such tasks.
Petr Mladek June 6, 2016, 1:54 p.m. UTC | #35
On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.

> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> new file mode 100644
> index 0000000..92819bb
> --- /dev/null
> +++ b/kernel/livepatch/transition.c
> +/*
> + * Try to safely switch a task to the target patch state.  If it's currently
> + * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or
> + * if the stack is unreliable, return false.
> + */
> +static bool klp_try_switch_task(struct task_struct *task)
> +{
> +	struct rq *rq;
> +	unsigned long flags;

This should be of type "struct rq_flags". Otherwise, I get compilation
warnings:

kernel/livepatch/transition.c: In function ‘klp_try_switch_task’:
kernel/livepatch/transition.c:349:2: warning: passing argument 2 of ‘task_rq_lock’ from incompatible pointer type [enabled by default]
  rq = task_rq_lock(task, &flags);
  ^
In file included from kernel/livepatch/transition.c:24:0:
kernel/livepatch/../sched/sched.h:1468:12: note: expected ‘struct rq_flags *’ but argument is of type ‘long unsigned int *’
 struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
            ^
kernel/livepatch/transition.c:367:2: warning: passing argument 3 of ‘task_rq_unlock’ from incompatible pointer type [enabled by default]
  task_rq_unlock(rq, task, &flags);
  ^
In file included from kernel/livepatch/transition.c:24:0:
kernel/livepatch/../sched/sched.h:1480:1: note: expected ‘struct rq_flags *’ but argument is of type ‘long unsigned int *’
 task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)


And even runtime warnings from lockdep:

[  212.847548] WARNING: CPU: 1 PID: 3847 at kernel/locking/lockdep.c:3532 lock_release+0x431/0x480
[  212.847549] releasing a pinned lock
[  212.847550] Modules linked in: livepatch_sample(E+)
[  212.847555] CPU: 1 PID: 3847 Comm: modprobe Tainted: G            E K 4.7.0-rc1-next-20160602-4-default+ #336
[  212.847556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  212.847558]  0000000000000000 ffff880139823aa0 ffffffff814388dc ffff880139823af0
[  212.847562]  0000000000000000 ffff880139823ae0 ffffffff8106fad1 00000dcc82b11390
[  212.847565]  ffff88013fc978d8 ffffffff810eea1e ffff8800ba0ed6d0 0000000000000003
[  212.847569] Call Trace:
[  212.847572]  [<ffffffff814388dc>] dump_stack+0x85/0xc9
[  212.847575]  [<ffffffff8106fad1>] __warn+0xd1/0xf0
[  212.847578]  [<ffffffff810eea1e>] ? klp_try_switch_task.part.3+0x5e/0x2b0
[  212.847580]  [<ffffffff8106fb3f>] warn_slowpath_fmt+0x4f/0x60
[  212.847582]  [<ffffffff810cc151>] lock_release+0x431/0x480
[  212.847585]  [<ffffffff8101e258>] ? dump_trace+0x118/0x310
[  212.847588]  [<ffffffff8195d07c>] ? entry_SYSCALL_64_fastpath+0x1f/0xbd
[  212.847590]  [<ffffffff8195c8bf>] _raw_spin_unlock+0x1f/0x30
[  212.847600]  [<ffffffff810eea1e>] klp_try_switch_task.part.3+0x5e/0x2b0
[  212.847603]  [<ffffffff810ef0e4>] klp_try_complete_transition+0x84/0x190
[  212.847605]  [<ffffffff810ed370>] __klp_enable_patch+0xb0/0x130
[  212.847607]  [<ffffffff810ed445>] klp_enable_patch+0x55/0x80
[  212.847610]  [<ffffffffa0000030>] ? livepatch_cmdline_proc_show+0x30/0x30 [livepatch_sample]
[  212.847613]  [<ffffffffa0000061>] livepatch_init+0x31/0x70 [livepatch_sample]
[  212.847615]  [<ffffffffa0000030>] ? livepatch_cmdline_proc_show+0x30/0x30 [livepatch_sample]
[  212.847617]  [<ffffffff8100041d>] do_one_initcall+0x3d/0x160
[  212.847629]  [<ffffffff81196c9b>] ? do_init_module+0x27/0x1e4
[  212.847632]  [<ffffffff810e7172>] ? rcu_read_lock_sched_held+0x62/0x70
[  212.847634]  [<ffffffff811fdea2>] ? kmem_cache_alloc_trace+0x282/0x340
[  212.847636]  [<ffffffff81196cd4>] do_init_module+0x60/0x1e4
[  212.847638]  [<ffffffff81111fd2>] load_module+0x1482/0x1d40
[  212.847640]  [<ffffffff8110ea10>] ? __symbol_put+0x40/0x40
[  212.847643]  [<ffffffff81112aa9>] SYSC_finit_module+0xa9/0xd0
[  212.847645]  [<ffffffff81112aee>] SyS_finit_module+0xe/0x10
[  212.847647]  [<ffffffff8195d07c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[  212.847649] ---[ end trace e4e9f09d45443049 ]---


> +	int ret;
> +	bool success = false;
> +
> +	/* check if this task has already switched over */
> +	if (task->patch_state == klp_target_state)
> +		return true;
> +
> +	/*
> +	 * For arches which don't have reliable stack traces, we have to rely
> +	 * on other methods (e.g., switching tasks at the syscall barrier).
> +	 */
> +	if (!IS_ENABLED(CONFIG_RELIABLE_STACKTRACE))
> +		return false;
> +
> +	/*
> +	 * Now try to check the stack for any to-be-patched or to-be-unpatched
> +	 * functions.  If all goes well, switch the task to the target patch
> +	 * state.
> +	 */
> +	rq = task_rq_lock(task, &flags);
> +
> +	if (task_running(rq, task) && task != current) {
> +		pr_debug("%s: pid %d (%s) is running\n", __func__, task->pid,
> +			 task->comm);

Also I think about using printk_deferred() inside the rq_lock but
it is not strictly needed. Also we use only pr_debug() here which
is a NOP when not enabled.

Best Regards,
Petr
Josh Poimboeuf June 6, 2016, 2:29 p.m. UTC | #36
On Mon, Jun 06, 2016 at 03:54:41PM +0200, Petr Mladek wrote:
> On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote:
> > Change livepatch to use a basic per-task consistency model.  This is the
> > foundation which will eventually enable us to patch those ~10% of
> > security patches which change function or data semantics.  This is the
> > biggest remaining piece needed to make livepatch more generally useful.
> 
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > new file mode 100644
> > index 0000000..92819bb
> > --- /dev/null
> > +++ b/kernel/livepatch/transition.c
> > +/*
> > + * Try to safely switch a task to the target patch state.  If it's currently
> > + * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or
> > + * if the stack is unreliable, return false.
> > + */
> > +static bool klp_try_switch_task(struct task_struct *task)
> > +{
> > +	struct rq *rq;
> > +	unsigned long flags;
> 
> This should be of type "struct rq_flags". Otherwise, I get compilation
> warnings:
> 
> kernel/livepatch/transition.c: In function ‘klp_try_switch_task’:
> kernel/livepatch/transition.c:349:2: warning: passing argument 2 of ‘task_rq_lock’ from incompatible pointer type [enabled by default]
>   rq = task_rq_lock(task, &flags);
>   ^
> In file included from kernel/livepatch/transition.c:24:0:
> kernel/livepatch/../sched/sched.h:1468:12: note: expected ‘struct rq_flags *’ but argument is of type ‘long unsigned int *’
>  struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
>             ^
> kernel/livepatch/transition.c:367:2: warning: passing argument 3 of ‘task_rq_unlock’ from incompatible pointer type [enabled by default]
>   task_rq_unlock(rq, task, &flags);
>   ^
> In file included from kernel/livepatch/transition.c:24:0:
> kernel/livepatch/../sched/sched.h:1480:1: note: expected ‘struct rq_flags *’ but argument is of type ‘long unsigned int *’
>  task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> 
> 
> And even runtime warnings from lockdep:
> 
> [  212.847548] WARNING: CPU: 1 PID: 3847 at kernel/locking/lockdep.c:3532 lock_release+0x431/0x480
> [  212.847549] releasing a pinned lock
> [  212.847550] Modules linked in: livepatch_sample(E+)
> [  212.847555] CPU: 1 PID: 3847 Comm: modprobe Tainted: G            E K 4.7.0-rc1-next-20160602-4-default+ #336
> [  212.847556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [  212.847558]  0000000000000000 ffff880139823aa0 ffffffff814388dc ffff880139823af0
> [  212.847562]  0000000000000000 ffff880139823ae0 ffffffff8106fad1 00000dcc82b11390
> [  212.847565]  ffff88013fc978d8 ffffffff810eea1e ffff8800ba0ed6d0 0000000000000003
> [  212.847569] Call Trace:
> [  212.847572]  [<ffffffff814388dc>] dump_stack+0x85/0xc9
> [  212.847575]  [<ffffffff8106fad1>] __warn+0xd1/0xf0
> [  212.847578]  [<ffffffff810eea1e>] ? klp_try_switch_task.part.3+0x5e/0x2b0
> [  212.847580]  [<ffffffff8106fb3f>] warn_slowpath_fmt+0x4f/0x60
> [  212.847582]  [<ffffffff810cc151>] lock_release+0x431/0x480
> [  212.847585]  [<ffffffff8101e258>] ? dump_trace+0x118/0x310
> [  212.847588]  [<ffffffff8195d07c>] ? entry_SYSCALL_64_fastpath+0x1f/0xbd
> [  212.847590]  [<ffffffff8195c8bf>] _raw_spin_unlock+0x1f/0x30
> [  212.847600]  [<ffffffff810eea1e>] klp_try_switch_task.part.3+0x5e/0x2b0
> [  212.847603]  [<ffffffff810ef0e4>] klp_try_complete_transition+0x84/0x190
> [  212.847605]  [<ffffffff810ed370>] __klp_enable_patch+0xb0/0x130
> [  212.847607]  [<ffffffff810ed445>] klp_enable_patch+0x55/0x80
> [  212.847610]  [<ffffffffa0000030>] ? livepatch_cmdline_proc_show+0x30/0x30 [livepatch_sample]
> [  212.847613]  [<ffffffffa0000061>] livepatch_init+0x31/0x70 [livepatch_sample]
> [  212.847615]  [<ffffffffa0000030>] ? livepatch_cmdline_proc_show+0x30/0x30 [livepatch_sample]
> [  212.847617]  [<ffffffff8100041d>] do_one_initcall+0x3d/0x160
> [  212.847629]  [<ffffffff81196c9b>] ? do_init_module+0x27/0x1e4
> [  212.847632]  [<ffffffff810e7172>] ? rcu_read_lock_sched_held+0x62/0x70
> [  212.847634]  [<ffffffff811fdea2>] ? kmem_cache_alloc_trace+0x282/0x340
> [  212.847636]  [<ffffffff81196cd4>] do_init_module+0x60/0x1e4
> [  212.847638]  [<ffffffff81111fd2>] load_module+0x1482/0x1d40
> [  212.847640]  [<ffffffff8110ea10>] ? __symbol_put+0x40/0x40
> [  212.847643]  [<ffffffff81112aa9>] SYSC_finit_module+0xa9/0xd0
> [  212.847645]  [<ffffffff81112aee>] SyS_finit_module+0xe/0x10
> [  212.847647]  [<ffffffff8195d07c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
> [  212.847649] ---[ end trace e4e9f09d45443049 ]---

Thanks, I also saw this when rebasing onto a newer linux-next.

> > +	int ret;
> > +	bool success = false;
> > +
> > +	/* check if this task has already switched over */
> > +	if (task->patch_state == klp_target_state)
> > +		return true;
> > +
> > +	/*
> > +	 * For arches which don't have reliable stack traces, we have to rely
> > +	 * on other methods (e.g., switching tasks at the syscall barrier).
> > +	 */
> > +	if (!IS_ENABLED(CONFIG_RELIABLE_STACKTRACE))
> > +		return false;
> > +
> > +	/*
> > +	 * Now try to check the stack for any to-be-patched or to-be-unpatched
> > +	 * functions.  If all goes well, switch the task to the target patch
> > +	 * state.
> > +	 */
> > +	rq = task_rq_lock(task, &flags);
> > +
> > +	if (task_running(rq, task) && task != current) {
> > +		pr_debug("%s: pid %d (%s) is running\n", __func__, task->pid,
> > +			 task->comm);
> 
> Also I think about using printk_deferred() inside the rq_lock but
> it is not strictly needed. Also we use only pr_debug() here which
> is a NOP when not enabled.

Good catch.  It's probably best to avoid it anyway.  klp_check_stack()
also has some pr_debug() calls.  I may restructure the code a bit to
release the lock before doing any of the pr_debug()'s.
diff mbox

Patch

diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch b/Documentation/ABI/testing/sysfs-kernel-livepatch
index da87f43..24ca6df 100644
--- a/Documentation/ABI/testing/sysfs-kernel-livepatch
+++ b/Documentation/ABI/testing/sysfs-kernel-livepatch
@@ -25,6 +25,14 @@  Description:
 		code is currently applied.  Writing 0 will disable the patch
 		while writing 1 will re-enable the patch.
 
+What:		/sys/kernel/livepatch/<patch>/transition
+Date:		May 2016
+KernelVersion:	4.7.0
+Contact:	live-patching@vger.kernel.org
+Description:
+		An attribute which indicates whether the patch is currently in
+		transition.
+
 What:		/sys/kernel/livepatch/<patch>/<object>
 Date:		Nov 2014
 KernelVersion:	3.19.0
diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt
index 6c43f6e..bee86d0 100644
--- a/Documentation/livepatch/livepatch.txt
+++ b/Documentation/livepatch/livepatch.txt
@@ -72,7 +72,8 @@  example, they add a NULL pointer or a boundary check, fix a race by adding
 a missing memory barrier, or add some locking around a critical section.
 Most of these changes are self contained and the function presents itself
 the same way to the rest of the system. In this case, the functions might
-be updated independently one by one.
+be updated independently one by one.  (This can be done by setting the
+'immediate' flag in the klp_patch struct.)
 
 But there are more complex fixes. For example, a patch might change
 ordering of locking in multiple functions at the same time. Or a patch
@@ -86,20 +87,103 @@  or no data are stored in the modified structures at the moment.
 The theory about how to apply functions a safe way is rather complex.
 The aim is to define a so-called consistency model. It attempts to define
 conditions when the new implementation could be used so that the system
-stays consistent. The theory is not yet finished. See the discussion at
-http://thread.gmane.org/gmane.linux.kernel/1823033/focus=1828189
-
-The current consistency model is very simple. It guarantees that either
-the old or the new function is called. But various functions get redirected
-one by one without any synchronization.
-
-In other words, the current implementation _never_ modifies the behavior
-in the middle of the call. It is because it does _not_ rewrite the entire
-function in the memory. Instead, the function gets redirected at the
-very beginning. But this redirection is used immediately even when
-some other functions from the same patch have not been redirected yet.
-
-See also the section "Limitations" below.
+stays consistent.
+
+Livepatch has a consistency model which is a hybrid of kGraft and
+kpatch:  it uses kGraft's per-task consistency and syscall barrier
+switching combined with kpatch's stack trace switching.  There are also
+a number of fallback options which make it quite flexible.
+
+Patches are applied on a per-task basis, when the task is deemed safe to
+switch over.  When a patch is enabled, livepatch enters into a
+transition state where tasks are converging to the patched state.
+Usually this transition state can complete in a few seconds.  The same
+sequence occurs when a patch is disabled, except the tasks converge from
+the patched state to the unpatched state.
+
+An interrupt handler inherits the patched state of the task it
+interrupts.  The same is true for forked tasks: the child inherits the
+patched state of the parent.
+
+Livepatch uses several complementary approaches to determine when it's
+safe to patch tasks:
+
+1. The first and most effective approach is stack checking of sleeping
+   tasks.  If no affected functions are on the stack of a given task,
+   the task is patched.  In most cases this will patch most or all of
+   the tasks on the first try.  Otherwise it'll keep trying
+   periodically.  This option is only available if the architecture has
+   reliable stacks (CONFIG_RELIABLE_STACKTRACE and
+   CONFIG_STACK_VALIDATION).
+
+2. The second approach, if needed, is kernel exit switching.  A
+   task is switched when it returns to user space from a system call, a
+   user space IRQ, or a signal.  It's useful in the following cases:
+
+   a) Patching I/O-bound user tasks which are sleeping on an affected
+      function.  In this case you have to send SIGSTOP and SIGCONT to
+      force it to exit the kernel and be patched.
+   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
+      then it will get patched the next time it gets interrupted by an
+      IRQ.
+   c) Applying patches for architectures which don't yet have
+      CONFIG_RELIABLE_STACKTRACE.  In this case you'll have to signal
+      most of the tasks on the system.  However this isn't a complete
+      solution, because there's currently no way to patch kthreads
+      without CONFIG_RELIABLE_STACKTRACE.
+
+   Note: since idle "swapper" tasks don't ever exit the kernel, they
+   instead have a kpatch_patch_task() call in the idle loop which allows
+   them to patched before the CPU enters the idle state.
+
+3. A third approach (not yet implemented) is planned for the case where
+   a kthread is sleeping on an affected function.  In that case we could
+   kick the kthread with a signal and then try to patch the task from
+   the to-be-patched function's livepatch ftrace handler when it
+   re-enters the function.  This will require
+   CONFIG_RELIABLE_STACKTRACE.
+
+All the above approaches may be skipped by setting the 'immediate' flag
+in the 'klp_patch' struct, which will patch all tasks immediately.  This
+can be useful if the patch doesn't change any function or data
+semantics.  Note that, even with this flag set, it's possible that some
+tasks may still be running with an old version of the function, until
+that function returns.
+
+There's also an 'immediate' flag in the 'klp_func' struct which allows
+you to specify that certain functions in the patch can be applied
+without per-task consistency.  This might be useful if you want to patch
+a common function like schedule(), and the function change doesn't need
+consistency but the rest of the patch does.
+
+For architectures which don't have CONFIG_RELIABLE_STACKTRACE, there
+are two options:
+
+a) the user can set the patch->immediate flag which causes all tasks to
+   be patched immediately.  This option should be used with care, only
+   when the patch doesn't change any function or data semantics; or
+
+b) use the kernel exit switching approach (this is the default).
+   Note the patching will never complete because there's no currently no
+   way to patch kthreads without CONFIG_RELIABLE_STACKTRACE.
+
+The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
+is in transition.  Only a single patch (the topmost patch on the stack)
+can be in transition at a given time.  A patch can remain in transition
+indefinitely, if any of the tasks are stuck in the initial patch state.
+
+A transition can be reversed and effectively canceled by writing the
+opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
+the transition is in progress.  Then all the tasks will attempt to
+converge back to the original patch state.
+
+There's also a /proc/<pid>/patch_state file which can be used to
+determine which tasks are blocking completion of a patching operation.
+If a patch is in transition, this file shows 0 to indicate the task is
+unpatched and 1 to indicate it's patched.  Otherwise, if no patch is in
+transition, it shows -1.  Any tasks which are blocking the transition
+can be signaled with SIGSTOP and SIGCONT to force them to change their
+patched state.
 
 
 4. Livepatch module
@@ -239,9 +323,15 @@  Registered patches might be enabled either by calling klp_enable_patch() or
 by writing '1' to /sys/kernel/livepatch/<name>/enabled. The system will
 start using the new implementation of the patched functions at this stage.
 
-In particular, if an original function is patched for the first time, a
-function specific struct klp_ops is created and an universal ftrace handler
-is registered.
+When a patch is enabled, livepatch enters into a transition state where
+tasks are converging to the patched state.  This is indicated by a value
+of '1' in /sys/kernel/livepatch/<name>/transition.  Once all tasks have
+been patched, the 'transition' value changes to '0'.  For more
+information about this process, see the "Consistency model" section.
+
+If an original function is patched for the first time, a function
+specific struct klp_ops is created and an universal ftrace handler is
+registered.
 
 Functions might be patched multiple times. The ftrace handler is registered
 only once for the given function. Further patches just add an entry to the
@@ -261,6 +351,12 @@  by writing '0' to /sys/kernel/livepatch/<name>/enabled. At this stage
 either the code from the previously enabled patch or even the original
 code gets used.
 
+When a patch is disabled, livepatch enters into a transition state where
+tasks are converging to the unpatched state.  This is indicated by a
+value of '1' in /sys/kernel/livepatch/<name>/transition.  Once all tasks
+have been unpatched, the 'transition' value changes to '0'.  For more
+information about this process, see the "Consistency model" section.
+
 Here all the functions (struct klp_func) associated with the to-be-disabled
 patch are removed from the corresponding struct klp_ops. The ftrace handler
 is unregistered and the struct klp_ops is freed when the func_stack list
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d4..12199ef 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -14,6 +14,7 @@ 
 #include <linux/rbtree.h>
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
+#include <linux/livepatch.h>
 
 #ifdef CONFIG_SMP
 # define INIT_PUSHABLE_TASKS(tsk)					\
@@ -183,6 +184,13 @@  extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_LIVEPATCH
+#define INIT_LIVEPATCH(tsk)						\
+	.patch_state = KLP_UNDEFINED,
+#else
+#define INIT_LIVEPATCH(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -260,6 +268,7 @@  extern struct task_group root_task_group;
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
 	INIT_KASAN(tsk)							\
+	INIT_LIVEPATCH(tsk)						\
 }
 
 
diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index c38c694..6ec50ff 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -28,18 +28,40 @@ 
 
 #include <asm/livepatch.h>
 
+/* task patch states */
+#define KLP_UNDEFINED	-1
+#define KLP_UNPATCHED	0
+#define KLP_PATCHED	1
+
 /**
  * struct klp_func - function structure for live patching
  * @old_name:	name of the function to be patched
  * @new_func:	pointer to the patched function code
  * @old_sympos: a hint indicating which symbol position the old function
  *		can be found (optional)
+ * @immediate:  patch the func immediately, bypassing backtrace safety checks
  * @old_addr:	the address of the function being patched
  * @kobj:	kobject for sysfs resources
  * @stack_node:	list node for klp_ops func_stack list
  * @old_size:	size of the old function
  * @new_size:	size of the new function
  * @patched:	the func has been added to the klp_ops list
+ * @transition:	the func is currently being applied or reverted
+ *
+ * The patched and transition variables define the func's patching state.  When
+ * patching, a func is always in one of the following states:
+ *
+ *   patched=0 transition=0: unpatched
+ *   patched=0 transition=1: unpatched, temporary starting state
+ *   patched=1 transition=1: patched, may be visible to some tasks
+ *   patched=1 transition=0: patched, visible to all tasks
+ *
+ * And when unpatching, it goes in the reverse order:
+ *
+ *   patched=1 transition=0: patched, visible to all tasks
+ *   patched=1 transition=1: patched, may be visible to some tasks
+ *   patched=0 transition=1: unpatched, temporary ending state
+ *   patched=0 transition=0: unpatched
  */
 struct klp_func {
 	/* external */
@@ -53,6 +75,7 @@  struct klp_func {
 	 * in kallsyms for the given object is used.
 	 */
 	unsigned long old_sympos;
+	bool immediate;
 
 	/* internal */
 	unsigned long old_addr;
@@ -60,6 +83,7 @@  struct klp_func {
 	struct list_head stack_node;
 	unsigned long old_size, new_size;
 	bool patched;
+	bool transition;
 };
 
 /**
@@ -86,6 +110,7 @@  struct klp_object {
  * struct klp_patch - patch structure for live patching
  * @mod:	reference to the live patch module
  * @objs:	object entries for kernel objects to be patched
+ * @immediate:  patch all funcs immediately, bypassing safety mechanisms
  * @list:	list node for global list of registered patches
  * @kobj:	kobject for sysfs resources
  * @enabled:	the patch is enabled (but operation may be incomplete)
@@ -94,6 +119,7 @@  struct klp_patch {
 	/* external */
 	struct module *mod;
 	struct klp_object *objs;
+	bool immediate;
 
 	/* internal */
 	struct list_head list;
@@ -116,15 +142,21 @@  int klp_disable_patch(struct klp_patch *);
 int klp_module_coming(struct module *mod);
 void klp_module_going(struct module *mod);
 
-static inline bool klp_patch_pending(struct task_struct *task) { return false; }
+void klp_copy_process(struct task_struct *child);
 void klp_patch_task(struct task_struct *task);
 
+static inline bool klp_patch_pending(struct task_struct *task)
+{
+	return test_tsk_thread_flag(task, TIF_PATCH_PENDING);
+}
+
 #else /* !CONFIG_LIVEPATCH */
 
 static inline int klp_module_coming(struct module *mod) { return 0; }
 static inline void klp_module_going(struct module *mod) {}
 static inline bool klp_patch_pending(struct task_struct *task) { return false; }
 static inline void klp_patch_task(struct task_struct *task) {}
+static inline void klp_copy_process(struct task_struct *child) {}
 
 #endif /* CONFIG_LIVEPATCH */
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fb364a0..7fc8b49 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1860,6 +1860,9 @@  struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_LIVEPATCH
+	int patch_state;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index d2fe04a..a12e3b0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@ 
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/livepatch.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1586,6 +1587,8 @@  static struct task_struct *copy_process(unsigned long clone_flags,
 		p->parent_exec_id = current->self_exec_id;
 	}
 
+	klp_copy_process(p);
+
 	spin_lock(&current->sighand->siglock);
 
 	/*
diff --git a/kernel/livepatch/Makefile b/kernel/livepatch/Makefile
index e136dad..2b8bdb1 100644
--- a/kernel/livepatch/Makefile
+++ b/kernel/livepatch/Makefile
@@ -1,3 +1,3 @@ 
 obj-$(CONFIG_LIVEPATCH) += livepatch.o
 
-livepatch-objs := core.o patch.o
+livepatch-objs := core.o patch.o transition.o
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index aa3dbdf..0be352f 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -31,12 +31,15 @@ 
 #include <linux/moduleloader.h>
 #include <asm/cacheflush.h>
 #include "patch.h"
+#include "transition.h"
 
 /*
- * The klp_mutex protects the global lists and state transitions of any
- * structure reachable from them.  References to any structure must be obtained
- * under mutex protection (except in klp_ftrace_handler(), which uses RCU to
- * ensure it gets consistent data).
+ * klp_mutex is a coarse lock which serializes access to klp data.  All
+ * accesses to klp-related variables and structures must have mutex protection,
+ * except within the following functions which carefully avoid the need for it:
+ *
+ * - klp_ftrace_handler()
+ * - klp_patch_task()
  */
 static DEFINE_MUTEX(klp_mutex);
 
@@ -44,8 +47,28 @@  static LIST_HEAD(klp_patches);
 
 static struct kobject *klp_root_kobj;
 
-/* TODO: temporary stub */
-void klp_patch_task(struct task_struct *task) {}
+static void klp_work_fn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(klp_work, klp_work_fn);
+
+static void klp_schedule_work(void)
+{
+	schedule_delayed_work(&klp_work, round_jiffies_relative(HZ));
+}
+
+/*
+ * This work can be performed periodically to finish patching or unpatching any
+ * "straggler" tasks which failed to transition in klp_enable_patch().
+ */
+static void klp_work_fn(struct work_struct *work)
+{
+	mutex_lock(&klp_mutex);
+
+	if (klp_transition_patch)
+		if (!klp_try_complete_transition())
+			klp_schedule_work();
+
+	mutex_unlock(&klp_mutex);
+}
 
 static bool klp_is_module(struct klp_object *obj)
 {
@@ -85,7 +108,6 @@  static void klp_find_object_module(struct klp_object *obj)
 	mutex_unlock(&module_mutex);
 }
 
-/* klp_mutex must be held by caller */
 static bool klp_is_patch_registered(struct klp_patch *patch)
 {
 	struct klp_patch *mypatch;
@@ -283,19 +305,18 @@  static int klp_write_object_relocations(struct module *pmod,
 
 static int __klp_disable_patch(struct klp_patch *patch)
 {
-	struct klp_object *obj;
+	if (klp_transition_patch)
+		return -EBUSY;
 
 	/* enforce stacking: only the last enabled patch can be disabled */
 	if (!list_is_last(&patch->list, &klp_patches) &&
 	    list_next_entry(patch, list)->enabled)
 		return -EBUSY;
 
-	pr_notice("disabling patch '%s'\n", patch->mod->name);
-
-	klp_for_each_object(patch, obj) {
-		if (obj->patched)
-			klp_unpatch_object(obj);
-	}
+	klp_init_transition(patch, KLP_UNPATCHED);
+	klp_start_transition();
+	if (!klp_try_complete_transition())
+		klp_schedule_work();
 
 	patch->enabled = false;
 
@@ -339,6 +360,9 @@  static int __klp_enable_patch(struct klp_patch *patch)
 	struct klp_object *obj;
 	int ret;
 
+	if (klp_transition_patch)
+		return -EBUSY;
+
 	if (WARN_ON(patch->enabled))
 		return -EINVAL;
 
@@ -350,24 +374,32 @@  static int __klp_enable_patch(struct klp_patch *patch)
 	pr_notice_once("tainting kernel with TAINT_LIVEPATCH\n");
 	add_taint(TAINT_LIVEPATCH, LOCKDEP_STILL_OK);
 
-	pr_notice("enabling patch '%s'\n", patch->mod->name);
+	klp_init_transition(patch, KLP_PATCHED);
 
 	klp_for_each_object(patch, obj) {
 		if (!klp_is_object_loaded(obj))
 			continue;
 
 		ret = klp_patch_object(obj);
-		if (ret)
-			goto unregister;
+		if (ret) {
+			pr_warn("failed to enable patch '%s'\n",
+				patch->mod->name);
+
+			klp_unpatch_objects(patch);
+			klp_complete_transition();
+
+			return ret;
+		}
 	}
 
+	klp_start_transition();
+
+	if (!klp_try_complete_transition())
+		klp_schedule_work();
+
 	patch->enabled = true;
 
 	return 0;
-
-unregister:
-	WARN_ON(__klp_disable_patch(patch));
-	return ret;
 }
 
 /**
@@ -404,6 +436,7 @@  EXPORT_SYMBOL_GPL(klp_enable_patch);
  * /sys/kernel/livepatch
  * /sys/kernel/livepatch/<patch>
  * /sys/kernel/livepatch/<patch>/enabled
+ * /sys/kernel/livepatch/<patch>/transition
  * /sys/kernel/livepatch/<patch>/<object>
  * /sys/kernel/livepatch/<patch>/<object>/<function,sympos>
  */
@@ -432,7 +465,9 @@  static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr,
 		goto err;
 	}
 
-	if (val) {
+	if (patch == klp_transition_patch) {
+		klp_reverse_transition();
+	} else if (val) {
 		ret = __klp_enable_patch(patch);
 		if (ret)
 			goto err;
@@ -460,9 +495,21 @@  static ssize_t enabled_show(struct kobject *kobj,
 	return snprintf(buf, PAGE_SIZE-1, "%d\n", patch->enabled);
 }
 
+static ssize_t transition_show(struct kobject *kobj,
+			       struct kobj_attribute *attr, char *buf)
+{
+	struct klp_patch *patch;
+
+	patch = container_of(kobj, struct klp_patch, kobj);
+	return snprintf(buf, PAGE_SIZE-1, "%d\n",
+			patch == klp_transition_patch);
+}
+
 static struct kobj_attribute enabled_kobj_attr = __ATTR_RW(enabled);
+static struct kobj_attribute transition_kobj_attr = __ATTR_RO(transition);
 static struct attribute *klp_patch_attrs[] = {
 	&enabled_kobj_attr.attr,
+	&transition_kobj_attr.attr,
 	NULL
 };
 
@@ -549,6 +596,7 @@  static int klp_init_func(struct klp_object *obj, struct klp_func *func)
 {
 	INIT_LIST_HEAD(&func->stack_node);
 	func->patched = false;
+	func->transition = false;
 
 	/* The format for the sysfs directory is <function,sympos> where sympos
 	 * is the nth occurrence of this symbol in kallsyms for the patched
@@ -781,7 +829,11 @@  int klp_module_coming(struct module *mod)
 				goto err;
 			}
 
-			if (!patch->enabled)
+			/*
+			 * Only patch the module if the patch is enabled or is
+			 * in transition.
+			 */
+			if (!patch->enabled && patch != klp_transition_patch)
 				break;
 
 			pr_notice("applying patch '%s' to loading module '%s'\n",
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index 782fbb5..b3b8639 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -29,6 +29,7 @@ 
 #include <linux/bug.h>
 #include <linux/printk.h>
 #include "patch.h"
+#include "transition.h"
 
 static LIST_HEAD(klp_ops);
 
@@ -58,11 +59,42 @@  static void notrace klp_ftrace_handler(unsigned long ip,
 	ops = container_of(fops, struct klp_ops, fops);
 
 	rcu_read_lock();
+
 	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,
 				      stack_node);
-	if (WARN_ON_ONCE(!func))
+
+	if (!func)
 		goto unlock;
 
+	/*
+	 * See the comment for the 2nd smp_wmb() in klp_init_transition() for
+	 * an explanation of why this read barrier is needed.
+	 */
+	smp_rmb();
+
+	if (unlikely(func->transition)) {
+
+		/*
+		 * See the comment for the 1st smp_wmb() in
+		 * klp_init_transition() for an explanation of why this read
+		 * barrier is needed.
+		 */
+		smp_rmb();
+
+		if (current->patch_state == KLP_UNPATCHED) {
+			/*
+			 * Use the previously patched version of the function.
+			 * If no previous patches exist, use the original
+			 * function.
+			 */
+			func = list_entry_rcu(func->stack_node.next,
+					      struct klp_func, stack_node);
+
+			if (&func->stack_node == &ops->func_stack)
+				goto unlock;
+		}
+	}
+
 	klp_arch_set_pc(regs, (unsigned long)func->new_func);
 unlock:
 	rcu_read_unlock();
@@ -211,3 +243,12 @@  int klp_patch_object(struct klp_object *obj)
 
 	return 0;
 }
+
+void klp_unpatch_objects(struct klp_patch *patch)
+{
+	struct klp_object *obj;
+
+	klp_for_each_object(patch, obj)
+		if (obj->patched)
+			klp_unpatch_object(obj);
+}
diff --git a/kernel/livepatch/patch.h b/kernel/livepatch/patch.h
index 2d0cce0..0db2271 100644
--- a/kernel/livepatch/patch.h
+++ b/kernel/livepatch/patch.h
@@ -28,5 +28,6 @@  struct klp_ops *klp_find_ops(unsigned long old_addr);
 
 int klp_patch_object(struct klp_object *obj);
 void klp_unpatch_object(struct klp_object *obj);
+void klp_unpatch_objects(struct klp_patch *patch);
 
 #endif /* _LIVEPATCH_PATCH_H */
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
new file mode 100644
index 0000000..92819bb
--- /dev/null
+++ b/kernel/livepatch/transition.c
@@ -0,0 +1,474 @@ 
+/*
+ * transition.c - Kernel Live Patching transition functions
+ *
+ * Copyright (C) 2015 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/cpu.h>
+#include <linux/stacktrace.h>
+#include "../sched/sched.h"
+
+#include "patch.h"
+#include "transition.h"
+
+#define MAX_STACK_ENTRIES 100
+
+struct klp_patch *klp_transition_patch;
+
+static int klp_target_state;
+
+/* called from copy_process() during fork */
+void klp_copy_process(struct task_struct *child)
+{
+	child->patch_state = current->patch_state;
+
+	/* TIF_PATCH_PENDING gets copied in setup_thread_stack() */
+}
+
+/*
+ * klp_patch_task() - change the patched state of a task
+ * @task:	The task to change
+ *
+ * Switches the patched state of the task to the set of functions in the target
+ * patch state.
+ */
+void klp_patch_task(struct task_struct *task)
+{
+	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+
+	/*
+	 * The corresponding write barriers are in klp_init_transition() and
+	 * klp_reverse_transition().  See the comments there for an explanation.
+	 */
+	smp_rmb();
+
+	task->patch_state = klp_target_state;
+}
+
+/*
+ * Initialize the global target patch state and all tasks to the initial patch
+ * state, and initialize all function transition states to true in preparation
+ * for patching or unpatching.
+ */
+void klp_init_transition(struct klp_patch *patch, int state)
+{
+	struct task_struct *g, *task;
+	unsigned int cpu;
+	struct klp_object *obj;
+	struct klp_func *func;
+	int initial_state = !state;
+
+	klp_transition_patch = patch;
+
+	/*
+	 * If the patch can be applied or reverted immediately, skip the
+	 * per-task transitions.
+	 */
+	if (patch->immediate)
+		return;
+
+	/*
+	 * Initialize all tasks to the initial patch state to prepare them for
+	 * switching to the target state.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, task)
+		task->patch_state = initial_state;
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * Ditto for the idle "swapper" tasks.
+	 */
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		idle_task(cpu)->patch_state = initial_state;
+	put_online_cpus();
+
+	/*
+	 * Ensure klp_ftrace_handler() sees the task->patch_state updates
+	 * before the func->transition updates.  Otherwise it could read an
+	 * out-of-date task state and pick the wrong function.
+	 */
+	smp_wmb();
+
+	/*
+	 * Set the func transition states so klp_ftrace_handler() will know to
+	 * switch to the transition logic.
+	 *
+	 * When patching, the funcs aren't yet in the func_stack and will be
+	 * made visible to the ftrace handler shortly by the calls to
+	 * klp_patch_object().
+	 *
+	 * When unpatching, the funcs are already in the func_stack and so are
+	 * already visible to the ftrace handler.
+	 */
+	klp_for_each_object(patch, obj)
+		klp_for_each_func(obj, func)
+			func->transition = true;
+
+	/*
+	 * Set the global target patch state which tasks will switch to.  This
+	 * has no effect until the TIF_PATCH_PENDING flags get set later.
+	 */
+	klp_target_state = state;
+
+	/*
+	 * For the enable path, ensure klp_ftrace_handler() will see the
+	 * func->transition updates before the funcs become visible to the
+	 * handler.  Otherwise the handler may wrongly pick the new func before
+	 * the task switches to the patched state.
+	 *
+	 * For the disable path, the funcs are already visible to the handler.
+	 * But we still need to ensure the ftrace handler will see the
+	 * func->transition updates before the tasks start switching to the
+	 * unpatched state.  Otherwise the handler can miss a task patch state
+	 * change which would result in it wrongly picking the new function.
+	 *
+	 * This barrier also ensures that if another CPU goes through the
+	 * syscall barrier, sees the TIF_PATCH_PENDING writes in
+	 * klp_start_transition(), and calls klp_patch_task(), it also sees the
+	 * above write to the target state.  Otherwise it can put the task in
+	 * the wrong universe.
+	 */
+	smp_wmb();
+}
+
+/*
+ * Start the transition to the specified target patch state so tasks can begin
+ * switching to it.
+ */
+void klp_start_transition(void)
+{
+	struct task_struct *g, *task;
+	unsigned int cpu;
+
+	pr_notice("'%s': %s...\n", klp_transition_patch->mod->name,
+		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
+
+	/*
+	 * If the patch can be applied or reverted immediately, skip the
+	 * per-task transitions.
+	 */
+	if (klp_transition_patch->immediate)
+		return;
+
+	/*
+	 * Mark all normal tasks as needing a patch state update.  As they pass
+	 * through the syscall barrier they'll switch over to the target state
+	 * (unless we switch them in klp_try_complete_transition() first).
+	 */
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, task)
+		set_tsk_thread_flag(task, TIF_PATCH_PENDING);
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * Ditto for the idle "swapper" tasks, though they never cross the
+	 * syscall barrier.  Instead they switch over in cpu_idle_loop().
+	 */
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		set_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
+	put_online_cpus();
+}
+
+/*
+ * The transition to the target patch state is complete.  Clean up the data
+ * structures.
+ */
+void klp_complete_transition(void)
+{
+	struct klp_object *obj;
+	struct klp_func *func;
+	struct task_struct *g, *task;
+	unsigned int cpu;
+
+	if (klp_transition_patch->immediate)
+		goto done;
+
+	klp_for_each_object(klp_transition_patch, obj)
+		klp_for_each_func(obj, func)
+			func->transition = false;
+
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, task) {
+		clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+		task->patch_state = KLP_UNDEFINED;
+	}
+	read_unlock(&tasklist_lock);
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		task = idle_task(cpu);
+		clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+		task->patch_state = KLP_UNDEFINED;
+	}
+	put_online_cpus();
+
+done:
+	klp_transition_patch = NULL;
+}
+
+/*
+ * Determine whether the given stack trace includes any references to a
+ * to-be-patched or to-be-unpatched function.
+ */
+static int klp_check_stack_func(struct klp_func *func,
+				struct stack_trace *trace)
+{
+	unsigned long func_addr, func_size, address;
+	struct klp_ops *ops;
+	int i;
+
+	if (func->immediate)
+		return 0;
+
+	for (i = 0; i < trace->nr_entries; i++) {
+		address = trace->entries[i];
+
+		if (klp_target_state == KLP_UNPATCHED) {
+			 /*
+			  * Check for the to-be-unpatched function
+			  * (the func itself).
+			  */
+			func_addr = (unsigned long)func->new_func;
+			func_size = func->new_size;
+		} else {
+			/*
+			 * Check for the to-be-patched function
+			 * (the previous func).
+			 */
+			ops = klp_find_ops(func->old_addr);
+
+			if (list_is_singular(&ops->func_stack)) {
+				/* original function */
+				func_addr = func->old_addr;
+				func_size = func->old_size;
+			} else {
+				/* previously patched function */
+				struct klp_func *prev;
+
+				prev = list_next_entry(func, stack_node);
+				func_addr = (unsigned long)prev->new_func;
+				func_size = prev->new_size;
+			}
+		}
+
+		if (address >= func_addr && address < func_addr + func_size)
+			return -EAGAIN;
+	}
+
+	return 0;
+}
+
+/*
+ * Determine whether it's safe to transition the task to the target patch state
+ * by looking for any to-be-patched or to-be-unpatched functions on its stack.
+ */
+static int klp_check_stack(struct task_struct *task)
+{
+	static unsigned long entries[MAX_STACK_ENTRIES];
+	struct stack_trace trace;
+	struct klp_object *obj;
+	struct klp_func *func;
+	int ret;
+
+	trace.skip = 0;
+	trace.nr_entries = 0;
+	trace.max_entries = MAX_STACK_ENTRIES;
+	trace.entries = entries;
+	ret = save_stack_trace_tsk_reliable(task, &trace);
+	WARN_ON_ONCE(ret == -ENOSYS);
+	if (ret) {
+		pr_debug("%s: pid %d (%s) has an unreliable stack\n",
+			 __func__, task->pid, task->comm);
+		return ret;
+	}
+
+	klp_for_each_object(klp_transition_patch, obj) {
+		if (!obj->patched)
+			continue;
+		klp_for_each_func(obj, func) {
+			ret = klp_check_stack_func(func, &trace);
+			if (ret) {
+				pr_debug("%s: pid %d (%s) is sleeping on function %s\n",
+					 __func__, task->pid, task->comm,
+					 func->old_name);
+				return ret;
+			}
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Try to safely switch a task to the target patch state.  If it's currently
+ * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or
+ * if the stack is unreliable, return false.
+ */
+static bool klp_try_switch_task(struct task_struct *task)
+{
+	struct rq *rq;
+	unsigned long flags;
+	int ret;
+	bool success = false;
+
+	/* check if this task has already switched over */
+	if (task->patch_state == klp_target_state)
+		return true;
+
+	/*
+	 * For arches which don't have reliable stack traces, we have to rely
+	 * on other methods (e.g., switching tasks at the syscall barrier).
+	 */
+	if (!IS_ENABLED(CONFIG_RELIABLE_STACKTRACE))
+		return false;
+
+	/*
+	 * Now try to check the stack for any to-be-patched or to-be-unpatched
+	 * functions.  If all goes well, switch the task to the target patch
+	 * state.
+	 */
+	rq = task_rq_lock(task, &flags);
+
+	if (task_running(rq, task) && task != current) {
+		pr_debug("%s: pid %d (%s) is running\n", __func__, task->pid,
+			 task->comm);
+		goto done;
+	}
+
+	ret = klp_check_stack(task);
+	if (ret)
+		goto done;
+
+	success = true;
+
+	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+	task->patch_state = klp_target_state;
+
+done:
+	task_rq_unlock(rq, task, &flags);
+	return success;
+}
+
+/*
+ * Try to switch all remaining tasks to the target patch state by walking the
+ * stacks of sleeping tasks and looking for any to-be-patched or
+ * to-be-unpatched functions.  If such functions are found, the task can't be
+ * switched yet.
+ *
+ * If any tasks are still stuck in the initial patch state, schedule a retry.
+ */
+bool klp_try_complete_transition(void)
+{
+	unsigned int cpu;
+	struct task_struct *g, *task;
+	bool complete = true;
+
+	/*
+	 * If the patch can be applied or reverted immediately, skip the
+	 * per-task transitions.
+	 */
+	if (klp_transition_patch->immediate)
+		goto success;
+
+	/*
+	 * Try to switch the tasks to the target patch state by walking their
+	 * stacks and looking for any to-be-patched or to-be-unpatched
+	 * functions.  If such functions are found on a stack, or if the stack
+	 * is deemed unreliable, the task can't be switched yet.
+	 *
+	 * Usually this will transition most (or all) of the tasks on a system
+	 * unless the patch includes changes to a very common function.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, task)
+		if (!klp_try_switch_task(task))
+			complete = false;
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * Ditto for the idle "swapper" tasks.
+	 */
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		if (!klp_try_switch_task(idle_task(cpu)))
+			complete = false;
+	put_online_cpus();
+
+	/*
+	 * Some tasks weren't able to be switched over.  Try again later and/or
+	 * wait for other methods like syscall barrier switching.
+	 */
+	if (!complete)
+		return false;
+
+success:
+	/*
+	 * When unpatching, all tasks have transitioned to KLP_UNPATCHED so we
+	 * can now remove the new functions from the func_stack.
+	 */
+	if (klp_target_state == KLP_UNPATCHED) {
+		klp_unpatch_objects(klp_transition_patch);
+
+		/*
+		 * Don't allow any existing instances of ftrace handlers to
+		 * access any obsolete funcs before we reset the func
+		 * transition states to false.  Otherwise the handler may see
+		 * the deleted "new" func, see that it's not in transition, and
+		 * wrongly pick the new version of the function.
+		 */
+		synchronize_rcu();
+	}
+
+	pr_notice("'%s': %s complete\n", klp_transition_patch->mod->name,
+		  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
+
+	/* we're done, now cleanup the data structures */
+	klp_complete_transition();
+
+	return true;
+}
+
+/*
+ * This function can be called in the middle of an existing transition to
+ * reverse the direction of the target patch state.  This can be done to
+ * effectively cancel an existing enable or disable operation if there are any
+ * tasks which are stuck in the initial patch state.
+ */
+void klp_reverse_transition(void)
+{
+	struct klp_patch *patch = klp_transition_patch;
+
+	klp_target_state = !klp_target_state;
+
+	/*
+	 * Ensure that if another CPU goes through the syscall barrier, sees
+	 * the TIF_PATCH_PENDING writes in klp_start_transition(), and calls
+	 * klp_patch_task(), it also sees the above write to the target state.
+	 * Otherwise it can put the task in the wrong universe.
+	 */
+	smp_wmb();
+
+	klp_start_transition();
+	klp_try_complete_transition();
+
+	patch->enabled = !patch->enabled;
+}
+
diff --git a/kernel/livepatch/transition.h b/kernel/livepatch/transition.h
new file mode 100644
index 0000000..5191b96
--- /dev/null
+++ b/kernel/livepatch/transition.h
@@ -0,0 +1,14 @@ 
+#ifndef _LIVEPATCH_TRANSITION_H
+#define _LIVEPATCH_TRANSITION_H
+
+#include <linux/livepatch.h>
+
+extern struct klp_patch *klp_transition_patch;
+
+void klp_init_transition(struct klp_patch *patch, int state);
+void klp_start_transition(void);
+void klp_reverse_transition(void);
+bool klp_try_complete_transition(void);
+void klp_complete_transition(void);
+
+#endif /* _LIVEPATCH_TRANSITION_H */
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index bd12c6c..60d633f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -9,6 +9,7 @@ 
 #include <linux/mm.h>
 #include <linux/stackprotector.h>
 #include <linux/suspend.h>
+#include <linux/livepatch.h>
 
 #include <asm/tlb.h>
 
@@ -266,6 +267,9 @@  static void cpu_idle_loop(void)
 
 		sched_ttwu_pending();
 		schedule_preempt_disabled();
+
+		if (unlikely(klp_patch_pending(current)))
+			klp_patch_task(current);
 	}
 }