arc_usr_cmpxchg and preemption

Message ID 1521045375.11552.27.camel@synopsys.com
State New
Headers show
Series
  • arc_usr_cmpxchg and preemption
Related show

Commit Message

Alexey Brodkin March 14, 2018, 4:36 p.m.
Hi Vineet,

While debugging a segfault of user-space app on system without atomic ops
(I mean LLOCK/SCOND) I understood the root-cause is in implementation
of kernel's __NR_arc_usr_cmpxchg syscall which is supposed to emulate mentioned
atomic ops for user-space.

So here's a problem.

1. User-space app [via libc] triggers __NR_arc_usr_cmpxchg syscall,
   we enter arc_usr_cmpxchg()which basically does:
   ---------------------------->8-------------------------------
     preempt_disable();
     __get_user(uval, uaddr);
     __put_user(new, uaddr);
     preempt_enable();
   ---------------------------->8-------------------------------

2. Most of the time everything is fine because __get_user()/__put_user()
   for ARC is just LD/ST.

3. Rarely user's variable is situated in not yet allocated page.
   Here I mean copy-on-write case, when a page has read-only flag in TLB.
   In that case __get_user() succeeds but __put_user() causes Privilege
   Violation exception and we enter do_page_fault() where new page allocation
   with proper access bits is supposed to happen... but that never happens
   because with preempt_disable() we set in_atomic() which set
   faulthandler_disabled() and so we exit early from page fault handler
   effectively with nothing done, i.e. user's variable is left unchanged
   which in its turn causes very strange problems later down the line because
   we don't notify user-space app about failed data modification.

The simplest fix is to not mess with preemption:
   ---------------------------->8-------------------------------

But I'm not really sure how safe is that.
Especially if we think about PREEMPT_RT for example.

Any thoughts?

-Alexey

P.S. In our emulation of unaligned access we don't seem to disable preemption so
it might be a good idea to "align" syscall implementation is both cases.

Comments

Vineet Gupta March 14, 2018, 4:58 p.m. | #1
+CC linux-arch, Peter for any preemption insights !

On 03/14/2018 09:36 AM, Alexey Brodkin wrote:
> Hi Vineet,
> 
> While debugging a segfault of user-space app on system without atomic ops
> (I mean LLOCK/SCOND) I understood the root-cause is in implementation
> of kernel's __NR_arc_usr_cmpxchg syscall which is supposed to emulate mentioned
> atomic ops for user-space.
> 
> So here's a problem.
> 
> 1. User-space app [via libc] triggers __NR_arc_usr_cmpxchg syscall,
>     we enter arc_usr_cmpxchg()which basically does:
>     ---------------------------->8-------------------------------
>       preempt_disable();
>       __get_user(uval, uaddr);
>       __put_user(new, uaddr);
>       preempt_enable();
>     ---------------------------->8-------------------------------
> 
> 2. Most of the time everything is fine because __get_user()/__put_user()
>     for ARC is just LD/ST.
> 
> 3. Rarely user's variable is situated in not yet allocated page.
>     Here I mean copy-on-write case, when a page has read-only flag in TLB.
>     In that case __get_user() succeeds but __put_user() causes Privilege
>     Violation exception and we enter do_page_fault() where new page allocation
>     with proper access bits is supposed to happen... but that never happens
>     because with preempt_disable() we set in_atomic() which set
>     faulthandler_disabled() and so we exit early from page fault handler
>     effectively with nothing done, i.e. user's variable is left unchanged
>     which in its turn causes very strange problems later down the line because
>     we don't notify user-space app about failed data modification.

Interesting problem ! But what is special here, I would think syscalls in general 
could hit this.

> 
> The simplest fix is to not mess with preemption:
>     ---------------------------->8-------------------------------
> diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
> index 5ac3b547453f..d1713d8d3981 100644
> --- a/arch/arc/kernel/process.c
> +++ b/arch/arc/kernel/process.c
> @@ -63,8 +63,6 @@ SYSCALL_DEFINE3(arc_usr_cmpxchg, int *, uaddr, int, expected, int, new)
>          if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int)))
>                  return -EFAULT;
>   
> -       preempt_disable();
> -
>          if (__get_user(uval, uaddr))
>                  goto done;

...
...

>   done:
> -       preempt_enable();
> -
>          return uval;
>   }
>     ---------------------------->8-------------------------------
> 
> But I'm not really sure how safe is that.

Well it is broken wrt the semantics the syscall is supposed to provide. Preemption 
disabling is what prevents a concurrent thread from coming in and modifying the 
same location (Imagine a variable which is being cmpxchg concurrently by 2 threads).

One approach is to do it the MIPS way, emulate the llsc flag - set it under 
preemption disabled section and clear it in switch_to

see arch/mips/kernel/syscall.c
Peter Zijlstra March 14, 2018, 5:53 p.m. | #2
On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote:

> Well it is broken wrt the semantics the syscall is supposed to provide.
> Preemption disabling is what prevents a concurrent thread from coming in and
> modifying the same location (Imagine a variable which is being cmpxchg
> concurrently by 2 threads).
> 
> One approach is to do it the MIPS way, emulate the llsc flag - set it under
> preemption disabled section and clear it in switch_to

*shudder*... just catch the -EFAULT, force the write fault and retry.

Something like:

int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new)
{
	u32 val;
	int ret;

again:
	ret = 0;

	preempt_disable();
	val = get_user(user_ptr);
	if (val == old)
		ret = put_user(new, user_ptr);
	preempt_enable();

	if (ret == -EFAULT) {
		struct page *page;
		ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page);
		if (ret < 0)
			return ret;
		put_page(page);
		goto again;
	}

	return ret;
}
Peter Zijlstra March 14, 2018, 6:20 p.m. | #3
On Wed, Mar 14, 2018 at 06:53:52PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote:
> 
> > Well it is broken wrt the semantics the syscall is supposed to provide.
> > Preemption disabling is what prevents a concurrent thread from coming in and
> > modifying the same location (Imagine a variable which is being cmpxchg
> > concurrently by 2 threads).
> > 
> > One approach is to do it the MIPS way, emulate the llsc flag - set it under
> > preemption disabled section and clear it in switch_to
> 
> *shudder*... just catch the -EFAULT, force the write fault and retry.
> 
> Something like:
> 
> int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new)
> {
> 	u32 val;
> 	int ret;

Also very important:

	if ((unsigned long)user_ptr & 0x3)
		return -EINVAL;

You must disallow unaligned atomics, otherwise the below can cross a
page-boundary (and unaligned atomics are insane in any case).

> again:
> 	ret = 0;
> 
> 	preempt_disable();
> 	val = get_user(user_ptr);
> 	if (val == old)
> 		ret = put_user(new, user_ptr);
> 	preempt_enable();
> 
> 	if (ret == -EFAULT) {
> 		struct page *page;
> 		ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page);
> 		if (ret < 0)
> 			return ret;
> 		put_page(page);
> 		goto again;
> 	}
> 
> 	return ret;
> }
Alexey Brodkin March 14, 2018, 8:38 p.m. | #4
Hi Peter, Vineet,

On Wed, 2018-03-14 at 18:53 +0100, Peter Zijlstra wrote:
> On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote:
> 
> > Well it is broken wrt the semantics the syscall is supposed to provide.
> > Preemption disabling is what prevents a concurrent thread from coming in and
> > modifying the same location (Imagine a variable which is being cmpxchg
> > concurrently by 2 threads).
> > 
> > One approach is to do it the MIPS way, emulate the llsc flag - set it under
> > preemption disabled section and clear it in switch_to
> 
> *shudder*... just catch the -EFAULT, force the write fault and retry.
> 
> Something like:
> 
> int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new)
> {
> 	u32 val;
> 	int ret;
> 
> again:
> 	ret = 0;
> 
> 	preempt_disable();
> 	val = get_user(user_ptr);
> 	if (val == old)
> 		ret = put_user(new, user_ptr);
> 	preempt_enable();
> 
> 	if (ret == -EFAULT) {
> 		struct page *page;
> 		ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page);
> 		if (ret < 0)
> 			return ret;
> 		put_page(page);
> 		goto again;

I guess this jump we need to do only once, right?
If for whatever reason get_user_pages_fast() fails we return immediately
and if it succeeds there's no reason for put_user() to not succeed as
required page is supposed to be prepared for write.

Otherwise if something goes way too bad we may end-up in an infinite loop
which we'd better prevent.

> 	}
> 
> 	return ret;
> }

@Vineet, are you OK with proposed implementation?

-Alexey
Vineet Gupta March 14, 2018, 8:55 p.m. | #5
On 03/14/2018 01:38 PM, Alexey Brodkin wrote:
> @Vineet, are you OK with proposed implementation?

I couldn't agree any more !

-Vineet
Peter Zijlstra March 15, 2018, 8:18 a.m. | #6
On Wed, Mar 14, 2018 at 08:38:53PM +0000, Alexey Brodkin wrote:
> > int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new)
> > {
> > 	u32 val;
> > 	int ret;
> > 
> > again:
> > 	ret = 0;
> > 
> > 	preempt_disable();
> > 	val = get_user(user_ptr);
> > 	if (val == old)
> > 		ret = put_user(new, user_ptr);
> > 	preempt_enable();
> > 
> > 	if (ret == -EFAULT) {
> > 		struct page *page;
> > 		ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page);
> > 		if (ret < 0)
> > 			return ret;
> > 		put_page(page);
> > 		goto again;
> 
> I guess this jump we need to do only once, right?

Typically, yes. It is theoretically possible for the page to get
paged-out right after we do put_page() and before we do get/put_user(),
But if that happens the machine likely has bigger problems than having
to do this loop again.

FWIW, look at kernel/futex.c for working examples of this pattern, the
above was written purely from memory and could contain a fail or two ;-)

Also, it might make sense to stuff this implementation in some lib/ file
somewhere and make all platforms that need it use the same code, afaict
there really isn't anything platform specific to it.
Alexey Brodkin March 15, 2018, 9:12 a.m. | #7
Hi Peter,

On Thu, 2018-03-15 at 09:18 +0100, Peter Zijlstra wrote:
> On Wed, Mar 14, 2018 at 08:38:53PM +0000, Alexey Brodkin wrote:
> > > int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new)
> > > {
> > > 	u32 val;
> > > 	int ret;
> > > 
> > > again:
> > > 	ret = 0;
> > > 
> > > 	preempt_disable();
> > > 	val = get_user(user_ptr);
> > > 	if (val == old)
> > > 		ret = put_user(new, user_ptr);
> > > 	preempt_enable();
> > > 
> > > 	if (ret == -EFAULT) {
> > > 		struct page *page;
> > > 		ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page);
> > > 		if (ret < 0)
> > > 			return ret;
> > > 		put_page(page);
> > > 		goto again;
> > 
> > I guess this jump we need to do only once, right?
> 
> Typically, yes. It is theoretically possible for the page to get
> paged-out right after we do put_page() and before we do get/put_user(),
> But if that happens the machine likely has bigger problems than having
> to do this loop again.
> 
> FWIW, look at kernel/futex.c for working examples of this pattern, the
> above was written purely from memory and could contain a fail or two ;-)

Thanks for the pointer!

> Also, it might make sense to stuff this implementation in some lib/ file
> somewhere and make all platforms that need it use the same code, afaict
> there really isn't anything platform specific to it.

Not clear which part do you mean here.
Are you talking about entire cmpxchg syscall implementation?

Do you think there're many users of that quite an inefficient
[compared to proper HW version] atomic exchange?

-Alexey
Peter Zijlstra March 15, 2018, 11:28 a.m. | #8
On Thu, Mar 15, 2018 at 09:12:09AM +0000, Alexey Brodkin wrote:
> On Thu, 2018-03-15 at 09:18 +0100, Peter Zijlstra wrote:

> > Also, it might make sense to stuff this implementation in some lib/ file
> > somewhere and make all platforms that need it use the same code, afaict
> > there really isn't anything platform specific to it.
> 
> Not clear which part do you mean here.
> Are you talking about entire cmpxchg syscall implementation?

Yep.

> Do you think there're many users of that quite an inefficient
> [compared to proper HW version] atomic exchange?

I think there's a bunch of architectures that are in the same boat.
m68k, arm, mips was mentioned. Sure, the moment an arch has hardware
support you don't need the syscall anymore.

I was just thinking it would be good to have a common implementation (if
possible) rather than 4-5 different copies of basically the same thing.
Alexey Brodkin March 15, 2018, 7:03 p.m. | #9
Hi Peter,

On Thu, 2018-03-15 at 12:28 +0100, Peter Zijlstra wrote:
> On Thu, Mar 15, 2018 at 09:12:09AM +0000, Alexey Brodkin wrote:
> > On Thu, 2018-03-15 at 09:18 +0100, Peter Zijlstra wrote:
> > > Also, it might make sense to stuff this implementation in some lib/ file
> > > somewhere and make all platforms that need it use the same code, afaict
> > > there really isn't anything platform specific to it.
> > 
> > Not clear which part do you mean here.
> > Are you talking about entire cmpxchg syscall implementation?
> 
> Yep.

Hm... new generic syscall doing something sane people are not supposed to do?
Let's see who's going to express his/her excitement about that :)

But even introduction of that new syscall is obviously not enough
as we'll need to fix-up libc for affected arches accordingly...

> > Do you think there're many users of that quite an inefficient
> > [compared to proper HW version] atomic exchange?
> 
> I think there's a bunch of architectures that are in the same boat.
> m68k, arm, mips was mentioned. Sure, the moment an arch has hardware
> support you don't need the syscall anymore.

Here's a brief analysis:
ARM:  Looks like they got rid of that stuff in v4.4, see
      commit db695c0509d6 ("ARM: remove user cmpxchg syscall").

M68K: That's even uglier implementation which is really asking for
      a facelift, look at sys_atomic_cmpxchg_32() here:
      https://elixir.bootlin.com/linux/latest/source/arch/m68k/kernel/sys_m68k.c#L461

MIPS: They do it via special sysmips syscall which among other things
      might handle MIPS_ATOMIC_SET with mips_atomic_set()

I don't immediately see if there're others but really I'm not sure if it even worth trying to
clean-up all that since efforts might be spent pointlessly.

> I was just thinking it would be good to have a common implementation (if
> possible) rather than 4-5 different copies of basically the same thing.

From above I would conclude that only M68K might benefit from new library
implementation. BTW M68K uses a bit different ABI compared to ARC for that syscall so
it will be really atomic_cmpxchg_32() libfunction called from those syscalls,
but now I think that's exactly what you meant initially, correct?

-Alexey
Peter Zijlstra March 16, 2018, 7:55 a.m. | #10
On Thu, Mar 15, 2018 at 07:03:32PM +0000, Alexey Brodkin wrote:
> > I think there's a bunch of architectures that are in the same boat.
> > m68k, arm, mips was mentioned. Sure, the moment an arch has hardware
> > support you don't need the syscall anymore.
> 
> Here's a brief analysis:
> ARM:  Looks like they got rid of that stuff in v4.4, see
>       commit db695c0509d6 ("ARM: remove user cmpxchg syscall").

Oh shiny, that's why I couldn't find it. I had distinct memories of them
having one though.

> M68K: That's even uglier implementation which is really asking for
>       a facelift, look at sys_atomic_cmpxchg_32() here:
>       https://elixir.bootlin.com/linux/latest/source/arch/m68k/kernel/sys_m68k.c#L461

Yes, I found that code, it's something special allright.

> MIPS: They do it via special sysmips syscall which among other things
>       might handle MIPS_ATOMIC_SET with mips_atomic_set()
> 
> I don't immediately see if there're others but really I'm not sure if it even worth trying to
> clean-up all that since efforts might be spent pointlessly.
> 
> > I was just thinking it would be good to have a common implementation (if
> > possible) rather than 4-5 different copies of basically the same thing.
> 
> From above I would conclude that only M68K might benefit from new
> library implementation. BTW M68K uses a bit different ABI compared to
> ARC for that syscall so it will be really atomic_cmpxchg_32()
> libfunction called from those syscalls, but now I think that's exactly
> what you meant initially, correct?

Right. In any case, I won't insist, but if it's very little effort, it
might well be worth getting rid of that m68k magic.
Alexey Brodkin March 16, 2018, 5:33 p.m. | #11
Hi Peter, Vineet,

On Wed, 2018-03-14 at 18:53 +0100, Peter Zijlstra wrote:
> On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote:
> 
> > Well it is broken wrt the semantics the syscall is supposed to provide.
> > Preemption disabling is what prevents a concurrent thread from coming in and
> > modifying the same location (Imagine a variable which is being cmpxchg
> > concurrently by 2 threads).
> > 
> > One approach is to do it the MIPS way, emulate the llsc flag - set it under
> > preemption disabled section and clear it in switch_to
> 
> *shudder*... just catch the -EFAULT, force the write fault and retry.

More I look at this initially quite simple thing more it looks like
a can of worms...

> Something like:
> 
> int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new)
> {

That functions is supposed to return old value stored in memory.
At least that's how it is used in case of ARC and M68K.

Remember there's already libc that relies on that established API
and we cannot just change it... even though it might be a good idea.
For example return "errno" and pass old value via pointer in an argument.
But now I guess it's better to use what we have now.

> 	u32 val;
> 	int ret;
> 
> again:
> 	ret = 0;
> 
> 	preempt_disable();
> 	val = get_user(user_ptr);

What if get_user() fails?
In Peter's implementation we will return 0, in Vineet's
we will return -EFAULT... and who knows what kind of unexpected behavior happens
further down the line in user-space... so I think it would be safer to kill
the process then.

And that's my take:
-------------------------->8------------------------
int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new)
{
	u32 val;
	int ret;

again:
	ret = 0;

	preempt_disable();

	ret = get_user(val, user_ptr);
	if(ret == -EFAULT) {
		struct page *page;

		preempt_enable();
		ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page);
		if (ret < 0) {
			force_sig(SIGSEGV, current);
			return ret;
		}

		put_page(page);
		goto again;
	}

	if (val == old)
		ret = put_user(new, user_ptr);

	preempt_enable();

	if (ret == -EFAULT) {
		struct page *page;

		ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page);
		if (ret < 0) {
			force_sig(SIGSEGV, current);
			return ret;
		}

		put_page(page);
		goto again;
	}

	return ret;
}
-------------------------->8------------------------

-Alexey
Vineet Gupta March 16, 2018, 5:54 p.m. | #12
On 03/16/2018 10:33 AM, Alexey Brodkin wrote:
> Hi Peter, Vineet,
>
> On Wed, 2018-03-14 at 18:53 +0100, Peter Zijlstra wrote:
>> On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote:
>>
>>> Well it is broken wrt the semantics the syscall is supposed to provide.
>>> Preemption disabling is what prevents a concurrent thread from coming in and
>>> modifying the same location (Imagine a variable which is being cmpxchg
>>> concurrently by 2 threads).
>>>
>>> One approach is to do it the MIPS way, emulate the llsc flag - set it under
>>> preemption disabled section and clear it in switch_to
>> *shudder*... just catch the -EFAULT, force the write fault and retry.
> More I look at this initially quite simple thing more it looks like
> a can of worms...
>

I'd say just bite the bullet, write the patch and we can refine it there !

-Vineet
Peter Zijlstra March 16, 2018, 5:58 p.m. | #13
On Fri, Mar 16, 2018 at 10:54:52AM -0700, Vineet Gupta wrote:
> I'd say just bite the bullet, write the patch and we can refine it there !

Just be glad its not futex.c proper ;-) I'll try and have a look later..
Max Filippov March 16, 2018, 6:12 p.m. | #14
> On Thu, Mar 15, 2018 at 12:03 PM, Alexey Brodkin <Alexey.Brodkin@synopsys.com> wrote:
> Here's a brief analysis:
> ARM:  Looks like they got rid of that stuff in v4.4, see
>       commit db695c0509d6 ("ARM: remove user cmpxchg syscall").
>
> M68K: That's even uglier implementation which is really asking for
>       a facelift, look at sys_atomic_cmpxchg_32() here:
>       https://elixir.bootlin.com/linux/latest/source/arch/m68k/kernel/sys_m68k.c#L461
>
> MIPS: They do it via special sysmips syscall which among other things
>       might handle MIPS_ATOMIC_SET with mips_atomic_set()
>
> I don't immediately see if there're others but really I'm not sure if it even worth trying to
> clean-up all that since efforts might be spent pointlessly.

xtensa is another one. We used to have a buggy implementation in
arch/xtensa/kernel/entry.S:fast_syscall_xtensa which we still keep
disabled by default, just in case somebody wanted backwards
compatibility. I don't think it's worth fixing.

Patch

diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 5ac3b547453f..d1713d8d3981 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -63,8 +63,6 @@  SYSCALL_DEFINE3(arc_usr_cmpxchg, int *, uaddr, int, expected, int, new)
        if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int)))
                return -EFAULT;
 
-       preempt_disable();
-
        if (__get_user(uval, uaddr))
                goto done;
 
@@ -74,8 +72,6 @@  SYSCALL_DEFINE3(arc_usr_cmpxchg, int *, uaddr, int, expected, int, new)
        }
 
 done:
-       preempt_enable();
-
        return uval;
 }
   ---------------------------->8-------------------------------