diff mbox

[RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young

Message ID 1311049762.25044.392.camel@pasglop (mailing list archive)
State Not Applicable
Headers show

Commit Message

Benjamin Herrenschmidt July 19, 2011, 4:29 a.m. UTC
The futex code currently attempts to write to user memory within
a pagefault disabled section, and if that fails, tries to fix it
up using get_user_pages().

This doesn't work on archs where the dirty and young bits are
maintained by software, since they will gate access permission
in the TLB, and will not be updated by gup().

In addition, there's an expectation on some archs that a
spurious write fault triggers a local TLB flush, and that is
missing from the picture as well.

I decided that adding those "features" to gup() would be too much
for this already too complex function, and instead added a new
simpler fixup_user_fault() which is essentially a wrapper around
handle_mm_fault() which the futex code can call.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

Shan, can you test this ? It might not fix the problem since I'm
starting to have the nasty feeling that you are hitting what is
somewhat a subtly different issue or my previous patch should
have worked (but then I might have done a stupid mistake as well)
but let us know anyway.

Cheers,
Ben.

Comments

Haishan Bai July 19, 2011, 4:55 a.m. UTC | #1
On 07/19/2011 12:29 PM, Benjamin Herrenschmidt wrote:
> The futex code currently attempts to write to user memory within
> a pagefault disabled section, and if that fails, tries to fix it
> up using get_user_pages().
>
> This doesn't work on archs where the dirty and young bits are
> maintained by software, since they will gate access permission
> in the TLB, and will not be updated by gup().
>
> In addition, there's an expectation on some archs that a
> spurious write fault triggers a local TLB flush, and that is
> missing from the picture as well.
>
> I decided that adding those "features" to gup() would be too much
> for this already too complex function, and instead added a new
> simpler fixup_user_fault() which is essentially a wrapper around
> handle_mm_fault() which the futex code can call.
>
> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> ---
>
> Shan, can you test this ? It might not fix the problem since I'm
> starting to have the nasty feeling that you are hitting what is
> somewhat a subtly different issue or my previous patch should
> have worked (but then I might have done a stupid mistake as well)
> but let us know anyway.
>

Ok, I will test the patch, I think this should work, because
it's similar to my first posted patch, the difference is that
I tried to do it in the futex_atomic_cmpxchg_inatomic() in
the ppc specific path, lower level than yours as in
fault_in_user_writable :-)

Anyway, I will notify you on the test result.

Thanks
Shan Hai

> Cheers,
> Ben.
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9670f71..1036614 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   			struct page **pages);
>   struct page *get_dump_page(unsigned long addr);
> +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +			    unsigned long address, unsigned int fault_flags);
>
>   extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
>   extern void do_invalidatepage(struct page *page, unsigned long offset);
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fe28dc2..7a0a4ed 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
>   	int ret;
>
>   	down_read(&mm->mmap_sem);
> -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
> -			     1, 1, 0, NULL, NULL);
> +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
> +			       FAULT_FLAG_WRITE);
>   	up_read(&mm->mmap_sem);
>
>   	return ret<  0 ? ret : 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index 40b7531..b967fb0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1815,7 +1815,64 @@ next_page:
>   }
>   EXPORT_SYMBOL(__get_user_pages);
>
> -/**
> +/*
> + * fixup_user_fault() - manually resolve a user page  fault
> + * @tsk:	the task_struct to use for page fault accounting, or
> + *		NULL if faults are not to be recorded.
> + * @mm:		mm_struct of target mm
> + * @address:	user address
> + * @fault_flags:flags to pass down to handle_mm_fault()
> + *
> + * This is meant to be called in the specific scenario where for
> + * locking reasons we try to access user memory in atomic context
> + * (within a pagefault_disable() section), this returns -EFAULT,
> + * and we want to resolve the user fault before trying again.
> + *
> + * Typically this is meant to be used by the futex code.
> + *
> + * The main difference with get_user_pages() is that this function
> + * will unconditionally call handle_mm_fault() which will in turn
> + * perform all the necessary SW fixup of the dirty and young bits
> + * in the PTE, while handle_mm_fault() only guarantees to update
> + * these in the struct page.
> + *
> + * This is important for some architectures where those bits also
> + * gate the access permission to the page because their are
> + * maintained in software. On such architecture, gup() will not
> + * be enough to make a subsequent access succeed.
> + *
> + * This should be called with the mm_sem held for read.
> + */
> +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +		     unsigned long address, unsigned int fault_flags)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	vma = find_extend_vma(mm, address);
> +	if (!vma || address<  vma->vm_start)
> +		return -EFAULT;
> +	
> +	ret = handle_mm_fault(mm, vma, address, fault_flags);
> +	if (ret&  VM_FAULT_ERROR) {
> +		if (ret&  VM_FAULT_OOM)
> +			return -ENOMEM;
> +		if (ret&  (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> +			return -EHWPOISON;
> +		if (ret&  VM_FAULT_SIGBUS)
> +			return -EFAULT;
> +		BUG();
> +	}
> +	if (tsk) {
> +		if (ret&  VM_FAULT_MAJOR)
> +			tsk->maj_flt++;
> +		else
> +			tsk->min_flt++;
> +	}
> +	return 0;
> +}
> +
> +/*
>    * get_user_pages() - pin user pages in memory
>    * @tsk:	the task_struct to use for page fault accounting, or
>    *		NULL if faults are not to be recorded.
>
>
Haishan Bai July 19, 2011, 5:17 a.m. UTC | #2
On 07/19/2011 12:29 PM, Benjamin Herrenschmidt wrote:
> The futex code currently attempts to write to user memory within
> a pagefault disabled section, and if that fails, tries to fix it
> up using get_user_pages().
>
> This doesn't work on archs where the dirty and young bits are
> maintained by software, since they will gate access permission
> in the TLB, and will not be updated by gup().
>
> In addition, there's an expectation on some archs that a
> spurious write fault triggers a local TLB flush, and that is
> missing from the picture as well.
>
> I decided that adding those "features" to gup() would be too much
> for this already too complex function, and instead added a new
> simpler fixup_user_fault() which is essentially a wrapper around
> handle_mm_fault() which the futex code can call.
>
> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> ---
>
> Shan, can you test this ? It might not fix the problem since I'm
> starting to have the nasty feeling that you are hitting what is
> somewhat a subtly different issue or my previous patch should
> have worked (but then I might have done a stupid mistake as well)
> but let us know anyway.
>

The patch works, but I have certain confusions,
- Do we want to handle_mm_fault on each futex_lock_pi
     even though in most cases there is no write permission
     fixup's needed?
- How about let the archs do their own write permission
     fixup as what I did in my original
     "[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core"?
     (I will fix the stupid errors in my original patch if the concept 
is acceptable)
     in this way we could decrease the overhead of handle_mm_fault
     in the path which does not need write permission fixup.

Thanks
Shan Hai
> Cheers,
> Ben.
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9670f71..1036614 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   			struct page **pages);
>   struct page *get_dump_page(unsigned long addr);
> +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +			    unsigned long address, unsigned int fault_flags);
>
>   extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
>   extern void do_invalidatepage(struct page *page, unsigned long offset);
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fe28dc2..7a0a4ed 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
>   	int ret;
>
>   	down_read(&mm->mmap_sem);
> -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
> -			     1, 1, 0, NULL, NULL);
> +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
> +			       FAULT_FLAG_WRITE);
>   	up_read(&mm->mmap_sem);
>
>   	return ret<  0 ? ret : 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index 40b7531..b967fb0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1815,7 +1815,64 @@ next_page:
>   }
>   EXPORT_SYMBOL(__get_user_pages);
>
> -/**
> +/*
> + * fixup_user_fault() - manually resolve a user page  fault
> + * @tsk:	the task_struct to use for page fault accounting, or
> + *		NULL if faults are not to be recorded.
> + * @mm:		mm_struct of target mm
> + * @address:	user address
> + * @fault_flags:flags to pass down to handle_mm_fault()
> + *
> + * This is meant to be called in the specific scenario where for
> + * locking reasons we try to access user memory in atomic context
> + * (within a pagefault_disable() section), this returns -EFAULT,
> + * and we want to resolve the user fault before trying again.
> + *
> + * Typically this is meant to be used by the futex code.
> + *
> + * The main difference with get_user_pages() is that this function
> + * will unconditionally call handle_mm_fault() which will in turn
> + * perform all the necessary SW fixup of the dirty and young bits
> + * in the PTE, while handle_mm_fault() only guarantees to update
> + * these in the struct page.
> + *
> + * This is important for some architectures where those bits also
> + * gate the access permission to the page because their are
> + * maintained in software. On such architecture, gup() will not
> + * be enough to make a subsequent access succeed.
> + *
> + * This should be called with the mm_sem held for read.
> + */
> +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +		     unsigned long address, unsigned int fault_flags)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	vma = find_extend_vma(mm, address);
> +	if (!vma || address<  vma->vm_start)
> +		return -EFAULT;
> +	
> +	ret = handle_mm_fault(mm, vma, address, fault_flags);
> +	if (ret&  VM_FAULT_ERROR) {
> +		if (ret&  VM_FAULT_OOM)
> +			return -ENOMEM;
> +		if (ret&  (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> +			return -EHWPOISON;
> +		if (ret&  VM_FAULT_SIGBUS)
> +			return -EFAULT;
> +		BUG();
> +	}
> +	if (tsk) {
> +		if (ret&  VM_FAULT_MAJOR)
> +			tsk->maj_flt++;
> +		else
> +			tsk->min_flt++;
> +	}
> +	return 0;
> +}
> +
> +/*
>    * get_user_pages() - pin user pages in memory
>    * @tsk:	the task_struct to use for page fault accounting, or
>    *		NULL if faults are not to be recorded.
>
>
Benjamin Herrenschmidt July 19, 2011, 5:24 a.m. UTC | #3
On Tue, 2011-07-19 at 13:17 +0800, Shan Hai wrote:

> The patch works, but I have certain confusions,
> - Do we want to handle_mm_fault on each futex_lock_pi
>      even though in most cases there is no write permission
>      fixup's needed?

Don't we only ever call this when futex_atomic_op_inuser() failed ?
Which means a fixup -is- needed .... The fast path is still there.

> - How about let the archs do their own write permission
>      fixup as what I did in my original

Why ? This is generic and will fix all archs at once with generic code
which is a significant improvement in my book and a lot more
maintainable :-)

>      "[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core"?
>      (I will fix the stupid errors in my original patch if the concept 
> is acceptable)
>      in this way we could decrease the overhead of handle_mm_fault
>      in the path which does not need write permission fixup.

Which overhead ? gup does handle_mm_fault() as well if needed.

What I do is I replace what is arguably an abuse of gup() in the case
where a fixup -is- needed with a dedicated function designed to perform
the said fixup ... and do it properly which gup() didn't :-)

Cheers,
Ben.

> Thanks
> Shan Hai
> > Cheers,
> > Ben.
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 9670f71..1036614 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >   			struct page **pages);
> >   struct page *get_dump_page(unsigned long addr);
> > +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> > +			    unsigned long address, unsigned int fault_flags);
> >
> >   extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
> >   extern void do_invalidatepage(struct page *page, unsigned long offset);
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index fe28dc2..7a0a4ed 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
> >   	int ret;
> >
> >   	down_read(&mm->mmap_sem);
> > -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
> > -			     1, 1, 0, NULL, NULL);
> > +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
> > +			       FAULT_FLAG_WRITE);
> >   	up_read(&mm->mmap_sem);
> >
> >   	return ret<  0 ? ret : 0;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 40b7531..b967fb0 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1815,7 +1815,64 @@ next_page:
> >   }
> >   EXPORT_SYMBOL(__get_user_pages);
> >
> > -/**
> > +/*
> > + * fixup_user_fault() - manually resolve a user page  fault
> > + * @tsk:	the task_struct to use for page fault accounting, or
> > + *		NULL if faults are not to be recorded.
> > + * @mm:		mm_struct of target mm
> > + * @address:	user address
> > + * @fault_flags:flags to pass down to handle_mm_fault()
> > + *
> > + * This is meant to be called in the specific scenario where for
> > + * locking reasons we try to access user memory in atomic context
> > + * (within a pagefault_disable() section), this returns -EFAULT,
> > + * and we want to resolve the user fault before trying again.
> > + *
> > + * Typically this is meant to be used by the futex code.
> > + *
> > + * The main difference with get_user_pages() is that this function
> > + * will unconditionally call handle_mm_fault() which will in turn
> > + * perform all the necessary SW fixup of the dirty and young bits
> > + * in the PTE, while handle_mm_fault() only guarantees to update
> > + * these in the struct page.
> > + *
> > + * This is important for some architectures where those bits also
> > + * gate the access permission to the page because their are
> > + * maintained in software. On such architecture, gup() will not
> > + * be enough to make a subsequent access succeed.
> > + *
> > + * This should be called with the mm_sem held for read.
> > + */
> > +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> > +		     unsigned long address, unsigned int fault_flags)
> > +{
> > +	struct vm_area_struct *vma;
> > +	int ret;
> > +
> > +	vma = find_extend_vma(mm, address);
> > +	if (!vma || address<  vma->vm_start)
> > +		return -EFAULT;
> > +	
> > +	ret = handle_mm_fault(mm, vma, address, fault_flags);
> > +	if (ret&  VM_FAULT_ERROR) {
> > +		if (ret&  VM_FAULT_OOM)
> > +			return -ENOMEM;
> > +		if (ret&  (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> > +			return -EHWPOISON;
> > +		if (ret&  VM_FAULT_SIGBUS)
> > +			return -EFAULT;
> > +		BUG();
> > +	}
> > +	if (tsk) {
> > +		if (ret&  VM_FAULT_MAJOR)
> > +			tsk->maj_flt++;
> > +		else
> > +			tsk->min_flt++;
> > +	}
> > +	return 0;
> > +}
> > +
> > +/*
> >    * get_user_pages() - pin user pages in memory
> >    * @tsk:	the task_struct to use for page fault accounting, or
> >    *		NULL if faults are not to be recorded.
> >
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Haishan Bai July 19, 2011, 5:38 a.m. UTC | #4
On 07/19/2011 01:24 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2011-07-19 at 13:17 +0800, Shan Hai wrote:
>
>> The patch works, but I have certain confusions,
>> - Do we want to handle_mm_fault on each futex_lock_pi
>>       even though in most cases there is no write permission
>>       fixup's needed?
> Don't we only ever call this when futex_atomic_op_inuser() failed ?
> Which means a fixup -is- needed .... The fast path is still there.
>

What you said is another path, that is futex_wake_op(),
but what about futex_lock_pi in which my test case failed?
your patch will call handle_mm_fault on every futex contention
in the futex_lock_pi path.

futex_lock_pi()
     ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
         case -EFAULT:
                         goto uaddr_faulted;

     ...
uaddr_faulted:
     ret = fault_in_user_writeable(uaddr);


>> - How about let the archs do their own write permission
>>       fixup as what I did in my original
> Why ? This is generic and will fix all archs at once with generic code
> which is a significant improvement in my book and a lot more
> maintainable :-)
>

If the overhead in the futex_lock_pi  path is not considerable yes fix it up
generally is nice :-)

>>       "[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core"?
>>       (I will fix the stupid errors in my original patch if the concept
>> is acceptable)
>>       in this way we could decrease the overhead of handle_mm_fault
>>       in the path which does not need write permission fixup.
> Which overhead ? gup does handle_mm_fault() as well if needed.

it does it *if needed*, and this requirement is rare in my opinion.


Thanks
Shan Hai

> What I do is I replace what is arguably an abuse of gup() in the case
> where a fixup -is- needed with a dedicated function designed to perform
> the said fixup ... and do it properly which gup() didn't :-)
>
> Cheers,
> Ben.
>
>> Thanks
>> Shan Hai
>>> Cheers,
>>> Ben.
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 9670f71..1036614 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>>    int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>>    			struct page **pages);
>>>    struct page *get_dump_page(unsigned long addr);
>>> +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>>> +			    unsigned long address, unsigned int fault_flags);
>>>
>>>    extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
>>>    extern void do_invalidatepage(struct page *page, unsigned long offset);
>>> diff --git a/kernel/futex.c b/kernel/futex.c
>>> index fe28dc2..7a0a4ed 100644
>>> --- a/kernel/futex.c
>>> +++ b/kernel/futex.c
>>> @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
>>>    	int ret;
>>>
>>>    	down_read(&mm->mmap_sem);
>>> -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
>>> -			     1, 1, 0, NULL, NULL);
>>> +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
>>> +			       FAULT_FLAG_WRITE);
>>>    	up_read(&mm->mmap_sem);
>>>
>>>    	return ret<   0 ? ret : 0;
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 40b7531..b967fb0 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1815,7 +1815,64 @@ next_page:
>>>    }
>>>    EXPORT_SYMBOL(__get_user_pages);
>>>
>>> -/**
>>> +/*
>>> + * fixup_user_fault() - manually resolve a user page  fault
>>> + * @tsk:	the task_struct to use for page fault accounting, or
>>> + *		NULL if faults are not to be recorded.
>>> + * @mm:		mm_struct of target mm
>>> + * @address:	user address
>>> + * @fault_flags:flags to pass down to handle_mm_fault()
>>> + *
>>> + * This is meant to be called in the specific scenario where for
>>> + * locking reasons we try to access user memory in atomic context
>>> + * (within a pagefault_disable() section), this returns -EFAULT,
>>> + * and we want to resolve the user fault before trying again.
>>> + *
>>> + * Typically this is meant to be used by the futex code.
>>> + *
>>> + * The main difference with get_user_pages() is that this function
>>> + * will unconditionally call handle_mm_fault() which will in turn
>>> + * perform all the necessary SW fixup of the dirty and young bits
>>> + * in the PTE, while handle_mm_fault() only guarantees to update
>>> + * these in the struct page.
>>> + *
>>> + * This is important for some architectures where those bits also
>>> + * gate the access permission to the page because their are
>>> + * maintained in software. On such architecture, gup() will not
>>> + * be enough to make a subsequent access succeed.
>>> + *
>>> + * This should be called with the mm_sem held for read.
>>> + */
>>> +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>>> +		     unsigned long address, unsigned int fault_flags)
>>> +{
>>> +	struct vm_area_struct *vma;
>>> +	int ret;
>>> +
>>> +	vma = find_extend_vma(mm, address);
>>> +	if (!vma || address<   vma->vm_start)
>>> +		return -EFAULT;
>>> +	
>>> +	ret = handle_mm_fault(mm, vma, address, fault_flags);
>>> +	if (ret&   VM_FAULT_ERROR) {
>>> +		if (ret&   VM_FAULT_OOM)
>>> +			return -ENOMEM;
>>> +		if (ret&   (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
>>> +			return -EHWPOISON;
>>> +		if (ret&   VM_FAULT_SIGBUS)
>>> +			return -EFAULT;
>>> +		BUG();
>>> +	}
>>> +	if (tsk) {
>>> +		if (ret&   VM_FAULT_MAJOR)
>>> +			tsk->maj_flt++;
>>> +		else
>>> +			tsk->min_flt++;
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>> +/*
>>>     * get_user_pages() - pin user pages in memory
>>>     * @tsk:	the task_struct to use for page fault accounting, or
>>>     *		NULL if faults are not to be recorded.
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
Benjamin Herrenschmidt July 19, 2011, 7:46 a.m. UTC | #5
On Tue, 2011-07-19 at 13:38 +0800, Shan Hai wrote:

> What you said is another path, that is futex_wake_op(),
> but what about futex_lock_pi in which my test case failed?
> your patch will call handle_mm_fault on every futex contention
> in the futex_lock_pi path.
> 
> futex_lock_pi()
>      ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
>          case -EFAULT:
>                          goto uaddr_faulted;
> 
>      ...
> uaddr_faulted:
>      ret = fault_in_user_writeable(uaddr);

Euh ... and how do we get to uaddr_faulted ? You may have missed the
return statement right before it :-)

From what I can tell we only get there as a result of -EFAULT from
futex_lock_pi_atomic() which is exactly the case we are trying to
cover. 

 .../...
> >>       "[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core"?
> >>       (I will fix the stupid errors in my original patch if the concept
> >> is acceptable)
> >>       in this way we could decrease the overhead of handle_mm_fault
> >>       in the path which does not need write permission fixup.
> > Which overhead ? gup does handle_mm_fault() as well if needed.
> 
> it does it *if needed*, and this requirement is rare in my opinion.

And how does gup figure out of it's needed ? By walking down the page
tables in follow_page... what does handle_mm_fault do ? walk down the
page tables...

The main (if not the only) relevant difference here, is going to be the
spurious fault TLB invaliate for writes ... which is a nop on x86....
and needed in all the cases we care about (and if it's not needed, then
it's up to the arch to nop it out, we should probably do it on powerpc
too ...  but that's un unrelated discussion).

Cheers,
Ben.

> Thanks
> Shan Hai
> 
> > What I do is I replace what is arguably an abuse of gup() in the case
> > where a fixup -is- needed with a dedicated function designed to perform
> > the said fixup ... and do it properly which gup() didn't :-)
> >
> > Cheers,
> > Ben.
> >
> >> Thanks
> >> Shan Hai
> >>> Cheers,
> >>> Ben.
> >>>
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index 9670f71..1036614 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >>>    int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >>>    			struct page **pages);
> >>>    struct page *get_dump_page(unsigned long addr);
> >>> +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> >>> +			    unsigned long address, unsigned int fault_flags);
> >>>
> >>>    extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
> >>>    extern void do_invalidatepage(struct page *page, unsigned long offset);
> >>> diff --git a/kernel/futex.c b/kernel/futex.c
> >>> index fe28dc2..7a0a4ed 100644
> >>> --- a/kernel/futex.c
> >>> +++ b/kernel/futex.c
> >>> @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
> >>>    	int ret;
> >>>
> >>>    	down_read(&mm->mmap_sem);
> >>> -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
> >>> -			     1, 1, 0, NULL, NULL);
> >>> +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
> >>> +			       FAULT_FLAG_WRITE);
> >>>    	up_read(&mm->mmap_sem);
> >>>
> >>>    	return ret<   0 ? ret : 0;
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 40b7531..b967fb0 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -1815,7 +1815,64 @@ next_page:
> >>>    }
> >>>    EXPORT_SYMBOL(__get_user_pages);
> >>>
> >>> -/**
> >>> +/*
> >>> + * fixup_user_fault() - manually resolve a user page  fault
> >>> + * @tsk:	the task_struct to use for page fault accounting, or
> >>> + *		NULL if faults are not to be recorded.
> >>> + * @mm:		mm_struct of target mm
> >>> + * @address:	user address
> >>> + * @fault_flags:flags to pass down to handle_mm_fault()
> >>> + *
> >>> + * This is meant to be called in the specific scenario where for
> >>> + * locking reasons we try to access user memory in atomic context
> >>> + * (within a pagefault_disable() section), this returns -EFAULT,
> >>> + * and we want to resolve the user fault before trying again.
> >>> + *
> >>> + * Typically this is meant to be used by the futex code.
> >>> + *
> >>> + * The main difference with get_user_pages() is that this function
> >>> + * will unconditionally call handle_mm_fault() which will in turn
> >>> + * perform all the necessary SW fixup of the dirty and young bits
> >>> + * in the PTE, while handle_mm_fault() only guarantees to update
> >>> + * these in the struct page.
> >>> + *
> >>> + * This is important for some architectures where those bits also
> >>> + * gate the access permission to the page because their are
> >>> + * maintained in software. On such architecture, gup() will not
> >>> + * be enough to make a subsequent access succeed.
> >>> + *
> >>> + * This should be called with the mm_sem held for read.
> >>> + */
> >>> +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> >>> +		     unsigned long address, unsigned int fault_flags)
> >>> +{
> >>> +	struct vm_area_struct *vma;
> >>> +	int ret;
> >>> +
> >>> +	vma = find_extend_vma(mm, address);
> >>> +	if (!vma || address<   vma->vm_start)
> >>> +		return -EFAULT;
> >>> +	
> >>> +	ret = handle_mm_fault(mm, vma, address, fault_flags);
> >>> +	if (ret&   VM_FAULT_ERROR) {
> >>> +		if (ret&   VM_FAULT_OOM)
> >>> +			return -ENOMEM;
> >>> +		if (ret&   (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> >>> +			return -EHWPOISON;
> >>> +		if (ret&   VM_FAULT_SIGBUS)
> >>> +			return -EFAULT;
> >>> +		BUG();
> >>> +	}
> >>> +	if (tsk) {
> >>> +		if (ret&   VM_FAULT_MAJOR)
> >>> +			tsk->maj_flt++;
> >>> +		else
> >>> +			tsk->min_flt++;
> >>> +	}
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +/*
> >>>     * get_user_pages() - pin user pages in memory
> >>>     * @tsk:	the task_struct to use for page fault accounting, or
> >>>     *		NULL if faults are not to be recorded.
> >>>
> >>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Haishan Bai July 19, 2011, 8:24 a.m. UTC | #6
On 07/19/2011 03:46 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2011-07-19 at 13:38 +0800, Shan Hai wrote:
>
>> What you said is another path, that is futex_wake_op(),
>> but what about futex_lock_pi in which my test case failed?
>> your patch will call handle_mm_fault on every futex contention
>> in the futex_lock_pi path.
>>
>> futex_lock_pi()
>>       ret = futex_lock_pi_atomic(uaddr, hb,&q.key,&q.pi_state, current, 0);
>>           case -EFAULT:
>>                           goto uaddr_faulted;
>>
>>       ...
>> uaddr_faulted:
>>       ret = fault_in_user_writeable(uaddr);
> Euh ... and how do we get to uaddr_faulted ? You may have missed the
> return statement right before it :-)
>
>  From what I can tell we only get there as a result of -EFAULT from
> futex_lock_pi_atomic() which is exactly the case we are trying to
> cover.
>

Got it, if the fault_in_user_writeable() is designed to catch the
exact same write permission fault problem we discuss here, so
your patch fixed that very nicely, we should fixup it by directly
calling handle_mm_fault like what you did because we are for sure
to know what just happened(permission violation), its not necessary
to check what's happened by calling gup-->follow_page, and
further the follow_page failed to report the fault :-)

Thanks
Shan Hai

>   .../...
>>>>        "[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core"?
>>>>        (I will fix the stupid errors in my original patch if the concept
>>>> is acceptable)
>>>>        in this way we could decrease the overhead of handle_mm_fault
>>>>        in the path which does not need write permission fixup.
>>> Which overhead ? gup does handle_mm_fault() as well if needed.
>> it does it *if needed*, and this requirement is rare in my opinion.
> And how does gup figure out of it's needed ? By walking down the page
> tables in follow_page... what does handle_mm_fault do ? walk down the
> page tables...
>
> The main (if not the only) relevant difference here, is going to be the
> spurious fault TLB invaliate for writes ... which is a nop on x86....
> and needed in all the cases we care about (and if it's not needed, then
> it's up to the arch to nop it out, we should probably do it on powerpc
> too ...  but that's un unrelated discussion).
>
> Cheers,
> Ben.
>
>> Thanks
>> Shan Hai
>>
>>> What I do is I replace what is arguably an abuse of gup() in the case
>>> where a fixup -is- needed with a dedicated function designed to perform
>>> the said fixup ... and do it properly which gup() didn't :-)
>>>
>>> Cheers,
>>> Ben.
>>>
>>>> Thanks
>>>> Shan Hai
>>>>> Cheers,
>>>>> Ben.
>>>>>
>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>> index 9670f71..1036614 100644
>>>>> --- a/include/linux/mm.h
>>>>> +++ b/include/linux/mm.h
>>>>> @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>>>>     int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>>>>     			struct page **pages);
>>>>>     struct page *get_dump_page(unsigned long addr);
>>>>> +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>>>>> +			    unsigned long address, unsigned int fault_flags);
>>>>>
>>>>>     extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
>>>>>     extern void do_invalidatepage(struct page *page, unsigned long offset);
>>>>> diff --git a/kernel/futex.c b/kernel/futex.c
>>>>> index fe28dc2..7a0a4ed 100644
>>>>> --- a/kernel/futex.c
>>>>> +++ b/kernel/futex.c
>>>>> @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
>>>>>     	int ret;
>>>>>
>>>>>     	down_read(&mm->mmap_sem);
>>>>> -	ret = get_user_pages(current, mm, (unsigned long)uaddr,
>>>>> -			     1, 1, 0, NULL, NULL);
>>>>> +	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
>>>>> +			       FAULT_FLAG_WRITE);
>>>>>     	up_read(&mm->mmap_sem);
>>>>>
>>>>>     	return ret<    0 ? ret : 0;
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 40b7531..b967fb0 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -1815,7 +1815,64 @@ next_page:
>>>>>     }
>>>>>     EXPORT_SYMBOL(__get_user_pages);
>>>>>
>>>>> -/**
>>>>> +/*
>>>>> + * fixup_user_fault() - manually resolve a user page  fault
>>>>> + * @tsk:	the task_struct to use for page fault accounting, or
>>>>> + *		NULL if faults are not to be recorded.
>>>>> + * @mm:		mm_struct of target mm
>>>>> + * @address:	user address
>>>>> + * @fault_flags:flags to pass down to handle_mm_fault()
>>>>> + *
>>>>> + * This is meant to be called in the specific scenario where for
>>>>> + * locking reasons we try to access user memory in atomic context
>>>>> + * (within a pagefault_disable() section), this returns -EFAULT,
>>>>> + * and we want to resolve the user fault before trying again.
>>>>> + *
>>>>> + * Typically this is meant to be used by the futex code.
>>>>> + *
>>>>> + * The main difference with get_user_pages() is that this function
>>>>> + * will unconditionally call handle_mm_fault() which will in turn
>>>>> + * perform all the necessary SW fixup of the dirty and young bits
>>>>> + * in the PTE, while handle_mm_fault() only guarantees to update
>>>>> + * these in the struct page.
>>>>> + *
>>>>> + * This is important for some architectures where those bits also
>>>>> + * gate the access permission to the page because their are
>>>>> + * maintained in software. On such architecture, gup() will not
>>>>> + * be enough to make a subsequent access succeed.
>>>>> + *
>>>>> + * This should be called with the mm_sem held for read.
>>>>> + */
>>>>> +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>>>>> +		     unsigned long address, unsigned int fault_flags)
>>>>> +{
>>>>> +	struct vm_area_struct *vma;
>>>>> +	int ret;
>>>>> +
>>>>> +	vma = find_extend_vma(mm, address);
>>>>> +	if (!vma || address<    vma->vm_start)
>>>>> +		return -EFAULT;
>>>>> +	
>>>>> +	ret = handle_mm_fault(mm, vma, address, fault_flags);
>>>>> +	if (ret&    VM_FAULT_ERROR) {
>>>>> +		if (ret&    VM_FAULT_OOM)
>>>>> +			return -ENOMEM;
>>>>> +		if (ret&    (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
>>>>> +			return -EHWPOISON;
>>>>> +		if (ret&    VM_FAULT_SIGBUS)
>>>>> +			return -EFAULT;
>>>>> +		BUG();
>>>>> +	}
>>>>> +	if (tsk) {
>>>>> +		if (ret&    VM_FAULT_MAJOR)
>>>>> +			tsk->maj_flt++;
>>>>> +		else
>>>>> +			tsk->min_flt++;
>>>>> +	}
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>>      * get_user_pages() - pin user pages in memory
>>>>>      * @tsk:	the task_struct to use for page fault accounting, or
>>>>>      *		NULL if faults are not to be recorded.
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
David Laight July 19, 2011, 8:26 a.m. UTC | #7
> Got it, if the fault_in_user_writeable() is designed to catch the
> exact same write permission fault problem we discuss here, so
> your patch fixed that very nicely, we should fixup it by directly
> calling handle_mm_fault like what you did because we are for sure
> to know what just happened(permission violation), its not necessary
> to check what's happened by calling gup-->follow_page, and
> further the follow_page failed to report the fault :-)

One thought I've had - and I don't know enough about the data
area in use to know if it is a problem - is what happens if
a different cpu faults on the same user page and has already
marked it 'valid' between the fault happening and the fault
handler looking at the page tables to find out why.
If any of the memory areas are shared, it might be that the
PTE (etc) might already show the page a writable by the
time the fault handler is looking at them - this might confuse it!

	David
Benjamin Herrenschmidt July 19, 2011, 8:45 a.m. UTC | #8
On Tue, 2011-07-19 at 09:26 +0100, David Laight wrote:
> > Got it, if the fault_in_user_writeable() is designed to catch the
> > exact same write permission fault problem we discuss here, so
> > your patch fixed that very nicely, we should fixup it by directly
> > calling handle_mm_fault like what you did because we are for sure
> > to know what just happened(permission violation), its not necessary
> > to check what's happened by calling gup-->follow_page, and
> > further the follow_page failed to report the fault :-)
> 
> One thought I've had - and I don't know enough about the data
> area in use to know if it is a problem - is what happens if
> a different cpu faults on the same user page and has already
> marked it 'valid' between the fault happening and the fault
> handler looking at the page tables to find out why.
> If any of the memory areas are shared, it might be that the
> PTE (etc) might already show the page a writable by the
> time the fault handler is looking at them - this might confuse it!

The same way handle_mm_fault() deals with two CPUs faulting on the same
page at the same time :-)

All the necessary locking is in there, handle_mm_fault() and friends
will walk the page tables, take the PTE lock, will notice it's already
been all fixed up (well that it doesn't need to do a page fault at
least), will then call ptep_set_access_flags() which will itself notice
there's nothing to do ... etc

So all you'll hit is the spurious fault TLB invalidate in the write
case, which is necessary on some archs (well, we think it is tho I don't
know which archs really :-)

Cheers,
Ben.
Haishan Bai July 19, 2011, 8:45 a.m. UTC | #9
On 07/19/2011 04:26 PM, David Laight wrote:
>
>> Got it, if the fault_in_user_writeable() is designed to catch the
>> exact same write permission fault problem we discuss here, so
>> your patch fixed that very nicely, we should fixup it by directly
>> calling handle_mm_fault like what you did because we are for sure
>> to know what just happened(permission violation), its not necessary
>> to check what's happened by calling gup-->follow_page, and
>> further the follow_page failed to report the fault :-)
> One thought I've had - and I don't know enough about the data
> area in use to know if it is a problem - is what happens if
> a different cpu faults on the same user page and has already
> marked it 'valid' between the fault happening and the fault
> handler looking at the page tables to find out why.
> If any of the memory areas are shared, it might be that the
> PTE (etc) might already show the page a writable by the
> time the fault handler is looking at them - this might confuse it!
>

There is no problem at all if you mean *valid* by page present
and writable, because when the fault_in_user_writeable()
is called, the pte to the shared page was already setup by
demand paging and pte.present and pte.write was set, and the
reason why the fault was taken is that because of violation of
permission on present and writable user page occurred on sw
dirty/young tracking architectures.

Thanks
Shan Hai

> 	David
>
>
Peter Zijlstra July 19, 2011, 11:10 a.m. UTC | #10
On Tue, 2011-07-19 at 14:29 +1000, Benjamin Herrenschmidt wrote:
> The futex code currently attempts to write to user memory within
> a pagefault disabled section, and if that fails, tries to fix it
> up using get_user_pages().
> 
> This doesn't work on archs where the dirty and young bits are
> maintained by software, since they will gate access permission
> in the TLB, and will not be updated by gup().
> 
> In addition, there's an expectation on some archs that a
> spurious write fault triggers a local TLB flush, and that is
> missing from the picture as well.
> 
> I decided that adding those "features" to gup() would be too much
> for this already too complex function, and instead added a new
> simpler fixup_user_fault() which is essentially a wrapper around
> handle_mm_fault() which the futex code can call.
> 
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> 

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Darren Hart July 20, 2011, 2:39 p.m. UTC | #11
Obviously no objection from the futex side of things, looks good. Couple
nits on the function comment:

On 07/18/2011 09:29 PM, Benjamin Herrenschmidt wrote:
...
> -/**
> +/*
> + * fixup_user_fault() - manually resolve a user page  fault

s/  fault/ fault/

> + * @tsk:	the task_struct to use for page fault accounting, or
> + *		NULL if faults are not to be recorded.
> + * @mm:		mm_struct of target mm
> + * @address:	user address
> + * @fault_flags:flags to pass down to handle_mm_fault()
> + *
> + * This is meant to be called in the specific scenario where for
> + * locking reasons we try to access user memory in atomic context
> + * (within a pagefault_disable() section), this returns -EFAULT,
> + * and we want to resolve the user fault before trying again.
> + *
> + * Typically this is meant to be used by the futex code.
> + *
> + * The main difference with get_user_pages() is that this function
> + * will unconditionally call handle_mm_fault() which will in turn
> + * perform all the necessary SW fixup of the dirty and young bits
> + * in the PTE, while handle_mm_fault() only guarantees to update
> + * these in the struct page.
> + *
> + * This is important for some architectures where those bits also
> + * gate the access permission to the page because their are

s/their/they/

Thanks,
Andrew Morton July 21, 2011, 10:36 p.m. UTC | #12
On Tue, 19 Jul 2011 14:29:22 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> The futex code currently attempts to write to user memory within
> a pagefault disabled section, and if that fails, tries to fix it
> up using get_user_pages().
> 
> This doesn't work on archs where the dirty and young bits are
> maintained by software, since they will gate access permission
> in the TLB, and will not be updated by gup().
> 
> In addition, there's an expectation on some archs that a
> spurious write fault triggers a local TLB flush, and that is
> missing from the picture as well.
> 
> I decided that adding those "features" to gup() would be too much
> for this already too complex function, and instead added a new
> simpler fixup_user_fault() which is essentially a wrapper around
> handle_mm_fault() which the futex code can call.
> 
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
> 
> Shan, can you test this ? It might not fix the problem

um, what problem.  There's no description here of the user-visible
effects of the bug hence it's hard to work out what kernel version(s)
should receive this patch.

What kernel version(s) should receive this patch?

> since I'm
> starting to have the nasty feeling that you are hitting what is
> somewhat a subtly different issue or my previous patch should
> have worked (but then I might have done a stupid mistake as well)
> but let us know anyway.

I assume that Shan reported the secret problem so I added the
reported-by to the changelog.
Benjamin Herrenschmidt July 21, 2011, 10:52 p.m. UTC | #13
On Thu, 2011-07-21 at 15:36 -0700, Andrew Morton wrote:
> On Tue, 19 Jul 2011 14:29:22 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > The futex code currently attempts to write to user memory within
> > a pagefault disabled section, and if that fails, tries to fix it
> > up using get_user_pages().
> > 
> > This doesn't work on archs where the dirty and young bits are
> > maintained by software, since they will gate access permission
> > in the TLB, and will not be updated by gup().
> > 
> > In addition, there's an expectation on some archs that a
> > spurious write fault triggers a local TLB flush, and that is
> > missing from the picture as well.
> > 
> > I decided that adding those "features" to gup() would be too much
> > for this already too complex function, and instead added a new
> > simpler fixup_user_fault() which is essentially a wrapper around
> > handle_mm_fault() which the futex code can call.
> > 
> > Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > ---
> > 
> > Shan, can you test this ? It might not fix the problem
> 
> um, what problem.  There's no description here of the user-visible
> effects of the bug hence it's hard to work out what kernel version(s)
> should receive this patch.

Shan could give you an actual example (it was in the previous thread),
but basically, livelock as the kernel keeps trying and trying the
in_atomic op and never resolves it.
 
> What kernel version(s) should receive this patch?

I haven't dug. Probably anything it applies on as far as we did that
trick of atomic + gup() for futex.

> > since I'm
> > starting to have the nasty feeling that you are hitting what is
> > somewhat a subtly different issue or my previous patch should
> > have worked (but then I might have done a stupid mistake as well)
> > but let us know anyway.
> 
> I assume that Shan reported the secret problem so I added the
> reported-by to the changelog.

He did :-) Shan, care to provide a rough explanation of what you
observed ?

Also Russell confirmed that ARM should be affected as well.

Cheers,
Ben.
Benjamin Herrenschmidt July 21, 2011, 10:57 p.m. UTC | #14
On Fri, 2011-07-22 at 08:52 +1000, Benjamin Herrenschmidt wrote:

> > um, what problem.  There's no description here of the user-visible
> > effects of the bug hence it's hard to work out what kernel version(s)
> > should receive this patch.
> 
> Shan could give you an actual example (it was in the previous thread),
> but basically, livelock as the kernel keeps trying and trying the
> in_atomic op and never resolves it.
>  
> > What kernel version(s) should receive this patch?
> 
> I haven't dug. Probably anything it applies on as far as we did that
> trick of atomic + gup() for futex.

Oops, I just realize I didn't document the problem at all in the
changelog .. sorry. I meant to say:

On archs who use SW tracking of dirty & young, a page without dirty is
effectively mapped read-only and a page without young unaccessible in
the PTE.

Additionally, some architectures might lazily flush the TLB when
relaxing write protection (by doing only a local flush), and expect a
fault to invalidate the stale entry if it's still present on another
processor.

The futex code assumes that if the "in_atomic()" access -EFAULT's, it
can "fix it up" by causing get_user_pages() which would then be
equivalent to taking the fault.

However that isn't the case. get_user_pages() will not call
handle_mm_fault() in the case where the PTE seems to have the right
permissions, regardless of the dirty and young state. It will eventually
update those bits ... in the struct page, but not in the PTE. 

Additionally, it will not handle the lazy TLB flushing that can be
required by some architectures in the fault case.

Basically, gup is the wrong interface for the job. The patch provides a
more appropriate one which boils down to just calling handle_mm_fault()
since what we are trying to do is simulate a real page fault.

Cheers,
Ben.

> > > since I'm
> > > starting to have the nasty feeling that you are hitting what is
> > > somewhat a subtly different issue or my previous patch should
> > > have worked (but then I might have done a stupid mistake as well)
> > > but let us know anyway.
> > 
> > I assume that Shan reported the secret problem so I added the
> > reported-by to the changelog.
> 
> He did :-) Shan, care to provide a rough explanation of what you
> observed ?
> 
> Also Russell confirmed that ARM should be affected as well.
> 
> Cheers,
> Ben.
Andrew Morton July 21, 2011, 10:59 p.m. UTC | #15
On Fri, 22 Jul 2011 08:52:06 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2011-07-21 at 15:36 -0700, Andrew Morton wrote:
> > On Tue, 19 Jul 2011 14:29:22 +1000
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > 
> > > The futex code currently attempts to write to user memory within
> > > a pagefault disabled section, and if that fails, tries to fix it
> > > up using get_user_pages().
> > > 
> > > This doesn't work on archs where the dirty and young bits are
> > > maintained by software, since they will gate access permission
> > > in the TLB, and will not be updated by gup().
> > > 
> > > In addition, there's an expectation on some archs that a
> > > spurious write fault triggers a local TLB flush, and that is
> > > missing from the picture as well.
> > > 
> > > I decided that adding those "features" to gup() would be too much
> > > for this already too complex function, and instead added a new
> > > simpler fixup_user_fault() which is essentially a wrapper around
> > > handle_mm_fault() which the futex code can call.
> > > 
> > > Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > ---
> > > 
> > > Shan, can you test this ? It might not fix the problem
> > 
> > um, what problem.  There's no description here of the user-visible
> > effects of the bug hence it's hard to work out what kernel version(s)
> > should receive this patch.
> 
> Shan could give you an actual example (it was in the previous thread),
> but basically, livelock as the kernel keeps trying and trying the
> in_atomic op and never resolves it.
>  
> > What kernel version(s) should receive this patch?
> 
> I haven't dug. Probably anything it applies on as far as we did that
> trick of atomic + gup() for futex.

You're not understanding me.

I need a good reason to merge this into 3.0.

The -stable maintainers need even better reasons to merge this into
earlier kernels.

Please provide those reasons!

(Documentation/stable_kernel_rules.txt, 4th bullet)

(And it's not just me and -stable maintainers.  Distro maintainers will
also look at this patch and wonder whether they should merge it)
Benjamin Herrenschmidt July 22, 2011, 1:40 a.m. UTC | #16
> You're not understanding me.
> 
> I need a good reason to merge this into 3.0.
> 
> The -stable maintainers need even better reasons to merge this into
> earlier kernels.
> 
> Please provide those reasons!
> 
> (Documentation/stable_kernel_rules.txt, 4th bullet)
> 
> (And it's not just me and -stable maintainers.  Distro maintainers will
> also look at this patch and wonder whether they should merge it)

Well, as an arch maintainer, I can get stable maintainers to merge
anything I CC to stable :-)

Now, the good reason should have been rather obvious... it's a user
exploitable kernel lockup.

Cheers,
Ben.
Haishan Bai July 22, 2011, 1:54 a.m. UTC | #17
On 07/22/2011 06:59 AM, Andrew Morton wrote:
> On Fri, 22 Jul 2011 08:52:06 +1000
> Benjamin Herrenschmidt<benh@kernel.crashing.org>  wrote:
>
>> On Thu, 2011-07-21 at 15:36 -0700, Andrew Morton wrote:
>>> On Tue, 19 Jul 2011 14:29:22 +1000
>>> Benjamin Herrenschmidt<benh@kernel.crashing.org>  wrote:
>>>
>>>> The futex code currently attempts to write to user memory within
>>>> a pagefault disabled section, and if that fails, tries to fix it
>>>> up using get_user_pages().
>>>>
>>>> This doesn't work on archs where the dirty and young bits are
>>>> maintained by software, since they will gate access permission
>>>> in the TLB, and will not be updated by gup().
>>>>
>>>> In addition, there's an expectation on some archs that a
>>>> spurious write fault triggers a local TLB flush, and that is
>>>> missing from the picture as well.
>>>>
>>>> I decided that adding those "features" to gup() would be too much
>>>> for this already too complex function, and instead added a new
>>>> simpler fixup_user_fault() which is essentially a wrapper around
>>>> handle_mm_fault() which the futex code can call.
>>>>
>>>> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
>>>> ---
>>>>
>>>> Shan, can you test this ? It might not fix the problem
>>> um, what problem.  There's no description here of the user-visible
>>> effects of the bug hence it's hard to work out what kernel version(s)
>>> should receive this patch.
>> Shan could give you an actual example (it was in the previous thread),
>> but basically, livelock as the kernel keeps trying and trying the
>> in_atomic op and never resolves it.
>>
>>> What kernel version(s) should receive this patch?
>> I haven't dug. Probably anything it applies on as far as we did that
>> trick of atomic + gup() for futex.
> You're not understanding me.
>
> I need a good reason to merge this into 3.0.
>
> The -stable maintainers need even better reasons to merge this into
> earlier kernels.
>
> Please provide those reasons!
>

Summary:
- Encountered a 100% CPU system usage problem on pthread_mutex allocated 
in a
     shared memory region, and the problem occurs only on setting 
PRIORITY_INHERITANCE
     to the pthread_mutex.
- ftrace result reveals that an infinite loop in the futex_lock_pi 
caused high CPU usage.
- The powerpc e500 was affected but the x86 was not.
     I have not tested on other archs so I am not sure whether the other 
archs are attacked
     by the problem.
- Tested it on 2.6.34 and 3.0-rc7, both are affected, earlier versions 
might be affected.

Please refer the threads "[PATCH 0/1] Fixup write permission of TLB on 
powerpc e500 core"
and "[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core" for 
the whole story.
Provided the test case code in the [PATH 0/1].

Thanks
Shan Hai

> (Documentation/stable_kernel_rules.txt, 4th bullet)
>
> (And it's not just me and -stable maintainers.  Distro maintainers will
> also look at this patch and wonder whether they should merge it)
Mike Frysinger July 27, 2011, 6:50 a.m. UTC | #18
On Mon, Jul 18, 2011 at 21:29, Benjamin Herrenschmidt wrote:
> The futex code currently attempts to write to user memory within
> a pagefault disabled section, and if that fails, tries to fix it
> up using get_user_pages().
>
> This doesn't work on archs where the dirty and young bits are
> maintained by software, since they will gate access permission
> in the TLB, and will not be updated by gup().
>
> In addition, there's an expectation on some archs that a
> spurious write fault triggers a local TLB flush, and that is
> missing from the picture as well.
>
> I decided that adding those "features" to gup() would be too much
> for this already too complex function, and instead added a new
> simpler fixup_user_fault() which is essentially a wrapper around
> handle_mm_fault() which the futex code can call.

unfortunately, this breaks all nommu ports.  you added
fixup_user_fault() to mm/memory.c only which is not used by nommu
logic.
-mike
Benjamin Herrenschmidt July 27, 2011, 7:58 a.m. UTC | #19
On Tue, 2011-07-26 at 23:50 -0700, Mike Frysinger wrote:
> On Mon, Jul 18, 2011 at 21:29, Benjamin Herrenschmidt wrote:
> > The futex code currently attempts to write to user memory within
> > a pagefault disabled section, and if that fails, tries to fix it
> > up using get_user_pages().
> >
> > This doesn't work on archs where the dirty and young bits are
> > maintained by software, since they will gate access permission
> > in the TLB, and will not be updated by gup().
> >
> > In addition, there's an expectation on some archs that a
> > spurious write fault triggers a local TLB flush, and that is
> > missing from the picture as well.
> >
> > I decided that adding those "features" to gup() would be too much
> > for this already too complex function, and instead added a new
> > simpler fixup_user_fault() which is essentially a wrapper around
> > handle_mm_fault() which the futex code can call.
> 
> unfortunately, this breaks all nommu ports.  you added
> fixup_user_fault() to mm/memory.c only which is not used by nommu

Argh. Andrew, do you want to send a fix ? I won't be able to do that
tonight, I have to go.

What should nommu do anyways ? it's not like there's much it can do
right ? It should never even hit the fault path to start with ...

Cheers,
Ben.
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..1036614 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -985,6 +985,8 @@  int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct page *get_dump_page(unsigned long addr);
+extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long address, unsigned int fault_flags);
 
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
 extern void do_invalidatepage(struct page *page, unsigned long offset);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..7a0a4ed 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -355,8 +355,8 @@  static int fault_in_user_writeable(u32 __user *uaddr)
 	int ret;
 
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, (unsigned long)uaddr,
-			     1, 1, 0, NULL, NULL);
+	ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
+			       FAULT_FLAG_WRITE);
 	up_read(&mm->mmap_sem);
 
 	return ret < 0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 40b7531..b967fb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,7 +1815,64 @@  next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
-/**
+/*
+ * fixup_user_fault() - manually resolve a user page  fault
+ * @tsk:	the task_struct to use for page fault accounting, or
+ *		NULL if faults are not to be recorded.
+ * @mm:		mm_struct of target mm
+ * @address:	user address
+ * @fault_flags:flags to pass down to handle_mm_fault()
+ *
+ * This is meant to be called in the specific scenario where for
+ * locking reasons we try to access user memory in atomic context
+ * (within a pagefault_disable() section), this returns -EFAULT,
+ * and we want to resolve the user fault before trying again.
+ *
+ * Typically this is meant to be used by the futex code.
+ *
+ * The main difference with get_user_pages() is that this function
+ * will unconditionally call handle_mm_fault() which will in turn
+ * perform all the necessary SW fixup of the dirty and young bits
+ * in the PTE, while handle_mm_fault() only guarantees to update
+ * these in the struct page.
+ *
+ * This is important for some architectures where those bits also
+ * gate the access permission to the page because their are
+ * maintained in software. On such architecture, gup() will not
+ * be enough to make a subsequent access succeed.
+ *
+ * This should be called with the mm_sem held for read.
+ */
+int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long address, unsigned int fault_flags)
+{
+	struct vm_area_struct *vma;
+	int ret;
+
+	vma = find_extend_vma(mm, address);
+	if (!vma || address < vma->vm_start)
+		return -EFAULT;
+	
+	ret = handle_mm_fault(mm, vma, address, fault_flags);
+	if (ret & VM_FAULT_ERROR) {
+		if (ret & VM_FAULT_OOM)
+			return -ENOMEM;
+		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
+			return -EHWPOISON;
+		if (ret & VM_FAULT_SIGBUS)
+			return -EFAULT;
+		BUG();
+	}
+	if (tsk) {
+		if (ret & VM_FAULT_MAJOR)
+			tsk->maj_flt++;
+		else
+			tsk->min_flt++;
+	}
+	return 0;
+}
+
+/*
  * get_user_pages() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
  *		NULL if faults are not to be recorded.