diff mbox

[V4,0/2] mm: FAULT_AROUND_ORDER patchset performance data for powerpc

Message ID 20140520102738.7F096E009B@blue.fi.intel.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Kirill A. Shutemov May 20, 2014, 10:27 a.m. UTC
Rusty Russell wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> > Andrew Morton wrote:
> >> On Mon, 19 May 2014 16:23:07 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> >> 
> >> > Shouldn't FAULT_AROUND_ORDER and fault_around_order be changed to be
> >> > the order of the fault-around size in bytes, and fault_around_pages()
> >> > use 1UL << (fault_around_order - PAGE_SHIFT)
> >> 
> >> Yes.  And shame on me for missing it (this time!) at review.
> >> 
> >> There's still time to fix this.  Patches, please.
> >
> > Here it is. Made at 3.30 AM, build tested only.
> 
> Prefer on top of Maddy's patch which makes it always a variable, rather
> than CONFIG_DEBUG_FS.  It's got enough hair as it is.

Something like this?

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Tue, 20 May 2014 13:02:03 +0300
Subject: [PATCH] mm: nominate faultaround area in bytes rather then page order

There are evidences that faultaround feature is less relevant on
architectures with page size bigger then 4k. Which makes sense since
page fault overhead per byte of mapped area should be less there.

Let's rework the feature to specify faultaround area in bytes instead of
page order. It's 64 kilobytes for now.

The patch effectively disables faultaround on architectures with
page size >= 64k (like ppc64).

It's possible that some other size of faultaround area is relevant for a
platform. We can expose `fault_around_bytes' variable to arch-specific
code once such platforms will be found.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 62 +++++++++++++++++++++++--------------------------------------
 1 file changed, 23 insertions(+), 39 deletions(-)

Comments

Andrew Morton May 20, 2014, 7:59 p.m. UTC | #1
On Tue, 20 May 2014 13:27:38 +0300 (EEST) "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Rusty Russell wrote:
> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> > > Andrew Morton wrote:
> > >> On Mon, 19 May 2014 16:23:07 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > >> 
> > >> > Shouldn't FAULT_AROUND_ORDER and fault_around_order be changed to be
> > >> > the order of the fault-around size in bytes, and fault_around_pages()
> > >> > use 1UL << (fault_around_order - PAGE_SHIFT)
> > >> 
> > >> Yes.  And shame on me for missing it (this time!) at review.
> > >> 
> > >> There's still time to fix this.  Patches, please.
> > >
> > > Here it is. Made at 3.30 AM, build tested only.
> > 
> > Prefer on top of Maddy's patch which makes it always a variable, rather
> > than CONFIG_DEBUG_FS.  It's got enough hair as it is.
> 
> Something like this?

This appears to be against mainline, not against Madhavan's patch.  As
mentioned previously, I'd prefer it that way but confused.


> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Tue, 20 May 2014 13:02:03 +0300
> Subject: [PATCH] mm: nominate faultaround area in bytes rather then page order
> 
> There are evidences that faultaround feature is less relevant on
> architectures with page size bigger then 4k. Which makes sense since
> page fault overhead per byte of mapped area should be less there.
> 
> Let's rework the feature to specify faultaround area in bytes instead of
> page order. It's 64 kilobytes for now.
> 
> The patch effectively disables faultaround on architectures with
> page size >= 64k (like ppc64).
> 
> It's possible that some other size of faultaround area is relevant for a
> platform. We can expose `fault_around_bytes' variable to arch-specific
> code once such platforms will be found.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/memory.c | 62 +++++++++++++++++++++++--------------------------------------
>  1 file changed, 23 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812a9531..252b319e8cdf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3402,63 +3402,47 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>  	update_mmu_cache(vma, address, pte);
>  }
>  
> -#define FAULT_AROUND_ORDER 4
> +static unsigned long fault_around_bytes = 65536;
> +
> +static inline unsigned long fault_around_pages(void)
> +{
> +	return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
> +}

I think we should round up, not down.  So if the user asks for 1kb,
they get one page.

So this becomes

	return PAGE_ALIGN(fault_around_bytes) / PAGE_SIZE;

> +static inline unsigned long fault_around_mask(void)
> +{
> +	return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
> +}

And this has me a bit stumped.  It's not helpful that do_fault_around()
is undocumented.  Does it fault in N/2 pages ahead and N/2 pages
behind?  Or does it align the address down to the highest multiple of
fault_around_bytes?  It appears to be the latter, so the location of
the faultaround window around the fault address is basically random,
depending on what address userspace happened to pick.  I don't know why
we did this :(

Or something.  Can we please get some code commentary over
do_fault_around() describing this design decision and explaining the
reasoning behind it?


Also, "neast" is not a word.
maddy May 27, 2014, 6:24 a.m. UTC | #2
On Tuesday 20 May 2014 03:57 PM, Kirill A. Shutemov wrote:
> Rusty Russell wrote:
>> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>>> Andrew Morton wrote:
>>>> On Mon, 19 May 2014 16:23:07 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
>>>>
>>>>> Shouldn't FAULT_AROUND_ORDER and fault_around_order be changed to be
>>>>> the order of the fault-around size in bytes, and fault_around_pages()
>>>>> use 1UL << (fault_around_order - PAGE_SHIFT)
>>>>
>>>> Yes.  And shame on me for missing it (this time!) at review.
>>>>
>>>> There's still time to fix this.  Patches, please.
>>>
>>> Here it is. Made at 3.30 AM, build tested only.
>>
>> Prefer on top of Maddy's patch which makes it always a variable, rather
>> than CONFIG_DEBUG_FS.  It's got enough hair as it is.
> 
> Something like this?
> 
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Tue, 20 May 2014 13:02:03 +0300
> Subject: [PATCH] mm: nominate faultaround area in bytes rather then page order
> 
> There are evidences that faultaround feature is less relevant on
> architectures with page size bigger then 4k. Which makes sense since
> page fault overhead per byte of mapped area should be less there.
> 
> Let's rework the feature to specify faultaround area in bytes instead of
> page order. It's 64 kilobytes for now.
> 
> The patch effectively disables faultaround on architectures with
> page size >= 64k (like ppc64).
> 
> It's possible that some other size of faultaround area is relevant for a
> platform. We can expose `fault_around_bytes' variable to arch-specific
> code once such platforms will be found.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/memory.c | 62 +++++++++++++++++++++++--------------------------------------
>  1 file changed, 23 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812a9531..252b319e8cdf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3402,63 +3402,47 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>  	update_mmu_cache(vma, address, pte);
>  }
> 
> -#define FAULT_AROUND_ORDER 4
> +static unsigned long fault_around_bytes = 65536;
> +
> +static inline unsigned long fault_around_pages(void)
> +{
> +	return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
> +}
> +
> +static inline unsigned long fault_around_mask(void)
> +{
> +	return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
> +}
> 
> -#ifdef CONFIG_DEBUG_FS
> -static unsigned int fault_around_order = FAULT_AROUND_ORDER;
> 
> -static int fault_around_order_get(void *data, u64 *val)
> +#ifdef CONFIG_DEBUG_FS
> +static int fault_around_bytes_get(void *data, u64 *val)
>  {
> -	*val = fault_around_order;
> +	*val = fault_around_bytes;
>  	return 0;
>  }
> 
> -static int fault_around_order_set(void *data, u64 val)
> +static int fault_around_bytes_set(void *data, u64 val)
>  {

Kindly ignore the question if not relevant. Even though we need root
access to alter the value, will we be fine with
negative value?.

Regards
Maddy

> -	BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
> -	if (1UL << val > PTRS_PER_PTE)
> +	if (val / PAGE_SIZE > PTRS_PER_PTE)
>  		return -EINVAL;
> -	fault_around_order = val;
> +	fault_around_bytes = val;
>  	return 0;
>  }
> -DEFINE_SIMPLE_ATTRIBUTE(fault_around_order_fops,
> -		fault_around_order_get, fault_around_order_set, "%llu\n");
> +DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
> +		fault_around_bytes_get, fault_around_bytes_set, "%llu\n");
> 
>  static int __init fault_around_debugfs(void)
>  {
>  	void *ret;
> 
> -	ret = debugfs_create_file("fault_around_order",	0644, NULL, NULL,
> -			&fault_around_order_fops);
> +	ret = debugfs_create_file("fault_around_bytes", 0644, NULL, NULL,
> +			&fault_around_bytes_fops);
>  	if (!ret)
> -		pr_warn("Failed to create fault_around_order in debugfs");
> +		pr_warn("Failed to create fault_around_bytes in debugfs");
>  	return 0;
>  }
>  late_initcall(fault_around_debugfs);
> -
> -static inline unsigned long fault_around_pages(void)
> -{
> -	return 1UL << fault_around_order;
> -}
> -
> -static inline unsigned long fault_around_mask(void)
> -{
> -	return ~((1UL << (PAGE_SHIFT + fault_around_order)) - 1);
> -}
> -#else
> -static inline unsigned long fault_around_pages(void)
> -{
> -	unsigned long nr_pages;
> -
> -	nr_pages = 1UL << FAULT_AROUND_ORDER;
> -	BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);
> -	return nr_pages;
> -}
> -
> -static inline unsigned long fault_around_mask(void)
> -{
> -	return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
> -}
>  #endif
> 
>  static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
> @@ -3515,7 +3499,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * if page by the offset is not ready to be mapped (cold cache or
>  	 * something).
>  	 */
> -	if (vma->vm_ops->map_pages) {
> +	if (vma->vm_ops->map_pages && fault_around_pages() > 1) {
>  		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>  		do_fault_around(vma, address, pte, pgoff, flags);
>  		if (!pte_same(*pte, orig_pte))
>
Kirill A. Shutemov May 27, 2014, 10:21 a.m. UTC | #3
Madhavan Srinivasan wrote:
> On Tuesday 20 May 2014 03:57 PM, Kirill A. Shutemov wrote:
> > Rusty Russell wrote:
> >> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> >>> Andrew Morton wrote:
> >>>> On Mon, 19 May 2014 16:23:07 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> >>>>
> >>>>> Shouldn't FAULT_AROUND_ORDER and fault_around_order be changed to be
> >>>>> the order of the fault-around size in bytes, and fault_around_pages()
> >>>>> use 1UL << (fault_around_order - PAGE_SHIFT)
> >>>>
> >>>> Yes.  And shame on me for missing it (this time!) at review.
> >>>>
> >>>> There's still time to fix this.  Patches, please.
> >>>
> >>> Here it is. Made at 3.30 AM, build tested only.
> >>
> >> Prefer on top of Maddy's patch which makes it always a variable, rather
> >> than CONFIG_DEBUG_FS.  It's got enough hair as it is.
> > 
> > Something like this?
> > 
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Tue, 20 May 2014 13:02:03 +0300
> > Subject: [PATCH] mm: nominate faultaround area in bytes rather then page order
> > 
> > There are evidences that faultaround feature is less relevant on
> > architectures with page size bigger then 4k. Which makes sense since
> > page fault overhead per byte of mapped area should be less there.
> > 
> > Let's rework the feature to specify faultaround area in bytes instead of
> > page order. It's 64 kilobytes for now.
> > 
> > The patch effectively disables faultaround on architectures with
> > page size >= 64k (like ppc64).
> > 
> > It's possible that some other size of faultaround area is relevant for a
> > platform. We can expose `fault_around_bytes' variable to arch-specific
> > code once such platforms will be found.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/memory.c | 62 +++++++++++++++++++++++--------------------------------------
> >  1 file changed, 23 insertions(+), 39 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 037b812a9531..252b319e8cdf 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3402,63 +3402,47 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> >  	update_mmu_cache(vma, address, pte);
> >  }
> > 
> > -#define FAULT_AROUND_ORDER 4
> > +static unsigned long fault_around_bytes = 65536;
> > +
> > +static inline unsigned long fault_around_pages(void)
> > +{
> > +	return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
> > +}
> > +
> > +static inline unsigned long fault_around_mask(void)
> > +{
> > +	return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
> > +}
> > 
> > -#ifdef CONFIG_DEBUG_FS
> > -static unsigned int fault_around_order = FAULT_AROUND_ORDER;
> > 
> > -static int fault_around_order_get(void *data, u64 *val)
> > +#ifdef CONFIG_DEBUG_FS
> > +static int fault_around_bytes_get(void *data, u64 *val)
> >  {
> > -	*val = fault_around_order;
> > +	*val = fault_around_bytes;
> >  	return 0;
> >  }
> > 
> > -static int fault_around_order_set(void *data, u64 val)
> > +static int fault_around_bytes_set(void *data, u64 val)
> >  {
> 
> Kindly ignore the question if not relevant. Even though we need root
> access to alter the value, will we be fine with
> negative value?.

val is u64. or I miss something?
maddy May 27, 2014, 10:44 a.m. UTC | #4
On Tuesday 27 May 2014 03:51 PM, Kirill A. Shutemov wrote:
> Madhavan Srinivasan wrote:
>> On Tuesday 20 May 2014 03:57 PM, Kirill A. Shutemov wrote:
>>> Rusty Russell wrote:
>>>> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>>>>> Andrew Morton wrote:
>>>>>> On Mon, 19 May 2014 16:23:07 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
>>>>>>
>>>>>>> Shouldn't FAULT_AROUND_ORDER and fault_around_order be changed to be
>>>>>>> the order of the fault-around size in bytes, and fault_around_pages()
>>>>>>> use 1UL << (fault_around_order - PAGE_SHIFT)
>>>>>>
>>>>>> Yes.  And shame on me for missing it (this time!) at review.
>>>>>>
>>>>>> There's still time to fix this.  Patches, please.
>>>>>
>>>>> Here it is. Made at 3.30 AM, build tested only.
>>>>
>>>> Prefer on top of Maddy's patch which makes it always a variable, rather
>>>> than CONFIG_DEBUG_FS.  It's got enough hair as it is.
>>>
>>> Something like this?
>>>
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>> Date: Tue, 20 May 2014 13:02:03 +0300
>>> Subject: [PATCH] mm: nominate faultaround area in bytes rather then page order
>>>
>>> There are evidences that faultaround feature is less relevant on
>>> architectures with page size bigger then 4k. Which makes sense since
>>> page fault overhead per byte of mapped area should be less there.
>>>
>>> Let's rework the feature to specify faultaround area in bytes instead of
>>> page order. It's 64 kilobytes for now.
>>>
>>> The patch effectively disables faultaround on architectures with
>>> page size >= 64k (like ppc64).
>>>
>>> It's possible that some other size of faultaround area is relevant for a
>>> platform. We can expose `fault_around_bytes' variable to arch-specific
>>> code once such platforms will be found.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>  mm/memory.c | 62 +++++++++++++++++++++++--------------------------------------
>>>  1 file changed, 23 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 037b812a9531..252b319e8cdf 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3402,63 +3402,47 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>>>  	update_mmu_cache(vma, address, pte);
>>>  }
>>>
>>> -#define FAULT_AROUND_ORDER 4
>>> +static unsigned long fault_around_bytes = 65536;
>>> +
>>> +static inline unsigned long fault_around_pages(void)
>>> +{
>>> +	return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
>>> +}
>>> +
>>> +static inline unsigned long fault_around_mask(void)
>>> +{
>>> +	return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
>>> +}
>>>
>>> -#ifdef CONFIG_DEBUG_FS
>>> -static unsigned int fault_around_order = FAULT_AROUND_ORDER;
>>>
>>> -static int fault_around_order_get(void *data, u64 *val)
>>> +#ifdef CONFIG_DEBUG_FS
>>> +static int fault_around_bytes_get(void *data, u64 *val)
>>>  {
>>> -	*val = fault_around_order;
>>> +	*val = fault_around_bytes;
>>>  	return 0;
>>>  }
>>>
>>> -static int fault_around_order_set(void *data, u64 val)
>>> +static int fault_around_bytes_set(void *data, u64 val)
>>>  {
>>
>> Kindly ignore the question if not relevant. Even though we need root
>> access to alter the value, will we be fine with
>> negative value?.
> ppc
> val is u64. or I miss something?
> 

My Bad. What I wanted to check was for all 0xf input and guess we are
fine. Sorry about that.

Regards
Maddy
diff mbox

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 037b812a9531..252b319e8cdf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3402,63 +3402,47 @@  void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
-#define FAULT_AROUND_ORDER 4
+static unsigned long fault_around_bytes = 65536;
+
+static inline unsigned long fault_around_pages(void)
+{
+	return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
+}
+
+static inline unsigned long fault_around_mask(void)
+{
+	return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
+}
 
-#ifdef CONFIG_DEBUG_FS
-static unsigned int fault_around_order = FAULT_AROUND_ORDER;
 
-static int fault_around_order_get(void *data, u64 *val)
+#ifdef CONFIG_DEBUG_FS
+static int fault_around_bytes_get(void *data, u64 *val)
 {
-	*val = fault_around_order;
+	*val = fault_around_bytes;
 	return 0;
 }
 
-static int fault_around_order_set(void *data, u64 val)
+static int fault_around_bytes_set(void *data, u64 val)
 {
-	BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
-	if (1UL << val > PTRS_PER_PTE)
+	if (val / PAGE_SIZE > PTRS_PER_PTE)
 		return -EINVAL;
-	fault_around_order = val;
+	fault_around_bytes = val;
 	return 0;
 }
-DEFINE_SIMPLE_ATTRIBUTE(fault_around_order_fops,
-		fault_around_order_get, fault_around_order_set, "%llu\n");
+DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
+		fault_around_bytes_get, fault_around_bytes_set, "%llu\n");
 
 static int __init fault_around_debugfs(void)
 {
 	void *ret;
 
-	ret = debugfs_create_file("fault_around_order",	0644, NULL, NULL,
-			&fault_around_order_fops);
+	ret = debugfs_create_file("fault_around_bytes", 0644, NULL, NULL,
+			&fault_around_bytes_fops);
 	if (!ret)
-		pr_warn("Failed to create fault_around_order in debugfs");
+		pr_warn("Failed to create fault_around_bytes in debugfs");
 	return 0;
 }
 late_initcall(fault_around_debugfs);
-
-static inline unsigned long fault_around_pages(void)
-{
-	return 1UL << fault_around_order;
-}
-
-static inline unsigned long fault_around_mask(void)
-{
-	return ~((1UL << (PAGE_SHIFT + fault_around_order)) - 1);
-}
-#else
-static inline unsigned long fault_around_pages(void)
-{
-	unsigned long nr_pages;
-
-	nr_pages = 1UL << FAULT_AROUND_ORDER;
-	BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);
-	return nr_pages;
-}
-
-static inline unsigned long fault_around_mask(void)
-{
-	return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
-}
 #endif
 
 static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
@@ -3515,7 +3499,7 @@  static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * if page by the offset is not ready to be mapped (cold cache or
 	 * something).
 	 */
-	if (vma->vm_ops->map_pages) {
+	if (vma->vm_ops->map_pages && fault_around_pages() > 1) {
 		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 		do_fault_around(vma, address, pte, pgoff, flags);
 		if (!pte_same(*pte, orig_pte))