[v3,0/9] parallelized "struct page" zeroing

Submitted by Michal Hocko on May 9, 2017, 6:12 p.m.

Details

Message ID 20170509181234.GA4397@dhcp22.suse.cz
State New
Headers show

Commit Message

Michal Hocko May 9, 2017, 6:12 p.m.
On Fri 05-05-17 13:03:07, Pavel Tatashin wrote:
> Changelog:
> 	v2 - v3
> 	- Addressed David's comments about one change per patch:
> 		* Splited changes to platforms into 4 patches
> 		* Made "do not zero vmemmap_buf" as a separate patch
> 	v1 - v2
> 	- Per request, added s390 to deferred "struct page" zeroing
> 	- Collected performance data on x86 which proofs the importance to
> 	  keep memset() as prefetch (see below).
> 
> When deferred struct page initialization feature is enabled, we get a
> performance gain of initializing vmemmap in parallel after other CPUs are
> started. However, we still zero the memory for vmemmap using one boot CPU.
> This patch-set fixes the memset-zeroing limitation by deferring it as well.

I like the idea of postponing the zeroing from the allocation to the
init time. To be honest the improvement looks much larger than I would
expect (Btw. this should be a part of the changelog rather than a
outside link).

The implementation just looks too large to what I would expect. E.g. do
we really need to add zero argument to the large part of the memblock
API? Wouldn't it be easier to simply export memblock_virt_alloc_internal
(or its tiny wrapper memblock_virt_alloc_core) and move the zeroing
outside to its 2 callers? A completely untested scratched version at the
end of the email.

Also it seems that this is not 100% correct either as it only cares
about VMEMMAP while DEFERRED_STRUCT_PAGE_INIT might be enabled also for
SPARSEMEM. This would suggest that we would zero out pages twice,
right?

A similar concern would go to the memory hotplug patch which will
fall back to the slab/page allocator IIRC. On the other hand
__init_single_page is shared with the hotplug code so again we would
initialize 2 times.

So I suspect more changes are needed. I will have a closer look tomorrow.

>  arch/powerpc/mm/init_64.c |    4 +-
>  arch/s390/mm/vmem.c       |    5 ++-
>  arch/sparc/mm/init_64.c   |   26 +++++++----------------
>  arch/x86/mm/init_64.c     |    3 +-
>  include/linux/bootmem.h   |    3 ++
>  include/linux/mm.h        |   15 +++++++++++--
>  mm/memblock.c             |   46 ++++++++++++++++++++++++++++++++++++------
>  mm/page_alloc.c           |    3 ++
>  mm/sparse-vmemmap.c       |   48 +++++++++++++++++++++++++++++---------------
>  9 files changed, 103 insertions(+), 50 deletions(-)


The bootmem API change mentioned above.

 include/linux/bootmem.h |  3 +++
 mm/memblock.c           | 41 ++++++++++++++++++++++++++---------------
 mm/sparse-vmemmap.c     |  2 +-
 3 files changed, 30 insertions(+), 16 deletions(-)

Comments

Pavel Tatashin May 9, 2017, 6:54 p.m.
Hi Michal,

> I like the idea of postponing the zeroing from the allocation to the
> init time. To be honest the improvement looks much larger than I would
> expect (Btw. this should be a part of the changelog rather than a
> outside link).

The improvements are larger, because this time was never measured, as 
Linux does not have early boot time stamps. I added them for x86 and 
SPARC to emasure the performance. I am pushing those changes through 
separate patchsets.

> 
> The implementation just looks too large to what I would expect. E.g. do
> we really need to add zero argument to the large part of the memblock
> API? Wouldn't it be easier to simply export memblock_virt_alloc_internal
> (or its tiny wrapper memblock_virt_alloc_core) and move the zeroing
> outside to its 2 callers? A completely untested scratched version at the
> end of the email.

I am OK, with this change. But, I do not really see a difference between:

memblock_virt_alloc_raw()
and
memblock_virt_alloc_core()

In both cases we use memblock_virt_alloc_internal(), but the only 
difference is that in my case we tell memblock_virt_alloc_internal() to 
zero the pages if needed, and in your case the other two callers are 
zeroing it. I like moving memblock_dbg() inside 
memblock_virt_alloc_internal()

> 
> Also it seems that this is not 100% correct either as it only cares
> about VMEMMAP while DEFERRED_STRUCT_PAGE_INIT might be enabled also for
> SPARSEMEM. This would suggest that we would zero out pages twice,
> right?

Thank you, I will check this combination before sending out the next patch.

> 
> A similar concern would go to the memory hotplug patch which will
> fall back to the slab/page allocator IIRC. On the other hand
> __init_single_page is shared with the hotplug code so again we would
> initialize 2 times.

Correct, when memory it hotplugged, to gain the benefit of this fix, and 
also not to regress by actually double zeroing "struct pages" we should 
not zero it out. However, I do not really have means to test it.

> 
> So I suspect more changes are needed. I will have a closer look tomorrow.

Thank you for reviewing this work. I will wait for your comments before 
sending out updated patches.

Pasha
Michal Hocko May 10, 2017, 7:24 a.m.
On Tue 09-05-17 14:54:50, Pasha Tatashin wrote:
[...]
> >The implementation just looks too large to what I would expect. E.g. do
> >we really need to add zero argument to the large part of the memblock
> >API? Wouldn't it be easier to simply export memblock_virt_alloc_internal
> >(or its tiny wrapper memblock_virt_alloc_core) and move the zeroing
> >outside to its 2 callers? A completely untested scratched version at the
> >end of the email.
> 
> I am OK, with this change. But, I do not really see a difference between:
> 
> memblock_virt_alloc_raw()
> and
> memblock_virt_alloc_core()
> 
> In both cases we use memblock_virt_alloc_internal(), but the only difference
> is that in my case we tell memblock_virt_alloc_internal() to zero the pages
> if needed, and in your case the other two callers are zeroing it. I like
> moving memblock_dbg() inside memblock_virt_alloc_internal()

Well, I didn't object to this particular part. I was mostly concerned
about
http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com
and the "zero" argument for other functions. I guess we can do without
that. I _think_ that we should simply _always_ initialize the page at the
__init_single_page time rather than during the allocation. That would
require dropping __GFP_ZERO for non-memblock allocations. Or do you
think we could regress for single threaded initialization?

> >Also it seems that this is not 100% correct either as it only cares
> >about VMEMMAP while DEFERRED_STRUCT_PAGE_INIT might be enabled also for
> >SPARSEMEM. This would suggest that we would zero out pages twice,
> >right?
> 
> Thank you, I will check this combination before sending out the next patch.
> 
> >
> >A similar concern would go to the memory hotplug patch which will
> >fall back to the slab/page allocator IIRC. On the other hand
> >__init_single_page is shared with the hotplug code so again we would
> >initialize 2 times.
> 
> Correct, when memory it hotplugged, to gain the benefit of this fix, and
> also not to regress by actually double zeroing "struct pages" we should not
> zero it out. However, I do not really have means to test it.

It should be pretty easy to test with kvm, but I can help with testing
on the real HW as well.

Thanks!
Pavel Tatashin May 10, 2017, 1:42 p.m.
> 
> Well, I didn't object to this particular part. I was mostly concerned
> about
> http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com
> and the "zero" argument for other functions. I guess we can do without
> that. I _think_ that we should simply _always_ initialize the page at the
> __init_single_page time rather than during the allocation. That would
> require dropping __GFP_ZERO for non-memblock allocations. Or do you
> think we could regress for single threaded initialization?
> 

Hi Michal,

Thats exactly right, I am worried that we will regress when there is no 
parallelized initialization of "struct pages" if we force 
unconditionally do memset() in __init_single_page(). The overhead of 
calling memset() on a smaller chunks (64-bytes) may cause the 
regression, this is why I opted only for parallelized case to zero this 
metadata. This way, we are guaranteed to see great improvements from 
this change without having regressions on platforms and builds that do 
not support parallelized initialization of "struct pages".

However, on some chips such as latest SPARCs it is beneficial to have 
memset() right inside __init_single_page() even for single threaded 
case, because it can act as a prefetch on chips with optimized block 
initialized store instructions.

Pasha
Michal Hocko May 10, 2017, 2:57 p.m.
On Wed 10-05-17 09:42:22, Pasha Tatashin wrote:
> >
> >Well, I didn't object to this particular part. I was mostly concerned
> >about
> >http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com
> >and the "zero" argument for other functions. I guess we can do without
> >that. I _think_ that we should simply _always_ initialize the page at the
> >__init_single_page time rather than during the allocation. That would
> >require dropping __GFP_ZERO for non-memblock allocations. Or do you
> >think we could regress for single threaded initialization?
> >
> 
> Hi Michal,
> 
> Thats exactly right, I am worried that we will regress when there is no
> parallelized initialization of "struct pages" if we force unconditionally do
> memset() in __init_single_page(). The overhead of calling memset() on a
> smaller chunks (64-bytes) may cause the regression, this is why I opted only
> for parallelized case to zero this metadata. This way, we are guaranteed to
> see great improvements from this change without having regressions on
> platforms and builds that do not support parallelized initialization of
> "struct pages".

Have you measured that? I do not think it would be super hard to
measure. I would be quite surprised if this added much if anything at
all as the whole struct page should be in the cache line already. We do
set reference count and other struct members. Almost nobody should be
looking at our page at this time and stealing the cache line. On the
other hand a large memcpy will basically wipe everything away from the
cpu cache. Or am I missing something?
Pavel Tatashin May 10, 2017, 3:01 p.m.
On 05/10/2017 10:57 AM, Michal Hocko wrote:
> On Wed 10-05-17 09:42:22, Pasha Tatashin wrote:
>>>
>>> Well, I didn't object to this particular part. I was mostly concerned
>>> about
>>> http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com
>>> and the "zero" argument for other functions. I guess we can do without
>>> that. I _think_ that we should simply _always_ initialize the page at the
>>> __init_single_page time rather than during the allocation. That would
>>> require dropping __GFP_ZERO for non-memblock allocations. Or do you
>>> think we could regress for single threaded initialization?
>>>
>>
>> Hi Michal,
>>
>> Thats exactly right, I am worried that we will regress when there is no
>> parallelized initialization of "struct pages" if we force unconditionally do
>> memset() in __init_single_page(). The overhead of calling memset() on a
>> smaller chunks (64-bytes) may cause the regression, this is why I opted only
>> for parallelized case to zero this metadata. This way, we are guaranteed to
>> see great improvements from this change without having regressions on
>> platforms and builds that do not support parallelized initialization of
>> "struct pages".
> 
> Have you measured that? I do not think it would be super hard to
> measure. I would be quite surprised if this added much if anything at
> all as the whole struct page should be in the cache line already. We do
> set reference count and other struct members. Almost nobody should be
> looking at our page at this time and stealing the cache line. On the
> other hand a large memcpy will basically wipe everything away from the
> cpu cache. Or am I missing something?
> 

Perhaps you are right, and I will measure on x86. But, I suspect hit can 
become unacceptable on some platfoms: there is an overhead of calling a 
function, even if it is leaf-optimized, and there is an overhead in 
memset() to check for alignments of size and address, types of setting 
(zeroing vs. non-zeroing), etc., that adds up quickly.

Pasha
David Miller May 10, 2017, 3:19 p.m.
From: Michal Hocko <mhocko@kernel.org>
Date: Wed, 10 May 2017 16:57:26 +0200

> Have you measured that? I do not think it would be super hard to
> measure. I would be quite surprised if this added much if anything at
> all as the whole struct page should be in the cache line already. We do
> set reference count and other struct members. Almost nobody should be
> looking at our page at this time and stealing the cache line. On the
> other hand a large memcpy will basically wipe everything away from the
> cpu cache. Or am I missing something?

I guess it might be clearer if you understand what the block
initializing stores do on sparc64.  There are no memory accesses at
all.

The cpu just zeros out the cache line, that's it.

No L3 cache line is allocated.  So this "wipe everything" behavior
will not happen in the L3.
David Miller May 10, 2017, 3:20 p.m.
From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Wed, 10 May 2017 11:01:40 -0400

> Perhaps you are right, and I will measure on x86. But, I suspect hit
> can become unacceptable on some platfoms: there is an overhead of
> calling a function, even if it is leaf-optimized, and there is an
> overhead in memset() to check for alignments of size and address,
> types of setting (zeroing vs. non-zeroing), etc., that adds up
> quickly.

Another source of overhead on the sparc64 side is that we much
do memory barriers around the block initializiing stores.  So
batching calls to memset() amortize that as well.
Matthew Wilcox May 10, 2017, 5:17 p.m.
On Wed, May 10, 2017 at 11:19:43AM -0400, David Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Wed, 10 May 2017 16:57:26 +0200
> 
> > Have you measured that? I do not think it would be super hard to
> > measure. I would be quite surprised if this added much if anything at
> > all as the whole struct page should be in the cache line already. We do
> > set reference count and other struct members. Almost nobody should be
> > looking at our page at this time and stealing the cache line. On the
> > other hand a large memcpy will basically wipe everything away from the
> > cpu cache. Or am I missing something?
> 
> I guess it might be clearer if you understand what the block
> initializing stores do on sparc64.  There are no memory accesses at
> all.
> 
> The cpu just zeros out the cache line, that's it.
> 
> No L3 cache line is allocated.  So this "wipe everything" behavior
> will not happen in the L3.

There's either something wrong with your explanation or my reading
skills :-)

"There are no memory accesses"
"No L3 cache line is allocated"

You can have one or the other ... either the CPU sends a cacheline-sized
write of zeroes to memory without allocating an L3 cache line (maybe
using the store buffer?), or the CPU allocates an L3 cache line and sets
its contents to zeroes, probably putting it in the last way of the set
so it's the first thing to be evicted if not touched.

Or there's some magic in the memory bus protocol where the CPU gets to
tell the DRAM "hey, clear these cache lines".  Although that's also a
memory access of sorts ...
David Miller May 10, 2017, 6 p.m.
From: Matthew Wilcox <willy@infradead.org>
Date: Wed, 10 May 2017 10:17:03 -0700

> On Wed, May 10, 2017 at 11:19:43AM -0400, David Miller wrote:
>> From: Michal Hocko <mhocko@kernel.org>
>> Date: Wed, 10 May 2017 16:57:26 +0200
>> 
>> > Have you measured that? I do not think it would be super hard to
>> > measure. I would be quite surprised if this added much if anything at
>> > all as the whole struct page should be in the cache line already. We do
>> > set reference count and other struct members. Almost nobody should be
>> > looking at our page at this time and stealing the cache line. On the
>> > other hand a large memcpy will basically wipe everything away from the
>> > cpu cache. Or am I missing something?
>> 
>> I guess it might be clearer if you understand what the block
>> initializing stores do on sparc64.  There are no memory accesses at
>> all.
>> 
>> The cpu just zeros out the cache line, that's it.
>> 
>> No L3 cache line is allocated.  So this "wipe everything" behavior
>> will not happen in the L3.
> 
> There's either something wrong with your explanation or my reading
> skills :-)
> 
> "There are no memory accesses"
> "No L3 cache line is allocated"
> 
> You can have one or the other ... either the CPU sends a cacheline-sized
> write of zeroes to memory without allocating an L3 cache line (maybe
> using the store buffer?), or the CPU allocates an L3 cache line and sets
> its contents to zeroes, probably putting it in the last way of the set
> so it's the first thing to be evicted if not touched.

There is no conflict in what I said.

Only an L2 cache line is allocated and cleared.  L3 is left alone.
Matthew Wilcox May 10, 2017, 9:11 p.m.
On Wed, May 10, 2017 at 02:00:26PM -0400, David Miller wrote:
> From: Matthew Wilcox <willy@infradead.org>
> Date: Wed, 10 May 2017 10:17:03 -0700
> > On Wed, May 10, 2017 at 11:19:43AM -0400, David Miller wrote:
> >> I guess it might be clearer if you understand what the block
> >> initializing stores do on sparc64.  There are no memory accesses at
> >> all.
> >> 
> >> The cpu just zeros out the cache line, that's it.
> >> 
> >> No L3 cache line is allocated.  So this "wipe everything" behavior
> >> will not happen in the L3.
> > 
> > There's either something wrong with your explanation or my reading
> > skills :-)
> > 
> > "There are no memory accesses"
> > "No L3 cache line is allocated"
> > 
> > You can have one or the other ... either the CPU sends a cacheline-sized
> > write of zeroes to memory without allocating an L3 cache line (maybe
> > using the store buffer?), or the CPU allocates an L3 cache line and sets
> > its contents to zeroes, probably putting it in the last way of the set
> > so it's the first thing to be evicted if not touched.
> 
> There is no conflict in what I said.
> 
> Only an L2 cache line is allocated and cleared.  L3 is left alone.

I thought SPARC had inclusive caches.  So allocating an L2 cacheline
would necessitate allocating an L3 cacheline.  Or is this an exception
to the normal order of things?
Michal Hocko May 11, 2017, 8:05 a.m.
On Wed 10-05-17 11:19:43, David S. Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Wed, 10 May 2017 16:57:26 +0200
> 
> > Have you measured that? I do not think it would be super hard to
> > measure. I would be quite surprised if this added much if anything at
> > all as the whole struct page should be in the cache line already. We do
> > set reference count and other struct members. Almost nobody should be
> > looking at our page at this time and stealing the cache line. On the
> > other hand a large memcpy will basically wipe everything away from the
> > cpu cache. Or am I missing something?
> 
> I guess it might be clearer if you understand what the block
> initializing stores do on sparc64.  There are no memory accesses at
> all.
> 
> The cpu just zeros out the cache line, that's it.
> 
> No L3 cache line is allocated.  So this "wipe everything" behavior
> will not happen in the L3.

OK, good to know. My undestanding of sparc64 is close to zero.

Anyway, do you agree that doing the struct page initialization along
with other writes to it shouldn't add a measurable overhead comparing
to pre-zeroing of larger block of struct pages?  We already have an
exclusive cache line and doing one 64B write along with few other stores
should be basically the same.
David Miller May 11, 2017, 2:35 p.m.
From: Michal Hocko <mhocko@kernel.org>
Date: Thu, 11 May 2017 10:05:38 +0200

> Anyway, do you agree that doing the struct page initialization along
> with other writes to it shouldn't add a measurable overhead comparing
> to pre-zeroing of larger block of struct pages?  We already have an
> exclusive cache line and doing one 64B write along with few other stores
> should be basically the same.

Yes, it should be reasonably cheap.
Pavel Tatashin May 11, 2017, 8:47 p.m.
>>
>> Have you measured that? I do not think it would be super hard to
>> measure. I would be quite surprised if this added much if anything at
>> all as the whole struct page should be in the cache line already. We do
>> set reference count and other struct members. Almost nobody should be
>> looking at our page at this time and stealing the cache line. On the
>> other hand a large memcpy will basically wipe everything away from the
>> cpu cache. Or am I missing something?
>>

Here is data for single thread (deferred struct page init is disabled):

Intel CPU E7-8895 v3 @ 2.60GHz  1T memory
-----------------------------------------
time to memset "struct pages in memblock: 11.28s
time to init "struct pag"es:               4.90s

Moving memset into __init_single_page()
time to init and memset "struct page"es:   8.39s

SPARC M6 @ 3600 MHz  1T memory
-----------------------------------------
time to memset "struct pages in memblock:  1.60s
time to init "struct pag"es:               3.37s

Moving memset into __init_single_page()
time to init and memset "struct page"es:  12.99s


So, moving memset() into __init_single_page() benefits Intel. I am 
actually surprised why memset() is so slow on intel when it is called 
from memblock. But, hurts SPARC, I guess these membars at the end of 
memset() kills the performance.

Also, when looking at these values, remeber that Intel has twice as many 
"struct page" for the same amount of memory.

Pasha
Pavel Tatashin May 11, 2017, 8:59 p.m.
We should either keep memset() only for deferred struct pages as what I 
have in my patches.

Another option is to add a new function struct_page_clear() which would 
default to memset() and to something else on platforms that decide to 
optimize it.

On SPARC it would call STBIs, and we would do one membar call after all 
"struct pages" are initialized.

I think what I sent out already is cleaner and better solution, because 
I am not sure what kind of performance we would see on other chips.

On 05/11/2017 04:47 PM, Pasha Tatashin wrote:
>>>
>>> Have you measured that? I do not think it would be super hard to
>>> measure. I would be quite surprised if this added much if anything at
>>> all as the whole struct page should be in the cache line already. We do
>>> set reference count and other struct members. Almost nobody should be
>>> looking at our page at this time and stealing the cache line. On the
>>> other hand a large memcpy will basically wipe everything away from the
>>> cpu cache. Or am I missing something?
>>>
> 
> Here is data for single thread (deferred struct page init is disabled):
> 
> Intel CPU E7-8895 v3 @ 2.60GHz  1T memory
> -----------------------------------------
> time to memset "struct pages in memblock: 11.28s
> time to init "struct pag"es:               4.90s
> 
> Moving memset into __init_single_page()
> time to init and memset "struct page"es:   8.39s
> 
> SPARC M6 @ 3600 MHz  1T memory
> -----------------------------------------
> time to memset "struct pages in memblock:  1.60s
> time to init "struct pag"es:               3.37s
> 
> Moving memset into __init_single_page()
> time to init and memset "struct page"es:  12.99s
> 
> 
> So, moving memset() into __init_single_page() benefits Intel. I am 
> actually surprised why memset() is so slow on intel when it is called 
> from memblock. But, hurts SPARC, I guess these membars at the end of 
> memset() kills the performance.
> 
> Also, when looking at these values, remeber that Intel has twice as many 
> "struct page" for the same amount of memory.
> 
> Pasha
> -- 
> To unsubscribe from this list: send the line "unsubscribe sparclinux" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 12, 2017, 4:56 p.m.
From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Thu, 11 May 2017 16:47:05 -0400

> So, moving memset() into __init_single_page() benefits Intel. I am
> actually surprised why memset() is so slow on intel when it is called
> from memblock. But, hurts SPARC, I guess these membars at the end of
> memset() kills the performance.

Perhaps an x86 expert can chime in, but it might be the case that past
a certain size, the microcode for the enhanced stosb uses non-temporal
stores or something like that.

As for sparc64, yes we can get really killed by the transactional cost
of memset because of the membars.

But I wonder, for a single page struct, if we even use the special
stores and thus eat the membar cost.  struct page is only 64 bytes,
and the cutoff in the Niagara4 bzero implementation is "64 + (64 - 8)"
so indeed the initializing stores will not even be used.

So sparc64 will only use initializing stores and do the membars if
at least 2 pages are cleared at a time.
David Miller May 12, 2017, 4:57 p.m.
From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Thu, 11 May 2017 16:59:33 -0400

> We should either keep memset() only for deferred struct pages as what
> I have in my patches.
> 
> Another option is to add a new function struct_page_clear() which
> would default to memset() and to something else on platforms that
> decide to optimize it.
> 
> On SPARC it would call STBIs, and we would do one membar call after
> all "struct pages" are initialized.

No membars will be performed for single individual page struct clear,
the cutoff to use the STBI is larger than that.
Pavel Tatashin May 12, 2017, 5:24 p.m.
On 05/12/2017 12:57 PM, David Miller wrote:
> From: Pasha Tatashin <pasha.tatashin@oracle.com>
> Date: Thu, 11 May 2017 16:59:33 -0400
> 
>> We should either keep memset() only for deferred struct pages as what
>> I have in my patches.
>>
>> Another option is to add a new function struct_page_clear() which
>> would default to memset() and to something else on platforms that
>> decide to optimize it.
>>
>> On SPARC it would call STBIs, and we would do one membar call after
>> all "struct pages" are initialized.
> 
> No membars will be performed for single individual page struct clear,
> the cutoff to use the STBI is larger than that.
> 

Right now it is larger, but what I suggested is to add a new optimized 
routine just for this case, which would do STBI for 64-bytes but without 
membar (do membar at the end of memmap_init_zone() and 
deferred_init_memmap()

#define struct_page_clear(page)                                 \
         __asm__ __volatile__(                                   \
         "stxa   %%g0, [%0]%2\n"                                 \
         "stxa   %%xg0, [%0 + %1]%2\n"                           \
         : /* No output */                                       \
         : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P))

And insert it into __init_single_page() instead of memset()

The final result is 4.01s/T which is even faster compared to current 4.97s/T



Pasha
David Miller May 12, 2017, 5:37 p.m.
From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Fri, 12 May 2017 13:24:52 -0400

> Right now it is larger, but what I suggested is to add a new optimized
> routine just for this case, which would do STBI for 64-bytes but
> without membar (do membar at the end of memmap_init_zone() and
> deferred_init_memmap()
> 
> #define struct_page_clear(page)                                 \
>         __asm__ __volatile__(                                   \
>         "stxa   %%g0, [%0]%2\n"                                 \
>         "stxa   %%xg0, [%0 + %1]%2\n"                           \
>         : /* No output */                                       \
>         : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P))
> 
> And insert it into __init_single_page() instead of memset()
> 
> The final result is 4.01s/T which is even faster compared to current
> 4.97s/T

Ok, indeed, that would work.
Pavel Tatashin May 15, 2017, 6:12 p.m.
Hi Michal,

After looking at your suggested memblock_virt_alloc_core() change again, 
I decided to keep what I have. I do not want to inline 
memblock_virt_alloc_internal(), because it is not a performance critical 
path, and by inlining it we will unnecessarily increase the text size on 
all platforms.

Also, because it will be very hard to make sure that no platform 
regresses by making memset() default in _memblock_virt_alloc_core() (as 
I already showed last week at least sun4v SPARC64 will require special 
changes in order for this to work), I decided to make it available only 
for "deferred struct page init" case. As, what is already in the patch.

I am working on testing to make sure we do not need to double zero in 
the two cases that you found: sparsemem, and mem hotplug. Please let me 
know if you have any more comments, or if I can send new patches out 
when they are ready.

Thank you,
Pasha
Michal Hocko May 15, 2017, 7:38 p.m.
On Mon 15-05-17 14:12:10, Pasha Tatashin wrote:
> Hi Michal,
> 
> After looking at your suggested memblock_virt_alloc_core() change again, I
> decided to keep what I have. I do not want to inline
> memblock_virt_alloc_internal(), because it is not a performance critical
> path, and by inlining it we will unnecessarily increase the text size on all
> platforms.

I do not insist but I would really _prefer_ if the bool zero argument
didn't proliferate all over the memblock API.
 
> Also, because it will be very hard to make sure that no platform regresses
> by making memset() default in _memblock_virt_alloc_core() (as I already
> showed last week at least sun4v SPARC64 will require special changes in
> order for this to work), I decided to make it available only for "deferred
> struct page init" case. As, what is already in the patch.

I do not think this is the right approach. Your measurements just show
that sparc could have a more optimized memset for small sizes. If you
keep the same memset only for the parallel initialization then you
just hide this fact. I wouldn't worry about other architectures. All
sane architectures should simply work reasonably well when touching a
single or only few cache lines at the same time. If some arches really
suffer from small memsets then the initialization should be driven by a
specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend
on DEFERRED_INIT. Or if you are too worried then make it opt-in and make
it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and
sparc after memset optimization.
Pavel Tatashin May 15, 2017, 8:44 p.m.
On 05/15/2017 03:38 PM, Michal Hocko wrote:
> On Mon 15-05-17 14:12:10, Pasha Tatashin wrote:
>> Hi Michal,
>>
>> After looking at your suggested memblock_virt_alloc_core() change again, I
>> decided to keep what I have. I do not want to inline
>> memblock_virt_alloc_internal(), because it is not a performance critical
>> path, and by inlining it we will unnecessarily increase the text size on all
>> platforms.
> 
> I do not insist but I would really _prefer_ if the bool zero argument
> didn't proliferate all over the memblock API.

Sure, I will remove zero boolean argument from 
memblock_virt_alloc_internal(), and do memset() calls inside callers.

>   
>> Also, because it will be very hard to make sure that no platform regresses
>> by making memset() default in _memblock_virt_alloc_core() (as I already
>> showed last week at least sun4v SPARC64 will require special changes in
>> order for this to work), I decided to make it available only for "deferred
>> struct page init" case. As, what is already in the patch.
> 
> I do not think this is the right approach. Your measurements just show
> that sparc could have a more optimized memset for small sizes. If you
> keep the same memset only for the parallel initialization then you
> just hide this fact. I wouldn't worry about other architectures. All
> sane architectures should simply work reasonably well when touching a
> single or only few cache lines at the same time. If some arches really
> suffer from small memsets then the initialization should be driven by a
> specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend
> on DEFERRED_INIT. Or if you are too worried then make it opt-in and make
> it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and
> sparc after memset optimization.

OK, I will think about this.

I do not really like adding new configs because they tend to clutter the 
code. This is why, I wanted to rely on already existing config that I 
know benefits all platforms that use it. Eventually, 
"CONFIG_DEFERRED_STRUCT_PAGE_INIT" is going to become the default 
everywhere, as there should not be a drawback of using it even on small 
machines.

Pasha
Michal Hocko May 16, 2017, 8:36 a.m.
On Mon 15-05-17 16:44:26, Pasha Tatashin wrote:
> On 05/15/2017 03:38 PM, Michal Hocko wrote:
> >I do not think this is the right approach. Your measurements just show
> >that sparc could have a more optimized memset for small sizes. If you
> >keep the same memset only for the parallel initialization then you
> >just hide this fact. I wouldn't worry about other architectures. All
> >sane architectures should simply work reasonably well when touching a
> >single or only few cache lines at the same time. If some arches really
> >suffer from small memsets then the initialization should be driven by a
> >specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend
> >on DEFERRED_INIT. Or if you are too worried then make it opt-in and make
> >it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and
> >sparc after memset optimization.
> 
> OK, I will think about this.
> 
> I do not really like adding new configs because they tend to clutter the
> code. This is why,

Yes I hate adding new (arch) config options as well. And I still believe
we do not need any here either...

> I wanted to rely on already existing config that I know benefits all
> platforms that use it.

I wouldn't be so sure about this. If any other platform has a similar
issues with small memset as sparc then the overhead is just papered over
by parallel initialization.

> Eventually,
> "CONFIG_DEFERRED_STRUCT_PAGE_INIT" is going to become the default
> everywhere, as there should not be a drawback of using it even on small
> machines.

Maybe and I would highly appreciate that.
Benjamin Herrenschmidt May 16, 2017, 11:50 p.m.
On Fri, 2017-05-12 at 13:37 -0400, David Miller wrote:
> > Right now it is larger, but what I suggested is to add a new optimized
> > routine just for this case, which would do STBI for 64-bytes but
> > without membar (do membar at the end of memmap_init_zone() and
> > deferred_init_memmap()
> > 
> > #define struct_page_clear(page)                                 \
> >          __asm__ __volatile__(                                   \
> >          "stxa   %%g0, [%0]%2\n"                                 \
> >          "stxa   %%xg0, [%0 + %1]%2\n"                           \
> >          : /* No output */                                       \
> >          : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P))
> > 
> > And insert it into __init_single_page() instead of memset()
> > 
> > The final result is 4.01s/T which is even faster compared to current
> > 4.97s/T
> 
> Ok, indeed, that would work.

On ppc64, that might not. We have a dcbz instruction that clears an
entire cache line at once. That's what we use for memset's and page
clearing. However, 64 bytes is half a cache line on modern processors
so we can't use it with that semantic and would have to fallback to the
slower stores.

Cheers,
Ben.
Pavel Tatashin May 26, 2017, 4:45 p.m.
Hi Michal,

I have considered your proposals:

1. Making memset(0) unconditional inside __init_single_page() is not 
going to work because it slows down SPARC, and ppc64. On SPARC even the 
BSTI optimization that I have proposed earlier won't work, because after 
consulting with other engineers I was told that stores (without loads!) 
after BSTI without membar are unsafe

2. Adding ARCH_WANT_LARGE_PAGEBLOCK_INIT is not going to solve the 
problem, because while arch might want a large memset(), it still wants 
to get the benefit of parallelized struct page initialization.

3. Another approach that have I considered is moving memset() above 
__init_single_page() and do it in a larger chunks. However, this 
solution is also not going to work, because inside the loops, there are 
cases where "struct page"s are skipped, so every single page is checked:
early_pfn_valid(pfn), early_pfn_in_nid(), and also mirroed_kernelcore cases.

> I wouldn't be so sure about this. If any other platform has a similar
> issues with small memset as sparc then the overhead is just papered over
> by parallel initialization.

That is true, and that is fine, because parallelization gives an order 
of magnitude better improvements compared to trade of slower single 
thread performance. Remember, this will happen during boot and memory 
hotplug only, and not something that will eat up computing resources 
during runtime.

So, at the moment I cannot really find a better solution compared to 
what I have proposed: do memset() inside __init_single_page() only when 
deferred initialization is enabled.

Pasha
Michal Hocko May 29, 2017, 11:53 a.m.
On Fri 26-05-17 12:45:55, Pasha Tatashin wrote:
> Hi Michal,
> 
> I have considered your proposals:
> 
> 1. Making memset(0) unconditional inside __init_single_page() is not going
> to work because it slows down SPARC, and ppc64. On SPARC even the BSTI
> optimization that I have proposed earlier won't work, because after
> consulting with other engineers I was told that stores (without loads!)
> after BSTI without membar are unsafe

Could you be more specific? E.g. how are other stores done in
__init_single_page safe then? I am sorry to be dense here but how does
the full 64B store differ from other stores done in the same function.

[...]
> So, at the moment I cannot really find a better solution compared to what I
> have proposed: do memset() inside __init_single_page() only when deferred
> initialization is enabled.

As I've already said I am not going to block your approach I was just
hoping for something that doesn't depend on the deferred initialization.
Especially when the struct page is a small objects and it makes sense to
initialize it completely at a single page. Writing to a single cache
line should simply not add memory traffic for exclusive cache line and
struct pages are very likely to exclusive at that stage.

If that doesn't fly then be it but I have to confess I still do not
understand why that is not the case.
Pavel Tatashin May 30, 2017, 5:16 p.m.
> Could you be more specific? E.g. how are other stores done in
> __init_single_page safe then? I am sorry to be dense here but how does
> the full 64B store differ from other stores done in the same function.

Hi Michal,

It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb) 
without membar, but they are slower compared to STBI which require a 
membar before memory can be accessed. So when on SPARC we zero a larger 
span of memory it is faster to use STBI, and do one membar at the end. 
This is why for single thread it is faster to zero multiple pages of 
memory and than initialize only fields that are needed in "struct page". 
I believe the same is true for ppc64, as they clear the whole cacheline 
128-bytes at a time with larger memsets.

Pasha
Michal Hocko May 31, 2017, 4:31 p.m.
On Tue 30-05-17 13:16:50, Pasha Tatashin wrote:
> >Could you be more specific? E.g. how are other stores done in
> >__init_single_page safe then? I am sorry to be dense here but how does
> >the full 64B store differ from other stores done in the same function.
> 
> Hi Michal,
> 
> It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb)
> without membar, but they are slower compared to STBI which require a membar
> before memory can be accessed.

OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
would do memset. You said it would be slower but would that be
measurable? I am sorry to be so persistent here but I would be really
happier if this didn't depend on the deferred initialization. If this is
absolutely a no-go then I can live with that of course.
David Miller May 31, 2017, 4:51 p.m.
From: Michal Hocko <mhocko@kernel.org>
Date: Wed, 31 May 2017 18:31:31 +0200

> On Tue 30-05-17 13:16:50, Pasha Tatashin wrote:
>> >Could you be more specific? E.g. how are other stores done in
>> >__init_single_page safe then? I am sorry to be dense here but how does
>> >the full 64B store differ from other stores done in the same function.
>> 
>> Hi Michal,
>> 
>> It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb)
>> without membar, but they are slower compared to STBI which require a membar
>> before memory can be accessed.
> 
> OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
> would do memset. You said it would be slower but would that be
> measurable? I am sorry to be so persistent here but I would be really
> happier if this didn't depend on the deferred initialization. If this is
> absolutely a no-go then I can live with that of course.

It is measurable.  That's the impetus for this work in the first place.

When the do the memory barrier, the whole store buffer flushes because
the memory barrier is done with a dependency on the next load or store
operation, one of which the caller is going to do immediately.
Pavel Tatashin June 1, 2017, 3:35 a.m.
> OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
> would do memset. You said it would be slower but would that be
> measurable? I am sorry to be so persistent here but I would be really
> happier if this didn't depend on the deferred initialization. If this is
> absolutely a no-go then I can live with that of course.

Hi Michal,

This is actually a very good idea. I just did some measurements, and it 
looks like performance is very good.

Here is data from SPARC-M7 with 3312G memory with single thread performance:

Current:
memset() in memblock allocator takes: 8.83s
__init_single_page() take: 8.63s

Option 1:
memset() in __init_single_page() takes: 61.09s (as we discussed because 
of membar overhead, memset should really be optimized to do STBI only 
when size is 1 page or bigger).

Option 2:

8 stores (stx) in __init_single_page(): 8.525s!

So, even for single thread performance we can double the initialization 
speed of "struct page" on SPARC by removing memset() from memblock, and 
using 8 stx in __init_single_page(). It appears we never miss L1 in 
__init_single_page() after the initial 8 stx.

I will update patches with memset() on other platforms, and stx on SPARC.

My experimental code looks like this:

static void __meminit __init_single_page(struct page *page, unsigned 
long pfn, unsigned long zone, int nid)
{
         __asm__ __volatile__(
         "stx    %%g0, [%0 + 0x00]\n"
         "stx    %%g0, [%0 + 0x08]\n"
         "stx    %%g0, [%0 + 0x10]\n"
         "stx    %%g0, [%0 + 0x18]\n"
         "stx    %%g0, [%0 + 0x20]\n"
         "stx    %%g0, [%0 + 0x28]\n"
         "stx    %%g0, [%0 + 0x30]\n"
         "stx    %%g0, [%0 + 0x38]\n"
         :
         :"r"(page));
         set_page_links(page, zone, nid, pfn);
         init_page_count(page);
         page_mapcount_reset(page);
         page_cpupid_reset_last(page);

         INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
         /* The shift won't overflow because ZONE_NORMAL is below 4G. */
         if (!is_highmem_idx(zone))
                 set_page_address(page, __va(pfn << PAGE_SHIFT));
#endif
}

Thank you,
Pasha
Michal Hocko June 1, 2017, 8:46 a.m.
On Wed 31-05-17 23:35:48, Pasha Tatashin wrote:
> >OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
> >would do memset. You said it would be slower but would that be
> >measurable? I am sorry to be so persistent here but I would be really
> >happier if this didn't depend on the deferred initialization. If this is
> >absolutely a no-go then I can live with that of course.
> 
> Hi Michal,
> 
> This is actually a very good idea. I just did some measurements, and it
> looks like performance is very good.
> 
> Here is data from SPARC-M7 with 3312G memory with single thread performance:
> 
> Current:
> memset() in memblock allocator takes: 8.83s
> __init_single_page() take: 8.63s
> 
> Option 1:
> memset() in __init_single_page() takes: 61.09s (as we discussed because of
> membar overhead, memset should really be optimized to do STBI only when size
> is 1 page or bigger).
> 
> Option 2:
> 
> 8 stores (stx) in __init_single_page(): 8.525s!
> 
> So, even for single thread performance we can double the initialization
> speed of "struct page" on SPARC by removing memset() from memblock, and
> using 8 stx in __init_single_page(). It appears we never miss L1 in
> __init_single_page() after the initial 8 stx.

OK, that is good to hear and it actually matches my understanding that
writes to a single cacheline should add an overhead.

Thanks!

Patch hide | download patch | download mbox

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 962164d36506..c9a08463d9a8 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@  extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void * memblock_virt_alloc_core(phys_addr_t size, phys_addr_t align,
+				phys_addr_t min_addr, phys_addr_t max_addr,
+				int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
diff --git a/mm/memblock.c b/mm/memblock.c
index b049c9b2dba8..eab7da94f873 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1271,8 +1271,7 @@  phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
  *
  * The memory block is aligned on SMP_CACHE_BYTES if @align == 0.
  *
- * The phys address of allocated boot memory block is converted to virtual and
- * allocated memory is reset to 0.
+ * The function has to be zeroed out explicitly.
  *
  * In addition, function sets the min_count to 0 using kmemleak_alloc for
  * allocated boot memory block, so that it is never reported as leaks.
@@ -1280,15 +1279,18 @@  phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
  */
-static void * __init memblock_virt_alloc_internal(
+static inline void * __init memblock_virt_alloc_internal(
 				phys_addr_t size, phys_addr_t align,
 				phys_addr_t min_addr, phys_addr_t max_addr,
-				int nid)
+				int nid, void *caller)
 {
 	phys_addr_t alloc;
 	void *ptr;
 	ulong flags = choose_memblock_flags();
 
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, caller);
 	if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n"))
 		nid = NUMA_NO_NODE;
 
@@ -1334,7 +1336,6 @@  static void * __init memblock_virt_alloc_internal(
 	return NULL;
 done:
 	ptr = phys_to_virt(alloc);
-	memset(ptr, 0, size);
 
 	/*
 	 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1347,6 +1348,14 @@  static void * __init memblock_virt_alloc_internal(
 	return ptr;
 }
 
+void * __init memblock_virt_alloc_core(phys_addr_t size, phys_addr_t align,
+				phys_addr_t min_addr, phys_addr_t max_addr,
+				int nid)
+{
+	return memblock_virt_alloc_internal(size, align, min_addr, max_addr, nid,
+			(void *)_RET_IP_);
+}
+
 /**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1369,11 +1378,14 @@  void * __init memblock_virt_alloc_try_nid_nopanic(
 				phys_addr_t min_addr, phys_addr_t max_addr,
 				int nid)
 {
-	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
-		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
-		     (u64)max_addr, (void *)_RET_IP_);
-	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+	void *ptr;
+
+	ptr = memblock_virt_alloc_internal(size, align, min_addr,
+					     max_addr, nid, (void *)_RET_IP_);
+	if (ptr)
+		memset(ptr, 0, size);
+
+	return ptr;
 }
 
 /**
@@ -1401,13 +1413,12 @@  void * __init memblock_virt_alloc_try_nid(
 {
 	void *ptr;
 
-	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
-		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
-		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
-					   min_addr, max_addr, nid);
-	if (ptr)
+					   min_addr, max_addr, nid, (void *)_RET_IP_);
+	if (ptr) {
+		memset(ptr, 0, size);
 		return ptr;
+	}
 
 	panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n",
 	      __func__, (u64)size, (u64)align, nid, (u64)min_addr,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a56c3989f773..4e060f0f9fe5 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@  static void * __ref __earlyonly_bootmem_alloc(int node,
 				unsigned long align,
 				unsigned long goal)
 {
-	return memblock_virt_alloc_try_nid(size, align, goal,
+	return memblock_virt_alloc_core(size, align, goal,
 					    BOOTMEM_ALLOC_ACCESSIBLE, node);
 }