Patchwork mmotm threatens ppc preemption again

login
register
mail settings
Submitter Hugh Dickins
Date March 20, 2011, 4:11 a.m.
Message ID <alpine.LSU.2.00.1103192041390.1592@sister.anvils>
Download mbox | patch
Permalink /patch/87645/
State Not Applicable
Headers show

Comments

Hugh Dickins - March 20, 2011, 4:11 a.m.
Hi Ben,

As I warned a few weeks ago, Jeremy has vmalloc apply_to_pte_range
patches in mmotm, which again assault PowerPC's expectations, and
cause lots of noise with CONFIG_PREEMPT=y CONFIG_PREEMPT_DEBUG=y.

This time in vmalloc as well as vfree; and Peter's fix to the last
lot, which went into 2.6.38, doesn't protect against these ones.
Here's what I now see when I swapon and swapoff:

BUG: using smp_processor_id() in preemptible [00000000] code: swapon/3230
caller is .apply_to_pte_range+0x118/0x1f0
Call Trace:
[c000000029c3b870] [c00000000000f38c] .show_stack+0x6c/0x16c (unreliable)
[c000000029c3b920] [c00000000022e024] .debug_smp_processor_id+0xe4/0x11c
[c000000029c3b9b0] [c0000000000de78c] .apply_to_pte_range+0x118/0x1f0
[c000000029c3ba70] [c0000000000de988] .apply_to_pud_range+0x124/0x188
[c000000029c3bb40] [c0000000000dea90] .apply_to_page_range_batch+0xa4/0xe8
[c000000029c3bc00] [c0000000000eb2c0] .map_vm_area+0x50/0x94
[c000000029c3bca0] [c0000000000ec368] .__vmalloc_area_node+0x144/0x190
[c000000029c3bd50] [c0000000000f1738] .SyS_swapon+0x270/0x704
[c000000029c3be30] [c0000000000075a8] syscall_exit+0x0/0x40
BUG: using smp_processor_id() in preemptible [00000000] code: swapon/3230
caller is .apply_to_pte_range+0x168/0x1f0
Call Trace:
[c000000029c3b870] [c00000000000f38c] .show_stack+0x6c/0x16c (unreliable)
[c000000029c3b920] [c00000000022e024] .debug_smp_processor_id+0xe4/0x11c
[c000000029c3b9b0] [c0000000000de7dc] .apply_to_pte_range+0x168/0x1f0
[c000000029c3ba70] [c0000000000de988] .apply_to_pud_range+0x124/0x188
[c000000029c3bb40] [c0000000000dea90] .apply_to_page_range_batch+0xa4/0xe8
[c000000029c3bc00] [c0000000000eb2c0] .map_vm_area+0x50/0x94
[c000000029c3bca0] [c0000000000ec368] .__vmalloc_area_node+0x144/0x190
[c000000029c3bd50] [c0000000000f1738] .SyS_swapon+0x270/0x704
[c000000029c3be30] [c0000000000075a8] syscall_exit+0x0/0x40
Adding 1572860k swap on /dev/sdb4.  Priority:-1 extents:1 across:1572860k SS
BUG: using smp_processor_id() in preemptible [00000000] code: swapoff/3231
caller is .apply_to_pte_range+0x118/0x1f0
Call Trace:
[c0000000260d38b0] [c00000000000f38c] .show_stack+0x6c/0x16c (unreliable)
[c0000000260d3960] [c00000000022e024] .debug_smp_processor_id+0xe4/0x11c
[c0000000260d39f0] [c0000000000de78c] .apply_to_pte_range+0x118/0x1f0
[c0000000260d3ab0] [c0000000000de988] .apply_to_pud_range+0x124/0x188
[c0000000260d3b80] [c0000000000dea90] .apply_to_page_range_batch+0xa4/0xe8
[c0000000260d3c40] [c0000000000eb0d8] .remove_vm_area+0x90/0xd4
[c0000000260d3cd0] [c0000000000ec0a8] .__vunmap+0x50/0x104
[c0000000260d3d60] [c0000000000f32fc] .SyS_swapoff+0x4d8/0x5e8
[c0000000260d3e30] [c0000000000075a8] syscall_exit+0x0/0x40
BUG: using smp_processor_id() in preemptible [00000000] code: swapoff/3231
caller is .apply_to_pte_range+0x168/0x1f0
Call Trace:
[c0000000260d38b0] [c00000000000f38c] .show_stack+0x6c/0x16c (unreliable)
[c0000000260d3960] [c00000000022e024] .debug_smp_processor_id+0xe4/0x11c
[c0000000260d39f0] [c0000000000de7dc] .apply_to_pte_range+0x168/0x1f0
[c0000000260d3ab0] [c0000000000de988] .apply_to_pud_range+0x124/0x188
[c0000000260d3b80] [c0000000000dea90] .apply_to_page_range_batch+0xa4/0xe8
[c0000000260d3c40] [c0000000000eb0d8] .remove_vm_area+0x90/0xd4
[c0000000260d3cd0] [c0000000000ec0a8] .__vunmap+0x50/0x104
[c0000000260d3d60] [c0000000000f32fc] .SyS_swapoff+0x4d8/0x5e8
[c0000000260d3e30] [c0000000000075a8] syscall_exit+0x0/0x40
BUG: using smp_processor_id() in preemptible [00000000] code: swapoff/3231
caller is .__flush_tlb_pending+0x20/0xb4
Call Trace:
[c0000000260d3830] [c00000000000f38c] .show_stack+0x6c/0x16c (unreliable)
[c0000000260d38e0] [c00000000022e024] .debug_smp_processor_id+0xe4/0x11c
[c0000000260d3970] [c00000000002efbc] .__flush_tlb_pending+0x20/0xb4
[c0000000260d39f0] [c0000000000de7fc] .apply_to_pte_range+0x188/0x1f0
[c0000000260d3ab0] [c0000000000de988] .apply_to_pud_range+0x124/0x188
[c0000000260d3b80] [c0000000000dea90] .apply_to_page_range_batch+0xa4/0xe8
[c0000000260d3c40] [c0000000000eb0d8] .remove_vm_area+0x90/0xd4
[c0000000260d3cd0] [c0000000000ec0a8] .__vunmap+0x50/0x104
[c0000000260d3d60] [c0000000000f32fc] .SyS_swapoff+0x4d8/0x5e8
[c0000000260d3e30] [c0000000000075a8] syscall_exit+0x0/0x40
BUG: using smp_processor_id() in preemptible [00000000] code: swapoff/3231
caller is .native_flush_hash_range+0x3c/0x384
Call Trace:
[c0000000260d36f0] [c00000000000f38c] .show_stack+0x6c/0x16c (unreliable)
[c0000000260d37a0] [c00000000022e024] .debug_smp_processor_id+0xe4/0x11c
[c0000000260d3830] [c00000000002e2c8] .native_flush_hash_range+0x3c/0x384
[c0000000260d38e0] [c00000000002c370] .flush_hash_range+0x4c/0xc8
[c0000000260d3970] [c00000000002f02c] .__flush_tlb_pending+0x90/0xb4
[c0000000260d39f0] [c0000000000de7fc] .apply_to_pte_range+0x188/0x1f0
[c0000000260d3ab0] [c0000000000de988] .apply_to_pud_range+0x124/0x188
[c0000000260d3b80] [c0000000000dea90] .apply_to_page_range_batch+0xa4/0xe8
[c0000000260d3c40] [c0000000000eb0d8] .remove_vm_area+0x90/0xd4
[c0000000260d3cd0] [c0000000000ec0a8] .__vunmap+0x50/0x104
[c0000000260d3d60] [c0000000000f32fc] .SyS_swapoff+0x4d8/0x5e8
[c0000000260d3e30] [c0000000000075a8] syscall_exit+0x0/0x40

I work around them with the patch below, but would prefer not to disable
preemption on all architectures there.  Though I'm not a huge fan of
apply_to_pte_range myself (I feel it glosses over differences, such as
how often one needs to let preemption in): I wouldn't mind if we left
vmalloc as is without it.

Hugh
Benjamin Herrenschmidt - March 20, 2011, 11:53 p.m.
On Sat, 2011-03-19 at 21:11 -0700, Hugh Dickins wrote:
> 
> As I warned a few weeks ago, Jeremy has vmalloc apply_to_pte_range
> patches in mmotm, which again assault PowerPC's expectations, and
> cause lots of noise with CONFIG_PREEMPT=y CONFIG_PREEMPT_DEBUG=y.
> 
> This time in vmalloc as well as vfree; and Peter's fix to the last
> lot, which went into 2.6.38, doesn't protect against these ones.
> Here's what I now see when I swapon and swapoff:

Right. And we said from day one we had the HARD WIRED assumption that
arch_enter/leave_lazy_mmu_mode() was ALWAYS going to be called within
a PTE lock section, and we did get reassurance that it was going to
remain so.

So why is it ok for them to change those and break us like that ?

Seriously, this is going out of control. If we can't even rely on
fundamental locking assumptions in the VM to remain reasonably stable
or at least get some amount of -care- from who changes them as to
whether they break others and work with us to fix them, wtf ?

I don't know what the right way to fix that is. We have an absolute
requirement that the batching we start within a lazy MMU section
is complete and flushed before any other PTE in that section can be
touched by anything else. Do we -at least- keep that guarantee ?

If yes, then maybe preempt_disable/enable() around
arch_enter/leave_lazy_mmu_mode() in apply_to_pte_range() would do... 

Or maybe I should just prevent any batching  of init_mm :-(

Cheers,
Ben.
Hugh Dickins - March 21, 2011, 1:41 a.m.
On Mon, 21 Mar 2011, Benjamin Herrenschmidt wrote:
> On Sat, 2011-03-19 at 21:11 -0700, Hugh Dickins wrote:
> > 
> > As I warned a few weeks ago, Jeremy has vmalloc apply_to_pte_range
> > patches in mmotm, which again assault PowerPC's expectations, and
> > cause lots of noise with CONFIG_PREEMPT=y CONFIG_PREEMPT_DEBUG=y.
> > 
> > This time in vmalloc as well as vfree; and Peter's fix to the last
> > lot, which went into 2.6.38, doesn't protect against these ones.
> > Here's what I now see when I swapon and swapoff:
> 
> Right. And we said from day one we had the HARD WIRED assumption that
> arch_enter/leave_lazy_mmu_mode() was ALWAYS going to be called within
> a PTE lock section, and we did get reassurance that it was going to
> remain so.
> 
> So why is it ok for them to change those and break us like that ?

It's not ok.  Sounds like Andrew should not forward

mm-remove-unused-token-argument-from-apply_to_page_range-callback.patch
mm-add-apply_to_page_range_batch.patch
ioremap-use-apply_to_page_range_batch-for-ioremap_page_range.patch
vmalloc-use-plain-pte_clear-for-unmaps.patch
vmalloc-use-apply_to_page_range_batch-for-vunmap_page_range.patch
vmalloc-use-apply_to_page_range_batch-for-vmap_page_range_noflush.patch
vmalloc-use-apply_to_page_range_batch-in-alloc_vm_area.patch
xen-mmu-use-apply_to_page_range_batch-in-xen_remap_domain_mfn_range.patch
xen-grant-table-use-apply_to_page_range_batch.patch

or some subset (the vmalloc-use-apply ones? and the ioremap one?)
of that set to Linus for 2.6.39.  Your call.

> 
> Seriously, this is going out of control. If we can't even rely on
> fundamental locking assumptions in the VM to remain reasonably stable
> or at least get some amount of -care- from who changes them as to
> whether they break others and work with us to fix them, wtf ?

I know next to nothing of arch_enter/leave_lazy_mmu_mode(),
and the same is probably true of most mm developers.  The only
people who have it defined to anything interesting appear to be
powerpc and xen and lguest: so it would be a gentleman's agreement
between you and Jeremy and Rusty.

If Jeremy has changed the rules without your agreement, then you
can fight a duel at daybreak, or, since your daybreaks are at
different times, Jeremy's patches just shouldn't go forward yet.

> 
> I don't know what the right way to fix that is. We have an absolute
> requirement that the batching we start within a lazy MMU section
> is complete and flushed before any other PTE in that section can be
> touched by anything else. Do we -at least- keep that guarantee ?

I'm guessing it's a guarantee of the same kind as led me to skip
page_table_lock on init_mm in 2.6.15: no locking to guarantee it,
but it would have to be a kernel bug, in a driver or wherever,
for us to be accessing such a section while it was in transit
(short of speculative access prior to tlb flush).

> 
> If yes, then maybe preempt_disable/enable() around
> arch_enter/leave_lazy_mmu_mode() in apply_to_pte_range() would do... 
> 
> Or maybe I should just prevent any batching  of init_mm :-(

I don't see where you're doing batching on init_mm today:
it looks as if Jeremy's patches, by using the same code as he has
for user mms, are now enabling batching on init_mm, and you should :-)

But I may be all wrong: it's between you and Jeremy,
and until he defends them, his patches should not go forward.

Hugh
Benjamin Herrenschmidt - March 21, 2011, 1:50 a.m.
On Sun, 2011-03-20 at 18:41 -0700, Hugh Dickins wrote:
> > I don't know what the right way to fix that is. We have an absolute
> > requirement that the batching we start within a lazy MMU section
> > is complete and flushed before any other PTE in that section can be
> > touched by anything else. Do we -at least- keep that guarantee ?
> 
> I'm guessing it's a guarantee of the same kind as led me to skip
> page_table_lock on init_mm in 2.6.15: no locking to guarantee it,
> but it would have to be a kernel bug, in a driver or wherever,
> for us to be accessing such a section while it was in transit
> (short of speculative access prior to tlb flush).

As long as the races to avoid are between map/unmap vs. access, yes, it
-should- be fine, and we used to not do demand faulting on kernel space
(but for how long ?). I'm wondering why we don't just stick a ptl in
there or is there a good reason why we can't ?

> I don't see where you're doing batching on init_mm today:
> it looks as if Jeremy's patches, by using the same code as he has
> for user mms, are now enabling batching on init_mm, and you should :-)
> 
> But I may be all wrong: it's between you and Jeremy,
> and until he defends them, his patches should not go forward. 

We don't do it today (batching). Jeremy's patches have the side effect
of "enabling" it, which isn't wrong per-se ... but on our side relies on
some locking assumptions we are missing.

Cheers,
Ben.
Hugh Dickins - March 21, 2011, 2:20 a.m.
On Mon, 21 Mar 2011, Benjamin Herrenschmidt wrote:
> 
> As long as the races to avoid are between map/unmap vs. access, yes, it
> -should- be fine, and we used to not do demand faulting on kernel space
> (but for how long ?). I'm wondering why we don't just stick a ptl in
> there or is there a good reason why we can't ?

We can - but we usually prefer to avoid unnecessary locking.
An arch function which locks init_mm.page_table_lock on powerpc,
but does nothing on others?

Hugh
Benjamin Herrenschmidt - March 21, 2011, 2:22 a.m.
On Sun, 2011-03-20 at 19:20 -0700, Hugh Dickins wrote:
> > As long as the races to avoid are between map/unmap vs. access, yes, it
> > -should- be fine, and we used to not do demand faulting on kernel space
> > (but for how long ?). I'm wondering why we don't just stick a ptl in
> > there or is there a good reason why we can't ?
> 
> We can - but we usually prefer to avoid unnecessary locking.
> An arch function which locks init_mm.page_table_lock on powerpc,
> but does nothing on others? 

That still means gratuitous differences between how the normal and
kernel page tables are handled. Maybe that's not worth bothering ...

Cheers,
Ben.
Jeremy Fitzhardinge - March 21, 2011, 11:24 a.m.
On 03/20/2011 11:53 PM, Benjamin Herrenschmidt wrote:
> On Sat, 2011-03-19 at 21:11 -0700, Hugh Dickins wrote:
>> As I warned a few weeks ago, Jeremy has vmalloc apply_to_pte_range
>> patches in mmotm, which again assault PowerPC's expectations, and
>> cause lots of noise with CONFIG_PREEMPT=y CONFIG_PREEMPT_DEBUG=y.
>>
>> This time in vmalloc as well as vfree; and Peter's fix to the last
>> lot, which went into 2.6.38, doesn't protect against these ones.
>> Here's what I now see when I swapon and swapoff:
> Right. And we said from day one we had the HARD WIRED assumption that
> arch_enter/leave_lazy_mmu_mode() was ALWAYS going to be called within
> a PTE lock section, and we did get reassurance that it was going to
> remain so.
>
> So why is it ok for them to change those and break us like that ?

In general, the pagetable's locking rules are that all *usermode* pte
updates have to be done under a pte lock, but kernel mode ones do not;
they generally have some kind of per-subsystem ad-hoc locking where
needed, which may or may not be no-preempt.

Originally the enter/leave_lazy_mmu_mode did require preemption to be
disabled for the whole time, but that was incompatible with the above
locking rules, and resulted in preemption being disabled for long
periods when using lazy mode which wouldn't normally happen.  This
raised a number of complaints.

To address this, I changed the x86 implementation to deal with
preemption in lazy mode by dropping out for lazy mode at context switch
time (and recording the fact that we were in lazy mode with a TIF flag
and re-entering on the next context switch).


> Seriously, this is going out of control. If we can't even rely on
> fundamental locking assumptions in the VM to remain reasonably stable
> or at least get some amount of -care- from who changes them as to
> whether they break others and work with us to fix them, wtf ?
>
> I don't know what the right way to fix that is. We have an absolute
> requirement that the batching we start within a lazy MMU section
> is complete and flushed before any other PTE in that section can be
> touched by anything else. Do we -at least- keep that guarantee ?
>
> If yes, then maybe preempt_disable/enable() around
> arch_enter/leave_lazy_mmu_mode() in apply_to_pte_range() would do... 
>
> Or maybe I should just prevent any batching  of init_mm :-(

I'm very sorry about that, I didn't realize power was also using that
interface.  Unfortunately, the "no preemption" definition was an error,
and had to be changed to match the pre-existing locking rules.

Could you implement a similar "flush batched pte updates on context
switch" as x86?

    J
Benjamin Herrenschmidt - March 21, 2011, 10:52 p.m.
On Mon, 2011-03-21 at 11:24 +0000, Jeremy Fitzhardinge wrote:
> 
> I'm very sorry about that, I didn't realize power was also using that
> interface.  Unfortunately, the "no preemption" definition was an error,
> and had to be changed to match the pre-existing locking rules.
> 
> Could you implement a similar "flush batched pte updates on context
> switch" as x86? 

Well, we already do that for -rt & co.

However, we have another issue which is the reason we used those
lazy_mmu hooks to do our flushing.

Our PTEs eventually get faulted into a hash table which is what the real
MMU uses. We must never (ever) allow that hash table to contain a
duplicate entry for a given virtual address.

When we do a batch, we remove things from the linux PTE, and keep a
reference in our batch structure, and only update the hash table at the
end of the batch.

That means that we must not allow a hash fault to populate the hash with
a "new" PTE value prior to the old one having been flushed out (which is
possible if they different in protection attributes for example). For
that to happen, we must basically not allow a page fault to re-populate
a PTE invalidated by a batch before that batch has completed.

That translates to batches must only happen within a PTE lock section.

Cheers,
Ben.
Jeremy Fitzhardinge - March 22, 2011, 1:34 p.m.
On 03/21/2011 10:52 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2011-03-21 at 11:24 +0000, Jeremy Fitzhardinge wrote:
>> I'm very sorry about that, I didn't realize power was also using that
>> interface.  Unfortunately, the "no preemption" definition was an error,
>> and had to be changed to match the pre-existing locking rules.
>>
>> Could you implement a similar "flush batched pte updates on context
>> switch" as x86? 
> Well, we already do that for -rt & co.
>
> However, we have another issue which is the reason we used those
> lazy_mmu hooks to do our flushing.
>
> Our PTEs eventually get faulted into a hash table which is what the real
> MMU uses. We must never (ever) allow that hash table to contain a
> duplicate entry for a given virtual address.
>
> When we do a batch, we remove things from the linux PTE, and keep a
> reference in our batch structure, and only update the hash table at the
> end of the batch.

Wouldn't implicitly ending a batch on context switch get the same effect?

> That means that we must not allow a hash fault to populate the hash with
> a "new" PTE value prior to the old one having been flushed out (which is
> possible if they different in protection attributes for example). For
> that to happen, we must basically not allow a page fault to re-populate
> a PTE invalidated by a batch before that batch has completed.

Kernel ptes are not generally populated on fault though, unless there's
something in power?  On x86 it can happen when syncing a process's
kernel pmd with the init_mm one, but that shouldn't happen in the middle
of an update since you'd deadlock anyway.  If a particular kernel
subsystem has its own locks to manage the ptes for a kernel mapping,
then that should prevent any nested updates within a batch shouldn't it?

> That translates to batches must only happen within a PTE lock section.

Well, in that case, I guess your best bet is to disable batching for
kernel pagetable updates.  These apply_to_page_range() changes are the
first time any attempt to batch kernel pagetable updates has been made
(otherwise you would have seen this problem earlier), so not batching
them will not be a regression for you.

But I'm not sure what the proper fix to get batching in your case will
be.  But the assumption that there's a pte lock for kernel ptes is not
valid.

    J
Andrew Morton - March 30, 2011, 8:53 p.m.
On Mon, 21 Mar 2011 13:22:30 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Sun, 2011-03-20 at 19:20 -0700, Hugh Dickins wrote:
> > > As long as the races to avoid are between map/unmap vs. access, yes, it
> > > -should- be fine, and we used to not do demand faulting on kernel space
> > > (but for how long ?). I'm wondering why we don't just stick a ptl in
> > > there or is there a good reason why we can't ?
> > 
> > We can - but we usually prefer to avoid unnecessary locking.
> > An arch function which locks init_mm.page_table_lock on powerpc,
> > but does nothing on others? 
> 
> That still means gratuitous differences between how the normal and
> kernel page tables are handled. Maybe that's not worth bothering ...

So what will we do here?  I still have

mm-remove-unused-token-argument-from-apply_to_page_range-callback.patch
mm-add-apply_to_page_range_batch.patch
ioremap-use-apply_to_page_range_batch-for-ioremap_page_range.patch
vmalloc-use-plain-pte_clear-for-unmaps.patch
vmalloc-use-apply_to_page_range_batch-for-vunmap_page_range.patch
vmalloc-use-apply_to_page_range_batch-for-vmap_page_range_noflush.patch
vmalloc-use-apply_to_page_range_batch-in-alloc_vm_area.patch
xen-mmu-use-apply_to_page_range_batch-in-xen_remap_domain_mfn_range.patch
xen-grant-table-use-apply_to_page_range_batch.patch

floating around and at some stage they may cause merge problems.
Jeremy Fitzhardinge - March 30, 2011, 9:07 p.m.
On 03/30/2011 01:53 PM, Andrew Morton wrote:
> On Mon, 21 Mar 2011 13:22:30 +1100
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
>> On Sun, 2011-03-20 at 19:20 -0700, Hugh Dickins wrote:
>>>> As long as the races to avoid are between map/unmap vs. access, yes, it
>>>> -should- be fine, and we used to not do demand faulting on kernel space
>>>> (but for how long ?). I'm wondering why we don't just stick a ptl in
>>>> there or is there a good reason why we can't ?
>>> We can - but we usually prefer to avoid unnecessary locking.
>>> An arch function which locks init_mm.page_table_lock on powerpc,
>>> but does nothing on others? 
>> That still means gratuitous differences between how the normal and
>> kernel page tables are handled. Maybe that's not worth bothering ...
> So what will we do here?  I still have
>
> mm-remove-unused-token-argument-from-apply_to_page_range-callback.patch
> mm-add-apply_to_page_range_batch.patch
> ioremap-use-apply_to_page_range_batch-for-ioremap_page_range.patch
> vmalloc-use-plain-pte_clear-for-unmaps.patch
> vmalloc-use-apply_to_page_range_batch-for-vunmap_page_range.patch
> vmalloc-use-apply_to_page_range_batch-for-vmap_page_range_noflush.patch
> vmalloc-use-apply_to_page_range_batch-in-alloc_vm_area.patch
> xen-mmu-use-apply_to_page_range_batch-in-xen_remap_domain_mfn_range.patch
> xen-grant-table-use-apply_to_page_range_batch.patch
>
> floating around and at some stage they may cause merge problems.

Well, my understanding of the situation is:

   1. There's a basic asymmetry between user and kernel pagetables, in
      that the former has a standard pte locking scheme, whereas the
      latter uses ad-hoc locking depending on what particular subsystem
      is doing the changes (presumably to its own private piece of
      kernel virtual address space)
   2. Power was assuming that all lazy mmu updates were done under
      spinlock, or are at least non-preemptable.  This is incompatible
      with 1), but it was moot because no kernel updates were done lazily
   3. These patches add the first instance of lazy mmu updates, which
      reveals the mismatch between Power's expectations and the actual
      locking rules.


So the options are:

   1. Change the locking rules for kernel updates to also require a pte lock
   2. Special-case batched kernel updates to include a pte lock
   3. Make Power deal with preemption during a batched kernel update
   4. Never do batched kernel updates on Power
   5. Never do batched kernel updates (the current state)


1 seems like a big change.
2 is pretty awkward, and has the side-effect of increasing preemption
latencies (since if you're doing enough updates to be worth batching,
you'll be disabling preemption for a longish time).
I don't know how complex 3 is; I guess it depends on the details of the
batched hashtable update thingy.
4 looks like it should be simple.
5 is the default do-nothing state, but it seems unfair on anyone who can
actually take advantage of batched updates.

Ben, how hard would something like 3 or 4 be to implement?

Thanks,
    J
Benjamin Herrenschmidt - March 31, 2011, 12:52 a.m.
On Wed, 2011-03-30 at 14:07 -0700, Jeremy Fitzhardinge wrote:
> 
>    1. Change the locking rules for kernel updates to also require a pte lock
>    2. Special-case batched kernel updates to include a pte lock
>    3. Make Power deal with preemption during a batched kernel update
>    4. Never do batched kernel updates on Power
>    5. Never do batched kernel updates (the current state) 

We deal with preemption already since the PTL turns into a mutex on -rt,
so we could bring that patch into mainline. The easiest approach however
for now would be to not do the kernel batched updates on kernel
(solution 4), and I can sort it out later if I want to enable it.

The problem is that it's hard for me to "fix" that with the current
accessors as arch_enter_lazy_mmu_mode() don't get any argument that
could point me to which mm is being operated on.

Jeremy, I haven't had a chance to look at your patches in detail, do
you just use those accessors or do you create new ones for batching
kernel updates in which case powerpc could just make them do nothing ?

Else, we could have one patch that adds an mm argument accross the tree,
it shouldn't be too hard.

Later on, I can bring in the stuff from -rt stuff to enable lazy
batching of kernel pages on power if I wish to do so.

Cheers,
Ben
Jeremy Fitzhardinge - March 31, 2011, 5:21 p.m.
On 03/30/2011 05:52 PM, Benjamin Herrenschmidt wrote:
> We deal with preemption already since the PTL turns into a mutex on -rt,
> so we could bring that patch into mainline. The easiest approach however
> for now would be to not do the kernel batched updates on kernel
> (solution 4), and I can sort it out later if I want to enable it.
>
> The problem is that it's hard for me to "fix" that with the current
> accessors as arch_enter_lazy_mmu_mode() don't get any argument that
> could point me to which mm is being operated on.
>
> Jeremy, I haven't had a chance to look at your patches in detail, do
> you just use those accessors or do you create new ones for batching
> kernel updates in which case powerpc could just make them do nothing ?
>
> Else, we could have one patch that adds an mm argument accross the tree,
> it shouldn't be too hard.

No, its the same accessors for both, since the need to distinguish them
hasn't really come up.  Could you put a "if (preemptable()) return;"
guard in your implementations?

Otherwise I have no objections to passing the mm in (we'll probably just
continue to ignore the arg in x86-land).

Thanks,
    J
Benjamin Herrenschmidt - March 31, 2011, 8:38 p.m.
On Thu, 2011-03-31 at 10:21 -0700, Jeremy Fitzhardinge wrote:
> 
> No, its the same accessors for both, since the need to distinguish them
> hasn't really come up.  Could you put a "if (preemptable()) return;"
> guard in your implementations?

That would be a band-aid but would probably do the trick for now
for !-rt, tho it wouldn't do the right thing for -rt... 

> Otherwise I have no objections to passing the mm in (we'll probably just
> continue to ignore the arg in x86-land).

Ben.
Jeremy Fitzhardinge - May 18, 2011, 11:29 p.m.
On 03/31/2011 01:38 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2011-03-31 at 10:21 -0700, Jeremy Fitzhardinge wrote:
>> No, its the same accessors for both, since the need to distinguish them
>> hasn't really come up.  Could you put a "if (preemptable()) return;"
>> guard in your implementations?
> That would be a band-aid but would probably do the trick for now
> for !-rt, tho it wouldn't do the right thing for -rt... 

Hi Ben,

Have you had a chance to look at doing a workaround/fix for these power
problems?  I believe that's the only holdup to putting in the batching
changes.  I'd like to get them in for the next window if possible, since
they're a pretty significant performance win for us.

Thanks,
    J

Patch

--- mmotm/mm/memory.c
+++ fixed/mm/memory.c
@@ -2021,9 +2021,11 @@  static int apply_to_pte_range(struct mm_
 	int err;
 	spinlock_t *uninitialized_var(ptl);
 
-	pte = (mm == &init_mm) ?
-		pte_alloc_kernel(pmd, addr) :
-		pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	if (mm == &init_mm) {
+		pte = pte_alloc_kernel(pmd, addr);
+		preempt_disable();
+	} else
+		pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
 
@@ -2033,7 +2035,9 @@  static int apply_to_pte_range(struct mm_
 	err = fn(pte, (end - addr) / PAGE_SIZE, addr, data);
 	arch_leave_lazy_mmu_mode();
 
-	if (mm != &init_mm)
+	if (mm == &init_mm)
+		preempt_enable();
+	else
 		pte_unmap_unlock(pte, ptl);
 	return err;
 }