diff mbox series

[v3,2/2] cxl: Enable global TLBIs for cxl contexts

Message ID 20170903181513.29635-2-fbarrat@linux.vnet.ibm.com (mailing list archive)
State Accepted
Commit 03b8abedf4f4965e7e9e0d4f92877c42c07ce19f
Headers show
Series [v3,1/2] powerpc/mm: Export flush_all_mm() | expand

Commit Message

Frederic Barrat Sept. 3, 2017, 6:15 p.m. UTC
The PSL and nMMU need to see all TLB invalidations for the memory
contexts used on the adapter. For the hash memory model, it is done by
making all TLBIs global as soon as the cxl driver is in use. For
radix, we need something similar, but we can refine and only convert
to global the invalidations for contexts actually used by the device.

The new mm_context_add_copro() API increments the 'active_cpus' count
for the contexts attached to the cxl adapter. As soon as there's more
than 1 active cpu, the TLBIs for the context become global. Active cpu
count must be decremented when detaching to restore locality if
possible and to avoid overflowing the counter.

The hash memory model support is somewhat limited, as we can't
decrement the active cpus count when mm_context_remove_copro() is
called, because we can't flush the TLB for a mm on hash. So TLBIs
remain global on hash.

Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Fixes: f24be42aab37 ("cxl: Add psl9 specific code")
---
Changelog:
v3: don't decrement active cpus count with hash, as we don't know how to flush
v2: Replace flush_tlb_mm() by the new flush_all_mm() to flush the TLBs
and PWCs (thanks to Ben)

 arch/powerpc/include/asm/mmu_context.h | 46 ++++++++++++++++++++++++++++++++++
 arch/powerpc/mm/mmu_context.c          |  9 -------
 drivers/misc/cxl/api.c                 | 22 +++++++++++++---
 drivers/misc/cxl/context.c             |  3 +++
 drivers/misc/cxl/file.c                | 19 ++++++++++++--
 5 files changed, 85 insertions(+), 14 deletions(-)

Comments

Nicholas Piggin Sept. 8, 2017, 6:56 a.m. UTC | #1
On Sun,  3 Sep 2017 20:15:13 +0200
Frederic Barrat <fbarrat@linux.vnet.ibm.com> wrote:

> The PSL and nMMU need to see all TLB invalidations for the memory
> contexts used on the adapter. For the hash memory model, it is done by
> making all TLBIs global as soon as the cxl driver is in use. For
> radix, we need something similar, but we can refine and only convert
> to global the invalidations for contexts actually used by the device.
> 
> The new mm_context_add_copro() API increments the 'active_cpus' count
> for the contexts attached to the cxl adapter. As soon as there's more
> than 1 active cpu, the TLBIs for the context become global. Active cpu
> count must be decremented when detaching to restore locality if
> possible and to avoid overflowing the counter.
> 
> The hash memory model support is somewhat limited, as we can't
> decrement the active cpus count when mm_context_remove_copro() is
> called, because we can't flush the TLB for a mm on hash. So TLBIs
> remain global on hash.

Sorry I didn't look at this earlier and just wading in here a bit, but
what do you think of using mmu notifiers for invalidating nMMU and
coprocessor caches, rather than put the details into the host MMU
management? npu-dma.c already looks to have almost everything covered
with its notifiers (in that it wouldn't have to rely on tlbie coming
from host MMU code).

This change is not too bad today, but if we get to more complicated
MMU/nMMU TLB management like directed invalidation of particular units,
then putting more knowledge into the host code will end up being
complex I think.

I also want to also do optimizations on the core code that assumes we
only have to take care of other CPUs, e.g.,

https://patchwork.ozlabs.org/patch/811068/

Or, another example, directed IPI invalidations from the mm_cpumask
bitmap.

I realize you want to get something merged! For the merge window and
backports this seems fine. I think it would be nice soon afterwards to
get nMMU knowledge out of the core code... Though I also realize with
our tlbie instruction that does everything then it may be tricky to
make a really optimal notifier.

Thanks,
Nick

> 
> Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
> Fixes: f24be42aab37 ("cxl: Add psl9 specific code")
> ---
> Changelog:
> v3: don't decrement active cpus count with hash, as we don't know how to flush
> v2: Replace flush_tlb_mm() by the new flush_all_mm() to flush the TLBs
> and PWCs (thanks to Ben)
Frederic Barrat Sept. 8, 2017, 7:34 a.m. UTC | #2
Le 08/09/2017 à 08:56, Nicholas Piggin a écrit :
> On Sun,  3 Sep 2017 20:15:13 +0200
> Frederic Barrat <fbarrat@linux.vnet.ibm.com> wrote:
> 
>> The PSL and nMMU need to see all TLB invalidations for the memory
>> contexts used on the adapter. For the hash memory model, it is done by
>> making all TLBIs global as soon as the cxl driver is in use. For
>> radix, we need something similar, but we can refine and only convert
>> to global the invalidations for contexts actually used by the device.
>>
>> The new mm_context_add_copro() API increments the 'active_cpus' count
>> for the contexts attached to the cxl adapter. As soon as there's more
>> than 1 active cpu, the TLBIs for the context become global. Active cpu
>> count must be decremented when detaching to restore locality if
>> possible and to avoid overflowing the counter.
>>
>> The hash memory model support is somewhat limited, as we can't
>> decrement the active cpus count when mm_context_remove_copro() is
>> called, because we can't flush the TLB for a mm on hash. So TLBIs
>> remain global on hash.
> 
> Sorry I didn't look at this earlier and just wading in here a bit, but
> what do you think of using mmu notifiers for invalidating nMMU and
> coprocessor caches, rather than put the details into the host MMU
> management? npu-dma.c already looks to have almost everything covered
> with its notifiers (in that it wouldn't have to rely on tlbie coming
> from host MMU code).

Does npu-dma.c really do mmio nMMU invalidations? My understanding was 
that those atsd_launch operations are really targeted at the device 
behind the NPU, i.e. the nvidia card.
At some point, it was not possible to do mmio invalidations on the nMMU. 
At least on dd1. I'm checking with the nMMU team the status on dd2.

Alistair: is your code really doing a nMMU invalidation? Considering 
you're trying to also reuse the mm_context_add_copro() from this patch, 
I think I know the answer.

There are also other components relying on broadcasted invalidations 
from hardware: the PSL (for capi FPGA) and the XSL on the Mellanox CX5 
card, when in capi mode. They rely on hardware TLBIs, snooped and 
forwarded to them by the CAPP.
For the PSL, we do have a mmio interface to do targeted invalidations, 
but it was removed from the capi architecture (and left as a debug 
feature for our PSL implementation), because the nMMU would be out of 
sync with the PSL (due to the lack of interface to sync the nMMU, as 
mentioned above).
For the XSL on the Mellanox CX5, it's even more complicated. AFAIK, they 
do have a way to trigger invalidations through software, though the 
interface is private and Mellanox would have to be involved. They've 
also stated the performance is much worse through software invalidation.

Another consideration is performance. Which is best? Short of having 
real numbers, it's probably hard to know for sure.

So the road of getting rid of hardware invalidations for external 
components, if at all possible or even desirable, may be long.

   Fred



> This change is not too bad today, but if we get to more complicated
> MMU/nMMU TLB management like directed invalidation of particular units,
> then putting more knowledge into the host code will end up being
> complex I think.
> 
> I also want to also do optimizations on the core code that assumes we
> only have to take care of other CPUs, e.g.,
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_811068_&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=647QnUvvBMO2f-DWP2xkeFceXDYSjpgHeTL3m_I9fiA&m=VaerDVXunKigctgE7NLm8VjaTR90W1m08iMcohAAnPo&s=y25SSoLEB8zDwXOLaTb8FFSpX_qSKiIG3Z5Cf1m7xnw&e=
> 
> Or, another example, directed IPI invalidations from the mm_cpumask
> bitmap.
> 
> I realize you want to get something merged! For the merge window and
> backports this seems fine. I think it would be nice soon afterwards to
> get nMMU knowledge out of the core code... Though I also realize with
> our tlbie instruction that does everything then it may be tricky to
> make a really optimal notifier.
> 
> Thanks,
> Nick
> 
>>
>> Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
>> Fixes: f24be42aab37 ("cxl: Add psl9 specific code")
>> ---
>> Changelog:
>> v3: don't decrement active cpus count with hash, as we don't know how to flush
>> v2: Replace flush_tlb_mm() by the new flush_all_mm() to flush the TLBs
>> and PWCs (thanks to Ben)
>
Nicholas Piggin Sept. 8, 2017, 10:54 a.m. UTC | #3
On Fri, 8 Sep 2017 09:34:39 +0200
Frederic Barrat <fbarrat@linux.vnet.ibm.com> wrote:

> Le 08/09/2017 à 08:56, Nicholas Piggin a écrit :
> > On Sun,  3 Sep 2017 20:15:13 +0200
> > Frederic Barrat <fbarrat@linux.vnet.ibm.com> wrote:
> >   
> >> The PSL and nMMU need to see all TLB invalidations for the memory
> >> contexts used on the adapter. For the hash memory model, it is done by
> >> making all TLBIs global as soon as the cxl driver is in use. For
> >> radix, we need something similar, but we can refine and only convert
> >> to global the invalidations for contexts actually used by the device.
> >>
> >> The new mm_context_add_copro() API increments the 'active_cpus' count
> >> for the contexts attached to the cxl adapter. As soon as there's more
> >> than 1 active cpu, the TLBIs for the context become global. Active cpu
> >> count must be decremented when detaching to restore locality if
> >> possible and to avoid overflowing the counter.
> >>
> >> The hash memory model support is somewhat limited, as we can't
> >> decrement the active cpus count when mm_context_remove_copro() is
> >> called, because we can't flush the TLB for a mm on hash. So TLBIs
> >> remain global on hash.  
> > 
> > Sorry I didn't look at this earlier and just wading in here a bit, but
> > what do you think of using mmu notifiers for invalidating nMMU and
> > coprocessor caches, rather than put the details into the host MMU
> > management? npu-dma.c already looks to have almost everything covered
> > with its notifiers (in that it wouldn't have to rely on tlbie coming
> > from host MMU code).  
> 
> Does npu-dma.c really do mmio nMMU invalidations?

No, but it does do a flush_tlb_mm there to do a tlbie (probably
buggy in some cases and does tlbiel without this patch of yours).
But the point is when you control the flushing you don't have to
mess with making the core flush code give you tlbies.

Just add a flush_nmmu_mm or something that does what you need.

If you can make a more targeted nMMU invalidate, then that's
even better.

One downside at first I thought is that the core code might already
do a broadcast tlbie, then the mmu notifier does not easily know
about that so it will do a second one which will be suboptimal.

Possibly we could add some flag or state so the nmmu flush can
avoid the second one.

But now that I look again, the NPU code has this comment:

        /*
         * Unfortunately the nest mmu does not support flushing specific
         * addresses so we have to flush the whole mm.
         */

Which seems to indicate that you can't rely on core code to give
you full flushes because for range flushing it is possible that the
core code will do it with address flushes. Or am I missing something?

So it seems you really do need to always issue a full PID tlbie from
a notifier.

> My understanding was 
> that those atsd_launch operations are really targeted at the device 
> behind the NPU, i.e. the nvidia card.
> At some point, it was not possible to do mmio invalidations on the nMMU. 
> At least on dd1. I'm checking with the nMMU team the status on dd2.
> 
> Alistair: is your code really doing a nMMU invalidation? Considering 
> you're trying to also reuse the mm_context_add_copro() from this patch, 
> I think I know the answer.
> 
> There are also other components relying on broadcasted invalidations 
> from hardware: the PSL (for capi FPGA) and the XSL on the Mellanox CX5 
> card, when in capi mode. They rely on hardware TLBIs, snooped and 
> forwarded to them by the CAPP.
> For the PSL, we do have a mmio interface to do targeted invalidations, 
> but it was removed from the capi architecture (and left as a debug 
> feature for our PSL implementation), because the nMMU would be out of 
> sync with the PSL (due to the lack of interface to sync the nMMU, as 
> mentioned above).
> For the XSL on the Mellanox CX5, it's even more complicated. AFAIK, they 
> do have a way to trigger invalidations through software, though the 
> interface is private and Mellanox would have to be involved. They've 
> also stated the performance is much worse through software invalidation.

Okay, point is I think the nMMU and agent drivers will be in a better
position to handle all that. I don't see that flushing from your notifier
means that you can't issue a tlbie to do it.

> 
> Another consideration is performance. Which is best? Short of having 
> real numbers, it's probably hard to know for sure.

Let's come to that if we agree on a way to go. I *think* we can make it
at least no worse than we have today, using tlbie and possibly some small
changes to generic code callers.

Thanks,
Nick
Nicholas Piggin Sept. 8, 2017, 11:18 a.m. UTC | #4
On Fri, 8 Sep 2017 20:54:02 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Fri, 8 Sep 2017 09:34:39 +0200
> Frederic Barrat <fbarrat@linux.vnet.ibm.com> wrote:
> 
> > Le 08/09/2017 à 08:56, Nicholas Piggin a écrit :  
> > > On Sun,  3 Sep 2017 20:15:13 +0200
> > > Frederic Barrat <fbarrat@linux.vnet.ibm.com> wrote:
> > >     
> > >> The PSL and nMMU need to see all TLB invalidations for the memory
> > >> contexts used on the adapter. For the hash memory model, it is done by
> > >> making all TLBIs global as soon as the cxl driver is in use. For
> > >> radix, we need something similar, but we can refine and only convert
> > >> to global the invalidations for contexts actually used by the device.
> > >>
> > >> The new mm_context_add_copro() API increments the 'active_cpus' count
> > >> for the contexts attached to the cxl adapter. As soon as there's more
> > >> than 1 active cpu, the TLBIs for the context become global. Active cpu
> > >> count must be decremented when detaching to restore locality if
> > >> possible and to avoid overflowing the counter.
> > >>
> > >> The hash memory model support is somewhat limited, as we can't
> > >> decrement the active cpus count when mm_context_remove_copro() is
> > >> called, because we can't flush the TLB for a mm on hash. So TLBIs
> > >> remain global on hash.    
> > > 
> > > Sorry I didn't look at this earlier and just wading in here a bit, but
> > > what do you think of using mmu notifiers for invalidating nMMU and
> > > coprocessor caches, rather than put the details into the host MMU
> > > management? npu-dma.c already looks to have almost everything covered
> > > with its notifiers (in that it wouldn't have to rely on tlbie coming
> > > from host MMU code).    
> > 
> > Does npu-dma.c really do mmio nMMU invalidations?  
> 
> No, but it does do a flush_tlb_mm there to do a tlbie (probably
> buggy in some cases and does tlbiel without this patch of yours).
> But the point is when you control the flushing you don't have to
> mess with making the core flush code give you tlbies.
> 
> Just add a flush_nmmu_mm or something that does what you need.
> 
> If you can make a more targeted nMMU invalidate, then that's
> even better.
> 
> One downside at first I thought is that the core code might already
> do a broadcast tlbie, then the mmu notifier does not easily know
> about that so it will do a second one which will be suboptimal.
> 
> Possibly we could add some flag or state so the nmmu flush can
> avoid the second one.
> 
> But now that I look again, the NPU code has this comment:
> 
>         /*
>          * Unfortunately the nest mmu does not support flushing specific
>          * addresses so we have to flush the whole mm.
>          */
> 
> Which seems to indicate that you can't rely on core code to give
> you full flushes because for range flushing it is possible that the
> core code will do it with address flushes. Or am I missing something?
> 
> So it seems you really do need to always issue a full PID tlbie from
> a notifier.

Oh I see, actually it's fixed in newer firmware and there's a patch
out for it.

Okay, so the nMMU can take address tlbie, in that case it's not a
correctness issue (except for old firmware that still has the bug).
Alistair Popple Sept. 13, 2017, 3:53 a.m. UTC | #5
On Fri, 8 Sep 2017 04:56:24 PM Nicholas Piggin wrote:
> On Sun,  3 Sep 2017 20:15:13 +0200
> Frederic Barrat <fbarrat@linux.vnet.ibm.com> wrote:
> 
> > The PSL and nMMU need to see all TLB invalidations for the memory
> > contexts used on the adapter. For the hash memory model, it is done by
> > making all TLBIs global as soon as the cxl driver is in use. For
> > radix, we need something similar, but we can refine and only convert
> > to global the invalidations for contexts actually used by the device.
> > 
> > The new mm_context_add_copro() API increments the 'active_cpus' count
> > for the contexts attached to the cxl adapter. As soon as there's more
> > than 1 active cpu, the TLBIs for the context become global. Active cpu
> > count must be decremented when detaching to restore locality if
> > possible and to avoid overflowing the counter.
> > 
> > The hash memory model support is somewhat limited, as we can't
> > decrement the active cpus count when mm_context_remove_copro() is
> > called, because we can't flush the TLB for a mm on hash. So TLBIs
> > remain global on hash.
> 
> Sorry I didn't look at this earlier and just wading in here a bit, but
> what do you think of using mmu notifiers for invalidating nMMU and
> coprocessor caches, rather than put the details into the host MMU
> management? npu-dma.c already looks to have almost everything covered
> with its notifiers (in that it wouldn't have to rely on tlbie coming
> from host MMU code).

Sorry, just finding time to catch up on this. From subsequent emails it looks
like you may have figured this out. The TLB flush in npu-dma.c is a workaround
for a HW issue rather than there to explicitly manage the NMMU caches. The
intent for NPU was always to have the NMMU snoop normal core tlbies rather than
do it via notifiers. A subsequent patch series
(https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=1681) removes
this flush now that the HW issue has been worked around via a FW fix.

I agree this is something we could look into optimising in the medium term, but
for the moment it would be good if we could get this series merged.

- Alistair
Alistair Popple Sept. 13, 2017, 3:58 a.m. UTC | #6
I have tested the non-cxl specific parts
(mm_context_add_copro/mm_context_remove_copro) with this series -
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=1681 - and it
works well for npu.

Tested-by: Alistair Popple <alistair@popple.id.au>

On Sun, 3 Sep 2017 08:15:13 PM Frederic Barrat wrote:
> The PSL and nMMU need to see all TLB invalidations for the memory
> contexts used on the adapter. For the hash memory model, it is done by
> making all TLBIs global as soon as the cxl driver is in use. For
> radix, we need something similar, but we can refine and only convert
> to global the invalidations for contexts actually used by the device.
> 
> The new mm_context_add_copro() API increments the 'active_cpus' count
> for the contexts attached to the cxl adapter. As soon as there's more
> than 1 active cpu, the TLBIs for the context become global. Active cpu
> count must be decremented when detaching to restore locality if
> possible and to avoid overflowing the counter.
> 
> The hash memory model support is somewhat limited, as we can't
> decrement the active cpus count when mm_context_remove_copro() is
> called, because we can't flush the TLB for a mm on hash. So TLBIs
> remain global on hash.
> 
> Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
> Fixes: f24be42aab37 ("cxl: Add psl9 specific code")
> ---
> Changelog:
> v3: don't decrement active cpus count with hash, as we don't know how to flush
> v2: Replace flush_tlb_mm() by the new flush_all_mm() to flush the TLBs
> and PWCs (thanks to Ben)
> 
>  arch/powerpc/include/asm/mmu_context.h | 46 ++++++++++++++++++++++++++++++++++
>  arch/powerpc/mm/mmu_context.c          |  9 -------
>  drivers/misc/cxl/api.c                 | 22 +++++++++++++---
>  drivers/misc/cxl/context.c             |  3 +++
>  drivers/misc/cxl/file.c                | 19 ++++++++++++--
>  5 files changed, 85 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 309592589e30..a0d7145d6cd2 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -77,6 +77,52 @@ extern void switch_cop(struct mm_struct *next);
>  extern int use_cop(unsigned long acop, struct mm_struct *mm);
>  extern void drop_cop(unsigned long acop, struct mm_struct *mm);
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> +static inline void inc_mm_active_cpus(struct mm_struct *mm)
> +{
> +	atomic_inc(&mm->context.active_cpus);
> +}
> +
> +static inline void dec_mm_active_cpus(struct mm_struct *mm)
> +{
> +	atomic_dec(&mm->context.active_cpus);
> +}
> +
> +static inline void mm_context_add_copro(struct mm_struct *mm)
> +{
> +	/*
> +	 * On hash, should only be called once over the lifetime of
> +	 * the context, as we can't decrement the active cpus count
> +	 * and flush properly for the time being.
> +	 */
> +	inc_mm_active_cpus(mm);
> +}
> +
> +static inline void mm_context_remove_copro(struct mm_struct *mm)
> +{
> +	/*
> +	 * Need to broadcast a global flush of the full mm before
> +	 * decrementing active_cpus count, as the next TLBI may be
> +	 * local and the nMMU and/or PSL need to be cleaned up.
> +	 * Should be rare enough so that it's acceptable.
> +	 *
> +	 * Skip on hash, as we don't know how to do the proper flush
> +	 * for the time being. Invalidations will remain global if
> +	 * used on hash.
> +	 */
> +	if (radix_enabled()) {
> +		flush_all_mm(mm);
> +		dec_mm_active_cpus(mm);
> +	}
> +}
> +#else
> +static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
> +static inline void dec_mm_active_cpus(struct mm_struct *mm) { }
> +static inline void mm_context_add_copro(struct mm_struct *mm) { }
> +static inline void mm_context_remove_copro(struct mm_struct *mm) { }
> +#endif
> +
> +
>  extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  			       struct task_struct *tsk);
>  
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index 0f613bc63c50..d60a62bf4fc7 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -34,15 +34,6 @@ static inline void switch_mm_pgdir(struct task_struct *tsk,
>  				   struct mm_struct *mm) { }
>  #endif
>  
> -#ifdef CONFIG_PPC_BOOK3S_64
> -static inline void inc_mm_active_cpus(struct mm_struct *mm)
> -{
> -	atomic_inc(&mm->context.active_cpus);
> -}
> -#else
> -static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
> -#endif
> -
>  void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  			struct task_struct *tsk)
>  {
> diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
> index a0c44d16bf30..1137a2cc1d3e 100644
> --- a/drivers/misc/cxl/api.c
> +++ b/drivers/misc/cxl/api.c
> @@ -15,6 +15,7 @@
>  #include <linux/module.h>
>  #include <linux/mount.h>
>  #include <linux/sched/mm.h>
> +#include <linux/mmu_context.h>
>  
>  #include "cxl.h"
>  
> @@ -331,9 +332,12 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
>  		/* ensure this mm_struct can't be freed */
>  		cxl_context_mm_count_get(ctx);
>  
> -		/* decrement the use count */
> -		if (ctx->mm)
> +		if (ctx->mm) {
> +			/* decrement the use count from above */
>  			mmput(ctx->mm);
> +			/* make TLBIs for this context global */
> +			mm_context_add_copro(ctx->mm);
> +		}
>  	}
>  
>  	/*
> @@ -342,13 +346,25 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
>  	 */
>  	cxl_ctx_get();
>  
> +	/*
> +	 * Barrier is needed to make sure all TLBIs are global before
> +	 * we attach and the context starts being used by the adapter.
> +	 *
> +	 * Needed after mm_context_add_copro() for radix and
> +	 * cxl_ctx_get() for hash/p8
> +	 */
> +	smp_mb();
> +
>  	if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
>  		put_pid(ctx->pid);
>  		ctx->pid = NULL;
>  		cxl_adapter_context_put(ctx->afu->adapter);
>  		cxl_ctx_put();
> -		if (task)
> +		if (task) {
>  			cxl_context_mm_count_put(ctx);
> +			if (ctx->mm)
> +				mm_context_remove_copro(ctx->mm);
> +		}
>  		goto out;
>  	}
>  
> diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
> index 8c32040b9c09..12a41b2753f0 100644
> --- a/drivers/misc/cxl/context.c
> +++ b/drivers/misc/cxl/context.c
> @@ -18,6 +18,7 @@
>  #include <linux/slab.h>
>  #include <linux/idr.h>
>  #include <linux/sched/mm.h>
> +#include <linux/mmu_context.h>
>  #include <asm/cputable.h>
>  #include <asm/current.h>
>  #include <asm/copro.h>
> @@ -267,6 +268,8 @@ int __detach_context(struct cxl_context *ctx)
>  
>  	/* Decrease the mm count on the context */
>  	cxl_context_mm_count_put(ctx);
> +	if (ctx->mm)
> +		mm_context_remove_copro(ctx->mm);
>  	ctx->mm = NULL;
>  
>  	return 0;
> diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
> index 4bfad9f6dc9f..84b801b5d0e5 100644
> --- a/drivers/misc/cxl/file.c
> +++ b/drivers/misc/cxl/file.c
> @@ -19,6 +19,7 @@
>  #include <linux/mm.h>
>  #include <linux/slab.h>
>  #include <linux/sched/mm.h>
> +#include <linux/mmu_context.h>
>  #include <asm/cputable.h>
>  #include <asm/current.h>
>  #include <asm/copro.h>
> @@ -220,9 +221,12 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
>  	/* ensure this mm_struct can't be freed */
>  	cxl_context_mm_count_get(ctx);
>  
> -	/* decrement the use count */
> -	if (ctx->mm)
> +	if (ctx->mm) {
> +		/* decrement the use count from above */
>  		mmput(ctx->mm);
> +		/* make TLBIs for this context global */
> +		mm_context_add_copro(ctx->mm);
> +	}
>  
>  	/*
>  	 * Increment driver use count. Enables global TLBIs for hash
> @@ -230,6 +234,15 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
>  	 */
>  	cxl_ctx_get();
>  
> +	/*
> +	 * Barrier is needed to make sure all TLBIs are global before
> +	 * we attach and the context starts being used by the adapter.
> +	 *
> +	 * Needed after mm_context_add_copro() for radix and
> +	 * cxl_ctx_get() for hash/p8
> +	 */
> +	smp_mb();
> +
>  	trace_cxl_attach(ctx, work.work_element_descriptor, work.num_interrupts, amr);
>  
>  	if ((rc = cxl_ops->attach_process(ctx, false, work.work_element_descriptor,
> @@ -240,6 +253,8 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
>  		ctx->pid = NULL;
>  		cxl_ctx_put();
>  		cxl_context_mm_count_put(ctx);
> +		if (ctx->mm)
> +			mm_context_remove_copro(ctx->mm);
>  		goto out;
>  	}
>  
>
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 309592589e30..a0d7145d6cd2 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -77,6 +77,52 @@  extern void switch_cop(struct mm_struct *next);
 extern int use_cop(unsigned long acop, struct mm_struct *mm);
 extern void drop_cop(unsigned long acop, struct mm_struct *mm);
 
+#ifdef CONFIG_PPC_BOOK3S_64
+static inline void inc_mm_active_cpus(struct mm_struct *mm)
+{
+	atomic_inc(&mm->context.active_cpus);
+}
+
+static inline void dec_mm_active_cpus(struct mm_struct *mm)
+{
+	atomic_dec(&mm->context.active_cpus);
+}
+
+static inline void mm_context_add_copro(struct mm_struct *mm)
+{
+	/*
+	 * On hash, should only be called once over the lifetime of
+	 * the context, as we can't decrement the active cpus count
+	 * and flush properly for the time being.
+	 */
+	inc_mm_active_cpus(mm);
+}
+
+static inline void mm_context_remove_copro(struct mm_struct *mm)
+{
+	/*
+	 * Need to broadcast a global flush of the full mm before
+	 * decrementing active_cpus count, as the next TLBI may be
+	 * local and the nMMU and/or PSL need to be cleaned up.
+	 * Should be rare enough so that it's acceptable.
+	 *
+	 * Skip on hash, as we don't know how to do the proper flush
+	 * for the time being. Invalidations will remain global if
+	 * used on hash.
+	 */
+	if (radix_enabled()) {
+		flush_all_mm(mm);
+		dec_mm_active_cpus(mm);
+	}
+}
+#else
+static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
+static inline void dec_mm_active_cpus(struct mm_struct *mm) { }
+static inline void mm_context_add_copro(struct mm_struct *mm) { }
+static inline void mm_context_remove_copro(struct mm_struct *mm) { }
+#endif
+
+
 extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			       struct task_struct *tsk);
 
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 0f613bc63c50..d60a62bf4fc7 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -34,15 +34,6 @@  static inline void switch_mm_pgdir(struct task_struct *tsk,
 				   struct mm_struct *mm) { }
 #endif
 
-#ifdef CONFIG_PPC_BOOK3S_64
-static inline void inc_mm_active_cpus(struct mm_struct *mm)
-{
-	atomic_inc(&mm->context.active_cpus);
-}
-#else
-static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
-#endif
-
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			struct task_struct *tsk)
 {
diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index a0c44d16bf30..1137a2cc1d3e 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -15,6 +15,7 @@ 
 #include <linux/module.h>
 #include <linux/mount.h>
 #include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
 
 #include "cxl.h"
 
@@ -331,9 +332,12 @@  int cxl_start_context(struct cxl_context *ctx, u64 wed,
 		/* ensure this mm_struct can't be freed */
 		cxl_context_mm_count_get(ctx);
 
-		/* decrement the use count */
-		if (ctx->mm)
+		if (ctx->mm) {
+			/* decrement the use count from above */
 			mmput(ctx->mm);
+			/* make TLBIs for this context global */
+			mm_context_add_copro(ctx->mm);
+		}
 	}
 
 	/*
@@ -342,13 +346,25 @@  int cxl_start_context(struct cxl_context *ctx, u64 wed,
 	 */
 	cxl_ctx_get();
 
+	/*
+	 * Barrier is needed to make sure all TLBIs are global before
+	 * we attach and the context starts being used by the adapter.
+	 *
+	 * Needed after mm_context_add_copro() for radix and
+	 * cxl_ctx_get() for hash/p8
+	 */
+	smp_mb();
+
 	if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
 		put_pid(ctx->pid);
 		ctx->pid = NULL;
 		cxl_adapter_context_put(ctx->afu->adapter);
 		cxl_ctx_put();
-		if (task)
+		if (task) {
 			cxl_context_mm_count_put(ctx);
+			if (ctx->mm)
+				mm_context_remove_copro(ctx->mm);
+		}
 		goto out;
 	}
 
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 8c32040b9c09..12a41b2753f0 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -18,6 +18,7 @@ 
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
 #include <asm/cputable.h>
 #include <asm/current.h>
 #include <asm/copro.h>
@@ -267,6 +268,8 @@  int __detach_context(struct cxl_context *ctx)
 
 	/* Decrease the mm count on the context */
 	cxl_context_mm_count_put(ctx);
+	if (ctx->mm)
+		mm_context_remove_copro(ctx->mm);
 	ctx->mm = NULL;
 
 	return 0;
diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index 4bfad9f6dc9f..84b801b5d0e5 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -19,6 +19,7 @@ 
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
 #include <asm/cputable.h>
 #include <asm/current.h>
 #include <asm/copro.h>
@@ -220,9 +221,12 @@  static long afu_ioctl_start_work(struct cxl_context *ctx,
 	/* ensure this mm_struct can't be freed */
 	cxl_context_mm_count_get(ctx);
 
-	/* decrement the use count */
-	if (ctx->mm)
+	if (ctx->mm) {
+		/* decrement the use count from above */
 		mmput(ctx->mm);
+		/* make TLBIs for this context global */
+		mm_context_add_copro(ctx->mm);
+	}
 
 	/*
 	 * Increment driver use count. Enables global TLBIs for hash
@@ -230,6 +234,15 @@  static long afu_ioctl_start_work(struct cxl_context *ctx,
 	 */
 	cxl_ctx_get();
 
+	/*
+	 * Barrier is needed to make sure all TLBIs are global before
+	 * we attach and the context starts being used by the adapter.
+	 *
+	 * Needed after mm_context_add_copro() for radix and
+	 * cxl_ctx_get() for hash/p8
+	 */
+	smp_mb();
+
 	trace_cxl_attach(ctx, work.work_element_descriptor, work.num_interrupts, amr);
 
 	if ((rc = cxl_ops->attach_process(ctx, false, work.work_element_descriptor,
@@ -240,6 +253,8 @@  static long afu_ioctl_start_work(struct cxl_context *ctx,
 		ctx->pid = NULL;
 		cxl_ctx_put();
 		cxl_context_mm_count_put(ctx);
+		if (ctx->mm)
+			mm_context_remove_copro(ctx->mm);
 		goto out;
 	}