Patchwork [v2,1/3] powerpc/booke64: add sync after writing PTE

login
register
mail settings
Submitter Scott Wood
Date Sept. 14, 2013, 3:50 a.m.
Message ID <1379130622-17436-1-git-send-email-scottwood@freescale.com>
Download mbox | patch
Permalink /patch/274893/
State Superseded
Headers show

Comments

Scott Wood - Sept. 14, 2013, 3:50 a.m.
The ISA says that a sync is needed to order a PTE write with a
subsequent hardware tablewalk lookup.  On e6500, without this sync
we've been observed to die with a DSI due to a PTE write not being seen
by a subsequent access, even when everything happens on the same
CPU.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v2: new patch

Note that we saw this problem when running in emulation (died early in
boot), but when I tried just now with the mb() removed on actual
hardware, I didn't see any problem.  But since I'm not aware of any
change having been made in the hardware relative to this, and since it
is architecturally required, I'd be more comfortable leaving it in
unlesss something has changed in the kernel recently such that a sync
is happening somewhere else along the code path.
---
 arch/powerpc/include/asm/pgtable.h | 7 +++++++
 1 file changed, 7 insertions(+)
Benjamin Herrenschmidt - Sept. 15, 2013, 9:38 p.m.
On Fri, 2013-09-13 at 22:50 -0500, Scott Wood wrote:
> The ISA says that a sync is needed to order a PTE write with a
> subsequent hardware tablewalk lookup.  On e6500, without this sync
> we've been observed to die with a DSI due to a PTE write not being seen
> by a subsequent access, even when everything happens on the same
> CPU.

This is gross, I didn't realize we had that bogosity in the
architecture...

Did you measure the performance impact ?

Cheers,
Ben.

> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v2: new patch
> 
> Note that we saw this problem when running in emulation (died early in
> boot), but when I tried just now with the mb() removed on actual
> hardware, I didn't see any problem.  But since I'm not aware of any
> change having been made in the hardware relative to this, and since it
> is architecturally required, I'd be more comfortable leaving it in
> unlesss something has changed in the kernel recently such that a sync
> is happening somewhere else along the code path.
> ---
>  arch/powerpc/include/asm/pgtable.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> index 7d6eacf..0459077 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -143,6 +143,13 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
>  	 * cases, and 32-bit non-hash with 32-bit PTEs.
>  	 */
>  	*ptep = pte;
> +#ifdef CONFIG_PPC_BOOK3E_64
> +	/*
> +	 * With hardware tablewalk, a sync is needed to ensure that
> +	 * subsequent accesses see the PTE we just wrote.
> +	 */
> +	mb();
> +#endif
>  #endif
>  }
>
Scott Wood - Sept. 17, 2013, 12:06 a.m.
On Mon, 2013-09-16 at 07:38 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2013-09-13 at 22:50 -0500, Scott Wood wrote:
> > The ISA says that a sync is needed to order a PTE write with a
> > subsequent hardware tablewalk lookup.  On e6500, without this sync
> > we've been observed to die with a DSI due to a PTE write not being seen
> > by a subsequent access, even when everything happens on the same
> > CPU.
> 
> This is gross, I didn't realize we had that bogosity in the
> architecture...
> 
> Did you measure the performance impact ?

I didn't see a noticeable impact on the tests I ran, but those were
aimed at measuring TLB miss overhead.  I'll need to try it with a
benchmark that's more oriented around lots of page table updates.

-Scott
Scott Wood - Oct. 10, 2013, 10:31 p.m.
On Mon, 2013-09-16 at 19:06 -0500, Scott Wood wrote:
> On Mon, 2013-09-16 at 07:38 +1000, Benjamin Herrenschmidt wrote:
> > On Fri, 2013-09-13 at 22:50 -0500, Scott Wood wrote:
> > > The ISA says that a sync is needed to order a PTE write with a
> > > subsequent hardware tablewalk lookup.  On e6500, without this sync
> > > we've been observed to die with a DSI due to a PTE write not being seen
> > > by a subsequent access, even when everything happens on the same
> > > CPU.
> > 
> > This is gross, I didn't realize we had that bogosity in the
> > architecture...
> > 
> > Did you measure the performance impact ?
> 
> I didn't see a noticeable impact on the tests I ran, but those were
> aimed at measuring TLB miss overhead.  I'll need to try it with a
> benchmark that's more oriented around lots of page table updates.

Lmbench's fork test runs about 2% slower with the sync.  I've been told
that nothing relevant has changed since we saw the failure during
emulation; it's probably luck and/or timing, or maybe a sync got added
somewhere else since then?  I think it's only really a problem for
kernel page tables, since user page tables will retry if do_page_fault()
sees a valid PTE.  So maybe we should put an mb() in map_kernel_page()
instead.

-Scott
Scott Wood - Oct. 10, 2013, 11:25 p.m.
On Thu, 2013-10-10 at 17:31 -0500, Scott Wood wrote:
> On Mon, 2013-09-16 at 19:06 -0500, Scott Wood wrote:
> > On Mon, 2013-09-16 at 07:38 +1000, Benjamin Herrenschmidt wrote:
> > > On Fri, 2013-09-13 at 22:50 -0500, Scott Wood wrote:
> > > > The ISA says that a sync is needed to order a PTE write with a
> > > > subsequent hardware tablewalk lookup.  On e6500, without this sync
> > > > we've been observed to die with a DSI due to a PTE write not being seen
> > > > by a subsequent access, even when everything happens on the same
> > > > CPU.
> > > 
> > > This is gross, I didn't realize we had that bogosity in the
> > > architecture...
> > > 
> > > Did you measure the performance impact ?
> > 
> > I didn't see a noticeable impact on the tests I ran, but those were
> > aimed at measuring TLB miss overhead.  I'll need to try it with a
> > benchmark that's more oriented around lots of page table updates.
> 
> Lmbench's fork test runs about 2% slower with the sync.  I've been told
> that nothing relevant has changed since we saw the failure during
> emulation; it's probably luck and/or timing, or maybe a sync got added
> somewhere else since then?  I think it's only really a problem for
> kernel page tables, since user page tables will retry if do_page_fault()
> sees a valid PTE.  So maybe we should put an mb() in map_kernel_page()
> instead.

Looking at some of the code in mm/, I suspect that the normal callers of
set_pte_at() already have an unlock (and thus a sync) already, so we may
not even be relying on those retries.  Certainly some of them do; it
would take some effort to verify all of them.

Also, without such a sync in map_kernel_page(), even with software
tablewalk, couldn't we theoretically have a situation where a store to
pointer X that exposes a new mapping gets reordered before the PTE store
as seen by another CPU?  The other CPU could see non-NULL X and
dereference it, but get the stale PTE.  Callers of ioremap() generally
don't do a barrier of their own prior to exposing the result.

-Scott
Benjamin Herrenschmidt - Oct. 10, 2013, 11:51 p.m.
On Thu, 2013-10-10 at 18:25 -0500, Scott Wood wrote:

> Looking at some of the code in mm/, I suspect that the normal callers of
> set_pte_at() already have an unlock (and thus a sync) 

Unlock is lwsync actually...

> already, so we may
> not even be relying on those retries.  Certainly some of them do; it
> would take some effort to verify all of them.
> 
> Also, without such a sync in map_kernel_page(), even with software
> tablewalk, couldn't we theoretically have a situation where a store to
> pointer X that exposes a new mapping gets reordered before the PTE store
> as seen by another CPU?  The other CPU could see non-NULL X and
> dereference it, but get the stale PTE.  Callers of ioremap() generally
> don't do a barrier of their own prior to exposing the result.

Hrm, we transition to the new PTE either restricts the access permission
in which case it flushes the TLB (and synchronizes with other CPUs) or
extends access (adds dirty, set pte from 0 -> populated, ...) in which
case the worst case is we see the old one and take a spurrious fault.

So the problem would only be with kernel mappings and in that case I
think we are fine. A driver doing an ioremap shouldn't then start using
that mapping on another CPU before having *informed* that other CPU of
the existence of the mapping and that should be ordered.

Ben.
Scott Wood - Oct. 11, 2013, 10:07 p.m.
On Fri, 2013-10-11 at 10:51 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2013-10-10 at 18:25 -0500, Scott Wood wrote:
> 
> > Looking at some of the code in mm/, I suspect that the normal callers of
> > set_pte_at() already have an unlock (and thus a sync) 
> 
> Unlock is lwsync actually...

Oops, I was seeing the conditional sync from SYNC_IO in the disassembly.
BTW, it's a bug that we don't do SYNC_IO on e500mc -- the assumption
that lwsync is 64-bit-only is no longer true.

> > already, so we may
> > not even be relying on those retries.  Certainly some of them do; it
> > would take some effort to verify all of them.
> > 
> > Also, without such a sync in map_kernel_page(), even with software
> > tablewalk, couldn't we theoretically have a situation where a store to
> > pointer X that exposes a new mapping gets reordered before the PTE store
> > as seen by another CPU?  The other CPU could see non-NULL X and
> > dereference it, but get the stale PTE.  Callers of ioremap() generally
> > don't do a barrier of their own prior to exposing the result.
> 
> Hrm, we transition to the new PTE either restricts the access permission
> in which case it flushes the TLB (and synchronizes with other CPUs) or
> extends access (adds dirty, set pte from 0 -> populated, ...) in which
> case the worst case is we see the old one and take a spurrious fault.

Yes, and the lwsync is good enough for software reading the PTE.  So it
becomes a question of how much spurious faults with hardware tablewalk
hurt performance, and at least for the lmbench fork test, the sync is
worse (or maybe lwsync happens to be good enough for hw tablewalk on
e6500?).

> So the problem would only be with kernel mappings and in that case I
> think we are fine. A driver doing an ioremap shouldn't then start using
> that mapping on another CPU before having *informed* that other CPU of
> the existence of the mapping and that should be ordered.

But are callers of ioremap() expected to use a barrier before exposing
the pointer (and what type)?  I don't think that's common practice.

map_kernel_page() should not be performance critical, so it shouldn't be
a big deal to put mb() in there.

-Scott
Benjamin Herrenschmidt - Oct. 11, 2013, 10:34 p.m.
On Fri, 2013-10-11 at 17:07 -0500, Scott Wood wrote:
> On Fri, 2013-10-11 at 10:51 +1100, Benjamin Herrenschmidt wrote:
> > On Thu, 2013-10-10 at 18:25 -0500, Scott Wood wrote:
> > 
> > > Looking at some of the code in mm/, I suspect that the normal callers of
> > > set_pte_at() already have an unlock (and thus a sync) 
> > 
> > Unlock is lwsync actually...
> 
> Oops, I was seeing the conditional sync from SYNC_IO in the disassembly.
> BTW, it's a bug that we don't do SYNC_IO on e500mc -- the assumption
> that lwsync is 64-bit-only is no longer true.

Patch welcome :)

> > > already, so we may
> > > not even be relying on those retries.  Certainly some of them do; it
> > > would take some effort to verify all of them.
> > > 
> > > Also, without such a sync in map_kernel_page(), even with software
> > > tablewalk, couldn't we theoretically have a situation where a store to
> > > pointer X that exposes a new mapping gets reordered before the PTE store
> > > as seen by another CPU?  The other CPU could see non-NULL X and
> > > dereference it, but get the stale PTE.  Callers of ioremap() generally
> > > don't do a barrier of their own prior to exposing the result.
> > 
> > Hrm, we transition to the new PTE either restricts the access permission
> > in which case it flushes the TLB (and synchronizes with other CPUs) or
> > extends access (adds dirty, set pte from 0 -> populated, ...) in which
> > case the worst case is we see the old one and take a spurrious fault.
> 
> Yes, and the lwsync is good enough for software reading the PTE.  So it
> becomes a question of how much spurious faults with hardware tablewalk
> hurt performance, and at least for the lmbench fork test, the sync is
> worse (or maybe lwsync happens to be good enough for hw tablewalk on
> e6500?).
> 
> > So the problem would only be with kernel mappings and in that case I
> > think we are fine. A driver doing an ioremap shouldn't then start using
> > that mapping on another CPU before having *informed* that other CPU of
> > the existence of the mapping and that should be ordered.
> 
> But are callers of ioremap() expected to use a barrier before exposing
> the pointer (and what type)?  I don't think that's common practice.
> 
> map_kernel_page() should not be performance critical, so it shouldn't be
> a big deal to put mb() in there.

Yup, go for it.

Cheers,
Ben.

> -Scott
> 
>

Patch

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 7d6eacf..0459077 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -143,6 +143,13 @@  static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 * cases, and 32-bit non-hash with 32-bit PTEs.
 	 */
 	*ptep = pte;
+#ifdef CONFIG_PPC_BOOK3E_64
+	/*
+	 * With hardware tablewalk, a sync is needed to ensure that
+	 * subsequent accesses see the PTE we just wrote.
+	 */
+	mb();
+#endif
 #endif
 }