Message ID | 1379130622-17436-1-git-send-email-scottwood@freescale.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Fri, 2013-09-13 at 22:50 -0500, Scott Wood wrote: > The ISA says that a sync is needed to order a PTE write with a > subsequent hardware tablewalk lookup. On e6500, without this sync > we've been observed to die with a DSI due to a PTE write not being seen > by a subsequent access, even when everything happens on the same > CPU. This is gross, I didn't realize we had that bogosity in the architecture... Did you measure the performance impact ? Cheers, Ben. > Signed-off-by: Scott Wood <scottwood@freescale.com> > --- > v2: new patch > > Note that we saw this problem when running in emulation (died early in > boot), but when I tried just now with the mb() removed on actual > hardware, I didn't see any problem. But since I'm not aware of any > change having been made in the hardware relative to this, and since it > is architecturally required, I'd be more comfortable leaving it in > unlesss something has changed in the kernel recently such that a sync > is happening somewhere else along the code path. > --- > arch/powerpc/include/asm/pgtable.h | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h > index 7d6eacf..0459077 100644 > --- a/arch/powerpc/include/asm/pgtable.h > +++ b/arch/powerpc/include/asm/pgtable.h > @@ -143,6 +143,13 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, > * cases, and 32-bit non-hash with 32-bit PTEs. > */ > *ptep = pte; > +#ifdef CONFIG_PPC_BOOK3E_64 > + /* > + * With hardware tablewalk, a sync is needed to ensure that > + * subsequent accesses see the PTE we just wrote. > + */ > + mb(); > +#endif > #endif > } >
On Mon, 2013-09-16 at 07:38 +1000, Benjamin Herrenschmidt wrote: > On Fri, 2013-09-13 at 22:50 -0500, Scott Wood wrote: > > The ISA says that a sync is needed to order a PTE write with a > > subsequent hardware tablewalk lookup. On e6500, without this sync > > we've been observed to die with a DSI due to a PTE write not being seen > > by a subsequent access, even when everything happens on the same > > CPU. > > This is gross, I didn't realize we had that bogosity in the > architecture... > > Did you measure the performance impact ? I didn't see a noticeable impact on the tests I ran, but those were aimed at measuring TLB miss overhead. I'll need to try it with a benchmark that's more oriented around lots of page table updates. -Scott
On Mon, 2013-09-16 at 19:06 -0500, Scott Wood wrote: > On Mon, 2013-09-16 at 07:38 +1000, Benjamin Herrenschmidt wrote: > > On Fri, 2013-09-13 at 22:50 -0500, Scott Wood wrote: > > > The ISA says that a sync is needed to order a PTE write with a > > > subsequent hardware tablewalk lookup. On e6500, without this sync > > > we've been observed to die with a DSI due to a PTE write not being seen > > > by a subsequent access, even when everything happens on the same > > > CPU. > > > > This is gross, I didn't realize we had that bogosity in the > > architecture... > > > > Did you measure the performance impact ? > > I didn't see a noticeable impact on the tests I ran, but those were > aimed at measuring TLB miss overhead. I'll need to try it with a > benchmark that's more oriented around lots of page table updates. Lmbench's fork test runs about 2% slower with the sync. I've been told that nothing relevant has changed since we saw the failure during emulation; it's probably luck and/or timing, or maybe a sync got added somewhere else since then? I think it's only really a problem for kernel page tables, since user page tables will retry if do_page_fault() sees a valid PTE. So maybe we should put an mb() in map_kernel_page() instead. -Scott
On Thu, 2013-10-10 at 17:31 -0500, Scott Wood wrote: > On Mon, 2013-09-16 at 19:06 -0500, Scott Wood wrote: > > On Mon, 2013-09-16 at 07:38 +1000, Benjamin Herrenschmidt wrote: > > > On Fri, 2013-09-13 at 22:50 -0500, Scott Wood wrote: > > > > The ISA says that a sync is needed to order a PTE write with a > > > > subsequent hardware tablewalk lookup. On e6500, without this sync > > > > we've been observed to die with a DSI due to a PTE write not being seen > > > > by a subsequent access, even when everything happens on the same > > > > CPU. > > > > > > This is gross, I didn't realize we had that bogosity in the > > > architecture... > > > > > > Did you measure the performance impact ? > > > > I didn't see a noticeable impact on the tests I ran, but those were > > aimed at measuring TLB miss overhead. I'll need to try it with a > > benchmark that's more oriented around lots of page table updates. > > Lmbench's fork test runs about 2% slower with the sync. I've been told > that nothing relevant has changed since we saw the failure during > emulation; it's probably luck and/or timing, or maybe a sync got added > somewhere else since then? I think it's only really a problem for > kernel page tables, since user page tables will retry if do_page_fault() > sees a valid PTE. So maybe we should put an mb() in map_kernel_page() > instead. Looking at some of the code in mm/, I suspect that the normal callers of set_pte_at() already have an unlock (and thus a sync) already, so we may not even be relying on those retries. Certainly some of them do; it would take some effort to verify all of them. Also, without such a sync in map_kernel_page(), even with software tablewalk, couldn't we theoretically have a situation where a store to pointer X that exposes a new mapping gets reordered before the PTE store as seen by another CPU? The other CPU could see non-NULL X and dereference it, but get the stale PTE. Callers of ioremap() generally don't do a barrier of their own prior to exposing the result. -Scott
On Thu, 2013-10-10 at 18:25 -0500, Scott Wood wrote: > Looking at some of the code in mm/, I suspect that the normal callers of > set_pte_at() already have an unlock (and thus a sync) Unlock is lwsync actually... > already, so we may > not even be relying on those retries. Certainly some of them do; it > would take some effort to verify all of them. > > Also, without such a sync in map_kernel_page(), even with software > tablewalk, couldn't we theoretically have a situation where a store to > pointer X that exposes a new mapping gets reordered before the PTE store > as seen by another CPU? The other CPU could see non-NULL X and > dereference it, but get the stale PTE. Callers of ioremap() generally > don't do a barrier of their own prior to exposing the result. Hrm, we transition to the new PTE either restricts the access permission in which case it flushes the TLB (and synchronizes with other CPUs) or extends access (adds dirty, set pte from 0 -> populated, ...) in which case the worst case is we see the old one and take a spurrious fault. So the problem would only be with kernel mappings and in that case I think we are fine. A driver doing an ioremap shouldn't then start using that mapping on another CPU before having *informed* that other CPU of the existence of the mapping and that should be ordered. Ben.
On Fri, 2013-10-11 at 10:51 +1100, Benjamin Herrenschmidt wrote: > On Thu, 2013-10-10 at 18:25 -0500, Scott Wood wrote: > > > Looking at some of the code in mm/, I suspect that the normal callers of > > set_pte_at() already have an unlock (and thus a sync) > > Unlock is lwsync actually... Oops, I was seeing the conditional sync from SYNC_IO in the disassembly. BTW, it's a bug that we don't do SYNC_IO on e500mc -- the assumption that lwsync is 64-bit-only is no longer true. > > already, so we may > > not even be relying on those retries. Certainly some of them do; it > > would take some effort to verify all of them. > > > > Also, without such a sync in map_kernel_page(), even with software > > tablewalk, couldn't we theoretically have a situation where a store to > > pointer X that exposes a new mapping gets reordered before the PTE store > > as seen by another CPU? The other CPU could see non-NULL X and > > dereference it, but get the stale PTE. Callers of ioremap() generally > > don't do a barrier of their own prior to exposing the result. > > Hrm, we transition to the new PTE either restricts the access permission > in which case it flushes the TLB (and synchronizes with other CPUs) or > extends access (adds dirty, set pte from 0 -> populated, ...) in which > case the worst case is we see the old one and take a spurrious fault. Yes, and the lwsync is good enough for software reading the PTE. So it becomes a question of how much spurious faults with hardware tablewalk hurt performance, and at least for the lmbench fork test, the sync is worse (or maybe lwsync happens to be good enough for hw tablewalk on e6500?). > So the problem would only be with kernel mappings and in that case I > think we are fine. A driver doing an ioremap shouldn't then start using > that mapping on another CPU before having *informed* that other CPU of > the existence of the mapping and that should be ordered. But are callers of ioremap() expected to use a barrier before exposing the pointer (and what type)? I don't think that's common practice. map_kernel_page() should not be performance critical, so it shouldn't be a big deal to put mb() in there. -Scott
On Fri, 2013-10-11 at 17:07 -0500, Scott Wood wrote: > On Fri, 2013-10-11 at 10:51 +1100, Benjamin Herrenschmidt wrote: > > On Thu, 2013-10-10 at 18:25 -0500, Scott Wood wrote: > > > > > Looking at some of the code in mm/, I suspect that the normal callers of > > > set_pte_at() already have an unlock (and thus a sync) > > > > Unlock is lwsync actually... > > Oops, I was seeing the conditional sync from SYNC_IO in the disassembly. > BTW, it's a bug that we don't do SYNC_IO on e500mc -- the assumption > that lwsync is 64-bit-only is no longer true. Patch welcome :) > > > already, so we may > > > not even be relying on those retries. Certainly some of them do; it > > > would take some effort to verify all of them. > > > > > > Also, without such a sync in map_kernel_page(), even with software > > > tablewalk, couldn't we theoretically have a situation where a store to > > > pointer X that exposes a new mapping gets reordered before the PTE store > > > as seen by another CPU? The other CPU could see non-NULL X and > > > dereference it, but get the stale PTE. Callers of ioremap() generally > > > don't do a barrier of their own prior to exposing the result. > > > > Hrm, we transition to the new PTE either restricts the access permission > > in which case it flushes the TLB (and synchronizes with other CPUs) or > > extends access (adds dirty, set pte from 0 -> populated, ...) in which > > case the worst case is we see the old one and take a spurrious fault. > > Yes, and the lwsync is good enough for software reading the PTE. So it > becomes a question of how much spurious faults with hardware tablewalk > hurt performance, and at least for the lmbench fork test, the sync is > worse (or maybe lwsync happens to be good enough for hw tablewalk on > e6500?). > > > So the problem would only be with kernel mappings and in that case I > > think we are fine. A driver doing an ioremap shouldn't then start using > > that mapping on another CPU before having *informed* that other CPU of > > the existence of the mapping and that should be ordered. > > But are callers of ioremap() expected to use a barrier before exposing > the pointer (and what type)? I don't think that's common practice. > > map_kernel_page() should not be performance critical, so it shouldn't be > a big deal to put mb() in there. Yup, go for it. Cheers, Ben. > -Scott > >
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 7d6eacf..0459077 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -143,6 +143,13 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, * cases, and 32-bit non-hash with 32-bit PTEs. */ *ptep = pte; +#ifdef CONFIG_PPC_BOOK3E_64 + /* + * With hardware tablewalk, a sync is needed to ensure that + * subsequent accesses see the PTE we just wrote. + */ + mb(); +#endif #endif }
The ISA says that a sync is needed to order a PTE write with a subsequent hardware tablewalk lookup. On e6500, without this sync we've been observed to die with a DSI due to a PTE write not being seen by a subsequent access, even when everything happens on the same CPU. Signed-off-by: Scott Wood <scottwood@freescale.com> --- v2: new patch Note that we saw this problem when running in emulation (died early in boot), but when I tried just now with the mb() removed on actual hardware, I didn't see any problem. But since I'm not aware of any change having been made in the hardware relative to this, and since it is architecturally required, I'd be more comfortable leaving it in unlesss something has changed in the kernel recently such that a sync is happening somewhere else along the code path. --- arch/powerpc/include/asm/pgtable.h | 7 +++++++ 1 file changed, 7 insertions(+)