mbox series

[RFC,0/7] powerpc/64s/radix TLB flush performance improvements

Message ID 20171031064504.25245-1-npiggin@gmail.com (mailing list archive)
Headers show
Series powerpc/64s/radix TLB flush performance improvements | expand

Message

Nicholas Piggin Oct. 31, 2017, 6:44 a.m. UTC
Here's a random mix of performance improvements for radix TLB flushing
code. The main aims are to reduce the amount of translation that gets
invalidated, and to reduce global flushes where we can do local.

To that end, a parallel kernel compile benchmark using powerpc:tlbie
tracepoint shows a reduction in tlbie instructions from about 290,000
to 80,000, and a reduction in tlbiel instructions from 49,500,000 to
15,000,000. Looks great, but unfortunately does not translate to a
statistically significant performance improvement! The needle on TLB
misses does not move much, I suspect because a lot of the flushing is
done a startup and shutdown, and because a significant cost of TLB
flushing itself is in the barriers.

I have some microbenchmarks in the individual patches, and should
start looking around for some more interesting workloads. I think
most of this series is pretty obviously the right thing to do though.

This goes on top of the 3 radix TLB fixes I sent out earlier.

Thanks,
Nick

Nicholas Piggin (7):
  powerpc/64s/radix: optimize TLB range flush barriers
  powerpc/64s/radix: Implement _tlbie(l)_va_range flush functions
  powerpc/64s/radix: Optimize flush_tlb_range
  powerpc/64s/radix: Introduce local single page ceiling for TLB range
    flush
  powerpc/64s/radix: Improve TLB flushing for page table freeing
  powerpc/64s/radix: reset mm_cpumask for single thread process when
    possible
  powerpc/64s/radix: Only flush local TLB for spurious fault flushes

 .../powerpc/include/asm/book3s/64/tlbflush-radix.h |   5 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h      |  11 +
 arch/powerpc/include/asm/mmu_context.h             |  19 ++
 arch/powerpc/mm/pgtable-book3s64.c                 |   5 +-
 arch/powerpc/mm/pgtable.c                          |   2 +-
 arch/powerpc/mm/tlb-radix.c                        | 363 ++++++++++++++++-----
 6 files changed, 325 insertions(+), 80 deletions(-)

Comments

Anshuman Khandual Nov. 1, 2017, 12:05 p.m. UTC | #1
On 10/31/2017 12:14 PM, Nicholas Piggin wrote:
> Here's a random mix of performance improvements for radix TLB flushing
> code. The main aims are to reduce the amount of translation that gets
> invalidated, and to reduce global flushes where we can do local.
> 
> To that end, a parallel kernel compile benchmark using powerpc:tlbie
> tracepoint shows a reduction in tlbie instructions from about 290,000
> to 80,000, and a reduction in tlbiel instructions from 49,500,000 to
> 15,000,000. Looks great, but unfortunately does not translate to a
> statistically significant performance improvement! The needle on TLB
> misses does not move much, I suspect because a lot of the flushing is
> done a startup and shutdown, and because a significant cost of TLB
> flushing itself is in the barriers.

Does memory barrier initiate a single global invalidation with tlbie ?
Nicholas Piggin Nov. 1, 2017, 1:39 p.m. UTC | #2
On Wed, 1 Nov 2017 17:35:51 +0530
Anshuman Khandual <khandual@linux.vnet.ibm.com> wrote:

> On 10/31/2017 12:14 PM, Nicholas Piggin wrote:
> > Here's a random mix of performance improvements for radix TLB flushing
> > code. The main aims are to reduce the amount of translation that gets
> > invalidated, and to reduce global flushes where we can do local.
> > 
> > To that end, a parallel kernel compile benchmark using powerpc:tlbie
> > tracepoint shows a reduction in tlbie instructions from about 290,000
> > to 80,000, and a reduction in tlbiel instructions from 49,500,000 to
> > 15,000,000. Looks great, but unfortunately does not translate to a
> > statistically significant performance improvement! The needle on TLB
> > misses does not move much, I suspect because a lot of the flushing is
> > done a startup and shutdown, and because a significant cost of TLB
> > flushing itself is in the barriers.  
> 
> Does memory barrier initiate a single global invalidation with tlbie ?
> 

I'm not quite sure what you're asking, and I don't know the details
of how the hardware handles it, but from the measurements in patch
1 of the series we can see there is a benefit for both tlbie and
tlbiel of batching them up between barriers.

Thanks,
Nick
Anshuman Khandual Nov. 2, 2017, 3:19 a.m. UTC | #3
On 11/01/2017 07:09 PM, Nicholas Piggin wrote:
> On Wed, 1 Nov 2017 17:35:51 +0530
> Anshuman Khandual <khandual@linux.vnet.ibm.com> wrote:
> 
>> On 10/31/2017 12:14 PM, Nicholas Piggin wrote:
>>> Here's a random mix of performance improvements for radix TLB flushing
>>> code. The main aims are to reduce the amount of translation that gets
>>> invalidated, and to reduce global flushes where we can do local.
>>>
>>> To that end, a parallel kernel compile benchmark using powerpc:tlbie
>>> tracepoint shows a reduction in tlbie instructions from about 290,000
>>> to 80,000, and a reduction in tlbiel instructions from 49,500,000 to
>>> 15,000,000. Looks great, but unfortunately does not translate to a
>>> statistically significant performance improvement! The needle on TLB
>>> misses does not move much, I suspect because a lot of the flushing is
>>> done a startup and shutdown, and because a significant cost of TLB
>>> flushing itself is in the barriers.  
>>
>> Does memory barrier initiate a single global invalidation with tlbie ?
>>
> 
> I'm not quite sure what you're asking, and I don't know the details
> of how the hardware handles it, but from the measurements in patch
> 1 of the series we can see there is a benefit for both tlbie and
> tlbiel of batching them up between barriers.

Ahh, I might have got the statement "a significant cost of TLB flushing
itself is in the barriers" wrong. I guess you were mentioning about the
total cost of multiple TLB flushes with memory barriers in between each
of them which is causing the high execution cost. This got reduced by
packing multiple tlbie(l) instruction between a single memory barrier.
Nicholas Piggin Nov. 2, 2017, 3:27 a.m. UTC | #4
On Thu, 2 Nov 2017 08:49:49 +0530
Anshuman Khandual <khandual@linux.vnet.ibm.com> wrote:

> On 11/01/2017 07:09 PM, Nicholas Piggin wrote:
> > On Wed, 1 Nov 2017 17:35:51 +0530
> > Anshuman Khandual <khandual@linux.vnet.ibm.com> wrote:
> >   
> >> On 10/31/2017 12:14 PM, Nicholas Piggin wrote:  
> >>> Here's a random mix of performance improvements for radix TLB flushing
> >>> code. The main aims are to reduce the amount of translation that gets
> >>> invalidated, and to reduce global flushes where we can do local.
> >>>
> >>> To that end, a parallel kernel compile benchmark using powerpc:tlbie
> >>> tracepoint shows a reduction in tlbie instructions from about 290,000
> >>> to 80,000, and a reduction in tlbiel instructions from 49,500,000 to
> >>> 15,000,000. Looks great, but unfortunately does not translate to a
> >>> statistically significant performance improvement! The needle on TLB
> >>> misses does not move much, I suspect because a lot of the flushing is
> >>> done a startup and shutdown, and because a significant cost of TLB
> >>> flushing itself is in the barriers.    
> >>
> >> Does memory barrier initiate a single global invalidation with tlbie ?
> >>  
> > 
> > I'm not quite sure what you're asking, and I don't know the details
> > of how the hardware handles it, but from the measurements in patch
> > 1 of the series we can see there is a benefit for both tlbie and
> > tlbiel of batching them up between barriers.  
> 
> Ahh, I might have got the statement "a significant cost of TLB flushing
> itself is in the barriers" wrong. I guess you were mentioning about the
> total cost of multiple TLB flushes with memory barriers in between each
> of them which is causing the high execution cost. This got reduced by
> packing multiple tlbie(l) instruction between a single memory barrier.
> 

Yes that did get reduced for the va range flush in my patches. However
the big reduction in the number of tlbiel calls came from more use of
range flushes and fewer use of PID flushes. But the PID flushes already
have such optimization. Therefore despite tlbiel being reduced, the
number of barriers probably has not gone down a great deal on this
workload, which may explain why performance numbers are basically in
the noise.

Thanks,
Nick