mbox series

[0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

Message ID 20201113111901.743573013@infradead.org
Headers show
Series perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE | expand

Message

Peter Zijlstra Nov. 13, 2020, 11:19 a.m. UTC
Hi,

These patches provide generic infrastructure to determine TLB page size from
page table entries alone. Perf will use this (for either data or code address)
to aid in profiling TLB issues.

While most architectures only have page table aligned large pages, some
(notably ARM64, Sparc64 and Power) provide non page table aligned large pages
and need to provide their own implementation of these functions.

I've provided (completely untested) implementations for ARM64 and Sparc64, but
failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can help
me out there.

Comments

Christophe Leroy Nov. 13, 2020, 1:44 p.m. UTC | #1
Hi

Le 13/11/2020 à 12:19, Peter Zijlstra a écrit :
> Hi,
> 
> These patches provide generic infrastructure to determine TLB page size from
> page table entries alone. Perf will use this (for either data or code address)
> to aid in profiling TLB issues.
> 
> While most architectures only have page table aligned large pages, some
> (notably ARM64, Sparc64 and Power) provide non page table aligned large pages
> and need to provide their own implementation of these functions.
> 
> I've provided (completely untested) implementations for ARM64 and Sparc64, but
> failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can help
> me out there.
> 

I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024 entries, that means each 
entry maps 4M.

Page sizes are 4k, 16k, 512k and 8M.

For the 8M pages we use hugepd with a single entry. The two related PGD entries point to the same 
hugepd.

For the other sizes, they are in standard page tables. 16k pages appear 4 times in the page table. 
512k entries appear 128 times in the page table.

When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a hugepd with holds the single 8M 
entry.

In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE

_PAGE_HUGE means it is a 512k page
_PAGE_SPS means it is not a 4k page

The kernel can by build either with 4k pages as standard page size, or 16k pages. It doesn't change 
the page table layout though.

Hope this is clear. Now I don't really know to wire that up to your series.

Christophe
Kirill A. Shutemov Nov. 16, 2020, 3:43 p.m. UTC | #2
On Fri, Nov 13, 2020 at 12:19:01PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> These patches provide generic infrastructure to determine TLB page size from
> page table entries alone. Perf will use this (for either data or code address)
> to aid in profiling TLB issues.

I'm not sure it's an issue, but strictly speaking, size of page according
to page table tree doesn't mean pagewalk would fill TLB entry of the size.
CPU may support 1G pages in page table tree without 1G TLB at all.

IIRC, current Intel CPU still don't have any 1G iTLB entries and fill 2M
iTLB instead.
Matthew Wilcox (Oracle) Nov. 16, 2020, 3:54 p.m. UTC | #3
On Mon, Nov 16, 2020 at 06:43:57PM +0300, Kirill A. Shutemov wrote:
> On Fri, Nov 13, 2020 at 12:19:01PM +0100, Peter Zijlstra wrote:
> > Hi,
> > 
> > These patches provide generic infrastructure to determine TLB page size from
> > page table entries alone. Perf will use this (for either data or code address)
> > to aid in profiling TLB issues.
> 
> I'm not sure it's an issue, but strictly speaking, size of page according
> to page table tree doesn't mean pagewalk would fill TLB entry of the size.
> CPU may support 1G pages in page table tree without 1G TLB at all.
> 
> IIRC, current Intel CPU still don't have any 1G iTLB entries and fill 2M
> iTLB instead.

It gets even more complicated with CPUs with multiple levels of TLB
which support different TLB entry sizes.  My CPU reports:

TLB info
 Instruction TLB: 2M/4M pages, fully associative, 8 entries
 Instruction TLB: 4K pages, 8-way associative, 64 entries
 Data TLB: 1GB pages, 4-way set associative, 4 entries
 Data TLB: 4KB pages, 4-way associative, 64 entries
 Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries

I'm not quite sure what the rules are for evicting a 1GB entry in the
dTLB into the s2TLB.  I've read them for so many different processors,
I get quite confused.  Some CPUs fracture them; others ditch them entirely
and will look them up again if needed.

I think the architecture here is fine, but it'll need a little bit of
finagling to maybe pass i-vs-d to the pXd_leaf_size() routines, and x86
will need an implementation of pud_leaf_size() which interrogates the
TLB info to find out what size TLB entry will actually be used.
Dave Hansen Nov. 16, 2020, 4:28 p.m. UTC | #4
On 11/16/20 7:54 AM, Matthew Wilcox wrote:
> It gets even more complicated with CPUs with multiple levels of TLB
> which support different TLB entry sizes.  My CPU reports:
> 
> TLB info
>  Instruction TLB: 2M/4M pages, fully associative, 8 entries
>  Instruction TLB: 4K pages, 8-way associative, 64 entries
>  Data TLB: 1GB pages, 4-way set associative, 4 entries
>  Data TLB: 4KB pages, 4-way associative, 64 entries
>  Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries

It's even "worse" on recent AMD systems.  Those will coalesce multiple
adjacent PTEs into a single TLB entry.  I think Alphas did something
like this back in the day with an opt-in.

Anyway, the changelog should probably replace:

> This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
> page sizes.

with something more like:

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate page
table mapping sizes.

That's really the best we can do from software without digging into
microarchitecture-specific events.
Matthew Wilcox (Oracle) Nov. 16, 2020, 4:32 p.m. UTC | #5
On Mon, Nov 16, 2020 at 08:28:23AM -0800, Dave Hansen wrote:
> On 11/16/20 7:54 AM, Matthew Wilcox wrote:
> > It gets even more complicated with CPUs with multiple levels of TLB
> > which support different TLB entry sizes.  My CPU reports:
> > 
> > TLB info
> >  Instruction TLB: 2M/4M pages, fully associative, 8 entries
> >  Instruction TLB: 4K pages, 8-way associative, 64 entries
> >  Data TLB: 1GB pages, 4-way set associative, 4 entries
> >  Data TLB: 4KB pages, 4-way associative, 64 entries
> >  Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries
> 
> It's even "worse" on recent AMD systems.  Those will coalesce multiple
> adjacent PTEs into a single TLB entry.  I think Alphas did something
> like this back in the day with an opt-in.

I debated mentioning that ;-)  We can detect in software whether that's
_possible_, but we can't detect whether it's *done* it.  I heard it
sometimes takes several faults on the 4kB entries for the CPU to decide
that it's beneficial to use a 32kB TLB entry.  But this is all rumour.

> Anyway, the changelog should probably replace:
> 
> > This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
> > page sizes.
> 
> with something more like:
> 
> This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate page
> table mapping sizes.
> 
> That's really the best we can do from software without digging into
> microarchitecture-specific events.

I mean this is perf.  Digging into microarch specific events is what it
does ;-)
Dave Hansen Nov. 16, 2020, 4:36 p.m. UTC | #6
On 11/16/20 8:32 AM, Matthew Wilcox wrote:
>>
>> That's really the best we can do from software without digging into
>> microarchitecture-specific events.
> I mean this is perf.  Digging into microarch specific events is what it
> does ;-)

Yeah, totally.

But, if we see a bunch of 4k TLB hit events, it's still handy to know
that those 4k TLB hits originated from a 2M page table entry.  This
series just makes sure that perf has the data about the page table
mapping sizes regardless of what the microarchitecture does with it.

I'm just saying we need to make the descriptions in this perf feature
specifically about the page tables, not the TLB.
Peter Zijlstra Nov. 16, 2020, 4:55 p.m. UTC | #7
On Mon, Nov 16, 2020 at 08:28:23AM -0800, Dave Hansen wrote:
> On 11/16/20 7:54 AM, Matthew Wilcox wrote:
> > It gets even more complicated with CPUs with multiple levels of TLB
> > which support different TLB entry sizes.  My CPU reports:
> > 
> > TLB info
> >  Instruction TLB: 2M/4M pages, fully associative, 8 entries
> >  Instruction TLB: 4K pages, 8-way associative, 64 entries
> >  Data TLB: 1GB pages, 4-way set associative, 4 entries
> >  Data TLB: 4KB pages, 4-way associative, 64 entries
> >  Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries
> 
> It's even "worse" on recent AMD systems.  Those will coalesce multiple
> adjacent PTEs into a single TLB entry.  I think Alphas did something
> like this back in the day with an opt-in.
> 
> Anyway, the changelog should probably replace:

ARM64 does too.

> > This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
> > page sizes.
> 
> with something more like:
> 
> This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate page
> table mapping sizes.

Sure.
Peter Zijlstra Nov. 16, 2020, 4:57 p.m. UTC | #8
On Mon, Nov 16, 2020 at 08:36:36AM -0800, Dave Hansen wrote:
> On 11/16/20 8:32 AM, Matthew Wilcox wrote:
> >>
> >> That's really the best we can do from software without digging into
> >> microarchitecture-specific events.
> > I mean this is perf.  Digging into microarch specific events is what it
> > does ;-)
> 
> Yeah, totally.

Sure, but the automatic promotion/demotion of TLB sizes is not visible
if you don't know what you startd out with.

> But, if we see a bunch of 4k TLB hit events, it's still handy to know
> that those 4k TLB hits originated from a 2M page table entry.  This
> series just makes sure that perf has the data about the page table
> mapping sizes regardless of what the microarchitecture does with it.

This.
Christophe Leroy Nov. 20, 2020, 11:18 a.m. UTC | #9
Hi Peter,

Le 13/11/2020 à 14:44, Christophe Leroy a écrit :
> Hi
> 
> Le 13/11/2020 à 12:19, Peter Zijlstra a écrit :
>> Hi,
>>
>> These patches provide generic infrastructure to determine TLB page size from
>> page table entries alone. Perf will use this (for either data or code address)
>> to aid in profiling TLB issues.
>>
>> While most architectures only have page table aligned large pages, some
>> (notably ARM64, Sparc64 and Power) provide non page table aligned large pages
>> and need to provide their own implementation of these functions.
>>
>> I've provided (completely untested) implementations for ARM64 and Sparc64, but
>> failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can help
>> me out there.
>>
> 
> I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024 entries, that means each 
> entry maps 4M.
> 
> Page sizes are 4k, 16k, 512k and 8M.
> 
> For the 8M pages we use hugepd with a single entry. The two related PGD entries point to the same 
> hugepd.
> 
> For the other sizes, they are in standard page tables. 16k pages appear 4 times in the page table. 
> 512k entries appear 128 times in the page table.
> 
> When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a hugepd with holds the single 8M 
> entry.
> 
> In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
> 
> _PAGE_HUGE means it is a 512k page
> _PAGE_SPS means it is not a 4k page
> 
> The kernel can by build either with 4k pages as standard page size, or 16k pages. It doesn't change 
> the page table layout though.
> 
> Hope this is clear. Now I don't really know to wire that up to your series.

Does my description make sense ? Is there anything I can help with ?

Christophe
Peter Zijlstra Nov. 20, 2020, 12:20 p.m. UTC | #10
On Fri, Nov 20, 2020 at 12:18:22PM +0100, Christophe Leroy wrote:
> Hi Peter,
> 
> Le 13/11/2020 à 14:44, Christophe Leroy a écrit :
> > Hi
> > 
> > Le 13/11/2020 à 12:19, Peter Zijlstra a écrit :
> > > Hi,
> > > 
> > > These patches provide generic infrastructure to determine TLB page size from
> > > page table entries alone. Perf will use this (for either data or code address)
> > > to aid in profiling TLB issues.
> > > 
> > > While most architectures only have page table aligned large pages, some
> > > (notably ARM64, Sparc64 and Power) provide non page table aligned large pages
> > > and need to provide their own implementation of these functions.
> > > 
> > > I've provided (completely untested) implementations for ARM64 and Sparc64, but
> > > failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can help
> > > me out there.
> > > 
> > 
> > I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024
> > entries, that means each entry maps 4M.
> > 
> > Page sizes are 4k, 16k, 512k and 8M.
> > 
> > For the 8M pages we use hugepd with a single entry. The two related PGD
> > entries point to the same hugepd.
> > 
> > For the other sizes, they are in standard page tables. 16k pages appear
> > 4 times in the page table. 512k entries appear 128 times in the page
> > table.
> > 
> > When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a
> > hugepd with holds the single 8M entry.
> > 
> > In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
> > 
> > _PAGE_HUGE means it is a 512k page
> > _PAGE_SPS means it is not a 4k page
> > 
> > The kernel can by build either with 4k pages as standard page size, or
> > 16k pages. It doesn't change the page table layout though.
> > 
> > Hope this is clear. Now I don't really know to wire that up to your series.
> 
> Does my description make sense ? Is there anything I can help with ?

It did, and I had vague memories from when we fixed that pgd_t issue.
I've just not had time to dig through the powerpc code yet to find the
right mmu header to stick it in.

I was meaning to get another version of these patches posted this week,
but time keeps slipping away, I'll try.
Peter Zijlstra Nov. 26, 2020, 10:46 a.m. UTC | #11
On Fri, Nov 20, 2020 at 01:20:04PM +0100, Peter Zijlstra wrote:

> > > I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024
> > > entries, that means each entry maps 4M.
> > > 
> > > Page sizes are 4k, 16k, 512k and 8M.
> > > 
> > > For the 8M pages we use hugepd with a single entry. The two related PGD
> > > entries point to the same hugepd.
> > > 
> > > For the other sizes, they are in standard page tables. 16k pages appear
> > > 4 times in the page table. 512k entries appear 128 times in the page
> > > table.
> > > 
> > > When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a
> > > hugepd with holds the single 8M entry.
> > > 
> > > In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
> > > 
> > > _PAGE_HUGE means it is a 512k page
> > > _PAGE_SPS means it is not a 4k page
> > > 
> > > The kernel can by build either with 4k pages as standard page size, or
> > > 16k pages. It doesn't change the page table layout though.
> > > 
> > > Hope this is clear. Now I don't really know to wire that up to your series.

Does the below accurately reflect things?

Let me go find a suitable cross-compiler ..

diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 1581204467e1..fcc48d590d88 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -135,6 +135,29 @@ static inline pte_t pte_mkhuge(pte_t pte)
 }
 
 #define pte_mkhuge pte_mkhuge
+
+static inline unsigned long pgd_leaf_size(pgd_t pgd)
+{
+	if (pgd_val(pgd) & _PMD_PAGE_8M)
+		return SZ_8M;
+	return SZ_4M;
+}
+
+#define pgd_leaf_size pgd_leaf_size
+
+static inline unsigned long pte_leaf_size(pte_t pte)
+{
+	pte_basic_t val = pte_val(pte);
+
+	if (val & _PAGE_HUGE)
+		return SZ_512K;
+	if (val & _PAGE_SPS)
+		return SZ_16K;
+	return SZ_4K;
+}
+
+#define pte_leaf_size pte_leaf_size
+
 #endif
 
 #endif /* __KERNEL__ */