diff mbox series

[v5,22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

Message ID 20240202080756.1453939-23-ryan.roberts@arm.com (mailing list archive)
State Handled Elsewhere
Headers show
Series Transparent Contiguous PTEs for User Mappings | expand

Commit Message

Ryan Roberts Feb. 2, 2024, 8:07 a.m. UTC
Some architectures (e.g. arm64) can tell from looking at a pte, if some
follow-on ptes also map contiguous physical memory with the same pgprot.
(for arm64, these are contpte mappings).

Take advantage of this knowledge to optimize folio_pte_batch() so that
it can skip these ptes when scanning to create a batch. By default, if
an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
the changes are optimized out and the behaviour is as before.

arm64 will opt-in to providing this hint in the next patch, which will
greatly reduce the cost of ptep_get() when scanning a range of contptes.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 18 ++++++++++++++++++
 mm/memory.c             | 20 +++++++++++++-------
 2 files changed, 31 insertions(+), 7 deletions(-)

Comments

David Hildenbrand Feb. 12, 2024, 1:43 p.m. UTC | #1
On 02.02.24 09:07, Ryan Roberts wrote:
> Some architectures (e.g. arm64) can tell from looking at a pte, if some
> follow-on ptes also map contiguous physical memory with the same pgprot.
> (for arm64, these are contpte mappings).
> 
> Take advantage of this knowledge to optimize folio_pte_batch() so that
> it can skip these ptes when scanning to create a batch. By default, if
> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
> the changes are optimized out and the behaviour is as before.
> 
> arm64 will opt-in to providing this hint in the next patch, which will
> greatly reduce the cost of ptep_get() when scanning a range of contptes.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 18 ++++++++++++++++++
>   mm/memory.c             | 20 +++++++++++++-------
>   2 files changed, 31 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 50f32cccbd92..cba31f177d27 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>   #define arch_flush_lazy_mmu_mode()	do {} while (0)
>   #endif
>   
> +#ifndef pte_batch_hint
> +/**
> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
> + * @ptep: Page table pointer for the entry.
> + * @pte: Page table entry.
> + *
> + * Some architectures know that a set of contiguous ptes all map the same
> + * contiguous memory with the same permissions. In this case, it can provide a
> + * hint to aid pte batching without the core code needing to scan every pte.

I think we might want to document here the expectation regarding
dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
always when batching, because the dirty bit may target any pte part of the
cont-pte group either way.

Maybe something like:

"
An architecture implementation may only ignore the PTE accessed and dirty bits.
Further, it may only ignore the dirty bit if that bit is already not
maintained with precision per PTE inside the hinted batch, and ptep_get()
would already have to collect it from various PTEs.
"

I think there are some more details to it, but I'm hoping something along
the lines above is sufficient.


> +
>   #ifndef pte_advance_pfn
>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>   {
> diff --git a/mm/memory.c b/mm/memory.c
> index 65fbe4f886c1..902665b27702 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   {
>   	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>   	const pte_t *end_ptep = start_ptep + max_nr;
> -	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
> -	pte_t *ptep = start_ptep + 1;
> +	pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
> +	pte_t *ptep = start_ptep;
>   	bool writable;
> +	int nr;
>   
>   	if (any_writable)
>   		*any_writable = false;
>   
>   	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>   
> -	while (ptep != end_ptep) {
> +	nr = pte_batch_hint(ptep, pte);
> +	expected_pte = pte_advance_pfn(expected_pte, nr);
> +	ptep += nr;
> +

*Maybe* it's easier to get when initializing expected_pte+ptep only once.

Like:

[...]
pte_t expected_pte, *ptep;
[...]

nr = pte_batch_hint(start_ptep, pte);
expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
ptep = start_ptep + nr;

> +	while (ptep < end_ptep) {
>   		pte = ptep_get(ptep);
>   		if (any_writable)
>   			writable = !!pte_write(pte);
> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   		 * corner cases the next PFN might fall into a different
>   		 * folio.
>   		 */
> -		if (pte_pfn(pte) == folio_end_pfn)
> +		if (pte_pfn(pte) >= folio_end_pfn)
>   			break;
>   
>   		if (any_writable)
>   			*any_writable |= writable;
>   
> -		expected_pte = pte_advance_pfn(expected_pte, 1);
> -		ptep++;
> +		nr = pte_batch_hint(ptep, pte);
> +		expected_pte = pte_advance_pfn(expected_pte, nr);
> +		ptep += nr;
>   	}
>   
> -	return ptep - start_ptep;
> +	return min(ptep - start_ptep, max_nr);
>   }

Acked-by: David Hildenbrand <david@redhat.com>
Ryan Roberts Feb. 12, 2024, 3 p.m. UTC | #2
On 12/02/2024 13:43, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>> follow-on ptes also map contiguous physical memory with the same pgprot.
>> (for arm64, these are contpte mappings).
>>
>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>> it can skip these ptes when scanning to create a batch. By default, if
>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>> the changes are optimized out and the behaviour is as before.
>>
>> arm64 will opt-in to providing this hint in the next patch, which will
>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 18 ++++++++++++++++++
>>   mm/memory.c             | 20 +++++++++++++-------
>>   2 files changed, 31 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 50f32cccbd92..cba31f177d27 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>   #endif
>>   +#ifndef pte_batch_hint
>> +/**
>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>> + * @ptep: Page table pointer for the entry.
>> + * @pte: Page table entry.
>> + *
>> + * Some architectures know that a set of contiguous ptes all map the same
>> + * contiguous memory with the same permissions. In this case, it can provide a
>> + * hint to aid pte batching without the core code needing to scan every pte.
> 
> I think we might want to document here the expectation regarding
> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
> always when batching, because the dirty bit may target any pte part of the
> cont-pte group either way.
> 
> Maybe something like:
> 
> "
> An architecture implementation may only ignore the PTE accessed and dirty bits.
> Further, it may only ignore the dirty bit if that bit is already not
> maintained with precision per PTE inside the hinted batch, and ptep_get()
> would already have to collect it from various PTEs.
> "

Yep, sounds good. I'll add it in next version.

> 
> I think there are some more details to it, but I'm hoping something along
> the lines above is sufficient.
> 
> 
>> +
>>   #ifndef pte_advance_pfn
>>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>   {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 65fbe4f886c1..902665b27702 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>   {
>>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>       const pte_t *end_ptep = start_ptep + max_nr;
>> -    pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1),
>> flags);
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
>> +    pte_t *ptep = start_ptep;
>>       bool writable;
>> +    int nr;
>>         if (any_writable)
>>           *any_writable = false;
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   -    while (ptep != end_ptep) {
>> +    nr = pte_batch_hint(ptep, pte);
>> +    expected_pte = pte_advance_pfn(expected_pte, nr);
>> +    ptep += nr;
>> +
> 
> *Maybe* it's easier to get when initializing expected_pte+ptep only once.
> 
> Like:
> 
> [...]
> pte_t expected_pte, *ptep;
> [...]
> 
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> ptep = start_ptep + nr;

Yeah that works for me. Will change for next version.

> 
>> +    while (ptep < end_ptep) {
>>           pte = ptep_get(ptep);
>>           if (any_writable)
>>               writable = !!pte_write(pte);
>> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>            * corner cases the next PFN might fall into a different
>>            * folio.
>>            */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>               break;
>>             if (any_writable)
>>               *any_writable |= writable;
>>   -        expected_pte = pte_advance_pfn(expected_pte, 1);
>> -        ptep++;
>> +        nr = pte_batch_hint(ptep, pte);
>> +        expected_pte = pte_advance_pfn(expected_pte, nr);
>> +        ptep += nr;
>>       }
>>   -    return ptep - start_ptep;
>> +    return min(ptep - start_ptep, max_nr);
>>   }
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

>
Ryan Roberts Feb. 12, 2024, 3:47 p.m. UTC | #3
On 12/02/2024 13:43, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>> follow-on ptes also map contiguous physical memory with the same pgprot.
>> (for arm64, these are contpte mappings).
>>
>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>> it can skip these ptes when scanning to create a batch. By default, if
>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>> the changes are optimized out and the behaviour is as before.
>>
>> arm64 will opt-in to providing this hint in the next patch, which will
>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 18 ++++++++++++++++++
>>   mm/memory.c             | 20 +++++++++++++-------
>>   2 files changed, 31 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 50f32cccbd92..cba31f177d27 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>   #endif
>>   +#ifndef pte_batch_hint
>> +/**
>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>> + * @ptep: Page table pointer for the entry.
>> + * @pte: Page table entry.
>> + *
>> + * Some architectures know that a set of contiguous ptes all map the same
>> + * contiguous memory with the same permissions. In this case, it can provide a
>> + * hint to aid pte batching without the core code needing to scan every pte.
> 
> I think we might want to document here the expectation regarding
> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
> always when batching, because the dirty bit may target any pte part of the
> cont-pte group either way.
> 
> Maybe something like:
> 
> "
> An architecture implementation may only ignore the PTE accessed and dirty bits.
> Further, it may only ignore the dirty bit if that bit is already not
> maintained with precision per PTE inside the hinted batch, and ptep_get()
> would already have to collect it from various PTEs.
> "

I'm proposing to simplify this to:

"
An architecture implementation may ignore the PTE accessed state. Further, the
dirty state must apply atomically to all the PTEs described by the hint.
"

Which I think more accurately describes the requirement. Shout if you disagree.

> 
> I think there are some more details to it, but I'm hoping something along
> the lines above is sufficient.
> 
> 
>> +
>>   #ifndef pte_advance_pfn
>>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>   {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 65fbe4f886c1..902665b27702 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>   {
>>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>       const pte_t *end_ptep = start_ptep + max_nr;
>> -    pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1),
>> flags);
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
>> +    pte_t *ptep = start_ptep;
>>       bool writable;
>> +    int nr;
>>         if (any_writable)
>>           *any_writable = false;
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   -    while (ptep != end_ptep) {
>> +    nr = pte_batch_hint(ptep, pte);
>> +    expected_pte = pte_advance_pfn(expected_pte, nr);
>> +    ptep += nr;
>> +
> 
> *Maybe* it's easier to get when initializing expected_pte+ptep only once.
> 
> Like:
> 
> [...]
> pte_t expected_pte, *ptep;
> [...]
> 
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> ptep = start_ptep + nr;
> 
>> +    while (ptep < end_ptep) {
>>           pte = ptep_get(ptep);
>>           if (any_writable)
>>               writable = !!pte_write(pte);
>> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>            * corner cases the next PFN might fall into a different
>>            * folio.
>>            */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>               break;
>>             if (any_writable)
>>               *any_writable |= writable;
>>   -        expected_pte = pte_advance_pfn(expected_pte, 1);
>> -        ptep++;
>> +        nr = pte_batch_hint(ptep, pte);
>> +        expected_pte = pte_advance_pfn(expected_pte, nr);
>> +        ptep += nr;
>>       }
>>   -    return ptep - start_ptep;
>> +    return min(ptep - start_ptep, max_nr);
>>   }
> 
> Acked-by: David Hildenbrand <david@redhat.com>
>
David Hildenbrand Feb. 12, 2024, 4:27 p.m. UTC | #4
On 12.02.24 16:47, Ryan Roberts wrote:
> On 12/02/2024 13:43, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>>> follow-on ptes also map contiguous physical memory with the same pgprot.
>>> (for arm64, these are contpte mappings).
>>>
>>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>>> it can skip these ptes when scanning to create a batch. By default, if
>>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>>> the changes are optimized out and the behaviour is as before.
>>>
>>> arm64 will opt-in to providing this hint in the next patch, which will
>>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    include/linux/pgtable.h | 18 ++++++++++++++++++
>>>    mm/memory.c             | 20 +++++++++++++-------
>>>    2 files changed, 31 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 50f32cccbd92..cba31f177d27 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>>    #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>    #endif
>>>    +#ifndef pte_batch_hint
>>> +/**
>>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>>> + * @ptep: Page table pointer for the entry.
>>> + * @pte: Page table entry.
>>> + *
>>> + * Some architectures know that a set of contiguous ptes all map the same
>>> + * contiguous memory with the same permissions. In this case, it can provide a
>>> + * hint to aid pte batching without the core code needing to scan every pte.
>>
>> I think we might want to document here the expectation regarding
>> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
>> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
>> always when batching, because the dirty bit may target any pte part of the
>> cont-pte group either way.
>>
>> Maybe something like:
>>
>> "
>> An architecture implementation may only ignore the PTE accessed and dirty bits.
>> Further, it may only ignore the dirty bit if that bit is already not
>> maintained with precision per PTE inside the hinted batch, and ptep_get()
>> would already have to collect it from various PTEs.
>> "
> 
> I'm proposing to simplify this to:
> 
> "
> An architecture implementation may ignore the PTE accessed state. Further, the
> dirty state must apply atomically to all the PTEs described by the hint.
> "
> 
> Which I think more accurately describes the requirement. Shout if you disagree.

I'm not 100% sure if the "must apply atomically" is clear without all of 
the cont-pte details and ptep_get(). But I fail to describe it in a 
better way.

It's all better compared to what we had before, so LGTM :)
diff mbox series

Patch

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 50f32cccbd92..cba31f177d27 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,6 +212,24 @@  static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif
 
+#ifndef pte_batch_hint
+/**
+ * pte_batch_hint - Number of pages that can be added to batch without scanning.
+ * @ptep: Page table pointer for the entry.
+ * @pte: Page table entry.
+ *
+ * Some architectures know that a set of contiguous ptes all map the same
+ * contiguous memory with the same permissions. In this case, it can provide a
+ * hint to aid pte batching without the core code needing to scan every pte.
+ *
+ * May be overridden by the architecture, else pte_batch_hint is always 1.
+ */
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+	return 1;
+}
+#endif
+
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 65fbe4f886c1..902665b27702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,16 +988,21 @@  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
-	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
-	pte_t *ptep = start_ptep + 1;
+	pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
+	pte_t *ptep = start_ptep;
 	bool writable;
+	int nr;
 
 	if (any_writable)
 		*any_writable = false;
 
 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
-	while (ptep != end_ptep) {
+	nr = pte_batch_hint(ptep, pte);
+	expected_pte = pte_advance_pfn(expected_pte, nr);
+	ptep += nr;
+
+	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
 		if (any_writable)
 			writable = !!pte_write(pte);
@@ -1011,17 +1016,18 @@  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		 * corner cases the next PFN might fall into a different
 		 * folio.
 		 */
-		if (pte_pfn(pte) == folio_end_pfn)
+		if (pte_pfn(pte) >= folio_end_pfn)
 			break;
 
 		if (any_writable)
 			*any_writable |= writable;
 
-		expected_pte = pte_advance_pfn(expected_pte, 1);
-		ptep++;
+		nr = pte_batch_hint(ptep, pte);
+		expected_pte = pte_advance_pfn(expected_pte, nr);
+		ptep += nr;
 	}
 
-	return ptep - start_ptep;
+	return min(ptep - start_ptep, max_nr);
 }
 
 /*