diff mbox series

[v2,07/12] s390: add pte_free_defer() for pgtables sharing page

Message ID a722dbec-bd9e-1213-1edd-53cd547aa4f@google.com
State New
Headers show
Series mm: free retracted page table by RCU | expand

Commit Message

Hugh Dickins June 20, 2023, 7:51 a.m. UTC
Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is not linked to mm_context_t::pgtable_list while either half is
pending free, by either tlb_remove_table() or pte_free_defer(); but is
afterwards either relinked to the list (if other half is allocated), or
freed (if other half is free): by __tlb_remove_table() in both cases.

This rule ensures that page->lru is no longer in use while page->rcu_head
may be needed for use by pte_free_defer().  And a fortuitous byproduct of
following this rule is that page_table_free() no longer needs its curious
two-step manipulation of _refcount - read commit c2c224932fd0 ("s390/mm:
fix 2KB pgtable release race") for what to think of there.  But it does
not solve the problem that two halves may need rcu_head at the same time.

For that, add HHead bits between s390's AAllocated and PPending bits in
the upper byte of page->_refcount: then the second pte_free_defer() can
see that rcu_head is already in use, and the RCU callee pte_free_half()
can see that it needs to make a further call_rcu() for that other half.

page_table_alloc() set the page->pt_mm field, so __tlb_remove_table()
knows where to link the freed half while its other half is allocated.
But linking to the list needs mm->context.lock: and although AA bit set
guarantees that pt_mm must still be valid, it does not guarantee that mm
is still valid an instant later: so acquiring mm->context.lock would not
be safe.  For now, use a static global mm_pgtable_list_lock instead:
then a soon-to-follow commit will split it per-mm as before (probably by
using a SLAB_TYPESAFE_BY_RCU structure for the list head and its lock);
and update the commentary on the pgtable_list.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/include/asm/pgalloc.h |   4 +
 arch/s390/mm/pgalloc.c          | 205 +++++++++++++++++++++++---------
 include/linux/mm_types.h        |   2 +-
 3 files changed, 154 insertions(+), 57 deletions(-)

Comments

Gerald Schaefer June 28, 2023, 7:16 p.m. UTC | #1
On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because s390 fits two 2K
> page tables into one 4K page (so page->rcu_head must be shared between
> both halves), and already uses page->lru (which page->rcu_head overlays)
> to list any free halves; with clever management by page->_refcount bits.
> 
> Build upon the existing management, adjusted to follow a new rule: that
> a page is not linked to mm_context_t::pgtable_list while either half is
> pending free, by either tlb_remove_table() or pte_free_defer(); but is
> afterwards either relinked to the list (if other half is allocated), or
> freed (if other half is free): by __tlb_remove_table() in both cases.
> 
> This rule ensures that page->lru is no longer in use while page->rcu_head
> may be needed for use by pte_free_defer().  And a fortuitous byproduct of
> following this rule is that page_table_free() no longer needs its curious
> two-step manipulation of _refcount - read commit c2c224932fd0 ("s390/mm:
> fix 2KB pgtable release race") for what to think of there.  But it does
> not solve the problem that two halves may need rcu_head at the same time.
> 
> For that, add HHead bits between s390's AAllocated and PPending bits in
> the upper byte of page->_refcount: then the second pte_free_defer() can
> see that rcu_head is already in use, and the RCU callee pte_free_half()
> can see that it needs to make a further call_rcu() for that other half.
> 
> page_table_alloc() set the page->pt_mm field, so __tlb_remove_table()
> knows where to link the freed half while its other half is allocated.
> But linking to the list needs mm->context.lock: and although AA bit set
> guarantees that pt_mm must still be valid, it does not guarantee that mm
> is still valid an instant later: so acquiring mm->context.lock would not
> be safe.  For now, use a static global mm_pgtable_list_lock instead:
> then a soon-to-follow commit will split it per-mm as before (probably by
> using a SLAB_TYPESAFE_BY_RCU structure for the list head and its lock);
> and update the commentary on the pgtable_list.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  arch/s390/include/asm/pgalloc.h |   4 +
>  arch/s390/mm/pgalloc.c          | 205 +++++++++++++++++++++++---------
>  include/linux/mm_types.h        |   2 +-
>  3 files changed, 154 insertions(+), 57 deletions(-)

As discussed in the other thread, we would rather go with less complexity,
possibly switching to an approach w/o the list and fragment re-use in the
future. For now, as a first step in that direction, we can try with not
adding fragments back only for pte_free_defer(). Here is an adjusted
version of your patch, copying most of your pte_free_defer() logic and
also description, tested with LTP and all three of your patch series applied:

Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is never added back to the list in pte_free_defer(). It is only
removed from the list, when currently listed because the other fragment
is not allocated. This introduces some asymmetry compared to the other
page table freeing paths, and in particular a list_del() for such pages
must be avoided there. Use page->pt_frag_refcount to keep track of the
list status, and check that before doing list_del() in any freeing path.

Other paths would also not add back such pages to the list, if the other
fragment happens to be freed in such a path at the same time, because
they would observe cleared AA bits.

This rule ensures that page->lru is no longer in use while page->rcu_head
may be needed for use by pte_free_defer(). But it does not solve the problem
that two halves may need rcu_head at the same time.

For that, add HHead bits between s390's AAllocated and PPending bits in
the upper byte of page->_refcount: then the second pte_free_defer() can
see that rcu_head is already in use, and the RCU callee pte_free_half()
can see that it needs to make a further call_rcu() for that other half.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgalloc.h |    4 +
 arch/s390/mm/pgalloc.c          |  136 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 132 insertions(+), 8 deletions(-)

--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct m
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -185,11 +185,13 @@ void page_table_free_pgste(struct page *
  * The upper byte (bits 24-31) of the parent page _refcount is used
  * for tracking contained 2KB-pgtables and has the following format:
  *
- *   PP  AA
+ *   PPHHAA
  * 01234567    upper byte (bits 24-31) of struct page::_refcount
- *   ||  ||
- *   ||  |+--- upper 2KB-pgtable is allocated
- *   ||  +---- lower 2KB-pgtable is allocated
+ *   ||||||
+ *   |||||+--- upper 2KB-pgtable is allocated
+ *   ||||+---- lower 2KB-pgtable is allocated
+ *   |||+----- upper 2KB-pgtable is pending free by page->rcu_head
+ *   ||+------ lower 2KB-pgtable is pending free by page->rcu_head
  *   |+------- upper 2KB-pgtable is pending for removal
  *   +-------- lower 2KB-pgtable is pending for removal
  *
@@ -229,6 +231,17 @@ void page_table_free_pgste(struct page *
  * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
  * while the PP bits are never used, nor such a page is added to or removed
  * from mm_context_t::pgtable_list.
+ *
+ * The HH bits are used to prevent double use of page->rcu_head in
+ * pte_free_defer(), when both 2K pagetables inside a page happen to get
+ * freed by that path at the same time.
+ *
+ * pte_free_defer() also cannot add 2K fragments back to the list, because
+ * page->rcu_head overlays with page->lru. This introduces some asymmetry
+ * compared to the other pagetable freeing paths, and the missing list_add()
+ * in pte_free_defer() could result in incorrect list_del(). Therefore, track
+ * the the list status of a page with page->pt_frag_refcount, and check that
+ * before doing list_del() in any freeing path.
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
@@ -262,6 +275,7 @@ unsigned long *page_table_alloc(struct m
 				atomic_xor_bits(&page->_refcount,
 							0x01U << (bit + 24));
 				list_del(&page->lru);
+				atomic_set(&page->pt_frag_refcount, 0);
 			}
 		}
 		spin_unlock_bh(&mm->context.lock);
@@ -290,6 +304,7 @@ unsigned long *page_table_alloc(struct m
 		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
 		spin_lock_bh(&mm->context.lock);
 		list_add(&page->lru, &mm->context.pgtable_list);
+		atomic_set(&page->pt_frag_refcount, 1);
 		spin_unlock_bh(&mm->context.lock);
 	}
 	return table;
@@ -325,13 +340,24 @@ void page_table_free(struct mm_struct *m
 		 */
 		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
+		if (mask & 0x03U) {
+			/*
+			 * Other half is allocated, add to list
+			 */
 			list_add(&page->lru, &mm->context.pgtable_list);
-		else
+			atomic_set(&page->pt_frag_refcount, 1);
+		} else if (atomic_read(&page->pt_frag_refcount)) {
+			/*
+			 * Other half is not allocated, and page is on the list,
+			 * remove from list
+			 */
 			list_del(&page->lru);
+			atomic_set(&page->pt_frag_refcount, 0);
+		}
 		spin_unlock_bh(&mm->context.lock);
 		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
 		mask >>= 24;
+		/* Return if other half is allocated, or delayed release pending */
 		if (mask != 0x00U)
 			return;
 		half = 0x01U << bit;
@@ -370,10 +396,22 @@ void page_table_free_rcu(struct mmu_gath
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
+	if (mask & 0x03U) {
+		/*
+		 * Other half is allocated, add to end of list, as this
+		 * will not immediately be re-usable because it is marked
+		 * for delayed release
+		 */
 		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
+		atomic_set(&page->pt_frag_refcount, 1);
+	} else if (atomic_read(&page->pt_frag_refcount)) {
+		/*
+		 * Other half is not allocated, and page is on the list,
+		 * remove from list
+		 */
 		list_del(&page->lru);
+		atomic_set(&page->pt_frag_refcount, 0);
+	}
 	spin_unlock_bh(&mm->context.lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
 	tlb_remove_table(tlb, table);
@@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
 	__free_page(page);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now0(struct rcu_head *head);
+static void pte_free_now1(struct rcu_head *head);
+
+static void pte_free_pgste(struct rcu_head *head)
+{
+	unsigned long *table;
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	table = (unsigned long *)page_to_virt(page);
+	table = (unsigned long *)((unsigned long)table | 0x03U);
+	__tlb_remove_table(table);
+}
+
+static void pte_free_half(struct rcu_head *head, unsigned int bit)
+{
+	unsigned long *table;
+	struct page *page;
+	unsigned int mask;
+
+	page = container_of(head, struct page, rcu_head);
+	mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
+
+	table = (unsigned long *)page_to_virt(page);
+	table += bit * PTRS_PER_PTE;
+	table = (unsigned long *)((unsigned long)table | (0x01U << bit));
+	__tlb_remove_table(table);
+
+	/* If pte_free_defer() of the other half came in, queue it now */
+	if (mask & 0x0CU)
+		call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
+}
+
+static void pte_free_now0(struct rcu_head *head)
+{
+	pte_free_half(head, 0);
+}
+
+static void pte_free_now1(struct rcu_head *head)
+{
+	pte_free_half(head, 1);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	unsigned int bit, mask;
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	if (mm_alloc_pgste(mm)) {
+		/*
+		 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
+		 * page_table_free_rcu()?
+		 * If yes -> need addr parameter here, like in pte_free_tlb().
+		 */
+		call_rcu(&page->rcu_head, pte_free_pgste);
+		return;
+}
+	bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
+
+	spin_lock_bh(&mm->context.lock);
+	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
+	mask >>= 24;
+	if ((mask & 0x03U) == 0x00U && atomic_read(&page->pt_frag_refcount)) {
+		/*
+		 * Other half is not allocated, page is on the list,
+		 * remove from list
+		 */
+		list_del(&page->lru);
+		atomic_set(&page->pt_frag_refcount, 0);
+	}
+	/* Page must not be on the list, so rcu_head can be used */
+	BUG_ON(atomic_read(&page->pt_frag_refcount));
+	spin_unlock_bh(&mm->context.lock);
+
+	/* Do not relink on rcu_head if other half already linked on rcu_head */
+	if ((mask & 0x0CU) != 0x0CU)
+		call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 /*
  * Base infrastructure required to generate basic asces, region, segment,
  * and page tables that do not make use of enhanced features like EDAT1.
Hugh Dickins June 29, 2023, 5:08 a.m. UTC | #2
On Wed, 28 Jun 2023, Gerald Schaefer wrote:
> 
> As discussed in the other thread, we would rather go with less complexity,
> possibly switching to an approach w/o the list and fragment re-use in the
> future. For now, as a first step in that direction, we can try with not
> adding fragments back only for pte_free_defer(). Here is an adjusted
> version of your patch, copying most of your pte_free_defer() logic and
> also description, tested with LTP and all three of your patch series applied:

Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
patch (posted with fewer Cc's to the s390 list last week), and switching
to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
patch.

But I didn't get deep enough into it today to confirm it - and disappointed
that you've found it necessary to play with pt_frag_refcount in addition to
_refcount and HH bits.  No real problem with that, but my instinct says it
should be simpler.

Tomorrow...
Hugh
Alexander Gordeev June 29, 2023, 1:59 p.m. UTC | #3
On Wed, Jun 28, 2023 at 09:16:24PM +0200, Gerald Schaefer wrote:
> On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:

Hi Gerald, Hugh!

...
> @@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
>  	__free_page(page);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now0(struct rcu_head *head);
> +static void pte_free_now1(struct rcu_head *head);

What about pte_free_lower() / pte_free_upper()?

...
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	unsigned int bit, mask;
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	if (mm_alloc_pgste(mm)) {
> +		/*
> +		 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> +		 * page_table_free_rcu()?
> +		 * If yes -> need addr parameter here, like in pte_free_tlb().
> +		 */
> +		call_rcu(&page->rcu_head, pte_free_pgste);
> +		return;
> +}
> +	bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
> +
> +	spin_lock_bh(&mm->context.lock);
> +	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));

This  makes the bit logic increasingly complicated to me.

What if instead we set the rule "one bit at a time only"?
That means an atomic group bit flip is only allowed between
pairs of bits, namely:

bit flip	initiated from
-----------	----------------------------------------
P      <- A	page_table_free(), page_table_free_rcu()
     H <- A	pte_free_defer()
P <- H		pte_free_half()

In the current model P bit could be on together with H
bit simultaneously. That actually brings in equation
nothing.

Besides, this check in page_table_alloc() (while still
correct) makes one (well, me) wonder "what about HH bits?":

	mask = (mask | (mask >> 4)) & 0x03U;
	if (mask != 0x03U) {
		...
	}

By contrast, with "one bit at a time only" policy every
of three bits effectevely indicates which state a page
half is currently in.

Thanks!
Jason Gunthorpe June 29, 2023, 3:22 p.m. UTC | #4
On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:
> On Wed, 28 Jun 2023, Gerald Schaefer wrote:
> > 
> > As discussed in the other thread, we would rather go with less complexity,
> > possibly switching to an approach w/o the list and fragment re-use in the
> > future. For now, as a first step in that direction, we can try with not
> > adding fragments back only for pte_free_defer(). Here is an adjusted
> > version of your patch, copying most of your pte_free_defer() logic and
> > also description, tested with LTP and all three of your patch series applied:
> 
> Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> patch (posted with fewer Cc's to the s390 list last week), and switching
> to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> patch.
> 
> But I didn't get deep enough into it today to confirm it - and disappointed
> that you've found it necessary to play with pt_frag_refcount in addition to
> _refcount and HH bits.  No real problem with that, but my instinct says it
> should be simpler.

Is there any reason it should be any different at all from what PPC is
doing?

I still think the right thing to do here is make the PPC code common
(with Hugh's proposed RCU modification) and just use it in both
arches....

Jason
Gerald Schaefer June 29, 2023, 3:43 p.m. UTC | #5
On Thu, 29 Jun 2023 15:59:07 +0200
Alexander Gordeev <agordeev@linux.ibm.com> wrote:

> On Wed, Jun 28, 2023 at 09:16:24PM +0200, Gerald Schaefer wrote:
> > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> 
> Hi Gerald, Hugh!
> 
> ...
> > @@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
> >  	__free_page(page);
> >  }
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static void pte_free_now0(struct rcu_head *head);
> > +static void pte_free_now1(struct rcu_head *head);  
> 
> What about pte_free_lower() / pte_free_upper()?

I actually like the 0/1 better, I always get confused what exactly we
mean with "lower / upper" in our code and comments. Is it the first
or second half? With 0/1 it is immediately clear to me.

> 
> ...
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	unsigned int bit, mask;
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	if (mm_alloc_pgste(mm)) {
> > +		/*
> > +		 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> > +		 * page_table_free_rcu()?
> > +		 * If yes -> need addr parameter here, like in pte_free_tlb().
> > +		 */
> > +		call_rcu(&page->rcu_head, pte_free_pgste);
> > +		return;
> > +}
> > +	bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
> > +
> > +	spin_lock_bh(&mm->context.lock);
> > +	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));  
> 
> This  makes the bit logic increasingly complicated to me.

I think it is well in line with existing code in page_table_free[_rcu].
Only instead of doing xor with 0x11U, it does xor with 0x15U to also
switch on the H bit while at it.

> 
> What if instead we set the rule "one bit at a time only"?
> That means an atomic group bit flip is only allowed between
> pairs of bits, namely:
> 
> bit flip	initiated from
> -----------	----------------------------------------
> P      <- A	page_table_free(), page_table_free_rcu()
>      H <- A	pte_free_defer()
> P <- H		pte_free_half()
> 
> In the current model P bit could be on together with H
> bit simultaneously. That actually brings in equation
> nothing.

P bit has to be set at the latest when __tlb_remove_table() gets
called, because there it is checked / cleared. It might be possible
to not set it in pte_free_defer() already, but only later in
pte_free_half() RCU callback, before calling __tlb_remove_table().
But that would not be in line any more with existing code, where it
is already set before scheduling the RCU callback.

Therefore, I would rather stick to the current approach, unless
you see some bug in it.

> 
> Besides, this check in page_table_alloc() (while still
> correct) makes one (well, me) wonder "what about HH bits?":
> 
> 	mask = (mask | (mask >> 4)) & 0x03U;
> 	if (mask != 0x03U) {
> 		...
> 	}

Without adding fragments back to the list, it is not necessary
to check any H bits page_table_alloc(), or so I hope. Actually,
I like that aspect most, i.e. we have as little impact on current
code as possible.

And H bits are only relevant for preventing double use of rcu_head,
which is what they were designed for, and only the new code has
to care about them.
Gerald Schaefer June 29, 2023, 3:56 p.m. UTC | #6
On Thu, 29 Jun 2023 12:22:24 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:
> > On Wed, 28 Jun 2023, Gerald Schaefer wrote:  
> > > 
> > > As discussed in the other thread, we would rather go with less complexity,
> > > possibly switching to an approach w/o the list and fragment re-use in the
> > > future. For now, as a first step in that direction, we can try with not
> > > adding fragments back only for pte_free_defer(). Here is an adjusted
> > > version of your patch, copying most of your pte_free_defer() logic and
> > > also description, tested with LTP and all three of your patch series applied:  
> > 
> > Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> > patch (posted with fewer Cc's to the s390 list last week), and switching
> > to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> > patch.
> > 
> > But I didn't get deep enough into it today to confirm it - and disappointed
> > that you've found it necessary to play with pt_frag_refcount in addition to
> > _refcount and HH bits.  No real problem with that, but my instinct says it
> > should be simpler.  

Yes, I also found it a bit awkward, but it seemed "good and simple enough",
to have something to go forward with, while my instinct was in line with yours.

> 
> Is there any reason it should be any different at all from what PPC is
> doing?
> 
> I still think the right thing to do here is make the PPC code common
> (with Hugh's proposed RCU modification) and just use it in both
> arches....

With the current approach, we would not add back fragments _only_ for
the new pte_free_defer() path, while keeping our cleverness for the other
paths. Not having a good overview of the negative impact wrt potential
memory waste, I would rather take small steps, if possible.

If we later switch to never adding back fragments, of course we should
try to be in line with PPC implementation.
Hugh Dickins June 30, 2023, 6 a.m. UTC | #7
On Thu, 29 Jun 2023, Gerald Schaefer wrote:
> On Thu, 29 Jun 2023 12:22:24 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:
> > > On Wed, 28 Jun 2023, Gerald Schaefer wrote:  
> > > > 
> > > > As discussed in the other thread, we would rather go with less complexity,
> > > > possibly switching to an approach w/o the list and fragment re-use in the
> > > > future. For now, as a first step in that direction, we can try with not
> > > > adding fragments back only for pte_free_defer(). Here is an adjusted
> > > > version of your patch, copying most of your pte_free_defer() logic and
> > > > also description, tested with LTP and all three of your patch series applied:  
> > > 
> > > Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> > > patch (posted with fewer Cc's to the s390 list last week), and switching
> > > to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> > > patch.
> > > 
> > > But I didn't get deep enough into it today to confirm it - and disappointed
> > > that you've found it necessary to play with pt_frag_refcount in addition to
> > > _refcount and HH bits.  No real problem with that, but my instinct says it
> > > should be simpler.  
> 
> Yes, I also found it a bit awkward, but it seemed "good and simple enough",
> to have something to go forward with, while my instinct was in line with yours.
> 
> > 
> > Is there any reason it should be any different at all from what PPC is
> > doing?
> > 
> > I still think the right thing to do here is make the PPC code common
> > (with Hugh's proposed RCU modification) and just use it in both
> > arches....
> 
> With the current approach, we would not add back fragments _only_ for
> the new pte_free_defer() path, while keeping our cleverness for the other
> paths. Not having a good overview of the negative impact wrt potential
> memory waste, I would rather take small steps, if possible.
> 
> If we later switch to never adding back fragments, of course we should
> try to be in line with PPC implementation.

I find myself half-agreeing with everyone.

I agree with Gerald that s390 should keep close to what it is already
doing (except for adding pte_free_defer()): that changing its strategy
and implementation to be much more like powerpc, is a job for some other
occasion (and would depend on gathering data about how well each does).

But I agree with Jason that the powerpc solution we ended up with cut
out a lot of unnecessary complication: it shifts the RCU delay from
when pte_free_defer() is called, to when the shared page comes to be
freed; which may be a lot later, and might not be welcome in a common
path, but is quite okay for the uncommon pte_free_defer().

And I agree with Alexander that pte_free_lower() and pte_free_upper()
are better names than pte_free_now0() and pte_free_now1(): I was going
to make that change, except all those functions disappear if we follow
Jason's advice and switch the call_rcu() to when freeing the page.

(Lower and upper seem unambiguous to me: Gerald, does your confusion
come just from the way they are shown the wrong way round in the PP AA
diagram?  I corrected that in my patch, but you reverted it in yours.)

I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
extent that I've not even tried to verify it; but I think I do get the
point now, that we need further info than just PPHHAA to know whether
the page is on the list or not.  But I think that if we move where the
call_rcu() is done, then the page can stay on or off the list by same
rules as before (but need to check HH bits along with PP when deciding
whether to allocate, and whether to list_add_tail() when freeing).

So, starting from Gerald's but cutting it down, I was working on the
patch which follows those ideas.  But have run out of puff for tonight,
and would just waste all our time (again) if I sent anything out now.

Hugh
Claudio Imbrenda June 30, 2023, 1:38 p.m. UTC | #8
On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

[...]

> @@ -407,6 +429,77 @@ void __tlb_remove_table(void *_table)
>  	__free_page(page);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now0(struct rcu_head *head);
> +static void pte_free_now1(struct rcu_head *head);
> +
> +static void pte_free_pgste(struct rcu_head *head)
> +{
> +	unsigned long *table;
> +	struct page *page;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	table = (unsigned long *)page_to_virt(page);
> +	table = (unsigned long *)((unsigned long)table | 0x03U);
> +	__tlb_remove_table(table);
> +}
> +
> +static void pte_free_half(struct rcu_head *head, unsigned int bit)
> +{
> +	unsigned long *table;
> +	struct page *page;
> +	unsigned int mask;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
> +
> +	table = (unsigned long *)page_to_virt(page);
> +	table += bit * PTRS_PER_PTE;
> +	table = (unsigned long *)((unsigned long)table | (0x01U << bit));
> +	__tlb_remove_table(table);
> +
> +	/* If pte_free_defer() of the other half came in, queue it now */
> +	if (mask & 0x0CU)
> +		call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
> +}
> +
> +static void pte_free_now0(struct rcu_head *head)
> +{
> +	pte_free_half(head, 0);
> +}
> +
> +static void pte_free_now1(struct rcu_head *head)
> +{
> +	pte_free_half(head, 1);
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	unsigned int bit, mask;
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	if (mm_alloc_pgste(mm)) {
> +		call_rcu(&page->rcu_head, pte_free_pgste);

so is this now going to be used to free page tables
instead of page_table_free_rcu?

or will it be used instead of page_table_free?

this is actually quite important for KVM on s390

> +		return;
> +	}
> +	bit = ((unsigned long)pgtable & ~PAGE_MASK) /
> +			(PTRS_PER_PTE * sizeof(pte_t));
> +
> +	spin_lock_bh(&mm_pgtable_list_lock);
> +	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
> +	mask >>= 24;
> +	/* Other half not allocated? Other half not already pending free? */
> +	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
> +		list_del(&page->lru);
> +	spin_unlock_bh(&mm_pgtable_list_lock);
> +
> +	/* Do not relink on rcu_head if other half already linked on rcu_head */
> +	if ((mask & 0x0CU) != 0x0CU)
> +		call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * Base infrastructure required to generate basic asces, region, segment,
>   * and page tables that do not make use of enhanced features like EDAT1.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..1667a1bdb8a8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -146,7 +146,7 @@ struct page {
>  			pgtable_t pmd_huge_pte; /* protected by page->ptl */
>  			unsigned long _pt_pad_2;	/* mapping */
>  			union {
> -				struct mm_struct *pt_mm; /* x86 pgds only */
> +				struct mm_struct *pt_mm; /* x86 pgd, s390 */
>  				atomic_t pt_frag_refcount; /* powerpc */
>  			};
>  #if ALLOC_SPLIT_PTLOCKS
Hugh Dickins June 30, 2023, 3:28 p.m. UTC | #9
On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> 
> [...]
> 
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	unsigned int bit, mask;
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	if (mm_alloc_pgste(mm)) {
> > +		call_rcu(&page->rcu_head, pte_free_pgste);
> 
> so is this now going to be used to free page tables
> instead of page_table_free_rcu?

No.

All pte_free_defer() is being used for (in this series; and any future
use beyond this series will have to undertake its own evaluations) is
for the case of removing an empty page table, which used to map a group
of PTE mappings of a file, in order to make way for one PMD mapping of
the huge page which those scattered pages have now been gathered into.

You're worried by that mm_alloc_pgste() block: it's something I didn't
have at all in my first draft, then I thought that perhaps the pgste
case might be able to come this way, so it seemed stupid to leave out
the handling for it.

I hope that you're implying that should be dead code here?  Perhaps,
that the pgste case corresponds to the case in s390 where THPs are
absolutely forbidden?  That would be good news for us.

Gerald, in his version of this block, added a comment asking:
	/*
	 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
	 * page_table_free_rcu()?
	 * If yes -> need addr parameter here, like in pte_free_tlb().
	 */
Do you have the answer to that?  Neither of us could work it out.

> 
> or will it be used instead of page_table_free?

Not always; but yes, this case of removing a page table used
page_table_free() before; but now, with the lighter locking, needs
to keep the page table valid until the RCU grace period expires.

> 
> this is actually quite important for KVM on s390

None of us are wanting to break KVM on s390: your guidance appreciated!

Thanks,
Hugh
Claudio Imbrenda June 30, 2023, 4:25 p.m. UTC | #10
On Fri, 30 Jun 2023 08:28:54 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > [...]
> >   
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > +	unsigned int bit, mask;
> > > +	struct page *page;
> > > +
> > > +	page = virt_to_page(pgtable);
> > > +	if (mm_alloc_pgste(mm)) {
> > > +		call_rcu(&page->rcu_head, pte_free_pgste);  
> > 
> > so is this now going to be used to free page tables
> > instead of page_table_free_rcu?  
> 
> No.
> 
> All pte_free_defer() is being used for (in this series; and any future
> use beyond this series will have to undertake its own evaluations) is
> for the case of removing an empty page table, which used to map a group
> of PTE mappings of a file, in order to make way for one PMD mapping of
> the huge page which those scattered pages have now been gathered into.
> 
> You're worried by that mm_alloc_pgste() block: it's something I didn't

actually no, but thanks for bringing it up :D

> have at all in my first draft, then I thought that perhaps the pgste
> case might be able to come this way, so it seemed stupid to leave out
> the handling for it.
> 
> I hope that you're implying that should be dead code here?  Perhaps,
> that the pgste case corresponds to the case in s390 where THPs are
> absolutely forbidden?  That would be good news for us.
> 
> Gerald, in his version of this block, added a comment asking:
> 	/*
> 	 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> 	 * page_table_free_rcu()?
> 	 * If yes -> need addr parameter here, like in pte_free_tlb().
> 	 */
> Do you have the answer to that?  Neither of us could work it out.

this is the thing I'm worried about; removing a page table that was
used to map a guest will leave dangling pointers in the gmap that will
cause memory corruption (I actually ran into that problem myself for
another patchseries).

gmap_unlink() is needed to clean up the pointers before they become
dangling (and also potentially do some TLB purging as needed)

the point here is: we need that only for page_table_free_rcu(); all
other users of page_table_free() cannot act on guest page tables
(because we don't allow THP for KVM guests). and that is why
page_table_free() does not do gmap_unlink() currently.

> 
> > 
> > or will it be used instead of page_table_free?  
> 
> Not always; but yes, this case of removing a page table used
> page_table_free() before; but now, with the lighter locking, needs
> to keep the page table valid until the RCU grace period expires.

so if I understand correctly your code will, sometimes, under some
circumstances, replace what page_table_free() does, but it will never
replace page_table_free_rcu()?

because in that case there would be no issues 

> 
> > 
> > this is actually quite important for KVM on s390  
> 
> None of us are wanting to break KVM on s390: your guidance appreciated!
> 
> Thanks,
> Hugh
Hugh Dickins June 30, 2023, 7:22 p.m. UTC | #11
On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> On Fri, 30 Jun 2023 08:28:54 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> > > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > > Hugh Dickins <hughd@google.com> wrote:
> > > 
> > > [...]
> > >   
> > > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > > +{
> > > > +	unsigned int bit, mask;
> > > > +	struct page *page;
> > > > +
> > > > +	page = virt_to_page(pgtable);
> > > > +	if (mm_alloc_pgste(mm)) {
> > > > +		call_rcu(&page->rcu_head, pte_free_pgste);  
> > > 
> > > so is this now going to be used to free page tables
> > > instead of page_table_free_rcu?  
> > 
> > No.
> > 
> > All pte_free_defer() is being used for (in this series; and any future
> > use beyond this series will have to undertake its own evaluations) is
> > for the case of removing an empty page table, which used to map a group
> > of PTE mappings of a file, in order to make way for one PMD mapping of
> > the huge page which those scattered pages have now been gathered into.
> > 
> > You're worried by that mm_alloc_pgste() block: it's something I didn't
> 
> actually no, but thanks for bringing it up :D
> 
> > have at all in my first draft, then I thought that perhaps the pgste
> > case might be able to come this way, so it seemed stupid to leave out
> > the handling for it.
> > 
> > I hope that you're implying that should be dead code here?  Perhaps,
> > that the pgste case corresponds to the case in s390 where THPs are
> > absolutely forbidden?  That would be good news for us.
> > 
> > Gerald, in his version of this block, added a comment asking:
> > 	/*
> > 	 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> > 	 * page_table_free_rcu()?
> > 	 * If yes -> need addr parameter here, like in pte_free_tlb().
> > 	 */
> > Do you have the answer to that?  Neither of us could work it out.
> 
> this is the thing I'm worried about; removing a page table that was
> used to map a guest will leave dangling pointers in the gmap that will
> cause memory corruption (I actually ran into that problem myself for
> another patchseries).
> 
> gmap_unlink() is needed to clean up the pointers before they become
> dangling (and also potentially do some TLB purging as needed)

That's something I would have expected to be handled already via
mmu_notifiers, rather than buried inside the page table freeing.

If s390 is the only architecture to go that way, and could instead do
it via mmu_notifiers, then I think that will be more easily supported
in the long term.

But I'm writing from a position of very great ignorance: advising
KVM on s390 is many dimensions away from what I'm capable of.

> 
> the point here is: we need that only for page_table_free_rcu(); all
> other users of page_table_free() cannot act on guest page tables

I might be wrong, but I think that most users of page_table_free()
are merely freeing a page table which had to be allocated up front,
but was then found unnecessary (maybe a racing task already inserted
one): page tables which were never exposed to actual use.

> (because we don't allow THP for KVM guests). and that is why
> page_table_free() does not do gmap_unlink() currently.

But THP collapse does (or did before this series) use it to free a
page table which had been exposed to use.  The fact that s390 does
not allow THP for KVM guests makes page_table_free(), and this new
pte_free_defer(), safe for that; but it feels dangerously coincidental.

It's easy to imagine a future change being made, which would stumble
over this issue.  I have imagined that pte_free_defer() will be useful
in future, in the freeing of empty page tables: but s390 may pose a
problem there - though perhaps no more of a problem than additionally
needing to pass a virtual address down the stack.

> 
> > 
> > > 
> > > or will it be used instead of page_table_free?  
> > 
> > Not always; but yes, this case of removing a page table used
> > page_table_free() before; but now, with the lighter locking, needs
> > to keep the page table valid until the RCU grace period expires.
> 
> so if I understand correctly your code will, sometimes, under some
> circumstances, replace what page_table_free() does, but it will never
> replace page_table_free_rcu()?
> 
> because in that case there would be no issues 

Yes, thanks for confirming: we have no issue here at present, but may
do if use of pte_free_defer() is extended to other contexts in future.

Would it be appropriate to add a WARN_ON_ONCE around that
> > > > +	if (mm_alloc_pgste(mm)) {
in pte_free_defer()?

I ask that somewhat rhetorically: that block disappears in the later
version I was working on last night (and will return to shortly), in
which pte_free_defer() just sets a bit and calls page_table_free().

But I'd like to understand the possibilities better: does mm_alloc_pgste()
correspond 1:1 to KVM guest on s390, or does it cover several different
possibilities of which KVM guest is one, or am I just confused to be
thinking there's any relationship?

Thanks,
Hugh

> 
> > 
> > > 
> > > this is actually quite important for KVM on s390  
> > 
> > None of us are wanting to break KVM on s390: your guidance appreciated!
> > 
> > Thanks,
> > Hugh
Hugh Dickins July 2, 2023, 4:32 a.m. UTC | #12
On Thu, 29 Jun 2023, Hugh Dickins wrote:
> 
> I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> extent that I've not even tried to verify it; but I think I do get the
> point now, that we need further info than just PPHHAA to know whether
> the page is on the list or not.  But I think that if we move where the
> call_rcu() is done, then the page can stay on or off the list by same
> rules as before (but need to check HH bits along with PP when deciding
> whether to allocate, and whether to list_add_tail() when freeing).

No, not quite the same rules as before: I came to realize that using
list_add_tail() for the HH pages would be liable to put a page on the
list which forever blocked reuse of PP list_add_tail() pages after it
(could be solved by a list_move() somewhere, but we have agreed to
prefer simplicity).

I've dropped the HH bits, I'm using PageActive like we did on powerpc,
I've dropped most of the pte_free_*() helpers, and list_del_init() is
an easier way of dealing with those "is it on the list" questions.
I expect that we shall be close to reaching agreement on...

[PATCH v? 07/12] s390: add pte_free_defer() for pgtables sharing page

Add s390-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is never on the free list if pte_free_defer() was used on either
half (marked by PageActive).  And for simplicity, delay calling RCU until
both halves are freed.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/include/asm/pgalloc.h |  4 ++
 arch/s390/mm/pgalloc.c          | 75 +++++++++++++++++++++++++++------
 2 files changed, 67 insertions(+), 12 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..fd0c4312da16 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
  * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
  * while the PP bits are never used, nor such a page is added to or removed
  * from mm_context_t::pgtable_list.
+ *
+ * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
+ * and prevents both 2K fragments from being reused. pte_free_defer() has to
+ * guarantee that its pgtable cannot be reused before the RCU grace period
+ * has elapsed (which page_table_free_rcu() does not actually guarantee).
+ * But for simplicity, because page->rcu_head overlays page->lru, and because
+ * the RCU callback might not be called before the mm_context_t has been freed,
+ * pte_free_defer() in this implementation prevents both fragments from being
+ * reused, and delays making the call to RCU until both fragments are freed.
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
@@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 					table += PTRS_PER_PTE;
 				atomic_xor_bits(&page->_refcount,
 							0x01U << (bit + 24));
-				list_del(&page->lru);
+				list_del_init(&page->lru);
 			}
 		}
 		spin_unlock_bh(&mm->context.lock);
@@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 	table = (unsigned long *) page_to_virt(page);
 	if (mm_alloc_pgste(mm)) {
 		/* Return 4K page table with PGSTEs */
+		INIT_LIST_HEAD(&page->lru);
 		atomic_xor_bits(&page->_refcount, 0x03U << 24);
 		memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
 		memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
@@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table,
 {
 	char msg[128];
 
-	if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
+	if (!IS_ENABLED(CONFIG_DEBUG_VM))
+		return;
+	if (!mask && list_empty(&page->lru))
 		return;
 	snprintf(msg, sizeof(msg),
 		 "Invalid pgtable %p release half 0x%02x mask 0x%02x",
@@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table,
 	dump_page(page, msg);
 }
 
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pgtable_pte_page_dtor(page);
+	__free_page(page);
+}
+
 void page_table_free(struct mm_struct *mm, unsigned long *table)
 {
 	unsigned int mask, bit, half;
@@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 		 */
 		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
+		if ((mask & 0x03U) && !PageActive(page)) {
+			/*
+			 * Other half is allocated, and neither half has had
+			 * its free deferred: add page to head of list, to make
+			 * this freed half available for immediate reuse.
+			 */
 			list_add(&page->lru, &mm->context.pgtable_list);
-		else
-			list_del(&page->lru);
+		} else {
+			/* If page is on list, now remove it. */
+			list_del_init(&page->lru);
+		}
 		spin_unlock_bh(&mm->context.lock);
 		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
 		mask >>= 24;
@@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 	}
 
 	page_table_release_check(page, table, half, mask);
-	pgtable_pte_page_dtor(page);
-	__free_page(page);
+	if (TestClearPageActive(page))
+		call_rcu(&page->rcu_head, pte_free_now);
+	else
+		pte_free_now(&page->rcu_head);
 }
 
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
@@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
+	if ((mask & 0x03U) && !PageActive(page)) {
+		/*
+		 * Other half is allocated, and neither half has had
+		 * its free deferred: add page to end of list, to make
+		 * this freed half available for reuse once its pending
+		 * bit has been cleared by __tlb_remove_table().
+		 */
 		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
-		list_del(&page->lru);
+	} else {
+		/* If page is on list, now remove it. */
+		list_del_init(&page->lru);
+	}
 	spin_unlock_bh(&mm->context.lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
 	tlb_remove_table(tlb, table);
@@ -403,10 +441,23 @@ void __tlb_remove_table(void *_table)
 	}
 
 	page_table_release_check(page, table, half, mask);
-	pgtable_pte_page_dtor(page);
-	__free_page(page);
+	if (TestClearPageActive(page))
+		call_rcu(&page->rcu_head, pte_free_now);
+	else
+		pte_free_now(&page->rcu_head);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	SetPageActive(page);
+	page_table_free(mm, (unsigned long *)pgtable);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 /*
  * Base infrastructure required to generate basic asces, region, segment,
  * and page tables that do not make use of enhanced features like EDAT1.
Claudio Imbrenda July 3, 2023, 11 a.m. UTC | #13
On Fri, 30 Jun 2023 12:22:43 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

[...]

> That's something I would have expected to be handled already via
> mmu_notifiers, rather than buried inside the page table freeing.
> 
> If s390 is the only architecture to go that way, and could instead do
> it via mmu_notifiers, then I think that will be more easily supported
> in the long term.

I am very well aware of that, and in fact I've been working on
exactly that for some time already. But it's a very complex minefield
and therefore I'm proceeding *very* carefully. It will take quite some
time before anything comes out.

> 
> But I'm writing from a position of very great ignorance: advising
> KVM on s390 is many dimensions away from what I'm capable of.

fair enough, but in this case it doesn't mean that you are not right :)

> 
> > 
> > the point here is: we need that only for page_table_free_rcu(); all
> > other users of page_table_free() cannot act on guest page tables  
> 
> I might be wrong, but I think that most users of page_table_free()
> are merely freeing a page table which had to be allocated up front,
> but was then found unnecessary (maybe a racing task already inserted
> one): page tables which were never exposed to actual use.

that was my impression as well

> > (because we don't allow THP for KVM guests). and that is why
> > page_table_free() does not do gmap_unlink() currently.  
> 
> But THP collapse does (or did before this series) use it to free a
> page table which had been exposed to use.  The fact that s390 does

that is also my understanding

> not allow THP for KVM guests makes page_table_free(), and this new
> pte_free_defer(), safe for that; but it feels dangerously coincidental.

not really; my guess is that we _intentionally_ did not do anything
there, because we knew we did not need it, knowing well that
we would need it once we would want to support THP for guests.
so not a coincidence, but a conscious decision based, I guess, on
touching as little code as needed.

> 
> It's easy to imagine a future change being made, which would stumble
> over this issue.  I have imagined that pte_free_defer() will be useful
> in future, in the freeing of empty page tables: but s390 may pose a
> problem there - though perhaps no more of a problem than additionally
> needing to pass a virtual address down the stack.

yeah it can always be fixed later if we need to

> 
> >   
> > >   
> > > > 
> > > > or will it be used instead of page_table_free?    
> > > 
> > > Not always; but yes, this case of removing a page table used
> > > page_table_free() before; but now, with the lighter locking, needs
> > > to keep the page table valid until the RCU grace period expires.  
> > 
> > so if I understand correctly your code will, sometimes, under some
> > circumstances, replace what page_table_free() does, but it will never
> > replace page_table_free_rcu()?
> > 
> > because in that case there would be no issues   
> 
> Yes, thanks for confirming: we have no issue here at present, but may
> do if use of pte_free_defer() is extended to other contexts in future.
> 
> Would it be appropriate to add a WARN_ON_ONCE around that
> > > > > +	if (mm_alloc_pgste(mm)) {  
> in pte_free_defer()?

that's actually not a bad idea. should never happen, but... that's the
whole point of a WARN_ON after all

> 
> I ask that somewhat rhetorically: that block disappears in the later
> version I was working on last night (and will return to shortly), in
> which pte_free_defer() just sets a bit and calls page_table_free().
> 
> But I'd like to understand the possibilities better: does mm_alloc_pgste()
> correspond 1:1 to KVM guest on s390, or does it cover several different
> possibilities of which KVM guest is one, or am I just confused to be
> thinking there's any relationship?

this is... historically complicated (because of course it is)

in theory any process can allocate PGSTEs by having the right bits in
the ELF header (that's how QEMU does it currently). And QEMU will have
PGSTEs allocated even if it does not actually start any guests.

Then we have the vm.allocate_pgste sysctl knob; once enabled, it will
cause all processes to have PGSTEs allocated. This is how we handled
PGSTEs before we switched to ELF header bits.

So in summary: in __practice__ yes, only QEMU will have PGSTEs. But in
theory anything is possible and allowed.

> 
> Thanks,
> Hugh
> 
> >   
> > >   
> > > > 
> > > > this is actually quite important for KVM on s390    
> > > 
> > > None of us are wanting to break KVM on s390: your guidance appreciated!
> > > 
> > > Thanks,
> > > Hugh
Gerald Schaefer July 3, 2023, 4:10 p.m. UTC | #14
On Thu, 29 Jun 2023 23:00:07 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Thu, 29 Jun 2023, Gerald Schaefer wrote:
> > On Thu, 29 Jun 2023 12:22:24 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:  
> > > On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:  
> > > > On Wed, 28 Jun 2023, Gerald Schaefer wrote:    
> > > > > 
> > > > > As discussed in the other thread, we would rather go with less complexity,
> > > > > possibly switching to an approach w/o the list and fragment re-use in the
> > > > > future. For now, as a first step in that direction, we can try with not
> > > > > adding fragments back only for pte_free_defer(). Here is an adjusted
> > > > > version of your patch, copying most of your pte_free_defer() logic and
> > > > > also description, tested with LTP and all three of your patch series applied:    
> > > > 
> > > > Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> > > > patch (posted with fewer Cc's to the s390 list last week), and switching
> > > > to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> > > > patch.
> > > > 
> > > > But I didn't get deep enough into it today to confirm it - and disappointed
> > > > that you've found it necessary to play with pt_frag_refcount in addition to
> > > > _refcount and HH bits.  No real problem with that, but my instinct says it
> > > > should be simpler.    
> > 
> > Yes, I also found it a bit awkward, but it seemed "good and simple enough",
> > to have something to go forward with, while my instinct was in line with yours.
> >   
> > > 
> > > Is there any reason it should be any different at all from what PPC is
> > > doing?
> > > 
> > > I still think the right thing to do here is make the PPC code common
> > > (with Hugh's proposed RCU modification) and just use it in both
> > > arches....  
> > 
> > With the current approach, we would not add back fragments _only_ for
> > the new pte_free_defer() path, while keeping our cleverness for the other
> > paths. Not having a good overview of the negative impact wrt potential
> > memory waste, I would rather take small steps, if possible.
> > 
> > If we later switch to never adding back fragments, of course we should
> > try to be in line with PPC implementation.  
> 
> I find myself half-agreeing with everyone.
> 
> I agree with Gerald that s390 should keep close to what it is already
> doing (except for adding pte_free_defer()): that changing its strategy
> and implementation to be much more like powerpc, is a job for some other
> occasion (and would depend on gathering data about how well each does).
> 
> But I agree with Jason that the powerpc solution we ended up with cut
> out a lot of unnecessary complication: it shifts the RCU delay from
> when pte_free_defer() is called, to when the shared page comes to be
> freed; which may be a lot later, and might not be welcome in a common
> path, but is quite okay for the uncommon pte_free_defer().

Ok, I guess I must admit that I completely ignored the latest progress in
the powerpc thread, and therefore was not up-to-date. Still had the older
approach in mind, where you also checked for pt_frag_refcount to avoid
double call_rcu().

The new approach sounds very reasonable, and I also like your latest
s390 patch from a first glance. Need to get more up-to-date with PageActive
and maybe also powerpc approach, and give this some proper review tomorrow.

> 
> And I agree with Alexander that pte_free_lower() and pte_free_upper()
> are better names than pte_free_now0() and pte_free_now1(): I was going
> to make that change, except all those functions disappear if we follow
> Jason's advice and switch the call_rcu() to when freeing the page.
> 
> (Lower and upper seem unambiguous to me: Gerald, does your confusion
> come just from the way they are shown the wrong way round in the PP AA
> diagram?  I corrected that in my patch, but you reverted it in yours.)

Ah yes, that could well be, and unfortunately I did not notice that you
fixed that in the comment. I only saw that you "fixed" the bit numbering
from 01234567 to 76543210, which I think is wrong on big-endian s390,
and therefore I simply removed that complete hunk.

But thanks a lot for pointing to that! We will certainly want to fix that
comment in a later patch, to reduce some or maybe all of the (at least
my) upper/lower confusion.
Jason Gunthorpe July 3, 2023, 9:29 p.m. UTC | #15
On Mon, Jul 03, 2023 at 01:00:13PM +0200, Claudio Imbrenda wrote:
> On Fri, 30 Jun 2023 12:22:43 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> 
> [...]
> 
> > That's something I would have expected to be handled already via
> > mmu_notifiers, rather than buried inside the page table freeing.
> > 
> > If s390 is the only architecture to go that way, and could instead do
> > it via mmu_notifiers, then I think that will be more easily supported
> > in the long term.
> 
> I am very well aware of that, and in fact I've been working on
> exactly that for some time already. But it's a very complex minefield
> and therefore I'm proceeding *very* carefully. It will take quite some
> time before anything comes out.

Yes +1 on this please, creating your own arch cross connect with kvm
in the page table freers is really horrible..

Jason
Alexander Gordeev July 4, 2023, 1:40 p.m. UTC | #16
On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> On Thu, 29 Jun 2023, Hugh Dickins wrote:

Hi Hugh,

...
> No, not quite the same rules as before: I came to realize that using
> list_add_tail() for the HH pages would be liable to put a page on the
> list which forever blocked reuse of PP list_add_tail() pages after it
> (could be solved by a list_move() somewhere, but we have agreed to
> prefer simplicity).

Just to make things more clear for me: do I understand correctly that this
was an attempt to add HH fragments to pgtable_list from pte_free_defer()?

Thanks!
Gerald Schaefer July 4, 2023, 3:19 p.m. UTC | #17
On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > 
> > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > extent that I've not even tried to verify it; but I think I do get the
> > point now, that we need further info than just PPHHAA to know whether
> > the page is on the list or not.  But I think that if we move where the
> > call_rcu() is done, then the page can stay on or off the list by same
> > rules as before (but need to check HH bits along with PP when deciding
> > whether to allocate, and whether to list_add_tail() when freeing).  
> 
> No, not quite the same rules as before: I came to realize that using
> list_add_tail() for the HH pages would be liable to put a page on the
> list which forever blocked reuse of PP list_add_tail() pages after it
> (could be solved by a list_move() somewhere, but we have agreed to
> prefer simplicity).
> 
> I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> I've dropped most of the pte_free_*() helpers, and list_del_init() is
> an easier way of dealing with those "is it on the list" questions.
> I expect that we shall be close to reaching agreement on...

This looks really nice, almost too good and easy to be true. I did not
find any obvious flaw, just some comments below. It also survived LTP
without any visible havoc, so I guess this approach is the best so far.

> 
> [PATCH v? 07/12] s390: add pte_free_defer() for pgtables sharing page
> 
> Add s390-specific pte_free_defer(), to free table page via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because s390 fits two 2K
> page tables into one 4K page (so page->rcu_head must be shared between
> both halves), and already uses page->lru (which page->rcu_head overlays)
> to list any free halves; with clever management by page->_refcount bits.
> 
> Build upon the existing management, adjusted to follow a new rule: that
> a page is never on the free list if pte_free_defer() was used on either
> half (marked by PageActive).  And for simplicity, delay calling RCU until
> both halves are freed.
> 
> Not adding back unallocated fragments to the list in pte_free_defer()
> can result in wasting some amount of memory for pagetables, depending
> on how long the allocated fragment will stay in use. In practice, this
> effect is expected to be insignificant, and not justify a far more
> complex approach, which might allow to add the fragments back later
> in __tlb_remove_table(), where we might not have a stable mm any more.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  arch/s390/include/asm/pgalloc.h |  4 ++
>  arch/s390/mm/pgalloc.c          | 75 +++++++++++++++++++++++++++------
>  2 files changed, 67 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
> index 17eb618f1348..89a9d5ef94f8 100644
> --- a/arch/s390/include/asm/pgalloc.h
> +++ b/arch/s390/include/asm/pgalloc.h
> @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
>  #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  
> +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  void vmem_map_init(void);
>  void *vmem_crst_alloc(unsigned long val);
>  pte_t *vmem_pte_alloc(void);
> diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
> index 66ab68db9842..fd0c4312da16 100644
> --- a/arch/s390/mm/pgalloc.c
> +++ b/arch/s390/mm/pgalloc.c
> @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
>   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
>   * while the PP bits are never used, nor such a page is added to or removed
>   * from mm_context_t::pgtable_list.
> + *
> + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> + * guarantee that its pgtable cannot be reused before the RCU grace period
> + * has elapsed (which page_table_free_rcu() does not actually guarantee).

Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
allow reuse before grace period elapsed. And I hope that it does so, by
setting the PP bits, which would be noticed in page_table_alloc(), in
case the page would be seen there.

Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
end of the list, and so they could be seen in page_table_alloc(), but they
should not be reused before grace period elapsed and __tlb_remove_table()
cleared the PP bits, as far as I understand.

So what exactly do you mean with "which page_table_free_rcu() does not actually
guarantee"?

> + * But for simplicity, because page->rcu_head overlays page->lru, and because
> + * the RCU callback might not be called before the mm_context_t has been freed,
> + * pte_free_defer() in this implementation prevents both fragments from being
> + * reused, and delays making the call to RCU until both fragments are freed.
>   */
>  unsigned long *page_table_alloc(struct mm_struct *mm)
>  {
> @@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
>  					table += PTRS_PER_PTE;
>  				atomic_xor_bits(&page->_refcount,
>  							0x01U << (bit + 24));
> -				list_del(&page->lru);
> +				list_del_init(&page->lru);
>  			}
>  		}
>  		spin_unlock_bh(&mm->context.lock);
> @@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
>  	table = (unsigned long *) page_to_virt(page);
>  	if (mm_alloc_pgste(mm)) {
>  		/* Return 4K page table with PGSTEs */
> +		INIT_LIST_HEAD(&page->lru);
>  		atomic_xor_bits(&page->_refcount, 0x03U << 24);
>  		memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
>  		memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
> @@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table,
>  {
>  	char msg[128];
>  
> -	if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
> +	if (!IS_ENABLED(CONFIG_DEBUG_VM))
> +		return;
> +	if (!mask && list_empty(&page->lru))
>  		return;
>  	snprintf(msg, sizeof(msg),
>  		 "Invalid pgtable %p release half 0x%02x mask 0x%02x",
> @@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table,
>  	dump_page(page, msg);
>  }
>  
> +static void pte_free_now(struct rcu_head *head)
> +{
> +	struct page *page;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	pgtable_pte_page_dtor(page);
> +	__free_page(page);
> +}
> +
>  void page_table_free(struct mm_struct *mm, unsigned long *table)
>  {
>  	unsigned int mask, bit, half;
> @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
>  		 */
>  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  		mask >>= 24;
> -		if (mask & 0x03U)
> +		if ((mask & 0x03U) && !PageActive(page)) {
> +			/*
> +			 * Other half is allocated, and neither half has had
> +			 * its free deferred: add page to head of list, to make
> +			 * this freed half available for immediate reuse.
> +			 */
>  			list_add(&page->lru, &mm->context.pgtable_list);
> -		else
> -			list_del(&page->lru);
> +		} else {
> +			/* If page is on list, now remove it. */
> +			list_del_init(&page->lru);
> +		}

Ok, we might end up with some unnecessary list_del_init() here, e.g. if
other half is still allocated, when called from pte_free_defer() on a
fully allocated page, which was not on the list (and with PageActive, and
(mask & 0x03U) true).
Not sure if adding an additional mask check to the else path would be
needed, but it seems that list_del_init() should also be able to handle
this.

Same thought applies to the similar logic in page_table_free_rcu()
below.

>  		spin_unlock_bh(&mm->context.lock);
>  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
>  		mask >>= 24;
> @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
>  	}
>  
>  	page_table_release_check(page, table, half, mask);
> -	pgtable_pte_page_dtor(page);
> -	__free_page(page);
> +	if (TestClearPageActive(page))
> +		call_rcu(&page->rcu_head, pte_free_now);
> +	else
> +		pte_free_now(&page->rcu_head);

This ClearPageActive, and the similar thing in __tlb_remove_table() below,
worries me a bit, because it is done outside the spin_lock. It "feels" like
there could be some race with the PageActive checks inside the spin_lock,
but when drawing some pictures, I could not find any such scenario yet.
Also, our existing spin_lock is probably not supposed to protect against
PageActive changes anyway, right?

>  }
>  
>  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> @@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
>  	 */
>  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  	mask >>= 24;
> -	if (mask & 0x03U)
> +	if ((mask & 0x03U) && !PageActive(page)) {
> +		/*
> +		 * Other half is allocated, and neither half has had
> +		 * its free deferred: add page to end of list, to make
> +		 * this freed half available for reuse once its pending
> +		 * bit has been cleared by __tlb_remove_table().
> +		 */
>  		list_add_tail(&page->lru, &mm->context.pgtable_list);
> -	else
> -		list_del(&page->lru);
> +	} else {
> +		/* If page is on list, now remove it. */
> +		list_del_init(&page->lru);
> +	}
>  	spin_unlock_bh(&mm->context.lock);
>  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
>  	tlb_remove_table(tlb, table);
> @@ -403,10 +441,23 @@ void __tlb_remove_table(void *_table)
>  	}
>  
>  	page_table_release_check(page, table, half, mask);
> -	pgtable_pte_page_dtor(page);
> -	__free_page(page);
> +	if (TestClearPageActive(page))
> +		call_rcu(&page->rcu_head, pte_free_now);
> +	else
> +		pte_free_now(&page->rcu_head);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	SetPageActive(page);
> +	page_table_free(mm, (unsigned long *)pgtable);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * Base infrastructure required to generate basic asces, region, segment,
>   * and page tables that do not make use of enhanced features like EDAT1.
Hugh Dickins July 4, 2023, 4:03 p.m. UTC | #18
On Tue, 4 Jul 2023, Alexander Gordeev wrote:
> On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> 
> Hi Hugh,
> 
> ...
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
> 
> Just to make things more clear for me: do I understand correctly that this
> was an attempt to add HH fragments to pgtable_list from pte_free_defer()?

Yes, from page_table_free() called from pte_free_defer(): I had claimed
they could be put on the list (or not) without needing to consider their
HH-ness, apart from wanting to list_add_tail() rather than list_add() them.

But then realized that this category of list_add_tail() pages would block
access to the others.

But I think I was mistaken then to say "could be solved by a list_move()
somewhere"; because "somewhere" would have had to be __tlb_remove_table()
when it removes PP-bits, which would bring us back to the issues of
getting a spinlock from an mm which might already be freed.

Hugh
Hugh Dickins July 4, 2023, 5:03 p.m. UTC | #19
On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > > 
> > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > extent that I've not even tried to verify it; but I think I do get the
> > > point now, that we need further info than just PPHHAA to know whether
> > > the page is on the list or not.  But I think that if we move where the
> > > call_rcu() is done, then the page can stay on or off the list by same
> > > rules as before (but need to check HH bits along with PP when deciding
> > > whether to allocate, and whether to list_add_tail() when freeing).  
> > 
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
> > 
> > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > an easier way of dealing with those "is it on the list" questions.
> > I expect that we shall be close to reaching agreement on...
> 
> This looks really nice, almost too good and easy to be true. I did not
> find any obvious flaw, just some comments below. It also survived LTP
> without any visible havoc, so I guess this approach is the best so far.

Phew! I'm of course glad to hear this: thanks for your efforts on it.

...
> > --- a/arch/s390/mm/pgalloc.c
> > +++ b/arch/s390/mm/pgalloc.c
> > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> >   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> >   * while the PP bits are never used, nor such a page is added to or removed
> >   * from mm_context_t::pgtable_list.
> > + *
> > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> 
> Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> allow reuse before grace period elapsed. And I hope that it does so, by
> setting the PP bits, which would be noticed in page_table_alloc(), in
> case the page would be seen there.
> 
> Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> end of the list, and so they could be seen in page_table_alloc(), but they
> should not be reused before grace period elapsed and __tlb_remove_table()
> cleared the PP bits, as far as I understand.
> 
> So what exactly do you mean with "which page_table_free_rcu() does not actually
> guarantee"?

I'll answer without locating and re-reading what Jason explained earlier,
perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
he may have explained it better.  And without working out again all the
MMU_GATHER #defines, and which of them do and do not apply to s390 here.

The detail that sticks in my mind is the fallback in tlb_remove_table()
in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
batch the tables for freeing by RCU, and resorts instead to an immediate 
TLB flush (I think: that again involves chasing definitions) followed by
tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
and is commented: 
/*
 * This isn't an RCU grace period and hence the page-tables cannot be
 * assumed to be actually RCU-freed.
 *
 * It is however sufficient for software page-table walkers that rely on
 * IRQ disabling.
 */

Whether that's good for your PP pages or not, I've given no thought:
I've just taken it on trust that what s390 has working today is good.

If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
then I would not have written "(which page_table_free_rcu() does not
actually guarantee)".  But it cannot use call_rcu() because it does
not have an rcu_head to work with - it's in some generic code, and
there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.

And Jason would have much preferred us to address the issue from that
angle; but not only would doing so destroy my sanity, I'd also destroy
20 architectures TLB-flushing, unbuilt and untested, in the attempt.

...
> > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> >  		 */
> >  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> >  		mask >>= 24;
> > -		if (mask & 0x03U)
> > +		if ((mask & 0x03U) && !PageActive(page)) {
> > +			/*
> > +			 * Other half is allocated, and neither half has had
> > +			 * its free deferred: add page to head of list, to make
> > +			 * this freed half available for immediate reuse.
> > +			 */
> >  			list_add(&page->lru, &mm->context.pgtable_list);
> > -		else
> > -			list_del(&page->lru);
> > +		} else {
> > +			/* If page is on list, now remove it. */
> > +			list_del_init(&page->lru);
> > +		}
> 
> Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> other half is still allocated, when called from pte_free_defer() on a
> fully allocated page, which was not on the list (and with PageActive, and
> (mask & 0x03U) true).
> Not sure if adding an additional mask check to the else path would be
> needed, but it seems that list_del_init() should also be able to handle
> this.

list_del_init() is very cheap in the unnecessary case: the cachelines
required are already there.  You don't want a flag to say whether to
call it or not, it is already the efficient approach.

(But you were right not to use it in your pt_frag_refcount version,
because there we were still trying to do the call_rcu() per fragment
rather than per page, so page->lru could have been on the RCU queue.)

> 
> Same thought applies to the similar logic in page_table_free_rcu()
> below.
> 
> >  		spin_unlock_bh(&mm->context.lock);
> >  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> >  		mask >>= 24;
> > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> >  	}
> >  
> >  	page_table_release_check(page, table, half, mask);
> > -	pgtable_pte_page_dtor(page);
> > -	__free_page(page);
> > +	if (TestClearPageActive(page))
> > +		call_rcu(&page->rcu_head, pte_free_now);
> > +	else
> > +		pte_free_now(&page->rcu_head);
> 
> This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> worries me a bit, because it is done outside the spin_lock. It "feels" like
> there could be some race with the PageActive checks inside the spin_lock,
> but when drawing some pictures, I could not find any such scenario yet.
> Also, our existing spin_lock is probably not supposed to protect against
> PageActive changes anyway, right?

Here (and similarly in __tlb_remove_table()) is where we are about to free
the page table page: both of the fragments have already been released,
there is nobody left who could be racing against us to set PageActive.

I chose PageActive for its name, not for any special behaviour of that
flag: nothing else could be setting or clearing it while we own the page.

Hugh
Alexander Gordeev July 5, 2023, 6:46 a.m. UTC | #20
On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> On Thu, 29 Jun 2023, Hugh Dickins wrote:

Hi Hugh,

...

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;

If I got your and Claudio conversation right, you were going to add
here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?

> +	page = virt_to_page(pgtable);
> +	SetPageActive(page);
> +	page_table_free(mm, (unsigned long *)pgtable);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * Base infrastructure required to generate basic asces, region, segment,
>   * and page tables that do not make use of enhanced features like EDAT1.

Thanks!
Gerald Schaefer July 5, 2023, 12:55 p.m. UTC | #21
On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> > > On Thu, 29 Jun 2023, Hugh Dickins wrote:  
> > > > 
> > > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > > extent that I've not even tried to verify it; but I think I do get the
> > > > point now, that we need further info than just PPHHAA to know whether
> > > > the page is on the list or not.  But I think that if we move where the
> > > > call_rcu() is done, then the page can stay on or off the list by same
> > > > rules as before (but need to check HH bits along with PP when deciding
> > > > whether to allocate, and whether to list_add_tail() when freeing).    
> > > 
> > > No, not quite the same rules as before: I came to realize that using
> > > list_add_tail() for the HH pages would be liable to put a page on the
> > > list which forever blocked reuse of PP list_add_tail() pages after it
> > > (could be solved by a list_move() somewhere, but we have agreed to
> > > prefer simplicity).
> > > 
> > > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > > an easier way of dealing with those "is it on the list" questions.
> > > I expect that we shall be close to reaching agreement on...  
> > 
> > This looks really nice, almost too good and easy to be true. I did not
> > find any obvious flaw, just some comments below. It also survived LTP
> > without any visible havoc, so I guess this approach is the best so far.  
> 
> Phew! I'm of course glad to hear this: thanks for your efforts on it.
> 
> ...
> > > --- a/arch/s390/mm/pgalloc.c
> > > +++ b/arch/s390/mm/pgalloc.c
> > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > >   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > >   * while the PP bits are never used, nor such a page is added to or removed
> > >   * from mm_context_t::pgtable_list.
> > > + *
> > > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > > + * has elapsed (which page_table_free_rcu() does not actually guarantee).  
> > 
> > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > allow reuse before grace period elapsed. And I hope that it does so, by
> > setting the PP bits, which would be noticed in page_table_alloc(), in
> > case the page would be seen there.
> > 
> > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > end of the list, and so they could be seen in page_table_alloc(), but they
> > should not be reused before grace period elapsed and __tlb_remove_table()
> > cleared the PP bits, as far as I understand.
> > 
> > So what exactly do you mean with "which page_table_free_rcu() does not actually
> > guarantee"?  
> 
> I'll answer without locating and re-reading what Jason explained earlier,
> perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> he may have explained it better.  And without working out again all the
> MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> 
> The detail that sticks in my mind is the fallback in tlb_remove_table()

Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
but that is rather a generic issue, and not s390-specific. I thought you
meant some s390-oddity here, of which we have a lot, unfortunately...
Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
I guess you could say that page_table_free_rcu() cannot guarantee what
tlb_remove_table() cannot guarantee.

Maybe change to "which page_table_free_rcu() does not actually guarantee,
by calling tlb_remove_table()", to make it clear that this is not a problem
of page_table_free_rcu() itself.

> in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> batch the tables for freeing by RCU, and resorts instead to an immediate 
> TLB flush (I think: that again involves chasing definitions) followed by
> tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> and is commented: 
> /*
>  * This isn't an RCU grace period and hence the page-tables cannot be
>  * assumed to be actually RCU-freed.
>  *
>  * It is however sufficient for software page-table walkers that rely on
>  * IRQ disabling.
>  */
> 
> Whether that's good for your PP pages or not, I've given no thought:
> I've just taken it on trust that what s390 has working today is good.

Yes, we should be fine with that, current code can be trusted :-)

> 
> If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> then I would not have written "(which page_table_free_rcu() does not
> actually guarantee)".  But it cannot use call_rcu() because it does
> not have an rcu_head to work with - it's in some generic code, and
> there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
> 
> And Jason would have much preferred us to address the issue from that
> angle; but not only would doing so destroy my sanity, I'd also destroy
> 20 architectures TLB-flushing, unbuilt and untested, in the attempt.

Oh yes, if your changes would have allowed to get rid of that "semi RCU"
logic, that would really be a major boost in popularity, I guess. But
it probably is as it is, because it is not so easily fixed...

> 
> ...
> > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > >  		 */
> > >  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > >  		mask >>= 24;
> > > -		if (mask & 0x03U)
> > > +		if ((mask & 0x03U) && !PageActive(page)) {
> > > +			/*
> > > +			 * Other half is allocated, and neither half has had
> > > +			 * its free deferred: add page to head of list, to make
> > > +			 * this freed half available for immediate reuse.
> > > +			 */
> > >  			list_add(&page->lru, &mm->context.pgtable_list);
> > > -		else
> > > -			list_del(&page->lru);
> > > +		} else {
> > > +			/* If page is on list, now remove it. */
> > > +			list_del_init(&page->lru);
> > > +		}  
> > 
> > Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> > other half is still allocated, when called from pte_free_defer() on a
> > fully allocated page, which was not on the list (and with PageActive, and
> > (mask & 0x03U) true).
> > Not sure if adding an additional mask check to the else path would be
> > needed, but it seems that list_del_init() should also be able to handle
> > this.  
> 
> list_del_init() is very cheap in the unnecessary case: the cachelines
> required are already there.  You don't want a flag to say whether to
> call it or not, it is already the efficient approach.

Yes, I also see no functional issue here. Just thought that the extra
write could be avoided, e.g. by checking for list_empty() or mask first.
But I guess that is simply the benefit of list_del_init(), that you
don't have to check, at least if it is guaranteed that rcu_head is
never in use here.

Then maybe adjust the comment, because now it makes you wonder, when
you read (and understand) the code, you see that this list_del_init()
might also be called for pages not on the list.

> 
> (But you were right not to use it in your pt_frag_refcount version,
> because there we were still trying to do the call_rcu() per fragment
> rather than per page, so page->lru could have been on the RCU queue.)

That is actually the one thing I still try to figure out, by drawing
pictures, i.e. if we really really never end up here on list_del_init(),
while using rcu_head, e.g. by racing PageActive.

> 
> > 
> > Same thought applies to the similar logic in page_table_free_rcu()
> > below.
> >   
> > >  		spin_unlock_bh(&mm->context.lock);
> > >  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> > >  		mask >>= 24;
> > > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > >  	}
> > >  
> > >  	page_table_release_check(page, table, half, mask);
> > > -	pgtable_pte_page_dtor(page);
> > > -	__free_page(page);
> > > +	if (TestClearPageActive(page))
> > > +		call_rcu(&page->rcu_head, pte_free_now);
> > > +	else
> > > +		pte_free_now(&page->rcu_head);  
> > 
> > This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> > worries me a bit, because it is done outside the spin_lock. It "feels" like
> > there could be some race with the PageActive checks inside the spin_lock,
> > but when drawing some pictures, I could not find any such scenario yet.
> > Also, our existing spin_lock is probably not supposed to protect against
> > PageActive changes anyway, right?  
> 
> Here (and similarly in __tlb_remove_table()) is where we are about to free
> the page table page: both of the fragments have already been released,
> there is nobody left who could be racing against us to set PageActive.

Yes, that is what makes this approach so nice, i.e. no more checking
for HH bits or worry about double call_rcu(), simply do the the freeing
whenever the page is ready. At least in theory, still drawing pictures :-)

But this really looks very good to me, and also works with LTP not worse
than the other approaches.
Hugh Dickins July 6, 2023, 12:52 a.m. UTC | #22
On Wed, 5 Jul 2023, Alexander Gordeev wrote:
> On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> 
> Hi Hugh,
> 
> ...
> 
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	struct page *page;
> 
> If I got your and Claudio conversation right, you were going to add
> here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?

Well, Claudio approved, so I would have put it in, if we had stuck with
that version which had "if (mm_alloc_pgste(mm)) {" in pte_free_defer();
but once that went away, it became somewhat irrelevant... to me anyway.

But I don't mind adding it here, in the v3 I'll post when -rc1 is out,
if it might help you guys - there is some point, since pte_free_defer()
is a route which can usefully check for such a case, without confusion
from harmless traffic from immediate frees of just-in-case allocations.

But don't expect it to catch all such cases (if they exist): another
category of s390 page_table_free()s comes from the PageAnon
zap_deposited_table() in zap_huge_pmd(): those tables might or might
not have been exposed to userspace at some time in the past.

I'll add the WARN_ON_ONCE in pte_free_defer() (after checking that
WARN_ON_ONCE is the one we want - I get confused by all the different
flavours of WARN, and have to check the header file each time to be
sure of the syntax and semantics): but be aware that it won't be
checking all potential cases.

Hugh

> 
> > +	page = virt_to_page(pgtable);
> > +	SetPageActive(page);
> > +	page_table_free(mm, (unsigned long *)pgtable);
> > +}
> > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +
> >  /*
> >   * Base infrastructure required to generate basic asces, region, segment,
> >   * and page tables that do not make use of enhanced features like EDAT1.
> 
> Thanks!
Hugh Dickins July 6, 2023, 1:20 a.m. UTC | #23
On Wed, 5 Jul 2023, Gerald Schaefer wrote:
> On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > > Hugh Dickins <hughd@google.com> wrote:  
> > > > On Thu, 29 Jun 2023, Hugh Dickins wrote:  
> > ...
> > > > --- a/arch/s390/mm/pgalloc.c
> > > > +++ b/arch/s390/mm/pgalloc.c
> > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > >   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > > >   * while the PP bits are never used, nor such a page is added to or removed
> > > >   * from mm_context_t::pgtable_list.
> > > > + *
> > > > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > > > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > > > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > > > + * has elapsed (which page_table_free_rcu() does not actually guarantee).  
> > > 
> > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > > allow reuse before grace period elapsed. And I hope that it does so, by
> > > setting the PP bits, which would be noticed in page_table_alloc(), in
> > > case the page would be seen there.
> > > 
> > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > > end of the list, and so they could be seen in page_table_alloc(), but they
> > > should not be reused before grace period elapsed and __tlb_remove_table()
> > > cleared the PP bits, as far as I understand.
> > > 
> > > So what exactly do you mean with "which page_table_free_rcu() does not actually
> > > guarantee"?  
> > 
> > I'll answer without locating and re-reading what Jason explained earlier,
> > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> > he may have explained it better.  And without working out again all the
> > MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> > 
> > The detail that sticks in my mind is the fallback in tlb_remove_table()
> 
> Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> but that is rather a generic issue, and not s390-specific.

Yes.

> I thought you
> meant some s390-oddity here, of which we have a lot, unfortunately...
> Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> I guess you could say that page_table_free_rcu() cannot guarantee what
> tlb_remove_table() cannot guarantee.
> 
> Maybe change to "which page_table_free_rcu() does not actually guarantee,
> by calling tlb_remove_table()", to make it clear that this is not a problem
> of page_table_free_rcu() itself.

Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-)

> 
> > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> > batch the tables for freeing by RCU, and resorts instead to an immediate 
> > TLB flush (I think: that again involves chasing definitions) followed by
> > tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> > and is commented: 
> > /*
> >  * This isn't an RCU grace period and hence the page-tables cannot be
> >  * assumed to be actually RCU-freed.
> >  *
> >  * It is however sufficient for software page-table walkers that rely on
> >  * IRQ disabling.
> >  */
> > 
> > Whether that's good for your PP pages or not, I've given no thought:
> > I've just taken it on trust that what s390 has working today is good.
> 
> Yes, we should be fine with that, current code can be trusted :-)

Glad to hear it :-)  Yes, I think it's not actually relying on the "rcu"
implied by the function name.

> 
> > 
> > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> > then I would not have written "(which page_table_free_rcu() does not
> > actually guarantee)".  But it cannot use call_rcu() because it does
> > not have an rcu_head to work with - it's in some generic code, and
> > there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
> > 
> > And Jason would have much preferred us to address the issue from that
> > angle; but not only would doing so destroy my sanity, I'd also destroy
> > 20 architectures TLB-flushing, unbuilt and untested, in the attempt.
> 
> Oh yes, if your changes would have allowed to get rid of that "semi RCU"
> logic, that would really be a major boost in popularity, I guess. But
> it probably is as it is, because it is not so easily fixed...

I'm hoping that this series might help stir someone else to get into that.

> 
> > 
> > ...
> > > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > >  		 */
> > > >  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > > >  		mask >>= 24;
> > > > -		if (mask & 0x03U)
> > > > +		if ((mask & 0x03U) && !PageActive(page)) {
> > > > +			/*
> > > > +			 * Other half is allocated, and neither half has had
> > > > +			 * its free deferred: add page to head of list, to make
> > > > +			 * this freed half available for immediate reuse.
> > > > +			 */
> > > >  			list_add(&page->lru, &mm->context.pgtable_list);
> > > > -		else
> > > > -			list_del(&page->lru);
> > > > +		} else {
> > > > +			/* If page is on list, now remove it. */
> > > > +			list_del_init(&page->lru);
> > > > +		}  
> > > 
> > > Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> > > other half is still allocated, when called from pte_free_defer() on a
> > > fully allocated page, which was not on the list (and with PageActive, and
> > > (mask & 0x03U) true).
> > > Not sure if adding an additional mask check to the else path would be
> > > needed, but it seems that list_del_init() should also be able to handle
> > > this.  
> > 
> > list_del_init() is very cheap in the unnecessary case: the cachelines
> > required are already there.  You don't want a flag to say whether to
> > call it or not, it is already the efficient approach.
> 
> Yes, I also see no functional issue here. Just thought that the extra
> write could be avoided, e.g. by checking for list_empty() or mask first.
> But I guess that is simply the benefit of list_del_init(), that you
> don't have to check, at least if it is guaranteed that rcu_head is
> never in use here.
> 
> Then maybe adjust the comment, because now it makes you wonder, when
> you read (and understand) the code, you see that this list_del_init()
> might also be called for pages not on the list.

Sorry, I don't understand what clarification you're asking for there.
I thought
			/* If page is on list, now remove it. */
			list_del_init(&page->lru);
was good enough comment.

(I certainly don't want to enumerate the cases when it is or is not
already on the list there, that would be misery; but I don't think
that's the adjustment you were asking for either.)

> 
> > 
> > (But you were right not to use it in your pt_frag_refcount version,
> > because there we were still trying to do the call_rcu() per fragment
> > rather than per page, so page->lru could have been on the RCU queue.)
> 
> That is actually the one thing I still try to figure out, by drawing
> pictures, i.e. if we really really never end up here on list_del_init(),
> while using rcu_head, e.g. by racing PageActive.

There is no race with PageActive being seen when the table page is
finally to be freed (by RCU or not).  But there is definitely a harmless
race with pte_free_defer()er of other half setting PageActive an instant
after page_table_free() checked PageActive here.  So maybe this
page_table_free() does a list_add(), which the racer then list_del_init()s
when it gets the mm->context.lock; or maybe they both list_del_init().

> 
> > 
> > > 
> > > Same thought applies to the similar logic in page_table_free_rcu()
> > > below.
> > >   
> > > >  		spin_unlock_bh(&mm->context.lock);
> > > >  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> > > >  		mask >>= 24;
> > > > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > >  	}
> > > >  
> > > >  	page_table_release_check(page, table, half, mask);
> > > > -	pgtable_pte_page_dtor(page);
> > > > -	__free_page(page);
> > > > +	if (TestClearPageActive(page))
> > > > +		call_rcu(&page->rcu_head, pte_free_now);
> > > > +	else
> > > > +		pte_free_now(&page->rcu_head);  
> > > 
> > > This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> > > worries me a bit, because it is done outside the spin_lock. It "feels" like
> > > there could be some race with the PageActive checks inside the spin_lock,
> > > but when drawing some pictures, I could not find any such scenario yet.
> > > Also, our existing spin_lock is probably not supposed to protect against
> > > PageActive changes anyway, right?  
> > 
> > Here (and similarly in __tlb_remove_table()) is where we are about to free
> > the page table page: both of the fragments have already been released,
> > there is nobody left who could be racing against us to set PageActive.
> 
> Yes, that is what makes this approach so nice, i.e. no more checking
> for HH bits or worry about double call_rcu(), simply do the the freeing
> whenever the page is ready. At least in theory, still drawing pictures :-)

Please do keep drawing: and perhaps you can sell them afterwards :-)

> 
> But this really looks very good to me, and also works with LTP not worse
> than the other approaches.

Great, thanks for all your help Gerald.

Hugh
Gerald Schaefer July 6, 2023, 3:02 p.m. UTC | #24
On Wed, 5 Jul 2023 18:20:21 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Wed, 5 Jul 2023, Gerald Schaefer wrote:
> > On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> > > On Tue, 4 Jul 2023, Gerald Schaefer wrote:  
> > > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > > > Hugh Dickins <hughd@google.com> wrote:    
> > > > > On Thu, 29 Jun 2023, Hugh Dickins wrote:    
> > > ...  
> > > > > --- a/arch/s390/mm/pgalloc.c
> > > > > +++ b/arch/s390/mm/pgalloc.c
> > > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > > >   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > > > >   * while the PP bits are never used, nor such a page is added to or removed
> > > > >   * from mm_context_t::pgtable_list.
> > > > > + *
> > > > > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > > > > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > > > > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > > > > + * has elapsed (which page_table_free_rcu() does not actually guarantee).    
> > > > 
> > > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > > > allow reuse before grace period elapsed. And I hope that it does so, by
> > > > setting the PP bits, which would be noticed in page_table_alloc(), in
> > > > case the page would be seen there.
> > > > 
> > > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > > > end of the list, and so they could be seen in page_table_alloc(), but they
> > > > should not be reused before grace period elapsed and __tlb_remove_table()
> > > > cleared the PP bits, as far as I understand.
> > > > 
> > > > So what exactly do you mean with "which page_table_free_rcu() does not actually
> > > > guarantee"?    
> > > 
> > > I'll answer without locating and re-reading what Jason explained earlier,
> > > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> > > he may have explained it better.  And without working out again all the
> > > MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> > > 
> > > The detail that sticks in my mind is the fallback in tlb_remove_table()  
> > 
> > Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> > but that is rather a generic issue, and not s390-specific.  
> 
> Yes.
> 
> > I thought you
> > meant some s390-oddity here, of which we have a lot, unfortunately...
> > Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> > I guess you could say that page_table_free_rcu() cannot guarantee what
> > tlb_remove_table() cannot guarantee.
> > 
> > Maybe change to "which page_table_free_rcu() does not actually guarantee,
> > by calling tlb_remove_table()", to make it clear that this is not a problem
> > of page_table_free_rcu() itself.  
> 
> Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-)
> 
> >   
> > > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> > > batch the tables for freeing by RCU, and resorts instead to an immediate 
> > > TLB flush (I think: that again involves chasing definitions) followed by
> > > tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> > > and is commented: 
> > > /*
> > >  * This isn't an RCU grace period and hence the page-tables cannot be
> > >  * assumed to be actually RCU-freed.
> > >  *
> > >  * It is however sufficient for software page-table walkers that rely on
> > >  * IRQ disabling.
> > >  */
> > > 
> > > Whether that's good for your PP pages or not, I've given no thought:
> > > I've just taken it on trust that what s390 has working today is good.  
> > 
> > Yes, we should be fine with that, current code can be trusted :-)  
> 
> Glad to hear it :-)  Yes, I think it's not actually relying on the "rcu"
> implied by the function name.

Ah ok, now I get it. Never noticed that naming it "free_rcu" could be
misleading. It is only ever called from pte_free_tlb(), so always in that
"semi-RCU" context. If you just look at the name, you could expect this
to always free pagetables by RCU, which would be exactly what you need
for pte_free_defer(), and which of course cannot be guaranteed by our
page_table_free_rcu().

IOW, exactly what your comment says, and now I think it is actually fine
as it is :-)

I guess I am a bit lamebrained this week, due to early shift and not
enough sleep...

> 
> >   
> > > 
> > > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> > > then I would not have written "(which page_table_free_rcu() does not
> > > actually guarantee)".  But it cannot use call_rcu() because it does
> > > not have an rcu_head to work with - it's in some generic code, and
> > > there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
> > > 
> > > And Jason would have much preferred us to address the issue from that
> > > angle; but not only would doing so destroy my sanity, I'd also destroy
> > > 20 architectures TLB-flushing, unbuilt and untested, in the attempt.  
> > 
> > Oh yes, if your changes would have allowed to get rid of that "semi RCU"
> > logic, that would really be a major boost in popularity, I guess. But
> > it probably is as it is, because it is not so easily fixed...  
> 
> I'm hoping that this series might help stir someone else to get into that.
> 
> >   
> > > 
> > > ...  
> > > > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > > >  		 */
> > > > >  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > > > >  		mask >>= 24;
> > > > > -		if (mask & 0x03U)
> > > > > +		if ((mask & 0x03U) && !PageActive(page)) {
> > > > > +			/*
> > > > > +			 * Other half is allocated, and neither half has had
> > > > > +			 * its free deferred: add page to head of list, to make
> > > > > +			 * this freed half available for immediate reuse.
> > > > > +			 */
> > > > >  			list_add(&page->lru, &mm->context.pgtable_list);
> > > > > -		else
> > > > > -			list_del(&page->lru);
> > > > > +		} else {
> > > > > +			/* If page is on list, now remove it. */
> > > > > +			list_del_init(&page->lru);
> > > > > +		}    
> > > > 
> > > > Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> > > > other half is still allocated, when called from pte_free_defer() on a
> > > > fully allocated page, which was not on the list (and with PageActive, and
> > > > (mask & 0x03U) true).
> > > > Not sure if adding an additional mask check to the else path would be
> > > > needed, but it seems that list_del_init() should also be able to handle
> > > > this.    
> > > 
> > > list_del_init() is very cheap in the unnecessary case: the cachelines
> > > required are already there.  You don't want a flag to say whether to
> > > call it or not, it is already the efficient approach.  
> > 
> > Yes, I also see no functional issue here. Just thought that the extra
> > write could be avoided, e.g. by checking for list_empty() or mask first.
> > But I guess that is simply the benefit of list_del_init(), that you
> > don't have to check, at least if it is guaranteed that rcu_head is
> > never in use here.
> > 
> > Then maybe adjust the comment, because now it makes you wonder, when
> > you read (and understand) the code, you see that this list_del_init()
> > might also be called for pages not on the list.  
> 
> Sorry, I don't understand what clarification you're asking for there.
> I thought
> 			/* If page is on list, now remove it. */
> 			list_del_init(&page->lru);
> was good enough comment.
> 
> (I certainly don't want to enumerate the cases when it is or is not
> already on the list there, that would be misery; but I don't think
> that's the adjustment you were asking for either.)

I was mislead by the comment saying "If page is on the list", in an
else path where we also end up for pages not on the list any more.
I guess I would have added something like "it is also ok to do
list_del_init() here for pages not on the list". But thinking again,
that would probably just be a reminder of how list_del_init() works, 
which should be obvious anyway, at least for people with enough sleep.

> 
> >   
> > > 
> > > (But you were right not to use it in your pt_frag_refcount version,
> > > because there we were still trying to do the call_rcu() per fragment
> > > rather than per page, so page->lru could have been on the RCU queue.)  
> > 
> > That is actually the one thing I still try to figure out, by drawing
> > pictures, i.e. if we really really never end up here on list_del_init(),
> > while using rcu_head, e.g. by racing PageActive.  
> 
> There is no race with PageActive being seen when the table page is
> finally to be freed (by RCU or not).  But there is definitely a harmless
> race with pte_free_defer()er of other half setting PageActive an instant
> after page_table_free() checked PageActive here.  So maybe this
> page_table_free() does a list_add(), which the racer then list_del_init()s
> when it gets the mm->context.lock; or maybe they both list_del_init().

Agree.

Since none of my remarks on the comments seem valid or strictly necessary
any more, and I also could not find functional issues, I think you can add
this patch as new version for 07/12. And I can now give you this:

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Hugh Dickins July 6, 2023, 7:45 p.m. UTC | #25
On Thu, 6 Jul 2023, Gerald Schaefer wrote:
> 
> Since none of my remarks on the comments seem valid or strictly necessary
> any more, and I also could not find functional issues, I think you can add
> this patch as new version for 07/12. And I can now give you this:
> 
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

Great, thanks a lot Gerald.
The one change I'm making to it is then this merged in:

--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -455,6 +455,11 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
 	page = virt_to_page(pgtable);
 	SetPageActive(page);
 	page_table_free(mm, (unsigned long *)pgtable);
+	/*
+	 * page_table_free() does not do the pgste gmap_unlink() which
+	 * page_table_free_rcu() does: warn us if pgste ever reaches here.
+	 */
+	WARN_ON_ONCE(mm_alloc_pgste(mm));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
Gerald Schaefer July 7, 2023, 2:37 p.m. UTC | #26
On Wed, 5 Jul 2023 17:52:40 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Wed, 5 Jul 2023, Alexander Gordeev wrote:
> > On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:  
> > > On Thu, 29 Jun 2023, Hugh Dickins wrote:  
> > 
> > Hi Hugh,
> > 
> > ...
> >   
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > +	struct page *page;  
> > 
> > If I got your and Claudio conversation right, you were going to add
> > here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?  

Good point, thanks Alexander for noticing!

> 
> Well, Claudio approved, so I would have put it in, if we had stuck with
> that version which had "if (mm_alloc_pgste(mm)) {" in pte_free_defer();
> but once that went away, it became somewhat irrelevant... to me anyway.
> 
> But I don't mind adding it here, in the v3 I'll post when -rc1 is out,
> if it might help you guys - there is some point, since pte_free_defer()
> is a route which can usefully check for such a case, without confusion
> from harmless traffic from immediate frees of just-in-case allocations.
> 
> But don't expect it to catch all such cases (if they exist): another
> category of s390 page_table_free()s comes from the PageAnon
> zap_deposited_table() in zap_huge_pmd(): those tables might or might
> not have been exposed to userspace at some time in the past.

Right, after THP collapse, the previously active PTE table would be
deposited in this case, and then later freed in zap_deposited_table().
I guess we need to be very careful, if THP was ever enabled for KVM
guests.

> 
> I'll add the WARN_ON_ONCE in pte_free_defer() (after checking that
> WARN_ON_ONCE is the one we want - I get confused by all the different
> flavours of WARN, and have to check the header file each time to be
> sure of the syntax and semantics): but be aware that it won't be
> checking all potential cases.

Thanks, looks good.
Jason Gunthorpe July 10, 2023, 5:21 p.m. UTC | #27
On Wed, Jul 05, 2023 at 02:55:16PM +0200, Gerald Schaefer wrote:

> Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> but that is rather a generic issue, and not s390-specific. I thought you
> meant some s390-oddity here, of which we have a lot, unfortunately...
> Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> I guess you could say that page_table_free_rcu() cannot guarantee what
> tlb_remove_table() cannot guarantee.

The issue is the arches don't provide a reliable way to RCU free
things, so the core code creates an RCU situation using the MMU
batch. With the non-RCU compatible IPI fallback. So it isn't actually
RCU, it is IPI but optimized with RCU in some cases.

When Hugh introduces a reliable way to RCU free stuff we could fall
back to that in the TLB code instead of invoking the synchronize_rcu()

For lots of arches, S390 included after this series, this would be
pretty easy.

What I see now as the big trouble is that this series only addresses
PTE RCU'ness and making all the other levels RCUable would be much
harder on some arches like power.

In short we could create a CONFIG_ARCH_RCU_SAFE_PAGEWALK and it could
be done on alot of arches quite simply, but at least not power. Which
makes me wonder about the value, but maybe it could shame power into
doing something..

However, calling things 'page_table_free_rcu()' when it doesn't
actually always do RCU but IPI optimzed RCU is an unfortunate name :(
As long as you never assume it does RCU anywhere else, and don't use
rcu_read_lock(), it is fine :)

The corner case is narrow, you have to OOM the TLB batching before you
loose the RCU optimization of the IPI.  Then you can notice that
rcu_read_lock() doesn't actually protect against concurrent free.

Jason
diff mbox series

Patch

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@  static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..11983a3ff95a 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -159,6 +159,11 @@  void page_table_free_pgste(struct page *page)
 
 #endif /* CONFIG_PGSTE */
 
+/*
+ * Temporarily use a global spinlock instead of mm->context.lock.
+ * This will be replaced by a per-mm spinlock in a followup commit.
+ */
+static DEFINE_SPINLOCK(mm_pgtable_list_lock);
 /*
  * A 2KB-pgtable is either upper or lower half of a normal page.
  * The second half of the page may be unused or used as another
@@ -172,7 +177,7 @@  void page_table_free_pgste(struct page *page)
  * When a parent page gets fully allocated it contains 2KB-pgtables in both
  * upper and lower halves and is removed from mm_context_t::pgtable_list.
  *
- * When 2KB-pgtable is freed from to fully allocated parent page that
+ * When 2KB-pgtable is freed from the fully allocated parent page that
  * page turns partially allocated and added to mm_context_t::pgtable_list.
  *
  * If 2KB-pgtable is freed from the partially allocated parent page that
@@ -182,16 +187,24 @@  void page_table_free_pgste(struct page *page)
  * As follows from the above, no unallocated or fully allocated parent
  * pages are contained in mm_context_t::pgtable_list.
  *
+ * NOTE NOTE NOTE: The commentary above and below has not yet been updated:
+ * the new rule is that a page is not linked to mm_context_t::pgtable_list
+ * while either half is pending free by any method; but afterwards is
+ * either relinked to it, or freed, by __tlb_remove_table().  This allows
+ * pte_free_defer() to use the page->rcu_head (which overlays page->lru).
+ *
  * The upper byte (bits 24-31) of the parent page _refcount is used
  * for tracking contained 2KB-pgtables and has the following format:
  *
- *   PP  AA
- * 01234567    upper byte (bits 24-31) of struct page::_refcount
- *   ||  ||
- *   ||  |+--- upper 2KB-pgtable is allocated
- *   ||  +---- lower 2KB-pgtable is allocated
- *   |+------- upper 2KB-pgtable is pending for removal
- *   +-------- lower 2KB-pgtable is pending for removal
+ *   PPHHAA
+ * 76543210    upper byte (bits 24-31) of struct page::_refcount
+ *   ||||||
+ *   |||||+--- lower 2KB-pgtable is allocated
+ *   ||||+---- upper 2KB-pgtable is allocated
+ *   |||+----- lower 2KB-pgtable is pending free by page->rcu_head
+ *   ||+------ upper 2KB-pgtable is pending free by page->rcu_head
+ *   |+------- lower 2KB-pgtable is pending free by any method
+ *   +-------- upper 2KB-pgtable is pending free by any method
  *
  * (See commit 620b4e903179 ("s390: use _refcount for pgtables") on why
  * using _refcount is possible).
@@ -200,7 +213,7 @@  void page_table_free_pgste(struct page *page)
  * The parent page is either:
  *   - added to mm_context_t::pgtable_list in case the second half of the
  *     parent page is still unallocated;
- *   - removed from mm_context_t::pgtable_list in case both hales of the
+ *   - removed from mm_context_t::pgtable_list in case both halves of the
  *     parent page are allocated;
  * These operations are protected with mm_context_t::lock.
  *
@@ -239,32 +252,22 @@  unsigned long *page_table_alloc(struct mm_struct *mm)
 	/* Try to get a fragment of a 4K page as a 2K page table */
 	if (!mm_alloc_pgste(mm)) {
 		table = NULL;
-		spin_lock_bh(&mm->context.lock);
+		spin_lock_bh(&mm_pgtable_list_lock);
 		if (!list_empty(&mm->context.pgtable_list)) {
 			page = list_first_entry(&mm->context.pgtable_list,
 						struct page, lru);
 			mask = atomic_read(&page->_refcount) >> 24;
-			/*
-			 * The pending removal bits must also be checked.
-			 * Failure to do so might lead to an impossible
-			 * value of (i.e 0x13 or 0x23) written to _refcount.
-			 * Such values violate the assumption that pending and
-			 * allocation bits are mutually exclusive, and the rest
-			 * of the code unrails as result. That could lead to
-			 * a whole bunch of races and corruptions.
-			 */
-			mask = (mask | (mask >> 4)) & 0x03U;
-			if (mask != 0x03U) {
-				table = (unsigned long *) page_to_virt(page);
-				bit = mask & 1;		/* =1 -> second 2K */
-				if (bit)
-					table += PTRS_PER_PTE;
-				atomic_xor_bits(&page->_refcount,
-							0x01U << (bit + 24));
-				list_del(&page->lru);
-			}
+			/* Cannot be on this list if either half pending free */
+			WARN_ON_ONCE(mask & ~0x03U);
+			/* One or other half must be available, but not both */
+			WARN_ON_ONCE(mask == 0x00U || mask == 0x03U);
+			table = (unsigned long *)page_to_virt(page);
+			bit = mask & 0x01U;	/* =1 -> second 2K available */
+			table += bit * PTRS_PER_PTE;
+			atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
+			list_del(&page->lru);
 		}
-		spin_unlock_bh(&mm->context.lock);
+		spin_unlock_bh(&mm_pgtable_list_lock);
 		if (table)
 			return table;
 	}
@@ -278,6 +281,7 @@  unsigned long *page_table_alloc(struct mm_struct *mm)
 	}
 	arch_set_page_dat(page, 0);
 	/* Initialize page table */
+	page->pt_mm = mm;
 	table = (unsigned long *) page_to_virt(page);
 	if (mm_alloc_pgste(mm)) {
 		/* Return 4K page table with PGSTEs */
@@ -288,14 +292,14 @@  unsigned long *page_table_alloc(struct mm_struct *mm)
 		/* Return the first 2K fragment of the page */
 		atomic_xor_bits(&page->_refcount, 0x01U << 24);
 		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
-		spin_lock_bh(&mm->context.lock);
+		spin_lock_bh(&mm_pgtable_list_lock);
 		list_add(&page->lru, &mm->context.pgtable_list);
-		spin_unlock_bh(&mm->context.lock);
+		spin_unlock_bh(&mm_pgtable_list_lock);
 	}
 	return table;
 }
 
-static void page_table_release_check(struct page *page, void *table,
+static void page_table_release_check(struct page *page, unsigned long *table,
 				     unsigned int half, unsigned int mask)
 {
 	char msg[128];
@@ -317,21 +321,18 @@  void page_table_free(struct mm_struct *mm, unsigned long *table)
 	if (!mm_alloc_pgste(mm)) {
 		/* Free 2K page table fragment of a 4K page */
 		bit = ((unsigned long) table & ~PAGE_MASK)/(PTRS_PER_PTE*sizeof(pte_t));
-		spin_lock_bh(&mm->context.lock);
+		spin_lock_bh(&mm_pgtable_list_lock);
 		/*
-		 * Mark the page for delayed release. The actual release
-		 * will happen outside of the critical section from this
-		 * function or from __tlb_remove_table()
+		 * Mark the page for release. The actual release will happen
+		 * below from this function, or later from __tlb_remove_table().
 		 */
-		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
+		mask = atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
+		if (mask & 0x03U)		/* other half is allocated */
 			list_add(&page->lru, &mm->context.pgtable_list);
-		else
+		else if (!(mask & 0x30U))	/* other half not pending */
 			list_del(&page->lru);
-		spin_unlock_bh(&mm->context.lock);
-		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
-		mask >>= 24;
+		spin_unlock_bh(&mm_pgtable_list_lock);
 		if (mask != 0x00U)
 			return;
 		half = 0x01U << bit;
@@ -362,19 +363,17 @@  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 		return;
 	}
 	bit = ((unsigned long) table & ~PAGE_MASK) / (PTRS_PER_PTE*sizeof(pte_t));
-	spin_lock_bh(&mm->context.lock);
+	spin_lock_bh(&mm_pgtable_list_lock);
 	/*
-	 * Mark the page for delayed release. The actual release will happen
-	 * outside of the critical section from __tlb_remove_table() or from
-	 * page_table_free()
+	 * Mark the page for delayed release.
+	 * The actual release will happen later, from __tlb_remove_table().
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
-		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
+	/* Other half not allocated? Other half not already pending free? */
+	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
 		list_del(&page->lru);
-	spin_unlock_bh(&mm->context.lock);
+	spin_unlock_bh(&mm_pgtable_list_lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
 	tlb_remove_table(tlb, table);
 }
@@ -382,17 +381,40 @@  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 void __tlb_remove_table(void *_table)
 {
 	unsigned int mask = (unsigned long) _table & 0x03U, half = mask;
-	void *table = (void *)((unsigned long) _table ^ mask);
+	unsigned long *table = (unsigned long *)((unsigned long) _table ^ mask);
 	struct page *page = virt_to_page(table);
 
 	switch (half) {
 	case 0x00U:	/* pmd, pud, or p4d */
-		free_pages((unsigned long)table, CRST_ALLOC_ORDER);
+		__free_pages(page, CRST_ALLOC_ORDER);
 		return;
 	case 0x01U:	/* lower 2K of a 4K page table */
-	case 0x02U:	/* higher 2K of a 4K page table */
-		mask = atomic_xor_bits(&page->_refcount, mask << (4 + 24));
-		mask >>= 24;
+	case 0x02U:	/* upper 2K of a 4K page table */
+		/*
+		 * If the other half is marked as allocated, page->pt_mm must
+		 * still be valid, page->rcu_head no longer in use so page->lru
+		 * good for use, so now make the freed half available for reuse.
+		 * But be wary of races with that other half being freed.
+		 */
+		if (atomic_read(&page->_refcount) & (0x03U << 24)) {
+			struct mm_struct *mm = page->pt_mm;
+			/*
+			 * It is safe to use page->pt_mm when the other half
+			 * is seen allocated while holding pgtable_list lock;
+			 * but how will it be safe to acquire that spinlock?
+			 * Global mm_pgtable_list_lock is safe and easy for
+			 * now, then a followup commit will split it per-mm.
+			 */
+			spin_lock_bh(&mm_pgtable_list_lock);
+			mask = atomic_xor_bits(&page->_refcount, mask << 28);
+			mask >>= 24;
+			if (mask & 0x03U)
+				list_add(&page->lru, &mm->context.pgtable_list);
+			spin_unlock_bh(&mm_pgtable_list_lock);
+		} else {
+			mask = atomic_xor_bits(&page->_refcount, mask << 28);
+			mask >>= 24;
+		}
 		if (mask != 0x00U)
 			return;
 		break;
@@ -407,6 +429,77 @@  void __tlb_remove_table(void *_table)
 	__free_page(page);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now0(struct rcu_head *head);
+static void pte_free_now1(struct rcu_head *head);
+
+static void pte_free_pgste(struct rcu_head *head)
+{
+	unsigned long *table;
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	table = (unsigned long *)page_to_virt(page);
+	table = (unsigned long *)((unsigned long)table | 0x03U);
+	__tlb_remove_table(table);
+}
+
+static void pte_free_half(struct rcu_head *head, unsigned int bit)
+{
+	unsigned long *table;
+	struct page *page;
+	unsigned int mask;
+
+	page = container_of(head, struct page, rcu_head);
+	mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
+
+	table = (unsigned long *)page_to_virt(page);
+	table += bit * PTRS_PER_PTE;
+	table = (unsigned long *)((unsigned long)table | (0x01U << bit));
+	__tlb_remove_table(table);
+
+	/* If pte_free_defer() of the other half came in, queue it now */
+	if (mask & 0x0CU)
+		call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
+}
+
+static void pte_free_now0(struct rcu_head *head)
+{
+	pte_free_half(head, 0);
+}
+
+static void pte_free_now1(struct rcu_head *head)
+{
+	pte_free_half(head, 1);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	unsigned int bit, mask;
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	if (mm_alloc_pgste(mm)) {
+		call_rcu(&page->rcu_head, pte_free_pgste);
+		return;
+	}
+	bit = ((unsigned long)pgtable & ~PAGE_MASK) /
+			(PTRS_PER_PTE * sizeof(pte_t));
+
+	spin_lock_bh(&mm_pgtable_list_lock);
+	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
+	mask >>= 24;
+	/* Other half not allocated? Other half not already pending free? */
+	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
+		list_del(&page->lru);
+	spin_unlock_bh(&mm_pgtable_list_lock);
+
+	/* Do not relink on rcu_head if other half already linked on rcu_head */
+	if ((mask & 0x0CU) != 0x0CU)
+		call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 /*
  * Base infrastructure required to generate basic asces, region, segment,
  * and page tables that do not make use of enhanced features like EDAT1.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..1667a1bdb8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -146,7 +146,7 @@  struct page {
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
-				struct mm_struct *pt_mm; /* x86 pgds only */
+				struct mm_struct *pt_mm; /* x86 pgd, s390 */
 				atomic_t pt_frag_refcount; /* powerpc */
 			};
 #if ALLOC_SPLIT_PTLOCKS