diff mbox

3.10-rc ppc64 corrupts usermem when swapping

Message ID 87vc60na89.fsf@linux.vnet.ibm.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Aneesh Kumar K.V May 30, 2013, 8:27 a.m. UTC
Benjamin Herrenschmidt <benh@au1.ibm.com> writes:

> On Wed, 2013-05-29 at 22:47 -0700, Hugh Dickins wrote:
>> Running my favourite swapping load (repeated make -j20 kernel builds
>> in tmpfs in parallel with repeated make -j20 kernel builds in ext4 on
>> loop on tmpfs file, all limited by mem=700M and swap 1.5G) on 3.10-rc
>> on PowerMac G5, the test dies with corrupted usermem after a few hours.
>> 
>> Variously, segmentation fault or Binutils assertion fail or gcc Internal
>> error in either or both builds: usually signs of swapping or TLB flushing
>> gone wrong.  Sometimes the tmpfs build breaks first, sometimes the ext4 on
>> loop on tmpfs, so at least it looks unrelated to loop.  No problem on x86.
>> 
>> This is 64-bit kernel but 4k pages and old SuSE 11.1 32-bit userspace.
>> 
>> I've just finished a manual bisection on arch/powerpc/mm (which might
>> have been a wrong guess, but has paid off): the first bad commit is
>> 7e74c3921ad9610c0b49f28b8fc69f7480505841
>> "powerpc: Fix hpte_decode to use the correct decoding for page sizes".
>
> Ok, I have other reasons to think is wrong. I debugged a case last week
> where after kexec we still had stale TLB entries, due to the TLB cleanup
> not working.
>
> Thanks for doing that bisection ! I'll investigate ASAP (though it will
> probably have to wait for tomorrow unless Paul beats me to it)
>
>> I don't know if it's actually swapping to swap that's triggering the
>> problem, or a more general page reclaim or TLB flush problem.  I hit
>> it originally when trying to test Mel Gorman's pagevec series on top
>> of 3.10-rc; and though I then reproduced it without that series, it
>> did seem to take much longer: so I have been applying Mel's series to
>> speed up each step of the bisection.  But if I went back again, might
>> find it was just chance that I hit it sooner with Mel's series than
>> without.  So, you're probably safe to ignore that detail, but I
>> mention it just in case it turns out to have some relevance.
>> 
>> Something else peculiar that I've been doing in these runs, may or may
>> not be relevant: I've been running swapon and swapoff repeatedly in the
>> background, so that we're doing swapoff even while busy building.
>> 
>> I probably can't go into much more detail on the test (it's hard
>> to get the balance right, to be swapping rather than OOMing or just
>> running without reclaim), but can test any patches you'd like me to
>> try (though it may take 24 hours for me to report back usefully).
>
> I think it's just failing to invalidate the TLB properly. At least one
> bug I can spot just looking at it:
>
> static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
> 				   int psize, int ssize, int local)
>
>    .../...
>
> 	native_lock_hpte(hptep);
> 	hpte_v = hptep->v;
>
> 	actual_psize = hpte_actual_psize(hptep, psize);
> 	if (actual_psize < 0) {
> 		native_unlock_hpte(hptep);
> 		local_irq_restore(flags);
> 		return;
> 	}
>
> That's wrong. We must still perform the TLB invalidation even if the
> hash PTE is empty.
>
> In fact, Aneesh, this is a problem with MPSS for your THP work, I just
> thought about it.
>
> The reason is that if a hash bucket gets full, we "evict" a more/less
> random entry from it. When we do that we don't invalidate the TLB
> (hpte_remove) because we assume the old translation is still technically
> "valid".
>

Hmm that is correct, I missed that. But to do a tlb invalidate we need
both base and actual page size. One of the reason i didn't update the
hpte_invalidate callback to take both the page sizes was because, PAPR
didn't need that for invalidate (H_REMOVE). hpte_remove did result in a
tlb invalidate there. 


> However that means that an hpte_invalidate *must* invalidate the TLB
> later on even if it's not hitting the right entry in the hash.
>
> However, I can see why that cannot work with THP/MPSS since you have no
> way to know the page size from the PTE anymore....
>
> So my question is, apart from hpte_decode used by kexec, which I will
> fix by just blowing the whole TLB when not running phyp, why do you need
> the "actual" size in invalidate and updatepp ? You really can't rely on
> the size passed by the upper layers ?

So for upstream I have below which should address the
above. Meanwhile I will see what the impact would be to do a tlb
invalidate in hpte_remove, so that we can keep both lpar and native
changes similar.

Comments

Benjamin Herrenschmidt May 30, 2013, 8:33 a.m. UTC | #1
On Thu, 2013-05-30 at 13:57 +0530, Aneesh Kumar K.V wrote:
> +               /* FIXME!!, will fail with when we enable hugepage
> support */

Just fix that to say "Transparent huge pages" as normal huge pages
should work fine unless I'm missing something.

Hugh, any chance you can give that a spin ? 

Cheers,
Ben.
Hugh Dickins May 30, 2013, 1:48 p.m. UTC | #2
On Thu, 30 May 2013, Benjamin Herrenschmidt wrote:
> On Thu, 2013-05-30 at 13:57 +0530, Aneesh Kumar K.V wrote:
> > +               /* FIXME!!, will fail with when we enable hugepage
> > support */
> 
> Just fix that to say "Transparent huge pages" as normal huge pages
> should work fine unless I'm missing something.
> 
> Hugh, any chance you can give that a spin ? 

Sure, it's now under way.  If all goes well, I'll give you a
progress report in about 15 hours time; but given the variance in
how long it took to hit, I won't feel fully confident until this
time tomorrow, when I'll update you again.

Thank you both for the great response!

Hugh
Hugh Dickins May 31, 2013, 5:05 a.m. UTC | #3
On Thu, 30 May 2013, Hugh Dickins wrote:
> On Thu, 30 May 2013, Benjamin Herrenschmidt wrote:
> > On Thu, 2013-05-30 at 13:57 +0530, Aneesh Kumar K.V wrote:
> > > +               /* FIXME!!, will fail with when we enable hugepage
> > > support */
> > 
> > Just fix that to say "Transparent huge pages" as normal huge pages
> > should work fine unless I'm missing something.
> > 
> > Hugh, any chance you can give that a spin ? 
> 
> Sure, it's now under way.  If all goes well, I'll give you a
> progress report in about 15 hours time; but given the variance in
> how long it took to hit, I won't feel fully confident until this
> time tomorrow, when I'll update you again.

Still running fine.  I'll leave it running a few more hours to make
sure, and then try switching to the patch Aneesh sent afterwards -
or say if you'd prefer me to switch over to that one immediately.

Hugh
Benjamin Herrenschmidt May 31, 2013, 5:31 a.m. UTC | #4
On Thu, 2013-05-30 at 22:05 -0700, Hugh Dickins wrote:
> > Sure, it's now under way.  If all goes well, I'll give you a
> > progress report in about 15 hours time; but given the variance in
> > how long it took to hit, I won't feel fully confident until this
> > time tomorrow, when I'll update you again.
> 
> Still running fine.  I'll leave it running a few more hours to make
> sure, and then try switching to the patch Aneesh sent afterwards -
> or say if you'd prefer me to switch over to that one immediately.

The patch you are running on is what I'll send to Linus for 3.10 (+/-
cosmetics). Aneesh second patch is a much larger rework which will be
needed for THP but that will wait for 3.11. I'm happy for you to test it
but I first want to make sure it's solid with the 3.10 fix :-)

Thanks !

Cheers,
Ben.
Hugh Dickins May 31, 2013, 2:45 p.m. UTC | #5
On Fri, 31 May 2013, Benjamin Herrenschmidt wrote:
> On Thu, 2013-05-30 at 22:05 -0700, Hugh Dickins wrote:
> > > Sure, it's now under way.  If all goes well, I'll give you a
> > > progress report in about 15 hours time; but given the variance in
> > > how long it took to hit, I won't feel fully confident until this
> > > time tomorrow, when I'll update you again.
> > 
> > Still running fine.  I'll leave it running a few more hours to make
> > sure, and then try switching to the patch Aneesh sent afterwards -
> > or say if you'd prefer me to switch over to that one immediately.
> 
> The patch you are running on is what I'll send to Linus for 3.10 (+/-
> cosmetics). Aneesh second patch is a much larger rework which will be
> needed for THP but that will wait for 3.11. I'm happy for you to test it
> but I first want to make sure it's solid with the 3.10 fix :-)

That makes sense.
The first patch ran fine without incident for 25 hours, I say it's good.
I've now stopped that run and started another with the second patch in.

Hugh
Benjamin Herrenschmidt May 31, 2013, 10:23 p.m. UTC | #6
On Fri, 2013-05-31 at 14:45 +0530, Aneesh Kumar K.V wrote:

> > The patch you are running on is what I'll send to Linus for 3.10 (+/-
> > cosmetics). Aneesh second patch is a much larger rework which will be
> > needed for THP but that will wait for 3.11. I'm happy for you to test it
> > but I first want to make sure it's solid with the 3.10 fix :-)

BTW. One concern I still have is that Hugh identified the bad commit
to be:

7e74c3921ad9610c0b49f28b8fc69f7480505841
"powerpc: Fix hpte_decode to use the correct decoding for page sizes".

However, you introduce the return on HPTE not found earlier, in

b1022fbd293564de91596b8775340cf41ad5214c
"powerpc: Decode the pte-lp-encoding bits correctly."

So while I'm still happy with the current band-aid for 3.10 and am
about to send it to Linus, the above *does* seem to indicate that
there is also something wrong with the "Fix hpte_decode..." commit,
which might not actually get the page size right...

Can you investigate ?

Cheers,
Ben.
Aneesh Kumar K.V June 2, 2013, 7:22 a.m. UTC | #7
Benjamin Herrenschmidt <benh@au1.ibm.com> writes:

> On Fri, 2013-05-31 at 14:45 +0530, Aneesh Kumar K.V wrote:
>
>> > The patch you are running on is what I'll send to Linus for 3.10 (+/-
>> > cosmetics). Aneesh second patch is a much larger rework which will be
>> > needed for THP but that will wait for 3.11. I'm happy for you to test it
>> > but I first want to make sure it's solid with the 3.10 fix :-)
>
> BTW. One concern I still have is that Hugh identified the bad commit
> to be:
>
> 7e74c3921ad9610c0b49f28b8fc69f7480505841
> "powerpc: Fix hpte_decode to use the correct decoding for page sizes".
>
> However, you introduce the return on HPTE not found earlier, in
>
> b1022fbd293564de91596b8775340cf41ad5214c
> "powerpc: Decode the pte-lp-encoding bits correctly."
>
> So while I'm still happy with the current band-aid for 3.10 and am
> about to send it to Linus, the above *does* seem to indicate that
> there is also something wrong with the "Fix hpte_decode..." commit,
> which might not actually get the page size right...
>
> Can you investigate ?

7e74c3921ad9610c0b49f28b8fc69f7480505841 
"powerpc: Fix hpte_decode to use the correct decoding for page sizes"
changes should only impact hpte_decode. We don't change the details
of hpte_actual_psize at all in this patch. That means we should see a
difference only with kexec right ?.

Hugh,

Will you be able to double check whether
7e74c3921ad9610c0b49f28b8fc69f7480505841 is the bad commit. The one
before that is what we changed in the patch that fixed your problem.

-aneesh
Hugh Dickins June 2, 2013, 6:19 p.m. UTC | #8
On Sun, 2 Jun 2013, Aneesh Kumar K.V wrote:
> Benjamin Herrenschmidt <benh@au1.ibm.com> writes:
> > On Fri, 2013-05-31 at 14:45 +0530, Aneesh Kumar K.V wrote:
> >
> >> > The patch you are running on is what I'll send to Linus for 3.10 (+/-
> >> > cosmetics). Aneesh second patch is a much larger rework which will be
> >> > needed for THP but that will wait for 3.11. I'm happy for you to test it
> >> > but I first want to make sure it's solid with the 3.10 fix :-)
> >
> > BTW. One concern I still have is that Hugh identified the bad commit
> > to be:
> >
> > 7e74c3921ad9610c0b49f28b8fc69f7480505841
> > "powerpc: Fix hpte_decode to use the correct decoding for page sizes".
> >
> > However, you introduce the return on HPTE not found earlier, in
> >
> > b1022fbd293564de91596b8775340cf41ad5214c
> > "powerpc: Decode the pte-lp-encoding bits correctly."
> >
> > So while I'm still happy with the current band-aid for 3.10 and am
> > about to send it to Linus, the above *does* seem to indicate that
> > there is also something wrong with the "Fix hpte_decode..." commit,
> > which might not actually get the page size right...
> >
> > Can you investigate ?
> 
> 7e74c3921ad9610c0b49f28b8fc69f7480505841 
> "powerpc: Fix hpte_decode to use the correct decoding for page sizes"
> changes should only impact hpte_decode. We don't change the details
> of hpte_actual_psize at all in this patch. That means we should see a
> difference only with kexec right ?.
> 
> Hugh,
> 
> Will you be able to double check whether
> 7e74c3921ad9610c0b49f28b8fc69f7480505841 is the bad commit. The one
> before that is what we changed in the patch that fixed your problem.

You are absolutely right.  I just set b1022fbd29 going, expecting
to answer you tomorrow: but got a Segmentation fault in 20 minutes
(quicker than ever seen before).  It looks as if I was running some
other kernel for the last stage of my bisection: I can't see how that
came about, but it's not very interesting now - you got it right.

Prior to trying that, I had been running your second patch, 9f70fd8cfe,
and that tested out successfully for 50 hours before I stopped it.

Hugh
diff mbox

Patch

diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 6a2aead..6d1bd81 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -336,11 +336,19 @@  static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
 
 	hpte_v = hptep->v;
 	actual_psize = hpte_actual_psize(hptep, psize);
+	/*
+	 * We need to invalidate the TLB always because hpte_remove doesn't do
+	 * a tlb invalidate. If a hash bucket gets full, we "evict" a more/less
+	 * random entry from it. When we do that we don't invalidate the TLB
+	 * (hpte_remove) because we assume the old translation is still technically
+	 * "valid".
+	 */
 	if (actual_psize < 0) {
-		native_unlock_hpte(hptep);
-		return -1;
+		/* FIXME!!, will fail with when we enable hugepage support */
+		actual_psize = psize;
+		ret = -1;
+		goto err_out;
 	}
-	/* Even if we miss, we need to invalidate the TLB */
 	if (!HPTE_V_COMPARE(hpte_v, want_v)) {
 		DBG_LOW(" -> miss\n");
 		ret = -1;
@@ -350,6 +358,7 @@  static long native_hpte_updatepp(unsigned long slot, unsigned long newpp,
 		hptep->r = (hptep->r & ~(HPTE_R_PP | HPTE_R_N)) |
 			(newpp & (HPTE_R_PP | HPTE_R_N | HPTE_R_C));
 	}
+err_out:
 	native_unlock_hpte(hptep);
 
 	/* Ensure it is out of the tlb too. */
@@ -408,8 +417,9 @@  static void native_hpte_updateboltedpp(unsigned long newpp, unsigned long ea,
 		panic("could not find page to bolt\n");
 	hptep = htab_address + slot;
 	actual_psize = hpte_actual_psize(hptep, psize);
+	/* FIXME!! can this happen for bolted entry ? */
 	if (actual_psize < 0)
-		return;
+		actual_psize = psize;
 
 	/* Update the HPTE */
 	hptep->r = (hptep->r & ~(HPTE_R_PP | HPTE_R_N)) |
@@ -437,21 +447,28 @@  static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
 	hpte_v = hptep->v;
 
 	actual_psize = hpte_actual_psize(hptep, psize);
+	/*
+	 * We need to invalidate the TLB always because hpte_remove doesn't do
+	 * a tlb invalidate. If a hash bucket gets full, we "evict" a more/less
+	 * random entry from it. When we do that we don't invalidate the TLB
+	 * (hpte_remove) because we assume the old translation is still technically
+	 * "valid".
+	 */
 	if (actual_psize < 0) {
+		/* FIXME!!, will fail with when we enable hugepage support */
+		actual_psize = psize;
 		native_unlock_hpte(hptep);
-		local_irq_restore(flags);
-		return;
+		goto err_out;
 	}
-	/* Even if we miss, we need to invalidate the TLB */
 	if (!HPTE_V_COMPARE(hpte_v, want_v))
 		native_unlock_hpte(hptep);
 	else
 		/* Invalidate the hpte. NOTE: this also unlocks it */
 		hptep->v = 0;
 
+err_out:
 	/* Invalidate the TLB */
 	tlbie(vpn, psize, actual_psize, ssize, local);
-
 	local_irq_restore(flags);
 }