diff mbox

[BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

Message ID 20160223193345.GC21820@node.shutemov.name (mailing list archive)
State Not Applicable
Headers show

Commit Message

Kirill A. Shutemov Feb. 23, 2016, 7:33 p.m. UTC
On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> I'll check with Martin, maybe it is actually trivial, then we can
> do a quick test it to rule that one out.

Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
_the_ bug.

pmdp_invalidate() is called for the wrong address :-/
I guess that can be destructive on the architecture, right?

Could you check this?

Comments

Will Deacon Feb. 23, 2016, 8:22 p.m. UTC | #1
On Tue, Feb 23, 2016 at 10:33:45PM +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> > I'll check with Martin, maybe it is actually trivial, then we can
> > do a quick test it to rule that one out.
> 
> Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
> _the_ bug.
> 
> pmdp_invalidate() is called for the wrong address :-/
> I guess that can be destructive on the architecture, right?

FWIW, arm64 ignores the address parameter for set_pmd_at, so this would
only result in the TLBI nuking the wrong entries, which is going to be
tricky to observe in practice given that we install a table entry
immediately afterwards that maps the same pages. If s390 does more here
(I see some magic asm using the address), that could be the answer...

Will
Martin Schwidefsky Feb. 24, 2016, 8:39 a.m. UTC | #2
On Tue, 23 Feb 2016 22:33:45 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> > I'll check with Martin, maybe it is actually trivial, then we can
> > do a quick test it to rule that one out.
> 
> Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
> _the_ bug.
> 
> pmdp_invalidate() is called for the wrong address :-/
> I guess that can be destructive on the architecture, right?
> 
> Could you check this?
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1c317b85ea7d..4246bc70e55a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2865,7 +2865,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>  	pmd_populate(mm, &_pmd, pgtable);
> 
> -	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		pte_t entry, *pte;
>  		/*
>  		 * Note that NUMA hinting access restrictions are not
> @@ -2886,9 +2886,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		}
>  		if (dirty)
>  			SetPageDirty(page + i);
> -		pte = pte_offset_map(&_pmd, haddr);
> +		pte = pte_offset_map(&_pmd, haddr + i * PAGE_SIZE);
>  		BUG_ON(!pte_none(*pte));
> -		set_pte_at(mm, haddr, pte, entry);
> +		set_pte_at(mm, haddr + i * PAGE_SIZE, pte, entry);
>  		atomic_inc(&page[i]._mapcount);
>  		pte_unmap(pte);
>  	}
> @@ -2938,7 +2938,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	pmd_populate(mm, pmd, pgtable);
> 
>  	if (freeze) {
> -		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		for (i = 0; i < HPAGE_PMD_NR; i++) {
>  			page_remove_rmap(page + i, false);
>  			put_page(page + i);
>  		}

Test is running and it looks good so far. For the final assessment I defer
to Gerald and Sebastian.
Christian Borntraeger Feb. 24, 2016, 10:16 a.m. UTC | #3
On 02/23/2016 09:22 PM, Will Deacon wrote:
> On Tue, Feb 23, 2016 at 10:33:45PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
>>> I'll check with Martin, maybe it is actually trivial, then we can
>>> do a quick test it to rule that one out.
>>
>> Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
>> _the_ bug.
>>
>> pmdp_invalidate() is called for the wrong address :-/
>> I guess that can be destructive on the architecture, right?
> 
> FWIW, arm64 ignores the address parameter for set_pmd_at, so this would
> only result in the TLBI nuking the wrong entries, which is going to be
> tricky to observe in practice given that we install a table entry
> immediately afterwards that maps the same pages. If s390 does more here
> (I see some magic asm using the address), that could be the answer...

This patch does not change the address for set_pmd_at, it does that for the 
pmdp_invalidate here (by keeping haddr at the start of the pmd)

--->    pmdp_invalidate(vma, haddr, pmd);
        pmd_populate(mm, pmd, pgtable);
 



Without that fix we would clearly have stale tlb entries, no?
Will Deacon Feb. 24, 2016, 10:41 a.m. UTC | #4
On Wed, Feb 24, 2016 at 11:16:34AM +0100, Christian Borntraeger wrote:
> On 02/23/2016 09:22 PM, Will Deacon wrote:
> > On Tue, Feb 23, 2016 at 10:33:45PM +0300, Kirill A. Shutemov wrote:
> >> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> >>> I'll check with Martin, maybe it is actually trivial, then we can
> >>> do a quick test it to rule that one out.
> >>
> >> Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
> >> _the_ bug.
> >>
> >> pmdp_invalidate() is called for the wrong address :-/
> >> I guess that can be destructive on the architecture, right?
> > 
> > FWIW, arm64 ignores the address parameter for set_pmd_at, so this would
> > only result in the TLBI nuking the wrong entries, which is going to be
> > tricky to observe in practice given that we install a table entry
> > immediately afterwards that maps the same pages. If s390 does more here
> > (I see some magic asm using the address), that could be the answer...
> 
> This patch does not change the address for set_pmd_at, it does that for the 
> pmdp_invalidate here (by keeping haddr at the start of the pmd)
> 
> --->    pmdp_invalidate(vma, haddr, pmd);
>         pmd_populate(mm, pmd, pgtable);

On arm64, pmdp_invalidate looks like:

void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
		     pmd_t *pmdp)
{
	pmd_t entry = *pmdp;
	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
	flush_pmd_tlb_range(vma, address, address + hpage_pmd_size);
}

so that's the set_pmd_at call I was referring to.

On s390, that address ends up in __pmdp_idte[_local], but I don't know
what .insn rrf,0xb98e0000,%2,%3,0,{0,1} do ;)

> Without that fix we would clearly have stale tlb entries, no?

Yes, but AFAIU the sequence on arm64 is:

1.  trans huge mapping (block mapping in arm64 speak)
2.  faulting entry (pmd_mknotpresent)
3.  tlb invalidation
4.  table entry mapping the same pages as (1).

so if the microarchitecture we're on can tolerate a mixture of block
mappings and page mappings mapping the same VA to the same PA, then the
lack of TLB maintenance would go unnoticed. There are certainly systems
where that could cause an issue, but I believe the one I've been testing
on would be ok.

Will
Christian Borntraeger Feb. 24, 2016, 10:51 a.m. UTC | #5
On 02/24/2016 11:41 AM, Will Deacon wrote:
> On Wed, Feb 24, 2016 at 11:16:34AM +0100, Christian Borntraeger wrote:
>> On 02/23/2016 09:22 PM, Will Deacon wrote:
>>> On Tue, Feb 23, 2016 at 10:33:45PM +0300, Kirill A. Shutemov wrote:
>>>> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
>>>>> I'll check with Martin, maybe it is actually trivial, then we can
>>>>> do a quick test it to rule that one out.
>>>>
>>>> Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
>>>> _the_ bug.
>>>>
>>>> pmdp_invalidate() is called for the wrong address :-/
>>>> I guess that can be destructive on the architecture, right?
>>>
>>> FWIW, arm64 ignores the address parameter for set_pmd_at, so this would
>>> only result in the TLBI nuking the wrong entries, which is going to be
>>> tricky to observe in practice given that we install a table entry
>>> immediately afterwards that maps the same pages. If s390 does more here
>>> (I see some magic asm using the address), that could be the answer...
>>
>> This patch does not change the address for set_pmd_at, it does that for the 
>> pmdp_invalidate here (by keeping haddr at the start of the pmd)
>>
>> --->    pmdp_invalidate(vma, haddr, pmd);
>>         pmd_populate(mm, pmd, pgtable);
> 
> On arm64, pmdp_invalidate looks like:
> 
> void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> 		     pmd_t *pmdp)
> {
> 	pmd_t entry = *pmdp;
> 	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> 	flush_pmd_tlb_range(vma, address, address + hpage_pmd_size);
> }
> 
> so that's the set_pmd_at call I was referring to.
> 
> On s390, that address ends up in __pmdp_idte[_local], but I don't know
> what .insn rrf,0xb98e0000,%2,%3,0,{0,1} do ;)

It does invalidation of the pmd entry and tlb clearing for this entry.

> 
>> Without that fix we would clearly have stale tlb entries, no?
> 
> Yes, but AFAIU the sequence on arm64 is:
> 
> 1.  trans huge mapping (block mapping in arm64 speak)
> 2.  faulting entry (pmd_mknotpresent)
> 3.  tlb invalidation
> 4.  table entry mapping the same pages as (1).
> 
> so if the microarchitecture we're on can tolerate a mixture of block
> mappings and page mappings mapping the same VA to the same PA, then the
> lack of TLB maintenance would go unnoticed. There are certainly systems
> where that could cause an issue, but I believe the one I've been testing
> on would be ok.

So in essence you say it does not matter that you flush the wrong range in 
flush_pmd_tlb_range as long as it will be flushed later on when the pages
really go away. Yes, then it really might be ok for arm64.
Will Deacon Feb. 24, 2016, 11:02 a.m. UTC | #6
On Wed, Feb 24, 2016 at 11:51:47AM +0100, Christian Borntraeger wrote:
> On 02/24/2016 11:41 AM, Will Deacon wrote:
> > On Wed, Feb 24, 2016 at 11:16:34AM +0100, Christian Borntraeger wrote:
> >> Without that fix we would clearly have stale tlb entries, no?
> > 
> > Yes, but AFAIU the sequence on arm64 is:
> > 
> > 1.  trans huge mapping (block mapping in arm64 speak)
> > 2.  faulting entry (pmd_mknotpresent)
> > 3.  tlb invalidation
> > 4.  table entry mapping the same pages as (1).
> > 
> > so if the microarchitecture we're on can tolerate a mixture of block
> > mappings and page mappings mapping the same VA to the same PA, then the
> > lack of TLB maintenance would go unnoticed. There are certainly systems
> > where that could cause an issue, but I believe the one I've been testing
> > on would be ok.
> 
> So in essence you say it does not matter that you flush the wrong range in 
> flush_pmd_tlb_range as long as it will be flushed later on when the pages
> really go away. Yes, then it really might be ok for arm64.

Indeed, although that's a property of the microarchitecture I'm using
rather than an architectural guarantee so the code should certainly be
fixed!

Will
Sebastian Ott Feb. 24, 2016, 12:11 p.m. UTC | #7
On Wed, 24 Feb 2016, Martin Schwidefsky wrote:
> On Tue, 23 Feb 2016 22:33:45 +0300
> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> > > I'll check with Martin, maybe it is actually trivial, then we can
> > > do a quick test it to rule that one out.
> > 
> > Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
> > _the_ bug.
> > 
> > pmdp_invalidate() is called for the wrong address :-/
> > I guess that can be destructive on the architecture, right?
> > 
> > Could you check this?
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 1c317b85ea7d..4246bc70e55a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2865,7 +2865,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> >  	pmd_populate(mm, &_pmd, pgtable);
> > 
> > -	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> > +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		pte_t entry, *pte;
> >  		/*
> >  		 * Note that NUMA hinting access restrictions are not
> > @@ -2886,9 +2886,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  		}
> >  		if (dirty)
> >  			SetPageDirty(page + i);
> > -		pte = pte_offset_map(&_pmd, haddr);
> > +		pte = pte_offset_map(&_pmd, haddr + i * PAGE_SIZE);
> >  		BUG_ON(!pte_none(*pte));
> > -		set_pte_at(mm, haddr, pte, entry);
> > +		set_pte_at(mm, haddr + i * PAGE_SIZE, pte, entry);
> >  		atomic_inc(&page[i]._mapcount);
> >  		pte_unmap(pte);
> >  	}
> > @@ -2938,7 +2938,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  	pmd_populate(mm, pmd, pgtable);
> > 
> >  	if (freeze) {
> > -		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> > +		for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  			page_remove_rmap(page + i, false);
> >  			put_page(page + i);
> >  		}
> 
> Test is running and it looks good so far. For the final assessment I defer
> to Gerald and Sebastian.
> 

Yes, that one worked. My testsystem is doing make -j10 && make clean
in a loop since 4 hours now. Thanks!

Sebastian
Gerald Schaefer Feb. 24, 2016, 4:44 p.m. UTC | #8
On Tue, 23 Feb 2016 22:33:45 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> > I'll check with Martin, maybe it is actually trivial, then we can
> > do a quick test it to rule that one out.
> 
> Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
> _the_ bug.
> 
> pmdp_invalidate() is called for the wrong address :-/
> I guess that can be destructive on the architecture, right?

Thanks, that's it! We can no longer reproduce the crashes and calling
pmdp_invalidate() with a wrong address also perfectly explains the
memory corruption that I found in several dumps: 0x020 was ORed into
pte entries, which didn't make sense, and caused the list corruption
for example. 0x020 it is the invalid bit for pmd entries on s390 and
thus can be explained by this bug when a pte table lies before a pmd
table in memory.

> 
> Could you check this?
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1c317b85ea7d..4246bc70e55a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2865,7 +2865,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>  	pmd_populate(mm, &_pmd, pgtable);
> 
> -	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		pte_t entry, *pte;
>  		/*
>  		 * Note that NUMA hinting access restrictions are not
> @@ -2886,9 +2886,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		}
>  		if (dirty)
>  			SetPageDirty(page + i);
> -		pte = pte_offset_map(&_pmd, haddr);
> +		pte = pte_offset_map(&_pmd, haddr + i * PAGE_SIZE);
>  		BUG_ON(!pte_none(*pte));
> -		set_pte_at(mm, haddr, pte, entry);
> +		set_pte_at(mm, haddr + i * PAGE_SIZE, pte, entry);
>  		atomic_inc(&page[i]._mapcount);
>  		pte_unmap(pte);
>  	}
> @@ -2938,7 +2938,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	pmd_populate(mm, pmd, pgtable);
> 
>  	if (freeze) {
> -		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		for (i = 0; i < HPAGE_PMD_NR; i++) {
>  			page_remove_rmap(page + i, false);
>  			put_page(page + i);
>  		}
Aneesh Kumar K.V Feb. 24, 2016, 5:22 p.m. UTC | #9
Christian Borntraeger <borntraeger@de.ibm.com> writes:

> On 02/24/2016 11:41 AM, Will Deacon wrote:
>> On Wed, Feb 24, 2016 at 11:16:34AM +0100, Christian Borntraeger wrote:
>>> On 02/23/2016 09:22 PM, Will Deacon wrote:
>>>> On Tue, Feb 23, 2016 at 10:33:45PM +0300, Kirill A. Shutemov wrote:
>>>>> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
>>>>>> I'll check with Martin, maybe it is actually trivial, then we can
>>>>>> do a quick test it to rule that one out.
>>>>>
>>>>> Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
>>>>> _the_ bug.
>>>>>
>>>>> pmdp_invalidate() is called for the wrong address :-/
>>>>> I guess that can be destructive on the architecture, right?
>>>>
>>>> FWIW, arm64 ignores the address parameter for set_pmd_at, so this would
>>>> only result in the TLBI nuking the wrong entries, which is going to be
>>>> tricky to observe in practice given that we install a table entry
>>>> immediately afterwards that maps the same pages. If s390 does more here
>>>> (I see some magic asm using the address), that could be the answer...
>>>
>>> This patch does not change the address for set_pmd_at, it does that for the 
>>> pmdp_invalidate here (by keeping haddr at the start of the pmd)
>>>
>>> --->    pmdp_invalidate(vma, haddr, pmd);
>>>         pmd_populate(mm, pmd, pgtable);
>> 
>> On arm64, pmdp_invalidate looks like:
>> 
>> void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>> 		     pmd_t *pmdp)
>> {
>> 	pmd_t entry = *pmdp;
>> 	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
>> 	flush_pmd_tlb_range(vma, address, address + hpage_pmd_size);
>> }
>> 
>> so that's the set_pmd_at call I was referring to.
>> 
>> On s390, that address ends up in __pmdp_idte[_local], but I don't know
>> what .insn rrf,0xb98e0000,%2,%3,0,{0,1} do ;)
>
> It does invalidation of the pmd entry and tlb clearing for this entry.
>
>> 
>>> Without that fix we would clearly have stale tlb entries, no?
>> 
>> Yes, but AFAIU the sequence on arm64 is:
>> 
>> 1.  trans huge mapping (block mapping in arm64 speak)
>> 2.  faulting entry (pmd_mknotpresent)
>> 3.  tlb invalidation
>> 4.  table entry mapping the same pages as (1).
>> 
>> so if the microarchitecture we're on can tolerate a mixture of block
>> mappings and page mappings mapping the same VA to the same PA, then the
>> lack of TLB maintenance would go unnoticed. There are certainly systems
>> where that could cause an issue, but I believe the one I've been testing
>> on would be ok.
>
> So in essence you say it does not matter that you flush the wrong range in 
> flush_pmd_tlb_range as long as it will be flushed later on when the pages
> really go away. Yes, then it really might be ok for arm64.

This is more or less same for ppc64 too. With ppc64 the actual flush
happened in pmdp_huge_split_prepare() and pmdp_invalidate() is mostly a
no-op w.r.t thp split in our case.

-aneesh
diff mbox

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1c317b85ea7d..4246bc70e55a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2865,7 +2865,7 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
 
-	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		pte_t entry, *pte;
 		/*
 		 * Note that NUMA hinting access restrictions are not
@@ -2886,9 +2886,9 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		}
 		if (dirty)
 			SetPageDirty(page + i);
-		pte = pte_offset_map(&_pmd, haddr);
+		pte = pte_offset_map(&_pmd, haddr + i * PAGE_SIZE);
 		BUG_ON(!pte_none(*pte));
-		set_pte_at(mm, haddr, pte, entry);
+		set_pte_at(mm, haddr + i * PAGE_SIZE, pte, entry);
 		atomic_inc(&page[i]._mapcount);
 		pte_unmap(pte);
 	}
@@ -2938,7 +2938,7 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	pmd_populate(mm, pmd, pgtable);
 
 	if (freeze) {
-		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		for (i = 0; i < HPAGE_PMD_NR; i++) {
 			page_remove_rmap(page + i, false);
 			put_page(page + i);
 		}