Message ID | 1391781117-19045-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Fri, Feb 07, 2014 at 07:21:57PM +0530, Aneesh Kumar K.V wrote: > From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> > > This patch fix the below crash > > NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440 > LR [c0000000000439ac] .hash_page+0x18c/0x5e0 > ... > Call Trace: > [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable) > [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0 > [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58 > > On ppc64 we use the pgtable for storing the hpte slot information and > store address to the pgtable at a constant offset (PTRS_PER_PMD) from > pmd. On mremap, when we switch the pmd, we need to withdraw and deposit > the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset > from new pmd. > > We also want to move the withdraw and deposit before the set_pmd so > that, when page fault find the pmd as trans huge we can be sure that > pgtable can be located at the offset. > > variant of upstream SHA1: b3084f4db3aeb991c507ca774337c7e7893ed04f > for 3.12 stable series This doesn't look like a "variant", it looks totally different. Why can't I just take the b3084f4db3aeb991c507ca774337c7e7893ed04f patch (and follow-on fix) for 3.12? I _REALLY_ dislike patches that are totally different from Linus's tree in stable trees, it has caused nothing but problems in the past. greg k-h
On Tue, 2014-02-11 at 09:31 -0800, Greg KH wrote: > On Fri, Feb 07, 2014 at 07:21:57PM +0530, Aneesh Kumar K.V wrote: > > From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> > > > > This patch fix the below crash > > > > NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440 > > LR [c0000000000439ac] .hash_page+0x18c/0x5e0 > > ... > > Call Trace: > > [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable) > > [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0 > > [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58 > > > > On ppc64 we use the pgtable for storing the hpte slot information and > > store address to the pgtable at a constant offset (PTRS_PER_PMD) from > > pmd. On mremap, when we switch the pmd, we need to withdraw and deposit > > the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset > > from new pmd. > > > > We also want to move the withdraw and deposit before the set_pmd so > > that, when page fault find the pmd as trans huge we can be sure that > > pgtable can be located at the offset. > > > > variant of upstream SHA1: b3084f4db3aeb991c507ca774337c7e7893ed04f > > for 3.12 stable series > > This doesn't look like a "variant", it looks totally different. Why > can't I just take the b3084f4db3aeb991c507ca774337c7e7893ed04f patch > (and follow-on fix) for 3.12? > > I _REALLY_ dislike patches that are totally different from Linus's tree > in stable trees, it has caused nothing but problems in the past. I don't think it applies... (I tried on an internal tree) but the affected function changed in 3.13 in various ways. Aneesh, please provide a more details explanation and whether we should backport those other changes too or whether this is not necessary. BTW. Aneesh, we need a 3.11.x one too Cheers, Ben.
Greg KH <gregkh@linuxfoundation.org> writes: > On Fri, Feb 07, 2014 at 07:21:57PM +0530, Aneesh Kumar K.V wrote: >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> >> >> This patch fix the below crash >> >> NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440 >> LR [c0000000000439ac] .hash_page+0x18c/0x5e0 >> ... >> Call Trace: >> [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable) >> [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0 >> [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58 >> >> On ppc64 we use the pgtable for storing the hpte slot information and >> store address to the pgtable at a constant offset (PTRS_PER_PMD) from >> pmd. On mremap, when we switch the pmd, we need to withdraw and deposit >> the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset >> from new pmd. >> >> We also want to move the withdraw and deposit before the set_pmd so >> that, when page fault find the pmd as trans huge we can be sure that >> pgtable can be located at the offset. >> >> variant of upstream SHA1: b3084f4db3aeb991c507ca774337c7e7893ed04f >> for 3.12 stable series > > This doesn't look like a "variant", it looks totally different. Why > can't I just take the b3084f4db3aeb991c507ca774337c7e7893ed04f patch > (and follow-on fix) for 3.12? Because the code in that function changed in 3.13. Kirill added split ptl locks for huge pte, and we decide whether to withdraw and deposit again based on the ptl locks in 3.13. In 3.12 we do that only for ppc64 using #ifdef > > I _REALLY_ dislike patches that are totally different from Linus's tree > in stable trees, it has caused nothing but problems in the past. > -aneesh
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes: > On Tue, 2014-02-11 at 09:31 -0800, Greg KH wrote: >> On Fri, Feb 07, 2014 at 07:21:57PM +0530, Aneesh Kumar K.V wrote: >> > From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> >> > >> > This patch fix the below crash >> > >> > NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440 >> > LR [c0000000000439ac] .hash_page+0x18c/0x5e0 >> > ... >> > Call Trace: >> > [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable) >> > [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0 >> > [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58 >> > >> > On ppc64 we use the pgtable for storing the hpte slot information and >> > store address to the pgtable at a constant offset (PTRS_PER_PMD) from >> > pmd. On mremap, when we switch the pmd, we need to withdraw and deposit >> > the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset >> > from new pmd. >> > >> > We also want to move the withdraw and deposit before the set_pmd so >> > that, when page fault find the pmd as trans huge we can be sure that >> > pgtable can be located at the offset. >> > >> > variant of upstream SHA1: b3084f4db3aeb991c507ca774337c7e7893ed04f >> > for 3.12 stable series >> >> This doesn't look like a "variant", it looks totally different. Why >> can't I just take the b3084f4db3aeb991c507ca774337c7e7893ed04f patch >> (and follow-on fix) for 3.12? >> >> I _REALLY_ dislike patches that are totally different from Linus's tree >> in stable trees, it has caused nothing but problems in the past. > > I don't think it applies... (I tried on an internal tree) but the > affected function changed in 3.13 in various ways. Aneesh, please > provide a more details explanation and whether we should backport those > other changes too or whether this is not necessary Yes the affected function added support for split ptl locks for huge pte. I don't think that is a stable material. . > > BTW. Aneesh, we need a 3.11.x one too > 3.11.x it is already applied. -aneesh
On Wed, Feb 12, 2014 at 08:22:02AM +0530, Aneesh Kumar K.V wrote: > Greg KH <gregkh@linuxfoundation.org> writes: > > > On Fri, Feb 07, 2014 at 07:21:57PM +0530, Aneesh Kumar K.V wrote: > >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> > >> > >> This patch fix the below crash > >> > >> NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440 > >> LR [c0000000000439ac] .hash_page+0x18c/0x5e0 > >> ... > >> Call Trace: > >> [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable) > >> [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0 > >> [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58 > >> > >> On ppc64 we use the pgtable for storing the hpte slot information and > >> store address to the pgtable at a constant offset (PTRS_PER_PMD) from > >> pmd. On mremap, when we switch the pmd, we need to withdraw and deposit > >> the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset > >> from new pmd. > >> > >> We also want to move the withdraw and deposit before the set_pmd so > >> that, when page fault find the pmd as trans huge we can be sure that > >> pgtable can be located at the offset. > >> > >> variant of upstream SHA1: b3084f4db3aeb991c507ca774337c7e7893ed04f > >> for 3.12 stable series > > > > This doesn't look like a "variant", it looks totally different. Why > > can't I just take the b3084f4db3aeb991c507ca774337c7e7893ed04f patch > > (and follow-on fix) for 3.12? > > Because the code in that function changed in 3.13. Kirill added split > ptl locks for huge pte, and we decide whether to withdraw and > deposit again based on the ptl locks in 3.13. In 3.12 we do that only > for ppc64 using #ifdef I have no idea what that means... If you want this patch applied, please be specific as to what is going on, why the code is _very_ different, and all of that. Make it _obvious_ as to what is happening, and why I would be a fool not to take it in the stable tree. As it is, the code in this patch looks so different that I'm just assuming you got something wrong and are trying to really send me something else, so I'll just ignore it. greg k-h
Greg KH <gregkh@linuxfoundation.org> writes: > On Wed, Feb 12, 2014 at 08:22:02AM +0530, Aneesh Kumar K.V wrote: >> Greg KH <gregkh@linuxfoundation.org> writes: >> >> > On Fri, Feb 07, 2014 at 07:21:57PM +0530, Aneesh Kumar K.V wrote: >> >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> >> >> >> >> This patch fix the below crash >> >> >> >> NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440 >> >> LR [c0000000000439ac] .hash_page+0x18c/0x5e0 >> >> ... >> >> Call Trace: >> >> [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable) >> >> [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0 >> >> [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58 >> >> >> >> On ppc64 we use the pgtable for storing the hpte slot information and >> >> store address to the pgtable at a constant offset (PTRS_PER_PMD) from >> >> pmd. On mremap, when we switch the pmd, we need to withdraw and deposit >> >> the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset >> >> from new pmd. >> >> >> >> We also want to move the withdraw and deposit before the set_pmd so >> >> that, when page fault find the pmd as trans huge we can be sure that >> >> pgtable can be located at the offset. >> >> >> >> variant of upstream SHA1: b3084f4db3aeb991c507ca774337c7e7893ed04f >> >> for 3.12 stable series >> > >> > This doesn't look like a "variant", it looks totally different. Why >> > can't I just take the b3084f4db3aeb991c507ca774337c7e7893ed04f patch >> > (and follow-on fix) for 3.12? >> >> Because the code in that function changed in 3.13. Kirill added split >> ptl locks for huge pte, and we decide whether to withdraw and >> deposit again based on the ptl locks in 3.13. In 3.12 we do that only >> for ppc64 using #ifdef > > I have no idea what that means... > > If you want this patch applied, please be specific as to what is going > on, why the code is _very_ different, and all of that. Make it > _obvious_ as to what is happening, and why I would be a fool not to take > it in the stable tree. > > As it is, the code in this patch looks so different that I'm just > assuming you got something wrong and are trying to really send me > something else, so I'll just ignore it. 3.13 we added split huge ptl lock which introduced separate lock at pmd level for hugepage (bf929152e9f6c49b66fad4ebf08cc95b02ce48f5). This required us 3592806cfa08b7cca968f793c33f8e9460bab395. ie, when we move huge page, we need to withdraw and deposit PTE page if we are moving them across different pmd page. We did that by checking spin lock address in 3.13. ie, we have if (new_ptl != old_ptl) { ..... pgtable = pgtable_trans_huge_withdraw(mm, old_pmd); pgtable_trans_huge_deposit(mm, new_pmd,pgtable); ... } ppc64 even without using split ptl had PTE page per pmd entry. The details for that are explained in the commit message above. So when we move huge page we need to withdraw and deposit PTE page always on ppc64. Now on 3.13 we added a new function which did static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl, spinlock_t *old_pmd_ptl) { /* * With split pmd lock we also need to move preallocated * PTE page table if new_pmd is on different PMD page table. */ return new_pmd_ptl != old_pmd_ptl; } for x86 and on ppc64 we did static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl, spinlock_t *old_pmd_ptl) { /* * Archs like ppc64 use pgtable to store per pmd * specific information. So when we switch the pmd, * we should also withdraw and deposit the pgtable */ return true; } ie, on ppc64 we always did withdraw and deposit and on x86 we do that only when spin lock address are different. For 3.12, since we don't have split huge ptl locks yet, we did the below +#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW + /* + * Archs like ppc64 use pgtable to store per pmd + * specific information. So when we switch the pmd, + * we should also withdraw and deposit the pgtable + */ + pgtable = pgtable_trans_huge_withdraw(mm, old_pmd); + pgtable_trans_huge_deposit(mm, new_pmd, pgtable); +#endif CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW is only set for PPC64. -aneesh
On Wed, 2014-02-12 at 06:23 -0800, Greg KH wrote: > I have no idea what that means... > > If you want this patch applied, please be specific as to what is going > on, why the code is _very_ different, and all of that. Make it > _obvious_ as to what is happening, and why I would be a fool not to > take > it in the stable tree. > > As it is, the code in this patch looks so different that I'm just > assuming you got something wrong and are trying to really send me > something else, so I'll just ignore it. It looks very different because the function that needs to be fixed changed a lot upstream in 3.13. In practice it's *not* very different in behaviour. It's just that on powerpc we need to unconditionally call withdraw and deposit when moving PTEs or it will crash, due to how we keep the transparent huge page in sync with the hash table. With the 3.13 code, due to lock breaking introduced by Kirill in 3.13-rc's, there's already a generic case for doing that (if we dropped the lock). So we just changed the condition to essentially force the condition to true to always do it under control of an arch helper. The pre-3.13 code didn't do the withdraw and deposit at all in that function however, so in that case, the patch (this 3.12 one) basically just adds the calls to withdraw and deposit under control of an ifdef which is only enabled for powerpc64. So you are taking 0 risk with other architecture and as the powerpc maintainer I'm happy with the patch. Cheers, Ben.
On Thu, 2014-02-13 at 08:03 +1100, Benjamin Herrenschmidt wrote: > It looks very different because the function that needs to be fixed > changed a lot upstream in 3.13. .../... Hi Greg ! You didn't say if that explanation was to your liking :-) If it is, do you want Aneesh to re-submit the patch with such an explanation in the changelog ? Cheers, Ben. > In practice it's *not* very different in behaviour. It's just that > on powerpc we need to unconditionally call withdraw and deposit when > moving PTEs or it will crash, due to how we keep the transparent > huge page in sync with the hash table. > > With the 3.13 code, due to lock breaking introduced by Kirill in > 3.13-rc's, there's already a generic case for doing that (if we dropped > the lock). So we just changed the condition to essentially force the > condition to true to always do it under control of an arch helper. > > The pre-3.13 code didn't do the withdraw and deposit at all in that > function however, so in that case, the patch (this 3.12 one) basically > just adds the calls to withdraw and deposit under control of an ifdef > which is only enabled for powerpc64. > > So you are taking 0 risk with other architecture and as the powerpc > maintainer I'm happy with the patch. > > Cheers, > Ben. > >
diff --git a/arch/Kconfig b/arch/Kconfig index af2cc6eabcc7..bca9e7a18bd2 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -365,6 +365,9 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE config HAVE_ARCH_SOFT_DIRTY bool +config ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW + bool + config HAVE_MOD_ARCH_SPECIFIC bool help diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 6704e2e20e6b..0225011231ea 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -71,6 +71,7 @@ config PPC_BOOK3S_64 select PPC_FPU select PPC_HAVE_PMU_SUPPORT select SYS_SUPPORTS_HUGETLBFS + select ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES config PPC_BOOK3E_64 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 292a266e0d42..89b7a647f1cb 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1474,8 +1474,20 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, ret = __pmd_trans_huge_lock(old_pmd, vma); if (ret == 1) { +#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW + pgtable_t pgtable; +#endif pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); VM_BUG_ON(!pmd_none(*new_pmd)); +#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW + /* + * Archs like ppc64 use pgtable to store per pmd + * specific information. So when we switch the pmd, + * we should also withdraw and deposit the pgtable + */ + pgtable = pgtable_trans_huge_withdraw(mm, old_pmd); + pgtable_trans_huge_deposit(mm, new_pmd, pgtable); +#endif set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd)); spin_unlock(&mm->page_table_lock); }