[3/7] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

Message ID 1504910713-7094-4-git-send-email-linuxram@us.ibm.com
State Changes Requested
Headers show
Series
  • powerpc: Free up RPAGE_RSV bits
Related show

Commit Message

Ram Pai Sept. 8, 2017, 10:44 p.m.
Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6,
in the 4K backed HPTE pages.These bits continue to be used
for 64K backed HPTE pages in this patch, but will be freed
up in the next patch. The  bit  numbers are big-endian  as
defined in the ISA3.0

The patch does the following change to the 4k htpe backed
64K PTE's format.

H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure
		below)
V0 which occupied bit 4 is not used anymore.
V1 which occupied bit 5 is not used anymore.
V2 which occupied bit 6 is not used anymore.
V3 which occupied bit 7 is not used anymore.

Before the patch, the 4k backed 64k PTE format was as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x|B|V0|V1|V2|V3|x| | |x|x|................|x|x|x|x| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

After the patch, the 4k backed 64k PTE format is as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x| |  |  |  |  |x|B| |x|x|................|.|.|.|.| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

the four  bits S,G,I,X (one quadruplet per 4k HPTE) that
cache  the  hash-bucket  slot  value, is initialized  to
1,1,1,1 indicating -- an invalid slot.   If  a HPTE gets
cached in a 1111  slot(i.e 7th  slot  of  secondary hash
bucket), it is  released  immediately. In  other  words,
even  though 1111   is   a valid slot  value in the hash
bucket, we consider it invalid and  release the slot and
the HPTE.  This  gives  us  the opportunity to determine
the validity of S,G,I,X  bits  based on its contents and
not on any of the bits V0,V1,V2 or V3 in the primary PTE

When   we  release  a    HPTE    cached in the 1111 slot
we also    release  a  legitimate   slot  in the primary
hash bucket  and  unmap  its  corresponding  HPTE.  This
is  to  ensure   that  we do get a HPTE cached in a slot
of the primary hash bucket, the next time we retry.

Though  treating  1111  slot  as  invalid,  reduces  the
number of  available  slots  in the hash bucket and  may
have  an  effect   on the performance, the probabilty of
hitting a 1111 slot is extermely low.

Compared  to  the   current    scheme,  the above scheme
reduces  the   number  of   false   hash  table  updates
significantly and  has the  added advantage of releasing
four  valuable  PTE bits for other purpose.

NOTE:even though bits 3, 4, 5, 6, 7 are  not  used  when
the  64K  PTE is backed by 4k HPTE,  they continue to be
used  if  the  PTE  gets  backed  by 64k HPTE.  The next
patch will decouple that aswell, and truely  release the
bits.

This idea was jointly developed by Paul Mackerras,
Aneesh, Michael Ellermen and myself.

4K PTE format remains unchanged currently.

The patch does the following code changes
a) PTE flags are split between 64k and 4k  header files.
b) __hash_page_4K()  is  reimplemented   to reflect the
   above logic.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |    2 +
 arch/powerpc/include/asm/book3s/64/hash-64k.h |    8 +--
 arch/powerpc/include/asm/book3s/64/hash.h     |    1 -
 arch/powerpc/mm/hash64_64k.c                  |  106 +++++++++++++------------
 arch/powerpc/mm/hash_utils_64.c               |    4 +-
 5 files changed, 63 insertions(+), 58 deletions(-)

Comments

Balbir Singh Sept. 14, 2017, 1:18 a.m. | #1
On Fri,  8 Sep 2017 15:44:43 -0700
Ram Pai <linuxram@us.ibm.com> wrote:

> Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6,
> in the 4K backed HPTE pages.These bits continue to be used
> for 64K backed HPTE pages in this patch, but will be freed
> up in the next patch. The  bit  numbers are big-endian  as
> defined in the ISA3.0
> 
> The patch does the following change to the 4k htpe backed
> 64K PTE's format.
> 
> H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure
> 		below)
> V0 which occupied bit 4 is not used anymore.
> V1 which occupied bit 5 is not used anymore.
> V2 which occupied bit 6 is not used anymore.
> V3 which occupied bit 7 is not used anymore.
> 
> Before the patch, the 4k backed 64k PTE format was as follows
> 
>  0 1 2 3 4  5  6  7  8 9 10...........................63
>  : : : : :  :  :  :  : : :                            :
>  v v v v v  v  v  v  v v v                            v
> 
> ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
> |x|x|x|B|V0|V1|V2|V3|x| | |x|x|................|x|x|x|x| <- primary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
> |S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'
> 
> After the patch, the 4k backed 64k PTE format is as follows
> 
>  0 1 2 3 4  5  6  7  8 9 10...........................63
>  : : : : :  :  :  :  : : :                            :
>  v v v v v  v  v  v  v v v                            v
> 
> ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
> |x|x|x| |  |  |  |  |x|B| |x|x|................|.|.|.|.| <- primary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
> |S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'
> 
> the four  bits S,G,I,X (one quadruplet per 4k HPTE) that
> cache  the  hash-bucket  slot  value, is initialized  to
> 1,1,1,1 indicating -- an invalid slot.   If  a HPTE gets
> cached in a 1111  slot(i.e 7th  slot  of  secondary hash
> bucket), it is  released  immediately. In  other  words,
> even  though 1111   is   a valid slot  value in the hash
> bucket, we consider it invalid and  release the slot and
> the HPTE.  This  gives  us  the opportunity to determine
> the validity of S,G,I,X  bits  based on its contents and
> not on any of the bits V0,V1,V2 or V3 in the primary PTE
> 
> When   we  release  a    HPTE    cached in the 1111 slot
> we also    release  a  legitimate   slot  in the primary
> hash bucket  and  unmap  its  corresponding  HPTE.  This
> is  to  ensure   that  we do get a HPTE cached in a slot
> of the primary hash bucket, the next time we retry.
> 
> Though  treating  1111  slot  as  invalid,  reduces  the
> number of  available  slots  in the hash bucket and  may
> have  an  effect   on the performance, the probabilty of
> hitting a 1111 slot is extermely low.
> 
> Compared  to  the   current    scheme,  the above scheme
> reduces  the   number  of   false   hash  table  updates
> significantly and  has the  added advantage of releasing
> four  valuable  PTE bits for other purpose.
> 
> NOTE:even though bits 3, 4, 5, 6, 7 are  not  used  when
> the  64K  PTE is backed by 4k HPTE,  they continue to be
> used  if  the  PTE  gets  backed  by 64k HPTE.  The next
> patch will decouple that aswell, and truely  release the
> bits.
> 
> This idea was jointly developed by Paul Mackerras,
> Aneesh, Michael Ellermen and myself.
> 

Acked-by: Balbir Singh <bsingharora@gmail.com>
Michael Ellerman Oct. 19, 2017, 3:25 a.m. | #2
Ram Pai <linuxram@us.ibm.com> writes:

> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> index 1a68cb1..c6c5559 100644
> --- a/arch/powerpc/mm/hash64_64k.c
> +++ b/arch/powerpc/mm/hash64_64k.c
> @@ -126,18 +113,13 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  	if (__rpte_sub_valid(rpte, subpg_index)) {
>  		int ret;
>  
> -		hash = hpt_hash(vpn, shift, ssize);
> -		hidx = __rpte_to_hidx(rpte, subpg_index);
> -		if (hidx & _PTEIDX_SECONDARY)
> -			hash = ~hash;
> -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> -		slot += hidx & _PTEIDX_GROUP_IX;
> +		gslot = pte_get_hash_gslot(vpn, shift, ssize, rpte,
> +					subpg_index);
> +		ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
> +			MMU_PAGE_4K, MMU_PAGE_4K, ssize, flags);

This was formatted correctly before:
  
> -		ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
> -						 MMU_PAGE_4K, MMU_PAGE_4K,
> -						 ssize, flags);
>  		/*
> -		 *if we failed because typically the HPTE wasn't really here
> +		 * if we failed because typically the HPTE wasn't really here

If you're fixing it up please make it "If ...".

>  		 * we try an insertion.
>  		 */
>  		if (ret == -1)
> @@ -148,6 +130,15 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  	}
>  
>  htab_insert_hpte:
> +
> +	/*
> +	 * initialize all hidx entries to invalid value,
> +	 * the first time the PTE is about to allocate
> +	 * a 4K hpte
> +	 */

Should be:
	/*
	 * Initialize all hidx entries to invalid value, the first time
         * the PTE is about to allocate a 4K HPTE.
	 */

> +	if (!(old_pte & H_PAGE_COMBO))
> +		rpte.hidx = ~0x0UL;
> +

Paul had the idea that if we biased the slot number by 1, we could make
the "invalid" value be == 0.

That would avoid needing to that above, and also mean the value is
correctly invalid from the get-go, which would be good IMO.

I think now that you've added the slot accessors it would be pretty easy
to do.


>  	/*
>  	 * handle H_PAGE_4K_PFN case
>  	 */
> @@ -172,15 +163,41 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  	 * Primary is full, try the secondary
>  	 */
>  	if (unlikely(slot == -1)) {
> +		bool soft_invalid;
> +
>  		hpte_group = ((~hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
>  		slot = mmu_hash_ops.hpte_insert(hpte_group, vpn, pa,
>  						rflags, HPTE_V_SECONDARY,
>  						MMU_PAGE_4K, MMU_PAGE_4K,
>  						ssize);
> -		if (slot == -1) {
> -			if (mftb() & 0x1)
> +
> +		soft_invalid = hpte_soft_invalid(slot);
> +		if (unlikely(soft_invalid)) {


> +			/*
> +			 * we got a valid slot from a hardware point of view.
> +			 * but we cannot use it, because we use this special
> +			 * value; as     defined   by    hpte_soft_invalid(),
> +			 * to  track    invalid  slots.  We  cannot  use  it.
> +			 * So invalidate it.
> +			 */
> +			gslot = slot & _PTEIDX_GROUP_IX;
> +			mmu_hash_ops.hpte_invalidate(hpte_group+gslot, vpn,
> +				MMU_PAGE_4K, MMU_PAGE_4K,
> +				ssize, 0);

Please:
			mmu_hash_ops.hpte_invalidate(hpte_group+gslot, vpn,
                        			     MMU_PAGE_4K, MMU_PAGE_4K,
						     ssize, 0);

> +		}
> +
> +		if (unlikely(slot == -1 || soft_invalid)) {
> +			/*
> +			 * for soft invalid slot, lets   ensure that we

For .. let's


cheers
Ram Pai Oct. 19, 2017, 5:02 p.m. | #3
On Thu, Oct 19, 2017 at 02:25:47PM +1100, Michael Ellerman wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> 
> > diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> > index 1a68cb1..c6c5559 100644
> > --- a/arch/powerpc/mm/hash64_64k.c
> > +++ b/arch/powerpc/mm/hash64_64k.c
> > @@ -126,18 +113,13 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  	if (__rpte_sub_valid(rpte, subpg_index)) {
> >  		int ret;
> >  
> > -		hash = hpt_hash(vpn, shift, ssize);
> > -		hidx = __rpte_to_hidx(rpte, subpg_index);
> > -		if (hidx & _PTEIDX_SECONDARY)
> > -			hash = ~hash;
> > -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > -		slot += hidx & _PTEIDX_GROUP_IX;
> > +		gslot = pte_get_hash_gslot(vpn, shift, ssize, rpte,
> > +					subpg_index);
> > +		ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
> > +			MMU_PAGE_4K, MMU_PAGE_4K, ssize, flags);
> 
> This was formatted correctly before:
>   
> > -		ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
> > -						 MMU_PAGE_4K, MMU_PAGE_4K,
> > -						 ssize, flags);
> >  		/*
> > -		 *if we failed because typically the HPTE wasn't really here
> > +		 * if we failed because typically the HPTE wasn't really here
> 
> If you're fixing it up please make it "If ...".
> 
> >  		 * we try an insertion.
> >  		 */
> >  		if (ret == -1)
> > @@ -148,6 +130,15 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  	}
> >  
> >  htab_insert_hpte:
> > +
> > +	/*
> > +	 * initialize all hidx entries to invalid value,
> > +	 * the first time the PTE is about to allocate
> > +	 * a 4K hpte
> > +	 */
> 
> Should be:
> 	/*
> 	 * Initialize all hidx entries to invalid value, the first time
>          * the PTE is about to allocate a 4K HPTE.
> 	 */
> 
> > +	if (!(old_pte & H_PAGE_COMBO))
> > +		rpte.hidx = ~0x0UL;
> > +
> 
> Paul had the idea that if we biased the slot number by 1, we could make
> the "invalid" value be == 0.
> 
> That would avoid needing to that above, and also mean the value is
> correctly invalid from the get-go, which would be good IMO.
> 
> I think now that you've added the slot accessors it would be pretty easy
> to do.

I did attempt to do so, and was not getting it right. The machine went
unstable. So left it with an accessor, to be revisited at a
later point in time. That time has come... I suppose.  Shall I make it a
separate patch instead of baking it into this patch?

RP
Aneesh Kumar K.V Oct. 23, 2017, 8:47 a.m. | #4
Michael Ellerman <mpe@ellerman.id.au> writes:

> Ram Pai <linuxram@us.ibm.com> writes:
>
>> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
>> index 1a68cb1..c6c5559 100644
>> --- a/arch/powerpc/mm/hash64_64k.c
>> +++ b/arch/powerpc/mm/hash64_64k.c
>> @@ -126,18 +113,13 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>>  	if (__rpte_sub_valid(rpte, subpg_index)) {
>>  		int ret;
>>  
>> -		hash = hpt_hash(vpn, shift, ssize);
>> -		hidx = __rpte_to_hidx(rpte, subpg_index);
>> -		if (hidx & _PTEIDX_SECONDARY)
>> -			hash = ~hash;
>> -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
>> -		slot += hidx & _PTEIDX_GROUP_IX;
>> +		gslot = pte_get_hash_gslot(vpn, shift, ssize, rpte,
>> +					subpg_index);
>> +		ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
>> +			MMU_PAGE_4K, MMU_PAGE_4K, ssize, flags);
>
> This was formatted correctly before:
>   
>> -		ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
>> -						 MMU_PAGE_4K, MMU_PAGE_4K,
>> -						 ssize, flags);
>>  		/*
>> -		 *if we failed because typically the HPTE wasn't really here
>> +		 * if we failed because typically the HPTE wasn't really here
>
> If you're fixing it up please make it "If ...".
>
>>  		 * we try an insertion.
>>  		 */
>>  		if (ret == -1)
>> @@ -148,6 +130,15 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>>  	}
>>  
>>  htab_insert_hpte:
>> +
>> +	/*
>> +	 * initialize all hidx entries to invalid value,
>> +	 * the first time the PTE is about to allocate
>> +	 * a 4K hpte
>> +	 */
>
> Should be:
> 	/*
> 	 * Initialize all hidx entries to invalid value, the first time
>          * the PTE is about to allocate a 4K HPTE.
> 	 */
>
>> +	if (!(old_pte & H_PAGE_COMBO))
>> +		rpte.hidx = ~0x0UL;
>> +
>
> Paul had the idea that if we biased the slot number by 1, we could make
> the "invalid" value be == 0.
>
> That would avoid needing to that above, and also mean the value is
> correctly invalid from the get-go, which would be good IMO.
>
> I think now that you've added the slot accessors it would be pretty easy
> to do.

That would be imply, we loose one slot in primary group, which means we
will do extra work in some case because our primary now has only 7
slots. And in case of pseries, the hypervisor will always return the
least available slot, which implie we will do extra hcalls in case of an
hpte insert to an empty group?

-aneesh
Ram Pai Oct. 23, 2017, 4:29 p.m. | #5
On Mon, Oct 23, 2017 at 02:17:39PM +0530, Aneesh Kumar K.V wrote:
> Michael Ellerman <mpe@ellerman.id.au> writes:
> 
> > Ram Pai <linuxram@us.ibm.com> writes:
> >
> >> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> >> index 1a68cb1..c6c5559 100644
> >> --- a/arch/powerpc/mm/hash64_64k.c
> >> +++ b/arch/powerpc/mm/hash64_64k.c
> >> @@ -126,18 +113,13 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >>  	if (__rpte_sub_valid(rpte, subpg_index)) {
> >>  		int ret;
> >>  
> >> -		hash = hpt_hash(vpn, shift, ssize);
> >> -		hidx = __rpte_to_hidx(rpte, subpg_index);
> >> -		if (hidx & _PTEIDX_SECONDARY)
> >> -			hash = ~hash;
> >> -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> >> -		slot += hidx & _PTEIDX_GROUP_IX;
> >> +		gslot = pte_get_hash_gslot(vpn, shift, ssize, rpte,
> >> +					subpg_index);
> >> +		ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
> >> +			MMU_PAGE_4K, MMU_PAGE_4K, ssize, flags);
> >
> > This was formatted correctly before:
> >   
> >> -		ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
> >> -						 MMU_PAGE_4K, MMU_PAGE_4K,
> >> -						 ssize, flags);
> >>  		/*
> >> -		 *if we failed because typically the HPTE wasn't really here
> >> +		 * if we failed because typically the HPTE wasn't really here
> >
> > If you're fixing it up please make it "If ...".
> >
> >>  		 * we try an insertion.
> >>  		 */
> >>  		if (ret == -1)
> >> @@ -148,6 +130,15 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >>  	}
> >>  
> >>  htab_insert_hpte:
> >> +
> >> +	/*
> >> +	 * initialize all hidx entries to invalid value,
> >> +	 * the first time the PTE is about to allocate
> >> +	 * a 4K hpte
> >> +	 */
> >
> > Should be:
> > 	/*
> > 	 * Initialize all hidx entries to invalid value, the first time
> >          * the PTE is about to allocate a 4K HPTE.
> > 	 */
> >
> >> +	if (!(old_pte & H_PAGE_COMBO))
> >> +		rpte.hidx = ~0x0UL;
> >> +
> >
> > Paul had the idea that if we biased the slot number by 1, we could make
> > the "invalid" value be == 0.
> >
> > That would avoid needing to that above, and also mean the value is
> > correctly invalid from the get-go, which would be good IMO.
> >
> > I think now that you've added the slot accessors it would be pretty easy
> > to do.
> 
> That would be imply, we loose one slot in primary group, which means we
> will do extra work in some case because our primary now has only 7
> slots. And in case of pseries, the hypervisor will always return the
> least available slot, which implie we will do extra hcalls in case of an
> hpte insert to an empty group?


No. that is not the idea.  The idea is that slot 'F' in the seconday
will continue to be a invalid slot, but will be represented as
offset-by-one in the PTE.  In other words, 0 will be repesented as 1,
1 as 2....   and  n as (n+1)%32

The idea seems feasible.  It has the advantage -- where 0 in the PTE
means invalid slot. But it can be confusing to the casual code-
reader. Will need to put in a big-huge comment to explain that.

RP
Michael Ellerman Oct. 25, 2017, 9:18 a.m. | #6
Ram Pai <linuxram@us.ibm.com> writes:
> On Mon, Oct 23, 2017 at 02:17:39PM +0530, Aneesh Kumar K.V wrote:
>> Michael Ellerman <mpe@ellerman.id.au> writes:
>> > Ram Pai <linuxram@us.ibm.com> writes:
...
>> > Should be:
>> > 	/*
>> > 	 * Initialize all hidx entries to invalid value, the first time
>> >          * the PTE is about to allocate a 4K HPTE.
>> > 	 */
>> >
>> >> +	if (!(old_pte & H_PAGE_COMBO))
>> >> +		rpte.hidx = ~0x0UL;
>> >> +
>> >
>> > Paul had the idea that if we biased the slot number by 1, we could make
>> > the "invalid" value be == 0.
>> >
>> > That would avoid needing to that above, and also mean the value is
>> > correctly invalid from the get-go, which would be good IMO.
>> >
>> > I think now that you've added the slot accessors it would be pretty easy
>> > to do.
>> 
>> That would be imply, we loose one slot in primary group, which means we
>> will do extra work in some case because our primary now has only 7
>> slots. And in case of pseries, the hypervisor will always return the
>> least available slot, which implie we will do extra hcalls in case of an
>> hpte insert to an empty group?
>
> No. that is not the idea.  The idea is that slot 'F' in the seconday
> will continue to be a invalid slot, but will be represented as
> offset-by-one in the PTE.  In other words, 0 will be repesented as 1,
> 1 as 2....   and  n as (n+1)%32

Right.

> The idea seems feasible.  It has the advantage -- where 0 in the PTE
> means invalid slot. But it can be confusing to the casual code-
> reader. Will need to put in a big-huge comment to explain that.

This code is already confusing to *any* reader, so I don't think it's a
worry. :)

cheers
Ram Pai Oct. 26, 2017, 6:08 a.m. | #7
On Wed, Oct 25, 2017 at 11:18:49AM +0200, Michael Ellerman wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> > On Mon, Oct 23, 2017 at 02:17:39PM +0530, Aneesh Kumar K.V wrote:
> >> Michael Ellerman <mpe@ellerman.id.au> writes:
> >> > Ram Pai <linuxram@us.ibm.com> writes:
> ...
> >> > Should be:
> >> > 	/*
> >> > 	 * Initialize all hidx entries to invalid value, the first time
> >> >          * the PTE is about to allocate a 4K HPTE.
> >> > 	 */
> >> >
> >> >> +	if (!(old_pte & H_PAGE_COMBO))
> >> >> +		rpte.hidx = ~0x0UL;
> >> >> +
> >> >
> >> > Paul had the idea that if we biased the slot number by 1, we could make
> >> > the "invalid" value be == 0.
> >> >
> >> > That would avoid needing to that above, and also mean the value is
> >> > correctly invalid from the get-go, which would be good IMO.
> >> >
> >> > I think now that you've added the slot accessors it would be pretty easy
> >> > to do.
> >> 
> >> That would be imply, we loose one slot in primary group, which means we
> >> will do extra work in some case because our primary now has only 7
> >> slots. And in case of pseries, the hypervisor will always return the
> >> least available slot, which implie we will do extra hcalls in case of an
> >> hpte insert to an empty group?
> >
> > No. that is not the idea.  The idea is that slot 'F' in the seconday
> > will continue to be a invalid slot, but will be represented as
> > offset-by-one in the PTE.  In other words, 0 will be repesented as 1,
> > 1 as 2....   and  n as (n+1)%32
> 
> Right.
> 
> > The idea seems feasible.  It has the advantage -- where 0 in the PTE
> > means invalid slot. But it can be confusing to the casual code-
> > reader. Will need to put in a big-huge comment to explain that.
> 
> This code is already confusing to *any* reader, so I don't think it's a
> worry. :)

I just got it coded and working.  But I see no advantage implementing
the shifted-value. The hidx in the secondary-part of the pte, still
needs to be initialzed to all-zeros. Because it could contain the value of
the hidx corresponding the 64k-backed hpte, which needs to be cleared.

I will send the patch anyway.  But we should not apply it
for I see no apparent gain.

RP

> 
> cheers

Patch

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 8909039..e66bfeb 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,6 +16,8 @@ 
 #define H_PUD_TABLE_SIZE	(sizeof(pud_t) << H_PUD_INDEX_SIZE)
 #define H_PGD_TABLE_SIZE	(sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
+#define H_PAGE_BUSY	_RPAGE_RSV1     /* software: PTE & hash are busy */
+
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
 			 H_PAGE_F_SECOND | H_PAGE_F_GIX)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 6652669..e038f1c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,18 +12,14 @@ 
  */
 #define H_PAGE_COMBO	_RPAGE_RPN0 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN	_RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_BUSY	_RPAGE_RPN42     /* software: PTE & hash are busy */
+
 /*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
  */
 #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
 
-/*
- * Used to track subpage group valid if H_PAGE_COMBO is set
- * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
- */
-#define H_PAGE_COMBO_VALID	(H_PAGE_F_GIX | H_PAGE_F_SECOND)
-
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
 			 H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 060c059..8ce4112 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -9,7 +9,6 @@ 
  */
 #define H_PTE_NONE_MASK		_PAGE_HPTEFLAGS
 #define H_PAGE_F_GIX_SHIFT	56
-#define H_PAGE_BUSY		_RPAGE_RSV1 /* software: PTE & hash are busy */
 #define H_PAGE_F_SECOND		_RPAGE_RSV2	/* HPTE is in 2ndary HPTEG */
 #define H_PAGE_F_GIX		(_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
 #define H_PAGE_HASHPTE		_RPAGE_RPN43	/* PTE has associated HPTE */
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 1a68cb1..c6c5559 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -15,34 +15,22 @@ 
 #include <linux/mm.h>
 #include <asm/machdep.h>
 #include <asm/mmu.h>
+
 /*
- * index from 0 - 15
+ * return true, if the entry has a slot value which
+ * the software considers as invalid.
  */
-bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
+static inline bool hpte_soft_invalid(unsigned long slot)
 {
-	unsigned long g_idx;
-	unsigned long ptev = pte_val(rpte.pte);
-
-	g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
-	index = index >> 2;
-	if (g_idx & (0x1 << index))
-		return true;
-	else
-		return false;
+	return ((slot & 0xfUL) == 0xfUL);
 }
+
 /*
  * index from 0 - 15
  */
-static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
+bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
 {
-	unsigned long g_idx;
-
-	if (!(ptev & H_PAGE_COMBO))
-		return ptev;
-	index = index >> 2;
-	g_idx = 0x1 << index;
-
-	return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
+	return !(hpte_soft_invalid(rpte.hidx >> (index << 2)));
 }
 
 int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
@@ -50,12 +38,11 @@  int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 		   int ssize, int subpg_prot)
 {
 	real_pte_t rpte;
-	unsigned long *hidxp;
 	unsigned long hpte_group;
 	unsigned int subpg_index;
-	unsigned long rflags, pa, hidx;
+	unsigned long rflags, pa;
 	unsigned long old_pte, new_pte, subpg_pte;
-	unsigned long vpn, hash, slot;
+	unsigned long vpn, hash, slot, gslot;
 	unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
 
 	/*
@@ -126,18 +113,13 @@  int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 	if (__rpte_sub_valid(rpte, subpg_index)) {
 		int ret;
 
-		hash = hpt_hash(vpn, shift, ssize);
-		hidx = __rpte_to_hidx(rpte, subpg_index);
-		if (hidx & _PTEIDX_SECONDARY)
-			hash = ~hash;
-		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-		slot += hidx & _PTEIDX_GROUP_IX;
+		gslot = pte_get_hash_gslot(vpn, shift, ssize, rpte,
+					subpg_index);
+		ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
+			MMU_PAGE_4K, MMU_PAGE_4K, ssize, flags);
 
-		ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
-						 MMU_PAGE_4K, MMU_PAGE_4K,
-						 ssize, flags);
 		/*
-		 *if we failed because typically the HPTE wasn't really here
+		 * if we failed because typically the HPTE wasn't really here
 		 * we try an insertion.
 		 */
 		if (ret == -1)
@@ -148,6 +130,15 @@  int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 	}
 
 htab_insert_hpte:
+
+	/*
+	 * initialize all hidx entries to invalid value,
+	 * the first time the PTE is about to allocate
+	 * a 4K hpte
+	 */
+	if (!(old_pte & H_PAGE_COMBO))
+		rpte.hidx = ~0x0UL;
+
 	/*
 	 * handle H_PAGE_4K_PFN case
 	 */
@@ -172,15 +163,41 @@  int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 	 * Primary is full, try the secondary
 	 */
 	if (unlikely(slot == -1)) {
+		bool soft_invalid;
+
 		hpte_group = ((~hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
 		slot = mmu_hash_ops.hpte_insert(hpte_group, vpn, pa,
 						rflags, HPTE_V_SECONDARY,
 						MMU_PAGE_4K, MMU_PAGE_4K,
 						ssize);
-		if (slot == -1) {
-			if (mftb() & 0x1)
+
+		soft_invalid = hpte_soft_invalid(slot);
+		if (unlikely(soft_invalid)) {
+			/*
+			 * we got a valid slot from a hardware point of view.
+			 * but we cannot use it, because we use this special
+			 * value; as     defined   by    hpte_soft_invalid(),
+			 * to  track    invalid  slots.  We  cannot  use  it.
+			 * So invalidate it.
+			 */
+			gslot = slot & _PTEIDX_GROUP_IX;
+			mmu_hash_ops.hpte_invalidate(hpte_group+gslot, vpn,
+				MMU_PAGE_4K, MMU_PAGE_4K,
+				ssize, 0);
+		}
+
+		if (unlikely(slot == -1 || soft_invalid)) {
+			/*
+			 * for soft invalid slot, lets   ensure that we
+			 * release a slot from  the primary,   with the
+			 * hope that we  will  acquire that slot   next
+			 * time we try. This will ensure that we do not
+			 * get the same soft-invalid slot.
+			 */
+			if (soft_invalid || (mftb() & 0x1))
 				hpte_group = ((hash & htab_hash_mask) *
 					      HPTES_PER_GROUP) & ~0x7UL;
+
 			mmu_hash_ops.hpte_remove(hpte_group);
 			/*
 			 * FIXME!! Should be try the group from which we removed ?
@@ -198,21 +215,10 @@  int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 				   MMU_PAGE_4K, MMU_PAGE_4K, old_pte);
 		return -1;
 	}
-	/*
-	 * Insert slot number & secondary bit in PTE second half,
-	 * clear H_PAGE_BUSY and set appropriate HPTE slot bit
-	 * Since we have H_PAGE_BUSY set on ptep, we can be sure
-	 * nobody is undating hidx.
-	 */
-	hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
-	rpte.hidx &= ~(0xfUL << (subpg_index << 2));
-	*hidxp = rpte.hidx  | (slot << (subpg_index << 2));
-	new_pte = mark_subptegroup_valid(new_pte, subpg_index);
-	new_pte |=  H_PAGE_HASHPTE;
-	/*
-	 * check __real_pte for details on matching smp_rmb()
-	 */
-	smp_wmb();
+
+	new_pte |= pte_set_hash_slot(ptep, rpte, subpg_index, slot);
+	new_pte |= H_PAGE_HASHPTE;
+
 	*ptep = __pte(new_pte & ~H_PAGE_BUSY);
 	return 0;
 }
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index e68f053..a40c7bc 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -978,8 +978,9 @@  void __init hash__early_init_devtree(void)
 
 void __init hash__early_init_mmu(void)
 {
+#ifndef CONFIG_PPC_64K_PAGES
 	/*
-	 * We have code in __hash_page_64K() and elsewhere, which assumes it can
+	 * We have code in __hash_page_4K() and elsewhere, which assumes it can
 	 * do the following:
 	 *   new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
 	 *
@@ -990,6 +991,7 @@  void __init hash__early_init_mmu(void)
 	 * with a BUILD_BUG_ON().
 	 */
 	BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul  << (H_PAGE_F_GIX_SHIFT + 3)));
+#endif /* CONFIG_PPC_64K_PAGES */
 
 	htab_init_page_sizes();