Patchwork [-V1,09/24] powerpc: Decode the pte-lp-encoding bits correctly.

login
register
mail settings
Submitter Aneesh Kumar K.V
Date March 4, 2013, 11:41 a.m.
Message ID <87y5e31jem.fsf@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/224703/
State Superseded
Headers show

Comments

Aneesh Kumar K.V - March 4, 2013, 11:41 a.m.
Paul Mackerras <paulus@samba.org> writes:

> On Tue, Feb 26, 2013 at 01:34:59PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> We look at both the segment base page size and actual page size and store
>> the pte-lp-encodings in an array per base page size.
>> 
>> We also update all relevant functions to take actual page size argument
>> so that we can use the correct PTE LP encoding in HPTE. This should also
>> get the basic Multiple Page Size per Segment (MPSS) support. This is needed
>> to enable THP on ppc64.
>
> Mostly looks OK, comments below...
>
>> +/*
>> + * HPTE LP details
>> + */
>> +#define LP_SHIFT	12
>> +#define LP_BITS		8
>> +#define LP_MASK(i)	((0xFF >> (i)) << LP_SHIFT)
>
> The reader might be wondering at this point what "LP" is; be kind and
> make it "large page" in the comment for them.

Fixed.

>
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 71d0c90..48f6d99 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1515,7 +1515,7 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
>>  	(*sps)->page_shift = def->shift;
>>  	(*sps)->slb_enc = def->sllp;
>>  	(*sps)->enc[0].page_shift = def->shift;
>> -	(*sps)->enc[0].pte_enc = def->penc;
>> +	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
>>  	(*sps)++;
>>  }
>
> This will only return the entries where actual page size == base page
> size, which basically means that KVM guests won't be able to use
> MPSS.  We will need to return multiple entries in that case.

I did that as a the follow up patch.

[PATCH -V1 10/24] powerpc: Return all the valid pte ecndoing in
KVM_PPC_GET_SMMU_INFO ioct

>
>> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
>> +{
>> +	unsigned int mask;
>> +	int i, penc, shift;
>> +	/* Look at the 8 bit LP value */
>> +	unsigned int lp = (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>> +
>> +	penc = 0;
>> +	for (i = 0; i < MMU_PAGE_COUNT; i++) {
>> +		/* valid entries have a shift value */
>> +		if (!mmu_psize_defs[i].shift)
>> +			continue;
>> +
>> +		/* encoding bits per actual page size */
>> +		shift = mmu_psize_defs[i].shift - 11;
>> +		if (shift > 9)
>> +			shift = 9;
>> +		mask = (1 << shift) - 1;
>> +		if ((lp & mask) == mmu_psize_defs[psize].penc[i])
>> +			return i;
>> +	}
>> +	return -1;
>> +}
>
> This doesn't look right to me.  First, it's not clear what the 11 and
> 9 refer to, and I think the 9 should be LP_BITS (i.e. 8).  Secondly,
> the mask for the comparison needs to depend on the actual page size
> not the base page size.

That 11 should be 12.That depends on the fact that we have below mapping
 rrrr rrrz 	≥8KB

Yes, that 9 should be LP_BITs. 

We are generating mask based on actual page size above (variable i in
the for loop).


>
> I strongly suggest you pull out this code together with
> native_hpte_insert into a little userspace test program that runs
> through all the possible page size combinations, creating an HPTE and
> then decoding it with hpte_actual_psize() to check that you get back
> the correct actual page size.
>

will do.

>>  static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
>> -			int *psize, int *ssize, unsigned long *vpn)
>> +			int *psize, int *apsize, int *ssize, unsigned long *vpn)
>>  {
>>  	unsigned long avpn, pteg, vpi;
>>  	unsigned long hpte_r = hpte->r;
>>  	unsigned long hpte_v = hpte->v;
>>  	unsigned long vsid, seg_off;
>> -	int i, size, shift, penc;
>> +	int i, size, a_size = MMU_PAGE_4K, shift, penc;
>>  
>>  	if (!(hpte_v & HPTE_V_LARGE))
>>  		size = MMU_PAGE_4K;
>> @@ -395,12 +422,13 @@ static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
>>  			/* valid entries have a shift value */
>>  			if (!mmu_psize_defs[size].shift)
>>  				continue;
>> -
>> -			if (penc == mmu_psize_defs[size].penc)
>> -				break;
>> +			for (a_size = 0; a_size < MMU_PAGE_COUNT; a_size++)
>> +				if (penc == mmu_psize_defs[size].penc[a_size])
>> +					goto out;
>
> Once again I don't think this is correct, since the number of bits in
> the page size encoding depends on the page size.  In fact the
> calculation of penc in that function looks completely bogus to me (not
> that that is code that you have written or modified, but it looks to
> me like it needs fixing).

I am fixing that in the later patch

powerpc: Fix hpte_decode to use the correct decoding for page sizes

But that will also need fixing as you suggested above.

>
>>  static int __init htab_dt_scan_page_sizes(unsigned long node,
>>  					  const char *uname, int depth,
>>  					  void *data)
>> @@ -294,60 +318,57 @@ static int __init htab_dt_scan_page_sizes(unsigned long node,
>>  		size /= 4;
>>  		cur_cpu_spec->mmu_features &= ~(MMU_FTR_16M_PAGE);
>>  		while(size > 0) {
>> -			unsigned int shift = prop[0];
>> +			unsigned int base_shift = prop[0];
>>  			unsigned int slbenc = prop[1];
>>  			unsigned int lpnum = prop[2];
>> -			unsigned int lpenc = 0;
>>  			struct mmu_psize_def *def;
>> -			int idx = -1;
>> +			int idx, base_idx;
>>  
>>  			size -= 3; prop += 3;
>> -			while(size > 0 && lpnum) {
>> -				if (prop[0] == shift)
>> -					lpenc = prop[1];
>> +			base_idx = get_idx_from_shift(base_shift);
>> +			if (base_idx < 0) {
>> +				/*
>> +				 * skip the pte encoding also
>> +				 */
>>  				prop += 2; size -= 2;
>> -				lpnum--;
>> +				continue;
>>  			}
>> -			switch(shift) {
>> -			case 0xc:
>> -				idx = MMU_PAGE_4K;
>> -				break;
>> -			case 0x10:
>> -				idx = MMU_PAGE_64K;
>> -				break;
>> -			case 0x14:
>> -				idx = MMU_PAGE_1M;
>> -				break;
>> -			case 0x18:
>> -				idx = MMU_PAGE_16M;
>> +			def = &mmu_psize_defs[base_idx];
>> +			if (base_idx == MMU_PAGE_16M)
>>  				cur_cpu_spec->mmu_features |= MMU_FTR_16M_PAGE;
>> -				break;
>> -			case 0x22:
>> -				idx = MMU_PAGE_16G;
>> -				break;
>> -			}
>> -			if (idx < 0)
>> -				continue;
>> -			def = &mmu_psize_defs[idx];
>> -			def->shift = shift;
>> -			if (shift <= 23)
>> +
>> +			def->shift = base_shift;
>> +			if (base_shift <= 23)
>>  				def->avpnm = 0;
>>  			else
>> -				def->avpnm = (1 << (shift - 23)) - 1;
>> +				def->avpnm = (1 << (base_shift - 23)) - 1;
>>  			def->sllp = slbenc;
>> -			def->penc = lpenc;
>> -			/* We don't know for sure what's up with tlbiel, so
>> +			/*
>> +			 * We don't know for sure what's up with tlbiel, so
>>  			 * for now we only set it for 4K and 64K pages
>>  			 */
>> -			if (idx == MMU_PAGE_4K || idx == MMU_PAGE_64K)
>> +			if (base_idx == MMU_PAGE_4K || base_idx == MMU_PAGE_64K)
>>  				def->tlbiel = 1;
>>  			else
>>  				def->tlbiel = 0;
>>  
>> -			DBG(" %d: shift=%02x, sllp=%04lx, avpnm=%08lx, "
>> -			    "tlbiel=%d, penc=%d\n",
>> -			    idx, shift, def->sllp, def->avpnm, def->tlbiel,
>> -			    def->penc);
>> +			while (size > 0 && lpnum) {
>> +				unsigned int shift = prop[0];
>> +				unsigned int penc  = prop[1];
>> +
>> +				prop += 2; size -= 2;
>> +				lpnum--;
>> +
>> +				idx = get_idx_from_shift(shift);
>> +				if (idx < 0)
>> +					continue;
>> +
>> +				def->penc[idx] = penc;
>> +				DBG(" %d: shift=%02x, sllp=%04lx, "
>> +				    "avpnm=%08lx, tlbiel=%d, penc=%d\n",
>> +				    idx, shift, def->sllp, def->avpnm,
>> +				    def->tlbiel, def->penc[idx]);
>> +			}
>
> I don't see where in this function you set the penc[] elements for
> invalid actual page sizes to -1.

We do the below
Paul Mackerras - March 5, 2013, 2:02 a.m.
On Mon, Mar 04, 2013 at 05:11:53PM +0530, Aneesh Kumar K.V wrote:
> Paul Mackerras <paulus@samba.org> writes:
> >> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
> >> +{
> >> +	unsigned int mask;
> >> +	int i, penc, shift;
> >> +	/* Look at the 8 bit LP value */
> >> +	unsigned int lp = (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
> >> +
> >> +	penc = 0;
> >> +	for (i = 0; i < MMU_PAGE_COUNT; i++) {
> >> +		/* valid entries have a shift value */
> >> +		if (!mmu_psize_defs[i].shift)
> >> +			continue;
> >> +
> >> +		/* encoding bits per actual page size */
> >> +		shift = mmu_psize_defs[i].shift - 11;
> >> +		if (shift > 9)
> >> +			shift = 9;
> >> +		mask = (1 << shift) - 1;
> >> +		if ((lp & mask) == mmu_psize_defs[psize].penc[i])
> >> +			return i;
> >> +	}
> >> +	return -1;
> >> +}
> >
> > This doesn't look right to me.  First, it's not clear what the 11 and
> > 9 refer to, and I think the 9 should be LP_BITS (i.e. 8).  Secondly,
> > the mask for the comparison needs to depend on the actual page size
> > not the base page size.
> 
> That 11 should be 12.That depends on the fact that we have below mapping

And the 12 should be LP_SHIFT, shouldn't it?

>  rrrr rrrz 	≥8KB
> 
> Yes, that 9 should be LP_BITs. 
> 
> We are generating mask based on actual page size above (variable i in
> the for loop).

OK, yes, you're right.

> > I don't see where in this function you set the penc[] elements for
> > invalid actual page sizes to -1.
> 
> We do the below
> 
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -125,7 +125,7 @@ static struct mmu_psize_def mmu_psize_defaults_old[] = {
>         [MMU_PAGE_4K] = {
>                 .shift  = 12,
>                 .sllp   = 0,
> -               .penc   = 0,
> +               .penc   = { [0 ... MMU_PAGE_COUNT - 1] = -1 },
>                 .avpnm  = 0,

Yes, which sets them for the entries you initialize, but not for the
others.  For example, the entry for MMU_PAGE_64K will initially be all
zeroes.  Then we find an entry in the ibm,segment-page-sizes property
for 64k pages, so we set mmu_psize_defs[MMU_PAGE_64K].shift to 16,
making that entry valid, but we never set any of the .penc[] entries
to -1, leading your other code to think that it can do (say) 1M pages
in a 64k segment using an encoding of 0.

Also, I noticed that the code in the if (base_idx < 0) statement is
wrong.  It needs to advance prop (and decrease size) by 2 * lpnum,
not just 2.

Paul.
Aneesh Kumar K.V - March 6, 2013, 4:30 a.m.
Paul Mackerras <paulus@samba.org> writes:

> On Mon, Mar 04, 2013 at 05:11:53PM +0530, Aneesh Kumar K.V wrote:
>> Paul Mackerras <paulus@samba.org> writes:
>> >> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
>> >> +{
>> >> +	unsigned int mask;
>> >> +	int i, penc, shift;
>> >> +	/* Look at the 8 bit LP value */
>> >> +	unsigned int lp = (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>> >> +
>> >> +	penc = 0;
>> >> +	for (i = 0; i < MMU_PAGE_COUNT; i++) {
>> >> +		/* valid entries have a shift value */
>> >> +		if (!mmu_psize_defs[i].shift)
>> >> +			continue;
>> >> +
>> >> +		/* encoding bits per actual page size */
>> >> +		shift = mmu_psize_defs[i].shift - 11;
>> >> +		if (shift > 9)
>> >> +			shift = 9;
>> >> +		mask = (1 << shift) - 1;
>> >> +		if ((lp & mask) == mmu_psize_defs[psize].penc[i])
>> >> +			return i;
>> >> +	}
>> >> +	return -1;
>> >> +}
>> >
>> > This doesn't look right to me.  First, it's not clear what the 11 and
>> > 9 refer to, and I think the 9 should be LP_BITS (i.e. 8).  Secondly,
>> > the mask for the comparison needs to depend on the actual page size
>> > not the base page size.
>> 
>> That 11 should be 12.That depends on the fact that we have below mapping
>
> And the 12 should be LP_SHIFT, shouldn't it?

LP_SHIFT would indicate how many bit poisition need to be shifted to get
to the LP field in HPTE. I guess what we want here is shift value for 4K
page.  How about 

shift = mmu_psize_defs[i].shift - mmu_psize_defs[MMU_PAGE_4K].shift;


>
>>  rrrr rrrz 	≥8KB
>> 
>> Yes, that 9 should be LP_BITs. 
>> 
>> We are generating mask based on actual page size above (variable i in
>> the for loop).
>
> OK, yes, you're right.
>
>> > I don't see where in this function you set the penc[] elements for
>> > invalid actual page sizes to -1.
>> 
>> We do the below
>> 
>> --- a/arch/powerpc/mm/hash_utils_64.c
>> +++ b/arch/powerpc/mm/hash_utils_64.c
>> @@ -125,7 +125,7 @@ static struct mmu_psize_def mmu_psize_defaults_old[] = {
>>         [MMU_PAGE_4K] = {
>>                 .shift  = 12,
>>                 .sllp   = 0,
>> -               .penc   = 0,
>> +               .penc   = { [0 ... MMU_PAGE_COUNT - 1] = -1 },
>>                 .avpnm  = 0,
>
> Yes, which sets them for the entries you initialize, but not for the
> others.  For example, the entry for MMU_PAGE_64K will initially be all
> zeroes.  Then we find an entry in the ibm,segment-page-sizes property
> for 64k pages, so we set mmu_psize_defs[MMU_PAGE_64K].shift to 16,
> making that entry valid, but we never set any of the .penc[] entries
> to -1, leading your other code to think that it can do (say) 1M pages
> in a 64k segment using an encoding of 0.
>

Noticed that earlier. This is what i currently have.

+static void mmu_psize_set_default_penc(struct mmu_psize_def *mmu_psize)
+{
+	int bpsize, apsize;
+	for (bpsize = 0; bpsize < MMU_PAGE_COUNT; bpsize++)
+		for (apsize = 0; apsize < MMU_PAGE_COUNT; apsize++)
+			mmu_psize[bpsize].penc[apsize] = -1;
+}
+
 static void __init htab_init_page_sizes(void)
 {
 	int rc;
 
+	mmu_psize_set_default_penc(mmu_psize_defaults_old);
+
 	/* Default to 4K pages only */
 	memcpy(mmu_psize_defs, mmu_psize_defaults_old,
 	       sizeof(mmu_psize_defaults_old));
@@ -411,6 +443,8 @@ static void __init htab_init_page_sizes(void)
 	if (rc != 0)  /* Found */
 		goto found;
 
+	mmu_psize_set_default_penc(mmu_psize_defaults_gp);
+
 	/*
 	 * Not in the device-tree, let's fallback on known size
 	 * list for 16M capable GP & GR
	Modified   arch/powerpc/mm/hugetlbpage-hash64.c



> Also, I noticed that the code in the if (base_idx < 0) statement is
> wrong.  It needs to advance prop (and decrease size) by 2 * lpnum,
> not just 2.
>

Ok. Fixed now.

-aneesh

Patch

--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -125,7 +125,7 @@  static struct mmu_psize_def mmu_psize_defaults_old[] = {
        [MMU_PAGE_4K] = {
                .shift  = 12,
                .sllp   = 0,
-               .penc   = 0,
+               .penc   = { [0 ... MMU_PAGE_COUNT - 1] = -1 },
                .avpnm  = 0,

-aneesh