diff mbox series

npu2: Invalidate entire TCE cache if many entries requested

Message ID 20190819061748.78922-1-aik@ozlabs.ru
State Accepted
Headers show
Series npu2: Invalidate entire TCE cache if many entries requested | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (a1fced25bf41f1f94a3673a0b2bf68135eedce25)
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present

Commit Message

Alexey Kardashevskiy Aug. 19, 2019, 6:17 a.m. UTC
Turned out invalidating entries in NPU TCE cache is so slow that it
becomes visible when running a 30+GB guest with GPU+NVlink2 passed
through; a 100GB guest takes about 20s to map all 100GB.

This falls through to the entire cache invalidation if more than 128
TCEs were requested to invalidate, this reduces 20s from the abobe to
less than 1s. The KVM change [1] is required to see this difference.

The threshold of 128 is chosen in attempt not to affect performance much
as it is not clear how expensive it is to populate the TCE cache again;
all we know for sure is that mapping the guest produces invalidation
requests of 512 TCEs each.

Note TCE cache invalidation in PHB4 is faster and does not require
the same workaround.

[1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates
https://patchwork.ozlabs.org/patch/1149003/
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/npu2.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

Comments

Alistair Popple Aug. 20, 2019, 5:21 a.m. UTC | #1
On Monday, 19 August 2019 4:17:48 PM AEST Alexey Kardashevskiy wrote:
> Turned out invalidating entries in NPU TCE cache is so slow that it
> becomes visible when running a 30+GB guest with GPU+NVlink2 passed
> through; a 100GB guest takes about 20s to map all 100GB.
> 
> This falls through to the entire cache invalidation if more than 128
> TCEs were requested to invalidate, this reduces 20s from the abobe to
> less than 1s. The KVM change [1] is required to see this difference.
> 
> The threshold of 128 is chosen in attempt not to affect performance much
> as it is not clear how expensive it is to populate the TCE cache again;
> all we know for sure is that mapping the guest produces invalidation
> requests of 512 TCEs each.
> 
> Note TCE cache invalidation in PHB4 is faster and does not require
> the same workaround.

Do you know why PHB4 is so much faster? I suspect it is because the NPU is 
still doing SCOM read/writes (which itself is an indirect access) to perform 
the TCE invalidate rather than direct MMIO. Does the problem go away if you 
use out_be64(npu->regs + NPU2_ATS_TCE_KILL, ...) instead of npu2_write(...)?

- Alistair

> [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates
> https://patchwork.ozlabs.org/patch/1149003/
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/npu2.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/npu2.c b/hw/npu2.c
> index 1dba8bb00f85..40583e343a98 100644
> --- a/hw/npu2.c
> +++ b/hw/npu2.c
> @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb, 
uint32_t kill_type,
>  			return OPAL_PARAMETER;
>  		}
>  
> -		while (npages--) {
> -			val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
> -			npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
> -			dma_addr += tce_size;
> +		if (npages < 128) {
> +			while (npages--) {
> +				val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
> +				npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | 
val);
> +				dma_addr += tce_size;
> +			}
> +			break;
>  		}
> -		break;
> +		/*
> +		 * For too many TCEs do not bother with the loop above and simply
> +		 * flush everything, going to be lot faster.
> +		 */
> +		/* Fall through */
>  	case OPAL_PCI_TCE_KILL_PE:
>  		/*
>  		 * NPU2 doesn't support killing a PE so fall through
>
Alexey Kardashevskiy Aug. 20, 2019, 6:39 a.m. UTC | #2
On 20/08/2019 15:21, Alistair Popple wrote:
> On Monday, 19 August 2019 4:17:48 PM AEST Alexey Kardashevskiy wrote:
>> Turned out invalidating entries in NPU TCE cache is so slow that it
>> becomes visible when running a 30+GB guest with GPU+NVlink2 passed
>> through; a 100GB guest takes about 20s to map all 100GB.
>>
>> This falls through to the entire cache invalidation if more than 128
>> TCEs were requested to invalidate, this reduces 20s from the abobe to
>> less than 1s. The KVM change [1] is required to see this difference.
>>
>> The threshold of 128 is chosen in attempt not to affect performance much
>> as it is not clear how expensive it is to populate the TCE cache again;
>> all we know for sure is that mapping the guest produces invalidation
>> requests of 512 TCEs each.
>>
>> Note TCE cache invalidation in PHB4 is faster and does not require
>> the same workaround.
> 
> Do you know why PHB4 is so much faster? I suspect it is because the NPU is
> still doing SCOM read/writes (which itself is an indirect access) to perform
> the TCE invalidate rather than direct MMIO. Does the problem go away if you
> use out_be64(npu->regs + NPU2_ATS_TCE_KILL, ...) instead of npu2_write(...)?


Doing this:

- npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
+ out_be64(npu->regs + NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);

brings 10sec to 3-4sec for 100GB guest. Invalidating the entire cache is 
still way faster.





> 
> - Alistair
> 
>> [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates
>> https://patchwork.ozlabs.org/patch/1149003/
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   hw/npu2.c | 17 ++++++++++++-----
>>   1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/npu2.c b/hw/npu2.c
>> index 1dba8bb00f85..40583e343a98 100644
>> --- a/hw/npu2.c
>> +++ b/hw/npu2.c
>> @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb,
> uint32_t kill_type,
>>   			return OPAL_PARAMETER;
>>   		}
>>   
>> -		while (npages--) {
>> -			val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
>> -			npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
>> -			dma_addr += tce_size;
>> +		if (npages < 128) {
>> +			while (npages--) {
>> +				val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
>> +				npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE |
> val);
>> +				dma_addr += tce_size;
>> +			}
>> +			break;
>>   		}
>> -		break;
>> +		/*
>> +		 * For too many TCEs do not bother with the loop above and simply
>> +		 * flush everything, going to be lot faster.
>> +		 */
>> +		/* Fall through */
>>   	case OPAL_PCI_TCE_KILL_PE:
>>   		/*
>>   		 * NPU2 doesn't support killing a PE so fall through
>>
> 
> 
> 
>
Alistair Popple Aug. 20, 2019, 6:44 a.m. UTC | #3
On Tuesday, 20 August 2019 4:39:47 PM AEST Alexey Kardashevskiy wrote:
> 
> On 20/08/2019 15:21, Alistair Popple wrote:
> > On Monday, 19 August 2019 4:17:48 PM AEST Alexey Kardashevskiy wrote:
> >> Turned out invalidating entries in NPU TCE cache is so slow that it
> >> becomes visible when running a 30+GB guest with GPU+NVlink2 passed
> >> through; a 100GB guest takes about 20s to map all 100GB.
> >>
> >> This falls through to the entire cache invalidation if more than 128
> >> TCEs were requested to invalidate, this reduces 20s from the abobe to
> >> less than 1s. The KVM change [1] is required to see this difference.
> >>
> >> The threshold of 128 is chosen in attempt not to affect performance much
> >> as it is not clear how expensive it is to populate the TCE cache again;
> >> all we know for sure is that mapping the guest produces invalidation
> >> requests of 512 TCEs each.
> >>
> >> Note TCE cache invalidation in PHB4 is faster and does not require
> >> the same workaround.
> > 
> > Do you know why PHB4 is so much faster? I suspect it is because the NPU is
> > still doing SCOM read/writes (which itself is an indirect access) to 
perform
> > the TCE invalidate rather than direct MMIO. Does the problem go away if 
you
> > use out_be64(npu->regs + NPU2_ATS_TCE_KILL, ...) instead of 
npu2_write(...)?
> 
> 
> Doing this:
> 
> - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
> + out_be64(npu->regs + NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
> 
> brings 10sec to 3-4sec for 100GB guest. Invalidating the entire cache is 
> still way faster.

Good to know, thanks.

In that case this does look like the best option given TCE based access should 
be rare anyway so:

Reviewed-by: Alistair Popple <alistair@popple.id.au>
 
> 
> 
> 
> 
> > 
> > - Alistair
> > 
> >> [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates
> >> https://patchwork.ozlabs.org/patch/1149003/
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>   hw/npu2.c | 17 ++++++++++++-----
> >>   1 file changed, 12 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/hw/npu2.c b/hw/npu2.c
> >> index 1dba8bb00f85..40583e343a98 100644
> >> --- a/hw/npu2.c
> >> +++ b/hw/npu2.c
> >> @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb,
> > uint32_t kill_type,
> >>   			return OPAL_PARAMETER;
> >>   		}
> >>   
> >> -		while (npages--) {
> >> -			val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
> >> -			npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | 
val);
> >> -			dma_addr += tce_size;
> >> +		if (npages < 128) {
> >> +			while (npages--) {
> >> +				val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, 
pe_number);
> >> +				npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE |
> > val);
> >> +				dma_addr += tce_size;
> >> +			}
> >> +			break;
> >>   		}
> >> -		break;
> >> +		/*
> >> +		 * For too many TCEs do not bother with the loop above and simply
> >> +		 * flush everything, going to be lot faster.
> >> +		 */
> >> +		/* Fall through */
> >>   	case OPAL_PCI_TCE_KILL_PE:
> >>   		/*
> >>   		 * NPU2 doesn't support killing a PE so fall through
> >>
> > 
> > 
> > 
> > 
> 
>
diff mbox series

Patch

diff --git a/hw/npu2.c b/hw/npu2.c
index 1dba8bb00f85..40583e343a98 100644
--- a/hw/npu2.c
+++ b/hw/npu2.c
@@ -1277,12 +1277,19 @@  static int64_t npu2_tce_kill(struct phb *phb, uint32_t kill_type,
 			return OPAL_PARAMETER;
 		}
 
-		while (npages--) {
-			val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
-			npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
-			dma_addr += tce_size;
+		if (npages < 128) {
+			while (npages--) {
+				val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
+				npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
+				dma_addr += tce_size;
+			}
+			break;
 		}
-		break;
+		/*
+		 * For too many TCEs do not bother with the loop above and simply
+		 * flush everything, going to be lot faster.
+		 */
+		/* Fall through */
 	case OPAL_PCI_TCE_KILL_PE:
 		/*
 		 * NPU2 doesn't support killing a PE so fall through