[kernel] powerpc/pseries/iommu: Add cond_resched() for huge updates
diff mbox series

Message ID 20190722082821.37310-1-aik@ozlabs.ru
State Changes Requested
Headers show
Series
  • [kernel] powerpc/pseries/iommu: Add cond_resched() for huge updates
Related show

Checks

Context Check Description
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 7 lines checked
snowpatch_ozlabs/build-pmac32 success Build succeeded
snowpatch_ozlabs/build-ppc64e success Build succeeded
snowpatch_ozlabs/build-ppc64be success Build succeeded
snowpatch_ozlabs/build-ppc64le success Build succeeded
snowpatch_ozlabs/apply_patch success Successfully applied on branch next (f3365d1a959d5c6527efe3d38276acc9b58e3f3f)

Commit Message

Alexey Kardashevskiy July 22, 2019, 8:28 a.m. UTC
Mapping ~5.000.000 TCEs currently takes about 40s; this is the amount
required for a 300GB VM with 64k IOMMU page size. Anything bigger than
this produces RCU stall warnings.

This adds cond_resched() to allow the scheduler to do context switching
when it decides to.

This loop is called from dma_set_mask() which is a sleepable context.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Michael Ellerman July 23, 2019, 11:46 a.m. UTC | #1
Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> Mapping ~5.000.000 TCEs currently takes about 40s; this is the amount
> required for a 300GB VM with 64k IOMMU page size. Anything bigger than
> this produces RCU stall warnings.

OK. Are we sure we're not doing anything stupid in that code to make it
go that slowly?

> This adds cond_resched() to allow the scheduler to do context switching
> when it decides to.
>
> This loop is called from dma_set_mask() which is a sleepable context.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 889dc2e44b89..2b8de822272f 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -459,6 +459,7 @@ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn,
>  static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
>  		unsigned long num_pfn, void *arg)
>  {
> +	cond_resched();
>  	return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
>  }

Why there and not in tce_setrange_multi_pSeriesLP() ?

I'm not sure what the maximum granularity walk_system_ram_range() will
ever call us with is.

cheers
Alexey Kardashevskiy July 23, 2019, 2:50 p.m. UTC | #2
On 23/07/2019 21:46, Michael Ellerman wrote:
> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>> Mapping ~5.000.000 TCEs currently takes about 40s; this is the amount
>> required for a 300GB VM with 64k IOMMU page size. Anything bigger than
>> this produces RCU stall warnings.
> 
> OK. Are we sure we're not doing anything stupid in that code to make it
> go that slowly?

Each tce_setrange_multi_pSeriesLP() is a hypercall and KVM (if it is KVM
and the call was not bounced to QEMU which I believe does not happen)
walks through all 512 TCEs in the request and does TCE Kill for each of
those TCEs which in turn are OPAL calls, and not just one per TCE but
two - one for PHB's TCE and one for NPU's TCE. And I have a test patch
to do TCE kills in powernv (not in OPAL), 40s figure was taken with this
patch.

So I agree it should be faster but it won't be 1-2s and for longer
operations we will need this resched.


> 
>> This adds cond_resched() to allow the scheduler to do context switching
>> when it decides to.
>>
>> This loop is called from dma_set_mask() which is a sleepable context.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/platforms/pseries/iommu.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 889dc2e44b89..2b8de822272f 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -459,6 +459,7 @@ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn,
>>  static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
>>  		unsigned long num_pfn, void *arg)
>>  {
>> +	cond_resched();
>>  	return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
>>  }
> 
> Why there and not in tce_setrange_multi_pSeriesLP() ?

The other caller is iommu_mem_notifier and I am unsure about locking
there, there may be nasty surprises.

> I'm not sure what the maximum granularity walk_system_ram_range() will
> ever call us with is.

Contiguous blocks, which are too big. You're right, this resched better
go to tce_setrange_multi_pSeriesLP.

Patch
diff mbox series

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 889dc2e44b89..2b8de822272f 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -459,6 +459,7 @@  static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn,
 static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
 		unsigned long num_pfn, void *arg)
 {
+	cond_resched();
 	return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
 }