Message ID | 20190819061748.78922-1-aik@ozlabs.ru |
---|---|
State | Accepted |
Headers | show |
Series | npu2: Invalidate entire TCE cache if many entries requested | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (a1fced25bf41f1f94a3673a0b2bf68135eedce25) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
On Monday, 19 August 2019 4:17:48 PM AEST Alexey Kardashevskiy wrote: > Turned out invalidating entries in NPU TCE cache is so slow that it > becomes visible when running a 30+GB guest with GPU+NVlink2 passed > through; a 100GB guest takes about 20s to map all 100GB. > > This falls through to the entire cache invalidation if more than 128 > TCEs were requested to invalidate, this reduces 20s from the abobe to > less than 1s. The KVM change [1] is required to see this difference. > > The threshold of 128 is chosen in attempt not to affect performance much > as it is not clear how expensive it is to populate the TCE cache again; > all we know for sure is that mapping the guest produces invalidation > requests of 512 TCEs each. > > Note TCE cache invalidation in PHB4 is faster and does not require > the same workaround. Do you know why PHB4 is so much faster? I suspect it is because the NPU is still doing SCOM read/writes (which itself is an indirect access) to perform the TCE invalidate rather than direct MMIO. Does the problem go away if you use out_be64(npu->regs + NPU2_ATS_TCE_KILL, ...) instead of npu2_write(...)? - Alistair > [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates > https://patchwork.ozlabs.org/patch/1149003/ > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > hw/npu2.c | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/hw/npu2.c b/hw/npu2.c > index 1dba8bb00f85..40583e343a98 100644 > --- a/hw/npu2.c > +++ b/hw/npu2.c > @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb, uint32_t kill_type, > return OPAL_PARAMETER; > } > > - while (npages--) { > - val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number); > - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); > - dma_addr += tce_size; > + if (npages < 128) { > + while (npages--) { > + val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number); > + npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); > + dma_addr += tce_size; > + } > + break; > } > - break; > + /* > + * For too many TCEs do not bother with the loop above and simply > + * flush everything, going to be lot faster. > + */ > + /* Fall through */ > case OPAL_PCI_TCE_KILL_PE: > /* > * NPU2 doesn't support killing a PE so fall through >
On 20/08/2019 15:21, Alistair Popple wrote: > On Monday, 19 August 2019 4:17:48 PM AEST Alexey Kardashevskiy wrote: >> Turned out invalidating entries in NPU TCE cache is so slow that it >> becomes visible when running a 30+GB guest with GPU+NVlink2 passed >> through; a 100GB guest takes about 20s to map all 100GB. >> >> This falls through to the entire cache invalidation if more than 128 >> TCEs were requested to invalidate, this reduces 20s from the abobe to >> less than 1s. The KVM change [1] is required to see this difference. >> >> The threshold of 128 is chosen in attempt not to affect performance much >> as it is not clear how expensive it is to populate the TCE cache again; >> all we know for sure is that mapping the guest produces invalidation >> requests of 512 TCEs each. >> >> Note TCE cache invalidation in PHB4 is faster and does not require >> the same workaround. > > Do you know why PHB4 is so much faster? I suspect it is because the NPU is > still doing SCOM read/writes (which itself is an indirect access) to perform > the TCE invalidate rather than direct MMIO. Does the problem go away if you > use out_be64(npu->regs + NPU2_ATS_TCE_KILL, ...) instead of npu2_write(...)? Doing this: - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); + out_be64(npu->regs + NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); brings 10sec to 3-4sec for 100GB guest. Invalidating the entire cache is still way faster. > > - Alistair > >> [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates >> https://patchwork.ozlabs.org/patch/1149003/ >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> --- >> hw/npu2.c | 17 ++++++++++++----- >> 1 file changed, 12 insertions(+), 5 deletions(-) >> >> diff --git a/hw/npu2.c b/hw/npu2.c >> index 1dba8bb00f85..40583e343a98 100644 >> --- a/hw/npu2.c >> +++ b/hw/npu2.c >> @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb, > uint32_t kill_type, >> return OPAL_PARAMETER; >> } >> >> - while (npages--) { >> - val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number); >> - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); >> - dma_addr += tce_size; >> + if (npages < 128) { >> + while (npages--) { >> + val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number); >> + npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | > val); >> + dma_addr += tce_size; >> + } >> + break; >> } >> - break; >> + /* >> + * For too many TCEs do not bother with the loop above and simply >> + * flush everything, going to be lot faster. >> + */ >> + /* Fall through */ >> case OPAL_PCI_TCE_KILL_PE: >> /* >> * NPU2 doesn't support killing a PE so fall through >> > > > >
On Tuesday, 20 August 2019 4:39:47 PM AEST Alexey Kardashevskiy wrote: > > On 20/08/2019 15:21, Alistair Popple wrote: > > On Monday, 19 August 2019 4:17:48 PM AEST Alexey Kardashevskiy wrote: > >> Turned out invalidating entries in NPU TCE cache is so slow that it > >> becomes visible when running a 30+GB guest with GPU+NVlink2 passed > >> through; a 100GB guest takes about 20s to map all 100GB. > >> > >> This falls through to the entire cache invalidation if more than 128 > >> TCEs were requested to invalidate, this reduces 20s from the abobe to > >> less than 1s. The KVM change [1] is required to see this difference. > >> > >> The threshold of 128 is chosen in attempt not to affect performance much > >> as it is not clear how expensive it is to populate the TCE cache again; > >> all we know for sure is that mapping the guest produces invalidation > >> requests of 512 TCEs each. > >> > >> Note TCE cache invalidation in PHB4 is faster and does not require > >> the same workaround. > > > > Do you know why PHB4 is so much faster? I suspect it is because the NPU is > > still doing SCOM read/writes (which itself is an indirect access) to perform > > the TCE invalidate rather than direct MMIO. Does the problem go away if you > > use out_be64(npu->regs + NPU2_ATS_TCE_KILL, ...) instead of npu2_write(...)? > > > Doing this: > > - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); > + out_be64(npu->regs + NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); > > brings 10sec to 3-4sec for 100GB guest. Invalidating the entire cache is > still way faster. Good to know, thanks. In that case this does look like the best option given TCE based access should be rare anyway so: Reviewed-by: Alistair Popple <alistair@popple.id.au> > > > > > > > > - Alistair > > > >> [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates > >> https://patchwork.ozlabs.org/patch/1149003/ > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >> --- > >> hw/npu2.c | 17 ++++++++++++----- > >> 1 file changed, 12 insertions(+), 5 deletions(-) > >> > >> diff --git a/hw/npu2.c b/hw/npu2.c > >> index 1dba8bb00f85..40583e343a98 100644 > >> --- a/hw/npu2.c > >> +++ b/hw/npu2.c > >> @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb, > > uint32_t kill_type, > >> return OPAL_PARAMETER; > >> } > >> > >> - while (npages--) { > >> - val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number); > >> - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); > >> - dma_addr += tce_size; > >> + if (npages < 128) { > >> + while (npages--) { > >> + val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number); > >> + npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | > > val); > >> + dma_addr += tce_size; > >> + } > >> + break; > >> } > >> - break; > >> + /* > >> + * For too many TCEs do not bother with the loop above and simply > >> + * flush everything, going to be lot faster. > >> + */ > >> + /* Fall through */ > >> case OPAL_PCI_TCE_KILL_PE: > >> /* > >> * NPU2 doesn't support killing a PE so fall through > >> > > > > > > > > > >
diff --git a/hw/npu2.c b/hw/npu2.c index 1dba8bb00f85..40583e343a98 100644 --- a/hw/npu2.c +++ b/hw/npu2.c @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb, uint32_t kill_type, return OPAL_PARAMETER; } - while (npages--) { - val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number); - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); - dma_addr += tce_size; + if (npages < 128) { + while (npages--) { + val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number); + npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val); + dma_addr += tce_size; + } + break; } - break; + /* + * For too many TCEs do not bother with the loop above and simply + * flush everything, going to be lot faster. + */ + /* Fall through */ case OPAL_PCI_TCE_KILL_PE: /* * NPU2 doesn't support killing a PE so fall through
Turned out invalidating entries in NPU TCE cache is so slow that it becomes visible when running a 30+GB guest with GPU+NVlink2 passed through; a 100GB guest takes about 20s to map all 100GB. This falls through to the entire cache invalidation if more than 128 TCEs were requested to invalidate, this reduces 20s from the abobe to less than 1s. The KVM change [1] is required to see this difference. The threshold of 128 is chosen in attempt not to affect performance much as it is not clear how expensive it is to populate the TCE cache again; all we know for sure is that mapping the guest produces invalidation requests of 512 TCEs each. Note TCE cache invalidation in PHB4 is faster and does not require the same workaround. [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates https://patchwork.ozlabs.org/patch/1149003/ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- hw/npu2.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)