Message ID | 1538090591-28519-3-git-send-email-mhairgrove@nvidia.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | powerpc/powernv/npu: Improve ATSD invalidation overhead | expand |
Thanks Mark, Looks like some worthwhile improvments to be had. I've added a couple of comments inline below. > +#define PAGE_64K (64UL * 1024) +#define PAGE_2M (2UL * 1024 * 1024) +#define > PAGE_1G (1UL * 1024 * 1024 * 1024) include/linux/sizes.h includes definitions for SZ_64K, SZ_2M, SZ_1G, etc. so unless they're redefined here for some reason I personally think it's cleaner to use those. > /* > - * Invalidate either a single address or an entire PID depending on > - * the value of va. > + * Invalidate a virtual address range > */ > -static void mmio_invalidate(struct npu_context *npu_context, int va, > - unsigned long address, bool flush) > +static void mmio_invalidate(struct npu_context *npu_context, > + unsigned long start, unsigned long size, bool flush) With this optimisation every caller of mmio_invalidate() sets flush == true so it no longer appears to be used. We should drop it as a parameter unless you think there might be some reason to use it in future? Therefore we could also drop it as a parameter to get_atsd_launch_val(), mmio_invalidate_pid() and mmio_invalidate_range() as well as I couldn't find any callers of those that set it to anything other than true. > struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS]; > unsigned long pid = npu_context->mm->context.id; > + unsigned long atsd_start = 0; > + unsigned long end = start + size - 1; > + int atsd_psize = MMU_PAGE_COUNT; > + > + /* > + * Convert the input range into one of the supported sizes. If the range > + * doesn't fit, use the next larger supported size. Invalidation latency > + * is high, so over-invalidation is preferred to issuing multiple > + * invalidates. > + */ > + if (size == PAGE_64K) { We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD would invalidate the entire GPU TLB for a the given PID on those systems. Could we change the above check to `if (size <= PAGE_64K)` to avoid this? > + atsd_start = start; Which would also require: atsd_start = ALIGN_DOWN(start, PAGE_64K); > + atsd_psize = MMU_PAGE_64K; > + } else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) { Wouldn't this lead to under invalidation in ranges which happen to cross a 2M boundary? For example invalidating a 128K (ie. 2x64K pages) range with start == 0x1f0000 and end == 0x210000 would result in an invalidation of the range 0x0 - 0x200000 incorrectly leaving 0x200000 - 0x210000 in the GPU TLB. > + atsd_start = ALIGN_DOWN(start, PAGE_2M); > + atsd_psize = MMU_PAGE_2M; > + } else if (ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G)) { Ditto. > + atsd_start = ALIGN_DOWN(start, PAGE_1G); > + atsd_psize = MMU_PAGE_1G; > + } > - Alistair
Thanks for the review. Comments below. On Tue, 2 Oct 2018, Alistair Popple wrote: > Thanks Mark, > > Looks like some worthwhile improvments to be had. I've added a couple of > comments inline below. > > > +#define PAGE_64K (64UL * 1024) +#define PAGE_2M (2UL * 1024 * 1024) +#define > > PAGE_1G (1UL * 1024 * 1024 * 1024) > > include/linux/sizes.h includes definitions for SZ_64K, SZ_2M, SZ_1G, etc. so > unless they're redefined here for some reason I personally think it's cleaner to > use those. Agreed, will fix. Thanks for the pointer. > > > /* > > - * Invalidate either a single address or an entire PID depending on > > - * the value of va. > > + * Invalidate a virtual address range > > */ > > -static void mmio_invalidate(struct npu_context *npu_context, int va, > > - unsigned long address, bool flush) > > +static void mmio_invalidate(struct npu_context *npu_context, > > + unsigned long start, unsigned long size, bool flush) > > With this optimisation every caller of mmio_invalidate() sets flush == true so > it no longer appears to be used. We should drop it as a parameter unless you > think there might be some reason to use it in future? > > Therefore we could also drop it as a parameter to get_atsd_launch_val(), > mmio_invalidate_pid() and mmio_invalidate_range() as well as I couldn't find any > callers of those that set it to anything other than true. Yeah, good catch. I'll simplify all of those. > > > struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS]; > > unsigned long pid = npu_context->mm->context.id; > > + unsigned long atsd_start = 0; > > + unsigned long end = start + size - 1; > > + int atsd_psize = MMU_PAGE_COUNT; > > + > > + /* > > + * Convert the input range into one of the supported sizes. If the range > > + * doesn't fit, use the next larger supported size. Invalidation latency > > + * is high, so over-invalidation is preferred to issuing multiple > > + * invalidates. > > + */ > > + if (size == PAGE_64K) { > > We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD > would invalidate the entire GPU TLB for a the given PID on those systems. Could > we change the above check to `if (size <= PAGE_64K)` to avoid this? PPC supports 4K pages but the GPU ATS implementation does not. For that reason I didn't bother handling invalidates smaller than 64K. I'll add a comment on that. I don't know that this requirement is enforced anywhere though. I could add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it would be useful. > > > + atsd_start = start; > > Which would also require: > > atsd_start = ALIGN_DOWN(start, PAGE_64K); > > > + atsd_psize = MMU_PAGE_64K; > > + } else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) { > > Wouldn't this lead to under invalidation in ranges which happen to cross a 2M > boundary? For example invalidating a 128K (ie. 2x64K pages) range with start == > 0x1f0000 and end == 0x210000 would result in an invalidation of the range 0x0 - > 0x200000 incorrectly leaving 0x200000 - 0x210000 in the GPU TLB. In this example: start 0x1f0000 size 0x020000 end (start + size - 1) 0x20ffff ALIGN_DOWN(start, PAGE_2M) 0x000000 ALIGN_DOWN(end, PAGE_2M) 0x200000 Since ALIGN_DOWN(start, PAGE_2M) != ALIGN_DOWN(end, PAGE_2M), the condition fails and we move to the 1G clause. Then ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G) == 0, so we invalidate the range [0, 1G).
> > > > We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD > > would invalidate the entire GPU TLB for a the given PID on those systems. Could > > we change the above check to `if (size <= PAGE_64K)` to avoid this? > > PPC supports 4K pages but the GPU ATS implementation does not. For that > reason I didn't bother handling invalidates smaller than 64K. I'll add a > comment on that. Interesting, I was not aware of that limitation. Do you know if it is a SW/driver limitation or a HW limitation? > I don't know that this requirement is enforced anywhere though. I could > add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it > would be useful. Given it's a static kernel build parameter perhaps it makes more sense to do the check as part of the driver build in a conftest rather than a runtime failure? > > > > > + atsd_start = start; > > > > Which would also require: > > > > atsd_start = ALIGN_DOWN(start, PAGE_64K); > > > > > + atsd_psize = MMU_PAGE_64K; > > > + } else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) { > > > > Wouldn't this lead to under invalidation in ranges which happen to cross a 2M > > boundary? For example invalidating a 128K (ie. 2x64K pages) range with start == > > 0x1f0000 and end == 0x210000 would result in an invalidation of the range 0x0 - > > 0x200000 incorrectly leaving 0x200000 - 0x210000 in the GPU TLB. > > In this example: > start 0x1f0000 > size 0x020000 > end (start + size - 1) 0x20ffff > ALIGN_DOWN(start, PAGE_2M) 0x000000 > ALIGN_DOWN(end, PAGE_2M) 0x200000 > > Since ALIGN_DOWN(start, PAGE_2M) != ALIGN_DOWN(end, PAGE_2M), the > condition fails and we move to the 1G clause. Then > ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G) == 0, so we > invalidate the range [0, 1G). Oh yeah, sorry that makes sense and looks good to me. - Alistair
On Wed, 3 Oct 2018, Alistair Popple wrote: > > > > > > We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD > > > would invalidate the entire GPU TLB for a the given PID on those systems. Could > > > we change the above check to `if (size <= PAGE_64K)` to avoid this? > > > > PPC supports 4K pages but the GPU ATS implementation does not. For that > > reason I didn't bother handling invalidates smaller than 64K. I'll add a > > comment on that. > > Interesting, I was not aware of that limitation. Do you know if it is a > SW/driver limitation or a HW limitation? HW. > > > I don't know that this requirement is enforced anywhere though. I could > > add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it > > would be useful. > > Given it's a static kernel build parameter perhaps it makes more sense to do the > check as part of the driver build in a conftest rather than a runtime failure? Sure, we can do it in the driver. I'll just stick with a comment for the kernel.
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index c8f438a..e471a1a 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -509,15 +509,14 @@ static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS], mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch); } -static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS], - unsigned long va, unsigned long pid, bool flush) +static void mmio_invalidate_range(struct mmio_atsd_reg + mmio_atsd_reg[NV_MAX_NPUS], unsigned long pid, + unsigned long start, unsigned long psize, bool flush) { - unsigned long launch; - - launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush); + unsigned long launch = get_atsd_launch_val(pid, psize, flush); /* Write all VAs first */ - mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va); + mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, start); /* Issue one barrier for all address writes */ eieio(); @@ -608,15 +607,38 @@ static void release_atsd_reg(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS]) } } +#define PAGE_64K (64UL * 1024) +#define PAGE_2M (2UL * 1024 * 1024) +#define PAGE_1G (1UL * 1024 * 1024 * 1024) + /* - * Invalidate either a single address or an entire PID depending on - * the value of va. + * Invalidate a virtual address range */ -static void mmio_invalidate(struct npu_context *npu_context, int va, - unsigned long address, bool flush) +static void mmio_invalidate(struct npu_context *npu_context, + unsigned long start, unsigned long size, bool flush) { struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS]; unsigned long pid = npu_context->mm->context.id; + unsigned long atsd_start = 0; + unsigned long end = start + size - 1; + int atsd_psize = MMU_PAGE_COUNT; + + /* + * Convert the input range into one of the supported sizes. If the range + * doesn't fit, use the next larger supported size. Invalidation latency + * is high, so over-invalidation is preferred to issuing multiple + * invalidates. + */ + if (size == PAGE_64K) { + atsd_start = start; + atsd_psize = MMU_PAGE_64K; + } else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) { + atsd_start = ALIGN_DOWN(start, PAGE_2M); + atsd_psize = MMU_PAGE_2M; + } else if (ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G)) { + atsd_start = ALIGN_DOWN(start, PAGE_1G); + atsd_psize = MMU_PAGE_1G; + } if (npu_context->nmmu_flush) /* @@ -631,10 +653,12 @@ static void mmio_invalidate(struct npu_context *npu_context, int va, * an invalidate. */ acquire_atsd_reg(npu_context, mmio_atsd_reg); - if (va) - mmio_invalidate_va(mmio_atsd_reg, address, pid, flush); - else + + if (atsd_psize == MMU_PAGE_COUNT) mmio_invalidate_pid(mmio_atsd_reg, pid, flush); + else + mmio_invalidate_range(mmio_atsd_reg, pid, atsd_start, + atsd_psize, flush); mmio_invalidate_wait(mmio_atsd_reg); if (flush) { @@ -664,7 +688,7 @@ static void pnv_npu2_mn_release(struct mmu_notifier *mn, * There should be no more translation requests for this PID, but we * need to ensure any entries for it are removed from the TLB. */ - mmio_invalidate(npu_context, 0, 0, true); + mmio_invalidate(npu_context, 0, ~0UL, true); } static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn, @@ -673,8 +697,7 @@ static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn, pte_t pte) { struct npu_context *npu_context = mn_to_npu_context(mn); - - mmio_invalidate(npu_context, 1, address, true); + mmio_invalidate(npu_context, address, PAGE_SIZE, true); } static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn, @@ -682,21 +705,7 @@ static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn, unsigned long start, unsigned long end) { struct npu_context *npu_context = mn_to_npu_context(mn); - unsigned long address; - - if (end - start > atsd_threshold) { - /* - * Just invalidate the entire PID if the address range is too - * large. - */ - mmio_invalidate(npu_context, 0, 0, true); - } else { - for (address = start; address < end; address += PAGE_SIZE) - mmio_invalidate(npu_context, 1, address, false); - - /* Do the flush only on the final addess == end */ - mmio_invalidate(npu_context, 1, address, true); - } + mmio_invalidate(npu_context, start, end - start, true); } static const struct mmu_notifier_ops nv_nmmu_notifier_ops = {
Prior to this change only two types of ATSDs were issued to the NPU: invalidates targeting a single page and invalidates targeting the whole address space. The crossover point happened at the configurable atsd_threshold which defaulted to 2M. Invalidates that size or smaller would issue per-page invalidates for the whole range. The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all. These invalidates target addresses aligned to their size. 2M is a common invalidation size for GPU-enabled applications because that is a GPU page size, so reducing the number of invalidates by 32x in that case is a clear improvement. ATSD latency is high in general so now we always issue a single invalidate rather than multiple. This will over-invalidate in some cases, but for any invalidation size over 2M it matches or improves the prior behavior. There's also an improvement for single-page invalidates since the prior version issued two invalidates for that case instead of one. To show the benefit here are some performance numbers from a microbenchmark which creates a 1G allocation then uses mprotect with PROT_NONE to trigger invalidates in strides across the allocation. One NPU (1 GPU): mprotect rate (GB/s) Stride Before After Speedup 64K 5.3 5.6 5% 1M 39.3 57.4 46% 2M 49.7 82.6 66% 4M 286.6 285.7 0% Two NPUs (6 GPUs): mprotect rate (GB/s) Stride Before After Speedup 64K 6.5 7.4 13% 1M 33.4 67.9 103% 2M 38.7 93.1 141% 4M 356.7 354.6 -1% Anything over 2M is roughly the same as before since both cases issue a single ATSD. Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> --- arch/powerpc/platforms/powernv/npu-dma.c | 71 +++++++++++++++++------------- 1 files changed, 40 insertions(+), 31 deletions(-)