diff mbox series

[2/3] powerpc/powernv/npu: Use size-based ATSD invalidates

Message ID 1538090591-28519-3-git-send-email-mhairgrove@nvidia.com (mailing list archive)
State Changes Requested
Headers show
Series powerpc/powernv/npu: Improve ATSD invalidation overhead | expand

Commit Message

Mark Hairgrove Sept. 27, 2018, 11:23 p.m. UTC
Prior to this change only two types of ATSDs were issued to the NPU:
invalidates targeting a single page and invalidates targeting the whole
address space. The crossover point happened at the configurable
atsd_threshold which defaulted to 2M. Invalidates that size or smaller
would issue per-page invalidates for the whole range.

The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
These invalidates target addresses aligned to their size. 2M is a common
invalidation size for GPU-enabled applications because that is a GPU
page size, so reducing the number of invalidates by 32x in that case is a
clear improvement.

ATSD latency is high in general so now we always issue a single invalidate
rather than multiple. This will over-invalidate in some cases, but for any
invalidation size over 2M it matches or improves the prior behavior.
There's also an improvement for single-page invalidates since the prior
version issued two invalidates for that case instead of one.

To show the benefit here are some performance numbers from a
microbenchmark which creates a 1G allocation then uses mprotect with
PROT_NONE to trigger invalidates in strides across the allocation.

One NPU (1 GPU):

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         5.3        5.6           5%
1M         39.3       57.4          46%
2M         49.7       82.6          66%
4M        286.6      285.7           0%

Two NPUs (6 GPUs):

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         6.5        7.4          13%
1M         33.4       67.9         103%
2M         38.7       93.1         141%
4M        356.7      354.6          -1%

Anything over 2M is roughly the same as before since both cases issue a
single ATSD.

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
---
 arch/powerpc/platforms/powernv/npu-dma.c |   71 +++++++++++++++++-------------
 1 files changed, 40 insertions(+), 31 deletions(-)

Comments

Alistair Popple Oct. 2, 2018, 7:11 a.m. UTC | #1
Thanks Mark,

Looks like some worthwhile improvments to be had. I've added a couple of
comments inline below.

> +#define PAGE_64K (64UL * 1024) +#define PAGE_2M (2UL * 1024 * 1024) +#define
> PAGE_1G (1UL * 1024 * 1024 * 1024)

include/linux/sizes.h includes definitions for SZ_64K, SZ_2M, SZ_1G, etc. so
unless they're redefined here for some reason I personally think it's cleaner to
use those.

>  /*
> - * Invalidate either a single address or an entire PID depending on
> - * the value of va.
> + * Invalidate a virtual address range
>   */
> -static void mmio_invalidate(struct npu_context *npu_context, int va,
> -			unsigned long address, bool flush)
> +static void mmio_invalidate(struct npu_context *npu_context,
> +			unsigned long start, unsigned long size, bool flush)

With this optimisation every caller of mmio_invalidate() sets flush == true so
it no longer appears to be used. We should drop it as a parameter unless you
think there might be some reason to use it in future?

Therefore we could also drop it as a parameter to get_atsd_launch_val(),
mmio_invalidate_pid() and mmio_invalidate_range() as well as I couldn't find any
callers of those that set it to anything other than true.

>  	struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
>  	unsigned long pid = npu_context->mm->context.id;
> +	unsigned long atsd_start = 0;
> +	unsigned long end = start + size - 1;
> +	int atsd_psize = MMU_PAGE_COUNT;
> +
> +	/*
> +	 * Convert the input range into one of the supported sizes. If the range
> +	 * doesn't fit, use the next larger supported size. Invalidation latency
> +	 * is high, so over-invalidation is preferred to issuing multiple
> +	 * invalidates.
> +	 */
> +	if (size == PAGE_64K) {

We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD
would invalidate the entire GPU TLB for a the given PID on those systems. Could
we change the above check to `if (size <= PAGE_64K)` to avoid this?

> +		atsd_start = start;

Which would also require:

      	    atsd_start = ALIGN_DOWN(start, PAGE_64K);

> +		atsd_psize = MMU_PAGE_64K;
> +	} else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) {

Wouldn't this lead to under invalidation in ranges which happen to cross a 2M
boundary? For example invalidating a 128K (ie. 2x64K pages) range with start ==
0x1f0000 and end == 0x210000 would result in an invalidation of the range 0x0 -
0x200000 incorrectly leaving 0x200000 - 0x210000 in the GPU TLB.

> +		atsd_start = ALIGN_DOWN(start, PAGE_2M);
> +		atsd_psize = MMU_PAGE_2M;
> +	} else if (ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G)) {

Ditto.

> +		atsd_start = ALIGN_DOWN(start, PAGE_1G);
> +		atsd_psize = MMU_PAGE_1G;
> +	}
>  

- Alistair
Mark Hairgrove Oct. 3, 2018, 1:10 a.m. UTC | #2
Thanks for the review. Comments below.

On Tue, 2 Oct 2018, Alistair Popple wrote:

> Thanks Mark,
> 
> Looks like some worthwhile improvments to be had. I've added a couple of
> comments inline below.
> 
> > +#define PAGE_64K (64UL * 1024) +#define PAGE_2M (2UL * 1024 * 1024) +#define
> > PAGE_1G (1UL * 1024 * 1024 * 1024)
> 
> include/linux/sizes.h includes definitions for SZ_64K, SZ_2M, SZ_1G, etc. so
> unless they're redefined here for some reason I personally think it's cleaner to
> use those.

Agreed, will fix. Thanks for the pointer.


> 
> >  /*
> > - * Invalidate either a single address or an entire PID depending on
> > - * the value of va.
> > + * Invalidate a virtual address range
> >   */
> > -static void mmio_invalidate(struct npu_context *npu_context, int va,
> > -			unsigned long address, bool flush)
> > +static void mmio_invalidate(struct npu_context *npu_context,
> > +			unsigned long start, unsigned long size, bool flush)
> 
> With this optimisation every caller of mmio_invalidate() sets flush == true so
> it no longer appears to be used. We should drop it as a parameter unless you
> think there might be some reason to use it in future?
> 
> Therefore we could also drop it as a parameter to get_atsd_launch_val(),
> mmio_invalidate_pid() and mmio_invalidate_range() as well as I couldn't find any
> callers of those that set it to anything other than true.

Yeah, good catch. I'll simplify all of those.


> 
> >  	struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
> >  	unsigned long pid = npu_context->mm->context.id;
> > +	unsigned long atsd_start = 0;
> > +	unsigned long end = start + size - 1;
> > +	int atsd_psize = MMU_PAGE_COUNT;
> > +
> > +	/*
> > +	 * Convert the input range into one of the supported sizes. If the range
> > +	 * doesn't fit, use the next larger supported size. Invalidation latency
> > +	 * is high, so over-invalidation is preferred to issuing multiple
> > +	 * invalidates.
> > +	 */
> > +	if (size == PAGE_64K) {
> 
> We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD
> would invalidate the entire GPU TLB for a the given PID on those systems. Could
> we change the above check to `if (size <= PAGE_64K)` to avoid this?

PPC supports 4K pages but the GPU ATS implementation does not. For that
reason I didn't bother handling invalidates smaller than 64K. I'll add a
comment on that.

I don't know that this requirement is enforced anywhere though. I could
add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it
would be useful.


> 
> > +		atsd_start = start;
> 
> Which would also require:
> 
>       	    atsd_start = ALIGN_DOWN(start, PAGE_64K);
> 
> > +		atsd_psize = MMU_PAGE_64K;
> > +	} else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) {
> 
> Wouldn't this lead to under invalidation in ranges which happen to cross a 2M
> boundary? For example invalidating a 128K (ie. 2x64K pages) range with start ==
> 0x1f0000 and end == 0x210000 would result in an invalidation of the range 0x0 -
> 0x200000 incorrectly leaving 0x200000 - 0x210000 in the GPU TLB.

In this example:
start                         0x1f0000
size                          0x020000
end (start + size - 1)        0x20ffff
ALIGN_DOWN(start, PAGE_2M)    0x000000
ALIGN_DOWN(end, PAGE_2M)      0x200000

Since ALIGN_DOWN(start, PAGE_2M) != ALIGN_DOWN(end, PAGE_2M), the
condition fails and we move to the 1G clause. Then
ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G) == 0, so we
invalidate the range [0, 1G).
Alistair Popple Oct. 3, 2018, 2:27 a.m. UTC | #3
> > 
> > We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD
> > would invalidate the entire GPU TLB for a the given PID on those systems. Could
> > we change the above check to `if (size <= PAGE_64K)` to avoid this?
> 
> PPC supports 4K pages but the GPU ATS implementation does not. For that
> reason I didn't bother handling invalidates smaller than 64K. I'll add a
> comment on that.

Interesting, I was not aware of that limitation. Do you know if it is a
SW/driver limitation or a HW limitation?

> I don't know that this requirement is enforced anywhere though. I could
> add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it
> would be useful.

Given it's a static kernel build parameter perhaps it makes more sense to do the
check as part of the driver build in a conftest rather than a runtime failure?

> >
> > > +		atsd_start = start;
> > 
> > Which would also require:
> > 
> >       	    atsd_start = ALIGN_DOWN(start, PAGE_64K);
> > 
> > > +		atsd_psize = MMU_PAGE_64K;
> > > +	} else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) {
> > 
> > Wouldn't this lead to under invalidation in ranges which happen to cross a 2M
> > boundary? For example invalidating a 128K (ie. 2x64K pages) range with start ==
> > 0x1f0000 and end == 0x210000 would result in an invalidation of the range 0x0 -
> > 0x200000 incorrectly leaving 0x200000 - 0x210000 in the GPU TLB.
> 
> In this example:
> start                         0x1f0000
> size                          0x020000
> end (start + size - 1)        0x20ffff
> ALIGN_DOWN(start, PAGE_2M)    0x000000
> ALIGN_DOWN(end, PAGE_2M)      0x200000
> 
> Since ALIGN_DOWN(start, PAGE_2M) != ALIGN_DOWN(end, PAGE_2M), the
> condition fails and we move to the 1G clause. Then
> ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G) == 0, so we
> invalidate the range [0, 1G).

Oh yeah, sorry that makes sense and looks good to me.

- Alistair
Mark Hairgrove Oct. 3, 2018, 6:33 p.m. UTC | #4
On Wed, 3 Oct 2018, Alistair Popple wrote:

> > > 
> > > We also support 4K page sizes on PPC. If I am not mistaken this means every ATSD
> > > would invalidate the entire GPU TLB for a the given PID on those systems. Could
> > > we change the above check to `if (size <= PAGE_64K)` to avoid this?
> > 
> > PPC supports 4K pages but the GPU ATS implementation does not. For that
> > reason I didn't bother handling invalidates smaller than 64K. I'll add a
> > comment on that.
> 
> Interesting, I was not aware of that limitation. Do you know if it is a
> SW/driver limitation or a HW limitation?

HW.

> 
> > I don't know that this requirement is enforced anywhere though. I could
> > add a PAGE_SIZE == 64K check to pnv_npu2_init_context if you think it
> > would be useful.
> 
> Given it's a static kernel build parameter perhaps it makes more sense to do the
> check as part of the driver build in a conftest rather than a runtime failure?

Sure, we can do it in the driver. I'll just stick with a comment for the
kernel.
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index c8f438a..e471a1a 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -509,15 +509,14 @@  static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
 	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
 }
 
-static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
-			unsigned long va, unsigned long pid, bool flush)
+static void mmio_invalidate_range(struct mmio_atsd_reg
+			mmio_atsd_reg[NV_MAX_NPUS], unsigned long pid,
+			unsigned long start, unsigned long psize, bool flush)
 {
-	unsigned long launch;
-
-	launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
+	unsigned long launch = get_atsd_launch_val(pid, psize, flush);
 
 	/* Write all VAs first */
-	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
+	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, start);
 
 	/* Issue one barrier for all address writes */
 	eieio();
@@ -608,15 +607,38 @@  static void release_atsd_reg(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS])
 	}
 }
 
+#define PAGE_64K (64UL * 1024)
+#define PAGE_2M  (2UL * 1024 * 1024)
+#define PAGE_1G  (1UL * 1024 * 1024 * 1024)
+
 /*
- * Invalidate either a single address or an entire PID depending on
- * the value of va.
+ * Invalidate a virtual address range
  */
-static void mmio_invalidate(struct npu_context *npu_context, int va,
-			unsigned long address, bool flush)
+static void mmio_invalidate(struct npu_context *npu_context,
+			unsigned long start, unsigned long size, bool flush)
 {
 	struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
 	unsigned long pid = npu_context->mm->context.id;
+	unsigned long atsd_start = 0;
+	unsigned long end = start + size - 1;
+	int atsd_psize = MMU_PAGE_COUNT;
+
+	/*
+	 * Convert the input range into one of the supported sizes. If the range
+	 * doesn't fit, use the next larger supported size. Invalidation latency
+	 * is high, so over-invalidation is preferred to issuing multiple
+	 * invalidates.
+	 */
+	if (size == PAGE_64K) {
+		atsd_start = start;
+		atsd_psize = MMU_PAGE_64K;
+	} else if (ALIGN_DOWN(start, PAGE_2M) == ALIGN_DOWN(end, PAGE_2M)) {
+		atsd_start = ALIGN_DOWN(start, PAGE_2M);
+		atsd_psize = MMU_PAGE_2M;
+	} else if (ALIGN_DOWN(start, PAGE_1G) == ALIGN_DOWN(end, PAGE_1G)) {
+		atsd_start = ALIGN_DOWN(start, PAGE_1G);
+		atsd_psize = MMU_PAGE_1G;
+	}
 
 	if (npu_context->nmmu_flush)
 		/*
@@ -631,10 +653,12 @@  static void mmio_invalidate(struct npu_context *npu_context, int va,
 	 * an invalidate.
 	 */
 	acquire_atsd_reg(npu_context, mmio_atsd_reg);
-	if (va)
-		mmio_invalidate_va(mmio_atsd_reg, address, pid, flush);
-	else
+
+	if (atsd_psize == MMU_PAGE_COUNT)
 		mmio_invalidate_pid(mmio_atsd_reg, pid, flush);
+	else
+		mmio_invalidate_range(mmio_atsd_reg, pid, atsd_start,
+					atsd_psize, flush);
 
 	mmio_invalidate_wait(mmio_atsd_reg);
 	if (flush) {
@@ -664,7 +688,7 @@  static void pnv_npu2_mn_release(struct mmu_notifier *mn,
 	 * There should be no more translation requests for this PID, but we
 	 * need to ensure any entries for it are removed from the TLB.
 	 */
-	mmio_invalidate(npu_context, 0, 0, true);
+	mmio_invalidate(npu_context, 0, ~0UL, true);
 }
 
 static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
@@ -673,8 +697,7 @@  static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
 				pte_t pte)
 {
 	struct npu_context *npu_context = mn_to_npu_context(mn);
-
-	mmio_invalidate(npu_context, 1, address, true);
+	mmio_invalidate(npu_context, address, PAGE_SIZE, true);
 }
 
 static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
@@ -682,21 +705,7 @@  static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
 					unsigned long start, unsigned long end)
 {
 	struct npu_context *npu_context = mn_to_npu_context(mn);
-	unsigned long address;
-
-	if (end - start > atsd_threshold) {
-		/*
-		 * Just invalidate the entire PID if the address range is too
-		 * large.
-		 */
-		mmio_invalidate(npu_context, 0, 0, true);
-	} else {
-		for (address = start; address < end; address += PAGE_SIZE)
-			mmio_invalidate(npu_context, 1, address, false);
-
-		/* Do the flush only on the final addess == end */
-		mmio_invalidate(npu_context, 1, address, true);
-	}
+	mmio_invalidate(npu_context, start, end - start, true);
 }
 
 static const struct mmu_notifier_ops nv_nmmu_notifier_ops = {