[v2] hw/phb4: Tune GPU direct performance on witherspoon in PCI mode
diff mbox series

Message ID 20200324175125.54264-1-fbarrat@linux.ibm.com
State Superseded
Headers show
Series
  • [v2] hw/phb4: Tune GPU direct performance on witherspoon in PCI mode
Related show

Checks

Context Check Description
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (d0043ee22d0d9a0a65ac6781c4fd3bf69492ddc5)

Commit Message

Frederic Barrat March 24, 2020, 5:51 p.m. UTC
Good GPU direct performance on witherspoon, with a Mellanox adapter
on the shared slot, requires to reallocate some dma engines within
PEC2, "stealing" some from PHB4&5 and giving extras to PHB3. It's
currently done when using CAPI mode. But the same is true if the
adapter stays in PCI mode.

In preparation for upcoming versions of MOFED, which may not use CAPI
mode, this patch reallocates dma engines even in PCI mode for a series
of Mellanox adapters that can be used with GPU direct, on witherspoon
and on the shared slot only.

The loss of dma engines for PHB4&5 on witherspoon has not shown
problems in testing, as well as in current deployments where CAPI mode
is used.

Here is a comparison of the bandwidth numbers seen with the PHB in PCI
mode (no CAPI) with and without this patch. Variations on smaller
packet sizes can be attributed to jitter and are not that meaningful.

 # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.6.1
 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
 # Size      Bandwidth (MB/s)       Bandwidth (MB/s)
 #              with patch           without patch
1                       1.29              1.48
2                       2.66              3.04
4                       5.34              5.93
8                      10.68             11.86
16                     21.39             23.71
32                     42.78             49.15
64                     85.43             97.67
128                   170.82            196.64
256                   385.47            383.02
512                   774.68            755.54
1024                 1535.14           1495.30
2048                 2599.31           2561.60
4096                 5192.31           5092.47
8192                 9930.30           9566.90
16384               18189.81          16803.42
32768               24671.48          21383.57
65536               28977.71          24104.50
131072              31110.55          25858.95
262144              32180.64          26470.61
524288              32842.23          26961.93
1048576             33184.87          27217.38
2097152             33342.67          27338.08

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---

Oliver, Andrew: I can only do basic testing on witherspoon. GPU direct
testing requires a much more complex setup and is handled by the IO
team. Since it's pretty time-consuming, I'd like to ask them to only
test what we expect to be the final version. Could you review this
one, even though it hasn't been fully tested? Then I can ask them for
a final round of testing before we can merge.

Changelog:
v2: use same code from capi setup (Andrew)
    trigger setup from witherspoon platform file (Oliver)
    

 hw/phb4.c                      | 53 +++++++++++++++++++---------------
 include/phb4.h                 |  2 ++
 platforms/astbmc/witherspoon.c | 43 +++++++++++++++++++++++++++
 3 files changed, 74 insertions(+), 24 deletions(-)

Comments

Oliver O'Halloran March 25, 2020, 5:11 a.m. UTC | #1
On Wed, Mar 25, 2020 at 4:51 AM Frederic Barrat <fbarrat@linux.ibm.com> wrote:
>
> +
> +static void witherspoon_pci_probe_complete(void)
> +{
> +       struct pci_device *dev;
> +       struct phb *phb;
> +       struct phb4 *p;
> +
> +       /*
> +        * Reallocate dma engines between stacks in PEC2 if a Mellanox
> +        * card is found on the shared slot, as it is required to get
> +        * good GPU direct performance.
> +        */
> +       for_each_phb(phb) {
> +               p = phb_to_phb4(phb);
> +               if (p->index != 3)
> +                       continue;

I think you need to check phb->type == phb_type_pcie_v4 before dereferencing p.

> +               dev = pci_walk_dev(phb, NULL, check_mlx_cards, NULL);
> +               if (dev)
> +                       phb4_pec2_dma_engine_realloc(p);
> +       }
> +}
Frederic Barrat March 25, 2020, 9:41 a.m. UTC | #2
Le 25/03/2020 à 06:11, Oliver O'Halloran a écrit :
> On Wed, Mar 25, 2020 at 4:51 AM Frederic Barrat <fbarrat@linux.ibm.com> wrote:
>>
>> +
>> +static void witherspoon_pci_probe_complete(void)
>> +{
>> +       struct pci_device *dev;
>> +       struct phb *phb;
>> +       struct phb4 *p;
>> +
>> +       /*
>> +        * Reallocate dma engines between stacks in PEC2 if a Mellanox
>> +        * card is found on the shared slot, as it is required to get
>> +        * good GPU direct performance.
>> +        */
>> +       for_each_phb(phb) {
>> +               p = phb_to_phb4(phb);
>> +               if (p->index != 3)
>> +                       continue;
> 
> I think you need to check phb->type == phb_type_pcie_v4 before dereferencing p.


Duh! Thanks...


>> +               dev = pci_walk_dev(phb, NULL, check_mlx_cards, NULL);
>> +               if (dev)
>> +                       phb4_pec2_dma_engine_realloc(p);
>> +       }
>> +}

Patch
diff mbox series

diff --git a/hw/phb4.c b/hw/phb4.c
index f17625bb..60e797cf 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -809,6 +809,33 @@  static int64_t phb4_pcicfg_no_dstate(void *dev __unused,
 	return OPAL_PARTIAL;
 }
 
+void phb4_pec2_dma_engine_realloc(struct phb4 *p)
+{
+	uint64_t reg;
+
+	/*
+	 * Allocate 16 extra dma read engines to stack 0, to boost dma
+	 * performance for devices on stack 0 of PEC2, i.e PHB3.
+	 * It comes at a price of reduced read engine allocation for
+	 * devices on stack 1 and 2. The engine allocation becomes
+	 * 48/8/8 instead of the default 32/16/16.
+	 *
+	 * The reallocation magic value should be 0xffff0000ff008000,
+	 * but per the PCI designers, dma engine 32 (bit 0) has a
+	 * quirk, and 0x7fff80007F008000 has the same effect (engine
+	 * 32 goes to PHB4).
+	 */
+	if (p->index != 3) /* shared slot on PEC2 */
+		return;
+
+	PHBINF(p, "Allocating an extra 16 dma read engines on PEC2 stack0\n");
+	reg = 0x7fff80007F008000ULL;
+	xscom_write(p->chip_id,
+		    p->pci_xscom + XPEC_PCI_PRDSTKOVR, reg);
+	xscom_write(p->chip_id,
+		    p->pe_xscom  + XPEC_NEST_READ_STACK_OVERRIDE, reg);
+}
+
 static void phb4_check_device_quirks(struct pci_device *dev)
 {
 	/* Some special adapter tweaks for devices directly under the PHB */
@@ -4415,30 +4442,8 @@  static int64_t enable_capi_mode(struct phb4 *p, uint64_t pe_number,
 	 * dma-read engines allocations to maximize the DMA read performance
 	 */
 	if ((p->index == CAPP1_PHB_INDEX) &&
-	    (capp_eng & CAPP_MAX_DMA_READ_ENGINES)) {
-
-		/*
-		 * Allocate Additional 16/8 dma read engines to stack0/stack1
-		 * respectively. Read engines 0:31 are anyways always assigned
-		 * to stack0. Also skip allocating DMA Read Engine-32 by
-		 * enabling Bit[0] in XPEC_NEST_READ_STACK_OVERRIDE register.
-		 * Enabling this bit seems cause a parity error reported in
-		 * NFIR[1]-nonbar_pe.
-		 */
-		reg = 0x7fff80007F008000ULL;
-
-		xscom_write(p->chip_id, p->pci_xscom + XPEC_PCI_PRDSTKOVR, reg);
-		xscom_write(p->chip_id, p->pe_xscom +
-			    XPEC_NEST_READ_STACK_OVERRIDE, reg);
-
-		/* Log this reallocation as it may impact dma performance of
-		 * other slots connected to PEC2
-		 */
-		PHBINF(p, "CAPP: Set %d dma-read engines for PEC2/stack-0\n",
-		      32 + __builtin_popcountll(reg & PPC_BITMASK(0, 31)));
-		PHBDBG(p, "CAPP: XPEC_NEST_READ_STACK_OVERRIDE: %016llx\n",
-		       reg);
-	}
+	    (capp_eng & CAPP_MAX_DMA_READ_ENGINES))
+		phb4_pec2_dma_engine_realloc(p);
 
 	/* PCI to PB data movement ignores the PB init signal. */
 	xscom_write_mask(p->chip_id, p->pe_xscom + XPEC_NEST_PBCQ_HW_CONFIG,
diff --git a/include/phb4.h b/include/phb4.h
index 6d5fd510..abba2d9c 100644
--- a/include/phb4.h
+++ b/include/phb4.h
@@ -257,4 +257,6 @@  static inline int phb4_get_opal_id(unsigned int chip_id, unsigned int index)
 		return chip_id * PHB4_MAX_PHBS_PER_CHIP_P9P + index;
 }
 
+void phb4_pec2_dma_engine_realloc(struct phb4 *p);
+
 #endif /* __PHB4_H */
diff --git a/platforms/astbmc/witherspoon.c b/platforms/astbmc/witherspoon.c
index 6387af48..e0e6a51b 100644
--- a/platforms/astbmc/witherspoon.c
+++ b/platforms/astbmc/witherspoon.c
@@ -192,6 +192,48 @@  static void witherspoon_shared_slot_fixup(void)
 	}
 }
 
+static int check_mlx_cards(struct phb *phb __unused, struct pci_device *dev,
+			   void *userdata __unused)
+{
+	uint16_t mlx_cards[] = {
+		0x1017, /* ConnectX-5 */
+		0x1019, /* ConnectX-5 Ex */
+		0x101b, /* ConnectX-6 */
+		0x101d, /* ConnectX-6 Dx */
+		0x101f, /* ConnectX-6 Lx */
+		0x1021, /* ConnectX-7 */
+	};
+
+	if (PCI_VENDOR_ID(dev->vdid) == 0x15b3) { /* Mellanox */
+		for (int i = 0; i < ARRAY_SIZE(mlx_cards); i++) {
+			if (mlx_cards[i] == PCI_DEVICE_ID(dev->vdid))
+				return 1;
+		}
+	}
+	return 0;
+}
+
+static void witherspoon_pci_probe_complete(void)
+{
+	struct pci_device *dev;
+	struct phb *phb;
+	struct phb4 *p;
+
+	/*
+	 * Reallocate dma engines between stacks in PEC2 if a Mellanox
+	 * card is found on the shared slot, as it is required to get
+	 * good GPU direct performance.
+	 */
+	for_each_phb(phb) {
+		p = phb_to_phb4(phb);
+		if (p->index != 3)
+			continue;
+		dev = pci_walk_dev(phb, NULL, check_mlx_cards, NULL);
+		if (dev)
+			phb4_pec2_dma_engine_realloc(p);
+	}
+}
+
 static void set_link_details(struct npu2 *npu, uint32_t link_index,
 			     uint32_t brick_index, enum npu2_dev_type type)
 {
@@ -533,6 +575,7 @@  DECLARE_PLATFORM(witherspoon) = {
 	.probe			= witherspoon_probe,
 	.init			= astbmc_init,
 	.pre_pci_fixup		= witherspoon_shared_slot_fixup,
+	.pci_probe_complete	= witherspoon_pci_probe_complete,
 	.start_preload_resource	= flash_start_preload_resource,
 	.resource_loaded	= flash_resource_loaded,
 	.bmc			= &bmc_plat_ast2500_openbmc,