hw/phb4: Tune GPU direct performance on witherspoon in PCI mode
diff mbox series

Message ID 20200311152805.20495-1-fbarrat@linux.ibm.com
State Superseded
Headers show
Series
  • hw/phb4: Tune GPU direct performance on witherspoon in PCI mode
Related show

Checks

Context Check Description
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (2700092e09adc8f3c2d578878c1a98cb44b9863d)

Commit Message

Frederic Barrat March 11, 2020, 3:28 p.m. UTC
Good GPU direct performance on witherspoon, with a Mellanox adapter
on the shared slot, requires to reallocate some dma engines within
PEC2, "stealing" some from PHB4&5 and giving extras to PHB3. It's
currently done when using CAPI mode. But the same is true if the
adapter stays in PCI mode.

In preparation for upcoming versions of MOFED, which may not use CAPI
mode, this patch reallocates dma engines even in PCI mode for a series
of Mellanox adapters that can be used with GPU direct, on witherspoon
and on the shared slot only.

The loss of dma engines for PHB4&5 on witherspoon has not shown
problems in testing, as well as in current deployments where CAPI mode
is used.

Here is a comparison of the bandwidth numbers seen with the PHB in
PCI mode (no CAPI) with and without this patch:

 # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.6.1
 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
 # Size      Bandwidth (MB/s)       Bandwidth (MB/s)
 #              with patch           without patch
1                       1.29              1.48
2                       2.66              3.04
4                       5.34              5.93
8                      10.68             11.86
16                     21.39             23.71
32                     42.78             49.15
64                     85.43             97.67
128                   170.82            196.64
256                   385.47            383.02
512                   774.68            755.54
1024                 1535.14           1495.30
2048                 2599.31           2561.60
4096                 5192.31           5092.47
8192                 9930.30           9566.90
16384               18189.81          16803.42
32768               24671.48          21383.57
65536               28977.71          24104.50
131072              31110.55          25858.95
262144              32180.64          26470.61
524288              32842.23          26961.93
1048576             33184.87          27217.38
2097152             33342.67          27338.08

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 hw/phb4.c | 92 +++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 76 insertions(+), 16 deletions(-)

Comments

Andrew Donnellan March 24, 2020, 5:29 a.m. UTC | #1
On 12/3/20 2:28 am, Frederic Barrat wrote:
> -static void phb4_check_device_quirks(struct pci_device *dev)
> +static void phb4_pec2_dma_engine_realloc(struct phb *phb)
>   {
> +	struct phb4 *p = phb_to_phb4(phb);
> +	uint64_t reg;
> +
> +	/*
> +	 * Allocate 16 extra dma read engines to stack 0, to boost dma
> +	 * performance for devices on stack 0 of PEC2, i.e PHB3.  It
> +	 * comes at a price of reduced read engine allocation for
> +	 * devices on stack 1 and 2. The engine allocation becomes
> +	 * 48/8/8 instead of the default 32/16/16.
> +	 *
> +	 * The reallocation magic value should be 0xffff0000ff008000,
> +	 * but per the PCI designers, dma engine 32 (bit 0) has a
> +	 * quirk, and 0x7fff80007F008000 has the same effect (engine
> +	 * 32 goes to PHB4).
> +	 */
> +	if (p->index != 3) /* shared slot on PEC2 */
> +		return;
> +
> +	PHBINF(p, "Allocating extra dma read engines on PEC2 stack0\n");
> +	reg = 0x7fff80007F008000ULL;
> +	xscom_write(p->chip_id,
> +		    p->pci_xscom + XPEC_PCI_PRDSTKOVR, reg);
> +	xscom_write(p->chip_id,
> +		    p->pe_xscom  + XPEC_NEST_READ_STACK_OVERRIDE, reg);
> +}

Can we use this function in the CAPI enable path as well so as not to 
duplicate?
Oliver O'Halloran March 24, 2020, 7:55 a.m. UTC | #2
On Thu, Mar 12, 2020 at 2:29 AM Frederic Barrat <fbarrat@linux.ibm.com> wrote:
>
> Good GPU direct performance on witherspoon, with a Mellanox adapter
> on the shared slot, requires to reallocate some dma engines within
> PEC2, "stealing" some from PHB4&5 and giving extras to PHB3. It's
> currently done when using CAPI mode. But the same is true if the
> adapter stays in PCI mode.
>
> In preparation for upcoming versions of MOFED, which may not use CAPI
> mode, this patch reallocates dma engines even in PCI mode for a series
> of Mellanox adapters that can be used with GPU direct, on witherspoon
> and on the shared slot only.
>
> The loss of dma engines for PHB4&5 on witherspoon has not shown
> problems in testing, as well as in current deployments where CAPI mode
> is used.
>
> Here is a comparison of the bandwidth numbers seen with the PHB in
> PCI mode (no CAPI) with and without this patch:
>
>  # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.6.1
>  # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>  # Size      Bandwidth (MB/s)       Bandwidth (MB/s)
>  #              with patch           without patch
> 1                       1.29              1.48
> 2                       2.66              3.04
> 4                       5.34              5.93
> 8                      10.68             11.86
> 16                     21.39             23.71
> 32                     42.78             49.15
> 64                     85.43             97.67
> 128                   170.82            196.64
> 256                   385.47            383.02

Looks like it actually makes things worse until you get to 256. What
is the "size" here? Bytes or something else?

> 512                   774.68            755.54
> 1024                 1535.14           1495.30
> 2048                 2599.31           2561.60
> 4096                 5192.31           5092.47
> 8192                 9930.30           9566.90
> 16384               18189.81          16803.42
> 32768               24671.48          21383.57
> 65536               28977.71          24104.50
> 131072              31110.55          25858.95
> 262144              32180.64          26470.61
> 524288              32842.23          26961.93
> 1048576             33184.87          27217.38
> 2097152             33342.67          27338.08
>
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>


> ---
>  hw/phb4.c | 92 +++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 76 insertions(+), 16 deletions(-)
>
> diff --git a/hw/phb4.c b/hw/phb4.c
> index ed7f4e5c..59c04b9f 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -809,20 +809,88 @@ static int64_t phb4_pcicfg_no_dstate(void *dev __unused,
>         return OPAL_PARTIAL;
>  }
>
> -static void phb4_check_device_quirks(struct pci_device *dev)
> +static void phb4_pec2_dma_engine_realloc(struct phb *phb)
>  {
> +       struct phb4 *p = phb_to_phb4(phb);
> +       uint64_t reg;
> +
> +       /*
> +        * Allocate 16 extra dma read engines to stack 0, to boost dma
> +        * performance for devices on stack 0 of PEC2, i.e PHB3.  It
> +        * comes at a price of reduced read engine allocation for
> +        * devices on stack 1 and 2. The engine allocation becomes
> +        * 48/8/8 instead of the default 32/16/16.
> +        *
> +        * The reallocation magic value should be 0xffff0000ff008000,
> +        * but per the PCI designers, dma engine 32 (bit 0) has a
> +        * quirk, and 0x7fff80007F008000 has the same effect (engine
> +        * 32 goes to PHB4).
> +        */
> +       if (p->index != 3) /* shared slot on PEC2 */
> +               return;
> +
> +       PHBINF(p, "Allocating extra dma read engines on PEC2 stack0\n");
> +       reg = 0x7fff80007F008000ULL;
> +       xscom_write(p->chip_id,
> +                   p->pci_xscom + XPEC_PCI_PRDSTKOVR, reg);
> +       xscom_write(p->chip_id,
> +                   p->pe_xscom  + XPEC_NEST_READ_STACK_OVERRIDE, reg);
> +}
> +
> +struct pci_card_id {
> +       uint16_t vendor;
> +       uint16_t device;
> +};
> +
> +#define VENDOR(vdid) ((vdid) & 0xffff)
> +#define DEVICE(vdid) (((vdid) >> 16) & 0xffff)
> +
> +static struct pci_card_id dma_eng_realloc_whitelist[] = {
> +       { 0x15b3, 0x1017 }, /* Mellanox ConnectX-5 */
> +       { 0x15b3, 0x1019 }, /* Mellanox ConnectX-5 Ex */
> +       { 0x15b3, 0x101b }, /* Mellanox ConnectX-6 */
> +       { 0x15b3, 0x101d }, /* Mellanox ConnectX-6 Dx */
> +       { 0x15b3, 0x101f }, /* Mellanox ConnectX-6 Lx */
> +       { 0x15b3, 0x1021 }, /* Mellanox ConnectX-7 */
> +};
> +
> +static bool phb4_adapter_need_dma_engine_realloc(uint32_t vdid)
> +{
> +       int i;
> +
> +       for (i = 0; i < ARRAY_SIZE(dma_eng_realloc_whitelist); i++)
> +               if (dma_eng_realloc_whitelist[i].vendor == VENDOR(vdid) &&
> +                   dma_eng_realloc_whitelist[i].device == DEVICE(vdid))
> +                       return true;
> +       return false;
> +}

Considering this is a copy of the existing DD2.0 retain whitelist it
wouldn't make it generic-ish, something like:

pci_adapter_in_list(vdid, &table);

> +static void phb4_check_device_quirks(struct phb *phb, struct pci_device *dev)
> +{
> +       struct phb4 *p = phb_to_phb4(phb);
> +
>         /* Some special adapter tweaks for devices directly under the PHB */
>         if (dev->primary_bus != 1)
>                 return;
>
>         /* PM quirk */
> -       if (!pci_has_cap(dev, PCI_CFG_CAP_ID_PM, false))
> -               return;
> +       if (pci_has_cap(dev, PCI_CFG_CAP_ID_PM, false)) {
> +               pci_add_cfg_reg_filter(dev,
> +                                      pci_cap(dev, PCI_CFG_CAP_ID_PM, false), 8,
> +                                      PCI_REG_FLAG_WRITE,
> +                                      phb4_pcicfg_no_dstate);
> +       }
>
> -       pci_add_cfg_reg_filter(dev,
> -                              pci_cap(dev, PCI_CFG_CAP_ID_PM, false), 8,
> -                              PCI_REG_FLAG_WRITE,
> -                              phb4_pcicfg_no_dstate);
> +       /*
> +        * PEC2 dma engine reallocation for Mellanox cards.
> +        * Only on witherspoon when the card is on the shared slot.
> +        * It improves GPU direct performance.
> +        */
> +       if (p->index == 3 && phb->slot->peer_slot &&
> +           phb4_adapter_need_dma_engine_realloc(dev->vdid)) {
> +               if (PCI_FUNC(dev->bdfn) == 0) // once per adapter
> +                       phb4_pec2_dma_engine_realloc(phb);

All these checks seem to be there to cover up for the fact this is a
platform specific hack being shoved into generic code. Why can't this
live in witherspoon.c with the rest of the shared slot hacks? The
SCOMs being poked are all part of the PEC rather than the PHB so we
don't need to care about what happens on a PHB reset, so one of the
existing platform PCI hooks should do the trick.

Oliver
Frederic Barrat March 24, 2020, 5:03 p.m. UTC | #3
Le 24/03/2020 à 06:29, Andrew Donnellan a écrit :
> On 12/3/20 2:28 am, Frederic Barrat wrote:
>> -static void phb4_check_device_quirks(struct pci_device *dev)
>> +static void phb4_pec2_dma_engine_realloc(struct phb *phb)
>>   {
>> +    struct phb4 *p = phb_to_phb4(phb);
>> +    uint64_t reg;
>> +
>> +    /*
>> +     * Allocate 16 extra dma read engines to stack 0, to boost dma
>> +     * performance for devices on stack 0 of PEC2, i.e PHB3.  It
>> +     * comes at a price of reduced read engine allocation for
>> +     * devices on stack 1 and 2. The engine allocation becomes
>> +     * 48/8/8 instead of the default 32/16/16.
>> +     *
>> +     * The reallocation magic value should be 0xffff0000ff008000,
>> +     * but per the PCI designers, dma engine 32 (bit 0) has a
>> +     * quirk, and 0x7fff80007F008000 has the same effect (engine
>> +     * 32 goes to PHB4).
>> +     */
>> +    if (p->index != 3) /* shared slot on PEC2 */
>> +        return;
>> +
>> +    PHBINF(p, "Allocating extra dma read engines on PEC2 stack0\n");
>> +    reg = 0x7fff80007F008000ULL;
>> +    xscom_write(p->chip_id,
>> +            p->pci_xscom + XPEC_PCI_PRDSTKOVR, reg);
>> +    xscom_write(p->chip_id,
>> +            p->pe_xscom  + XPEC_NEST_READ_STACK_OVERRIDE, reg);
>> +}
> 
> Can we use this function in the CAPI enable path as well so as not to 
> duplicate?

Yes! The capi code actually becomes redundant, since the setup is now 
done at boot and we're just redoing the same thing when switching the 
PHB to capi mode. I'm going to keep the call for capi anyway, since it's 
really under control from the mlx5 driver and there's a (highly 
unexpected) possibility that mlx5 would activate it for another adapter 
type not listed in the PCI case.

   Fred
Frederic Barrat March 24, 2020, 5:31 p.m. UTC | #4
Le 24/03/2020 à 08:55, Oliver O'Halloran a écrit :
> On Thu, Mar 12, 2020 at 2:29 AM Frederic Barrat <fbarrat@linux.ibm.com> wrote:
>>
>> Good GPU direct performance on witherspoon, with a Mellanox adapter
>> on the shared slot, requires to reallocate some dma engines within
>> PEC2, "stealing" some from PHB4&5 and giving extras to PHB3. It's
>> currently done when using CAPI mode. But the same is true if the
>> adapter stays in PCI mode.
>>
>> In preparation for upcoming versions of MOFED, which may not use CAPI
>> mode, this patch reallocates dma engines even in PCI mode for a series
>> of Mellanox adapters that can be used with GPU direct, on witherspoon
>> and on the shared slot only.
>>
>> The loss of dma engines for PHB4&5 on witherspoon has not shown
>> problems in testing, as well as in current deployments where CAPI mode
>> is used.
>>
>> Here is a comparison of the bandwidth numbers seen with the PHB in
>> PCI mode (no CAPI) with and without this patch:
>>
>>   # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.6.1
>>   # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
>>   # Size      Bandwidth (MB/s)       Bandwidth (MB/s)
>>   #              with patch           without patch
>> 1                       1.29              1.48
>> 2                       2.66              3.04
>> 4                       5.34              5.93
>> 8                      10.68             11.86
>> 16                     21.39             23.71
>> 32                     42.78             49.15
>> 64                     85.43             97.67
>> 128                   170.82            196.64
>> 256                   385.47            383.02
> 
> Looks like it actually makes things worse until you get to 256. What
> is the "size" here? Bytes or something else?

It's the size, in byte, of a message going from one GPU on node A to 
another GPU on node B.
The IO team, which is asking for this change and doing the perf 
analysis, is not really concerned about the variations for smaller 
message sizes, as there's apparently quite a bit of jitter. RDMA, for 
which this change is needed, is kicking in at size=64k, so that's where 
they really want to see an improvement.



>> 512                   774.68            755.54
>> 1024                 1535.14           1495.30
>> 2048                 2599.31           2561.60
>> 4096                 5192.31           5092.47
>> 8192                 9930.30           9566.90
>> 16384               18189.81          16803.42
>> 32768               24671.48          21383.57
>> 65536               28977.71          24104.50
>> 131072              31110.55          25858.95
>> 262144              32180.64          26470.61
>> 524288              32842.23          26961.93
>> 1048576             33184.87          27217.38
>> 2097152             33342.67          27338.08
>>
>> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> 
> 
>> ---
>>   hw/phb4.c | 92 +++++++++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 76 insertions(+), 16 deletions(-)
>>
>> diff --git a/hw/phb4.c b/hw/phb4.c
>> index ed7f4e5c..59c04b9f 100644
>> --- a/hw/phb4.c
>> +++ b/hw/phb4.c
>> @@ -809,20 +809,88 @@ static int64_t phb4_pcicfg_no_dstate(void *dev __unused,
>>          return OPAL_PARTIAL;
>>   }
>>
>> -static void phb4_check_device_quirks(struct pci_device *dev)
>> +static void phb4_pec2_dma_engine_realloc(struct phb *phb)
>>   {
>> +       struct phb4 *p = phb_to_phb4(phb);
>> +       uint64_t reg;
>> +
>> +       /*
>> +        * Allocate 16 extra dma read engines to stack 0, to boost dma
>> +        * performance for devices on stack 0 of PEC2, i.e PHB3.  It
>> +        * comes at a price of reduced read engine allocation for
>> +        * devices on stack 1 and 2. The engine allocation becomes
>> +        * 48/8/8 instead of the default 32/16/16.
>> +        *
>> +        * The reallocation magic value should be 0xffff0000ff008000,
>> +        * but per the PCI designers, dma engine 32 (bit 0) has a
>> +        * quirk, and 0x7fff80007F008000 has the same effect (engine
>> +        * 32 goes to PHB4).
>> +        */
>> +       if (p->index != 3) /* shared slot on PEC2 */
>> +               return;
>> +
>> +       PHBINF(p, "Allocating extra dma read engines on PEC2 stack0\n");
>> +       reg = 0x7fff80007F008000ULL;
>> +       xscom_write(p->chip_id,
>> +                   p->pci_xscom + XPEC_PCI_PRDSTKOVR, reg);
>> +       xscom_write(p->chip_id,
>> +                   p->pe_xscom  + XPEC_NEST_READ_STACK_OVERRIDE, reg);
>> +}
>> +
>> +struct pci_card_id {
>> +       uint16_t vendor;
>> +       uint16_t device;
>> +};
>> +
>> +#define VENDOR(vdid) ((vdid) & 0xffff)
>> +#define DEVICE(vdid) (((vdid) >> 16) & 0xffff)
>> +
>> +static struct pci_card_id dma_eng_realloc_whitelist[] = {
>> +       { 0x15b3, 0x1017 }, /* Mellanox ConnectX-5 */
>> +       { 0x15b3, 0x1019 }, /* Mellanox ConnectX-5 Ex */
>> +       { 0x15b3, 0x101b }, /* Mellanox ConnectX-6 */
>> +       { 0x15b3, 0x101d }, /* Mellanox ConnectX-6 Dx */
>> +       { 0x15b3, 0x101f }, /* Mellanox ConnectX-6 Lx */
>> +       { 0x15b3, 0x1021 }, /* Mellanox ConnectX-7 */
>> +};
>> +
>> +static bool phb4_adapter_need_dma_engine_realloc(uint32_t vdid)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < ARRAY_SIZE(dma_eng_realloc_whitelist); i++)
>> +               if (dma_eng_realloc_whitelist[i].vendor == VENDOR(vdid) &&
>> +                   dma_eng_realloc_whitelist[i].device == DEVICE(vdid))
>> +                       return true;
>> +       return false;
>> +}
> 
> Considering this is a copy of the existing DD2.0 retain whitelist it
> wouldn't make it generic-ish, something like:
> 
> pci_adapter_in_list(vdid, &table);
> 
>> +static void phb4_check_device_quirks(struct phb *phb, struct pci_device *dev)
>> +{
>> +       struct phb4 *p = phb_to_phb4(phb);
>> +
>>          /* Some special adapter tweaks for devices directly under the PHB */
>>          if (dev->primary_bus != 1)
>>                  return;
>>
>>          /* PM quirk */
>> -       if (!pci_has_cap(dev, PCI_CFG_CAP_ID_PM, false))
>> -               return;
>> +       if (pci_has_cap(dev, PCI_CFG_CAP_ID_PM, false)) {
>> +               pci_add_cfg_reg_filter(dev,
>> +                                      pci_cap(dev, PCI_CFG_CAP_ID_PM, false), 8,
>> +                                      PCI_REG_FLAG_WRITE,
>> +                                      phb4_pcicfg_no_dstate);
>> +       }
>>
>> -       pci_add_cfg_reg_filter(dev,
>> -                              pci_cap(dev, PCI_CFG_CAP_ID_PM, false), 8,
>> -                              PCI_REG_FLAG_WRITE,
>> -                              phb4_pcicfg_no_dstate);
>> +       /*
>> +        * PEC2 dma engine reallocation for Mellanox cards.
>> +        * Only on witherspoon when the card is on the shared slot.
>> +        * It improves GPU direct performance.
>> +        */
>> +       if (p->index == 3 && phb->slot->peer_slot &&
>> +           phb4_adapter_need_dma_engine_realloc(dev->vdid)) {
>> +               if (PCI_FUNC(dev->bdfn) == 0) // once per adapter
>> +                       phb4_pec2_dma_engine_realloc(phb);
> 
> All these checks seem to be there to cover up for the fact this is a
> platform specific hack being shoved into generic code. Why can't this
> live in witherspoon.c with the rest of the shared slot hacks? The
> SCOMs being poked are all part of the PEC rather than the PHB so we
> don't need to care about what happens on a PHB reset, so one of the
> existing platform PCI hooks should do the trick.

Yeah, I was somehow reluctant to modify both phb4.c and witherspoon.c. 
But I completely agree I'm abusing the meaning of slot->peer_slot to 
mean witherspoon.
I've reworked the patch, triggering the setup from witherspoon.c and I 
think it looks a bit better. v2 on its way.

   Fred

Patch
diff mbox series

diff --git a/hw/phb4.c b/hw/phb4.c
index ed7f4e5c..59c04b9f 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -809,20 +809,88 @@  static int64_t phb4_pcicfg_no_dstate(void *dev __unused,
 	return OPAL_PARTIAL;
 }
 
-static void phb4_check_device_quirks(struct pci_device *dev)
+static void phb4_pec2_dma_engine_realloc(struct phb *phb)
 {
+	struct phb4 *p = phb_to_phb4(phb);
+	uint64_t reg;
+
+	/*
+	 * Allocate 16 extra dma read engines to stack 0, to boost dma
+	 * performance for devices on stack 0 of PEC2, i.e PHB3.  It
+	 * comes at a price of reduced read engine allocation for
+	 * devices on stack 1 and 2. The engine allocation becomes
+	 * 48/8/8 instead of the default 32/16/16.
+	 *
+	 * The reallocation magic value should be 0xffff0000ff008000,
+	 * but per the PCI designers, dma engine 32 (bit 0) has a
+	 * quirk, and 0x7fff80007F008000 has the same effect (engine
+	 * 32 goes to PHB4).
+	 */
+	if (p->index != 3) /* shared slot on PEC2 */
+		return;
+
+	PHBINF(p, "Allocating extra dma read engines on PEC2 stack0\n");
+	reg = 0x7fff80007F008000ULL;
+	xscom_write(p->chip_id,
+		    p->pci_xscom + XPEC_PCI_PRDSTKOVR, reg);
+	xscom_write(p->chip_id,
+		    p->pe_xscom  + XPEC_NEST_READ_STACK_OVERRIDE, reg);
+}
+
+struct pci_card_id {
+	uint16_t vendor;
+	uint16_t device;
+};
+
+#define VENDOR(vdid) ((vdid) & 0xffff)
+#define DEVICE(vdid) (((vdid) >> 16) & 0xffff)
+
+static struct pci_card_id dma_eng_realloc_whitelist[] = {
+	{ 0x15b3, 0x1017 }, /* Mellanox ConnectX-5 */
+	{ 0x15b3, 0x1019 }, /* Mellanox ConnectX-5 Ex */
+	{ 0x15b3, 0x101b }, /* Mellanox ConnectX-6 */
+	{ 0x15b3, 0x101d }, /* Mellanox ConnectX-6 Dx */
+	{ 0x15b3, 0x101f }, /* Mellanox ConnectX-6 Lx */
+	{ 0x15b3, 0x1021 }, /* Mellanox ConnectX-7 */
+};
+
+static bool phb4_adapter_need_dma_engine_realloc(uint32_t vdid)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(dma_eng_realloc_whitelist); i++)
+		if (dma_eng_realloc_whitelist[i].vendor == VENDOR(vdid) &&
+		    dma_eng_realloc_whitelist[i].device == DEVICE(vdid))
+			return true;
+	return false;
+}
+
+static void phb4_check_device_quirks(struct phb *phb, struct pci_device *dev)
+{
+	struct phb4 *p = phb_to_phb4(phb);
+
 	/* Some special adapter tweaks for devices directly under the PHB */
 	if (dev->primary_bus != 1)
 		return;
 
 	/* PM quirk */
-	if (!pci_has_cap(dev, PCI_CFG_CAP_ID_PM, false))
-		return;
+	if (pci_has_cap(dev, PCI_CFG_CAP_ID_PM, false)) {
+		pci_add_cfg_reg_filter(dev,
+				       pci_cap(dev, PCI_CFG_CAP_ID_PM, false), 8,
+				       PCI_REG_FLAG_WRITE,
+				       phb4_pcicfg_no_dstate);
+	}
 
-	pci_add_cfg_reg_filter(dev,
-			       pci_cap(dev, PCI_CFG_CAP_ID_PM, false), 8,
-			       PCI_REG_FLAG_WRITE,
-			       phb4_pcicfg_no_dstate);
+	/*
+	 * PEC2 dma engine reallocation for Mellanox cards.
+	 * Only on witherspoon when the card is on the shared slot.
+	 * It improves GPU direct performance.
+	 */
+	if (p->index == 3 && phb->slot->peer_slot &&
+	    phb4_adapter_need_dma_engine_realloc(dev->vdid)) {
+		if (PCI_FUNC(dev->bdfn) == 0) // once per adapter
+			phb4_pec2_dma_engine_realloc(phb);
+	}
 }
 
 static int phb4_device_init(struct phb *phb, struct pci_device *dev,
@@ -831,7 +899,7 @@  static int phb4_device_init(struct phb *phb, struct pci_device *dev,
 	int ecap, aercap;
 
 	/* Setup special device quirks */
-	phb4_check_device_quirks(dev);
+	phb4_check_device_quirks(phb, dev);
 
 	/* Common initialization for the device */
 	pci_device_init(phb, dev);
@@ -2581,11 +2649,6 @@  static bool phb4_chip_retry_workaround(void)
 	return false;
 }
 
-struct pci_card_id {
-	uint16_t vendor;
-	uint16_t device;
-};
-
 static struct pci_card_id retry_whitelist[] = {
 	{ 0x1000, 0x005d }, /* LSI Logic MegaRAID SAS-3 3108 */
 	{ 0x1000, 0x00c9 }, /* LSI MPT SAS-3 */
@@ -2602,9 +2665,6 @@  static struct pci_card_id retry_whitelist[] = {
 	{ 0x9005, 0x028d }, /* MicroSemi PM8069 */
 };
 
-#define VENDOR(vdid) ((vdid) & 0xffff)
-#define DEVICE(vdid) (((vdid) >> 16) & 0xffff)
-
 static bool phb4_adapter_in_whitelist(uint32_t vdid)
 {
 	int i;