mbox series

[RFC,net-next,v5,0/3] r8169: Implement dynamic ASPM mechanism for recent 1.0/2.5Gbps Realtek NICs

Message ID 20210916154417.664323-1-kai.heng.feng@canonical.com
Headers show
Series r8169: Implement dynamic ASPM mechanism for recent 1.0/2.5Gbps Realtek NICs | expand

Message

Kai-Heng Feng Sept. 16, 2021, 3:44 p.m. UTC
The purpose of the series is to get comments and reviews so we can merge
and test the series in downstream kernel.

The latest Realtek vendor driver and its Windows driver implements a
feature called "dynamic ASPM" which can improve performance on it's
ethernet NICs.

Heiner Kallweit pointed out the potential root cause can be that the
buffer is to small for its ASPM exit latency.

So bring the dynamic ASPM to r8169 so we can have both nice performance
and powersaving at the same time.

For the slow/fast alternating traffic pattern, we'll need some real
world test to know if we need to lower the dynamic ASPM interval.

v4:
https://lore.kernel.org/netdev/20210827171452.217123-1-kai.heng.feng@canonical.com/

v3:
https://lore.kernel.org/netdev/20210819054542.608745-1-kai.heng.feng@canonical.com/

v2:
https://lore.kernel.org/netdev/20210812155341.817031-1-kai.heng.feng@canonical.com/

v1:
https://lore.kernel.org/netdev/20210803152823.515849-1-kai.heng.feng@canonical.com/

Kai-Heng Feng (3):
  PCI/ASPM: Introduce a new helper to report ASPM capability
  r8169: Use PCIe ASPM status for NIC ASPM enablement
  r8169: Implement dynamic ASPM mechanism

 drivers/net/ethernet/realtek/r8169_main.c | 69 ++++++++++++++++++++---
 drivers/pci/pcie/aspm.c                   | 11 ++++
 include/linux/pci.h                       |  2 +
 3 files changed, 73 insertions(+), 9 deletions(-)

Comments

Bjorn Helgaas Sept. 17, 2021, 10:09 p.m. UTC | #1
On Thu, Sep 16, 2021 at 11:44:14PM +0800, Kai-Heng Feng wrote:
> The purpose of the series is to get comments and reviews so we can merge
> and test the series in downstream kernel.
> 
> The latest Realtek vendor driver and its Windows driver implements a
> feature called "dynamic ASPM" which can improve performance on it's
> ethernet NICs.
> 
> Heiner Kallweit pointed out the potential root cause can be that the
> buffer is too small for its ASPM exit latency.

I looked at the lspci data in your bugzilla
(https://bugzilla.kernel.org/show_bug.cgi?id=214307).

L1.2 is enabled, which requires the Latency Tolerance Reporting
capability, which helps determine when the Link will be put in L1.2.
IIUC, these are analogous to the DevCap "Acceptable Latency" values.
Zero latency values indicate the device will be impacted by any delay
(PCIe r5.0, sec 6.18).

Linux does not currently program those values, so the values there
must have been set by the BIOS.  On the working AMD system, they're
set to 1048576ns, while on the broken Intel system, they're set to
3145728ns.

I don't really understand how these values should be computed, and I
think they depend on some electrical characteristics of the Link, so
I'm not sure it's *necessarily* a problem that they are different.
But a 3X difference does seem pretty large.

So I'm curious whether this is related to the problem.  Here are some
things we could try on the broken Intel system:

  - What happens if you disable ASPM L1.2 using
    /sys/devices/pci*/.../link/l1_2_aspm?

  - If that doesn't work, what happens if you also disable PCI-PM L1.2
    using /sys/devices/pci*/.../link/l1_2_pcipm?

  - If either of the above makes things work, then at least we know
    the problem is sensitive to L1.2.

  - Then what happens if you use setpci to set the LTR Latency
    registers to 0, then re-enable ASPM L1.2 and PCI-PM L1.2?  This
    should mean the Realtek device wants the best possible service and
    the Link probably won't spend much time in L1.2.

  - What happens if you set the LTR Latency registers to 0x1001
    (should be the same as on the AMD system)?
Kai-Heng Feng Oct. 1, 2021, 4:17 a.m. UTC | #2
On Sat, Sep 18, 2021 at 6:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Sep 16, 2021 at 11:44:14PM +0800, Kai-Heng Feng wrote:
> > The purpose of the series is to get comments and reviews so we can merge
> > and test the series in downstream kernel.
> >
> > The latest Realtek vendor driver and its Windows driver implements a
> > feature called "dynamic ASPM" which can improve performance on it's
> > ethernet NICs.
> >
> > Heiner Kallweit pointed out the potential root cause can be that the
> > buffer is too small for its ASPM exit latency.
>
> I looked at the lspci data in your bugzilla
> (https://bugzilla.kernel.org/show_bug.cgi?id=214307).
>
> L1.2 is enabled, which requires the Latency Tolerance Reporting
> capability, which helps determine when the Link will be put in L1.2.
> IIUC, these are analogous to the DevCap "Acceptable Latency" values.
> Zero latency values indicate the device will be impacted by any delay
> (PCIe r5.0, sec 6.18).
>
> Linux does not currently program those values, so the values there
> must have been set by the BIOS.  On the working AMD system, they're
> set to 1048576ns, while on the broken Intel system, they're set to
> 3145728ns.
>
> I don't really understand how these values should be computed, and I
> think they depend on some electrical characteristics of the Link, so
> I'm not sure it's *necessarily* a problem that they are different.
> But a 3X difference does seem pretty large.
>
> So I'm curious whether this is related to the problem.  Here are some
> things we could try on the broken Intel system:

Original network speed, tested via iperf3:
TX: ~255 Mbps
RX: ~490 Mbps

>
>   - What happens if you disable ASPM L1.2 using
>     /sys/devices/pci*/.../link/l1_2_aspm?

TX: ~670 Mbps
RX: ~670 Mbps

>
>   - If that doesn't work, what happens if you also disable PCI-PM L1.2
>     using /sys/devices/pci*/.../link/l1_2_pcipm?

Same as only disables l1_2_aspm.

>
>   - If either of the above makes things work, then at least we know
>     the problem is sensitive to L1.2.

Right now the downstream kernel disables ASPM L1.2 as workaround.

>
>   - Then what happens if you use setpci to set the LTR Latency
>     registers to 0, then re-enable ASPM L1.2 and PCI-PM L1.2?  This
>     should mean the Realtek device wants the best possible service and
>     the Link probably won't spend much time in L1.2.

# setpci -s 01:00.0 ECAP_LTR+4.w=0x0
# setpci -s 01:00.0 ECAP_LTR+6.w=0x0

Then re-enable ASPM L1.2, the issue persists - the network speed is
still very slow.

>
>   - What happens if you set the LTR Latency registers to 0x1001
>     (should be the same as on the AMD system)?

Same slow speed here.

Kai-Heng
Bjorn Helgaas Oct. 7, 2021, 7:10 p.m. UTC | #3
On Fri, Oct 01, 2021 at 12:17:26PM +0800, Kai-Heng Feng wrote:
> On Sat, Sep 18, 2021 at 6:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Sep 16, 2021 at 11:44:14PM +0800, Kai-Heng Feng wrote:
> > > The purpose of the series is to get comments and reviews so we can merge
> > > and test the series in downstream kernel.
> > >
> > > The latest Realtek vendor driver and its Windows driver implements a
> > > feature called "dynamic ASPM" which can improve performance on it's
> > > ethernet NICs.
> > >
> > > Heiner Kallweit pointed out the potential root cause can be that the
> > > buffer is too small for its ASPM exit latency.
> >
> > I looked at the lspci data in your bugzilla
> > (https://bugzilla.kernel.org/show_bug.cgi?id=214307).
> >
> > L1.2 is enabled, which requires the Latency Tolerance Reporting
> > capability, which helps determine when the Link will be put in L1.2.
> > IIUC, these are analogous to the DevCap "Acceptable Latency" values.
> > Zero latency values indicate the device will be impacted by any delay
> > (PCIe r5.0, sec 6.18).
> >
> > Linux does not currently program those values, so the values there
> > must have been set by the BIOS.  On the working AMD system, they're
> > set to 1048576ns, while on the broken Intel system, they're set to
> > 3145728ns.
> >
> > I don't really understand how these values should be computed, and I
> > think they depend on some electrical characteristics of the Link, so
> > I'm not sure it's *necessarily* a problem that they are different.
> > But a 3X difference does seem pretty large.
> >
> > So I'm curious whether this is related to the problem.  Here are some
> > things we could try on the broken Intel system:
> 
> Original network speed, tested via iperf3:
> TX: ~255 Mbps
> RX: ~490 Mbps
> 
> >   - What happens if you disable ASPM L1.2 using
> >     /sys/devices/pci*/.../link/l1_2_aspm?
> 
> TX: ~670 Mbps
> RX: ~670 Mbps
> 
> >   - If that doesn't work, what happens if you also disable PCI-PM L1.2
> >     using /sys/devices/pci*/.../link/l1_2_pcipm?
> 
> Same as only disables l1_2_aspm.
> 
> >   - If either of the above makes things work, then at least we know
> >     the problem is sensitive to L1.2.
> 
> Right now the downstream kernel disables ASPM L1.2 as workaround.
> 
> >   - Then what happens if you use setpci to set the LTR Latency
> >     registers to 0, then re-enable ASPM L1.2 and PCI-PM L1.2?  This
> >     should mean the Realtek device wants the best possible service and
> >     the Link probably won't spend much time in L1.2.
> 
> # setpci -s 01:00.0 ECAP_LTR+4.w=0x0
> # setpci -s 01:00.0 ECAP_LTR+6.w=0x0
> 
> Then re-enable ASPM L1.2, the issue persists - the network speed is
> still very slow.
> 
> >   - What happens if you set the LTR Latency registers to 0x1001
> >     (should be the same as on the AMD system)?
> 
> Same slow speed here.

Thanks a lot for indulging my curiosity and testing this.  So I guess
you confirmed that specifically ASPM L1.2 is the issue, which makes
sense given the current downstream kernel workaround.
Bjorn Helgaas Oct. 8, 2021, 1:56 p.m. UTC | #4
On Fri, Oct 01, 2021 at 12:17:26PM +0800, Kai-Heng Feng wrote:
> On Sat, Sep 18, 2021 at 6:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Sep 16, 2021 at 11:44:14PM +0800, Kai-Heng Feng wrote:
> > > The purpose of the series is to get comments and reviews so we can merge
> > > and test the series in downstream kernel.
> > >
> > > The latest Realtek vendor driver and its Windows driver implements a
> > > feature called "dynamic ASPM" which can improve performance on it's
> > > ethernet NICs.
> > >
> > > Heiner Kallweit pointed out the potential root cause can be that the
> > > buffer is too small for its ASPM exit latency.
> >
> > I looked at the lspci data in your bugzilla
> > (https://bugzilla.kernel.org/show_bug.cgi?id=214307).
> >
> > L1.2 is enabled, which requires the Latency Tolerance Reporting
> > capability, which helps determine when the Link will be put in L1.2.
> > IIUC, these are analogous to the DevCap "Acceptable Latency" values.
> > Zero latency values indicate the device will be impacted by any delay
> > (PCIe r5.0, sec 6.18).
> >
> > Linux does not currently program those values, so the values there
> > must have been set by the BIOS.  On the working AMD system, they're
> > set to 1048576ns, while on the broken Intel system, they're set to
> > 3145728ns.
> >
> > I don't really understand how these values should be computed, and I
> > think they depend on some electrical characteristics of the Link, so
> > I'm not sure it's *necessarily* a problem that they are different.
> > But a 3X difference does seem pretty large.
> >
> > So I'm curious whether this is related to the problem.  Here are some
> > things we could try on the broken Intel system:
> 
> Original network speed, tested via iperf3:
> TX: ~255 Mbps
> RX: ~490 Mbps
> 
> >   - What happens if you disable ASPM L1.2 using
> >     /sys/devices/pci*/.../link/l1_2_aspm?
> 
> TX: ~670 Mbps
> RX: ~670 Mbps

Do you remember if there were any dropped packets here?  You mentioned
at [1] that you have also seen reports of issues with L0s and L1.1.
If you disable L1.2, L0s and L1.1 *should* still be enabled.

[1] https://lore.kernel.org/r/CAAd53p4v+CmupCu2+3vY5N64WKkxcNvpk1M7+hhNoposx+aYCg@mail.gmail.com