Message ID | 20190910154238.9155-1-bob.beckett@collabora.com |
---|---|
Headers | show |
Series | net: dsa: mv88e6xxx: features to handle network storms | expand |
+Ido, Jiri, On 9/10/19 8:41 AM, Robert Beckett wrote: > This patch-set adds support for some features of the Marvell switch > chips that can be used to handle packet storms. > > The rationale for this was a setup that requires the ability to receive > traffic from one port, while a packet storm is occuring on another port > (via an external switch with a deliberate loop). This is needed to > ensure vital data delivery from a specific port, while mitigating any > loops or DoS that a user may introduce on another port (can't guarantee > sensible users). The use case is reasonable, but the implementation is not really. You are using Device Tree which is meant to describe hardware as a policy holder for setting up queue priorities and likewise for queue scheduling. The tool that should be used for that purpose is tc and possibly an appropriately offloaded queue scheduler in order to map the desired scheduling class to what the hardware supports. Jiri, Ido, how do you guys support this with mlxsw? > > [patch 1/7] configures auto negotiation for CPU ports connected with > phys to enable pause frame propogation. > > [patch 2/7] allows setting of port's default output queue priority for > any ingressing packets on that port. > > [patch 3/7] dt-bindings for patch 2. > > [patch 4/7] allows setting of a port's queue scheduling so that it can > prioritise egress of traffic routed from high priority ports. > > [patch 5/7] dt-bindings for patch 4. > > [patch 6/7] allows ports to rate limit their egress. This can be used to > stop the host CPU from becoming swamped by packet delivery and exhasting > descriptors. > > [patch 7/7] dt-bindings for patch 6. > > > Robert Beckett (7): > net/dsa: configure autoneg for CPU port > net: dsa: mv88e6xxx: add ability to set default queue priorities per > port > dt-bindings: mv88e6xxx: add ability to set default queue priorities > per port > net: dsa: mv88e6xxx: add ability to set queue scheduling > dt-bindings: mv88e6xxx: add ability to set queue scheduling > net: dsa: mv88e6xxx: add egress rate limiting > dt-bindings: mv88e6xxx: add egress rate limiting > > .../devicetree/bindings/net/dsa/marvell.txt | 38 +++++ > drivers/net/dsa/mv88e6xxx/chip.c | 122 ++++++++++++--- > drivers/net/dsa/mv88e6xxx/chip.h | 5 +- > drivers/net/dsa/mv88e6xxx/port.c | 140 +++++++++++++++++- > drivers/net/dsa/mv88e6xxx/port.h | 24 ++- > include/dt-bindings/net/dsa-mv88e6xxx.h | 22 +++ > net/dsa/port.c | 10 ++ > 7 files changed, 327 insertions(+), 34 deletions(-) > create mode 100644 include/dt-bindings/net/dsa-mv88e6xxx.h >
Hi Robert, On Tue, 10 Sep 2019 16:41:46 +0100, Robert Beckett <bob.beckett@collabora.com> wrote: > This patch-set adds support for some features of the Marvell switch > chips that can be used to handle packet storms. > > The rationale for this was a setup that requires the ability to receive > traffic from one port, while a packet storm is occuring on another port > (via an external switch with a deliberate loop). This is needed to > ensure vital data delivery from a specific port, while mitigating any > loops or DoS that a user may introduce on another port (can't guarantee > sensible users). > > [patch 1/7] configures auto negotiation for CPU ports connected with > phys to enable pause frame propogation. > > [patch 2/7] allows setting of port's default output queue priority for > any ingressing packets on that port. > > [patch 3/7] dt-bindings for patch 2. > > [patch 4/7] allows setting of a port's queue scheduling so that it can > prioritise egress of traffic routed from high priority ports. > > [patch 5/7] dt-bindings for patch 4. > > [patch 6/7] allows ports to rate limit their egress. This can be used to > stop the host CPU from becoming swamped by packet delivery and exhasting > descriptors. > > [patch 7/7] dt-bindings for patch 6. > > > Robert Beckett (7): > net/dsa: configure autoneg for CPU port > net: dsa: mv88e6xxx: add ability to set default queue priorities per > port > dt-bindings: mv88e6xxx: add ability to set default queue priorities > per port > net: dsa: mv88e6xxx: add ability to set queue scheduling > dt-bindings: mv88e6xxx: add ability to set queue scheduling > net: dsa: mv88e6xxx: add egress rate limiting > dt-bindings: mv88e6xxx: add egress rate limiting > > .../devicetree/bindings/net/dsa/marvell.txt | 38 +++++ > drivers/net/dsa/mv88e6xxx/chip.c | 122 ++++++++++++--- > drivers/net/dsa/mv88e6xxx/chip.h | 5 +- > drivers/net/dsa/mv88e6xxx/port.c | 140 +++++++++++++++++- > drivers/net/dsa/mv88e6xxx/port.h | 24 ++- > include/dt-bindings/net/dsa-mv88e6xxx.h | 22 +++ > net/dsa/port.c | 10 ++ > 7 files changed, 327 insertions(+), 34 deletions(-) > create mode 100644 include/dt-bindings/net/dsa-mv88e6xxx.h Feature series targeting netdev must be prefixed "PATCH net-next". As this approach was a PoC, sending it as "RFC net-next" would be even more appropriate. Thank you, Vivien
On Tue, 2019-09-10 at 09:49 -0700, Florian Fainelli wrote: > +Ido, Jiri, > > On 9/10/19 8:41 AM, Robert Beckett wrote: > > This patch-set adds support for some features of the Marvell switch > > chips that can be used to handle packet storms. > > > > The rationale for this was a setup that requires the ability to > > receive > > traffic from one port, while a packet storm is occuring on another > > port > > (via an external switch with a deliberate loop). This is needed to > > ensure vital data delivery from a specific port, while mitigating > > any > > loops or DoS that a user may introduce on another port (can't > > guarantee > > sensible users). > > The use case is reasonable, but the implementation is not really. You > are using Device Tree which is meant to describe hardware as a policy > holder for setting up queue priorities and likewise for queue > scheduling. > > The tool that should be used for that purpose is tc and possibly an > appropriately offloaded queue scheduler in order to map the desired > scheduling class to what the hardware supports. Thanks for the review and tip about tc. Im currently not familiar with that tool. Ill investigate it as an alternative approach. > > Jiri, Ido, how do you guys support this with mlxsw? > > > > > [patch 1/7] configures auto negotiation for CPU ports connected > > with > > phys to enable pause frame propogation. > > > > [patch 2/7] allows setting of port's default output queue priority > > for > > any ingressing packets on that port. > > > > [patch 3/7] dt-bindings for patch 2. > > > > [patch 4/7] allows setting of a port's queue scheduling so that it > > can > > prioritise egress of traffic routed from high priority ports. > > > > [patch 5/7] dt-bindings for patch 4. > > > > [patch 6/7] allows ports to rate limit their egress. This can be > > used to > > stop the host CPU from becoming swamped by packet delivery and > > exhasting > > descriptors. > > > > [patch 7/7] dt-bindings for patch 6. > > > > > > Robert Beckett (7): > > net/dsa: configure autoneg for CPU port > > net: dsa: mv88e6xxx: add ability to set default queue priorities > > per > > port > > dt-bindings: mv88e6xxx: add ability to set default queue > > priorities > > per port > > net: dsa: mv88e6xxx: add ability to set queue scheduling > > dt-bindings: mv88e6xxx: add ability to set queue scheduling > > net: dsa: mv88e6xxx: add egress rate limiting > > dt-bindings: mv88e6xxx: add egress rate limiting > > > > .../devicetree/bindings/net/dsa/marvell.txt | 38 +++++ > > drivers/net/dsa/mv88e6xxx/chip.c | 122 ++++++++++++ > > --- > > drivers/net/dsa/mv88e6xxx/chip.h | 5 +- > > drivers/net/dsa/mv88e6xxx/port.c | 140 > > +++++++++++++++++- > > drivers/net/dsa/mv88e6xxx/port.h | 24 ++- > > include/dt-bindings/net/dsa-mv88e6xxx.h | 22 +++ > > net/dsa/port.c | 10 ++ > > 7 files changed, 327 insertions(+), 34 deletions(-) > > create mode 100644 include/dt-bindings/net/dsa-mv88e6xxx.h > > > >
On Tue, 2019-09-10 at 13:19 -0400, Vivien Didelot wrote: > Hi Robert, > > On Tue, 10 Sep 2019 16:41:46 +0100, Robert Beckett < > bob.beckett@collabora.com> wrote: > > This patch-set adds support for some features of the Marvell switch > > chips that can be used to handle packet storms. > > > > The rationale for this was a setup that requires the ability to > > receive > > traffic from one port, while a packet storm is occuring on another > > port > > (via an external switch with a deliberate loop). This is needed to > > ensure vital data delivery from a specific port, while mitigating > > any > > loops or DoS that a user may introduce on another port (can't > > guarantee > > sensible users). > > > > [patch 1/7] configures auto negotiation for CPU ports connected > > with > > phys to enable pause frame propogation. > > > > [patch 2/7] allows setting of port's default output queue priority > > for > > any ingressing packets on that port. > > > > [patch 3/7] dt-bindings for patch 2. > > > > [patch 4/7] allows setting of a port's queue scheduling so that it > > can > > prioritise egress of traffic routed from high priority ports. > > > > [patch 5/7] dt-bindings for patch 4. > > > > [patch 6/7] allows ports to rate limit their egress. This can be > > used to > > stop the host CPU from becoming swamped by packet delivery and > > exhasting > > descriptors. > > > > [patch 7/7] dt-bindings for patch 6. > > > > > > Robert Beckett (7): > > net/dsa: configure autoneg for CPU port > > net: dsa: mv88e6xxx: add ability to set default queue priorities > > per > > port > > dt-bindings: mv88e6xxx: add ability to set default queue > > priorities > > per port > > net: dsa: mv88e6xxx: add ability to set queue scheduling > > dt-bindings: mv88e6xxx: add ability to set queue scheduling > > net: dsa: mv88e6xxx: add egress rate limiting > > dt-bindings: mv88e6xxx: add egress rate limiting > > > > .../devicetree/bindings/net/dsa/marvell.txt | 38 +++++ > > drivers/net/dsa/mv88e6xxx/chip.c | 122 ++++++++++++ > > --- > > drivers/net/dsa/mv88e6xxx/chip.h | 5 +- > > drivers/net/dsa/mv88e6xxx/port.c | 140 > > +++++++++++++++++- > > drivers/net/dsa/mv88e6xxx/port.h | 24 ++- > > include/dt-bindings/net/dsa-mv88e6xxx.h | 22 +++ > > net/dsa/port.c | 10 ++ > > 7 files changed, 327 insertions(+), 34 deletions(-) > > create mode 100644 include/dt-bindings/net/dsa-mv88e6xxx.h > > Feature series targeting netdev must be prefixed "PATCH net-next". As Thanks for the info. Out of curiosity, where should I have gleaned this info from? This is my first contribution to netdev, so I wasnt familiar with the etiquette. > this approach was a PoC, sending it as "RFC net-next" would be even > more > appropriate. > > > Thank you, > > Vivien
On Tue, Sep 10, 2019 at 09:49:46AM -0700, Florian Fainelli wrote: > +Ido, Jiri, > > On 9/10/19 8:41 AM, Robert Beckett wrote: > > This patch-set adds support for some features of the Marvell switch > > chips that can be used to handle packet storms. > > > > The rationale for this was a setup that requires the ability to receive > > traffic from one port, while a packet storm is occuring on another port > > (via an external switch with a deliberate loop). This is needed to > > ensure vital data delivery from a specific port, while mitigating any > > loops or DoS that a user may introduce on another port (can't guarantee > > sensible users). > > The use case is reasonable, but the implementation is not really. You > are using Device Tree which is meant to describe hardware as a policy > holder for setting up queue priorities and likewise for queue scheduling. > > The tool that should be used for that purpose is tc and possibly an > appropriately offloaded queue scheduler in order to map the desired > scheduling class to what the hardware supports. > > Jiri, Ido, how do you guys support this with mlxsw? Hi Florian, Are you referring to policing traffic towards the CPU using a policer on the egress of the CPU port? At least that's what I understand from the description of patch 6 below. If so, mlxsw sets policers for different traffic types during its initialization sequence. These policers are not exposed to the user nor configurable. While the default settings are good for most users, we do want to allow users to change these and expose current settings. I agree that tc seems like the right choice, but the question is where are we going to install the filters? > > > > > [patch 1/7] configures auto negotiation for CPU ports connected with > > phys to enable pause frame propogation. > > > > [patch 2/7] allows setting of port's default output queue priority for > > any ingressing packets on that port. > > > > [patch 3/7] dt-bindings for patch 2. > > > > [patch 4/7] allows setting of a port's queue scheduling so that it can > > prioritise egress of traffic routed from high priority ports. > > > > [patch 5/7] dt-bindings for patch 4. > > > > [patch 6/7] allows ports to rate limit their egress. This can be used to > > stop the host CPU from becoming swamped by packet delivery and exhasting > > descriptors. > > > > [patch 7/7] dt-bindings for patch 6. > > > > > > Robert Beckett (7): > > net/dsa: configure autoneg for CPU port > > net: dsa: mv88e6xxx: add ability to set default queue priorities per > > port > > dt-bindings: mv88e6xxx: add ability to set default queue priorities > > per port > > net: dsa: mv88e6xxx: add ability to set queue scheduling > > dt-bindings: mv88e6xxx: add ability to set queue scheduling > > net: dsa: mv88e6xxx: add egress rate limiting > > dt-bindings: mv88e6xxx: add egress rate limiting > > > > .../devicetree/bindings/net/dsa/marvell.txt | 38 +++++ > > drivers/net/dsa/mv88e6xxx/chip.c | 122 ++++++++++++--- > > drivers/net/dsa/mv88e6xxx/chip.h | 5 +- > > drivers/net/dsa/mv88e6xxx/port.c | 140 +++++++++++++++++- > > drivers/net/dsa/mv88e6xxx/port.h | 24 ++- > > include/dt-bindings/net/dsa-mv88e6xxx.h | 22 +++ > > net/dsa/port.c | 10 ++ > > 7 files changed, 327 insertions(+), 34 deletions(-) > > create mode 100644 include/dt-bindings/net/dsa-mv88e6xxx.h > > > > > -- > Florian
On Wed, 2019-09-11 at 11:21 +0000, Ido Schimmel wrote: > On Tue, Sep 10, 2019 at 09:49:46AM -0700, Florian Fainelli wrote: > > +Ido, Jiri, > > > > On 9/10/19 8:41 AM, Robert Beckett wrote: > > > This patch-set adds support for some features of the Marvell > > > switch > > > chips that can be used to handle packet storms. > > > > > > The rationale for this was a setup that requires the ability to > > > receive > > > traffic from one port, while a packet storm is occuring on > > > another port > > > (via an external switch with a deliberate loop). This is needed > > > to > > > ensure vital data delivery from a specific port, while mitigating > > > any > > > loops or DoS that a user may introduce on another port (can't > > > guarantee > > > sensible users). > > > > The use case is reasonable, but the implementation is not really. > > You > > are using Device Tree which is meant to describe hardware as a > > policy > > holder for setting up queue priorities and likewise for queue > > scheduling. > > > > The tool that should be used for that purpose is tc and possibly an > > appropriately offloaded queue scheduler in order to map the desired > > scheduling class to what the hardware supports. > > > > Jiri, Ido, how do you guys support this with mlxsw? > > Hi Florian, > > Are you referring to policing traffic towards the CPU using a policer > on > the egress of the CPU port? At least that's what I understand from > the > description of patch 6 below. > > If so, mlxsw sets policers for different traffic types during its > initialization sequence. These policers are not exposed to the user > nor > configurable. While the default settings are good for most users, we > do > want to allow users to change these and expose current settings. > > I agree that tc seems like the right choice, but the question is > where > are we going to install the filters? > Before I go too far down the rabbit hole of tc traffic shaping, maybe it would be good to explain in more detail the problem I am trying to solve. We have a setup as follows: Marvell 88E6240 switch chip, accepting traffic from 4 ports. Port 1 (P1) is critical priority, no dropped packets allowed, all others can be best effort. CPU port of swtich chip is connected via phy to phy of intel i210 (igb driver). i210 is connected via pcie switch to imx6. When too many small packets attempt to be delivered to CPU port (e.g. during broadcast flood) we saw dropped packets. The packets were being received by i210 in to rx descriptor buffer fine, but the CPU could not keep up with the load. We saw rx_fifo_errors increasing rapidly and ksoftirqd at ~100% CPU. With this in mind, I am wondering whether any amount of tc traffic shaping would help? Would tc shaping require that the packet reception manages to keep up before it can enact its policies? Does the infrastructure have accelerator offload hooks to be able to apply it via HW? I dont see how it would be able to inspect the packets to apply filtering if they were dropped due to rx descriptor exhaustion. (please bear with me with the basic questions, I am not familiar with this part of the stack). Assuming that tc is still the way to go, after a brief look in to the man pages and the documentation at largc.org, it seems like it would need to use the ingress qdisc, with some sort of system to segregate and priortise based on ingress port. Is this possible? > > > > > > > > [patch 1/7] configures auto negotiation for CPU ports connected > > > with > > > phys to enable pause frame propogation. > > > > > > [patch 2/7] allows setting of port's default output queue > > > priority for > > > any ingressing packets on that port. > > > > > > [patch 3/7] dt-bindings for patch 2. > > > > > > [patch 4/7] allows setting of a port's queue scheduling so that > > > it can > > > prioritise egress of traffic routed from high priority ports. > > > > > > [patch 5/7] dt-bindings for patch 4. > > > > > > [patch 6/7] allows ports to rate limit their egress. This can be > > > used to > > > stop the host CPU from becoming swamped by packet delivery and > > > exhasting > > > descriptors. > > > > > > [patch 7/7] dt-bindings for patch 6. > > > > > > > > > Robert Beckett (7): > > > net/dsa: configure autoneg for CPU port > > > net: dsa: mv88e6xxx: add ability to set default queue > > > priorities per > > > port > > > dt-bindings: mv88e6xxx: add ability to set default queue > > > priorities > > > per port > > > net: dsa: mv88e6xxx: add ability to set queue scheduling > > > dt-bindings: mv88e6xxx: add ability to set queue scheduling > > > net: dsa: mv88e6xxx: add egress rate limiting > > > dt-bindings: mv88e6xxx: add egress rate limiting > > > > > > .../devicetree/bindings/net/dsa/marvell.txt | 38 +++++ > > > drivers/net/dsa/mv88e6xxx/chip.c | 122 > > > ++++++++++++--- > > > drivers/net/dsa/mv88e6xxx/chip.h | 5 +- > > > drivers/net/dsa/mv88e6xxx/port.c | 140 > > > +++++++++++++++++- > > > drivers/net/dsa/mv88e6xxx/port.h | 24 ++- > > > include/dt-bindings/net/dsa-mv88e6xxx.h | 22 +++ > > > net/dsa/port.c | 10 ++ > > > 7 files changed, 327 insertions(+), 34 deletions(-) > > > create mode 100644 include/dt-bindings/net/dsa-mv88e6xxx.h > > > > > > > > > -- > > Florian
Hi Robert, On Wed, 11 Sep 2019 10:46:05 +0100, Robert Beckett <bob.beckett@collabora.com> wrote: > > Feature series targeting netdev must be prefixed "PATCH net-next". As > > Thanks for the info. Out of curiosity, where should I have gleaned this > info from? This is my first contribution to netdev, so I wasnt familiar > with the etiquette. > > > this approach was a PoC, sending it as "RFC net-next" would be even > > more > > appropriate. Netdev being a huge subsystem has specific rules for subject prefix or merge window, which are described in Documentation/networking/netdev-FAQ.rst Thank you, Vivien
> We have a setup as follows: > > Marvell 88E6240 switch chip, accepting traffic from 4 ports. Port 1 > (P1) is critical priority, no dropped packets allowed, all others can > be best effort. > > CPU port of swtich chip is connected via phy to phy of intel i210 (igb > driver). > > i210 is connected via pcie switch to imx6. > > When too many small packets attempt to be delivered to CPU port (e.g. > during broadcast flood) we saw dropped packets. > > The packets were being received by i210 in to rx descriptor buffer > fine, but the CPU could not keep up with the load. We saw > rx_fifo_errors increasing rapidly and ksoftirqd at ~100% CPU. > > > With this in mind, I am wondering whether any amount of tc traffic > shaping would help? Hi Robert The model in linux is that you start with a software TC filter, and then offload it to the hardware. So the user configures TC just as normal, and then that is used to program the hardware to do the same thing as what would happen in software. This is exactly the same as we do with bridging. You create a software bridge and add interfaces to the bridge. This then gets offloaded to the hardware and it does the bridging for you. So think about how your can model the Marvell switch capabilities using TC, and implement offload support for it. Andrew
> > Feature series targeting netdev must be prefixed "PATCH net-next". As > > Thanks for the info. Out of curiosity, where should I have gleaned this > info from? This is my first contribution to netdev, so I wasnt familiar > with the etiquette. It is also a good idea to 'lurk' in a mailing list for a while, reading emails flying around, getting to know how things work. This subject of "PATCH net-next" comes up maybe once a week. The idea off offloads gets discussed once every couple of weeks etc. Andrew
On Wed, Sep 11, 2019 at 12:49:03PM +0100, Robert Beckett wrote: > On Wed, 2019-09-11 at 11:21 +0000, Ido Schimmel wrote: > > On Tue, Sep 10, 2019 at 09:49:46AM -0700, Florian Fainelli wrote: > > > +Ido, Jiri, > > > > > > On 9/10/19 8:41 AM, Robert Beckett wrote: > > > > This patch-set adds support for some features of the Marvell > > > > switch > > > > chips that can be used to handle packet storms. > > > > > > > > The rationale for this was a setup that requires the ability to > > > > receive > > > > traffic from one port, while a packet storm is occuring on > > > > another port > > > > (via an external switch with a deliberate loop). This is needed > > > > to > > > > ensure vital data delivery from a specific port, while mitigating > > > > any > > > > loops or DoS that a user may introduce on another port (can't > > > > guarantee > > > > sensible users). > > > > > > The use case is reasonable, but the implementation is not really. > > > You > > > are using Device Tree which is meant to describe hardware as a > > > policy > > > holder for setting up queue priorities and likewise for queue > > > scheduling. > > > > > > The tool that should be used for that purpose is tc and possibly an > > > appropriately offloaded queue scheduler in order to map the desired > > > scheduling class to what the hardware supports. > > > > > > Jiri, Ido, how do you guys support this with mlxsw? > > > > Hi Florian, > > > > Are you referring to policing traffic towards the CPU using a policer > > on > > the egress of the CPU port? At least that's what I understand from > > the > > description of patch 6 below. > > > > If so, mlxsw sets policers for different traffic types during its > > initialization sequence. These policers are not exposed to the user > > nor > > configurable. While the default settings are good for most users, we > > do > > want to allow users to change these and expose current settings. > > > > I agree that tc seems like the right choice, but the question is > > where > > are we going to install the filters? > > > > Before I go too far down the rabbit hole of tc traffic shaping, maybe > it would be good to explain in more detail the problem I am trying to > solve. > > We have a setup as follows: > > Marvell 88E6240 switch chip, accepting traffic from 4 ports. Port 1 > (P1) is critical priority, no dropped packets allowed, all others can > be best effort. > > CPU port of swtich chip is connected via phy to phy of intel i210 (igb > driver). > > i210 is connected via pcie switch to imx6. > > When too many small packets attempt to be delivered to CPU port (e.g. > during broadcast flood) we saw dropped packets. > > The packets were being received by i210 in to rx descriptor buffer > fine, but the CPU could not keep up with the load. We saw > rx_fifo_errors increasing rapidly and ksoftirqd at ~100% CPU. > > > With this in mind, I am wondering whether any amount of tc traffic > shaping would help? Would tc shaping require that the packet reception > manages to keep up before it can enact its policies? Does the > infrastructure have accelerator offload hooks to be able to apply it > via HW? I dont see how it would be able to inspect the packets to apply > filtering if they were dropped due to rx descriptor exhaustion. (please > bear with me with the basic questions, I am not familiar with this part > of the stack). > > Assuming that tc is still the way to go, after a brief look in to the > man pages and the documentation at largc.org, it seems like it would > need to use the ingress qdisc, with some sort of system to segregate > and priortise based on ingress port. Is this possible? Hi Robert, As I see it, you have two problems here: 1. Classification: Based on ingress port in your case 2. Scheduling: How to schedule between the different transmission queues Where the port from which the packets should egress is the CPU port, before they cross the PCI towards the imx6. Both of these issues can be solved by tc. The main problem is that today we do not have a netdev to represent the CPU port and therefore can't use existing infra like tc. I believe we need to create one. Besides scheduling, we can also use it to permit/deny certain traffic from reaching the CPU and perform policing. Drivers can run the received packets via taps using dev_queue_xmit_nit(), so that users will see all the traffic directed at the host when running tcpdump on this netdev. > > > > > > > > > > > > > > [patch 1/7] configures auto negotiation for CPU ports connected > > > > with > > > > phys to enable pause frame propogation. > > > > > > > > [patch 2/7] allows setting of port's default output queue > > > > priority for > > > > any ingressing packets on that port. > > > > > > > > [patch 3/7] dt-bindings for patch 2. > > > > > > > > [patch 4/7] allows setting of a port's queue scheduling so that > > > > it can > > > > prioritise egress of traffic routed from high priority ports. > > > > > > > > [patch 5/7] dt-bindings for patch 4. > > > > > > > > [patch 6/7] allows ports to rate limit their egress. This can be > > > > used to > > > > stop the host CPU from becoming swamped by packet delivery and > > > > exhasting > > > > descriptors. > > > > > > > > [patch 7/7] dt-bindings for patch 6. > > > > > > > > > > > > Robert Beckett (7): > > > > net/dsa: configure autoneg for CPU port > > > > net: dsa: mv88e6xxx: add ability to set default queue > > > > priorities per > > > > port > > > > dt-bindings: mv88e6xxx: add ability to set default queue > > > > priorities > > > > per port > > > > net: dsa: mv88e6xxx: add ability to set queue scheduling > > > > dt-bindings: mv88e6xxx: add ability to set queue scheduling > > > > net: dsa: mv88e6xxx: add egress rate limiting > > > > dt-bindings: mv88e6xxx: add egress rate limiting > > > > > > > > .../devicetree/bindings/net/dsa/marvell.txt | 38 +++++ > > > > drivers/net/dsa/mv88e6xxx/chip.c | 122 > > > > ++++++++++++--- > > > > drivers/net/dsa/mv88e6xxx/chip.h | 5 +- > > > > drivers/net/dsa/mv88e6xxx/port.c | 140 > > > > +++++++++++++++++- > > > > drivers/net/dsa/mv88e6xxx/port.h | 24 ++- > > > > include/dt-bindings/net/dsa-mv88e6xxx.h | 22 +++ > > > > net/dsa/port.c | 10 ++ > > > > 7 files changed, 327 insertions(+), 34 deletions(-) > > > > create mode 100644 include/dt-bindings/net/dsa-mv88e6xxx.h > > > > > > > > > > > > > -- > > > Florian >
On Thu, Sep 12, 2019 at 12:58:41AM +0200, Andrew Lunn wrote: > So think about how your can model the Marvell switch capabilities > using TC, and implement offload support for it. +1 :)
> 2. Scheduling: How to schedule between the different transmission queues > > Where the port from which the packets should egress is the CPU port, > before they cross the PCI towards the imx6. Hi Ido This is DSA, so the switch is connected via Ethernet to the IMX6, not PCI. Minor detail, but that really is the core of what makes DSA DSA. Andrew
On 9/12/19 2:03 AM, Ido Schimmel wrote: > On Wed, Sep 11, 2019 at 12:49:03PM +0100, Robert Beckett wrote: >> On Wed, 2019-09-11 at 11:21 +0000, Ido Schimmel wrote: >>> On Tue, Sep 10, 2019 at 09:49:46AM -0700, Florian Fainelli wrote: >>>> +Ido, Jiri, >>>> >>>> On 9/10/19 8:41 AM, Robert Beckett wrote: >>>>> This patch-set adds support for some features of the Marvell >>>>> switch >>>>> chips that can be used to handle packet storms. >>>>> >>>>> The rationale for this was a setup that requires the ability to >>>>> receive >>>>> traffic from one port, while a packet storm is occuring on >>>>> another port >>>>> (via an external switch with a deliberate loop). This is needed >>>>> to >>>>> ensure vital data delivery from a specific port, while mitigating >>>>> any >>>>> loops or DoS that a user may introduce on another port (can't >>>>> guarantee >>>>> sensible users). >>>> >>>> The use case is reasonable, but the implementation is not really. >>>> You >>>> are using Device Tree which is meant to describe hardware as a >>>> policy >>>> holder for setting up queue priorities and likewise for queue >>>> scheduling. >>>> >>>> The tool that should be used for that purpose is tc and possibly an >>>> appropriately offloaded queue scheduler in order to map the desired >>>> scheduling class to what the hardware supports. >>>> >>>> Jiri, Ido, how do you guys support this with mlxsw? >>> >>> Hi Florian, >>> >>> Are you referring to policing traffic towards the CPU using a policer >>> on >>> the egress of the CPU port? At least that's what I understand from >>> the >>> description of patch 6 below. >>> >>> If so, mlxsw sets policers for different traffic types during its >>> initialization sequence. These policers are not exposed to the user >>> nor >>> configurable. While the default settings are good for most users, we >>> do >>> want to allow users to change these and expose current settings. >>> >>> I agree that tc seems like the right choice, but the question is >>> where >>> are we going to install the filters? >>> >> >> Before I go too far down the rabbit hole of tc traffic shaping, maybe >> it would be good to explain in more detail the problem I am trying to >> solve. >> >> We have a setup as follows: >> >> Marvell 88E6240 switch chip, accepting traffic from 4 ports. Port 1 >> (P1) is critical priority, no dropped packets allowed, all others can >> be best effort. >> >> CPU port of swtich chip is connected via phy to phy of intel i210 (igb >> driver). >> >> i210 is connected via pcie switch to imx6. >> >> When too many small packets attempt to be delivered to CPU port (e.g. >> during broadcast flood) we saw dropped packets. >> >> The packets were being received by i210 in to rx descriptor buffer >> fine, but the CPU could not keep up with the load. We saw >> rx_fifo_errors increasing rapidly and ksoftirqd at ~100% CPU. >> >> >> With this in mind, I am wondering whether any amount of tc traffic >> shaping would help? Would tc shaping require that the packet reception >> manages to keep up before it can enact its policies? Does the >> infrastructure have accelerator offload hooks to be able to apply it >> via HW? I dont see how it would be able to inspect the packets to apply >> filtering if they were dropped due to rx descriptor exhaustion. (please >> bear with me with the basic questions, I am not familiar with this part >> of the stack). >> >> Assuming that tc is still the way to go, after a brief look in to the >> man pages and the documentation at largc.org, it seems like it would >> need to use the ingress qdisc, with some sort of system to segregate >> and priortise based on ingress port. Is this possible? > > Hi Robert, > > As I see it, you have two problems here: > > 1. Classification: Based on ingress port in your case > > 2. Scheduling: How to schedule between the different transmission queues > > Where the port from which the packets should egress is the CPU port, > before they cross the PCI towards the imx6. > > Both of these issues can be solved by tc. The main problem is that today > we do not have a netdev to represent the CPU port and therefore can't > use existing infra like tc. I believe we need to create one. Besides > scheduling, we can also use it to permit/deny certain traffic from > reaching the CPU and perform policing. We do not necessarily have to create a CPU netdev, we can overlay netdev operations onto the DSA master interface (fec in that case), and whenever you configure the DSA master interface, we also call back into the switch side for the CPU port. This is not necessarily the cleanest way to do things, but that is how we support ethtool operations (and some netdev operations incidentally), and it works
On Thu, 2019-09-12 at 09:25 -0700, Florian Fainelli wrote: > On 9/12/19 2:03 AM, Ido Schimmel wrote: > > On Wed, Sep 11, 2019 at 12:49:03PM +0100, Robert Beckett wrote: > > > On Wed, 2019-09-11 at 11:21 +0000, Ido Schimmel wrote: > > > > On Tue, Sep 10, 2019 at 09:49:46AM -0700, Florian Fainelli > > > > wrote: > > > > > +Ido, Jiri, > > > > > > > > > > On 9/10/19 8:41 AM, Robert Beckett wrote: > > > > > > This patch-set adds support for some features of the > > > > > > Marvell > > > > > > switch > > > > > > chips that can be used to handle packet storms. > > > > > > > > > > > > The rationale for this was a setup that requires the > > > > > > ability to > > > > > > receive > > > > > > traffic from one port, while a packet storm is occuring on > > > > > > another port > > > > > > (via an external switch with a deliberate loop). This is > > > > > > needed > > > > > > to > > > > > > ensure vital data delivery from a specific port, while > > > > > > mitigating > > > > > > any > > > > > > loops or DoS that a user may introduce on another port > > > > > > (can't > > > > > > guarantee > > > > > > sensible users). > > > > > > > > > > The use case is reasonable, but the implementation is not > > > > > really. > > > > > You > > > > > are using Device Tree which is meant to describe hardware as > > > > > a > > > > > policy > > > > > holder for setting up queue priorities and likewise for queue > > > > > scheduling. > > > > > > > > > > The tool that should be used for that purpose is tc and > > > > > possibly an > > > > > appropriately offloaded queue scheduler in order to map the > > > > > desired > > > > > scheduling class to what the hardware supports. > > > > > > > > > > Jiri, Ido, how do you guys support this with mlxsw? > > > > > > > > Hi Florian, > > > > > > > > Are you referring to policing traffic towards the CPU using a > > > > policer > > > > on > > > > the egress of the CPU port? At least that's what I understand > > > > from > > > > the > > > > description of patch 6 below. > > > > > > > > If so, mlxsw sets policers for different traffic types during > > > > its > > > > initialization sequence. These policers are not exposed to the > > > > user > > > > nor > > > > configurable. While the default settings are good for most > > > > users, we > > > > do > > > > want to allow users to change these and expose current > > > > settings. > > > > > > > > I agree that tc seems like the right choice, but the question > > > > is > > > > where > > > > are we going to install the filters? > > > > > > > > > > Before I go too far down the rabbit hole of tc traffic shaping, > > > maybe > > > it would be good to explain in more detail the problem I am > > > trying to > > > solve. > > > > > > We have a setup as follows: > > > > > > Marvell 88E6240 switch chip, accepting traffic from 4 ports. Port > > > 1 > > > (P1) is critical priority, no dropped packets allowed, all others > > > can > > > be best effort. > > > > > > CPU port of swtich chip is connected via phy to phy of intel i210 > > > (igb > > > driver). > > > > > > i210 is connected via pcie switch to imx6. > > > > > > When too many small packets attempt to be delivered to CPU port > > > (e.g. > > > during broadcast flood) we saw dropped packets. > > > > > > The packets were being received by i210 in to rx descriptor > > > buffer > > > fine, but the CPU could not keep up with the load. We saw > > > rx_fifo_errors increasing rapidly and ksoftirqd at ~100% CPU. > > > > > > > > > With this in mind, I am wondering whether any amount of tc > > > traffic > > > shaping would help? Would tc shaping require that the packet > > > reception > > > manages to keep up before it can enact its policies? Does the > > > infrastructure have accelerator offload hooks to be able to apply > > > it > > > via HW? I dont see how it would be able to inspect the packets to > > > apply > > > filtering if they were dropped due to rx descriptor exhaustion. > > > (please > > > bear with me with the basic questions, I am not familiar with > > > this part > > > of the stack). > > > > > > Assuming that tc is still the way to go, after a brief look in to > > > the > > > man pages and the documentation at largc.org, it seems like it > > > would > > > need to use the ingress qdisc, with some sort of system to > > > segregate > > > and priortise based on ingress port. Is this possible? > > > > Hi Robert, > > > > As I see it, you have two problems here: > > > > 1. Classification: Based on ingress port in your case > > > > 2. Scheduling: How to schedule between the different transmission > > queues > > > > Where the port from which the packets should egress is the CPU > > port, > > before they cross the PCI towards the imx6. > > > > Both of these issues can be solved by tc. The main problem is that > > today > > we do not have a netdev to represent the CPU port and therefore > > can't > > use existing infra like tc. I believe we need to create one. > > Besides > > scheduling, we can also use it to permit/deny certain traffic from > > reaching the CPU and perform policing. > > We do not necessarily have to create a CPU netdev, we can overlay > netdev > operations onto the DSA master interface (fec in that case), and > whenever you configure the DSA master interface, we also call back > into > the switch side for the CPU port. This is not necessarily the > cleanest > way to do things, but that is how we support ethtool operations (and > some netdev operations incidentally), and it works After reading up on tc, I am not sure how this would work given the semantics of the tool currently. My initial thought was to model the switch's 4 output queues using an mqprio qdisc for the CPU port, and then use either iptables's classify module on the input ports to set which queue it egresses from on the CPU port, or use vlan tagging with id 0 and priority set. (with the many detail of how to implement them still left to discover). However, it looks like the mqprio qdisc could only be used for egress, so without a netdev representing the CPU port, I dont know how it could be used. Another thing I thought of using was just to use iptable's TOS module to set the minimal delay bit and rely on default behaviours, but Ive yet to find anything in the Marvell manual that indicates it could set that bit on all frames entering a port. Another option might be to use vlans with their priority bits being used to steer to output queues, but I really dont want to introduce more virtual interfaces in to the setup, and I cant see how to configure an enforce default vlan tag with id 0 and priority bits set via linux userland tools. It does look like tc would be quite nice for configuring the egress rate limiting assuming we a netdev to target with the rate controls of the qdisc. So far, this seems like I am trying to shoe horn this stuff in to tc. It seems like tc is meant to configure how the ip stack configures flow within the stack, whereas in a switch chip, the packets go nowhere near the CPUs kernel ip stack. I cant help thinking that it would be good have a specific utility for configuring switches that operates on the port level for manage flow within the chip, or maybe simple sysfs attributes to set the ports priority.
On 9/12/19 9:46 AM, Robert Beckett wrote: > On Thu, 2019-09-12 at 09:25 -0700, Florian Fainelli wrote: >> On 9/12/19 2:03 AM, Ido Schimmel wrote: >>> On Wed, Sep 11, 2019 at 12:49:03PM +0100, Robert Beckett wrote: >>>> On Wed, 2019-09-11 at 11:21 +0000, Ido Schimmel wrote: >>>>> On Tue, Sep 10, 2019 at 09:49:46AM -0700, Florian Fainelli >>>>> wrote: >>>>>> +Ido, Jiri, >>>>>> >>>>>> On 9/10/19 8:41 AM, Robert Beckett wrote: >>>>>>> This patch-set adds support for some features of the >>>>>>> Marvell >>>>>>> switch >>>>>>> chips that can be used to handle packet storms. >>>>>>> >>>>>>> The rationale for this was a setup that requires the >>>>>>> ability to >>>>>>> receive >>>>>>> traffic from one port, while a packet storm is occuring on >>>>>>> another port >>>>>>> (via an external switch with a deliberate loop). This is >>>>>>> needed >>>>>>> to >>>>>>> ensure vital data delivery from a specific port, while >>>>>>> mitigating >>>>>>> any >>>>>>> loops or DoS that a user may introduce on another port >>>>>>> (can't >>>>>>> guarantee >>>>>>> sensible users). >>>>>> >>>>>> The use case is reasonable, but the implementation is not >>>>>> really. >>>>>> You >>>>>> are using Device Tree which is meant to describe hardware as >>>>>> a >>>>>> policy >>>>>> holder for setting up queue priorities and likewise for queue >>>>>> scheduling. >>>>>> >>>>>> The tool that should be used for that purpose is tc and >>>>>> possibly an >>>>>> appropriately offloaded queue scheduler in order to map the >>>>>> desired >>>>>> scheduling class to what the hardware supports. >>>>>> >>>>>> Jiri, Ido, how do you guys support this with mlxsw? >>>>> >>>>> Hi Florian, >>>>> >>>>> Are you referring to policing traffic towards the CPU using a >>>>> policer >>>>> on >>>>> the egress of the CPU port? At least that's what I understand >>>>> from >>>>> the >>>>> description of patch 6 below. >>>>> >>>>> If so, mlxsw sets policers for different traffic types during >>>>> its >>>>> initialization sequence. These policers are not exposed to the >>>>> user >>>>> nor >>>>> configurable. While the default settings are good for most >>>>> users, we >>>>> do >>>>> want to allow users to change these and expose current >>>>> settings. >>>>> >>>>> I agree that tc seems like the right choice, but the question >>>>> is >>>>> where >>>>> are we going to install the filters? >>>>> >>>> >>>> Before I go too far down the rabbit hole of tc traffic shaping, >>>> maybe >>>> it would be good to explain in more detail the problem I am >>>> trying to >>>> solve. >>>> >>>> We have a setup as follows: >>>> >>>> Marvell 88E6240 switch chip, accepting traffic from 4 ports. Port >>>> 1 >>>> (P1) is critical priority, no dropped packets allowed, all others >>>> can >>>> be best effort. >>>> >>>> CPU port of swtich chip is connected via phy to phy of intel i210 >>>> (igb >>>> driver). >>>> >>>> i210 is connected via pcie switch to imx6. >>>> >>>> When too many small packets attempt to be delivered to CPU port >>>> (e.g. >>>> during broadcast flood) we saw dropped packets. >>>> >>>> The packets were being received by i210 in to rx descriptor >>>> buffer >>>> fine, but the CPU could not keep up with the load. We saw >>>> rx_fifo_errors increasing rapidly and ksoftirqd at ~100% CPU. >>>> >>>> >>>> With this in mind, I am wondering whether any amount of tc >>>> traffic >>>> shaping would help? Would tc shaping require that the packet >>>> reception >>>> manages to keep up before it can enact its policies? Does the >>>> infrastructure have accelerator offload hooks to be able to apply >>>> it >>>> via HW? I dont see how it would be able to inspect the packets to >>>> apply >>>> filtering if they were dropped due to rx descriptor exhaustion. >>>> (please >>>> bear with me with the basic questions, I am not familiar with >>>> this part >>>> of the stack). >>>> >>>> Assuming that tc is still the way to go, after a brief look in to >>>> the >>>> man pages and the documentation at largc.org, it seems like it >>>> would >>>> need to use the ingress qdisc, with some sort of system to >>>> segregate >>>> and priortise based on ingress port. Is this possible? >>> >>> Hi Robert, >>> >>> As I see it, you have two problems here: >>> >>> 1. Classification: Based on ingress port in your case >>> >>> 2. Scheduling: How to schedule between the different transmission >>> queues >>> >>> Where the port from which the packets should egress is the CPU >>> port, >>> before they cross the PCI towards the imx6. >>> >>> Both of these issues can be solved by tc. The main problem is that >>> today >>> we do not have a netdev to represent the CPU port and therefore >>> can't >>> use existing infra like tc. I believe we need to create one. >>> Besides >>> scheduling, we can also use it to permit/deny certain traffic from >>> reaching the CPU and perform policing. >> >> We do not necessarily have to create a CPU netdev, we can overlay >> netdev >> operations onto the DSA master interface (fec in that case), and >> whenever you configure the DSA master interface, we also call back >> into >> the switch side for the CPU port. This is not necessarily the >> cleanest >> way to do things, but that is how we support ethtool operations (and >> some netdev operations incidentally), and it works > > After reading up on tc, I am not sure how this would work given the > semantics of the tool currently. > > My initial thought was to model the switch's 4 output queues using an > mqprio qdisc for the CPU port, and then use either iptables's classify > module on the input ports to set which queue it egresses from on the > CPU port, or use vlan tagging with id 0 and priority set. (with the > many detail of how to implement them still left to discover). > > However, it looks like the mqprio qdisc could only be used for egress, > so without a netdev representing the CPU port, I dont know how it could > be used. If you are looking at mapping your DSA master/CPU port egress queues to actual switch egress queues, you can look at what bcm_sf2.c and bcmsysport.c do and read the commit messages that introduced the mapping functionality for background on why this was done. In a nutshell, the hardware has the ability to back pressure the Ethernet MAC behind the CPU port in order to automatically rate limit the egress out of the switch. So for instance, if your CPU tries to send 1Gb/sec of traffic to a port that is linked to a link partner at 100Mbits/sec, there is out of band information between the switch and the Ethernet DMA of the CPU port to pace the TX completion interrupt rate to match 100Mbits/sec. This is going to be different for you here obviously because the hardware has not been specifically designed for that, so you do need to rely on more standard constructs, like actual egress QoS on both ends. > > Another thing I thought of using was just to use iptable's TOS module > to set the minimal delay bit and rely on default behaviours, but Ive > yet to find anything in the Marvell manual that indicates it could set > that bit on all frames entering a port. > > Another option might be to use vlans with their priority bits being > used to steer to output queues, but I really dont want to introduce > more virtual interfaces in to the setup, and I cant see how to > configure an enforce default vlan tag with id 0 and priority bits set > via linux userland tools. > > > It does look like tc would be quite nice for configuring the egress > rate limiting assuming we a netdev to target with the rate controls of > the qdisc. > > > So far, this seems like I am trying to shoe horn this stuff in to tc. > It seems like tc is meant to configure how the ip stack configures > flow within the stack, whereas in a switch chip, the packets go nowhere > near the CPUs kernel ip stack. I cant help thinking that it would be > good have a specific utility for configuring switches that operates on > the port level for manage flow within the chip, or maybe simple sysfs > attributes to set the ports priority. I am not looking at tc the same way you are doing, tc is just the tool to configure all QoS/ingress/egress related operations on a network device. Whether that network device can offload some of the TC operations or not is where things get interesting. TC has ingress filtering support, which is what you could use for offloading broadcast storms, I would imagine that the following should be possible to be offloaded (this is not a working command but you get the idea): tc qdisc add dev sw0p0 handle ffff: ingress tc filter add dev sw0p0 parent ffff: protocol ip prio 1 u32 match ether src 0xfffffffffffff police rate 100k burst 10k skip_sw something along those lines is how I would implement ingress rate limiting leveraging what the switch could do. This might mean adding support for offloading specific TC filters, Jiri and Ido can certainly suggest a cleverer way of achieving that same functionality.
On Thu, 2019-09-12 at 10:41 -0700, Florian Fainelli wrote: > On 9/12/19 9:46 AM, Robert Beckett wrote: > > On Thu, 2019-09-12 at 09:25 -0700, Florian Fainelli wrote: > > > On 9/12/19 2:03 AM, Ido Schimmel wrote: > > > > On Wed, Sep 11, 2019 at 12:49:03PM +0100, Robert Beckett wrote: > > > > > On Wed, 2019-09-11 at 11:21 +0000, Ido Schimmel wrote: > > > > > > On Tue, Sep 10, 2019 at 09:49:46AM -0700, Florian Fainelli > > > > > > wrote: > > > > > > > +Ido, Jiri, > > > > > > > > > > > > > > On 9/10/19 8:41 AM, Robert Beckett wrote: > > > > > > > > This patch-set adds support for some features of the > > > > > > > > Marvell > > > > > > > > switch > > > > > > > > chips that can be used to handle packet storms. > > > > > > > > > > > > > > > > The rationale for this was a setup that requires the > > > > > > > > ability to > > > > > > > > receive > > > > > > > > traffic from one port, while a packet storm is occuring > > > > > > > > on > > > > > > > > another port > > > > > > > > (via an external switch with a deliberate loop). This > > > > > > > > is > > > > > > > > needed > > > > > > > > to > > > > > > > > ensure vital data delivery from a specific port, while > > > > > > > > mitigating > > > > > > > > any > > > > > > > > loops or DoS that a user may introduce on another port > > > > > > > > (can't > > > > > > > > guarantee > > > > > > > > sensible users). > > > > > > > > > > > > > > The use case is reasonable, but the implementation is not > > > > > > > really. > > > > > > > You > > > > > > > are using Device Tree which is meant to describe hardware > > > > > > > as > > > > > > > a > > > > > > > policy > > > > > > > holder for setting up queue priorities and likewise for > > > > > > > queue > > > > > > > scheduling. > > > > > > > > > > > > > > The tool that should be used for that purpose is tc and > > > > > > > possibly an > > > > > > > appropriately offloaded queue scheduler in order to map > > > > > > > the > > > > > > > desired > > > > > > > scheduling class to what the hardware supports. > > > > > > > > > > > > > > Jiri, Ido, how do you guys support this with mlxsw? > > > > > > > > > > > > Hi Florian, > > > > > > > > > > > > Are you referring to policing traffic towards the CPU using > > > > > > a > > > > > > policer > > > > > > on > > > > > > the egress of the CPU port? At least that's what I > > > > > > understand > > > > > > from > > > > > > the > > > > > > description of patch 6 below. > > > > > > > > > > > > If so, mlxsw sets policers for different traffic types > > > > > > during > > > > > > its > > > > > > initialization sequence. These policers are not exposed to > > > > > > the > > > > > > user > > > > > > nor > > > > > > configurable. While the default settings are good for most > > > > > > users, we > > > > > > do > > > > > > want to allow users to change these and expose current > > > > > > settings. > > > > > > > > > > > > I agree that tc seems like the right choice, but the > > > > > > question > > > > > > is > > > > > > where > > > > > > are we going to install the filters? > > > > > > > > > > > > > > > > Before I go too far down the rabbit hole of tc traffic > > > > > shaping, > > > > > maybe > > > > > it would be good to explain in more detail the problem I am > > > > > trying to > > > > > solve. > > > > > > > > > > We have a setup as follows: > > > > > > > > > > Marvell 88E6240 switch chip, accepting traffic from 4 ports. > > > > > Port > > > > > 1 > > > > > (P1) is critical priority, no dropped packets allowed, all > > > > > others > > > > > can > > > > > be best effort. > > > > > > > > > > CPU port of swtich chip is connected via phy to phy of intel > > > > > i210 > > > > > (igb > > > > > driver). > > > > > > > > > > i210 is connected via pcie switch to imx6. > > > > > > > > > > When too many small packets attempt to be delivered to CPU > > > > > port > > > > > (e.g. > > > > > during broadcast flood) we saw dropped packets. > > > > > > > > > > The packets were being received by i210 in to rx descriptor > > > > > buffer > > > > > fine, but the CPU could not keep up with the load. We saw > > > > > rx_fifo_errors increasing rapidly and ksoftirqd at ~100% CPU. > > > > > > > > > > > > > > > With this in mind, I am wondering whether any amount of tc > > > > > traffic > > > > > shaping would help? Would tc shaping require that the packet > > > > > reception > > > > > manages to keep up before it can enact its policies? Does the > > > > > infrastructure have accelerator offload hooks to be able to > > > > > apply > > > > > it > > > > > via HW? I dont see how it would be able to inspect the > > > > > packets to > > > > > apply > > > > > filtering if they were dropped due to rx descriptor > > > > > exhaustion. > > > > > (please > > > > > bear with me with the basic questions, I am not familiar with > > > > > this part > > > > > of the stack). > > > > > > > > > > Assuming that tc is still the way to go, after a brief look > > > > > in to > > > > > the > > > > > man pages and the documentation at largc.org, it seems like > > > > > it > > > > > would > > > > > need to use the ingress qdisc, with some sort of system to > > > > > segregate > > > > > and priortise based on ingress port. Is this possible? > > > > > > > > Hi Robert, > > > > > > > > As I see it, you have two problems here: > > > > > > > > 1. Classification: Based on ingress port in your case > > > > > > > > 2. Scheduling: How to schedule between the different > > > > transmission > > > > queues > > > > > > > > Where the port from which the packets should egress is the CPU > > > > port, > > > > before they cross the PCI towards the imx6. > > > > > > > > Both of these issues can be solved by tc. The main problem is > > > > that > > > > today > > > > we do not have a netdev to represent the CPU port and therefore > > > > can't > > > > use existing infra like tc. I believe we need to create one. > > > > Besides > > > > scheduling, we can also use it to permit/deny certain traffic > > > > from > > > > reaching the CPU and perform policing. > > > > > > We do not necessarily have to create a CPU netdev, we can overlay > > > netdev > > > operations onto the DSA master interface (fec in that case), and > > > whenever you configure the DSA master interface, we also call > > > back > > > into > > > the switch side for the CPU port. This is not necessarily the > > > cleanest > > > way to do things, but that is how we support ethtool operations > > > (and > > > some netdev operations incidentally), and it works > > > > After reading up on tc, I am not sure how this would work given the > > semantics of the tool currently. > > > > My initial thought was to model the switch's 4 output queues using > > an > > mqprio qdisc for the CPU port, and then use either iptables's > > classify > > module on the input ports to set which queue it egresses from on > > the > > CPU port, or use vlan tagging with id 0 and priority set. (with the > > many detail of how to implement them still left to discover). > > > > However, it looks like the mqprio qdisc could only be used for > > egress, > > so without a netdev representing the CPU port, I dont know how it > > could > > be used. > > If you are looking at mapping your DSA master/CPU port egress queues > to > actual switch egress queues, you can look at what bcm_sf2.c and > bcmsysport.c do and read the commit messages that introduced the > mapping > functionality for background on why this was done. In a nutshell, the > hardware has the ability to back pressure the Ethernet MAC behind the > CPU port in order to automatically rate limit the egress out of the > switch. So for instance, if your CPU tries to send 1Gb/sec of traffic > to > a port that is linked to a link partner at 100Mbits/sec, there is out > of > band information between the switch and the Ethernet DMA of the CPU > port > to pace the TX completion interrupt rate to match 100Mbits/sec. > > This is going to be different for you here obviously because the > hardware has not been specifically designed for that, so you do need > to > rely on more standard constructs, like actual egress QoS on both > ends. > > > > > Another thing I thought of using was just to use iptable's TOS > > module > > to set the minimal delay bit and rely on default behaviours, but > > Ive > > yet to find anything in the Marvell manual that indicates it could > > set > > that bit on all frames entering a port. > > > > Another option might be to use vlans with their priority bits being > > used to steer to output queues, but I really dont want to introduce > > more virtual interfaces in to the setup, and I cant see how to > > configure an enforce default vlan tag with id 0 and priority bits > > set > > via linux userland tools. > > > > > > It does look like tc would be quite nice for configuring the egress > > rate limiting assuming we a netdev to target with the rate controls > > of > > the qdisc. > > > > > > So far, this seems like I am trying to shoe horn this stuff in to > > tc. > > It seems like tc is meant to configure how the ip stack configures > > flow within the stack, whereas in a switch chip, the packets go > > nowhere > > near the CPUs kernel ip stack. I cant help thinking that it would > > be > > good have a specific utility for configuring switches that operates > > on > > the port level for manage flow within the chip, or maybe simple > > sysfs > > attributes to set the ports priority. > > I am not looking at tc the same way you are doing, tc is just the > tool > to configure all QoS/ingress/egress related operations on a network > device. Whether that network device can offload some of the TC > operations or not is where things get interesting. > > TC has ingress filtering support, which is what you could use for > offloading broadcast storms, I would imagine that the following > should > be possible to be offloaded (this is not a working command but you > get > the idea): > > tc qdisc add dev sw0p0 handle ffff: ingress > tc filter add dev sw0p0 parent ffff: protocol ip prio 1 u32 match > ether > src 0xfffffffffffff police rate 100k burst 10k skip_sw > > something along those lines is how I would implement ingress rate > limiting leveraging what the switch could do. This might mean adding > support for offloading specific TC filters, Jiri and Ido can > certainly > suggest a cleverer way of achieving that same functionality. Thanks for your thoughts on this, its been very helpful in leanring the stack and coming up with ideas for a better design. I wrote up a set of high level options for discussions internally, and would appreciate any feedback you had on them: To get this upstreamed, I think we will need something like the following high level design: 1. Handle egress rate limiting 1.1. Add frames per second as a rate metric usable throughout tc and associated kernel interfaces. 1.2. Handle any changes required to make a command like this work: tc qdisc add dev enp4s0 handle ffff: ingress tc filter add dev enp4s0 parent ffff: protocol ip prio 1 u32 match ether src 0xfffffffffffffffff police rate 50kfrm burst 10kfrm This should mostly already work as a valid command, maybe some changes for handling the new frame based rates. 1.3. Add tc bindings in dsa driver that hook in to the parent netdev's tc bindings (similar to how the ethtool bindings hook in to the parent's ethtool bindings) to setup the HW egress rate limiting as done in the existing patch. 2. Add ability to set output queue scheduling algorithm Currently no netdev is created for the CPU port, so it can't be targeted by tc or any of the other userland utilities. We need to be able to set settings for that port. Currently I can think of 2 options: 2.1. Add a netdev to represent the CPU port. This will likely face objections from some people upstream, though it has already been suggested as a way to handle this by others upstream. This would likely require a lot of effort and learning to figure out how to do this in a way that doesn't start to break key assumptions with the rest of dsa (like the CPU port not having its own IP address). If this were achieved, we could then do one of the following: 2.1.1. Use mqprio disc to model the 4 output queues with a new parameter to select the scheduling mode. 2.1.2. Add an ethtool priv settings capability (similar to priv flags) that configures the scheduling mode. This would be my preferred method as it allows port prioritization irrespective of linux's qdisc priorities, which seems to model the HW better. 2.2. Add tc bindings similar to 1.3 above, which allow us to define a new ingress qdisc parameter for scheduling mode, which the CPU port code can see due to hooking in to the parent device's tc bindings. This might be the simplest approach to implement, but feels a bit hinky w.r.t the semantics of ingress qdisc as we are actually specifying the scheduling of output queues for its link partner. 3. Add ability to set default queue priority of incoming packets on a port. This could be done as a new parameter for tc's ingress qdisc. I would suggest that it should specify an 802.1p priority number (e.g. "hwprio 3" would specify all traffic ingressing should be considered critical) as this neatly lines up with the 8 priority levels used in mqprio and could be extended further to allow/disallow priority setting per frame from 802.1q tags (e.g. "hwprio 3 notag" or something similar). This might balloon in required effort, particularly if we have to handle the none HW offload path, requiring us to figure out and implement a priority tagging within the kernel's buffers. This would likely be able to reuse a lot of the infrastructure in place for 802.1q tagging that currently exists within the kernel, though Ive not looked in to those code paths to estimate difficulty.