diff mbox series

[net] net: sched: cls: Fix offloading when ingress dev is vxlan

Message ID 1528185843-18645-1-git-send-email-paulb@mellanox.com
State Accepted, archived
Delegated to: David Miller
Headers show
Series [net] net: sched: cls: Fix offloading when ingress dev is vxlan | expand

Commit Message

Paul Blakey June 5, 2018, 8:04 a.m. UTC
When using a vxlan device as the ingress dev, we count it as a
"no offload dev", so when such a rule comes and err stop is true,
we fail early and don't try the egdev route which can offload it
through the egress device.

Fix that by not calling the block offload if one of the devices
attached to it is not offload capable, but make sure egress on such case
is capable instead.

Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
---
 net/sched/cls_api.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

Comments

David Miller June 5, 2018, 2:30 p.m. UTC | #1
From: Paul Blakey <paulb@mellanox.com>
Date: Tue,  5 Jun 2018 11:04:03 +0300

> When using a vxlan device as the ingress dev, we count it as a
> "no offload dev", so when such a rule comes and err stop is true,
> we fail early and don't try the egdev route which can offload it
> through the egress device.
> 
> Fix that by not calling the block offload if one of the devices
> attached to it is not offload capable, but make sure egress on such case
> is capable instead.
> 
> Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
> Reviewed-by: Roi Dayan <roid@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Paul Blakey <paulb@mellanox.com>

Applied and queued up for -stable, thanks.
Jakub Kicinski June 5, 2018, 6:57 p.m. UTC | #2
On Tue,  5 Jun 2018 11:04:03 +0300, Paul Blakey wrote:
> When using a vxlan device as the ingress dev, we count it as a
> "no offload dev", so when such a rule comes and err stop is true,
> we fail early and don't try the egdev route which can offload it
> through the egress device.
> 
> Fix that by not calling the block offload if one of the devices
> attached to it is not offload capable, but make sure egress on such case
> is capable instead.
> 
> Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
> Reviewed-by: Roi Dayan <roid@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Paul Blakey <paulb@mellanox.com>

Very poor commit message.  What you're doing is re-enabling skip_sw
filters on tunnel devices which semantically make no sense and
shouldn't have been allowed in the first place.

This will breaks block sharing between tunnels and HW netdevs (because
you skip the tcf_block_cb_call() completely).  The entire egdev idea
remains badly broken accepting filters like this:

# tc filter add dev vxlan0 ingress \
	matchall action skip_sw \
		mirred egress redirect dev lo \
		mirred egress redirect dev sw1np0

Do we still care about correctness and not breaking backward
compatibility?
Jakub Kicinski June 5, 2018, 6:59 p.m. UTC | #3
On Tue, 5 Jun 2018 11:57:47 -0700, Jakub Kicinski wrote:
> On Tue,  5 Jun 2018 11:04:03 +0300, Paul Blakey wrote:
> > When using a vxlan device as the ingress dev, we count it as a
> > "no offload dev", so when such a rule comes and err stop is true,
> > we fail early and don't try the egdev route which can offload it
> > through the egress device.
> > 
> > Fix that by not calling the block offload if one of the devices
> > attached to it is not offload capable, but make sure egress on such case
> > is capable instead.
> > 
> > Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
> > Reviewed-by: Roi Dayan <roid@mellanox.com>
> > Acked-by: Jiri Pirko <jiri@mellanox.com>
> > Signed-off-by: Paul Blakey <paulb@mellanox.com>  
> 
> Very poor commit message.  What you're doing is re-enabling skip_sw
> filters on tunnel devices which semantically make no sense and
> shouldn't have been allowed in the first place.
> 
> This will breaks block sharing between tunnels and HW netdevs (because
> you skip the tcf_block_cb_call() completely).  The entire egdev idea
> remains badly broken accepting filters like this:
> 
> # tc filter add dev vxlan0 ingress \
> 	matchall action skip_sw \
> 		mirred egress redirect dev lo \
> 		mirred egress redirect dev sw1np0

For above we probably need something like this (untested):

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 3f4cf930f809..71e5eebec31a 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1511,7 +1511,7 @@ int tc_setup_cb_egdev_call(const struct net_device *dev,
        struct tcf_action_egdev *egdev = tcf_action_egdev_lookup(dev);
 
        if (!egdev)
-               return 0;
+               return err_stop ? -EOPNOTSUPP : 0;
        return tcf_action_egdev_cb_call(egdev, type, type_data, err_stop);
 }
 EXPORT_SYMBOL_GPL(tc_setup_cb_egdev_call);

But the correct fix is to remove egdev crutch completely IMO.
David Miller June 5, 2018, 7:06 p.m. UTC | #4
From: Jakub Kicinski <kubakici@wp.pl>
Date: Tue, 5 Jun 2018 11:57:47 -0700

> Do we still care about correctness and not breaking backward
> compatibility?

Jakub let me know if you want me to revert this change.
Jakub Kicinski June 5, 2018, 9:27 p.m. UTC | #5
On Tue, 05 Jun 2018 15:06:40 -0400 (EDT), David Miller wrote:
> From: Jakub Kicinski <kubakici@wp.pl>
> Date: Tue, 5 Jun 2018 11:57:47 -0700
> 
> > Do we still care about correctness and not breaking backward
> > compatibility?  
> 
> Jakub let me know if you want me to revert this change.

Yes, I think this patch introduces a regression when block is shared
between offload capable and in-capable device, therefore it should be
reverted.  Shared blocks went through a number of review cycles to
ensure such cases are handled correctly.


Longer story about the actual issue which is never explained in the
commit message is this: in kernels 4.10 - 4.14 skip_sw flag was
supported on tunnels in cls_flower only:

static int fl_hw_replace_filter(struct tcf_proto *tp,
[...]
	if (!tc_can_offload(dev)) {
		if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
		    (f->hw_dev && !tc_can_offload(f->hw_dev))) {
			f->hw_dev = dev;
			return tc_skip_sw(f->flags) ? -EINVAL : 0;
		}
		dev = f->hw_dev;
		cls_flower.egress_dev = true;
	} else {
		f->hw_dev = dev;
	}


In 4.15 - 4.17 with addition of shared blocks egdev mechanism was
promoted to a generic TC thing supported on all filters but it no
longer works with skip_sw.

I'd argue skip_sw is incorrect for tunnels, because the rule will only
apply to traffic ingressing on ASIC ports, not all traffic which hits
the tunnel netdev.  Therefore we should keep the 4.15 - 4.17 behaviour.

But that's a side note, I don't think we should be breaking offload on
shared blocks whether we decide to support skip_sw on tunnels or not.
Or Gerlitz June 6, 2018, 5:12 a.m. UTC | #6
On Tue, Jun 5, 2018 at 9:59 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Tue, 5 Jun 2018 11:57:47 -0700, Jakub Kicinski wrote:
>> On Tue,  5 Jun 2018 11:04:03 +0300, Paul Blakey wrote:
>> > When using a vxlan device as the ingress dev, we count it as a
>> > "no offload dev", so when such a rule comes and err stop is true,
>> > we fail early and don't try the egdev route which can offload it
>> > through the egress device.
>> >
>> > Fix that by not calling the block offload if one of the devices
>> > attached to it is not offload capable, but make sure egress on such case
>> > is capable instead.
>> >
>> > Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
>> > Reviewed-by: Roi Dayan <roid@mellanox.com>
>> > Acked-by: Jiri Pirko <jiri@mellanox.com>
>> > Signed-off-by: Paul Blakey <paulb@mellanox.com>
>>
>> Very poor commit message.  What you're doing is re-enabling skip_sw
>> filters on tunnel devices which semantically make no sense and
>> shouldn't have been allowed in the first place.
>>
>> This will breaks block sharing between tunnels and HW netdevs (because
>> you skip the tcf_block_cb_call() completely).  The entire egdev idea
>> remains badly broken accepting filters like this:
>>
>> # tc filter add dev vxlan0 ingress \
>>       matchall action skip_sw \
>>               mirred egress redirect dev lo \
>>               mirred egress redirect dev sw1np0
>
> For above we probably need something like this (untested):
>
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index 3f4cf930f809..71e5eebec31a 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -1511,7 +1511,7 @@ int tc_setup_cb_egdev_call(const struct net_device *dev,
>         struct tcf_action_egdev *egdev = tcf_action_egdev_lookup(dev);
>
>         if (!egdev)
> -               return 0;
> +               return err_stop ? -EOPNOTSUPP : 0;
>         return tcf_action_egdev_cb_call(egdev, type, type_data, err_stop);
>  }
>  EXPORT_SYMBOL_GPL(tc_setup_cb_egdev_call);

Jakub,

We will test it out and let you know

> But the correct fix is to remove egdev crutch completely IMO.

Not against it, sometimes designs should change and be replaced
with better ones, happens
Or Gerlitz June 6, 2018, 5:15 a.m. UTC | #7
On Wed, Jun 6, 2018 at 12:27 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Tue, 05 Jun 2018 15:06:40 -0400 (EDT), David Miller wrote:
>> From: Jakub Kicinski <kubakici@wp.pl>
>> Date: Tue, 5 Jun 2018 11:57:47 -0700
>>
>> > Do we still care about correctness and not breaking backward
>> > compatibility?
>>
>> Jakub let me know if you want me to revert this change.
>
> Yes, I think this patch introduces a regression when block is shared
> between offload capable and in-capable device, therefore it should be
> reverted.  Shared blocks went through a number of review cycles to
> ensure such cases are handled correctly.
>
>
> Longer story about the actual issue which is never explained in the
> commit message is this: in kernels 4.10 - 4.14 skip_sw flag was
> supported on tunnels in cls_flower only:
>
> static int fl_hw_replace_filter(struct tcf_proto *tp,
> [...]
>         if (!tc_can_offload(dev)) {
>                 if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
>                     (f->hw_dev && !tc_can_offload(f->hw_dev))) {
>                         f->hw_dev = dev;
>                         return tc_skip_sw(f->flags) ? -EINVAL : 0;
>                 }
>                 dev = f->hw_dev;
>                 cls_flower.egress_dev = true;
>         } else {
>                 f->hw_dev = dev;
>         }
>
>
> In 4.15 - 4.17 with addition of shared blocks egdev mechanism was
> promoted to a generic TC thing supported on all filters but it no
> longer works with skip_sw.
>
> I'd argue skip_sw is incorrect for tunnels, because the rule will only
> apply to traffic ingressing on ASIC ports, not all traffic which hits
> the tunnel netdev.

This argument makes sense, however, skip_sw for tunnel decap rules
 **is** allowed since 4.10 and we have some sort of regression here (turns
out that before and after the patch..)

> Therefore we should keep the 4.15 - 4.17 behaviour.
> But that's a side note, I don't think we should be breaking offload on
> shared blocks whether we decide to support skip_sw on tunnels or not.

skip_sw on tunnels was there before shared-block, newer features should
take care not to break existing ones.
Paul Blakey June 6, 2018, 7:59 a.m. UTC | #8
On 06/06/2018 00:27, Jakub Kicinski wrote:
> On Tue, 05 Jun 2018 15:06:40 -0400 (EDT), David Miller wrote:
>> From: Jakub Kicinski <kubakici@wp.pl>
>> Date: Tue, 5 Jun 2018 11:57:47 -0700
>>
>>> Do we still care about correctness and not breaking backward
>>> compatibility?
>>
>> Jakub let me know if you want me to revert this change.
> 
> Yes, I think this patch introduces a regression when block is shared
> between offload capable and in-capable device, therefore it should be
> reverted.  Shared blocks went through a number of review cycles to
> ensure such cases are handled correctly.
> 
> 
> Longer story about the actual issue which is never explained in the
> commit message is this: in kernels 4.10 - 4.14 skip_sw flag was
> supported on tunnels in cls_flower only:
> 
> static int fl_hw_replace_filter(struct tcf_proto *tp,
> [...]
> 	if (!tc_can_offload(dev)) {
> 		if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
> 		    (f->hw_dev && !tc_can_offload(f->hw_dev))) {
> 			f->hw_dev = dev;
> 			return tc_skip_sw(f->flags) ? -EINVAL : 0;
> 		}
> 		dev = f->hw_dev;
> 		cls_flower.egress_dev = true;
> 	} else {
> 		f->hw_dev = dev;
> 	}
> 
> 
> In 4.15 - 4.17 with addition of shared blocks egdev mechanism was
> promoted to a generic TC thing supported on all filters but it no
> longer works with skip_sw.
> 
> I'd argue skip_sw is incorrect for tunnels, because the rule will only
> apply to traffic ingressing on ASIC ports, not all traffic which hits
> the tunnel netdev.  Therefore we should keep the 4.15 - 4.17 behaviour.
> 
> But that's a side note, I don't think we should be breaking offload on
> shared blocks whether we decide to support skip_sw on tunnels or not.
> 

Maybe we can apply my patch logic of still trying the egress dev if the 
block has a single device, and not shared. Is that ok with you?

You're patch seems good as an add on, but the egress hw device (sw1np0) 
would go over the tc actions and see if it can offload such rule (output 
to software lo device) and would fail anyway.
Jakub Kicinski June 6, 2018, 3:46 p.m. UTC | #9
On Wed, 6 Jun 2018 08:15:27 +0300, Or Gerlitz wrote:
> On Wed, Jun 6, 2018 at 12:27 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> > On Tue, 05 Jun 2018 15:06:40 -0400 (EDT), David Miller wrote:  
> >> From: Jakub Kicinski <kubakici@wp.pl>
> >> Date: Tue, 5 Jun 2018 11:57:47 -0700
> >>  
> >> > Do we still care about correctness and not breaking backward
> >> > compatibility?  
> >>
> >> Jakub let me know if you want me to revert this change.  
> >
> > Yes, I think this patch introduces a regression when block is shared
> > between offload capable and in-capable device, therefore it should be
> > reverted.  Shared blocks went through a number of review cycles to
> > ensure such cases are handled correctly.
> >
> >
> > Longer story about the actual issue which is never explained in the
> > commit message is this: in kernels 4.10 - 4.14 skip_sw flag was
> > supported on tunnels in cls_flower only:
> >
> > static int fl_hw_replace_filter(struct tcf_proto *tp,
> > [...]
> >         if (!tc_can_offload(dev)) {
> >                 if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
> >                     (f->hw_dev && !tc_can_offload(f->hw_dev))) {
> >                         f->hw_dev = dev;
> >                         return tc_skip_sw(f->flags) ? -EINVAL : 0;
> >                 }
> >                 dev = f->hw_dev;
> >                 cls_flower.egress_dev = true;
> >         } else {
> >                 f->hw_dev = dev;
> >         }
> >
> >
> > In 4.15 - 4.17 with addition of shared blocks egdev mechanism was
> > promoted to a generic TC thing supported on all filters but it no
> > longer works with skip_sw.
> >
> > I'd argue skip_sw is incorrect for tunnels, because the rule will only
> > apply to traffic ingressing on ASIC ports, not all traffic which hits
> > the tunnel netdev.  
> 
> This argument makes sense, however, skip_sw for tunnel decap rules
>  **is** allowed since 4.10 and we have some sort of regression here (turns
> out that before and after the patch..)

As I said it was allowed in 4 releases, which was a mistake, in last 3
it wasn't.  I understand your use case, but the semantics of skip_sw
are not preserved here so we should find a different solution.

> > Therefore we should keep the 4.15 - 4.17 behaviour.
> > But that's a side note, I don't think we should be breaking offload on
> > shared blocks whether we decide to support skip_sw on tunnels or not.  
> 
> skip_sw on tunnels was there before shared-block, newer features should
> take care not to break existing ones.

Oh, I agree we shouldn't break existing use cases so please don't break
the use case I mentioned above.  I want to set up shared block between
a LAG and its members.  Now since the bond will not be offload-capable
TC will not even make an attempt to offload to members.

I'm gonna test that my reading of the code is correct and send a revert
later today, sorry.
Jakub Kicinski June 6, 2018, 3:50 p.m. UTC | #10
On Wed, 6 Jun 2018 08:12:20 +0300, Or Gerlitz wrote:
> On Tue, Jun 5, 2018 at 9:59 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> > On Tue, 5 Jun 2018 11:57:47 -0700, Jakub Kicinski wrote:  
> >> On Tue,  5 Jun 2018 11:04:03 +0300, Paul Blakey wrote:  
> >> > When using a vxlan device as the ingress dev, we count it as a
> >> > "no offload dev", so when such a rule comes and err stop is true,
> >> > we fail early and don't try the egdev route which can offload it
> >> > through the egress device.
> >> >
> >> > Fix that by not calling the block offload if one of the devices
> >> > attached to it is not offload capable, but make sure egress on such case
> >> > is capable instead.
> >> >
> >> > Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
> >> > Reviewed-by: Roi Dayan <roid@mellanox.com>
> >> > Acked-by: Jiri Pirko <jiri@mellanox.com>
> >> > Signed-off-by: Paul Blakey <paulb@mellanox.com>  
> >>
> >> Very poor commit message.  What you're doing is re-enabling skip_sw
> >> filters on tunnel devices which semantically make no sense and
> >> shouldn't have been allowed in the first place.
> >>
> >> This will breaks block sharing between tunnels and HW netdevs (because
> >> you skip the tcf_block_cb_call() completely).  The entire egdev idea
> >> remains badly broken accepting filters like this:
> >>
> >> # tc filter add dev vxlan0 ingress \
> >>       matchall action skip_sw \
> >>               mirred egress redirect dev lo \
> >>               mirred egress redirect dev sw1np0  
> >
> > For above we probably need something like this (untested):
> >
> > diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> > index 3f4cf930f809..71e5eebec31a 100644
> > --- a/net/sched/act_api.c
> > +++ b/net/sched/act_api.c
> > @@ -1511,7 +1511,7 @@ int tc_setup_cb_egdev_call(const struct net_device *dev,
> >         struct tcf_action_egdev *egdev = tcf_action_egdev_lookup(dev);
> >
> >         if (!egdev)
> > -               return 0;
> > +               return err_stop ? -EOPNOTSUPP : 0;
> >         return tcf_action_egdev_cb_call(egdev, type, type_data, err_stop);
> >  }
> >  EXPORT_SYMBOL_GPL(tc_setup_cb_egdev_call);  
> 
> Jakub,
> 
> We will test it out and let you know

That's probably insufficient, because the second action should be
skipped completely.  But an improvement nonetheless :)

> > But the correct fix is to remove egdev crutch completely IMO.  
> 
> Not against it, sometimes designs should change and be replaced
> with better ones, happens

Cool, I'm not sure when we'll get to it, but perhaps we can go over 
the ideas I presented at netconf on switchdev call next week (if it's
still active)?
Jakub Kicinski June 6, 2018, 3:57 p.m. UTC | #11
On Wed, 6 Jun 2018 10:59:30 +0300, Paul Blakey wrote:
> Maybe we can apply my patch logic of still trying the egress dev if the 
> block has a single device, and not shared. Is that ok with you?

I don't remember that patch but sounds pretty bad.

> You're patch seems good as an add on, but the egress hw device (sw1np0) 
> would go over the tc actions and see if it can offload such rule (output 
> to software lo device) and would fail anyway.

Ack, I'm not trying to solve the skip_sw problem.  It's a hard one.
David Miller June 6, 2018, 5:56 p.m. UTC | #12
From: Jakub Kicinski <kubakici@wp.pl>
Date: Tue, 5 Jun 2018 14:27:00 -0700

> On Tue, 05 Jun 2018 15:06:40 -0400 (EDT), David Miller wrote:
>> From: Jakub Kicinski <kubakici@wp.pl>
>> Date: Tue, 5 Jun 2018 11:57:47 -0700
>> 
>> > Do we still care about correctness and not breaking backward
>> > compatibility?  
>> 
>> Jakub let me know if you want me to revert this change.
> 
> Yes, I think this patch introduces a regression when block is shared
> between offload capable and in-capable device, therefore it should be
> reverted.

Ok, I've reverted the change.  Please sort out how to fix things
properly.
Or Gerlitz June 6, 2018, 7:07 p.m. UTC | #13
On Wed, Jun 6, 2018 at 6:50 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Wed, 6 Jun 2018 08:12:20 +0300, Or Gerlitz wrote:
>> On Tue, Jun 5, 2018 at 9:59 PM, Jakub Kicinski <kubakici@wp.pl> wrote:

>> > But the correct fix is to remove egdev crutch completely IMO.

>> Not against it, sometimes designs should change and be replaced
>> with better ones, happens

> Cool, I'm not sure when we'll get to it, but perhaps we can go over
> the ideas I presented at netconf on switchdev call next week (if it's
> still active)?

sounds good, I'll check. Where the netconf slides can be found?
diff mbox series

Patch

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index a57e112..2cd579f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -734,10 +734,6 @@  static int tcf_block_cb_call(struct tcf_block *block, enum tc_setup_type type,
 	int ok_count = 0;
 	int err;
 
-	/* Make sure all netdevs sharing this block are offload-capable. */
-	if (block->nooffloaddevcnt && err_stop)
-		return -EOPNOTSUPP;
-
 	list_for_each_entry(block_cb, &block->cb_list, list) {
 		err = block_cb->cb(type, type_data, block_cb->cb_priv);
 		if (err) {
@@ -1580,21 +1576,31 @@  static int tc_exts_setup_cb_egdev_call(struct tcf_exts *exts,
 int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts,
 		     enum tc_setup_type type, void *type_data, bool err_stop)
 {
-	int ok_count;
+	int ok_count = 0;
 	int ret;
 
-	ret = tcf_block_cb_call(block, type, type_data, err_stop);
-	if (ret < 0)
-		return ret;
-	ok_count = ret;
+	if (!block->nooffloaddevcnt) {
+		ret = tcf_block_cb_call(block, type, type_data, err_stop);
+		if (ret < 0)
+			return ret;
+		ok_count = ret;
+	}
 
 	if (!exts || ok_count)
-		return ok_count;
+		goto skip_egress;
+
 	ret = tc_exts_setup_cb_egdev_call(exts, type, type_data, err_stop);
 	if (ret < 0)
 		return ret;
 	ok_count += ret;
 
+skip_egress:
+	/* if one of the netdevs sharing this block are not offload-capable
+	 * make sure we succeeded in egress instead.
+	 */
+	if (block->nooffloaddevcnt && !ok_count && err_stop)
+		return -EOPNOTSUPP;
+
 	return ok_count;
 }
 EXPORT_SYMBOL(tc_setup_cb_call);