[net,1/1] bonding: fix PACKET_ORIGDEV regression on bonding masters
diff mbox series

Message ID 20190107162946.13072-2-soltys@ziu.info
State Changes Requested
Delegated to: David Miller
Headers show
Series
  • bonding: fix PACKET_ORIGDEV regression
Related show

Commit Message

Michal Soltys Jan. 7, 2019, 4:29 p.m. UTC
This patch reverts:

b89f04c61efe bonding: deliver link-local packets with skb->dev set to link that packets arrived on

And its subsequent fixups:

6a9e461f6fe4 bonding: pass link-local packets to bonding master also.
0f3b914c9cfc bonding: fix warning message

The intended functionality of the original patch (as explained by its
author) has been available in the kernel since v2.6.21-350-g80feaacb8a64
via PACKET_ORIGDEV socket option. The patch also broke that feature, as
it's now no longer possible to get the original incoming device. Quoting
the report:

> Unfortunately, this doesn't completely restore the previous
> functionality as PACKET_ORIGDEV is broken for the copy: the original
> interface is lost through the call to netif_rx(). A LLDP daemon
> listening to the master interface won't get the original interface like
> it was able to before 4.12.

The patch reverts to pre-b89f04c61efe state, so:

- both master and original (via PACKET_ORIGDEV) devices are available
  when listening on the master
- original device is available when listening directly on one of its
  slaves

Reported-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: Michal Soltys <soltys@ziu.info>
---
 drivers/net/bonding/bond_main.c | 21 ---------------------
 1 file changed, 21 deletions(-)

Comments

David Miller Jan. 7, 2019, 5:12 p.m. UTC | #1
From: Michal Soltys <soltys@ziu.info>
Date: Mon,  7 Jan 2019 17:29:46 +0100

> This patch reverts:
> 
> b89f04c61efe bonding: deliver link-local packets with skb->dev set to link that packets arrived on
> 
> And its subsequent fixups:
> 
> 6a9e461f6fe4 bonding: pass link-local packets to bonding master also.
> 0f3b914c9cfc bonding: fix warning message
> 
> The intended functionality of the original patch (as explained by its
> author) has been available in the kernel since v2.6.21-350-g80feaacb8a64
> via PACKET_ORIGDEV socket option. The patch also broke that feature, as
> it's now no longer possible to get the original incoming device. Quoting
> the report:
> 
>> Unfortunately, this doesn't completely restore the previous
>> functionality as PACKET_ORIGDEV is broken for the copy: the original
>> interface is lost through the call to netif_rx(). A LLDP daemon
>> listening to the master interface won't get the original interface like
>> it was able to before 4.12.
> 
> The patch reverts to pre-b89f04c61efe state, so:
> 
> - both master and original (via PACKET_ORIGDEV) devices are available
>   when listening on the master
> - original device is available when listening directly on one of its
>   slaves
> 
> Reported-by: Vincent Bernat <vincent@bernat.ch>
> Signed-off-by: Michal Soltys <soltys@ziu.info>

It is tiring and disappointing to see the behavior sway back and forth
multiple times like this.  We are breaking someone, then breaking them
again if they adjusted to the new behavior.

Pretty much unacceptable.

I'm not applying anything until someone can convince me the full scope
of the situation and why this time it's "right."
Vincent Bernat Jan. 8, 2019, 1:46 p.m. UTC | #2
❦  7 janvier 2019 09:12 -08, David Miller <davem@davemloft.net>:

>> The patch reverts to pre-b89f04c61efe state, so:
>> 
>> - both master and original (via PACKET_ORIGDEV) devices are available
>>   when listening on the master
>> - original device is available when listening directly on one of its
>>   slaves
>> 
>> Reported-by: Vincent Bernat <vincent@bernat.ch>
>> Signed-off-by: Michal Soltys <soltys@ziu.info>

Maybe add:

Fixes: b89f04c61efe ("bonding: deliver link-local packets with skb->dev set to link that packets arrived on")

> It is tiring and disappointing to see the behavior sway back and forth
> multiple times like this.  We are breaking someone, then breaking them
> again if they adjusted to the new behavior.
>
> Pretty much unacceptable.
>
> I'm not applying anything until someone can convince me the full scope
> of the situation and why this time it's "right."

For me, the problem was not visible until the fix introduced for 4.19,
hence the late report of the problem. Before 4.12 (and from 2.6.27 for
non-multicast packets), the behaviour was:

 - ability to receive link-local packets on physical interfaces by
   listening directly to them,
 - ability to receive link-local packets on bonded devices and get the
   original interface using PACKET_ORIGDEV.

The patch introduced in 4.12 doesn't change the first ability and breaks
the second one as the packets are not received on bonded devices
anymore. 

With the fix introduced in 4.19 enables to receive link-local packets on
bonded devices again, but it's still not possible to get the original
interface with PACKET_ORIGDEV.

I don't understand what b89f04c61efe purpose is. My first report was on
November 30th <https://marc.info/?l=linux-netdev&m=154361371507825&w=2>
but I didn't get any additional info, notably how the original interface
is retrieved when listening to a bond device. Without an example of
usage, it's difficult to understand if a revert would break anything.
However, a revert would unbreak users of PACKET_ORIGDEV.
David Miller Jan. 13, 2019, 11:03 p.m. UTC | #3
From: Michal Soltys <soltys@ziu.info>
Date: Mon,  7 Jan 2019 17:29:46 +0100

> This patch reverts:
> 
> b89f04c61efe bonding: deliver link-local packets with skb->dev set to link that packets arrived on
> 
> And its subsequent fixups:
> 
> 6a9e461f6fe4 bonding: pass link-local packets to bonding master also.
> 0f3b914c9cfc bonding: fix warning message
> 
> The intended functionality of the original patch (as explained by its
> author) has been available in the kernel since v2.6.21-350-g80feaacb8a64
> via PACKET_ORIGDEV socket option. The patch also broke that feature, as
> it's now no longer possible to get the original incoming device. Quoting
> the report:
> 
>> Unfortunately, this doesn't completely restore the previous
>> functionality as PACKET_ORIGDEV is broken for the copy: the original
>> interface is lost through the call to netif_rx(). A LLDP daemon
>> listening to the master interface won't get the original interface like
>> it was able to before 4.12.
> 
> The patch reverts to pre-b89f04c61efe state, so:
> 
> - both master and original (via PACKET_ORIGDEV) devices are available
>   when listening on the master
> - original device is available when listening directly on one of its
>   slaves
> 
> Reported-by: Vincent Bernat <vincent@bernat.ch>
> Signed-off-by: Michal Soltys <soltys@ziu.info>

Google folks I want to hear what you have to say about all of this.

Thank you.
Maciej Żenczykowski Jan. 14, 2019, 2:01 a.m. UTC | #4
So I don't remember the specifics...

(note I'm writing this all from memory without looking it up/testing
it - I may be utterly wrong or dreaming)

But I seem to recall that the core problem we were trying to solve was
that a daemon listening
on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
would not receive LLDP packets
arriving on inactive bond slaves (either active-backup or lag).

[inactive = link/carrier up, but not part of active aggregator]

This made monitoring for miscabling harder (IFIRC the only non kernel
fix was to get the daemon to create
a separate AF_PACKET/88CC socket bound to every physical interface in
the system, or monitor for
inactive slaves and add extra packet sockets as needed).

They would get re-parented to the master and then since the slave was
inactive they would be considered RX_HANDLER_EXACT match only and not
match the * interface.

Honestly I wasn't aware of PACKET_ORIGDEV, although I don't think it
helps in this case - AFAICR the packets never made it to the packet
socket.

Perhaps going from:
  /* don't change skb->dev for link-local packets */
  if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
  if (bond_should_deliver_exact_match(skb, slave, bond)) return
RX_HANDLER_EXACT;

to something more like:
  if (bond_should_deliver_exact_match(skb, slave, bond)) {
    /* don't change skb->dev for link-local packets on inactive slaves */
    if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
    return RX_HANDLER_EXACT;
  }

would fix both problems?

On Sun, Jan 13, 2019 at 3:05 PM David Miller <davem@davemloft.net> wrote:
>
> From: Michal Soltys <soltys@ziu.info>
> Date: Mon,  7 Jan 2019 17:29:46 +0100
>
> > This patch reverts:
> >
> > b89f04c61efe bonding: deliver link-local packets with skb->dev set to link that packets arrived on
> >
> > And its subsequent fixups:
> >
> > 6a9e461f6fe4 bonding: pass link-local packets to bonding master also.
> > 0f3b914c9cfc bonding: fix warning message
> >
> > The intended functionality of the original patch (as explained by its
> > author) has been available in the kernel since v2.6.21-350-g80feaacb8a64
> > via PACKET_ORIGDEV socket option. The patch also broke that feature, as
> > it's now no longer possible to get the original incoming device. Quoting
> > the report:
> >
> >> Unfortunately, this doesn't completely restore the previous
> >> functionality as PACKET_ORIGDEV is broken for the copy: the original
> >> interface is lost through the call to netif_rx(). A LLDP daemon
> >> listening to the master interface won't get the original interface like
> >> it was able to before 4.12.
> >
> > The patch reverts to pre-b89f04c61efe state, so:
> >
> > - both master and original (via PACKET_ORIGDEV) devices are available
> >   when listening on the master
> > - original device is available when listening directly on one of its
> >   slaves
> >
> > Reported-by: Vincent Bernat <vincent@bernat.ch>
> > Signed-off-by: Michal Soltys <soltys@ziu.info>
>
> Google folks I want to hear what you have to say about all of this.
>
> Thank you.
Vincent Bernat Jan. 14, 2019, 8 a.m. UTC | #5
❦ 13 janvier 2019 18:01 -08, Maciej Żenczykowski <zenczykowski@gmail.com>:

> But I seem to recall that the core problem we were trying to solve was
> that a daemon listening
> on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
> would not receive LLDP packets
> arriving on inactive bond slaves (either active-backup or lag).

Just tested and with 4.9.150, I am in fact unable to receive anything
on a backup link when listening to the active-backup master device or to
"any" device.

> Perhaps going from:
>   /* don't change skb->dev for link-local packets */
>   if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>   if (bond_should_deliver_exact_match(skb, slave, bond)) return
> RX_HANDLER_EXACT;
>
> to something more like:
>   if (bond_should_deliver_exact_match(skb, slave, bond)) {
>     /* don't change skb->dev for link-local packets on inactive slaves */
>     if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>     return RX_HANDLER_EXACT;
>   }
>
> would fix both problems?

It makes PACKET_ORIGDEV works again. Moreover, when not binding to any
interface, we receive packets on both active and backup links. But when
binding to the master device, I only receive packets from the active
devices (which is the same behaviour than pre-4.12). When not binding to
any device and not using PACKET_ORIGDEV, one packet is said to be from
the master device and one packet is said to be from the backup device.
Previously, I had one packet from active device, one packet from backup
device and two packets from master device.

For me, this is a better situation than previously as we return to the
situation before 4.12 but you can get what you want by not binding to
any device _and_ using PACKET_ORIGDEV (otherwise, you are don't get the
right interface in all cases).

If it's unclear, I can provide more extensive results. I am using this
test program (comment the s.bind/s.setsockopt line if needed):

#v+
#!/usr/bin/env python3

import sys
import socket
import datetime

socket.SOL_PACKET = 263
socket.PACKET_ORIGDEV = 9

s = socket.socket(socket.AF_PACKET,
                  socket.SOCK_RAW,
                  socket.htons(0x88cc))
s.bind(("bond0", 0))
s.setsockopt(socket.SOL_PACKET, socket.PACKET_ORIGDEV, 1)
while True:
    data, addrinfo = s.recvfrom(1500)
    if addrinfo[2] == socket.PACKET_OUTGOING:
        continue
    print(f"{datetime.datetime.now().isoformat()}: "
          f"Received {len(data)} bytes from {addrinfo}")
#v-
On Mon, Jan 14, 2019 at 12:00 AM Vincent Bernat <vincent@bernat.ch> wrote:
>
>  ❦ 13 janvier 2019 18:01 -08, Maciej Żenczykowski <zenczykowski@gmail.com>:
>
> > But I seem to recall that the core problem we were trying to solve was
> > that a daemon listening
> > on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
> > would not receive LLDP packets
> > arriving on inactive bond slaves (either active-backup or lag).
>
> Just tested and with 4.9.150, I am in fact unable to receive anything
> on a backup link when listening to the active-backup master device or to
> "any" device.
>
> > Perhaps going from:
> >   /* don't change skb->dev for link-local packets */
> >   if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
> >   if (bond_should_deliver_exact_match(skb, slave, bond)) return
> > RX_HANDLER_EXACT;
> >
> > to something more like:
> >   if (bond_should_deliver_exact_match(skb, slave, bond)) {
> >     /* don't change skb->dev for link-local packets on inactive slaves */
> >     if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
> >     return RX_HANDLER_EXACT;
> >   }
> >
> > would fix both problems?
>
thanks for jumping in and offering a solution. This should fix the issue.

NACK for the revert-patch!

Folks, please, revert is not the solution! Last time when there was a
problem posted I offered you a solution, so wasn't that enough to
prove that we care about solving the problem that you are facing while
continuing to have this functionality? No one wants to break your use
case, it happens only because one is not aware of it. Thank you David
for resorting to resolve it.


> It makes PACKET_ORIGDEV works again. Moreover, when not binding to any
> interface, we receive packets on both active and backup links. But when
> binding to the master device, I only receive packets from the active
> devices (which is the same behaviour than pre-4.12). When not binding to
> any device and not using PACKET_ORIGDEV, one packet is said to be from
> the master device and one packet is said to be from the backup device.
> Previously, I had one packet from active device, one packet from backup
> device and two packets from master device.
>
> For me, this is a better situation than previously as we return to the
> situation before 4.12 but you can get what you want by not binding to
> any device _and_ using PACKET_ORIGDEV (otherwise, you are don't get the
> right interface in all cases).
>
> If it's unclear, I can provide more extensive results. I am using this
> test program (comment the s.bind/s.setsockopt line if needed):
>
> #v+
> #!/usr/bin/env python3
>
> import sys
> import socket
> import datetime
>
> socket.SOL_PACKET = 263
> socket.PACKET_ORIGDEV = 9
>
> s = socket.socket(socket.AF_PACKET,
>                   socket.SOCK_RAW,
>                   socket.htons(0x88cc))
> s.bind(("bond0", 0))
> s.setsockopt(socket.SOL_PACKET, socket.PACKET_ORIGDEV, 1)
> while True:
>     data, addrinfo = s.recvfrom(1500)
>     if addrinfo[2] == socket.PACKET_OUTGOING:
>         continue
>     print(f"{datetime.datetime.now().isoformat()}: "
>           f"Received {len(data)} bytes from {addrinfo}")
> #v-
> --
> Localise input and output in subroutines.
>             - The Elements of Programming Style (Kernighan & Plauger)
Michal Soltys Jan. 16, 2019, 2:01 a.m. UTC | #7
On 19/01/14 03:01, Maciej Żenczykowski wrote:
> So I don't remember the specifics...
> 
> (note I'm writing this all from memory without looking it up/testing
> it - I may be utterly wrong or dreaming)
> 
> But I seem to recall that the core problem we were trying to solve was
> that a daemon listening
> on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
> would not receive LLDP packets
> arriving on inactive bond slaves (either active-backup or lag).
> 
> [inactive = link/carrier up, but not part of active aggregator]
> 
> This made monitoring for miscabling harder (IFIRC the only non kernel
> fix was to get the daemon to create
> a separate AF_PACKET/88CC socket bound to every physical interface in
> the system, or monitor for
> inactive slaves and add extra packet sockets as needed).
> 
> They would get re-parented to the master and then since the slave was
> inactive they would be considered RX_HANDLER_EXACT match only and not
> match the * interface.
> 
> Honestly I wasn't aware of PACKET_ORIGDEV, although I don't think it
> helps in this case - AFAICR the packets never made it to the packet
> socket.
> 
> Perhaps going from:
>    /* don't change skb->dev for link-local packets */
>    if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>    if (bond_should_deliver_exact_match(skb, slave, bond)) return
> RX_HANDLER_EXACT;
> 
> to something more like:
>    if (bond_should_deliver_exact_match(skb, slave, bond)) {
>      /* don't change skb->dev for link-local packets on inactive slaves */
>      if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>      return RX_HANDLER_EXACT;
>    }
> 
> would fix both problems?
> 

I'll test this change in bridging scenarios in coming days. And thanks 
for the explanation of what was the issue in your case.
Michal Soltys Jan. 16, 2019, 2:58 a.m. UTC | #8
On 19/01/15 03:19, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Mon, Jan 14, 2019 at 12:00 AM Vincent Bernat <vincent@bernat.ch> wrote:
>>
>>  ❦ 13 janvier 2019 18:01 -08, Maciej Żenczykowski <zenczykowski@gmail.com>:
>>
>> > But I seem to recall that the core problem we were trying to solve was
>> > that a daemon listening
>> > on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
>> > would not receive LLDP packets
>> > arriving on inactive bond slaves (either active-backup or lag).
>>
>> Just tested and with 4.9.150, I am in fact unable to receive anything
>> on a backup link when listening to the active-backup master device or to
>> "any" device.
>>
>> > Perhaps going from:
>> >   /* don't change skb->dev for link-local packets */
>> >   if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>> >   if (bond_should_deliver_exact_match(skb, slave, bond)) return
>> > RX_HANDLER_EXACT;
>> >
>> > to something more like:
>> >   if (bond_should_deliver_exact_match(skb, slave, bond)) {
>> >     /* don't change skb->dev for link-local packets on inactive slaves */
>> >     if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>> >     return RX_HANDLER_EXACT;
>> >   }
>> >
>> > would fix both problems?
>>
> thanks for jumping in and offering a solution. This should fix the issue.
> 
> NACK for the revert-patch!
> 
> Folks, please, revert is not the solution! Last time when there was a
> problem posted I offered you a solution, so wasn't that enough to
> prove that we care about solving the problem that you are facing while
> continuing to have this functionality? No one wants to break your use
> case, it happens only because one is not aware of it. Thank you David
> for resorting to resolve it.

Mahesh, that's not it. But:

Since Vincent reported PACKET_ORIGDEV regression late november, none of 
you replied to anything posted until now. And if David hadn't called you 
guys directly, I'm not sure you would have at this point. Reverting and 
Vincent's offer to patch to update packet(7) were also clearly mentioned 
in the previous thread, none of them commented/nacked/acked either. Me 
an Vincent have been scratching our head for a while off list - but our 
guessing can only go as far as time goes on.

Maciej now was the first to ever provide the actual details about the 
issue you were facing originally. My bad I haven't added him to CC from 
the very beginning.

Anyway, I'll test Maciej's version in bridging context in coming days 
and look closer at the code overall. It probably works fine if Vincent 
is seeing packets on masters, but I'd rather be sure.
Michal Soltys Jan. 18, 2019, 12:27 a.m. UTC | #9
On 19/01/14 03:01, Maciej Żenczykowski wrote:
> So I don't remember the specifics...
> 
> (note I'm writing this all from memory without looking it up/testing
> it - I may be utterly wrong or dreaming)
> 
> But I seem to recall that the core problem we were trying to solve was
> that a daemon listening
> on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
> would not receive LLDP packets
> arriving on inactive bond slaves (either active-backup or lag).
> 
> [inactive = link/carrier up, but not part of active aggregator]
> 
> This made monitoring for miscabling harder (IFIRC the only non kernel
> fix was to get the daemon to create
> a separate AF_PACKET/88CC socket bound to every physical interface in
> the system, or monitor for
> inactive slaves and add extra packet sockets as needed).
> 
> They would get re-parented to the master and then since the slave was
> inactive they would be considered RX_HANDLER_EXACT match only and not
> match the * interface.
> 
> Honestly I wasn't aware of PACKET_ORIGDEV, although I don't think it
> helps in this case - AFAICR the packets never made it to the packet
> socket.
> 
> Perhaps going from:
>    /* don't change skb->dev for link-local packets */
>    if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>    if (bond_should_deliver_exact_match(skb, slave, bond)) return
> RX_HANDLER_EXACT;
> 
> to something more like:
>    if (bond_should_deliver_exact_match(skb, slave, bond)) {
>      /* don't change skb->dev for link-local packets on inactive slaves */
>      if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>      return RX_HANDLER_EXACT;
>    }

Having checked the code (if I get the flow correctly), one 
thing/question - currently with Mahesh's fixes, not bound LLDP listener 
will receive all packets - both from active and inactive slaves directly 
(as the check for suppression is done after the link-local check).

The version above will do the suppression check first - so all inactive 
slaves - excluding non-multi/non-broad ALB - will pass it and return 
RX_HANDLER_PASS if the packet is link-local. So those will be available 
w/o binding, but active slaves' packets will be available via master 
device (but with working PACKET_ORIGDEV now - so slave device can be 
retrieved easily). This is fine in your scenario I presume ?
Maciej Żenczykowski Jan. 18, 2019, 6:58 a.m. UTC | #10
I'm not sure there's a truly good answer.

Also note, that there's subtle differences between af_packet sockets
tied to ALL protocols vs tied to specific protocols (vs none/0 of
course).
However ETH_P_ALL protocol raw sockets need to be avoided if at all
possible due to perf impact.
Certainly we don't want any of these created outside of debugging.

I believe we only cared about the utterly unbound to any device case
working (ie. delivering all LLDP packets all nics are receiving).
So that a single simple daemon could collect and export all the link
information.
[btw. now that I know about the PACKET_ORIGDEV option, I do -
unsurprisingly - see our daemon sets it]
We really didn't want the complexity of having to bind to individual
interfaces and having to try to dynamically adjust the set of raw
sockets.
(not to mention that less raw sockets is always good, also it's never
actually entirely clear which interfaces would need to be monitored
due to the preponderance of tunnels/macvlan/ipvlan/veth/dummy and
other virtual interface types)

I think from a logic perspective:

- if you bind to slave (physical interface) you should see *all*
packets on that interface,
regardless of whether the slave is active or not, and whether the
packet is link-local or not.
ie. this should show you exactly what nic is receiving (& sending:
PACKET_OUTGOING)
I should see *exactly* one copy of any packet.
This mode has to be useful for debugging.
This includes seeing packets which aren't even destined for me if nic
receives them.
(ie. destined to unicast/multicast mac we'd filter out: PACKET_OTHERHOST)
Although by itself this socket's existence shouldn't affect nic mac
filters and promiscuous mode
(I believe there are all sorts of various additional socket options to
change those).

- if you don't bind to an interface then I think I'd expect to see
packets delivered to stack + link local packets
  - ie. should not see non-link-local packets discarded due to being
on inactive slaves
    similarly to how I should not see packets filtered out by virtue
of mac not matching interface mac filters
  - IMHO you should see *all* link local packets arriving at system
(ie. the original change's purpose)
    [and I think, but am not certain, I shouldn't need to use any
socket options to register the link local macs,
     although glancing at code I do think the daemon uses
PACKET_ADD_MEMBERSHIP to register lldp mac,
     so perhaps filtering should apply as normal?]
  - with PACKET_ORIGDEV they should always show up as from the
physical interfaces (ie. slaves)
  - without PACKET_ORIGDEV:
     - non link local packets should show up as from bonding/team master
     - link local packets on active slaves could show up on master or
slave (probably master is required ifirc some earlier fix???)
     - link local packets on inactive slaves should show up on slave -
and definitely not on master
  - I don't think I should ever see any PACKET_OTHERHOST packets

- if you bind to master you should see packets from active slaves (ie.
those that will get delivered to stack)
  - clearly you should not see non-link-local inactive slave packets
(they'll be dropped)
  - behaviour wrt. link local packets is more dicey... (I believe
somewhere in these threads patches there was some description of what
standard requires, but I don't off the top of my head remember)
  - for an active-backup bond it would seem logical to see only active
links link local
  - for a multiple-active aggregate bond I'd be fine with seeing none,
or all active slaves link local packets - I guess even though the
later is confusing it makes sense
  - it might be okay to see link-local inactive slave packets, but I
think I'd prefer not to (could be configurable though) - this would
seem confusing/wrong to me.
  - and again I don't think I should see any PACKET_OTHERHOST packets here...
  - PACKET_ORIGDEV as above...

I gave the above a fair amount of thought... but I'm not guaranteeing
I didn't make typos, or write something utterly stupid or
unimplementable or not how stuff should work for other reasons...
Comments welcome.

I think this continues to be in line with my proposal from earlier in
the thread?

On Thu, Jan 17, 2019 at 4:27 PM Michal Soltys <soltys@ziu.info> wrote:
>
> On 19/01/14 03:01, Maciej Żenczykowski wrote:
> > So I don't remember the specifics...
> >
> > (note I'm writing this all from memory without looking it up/testing
> > it - I may be utterly wrong or dreaming)
> >
> > But I seem to recall that the core problem we were trying to solve was
> > that a daemon listening
> > on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
> > would not receive LLDP packets
> > arriving on inactive bond slaves (either active-backup or lag).
> >
> > [inactive = link/carrier up, but not part of active aggregator]
> >
> > This made monitoring for miscabling harder (IFIRC the only non kernel
> > fix was to get the daemon to create
> > a separate AF_PACKET/88CC socket bound to every physical interface in
> > the system, or monitor for
> > inactive slaves and add extra packet sockets as needed).
> >
> > They would get re-parented to the master and then since the slave was
> > inactive they would be considered RX_HANDLER_EXACT match only and not
> > match the * interface.
> >
> > Honestly I wasn't aware of PACKET_ORIGDEV, although I don't think it
> > helps in this case - AFAICR the packets never made it to the packet
> > socket.
> >
> > Perhaps going from:
> >    /* don't change skb->dev for link-local packets */
> >    if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
> >    if (bond_should_deliver_exact_match(skb, slave, bond)) return
> > RX_HANDLER_EXACT;
> >
> > to something more like:
> >    if (bond_should_deliver_exact_match(skb, slave, bond)) {
> >      /* don't change skb->dev for link-local packets on inactive slaves */
> >      if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
> >      return RX_HANDLER_EXACT;
> >    }
>
> Having checked the code (if I get the flow correctly), one
> thing/question - currently with Mahesh's fixes, not bound LLDP listener
> will receive all packets - both from active and inactive slaves directly
> (as the check for suppression is done after the link-local check).
>
> The version above will do the suppression check first - so all inactive
> slaves - excluding non-multi/non-broad ALB - will pass it and return
> RX_HANDLER_PASS if the packet is link-local. So those will be available
> w/o binding, but active slaves' packets will be available via master
> device (but with working PACKET_ORIGDEV now - so slave device can be
> retrieved easily). This is fine in your scenario I presume ?
>
Michal Soltys Jan. 29, 2019, 1:47 a.m. UTC | #11
On 19/01/18 07:58, Maciej Żenczykowski wrote:
> I'm not sure there's a truly good answer.
> 
> Also note, that there's subtle differences between af_packet sockets
> tied to ALL protocols vs tied to specific protocols (vs none/0 of
> course).
> However ETH_P_ALL protocol raw sockets need to be avoided if at all
> possible due to perf impact.
> Certainly we don't want any of these created outside of debugging.
> 
> I believe we only cared about the utterly unbound to any device case
> working (ie. delivering all LLDP packets all nics are receiving).
> So that a single simple daemon could collect and export all the link
> information.
> [btw. now that I know about the PACKET_ORIGDEV option, I do -
> unsurprisingly - see our daemon sets it]
> We really didn't want the complexity of having to bind to individual
> interfaces and having to try to dynamically adjust the set of raw
> sockets.
> (not to mention that less raw sockets is always good, also it's never
> actually entirely clear which interfaces would need to be monitored
> due to the preponderance of tunnels/macvlan/ipvlan/veth/dummy and
> other virtual interface types)
> 
> I think from a logic perspective:
> 
> - if you bind to slave (physical interface) you should see *all*
> packets on that interface,
> regardless of whether the slave is active or not, and whether the
> packet is link-local or not.
> ie. this should show you exactly what nic is receiving (& sending:
> PACKET_OUTGOING)
> I should see *exactly* one copy of any packet.
> This mode has to be useful for debugging.
> This includes seeing packets which aren't even destined for me if nic
> receives them.
> (ie. destined to unicast/multicast mac we'd filter out: PACKET_OTHERHOST)
> Although by itself this socket's existence shouldn't affect nic mac
> filters and promiscuous mode
> (I believe there are all sorts of various additional socket options to
> change those).
> 
> - if you don't bind to an interface then I think I'd expect to see
> packets delivered to stack + link local packets
>    - ie. should not see non-link-local packets discarded due to being
> on inactive slaves
>      similarly to how I should not see packets filtered out by virtue
> of mac not matching interface mac filters
>    - IMHO you should see *all* link local packets arriving at system
> (ie. the original change's purpose)
>      [and I think, but am not certain, I shouldn't need to use any
> socket options to register the link local macs,
>       although glancing at code I do think the daemon uses
> PACKET_ADD_MEMBERSHIP to register lldp mac,
>       so perhaps filtering should apply as normal?]
>    - with PACKET_ORIGDEV they should always show up as from the
> physical interfaces (ie. slaves)
>    - without PACKET_ORIGDEV:
>       - non link local packets should show up as from bonding/team master

>       - link local packets on active slaves could show up on master or
> slave (probably master is required ifirc some earlier fix???)

As far as I can see, they (LLCs coming on active slave) will show via 
master as from master (w/o the socket option) or as from slave (with). 
Think Vincent's tests confirmed it.

>       - link local packets on inactive slaves should show up on slave -
> and definitely not on master
>    - I don't think I should ever see any PACKET_OTHERHOST packets
> 
> - if you bind to master you should see packets from active slaves (ie.
> those that will get delivered to stack)
>    - clearly you should not see non-link-local inactive slave packets
> (they'll be dropped)

>    - behaviour wrt. link local packets is more dicey... (I believe
> somewhere in these threads patches there was some description of what
> standard requires, but I don't off the top of my head remember)

They must be seen, otherwise bonds attached to a linux bridge (I assume 
enslaving an interface to a bridge essentially counts as bind) will be 
blind to them (among those - e.g. to spanning tree information - this 
was what originally caused issues for me last year). Even the bridge 
code goes to extra length to not carelessly consume stp packets, if it's 
not actively participating in stp (of any kind). Aside that, certain 
[recent] bridge features like group_fwd_mask would be non-functional on 
bond ports as well.

As for standard (the one I quoted in the oldest thread) expected the 
link-local packet to be both readable via master and slaves depending on 
need (though it didn't go into exact gory details). Your current patch I 
think cover all possible cases quite greacefully.

>    - for an active-backup bond it would seem logical to see only active
> links link local
>    - for a multiple-active aggregate bond I'd be fine with seeing none,
> or all active slaves link local packets - I guess even though the
> later is confusing it makes sense
>    - it might be okay to see link-local inactive slave packets, but I
> think I'd prefer not to (could be configurable though) - this would
> seem confusing/wrong to me.
>    - and again I don't think I should see any PACKET_OTHERHOST packets here...
>    - PACKET_ORIGDEV as above...
> 
> I gave the above a fair amount of thought... but I'm not guaranteeing
> I didn't make typos, or write something utterly stupid or
> unimplementable or not how stuff should work for other reasons...
> Comments welcome.
> 
> I think this continues to be in line with my proposal from earlier in
> the thread?

(sorry for late reply)

Yes, pretty much spot on. With some confirmation comments above.

I did bridging tests yesterday (2 LACP bonds attached via separate 
switches treating linux bridge as a shared segment in mstp scenario) and 
all is working fine on my side as well.

If everyone agrees with the proposed code, I will submit v2 patch with 
added comment explaining basic logic (or you could submit if you prefer).

> 
> On Thu, Jan 17, 2019 at 4:27 PM Michal Soltys <soltys@ziu.info> wrote:
>>
>> On 19/01/14 03:01, Maciej Żenczykowski wrote:
>> > So I don't remember the specifics...
>> >
>> > (note I'm writing this all from memory without looking it up/testing
>> > it - I may be utterly wrong or dreaming)
>> >
>> > But I seem to recall that the core problem we were trying to solve was
>> > that a daemon listening
>> > on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
>> > would not receive LLDP packets
>> > arriving on inactive bond slaves (either active-backup or lag).
>> >
>> > [inactive = link/carrier up, but not part of active aggregator]
>> >
>> > This made monitoring for miscabling harder (IFIRC the only non kernel
>> > fix was to get the daemon to create
>> > a separate AF_PACKET/88CC socket bound to every physical interface in
>> > the system, or monitor for
>> > inactive slaves and add extra packet sockets as needed).
>> >
>> > They would get re-parented to the master and then since the slave was
>> > inactive they would be considered RX_HANDLER_EXACT match only and not
>> > match the * interface.
>> >
>> > Honestly I wasn't aware of PACKET_ORIGDEV, although I don't think it
>> > helps in this case - AFAICR the packets never made it to the packet
>> > socket.
>> >
>> > Perhaps going from:
>> >    /* don't change skb->dev for link-local packets */
>> >    if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>> >    if (bond_should_deliver_exact_match(skb, slave, bond)) return
>> > RX_HANDLER_EXACT;
>> >
>> > to something more like:
>> >    if (bond_should_deliver_exact_match(skb, slave, bond)) {
>> >      /* don't change skb->dev for link-local packets on inactive slaves */
>> >      if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>> >      return RX_HANDLER_EXACT;
>> >    }
>>
>> Having checked the code (if I get the flow correctly), one
>> thing/question - currently with Mahesh's fixes, not bound LLDP listener
>> will receive all packets - both from active and inactive slaves directly
>> (as the check for suppression is done after the link-local check).
>>
>> The version above will do the suppression check first - so all inactive
>> slaves - excluding non-multi/non-broad ALB - will pass it and return
>> RX_HANDLER_PASS if the packet is link-local. So those will be available
>> w/o binding, but active slaves' packets will be available via master
>> device (but with working PACKET_ORIGDEV now - so slave device can be
>> retrieved easily). This is fine in your scenario I presume ?
>>
>
Maciej Żenczykowski Jan. 29, 2019, 9:39 a.m. UTC | #12
Sounds good to me.

On Mon, Jan 28, 2019 at 5:47 PM Michal Soltys <soltys@ziu.info> wrote:
>
> On 19/01/18 07:58, Maciej Żenczykowski wrote:
> > I'm not sure there's a truly good answer.
> >
> > Also note, that there's subtle differences between af_packet sockets
> > tied to ALL protocols vs tied to specific protocols (vs none/0 of
> > course).
> > However ETH_P_ALL protocol raw sockets need to be avoided if at all
> > possible due to perf impact.
> > Certainly we don't want any of these created outside of debugging.
> >
> > I believe we only cared about the utterly unbound to any device case
> > working (ie. delivering all LLDP packets all nics are receiving).
> > So that a single simple daemon could collect and export all the link
> > information.
> > [btw. now that I know about the PACKET_ORIGDEV option, I do -
> > unsurprisingly - see our daemon sets it]
> > We really didn't want the complexity of having to bind to individual
> > interfaces and having to try to dynamically adjust the set of raw
> > sockets.
> > (not to mention that less raw sockets is always good, also it's never
> > actually entirely clear which interfaces would need to be monitored
> > due to the preponderance of tunnels/macvlan/ipvlan/veth/dummy and
> > other virtual interface types)
> >
> > I think from a logic perspective:
> >
> > - if you bind to slave (physical interface) you should see *all*
> > packets on that interface,
> > regardless of whether the slave is active or not, and whether the
> > packet is link-local or not.
> > ie. this should show you exactly what nic is receiving (& sending:
> > PACKET_OUTGOING)
> > I should see *exactly* one copy of any packet.
> > This mode has to be useful for debugging.
> > This includes seeing packets which aren't even destined for me if nic
> > receives them.
> > (ie. destined to unicast/multicast mac we'd filter out: PACKET_OTHERHOST)
> > Although by itself this socket's existence shouldn't affect nic mac
> > filters and promiscuous mode
> > (I believe there are all sorts of various additional socket options to
> > change those).
> >
> > - if you don't bind to an interface then I think I'd expect to see
> > packets delivered to stack + link local packets
> >    - ie. should not see non-link-local packets discarded due to being
> > on inactive slaves
> >      similarly to how I should not see packets filtered out by virtue
> > of mac not matching interface mac filters
> >    - IMHO you should see *all* link local packets arriving at system
> > (ie. the original change's purpose)
> >      [and I think, but am not certain, I shouldn't need to use any
> > socket options to register the link local macs,
> >       although glancing at code I do think the daemon uses
> > PACKET_ADD_MEMBERSHIP to register lldp mac,
> >       so perhaps filtering should apply as normal?]
> >    - with PACKET_ORIGDEV they should always show up as from the
> > physical interfaces (ie. slaves)
> >    - without PACKET_ORIGDEV:
> >       - non link local packets should show up as from bonding/team master
>
> >       - link local packets on active slaves could show up on master or
> > slave (probably master is required ifirc some earlier fix???)
>
> As far as I can see, they (LLCs coming on active slave) will show via
> master as from master (w/o the socket option) or as from slave (with).
> Think Vincent's tests confirmed it.
>
> >       - link local packets on inactive slaves should show up on slave -
> > and definitely not on master
> >    - I don't think I should ever see any PACKET_OTHERHOST packets
> >
> > - if you bind to master you should see packets from active slaves (ie.
> > those that will get delivered to stack)
> >    - clearly you should not see non-link-local inactive slave packets
> > (they'll be dropped)
>
> >    - behaviour wrt. link local packets is more dicey... (I believe
> > somewhere in these threads patches there was some description of what
> > standard requires, but I don't off the top of my head remember)
>
> They must be seen, otherwise bonds attached to a linux bridge (I assume
> enslaving an interface to a bridge essentially counts as bind) will be
> blind to them (among those - e.g. to spanning tree information - this
> was what originally caused issues for me last year). Even the bridge
> code goes to extra length to not carelessly consume stp packets, if it's
> not actively participating in stp (of any kind). Aside that, certain
> [recent] bridge features like group_fwd_mask would be non-functional on
> bond ports as well.
>
> As for standard (the one I quoted in the oldest thread) expected the
> link-local packet to be both readable via master and slaves depending on
> need (though it didn't go into exact gory details). Your current patch I
> think cover all possible cases quite greacefully.
>
> >    - for an active-backup bond it would seem logical to see only active
> > links link local
> >    - for a multiple-active aggregate bond I'd be fine with seeing none,
> > or all active slaves link local packets - I guess even though the
> > later is confusing it makes sense
> >    - it might be okay to see link-local inactive slave packets, but I
> > think I'd prefer not to (could be configurable though) - this would
> > seem confusing/wrong to me.
> >    - and again I don't think I should see any PACKET_OTHERHOST packets here...
> >    - PACKET_ORIGDEV as above...
> >
> > I gave the above a fair amount of thought... but I'm not guaranteeing
> > I didn't make typos, or write something utterly stupid or
> > unimplementable or not how stuff should work for other reasons...
> > Comments welcome.
> >
> > I think this continues to be in line with my proposal from earlier in
> > the thread?
>
> (sorry for late reply)
>
> Yes, pretty much spot on. With some confirmation comments above.
>
> I did bridging tests yesterday (2 LACP bonds attached via separate
> switches treating linux bridge as a shared segment in mstp scenario) and
> all is working fine on my side as well.
>
> If everyone agrees with the proposed code, I will submit v2 patch with
> added comment explaining basic logic (or you could submit if you prefer).
>
> >
> > On Thu, Jan 17, 2019 at 4:27 PM Michal Soltys <soltys@ziu.info> wrote:
> >>
> >> On 19/01/14 03:01, Maciej Żenczykowski wrote:
> >> > So I don't remember the specifics...
> >> >
> >> > (note I'm writing this all from memory without looking it up/testing
> >> > it - I may be utterly wrong or dreaming)
> >> >
> >> > But I seem to recall that the core problem we were trying to solve was
> >> > that a daemon listening
> >> > on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
> >> > would not receive LLDP packets
> >> > arriving on inactive bond slaves (either active-backup or lag).
> >> >
> >> > [inactive = link/carrier up, but not part of active aggregator]
> >> >
> >> > This made monitoring for miscabling harder (IFIRC the only non kernel
> >> > fix was to get the daemon to create
> >> > a separate AF_PACKET/88CC socket bound to every physical interface in
> >> > the system, or monitor for
> >> > inactive slaves and add extra packet sockets as needed).
> >> >
> >> > They would get re-parented to the master and then since the slave was
> >> > inactive they would be considered RX_HANDLER_EXACT match only and not
> >> > match the * interface.
> >> >
> >> > Honestly I wasn't aware of PACKET_ORIGDEV, although I don't think it
> >> > helps in this case - AFAICR the packets never made it to the packet
> >> > socket.
> >> >
> >> > Perhaps going from:
> >> >    /* don't change skb->dev for link-local packets */
> >> >    if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
> >> >    if (bond_should_deliver_exact_match(skb, slave, bond)) return
> >> > RX_HANDLER_EXACT;
> >> >
> >> > to something more like:
> >> >    if (bond_should_deliver_exact_match(skb, slave, bond)) {
> >> >      /* don't change skb->dev for link-local packets on inactive slaves */
> >> >      if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
> >> >      return RX_HANDLER_EXACT;
> >> >    }
> >>
> >> Having checked the code (if I get the flow correctly), one
> >> thing/question - currently with Mahesh's fixes, not bound LLDP listener
> >> will receive all packets - both from active and inactive slaves directly
> >> (as the check for suppression is done after the link-local check).
> >>
> >> The version above will do the suppression check first - so all inactive
> >> slaves - excluding non-multi/non-broad ALB - will pass it and return
> >> RX_HANDLER_PASS if the packet is link-local. So those will be available
> >> w/o binding, but active slaves' packets will be available via master
> >> device (but with working PACKET_ORIGDEV now - so slave device can be
> >> retrieved easily). This is fine in your scenario I presume ?
> >>
> >
>

Patch
diff mbox series

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index a9d597f28023..290235587a0e 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1183,27 +1183,6 @@  static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 		}
 	}
 
-	/* Link-local multicast packets should be passed to the
-	 * stack on the link they arrive as well as pass them to the
-	 * bond-master device. These packets are mostly usable when
-	 * stack receives it with the link on which they arrive
-	 * (e.g. LLDP) they also must be available on master. Some of
-	 * the use cases include (but are not limited to): LLDP agents
-	 * that must be able to operate both on enslaved interfaces as
-	 * well as on bonds themselves; linux bridges that must be able
-	 * to process/pass BPDUs from attached bonds when any kind of
-	 * STP version is enabled on the network.
-	 */
-	if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) {
-		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
-
-		if (nskb) {
-			nskb->dev = bond->dev;
-			nskb->queue_mapping = 0;
-			netif_rx(nskb);
-		}
-		return RX_HANDLER_PASS;
-	}
 	if (bond_should_deliver_exact_match(skb, slave, bond))
 		return RX_HANDLER_EXACT;