RFC: bridge get fdb by bridge device

Message ID	52F79990.3000400@mojatatu.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Message-ID: <52F79990.3000400@mojatatu.com> Date: Sun, 09 Feb 2014 10:06:56 -0500 From: Jamal Hadi Salim <jhs@mojatatu.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: "netdev@vger.kernel.org" <netdev@vger.kernel.org> CC: vyasevic@redhat.com, Stephen Hemminger <stephen@networkplumber.org>, Scott Feldman <sfeldma@cumulusnetworks.com>, John Fastabend <john.r.fastabend@intel.com> Subject: RFC: bridge get fdb by bridge device References: <52F21F72.2090405@mojatatu.com> <52F29747.7040008@redhat.com> <52F3CF76.9090404@mojatatu.com> <52F3E357.4040006@redhat.com> In-Reply-To: <52F3E357.4040006@redhat.com> Content-Type: multipart/mixed; boundary="------------090501000001010305090103" Sender: netdev-owner@vger.kernel.org Precedence: bulk

Jamal Hadi Salim Feb. 9, 2014, 3:06 p.m. UTC

This patch allows something equivalent to
"brctl showmacs <bridge device>" with iproute2
syntax "bridge link br <bridge device>"
Filtering by bridge is done in the kernel.
The current setup doesnt scale when you have many bridges each
with large fdbs (preliminary fix with the kernel patch).

iproute2 allows filtering by bridge port, example:
"bridge link br br1234 dev port1234"
but the filtering is done in user space.
In a future patch i would like to do the port filtering
in the kernel. As well, adding a MAC filter in the kernel
makes sense.

Kernel patch is against net-next.

cheers,
jamal

diff --git a/bridge/fdb.c b/bridge/fdb.c
index e2e53f1..f3073d6 100644
--- a/bridge/fdb.c
+++ b/bridge/fdb.c
@@ -33,7 +33,7 @@ static void usage(void)
 	fprintf(stderr, "Usage: bridge fdb { add | append | del | replace } ADDR dev DEV {self|master} [ temp ]\n"
 		        "              [router] [ dst IPADDR] [ vlan VID ]\n"
 		        "              [ port PORT] [ vni VNI ] [via DEV]\n");
-	fprintf(stderr, "       bridge fdb {show} [ dev DEV ]\n");
+	fprintf(stderr, "       bridge fdb {show} [ br BRDEV ] [ dev DEV ]\n");
 	exit(-1);
 }
 
@@ -152,18 +152,35 @@ int print_fdb(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 
 static int fdb_show(int argc, char **argv)
 {
+	struct ndmsg ndm = { };
 	char *filter_dev = NULL;
+	char *br = NULL;
+
+	ndm.ndm_family = PF_BRIDGE;
+	ndm.ndm_state = NUD_NOARP;
 
 	while (argc > 0) {
-		if (strcmp(*argv, "dev") == 0) {
+		if ((strcmp(*argv, "port") == 0) || strcmp(*argv, "dev") == 0) {
 			NEXT_ARG();
-			if (filter_dev)
-				duparg("dev", *argv);
 			filter_dev = *argv;
+		} else if (strcmp(*argv, "br") == 0) {
+			NEXT_ARG();
+			br = *argv;
+		} else {
+			if (matches(*argv, "help") == 0)
+				usage();
 		}
 		argc--; argv++;
 	}
 
+	if (br) {
+		ndm.ndm_ifindex = ll_name_to_index(br);
+		if (ndm.ndm_ifindex == 0) {
+			fprintf(stderr, "Cannot find bridge device \"%s\"\n", br);
+			return -1;
+		}
+	}
+
 	if (filter_dev) {
 		filter_index = if_nametoindex(filter_dev);
 		if (filter_index == 0) {
@@ -171,13 +188,15 @@ static int fdb_show(int argc, char **argv)
 				filter_dev);
 			return -1;
 		}
+
 	}
 
-	if (rtnl_wilddump_request(&rth, PF_BRIDGE, RTM_GETNEIGH) < 0) {
+	if (rtnl_dump_request(&rth, RTM_GETNEIGH, &ndm, sizeof(struct ndmsg)) < 0) {
 		perror("Cannot send dump request");
 		exit(1);
 	}
 
+
 	if (rtnl_dump_filter(&rth, print_fdb, stdout) < 0) {
 		fprintf(stderr, "Dump terminated\n");
 		exit(1);

John Fastabend Feb. 9, 2014, 7:33 p.m. UTC | #1

On 02/09/2014 07:06 AM, Jamal Hadi Salim wrote:
>
> This patch allows something equivalent to
> "brctl showmacs <bridge device>" with iproute2
> syntax "bridge link br <bridge device>"
> Filtering by bridge is done in the kernel.
> The current setup doesnt scale when you have many bridges each
> with large fdbs (preliminary fix with the kernel patch).
>
> iproute2 allows filtering by bridge port, example:
> "bridge link br br1234 dev port1234"
> but the filtering is done in user space.
> In a future patch i would like to do the port filtering
> in the kernel. As well, adding a MAC filter in the kernel
> makes sense.
>
> Kernel patch is against net-next.
>
> cheers,
> jamal

[...]

> +	if (ndm->ndm_ifindex) {
> +		dev = __dev_get_by_index(net, ndm->ndm_ifindex);
> +		if (dev == NULL) {
> +			pr_info("PF_BRIDGE: RTM_GETNEIGH with unknown ifindex\n");
> +			return -ENODEV;
> +		}
> +	
> +		if (!(dev->priv_flags & IFF_EBRIDGE)) {

Can we drop this 'if case' and just use the 'if (ops->ndo_fdb_dump)'
below? IFF_EBRIDGE is specific to ./net/bridge so it will fail for
macvlans and I think the command is useful in both cases.


> +			pr_info("PF_BRIDGE: RTM_GETNEIGH %s not a bridge device\n",
> +				dev->name);
> +			return -EINVAL;
>  		}
> +		ops = dev->netdev_ops;
> +		if (ops->ndo_fdb_dump) {
> +			idx = ops->ndo_fdb_dump(skb, cb, dev, idx);
> +		} else {

Is there any problem with using the ndo_dflt_fdb_dump() in the else
here?

Userspace should be able to easily learn which ports are ebridge ports
so I don't think that should be an issue. Anyways with the above
IFF_EBRIDGE check you should never hit this else case although I think
its safe to drop the above check as noted.

> +			pr_info("PF_BRIDGE: RTM_GETNEIGH %s no dumper\n",
> +				dev->name);
> +			return -EINVAL;
> +		}
> +	} else {

Thanks,
John

Vlad Yasevich Feb. 10, 2014, 4:31 p.m. UTC | #2

On 02/09/2014 10:06 AM, Jamal Hadi Salim wrote:
>
> This patch allows something equivalent to
> "brctl showmacs <bridge device>" with iproute2
> syntax "bridge link br <bridge device>"
> Filtering by bridge is done in the kernel.
> The current setup doesnt scale when you have many bridges each
> with large fdbs (preliminary fix with the kernel patch).
>
> iproute2 allows filtering by bridge port, example:
> "bridge link br br1234 dev port1234"
> but the filtering is done in user space.
> In a future patch i would like to do the port filtering
> in the kernel. As well, adding a MAC filter in the kernel
> makes sense.
>
> Kernel patch is against net-next.
>
> cheers,
> jamal
>
> bridge-fdb-filter1
>
>
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 393b1bc..507ea4e 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -2423,26 +2423,50 @@ static int rtnl_fdb_dump(struct sk_buff *skb,
struct netlink_callback *cb)
>  {
>  	int idx = 0;
>  	struct net *net = sock_net(skb->sk);
> +	const struct net_device_ops *ops;
>  	struct net_device *dev;
> +	struct ndmsg *ndm;
>
> -	rcu_read_lock();
> -	for_each_netdev_rcu(net, dev) {
> -		if (dev->priv_flags & IFF_BRIDGE_PORT) {
> -			struct net_device *br_dev;
> -			const struct net_device_ops *ops;
> -
> -			br_dev = netdev_master_upper_dev_get(dev);
> -			ops = br_dev->netdev_ops;
> -			if (ops->ndo_fdb_dump)
> -				idx = ops->ndo_fdb_dump(skb, cb, dev, idx);
> +	ndm = nlmsg_data(cb->nlh);
> +	if (ndm->ndm_ifindex) {

We get really lucky here that ndm_ifindex and ifi_index happen to map to
the same location.

> +		dev = __dev_get_by_index(net, ndm->ndm_ifindex);
> +		if (dev == NULL) {
> +			pr_info("PF_BRIDGE: RTM_GETNEIGH with unknown ifindex\n");
> +			return -ENODEV;
> +		}
> +	
> +		if (!(dev->priv_flags & IFF_EBRIDGE)) {
> +			pr_info("PF_BRIDGE: RTM_GETNEIGH %s not a bridge device\n",
> +				dev->name);
> +			return -EINVAL;
>  		}
> +		ops = dev->netdev_ops;
> +		if (ops->ndo_fdb_dump) {
> +			idx = ops->ndo_fdb_dump(skb, cb, dev, idx);
> +		} else {
> +			pr_info("PF_BRIDGE: RTM_GETNEIGH %s no dumper\n",
> +				dev->name);
> +			return -EINVAL;
> +		}

I agree with both of Johns commens fro the above code.
I think you can use ndo_dflt_fdb_dump() here and remove the first check
for IFF_EBRIDGE.

The only odd thing is that it would permit syntax like:
 # bridge fbd show br eth0
or
 # bridge fdb show br macvlan0

but I think that's ok.

> +	} else {
> +		rcu_read_lock();
> +		for_each_netdev_rcu(net, dev) {
> +			if (dev->priv_flags & IFF_BRIDGE_PORT) {
> +				struct net_device *br_dev;
> +				br_dev = netdev_master_upper_dev_get(dev);
> +				ops = br_dev->netdev_ops;
> +				if (ops->ndo_fdb_dump)
> +					idx = ops->ndo_fdb_dump(skb, cb, dev, idx);
> +			}
>
> -		if (dev->netdev_ops->ndo_fdb_dump)
> -			idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, dev, idx);
> -		else
> -			idx = ndo_dflt_fdb_dump(skb, cb, dev, idx);
> +			if (dev->netdev_ops->ndo_fdb_dump)
> +				idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, dev,
> +								    idx);
> +			else
> +				idx = ndo_dflt_fdb_dump(skb, cb, dev, idx);
> +		}
> +		rcu_read_unlock();
>  	}
> -	rcu_read_unlock();
>
>  	cb->args[0] = idx;
>  	return skb->len;
>
>
> iprt-fdb-brfilter1
>
>
> diff --git a/bridge/fdb.c b/bridge/fdb.c
> index e2e53f1..f3073d6 100644
> --- a/bridge/fdb.c
> +++ b/bridge/fdb.c
> @@ -33,7 +33,7 @@ static void usage(void)
>  	fprintf(stderr, "Usage: bridge fdb { add | append | del | replace }
ADDR dev DEV {self|master} [ temp ]\n"
>  		        "              [router] [ dst IPADDR] [ vlan VID ]\n"
>  		        "              [ port PORT] [ vni VNI ] [via DEV]\n");
> -	fprintf(stderr, "       bridge fdb {show} [ dev DEV ]\n");
> +	fprintf(stderr, "       bridge fdb {show} [ br BRDEV ] [ dev DEV ]\n");

'port' option is now allowed in the show operation

-vlad

>  	exit(-1);
>  }
>
> @@ -152,18 +152,35 @@ int print_fdb(const struct sockaddr_nl *who,
struct nlmsghdr *n, void *arg)
>
>  static int fdb_show(int argc, char **argv)
>  {
> +	struct ndmsg ndm = { };
>  	char *filter_dev = NULL;
> +	char *br = NULL;
> +
> +	ndm.ndm_family = PF_BRIDGE;
> +	ndm.ndm_state = NUD_NOARP;
>
>  	while (argc > 0) {
> -		if (strcmp(*argv, "dev") == 0) {
> +		if ((strcmp(*argv, "port") == 0) || strcmp(*argv, "dev") == 0) {
>  			NEXT_ARG();
> -			if (filter_dev)
> -				duparg("dev", *argv);
>  			filter_dev = *argv;
> +		} else if (strcmp(*argv, "br") == 0) {
> +			NEXT_ARG();
> +			br = *argv;
> +		} else {
> +			if (matches(*argv, "help") == 0)
> +				usage();
>  		}
>  		argc--; argv++;
>  	}
>
> +	if (br) {
> +		ndm.ndm_ifindex = ll_name_to_index(br);
> +		if (ndm.ndm_ifindex == 0) {
> +			fprintf(stderr, "Cannot find bridge device \"%s\"\n", br);
> +			return -1;
> +		}
> +	}
> +
>  	if (filter_dev) {
>  		filter_index = if_nametoindex(filter_dev);
>  		if (filter_index == 0) {
> @@ -171,13 +188,15 @@ static int fdb_show(int argc, char **argv)
>  				filter_dev);
>  			return -1;
>  		}
> +
>  	}
>
> -	if (rtnl_wilddump_request(&rth, PF_BRIDGE, RTM_GETNEIGH) < 0) {
> +	if (rtnl_dump_request(&rth, RTM_GETNEIGH, &ndm, sizeof(struct
ndmsg)) < 0) {
>  		perror("Cannot send dump request");
>  		exit(1);
>  	}
>
> +
>  	if (rtnl_dump_filter(&rth, print_fdb, stdout) < 0) {
>  		fprintf(stderr, "Dump terminated\n");
>  		exit(1);
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Feb. 11, 2014, 5:03 p.m. UTC | #3

On 02/09/14 14:33, John Fastabend wrote:
> On 02/09/2014 07:06 AM, Jamal Hadi Salim wrote:

>
>> +    if (ndm->ndm_ifindex) {
>> +        dev = __dev_get_by_index(net, ndm->ndm_ifindex);
>> +        if (dev == NULL) {
>> +            pr_info("PF_BRIDGE: RTM_GETNEIGH with unknown ifindex\n");
>> +            return -ENODEV;
>> +        }
>> +
>> +        if (!(dev->priv_flags & IFF_EBRIDGE)) {
>
> Can we drop this 'if case' and just use the 'if (ops->ndo_fdb_dump)'
> below? IFF_EBRIDGE is specific to ./net/bridge so it will fail for
> macvlans and I think the command is useful in both cases.
>

My only worry is:
The target ndm_ifindex is to a bridge device i.e something
that has an "fdb" (user space says "br X" then we use X to
mean we are talking about a bridge device. This is what
brctl showmacs <X> means.
I know this doesnt seem right relative to current point of
view where the target is the bridge port ifindex.
I'd be more than happy to make the change if it makes sense,
but i am struggling with that thought.

>
>> +            pr_info("PF_BRIDGE: RTM_GETNEIGH %s not a bridge device\n",
>> +                dev->name);
>> +            return -EINVAL;
>>          }
>> +        ops = dev->netdev_ops;
>> +        if (ops->ndo_fdb_dump) {
>> +            idx = ops->ndo_fdb_dump(skb, cb, dev, idx);
>> +        } else {
>
> Is there any problem with using the ndo_dflt_fdb_dump() in the else
> here?
>

If the assumption is the target ifindex is to a bridge device, I would
expect it MUST have a ops->ndo_fdb_dump(), no?
the default dumper deals with bridge ports (which may be bridges, but
if i wanted their databases i would address them)

> Userspace should be able to easily learn which ports are ebridge ports
> so I don't think that should be an issue. Anyways with the above
> IFF_EBRIDGE check you should never hit this else case although I think
> its safe to drop the above check as noted.
>

Right - so if i can find out which is an ebridge i can send it the
same request.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Feb. 11, 2014, 5:07 p.m. UTC | #4

On 02/10/14 11:31, Vlad Yasevich wrote:
> On 02/09/2014 10:06 AM, Jamal Hadi Salim wrote:

>> +	ndm = nlmsg_data(cb->nlh);
>> +	if (ndm->ndm_ifindex) {
>
> We get really lucky here that ndm_ifindex and ifi_index happen to map to
> the same location.
>

Didnt follow - but I have a feeling you are looking at the reference
point of a bridge port.
Note as per my response to John: The target is a bridge device, not
a bridge port.

>
> I agree with both of Johns commens fro the above code.
> I think you can use ndo_dflt_fdb_dump() here and remove the first check
> for IFF_EBRIDGE.
>

Same comment i made to John. The goal is to emulate
brctl showmacs <bridge>
ndo_dflt_fdb_dump() gives me in theory all the bridge ports
unicast and multicast MAC addresses. There is a posibility that
the bridgeport is a bridge - in which case I can find out from
user space and safely request for it directly instead of via
its parent.

> The only odd thing is that it would permit syntax like:
>   # bridge fbd show br eth0
> or
>   # bridge fdb show br macvlan0
>
> but I think that's ok.

Ok, since both you and John point to macvlan - is that
considered as something with an fdb? It doesnt forward
packets between two devices.

>> diff --git a/bridge/fdb.c b/bridge/fdb.c
>> index e2e53f1..f3073d6 100644
>> --- a/bridge/fdb.c
>> +++ b/bridge/fdb.c
>> @@ -33,7 +33,7 @@ static void usage(void)
>>   	fprintf(stderr, "Usage: bridge fdb { add | append | del | replace }
> ADDR dev DEV {self|master} [ temp ]\n"
>>   		        "              [router] [ dst IPADDR] [ vlan VID ]\n"
>>   		        "              [ port PORT] [ vni VNI ] [via DEV]\n");
>> -	fprintf(stderr, "       bridge fdb {show} [ dev DEV ]\n");
>> +	fprintf(stderr, "       bridge fdb {show} [ br BRDEV ] [ dev DEV ]\n");
>
> 'port' option is now allowed in the show operation
>

Thanks - it is already taken seems by vxlan using the same interface.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vlad Yasevich Feb. 11, 2014, 6:21 p.m. UTC | #5

On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
> On 02/10/14 11:31, Vlad Yasevich wrote:
>> On 02/09/2014 10:06 AM, Jamal Hadi Salim wrote:
> 
> 
>>> +    ndm = nlmsg_data(cb->nlh);
>>> +    if (ndm->ndm_ifindex) {
>>
>> We get really lucky here that ndm_ifindex and ifi_index happen to map to
>> the same location.
>>
> 
> Didnt follow - but I have a feeling you are looking at the reference
> point of a bridge port.
> Note as per my response to John: The target is a bridge device, not
> a bridge port.
> 

No, this was more the point that the current iproute code sends an
ifinfomsg struct down, and you change that to send ndmsg struct.
This is risky, but we luck out since the index is at the same offset
in both structs.

> 
> 
>>
>> I agree with both of Johns commens fro the above code.
>> I think you can use ndo_dflt_fdb_dump() here and remove the first check
>> for IFF_EBRIDGE.
>>
> 
> Same comment i made to John. The goal is to emulate
> brctl showmacs <bridge>
> ndo_dflt_fdb_dump() gives me in theory all the bridge ports
> unicast and multicast MAC addresses. There is a posibility that
> the bridgeport is a bridge - in which case I can find out from
> user space and safely request for it directly instead of via
> its parent.
> 

But that would only happen if the user said:
  # bridge fdb show br eth0

If eth0 in this case is a hw bridge device, getting the device's
version of fdb data is exactly what would be expected, isn't it?

If you mean a 'software bridge' above, then that's not an issue
since that's a disallowed config.  You can't stack software bridges
without something in the middle like bond or vlan.

>> The only odd thing is that it would permit syntax like:
>>   # bridge fbd show br eth0
>> or
>>   # bridge fdb show br macvlan0
>>
>> but I think that's ok.
> 
> Ok, since both you and John point to macvlan - is that
> considered as something with an fdb? It doesnt forward
> packets between two devices.
> 

Yes, macvlan can forward data to other macvlans, but that's
not the interesting thing.
When you configure multiple macvlan devices on top of the
same hw device, one could think of the hw device as a sort
of a bridge.  It's not really, but you could define it in
those terms.  The fdb entries, in this case, contain the mac
addresses of the macvlan devices.

> 
> 
>>> diff --git a/bridge/fdb.c b/bridge/fdb.c
>>> index e2e53f1..f3073d6 100644
>>> --- a/bridge/fdb.c
>>> +++ b/bridge/fdb.c
>>> @@ -33,7 +33,7 @@ static void usage(void)
>>>       fprintf(stderr, "Usage: bridge fdb { add | append | del |
>>> replace }
>> ADDR dev DEV {self|master} [ temp ]\n"
>>>                   "              [router] [ dst IPADDR] [ vlan VID ]\n"
>>>                   "              [ port PORT] [ vni VNI ] [via DEV]\n");
>>> -    fprintf(stderr, "       bridge fdb {show} [ dev DEV ]\n");
>>> +    fprintf(stderr, "       bridge fdb {show} [ br BRDEV ] [ dev DEV
>>> ]\n");
>>
>> 'port' option is now allowed in the show operation
>>
> 
> Thanks - it is already taken seems by vxlan using the same interface.
> 

Sorry, I wasn't very clear. What I meant was that you now support
  # bridge fdb show port <>

The usage message should reflect it.

-vlad
> 
> cheers,
> jamal
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Feb. 11, 2014, 8:15 p.m. UTC | #6

On 02/11/14 13:21, Vlad Yasevich wrote:
> On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
>> On 02/10/14 11:31, Vlad Yasevich wrote:

> No, this was more the point that the current iproute code sends an
> ifinfomsg struct down, and you change that to send ndmsg struct.
> This is risky, but we luck out since the index is at the same offset
> in both structs.
>

ah, ok, thanks for catching that. I should have said something - the
original code was wrong and i felt it was safe to make the change
given that the kernel code never even looked at what was being
sent to it. There is asymetry desires which are violated.
It doesnt make sense to send and ifm and expect back an ndm.
I should send that separately as a bug fix.

> But that would only happen if the user said:
>    # bridge fdb show br eth0
>
> If eth0 in this case is a hw bridge device, getting the device's
> version of fdb data is exactly what would be expected, isn't it?
>

Well, if it is a "bridge device" why would it not be tagged as a bridge
device?

> If you mean a 'software bridge' above, then that's not an issue
> since that's a disallowed config.  You can't stack software bridges
> without something in the middle like bond or vlan.
>

Ok, didnt realize that.
So i cant add a bridge as a bridge port to another bridge?

>
> Yes, macvlan can forward data to other macvlans, but that's
> not the interesting thing.

Sample config?

> When you configure multiple macvlan devices on top of the
> same hw device, one could think of the hw device as a sort
> of a bridge.  It's not really, but you could define it in
> those terms.  The fdb entries, in this case, contain the mac
> addresses of the macvlan devices.
>

It certainly has some equivalent semantics (looks at dst MAC then
picks the port). Possible to add Vlans as well?
Why dont we tag such a thing as a bridge then?

>
> Sorry, I wasn't very clear. What I meant was that you now support
>    # bridge fdb show port <>
>
> The usage message should reflect it.
>

Sorry - I noticed the word "port" at exactly where your quote came.
So i thought you noticed that "port" was already taken - it is used
for VXLAN fdb entries (for udp ports).

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Fastabend Feb. 11, 2014, 8:21 p.m. UTC | #7

On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:
>
>>
>> Yes, macvlan can forward data to other macvlans, but that's
>> not the interesting thing.
>
> Sample config?

ip link add link ethx name mv1 type macvlan mode bridge
ip link add link ethx name mv2 type macvlan mode bridge

Now you have a macvlan on ethx that will forward data between
mv1 and mv2.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Fastabend Feb. 11, 2014, 8:30 p.m. UTC | #8

On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:
> On 02/11/14 13:21, Vlad Yasevich wrote:
>> On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
>>> On 02/10/14 11:31, Vlad Yasevich wrote:
>
>> No, this was more the point that the current iproute code sends an
>> ifinfomsg struct down, and you change that to send ndmsg struct.
>> This is risky, but we luck out since the index is at the same offset
>> in both structs.
>>
>
> ah, ok, thanks for catching that. I should have said something - the
> original code was wrong and i felt it was safe to make the change
> given that the kernel code never even looked at what was being
> sent to it. There is asymetry desires which are violated.
> It doesnt make sense to send and ifm and expect back an ndm.
> I should send that separately as a bug fix.
>
>
>> But that would only happen if the user said:
>>    # bridge fdb show br eth0
>>
>> If eth0 in this case is a hw bridge device, getting the device's
>> version of fdb data is exactly what would be expected, isn't it?
>>
>
> Well, if it is a "bridge device" why would it not be tagged as a bridge
> device?

What do you mean by "bridge device" are you specifically talking about
IFF_BRIDGE flag? This flag is used only for ./net/bridge devices. For
example macvlan uses its own flag. I think there is a good case to be
made for netdevices which are acting as the management interface for a
hardware bridge to set an identifying flag. Perhaps IFF_HWBRIDGE.

>
>> If you mean a 'software bridge' above, then that's not an issue
>> since that's a disallowed config.  You can't stack software bridges
>> without something in the middle like bond or vlan.
>>
>
> Ok, didnt realize that.
> So i cant add a bridge as a bridge port to another bridge?
>

# ip link set dev bridge0 master bridge1
RTNETLINK answers: Too many levels of symbolic links

in the bridge case this doesn't work. But you can stack a macvlan
on top of the bridge port,

# ip link add link bridge0 type macvlan mode vepa

11: macvlan0@bridge0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop 
state DOWN mode DEFAULT group default

And macvlans on macvlans is OK as well.

# ip link add link macvlan0 type macvlan mode vepa

[...]

>> When you configure multiple macvlan devices on top of the
>> same hw device, one could think of the hw device as a sort
>> of a bridge.  It's not really, but you could define it in
>> those terms.  The fdb entries, in this case, contain the mac
>> addresses of the macvlan devices.
>>
>
> It certainly has some equivalent semantics (looks at dst MAC then
> picks the port). Possible to add Vlans as well?
> Why dont we tag such a thing as a bridge then?
>

If its useful then we should. You can track them down in userspace
via /sys/class/net/ or looking for offloaded netdevices that point
to the interface but a flag is definitely more direct.

>>
>> Sorry, I wasn't very clear. What I meant was that you now support
>>    # bridge fdb show port <>
>>
>> The usage message should reflect it.
>>
>
> Sorry - I noticed the word "port" at exactly where your quote came.
> So i thought you noticed that "port" was already taken - it is used
> for VXLAN fdb entries (for udp ports).
>
>
> cheers,
> jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vlad Yasevich Feb. 11, 2014, 9 p.m. UTC | #9

On 02/11/2014 03:15 PM, Jamal Hadi Salim wrote:
> On 02/11/14 13:21, Vlad Yasevich wrote:
>> On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
>>> On 02/10/14 11:31, Vlad Yasevich wrote:
> 
>> No, this was more the point that the current iproute code sends an
>> ifinfomsg struct down, and you change that to send ndmsg struct.
>> This is risky, but we luck out since the index is at the same offset
>> in both structs.
>>
> 
> ah, ok, thanks for catching that. I should have said something - the
> original code was wrong and i felt it was safe to make the change
> given that the kernel code never even looked at what was being
> sent to it. There is asymetry desires which are violated.
> It doesnt make sense to send and ifm and expect back an ndm.
> I should send that separately as a bug fix.
> 
> 
>> But that would only happen if the user said:
>>    # bridge fdb show br eth0
>>
>> If eth0 in this case is a hw bridge device, getting the device's
>> version of fdb data is exactly what would be expected, isn't it?
>>
> 
> Well, if it is a "bridge device" why would it not be tagged as a bridge
> device?

Because it just a multi-function nic that isn't tagged with any
kine of bridge flag.  As John said, this might be useful, but not
done yet.

> 
>> If you mean a 'software bridge' above, then that's not an issue
>> since that's a disallowed config.  You can't stack software bridges
>> without something in the middle like bond or vlan.
>>
> 
> Ok, didnt realize that.
> So i cant add a bridge as a bridge port to another bridge?

Not directly.  However, if you put a layered software device in between
(vlan, bond, macvlan), then you can add that device to another bridge.
In fact, people do that to get GVRP working with VMs.

> 
>>
>> Yes, macvlan can forward data to other macvlans, but that's
>> not the interesting thing.
> 
> Sample config?
> 
>> When you configure multiple macvlan devices on top of the
>> same hw device, one could think of the hw device as a sort
>> of a bridge.  It's not really, but you could define it in
>> those terms.  The fdb entries, in this case, contain the mac
>> addresses of the macvlan devices.
>>
> 
> It certainly has some equivalent semantics (looks at dst MAC then
> picks the port). Possible to add Vlans as well?

I suppose.   You can do things like:
# ip link add link eth0 dev vlan100 protocol 8021Q id 100
# ip link add link vlan0 dev mac100 type macvlan

Now, you have a macvlan (mac100) that will only receive vlan100 traffic.
Expressing this in terms of fdb would be a bit difficult since each
interface is separate and eth0 doesn't really know about the stack.
It would require quite a lot of code.

> Why dont we tag such a thing as a bridge then?
> 

Because they are not always a bridge.  It could be just a nic capable of
mac filtering.

>>
>> Sorry, I wasn't very clear. What I meant was that you now support
>>    # bridge fdb show port <>
>>
>> The usage message should reflect it.
>>
> 
> Sorry - I noticed the word "port" at exactly where your quote came.
> So i thought you noticed that "port" was already taken - it is used
> for VXLAN fdb entries (for udp ports).
>

Didn't realize it has different connotation for vxlan.  The you probably
don't want to include and support in the bridge fdb show command.

-vlad

> 
> cheers,
> jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Feb. 11, 2014, 9:04 p.m. UTC | #10

On 02/11/14 15:30, John Fastabend wrote:
> On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:

Thanks for the example on the other email.

> What do you mean by "bridge device" are you specifically talking about
> IFF_BRIDGE flag? This flag is used only for ./net/bridge devices.

Right - the simple definition is this thing has an fdb.
Yes, I know weve added vlan filtering and multicast snooping
but thats all lipstick. If it has an (ethernet) fdb it is a bridge.

>For
> example macvlan uses its own flag. I think there is a good case to be
> made for netdevices which are acting as the management interface for a
> hardware bridge to set an identifying flag. Perhaps IFF_HWBRIDGE.
>

If you introduce IFF_HWBRIDGE - I think that would satisfy the
distinction. The question then is why not just tag it IFF_BRIDGE?

>
> # ip link set dev bridge0 master bridge1
> RTNETLINK answers: Too many levels of symbolic links
>

pourquoi?  If the original rationale was to limit the
broadcast domain scope it sounds strange that a bridge in
the form a macvlan is allowed.

> in the bridge case this doesn't work. But you can stack a macvlan
> on top of the bridge port,
>
> # ip link add link bridge0 type macvlan mode vepa
>
> 11: macvlan0@bridge0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop
> state DOWN mode DEFAULT group default
>
> And macvlans on macvlans is OK as well.
>
> # ip link add link macvlan0 type macvlan mode vepa
>
> [...]
>

Ok, I need to let that sink in. Cool actually.

>
> If its useful then we should. You can track them down in userspace
> via /sys/class/net/ or looking for offloaded netdevices that point
> to the interface but a flag is definitely more direct.
>

I prefer a flag. Then i can deal with it via netlink.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Feb. 11, 2014, 9:08 p.m. UTC | #11

On 02/11/14 16:00, Vlad Yasevich wrote:
> On 02/11/2014 03:15 PM, Jamal Hadi Salim wrote:

>
> Because it just a multi-function nic that isn't tagged with any
> kine of bridge flag.  As John said, this might be useful, but not
> done yet.
>

Ok, fair enough. Someone should send a patch - John perhaps.

>
> Not directly.  However, if you put a layered software device in between
> (vlan, bond, macvlan), then you can add that device to another bridge.
> In fact, people do that to get GVRP working with VMs.
>

Do you recall the reasoning behind it?


>> It certainly has some equivalent semantics (looks at dst MAC then
>> picks the port). Possible to add Vlans as well?
>
> I suppose.   You can do things like:
> # ip link add link eth0 dev vlan100 protocol 8021Q id 100
> # ip link add link vlan0 dev mac100 type macvlan
>
> Now, you have a macvlan (mac100) that will only receive vlan100 traffic.
> Expressing this in terms of fdb would be a bit difficult since each
> interface is separate and eth0 doesn't really know about the stack.
> It would require quite a lot of code.
>

nice.

>> Why dont we tag such a thing as a bridge then?
>>
>
> Because they are not always a bridge.  It could be just a nic capable of
> mac filtering.
>

I think in one of the modes it is merely a filter.
But you turn on this other feature it is a bridge.

>
> Didn't realize it has different connotation for vxlan.  The you probably
> don't want to include and support in the bridge fdb show command.

Thats what i thought you said earlier ;->

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Feb. 11, 2014, 9:12 p.m. UTC | #12

On 02/11/14 16:08, Jamal Hadi Salim wrote:
> On 02/11/14 16:00, Vlad Yasevich wrote:
>> On 02/11/2014 03:15 PM, Jamal Hadi Salim wrote:
>

> I think in one of the modes it is merely a filter.
> But you turn on this other feature it is a bridge.

IOW, VEPA should turn off IFF_BRIDGE and VEB should
turn it on. We probably need an event generated.


cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Fastabend Feb. 12, 2014, 6:50 p.m. UTC | #13

On 2/11/2014 1:04 PM, Jamal Hadi Salim wrote:
> On 02/11/14 15:30, John Fastabend wrote:
>> On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:
>
> Thanks for the example on the other email.
>

Just to wrap things up in one email. Changing between VEB and VEPA
modes already triggers an event. So management applications can listen
for this.

And I can send out a patch to add a flag to hardware bridge devices
I'll likely get to it next week sometime unless someone beats me
to it.

>
>> What do you mean by "bridge device" are you specifically talking about
>> IFF_BRIDGE flag? This flag is used only for ./net/bridge devices.
>
> Right - the simple definition is this thing has an fdb.
> Yes, I know weve added vlan filtering and multicast snooping
> but thats all lipstick. If it has an (ethernet) fdb it is a bridge.
>

Sure, IEEE802.1Q would call these edge relays.

>> For
>> example macvlan uses its own flag. I think there is a good case to be
>> made for netdevices which are acting as the management interface for a
>> hardware bridge to set an identifying flag. Perhaps IFF_HWBRIDGE.
>>
>
> If you introduce IFF_HWBRIDGE - I think that would satisfy the
> distinction. The question then is why not just tag it IFF_BRIDGE?
>

Because it is not the same type of object as the software bridge.
Most notably it doesn't do learning. If anything its more like a
macvlan device and we could just as easily tag it IFF_MACVLAN. So
because it doesn't really match 1:1 with either of those object I
would just presume give its own flag. Userspace can always create
a small macro call it is_bridge_like() and check for any of the
handful of bridge like objects.

>>
>> # ip link set dev bridge0 master bridge1
>> RTNETLINK answers: Too many levels of symbolic links
>>
>
> pourquoi?  If the original rationale was to limit the
> broadcast domain scope it sounds strange that a bridge in
> the form a macvlan is allowed.
>

Agreed. But there it is.

>> in the bridge case this doesn't work. But you can stack a macvlan
>> on top of the bridge port,
>>
>> # ip link add link bridge0 type macvlan mode vepa
>>
>> 11: macvlan0@bridge0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop
>> state DOWN mode DEFAULT group default
>>
>> And macvlans on macvlans is OK as well.
>>
>> # ip link add link macvlan0 type macvlan mode vepa
>>
>> [...]
>>
>
> Ok, I need to let that sink in. Cool actually.
>

Also note you can mix VEB and VEPA modes and if you do it correctly
can create blocks of virtual ports that can do east-west traffic and
isolate others.

>
>>
>> If its useful then we should. You can track them down in userspace
>> via /sys/class/net/ or looking for offloaded netdevices that point
>> to the interface but a flag is definitely more direct.
>>
>
> I prefer a flag. Then i can deal with it via netlink.
>

OK. I'll add one here shortly.

> cheers,
> jamal
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vlad Yasevich Feb. 12, 2014, 7:02 p.m. UTC | #14

On 02/11/2014 04:08 PM, Jamal Hadi Salim wrote:
> On 02/11/14 16:00, Vlad Yasevich wrote:
>> On 02/11/2014 03:15 PM, Jamal Hadi Salim wrote:
> 
>>
>> Because it just a multi-function nic that isn't tagged with any
>> kine of bridge flag.  As John said, this might be useful, but not
>> done yet.
>>
> 
> Ok, fair enough. Someone should send a patch - John perhaps.
> 
>>
>> Not directly.  However, if you put a layered software device in between
>> (vlan, bond, macvlan), then you can add that device to another bridge.
>> In fact, people do that to get GVRP working with VMs.
>>
> 
> Do you recall the reasoning behind it?

Before my time. It's there since before 2.6.12 :)

-vlad

> 
> 
>>> It certainly has some equivalent semantics (looks at dst MAC then
>>> picks the port). Possible to add Vlans as well?
>>
>> I suppose.   You can do things like:
>> # ip link add link eth0 dev vlan100 protocol 8021Q id 100
>> # ip link add link vlan0 dev mac100 type macvlan
>>
>> Now, you have a macvlan (mac100) that will only receive vlan100 traffic.
>> Expressing this in terms of fdb would be a bit difficult since each
>> interface is separate and eth0 doesn't really know about the stack.
>> It would require quite a lot of code.
>>
> 
> nice.
> 
>>> Why dont we tag such a thing as a bridge then?
>>>
>>
>> Because they are not always a bridge.  It could be just a nic capable of
>> mac filtering.
>>
> 
> I think in one of the modes it is merely a filter.
> But you turn on this other feature it is a bridge.
> 
>>
>> Didn't realize it has different connotation for vxlan.  The you probably
>> don't want to include and support in the bridge fdb show command.
> 
> Thats what i thought you said earlier ;->
> 
> cheers,
> jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Feb. 13, 2014, 12:50 p.m. UTC | #15

On 02/12/14 13:50, John Fastabend wrote:

>
> Just to wrap things up in one email. Changing between VEB and VEPA
> modes already triggers an event. So management applications can listen
> for this.
>

Ok - reasonable.

> And I can send out a patch to add a flag to hardware bridge devices
> I'll likely get to it next week sometime unless someone beats me
> to it.
>

You understand this better - so i will wait.
I'll send an updated version of the patch this weekend
now that net-next is open.

>
> Sure, IEEE802.1Q would call these edge relays.
>

Ok, so what older kids used to call "repeaters".
The more i think about it, the more it looks like this is
still a bridge and we have a bridgeport mode as VEPA vs VEB.
IOW, as you said - you can have a bridge with mix and match of
VEB/VEPA. We can easily add such a feature to the software bridge
as well. It sounds simple and useful enough.

>
> Because it is not the same type of object as the software bridge.
> Most notably it doesn't do learning. If anything its more like a
> macvlan device and we could just as easily tag it IFF_MACVLAN. So
> because it doesn't really match 1:1 with either of those object I
> would just presume give its own flag. Userspace can always create
> a small macro call it is_bridge_like() and check for any of the
> handful of bridge like objects.
>

I think VEB/VEP may be somehow covering the "port" aspect.
The challenge is what to call "eth0 when running in SRIOV"

>>>
>>> # ip link set dev bridge0 master bridge1
>>> RTNETLINK answers: Too many levels of symbolic links
>>>
>>
>> pourquoi?  If the original rationale was to limit the
>> broadcast domain scope it sounds strange that a bridge in
>> the form a macvlan is allowed.
>>
>
> Agreed. But there it is.
>

I am sure someone knows why - Stephen?

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Feb. 13, 2014, 3:37 p.m. UTC | #16

On 02/12/14 13:50, John Fastabend wrote:
> On 2/11/2014 1:04 PM, Jamal Hadi Salim wrote:
>
> Because it is not the same type of object as the software bridge.
> Most notably it doesn't do learning.

This kept nagging at me.
Learning is optional for a bridge. So is flooding.
I think we got this right in recent kernels.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Fastabend Feb. 13, 2014, 4:03 p.m. UTC | #17

On 2/13/2014 7:37 AM, Jamal Hadi Salim wrote:
> On 02/12/14 13:50, John Fastabend wrote:
>> On 2/11/2014 1:04 PM, Jamal Hadi Salim wrote:
>>
>> Because it is not the same type of object as the software bridge.
>> Most notably it doesn't do learning.
>
> This kept nagging at me.
> Learning is optional for a bridge. So is flooding.
> I think we got this right in recent kernels.
>
> cheers,
> jamal

Yeah I remember now Vlad added them. The real distinction
is macvlan devices only have _one_ uplink port and support
other forwarding modes, VEPA, etc.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RFC: bridge get fdb by bridge device

Commit Message

Comments

Patch