diff mbox

[net-next,RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.

Message ID c6756bd161048ac4b4407e308045fe73.squirrel@poczta.wsisiz.edu.pl
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Hubert Sokolowski Dec. 10, 2014, 7:37 p.m. UTC
This change restores the semantic that was present
before 5e6d243587990a588143b9da3974833649595587
"bridge: netlink dump interface at par with brctl"
on how ndo_dflt_fdb_dump is called.
This semantic is still used for add and del operations
so let's keep it consistent.
Driver can still call ndo_dflt_fdb_dump from inside
its own fdb_dump routine when needed.

Signed-off-by: Hubert Sokolowski <h.sokolowski@wit.edu.pl>
---
 net/core/rtnetlink.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

David Miller Dec. 11, 2014, 4:32 a.m. UTC | #1
From: "Hubert Sokolowski" <h.sokolowski@wit.edu.pl>
Date: Wed, 10 Dec 2014 19:37:01 -0000

> This change restores the semantic that was present
> before 5e6d243587990a588143b9da3974833649595587
> "bridge: netlink dump interface at par with brctl"
> on how ndo_dflt_fdb_dump is called.
> This semantic is still used for add and del operations
> so let's keep it consistent.
> Driver can still call ndo_dflt_fdb_dump from inside
> its own fdb_dump routine when needed.
> 
> Signed-off-by: Hubert Sokolowski <h.sokolowski@wit.edu.pl>

Jamal, please review.

> ---
>  net/core/rtnetlink.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index eaa057f..a9e5c37 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -2692,10 +2692,11 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
>  							 idx);
>  		}
> 
> -		idx = ndo_dflt_fdb_dump(skb, cb, dev, NULL, idx);
>  		if (dev->netdev_ops->ndo_fdb_dump)
>  			idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, bdev, dev,
>  							    idx);
> +		else
> +			idx = ndo_dflt_fdb_dump(skb, cb, dev, NULL, idx);
> 
>  		cops = NULL;
>  	}
> -- 
> 1.9.3
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Dec. 11, 2014, 7:31 a.m. UTC | #2
On 12/10/14, 11:37 AM, Hubert Sokolowski wrote:
> This change restores the semantic that was present
> before 5e6d243587990a588143b9da3974833649595587
> "bridge: netlink dump interface at par with brctl"
> on how ndo_dflt_fdb_dump is called.
> This semantic is still used for add and del operations
> so let's keep it consistent.
> Driver can still call ndo_dflt_fdb_dump from inside
> its own fdb_dump routine when needed.
>
> Signed-off-by: Hubert Sokolowski <h.sokolowski@wit.edu.pl>
> ---
>   net/core/rtnetlink.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index eaa057f..a9e5c37 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -2692,10 +2692,11 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
>   							 idx);
>   		}
>
> -		idx = ndo_dflt_fdb_dump(skb, cb, dev, NULL, idx);
>   		if (dev->netdev_ops->ndo_fdb_dump)
>   			idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, bdev, dev,
>   							    idx);
> +		else
> +			idx = ndo_dflt_fdb_dump(skb, cb, dev, NULL, idx);
>
>   		cops = NULL;
>   	}
I think commit
"5e6d243587990a588143b9da3974833649595587 "bridge: netlink dump 
interface at par with brctl" tried to make sure even the dflt entries 
(ie dev->uc and dev->mc) were also printed in the below case. ie the 
'self' entries in the below output.

# ./bridge fdb show brport eth1
02:00:00:12:01:02 vlan 0 master br0 permanent
00:17:42:8a:b4:05 vlan 0 master br0 permanent
00:17:42:8a:b4:07 self permanent
33:33:00:00:00:01 self permanent

Am guessing reverting the patch is going to make the 'self' entries in 
the above output to go away.
Can you please confirm ?.

Also, if i hear your concern correctly, for bridge ports that implement 
ndo_fdb_dump, with commit 5e6d243587990a588143b9da3974833649595587, we 
will get two entries for each 'self' entry above.
Can you also paste sample output for that ?.

Thanks,
Roopa


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 11, 2014, 11:49 a.m. UTC | #3
On 12/10/14 23:32, David Miller wrote:
> From: "Hubert Sokolowski" <h.sokolowski@wit.edu.pl>
> Date: Wed, 10 Dec 2014 19:37:01 -0000
>
>> This change restores the semantic that was present
>> before 5e6d243587990a588143b9da3974833649595587
>> "bridge: netlink dump interface at par with brctl"
>> on how ndo_dflt_fdb_dump is called.
>> This semantic is still used for add and del operations
>> so let's keep it consistent.
>> Driver can still call ndo_dflt_fdb_dump from inside
>> its own fdb_dump routine when needed.
>>
>> Signed-off-by: Hubert Sokolowski <h.sokolowski@wit.edu.pl>
>
> Jamal, please review.
>

It wont work. As pointed out by Roopa in
the other email dev->uc/mc will not get dumped with this
change. Vlad will be in a better position to comment.
CCing Vlad.

Hubert, immediate gratification never works on netdev.
I advised you to run the commit tests in at least
2 emails when you contacted me privately before posting.
It would have chewed about 5 minutes of your time.
I am sure it cost Roopa at least 1 hour. And if Dave
had sucked in your innocent looking patch we'd be playing
damage control after which is a lot more expensive.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hubert Sokolowski Dec. 11, 2014, 4:39 p.m. UTC | #4
> I think commit
> "5e6d243587990a588143b9da3974833649595587 "bridge: netlink dump
> interface at par with brctl" tried to make sure even the dflt entries
> (ie dev->uc and dev->mc) were also printed in the below case. ie the
> 'self' entries in the below output.
>
> # ./bridge fdb show brport eth1
> 02:00:00:12:01:02 vlan 0 master br0 permanent
> 00:17:42:8a:b4:05 vlan 0 master br0 permanent
> 00:17:42:8a:b4:07 self permanent
> 33:33:00:00:00:01 self permanent
>
> Am guessing reverting the patch is going to make the 'self' entries in
> the above output to go away.
> Can you please confirm ?.

I don't want you to revert the patch, as the main goal of the patch
was to enable filtering in the kernel. I am only proposing
to revert part of it that allows driver to implement own dump.
This does not break the filtering in the kernel.
Whether the 'self' entries will go away it depends if the driver
overrides ndo_fdb_dump callback with its own. For cases where the driver
does not implement the callback, the dflt callback is still called
showing 'self' entries:
[root@silpixa00378825 ~]# bridge fdb show
33:33:00:00:00:01 dev em1 self permanent
01:00:5e:00:00:01 dev em1 self permanent
33:33:00:00:00:01 dev p4p1 self permanent
01:00:5e:00:00:01 dev p4p1 self permanent
33:33:ff:81:56:db dev p4p1 self permanent
01:00:5e:00:00:fb dev p4p1 self permanent
33:33:00:00:00:01 dev dummy0 self permanent

>
> Also, if i hear your concern correctly, for bridge ports that implement
> ndo_fdb_dump, with commit 5e6d243587990a588143b9da3974833649595587, we
> will get two entries for each 'self' entry above.
> Can you also paste sample output for that ?.

My patch affects *only* drivers that implements own dump callback.
Implementing own dump callback means the driver want to have a control
over what is being dumped. For example you may want to dump a hardware
MAC table only (my case) where 'self' entries created by kernel make no sense.
Also there are drivers that calls dflt callback from inside own dump function.
Please see following dump callback implemented for QLogic:
static int qlcnic_fdb_dump(struct sk_buff *skb, struct netlink_callback *ncb,
                        struct net_device *netdev,
                        struct net_device *filter_dev, int idx)
{
        struct qlcnic_adapter *adapter = netdev_priv(netdev);

        if (!adapter->fdb_mac_learn)
                return ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx);
[...]

Another example of dflt being called twice is macvlan.c where ndo_fdb_dump
is actually initialized with the dflt callback:
macvlan.c:1022:        .ndo_fdb_dump           = ndo_dflt_fdb_dump,

Thanks,
Hubert

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hubert Sokolowski Dec. 11, 2014, 4:51 p.m. UTC | #5
>
> It wont work. As pointed out by Roopa in
> the other email dev->uc/mc will not get dumped with this
> change. Vlad will be in a better position to comment.
> CCing Vlad.

Jamal, this will still get dumped unless your driver implements
own dump callback. Can you please exactly tell us what wont work?
Please see my previous post where I added more explanation.

Thanks,
Hubert


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hubert Sokolowski Dec. 11, 2014, 5:06 p.m. UTC | #6
My apologies for sending again, I forgot to include a sample output you asked.

> Also, if i hear your concern correctly, for bridge ports that implement
> ndo_fdb_dump, with commit 5e6d243587990a588143b9da3974833649595587, we
> will get two entries for each 'self' entry above.
> Can you also paste sample output for that ?.
>

[root@silpixa00378825 ~]# bridge fdb show brport mac0
33:33:00:00:00:01 self permanent
33:33:00:00:00:01 self permanent

Regards,
Hubert


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Dec. 11, 2014, 5:32 p.m. UTC | #7
On 12/11/14, 9:06 AM, Hubert Sokolowski wrote:
> My apologies for sending again, I forgot to include a sample output you asked.
>
>> Also, if i hear your concern correctly, for bridge ports that implement
>> ndo_fdb_dump, with commit 5e6d243587990a588143b9da3974833649595587, we
>> will get two entries for each 'self' entry above.
>> Can you also paste sample output for that ?.
>>
> [root@silpixa00378825 ~]# bridge fdb show brport mac0
> 33:33:00:00:00:01 self permanent
> 33:33:00:00:00:01 self permanent
>
>
Thanks. yes, that is a problem. And, this mac0 is not a bridge port 
correct ?.

But, for the same test case, when mac0 is a bridge port, does your patch 
under review make both the entries go away for a bridge port ?.
(If i understand jamal correctly, this is his concern).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arad, Ronen Dec. 11, 2014, 6:47 p.m. UTC | #8
> -----Original Message-----

> From: netdev-owner@vger.kernel.org [mailto:netdev-

> owner@vger.kernel.org] On Behalf Of Hubert Sokolowski

> Sent: Thursday, December 11, 2014 8:40 AM

> To: Roopa Prabhu

> Cc: netdev@vger.kernel.org; Jamal Hadi Salim

> Subject: Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if

> ndo_fdb_dump is defined.

> 

> > I think commit

> > "5e6d243587990a588143b9da3974833649595587 "bridge: netlink dump

> > interface at par with brctl" tried to make sure even the dflt entries

> > (ie dev->uc and dev->mc) were also printed in the below case. ie the

> > 'self' entries in the below output.

> >

> > # ./bridge fdb show brport eth1

> > 02:00:00:12:01:02 vlan 0 master br0 permanent

> > 00:17:42:8a:b4:05 vlan 0 master br0 permanent

> > 00:17:42:8a:b4:07 self permanent

> > 33:33:00:00:00:01 self permanent

> >

> > Am guessing reverting the patch is going to make the 'self' entries in

> > the above output to go away.

> > Can you please confirm ?.

> 

> I don't want you to revert the patch, as the main goal of the patch was to

> enable filtering in the kernel. I am only proposing to revert part of it that

> allows driver to implement own dump.

> This does not break the filtering in the kernel.

> Whether the 'self' entries will go away it depends if the driver overrides

> ndo_fdb_dump callback with its own. For cases where the driver does not

> implement the callback, the dflt callback is still called showing 'self' entries:

> [root@silpixa00378825 ~]# bridge fdb show

> 33:33:00:00:00:01 dev em1 self permanent

> 01:00:5e:00:00:01 dev em1 self permanent

> 33:33:00:00:00:01 dev p4p1 self permanent

> 01:00:5e:00:00:01 dev p4p1 self permanent 33:33:ff:81:56:db dev p4p1 self

> permanent 01:00:5e:00:00:fb dev p4p1 self permanent

> 33:33:00:00:00:01 dev dummy0 self permanent

> 

> >

> > Also, if i hear your concern correctly, for bridge ports that

> > implement ndo_fdb_dump, with commit

> > 5e6d243587990a588143b9da3974833649595587, we will get two entries

> for each 'self' entry above.

> > Can you also paste sample output for that ?.

> 

> My patch affects *only* drivers that implements own dump callback.

> Implementing own dump callback means the driver want to have a control

> over what is being dumped. For example you may want to dump a hardware

> MAC table only (my case) where 'self' entries created by kernel make no

> sense.

> Also there are drivers that calls dflt callback from inside own dump function.

> Please see following dump callback implemented for QLogic:

> static int qlcnic_fdb_dump(struct sk_buff *skb, struct netlink_callback *ncb,

>                         struct net_device *netdev,

>                         struct net_device *filter_dev, int idx) {

>         struct qlcnic_adapter *adapter = netdev_priv(netdev);

> 

>         if (!adapter->fdb_mac_learn)

>                 return ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx); [...]

> 

> Another example of dflt being called twice is macvlan.c where

> ndo_fdb_dump is actually initialized with the dflt callback:

> macvlan.c:1022:        .ndo_fdb_dump           = ndo_dflt_fdb_dump,

> 

> Thanks,

> Hubert

>


I agree with Hubert on that. When a device defines its own ndo_fdb_dump it implies it wants control over the information that will be returned to user-space.
The proposed patch changes the recent status quo but it only restores it to the way it was before 5e6d243587990a588143b9da3974833649595587.
A compromise could be to also include in the same patch-set patches to the drivers that have benefited from the implicit call to ndo_dflt_fdb_dump.

-Ronen
 
> --

> To unsubscribe from this list: send the line "unsubscribe netdev" in the body

> of a message to majordomo@vger.kernel.org More majordomo info at

> http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 11, 2014, 8:40 p.m. UTC | #9
On 12/11/14 12:32, Roopa Prabhu wrote:

>>
> Thanks. yes, that is a problem. And, this mac0 is not a bridge port
> correct ?.
>

Aha. I think this is the key. It is NOT a bridge port. This is why i
was explicit to ask for the commit log tests to be run.

> But, for the same test case, when mac0 is a bridge port, does your patch
> under review make both the entries go away for a bridge port ?.
> (If i understand jamal correctly, this is his concern).

I am interested to see the answer.
More importantly I am hoping Vlad would wake up and say something.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hubert Sokolowski Dec. 12, 2014, 11:38 a.m. UTC | #10
> On 12/11/14, 9:06 AM, Hubert Sokolowski wrote:
>> My apologies for sending again, I forgot to include a sample output you asked.
>>
>>> Also, if i hear your concern correctly, for bridge ports that implement
>>> ndo_fdb_dump, with commit 5e6d243587990a588143b9da3974833649595587, we
>>> will get two entries for each 'self' entry above.
>>> Can you also paste sample output for that ?.
>>>
>> [root@silpixa00378825 ~]# bridge fdb show brport mac0
>> 33:33:00:00:00:01 self permanent
>> 33:33:00:00:00:01 self permanent
>>
>>
> Thanks. yes, that is a problem. And, this mac0 is not a bridge port
> correct ?.
>
in that part of the code it does not matter whether it is a bridge
port or not. I can show all devices and the entry is duplicated.
I can still filter by dev and you will see same problem, because
for every network device the dflt callback is called without checking
if the ndo_fdb_dump was defined:
[root@silpixa00378825 ~]# bridge fdb show
33:33:00:00:00:01 dev em1 self permanent
01:00:5e:00:00:01 dev em1 self permanent
33:33:00:00:00:01 dev p4p1 self permanent
01:00:5e:00:00:01 dev p4p1 self permanent
33:33:ff:81:56:db dev p4p1 self permanent
01:00:5e:00:00:fb dev p4p1 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev mac0 self permanent
33:33:00:00:00:01 dev mac0 self permanent
[root@silpixa00378825 ~]# bridge fdb show dev mac0
33:33:00:00:00:01 self permanent
33:33:00:00:00:01 self permanent

Please see how the ndo_dflt_fdb_add and del are called:
		if (dev->netdev_ops->ndo_fdb_add)
			err = dev->netdev_ops->ndo_fdb_add(ndm, tb, dev, addr,
							   vid,
							   nlh->nlmsg_flags);
		else
			err = ndo_dflt_fdb_add(ndm, tb, dev, addr, vid,
					       nlh->nlmsg_flags);


> But, for the same test case, when mac0 is a bridge port, does your patch
> under review make both the entries go away for a bridge port ?.
> (If i understand jamal correctly, this is his concern).
>
As it was suggested by Ronen I can modify the patch so the dflt callback
is always called for bridge devices if this is desirable. Either by calling
it when following expression is true:
    (dev->priv_flags & IFF_BRIDGE_PORT)
or by modifying br_fdb_dump to call ndo_dflt_fdb_dump.

thanks,
Hubert

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 12, 2014, 11:54 a.m. UTC | #11
On 12/12/14 06:38, Hubert Sokolowski wrote:
>> On 12/11/14, 9:06 AM, Hubert Sokolowski wrote:

> Please see how the ndo_dflt_fdb_add and del are called:
> 		if (dev->netdev_ops->ndo_fdb_add)
> 			err = dev->netdev_ops->ndo_fdb_add(ndm, tb, dev, addr,
> 							   vid,
> 							   nlh->nlmsg_flags);
> 		else
> 			err = ndo_dflt_fdb_add(ndm, tb, dev, addr, vid,
> 					       nlh->nlmsg_flags);
>

Semantically add and dump are not the same.
Add adds a specific entry.
Dump not only dumps the fdb table but also the unicast and multicast
addresses.
the default dumper does the uni/multicast dumping.


> As it was suggested by Ronen I can modify the patch so the dflt callback
> is always called for bridge devices if this is desirable. Either by calling
> it when following expression is true:
>      (dev->priv_flags & IFF_BRIDGE_PORT)
> or by modifying br_fdb_dump to call ndo_dflt_fdb_dump.
>

Are you saying the above is going to work? You need to TEST please.

It seems to me drivers which do this:
---
.ndo_fdb_dump           = my_fdb_dump,

and then my_fdb_dump is:
return ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx);

are broken because we always have to dump the uni/multicast
addresses *unconditionally* and these drivers claim to be
overriding the dumper but are in fact calling the default dumper.
They are not filtering anything as you had stated.
I wish Vlad (Cced) would show up and say something ;->

IOW, fix all the broken driver to not do that.

cheers,
jamal


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hubert Sokolowski Dec. 12, 2014, 1:36 p.m. UTC | #12
> On 12/12/14 06:38, Hubert Sokolowski wrote:
>>> On 12/11/14, 9:06 AM, Hubert Sokolowski wrote:
>
>> Please see how the ndo_dflt_fdb_add and del are called:
>> 		if (dev->netdev_ops->ndo_fdb_add)
>> 			err = dev->netdev_ops->ndo_fdb_add(ndm, tb, dev, addr,
>> 							   vid,
>> 							   nlh->nlmsg_flags);
>> 		else
>> 			err = ndo_dflt_fdb_add(ndm, tb, dev, addr, vid,
>> 					       nlh->nlmsg_flags);
>>
>
> Semantically add and dump are not the same.
> Add adds a specific entry.
> Dump not only dumps the fdb table but also the unicast and multicast
> addresses.
this is not true for 3.16 and before. Please see:
http://lxr.free-electrons.com/source/net/core/rtnetlink.c?v=3.16#L2545
It lets you fully manage the FDB table by overwriding fdb_add, del and dump
in the same way.


>
>
>> As it was suggested by Ronen I can modify the patch so the dflt callback
>> is always called for bridge devices if this is desirable. Either by calling
>> it when following expression is true:
>>      (dev->priv_flags & IFF_BRIDGE_PORT)
>> or by modifying br_fdb_dump to call ndo_dflt_fdb_dump.
>>
>
> Are you saying the above is going to work? You need to TEST please.
yes, it works and it is not a rocket science :). we just need to agree
on the approach to use.

>
> It seems to me drivers which do this:
> ---
> .ndo_fdb_dump           = my_fdb_dump,
>
> and then my_fdb_dump is:
> return ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx);
>
> are broken because we always have to dump the uni/multicast
> addresses *unconditionally* and these drivers claim to be
> overriding the dumper but are in fact calling the default dumper.
> They are not filtering anything as you had stated.
> I wish Vlad (Cced) would show up and say something ;->
>
> IOW, fix all the broken driver to not do that.

These "broken" drivers were written before 3.17 when the unconditional call
to dflt_dump was introduced. Some drivers implement own dump and don't
call dflt_dump at all, some call dflt_dump depending on some condition.
You can also call dflt_dump inside, save the idx result and continue
with own dump resulting in both uc/mc and own entries returned to the user.

thanks,
Hubert

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 12, 2014, 2:35 p.m. UTC | #13
On 12/12/14 08:36, Hubert Sokolowski wrote:
>> On 12/12/14 06:38, Hubert Sokolowski wrote:
>>>> On 12/11/14, 9:06 AM, Hubert Sokolowski wrote:
>>
>>> Please see how the ndo_dflt_fdb_add and del are called:
>>> 		if (dev->netdev_ops->ndo_fdb_add)
>>> 			err = dev->netdev_ops->ndo_fdb_add(ndm, tb, dev, addr,
>>> 							   vid,
>>> 							   nlh->nlmsg_flags);
>>> 		else
>>> 			err = ndo_dflt_fdb_add(ndm, tb, dev, addr, vid,
>>> 					       nlh->nlmsg_flags);
>>>
>>
>> Semantically add and dump are not the same.
>> Add adds a specific entry.
>> Dump not only dumps the fdb table but also the unicast and multicast
>> addresses.
> this is not true for 3.16 and before. Please see:
> http://lxr.free-electrons.com/source/net/core/rtnetlink.c?v=3.16#L2545
> It lets you fully manage the FDB table by overwriding fdb_add, del and dump
> in the same way.
>
 >
>
>>
>>
>>> As it was suggested by Ronen I can modify the patch so the dflt callback
>>> is always called for bridge devices if this is desirable. Either by calling
>>> it when following expression is true:
>>>       (dev->priv_flags & IFF_BRIDGE_PORT)
>>> or by modifying br_fdb_dump to call ndo_dflt_fdb_dump.
>>>
>>
>> Are you saying the above is going to work? You need to TEST please.
> yes, it works and it is not a rocket science :). we just need to agree
> on the approach to use.
>

I am happy if this solves wont break
any of the use cases Vlad made me test and make sure work.
When i said "test" - I mean run the testcases outlined in the
commit log; if those work i dont see what the issue is.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Dec. 12, 2014, 8:05 p.m. UTC | #14
On 12/12/2014 06:35 AM, Jamal Hadi Salim wrote:
> On 12/12/14 08:36, Hubert Sokolowski wrote:
>>> On 12/12/14 06:38, Hubert Sokolowski wrote:
>>>>> On 12/11/14, 9:06 AM, Hubert Sokolowski wrote:
>>>
>>>> Please see how the ndo_dflt_fdb_add and del are called:
>>>>         if (dev->netdev_ops->ndo_fdb_add)
>>>>             err = dev->netdev_ops->ndo_fdb_add(ndm, tb, dev, addr,
>>>>                                vid,
>>>>                                nlh->nlmsg_flags);
>>>>         else
>>>>             err = ndo_dflt_fdb_add(ndm, tb, dev, addr, vid,
>>>>                            nlh->nlmsg_flags);
>>>>
>>>
>>> Semantically add and dump are not the same.
>>> Add adds a specific entry.
>>> Dump not only dumps the fdb table but also the unicast and multicast
>>> addresses.
>> this is not true for 3.16 and before. Please see:
>> http://lxr.free-electrons.com/source/net/core/rtnetlink.c?v=3.16#L2545
>> It lets you fully manage the FDB table by overwriding fdb_add, del and
>> dump
>> in the same way.
>>
>  >
>>
>>>
>>>
>>>> As it was suggested by Ronen I can modify the patch so the dflt
>>>> callback
>>>> is always called for bridge devices if this is desirable. Either by
>>>> calling
>>>> it when following expression is true:
>>>>       (dev->priv_flags & IFF_BRIDGE_PORT)
>>>> or by modifying br_fdb_dump to call ndo_dflt_fdb_dump.
>>>>
>>>
>>> Are you saying the above is going to work? You need to TEST please.
>> yes, it works and it is not a rocket science :). we just need to agree
>> on the approach to use.
>>
>
> I am happy if this solves wont break
> any of the use cases Vlad made me test and make sure work.
> When i said "test" - I mean run the testcases outlined in the
> commit log; if those work i dont see what the issue is.
>

I'll wake up ;)

First quick grep of code finds some strange uses of ndo_fdb_dump like
this in macvlan,

   ./drivers/net/macvlan.c
         .ndo_fdb_dump           = ndo_dflt_fdb_dump,

I'll be sending a patch once net-next opens up again to resolve it. Its
harmless though so not really a fix for net.

There seem to be a few places that have the potential to return
different values then the uc/mc lists.

	./drivers/net/vxlan.c
	./drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
	./drivers/net/ethernet/rocker/rocker.c

	./net/bridge/br_device.c

So I guess we can walk through the list and analyse them a bit.

vxlan:

Try stacking devices on top of the vxlan device this will call a uc_add
routine if you then change the mac addr on the vlan. This would get
reported by the dflt fdb dump handlers but not the drivers fdb dump
handlers. So removing the dflt dump handler from this patch at least
changes things. We should either explain why this is OK or accept that
the driver needs to be fixed. Or I guess that the patch is just wrong.
My guess is one of the latter options.

Also Jamal, your original patch seems like it might of changed this
and Hubert's patch is reverting back to its original case. Was this
specific part of your patch intentional?

qlcnic:

hmm again this is changed a bit. In the case of !learning and no
eswitch or sriov (side-bar question how do you do SR-IOV without
some sort of embedded switch?) then we return idx? See qlcnic_fdb_dump.

So at least this is different after the patch as well. Same questions
as above, explain why this is OK, fix it, or solve the issue some other
way. Although again it was this way before Jamal's patch.

rocker.c:

rocker never calls dflt dump either. Should it or is it not necessary?

br_device:

same story.

Hubert, can you run the set of commands in Jamal's patch and repost
your patch for us with them in the commit msg and call out any
differences?

Note the above is just from reading the code I never really ran any
tests to sort it out completely.

Thanks!
John



> cheers,
> jamal
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 15, 2014, 2:29 p.m. UTC | #15
On 12/12/14 15:05, John Fastabend wrote:
> On 12/12/2014 06:35 AM, Jamal Hadi Salim wrote:


> I'll wake up ;)


Vlad made me go over those patches in a few iterations to make
sure that the use cases covered in the test case work. It is
holiday season, so he may be offline.

> First quick grep of code finds some strange uses of ndo_fdb_dump like
> this in macvlan,
>
>    ./drivers/net/macvlan.c
>          .ndo_fdb_dump           = ndo_dflt_fdb_dump,
>
> I'll be sending a patch once net-next opens up again to resolve it. Its
> harmless though so not really a fix for net.
>
> There seem to be a few places that have the potential to return
> different values then the uc/mc lists.
>
>      ./drivers/net/vxlan.c
>      ./drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
>      ./drivers/net/ethernet/rocker/rocker.c
>
>      ./net/bridge/br_device.c
>

Yes, thats my observation as well.
The question is: Are multi/unicast address unconditionally dumped?
Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
the other driver did).
If you go over the original thread exchange with Vlad, you'll notice
i was kind of unsure why dumping of unicast/multicast had anything to
do with fdb dumping.
It is still my view that we shouldnt be treating these addresses as if
they were fdb entries. But: The problem is once you allow an API to
user space you cant take it back even if people are depending on bugs.


> So I guess we can walk through the list and analyse them a bit.
>
> vxlan:
>
> Try stacking devices on top of the vxlan device this will call a uc_add
> routine if you then change the mac addr on the vlan. This would get
> reported by the dflt fdb dump handlers but not the drivers fdb dump
> handlers. So removing the dflt dump handler from this patch at least
> changes things. We should either explain why this is OK or accept that
> the driver needs to be fixed. Or I guess that the patch is just wrong.
> My guess is one of the latter options.
>
> Also Jamal, your original patch seems like it might of changed this
> and Hubert's patch is reverting back to its original case. Was this
> specific part of your patch intentional?
>

Yes.
This is based on the view that unicast/multicast must be dumped
*unconditionally*. If the view is that uni/mcast addresses are
dumped conditionally based on what the driver thinks, then Hubert's
one liner is good. But i really would like Vlad to comment. 80%
of the effort on my part if you look at the thread was the refactoring
of the code to meet the use case.

I thought the abstraction which requires that your own MAC addresses
are treated as fdb entries was broken - but it is too late to change
that.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Dec. 16, 2014, 12:45 a.m. UTC | #16
On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:
> On 12/12/14 15:05, John Fastabend wrote:
>> On 12/12/2014 06:35 AM, Jamal Hadi Salim wrote:
>
>
>> I'll wake up ;)
>
>
> Vlad made me go over those patches in a few iterations to make
> sure that the use cases covered in the test case work. It is
> holiday season, so he may be offline.
>

Yep.

>> First quick grep of code finds some strange uses of ndo_fdb_dump like
>> this in macvlan,
>>
>>    ./drivers/net/macvlan.c
>>          .ndo_fdb_dump           = ndo_dflt_fdb_dump,
>>
>> I'll be sending a patch once net-next opens up again to resolve it. Its
>> harmless though so not really a fix for net.
>>
>> There seem to be a few places that have the potential to return
>> different values then the uc/mc lists.
>>
>>      ./drivers/net/vxlan.c
>>      ./drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
>>      ./drivers/net/ethernet/rocker/rocker.c
>>
>>      ./net/bridge/br_device.c
>>
>
> Yes, thats my observation as well.
> The question is: Are multi/unicast address unconditionally dumped?

hmm good question. When I implemented this on the host nics with SR-IOV,
VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
the driver. My logic was if some netdev ethx has a set of MAC addresses
above it well then any virtual function or virtual device also behind
the hardware shouldn't be sending those addresses out the egress switch
facing port. Otherwise the switch will see packets it knows are behind
that port and drop them. Or flood them if it hasn't learned the address
yet. Either way they will never get to the right netdev.

Admittedly I wasn't thinking about switches with many ports at the time.

> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
> the other driver did).

My original thinking here was... if it didn't implement fdb_add, fdb_del
and fdb_dump then if you wanted to think of it as having forwarding
database that was fine but it was really just a two port mac relay. In
which case just dump all the mac addresses it knows about. In this case
if it was something more fancy it could do its own dump like vxlan or
macvlan.

> If you go over the original thread exchange with Vlad, you'll notice
> i was kind of unsure why dumping of unicast/multicast had anything to
> do with fdb dumping.
> It is still my view that we shouldnt be treating these addresses as if
> they were fdb entries. But: The problem is once you allow an API to
> user space you cant take it back even if people are depending on bugs.
>

For a host nic ucast/multicast and fdb are the same, I think? The
code we had was just short-hand to allow the common case a host nic
to work. Notice vxlan and bridge drivers didn't dump there addr lists 
from fdb_dump until your patch.

Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
I shouldn't overload the addr lists.

>
>> So I guess we can walk through the list and analyse them a bit.
>>
>> vxlan:
>>
>> Try stacking devices on top of the vxlan device this will call a uc_add
>> routine if you then change the mac addr on the vlan. This would get
>> reported by the dflt fdb dump handlers but not the drivers fdb dump
>> handlers. So removing the dflt dump handler from this patch at least
>> changes things. We should either explain why this is OK or accept that
>> the driver needs to be fixed. Or I guess that the patch is just wrong.
>> My guess is one of the latter options.
>>
>> Also Jamal, your original patch seems like it might of changed this
>> and Hubert's patch is reverting back to its original case. Was this
>> specific part of your patch intentional?
>>
>
> Yes.
> This is based on the view that unicast/multicast must be dumped
> *unconditionally*. If the view is that uni/mcast addresses are
> dumped conditionally based on what the driver thinks, then Hubert's
> one liner is good. But i really would like Vlad to comment. 80%
> of the effort on my part if you look at the thread was the refactoring
> of the code to meet the use case.

I'm interested to see what Vlad says as well. But the current situation
is previously some drivers dumped their addr lists others didn't.
Specifically, the more switch like devices (bridge, vxlan) didn't. Now
every device will dump the addr lists. I'm not entirely convinced that
is correct.

>
> I thought the abstraction which requires that your own MAC addresses
> are treated as fdb entries was broken - but it is too late to change
> that.
>

It works OK for host nics (NICS that can't forward between ports) and
seems at best confusing for real switch asics. On a related question do
you expect the switch asic to trap any packets with MAC addresses in
the multi/unicast address lists and send them to the correct netdev? Or
will the switch forward them using normal FDB tables?

Also I don't think its too late to fix it though. Maybe we had some
buggy drivers is all.

> cheers,
> jamal
Jamal Hadi Salim Dec. 16, 2014, 1:06 p.m. UTC | #17
On 12/15/14 19:45, John Fastabend wrote:
> On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:

>
> hmm good question. When I implemented this on the host nics with SR-IOV,
> VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
> the driver.

So if i understand correctly, this is a NIC with an FDB. And there is no
concept of a bridge to which it is attached. To the point of
classical uni/multicast addresses on a netdev abstraction; these
are typically stored in *much simpler tables* (used to be IO
registers back in the day)
Do these NICs not have such a concept?
An fdb entry has an egress port column; I have seen cases where the
port is labeled as "Cpu port" which would mean it belongs to the host;
but in this case it just seems there is no such concept and as Or
brought up in another email - what does "VLANid" mean in such a case?
If we go with a CPU port concept,
We could then use the concept of a vlan filter on a port basis
but then what happens when you dont have an fdb (majority of cases)?

> My logic was if some netdev ethx has a set of MAC addresses
> above it well then any virtual function or virtual device also behind
> the hardware shouldn't be sending those addresses out the egress switch
> facing port. Otherwise the switch will see packets it knows are behind
> that port and drop them. Or flood them if it hasn't learned the address
> yet. Either way they will never get to the right netdev.
>
> Admittedly I wasn't thinking about switches with many ports at the time.
>

I often struggle with trying to "box" SRIOV into some concept of a
switch abstraction and sometimes i am puzzled.
Would exposing the SRIOV underlay as a switch not have solved this
problem? Then the virtual ports essentially are bridge ports.
Maybe what we need is a concept of a "edge relay" extended netdev?
These things would have an fdb as well down and uplink relay ports that
can be attached to them.


>> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
>> the other driver did).
>
> My original thinking here was... if it didn't implement fdb_add, fdb_del
> and fdb_dump then if you wanted to think of it as having forwarding
> database that was fine but it was really just a two port mac relay. In
> which case just dump all the mac addresses it knows about. In this case
> if it was something more fancy it could do its own dump like vxlan or
> macvlan.
>

The challenge here is lack of separation between a NICs uni/multicast
ports which it owns - which is a traditional operation regardless of
what capabilities the NIC has; vs an fdb which has may have many
other capabilities. Probably all NICs capable of many MACs implement
fdbs?

> For a host nic ucast/multicast and fdb are the same, I think? The
> code we had was just short-hand to allow the common case a host nic
> to work. Notice vxlan and bridge drivers didn't dump there addr lists
> from fdb_dump until your patch.
>
> Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
> I shouldn't overload the addr lists.
>

Not just those - I am wondering about the general utility of what
Hubert was trying to do if all the driver does is call the default
dumper based on some flags presence and the default dumper
does a dump of uni/multicast host entries. Those are not really fdb
entries in the traditional sense.
Is there no way to get the unicast/multicast mac addresses for such
a driver?
I think that would help bring clarity to my confusion.


>
> I'm interested to see what Vlad says as well. But the current situation
> is previously some drivers dumped their addr lists others didn't.
> Specifically, the more switch like devices (bridge, vxlan) didn't. Now
> every device will dump the addr lists. I'm not entirely convinced that
> is correct.
>

I am glad this happened ;-> Otherwise we wouldnt be having this
discussion. When Vlad was asking me I was in a rush to get the patch
out and didnt question because i thought this was something some crazy
virtualization people needed.
If Vlad's use case goes away, then Hubert's little restoration is fine.


> It works OK for host nics (NICS that can't forward between ports) and
> seems at best confusing for real switch asics.

So if these NICs have fdb entries and i programmed it (meaning setting
which port a given MAC should be sent to), would it not work?

> On a related question do
> you expect the switch asic to trap any packets with MAC addresses in
> the multi/unicast address lists and send them to the correct netdev? Or
> will the switch forward them using normal FDB tables?
>

I think there would be a separate table for that. Roopa, can you check
with the ASICs you guys work on? The point i was trying to make above
is today there is a uni/multicast list or table of sorts that all NICs
expose.
There's always the hack of a "cpu port". I have also seen the "cpu port"
being conceptualized in L3 tables to imply "next hop is cpu" where you
have an IP address owned by the host; so maybe we need a concept of a
cpu port or again the revival of TheThing class device.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hubert Sokolowski Dec. 16, 2014, 2:35 p.m. UTC | #18
>
> I am glad this happened ;-> Otherwise we wouldnt be having this
> discussion. When Vlad was asking me I was in a rush to get the patch
> out and didnt question because i thought this was something some crazy
> virtualization people needed.
> If Vlad's use case goes away, then Hubert's little restoration is fine.
>

I'm afraid there might be a little more to fix in here. I just tested
my patch after I moved the dflt_fdb_dump unconditional call inside br_fdb_dump,
so the "self" is returned again for the bridge and saw that br_fdb_dump is called
twice in some cases. As a result I saw duplicate "self" entries.
I think the problem is in rtnl_fdb_dump how it invokes ndo_fdb_dump.
In 3.16 the algorithm was very simple, now it is a little bit more complicated.

I tested on 3.18 without my changes only added pr_info inside br_fdb_dump:
pr_info("br fdb dump called dev %s filter dev %s.\n", dev->name,
			filter_dev->name);

I loaded dummy module, created bridge br0 with brctl and then attached dummy0
to that bridge:
    brctl  addif  br0 dummy0
Then when trying to filter by brport only:
./bridge fdb show brport dummy0
5e:e2:a0:21:0c:f5 vlan 1 master br0 permanent
5e:e2:a0:21:0c:f5 master br0 permanent
33:33:00:00:00:01 self permanent

Even though the output looks OK, I see in the journalctl logs the callback
was called twice with same attributes:
Dec 16 09:15:39 localhost.localdomain kernel: br fdb dump called dev br0 filter dev dummy0.
Dec 16 09:15:39 localhost.localdomain kernel: br fdb dump called dev br0 filter dev dummy0.

Do you have any idea why this is happening? I hope this test makes sense :).

Thanks,
Hubert

--
Hubert Sokolowski           Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Dec. 16, 2014, 4:35 p.m. UTC | #19
On 12/16/2014 05:06 AM, Jamal Hadi Salim wrote:
> On 12/15/14 19:45, John Fastabend wrote:
>> On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:
>
>>
>> hmm good question. When I implemented this on the host nics with SR-IOV,
>> VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
>> the driver.
>
> So if i understand correctly, this is a NIC with an FDB. And there is no
> concept of a bridge to which it is attached. To the point of
> classical uni/multicast addresses on a netdev abstraction; these
> are typically stored in *much simpler tables* (used to be IO
> registers back in the day)

 From a model perspective it looks like a edge relay. Only a single
downlink with multiple uplinks. No learning, no loops and so no
STP, et. al. required. It may or may not support MAC+VLAN forwarding
or just MAC forwarding.

It may be configured via register writes or more complicated firmware
requests or some other mechanism. This is device dependent even across
devices by the same vendor the mechanisms change. But the driver
abstracts this.

> Do these NICs not have such a concept?
> An fdb entry has an egress port column; I have seen cases where the
> port is labeled as "Cpu port" which would mean it belongs to the host;

But in the SR-IOV case you have multiple "Cpu ports" and you want
to send packets to each of them depending on the configuration.


    port0   port1     port2  port3
     |        |        |      |      uplinks
  +------------------------------+
  |                              |
  |       SRIOV edge relay       |
  |                              |
  +------------------------------+
                  |                   downlink


In a host nic with SRIOV each port will be a PCIE function. So really
they are all CPU ports. For multi-function devices they might all be
physical functions.

In the hardware there needs to be a table to forward incoming traffic
to the correct port#. For L2 we use MAC+VLAN and an egress port column
to select the port. The model shouldn't care if the port is backed by
a VF or PF or set of queues. It just needs to forward packets to the
correct uplink.

One issue we have today when writing software for these edge relays
is we don't have a netdev representing the downlink. Or a netdev
representing management functions of the device. So if I want to
say change the mode of the edge relay from VEB to VEPA I usually
just send the message to the PF. Or if I want to send packets out on
the wire but not through the edge relay usually we do this by sending
control packets over an elected PF and it will attach a tag or something
so the edge relay doesn't forward or flood them to other uplinks. Adding
a netdev for the downlink would probably clean some of this up. Now
we rely on some behaviour that is not well-defined.

> but in this case it just seems there is no such concept and as Or
> brought up in another email - what does "VLANid" mean in such a case?

I think most host nics with SR-IOV can forward using VLAN + MAC and
do filtering on VLANid. Many can also put a default VLAN on the packet.

> If we go with a CPU port concept,
> We could then use the concept of a vlan filter on a port basis
> but then what happens when you dont have an fdb (majority of cases)?

Not sure what the question is here.. I'm hoping the above helped
explain my thinking on this.

Don't have an FDB? This means you don't have any way to forward
between ports so you must have a 1:1 mapping between the physical
port and the netdev. I think its fair to think of this as a TPMR
(two port mac relay) although not a very useful abstraction.

>
>> My logic was if some netdev ethx has a set of MAC addresses
>> above it well then any virtual function or virtual device also behind
>> the hardware shouldn't be sending those addresses out the egress switch
>> facing port. Otherwise the switch will see packets it knows are behind
>> that port and drop them. Or flood them if it hasn't learned the address
>> yet. Either way they will never get to the right netdev.
>>
>> Admittedly I wasn't thinking about switches with many ports at the time.
>>
>
> I often struggle with trying to "box" SRIOV into some concept of a
> switch abstraction and sometimes i am puzzled.
> Would exposing the SRIOV underlay as a switch not have solved this
> problem? Then the virtual ports essentially are bridge ports.

Yes this would help and this is how I view it. Although the
edge relay vs "real standards based" bridge distinction is important
because we don't do learning, only have a single uplink, don't run
loop detecting protocols, etc. All that stuff is not needed on a host
where you "know" your MAC addresses (at least for many use cases) and
can not build loops.

> Maybe what we need is a concept of a "edge relay" extended netdev?

This is effectively what the fdb table does right? Sure its not as
explicit as it could be but this is how I treat the NIC when I learn
it has multiple downlinks and a single uplink. At the moment we use
a trick similar to Jiri's on rocker, when we get a switch op like
getlink, setlink we "know" what switch object it refers to because
the netdev maps to a single switch always.

> These things would have an fdb as well down and uplink relay ports that
> can be attached to them.
>

Right in the current code paths there is no "attach" operation we assume
the edge relay and ports are attached when the ports are created via
SR-IOV or hw-offload or whatever.

What are we missing? We have the FDB and a unique id to show ports on
the same edge relay. User space can build this abstraction from those
two things. A downlink netdev port would probably clean up the
abstraction a bit especially for sending control frames.

>
>>> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
>>> the other driver did).
>>
>> My original thinking here was... if it didn't implement fdb_add, fdb_del
>> and fdb_dump then if you wanted to think of it as having forwarding
>> database that was fine but it was really just a two port mac relay. In
>> which case just dump all the mac addresses it knows about. In this case
>> if it was something more fancy it could do its own dump like vxlan or
>> macvlan.
>>
>
> The challenge here is lack of separation between a NICs uni/multicast
> ports which it owns - which is a traditional operation regardless of
> what capabilities the NIC has; vs an fdb which has may have many
> other capabilities. Probably all NICs capable of many MACs implement
> fdbs?

Yes they must to support forwarding. Agreed its a bit clunky they
way we overload uni/multicast address lists. But what does it mean
to add a unicast address to a port and not have it in the FDB? If
the port wants to receive traffic on a MAC because its added to the
unicast list doesn't it mean insert it into the FDB so the packets
actually get sent to the netdev?

Otherwise its a two step process one add it to the multicast list
and then add it to the FDB. I'm not sure why this is valuable.

>
>> For a host nic ucast/multicast and fdb are the same, I think? The
>> code we had was just short-hand to allow the common case a host nic
>> to work. Notice vxlan and bridge drivers didn't dump there addr lists
>> from fdb_dump until your patch.
>>
>> Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
>> I shouldn't overload the addr lists.
>>
>
> Not just those - I am wondering about the general utility of what
> Hubert was trying to do if all the driver does is call the default
> dumper based on some flags presence and the default dumper
> does a dump of uni/multicast host entries. Those are not really fdb
> entries in the traditional sense.

But as a practical matter any uni/multicast entry is in the FDB
so when the host nic has multiple ports we receive those mac addresses
on the port. The drivers do this today and it seems reasonable to me.

> Is there no way to get the unicast/multicast mac addresses for such
> a driver?

You can almost infer it from ip link by looking at all the stacked
drivers and figuring out how the address are propagated down. Then
look at the routes and figure out multicast address. But other than
the fdb dump mechanism I don't think there is anything.

> I think that would help bring clarity to my confusion.
>

clear as mud now?

>
>>
>> I'm interested to see what Vlad says as well. But the current situation
>> is previously some drivers dumped their addr lists others didn't.
>> Specifically, the more switch like devices (bridge, vxlan) didn't. Now
>> every device will dump the addr lists. I'm not entirely convinced that
>> is correct.
>>
>
> I am glad this happened ;-> Otherwise we wouldnt be having this
> discussion. When Vlad was asking me I was in a rush to get the patch
> out and didnt question because i thought this was something some crazy
> virtualization people needed.
> If Vlad's use case goes away, then Hubert's little restoration is fine.

Yep. maybe we can talk about it at the netdev users conference

>
>
>> It works OK for host nics (NICS that can't forward between ports) and
>> seems at best confusing for real switch asics.
>
> So if these NICs have fdb entries and i programmed it (meaning setting
> which port a given MAC should be sent to), would it not work?

You mean via 'bridge fdb add' yes this will work. But then as a short
hand we also program the ucast/multicast addresses. (have I beaten this
to death yet?)

>
>> On a related question do
>> you expect the switch asic to trap any packets with MAC addresses in
>> the multi/unicast address lists and send them to the correct netdev? Or
>> will the switch forward them using normal FDB tables?
>>
>
> I think there would be a separate table for that. Roopa, can you check
> with the ASICs you guys work on? The point i was trying to make above
> is today there is a uni/multicast list or table of sorts that all NICs
> expose.
> There's always the hack of a "cpu port". I have also seen the "cpu port"
> being conceptualized in L3 tables to imply "next hop is cpu" where you
> have an IP address owned by the host; so maybe we need a concept of a
> cpu port or again the revival of TheThing class device.

OK the confusing part of "cpu port" to me is in a host nic trying to
map this abstraction onto it implies a host nic may have many "cpu
ports".

Thanks,
.John

>
> cheers,
> jamal
>
Samudrala, Sridhar Dec. 16, 2014, 5:21 p.m. UTC | #20
On 12/16/2014 8:35 AM, John Fastabend wrote:
>
>> Is there no way to get the unicast/multicast mac addresses for such
>> a driver?
>
> You can almost infer it from ip link by looking at all the stacked
> drivers and figuring out how the address are propagated down. Then
> look at the routes and figure out multicast address. But other than
> the fdb dump mechanism I don't think there is anything.

It looks like we can get the device specific unicast/multicast mac 
addresses via 'ip maddr' too.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Dec. 16, 2014, 7:30 p.m. UTC | #21
On 12/16/14, 9:21 AM, Samudrala, Sridhar wrote:
>
> On 12/16/2014 8:35 AM, John Fastabend wrote:
>>
>>> Is there no way to get the unicast/multicast mac addresses for such
>>> a driver?
>>
>> You can almost infer it from ip link by looking at all the stacked
>> drivers and figuring out how the address are propagated down. Then
>> look at the routes and figure out multicast address. But other than
>> the fdb dump mechanism I don't think there is anything.
>
> It looks like we can get the device specific unicast/multicast mac 
> addresses via 'ip maddr' too.
if i remember correctly, 'ip maddr' was only for multicast list. And 
there was no way to dump the unicast list until bridge self was introduced.
the only way to dump unicast addresses today is by using the `bridge fdb 
show self`
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Samudrala, Sridhar Dec. 16, 2014, 8:11 p.m. UTC | #22
On 12/16/2014 11:30 AM, Roopa Prabhu wrote:
> On 12/16/14, 9:21 AM, Samudrala, Sridhar wrote:
>>
>> On 12/16/2014 8:35 AM, John Fastabend wrote:
>>>
>>>> Is there no way to get the unicast/multicast mac addresses for such
>>>> a driver?
>>>
>>> You can almost infer it from ip link by looking at all the stacked
>>> drivers and figuring out how the address are propagated down. Then
>>> look at the routes and figure out multicast address. But other than
>>> the fdb dump mechanism I don't think there is anything.
>>
>> It looks like we can get the device specific unicast/multicast mac 
>> addresses via 'ip maddr' too.
> if i remember correctly, 'ip maddr' was only for multicast list. And 
> there was no way to dump the unicast list until bridge self was 
> introduced.
> the only way to dump unicast addresses today is by using the `bridge 
> fdb show self`
Yes. 'ip maddr show' only lists the multicast macs as the name suggests. 
I stand corrected.
May be we need 'ip uaddr show' to list unicast macs instead of 
overloading 'bridge fdb show' to show unicast lists.

Thanks
Sridhar
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Dec. 17, 2014, 5:51 a.m. UTC | #23
On 12/16/14, 5:06 AM, Jamal Hadi Salim wrote:
> On 12/15/14 19:45, John Fastabend wrote:
>> On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:
>
>>
>> hmm good question. When I implemented this on the host nics with SR-IOV,
>> VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
>> the driver.
>
> So if i understand correctly, this is a NIC with an FDB. And there is no
> concept of a bridge to which it is attached. To the point of
> classical uni/multicast addresses on a netdev abstraction; these
> are typically stored in *much simpler tables* (used to be IO
> registers back in the day)
> Do these NICs not have such a concept?
> An fdb entry has an egress port column; I have seen cases where the
> port is labeled as "Cpu port" which would mean it belongs to the host;
> but in this case it just seems there is no such concept and as Or
> brought up in another email - what does "VLANid" mean in such a case?
> If we go with a CPU port concept,
> We could then use the concept of a vlan filter on a port basis
> but then what happens when you dont have an fdb (majority of cases)?
>
>> My logic was if some netdev ethx has a set of MAC addresses
>> above it well then any virtual function or virtual device also behind
>> the hardware shouldn't be sending those addresses out the egress switch
>> facing port. Otherwise the switch will see packets it knows are behind
>> that port and drop them. Or flood them if it hasn't learned the address
>> yet. Either way they will never get to the right netdev.
>>
>> Admittedly I wasn't thinking about switches with many ports at the time.
>>
>
> I often struggle with trying to "box" SRIOV into some concept of a
> switch abstraction and sometimes i am puzzled.
> Would exposing the SRIOV underlay as a switch not have solved this
> problem? Then the virtual ports essentially are bridge ports.
> Maybe what we need is a concept of a "edge relay" extended netdev?
> These things would have an fdb as well down and uplink relay ports that
> can be attached to them.
>
>
>>> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
>>> the other driver did).
>>
>> My original thinking here was... if it didn't implement fdb_add, fdb_del
>> and fdb_dump then if you wanted to think of it as having forwarding
>> database that was fine but it was really just a two port mac relay. In
>> which case just dump all the mac addresses it knows about. In this case
>> if it was something more fancy it could do its own dump like vxlan or
>> macvlan.
>>
>
> The challenge here is lack of separation between a NICs uni/multicast
> ports which it owns - which is a traditional operation regardless of
> what capabilities the NIC has; vs an fdb which has may have many
> other capabilities. Probably all NICs capable of many MACs implement
> fdbs?
>
>> For a host nic ucast/multicast and fdb are the same, I think? The
>> code we had was just short-hand to allow the common case a host nic
>> to work. Notice vxlan and bridge drivers didn't dump there addr lists
>> from fdb_dump until your patch.
>>
>> Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
>> I shouldn't overload the addr lists.
>>
>
> Not just those - I am wondering about the general utility of what
> Hubert was trying to do if all the driver does is call the default
> dumper based on some flags presence and the default dumper
> does a dump of uni/multicast host entries. Those are not really fdb
> entries in the traditional sense.
> Is there no way to get the unicast/multicast mac addresses for such
> a driver?
> I think that would help bring clarity to my confusion.
>
>
>>
>> I'm interested to see what Vlad says as well. But the current situation
>> is previously some drivers dumped their addr lists others didn't.
>> Specifically, the more switch like devices (bridge, vxlan) didn't. Now
>> every device will dump the addr lists. I'm not entirely convinced that
>> is correct.
>>
>
> I am glad this happened ;-> Otherwise we wouldnt be having this
> discussion. When Vlad was asking me I was in a rush to get the patch
> out and didnt question because i thought this was something some crazy
> virtualization people needed.
> If Vlad's use case goes away, then Hubert's little restoration is fine.
>
>
>> It works OK for host nics (NICS that can't forward between ports) and
>> seems at best confusing for real switch asics.
>
> So if these NICs have fdb entries and i programmed it (meaning setting
> which port a given MAC should be sent to), would it not work?
>
>> On a related question do
>> you expect the switch asic to trap any packets with MAC addresses in
>> the multi/unicast address lists and send them to the correct netdev? Or
>> will the switch forward them using normal FDB tables?
>>
>
> I think there would be a separate table for that. Roopa, can you check
> with the ASICs you guys work on? 
Jamal, yes, AFAICS, we do have a separate table where we add some static 
entries
indicating send to  CPU (example IPV4 and IPV6 link local multicast) and 
such
packets are sent to the correct netdev

> The point i was trying to make above
> is today there is a uni/multicast list or table of sorts that all NICs
> expose.
> There's always the hack of a "cpu port". I have also seen the "cpu port"
> being conceptualized in L3 tables to imply "next hop is cpu" where you
> have an IP address owned by the host; so maybe we need a concept of a
> cpu port or again the revival of TheThing class device.
>
> cheers,
> jamal
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Dec. 17, 2014, 5:54 a.m. UTC | #24
On 12/16/14, 12:11 PM, Samudrala, Sridhar wrote:
>
> On 12/16/2014 11:30 AM, Roopa Prabhu wrote:
>> On 12/16/14, 9:21 AM, Samudrala, Sridhar wrote:
>>>
>>> On 12/16/2014 8:35 AM, John Fastabend wrote:
>>>>
>>>>> Is there no way to get the unicast/multicast mac addresses for such
>>>>> a driver?
>>>>
>>>> You can almost infer it from ip link by looking at all the stacked
>>>> drivers and figuring out how the address are propagated down. Then
>>>> look at the routes and figure out multicast address. But other than
>>>> the fdb dump mechanism I don't think there is anything.
>>>
>>> It looks like we can get the device specific unicast/multicast mac 
>>> addresses via 'ip maddr' too.
>> if i remember correctly, 'ip maddr' was only for multicast list. And 
>> there was no way to dump the unicast list until bridge self was 
>> introduced.
>> the only way to dump unicast addresses today is by using the `bridge 
>> fdb show self`
> Yes. 'ip maddr show' only lists the multicast macs as the name 
> suggests. I stand corrected.
> May be we need 'ip uaddr show' to list unicast macs instead of 
> overloading 'bridge fdb show' to show unicast lists.
>
maybe too late for that ..., 'bridge fdb show self' has been out for 
sometime now and in use
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vlad Yasevich Dec. 17, 2014, 3:39 p.m. UTC | #25
On 12/15/2014 07:45 PM, John Fastabend wrote:
> On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:
>> On 12/12/14 15:05, John Fastabend wrote:
>>> On 12/12/2014 06:35 AM, Jamal Hadi Salim wrote:
>>
>>
>>> I'll wake up ;)
>>
>>
>> Vlad made me go over those patches in a few iterations to make
>> sure that the use cases covered in the test case work. It is
>> holiday season, so he may be offline.
>>
> 
> Yep.

Sorry,  had HW/network issues as well as holiday season...  Been trying to
catch up.

> 
>>> First quick grep of code finds some strange uses of ndo_fdb_dump like
>>> this in macvlan,
>>>
>>>    ./drivers/net/macvlan.c
>>>          .ndo_fdb_dump           = ndo_dflt_fdb_dump,
>>>
>>> I'll be sending a patch once net-next opens up again to resolve it. Its
>>> harmless though so not really a fix for net.
>>>
>>> There seem to be a few places that have the potential to return
>>> different values then the uc/mc lists.
>>>
>>>      ./drivers/net/vxlan.c
>>>      ./drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
>>>      ./drivers/net/ethernet/rocker/rocker.c
>>>
>>>      ./net/bridge/br_device.c
>>>
>>
>> Yes, thats my observation as well.
>> The question is: Are multi/unicast address unconditionally dumped?
> 
> hmm good question. When I implemented this on the host nics with SR-IOV,
> VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
> the driver. My logic was if some netdev ethx has a set of MAC addresses
> above it well then any virtual function or virtual device also behind
> the hardware shouldn't be sending those addresses out the egress switch
> facing port. Otherwise the switch will see packets it knows are behind
> that port and drop them. Or flood them if it hasn't learned the address
> yet. Either way they will never get to the right netdev.
> 
> Admittedly I wasn't thinking about switches with many ports at the time.

Looking at the old code, we've always asked HW to dump it's state and if HW
didn't support the dumper, the default dumper of MC/UC lists was used.
This makes sense for most devices.

> 
>> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
>> the other driver did).
> 
> My original thinking here was... if it didn't implement fdb_add, fdb_del
> and fdb_dump then if you wanted to think of it as having forwarding
> database that was fine but it was really just a two port mac relay. In
> which case just dump all the mac addresses it knows about. In this case
> if it was something more fancy it could do its own dump like vxlan or
> macvlan.
> 
>> If you go over the original thread exchange with Vlad, you'll notice
>> i was kind of unsure why dumping of unicast/multicast had anything to
>> do with fdb dumping.
>> It is still my view that we shouldnt be treating these addresses as if
>> they were fdb entries. But: The problem is once you allow an API to
>> user space you cant take it back even if people are depending on bugs.
>>
> 
> For a host nic ucast/multicast and fdb are the same, I think? The
> code we had was just short-hand to allow the common case a host nic
> to work. Notice vxlan and bridge drivers didn't dump there addr lists from fdb_dump until
> your patch.

Right.  That patch added additional filtering code, but I guess we missed
the change that force it dump MC/UC lists from the master devices.  It did dump
those list from the slave devices.

> 
> Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
> I shouldn't overload the addr lists.
> 
>>
>>> So I guess we can walk through the list and analyse them a bit.
>>>
>>> vxlan:
>>>
>>> Try stacking devices on top of the vxlan device this will call a uc_add
>>> routine if you then change the mac addr on the vlan. This would get
>>> reported by the dflt fdb dump handlers but not the drivers fdb dump
>>> handlers. So removing the dflt dump handler from this patch at least
>>> changes things. We should either explain why this is OK or accept that
>>> the driver needs to be fixed. Or I guess that the patch is just wrong.
>>> My guess is one of the latter options.
>>>
>>> Also Jamal, your original patch seems like it might of changed this
>>> and Hubert's patch is reverting back to its original case. Was this
>>> specific part of your patch intentional?
>>>
>>
>> Yes.
>> This is based on the view that unicast/multicast must be dumped
>> *unconditionally*. If the view is that uni/mcast addresses are
>> dumped conditionally based on what the driver thinks, then Hubert's
>> one liner is good. But i really would like Vlad to comment. 80%
>> of the effort on my part if you look at the thread was the refactoring
>> of the code to meet the use case.

I don't think we have to dump uc/mc lists unconditionally.  What we
want is the lower diver's view of any fdb entries it things are appropriate.
For simple cards, this becomes equivalent to uc/mc lists.  For smarter cards
that override the default dumper, it makes sense for them to provide the info.

> 
> I'm interested to see what Vlad says as well. But the current situation
> is previously some drivers dumped their addr lists others didn't.
> Specifically, the more switch like devices (bridge, vxlan) didn't. Now
> every device will dump the addr lists. I'm not entirely convinced that
> is correct.
>

Well, the bridge would have dumped any fdbs that pointer at the bridge
device (port is NULL) as it's egress port.  Not sure about the vxlan.
For stacked situations of complex devices this make sense.  For
instance if you stack vxlan on top of bridge, then from the vxlan
perspective, we want to see what macs the bridge will forward to the
vxlan.  Here, the bridge is actually slightly broken as it wouldn't
actually dump all the pertinent info, but that's more of a bridge problem.

-vlad

>>
>> I thought the abstraction which requires that your own MAC addresses
>> are treated as fdb entries was broken - but it is too late to change
>> that.
>>
> 
> It works OK for host nics (NICS that can't forward between ports) and
> seems at best confusing for real switch asics. On a related question do
> you expect the switch asic to trap any packets with MAC addresses in
> the multi/unicast address lists and send them to the correct netdev? Or
> will the switch forward them using normal FDB tables?
> 
> Also I don't think its too late to fix it though. Maybe we had some
> buggy drivers is all.
> 
>> cheers,
>> jamal
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hubert Sokolowski Dec. 17, 2014, 4:18 p.m. UTC | #26
>
> I don't think we have to dump uc/mc lists unconditionally.  What we
> want is the lower diver's view of any fdb entries it things are appropriate.
> For simple cards, this becomes equivalent to uc/mc lists.  For smarter cards
> that override the default dumper, it makes sense for them to provide the info.
>

I am very glad to hear that :).

>
> Well, the bridge would have dumped any fdbs that pointer at the bridge
> device (port is NULL) as it's egress port.  Not sure about the vxlan.
> For stacked situations of complex devices this make sense.  For
> instance if you stack vxlan on top of bridge, then from the vxlan
> perspective, we want to see what macs the bridge will forward to the
> vxlan.  Here, the bridge is actually slightly broken as it wouldn't
> actually dump all the pertinent info, but that's more of a bridge problem.

I have just prepared a patch where I dump uc/mc for bridge devices
by looking at (dev->priv_flags & IFF_EBRIDGE), so I have same results
as without my changes. This should satisfy Jamal and Roopa.
I could send it as v3 of my patch along with the results if you are
interested.

>
> -vlad
>


thanks,
Hubert

--
Hubert Sokolowski    Intel Corporation





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 18, 2014, 10:32 p.m. UTC | #27
Sorry for the latency (head-buried-in-sand in effect)

On 12/17/14 11:18, Hubert Sokolowski wrote:

> I have just prepared a patch where I dump uc/mc for bridge devices
> by looking at (dev->priv_flags & IFF_EBRIDGE), so I have same results
> as without my changes. This should satisfy Jamal and Roopa.
> I could send it as v3 of my patch along with the results if you are
> interested.
>


Please do. If you satisfy Vlad's goals then we are all happy.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hubert Sokolowski Dec. 19, 2014, 3:17 p.m. UTC | #28
On 18/12/14 22:32, Jamal Hadi Salim wrote:

> Sorry for the latency (head-buried-in-sand in effect)
> On 12/17/14 11:18, Hubert Sokolowski wrote:
>> I have just prepared a patch where I dump uc/mc for bridge devices
>> by looking at (dev->priv_flags & IFF_EBRIDGE), so I have same results
>> as without my changes. This should satisfy Jamal and Roopa.
>> I could send it as v3 of my patch along with the results if you are
>> interested.
> Please do. If you satisfy Vlad's goals then we are all happy.

Posted as v3, please review.
There is still open question I asked sometime ago but never got explained.
It is about the new filter_dev parameter that was added to ndo_fdb_dump:
        int                     (*ndo_fdb_dump)(struct sk_buff *skb,
                                                struct netlink_callback *cb,
                                                struct net_device *dev,
                                                struct net_device *filter_dev,
                                                int idx);

When we call this function for a device, dev pointer is passed as the filter_dev:
                if (dev->netdev_ops->ndo_fdb_dump)
                        idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, bdev, dev,
                                                            idx);

This is not an issue for a bridge device and a device that is not enslaved
in a bridge because bdev == dev, but this can be dangerous in other cases.
Let's assume QLogic NIC has a master device, in this case bdev != dev.
Now look what is happening, dev is passed as filter_dev to:
static int qlcnic_fdb_dump(struct sk_buff *skb, struct netlink_callback *ncb,
                        struct net_device *netdev,
                        struct net_device *filter_dev, int idx)
{
        struct qlcnic_adapter *adapter = netdev_priv(netdev);
...

netdev_priv(netdev) returns a pointer to private struct of the bridge,but the driver
is expecting it's own private stuff.

Should we fix the driver and assume filter_dev is /me and dev is our master
or the parameters were reversed and should be passed as (skb, cb, dev, bdev, idx) ?
Is this something for another patch/discussion?

regards,
Hubert
Roopa Prabhu Dec. 19, 2014, 4:32 p.m. UTC | #29
On 12/19/14, 7:17 AM, Hubert Sokolowski wrote:
> On 18/12/14 22:32, Jamal Hadi Salim wrote:
>
>> Sorry for the latency (head-buried-in-sand in effect)
>> On 12/17/14 11:18, Hubert Sokolowski wrote:
>>> I have just prepared a patch where I dump uc/mc for bridge devices
>>> by looking at (dev->priv_flags & IFF_EBRIDGE), so I have same results
>>> as without my changes. This should satisfy Jamal and Roopa.
>>> I could send it as v3 of my patch along with the results if you are
>>> interested.
>> Please do. If you satisfy Vlad's goals then we are all happy.
> Posted as v3, please review.
> There is still open question I asked sometime ago but never got explained.
> It is about the new filter_dev parameter that was added to ndo_fdb_dump:
>          int                     (*ndo_fdb_dump)(struct sk_buff *skb,
>                                                  struct netlink_callback *cb,
>                                                  struct net_device *dev,
>                                                  struct net_device *filter_dev,
>                                                  int idx);
>
> When we call this function for a device, dev pointer is passed as the filter_dev:
>                  if (dev->netdev_ops->ndo_fdb_dump)
>                          idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, bdev, dev,
>                                                              idx);
seems like these calls should be fixed. bdev is really dev in this case. 
And filter_dev should be null.
>
> This is not an issue for a bridge device and a device that is not enslaved
> in a bridge because bdev == dev, but this can be dangerous in other cases.
> Let's assume QLogic NIC has a master device, in this case bdev != dev.
> Now look what is happening, dev is passed as filter_dev to:
> static int qlcnic_fdb_dump(struct sk_buff *skb, struct netlink_callback *ncb,
>                          struct net_device *netdev,
>                          struct net_device *filter_dev, int idx)
> {
>          struct qlcnic_adapter *adapter = netdev_priv(netdev);
> ...
>
> netdev_priv(netdev) returns a pointer to private struct of the bridge,but the driver
> is expecting it's own private stuff.
>
> Should we fix the driver and assume filter_dev is /me and dev is our master
> or the parameters were reversed and should be passed as (skb, cb, dev, bdev, idx) ?
> Is this something for another patch/discussion?
>
> regards,
> Hubert
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 21, 2014, 2:27 p.m. UTC | #30
Sorry for the latency, Ive been down with a bad flu (its bad when i cant
type on my keyboard sitting infront of me;->),  recovering and the
thread seems to have caught on - should be able to catchup in the
next few days.
I am beginning to reach a conclusion that the current switchdev approach
is *not* going to work for SRIOV. I also worry it may be too late
to change that.
Shrijeet wanted to set up a BOF for netdev to have hopefully final 
consensus. Shrijeet, are you going to make an official request for the BOF?

Sorry John, I dont have enough energy to address all your points but i
will try to just focus on SRIOV and will save a few bytes while at it.


On 12/16/14 11:35, John Fastabend wrote:

> But in the SR-IOV case you have multiple "Cpu ports" and you want
> to send packets to each of them depending on the configuration.
>
>
>     port0   port1     port2  port3
>      |        |        |      |      uplinks
>   +------------------------------+
>   |                              |
>   |       SRIOV edge relay       |
>   |                              |
>   +------------------------------+
>                   |                   downlink
>
>

Two points above:
1) Did you flip uplink vs down link above?
(I Thought URP was the wire link)
2) What you are not showing above which is *very important* is that
infact there is an underlying embedded fdb.

point #2 brings out a lot of the weird things in some of the bridge
code. IOW, you have an *offloaded* bridge with _bridge ports_
visible in the kernel but not the bridge that is controlled
by standard Linux bridge tools. I am not saying that the model is
wrong; on the contrary what Ben had exposed may fall under the
same category i.e you have E_BRIDGE flag on the netdev to say it sits
on top of an offloaded bridge and you dont need a br0 to run
bridge command on. But then we need some proxy (TheClassThingy) to act
as intemediary to the offloaded hardware.
If you do that then the vf becomes simply a bridge port - which
means bridge port ops apply.

SRIOV it seems to have morphed its own toolkit.
The PF port, when acting as the control interface, is actually
TheClassThingy we discuss on/off.
To add an fdb entry to point to vf 1, where TheClassThingy is eth1:
ip link set eth1 vf 1 mac aa:bb:cc:dd:ee:ff vlan 10

IMO, SRIOV should expose these ports with names and ifindices
(probably does already) and pre-populated master or something
which points to its parent, then i can do the following:
bridge fdb add aa:bb:cc:dd:ee:ff vlan 10 dev vf1 master

master in such a case will go to TheClassThingy which would pass
such control to the underlying hardware.
The PF still stays but not as the management interface.

Ok, promised to keep this short ;-> I should respond to the other points
in a separate email.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 21, 2014, 2:46 p.m. UTC | #31
John,

On CPU port: Yes, a VF with PCIE params is considered a "CPU port".
This is fine for SRIOV because it ties VF to a "CPU port". Other
ASIC models have an explicit single "CPU port" (I would say the tiny
openwrt types or larger broadcom switches). In such a case you program
separately the cpu MAC addresses and then you teach some table to
send to the CPU port. It doesnt matter whether it is L2 or L3. As
an example you can point a L3 NH to cpu port with a CPU port MAC address
so things dont show up as skb HOST tagged on the stack.
On Linux infact without the concept of an fdb - this is also true.
i.e not all NICs have an fdb. Macvlan was initially introduced (Patrick
McHardy iirc) to expose the multiple MAC address a standard NIC has.
Conflating this with having an fdb is where the water got mudied in my
opinion. I should be able to dump these MAC addresses without expecting
that i need to talk to an underlying hardware dumper (hence the default
dumper). Unless it is true today that _all_ NICs with multiple MACs have
an embedded fdb.
To summarize: I dont think we disagree much - i just wanted to emphasize
the concept of "cpu port" being important.

Other thing you brought up was the concept of uplink/downlink relay
ports.  The more i think about it, the more i am reaching a conclusion
that they are *not* anything speacial.
Essentially, in one case you have the port column pointing to the VF
after you populate the underlying fdb. By default it doesnt seem these
SRIOV switches cant learn. We have knobs for that and capability is
exposable to point to the fact
And it seems to me that VEPA is just a mode where flooding control
points out one port connected externally.
We already have knobs per port flood and learning controls.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shrijeet Mukherjee Dec. 21, 2014, 4:33 p.m. UTC | #32
----- Original Message -----
> From: "jhs" <jhs@mojatatu.com>
> To: "John Fastabend" <john.fastabend@gmail.com>
> Cc: "Hubert Sokolowski" <h.sokolowski@wit.edu.pl>, "roopa" <roopa@cumulusnetworks.com>, netdev@vger.kernel.org, "Vlad
> Yasevich" <vyasevic@redhat.com>, "Shrijeet Mukherjee" <shm@cumulusnetworks.com>
> Sent: Sunday, December 21, 2014 6:27:46 AM
> Subject: SRIOV as bridge Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.

> Sorry for the latency, Ive been down with a bad flu (its bad when i cant
> type on my keyboard sitting infront of me;->),  recovering and the
> thread seems to have caught on - should be able to catchup in the
> next few days.
> I am beginning to reach a conclusion that the current switchdev approach
> is *not* going to work for SRIOV. I also worry it may be too late
> to change that.
> Shrijeet wanted to set up a BOF for netdev to have hopefully final
> consensus. Shrijeet, are you going to make an official request for the BOF?

Yes, will write something up tonite. The point you make about the PF being the TheClassThingy is what make the merging of models possible IMO.



> 
> Sorry John, I dont have enough energy to address all your points but i
> will try to just focus on SRIOV and will save a few bytes while at it.
> 
> 
> On 12/16/14 11:35, John Fastabend wrote:
> 
>> But in the SR-IOV case you have multiple "Cpu ports" and you want
>> to send packets to each of them depending on the configuration.
>>
>>
>>     port0   port1     port2  port3
>>      |        |        |      |      uplinks
>>   +------------------------------+
>>   |                              |
>>   |       SRIOV edge relay       |
>>   |                              |
>>   +------------------------------+
>>                   |                   downlink
>>
>>
> 
> Two points above:
> 1) Did you flip uplink vs down link above?
> (I Thought URP was the wire link)
> 2) What you are not showing above which is *very important* is that
> infact there is an underlying embedded fdb.
> 
> point #2 brings out a lot of the weird things in some of the bridge
> code. IOW, you have an *offloaded* bridge with _bridge ports_
> visible in the kernel but not the bridge that is controlled
> by standard Linux bridge tools. I am not saying that the model is
> wrong; on the contrary what Ben had exposed may fall under the
> same category i.e you have E_BRIDGE flag on the netdev to say it sits
> on top of an offloaded bridge and you dont need a br0 to run
> bridge command on. But then we need some proxy (TheClassThingy) to act
> as intemediary to the offloaded hardware.
> If you do that then the vf becomes simply a bridge port - which
> means bridge port ops apply.
> 
> SRIOV it seems to have morphed its own toolkit.
> The PF port, when acting as the control interface, is actually
> TheClassThingy we discuss on/off.
> To add an fdb entry to point to vf 1, where TheClassThingy is eth1:
> ip link set eth1 vf 1 mac aa:bb:cc:dd:ee:ff vlan 10
> 
> IMO, SRIOV should expose these ports with names and ifindices
> (probably does already) and pre-populated master or something
> which points to its parent, then i can do the following:
> bridge fdb add aa:bb:cc:dd:ee:ff vlan 10 dev vf1 master
> 
> master in such a case will go to TheClassThingy which would pass
> such control to the underlying hardware.
> The PF still stays but not as the management interface.
> 
> Ok, promised to keep this short ;-> I should respond to the other points
> in a separate email.
> 
> cheers,
> jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Dec. 21, 2014, 7:08 p.m. UTC | #33
On 12/21/14, 6:27 AM, Jamal Hadi Salim wrote:
>
> Sorry for the latency, Ive been down with a bad flu (its bad when i cant
> type on my keyboard sitting infront of me;->),  recovering and the
> thread seems to have caught on - should be able to catchup in the
> next few days.
> I am beginning to reach a conclusion that the current switchdev approach
> is *not* going to work for SRIOV. I also worry it may be too late
> to change that.
> Shrijeet wanted to set up a BOF for netdev to have hopefully final 
> consensus. Shrijeet, are you going to make an official request for the 
> BOF?
>
> Sorry John, I dont have enough energy to address all your points but i
> will try to just focus on SRIOV and will save a few bytes while at it.
>
>
> On 12/16/14 11:35, John Fastabend wrote:
>
>> But in the SR-IOV case you have multiple "Cpu ports" and you want
>> to send packets to each of them depending on the configuration.
>>
>>
>>     port0   port1     port2  port3
>>      |        |        |      |      uplinks
>>   +------------------------------+
>>   |                              |
>>   |       SRIOV edge relay       |
>>   |                              |
>>   +------------------------------+
>>                   |                   downlink
>>
>>
>
> Two points above:
> 1) Did you flip uplink vs down link above?
> (I Thought URP was the wire link)
> 2) What you are not showing above which is *very important* is that
> infact there is an underlying embedded fdb.
>
> point #2 brings out a lot of the weird things in some of the bridge
> code. IOW, you have an *offloaded* bridge with _bridge ports_
> visible in the kernel but not the bridge that is controlled
> by standard Linux bridge tools. I am not saying that the model is
> wrong; on the contrary what Ben had exposed may fall under the
> same category i.e you have E_BRIDGE flag on the netdev to say it sits
> on top of an offloaded bridge and you dont need a br0 to run
> bridge command on. But then we need some proxy (TheClassThingy) to act
> as intemediary to the offloaded hardware.
> If you do that then the vf becomes simply a bridge port - which
> means bridge port ops apply.
>
> SRIOV it seems to have morphed its own toolkit.
> The PF port, when acting as the control interface, is actually
> TheClassThingy we discuss on/off.
> To add an fdb entry to point to vf 1, where TheClassThingy is eth1:
> ip link set eth1 vf 1 mac aa:bb:cc:dd:ee:ff vlan 10
>
> IMO, SRIOV should expose these ports with names and ifindices
> (probably does already) and pre-populated master or something
> which points to its parent, then i can do the following:
> bridge fdb add aa:bb:cc:dd:ee:ff vlan 10 dev vf1 master
I had a slightly different understanding of how this would work for 
SRIOV. So, am attempting to respond to your questions for John..., ...so 
that he can correct my understanding too ..if needed :).

I think SRIOV VF's do have netdevs (John can confirm, I maybe wrong). To 
me if SRIOV has a single fdb for all VF's under a PF,
and it wants to bypass the bridge driver, there is still no reason to 
refer to the PF as a master.
You can use self and go to the vf driver directly and it will do the 
right thing.

bridge fdb add aa:bb:cc:dd:ee:ff vlan 10 dev vf1 self

>
> master in such a case will go to TheClassThingy which would pass
> such control to the underlying hardware.
> The PF still stays but not as the management interface.

Even if 'TheClassThingy' where there, you wouldn't refer to it as the 
master (ie the PF will not have a netdev master/slave relationship with 
the VF). 'master' will still be used for the netdev 'upper'  device if 
VF was enslaved to one (which could be a bridge).


Thanks,
Roopa


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 21, 2014, 7:19 p.m. UTC | #34
On 12/21/14 14:08, Roopa Prabhu wrote:
> PF still stays but not as the management interface.
>
> Even if 'TheClassThingy' where there, you wouldn't refer to it as the
> master (ie the PF will not have a netdev master/slave relationship with
> the VF). 'master' will still be used for the netdev 'upper'  device if
> VF was enslaved to one (which could be a bridge).
>

Well, there is an embedded switch underneath the VFs (in hardware).
You cant send pkts from one VF to another without going through this
switch (or in VEPA mode via it). i.e you dont need a kernel bridge.
So in essence the VF is a bridge port to  this embedded switch (as
is the PF). So the role of master points downwards from the kernel.
Master is just not visible at the kernel. I am not sure what "self"
would mean in this case.
This is why i dont think current switchdev approach would work.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Dec. 21, 2014, 7:36 p.m. UTC | #35
On 12/21/14, 11:19 AM, Jamal Hadi Salim wrote:
> On 12/21/14 14:08, Roopa Prabhu wrote:
>> PF still stays but not as the management interface.
>>
>> Even if 'TheClassThingy' where there, you wouldn't refer to it as the
>> master (ie the PF will not have a netdev master/slave relationship with
>> the VF). 'master' will still be used for the netdev 'upper' device if
>> VF was enslaved to one (which could be a bridge).
>>
>
> Well, there is an embedded switch underneath the VFs (in hardware).
> You cant send pkts from one VF to another without going through this
> switch (or in VEPA mode via it). i.e you dont need a kernel bridge.
understood. And since you don't need the kernel bridge, you don't need 
the kernel netdev master construct here.
>
> So in essence the VF is a bridge port to  this embedded switch (as
> is the PF). So the role of master points downwards from the kernel.
> Master is just not visible at the kernel. 
exactly. So you will not be able to use the kernel 'master' in this case.
> I am not sure what "self"
> would mean in this case.
'self' would just mean the driver owns the PF embedded bridge and the 
kernel bridge driver has no role in this. 'self' will just tell the VF 
driver to deal with the fdb mac entry. And the VF driver can push the 
fdb to the PF  (John can confirm if the intel sriov devices really do it 
this way or some other way).

> This is why i dont think current switchdev approach would work.
Current switchdev code in the kernel supports a driver managing its own 
switch with 'self' calls. ie bypassing the kernel bridge driver. Since 
sriov devices already did it that way...., they will just continue to 
work. They are not broken.
But, yes, for sriov devices to use the switchdev model...may need some work.

Good thing you brought up these points. A BOF to close on these things 
at netdev will be good.

Thanks,
Roopa



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Dec. 21, 2014, 7:52 p.m. UTC | #36
On 12/21/2014 11:08 AM, Roopa Prabhu wrote:
> On 12/21/14, 6:27 AM, Jamal Hadi Salim wrote:
>>
>> Sorry for the latency, Ive been down with a bad flu (its bad when i cant
>> type on my keyboard sitting infront of me;->),  recovering and the
>> thread seems to have caught on - should be able to catchup in the
>> next few days.
>> I am beginning to reach a conclusion that the current switchdev approach
>> is *not* going to work for SRIOV. I also worry it may be too late
>> to change that.
>> Shrijeet wanted to set up a BOF for netdev to have hopefully final
>> consensus. Shrijeet, are you going to make an official request for the
>> BOF?
>>
>> Sorry John, I dont have enough energy to address all your points but i
>> will try to just focus on SRIOV and will save a few bytes while at it.
>>

not a problem thanks for the response. I might try to document this
somewhere if folks think it would be useful. Something describing
how it works today would be helpful is my thought. Showing the
various stacked cases and how messages get propagated. (Some cases
being with bridge, without bridge, with bridge and multiple uplinks,
with bridge + VLAN filtering, macvlan, SR-IOV + bridge + VMDQ, etc.)

Its not a small task so likely won't get to it until after the new
year.

>>
>> On 12/16/14 11:35, John Fastabend wrote:
>>
>>> But in the SR-IOV case you have multiple "Cpu ports" and you want
>>> to send packets to each of them depending on the configuration.
>>>
>>>
>>>     port0   port1     port2  port3
>>>      |        |        |      |      uplinks
>>>   +------------------------------+
>>>   |                              |
>>>   |       SRIOV edge relay       |
>>>   |                              |
>>>   +------------------------------+
>>>                   |                   downlink
>>>
>>>
>>
>> Two points above:
>> 1) Did you flip uplink vs down link above?
>> (I Thought URP was the wire link)

yes sorry typo hopefully not too confusing.

>> 2) What you are not showing above which is *very important* is that
>> infact there is an underlying embedded fdb.

Yes. There is an embedded FDB.

>>
>> point #2 brings out a lot of the weird things in some of the bridge
>> code. IOW, you have an *offloaded* bridge with _bridge ports_
>> visible in the kernel but not the bridge that is controlled
>> by standard Linux bridge tools. I am not saying that the model is
>> wrong; on the contrary what Ben had exposed may fall under the
>> same category i.e you have E_BRIDGE flag on the netdev to say it sits
>> on top of an offloaded bridge and you dont need a br0 to run
>> bridge command on. But then we need some proxy (TheClassThingy) to act
>> as intemediary to the offloaded hardware.
>> If you do that then the vf becomes simply a bridge port - which
>> means bridge port ops apply.
>>
>> SRIOV it seems to have morphed its own toolkit.

Yes, but I don't think its too late to bring it into the picture here.

>> The PF port, when acting as the control interface, is actually
>> TheClassThingy we discuss on/off.

Yep or if you take Jiri's approach any port on the nic could be used
to manage this.

>> To add an fdb entry to point to vf 1, where TheClassThingy is eth1:
>> ip link set eth1 vf 1 mac aa:bb:cc:dd:ee:ff vlan 10
>>
>> IMO, SRIOV should expose these ports with names and ifindices
>> (probably does already) and pre-populated master or something
>> which points to its parent, then i can do the following:
>> bridge fdb add aa:bb:cc:dd:ee:ff vlan 10 dev vf1 master
> I had a slightly different understanding of how this would work for
> SRIOV. So, am attempting to respond to your questions for John..., ...so
> that he can correct my understanding too ..if needed :).
>
> I think SRIOV VF's do have netdevs (John can confirm, I maybe wrong). To
> me if SRIOV has a single fdb for all VF's under a PF,
> and it wants to bypass the bridge driver, there is still no reason to
> refer to the PF as a master.
> You can use self and go to the vf driver directly and it will do the
> right thing.

The VF's may have netdev's if they are in the host. In this case you
could use 'bridge fdb' to manage them. In many use cases though the
VFs are directly assigned to VMs and then are outside the hosts
management domain. For this case you can either let the host tell the
driver which addresses it would want to receive.

Another _idea_ would be to create a "shadow" netdev in the host
to manage the port even when the VF is direct assigned. Then you
could use all your normal commands from the host to set the MTU,
set any MACs, etc. At the moment as Jamal noted we have a subset
of 'ip link' commands that we use to work on VFs when they are not
in the host domain.

	'ip link set ethx vf # ...'

In the SR-IOV case you would have a PF and then a set of eth-vf#
netdev's which are not attached to a VF but act as the management
interface for the port.

>
> bridge fdb add aa:bb:cc:dd:ee:ff vlan 10 dev vf1 self
>
>>
>> master in such a case will go to TheClassThingy which would pass
>> such control to the underlying hardware.
>> The PF still stays but not as the management interface.
>

I think this is not specific to SR-IOV though right. This is the
same point for both "real" switch ASICs and SR-IOV. Using the netdev
directly as a management interace (a la rocker) seems to work OK.
But does it become cleaner to have the switch object represented
explicitly for management.

> Even if 'TheClassThingy' where there, you wouldn't refer to it as the
> master (ie the PF will not have a netdev master/slave relationship with
> the VF). 'master' will still be used for the netdev 'upper'  device if
> VF was enslaved to one (which could be a bridge).
>

Sounds right to me.

>
> Thanks,
> Roopa
>
>
Jamal Hadi Salim Dec. 21, 2014, 8:06 p.m. UTC | #37
On 12/21/14 14:36, Roopa Prabhu wrote:
> On 12/21/14, 11:19 AM, Jamal Hadi Salim wrote:
>> On 12/21/14 14:08, Roopa Prabhu wrote:

> 'self' would just mean the driver owns the PF embedded bridge and the
> kernel bridge driver has no role in this. 'self' will just tell the VF
> driver to deal with the fdb mac entry. And the VF driver can push the
> fdb to the PF  (John can confirm if the intel sriov devices really do it
> this way or some other way).
>

If the VFs are exposed as netdevs and the master (or Parent which i 
think the "P" stands for) they point to
is the PF - then is the PF a bridge abstraction?
And if the path is via is the PF - i think that seems like "master"
not self, no?

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Dec. 21, 2014, 8:46 p.m. UTC | #38
On 12/21/14, 12:06 PM, Jamal Hadi Salim wrote:
> On 12/21/14 14:36, Roopa Prabhu wrote:
>> On 12/21/14, 11:19 AM, Jamal Hadi Salim wrote:
>>> On 12/21/14 14:08, Roopa Prabhu wrote:
>
>> 'self' would just mean the driver owns the PF embedded bridge and the
>> kernel bridge driver has no role in this. 'self' will just tell the VF
>> driver to deal with the fdb mac entry. And the VF driver can push the
>> fdb to the PF  (John can confirm if the intel sriov devices really do it
>> this way or some other way).
>>
>
> If the VFs are exposed as netdevs and the master (or Parent which i 
> think the "P" stands for) they point to
> is the PF - then is the PF a bridge abstraction?
yes, could be, but its not today ('PF' is physical function and 'VF' is 
virtual function).
If you introduce a master/slave relationship between the PF and VF (ie 
VF's were assigned PF as the master using 'ip link set dev vf master 
PF), then yes.
> And if the path is via is the PF - i think that seems like "master"
> not self, no?

Today ...when you add fdb...path is not via the PF netdev. It maybe 
internally done that way in PF/VF driver.
so, 'master' does not apply today. But if there were such a relationship 
between PF/VF, yes, 'master' could be used.

PF does not really need to have a master relationship with the VF. Its 
better that way. Infact it should be that way even in the case of 'the 
switch device class model' because that will allow switch ports to be 
added to a linux bridge (and hence make use of the linux bridge (cumulus 
model). 'master' will be the 'linux bridge device' in this case).

Thanks,
Roopa
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 22, 2014, 2:59 a.m. UTC | #39
On 12/21/14 14:52, John Fastabend wrote:

> not a problem thanks for the response. I might try to document this
> somewhere if folks think it would be useful. Something describing
> how it works today would be helpful is my thought. Showing the
> various stacked cases and how messages get propagated. (Some cases
> being with bridge, without bridge, with bridge and multiple uplinks,
> with bridge + VLAN filtering, macvlan, SR-IOV + bridge + VMDQ, etc.)
>
> Its not a small task so likely won't get to it until after the new
> year.
>

I think this would be very useful.
We are looking for an L2 tutorial at netdev01 (I was looking at
Vlad and Roopa so far) so maybe we could split the work. Would
you be interested?
Yes, all those bridge and bridge like things like Vxlan for example
should be part of this de-mystification.
Shrijeet is also putting together a BOF for the offload stuff.

>
> Yes, but I don't think its too late to bring it into the picture here.
>

that would be nice.


>>> The PF port, when acting as the control interface, is actually
>>> TheClassThingy we discuss on/off.
>
> Yep or if you take Jiri's approach any port on the nic could be used
> to manage this.
>
> The VF's may have netdev's if they are in the host. In this case you
> could use 'bridge fdb' to manage them. In many use cases though the
> VFs are directly assigned to VMs and then are outside the hosts
> management domain. For this case you can either let the host tell the
> driver which addresses it would want to receive.
>

Which driver? The PF? I dont think the semantics allow for that;
How do i tell the PF using "fdb add" that a specific mac+vlan should
be sent to vf1 which is now sitting inside some vm?

> Another _idea_ would be to create a "shadow" netdev in the host
> to manage the port even when the VF is direct assigned.

That may work. But there are questions:
Can its name be changed within the container/vm or are those
different namespaces etc.
So if the answer is that "self" implies using the hardware fdb,
what does "master" mean?


>Then you
> could use all your normal commands from the host to set the MTU,
> set any MACs, etc. At the moment as Jamal noted we have a subset
> of 'ip link' commands that we use to work on VFs when they are not
> in the host domain.
>
>      'ip link set ethx vf # ...'
>

I think once it moves management of such things as MTUs should be from
wherever that thing sits at (vm/container) to avoid any suprises.


> In the SR-IOV case you would have a PF and then a set of eth-vf#
> netdev's which are not attached to a VF but act as the management
> interface for the port.
>

you can have more than one PF?


> I think this is not specific to SR-IOV though right.This is the
> same point for both "real" switch ASICs and SR-IOV. Using the netdev
> directly as a management interace (a la rocker) seems to work OK.
> But does it become cleaner to have the switch object represented
> explicitly for management.
>

Indeed it does. In particular if you had to move around a bridge port
to a container or VM.
Now if we introduced the idea of showing up with netdev for each vf
and you had a classthing you dont have to keep the vf1s visible
once migrated - but would be able to add fdb entries pointing to
them (assuming names/ifindices remain valid).

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 22, 2014, 3:13 a.m. UTC | #40
On 12/21/14 15:46, Roopa Prabhu wrote:
> On 12/21/14, 12:06 PM, Jamal Hadi Salim wrote:

> yes, could be, but its not today ('PF' is physical function and 'VF' is
> virtual function).
> If you introduce a master/slave relationship between the PF and VF (ie
> VF's were assigned PF as the master using 'ip link set dev vf master
> PF), then yes.


When someone says "modprobe igb max_vfs=19" then 19 VFs show up. i.e the
driver creates them. And then there is assumed direct relationship
between VF and PF. The PF being the parent. Adding fdbs goes via PF.

>> And if the path is via is the PF - i think that seems like "master"
>> not self, no?
>
> Today ...when you add fdb...path is not via the PF netdev.

For SRIOV it is. Example to add via pf eth10 an
fdb entry to the igb hardware fdb to point to vf1:
ip link set eth10 vf 1 mac aa:bb:cc:dd:ee:ff vlan 10
That last part "vf 1 mac aa:bb:cc:dd:ee:ff vlan 10" is typically
part of an "fdb add semantic" - but we explicitly call out
eth10, the parent. The PF has control of the hardware fdb.

It maybe
> internally done that way in PF/VF driver.
> so, 'master' does not apply today. But if there were such a relationship
> between PF/VF, yes, 'master' could be used.
>

I am refering if were to get rid of using iplink. There has to be 
something pointed to by vf1 that gets called to add the fdb entry in
hardware.

> PF does not really need to have a master relationship with the VF. Its
> better that way. Infact it should be that way even in the case of 'the
> switch device class model' because that will allow switch ports to be
> added to a linux bridge (and hence make use of the linux bridge (cumulus
> model). 'master' will be the 'linux bridge device' in this case).
>

So what do you do if the user sets either one of master/self and it 
doesnt make sense?

cheers,
jamal


> Thanks,
> Roopa

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Dec. 22, 2014, 6:24 a.m. UTC | #41
On 12/21/14, 7:13 PM, Jamal Hadi Salim wrote:
> On 12/21/14 15:46, Roopa Prabhu wrote:
>> On 12/21/14, 12:06 PM, Jamal Hadi Salim wrote:
>
>> yes, could be, but its not today ('PF' is physical function and 'VF' is
>> virtual function).
>> If you introduce a master/slave relationship between the PF and VF (ie
>> VF's were assigned PF as the master using 'ip link set dev vf master
>> PF), then yes.
>
>
> When someone says "modprobe igb max_vfs=19" then 19 VFs show up. i.e the
> driver creates them. And then there is assumed direct relationship
> between VF and PF. The PF being the parent. Adding fdbs goes via PF.
>
>>> And if the path is via is the PF - i think that seems like "master"
>>> not self, no?
>>
>> Today ...when you add fdb...path is not via the PF netdev.
>
> For SRIOV it is. Example to add via pf eth10 an
> fdb entry to the igb hardware fdb to point to vf1:
> ip link set eth10 vf 1 mac aa:bb:cc:dd:ee:ff vlan 10
> That last part "vf 1 mac aa:bb:cc:dd:ee:ff vlan 10" is typically
> part of an "fdb add semantic" - but we explicitly call out
> eth10, the parent. The PF has control of the hardware fdb.

Ah......i did not know this syntax with 'ip link set'. thanks for 
pointing out.
I always thought that you can still use 'bridge fdb add' for vfs. 
Curious why its not that way.

>
> It maybe
>> internally done that way in PF/VF driver.
>> so, 'master' does not apply today. But if there were such a relationship
>> between PF/VF, yes, 'master' could be used.
>>
>
> I am refering if were to get rid of using iplink. There has to be 
> something pointed to by vf1 that gets called to add the fdb entry in
> hardware.
ok, i assumed we were only talking about  'bridge fdb add'
>
>> PF does not really need to have a master relationship with the VF. Its
>> better that way. Infact it should be that way even in the case of 'the
>> switch device class model' because that will allow switch ports to be
>> added to a linux bridge (and hence make use of the linux bridge (cumulus
>> model). 'master' will be the 'linux bridge device' in this case).
>>
>
> So what do you do if the user sets either one of master/self and it 
> doesnt make sense?

Am guessing it will continue to do what it does today. If there is no 
master or if there is master and the master does not support the op, it 
will return -EOPNOTSUPP. And, self does not make sense in cases where 
the port driver does not support the op. In which case again you will 
get a -EOPNOTSUPP. Have not thought through all the other cases yet.

Thanks,
Roopa

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 22, 2014, 12:10 p.m. UTC | #42
On 12/22/14 01:24, Roopa Prabhu wrote:
> On 12/21/14, 7:13 PM, Jamal Hadi Salim wrote:

>> For SRIOV it is. Example to add via pf eth10 an
>> fdb entry to the igb hardware fdb to point to vf1:
>> ip link set eth10 vf 1 mac aa:bb:cc:dd:ee:ff vlan 10
>> That last part "vf 1 mac aa:bb:cc:dd:ee:ff vlan 10" is typically
>> part of an "fdb add semantic" - but we explicitly call out
>> eth10, the parent. The PF has control of the hardware fdb.
>
> Ah......i did not know this syntax with 'ip link set'. thanks for
> pointing out.
> I always thought that you can still use 'bridge fdb add' for vfs.
> Curious why its not that way.
>

It was there before bridge command showed up. I think thats why
John feels it can be repaired. The main challenge is that most
of the time the VFs are "migrated" from host to VMs, so you cant
even list them on the host.
Actually you cant list them as netdevs (please someone
correct me if i am wrong). What you see are PCI devices.

 From my notes:
---
modprobe igb max_vfs=6
# lspci | grep 82576
0b:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network 
Connection (rev 01)
0b:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network 
Connection(rev 01)
0b:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function 
(rev 01)
0b:10.1 Ethernet controller: Intel Corporation 82576 Virtual Function 
(rev 01)
..
...
....
---

And then you can tell qemu to use one of these devices as a network
device and holla it disappears from the host.

What you can do from the host side is tell the embedded hardware switch
to add an fdb which sends  packets to a VF using the above syntax (via
PF). I think PF as control interface to the VFs is a convinience issue.
It just happened to be there and could be used as the anchor.
This is why i was mapping it instead to theclassthingy which i think
is more generic.

>
>> So what do you do if the user sets either one of master/self and it
>> doesnt make sense?
>
> Am guessing it will continue to do what it does today. If there is no
> master or if there is master and the master does not support the op, it
> will return -EOPNOTSUPP. And, self does not make sense in cases where
> the port driver does not support the op. In which case again you will
> get a -EOPNOTSUPP. Have not thought through all the other cases yet.
>

So master is only valid if there is a software device parent?
Again note - none of these VFs are attached to a bridge today.
It doesnt seem to me like it even makes sense people want to
do that.
The PF could be and in such a case its master is the bridge.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 22, 2014, 1:04 p.m. UTC | #43
On 12/22/14 07:10, Jamal Hadi Salim wrote:

> Actually you cant list them as netdevs (please someone
> correct me if i am wrong). What you see are PCI devices.

I have been corrected.
It turns out i was wrong. The VFs infact show up as ethx devices.
So your idea of having the devices with "bridge .. self" would work
except when the device moves to a VM and you cant see them anymore...
and you want to add an fdb entry to the embedded switch..

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hubert Sokolowski Jan. 5, 2015, 12:56 p.m. UTC | #44
On 19/12/14 16:32, Roopa Prabhu wrote:
> On 12/19/14, 7:17 AM, Hubert Sokolowski wrote:
>> On 18/12/14 22:32, Jamal Hadi Salim wrote:
>>
>>> Sorry for the latency (head-buried-in-sand in effect)
>>> On 12/17/14 11:18, Hubert Sokolowski wrote:
>>>> I have just prepared a patch where I dump uc/mc for bridge devices
>>>> by looking at (dev->priv_flags & IFF_EBRIDGE), so I have same results
>>>> as without my changes. This should satisfy Jamal and Roopa.
>>>> I could send it as v3 of my patch along with the results if you are
>>>> interested.
>>> Please do. If you satisfy Vlad's goals then we are all happy.
>> Posted as v3, please review.
>> There is still open question I asked sometime ago but never got explained.
>> It is about the new filter_dev parameter that was added to ndo_fdb_dump:
>>          int                     (*ndo_fdb_dump)(struct sk_buff *skb,
>>                                                  struct netlink_callback *cb,
>>                                                  struct net_device *dev,
>>                                                  struct net_device *filter_dev,
>>                                                  int idx);
>>
>> When we call this function for a device, dev pointer is passed as the filter_dev:
>>                  if (dev->netdev_ops->ndo_fdb_dump)
>>                          idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, bdev, dev,
>>                                                              idx);
> seems like these calls should be fixed. bdev is really dev in this case. And filter_dev should be null.
>

Will fix this too and resend the patch as v4.

Thanks,
Hubert
diff mbox

Patch

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index eaa057f..a9e5c37 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2692,10 +2692,11 @@  static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 							 idx);
 		}

-		idx = ndo_dflt_fdb_dump(skb, cb, dev, NULL, idx);
 		if (dev->netdev_ops->ndo_fdb_dump)
 			idx = dev->netdev_ops->ndo_fdb_dump(skb, cb, bdev, dev,
 							    idx);
+		else
+			idx = ndo_dflt_fdb_dump(skb, cb, dev, NULL, idx);

 		cops = NULL;
 	}