diff mbox

[net-next,v2] bridge: vlan: allow to suppress local mac install for all vlans

Message ID 1440549295-3979-1-git-send-email-razor@blackwall.org
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Nikolay Aleksandrov Aug. 26, 2015, 12:34 a.m. UTC
From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

This patch adds a new knob that, when enabled, allows to suppress the
installation of local fdb entries in newly created vlans. This could
pose a big scalability issue if we have a large number of ports and a
large number of vlans, e.g. in a 48 port device with 2000 vlans these
entries easily go up to 96000.
Note that packets for these macs are still received properly because they
are added in vlan 0 as "own" macs and referenced when fdb lookup by vlan
results in a miss.
Also note that vlan membership of ingress port and the bridge device
as egress are still being correctly enforced.

The default (0/off) is keeping the current behaviour.

Based on a patch by Wilson Kok (wkok@cumulusnetworks.com).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
v2: Triple checked the timezone

 include/uapi/linux/if_link.h |  1 +
 net/bridge/br_input.c        |  7 +++++++
 net/bridge/br_netlink.c      | 14 +++++++++++++-
 net/bridge/br_private.h      | 18 ++++++++++++++++++
 net/bridge/br_sysfs_br.c     | 18 ++++++++++++++++++
 net/bridge/br_vlan.c         | 18 +++++++++++++-----
 6 files changed, 70 insertions(+), 6 deletions(-)

Comments

Stephen Hemminger Aug. 26, 2015, 12:56 a.m. UTC | #1
On Tue, 25 Aug 2015 17:34:55 -0700
Nikolay Aleksandrov <razor@blackwall.org> wrote:

> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> 
> This patch adds a new knob that, when enabled, allows to suppress the
> installation of local fdb entries in newly created vlans. This could
> pose a big scalability issue if we have a large number of ports and a
> large number of vlans, e.g. in a 48 port device with 2000 vlans these
> entries easily go up to 96000.
> Note that packets for these macs are still received properly because they
> are added in vlan 0 as "own" macs and referenced when fdb lookup by vlan
> results in a miss.
> Also note that vlan membership of ingress port and the bridge device
> as egress are still being correctly enforced.
> 
> The default (0/off) is keeping the current behaviour.
> 
> Based on a patch by Wilson Kok (wkok@cumulusnetworks.com).


This is getting messy, but then again the bridge seems to have become
a ghetto for a long time. I would rather see the lookup code fixed so
that the fdb was correct.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Hemminger Aug. 26, 2015, 12:58 a.m. UTC | #2
On Tue, 25 Aug 2015 17:34:55 -0700
Nikolay Aleksandrov <razor@blackwall.org> wrote:

> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 3d95647039d0..2bda472c5a6e 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -294,6 +294,7 @@ struct net_bridge
>  	u32				auto_cnt;
>  #ifdef CONFIG_BRIDGE_VLAN_FILTERING
>  	u8				vlan_enabled;
> +	bool				vlan_ignore_local_fdb;

bool takes more space than u8.


>  	__be16				vlan_proto;
>  	u16				default_pvid;
>  	struct net_port_vlans __rcu	*vlan_info;

> +int br_vlan_ignore_local_fdb_toggle(struct net_bridge *br, unsigned long val)
> +{
> +	br->vlan_ignore_local_fdb = val ? true : false;

personal preference is for:
	br->vlan_ignore_local_fdb = !!val;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 26, 2015, 2:42 a.m. UTC | #3
From: Nikolay Aleksandrov <razor@blackwall.org>
Date: Tue, 25 Aug 2015 17:34:55 -0700

> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> 
> This patch adds a new knob that, when enabled, allows to suppress the
> installation of local fdb entries in newly created vlans. This could
> pose a big scalability issue if we have a large number of ports and a
> large number of vlans, e.g. in a 48 port device with 2000 vlans these
> entries easily go up to 96000.
> Note that packets for these macs are still received properly because they
> are added in vlan 0 as "own" macs and referenced when fdb lookup by vlan
> results in a miss.
> Also note that vlan membership of ingress port and the bridge device
> as egress are still being correctly enforced.
> 
> The default (0/off) is keeping the current behaviour.
> 
> Based on a patch by Wilson Kok (wkok@cumulusnetworks.com).
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
> v2: Triple checked the timezone

I'd rather we fix the essence of the scalability problem than add
more spaghetti code to the various bridge paths.

Can we make the fdb entries smaller?

Can we enhance how we store such local entries such that they live in
a compact datastructure?  Perhaps the FDB can consist of a very dense
lookup mechanism for local stuff sitting alongside the current table.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Aleksandrov Aug. 26, 2015, 5:28 a.m. UTC | #4
> On Aug 25, 2015, at 7:42 PM, David Miller <davem@davemloft.net> wrote:
> 
> From: Nikolay Aleksandrov <razor@blackwall.org>
> Date: Tue, 25 Aug 2015 17:34:55 -0700
> 
>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> 
>> This patch adds a new knob that, when enabled, allows to suppress the
>> installation of local fdb entries in newly created vlans. This could
>> pose a big scalability issue if we have a large number of ports and a
>> large number of vlans, e.g. in a 48 port device with 2000 vlans these
>> entries easily go up to 96000.
>> Note that packets for these macs are still received properly because they
>> are added in vlan 0 as "own" macs and referenced when fdb lookup by vlan
>> results in a miss.
>> Also note that vlan membership of ingress port and the bridge device
>> as egress are still being correctly enforced.
>> 
>> The default (0/off) is keeping the current behaviour.
>> 
>> Based on a patch by Wilson Kok (wkok@cumulusnetworks.com).
>> 
>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> ---
>> v2: Triple checked the timezone
> 
> I'd rather we fix the essence of the scalability problem than add
> more spaghetti code to the various bridge paths.
> 
> Can we make the fdb entries smaller?
> 
> Can we enhance how we store such local entries such that they live in
> a compact datastructure?  Perhaps the FDB can consist of a very dense
> lookup mechanism for local stuff sitting alongside the current table.

Certainly, that should be done and I will look into it, but the essence of this patch
is a bit different. The problem here is not the size of the fdb entries, it’s more the
number of them - having 96000 entries (even if they were 1 byte ones) is just way
too much especially when the fdb hash size is small and static. We could work on making
it dynamic though, but still these type of local entries per vlan per port can easily be avoided
with this option.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Aleksandrov Aug. 26, 2015, 5:42 a.m. UTC | #5
> On Aug 25, 2015, at 5:56 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> 
> On Tue, 25 Aug 2015 17:34:55 -0700
> Nikolay Aleksandrov <razor@blackwall.org> wrote:
> 
>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> 
>> This patch adds a new knob that, when enabled, allows to suppress the
>> installation of local fdb entries in newly created vlans. This could
>> pose a big scalability issue if we have a large number of ports and a
>> large number of vlans, e.g. in a 48 port device with 2000 vlans these
>> entries easily go up to 96000.
>> Note that packets for these macs are still received properly because they
>> are added in vlan 0 as "own" macs and referenced when fdb lookup by vlan
>> results in a miss.
>> Also note that vlan membership of ingress port and the bridge device
>> as egress are still being correctly enforced.
>> 
>> The default (0/off) is keeping the current behaviour.
>> 
>> Based on a patch by Wilson Kok (wkok@cumulusnetworks.com).
> 
> 
> This is getting messy, but then again the bridge seems to have become
> a ghetto for a long time. I would rather see the lookup code fixed so
> that the fdb was correct.

What do you mean by it is getting messy ? The entries (normally) are being added to each
vlan so there’s not much in terms of lookup that you can fix except making the table bigger/better
but that will be only a temporary win. If you elaborate on what you mean by fdb code being fixed
I could spend time and work on fixing it. If it is resizing the table so it can handle 96k entries and
probably using the rhashtable, that is what I have in mind too.
I still think that it would be nice to have the option to avoid adding the 96k entries in the first place
and that space could be better utilized by real ones, which this option does.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
B Viswanath Aug. 26, 2015, 6:10 a.m. UTC | #6
>>
>> I'd rather we fix the essence of the scalability problem than add
>> more spaghetti code to the various bridge paths.
>>
>> Can we make the fdb entries smaller?
>>
>> Can we enhance how we store such local entries such that they live in
>> a compact datastructure?  Perhaps the FDB can consist of a very dense
>> lookup mechanism for local stuff sitting alongside the current table.
>
> Certainly, that should be done and I will look into it, but the essence of this patch
> is a bit different. The problem here is not the size of the fdb entries, it’s more the
> number of them - having 96000 entries (even if they were 1 byte ones) is just way
> too much especially when the fdb hash size is small and static. We could work on making
> it dynamic though, but still these type of local entries per vlan per port can easily be avoided
> with this option.
>

I was wondering if it is possible to assign a vlan bitmap for the FDB
entry, instead of replicating the entry for each vlan. ( I believe
Roopa has done something similar, but not so sure). This means that
the number of FDB entries remain static for any number of vlans.

I guess its more complicated than it sounds, but just wanted to know
if its feasible at all.

Thanks
Vissu

>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Aleksandrov Aug. 26, 2015, 11:33 a.m. UTC | #7
> On Aug 25, 2015, at 11:06 PM, David Miller <davem@davemloft.net> wrote:
> 
> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Date: Tue, 25 Aug 2015 22:28:16 -0700
> 
>> Certainly, that should be done and I will look into it, but the
>> essence of this patch is a bit different. The problem here is not
>> the size of the fdb entries, it’s more the number of them - having
>> 96000 entries (even if they were 1 byte ones) is just way too much
>> especially when the fdb hash size is small and static. We could work
>> on making it dynamic though, but still these type of local entries
>> per vlan per port can easily be avoided with this option.
> 
> 96000 bits can be stored in 12k.  Get where I'm going with this?
> 
> Look at the problem sideways.

Oh okay, I misunderstood your previous comment. I’ll look into that.

Thanks,
 Nik--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vlad Yasevich Aug. 26, 2015, 12:52 p.m. UTC | #8
On 08/26/2015 02:10 AM, B Viswanath wrote:
>>>
>>> I'd rather we fix the essence of the scalability problem than add
>>> more spaghetti code to the various bridge paths.
>>>
>>> Can we make the fdb entries smaller?
>>>
>>> Can we enhance how we store such local entries such that they live in
>>> a compact datastructure?  Perhaps the FDB can consist of a very dense
>>> lookup mechanism for local stuff sitting alongside the current table.
>>
>> Certainly, that should be done and I will look into it, but the essence of this patch
>> is a bit different. The problem here is not the size of the fdb entries, it’s more the
>> number of them - having 96000 entries (even if they were 1 byte ones) is just way
>> too much especially when the fdb hash size is small and static. We could work on making
>> it dynamic though, but still these type of local entries per vlan per port can easily be avoided
>> with this option.
>>
> 
> I was wondering if it is possible to assign a vlan bitmap for the FDB
> entry, instead of replicating the entry for each vlan. ( I believe
> Roopa has done something similar, but not so sure). This means that
> the number of FDB entries remain static for any number of vlans.
> 
> I guess its more complicated than it sounds, but just wanted to know
> if its feasible at all.

I've actually had this done in one of the earlier attempts.  The issue was how
to compress it because there was absolutely no gain if you have a sparse vlan bitmap.

I even tried doing something along the lines of vlan_group array, but that can
explode to full size almost as fast.

What actually worked better was a hash table of vlans where each entry in the table
contained a bunch of data one of which was a list of fdbs for a given vlan.   It
didn't replicate fdbs but simply referenced the ones we cared about and bumped the ref.

However, this made vlan look-ups slower since we now had a hash instead of a bitmap lookup
and Stephen rejected it.

-vlad

> 
> Thanks
> Vissu
> 
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Aug. 27, 2015, 4:57 a.m. UTC | #9
On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
>> On Aug 25, 2015, at 11:06 PM, David Miller <davem@davemloft.net> wrote:
>>
>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> Date: Tue, 25 Aug 2015 22:28:16 -0700
>>
>>> Certainly, that should be done and I will look into it, but the
>>> essence of this patch is a bit different. The problem here is not
>>> the size of the fdb entries, it’s more the number of them - having
>>> 96000 entries (even if they were 1 byte ones) is just way too much
>>> especially when the fdb hash size is small and static. We could work
>>> on making it dynamic though, but still these type of local entries
>>> per vlan per port can easily be avoided with this option.
>> 96000 bits can be stored in 12k.  Get where I'm going with this?
>>
>> Look at the problem sideways.
> Oh okay, I misunderstood your previous comment. I’ll look into that.
>
I just wanted to add the other problems we have had with keeping these 
macs (mostly from userspace POV):
- add/del netlink notification storms
- and large netlink dumps

In addition to in-kernel optimizations, will be nice to have a solution 
that reduces the burden on userspace. That will need a newer netlink 
dump format for fdbs. Considering all the changes needed, Nikolays patch 
seems less intrusive.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Aleksandrov Aug. 27, 2015, 9:02 p.m. UTC | #10
> On Aug 26, 2015, at 9:57 PM, roopa <roopa@cumulusnetworks.com> wrote:
> 
> On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
>>> On Aug 25, 2015, at 11:06 PM, David Miller <davem@davemloft.net> wrote:
>>> 
>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>> Date: Tue, 25 Aug 2015 22:28:16 -0700
>>> 
>>>> Certainly, that should be done and I will look into it, but the
>>>> essence of this patch is a bit different. The problem here is not
>>>> the size of the fdb entries, it’s more the number of them - having
>>>> 96000 entries (even if they were 1 byte ones) is just way too much
>>>> especially when the fdb hash size is small and static. We could work
>>>> on making it dynamic though, but still these type of local entries
>>>> per vlan per port can easily be avoided with this option.
>>> 96000 bits can be stored in 12k.  Get where I'm going with this?
>>> 
>>> Look at the problem sideways.
>> Oh okay, I misunderstood your previous comment. I’ll look into that.
>> 
> I just wanted to add the other problems we have had with keeping these macs (mostly from userspace POV):
> - add/del netlink notification storms
> - and large netlink dumps
> 
> In addition to in-kernel optimizations, will be nice to have a solution that reduces the burden on userspace. That will need a newer netlink dump format for fdbs. Considering all the changes needed, Nikolays patch seems less intrusive.

Right, we need to take these into account as well. I’ll continue the discussion on this (or restart it) because
I looked into using a bitmap for the local entries only and while it fixes the scalability issue, it presents
a few new ones which are mostly related to the fact that these entries now exist only without a vlan
and if a new mac comes along which matches one of these but is in a vlan, the entry will get created
in br_fdb_update() unless we add a second lookup, but that will slow down the learning path.
Also this change requires an update of every fdb function that uses the vid as a key (every fdb function?!)
because now we can have the mac in two places instead of one which is a pretty big churn with lots
of conditionals all over the place and I don’t like it. Adding this complexity for the local addresses only
seems like an overkill, so I think to drop this issue for now.
This patch (that works around the initial problem) also has these issues.
Note that one way to take care of this in a more straight-forward way would be to have each entry
with some sort of a bitmap (like Vlad has tried earlier) and then we can combine the paths so most
of these issues disappear, but that will not be easy as was already commented earlier. I’ve looked
briefly into doing this with rhashtable so we can keep the memory footprint for each entry relatively
small but it still affects the performance and we can have thousands of resizes happening. 

On the notification side if we can fix that, we can actually delete the 96000 entries without creating a
huge notification storm and do a user-land workaround of the original issue, so I’ll look into that next.

Any comments or ideas are very welcome.

Thank you,
 Nik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vlad Yasevich Aug. 27, 2015, 11:47 p.m. UTC | #11
On 08/27/2015 05:02 PM, Nikolay Aleksandrov wrote:
> 
>> On Aug 26, 2015, at 9:57 PM, roopa <roopa@cumulusnetworks.com> wrote:
>>
>> On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
>>>> On Aug 25, 2015, at 11:06 PM, David Miller <davem@davemloft.net> wrote:
>>>>
>>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>> Date: Tue, 25 Aug 2015 22:28:16 -0700
>>>>
>>>>> Certainly, that should be done and I will look into it, but the
>>>>> essence of this patch is a bit different. The problem here is not
>>>>> the size of the fdb entries, it’s more the number of them - having
>>>>> 96000 entries (even if they were 1 byte ones) is just way too much
>>>>> especially when the fdb hash size is small and static. We could work
>>>>> on making it dynamic though, but still these type of local entries
>>>>> per vlan per port can easily be avoided with this option.
>>>> 96000 bits can be stored in 12k.  Get where I'm going with this?
>>>>
>>>> Look at the problem sideways.
>>> Oh okay, I misunderstood your previous comment. I’ll look into that.
>>>
>> I just wanted to add the other problems we have had with keeping these macs (mostly from userspace POV):
>> - add/del netlink notification storms
>> - and large netlink dumps
>>
>> In addition to in-kernel optimizations, will be nice to have a solution that reduces the burden on userspace. That will need a newer netlink dump format for fdbs. Considering all the changes needed, Nikolays patch seems less intrusive.
> 
> Right, we need to take these into account as well. I’ll continue the discussion on this (or restart it) because
> I looked into using a bitmap for the local entries only and while it fixes the scalability issue, it presents
> a few new ones which are mostly related to the fact that these entries now exist only without a vlan
> and if a new mac comes along which matches one of these but is in a vlan, the entry will get created
> in br_fdb_update() unless we add a second lookup, but that will slow down the learning path.
> Also this change requires an update of every fdb function that uses the vid as a key (every fdb function?!)
> because now we can have the mac in two places instead of one which is a pretty big churn with lots
> of conditionals all over the place and I don’t like it. Adding this complexity for the local addresses only
> seems like an overkill, so I think to drop this issue for now.

I seem to recall Roopa and I and maybe a few others have discussing this a few
years ago at plumbers, I can't remember the details any more.  All these local
addresses add a ton of confusion.  Does anyone (Stephen?) remember what the
original reason was for all these local addresses? I wonder if we can have
a nob to disable all of them (not just per vlan)?  That might be cleaner and
easier to swallow.

> This patch (that works around the initial problem) also has these issues.
> Note that one way to take care of this in a more straight-forward way would be to have each entry
> with some sort of a bitmap (like Vlad has tried earlier) and then we can combine the paths so most
> of these issues disappear, but that will not be easy as was already commented earlier. I’ve looked
> briefly into doing this with rhashtable so we can keep the memory footprint for each entry relatively
> small but it still affects the performance and we can have thousands of resizes happening. 
> 

So, one of the earlier approaches that I've tried (before rhashtable was
in the kernel) was to have a hash of vlan ids each with a data structure
pointing to a list of ports for a given vlan as well as a list of fdbs for
a given vlan.  As far as scalability goes, that's really the best approach.
It would also allow us to do packet accounting per vlan.  The only concern
at the time was performance of ingress lookup.   I think rhashtables might
help with this as well as ability to grow the footprint of the vlan hash
table dynamically.

-vlad

> On the notification side if we can fix that, we can actually delete the 96000 entries without creating a
> huge notification storm and do a user-land workaround of the original issue, so I’ll look into that next.
> 
> Any comments or ideas are very welcome.
> 
> Thank you,
>  Nik
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Aleksandrov Aug. 28, 2015, 2:17 a.m. UTC | #12
> On Aug 27, 2015, at 4:47 PM, Vlad Yasevich <vyasevic@redhat.com> wrote:
> 
> On 08/27/2015 05:02 PM, Nikolay Aleksandrov wrote:
>> 
>>> On Aug 26, 2015, at 9:57 PM, roopa <roopa@cumulusnetworks.com> wrote:
>>> 
>>> On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
>>>>> On Aug 25, 2015, at 11:06 PM, David Miller <davem@davemloft.net> wrote:
>>>>> 
>>>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>>> Date: Tue, 25 Aug 2015 22:28:16 -0700
>>>>> 
>>>>>> Certainly, that should be done and I will look into it, but the
>>>>>> essence of this patch is a bit different. The problem here is not
>>>>>> the size of the fdb entries, it’s more the number of them - having
>>>>>> 96000 entries (even if they were 1 byte ones) is just way too much
>>>>>> especially when the fdb hash size is small and static. We could work
>>>>>> on making it dynamic though, but still these type of local entries
>>>>>> per vlan per port can easily be avoided with this option.
>>>>> 96000 bits can be stored in 12k.  Get where I'm going with this?
>>>>> 
>>>>> Look at the problem sideways.
>>>> Oh okay, I misunderstood your previous comment. I’ll look into that.
>>>> 
>>> I just wanted to add the other problems we have had with keeping these macs (mostly from userspace POV):
>>> - add/del netlink notification storms
>>> - and large netlink dumps
>>> 
>>> In addition to in-kernel optimizations, will be nice to have a solution that reduces the burden on userspace. That will need a newer netlink dump format for fdbs. Considering all the changes needed, Nikolays patch seems less intrusive.
>> 
>> Right, we need to take these into account as well. I’ll continue the discussion on this (or restart it) because
>> I looked into using a bitmap for the local entries only and while it fixes the scalability issue, it presents
>> a few new ones which are mostly related to the fact that these entries now exist only without a vlan
>> and if a new mac comes along which matches one of these but is in a vlan, the entry will get created
>> in br_fdb_update() unless we add a second lookup, but that will slow down the learning path.
>> Also this change requires an update of every fdb function that uses the vid as a key (every fdb function?!)
>> because now we can have the mac in two places instead of one which is a pretty big churn with lots
>> of conditionals all over the place and I don’t like it. Adding this complexity for the local addresses only
>> seems like an overkill, so I think to drop this issue for now.
> 
> I seem to recall Roopa and I and maybe a few others have discussing this a few
> years ago at plumbers, I can't remember the details any more.  All these local
> addresses add a ton of confusion.  Does anyone (Stephen?) remember what the
> original reason was for all these local addresses? I wonder if we can have
> a nob to disable all of them (not just per vlan)?  That might be cleaner and
> easier to swallow.
> 

Right, this would be the easiest way and if the others agree - I’ll post a patch for it so we can
have some way to resolve it today and even if we fix the scalability issue, this is still a valid case
that some people don’t want local fdbs installed automatically.
Any objections to this ?

>> This patch (that works around the initial problem) also has these issues.
>> Note that one way to take care of this in a more straight-forward way would be to have each entry
>> with some sort of a bitmap (like Vlad has tried earlier) and then we can combine the paths so most
>> of these issues disappear, but that will not be easy as was already commented earlier. I’ve looked
>> briefly into doing this with rhashtable so we can keep the memory footprint for each entry relatively
>> small but it still affects the performance and we can have thousands of resizes happening. 
>> 
> 
> So, one of the earlier approaches that I've tried (before rhashtable was
> in the kernel) was to have a hash of vlan ids each with a data structure
> pointing to a list of ports for a given vlan as well as a list of fdbs for
> a given vlan.  As far as scalability goes, that's really the best approach.
> It would also allow us to do packet accounting per vlan.  The only concern
> at the time was performance of ingress lookup.   I think rhashtables might
> help with this as well as ability to grow the footprint of the vlan hash
> table dynamically.
> 
> -vlad
> 
I’ll look into it but I’m guessing the learning will become a more complicated process with additional 
allocations and some hash handling.

>> On the notification side if we can fix that, we can actually delete the 96000 entries without creating a
>> huge notification storm and do a user-land workaround of the original issue, so I’ll look into that next.
>> 
>> Any comments or ideas are very welcome.
>> 
>> Thank you,
>> Nik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vlad Yasevich Aug. 28, 2015, 12:31 p.m. UTC | #13
On 08/27/2015 10:17 PM, Nikolay Aleksandrov wrote:
> 
>> On Aug 27, 2015, at 4:47 PM, Vlad Yasevich <vyasevic@redhat.com> wrote:
>>
>> On 08/27/2015 05:02 PM, Nikolay Aleksandrov wrote:
>>>
>>>> On Aug 26, 2015, at 9:57 PM, roopa <roopa@cumulusnetworks.com> wrote:
>>>>
>>>> On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
>>>>>> On Aug 25, 2015, at 11:06 PM, David Miller <davem@davemloft.net> wrote:
>>>>>>
>>>>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>>>> Date: Tue, 25 Aug 2015 22:28:16 -0700
>>>>>>
>>>>>>> Certainly, that should be done and I will look into it, but the
>>>>>>> essence of this patch is a bit different. The problem here is not
>>>>>>> the size of the fdb entries, it’s more the number of them - having
>>>>>>> 96000 entries (even if they were 1 byte ones) is just way too much
>>>>>>> especially when the fdb hash size is small and static. We could work
>>>>>>> on making it dynamic though, but still these type of local entries
>>>>>>> per vlan per port can easily be avoided with this option.
>>>>>> 96000 bits can be stored in 12k.  Get where I'm going with this?
>>>>>>
>>>>>> Look at the problem sideways.
>>>>> Oh okay, I misunderstood your previous comment. I’ll look into that.
>>>>>
>>>> I just wanted to add the other problems we have had with keeping these macs (mostly from userspace POV):
>>>> - add/del netlink notification storms
>>>> - and large netlink dumps
>>>>
>>>> In addition to in-kernel optimizations, will be nice to have a solution that reduces the burden on userspace. That will need a newer netlink dump format for fdbs. Considering all the changes needed, Nikolays patch seems less intrusive.
>>>
>>> Right, we need to take these into account as well. I’ll continue the discussion on this (or restart it) because
>>> I looked into using a bitmap for the local entries only and while it fixes the scalability issue, it presents
>>> a few new ones which are mostly related to the fact that these entries now exist only without a vlan
>>> and if a new mac comes along which matches one of these but is in a vlan, the entry will get created
>>> in br_fdb_update() unless we add a second lookup, but that will slow down the learning path.
>>> Also this change requires an update of every fdb function that uses the vid as a key (every fdb function?!)
>>> because now we can have the mac in two places instead of one which is a pretty big churn with lots
>>> of conditionals all over the place and I don’t like it. Adding this complexity for the local addresses only
>>> seems like an overkill, so I think to drop this issue for now.
>>
>> I seem to recall Roopa and I and maybe a few others have discussing this a few
>> years ago at plumbers, I can't remember the details any more.  All these local
>> addresses add a ton of confusion.  Does anyone (Stephen?) remember what the
>> original reason was for all these local addresses? I wonder if we can have
>> a nob to disable all of them (not just per vlan)?  That might be cleaner and
>> easier to swallow.
>>
> 
> Right, this would be the easiest way and if the others agree - I’ll post a patch for it so we can
> have some way to resolve it today and even if we fix the scalability issue, this is still a valid case
> that some people don’t want local fdbs installed automatically.
> Any objections to this ?
> 
>>> This patch (that works around the initial problem) also has these issues.
>>> Note that one way to take care of this in a more straight-forward way would be to have each entry
>>> with some sort of a bitmap (like Vlad has tried earlier) and then we can combine the paths so most
>>> of these issues disappear, but that will not be easy as was already commented earlier. I’ve looked
>>> briefly into doing this with rhashtable so we can keep the memory footprint for each entry relatively
>>> small but it still affects the performance and we can have thousands of resizes happening. 
>>>
>>
>> So, one of the earlier approaches that I've tried (before rhashtable was
>> in the kernel) was to have a hash of vlan ids each with a data structure
>> pointing to a list of ports for a given vlan as well as a list of fdbs for
>> a given vlan.  As far as scalability goes, that's really the best approach.
>> It would also allow us to do packet accounting per vlan.  The only concern
>> at the time was performance of ingress lookup.   I think rhashtables might
>> help with this as well as ability to grow the footprint of the vlan hash
>> table dynamically.
>>
>> -vlad
>>
> I’ll look into it but I’m guessing the learning will become a more complicated process with additional 
> allocations and some hash handling.

I don't remember learning being all that complicated.  The hash only changed under
rtnl when vlans were added/removed.  The nice this is that we wouldn't need
to rebalance, because if the vlan is removed all fdb links get removed too.  They
don't move to another bucket (But that was with static hash.  Need to look at rhash in
more detail).

If you want, I might still have patches hanging around on my machine that had a hash
table implementation.  I can send them to you.

-vlad

> 
>>> On the notification side if we can fix that, we can actually delete the 96000 entries without creating a
>>> huge notification storm and do a user-land workaround of the original issue, so I’ll look into that next.
>>>
>>> Any comments or ideas are very welcome.
>>>
>>> Thank you,
>>> Nik
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Aleksandrov Aug. 28, 2015, 3:26 p.m. UTC | #14
> On Aug 28, 2015, at 5:31 AM, Vlad Yasevich <vyasevic@redhat.com> wrote:
> 
> On 08/27/2015 10:17 PM, Nikolay Aleksandrov wrote:
>> 
>>> On Aug 27, 2015, at 4:47 PM, Vlad Yasevich <vyasevic@redhat.com> wrote:
>>> 
>>> On 08/27/2015 05:02 PM, Nikolay Aleksandrov wrote:
>>>> 
>>>>> On Aug 26, 2015, at 9:57 PM, roopa <roopa@cumulusnetworks.com> wrote:
>>>>> 
>>>>> On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
>>>>>>> On Aug 25, 2015, at 11:06 PM, David Miller <davem@davemloft.net> wrote:
>>>>>>> 
>>>>>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>>>>> Date: Tue, 25 Aug 2015 22:28:16 -0700
>>>>>>> 
>>>>>>>> Certainly, that should be done and I will look into it, but the
>>>>>>>> essence of this patch is a bit different. The problem here is not
>>>>>>>> the size of the fdb entries, it’s more the number of them - having
>>>>>>>> 96000 entries (even if they were 1 byte ones) is just way too much
>>>>>>>> especially when the fdb hash size is small and static. We could work
>>>>>>>> on making it dynamic though, but still these type of local entries
>>>>>>>> per vlan per port can easily be avoided with this option.
>>>>>>> 96000 bits can be stored in 12k.  Get where I'm going with this?
>>>>>>> 
>>>>>>> Look at the problem sideways.
>>>>>> Oh okay, I misunderstood your previous comment. I’ll look into that.
>>>>>> 
>>>>> I just wanted to add the other problems we have had with keeping these macs (mostly from userspace POV):
>>>>> - add/del netlink notification storms
>>>>> - and large netlink dumps
>>>>> 
>>>>> In addition to in-kernel optimizations, will be nice to have a solution that reduces the burden on userspace. That will need a newer netlink dump format for fdbs. Considering all the changes needed, Nikolays patch seems less intrusive.
>>>> 
>>>> Right, we need to take these into account as well. I’ll continue the discussion on this (or restart it) because
>>>> I looked into using a bitmap for the local entries only and while it fixes the scalability issue, it presents
>>>> a few new ones which are mostly related to the fact that these entries now exist only without a vlan
>>>> and if a new mac comes along which matches one of these but is in a vlan, the entry will get created
>>>> in br_fdb_update() unless we add a second lookup, but that will slow down the learning path.
>>>> Also this change requires an update of every fdb function that uses the vid as a key (every fdb function?!)
>>>> because now we can have the mac in two places instead of one which is a pretty big churn with lots
>>>> of conditionals all over the place and I don’t like it. Adding this complexity for the local addresses only
>>>> seems like an overkill, so I think to drop this issue for now.
>>> 
>>> I seem to recall Roopa and I and maybe a few others have discussing this a few
>>> years ago at plumbers, I can't remember the details any more.  All these local
>>> addresses add a ton of confusion.  Does anyone (Stephen?) remember what the
>>> original reason was for all these local addresses? I wonder if we can have
>>> a nob to disable all of them (not just per vlan)?  That might be cleaner and
>>> easier to swallow.
>>> 
>> 
>> Right, this would be the easiest way and if the others agree - I’ll post a patch for it so we can
>> have some way to resolve it today and even if we fix the scalability issue, this is still a valid case
>> that some people don’t want local fdbs installed automatically.
>> Any objections to this ?
>> 
>>>> This patch (that works around the initial problem) also has these issues.
>>>> Note that one way to take care of this in a more straight-forward way would be to have each entry
>>>> with some sort of a bitmap (like Vlad has tried earlier) and then we can combine the paths so most
>>>> of these issues disappear, but that will not be easy as was already commented earlier. I’ve looked
>>>> briefly into doing this with rhashtable so we can keep the memory footprint for each entry relatively
>>>> small but it still affects the performance and we can have thousands of resizes happening. 
>>>> 
>>> 
>>> So, one of the earlier approaches that I've tried (before rhashtable was
>>> in the kernel) was to have a hash of vlan ids each with a data structure
>>> pointing to a list of ports for a given vlan as well as a list of fdbs for
>>> a given vlan.  As far as scalability goes, that's really the best approach.
>>> It would also allow us to do packet accounting per vlan.  The only concern
>>> at the time was performance of ingress lookup.   I think rhashtables might
>>> help with this as well as ability to grow the footprint of the vlan hash
>>> table dynamically.
>>> 
>>> -vlad
>>> 
>> I’ll look into it but I’m guessing the learning will become a more complicated process with additional 
>> allocations and some hash handling.
> 
> I don't remember learning being all that complicated.  The hash only changed under
> rtnl when vlans were added/removed.  The nice this is that we wouldn't need
> to rebalance, because if the vlan is removed all fdb links get removed too.  They
> don't move to another bucket (But that was with static hash.  Need to look at rhash in
> more detail).
> 
> If you want, I might still have patches hanging around on my machine that had a hash
> table implementation.  I can send them to you.
> 
> -vlad
> 

:-) Okay, I’m putting the crystal ball away. If you could send me these patches it’d be great so
I don’t have to start this from scratch.

Thanks,
 Nik

>> 
>>>> On the notification side if we can fix that, we can actually delete the 96000 entries without creating a
>>>> huge notification storm and do a user-land workaround of the original issue, so I’ll look into that next.
>>>> 
>>>> Any comments or ideas are very welcome.
>>>> 
>>>> Thank you,
>>>> Nik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vlad Yasevich Aug. 29, 2015, 1:11 a.m. UTC | #15
On 08/28/2015 11:26 AM, Nikolay Aleksandrov wrote:
> 
>> On Aug 28, 2015, at 5:31 AM, Vlad Yasevich <vyasevic@redhat.com> wrote:
>>
>> On 08/27/2015 10:17 PM, Nikolay Aleksandrov wrote:
>>>
>>>> On Aug 27, 2015, at 4:47 PM, Vlad Yasevich <vyasevic@redhat.com> wrote:
>>>>
>>>> On 08/27/2015 05:02 PM, Nikolay Aleksandrov wrote:
>>>>>
>>>>>> On Aug 26, 2015, at 9:57 PM, roopa <roopa@cumulusnetworks.com> wrote:
>>>>>>
>>>>>> On 8/26/15, 4:33 AM, Nikolay Aleksandrov wrote:
>>>>>>>> On Aug 25, 2015, at 11:06 PM, David Miller <davem@davemloft.net> wrote:
>>>>>>>>
>>>>>>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>>>>>> Date: Tue, 25 Aug 2015 22:28:16 -0700
>>>>>>>>
>>>>>>>>> Certainly, that should be done and I will look into it, but the
>>>>>>>>> essence of this patch is a bit different. The problem here is not
>>>>>>>>> the size of the fdb entries, it’s more the number of them - having
>>>>>>>>> 96000 entries (even if they were 1 byte ones) is just way too much
>>>>>>>>> especially when the fdb hash size is small and static. We could work
>>>>>>>>> on making it dynamic though, but still these type of local entries
>>>>>>>>> per vlan per port can easily be avoided with this option.
>>>>>>>> 96000 bits can be stored in 12k.  Get where I'm going with this?
>>>>>>>>
>>>>>>>> Look at the problem sideways.
>>>>>>> Oh okay, I misunderstood your previous comment. I’ll look into that.
>>>>>>>
>>>>>> I just wanted to add the other problems we have had with keeping these macs (mostly from userspace POV):
>>>>>> - add/del netlink notification storms
>>>>>> - and large netlink dumps
>>>>>>
>>>>>> In addition to in-kernel optimizations, will be nice to have a solution that reduces the burden on userspace. That will need a newer netlink dump format for fdbs. Considering all the changes needed, Nikolays patch seems less intrusive.
>>>>>
>>>>> Right, we need to take these into account as well. I’ll continue the discussion on this (or restart it) because
>>>>> I looked into using a bitmap for the local entries only and while it fixes the scalability issue, it presents
>>>>> a few new ones which are mostly related to the fact that these entries now exist only without a vlan
>>>>> and if a new mac comes along which matches one of these but is in a vlan, the entry will get created
>>>>> in br_fdb_update() unless we add a second lookup, but that will slow down the learning path.
>>>>> Also this change requires an update of every fdb function that uses the vid as a key (every fdb function?!)
>>>>> because now we can have the mac in two places instead of one which is a pretty big churn with lots
>>>>> of conditionals all over the place and I don’t like it. Adding this complexity for the local addresses only
>>>>> seems like an overkill, so I think to drop this issue for now.
>>>>
>>>> I seem to recall Roopa and I and maybe a few others have discussing this a few
>>>> years ago at plumbers, I can't remember the details any more.  All these local
>>>> addresses add a ton of confusion.  Does anyone (Stephen?) remember what the
>>>> original reason was for all these local addresses? I wonder if we can have
>>>> a nob to disable all of them (not just per vlan)?  That might be cleaner and
>>>> easier to swallow.
>>>>
>>>
>>> Right, this would be the easiest way and if the others agree - I’ll post a patch for it so we can
>>> have some way to resolve it today and even if we fix the scalability issue, this is still a valid case
>>> that some people don’t want local fdbs installed automatically.
>>> Any objections to this ?
>>>
>>>>> This patch (that works around the initial problem) also has these issues.
>>>>> Note that one way to take care of this in a more straight-forward way would be to have each entry
>>>>> with some sort of a bitmap (like Vlad has tried earlier) and then we can combine the paths so most
>>>>> of these issues disappear, but that will not be easy as was already commented earlier. I’ve looked
>>>>> briefly into doing this with rhashtable so we can keep the memory footprint for each entry relatively
>>>>> small but it still affects the performance and we can have thousands of resizes happening. 
>>>>>
>>>>
>>>> So, one of the earlier approaches that I've tried (before rhashtable was
>>>> in the kernel) was to have a hash of vlan ids each with a data structure
>>>> pointing to a list of ports for a given vlan as well as a list of fdbs for
>>>> a given vlan.  As far as scalability goes, that's really the best approach.
>>>> It would also allow us to do packet accounting per vlan.  The only concern
>>>> at the time was performance of ingress lookup.   I think rhashtables might
>>>> help with this as well as ability to grow the footprint of the vlan hash
>>>> table dynamically.
>>>>
>>>> -vlad
>>>>
>>> I’ll look into it but I’m guessing the learning will become a more complicated process with additional 
>>> allocations and some hash handling.
>>
>> I don't remember learning being all that complicated.  The hash only changed under
>> rtnl when vlans were added/removed.  The nice this is that we wouldn't need
>> to rebalance, because if the vlan is removed all fdb links get removed too.  They
>> don't move to another bucket (But that was with static hash.  Need to look at rhash in
>> more detail).
>>
>> If you want, I might still have patches hanging around on my machine that had a hash
>> table implementation.  I can send them to you.
>>
>> -vlad
>>
> 
> :-) Okay, I’m putting the crystal ball away. If you could send me these patches it’d be great so
> I don’t have to start this from scratch.
> 

So, I forgot that I lost an old disk that had all that code, so I am a bit bummed about
that.  I did however find the series that got posted.
http://www.spinics.net/lists/netdev/msg219737.html

That was the series where I briefly switch from bitmaps to hash and list.
However, I see that the fdb code that was playing with never got posted...

Sorry.

-vlad

> Thanks,
>  Nik
> 
>>>
>>>>> On the notification side if we can fix that, we can actually delete the 96000 entries without creating a
>>>>> huge notification storm and do a user-land workaround of the original issue, so I’ll look into that next.
>>>>>
>>>>> Any comments or ideas are very welcome.
>>>>>
>>>>> Thank you,
>>>>> Nik
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Aleksandrov Sept. 13, 2015, 1:22 p.m. UTC | #16
On 08/29/2015 03:11 AM, Vlad Yasevich wrote:
> On 08/28/2015 11:26 AM, Nikolay Aleksandrov wrote:
>>
>>> On Aug 28, 2015, at 5:31 AM, Vlad Yasevich <vyasevic@redhat.com> wrote:
>>>
>>> On 08/27/2015 10:17 PM, Nikolay Aleksandrov wrote:
<<<snip>>>
>>>
>>> I don't remember learning being all that complicated.  The hash only changed under
>>> rtnl when vlans were added/removed.  The nice this is that we wouldn't need
>>> to rebalance, because if the vlan is removed all fdb links get removed too.  They
>>> don't move to another bucket (But that was with static hash.  Need to look at rhash in
>>> more detail).
>>>
>>> If you want, I might still have patches hanging around on my machine that had a hash
>>> table implementation.  I can send them to you.
>>>
>>> -vlad
>>>
>>
>> :-) Okay, I’m putting the crystal ball away. If you could send me these patches it’d be great so
>> I don’t have to start this from scratch.
>>
> 
> So, I forgot that I lost an old disk that had all that code, so I am a bit bummed about
> that.  I did however find the series that got posted.
> http://www.spinics.net/lists/netdev/msg219737.html
> 
> That was the series where I briefly switch from bitmaps to hash and list.
> However, I see that the fdb code that was playing with never got posted...
> 
> Sorry.
> 
> -vlad
> 

So I've been looking into this for some time now and did a basic implementation of vlan handling
using rhashtables, here are some thoughts and a slightly different proposition.
First a few scenarios (the memory footprint is only the extra memory needed for the
vlans):
Current memory footprint for 48 ports & 2000 vlans ~ 50k

1. Bridge with vlan hash with port bitmaps (similar to Vlad's first set)
- On input we have hash lookup + bitmap lookup
- If (r)hashtable is used we need additional list to handle stable list walks which are
needed all over the place from error handling to compressed vlan dumps which actually
need this list to be kept sorted since the already exposed user interfaces need to
be handled without visible changes, but they also allow for per-port vlan compressed
dumping which isn't easy to handle. Mostly the stability issue with rhashtable
is with resizing since these entries change only under rtnl, also we need the sorted
order because of the compressed dump. One alternative way to solve this is to build the
sorted list each time a dump is requested, but again this falls under the workarounds
needed to satisfy current behaviour requirements.
If this is chosen my preference is to have the vlans also in a list which is kept sorted
for the walks, then the compressed request can be satisfied easier.
- memory footprint for 2000 vlans with 48 ports ~ 1.5 MB

2. Bridge with vlan hash, ports with vlan hashes (need a special per-port struct because
of the tagged/untagged case, we basically need per-port per-vlan flags)
- On input we have 1 hash lookup only from the port vlan hash where get a pointer
to the bridge's vlan entry so we get the global vlan context as well as the local
- Same rhashtable handling requirements apply + more complexity & memory due to having
to keep in sync multiple (per-port, per-bridge global) rhashtables
- memory footprint for 2000 vlans with 48 ports ~ 2.6 MB

Up until now I've done partially point 1 to see how much churn it would take and the
basic change is huge. Also the memory footprint increases a lot.
So I'd propose a third option which you may call middle ground between the current
implementation (which is very fast and compact) and points 1 & 2:

What do you think about adding an auxiliary per-vlan global context using rhashtable
which is not used in the ingress/egress decision making ? We can contain it
via either a Kconfig option (so it can be compiled out) or via a dynamic run-time option
so people who would like more features can enabled it on demand and are willing to
trade some performance and memory.
This way we won't have to change most of the current API and won't have to add workarounds
to keep the user-facing behaviour the same, also the syncing is reduced to
a refcount and the memory footprint is kept minimal.
The initial new features I'd like to introduce are per-vlan counters and also per-vlan
flags which at first will be used to enable/disable multicast on a vlan basis.
In terms of performance if this is enabled it is close to point 1 but without the changes
all over the API and more importantly with much less memory footprint.
The memory footprint of this option with 2000 vlans & 48 ports ~ +70k (without the per-cpu
counters, any additional feature will naturally add to this). This is because we don't
have a per-port increase for each vlan added and only keep the global context.

If it's acceptable to take the performance/memory hit and the huge churn, then I can continue
with 1 or 2, but I'm not a big fan of that idea.

Feedback before I go any further on this would be much appreciated.

Thank you,
 Nik
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 313c305fd1ad..df1c601dd315 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -231,6 +231,7 @@  enum {
 	IFLA_BR_STP_STATE,
 	IFLA_BR_PRIORITY,
 	IFLA_BR_VLAN_FILTERING,
+	IFLA_BR_VLAN_IGNORE_LOCAL_FDB,
 	__IFLA_BR_MAX,
 };
 
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index f921a5dce22d..a2b00849de3c 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -186,6 +186,13 @@  int br_handle_frame_finish(struct sock *sk, struct sk_buff *skb)
 		skb2 = skb;
 		/* Do not forward the packet since it's local. */
 		skb = NULL;
+	} else if (br_vlan_enabled(br) && br_vlan_ignore_local_fdb(br)) {
+		dst = __br_fdb_get(br, dest, 0);
+		if (dst && dst->is_local) {
+			skb2 = skb;
+			/* Do not forward the packet since it's local. */
+			skb = NULL;
+		}
 	}
 
 	if (skb) {
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index dbcb1949ea58..07978f7b6245 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -729,6 +729,7 @@  static const struct nla_policy br_policy[IFLA_BR_MAX + 1] = {
 	[IFLA_BR_STP_STATE] = { .type = NLA_U32 },
 	[IFLA_BR_PRIORITY] = { .type = NLA_U16 },
 	[IFLA_BR_VLAN_FILTERING] = { .type = NLA_U8 },
+	[IFLA_BR_VLAN_IGNORE_LOCAL_FDB] = { .type = NLA_U8 },
 };
 
 static int br_changelink(struct net_device *brdev, struct nlattr *tb[],
@@ -784,6 +785,14 @@  static int br_changelink(struct net_device *brdev, struct nlattr *tb[],
 			return err;
 	}
 
+	if (data[IFLA_BR_VLAN_IGNORE_LOCAL_FDB]) {
+		u8 vlan_ignore_local = nla_get_u8(data[IFLA_BR_VLAN_IGNORE_LOCAL_FDB]);
+
+		err = br_vlan_ignore_local_fdb_toggle(br, vlan_ignore_local);
+		if (err)
+			return err;
+	}
+
 	return 0;
 }
 
@@ -796,6 +805,7 @@  static size_t br_get_size(const struct net_device *brdev)
 	       nla_total_size(sizeof(u32)) +    /* IFLA_BR_STP_STATE */
 	       nla_total_size(sizeof(u16)) +    /* IFLA_BR_PRIORITY */
 	       nla_total_size(sizeof(u8)) +     /* IFLA_BR_VLAN_FILTERING */
+	       nla_total_size(sizeof(u8)) +     /* IFLA_BR_VLAN_IGNORE_LOCAL_FDB */
 	       0;
 }
 
@@ -809,6 +819,7 @@  static int br_fill_info(struct sk_buff *skb, const struct net_device *brdev)
 	u32 stp_enabled = br->stp_enabled;
 	u16 priority = (br->bridge_id.prio[0] << 8) | br->bridge_id.prio[1];
 	u8 vlan_enabled = br_vlan_enabled(br);
+	u8 vlan_ignore_local = br_vlan_ignore_local_fdb(br);
 
 	if (nla_put_u32(skb, IFLA_BR_FORWARD_DELAY, forward_delay) ||
 	    nla_put_u32(skb, IFLA_BR_HELLO_TIME, hello_time) ||
@@ -816,7 +827,8 @@  static int br_fill_info(struct sk_buff *skb, const struct net_device *brdev)
 	    nla_put_u32(skb, IFLA_BR_AGEING_TIME, ageing_time) ||
 	    nla_put_u32(skb, IFLA_BR_STP_STATE, stp_enabled) ||
 	    nla_put_u16(skb, IFLA_BR_PRIORITY, priority) ||
-	    nla_put_u8(skb, IFLA_BR_VLAN_FILTERING, vlan_enabled))
+	    nla_put_u8(skb, IFLA_BR_VLAN_FILTERING, vlan_enabled) ||
+	    nla_put_u8(skb, IFLA_BR_VLAN_IGNORE_LOCAL_FDB, vlan_ignore_local))
 		return -EMSGSIZE;
 
 	return 0;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 3d95647039d0..2bda472c5a6e 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -294,6 +294,7 @@  struct net_bridge
 	u32				auto_cnt;
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
 	u8				vlan_enabled;
+	bool				vlan_ignore_local_fdb;
 	__be16				vlan_proto;
 	u16				default_pvid;
 	struct net_port_vlans __rcu	*vlan_info;
@@ -624,6 +625,7 @@  int nbp_vlan_delete(struct net_bridge_port *port, u16 vid);
 void nbp_vlan_flush(struct net_bridge_port *port);
 bool nbp_vlan_find(struct net_bridge_port *port, u16 vid);
 int nbp_vlan_init(struct net_bridge_port *port);
+int br_vlan_ignore_local_fdb_toggle(struct net_bridge *br, unsigned long val);
 
 static inline struct net_port_vlans *br_get_vlan_info(
 						const struct net_bridge *br)
@@ -667,6 +669,11 @@  static inline int br_vlan_enabled(struct net_bridge *br)
 {
 	return br->vlan_enabled;
 }
+
+static inline bool br_vlan_ignore_local_fdb(struct net_bridge *br)
+{
+	return br->vlan_ignore_local_fdb;
+}
 #else
 static inline bool br_allowed_ingress(struct net_bridge *br,
 				      struct net_port_vlans *v,
@@ -778,6 +785,17 @@  static inline int __br_vlan_filter_toggle(struct net_bridge *br,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline int br_vlan_ignore_local_fdb_toggle(struct net_bridge *br,
+						  unsigned long val)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline bool br_vlan_ignore_local_fdb(struct net_bridge *br)
+{
+	return false;
+}
 #endif
 
 struct nf_br_ops {
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 4c97fc50fb70..fca352f0943a 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -741,6 +741,23 @@  static ssize_t default_pvid_store(struct device *d,
 	return store_bridge_parm(d, buf, len, br_vlan_set_default_pvid);
 }
 static DEVICE_ATTR_RW(default_pvid);
+
+static ssize_t vlan_ignore_local_fdb_show(struct device *d,
+					  struct device_attribute *attr,
+					  char *buf)
+{
+	struct net_bridge *br = to_bridge(d);
+
+	return sprintf(buf, "%d\n", br->vlan_ignore_local_fdb);
+}
+
+static ssize_t vlan_ignore_local_fdb_store(struct device *d,
+					   struct device_attribute *attr,
+					   const char *buf, size_t len)
+{
+	return store_bridge_parm(d, buf, len, br_vlan_ignore_local_fdb_toggle);
+}
+static DEVICE_ATTR_RW(vlan_ignore_local_fdb);
 #endif
 
 static struct attribute *bridge_attrs[] = {
@@ -788,6 +805,7 @@  static struct attribute *bridge_attrs[] = {
 	&dev_attr_vlan_filtering.attr,
 	&dev_attr_vlan_protocol.attr,
 	&dev_attr_default_pvid.attr,
+	&dev_attr_vlan_ignore_local_fdb.attr,
 #endif
 	NULL
 };
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 3cef6892c0bb..f9efa1b07994 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -98,11 +98,12 @@  static int __vlan_add(struct net_port_vlans *v, u16 vid, u16 flags)
 			return err;
 	}
 
-	err = br_fdb_insert(br, p, dev->dev_addr, vid);
-	if (err) {
-		br_err(br, "failed insert local address into bridge "
-		       "forwarding table\n");
-		goto out_filt;
+	if (!br_vlan_ignore_local_fdb(br) || !v->port_idx) {
+		err = br_fdb_insert(br, p, dev->dev_addr, vid);
+		if (err) {
+			br_err(br, "failed insert local address into bridge forwarding table\n");
+			goto out_filt;
+		}
 	}
 
 	set_bit(vid, v->vlan_bitmap);
@@ -492,6 +493,13 @@  int br_vlan_filter_toggle(struct net_bridge *br, unsigned long val)
 	return 0;
 }
 
+int br_vlan_ignore_local_fdb_toggle(struct net_bridge *br, unsigned long val)
+{
+	br->vlan_ignore_local_fdb = val ? true : false;
+
+	return 0;
+}
+
 int br_vlan_set_proto(struct net_bridge *br, unsigned long val)
 {
 	int err = 0;