[RFC/RFT,net-next,00/17] net: Convert neighbor tables to per-namespace

Message ID	20180717120651.15748-1-dsahern@kernel.org
Headers	show Return-Path: <netfilter-devel-owner@vger.kernel.org> From: dsahern@kernel.org To: netdev@vger.kernel.org Cc: nikita.leshchenko@oracle.com, roopa@cumulusnetworks.com, stephen@networkplumber.org, idosch@mellanox.com, jiri@mellanox.com, saeedm@mellanox.com, alex.aring@gmail.com, linux-wpan@vger.kernel.org, netfilter-devel@vger.kernel.org, linux-kernel@vger.kernel.org, David Ahern <dsahern@gmail.com> Subject: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace Date: Tue, 17 Jul 2018 05:06:34 -0700 Message-Id: <20180717120651.15748-1-dsahern@kernel.org> Sender: netfilter-devel-owner@vger.kernel.org Precedence: bulk
Series	net: Convert neighbor tables to per-namespace \| expand [RFC/RFT,net-next,00/17] net: Convert neighbor tables to per-namespace [RFC/RFT,net-next,01/17] net/ipv4: rename ipv4_neigh_lookup to ipv4_dst_neigh_lookup [RFC/RFT,net-next,02/17] net/neigh: export neigh_find_table [RFC/RFT,net-next,03/17] net/ipv4: wrappers for arp table references [RFC/RFT,net-next,04/17] net/ipv4: Remove open coded use of arp table [RFC/RFT,net-next,05/17] net/ipv6: wrappers for neighbor table references [RFC/RFT,net-next,06/17] net/ipv6: Remove open coded use of neighbor table [RFC/RFT,net-next,07/17] drivers/net: remove open coding of neighbor tables [RFC/RFT,net-next,08/17] net: Remove nd_tbl from ipv6 stub [RFC/RFT,net-next,09/17] net: Remove arp_tbl and nd_tbl from headers [RFC/RFT,net-next,10/17] net: Add key_len to neighbor constructor [RFC/RFT,net-next,11/17] net: Change neigh_table_init and neigh_table_clear signature [RFC/RFT,net-next,12/17] net/neigh: Change neigh_xmit to take an address family [RFC/RFT,net-next,13/17] net/neighbor: Convert internal functions away from neigh_tables [RFC/RFT,net-next,14/17] net/ipv4: Convert arp table to per namespace [RFC/RFT,net-next,16/17] net/decnet: Move neighbor table to per-namespace [RFC/RFT,net-next,17/17] net/neighbor: Remove neigh_tables and NEIGH enum

David Ahern July 17, 2018, 12:06 p.m. UTC

From: David Ahern <dsahern@gmail.com>

Nikita Leshenko reported that neighbor entries in one namespace can
evict neighbor entries in another. The problem is that the neighbor
tables have entries across all namespaces without separate accounting
and with global limits on when to scan for entries to evict.

Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
namespace and making the accounting and threshold limits per namespace.

David Ahern (17):
  net/ipv4: rename ipv4_neigh_lookup to ipv4_dst_neigh_lookup
  net/neigh: export neigh_find_table
  net/ipv4: wrappers for arp table references
  net/ipv4: Remove open coded use of arp table
  net/ipv6: wrappers for neighbor table references
  net/ipv6: Remove open coded use of neighbor table
  drivers/net: remove open coding of neighbor tables
  net: Remove nd_tbl from ipv6 stub
  net: Remove arp_tbl and nd_tbl from headers
  net: Add key_len to neighbor constructor
  net: Change neigh_table_init and neigh_table_clear signature
  net/neigh: Change neigh_xmit to take an address family
  net/neighbor: Convert internal functions away from neigh_tables
  net/ipv4: Convert arp table to per namespace
  net/ipv6: Convert neighbor table to per-namespace
  net/decnet: Move neighbor table to per-namespace
  net/neighbor: Remove neigh_tables and NEIGH enum

 drivers/infiniband/ulp/ipoib/ipoib_main.c          |  14 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  35 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |  11 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  27 ++--
 .../net/ethernet/mellanox/mlxsw/spectrum_span.c    |   8 +-
 .../ethernet/netronome/nfp/flower/tunnel_conf.c    |   2 +-
 drivers/net/ethernet/rocker/rocker_main.c          |   4 +-
 drivers/net/ethernet/rocker/rocker_ofdpa.c         |   2 +-
 drivers/net/vrf.c                                  |   4 +-
 drivers/net/vxlan.c                                |  10 +-
 include/net/addrconf.h                             |   1 -
 include/net/arp.h                                  |  25 +++-
 include/net/ndisc.h                                |  75 +++++++++-
 include/net/neighbour.h                            |  17 +--
 include/net/net_namespace.h                        |   3 +
 include/net/netns/ipv4.h                           |   1 +
 include/net/netns/ipv6.h                           |   1 +
 net/atm/clip.c                                     |  14 +-
 net/bridge/br_arp_nd_proxy.c                       |   4 +-
 net/core/filter.c                                  |   3 +-
 net/core/neighbour.c                               | 115 +++++++++-----
 net/decnet/dn_neigh.c                              |   8 +-
 net/ieee802154/6lowpan/tx.c                        |   2 +-
 net/ipv4/arp.c                                     | 130 +++++++++-------
 net/ipv4/devinet.c                                 |   8 +-
 net/ipv4/fib_semantics.c                           |   2 +-
 net/ipv4/ip_output.c                               |   2 +-
 net/ipv4/route.c                                   |  12 +-
 net/ipv6/addrconf.c                                |  16 +-
 net/ipv6/af_inet6.c                                |   1 -
 net/ipv6/ip6_output.c                              |   4 +-
 net/ipv6/ndisc.c                                   | 165 +++++++++++----------
 net/ipv6/route.c                                   |  12 +-
 net/mpls/af_mpls.c                                 |  33 ++---
 net/mpls/mpls_iptunnel.c                           |   6 +-
 net/netfilter/nf_flow_table_ip.c                   |   4 +-
 net/netfilter/nft_fwd_netdev.c                     |   6 +-
 37 files changed, 467 insertions(+), 320 deletions(-)

Cong Wang July 17, 2018, 5:40 p.m. UTC | #1

On Tue, Jul 17, 2018 at 5:11 AM <dsahern@kernel.org> wrote:
>
> From: David Ahern <dsahern@gmail.com>
>
> Nikita Leshenko reported that neighbor entries in one namespace can
> evict neighbor entries in another. The problem is that the neighbor
> tables have entries across all namespaces without separate accounting
> and with global limits on when to scan for entries to evict.

It is nothing new, people including me already noticed this before.


>
> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
> namespace and making the accounting and threshold limits per namespace.


The last discussion about this a long time ago concluded that neigh
table entries are controllable by remote, so after moving it to per netns,
it would be easier to DOS the host.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Ahern July 17, 2018, 5:43 p.m. UTC | #2

On 7/17/18 11:40 AM, Cong Wang wrote:
> On Tue, Jul 17, 2018 at 5:11 AM <dsahern@kernel.org> wrote:
>>
>> From: David Ahern <dsahern@gmail.com>
>>
>> Nikita Leshenko reported that neighbor entries in one namespace can
>> evict neighbor entries in another. The problem is that the neighbor
>> tables have entries across all namespaces without separate accounting
>> and with global limits on when to scan for entries to evict.
> 
> It is nothing new, people including me already noticed this before.
> 
> 
>>
>> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
>> namespace and making the accounting and threshold limits per namespace.
> 
> 
> The last discussion about this a long time ago concluded that neigh
> table entries are controllable by remote, so after moving it to per netns,
> it would be easier to DOS the host.
> 

There are still limits on the total number of entries and with
per-namespace limits an admin has better control.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cong Wang July 17, 2018, 5:53 p.m. UTC | #3

On Tue, Jul 17, 2018 at 10:43 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 7/17/18 11:40 AM, Cong Wang wrote:
> > On Tue, Jul 17, 2018 at 5:11 AM <dsahern@kernel.org> wrote:
> >>
> >> From: David Ahern <dsahern@gmail.com>
> >>
> >> Nikita Leshenko reported that neighbor entries in one namespace can
> >> evict neighbor entries in another. The problem is that the neighbor
> >> tables have entries across all namespaces without separate accounting
> >> and with global limits on when to scan for entries to evict.
> >
> > It is nothing new, people including me already noticed this before.
> >
> >
> >>
> >> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
> >> namespace and making the accounting and threshold limits per namespace.
> >
> >
> > The last discussion about this a long time ago concluded that neigh
> > table entries are controllable by remote, so after moving it to per netns,
> > it would be easier to DOS the host.
> >
>
> There are still limits on the total number of entries and with
> per-namespace limits an admin has better control.

Per-netns limit is *exactly* the problem here.

Quote from David Miller:
"
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Wed, 25 Jun 2014 18:17:08 -0700

> I disagree that removing a global DOS prevention check is a benefit.
> Certainly large semantics changes like that should not happen without
> being discussed in the patch description.

Agreed, this is the most important core issue.

If we just make these things per netns, then as a result if you create
N namespaces we will allow N times more neighbour entries to be
sitting in the system at once.

Actually, I'm really surprised the limits get hit and this actually
causes problems.
"

You can see the original discussion here:
https://marc.info/?l=linux-netdev&m=140356141019653&w=2
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Ahern July 17, 2018, 7:02 p.m. UTC | #4

On 7/17/18 11:53 AM, Cong Wang wrote:
> You can see the original discussion here:
> https://marc.info/?l=linux-netdev&m=140356141019653&w=2
> 

Thanks for the reference.

I was surprised that the tables are still global. A number of objections
raised in that thread were due to a large patch tackling multiple
issues. This set is focused one thing - moving the tables to net - and
does so in small incremental changes to make it easy to review.

One of DaveM's comments:

"Finally, another problem are permanent neigh entries as those cannot
be reclaimed, that might be part of the main problem here.

One idea wrt. permanent entries is that we could decide that, since
they are administratively added, they don't count against the
thresholds and limits."

this is another we have hit and with same thinking ... permanent entries
should not count in the gc numbers. We need to address this for EVPN.

As for the per-namespace tables, it is 4 years later and over that time
Linux supports a number of features: EVPN which is very mac heavy, VRR
which doubles mac entries (one against the VRR device and one against
the lower device) and NOS level features such as mlxsw which has to
ensure mac entries for nexthop gateaways stay active. In addition there
are other features on the horizon - like the ability to use namespaces
to create virtual switches (what Cisco calls a VDC) where you absolutely
want isolation and not allowing entries from virtual switch to evict
entries from another. And of course the continued proliferation of
containerized workloads where isolation is desired.

I understand the concern about global resource and limits: as it stands
you have to increase the limits in init_net to the max expected and hope
for the best. With per namespace limits you can lower the limits of each
namespace better control the total impact on the total memory used.
Perhaps the defaults for namespaces after init_net could have really low
defaults (e.g., 16 / 32 / 64 for gc_thresh 1/2/3) requiring admin
intervention.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cong Wang July 17, 2018, 8:37 p.m. UTC | #5

On Tue, Jul 17, 2018 at 12:02 PM David Ahern <dsahern@gmail.com> wrote:
> As for the per-namespace tables, it is 4 years later and over that time
> Linux supports a number of features: EVPN which is very mac heavy, VRR
> which doubles mac entries (one against the VRR device and one against
> the lower device) and NOS level features such as mlxsw which has to
> ensure mac entries for nexthop gateaways stay active. In addition there
> are other features on the horizon - like the ability to use namespaces
> to create virtual switches (what Cisco calls a VDC) where you absolutely
> want isolation and not allowing entries from virtual switch to evict
> entries from another. And of course the continued proliferation of
> containerized workloads where isolation is desired.

As long as no change in neigh table code base itself, these can't
address the concern people raised before.


>
> I understand the concern about global resource and limits: as it stands
> you have to increase the limits in init_net to the max expected and hope
> for the best. With per namespace limits you can lower the limits of each
> namespace better control the total impact on the total memory used.

The problem is that the number of containers in a host is usually
not predictable.

Of course, you can say containers limit kernel memory too, but
memcg is not part of netns. I once told David Miller cpuset is the
isolation for isolating per-CPU softnet_data, he didn't like it. Based
on that I don't think you can convince him with memcg as a solution
here.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller July 18, 2018, 3:59 a.m. UTC | #6

From: David Ahern <dsahern@gmail.com>
Date: Tue, 17 Jul 2018 13:02:18 -0600

> I understand the concern about global resource and limits: as it stands
> you have to increase the limits in init_net to the max expected and hope
> for the best. With per namespace limits you can lower the limits of each
> namespace better control the total impact on the total memory used.
> Perhaps the defaults for namespaces after init_net could have really low
> defaults (e.g., 16 / 32 / 64 for gc_thresh 1/2/3) requiring admin
> intervention.

How does this work when a namespace creates another namespace?

Changing the defaults for non-init_net namespaces could work, but that
could be a surprise to some people.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael Richardson July 19, 2018, 12:54 a.m. UTC | #7

>>>>> David Ahern <dsahern@gmail.com> writes:
dsahern@kernel.org wrote:
    > Nikita Leshenko reported that neighbor entries in one namespace can
    > evict neighbor entries in another. The problem is that the neighbor
    > tables have entries across all namespaces without separate accounting
    > and with global limits on when to scan for entries to evict.

    > Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
    > namespace and making the accounting and threshold limits per namespace.

This is a good improvement, thank you.
We absolutely need to keep a DOS against a single netns from causing
evictions in another netns.

Within a namespace there may be neighbours entries that are more
sure/valid/useful than others.  I would like an API to be able to
mark them explicitely, but that could come leter.

In particular, in the 802.15.4 case, NE that arrive via encrypted
channels should be preferred over entries that arrive over unencrypted
channels.  This is needed for IETF 6tisch secure join work, for instance.

I believe that we could use network namespaces to implement though.

I had not considered that before, and I think that it will work, but
there might be something subtle that I've missed.  (Alex?)
It appears that one can tune the amount of space on a per-namespace basis:

+	nd_tbl->gc_thresh1	= 128;
+	nd_tbl->gc_thresh2	= 512;
+	nd_tbl->gc_thresh3	= 1024;

> Remove open use of arp_tbl and nd_tbl in favor of the new
> ipv{4,6}_neigh_table helpers. Since the existence of the IPv6 table
> is managed by the core networking, the IS_ENABLED checks for IPv6
> can be removed in favor of "is the table non-NULL".

What's the advantage of changing this check? (I am ignorant)

David Ahern July 19, 2018, 3:49 p.m. UTC | #8

On 7/18/18 6:54 PM, Michael Richardson wrote:
>> Remove open use of arp_tbl and nd_tbl in favor of the new
>> ipv{4,6}_neigh_table helpers. Since the existence of the IPv6 table
>> is managed by the core networking, the IS_ENABLED checks for IPv6
>> can be removed in favor of "is the table non-NULL".
> 
> What's the advantage of changing this check? (I am ignorant)
> 

Just makes the code simpler.

The current nd_tbl is a global owned by the ipv6 code
(net/ipv6/ndisc.c). If CONFIG_IPV6 is not enabled, then nd_tbl is not
defined which leads code referencing it to use if checks like this:

#if IS_ENABLED(CONFIG_IPV6)
		if (!p->dev || (p->tbl != &nd_tbl && p->tbl != &arp_tbl))
#else
		if (!p->dev || p->tbl != &arp_tbl)
#endif

With the neigh_find_table approach the IS_ENABLED can be removed in
favor of 'if (tbl)'. If tbl is set, then ipv6 is loaded and initialized.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Ahern July 19, 2018, 4:16 p.m. UTC | #9

On 7/17/18 9:59 PM, David Miller wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Tue, 17 Jul 2018 13:02:18 -0600
> 
>> I understand the concern about global resource and limits: as it stands
>> you have to increase the limits in init_net to the max expected and hope
>> for the best. With per namespace limits you can lower the limits of each
>> namespace better control the total impact on the total memory used.
>> Perhaps the defaults for namespaces after init_net could have really low
>> defaults (e.g., 16 / 32 / 64 for gc_thresh 1/2/3) requiring admin
>> intervention.
> 
> How does this work when a namespace creates another namespace?
> 
> Changing the defaults for non-init_net namespaces could work, but that
> could be a surprise to some people.
> 

Patches 14 (ipv4) and 15 (ipv6) currently use the existing hardcoded
values - not based on current init_net or anything else. This could be
changed to:

+	if (net_eq(net, &init_net)) {
+		arp_tbl->gc_thresh1	= 128;
+		arp_tbl->gc_thresh2	= 512;
+		arp_tbl->gc_thresh3	= 1024;
+	} else {
+		arp_tbl->gc_thresh1	= 16;
+		arp_tbl->gc_thresh2	= 32;
+		arp_tbl->gc_thresh3	= 64;
+	}

and update the documentation that any new network namespaces have lower
defaults.

As for any change in behavior: today neighbor entries from one namespace
can be removed due to actions in another so no obvious correlation. With
lower settings then gc could kick in and remove entries that otherwise
would not have been. The big hit would be to a new namespace where an
app inserts a lot of PERMANENT entries.

Chatting with Nikolay about this and he brought up a good corollary - ip
fragmentation. It really is a similar problem in that memory is consumed
as a result of packets received from an external entity. The ipfrag
sysctls are per namespace with a limit that non-init_net namespaces can
not set high_thresh > the current value of init_net. Potential memory
consumed by fragments scales with the number of namespaces which is the
primary concern with making neighbor tables per namespace.

If we kept the current default settings (128/512/1024) per namespace we
still have capped memory use, and the one user visible hit that comes to
mind is the namespace with a lot of PERM entries.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cong Wang July 19, 2018, 5:12 p.m. UTC | #10

On Thu, Jul 19, 2018 at 9:16 AM David Ahern <dsahern@gmail.com> wrote:
>
> Chatting with Nikolay about this and he brought up a good corollary - ip
> fragmentation. It really is a similar problem in that memory is consumed
> as a result of packets received from an external entity. The ipfrag
> sysctls are per namespace with a limit that non-init_net namespaces can
> not set high_thresh > the current value of init_net. Potential memory
> consumed by fragments scales with the number of namespaces which is the
> primary concern with making neighbor tables per namespace.

Nothing new, already discussed:
https://marc.info/?l=linux-netdev&m=140391416215988&w=2

:)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Ahern July 24, 2018, 3:14 p.m. UTC | #11

On 7/19/18 11:12 AM, Cong Wang wrote:
> On Thu, Jul 19, 2018 at 9:16 AM David Ahern <dsahern@gmail.com> wrote:
>>
>> Chatting with Nikolay about this and he brought up a good corollary - ip
>> fragmentation. It really is a similar problem in that memory is consumed
>> as a result of packets received from an external entity. The ipfrag
>> sysctls are per namespace with a limit that non-init_net namespaces can
>> not set high_thresh > the current value of init_net. Potential memory
>> consumed by fragments scales with the number of namespaces which is the
>> primary concern with making neighbor tables per namespace.
> 
> Nothing new, already discussed:
> https://marc.info/?l=linux-netdev&m=140391416215988&w=2
> 
> :)
> 

Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
local memory resources due to received packets. bridge and vxlan fdb's
are fairly straightforward analogs to neighbor entries; they are per
device with no limits on the number of entries. Fragments have memory
limits per namespace. So neighbor tables are the only ones with this
strict limitation and concern on memory consumption.

I get the impression there is no longer a strong resistance against
moving the tables to per namespace, but deciding what is the right
approach to handle backwards compatibility. Correct? Changing the
accounting is inevitably going to be noticeable to some use case(s), but
with sysctl settings it is a simple runtime update once the user knows
to make the change.

neighbor entries round up to 512 byte allocations, so with the current
gc_thresh defaults (128/512/1024) 512k can be consumed. Using those
limits per namespace seems high which is why I suggested a per-namespace
default of (16/32/64) which amounts to 32k per namespace limit by
default. Open to other suggestions as well.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller July 24, 2018, 5:14 p.m. UTC | #12

From: David Ahern <dsahern@gmail.com>
Date: Tue, 24 Jul 2018 09:14:01 -0600

> I get the impression there is no longer a strong resistance against
> moving the tables to per namespace, but deciding what is the right
> approach to handle backwards compatibility. Correct? Changing the
> accounting is inevitably going to be noticeable to some use case(s), but
> with sysctl settings it is a simple runtime update once the user knows
> to make the change.
> 
> neighbor entries round up to 512 byte allocations, so with the current
> gc_thresh defaults (128/512/1024) 512k can be consumed. Using those
> limits per namespace seems high which is why I suggested a per-namespace
> default of (16/32/64) which amounts to 32k per namespace limit by
> default. Open to other suggestions as well.

No objection from me about going to per-ns neigh tables.

About the defaults, I wonder if we can scale them to the amount of
memory given to the ns or something like that?  I bet this will better
match the intended use of the ns.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cong Wang July 24, 2018, 10:09 p.m. UTC | #13

On Tue, Jul 24, 2018 at 8:14 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 7/19/18 11:12 AM, Cong Wang wrote:
> > On Thu, Jul 19, 2018 at 9:16 AM David Ahern <dsahern@gmail.com> wrote:
> >>
> >> Chatting with Nikolay about this and he brought up a good corollary - ip
> >> fragmentation. It really is a similar problem in that memory is consumed
> >> as a result of packets received from an external entity. The ipfrag
> >> sysctls are per namespace with a limit that non-init_net namespaces can
> >> not set high_thresh > the current value of init_net. Potential memory
> >> consumed by fragments scales with the number of namespaces which is the
> >> primary concern with making neighbor tables per namespace.
> >
> > Nothing new, already discussed:
> > https://marc.info/?l=linux-netdev&m=140391416215988&w=2
> >
> > :)
> >
>
> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
> local memory resources due to received packets. bridge and vxlan fdb's
> are fairly straightforward analogs to neighbor entries; they are per
> device with no limits on the number of entries. Fragments have memory
> limits per namespace. So neighbor tables are the only ones with this
> strict limitation and concern on memory consumption.
>
> I get the impression there is no longer a strong resistance against
> moving the tables to per namespace, but deciding what is the right
> approach to handle backwards compatibility. Correct? Changing the
> accounting is inevitably going to be noticeable to some use case(s), but
> with sysctl settings it is a simple runtime update once the user knows
> to make the change.

This question definitely should go to Eric Biederman who was against
my proposal.

Let's add Eric into CC.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman July 25, 2018, 12:33 p.m. UTC | #14

Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Tue, Jul 24, 2018 at 8:14 AM David Ahern <dsahern@gmail.com> wrote:
>>
>> On 7/19/18 11:12 AM, Cong Wang wrote:
>> > On Thu, Jul 19, 2018 at 9:16 AM David Ahern <dsahern@gmail.com> wrote:
>> >>
>> >> Chatting with Nikolay about this and he brought up a good corollary - ip
>> >> fragmentation. It really is a similar problem in that memory is consumed
>> >> as a result of packets received from an external entity. The ipfrag
>> >> sysctls are per namespace with a limit that non-init_net namespaces can
>> >> not set high_thresh > the current value of init_net. Potential memory
>> >> consumed by fragments scales with the number of namespaces which is the
>> >> primary concern with making neighbor tables per namespace.
>> >
>> > Nothing new, already discussed:
>> > https://marc.info/?l=linux-netdev&m=140391416215988&w=2
>> >
>> > :)
>> >
>>
>> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
>> local memory resources due to received packets. bridge and vxlan fdb's
>> are fairly straightforward analogs to neighbor entries; they are per
>> device with no limits on the number of entries. Fragments have memory
>> limits per namespace. So neighbor tables are the only ones with this
>> strict limitation and concern on memory consumption.
>>
>> I get the impression there is no longer a strong resistance against
>> moving the tables to per namespace, but deciding what is the right
>> approach to handle backwards compatibility. Correct? Changing the
>> accounting is inevitably going to be noticeable to some use case(s), but
>> with sysctl settings it is a simple runtime update once the user knows
>> to make the change.
>
> This question definitely should go to Eric Biederman who was against
> my proposal.
>
> Let's add Eric into CC.

Given that the entries are per device and the devices are per-namespace,
semantically neighbours are already kept in a per-namespace manner.  So
this is all about making the code not honoring global resource limits.
Making the code not honor gc_thresh3.

Skimming through the code today the default for gc_thresh3 is 1024.
Which means that we limit the neighbour tables to 1024 entries per
protocol type.

There are some pretty compelling reasons especially with ipv4 to keep
the subnet size down.  Arp storms are a real thing.

I don't know off the top of my head what the reasons for limiting the
neighbour table sizes.  I would be much more comfortable with a patchset
like this if we did some research and figured out the reasons why
we have a global limit.  Then changed the code to remove those limits.

When the limits are gone.  When the code can support large subnets
without tuning.  We we don't have to worry about someone scanning an all
addresses in an ipv6 subnet and causing a DOS on working machines.
I think it is completely appropriate to look to see if something per
network namespace needs to happen.

So please let's address the limits, not the fact that some specific
corner case ran into them.

If we are going to neuter gc_thresh3 let's go as far as removing it
entirely.  If we are going to make the neighbour table per something
let's make it per network device.  If we can afford the multiple hash
tables then a hash table per device is better.   Perhaps we want to move
to rhash tables while we look at this, instead of an old hand grown
version of resizable hash table.

Unless I misread something all your patchset did is reshuffle code and
data structures so that gc_thresh3 does not apply accross namespaces.
That does not feel like it really fixes anything.  That just lies to
people.

Further unless I misread something you are increasing the number of
timers to 3 per namespace.  If I create create a thousand network
namespaces that feels like it will hurt system performance overall.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Ahern July 25, 2018, 2:06 p.m. UTC | #15

On 7/25/18 6:33 AM, Eric W. Biederman wrote:
> Cong Wang <xiyou.wangcong@gmail.com> writes:
> 
>> On Tue, Jul 24, 2018 at 8:14 AM David Ahern <dsahern@gmail.com> wrote:
>>>
>>> On 7/19/18 11:12 AM, Cong Wang wrote:
>>>> On Thu, Jul 19, 2018 at 9:16 AM David Ahern <dsahern@gmail.com> wrote:
>>>>>
>>>>> Chatting with Nikolay about this and he brought up a good corollary - ip
>>>>> fragmentation. It really is a similar problem in that memory is consumed
>>>>> as a result of packets received from an external entity. The ipfrag
>>>>> sysctls are per namespace with a limit that non-init_net namespaces can
>>>>> not set high_thresh > the current value of init_net. Potential memory
>>>>> consumed by fragments scales with the number of namespaces which is the
>>>>> primary concern with making neighbor tables per namespace.
>>>>
>>>> Nothing new, already discussed:
>>>> https://marc.info/?l=linux-netdev&m=140391416215988&w=2
>>>>
>>>> :)
>>>>
>>>
>>> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
>>> local memory resources due to received packets. bridge and vxlan fdb's
>>> are fairly straightforward analogs to neighbor entries; they are per
>>> device with no limits on the number of entries. Fragments have memory
>>> limits per namespace. So neighbor tables are the only ones with this
>>> strict limitation and concern on memory consumption.
>>>
>>> I get the impression there is no longer a strong resistance against
>>> moving the tables to per namespace, but deciding what is the right
>>> approach to handle backwards compatibility. Correct? Changing the
>>> accounting is inevitably going to be noticeable to some use case(s), but
>>> with sysctl settings it is a simple runtime update once the user knows
>>> to make the change.
>>
>> This question definitely should go to Eric Biederman who was against
>> my proposal.
>>
>> Let's add Eric into CC.
> 
> Given that the entries are per device and the devices are per-namespace,
> semantically neighbours are already kept in a per-namespace manner.  So
> this is all about making the code not honoring global resource limits.
> Making the code not honor gc_thresh3.
> 
> Skimming through the code today the default for gc_thresh3 is 1024.
> Which means that we limit the neighbour tables to 1024 entries per
> protocol type.
> 
> There are some pretty compelling reasons especially with ipv4 to keep
> the subnet size down.  Arp storms are a real thing.
> 
> I don't know off the top of my head what the reasons for limiting the
> neighbour table sizes.  I would be much more comfortable with a patchset
> like this if we did some research and figured out the reasons why
> we have a global limit.  Then changed the code to remove those limits.
> 
> When the limits are gone.  When the code can support large subnets
> without tuning.  We we don't have to worry about someone scanning an all
> addresses in an ipv6 subnet and causing a DOS on working machines.
> I think it is completely appropriate to look to see if something per
> network namespace needs to happen.
> 
> So please let's address the limits, not the fact that some specific
> corner case ran into them.
> 
> If we are going to neuter gc_thresh3 let's go as far as removing it
> entirely.  If we are going to make the neighbour table per something
> let's make it per network device.  If we can afford the multiple hash
> tables then a hash table per device is better.   Perhaps we want to move
> to rhash tables while we look at this, instead of an old hand grown
> version of resizable hash table.

Given the uses cases with increasing number of devices (> 10,000),
per-device tables will have more problems than per namespace - in
reference to your concern in the last paragraph below.

> 
> Unless I misread something all your patchset did is reshuffle code and
> data structures so that gc_thresh3 does not apply accross namespaces.
> That does not feel like it really fixes anything.  That just lies to
> people.

This patch set fixes the lie that network namespaces provide complete
isolation when in fact one namespace can evict neighbor entries from
another. An arp storm you are concerned about in one namespace impacts
all containers.

It starts by removing the proliferation of open coded references to
arp_tbl and nd_tbl, moving them behind the existing neigh_find_table.
From there (patches 14-16) it makes the tables per-namespace and hence
makes the gc_thresh parameters which are per-table now
per-table-per-namespace.

So it removes the global thresholds because the global ones are just
wrong given the meaning of a network namespace and provides the more
appropriate per-namespace limits.

> 
> Further unless I misread something you are increasing the number of
> timers to 3 per namespace.  If I create create a thousand network
> namespaces that feels like it will hurt system performance overall.

It seems to me the timers are per neighbor entry not table. The per
table ones are for proxies.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman July 25, 2018, 5:38 p.m. UTC | #16

David Ahern <dsahern@gmail.com> writes:

> On 7/25/18 6:33 AM, Eric W. Biederman wrote:
>> Cong Wang <xiyou.wangcong@gmail.com> writes:
>> 
>>> On Tue, Jul 24, 2018 at 8:14 AM David Ahern <dsahern@gmail.com> wrote:
>>>>
>>>> On 7/19/18 11:12 AM, Cong Wang wrote:
>>>>> On Thu, Jul 19, 2018 at 9:16 AM David Ahern <dsahern@gmail.com> wrote:
>>>>>>
>>>>>> Chatting with Nikolay about this and he brought up a good corollary - ip
>>>>>> fragmentation. It really is a similar problem in that memory is consumed
>>>>>> as a result of packets received from an external entity. The ipfrag
>>>>>> sysctls are per namespace with a limit that non-init_net namespaces can
>>>>>> not set high_thresh > the current value of init_net. Potential memory
>>>>>> consumed by fragments scales with the number of namespaces which is the
>>>>>> primary concern with making neighbor tables per namespace.
>>>>>
>>>>> Nothing new, already discussed:
>>>>> https://marc.info/?l=linux-netdev&m=140391416215988&w=2
>>>>>
>>>>> :)
>>>>>
>>>>
>>>> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
>>>> local memory resources due to received packets. bridge and vxlan fdb's
>>>> are fairly straightforward analogs to neighbor entries; they are per
>>>> device with no limits on the number of entries. Fragments have memory
>>>> limits per namespace. So neighbor tables are the only ones with this
>>>> strict limitation and concern on memory consumption.
>>>>
>>>> I get the impression there is no longer a strong resistance against
>>>> moving the tables to per namespace, but deciding what is the right
>>>> approach to handle backwards compatibility. Correct? Changing the
>>>> accounting is inevitably going to be noticeable to some use case(s), but
>>>> with sysctl settings it is a simple runtime update once the user knows
>>>> to make the change.
>>>
>>> This question definitely should go to Eric Biederman who was against
>>> my proposal.
>>>
>>> Let's add Eric into CC.
>> 
>> Given that the entries are per device and the devices are per-namespace,
>> semantically neighbours are already kept in a per-namespace manner.  So
>> this is all about making the code not honoring global resource limits.
>> Making the code not honor gc_thresh3.
>> 
>> Skimming through the code today the default for gc_thresh3 is 1024.
>> Which means that we limit the neighbour tables to 1024 entries per
>> protocol type.
>> 
>> There are some pretty compelling reasons especially with ipv4 to keep
>> the subnet size down.  Arp storms are a real thing.
>> 
>> I don't know off the top of my head what the reasons for limiting the
>> neighbour table sizes.  I would be much more comfortable with a patchset
>> like this if we did some research and figured out the reasons why
>> we have a global limit.  Then changed the code to remove those limits.
>> 
>> When the limits are gone.  When the code can support large subnets
>> without tuning.  We we don't have to worry about someone scanning an all
>> addresses in an ipv6 subnet and causing a DOS on working machines.
>> I think it is completely appropriate to look to see if something per
>> network namespace needs to happen.
>> 
>> So please let's address the limits, not the fact that some specific
>> corner case ran into them.
>> 
>> If we are going to neuter gc_thresh3 let's go as far as removing it
>> entirely.  If we are going to make the neighbour table per something
>> let's make it per network device.  If we can afford the multiple hash
>> tables then a hash table per device is better.   Perhaps we want to move
>> to rhash tables while we look at this, instead of an old hand grown
>> version of resizable hash table.
>
> Given the uses cases with increasing number of devices (> 10,000),
> per-device tables will have more problems than per namespace - in
> reference to your concern in the last paragraph below.
>
>> 
>> Unless I misread something all your patchset did is reshuffle code and
>> data structures so that gc_thresh3 does not apply accross namespaces.
>> That does not feel like it really fixes anything.  That just lies to
>> people.
>
> This patch set fixes the lie that network namespaces provide complete
> isolation when in fact one namespace can evict neighbor entries from
> another. An arp storm you are concerned about in one namespace impacts
> all containers.

Network namespaces can not provide complete isolation.  They share the
same kernel and they do not dedicate resources to each other.
Namespaces in general are about the names.  They are about sharing a
machine efficiently.

I humbly suggest that anyone who wants ``complete'' isolation to use vm
at the very least.

I do think the limits on the neighbour table are quite likely too
strict.  We should be able to relax them and continue to have a
networking stack that works for everyone.

> It starts by removing the proliferation of open coded references to
> arp_tbl and nd_tbl, moving them behind the existing neigh_find_table.
> From there (patches 14-16) it makes the tables per-namespace and hence
> makes the gc_thresh parameters which are per-table now
> per-table-per-namespace.
>
> So it removes the global thresholds because the global ones are just
> wrong given the meaning of a network namespace and provides the more
> appropriate per-namespace limits.

Absolutely NOT.  Global thresholds are exactly correct given the fact
you are running on a single kernel.

Memory is not free (Even though we are swimming in enough of it memory
rarely matters).  One of the few remaining challenges is for containers
is finding was to limit resources in such a way that one application
does not mess things up for another container during ordinary usage.

It looks like the neighbour tables absolutely are that kind of problem,
because the artificial limits are too strict.   Completely giving up on
limits does not seem right approach either.  We need to fix the limits
we have (perhaps making them go away entirely), not just apply a
band-aid.  Let's get to the bottom of this and make the system better.

>> Further unless I misread something you are increasing the number of
>> timers to 3 per namespace.  If I create create a thousand network
>> namespaces that feels like it will hurt system performance overall.
>
> It seems to me the timers are per neighbor entry not table. The per
> table ones are for proxies.

It seems I misread that bit when I was refreshing my memory on what
everything is doing.  If we can already have 1024 timers that makes
timers not a concern.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Ahern July 25, 2018, 6:13 p.m. UTC | #17

On 7/25/18 11:38 AM, Eric W. Biederman wrote:
> 
> Absolutely NOT.  Global thresholds are exactly correct given the fact
> you are running on a single kernel.
> 
> Memory is not free (Even though we are swimming in enough of it memory
> rarely matters).  One of the few remaining challenges is for containers
> is finding was to limit resources in such a way that one application
> does not mess things up for another container during ordinary usage.
> 
> It looks like the neighbour tables absolutely are that kind of problem,
> because the artificial limits are too strict.   Completely giving up on
> limits does not seem right approach either.  We need to fix the limits
> we have (perhaps making them go away entirely), not just apply a
> band-aid.  Let's get to the bottom of this and make the system better.

Eric: yes, they all share the global resource of memory and there should
be limits on how many entries a remote entity can create.

Network namespaces can provide a separation such that one namespace does
not disrupt networking in another. It is absolutely appropriate to do
so. Your rigid stance is inconsistent given the basic meaning of a
network namespace and the parallels to this same problem -- bridges,
vxlans, and ip fragments. Only neighbor tables are not per-device or per
namespace; your insistence on global limits is missing the mark and wrong.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Ahern July 25, 2018, 6:23 p.m. UTC | #18

On 7/24/18 11:14 AM, David Miller wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Tue, 24 Jul 2018 09:14:01 -0600
> 
>> I get the impression there is no longer a strong resistance against
>> moving the tables to per namespace, but deciding what is the right
>> approach to handle backwards compatibility. Correct? Changing the
>> accounting is inevitably going to be noticeable to some use case(s), but
>> with sysctl settings it is a simple runtime update once the user knows
>> to make the change.
>>
>> neighbor entries round up to 512 byte allocations, so with the current
>> gc_thresh defaults (128/512/1024) 512k can be consumed. Using those
>> limits per namespace seems high which is why I suggested a per-namespace
>> default of (16/32/64) which amounts to 32k per namespace limit by
>> default. Open to other suggestions as well.
> 
> No objection from me about going to per-ns neigh tables.
> 
> About the defaults, I wonder if we can scale them to the amount of
> memory given to the ns or something like that?  I bet this will better
> match the intended use of the ns.
> 

Not sure how to do that. I am not aware of memory allocations to a
network namespace. As I understand it containers use cgroups to control
memory use, but I am not aware of any direct ties to namespace.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman July 25, 2018, 7:17 p.m. UTC | #19

David Ahern <dsahern@gmail.com> writes:

> On 7/25/18 11:38 AM, Eric W. Biederman wrote:
>> 
>> Absolutely NOT.  Global thresholds are exactly correct given the fact
>> you are running on a single kernel.
>> 
>> Memory is not free (Even though we are swimming in enough of it memory
>> rarely matters).  One of the few remaining challenges is for containers
>> is finding was to limit resources in such a way that one application
>> does not mess things up for another container during ordinary usage.
>> 
>> It looks like the neighbour tables absolutely are that kind of problem,
>> because the artificial limits are too strict.   Completely giving up on
>> limits does not seem right approach either.  We need to fix the limits
>> we have (perhaps making them go away entirely), not just apply a
>> band-aid.  Let's get to the bottom of this and make the system better.
>
> Eric: yes, they all share the global resource of memory and there should
> be limits on how many entries a remote entity can create.
>
> Network namespaces can provide a separation such that one namespace does
> not disrupt networking in another. It is absolutely appropriate to do
> so. Your rigid stance is inconsistent given the basic meaning of a
> network namespace and the parallels to this same problem -- bridges,
> vxlans, and ip fragments. Only neighbor tables are not per-device or per
> namespace; your insistence on global limits is missing the mark and wrong.

That is not what I said.  Let me rephrase and see if you understand.

The problem appears to be of lots of devices.  Fundamentally if you use
lots of network devices today unless you adjust gc_thresh3 you will run
out of neighbour table entries.

The problem has a bigger scope than what you are looking at.

If you fix the core problem you won't see the problem in the context
of network namespaces either.

Default limits should be something that will never be hit unless
something goes crazy.  We are hitting them.  Therefore by definition
there is a bug in these limits.

And yes there is absolutely a place for global limits on things like
inodes, file descriptors etc, that does not care about which part of the
kernel you are in.  However hitting those limits in normal operation is
a bug.

We have ourselves a bug.

Eric

p.s. I wrote the definition of network namespaces and it absolutely does
have room for global limits.   One of the things Linus has periodically
yelled at me about is that there are not enough of them.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Laight July 26, 2018, 11:12 a.m. UTC | #20

From: Eric W. Biederman
> Sent: 25 July 2018 18:38
...
> >> Further unless I misread something you are increasing the number of
> >> timers to 3 per namespace.  If I create create a thousand network
> >> namespaces that feels like it will hurt system performance overall.
> >
> > It seems to me the timers are per neighbor entry not table. The per
> > table ones are for proxies.
> 
> It seems I misread that bit when I was refreshing my memory on what
> everything is doing.  If we can already have 1024 timers that makes
> timers not a concern.

Surely it is enough to just have a timestamp in each entry.
Deletion of expired items need not be done until insert (which
has the table suitable locked) bumps into an expired item.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman July 27, 2018, 4:27 p.m. UTC | #21

David Laight <David.Laight@ACULAB.COM> writes:

> From: Eric W. Biederman
>> Sent: 25 July 2018 18:38
> ...
>> >> Further unless I misread something you are increasing the number of
>> >> timers to 3 per namespace.  If I create create a thousand network
>> >> namespaces that feels like it will hurt system performance overall.
>> >
>> > It seems to me the timers are per neighbor entry not table. The per
>> > table ones are for proxies.
>> 
>> It seems I misread that bit when I was refreshing my memory on what
>> everything is doing.  If we can already have 1024 timers that makes
>> timers not a concern.
>
> Surely it is enough to just have a timestamp in each entry.
> Deletion of expired items need not be done until insert (which
> has the table suitable locked) bumps into an expired item.

Part of the state machine the timer is used for is sending retransmits
before we get out first response, so the state machine is not just
about expiry and may not be simple enough to do without a timer.

I honestly don't remember the tradeoffs between the various kinds of
timers.  If we already can have up to 1000 timers without problems I suspect
they are cheap enough optimizing for fewer times simply doesn't matter.
Certainly we are not in regression territory if we don't.

As for bumping into expired items on insert it would be a bad hash table
if we were bumping into any other entries frequently on insert.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Vasily Averin Aug. 12, 2018, 6:46 a.m. UTC | #22

On 07/17/2018 03:06 PM, dsahern@kernel.org wrote:
> From: David Ahern <dsahern@gmail.com>
> 
> Nikita Leshenko reported that neighbor entries in one namespace can
> evict neighbor entries in another. The problem is that the neighbor
> tables have entries across all namespaces without separate accounting
> and with global limits on when to scan for entries to evict.
> 
> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
> namespace and making the accounting and threshold limits per namespace.

Dear David,
I prepared own patch set to fix this problem and found your one.
It looks perfect for me, and I hope David Miller will merge it soon,
however I have found a few drawbacks:

1) I know that if net_device exist it always have correct net reference,
so dev_net(dev) will be always correct.
However I afraid that device reference itself is correct in some places.
For example, 
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
@@ -376,9 +377,10 @@ mlxsw_sp_span_entry_gretap4_parms(const struct net_device *to_dev,
 		return mlxsw_sp_span_entry_unoffloadable(sparmsp);

 	l3edev = mlxsw_sp_span_gretap4_route(to_dev, &saddr.addr4, &gw.addr4);
+	tbl = ipv4_neigh_table(dev_net(l3edev));
 	return mlxsw_sp_span_entry_tunnel_parms_common(l3edev, saddr, daddr, gw,
 						       tparm.iph.ttl,
-						       &arp_tbl, sparmsp);
+						       tbl, sparmsp);
 }

mlxsw_sp_span_entry_tunnel_parms_common() have "if (!edev)" check inside,
so it seems l3edev can be set to NULL here and lead to crash inside dev_net(l3edev).
There are few other suspicious places and I think they should be carefully re-checked.

2) modified arp_net_init() does not check return value neigh_sysctl_register() and lacks correct rollback.
It was acceptable in arp_init, because it was called only once on boot, but now it will be called 
for each new net namespace, it can have real chances to fail lead to memory crash/memory corruption.

3) modified neigh_table_init() is called many times per netns but it can panic in case failed memory allocation.
I think it should be reworked to return errors in such cases, its callers should check it and add correct rollbacks.

4) currently neigh_table_clear() always return 0, I think it makes sense to change it to return void.

Thank you,
	Vasily Averin

David Ahern Aug. 12, 2018, 5:37 p.m. UTC | #23

On 8/12/18 12:46 AM, Vasily Averin wrote:
> On 07/17/2018 03:06 PM, dsahern@kernel.org wrote:
>> From: David Ahern <dsahern@gmail.com>
>>
>> Nikita Leshenko reported that neighbor entries in one namespace can
>> evict neighbor entries in another. The problem is that the neighbor
>> tables have entries across all namespaces without separate accounting
>> and with global limits on when to scan for entries to evict.
>>
>> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
>> namespace and making the accounting and threshold limits per namespace.
> 
> Dear David,
> I prepared own patch set to fix this problem and found your one.
> It looks perfect for me, and I hope David Miller will merge it soon,
> however I have found a few drawbacks:
> 

Hi:

I just returned from an extended vacation. I will revive this topic in
the next few days.

Thanks for the comments. I will address in the next version.

David Ahern Aug. 13, 2018, 9:48 p.m. UTC | #24

On 7/25/18 1:17 PM, Eric W. Biederman wrote:
> David Ahern <dsahern@gmail.com> writes:
> 
>> On 7/25/18 11:38 AM, Eric W. Biederman wrote:
>>>
>>> Absolutely NOT.  Global thresholds are exactly correct given the fact
>>> you are running on a single kernel.
>>>
>>> Memory is not free (Even though we are swimming in enough of it memory
>>> rarely matters).  One of the few remaining challenges is for containers
>>> is finding was to limit resources in such a way that one application
>>> does not mess things up for another container during ordinary usage.
>>>
>>> It looks like the neighbour tables absolutely are that kind of problem,
>>> because the artificial limits are too strict.   Completely giving up on
>>> limits does not seem right approach either.  We need to fix the limits
>>> we have (perhaps making them go away entirely), not just apply a
>>> band-aid.  Let's get to the bottom of this and make the system better.
>>
>> Eric: yes, they all share the global resource of memory and there should
>> be limits on how many entries a remote entity can create.
>>
>> Network namespaces can provide a separation such that one namespace does
>> not disrupt networking in another. It is absolutely appropriate to do
>> so. Your rigid stance is inconsistent given the basic meaning of a
>> network namespace and the parallels to this same problem -- bridges,
>> vxlans, and ip fragments. Only neighbor tables are not per-device or per
>> namespace; your insistence on global limits is missing the mark and wrong.
> 
> That is not what I said.  Let me rephrase and see if you understand.
> 
> The problem appears to be of lots of devices.  Fundamentally if you use
> lots of network devices today unless you adjust gc_thresh3 you will run
> out of neighbour table entries.
> 
> The problem has a bigger scope than what you are looking at.
> 
> If you fix the core problem you won't see the problem in the context
> of network namespaces either.
> 
> Default limits should be something that will never be hit unless
> something goes crazy.  We are hitting them.  Therefore by definition
> there is a bug in these limits.

I disagree that the problem is a global limit. It is trivial for users
to increase gc_thresh3. That does not solve the fundamental problem.

> 
> 
> And yes there is absolutely a place for global limits on things like
> inodes, file descriptors etc, that does not care about which part of the
> kernel you are in.  However hitting those limits in normal operation is
> a bug.
> 
> We have ourselves a bug.

I agree we have a bug; we disagree on what that bug is.

I am just back from vacation and re-read your responses. No where do you
acknowledge the fundamental point of this patch set - that adding a new
neighbor entry in one namespace can evict an entry in another namespace
or worse networking in one namespace can fail due to table overflow
because of entries from another. That is a real problem.

It is not a matter of increasing the default gc_thresh3 to some number
N; it is ensuring that regardless of the value of gc_thresh3 one
namespace is not affected by another.

You created network namespaces and it provides isolation -- separate
tables essentially -- for devices, FIB entries, sockets, etc, but you
argue against completing the task with separate neighbor tables which is
very strange given the impact (completely broken networking).

> 
> Eric
> 
> p.s. I wrote the definition of network namespaces and it absolutely does
> have room for global limits.   One of the things Linus has periodically
> yelled at me about is that there are not enough of them.
>

Eric W. Biederman Aug. 15, 2018, 4:36 a.m. UTC | #25

David Ahern <dsahern@gmail.com> writes:

> On 7/25/18 1:17 PM, Eric W. Biederman wrote:
>> David Ahern <dsahern@gmail.com> writes:
>> 
>>> On 7/25/18 11:38 AM, Eric W. Biederman wrote:
>>>>
>>>> Absolutely NOT.  Global thresholds are exactly correct given the fact
>>>> you are running on a single kernel.
>>>>
>>>> Memory is not free (Even though we are swimming in enough of it memory
>>>> rarely matters).  One of the few remaining challenges is for containers
>>>> is finding was to limit resources in such a way that one application
>>>> does not mess things up for another container during ordinary usage.
>>>>
>>>> It looks like the neighbour tables absolutely are that kind of problem,
>>>> because the artificial limits are too strict.   Completely giving up on
>>>> limits does not seem right approach either.  We need to fix the limits
>>>> we have (perhaps making them go away entirely), not just apply a
>>>> band-aid.  Let's get to the bottom of this and make the system better.
>>>
>>> Eric: yes, they all share the global resource of memory and there should
>>> be limits on how many entries a remote entity can create.
>>>
>>> Network namespaces can provide a separation such that one namespace does
>>> not disrupt networking in another. It is absolutely appropriate to do
>>> so. Your rigid stance is inconsistent given the basic meaning of a
>>> network namespace and the parallels to this same problem -- bridges,
>>> vxlans, and ip fragments. Only neighbor tables are not per-device or per
>>> namespace; your insistence on global limits is missing the mark and wrong.
>> 
>> That is not what I said.  Let me rephrase and see if you understand.
>> 
>> The problem appears to be of lots of devices.  Fundamentally if you use
>> lots of network devices today unless you adjust gc_thresh3 you will run
>> out of neighbour table entries.
>> 
>> The problem has a bigger scope than what you are looking at.
>> 
>> If you fix the core problem you won't see the problem in the context
>> of network namespaces either.
>> 
>> Default limits should be something that will never be hit unless
>> something goes crazy.  We are hitting them.  Therefore by definition
>> there is a bug in these limits.
>
> I disagree that the problem is a global limit. It is trivial for users
> to increase gc_thresh3. That does not solve the fundamental problem.
>
>> 
>> 
>> And yes there is absolutely a place for global limits on things like
>> inodes, file descriptors etc, that does not care about which part of the
>> kernel you are in.  However hitting those limits in normal operation is
>> a bug.
>> 
>> We have ourselves a bug.
>
> I agree we have a bug; we disagree on what that bug is.
>
> I am just back from vacation and re-read your responses. No where do you
> acknowledge the fundamental point of this patch set - that adding a new
> neighbor entry in one namespace can evict an entry in another namespace
> or worse networking in one namespace can fail due to table overflow
> because of entries from another. That is a real problem.
>
> It is not a matter of increasing the default gc_thresh3 to some number
> N; it is ensuring that regardless of the value of gc_thresh3 one
> namespace is not affected by another.

My suggestion is to look at the problem and it's requirements and figure
out how to safely remove gc_thresh3 entirely.  We do have to ensure
neighbour tables don't grow too large, I expect we can do it in a way
that can scale from a small machine with few neighbours to a large
machine with many neighbours.

Perhaps the code just needs to limit the number of neighbours who have
never replied and the code is probing for an a per interface basis.

It still may make sense to have a global limit of perhaps a million
entries just because that would be an indicator that something has truly
gone weird.

> You created network namespaces and it provides isolation -- separate
> tables essentially -- for devices, FIB entries, sockets, etc, but you
> argue against completing the task with separate neighbor tables which is
> very strange given the impact (completely broken networking).

Namespaces provide isolation at the level of names.  The objects still
share a kernel and compete for resources.  Not competing for resources
would require each namespace have it's own dedicated pool of resources
which over the whole machine would be much less efficient.

That is the fundamental design difference between namespaces and VM's
and it is why namespaces can be much cheaper and much more resource
efficient.  Reserving your worst case resource usage ahead of time tends
to result in a lot of inefficiencies.

Eric

[RFC/RFT,net-next,00/17] net: Convert neighbor tables to per-namespace

Message

Comments