diff mbox

[net-next,v5] rtnetlink: add new RTM_GETSTATS message to dump link stats

Message ID 1461013819-23223-1-git-send-email-roopa@cumulusnetworks.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Roopa Prabhu April 18, 2016, 9:10 p.m. UTC
From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch adds a new RTM_GETSTATS message to query link stats via netlink
from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
returns a lot more than just stats and is expensive in some cases when
frequent polling for stats from userspace is a common operation.

RTM_GETSTATS is an attempt to provide a light weight netlink message
to explicity query only link stats from the kernel on an interface.
The idea is to also keep it extensible so that new kinds of stats can be
added to it in the future.

This patch adds the following attribute for NETDEV stats:
struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
        [IFLA_STATS_LINK_64]  = { .len = sizeof(struct rtnl_link_stats64) },
};

Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
a single interface or all interfaces with NLM_F_DUMP.

Future possible new types of stat attributes:
link af stats:
    - IFLA_STATS_LINK_IPV6  (nested. for ipv6 stats)
    - IFLA_STATS_LINK_MPLS  (nested. for mpls/mdev stats)
extended stats:
    - IFLA_STATS_LINK_EXTENDED (nested. extended software netdev stats like bridge,
      vlan, vxlan etc)
    - IFLA_STATS_LINK_HW_EXTENDED (nested. extended hardware stats which are
      available via ethtool today)

This patch also declares a filter mask for all stat attributes.
User has to provide a mask of stats attributes to query. filter mask
can be specified in the new hdr 'struct if_stats_msg' for stats messages.
Other important field in the header is the ifindex.

This api can also include attributes for global stats (eg tcp) in the future.
When global stats are included in a stats msg, the ifindex in the header
must be zero. A single stats message cannot contain both global and
netdev specific stats. To easily distinguish them, netdev specific stat
attributes name are prefixed with IFLA_STATS_LINK_

Without any attributes in the filter_mask, no stats will be returned.

This patch has been tested with mofified iproute2 ifstat.

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
RFC to v1 (apologies for the delay in sending this version out. busy days):
        - Addressed feedback from Dave
                - removed rtnl_link_stats
                - Added hdr struct if_stats_msg to carry ifindex and
                  filter mask
                - new macro IFLA_STATS_FILTER_BIT(ATTR) for filter mask
        - split the ipv6 patch into a separate patch, need some more eyes on it
        - prefix attributes with IFLA_STATS instead of IFLA_LINK_STATS for
          shorter attribute names

v2:
        - move IFLA_STATS_INET6 declaration to the inet6 patch
        - get rid of RTM_DELSTATS
        - mark ipv6 patch RFC. It can be used as an example for
          other AF stats like stats

v3:
        - add required padding to the if_stats_msg structure(suggested by jamal)
        - rename netdev stat attributes with IFLA_STATS_LINK prefix
          so that they are easily distinguishable with global
          stats in the future (after global stats discussion with thomas)
        - get rid of unnecessary copy when getting stats with dev_get_stats
          (suggested by dave)

v4:
        - dropped calcit and af stats from this patch. Will add it
          back when it becomes necessary and with the first af stats
          patch
        - add check for null filter in dump and return -EINVAL:
          this follows rtnl_fdb_dump in returning an error.
          But since netlink_dump does not propagate the error
          to the user, the user will not see an error and
          but will also not see any data. This is consistent with
          other kinds of dumps.

v5:
        - fix selinux nlmsgtab to account for new RTM_*STATS messages

 include/uapi/linux/if_link.h   |  23 +++++++
 include/uapi/linux/rtnetlink.h |   5 ++
 net/core/rtnetlink.c           | 150 +++++++++++++++++++++++++++++++++++++++++
 security/selinux/nlmsgtab.c    |   4 +-
 4 files changed, 181 insertions(+), 1 deletion(-)

Comments

Eric Dumazet April 18, 2016, 9:35 p.m. UTC | #1
On Mon, 2016-04-18 at 14:10 -0700, Roopa Prabhu wrote:

> +	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) {
> +		struct rtnl_link_stats64 *sp;
> +
> +		attr = nla_reserve(skb, IFLA_STATS_LINK_64,
> +				   sizeof(struct rtnl_link_stats64));
> +		if (!attr)
> +			goto nla_put_failure;
> +
> +		sp = nla_data(attr);

Are you sure we have a guarantee that sp is aligned to u64 fields ?

x86 does not care, but some arches would have a potential misalign
access here.


> +		dev_get_stats(dev, sp);
> +	}
David Miller April 19, 2016, 12:57 a.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 18 Apr 2016 14:35:56 -0700

> On Mon, 2016-04-18 at 14:10 -0700, Roopa Prabhu wrote:
> 
>> +	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) {
>> +		struct rtnl_link_stats64 *sp;
>> +
>> +		attr = nla_reserve(skb, IFLA_STATS_LINK_64,
>> +				   sizeof(struct rtnl_link_stats64));
>> +		if (!attr)
>> +			goto nla_put_failure;
>> +
>> +		sp = nla_data(attr);
> 
> Are you sure we have a guarantee that sp is aligned to u64 fields ?
> 
> x86 does not care, but some arches would have a potential misalign
> access here.

I'll do some testing on sparc and deal with any fallout.
David Miller April 19, 2016, 1:48 a.m. UTC | #3
From: David Miller <davem@davemloft.net>
Date: Mon, 18 Apr 2016 20:57:55 -0400 (EDT)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 18 Apr 2016 14:35:56 -0700
> 
>> On Mon, 2016-04-18 at 14:10 -0700, Roopa Prabhu wrote:
>> 
>>> +	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) {
>>> +		struct rtnl_link_stats64 *sp;
>>> +
>>> +		attr = nla_reserve(skb, IFLA_STATS_LINK_64,
>>> +				   sizeof(struct rtnl_link_stats64));
>>> +		if (!attr)
>>> +			goto nla_put_failure;
>>> +
>>> +		sp = nla_data(attr);
>> 
>> Are you sure we have a guarantee that sp is aligned to u64 fields ?
>> 
>> x86 does not care, but some arches would have a potential misalign
>> access here.
> 
> I'll do some testing on sparc and deal with any fallout.

Just thinking out loud before I start testing, yeah I think you are
right.  nlmsghdr is 64-bit aligned, but the nlattr header is 32-bit
which will thus make the attribute data area not be aligned.

I think the time has probably come to have a new netlink attribute
format that doesn't have this multi-decade old problem.

There are a lot of bits left in nla_type, we can use one to indicate
that the nlattr struct is another 4 bytes in length in order to
archieve proper alignment of the payload data.

+struct nlattr64 {
+	__u16           nla_len;
+	__u16           nla_type;
+	__u32		nla_pad;
+};
 ...
+#define NLA_F_64BIT_ALIGNED	(1 << 13)
-#define NLA_TYPE_MASK		~(NLA_F_NESTED | NLA_F_NET_BYTEORDER)
+#define NLA_TYPE_MASK		~(NLA_F_NESTED | NLA_F_NET_BYTEORDER |
				  NLA_F_64BIT_ALIGNED)
 ...
#define NLA64_ALIGNTO		8
#define NLA64_ALIGN(len)	(((len) + NLA64_ALIGNTO - 1) & ~(NLA64_ALIGNTO - 1))
#define NLA64_HDRLEN		((int) NLA64_ALIGN(sizeof(struct nlattr64)))

We're going to need some new nla64_*() helpers and code added to some
of the existing ones to test that new bit.

For example, nla_data would now be:

static inline void *nla_data(const struct nlattr *nla)
{
	if (nla->nla_type & NLA_F_64BIT_ALIGNED)
		return (char *) nla + NLA64_HDRLEN;
	else
		return (char *) nla + NLA_HDRLEN;
}

nla_len would be:

static inline int nla_len(const struct nlattr *nla)
{
	int hdrlen = NLA_HDRLEN;

	if (nla->nla_type & NLA_F_64BIT_ALIGNED)
		hdrlen = NLA64_hdrlen;
	return nla->nla_len - hdrlen;
}

etc. etc.

And anyways, I get unaligned accesses without Roopa's changes :-/

davem@patience:~$ ip l l
[3391066.656729] Kernel unaligned access at TPC[7d6c14] loopback_get_stats64+0x74/0xa0
[3391066.672020] Kernel unaligned access at TPC[7d6c18] loopback_get_stats64+0x78/0xa0
[3391066.687282] Kernel unaligned access at TPC[7d6c1c] loopback_get_stats64+0x7c/0xa0
[3391066.702573] Kernel unaligned access at TPC[7d6c20] loopback_get_stats64+0x80/0xa0
[3391066.717858] Kernel unaligned access at TPC[8609dc] dev_get_stats+0x3c/0xe0
Eric Dumazet April 19, 2016, 2:22 a.m. UTC | #4
On Mon, 2016-04-18 at 21:48 -0400, David Miller wrote:

> 
> And anyways, I get unaligned accesses without Roopa's changes :-/
> 
> davem@patience:~$ ip l l
> [3391066.656729] Kernel unaligned access at TPC[7d6c14] loopback_get_stats64+0x74/0xa0
> [3391066.672020] Kernel unaligned access at TPC[7d6c18] loopback_get_stats64+0x78/0xa0
> [3391066.687282] Kernel unaligned access at TPC[7d6c1c] loopback_get_stats64+0x7c/0xa0
> [3391066.702573] Kernel unaligned access at TPC[7d6c20] loopback_get_stats64+0x80/0xa0
> [3391066.717858] Kernel unaligned access at TPC[8609dc] dev_get_stats+0x3c/0xe0

Yes, rtnl_fill_stats() probably has the same mistake.

commit 550bce59baf3f3059cd4ae1e268f08f2d2cb1d5c
Author: Roopa Prabhu <roopa@cumulusnetworks.com>
Date:   Fri Apr 15 20:36:25 2016 -0700

    rtnetlink: rtnl_fill_stats: avoid an unnecssary stats copy
    
    This patch passes netlink attr data ptr directly to dev_get_stats
    thus elimiating a stats copy.
    
    Suggested-by: David Miller <davem@davemloft.net>
    Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu April 19, 2016, 2:30 a.m. UTC | #5
On 4/18/16, 5:57 PM, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 18 Apr 2016 14:35:56 -0700
>
>> On Mon, 2016-04-18 at 14:10 -0700, Roopa Prabhu wrote:
>>
>>> +	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) {
>>> +		struct rtnl_link_stats64 *sp;
>>> +
>>> +		attr = nla_reserve(skb, IFLA_STATS_LINK_64,
>>> +				   sizeof(struct rtnl_link_stats64));
>>> +		if (!attr)
>>> +			goto nla_put_failure;
>>> +
>>> +		sp = nla_data(attr);
>> Are you sure we have a guarantee that sp is aligned to u64 fields ?
>>
>> x86 does not care, but some arches would have a potential misalign
>> access here.
> I'll do some testing on sparc and deal with any fallout.
Thanks David. I will be happy to respin if you see problems.
Roopa Prabhu April 19, 2016, 2:40 a.m. UTC | #6
On 4/18/16, 7:22 PM, Eric Dumazet wrote:
> On Mon, 2016-04-18 at 21:48 -0400, David Miller wrote:
>
>> And anyways, I get unaligned accesses without Roopa's changes :-/
>>
>> davem@patience:~$ ip l l
>> [3391066.656729] Kernel unaligned access at TPC[7d6c14] loopback_get_stats64+0x74/0xa0
>> [3391066.672020] Kernel unaligned access at TPC[7d6c18] loopback_get_stats64+0x78/0xa0
>> [3391066.687282] Kernel unaligned access at TPC[7d6c1c] loopback_get_stats64+0x7c/0xa0
>> [3391066.702573] Kernel unaligned access at TPC[7d6c20] loopback_get_stats64+0x80/0xa0
>> [3391066.717858] Kernel unaligned access at TPC[8609dc] dev_get_stats+0x3c/0xe0
> Yes, rtnl_fill_stats() probably has the same mistake.
>
> commit 550bce59baf3f3059cd4ae1e268f08f2d2cb1d5c
> Author: Roopa Prabhu <roopa@cumulusnetworks.com>
> Date:   Fri Apr 15 20:36:25 2016 -0700
>
>     rtnetlink: rtnl_fill_stats: avoid an unnecssary stats copy
>     
>     This patch passes netlink attr data ptr directly to dev_get_stats
>     thus elimiating a stats copy.
>     
>     Suggested-by: David Miller <davem@davemloft.net>
>     Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>
>
>
David, if you revert the one in rtnl_fill_stats, i will take care of the dev_get_stats in RTM_GETSTATS in v6.

thanks,
Roopa
David Miller April 19, 2016, 3:49 a.m. UTC | #7
From: Roopa Prabhu <roopa@cumulusnetworks.com>
Date: Mon, 18 Apr 2016 19:40:08 -0700

> On 4/18/16, 7:22 PM, Eric Dumazet wrote:
>> On Mon, 2016-04-18 at 21:48 -0400, David Miller wrote:
>>
>>> And anyways, I get unaligned accesses without Roopa's changes :-/
>>>
>>> davem@patience:~$ ip l l
>>> [3391066.656729] Kernel unaligned access at TPC[7d6c14] loopback_get_stats64+0x74/0xa0
>>> [3391066.672020] Kernel unaligned access at TPC[7d6c18] loopback_get_stats64+0x78/0xa0
>>> [3391066.687282] Kernel unaligned access at TPC[7d6c1c] loopback_get_stats64+0x7c/0xa0
>>> [3391066.702573] Kernel unaligned access at TPC[7d6c20] loopback_get_stats64+0x80/0xa0
>>> [3391066.717858] Kernel unaligned access at TPC[8609dc] dev_get_stats+0x3c/0xe0
>> Yes, rtnl_fill_stats() probably has the same mistake.
>>
>> commit 550bce59baf3f3059cd4ae1e268f08f2d2cb1d5c
>> Author: Roopa Prabhu <roopa@cumulusnetworks.com>
>> Date:   Fri Apr 15 20:36:25 2016 -0700
>>
>>     rtnetlink: rtnl_fill_stats: avoid an unnecssary stats copy
>>     
>>     This patch passes netlink attr data ptr directly to dev_get_stats
>>     thus elimiating a stats copy.
>>     
>>     Suggested-by: David Miller <davem@davemloft.net>
>>     Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>>
>>
> David, if you revert the one in rtnl_fill_stats, i will take care of the dev_get_stats in RTM_GETSTATS in v6.

Don't need to revert, see my idea with the IFLA_PAD attribute. :-)
David Miller April 19, 2016, 3:52 a.m. UTC | #8
From: David Miller <davem@davemloft.net>
Date: Mon, 18 Apr 2016 21:48:51 -0400 (EDT)

> I think the time has probably come to have a new netlink attribute
> format that doesn't have this multi-decade old problem.
 ...

Scratch this whole idea, I think the padding attribute works a lot
better and is backwards compatible with every properly written
netlink application.
Nicolas Dichtel April 19, 2016, 8:26 a.m. UTC | #9
+ selinux maintainers

Le 18/04/2016 23:10, Roopa Prabhu a écrit :
[snip]
> diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
> index 8495b93..1714633 100644
> --- a/security/selinux/nlmsgtab.c
> +++ b/security/selinux/nlmsgtab.c
> @@ -76,6 +76,8 @@ static struct nlmsg_perm nlmsg_route_perms[] =
>   	{ RTM_NEWNSID,		NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>   	{ RTM_DELNSID,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
>   	{ RTM_GETNSID,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
> +	{ RTM_NEWSTATS,		NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
I would say it's NETLINK_ROUTE_SOCKET__NLMSG_READ, not WRITE. This command is
only sent by the kernel, not by the userland.
Johannes Berg April 19, 2016, 10:09 a.m. UTC | #10
On Mon, 2016-04-18 at 23:52 -0400, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Mon, 18 Apr 2016 21:48:51 -0400 (EDT)
> 
> > I think the time has probably come to have a new netlink attribute
> > format that doesn't have this multi-decade old problem.
>  ...
> 
> Scratch this whole idea, I think the padding attribute works a lot
> better and is backwards compatible with every properly written
> netlink application.

This talk about attribute flags reminded me about something else.

At netconf, we talked about how awkward it can be that one doesn't know
if an attribute was accepted by the kernel or simply ignored because
it's not supported (older kernel version).

I considered (and Emmanuel has started to cook up a patch for it)
adding a flag here meaning "reject if not parsed" (or so).

Now, I realize that with all the existing netlink commands one can't
actually rely on that (since it requires some infrastructure), but new
commands in existing families and also entirely new generic netlink
families would allow let userspace rely on such a thing.

Thoughts?

johannes
David Miller April 19, 2016, 6:23 p.m. UTC | #11
From: Johannes Berg <johannes@sipsolutions.net>
Date: Tue, 19 Apr 2016 12:09:03 +0200

> At netconf, we talked about how awkward it can be that one doesn't know
> if an attribute was accepted by the kernel or simply ignored because
> it's not supported (older kernel version).
> 
> I considered (and Emmanuel has started to cook up a patch for it)
> adding a flag here meaning "reject if not parsed" (or so).
> 
> Now, I realize that with all the existing netlink commands one can't
> actually rely on that (since it requires some infrastructure), but new
> commands in existing families and also entirely new generic netlink
> families would allow let userspace rely on such a thing.

I like this nlattr flag idea, it's opt-in and any tool can be updated
to use the new facility.

I'd be willing the backport the nlattr flag bit change to all stable
releases as well.
Johannes Berg April 19, 2016, 7:41 p.m. UTC | #12
On Tue, 2016-04-19 at 14:23 -0400, David Miller wrote:

> I like this nlattr flag idea, it's opt-in and any tool can be updated
> to use the new facility.

Right.

> I'd be willing the backport the nlattr flag bit change to all stable
> releases as well.

I'm not really convinced that helps much - tools still can't really
rely on the kernel supporting it.

One thing that might work is that a tool might say it only wants to
support kernels that have this change (assuming we backport it to
enough kernels etc.); in that case the tool could add some absolutely
must-have information (like the IFINDEX or whatever, depends on the
command) with the new flag, this would get rejected since unpatched
kernels wouldn't understand the flag and wouldn't find that must-have
information.

Nevertheless, I think it's most reliable with new netlink commands that
are known to be available only on kernels understand and treating the
flag correctly.

johannes
Paul Moore April 19, 2016, 7:55 p.m. UTC | #13
On Tue, Apr 19, 2016 at 4:26 AM, Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
> + selinux maintainers
>
> Le 18/04/2016 23:10, Roopa Prabhu a écrit :
> [snip]
>>
>> diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
>> index 8495b93..1714633 100644
>> --- a/security/selinux/nlmsgtab.c
>> +++ b/security/selinux/nlmsgtab.c
>> @@ -76,6 +76,8 @@ static struct nlmsg_perm nlmsg_route_perms[] =
>>         { RTM_NEWNSID,          NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>>         { RTM_DELNSID,          NETLINK_ROUTE_SOCKET__NLMSG_READ  },
>>         { RTM_GETNSID,          NETLINK_ROUTE_SOCKET__NLMSG_READ  },
>> +       { RTM_NEWSTATS,         NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>
> I would say it's NETLINK_ROUTE_SOCKET__NLMSG_READ, not WRITE. This command
> is only sent by the kernel, not by the userland.

From what I could tell from the patch description, it looks like
RTM_NEWSTATS only dumps stats to userspace and doesn't alter the state
of the kernel, is that correct?  If so, then yes, NLMSG__READ is the
right SELinux permission.  However, if RTM_NEWSTATS does alter the
state/configuration of the kernel then we should use NLMSG__WRITE.
Roopa Prabhu April 19, 2016, 8:40 p.m. UTC | #14
On 4/19/16, 12:55 PM, Paul Moore wrote:
> On Tue, Apr 19, 2016 at 4:26 AM, Nicolas Dichtel
> <nicolas.dichtel@6wind.com> wrote:
>> + selinux maintainers
>>
>> Le 18/04/2016 23:10, Roopa Prabhu a écrit :
>> [snip]
>>> diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
>>> index 8495b93..1714633 100644
>>> --- a/security/selinux/nlmsgtab.c
>>> +++ b/security/selinux/nlmsgtab.c
>>> @@ -76,6 +76,8 @@ static struct nlmsg_perm nlmsg_route_perms[] =
>>>         { RTM_NEWNSID,          NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>>>         { RTM_DELNSID,          NETLINK_ROUTE_SOCKET__NLMSG_READ  },
>>>         { RTM_GETNSID,          NETLINK_ROUTE_SOCKET__NLMSG_READ  },
>>> +       { RTM_NEWSTATS,         NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
>> I would say it's NETLINK_ROUTE_SOCKET__NLMSG_READ, not WRITE. This command
>> is only sent by the kernel, not by the userland.
> From what I could tell from the patch description, it looks like
> RTM_NEWSTATS only dumps stats to userspace and doesn't alter the state
> of the kernel, is that correct?  If so, then yes, NLMSG__READ is the
> right SELinux permission.  However, if RTM_NEWSTATS does alter the
> state/configuration of the kernel then we should use NLMSG__WRITE.
>
okay, will change it to READ in the next version,

thanks.
David Ahern April 20, 2016, 1:53 a.m. UTC | #15
On 4/19/16 1:41 PM, Johannes Berg wrote:
> On Tue, 2016-04-19 at 14:23 -0400, David Miller wrote:
>>
>> I like this nlattr flag idea, it's opt-in and any tool can be updated
>> to use the new facility.
>
> Right.
>
>> I'd be willing the backport the nlattr flag bit change to all stable
>> releases as well.
>
> I'm not really convinced that helps much - tools still can't really
> rely on the kernel supporting it.
>
> One thing that might work is that a tool might say it only wants to
> support kernels that have this change (assuming we backport it to
> enough kernels etc.); in that case the tool could add some absolutely
> must-have information (like the IFINDEX or whatever, depends on the
> command) with the new flag, this would get rejected since unpatched
> kernels wouldn't understand the flag and wouldn't find that must-have
> information.
>
> Nevertheless, I think it's most reliable with new netlink commands that
> are known to be available only on kernels understand and treating the
> flag correctly.

The kernel can set a flag in the response that it acknowledges the new 
attribute/flag. I did that for filtering neigh dumps -- 21fdd092acc7.
Johannes Berg April 20, 2016, 7:32 a.m. UTC | #16
On Tue, 2016-04-19 at 19:53 -0600, David Ahern wrote:

> The kernel can set a flag in the response that it acknowledges the
> new  attribute/flag. I did that for filtering neigh dumps --
> 21fdd092acc7.
> 

Hm, that works, but I think it requires writing extra code, which I was
kinda trying to avoid. With the patch that Emmanuel wrote, we can
restrict the changes to just nla_parse().

Anyway, I think we just have to document the behaviour very precisely,
and userspace can make its own decisions.

Essentially, apps will have a number of choices:

1) Use the new attribute flag only with commands known to have been
   added after the kernel support was added.

2) Use the new attribute flag with some required attribute for
   existing commands, so that older kernel will not find the required
   attribute and will reject the operation entirely.
   May or may not fall back to trying the operation again without the
   flag.

3) Simply use the new flag and do unexpected things on kernels not
   supporting the rejection mechanism - not much worse than today in
   many cases.

I guess we'll write a proper commit message and send the patch.

johannes
Jiri Benc April 20, 2016, 12:48 p.m. UTC | #17
On Wed, 20 Apr 2016 09:32:20 +0200, Johannes Berg wrote:
> 2) Use the new attribute flag with some required attribute for
>    existing commands, so that older kernel will not find the required
>    attribute and will reject the operation entirely.
>    May or may not fall back to trying the operation again without the
>    flag.

This is basically what I submitted half a year ago. See:
http://thread.gmane.org/gmane.linux.network/382850

 Jiri
Johannes Berg April 20, 2016, 1:17 p.m. UTC | #18
On Wed, 2016-04-20 at 14:48 +0200, Jiri Benc wrote:
> On Wed, 20 Apr 2016 09:32:20 +0200, Johannes Berg wrote:
> > 
> > 2) Use the new attribute flag with some required attribute for
> >    existing commands, so that older kernel will not find the
> > required
> >    attribute and will reject the operation entirely.
> >    May or may not fall back to trying the operation again without
> > the
> >    flag.
> This is basically what I submitted half a year ago. See:
> http://thread.gmane.org/gmane.linux.network/382850
> 

That looks like a *huge* patchset though - whereas my proposal really
required only what Emmanuel sent in this thread. It did make some
assumptions, for example that any attribute lower than the "maxtype"
argument to nla_parse() was understood. [1]

Looks like you have this on a per-message basis. I thought it was
better on an attribute basis because that's really where the issue is.

You can still detect it with the per-attribute flag approach as I
described in (2) - if, for your lwtunnel example, you could specify the
flag on the RTA_ENCAP attribute, without which no lwtunnel can be
created (if I understand the code correctly.)

johannes



[1] for example, if I have three attributes:
enum attrs {__unused, A, B, C};

and the policy

policy = {
	[A] = { .type = NLA_U32 },
	[C] = { .type = NLA_U8 },
}

and then do

nla_parse(tb, 3, msg, msg_len, &policy)

it would assume that "B" is valid. Since this policy is equivalent to
the policy with
	[B] = { .type = NLA_BINARY }

(minimum length 0) we could also reject anything that has type=len=0 in
the policy, if the NLA_F_NET_MUST_PARSE flag is set in the nla_type.

This would likely be the right approach for most netlink families,
since they usually don't have holes that they actually care about -
I've yet to see any attribute that's not specified at all in the policy
but used anyway, normally you want some level of checking, and indicate
that by using { .type = NLA_BINARY } - but other things are possible.

johannes
Jiri Benc April 20, 2016, 1:34 p.m. UTC | #19
On Wed, 20 Apr 2016 15:17:08 +0200, Johannes Berg wrote:
> Looks like you have this on a per-message basis. I thought it was
> better on an attribute basis because that's really where the issue is.

No problem. I'm not that happy with my patchset myself. Just wanted to
point it out in case it's useful.

 Jiri
Johannes Berg April 20, 2016, 8:13 p.m. UTC | #20
On Wed, 2016-04-20 at 15:34 +0200, Jiri Benc wrote:
> On Wed, 20 Apr 2016 15:17:08 +0200, Johannes Berg wrote:
> > 
> > Looks like you have this on a per-message basis. I thought it was
> > better on an attribute basis because that's really where the issue
> > is.
> No problem. I'm not that happy with my patchset myself. Just wanted
> to point it out in case it's useful.

Yeah, I looked at it, but I think it ended up a bit too complicated
really.

It does have slightly more validation in some sense, but I don't really
think that justifies the complexity?

No matter what, we'll always have to deal with the problem of not
having this capability on older kernels. One way to work around it
would be to add a new NLM_F_REQUEST2 flag, since the kernel currently
requires having NLM_F_REQUEST set, NLM_F_REQUEST2 messages would be
rejected by existing kernels. Dunno if it's really worth it though, I
suspect that family/command-specific detection will work in practically
all cases.

johannes
diff mbox

Patch

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index bb3a90b..0762f35 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -781,4 +781,27 @@  enum {
 
 #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1)
 
+/* STATS section */
+
+struct if_stats_msg {
+	__u8  family;
+	__u8  pad1;
+	__u16 pad2;
+	__u32 ifindex;
+	__u32 filter_mask;
+};
+
+/* A stats attribute can be netdev specific or a global stat.
+ * For netdev stats, lets use the prefix IFLA_STATS_LINK_*
+ */
+enum {
+	IFLA_STATS_UNSPEC,
+	IFLA_STATS_LINK_64,
+	__IFLA_STATS_MAX,
+};
+
+#define IFLA_STATS_MAX (__IFLA_STATS_MAX - 1)
+
+#define IFLA_STATS_FILTER_BIT(ATTR)	(1 << (ATTR))
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index ca764b5..cc885c4 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -139,6 +139,11 @@  enum {
 	RTM_GETNSID = 90,
 #define RTM_GETNSID RTM_GETNSID
 
+	RTM_NEWSTATS = 92,
+#define RTM_NEWSTATS RTM_NEWSTATS
+	RTM_GETSTATS = 94,
+#define RTM_GETSTATS RTM_GETSTATS
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a75f7e9..fe35102 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3451,6 +3451,153 @@  out:
 	return err;
 }
 
+static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev,
+			       int type, u32 pid, u32 seq, u32 change,
+			       unsigned int flags, unsigned int filter_mask)
+{
+	struct if_stats_msg *ifsm;
+	struct nlmsghdr *nlh;
+	struct nlattr *attr;
+
+	ASSERT_RTNL();
+
+	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifsm), flags);
+	if (!nlh)
+		return -EMSGSIZE;
+
+	ifsm = nlmsg_data(nlh);
+	ifsm->ifindex = dev->ifindex;
+	ifsm->filter_mask = filter_mask;
+
+	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) {
+		struct rtnl_link_stats64 *sp;
+
+		attr = nla_reserve(skb, IFLA_STATS_LINK_64,
+				   sizeof(struct rtnl_link_stats64));
+		if (!attr)
+			goto nla_put_failure;
+
+		sp = nla_data(attr);
+		dev_get_stats(dev, sp);
+	}
+
+	nlmsg_end(skb, nlh);
+
+	return 0;
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+
+	return -EMSGSIZE;
+}
+
+static const struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
+	[IFLA_STATS_LINK_64]	= { .len = sizeof(struct rtnl_link_stats64) },
+};
+
+static size_t if_nlmsg_stats_size(const struct net_device *dev,
+				  u32 filter_mask)
+{
+	size_t size = 0;
+
+	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64))
+		size += nla_total_size(sizeof(struct rtnl_link_stats64));
+
+	return size;
+}
+
+static int rtnl_stats_get(struct sk_buff *skb, struct nlmsghdr *nlh)
+{
+	struct net *net = sock_net(skb->sk);
+	struct if_stats_msg *ifsm;
+	struct net_device *dev = NULL;
+	struct sk_buff *nskb;
+	u32 filter_mask;
+	int err;
+
+	ifsm = nlmsg_data(nlh);
+	if (ifsm->ifindex > 0)
+		dev = __dev_get_by_index(net, ifsm->ifindex);
+	else
+		return -EINVAL;
+
+	if (!dev)
+		return -ENODEV;
+
+	filter_mask = ifsm->filter_mask;
+	if (!filter_mask)
+		return -EINVAL;
+
+	nskb = nlmsg_new(if_nlmsg_stats_size(dev, filter_mask), GFP_KERNEL);
+	if (!nskb)
+		return -ENOBUFS;
+
+	err = rtnl_fill_statsinfo(nskb, dev, RTM_NEWSTATS,
+				  NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0,
+				  0, filter_mask);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in if_nlmsg_stats_size */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(nskb);
+	} else {
+		err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid);
+	}
+
+	return err;
+}
+
+static int rtnl_stats_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net *net = sock_net(skb->sk);
+	struct if_stats_msg *ifsm;
+	int h, s_h;
+	int idx = 0, s_idx;
+	struct net_device *dev;
+	struct hlist_head *head;
+	unsigned int flags = NLM_F_MULTI;
+	u32 filter_mask = 0;
+	int err;
+
+	s_h = cb->args[0];
+	s_idx = cb->args[1];
+
+	cb->seq = net->dev_base_seq;
+
+	ifsm = nlmsg_data(cb->nlh);
+	filter_mask = ifsm->filter_mask;
+	if (!filter_mask)
+		return -EINVAL;
+
+	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
+		idx = 0;
+		head = &net->dev_index_head[h];
+		hlist_for_each_entry(dev, head, index_hlist) {
+			if (idx < s_idx)
+				goto cont;
+			err = rtnl_fill_statsinfo(skb, dev, RTM_NEWSTATS,
+						  NETLINK_CB(cb->skb).portid,
+						  cb->nlh->nlmsg_seq, 0,
+						  flags, filter_mask);
+			/* If we ran out of room on the first message,
+			 * we're in trouble
+			 */
+			WARN_ON((err == -EMSGSIZE) && (skb->len == 0));
+
+			if (err < 0)
+				goto out;
+
+			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+cont:
+			idx++;
+		}
+	}
+out:
+	cb->args[1] = idx;
+	cb->args[0] = h;
+
+	return skb->len;
+}
+
 /* Process one rtnetlink message. */
 
 static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
@@ -3600,4 +3747,7 @@  void __init rtnetlink_init(void)
 	rtnl_register(PF_BRIDGE, RTM_GETLINK, NULL, rtnl_bridge_getlink, NULL);
 	rtnl_register(PF_BRIDGE, RTM_DELLINK, rtnl_bridge_dellink, NULL, NULL);
 	rtnl_register(PF_BRIDGE, RTM_SETLINK, rtnl_bridge_setlink, NULL, NULL);
+
+	rtnl_register(PF_UNSPEC, RTM_GETSTATS, rtnl_stats_get, rtnl_stats_dump,
+		      NULL);
 }
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index 8495b93..1714633 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -76,6 +76,8 @@  static struct nlmsg_perm nlmsg_route_perms[] =
 	{ RTM_NEWNSID,		NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 	{ RTM_DELNSID,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 	{ RTM_GETNSID,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
+	{ RTM_NEWSTATS,		NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_GETSTATS,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 };
 
 static struct nlmsg_perm nlmsg_tcpdiag_perms[] =
@@ -155,7 +157,7 @@  int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
 	switch (sclass) {
 	case SECCLASS_NETLINK_ROUTE_SOCKET:
 		/* RTM_MAX always point to RTM_SETxxxx, ie RTM_NEWxxx + 3 */
-		BUILD_BUG_ON(RTM_MAX != (RTM_NEWNSID + 3));
+		BUILD_BUG_ON(RTM_MAX != (RTM_NEWSTATS + 3));
 		err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms,
 				 sizeof(nlmsg_route_perms));
 		break;