diff mbox

net-netlink: Add a new attribute to expose TCLASS values via netlink

Message ID 1320711791-11005-1-git-send-email-zenczykowski@gmail.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Maciej Żenczykowski Nov. 8, 2011, 12:23 a.m. UTC
From: Maciej Żenczykowski <maze@google.com>

commit 3ceca749668a52bd795585e0f71c6f0b04814f7b added a TOS attribute.

Unfortunately TOS and TCLASS are both present in a dual-stack v6 socket,
furthermore they can have different values.  As such one cannot in a
sane way expose both through a single attribute.

Signed-off-by: Maciej Żenczyowski <maze@google.com>
CC: Murali Raja <muralira@google.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David S. Miller <davem@davemloft.net>
---
 include/linux/inet_diag.h |    3 ++-
 net/ipv4/inet_diag.c      |    4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

Comments

Maciej Żenczykowski Nov. 8, 2011, 12:25 a.m. UTC | #1
FYI, This obviously requires a follow up patch which will add TOS and
TCLASS info to appropriate dual-stack sockets (for example listening
tcp v6 non-ipv6only).

On Mon, Nov 7, 2011 at 4:23 PM, Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
> From: Maciej Żenczykowski <maze@google.com>
>
> commit 3ceca749668a52bd795585e0f71c6f0b04814f7b added a TOS attribute.
>
> Unfortunately TOS and TCLASS are both present in a dual-stack v6 socket,
> furthermore they can have different values.  As such one cannot in a
> sane way expose both through a single attribute.
>
> Signed-off-by: Maciej Żenczyowski <maze@google.com>
> CC: Murali Raja <muralira@google.com>
> CC: Stephen Hemminger <shemminger@vyatta.com>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> CC: David S. Miller <davem@davemloft.net>
> ---
>  include/linux/inet_diag.h |    3 ++-
>  net/ipv4/inet_diag.c      |    4 ++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
> index 80b480c..abf5028 100644
> --- a/include/linux/inet_diag.h
> +++ b/include/linux/inet_diag.h
> @@ -98,9 +98,10 @@ enum {
>        INET_DIAG_VEGASINFO,
>        INET_DIAG_CONG,
>        INET_DIAG_TOS,
> +       INET_DIAG_TCLASS,
>  };
>
> -#define INET_DIAG_MAX INET_DIAG_TOS
> +#define INET_DIAG_MAX INET_DIAG_TCLASS
>
>
>  /* INET_DIAG_MEM */
> diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
> index f5e2bda..68e8ac5 100644
> --- a/net/ipv4/inet_diag.c
> +++ b/net/ipv4/inet_diag.c
> @@ -133,8 +133,8 @@ static int inet_csk_diag_fill(struct sock *sk,
>                               &np->rcv_saddr);
>                ipv6_addr_copy((struct in6_addr *)r->id.idiag_dst,
>                               &np->daddr);
> -               if (ext & (1 << (INET_DIAG_TOS - 1)))
> -                       RTA_PUT_U8(skb, INET_DIAG_TOS, np->tclass);
> +               if (ext & (1 << (INET_DIAG_TCLASS - 1)))
> +                       RTA_PUT_U8(skb, INET_DIAG_TCLASS, np->tclass);
>        }
>  #endif
>
> --
> 1.7.3.1
>
>
stephen hemminger Nov. 8, 2011, 12:27 a.m. UTC | #2
On Mon,  7 Nov 2011 16:23:11 -0800
Maciej Żenczykowski <zenczykowski@gmail.com> wrote:

> commit 3ceca749668a52bd795585e0f71c6f0b04814f7b added a TOS attribute.
> 
> Unfortunately TOS and TCLASS are both present in a dual-stack v6 socket,
> furthermore they can have different values.  As such one cannot in a
> sane way expose both through a single attribute.

Do we really want to continue to expose that as a supported
API. I would argue it was a mistake in the original implementation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski Nov. 8, 2011, 12:29 a.m. UTC | #3
> Do we really want to continue to expose that as a supported
> API. I would argue it was a mistake in the original implementation.

Yes, that's an interesting question...

However, can this be changed at this point in time?

Theoretically network fabric hardware could interpret v4 and v6 dscp
code points differently
(don't think anyone sane would do really want to do this though...)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 9, 2011, 8:34 p.m. UTC | #4
From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Mon,  7 Nov 2011 16:23:11 -0800

> From: Maciej Żenczykowski <maze@google.com>
> 
> commit 3ceca749668a52bd795585e0f71c6f0b04814f7b added a TOS attribute.
> 
> Unfortunately TOS and TCLASS are both present in a dual-stack v6 socket,
> furthermore they can have different values.  As such one cannot in a
> sane way expose both through a single attribute.
> 
> Signed-off-by: Maciej Żenczyowski <maze@google.com>

I can't see how an ipv6 mapped socket can even set the inet->tos value.

As far as I can see, only net/ipv4/ip_sockglue.c:ip_setsockopt() provides
the interface to change inet->tos.

And ipv6 sockets, of any type, are provided no such vector by which to
get at those interfaces.

So inet->tos is always left at it's default value for ipv6 mapped sockets,
and therefore I see no reason to report TCLASS vs. TOS separately.

In fact, what I would suggest is to do something about the lack of
ability to set inet->tos, and the best way to do that seems to be to
simply propagate the npinfo->tclass setting into inet->tos.  Performaing
any munging if necessary.

I'm not applying this patch.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski Nov. 10, 2011, 1:50 a.m. UTC | #5
> I can't see how an ipv6 mapped socket can even set the inet->tos value.
>
> As far as I can see, only net/ipv4/ip_sockglue.c:ip_setsockopt() provides
> the interface to change inet->tos.

Yes, but an ipv6 socket is permitted to setsockopt SOL_IP in addition
to SOL_IPV6.
As such, one can simply setsockopt(ipv6_socket, SOL_IP, IP_TOS, value).
This is just as well, because inet->tos is what determines the ipv4 tos value
of any ipv4 packets generated by an ipv6 socket, ie. when connect(ed/ing)
to an ipv4 mapped ipv6 address.

> And ipv6 sockets, of any type, are provided no such vector by which to
> get at those interfaces.

> So inet->tos is always left at it's default value for ipv6 mapped sockets,
> and therefore I see no reason to report TCLASS vs. TOS separately.

> In fact, what I would suggest is to do something about the lack of
> ability to set inet->tos, and the best way to do that seems to be to
> simply propagate the npinfo->tclass setting into inet->tos.  Performaing
> any munging if necessary.

> I'm not applying this patch.

If people are in agreement that inet->tos vs ipv6->tclass being
separately settable is not desirable,
then I'm willing to go through and remove the tclass field (and send
out a patch to do that).

However, this is _QUITE DEFINITELY_ a user visible change in API and
application visible behaviour.

It is however an annoying corner case, so perhaps would be best fixed
by getting rid of it?
OTOH, I'm not sure if people aren't perhaps relying on the ability to
separately set ipv4 dscp vs ipv6 tclass due to some internal network
constraints...

Personally, I'd prefer the simplification of not having to deal with
IP_TOS vs IPV6_TCLASS discrepancies on ipv6 dual-stack sockets.

- Maciej
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 10, 2011, 1:53 a.m. UTC | #6
From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Wed, 9 Nov 2011 17:50:27 -0800

> Yes, but an ipv6 socket is permitted to setsockopt SOL_IP in addition
> to SOL_IPV6.
> As such, one can simply setsockopt(ipv6_socket, SOL_IP, IP_TOS, value).

That's my whole point, this operation isn't currently possible, the
socket ops for ipv4 setsockopt() aren't hooked up to ipv6 mapped
sockets in the kernel right now.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski Nov. 15, 2011, 6:13 a.m. UTC | #7
2011/11/13 David Miller <davem@davemloft.net>:
>>>> Yes, but an ipv6 socket is permitted to setsockopt SOL_IP in addition to SOL_IPV6.
>>>> As such, one can simply setsockopt(ipv6_socket, SOL_IP, IP_TOS, value).
>>>
>>> That's my whole point, this operation isn't currently possible, the
>>> socket ops for ipv4 setsockopt() aren't hooked up to ipv6 mapped
>>> sockets in the kernel right now.
>>
>> http://lxr.linux.no/linux+v3.1/net/ipv6/ipv6_sockglue.c#L822
>
> That definitely destroys the basis for all of my arguments :-)
>
> I'm going to add the TCLASS inet_diag patch, but can you find a cleaner
> way to make sure that all types of sockets have the value initialized
> to something sane?  The one you posted wasn't... optimal :-)
>
> Thanks.

I only realized just now that I had replied only to you and not to all.
I hope you don't mind that I'm adding everyone back into the loop.

(a) if we keep tos/tclass distinct.

We should make sure to get this patch (and the as yet unwritten
followup) into 3.2-final and not 3.3-rc1 (ie. net/master and not
net-next/master).
Since it does slightly change userspace visible API (thankfully, it
was just added so it's not yet too late to change it).
[FYI, I already see this patch in net/master - thanks!]

As for the followup, I can see three ways to proceed.
- The first involves figuring out what type of socket we're dealing
with when we look at it (ie. something like the RFC patch I posted - I
don't think it's going to be much prettier no matter what one does -
there's simply a lot of complexity...)
- The second involves keeping some sort of type-of-ip-socket
field/bitmap (none, ipv4 only, ipv4/ipv6 dual stack, ipv6 only)
directly in the socket.  This is probably much harder to get right
(changes all over the stack), but may have long term benefits???
- The third involves punting on the problem, we handle the simple
cases in kernel and sometimes return excess information to userspace
(ie. we take the RFC patch and strip out the address comparisons and
error on the side of returning 'true') -> it's userspace's problem to
sift through the excess garbage.

(b) personally I would still prefer to just merge tos and tclass and
get rid of the distinction... is this minor-but-userspace-visible-api
change out of the question?

- Maciej
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index 80b480c..abf5028 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -98,9 +98,10 @@  enum {
 	INET_DIAG_VEGASINFO,
 	INET_DIAG_CONG,
 	INET_DIAG_TOS,
+	INET_DIAG_TCLASS,
 };
 
-#define INET_DIAG_MAX INET_DIAG_TOS
+#define INET_DIAG_MAX INET_DIAG_TCLASS
 
 
 /* INET_DIAG_MEM */
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index f5e2bda..68e8ac5 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -133,8 +133,8 @@  static int inet_csk_diag_fill(struct sock *sk,
 			       &np->rcv_saddr);
 		ipv6_addr_copy((struct in6_addr *)r->id.idiag_dst,
 			       &np->daddr);
-		if (ext & (1 << (INET_DIAG_TOS - 1)))
-			RTA_PUT_U8(skb, INET_DIAG_TOS, np->tclass);
+		if (ext & (1 << (INET_DIAG_TCLASS - 1)))
+			RTA_PUT_U8(skb, INET_DIAG_TCLASS, np->tclass);
 	}
 #endif