Message ID | 20180307011230.24001-9-jesus.sanchez-palencia@intel.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Series | Time based packet transmission | expand |
On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote: > Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and > a drop_if_late flag. With this commit the API becomes: > > * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h * index d8340e6e8814..951969ceaf65 100644 * --- a/include/linux/skbuff.h * +++ b/include/linux/skbuff.h * @@ -788,6 +788,9 @@ struct sk_buff { * __u8 tc_redirected:1; * __u8 tc_from_ingress:1; * #endif * + __u8 tc_drop_if_late:1; * + * + clockid_t txtime_clockid; * * #ifdef CONFIG_NET_SCHED * __u16 tc_index; /* traffic control index */ This is adding 32+1 bits to sk_buff, and possibly holes in this very very hot (and already too fat) structure. Do we really need 32 bits for a clockid_t ?
On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote: > This is adding 32+1 bits to sk_buff, and possibly holes in this very > very hot (and already too fat) structure. > > Do we really need 32 bits for a clockid_t ? Probably we can live with fewer bits. For clock IDs with a positive sign, the max possible clock value is 16. For clock IDs with a negative sign, IIRC, three bits are for the type code (we have also posix timers packed like this) and the are for the file descriptor. So maybe we could use 16 bits, allowing 12 bits or so for encoding the FD. The downside would be that this forces the application to make sure and open the dynamic posix clock early enough before the FD count gets too high. Thanks, Richard
On Wed, Mar 7, 2018 at 12:24 AM, Richard Cochran <richardcochran@gmail.com> wrote: > On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote: >> This is adding 32+1 bits to sk_buff, and possibly holes in this very >> very hot (and already too fat) structure. >> >> Do we really need 32 bits for a clockid_t ? > > Probably we can live with fewer bits. > > For clock IDs with a positive sign, the max possible clock value is 16. > > For clock IDs with a negative sign, IIRC, three bits are for the type > code (we have also posix timers packed like this) and the are for the > file descriptor. So maybe we could use 16 bits, allowing 12 bits or > so for encoding the FD. > > The downside would be that this forces the application to make sure > and open the dynamic posix clock early enough before the FD count gets > too high. The same choices are probably made for all packets on a given socket. Unless skb->sk gets scrubbed in some transmit paths, then these be set as sockopt instead of cmsg.
On Wed, Mar 07, 2018 at 12:01:19PM -0500, Willem de Bruijn wrote: > The same choices are probably made for all packets on a given > socket. Unless skb->sk gets scrubbed in some transmit paths, > then these be set as sockopt instead of cmsg. The discussion on v2 ended with this per-message idea, in preference to the per-socket idea, IIRC. Thanks, Richard
On Wed, Mar 07, 2018 at 09:35:24AM -0800, Richard Cochran wrote: > The discussion on v2 ended with this per-message idea, in preference > to the per-socket idea, IIRC. (But my own opinion is that per-socket is good enough...) Thanks, Richard
On Wed, Mar 7, 2018 at 9:37 AM, Richard Cochran <richardcochran@gmail.com> wrote: > On Wed, Mar 07, 2018 at 09:35:24AM -0800, Richard Cochran wrote: >> The discussion on v2 ended with this per-message idea, in preference >> to the per-socket idea, IIRC. > > (But my own opinion is that per-socket is good enough...) > > Thanks, > Richard I would love if skb->tstamp could be either 0 or expressed in ktime_get() base all the time. ( Even if we would have to convert this to other bases when/if needed) Having to deal with many clockid in the core networking stack seems over engineered.
Hi, On 03/06/2018 06:53 PM, Eric Dumazet wrote: > On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote: >> Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and >> a drop_if_late flag. With this commit the API becomes: >> >> > > * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > * index d8340e6e8814..951969ceaf65 100644 > * --- a/include/linux/skbuff.h > * +++ b/include/linux/skbuff.h > * @@ -788,6 +788,9 @@ struct sk_buff { > * __u8 tc_redirected:1; > * __u8 tc_from_ingress:1; > * #endif > * + __u8 tc_drop_if_late:1; > * + > * + clockid_t txtime_clockid; > * > * #ifdef CONFIG_NET_SCHED > * __u16 tc_index; /* traffic > control index */ > > > This is adding 32+1 bits to sk_buff, and possibly holes in this very > very hot (and already too fat) structure. I should have mentioned on the commit msg, but the tc_drop_if_late is actually filling a 1 bit hole that was already there. > > Do we really need 32 bits for a clockid_t ? There is a 2 bytes hole just after tc_index, so a u16 clockid would fit perfectly without increasing the skbuffs size / cachelines any further. From Richard's reply, it seems safe to just change the definition here if we make it explicit on the SCM_CLOCKID documentation the caveat about the max possible fd count for dynamic clocks. How does that sound? Thanks, Jesus
On Wed, 2018-03-07 at 13:52 -0800, Jesus Sanchez-Palencia wrote: > Hi, ... > I should have mentioned on the commit msg, but the tc_drop_if_late is > actually > filling a 1 bit hole that was already there. > > > > > > Do we really need 32 bits for a clockid_t ? > > There is a 2 bytes hole just after tc_index, so a u16 clockid would > fit > perfectly without increasing the skbuffs size / cachelines any > further. > > From Richard's reply, it seems safe to just change the definition > here if we > make it explicit on the SCM_CLOCKID documentation the caveat about > the max > possible fd count for dynamic clocks. > > How does that sound? Not convincing really :/ Next big feature needing one bit in sk_buff will add it, and add a 63bit hole. Then next feature(s) will happily consume 'because there are holes anyway'. Then at some point we will cross cache line boundary and performance will take a 10 % hit. It is a never ending trend. If you really need 33 bits, then maybe we'll ask you to guard the new bits with some #if IS_ENABLED(CONFIG_...) so that we can opt-out. Why do we _really_ need dynamic clocks being supported in core networking stack, other than 'that is needed to send 2 packets per second with precise departure time and arbitrary user defined clocks, so lets do that, and do not care of the other 10,000,000 packets we receive/send per second' I have one patch (TXCS, something that I called XPS in the past) implementing the remote-freeing of skbs that help workloads where skb are produced on cpu A and consumed on cpu B, using an additional 16bit field that I have not upstreamed yet (even if Mellanox folks want that), simply because of this additional field... Maybe I should eat this hole before you take it ? No, we need to be extra careful.
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 07 Mar 2018 14:45:45 -0800 > No, we need to be extra careful. +1
On Wed, Mar 07, 2018 at 02:45:45PM -0800, Eric Dumazet wrote: > On Wed, 2018-03-07 at 13:52 -0800, Jesus Sanchez-Palencia wrote: > > > Do we really need 32 bits for a clockid_t ? > > > > There is a 2 bytes hole just after tc_index, so a u16 clockid would > > fit > > perfectly without increasing the skbuffs size / cachelines any > > further. > Not convincing really :/ > > Next big feature needing one bit in sk_buff will add it, and add a > 63bit hole. Would it be possible to put the clockid in skb_shared_info? If that's technically difficult or does not make sense, I'm ok with the clockid being a socket option. If a packet is sent immediately after changing the clockid via setsockopt(), will it be still guaranteed that the packet is restricted by the new id? > Why do we _really_ need dynamic clocks being supported in core > networking stack, other than 'that is needed to send 2 packets per > second with precise departure time and arbitrary user defined clocks, > so lets do that, and do not care of the other 10,000,000 packets we > receive/send per second' Well, I'd not expect it to be a common use case, but a public NTP server could be sending millions of packets per second in traffic peaks (typically at *:00:00) over multiple interfaces.
From: Miroslav Lichvar <mlichvar@redhat.com> Date: Thu, 8 Mar 2018 12:37:22 +0100 > Well, I'd not expect it to be a common use case, but a public NTP > server could be sending millions of packets per second in traffic > peaks (typically at *:00:00) over multiple interfaces. That's the problem. Bloating up sk_buff for an uncommon use case, penalizing all others, is a non-starter. Sorry.
On Wed, Mar 07, 2018 at 09:47:40AM -0800, Eric Dumazet wrote: > I would love if skb->tstamp could be either 0 or expressed in > ktime_get() base all the time. > > ( Even if we would have to convert this to other bases when/if needed) We really do need variable clock IDs. Otherwise the HW offloading case won't work. The desired transmit time must be expressed in terms of the clock inside the MAC. This clock is not necessarily related to the system time at all. But in addition to the performance concerns, I think putting this into a socket option is the more natural solution. Thanks, Richard
Hi, On 03/08/2018 08:44 AM, Richard Cochran wrote: > On Wed, Mar 07, 2018 at 09:47:40AM -0800, Eric Dumazet wrote: >> I would love if skb->tstamp could be either 0 or expressed in >> ktime_get() base all the time. >> >> ( Even if we would have to convert this to other bases when/if needed) > > We really do need variable clock IDs. Otherwise the HW offloading > case won't work. The desired transmit time must be expressed in terms > of the clock inside the MAC. This clock is not necessarily related to > the system time at all. > > But in addition to the performance concerns, I think putting this into > a socket option is the more natural solution. Ok, so we have it settled for clockid now. Providing it per-socket was what we'd proposed previously, so this was just an attempt to accommodate all the feedback we got on the v2 RFC. What about the tc_drop_if_late bit, though? Would it be acceptable to keep it per-packet, thus eating the 1-bit hole from skbuff if we would #if guard it (e.g. with CONFIG_NET_SCH_TBS)? Thanks, Jesus
On Tue, 6 Mar 2018, Richard Cochran wrote: > On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote: > > This is adding 32+1 bits to sk_buff, and possibly holes in this very > > very hot (and already too fat) structure. > > > > Do we really need 32 bits for a clockid_t ? > > Probably we can live with fewer bits. > > For clock IDs with a positive sign, the max possible clock value is 16. > > For clock IDs with a negative sign, IIRC, three bits are for the type > code (we have also posix timers packed like this) and the are for the > file descriptor. So maybe we could use 16 bits, allowing 12 bits or > so for encoding the FD. > > The downside would be that this forces the application to make sure > and open the dynamic posix clock early enough before the FD count gets > too high. Errm. No. There is no way to support fd based clocks or one of the CPU time/process time based clocks for this. CLOCK_REALTIME and CLOCK_MONOTONIC are probably the only interesting ones. BOOTTIME is hopefully soon irrelevant as we make MONOTONIC and BOOTTIME the same unless this causes unexpectedly a major issues. I don't think that CLOCK_TAI makes sense in that context, but I might be wrong. The rest of the CLOCK_* space cannot be used at all. So you need at max 2 bits for this, but I think 1 is good enough. Thanks, tglx
On Wed, Mar 21, 2018 at 01:58:51PM +0100, Thomas Gleixner wrote: > Errm. No. There is no way to support fd based clocks or one of the CPU > time/process time based clocks for this. Why not? If the we have HW offloading, then the transmit time had better be expressed in terms of the MAC's internal clock. Otherwise we would need to translate between a kernel clock and the MAC clock, but that is expensive (eg over PCIe) and silly (because in a typical use case the MAC will already be synchronized to the network time). Thanks, Richard
On Wed, 21 Mar 2018, Richard Cochran wrote: @Intel: I removed intel-wired-lan@ as I have absolutely zero interest in the moderation spam from that list. Can you please either get rid of this moderation nonsense or stop CC'ing that list when posting to lkml/netdev? > On Wed, Mar 21, 2018 at 01:58:51PM +0100, Thomas Gleixner wrote: > > Errm. No. There is no way to support fd based clocks or one of the CPU > > time/process time based clocks for this. > > Why not? > > If the we have HW offloading, then the transmit time had better be > expressed in terms of the MAC's internal clock. Otherwise we would > need to translate between a kernel clock and the MAC clock, but that > is expensive (eg over PCIe) and silly (because in a typical use case > the MAC will already be synchronized to the network time). Sure, but you CANNOT use a clockid for that because there is NONE. The mac clock is exposed via a dynamic posix clock and can only be referenced via a file descriptor. The qdisc setup does fd = open(...) and hands that in as clockid. Later the application does fd = open(...) and uses that as clockid for tagging the messages. What the heck guarantees that both the setup and the application will get the same fd number? Exactly nothing. So any attempt to use the filedescriptor number as clockid is broken by definition. Thanks, tglx
diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index 065fb372e355..3399dfefa579 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -114,5 +114,7 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 62 +#define SCM_CLOCKID 63 #endif /* _UAPI_ASM_SOCKET_H */ diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h index 0e95f45cd058..43b636836722 100644 --- a/arch/frv/include/uapi/asm/socket.h +++ b/arch/frv/include/uapi/asm/socket.h @@ -107,6 +107,8 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 62 +#define SCM_CLOCKID 63 #endif /* _ASM_SOCKET_H */ diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h index c872c4e6bafb..1f06d07aadbe 100644 --- a/arch/ia64/include/uapi/asm/socket.h +++ b/arch/ia64/include/uapi/asm/socket.h @@ -116,5 +116,7 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 62 +#define SCM_CLOCKID 63 #endif /* _ASM_IA64_SOCKET_H */ diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h index 65276c95b8df..69ab380d8d48 100644 --- a/arch/m32r/include/uapi/asm/socket.h +++ b/arch/m32r/include/uapi/asm/socket.h @@ -107,5 +107,7 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 62 +#define SCM_CLOCKID 63 #endif /* _ASM_M32R_SOCKET_H */ diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 71370fb3ceef..97da79f58538 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -125,5 +125,7 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 62 +#define SCM_CLOCKID 63 #endif /* _UAPI_ASM_SOCKET_H */ diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h index d029a40b1b55..7c7a174fdfae 100644 --- a/arch/mn10300/include/uapi/asm/socket.h +++ b/arch/mn10300/include/uapi/asm/socket.h @@ -107,5 +107,7 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 62 +#define SCM_CLOCKID 63 #endif /* _ASM_SOCKET_H */ diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index 061b9cf2a779..7fe86b5cd593 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -106,5 +106,7 @@ #define SO_TXTIME 0x4036 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 0x4037 +#define SCM_CLOCKID 0x4038 #endif /* _UAPI_ASM_SOCKET_H */ diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h index 39d901476ee5..97f90c4a9b8c 100644 --- a/arch/s390/include/uapi/asm/socket.h +++ b/arch/s390/include/uapi/asm/socket.h @@ -113,5 +113,7 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 62 +#define SCM_CLOCKID 63 #endif /* _ASM_SOCKET_H */ diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index 7ea35e5601b6..6397c366dd2d 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -103,6 +103,8 @@ #define SO_TXTIME 0x003f #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 0x0040 +#define SCM_CLOCKID 0x0041 /* Security levels - as per NRL IPv6 - don't actually do anything */ #define SO_SECURITY_AUTHENTICATION 0x5001 diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h index 1de07a7f7680..bc81b02a1f5f 100644 --- a/arch/xtensa/include/uapi/asm/socket.h +++ b/arch/xtensa/include/uapi/asm/socket.h @@ -118,5 +118,7 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 62 +#define SCM_CLOCKID 63 #endif /* _XTENSA_SOCKET_H */ diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d8340e6e8814..951969ceaf65 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -788,6 +788,9 @@ struct sk_buff { __u8 tc_redirected:1; __u8 tc_from_ingress:1; #endif + __u8 tc_drop_if_late:1; + + clockid_t txtime_clockid; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ diff --git a/include/net/sock.h b/include/net/sock.h index 16a90a69c9b3..50e36e0f62f6 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1571,7 +1571,9 @@ void sk_send_sigurg(struct sock *sk); struct sockcm_cookie { u64 transmit_time; u32 mark; + clockid_t clockid; u16 tsflags; + u8 drop_if_late; }; int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index a12692e5f7a8..c9e1ea0097e1 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -109,5 +109,7 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SCM_DROP_IF_LATE 62 +#define SCM_CLOCKID 63 #endif /* __ASM_GENERIC_SOCKET_H */ diff --git a/net/core/sock.c b/net/core/sock.c index 2ba09f311e71..51cfade342ec 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2126,6 +2126,7 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, struct sockcm_cookie *sockc) { u32 tsflags; + u8 drop; switch (cmsg->cmsg_type) { case SO_MARK: @@ -2146,13 +2147,32 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK; sockc->tsflags |= tsflags; break; - case SO_TXTIME: + case SCM_TXTIME: if (!sock_flag(sk, SOCK_TXTIME)) return -EINVAL; if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64))) return -EINVAL; sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg)); break; + case SCM_DROP_IF_LATE: + if (!sock_flag(sk, SOCK_TXTIME)) + return -EINVAL; + if (cmsg->cmsg_len != CMSG_LEN(sizeof(u8))) + return -EINVAL; + + drop = get_unaligned((u8 *)CMSG_DATA(cmsg)); + if (drop < 0 || drop > 1) + return -EINVAL; + + sockc->drop_if_late = drop; + break; + case SCM_CLOCKID: + if (!sock_flag(sk, SOCK_TXTIME)) + return -EINVAL; + if (cmsg->cmsg_len != CMSG_LEN(sizeof(clockid_t))) + return -EINVAL; + sockc->clockid = get_unaligned((clockid_t *)CMSG_DATA(cmsg)); + break; /* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */ case SCM_RIGHTS: case SCM_CREDENTIALS:
Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and a drop_if_late flag. With this commit the API becomes: - use SO_TXTIME to enable the feature on a socket; - pass the per-packet arguments through the cmsg header using: * SCM_CLOCKID for the clockid to be used as the txtime clock source; * SCM_TXTIME for the txtime timestamp; * SCM_DROP_IF_LATE for the drop flag. This flag will be used by the traffic control to decide if a delayed packet should be dropped. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> --- arch/alpha/include/uapi/asm/socket.h | 2 ++ arch/frv/include/uapi/asm/socket.h | 2 ++ arch/ia64/include/uapi/asm/socket.h | 2 ++ arch/m32r/include/uapi/asm/socket.h | 2 ++ arch/mips/include/uapi/asm/socket.h | 2 ++ arch/mn10300/include/uapi/asm/socket.h | 2 ++ arch/parisc/include/uapi/asm/socket.h | 2 ++ arch/s390/include/uapi/asm/socket.h | 2 ++ arch/sparc/include/uapi/asm/socket.h | 2 ++ arch/xtensa/include/uapi/asm/socket.h | 2 ++ include/linux/skbuff.h | 3 +++ include/net/sock.h | 2 ++ include/uapi/asm-generic/socket.h | 2 ++ net/core/sock.c | 22 +++++++++++++++++++++- 14 files changed, 48 insertions(+), 1 deletion(-)