Message ID | 1457028388-18226-3-git-send-email-bro.devel+kernel@gmail.com |
---|---|
State | Deferred, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 2016-03-03 at 19:06 +0100, Bendik Rønning Opstad wrote: > Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing > the latency for applications sending time-dependent data. > > Latency-sensitive applications or services, such as online games, > remote control systems, and VoIP, produce traffic with thin-stream > characteristics, characterized by small packets and relatively high > inter-transmission times (ITT). When experiencing packet loss, such > latency-sensitive applications are heavily penalized by the need to > retransmit lost packets, which increases the latency by a minimum of > one RTT for the lost packet. Packets coming after a lost packet are > held back due to head-of-line blocking, causing increased delays for > all data segments until the lost packet has been retransmitted. Acked-by: Eric Dumazet <edumazet@google.com> Note that RDB probably should get some SNMP counters, so that we get an idea of how many times a loss could be repaired. Ideally, if the path happens to be lossless, all these pro active bundles are overhead. Might be useful to make RDB conditional to tp->total_retrans or something.
On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad <bro.devel@gmail.com> wrote: > > Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing > the latency for applications sending time-dependent data. > > Latency-sensitive applications or services, such as online games, > remote control systems, and VoIP, produce traffic with thin-stream > characteristics, characterized by small packets and relatively high > inter-transmission times (ITT). When experiencing packet loss, such > latency-sensitive applications are heavily penalized by the need to > retransmit lost packets, which increases the latency by a minimum of > one RTT for the lost packet. Packets coming after a lost packet are > held back due to head-of-line blocking, causing increased delays for > all data segments until the lost packet has been retransmitted. > > RDB enables a TCP sender to bundle redundant (already sent) data with > TCP packets containing small segments of new data. By resending > un-ACKed data from the output queue in packets with new data, RDB > reduces the need to retransmit data segments on connections > experiencing sporadic packet loss. By avoiding a retransmit, RDB > evades the latency increase of at least one RTT for the lost packet, > as well as alleviating head-of-line blocking for the packets following > the lost packet. This makes the TCP connection more resistant to > latency fluctuations, and reduces the application layer latency > significantly in lossy environments. > > Main functionality added: > > o When a packet is scheduled for transmission, RDB builds and > transmits a new SKB containing both the unsent data as well as > data of previously sent packets from the TCP output queue. > > o RDB will only be used for streams classified as thin by the > function tcp_stream_is_thin_dpifl(). This enforces a lower bound > on the ITT for streams that may benefit from RDB, controlled by > the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound. > > o Loss detection of hidden loss events: When bundling redundant data > with each packet, packet loss can be hidden from the TCP engine due > to lack of dupACKs. This is because the loss is "repaired" by the > redundant data in the packet coming after the lost packet. Based on > incoming ACKs, such hidden loss events are detected, and CWR state > is entered. > > RDB can be enabled on a connection with the socket option TCP_RDB, or > on all new connections by setting the sysctl variable > net.ipv4.tcp_rdb=1 > > Cc: Andreas Petlund <apetlund@simula.no> > Cc: Carsten Griwodz <griff@simula.no> > Cc: Pål Halvorsen <paalh@simula.no> > Cc: Jonas Markussen <jonassm@ifi.uio.no> > Cc: Kristian Evensen <kristian.evensen@gmail.com> > Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> > Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com> > --- > Documentation/networking/ip-sysctl.txt | 15 +++ > include/linux/skbuff.h | 1 + > include/linux/tcp.h | 3 +- > include/net/tcp.h | 15 +++ > include/uapi/linux/tcp.h | 1 + > net/core/skbuff.c | 2 +- > net/ipv4/Makefile | 3 +- > net/ipv4/sysctl_net_ipv4.c | 25 ++++ > net/ipv4/tcp.c | 14 +- > net/ipv4/tcp_input.c | 3 + > net/ipv4/tcp_output.c | 48 ++++--- > net/ipv4/tcp_rdb.c | 228 +++++++++++++++++++++++++++++++++ > 12 files changed, 335 insertions(+), 23 deletions(-) > create mode 100644 net/ipv4/tcp_rdb.c > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt > index 6a92b15..8f3f3bf 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER > calculated, which is used to classify whether a stream is thin. > Default: 10000 > > +tcp_rdb - BOOLEAN > + Enable RDB for all new TCP connections. Please describe RDB briefly, perhaps with a pointer to your paper. I suggest have three level of controls: 0: disable RDB completely 1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket options 2: enable RDB on all thin-stream conn. by default currently it only provides mode 1 and 2. but there may be cases where the administrator wants to disallow it (e.g., broken middle-boxes). > + Default: 0 > + > +tcp_rdb_max_bytes - INTEGER > + Enable restriction on how many bytes an RDB packet can contain. > + This is the total amount of payload including the new unsent data. > + Default: 0 > + > +tcp_rdb_max_packets - INTEGER > + Enable restriction on how many previous packets in the output queue > + RDB may include data from. A value of 1 will restrict bundling to > + only the data from the last packet that was sent. > + Default: 1 why two metrics on redundancy? It also seems better to allow individual socket to select the redundancy level (e.g., setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting. This requires more bits in tcp_sock but 2-3 more is suffice. / > + > tcp_limit_output_bytes - INTEGER > Controls TCP Small Queue limit per tcp socket. > TCP bulk sender tends to increase packets in flight until it > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 797cefb..0f2c9d1 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -2927,6 +2927,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); > void skb_free_datagram(struct sock *sk, struct sk_buff *skb); > void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); > int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags); > +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old); > int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len); > int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len); > __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index bcbf51d..c84de15 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -207,9 +207,10 @@ struct tcp_sock { > } rack; > u16 advmss; /* Advertised MSS */ > u8 unused; > - u8 nonagle : 4,/* Disable Nagle algorithm? */ > + u8 nonagle : 3,/* Disable Nagle algorithm? */ > thin_lto : 1,/* Use linear timeouts for thin streams */ > thin_dupack : 1,/* Fast retransmit on first dupack */ > + rdb : 1,/* Redundant Data Bundling enabled */ > repair : 1, > frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */ > u8 repair_queue; > diff --git a/include/net/tcp.h b/include/net/tcp.h > index d38eae9..2d42f4a 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -267,6 +267,9 @@ extern int sysctl_tcp_slow_start_after_idle; > extern int sysctl_tcp_thin_linear_timeouts; > extern int sysctl_tcp_thin_dupack; > extern int sysctl_tcp_thin_dpifl_itt_lower_bound; > +extern int sysctl_tcp_rdb; > +extern int sysctl_tcp_rdb_max_bytes; > +extern int sysctl_tcp_rdb_max_packets; > extern int sysctl_tcp_early_retrans; > extern int sysctl_tcp_limit_output_bytes; > extern int sysctl_tcp_challenge_ack_limit; > @@ -539,6 +542,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss, > bool tcp_may_send_now(struct sock *sk); > int __tcp_retransmit_skb(struct sock *, struct sk_buff *); > int tcp_retransmit_skb(struct sock *, struct sk_buff *); > +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, > + gfp_t gfp_mask); > void tcp_retransmit_timer(struct sock *sk); > void tcp_xmit_retransmit_queue(struct sock *); > void tcp_simple_retransmit(struct sock *); > @@ -556,6 +561,7 @@ void tcp_send_ack(struct sock *sk); > void tcp_send_delayed_ack(struct sock *sk); > void tcp_send_loss_probe(struct sock *sk); > bool tcp_schedule_loss_probe(struct sock *sk); > +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb); > > /* tcp_input.c */ > void tcp_resume_early_retransmit(struct sock *sk); > @@ -565,6 +571,11 @@ void tcp_reset(struct sock *sk); > void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb); > void tcp_fin(struct sock *sk); > > +/* tcp_rdb.c */ > +void tcp_rdb_ack_event(struct sock *sk, u32 flags); > +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, > + unsigned int mss_now, gfp_t gfp_mask); > + > /* tcp_timer.c */ > void tcp_init_xmit_timers(struct sock *); > static inline void tcp_clear_xmit_timers(struct sock *sk) > @@ -763,6 +774,7 @@ struct tcp_skb_cb { > union { > struct { > /* There is space for up to 20 bytes */ > + __u32 rdb_start_seq; /* Start seq of rdb data */ > } tx; /* only used for outgoing skbs */ > union { > struct inet_skb_parm h4; > @@ -1497,6 +1509,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk, > #define tcp_for_write_queue_from_safe(skb, tmp, sk) \ > skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp) > > +#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) \ > + skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp) > + > static inline struct sk_buff *tcp_send_head(const struct sock *sk) > { > return sk->sk_send_head; > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > index fe95446..6799875 100644 > --- a/include/uapi/linux/tcp.h > +++ b/include/uapi/linux/tcp.h > @@ -115,6 +115,7 @@ enum { > #define TCP_CC_INFO 26 /* Get Congestion Control (optional) info */ > #define TCP_SAVE_SYN 27 /* Record SYN headers for new connections */ > #define TCP_SAVED_SYN 28 /* Get SYN headers recorded for connection */ > +#define TCP_RDB 29 /* Enable Redundant Data Bundling mechanism */ > > struct tcp_repair_opt { > __u32 opt_code; > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 7af7ec6..50bc5b0 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1055,7 +1055,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off) > skb->inner_mac_header += off; > } > > -static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) > +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) > { > __copy_skb_header(new, old); > > diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile > index bfa1336..459048c 100644 > --- a/net/ipv4/Makefile > +++ b/net/ipv4/Makefile > @@ -12,7 +12,8 @@ obj-y := route.o inetpeer.o protocol.o \ > tcp_offload.o datagram.o raw.o udp.o udplite.o \ > udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \ > fib_frontend.o fib_semantics.o fib_trie.o \ > - inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o > + inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \ > + tcp_rdb.o > > obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o > obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index f04320a..43b4390 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -581,6 +581,31 @@ static struct ctl_table ipv4_table[] = { > .extra1 = &tcp_thin_dpifl_itt_lower_bound_min, > }, > { > + .procname = "tcp_rdb", > + .data = &sysctl_tcp_rdb, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = &zero, > + .extra2 = &one, > + }, > + { > + .procname = "tcp_rdb_max_bytes", > + .data = &sysctl_tcp_rdb_max_bytes, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = &zero, > + }, > + { > + .procname = "tcp_rdb_max_packets", > + .data = &sysctl_tcp_rdb_max_packets, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = &zero, > + }, > + { > .procname = "tcp_early_retrans", > .data = &sysctl_tcp_early_retrans, > .maxlen = sizeof(int), > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 8421f3d..b53d4cb 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -288,6 +288,8 @@ int sysctl_tcp_autocorking __read_mostly = 1; > > int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN; > > +int sysctl_tcp_rdb __read_mostly; > + > struct percpu_counter tcp_orphan_count; > EXPORT_SYMBOL_GPL(tcp_orphan_count); > > @@ -407,6 +409,7 @@ void tcp_init_sock(struct sock *sk) > u64_stats_init(&tp->syncp); > > tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering; > + tp->rdb = sysctl_tcp_rdb; > tcp_enable_early_retrans(tp); > tcp_assign_congestion_control(sk); > > @@ -2412,6 +2415,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level, > } > break; > > + case TCP_RDB: > + if (val < 0 || val > 1) > + err = -EINVAL; > + else > + tp->rdb = val; > + break; > + > case TCP_REPAIR: > if (!tcp_can_repair_sock(sk)) > err = -EPERM; > @@ -2842,7 +2852,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level, > case TCP_THIN_DUPACK: > val = tp->thin_dupack; > break; > - > + case TCP_RDB: > + val = tp->rdb; > + break; > case TCP_REPAIR: > val = tp->repair; > break; > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index e6e65f7..7b52ce4 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -3537,6 +3537,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags) > > if (icsk->icsk_ca_ops->in_ack_event) > icsk->icsk_ca_ops->in_ack_event(sk, flags); > + > + if (unlikely(tcp_sk(sk)->rdb)) > + tcp_rdb_ack_event(sk, flags); > } > > /* Congestion control has updated the cwnd already. So if we're in > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 7d2c7a4..6f92fae 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -897,8 +897,8 @@ out: > * We are working here with either a clone of the original > * SKB, or a fresh unique copy made by the retransmit engine. > */ > -static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, > - gfp_t gfp_mask) > +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, > + gfp_t gfp_mask) > { > const struct inet_connection_sock *icsk = inet_csk(sk); > struct inet_sock *inet; > @@ -2110,9 +2110,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > break; > } > > - if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) > + if (unlikely(tcp_sk(sk)->rdb)) { > + if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp)) > + break; > + } else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) { > break; > - > + } > repair: > /* Advance the send_head. This one is sent out. > * This call will increment packets_out. > @@ -2439,15 +2442,32 @@ u32 __tcp_select_window(struct sock *sk) > return window; > } > > +/** > + * tcp_skb_append_data() - copy the linear data from an SKB to the end > + * of another and update end sequence number > + * and checksum > + * @from_skb: the SKB to copy data from > + * @to_skb: the SKB to copy data to > + */ > +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb) > +{ > + skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len), > + from_skb->len); > + TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq; > + > + if (from_skb->ip_summed == CHECKSUM_PARTIAL) > + to_skb->ip_summed = CHECKSUM_PARTIAL; > + > + if (to_skb->ip_summed != CHECKSUM_PARTIAL) > + to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum, > + to_skb->len); > +} > + > /* Collapses two adjacent SKB's during retransmission. */ > static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb) > { > struct tcp_sock *tp = tcp_sk(sk); > struct sk_buff *next_skb = tcp_write_queue_next(sk, skb); > - int skb_size, next_skb_size; > - > - skb_size = skb->len; > - next_skb_size = next_skb->len; > > BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1); > > @@ -2455,17 +2475,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb) > > tcp_unlink_write_queue(next_skb, sk); > > - skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size), > - next_skb_size); > - > - if (next_skb->ip_summed == CHECKSUM_PARTIAL) > - skb->ip_summed = CHECKSUM_PARTIAL; > - > - if (skb->ip_summed != CHECKSUM_PARTIAL) > - skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size); > - > - /* Update sequence range on original skb. */ > - TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq; > + tcp_skb_append_data(next_skb, skb); > > /* Merge over control information. This moves PSH/FIN etc. over */ > TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags; > diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c > new file mode 100644 > index 0000000..2b37957 > --- /dev/null > +++ b/net/ipv4/tcp_rdb.c > @@ -0,0 +1,228 @@ > +#include <linux/skbuff.h> > +#include <net/tcp.h> > + > +int sysctl_tcp_rdb_max_bytes __read_mostly; > +int sysctl_tcp_rdb_max_packets __read_mostly = 1; > + > +/** > + * rdb_detect_loss() - perform RDB loss detection by analysing ACKs > + * @sk: socket > + * > + * Traverse the output queue and check if the ACKed packet is an RDB > + * packet and if the redundant data covers one or more un-ACKed SKBs. > + * If the incoming ACK acknowledges multiple SKBs, we can presume > + * packet loss has occurred. > + * > + * We can infer packet loss this way because we can expect one ACK per > + * transmitted data packet, as delayed ACKs are disabled when a host > + * receives packets where the sequence number is not the expected > + * sequence number. > + * > + * Return: The number of packets that are presumed to be lost > + */ > +static unsigned int rdb_detect_loss(struct sock *sk) > +{ > + struct sk_buff *skb, *tmp; > + struct tcp_skb_cb *scb; > + u32 seq_acked = tcp_sk(sk)->snd_una; > + unsigned int packets_lost = 0; > + > + tcp_for_write_queue(skb, sk) { > + if (skb == tcp_send_head(sk)) > + break; > + > + scb = TCP_SKB_CB(skb); > + /* The ACK acknowledges parts of the data in this SKB. > + * Can be caused by: > + * - TSO: We abort as RDB is not used on SKBs split across > + * multiple packets on lower layers as these are greater > + * than one MSS. > + * - Retrans collapse: We've had a retrans, so loss has already > + * been detected. > + */ > + if (after(scb->end_seq, seq_acked)) > + break; > + else if (scb->end_seq != seq_acked) > + continue; > + > + /* We have found the ACKed packet */ > + > + /* This packet was sent with no redundant data, or no prior > + * un-ACKed SKBs is in the output queue, so break here. > + */ > + if (scb->tx.rdb_start_seq == scb->seq || > + skb_queue_is_first(&sk->sk_write_queue, skb)) > + break; > + /* Find number of prior SKBs whose data was bundled in this > + * (ACKed) SKB. We presume any redundant data covering previous > + * SKB's are due to loss. (An exception would be reordering). > + */ > + skb = skb->prev; > + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { > + if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq)) > + break; > + packets_lost++; since we only care if there is packet loss or not, we can return early here? > + } > + break; > + } > + return packets_lost; > +} > + > +/** > + * tcp_rdb_ack_event() - initiate RDB loss detection > + * @sk: socket > + * @flags: flags > + */ > +void tcp_rdb_ack_event(struct sock *sk, u32 flags) flags are not used > +{ > + if (rdb_detect_loss(sk)) > + tcp_enter_cwr(sk); > +} > + > +/** > + * rdb_build_skb() - build a new RDB SKB and copy redundant + unsent > + * data to the linear page buffer > + * @sk: socket > + * @xmit_skb: the SKB processed for transmission in the output engine > + * @first_skb: the first SKB in the output queue to be bundled > + * @bytes_in_rdb_skb: the total number of data bytes for the new > + * rdb_skb (NEW + Redundant) > + * @gfp_mask: gfp_t allocation > + * > + * Return: A new SKB containing redundant data, or NULL if memory > + * allocation failed > + */ > +static struct sk_buff *rdb_build_skb(const struct sock *sk, > + struct sk_buff *xmit_skb, > + struct sk_buff *first_skb, > + u32 bytes_in_rdb_skb, > + gfp_t gfp_mask) > +{ > + struct sk_buff *rdb_skb, *tmp_skb = first_skb; > + > + rdb_skb = sk_stream_alloc_skb((struct sock *)sk, > + (int)bytes_in_rdb_skb, > + gfp_mask, false); > + if (!rdb_skb) > + return NULL; > + copy_skb_header(rdb_skb, xmit_skb); > + rdb_skb->ip_summed = xmit_skb->ip_summed; > + TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq; > + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq; > + > + /* Start on first_skb and append payload from each SKB in the output > + * queue onto rdb_skb until we reach xmit_skb. > + */ > + tcp_for_write_queue_from(tmp_skb, sk) { > + tcp_skb_append_data(tmp_skb, rdb_skb); > + > + /* We reached xmit_skb, containing the unsent data */ > + if (tmp_skb == xmit_skb) > + break; > + } > + return rdb_skb; > +} > + > +/** > + * rdb_can_bundle_test() - test if redundant data can be bundled > + * @sk: socket > + * @xmit_skb: the SKB processed for transmission by the output engine > + * @max_payload: the maximum allowed payload bytes for the RDB SKB > + * @bytes_in_rdb_skb: store the total number of payload bytes in the > + * RDB SKB if bundling can be performed > + * > + * Traverse the output queue and check if any un-acked data may be > + * bundled. > + * > + * Return: The first SKB to be in the bundle, or NULL if no bundling > + */ > +static struct sk_buff *rdb_can_bundle_test(const struct sock *sk, > + struct sk_buff *xmit_skb, > + unsigned int max_payload, > + u32 *bytes_in_rdb_skb) > +{ > + struct sk_buff *first_to_bundle = NULL; > + struct sk_buff *tmp, *skb = xmit_skb->prev; > + u32 skbs_in_bundle_count = 1; /* Start on 1 to account for xmit_skb */ > + u32 total_payload = xmit_skb->len; > + > + if (sysctl_tcp_rdb_max_bytes) > + max_payload = min_t(unsigned int, max_payload, > + sysctl_tcp_rdb_max_bytes); > + > + /* We start at xmit_skb->prev, and go backwards */ > + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { > + /* Including data from this SKB would exceed payload limit */ > + if ((total_payload + skb->len) > max_payload) > + break; > + > + if (sysctl_tcp_rdb_max_packets && > + (skbs_in_bundle_count > sysctl_tcp_rdb_max_packets)) > + break; > + > + total_payload += skb->len; > + skbs_in_bundle_count++; > + first_to_bundle = skb; > + } > + *bytes_in_rdb_skb = total_payload; > + return first_to_bundle; > +} > + > +/** > + * tcp_transmit_rdb_skb() - try to create and send an RDB packet > + * @sk: socket > + * @xmit_skb: the SKB processed for transmission by the output engine > + * @mss_now: current mss value > + * @gfp_mask: gfp_t allocation > + * > + * If an RDB packet could not be created and sent, transmit the > + * original unmodified SKB (xmit_skb). > + * > + * Return: 0 if successfully sent packet, else error from > + * tcp_transmit_skb > + */ > +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, > + unsigned int mss_now, gfp_t gfp_mask) > +{ > + struct sk_buff *rdb_skb = NULL; > + struct sk_buff *first_to_bundle; > + u32 bytes_in_rdb_skb = 0; > + > + /* How we detect that RDB was used. When equal, no RDB data was sent */ > + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq; > + > + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk))) During loss recovery tcp inflight fluctuates and would like to trigger this check even for non-thin-stream connections. Since the loss already occurs, RDB can only take advantage from limited-transmit, which it likely does not have (b/c its a thin-stream). It might be checking if the state is open. > + goto xmit_default; > + > + /* No bundling if first in queue, or on FIN packet */ > + if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) || > + (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)) seems there are still benefit to bundle packets up to FIN? > + goto xmit_default; > + > + /* Find number of (previous) SKBs to get data from */ > + first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now, > + &bytes_in_rdb_skb); > + if (!first_to_bundle) > + goto xmit_default; > + > + /* Create an SKB that contains redundant data starting from > + * first_to_bundle. > + */ > + rdb_skb = rdb_build_skb(sk, xmit_skb, first_to_bundle, > + bytes_in_rdb_skb, gfp_mask); > + if (!rdb_skb) > + goto xmit_default; > + > + /* Set skb_mstamp for the SKB in the output queue (xmit_skb) containing > + * the yet unsent data. Normally this would be done by > + * tcp_transmit_skb(), but as we pass in rdb_skb instead, xmit_skb's > + * timestamp will not be touched. > + */ > + skb_mstamp_get(&xmit_skb->skb_mstamp); > + rdb_skb->skb_mstamp = xmit_skb->skb_mstamp; > + return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask); > + > +xmit_default: > + /* Transmit the unmodified SKB from output queue */ > + return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask); > +} > -- > 1.9.1 > since RDB will cause DSACKs, and we only blindly count DSACKs to perform CWND undo. How does RDB handle that false positives?
On Mon, 14 Mar 2016, Yuchung Cheng wrote: > On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad > <bro.devel@gmail.com> wrote: > > > > Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing > > the latency for applications sending time-dependent data. ... > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt > > index 6a92b15..8f3f3bf 100644 > > --- a/Documentation/networking/ip-sysctl.txt > > +++ b/Documentation/networking/ip-sysctl.txt > > @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER > > calculated, which is used to classify whether a stream is thin. > > Default: 10000 > > > > +tcp_rdb - BOOLEAN > > + Enable RDB for all new TCP connections. > Please describe RDB briefly, perhaps with a pointer to your paper. > I suggest have three level of controls: > 0: disable RDB completely > 1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket > options > 2: enable RDB on all thin-stream conn. by default > > currently it only provides mode 1 and 2. but there may be cases where > the administrator wants to disallow it (e.g., broken middle-boxes). > > > + Default: 0 A per route setting to enable or disable tcp_rdb, overriding the global setting, could also be useful to the administrator. Just a suggestion for potential followup work. -Bill
On 03/14/2016 02:15 PM, Eric Dumazet wrote: > On Thu, 2016-03-03 at 19:06 +0100, Bendik Rønning Opstad wrote: >> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing >> the latency for applications sending time-dependent data. >> >> Latency-sensitive applications or services, such as online games, >> remote control systems, and VoIP, produce traffic with thin-stream >> characteristics, characterized by small packets and relatively high >> inter-transmission times (ITT). When experiencing packet loss, such >> latency-sensitive applications are heavily penalized by the need to >> retransmit lost packets, which increases the latency by a minimum of >> one RTT for the lost packet. Packets coming after a lost packet are >> held back due to head-of-line blocking, causing increased delays for >> all data segments until the lost packet has been retransmitted. > > Acked-by: Eric Dumazet <edumazet@google.com> > > Note that RDB probably should get some SNMP counters, > so that we get an idea of how many times a loss could be repaired. And some idea of the duplication seen by receivers, assuming there isn't already a counter for such a thing in Linux. happy benchmarking, rick jones > > Ideally, if the path happens to be lossless, all these pro active > bundles are overhead. Might be useful to make RDB conditional to > tp->total_retrans or something. > >
On Mon, Mar 14, 2016 at 6:04 PM, Rick Jones <rick.jones2@hpe.com> wrote: > > On 03/14/2016 02:15 PM, Eric Dumazet wrote: >> >> On Thu, 2016-03-03 at 19:06 +0100, Bendik Rønning Opstad wrote: >>> >>> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing >>> the latency for applications sending time-dependent data. >>> >>> Latency-sensitive applications or services, such as online games, >>> remote control systems, and VoIP, produce traffic with thin-stream >>> characteristics, characterized by small packets and relatively high >>> inter-transmission times (ITT). When experiencing packet loss, such >>> latency-sensitive applications are heavily penalized by the need to >>> retransmit lost packets, which increases the latency by a minimum of >>> one RTT for the lost packet. Packets coming after a lost packet are >>> held back due to head-of-line blocking, causing increased delays for >>> all data segments until the lost packet has been retransmitted. >> >> >> Acked-by: Eric Dumazet <edumazet@google.com> >> >> Note that RDB probably should get some SNMP counters, >> so that we get an idea of how many times a loss could be repaired. > > > And some idea of the duplication seen by receivers, assuming there isn't already a counter for such a thing in Linux. We sort of track that in the awkwardly named LINUX_MIB_DELAYEDACKLOST > > happy benchmarking, > > rick jones > > >> >> Ideally, if the path happens to be lossless, all these pro active >> bundles are overhead. Might be useful to make RDB conditional to >> tp->total_retrans or something. >> >> >
>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt >> index 6a92b15..8f3f3bf 100644 >> --- a/Documentation/networking/ip-sysctl.txt >> +++ b/Documentation/networking/ip-sysctl.txt >> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER >> calculated, which is used to classify whether a stream is thin. >> Default: 10000 >> >> +tcp_rdb - BOOLEAN >> + Enable RDB for all new TCP connections. > Please describe RDB briefly, perhaps with a pointer to your paper. Ah, yes, that description may have been a bit too brief... What about pointing to tcp-thin.txt in the brief description, and rewrite tcp-thin.txt with a more detailed description of RDB along with a paper reference? > I suggest have three level of controls: > 0: disable RDB completely > 1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket > options > 2: enable RDB on all thin-stream conn. by default > > currently it only provides mode 1 and 2. but there may be cases where > the administrator wants to disallow it (e.g., broken middle-boxes). Good idea. Will change this. >> + Default: 0 >> + >> +tcp_rdb_max_bytes - INTEGER >> + Enable restriction on how many bytes an RDB packet can contain. >> + This is the total amount of payload including the new unsent data. >> + Default: 0 >> + >> +tcp_rdb_max_packets - INTEGER >> + Enable restriction on how many previous packets in the output queue >> + RDB may include data from. A value of 1 will restrict bundling to >> + only the data from the last packet that was sent. >> + Default: 1 > why two metrics on redundancy? We have primarily used the packet based limit in our tests. This is also the most important knob as it directly controls how many lost packets each RDB packet may recover. We believe that the byte based limit can also be useful because it allows more fine grained control on how much impact RDB can have on the increased bandwidth requirements of the flows. If an application writes 700 bytes per write call, the bandwidth increase can be quite significant (even with a 1 packet bundling limit) if we consider a scenario with thousands of RDB streams. In some of our experiments with many simultaneous thin streams, where we set up a bottleneck rate limited by a htb with pfifo queue, we observed considerable difference in loss rates depending on how many bytes (packets) were allowed to be bundled with each packet. This is partly why we recommend a default bundling limit of 1 packet. By limiting the total payload size of RDB packets to e.g. 100 bytes, only the smallest segments will benefit from RDB, while the segments that would increase the bandwidth requirements the most, will not. While a very large number of RDB streams from one sender may be a corner case, we still think this sysctl knob can be valuable for a sysadmin that finds himself in such a situation. > It also seems better to > allow individual socket to select the redundancy level (e.g., > setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting. > This requires more bits in tcp_sock but 2-3 more is suffice. Most certainly. We decided not to implement this for the patch to keep it as simple as possible, however, we surely prefer to have this functionality included if possible. >> +static unsigned int rdb_detect_loss(struct sock *sk) >> +{ ... >> + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { >> + if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq)) >> + break; >> + packets_lost++; > since we only care if there is packet loss or not, we can return early here? Yes, I considered that, and as long as the number of packets presumed to be lost is not needed, that will suffice. However, could this not be useful for statistical purposes? This is also relevant to the comment from Eric on SNMP counters for how many times losses could be repaired by RDB? >> + } >> + break; >> + } >> + return packets_lost; >> +} >> + >> +/** >> + * tcp_rdb_ack_event() - initiate RDB loss detection >> + * @sk: socket >> + * @flags: flags >> + */ >> +void tcp_rdb_ack_event(struct sock *sk, u32 flags) > flags are not used Ah, yes, will remove that. >> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, >> + unsigned int mss_now, gfp_t gfp_mask) >> +{ >> + struct sk_buff *rdb_skb = NULL; >> + struct sk_buff *first_to_bundle; >> + u32 bytes_in_rdb_skb = 0; >> + >> + /* How we detect that RDB was used. When equal, no RDB data was sent */ >> + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq; > >> + >> + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk))) > During loss recovery tcp inflight fluctuates and would like to trigger > this check even for non-thin-stream connections. Good point. > Since the loss > already occurs, RDB can only take advantage from limited-transmit, > which it likely does not have (b/c its a thin-stream). It might be > checking if the state is open. You mean to test for open state to avoid calling rdb_can_bundle_test() unnecessarily if we (presume to) know it cannot bundle anyway? That makes sense, however, I would like to do some tests on whether "state != open" is a good indicator on when bundling is not possible. >> + goto xmit_default; >> + >> + /* No bundling if first in queue, or on FIN packet */ >> + if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) || >> + (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)) > seems there are still benefit to bundle packets up to FIN? I was close to removing the FIN test, but decided to not remove it until I could verify that it will not cause any issues on some TCP receivers. If/(Since?) you are certain it will not cause any issues, I will remove it. > since RDB will cause DSACKs, and we only blindly count DSACKs to > perform CWND undo. How does RDB handle that false positives? That is a very good question. The simple answer is that the implementation does not handle any such false positives, which I expect can result in incorrectly undoing CWND reduction in some cases. This gets a bit complicated, so I'll have to do some more testing on this to verify with certainty when it happens. When there is no loss, and each RDB packet arriving at the receiver contains both already received and new data, the receiver will respond with an ACK that acknowledges new data (moves snd_una), with the SACK field populated with the already received sequence range (DSACK). The DSACKs in these incoming ACKs are not counted (tp->undo_retrans--) unless tp->undo_marker has been set by tcp_init_undo(), which is called by either tcp_enter_loss() or tcp_enter_recovery(). However, whenever a loss is detected by rdb_detect_loss(), tcp_enter_cwr() is called, which disables CWND undo. Therefore, I believe the incorrect counting of DSACKs from ACKs on RDB packets will only be a problem after the regular loss detection mechanisms (Fast Retransmit/RTO) have been triggered (i.e. we are in either TCP_CA_Recovery or TCP_CA_Loss). We have recorded the CWND values for both RDB and non-RDB streams in our experiments, and have not found any obvious red flags when analysing the results, so I presume (hope may be more precise) this is not a major issue we have missed. Nevertheless, I will investigate this in detail and get back to you. Thank you for the detailed comments. Bendik
On 14/03/16 22:15, Eric Dumazet wrote: > Acked-by: Eric Dumazet <edumazet@google.com> > > Note that RDB probably should get some SNMP counters, > so that we get an idea of how many times a loss could be repaired. Good idea. Simply count how many times an RDB packet successfully repaired loss? Note that this can be one or more lost packets. When bundling N packets, the RDB packet can repair up to N losses in the previous N packets that were sent. Which list should this be added to? snmp4_tcp_list? Any other counters that would be useful? Total number of RDB packets transmitted? > Ideally, if the path happens to be lossless, all these pro active > bundles are overhead. Might be useful to make RDB conditional to > tp->total_retrans or something. Yes, that is a good point. We have discussed this (for years really), but have not had the opportunity to investigate it in-depth. Having such a condition hard coded is not ideal, as it very much depends on the use case if bundling from the beginning is desirable. In most cases, this is probably a fair compromise, but preferably we would have some logic/settings to control how the bundling rate can be dynamically adjusted in response to certain events, defined by a set of given metrics. A conservative (default) setting would not do bundling until loss has been registered, and could also check against some smoothed loss indicator such that a certain amount of loss must have occurred within a specific time frame to allow bundling. This could be useful in cases where the network congestion varies greatly depending on such as the time of day/night. In a scenario where minimal application layer latency is very important, but only sporadic (single) packet loss is expected to regularly occur, always bundling one previous packet may be both sufficient and desirable. In the end, the best settings for an application/service depends on the degree to which application layer latency (both minimal and variations) affects the QoE. There are many possibilities to consider in this regard, and I expect we will not have this question fully explored any time soon. Most importantly, we should ensure that such logic can easily be added later on without breaking backwards compatibility. Suggestions and comments on this are very welcome. Bendik
On Thu, Mar 17, 2016 at 4:26 PM, Bendik Rønning Opstad <bro.devel@gmail.com> wrote: > > >> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt > >> index 6a92b15..8f3f3bf 100644 > >> --- a/Documentation/networking/ip-sysctl.txt > >> +++ b/Documentation/networking/ip-sysctl.txt > >> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER > >> calculated, which is used to classify whether a stream is thin. > >> Default: 10000 > >> > >> +tcp_rdb - BOOLEAN > >> + Enable RDB for all new TCP connections. > > Please describe RDB briefly, perhaps with a pointer to your paper. > > Ah, yes, that description may have been a bit too brief... > > What about pointing to tcp-thin.txt in the brief description, and > rewrite tcp-thin.txt with a more detailed description of RDB along > with a paper reference? +1 > > > I suggest have three level of controls: > > 0: disable RDB completely > > 1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket > > options > > 2: enable RDB on all thin-stream conn. by default > > > > currently it only provides mode 1 and 2. but there may be cases where > > the administrator wants to disallow it (e.g., broken middle-boxes). > > Good idea. Will change this. > > >> + Default: 0 > >> + > >> +tcp_rdb_max_bytes - INTEGER > >> + Enable restriction on how many bytes an RDB packet can contain. > >> + This is the total amount of payload including the new unsent data. > >> + Default: 0 > >> + > >> +tcp_rdb_max_packets - INTEGER > >> + Enable restriction on how many previous packets in the output queue > >> + RDB may include data from. A value of 1 will restrict bundling to > >> + only the data from the last packet that was sent. > >> + Default: 1 > > why two metrics on redundancy? > > We have primarily used the packet based limit in our tests. This is > also the most important knob as it directly controls how many lost > packets each RDB packet may recover. > > We believe that the byte based limit can also be useful because it > allows more fine grained control on how much impact RDB can have on > the increased bandwidth requirements of the flows. If an application > writes 700 bytes per write call, the bandwidth increase can be quite > significant (even with a 1 packet bundling limit) if we consider a > scenario with thousands of RDB streams. > > In some of our experiments with many simultaneous thin streams, where > we set up a bottleneck rate limited by a htb with pfifo queue, we > observed considerable difference in loss rates depending on how many > bytes (packets) were allowed to be bundled with each packet. This is > partly why we recommend a default bundling limit of 1 packet. > > By limiting the total payload size of RDB packets to e.g. 100 bytes, > only the smallest segments will benefit from RDB, while the segments > that would increase the bandwidth requirements the most, will not. > > While a very large number of RDB streams from one sender may be a > corner case, we still think this sysctl knob can be valuable for a > sysadmin that finds himself in such a situation. These nice comments would be useful in the sysctl descriptions. > > > It also seems better to > > allow individual socket to select the redundancy level (e.g., > > setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting. > > This requires more bits in tcp_sock but 2-3 more is suffice. > > Most certainly. We decided not to implement this for the patch to keep > it as simple as possible, however, we surely prefer to have this > functionality included if possible. > > >> +static unsigned int rdb_detect_loss(struct sock *sk) > >> +{ > ... > >> + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { > >> + if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq)) > >> + break; > >> + packets_lost++; > > since we only care if there is packet loss or not, we can return early here? > > Yes, I considered that, and as long as the number of packets presumed > to be lost is not needed, that will suffice. However, could this not > be useful for statistical purposes? > > This is also relevant to the comment from Eric on SNMP counters for > how many times losses could be repaired by RDB? > > >> + } > >> + break; > >> + } > >> + return packets_lost; > >> +} > >> + > >> +/** > >> + * tcp_rdb_ack_event() - initiate RDB loss detection > >> + * @sk: socket > >> + * @flags: flags > >> + */ > >> +void tcp_rdb_ack_event(struct sock *sk, u32 flags) > > flags are not used > > Ah, yes, will remove that. > > >> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, > >> + unsigned int mss_now, gfp_t gfp_mask) > >> +{ > >> + struct sk_buff *rdb_skb = NULL; > >> + struct sk_buff *first_to_bundle; > >> + u32 bytes_in_rdb_skb = 0; > >> + > >> + /* How we detect that RDB was used. When equal, no RDB data was sent */ > >> + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq; > > > >> + > >> + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk))) > > During loss recovery tcp inflight fluctuates and would like to trigger > > this check even for non-thin-stream connections. > > Good point. > > > Since the loss > > already occurs, RDB can only take advantage from limited-transmit, > > which it likely does not have (b/c its a thin-stream). It might be > > checking if the state is open. > > You mean to test for open state to avoid calling rdb_can_bundle_test() > unnecessarily if we (presume to) know it cannot bundle anyway? That > makes sense, however, I would like to do some tests on whether "state > != open" is a good indicator on when bundling is not possible. > > >> + goto xmit_default; > >> + > >> + /* No bundling if first in queue, or on FIN packet */ > >> + if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) || > >> + (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)) > > seems there are still benefit to bundle packets up to FIN? > > I was close to removing the FIN test, but decided to not remove it > until I could verify that it will not cause any issues on some TCP > receivers. If/(Since?) you are certain it will not cause any issues, I > will remove it. > > > since RDB will cause DSACKs, and we only blindly count DSACKs to > > perform CWND undo. How does RDB handle that false positives? > > That is a very good question. The simple answer is that the > implementation does not handle any such false positives, which I > expect can result in incorrectly undoing CWND reduction in some cases. > This gets a bit complicated, so I'll have to do some more testing on > this to verify with certainty when it happens. > > When there is no loss, and each RDB packet arriving at the receiver > contains both already received and new data, the receiver will respond > with an ACK that acknowledges new data (moves snd_una), with the SACK > field populated with the already received sequence range (DSACK). > > The DSACKs in these incoming ACKs are not counted (tp->undo_retrans--) > unless tp->undo_marker has been set by tcp_init_undo(), which is > called by either tcp_enter_loss() or tcp_enter_recovery(). However, > whenever a loss is detected by rdb_detect_loss(), tcp_enter_cwr() is > called, which disables CWND undo. Therefore, I believe the incorrect thanks for the clarification. it might worth a short comment on why we use tcp_enter_cwr() (to disable undo) > counting of DSACKs from ACKs on RDB packets will only be a problem > after the regular loss detection mechanisms (Fast Retransmit/RTO) have > been triggered (i.e. we are in either TCP_CA_Recovery or TCP_CA_Loss). > > We have recorded the CWND values for both RDB and non-RDB streams in > our experiments, and have not found any obvious red flags when > analysing the results, so I presume (hope may be more precise) this is > not a major issue we have missed. Nevertheless, I will investigate > this in detail and get back to you. > > > Thank you for the detailed comments. > > Bendik >
On 21/03/16 19:54, Yuchung Cheng wrote: > On Thu, Mar 17, 2016 at 4:26 PM, Bendik Rønning Opstad > <bro.devel@gmail.com> wrote: >> >>>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt >>>> index 6a92b15..8f3f3bf 100644 >>>> --- a/Documentation/networking/ip-sysctl.txt >>>> +++ b/Documentation/networking/ip-sysctl.txt >>>> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER >>>> calculated, which is used to classify whether a stream is thin. >>>> Default: 10000 >>>> >>>> +tcp_rdb - BOOLEAN >>>> + Enable RDB for all new TCP connections. >>> Please describe RDB briefly, perhaps with a pointer to your paper. >> >> Ah, yes, that description may have been a bit too brief... >> >> What about pointing to tcp-thin.txt in the brief description, and >> rewrite tcp-thin.txt with a more detailed description of RDB along >> with a paper reference? > +1 >> >>> I suggest have three level of controls: >>> 0: disable RDB completely >>> 1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket >>> options >>> 2: enable RDB on all thin-stream conn. by default >>> >>> currently it only provides mode 1 and 2. but there may be cases where >>> the administrator wants to disallow it (e.g., broken middle-boxes). >> >> Good idea. Will change this. I have implemented your suggestion in the next patch. >>> It also seems better to >>> allow individual socket to select the redundancy level (e.g., >>> setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting. >>> This requires more bits in tcp_sock but 2-3 more is suffice. >> >> Most certainly. We decided not to implement this for the patch to keep >> it as simple as possible, however, we surely prefer to have this >> functionality included if possible. Next patch version has a socket option to allow modifying the different RDB settings. >>>> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, >>>> + unsigned int mss_now, gfp_t gfp_mask) >>>> +{ >>>> + struct sk_buff *rdb_skb = NULL; >>>> + struct sk_buff *first_to_bundle; >>>> + u32 bytes_in_rdb_skb = 0; >>>> + >>>> + /* How we detect that RDB was used. When equal, no RDB data was sent */ >>>> + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq; >>> >>>> + >>>> + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk))) >>> During loss recovery tcp inflight fluctuates and would like to trigger >>> this check even for non-thin-stream connections. >> >> Good point. >> >>> Since the loss >>> already occurs, RDB can only take advantage from limited-transmit, >>> which it likely does not have (b/c its a thin-stream). It might be >>> checking if the state is open. >> >> You mean to test for open state to avoid calling rdb_can_bundle_test() >> unnecessarily if we (presume to) know it cannot bundle anyway? That >> makes sense, however, I would like to do some tests on whether "state >> != open" is a good indicator on when bundling is not possible. When testing this I found that bundling can often be performed when not in Open state. For the most part in CWR mode, but also the other modes, so this does not seem like a good indicator. The only problem with tcp_stream_is_thin_dpifl() triggering for non-thin streams in loss recovery would be the performance penalty of calling rdb_can_bundle_test(). It would not be able to bundle anyways since the previous SKB would contain >= mss worth of data. The most reliable test is to check available space in the previous SKB, i.e. if (xmit_skb->prev->len == mss_now). Do you suggest, for performance reasons, to do this before the call to tcp_stream_is_thin_dpifl()? >>> since RDB will cause DSACKs, and we only blindly count DSACKs to >>> perform CWND undo. How does RDB handle that false positives? >> >> That is a very good question. The simple answer is that the >> implementation does not handle any such false positives, which I >> expect can result in incorrectly undoing CWND reduction in some cases. >> This gets a bit complicated, so I'll have to do some more testing on >> this to verify with certainty when it happens. >> >> When there is no loss, and each RDB packet arriving at the receiver >> contains both already received and new data, the receiver will respond >> with an ACK that acknowledges new data (moves snd_una), with the SACK >> field populated with the already received sequence range (DSACK). >> >> The DSACKs in these incoming ACKs are not counted (tp->undo_retrans--) >> unless tp->undo_marker has been set by tcp_init_undo(), which is >> called by either tcp_enter_loss() or tcp_enter_recovery(). However, >> whenever a loss is detected by rdb_detect_loss(), tcp_enter_cwr() is >> called, which disables CWND undo. Therefore, I believe the incorrect > thanks for the clarification. it might worth a short comment on why we > use tcp_enter_cwr() (to disable undo) > > >> counting of DSACKs from ACKs on RDB packets will only be a problem >> after the regular loss detection mechanisms (Fast Retransmit/RTO) have >> been triggered (i.e. we are in either TCP_CA_Recovery or TCP_CA_Loss). >> >> We have recorded the CWND values for both RDB and non-RDB streams in >> our experiments, and have not found any obvious red flags when >> analysing the results, so I presume (hope may be more precise) this is >> not a major issue we have missed. Nevertheless, I will investigate >> this in detail and get back to you. I've looked into this and tried to figure out in which cases this is actually a problem, but I have failed to find any. One scenario I considered is when an RDB packet is sent right after a retransmit, which would result in DSACK in the ACK in response to the RDB packet. With a bundling limit of 1 packet, two packets must be lost for RDB to fail to repair the loss, causing dupACKs. So if three packets are sent, where the first two are lost, the last packet will cause a dupACK, resulting in a fast retransmit (and entering recovery which calls tcp_init_undo()). By writing new data to the socket right after the fast retransmit, a new RDB packet is built with some old data that was just retransmitted. On the ACK on the fast retransmit the state is changed from Recovery to Open. The next incoming ACK (on the RDB packet) will contain a DSACK range, but it will not be considered dubious (tcp_ack_is_dubious()) since "!(flag & FLAG_NOT_DUP)" is false (new data was acked), state is Open, and "flag & FLAG_CA_ALERT" evaluates to false. Feel free to suggest scenarios (as detailed as possible) with the potential to cause such false positives, and I'll test them with packetdrill. Bendik
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 6a92b15..8f3f3bf 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER calculated, which is used to classify whether a stream is thin. Default: 10000 +tcp_rdb - BOOLEAN + Enable RDB for all new TCP connections. + Default: 0 + +tcp_rdb_max_bytes - INTEGER + Enable restriction on how many bytes an RDB packet can contain. + This is the total amount of payload including the new unsent data. + Default: 0 + +tcp_rdb_max_packets - INTEGER + Enable restriction on how many previous packets in the output queue + RDB may include data from. A value of 1 will restrict bundling to + only the data from the last packet that was sent. + Default: 1 + tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 797cefb..0f2c9d1 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2927,6 +2927,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); void skb_free_datagram(struct sock *sk, struct sk_buff *skb); void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb); int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags); +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old); int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len); int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len); __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, diff --git a/include/linux/tcp.h b/include/linux/tcp.h index bcbf51d..c84de15 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -207,9 +207,10 @@ struct tcp_sock { } rack; u16 advmss; /* Advertised MSS */ u8 unused; - u8 nonagle : 4,/* Disable Nagle algorithm? */ + u8 nonagle : 3,/* Disable Nagle algorithm? */ thin_lto : 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ + rdb : 1,/* Redundant Data Bundling enabled */ repair : 1, frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */ u8 repair_queue; diff --git a/include/net/tcp.h b/include/net/tcp.h index d38eae9..2d42f4a 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -267,6 +267,9 @@ extern int sysctl_tcp_slow_start_after_idle; extern int sysctl_tcp_thin_linear_timeouts; extern int sysctl_tcp_thin_dupack; extern int sysctl_tcp_thin_dpifl_itt_lower_bound; +extern int sysctl_tcp_rdb; +extern int sysctl_tcp_rdb_max_bytes; +extern int sysctl_tcp_rdb_max_packets; extern int sysctl_tcp_early_retrans; extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; @@ -539,6 +542,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss, bool tcp_may_send_now(struct sock *sk); int __tcp_retransmit_skb(struct sock *, struct sk_buff *); int tcp_retransmit_skb(struct sock *, struct sk_buff *); +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, + gfp_t gfp_mask); void tcp_retransmit_timer(struct sock *sk); void tcp_xmit_retransmit_queue(struct sock *); void tcp_simple_retransmit(struct sock *); @@ -556,6 +561,7 @@ void tcp_send_ack(struct sock *sk); void tcp_send_delayed_ack(struct sock *sk); void tcp_send_loss_probe(struct sock *sk); bool tcp_schedule_loss_probe(struct sock *sk); +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb); /* tcp_input.c */ void tcp_resume_early_retransmit(struct sock *sk); @@ -565,6 +571,11 @@ void tcp_reset(struct sock *sk); void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb); void tcp_fin(struct sock *sk); +/* tcp_rdb.c */ +void tcp_rdb_ack_event(struct sock *sk, u32 flags); +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, + unsigned int mss_now, gfp_t gfp_mask); + /* tcp_timer.c */ void tcp_init_xmit_timers(struct sock *); static inline void tcp_clear_xmit_timers(struct sock *sk) @@ -763,6 +774,7 @@ struct tcp_skb_cb { union { struct { /* There is space for up to 20 bytes */ + __u32 rdb_start_seq; /* Start seq of rdb data */ } tx; /* only used for outgoing skbs */ union { struct inet_skb_parm h4; @@ -1497,6 +1509,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk, #define tcp_for_write_queue_from_safe(skb, tmp, sk) \ skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp) +#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) \ + skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp) + static inline struct sk_buff *tcp_send_head(const struct sock *sk) { return sk->sk_send_head; diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index fe95446..6799875 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -115,6 +115,7 @@ enum { #define TCP_CC_INFO 26 /* Get Congestion Control (optional) info */ #define TCP_SAVE_SYN 27 /* Record SYN headers for new connections */ #define TCP_SAVED_SYN 28 /* Get SYN headers recorded for connection */ +#define TCP_RDB 29 /* Enable Redundant Data Bundling mechanism */ struct tcp_repair_opt { __u32 opt_code; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7af7ec6..50bc5b0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1055,7 +1055,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off) skb->inner_mac_header += off; } -static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) { __copy_skb_header(new, old); diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index bfa1336..459048c 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -12,7 +12,8 @@ obj-y := route.o inetpeer.o protocol.o \ tcp_offload.o datagram.o raw.o udp.o udplite.o \ udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \ fib_frontend.o fib_semantics.o fib_trie.o \ - inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o + inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \ + tcp_rdb.o obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index f04320a..43b4390 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -581,6 +581,31 @@ static struct ctl_table ipv4_table[] = { .extra1 = &tcp_thin_dpifl_itt_lower_bound_min, }, { + .procname = "tcp_rdb", + .data = &sysctl_tcp_rdb, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, + { + .procname = "tcp_rdb_max_bytes", + .data = &sysctl_tcp_rdb_max_bytes, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + }, + { + .procname = "tcp_rdb_max_packets", + .data = &sysctl_tcp_rdb_max_packets, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + }, + { .procname = "tcp_early_retrans", .data = &sysctl_tcp_early_retrans, .maxlen = sizeof(int), diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8421f3d..b53d4cb 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -288,6 +288,8 @@ int sysctl_tcp_autocorking __read_mostly = 1; int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN; +int sysctl_tcp_rdb __read_mostly; + struct percpu_counter tcp_orphan_count; EXPORT_SYMBOL_GPL(tcp_orphan_count); @@ -407,6 +409,7 @@ void tcp_init_sock(struct sock *sk) u64_stats_init(&tp->syncp); tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering; + tp->rdb = sysctl_tcp_rdb; tcp_enable_early_retrans(tp); tcp_assign_congestion_control(sk); @@ -2412,6 +2415,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level, } break; + case TCP_RDB: + if (val < 0 || val > 1) + err = -EINVAL; + else + tp->rdb = val; + break; + case TCP_REPAIR: if (!tcp_can_repair_sock(sk)) err = -EPERM; @@ -2842,7 +2852,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level, case TCP_THIN_DUPACK: val = tp->thin_dupack; break; - + case TCP_RDB: + val = tp->rdb; + break; case TCP_REPAIR: val = tp->repair; break; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e6e65f7..7b52ce4 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3537,6 +3537,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags) if (icsk->icsk_ca_ops->in_ack_event) icsk->icsk_ca_ops->in_ack_event(sk, flags); + + if (unlikely(tcp_sk(sk)->rdb)) + tcp_rdb_ack_event(sk, flags); } /* Congestion control has updated the cwnd already. So if we're in diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 7d2c7a4..6f92fae 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -897,8 +897,8 @@ out: * We are working here with either a clone of the original * SKB, or a fresh unique copy made by the retransmit engine. */ -static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, - gfp_t gfp_mask) +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, + gfp_t gfp_mask) { const struct inet_connection_sock *icsk = inet_csk(sk); struct inet_sock *inet; @@ -2110,9 +2110,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } - if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) + if (unlikely(tcp_sk(sk)->rdb)) { + if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp)) + break; + } else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) { break; - + } repair: /* Advance the send_head. This one is sent out. * This call will increment packets_out. @@ -2439,15 +2442,32 @@ u32 __tcp_select_window(struct sock *sk) return window; } +/** + * tcp_skb_append_data() - copy the linear data from an SKB to the end + * of another and update end sequence number + * and checksum + * @from_skb: the SKB to copy data from + * @to_skb: the SKB to copy data to + */ +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb) +{ + skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len), + from_skb->len); + TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq; + + if (from_skb->ip_summed == CHECKSUM_PARTIAL) + to_skb->ip_summed = CHECKSUM_PARTIAL; + + if (to_skb->ip_summed != CHECKSUM_PARTIAL) + to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum, + to_skb->len); +} + /* Collapses two adjacent SKB's during retransmission. */ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *next_skb = tcp_write_queue_next(sk, skb); - int skb_size, next_skb_size; - - skb_size = skb->len; - next_skb_size = next_skb->len; BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1); @@ -2455,17 +2475,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb) tcp_unlink_write_queue(next_skb, sk); - skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size), - next_skb_size); - - if (next_skb->ip_summed == CHECKSUM_PARTIAL) - skb->ip_summed = CHECKSUM_PARTIAL; - - if (skb->ip_summed != CHECKSUM_PARTIAL) - skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size); - - /* Update sequence range on original skb. */ - TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq; + tcp_skb_append_data(next_skb, skb); /* Merge over control information. This moves PSH/FIN etc. over */ TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags; diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c new file mode 100644 index 0000000..2b37957 --- /dev/null +++ b/net/ipv4/tcp_rdb.c @@ -0,0 +1,228 @@ +#include <linux/skbuff.h> +#include <net/tcp.h> + +int sysctl_tcp_rdb_max_bytes __read_mostly; +int sysctl_tcp_rdb_max_packets __read_mostly = 1; + +/** + * rdb_detect_loss() - perform RDB loss detection by analysing ACKs + * @sk: socket + * + * Traverse the output queue and check if the ACKed packet is an RDB + * packet and if the redundant data covers one or more un-ACKed SKBs. + * If the incoming ACK acknowledges multiple SKBs, we can presume + * packet loss has occurred. + * + * We can infer packet loss this way because we can expect one ACK per + * transmitted data packet, as delayed ACKs are disabled when a host + * receives packets where the sequence number is not the expected + * sequence number. + * + * Return: The number of packets that are presumed to be lost + */ +static unsigned int rdb_detect_loss(struct sock *sk) +{ + struct sk_buff *skb, *tmp; + struct tcp_skb_cb *scb; + u32 seq_acked = tcp_sk(sk)->snd_una; + unsigned int packets_lost = 0; + + tcp_for_write_queue(skb, sk) { + if (skb == tcp_send_head(sk)) + break; + + scb = TCP_SKB_CB(skb); + /* The ACK acknowledges parts of the data in this SKB. + * Can be caused by: + * - TSO: We abort as RDB is not used on SKBs split across + * multiple packets on lower layers as these are greater + * than one MSS. + * - Retrans collapse: We've had a retrans, so loss has already + * been detected. + */ + if (after(scb->end_seq, seq_acked)) + break; + else if (scb->end_seq != seq_acked) + continue; + + /* We have found the ACKed packet */ + + /* This packet was sent with no redundant data, or no prior + * un-ACKed SKBs is in the output queue, so break here. + */ + if (scb->tx.rdb_start_seq == scb->seq || + skb_queue_is_first(&sk->sk_write_queue, skb)) + break; + /* Find number of prior SKBs whose data was bundled in this + * (ACKed) SKB. We presume any redundant data covering previous + * SKB's are due to loss. (An exception would be reordering). + */ + skb = skb->prev; + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { + if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq)) + break; + packets_lost++; + } + break; + } + return packets_lost; +} + +/** + * tcp_rdb_ack_event() - initiate RDB loss detection + * @sk: socket + * @flags: flags + */ +void tcp_rdb_ack_event(struct sock *sk, u32 flags) +{ + if (rdb_detect_loss(sk)) + tcp_enter_cwr(sk); +} + +/** + * rdb_build_skb() - build a new RDB SKB and copy redundant + unsent + * data to the linear page buffer + * @sk: socket + * @xmit_skb: the SKB processed for transmission in the output engine + * @first_skb: the first SKB in the output queue to be bundled + * @bytes_in_rdb_skb: the total number of data bytes for the new + * rdb_skb (NEW + Redundant) + * @gfp_mask: gfp_t allocation + * + * Return: A new SKB containing redundant data, or NULL if memory + * allocation failed + */ +static struct sk_buff *rdb_build_skb(const struct sock *sk, + struct sk_buff *xmit_skb, + struct sk_buff *first_skb, + u32 bytes_in_rdb_skb, + gfp_t gfp_mask) +{ + struct sk_buff *rdb_skb, *tmp_skb = first_skb; + + rdb_skb = sk_stream_alloc_skb((struct sock *)sk, + (int)bytes_in_rdb_skb, + gfp_mask, false); + if (!rdb_skb) + return NULL; + copy_skb_header(rdb_skb, xmit_skb); + rdb_skb->ip_summed = xmit_skb->ip_summed; + TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq; + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq; + + /* Start on first_skb and append payload from each SKB in the output + * queue onto rdb_skb until we reach xmit_skb. + */ + tcp_for_write_queue_from(tmp_skb, sk) { + tcp_skb_append_data(tmp_skb, rdb_skb); + + /* We reached xmit_skb, containing the unsent data */ + if (tmp_skb == xmit_skb) + break; + } + return rdb_skb; +} + +/** + * rdb_can_bundle_test() - test if redundant data can be bundled + * @sk: socket + * @xmit_skb: the SKB processed for transmission by the output engine + * @max_payload: the maximum allowed payload bytes for the RDB SKB + * @bytes_in_rdb_skb: store the total number of payload bytes in the + * RDB SKB if bundling can be performed + * + * Traverse the output queue and check if any un-acked data may be + * bundled. + * + * Return: The first SKB to be in the bundle, or NULL if no bundling + */ +static struct sk_buff *rdb_can_bundle_test(const struct sock *sk, + struct sk_buff *xmit_skb, + unsigned int max_payload, + u32 *bytes_in_rdb_skb) +{ + struct sk_buff *first_to_bundle = NULL; + struct sk_buff *tmp, *skb = xmit_skb->prev; + u32 skbs_in_bundle_count = 1; /* Start on 1 to account for xmit_skb */ + u32 total_payload = xmit_skb->len; + + if (sysctl_tcp_rdb_max_bytes) + max_payload = min_t(unsigned int, max_payload, + sysctl_tcp_rdb_max_bytes); + + /* We start at xmit_skb->prev, and go backwards */ + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) { + /* Including data from this SKB would exceed payload limit */ + if ((total_payload + skb->len) > max_payload) + break; + + if (sysctl_tcp_rdb_max_packets && + (skbs_in_bundle_count > sysctl_tcp_rdb_max_packets)) + break; + + total_payload += skb->len; + skbs_in_bundle_count++; + first_to_bundle = skb; + } + *bytes_in_rdb_skb = total_payload; + return first_to_bundle; +} + +/** + * tcp_transmit_rdb_skb() - try to create and send an RDB packet + * @sk: socket + * @xmit_skb: the SKB processed for transmission by the output engine + * @mss_now: current mss value + * @gfp_mask: gfp_t allocation + * + * If an RDB packet could not be created and sent, transmit the + * original unmodified SKB (xmit_skb). + * + * Return: 0 if successfully sent packet, else error from + * tcp_transmit_skb + */ +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb, + unsigned int mss_now, gfp_t gfp_mask) +{ + struct sk_buff *rdb_skb = NULL; + struct sk_buff *first_to_bundle; + u32 bytes_in_rdb_skb = 0; + + /* How we detect that RDB was used. When equal, no RDB data was sent */ + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq; + + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk))) + goto xmit_default; + + /* No bundling if first in queue, or on FIN packet */ + if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) || + (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN)) + goto xmit_default; + + /* Find number of (previous) SKBs to get data from */ + first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now, + &bytes_in_rdb_skb); + if (!first_to_bundle) + goto xmit_default; + + /* Create an SKB that contains redundant data starting from + * first_to_bundle. + */ + rdb_skb = rdb_build_skb(sk, xmit_skb, first_to_bundle, + bytes_in_rdb_skb, gfp_mask); + if (!rdb_skb) + goto xmit_default; + + /* Set skb_mstamp for the SKB in the output queue (xmit_skb) containing + * the yet unsent data. Normally this would be done by + * tcp_transmit_skb(), but as we pass in rdb_skb instead, xmit_skb's + * timestamp will not be touched. + */ + skb_mstamp_get(&xmit_skb->skb_mstamp); + rdb_skb->skb_mstamp = xmit_skb->skb_mstamp; + return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask); + +xmit_default: + /* Transmit the unmodified SKB from output queue */ + return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask); +}
Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games, remote control systems, and VoIP, produce traffic with thin-stream characteristics, characterized by small packets and relatively high inter-transmission times (ITT). When experiencing packet loss, such latency-sensitive applications are heavily penalized by the need to retransmit lost packets, which increases the latency by a minimum of one RTT for the lost packet. Packets coming after a lost packet are held back due to head-of-line blocking, causing increased delays for all data segments until the lost packet has been retransmitted. RDB enables a TCP sender to bundle redundant (already sent) data with TCP packets containing small segments of new data. By resending un-ACKed data from the output queue in packets with new data, RDB reduces the need to retransmit data segments on connections experiencing sporadic packet loss. By avoiding a retransmit, RDB evades the latency increase of at least one RTT for the lost packet, as well as alleviating head-of-line blocking for the packets following the lost packet. This makes the TCP connection more resistant to latency fluctuations, and reduces the application layer latency significantly in lossy environments. Main functionality added: o When a packet is scheduled for transmission, RDB builds and transmits a new SKB containing both the unsent data as well as data of previously sent packets from the TCP output queue. o RDB will only be used for streams classified as thin by the function tcp_stream_is_thin_dpifl(). This enforces a lower bound on the ITT for streams that may benefit from RDB, controlled by the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound. o Loss detection of hidden loss events: When bundling redundant data with each packet, packet loss can be hidden from the TCP engine due to lack of dupACKs. This is because the loss is "repaired" by the redundant data in the packet coming after the lost packet. Based on incoming ACKs, such hidden loss events are detected, and CWR state is entered. RDB can be enabled on a connection with the socket option TCP_RDB, or on all new connections by setting the sysctl variable net.ipv4.tcp_rdb=1 Cc: Andreas Petlund <apetlund@simula.no> Cc: Carsten Griwodz <griff@simula.no> Cc: Pål Halvorsen <paalh@simula.no> Cc: Jonas Markussen <jonassm@ifi.uio.no> Cc: Kristian Evensen <kristian.evensen@gmail.com> Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com> --- Documentation/networking/ip-sysctl.txt | 15 +++ include/linux/skbuff.h | 1 + include/linux/tcp.h | 3 +- include/net/tcp.h | 15 +++ include/uapi/linux/tcp.h | 1 + net/core/skbuff.c | 2 +- net/ipv4/Makefile | 3 +- net/ipv4/sysctl_net_ipv4.c | 25 ++++ net/ipv4/tcp.c | 14 +- net/ipv4/tcp_input.c | 3 + net/ipv4/tcp_output.c | 48 ++++--- net/ipv4/tcp_rdb.c | 228 +++++++++++++++++++++++++++++++++ 12 files changed, 335 insertions(+), 23 deletions(-) create mode 100644 net/ipv4/tcp_rdb.c