diff mbox

[1/2] virtio-net rsc: support coalescing ipv4 tcp traffic

Message ID 1458033424-25414-2-git-send-email-wexu@redhat.com
State New
Headers show

Commit Message

Wei Xu March 15, 2016, 9:17 a.m. UTC
From: Wei Xu <wexu@redhat.com>

All the data packets in a tcp connection will be cached to a big buffer
in every receive interval, and will be sent out via a timer, the
'virtio_net_rsc_timeout' controls the interval, the value will influent the
performance and response of tcp connection extremely, 50000(50us) is a
experience value to gain a performance improvement, since the whql test
sends packets every 100us, so '300000(300us)' can pass the test case,
this is also the default value, it's gonna to be tunable.

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets

'NetRscChain' is used to save the segments of different protocols in a
VirtIONet device.

The main handler of TCP includes TCP window update, duplicated ACK check
and the real data coalescing if the new segment passed sanity check
and is identified as an 'wanted' one.

An 'wanted' segment means:
1. Segment is within current window and the sequence is the expected one.
2. ACK of the segment is in the valid window.
3. If the ACK in the segment is a duplicated one, then it must less than 2,
   this is to notify upper layer TCP starting retransmission due to the spec.

Sanity check includes:
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

There maybe more cases should be considered such as ip identification other
flags, while it broke the test because windows set it to the same even it's
not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
means the packets should also be bypassed, and this should be done
after searching for the same connection packets in the pool and sending
all of them out, this is to avoid out of data.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'FIN/RST' will trigger a finalization, because
this normally happens upon a connection is going to be closed, an 'URG' packet
also finalize current coalescing unit while there maybe protocol difference to
different OS.

Statistics can be used to monitor the basic coalescing status, the 'out of order'
and 'out of window' means how many retransmitting packets, thus describe the
performance intuitively.

Signed-off-by: Wei Xu <wexu@redhat.com>
---
 hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
 include/hw/virtio/virtio-net.h |   1 +
 include/hw/virtio/virtio.h     |  72 ++++++
 3 files changed, 558 insertions(+), 1 deletion(-)

Comments

Michael S. Tsirkin March 15, 2016, 10 a.m. UTC | #1
On Tue, Mar 15, 2016 at 05:17:03PM +0800, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
> 
> All the data packets in a tcp connection will be cached to a big buffer
> in every receive interval, and will be sent out via a timer, the
> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
> performance and response of tcp connection extremely, 50000(50us) is a
> experience value to gain a performance improvement, since the whql test
> sends packets every 100us, so '300000(300us)' can pass the test case,
> this is also the default value, it's gonna to be tunable.
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets
> 
> 'NetRscChain' is used to save the segments of different protocols in a
> VirtIONet device.
> 
> The main handler of TCP includes TCP window update, duplicated ACK check
> and the real data coalescing if the new segment passed sanity check
> and is identified as an 'wanted' one.
> 
> An 'wanted' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. ACK of the segment is in the valid window.
> 3. If the ACK in the segment is a duplicated one, then it must less than 2,
>    this is to notify upper layer TCP starting retransmission due to the spec.
> 
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. IP options & IP fragment
> 3. Not a TCP packets
> 4. Sanity size check to prevent buffer overflow attack.
> 
> There maybe more cases should be considered such as ip identification other
> flags, while it broke the test because windows set it to the same even it's
> not a fragment.
> 
> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
> means the packets should also be bypassed, and this should be done
> after searching for the same connection packets in the pool and sending
> all of them out, this is to avoid out of data.
> 
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'FIN/RST' will trigger a finalization, because
> this normally happens upon a connection is going to be closed, an 'URG' packet
> also finalize current coalescing unit while there maybe protocol difference to
> different OS.
> 
> Statistics can be used to monitor the basic coalescing status, the 'out of order'
> and 'out of window' means how many retransmitting packets, thus describe the
> performance intuitively.
> 
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>  hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h     |  72 ++++++
>  3 files changed, 558 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 5798f87..c23b45f 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -15,10 +15,12 @@
>  #include "qemu/iov.h"
>  #include "hw/virtio/virtio.h"
>  #include "net/net.h"
> +#include "net/eth.h"
>  #include "net/checksum.h"
>  #include "net/tap.h"
>  #include "qemu/error-report.h"
>  #include "qemu/timer.h"
> +#include "qemu/sockets.h"
>  #include "hw/virtio/virtio-net.h"
>  #include "net/vhost_net.h"
>  #include "hw/virtio/virtio-bus.h"
> @@ -38,6 +40,35 @@
>  #define endof(container, field) \
>      (offsetof(container, field) + sizeof(((container *)0)->field))
>  
> +#define ETH_HDR_SZ (sizeof(struct eth_header))
> +#define IP4_HDR_SZ (sizeof(struct ip_header))
> +#define TCP_HDR_SZ (sizeof(struct tcp_header))
> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)

It's better to open-code these imho.

> +
> +#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
> +#define TCP_PORT_SIZE   4                   /* sport + dport */
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
> +#define MAX_TCP_PAYLOAD 65535
> +
> +/* max payload with virtio header */
> +#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
> +                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
> +
> +#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
> +
> +/* Purge coalesced packets timer interval */
> +#define RSC_TIMER_INTERVAL  300000

Pls prefix local macros with VIRTIO_NET_


> +
> +/* Switcher to enable/disable rsc */
> +static bool virtio_net_rsc_bypass = 1;
> +
> +/* This value affects the performance a lot, and should be tuned carefully,
> +   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
> +   gain 2x netperf throughput with tso/gso/gro 'off'. */

So either tests pass or we get good performance, but not both?
Hmm.

> +static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;


This would beed to be tunable.

> +
>  typedef struct VirtIOFeature {
>      uint32_t flags;
>      size_t end;
> @@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>      return 0;
>  }
>  
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
> +static ssize_t virtio_net_do_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
>  {
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
> @@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>      return 0;
>  }
>  
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t ip_hdrlen;
> +
> +    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
> +    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
> +    unit->ip_plen = &unit->ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
> +{
> +    uint32_t sum;
> +
> +    ip->ip_sum = 0;
> +    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +
> +    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscChain *chain = (NetRscChain *)opq;
> +    NetRscSeg *seg, *rn;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;
> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    NetRscSeg *seg;
> +
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
> +
> +    memmove(seg->buf, buf, size);
> +    seg->size = size;
> +    seg->dup_ack_count = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
> +                                 const uint8_t *buf, struct tcp_header *n_tcp,
> +                                 struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +
> +            if (seg->dup_ack_count == 0) {
> +                seg->dup_ack_count++;
> +                chain->stat.dup_ack1++;
> +                return RSC_COALESCE;
> +            } else {
> +                /* Spec says should send it directly */
> +                chain->stat.dup_ack2++;
> +                return RSC_FINAL;
> +            }
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, update ack */
> +        o_tcp->th_ack = n_tcp->th_ack;
> +        chain->stat.pure_ack++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
> +                                    const uint8_t *buf, NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
> +        /* Log this only for debugging observation */
> +        chain->stat.tcp_option++;
> +    }
> +
> +    /* Ignore packet with more/larger tcp options */
> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
> +        chain->stat.tcp_larger_option++;
> +        return RSC_FINAL;
> +    }
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((0 == o_unit->payload) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload lengh in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest. */
> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
> +        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size)
> +{
> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                        const uint8_t *buf, size_t size, uint16_t ip_start,
> +                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
> +{
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                        struct ip_header *ip, const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Not an ipv4 one */
> +    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Sanity check */
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
> +        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
> +                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
> +    }
> +
> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
> +        return size;
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                            NetClientState *nc, uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    /* Only handle IPv4/6 */
> +    if (proto != (uint16_t)ETH_P_IP) {
> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->hdr_size = n->guest_hdr_len;
> +    chain->proto = proto;
> +    chain->max_payload = MAX_IP4_PAYLOAD;
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (!chain) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        chain->stat.received++;
> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +    }
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
> +{
> +    if (virtio_net_rsc_bypass) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    }
> +}
> +
>  static NetClientInfo net_virtio_info = {
>      .type = NET_CLIENT_OPTIONS_KIND_NIC,
>      .size = sizeof(NICState),
> @@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      nc = qemu_get_queue(n->nic);
>      nc->rxfilter_notify_enabled = 1;
>  
> +    QTAILQ_INIT(&n->rsc_chains);
>      n->qdev = dev;
>      register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>                      virtio_net_save, virtio_net_load, n);
> @@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>      g_free(n->vqs);
>      qemu_del_nic(n->nic);
>      virtio_cleanup(vdev);
> +    virtio_net_rsc_cleanup(n);
>  }
>  
>  static void virtio_net_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 0cabdb6..6939e92 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -59,6 +59,7 @@ typedef struct VirtIONet {
>      VirtIONetQueue *vqs;
>      VirtQueue *ctrl_vq;
>      NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>      uint32_t tx_timeout;
>      int32_t tx_burst;
>      uint32_t has_vnet_hdr;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 2b5b248..3b1dfa8 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass {
>      int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>  } VirtioDeviceClass;
>  
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */
> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
> +    RSC_WANT                /* Data want to be coalesced */
> +} COALESCE_STATUS;
> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_larger_option;
> +    uint32_t ip_frag;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +   struct ip_header *ip;   /* ip header */
> +   uint16_t *ip_plen;      /* data len pointer in ip header field */
> +   struct tcp_header *tcp; /* tcp header */
> +   uint16_t tcp_hdrlen;    /* tcp header len */
> +   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint32_t dup_ack_count;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;
> +
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +   QTAILQ_ENTRY(NetRscChain) next;
> +   uint16_t hdr_size;
> +   uint16_t proto;
> +   uint16_t max_payload;
> +   QEMUTimer *drain_timer;
> +   QTAILQ_HEAD(, NetRscSeg) buffers;
> +   NetRscStat stat;
> +} NetRscChain;
> +
>  void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                   size_t vdev_size, const char *vdev_name);
>  
> -- 
> 2.5.0
Wei Xu March 16, 2016, 3:23 a.m. UTC | #2
----- Original Message -----
From: "Michael S. Tsirkin" <mst@redhat.com>
To: wexu@redhat.com
Cc: victork@redhat.com, jasowang@redhat.com, yvugenfi@redhat.com, qemu-devel@nongnu.org, marcel@redhat.com, dfleytma@redhat.com
Sent: Tuesday, March 15, 2016 6:00:03 PM
Subject: Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic

On Tue, Mar 15, 2016 at 05:17:03PM +0800, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
> 
> All the data packets in a tcp connection will be cached to a big buffer
> in every receive interval, and will be sent out via a timer, the
> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
> performance and response of tcp connection extremely, 50000(50us) is a
> experience value to gain a performance improvement, since the whql test
> sends packets every 100us, so '300000(300us)' can pass the test case,
> this is also the default value, it's gonna to be tunable.
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets
> 
> 'NetRscChain' is used to save the segments of different protocols in a
> VirtIONet device.
> 
> The main handler of TCP includes TCP window update, duplicated ACK check
> and the real data coalescing if the new segment passed sanity check
> and is identified as an 'wanted' one.
> 
> An 'wanted' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. ACK of the segment is in the valid window.
> 3. If the ACK in the segment is a duplicated one, then it must less than 2,
>    this is to notify upper layer TCP starting retransmission due to the spec.
> 
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. IP options & IP fragment
> 3. Not a TCP packets
> 4. Sanity size check to prevent buffer overflow attack.
> 
> There maybe more cases should be considered such as ip identification other
> flags, while it broke the test because windows set it to the same even it's
> not a fragment.
> 
> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
> means the packets should also be bypassed, and this should be done
> after searching for the same connection packets in the pool and sending
> all of them out, this is to avoid out of data.
> 
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'FIN/RST' will trigger a finalization, because
> this normally happens upon a connection is going to be closed, an 'URG' packet
> also finalize current coalescing unit while there maybe protocol difference to
> different OS.
> 
> Statistics can be used to monitor the basic coalescing status, the 'out of order'
> and 'out of window' means how many retransmitting packets, thus describe the
> performance intuitively.
> 
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>  hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h     |  72 ++++++
>  3 files changed, 558 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 5798f87..c23b45f 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -15,10 +15,12 @@
>  #include "qemu/iov.h"
>  #include "hw/virtio/virtio.h"
>  #include "net/net.h"
> +#include "net/eth.h"
>  #include "net/checksum.h"
>  #include "net/tap.h"
>  #include "qemu/error-report.h"
>  #include "qemu/timer.h"
> +#include "qemu/sockets.h"
>  #include "hw/virtio/virtio-net.h"
>  #include "net/vhost_net.h"
>  #include "hw/virtio/virtio-bus.h"
> @@ -38,6 +40,35 @@
>  #define endof(container, field) \
>      (offsetof(container, field) + sizeof(((container *)0)->field))
>  
> +#define ETH_HDR_SZ (sizeof(struct eth_header))
> +#define IP4_HDR_SZ (sizeof(struct ip_header))
> +#define TCP_HDR_SZ (sizeof(struct tcp_header))
> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)

It's better to open-code these imho.

okay.

> +
> +#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
> +#define TCP_PORT_SIZE   4                   /* sport + dport */
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
> +#define MAX_TCP_PAYLOAD 65535
> +
> +/* max payload with virtio header */
> +#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
> +                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
> +
> +#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
> +
> +/* Purge coalesced packets timer interval */
> +#define RSC_TIMER_INTERVAL  300000

Pls prefix local macros with VIRTIO_NET_

sure.


> +
> +/* Switcher to enable/disable rsc */
> +static bool virtio_net_rsc_bypass = 1;
> +
> +/* This value affects the performance a lot, and should be tuned carefully,
> +   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
> +   gain 2x netperf throughput with tso/gso/gro 'off'. */

So either tests pass or we get good performance, but not both?
Hmm.

Yes, the test case send 6 data packets every 100us, and then capture and checking if some of the packets are coalesced,
so the interval less than 100us is beat by the case definitely, actually it's really depends on the interval to pass the case,
i tried 200/400/500+(us) but there are still a few cases failed, don't know how the internal of the test case is, maybe it is to
bypass 'tso/gso' like feature in windows stack not to coalesce packets before sending out.

> +static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;


This would beed to be tunable.
Yes, I have a question, can it be controlled by 'ethtool' or 'qmp/hmp'?

> +
>  typedef struct VirtIOFeature {
>      uint32_t flags;
>      size_t end;
> @@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>      return 0;
>  }
>  
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
> +static ssize_t virtio_net_do_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
>  {
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
> @@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>      return 0;
>  }
>  
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t ip_hdrlen;
> +
> +    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
> +    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
> +    unit->ip_plen = &unit->ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
> +{
> +    uint32_t sum;
> +
> +    ip->ip_sum = 0;
> +    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +
> +    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscChain *chain = (NetRscChain *)opq;
> +    NetRscSeg *seg, *rn;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;
> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    NetRscSeg *seg;
> +
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
> +
> +    memmove(seg->buf, buf, size);
> +    seg->size = size;
> +    seg->dup_ack_count = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
> +                                 const uint8_t *buf, struct tcp_header *n_tcp,
> +                                 struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +
> +            if (seg->dup_ack_count == 0) {
> +                seg->dup_ack_count++;
> +                chain->stat.dup_ack1++;
> +                return RSC_COALESCE;
> +            } else {
> +                /* Spec says should send it directly */
> +                chain->stat.dup_ack2++;
> +                return RSC_FINAL;
> +            }
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, update ack */
> +        o_tcp->th_ack = n_tcp->th_ack;
> +        chain->stat.pure_ack++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
> +                                    const uint8_t *buf, NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
> +        /* Log this only for debugging observation */
> +        chain->stat.tcp_option++;
> +    }
> +
> +    /* Ignore packet with more/larger tcp options */
> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
> +        chain->stat.tcp_larger_option++;
> +        return RSC_FINAL;
> +    }
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((0 == o_unit->payload) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload lengh in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest. */
> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
> +        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size)
> +{
> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                        const uint8_t *buf, size_t size, uint16_t ip_start,
> +                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
> +{
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                        struct ip_header *ip, const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Not an ipv4 one */
> +    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Sanity check */
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
> +        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
> +                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
> +    }
> +
> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
> +        return size;
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                            NetClientState *nc, uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    /* Only handle IPv4/6 */
> +    if (proto != (uint16_t)ETH_P_IP) {
> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->hdr_size = n->guest_hdr_len;
> +    chain->proto = proto;
> +    chain->max_payload = MAX_IP4_PAYLOAD;
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (!chain) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        chain->stat.received++;
> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +    }
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
> +{
> +    if (virtio_net_rsc_bypass) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    }
> +}
> +
>  static NetClientInfo net_virtio_info = {
>      .type = NET_CLIENT_OPTIONS_KIND_NIC,
>      .size = sizeof(NICState),
> @@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      nc = qemu_get_queue(n->nic);
>      nc->rxfilter_notify_enabled = 1;
>  
> +    QTAILQ_INIT(&n->rsc_chains);
>      n->qdev = dev;
>      register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>                      virtio_net_save, virtio_net_load, n);
> @@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>      g_free(n->vqs);
>      qemu_del_nic(n->nic);
>      virtio_cleanup(vdev);
> +    virtio_net_rsc_cleanup(n);
>  }
>  
>  static void virtio_net_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 0cabdb6..6939e92 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -59,6 +59,7 @@ typedef struct VirtIONet {
>      VirtIONetQueue *vqs;
>      VirtQueue *ctrl_vq;
>      NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>      uint32_t tx_timeout;
>      int32_t tx_burst;
>      uint32_t has_vnet_hdr;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 2b5b248..3b1dfa8 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass {
>      int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>  } VirtioDeviceClass;
>  
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */
> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
> +    RSC_WANT                /* Data want to be coalesced */
> +} COALESCE_STATUS;
> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_larger_option;
> +    uint32_t ip_frag;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +   struct ip_header *ip;   /* ip header */
> +   uint16_t *ip_plen;      /* data len pointer in ip header field */
> +   struct tcp_header *tcp; /* tcp header */
> +   uint16_t tcp_hdrlen;    /* tcp header len */
> +   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint32_t dup_ack_count;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;
> +
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +   QTAILQ_ENTRY(NetRscChain) next;
> +   uint16_t hdr_size;
> +   uint16_t proto;
> +   uint16_t max_payload;
> +   QEMUTimer *drain_timer;
> +   QTAILQ_HEAD(, NetRscSeg) buffers;
> +   NetRscStat stat;
> +} NetRscChain;
> +
>  void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                   size_t vdev_size, const char *vdev_name);
>  
> -- 
> 2.5.0
Jason Wang March 17, 2016, 8:42 a.m. UTC | #3
On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
>
> All the data packets in a tcp connection will be cached to a big buffer
> in every receive interval, and will be sent out via a timer, the
> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
> performance and response of tcp connection extremely, 50000(50us) is a
> experience value to gain a performance improvement, since the whql test
> sends packets every 100us, so '300000(300us)' can pass the test case,
> this is also the default value, it's gonna to be tunable.
>
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets
>
> 'NetRscChain' is used to save the segments of different protocols in a
> VirtIONet device.
>
> The main handler of TCP includes TCP window update, duplicated ACK check
> and the real data coalescing if the new segment passed sanity check
> and is identified as an 'wanted' one.
>
> An 'wanted' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. ACK of the segment is in the valid window.
> 3. If the ACK in the segment is a duplicated one, then it must less than 2,
>    this is to notify upper layer TCP starting retransmission due to the spec.
>
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. IP options & IP fragment
> 3. Not a TCP packets
> 4. Sanity size check to prevent buffer overflow attack.
>
> There maybe more cases should be considered such as ip identification other
> flags, while it broke the test because windows set it to the same even it's
> not a fragment.
>
> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
> means the packets should also be bypassed, and this should be done
> after searching for the same connection packets in the pool and sending
> all of them out, this is to avoid out of data.
>
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'FIN/RST' will trigger a finalization, because
> this normally happens upon a connection is going to be closed, an 'URG' packet
> also finalize current coalescing unit while there maybe protocol difference to
> different OS.

But URG packet should be sent as quickly as possible regardless of
ordering, no?

>
> Statistics can be used to monitor the basic coalescing status, the 'out of order'
> and 'out of window' means how many retransmitting packets, thus describe the
> performance intuitively.
>
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>  hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h     |  72 ++++++
>  3 files changed, 558 insertions(+), 1 deletion(-)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 5798f87..c23b45f 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -15,10 +15,12 @@
>  #include "qemu/iov.h"
>  #include "hw/virtio/virtio.h"
>  #include "net/net.h"
> +#include "net/eth.h"
>  #include "net/checksum.h"
>  #include "net/tap.h"
>  #include "qemu/error-report.h"
>  #include "qemu/timer.h"
> +#include "qemu/sockets.h"
>  #include "hw/virtio/virtio-net.h"
>  #include "net/vhost_net.h"
>  #include "hw/virtio/virtio-bus.h"
> @@ -38,6 +40,35 @@
>  #define endof(container, field) \
>      (offsetof(container, field) + sizeof(((container *)0)->field))
>  
> +#define ETH_HDR_SZ (sizeof(struct eth_header))
> +#define IP4_HDR_SZ (sizeof(struct ip_header))
> +#define TCP_HDR_SZ (sizeof(struct tcp_header))
> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)
> +
> +#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
> +#define TCP_PORT_SIZE   4                   /* sport + dport */
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
> +#define MAX_TCP_PAYLOAD 65535
> +
> +/* max payload with virtio header */
> +#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
> +                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)

Should we use guest_hdr_len instead of sizeof() here? Consider the
vnet_hdr will be extended in the future.

> +
> +#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */

type, should be 'length'

> +
> +/* Purge coalesced packets timer interval */
> +#define RSC_TIMER_INTERVAL  300000
> +
> +/* Switcher to enable/disable rsc */
> +static bool virtio_net_rsc_bypass = 1;
> +
> +/* This value affects the performance a lot, and should be tuned carefully,
> +   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
> +   gain 2x netperf throughput with tso/gso/gro 'off'. */
> +static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;
> +
>  typedef struct VirtIOFeature {
>      uint32_t flags;
>      size_t end;
> @@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>      return 0;
>  }
>  
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
> +static ssize_t virtio_net_do_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)

indentation should also changed here.

>  {
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
> @@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>      return 0;
>  }
>  
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t ip_hdrlen;
> +
> +    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
> +    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
> +    unit->ip_plen = &unit->ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
> +{
> +    uint32_t sum;
> +
> +    ip->ip_sum = 0;
> +    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +
> +    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscChain *chain = (NetRscChain *)opq;
> +    NetRscSeg *seg, *rn;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;
> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    NetRscSeg *seg;
> +
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
> +
> +    memmove(seg->buf, buf, size);

Can seg->buf overlap with buf? If not, why use memmove() instead of
memcpy()?

> +    seg->size = size;
> +    seg->dup_ack_count = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
> +                                 const uint8_t *buf, struct tcp_header *n_tcp,
> +                                 struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +
> +            if (seg->dup_ack_count == 0) {
> +                seg->dup_ack_count++;
> +                chain->stat.dup_ack1++;
> +                return RSC_COALESCE;
> +            } else {
> +                /* Spec says should send it directly */
> +                chain->stat.dup_ack2++;
> +                return RSC_FINAL;
> +            }
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, update ack */
> +        o_tcp->th_ack = n_tcp->th_ack;
> +        chain->stat.pure_ack++;
> +        return RSC_COALESCE;

Looks like there're something I missed. The spec said:

"In other words, any pure ACK that is not a duplicate ACK or a window
update triggers an exception and must not be coalesced. All such pure
ACKs must be indicated as individual segments."

Does it mean we *should not* coalesce windows update and pure ack?
(Since it can wakeup transmission)?

> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
> +                                    const uint8_t *buf, NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
> +        /* Log this only for debugging observation */
> +        chain->stat.tcp_option++;
> +    }
> +
> +    /* Ignore packet with more/larger tcp options */
> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {

What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?

> +        chain->stat.tcp_larger_option++;
> +        return RSC_FINAL;
> +    }
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((0 == o_unit->payload) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload lengh in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest. */

This needs more thought, 'can' probably means don't. (Linux GRO won't
merge PUSH packet).

> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
> +        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size)

indentation looks wrong.

> +{
> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{

and here.

> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                        const uint8_t *buf, size_t size, uint16_t ip_start,
> +                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
> +{
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                        struct ip_header *ip, const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Not an ipv4 one */
> +    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {

I've replied this several times, please use a consistent style. E.g
"ip->ip_ver_len & 0xF0".

> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Sanity check */
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
> +        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
> +                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
> +    }
> +
> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
> +        return size;
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                            NetClientState *nc, uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    /* Only handle IPv4/6 */
> +    if (proto != (uint16_t)ETH_P_IP) {

The code is conflict with the comment above.

> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->hdr_size = n->guest_hdr_len;

Why introduce a specified field for instead of just use n->guest_hdr_len?

> +    chain->proto = proto;
> +    chain->max_payload = MAX_IP4_PAYLOAD;
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (!chain) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        chain->stat.received++;
> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +    }
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
> +{
> +    if (virtio_net_rsc_bypass) {
> +        return virtio_net_do_receive(nc, buf, size);

You need a feature bit for this and compat it for older machine types.
And also need some work on virtio spec I think.

> +    } else {
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    }
> +}
> +
>  static NetClientInfo net_virtio_info = {
>      .type = NET_CLIENT_OPTIONS_KIND_NIC,
>      .size = sizeof(NICState),
> @@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      nc = qemu_get_queue(n->nic);
>      nc->rxfilter_notify_enabled = 1;
>  
> +    QTAILQ_INIT(&n->rsc_chains);
>      n->qdev = dev;
>      register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>                      virtio_net_save, virtio_net_load, n);
> @@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>      g_free(n->vqs);
>      qemu_del_nic(n->nic);
>      virtio_cleanup(vdev);
> +    virtio_net_rsc_cleanup(n);
>  }
>  
>  static void virtio_net_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 0cabdb6..6939e92 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -59,6 +59,7 @@ typedef struct VirtIONet {
>      VirtIONetQueue *vqs;
>      VirtQueue *ctrl_vq;
>      NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>      uint32_t tx_timeout;
>      int32_t tx_burst;
>      uint32_t has_vnet_hdr;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 2b5b248..3b1dfa8 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass {
>      int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>  } VirtioDeviceClass;
>  
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */
> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
> +    RSC_WANT                /* Data want to be coalesced */
> +} COALESCE_STATUS;
> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_larger_option;
> +    uint32_t ip_frag;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +   struct ip_header *ip;   /* ip header */
> +   uint16_t *ip_plen;      /* data len pointer in ip header field */
> +   struct tcp_header *tcp; /* tcp header */
> +   uint16_t tcp_hdrlen;    /* tcp header len */
> +   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint32_t dup_ack_count;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;
> +
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +   QTAILQ_ENTRY(NetRscChain) next;
> +   uint16_t hdr_size;
> +   uint16_t proto;
> +   uint16_t max_payload;
> +   QEMUTimer *drain_timer;
> +   QTAILQ_HEAD(, NetRscSeg) buffers;
> +   NetRscStat stat;
> +} NetRscChain;
> +
>  void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                   size_t vdev_size, const char *vdev_name);
>
Wei Xu March 17, 2016, 4:45 p.m. UTC | #4
On 2016年03月17日 16:42, Jason Wang wrote:

>
> On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
>> From: Wei Xu <wexu@redhat.com>
>>
>> All the data packets in a tcp connection will be cached to a big buffer
>> in every receive interval, and will be sent out via a timer, the
>> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
>> performance and response of tcp connection extremely, 50000(50us) is a
>> experience value to gain a performance improvement, since the whql test
>> sends packets every 100us, so '300000(300us)' can pass the test case,
>> this is also the default value, it's gonna to be tunable.
>>
>> The timer will only be triggered if the packets pool is not empty,
>> and it'll drain off all the cached packets
>>
>> 'NetRscChain' is used to save the segments of different protocols in a
>> VirtIONet device.
>>
>> The main handler of TCP includes TCP window update, duplicated ACK check
>> and the real data coalescing if the new segment passed sanity check
>> and is identified as an 'wanted' one.
>>
>> An 'wanted' segment means:
>> 1. Segment is within current window and the sequence is the expected one.
>> 2. ACK of the segment is in the valid window.
>> 3. If the ACK in the segment is a duplicated one, then it must less than 2,
>>     this is to notify upper layer TCP starting retransmission due to the spec.
>>
>> Sanity check includes:
>> 1. Incorrect version in IP header
>> 2. IP options & IP fragment
>> 3. Not a TCP packets
>> 4. Sanity size check to prevent buffer overflow attack.
>>
>> There maybe more cases should be considered such as ip identification other
>> flags, while it broke the test because windows set it to the same even it's
>> not a fragment.
>>
>> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
>> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
>> means the packets should also be bypassed, and this should be done
>> after searching for the same connection packets in the pool and sending
>> all of them out, this is to avoid out of data.
>>
>> All the 'SYN' packets will be bypassed since this always begin a new'
>> connection, other flags such 'FIN/RST' will trigger a finalization, because
>> this normally happens upon a connection is going to be closed, an 'URG' packet
>> also finalize current coalescing unit while there maybe protocol difference to
>> different OS.
> But URG packet should be sent as quickly as possible regardless of
> ordering, no?

Yes, you right, URG will terminate the current 'SCU', i'll amend the commit log.

>
>> Statistics can be used to monitor the basic coalescing status, the 'out of order'
>> and 'out of window' means how many retransmitting packets, thus describe the
>> performance intuitively.
>>
>> Signed-off-by: Wei Xu <wexu@redhat.com>
>> ---
>>   hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
>>   include/hw/virtio/virtio-net.h |   1 +
>>   include/hw/virtio/virtio.h     |  72 ++++++
>>   3 files changed, 558 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 5798f87..c23b45f 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -15,10 +15,12 @@
>>   #include "qemu/iov.h"
>>   #include "hw/virtio/virtio.h"
>>   #include "net/net.h"
>> +#include "net/eth.h"
>>   #include "net/checksum.h"
>>   #include "net/tap.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/timer.h"
>> +#include "qemu/sockets.h"
>>   #include "hw/virtio/virtio-net.h"
>>   #include "net/vhost_net.h"
>>   #include "hw/virtio/virtio-bus.h"
>> @@ -38,6 +40,35 @@
>>   #define endof(container, field) \
>>       (offsetof(container, field) + sizeof(((container *)0)->field))
>>   
>> +#define ETH_HDR_SZ (sizeof(struct eth_header))
>> +#define IP4_HDR_SZ (sizeof(struct ip_header))
>> +#define TCP_HDR_SZ (sizeof(struct tcp_header))
>> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)
>> +
>> +#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
>> +#define TCP_PORT_SIZE   4                   /* sport + dport */
>> +
>> +/* IPv4 max payload, 16 bits in the header */
>> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
>> +#define MAX_TCP_PAYLOAD 65535
>> +
>> +/* max payload with virtio header */
>> +#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
>> +                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
> Should we use guest_hdr_len instead of sizeof() here? Consider the
> vnet_hdr will be extended in the future.

Sure.

>
>> +
>> +#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
> type, should be 'length'

ok.

>
>> +
>> +/* Purge coalesced packets timer interval */
>> +#define RSC_TIMER_INTERVAL  300000
>> +
>> +/* Switcher to enable/disable rsc */
>> +static bool virtio_net_rsc_bypass = 1;
>> +
>> +/* This value affects the performance a lot, and should be tuned carefully,
>> +   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
>> +   gain 2x netperf throughput with tso/gso/gro 'off'. */
>> +static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;
>> +
>>   typedef struct VirtIOFeature {
>>       uint32_t flags;
>>       size_t end;
>> @@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>>       return 0;
>>   }
>>   
>> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
>> +static ssize_t virtio_net_do_receive(NetClientState *nc,
>> +                                  const uint8_t *buf, size_t size)
> indentation should also changed here.

ok.

>
>>   {
>>       VirtIONet *n = qemu_get_nic_opaque(nc);
>>       VirtIONetQueue *q = virtio_net_get_subqueue(nc);
>> @@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>>       return 0;
>>   }
>>   
>> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
>> +                                         const uint8_t *buf, NetRscUnit* unit)
>> +{
>> +    uint16_t ip_hdrlen;
>> +
>> +    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
>> +    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
>> +    unit->ip_plen = &unit->ip->ip_len;
>> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
>> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
>> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
>> +}
>> +
>> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
>> +{
>> +    uint32_t sum;
>> +
>> +    ip->ip_sum = 0;
>> +    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
>> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
>> +}
>> +
>> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
>> +{
>> +    int ret;
>> +
>> +    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
>> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
>> +    g_free(seg->buf);
>> +    g_free(seg);
>> +
>> +    return ret;
>> +}
>> +
>> +static void virtio_net_rsc_purge(void *opq)
>> +{
>> +    NetRscChain *chain = (NetRscChain *)opq;
>> +    NetRscSeg *seg, *rn;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
>> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +            chain->stat.purge_failed++;
>> +            continue;
>> +        }
>> +    }
>> +
>> +    chain->stat.timer++;
>> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
>> +        timer_mod(chain->drain_timer,
>> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
>> +    }
>> +}
>> +
>> +static void virtio_net_rsc_cleanup(VirtIONet *n)
>> +{
>> +    NetRscChain *chain, *rn_chain;
>> +    NetRscSeg *seg, *rn_seg;
>> +
>> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
>> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
>> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
>> +            g_free(seg->buf);
>> +            g_free(seg);
>> +        }
>> +
>> +        timer_del(chain->drain_timer);
>> +        timer_free(chain->drain_timer);
>> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
>> +        g_free(chain);
>> +    }
>> +}
>> +
>> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
>> +                                     const uint8_t *buf, size_t size)
>> +{
>> +    NetRscSeg *seg;
>> +
>> +    seg = g_malloc(sizeof(NetRscSeg));
>> +    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
>> +
>> +    memmove(seg->buf, buf, size);
> Can seg->buf overlap with buf? If not, why use memmove() instead of
> memcpy()?

Yes, they will not overlap, memcpy() is good for it.

>
>> +    seg->size = size;
>> +    seg->dup_ack_count = 0;
>> +    seg->is_coalesced = 0;
>> +    seg->nc = nc;
>> +
>> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
>> +    chain->stat.cache++;
>> +
>> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
>> +}
>> +
>> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
>> +                                 const uint8_t *buf, struct tcp_header *n_tcp,
>> +                                 struct tcp_header *o_tcp)
>> +{
>> +    uint32_t nack, oack;
>> +    uint16_t nwin, owin;
>> +
>> +    nack = htonl(n_tcp->th_ack);
>> +    nwin = htons(n_tcp->th_win);
>> +    oack = htonl(o_tcp->th_ack);
>> +    owin = htons(o_tcp->th_win);
>> +
>> +    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
>> +        chain->stat.ack_out_of_win++;
>> +        return RSC_FINAL;
>> +    } else if (nack == oack) {
>> +        /* duplicated ack or window probe */
>> +        if (nwin == owin) {
>> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
>> +            chain->stat.dup_ack++;
>> +
>> +            if (seg->dup_ack_count == 0) {
>> +                seg->dup_ack_count++;
>> +                chain->stat.dup_ack1++;
>> +                return RSC_COALESCE;
>> +            } else {
>> +                /* Spec says should send it directly */
>> +                chain->stat.dup_ack2++;
>> +                return RSC_FINAL;
>> +            }
>> +        } else {
>> +            /* Coalesce window update */
>> +            o_tcp->th_win = n_tcp->th_win;
>> +            chain->stat.win_update++;
>> +            return RSC_COALESCE;
>> +        }
>> +    } else {
>> +        /* pure ack, update ack */
>> +        o_tcp->th_ack = n_tcp->th_ack;
>> +        chain->stat.pure_ack++;
>> +        return RSC_COALESCE;
> Looks like there're something I missed. The spec said:
>
> "In other words, any pure ACK that is not a duplicate ACK or a window
> update triggers an exception and must not be coalesced. All such pure
> ACKs must be indicated as individual segments."
>
> Does it mean we *should not* coalesce windows update and pure ack?
> (Since it can wakeup transmission)?

It's also a little bit inexplicit and flexible due to the spec, please see the flowchart I on the same page.

Comments about the  flowchart:
------------------------------------------------------------------------
The first of the following two flowcharts describes the rules for coalescing segments and updating the TCP headers.
This flowchart refers to mechanisms for distinguishing valid duplicate ACKs and window updates. The second flowchart describes these mechanisms.
------------------------------------------------------------------------
As show in the flowchart, only status 'C' will break current scu and get finalized, both 'A' and 'B' can be coalesced afaik.

>
>> +    }
>> +}
>> +
>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
>> +                                    const uint8_t *buf, NetRscUnit *n_unit)
>> +{
>> +    void *data;
>> +    uint16_t o_ip_len;
>> +    uint32_t nseq, oseq;
>> +    NetRscUnit *o_unit;
>> +
>> +    o_unit = &seg->unit;
>> +    o_ip_len = htons(*o_unit->ip_plen);
>> +    nseq = htonl(n_unit->tcp->th_seq);
>> +    oseq = htonl(o_unit->tcp->th_seq);
>> +
>> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
>> +        /* Log this only for debugging observation */
>> +        chain->stat.tcp_option++;
>> +    }
>> +
>> +    /* Ignore packet with more/larger tcp options */
>> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?
do you mean '<'? that also means some option changed, maybe i should 
include it.
>
>> +        chain->stat.tcp_larger_option++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    /* out of order or retransmitted. */
>> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
>> +        chain->stat.data_out_of_win++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>> +    if (nseq == oseq) {
>> +        if ((0 == o_unit->payload) && n_unit->payload) {
>> +            /* From no payload to payload, normal case, not a dup ack or etc */
>> +            chain->stat.data_after_pure_ack++;
>> +            goto coalesce;
>> +        } else {
>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>> +                                             n_unit->tcp, o_unit->tcp);
>> +        }
>> +    } else if ((nseq - oseq) != o_unit->payload) {
>> +        /* Not a consistent packet, out of order */
>> +        chain->stat.data_out_of_order++;
>> +        return RSC_FINAL;
>> +    } else {
>> +coalesce:
>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>> +            chain->stat.over_size++;
>> +            return RSC_FINAL;
>> +        }
>> +
>> +        /* Here comes the right data, the payload lengh in v4/v6 is different,
>> +           so use the field value to update and record the new data len */
>> +        o_unit->payload += n_unit->payload; /* update new data len */
>> +
>> +        /* update field in ip header */
>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>> +
>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
>> +           for windows guest, while this may change the behavior for linux
>> +           guest. */
> This needs more thought, 'can' probably means don't. (Linux GRO won't
> merge PUSH packet).
Yes, since it's mainly for win guest, how about take this as this first 
and do more test and see how to handle it?
>
>> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
>> +
>> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
>> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
>> +
>> +        memmove(seg->buf + seg->size, data, n_unit->payload);
>> +        seg->size += n_unit->payload;
>> +        chain->stat.coalesced++;
>> +        return RSC_COALESCE;
>> +    }
>> +}
>> +
>> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
>> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
>> +{
>> +    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
>> +        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
>> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
>> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
>> +        chain->stat.no_match++;
>> +        return RSC_NO_MATCH;
>> +    }
>> +
>> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>> +}
>> +
>> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
>> + * to prevent out of order */
>> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
>> +                                         struct tcp_header *tcp)
>> +{
>> +    uint16_t tcp_flag;
>> +
>> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
>> +    if (tcp_flag & TH_SYN) {
>> +        chain->stat.tcp_syn++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
>> +        chain->stat.tcp_ctrl_drain++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    return RSC_WANT;
>> +}
>> +
>> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
>> +                          const uint8_t *buf, size_t size)
> indentation looks wrong.
ok.
>
>> +{
>> +    if (QTAILQ_EMPTY(&chain->buffers)) {
>> +        chain->stat.empty_cache++;
>> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
>> +        timer_mod(chain->drain_timer,
>> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
>> +        return 1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
>> +                          const uint8_t *buf, size_t size, NetRscUnit *unit)
>> +{
> and here.
ok.
>
>> +    int ret;
>> +    NetRscSeg *seg, *nseg;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
>> +
>> +        if (ret == RSC_FINAL) {
>> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +                /* Send failed */
>> +                chain->stat.final_failed++;
>> +                return 0;
>> +            }
>> +
>> +            /* Send current packet */
>> +            return virtio_net_do_receive(nc, buf, size);
>> +        } else if (ret == RSC_NO_MATCH) {
>> +            continue;
>> +        } else {
>> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
>> +            seg->is_coalesced = 1;
>> +            return size;
>> +        }
>> +    }
>> +
>> +    chain->stat.no_match_cache++;
>> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
>> +    return size;
>> +}
>> +
>> +/* Drain a connection data, this is to avoid out of order segments */
>> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
>> +                        const uint8_t *buf, size_t size, uint16_t ip_start,
>> +                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
>> +{
>> +    NetRscSeg *seg, *nseg;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
>> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
>> +            continue;
>> +        }
>> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +            chain->stat.drain_failed++;
>> +        }
>> +
>> +        break;
>> +    }
>> +
>> +    return virtio_net_do_receive(nc, buf, size);
>> +}
>> +
>> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
>> +                        struct ip_header *ip, const uint8_t *buf, size_t size)
>> +{
>> +    uint16_t ip_len;
>> +
>> +    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Not an ipv4 one */
>> +    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {
> I've replied this several times, please use a consistent style. E.g
> "ip->ip_ver_len & 0xF0".
Sorry, i used to think you meant the 'IP_HEADER_VERSION_4'.
>
>> +        chain->stat.ip_option++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Don't handle packets with ip option */
>> +    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
>> +        chain->stat.ip_option++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Don't handle packets with ip fragment */
>> +    if (!(htons(ip->ip_off) & IP_DF)) {
>> +        chain->stat.ip_frag++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (ip->ip_p != IPPROTO_TCP) {
>> +        chain->stat.bypass_not_tcp++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Sanity check */
>> +    ip_len = htons(ip->ip_len);
>> +    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
>> +        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
>> +        chain->stat.ip_hacked++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    return RSC_WANT;
>> +}
>> +
>> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
>> +    int32_t ret;
>> +    NetRscChain *chain;
>> +    NetRscUnit unit;
>> +
>> +    chain = (NetRscChain *)opq;
>> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
>> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
>> +    if (ret == RSC_BYPASS) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else if (ret == RSC_FINAL) {
>> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
>> +                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
>> +                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
>> +    }
>> +
>> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
>> +        return size;
>> +    }
>> +
>> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>> +}
>> +
>> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>> +                                            NetClientState *nc, uint16_t proto)
>> +{
>> +    NetRscChain *chain;
>> +
>> +    /* Only handle IPv4/6 */
>> +    if (proto != (uint16_t)ETH_P_IP) {
> The code is conflict with the comment above.
ok.
>
>> +        return NULL;
>> +    }
>> +
>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>> +        if (chain->proto == proto) {
>> +            return chain;
>> +        }
>> +    }
>> +
>> +    chain = g_malloc(sizeof(*chain));
>> +    chain->hdr_size = n->guest_hdr_len;
> Why introduce a specified field for instead of just use n->guest_hdr_len?
this is to reduce invoking times to find VirtIONet by 'nc', there are a 
few places will use it.
>
>> +    chain->proto = proto;
>> +    chain->max_payload = MAX_IP4_PAYLOAD;
>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>> +                                      virtio_net_rsc_purge, chain);
>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>> +
>> +    QTAILQ_INIT(&chain->buffers);
>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>> +
>> +    return chain;
>> +}
>> +
>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
>> +    uint16_t proto;
>> +    NetRscChain *chain;
>> +    struct eth_header *eth;
>> +    VirtIONet *n;
>> +
>> +    n = qemu_get_nic_opaque(nc);
>> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>> +    proto = htons(eth->h_proto);
>> +
>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>> +    if (!chain) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else {
>> +        chain->stat.received++;
>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>> +    }
>> +}
>> +
>> +static ssize_t virtio_net_receive(NetClientState *nc,
>> +                                  const uint8_t *buf, size_t size)
>> +{
>> +    if (virtio_net_rsc_bypass) {
>> +        return virtio_net_do_receive(nc, buf, size);
> You need a feature bit for this and compat it for older machine types.
> And also need some work on virtio spec I think.
yes, not sure which way is good to support this, hmp/qmp/ethtool, this 
is gonna to support win guest,
so need a well-compatible interface, any comments?
>
>> +    } else {
>> +        return virtio_net_rsc_receive(nc, buf, size);
>> +    }
>> +}
>> +
>>   static NetClientInfo net_virtio_info = {
>>       .type = NET_CLIENT_OPTIONS_KIND_NIC,
>>       .size = sizeof(NICState),
>> @@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>>       nc = qemu_get_queue(n->nic);
>>       nc->rxfilter_notify_enabled = 1;
>>   
>> +    QTAILQ_INIT(&n->rsc_chains);
>>       n->qdev = dev;
>>       register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>>                       virtio_net_save, virtio_net_load, n);
>> @@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>>       g_free(n->vqs);
>>       qemu_del_nic(n->nic);
>>       virtio_cleanup(vdev);
>> +    virtio_net_rsc_cleanup(n);
>>   }
>>   
>>   static void virtio_net_instance_init(Object *obj)
>> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
>> index 0cabdb6..6939e92 100644
>> --- a/include/hw/virtio/virtio-net.h
>> +++ b/include/hw/virtio/virtio-net.h
>> @@ -59,6 +59,7 @@ typedef struct VirtIONet {
>>       VirtIONetQueue *vqs;
>>       VirtQueue *ctrl_vq;
>>       NICState *nic;
>> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>>       uint32_t tx_timeout;
>>       int32_t tx_burst;
>>       uint32_t has_vnet_hdr;
>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> index 2b5b248..3b1dfa8 100644
>> --- a/include/hw/virtio/virtio.h
>> +++ b/include/hw/virtio/virtio.h
>> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass {
>>       int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>>   } VirtioDeviceClass;
>>   
>> +/* Coalesced packets type & status */
>> +typedef enum {
>> +    RSC_COALESCE,           /* Data been coalesced */
>> +    RSC_FINAL,              /* Will terminate current connection */
>> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
>> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
>> +    RSC_WANT                /* Data want to be coalesced */
>> +} COALESCE_STATUS;
>> +
>> +typedef struct NetRscStat {
>> +    uint32_t received;
>> +    uint32_t coalesced;
>> +    uint32_t over_size;
>> +    uint32_t cache;
>> +    uint32_t empty_cache;
>> +    uint32_t no_match_cache;
>> +    uint32_t win_update;
>> +    uint32_t no_match;
>> +    uint32_t tcp_syn;
>> +    uint32_t tcp_ctrl_drain;
>> +    uint32_t dup_ack;
>> +    uint32_t dup_ack1;
>> +    uint32_t dup_ack2;
>> +    uint32_t pure_ack;
>> +    uint32_t ack_out_of_win;
>> +    uint32_t data_out_of_win;
>> +    uint32_t data_out_of_order;
>> +    uint32_t data_after_pure_ack;
>> +    uint32_t bypass_not_tcp;
>> +    uint32_t tcp_option;
>> +    uint32_t tcp_larger_option;
>> +    uint32_t ip_frag;
>> +    uint32_t ip_hacked;
>> +    uint32_t ip_option;
>> +    uint32_t purge_failed;
>> +    uint32_t drain_failed;
>> +    uint32_t final_failed;
>> +    int64_t  timer;
>> +} NetRscStat;
>> +
>> +/* Rsc unit general info used to checking if can coalescing */
>> +typedef struct NetRscUnit {
>> +   struct ip_header *ip;   /* ip header */
>> +   uint16_t *ip_plen;      /* data len pointer in ip header field */
>> +   struct tcp_header *tcp; /* tcp header */
>> +   uint16_t tcp_hdrlen;    /* tcp header len */
>> +   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
>> +} NetRscUnit;
>> +
>> +/* Coalesced segmant */
>> +typedef struct NetRscSeg {
>> +    QTAILQ_ENTRY(NetRscSeg) next;
>> +    void *buf;
>> +    size_t size;
>> +    uint32_t dup_ack_count;
>> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
>> +    NetRscUnit unit;
>> +    NetClientState *nc;
>> +} NetRscSeg;
>> +
>> +
>> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
>> +typedef struct NetRscChain {
>> +   QTAILQ_ENTRY(NetRscChain) next;
>> +   uint16_t hdr_size;
>> +   uint16_t proto;
>> +   uint16_t max_payload;
>> +   QEMUTimer *drain_timer;
>> +   QTAILQ_HEAD(, NetRscSeg) buffers;
>> +   NetRscStat stat;
>> +} NetRscChain;
>> +
>>   void virtio_instance_init_common(Object *proxy_obj, void *data,
>>                                    size_t vdev_size, const char *vdev_name);
>>   
>
>
Jason Wang March 18, 2016, 2:03 a.m. UTC | #5
On 03/18/2016 12:45 AM, Wei Xu wrote:
> On 2016年03月17日 16:42, Jason Wang wrote:
>
>>
>> On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
>>> From: Wei Xu <wexu@redhat.com>
>>>
>>> All the data packets in a tcp connection will be cached to a big buffer
>>> in every receive interval, and will be sent out via a timer, the
>>> 'virtio_net_rsc_timeout' controls the interval, the value will
>>> influent the
>>> performance and response of tcp connection extremely, 50000(50us) is a
>>> experience value to gain a performance improvement, since the whql test
>>> sends packets every 100us, so '300000(300us)' can pass the test case,
>>> this is also the default value, it's gonna to be tunable.
>>>
>>> The timer will only be triggered if the packets pool is not empty,
>>> and it'll drain off all the cached packets
>>>
>>> 'NetRscChain' is used to save the segments of different protocols in a
>>> VirtIONet device.
>>>
>>> The main handler of TCP includes TCP window update, duplicated ACK
>>> check
>>> and the real data coalescing if the new segment passed sanity check
>>> and is identified as an 'wanted' one.
>>>
>>> An 'wanted' segment means:
>>> 1. Segment is within current window and the sequence is the expected
>>> one.
>>> 2. ACK of the segment is in the valid window.
>>> 3. If the ACK in the segment is a duplicated one, then it must less
>>> than 2,
>>>     this is to notify upper layer TCP starting retransmission due to
>>> the spec.
>>>
>>> Sanity check includes:
>>> 1. Incorrect version in IP header
>>> 2. IP options & IP fragment
>>> 3. Not a TCP packets
>>> 4. Sanity size check to prevent buffer overflow attack.
>>>
>>> There maybe more cases should be considered such as ip
>>> identification other
>>> flags, while it broke the test because windows set it to the same
>>> even it's
>>> not a fragment.
>>>
>>> Normally it includes 2 typical ways to handle a TCP control flag,
>>> 'bypass'
>>> and 'finalize', 'bypass' means should be sent out directly, and
>>> 'finalize'
>>> means the packets should also be bypassed, and this should be done
>>> after searching for the same connection packets in the pool and sending
>>> all of them out, this is to avoid out of data.
>>>
>>> All the 'SYN' packets will be bypassed since this always begin a new'
>>> connection, other flags such 'FIN/RST' will trigger a finalization,
>>> because
>>> this normally happens upon a connection is going to be closed, an
>>> 'URG' packet
>>> also finalize current coalescing unit while there maybe protocol
>>> difference to
>>> different OS.
>> But URG packet should be sent as quickly as possible regardless of
>> ordering, no?
>
> Yes, you right, URG will terminate the current 'SCU', i'll amend the
> commit log.
>
>>
>>> Statistics can be used to monitor the basic coalescing status, the
>>> 'out of order'
>>> and 'out of window' means how many retransmitting packets, thus
>>> describe the
>>> performance intuitively.
>>>
>>> Signed-off-by: Wei Xu <wexu@redhat.com>
>>> ---
>>>   hw/net/virtio-net.c            | 486
>>> ++++++++++++++++++++++++++++++++++++++++-
>>>   include/hw/virtio/virtio-net.h |   1 +
>>>   include/hw/virtio/virtio.h     |  72 ++++++
>>>   3 files changed, 558 insertions(+), 1 deletion(-)

[...]

>>> +        } else {
>>> +            /* Coalesce window update */
>>> +            o_tcp->th_win = n_tcp->th_win;
>>> +            chain->stat.win_update++;
>>> +            return RSC_COALESCE;
>>> +        }
>>> +    } else {
>>> +        /* pure ack, update ack */
>>> +        o_tcp->th_ack = n_tcp->th_ack;
>>> +        chain->stat.pure_ack++;
>>> +        return RSC_COALESCE;
>> Looks like there're something I missed. The spec said:
>>
>> "In other words, any pure ACK that is not a duplicate ACK or a window
>> update triggers an exception and must not be coalesced. All such pure
>> ACKs must be indicated as individual segments."
>>
>> Does it mean we *should not* coalesce windows update and pure ack?
>> (Since it can wakeup transmission)?
>
> It's also a little bit inexplicit and flexible due to the spec, please
> see the flowchart I on the same page.
>
> Comments about the  flowchart:
> ------------------------------------------------------------------------
> The first of the following two flowcharts describes the rules for
> coalescing segments and updating the TCP headers.
> This flowchart refers to mechanisms for distinguishing valid duplicate
> ACKs and window updates. The second flowchart describes these mechanisms.
> ------------------------------------------------------------------------
> As show in the flowchart, only status 'C' will break current scu and
> get finalized, both 'A' and 'B' can be coalesced afaik.
>

Interesting, looks like you're right.

>>
>>> +    }
>>> +}
>>> +
>>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
>>> NetRscSeg *seg,
>>> +                                    const uint8_t *buf, NetRscUnit
>>> *n_unit)
>>> +{
>>> +    void *data;
>>> +    uint16_t o_ip_len;
>>> +    uint32_t nseq, oseq;
>>> +    NetRscUnit *o_unit;
>>> +
>>> +    o_unit = &seg->unit;
>>> +    o_ip_len = htons(*o_unit->ip_plen);
>>> +    nseq = htonl(n_unit->tcp->th_seq);
>>> +    oseq = htonl(o_unit->tcp->th_seq);
>>> +
>>> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
>>> +        /* Log this only for debugging observation */
>>> +        chain->stat.tcp_option++;
>>> +    }
>>> +
>>> +    /* Ignore packet with more/larger tcp options */
>>> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
>> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?
> do you mean '<'? that also means some option changed, maybe i should
> include it.

Yes.

>>
>>> +        chain->stat.tcp_larger_option++;
>>> +        return RSC_FINAL;
>>> +    }
>>> +
>>> +    /* out of order or retransmitted. */
>>> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
>>> +        chain->stat.data_out_of_win++;
>>> +        return RSC_FINAL;
>>> +    }
>>> +
>>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>>> +    if (nseq == oseq) {
>>> +        if ((0 == o_unit->payload) && n_unit->payload) {
>>> +            /* From no payload to payload, normal case, not a dup
>>> ack or etc */
>>> +            chain->stat.data_after_pure_ack++;
>>> +            goto coalesce;
>>> +        } else {
>>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>>> +                                             n_unit->tcp,
>>> o_unit->tcp);
>>> +        }
>>> +    } else if ((nseq - oseq) != o_unit->payload) {
>>> +        /* Not a consistent packet, out of order */
>>> +        chain->stat.data_out_of_order++;
>>> +        return RSC_FINAL;
>>> +    } else {
>>> +coalesce:
>>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>>> +            chain->stat.over_size++;
>>> +            return RSC_FINAL;
>>> +        }
>>> +
>>> +        /* Here comes the right data, the payload lengh in v4/v6 is
>>> different,
>>> +           so use the field value to update and record the new data
>>> len */
>>> +        o_unit->payload += n_unit->payload; /* update new data len */
>>> +
>>> +        /* update field in ip header */
>>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>>> +
>>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
>>> coalesced
>>> +           for windows guest, while this may change the behavior
>>> for linux
>>> +           guest. */
>> This needs more thought, 'can' probably means don't. (Linux GRO won't
>> merge PUSH packet).
> Yes, since it's mainly for win guest, how about take this as this
> first and do more test and see how to handle it?

Right, this is not an issue probably but just an optimization for latency.

[...]

>>
>>> +        return NULL;
>>> +    }
>>> +
>>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>>> +        if (chain->proto == proto) {
>>> +            return chain;
>>> +        }
>>> +    }
>>> +
>>> +    chain = g_malloc(sizeof(*chain));
>>> +    chain->hdr_size = n->guest_hdr_len;
>> Why introduce a specified field for instead of just use
>> n->guest_hdr_len?
> this is to reduce invoking times to find VirtIONet by 'nc', there are
> a few places will use it.

Okay, then I think you'd better keep a pointer to VirtIONet structure
instead. (Consider you may want to refer more data from it, we don't
want to duplicate fields in two places).

>>
>>> +    chain->proto = proto;
>>> +    chain->max_payload = MAX_IP4_PAYLOAD;
>>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>> +                                      virtio_net_rsc_purge, chain);
>>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>>> +
>>> +    QTAILQ_INIT(&chain->buffers);
>>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>>> +
>>> +    return chain;
>>> +}
>>> +
>>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>>> +                                      const uint8_t *buf, size_t size)
>>> +{
>>> +    uint16_t proto;
>>> +    NetRscChain *chain;
>>> +    struct eth_header *eth;
>>> +    VirtIONet *n;
>>> +
>>> +    n = qemu_get_nic_opaque(nc);
>>> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
>>> +        return virtio_net_do_receive(nc, buf, size);
>>> +    }
>>> +
>>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>>> +    proto = htons(eth->h_proto);
>>> +
>>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>>> +    if (!chain) {
>>> +        return virtio_net_do_receive(nc, buf, size);
>>> +    } else {
>>> +        chain->stat.received++;
>>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>>> +    }
>>> +}
>>> +
>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>> +                                  const uint8_t *buf, size_t size)
>>> +{
>>> +    if (virtio_net_rsc_bypass) {
>>> +        return virtio_net_do_receive(nc, buf, size);
>> You need a feature bit for this and compat it for older machine types.
>> And also need some work on virtio spec I think.
> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
> is gonna to support win guest,
> so need a well-compatible interface, any comments?

I think this should be implemented through feature bits/negotiation
instead of something like ethtool.

[...]
Wei Xu March 18, 2016, 4:17 a.m. UTC | #6
On 2016年03月18日 10:03, Jason Wang wrote:
>
> On 03/18/2016 12:45 AM, Wei Xu wrote:
>> On 2016年03月17日 16:42, Jason Wang wrote:
>>
>>> On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
>>>> From: Wei Xu <wexu@redhat.com>
>>>>
>>>> All the data packets in a tcp connection will be cached to a big buffer
>>>> in every receive interval, and will be sent out via a timer, the
>>>> 'virtio_net_rsc_timeout' controls the interval, the value will
>>>> influent the
>>>> performance and response of tcp connection extremely, 50000(50us) is a
>>>> experience value to gain a performance improvement, since the whql test
>>>> sends packets every 100us, so '300000(300us)' can pass the test case,
>>>> this is also the default value, it's gonna to be tunable.
>>>>
>>>> The timer will only be triggered if the packets pool is not empty,
>>>> and it'll drain off all the cached packets
>>>>
>>>> 'NetRscChain' is used to save the segments of different protocols in a
>>>> VirtIONet device.
>>>>
>>>> The main handler of TCP includes TCP window update, duplicated ACK
>>>> check
>>>> and the real data coalescing if the new segment passed sanity check
>>>> and is identified as an 'wanted' one.
>>>>
>>>> An 'wanted' segment means:
>>>> 1. Segment is within current window and the sequence is the expected
>>>> one.
>>>> 2. ACK of the segment is in the valid window.
>>>> 3. If the ACK in the segment is a duplicated one, then it must less
>>>> than 2,
>>>>      this is to notify upper layer TCP starting retransmission due to
>>>> the spec.
>>>>
>>>> Sanity check includes:
>>>> 1. Incorrect version in IP header
>>>> 2. IP options & IP fragment
>>>> 3. Not a TCP packets
>>>> 4. Sanity size check to prevent buffer overflow attack.
>>>>
>>>> There maybe more cases should be considered such as ip
>>>> identification other
>>>> flags, while it broke the test because windows set it to the same
>>>> even it's
>>>> not a fragment.
>>>>
>>>> Normally it includes 2 typical ways to handle a TCP control flag,
>>>> 'bypass'
>>>> and 'finalize', 'bypass' means should be sent out directly, and
>>>> 'finalize'
>>>> means the packets should also be bypassed, and this should be done
>>>> after searching for the same connection packets in the pool and sending
>>>> all of them out, this is to avoid out of data.
>>>>
>>>> All the 'SYN' packets will be bypassed since this always begin a new'
>>>> connection, other flags such 'FIN/RST' will trigger a finalization,
>>>> because
>>>> this normally happens upon a connection is going to be closed, an
>>>> 'URG' packet
>>>> also finalize current coalescing unit while there maybe protocol
>>>> difference to
>>>> different OS.
>>> But URG packet should be sent as quickly as possible regardless of
>>> ordering, no?
>> Yes, you right, URG will terminate the current 'SCU', i'll amend the
>> commit log.
>>
>>>> Statistics can be used to monitor the basic coalescing status, the
>>>> 'out of order'
>>>> and 'out of window' means how many retransmitting packets, thus
>>>> describe the
>>>> performance intuitively.
>>>>
>>>> Signed-off-by: Wei Xu <wexu@redhat.com>
>>>> ---
>>>>    hw/net/virtio-net.c            | 486
>>>> ++++++++++++++++++++++++++++++++++++++++-
>>>>    include/hw/virtio/virtio-net.h |   1 +
>>>>    include/hw/virtio/virtio.h     |  72 ++++++
>>>>    3 files changed, 558 insertions(+), 1 deletion(-)
> [...]
>
>>>> +        } else {
>>>> +            /* Coalesce window update */
>>>> +            o_tcp->th_win = n_tcp->th_win;
>>>> +            chain->stat.win_update++;
>>>> +            return RSC_COALESCE;
>>>> +        }
>>>> +    } else {
>>>> +        /* pure ack, update ack */
>>>> +        o_tcp->th_ack = n_tcp->th_ack;
>>>> +        chain->stat.pure_ack++;
>>>> +        return RSC_COALESCE;
>>> Looks like there're something I missed. The spec said:
>>>
>>> "In other words, any pure ACK that is not a duplicate ACK or a window
>>> update triggers an exception and must not be coalesced. All such pure
>>> ACKs must be indicated as individual segments."
>>>
>>> Does it mean we *should not* coalesce windows update and pure ack?
>>> (Since it can wakeup transmission)?
>> It's also a little bit inexplicit and flexible due to the spec, please
>> see the flowchart I on the same page.
>>
>> Comments about the  flowchart:
>> ------------------------------------------------------------------------
>> The first of the following two flowcharts describes the rules for
>> coalescing segments and updating the TCP headers.
>> This flowchart refers to mechanisms for distinguishing valid duplicate
>> ACKs and window updates. The second flowchart describes these mechanisms.
>> ------------------------------------------------------------------------
>> As show in the flowchart, only status 'C' will break current scu and
>> get finalized, both 'A' and 'B' can be coalesced afaik.
>>
> Interesting, looks like you're right.
>
>>>> +    }
>>>> +}
>>>> +
>>>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
>>>> NetRscSeg *seg,
>>>> +                                    const uint8_t *buf, NetRscUnit
>>>> *n_unit)
>>>> +{
>>>> +    void *data;
>>>> +    uint16_t o_ip_len;
>>>> +    uint32_t nseq, oseq;
>>>> +    NetRscUnit *o_unit;
>>>> +
>>>> +    o_unit = &seg->unit;
>>>> +    o_ip_len = htons(*o_unit->ip_plen);
>>>> +    nseq = htonl(n_unit->tcp->th_seq);
>>>> +    oseq = htonl(o_unit->tcp->th_seq);
>>>> +
>>>> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
>>>> +        /* Log this only for debugging observation */
>>>> +        chain->stat.tcp_option++;
>>>> +    }
>>>> +
>>>> +    /* Ignore packet with more/larger tcp options */
>>>> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
>>> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?
>> do you mean '<'? that also means some option changed, maybe i should
>> include it.
> Yes.
>
>>>> +        chain->stat.tcp_larger_option++;
>>>> +        return RSC_FINAL;
>>>> +    }
>>>> +
>>>> +    /* out of order or retransmitted. */
>>>> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
>>>> +        chain->stat.data_out_of_win++;
>>>> +        return RSC_FINAL;
>>>> +    }
>>>> +
>>>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>>>> +    if (nseq == oseq) {
>>>> +        if ((0 == o_unit->payload) && n_unit->payload) {
>>>> +            /* From no payload to payload, normal case, not a dup
>>>> ack or etc */
>>>> +            chain->stat.data_after_pure_ack++;
>>>> +            goto coalesce;
>>>> +        } else {
>>>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>>>> +                                             n_unit->tcp,
>>>> o_unit->tcp);
>>>> +        }
>>>> +    } else if ((nseq - oseq) != o_unit->payload) {
>>>> +        /* Not a consistent packet, out of order */
>>>> +        chain->stat.data_out_of_order++;
>>>> +        return RSC_FINAL;
>>>> +    } else {
>>>> +coalesce:
>>>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>>>> +            chain->stat.over_size++;
>>>> +            return RSC_FINAL;
>>>> +        }
>>>> +
>>>> +        /* Here comes the right data, the payload lengh in v4/v6 is
>>>> different,
>>>> +           so use the field value to update and record the new data
>>>> len */
>>>> +        o_unit->payload += n_unit->payload; /* update new data len */
>>>> +
>>>> +        /* update field in ip header */
>>>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>>>> +
>>>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
>>>> coalesced
>>>> +           for windows guest, while this may change the behavior
>>>> for linux
>>>> +           guest. */
>>> This needs more thought, 'can' probably means don't. (Linux GRO won't
>>> merge PUSH packet).
>> Yes, since it's mainly for win guest, how about take this as this
>> first and do more test and see how to handle it?
> Right, this is not an issue probably but just an optimization for latency.
>
> [...]
>
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>>>> +        if (chain->proto == proto) {
>>>> +            return chain;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    chain = g_malloc(sizeof(*chain));
>>>> +    chain->hdr_size = n->guest_hdr_len;
>>> Why introduce a specified field for instead of just use
>>> n->guest_hdr_len?
>> this is to reduce invoking times to find VirtIONet by 'nc', there are
>> a few places will use it.
> Okay, then I think you'd better keep a pointer to VirtIONet structure
> instead. (Consider you may want to refer more data from it, we don't
> want to duplicate fields in two places).
sure, that's a good idea.
>
>>>> +    chain->proto = proto;
>>>> +    chain->max_payload = MAX_IP4_PAYLOAD;
>>>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>>> +                                      virtio_net_rsc_purge, chain);
>>>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>>>> +
>>>> +    QTAILQ_INIT(&chain->buffers);
>>>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>>>> +
>>>> +    return chain;
>>>> +}
>>>> +
>>>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>>>> +                                      const uint8_t *buf, size_t size)
>>>> +{
>>>> +    uint16_t proto;
>>>> +    NetRscChain *chain;
>>>> +    struct eth_header *eth;
>>>> +    VirtIONet *n;
>>>> +
>>>> +    n = qemu_get_nic_opaque(nc);
>>>> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>> +    }
>>>> +
>>>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>>>> +    proto = htons(eth->h_proto);
>>>> +
>>>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>>>> +    if (!chain) {
>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>> +    } else {
>>>> +        chain->stat.received++;
>>>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>>>> +    }
>>>> +}
>>>> +
>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>> +                                  const uint8_t *buf, size_t size)
>>>> +{
>>>> +    if (virtio_net_rsc_bypass) {
>>>> +        return virtio_net_do_receive(nc, buf, size);
>>> You need a feature bit for this and compat it for older machine types.
>>> And also need some work on virtio spec I think.
>> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
>> is gonna to support win guest,
>> so need a well-compatible interface, any comments?
> I think this should be implemented through feature bits/negotiation
> instead of something like ethtool.
Looks this feature should be turn on/off dynamically due to the spec, so 
maybe this should be managed from the guest, is there any reference code 
for this?
>
> [...]
Jason Wang March 18, 2016, 5:20 a.m. UTC | #7
On 03/18/2016 12:17 PM, Wei Xu wrote:
>>>>>
>>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>>> +                                  const uint8_t *buf, size_t size)
>>>>> +{
>>>>> +    if (virtio_net_rsc_bypass) {
>>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>> You need a feature bit for this and compat it for older machine types.
>>>> And also need some work on virtio spec I think.
>>> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
>>> is gonna to support win guest,
>>> so need a well-compatible interface, any comments?
>> I think this should be implemented through feature bits/negotiation
>> instead of something like ethtool.
> Looks this feature should be turn on/off dynamically due to the spec,
> so maybe this should be managed from the guest, is there any reference
> code for this? 

Then you may want to look at implementation of
VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
Wei Xu March 18, 2016, 6:38 a.m. UTC | #8
On 2016年03月18日 13:20, Jason Wang wrote:
>
> On 03/18/2016 12:17 PM, Wei Xu wrote:
>>>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>>>> +                                  const uint8_t *buf, size_t size)
>>>>>> +{
>>>>>> +    if (virtio_net_rsc_bypass) {
>>>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>>> You need a feature bit for this and compat it for older machine types.
>>>>> And also need some work on virtio spec I think.
>>>> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
>>>> is gonna to support win guest,
>>>> so need a well-compatible interface, any comments?
>>> I think this should be implemented through feature bits/negotiation
>>> instead of something like ethtool.
>> Looks this feature should be turn on/off dynamically due to the spec,
>> so maybe this should be managed from the guest, is there any reference
>> code for this?
> Then you may want to look at implementation of
> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
Have a short look at it, do you know how to control the feature bit?  
both when lauching vm and changing it during runtime?
Jason Wang March 18, 2016, 6:56 a.m. UTC | #9
On 03/18/2016 02:38 PM, Wei Xu wrote:
>
>
> On 2016年03月18日 13:20, Jason Wang wrote:
>>
>> On 03/18/2016 12:17 PM, Wei Xu wrote:
>>>>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>>>>> +                                  const uint8_t *buf, size_t size)
>>>>>>> +{
>>>>>>> +    if (virtio_net_rsc_bypass) {
>>>>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>>>> You need a feature bit for this and compat it for older machine
>>>>>> types.
>>>>>> And also need some work on virtio spec I think.
>>>>> yes, not sure which way is good to support this, hmp/qmp/ethtool,
>>>>> this
>>>>> is gonna to support win guest,
>>>>> so need a well-compatible interface, any comments?
>>>> I think this should be implemented through feature bits/negotiation
>>>> instead of something like ethtool.
>>> Looks this feature should be turn on/off dynamically due to the spec,
>>> so maybe this should be managed from the guest, is there any reference
>>> code for this?
>> Then you may want to look at implementation of
>> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
> Have a short look at it, do you know how to control the feature bit? 
> both when lauching vm and changing it during runtime? 

Virtio spec and maybe windows driver source code can give you the answer.
Wei Xu March 18, 2016, 2:52 p.m. UTC | #10
On 2016年03月18日 14:56, Jason Wang wrote:
>
> On 03/18/2016 02:38 PM, Wei Xu wrote:
>>
>> On 2016年03月18日 13:20, Jason Wang wrote:
>>> On 03/18/2016 12:17 PM, Wei Xu wrote:
>>>>>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>>>>>> +                                  const uint8_t *buf, size_t size)
>>>>>>>> +{
>>>>>>>> +    if (virtio_net_rsc_bypass) {
>>>>>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>>>>> You need a feature bit for this and compat it for older machine
>>>>>>> types.
>>>>>>> And also need some work on virtio spec I think.
>>>>>> yes, not sure which way is good to support this, hmp/qmp/ethtool,
>>>>>> this
>>>>>> is gonna to support win guest,
>>>>>> so need a well-compatible interface, any comments?
>>>>> I think this should be implemented through feature bits/negotiation
>>>>> instead of something like ethtool.
>>>> Looks this feature should be turn on/off dynamically due to the spec,
>>>> so maybe this should be managed from the guest, is there any reference
>>>> code for this?
>>> Then you may want to look at implementation of
>>> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
>> Have a short look at it, do you know how to control the feature bit?
>> both when lauching vm and changing it during runtime?
> Virtio spec and maybe windows driver source code can give you the answer.
OK, will check it out.
diff mbox

Patch

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 5798f87..c23b45f 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@ 
 #include "qemu/iov.h"
 #include "hw/virtio/virtio.h"
 #include "net/net.h"
+#include "net/eth.h"
 #include "net/checksum.h"
 #include "net/tap.h"
 #include "qemu/error-report.h"
 #include "qemu/timer.h"
+#include "qemu/sockets.h"
 #include "hw/virtio/virtio-net.h"
 #include "net/vhost_net.h"
 #include "hw/virtio/virtio-bus.h"
@@ -38,6 +40,35 @@ 
 #define endof(container, field) \
     (offsetof(container, field) + sizeof(((container *)0)->field))
 
+#define ETH_HDR_SZ (sizeof(struct eth_header))
+#define IP4_HDR_SZ (sizeof(struct ip_header))
+#define TCP_HDR_SZ (sizeof(struct tcp_header))
+#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)
+
+#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
+#define TCP_PORT_SIZE   4                   /* sport + dport */
+
+/* IPv4 max payload, 16 bits in the header */
+#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
+#define MAX_TCP_PAYLOAD 65535
+
+/* max payload with virtio header */
+#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
+                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
+
+#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
+
+/* Purge coalesced packets timer interval */
+#define RSC_TIMER_INTERVAL  300000
+
+/* Switcher to enable/disable rsc */
+static bool virtio_net_rsc_bypass = 1;
+
+/* This value affects the performance a lot, and should be tuned carefully,
+   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
+   gain 2x netperf throughput with tso/gso/gro 'off'. */
+static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;
+
 typedef struct VirtIOFeature {
     uint32_t flags;
     size_t end;
@@ -1089,7 +1120,8 @@  static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
     return 0;
 }
 
-static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
+static ssize_t virtio_net_do_receive(NetClientState *nc,
+                                  const uint8_t *buf, size_t size)
 {
     VirtIONet *n = qemu_get_nic_opaque(nc);
     VirtIONetQueue *q = virtio_net_get_subqueue(nc);
@@ -1685,6 +1717,456 @@  static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
     return 0;
 }
 
+static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
+                                         const uint8_t *buf, NetRscUnit* unit)
+{
+    uint16_t ip_hdrlen;
+
+    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
+    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
+    unit->ip_plen = &unit->ip->ip_len;
+    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
+    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
+}
+
+static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
+{
+    uint32_t sum;
+
+    ip->ip_sum = 0;
+    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
+    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
+}
+
+static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
+{
+    int ret;
+
+    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
+    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
+    QTAILQ_REMOVE(&chain->buffers, seg, next);
+    g_free(seg->buf);
+    g_free(seg);
+
+    return ret;
+}
+
+static void virtio_net_rsc_purge(void *opq)
+{
+    NetRscChain *chain = (NetRscChain *)opq;
+    NetRscSeg *seg, *rn;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.purge_failed++;
+            continue;
+        }
+    }
+
+    chain->stat.timer++;
+    if (!QTAILQ_EMPTY(&chain->buffers)) {
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
+    }
+}
+
+static void virtio_net_rsc_cleanup(VirtIONet *n)
+{
+    NetRscChain *chain, *rn_chain;
+    NetRscSeg *seg, *rn_seg;
+
+    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
+        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
+            QTAILQ_REMOVE(&chain->buffers, seg, next);
+            g_free(seg->buf);
+            g_free(seg);
+        }
+
+        timer_del(chain->drain_timer);
+        timer_free(chain->drain_timer);
+        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
+        g_free(chain);
+    }
+}
+
+static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
+                                     const uint8_t *buf, size_t size)
+{
+    NetRscSeg *seg;
+
+    seg = g_malloc(sizeof(NetRscSeg));
+    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
+
+    memmove(seg->buf, buf, size);
+    seg->size = size;
+    seg->dup_ack_count = 0;
+    seg->is_coalesced = 0;
+    seg->nc = nc;
+
+    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
+    chain->stat.cache++;
+
+    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
+}
+
+static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
+                                 const uint8_t *buf, struct tcp_header *n_tcp,
+                                 struct tcp_header *o_tcp)
+{
+    uint32_t nack, oack;
+    uint16_t nwin, owin;
+
+    nack = htonl(n_tcp->th_ack);
+    nwin = htons(n_tcp->th_win);
+    oack = htonl(o_tcp->th_ack);
+    owin = htons(o_tcp->th_win);
+
+    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
+        chain->stat.ack_out_of_win++;
+        return RSC_FINAL;
+    } else if (nack == oack) {
+        /* duplicated ack or window probe */
+        if (nwin == owin) {
+            /* duplicated ack, add dup ack count due to whql test up to 1 */
+            chain->stat.dup_ack++;
+
+            if (seg->dup_ack_count == 0) {
+                seg->dup_ack_count++;
+                chain->stat.dup_ack1++;
+                return RSC_COALESCE;
+            } else {
+                /* Spec says should send it directly */
+                chain->stat.dup_ack2++;
+                return RSC_FINAL;
+            }
+        } else {
+            /* Coalesce window update */
+            o_tcp->th_win = n_tcp->th_win;
+            chain->stat.win_update++;
+            return RSC_COALESCE;
+        }
+    } else {
+        /* pure ack, update ack */
+        o_tcp->th_ack = n_tcp->th_ack;
+        chain->stat.pure_ack++;
+        return RSC_COALESCE;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
+                                    const uint8_t *buf, NetRscUnit *n_unit)
+{
+    void *data;
+    uint16_t o_ip_len;
+    uint32_t nseq, oseq;
+    NetRscUnit *o_unit;
+
+    o_unit = &seg->unit;
+    o_ip_len = htons(*o_unit->ip_plen);
+    nseq = htonl(n_unit->tcp->th_seq);
+    oseq = htonl(o_unit->tcp->th_seq);
+
+    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
+        /* Log this only for debugging observation */
+        chain->stat.tcp_option++;
+    }
+
+    /* Ignore packet with more/larger tcp options */
+    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
+        chain->stat.tcp_larger_option++;
+        return RSC_FINAL;
+    }
+
+    /* out of order or retransmitted. */
+    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
+        chain->stat.data_out_of_win++;
+        return RSC_FINAL;
+    }
+
+    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
+    if (nseq == oseq) {
+        if ((0 == o_unit->payload) && n_unit->payload) {
+            /* From no payload to payload, normal case, not a dup ack or etc */
+            chain->stat.data_after_pure_ack++;
+            goto coalesce;
+        } else {
+            return virtio_net_rsc_handle_ack(chain, seg, buf,
+                                             n_unit->tcp, o_unit->tcp);
+        }
+    } else if ((nseq - oseq) != o_unit->payload) {
+        /* Not a consistent packet, out of order */
+        chain->stat.data_out_of_order++;
+        return RSC_FINAL;
+    } else {
+coalesce:
+        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
+            chain->stat.over_size++;
+            return RSC_FINAL;
+        }
+
+        /* Here comes the right data, the payload lengh in v4/v6 is different,
+           so use the field value to update and record the new data len */
+        o_unit->payload += n_unit->payload; /* update new data len */
+
+        /* update field in ip header */
+        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
+
+        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
+           for windows guest, while this may change the behavior for linux
+           guest. */
+        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
+
+        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
+        o_unit->tcp->th_win = n_unit->tcp->th_win;
+
+        memmove(seg->buf + seg->size, data, n_unit->payload);
+        seg->size += n_unit->payload;
+        chain->stat.coalesced++;
+        return RSC_COALESCE;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
+                        const uint8_t *buf, size_t size, NetRscUnit *unit)
+{
+    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
+        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
+        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+        chain->stat.no_match++;
+        return RSC_NO_MATCH;
+    }
+
+    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
+/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
+ * to prevent out of order */
+static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
+                                         struct tcp_header *tcp)
+{
+    uint16_t tcp_flag;
+
+    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
+    if (tcp_flag & TH_SYN) {
+        chain->stat.tcp_syn++;
+        return RSC_BYPASS;
+    }
+
+    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
+        chain->stat.tcp_ctrl_drain++;
+        return RSC_FINAL;
+    }
+
+    return RSC_WANT;
+}
+
+static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
+                          const uint8_t *buf, size_t size)
+{
+    if (QTAILQ_EMPTY(&chain->buffers)) {
+        chain->stat.empty_cache++;
+        virtio_net_rsc_cache_buf(chain, nc, buf, size);
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
+        return 1;
+    }
+
+    return 0;
+}
+
+static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
+                          const uint8_t *buf, size_t size, NetRscUnit *unit)
+{
+    int ret;
+    NetRscSeg *seg, *nseg;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+
+        if (ret == RSC_FINAL) {
+            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+                /* Send failed */
+                chain->stat.final_failed++;
+                return 0;
+            }
+
+            /* Send current packet */
+            return virtio_net_do_receive(nc, buf, size);
+        } else if (ret == RSC_NO_MATCH) {
+            continue;
+        } else {
+            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
+            seg->is_coalesced = 1;
+            return size;
+        }
+    }
+
+    chain->stat.no_match_cache++;
+    virtio_net_rsc_cache_buf(chain, nc, buf, size);
+    return size;
+}
+
+/* Drain a connection data, this is to avoid out of order segments */
+static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
+                        const uint8_t *buf, size_t size, uint16_t ip_start,
+                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
+{
+    NetRscSeg *seg, *nseg;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
+            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
+            continue;
+        }
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.drain_failed++;
+        }
+
+        break;
+    }
+
+    return virtio_net_do_receive(nc, buf, size);
+}
+
+static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
+                        struct ip_header *ip, const uint8_t *buf, size_t size)
+{
+    uint16_t ip_len;
+
+    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
+        return RSC_BYPASS;
+    }
+
+    /* Not an ipv4 one */
+    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip option */
+    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip fragment */
+    if (!(htons(ip->ip_off) & IP_DF)) {
+        chain->stat.ip_frag++;
+        return RSC_BYPASS;
+    }
+
+    if (ip->ip_p != IPPROTO_TCP) {
+        chain->stat.bypass_not_tcp++;
+        return RSC_BYPASS;
+    }
+
+    /* Sanity check */
+    ip_len = htons(ip->ip_len);
+    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
+        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
+        chain->stat.ip_hacked++;
+        return RSC_BYPASS;
+    }
+
+    return RSC_WANT;
+}
+
+static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
+                                      const uint8_t *buf, size_t size)
+{
+    int32_t ret;
+    NetRscChain *chain;
+    NetRscUnit unit;
+
+    chain = (NetRscChain *)opq;
+    virtio_net_rsc_extract_unit4(chain, buf, &unit);
+    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
+    if (ret == RSC_BYPASS) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else if (ret == RSC_FINAL) {
+        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
+                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
+                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
+    }
+
+    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
+        return size;
+    }
+
+    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
+}
+
+static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
+                                            NetClientState *nc, uint16_t proto)
+{
+    NetRscChain *chain;
+
+    /* Only handle IPv4/6 */
+    if (proto != (uint16_t)ETH_P_IP) {
+        return NULL;
+    }
+
+    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
+        if (chain->proto == proto) {
+            return chain;
+        }
+    }
+
+    chain = g_malloc(sizeof(*chain));
+    chain->hdr_size = n->guest_hdr_len;
+    chain->proto = proto;
+    chain->max_payload = MAX_IP4_PAYLOAD;
+    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
+                                      virtio_net_rsc_purge, chain);
+    memset(&chain->stat, 0, sizeof(chain->stat));
+
+    QTAILQ_INIT(&chain->buffers);
+    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
+
+    return chain;
+}
+
+static ssize_t virtio_net_rsc_receive(NetClientState *nc,
+                                      const uint8_t *buf, size_t size)
+{
+    uint16_t proto;
+    NetRscChain *chain;
+    struct eth_header *eth;
+    VirtIONet *n;
+
+    n = qemu_get_nic_opaque(nc);
+    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    eth = (struct eth_header *)(buf + n->guest_hdr_len);
+    proto = htons(eth->h_proto);
+
+    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
+    if (!chain) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else {
+        chain->stat.received++;
+        return virtio_net_rsc_receive4(chain, nc, buf, size);
+    }
+}
+
+static ssize_t virtio_net_receive(NetClientState *nc,
+                                  const uint8_t *buf, size_t size)
+{
+    if (virtio_net_rsc_bypass) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else {
+        return virtio_net_rsc_receive(nc, buf, size);
+    }
+}
+
 static NetClientInfo net_virtio_info = {
     .type = NET_CLIENT_OPTIONS_KIND_NIC,
     .size = sizeof(NICState),
@@ -1814,6 +2296,7 @@  static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     nc = qemu_get_queue(n->nic);
     nc->rxfilter_notify_enabled = 1;
 
+    QTAILQ_INIT(&n->rsc_chains);
     n->qdev = dev;
     register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
                     virtio_net_save, virtio_net_load, n);
@@ -1848,6 +2331,7 @@  static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
     g_free(n->vqs);
     qemu_del_nic(n->nic);
     virtio_cleanup(vdev);
+    virtio_net_rsc_cleanup(n);
 }
 
 static void virtio_net_instance_init(Object *obj)
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 0cabdb6..6939e92 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -59,6 +59,7 @@  typedef struct VirtIONet {
     VirtIONetQueue *vqs;
     VirtQueue *ctrl_vq;
     NICState *nic;
+    QTAILQ_HEAD(, NetRscChain) rsc_chains;
     uint32_t tx_timeout;
     int32_t tx_burst;
     uint32_t has_vnet_hdr;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 2b5b248..3b1dfa8 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -128,6 +128,78 @@  typedef struct VirtioDeviceClass {
     int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
 } VirtioDeviceClass;
 
+/* Coalesced packets type & status */
+typedef enum {
+    RSC_COALESCE,           /* Data been coalesced */
+    RSC_FINAL,              /* Will terminate current connection */
+    RSC_NO_MATCH,           /* No matched in the buffer pool */
+    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
+    RSC_WANT                /* Data want to be coalesced */
+} COALESCE_STATUS;
+
+typedef struct NetRscStat {
+    uint32_t received;
+    uint32_t coalesced;
+    uint32_t over_size;
+    uint32_t cache;
+    uint32_t empty_cache;
+    uint32_t no_match_cache;
+    uint32_t win_update;
+    uint32_t no_match;
+    uint32_t tcp_syn;
+    uint32_t tcp_ctrl_drain;
+    uint32_t dup_ack;
+    uint32_t dup_ack1;
+    uint32_t dup_ack2;
+    uint32_t pure_ack;
+    uint32_t ack_out_of_win;
+    uint32_t data_out_of_win;
+    uint32_t data_out_of_order;
+    uint32_t data_after_pure_ack;
+    uint32_t bypass_not_tcp;
+    uint32_t tcp_option;
+    uint32_t tcp_larger_option;
+    uint32_t ip_frag;
+    uint32_t ip_hacked;
+    uint32_t ip_option;
+    uint32_t purge_failed;
+    uint32_t drain_failed;
+    uint32_t final_failed;
+    int64_t  timer;
+} NetRscStat;
+
+/* Rsc unit general info used to checking if can coalescing */
+typedef struct NetRscUnit {
+   struct ip_header *ip;   /* ip header */
+   uint16_t *ip_plen;      /* data len pointer in ip header field */
+   struct tcp_header *tcp; /* tcp header */
+   uint16_t tcp_hdrlen;    /* tcp header len */
+   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
+} NetRscUnit;
+
+/* Coalesced segmant */
+typedef struct NetRscSeg {
+    QTAILQ_ENTRY(NetRscSeg) next;
+    void *buf;
+    size_t size;
+    uint32_t dup_ack_count;
+    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
+    NetRscUnit unit;
+    NetClientState *nc;
+} NetRscSeg;
+
+
+/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
+typedef struct NetRscChain {
+   QTAILQ_ENTRY(NetRscChain) next;
+   uint16_t hdr_size;
+   uint16_t proto;
+   uint16_t max_payload;
+   QEMUTimer *drain_timer;
+   QTAILQ_HEAD(, NetRscSeg) buffers;
+   NetRscStat stat;
+} NetRscChain;
+
 void virtio_instance_init_common(Object *proxy_obj, void *data,
                                  size_t vdev_size, const char *vdev_name);