[{"id":1765885,"web_url":"http://patchwork.ozlabs.org/comment/1765885/","msgid":"<CAKgT0UfZed3_3KBmKbMbcmdsA+ctFRUCu8jp_rCnatC9AMv__g@mail.gmail.com>","list_archive_url":null,"date":"2017-09-10T01:28:18","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":252,"url":"http://patchwork.ozlabs.org/api/people/252/","name":"Alexander Duyck","email":"alexander.duyck@gmail.com"},"content":"On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala\n<sridhar.samudrala@intel.com> wrote:\n> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used\n> to enable symmetric tx and rx queues on a socket.\n>\n> This option is specifically useful for epoll based multi threaded workloads\n> where each thread handles packets received on a single RX queue . In this model,\n> we have noticed that it helps to send the packets on the same TX queue\n> corresponding to the queue-pair associated with the RX queue specifically when\n> busy poll is enabled with epoll().\n>\n> Two new fields are added to struct sock_common to cache the last rx ifindex and\n> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached\n> rx queue when this option is enabled and the TX is happening on the same device.\n>\n> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>\n> ---\n>  include/net/request_sock.h        |  1 +\n>  include/net/sock.h                | 17 +++++++++++++++++\n>  include/uapi/asm-generic/socket.h |  2 ++\n>  net/core/dev.c                    |  8 +++++++-\n>  net/core/sock.c                   | 10 ++++++++++\n>  net/ipv4/tcp_input.c              |  1 +\n>  net/ipv4/tcp_ipv4.c               |  1 +\n>  net/ipv4/tcp_minisocks.c          |  1 +\n>  8 files changed, 40 insertions(+), 1 deletion(-)\n>\n> diff --git a/include/net/request_sock.h b/include/net/request_sock.h\n> index 23e2205..c3bc12e 100644\n> --- a/include/net/request_sock.h\n> +++ b/include/net/request_sock.h\n> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req)\n>         req_to_sk(req)->sk_prot = sk_listener->sk_prot;\n>         sk_node_init(&req_to_sk(req)->sk_node);\n>         sk_tx_queue_clear(req_to_sk(req));\n> +       req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues;\n>         req->saved_syn = NULL;\n>         refcount_set(&req->rsk_refcnt, 0);\n>\n> diff --git a/include/net/sock.h b/include/net/sock.h\n> index 03a3625..3421809 100644\n> --- a/include/net/sock.h\n> +++ b/include/net/sock.h\n> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...)\n>   *     @skc_node: main hash linkage for various protocol lookup tables\n>   *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol\n>   *     @skc_tx_queue_mapping: tx queue number for this connection\n> + *     @skc_rx_queue_mapping: rx queue number for this connection\n> + *     @skc_rx_ifindex: rx ifindex for this connection\n>   *     @skc_flags: place holder for sk_flags\n>   *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,\n>   *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings\n>   *     @skc_incoming_cpu: record/match cpu processing incoming packets\n>   *     @skc_refcnt: reference count\n> + *     @skc_symmetric_queues: symmetric tx/rx queues\n>   *\n>   *     This is the minimal network layer representation of sockets, the header\n>   *     for struct sock and struct inet_timewait_sock.\n> @@ -177,6 +180,7 @@ struct sock_common {\n>         unsigned char           skc_reuseport:1;\n>         unsigned char           skc_ipv6only:1;\n>         unsigned char           skc_net_refcnt:1;\n> +       unsigned char           skc_symmetric_queues:1;\n>         int                     skc_bound_dev_if;\n>         union {\n>                 struct hlist_node       skc_bind_node;\n> @@ -214,6 +218,8 @@ struct sock_common {\n>                 struct hlist_nulls_node skc_nulls_node;\n>         };\n>         int                     skc_tx_queue_mapping;\n> +       int                     skc_rx_queue_mapping;\n> +       int                     skc_rx_ifindex;\n>         union {\n>                 int             skc_incoming_cpu;\n>                 u32             skc_rcv_wnd;\n> @@ -324,6 +330,8 @@ struct sock {\n>  #define sk_nulls_node          __sk_common.skc_nulls_node\n>  #define sk_refcnt              __sk_common.skc_refcnt\n>  #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping\n> +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping\n> +#define sk_rx_ifindex          __sk_common.skc_rx_ifindex\n>\n>  #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin\n>  #define sk_dontcopy_end                __sk_common.skc_dontcopy_end\n> @@ -340,6 +348,7 @@ struct sock {\n>  #define sk_reuseport           __sk_common.skc_reuseport\n>  #define sk_ipv6only            __sk_common.skc_ipv6only\n>  #define sk_net_refcnt          __sk_common.skc_net_refcnt\n> +#define sk_symmetric_queues    __sk_common.skc_symmetric_queues\n>  #define sk_bound_dev_if                __sk_common.skc_bound_dev_if\n>  #define sk_bind_node           __sk_common.skc_bind_node\n>  #define sk_prot                        __sk_common.skc_prot\n> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)\n>         return sk ? sk->sk_tx_queue_mapping : -1;\n>  }\n>\n> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)\n> +{\n> +       if (sk->sk_symmetric_queues) {\n> +               sk->sk_rx_ifindex = skb->skb_iif;\n> +               sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);\n> +       }\n> +}\n> +\n>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)\n>  {\n>         sk_tx_queue_clear(sk);\n> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h\n> index e47c9e4..f6b416e 100644\n> --- a/include/uapi/asm-generic/socket.h\n> +++ b/include/uapi/asm-generic/socket.h\n> @@ -106,4 +106,6 @@\n>\n>  #define SO_ZEROCOPY            60\n>\n> +#define SO_SYMMETRIC_QUEUES    61\n> +\n>  #endif /* __ASM_GENERIC_SOCKET_H */\n> diff --git a/net/core/dev.c b/net/core/dev.c\n> index 270b547..d96cda8 100644\n> --- a/net/core/dev.c\n> +++ b/net/core/dev.c\n> @@ -3322,7 +3322,13 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)\n>\n>         if (queue_index < 0 || skb->ooo_okay ||\n>             queue_index >= dev->real_num_tx_queues) {\n> -               int new_index = get_xps_queue(dev, skb);\n> +               int new_index = -1;\n> +\n> +               if (sk && sk->sk_symmetric_queues && dev->ifindex == sk->sk_rx_ifindex)\n> +                       new_index = sk->sk_rx_queue_mapping;\n> +\n> +               if (new_index < 0 || new_index >= dev->real_num_tx_queues)\n> +                       new_index = get_xps_queue(dev, skb);\n>\n>                 if (new_index < 0)\n>                         new_index = skb_tx_hash(dev, skb);\n\nSo one thing I am not sure about is if we should be overriding XPS. It\nmight make sense to instead place this after XPS so that if the root\nuser configures it then it applies, otherwise if the socket is\nrequesting symmetric queues you could fall back to that, and then\nfinally just use hashing as the final solution for distributing the\nworkload.\n\nThat way if somebody decides to reserve queues for some sort of\nspecific traffic like AF_PACKET then they can configure the Tx via\nXPS, configure the Rx via RSS redirection table reprogramming, and\nthen setup a filters on the hardware to direct the traffic they want\nto the queues that are running AF_PACKET.\n\n> diff --git a/net/core/sock.c b/net/core/sock.c\n> index 9b7b6bb..3876cce 100644\n> --- a/net/core/sock.c\n> +++ b/net/core/sock.c\n> @@ -1059,6 +1059,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,\n>                         sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);\n>                 break;\n>\n> +       case SO_SYMMETRIC_QUEUES:\n> +               sk->sk_symmetric_queues = valbool;\n> +               break;\n> +\n>         default:\n>                 ret = -ENOPROTOOPT;\n>                 break;\n> @@ -1391,6 +1395,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,\n>                 v.val = sock_flag(sk, SOCK_ZEROCOPY);\n>                 break;\n>\n> +       case SO_SYMMETRIC_QUEUES:\n> +               v.val = sk->sk_symmetric_queues;\n> +               break;\n> +\n>         default:\n>                 /* We implement the SO_SNDLOWAT etc to not be settable\n>                  * (1003.1g 7).\n> @@ -2738,6 +2746,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)\n>         sk->sk_max_pacing_rate = ~0U;\n>         sk->sk_pacing_rate = ~0U;\n>         sk->sk_incoming_cpu = -1;\n> +       sk->sk_rx_ifindex = -1;\n> +       sk->sk_rx_queue_mapping = -1;\n>         /*\n>          * Before updating sk_refcnt, we must commit prior changes to memory\n>          * (Documentation/RCU/rculist_nulls.txt for details)\n> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c\n> index c5d7656..12381e0 100644\n> --- a/net/ipv4/tcp_input.c\n> +++ b/net/ipv4/tcp_input.c\n> @@ -6356,6 +6356,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,\n>         tcp_rsk(req)->snt_isn = isn;\n>         tcp_rsk(req)->txhash = net_tx_rndhash();\n>         tcp_openreq_init_rwin(req, sk, dst);\n> +       sk_mark_rx_queue(req_to_sk(req), skb);\n>         if (!want_cookie) {\n>                 tcp_reqsk_record_syn(sk, req, skb);\n>                 fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc);\n> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c\n> index a63486a..82f9af4 100644\n> --- a/net/ipv4/tcp_ipv4.c\n> +++ b/net/ipv4/tcp_ipv4.c\n> @@ -1450,6 +1450,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)\n>\n>                 sock_rps_save_rxhash(sk, skb);\n>                 sk_mark_napi_id(sk, skb);\n> +               sk_mark_rx_queue(sk, skb);\n>                 if (dst) {\n>                         if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||\n>                             !dst->ops->check(dst, 0)) {\n> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c\n> index 188a6f3..2b5efd5 100644\n> --- a/net/ipv4/tcp_minisocks.c\n> +++ b/net/ipv4/tcp_minisocks.c\n> @@ -809,6 +809,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,\n>\n>         /* record NAPI ID of child */\n>         sk_mark_napi_id(child, skb);\n> +       sk_mark_rx_queue(child, skb);\n>\n>         tcp_segs_in(tcp_sk(child), skb);\n>         if (!sock_owned_by_user(child)) {\n> --\n> 1.8.3.1\n>","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=gmail.com header.i=@gmail.com\n\theader.b=\"RkDlh6Ta\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xqYNX4bCDz9sQl\n\tfor <patchwork-incoming@ozlabs.org>;\n\tSun, 10 Sep 2017 11:28:28 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1750819AbdIJB2V (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tSat, 9 Sep 2017 21:28:21 -0400","from mail-qk0-f196.google.com ([209.85.220.196]:34190 \"EHLO\n\tmail-qk0-f196.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1750788AbdIJB2U (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Sat, 9 Sep 2017 21:28:20 -0400","by mail-qk0-f196.google.com with SMTP id d70so3625444qkc.1\n\tfor <netdev@vger.kernel.org>; Sat, 09 Sep 2017 18:28:19 -0700 (PDT)","by 10.140.85.211 with HTTP; Sat, 9 Sep 2017 18:28:18 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=gmail.com; s=20161025;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=1pfI8wmYrz5zyeqyht4+kljHW8Y9ZmtL2ZjNN4iao98=;\n\tb=RkDlh6TaFKRqEgxQQ55yFOzN05hCgc0lahZCY+LH4DP37EYbjvYPDyoalySn+cNQZ6\n\t7fJFatGzB5LdQdjyMNtJjTX4wl4R7TM75D48yBnuABAT8roAYUgtmqrGBW9DlxDnODov\n\tN+sbsTqGtVdc1JworfRdX/1v/MWCqW7EVn5aqTx+n6R/Hr5ldUDPcgUEm9Z+UrDCSJUO\n\tdHR0sLqdwXPbOjfRP1D/h9BAvfp1wRw7E86Wc0oIng/jut7+z3U5mR+dmffrTbulwTip\n\tPWWA9s02EkQ+UJxNQYd3+arygnsIGU1ct4MWtIvd/8OK0dOJgIyutz49uuy+AF7D1dXU\n\t0ImA==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=1pfI8wmYrz5zyeqyht4+kljHW8Y9ZmtL2ZjNN4iao98=;\n\tb=CS5XYB476+LfF7m1oHS702uziRX6In0nua49mqAZYOLlaq8lOanXwDi16QHYAlqQ30\n\tdF0E7DBNbQ/szbL/MczSLisxtqGSDUKass9RsYyN+C4W78YJ4mkaSDLX+2RuOuBZ6jou\n\tGzrQja/bbqb2oz6viIwJb01n8XdCAdRhsuR+SLS+q4EnpTHJu6eZAke7aYXptXvNvjkU\n\tOPzzigZaG+XJtMYJo0zpJprWW6kjwf43zRfnaMXtBwNtyJtjfInsg0AhOIuwyzOEIAfG\n\t+6MPy1egdIVmU8LW8ZNKeqS1KEpzrdtCzJKgWgTDZnVWFhrIW+WiHUwh1sH4wp7k6gFH\n\tsreg==","X-Gm-Message-State":"AHPjjUieXk66s+9s1dO5g8Zpf7ocAZCudYQnaMTTnn82ZOv0N/YMS1It\n\t96cJ6BCAl4Kx6qvxf9ctn57My+GDzHwW","X-Google-Smtp-Source":"AOwi7QB/NoZJNLpKdIsrljfKnXyQv0nf2QfsZIHohTyf/aKlnPqmsONVheQ38svL4uEp0ZYema5TtliB7gjMNLZCanA=","X-Received":"by 10.55.169.81 with SMTP id s78mr10680069qke.34.1505006899090; \n\tSat, 09 Sep 2017 18:28:19 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>","From":"Alexander Duyck <alexander.duyck@gmail.com>","Date":"Sat, 9 Sep 2017 18:28:18 -0700","Message-ID":"<CAKgT0UfZed3_3KBmKbMbcmdsA+ctFRUCu8jp_rCnatC9AMv__g@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Sridhar Samudrala <sridhar.samudrala@intel.com>","Cc":"\"Duyck, Alexander H\" <alexander.h.duyck@intel.com>,\n\tNetdev <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1765905,"web_url":"http://patchwork.ozlabs.org/comment/1765905/","msgid":"<CALx6S3527ETZjZBBJTYrorWziwFi+eXMHFOsc-hC5YzPUWZ3Cw@mail.gmail.com>","list_archive_url":null,"date":"2017-09-10T05:32:17","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65986,"url":"http://patchwork.ozlabs.org/api/people/65986/","name":"Tom Herbert","email":"tom@herbertland.com"},"content":"On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala\n<sridhar.samudrala@intel.com> wrote:\n> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used\n> to enable symmetric tx and rx queues on a socket.\n>\n> This option is specifically useful for epoll based multi threaded workloads\n> where each thread handles packets received on a single RX queue . In this model,\n> we have noticed that it helps to send the packets on the same TX queue\n> corresponding to the queue-pair associated with the RX queue specifically when\n> busy poll is enabled with epoll().\n>\nPlease provide more details, test results on exactly how this helps.\nWhy would this better than than optimized XPS?\n\nThanks,\nTom\n\n> Two new fields are added to struct sock_common to cache the last rx ifindex and\n> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached\n> rx queue when this option is enabled and the TX is happening on the same device.\n>\n> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>\n> ---\n>  include/net/request_sock.h        |  1 +\n>  include/net/sock.h                | 17 +++++++++++++++++\n>  include/uapi/asm-generic/socket.h |  2 ++\n>  net/core/dev.c                    |  8 +++++++-\n>  net/core/sock.c                   | 10 ++++++++++\n>  net/ipv4/tcp_input.c              |  1 +\n>  net/ipv4/tcp_ipv4.c               |  1 +\n>  net/ipv4/tcp_minisocks.c          |  1 +\n>  8 files changed, 40 insertions(+), 1 deletion(-)\n>\n> diff --git a/include/net/request_sock.h b/include/net/request_sock.h\n> index 23e2205..c3bc12e 100644\n> --- a/include/net/request_sock.h\n> +++ b/include/net/request_sock.h\n> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req)\n>         req_to_sk(req)->sk_prot = sk_listener->sk_prot;\n>         sk_node_init(&req_to_sk(req)->sk_node);\n>         sk_tx_queue_clear(req_to_sk(req));\n> +       req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues;\n>         req->saved_syn = NULL;\n>         refcount_set(&req->rsk_refcnt, 0);\n>\n> diff --git a/include/net/sock.h b/include/net/sock.h\n> index 03a3625..3421809 100644\n> --- a/include/net/sock.h\n> +++ b/include/net/sock.h\n> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...)\n>   *     @skc_node: main hash linkage for various protocol lookup tables\n>   *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol\n>   *     @skc_tx_queue_mapping: tx queue number for this connection\n> + *     @skc_rx_queue_mapping: rx queue number for this connection\n> + *     @skc_rx_ifindex: rx ifindex for this connection\n>   *     @skc_flags: place holder for sk_flags\n>   *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,\n>   *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings\n>   *     @skc_incoming_cpu: record/match cpu processing incoming packets\n>   *     @skc_refcnt: reference count\n> + *     @skc_symmetric_queues: symmetric tx/rx queues\n>   *\n>   *     This is the minimal network layer representation of sockets, the header\n>   *     for struct sock and struct inet_timewait_sock.\n> @@ -177,6 +180,7 @@ struct sock_common {\n>         unsigned char           skc_reuseport:1;\n>         unsigned char           skc_ipv6only:1;\n>         unsigned char           skc_net_refcnt:1;\n> +       unsigned char           skc_symmetric_queues:1;\n>         int                     skc_bound_dev_if;\n>         union {\n>                 struct hlist_node       skc_bind_node;\n> @@ -214,6 +218,8 @@ struct sock_common {\n>                 struct hlist_nulls_node skc_nulls_node;\n>         };\n>         int                     skc_tx_queue_mapping;\n> +       int                     skc_rx_queue_mapping;\n> +       int                     skc_rx_ifindex;\n>         union {\n>                 int             skc_incoming_cpu;\n>                 u32             skc_rcv_wnd;\n> @@ -324,6 +330,8 @@ struct sock {\n>  #define sk_nulls_node          __sk_common.skc_nulls_node\n>  #define sk_refcnt              __sk_common.skc_refcnt\n>  #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping\n> +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping\n> +#define sk_rx_ifindex          __sk_common.skc_rx_ifindex\n>\n>  #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin\n>  #define sk_dontcopy_end                __sk_common.skc_dontcopy_end\n> @@ -340,6 +348,7 @@ struct sock {\n>  #define sk_reuseport           __sk_common.skc_reuseport\n>  #define sk_ipv6only            __sk_common.skc_ipv6only\n>  #define sk_net_refcnt          __sk_common.skc_net_refcnt\n> +#define sk_symmetric_queues    __sk_common.skc_symmetric_queues\n>  #define sk_bound_dev_if                __sk_common.skc_bound_dev_if\n>  #define sk_bind_node           __sk_common.skc_bind_node\n>  #define sk_prot                        __sk_common.skc_prot\n> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)\n>         return sk ? sk->sk_tx_queue_mapping : -1;\n>  }\n>\n> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)\n> +{\n> +       if (sk->sk_symmetric_queues) {\n> +               sk->sk_rx_ifindex = skb->skb_iif;\n> +               sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);\n> +       }\n> +}\n> +\n>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)\n>  {\n>         sk_tx_queue_clear(sk);\n> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h\n> index e47c9e4..f6b416e 100644\n> --- a/include/uapi/asm-generic/socket.h\n> +++ b/include/uapi/asm-generic/socket.h\n> @@ -106,4 +106,6 @@\n>\n>  #define SO_ZEROCOPY            60\n>\n> +#define SO_SYMMETRIC_QUEUES    61\n> +\n>  #endif /* __ASM_GENERIC_SOCKET_H */\n> diff --git a/net/core/dev.c b/net/core/dev.c\n> index 270b547..d96cda8 100644\n> --- a/net/core/dev.c\n> +++ b/net/core/dev.c\n> @@ -3322,7 +3322,13 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)\n>\n>         if (queue_index < 0 || skb->ooo_okay ||\n>             queue_index >= dev->real_num_tx_queues) {\n> -               int new_index = get_xps_queue(dev, skb);\n> +               int new_index = -1;\n> +\n> +               if (sk && sk->sk_symmetric_queues && dev->ifindex == sk->sk_rx_ifindex)\n> +                       new_index = sk->sk_rx_queue_mapping;\n> +\n> +               if (new_index < 0 || new_index >= dev->real_num_tx_queues)\n> +                       new_index = get_xps_queue(dev, skb);\n>\n>                 if (new_index < 0)\n>                         new_index = skb_tx_hash(dev, skb);\n> diff --git a/net/core/sock.c b/net/core/sock.c\n> index 9b7b6bb..3876cce 100644\n> --- a/net/core/sock.c\n> +++ b/net/core/sock.c\n> @@ -1059,6 +1059,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,\n>                         sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);\n>                 break;\n>\n> +       case SO_SYMMETRIC_QUEUES:\n> +               sk->sk_symmetric_queues = valbool;\n> +               break;\n> +\n>         default:\n>                 ret = -ENOPROTOOPT;\n>                 break;\n> @@ -1391,6 +1395,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,\n>                 v.val = sock_flag(sk, SOCK_ZEROCOPY);\n>                 break;\n>\n> +       case SO_SYMMETRIC_QUEUES:\n> +               v.val = sk->sk_symmetric_queues;\n> +               break;\n> +\n>         default:\n>                 /* We implement the SO_SNDLOWAT etc to not be settable\n>                  * (1003.1g 7).\n> @@ -2738,6 +2746,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)\n>         sk->sk_max_pacing_rate = ~0U;\n>         sk->sk_pacing_rate = ~0U;\n>         sk->sk_incoming_cpu = -1;\n> +       sk->sk_rx_ifindex = -1;\n> +       sk->sk_rx_queue_mapping = -1;\n>         /*\n>          * Before updating sk_refcnt, we must commit prior changes to memory\n>          * (Documentation/RCU/rculist_nulls.txt for details)\n> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c\n> index c5d7656..12381e0 100644\n> --- a/net/ipv4/tcp_input.c\n> +++ b/net/ipv4/tcp_input.c\n> @@ -6356,6 +6356,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,\n>         tcp_rsk(req)->snt_isn = isn;\n>         tcp_rsk(req)->txhash = net_tx_rndhash();\n>         tcp_openreq_init_rwin(req, sk, dst);\n> +       sk_mark_rx_queue(req_to_sk(req), skb);\n>         if (!want_cookie) {\n>                 tcp_reqsk_record_syn(sk, req, skb);\n>                 fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc);\n> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c\n> index a63486a..82f9af4 100644\n> --- a/net/ipv4/tcp_ipv4.c\n> +++ b/net/ipv4/tcp_ipv4.c\n> @@ -1450,6 +1450,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)\n>\n>                 sock_rps_save_rxhash(sk, skb);\n>                 sk_mark_napi_id(sk, skb);\n> +               sk_mark_rx_queue(sk, skb);\n>                 if (dst) {\n>                         if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||\n>                             !dst->ops->check(dst, 0)) {\n> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c\n> index 188a6f3..2b5efd5 100644\n> --- a/net/ipv4/tcp_minisocks.c\n> +++ b/net/ipv4/tcp_minisocks.c\n> @@ -809,6 +809,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,\n>\n>         /* record NAPI ID of child */\n>         sk_mark_napi_id(child, skb);\n> +       sk_mark_rx_queue(child, skb);\n>\n>         tcp_segs_in(tcp_sk(child), skb);\n>         if (!sock_owned_by_user(child)) {\n> --\n> 1.8.3.1\n>","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=herbertland-com.20150623.gappssmtp.com\n\theader.i=@herbertland-com.20150623.gappssmtp.com\n\theader.b=\"G2AJfABk\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xqfp36lfLz9sDB\n\tfor <patchwork-incoming@ozlabs.org>;\n\tSun, 10 Sep 2017 15:32:27 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1750989AbdIJFcU (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tSun, 10 Sep 2017 01:32:20 -0400","from mail-qt0-f196.google.com ([209.85.216.196]:33383 \"EHLO\n\tmail-qt0-f196.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1750730AbdIJFcS (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Sun, 10 Sep 2017 01:32:18 -0400","by mail-qt0-f196.google.com with SMTP id b1so1520932qtc.0\n\tfor <netdev@vger.kernel.org>; Sat, 09 Sep 2017 22:32:18 -0700 (PDT)","by 10.237.61.196 with HTTP; Sat, 9 Sep 2017 22:32:17 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=herbertland-com.20150623.gappssmtp.com; s=20150623;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=ZVkWqJeLiB0MAzSH+wqFZsnIRnL0BJ0vltrAM9ZAA1k=;\n\tb=G2AJfABkRujj0BIV7y0WQMrBVf4gsedllpaJmOY0n6m5PBDbjv18GBehWOcEWUec+8\n\tLyhBzlm/i6d8pIS3FozT38wFgBYCIIjytbKC2Cu5ytsSB2z9wap8TUaKcA6+VQsf09Pb\n\t0nk/3AwGCNwK6JsaOYNjase0bcKP/6EN0Xj20EZkCw6tas/UIOM2Uqs0++TfI1u/wNjl\n\tGqDnuG3Z+V43up820CW95G0obb0NtC5rZRl62X/XzNjAAHdTkWd28k4pIgM2R5SVG+qY\n\ttueHp4sELiy3C0qmPYqPZ1OIgRCPofGzuVCB/S23PGFJ24qlHHykwzAjYkdpszIbkz2D\n\tlksA==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=ZVkWqJeLiB0MAzSH+wqFZsnIRnL0BJ0vltrAM9ZAA1k=;\n\tb=bsDxwji3gAuF3WpaAP6yC2OBAr9nG/97QZh2AEKhKOJF5oD+uQLKvcv84RPraYnvK6\n\tatbs0hdGj4yrd8koeWnEiQ+Aecbua1bcMa2EPy4xLnFx40o8v7vbNbY1POF3/DXMf7lA\n\th7Oup+7ApS4KLtFX2OfVo++gXNJo3ZCPEwXw1R0CYjyWgCI/GgNdVU0XdnqCRb5C4Se5\n\tOBnfWjxKM3YQocd2S6WP/OyiWCiYlQRCyEymlAYB8j4vWaLvcGZIlRLQ49TdZwmFxBQV\n\tarOkAqqanUuqI9rMKv3xzbgpmb6R6Xes7PXuVt+OLvHttQLNBWSancKDamozur77xe0+\n\tXUig==","X-Gm-Message-State":"AHPjjUgRCvMTTK9VSAm9cDvpDOH7V0PUlxDnvBGfrD9GnlaJY2ueUf4c\n\tfqIzcq4Yh7oJkZosnur2GG0ZmVZNSWQt","X-Google-Smtp-Source":"AOwi7QBBGX/rcoshD8oNjbYxphssR3G4aw5TRD4DWeMzfuevn+RVFvHoqnoLfX+JxD1UrdLICv2duokejckg7wnzpTk=","X-Received":"by 10.237.56.167 with SMTP id k36mr11554962qte.286.1505021537949;\n\tSat, 09 Sep 2017 22:32:17 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>","From":"Tom Herbert <tom@herbertland.com>","Date":"Sat, 9 Sep 2017 22:32:17 -0700","Message-ID":"<CALx6S3527ETZjZBBJTYrorWziwFi+eXMHFOsc-hC5YzPUWZ3Cw@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Sridhar Samudrala <sridhar.samudrala@intel.com>","Cc":"Alexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1765960,"web_url":"http://patchwork.ozlabs.org/comment/1765960/","msgid":"<CALx6S371o0Oz9M7jmQWLuG2n3J0NGYGxZ_+okOkFL3x_+GB+2w@mail.gmail.com>","list_archive_url":null,"date":"2017-09-10T15:19:39","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65986,"url":"http://patchwork.ozlabs.org/api/people/65986/","name":"Tom Herbert","email":"tom@herbertland.com"},"content":"On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala\n<sridhar.samudrala@intel.com> wrote:\n> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used\n> to enable symmetric tx and rx queues on a socket.\n>\n> This option is specifically useful for epoll based multi threaded workloads\n> where each thread handles packets received on a single RX queue . In this model,\n> we have noticed that it helps to send the packets on the same TX queue\n> corresponding to the queue-pair associated with the RX queue specifically when\n> busy poll is enabled with epoll().\n>\n> Two new fields are added to struct sock_common to cache the last rx ifindex and\n> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached\n> rx queue when this option is enabled and the TX is happening on the same device.\n>\n> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>\n> ---\n>  include/net/request_sock.h        |  1 +\n>  include/net/sock.h                | 17 +++++++++++++++++\n>  include/uapi/asm-generic/socket.h |  2 ++\n>  net/core/dev.c                    |  8 +++++++-\n>  net/core/sock.c                   | 10 ++++++++++\n>  net/ipv4/tcp_input.c              |  1 +\n>  net/ipv4/tcp_ipv4.c               |  1 +\n>  net/ipv4/tcp_minisocks.c          |  1 +\n>  8 files changed, 40 insertions(+), 1 deletion(-)\n>\n> diff --git a/include/net/request_sock.h b/include/net/request_sock.h\n> index 23e2205..c3bc12e 100644\n> --- a/include/net/request_sock.h\n> +++ b/include/net/request_sock.h\n> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req)\n>         req_to_sk(req)->sk_prot = sk_listener->sk_prot;\n>         sk_node_init(&req_to_sk(req)->sk_node);\n>         sk_tx_queue_clear(req_to_sk(req));\n> +       req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues;\n>         req->saved_syn = NULL;\n>         refcount_set(&req->rsk_refcnt, 0);\n>\n> diff --git a/include/net/sock.h b/include/net/sock.h\n> index 03a3625..3421809 100644\n> --- a/include/net/sock.h\n> +++ b/include/net/sock.h\n> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...)\n>   *     @skc_node: main hash linkage for various protocol lookup tables\n>   *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol\n>   *     @skc_tx_queue_mapping: tx queue number for this connection\n> + *     @skc_rx_queue_mapping: rx queue number for this connection\n> + *     @skc_rx_ifindex: rx ifindex for this connection\n>   *     @skc_flags: place holder for sk_flags\n>   *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,\n>   *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings\n>   *     @skc_incoming_cpu: record/match cpu processing incoming packets\n>   *     @skc_refcnt: reference count\n> + *     @skc_symmetric_queues: symmetric tx/rx queues\n>   *\n>   *     This is the minimal network layer representation of sockets, the header\n>   *     for struct sock and struct inet_timewait_sock.\n> @@ -177,6 +180,7 @@ struct sock_common {\n>         unsigned char           skc_reuseport:1;\n>         unsigned char           skc_ipv6only:1;\n>         unsigned char           skc_net_refcnt:1;\n> +       unsigned char           skc_symmetric_queues:1;\n>         int                     skc_bound_dev_if;\n>         union {\n>                 struct hlist_node       skc_bind_node;\n> @@ -214,6 +218,8 @@ struct sock_common {\n>                 struct hlist_nulls_node skc_nulls_node;\n>         };\n>         int                     skc_tx_queue_mapping;\n> +       int                     skc_rx_queue_mapping;\n> +       int                     skc_rx_ifindex;\n>         union {\n>                 int             skc_incoming_cpu;\n>                 u32             skc_rcv_wnd;\n> @@ -324,6 +330,8 @@ struct sock {\n>  #define sk_nulls_node          __sk_common.skc_nulls_node\n>  #define sk_refcnt              __sk_common.skc_refcnt\n>  #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping\n> +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping\n> +#define sk_rx_ifindex          __sk_common.skc_rx_ifindex\n>\n>  #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin\n>  #define sk_dontcopy_end                __sk_common.skc_dontcopy_end\n> @@ -340,6 +348,7 @@ struct sock {\n>  #define sk_reuseport           __sk_common.skc_reuseport\n>  #define sk_ipv6only            __sk_common.skc_ipv6only\n>  #define sk_net_refcnt          __sk_common.skc_net_refcnt\n> +#define sk_symmetric_queues    __sk_common.skc_symmetric_queues\n>  #define sk_bound_dev_if                __sk_common.skc_bound_dev_if\n>  #define sk_bind_node           __sk_common.skc_bind_node\n>  #define sk_prot                        __sk_common.skc_prot\n> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)\n>         return sk ? sk->sk_tx_queue_mapping : -1;\n>  }\n>\n> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)\n> +{\n> +       if (sk->sk_symmetric_queues) {\n> +               sk->sk_rx_ifindex = skb->skb_iif;\n> +               sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);\n> +       }\n> +}\n> +\n>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)\n>  {\n>         sk_tx_queue_clear(sk);\n> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h\n> index e47c9e4..f6b416e 100644\n> --- a/include/uapi/asm-generic/socket.h\n> +++ b/include/uapi/asm-generic/socket.h\n> @@ -106,4 +106,6 @@\n>\n>  #define SO_ZEROCOPY            60\n>\n> +#define SO_SYMMETRIC_QUEUES    61\n> +\n>  #endif /* __ASM_GENERIC_SOCKET_H */\n> diff --git a/net/core/dev.c b/net/core/dev.c\n> index 270b547..d96cda8 100644\n> --- a/net/core/dev.c\n> +++ b/net/core/dev.c\n> @@ -3322,7 +3322,13 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)\n>\n>         if (queue_index < 0 || skb->ooo_okay ||\n>             queue_index >= dev->real_num_tx_queues) {\n> -               int new_index = get_xps_queue(dev, skb);\n> +               int new_index = -1;\n> +\n> +               if (sk && sk->sk_symmetric_queues && dev->ifindex == sk->sk_rx_ifindex)\n> +                       new_index = sk->sk_rx_queue_mapping;\n> +\n> +               if (new_index < 0 || new_index >= dev->real_num_tx_queues)\n> +                       new_index = get_xps_queue(dev, skb);\n\nThis enforces that notion of queue pairs which is not universal\nconcept to NICs. There are many devices and instances where we\npurposely avoid having a 1-1 relationship between rx and tx queues.\nAn alternative might be to create a rx queue to tx queue map, add the\nrx queue argument to get_xps_queue, and then that function can\nconsider the mapping. The administrator can configure the mapping as\nappropriate and can select which rx queues are subject to the mapping.\n\n>\n>                 if (new_index < 0)\n>                         new_index = skb_tx_hash(dev, skb);\n> diff --git a/net/core/sock.c b/net/core/sock.c\n> index 9b7b6bb..3876cce 100644\n> --- a/net/core/sock.c\n> +++ b/net/core/sock.c\n> @@ -1059,6 +1059,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,\n>                         sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);\n>                 break;\n>\n> +       case SO_SYMMETRIC_QUEUES:\n> +               sk->sk_symmetric_queues = valbool;\n> +               break;\n> +\nAllowing users control over this seems problematic to me. The intent\nof packet steering is to provide good loading across the whole\nsystems, not just for individual applications. Exposing this control\nmakes that mission harder.\n\n>         default:\n>                 ret = -ENOPROTOOPT;\n>                 break;\n> @@ -1391,6 +1395,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,\n>                 v.val = sock_flag(sk, SOCK_ZEROCOPY);\n>                 break;\n>\n> +       case SO_SYMMETRIC_QUEUES:\n> +               v.val = sk->sk_symmetric_queues;\n> +               break;\n> +\n>         default:\n>                 /* We implement the SO_SNDLOWAT etc to not be settable\n>                  * (1003.1g 7).\n> @@ -2738,6 +2746,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)\n>         sk->sk_max_pacing_rate = ~0U;\n>         sk->sk_pacing_rate = ~0U;\n>         sk->sk_incoming_cpu = -1;\n> +       sk->sk_rx_ifindex = -1;\n> +       sk->sk_rx_queue_mapping = -1;\n>         /*\n>          * Before updating sk_refcnt, we must commit prior changes to memory\n>          * (Documentation/RCU/rculist_nulls.txt for details)\n> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c\n> index c5d7656..12381e0 100644\n> --- a/net/ipv4/tcp_input.c\n> +++ b/net/ipv4/tcp_input.c\n> @@ -6356,6 +6356,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,\n>         tcp_rsk(req)->snt_isn = isn;\n>         tcp_rsk(req)->txhash = net_tx_rndhash();\n>         tcp_openreq_init_rwin(req, sk, dst);\n> +       sk_mark_rx_queue(req_to_sk(req), skb);\n>         if (!want_cookie) {\n>                 tcp_reqsk_record_syn(sk, req, skb);\n>                 fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc);\n> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c\n> index a63486a..82f9af4 100644\n> --- a/net/ipv4/tcp_ipv4.c\n> +++ b/net/ipv4/tcp_ipv4.c\n> @@ -1450,6 +1450,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)\n>\n>                 sock_rps_save_rxhash(sk, skb);\n>                 sk_mark_napi_id(sk, skb);\n> +               sk_mark_rx_queue(sk, skb);\n\nThis could be part of sock_rps_save_rxhash instead of new functions in\ncore receive path. That could be renamed sock_save_rx_info or\nsomething like that. UDP support is also lacking with this patch so\nthat gets solved by a common function also.\n\n>                 if (dst) {\n>                         if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||\n>                             !dst->ops->check(dst, 0)) {\n> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c\n> index 188a6f3..2b5efd5 100644\n> --- a/net/ipv4/tcp_minisocks.c\n> +++ b/net/ipv4/tcp_minisocks.c\n> @@ -809,6 +809,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,\n>\n>         /* record NAPI ID of child */\n>         sk_mark_napi_id(child, skb);\n> +       sk_mark_rx_queue(child, skb);\n>\n>         tcp_segs_in(tcp_sk(child), skb);\n>         if (!sock_owned_by_user(child)) {\n> --\n> 1.8.3.1\n>","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=herbertland-com.20150623.gappssmtp.com\n\theader.i=@herbertland-com.20150623.gappssmtp.com\n\theader.b=\"0N5ddetU\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xqvqj5lKjz9s7f\n\tfor <patchwork-incoming@ozlabs.org>;\n\tMon, 11 Sep 2017 01:19:45 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751435AbdIJPTm (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tSun, 10 Sep 2017 11:19:42 -0400","from mail-qk0-f196.google.com ([209.85.220.196]:34397 \"EHLO\n\tmail-qk0-f196.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751089AbdIJPTl (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Sun, 10 Sep 2017 11:19:41 -0400","by mail-qk0-f196.google.com with SMTP id d70so4165163qkc.1\n\tfor <netdev@vger.kernel.org>; Sun, 10 Sep 2017 08:19:40 -0700 (PDT)","by 10.237.61.196 with HTTP; Sun, 10 Sep 2017 08:19:39 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=herbertland-com.20150623.gappssmtp.com; s=20150623;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=yozoVyggMEt9Q+AXesjIkCaQA3AAMMg6j/hpQp2iqyw=;\n\tb=0N5ddetUjmDq8Qd+X+isxMVpT5gFdn860ObG2sLbIiqSA3sePAWsIr3jNB/y7mSh4w\n\tRT58YASSrGTu415Uhsh0nK0xAorRciZl8R3lzL9sJRC9j3uvNHLYELrrPQCezqj66JOk\n\tVGbyQyxoXeWT++cyS/YdCeHCacx+wHOkoPi1rfOp2es2/i13x4z/gPF1y85wr0LM1n7S\n\tfEDVQ+G5FSV4gX75vy9vd1U8Yg27wCeTkl3c9E7vpFMOlJqlc7CRNk/OrLLT/u3S3qgY\n\tWXPj3tEW+H3WtqNprY1Y5zOaPg+37w3CaBuE/acgkUvx6GWQs0159//8caWOkXOCluz4\n\tMoJg==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=yozoVyggMEt9Q+AXesjIkCaQA3AAMMg6j/hpQp2iqyw=;\n\tb=XXtTmdIxxNDWHnR9CAvFsD5FT7sjXFTzea91qVQnjvoIWdcOS8ej0mE5rMy2TNCAtz\n\ttTVHf1O1ofGyha8Rv+hbSpUcOezTPWS1GkPhtTGfBRiZWScunE+06m6OOXWhtF2sFJbx\n\thbhzeGFwf2Wvvpdk74fFQL8wi+DTZUJkthEAjxLRNkGb6Lt3z+7yhHJl+SOqMkVoxQyy\n\taUYA/4ESSeWcCcs7xoQIHraBx5/QlnIxbbyH9gL8OS2bBlH7WONcgbYs/Dlo2BIxfQ5Z\n\t6ZNw66xehf/JRnWAJEw7KPw+6H1fibLXwd7aSCGEyuoN9QNu0VNlCcyBNU4QXkwI+kNH\n\ti8WQ==","X-Gm-Message-State":"AHPjjUjQi7aIeafhAOwEwr3KQ0XNt2mQBG3Jtm3L1ZYvJgfczxq3+XBp\n\t8eh6ebklx0HV5Rb42Pu7wsuu7thwxenS","X-Google-Smtp-Source":"AOwi7QCOgmn10iCI0iq7fYw4N/mJHckMRQV3R6/86MRWfcih7CqY5MCJZFiORcQLA8ejo+ylbGEBCQQ+fDnxjwmOV/s=","X-Received":"by 10.55.65.133 with SMTP id o127mr11190180qka.51.1505056780106; \n\tSun, 10 Sep 2017 08:19:40 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>","From":"Tom Herbert <tom@herbertland.com>","Date":"Sun, 10 Sep 2017 08:19:39 -0700","Message-ID":"<CALx6S371o0Oz9M7jmQWLuG2n3J0NGYGxZ_+okOkFL3x_+GB+2w@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Sridhar Samudrala <sridhar.samudrala@intel.com>","Cc":"Alexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1766388,"web_url":"http://patchwork.ozlabs.org/comment/1766388/","msgid":"<91a48bc5-57d9-db09-78c0-98a49b414a28@intel.com>","list_archive_url":null,"date":"2017-09-11T16:48:35","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65219,"url":"http://patchwork.ozlabs.org/api/people/65219/","name":"Samudrala, Sridhar","email":"sridhar.samudrala@intel.com"},"content":"On 9/9/2017 6:28 PM, Alexander Duyck wrote:\n> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala\n> <sridhar.samudrala@intel.com> wrote:\n>> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used\n>> to enable symmetric tx and rx queues on a socket.\n>>\n>> This option is specifically useful for epoll based multi threaded workloads\n>> where each thread handles packets received on a single RX queue . In this model,\n>> we have noticed that it helps to send the packets on the same TX queue\n>> corresponding to the queue-pair associated with the RX queue specifically when\n>> busy poll is enabled with epoll().\n>>\n>> Two new fields are added to struct sock_common to cache the last rx ifindex and\n>> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached\n>> rx queue when this option is enabled and the TX is happening on the same device.\n>>\n>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>\n>> ---\n>>   include/net/request_sock.h        |  1 +\n>>   include/net/sock.h                | 17 +++++++++++++++++\n>>   include/uapi/asm-generic/socket.h |  2 ++\n>>   net/core/dev.c                    |  8 +++++++-\n>>   net/core/sock.c                   | 10 ++++++++++\n>>   net/ipv4/tcp_input.c              |  1 +\n>>   net/ipv4/tcp_ipv4.c               |  1 +\n>>   net/ipv4/tcp_minisocks.c          |  1 +\n>>   8 files changed, 40 insertions(+), 1 deletion(-)\n>>\n>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h\n>> index 23e2205..c3bc12e 100644\n>> --- a/include/net/request_sock.h\n>> +++ b/include/net/request_sock.h\n>> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req)\n>>          req_to_sk(req)->sk_prot = sk_listener->sk_prot;\n>>          sk_node_init(&req_to_sk(req)->sk_node);\n>>          sk_tx_queue_clear(req_to_sk(req));\n>> +       req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues;\n>>          req->saved_syn = NULL;\n>>          refcount_set(&req->rsk_refcnt, 0);\n>>\n>> diff --git a/include/net/sock.h b/include/net/sock.h\n>> index 03a3625..3421809 100644\n>> --- a/include/net/sock.h\n>> +++ b/include/net/sock.h\n>> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...)\n>>    *     @skc_node: main hash linkage for various protocol lookup tables\n>>    *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol\n>>    *     @skc_tx_queue_mapping: tx queue number for this connection\n>> + *     @skc_rx_queue_mapping: rx queue number for this connection\n>> + *     @skc_rx_ifindex: rx ifindex for this connection\n>>    *     @skc_flags: place holder for sk_flags\n>>    *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,\n>>    *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings\n>>    *     @skc_incoming_cpu: record/match cpu processing incoming packets\n>>    *     @skc_refcnt: reference count\n>> + *     @skc_symmetric_queues: symmetric tx/rx queues\n>>    *\n>>    *     This is the minimal network layer representation of sockets, the header\n>>    *     for struct sock and struct inet_timewait_sock.\n>> @@ -177,6 +180,7 @@ struct sock_common {\n>>          unsigned char           skc_reuseport:1;\n>>          unsigned char           skc_ipv6only:1;\n>>          unsigned char           skc_net_refcnt:1;\n>> +       unsigned char           skc_symmetric_queues:1;\n>>          int                     skc_bound_dev_if;\n>>          union {\n>>                  struct hlist_node       skc_bind_node;\n>> @@ -214,6 +218,8 @@ struct sock_common {\n>>                  struct hlist_nulls_node skc_nulls_node;\n>>          };\n>>          int                     skc_tx_queue_mapping;\n>> +       int                     skc_rx_queue_mapping;\n>> +       int                     skc_rx_ifindex;\n>>          union {\n>>                  int             skc_incoming_cpu;\n>>                  u32             skc_rcv_wnd;\n>> @@ -324,6 +330,8 @@ struct sock {\n>>   #define sk_nulls_node          __sk_common.skc_nulls_node\n>>   #define sk_refcnt              __sk_common.skc_refcnt\n>>   #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping\n>> +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping\n>> +#define sk_rx_ifindex          __sk_common.skc_rx_ifindex\n>>\n>>   #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin\n>>   #define sk_dontcopy_end                __sk_common.skc_dontcopy_end\n>> @@ -340,6 +348,7 @@ struct sock {\n>>   #define sk_reuseport           __sk_common.skc_reuseport\n>>   #define sk_ipv6only            __sk_common.skc_ipv6only\n>>   #define sk_net_refcnt          __sk_common.skc_net_refcnt\n>> +#define sk_symmetric_queues    __sk_common.skc_symmetric_queues\n>>   #define sk_bound_dev_if                __sk_common.skc_bound_dev_if\n>>   #define sk_bind_node           __sk_common.skc_bind_node\n>>   #define sk_prot                        __sk_common.skc_prot\n>> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)\n>>          return sk ? sk->sk_tx_queue_mapping : -1;\n>>   }\n>>\n>> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)\n>> +{\n>> +       if (sk->sk_symmetric_queues) {\n>> +               sk->sk_rx_ifindex = skb->skb_iif;\n>> +               sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);\n>> +       }\n>> +}\n>> +\n>>   static inline void sk_set_socket(struct sock *sk, struct socket *sock)\n>>   {\n>>          sk_tx_queue_clear(sk);\n>> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h\n>> index e47c9e4..f6b416e 100644\n>> --- a/include/uapi/asm-generic/socket.h\n>> +++ b/include/uapi/asm-generic/socket.h\n>> @@ -106,4 +106,6 @@\n>>\n>>   #define SO_ZEROCOPY            60\n>>\n>> +#define SO_SYMMETRIC_QUEUES    61\n>> +\n>>   #endif /* __ASM_GENERIC_SOCKET_H */\n>> diff --git a/net/core/dev.c b/net/core/dev.c\n>> index 270b547..d96cda8 100644\n>> --- a/net/core/dev.c\n>> +++ b/net/core/dev.c\n>> @@ -3322,7 +3322,13 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)\n>>\n>>          if (queue_index < 0 || skb->ooo_okay ||\n>>              queue_index >= dev->real_num_tx_queues) {\n>> -               int new_index = get_xps_queue(dev, skb);\n>> +               int new_index = -1;\n>> +\n>> +               if (sk && sk->sk_symmetric_queues && dev->ifindex == sk->sk_rx_ifindex)\n>> +                       new_index = sk->sk_rx_queue_mapping;\n>> +\n>> +               if (new_index < 0 || new_index >= dev->real_num_tx_queues)\n>> +                       new_index = get_xps_queue(dev, skb);\n>>\n>>                  if (new_index < 0)\n>>                          new_index = skb_tx_hash(dev, skb);\n> So one thing I am not sure about is if we should be overriding XPS. It\n> might make sense to instead place this after XPS so that if the root\n> user configures it then it applies, otherwise if the socket is\n> requesting symmetric queues you could fall back to that, and then\n> finally just use hashing as the final solution for distributing the\n> workload.\nIsn't XPS on by default and all the devices that support XPS setup XPS \nmaps as part of\nthe initialization?\nAre you suggesting that the root user needs to disable XPS on the \nspecific queues before an\napplication can use this option to enable symmetric queues?\n\n\n>\n> That way if somebody decides to reserve queues for some sort of\n> specific traffic like AF_PACKET then they can configure the Tx via\n> XPS, configure the Rx via RSS redirection table reprogramming, and\n> then setup a filters on the hardware to direct the traffic they want\n> to the queues that are running AF_PACKET.\n>\n>> diff --git a/net/core/sock.c b/net/core/sock.c\n>> index 9b7b6bb..3876cce 100644\n>> --- a/net/core/sock.c\n>> +++ b/net/core/sock.c\n>> @@ -1059,6 +1059,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,\n>>                          sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);\n>>                  break;\n>>\n>> +       case SO_SYMMETRIC_QUEUES:\n>> +               sk->sk_symmetric_queues = valbool;\n>> +               break;\n>> +\n>>          default:\n>>                  ret = -ENOPROTOOPT;\n>>                  break;\n>> @@ -1391,6 +1395,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,\n>>                  v.val = sock_flag(sk, SOCK_ZEROCOPY);\n>>                  break;\n>>\n>> +       case SO_SYMMETRIC_QUEUES:\n>> +               v.val = sk->sk_symmetric_queues;\n>> +               break;\n>> +\n>>          default:\n>>                  /* We implement the SO_SNDLOWAT etc to not be settable\n>>                   * (1003.1g 7).\n>> @@ -2738,6 +2746,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)\n>>          sk->sk_max_pacing_rate = ~0U;\n>>          sk->sk_pacing_rate = ~0U;\n>>          sk->sk_incoming_cpu = -1;\n>> +       sk->sk_rx_ifindex = -1;\n>> +       sk->sk_rx_queue_mapping = -1;\n>>          /*\n>>           * Before updating sk_refcnt, we must commit prior changes to memory\n>>           * (Documentation/RCU/rculist_nulls.txt for details)\n>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c\n>> index c5d7656..12381e0 100644\n>> --- a/net/ipv4/tcp_input.c\n>> +++ b/net/ipv4/tcp_input.c\n>> @@ -6356,6 +6356,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,\n>>          tcp_rsk(req)->snt_isn = isn;\n>>          tcp_rsk(req)->txhash = net_tx_rndhash();\n>>          tcp_openreq_init_rwin(req, sk, dst);\n>> +       sk_mark_rx_queue(req_to_sk(req), skb);\n>>          if (!want_cookie) {\n>>                  tcp_reqsk_record_syn(sk, req, skb);\n>>                  fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc);\n>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c\n>> index a63486a..82f9af4 100644\n>> --- a/net/ipv4/tcp_ipv4.c\n>> +++ b/net/ipv4/tcp_ipv4.c\n>> @@ -1450,6 +1450,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)\n>>\n>>                  sock_rps_save_rxhash(sk, skb);\n>>                  sk_mark_napi_id(sk, skb);\n>> +               sk_mark_rx_queue(sk, skb);\n>>                  if (dst) {\n>>                          if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||\n>>                              !dst->ops->check(dst, 0)) {\n>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c\n>> index 188a6f3..2b5efd5 100644\n>> --- a/net/ipv4/tcp_minisocks.c\n>> +++ b/net/ipv4/tcp_minisocks.c\n>> @@ -809,6 +809,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,\n>>\n>>          /* record NAPI ID of child */\n>>          sk_mark_napi_id(child, skb);\n>> +       sk_mark_rx_queue(child, skb);\n>>\n>>          tcp_segs_in(tcp_sk(child), skb);\n>>          if (!sock_owned_by_user(child)) {\n>> --\n>> 1.8.3.1\n>>","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":"ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xrYmH2b7cz9s81\n\tfor <patchwork-incoming@ozlabs.org>;\n\tTue, 12 Sep 2017 02:49:03 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1752214AbdIKQtA (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tMon, 11 Sep 2017 12:49:00 -0400","from mga01.intel.com ([192.55.52.88]:45964 \"EHLO mga01.intel.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S1752025AbdIKQs7 (ORCPT <rfc822;netdev@vger.kernel.org>);\n\tMon, 11 Sep 2017 12:48:59 -0400","from orsmga002.jf.intel.com ([10.7.209.21])\n\tby fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;\n\t11 Sep 2017 09:48:36 -0700","from samudral-mobl1.amr.corp.intel.com (HELO [10.165.248.23])\n\t([10.165.248.23])\n\tby orsmga002.jf.intel.com with ESMTP; 11 Sep 2017 09:48:35 -0700"],"X-ExtLoop1":"1","X-IronPort-AV":"E=Sophos;i=\"5.42,378,1500966000\"; d=\"scan'208\";a=\"134108856\"","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Alexander Duyck <alexander.duyck@gmail.com>","Cc":"\"Duyck, Alexander H\" <alexander.h.duyck@intel.com>,\n\tNetdev <netdev@vger.kernel.org>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CAKgT0UfZed3_3KBmKbMbcmdsA+ctFRUCu8jp_rCnatC9AMv__g@mail.gmail.com>","From":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Message-ID":"<91a48bc5-57d9-db09-78c0-98a49b414a28@intel.com>","Date":"Mon, 11 Sep 2017 09:48:35 -0700","User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101\n\tThunderbird/52.2.1","MIME-Version":"1.0","In-Reply-To":"<CAKgT0UfZed3_3KBmKbMbcmdsA+ctFRUCu8jp_rCnatC9AMv__g@mail.gmail.com>","Content-Type":"text/plain; charset=utf-8; format=flowed","Content-Transfer-Encoding":"7bit","Content-Language":"en-US","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1766389,"web_url":"http://patchwork.ozlabs.org/comment/1766389/","msgid":"<f98198c4-16f2-5271-3e12-7984a03bd6db@intel.com>","list_archive_url":null,"date":"2017-09-11T16:49:29","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65219,"url":"http://patchwork.ozlabs.org/api/people/65219/","name":"Samudrala, Sridhar","email":"sridhar.samudrala@intel.com"},"content":"On 9/10/2017 8:19 AM, Tom Herbert wrote:\n> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala\n> <sridhar.samudrala@intel.com> wrote:\n>> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used\n>> to enable symmetric tx and rx queues on a socket.\n>>\n>> This option is specifically useful for epoll based multi threaded workloads\n>> where each thread handles packets received on a single RX queue . In this model,\n>> we have noticed that it helps to send the packets on the same TX queue\n>> corresponding to the queue-pair associated with the RX queue specifically when\n>> busy poll is enabled with epoll().\n>>\n>> Two new fields are added to struct sock_common to cache the last rx ifindex and\n>> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached\n>> rx queue when this option is enabled and the TX is happening on the same device.\n>>\n>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>\n>> ---\n>>   include/net/request_sock.h        |  1 +\n>>   include/net/sock.h                | 17 +++++++++++++++++\n>>   include/uapi/asm-generic/socket.h |  2 ++\n>>   net/core/dev.c                    |  8 +++++++-\n>>   net/core/sock.c                   | 10 ++++++++++\n>>   net/ipv4/tcp_input.c              |  1 +\n>>   net/ipv4/tcp_ipv4.c               |  1 +\n>>   net/ipv4/tcp_minisocks.c          |  1 +\n>>   8 files changed, 40 insertions(+), 1 deletion(-)\n>>\n>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h\n>> index 23e2205..c3bc12e 100644\n>> --- a/include/net/request_sock.h\n>> +++ b/include/net/request_sock.h\n>> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req)\n>>          req_to_sk(req)->sk_prot = sk_listener->sk_prot;\n>>          sk_node_init(&req_to_sk(req)->sk_node);\n>>          sk_tx_queue_clear(req_to_sk(req));\n>> +       req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues;\n>>          req->saved_syn = NULL;\n>>          refcount_set(&req->rsk_refcnt, 0);\n>>\n>> diff --git a/include/net/sock.h b/include/net/sock.h\n>> index 03a3625..3421809 100644\n>> --- a/include/net/sock.h\n>> +++ b/include/net/sock.h\n>> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...)\n>>    *     @skc_node: main hash linkage for various protocol lookup tables\n>>    *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol\n>>    *     @skc_tx_queue_mapping: tx queue number for this connection\n>> + *     @skc_rx_queue_mapping: rx queue number for this connection\n>> + *     @skc_rx_ifindex: rx ifindex for this connection\n>>    *     @skc_flags: place holder for sk_flags\n>>    *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,\n>>    *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings\n>>    *     @skc_incoming_cpu: record/match cpu processing incoming packets\n>>    *     @skc_refcnt: reference count\n>> + *     @skc_symmetric_queues: symmetric tx/rx queues\n>>    *\n>>    *     This is the minimal network layer representation of sockets, the header\n>>    *     for struct sock and struct inet_timewait_sock.\n>> @@ -177,6 +180,7 @@ struct sock_common {\n>>          unsigned char           skc_reuseport:1;\n>>          unsigned char           skc_ipv6only:1;\n>>          unsigned char           skc_net_refcnt:1;\n>> +       unsigned char           skc_symmetric_queues:1;\n>>          int                     skc_bound_dev_if;\n>>          union {\n>>                  struct hlist_node       skc_bind_node;\n>> @@ -214,6 +218,8 @@ struct sock_common {\n>>                  struct hlist_nulls_node skc_nulls_node;\n>>          };\n>>          int                     skc_tx_queue_mapping;\n>> +       int                     skc_rx_queue_mapping;\n>> +       int                     skc_rx_ifindex;\n>>          union {\n>>                  int             skc_incoming_cpu;\n>>                  u32             skc_rcv_wnd;\n>> @@ -324,6 +330,8 @@ struct sock {\n>>   #define sk_nulls_node          __sk_common.skc_nulls_node\n>>   #define sk_refcnt              __sk_common.skc_refcnt\n>>   #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping\n>> +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping\n>> +#define sk_rx_ifindex          __sk_common.skc_rx_ifindex\n>>\n>>   #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin\n>>   #define sk_dontcopy_end                __sk_common.skc_dontcopy_end\n>> @@ -340,6 +348,7 @@ struct sock {\n>>   #define sk_reuseport           __sk_common.skc_reuseport\n>>   #define sk_ipv6only            __sk_common.skc_ipv6only\n>>   #define sk_net_refcnt          __sk_common.skc_net_refcnt\n>> +#define sk_symmetric_queues    __sk_common.skc_symmetric_queues\n>>   #define sk_bound_dev_if                __sk_common.skc_bound_dev_if\n>>   #define sk_bind_node           __sk_common.skc_bind_node\n>>   #define sk_prot                        __sk_common.skc_prot\n>> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)\n>>          return sk ? sk->sk_tx_queue_mapping : -1;\n>>   }\n>>\n>> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)\n>> +{\n>> +       if (sk->sk_symmetric_queues) {\n>> +               sk->sk_rx_ifindex = skb->skb_iif;\n>> +               sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);\n>> +       }\n>> +}\n>> +\n>>   static inline void sk_set_socket(struct sock *sk, struct socket *sock)\n>>   {\n>>          sk_tx_queue_clear(sk);\n>> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h\n>> index e47c9e4..f6b416e 100644\n>> --- a/include/uapi/asm-generic/socket.h\n>> +++ b/include/uapi/asm-generic/socket.h\n>> @@ -106,4 +106,6 @@\n>>\n>>   #define SO_ZEROCOPY            60\n>>\n>> +#define SO_SYMMETRIC_QUEUES    61\n>> +\n>>   #endif /* __ASM_GENERIC_SOCKET_H */\n>> diff --git a/net/core/dev.c b/net/core/dev.c\n>> index 270b547..d96cda8 100644\n>> --- a/net/core/dev.c\n>> +++ b/net/core/dev.c\n>> @@ -3322,7 +3322,13 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)\n>>\n>>          if (queue_index < 0 || skb->ooo_okay ||\n>>              queue_index >= dev->real_num_tx_queues) {\n>> -               int new_index = get_xps_queue(dev, skb);\n>> +               int new_index = -1;\n>> +\n>> +               if (sk && sk->sk_symmetric_queues && dev->ifindex == sk->sk_rx_ifindex)\n>> +                       new_index = sk->sk_rx_queue_mapping;\n>> +\n>> +               if (new_index < 0 || new_index >= dev->real_num_tx_queues)\n>> +                       new_index = get_xps_queue(dev, skb);\n> This enforces that notion of queue pairs which is not universal\n> concept to NICs. There are many devices and instances where we\n> purposely avoid having a 1-1 relationship between rx and tx queues.\n\nYes. This patch assumes that TX and RX queues come in pairs.\n\n> An alternative might be to create a rx queue to tx queue map, add the\n> rx queue argument to get_xps_queue, and then that function can\n> consider the mapping. The administrator can configure the mapping as\n> appropriate and can select which rx queues are subject to the mapping.\nThis alternative looks much cleaner and doesn't require the apps to \nconfigure the\nqueues. Do we need to support 1 to many rx to tx queue mappings?\nFor our symmetric queues usecase, where a single application thread is \nassociated with\n1 queue-pair,  1-1 mapping is sufficient.\nDo you see any usecase where it is useful to support 1-many mappings?\nI guess i can add a sysfs entry per rx-queue to setup a tx-queue OR  \ntx-queue-map.\n\n\n>\n>>                  if (new_index < 0)\n>>                          new_index = skb_tx_hash(dev, skb);\n>> diff --git a/net/core/sock.c b/net/core/sock.c\n>> index 9b7b6bb..3876cce 100644\n>> --- a/net/core/sock.c\n>> +++ b/net/core/sock.c\n>> @@ -1059,6 +1059,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,\n>>                          sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);\n>>                  break;\n>>\n>> +       case SO_SYMMETRIC_QUEUES:\n>> +               sk->sk_symmetric_queues = valbool;\n>> +               break;\n>> +\n> Allowing users control over this seems problematic to me. The intent\n> of packet steering is to provide good loading across the whole\n> systems, not just for individual applications. Exposing this control\n> makes that mission harder.\n\nSure. If we can do this on a per rxqueue basis that can be configured by \nthe administrator,\nit would be a better option.\n\n>\n>>          default:\n>>                  ret = -ENOPROTOOPT;\n>>                  break;\n>> @@ -1391,6 +1395,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,\n>>                  v.val = sock_flag(sk, SOCK_ZEROCOPY);\n>>                  break;\n>>\n>> +       case SO_SYMMETRIC_QUEUES:\n>> +               v.val = sk->sk_symmetric_queues;\n>> +               break;\n>> +\n>>          default:\n>>                  /* We implement the SO_SNDLOWAT etc to not be settable\n>>                   * (1003.1g 7).\n>> @@ -2738,6 +2746,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)\n>>          sk->sk_max_pacing_rate = ~0U;\n>>          sk->sk_pacing_rate = ~0U;\n>>          sk->sk_incoming_cpu = -1;\n>> +       sk->sk_rx_ifindex = -1;\n>> +       sk->sk_rx_queue_mapping = -1;\n>>          /*\n>>           * Before updating sk_refcnt, we must commit prior changes to memory\n>>           * (Documentation/RCU/rculist_nulls.txt for details)\n>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c\n>> index c5d7656..12381e0 100644\n>> --- a/net/ipv4/tcp_input.c\n>> +++ b/net/ipv4/tcp_input.c\n>> @@ -6356,6 +6356,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,\n>>          tcp_rsk(req)->snt_isn = isn;\n>>          tcp_rsk(req)->txhash = net_tx_rndhash();\n>>          tcp_openreq_init_rwin(req, sk, dst);\n>> +       sk_mark_rx_queue(req_to_sk(req), skb);\n>>          if (!want_cookie) {\n>>                  tcp_reqsk_record_syn(sk, req, skb);\n>>                  fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc);\n>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c\n>> index a63486a..82f9af4 100644\n>> --- a/net/ipv4/tcp_ipv4.c\n>> +++ b/net/ipv4/tcp_ipv4.c\n>> @@ -1450,6 +1450,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)\n>>\n>>                  sock_rps_save_rxhash(sk, skb);\n>>                  sk_mark_napi_id(sk, skb);\n>> +               sk_mark_rx_queue(sk, skb);\n> This could be part of sock_rps_save_rxhash instead of new functions in\n> core receive path. That could be renamed sock_save_rx_info or\n> something like that. UDP support is also lacking with this patch so\n> that gets solved by a common function also.\n\nSure. Will look into re factoring sock_rps_save_rxhash() to save the rx \nqueue info.\n\n>\n>>                  if (dst) {\n>>                          if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||\n>>                              !dst->ops->check(dst, 0)) {\n>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c\n>> index 188a6f3..2b5efd5 100644\n>> --- a/net/ipv4/tcp_minisocks.c\n>> +++ b/net/ipv4/tcp_minisocks.c\n>> @@ -809,6 +809,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,\n>>\n>>          /* record NAPI ID of child */\n>>          sk_mark_napi_id(child, skb);\n>> +       sk_mark_rx_queue(child, skb);\n>>\n>>          tcp_segs_in(tcp_sk(child), skb);\n>>          if (!sock_owned_by_user(child)) {\n>> --\n>> 1.8.3.1\n>>","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":"ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xrYnG30Bgz9s81\n\tfor <patchwork-incoming@ozlabs.org>;\n\tTue, 12 Sep 2017 02:49:54 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1752180AbdIKQtw (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tMon, 11 Sep 2017 12:49:52 -0400","from mga02.intel.com ([134.134.136.20]:51222 \"EHLO mga02.intel.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S1751916AbdIKQtv (ORCPT <rfc822;netdev@vger.kernel.org>);\n\tMon, 11 Sep 2017 12:49:51 -0400","from orsmga002.jf.intel.com ([10.7.209.21])\n\tby orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;\n\t11 Sep 2017 09:49:32 -0700","from samudral-mobl1.amr.corp.intel.com (HELO [10.165.248.23])\n\t([10.165.248.23])\n\tby orsmga002.jf.intel.com with ESMTP; 11 Sep 2017 09:49:30 -0700"],"X-ExtLoop1":"1","X-IronPort-AV":"E=Sophos;i=\"5.42,378,1500966000\"; d=\"scan'208\";a=\"134109443\"","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Tom Herbert <tom@herbertland.com>","Cc":"Alexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S371o0Oz9M7jmQWLuG2n3J0NGYGxZ_+okOkFL3x_+GB+2w@mail.gmail.com>","From":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Message-ID":"<f98198c4-16f2-5271-3e12-7984a03bd6db@intel.com>","Date":"Mon, 11 Sep 2017 09:49:29 -0700","User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101\n\tThunderbird/52.2.1","MIME-Version":"1.0","In-Reply-To":"<CALx6S371o0Oz9M7jmQWLuG2n3J0NGYGxZ_+okOkFL3x_+GB+2w@mail.gmail.com>","Content-Type":"text/plain; charset=utf-8; format=flowed","Content-Transfer-Encoding":"7bit","Content-Language":"en-US","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1766424,"web_url":"http://patchwork.ozlabs.org/comment/1766424/","msgid":"<CAKgT0Uc_-GVJgg3q6N4orJufW1iU9VZ5ZMaGo=SNUeCZr+k7HA@mail.gmail.com>","list_archive_url":null,"date":"2017-09-11T17:48:55","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":252,"url":"http://patchwork.ozlabs.org/api/people/252/","name":"Alexander Duyck","email":"alexander.duyck@gmail.com"},"content":"On Mon, Sep 11, 2017 at 9:48 AM, Samudrala, Sridhar\n<sridhar.samudrala@intel.com> wrote:\n>\n>\n> On 9/9/2017 6:28 PM, Alexander Duyck wrote:\n>>\n>> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala\n>> <sridhar.samudrala@intel.com> wrote:\n>>>\n>>> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be\n>>> used\n>>> to enable symmetric tx and rx queues on a socket.\n>>>\n>>> This option is specifically useful for epoll based multi threaded\n>>> workloads\n>>> where each thread handles packets received on a single RX queue . In this\n>>> model,\n>>> we have noticed that it helps to send the packets on the same TX queue\n>>> corresponding to the queue-pair associated with the RX queue specifically\n>>> when\n>>> busy poll is enabled with epoll().\n>>>\n>>> Two new fields are added to struct sock_common to cache the last rx\n>>> ifindex and\n>>> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns\n>>> the cached\n>>> rx queue when this option is enabled and the TX is happening on the same\n>>> device.\n>>>\n>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>\n>>> ---\n>>>   include/net/request_sock.h        |  1 +\n>>>   include/net/sock.h                | 17 +++++++++++++++++\n>>>   include/uapi/asm-generic/socket.h |  2 ++\n>>>   net/core/dev.c                    |  8 +++++++-\n>>>   net/core/sock.c                   | 10 ++++++++++\n>>>   net/ipv4/tcp_input.c              |  1 +\n>>>   net/ipv4/tcp_ipv4.c               |  1 +\n>>>   net/ipv4/tcp_minisocks.c          |  1 +\n>>>   8 files changed, 40 insertions(+), 1 deletion(-)\n>>>\n>>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h\n>>> index 23e2205..c3bc12e 100644\n>>> --- a/include/net/request_sock.h\n>>> +++ b/include/net/request_sock.h\n>>> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct\n>>> request_sock *req)\n>>>          req_to_sk(req)->sk_prot = sk_listener->sk_prot;\n>>>          sk_node_init(&req_to_sk(req)->sk_node);\n>>>          sk_tx_queue_clear(req_to_sk(req));\n>>> +       req_to_sk(req)->sk_symmetric_queues =\n>>> sk_listener->sk_symmetric_queues;\n>>>          req->saved_syn = NULL;\n>>>          refcount_set(&req->rsk_refcnt, 0);\n>>>\n>>> diff --git a/include/net/sock.h b/include/net/sock.h\n>>> index 03a3625..3421809 100644\n>>> --- a/include/net/sock.h\n>>> +++ b/include/net/sock.h\n>>> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char\n>>> *msg, ...)\n>>>    *     @skc_node: main hash linkage for various protocol lookup tables\n>>>    *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol\n>>>    *     @skc_tx_queue_mapping: tx queue number for this connection\n>>> + *     @skc_rx_queue_mapping: rx queue number for this connection\n>>> + *     @skc_rx_ifindex: rx ifindex for this connection\n>>>    *     @skc_flags: place holder for sk_flags\n>>>    *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,\n>>>    *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings\n>>>    *     @skc_incoming_cpu: record/match cpu processing incoming packets\n>>>    *     @skc_refcnt: reference count\n>>> + *     @skc_symmetric_queues: symmetric tx/rx queues\n>>>    *\n>>>    *     This is the minimal network layer representation of sockets, the\n>>> header\n>>>    *     for struct sock and struct inet_timewait_sock.\n>>> @@ -177,6 +180,7 @@ struct sock_common {\n>>>          unsigned char           skc_reuseport:1;\n>>>          unsigned char           skc_ipv6only:1;\n>>>          unsigned char           skc_net_refcnt:1;\n>>> +       unsigned char           skc_symmetric_queues:1;\n>>>          int                     skc_bound_dev_if;\n>>>          union {\n>>>                  struct hlist_node       skc_bind_node;\n>>> @@ -214,6 +218,8 @@ struct sock_common {\n>>>                  struct hlist_nulls_node skc_nulls_node;\n>>>          };\n>>>          int                     skc_tx_queue_mapping;\n>>> +       int                     skc_rx_queue_mapping;\n>>> +       int                     skc_rx_ifindex;\n>>>          union {\n>>>                  int             skc_incoming_cpu;\n>>>                  u32             skc_rcv_wnd;\n>>> @@ -324,6 +330,8 @@ struct sock {\n>>>   #define sk_nulls_node          __sk_common.skc_nulls_node\n>>>   #define sk_refcnt              __sk_common.skc_refcnt\n>>>   #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping\n>>> +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping\n>>> +#define sk_rx_ifindex          __sk_common.skc_rx_ifindex\n>>>\n>>>   #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin\n>>>   #define sk_dontcopy_end                __sk_common.skc_dontcopy_end\n>>> @@ -340,6 +348,7 @@ struct sock {\n>>>   #define sk_reuseport           __sk_common.skc_reuseport\n>>>   #define sk_ipv6only            __sk_common.skc_ipv6only\n>>>   #define sk_net_refcnt          __sk_common.skc_net_refcnt\n>>> +#define sk_symmetric_queues    __sk_common.skc_symmetric_queues\n>>>   #define sk_bound_dev_if                __sk_common.skc_bound_dev_if\n>>>   #define sk_bind_node           __sk_common.skc_bind_node\n>>>   #define sk_prot                        __sk_common.skc_prot\n>>> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct\n>>> sock *sk)\n>>>          return sk ? sk->sk_tx_queue_mapping : -1;\n>>>   }\n>>>\n>>> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff\n>>> *skb)\n>>> +{\n>>> +       if (sk->sk_symmetric_queues) {\n>>> +               sk->sk_rx_ifindex = skb->skb_iif;\n>>> +               sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);\n>>> +       }\n>>> +}\n>>> +\n>>>   static inline void sk_set_socket(struct sock *sk, struct socket *sock)\n>>>   {\n>>>          sk_tx_queue_clear(sk);\n>>> diff --git a/include/uapi/asm-generic/socket.h\n>>> b/include/uapi/asm-generic/socket.h\n>>> index e47c9e4..f6b416e 100644\n>>> --- a/include/uapi/asm-generic/socket.h\n>>> +++ b/include/uapi/asm-generic/socket.h\n>>> @@ -106,4 +106,6 @@\n>>>\n>>>   #define SO_ZEROCOPY            60\n>>>\n>>> +#define SO_SYMMETRIC_QUEUES    61\n>>> +\n>>>   #endif /* __ASM_GENERIC_SOCKET_H */\n>>> diff --git a/net/core/dev.c b/net/core/dev.c\n>>> index 270b547..d96cda8 100644\n>>> --- a/net/core/dev.c\n>>> +++ b/net/core/dev.c\n>>> @@ -3322,7 +3322,13 @@ static u16 __netdev_pick_tx(struct net_device\n>>> *dev, struct sk_buff *skb)\n>>>\n>>>          if (queue_index < 0 || skb->ooo_okay ||\n>>>              queue_index >= dev->real_num_tx_queues) {\n>>> -               int new_index = get_xps_queue(dev, skb);\n>>> +               int new_index = -1;\n>>> +\n>>> +               if (sk && sk->sk_symmetric_queues && dev->ifindex ==\n>>> sk->sk_rx_ifindex)\n>>> +                       new_index = sk->sk_rx_queue_mapping;\n>>> +\n>>> +               if (new_index < 0 || new_index >=\n>>> dev->real_num_tx_queues)\n>>> +                       new_index = get_xps_queue(dev, skb);\n>>>\n>>>                  if (new_index < 0)\n>>>                          new_index = skb_tx_hash(dev, skb);\n>>\n>> So one thing I am not sure about is if we should be overriding XPS. It\n>> might make sense to instead place this after XPS so that if the root\n>> user configures it then it applies, otherwise if the socket is\n>> requesting symmetric queues you could fall back to that, and then\n>> finally just use hashing as the final solution for distributing the\n>> workload.\n>\n> Isn't XPS on by default and all the devices that support XPS setup XPS maps\n> as part of\n> the initialization?\n> Are you suggesting that the root user needs to disable XPS on the specific\n> queues before an\n> application can use this option to enable symmetric queues?\n>\n\nXPS and this symmetric queue logic won't really play well together\nsince they are both attempting to do the same thing but from different\nends. Some sort of priority needs to be defined, and I would place the\nrequest of the kernel/root user above the request of a socket and/or\napplication.\n\nI guess it comes down to if we let the Tx CPU pick the queue or Rx\nqueue, and I would argue with an XPS configuration in place the kernel\nis requesting that the Tx CPU sets the Tx queue, not the Rx/Tx hash or\nRx queue. This is similar to how we handle this for routing today as\nXPS will also override the Rx to Tx mapping used there.\n\n>>\n>> That way if somebody decides to reserve queues for some sort of\n>> specific traffic like AF_PACKET then they can configure the Tx via\n>> XPS, configure the Rx via RSS redirection table reprogramming, and\n>> then setup a filters on the hardware to direct the traffic they want\n>> to the queues that are running AF_PACKET.\n>>\n>>> diff --git a/net/core/sock.c b/net/core/sock.c\n>>> index 9b7b6bb..3876cce 100644\n>>> --- a/net/core/sock.c\n>>> +++ b/net/core/sock.c\n>>> @@ -1059,6 +1059,10 @@ int sock_setsockopt(struct socket *sock, int\n>>> level, int optname,\n>>>                          sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);\n>>>                  break;\n>>>\n>>> +       case SO_SYMMETRIC_QUEUES:\n>>> +               sk->sk_symmetric_queues = valbool;\n>>> +               break;\n>>> +\n>>>          default:\n>>>                  ret = -ENOPROTOOPT;\n>>>                  break;\n>>> @@ -1391,6 +1395,10 @@ int sock_getsockopt(struct socket *sock, int\n>>> level, int optname,\n>>>                  v.val = sock_flag(sk, SOCK_ZEROCOPY);\n>>>                  break;\n>>>\n>>> +       case SO_SYMMETRIC_QUEUES:\n>>> +               v.val = sk->sk_symmetric_queues;\n>>> +               break;\n>>> +\n>>>          default:\n>>>                  /* We implement the SO_SNDLOWAT etc to not be settable\n>>>                   * (1003.1g 7).\n>>> @@ -2738,6 +2746,8 @@ void sock_init_data(struct socket *sock, struct\n>>> sock *sk)\n>>>          sk->sk_max_pacing_rate = ~0U;\n>>>          sk->sk_pacing_rate = ~0U;\n>>>          sk->sk_incoming_cpu = -1;\n>>> +       sk->sk_rx_ifindex = -1;\n>>> +       sk->sk_rx_queue_mapping = -1;\n>>>          /*\n>>>           * Before updating sk_refcnt, we must commit prior changes to\n>>> memory\n>>>           * (Documentation/RCU/rculist_nulls.txt for details)\n>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c\n>>> index c5d7656..12381e0 100644\n>>> --- a/net/ipv4/tcp_input.c\n>>> +++ b/net/ipv4/tcp_input.c\n>>> @@ -6356,6 +6356,7 @@ int tcp_conn_request(struct request_sock_ops\n>>> *rsk_ops,\n>>>          tcp_rsk(req)->snt_isn = isn;\n>>>          tcp_rsk(req)->txhash = net_tx_rndhash();\n>>>          tcp_openreq_init_rwin(req, sk, dst);\n>>> +       sk_mark_rx_queue(req_to_sk(req), skb);\n>>>          if (!want_cookie) {\n>>>                  tcp_reqsk_record_syn(sk, req, skb);\n>>>                  fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc);\n>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c\n>>> index a63486a..82f9af4 100644\n>>> --- a/net/ipv4/tcp_ipv4.c\n>>> +++ b/net/ipv4/tcp_ipv4.c\n>>> @@ -1450,6 +1450,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff\n>>> *skb)\n>>>\n>>>                  sock_rps_save_rxhash(sk, skb);\n>>>                  sk_mark_napi_id(sk, skb);\n>>> +               sk_mark_rx_queue(sk, skb);\n>>>                  if (dst) {\n>>>                          if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif\n>>> ||\n>>>                              !dst->ops->check(dst, 0)) {\n>>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c\n>>> index 188a6f3..2b5efd5 100644\n>>> --- a/net/ipv4/tcp_minisocks.c\n>>> +++ b/net/ipv4/tcp_minisocks.c\n>>> @@ -809,6 +809,7 @@ int tcp_child_process(struct sock *parent, struct\n>>> sock *child,\n>>>\n>>>          /* record NAPI ID of child */\n>>>          sk_mark_napi_id(child, skb);\n>>> +       sk_mark_rx_queue(child, skb);\n>>>\n>>>          tcp_segs_in(tcp_sk(child), skb);\n>>>          if (!sock_owned_by_user(child)) {\n>>> --\n>>> 1.8.3.1\n>>>\n>","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=gmail.com header.i=@gmail.com\n\theader.b=\"SrUAy+l3\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xrb5T22F2z9s7F\n\tfor <patchwork-incoming@ozlabs.org>;\n\tTue, 12 Sep 2017 03:49:01 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751337AbdIKRs6 (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tMon, 11 Sep 2017 13:48:58 -0400","from mail-qk0-f195.google.com ([209.85.220.195]:32899 \"EHLO\n\tmail-qk0-f195.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1750987AbdIKRs5 (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Mon, 11 Sep 2017 13:48:57 -0400","by mail-qk0-f195.google.com with SMTP id g128so5864931qke.0\n\tfor <netdev@vger.kernel.org>; Mon, 11 Sep 2017 10:48:57 -0700 (PDT)","by 10.140.85.211 with HTTP; Mon, 11 Sep 2017 10:48:55 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=gmail.com; s=20161025;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=uZK0qJQph6FdPRll8jo/JB0x0ZQU+L0PXhfHs3a0wbY=;\n\tb=SrUAy+l3xdn3EK7qE5ZoGjfeajaMx+34EdjpNlqgpwkXrl3moAFH2nywbRzlYpSSqd\n\tTFU9mKJ9wAkoK304sWO7/fsQVea0KNR4/ob8FHcIHu1jMbXJKzNFG3/xQEimSAUNDcQD\n\tAtd6DloOR4m4wvVR1Z7dkVNl4mAoCDoV8llinNWqNbeXH7N8m74/MFc1uEDxCKrhUPF3\n\tcTfb5PHuPCt5D60TTTKMb+JI8KMwjYNd8gkOqc0LWbXo78op/Q3dQwBRVckTa1cjdzSW\n\tWLh+MD2l7wQpy0rWo9UR6RW+zGLDkI6sbiTHMfbcqs93r6U4O2h6dl2aoYfUtz14U60t\n\tyN4A==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=uZK0qJQph6FdPRll8jo/JB0x0ZQU+L0PXhfHs3a0wbY=;\n\tb=T420x4uu2d6L7Sj1gRJcEfVT7Y9SHCfPnhP628lomzM91BIYikfVlain6A2j1ZcWlI\n\tGAM5+JqUNWBqb2THtzB9JKBjvk2fJWiwIBeKyWGzXeadKj8f05ZcwTyj+16pcmDW+VL+\n\tFkQWKBjbKUk1DKEc1wlsq7+mBv9kgbtpT05f2Ewdl4WtPDW2nX3m6m9rnCayNm3Sxv9U\n\tIuAVtjPV+FUXuebM9ndB0n/0UrAUSxdFT7B9sb6V6axpMzjpPC420ixOtJRErLjpuPUR\n\tLkN0lX1IV24G4zFJ+5JedKtCe+TSY847IiZgPRYaMby6lQsDcBFH7uiBTPMDO1RTCqWZ\n\tK+Ig==","X-Gm-Message-State":"AHPjjUiLtvPMV/HwgZRaSPXqcm+IASbP1nhsr1FHweugq6oIQ52xPHXi\n\tSkHC73dNRgsjBIGM6UrLit2fk5wLzw==","X-Google-Smtp-Source":"AOwi7QANYRf83VR9+oqvuyLYfjc5h8oHYQ2rIcyr52TjgtyEAdjVkUkjCRFEIZpVU/q8B75NKaqwvj65orNS2PXRJFA=","X-Received":"by 10.55.190.198 with SMTP id\n\to189mr15416295qkf.103.1505152136276; \n\tMon, 11 Sep 2017 10:48:56 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<91a48bc5-57d9-db09-78c0-98a49b414a28@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CAKgT0UfZed3_3KBmKbMbcmdsA+ctFRUCu8jp_rCnatC9AMv__g@mail.gmail.com>\n\t<91a48bc5-57d9-db09-78c0-98a49b414a28@intel.com>","From":"Alexander Duyck <alexander.duyck@gmail.com>","Date":"Mon, 11 Sep 2017 10:48:55 -0700","Message-ID":"<CAKgT0Uc_-GVJgg3q6N4orJufW1iU9VZ5ZMaGo=SNUeCZr+k7HA@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Cc":"\"Duyck, Alexander H\" <alexander.h.duyck@intel.com>,\n\tNetdev <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1766603,"web_url":"http://patchwork.ozlabs.org/comment/1766603/","msgid":"<CALx6S37XvbjcRon6g+gS19wj14SebL4qKFrVeGyagG0K2EStzA@mail.gmail.com>","list_archive_url":null,"date":"2017-09-11T22:07:25","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65986,"url":"http://patchwork.ozlabs.org/api/people/65986/","name":"Tom Herbert","email":"tom@herbertland.com"},"content":"On Mon, Sep 11, 2017 at 9:49 AM, Samudrala, Sridhar\n<sridhar.samudrala@intel.com> wrote:\n> On 9/10/2017 8:19 AM, Tom Herbert wrote:\n>>\n>> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala\n>> <sridhar.samudrala@intel.com> wrote:\n>>>\n>>> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be\n>>> used\n>>> to enable symmetric tx and rx queues on a socket.\n>>>\n>>> This option is specifically useful for epoll based multi threaded\n>>> workloads\n>>> where each thread handles packets received on a single RX queue . In this\n>>> model,\n>>> we have noticed that it helps to send the packets on the same TX queue\n>>> corresponding to the queue-pair associated with the RX queue specifically\n>>> when\n>>> busy poll is enabled with epoll().\n>>>\n>>> Two new fields are added to struct sock_common to cache the last rx\n>>> ifindex and\n>>> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns\n>>> the cached\n>>> rx queue when this option is enabled and the TX is happening on the same\n>>> device.\n>>>\n>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>\n>>> ---\n>>>   include/net/request_sock.h        |  1 +\n>>>   include/net/sock.h                | 17 +++++++++++++++++\n>>>   include/uapi/asm-generic/socket.h |  2 ++\n>>>   net/core/dev.c                    |  8 +++++++-\n>>>   net/core/sock.c                   | 10 ++++++++++\n>>>   net/ipv4/tcp_input.c              |  1 +\n>>>   net/ipv4/tcp_ipv4.c               |  1 +\n>>>   net/ipv4/tcp_minisocks.c          |  1 +\n>>>   8 files changed, 40 insertions(+), 1 deletion(-)\n>>>\n>>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h\n>>> index 23e2205..c3bc12e 100644\n>>> --- a/include/net/request_sock.h\n>>> +++ b/include/net/request_sock.h\n>>> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct\n>>> request_sock *req)\n>>>          req_to_sk(req)->sk_prot = sk_listener->sk_prot;\n>>>          sk_node_init(&req_to_sk(req)->sk_node);\n>>>          sk_tx_queue_clear(req_to_sk(req));\n>>> +       req_to_sk(req)->sk_symmetric_queues =\n>>> sk_listener->sk_symmetric_queues;\n>>>          req->saved_syn = NULL;\n>>>          refcount_set(&req->rsk_refcnt, 0);\n>>>\n>>> diff --git a/include/net/sock.h b/include/net/sock.h\n>>> index 03a3625..3421809 100644\n>>> --- a/include/net/sock.h\n>>> +++ b/include/net/sock.h\n>>> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char\n>>> *msg, ...)\n>>>    *     @skc_node: main hash linkage for various protocol lookup tables\n>>>    *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol\n>>>    *     @skc_tx_queue_mapping: tx queue number for this connection\n>>> + *     @skc_rx_queue_mapping: rx queue number for this connection\n>>> + *     @skc_rx_ifindex: rx ifindex for this connection\n>>>    *     @skc_flags: place holder for sk_flags\n>>>    *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,\n>>>    *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings\n>>>    *     @skc_incoming_cpu: record/match cpu processing incoming packets\n>>>    *     @skc_refcnt: reference count\n>>> + *     @skc_symmetric_queues: symmetric tx/rx queues\n>>>    *\n>>>    *     This is the minimal network layer representation of sockets, the\n>>> header\n>>>    *     for struct sock and struct inet_timewait_sock.\n>>> @@ -177,6 +180,7 @@ struct sock_common {\n>>>          unsigned char           skc_reuseport:1;\n>>>          unsigned char           skc_ipv6only:1;\n>>>          unsigned char           skc_net_refcnt:1;\n>>> +       unsigned char           skc_symmetric_queues:1;\n>>>          int                     skc_bound_dev_if;\n>>>          union {\n>>>                  struct hlist_node       skc_bind_node;\n>>> @@ -214,6 +218,8 @@ struct sock_common {\n>>>                  struct hlist_nulls_node skc_nulls_node;\n>>>          };\n>>>          int                     skc_tx_queue_mapping;\n>>> +       int                     skc_rx_queue_mapping;\n>>> +       int                     skc_rx_ifindex;\n>>>          union {\n>>>                  int             skc_incoming_cpu;\n>>>                  u32             skc_rcv_wnd;\n>>> @@ -324,6 +330,8 @@ struct sock {\n>>>   #define sk_nulls_node          __sk_common.skc_nulls_node\n>>>   #define sk_refcnt              __sk_common.skc_refcnt\n>>>   #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping\n>>> +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping\n>>> +#define sk_rx_ifindex          __sk_common.skc_rx_ifindex\n>>>\n>>>   #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin\n>>>   #define sk_dontcopy_end                __sk_common.skc_dontcopy_end\n>>> @@ -340,6 +348,7 @@ struct sock {\n>>>   #define sk_reuseport           __sk_common.skc_reuseport\n>>>   #define sk_ipv6only            __sk_common.skc_ipv6only\n>>>   #define sk_net_refcnt          __sk_common.skc_net_refcnt\n>>> +#define sk_symmetric_queues    __sk_common.skc_symmetric_queues\n>>>   #define sk_bound_dev_if                __sk_common.skc_bound_dev_if\n>>>   #define sk_bind_node           __sk_common.skc_bind_node\n>>>   #define sk_prot                        __sk_common.skc_prot\n>>> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct\n>>> sock *sk)\n>>>          return sk ? sk->sk_tx_queue_mapping : -1;\n>>>   }\n>>>\n>>> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff\n>>> *skb)\n>>> +{\n>>> +       if (sk->sk_symmetric_queues) {\n>>> +               sk->sk_rx_ifindex = skb->skb_iif;\n>>> +               sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);\n>>> +       }\n>>> +}\n>>> +\n>>>   static inline void sk_set_socket(struct sock *sk, struct socket *sock)\n>>>   {\n>>>          sk_tx_queue_clear(sk);\n>>> diff --git a/include/uapi/asm-generic/socket.h\n>>> b/include/uapi/asm-generic/socket.h\n>>> index e47c9e4..f6b416e 100644\n>>> --- a/include/uapi/asm-generic/socket.h\n>>> +++ b/include/uapi/asm-generic/socket.h\n>>> @@ -106,4 +106,6 @@\n>>>\n>>>   #define SO_ZEROCOPY            60\n>>>\n>>> +#define SO_SYMMETRIC_QUEUES    61\n>>> +\n>>>   #endif /* __ASM_GENERIC_SOCKET_H */\n>>> diff --git a/net/core/dev.c b/net/core/dev.c\n>>> index 270b547..d96cda8 100644\n>>> --- a/net/core/dev.c\n>>> +++ b/net/core/dev.c\n>>> @@ -3322,7 +3322,13 @@ static u16 __netdev_pick_tx(struct net_device\n>>> *dev, struct sk_buff *skb)\n>>>\n>>>          if (queue_index < 0 || skb->ooo_okay ||\n>>>              queue_index >= dev->real_num_tx_queues) {\n>>> -               int new_index = get_xps_queue(dev, skb);\n>>> +               int new_index = -1;\n>>> +\n>>> +               if (sk && sk->sk_symmetric_queues && dev->ifindex ==\n>>> sk->sk_rx_ifindex)\n>>> +                       new_index = sk->sk_rx_queue_mapping;\n>>> +\n>>> +               if (new_index < 0 || new_index >=\n>>> dev->real_num_tx_queues)\n>>> +                       new_index = get_xps_queue(dev, skb);\n>>\n>> This enforces that notion of queue pairs which is not universal\n>> concept to NICs. There are many devices and instances where we\n>> purposely avoid having a 1-1 relationship between rx and tx queues.\n>\n>\n> Yes. This patch assumes that TX and RX queues come in pairs.\n>\n>> An alternative might be to create a rx queue to tx queue map, add the\n>> rx queue argument to get_xps_queue, and then that function can\n>> consider the mapping. The administrator can configure the mapping as\n>> appropriate and can select which rx queues are subject to the mapping.\n>\n> This alternative looks much cleaner and doesn't require the apps to\n> configure the\n> queues. Do we need to support 1 to many rx to tx queue mappings?\n> For our symmetric queues usecase, where a single application thread is\n> associated with\n> 1 queue-pair,  1-1 mapping is sufficient.\n> Do you see any usecase where it is useful to support 1-many mappings?\n> I guess i can add a sysfs entry per rx-queue to setup a tx-queue OR\n> tx-queue-map.\n\nThere is no reason do disallow 1 to many, XPS already does that. In\nfact, the mapping algorithm in XSP is pretty much what is needed where\ninstead of mapping a CPU to a queue set, this just maps a rx queue to\nqueue set. ooo handling can still be done, although it might be less\ncritical in this case.\n\nTom","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=herbertland-com.20150623.gappssmtp.com\n\theader.i=@herbertland-com.20150623.gappssmtp.com\n\theader.b=\"hiEtmZya\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xrhwy3Flzz9s81\n\tfor <patchwork-incoming@ozlabs.org>;\n\tTue, 12 Sep 2017 08:12:02 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1750999AbdIKWH1 (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tMon, 11 Sep 2017 18:07:27 -0400","from mail-qt0-f193.google.com ([209.85.216.193]:37297 \"EHLO\n\tmail-qt0-f193.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1750911AbdIKWH0 (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Mon, 11 Sep 2017 18:07:26 -0400","by mail-qt0-f193.google.com with SMTP id u48so3492631qtc.4\n\tfor <netdev@vger.kernel.org>; Mon, 11 Sep 2017 15:07:26 -0700 (PDT)","by 10.237.61.196 with HTTP; Mon, 11 Sep 2017 15:07:25 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=herbertland-com.20150623.gappssmtp.com; s=20150623;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=UQXboQSwl+6fv90A7iJg0uN9zeDOiA4xqUfLR+MToOU=;\n\tb=hiEtmZyanVmv2POLX++4CLEZrjXkisVfKtmDu9Q7lnANowHIJYPoQTdgpp81u5JylN\n\tQxMAdcSiUvEZbWTartjzs/ZhzeZR9GwDuLG7DQNvVse4onWSmcLKgO3bBy6YF/NplxqC\n\tAgKD4Cd8LCsO1IAPMcr/ICASVQxJigJf9C0nRqenGHILG5kyjPq7Dsvfzzj5DtTs+410\n\tB8iDX8aa44HWQ7HPyQx5N0N/a4+Z5m0XtMpBum+vsgbhMDRpLlR2vs4vcxft+ws0kqlI\n\tm/UwJqT4HO1BZpF0QJatXQc0eN7IlRYGVBdo/5kWD0l2vp3EzBGhOrFV6JlF1XwzQOBy\n\ttpZw==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=UQXboQSwl+6fv90A7iJg0uN9zeDOiA4xqUfLR+MToOU=;\n\tb=I6FB1pADQw7ygynKUIi5uNwUcwby7aHwuw7aQDcApizw9MMM/o3MTiayoyi5OrzM34\n\tWzM6tFKedxn6F1HYHVg2LIteXb37xkMCUX+nQ+8TLREFfewnRRpM+mJJO7zJC4GyHAn5\n\tG5B/5T02BX6oAyg/sh1GI92lS25Zu1Br5Ppp143EqefnTp8aYik88rxYMiXaFQvuLsyv\n\taPYLeGp8TgAOzjTQrKeYxGBu18iXLn7gWq+2g/7dCGTqcJn+nK/5pg4Q5/o/BP/14aUG\n\tNMs2tExYoJf+8TtyMaS8vtNq03d3/QFgnn4SaPzvCXYFnRYjSQ0jDGtLkruBy9YNTgdX\n\tYoGA==","X-Gm-Message-State":"AHPjjUhwMh15O0nsTzqPO2bmWC0sDot9/xOhDWde/hMAA1ERcYgDIPBF\n\tLFAJM9erdIqlQ76ck6WQekqxq0qeDGS9","X-Google-Smtp-Source":"AOwi7QBb4pqxIBTlX5omurcd0teO2mfVZxnqsj/LHe11npvd2kGzC8/OjYDj8U8BUM0ebtOBTR4PP9231XJcE2hsbU0=","X-Received":"by 10.200.22.105 with SMTP id x38mr3849551qtk.108.1505167645488; \n\tMon, 11 Sep 2017 15:07:25 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<f98198c4-16f2-5271-3e12-7984a03bd6db@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S371o0Oz9M7jmQWLuG2n3J0NGYGxZ_+okOkFL3x_+GB+2w@mail.gmail.com>\n\t<f98198c4-16f2-5271-3e12-7984a03bd6db@intel.com>","From":"Tom Herbert <tom@herbertland.com>","Date":"Mon, 11 Sep 2017 15:07:25 -0700","Message-ID":"<CALx6S37XvbjcRon6g+gS19wj14SebL4qKFrVeGyagG0K2EStzA@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Cc":"Alexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1766613,"web_url":"http://patchwork.ozlabs.org/comment/1766613/","msgid":"<CAKgT0UdsO31Gcb_J__n_WMFVnm++3R8FvCwA3aSLH5Ghd7eqWA@mail.gmail.com>","list_archive_url":null,"date":"2017-09-11T22:50:11","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":252,"url":"http://patchwork.ozlabs.org/api/people/252/","name":"Alexander Duyck","email":"alexander.duyck@gmail.com"},"content":"On Mon, Sep 11, 2017 at 3:07 PM, Tom Herbert <tom@herbertland.com> wrote:\n> On Mon, Sep 11, 2017 at 9:49 AM, Samudrala, Sridhar\n> <sridhar.samudrala@intel.com> wrote:\n>> On 9/10/2017 8:19 AM, Tom Herbert wrote:\n>>>\n>>> On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala\n>>> <sridhar.samudrala@intel.com> wrote:\n>>>>\n>>>> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be\n>>>> used\n>>>> to enable symmetric tx and rx queues on a socket.\n>>>>\n>>>> This option is specifically useful for epoll based multi threaded\n>>>> workloads\n>>>> where each thread handles packets received on a single RX queue . In this\n>>>> model,\n>>>> we have noticed that it helps to send the packets on the same TX queue\n>>>> corresponding to the queue-pair associated with the RX queue specifically\n>>>> when\n>>>> busy poll is enabled with epoll().\n>>>>\n>>>> Two new fields are added to struct sock_common to cache the last rx\n>>>> ifindex and\n>>>> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns\n>>>> the cached\n>>>> rx queue when this option is enabled and the TX is happening on the same\n>>>> device.\n>>>>\n>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>\n>>>> ---\n>>>>   include/net/request_sock.h        |  1 +\n>>>>   include/net/sock.h                | 17 +++++++++++++++++\n>>>>   include/uapi/asm-generic/socket.h |  2 ++\n>>>>   net/core/dev.c                    |  8 +++++++-\n>>>>   net/core/sock.c                   | 10 ++++++++++\n>>>>   net/ipv4/tcp_input.c              |  1 +\n>>>>   net/ipv4/tcp_ipv4.c               |  1 +\n>>>>   net/ipv4/tcp_minisocks.c          |  1 +\n>>>>   8 files changed, 40 insertions(+), 1 deletion(-)\n>>>>\n>>>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h\n>>>> index 23e2205..c3bc12e 100644\n>>>> --- a/include/net/request_sock.h\n>>>> +++ b/include/net/request_sock.h\n>>>> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct\n>>>> request_sock *req)\n>>>>          req_to_sk(req)->sk_prot = sk_listener->sk_prot;\n>>>>          sk_node_init(&req_to_sk(req)->sk_node);\n>>>>          sk_tx_queue_clear(req_to_sk(req));\n>>>> +       req_to_sk(req)->sk_symmetric_queues =\n>>>> sk_listener->sk_symmetric_queues;\n>>>>          req->saved_syn = NULL;\n>>>>          refcount_set(&req->rsk_refcnt, 0);\n>>>>\n>>>> diff --git a/include/net/sock.h b/include/net/sock.h\n>>>> index 03a3625..3421809 100644\n>>>> --- a/include/net/sock.h\n>>>> +++ b/include/net/sock.h\n>>>> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char\n>>>> *msg, ...)\n>>>>    *     @skc_node: main hash linkage for various protocol lookup tables\n>>>>    *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol\n>>>>    *     @skc_tx_queue_mapping: tx queue number for this connection\n>>>> + *     @skc_rx_queue_mapping: rx queue number for this connection\n>>>> + *     @skc_rx_ifindex: rx ifindex for this connection\n>>>>    *     @skc_flags: place holder for sk_flags\n>>>>    *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,\n>>>>    *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings\n>>>>    *     @skc_incoming_cpu: record/match cpu processing incoming packets\n>>>>    *     @skc_refcnt: reference count\n>>>> + *     @skc_symmetric_queues: symmetric tx/rx queues\n>>>>    *\n>>>>    *     This is the minimal network layer representation of sockets, the\n>>>> header\n>>>>    *     for struct sock and struct inet_timewait_sock.\n>>>> @@ -177,6 +180,7 @@ struct sock_common {\n>>>>          unsigned char           skc_reuseport:1;\n>>>>          unsigned char           skc_ipv6only:1;\n>>>>          unsigned char           skc_net_refcnt:1;\n>>>> +       unsigned char           skc_symmetric_queues:1;\n>>>>          int                     skc_bound_dev_if;\n>>>>          union {\n>>>>                  struct hlist_node       skc_bind_node;\n>>>> @@ -214,6 +218,8 @@ struct sock_common {\n>>>>                  struct hlist_nulls_node skc_nulls_node;\n>>>>          };\n>>>>          int                     skc_tx_queue_mapping;\n>>>> +       int                     skc_rx_queue_mapping;\n>>>> +       int                     skc_rx_ifindex;\n>>>>          union {\n>>>>                  int             skc_incoming_cpu;\n>>>>                  u32             skc_rcv_wnd;\n>>>> @@ -324,6 +330,8 @@ struct sock {\n>>>>   #define sk_nulls_node          __sk_common.skc_nulls_node\n>>>>   #define sk_refcnt              __sk_common.skc_refcnt\n>>>>   #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping\n>>>> +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping\n>>>> +#define sk_rx_ifindex          __sk_common.skc_rx_ifindex\n>>>>\n>>>>   #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin\n>>>>   #define sk_dontcopy_end                __sk_common.skc_dontcopy_end\n>>>> @@ -340,6 +348,7 @@ struct sock {\n>>>>   #define sk_reuseport           __sk_common.skc_reuseport\n>>>>   #define sk_ipv6only            __sk_common.skc_ipv6only\n>>>>   #define sk_net_refcnt          __sk_common.skc_net_refcnt\n>>>> +#define sk_symmetric_queues    __sk_common.skc_symmetric_queues\n>>>>   #define sk_bound_dev_if                __sk_common.skc_bound_dev_if\n>>>>   #define sk_bind_node           __sk_common.skc_bind_node\n>>>>   #define sk_prot                        __sk_common.skc_prot\n>>>> @@ -1676,6 +1685,14 @@ static inline int sk_tx_queue_get(const struct\n>>>> sock *sk)\n>>>>          return sk ? sk->sk_tx_queue_mapping : -1;\n>>>>   }\n>>>>\n>>>> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff\n>>>> *skb)\n>>>> +{\n>>>> +       if (sk->sk_symmetric_queues) {\n>>>> +               sk->sk_rx_ifindex = skb->skb_iif;\n>>>> +               sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);\n>>>> +       }\n>>>> +}\n>>>> +\n>>>>   static inline void sk_set_socket(struct sock *sk, struct socket *sock)\n>>>>   {\n>>>>          sk_tx_queue_clear(sk);\n>>>> diff --git a/include/uapi/asm-generic/socket.h\n>>>> b/include/uapi/asm-generic/socket.h\n>>>> index e47c9e4..f6b416e 100644\n>>>> --- a/include/uapi/asm-generic/socket.h\n>>>> +++ b/include/uapi/asm-generic/socket.h\n>>>> @@ -106,4 +106,6 @@\n>>>>\n>>>>   #define SO_ZEROCOPY            60\n>>>>\n>>>> +#define SO_SYMMETRIC_QUEUES    61\n>>>> +\n>>>>   #endif /* __ASM_GENERIC_SOCKET_H */\n>>>> diff --git a/net/core/dev.c b/net/core/dev.c\n>>>> index 270b547..d96cda8 100644\n>>>> --- a/net/core/dev.c\n>>>> +++ b/net/core/dev.c\n>>>> @@ -3322,7 +3322,13 @@ static u16 __netdev_pick_tx(struct net_device\n>>>> *dev, struct sk_buff *skb)\n>>>>\n>>>>          if (queue_index < 0 || skb->ooo_okay ||\n>>>>              queue_index >= dev->real_num_tx_queues) {\n>>>> -               int new_index = get_xps_queue(dev, skb);\n>>>> +               int new_index = -1;\n>>>> +\n>>>> +               if (sk && sk->sk_symmetric_queues && dev->ifindex ==\n>>>> sk->sk_rx_ifindex)\n>>>> +                       new_index = sk->sk_rx_queue_mapping;\n>>>> +\n>>>> +               if (new_index < 0 || new_index >=\n>>>> dev->real_num_tx_queues)\n>>>> +                       new_index = get_xps_queue(dev, skb);\n>>>\n>>> This enforces that notion of queue pairs which is not universal\n>>> concept to NICs. There are many devices and instances where we\n>>> purposely avoid having a 1-1 relationship between rx and tx queues.\n>>\n>>\n>> Yes. This patch assumes that TX and RX queues come in pairs.\n>>\n>>> An alternative might be to create a rx queue to tx queue map, add the\n>>> rx queue argument to get_xps_queue, and then that function can\n>>> consider the mapping. The administrator can configure the mapping as\n>>> appropriate and can select which rx queues are subject to the mapping.\n>>\n>> This alternative looks much cleaner and doesn't require the apps to\n>> configure the\n>> queues. Do we need to support 1 to many rx to tx queue mappings?\n>> For our symmetric queues usecase, where a single application thread is\n>> associated with\n>> 1 queue-pair,  1-1 mapping is sufficient.\n>> Do you see any usecase where it is useful to support 1-many mappings?\n>> I guess i can add a sysfs entry per rx-queue to setup a tx-queue OR\n>> tx-queue-map.\n>\n> There is no reason do disallow 1 to many, XPS already does that. In\n> fact, the mapping algorithm in XSP is pretty much what is needed where\n> instead of mapping a CPU to a queue set, this just maps a rx queue to\n> queue set. ooo handling can still be done, although it might be less\n> critical in this case.\n>\n> Tom\n\nActually I wonder if we couldn't re-purpose the queue_mapping field\nthat is already there in the sk_buff for this. It might provide a more\nelegant way to deal with the logic for already dealing with the\nrecorded Rx queue at the start of __skb_tx_hash. Then if the socket\nwants this it could send down the packet with the queue_mapping field\nset and we would be using some the map to figure out the Rx to Tx\nmapping for either routing/bridging or this socket option.\n\nAnyway just a thought.\n\n- Alex","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=gmail.com header.i=@gmail.com\n\theader.b=\"AM/XNTGa\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xrjnz1K9Kz9s4q\n\tfor <patchwork-incoming@ozlabs.org>;\n\tTue, 12 Sep 2017 08:51:03 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751004AbdIKWuO (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tMon, 11 Sep 2017 18:50:14 -0400","from mail-qt0-f196.google.com ([209.85.216.196]:35159 \"EHLO\n\tmail-qt0-f196.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1750911AbdIKWuN (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Mon, 11 Sep 2017 18:50:13 -0400","by mail-qt0-f196.google.com with SMTP id l25so3514304qtf.2\n\tfor <netdev@vger.kernel.org>; Mon, 11 Sep 2017 15:50:12 -0700 (PDT)","by 10.140.85.211 with HTTP; Mon, 11 Sep 2017 15:50:11 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=gmail.com; s=20161025;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=F+GBPNMAdtucIRj4ONNqstcXyptCyUjnkty21uWBVYg=;\n\tb=AM/XNTGauFxeBqgOLlwwR2xjflY49LWF1nR8e03Rj7VN00JN0Fstn1UWnUlu25lnjP\n\tOZzhn6CUheG8FlQMGSb+a4tAXbI52ZxjLXKIoao5TJz23U0eV/118N7oRkjQAIK14Wxt\n\tpDFd1IBVAC49DIkYcHqaUJzjtRGyuaKMmyNemyJhEx84jEryr8VJ2jYfy7ERAUAHN+8c\n\t4qXsu4j2A9YOT6/roCEB6/1sGSvBdl6SIHpePclFqzwRcVo+LT4cQfPTpUgCi09T1kNj\n\tydtCTWkshQa8N7JUma6ZnwUwwc06wuIqVQfpaKot+FNkodooagfK0dFCZvASppJG63cu\n\ttZRg==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=F+GBPNMAdtucIRj4ONNqstcXyptCyUjnkty21uWBVYg=;\n\tb=Ub4UHQ1Rm04H6TeEPsZzyIleuPlo7spwEIz9+gPfUgYLELRq+n4ravx6TgSjC3zCli\n\trBqqRCn1+7GEHUtiR7W2odhFtTN47M132SPpfqCW7QqB7oKkwSjQ/PBrgRueqvAEpcRw\n\tyw8q3WIro2wls43OjmAU9Xrolb5IEqG/jdAJA/LEt0Ib2uXS1/foPF30rSnMaFg6lCL8\n\tPTBftYSVv6Xhz99HVI8btg3VEaX/uzOpLl2rzaCWCDXP//jyPsvQM5myCk0wr7Ys4avL\n\tLtXsH4pgfAwRj8nqSk+XaTU8LN/aa0NNTDAjWZOtqsMN60XNM4qwAnXhplsDgwW5wUIX\n\t1SqQ==","X-Gm-Message-State":"AHPjjUg5JvuiT885GZrj67aRlGv3/6dHqn2VzECsBKZvKBQy5wprvYpd\n\tW6M9bzl6i4NWpUddPJG5rBXlwY4pTg==","X-Google-Smtp-Source":"AOwi7QByzs4ld3CkRElmQon9B45uExqrDZQPkHfkQr5k2dxrcK8Yc4PUTEdPf818haRF/M53BjaTVjLnrez031w+DXY=","X-Received":"by 10.237.62.200 with SMTP id o8mr18936774qtf.294.1505170212018; \n\tMon, 11 Sep 2017 15:50:12 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<CALx6S37XvbjcRon6g+gS19wj14SebL4qKFrVeGyagG0K2EStzA@mail.gmail.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S371o0Oz9M7jmQWLuG2n3J0NGYGxZ_+okOkFL3x_+GB+2w@mail.gmail.com>\n\t<f98198c4-16f2-5271-3e12-7984a03bd6db@intel.com>\n\t<CALx6S37XvbjcRon6g+gS19wj14SebL4qKFrVeGyagG0K2EStzA@mail.gmail.com>","From":"Alexander Duyck <alexander.duyck@gmail.com>","Date":"Mon, 11 Sep 2017 15:50:11 -0700","Message-ID":"<CAKgT0UdsO31Gcb_J__n_WMFVnm++3R8FvCwA3aSLH5Ghd7eqWA@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Tom Herbert <tom@herbertland.com>","Cc":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>,\n\tAlexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1766656,"web_url":"http://patchwork.ozlabs.org/comment/1766656/","msgid":"<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>","list_archive_url":null,"date":"2017-09-12T03:12:19","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65986,"url":"http://patchwork.ozlabs.org/api/people/65986/","name":"Tom Herbert","email":"tom@herbertland.com"},"content":"On Thu, Aug 31, 2017 at 4:27 PM, Sridhar Samudrala\n<sridhar.samudrala@intel.com> wrote:\n> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used\n> to enable symmetric tx and rx queues on a socket.\n>\n> This option is specifically useful for epoll based multi threaded workloads\n> where each thread handles packets received on a single RX queue . In this model,\n> we have noticed that it helps to send the packets on the same TX queue\n> corresponding to the queue-pair associated with the RX queue specifically when\n> busy poll is enabled with epoll().\n>\n> Two new fields are added to struct sock_common to cache the last rx ifindex and\n> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached\n> rx queue when this option is enabled and the TX is happening on the same device.\n>\n> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>\n> ---\n>  include/net/request_sock.h        |  1 +\n>  include/net/sock.h                | 17 +++++++++++++++++\n>  include/uapi/asm-generic/socket.h |  2 ++\n>  net/core/dev.c                    |  8 +++++++-\n>  net/core/sock.c                   | 10 ++++++++++\n>  net/ipv4/tcp_input.c              |  1 +\n>  net/ipv4/tcp_ipv4.c               |  1 +\n>  net/ipv4/tcp_minisocks.c          |  1 +\n>  8 files changed, 40 insertions(+), 1 deletion(-)\n>\n> diff --git a/include/net/request_sock.h b/include/net/request_sock.h\n> index 23e2205..c3bc12e 100644\n> --- a/include/net/request_sock.h\n> +++ b/include/net/request_sock.h\n> @@ -100,6 +100,7 @@ static inline struct sock *req_to_sk(struct request_sock *req)\n>         req_to_sk(req)->sk_prot = sk_listener->sk_prot;\n>         sk_node_init(&req_to_sk(req)->sk_node);\n>         sk_tx_queue_clear(req_to_sk(req));\n> +       req_to_sk(req)->sk_symmetric_queues = sk_listener->sk_symmetric_queues;\n>         req->saved_syn = NULL;\n>         refcount_set(&req->rsk_refcnt, 0);\n>\n> diff --git a/include/net/sock.h b/include/net/sock.h\n> index 03a3625..3421809 100644\n> --- a/include/net/sock.h\n> +++ b/include/net/sock.h\n> @@ -138,11 +138,14 @@ void SOCK_DEBUG(const struct sock *sk, const char *msg, ...)\n>   *     @skc_node: main hash linkage for various protocol lookup tables\n>   *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol\n>   *     @skc_tx_queue_mapping: tx queue number for this connection\n> + *     @skc_rx_queue_mapping: rx queue number for this connection\n> + *     @skc_rx_ifindex: rx ifindex for this connection\n>   *     @skc_flags: place holder for sk_flags\n>   *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,\n>   *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings\n>   *     @skc_incoming_cpu: record/match cpu processing incoming packets\n>   *     @skc_refcnt: reference count\n> + *     @skc_symmetric_queues: symmetric tx/rx queues\n>   *\n>   *     This is the minimal network layer representation of sockets, the header\n>   *     for struct sock and struct inet_timewait_sock.\n> @@ -177,6 +180,7 @@ struct sock_common {\n>         unsigned char           skc_reuseport:1;\n>         unsigned char           skc_ipv6only:1;\n>         unsigned char           skc_net_refcnt:1;\n> +       unsigned char           skc_symmetric_queues:1;\n>         int                     skc_bound_dev_if;\n>         union {\n>                 struct hlist_node       skc_bind_node;\n> @@ -214,6 +218,8 @@ struct sock_common {\n>                 struct hlist_nulls_node skc_nulls_node;\n>         };\n>         int                     skc_tx_queue_mapping;\n> +       int                     skc_rx_queue_mapping;\n> +       int                     skc_rx_ifindex;\n\nTwo ints in sock_common for this purpose is quite expensive and the\nuse case for this is limited-- even if a RX->TX queue mapping were\nintroduced to eliminate the queue pair assumption this still won't\nhelp if the receive and transmit interfaces are different for the\nconnection. I think we really need to see some very compelling results\nto be able to justify this.\n\nThanks,\nTom","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=herbertland-com.20150623.gappssmtp.com\n\theader.i=@herbertland-com.20150623.gappssmtp.com\n\theader.b=\"tgB9S+S9\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xrqmJ3fG7z9s4s\n\tfor <patchwork-incoming@ozlabs.org>;\n\tTue, 12 Sep 2017 13:20:00 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751105AbdILDMV (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tMon, 11 Sep 2017 23:12:21 -0400","from mail-qk0-f194.google.com ([209.85.220.194]:38877 \"EHLO\n\tmail-qk0-f194.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751022AbdILDMU (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Mon, 11 Sep 2017 23:12:20 -0400","by mail-qk0-f194.google.com with SMTP id c69so6517697qke.5\n\tfor <netdev@vger.kernel.org>; Mon, 11 Sep 2017 20:12:20 -0700 (PDT)","by 10.237.61.196 with HTTP; Mon, 11 Sep 2017 20:12:19 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=herbertland-com.20150623.gappssmtp.com; s=20150623;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=j0Zk1cUPVEhJkiqs8BuyfFD3HVDKj0oZw7L4GAzFDLY=;\n\tb=tgB9S+S9q4hCXF78yLk8je4apH3zMJL5K2WcBRY4s2C71yH12rF8hdwMj/XGjo+YAI\n\tAKJMynDb7BQHlg9wbF2nUlC/itlHCXtRCedC1ec9Om7PSlx4fn/aZCWgRNUSrbK72mlT\n\tlPcymVj36lpKxbpqTI60+krxsReWGv6Y3A0yXH+6PWbd80n/QfI7iK/es8DDY+xdDIDd\n\tPFAd4TPIC1RxrY+u+9x8cfLRAd1Rv55OfJAOvihueJf3T8Ll8cpiVRzaD4Kloi0uPoYZ\n\tqVJ1cXi+L3tv6l5tnOdWOeLpbPwPc4BDdKqwBD8xgKxb9fjeos3WWSiGBqeZu9etcKGi\n\tDlWA==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=j0Zk1cUPVEhJkiqs8BuyfFD3HVDKj0oZw7L4GAzFDLY=;\n\tb=uSPZK18n7QM5Q8UoJqNwmHK0LuJ5kqQ4Qe17a1H11xYcZ7ShTilOynaT6Qiope+jFO\n\tBtILXEvbEzoVgsa5IoOraAEhXJygHVg91hJ9ICOLCXUfdRZ6D/dgYiZvW4O/wLnw4clL\n\tJiBU/4ROGnJ28eeQ+rzEMmfOCdHFbIDpUcoH9jUWBl8NvjLTX72zXZITiY6m25IceTAt\n\tHkthTyaXba8iOOV6/qW5lXE88VxBirqjBP8TaEr9j+GiFQRuS5CSrQQh5EGu7LZWMNWy\n\taVEZR+AckCqtNnnV96tagX4peUtOUKkkhWmI2hAJ6dt2aByWlgBdaSCD+Trl2R3Ddmvu\n\tByVw==","X-Gm-Message-State":"AHPjjUgopYNpazy8tow95EYzAyspAMrpr58b6RaGYdZvwc17hfWQ2POC\n\t/anvYq9vp6VO+AkdfLTUS6IoM+z1/MB9","X-Google-Smtp-Source":"ADKCNb5nyKjBwsKy8BUbVbe8HLa/O+hx7yHcbZG+YYq2LB6fQEpE3qWeXJbNfUzVWTC899Xdu0CvBN5hUdgPBRjlBaM=","X-Received":"by 10.55.167.135 with SMTP id\n\tq129mr17791795qke.311.1505185939533; \n\tMon, 11 Sep 2017 20:12:19 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>","From":"Tom Herbert <tom@herbertland.com>","Date":"Mon, 11 Sep 2017 20:12:19 -0700","Message-ID":"<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Sridhar Samudrala <sridhar.samudrala@intel.com>","Cc":"Alexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1766659,"web_url":"http://patchwork.ozlabs.org/comment/1766659/","msgid":"<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>","list_archive_url":null,"date":"2017-09-12T03:53:57","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":2404,"url":"http://patchwork.ozlabs.org/api/people/2404/","name":"Eric Dumazet","email":"eric.dumazet@gmail.com"},"content":"On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n\n> Two ints in sock_common for this purpose is quite expensive and the\n> use case for this is limited-- even if a RX->TX queue mapping were\n> introduced to eliminate the queue pair assumption this still won't\n> help if the receive and transmit interfaces are different for the\n> connection. I think we really need to see some very compelling results\n> to be able to justify this.\n> \n\nYes, this is unreasonable cost.\n\nXPS should really cover the case already.","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=gmail.com header.i=@gmail.com\n\theader.b=\"fDaoY6R7\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xrrWb29fxz9s7M\n\tfor <patchwork-incoming@ozlabs.org>;\n\tTue, 12 Sep 2017 13:54:03 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751326AbdILDyA (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tMon, 11 Sep 2017 23:54:00 -0400","from mail-pg0-f53.google.com ([74.125.83.53]:33752 \"EHLO\n\tmail-pg0-f53.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751041AbdILDx7 (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Mon, 11 Sep 2017 23:53:59 -0400","by mail-pg0-f53.google.com with SMTP id u18so2517219pgo.0\n\tfor <netdev@vger.kernel.org>; Mon, 11 Sep 2017 20:53:59 -0700 (PDT)","from [192.168.86.171] (c-67-180-167-114.hsd1.ca.comcast.net.\n\t[67.180.167.114]) by smtp.googlemail.com with ESMTPSA id\n\ty4sm15309584pgs.19.2017.09.11.20.53.57\n\t(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);\n\tMon, 11 Sep 2017 20:53:58 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=gmail.com; s=20161025;\n\th=message-id:subject:from:to:cc:date:in-reply-to:references\n\t:mime-version:content-transfer-encoding;\n\tbh=ngKgXh3+vkiott5m/M8uTpBnfVaEgXQi+AR0mF0Ierg=;\n\tb=fDaoY6R7ZLpfGOMgi3oZmYpdu4DtMVgwZSPcFbR4AZlG2BGeBTJiMMoOxJk9z+B/tw\n\tY7XWvEGBxU1y05f2PgK3WL2RohgW8uXVMPIZqGSK5q/1BD9OXPg+aww47oXjKNii5GpI\n\tU4QRDKyna1+fMKTjZl7cG+96zoRuGeBdrqM5qiRRiYCNZTh2agMwbusDUA7wDPXbTrj9\n\tSiRVEoaZ6KH2zLLrkdggibRIk+4jCgVoh2xvMRuTuwlygpjFJQxx8k5C8Dt4LEllcmG2\n\ttNvFm5LTfXMXelSQWUYdvuBsmGRW1+18gTfHZ2or7ZkyOkCKL+dqaksrFoOe2X5/sDXY\n\tC5Vg==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to\n\t:references:mime-version:content-transfer-encoding;\n\tbh=ngKgXh3+vkiott5m/M8uTpBnfVaEgXQi+AR0mF0Ierg=;\n\tb=Aj6kG02Xe6yyI3Ll3kJPFsjXHlWTq41R+ryabxulrnXTWJdzvdDqU7eXnIu+Nlcb2O\n\tcRAHmlMxTUZ+KJ8F9nb1IlOSJKoCZsbdmyjLrMlxtqT+/eDVA1hmAZWgQH0eIMohrie9\n\tZQQis7YrDpQK5v7HzMV1cCxWKCOnDLohPMRstKBHABPua13w5pIJaDsc/h6L93PFSF4Z\n\texLvHJyzp4D5oxCkczLW5tNpNPMvsZJqcwVivoJK5U3NWS0bWNFYwyChS2uH5Ju0D4pW\n\t9yqaNJzf/kesOngga2q2LFbcTQz/cjxutQ9+7WTBokyfi0bCDbwOjc52XdU6FtwAKyYH\n\t7pZQ==","X-Gm-Message-State":"AHPjjUgt/tKSTlTN/w9G4Mq0G8+28BuJ66pvlvei5lRTErlGSPjRFltk\n\tAW4zArq+b4ToxaDu","X-Google-Smtp-Source":"ADKCNb4prX/5yG5DWyCQ7Gux3rVooFohYgoHgM1elAK0+njaL0ULEv2U/GLXNTHEOBQCs20V90SLZQ==","X-Received":"by 10.98.216.202 with SMTP id\n\te193mr13989127pfg.344.1505188439412; \n\tMon, 11 Sep 2017 20:53:59 -0700 (PDT)","Message-ID":"<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","From":"Eric Dumazet <eric.dumazet@gmail.com>","To":"Tom Herbert <tom@herbertland.com>","Cc":"Sridhar Samudrala <sridhar.samudrala@intel.com>,\n\tAlexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Date":"Mon, 11 Sep 2017 20:53:57 -0700","In-Reply-To":"<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>","Content-Type":"text/plain; charset=\"UTF-8\"","X-Mailer":"Evolution 3.10.4-0ubuntu2 ","Mime-Version":"1.0","Content-Transfer-Encoding":"7bit","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1766702,"web_url":"http://patchwork.ozlabs.org/comment/1766702/","msgid":"<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>","list_archive_url":null,"date":"2017-09-12T06:27:13","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65219,"url":"http://patchwork.ozlabs.org/api/people/65219/","name":"Samudrala, Sridhar","email":"sridhar.samudrala@intel.com"},"content":"On 9/11/2017 8:53 PM, Eric Dumazet wrote:\n> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n>\n>> Two ints in sock_common for this purpose is quite expensive and the\n>> use case for this is limited-- even if a RX->TX queue mapping were\n>> introduced to eliminate the queue pair assumption this still won't\n>> help if the receive and transmit interfaces are different for the\n>> connection. I think we really need to see some very compelling results\n>> to be able to justify this.\nWill try to collect and post some perf data with symmetric queue \nconfiguration.\n\n> Yes, this is unreasonable cost.\n>\n> XPS should really cover the case already.\n>   \nEric,\n\nCan you clarify how XPS covers the RX-> TX queue mapping case?\nIs it possible to configure XPS to select TX queue based on the RX queue \nof a flow?\nIIUC, it is based on the CPU of the thread doing the transmit OR based \non skb->priority to TC mapping?\nIt may be possible to get this effect if the the threads are pinned to a \ncore, but if the app threads are\nfreely moving, i am not sure how XPS can be configured to select the TX \nqueue based on the RX queue of a flow.\n\nThanks\nSridhar","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":"ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xrvwP61pnz9s7g\n\tfor <patchwork-incoming@ozlabs.org>;\n\tTue, 12 Sep 2017 16:27:17 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751195AbdILG1P (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tTue, 12 Sep 2017 02:27:15 -0400","from mga03.intel.com ([134.134.136.65]:32432 \"EHLO mga03.intel.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S1750911AbdILG1P (ORCPT <rfc822;netdev@vger.kernel.org>);\n\tTue, 12 Sep 2017 02:27:15 -0400","from orsmga005.jf.intel.com ([10.7.209.41])\n\tby orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;\n\t11 Sep 2017 23:27:14 -0700","from samudral-mobl1.amr.corp.intel.com (HELO [10.252.140.74])\n\t([10.252.140.74])\n\tby orsmga005.jf.intel.com with ESMTP; 11 Sep 2017 23:27:14 -0700"],"X-ExtLoop1":"1","X-IronPort-AV":"E=Sophos;i=\"5.42,382,1500966000\"; d=\"scan'208\";a=\"148150989\"","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Eric Dumazet <eric.dumazet@gmail.com>, Tom Herbert <tom@herbertland.com>","Cc":"Alexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>\n\t<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>","From":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Message-ID":"<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>","Date":"Mon, 11 Sep 2017 23:27:13 -0700","User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101\n\tThunderbird/52.2.1","MIME-Version":"1.0","In-Reply-To":"<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>","Content-Type":"text/plain; charset=utf-8; format=flowed","Content-Transfer-Encoding":"7bit","Content-Language":"en-US","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1767178,"web_url":"http://patchwork.ozlabs.org/comment/1767178/","msgid":"<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>","list_archive_url":null,"date":"2017-09-12T15:47:42","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":2404,"url":"http://patchwork.ozlabs.org/api/people/2404/","name":"Eric Dumazet","email":"eric.dumazet@gmail.com"},"content":"On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:\n> \n> On 9/11/2017 8:53 PM, Eric Dumazet wrote:\n> > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n> >\n> >> Two ints in sock_common for this purpose is quite expensive and the\n> >> use case for this is limited-- even if a RX->TX queue mapping were\n> >> introduced to eliminate the queue pair assumption this still won't\n> >> help if the receive and transmit interfaces are different for the\n> >> connection. I think we really need to see some very compelling results\n> >> to be able to justify this.\n> Will try to collect and post some perf data with symmetric queue \n> configuration.\n> \n> > Yes, this is unreasonable cost.\n> >\n> > XPS should really cover the case already.\n> >   \n> Eric,\n> \n> Can you clarify how XPS covers the RX-> TX queue mapping case?\n> Is it possible to configure XPS to select TX queue based on the RX queue \n> of a flow?\n> IIUC, it is based on the CPU of the thread doing the transmit OR based \n> on skb->priority to TC mapping?\n> It may be possible to get this effect if the the threads are pinned to a \n> core, but if the app threads are\n> freely moving, i am not sure how XPS can be configured to select the TX \n> queue based on the RX queue of a flow.\n\nIf application is freely moving, how NIC can properly select the RX\nqueue so that packets are coming to the appropriate queue ?\n\nThis is called aRFS, and it does not scale to millions of flows.\nWe tried in the past, and this went nowhere really, since the setup cost\nis prohibitive and DDOS vulnerable.\n\nXPS will follow the thread, since selection is done on current cpu.\n\nThe problem is RX side. If application is free to migrate, then special\nsupport (aRFS) is needed from the hardware.\n\nAt least for passive connections, we already have all the support in the\nkernel so that you can have one thread per NIC queue, dealing with\nsockets that have incoming packets all received on one NIC RX queue.\n(And of course all TX packets will use the symmetric TX queue)\n\nSO_REUSEPORT plus appropriate BPF filter can achieve that.\n\nSay you have 32 queues, 32 cpus.\n\nSimply use 32 listeners, 32 threads (or 32 pools of threads)","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=gmail.com header.i=@gmail.com\n\theader.b=\"fjE8xPRW\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xs8M71mN4z9s76\n\tfor <patchwork-incoming@ozlabs.org>;\n\tWed, 13 Sep 2017 01:47:47 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751529AbdILPro (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tTue, 12 Sep 2017 11:47:44 -0400","from mail-pf0-f193.google.com ([209.85.192.193]:34973 \"EHLO\n\tmail-pf0-f193.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751054AbdILPrn (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Tue, 12 Sep 2017 11:47:43 -0400","by mail-pf0-f193.google.com with SMTP id i23so1591618pfi.2\n\tfor <netdev@vger.kernel.org>; Tue, 12 Sep 2017 08:47:43 -0700 (PDT)","from ?IPv6:2620:15c:2c1:100:e4ac:feb:7de5:11b8?\n\t([2620:15c:2c1:100:e4ac:feb:7de5:11b8])\n\tby smtp.googlemail.com with ESMTPSA id\n\tf74sm2807589pfa.36.2017.09.12.08.47.42\n\t(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);\n\tTue, 12 Sep 2017 08:47:42 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=gmail.com; s=20161025;\n\th=message-id:subject:from:to:cc:date:in-reply-to:references\n\t:mime-version:content-transfer-encoding;\n\tbh=PD30p6mBN3iYS2xJScsdA2xW6onLQbvW6h1LTE/l7VA=;\n\tb=fjE8xPRWK7/aHe0kF4Yb/KDeEyGNevFPznN7w0byBPRRkyb+ZTEWH6W+tWPTweYuHD\n\thxZ8ein53kr257U2CLTbpOkajV0ZQmOK/Z0Xr9FIcFRkAzjXr1yAwLollj6rSYpRA1li\n\t/Zx5cQDo6W9NkFe/5tOFOlKDKxUcxiEo9GuNHVVe+gEqWXnn7bT2JCevVSTC6dQKgoDG\n\t+e9RbN9Y6IoI/SAVkrl5Jl99EOyHU7t7oh+RK3Lfj3KekHY17pLdNCD+7n/V0M+LuFlp\n\tkLwqUJXKePIQ3DaYrF3ziISE8NPNP2dEIHa44+HqwCaN0RtW51pWX112b7qqP2GBkzqX\n\tW9CA==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to\n\t:references:mime-version:content-transfer-encoding;\n\tbh=PD30p6mBN3iYS2xJScsdA2xW6onLQbvW6h1LTE/l7VA=;\n\tb=sKmoSx2xqJ6lR5JCtF9y8pPpMHeI1/vWuBrVE23tcWOU6s8gdny+ZmYSwCtc3xneTG\n\tognqOR4c0Vor2uQnwVFWtsIX8DPyKUZnAyx3v11s5YVpjosZ6l/BffahElabJu1cRSCI\n\tQKPuexy1d/gLCRDosL0WBaqBLlxDHMwePi2zgAVhRwM4zpW/baf2I68Zo6peUx3xw45k\n\tQ8OrGkP5rk8XaXON/CmnInY1fgSfQBShrVbwB0uq5X5A0PuMoUvb0kvZRPljvOcMBMj1\n\tVUIFlbGwgEaHMfA3BnrfZxNx1g6bMl/nzUJu3I97XNneSmtMuvUOJEp++sXlg+c0thYv\n\tTpyA==","X-Gm-Message-State":"AHPjjUiiN8d6gVL23aVr+x/xRIuTI1Ku4fT3iIudovGBDfz9+3X3aJdh\n\t2BDAgMVb1H+E9w==","X-Google-Smtp-Source":"ADKCNb629UWQWMyIZJQn7akoXXYsl2gnUPjJ8Tb46v9a9IQRq/UAs/VCAAOzNI+Az6BgksNj6Je9Og==","X-Received":"by 10.84.217.131 with SMTP id p3mr6971461pli.126.1505231263225; \n\tTue, 12 Sep 2017 08:47:43 -0700 (PDT)","Message-ID":"<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","From":"Eric Dumazet <eric.dumazet@gmail.com>","To":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Cc":"Tom Herbert <tom@herbertland.com>,\n\tAlexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Date":"Tue, 12 Sep 2017 08:47:42 -0700","In-Reply-To":"<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>\n\t<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>","Content-Type":"text/plain; charset=\"UTF-8\"","X-Mailer":"Evolution 3.10.4-0ubuntu2 ","Mime-Version":"1.0","Content-Transfer-Encoding":"7bit","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1767415,"web_url":"http://patchwork.ozlabs.org/comment/1767415/","msgid":"<ef594f90-0f76-bdf5-63ce-e8750ee0d60f@intel.com>","list_archive_url":null,"date":"2017-09-12T22:31:45","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65219,"url":"http://patchwork.ozlabs.org/api/people/65219/","name":"Samudrala, Sridhar","email":"sridhar.samudrala@intel.com"},"content":"On 9/12/2017 8:47 AM, Eric Dumazet wrote:\n> On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:\n>> On 9/11/2017 8:53 PM, Eric Dumazet wrote:\n>>> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n>>>\n>>>> Two ints in sock_common for this purpose is quite expensive and the\n>>>> use case for this is limited-- even if a RX->TX queue mapping were\n>>>> introduced to eliminate the queue pair assumption this still won't\n>>>> help if the receive and transmit interfaces are different for the\n>>>> connection. I think we really need to see some very compelling results\n>>>> to be able to justify this.\n>> Will try to collect and post some perf data with symmetric queue\n>> configuration.\n>>\n>>> Yes, this is unreasonable cost.\n>>>\n>>> XPS should really cover the case already.\n>>>    \n>> Eric,\n>>\n>> Can you clarify how XPS covers the RX-> TX queue mapping case?\n>> Is it possible to configure XPS to select TX queue based on the RX queue\n>> of a flow?\n>> IIUC, it is based on the CPU of the thread doing the transmit OR based\n>> on skb->priority to TC mapping?\n>> It may be possible to get this effect if the the threads are pinned to a\n>> core, but if the app threads are\n>> freely moving, i am not sure how XPS can be configured to select the TX\n>> queue based on the RX queue of a flow.\n> If application is freely moving, how NIC can properly select the RX\n> queue so that packets are coming to the appropriate queue ?\nThe RX queue is selected via RSS and we don't want to move the flow based on\nwhere the thread is running.\n>\n> This is called aRFS, and it does not scale to millions of flows.\n> We tried in the past, and this went nowhere really, since the setup cost\n> is prohibitive and DDOS vulnerable.\n>\n> XPS will follow the thread, since selection is done on current cpu.\n>\n> The problem is RX side. If application is free to migrate, then special\n> support (aRFS) is needed from the hardware.\nThis may be true if most of the rx processing is happening in the \ninterrupt context.\nBut with busy polling,  i think we don't need aRFS as a thread should be \nable to poll\nany queue irrespective of where it is running.\n>\n> At least for passive connections, we already have all the support in the\n> kernel so that you can have one thread per NIC queue, dealing with\n> sockets that have incoming packets all received on one NIC RX queue.\n> (And of course all TX packets will use the symmetric TX queue)\n>\n> SO_REUSEPORT plus appropriate BPF filter can achieve that.\n>\n> Say you have 32 queues, 32 cpus.\n>\n> Simply use 32 listeners, 32 threads (or 32 pools of threads)\nYes. This will work if each thread is pinned to a core associated with \nthe RX interrupt.\nIt may not be possible to pin the threads to a core.\nInstead we want to associate a thread to a queue and do all the RX and \nTX completion\nof a queue in the same thread context via busy polling.\n\nThanks\nSridhar","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":"ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xsKKQ3B4Sz9t39\n\tfor <patchwork-incoming@ozlabs.org>;\n\tWed, 13 Sep 2017 08:31:54 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751308AbdILWbr (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tTue, 12 Sep 2017 18:31:47 -0400","from mga05.intel.com ([192.55.52.43]:13122 \"EHLO mga05.intel.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S1751020AbdILWbq (ORCPT <rfc822;netdev@vger.kernel.org>);\n\tTue, 12 Sep 2017 18:31:46 -0400","from orsmga001.jf.intel.com ([10.7.209.18])\n\tby fmsmga105.fm.intel.com with ESMTP; 12 Sep 2017 15:31:46 -0700","from samudral-mobl1.amr.corp.intel.com (HELO [10.165.248.23])\n\t([10.165.248.23])\n\tby orsmga001.jf.intel.com with ESMTP; 12 Sep 2017 15:31:45 -0700"],"X-ExtLoop1":"1","X-IronPort-AV":"E=Sophos;i=\"5.42,384,1500966000\"; d=\"scan'208\";a=\"1171641614\"","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Eric Dumazet <eric.dumazet@gmail.com>","Cc":"Tom Herbert <tom@herbertland.com>,\n\tAlexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>\n\t<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>\n\t<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>","From":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Message-ID":"<ef594f90-0f76-bdf5-63ce-e8750ee0d60f@intel.com>","Date":"Tue, 12 Sep 2017 15:31:45 -0700","User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101\n\tThunderbird/52.2.1","MIME-Version":"1.0","In-Reply-To":"<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>","Content-Type":"text/plain; charset=utf-8; format=flowed","Content-Transfer-Encoding":"7bit","Content-Language":"en-US","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1767425,"web_url":"http://patchwork.ozlabs.org/comment/1767425/","msgid":"<CALx6S372oQ4OsyMd66zwQ08pMvPvLj7Ejf=Cv24xDkdtVXaYjA@mail.gmail.com>","list_archive_url":null,"date":"2017-09-12T22:53:11","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65986,"url":"http://patchwork.ozlabs.org/api/people/65986/","name":"Tom Herbert","email":"tom@herbertland.com"},"content":"On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar\n<sridhar.samudrala@intel.com> wrote:\n>\n>\n> On 9/12/2017 8:47 AM, Eric Dumazet wrote:\n>>\n>> On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:\n>>>\n>>> On 9/11/2017 8:53 PM, Eric Dumazet wrote:\n>>>>\n>>>> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n>>>>\n>>>>> Two ints in sock_common for this purpose is quite expensive and the\n>>>>> use case for this is limited-- even if a RX->TX queue mapping were\n>>>>> introduced to eliminate the queue pair assumption this still won't\n>>>>> help if the receive and transmit interfaces are different for the\n>>>>> connection. I think we really need to see some very compelling results\n>>>>> to be able to justify this.\n>>>\n>>> Will try to collect and post some perf data with symmetric queue\n>>> configuration.\n>>>\n>>>> Yes, this is unreasonable cost.\n>>>>\n>>>> XPS should really cover the case already.\n>>>>\n>>>\n>>> Eric,\n>>>\n>>> Can you clarify how XPS covers the RX-> TX queue mapping case?\n>>> Is it possible to configure XPS to select TX queue based on the RX queue\n>>> of a flow?\n>>> IIUC, it is based on the CPU of the thread doing the transmit OR based\n>>> on skb->priority to TC mapping?\n>>> It may be possible to get this effect if the the threads are pinned to a\n>>> core, but if the app threads are\n>>> freely moving, i am not sure how XPS can be configured to select the TX\n>>> queue based on the RX queue of a flow.\n>>\n>> If application is freely moving, how NIC can properly select the RX\n>> queue so that packets are coming to the appropriate queue ?\n>\n> The RX queue is selected via RSS and we don't want to move the flow based on\n> where the thread is running.\n\nUnless flow director is enabled on the Intel device... This was, I\nbelieve, one of the first attempts to introduce a queue pair notion to\ngeneral purpose NICs. The idea was that the device records the TX\nqueue for a flow and then uses that to determine receive queue in a\nsymmetric fashion. aRFS is similar, but was under SW control how the\nmapping is done. As Eric mentioned there are scalability issues with\nthese mechanisms, but we also found that flow director can easily\nreorder packets whenever the thread moves.\n\n>>\n>>\n>> This is called aRFS, and it does not scale to millions of flows.\n>> We tried in the past, and this went nowhere really, since the setup cost\n>> is prohibitive and DDOS vulnerable.\n>>\n>> XPS will follow the thread, since selection is done on current cpu.\n>>\n>> The problem is RX side. If application is free to migrate, then special\n>> support (aRFS) is needed from the hardware.\n>\n> This may be true if most of the rx processing is happening in the interrupt\n> context.\n> But with busy polling,  i think we don't need aRFS as a thread should be\n> able to poll\n> any queue irrespective of where it is running.\n\nIt's not just a problem with interrupt processing, in general we like\nto have all receive processing an subsequent transmit of a reply to be\ndone on one CPU. Silo'ing is good for performance and parallelism.\nThis can sometimes be relaxed in situations where CPUs share a cache\nso crossing CPUs is not not costly.\n\n>>\n>>\n>> At least for passive connections, we already have all the support in the\n>> kernel so that you can have one thread per NIC queue, dealing with\n>> sockets that have incoming packets all received on one NIC RX queue.\n>> (And of course all TX packets will use the symmetric TX queue)\n>>\n>> SO_REUSEPORT plus appropriate BPF filter can achieve that.\n>>\n>> Say you have 32 queues, 32 cpus.\n>>\n>> Simply use 32 listeners, 32 threads (or 32 pools of threads)\n>\n> Yes. This will work if each thread is pinned to a core associated with the\n> RX interrupt.\n> It may not be possible to pin the threads to a core.\n> Instead we want to associate a thread to a queue and do all the RX and TX\n> completion\n> of a queue in the same thread context via busy polling.\n>\nWhen that happens it's possible for RX to be done on the completely\nwrong CPU which we know is suboptimal. However, this shouldn't\nnegatively affect TX side since XPS will just use the queue\nappropriate for running CPU. Like Eric said, this is really a receive\nproblem more than a transmit problem. Keeping them as independent\npaths seems to be a good approach.\n\nTom","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=herbertland-com.20150623.gappssmtp.com\n\theader.i=@herbertland-com.20150623.gappssmtp.com\n\theader.b=\"eA/xKG9r\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xsKp70HyMz9sPt\n\tfor <patchwork-incoming@ozlabs.org>;\n\tWed, 13 Sep 2017 08:53:18 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751351AbdILWxP (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tTue, 12 Sep 2017 18:53:15 -0400","from mail-qk0-f177.google.com ([209.85.220.177]:36663 \"EHLO\n\tmail-qk0-f177.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751003AbdILWxN (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Tue, 12 Sep 2017 18:53:13 -0400","by mail-qk0-f177.google.com with SMTP id z143so28362964qkb.3\n\tfor <netdev@vger.kernel.org>; Tue, 12 Sep 2017 15:53:13 -0700 (PDT)","by 10.237.61.196 with HTTP; Tue, 12 Sep 2017 15:53:11 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=herbertland-com.20150623.gappssmtp.com; s=20150623;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=41gWqsxuNzrblE292lNBQ6D4d2XMDKI8/KSWsPLVUm4=;\n\tb=eA/xKG9rd8nzuYNdH+bwEMN0IE2kEOJrBR52Pxv4ObQjI3tkYuASenmknOZiyWFirk\n\t9YE4F/Ioki8eTp3R3c1d7GEQf1C5UgM78M7kaShvYjlQLC7oIUjTW5dBqm4S7VOaSf/F\n\tS3VPikpjjDA/HJYoHTSpKUJel+pdnBs+LyrJFTRdimNe+yEeMKbxnIcX1I/ob8VCIaMM\n\tvvvIKcbuXv/Haj8zWVOR74nZV0oWzSdpi166tYpvL24d3Gjqj6ZFcpgTwLotqO1ISNNN\n\tiHRlaREV13UmZ0G0CBaNjd8SyIYPa71PkrofbgsYpQLVmo7BY2m1hhrLRnc4GOIzFrqn\n\tP5QA==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=41gWqsxuNzrblE292lNBQ6D4d2XMDKI8/KSWsPLVUm4=;\n\tb=Etw3K6LaW8DbDfrcnc7KAHDv8wn/SNZl9ugrsduoyJI0NQF+fgMKPFzVB8NEgZzfRT\n\tUNdTWRKwFJrIUgTCEukgS6vpyyUTN4d1QPj1Um9Aoq7V6rmlIHOELSttYNd407OMYCHp\n\tG2sCHRCGk768IymHyKvWk3kCCmUcCKZcAfZRs9RniDakNUyVIS/SRCnUodoR66QJAFZ5\n\tVgC8gclx7mtW3sLOL/n3Cg8m+kBr35C5w5ulCMc/m+nEgAU6rPBP5syF2VFxx1K99SeW\n\tWWaPtYPoAncOgaJ31nPXJGc8RTZzeGkaGv/ufG6wLnppEG4FEaQc+zWv4pLo40Z5IGn5\n\tBuaA==","X-Gm-Message-State":"AHPjjUiIdiX8csHRXyoitclEkZwEBH9TKrI2gQpDttt2LBWeFy23wfWk\n\t1dddURR9XuywEeDWE/sVRbjM0Fm3Nn/n","X-Google-Smtp-Source":"AOwi7QC2P5A5XEmQa6QJxHVmBTVw5s/QZkvi+n0lZzYGsVyL0aWbQwim4rH0PIpVtFSt9bI1ypA2xXbpriScTBAemsQ=","X-Received":"by 10.55.133.6 with SMTP id h6mr21357854qkd.17.1505256792566;\n\tTue, 12 Sep 2017 15:53:12 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<ef594f90-0f76-bdf5-63ce-e8750ee0d60f@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>\n\t<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>\n\t<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<ef594f90-0f76-bdf5-63ce-e8750ee0d60f@intel.com>","From":"Tom Herbert <tom@herbertland.com>","Date":"Tue, 12 Sep 2017 15:53:11 -0700","Message-ID":"<CALx6S372oQ4OsyMd66zwQ08pMvPvLj7Ejf=Cv24xDkdtVXaYjA@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Cc":"Eric Dumazet <eric.dumazet@gmail.com>,\n\tAlexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1771478,"web_url":"http://patchwork.ozlabs.org/comment/1771478/","msgid":"<9b247caf-fde0-1e39-aa94-f7b3bc4fc88a@intel.com>","list_archive_url":null,"date":"2017-09-20T00:34:21","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65219,"url":"http://patchwork.ozlabs.org/api/people/65219/","name":"Samudrala, Sridhar","email":"sridhar.samudrala@intel.com"},"content":"On 9/12/2017 3:53 PM, Tom Herbert wrote:\n> On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar\n> <sridhar.samudrala@intel.com> wrote:\n>>\n>> On 9/12/2017 8:47 AM, Eric Dumazet wrote:\n>>> On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:\n>>>> On 9/11/2017 8:53 PM, Eric Dumazet wrote:\n>>>>> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n>>>>>\n>>>>>> Two ints in sock_common for this purpose is quite expensive and the\n>>>>>> use case for this is limited-- even if a RX->TX queue mapping were\n>>>>>> introduced to eliminate the queue pair assumption this still won't\n>>>>>> help if the receive and transmit interfaces are different for the\n>>>>>> connection. I think we really need to see some very compelling results\n>>>>>> to be able to justify this.\n>>>> Will try to collect and post some perf data with symmetric queue\n>>>> configuration.\n\nHere is some performance data i collected with memcached workload over\nixgbe 10Gb NIC with mcblaster benchmark.\nixgbe is configured with 16 queues and rx-usecs is set to 1000 for a \nvery low\ninterrupt rate.\n       ethtool -L p1p1 combined 16\n       ethtool -C p1p1 rx-usecs 1000\nand busy poll is set to 1000usecs\n       sysctl net.core.busy_poll = 1000\n\n16 threads  800K requests/sec\n=============================\n                   rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n-----------------------------------------------------------------------\nDefault                2/182/10641            23391 61163\nSymmetric Queues       2/50/6311              20457 32843\n\n32 threads  800K requests/sec\n=============================\n                  rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n------------------------------------------------------------------------\nDefault                2/162/6390            32168 69450\nSymmetric Queues        2/50/3853            35044 35847\n\n>>>>\n>>>>> Yes, this is unreasonable cost.\n>>>>>\n>>>>> XPS should really cover the case already.\n>>>>>\n>>>> Eric,\n>>>>\n>>>> Can you clarify how XPS covers the RX-> TX queue mapping case?\n>>>> Is it possible to configure XPS to select TX queue based on the RX queue\n>>>> of a flow?\n>>>> IIUC, it is based on the CPU of the thread doing the transmit OR based\n>>>> on skb->priority to TC mapping?\n>>>> It may be possible to get this effect if the the threads are pinned to a\n>>>> core, but if the app threads are\n>>>> freely moving, i am not sure how XPS can be configured to select the TX\n>>>> queue based on the RX queue of a flow.\n>>> If application is freely moving, how NIC can properly select the RX\n>>> queue so that packets are coming to the appropriate queue ?\n>> The RX queue is selected via RSS and we don't want to move the flow based on\n>> where the thread is running.\n> Unless flow director is enabled on the Intel device... This was, I\n> believe, one of the first attempts to introduce a queue pair notion to\n> general purpose NICs. The idea was that the device records the TX\n> queue for a flow and then uses that to determine receive queue in a\n> symmetric fashion. aRFS is similar, but was under SW control how the\n> mapping is done. As Eric mentioned there are scalability issues with\n> these mechanisms, but we also found that flow director can easily\n> reorder packets whenever the thread moves.\n\nYou must be referring to the ATR(application targeted routing) feature \non Intel\nNICs wherea flow director entry is added for a flow based on TX queue \nused for\nthat flow. Instead, we would like to select the TX queue based on the RX \nqueue\nof a flow.\n\n\n>\n>>>\n>>> This is called aRFS, and it does not scale to millions of flows.\n>>> We tried in the past, and this went nowhere really, since the setup cost\n>>> is prohibitive and DDOS vulnerable.\n>>>\n>>> XPS will follow the thread, since selection is done on current cpu.\n>>>\n>>> The problem is RX side. If application is free to migrate, then special\n>>> support (aRFS) is needed from the hardware.\n>> This may be true if most of the rx processing is happening in the interrupt\n>> context.\n>> But with busy polling,  i think we don't need aRFS as a thread should be\n>> able to poll\n>> any queue irrespective of where it is running.\n> It's not just a problem with interrupt processing, in general we like\n> to have all receive processing an subsequent transmit of a reply to be\n> done on one CPU. Silo'ing is good for performance and parallelism.\n> This can sometimes be relaxed in situations where CPUs share a cache\n> so crossing CPUs is not not costly.\n\nYes. We would like to get this behavior even without binding the app \nthread to a CPU.\n\n\n>\n>>>\n>>> At least for passive connections, we already have all the support in the\n>>> kernel so that you can have one thread per NIC queue, dealing with\n>>> sockets that have incoming packets all received on one NIC RX queue.\n>>> (And of course all TX packets will use the symmetric TX queue)\n>>>\n>>> SO_REUSEPORT plus appropriate BPF filter can achieve that.\n>>>\n>>> Say you have 32 queues, 32 cpus.\n>>>\n>>> Simply use 32 listeners, 32 threads (or 32 pools of threads)\n>> Yes. This will work if each thread is pinned to a core associated with the\n>> RX interrupt.\n>> It may not be possible to pin the threads to a core.\n>> Instead we want to associate a thread to a queue and do all the RX and TX\n>> completion\n>> of a queue in the same thread context via busy polling.\n>>\n> When that happens it's possible for RX to be done on the completely\n> wrong CPU which we know is suboptimal. However, this shouldn't\n> negatively affect TX side since XPS will just use the queue\n> appropriate for running CPU. Like Eric said, this is really a receive\n> problem more than a transmit problem. Keeping them as independent\n> paths seems to be a good approach.\n>\n>\n\nWe are noticing that when majority of packets are received via busy \npolling, it\nshould not be an issue if RX processing is handled by a thread running \non a core\nthat is different from the core that is associated with the RX \ninterrupt. Also, as\nthe TX completions on the associated TX queue are processed along with \nthe RX\nprocessing via busy polling, we would like the Transmits also to happen \nin the same\nthread context.\n\nWould appreciate any feedback or thoughts on optional configuration to \nenable selection\nof TX queue based on the RX queue of a flow.\n\nThanks\nSridhar","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":"ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xxgjZ5HkMz9sPs\n\tfor <patchwork-incoming@ozlabs.org>;\n\tWed, 20 Sep 2017 10:34:26 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751480AbdITAeX (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tTue, 19 Sep 2017 20:34:23 -0400","from mga06.intel.com ([134.134.136.31]:41088 \"EHLO mga06.intel.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S1751396AbdITAeW (ORCPT <rfc822;netdev@vger.kernel.org>);\n\tTue, 19 Sep 2017 20:34:22 -0400","from orsmga004.jf.intel.com ([10.7.209.38])\n\tby orsmga104.jf.intel.com with ESMTP; 19 Sep 2017 17:34:21 -0700","from samudral-mobl1.amr.corp.intel.com (HELO [10.165.248.23])\n\t([10.165.248.23])\n\tby orsmga004.jf.intel.com with ESMTP; 19 Sep 2017 17:34:21 -0700"],"X-ExtLoop1":"1","X-IronPort-AV":"E=Sophos;i=\"5.42,419,1500966000\"; d=\"scan'208\";a=\"130491444\"","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Tom Herbert <tom@herbertland.com>","Cc":"Eric Dumazet <eric.dumazet@gmail.com>,\n\tAlexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>\n\t<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>\n\t<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<ef594f90-0f76-bdf5-63ce-e8750ee0d60f@intel.com>\n\t<CALx6S372oQ4OsyMd66zwQ08pMvPvLj7Ejf=Cv24xDkdtVXaYjA@mail.gmail.com>","From":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Message-ID":"<9b247caf-fde0-1e39-aa94-f7b3bc4fc88a@intel.com>","Date":"Tue, 19 Sep 2017 17:34:21 -0700","User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101\n\tThunderbird/52.2.1","MIME-Version":"1.0","In-Reply-To":"<CALx6S372oQ4OsyMd66zwQ08pMvPvLj7Ejf=Cv24xDkdtVXaYjA@mail.gmail.com>","Content-Type":"text/plain; charset=utf-8; format=flowed","Content-Transfer-Encoding":"7bit","Content-Language":"en-US","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1771482,"web_url":"http://patchwork.ozlabs.org/comment/1771482/","msgid":"<CALx6S35wbwhz7COqGuUgJZcd8TwYcaVOHpxZxTOd4TuQX76Crg@mail.gmail.com>","list_archive_url":null,"date":"2017-09-20T00:48:19","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65986,"url":"http://patchwork.ozlabs.org/api/people/65986/","name":"Tom Herbert","email":"tom@herbertland.com"},"content":"On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar\n<sridhar.samudrala@intel.com> wrote:\n> On 9/12/2017 3:53 PM, Tom Herbert wrote:\n>>\n>> On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar\n>> <sridhar.samudrala@intel.com> wrote:\n>>>\n>>>\n>>> On 9/12/2017 8:47 AM, Eric Dumazet wrote:\n>>>>\n>>>> On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:\n>>>>>\n>>>>> On 9/11/2017 8:53 PM, Eric Dumazet wrote:\n>>>>>>\n>>>>>> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n>>>>>>\n>>>>>>> Two ints in sock_common for this purpose is quite expensive and the\n>>>>>>> use case for this is limited-- even if a RX->TX queue mapping were\n>>>>>>> introduced to eliminate the queue pair assumption this still won't\n>>>>>>> help if the receive and transmit interfaces are different for the\n>>>>>>> connection. I think we really need to see some very compelling\n>>>>>>> results\n>>>>>>> to be able to justify this.\n>>>>>\n>>>>> Will try to collect and post some perf data with symmetric queue\n>>>>> configuration.\n>\n>\n> Here is some performance data i collected with memcached workload over\n> ixgbe 10Gb NIC with mcblaster benchmark.\n> ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very\n> low\n> interrupt rate.\n>       ethtool -L p1p1 combined 16\n>       ethtool -C p1p1 rx-usecs 1000\n> and busy poll is set to 1000usecs\n>       sysctl net.core.busy_poll = 1000\n>\n> 16 threads  800K requests/sec\n> =============================\n>                   rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n> -----------------------------------------------------------------------\n> Default                2/182/10641            23391 61163\n> Symmetric Queues       2/50/6311              20457 32843\n>\n> 32 threads  800K requests/sec\n> =============================\n>                  rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n> ------------------------------------------------------------------------\n> Default                2/162/6390            32168 69450\n> Symmetric Queues        2/50/3853            35044 35847\n>\nNo idea what \"Default\" configuration is. Please report how xps_cpus is\nbeing set, how many RSS queues there are, and what the mapping is\nbetween RSS queues and CPUs and shared caches. Also, whether and\nthreads are pinned.\n\nThanks,\nTom","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=herbertland-com.20150623.gappssmtp.com\n\theader.i=@herbertland-com.20150623.gappssmtp.com\n\theader.b=\"NV7IKpeL\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xxh1h3QPkz9sNr\n\tfor <patchwork-incoming@ozlabs.org>;\n\tWed, 20 Sep 2017 10:48:24 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751568AbdITAsW (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tTue, 19 Sep 2017 20:48:22 -0400","from mail-qk0-f194.google.com ([209.85.220.194]:37443 \"EHLO\n\tmail-qk0-f194.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751506AbdITAsV (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Tue, 19 Sep 2017 20:48:21 -0400","by mail-qk0-f194.google.com with SMTP id r66so796185qke.4\n\tfor <netdev@vger.kernel.org>; Tue, 19 Sep 2017 17:48:20 -0700 (PDT)","by 10.237.61.196 with HTTP; Tue, 19 Sep 2017 17:48:19 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=herbertland-com.20150623.gappssmtp.com; s=20150623;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=folq+nx7qkRPBHjVCw3uITA7bmEA/3+11ciMHLoIVbI=;\n\tb=NV7IKpeLXLQIn0RzxKa8LhGDD2umBlolBuOsT37SWzLyuscXFYDotRTJd1ynqLzOyE\n\tysCWoEl36Bqf8dfRlq5hRmVgJCq4BA1RGnojW+rDZHEMmyH5ZaQ2sQoDu3kzKxITTek+\n\tv+6QwJ7az7TtJXRkHnYc4mzGkURpu3T/vTecGQ3Qb1fJXqHt8rdMjty0x36WbdS3VzGK\n\tt99KxW5Wnp0PGbgAwKHpo+R4mzV7DMQ5SkQbsxqIoJQ+YBTFhulTrl4fJjb8oAezDuux\n\tZ8NFp4igFYqBsnMj2Tw87S5JDFrPbjRzDDBqNPGa30ZNrch/jOIXAutIqDAf/EgPC63+\n\t4BCg==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=folq+nx7qkRPBHjVCw3uITA7bmEA/3+11ciMHLoIVbI=;\n\tb=S2Zp0fiNydM1p2+bVmYF3jOYJsE1M4qV52Gt0dorEk271D9C2Hwoqt/l9f8I++ULRs\n\t+XYArZFl+KSDVW2P2NK8aV9yVGSfSUyQHy/t1rWAkQQX8OZxCDrtVU+QKz9hoWj//eYC\n\tlUBGquq0XK/8X0caDtIGp68dqtKfQZ+5onxmnKMC3gduEv4zSZxyjlRQurt9ofjd4xhG\n\tpBMKMkWsf3RG1Cl5Sr/IQemLUbhwXg9qoIq527PFQ1aGrv52FuBK1r6TiTm1snF0vGjD\n\tFgwrTYgVLsWAzGXIcLUpnsAdupQ+CCVdFl7Aghg4MY8STDz3R+9YztKOx4csPpeFG3yz\n\tpJmw==","X-Gm-Message-State":"AHPjjUgxuJjXOjqBnDxmEIyerdRcln9jLp9+aY/LM8mQ1r/gWD84BQqK\n\t5zQh4Bun5NnLaPi5YU9AQtVA3ejunWROlS/qs1ILnA==","X-Google-Smtp-Source":"AOwi7QAJNu6C0Lrjg+YIHAZqzg/8g0wiz4gSjuf1uv/h6anL9e1Xc4H9VcHG//ar8qvJB6+YQrQMdrkhZ3rp62A/aVI=","X-Received":"by 10.55.109.131 with SMTP id i125mr4501659qkc.17.1505868500359; \n\tTue, 19 Sep 2017 17:48:20 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<9b247caf-fde0-1e39-aa94-f7b3bc4fc88a@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>\n\t<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>\n\t<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<ef594f90-0f76-bdf5-63ce-e8750ee0d60f@intel.com>\n\t<CALx6S372oQ4OsyMd66zwQ08pMvPvLj7Ejf=Cv24xDkdtVXaYjA@mail.gmail.com>\n\t<9b247caf-fde0-1e39-aa94-f7b3bc4fc88a@intel.com>","From":"Tom Herbert <tom@herbertland.com>","Date":"Tue, 19 Sep 2017 17:48:19 -0700","Message-ID":"<CALx6S35wbwhz7COqGuUgJZcd8TwYcaVOHpxZxTOd4TuQX76Crg@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Cc":"Eric Dumazet <eric.dumazet@gmail.com>,\n\tAlexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1771571,"web_url":"http://patchwork.ozlabs.org/comment/1771571/","msgid":"<1505884427.29839.84.camel@edumazet-glaptop3.roam.corp.google.com>","list_archive_url":null,"date":"2017-09-20T05:13:47","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":2404,"url":"http://patchwork.ozlabs.org/api/people/2404/","name":"Eric Dumazet","email":"eric.dumazet@gmail.com"},"content":"On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:\n> On 9/19/2017 5:48 PM, Tom Herbert wrote:\n> > On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar\n> > <sridhar.samudrala@intel.com> wrote:\n> > > On 9/12/2017 3:53 PM, Tom Herbert wrote:\n> > > > On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar\n> > > > <sridhar.samudrala@intel.com> wrote:\n> > > > > \n> > > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote:\n> > > > > > On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:\n> > > > > > > On 9/11/2017 8:53 PM, Eric Dumazet wrote:\n> > > > > > > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n> > > > > > > > \n> > > > > > > > > Two ints in sock_common for this purpose is quite expensive and the\n> > > > > > > > > use case for this is limited-- even if a RX->TX queue mapping were\n> > > > > > > > > introduced to eliminate the queue pair assumption this still won't\n> > > > > > > > > help if the receive and transmit interfaces are different for the\n> > > > > > > > > connection. I think we really need to see some very compelling\n> > > > > > > > > results\n> > > > > > > > > to be able to justify this.\n> > > > > > > Will try to collect and post some perf data with symmetric queue\n> > > > > > > configuration.\n> > > \n> > > Here is some performance data i collected with memcached workload over\n> > > ixgbe 10Gb NIC with mcblaster benchmark.\n> > > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very\n> > > low\n> > > interrupt rate.\n> > >       ethtool -L p1p1 combined 16\n> > >       ethtool -C p1p1 rx-usecs 1000\n> > > and busy poll is set to 1000usecs\n> > >       sysctl net.core.busy_poll = 1000\n> > > \n> > > 16 threads  800K requests/sec\n> > > =============================\n> > >                   rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n> > > -----------------------------------------------------------------------\n> > > Default                2/182/10641            23391 61163\n> > > Symmetric Queues       2/50/6311              20457 32843\n> > > \n> > > 32 threads  800K requests/sec\n> > > =============================\n> > >                  rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n> > > ------------------------------------------------------------------------\n> > > Default                2/162/6390            32168 69450\n> > > Symmetric Queues        2/50/3853            35044 35847\n> > > \n> > No idea what \"Default\" configuration is. Please report how xps_cpus is\n> > being set, how many RSS queues there are, and what the mapping is\n> > between RSS queues and CPUs and shared caches. Also, whether and\n> > threads are pinned.\n> Default is linux 4.13 with the settings i listed above.    \n>         ethtool -L p1p1 combined 16\n>         ethtool -C p1p1 rx-usecs 1000\n>         sysctl net.core.busy_poll = 1000\n> \n> # ethtool -x p1p1\n> RX flow hash indirection table for p1p1 with 16 RX ring(s):\n>     0:      0     1     2     3     4     5     6     7\n>     8:      8     9    10    11    12    13    14    15\n>    16:      0     1     2     3     4     5     6     7\n>    24:      8     9    10    11    12    13    14    15\n>    32:      0     1     2     3     4     5     6     7\n>    40:      8     9    10    11    12    13    14    15\n>    48:      0     1     2     3     4     5     6     7\n>    56:      8     9    10    11    12    13    14    15\n>    64:      0     1     2     3     4     5     6     7\n>    72:      8     9    10    11    12    13    14    15\n>    80:      0     1     2     3     4     5     6     7\n>    88:      8     9    10    11    12    13    14    15\n>    96:      0     1     2     3     4     5     6     7\n>   104:      8     9    10    11    12    13    14    15\n>   112:      0     1     2     3     4     5     6     7\n>   120:      8     9    10    11    12    13    14    15\n> \n> smp_affinity for the 16 queuepairs\n>         141 p1p1-TxRx-0 0000,00000001\n>         142 p1p1-TxRx-1 0000,00000002\n>         143 p1p1-TxRx-2 0000,00000004\n>         144 p1p1-TxRx-3 0000,00000008\n>         145 p1p1-TxRx-4 0000,00000010\n>         146 p1p1-TxRx-5 0000,00000020\n>         147 p1p1-TxRx-6 0000,00000040\n>         148 p1p1-TxRx-7 0000,00000080\n>         149 p1p1-TxRx-8 0000,00000100\n>         150 p1p1-TxRx-9 0000,00000200\n>         151 p1p1-TxRx-10 0000,00000400\n>         152 p1p1-TxRx-11 0000,00000800\n>         153 p1p1-TxRx-12 0000,00001000\n>         154 p1p1-TxRx-13 0000,00002000\n>         155 p1p1-TxRx-14 0000,00004000\n>         156 p1p1-TxRx-15 0000,00008000\n> xps_cpus for the 16 Tx queues\n>         0000,00000001\n>         0000,00000002\n>         0000,00000004\n>         0000,00000008\n>         0000,00000010\n>         0000,00000020\n>         0000,00000040\n>         0000,00000080\n>         0000,00000100\n>         0000,00000200\n>         0000,00000400\n>         0000,00000800\n>         0000,00001000\n>         0000,00002000\n>         0000,00004000\n>         0000,00008000\n> memcached threads are not pinned.\n> \n\n...\n\nI urge you to take the time to properly tune this host.\n\nlinux kernel does not do automagic configuration. This is user policy.\n\nDocumentation/networking/scaling.txt has everything you need.","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=gmail.com header.i=@gmail.com\n\theader.b=\"NNDh6VA2\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xxnw24LJtz9s82\n\tfor <patchwork-incoming@ozlabs.org>;\n\tWed, 20 Sep 2017 15:13:54 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751397AbdITFNv (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tWed, 20 Sep 2017 01:13:51 -0400","from mail-io0-f193.google.com ([209.85.223.193]:38577 \"EHLO\n\tmail-io0-f193.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1751011AbdITFNu (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Wed, 20 Sep 2017 01:13:50 -0400","by mail-io0-f193.google.com with SMTP id e9so1408887iod.5\n\tfor <netdev@vger.kernel.org>; Tue, 19 Sep 2017 22:13:50 -0700 (PDT)","from [192.168.86.171] (c-67-180-167-114.hsd1.ca.comcast.net.\n\t[67.180.167.114]) by smtp.googlemail.com with ESMTPSA id\n\tk18sm1853267itb.2.2017.09.19.22.13.48\n\t(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);\n\tTue, 19 Sep 2017 22:13:48 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=gmail.com; s=20161025;\n\th=message-id:subject:from:to:cc:date:in-reply-to:references\n\t:mime-version:content-transfer-encoding;\n\tbh=nDxkIMKDgDsaeeLsyvHYVFarFSgMoqkcqVtyd0eWXJE=;\n\tb=NNDh6VA2JLDcp2lv8AbCFgVYDo4gm5Kv7lFJF3/+LKpTnRyMkbnxbE7/bpVUCtjyAo\n\tm/a4RtNjfTAvNAkk9KLfzN5vMKUazMojmfXFB++ahw8jzCzFc7hrddgB796f/rErSW9w\n\tmX5PSdtRmH9zjv+QVaGdzIdSrcpRzVrqaassmqrkAaq6+iCFZlAdSjTmimtxRMjopjcQ\n\tQj7F7W+j57G/m9ttR9p678nVtuKahiQQzVcDOGrxYySXPlecxdgeG9H0jRnZtdewb851\n\tJ4y3bW6VFgiusdRk2cU0jchlTYvsxpC0k1e/zFdxaMXoTVo1Zb5yADPLgmlUEWhS8YXF\n\tL/6Q==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to\n\t:references:mime-version:content-transfer-encoding;\n\tbh=nDxkIMKDgDsaeeLsyvHYVFarFSgMoqkcqVtyd0eWXJE=;\n\tb=cM2QS6Q9QJHSkVpIfZtLYSIbjsndeXPzheEb3x5B9IwozkM7EzjvmS5mdip6/CACSV\n\tND4C7ArRytyIiWcue4BMlNi50wHVEzG7fHop8qV19aWOPh9niSommuJqHVOHoVs5rNXA\n\ttQn27tWNy0r5AYzuRdVUwzWF+ptJBb0AQLeprWi3PRo+IfJ/JO7tRXgJbJ4Z4HKV+XcF\n\tRhbnL28vIRiYn3DKw8sFWhKTs0J8OwFyvVZkvW2JjP2/XxGKLfTtcYO+6BDt9u/nT/WB\n\tVFz1qhbKNNPqOxfAjfOZw2dDyAemdzI4pY8l01DP5FYlYud4Sjkue4DvFpimp5xCB5og\n\tcMPA==","X-Gm-Message-State":"AHPjjUg9FrC9OCcPB2YRPVxahLpZZbQC+KXqjMxMJBMFfIwSDqSTAiCm\n\tqSW09HELWuGCvx0XTdCZisA=","X-Google-Smtp-Source":"AOwi7QCFhQNLX4AdIcTWXUmz6JLuB+aSkOYAigyWA0JdSZ2qby/pv5Un0CI3Xw5VAeG+at03KoXXbQ==","X-Received":"by 10.107.162.3 with SMTP id l3mr5659755ioe.227.1505884429561;\n\tTue, 19 Sep 2017 22:13:49 -0700 (PDT)","Message-ID":"<1505884427.29839.84.camel@edumazet-glaptop3.roam.corp.google.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","From":"Eric Dumazet <eric.dumazet@gmail.com>","To":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Cc":"Tom Herbert <tom@herbertland.com>,\n\tAlexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Date":"Tue, 19 Sep 2017 22:13:47 -0700","In-Reply-To":"<fe565f14-156e-d703-c91d-d67136a0a0c0@intel.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>\n\t<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>\n\t<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<ef594f90-0f76-bdf5-63ce-e8750ee0d60f@intel.com>\n\t<CALx6S372oQ4OsyMd66zwQ08pMvPvLj7Ejf=Cv24xDkdtVXaYjA@mail.gmail.com>\n\t<9b247caf-fde0-1e39-aa94-f7b3bc4fc88a@intel.com>\n\t<CALx6S35wbwhz7COqGuUgJZcd8TwYcaVOHpxZxTOd4TuQX76Crg@mail.gmail.com>\n\t<fe565f14-156e-d703-c91d-d67136a0a0c0@intel.com>","Content-Type":"text/plain; charset=\"UTF-8\"","X-Mailer":"Evolution 3.10.4-0ubuntu2 ","Mime-Version":"1.0","Content-Transfer-Encoding":"7bit","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1771908,"web_url":"http://patchwork.ozlabs.org/comment/1771908/","msgid":"<CALx6S374dN944bdJ87Za+MzFH3YV_6S5L3ZVGKD9503fp=-6Bg@mail.gmail.com>","list_archive_url":null,"date":"2017-09-20T14:18:57","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65986,"url":"http://patchwork.ozlabs.org/api/people/65986/","name":"Tom Herbert","email":"tom@herbertland.com"},"content":"On Tue, Sep 19, 2017 at 10:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:\n> On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:\n>> On 9/19/2017 5:48 PM, Tom Herbert wrote:\n>> > On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar\n>> > <sridhar.samudrala@intel.com> wrote:\n>> > > On 9/12/2017 3:53 PM, Tom Herbert wrote:\n>> > > > On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar\n>> > > > <sridhar.samudrala@intel.com> wrote:\n>> > > > >\n>> > > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote:\n>> > > > > > On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:\n>> > > > > > > On 9/11/2017 8:53 PM, Eric Dumazet wrote:\n>> > > > > > > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n>> > > > > > > >\n>> > > > > > > > > Two ints in sock_common for this purpose is quite expensive and the\n>> > > > > > > > > use case for this is limited-- even if a RX->TX queue mapping were\n>> > > > > > > > > introduced to eliminate the queue pair assumption this still won't\n>> > > > > > > > > help if the receive and transmit interfaces are different for the\n>> > > > > > > > > connection. I think we really need to see some very compelling\n>> > > > > > > > > results\n>> > > > > > > > > to be able to justify this.\n>> > > > > > > Will try to collect and post some perf data with symmetric queue\n>> > > > > > > configuration.\n>> > >\n>> > > Here is some performance data i collected with memcached workload over\n>> > > ixgbe 10Gb NIC with mcblaster benchmark.\n>> > > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very\n>> > > low\n>> > > interrupt rate.\n>> > >       ethtool -L p1p1 combined 16\n>> > >       ethtool -C p1p1 rx-usecs 1000\n>> > > and busy poll is set to 1000usecs\n>> > >       sysctl net.core.busy_poll = 1000\n>> > >\n>> > > 16 threads  800K requests/sec\n>> > > =============================\n>> > >                   rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n>> > > -----------------------------------------------------------------------\n>> > > Default                2/182/10641            23391 61163\n>> > > Symmetric Queues       2/50/6311              20457 32843\n>> > >\n>> > > 32 threads  800K requests/sec\n>> > > =============================\n>> > >                  rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n>> > > ------------------------------------------------------------------------\n>> > > Default                2/162/6390            32168 69450\n>> > > Symmetric Queues        2/50/3853            35044 35847\n>> > >\n>> > No idea what \"Default\" configuration is. Please report how xps_cpus is\n>> > being set, how many RSS queues there are, and what the mapping is\n>> > between RSS queues and CPUs and shared caches. Also, whether and\n>> > threads are pinned.\n>> Default is linux 4.13 with the settings i listed above.\n>>         ethtool -L p1p1 combined 16\n>>         ethtool -C p1p1 rx-usecs 1000\n>>         sysctl net.core.busy_poll = 1000\n>>\n>> # ethtool -x p1p1\n>> RX flow hash indirection table for p1p1 with 16 RX ring(s):\n>>     0:      0     1     2     3     4     5     6     7\n>>     8:      8     9    10    11    12    13    14    15\n>>    16:      0     1     2     3     4     5     6     7\n>>    24:      8     9    10    11    12    13    14    15\n>>    32:      0     1     2     3     4     5     6     7\n>>    40:      8     9    10    11    12    13    14    15\n>>    48:      0     1     2     3     4     5     6     7\n>>    56:      8     9    10    11    12    13    14    15\n>>    64:      0     1     2     3     4     5     6     7\n>>    72:      8     9    10    11    12    13    14    15\n>>    80:      0     1     2     3     4     5     6     7\n>>    88:      8     9    10    11    12    13    14    15\n>>    96:      0     1     2     3     4     5     6     7\n>>   104:      8     9    10    11    12    13    14    15\n>>   112:      0     1     2     3     4     5     6     7\n>>   120:      8     9    10    11    12    13    14    15\n>>\n>> smp_affinity for the 16 queuepairs\n>>         141 p1p1-TxRx-0 0000,00000001\n>>         142 p1p1-TxRx-1 0000,00000002\n>>         143 p1p1-TxRx-2 0000,00000004\n>>         144 p1p1-TxRx-3 0000,00000008\n>>         145 p1p1-TxRx-4 0000,00000010\n>>         146 p1p1-TxRx-5 0000,00000020\n>>         147 p1p1-TxRx-6 0000,00000040\n>>         148 p1p1-TxRx-7 0000,00000080\n>>         149 p1p1-TxRx-8 0000,00000100\n>>         150 p1p1-TxRx-9 0000,00000200\n>>         151 p1p1-TxRx-10 0000,00000400\n>>         152 p1p1-TxRx-11 0000,00000800\n>>         153 p1p1-TxRx-12 0000,00001000\n>>         154 p1p1-TxRx-13 0000,00002000\n>>         155 p1p1-TxRx-14 0000,00004000\n>>         156 p1p1-TxRx-15 0000,00008000\n>> xps_cpus for the 16 Tx queues\n>>         0000,00000001\n>>         0000,00000002\n>>         0000,00000004\n>>         0000,00000008\n>>         0000,00000010\n>>         0000,00000020\n>>         0000,00000040\n>>         0000,00000080\n>>         0000,00000100\n>>         0000,00000200\n>>         0000,00000400\n>>         0000,00000800\n>>         0000,00001000\n>>         0000,00002000\n>>         0000,00004000\n>>         0000,00008000\n>> memcached threads are not pinned.\n>>\n>\n> ...\n>\n> I urge you to take the time to properly tune this host.\n>\n> linux kernel does not do automagic configuration. This is user policy.\n>\n> Documentation/networking/scaling.txt has everything you need.\n>\nYes, tuning a system for optimal performance is difficult. Even if you\nfind a performance benefit for a configuration on one system, that\nmight not translate to another. In other words, if you've produced\nsome code that seems to perform better than previous implementation on\na test machine it's not enough to be satisfied with that. We want\nunderstand _why_ there is a difference. If you can show there is\nintrinsic benefits to the queue-pair model that we can't achieve with\nexisting implementation _and_ can show there are ill effects in other\ncircumstances, then you should have a good case to make changes.\n\nIn the case of memcached, threads inevitably migrate off the CPU they\nwere created on, the data follows the thread but the RX-queue does not\nchange which means that the receive path is crosses CPUs or caches.\nBut, then in the queuepair case that also means transmit completions\nare crossing CPUs. We don't normally expect that to be a good thing.\nHowever, transmit completion processing does not happen in the\ncritical path, so if that work is being deferred to a less busy CPU\nthere may benefits. That's only a theory, analysis and experimentation\nshould be able to get to the root cause.\n\nThanks,\nTom","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=herbertland-com.20150623.gappssmtp.com\n\theader.i=@herbertland-com.20150623.gappssmtp.com\n\theader.b=\"jKWziiiE\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xy2124Tw2z9s82\n\tfor <patchwork-incoming@ozlabs.org>;\n\tThu, 21 Sep 2017 00:19:02 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751614AbdITOTA (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tWed, 20 Sep 2017 10:19:00 -0400","from mail-qk0-f195.google.com ([209.85.220.195]:33324 \"EHLO\n\tmail-qk0-f195.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1750892AbdITOS6 (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Wed, 20 Sep 2017 10:18:58 -0400","by mail-qk0-f195.google.com with SMTP id g128so1780963qke.0\n\tfor <netdev@vger.kernel.org>; Wed, 20 Sep 2017 07:18:58 -0700 (PDT)","by 10.237.61.196 with HTTP; Wed, 20 Sep 2017 07:18:57 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=herbertland-com.20150623.gappssmtp.com; s=20150623;\n\th=mime-version:in-reply-to:references:from:date:message-id:subject:to\n\t:cc; bh=hLKXo6R+ajDiujlddeeQ/Whkb5s6j/NVPlU8jExJkW4=;\n\tb=jKWziiiElvaTx9GLeyXgt7PDRISUmpVi3n6JeOJoGmLSrsZ2/RLwAWiBUAyA4Lduan\n\tspPGPt39k2tOlIX8Yn8lswIJD71AoaReEAfFXLCvCWN4S3veLbD3dcQ57W9MPdX/3bTV\n\tbi09gYsskvIKzmtwvdDsLgfWd+kD014GSIpjYe597zeUZEsgBosmhY85Cs0QWderiGrH\n\tI3N320+gZSg7S32sZkhPPxDerEgtsfPqkY+q6MEPiOlnqXLNIdfPXRC+ajTZFjN/yO0k\n\txrvRzwNJ0DqjYRR68dGIU1hHqNNqSied/kj6/hn2i4cCCuGj6IzSAGzt40DuX1ivB/mf\n\ti7+w==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:mime-version:in-reply-to:references:from:date\n\t:message-id:subject:to:cc;\n\tbh=hLKXo6R+ajDiujlddeeQ/Whkb5s6j/NVPlU8jExJkW4=;\n\tb=R4MxeobXLgqx7OT50HxJy3Iz4ZYrm8/XvFiW1w9Wv9DG6tditAPQ38zDteYKw71Otu\n\tboke5HOTpo3Lw+37BswTTIDTgDnNJuHY5yAy3uDI3ytrOOel6pC/zXoUjf1MyXZMYqyf\n\taFC0qgsoriO6zqv/ARp/pKzXNbeMJFKWeQ8STuj/qArAjYMQkOIaIcAk7Kv+lpY/BX1s\n\ttRYkcMrQMovFrH9TWFcHRL4EeY0hoWl6Ey4A6/vOsXlSrozRM6OKfzNzyVr3QvJ/3cDk\n\tUhKUQ14y1Nt5gGwvCDzxiZYjjBZTRADB4WGT9a2HdJaYqqw/kidaT6BbSMnhNKgnsDvH\n\tktmQ==","X-Gm-Message-State":"AHPjjUij61JaU9IeLzgZ0z63H6fejggeAWJ1zEtj488gP3gkkIGcVyvZ\n\tlp9b6xfMWJQts0klGxGGuL1goSLX889IHq/8/keHSQ==","X-Google-Smtp-Source":"AOwi7QDBvAK5AbZaHHI9s2GckTNII6SMCbn1f4Vw4szNyZfLaifxiEdYt/4O0Q7/aRKpnX55MhNoNbTfbt1415icqF8=","X-Received":"by 10.55.102.73 with SMTP id a70mr6895363qkc.345.1505917137957; \n\tWed, 20 Sep 2017 07:18:57 -0700 (PDT)","MIME-Version":"1.0","In-Reply-To":"<1505884427.29839.84.camel@edumazet-glaptop3.roam.corp.google.com>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>\n\t<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>\n\t<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<ef594f90-0f76-bdf5-63ce-e8750ee0d60f@intel.com>\n\t<CALx6S372oQ4OsyMd66zwQ08pMvPvLj7Ejf=Cv24xDkdtVXaYjA@mail.gmail.com>\n\t<9b247caf-fde0-1e39-aa94-f7b3bc4fc88a@intel.com>\n\t<CALx6S35wbwhz7COqGuUgJZcd8TwYcaVOHpxZxTOd4TuQX76Crg@mail.gmail.com>\n\t<fe565f14-156e-d703-c91d-d67136a0a0c0@intel.com>\n\t<1505884427.29839.84.camel@edumazet-glaptop3.roam.corp.google.com>","From":"Tom Herbert <tom@herbertland.com>","Date":"Wed, 20 Sep 2017 07:18:57 -0700","Message-ID":"<CALx6S374dN944bdJ87Za+MzFH3YV_6S5L3ZVGKD9503fp=-6Bg@mail.gmail.com>","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Eric Dumazet <eric.dumazet@gmail.com>","Cc":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>,\n\tAlexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","Content-Type":"text/plain; charset=\"UTF-8\"","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1771987,"web_url":"http://patchwork.ozlabs.org/comment/1771987/","msgid":"<87vakdbd5k.fsf@stressinduktion.org>","list_archive_url":null,"date":"2017-09-20T15:30:47","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":18284,"url":"http://patchwork.ozlabs.org/api/people/18284/","name":"Hannes Frederic Sowa","email":"hannes@stressinduktion.org"},"content":"Sridhar Samudrala <sridhar.samudrala@intel.com> writes:\n\n> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used\n> to enable symmetric tx and rx queues on a socket.\n>\n> This option is specifically useful for epoll based multi threaded workloads\n> where each thread handles packets received on a single RX queue . In this model,\n> we have noticed that it helps to send the packets on the same TX queue\n> corresponding to the queue-pair associated with the RX queue specifically when\n> busy poll is enabled with epoll().\n>\n> Two new fields are added to struct sock_common to cache the last rx ifindex and\n> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the cached\n> rx queue when this option is enabled and the TX is happening on the same device.\n\nWould it help to make the rx and tx skb hashes symmetric\n(skb_get_hash_symmetric) on request?","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=stressinduktion.org\n\theader.i=@stressinduktion.org header.b=\"SDVVEcaC\"; \n\tdkim=pass (2048-bit key;\n\tunprotected) header.d=messagingengine.com\n\theader.i=@messagingengine.com header.b=\"k8/OGI2/\"; \n\tdkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xy3c002S9z9s8J\n\tfor <patchwork-incoming@ozlabs.org>;\n\tThu, 21 Sep 2017 01:30:56 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751588AbdITPax (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tWed, 20 Sep 2017 11:30:53 -0400","from out1-smtp.messagingengine.com ([66.111.4.25]:60299 \"EHLO\n\tout1-smtp.messagingengine.com\" rhost-flags-OK-OK-OK-OK)\n\tby vger.kernel.org with ESMTP id S1751507AbdITPaw (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Wed, 20 Sep 2017 11:30:52 -0400","from compute7.internal (compute7.nyi.internal [10.202.2.47])\n\tby mailout.nyi.internal (Postfix) with ESMTP id C1C2B2134A;\n\tWed, 20 Sep 2017 11:30:49 -0400 (EDT)","from frontend2 ([10.202.2.161])\n\tby compute7.internal (MEProxy); Wed, 20 Sep 2017 11:30:49 -0400","from z.localhost.stressinduktion.org (unknown [217.192.177.51])\n\tby mail.messagingengine.com (Postfix) with ESMTPA id 950112498B;\n\tWed, 20 Sep 2017 11:30:48 -0400 (EDT)"],"DKIM-Signature":["v=1; a=rsa-sha256; c=relaxed/relaxed; d=\n\tstressinduktion.org; h=cc:content-type:date:from:in-reply-to\n\t:message-id:mime-version:references:subject:to:x-me-sender\n\t:x-me-sender:x-sasl-enc:x-sasl-enc; s=fm1; bh=5fzRyTvRAdMY9wdRf4\n\tPcWFaXDPm0CljOJPiYzyD6uNY=; b=SDVVEcaC4Ol28C+54GmRyrXf4jAtK+Cz1F\n\t1CKfe1uzOioJlhuJea96hHsyOKoKkoC3TwYUemUfjOcB6u5GcT0BgI6PKMPbJdO1\n\t+ZLlXzmXMneXhShCZQSZO9aefgAFWINqirX2ejq0/W+YiprMp8+pVbYalldO0KUo\n\tbd8rb7mlySFY8gX2CRX34NrwtPU2mjhiLO+KiL9AWJ017Jjg2h0r0aRse625ATpk\n\t1SocW7yyrWCrEUH+PsY2Y26e3ho9km4YOgPh1aokZHxJQSC/lMpR3gjJK3K/0qEh\n\twYkCvTDGx9Z7zX7RFZucV33SkqvDu4Dhjg2T8Qv/Ww+kHuHPrXow==","v=1; a=rsa-sha256; c=relaxed/relaxed; d=\n\tmessagingengine.com; h=cc:content-type:date:from:in-reply-to\n\t:message-id:mime-version:references:subject:to:x-me-sender\n\t:x-me-sender:x-sasl-enc:x-sasl-enc; s=fm1; bh=5fzRyTvRAdMY9wdRf4\n\tPcWFaXDPm0CljOJPiYzyD6uNY=; b=k8/OGI2/ox/5IG1x7kG8XWtPraKPQWVb4T\n\to/cBEqx4YI4nAqiGJnxAHA7UQyDL6HI/Wfc9uMLBqaPKeJz5EYOPFAHYy9IH6HLD\n\tWF1zjhxwfX6xvKQ3hH6CcVNmFW3xRHS8Tc2Vyd87Tj7smGRHMsxw15170HI3SxMU\n\t/2XyQ9z+6qUfglyOKs/rU9YXuOlmnnb1DZs9rbicZcS/zPBW786Q87lPn44O4nb9\n\tgjHl4ucnyXyXBkR4JlUJEOBkltqkBLpjwrkPZ8e+lBD92VM7DXXXu2i1vt1dS83K\n\tZ8OCHJNel5O/Bkwmfc0obfiGw5490MriPX2d+huWbJ0/Y9OrQsFg=="],"X-ME-Sender":"<xms:qYnCWdPO61MdipqwUDknMoyGTTtrS54MtrqRBXKKPZtqO6eH4kdHRQ>","X-Sasl-enc":"KYAMXNx2hXHpxoYdRxkTMZHW7Zh/ZGLv+AKZxSIn4qNF 1505921449","From":"Hannes Frederic Sowa <hannes@stressinduktion.org>","To":"Sridhar Samudrala <sridhar.samudrala@intel.com>","Cc":"alexander.h.duyck@intel.com, netdev@vger.kernel.org","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>","Date":"Wed, 20 Sep 2017 17:30:47 +0200","In-Reply-To":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t(Sridhar Samudrala's message of \"Thu, 31 Aug 2017 16:27:12 -0700\")","Message-ID":"<87vakdbd5k.fsf@stressinduktion.org>","User-Agent":"Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux)","MIME-Version":"1.0","Content-Type":"text/plain","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1772063,"web_url":"http://patchwork.ozlabs.org/comment/1772063/","msgid":"<4d1cf2be-23b6-ed43-972e-bdb9f13c772b@intel.com>","list_archive_url":null,"date":"2017-09-20T16:51:12","subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","submitter":{"id":65219,"url":"http://patchwork.ozlabs.org/api/people/65219/","name":"Samudrala, Sridhar","email":"sridhar.samudrala@intel.com"},"content":"On 9/20/2017 7:18 AM, Tom Herbert wrote:\n> On Tue, Sep 19, 2017 at 10:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:\n>> On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:\n>>> On 9/19/2017 5:48 PM, Tom Herbert wrote:\n>>>> On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar\n>>>> <sridhar.samudrala@intel.com> wrote:\n>>>>> On 9/12/2017 3:53 PM, Tom Herbert wrote:\n>>>>>> On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar\n>>>>>> <sridhar.samudrala@intel.com> wrote:\n>>>>>>> On 9/12/2017 8:47 AM, Eric Dumazet wrote:\n>>>>>>>> On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:\n>>>>>>>>> On 9/11/2017 8:53 PM, Eric Dumazet wrote:\n>>>>>>>>>> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:\n>>>>>>>>>>\n>>>>>>>>>>> Two ints in sock_common for this purpose is quite expensive and the\n>>>>>>>>>>> use case for this is limited-- even if a RX->TX queue mapping were\n>>>>>>>>>>> introduced to eliminate the queue pair assumption this still won't\n>>>>>>>>>>> help if the receive and transmit interfaces are different for the\n>>>>>>>>>>> connection. I think we really need to see some very compelling\n>>>>>>>>>>> results\n>>>>>>>>>>> to be able to justify this.\n>>>>>>>>> Will try to collect and post some perf data with symmetric queue\n>>>>>>>>> configuration.\n>>>>> Here is some performance data i collected with memcached workload over\n>>>>> ixgbe 10Gb NIC with mcblaster benchmark.\n>>>>> ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very\n>>>>> low\n>>>>> interrupt rate.\n>>>>>        ethtool -L p1p1 combined 16\n>>>>>        ethtool -C p1p1 rx-usecs 1000\n>>>>> and busy poll is set to 1000usecs\n>>>>>        sysctl net.core.busy_poll = 1000\n>>>>>\n>>>>> 16 threads  800K requests/sec\n>>>>> =============================\n>>>>>                    rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n>>>>> -----------------------------------------------------------------------\n>>>>> Default                2/182/10641            23391 61163\n>>>>> Symmetric Queues       2/50/6311              20457 32843\n>>>>>\n>>>>> 32 threads  800K requests/sec\n>>>>> =============================\n>>>>>                   rtt(min/avg/max)usecs     intr/sec contextswitch/sec\n>>>>> ------------------------------------------------------------------------\n>>>>> Default                2/162/6390            32168 69450\n>>>>> Symmetric Queues        2/50/3853            35044 35847\n>>>>>\n>>>> No idea what \"Default\" configuration is. Please report how xps_cpus is\n>>>> being set, how many RSS queues there are, and what the mapping is\n>>>> between RSS queues and CPUs and shared caches. Also, whether and\n>>>> threads are pinned.\n>>> Default is linux 4.13 with the settings i listed above.\n>>>          ethtool -L p1p1 combined 16\n>>>          ethtool -C p1p1 rx-usecs 1000\n>>>          sysctl net.core.busy_poll = 1000\n>>>\n>>> # ethtool -x p1p1\n>>> RX flow hash indirection table for p1p1 with 16 RX ring(s):\n>>>      0:      0     1     2     3     4     5     6     7\n>>>      8:      8     9    10    11    12    13    14    15\n>>>     16:      0     1     2     3     4     5     6     7\n>>>     24:      8     9    10    11    12    13    14    15\n>>>     32:      0     1     2     3     4     5     6     7\n>>>     40:      8     9    10    11    12    13    14    15\n>>>     48:      0     1     2     3     4     5     6     7\n>>>     56:      8     9    10    11    12    13    14    15\n>>>     64:      0     1     2     3     4     5     6     7\n>>>     72:      8     9    10    11    12    13    14    15\n>>>     80:      0     1     2     3     4     5     6     7\n>>>     88:      8     9    10    11    12    13    14    15\n>>>     96:      0     1     2     3     4     5     6     7\n>>>    104:      8     9    10    11    12    13    14    15\n>>>    112:      0     1     2     3     4     5     6     7\n>>>    120:      8     9    10    11    12    13    14    15\n>>>\n>>> smp_affinity for the 16 queuepairs\n>>>          141 p1p1-TxRx-0 0000,00000001\n>>>          142 p1p1-TxRx-1 0000,00000002\n>>>          143 p1p1-TxRx-2 0000,00000004\n>>>          144 p1p1-TxRx-3 0000,00000008\n>>>          145 p1p1-TxRx-4 0000,00000010\n>>>          146 p1p1-TxRx-5 0000,00000020\n>>>          147 p1p1-TxRx-6 0000,00000040\n>>>          148 p1p1-TxRx-7 0000,00000080\n>>>          149 p1p1-TxRx-8 0000,00000100\n>>>          150 p1p1-TxRx-9 0000,00000200\n>>>          151 p1p1-TxRx-10 0000,00000400\n>>>          152 p1p1-TxRx-11 0000,00000800\n>>>          153 p1p1-TxRx-12 0000,00001000\n>>>          154 p1p1-TxRx-13 0000,00002000\n>>>          155 p1p1-TxRx-14 0000,00004000\n>>>          156 p1p1-TxRx-15 0000,00008000\n>>> xps_cpus for the 16 Tx queues\n>>>          0000,00000001\n>>>          0000,00000002\n>>>          0000,00000004\n>>>          0000,00000008\n>>>          0000,00000010\n>>>          0000,00000020\n>>>          0000,00000040\n>>>          0000,00000080\n>>>          0000,00000100\n>>>          0000,00000200\n>>>          0000,00000400\n>>>          0000,00000800\n>>>          0000,00001000\n>>>          0000,00002000\n>>>          0000,00004000\n>>>          0000,00008000\n>>> memcached threads are not pinned.\n>>>\n>> ...\n>>\n>> I urge you to take the time to properly tune this host.\n>>\n>> linux kernel does not do automagic configuration. This is user policy.\n>>\n>> Documentation/networking/scaling.txt has everything you need.\n>>\n> Yes, tuning a system for optimal performance is difficult. Even if you\n> find a performance benefit for a configuration on one system, that\n> might not translate to another. In other words, if you've produced\n> some code that seems to perform better than previous implementation on\n> a test machine it's not enough to be satisfied with that. We want\n> understand _why_ there is a difference. If you can show there is\n> intrinsic benefits to the queue-pair model that we can't achieve with\n> existing implementation _and_ can show there are ill effects in other\n> circumstances, then you should have a good case to make changes.\n>\n> In the case of memcached, threads inevitably migrate off the CPU they\n> were created on, the data follows the thread but the RX-queue does not\n> change which means that the receive path is crosses CPUs or caches.\n> But, then in the queuepair case that also means transmit completions\n> are crossing CPUs. We don't normally expect that to be a good thing.\n> However, transmit completion processing does not happen in the\n> critical path, so if that work is being deferred to a less busy CPU\n> there may benefits. That's only a theory, analysis and experimentation\n> should be able to get to the root cause.\n>\nWith regards to tuning, forgot to mention that memcached is updated to\nselect thethread based on incoming queue via SO_INCOMING_NAPI_ID and\nis started with16 threads to match the number of RX queues.\nIf i do pinning of memcached threads to each of the 16 cores, i do get\nsimilar performance as symmetric queues. But this symmetric queues \nconfiguration\nis to support scenarios where it is not possible to pin the threads of the\napplication.\n\nThanks\nSridhar","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":"ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xy5Nk6pYLz9sP1\n\tfor <patchwork-incoming@ozlabs.org>;\n\tThu, 21 Sep 2017 02:51:18 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751524AbdITQvO (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tWed, 20 Sep 2017 12:51:14 -0400","from mga07.intel.com ([134.134.136.100]:60980 \"EHLO\n\tmga07.intel.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S1751024AbdITQvN (ORCPT <rfc822;netdev@vger.kernel.org>);\n\tWed, 20 Sep 2017 12:51:13 -0400","from orsmga003.jf.intel.com ([10.7.209.27])\n\tby orsmga105.jf.intel.com with ESMTP; 20 Sep 2017 09:51:12 -0700","from samudral-mobl1.amr.corp.intel.com (HELO [10.165.248.23])\n\t([10.165.248.23])\n\tby orsmga003.jf.intel.com with ESMTP; 20 Sep 2017 09:51:12 -0700"],"X-ExtLoop1":"1","X-IronPort-AV":"E=Sophos;i=\"5.42,421,1500966000\"; d=\"scan'208\";a=\"1016658757\"","Subject":"Re: [RFC PATCH] net: Introduce a socket option to enable picking tx\n\tqueue based on rx queue.","To":"Tom Herbert <tom@herbertland.com>, Eric Dumazet <eric.dumazet@gmail.com>","Cc":"Alexander Duyck <alexander.h.duyck@intel.com>,\n\tLinux Kernel Network Developers <netdev@vger.kernel.org>","References":"<1504222032-6337-1-git-send-email-sridhar.samudrala@intel.com>\n\t<CALx6S35sM1CDrFuNE+L59Op_wKTRpATLAdRJafihr0mB9+vQ8g@mail.gmail.com>\n\t<1505188437.15310.137.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<b2ea01b3-b984-d59f-cbaf-b2fe6b5d9eea@intel.com>\n\t<1505231262.15310.149.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<ef594f90-0f76-bdf5-63ce-e8750ee0d60f@intel.com>\n\t<CALx6S372oQ4OsyMd66zwQ08pMvPvLj7Ejf=Cv24xDkdtVXaYjA@mail.gmail.com>\n\t<9b247caf-fde0-1e39-aa94-f7b3bc4fc88a@intel.com>\n\t<CALx6S35wbwhz7COqGuUgJZcd8TwYcaVOHpxZxTOd4TuQX76Crg@mail.gmail.com>\n\t<fe565f14-156e-d703-c91d-d67136a0a0c0@intel.com>\n\t<1505884427.29839.84.camel@edumazet-glaptop3.roam.corp.google.com>\n\t<CALx6S374dN944bdJ87Za+MzFH3YV_6S5L3ZVGKD9503fp=-6Bg@mail.gmail.com>","From":"\"Samudrala, Sridhar\" <sridhar.samudrala@intel.com>","Message-ID":"<4d1cf2be-23b6-ed43-972e-bdb9f13c772b@intel.com>","Date":"Wed, 20 Sep 2017 09:51:12 -0700","User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101\n\tThunderbird/52.2.1","MIME-Version":"1.0","In-Reply-To":"<CALx6S374dN944bdJ87Za+MzFH3YV_6S5L3ZVGKD9503fp=-6Bg@mail.gmail.com>","Content-Type":"text/plain; charset=utf-8; format=flowed","Content-Transfer-Encoding":"7bit","Content-Language":"en-US","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}}]