diff mbox

[rfc] packet: zerocopy packet_snd

Message ID 1416602694-7540-1-git-send-email-willemb@google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Willem de Bruijn Nov. 21, 2014, 8:44 p.m. UTC
From: Willem de Bruijn <willemb@google.com>

Extend the copy avoidance ("zero copy") interface from virtio and tap
to packet sockets.

Interface (send):
  Pass MSG_ZEROCOPY flag in send to have the kernel link the original
  user pages into its skb_shinfo(skb)->frags.

Interface (recv notification):
  Process the socket error queue. Ancillary data of type
  SO_EE_ORIGIN_UPAGE will return the pointer to the buffer passed in
  send to notify the completion of that send request.

  If the process allows the error queue to overflow, notifications
  will be dropped.

Tested by injecting max sized TCP packets on a host with TSO enabled
and GSO disabled. This passes the full packet to the NIC. Perf shows
a large drop in cycles spent (perf record -C 4 -a sleep 4)

Sending 10 Kpps without MSG_ZEROCOPY:

    Event count (approx.): 1191500371
    30.63%  psock_reinject  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
     7.50%  psock_reinject  [kernel.kallsyms]  [k] __rmqueue
     6.26%         swapper  [kernel.kallsyms]  [k] intel_idle
     5.06%  psock_reinject  [kernel.kallsyms]  [k] get_page_from_freelist
     1.99%  psock_reinject  [kernel.kallsyms]  [k] mlx4_en_xmit
     1.60%  psock_reinject  [kernel.kallsyms]  [k] __alloc_pages_nodemask

and the same with MSG_ZEROCOPY:

    Event count (approx.): 390783801
    13.96%         swapper  [kernel.kallsyms]  [k] intel_idle
     3.11%  psock_reinject  [kernel.kallsyms]  [k] mlx4_en_xmit
     2.75%  psock_reinject  [kernel.kallsyms]  [k] _raw_spin_lock
     2.67%         swapper  [kernel.kallsyms]  [k] lapic_next_deadline
     2.33%  psock_reinject  [kernel.kallsyms]  [k] gup_huge_pmd
     1.90%         swapper  [kernel.kallsyms]  [k] apic_timer_interrupt

The idle is probably due to using usleep to pace the sender, so this
underreports the effect.

Another test sent packets over loopback to a UDP socket. The test
executes mostly synchronously in a single thread:
  sendmsg(fd_packet);
  recvmsg(fd_inet);
  recvmsg(fd_packet, .., MSG_ERRQUEUE);

Result from that at various packet sizes:

            no-zc    zc   !ubufs  recvmmsg !callback
1           573     368      359       439       517
10          569     368      361       441       507
100         572     299      340       408       478
1000        564     300      338       412       486
10000       475     247      331       405       479
30000       328     177      324       385       450
65000       218     117      307       359       417

Legend:
  zc:         pass MSG_ZEROCOPY
  !ubufs:     (gross hack) comment out skb_orphan_frags
                  in __netif_receive_skb_core
  recvmmsg:   switch from synchronous completion notification
                  handling to batching.
  !callback:  (gross hack) disable upcall mechanism completely,
                  to expose its cost.

This demonstrates one of the large blocking issues to the practical
use of this feature in many datapaths: the kernel copies data of
sk_buffs with field SKBTX_DEV_ZEROCOPY at skb_clone and other points
to ensure safe reference counting. Follow on work would be needed
to refine this mechanism and increase the number of datapaths where
user pages can safely be used. This is one reason why I demonstrate
with TSO, as opposed to GSO: GSO also has to copy.

Implementation note: returning the pointer is somewhat ugly. An
alternative is to keep a per-socket counter, increment that for
each MSG_ZEROCOPY sendmsg() call and return this counter as
notification.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h        |   1 +
 include/linux/socket.h        |   1 +
 include/uapi/linux/errqueue.h |   1 +
 net/packet/af_packet.c        | 110 ++++++++++++++++++++++++++++++++++++++----
 4 files changed, 103 insertions(+), 10 deletions(-)

Comments

Michael S. Tsirkin Nov. 26, 2014, 6:24 p.m. UTC | #1
On Fri, Nov 21, 2014 at 03:44:54PM -0500, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Extend the copy avoidance ("zero copy") interface from virtio and tap
> to packet sockets.
> 
> Interface (send):
>   Pass MSG_ZEROCOPY flag in send to have the kernel link the original
>   user pages into its skb_shinfo(skb)->frags.
> 
> Interface (recv notification):
>   Process the socket error queue. Ancillary data of type
>   SO_EE_ORIGIN_UPAGE will return the pointer to the buffer passed in
>   send to notify the completion of that send request.
> 
>   If the process allows the error queue to overflow, notifications
>   will be dropped.
> 
> Tested by injecting max sized TCP packets on a host with TSO enabled
> and GSO disabled. This passes the full packet to the NIC. Perf shows
> a large drop in cycles spent (perf record -C 4 -a sleep 4)
> 
> Sending 10 Kpps without MSG_ZEROCOPY:
> 
>     Event count (approx.): 1191500371
>     30.63%  psock_reinject  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
>      7.50%  psock_reinject  [kernel.kallsyms]  [k] __rmqueue
>      6.26%         swapper  [kernel.kallsyms]  [k] intel_idle
>      5.06%  psock_reinject  [kernel.kallsyms]  [k] get_page_from_freelist
>      1.99%  psock_reinject  [kernel.kallsyms]  [k] mlx4_en_xmit
>      1.60%  psock_reinject  [kernel.kallsyms]  [k] __alloc_pages_nodemask
> 
> and the same with MSG_ZEROCOPY:
> 
>     Event count (approx.): 390783801
>     13.96%         swapper  [kernel.kallsyms]  [k] intel_idle
>      3.11%  psock_reinject  [kernel.kallsyms]  [k] mlx4_en_xmit
>      2.75%  psock_reinject  [kernel.kallsyms]  [k] _raw_spin_lock
>      2.67%         swapper  [kernel.kallsyms]  [k] lapic_next_deadline
>      2.33%  psock_reinject  [kernel.kallsyms]  [k] gup_huge_pmd
>      1.90%         swapper  [kernel.kallsyms]  [k] apic_timer_interrupt
> 
> The idle is probably due to using usleep to pace the sender, so this
> underreports the effect.

About 15% gain in CPU utilization with huge packets is what I see with
vhost-net too.  So that seems reasonable.


> Another test sent packets over loopback to a UDP socket. The test
> executes mostly synchronously in a single thread:
>   sendmsg(fd_packet);
>   recvmsg(fd_inet);
>   recvmsg(fd_packet, .., MSG_ERRQUEUE);
> 
> Result from that at various packet sizes:
> 
>             no-zc    zc   !ubufs  recvmmsg !callback
> 1           573     368      359       439       517
> 10          569     368      361       441       507
> 100         572     299      340       408       478
> 1000        564     300      338       412       486
> 10000       475     247      331       405       479
> 30000       328     177      324       385       450
> 65000       218     117      307       359       417
> 
> Legend:
>   zc:         pass MSG_ZEROCOPY
>   !ubufs:     (gross hack) comment out skb_orphan_frags
>                   in __netif_receive_skb_core
>   recvmmsg:   switch from synchronous completion notification
>                   handling to batching.
>   !callback:  (gross hack) disable upcall mechanism completely,
>                   to expose its cost.
> 
> This demonstrates one of the large blocking issues to the practical
> use of this feature in many datapaths: the kernel copies data of
> sk_buffs with field SKBTX_DEV_ZEROCOPY at skb_clone and other points
> to ensure safe reference counting. Follow on work would be needed
> to refine this mechanism and increase the number of datapaths where
> user pages can safely be used. This is one reason why I demonstrate
> with TSO, as opposed to GSO: GSO also has to copy.
> 
> Implementation note: returning the pointer is somewhat ugly. An
> alternative is to keep a per-socket counter, increment that for
> each MSG_ZEROCOPY sendmsg() call and return this counter as
> notification.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>

The main problem with zero copy ATM is with queueing disciplines
which might keep the socket around essentially forever.
The case was described here:
https://lkml.org/lkml/2014/1/17/105
and of course this will make it more serious now that
more applications will be able to do this, so
chances that an administrator enables this
are higher.

One possible solution is some kind of timer orphaning frags
for skbs that have been around for too long.


> ---
>  include/linux/skbuff.h        |   1 +
>  include/linux/socket.h        |   1 +
>  include/uapi/linux/errqueue.h |   1 +
>  net/packet/af_packet.c        | 110 ++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 103 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 78c299f..8e661d2 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -295,6 +295,7 @@ struct ubuf_info {
>  	void (*callback)(struct ubuf_info *, bool zerocopy_success);
>  	void *ctx;
>  	unsigned long desc;
> +	void *callback_priv;
>  };
>  
>  /* This data is invariant across clones and lives at
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index de52228..0a2e0ea 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -265,6 +265,7 @@ struct ucred {
>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
>  #define MSG_EOF         MSG_FIN
>  
> +#define MSG_ZEROCOPY	0x4000000	/* Pin user pages */
>  #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
>  #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
>  					   descriptor received through
> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> index 07bdce1..61bf318 100644
> --- a/include/uapi/linux/errqueue.h
> +++ b/include/uapi/linux/errqueue.h
> @@ -19,6 +19,7 @@ struct sock_extended_err {
>  #define SO_EE_ORIGIN_ICMP6	3
>  #define SO_EE_ORIGIN_TXSTATUS	4
>  #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
> +#define SO_EE_ORIGIN_UPAGE	5
>  
>  #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
>  
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 58af5802..367c23a 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2370,28 +2370,112 @@ out:
>  
>  static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
>  				        size_t reserve, size_t len,
> -				        size_t linear, int noblock,
> +					size_t linear, int flags,
>  				        int *err)
>  {
>  	struct sk_buff *skb;
> +	size_t data_len;
>  
> -	/* Under a page?  Don't bother with paged skb. */
> -	if (prepad + len < PAGE_SIZE || !linear)
> -		linear = len;
> +	if (flags & MSG_ZEROCOPY) {
> +		/* Minimize linear, but respect header lower bound */
> +		linear = min(len, max_t(size_t, linear, MAX_HEADER));
> +		data_len = 0;
> +	} else {
> +		/* Under a page? Don't bother with paged skb. */
> +		if (prepad + len < PAGE_SIZE || !linear)
> +			linear = len;
> +		data_len = len - linear;
> +	}
>  
> -	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
> -				   err, 0);
> +	skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
> +				   flags & MSG_DONTWAIT, err, 0);
>  	if (!skb)
>  		return NULL;
>  
>  	skb_reserve(skb, reserve);
>  	skb_put(skb, linear);
> -	skb->data_len = len - linear;
> -	skb->len += len - linear;
> +	skb->data_len = data_len;
> +	skb->len += data_len;
>  
>  	return skb;
>  }
>  
> +/* release zerocopy resources and avoid destructor callback on kfree */
> +static void packet_snd_zerocopy_free(struct sk_buff *skb)
> +{
> +	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
> +
> +	if (uarg) {
> +		kfree_skb(uarg->callback_priv);
> +		sock_put((struct sock *) uarg->ctx);
> +		kfree(uarg);
> +
> +		skb_shinfo(skb)->destructor_arg = NULL;
> +		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> +	}
> +}
> +
> +static void packet_snd_zerocopy_callback(struct ubuf_info *uarg,
> +					 bool zerocopy_success)
> +{
> +	struct sk_buff *err_skb;
> +	struct sock *err_sk;
> +	struct sock_exterr_skb *serr;
> +
> +	err_sk = uarg->ctx;
> +	err_skb = uarg->callback_priv;
> +
> +	serr = SKB_EXT_ERR(err_skb);
> +	memset(serr, 0, sizeof(*serr));
> +	serr->ee.ee_errno = ENOMSG;
> +	serr->ee.ee_origin = SO_EE_ORIGIN_UPAGE;
> +	serr->ee.ee_data = uarg->desc & 0xFFFFFFFF;
> +	serr->ee.ee_info = ((u64) uarg->desc) >> 32;
> +	if (sock_queue_err_skb(err_sk, err_skb))
> +		kfree_skb(err_skb);
> +
> +	kfree(uarg);
> +	sock_put(err_sk);
> +}
> +
> +static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
> +					 struct msghdr *msg)
> +{
> +	struct ubuf_info *uarg = NULL;
> +	int ret;
> +
> +	if (iov_pages(msg->msg_iov, 0, msg->msg_iovlen) > MAX_SKB_FRAGS)
> +		return -EMSGSIZE;
> +
> +	uarg = kzalloc(sizeof(*uarg), GFP_KERNEL);
> +	if (!uarg)
> +		return -ENOMEM;
> +
> +	uarg->callback_priv = alloc_skb(0, GFP_KERNEL);
> +	if (!uarg->callback_priv) {
> +		kfree(uarg);
> +		return -ENOMEM;
> +	}
> +
> +	ret = zerocopy_sg_from_iovec(skb, msg->msg_iov, 0, msg->msg_iovlen);
> +	if (ret) {
> +		kfree_skb(uarg->callback_priv);
> +		kfree(uarg);
> +		return -EIO;
> +	}
> +
> +	sock_hold(skb->sk);
> +	uarg->ctx = skb->sk;
> +	uarg->callback = packet_snd_zerocopy_callback;
> +	uarg->desc = (unsigned long) msg->msg_iov[0].iov_base;
> +
> +	skb_shinfo(skb)->destructor_arg = uarg;
> +	skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> +	skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> +
> +	return 0;
> +}
> +
>  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  {
>  	struct sock *sk = sock->sk;
> @@ -2408,6 +2492,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  	unsigned short gso_type = 0;
>  	int hlen, tlen;
>  	int extra_len = 0;
> +	bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
>  
>  	/*
>  	 *	Get and verify the address.
> @@ -2501,7 +2586,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  	hlen = LL_RESERVED_SPACE(dev);
>  	tlen = dev->needed_tailroom;
>  	skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, vnet_hdr.hdr_len,
> -			       msg->msg_flags & MSG_DONTWAIT, &err);
> +			       msg->msg_flags, &err);
>  	if (skb == NULL)
>  		goto out_unlock;
>  
> @@ -2518,7 +2603,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  	}
>  
>  	/* Returns -EFAULT on error */
> -	err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
> +	if (zerocopy)
> +		err = packet_zerocopy_sg_from_iovec(skb, msg);
> +	else
> +		err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov,
> +						   0, len);
>  	if (err)
>  		goto out_free;
>  
> @@ -2578,6 +2667,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  	return len;
>  
>  out_free:
> +	packet_snd_zerocopy_free(skb);
>  	kfree_skb(skb);
>  out_unlock:
>  	if (dev)
> -- 
> 2.1.0.rc2.206.gedb03e5
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Nov. 26, 2014, 7:59 p.m. UTC | #2
> The main problem with zero copy ATM is with queueing disciplines
> which might keep the socket around essentially forever.
> The case was described here:
> https://lkml.org/lkml/2014/1/17/105
> and of course this will make it more serious now that
> more applications will be able to do this, so
> chances that an administrator enables this
> are higher.

The denial of service issue raised there, that a single queue can
block an entire virtio-net device, is less problematic in the case of
packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
application can increase the limit or use separate sockets for
separate flows.

> One possible solution is some kind of timer orphaning frags
> for skbs that have been around for too long.

Perhaps this can be approximated without an explicit timer by calling
skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?

>
>> ---
>>  include/linux/skbuff.h        |   1 +
>>  include/linux/socket.h        |   1 +
>>  include/uapi/linux/errqueue.h |   1 +
>>  net/packet/af_packet.c        | 110 ++++++++++++++++++++++++++++++++++++++----
>>  4 files changed, 103 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 78c299f..8e661d2 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -295,6 +295,7 @@ struct ubuf_info {
>>       void (*callback)(struct ubuf_info *, bool zerocopy_success);
>>       void *ctx;
>>       unsigned long desc;
>> +     void *callback_priv;
>>  };
>>
>>  /* This data is invariant across clones and lives at
>> diff --git a/include/linux/socket.h b/include/linux/socket.h
>> index de52228..0a2e0ea 100644
>> --- a/include/linux/socket.h
>> +++ b/include/linux/socket.h
>> @@ -265,6 +265,7 @@ struct ucred {
>>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
>>  #define MSG_EOF         MSG_FIN
>>
>> +#define MSG_ZEROCOPY 0x4000000       /* Pin user pages */
>>  #define MSG_FASTOPEN 0x20000000      /* Send data in TCP SYN */
>>  #define MSG_CMSG_CLOEXEC 0x40000000  /* Set close_on_exec for file
>>                                          descriptor received through
>> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
>> index 07bdce1..61bf318 100644
>> --- a/include/uapi/linux/errqueue.h
>> +++ b/include/uapi/linux/errqueue.h
>> @@ -19,6 +19,7 @@ struct sock_extended_err {
>>  #define SO_EE_ORIGIN_ICMP6   3
>>  #define SO_EE_ORIGIN_TXSTATUS        4
>>  #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
>> +#define SO_EE_ORIGIN_UPAGE   5
>>
>>  #define SO_EE_OFFENDER(ee)   ((struct sockaddr*)((ee)+1))
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 58af5802..367c23a 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -2370,28 +2370,112 @@ out:
>>
>>  static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
>>                                       size_t reserve, size_t len,
>> -                                     size_t linear, int noblock,
>> +                                     size_t linear, int flags,
>>                                       int *err)
>>  {
>>       struct sk_buff *skb;
>> +     size_t data_len;
>>
>> -     /* Under a page?  Don't bother with paged skb. */
>> -     if (prepad + len < PAGE_SIZE || !linear)
>> -             linear = len;
>> +     if (flags & MSG_ZEROCOPY) {
>> +             /* Minimize linear, but respect header lower bound */
>> +             linear = min(len, max_t(size_t, linear, MAX_HEADER));
>> +             data_len = 0;
>> +     } else {
>> +             /* Under a page? Don't bother with paged skb. */
>> +             if (prepad + len < PAGE_SIZE || !linear)
>> +                     linear = len;
>> +             data_len = len - linear;
>> +     }
>>
>> -     skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
>> -                                err, 0);
>> +     skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
>> +                                flags & MSG_DONTWAIT, err, 0);
>>       if (!skb)
>>               return NULL;
>>
>>       skb_reserve(skb, reserve);
>>       skb_put(skb, linear);
>> -     skb->data_len = len - linear;
>> -     skb->len += len - linear;
>> +     skb->data_len = data_len;
>> +     skb->len += data_len;
>>
>>       return skb;
>>  }
>>
>> +/* release zerocopy resources and avoid destructor callback on kfree */
>> +static void packet_snd_zerocopy_free(struct sk_buff *skb)
>> +{
>> +     struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
>> +
>> +     if (uarg) {
>> +             kfree_skb(uarg->callback_priv);
>> +             sock_put((struct sock *) uarg->ctx);
>> +             kfree(uarg);
>> +
>> +             skb_shinfo(skb)->destructor_arg = NULL;
>> +             skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
>> +     }
>> +}
>> +
>> +static void packet_snd_zerocopy_callback(struct ubuf_info *uarg,
>> +                                      bool zerocopy_success)
>> +{
>> +     struct sk_buff *err_skb;
>> +     struct sock *err_sk;
>> +     struct sock_exterr_skb *serr;
>> +
>> +     err_sk = uarg->ctx;
>> +     err_skb = uarg->callback_priv;
>> +
>> +     serr = SKB_EXT_ERR(err_skb);
>> +     memset(serr, 0, sizeof(*serr));
>> +     serr->ee.ee_errno = ENOMSG;
>> +     serr->ee.ee_origin = SO_EE_ORIGIN_UPAGE;
>> +     serr->ee.ee_data = uarg->desc & 0xFFFFFFFF;
>> +     serr->ee.ee_info = ((u64) uarg->desc) >> 32;
>> +     if (sock_queue_err_skb(err_sk, err_skb))
>> +             kfree_skb(err_skb);
>> +
>> +     kfree(uarg);
>> +     sock_put(err_sk);
>> +}
>> +
>> +static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
>> +                                      struct msghdr *msg)
>> +{
>> +     struct ubuf_info *uarg = NULL;
>> +     int ret;
>> +
>> +     if (iov_pages(msg->msg_iov, 0, msg->msg_iovlen) > MAX_SKB_FRAGS)
>> +             return -EMSGSIZE;
>> +
>> +     uarg = kzalloc(sizeof(*uarg), GFP_KERNEL);
>> +     if (!uarg)
>> +             return -ENOMEM;
>> +
>> +     uarg->callback_priv = alloc_skb(0, GFP_KERNEL);
>> +     if (!uarg->callback_priv) {
>> +             kfree(uarg);
>> +             return -ENOMEM;
>> +     }
>> +
>> +     ret = zerocopy_sg_from_iovec(skb, msg->msg_iov, 0, msg->msg_iovlen);
>> +     if (ret) {
>> +             kfree_skb(uarg->callback_priv);
>> +             kfree(uarg);
>> +             return -EIO;
>> +     }
>> +
>> +     sock_hold(skb->sk);
>> +     uarg->ctx = skb->sk;
>> +     uarg->callback = packet_snd_zerocopy_callback;
>> +     uarg->desc = (unsigned long) msg->msg_iov[0].iov_base;
>> +
>> +     skb_shinfo(skb)->destructor_arg = uarg;
>> +     skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
>> +     skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>> +
>> +     return 0;
>> +}
>> +
>>  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>  {
>>       struct sock *sk = sock->sk;
>> @@ -2408,6 +2492,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>       unsigned short gso_type = 0;
>>       int hlen, tlen;
>>       int extra_len = 0;
>> +     bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
>>
>>       /*
>>        *      Get and verify the address.
>> @@ -2501,7 +2586,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>       hlen = LL_RESERVED_SPACE(dev);
>>       tlen = dev->needed_tailroom;
>>       skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, vnet_hdr.hdr_len,
>> -                            msg->msg_flags & MSG_DONTWAIT, &err);
>> +                            msg->msg_flags, &err);
>>       if (skb == NULL)
>>               goto out_unlock;
>>
>> @@ -2518,7 +2603,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>       }
>>
>>       /* Returns -EFAULT on error */
>> -     err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
>> +     if (zerocopy)
>> +             err = packet_zerocopy_sg_from_iovec(skb, msg);
>> +     else
>> +             err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov,
>> +                                                0, len);
>>       if (err)
>>               goto out_free;
>>
>> @@ -2578,6 +2667,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>>       return len;
>>
>>  out_free:
>> +     packet_snd_zerocopy_free(skb);
>>       kfree_skb(skb);
>>  out_unlock:
>>       if (dev)
>> --
>> 2.1.0.rc2.206.gedb03e5
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Nov. 26, 2014, 9:17 p.m. UTC | #3
On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
> > The main problem with zero copy ATM is with queueing disciplines
> > which might keep the socket around essentially forever.
> > The case was described here:
> > https://lkml.org/lkml/2014/1/17/105
> > and of course this will make it more serious now that
> > more applications will be able to do this, so
> > chances that an administrator enables this
> > are higher.
> 
> The denial of service issue raised there, that a single queue can
> block an entire virtio-net device, is less problematic in the case of
> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
> application can increase the limit or use separate sockets for
> separate flows.

Socket per flow? Maybe just use TCP then?  increasing the limit
sounds like a wrong solution, it hurts security.

> > One possible solution is some kind of timer orphaning frags
> > for skbs that have been around for too long.
> 
> Perhaps this can be approximated without an explicit timer by calling
> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?

Hard to say. Will have to see that patch to judge how robust this is.


> >
> >> ---
> >>  include/linux/skbuff.h        |   1 +
> >>  include/linux/socket.h        |   1 +
> >>  include/uapi/linux/errqueue.h |   1 +
> >>  net/packet/af_packet.c        | 110 ++++++++++++++++++++++++++++++++++++++----
> >>  4 files changed, 103 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index 78c299f..8e661d2 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -295,6 +295,7 @@ struct ubuf_info {
> >>       void (*callback)(struct ubuf_info *, bool zerocopy_success);
> >>       void *ctx;
> >>       unsigned long desc;
> >> +     void *callback_priv;
> >>  };
> >>
> >>  /* This data is invariant across clones and lives at
> >> diff --git a/include/linux/socket.h b/include/linux/socket.h
> >> index de52228..0a2e0ea 100644
> >> --- a/include/linux/socket.h
> >> +++ b/include/linux/socket.h
> >> @@ -265,6 +265,7 @@ struct ucred {
> >>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
> >>  #define MSG_EOF         MSG_FIN
> >>
> >> +#define MSG_ZEROCOPY 0x4000000       /* Pin user pages */
> >>  #define MSG_FASTOPEN 0x20000000      /* Send data in TCP SYN */
> >>  #define MSG_CMSG_CLOEXEC 0x40000000  /* Set close_on_exec for file
> >>                                          descriptor received through
> >> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> >> index 07bdce1..61bf318 100644
> >> --- a/include/uapi/linux/errqueue.h
> >> +++ b/include/uapi/linux/errqueue.h
> >> @@ -19,6 +19,7 @@ struct sock_extended_err {
> >>  #define SO_EE_ORIGIN_ICMP6   3
> >>  #define SO_EE_ORIGIN_TXSTATUS        4
> >>  #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
> >> +#define SO_EE_ORIGIN_UPAGE   5
> >>
> >>  #define SO_EE_OFFENDER(ee)   ((struct sockaddr*)((ee)+1))
> >>
> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> >> index 58af5802..367c23a 100644
> >> --- a/net/packet/af_packet.c
> >> +++ b/net/packet/af_packet.c
> >> @@ -2370,28 +2370,112 @@ out:
> >>
> >>  static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
> >>                                       size_t reserve, size_t len,
> >> -                                     size_t linear, int noblock,
> >> +                                     size_t linear, int flags,
> >>                                       int *err)
> >>  {
> >>       struct sk_buff *skb;
> >> +     size_t data_len;
> >>
> >> -     /* Under a page?  Don't bother with paged skb. */
> >> -     if (prepad + len < PAGE_SIZE || !linear)
> >> -             linear = len;
> >> +     if (flags & MSG_ZEROCOPY) {
> >> +             /* Minimize linear, but respect header lower bound */
> >> +             linear = min(len, max_t(size_t, linear, MAX_HEADER));
> >> +             data_len = 0;
> >> +     } else {
> >> +             /* Under a page? Don't bother with paged skb. */
> >> +             if (prepad + len < PAGE_SIZE || !linear)
> >> +                     linear = len;
> >> +             data_len = len - linear;
> >> +     }
> >>
> >> -     skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
> >> -                                err, 0);
> >> +     skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
> >> +                                flags & MSG_DONTWAIT, err, 0);
> >>       if (!skb)
> >>               return NULL;
> >>
> >>       skb_reserve(skb, reserve);
> >>       skb_put(skb, linear);
> >> -     skb->data_len = len - linear;
> >> -     skb->len += len - linear;
> >> +     skb->data_len = data_len;
> >> +     skb->len += data_len;
> >>
> >>       return skb;
> >>  }
> >>
> >> +/* release zerocopy resources and avoid destructor callback on kfree */
> >> +static void packet_snd_zerocopy_free(struct sk_buff *skb)
> >> +{
> >> +     struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
> >> +
> >> +     if (uarg) {
> >> +             kfree_skb(uarg->callback_priv);
> >> +             sock_put((struct sock *) uarg->ctx);
> >> +             kfree(uarg);
> >> +
> >> +             skb_shinfo(skb)->destructor_arg = NULL;
> >> +             skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> >> +     }
> >> +}
> >> +
> >> +static void packet_snd_zerocopy_callback(struct ubuf_info *uarg,
> >> +                                      bool zerocopy_success)
> >> +{
> >> +     struct sk_buff *err_skb;
> >> +     struct sock *err_sk;
> >> +     struct sock_exterr_skb *serr;
> >> +
> >> +     err_sk = uarg->ctx;
> >> +     err_skb = uarg->callback_priv;
> >> +
> >> +     serr = SKB_EXT_ERR(err_skb);
> >> +     memset(serr, 0, sizeof(*serr));
> >> +     serr->ee.ee_errno = ENOMSG;
> >> +     serr->ee.ee_origin = SO_EE_ORIGIN_UPAGE;
> >> +     serr->ee.ee_data = uarg->desc & 0xFFFFFFFF;
> >> +     serr->ee.ee_info = ((u64) uarg->desc) >> 32;
> >> +     if (sock_queue_err_skb(err_sk, err_skb))
> >> +             kfree_skb(err_skb);
> >> +
> >> +     kfree(uarg);
> >> +     sock_put(err_sk);
> >> +}
> >> +
> >> +static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
> >> +                                      struct msghdr *msg)
> >> +{
> >> +     struct ubuf_info *uarg = NULL;
> >> +     int ret;
> >> +
> >> +     if (iov_pages(msg->msg_iov, 0, msg->msg_iovlen) > MAX_SKB_FRAGS)
> >> +             return -EMSGSIZE;
> >> +
> >> +     uarg = kzalloc(sizeof(*uarg), GFP_KERNEL);
> >> +     if (!uarg)
> >> +             return -ENOMEM;
> >> +
> >> +     uarg->callback_priv = alloc_skb(0, GFP_KERNEL);
> >> +     if (!uarg->callback_priv) {
> >> +             kfree(uarg);
> >> +             return -ENOMEM;
> >> +     }
> >> +
> >> +     ret = zerocopy_sg_from_iovec(skb, msg->msg_iov, 0, msg->msg_iovlen);
> >> +     if (ret) {
> >> +             kfree_skb(uarg->callback_priv);
> >> +             kfree(uarg);
> >> +             return -EIO;
> >> +     }
> >> +
> >> +     sock_hold(skb->sk);
> >> +     uarg->ctx = skb->sk;
> >> +     uarg->callback = packet_snd_zerocopy_callback;
> >> +     uarg->desc = (unsigned long) msg->msg_iov[0].iov_base;
> >> +
> >> +     skb_shinfo(skb)->destructor_arg = uarg;
> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> >> +
> >> +     return 0;
> >> +}
> >> +
> >>  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>  {
> >>       struct sock *sk = sock->sk;
> >> @@ -2408,6 +2492,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       unsigned short gso_type = 0;
> >>       int hlen, tlen;
> >>       int extra_len = 0;
> >> +     bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
> >>
> >>       /*
> >>        *      Get and verify the address.
> >> @@ -2501,7 +2586,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       hlen = LL_RESERVED_SPACE(dev);
> >>       tlen = dev->needed_tailroom;
> >>       skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, vnet_hdr.hdr_len,
> >> -                            msg->msg_flags & MSG_DONTWAIT, &err);
> >> +                            msg->msg_flags, &err);
> >>       if (skb == NULL)
> >>               goto out_unlock;
> >>
> >> @@ -2518,7 +2603,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       }
> >>
> >>       /* Returns -EFAULT on error */
> >> -     err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
> >> +     if (zerocopy)
> >> +             err = packet_zerocopy_sg_from_iovec(skb, msg);
> >> +     else
> >> +             err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov,
> >> +                                                0, len);
> >>       if (err)
> >>               goto out_free;
> >>
> >> @@ -2578,6 +2667,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       return len;
> >>
> >>  out_free:
> >> +     packet_snd_zerocopy_free(skb);
> >>       kfree_skb(skb);
> >>  out_unlock:
> >>       if (dev)
> >> --
> >> 2.1.0.rc2.206.gedb03e5
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Nov. 26, 2014, 9:20 p.m. UTC | #4
On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
> > The main problem with zero copy ATM is with queueing disciplines
> > which might keep the socket around essentially forever.
> > The case was described here:
> > https://lkml.org/lkml/2014/1/17/105
> > and of course this will make it more serious now that
> > more applications will be able to do this, so
> > chances that an administrator enables this
> > are higher.
> 
> The denial of service issue raised there, that a single queue can
> block an entire virtio-net device, is less problematic in the case of
> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
> application can increase the limit or use separate sockets for
> separate flows.

Sounds like this interface is very hard to use correctly.

> > One possible solution is some kind of timer orphaning frags
> > for skbs that have been around for too long.
> 
> Perhaps this can be approximated without an explicit timer by calling
> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?

Not sure.  I'll have to see that patch to judge.

> >
> >> ---
> >>  include/linux/skbuff.h        |   1 +
> >>  include/linux/socket.h        |   1 +
> >>  include/uapi/linux/errqueue.h |   1 +
> >>  net/packet/af_packet.c        | 110 ++++++++++++++++++++++++++++++++++++++----
> >>  4 files changed, 103 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index 78c299f..8e661d2 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -295,6 +295,7 @@ struct ubuf_info {
> >>       void (*callback)(struct ubuf_info *, bool zerocopy_success);
> >>       void *ctx;
> >>       unsigned long desc;
> >> +     void *callback_priv;
> >>  };
> >>
> >>  /* This data is invariant across clones and lives at
> >> diff --git a/include/linux/socket.h b/include/linux/socket.h
> >> index de52228..0a2e0ea 100644
> >> --- a/include/linux/socket.h
> >> +++ b/include/linux/socket.h
> >> @@ -265,6 +265,7 @@ struct ucred {
> >>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
> >>  #define MSG_EOF         MSG_FIN
> >>
> >> +#define MSG_ZEROCOPY 0x4000000       /* Pin user pages */
> >>  #define MSG_FASTOPEN 0x20000000      /* Send data in TCP SYN */
> >>  #define MSG_CMSG_CLOEXEC 0x40000000  /* Set close_on_exec for file
> >>                                          descriptor received through
> >> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> >> index 07bdce1..61bf318 100644
> >> --- a/include/uapi/linux/errqueue.h
> >> +++ b/include/uapi/linux/errqueue.h
> >> @@ -19,6 +19,7 @@ struct sock_extended_err {
> >>  #define SO_EE_ORIGIN_ICMP6   3
> >>  #define SO_EE_ORIGIN_TXSTATUS        4
> >>  #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
> >> +#define SO_EE_ORIGIN_UPAGE   5
> >>
> >>  #define SO_EE_OFFENDER(ee)   ((struct sockaddr*)((ee)+1))
> >>
> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> >> index 58af5802..367c23a 100644
> >> --- a/net/packet/af_packet.c
> >> +++ b/net/packet/af_packet.c
> >> @@ -2370,28 +2370,112 @@ out:
> >>
> >>  static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
> >>                                       size_t reserve, size_t len,
> >> -                                     size_t linear, int noblock,
> >> +                                     size_t linear, int flags,
> >>                                       int *err)
> >>  {
> >>       struct sk_buff *skb;
> >> +     size_t data_len;
> >>
> >> -     /* Under a page?  Don't bother with paged skb. */
> >> -     if (prepad + len < PAGE_SIZE || !linear)
> >> -             linear = len;
> >> +     if (flags & MSG_ZEROCOPY) {
> >> +             /* Minimize linear, but respect header lower bound */
> >> +             linear = min(len, max_t(size_t, linear, MAX_HEADER));
> >> +             data_len = 0;
> >> +     } else {
> >> +             /* Under a page? Don't bother with paged skb. */
> >> +             if (prepad + len < PAGE_SIZE || !linear)
> >> +                     linear = len;
> >> +             data_len = len - linear;
> >> +     }
> >>
> >> -     skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
> >> -                                err, 0);
> >> +     skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
> >> +                                flags & MSG_DONTWAIT, err, 0);
> >>       if (!skb)
> >>               return NULL;
> >>
> >>       skb_reserve(skb, reserve);
> >>       skb_put(skb, linear);
> >> -     skb->data_len = len - linear;
> >> -     skb->len += len - linear;
> >> +     skb->data_len = data_len;
> >> +     skb->len += data_len;
> >>
> >>       return skb;
> >>  }
> >>
> >> +/* release zerocopy resources and avoid destructor callback on kfree */
> >> +static void packet_snd_zerocopy_free(struct sk_buff *skb)
> >> +{
> >> +     struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
> >> +
> >> +     if (uarg) {
> >> +             kfree_skb(uarg->callback_priv);
> >> +             sock_put((struct sock *) uarg->ctx);
> >> +             kfree(uarg);
> >> +
> >> +             skb_shinfo(skb)->destructor_arg = NULL;
> >> +             skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> >> +     }
> >> +}
> >> +
> >> +static void packet_snd_zerocopy_callback(struct ubuf_info *uarg,
> >> +                                      bool zerocopy_success)
> >> +{
> >> +     struct sk_buff *err_skb;
> >> +     struct sock *err_sk;
> >> +     struct sock_exterr_skb *serr;
> >> +
> >> +     err_sk = uarg->ctx;
> >> +     err_skb = uarg->callback_priv;
> >> +
> >> +     serr = SKB_EXT_ERR(err_skb);
> >> +     memset(serr, 0, sizeof(*serr));
> >> +     serr->ee.ee_errno = ENOMSG;
> >> +     serr->ee.ee_origin = SO_EE_ORIGIN_UPAGE;
> >> +     serr->ee.ee_data = uarg->desc & 0xFFFFFFFF;
> >> +     serr->ee.ee_info = ((u64) uarg->desc) >> 32;
> >> +     if (sock_queue_err_skb(err_sk, err_skb))
> >> +             kfree_skb(err_skb);
> >> +
> >> +     kfree(uarg);
> >> +     sock_put(err_sk);
> >> +}
> >> +
> >> +static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
> >> +                                      struct msghdr *msg)
> >> +{
> >> +     struct ubuf_info *uarg = NULL;
> >> +     int ret;
> >> +
> >> +     if (iov_pages(msg->msg_iov, 0, msg->msg_iovlen) > MAX_SKB_FRAGS)
> >> +             return -EMSGSIZE;
> >> +
> >> +     uarg = kzalloc(sizeof(*uarg), GFP_KERNEL);
> >> +     if (!uarg)
> >> +             return -ENOMEM;
> >> +
> >> +     uarg->callback_priv = alloc_skb(0, GFP_KERNEL);
> >> +     if (!uarg->callback_priv) {
> >> +             kfree(uarg);
> >> +             return -ENOMEM;
> >> +     }
> >> +
> >> +     ret = zerocopy_sg_from_iovec(skb, msg->msg_iov, 0, msg->msg_iovlen);
> >> +     if (ret) {
> >> +             kfree_skb(uarg->callback_priv);
> >> +             kfree(uarg);
> >> +             return -EIO;
> >> +     }
> >> +
> >> +     sock_hold(skb->sk);
> >> +     uarg->ctx = skb->sk;
> >> +     uarg->callback = packet_snd_zerocopy_callback;
> >> +     uarg->desc = (unsigned long) msg->msg_iov[0].iov_base;
> >> +
> >> +     skb_shinfo(skb)->destructor_arg = uarg;
> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> >> +
> >> +     return 0;
> >> +}
> >> +
> >>  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>  {
> >>       struct sock *sk = sock->sk;
> >> @@ -2408,6 +2492,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       unsigned short gso_type = 0;
> >>       int hlen, tlen;
> >>       int extra_len = 0;
> >> +     bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
> >>
> >>       /*
> >>        *      Get and verify the address.
> >> @@ -2501,7 +2586,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       hlen = LL_RESERVED_SPACE(dev);
> >>       tlen = dev->needed_tailroom;
> >>       skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, vnet_hdr.hdr_len,
> >> -                            msg->msg_flags & MSG_DONTWAIT, &err);
> >> +                            msg->msg_flags, &err);
> >>       if (skb == NULL)
> >>               goto out_unlock;
> >>
> >> @@ -2518,7 +2603,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       }
> >>
> >>       /* Returns -EFAULT on error */
> >> -     err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
> >> +     if (zerocopy)
> >> +             err = packet_zerocopy_sg_from_iovec(skb, msg);
> >> +     else
> >> +             err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov,
> >> +                                                0, len);
> >>       if (err)
> >>               goto out_free;
> >>
> >> @@ -2578,6 +2667,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
> >>       return len;
> >>
> >>  out_free:
> >> +     packet_snd_zerocopy_free(skb);
> >>       kfree_skb(skb);
> >>  out_unlock:
> >>       if (dev)
> >> --
> >> 2.1.0.rc2.206.gedb03e5
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Nov. 26, 2014, 11:05 p.m. UTC | #5
On Wed, Nov 26, 2014 at 4:20 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
>> > The main problem with zero copy ATM is with queueing disciplines
>> > which might keep the socket around essentially forever.
>> > The case was described here:
>> > https://lkml.org/lkml/2014/1/17/105
>> > and of course this will make it more serious now that
>> > more applications will be able to do this, so
>> > chances that an administrator enables this
>> > are higher.
>>
>> The denial of service issue raised there, that a single queue can
>> block an entire virtio-net device, is less problematic in the case of
>> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
>> application can increase the limit or use separate sockets for
>> separate flows.
>
> Sounds like this interface is very hard to use correctly.

Actually, this socket alloc issue is the same for zerocopy and
non-zerocopy. Packets can be held in deep queues at which point
the packet socket is blocked. This is accepted behavior.

From the above thread:

"It's ok for non-zerocopy packet to be blocked since VM1 thought the
packets has been sent instead of pending in the virtqueue. So VM1 can
still send packet to other destination."

This is very specific to virtio and vhost-net. I don't think that that
concern applies to a packet interface.

Another issue, though, is that the method currently really only helps
TSO because ll other paths cause a deep copy. There are more use
cases once it can send up to 64KB MTU over loopback or send out
GSO datagrams without triggering skb_copy_ubufs. I have not looked
into how (or if) that can be achieved yet.

>
>> > One possible solution is some kind of timer orphaning frags
>> > for skbs that have been around for too long.
>>
>> Perhaps this can be approximated without an explicit timer by calling
>> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
>
> Not sure.  I'll have to see that patch to judge.
>
>> >
>> >> ---
>> >>  include/linux/skbuff.h        |   1 +
>> >>  include/linux/socket.h        |   1 +
>> >>  include/uapi/linux/errqueue.h |   1 +
>> >>  net/packet/af_packet.c        | 110 ++++++++++++++++++++++++++++++++++++++----
>> >>  4 files changed, 103 insertions(+), 10 deletions(-)
>> >>
>> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> >> index 78c299f..8e661d2 100644
>> >> --- a/include/linux/skbuff.h
>> >> +++ b/include/linux/skbuff.h
>> >> @@ -295,6 +295,7 @@ struct ubuf_info {
>> >>       void (*callback)(struct ubuf_info *, bool zerocopy_success);
>> >>       void *ctx;
>> >>       unsigned long desc;
>> >> +     void *callback_priv;
>> >>  };
>> >>
>> >>  /* This data is invariant across clones and lives at
>> >> diff --git a/include/linux/socket.h b/include/linux/socket.h
>> >> index de52228..0a2e0ea 100644
>> >> --- a/include/linux/socket.h
>> >> +++ b/include/linux/socket.h
>> >> @@ -265,6 +265,7 @@ struct ucred {
>> >>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
>> >>  #define MSG_EOF         MSG_FIN
>> >>
>> >> +#define MSG_ZEROCOPY 0x4000000       /* Pin user pages */
>> >>  #define MSG_FASTOPEN 0x20000000      /* Send data in TCP SYN */
>> >>  #define MSG_CMSG_CLOEXEC 0x40000000  /* Set close_on_exec for file
>> >>                                          descriptor received through
>> >> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
>> >> index 07bdce1..61bf318 100644
>> >> --- a/include/uapi/linux/errqueue.h
>> >> +++ b/include/uapi/linux/errqueue.h
>> >> @@ -19,6 +19,7 @@ struct sock_extended_err {
>> >>  #define SO_EE_ORIGIN_ICMP6   3
>> >>  #define SO_EE_ORIGIN_TXSTATUS        4
>> >>  #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
>> >> +#define SO_EE_ORIGIN_UPAGE   5
>> >>
>> >>  #define SO_EE_OFFENDER(ee)   ((struct sockaddr*)((ee)+1))
>> >>
>> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> >> index 58af5802..367c23a 100644
>> >> --- a/net/packet/af_packet.c
>> >> +++ b/net/packet/af_packet.c
>> >> @@ -2370,28 +2370,112 @@ out:
>> >>
>> >>  static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
>> >>                                       size_t reserve, size_t len,
>> >> -                                     size_t linear, int noblock,
>> >> +                                     size_t linear, int flags,
>> >>                                       int *err)
>> >>  {
>> >>       struct sk_buff *skb;
>> >> +     size_t data_len;
>> >>
>> >> -     /* Under a page?  Don't bother with paged skb. */
>> >> -     if (prepad + len < PAGE_SIZE || !linear)
>> >> -             linear = len;
>> >> +     if (flags & MSG_ZEROCOPY) {
>> >> +             /* Minimize linear, but respect header lower bound */
>> >> +             linear = min(len, max_t(size_t, linear, MAX_HEADER));
>> >> +             data_len = 0;
>> >> +     } else {
>> >> +             /* Under a page? Don't bother with paged skb. */
>> >> +             if (prepad + len < PAGE_SIZE || !linear)
>> >> +                     linear = len;
>> >> +             data_len = len - linear;
>> >> +     }
>> >>
>> >> -     skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
>> >> -                                err, 0);
>> >> +     skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
>> >> +                                flags & MSG_DONTWAIT, err, 0);
>> >>       if (!skb)
>> >>               return NULL;
>> >>
>> >>       skb_reserve(skb, reserve);
>> >>       skb_put(skb, linear);
>> >> -     skb->data_len = len - linear;
>> >> -     skb->len += len - linear;
>> >> +     skb->data_len = data_len;
>> >> +     skb->len += data_len;
>> >>
>> >>       return skb;
>> >>  }
>> >>
>> >> +/* release zerocopy resources and avoid destructor callback on kfree */
>> >> +static void packet_snd_zerocopy_free(struct sk_buff *skb)
>> >> +{
>> >> +     struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
>> >> +
>> >> +     if (uarg) {
>> >> +             kfree_skb(uarg->callback_priv);
>> >> +             sock_put((struct sock *) uarg->ctx);
>> >> +             kfree(uarg);
>> >> +
>> >> +             skb_shinfo(skb)->destructor_arg = NULL;
>> >> +             skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
>> >> +     }
>> >> +}
>> >> +
>> >> +static void packet_snd_zerocopy_callback(struct ubuf_info *uarg,
>> >> +                                      bool zerocopy_success)
>> >> +{
>> >> +     struct sk_buff *err_skb;
>> >> +     struct sock *err_sk;
>> >> +     struct sock_exterr_skb *serr;
>> >> +
>> >> +     err_sk = uarg->ctx;
>> >> +     err_skb = uarg->callback_priv;
>> >> +
>> >> +     serr = SKB_EXT_ERR(err_skb);
>> >> +     memset(serr, 0, sizeof(*serr));
>> >> +     serr->ee.ee_errno = ENOMSG;
>> >> +     serr->ee.ee_origin = SO_EE_ORIGIN_UPAGE;
>> >> +     serr->ee.ee_data = uarg->desc & 0xFFFFFFFF;
>> >> +     serr->ee.ee_info = ((u64) uarg->desc) >> 32;
>> >> +     if (sock_queue_err_skb(err_sk, err_skb))
>> >> +             kfree_skb(err_skb);
>> >> +
>> >> +     kfree(uarg);
>> >> +     sock_put(err_sk);
>> >> +}
>> >> +
>> >> +static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
>> >> +                                      struct msghdr *msg)
>> >> +{
>> >> +     struct ubuf_info *uarg = NULL;
>> >> +     int ret;
>> >> +
>> >> +     if (iov_pages(msg->msg_iov, 0, msg->msg_iovlen) > MAX_SKB_FRAGS)
>> >> +             return -EMSGSIZE;
>> >> +
>> >> +     uarg = kzalloc(sizeof(*uarg), GFP_KERNEL);
>> >> +     if (!uarg)
>> >> +             return -ENOMEM;
>> >> +
>> >> +     uarg->callback_priv = alloc_skb(0, GFP_KERNEL);
>> >> +     if (!uarg->callback_priv) {
>> >> +             kfree(uarg);
>> >> +             return -ENOMEM;
>> >> +     }
>> >> +
>> >> +     ret = zerocopy_sg_from_iovec(skb, msg->msg_iov, 0, msg->msg_iovlen);
>> >> +     if (ret) {
>> >> +             kfree_skb(uarg->callback_priv);
>> >> +             kfree(uarg);
>> >> +             return -EIO;
>> >> +     }
>> >> +
>> >> +     sock_hold(skb->sk);
>> >> +     uarg->ctx = skb->sk;
>> >> +     uarg->callback = packet_snd_zerocopy_callback;
>> >> +     uarg->desc = (unsigned long) msg->msg_iov[0].iov_base;
>> >> +
>> >> +     skb_shinfo(skb)->destructor_arg = uarg;
>> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
>> >> +     skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>> >> +
>> >> +     return 0;
>> >> +}
>> >> +
>> >>  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>  {
>> >>       struct sock *sk = sock->sk;
>> >> @@ -2408,6 +2492,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>       unsigned short gso_type = 0;
>> >>       int hlen, tlen;
>> >>       int extra_len = 0;
>> >> +     bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
>> >>
>> >>       /*
>> >>        *      Get and verify the address.
>> >> @@ -2501,7 +2586,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>       hlen = LL_RESERVED_SPACE(dev);
>> >>       tlen = dev->needed_tailroom;
>> >>       skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, vnet_hdr.hdr_len,
>> >> -                            msg->msg_flags & MSG_DONTWAIT, &err);
>> >> +                            msg->msg_flags, &err);
>> >>       if (skb == NULL)
>> >>               goto out_unlock;
>> >>
>> >> @@ -2518,7 +2603,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>       }
>> >>
>> >>       /* Returns -EFAULT on error */
>> >> -     err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
>> >> +     if (zerocopy)
>> >> +             err = packet_zerocopy_sg_from_iovec(skb, msg);
>> >> +     else
>> >> +             err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov,
>> >> +                                                0, len);
>> >>       if (err)
>> >>               goto out_free;
>> >>
>> >> @@ -2578,6 +2667,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>> >>       return len;
>> >>
>> >>  out_free:
>> >> +     packet_snd_zerocopy_free(skb);
>> >>       kfree_skb(skb);
>> >>  out_unlock:
>> >>       if (dev)
>> >> --
>> >> 2.1.0.rc2.206.gedb03e5
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Nov. 27, 2014, 7:27 a.m. UTC | #6
On Wed, Nov 26, 2014 at 06:05:16PM -0500, Willem de Bruijn wrote:
> On Wed, Nov 26, 2014 at 4:20 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
> >> > The main problem with zero copy ATM is with queueing disciplines
> >> > which might keep the socket around essentially forever.
> >> > The case was described here:
> >> > https://lkml.org/lkml/2014/1/17/105
> >> > and of course this will make it more serious now that
> >> > more applications will be able to do this, so
> >> > chances that an administrator enables this
> >> > are higher.
> >>
> >> The denial of service issue raised there, that a single queue can
> >> block an entire virtio-net device, is less problematic in the case of
> >> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
> >> application can increase the limit or use separate sockets for
> >> separate flows.
> >
> > Sounds like this interface is very hard to use correctly.
> 
> Actually, this socket alloc issue is the same for zerocopy and
> non-zerocopy. Packets can be held in deep queues at which point
> the packet socket is blocked. This is accepted behavior.
>
> >From the above thread:
> 
> "It's ok for non-zerocopy packet to be blocked since VM1 thought the
> packets has been sent instead of pending in the virtqueue. So VM1 can
> still send packet to other destination."
> 
> This is very specific to virtio and vhost-net. I don't think that that
> concern applies to a packet interface.

Well, you are obviously building the interface with some use-case in mind.
Let's try to make it work for multiple use-cases.

So at some level, you are right.  The issue is not running out of wmem.
But I think I'm right too - this is hard to use correctly.

I think the difference is that with your patch, application
can't reuse the memory until packet is transmitted, otherwise junk goes
out on the network. Even closing the socket won't help.
Is this true?

I see this as a problem.

I'm trying to figure out how would one use this interface, one obvious
use would be to tunnel out raw packets directly from VM memory.
For this application, a zero copy packet never completing is a problem:
at minimum, you want to be able to remove the device, which
translates to a requirement that closing the socket effectively stops
using userspace memory. In case you want to be able to run e.g. a watchdog,
ability to specify a deadline also seems benefitial.


> Another issue, though, is that the method currently really only helps
> TSO because ll other paths cause a deep copy. There are more use
> cases once it can send up to 64KB MTU over loopback or send out
> GSO datagrams without triggering skb_copy_ubufs. I have not looked
> into how (or if) that can be achieved yet.

I think this was done intentionally at some point,
try to look at git history to find out the reasons.

> >
> >> > One possible solution is some kind of timer orphaning frags
> >> > for skbs that have been around for too long.
> >>
> >> Perhaps this can be approximated without an explicit timer by calling
> >> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
> >
> > Not sure.  I'll have to see that patch to judge.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang Nov. 27, 2014, 9:10 a.m. UTC | #7
On Thu, Nov 27, 2014 at 5:17 AM, Michael S. Tsirkin <mst@redhat.com> 
wrote:
> On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
>>  > The main problem with zero copy ATM is with queueing disciplines
>>  > which might keep the socket around essentially forever.
>>  > The case was described here:
>>  > https://lkml.org/lkml/2014/1/17/105
>>  > and of course this will make it more serious now that
>>  > more applications will be able to do this, so
>>  > chances that an administrator enables this
>>  > are higher.
>>  
>>  The denial of service issue raised there, that a single queue can
>>  block an entire virtio-net device, is less problematic in the case 
>> of
>>  packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
>>  application can increase the limit or use separate sockets for
>>  separate flows.
> 
> Socket per flow? Maybe just use TCP then?  increasing the limit
> sounds like a wrong solution, it hurts security.
> 
>>  > One possible solution is some kind of timer orphaning frags
>>  > for skbs that have been around for too long.
>>  
>>  Perhaps this can be approximated without an explicit timer by 
>> calling
>>  skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
> 
> Hard to say. Will have to see that patch to judge how robust this is.

This could not work, consider if the threshold is greater than vring 
size
or vhost_net pending limit, transmission may still be blocked.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Nov. 27, 2014, 10:44 a.m. UTC | #8
On Thu, Nov 27, 2014 at 09:18:12AM +0008, Jason Wang wrote:
> 
> 
> On Thu, Nov 27, 2014 at 5:17 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
> >> > The main problem with zero copy ATM is with queueing disciplines
> >> > which might keep the socket around essentially forever.
> >> > The case was described here:
> >> > https://lkml.org/lkml/2014/1/17/105
> >> > and of course this will make it more serious now that
> >> > more applications will be able to do this, so
> >> > chances that an administrator enables this
> >> > are higher.
> >> The denial of service issue raised there, that a single queue can
> >> block an entire virtio-net device, is less problematic in the case of
> >> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
> >> application can increase the limit or use separate sockets for
> >> separate flows.
> >
> >Socket per flow? Maybe just use TCP then?  increasing the limit
> >sounds like a wrong solution, it hurts security.
> >
> >> > One possible solution is some kind of timer orphaning frags
> >> > for skbs that have been around for too long.
> >>   Perhaps this can be approximated without an explicit timer by calling
> >> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
> >
> >Hard to say. Will have to see that patch to judge how robust this is.
> 
> This could not work, consider if the threshold is greater than vring size
> or vhost_net pending limit, transmission may still be blocked.

Well, application can e.g. just switch to non zero copy after
reaching a specific number of requests.
I think the real problem isn't reaching the queue full
condition, it's the fact a specific buffer might never
get freed. This API isn't half as useful as it could be
if applications had a way to force the memory
to be reclaimed.


And actually, I see a way for applications to reclaim the memory:
application could invoke something like MADV_SOFT_OFFLINE on the memory
submitted for zero copy transmit, to invalidate PTEs, and make next
access fault new pages in.
If dedicated memory is used for packets, you could even use
MADV_DONTNEED - but this doesn't work in many cases, certainly
not for virtualization type workloads.

Playting with PTEs needs to invalidate the TLB so it is not fast,
but it does not need to be: we are talking about ability to close the
socket, which should be rare.

For example, an application/hypervisor can detect a timeout when a
packet is not transmitted within a predefined time period, and trigger
such reclaim.
Making this period shorter than network watchdog timer of the VM
will ensure that watchdog does not trigger within VM.
Alternatively, VM network watchdog could trigger this reclaim
in order to recover packet memory.

With this idea, if application merely reads memory, we incur a lot of
overhead with pagefaults. So maybe a new call to enable COW for a range
of pages would be a good idea.


We'd have to make sure whatever's used for reclaim works for
a wide range of memory types: mmap-ed file, hugetlbfs, anonymous memory.


Thoughts?
Willem de Bruijn Nov. 28, 2014, 12:32 a.m. UTC | #9
On Thu, Nov 27, 2014 at 2:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Nov 26, 2014 at 06:05:16PM -0500, Willem de Bruijn wrote:
>> On Wed, Nov 26, 2014 at 4:20 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
>> >> > The main problem with zero copy ATM is with queueing disciplines
>> >> > which might keep the socket around essentially forever.
>> >> > The case was described here:
>> >> > https://lkml.org/lkml/2014/1/17/105
>> >> > and of course this will make it more serious now that
>> >> > more applications will be able to do this, so
>> >> > chances that an administrator enables this
>> >> > are higher.
>> >>
>> >> The denial of service issue raised there, that a single queue can
>> >> block an entire virtio-net device, is less problematic in the case of
>> >> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
>> >> application can increase the limit or use separate sockets for
>> >> separate flows.
>> >
>> > Sounds like this interface is very hard to use correctly.
>>
>> Actually, this socket alloc issue is the same for zerocopy and
>> non-zerocopy. Packets can be held in deep queues at which point
>> the packet socket is blocked. This is accepted behavior.
>>
>> >From the above thread:
>>
>> "It's ok for non-zerocopy packet to be blocked since VM1 thought the
>> packets has been sent instead of pending in the virtqueue. So VM1 can
>> still send packet to other destination."
>>
>> This is very specific to virtio and vhost-net. I don't think that that
>> concern applies to a packet interface.
>
> Well, you are obviously building the interface with some use-case in mind.
> Let's try to make it work for multiple use-cases.
>
> So at some level, you are right.  The issue is not running out of wmem.
> But I think I'm right too - this is hard to use correctly.
>
> I think the difference is that with your patch, application
> can't reuse the memory until packet is transmitted, otherwise junk goes
> out on the network.

The packet headers are in linear skb memory, so the integrity
of the kernel is not affected. But payload (and derived data,
like checksums) may be junk.

> Even closing the socket won't help.
> Is this true?

Good point. Indeed. Your approach in the follow up email,
to let the process release its mappings, sounds like a good
answer.

In general, I think that leaving this under process control is
preferable over kernel guarantees using a timer, because it
is it is more predictable from a process point of view. Both
the state of the system at any point in time, and the per-packet
cycle cost are easier to reason about (were packet sent out
with zero-copy, or were deep copies triggered by a timer? this
question would be difficult to answer otherwise, and it is
important, as deep copies are likely more expensive).

>
> I see this as a problem.
>
> I'm trying to figure out how would one use this interface, one obvious
> use would be to tunnel out raw packets directly from VM memory.
> For this application, a zero copy packet never completing is a problem:
> at minimum, you want to be able to remove the device, which
> translates to a requirement that closing the socket effectively stops
> using userspace memory.

The application can ensure this by releasing the relevant
pages. For mmap()ed pages, munmap() is sufficient. The approach
you mentioned with madvise may be more widely applicable. I had
so far only targeted mmapped (incl. MAP_HUGETLB) pages.

Alternatively, the kernel could enforce this, by keeping a reference
on all in-flight ubuf_info structs associated with the sk. Though this
adds non trivial accounting overhead and potentially massive
copying at socket close.

If outstanding ubuf_info are tracked, then the forced copy can
be triggered by other actions, as well, such as in response to a
sendmsg cmsg or setsockopt.

Because there is per-call cost associated with maintaining
kernel state, but not necessarily with userspace strategies,
I would choose to leave this responsibility to the user. That
is already the behavior with vmsplice gifting, I think.

> In case you want to be able to run e.g. a watchdog,
> ability to specify a deadline also seems benefitial.

>> Another issue, though, is that the method currently really only helps
>> TSO because ll other paths cause a deep copy. There are more use
>> cases once it can send up to 64KB MTU over loopback or send out
>> GSO datagrams without triggering skb_copy_ubufs. I have not looked
>> into how (or if) that can be achieved yet.
>
> I think this was done intentionally at some point,
> try to look at git history to find out the reasons.
>
>> >
>> >> > One possible solution is some kind of timer orphaning frags
>> >> > for skbs that have been around for too long.
>> >>
>> >> Perhaps this can be approximated without an explicit timer by calling
>> >> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
>> >
>> > Not sure.  I'll have to see that patch to judge.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Nov. 28, 2014, 12:39 a.m. UTC | #10
On Thu, Nov 27, 2014 at 5:44 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Nov 27, 2014 at 09:18:12AM +0008, Jason Wang wrote:
>>
>>
>> On Thu, Nov 27, 2014 at 5:17 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
>> >> > The main problem with zero copy ATM is with queueing disciplines
>> >> > which might keep the socket around essentially forever.
>> >> > The case was described here:
>> >> > https://lkml.org/lkml/2014/1/17/105
>> >> > and of course this will make it more serious now that
>> >> > more applications will be able to do this, so
>> >> > chances that an administrator enables this
>> >> > are higher.
>> >> The denial of service issue raised there, that a single queue can
>> >> block an entire virtio-net device, is less problematic in the case of
>> >> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
>> >> application can increase the limit or use separate sockets for
>> >> separate flows.
>> >
>> >Socket per flow? Maybe just use TCP then?  increasing the limit
>> >sounds like a wrong solution, it hurts security.
>> >
>> >> > One possible solution is some kind of timer orphaning frags
>> >> > for skbs that have been around for too long.
>> >>   Perhaps this can be approximated without an explicit timer by calling
>> >> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
>> >
>> >Hard to say. Will have to see that patch to judge how robust this is.
>>
>> This could not work, consider if the threshold is greater than vring size
>> or vhost_net pending limit, transmission may still be blocked.
>
>
> Well, application can e.g. just switch to non zero copy after
> reaching a specific number of requests.
> I think the real problem isn't reaching the queue full
> condition, it's the fact a specific buffer might never
> get freed. This API isn't half as useful as it could be
> if applications had a way to force the memory
> to be reclaimed.
>
>
> And actually, I see a way for applications to reclaim the memory:
> application could invoke something like MADV_SOFT_OFFLINE on the memory
> submitted for zero copy transmit, to invalidate PTEs, and make next
> access fault new pages in.
> If dedicated memory is used for packets, you could even use
> MADV_DONTNEED - but this doesn't work in many cases, certainly
> not for virtualization type workloads.

Why not?

>
>
> Playting with PTEs needs to invalidate the TLB so it is not fast,
> but it does not need to be: we are talking about ability to close the
> socket, which should be rare.
>
> For example, an application/hypervisor can detect a timeout when a
> packet is not transmitted within a predefined time period, and trigger
> such reclaim.
> Making this period shorter than network watchdog timer of the VM
> will ensure that watchdog does not trigger within VM.
> Alternatively, VM network watchdog could trigger this reclaim
> in order to recover packet memory.
>
> With this idea, if application merely reads memory, we incur a lot of
> overhead with pagefaults. So maybe a new call to enable COW for a range
> of pages would be a good idea.
>
>
> We'd have to make sure whatever's used for reclaim works for
> a wide range of memory types: mmap-ed file, hugetlbfs, anonymous memory.
>
>
> Thoughts?

Very nice. When exactly are these madvise hints preferable over munmap?

> --
> MST
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang Nov. 28, 2014, 6:21 a.m. UTC | #11
On Thu, Nov 27, 2014 at 6:44 PM, Michael S. Tsirkin <mst@redhat.com> 
wrote:
> On Thu, Nov 27, 2014 at 09:18:12AM +0008, Jason Wang wrote:
>>  
>>  
>>  On Thu, Nov 27, 2014 at 5:17 AM, Michael S. Tsirkin 
>> <mst@redhat.com> wrote:
>>  >On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
>>  >> > The main problem with zero copy ATM is with queueing 
>> disciplines
>>  >> > which might keep the socket around essentially forever.
>>  >> > The case was described here:
>>  >> > https://lkml.org/lkml/2014/1/17/105
>>  >> > and of course this will make it more serious now that
>>  >> > more applications will be able to do this, so
>>  >> > chances that an administrator enables this
>>  >> > are higher.
>>  >> The denial of service issue raised there, that a single queue can
>>  >> block an entire virtio-net device, is less problematic in the 
>> case of
>>  >> packet sockets. A socket can run out of sk_wmem_alloc, but a 
>> prudent
>>  >> application can increase the limit or use separate sockets for
>>  >> separate flows.
>>  >
>>  >Socket per flow? Maybe just use TCP then?  increasing the limit
>>  >sounds like a wrong solution, it hurts security.
>>  >
>>  >> > One possible solution is some kind of timer orphaning frags
>>  >> > for skbs that have been around for too long.
>>  >>   Perhaps this can be approximated without an explicit timer by 
>> calling
>>  >> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold 
>> value?
>>  >
>>  >Hard to say. Will have to see that patch to judge how robust this 
>> is.
>>  
>>  This could not work, consider if the threshold is greater than 
>> vring size
>>  or vhost_net pending limit, transmission may still be blocked.
> 
> Well, application can e.g. just switch to non zero copy after
> reaching a specific number of requests.

Yes but only works if user are ok for out of order completion.
> 
> I think the real problem isn't reaching the queue full
> condition, it's the fact a specific buffer might never
> get freed. This API isn't half as useful as it could be
> if applications had a way to force the memory
> to be reclaimed.
> 

Agree. 
> 
> And actually, I see a way for applications to reclaim the memory:
> application could invoke something like MADV_SOFT_OFFLINE on the 
> memory
> submitted for zero copy transmit, to invalidate PTEs, and make next
> access fault new pages in.
> If dedicated memory is used for packets, you could even use
> MADV_DONTNEED - but this doesn't work in many cases, certainly
> not for virtualization type workloads.
> 
> Playting with PTEs needs to invalidate the TLB so it is not fast,
> but it does not need to be: we are talking about ability to close the
> socket, which should be rare.
> 
> For example, an application/hypervisor can detect a timeout when a
> packet is not transmitted within a predefined time period, and trigger
> such reclaim.
> Making this period shorter than network watchdog timer of the VM
> will ensure that watchdog does not trigger within VM.
> Alternatively, VM network watchdog could trigger this reclaim
> in order to recover packet memory.

Doing such in hypervisor seems better. It could reduce the possible
guest triggered behavior. 

But this just can fix the transmission stuck in guest, host socket
still need to wait for the packet to be sent by host?
> 
> 
> With this idea, if application merely reads memory, we incur a lot of
> overhead with pagefaults. So maybe a new call to enable COW for a 
> range
> of pages would be a good idea.
> 

Not very clear, doesn't COW still depends on pagefault to work?
> 
> We'd have to make sure whatever's used for reclaim works for
> a wide range of memory types: mmap-ed file, hugetlbfs, anonymous 
> memory.
> 
> 
> Thoughts?
> 
> -- 
> MST
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 78c299f..8e661d2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -295,6 +295,7 @@  struct ubuf_info {
 	void (*callback)(struct ubuf_info *, bool zerocopy_success);
 	void *ctx;
 	unsigned long desc;
+	void *callback_priv;
 };
 
 /* This data is invariant across clones and lives at
diff --git a/include/linux/socket.h b/include/linux/socket.h
index de52228..0a2e0ea 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -265,6 +265,7 @@  struct ucred {
 #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
 #define MSG_EOF         MSG_FIN
 
+#define MSG_ZEROCOPY	0x4000000	/* Pin user pages */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
 					   descriptor received through
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1..61bf318 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -19,6 +19,7 @@  struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6	3
 #define SO_EE_ORIGIN_TXSTATUS	4
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
+#define SO_EE_ORIGIN_UPAGE	5
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 58af5802..367c23a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2370,28 +2370,112 @@  out:
 
 static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
 				        size_t reserve, size_t len,
-				        size_t linear, int noblock,
+					size_t linear, int flags,
 				        int *err)
 {
 	struct sk_buff *skb;
+	size_t data_len;
 
-	/* Under a page?  Don't bother with paged skb. */
-	if (prepad + len < PAGE_SIZE || !linear)
-		linear = len;
+	if (flags & MSG_ZEROCOPY) {
+		/* Minimize linear, but respect header lower bound */
+		linear = min(len, max_t(size_t, linear, MAX_HEADER));
+		data_len = 0;
+	} else {
+		/* Under a page? Don't bother with paged skb. */
+		if (prepad + len < PAGE_SIZE || !linear)
+			linear = len;
+		data_len = len - linear;
+	}
 
-	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
-				   err, 0);
+	skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
+				   flags & MSG_DONTWAIT, err, 0);
 	if (!skb)
 		return NULL;
 
 	skb_reserve(skb, reserve);
 	skb_put(skb, linear);
-	skb->data_len = len - linear;
-	skb->len += len - linear;
+	skb->data_len = data_len;
+	skb->len += data_len;
 
 	return skb;
 }
 
+/* release zerocopy resources and avoid destructor callback on kfree */
+static void packet_snd_zerocopy_free(struct sk_buff *skb)
+{
+	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
+
+	if (uarg) {
+		kfree_skb(uarg->callback_priv);
+		sock_put((struct sock *) uarg->ctx);
+		kfree(uarg);
+
+		skb_shinfo(skb)->destructor_arg = NULL;
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+	}
+}
+
+static void packet_snd_zerocopy_callback(struct ubuf_info *uarg,
+					 bool zerocopy_success)
+{
+	struct sk_buff *err_skb;
+	struct sock *err_sk;
+	struct sock_exterr_skb *serr;
+
+	err_sk = uarg->ctx;
+	err_skb = uarg->callback_priv;
+
+	serr = SKB_EXT_ERR(err_skb);
+	memset(serr, 0, sizeof(*serr));
+	serr->ee.ee_errno = ENOMSG;
+	serr->ee.ee_origin = SO_EE_ORIGIN_UPAGE;
+	serr->ee.ee_data = uarg->desc & 0xFFFFFFFF;
+	serr->ee.ee_info = ((u64) uarg->desc) >> 32;
+	if (sock_queue_err_skb(err_sk, err_skb))
+		kfree_skb(err_skb);
+
+	kfree(uarg);
+	sock_put(err_sk);
+}
+
+static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
+					 struct msghdr *msg)
+{
+	struct ubuf_info *uarg = NULL;
+	int ret;
+
+	if (iov_pages(msg->msg_iov, 0, msg->msg_iovlen) > MAX_SKB_FRAGS)
+		return -EMSGSIZE;
+
+	uarg = kzalloc(sizeof(*uarg), GFP_KERNEL);
+	if (!uarg)
+		return -ENOMEM;
+
+	uarg->callback_priv = alloc_skb(0, GFP_KERNEL);
+	if (!uarg->callback_priv) {
+		kfree(uarg);
+		return -ENOMEM;
+	}
+
+	ret = zerocopy_sg_from_iovec(skb, msg->msg_iov, 0, msg->msg_iovlen);
+	if (ret) {
+		kfree_skb(uarg->callback_priv);
+		kfree(uarg);
+		return -EIO;
+	}
+
+	sock_hold(skb->sk);
+	uarg->ctx = skb->sk;
+	uarg->callback = packet_snd_zerocopy_callback;
+	uarg->desc = (unsigned long) msg->msg_iov[0].iov_base;
+
+	skb_shinfo(skb)->destructor_arg = uarg;
+	skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+	skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
+
+	return 0;
+}
+
 static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 {
 	struct sock *sk = sock->sk;
@@ -2408,6 +2492,7 @@  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	unsigned short gso_type = 0;
 	int hlen, tlen;
 	int extra_len = 0;
+	bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
 
 	/*
 	 *	Get and verify the address.
@@ -2501,7 +2586,7 @@  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	hlen = LL_RESERVED_SPACE(dev);
 	tlen = dev->needed_tailroom;
 	skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, vnet_hdr.hdr_len,
-			       msg->msg_flags & MSG_DONTWAIT, &err);
+			       msg->msg_flags, &err);
 	if (skb == NULL)
 		goto out_unlock;
 
@@ -2518,7 +2603,11 @@  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	/* Returns -EFAULT on error */
-	err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
+	if (zerocopy)
+		err = packet_zerocopy_sg_from_iovec(skb, msg);
+	else
+		err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov,
+						   0, len);
 	if (err)
 		goto out_free;
 
@@ -2578,6 +2667,7 @@  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	return len;
 
 out_free:
+	packet_snd_zerocopy_free(skb);
 	kfree_skb(skb);
 out_unlock:
 	if (dev)