diff mbox series

[V2,net-next,4/7] rds: support for zcopy completion notification

Message ID f9cf8d59bed78822f177364eea8e3f55f197776b.1517843755.git.sowmini.varadhan@oracle.com
State Superseded, archived
Delegated to: David Miller
Headers show
Series RDS: zerocopy support | expand

Commit Message

Sowmini Varadhan Feb. 14, 2018, 10:28 a.m. UTC
RDS removes a datagram (rds_message) from the retransmit queue when
an ACK is received. The ACK indicates that the receiver has queued
the RDS datagram, so that the sender can safely forget the datagram.
When all references to the rds_message are quiesced, rds_message_purge
is called to release resources used by the rds_message

If the datagram to be removed had pinned pages set up, add
an entry to the rs->rs_znotify_queue so that the notifcation
will be sent up via rds_rm_zerocopy_callback() when the
rds_message is eventually freed by rds_message_purge.

rds_rm_zerocopy_callback() attempts to batch the number of cookies
sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
This is achieved by checking the tail skb in the sk_error_queue:
if this has room for one more cookie, the cookie from the
current notification is added; else a new skb is added to the
sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
trigger a ->sk_error_report to notify the application.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2: 
  - make sure to always sock_put m_rs even if there is no znotifier.
  - major rewrite of notification, resulting in much simplification.

 include/uapi/linux/errqueue.h |    2 +
 net/rds/af_rds.c              |    2 +
 net/rds/message.c             |   83 +++++++++++++++++++++++++++++++++++++---
 net/rds/rds.h                 |   14 +++++++
 net/rds/recv.c                |    2 +
 5 files changed, 96 insertions(+), 7 deletions(-)

Comments

Santosh Shilimkar Feb. 14, 2018, 6:50 p.m. UTC | #1
On 2/14/2018 2:28 AM, Sowmini Varadhan wrote:
> RDS removes a datagram (rds_message) from the retransmit queue when
> an ACK is received. The ACK indicates that the receiver has queued
> the RDS datagram, so that the sender can safely forget the datagram.
> When all references to the rds_message are quiesced, rds_message_purge
> is called to release resources used by the rds_message
> 
> If the datagram to be removed had pinned pages set up, add
> an entry to the rs->rs_znotify_queue so that the notifcation
> will be sent up via rds_rm_zerocopy_callback() when the
> rds_message is eventually freed by rds_message_purge.
> 
> rds_rm_zerocopy_callback() attempts to batch the number of cookies
> sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
> This is achieved by checking the tail skb in the sk_error_queue:
> if this has room for one more cookie, the cookie from the
> current notification is added; else a new skb is added to the
> sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
> trigger a ->sk_error_report to notify the application.
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
> v2:
>    - make sure to always sock_put m_rs even if there is no znotifier.
>    - major rewrite of notification, resulting in much simplification.
>
>   include/uapi/linux/errqueue.h |    2 +
>   net/rds/af_rds.c              |    2 +
>   net/rds/message.c             |   83 +++++++++++++++++++++++++++++++++++++---
>   net/rds/rds.h                 |   14 +++++++
>   net/rds/recv.c                |    2 +
>   5 files changed, 96 insertions(+), 7 deletions(-)
>
generic comment and please update it where it is applicable
in terms of variable names, notifiers etc.

RDS support true zero copy already with RDMA transport so some of
this code can easily get confused.

So I suggest something like below.
s/zerocopy/zeromsgcopy
s/zcopy/zmsgcopy
s/zcookie/zmsgcpycookie
s/znotifier/zmsgcpynotifier	


> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> index dc64cfa..28812ed 100644
> --- a/include/uapi/linux/errqueue.h
> +++ b/include/uapi/linux/errqueue.h
> @@ -20,11 +20,13 @@ struct sock_extended_err {
>   #define SO_EE_ORIGIN_ICMP6	3
>   #define SO_EE_ORIGIN_TXSTATUS	4
>   #define SO_EE_ORIGIN_ZEROCOPY	5
> +#define SO_EE_ORIGIN_ZCOOKIE	6
>   #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
>   
>   #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
>   
>   #define SO_EE_CODE_ZEROCOPY_COPIED	1
> +#define	SO_EE_ORIGIN_MAX_ZCOOKIES	8

This error change might need to go though other subsystem tree. May
be you can seperate it and also copy "linux-api@vger.kernel.org"

Otherwise changes looks fine to me.

Regards,
Santosh
Sowmini Varadhan Feb. 14, 2018, 7:01 p.m. UTC | #2
On (02/14/18 10:50), Santosh Shilimkar wrote:
> generic comment and please update it where it is applicable
> in terms of variable names, notifiers etc.
> 
> RDS support true zero copy already with RDMA transport so some of
> this code can easily get confused.
> 
> So I suggest something like below.
> s/zerocopy/zeromsgcopy
> s/zcopy/zmsgcopy
> s/zcookie/zmsgcpycookie
> s/znotifier/zmsgcpynotifier	

I'd like to hear some additional opinions from the list on this: 
the existing socket API for TCP etc.  already uses ZEROCOPY, and other
than extending variable names (and putting me at risk of violating the
"fit within 80 chars per line" requirement, leading to not-so-pretty
line wraps), I'm not seeing much value in this.

> This error change might need to go though other subsystem tree. May
> be you can seperate it and also copy "linux-api@vger.kernel.org"

sure I can do that, either  when I put out patch v3 or later today,
after others have had a chance to review this.

--Sowmini
David Miller Feb. 14, 2018, 7:02 p.m. UTC | #3
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 14 Feb 2018 14:01:10 -0500

> On (02/14/18 10:50), Santosh Shilimkar wrote:
>> generic comment and please update it where it is applicable
>> in terms of variable names, notifiers etc.
>> 
>> RDS support true zero copy already with RDMA transport so some of
>> this code can easily get confused.
>> 
>> So I suggest something like below.
>> s/zerocopy/zeromsgcopy
>> s/zcopy/zmsgcopy
>> s/zcookie/zmsgcpycookie
>> s/znotifier/zmsgcpynotifier	
> 
> I'd like to hear some additional opinions from the list on this: 
> the existing socket API for TCP etc.  already uses ZEROCOPY, and other
> than extending variable names (and putting me at risk of violating the
> "fit within 80 chars per line" requirement, leading to not-so-pretty
> line wraps), I'm not seeing much value in this.

I agree, this name change requires seems pointless.  Just keep the names
the way they are, thank you.
Santosh Shilimkar Feb. 14, 2018, 7:17 p.m. UTC | #4
On 2/14/2018 11:01 AM, Sowmini Varadhan wrote:
> On (02/14/18 10:50), Santosh Shilimkar wrote:
>> generic comment and please update it where it is applicable
>> in terms of variable names, notifiers etc.
>>
>> RDS support true zero copy already with RDMA transport so some of
>> this code can easily get confused.
>>
>> So I suggest something like below.
>> s/zerocopy/zeromsgcopy
>> s/zcopy/zmsgcopy
>> s/zcookie/zmsgcpycookie
>> s/znotifier/zmsgcpynotifier	
> 
> I'd like to hear some additional opinions from the list on this:
> the existing socket API for TCP etc.  already uses ZEROCOPY, and other
> than extending variable names (and putting me at risk of violating the
> "fit within 80 chars per line" requirement, leading to not-so-pretty
> line wraps), I'm not seeing much value in this.
> 
What you mean by not seeing value ?
It has a value for RDS code which already support ZERO copy via
RDMA transport. For TCP sockets in isolation, ZCOPY may be
alright. Patch under discussion is for RDS and hence the
comment.

Regards,
Santosh
Santosh Shilimkar Feb. 14, 2018, 9:10 p.m. UTC | #5
On 2/14/2018 11:02 AM, David Miller wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Wed, 14 Feb 2018 14:01:10 -0500
> 
>> On (02/14/18 10:50), Santosh Shilimkar wrote:
>>> generic comment and please update it where it is applicable
>>> in terms of variable names, notifiers etc.
>>>
>>> RDS support true zero copy already with RDMA transport so some of
>>> this code can easily get confused.
>>>
>>> So I suggest something like below.
>>> s/zerocopy/zeromsgcopy
>>> s/zcopy/zmsgcopy
>>> s/zcookie/zmsgcpycookie
>>> s/znotifier/zmsgcpynotifier	
>>
>> I'd like to hear some additional opinions from the list on this:
>> the existing socket API for TCP etc.  already uses ZEROCOPY, and other
>> than extending variable names (and putting me at risk of violating the
>> "fit within 80 chars per line" requirement, leading to not-so-pretty
>> line wraps), I'm not seeing much value in this.
> 
> I agree, this name change requires seems pointless.  Just keep the names
> the way they are, thank you.
> 
Dave I understand your point for generic code and I was not all asking 
to rename anything in generic code.

My point was new code getting added to RDS to support this zero copy
message option. Within RDS, core code is common and all the 
infrastructure with it and it will be hard to follow the code if
that distinction is not maintained.

Ofcourse you have a last say on this list so if you want to
over rule the comment, I will step back :-)

Regards,
Santosh
Sowmini Varadhan Feb. 14, 2018, 9:25 p.m. UTC | #6
On (02/14/18 13:10), Santosh Shilimkar wrote:
> >>>RDS support true zero copy already with RDMA transport so some of
> >>>this code can easily get confused.

btw, another way to solve this is to have the RDMA code use the 
suffix "rdma" (which is what it really is) as needed.

--Sowmini
Santosh Shilimkar Feb. 14, 2018, 9:39 p.m. UTC | #7
On 2/14/2018 1:25 PM, Sowmini Varadhan wrote:
> On (02/14/18 13:10), Santosh Shilimkar wrote:
>>>>> RDS support true zero copy already with RDMA transport so some of
>>>>> this code can easily get confused.
> 
> btw, another way to solve this is to have the RDMA code use the
> suffix "rdma" (which is what it really is) as needed.
> 
And same breath,here zcopy is No message from user ;-)
ZCOPY is otherwise often tied with RDMA directly.

Renaming churns are not that useful and I definitely agree
with what Dave said if it was renaming change to the
existing code like what you are suggesting with 'rdma'

Anyways I don't want to contest this too much since I can
follow that code and know what each does and means :-)

The comment was long term readability perspective for
some one completely new reading the code and being able to
distinguish the different modes.

Regards,
Santosh
Santosh Shilimkar Feb. 14, 2018, 10:08 p.m. UTC | #8
On 2/14/2018 1:25 PM, Sowmini Varadhan wrote:
> On (02/14/18 13:10), Santosh Shilimkar wrote:
>>>>> RDS support true zero copy already with RDMA transport so some of
>>>>> this code can easily get confused.
> 
> btw, another way to solve this is to have the RDMA code use the
> suffix "rdma" (which is what it really is) as needed.
> 
And same breath,here zcopy is No message from user ;-)
ZCOPY is otherwise often tied with RDMA directly.

Renaming churns are not that useful and I definitely agree
with what Dave said if it was renaming change to the
existing code like what you are suggesting with 'rdma'

Anyways I don't want to contest this too much since I can
follow that code and know what each does and means :-)

The comment was long term readability perspective for
some one completely new reading the code and being able to
distinguish the different modes.

Regards,
Santosh
Willem de Bruijn Feb. 14, 2018, 10:26 p.m. UTC | #9
On Wed, Feb 14, 2018 at 1:50 PM, Santosh Shilimkar
<santosh.shilimkar@oracle.com> wrote:
> On 2/14/2018 2:28 AM, Sowmini Varadhan wrote:
>>
>> RDS removes a datagram (rds_message) from the retransmit queue when
>> an ACK is received. The ACK indicates that the receiver has queued
>> the RDS datagram, so that the sender can safely forget the datagram.
>> When all references to the rds_message are quiesced, rds_message_purge
>> is called to release resources used by the rds_message
>>
>> If the datagram to be removed had pinned pages set up, add
>> an entry to the rs->rs_znotify_queue so that the notifcation
>> will be sent up via rds_rm_zerocopy_callback() when the
>> rds_message is eventually freed by rds_message_purge.
>>
>> rds_rm_zerocopy_callback() attempts to batch the number of cookies
>> sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
>> This is achieved by checking the tail skb in the sk_error_queue:
>> if this has room for one more cookie, the cookie from the
>> current notification is added; else a new skb is added to the
>> sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
>> trigger a ->sk_error_report to notify the application.
>>
>> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>> v2:
>>    - make sure to always sock_put m_rs even if there is no znotifier.
>>    - major rewrite of notification, resulting in much simplification.
>>
>>   include/uapi/linux/errqueue.h |    2 +
>>   net/rds/af_rds.c              |    2 +
>>   net/rds/message.c             |   83
>> +++++++++++++++++++++++++++++++++++++---
>>   net/rds/rds.h                 |   14 +++++++
>>   net/rds/recv.c                |    2 +
>>   5 files changed, 96 insertions(+), 7 deletions(-)
>>
> generic comment and please update it where it is applicable
> in terms of variable names, notifiers etc.
>
> RDS support true zero copy already with RDMA transport so some of
> this code can easily get confused.
>
> So I suggest something like below.
> s/zerocopy/zeromsgcopy
> s/zcopy/zmsgcopy
> s/zcookie/zmsgcpycookie
> s/znotifier/zmsgcpynotifier
>
>
>> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
>> index dc64cfa..28812ed 100644
>> --- a/include/uapi/linux/errqueue.h
>> +++ b/include/uapi/linux/errqueue.h
>> @@ -20,11 +20,13 @@ struct sock_extended_err {
>>   #define SO_EE_ORIGIN_ICMP6    3
>>   #define SO_EE_ORIGIN_TXSTATUS 4
>>   #define SO_EE_ORIGIN_ZEROCOPY 5
>> +#define SO_EE_ORIGIN_ZCOOKIE   6
>>   #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
>>     #define SO_EE_OFFENDER(ee)  ((struct sockaddr*)((ee)+1))
>>     #define SO_EE_CODE_ZEROCOPY_COPIED  1
>> +#define        SO_EE_ORIGIN_MAX_ZCOOKIES       8
>
>
> This error change might need to go though other subsystem tree. May
> be you can seperate it and also copy "linux-api@vger.kernel.org"

Previous changes to this file also went in through net-next, so this is
fine. The error queue is a socket, so network, datastructure.

As for naming, zerocopy is unfortunately an overused term. I did not
help matters when adding msg_zerocopy. That said, let's try to keep
consistent naming across socket families, so no zmsg prefix only in
RDS.
Willem de Bruijn Feb. 15, 2018, 12:12 a.m. UTC | #10
On Wed, Feb 14, 2018 at 5:28 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> RDS removes a datagram (rds_message) from the retransmit queue when
> an ACK is received. The ACK indicates that the receiver has queued
> the RDS datagram, so that the sender can safely forget the datagram.
> When all references to the rds_message are quiesced, rds_message_purge
> is called to release resources used by the rds_message
>
> If the datagram to be removed had pinned pages set up, add
> an entry to the rs->rs_znotify_queue so that the notifcation
> will be sent up via rds_rm_zerocopy_callback() when the
> rds_message is eventually freed by rds_message_purge.
>
> rds_rm_zerocopy_callback() attempts to batch the number of cookies
> sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
> This is achieved by checking the tail skb in the sk_error_queue:
> if this has room for one more cookie, the cookie from the
> current notification is added; else a new skb is added to the
> sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
> trigger a ->sk_error_report to notify the application.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---

> +static void rds_rm_zerocopy_callback(struct rds_sock *rs,
> +                                    struct rds_znotifier *znotif)
> +{
> +       struct sock *sk = rds_rs_to_sk(rs);
> +       struct sk_buff *skb, *tail;
> +       struct sock_exterr_skb *serr;
> +       unsigned long flags;
> +       struct sk_buff_head *q;
> +       u32 cookie = znotif->z_cookie;
> +
> +       q = &sk->sk_error_queue;
> +       spin_lock_irqsave(&q->lock, flags);
> +       tail = skb_peek_tail(q);
> +
> +       if (tail && skb_zcookie_add(tail, cookie)) {
> +               spin_unlock_irqrestore(&q->lock, flags);
> +               mm_unaccount_pinned_pages(&znotif->z_mmp);
> +               consume_skb(rds_skb_from_znotifier(znotif));
> +               sk->sk_error_report(sk);
> +               return;
> +       }
> +
> +       skb = rds_skb_from_znotifier(znotif);
> +       serr = SKB_EXT_ERR(skb);
> +       serr->ee.ee_errno = 0;
> +       serr->ee.ee_origin = SO_EE_ORIGIN_ZCOOKIE;
> +       serr->ee.ee_info = 0;
> +       serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;

This resets selective fields in skb->cb[]. That array was previously used
as struct rds_znotifier, so may have non-zero state left over. Due to length
of serr->header it is likely that notifier and serr->ee do not overlap. But if
they do kernel data is leaked to userspace. And size of rds_znotifier may
change, so relying on that is fragile. Safer to clear serr->ee completely.
Santosh Shilimkar Feb. 15, 2018, 12:19 a.m. UTC | #11
On 2/14/2018 2:26 PM, Willem de Bruijn wrote:
> On Wed, Feb 14, 2018 at 1:50 PM, Santosh Shilimkar

[...]

>> This error change might need to go though other subsystem tree. May
>> be you can seperate it and also copy "linux-api@vger.kernel.org"
> 
> Previous changes to this file also went in through net-next, so this is
> fine. The error queue is a socket, so network, datastructure.
>
OK if that acceptable then there is no need to split the change.

> As for naming, zerocopy is unfortunately an overused term. I did not
> help matters when adding msg_zerocopy. 
:-)

> That said, let's try to keep
> consistent naming across socket families, so no zmsg prefix only in
> RDS.
> Fair enough.

Regards,
Santosh
Willem de Bruijn Feb. 15, 2018, 12:41 a.m. UTC | #12
On Wed, Feb 14, 2018 at 5:28 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> RDS removes a datagram (rds_message) from the retransmit queue when
> an ACK is received. The ACK indicates that the receiver has queued
> the RDS datagram, so that the sender can safely forget the datagram.
> When all references to the rds_message are quiesced, rds_message_purge
> is called to release resources used by the rds_message
>
> If the datagram to be removed had pinned pages set up, add
> an entry to the rs->rs_znotify_queue so that the notifcation
> will be sent up via rds_rm_zerocopy_callback() when the
> rds_message is eventually freed by rds_message_purge.
>
> rds_rm_zerocopy_callback() attempts to batch the number of cookies
> sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
> This is achieved by checking the tail skb in the sk_error_queue:
> if this has room for one more cookie, the cookie from the
> current notification is added; else a new skb is added to the
> sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
> trigger a ->sk_error_report to notify the application.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---

> +static void rds_rm_zerocopy_callback(struct rds_sock *rs,
> +                                    struct rds_znotifier *znotif)
> +{
> +       struct sock *sk = rds_rs_to_sk(rs);
> +       struct sk_buff *skb, *tail;
> +       struct sock_exterr_skb *serr;
> +       unsigned long flags;
> +       struct sk_buff_head *q;
> +       u32 cookie = znotif->z_cookie;
> +
> +       q = &sk->sk_error_queue;
> +       spin_lock_irqsave(&q->lock, flags);
> +       tail = skb_peek_tail(q);
> +
> +       if (tail && skb_zcookie_add(tail, cookie)) {
> +               spin_unlock_irqrestore(&q->lock, flags);
> +               mm_unaccount_pinned_pages(&znotif->z_mmp);
> +               consume_skb(rds_skb_from_znotifier(znotif));
> +               sk->sk_error_report(sk);
> +               return;
> +       }
> +
> +       skb = rds_skb_from_znotifier(znotif);
> +       serr = SKB_EXT_ERR(skb);
> +       serr->ee.ee_errno = 0;
> +       serr->ee.ee_origin = SO_EE_ORIGIN_ZCOOKIE;
> +       serr->ee.ee_info = 0;
> +       serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;

One more thing: this code notifies that the operation succeeded, but
the data was copied in the process. It does not have to be set otherwise.
Sowmini Varadhan Feb. 15, 2018, 12:03 p.m. UTC | #13
On (02/14/18 19:41), Willem de Bruijn wrote:
> 
> One more thing: this code notifies that the operation succeeded, but
> the data was copied in the process. It does not have to be set otherwise.

I see. this one was a bit confusing for me (hence the copy/paste) -
maybe because the TCP/UDP/PACKET case is a bit different from RDS,
and sendmsg may find that it has to switch from zcopy to bcopy 
after the sendmsg() itself has returned, but this cannot happen
for RDS (thus I should never set this code?).

Is it ok if I fix this in the next round or do you think
this warrants a V3?

--Sowmini
Willem de Bruijn Feb. 15, 2018, 4:35 p.m. UTC | #14
On Thu, Feb 15, 2018 at 7:03 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (02/14/18 19:41), Willem de Bruijn wrote:
>>
>> One more thing: this code notifies that the operation succeeded, but
>> the data was copied in the process. It does not have to be set otherwise.
>
> I see. this one was a bit confusing for me (hence the copy/paste) -
> maybe because the TCP/UDP/PACKET case is a bit different from RDS,
> and sendmsg may find that it has to switch from zcopy to bcopy
> after the sendmsg() itself has returned, but this cannot happen
> for RDS (thus I should never set this code?).

Agreed.

> Is it ok if I fix this in the next round or do you think
> this warrants a V3?

The flag is a hint, so on its own it does not require a respin.

There are by now four small items to patch up. I would respin so that the
patchset that is recorded can be read later as a single correct solution.

But either way works, I don't feel strongly.
Sowmini Varadhan Feb. 15, 2018, 4:43 p.m. UTC | #15
On (02/15/18 11:35), Willem de Bruijn wrote:
> 
> The flag is a hint, so on its own it does not require a respin.
> 
> There are by now four small items to patch up. I would respin so that the
> patchset that is recorded can be read later as a single correct solution.

Agreed, I'll genenrate V3 later today

--Sowmini
diff mbox series

Patch

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index dc64cfa..28812ed 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -20,11 +20,13 @@  struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6	3
 #define SO_EE_ORIGIN_TXSTATUS	4
 #define SO_EE_ORIGIN_ZEROCOPY	5
+#define SO_EE_ORIGIN_ZCOOKIE	6
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED	1
+#define	SO_EE_ORIGIN_MAX_ZCOOKIES	8
 
 /**
  *	struct scm_timestamping - timestamps exposed through cmsg
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b405f77..74a4bf8 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -183,6 +183,8 @@  static unsigned int rds_poll(struct file *file, struct socket *sock,
 		mask |= (POLLIN | POLLRDNORM);
 	if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
 		mask |= (POLLOUT | POLLWRNORM);
+	if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
+		mask |= POLLERR;
 	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
 
 	/* clear state any time we wake a seen-congested socket */
diff --git a/net/rds/message.c b/net/rds/message.c
index ef3daaf..d874b74 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -33,6 +33,9 @@ 
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/export.h>
+#include <linux/skbuff.h>
+#include <linux/list.h>
+#include <linux/errqueue.h>
 
 #include "rds.h"
 
@@ -53,29 +56,95 @@  void rds_message_addref(struct rds_message *rm)
 }
 EXPORT_SYMBOL_GPL(rds_message_addref);
 
+static inline bool skb_zcookie_add(struct sk_buff *skb, u32 cookie)
+{
+	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
+	int ncookies;
+	u32 *ptr;
+
+	if (serr->ee.ee_origin != SO_EE_ORIGIN_ZCOOKIE)
+		return false;
+	ncookies = serr->ee.ee_data;
+	if (ncookies == SO_EE_ORIGIN_MAX_ZCOOKIES)
+		return false;
+	ptr = skb_put(skb, sizeof(u32));
+	*ptr = cookie;
+	serr->ee.ee_data = ++ncookies;
+	return true;
+}
+
+static void rds_rm_zerocopy_callback(struct rds_sock *rs,
+				     struct rds_znotifier *znotif)
+{
+	struct sock *sk = rds_rs_to_sk(rs);
+	struct sk_buff *skb, *tail;
+	struct sock_exterr_skb *serr;
+	unsigned long flags;
+	struct sk_buff_head *q;
+	u32 cookie = znotif->z_cookie;
+
+	q = &sk->sk_error_queue;
+	spin_lock_irqsave(&q->lock, flags);
+	tail = skb_peek_tail(q);
+
+	if (tail && skb_zcookie_add(tail, cookie)) {
+		spin_unlock_irqrestore(&q->lock, flags);
+		mm_unaccount_pinned_pages(&znotif->z_mmp);
+		consume_skb(rds_skb_from_znotifier(znotif));
+		sk->sk_error_report(sk);
+		return;
+	}
+
+	skb = rds_skb_from_znotifier(znotif);
+	serr = SKB_EXT_ERR(skb);
+	serr->ee.ee_errno = 0;
+	serr->ee.ee_origin = SO_EE_ORIGIN_ZCOOKIE;
+	serr->ee.ee_info = 0;
+	serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;
+	WARN_ON(!skb_zcookie_add(skb, cookie));
+
+	__skb_queue_tail(q, skb);
+
+	spin_unlock_irqrestore(&q->lock, flags);
+	sk->sk_error_report(sk);
+
+	mm_unaccount_pinned_pages(&znotif->z_mmp);
+}
+
 /*
  * This relies on dma_map_sg() not touching sg[].page during merging.
  */
 static void rds_message_purge(struct rds_message *rm)
 {
 	unsigned long i, flags;
+	bool zcopy = false;
 
 	if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags)))
 		return;
 
-	for (i = 0; i < rm->data.op_nents; i++) {
-		rdsdebug("putting data page %p\n", (void *)sg_page(&rm->data.op_sg[i]));
-		/* XXX will have to put_page for page refs */
-		__free_page(sg_page(&rm->data.op_sg[i]));
-	}
-	rm->data.op_nents = 0;
 	spin_lock_irqsave(&rm->m_rs_lock, flags);
 	if (rm->m_rs) {
-		sock_put(rds_rs_to_sk(rm->m_rs));
+		struct rds_sock *rs = rm->m_rs;
+
+		if (rm->data.op_mmp_znotifier) {
+			zcopy = true;
+			rds_rm_zerocopy_callback(rs, rm->data.op_mmp_znotifier);
+			rm->data.op_mmp_znotifier = NULL;
+		}
+		sock_put(rds_rs_to_sk(rs));
 		rm->m_rs = NULL;
 	}
 	spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
+	for (i = 0; i < rm->data.op_nents; i++) {
+		/* XXX will have to put_page for page refs */
+		if (!zcopy)
+			__free_page(sg_page(&rm->data.op_sg[i]));
+		else
+			put_page(sg_page(&rm->data.op_sg[i]));
+	}
+	rm->data.op_nents = 0;
+
 	if (rm->rdma.op_active)
 		rds_rdma_free_op(&rm->rdma);
 	if (rm->rdma.op_rdma_mr)
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 374ae83..6e8fc4c 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -356,6 +356,19 @@  static inline u32 rds_rdma_cookie_offset(rds_rdma_cookie_t cookie)
 #define RDS_MSG_PAGEVEC		7
 #define RDS_MSG_FLUSH		8
 
+struct rds_znotifier {
+	struct list_head	z_list;
+	struct mmpin		z_mmp;
+	u32			z_cookie;
+};
+
+#define	RDS_ZCOPY_SKB(__skb)	((struct rds_znotifier *)&((__skb)->cb[0]))
+
+static inline struct sk_buff *rds_skb_from_znotifier(struct rds_znotifier *z)
+{
+	return container_of((void *)z, struct sk_buff, cb);
+}
+
 struct rds_message {
 	refcount_t		m_refcount;
 	struct list_head	m_sock_item;
@@ -436,6 +449,7 @@  struct rds_message {
 			unsigned int		op_count;
 			unsigned int		op_dmasg;
 			unsigned int		op_dmaoff;
+			struct rds_znotifier	*op_mmp_znotifier;
 			struct scatterlist	*op_sg;
 		} data;
 	};
diff --git a/net/rds/recv.c b/net/rds/recv.c
index b25bcfe..b080961 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -594,6 +594,8 @@  int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 
 	if (msg_flags & MSG_OOB)
 		goto out;
+	if (msg_flags & MSG_ERRQUEUE)
+		return sock_recv_errqueue(sk, msg, size, SOL_IP, IP_RECVERR);
 
 	while (1) {
 		/* If there are pending notifications, do those - and nothing else */