Message ID | 20090323093353.14253.76823.stgit@Decadence |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
Pablo Neira Ayuso wrote: > This patch adds the NETLINK_NO_ENOBUFS socket flag. This flag can > be used by unicast and broadcast listeners to avoid receiving > ENOBUFS errors. > > Generally speaking, ENOBUFS errors are useful to notify two things > to the listener: > > a) You may increase the receiver buffer size via setsockopt(). > b) You have lost messages, you may be out of sync. > > In some cases, ignoring ENOBUFS errors can be useful. For example: > > a) nfnetlink_queue: this subsystem does not have any sort of resync > method and you can decide to ignore ENOBUFS once you have set a > given buffer size. > > b) ctnetlink: you can use this together with the socket flag > NETLINK_BROADCAST_SEND_ERROR to stop getting ENOBUFS errors as > you do not need to resync (packets whose event are not delivered > are drop to provide reliable logging and state-synchronization). > > Moreover, the use of NETLINK_NO_ENOBUFS also reduces a "go up, go down" > effect in terms of performance which is due to the netlink congestion > control when the listener cannot back off. The effect is the following: > > 1) throughput rate goes up and netlink messages are inserted in the > receiver buffer. > 2) Then, netlink buffer fills and overruns (set on nlk->state bit 0). > 3) While the listener empties the receiver buffer, netlink keeps > dropping messages. Thus, throughput goes dramatically down. > 4) Then, once the listener has emptied the buffer (nlk->state > bit 0 is set off), goto step 1. I agree that not having netlink drop new messages after congestion might be useful. Two suggestions though: - NETLINK_NO_CONGESTION_CONTROL seems a bit more descriptive than "NO_ENOBUFS" - The ENOBUFS error itself is actually not the problem, but the congestion handling. It still makes sense to notify userspace of congestion. I'd suggest to deliver the error, but avoid setting the congestion bit. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Patrick McHardy wrote: > Pablo Neira Ayuso wrote: >> This patch adds the NETLINK_NO_ENOBUFS socket flag. This flag can >> be used by unicast and broadcast listeners to avoid receiving >> ENOBUFS errors. >> >> Generally speaking, ENOBUFS errors are useful to notify two things >> to the listener: >> >> a) You may increase the receiver buffer size via setsockopt(). >> b) You have lost messages, you may be out of sync. >> >> In some cases, ignoring ENOBUFS errors can be useful. For example: >> >> a) nfnetlink_queue: this subsystem does not have any sort of resync >> method and you can decide to ignore ENOBUFS once you have set a >> given buffer size. >> >> b) ctnetlink: you can use this together with the socket flag >> NETLINK_BROADCAST_SEND_ERROR to stop getting ENOBUFS errors as >> you do not need to resync (packets whose event are not delivered >> are drop to provide reliable logging and state-synchronization). >> >> Moreover, the use of NETLINK_NO_ENOBUFS also reduces a "go up, go down" >> effect in terms of performance which is due to the netlink congestion >> control when the listener cannot back off. The effect is the following: >> >> 1) throughput rate goes up and netlink messages are inserted in the >> receiver buffer. >> 2) Then, netlink buffer fills and overruns (set on nlk->state bit 0). >> 3) While the listener empties the receiver buffer, netlink keeps >> dropping messages. Thus, throughput goes dramatically down. >> 4) Then, once the listener has emptied the buffer (nlk->state >> bit 0 is set off), goto step 1. > > I agree that not having netlink drop new messages after congestion > might be useful. Two suggestions though: > > - NETLINK_NO_CONGESTION_CONTROL seems a bit more descriptive than > "NO_ENOBUFS" > > - The ENOBUFS error itself is actually not the problem, but the > congestion handling. It still makes sense to notify userspace > of congestion. I'd suggest to deliver the error, but avoid setting > the congestion bit. I thought about this choice but I see one problem with this. The ENOBUFS error is attached to the congestion control. If we keep reporting ENOBUFS errors to userspace with no congestion control, the listener may keep receiving ENOBUFS indefinitely. In other words, the congestion control seems to me like a way to avoid spamming ENOBUFS errors to userspace.
Pablo Neira Ayuso wrote: > Patrick McHardy wrote: >> - NETLINK_NO_CONGESTION_CONTROL seems a bit more descriptive than >> "NO_ENOBUFS" >> >> - The ENOBUFS error itself is actually not the problem, but the >> congestion handling. It still makes sense to notify userspace >> of congestion. I'd suggest to deliver the error, but avoid setting >> the congestion bit. > > I thought about this choice but I see one problem with this. The ENOBUFS > error is attached to the congestion control. What do you mean by "attached to"? Congestion control is done by setting and testing bit 0 of nlk->state. > If we keep reporting > ENOBUFS errors to userspace with no congestion control, the listener may > keep receiving ENOBUFS indefinitely. In other words, the congestion > control seems to me like a way to avoid spamming ENOBUFS errors to > userspace. The error will be cleared by the next call to recvmsg(). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Patrick McHardy wrote: > Pablo Neira Ayuso wrote: >> Patrick McHardy wrote: >>> - NETLINK_NO_CONGESTION_CONTROL seems a bit more descriptive than >>> "NO_ENOBUFS" >>> >>> - The ENOBUFS error itself is actually not the problem, but the >>> congestion handling. It still makes sense to notify userspace >>> of congestion. I'd suggest to deliver the error, but avoid setting >>> the congestion bit. >> >> I thought about this choice but I see one problem with this. The ENOBUFS >> error is attached to the congestion control. > > What do you mean by "attached to"? Congestion control is done by > setting and testing bit 0 of nlk->state. Yes, but once we set that bit to 1, we stop sending ENOBUFS to userspace. So I think that congestion also applies to error reporting, with "attached to" I meant "related" :). >> If we keep reporting >> ENOBUFS errors to userspace with no congestion control, the listener may >> keep receiving ENOBUFS indefinitely. In other words, the congestion >> control seems to me like a way to avoid spamming ENOBUFS errors to >> userspace. > > The error will be cleared by the next call to recvmsg(). Yes, but think about this scenario: 1) We hit ENOBUFS, you call recvmsg() you get the error, and error is cleared. 2) You're going to call recvmsg() again but before doing so, we hit ENOBUFS again. So you call recvmsg() and you get the error again. I think that this may lead to indefinitely getting ENOBUFS without retrieving data under very heavy load.
Pablo Neira Ayuso wrote: > Patrick McHardy wrote: >> Pablo Neira Ayuso wrote: >>> Patrick McHardy wrote: >>>> - NETLINK_NO_CONGESTION_CONTROL seems a bit more descriptive than >>>> "NO_ENOBUFS" >>>> >>>> - The ENOBUFS error itself is actually not the problem, but the >>>> congestion handling. It still makes sense to notify userspace >>>> of congestion. I'd suggest to deliver the error, but avoid setting >>>> the congestion bit. >>> I thought about this choice but I see one problem with this. The ENOBUFS >>> error is attached to the congestion control. >> What do you mean by "attached to"? Congestion control is done by >> setting and testing bit 0 of nlk->state. > > Yes, but once we set that bit to 1, we stop sending ENOBUFS to > userspace. So I think that congestion also applies to error reporting, > with "attached to" I meant "related" :). That's correct, there can only be a single outstanding error at any time. >>> If we keep reporting >>> ENOBUFS errors to userspace with no congestion control, the listener may >>> keep receiving ENOBUFS indefinitely. In other words, the congestion >>> control seems to me like a way to avoid spamming ENOBUFS errors to >>> userspace. >> The error will be cleared by the next call to recvmsg(). > > Yes, but think about this scenario: > > 1) We hit ENOBUFS, you call recvmsg() you get the error, and error is > cleared. > 2) You're going to call recvmsg() again but before doing so, we hit > ENOBUFS again. So you call recvmsg() and you get the error again. > > I think that this may lead to indefinitely getting ENOBUFS without > retrieving data under very heavy load. I'm not sure that this would be a bad thing under the circumstances you describe. We drop packets, we notify userspace. I agree though that my proposed way isn't ideal either, since we can't queue errors, they will be delivered sporadically (not reflecting the true amount of dropped messages) and without stopping to queue new messages, it can't be determined at which "position" the error occured. But I think some notification or other way to notice whats happening is needed for userspace, otherwise it can neither report not handle this in any way. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Patrick McHardy wrote: > But I think some notification or other way to notice whats happening > is needed for userspace, otherwise it can neither report not handle > this in any way. Hm, I see. I think that we can increase sk_drop like in the UDP code when the NETLINK_NO_ENOBUFS flag is set. We can display it in the netlink /proc entry. Would you be OK with this?
Pablo Neira Ayuso wrote: > Patrick McHardy wrote: >> But I think some notification or other way to notice whats happening >> is needed for userspace, otherwise it can neither report not handle >> this in any way. > > Hm, I see. I think that we can increase sk_drop like in the UDP code > when the NETLINK_NO_ENOBUFS flag is set. We can display it in the > netlink /proc entry. Would you be OK with this? Yes, something like that seems OK. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/netlink.h b/include/linux/netlink.h index 1e6bf99..5ba398e 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -104,6 +104,7 @@ struct nlmsgerr #define NETLINK_DROP_MEMBERSHIP 2 #define NETLINK_PKTINFO 3 #define NETLINK_BROADCAST_ERROR 4 +#define NETLINK_NO_ENOBUFS 5 struct nl_pktinfo { diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index dc93836..1c76c8e 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -86,6 +86,7 @@ struct netlink_sock { #define NETLINK_KERNEL_SOCKET 0x1 #define NETLINK_RECV_PKTINFO 0x2 #define NETLINK_BROADCAST_SEND_ERROR 0x4 +#define NETLINK_RECV_NO_ENOBUFS 0x8 static inline struct netlink_sock *nlk_sk(struct sock *sk) { @@ -717,9 +718,13 @@ static int netlink_getname(struct socket *sock, struct sockaddr *addr, static void netlink_overrun(struct sock *sk) { - if (!test_and_set_bit(0, &nlk_sk(sk)->state)) { - sk->sk_err = ENOBUFS; - sk->sk_error_report(sk); + struct netlink_sock *nlk = nlk_sk(sk); + + if (!(nlk->flags & NETLINK_RECV_NO_ENOBUFS)) { + if (!test_and_set_bit(0, &nlk_sk(sk)->state)) { + sk->sk_err = ENOBUFS; + sk->sk_error_report(sk); + } } } @@ -1183,6 +1188,15 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname, nlk->flags &= ~NETLINK_BROADCAST_SEND_ERROR; err = 0; break; + case NETLINK_NO_ENOBUFS: + if (val) { + nlk->flags |= NETLINK_RECV_NO_ENOBUFS; + clear_bit(0, &nlk->state); + wake_up_interruptible(&nlk->wait); + } else + nlk->flags &= ~NETLINK_RECV_NO_ENOBUFS; + err = 0; + break; default: err = -ENOPROTOOPT; } @@ -1225,6 +1239,16 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname, return -EFAULT; err = 0; break; + case NETLINK_NO_ENOBUFS: + if (len < sizeof(int)) + return -EINVAL; + len = sizeof(int); + val = nlk->flags & NETLINK_RECV_NO_ENOBUFS ? 1 : 0; + if (put_user(len, optlen) || + put_user(val, optval)) + return -EFAULT; + err = 0; + break; default: err = -ENOPROTOOPT; }
This patch adds the NETLINK_NO_ENOBUFS socket flag. This flag can be used by unicast and broadcast listeners to avoid receiving ENOBUFS errors. Generally speaking, ENOBUFS errors are useful to notify two things to the listener: a) You may increase the receiver buffer size via setsockopt(). b) You have lost messages, you may be out of sync. In some cases, ignoring ENOBUFS errors can be useful. For example: a) nfnetlink_queue: this subsystem does not have any sort of resync method and you can decide to ignore ENOBUFS once you have set a given buffer size. b) ctnetlink: you can use this together with the socket flag NETLINK_BROADCAST_SEND_ERROR to stop getting ENOBUFS errors as you do not need to resync (packets whose event are not delivered are drop to provide reliable logging and state-synchronization). Moreover, the use of NETLINK_NO_ENOBUFS also reduces a "go up, go down" effect in terms of performance which is due to the netlink congestion control when the listener cannot back off. The effect is the following: 1) throughput rate goes up and netlink messages are inserted in the receiver buffer. 2) Then, netlink buffer fills and overruns (set on nlk->state bit 0). 3) While the listener empties the receiver buffer, netlink keeps dropping messages. Thus, throughput goes dramatically down. 4) Then, once the listener has emptied the buffer (nlk->state bit 0 is set off), goto step 1. This effect is easier to trigger with netlink broadcast under heavy load, and it is more noticeable when using a big receiver buffer. You can find some results in [1] that show this problem. [1] http://1984.lsi.us.es/linux/netlink/ Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> --- include/linux/netlink.h | 1 + net/netlink/af_netlink.c | 30 +++++++++++++++++++++++++++--- 2 files changed, 28 insertions(+), 3 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html