Message ID | 4990C337.3040704@netfilter.org |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Pablo Neira Ayuso wrote: > Patrick McHardy wrote: >> Pablo Neira Ayuso wrote: > >> But unless I'm missing something, there's nothing wrong with this >> as long as the error is ignored. The fact that something was received >> by some listener doesn't have any meaning anyways, it might have >> been "ip monitor". Which somehow raises doubt about your proposed >> interface change though, I think anything that wants a reliable >> answer whether a packet was delivered to a process handling it >> appropriately should use unicast. > > Don't get me wrong, I agree with you that all netlink_broadcast callers > in the kernel should ignore the return value... > > ... unless they have "some way" (like in Netfilter) to make event > delivery reliable: I have attached a patch that I didn't send you yet, > I'm still reviewing and testing it. It adds an entry to /proc to enable > reliable event delivery over netlink by dropping packets whose events > were not delivered, you mentioned that possibility once during one of > our conversations ;). I know, but in the mean time I think its wrong :) The delivery isn't reliable and what the admin is effectively expressing by setting your sysctl is "I don't have any listeners besides the synchronization daemon running". So it might as well use unicast. > I'm aware of that this option may be dangerous if used by a buggy > process that trigger frequent overflows but it the cost of having > realible logging for ctnetlink (still, this behaviour is not the one by > default!). > > And I need this option to make conntrackd synchronize state-changes > appropriately under very heavy load: I've testing the daemon with these > patches and it reliably synchronizes state-changes (my system were 100% > busy filtering traffic and fully synchronizing all TCP state-changes in > near real-time effort, with a noticeable performance drop of 30% in > terms of filtered connections). So you're dropping the packet if you can't manage to synchronize. Doesn't that defeat the entire purpose of synchronizing, which is *increasing* reliability? :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Patrick McHardy wrote: > Pablo Neira Ayuso wrote: >> Patrick McHardy wrote: >>> Pablo Neira Ayuso wrote: >> >>> But unless I'm missing something, there's nothing wrong with this >>> as long as the error is ignored. The fact that something was received >>> by some listener doesn't have any meaning anyways, it might have >>> been "ip monitor". Which somehow raises doubt about your proposed >>> interface change though, I think anything that wants a reliable >>> answer whether a packet was delivered to a process handling it >>> appropriately should use unicast. >> >> Don't get me wrong, I agree with you that all netlink_broadcast callers >> in the kernel should ignore the return value... >> >> ... unless they have "some way" (like in Netfilter) to make event >> delivery reliable: I have attached a patch that I didn't send you yet, >> I'm still reviewing and testing it. It adds an entry to /proc to enable >> reliable event delivery over netlink by dropping packets whose events >> were not delivered, you mentioned that possibility once during one of >> our conversations ;). > > I know, but in the mean time I think its wrong :) The delivery > isn't reliable and what the admin is effectively expressing by > setting your sysctl is "I don't have any listeners besides the > synchronization daemon running". So it might as well use unicast. No :), this setting means "state-changes over ctnetlink will be reliable at the cost of dropping packets (if needed)", it's an optional trade-off. You may also have more listeners like a logging daemon (ulogd), similarly this will be useful to ensure that ulogd doesn't leak logging information which may happen under very heavy load. This option is *not* only oriented to state-synchronization. Using unicast would not do any different from broadcast as you may have two listeners receiving state-changes from ctnetlink via unicast, so the problem would be basically the same as above if you want reliable state-change information at the cost of dropping packets. BTW, the netlink_broadcast return value looked to me inconsistent before the patch. It returned ENOBUFS if it could not clone the skb, but zero when at least one message was delivered. How useful can be this return value for the callers? I would expect to have a similar behaviour to the one of netlink_unicast (reporting EAGAIN error when it could not deliver the message), even if the return value for most callers should be ignored as it is not of any help. >> I'm aware of that this option may be dangerous if used by a buggy >> process that trigger frequent overflows but it the cost of having >> realible logging for ctnetlink (still, this behaviour is not the one by >> default!). >> >> And I need this option to make conntrackd synchronize state-changes >> appropriately under very heavy load: I've testing the daemon with these >> patches and it reliably synchronizes state-changes (my system were 100% >> busy filtering traffic and fully synchronizing all TCP state-changes in >> near real-time effort, with a noticeable performance drop of 30% in >> terms of filtered connections). > > So you're dropping the packet if you can't manage to synchronize. > Doesn't that defeat the entire purpose of synchronizing, which is > *increasing* reliability? :) This reduces communications reliability a bit under very heavy load, yes, because it may drop some packets but it adds reliable flow-based logging accounting / state-synchronization in return. Both refers to reliability in different contexts. In the end, it's a trade-off world. There's some point at which you may want to choose which one you prefer, reliable communications if the system is under heavy load or reliable logging (no leaks in the logging) / state-synchronization (the backup firewall is able to follow state-changes of the master under heavy load). In my experiments, reaching 100% of CPU consumption, the number of packets drop where in fact very few indeed, but the harm in logging and state-synchronization reliability is considerable in the long run, as the backup starts getting unsynchronized (thus, becoming useless to increase cluster reliability but consuming resources) and you also have to interpret log information without forgetting the margin of error in the case of logging. BTW, I did not tell you, I can give you access to my testbed platform at any time, of course ;).
Pablo Neira Ayuso wrote: > Patrick McHardy wrote: >> I know, but in the mean time I think its wrong :) The delivery >> isn't reliable and what the admin is effectively expressing by >> setting your sysctl is "I don't have any listeners besides the >> synchronization daemon running". So it might as well use unicast. > > No :), this setting means "state-changes over ctnetlink will be reliable > at the cost of dropping packets (if needed)", it's an optional > trade-off. You may also have more listeners like a logging daemon > (ulogd), similarly this will be useful to ensure that ulogd doesn't leak > logging information which may happen under very heavy load. This option > is *not* only oriented to state-synchronization. I'm aware of that. But you're adding a policy knob to control the behaviour of a one-to-many interface based on what a single listener (or maybe even two) want. Its not possible anymore to just listen to events for debugging, since that might even lock you out. You also can't use ulogd and say that you *don't* care whether every last state change was delivered to it. This seems very wrong to me. And I don't even see a reason to do this since its easy to use unicast and per-listener state. > Using unicast would not do any different from broadcast as you may have > two listeners receiving state-changes from ctnetlink via unicast, so the > problem would be basically the same as above if you want reliable > state-change information at the cost of dropping packets. Only the processes that actually care can specify this behaviour. They're likely to have more CPU time, better adjusted receive buffers etc than f.i. the conntrack tool when dumping events. > BTW, the netlink_broadcast return value looked to me inconsistent before > the patch. It returned ENOBUFS if it could not clone the skb, but zero > when at least one message was delivered. How useful can be this return > value for the callers? I would expect to have a similar behaviour to the > one of netlink_unicast (reporting EAGAIN error when it could not deliver > the message), even if the return value for most callers should be > ignored as it is not of any help. Its useless since you don't know how received it. It should return void IMO. >> So you're dropping the packet if you can't manage to synchronize. >> Doesn't that defeat the entire purpose of synchronizing, which is >> *increasing* reliability? :) > > This reduces communications reliability a bit under very heavy load, > yes, because it may drop some packets but it adds reliable flow-based > logging accounting / state-synchronization in return. Both refers to > reliability in different contexts. In the end, it's a trade-off world. > There's some point at which you may want to choose which one you prefer, > reliable communications if the system is under heavy load or reliable > logging (no leaks in the logging) / state-synchronization (the backup > firewall is able to follow state-changes of the master under heavy load). Logging yes, but I can't see the point in perfect synchronization if that leads to less throughput. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
First of all, sorry, this email is probably too long. Patrick McHardy wrote: > Pablo Neira Ayuso wrote: >> Patrick McHardy wrote: >>> I know, but in the mean time I think its wrong :) The delivery >>> isn't reliable and what the admin is effectively expressing by >>> setting your sysctl is "I don't have any listeners besides the >>> synchronization daemon running". So it might as well use unicast. >> >> No :), this setting means "state-changes over ctnetlink will be reliable >> at the cost of dropping packets (if needed)", it's an optional >> trade-off. You may also have more listeners like a logging daemon >> (ulogd), similarly this will be useful to ensure that ulogd doesn't leak >> logging information which may happen under very heavy load. This option >> is *not* only oriented to state-synchronization. > > I'm aware of that. But you're adding a policy knob to control the > behaviour of a one-to-many interface based on what a single listener > (or maybe even two) want. Its not possible anymore to just listen to > events for debugging, since that might even lock you out. Can you think of one example where one ctnetlink listener may not find useful reliable state-change reports? Still, this setting is optional (it will be disabled by default) and, if turned on, you can disable it for debugging purposes. Thinking more about it, reliable logging and monitoring would be even something interesting in terms of security. > You also > can't use ulogd and say that you *don't* care whether every last state > change was delivered to it. > > This seems very wrong to me. And I don't even see a reason to do > this since its easy to use unicast and per-listener state. Netlink unicast would not be of any help either if you want reliable state-change reporting via ctnetlink. If one process receives the event and the other does not, you would also need to drop the packet to perform reliable logging. >> Using unicast would not do any different from broadcast as you may have >> two listeners receiving state-changes from ctnetlink via unicast, so the >> problem would be basically the same as above if you want reliable >> state-change information at the cost of dropping packets. > > Only the processes that actually care can specify this behaviour. No, because this behaviour implies that the packet would be drop if the state-change is not delivered correctly to all. It has to be an on/off behaviour for all listeners. > They're likely to have more CPU time, better adjusted receive > buffers etc than f.i. the conntrack tool when dumping events. > >> BTW, the netlink_broadcast return value looked to me inconsistent before >> the patch. It returned ENOBUFS if it could not clone the skb, but zero >> when at least one message was delivered. How useful can be this return >> value for the callers? I would expect to have a similar behaviour to the >> one of netlink_unicast (reporting EAGAIN error when it could not deliver >> the message), even if the return value for most callers should be >> ignored as it is not of any help. > > Its useless since you don't know how received it. It should return > void IMO. > >>> So you're dropping the packet if you can't manage to synchronize. >>> Doesn't that defeat the entire purpose of synchronizing, which is >>> *increasing* reliability? :) >> >> This reduces communications reliability a bit under very heavy load, >> yes, because it may drop some packets but it adds reliable flow-based >> logging accounting / state-synchronization in return. Both refers to >> reliability in different contexts. In the end, it's a trade-off world. >> There's some point at which you may want to choose which one you prefer, >> reliable communications if the system is under heavy load or reliable >> logging (no leaks in the logging) / state-synchronization (the backup >> firewall is able to follow state-changes of the master under heavy load). > > Logging yes, but I can't see the point in perfect synchronization if > that leads to less throughput. Indeed, (reactive) fault-tolerance force you to trade-off between the synchronization degree and performance. Conntrackd is far from doing "perfect synchronization", let me develop this idea a bit. Perfect synchronization (or call it "synchronous" replication) indeed implies *way* less performance. In the particular case of stateful firewalls, synchronous replication means that each packet would have to wait until one state-change is propagated to all backups. Then, once the backups have confirmed that the state-change has been propagated correctly, the packet continues its travel. Thus, packets would be delayed and throughput would severely drop. This is what fault-tolerant "erudite" people call a "correct fault-tolerant system" since the status of the replication is known at any time and the backups can successfully recover the stateful filtering at any time. However, the cost in terms of performance is *crap*, of course :), think of the delay in the packet delivery of that stateful firewall, like getting a coke from the moon to be "correct". It's clear that synchronous replication is not feasible in today's Internet systems. So, let's consider asynchronous replication, in the case of stateful firewalls, this means that the packet is not kept until the state-change is delivered, instead the packet continues its travel and the state-change event is delivered to the backups in a "do your best" approach. This is indeed a trade-off, we relax replication by allowing better performance to make fault-tolerant Internet systems feasible. But, in return, the backups are ready to recover a sub-set of state-changes while others may not be recovered (think of long-standing TCP established connections and very short TCP connections, the first sort can be recovered, the latter may not). Nevertheless, asynchronous replication works fine in practise. But asynchronous replication may become useless to achieve fault-tolerance if the rate of state-changes is high enough not to allow backup nodes follow the primary node. Going back to the problem, if Netlink cannot deliver the state-change, the backup would be able to recover the filtering if the primary fails. At some point you have to set a boundary limit after which you can ensure an acceptable synchronization and performance, and if the boundary is overpassed from one side, the other gets harmed. I would have to tell sysadmins that conntrackd becomes unreliable under heavy load in full near real-time mode, that would be horrible!. Instead, with this option, I can tell them that, if they have selected full near real-time event-driven synchronization, that reduces performance. BTW, conntrackd has one batch mode that relaxes synchronization *a lot* (it sends to the backup nodes the states that have been living in the kernel conntrack table between a range of time, say, [10-20) seconds, this also is possible. But, with the option that I'm proposing, we could allow the network designer choose what synchronization approach it prefers according to the network requirements. That includes that he/she understands that he/she assumes a performance drop (which I have measured in ~30% less with full TCP state replication of very short connections in event-driven near real-time fashion, which I think that it is close to the worst case).
Pablo Neira Ayuso wrote: > First of all, sorry, this email is probably too long. Indeed, I'm doing some trimminng :) > Patrick McHardy wrote: >> I'm aware of that. But you're adding a policy knob to control the >> behaviour of a one-to-many interface based on what a single listener >> (or maybe even two) want. Its not possible anymore to just listen to >> events for debugging, since that might even lock you out. > > Can you think of one example where one ctnetlink listener may not find > useful reliable state-change reports? Still, this setting is optional > (it will be disabled by default) and, if turned on, you can disable it > for debugging purposes. As I already said, "conntrack -E" used for debugging. Nobody cares whether it misses a few events instead of causing dropped packets. Whether its on or not by default is secondary to being the right thing at all. > Thinking more about it, reliable logging and monitoring would be even > something interesting in terms of security. I don't doubt that, I question the mechanism. >> This seems very wrong to me. And I don't even see a reason to do >> this since its easy to use unicast and per-listener state. > > Netlink unicast would not be of any help either if you want reliable > state-change reporting via ctnetlink. If one process receives the event > and the other does not, you would also need to drop the packet to > perform reliable logging. Yes, and you don't need to if you don't want "reliable" logging. The point is that you can choose per socket. Only if a socket that really wants this doesn't get a copy you drop. >>> Using unicast would not do any different from broadcast as you may have >>> two listeners receiving state-changes from ctnetlink via unicast, so the >>> problem would be basically the same as above if you want reliable >>> state-change information at the cost of dropping packets. No, its not the same. ctsync sets big receive buffers and requests "reliable" delivery, "conntrack -E" does nothing special and doesn't care whether messages are dropped because its receive queue is too small. >> Only the processes that actually care can specify this behaviour. > > No, because this behaviour implies that the packet would be drop if the > state-change is not delivered correctly to all. It has to be an on/off > behaviour for all listeners. You keep saying that, but its only the case because the way you implemented it requires this. Why would ctsync care whether conntrack -E missed a packet? >> [...] > I would have to tell sysadmins that conntrackd becomes unreliable under > heavy load in full near real-time mode, that would be horrible!. > Instead, with this option, I can tell them that, if they have selected > full near real-time event-driven synchronization, that reduces performance. Again, I'm not arguing about the option but about making it a sysctl or something that affects all (ctnetlink) sockets whether they care or not. You could even make it a per-broadcast listener option, but the sysctl is effectively converting broadcast operation to reliable unicast semantics and that seems wrong. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I'm also trimming ;) Patrick McHardy wrote: > Pablo Neira Ayuso wrote: >> Can you think of one example where one ctnetlink listener may not find >> useful reliable state-change reports? Still, this setting is optional >> (it will be disabled by default) and, if turned on, you can disable it >> for debugging purposes. > > As I already said, "conntrack -E" used for debugging. Nobody cares > whether it misses a few events instead of causing dropped packets. > Whether its on or not by default is secondary to being the right > thing at all. In particular, conntrack -E returns an error message when it hits ENOBUFS, so it's a bad example. Indeed, I think that other programs in userspace should do this if they don't know what to do with ENOBUFS, otherwise increase the buffer up to a reasonable limit (set by the user), and then report that this limit has been reached telling that they have become unreliable (or depending on the sysctl value that I'm proposing, tell that they may drop packets). And I think that there are tons of interfaces that userspace programs can abuse to do the wrong thing. >>>> Using unicast would not do any different from broadcast as you may have >>>> two listeners receiving state-changes from ctnetlink via unicast, so >>>> the >>>> problem would be basically the same as above if you want reliable >>>> state-change information at the cost of dropping packets. > > No, its not the same. ctsync sets big receive buffers and requests > "reliable" delivery, "conntrack -E" does nothing special and doesn't > care whether messages are dropped because its receive queue is too > small. conntrack -E is a bad example but I get the point. This sysctl has to be for all ctnetlink listeners. >> I would have to tell sysadmins that conntrackd becomes unreliable under >> heavy load in full near real-time mode, that would be horrible!. >> Instead, with this option, I can tell them that, if they have selected >> full near real-time event-driven synchronization, that reduces >> performance. > > Again, I'm not arguing about the option but about making it a > sysctl or something that affects all (ctnetlink) sockets whether > they care or not. You could even make it a per-broadcast listener > option, but the sysctl is effectively converting broadcast > operation to reliable unicast semantics and that seems wrong. And again, you point that this should be per-socket, but how can you make this option per-socket? The only way that I see to make state-change reporting reliable is to drop the packet to force the peer to retransmit the packet and trigger the same state-change, and that affect all ctnetlink listeners.
Pablo Neira Ayuso wrote: >>> Can you think of one example where one ctnetlink listener may not find >>> useful reliable state-change reports? Still, this setting is optional >>> (it will be disabled by default) and, if turned on, you can disable it >>> for debugging purposes. >> As I already said, "conntrack -E" used for debugging. Nobody cares >> whether it misses a few events instead of causing dropped packets. >> Whether its on or not by default is secondary to being the right >> thing at all. > > In particular, conntrack -E returns an error message when it hits > ENOBUFS, so it's a bad example. You're proposing to drop packets, I don't think an error message after the fact makes up for that :) > Indeed, I think that other programs in > userspace should do this if they don't know what to do with ENOBUFS, > otherwise increase the buffer up to a reasonable limit (set by the > user), and then report that this limit has been reached telling that > they have become unreliable (or depending on the sysctl value that I'm > proposing, tell that they may drop packets). > > And I think that there are tons of interfaces that userspace programs > can abuse to do the wrong thing. Thats true, in this case the userspace program doesn't need to do anything wrong though. >>> I would have to tell sysadmins that conntrackd becomes unreliable under >>> heavy load in full near real-time mode, that would be horrible!. >>> Instead, with this option, I can tell them that, if they have selected >>> full near real-time event-driven synchronization, that reduces >>> performance. >> Again, I'm not arguing about the option but about making it a >> sysctl or something that affects all (ctnetlink) sockets whether >> they care or not. You could even make it a per-broadcast listener >> option, but the sysctl is effectively converting broadcast >> operation to reliable unicast semantics and that seems wrong. > > And again, you point that this should be per-socket, but how can you > make this option per-socket? The only way that I see to make > state-change reporting reliable is to drop the packet to force the peer > to retransmit the packet and trigger the same state-change, and that > affect all ctnetlink listeners. For unicast its obviously simple, for broadcast you'd need something like this: err = 0; for (all netlink sockets; sk && !err; ...) { skb = skb_clone(...) if (skb == NULL) { if (sk->flags & NETLINK_HIGHLY_RELIABLE) err = -ENOBUFS; continue; } ... } So you're returning an error when at least one of the "reliable" sockets doesn't get its delivery. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ctnetlink: optional packet drop to make event delivery reliable From: Pablo Neira Ayuso <pablo@netfilter.org> This patch adds /proc entry to enable reliable ctnetlink event delivery. The entry is located at: /proc/sys/net/netfilter/nf_conntrack_netlink_broadcast_reliable When this entry is != 0, ctnetlink drops the packet if the delivery of an event over netlink fails. This patch is useful to provide reliable state synchronization for conntrackd. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> --- include/linux/netfilter/nfnetlink.h | 4 + include/net/netfilter/nf_conntrack_core.h | 6 +- include/net/netfilter/nf_conntrack_ecache.h | 2 - include/net/netns/conntrack.h | 2 + net/netfilter/nf_conntrack_ecache.c | 18 +++-- net/netfilter/nf_conntrack_netlink.c | 108 ++++++++++++++++++++++++++- net/netfilter/nfnetlink.c | 24 +++++- 7 files changed, 146 insertions(+), 18 deletions(-) diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h index 7d8e045..b89d5f3 100644 --- a/include/linux/netfilter/nfnetlink.h +++ b/include/linux/netfilter/nfnetlink.h @@ -74,8 +74,8 @@ extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n); extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n); extern int nfnetlink_has_listeners(unsigned int group); -extern int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, - int echo); +extern int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group, + int echo); extern int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags); extern void nfnl_lock(void); diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index e78afe7..0c6826d 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -62,7 +62,11 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb) if (ct) { if (!nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct)) ret = __nf_conntrack_confirm(skb); - nf_ct_deliver_cached_events(ct); + if (ret == NF_ACCEPT && nf_ct_deliver_cached_events(ct) < 0) { + struct net *net = nf_ct_net(ct); + NF_CT_STAT_INC_ATOMIC(net, drop); + return NF_DROP; + } } return ret; } diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h index 0ff0dc6..6e9e1f7 100644 --- a/include/net/netfilter/nf_conntrack_ecache.h +++ b/include/net/netfilter/nf_conntrack_ecache.h @@ -28,7 +28,7 @@ extern struct atomic_notifier_head nf_conntrack_chain; extern int nf_conntrack_register_notifier(struct notifier_block *nb); extern int nf_conntrack_unregister_notifier(struct notifier_block *nb); -extern void nf_ct_deliver_cached_events(const struct nf_conn *ct); +extern int nf_ct_deliver_cached_events(const struct nf_conn *ct); extern void __nf_ct_event_cache_init(struct nf_conn *ct); extern void nf_ct_event_cache_flush(struct net *net); diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index f4498a6..1ff61dd 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -20,9 +20,11 @@ struct netns_ct { int sysctl_acct; int sysctl_checksum; unsigned int sysctl_log_invalid; /* Log invalid packets */ + int sysctl_ctnetlink_event_reliable; #ifdef CONFIG_SYSCTL struct ctl_table_header *sysctl_header; struct ctl_table_header *acct_sysctl_header; + struct ctl_table_header *ctnetlink_sysctl_header; #endif int hash_vmalloc; int expect_vmalloc; diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c index dee4190..9c21269 100644 --- a/net/netfilter/nf_conntrack_ecache.c +++ b/net/netfilter/nf_conntrack_ecache.c @@ -31,9 +31,11 @@ EXPORT_SYMBOL_GPL(nf_ct_expect_chain); /* deliver cached events and clear cache entry - must be called with locally * disabled softirqs */ -static inline void +static inline int __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache) { + int ret = 0; + if (nf_ct_is_confirmed(ecache->ct) && !nf_ct_is_dying(ecache->ct) && ecache->events) { struct nf_ct_event item = { @@ -42,28 +44,32 @@ __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache) .report = 0 }; - atomic_notifier_call_chain(&nf_conntrack_chain, - ecache->events, - &item); + ret = atomic_notifier_call_chain(&nf_conntrack_chain, + ecache->events, + &item); + ret = notifier_to_errno(ret); } ecache->events = 0; nf_ct_put(ecache->ct); ecache->ct = NULL; + return ret; } /* Deliver all cached events for a particular conntrack. This is called * by code prior to async packet handling for freeing the skb */ -void nf_ct_deliver_cached_events(const struct nf_conn *ct) +int nf_ct_deliver_cached_events(const struct nf_conn *ct) { struct net *net = nf_ct_net(ct); struct nf_conntrack_ecache *ecache; + int ret = 0; local_bh_disable(); ecache = per_cpu_ptr(net->ct.ecache, raw_smp_processor_id()); if (ecache->ct == ct) - __nf_ct_deliver_cached_events(ecache); + ret = __nf_ct_deliver_cached_events(ecache); local_bh_enable(); + return ret; } EXPORT_SYMBOL_GPL(nf_ct_deliver_cached_events); diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 47c2f54..3e0ffb6 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -517,6 +517,8 @@ static int ctnetlink_conntrack_event(struct notifier_block *this, unsigned int type; sk_buff_data_t b; unsigned int flags = 0, group; + struct net *net = nf_ct_net(ct); + int err; /* ignore our fake conntrack entry */ if (ct == &nf_conntrack_untracked) @@ -613,13 +615,20 @@ static int ctnetlink_conntrack_event(struct notifier_block *this, rcu_read_unlock(); nlh->nlmsg_len = skb->tail - b; - nfnetlink_send(skb, item->pid, group, item->report); + err = nfnetlink_notify(skb, item->pid, group, item->report); + if (net->ct.sysctl_ctnetlink_event_reliable && + (err == -ENOBUFS || err == -EAGAIN)) + return notifier_from_errno(err); + return NOTIFY_DONE; nla_put_failure: rcu_read_unlock(); nlmsg_failure: kfree_skb(skb); + if (net->ct.sysctl_ctnetlink_event_reliable) + return notifier_from_errno(-ENOSPC); + return NOTIFY_DONE; } #endif /* CONFIG_NF_CONNTRACK_EVENTS */ @@ -1604,7 +1613,8 @@ static int ctnetlink_expect_event(struct notifier_block *this, struct sk_buff *skb; unsigned int type; sk_buff_data_t b; - int flags = 0; + int flags = 0, err; + struct net *net = nf_ct_exp_net(exp); if (events & IPEXP_NEW) { type = IPCTNL_MSG_EXP_NEW; @@ -1637,13 +1647,21 @@ static int ctnetlink_expect_event(struct notifier_block *this, rcu_read_unlock(); nlh->nlmsg_len = skb->tail - b; - nfnetlink_send(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW, item->report); + err = nfnetlink_notify(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW, + item->report); + if (net->ct.sysctl_ctnetlink_event_reliable && + (err == -ENOBUFS || err == -EAGAIN)) + return notifier_from_errno(err); + return NOTIFY_DONE; nla_put_failure: rcu_read_unlock(); nlmsg_failure: kfree_skb(skb); + if (net->ct.sysctl_ctnetlink_event_reliable) + return notifier_from_errno(-ENOSPC); + return NOTIFY_DONE; } #endif @@ -2003,7 +2021,63 @@ MODULE_ALIAS("ip_conntrack_netlink"); MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK); MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK_EXP); -static int __init ctnetlink_init(void) +#ifdef CONFIG_SYSCTL +static struct ctl_table ctnetlink_sysctl_table[] = { + { + .ctl_name = CTL_UNNUMBERED, + .procname = "nf_conntrack_netlink_broadcast_reliable", + .data = &init_net.ct.sysctl_ctnetlink_event_reliable, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + {} +}; + +static int ctnetlink_init_sysctl(struct net *net) +{ + struct ctl_table *table; + + table = kmemdup(ctnetlink_sysctl_table, sizeof(ctnetlink_sysctl_table), + GFP_KERNEL); + if (!table) + goto out; + + table[0].data = &net->ct.sysctl_ctnetlink_event_reliable; + + net->ct.ctnetlink_sysctl_header = register_net_sysctl_table(net, + nf_net_netfilter_sysctl_path, table); + if (!net->ct.ctnetlink_sysctl_header) + goto out_register; + + return 0; + +out_register: + kfree(table); +out: + return -ENOMEM; +} + +static void ctnetlink_fini_sysctl(struct net *net) +{ + struct ctl_table *table; + + table = net->ct.ctnetlink_sysctl_header->ctl_table_arg; + unregister_net_sysctl_table(net->ct.ctnetlink_sysctl_header); + kfree(table); +} +#else +static int ctnetlink_init_sysctl(struct net *net) +{ + return 0; +} + +static void ctnetlink_fini_sysctl(struct net *net) +{ +} +#endif /* CONFIG_SYSCTL */ + +static int ctnetlink_net_init(struct net *net) { int ret; @@ -2033,10 +2107,18 @@ static int __init ctnetlink_init(void) goto err_unreg_notifier; } #endif + ret = ctnetlink_init_sysctl(net); + if (ret < 0) { + printk("ctnetlink_init: cannot register sysctl.\n"); + goto err_unreg_exp_notifier; + } + net->ct.sysctl_ctnetlink_event_reliable = 0; return 0; #ifdef CONFIG_NF_CONNTRACK_EVENTS +err_unreg_exp_notifier: + nf_ct_expect_unregister_notifier(&ctnl_notifier_exp); err_unreg_notifier: nf_conntrack_unregister_notifier(&ctnl_notifier); err_unreg_exp_subsys: @@ -2048,7 +2130,7 @@ err_out: return ret; } -static void __exit ctnetlink_exit(void) +static void ctnetlink_net_exit(struct net *net) { printk("ctnetlink: unregistering from nfnetlink.\n"); @@ -2059,8 +2141,24 @@ static void __exit ctnetlink_exit(void) nfnetlink_subsys_unregister(&ctnl_exp_subsys); nfnetlink_subsys_unregister(&ctnl_subsys); + ctnetlink_fini_sysctl(net); return; } +static struct pernet_operations ctnetlink_net_ops = { + .init = ctnetlink_net_init, + .exit = ctnetlink_net_exit, +}; + +static int __init ctnetlink_init(void) +{ + return register_pernet_subsys(&ctnetlink_net_ops); +} + +static void __exit ctnetlink_exit(void) +{ + unregister_pernet_subsys(&ctnetlink_net_ops); +} + module_init(ctnetlink_init); module_exit(ctnetlink_exit); diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index 9c0ba17..fd7bbf4 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -107,11 +107,29 @@ int nfnetlink_has_listeners(unsigned int group) } EXPORT_SYMBOL_GPL(nfnetlink_has_listeners); -int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, int echo) +/* like nlmsg_notify, but we return the multicast error */ +int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group, int report) { - return nlmsg_notify(nfnl, skb, pid, group, echo, gfp_any()); + int err = 0, mcast_err = 0; + + if (group) { + int exclude_pid = 0; + + if (report) { + atomic_inc(&skb->users); + exclude_pid = pid; + } + + mcast_err = nlmsg_multicast(nfnl, skb, exclude_pid, + group, gfp_any()); + } + + if (report) + err = nlmsg_unicast(nfnl, skb, pid); + + return mcast_err ? mcast_err : err; } -EXPORT_SYMBOL_GPL(nfnetlink_send); +EXPORT_SYMBOL_GPL(nfnetlink_notify); int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags) {