diff mbox

[RFC] netlink broadcast return value

Message ID 4990C337.3040704@netfilter.org
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Pablo Neira Ayuso Feb. 9, 2009, 11:58 p.m. UTC
Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Patrick McHardy wrote:
>>> We have at least one case where the caller wants to know of
>>> any successful delivery. Keymanager queries done by xfrm_state
>>> want to know whether an acquire was delivered to any keymanager.
>>> So we need to continue to indicate this, maybe using a different
>>> errno code than -ENOBUFS. I don't have a suggestion which one to
>>> use though.
>>
>> Indeed, I have missed that spot. I'm not very familiar with that code,
>> however, I see that the creation of a state depends on the netlink
>> broadcast return value, but how useful is that? I think that the state
>> should be created even if the broadcast fails, the userspace daemon
>> should request a resync to the kernel as soon as it hits ENOBUFS, then
>> it would be in sync again with that state.
> 
> The idea is that the kernel is performing an active query. I agree
> that there's nothing wrong with installing the SA and indicating the
> error to userspace. Userspace could dump the SADB and look for new
> larval states, however thats unlikely to be very useful since once
> an overflow occurs, you probably have a lot of states.

More situations may trigger overflows: a "slow" reader (for example,
spending time on whatever while not retrieving messages) and a userspace
process with too small receive buffer.

> But unless I'm missing something, there's nothing wrong with this
> as long as the error is ignored. The fact that something was received
> by some listener doesn't have any meaning anyways, it might have
> been "ip monitor". Which somehow raises doubt about your proposed
> interface change though, I think anything that wants a reliable
> answer whether a packet was delivered to a process handling it
> appropriately should use unicast.

Don't get me wrong, I agree with you that all netlink_broadcast callers
in the kernel should ignore the return value...

... unless they have "some way" (like in Netfilter) to make event
delivery reliable: I have attached a patch that I didn't send you yet,
I'm still reviewing and testing it. It adds an entry to /proc to enable
reliable event delivery over netlink by dropping packets whose events
were not delivered, you mentioned that possibility once during one of
our conversations ;).

I'm aware of that this option may be dangerous if used by a buggy
process that trigger frequent overflows but it the cost of having
realible logging for ctnetlink (still, this behaviour is not the one by
default!).

And I need this option to make conntrackd synchronize state-changes
appropriately under very heavy load: I've testing the daemon with these
patches and it reliably synchronizes state-changes (my system were 100%
busy filtering traffic and fully synchronizing all TCP state-changes in
near real-time effort, with a noticeable performance drop of 30% in
terms of filtered connections).

Comments

Patrick McHardy Feb. 10, 2009, 1:50 p.m. UTC | #1
Pablo Neira Ayuso wrote:
> Patrick McHardy wrote:
>> Pablo Neira Ayuso wrote:
> 
>> But unless I'm missing something, there's nothing wrong with this
>> as long as the error is ignored. The fact that something was received
>> by some listener doesn't have any meaning anyways, it might have
>> been "ip monitor". Which somehow raises doubt about your proposed
>> interface change though, I think anything that wants a reliable
>> answer whether a packet was delivered to a process handling it
>> appropriately should use unicast.
> 
> Don't get me wrong, I agree with you that all netlink_broadcast callers
> in the kernel should ignore the return value...
> 
> ... unless they have "some way" (like in Netfilter) to make event
> delivery reliable: I have attached a patch that I didn't send you yet,
> I'm still reviewing and testing it. It adds an entry to /proc to enable
> reliable event delivery over netlink by dropping packets whose events
> were not delivered, you mentioned that possibility once during one of
> our conversations ;).

I know, but in the mean time I think its wrong :) The delivery
isn't reliable and what the admin is effectively expressing by
setting your sysctl is "I don't have any listeners besides the
synchronization daemon running". So it might as well use unicast.

> I'm aware of that this option may be dangerous if used by a buggy
> process that trigger frequent overflows but it the cost of having
> realible logging for ctnetlink (still, this behaviour is not the one by
> default!).
> 
> And I need this option to make conntrackd synchronize state-changes
> appropriately under very heavy load: I've testing the daemon with these
> patches and it reliably synchronizes state-changes (my system were 100%
> busy filtering traffic and fully synchronizing all TCP state-changes in
> near real-time effort, with a noticeable performance drop of 30% in
> terms of filtered connections).

So you're dropping the packet if you can't manage to synchronize.
Doesn't that defeat the entire purpose of synchronizing, which is
*increasing* reliability? :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Feb. 10, 2009, 6:51 p.m. UTC | #2
Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Patrick McHardy wrote:
>>> Pablo Neira Ayuso wrote:
>>
>>> But unless I'm missing something, there's nothing wrong with this
>>> as long as the error is ignored. The fact that something was received
>>> by some listener doesn't have any meaning anyways, it might have
>>> been "ip monitor". Which somehow raises doubt about your proposed
>>> interface change though, I think anything that wants a reliable
>>> answer whether a packet was delivered to a process handling it
>>> appropriately should use unicast.
>>
>> Don't get me wrong, I agree with you that all netlink_broadcast callers
>> in the kernel should ignore the return value...
>>
>> ... unless they have "some way" (like in Netfilter) to make event
>> delivery reliable: I have attached a patch that I didn't send you yet,
>> I'm still reviewing and testing it. It adds an entry to /proc to enable
>> reliable event delivery over netlink by dropping packets whose events
>> were not delivered, you mentioned that possibility once during one of
>> our conversations ;).
> 
> I know, but in the mean time I think its wrong :) The delivery
> isn't reliable and what the admin is effectively expressing by
> setting your sysctl is "I don't have any listeners besides the
> synchronization daemon running". So it might as well use unicast.

No :), this setting means "state-changes over ctnetlink will be reliable
at the cost of dropping packets (if needed)", it's an optional
trade-off. You may also have more listeners like a logging daemon
(ulogd), similarly this will be useful to ensure that ulogd doesn't leak
logging information which may happen under very heavy load. This option
is *not* only oriented to state-synchronization.

Using unicast would not do any different from broadcast as you may have
two listeners receiving state-changes from ctnetlink via unicast, so the
problem would be basically the same as above if you want reliable
state-change information at the cost of dropping packets.

BTW, the netlink_broadcast return value looked to me inconsistent before
the patch. It returned ENOBUFS if it could not clone the skb, but zero
when at least one message was delivered. How useful can be this return
value for the callers? I would expect to have a similar behaviour to the
one of netlink_unicast (reporting EAGAIN error when it could not deliver
the message), even if the return value for most callers should be
ignored as it is not of any help.

>> I'm aware of that this option may be dangerous if used by a buggy
>> process that trigger frequent overflows but it the cost of having
>> realible logging for ctnetlink (still, this behaviour is not the one by
>> default!).
>>
>> And I need this option to make conntrackd synchronize state-changes
>> appropriately under very heavy load: I've testing the daemon with these
>> patches and it reliably synchronizes state-changes (my system were 100%
>> busy filtering traffic and fully synchronizing all TCP state-changes in
>> near real-time effort, with a noticeable performance drop of 30% in
>> terms of filtered connections).
> 
> So you're dropping the packet if you can't manage to synchronize.
> Doesn't that defeat the entire purpose of synchronizing, which is
> *increasing* reliability? :)

This reduces communications reliability a bit under very heavy load,
yes, because it may drop some packets but it adds reliable flow-based
logging accounting / state-synchronization in return. Both refers to
reliability in different contexts. In the end, it's a trade-off world.
There's some point at which you may want to choose which one you prefer,
reliable communications if the system is under heavy load or reliable
logging (no leaks in the logging) / state-synchronization (the backup
firewall is able to follow state-changes of the master under heavy load).

In my experiments, reaching 100% of CPU consumption, the number of
packets drop where in fact very few indeed, but the harm in logging and
state-synchronization reliability is considerable in the long run, as
the backup starts getting unsynchronized (thus, becoming useless to
increase cluster reliability but consuming resources) and you also have
to interpret log information without forgetting the margin of error in
the case of logging.

BTW, I did not tell you, I can give you access to my testbed platform at
any time, of course ;).
Patrick McHardy Feb. 11, 2009, 12:44 p.m. UTC | #3
Pablo Neira Ayuso wrote:
> Patrick McHardy wrote:
>> I know, but in the mean time I think its wrong :) The delivery
>> isn't reliable and what the admin is effectively expressing by
>> setting your sysctl is "I don't have any listeners besides the
>> synchronization daemon running". So it might as well use unicast.
> 
> No :), this setting means "state-changes over ctnetlink will be reliable
> at the cost of dropping packets (if needed)", it's an optional
> trade-off. You may also have more listeners like a logging daemon
> (ulogd), similarly this will be useful to ensure that ulogd doesn't leak
> logging information which may happen under very heavy load. This option
> is *not* only oriented to state-synchronization.

I'm aware of that. But you're adding a policy knob to control the
behaviour of a one-to-many interface based on what a single listener
(or maybe even two) want. Its not possible anymore to just listen to
events for debugging, since that might even lock you out. You also
can't use ulogd and say that you *don't* care whether every last state
change was delivered to it.

This seems very wrong to me. And I don't even see a reason to do
this since its easy to use unicast and per-listener state.

> Using unicast would not do any different from broadcast as you may have
> two listeners receiving state-changes from ctnetlink via unicast, so the
> problem would be basically the same as above if you want reliable
> state-change information at the cost of dropping packets.

Only the processes that actually care can specify this behaviour.
They're likely to have more CPU time, better adjusted receive
buffers etc than f.i. the conntrack tool when dumping events.

> BTW, the netlink_broadcast return value looked to me inconsistent before
> the patch. It returned ENOBUFS if it could not clone the skb, but zero
> when at least one message was delivered. How useful can be this return
> value for the callers? I would expect to have a similar behaviour to the
> one of netlink_unicast (reporting EAGAIN error when it could not deliver
> the message), even if the return value for most callers should be
> ignored as it is not of any help.

Its useless since you don't know how received it. It should return
void IMO.

>> So you're dropping the packet if you can't manage to synchronize.
>> Doesn't that defeat the entire purpose of synchronizing, which is
>> *increasing* reliability? :)
> 
> This reduces communications reliability a bit under very heavy load,
> yes, because it may drop some packets but it adds reliable flow-based
> logging accounting / state-synchronization in return. Both refers to
> reliability in different contexts. In the end, it's a trade-off world.
> There's some point at which you may want to choose which one you prefer,
> reliable communications if the system is under heavy load or reliable
> logging (no leaks in the logging) / state-synchronization (the backup
> firewall is able to follow state-changes of the master under heavy load).

Logging yes, but I can't see the point in perfect synchronization if
that leads to less throughput.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Feb. 11, 2009, 4:39 p.m. UTC | #4
First of all, sorry, this email is probably too long.

Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Patrick McHardy wrote:
>>> I know, but in the mean time I think its wrong :) The delivery
>>> isn't reliable and what the admin is effectively expressing by
>>> setting your sysctl is "I don't have any listeners besides the
>>> synchronization daemon running". So it might as well use unicast.
>>
>> No :), this setting means "state-changes over ctnetlink will be reliable
>> at the cost of dropping packets (if needed)", it's an optional
>> trade-off. You may also have more listeners like a logging daemon
>> (ulogd), similarly this will be useful to ensure that ulogd doesn't leak
>> logging information which may happen under very heavy load. This option
>> is *not* only oriented to state-synchronization.
> 
> I'm aware of that. But you're adding a policy knob to control the
> behaviour of a one-to-many interface based on what a single listener
> (or maybe even two) want. Its not possible anymore to just listen to
> events for debugging, since that might even lock you out.

Can you think of one example where one ctnetlink listener may not find
useful reliable state-change reports? Still, this setting is optional
(it will be disabled by default) and, if turned on, you can disable it
for debugging purposes.

Thinking more about it, reliable logging and monitoring would be even
something interesting in terms of security.

> You also
> can't use ulogd and say that you *don't* care whether every last state
> change was delivered to it.
> 
> This seems very wrong to me. And I don't even see a reason to do
> this since its easy to use unicast and per-listener state.

Netlink unicast would not be of any help either if you want reliable
state-change reporting via ctnetlink. If one process receives the event
and the other does not, you would also need to drop the packet to
perform reliable logging.

>> Using unicast would not do any different from broadcast as you may have
>> two listeners receiving state-changes from ctnetlink via unicast, so the
>> problem would be basically the same as above if you want reliable
>> state-change information at the cost of dropping packets.
> 
> Only the processes that actually care can specify this behaviour.

No, because this behaviour implies that the packet would be drop if the
state-change is not delivered correctly to all. It has to be an on/off
behaviour for all listeners.

> They're likely to have more CPU time, better adjusted receive
> buffers etc than f.i. the conntrack tool when dumping events.
> 
>> BTW, the netlink_broadcast return value looked to me inconsistent before
>> the patch. It returned ENOBUFS if it could not clone the skb, but zero
>> when at least one message was delivered. How useful can be this return
>> value for the callers? I would expect to have a similar behaviour to the
>> one of netlink_unicast (reporting EAGAIN error when it could not deliver
>> the message), even if the return value for most callers should be
>> ignored as it is not of any help.
> 
> Its useless since you don't know how received it. It should return
> void IMO.
> 
>>> So you're dropping the packet if you can't manage to synchronize.
>>> Doesn't that defeat the entire purpose of synchronizing, which is
>>> *increasing* reliability? :)
>>
>> This reduces communications reliability a bit under very heavy load,
>> yes, because it may drop some packets but it adds reliable flow-based
>> logging accounting / state-synchronization in return. Both refers to
>> reliability in different contexts. In the end, it's a trade-off world.
>> There's some point at which you may want to choose which one you prefer,
>> reliable communications if the system is under heavy load or reliable
>> logging (no leaks in the logging) / state-synchronization (the backup
>> firewall is able to follow state-changes of the master under heavy load).
> 
> Logging yes, but I can't see the point in perfect synchronization if
> that leads to less throughput.

Indeed, (reactive) fault-tolerance force you to trade-off between the
synchronization degree and performance. Conntrackd is far from doing
"perfect synchronization", let me develop this idea a bit.

Perfect synchronization (or call it "synchronous" replication) indeed
implies *way* less performance. In the particular case of stateful
firewalls, synchronous replication means that each packet would have to
wait until one state-change is propagated to all backups.
Then, once the backups have confirmed that the state-change has been
propagated correctly, the packet continues its travel. Thus, packets
would be delayed and throughput would severely drop. This is what
fault-tolerant "erudite" people call a "correct fault-tolerant system"
since the status of the replication is known at any time and the backups
can successfully recover the stateful filtering at any time. However,
the cost in terms of performance is *crap*, of course :), think of the
delay in the packet delivery of that stateful firewall, like getting a
coke from the moon to be "correct".

It's clear that synchronous replication is not feasible in today's
Internet systems. So, let's consider asynchronous replication, in the
case of stateful firewalls, this means that the packet is not kept until
the state-change is delivered, instead the packet continues its travel
and the state-change event is delivered to the backups in a "do your
best" approach. This is indeed a trade-off, we relax replication by
allowing better performance to make fault-tolerant Internet systems
feasible. But, in return, the backups are ready to recover a sub-set of
state-changes while others may not be recovered (think of long-standing
TCP established connections and very short TCP connections, the first
sort can be recovered, the latter may not). Nevertheless, asynchronous
replication works fine in practise.

But asynchronous replication may become useless to achieve
fault-tolerance if the rate of state-changes is high enough not to allow
backup nodes follow the primary node. Going back to the problem, if
Netlink cannot deliver the state-change, the backup would be able to
recover the filtering if the primary fails. At some point you have to
set a boundary limit after which you can ensure an acceptable
synchronization and performance, and if the boundary is overpassed from
one side, the other gets harmed.

I would have to tell sysadmins that conntrackd becomes unreliable under
heavy load in full near real-time mode, that would be horrible!.
Instead, with this option, I can tell them that, if they have selected
full near real-time event-driven synchronization, that reduces performance.

BTW, conntrackd has one batch mode that relaxes synchronization *a lot*
(it sends to the backup nodes the states that have been living in the
kernel conntrack table between a range of time, say, [10-20) seconds,
this also is possible. But, with the option that I'm proposing, we could
allow the network designer choose what synchronization approach it
prefers according to the network requirements. That includes that he/she
understands that he/she assumes a performance drop (which I have
measured in ~30% less with full TCP state replication of very short
connections in event-driven near real-time fashion, which I think that
it is close to the worst case).
Patrick McHardy Feb. 11, 2009, 4:54 p.m. UTC | #5
Pablo Neira Ayuso wrote:
> First of all, sorry, this email is probably too long.

Indeed, I'm doing some trimminng :)

> Patrick McHardy wrote:
>> I'm aware of that. But you're adding a policy knob to control the
>> behaviour of a one-to-many interface based on what a single listener
>> (or maybe even two) want. Its not possible anymore to just listen to
>> events for debugging, since that might even lock you out.
> 
> Can you think of one example where one ctnetlink listener may not find
> useful reliable state-change reports? Still, this setting is optional
> (it will be disabled by default) and, if turned on, you can disable it
> for debugging purposes.

As I already said, "conntrack -E" used for debugging. Nobody cares
whether it misses a few events instead of causing dropped packets.
Whether its on or not by default is secondary to being the right
thing at all.

> Thinking more about it, reliable logging and monitoring would be even
> something interesting in terms of security.

I don't doubt that, I question the mechanism.

>> This seems very wrong to me. And I don't even see a reason to do
>> this since its easy to use unicast and per-listener state.
> 
> Netlink unicast would not be of any help either if you want reliable
> state-change reporting via ctnetlink. If one process receives the event
> and the other does not, you would also need to drop the packet to
> perform reliable logging.

Yes, and you don't need to if you don't want "reliable" logging.
The point is that you can choose per socket. Only if a socket that
really wants this doesn't get a copy you drop.

>>> Using unicast would not do any different from broadcast as you may have
>>> two listeners receiving state-changes from ctnetlink via unicast, so the
>>> problem would be basically the same as above if you want reliable
>>> state-change information at the cost of dropping packets.

No, its not the same. ctsync sets big receive buffers and requests
"reliable" delivery, "conntrack -E" does nothing special and doesn't
care whether messages are dropped because its receive queue is too
small.

>> Only the processes that actually care can specify this behaviour.
> 
> No, because this behaviour implies that the packet would be drop if the
> state-change is not delivered correctly to all. It has to be an on/off
> behaviour for all listeners.

You keep saying that, but its only the case because the way you
implemented it requires this. Why would ctsync care whether
conntrack -E missed a packet?

>> [...]
> I would have to tell sysadmins that conntrackd becomes unreliable under
> heavy load in full near real-time mode, that would be horrible!.
> Instead, with this option, I can tell them that, if they have selected
> full near real-time event-driven synchronization, that reduces performance.

Again, I'm not arguing about the option but about making it a
sysctl or something that affects all (ctnetlink) sockets whether
they care or not. You could even make it a per-broadcast listener
option, but the sysctl is effectively converting broadcast
operation to reliable unicast semantics and that seems wrong.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Feb. 11, 2009, 9:01 p.m. UTC | #6
I'm also trimming ;)

Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Can you think of one example where one ctnetlink listener may not find
>> useful reliable state-change reports? Still, this setting is optional
>> (it will be disabled by default) and, if turned on, you can disable it
>> for debugging purposes.
> 
> As I already said, "conntrack -E" used for debugging. Nobody cares
> whether it misses a few events instead of causing dropped packets.
> Whether its on or not by default is secondary to being the right
> thing at all.

In particular, conntrack -E returns an error message when it hits
ENOBUFS, so it's a bad example. Indeed, I think that other programs in
userspace should do this if they don't know what to do with ENOBUFS,
otherwise increase the buffer up to a reasonable limit (set by the
user), and then report that this limit has been reached telling that
they have become unreliable (or depending on the sysctl value that I'm
proposing, tell that they may drop packets).

And I think that there are tons of interfaces that userspace programs
can abuse to do the wrong thing.

>>>> Using unicast would not do any different from broadcast as you may have
>>>> two listeners receiving state-changes from ctnetlink via unicast, so
>>>> the
>>>> problem would be basically the same as above if you want reliable
>>>> state-change information at the cost of dropping packets.
> 
> No, its not the same. ctsync sets big receive buffers and requests
> "reliable" delivery, "conntrack -E" does nothing special and doesn't
> care whether messages are dropped because its receive queue is too
> small.

conntrack -E is a bad example but I get the point. This sysctl has to be
for all ctnetlink listeners.

>> I would have to tell sysadmins that conntrackd becomes unreliable under
>> heavy load in full near real-time mode, that would be horrible!.
>> Instead, with this option, I can tell them that, if they have selected
>> full near real-time event-driven synchronization, that reduces
>> performance.
> 
> Again, I'm not arguing about the option but about making it a
> sysctl or something that affects all (ctnetlink) sockets whether
> they care or not. You could even make it a per-broadcast listener
> option, but the sysctl is effectively converting broadcast
> operation to reliable unicast semantics and that seems wrong.

And again, you point that this should be per-socket, but how can you
make this option per-socket? The only way that I see to make
state-change reporting reliable is to drop the packet to force the peer
to retransmit the packet and trigger the same state-change, and that
affect all ctnetlink listeners.
Patrick McHardy Feb. 12, 2009, 5:07 a.m. UTC | #7
Pablo Neira Ayuso wrote:
>>> Can you think of one example where one ctnetlink listener may not find
>>> useful reliable state-change reports? Still, this setting is optional
>>> (it will be disabled by default) and, if turned on, you can disable it
>>> for debugging purposes.
>> As I already said, "conntrack -E" used for debugging. Nobody cares
>> whether it misses a few events instead of causing dropped packets.
>> Whether its on or not by default is secondary to being the right
>> thing at all.
> 
> In particular, conntrack -E returns an error message when it hits
> ENOBUFS, so it's a bad example.

You're proposing to drop packets, I don't think an error message
after the fact makes up for that :)

> Indeed, I think that other programs in
> userspace should do this if they don't know what to do with ENOBUFS,
> otherwise increase the buffer up to a reasonable limit (set by the
> user), and then report that this limit has been reached telling that
> they have become unreliable (or depending on the sysctl value that I'm
> proposing, tell that they may drop packets).
> 
> And I think that there are tons of interfaces that userspace programs
> can abuse to do the wrong thing.

Thats true, in this case the userspace program doesn't need to
do anything wrong though.

>>> I would have to tell sysadmins that conntrackd becomes unreliable under
>>> heavy load in full near real-time mode, that would be horrible!.
>>> Instead, with this option, I can tell them that, if they have selected
>>> full near real-time event-driven synchronization, that reduces
>>> performance.
>> Again, I'm not arguing about the option but about making it a
>> sysctl or something that affects all (ctnetlink) sockets whether
>> they care or not. You could even make it a per-broadcast listener
>> option, but the sysctl is effectively converting broadcast
>> operation to reliable unicast semantics and that seems wrong.
> 
> And again, you point that this should be per-socket, but how can you
> make this option per-socket? The only way that I see to make
> state-change reporting reliable is to drop the packet to force the peer
> to retransmit the packet and trigger the same state-change, and that
> affect all ctnetlink listeners.

For unicast its obviously simple, for broadcast you'd need something
like this:

err = 0;
for (all netlink sockets; sk && !err; ...) {
	skb = skb_clone(...)
	if (skb == NULL) {
		if (sk->flags & NETLINK_HIGHLY_RELIABLE)
			err = -ENOBUFS;
		continue;
	}
	...
}

So you're returning an error when at least one of the "reliable"
sockets doesn't get its delivery.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

ctnetlink: optional packet drop to make event delivery reliable

From: Pablo Neira Ayuso <pablo@netfilter.org>

This patch adds /proc entry to enable reliable ctnetlink event
delivery. The entry is located at:

/proc/sys/net/netfilter/nf_conntrack_netlink_broadcast_reliable

When this entry is != 0, ctnetlink drops the packet if the delivery of
an event over netlink fails. This patch is useful to provide reliable
state synchronization for conntrackd.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---

 include/linux/netfilter/nfnetlink.h         |    4 +
 include/net/netfilter/nf_conntrack_core.h   |    6 +-
 include/net/netfilter/nf_conntrack_ecache.h |    2 -
 include/net/netns/conntrack.h               |    2 +
 net/netfilter/nf_conntrack_ecache.c         |   18 +++--
 net/netfilter/nf_conntrack_netlink.c        |  108 ++++++++++++++++++++++++++-
 net/netfilter/nfnetlink.c                   |   24 +++++-
 7 files changed, 146 insertions(+), 18 deletions(-)


diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
index 7d8e045..b89d5f3 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -74,8 +74,8 @@  extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n);
 extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n);
 
 extern int nfnetlink_has_listeners(unsigned int group);
-extern int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, 
-			  int echo);
+extern int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group,
+			    int echo);
 extern int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags);
 
 extern void nfnl_lock(void);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index e78afe7..0c6826d 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -62,7 +62,11 @@  static inline int nf_conntrack_confirm(struct sk_buff *skb)
 	if (ct) {
 		if (!nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct))
 			ret = __nf_conntrack_confirm(skb);
-		nf_ct_deliver_cached_events(ct);
+		if (ret == NF_ACCEPT && nf_ct_deliver_cached_events(ct) < 0) {
+			struct net *net = nf_ct_net(ct);
+			NF_CT_STAT_INC_ATOMIC(net, drop);
+			return NF_DROP;
+		}
 	}
 	return ret;
 }
diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index 0ff0dc6..6e9e1f7 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -28,7 +28,7 @@  extern struct atomic_notifier_head nf_conntrack_chain;
 extern int nf_conntrack_register_notifier(struct notifier_block *nb);
 extern int nf_conntrack_unregister_notifier(struct notifier_block *nb);
 
-extern void nf_ct_deliver_cached_events(const struct nf_conn *ct);
+extern int nf_ct_deliver_cached_events(const struct nf_conn *ct);
 extern void __nf_ct_event_cache_init(struct nf_conn *ct);
 extern void nf_ct_event_cache_flush(struct net *net);
 
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index f4498a6..1ff61dd 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -20,9 +20,11 @@  struct netns_ct {
 	int			sysctl_acct;
 	int			sysctl_checksum;
 	unsigned int		sysctl_log_invalid; /* Log invalid packets */
+	int			sysctl_ctnetlink_event_reliable;
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header	*sysctl_header;
 	struct ctl_table_header	*acct_sysctl_header;
+	struct ctl_table_header *ctnetlink_sysctl_header;
 #endif
 	int			hash_vmalloc;
 	int			expect_vmalloc;
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index dee4190..9c21269 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -31,9 +31,11 @@  EXPORT_SYMBOL_GPL(nf_ct_expect_chain);
 
 /* deliver cached events and clear cache entry - must be called with locally
  * disabled softirqs */
-static inline void
+static inline int
 __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache)
 {
+	int ret = 0;
+
 	if (nf_ct_is_confirmed(ecache->ct) && !nf_ct_is_dying(ecache->ct)
 	    && ecache->events) {
 		struct nf_ct_event item = {
@@ -42,28 +44,32 @@  __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache)
 			.report	= 0
 		};
 
-		atomic_notifier_call_chain(&nf_conntrack_chain,
-					   ecache->events,
-					   &item);
+		ret = atomic_notifier_call_chain(&nf_conntrack_chain,
+						 ecache->events,
+						 &item);
+		ret = notifier_to_errno(ret);
 	}
 
 	ecache->events = 0;
 	nf_ct_put(ecache->ct);
 	ecache->ct = NULL;
+	return ret;
 }
 
 /* Deliver all cached events for a particular conntrack. This is called
  * by code prior to async packet handling for freeing the skb */
-void nf_ct_deliver_cached_events(const struct nf_conn *ct)
+int nf_ct_deliver_cached_events(const struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 	struct nf_conntrack_ecache *ecache;
+	int ret = 0;
 
 	local_bh_disable();
 	ecache = per_cpu_ptr(net->ct.ecache, raw_smp_processor_id());
 	if (ecache->ct == ct)
-		__nf_ct_deliver_cached_events(ecache);
+		ret = __nf_ct_deliver_cached_events(ecache);
 	local_bh_enable();
+	return ret;
 }
 EXPORT_SYMBOL_GPL(nf_ct_deliver_cached_events);
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 47c2f54..3e0ffb6 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -517,6 +517,8 @@  static int ctnetlink_conntrack_event(struct notifier_block *this,
 	unsigned int type;
 	sk_buff_data_t b;
 	unsigned int flags = 0, group;
+	struct net *net = nf_ct_net(ct);
+	int err;
 
 	/* ignore our fake conntrack entry */
 	if (ct == &nf_conntrack_untracked)
@@ -613,13 +615,20 @@  static int ctnetlink_conntrack_event(struct notifier_block *this,
 	rcu_read_unlock();
 
 	nlh->nlmsg_len = skb->tail - b;
-	nfnetlink_send(skb, item->pid, group, item->report);
+	err = nfnetlink_notify(skb, item->pid, group, item->report);
+	if (net->ct.sysctl_ctnetlink_event_reliable &&
+	    (err == -ENOBUFS || err == -EAGAIN))
+		return notifier_from_errno(err);
+
 	return NOTIFY_DONE;
 
 nla_put_failure:
 	rcu_read_unlock();
 nlmsg_failure:
 	kfree_skb(skb);
+	if (net->ct.sysctl_ctnetlink_event_reliable)
+		return notifier_from_errno(-ENOSPC);
+
 	return NOTIFY_DONE;
 }
 #endif /* CONFIG_NF_CONNTRACK_EVENTS */
@@ -1604,7 +1613,8 @@  static int ctnetlink_expect_event(struct notifier_block *this,
 	struct sk_buff *skb;
 	unsigned int type;
 	sk_buff_data_t b;
-	int flags = 0;
+	int flags = 0, err;
+	struct net *net = nf_ct_exp_net(exp);
 
 	if (events & IPEXP_NEW) {
 		type = IPCTNL_MSG_EXP_NEW;
@@ -1637,13 +1647,21 @@  static int ctnetlink_expect_event(struct notifier_block *this,
 	rcu_read_unlock();
 
 	nlh->nlmsg_len = skb->tail - b;
-	nfnetlink_send(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW, item->report);
+	err = nfnetlink_notify(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW,
+			       item->report);
+	if (net->ct.sysctl_ctnetlink_event_reliable &&
+	    (err == -ENOBUFS || err == -EAGAIN))
+		return notifier_from_errno(err);
+
 	return NOTIFY_DONE;
 
 nla_put_failure:
 	rcu_read_unlock();
 nlmsg_failure:
 	kfree_skb(skb);
+	if (net->ct.sysctl_ctnetlink_event_reliable)
+		return notifier_from_errno(-ENOSPC);
+
 	return NOTIFY_DONE;
 }
 #endif
@@ -2003,7 +2021,63 @@  MODULE_ALIAS("ip_conntrack_netlink");
 MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK);
 MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK_EXP);
 
-static int __init ctnetlink_init(void)
+#ifdef CONFIG_SYSCTL
+static struct ctl_table ctnetlink_sysctl_table[] = {
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "nf_conntrack_netlink_broadcast_reliable",
+		.data		= &init_net.ct.sysctl_ctnetlink_event_reliable,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{}
+};
+
+static int ctnetlink_init_sysctl(struct net *net)
+{
+	struct ctl_table *table;
+
+	table = kmemdup(ctnetlink_sysctl_table, sizeof(ctnetlink_sysctl_table),
+			GFP_KERNEL);
+	if (!table)
+		goto out;
+
+	table[0].data = &net->ct.sysctl_ctnetlink_event_reliable;
+
+	net->ct.ctnetlink_sysctl_header = register_net_sysctl_table(net,
+			nf_net_netfilter_sysctl_path, table);
+	if (!net->ct.ctnetlink_sysctl_header)
+		goto out_register;
+
+	return 0;
+
+out_register:
+	kfree(table);
+out:
+	return -ENOMEM;
+}
+
+static void ctnetlink_fini_sysctl(struct net *net)
+{
+	struct ctl_table *table;
+
+	table = net->ct.ctnetlink_sysctl_header->ctl_table_arg;
+	unregister_net_sysctl_table(net->ct.ctnetlink_sysctl_header);
+	kfree(table);
+}
+#else
+static int ctnetlink_init_sysctl(struct net *net)
+{
+	return 0;
+}
+
+static void ctnetlink_fini_sysctl(struct net *net)
+{
+}
+#endif /* CONFIG_SYSCTL */
+
+static int ctnetlink_net_init(struct net *net)
 {
 	int ret;
 
@@ -2033,10 +2107,18 @@  static int __init ctnetlink_init(void)
 		goto err_unreg_notifier;
 	}
 #endif
+	ret = ctnetlink_init_sysctl(net);
+	if (ret < 0) {
+		printk("ctnetlink_init: cannot register sysctl.\n");
+		goto err_unreg_exp_notifier;
+	}
+	net->ct.sysctl_ctnetlink_event_reliable = 0;
 
 	return 0;
 
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
+err_unreg_exp_notifier:
+	nf_ct_expect_unregister_notifier(&ctnl_notifier_exp);
 err_unreg_notifier:
 	nf_conntrack_unregister_notifier(&ctnl_notifier);
 err_unreg_exp_subsys:
@@ -2048,7 +2130,7 @@  err_out:
 	return ret;
 }
 
-static void __exit ctnetlink_exit(void)
+static void ctnetlink_net_exit(struct net *net)
 {
 	printk("ctnetlink: unregistering from nfnetlink.\n");
 
@@ -2059,8 +2141,24 @@  static void __exit ctnetlink_exit(void)
 
 	nfnetlink_subsys_unregister(&ctnl_exp_subsys);
 	nfnetlink_subsys_unregister(&ctnl_subsys);
+	ctnetlink_fini_sysctl(net);
 	return;
 }
 
+static struct pernet_operations ctnetlink_net_ops = {
+	.init = ctnetlink_net_init,
+	.exit = ctnetlink_net_exit,
+};
+
+static int __init ctnetlink_init(void)
+{
+	return register_pernet_subsys(&ctnetlink_net_ops);
+}
+
+static void __exit ctnetlink_exit(void)
+{
+	unregister_pernet_subsys(&ctnetlink_net_ops);
+}
+
 module_init(ctnetlink_init);
 module_exit(ctnetlink_exit);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 9c0ba17..fd7bbf4 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -107,11 +107,29 @@  int nfnetlink_has_listeners(unsigned int group)
 }
 EXPORT_SYMBOL_GPL(nfnetlink_has_listeners);
 
-int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, int echo)
+/* like nlmsg_notify, but we return the multicast error */
+int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group, int report)
 {
-	return nlmsg_notify(nfnl, skb, pid, group, echo, gfp_any());
+	int err = 0, mcast_err = 0;
+
+	if (group) {
+		int exclude_pid = 0;
+
+		if (report) {
+			atomic_inc(&skb->users);
+			exclude_pid = pid;
+		}
+
+		mcast_err = nlmsg_multicast(nfnl, skb, exclude_pid,
+					    group, gfp_any());
+	}
+
+	if (report)
+		err = nlmsg_unicast(nfnl, skb, pid);
+
+	return mcast_err ? mcast_err : err;
 }
-EXPORT_SYMBOL_GPL(nfnetlink_send);
+EXPORT_SYMBOL_GPL(nfnetlink_notify);
 
 int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags)
 {