diff mbox

[1/3,nfnetlink_acct] numerous changes and improvements to the kernel code

Message ID 514D9D45.6090804@googlemail.com
State RFC
Headers show

Commit Message

Michael Zintakis March 23, 2013, 12:17 p.m. UTC
The following is a first patch of a series of 3 patches dealing with the
following kernel changes to nfnetlink_acct:

* fmt and bthr (format and bytes threshold) properties have been added to
  the nfacct object.

* ability to change all nfacct object properties (with the exception of
  name) has been added.

* as a result of the above, a full save/restore is now possible, even if
  the accounting object is in use by iptables.

Signed-off-by: Michael Zintakis <michael.zintakis@googlemail.com>
---
 include/uapi/linux/netfilter/nfnetlink_acct.h |    2 +
 net/netfilter/nfnetlink_acct.c                |   63 ++++++++++++++++++++++++-
 2 files changed, 64 insertions(+), 1 deletion(-)



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Pablo Neira Ayuso March 23, 2013, 3:12 p.m. UTC | #1
Hi Michael,

On Sat, Mar 23, 2013 at 12:17:09PM +0000, Michael Zintakis wrote:
> The following is a first patch of a series of 3 patches dealing with the
> following kernel changes to nfnetlink_acct:
> 
> * fmt and bthr (format and bytes threshold) properties have been added to
>   the nfacct object.
> 
> * ability to change all nfacct object properties (with the exception of
>   name) has been added.
> 
> * as a result of the above, a full save/restore is now possible, even if
>   the accounting object is in use by iptables.
> 
> Signed-off-by: Michael Zintakis <michael.zintakis@googlemail.com>
> ---
>  include/uapi/linux/netfilter/nfnetlink_acct.h |    2 +
>  net/netfilter/nfnetlink_acct.c                |   63 ++++++++++++++++++++++++-
>  2 files changed, 64 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/netfilter/nfnetlink_acct.h b/include/uapi/linux/netfilter/nfnetlink_acct.h
> index c7b6269..f07e825 100644
> --- a/include/uapi/linux/netfilter/nfnetlink_acct.h
> +++ b/include/uapi/linux/netfilter/nfnetlink_acct.h
> @@ -18,6 +18,8 @@ enum nfnl_acct_type {
>  	NFACCT_NAME,
>  	NFACCT_PKTS,
>  	NFACCT_BYTES,
> +	NFACCT_BTHR,
> +	NFACCT_FMT,
>  	NFACCT_USE,
>  	__NFACCT_MAX
>  };
> diff --git a/net/netfilter/nfnetlink_acct.c b/net/netfilter/nfnetlink_acct.c
> index 589d686..bcd4ae8 100644
> --- a/net/netfilter/nfnetlink_acct.c
> +++ b/net/netfilter/nfnetlink_acct.c
> @@ -32,6 +32,8 @@ static LIST_HEAD(nfnl_acct_list);
>  struct nf_acct {
>  	atomic64_t		pkts;
>  	atomic64_t		bytes;
> +	atomic64_t		bthr;
> +	atomic_t		fmt;

These two new fields are meaningless to the kernel and they consume
extra memory for other people that may not want to use these new
features.

Instead of this, you can have a /etc/nfacct.conf file that contains
the formats and thresholds:

name "ALL 27 net" {
        pkts GiB
        bytes TiB
        threshold 6TiB
}

name "ALL misc" {
        bytes GiB
}

...

and so on. You can add new options for the `nfacct add' command so
this formats and thresholds are automatically appended to the
configuration file.

I can help you by making a little parser to read the file and put that
formatting information into a list or hashtable. Thus, you can edit
the format and thresholds by modifying the configuration file, without
the need for interactions with the kernel.

BTW, atomic is not required for those two fields, this is protected by
the nfnl_lock.

>  	struct list_head	head;
>  	atomic_t		refcnt;
>  	char			name[NFACCT_NAME_MAX];
> @@ -63,9 +65,55 @@ nfnl_acct_new(struct sock *nfnl, struct sk_buff *skb,
>  
>  	if (matching) {
>  		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
> -			/* reset counters if you request a replacement. */
> +			/* reset counters if you request a replacement */
> +			if (!tb[NFACCT_PKTS]) {
> +				/*
> +				 * Prevent resetting the packets counter if
> +				 * either fmt or bthr are specified.
> +				 *
> +				 * This is done for backward compatibility,
> +				 * otherwise resetting these counters should
> +				 * only be allowed when tb[NFACCT_PKTS] is
> +				 * explicitly specified and == 0.
> +				 *
> +				 */
> +				if (!tb[NFACCT_FMT] &&
> +				    !tb[NFACCT_BTHR]) {
>  			atomic64_set(&matching->pkts, 0);
> +				}
> +			} else {
> +				atomic64_set(&matching->pkts,
> +				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));

The replacement operation is not so easy. Note that you may hit
inconsistencies if while replacing the packet counter, the kernel
updates the byte counter, and then you replace the byte counter. You
would be leaking bytes and packets.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis March 26, 2013, 8:24 p.m. UTC | #2
Hello Pablo,


My apologies for this late reply...

>>  struct nf_acct {
>>  	atomic64_t		pkts;
>>  	atomic64_t		bytes;
>> +	atomic64_t		bthr;
>> +	atomic_t		fmt;
> 
> These two new fields are meaningless to the kernel and they consume
> extra memory for other people that may not want to use these new
> features.
> 
> Instead of this, you can have a /etc/nfacct.conf file that contains
> the formats and thresholds:
> 
> name "ALL 27 net" {
>         pkts GiB
>         bytes TiB
>         threshold 6TiB
> }
> 
> name "ALL misc" {
>         bytes GiB
> }
> 
> ...
> 
> and so on. You can add new options for the `nfacct add' command so
> this formats and thresholds are automatically appended to the
> configuration file.
> 
> I can help you by making a little parser to read the file and put that
> formatting information into a list or hashtable. Thus, you can edit
> the format and thresholds by modifying the configuration file, without
> the need for interactions with the kernel.
> 
> BTW, atomic is not required for those two fields, this is protected by
> the nfnl_lock.
I disagree with you entirely.

The two fields above aren't "meaningless" - they have a purpose and are an integral part of the nfacct object. None of the fields in the nfacct object are really used in the operation of the kernel itself - the kernel merely updates these. To put it bluntly, the whole nfacct struct is used as storage and the kernel don't use any of these fields in its core operation. OK, I agree that packet and byte counters are going to be updated much more frequently by the kernel than the fmt and bthr variables, but the fact is that these, strictly speaking, do not dependent for the kernel to operate. What I've done above is not a unique approach - in the kernel code there are a lot of examples like this, it is not something new and we are not re-inventing the wheel here or breaking a new ground.

Another reason for having fmt and bthr as part of the kernel nfacct struct is that this data, alongside the bytes and packet counters, is shielded and protected (in ring0), its integrity guaranteed and can only be changed via the kernel itself and nothing else. This is very important for us.

In the early stages of the nfacct changes, we considered a very similar approach to what you have proposed above, but rejected it, not least because the integrity and security of the data encapsulated in the nfacct struct cannot be guaranteed otherwise. We will rely heavily on this and don't need it to be changed arbitrarily (either by "mistake" or intentionally), so we decided that this is the best way going forward.

In stage 2 (current plans are this to be completed by the end of q2 of this year) we will extend this and create a specifically designed daemon (nfacctd), which will "talk" directly to the kernel and retrieve accounting data (much like what nfacct executable currently does, but on a bigger scale) on-demand and in real time and then send it to a central point where such data will be collected from all other machines for statistics and analysis. That daemon can't be spending extra cpu cycles parsing files that may or may not exist or check whether the data in them is or isn't reliable (whether it is "changed" on purpose or not) - that really isn't good enough.

>>  	if (matching) {
>>  		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
>> -			/* reset counters if you request a replacement. */
>> +			/* reset counters if you request a replacement */
>> +			if (!tb[NFACCT_PKTS]) {
>> +				/*
>> +				 * Prevent resetting the packets counter if
>> +				 * either fmt or bthr are specified.
>> +				 *
>> +				 * This is done for backward compatibility,
>> +				 * otherwise resetting these counters should
>> +				 * only be allowed when tb[NFACCT_PKTS] is
>> +				 * explicitly specified and == 0.
>> +				 *
>> +				 */
>> +				if (!tb[NFACCT_FMT] &&
>> +				    !tb[NFACCT_BTHR]) {
>>  			atomic64_set(&matching->pkts, 0);
>> +				}
>> +			} else {
>> +				atomic64_set(&matching->pkts,
>> +				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
> 
> The replacement operation is not so easy. Note that you may hit
> inconsistencies if while replacing the packet counter, the kernel
> updates the byte counter, and then you replace the byte counter. You
> would be leaking bytes and packets.
If I understand you correctly Pablo, you are worried that the packet and byte counters can be, in a way "misaligned". But if we are to update these values we don't really care if the old values get updated or not prior to that change, do we?

In other words, the packet value is updated and the kernel changes the byte value in the meantime. If we are later (in the same call) replacing the bytes value why do we care that there is a momentary inconsistency - we are going to replace it with a new value anyway?

One thing I'll correct with the next submission is to make sure that request for packet counter update is made in the same call as a request for an update to byte counters, otherwise we might be having only the packet or only the byte counters updated, which will introduce an inconsistency. In other words, have something like:

if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
 <do an update on both counters>
} else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES]) { // only one of packets/bytes request has been made, reject it
  return -EINVAL;
}

I hope I understood you correctly here.

One "technical" query regarding future submissions: from reading this list, I gather that the patches need to be submitted using git, is that correct?

The reason I ask this is because my git experience is zero (we use svn on all our development systems) and if it is possible to attach my patches as files (during my submission the last time, a friend of mine gave me a hand and prepared the patches for me as I am an absolute noob as far as git is concerned).

If it is too much trouble, I'll probably scratch out the commands needed to execute git and do it that way - let me know.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 3, 2013, 10:46 a.m. UTC | #3
Hi Michael,

On Tue, Mar 26, 2013 at 08:24:22PM +0000, Michael Zintakis wrote:
[...]
> > Instead of this, you can have a /etc/nfacct.conf file that contains
> > the formats and thresholds:
> > 
> > name "ALL 27 net" {
> >         pkts GiB
> >         bytes TiB
> >         threshold 6TiB
> > }
> > 
> > name "ALL misc" {
> >         bytes GiB
> > }
> > 
> > ...
> > 
> > and so on. You can add new options for the `nfacct add' command so
> > this formats and thresholds are automatically appended to the
> > configuration file.
> > 
> > I can help you by making a little parser to read the file and put that
> > formatting information into a list or hashtable. Thus, you can edit
> > the format and thresholds by modifying the configuration file, without
> > the need for interactions with the kernel.
> > 
> > BTW, atomic is not required for those two fields, this is protected by
> > the nfnl_lock.
>
> I disagree with you entirely.

At least you have to agree with me in that atomic is not required for
those variables :)

> The two fields above aren't "meaningless" - they have a purpose and
> are an integral part of the nfacct object. None of the fields in the
> nfacct object are really used in the operation of the kernel itself
> - the kernel merely updates these. To put it bluntly, the whole
> nfacct struct is used as storage and the kernel don't use any of
> these fields in its core operation. OK, I agree that packet and byte
> counters are going to be updated much more frequently by the kernel
> than the fmt and bthr variables, but the fact is that these,
> strictly speaking, do not dependent for the kernel to operate. What
> I've done above is not a unique approach - in the kernel code there
> are a lot of examples like this, it is not something new and we are
> not re-inventing the wheel here or breaking a new ground.
> 
> Another reason for having fmt and bthr as part of the kernel nfacct
> struct is that this data, alongside the bytes and packet counters,
> is shielded and protected (in ring0), its integrity guaranteed and
> can only be changed via the kernel itself and nothing else. This is
> very important for us.

Sorry, I don't think putting thing in the kernel makes things more
secure.

> In the early stages of the nfacct changes, we considered a very
> similar approach to what you have proposed above, but rejected it,
> not least because the integrity and security of the data
> encapsulated in the nfacct struct cannot be guaranteed otherwise. We
> will rely heavily on this and don't need it to be changed
> arbitrarily (either by "mistake" or intentionally), so we decided
> that this is the best way going forward.
>
> In stage 2 (current plans are this to be completed by the end of q2
> of this year) we will extend this and create a specifically designed
> daemon (nfacctd), which will "talk" directly to the kernel and
> retrieve accounting data (much like what nfacct executable currently
> does, but on a bigger scale) on-demand and in real time and then
> send it to a central point where such data will be collected from
> all other machines for statistics and analysis. That daemon can't be
> spending extra cpu cycles parsing files that may or may not exist or
> check whether the data in them is or isn't reliable (whether it is
> "changed" on purpose or not) - that really isn't good enough.

I think that daemon you want to implement is ulogd2. There's already a
plugin available for nfacct. Thus, you can skip parsing the
configuration file over and over again. To dump existing stats, you
can add a unix socket interface for this new extension you want to do.

> >>  	if (matching) {
> >>  		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
> >> -			/* reset counters if you request a replacement. */
> >> +			/* reset counters if you request a replacement */
> >> +			if (!tb[NFACCT_PKTS]) {
> >> +				/*
> >> +				 * Prevent resetting the packets counter if
> >> +				 * either fmt or bthr are specified.
> >> +				 *
> >> +				 * This is done for backward compatibility,
> >> +				 * otherwise resetting these counters should
> >> +				 * only be allowed when tb[NFACCT_PKTS] is
> >> +				 * explicitly specified and == 0.
> >> +				 *
> >> +				 */
> >> +				if (!tb[NFACCT_FMT] &&
> >> +				    !tb[NFACCT_BTHR]) {
> >>  			atomic64_set(&matching->pkts, 0);
> >> +				}
> >> +			} else {
> >> +				atomic64_set(&matching->pkts,
> >> +				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
> > 
> > The replacement operation is not so easy. Note that you may hit
> > inconsistencies if while replacing the packet counter, the kernel
> > updates the byte counter, and then you replace the byte counter. You
> > would be leaking bytes and packets.
>
> If I understand you correctly Pablo, you are worried that the packet
> and byte counters can be, in a way "misaligned". But if we are to
> update these values we don't really care if the old values get
> updated or not prior to that change, do we?
>
> In other words, the packet value is updated and the kernel changes
> the byte value in the meantime. If we are later (in the same call)
> replacing the bytes value why do we care that there is a momentary
> inconsistency - we are going to replace it with a new value anyway?

IMO we should care. Let me check if I can come with some lockless way
to replace counters, so we can remove that -EBUSY limitation.

> One thing I'll correct with the next submission is to make sure that
> request for packet counter update is made in the same call as a
> request for an update to byte counters, otherwise we might be having
> only the packet or only the byte counters updated, which will
> introduce an inconsistency. In other words, have something like:
> 
> if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
>  <do an update on both counters>
> } else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES]) { // only one of packets/bytes request has been made, reject it
>   return -EINVAL;
> }
> 
> I hope I understood you correctly here.
> 
> One "technical" query regarding future submissions: from reading
> this list, I gather that the patches need to be submitted using git,
> is that correct?

Some people use stgit and quilt. You may want to use diff as well (but
make sure the patch applies with patch -p1). No restriction in that
regard as long as the patch is well-formed.

> The reason I ask this is because my git experience is zero (we use
> svn on all our development systems) and if it is possible to attach
> my patches as files (during my submission the last time, a friend of
> mine gave me a hand and prepared the patches for me as I am an
> absolute noob as far as git is concerned).
> 
> If it is too much trouble, I'll probably scratch out the commands
> needed to execute git and do it that way - let me know.

Ping me privately, I can send you some small sheet with typical
commands I use. But I'd suggest stgit as a good way to start, I used
it for years and the pop/push/refresh logic it implements is very
intuitive.

Regards.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis April 4, 2013, 8:37 p.m. UTC | #4
Hello Pablo,


Pablo Neira Ayuso wrote:
>>> BTW, atomic is not required for those two fields, this is protected by
>>> the nfnl_lock.
>> I disagree with you entirely.
> 
> At least you have to agree with me in that atomic is not required for
> those variables :)
I am not sure I do, sorry Pablo!

If we demote the two variables from "atomic_t" and "atomic64_t" to uint32_t and uint64_t respectively, then if thread A is in the middle of changing 'bthr' for example, and thread B tries to access it, there is nothing stopping both threads gaining access to that variable since it isn't afforded the protection of the atomic operations (put it simple there isn't atomic lock to prevent such access), which will cause a mess.

There is a similar case with the "refcnt" variable - it is atomic and therefore protected, so I don't see if I change these 2 variables (bthr and fmt) to "normal" unsigned integers how (or what) will protect the access to these variables and prevent what I just described from happening...

>> The two fields above aren't "meaningless" - they have a purpose and
>> are an integral part of the nfacct object. None of the fields in the
>> nfacct object are really used in the operation of the kernel itself
>> - the kernel merely updates these. To put it bluntly, the whole
>> nfacct struct is used as storage and the kernel don't use any of
>> these fields in its core operation. OK, I agree that packet and byte
>> counters are going to be updated much more frequently by the kernel
>> than the fmt and bthr variables, but the fact is that these,
>> strictly speaking, do not dependent for the kernel to operate. What
>> I've done above is not a unique approach - in the kernel code there
>> are a lot of examples like this, it is not something new and we are
>> not re-inventing the wheel here or breaking a new ground.
>>
>> Another reason for having fmt and bthr as part of the kernel nfacct
>> struct is that this data, alongside the bytes and packet counters,
>> is shielded and protected (in ring0), its integrity guaranteed and
>> can only be changed via the kernel itself and nothing else. This is
>> very important for us.
> 
> Sorry, I don't think putting thing in the kernel makes things more
> secure.
These properties are not directly accessible or arbitrarily modifiable (whether intentional or not) via userspace so in that sense the information contained within is "shielded" or "protected" by the kernel, which is what I meant with my previous posting (sorry, my English is not 100%).

>> In the early stages of the nfacct changes, we considered a very
>> similar approach to what you have proposed above, but rejected it,
>> not least because the integrity and security of the data
>> encapsulated in the nfacct struct cannot be guaranteed otherwise. We
>> will rely heavily on this and don't need it to be changed
>> arbitrarily (either by "mistake" or intentionally), so we decided
>> that this is the best way going forward.
>>
>> In stage 2 (current plans are this to be completed by the end of q2
>> of this year) we will extend this and create a specifically designed
>> daemon (nfacctd), which will "talk" directly to the kernel and
>> retrieve accounting data (much like what nfacct executable currently
>> does, but on a bigger scale) on-demand and in real time and then
>> send it to a central point where such data will be collected from
>> all other machines for statistics and analysis. That daemon can't be
>> spending extra cpu cycles parsing files that may or may not exist or
>> check whether the data in them is or isn't reliable (whether it is
>> "changed" on purpose or not) - that really isn't good enough.
> 
> I think that daemon you want to implement is ulogd2. There's already a
> plugin available for nfacct. Thus, you can skip parsing the
> configuration file over and over again. To dump existing stats, you
> can add a unix socket interface for this new extension you want to do.
OK, let me illustrate my point with a little pictogram. We have two separate possible scenarios:

Scenario 1:
Request Station (RS) <=> Internal (trusted) network <=> Home Station (HS) - where nfacctd daemon and friends is located

Scenario 2:
Request Station <=> Public (untrusted) network <=> Home Station

In the above scenarios, the "Request Station" can be two different types: a) a central server gathering statistics for all registered machines on the network; or b) a separate work station running nfacct.

In case a) above, what usually happens is that the RS connects to nfacctd on all registered machines using tcp connection and tells them "I want traffic stats from you every X seconds" (without this initial command the nfacctd daemon is idle and simply listens on the pre-defined tcp port).

Once this is done, the nfacctd daemon start sending those stats to RS and gets an acknowledgment. If that doesn't happen, say, 3 times in a row - because the RS may be inaccessible or for other reasons, then the daemon stores these stats in memory if permissible up to a point for later transmission, while regularly trying to connect to the RS.

Periodically, particularly during "peak" times when we expect traffic to be high, the central server on RS can send a command to the nfacctd daemons to throttle up the interval at which these statistics are sent. Similarly, during "slow-traffic" times, the opposite may happen and this interval can be extended.

In addition, the RS needs to be able to request the current traffic stats for a particular nfacct object (similar to "nfacct get"), as well as be able to change the threshold and format properties of an individual nfacct object on the HS.

Another possible type of "Request Station" we need to implement (this would require extending the nfacct executable capabilities once again) is a RS requesting full or partial traffic statistics by querying nfacctd on a particular Home Station using the nfacct executable. In other words, be able to do this:

nfacct list [host <HS_HOST>] [port <HS_PORT>]
nfacct get <object-name> [host <HS_HOST>] [port <HS_PORT>]
nfacct add <object-name> replace [format FMT_SPEC] [threshold THRESHOLD] [host <HS_HOST>] [port <HS_PORT>]

In other words, nfacct from the RS should be able to connect to nfacctd daemon on HS_HOST:HS_PORT and request traffic statistics on all objects, a single object, as well as be able to change the format and threshold values of a particular nfacct object.

The connection itself will usually be on a pre-defined tcp port (above the 30000 range). For the final stage of the project (this is still in its infancy though) we will need to implement Scenario 2 above: if the medium used by the RS to connect to the HS is untrusted network, then we will deploy tls encryption (that will, of course, require certificate chain verification on both sides).

I haven't looked at the ulog2 and the nfacct module implementation there in great detail, but from what I can gather, it just issues and records stats on all nfacct objects at pre-selected interval configurable at start-up (via the "pollinterval" option) and all this is done locally, which is obviously insufficient for us.

>>>>  	if (matching) {
>>>>  		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
>>>> -			/* reset counters if you request a replacement. */
>>>> +			/* reset counters if you request a replacement */
>>>> +			if (!tb[NFACCT_PKTS]) {
>>>> +				/*
>>>> +				 * Prevent resetting the packets counter if
>>>> +				 * either fmt or bthr are specified.
>>>> +				 *
>>>> +				 * This is done for backward compatibility,
>>>> +				 * otherwise resetting these counters should
>>>> +				 * only be allowed when tb[NFACCT_PKTS] is
>>>> +				 * explicitly specified and == 0.
>>>> +				 *
>>>> +				 */
>>>> +				if (!tb[NFACCT_FMT] &&
>>>> +				    !tb[NFACCT_BTHR]) {
>>>>  			atomic64_set(&matching->pkts, 0);
>>>> +				}
>>>> +			} else {
>>>> +				atomic64_set(&matching->pkts,
>>>> +				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
>>> The replacement operation is not so easy. Note that you may hit
>>> inconsistencies if while replacing the packet counter, the kernel
>>> updates the byte counter, and then you replace the byte counter. You
>>> would be leaking bytes and packets.
>> If I understand you correctly Pablo, you are worried that the packet
>> and byte counters can be, in a way "misaligned". But if we are to
>> update these values we don't really care if the old values get
>> updated or not prior to that change, do we?
>>
>> In other words, the packet value is updated and the kernel changes
>> the byte value in the meantime. If we are later (in the same call)
>> replacing the bytes value why do we care that there is a momentary
>> inconsistency - we are going to replace it with a new value anyway?
> 
> IMO we should care.
Pablo, let me re-cap so that I make sure we are discussing the same thing...

Since nfnetlink_acct.c allows only packet or only byte counters to be updated, this may cause inconsistency (a valid concern!). So, what I proposed in my previous posting is roughly the following:

nfnetlink_acct.c::nfnl_acct_new:
	if (matching) {
/*...*/
		if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
			atomic64_set(&nfacct->pkts,
				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
			atomic64_set(&nfacct->bytes,
				be64_to_cpu(nla_get_be64(tb[NFACCT_BYTES])));
		} else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES])	{
			return -EINVAL;	// or "return -EBUSY;"
/*...*/

If I understand you correctly, you are concerned that between the time after "atomic64_set(&nfacct->pkts..." is executed and "atomic64_set(&nfacct->bytes" not yet started/finished, even though these two statements follow one after another, you think there will be a momentary inconsistency between the byte and packet counter values, is that right?

If that is so, then the existing kernel code is having the same problem. For example:

a)
nfnetlink_acct.c::nfnl_acct_new:
	if (matching) {
		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
			/* reset counters if you request a replacement.	*/
			atomic64_set(&matching->pkts, 0);
			atomic64_set(&matching->bytes, 0);
			return 0;
		}
		return -EBUSY;
	}

b)
nfnetlink_acct.c::nfnl_acct_fill_info:
	if (type == NFNL_MSG_ACCT_GET_CTRZERO) {
		pkts = atomic64_xchg(&acct->pkts, 0);
		bytes =	atomic64_xchg(&acct->bytes, 0);
	} else {
		pkts = atomic64_read(&acct->pkts);
		bytes =	atomic64_read(&acct->bytes);
	}

c)
nfnetlink_acct.c::nfnl_acct_update:
	atomic64_inc(&nfacct->pkts);
	atomic64_add(skb->len, &nfacct->bytes);

In all of the above cases (a-c) the update statements are identical with what I proposed above, so if this is undesirable, then the existing code in the kernel in the above 3 cases I listed should also be changed, correct?

> Let me check if I can come with some lockless way
> to replace counters, so we can remove that -EBUSY limitation.
This discussion we are having pointed me to some interesting idea.

Packet and byte counters are always updated in "unison" - one cannot (or should not) be updated without the other. The current "atomic" operations are performed separately (and independently!) for packet and then byte counters. Each atomic operation (atomic64_set, atomic64_inc etc) does a spinlock, again, separately for packet and byte counters. If we are always updating these variables in "unison" together, then we don't really need to do 2 locks - we need just one.

In other words, we can "demote" pkts and bytes variables from "atomic64_t" to uint64_t and use a manual lock which is specific for each nfacct object to update these values. In other words, something like this:

struct nf_acct {
	uint64_t		pkts;
	uint64_t		bytes;
	atomic64_t		bthr;
	atomic_t		fmt;
	struct list_head	head;
	atomic_t		refcnt;
	char			name[NFACCT_NAME_MAX];
	/* Lock	protecting packet & byte counters */
	rwlock_t		pb_lock;
	struct rcu_head		rcu_head;
};
/*...*/

nfnetlink_acct.c::nfnl_acct_new:
	if (matching) {
/*...*/
		if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
			write_lock_bh(&nfacct->pb_lock);
			nfacct->pkts =
				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS]));
			nfacct->bytes =
				be64_to_cpu(nla_get_be64(tb[NFACCT_BYTES]));
			write_unlock_bh(&nfacct->pb_lock);
		} else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES])	{
			return -EINVAL;	// or "return -EBUSY;"
/*...*/

	nfacct = kzalloc(sizeof(struct nf_acct), GFP_KERNEL);
	if (nfacct == NULL)
		return -ENOMEM;

	strncpy(nfacct->name, nla_data(tb[NFACCT_NAME]), NFACCT_NAME_MAX);
	rwlock_init(&nfacct->pb_lock);
/*...*/

nfnetlink_acct.c::nfnl_acct_fill_info:
	if (type == NFNL_MSG_ACCT_GET_CTRZERO) {
		write_lock_bh(&acct->pb_lock);
		pkts = acct->pkts;
		bytes =	acct->bytes;
		acct->pkts = 0;
		acct->bytes = 0;
		write_unlock_bh(&acct->pb_lock);
	} else {
		read_lock_bh(&acct->pb_lock);
		pkts = acct->pkts;
		bytes =	acct->bytes;
		read_unlock_bh(&acct->pb_lock);
	}
/*...*/

nfnetlink_acct.c::nfnl_acct_update:
	write_lock_bh(&nfacct->pb_lock);
	nfacct->pkts += 1;
	nfacct->bytes += skb->len;
	write_unlock_bh(&nfacct->pb_lock);

That way, we use only one lock (instead of two as was the case with "atomic64_set" and friends) and there should not be any concerns any longer as to whether there will be any inconsistency between packet and byte counter values because these are applied in "unison" during a write lock, which is later released after the operation is complete. What do you think, would that be satisfactory?

>> If it is too much trouble, I'll probably scratch out the commands
>> needed to execute git and do it that way - let me know.
> 
> Ping me privately, I can send you some small sheet with typical
> commands I use. But I'd suggest stgit as a good way to start, I used
> it for years and the pop/push/refresh logic it implements is very
> intuitive.
I'll post you privately as this is off-topic, thanks Pablo.


MZ
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 11, 2013, 10:18 a.m. UTC | #5
Hi Michael,

On Thu, Apr 04, 2013 at 09:37:03PM +0100, Michael Zintakis wrote:
> Hello Pablo,
> 
> Pablo Neira Ayuso wrote:
> >>> BTW, atomic is not required for those two fields, this is protected by
> >>> the nfnl_lock.
> >> I disagree with you entirely.
> > 
> > At least you have to agree with me in that atomic is not required for
> > those variables :)
>
> I am not sure I do, sorry Pablo!
> 
> If we demote the two variables from "atomic_t" and "atomic64_t" to
> uint32_t and uint64_t respectively, then if thread A is in the
> middle of changing 'bthr' for example, and thread B tries to access
> it, there is nothing stopping both threads gaining access to that
> variable since it isn't afforded the protection of the atomic
> operations (put it simple there isn't atomic lock to prevent such
> access), which will cause a mess.

Those variables are already protected by nfnl_lock in
net/netfilter/nfnetlink.c.

> There is a similar case with the "refcnt" variable - it is atomic
> and therefore protected, so I don't see if I change these 2
> variables (bthr and fmt) to "normal" unsigned integers how (or what)
> will protect the access to these variables and prevent what I just
> described from happening...
> 
> >> The two fields above aren't "meaningless" - they have a purpose and
> >> are an integral part of the nfacct object. None of the fields in the
> >> nfacct object are really used in the operation of the kernel itself
> >> - the kernel merely updates these. To put it bluntly, the whole
> >> nfacct struct is used as storage and the kernel don't use any of
> >> these fields in its core operation. OK, I agree that packet and byte
> >> counters are going to be updated much more frequently by the kernel
> >> than the fmt and bthr variables, but the fact is that these,
> >> strictly speaking, do not dependent for the kernel to operate. What
> >> I've done above is not a unique approach - in the kernel code there
> >> are a lot of examples like this, it is not something new and we are
> >> not re-inventing the wheel here or breaking a new ground.
> >>
> >> Another reason for having fmt and bthr as part of the kernel nfacct
> >> struct is that this data, alongside the bytes and packet counters,
> >> is shielded and protected (in ring0), its integrity guaranteed and
> >> can only be changed via the kernel itself and nothing else. This is
> >> very important for us.
> > 
> > Sorry, I don't think putting thing in the kernel makes things more
> > secure.
>
> These properties are not directly accessible or arbitrarily
> modifiable (whether intentional or not) via userspace so in that
> sense the information contained within is "shielded" or "protected"
> by the kernel, which is what I meant with my previous posting
> (sorry, my English is not 100%).

You can modify kernel information if you manage to load a module and
have some little knowlege on where to find things in the memory space.
Truly, I think you can achive this by protecting the configuration
file from changes in your tree and some extra checksumming to follow
the track of unexpected changes.

> >> In the early stages of the nfacct changes, we considered a very
> >> similar approach to what you have proposed above, but rejected it,
> >> not least because the integrity and security of the data
> >> encapsulated in the nfacct struct cannot be guaranteed otherwise. We
> >> will rely heavily on this and don't need it to be changed
> >> arbitrarily (either by "mistake" or intentionally), so we decided
> >> that this is the best way going forward.
> >>
> >> In stage 2 (current plans are this to be completed by the end of q2
> >> of this year) we will extend this and create a specifically designed
> >> daemon (nfacctd), which will "talk" directly to the kernel and
> >> retrieve accounting data (much like what nfacct executable currently
> >> does, but on a bigger scale) on-demand and in real time and then
> >> send it to a central point where such data will be collected from
> >> all other machines for statistics and analysis. That daemon can't be
> >> spending extra cpu cycles parsing files that may or may not exist or
> >> check whether the data in them is or isn't reliable (whether it is
> >> "changed" on purpose or not) - that really isn't good enough.
> > 
> > I think that daemon you want to implement is ulogd2. There's already a
> > plugin available for nfacct. Thus, you can skip parsing the
> > configuration file over and over again. To dump existing stats, you
> > can add a unix socket interface for this new extension you want to do.
>
> OK, let me illustrate my point with a little pictogram. We have two
> separate possible scenarios:
> 
> Scenario 1:
> Request Station (RS) <=> Internal (trusted) network <=> Home Station (HS) - where nfacctd daemon and friends is located
> 
> Scenario 2:
> Request Station <=> Public (untrusted) network <=> Home Station
> 
> In the above scenarios, the "Request Station" can be two different
> types: a) a central server gathering statistics for all registered
> machines on the network; or b) a separate work station running
> nfacct.
> 
> In case a) above, what usually happens is that the RS connects to
> nfacctd on all registered machines using tcp connection and tells
> them "I want traffic stats from you every X seconds" (without this
> initial command the nfacctd daemon is idle and simply listens on the
> pre-defined tcp port).
> 
> Once this is done, the nfacctd daemon start sending those stats to
> RS and gets an acknowledgment. If that doesn't happen, say, 3 times
> in a row - because the RS may be inaccessible or for other reasons,
> then the daemon stores these stats in memory if permissible up to a
> point for later transmission, while regularly trying to connect to
> the RS.
> 
> Periodically, particularly during "peak" times when we expect
> traffic to be high, the central server on RS can send a command to
> the nfacctd daemons to throttle up the interval at which these
> statistics are sent. Similarly, during "slow-traffic" times, the
> opposite may happen and this interval can be extended.
> 
> In addition, the RS needs to be able to request the current traffic
> stats for a particular nfacct object (similar to "nfacct get"), as
> well as be able to change the threshold and format properties of an
> individual nfacct object on the HS.
> 
> Another possible type of "Request Station" we need to implement
> (this would require extending the nfacct executable capabilities
> once again) is a RS requesting full or partial traffic statistics by
> querying nfacctd on a particular Home Station using the nfacct
> executable. In other words, be able to do this:
> 
> nfacct list [host <HS_HOST>] [port <HS_PORT>]
> nfacct get <object-name> [host <HS_HOST>] [port <HS_PORT>]
> nfacct add <object-name> replace [format FMT_SPEC] [threshold THRESHOLD] [host <HS_HOST>] [port <HS_PORT>]
> 
> In other words, nfacct from the RS should be able to connect to
> nfacctd daemon on HS_HOST:HS_PORT and request traffic statistics on
> all objects, a single object, as well as be able to change the
> format and threshold values of a particular nfacct object.

That sounds to me a lot like IPFIX. There's some preliminary support
for it in ulogd2, it should be relatively easy to finish it.

> The connection itself will usually be on a pre-defined tcp port
> (above the 30000 range). For the final stage of the project (this is
> still in its infancy though) we will need to implement Scenario 2
> above: if the medium used by the RS to connect to the HS is
> untrusted network, then we will deploy tls encryption (that will, of
> course, require certificate chain verification on both sides).
> 
> I haven't looked at the ulog2 and the nfacct module implementation
> there in great detail, but from what I can gather, it just issues
> and records stats on all nfacct objects at pre-selected interval
> configurable at start-up (via the "pollinterval" option) and all
> this is done locally, which is obviously insufficient for us.

I mean that ulogd2 provides the framework for Netfilter's user-space
logging. It's pretty extensible. You will surely need to make some
changes for what you need, but most of the infrastructure is already
there. No need for another separated daemon.

> >>>>  	if (matching) {
> >>>>  		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
> >>>> -			/* reset counters if you request a replacement. */
> >>>> +			/* reset counters if you request a replacement */
> >>>> +			if (!tb[NFACCT_PKTS]) {
> >>>> +				/*
> >>>> +				 * Prevent resetting the packets counter if
> >>>> +				 * either fmt or bthr are specified.
> >>>> +				 *
> >>>> +				 * This is done for backward compatibility,
> >>>> +				 * otherwise resetting these counters should
> >>>> +				 * only be allowed when tb[NFACCT_PKTS] is
> >>>> +				 * explicitly specified and == 0.
> >>>> +				 *
> >>>> +				 */
> >>>> +				if (!tb[NFACCT_FMT] &&
> >>>> +				    !tb[NFACCT_BTHR]) {
> >>>>  			atomic64_set(&matching->pkts, 0);
> >>>> +				}
> >>>> +			} else {
> >>>> +				atomic64_set(&matching->pkts,
> >>>> +				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
> >>> The replacement operation is not so easy. Note that you may hit
> >>> inconsistencies if while replacing the packet counter, the kernel
> >>> updates the byte counter, and then you replace the byte counter. You
> >>> would be leaking bytes and packets.
> >>>
> >> If I understand you correctly Pablo, you are worried that the packet
> >> and byte counters can be, in a way "misaligned". But if we are to
> >> update these values we don't really care if the old values get
> >> updated or not prior to that change, do we?
> >>
> >> In other words, the packet value is updated and the kernel changes
> >> the byte value in the meantime. If we are later (in the same call)
> >> replacing the bytes value why do we care that there is a momentary
> >> inconsistency - we are going to replace it with a new value anyway?
> > 
> > IMO we should care.
>
> Pablo, let me re-cap so that I make sure we are discussing the same
> thing...
> 
> Since nfnetlink_acct.c allows only packet or only byte counters to
> be updated, this may cause inconsistency (a valid concern!). So,
> what I proposed in my previous posting is roughly the following:
> 
> nfnetlink_acct.c::nfnl_acct_new:
> 	if (matching) {
> /*...*/
> 		if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
> 			atomic64_set(&nfacct->pkts,
> 				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
> 			atomic64_set(&nfacct->bytes,
> 				be64_to_cpu(nla_get_be64(tb[NFACCT_BYTES])));
> 		} else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES])	{
> 			return -EINVAL;	// or "return -EBUSY;"
> /*...*/
> 
> If I understand you correctly, you are concerned that between the
> time after "atomic64_set(&nfacct->pkts..." is executed and
> "atomic64_set(&nfacct->bytes" not yet started/finished, even though
> these two statements follow one after another, you think there will
> be a momentary inconsistency between the byte and packet counter
> values, is that right?
> 
> If that is so, then the existing kernel code is having the same
> problem. For example:
> 
> a)
> nfnetlink_acct.c::nfnl_acct_new:
> 	if (matching) {
> 		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
> 			/* reset counters if you request a replacement.	*/
> 			atomic64_set(&matching->pkts, 0);
> 			atomic64_set(&matching->bytes, 0);

We should check there if that acct object is in used. I think there is
no user-space code using the NLM_F_REPLACE flag IIRC.

> 			return 0;
> 		}
> 		return -EBUSY;
> 	}
> 
> b)
> nfnetlink_acct.c::nfnl_acct_fill_info:
> 	if (type == NFNL_MSG_ACCT_GET_CTRZERO) {
> 		pkts = atomic64_xchg(&acct->pkts, 0);
> 		bytes =	atomic64_xchg(&acct->bytes, 0);
> 	} else {
> 		pkts = atomic64_read(&acct->pkts);
> 		bytes =	atomic64_read(&acct->bytes);
> 	}

In this case, we don't leak packet/bytes counters.

> c)
> nfnetlink_acct.c::nfnl_acct_update:
> 	atomic64_inc(&nfacct->pkts);
> 	atomic64_add(skb->len, &nfacct->bytes);
> 
> In all of the above cases (a-c) the update statements are identical
> with what I proposed above, so if this is undesirable, then the
> existing code in the kernel in the above 3 cases I listed should
> also be changed, correct?

I don't see why c) is wrong. My concern is mostly about leaking data.

> > Let me check if I can come with some lockless way
> > to replace counters, so we can remove that -EBUSY limitation.
>
> This discussion we are having pointed me to some interesting idea.
> 
> Packet and byte counters are always updated in "unison" - one cannot
> (or should not) be updated without the other. The current "atomic"
> operations are performed separately (and independently!) for packet
> and then byte counters. Each atomic operation (atomic64_set,
> atomic64_inc etc) does a spinlock, again, separately for packet and
> byte counters. If we are always updating these variables in "unison"
> together, then we don't really need to do 2 locks - we need just
> one.
> 
> In other words, we can "demote" pkts and bytes variables from
> "atomic64_t" to uint64_t and use a manual lock which is specific for
> each nfacct object to update these values. In other words, something
> like this:
> 
> struct nf_acct {
> 	uint64_t		pkts;
> 	uint64_t		bytes;
> 	atomic64_t		bthr;
> 	atomic_t		fmt;
> 	struct list_head	head;
> 	atomic_t		refcnt;
> 	char			name[NFACCT_NAME_MAX];
> 	/* Lock	protecting packet & byte counters */
> 	rwlock_t		pb_lock;
> 	struct rcu_head		rcu_head;
> };
> /*...*/
> 
> nfnetlink_acct.c::nfnl_acct_new:
> 	if (matching) {
> /*...*/
> 		if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
> 			write_lock_bh(&nfacct->pb_lock);
> 			nfacct->pkts =
> 				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS]));
> 			nfacct->bytes =
> 				be64_to_cpu(nla_get_be64(tb[NFACCT_BYTES]));
> 			write_unlock_bh(&nfacct->pb_lock);
> 		} else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES])	{
> 			return -EINVAL;	// or "return -EBUSY;"
> /*...*/
> 
> 	nfacct = kzalloc(sizeof(struct nf_acct), GFP_KERNEL);
> 	if (nfacct == NULL)
> 		return -ENOMEM;
> 
> 	strncpy(nfacct->name, nla_data(tb[NFACCT_NAME]), NFACCT_NAME_MAX);
> 	rwlock_init(&nfacct->pb_lock);
> /*...*/
> 
> nfnetlink_acct.c::nfnl_acct_fill_info:
> 	if (type == NFNL_MSG_ACCT_GET_CTRZERO) {
> 		write_lock_bh(&acct->pb_lock);
> 		pkts = acct->pkts;
> 		bytes =	acct->bytes;
> 		acct->pkts = 0;
> 		acct->bytes = 0;
> 		write_unlock_bh(&acct->pb_lock);
> 	} else {
> 		read_lock_bh(&acct->pb_lock);
> 		pkts = acct->pkts;
> 		bytes =	acct->bytes;
> 		read_unlock_bh(&acct->pb_lock);
> 	}
> /*...*/
> 
> nfnetlink_acct.c::nfnl_acct_update:
> 	write_lock_bh(&nfacct->pb_lock);
> 	nfacct->pkts += 1;
> 	nfacct->bytes += skb->len;
> 	write_unlock_bh(&nfacct->pb_lock);
> 
> That way, we use only one lock (instead of two as was the case with
> "atomic64_set" and friends) and there should not be any concerns any
> longer as to whether there will be any inconsistency between packet
> and byte counter values because these are applied in "unison" during
> a write lock, which is later released after the operation is
> complete. What do you think, would that be satisfactory?

Eric Dumazet proposed during the latest workshop that we could add
some flag to allow different flavours of acct object:

* per-cpu based, they consume more memory, but they are lockless.
* compact, lock based, they consume little memory, but bottleneck is
  locking.

So you select what you need according to your requirements.  I plan to
send some patches to support this.

Regards.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis April 14, 2013, 9:50 a.m. UTC | #6
Hello Pablo,


Apologies once again for this late post...

Pablo Neira Ayuso wrote:
> Hi Michael,
> 
> On Thu, Apr 04, 2013 at 09:37:03PM +0100, Michael Zintakis wrote:
>> Hello Pablo,
>>
>> Pablo Neira Ayuso wrote:
>>>>> BTW, atomic is not required for those two fields, this is protected by
>>>>> the nfnl_lock.
>>>> I disagree with you entirely.
>>> At least you have to agree with me in that atomic is not required for
>>> those variables :)
>> I am not sure I do, sorry Pablo!
>>
>> If we demote the two variables from "atomic_t" and "atomic64_t" to
>> uint32_t and uint64_t respectively, then if thread A is in the
>> middle of changing 'bthr' for example, and thread B tries to access
>> it, there is nothing stopping both threads gaining access to that
>> variable since it isn't afforded the protection of the atomic
>> operations (put it simple there isn't atomic lock to prevent such
>> access), which will cause a mess.
> 
> Those variables are already protected by nfnl_lock in
> net/netfilter/nfnetlink.c.
Ah, I think I understand this now - since these two variables are modified only via the nfnetlink implementation (and not the kernel itself), then nfnl_lock() function comes into play since nfnl_nfacct_new is called by nfnetlinkc::nfnetlink_rcv_msg (where nfnl_lock is "active"), preventing another request (made possibly by another thread also via nfnetlink) to update these two variables in nfnetlink_acct.c. Is my assumption correct or do I have this wrong?

If it is correct, you are right that in such a case we don't really need the fmt and bthr variables to be atomic as no race condition can occur. Thank you for this clarification!

>> There is a similar case with the "refcnt" variable - it is atomic
>> and therefore protected, so I don't see if I change these 2
>> variables (bthr and fmt) to "normal" unsigned integers how (or what)
>> will protect the access to these variables and prevent what I just
>> described from happening...
>>
>>>> The two fields above aren't "meaningless" - they have a purpose and
>>>> are an integral part of the nfacct object. None of the fields in the
>>>> nfacct object are really used in the operation of the kernel itself
>>>> - the kernel merely updates these. To put it bluntly, the whole
>>>> nfacct struct is used as storage and the kernel don't use any of
>>>> these fields in its core operation. OK, I agree that packet and byte
>>>> counters are going to be updated much more frequently by the kernel
>>>> than the fmt and bthr variables, but the fact is that these,
>>>> strictly speaking, do not dependent for the kernel to operate. What
>>>> I've done above is not a unique approach - in the kernel code there
>>>> are a lot of examples like this, it is not something new and we are
>>>> not re-inventing the wheel here or breaking a new ground.
>>>>
>>>> Another reason for having fmt and bthr as part of the kernel nfacct
>>>> struct is that this data, alongside the bytes and packet counters,
>>>> is shielded and protected (in ring0), its integrity guaranteed and
>>>> can only be changed via the kernel itself and nothing else. This is
>>>> very important for us.
>>> Sorry, I don't think putting thing in the kernel makes things more
>>> secure.
>> These properties are not directly accessible or arbitrarily
>> modifiable (whether intentional or not) via userspace so in that
>> sense the information contained within is "shielded" or "protected"
>> by the kernel, which is what I meant with my previous posting
>> (sorry, my English is not 100%).
> 
> You can modify kernel information if you manage to load a module and
> have some little knowlege on where to find things in the memory space.
> Truly, I think you can achive this by protecting the configuration
> file from changes in your tree and some extra checksumming to follow
> the track of unexpected changes.
I understand that Pablo, but this is quite a bit far-fetched and, in my view, less practical (both in terms of performance as well as memory footprint) where compared to just using these 2 properties in the kernel where they are protected without the need to write extra code.

With placing these properties into the kernel they will also occupy less memory and the performance would be better, compared to if all this is done in user space (not to mention possible synchronization issues as the code in user space needs to "track" nfacct objects in the kernel and do that flawlessly).

As I already mention previously, these two properties form a logical part of the nfacct object - they are not some separate and completely independently detached "entities".

>>>> In the early stages of the nfacct changes, we considered a very
>>>> similar approach to what you have proposed above, but rejected it,
>>>> not least because the integrity and security of the data
>>>> encapsulated in the nfacct struct cannot be guaranteed otherwise. We
>>>> will rely heavily on this and don't need it to be changed
>>>> arbitrarily (either by "mistake" or intentionally), so we decided
>>>> that this is the best way going forward.
>>>>
>>>> In stage 2 (current plans are this to be completed by the end of q2
>>>> of this year) we will extend this and create a specifically designed
>>>> daemon (nfacctd), which will "talk" directly to the kernel and
>>>> retrieve accounting data (much like what nfacct executable currently
>>>> does, but on a bigger scale) on-demand and in real time and then
>>>> send it to a central point where such data will be collected from
>>>> all other machines for statistics and analysis. That daemon can't be
>>>> spending extra cpu cycles parsing files that may or may not exist or
>>>> check whether the data in them is or isn't reliable (whether it is
>>>> "changed" on purpose or not) - that really isn't good enough.
>>> I think that daemon you want to implement is ulogd2. There's already a
>>> plugin available for nfacct. Thus, you can skip parsing the
>>> configuration file over and over again. To dump existing stats, you
>>> can add a unix socket interface for this new extension you want to do.
>> OK, let me illustrate my point with a little pictogram. We have two
>> separate possible scenarios:
>>
>> Scenario 1:
>> Request Station (RS) <=> Internal (trusted) network <=> Home Station (HS) - where nfacctd daemon and friends is located
>>
>> Scenario 2:
>> Request Station <=> Public (untrusted) network <=> Home Station
>>
>> In the above scenarios, the "Request Station" can be two different
>> types: a) a central server gathering statistics for all registered
>> machines on the network; or b) a separate work station running
>> nfacct.
>>
>> In case a) above, what usually happens is that the RS connects to
>> nfacctd on all registered machines using tcp connection and tells
>> them "I want traffic stats from you every X seconds" (without this
>> initial command the nfacctd daemon is idle and simply listens on the
>> pre-defined tcp port).
>>
>> Once this is done, the nfacctd daemon start sending those stats to
>> RS and gets an acknowledgment. If that doesn't happen, say, 3 times
>> in a row - because the RS may be inaccessible or for other reasons,
>> then the daemon stores these stats in memory if permissible up to a
>> point for later transmission, while regularly trying to connect to
>> the RS.
>>
>> Periodically, particularly during "peak" times when we expect
>> traffic to be high, the central server on RS can send a command to
>> the nfacctd daemons to throttle up the interval at which these
>> statistics are sent. Similarly, during "slow-traffic" times, the
>> opposite may happen and this interval can be extended.
>>
>> In addition, the RS needs to be able to request the current traffic
>> stats for a particular nfacct object (similar to "nfacct get"), as
>> well as be able to change the threshold and format properties of an
>> individual nfacct object on the HS.
>>
>> Another possible type of "Request Station" we need to implement
>> (this would require extending the nfacct executable capabilities
>> once again) is a RS requesting full or partial traffic statistics by
>> querying nfacctd on a particular Home Station using the nfacct
>> executable. In other words, be able to do this:
>>
>> nfacct list [host <HS_HOST>] [port <HS_PORT>]
>> nfacct get <object-name> [host <HS_HOST>] [port <HS_PORT>]
>> nfacct add <object-name> replace [format FMT_SPEC] [threshold THRESHOLD] [host <HS_HOST>] [port <HS_PORT>]
>>
>> In other words, nfacct from the RS should be able to connect to
>> nfacctd daemon on HS_HOST:HS_PORT and request traffic statistics on
>> all objects, a single object, as well as be able to change the
>> format and threshold values of a particular nfacct object.
> 
> That sounds to me a lot like IPFIX. There's some preliminary support
> for it in ulogd2, it should be relatively easy to finish it.
> 
>> The connection itself will usually be on a pre-defined tcp port
>> (above the 30000 range). For the final stage of the project (this is
>> still in its infancy though) we will need to implement Scenario 2
>> above: if the medium used by the RS to connect to the HS is
>> untrusted network, then we will deploy tls encryption (that will, of
>> course, require certificate chain verification on both sides).
>>
>> I haven't looked at the ulog2 and the nfacct module implementation
>> there in great detail, but from what I can gather, it just issues
>> and records stats on all nfacct objects at pre-selected interval
>> configurable at start-up (via the "pollinterval" option) and all
>> this is done locally, which is obviously insufficient for us.
> 
> I mean that ulogd2 provides the framework for Netfilter's user-space
> logging. It's pretty extensible. You will surely need to make some
> changes for what you need, but most of the infrastructure is already
> there. No need for another separated daemon.
My concern with this is that even though you say it is extensible, there will be quite a lot of redesigning and rewriting of code - the ulogd2 primary function is to check packets and pipe this through various "channels" defined in advance at start up, which is quite different from the functionality required by the nfacct daemon. To use a common expression - I think using ulogd2 as nfacct daemon will be like trying to put square pegs in round holes - not impossible, but difficult and cumbersome to implement somewhat.

Having written all of this, I will leave this alternative active and see if something can be done, even if we don't actually deploy this, for the benefit of the community.

>>>>>>  	if (matching) {
>>>>>>  		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
>>>>>> -			/* reset counters if you request a replacement. */
>>>>>> +			/* reset counters if you request a replacement */
>>>>>> +			if (!tb[NFACCT_PKTS]) {
>>>>>> +				/*
>>>>>> +				 * Prevent resetting the packets counter if
>>>>>> +				 * either fmt or bthr are specified.
>>>>>> +				 *
>>>>>> +				 * This is done for backward compatibility,
>>>>>> +				 * otherwise resetting these counters should
>>>>>> +				 * only be allowed when tb[NFACCT_PKTS] is
>>>>>> +				 * explicitly specified and == 0.
>>>>>> +				 *
>>>>>> +				 */
>>>>>> +				if (!tb[NFACCT_FMT] &&
>>>>>> +				    !tb[NFACCT_BTHR]) {
>>>>>>  			atomic64_set(&matching->pkts, 0);
>>>>>> +				}
>>>>>> +			} else {
>>>>>> +				atomic64_set(&matching->pkts,
>>>>>> +				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
>>>>> The replacement operation is not so easy. Note that you may hit
>>>>> inconsistencies if while replacing the packet counter, the kernel
>>>>> updates the byte counter, and then you replace the byte counter. You
>>>>> would be leaking bytes and packets.
>>>>>
>>>> If I understand you correctly Pablo, you are worried that the packet
>>>> and byte counters can be, in a way "misaligned". But if we are to
>>>> update these values we don't really care if the old values get
>>>> updated or not prior to that change, do we?
>>>>
>>>> In other words, the packet value is updated and the kernel changes
>>>> the byte value in the meantime. If we are later (in the same call)
>>>> replacing the bytes value why do we care that there is a momentary
>>>> inconsistency - we are going to replace it with a new value anyway?
>>> IMO we should care.
>> Pablo, let me re-cap so that I make sure we are discussing the same
>> thing...
>>
>> Since nfnetlink_acct.c allows only packet or only byte counters to
>> be updated, this may cause inconsistency (a valid concern!). So,
>> what I proposed in my previous posting is roughly the following:
>>
>> nfnetlink_acct.c::nfnl_acct_new:
>> 	if (matching) {
>> /*...*/
>> 		if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
>> 			atomic64_set(&nfacct->pkts,
>> 				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
>> 			atomic64_set(&nfacct->bytes,
>> 				be64_to_cpu(nla_get_be64(tb[NFACCT_BYTES])));
>> 		} else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES])	{
>> 			return -EINVAL;	// or "return -EBUSY;"
>> /*...*/
>>
>> If I understand you correctly, you are concerned that between the
>> time after "atomic64_set(&nfacct->pkts..." is executed and
>> "atomic64_set(&nfacct->bytes" not yet started/finished, even though
>> these two statements follow one after another, you think there will
>> be a momentary inconsistency between the byte and packet counter
>> values, is that right?
>>
>> If that is so, then the existing kernel code is having the same
>> problem. For example:
>>
>> a)
>> nfnetlink_acct.c::nfnl_acct_new:
>> 	if (matching) {
>> 		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
>> 			/* reset counters if you request a replacement.	*/
>> 			atomic64_set(&matching->pkts, 0);
>> 			atomic64_set(&matching->bytes, 0);
> 
> We should check there if that acct object is in used.
Why? In my opinion, this is not necessary. The replace "mode" can be needed in both cases: where refcnt > 0 (nfacct object is used) and also where refcnt == 0 (nfacct object is not used). I don't see why do we need to make that distinction - we should be able to replace nfacct properties regardless of whether that object is used (in other words, regardless of the value of refcnt).

> I think there is
> no user-space code using the NLM_F_REPLACE flag IIRC.
I don't know whether that is the case or not as I don't have access to all machines on which this kernel code is deployed. We certainly are using this (as part of the "nfacct add/get" as well as "nfacct restore").

The above, I think, is a mute point since the code is there and is part of the current kernel and therefore we should consider the possibility that this functionality has been used (otherwise why implement it?).

The point I was trying to make is that what I proposed with the patch is not different from what currently exists in the kernel code (above) in view of your concerns.

> 
>> 			return 0;
>> 		}
>> 		return -EBUSY;
>> 	}
>>
>> b)
>> nfnetlink_acct.c::nfnl_acct_fill_info:
>> 	if (type == NFNL_MSG_ACCT_GET_CTRZERO) {
>> 		pkts = atomic64_xchg(&acct->pkts, 0);
>> 		bytes =	atomic64_xchg(&acct->bytes, 0);
>> 	} else {
>> 		pkts = atomic64_read(&acct->pkts);
>> 		bytes =	atomic64_read(&acct->bytes);
>> 	}
> 
> In this case, we don't leak packet/bytes counters.
Isn't the first part of this "If" statement the equivalent of what is currently done in the existing nfnetlink_acct.c::nfnl_acct_new (clearing both counters on replace "mode")? To me that is the same thing with the same problems I described above. The only difference is that we save the "old" values in a pair of variables.

> 
>> c)
>> nfnetlink_acct.c::nfnl_acct_update:
>> 	atomic64_inc(&nfacct->pkts);
>> 	atomic64_add(skb->len, &nfacct->bytes);
>>
>> In all of the above cases (a-c) the update statements are identical
>> with what I proposed above, so if this is undesirable, then the
>> existing code in the kernel in the above 3 cases I listed should
>> also be changed, correct?
> 
> I don't see why c) is wrong. My concern is mostly about leaking data.
I am assuming that by that you mean byte and packet counters not updated synchronously when we "lose" either packet or byte counts, correct?

>>> Let me check if I can come with some lockless way
>>> to replace counters, so we can remove that -EBUSY limitation.
>> This discussion we are having pointed me to some interesting idea.
>>
>> Packet and byte counters are always updated in "unison" - one cannot
>> (or should not) be updated without the other. The current "atomic"
>> operations are performed separately (and independently!) for packet
>> and then byte counters. Each atomic operation (atomic64_set,
>> atomic64_inc etc) does a spinlock, again, separately for packet and
>> byte counters. If we are always updating these variables in "unison"
>> together, then we don't really need to do 2 locks - we need just
>> one.
>>
>> In other words, we can "demote" pkts and bytes variables from
>> "atomic64_t" to uint64_t and use a manual lock which is specific for
>> each nfacct object to update these values. In other words, something
>> like this:
>>
>> struct nf_acct {
>> 	uint64_t		pkts;
>> 	uint64_t		bytes;
>> 	atomic64_t		bthr;
>> 	atomic_t		fmt;
>> 	struct list_head	head;
>> 	atomic_t		refcnt;
>> 	char			name[NFACCT_NAME_MAX];
>> 	/* Lock	protecting packet & byte counters */
>> 	rwlock_t		pb_lock;
>> 	struct rcu_head		rcu_head;
>> };
>> /*...*/
>>
>> nfnetlink_acct.c::nfnl_acct_new:
>> 	if (matching) {
>> /*...*/
>> 		if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
>> 			write_lock_bh(&nfacct->pb_lock);
>> 			nfacct->pkts =
>> 				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS]));
>> 			nfacct->bytes =
>> 				be64_to_cpu(nla_get_be64(tb[NFACCT_BYTES]));
>> 			write_unlock_bh(&nfacct->pb_lock);
>> 		} else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES])	{
>> 			return -EINVAL;	// or "return -EBUSY;"
>> /*...*/
>>
>> 	nfacct = kzalloc(sizeof(struct nf_acct), GFP_KERNEL);
>> 	if (nfacct == NULL)
>> 		return -ENOMEM;
>>
>> 	strncpy(nfacct->name, nla_data(tb[NFACCT_NAME]), NFACCT_NAME_MAX);
>> 	rwlock_init(&nfacct->pb_lock);
>> /*...*/
>>
>> nfnetlink_acct.c::nfnl_acct_fill_info:
>> 	if (type == NFNL_MSG_ACCT_GET_CTRZERO) {
>> 		write_lock_bh(&acct->pb_lock);
>> 		pkts = acct->pkts;
>> 		bytes =	acct->bytes;
>> 		acct->pkts = 0;
>> 		acct->bytes = 0;
>> 		write_unlock_bh(&acct->pb_lock);
>> 	} else {
>> 		read_lock_bh(&acct->pb_lock);
>> 		pkts = acct->pkts;
>> 		bytes =	acct->bytes;
>> 		read_unlock_bh(&acct->pb_lock);
>> 	}
>> /*...*/
>>
>> nfnetlink_acct.c::nfnl_acct_update:
>> 	write_lock_bh(&nfacct->pb_lock);
>> 	nfacct->pkts += 1;
>> 	nfacct->bytes += skb->len;
>> 	write_unlock_bh(&nfacct->pb_lock);
>>
>> That way, we use only one lock (instead of two as was the case with
>> "atomic64_set" and friends) and there should not be any concerns any
>> longer as to whether there will be any inconsistency between packet
>> and byte counter values because these are applied in "unison" during
>> a write lock, which is later released after the operation is
>> complete. What do you think, would that be satisfactory?
> 
> Eric Dumazet proposed during the latest workshop that we could add
> some flag to allow different flavours of acct object:
> 
> * per-cpu based, they consume more memory, but they are lockless.
By "lockless" you mean using atomic-types? They aren't "lockless" since the atomic operations do uses a lock. As I pointed out before, for updating packet and bytes counters we use two such locks (where I think only one is needed since we should update the byte and packet counters together, not separate).

> * compact, lock based, they consume little memory, but bottleneck is
>   locking.
I'll see what would be the difference in performance between atomic locking (or "lockless" as you mentioned above) and using a "proper" lock (what I proposed with my previous posting) - it would be interesting to check this out.

> So you select what you need according to your requirements.  I plan to
> send some patches to support this.
OK, at least for the kernel part, I will then update the patches to v2 in a few moments and make sure that:
* packet and byte counters are updated together in nfnl_acct_new and issue "EINVAL" error if that is not the case;
* fmt and bthr are "normal" variables and drop the "atomic" part;

Thanks Pablo.


MZ




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 19, 2013, 2:04 a.m. UTC | #7
On Sun, Apr 14, 2013 at 10:50:26AM +0100, Michael Zintakis wrote:
[...]
> > You can modify kernel information if you manage to load a module and
> > have some little knowlege on where to find things in the memory space.
> > Truly, I think you can achive this by protecting the configuration
> > file from changes in your tree and some extra checksumming to follow
> > the track of unexpected changes.
>
> I understand that Pablo, but this is quite a bit far-fetched and, in
> my view, less practical (both in terms of performance as well as
> memory footprint) where compared to just using these 2 properties in
> the kernel where they are protected without the need to write extra
> code.
> 
> With placing these properties into the kernel they will also occupy
> less memory and the performance would be better, compared to if all
> this is done in user space (not to mention possible synchronization
> issues as the code in user space needs to "track" nfacct objects in
> the kernel and do that flawlessly).

Check /etc/iproute2/ files, they all contain mappings and information
that are parsed from user-space utilities in runtime by many utilities
and daemons. Nobody has put that information into the kernel.

And more relevantly, these fields you want to add are bloating the
accounting structure for people that don't want your feature.

Sorry, you have to do this from user-space, you have the
infrastructure and the libraries to make it from there.

[...]
> > I mean that ulogd2 provides the framework for Netfilter's user-space
> > logging. It's pretty extensible. You will surely need to make some
> > changes for what you need, but most of the infrastructure is already
> > there. No need for another separated daemon.
>
> My concern with this is that even though you say it is extensible,
> there will be quite a lot of redesigning and rewriting of code - the
> ulogd2 primary function is to check packets and pipe this through
> various "channels" defined in advance at start up, which is quite
> different from the functionality required by the nfacct daemon.

ulogd2 is a generic framework in user-space for packet logging, it
could be extended relatively easy to support what you need.

[...]
> >> a)
> >> nfnetlink_acct.c::nfnl_acct_new:
> >> 	if (matching) {
> >> 		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
> >> 			/* reset counters if you request a replacement.	*/
> >> 			atomic64_set(&matching->pkts, 0);
> >> 			atomic64_set(&matching->bytes, 0);
> > 
> > We should check there if that acct object is in used.
>
> Why? In my opinion, this is not necessary. The replace "mode" can be
> needed in both cases: where refcnt > 0 (nfacct object is used) and
> also where refcnt == 0 (nfacct object is not used). I don't see why
> do we need to make that distinction - we should be able to replace
> nfacct properties regardless of whether that object is used (in
> other words, regardless of the value of refcnt).

You have the operation to atomically dump and reset counters that are
in used via NFNL_MSG_ACCT_GET_CTRZERO with the NLM_F_DUMP flag set (or
nfacct list reset in case you want to use the command line utility).
That allows you to fetch the counters while resetting them, without
leaking counter information.

[...]
> By "lockless" you mean using atomic-types? They aren't "lockless"
> since the atomic operations do uses a lock.

No. By Lockless I mean per-cpu based counters that require no locking
at all.

Regards.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis July 10, 2013, 6:22 p.m. UTC | #8
Hello Pablo,


Pablo Neira Ayuso wrote:
> On Sun, Apr 14, 2013 at 10:50:26AM +0100, Michael Zintakis wrote:
>> With placing these properties into the kernel they will also occupy
>> less memory and the performance would be better, compared to if all
>> this is done in user space (not to mention possible synchronization
>> issues as the code in user space needs to "track" nfacct objects in
>> the kernel and do that flawlessly).
> 
> Check /etc/iproute2/ files, they all contain mappings and information
> that are parsed from user-space utilities in runtime by many utilities
> and daemons. Nobody has put that information into the kernel.
Very true, but that information in this file is self-contained and relates to one logical object. The case with nfacct is very different since if we use both a file and kernel memory to store information about a single object (nfacct in this case), then the properties of that object will be in two separate places, which will require extra coding to "sync" both "sides" and even then, this is not guaranteed that the "file" side won't contain "useless" data.

Say I execute "nfacct flush" - what then? The kernel will clear the relevant part of the nfacct properties, leaving only nfacct objects intact which have been used by iptables. So, unless you make another extra round of kernel call to determine what is left in the main table, then the "file" side is bound to contain "useless" data as you mentioned it before.

I also mentioned in previous post that the information contained in the "file" side may, deliberately or otherwise, be changed, which at least in our case, is not desirable since the nfacct (executable), as well as nfacctd (the daemon) will rely heavily on it.
 
> And more relevantly, these fields you want to add are bloating the
> accounting structure for people that don't want your feature.
That is an unfair point Pablo! Like everything else in the kernel (and not only there - user space libraries or tools as well), when new features are introduced, they are not always "optional" - you don't have kernel config menu "bloated" with all 1000's of possible features of a given components since the beginning of time listed as y/n choices - the kernel config would look like a Christmas shopping list.
 
>> My concern with this is that even though you say it is extensible,
>> there will be quite a lot of redesigning and rewriting of code - the
>> ulogd2 primary function is to check packets and pipe this through
>> various "channels" defined in advance at start up, which is quite
>> different from the functionality required by the nfacct daemon.
> 
> ulogd2 is a generic framework in user-space for packet logging, it
> could be extended relatively easy to support what you need.
We decided at the end to go with the nfacctd (nfacct daemon) route because everything there will be tailored for the specific purpose of that daemon and we won't be "bloating" it with extra features/code we don't need as was the case with ulog daemon. We just completed a major part of the testing and the code is fairly stable - more or less.

In the next couple of posts I will enclose the latest nfacct patches (I am getting to know git better these days, hehe) where I followed your previous advice and "split" everything up in smaller "tasks" so that it makes it easier for you/anybody else to look at it. I also created these patches in a specific sequence so that the more "complex" features are left towards the end.

I also excluded the entire nfacct daemon (nfacctd) code from the 2 user space components - will include this when it is fully completed and we are confident that it does its job and if there is interest in it, of course.


MZ
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/uapi/linux/netfilter/nfnetlink_acct.h b/include/uapi/linux/netfilter/nfnetlink_acct.h
index c7b6269..f07e825 100644
--- a/include/uapi/linux/netfilter/nfnetlink_acct.h
+++ b/include/uapi/linux/netfilter/nfnetlink_acct.h
@@ -18,6 +18,8 @@  enum nfnl_acct_type {
 	NFACCT_NAME,
 	NFACCT_PKTS,
 	NFACCT_BYTES,
+	NFACCT_BTHR,
+	NFACCT_FMT,
 	NFACCT_USE,
 	__NFACCT_MAX
 };
diff --git a/net/netfilter/nfnetlink_acct.c b/net/netfilter/nfnetlink_acct.c
index 589d686..bcd4ae8 100644
--- a/net/netfilter/nfnetlink_acct.c
+++ b/net/netfilter/nfnetlink_acct.c
@@ -32,6 +32,8 @@  static LIST_HEAD(nfnl_acct_list);
 struct nf_acct {
 	atomic64_t		pkts;
 	atomic64_t		bytes;
+	atomic64_t		bthr;
+	atomic_t		fmt;
 	struct list_head	head;
 	atomic_t		refcnt;
 	char			name[NFACCT_NAME_MAX];
@@ -63,9 +65,55 @@  nfnl_acct_new(struct sock *nfnl, struct sk_buff *skb,
 
 	if (matching) {
 		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
-			/* reset counters if you request a replacement. */
+			/* reset counters if you request a replacement */
+			if (!tb[NFACCT_PKTS]) {
+				/*
+				 * Prevent resetting the packets counter if
+				 * either fmt or bthr are specified.
+				 *
+				 * This is done for backward compatibility,
+				 * otherwise resetting these counters should
+				 * only be allowed when tb[NFACCT_PKTS] is
+				 * explicitly specified and == 0.
+				 *
+				 */
+				if (!tb[NFACCT_FMT] &&
+				    !tb[NFACCT_BTHR]) {
 			atomic64_set(&matching->pkts, 0);
+				}
+			} else {
+				atomic64_set(&matching->pkts,
+				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
+			}
+			if (!tb[NFACCT_BYTES]) {
+				/*
+				 * Prevent resetting the packets counter if
+				 * either fmt or bthr are specified.
+				 *
+				 * This is done for backward compatibility,
+				 * otherwise resetting these counters should
+				 * only be allowed when tb[NFACCT_BYTES] is
+				 * explicitly specified and == 0.
+				 *
+				 */
+				if (!tb[NFACCT_FMT] &&
+				    !tb[NFACCT_BTHR]) {
 			atomic64_set(&matching->bytes, 0);
+				}
+			} else {
+				atomic64_set(&matching->bytes,
+				be64_to_cpu(nla_get_be64(tb[NFACCT_BYTES])));
+			}
+			/* ...and change the format... */
+			if (tb[NFACCT_FMT]) {
+				atomic_set(&matching->fmt,
+				be32_to_cpu(nla_get_be32(tb[NFACCT_FMT])));
+			}
+			/* ... as well as the bytes threshold */
+			if (tb[NFACCT_BTHR]) {
+				atomic64_set(&matching->bthr,
+				be64_to_cpu(nla_get_be64(tb[NFACCT_BTHR])));
+			}
 			return 0;
 		}
 		return -EBUSY;
@@ -85,6 +133,14 @@  nfnl_acct_new(struct sock *nfnl, struct sk_buff *skb,
 		atomic64_set(&nfacct->pkts,
 			     be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
 	}
+	if (tb[NFACCT_FMT]) {
+		atomic_set(&nfacct->fmt,
+			   be32_to_cpu(nla_get_be32(tb[NFACCT_FMT])));
+	}
+	if (tb[NFACCT_BTHR]) {
+		atomic64_set(&nfacct->bthr,
+			     be64_to_cpu(nla_get_be64(tb[NFACCT_BTHR])));
+	}
 	atomic_set(&nfacct->refcnt, 1);
 	list_add_tail_rcu(&nfacct->head, &nfnl_acct_list);
 	return 0;
@@ -121,6 +177,9 @@  nfnl_acct_fill_info(struct sk_buff *skb, u32 portid, u32 seq, u32 type,
 	}
 	if (nla_put_be64(skb, NFACCT_PKTS, cpu_to_be64(pkts)) ||
 	    nla_put_be64(skb, NFACCT_BYTES, cpu_to_be64(bytes)) ||
+	    nla_put_be64(skb, NFACCT_BTHR,
+			 cpu_to_be64(atomic64_read(&acct->bthr))) ||
+	    nla_put_be32(skb, NFACCT_FMT, htonl(atomic_read(&acct->fmt))) ||
 	    nla_put_be32(skb, NFACCT_USE, htonl(atomic_read(&acct->refcnt))))
 		goto nla_put_failure;
 
@@ -265,6 +324,8 @@  static const struct nla_policy nfnl_acct_policy[NFACCT_MAX+1] = {
 	[NFACCT_NAME] = { .type = NLA_NUL_STRING, .len = NFACCT_NAME_MAX-1 },
 	[NFACCT_BYTES] = { .type = NLA_U64 },
 	[NFACCT_PKTS] = { .type = NLA_U64 },
+	[NFACCT_BTHR] = { .type = NLA_U64 },
+	[NFACCT_FMT] = { .type = NLA_U32 },
 };
 
 static const struct nfnl_callback nfnl_acct_cb[NFNL_MSG_ACCT_MAX] = {