diff mbox

iptables nfacct match question

Message ID 512BC79F.1070708@googlemail.com
State Superseded
Headers show

Commit Message

Michael Zintakis Feb. 25, 2013, 8:20 p.m. UTC
> Could you develop some example usage of your extension?
Sure. As part of the services we provide, we need to "meter" and/or restrict user traffic. Since we value everything in "megabytes" (a misnomer, actually, since it is really Mibibytes we all speak of), it is much easier to get specific "counters" shown, or, locked as, MiB.

Also, for some other types, the traffic is not that big, so a higher "resolution" (i.e. KiB or plain bytes, formatted "nicely") is required and needs to be used.

> Yes, this is how our (limited) revision infrastructure works at this
> moment.
Thank you Pablo.

I've given up on my initial idea, which was to create this custom formatting (as well as object creation) at the point where the first iptables statement is created for a particular nfacct object, so I adopted a "plan b", where everything is done via the "nfacct" executable.

At the end, I've left iptables alone, and dealt with only the other 3 (core) nfacct components, which do the job quite nicely. 

If there is interest in what I've done, or anyone from this list finds this useful, here is a full verbatim of what I've done. I am also attaching 3 patches, which cover what I describe below - don't know what is the formal "submission" process, but if some of the Netfilter guys find this useful or wish to expand on what I've done, I'll gladly prepare a formal manual pages/text/notes, if required - just let me know and I'll gladly do that.

On the other hand, if you have a bit of critique or find something I've done wrong in the code - please do not hesitate and fire away!

The additional functionality I've implemented is to modify the original nfacct components (nfacct executable, libnfnetlink_acct userspace library and nfacct kernel component) to store and display "bytes" and "packets" numbers according to custom format, specified at nfacct object creation time (in other words, during "nfacct add"). 

The format of "bytes" and "packets" can be independently specified. In other words, the format for showing "packet" numbers can be different from the format used for "bytes" for each individual nfacct object.

This (custom) format can be specified at object creation time and takes the following form:

nfacct add <object-name> [[fmt][,fmt]] 

The first component indicates the format for "packets"; the second, if specified, the format for "bytes", where "fmt" is one the following :

def: 00000000000001048576    - default format, as is the case at present
3pl:            1,048,576    - display numbers in "triplets", separated by the "thousand separator" symbol, which is locale-dependent
iec: 		  133.012MiB - display numbers according to the IEC standard (KiB/MiB...) - the suffix is "automatically" determined depending on the value
kib:	      136,204.288KiB - same as above, but the value and therefore the suffix is "locked" as KiB (kibibytes)
mib:		  133.012MiB - same as above, but the value and therefore the suffix is "locked" as MiB (mibibytes)
gib:		    0.130GiB - same as above, but the value and therefore the suffix is "locked" as GiB (gibibytes)
tib:		1,008.345TiB - same as above, but the value and therefore the suffix is "locked" as TiB (tebibytes)
pib:		1,008.345PiB - same as above, but the value and therefore the suffix is "locked" as PiB (pebibytes)
eib:		1,008.345EiB - same as above, but the value and therefore the suffix is "locked" as EiB (exbibytes)
si:		  139.473MB  - display numbers according to the "old" SI standard (KB/MB...) - the suffix is "automatically" determined depending on the value
kb:	      139,473.191KB  - same as above, but the value and therefore the suffix is "locked" as KB (kilobytes)
mb:		  139.473MB  - same as above, but the value and therefore the suffix is "locked" as MB (megabytes)
gb:		    0.139GB  - same as above, but the value and therefore the suffix is "locked" as GB (gigabytes)
tb:		1,082.702TB  - same as above, but the value and therefore the suffix is "locked" as TB (terabytes)
pb:		1,082.702PB  - same as above, but the value and therefore the suffix is "locked" as PB (petabytes)
eb:		1,082.702EB  - same as above, but the value and therefore the suffix is "locked" as EB (exabytes)

A note about the "iec" and "si" values: when the "iec" and "si" formats are specified, then the suffixes are determined "automatically", depending on the value of the actual number. For example, a value of 1048576 bytes will be shown as "1.000MiB" (or "1.049MB" respectively, if "si" is used). However, a value of 999999 bytes will be shown as "976.561KiB" (or 999.999KB respectively).

Another note on the locale-specific format where the "thousand separator" or "decimal point" symbols are used: these are determined "automatically" and depend on the language settings on the particular machine/server/embedded device on which libnfnetlink_acct userspace library is installed. If that locale cannot be determined, then "en_GB" is used (; 

Also, if the "fmt" value for a particular component ("packets" or "bytes") is not specified, then the default ("def") is assumed. For example: 

a) "nfacct add in12 ,mib" is the equivalent of "nfacct add in12 def,mib";
b) "nfacct add in12 3pl," is the equivalent of "nfacct add in12 3pl,def"; and
c) "nfacct add in12 ," is the equivalent of "nfacct add in12", as well as "nfacct add in12 def,def";

Another point worth mentioning: if only one "fmt" component is specified, it is assumed that the specified format applies to *both* "bytes" and "packets". In other words:

a) "nfacct add in12 3pl" is the equivalent of "nfacct add in12 3pl,3pl"

Finally, I've tested these patches against "peculiar" userspace/kernel combination (i.e. old nfacct/libnfnetlink -> new kernel and vice-versa) - they seem to work OK.

--- a/include/libnetfilter_acct/libnetfilter_acct.h
+++ b/include/libnetfilter_acct/libnetfilter_acct.h
@@ -11,6 +11,27 @@
 	NFACCT_ATTR_NAME = 0,
 	NFACCT_ATTR_PKTS,
 	NFACCT_ATTR_BYTES,
+	NFACCT_ATTR_FMT,
+};
+
+enum nfacct_format {
+	FMT_DEFAULT=0,		/*   00001048576 			*/
+	FMT_TRIPLETS,		/*     1,048,576 - locale-dependent 	*/
+	FMT_IEC,		/*       133.012MiB - dynamic 		*/
+	FMT_IEC_KIBIBYTE,	/*     1,145.178KiB - fixed   		*/
+	FMT_IEC_MEBIBYTE,	/*     1,145.178MiB - fixed   		*/
+	FMT_IEC_GIBIBYTE,	/*     1,145.178GiB - fixed   		*/
+	FMT_IEC_TEBIBYTE,	/*     1,145.178TiB - fixed   		*/
+	FMT_IEC_PEBIBYTE,	/*     1,145.178PiB - fixed   		*/
+	FMT_IEC_EXBIBYTE,	/*     1,145.178EiB - fixed   		*/
+	FMT_SI,			/*       133.012MB  - dynamic 		*/
+	FMT_SI_KILOBYTE,	/*     1,145.178KB  - fixed   		*/
+	FMT_SI_MEGABYTE,	/*     1,145.178MB  - fixed   		*/
+	FMT_SI_GIGABYTE,	/*     1,145.178GB  - fixed   		*/
+	FMT_SI_TERABYTE,	/*     1,145.178TB  - fixed   		*/
+	FMT_SI_PETABYTE,	/*     1,145.178PB  - fixed   		*/
+	FMT_SI_EXABYTE,		/*     1,145.178EB  - fixed   		*/
+	FMT_MAX,
 };
 
 struct nfacct *nfacct_alloc(void);
@@ -19,11 +40,13 @@
 void nfacct_attr_set(struct nfacct *nfacct, enum nfacct_attr_type type, const void *data);
 void nfacct_attr_set_str(struct nfacct *nfacct, enum nfacct_attr_type type, const char *name);
 void nfacct_attr_set_u64(struct nfacct *nfacct, enum nfacct_attr_type type, uint64_t value);
+void nfacct_attr_set_u32(struct nfacct *nfacct, enum nfacct_attr_type type, uint32_t value);
 void nfacct_attr_unset(struct nfacct *nfacct, enum nfacct_attr_type type);
 
 const void *nfacct_attr_get(struct nfacct *nfacct, enum nfacct_attr_type type);
 const char *nfacct_attr_get_str(struct nfacct *nfacct, enum nfacct_attr_type type);
 uint64_t nfacct_attr_get_u64(struct nfacct *nfacct, enum nfacct_attr_type type);
+uint32_t nfacct_attr_get_u32(struct nfacct *nfacct, enum nfacct_attr_type type);
 
 struct nlmsghdr;
 
--- a/include/linux/netfilter/nfnetlink_acct.h
+++ b/include/linux/netfilter/nfnetlink_acct.h
@@ -18,6 +18,7 @@
 	NFACCT_NAME,
 	NFACCT_PKTS,
 	NFACCT_BYTES,
+	NFACCT_FMT,
 	NFACCT_USE,
 	__NFACCT_MAX
 };
--- a/src/libnetfilter_acct.c
+++ b/src/libnetfilter_acct.c
@@ -13,6 +13,7 @@
 #include <endian.h>
 #include <stdlib.h>
 #include <string.h>
+#include <locale.h>
 
 #include <libmnl/libmnl.h>
 #include <linux/netfilter/nfnetlink.h>
@@ -59,6 +60,7 @@
 	char		name[NFACCT_NAME_MAX];
 	uint64_t	pkts;
 	uint64_t	bytes;
+	uint32_t	fmt;
 	uint32_t	bitset;
 };
 
@@ -113,6 +115,10 @@
 		nfacct->bytes = *((uint64_t *) data);
 		nfacct->bitset |= (1 << NFACCT_ATTR_BYTES);
 		break;
+	case NFACCT_ATTR_FMT:
+		nfacct->fmt = *((uint32_t *) data);
+		nfacct->bitset |= (1 << NFACCT_ATTR_FMT);
+		break;
 	}
 }
 EXPORT_SYMBOL(nfacct_attr_set);
@@ -146,6 +152,20 @@
 EXPORT_SYMBOL(nfacct_attr_set_u64);
 
 /**
+ * nfacct_attr_set_u32 - set one attribute the accounting object
+ * \param nfacct pointer to the accounting object
+ * \param type attribute type you want to set
+ * \param value unsigned 32-bit integer
+ */
+void
+nfacct_attr_set_u32(struct nfacct *nfacct, enum nfacct_attr_type type,
+		    uint32_t value)
+{
+	nfacct_attr_set(nfacct, type, &value);
+}
+EXPORT_SYMBOL(nfacct_attr_set_u32);
+
+/**
  * nfacct_attr_unset - unset one attribute the accounting object
  * \param nfacct pointer to the accounting object
  * \param type attribute type you want to set
@@ -163,6 +183,9 @@
 	case NFACCT_ATTR_BYTES:
 		nfacct->bitset &= ~(1 << NFACCT_ATTR_BYTES);
 		break;
+	case NFACCT_ATTR_FMT:
+		nfacct->bitset &= ~(1 << NFACCT_ATTR_FMT);
+		break;
 	}
 }
 EXPORT_SYMBOL(nfacct_attr_unset);
@@ -192,6 +215,10 @@
 		if (nfacct->bitset & (1 << NFACCT_ATTR_BYTES))
 			ret = &nfacct->bytes;
 		break;
+	case NFACCT_ATTR_FMT:
+		if (nfacct->bitset & (1 << NFACCT_ATTR_FMT))
+			ret = &nfacct->fmt;
+		break;
 	}
 	return ret;
 }
@@ -227,19 +254,180 @@
 }
 EXPORT_SYMBOL(nfacct_attr_get_u64);
 
+/**
+ * nfacct_attr_get_u32 - get one attribute the accounting object
+ * \param nfacct pointer to the accounting object
+ * \param type attribute type you want to get
+ *
+ * This function returns a unsigned 32-bits integer. If the attribute is
+ * unsupported, this returns NULL.
+ */
+uint32_t nfacct_attr_get_u32(struct nfacct *nfacct, enum nfacct_attr_type type)
+{
+	const void *ret = nfacct_attr_get(nfacct, type);
+	return ret ? *((uint32_t *)ret) : 0;
+}
+EXPORT_SYMBOL(nfacct_attr_get_u32);
+
+#define KiB ((unsigned long long) 1 << 10)
+#define MiB ((unsigned long long) 1 << 20)
+#define GiB ((unsigned long long) 1 << 30)
+#define TiB ((unsigned long long) 1 << 40)
+#define PiB ((unsigned long long) 1 << 50)
+#define EiB ((unsigned long long) 1 << 60)
+#define KB ((unsigned long long)  1*1000)
+#define MB ((unsigned long long) KB*1000)
+#define GB ((unsigned long long) MB*1000)
+#define TB ((unsigned long long) GB*1000)
+#define PB ((unsigned long long) TB*1000)
+#define EB ((unsigned long long) PB*1000)
+
+#define STR_FMT_PLAIN		"{ pkts = %s, bytes = %s } = %s;"
+#define STR_FMT_XML		"<obj><name>%s</name>"	\
+				"<pkts>%s</pkts>"	\
+				"<bytes>%s</bytes>"
+#define STR_FMT_DEFAULT		"%020.0f%s"
+#define STR_FMT_TRIPLETS	"%'26.0f%s"
+#define STR_FMT_SI_IEC		"%'26.3f%s"
+#define STR_FMT_XML_DEFAULT	STR_FMT_DEFAULT
+#define STR_FMT_XML_TRIPLETS	"%'.0f%s"
+#define STR_FMT_XML_SI_IEC	"%'.3f%s"
+
+struct nfacct_number {
+	float value;
+	enum nfacct_format fmt;
+	char *fmt_str;
+};
+
+struct fmt_key {
+	unsigned long long num;
+	char name[4];
+};
+
+static struct fmt_key fmt_keys[] = {
+	[FMT_DEFAULT] 		= { .num = 0, .name = "" },
+	[FMT_TRIPLETS] 		= { .num = 0, .name = "" },
+	[FMT_IEC] 		= { .num = 0, .name = "" },
+	[FMT_IEC_KIBIBYTE] 	= { .num = KiB,	.name = "KiB" },
+	[FMT_IEC_MEBIBYTE] 	= { .num = MiB,	.name = "MiB" },
+	[FMT_IEC_GIBIBYTE] 	= { .num = GiB,	.name = "GiB" },
+	[FMT_IEC_TEBIBYTE] 	= { .num = TiB,	.name = "TiB" },
+	[FMT_IEC_PEBIBYTE] 	= { .num = PiB,	.name = "PiB" },
+	[FMT_IEC_EXBIBYTE] 	= { .num = EiB,	.name = "EiB" },
+	[FMT_SI] 		= { .num = 0, .name = "" },
+	[FMT_SI_KILOBYTE] 	= { .num = KB, .name = "KB" },
+	[FMT_SI_MEGABYTE] 	= { .num = MB, .name = "MB" },
+	[FMT_SI_GIGABYTE] 	= { .num = GB, .name = "GB" },
+	[FMT_SI_TERABYTE] 	= { .num = TB, .name = "TB" },
+	[FMT_SI_PETABYTE] 	= { .num = PB, .name = "PB" },
+	[FMT_SI_EXABYTE] 	= { .num = EB, .name = "EB" },
+};
+
+#define SET_RET(x) 			\
+	ret.value /= fmt_keys[x].num;	\
+	ret.fmt = x;
+
+#define SET_RET_FMT(x)			\
+	ret.fmt = fmt;			\
+	ret.fmt_str = xml ? STR_FMT_XML_##x : STR_FMT_##x;
+
+static struct nfacct_number
+format_number(const unsigned long long val, const enum nfacct_format fmt, 
+	      const int xml)
+{
+	struct nfacct_number ret;
+	ret.value = (float) val;
+	SET_RET_FMT(SI_IEC);
+	switch (fmt) {	
+	case FMT_IEC:
+		if (ret.value >= EiB) {
+			SET_RET(FMT_IEC_EXBIBYTE);
+	        } else if (ret.value >= PiB) {
+			SET_RET(FMT_IEC_PEBIBYTE);
+	        } else if (ret.value >= TiB) {
+			SET_RET(FMT_IEC_TEBIBYTE);
+	        } else if (ret.value >= GiB) {
+			SET_RET(FMT_IEC_GIBIBYTE);
+	        } else if (ret.value >= MiB) {
+			SET_RET(FMT_IEC_MEBIBYTE);
+	        } else if (ret.value >= KiB) {
+			SET_RET(FMT_IEC_KIBIBYTE);
+		}
+		break;
+	case FMT_SI:
+		if (ret.value >= EB) {
+			SET_RET(FMT_SI_EXABYTE);
+	        } else if (ret.value >= PB) {
+			SET_RET(FMT_SI_PETABYTE);
+	        } else if (ret.value >= TB) {
+			SET_RET(FMT_SI_TERABYTE);
+	        } else if (ret.value >= GB) {
+			SET_RET(FMT_SI_GIGABYTE);
+	        } else if (ret.value >= MB) {
+			SET_RET(FMT_SI_MEGABYTE);
+	        } else if (ret.value >= KB) {
+			SET_RET(FMT_SI_KILOBYTE);
+		}
+		break;
+	case FMT_IEC_EXBIBYTE:
+	case FMT_IEC_PEBIBYTE:
+	case FMT_IEC_TEBIBYTE:
+	case FMT_IEC_GIBIBYTE:
+	case FMT_IEC_MEBIBYTE:
+	case FMT_IEC_KIBIBYTE:
+	case FMT_SI_EXABYTE:
+	case FMT_SI_PETABYTE:
+	case FMT_SI_TERABYTE:
+	case FMT_SI_GIGABYTE:
+	case FMT_SI_MEGABYTE:
+	case FMT_SI_KILOBYTE:
+		SET_RET(fmt);
+		break;
+	case FMT_DEFAULT:
+		SET_RET_FMT(DEFAULT);
+		break;
+	case FMT_TRIPLETS:
+		SET_RET_FMT(TRIPLETS);
+		break;
+	}
+	return ret;
+}
+
+#define DEFAULT_LOCALE "en_GB"
+
+static void init_locale(void) {
+	char *lang;
+	char *env = "LANG";
+	lang = getenv(env);
+	setlocale(LC_ALL,(lang == NULL ? DEFAULT_LOCALE : lang));
+}
+
 static int
 nfacct_snprintf_plain(char *buf, size_t rem, struct nfacct *nfacct,
 		      uint16_t flags)
 {
 	int ret;
+	char fmt_str[sizeof(STR_FMT_PLAIN) + 
+		     sizeof(STR_FMT_DEFAULT) * 2 + 10];
+	struct nfacct_number p;
+	struct nfacct_number b;
+	uint32_t fmt;
 
 	if (flags & NFACCT_SNPRINTF_F_FULL) {
-		ret = snprintf(buf, rem,
-			"{ pkts = %.20llu, bytes = %.20llu } = %s;",
-			(unsigned long long)
+		fmt = nfacct_attr_get_u32(nfacct, NFACCT_ATTR_FMT);
+		if (((fmt & 240) >> 4) || (fmt & 15)) // locale-dependent
+			init_locale();
+
+		p = format_number((unsigned long long)
 			nfacct_attr_get_u64(nfacct, NFACCT_ATTR_PKTS),
-			(unsigned long long)
+				  ((fmt & 240) >> 4), 0);
+		b = format_number((unsigned long long)
 			nfacct_attr_get_u64(nfacct, NFACCT_ATTR_BYTES),
+				  (fmt & 15), 0);
+		sprintf(fmt_str, STR_FMT_PLAIN,	p.fmt_str, b.fmt_str, "%s");
+		ret = snprintf(buf, rem, fmt_str, 
+				p.value, fmt_keys[p.fmt].name, 
+				b.value, fmt_keys[b.fmt].name,
 			nfacct_attr_get_str(nfacct, NFACCT_ATTR_NAME));
 	} else {
 		ret = snprintf(buf, rem, "%s\n",
@@ -293,16 +481,26 @@
 {
 	int ret = 0;
 	unsigned int size = 0, offset = 0;
+	char fmt_str[sizeof(STR_FMT_XML) + sizeof(STR_FMT_DEFAULT) * 2 + 10];
+	struct nfacct_number p;
+	struct nfacct_number b;
+	uint32_t fmt;
+
+	fmt = nfacct_attr_get_u32(nfacct, NFACCT_ATTR_FMT);
+	if (((fmt & 240) >> 4) || (fmt & 15)) // locale-dependent
+		init_locale();
 
-	ret = snprintf(buf, rem,
-			"<obj><name>%s</name>"
-			"<pkts>%.20llu</pkts>"
-			"<bytes>%.20llu</bytes>",
-			nfacct_attr_get_str(nfacct, NFACCT_ATTR_NAME),
-			(unsigned long long)
+	p = format_number((unsigned long long)
+			nfacct_attr_get_u64(nfacct, NFACCT_ATTR_PKTS),
+			  ((fmt & 240) >> 4), 1);
+	b = format_number((unsigned long long)
 			nfacct_attr_get_u64(nfacct, NFACCT_ATTR_BYTES),
-			(unsigned long long)
-			nfacct_attr_get_u64(nfacct, NFACCT_ATTR_PKTS));
+			  (fmt & 15), 1);
+	sprintf(fmt_str, STR_FMT_XML, "%s", p.fmt_str, b.fmt_str);
+	ret = snprintf(buf, rem, fmt_str, 
+			nfacct_attr_get_str(nfacct, NFACCT_ATTR_NAME),
+			p.value, fmt_keys[p.fmt].name, 
+			b.value, fmt_keys[b.fmt].name);
 	BUFFER_SIZE(ret, size, rem, offset);
 
 	if (flags & NFACCT_SNPRINTF_F_TIME) {
@@ -427,6 +625,9 @@
 
 	if (nfacct->bitset & (1 << NFACCT_ATTR_BYTES))
 		mnl_attr_put_u64(nlh, NFACCT_BYTES, htobe64(nfacct->bytes));
+
+	if (nfacct->bitset & (1 << NFACCT_ATTR_FMT))
+		mnl_attr_put_u32(nlh, NFACCT_FMT, htobe32(nfacct->fmt));
 }
 EXPORT_SYMBOL(nfacct_nlmsg_build_payload);
 
@@ -452,6 +653,12 @@
 			return MNL_CB_ERROR;
 		}
 		break;
+	case NFACCT_FMT:
+		if (mnl_attr_validate(attr, MNL_TYPE_U32) < 0) {
+			perror("mnl_attr_validate");
+			return MNL_CB_ERROR;
+		}
+		break;
 	}
 	tb[type] = attr;
 	return MNL_CB_OK;
@@ -481,6 +688,9 @@
 			    be64toh(mnl_attr_get_u64(tb[NFACCT_PKTS])));
 	nfacct_attr_set_u64(nfacct, NFACCT_ATTR_BYTES,
 			    be64toh(mnl_attr_get_u64(tb[NFACCT_BYTES])));
+	if (tb[NFACCT_FMT])
+		nfacct_attr_set_u32(nfacct, NFACCT_ATTR_FMT,
+				    be32toh(mnl_attr_get_u32(tb[NFACCT_FMT])));
 
 	return 0;
 }
--- a/src/nfacct.c
+++ b/src/nfacct.c
@@ -211,19 +211,61 @@
 	return 0;
 }
 
+static const char *fmt_options[FMT_MAX + 1] =	{ 
+						"def","3pl","iec","kib",
+						"mib","gib","tib","pib","eib",
+						"si","kb","mb", "gb", "tb",
+						"pb","eb",""
+						};
+
+static uint32_t nfacct_parse_format_options(const char *argv) {
+	char *ptr;
+	char *tmp = strdup(argv);
+	int i, j;
+	enum nfacct_format fmt[2] = { FMT_MAX, FMT_MAX };
+
+	for (i = 0; i <= 2 && tmp != NULL; i++) {
+		ptr = strsep(&tmp, ",");
+
+		if (ptr != NULL && strlen(ptr) == 0) {
+			fmt[i] = FMT_DEFAULT;
+		} else {
+			for (j = 0; j <= FMT_MAX && 
+			      strncmp(ptr, fmt_options[j], 3) != 0; j++) { ; }
+
+			if (j >= FMT_MAX)
+				break;
+
+			fmt[i] = j;
+		}
+	}
+
+	if (i > 2 || j >= FMT_MAX)
+		return -1;
+
+	if (fmt[0] == FMT_MAX) 
+		fmt[0] = fmt[1];
+
+	if (fmt[1] == FMT_MAX)
+		fmt[1] = fmt[0];
+
+	return (fmt[0] << 4) | fmt[1];
+}
+
 static int nfacct_cmd_add(int argc, char *argv[])
 {
 	struct mnl_socket *nl;
 	char buf[MNL_SOCKET_BUFFER_SIZE];
 	struct nlmsghdr *nlh;
 	uint32_t portid, seq;
+	uint32_t fmt;
 	struct nfacct *nfacct;
 	int ret;
 
 	if (argc < 3) {
 		nfacct_perror("missing object name");
 		return -1;
-	} else if (argc > 3) {
+	} else if (argc > 4) {
 		nfacct_perror("too many arguments");
 		return -1;
 	}
@@ -236,6 +278,15 @@
 
 	nfacct_attr_set(nfacct, NFACCT_ATTR_NAME, argv[2]);
 
+	if (argc == 4 && argv[3] != NULL) {
+		fmt = nfacct_parse_format_options(argv[3]);
+		if (fmt == -1) {
+			nfacct_perror("packets/bytes format wrong");
+			return -1;
+		}
+		nfacct_attr_set(nfacct, NFACCT_ATTR_FMT, &fmt);
+	}
+
 	seq = time(NULL);
 	nlh = nfacct_nlmsg_build_hdr(buf, NFNL_MSG_ACCT_NEW,
 				     NLM_F_CREATE | NLM_F_ACK, seq);

Comments

Pablo Neira Ayuso Feb. 26, 2013, 1:55 p.m. UTC | #1
Hi Michael,

On Mon, Feb 25, 2013 at 08:20:47PM +0000, Michael Zintakis wrote:
[...]
> I've given up on my initial idea, which was to create this custom
> formatting (as well as object creation) at the point where the first
> iptables statement is created for a particular nfacct object, so I
> adopted a "plan b", where everything is done via the "nfacct"
> executable.

Thanks for the explanation. I think that, for most users, something
like:

        nfacct list MiB

would be just fine, so all counters will be displayed using the
formatting (MiB in the example case) that has been requested.

I'm still missing why different formatting according to the accounting
object can be useful.

Regards.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis Feb. 26, 2013, 7:23 p.m. UTC | #2
Pablo Neira Ayuso wrote:
> Thanks for the explanation.
No problem.

> I think that, for most users, something
> like:
> 
>         nfacct list MiB
I can't speak for other people (it would be very foolish of me to do so on this occasion), but judging this from our own needs/experience, the traffic - both by type and volume - is quite different. One cannot simply shoe-horn all traffic under a single denominator and say "that's it" - it doesn't work like that.

> I'm still missing why different formatting according to the accounting
> object can be useful.
OK, I tried to explain this in my previous post, but if it wasn't clear I'll expand a bit further. 

Different types of traffic, by their very nature, have different volume requirements. At the "low" end, we have DNS and authentication-type traffic (think RADIUS for example), where the denomination needs to be pretty "low" - in KiB or even "plain bytes" range.

At the other end of that scale you have much higher volume of traffic (think HD video streaming for example or private customers running their own PBXs, taking video/voice calls in their thousands), where the denomination needs to be much higher - in the GiB or even TiB range in some circumstances.

Not to mention that we have our own internal measurements, where we combine the total traffic counters of whole subnets where that denomination goes much much higher that "GiB".

On top of all that, you have the traffic which could be quite unpredictable (think someone running, or connecting to, a private VPN server for example), hence the need for a "dynamic" denomination, depending on the volume of that traffic, which is what I implemented with the "iec" and "si" options.

Not to mention that in your example above, the chosen measurement (MiB) would also apply to packet counters - that isn't very appropriate, since packet counters are much lower (by order of magnitude!) compared to the packet length.

One cannot simply brush it aside and design a one-size-fits-all measurement and apply it. 

We've had this problem with the "old" iptables accounting and it is one of the reasons we moved on from that, because it simply wasn't flexible enough. What I did with nfacct provides for flexibility - it can be configured to fit quite a variety of scenarios and individual needs. I hope I've explained myself a bit better this time.


MZ
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Feb. 26, 2013, 9:47 p.m. UTC | #3
On Tue, Feb 26, 2013 at 07:23:16PM +0000, Michael Zintakis wrote:
[...]
> Different types of traffic, by their very nature, have different
> volume requirements. At the "low" end, we have DNS and
> authentication-type traffic (think RADIUS for example), where the
> denomination needs to be pretty "low" - in KiB or even "plain bytes"
> range.

I see. Then my new proposal is to add a new automagic function to
round the output to the most expressive measure, would be somehow
similar to xtables_print_num:

http://git.netfilter.org/cgi-bin/gitweb.cgi?p=iptables.git;a=blob;f=libxtables/xtables.c;h=009ab9115f6fd687a762a2552f89ac0b81ee1a42;hb=HEAD#l1915

Would that fit into your needs?

Regards.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis Feb. 27, 2013, 8:57 p.m. UTC | #4
Pablo Neira Ayuso wrote:
> I see. Then my new proposal is to add a new automagic function to
> round the output to the most expressive measure, would be somehow
> similar to xtables_print_num:
> 
> http://git.netfilter.org/cgi-bin/gitweb.cgi?p=iptables.git;a=blob;f=libxtables/xtables.c;h=009ab9115f6fd687a762a2552f89ac0b81ee1a42;hb=HEAD#l1915
I might be seeing this wrong and if so I apologize, but is this the same/similar function which exists in the "old" iptables accounting, as well as seen for packets/bytes counter when iptables -L -vn is executed? If so, that isn't very appropriate as I indicated in my previous posting.

Pablo, do you think there is something wrong with the "iec" and "si" options already in place? If you think that I've done something wrong, please let me know because this was one of the reasons for placing the changes (and including the patches) in the code I attached before. I would gladly benefit from a feedback on that code.
 
> Would that fit into your needs?
Short answer: no, not really.

As I already posted, the "iec" and "si" options deal with the two numbering standards (IEC and the "old" SI), have 3-digit decimal point resolution and, most importantly in this case, they were put in place to cover traffic which is of unpredictable/unknown quantity/volume. Going from our own experience, this covers about 20% of the traffic we measure. 

For the vast majority of all other traffic, we "lock" the denominator and use the appropriate format ("kib", "mib" etc). This is so that if that traffic is different from what we expected to see, this is instantly reflected in the numbers and is immediately flagged for further analysis.

Let me illustrate this with a small example: if we use "3pl,mib" format options for a specific type of traffic and we start getting byte count numbers like, for example, "140,666,825.688MiB" (in other words, over 134TiB), then this is instantly flagged to be analyzed to find why that traffic shot our pre-determined "expectation" from being in the "MiB range" and jumped two ranges and got into "TiB" territory.

The packet count is also a different matter. Even though in vast majority of cases we use the "3pl" format, this is by no means set solid in stone, so packet count format would also needs to be configured for each traffic measure (i.e. nfacct object).

The "old" SI-type options ("kb","mb" and so on) were also put there for a reason - we still have people who are used to this measure, so it is more convenient for them to have this working range of options, which they could use. I hope I have explained this clear, let me know if this is not the case.


MZ
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis March 23, 2013, 12:12 p.m. UTC | #5
Hello Pablo and all,


> Pablo Neira Ayuso wrote:
>> Would that fit into your needs?
> Short answer: no, not really.
In connection with this subject, I wanted to let you know that I have made quite a lot of changes, which I would try to describe below.

We have had internal team gathering almost 3 weeks ago and started planning for the changes to nfacct in order to make it more useful and more functional. This was also done with a view of a presentation to all major stakeholders of the company, which was previously planned and finally concluded 2 days ago (Thursday). During that we demonstrated our new and improved capability (utilizing the new and improved nfacct was part of that, of course!).

I am glad to let you know that everything was very well received and, after fine-tuning my work I am going to submit 3 patches to this community very shortly, with the changes I've made to the nfacct system. They are quite extensive and nfacct executable in particular was almost completely re-written. I also found a few bugs, which I fixed.

The new changes follow (I will also include a printouts to be more clear). The only changes I have made to the kernel code since my last posting was the introduction of another property called 'bytes threshold' (64-bit number). Its main purpose was to enable us to register 'an expectation' of the traffic passing through a given accounting object and if this threshold is exceeded (in other words if bytes count > threshold), then this is visually displayed with the 'list' and 'get' commands. In other words:

[root@27_13 ~]# nfacct list
[ pkts =    7.260GiB  bytes =   6.817TiB+ ] = "ALL 27 net"
[ pkts = 296,615,264  bytes =  21.750GiB  ] = " IN web;streaming"
[ pkts = 533,035,424  bytes = 721.382GiB  ] = "OUT web;streaming"
[ pkts = 263,548,272  bytes = 236.012GiB+ ] = "ALL misc"
[ pkts =  12,852,909  bytes =  11.510GiB  ] = "ALL private"
[ pkts =     942,885  bytes = 864.635MiB  ] = "ALL sec;audit"


As we see above, the plus sign (+) next to the bytes count indicates that the registered threshold for this accounting object has been exceeded (enabling such threshold is, of course, entirely optional). The actual threshold value can be shown with a new option of the 'list' and 'get' commands (called 'show') in which I can specify what columns to view. In other words:

[root@27_13 shorewall]# nfacct list show bytes
[ bytes =   6.817TiB+ ] = "ALL 27 net"
[ bytes =  21.750GiB  ] = " IN web;streaming"
[ bytes = 721.382GiB  ] = "OUT web;streaming"
[ bytes = 236.012GiB+ ] = "ALL misc"
[ bytes =  11.510GiB  ] = "ALL private"
[ bytes = 864.635MiB  ] = "ALL sec;audit"

As we can see, with the above I am shown only the name and bytes columns.

[root@27_13 ~]# nfacct list show extended
[ pkts =    7.260GiB  bytes =   6.817TiB+ thr =   6.000TiB ] = "ALL 27 net"
[ pkts = 296,615,264  bytes =  21.750GiB  thr =          - ] = " IN web;streaming"
[ pkts = 533,035,424  bytes = 721.382GiB  thr =          - ] = "OUT web;streaming"
[ pkts = 263,548,272  bytes = 236.012GiB+ thr = 200.000GiB ] = "ALL misc"
[ pkts =  12,852,909  bytes =  11.510GiB  thr =  50.000GiB ] = "ALL private"
[ pkts =     942,885  bytes = 864.635MiB  thr =          - ] = "ALL sec;audit"

As we can see now, by selecting a different 'show' option ('extended' in this case), different properties are shown (I am now shown all properties - packets and byte counters, as well as the threshold values and threshold exceeded indicator, plus account object names).

Another good feature is that all column widths are now adjusted 'automatically' by nfacct (libnetfilter_acct plays a major part in this) so that we don't get excessive amount of space shown on the user screen or numbers displayed like 00000000000000001234, which was a bit ugly to say the least.

Coming back to the 'bytes threshold', from the last example above we can see that for "ALL 27 net" and "ALL misc" accounting objects, the threshold of 6TiB and 200GiB respectively, has been exceeded and that is indicated by the "+" sign next to the bytes counter. 

We will also notice that all account object names, if they contain 'odd' symbols are now encoded and shown with quotations. This was one of many bugs I found during the improvements I've made to nfacct - if that name contained any of these characters, restore fails. With the current improvements, this is all now gone.

Also as a result of that, not all data was properly encoded when the 'xml' output parameter was used - characters were shown when they were non-conformant to the xml specification (like '>' or '&' for example), but enough about bad bugs...

The formatting of objects can now be overwritten by the 'list' and 'get' commands too. The formatting of the numbers of all accounting objects in the above example is 'natural' to the accounting objects themselves, but this can be changed. In other words:

[root@27_13 ~]# nfacct list show extended format raw
[ pkts = 7795058176  bytes = 7495370670080+ thr = 6597069766656 ] = "ALL 27 net"
[ pkts =  296615264  bytes =   23353884672  thr =             - ] = " IN web;streaming"
[ pkts =  533035424  bytes =  774578044928  thr =             - ] = "OUT web;streaming"
[ pkts =  263548272  bytes =  253415948288+ thr =  214748364800 ] = "ALL misc"
[ pkts =   12852909  bytes =   12358768640  thr =   53687091200 ] = "ALL private"
[ pkts =     942885  bytes =     906635520  thr =             - ] = "ALL sec;audit"


With the above, I asked the 'list' command to show me un-formatted values ('raw' was the format used, but I can select any formatting option I chose - I have now a complete freedom).

Maybe the major issue resolved in terms of administration is the new 'save' and 'restore' commands.

The previous 'restore' command wasn't working, and it was capturing input from the 'list' command. This was ugly (a bit like trying to do iptables-restore from 'iptables -L'). The new 'save' command now produces output to stdout in a form completely suitable for the new 'restore' command. In other words:

[root@27_13 ~]# nfacct save
"ALL 27 net" iec,tib 7795057933 7495370766549 6597069766656
" IN web;streaming" 3pl,gib 296615255 23353884672 0
"OUT web;streaming" 3pl,gib 533035414 774578024481 0
"ALL misc" 3pl,gib 263548277 253415955366 214748364800
"ALL private" 3pl,gib 12852909 12358768394 53687091200
"ALL sec;audit" 3pl,mib 942885 906635509 0

As we can see, this can be safely directed to a file and then used with the new 'nfacct restore'.

The 'restore' command also had a lot of changes: The best improvement in this is that it now allows all accounting objects to be restored regardless of whether they are used by iptables or not. This was not possible before. 

The two additional parameters to the 'restore' command - 'flush' and 'replace' make sure that the accounting table can be flushed (though objects used by iptables are still not deleted) and the second option - 'replace' - makes sure that accounting object properties are replaced if they exist in the accounting table. The latter option can modify object properties even if these are in use/locked by iptables. The 'add' and 'get' commands have similar options allowing accounting object properties to be modified at will. That was not possible before.

So, with the new 'save' and 'restore' nfacct commands it is now possible for full and complete restoration of all account objects to be done. I will list the detailed changes I've made for each nfacct component (kernel, libnetfilter_acct and nfacct) in the patches I will submit shortly.

For full information about the new and improved features, there is alsomst completely re-written man page, but I am listing the output of the 'help' command which shows very briefly all the options currently available. The nfacct executable now has the following options (from the improved 'nfacct help' command):

nfacct v1.0.1: utility for the Netfilter extended accounting infrastructure
Usage: nfacct command [parameters]...

Commands:
  list LST_PARAMS	List the accounting object table
  add NAME ADD_PARAMS	Add new accounting object NAME to table
  delete NAME		Delete existing accounting object NAME
  get NAME GET_PARAMS	Get and list existing accounting object NAME
  flush			Flush accounting object table
  save			Dump current accounting object table to stdout
  restore RST_PARAMS	Restore accounting object table from stdin
  version		Display version and disclaimer
  help			Display this help message

Parameters:
  LST_PARAMS := [ reset ] [ show SHOW_SPEC ] [ format FMT_SPEC ] [ xml ]
  ADD_PARAMS := [ replace ] [ format FMT_SPEC ] [ threshold NUMBER ]
  GET_PARAMS := [ reset ] [ show SHOW_SPEC ] [ format FMT_SPEC ] [ xml ]
  RST_PARAMS := [ flush ] [ replace ]
  SHOW_SPEC := { bytes | extended }
  FMT_SPEC := { [FMT] | [,] | [FMT] ... }
  FMT := { def | raw | 3pl | iec | kib | mib | gib | tib | pib | eib |
  	   si | kb | mb | gb | tb | pb | eb }

After all this, I do have a question: in what circumstances can the kernel part be unable to update the account object counters - is this possible and if so in what circumstances and how likely is this to happen?

It is important for us to know and that is one question I was asked and I didn't really knew the answer, though by looking in the kernel code I couldn't find anything which could prevent that from happening, but thought to ask here anyway.


MZ
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis April 4, 2013, 8:37 p.m. UTC | #6
Hello Pablo,


Michael Zintakis wrote:
> Pablo Neira Ayuso wrote:
>> I see. Then my new proposal is to add a new automagic function to
>> round the output to the most expressive measure, would be somehow
>> similar to xtables_print_num:
>>
>> http://git.netfilter.org/cgi-bin/gitweb.cgi?p=iptables.git;a=blob;f=libxtables/xtables.c;h=009ab9115f6fd687a762a2552f89ac0b81ee1a42;hb=HEAD#l1915
Something we've discovered with regards to the nfacct match recently. If I have the following iptables statement:

iptables -A INPUT -m nfacct --nfacct <nfacct_obj> -m <match2> -m <match3>


The above aklways updates the "nfacct_obj" byte and packet counters, regardless of whether "match2" and "match3" actually matches. However, if we have:

iptables -A INPUT -m <match2> -m nfacct --nfacct <nfacct_obj> -m <match3>


then "nfacct_obj" counters are updated only when "match1" is satisfied, but if we have:

iptables -A INPUT -m <match2> -m <match3> -m nfacct --nfacct <nfacct_obj>


then "nfacct_obj" counters are updated when both match2 and match3 are matched (which was the initial intention).

This inconsistency stems from the fact that the nfacct match in the kernel (xt_nfacct.c::nfacct_mt) always returns true, but also because of how iptables evaluates matches: it does so from left to right.

Since there isn't a callback in the xt_match struct which is called after ALL matches have been satisfied (xt_match.match is called for each registered match in that statement), this causes the nfacct counters to be updated (or not) depending on the position of the nfacct match.

What I have done locally is to add a separate callback (I called it "matched") which is called for all matches after all such matches in a particular statement have been satisfied, but that obviously will break lots of code depending on the old xt_match struct if such approach is adopted. My question is: is there more elegant solution to do this? Thanks.


MZ
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jozsef Kadlecsik April 4, 2013, 9:46 p.m. UTC | #7
On Thu, 4 Apr 2013, Michael Zintakis wrote:

> Michael Zintakis wrote:
> > Pablo Neira Ayuso wrote:
> >> I see. Then my new proposal is to add a new automagic function to
> >> round the output to the most expressive measure, would be somehow
> >> similar to xtables_print_num:
> >>
> >> http://git.netfilter.org/cgi-bin/gitweb.cgi?p=iptables.git;a=blob;f=libxtables/xtables.c;h=009ab9115f6fd687a762a2552f89ac0b81ee1a42;hb=HEAD#l1915
> Something we've discovered with regards to the nfacct match recently. If 
> I have the following iptables statement:
> 
> iptables -A INPUT -m nfacct --nfacct <nfacct_obj> -m <match2> -m <match3>
>
> The above aklways updates the "nfacct_obj" byte and packet counters, 
> regardless of whether "match2" and "match3" actually matches. However, 
> if we have:
> 
> iptables -A INPUT -m <match2> -m nfacct --nfacct <nfacct_obj> -m <match3>
>
> then "nfacct_obj" counters are updated only when "match1" is satisfied, 
> but if we have:
> 
> iptables -A INPUT -m <match2> -m <match3> -m nfacct --nfacct <nfacct_obj>
>
> then "nfacct_obj" counters are updated when both match2 and match3 are 
> matched (which was the initial intention).
> 
> This inconsistency stems from the fact that the nfacct match in the 
> kernel (xt_nfacct.c::nfacct_mt) always returns true, but also because of 
> how iptables evaluates matches: it does so from left to right.
> 
> Since there isn't a callback in the xt_match struct which is called 
> after ALL matches have been satisfied (xt_match.match is called for each 
> registered match in that statement), this causes the nfacct counters to 
> be updated (or not) depending on the position of the nfacct match.
> 
> What I have done locally is to add a separate callback (I called it 
> "matched") which is called for all matches after all such matches in a 
> particular statement have been satisfied, but that obviously will break 
> lots of code depending on the old xt_match struct if such approach is 
> adopted. My question is: is there more elegant solution to do this? 

In my opinion this is not inconsistency at all, but the intended 
behaviour. So I don't see any reason to add such a hack to override it.

What prevents you from entering the matches in the order you want them to 
be evaluated?

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis April 5, 2013, 7:10 p.m. UTC | #8
Hello Jozsef,

Jozsef Kadlecsik wrote:
> On Thu, 4 Apr 2013, Michael Zintakis wrote:
>> Something we've discovered with regards to the nfacct match recently. If 
>> I have the following iptables statement:
>>
>> iptables -A INPUT -m nfacct --nfacct <nfacct_obj> -m <match2> -m <match3>
>>
>> The above aklways updates the "nfacct_obj" byte and packet counters, 
>> regardless of whether "match2" and "match3" actually matches. However, 
>> if we have:
>>
>> iptables -A INPUT -m <match2> -m nfacct --nfacct <nfacct_obj> -m <match3>
>>
>> then "nfacct_obj" counters are updated only when "match1" is satisfied, 
>> but if we have:
>>
>> iptables -A INPUT -m <match2> -m <match3> -m nfacct --nfacct <nfacct_obj>
>>
>> then "nfacct_obj" counters are updated when both match2 and match3 are 
>> matched (which was the initial intention).
>>
>> This inconsistency stems from the fact that the nfacct match in the 
>> kernel (xt_nfacct.c::nfacct_mt) always returns true, but also because of 
>> how iptables evaluates matches: it does so from left to right.
>>
>> Since there isn't a callback in the xt_match struct which is called 
>> after ALL matches have been satisfied (xt_match.match is called for each 
>> registered match in that statement), this causes the nfacct counters to 
>> be updated (or not) depending on the position of the nfacct match.
>>
>> What I have done locally is to add a separate callback (I called it 
>> "matched") which is called for all matches after all such matches in a 
>> particular statement have been satisfied, but that obviously will break 
>> lots of code depending on the old xt_match struct if such approach is 
>> adopted. My question is: is there more elegant solution to do this? 
> 
> In my opinion this is not inconsistency at all, but the intended 
> behaviour. So I don't see any reason to add such a hack to override it.
I meant inconsistent in terms of the end result, which in the example above is packet/bytes counting.

That result is different depending on the order of the conditions (i.e. matches) attached to the iptables rule. With the 'old' accounting we didn't have that. In other words, with the old accounting we've had:

If (match1 && match2 && matchN) {
  do_packet_and_bytes_counting();
}

No matter how we arrange the order of match1, match2 and matchN, the end result is (or should be) the same. With the nfacct match that isn't the case, but that isn't nfacct match's fault, but I guess it is because of the way iptables is examining the matches.

We would have had the consistency (in other words, getting a consistent result regardless of the order of the various conditions/matches) if nfacct was a target, not a match, but I know that would be difficult (I already examined that possibility) since the x_tables target does not provide a 'destroy' method, so there isn't a way to track the 'refcnt' in the nfacct kernel struct, so inventing this method is as equally as ugly as the hack I did with the nfacct match above, so I thought to ask and see whether there is a better solution.

> What prevents you from entering the matches in the order you want them to 
> be evaluated?
Nothing. Again, I am coming from the point of view of the 'old' accounting where I did not have that, so I didn't expect this change.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jozsef Kadlecsik April 5, 2013, 7:24 p.m. UTC | #9
Hi Michael,

On Fri, 5 Apr 2013, Michael Zintakis wrote:

> Jozsef Kadlecsik wrote:
> > On Thu, 4 Apr 2013, Michael Zintakis wrote:
> >> Something we've discovered with regards to the nfacct match recently. If 
> >> I have the following iptables statement:
> >>
> >> iptables -A INPUT -m nfacct --nfacct <nfacct_obj> -m <match2> -m <match3>
> >>
> >> The above aklways updates the "nfacct_obj" byte and packet counters, 
> >> regardless of whether "match2" and "match3" actually matches. However, 
> >> if we have:
> >>
> >> iptables -A INPUT -m <match2> -m nfacct --nfacct <nfacct_obj> -m <match3>
> >>
> >> then "nfacct_obj" counters are updated only when "match1" is satisfied, 
> >> but if we have:
> >>
> >> iptables -A INPUT -m <match2> -m <match3> -m nfacct --nfacct <nfacct_obj>
> >>
> >> then "nfacct_obj" counters are updated when both match2 and match3 are 
> >> matched (which was the initial intention).
> >>
> >> This inconsistency stems from the fact that the nfacct match in the 
> >> kernel (xt_nfacct.c::nfacct_mt) always returns true, but also because of 
> >> how iptables evaluates matches: it does so from left to right.
> >>
> >> Since there isn't a callback in the xt_match struct which is called 
> >> after ALL matches have been satisfied (xt_match.match is called for each 
> >> registered match in that statement), this causes the nfacct counters to 
> >> be updated (or not) depending on the position of the nfacct match.
> >>
> >> What I have done locally is to add a separate callback (I called it 
> >> "matched") which is called for all matches after all such matches in a 
> >> particular statement have been satisfied, but that obviously will break 
> >> lots of code depending on the old xt_match struct if such approach is 
> >> adopted. My question is: is there more elegant solution to do this? 
> > 
> > In my opinion this is not inconsistency at all, but the intended 
> > behaviour. So I don't see any reason to add such a hack to override it.
> I meant inconsistent in terms of the end result, which in the example 
> above is packet/bytes counting.
> 
> That result is different depending on the order of the conditions (i.e. 
> matches) attached to the iptables rule. With the 'old' accounting we 
> didn't have that. In other words, with the old accounting we've had:
> 
> If (match1 && match2 && matchN) {
>   do_packet_and_bytes_counting();
> }
> 
> No matter how we arrange the order of match1, match2 and matchN, the end 
> result is (or should be) the same. With the nfacct match that isn't the 
> case, but that isn't nfacct match's fault, but I guess it is because of 
> the way iptables is examining the matches.

Yes, exactly. And actually it supports rules like this:

iptables -A INPUT -m <match0> -m nfacct --nfacct acct0 \
                  -m <match1> -m nfacct --nfacct acct1 \ 
		  ...

Also, this is a new accounting method, which is just not the same as the 
old one.
 
> We would have had the consistency (in other words, getting a consistent 
> result regardless of the order of the various conditions/matches) if 
> nfacct was a target, not a match, but I know that would be difficult (I 
> already examined that possibility) since the x_tables target does not 
> provide a 'destroy' method, so there isn't a way to track the 'refcnt' 
> in the nfacct kernel struct, so inventing this method is as equally as 
> ugly as the hack I did with the nfacct match above, so I thought to ask 
> and see whether there is a better solution.

Targets do have a destroy method.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis April 5, 2013, 7:27 p.m. UTC | #10
Michael Zintakis wrote:
> We would have had the consistency (in other words, getting a consistent result regardless of the order of the various conditions/matches) if nfacct was a target, not a match, but I know that would be difficult (I already examined that possibility) since the x_tables target does not provide a 'destroy' method, so there isn't a way to track the 'refcnt' in the nfacct kernel struct, so inventing this method is as equally as ugly as the hack I did with the nfacct match above, so I thought to ask and see whether there is a better solution.
It looks as though I was wrong - I must have been blind when I looked in the x_tables header file!

There is a destroy method as part of mt_target. So if I 'reform' the nfacct match and make it a target, then I guess that whole 'inconsistency' thing will disappear since I could now use something like:

iptables -A INPUT -m match1 -m match2 -j NFACCT --nfacct <nfacct_obj>

and regardless of the order of match1 and match2, the result will be the same, am I correct or is there something very wrong?
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis April 5, 2013, 7:34 p.m. UTC | #11
Hello Jozsef,

Jozsef Kadlecsik wrote:
> Hi Michael,
> 
> On Fri, 5 Apr 2013, Michael Zintakis wrote:
> 
>> Jozsef Kadlecsik wrote:
>>> On Thu, 4 Apr 2013, Michael Zintakis wrote:
>>>> Something we've discovered with regards to the nfacct match recently. If 
>>>> I have the following iptables statement:
>>>>
>>>> iptables -A INPUT -m nfacct --nfacct <nfacct_obj> -m <match2> -m <match3>
>>>>
>>>> The above aklways updates the "nfacct_obj" byte and packet counters, 
>>>> regardless of whether "match2" and "match3" actually matches. However, 
>>>> if we have:
>>>>
>>>> iptables -A INPUT -m <match2> -m nfacct --nfacct <nfacct_obj> -m <match3>
>>>>
>>>> then "nfacct_obj" counters are updated only when "match1" is satisfied, 
>>>> but if we have:
>>>>
>>>> iptables -A INPUT -m <match2> -m <match3> -m nfacct --nfacct <nfacct_obj>
>>>>
>>>> then "nfacct_obj" counters are updated when both match2 and match3 are 
>>>> matched (which was the initial intention).
>>>>
>>>> This inconsistency stems from the fact that the nfacct match in the 
>>>> kernel (xt_nfacct.c::nfacct_mt) always returns true, but also because of 
>>>> how iptables evaluates matches: it does so from left to right.
>>>>
>>>> Since there isn't a callback in the xt_match struct which is called 
>>>> after ALL matches have been satisfied (xt_match.match is called for each 
>>>> registered match in that statement), this causes the nfacct counters to 
>>>> be updated (or not) depending on the position of the nfacct match.
>>>>
>>>> What I have done locally is to add a separate callback (I called it 
>>>> "matched") which is called for all matches after all such matches in a 
>>>> particular statement have been satisfied, but that obviously will break 
>>>> lots of code depending on the old xt_match struct if such approach is 
>>>> adopted. My question is: is there more elegant solution to do this? 
>>> In my opinion this is not inconsistency at all, but the intended 
>>> behaviour. So I don't see any reason to add such a hack to override it.
>> I meant inconsistent in terms of the end result, which in the example 
>> above is packet/bytes counting.
>>
>> That result is different depending on the order of the conditions (i.e. 
>> matches) attached to the iptables rule. With the 'old' accounting we 
>> didn't have that. In other words, with the old accounting we've had:
>>
>> If (match1 && match2 && matchN) {
>>   do_packet_and_bytes_counting();
>> }
>>
>> No matter how we arrange the order of match1, match2 and matchN, the end 
>> result is (or should be) the same. With the nfacct match that isn't the 
>> case, but that isn't nfacct match's fault, but I guess it is because of 
>> the way iptables is examining the matches.
> 
> Yes, exactly. And actually it supports rules like this:
> 
> iptables -A INPUT -m <match0> -m nfacct --nfacct acct0 \
>                   -m <match1> -m nfacct --nfacct acct1 \ 
> 		  ...
Hm, never thought of that, but I guess one learns something new every day. Thanks Jozsef!

> Also, this is a new accounting method, which is just not the same as the 
> old one.
Yes, I know, I wasn't disputing that - it is just that I am used to the 'old' accounting and when you've been using it for years it is not so easy to 'detach' yourself from that.

>> We would have had the consistency (in other words, getting a consistent 
>> result regardless of the order of the various conditions/matches) if 
>> nfacct was a target, not a match, but I know that would be difficult (I 
>> already examined that possibility) since the x_tables target does not 
>> provide a 'destroy' method, so there isn't a way to track the 'refcnt' 
>> in the nfacct kernel struct, so inventing this method is as equally as 
>> ugly as the hack I did with the nfacct match above, so I thought to ask 
>> and see whether there is a better solution.
> 
> Targets do have a destroy method.
Haha, you are far too quick for me!

I just found that out - I don't know how I did not see it when I first looked at it. I guess if I 'convert' nfacct to a target I could get that 'consistency', but I appreciate the new example you gave above, which I have to admit is very useful indeed (one can hit two or more birds with one stone so to speak).
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jozsef Kadlecsik April 5, 2013, 9:01 p.m. UTC | #12
Hi Michael,

On Fri, 5 Apr 2013, Michael Zintakis wrote:

> Jozsef Kadlecsik wrote:
> > On Fri, 5 Apr 2013, Michael Zintakis wrote:
> > 
> >> Jozsef Kadlecsik wrote:
> >>> On Thu, 4 Apr 2013, Michael Zintakis wrote:
> >>>> Something we've discovered with regards to the nfacct match recently. If 
> >>>> I have the following iptables statement:
> >>>>
> >>>> iptables -A INPUT -m nfacct --nfacct <nfacct_obj> -m <match2> -m <match3>
> >>>>
> >>>> The above aklways updates the "nfacct_obj" byte and packet counters, 
> >>>> regardless of whether "match2" and "match3" actually matches. However, 
> >>>> if we have:
> >>>>
> >>>> iptables -A INPUT -m <match2> -m nfacct --nfacct <nfacct_obj> -m <match3>
> >>>>
> >>>> then "nfacct_obj" counters are updated only when "match1" is satisfied, 
> >>>> but if we have:
> >>>>
> >>>> iptables -A INPUT -m <match2> -m <match3> -m nfacct --nfacct <nfacct_obj>
> >>>>
> >>>> then "nfacct_obj" counters are updated when both match2 and match3 are 
> >>>> matched (which was the initial intention).
> >>>>
> >>>> This inconsistency stems from the fact that the nfacct match in the 
> >>>> kernel (xt_nfacct.c::nfacct_mt) always returns true, but also because of 
> >>>> how iptables evaluates matches: it does so from left to right.
> >>>>
> >>>> Since there isn't a callback in the xt_match struct which is called 
> >>>> after ALL matches have been satisfied (xt_match.match is called for each 
> >>>> registered match in that statement), this causes the nfacct counters to 
> >>>> be updated (or not) depending on the position of the nfacct match.
> >>>>
> >>>> What I have done locally is to add a separate callback (I called it 
> >>>> "matched") which is called for all matches after all such matches in a 
> >>>> particular statement have been satisfied, but that obviously will break 
> >>>> lots of code depending on the old xt_match struct if such approach is 
> >>>> adopted. My question is: is there more elegant solution to do this? 
> >>> In my opinion this is not inconsistency at all, but the intended 
> >>> behaviour. So I don't see any reason to add such a hack to override it.
> >> I meant inconsistent in terms of the end result, which in the example 
> >> above is packet/bytes counting.
> >>
> >> That result is different depending on the order of the conditions (i.e. 
> >> matches) attached to the iptables rule. With the 'old' accounting we 
> >> didn't have that. In other words, with the old accounting we've had:
> >>
> >> If (match1 && match2 && matchN) {
> >>   do_packet_and_bytes_counting();
> >> }
> >>
> >> No matter how we arrange the order of match1, match2 and matchN, the end 
> >> result is (or should be) the same. With the nfacct match that isn't the 
> >> case, but that isn't nfacct match's fault, but I guess it is because of 
> >> the way iptables is examining the matches.
> > 
> > Yes, exactly. And actually it supports rules like this:
> > 
> > iptables -A INPUT -m <match0> -m nfacct --nfacct acct0 \
> >                   -m <match1> -m nfacct --nfacct acct1 \ 
> > 		  ...
> Hm, never thought of that, but I guess one learns something new every 
> day. Thanks Jozsef!
> 
> > Also, this is a new accounting method, which is just not the same as the 
> > old one.
> Yes, I know, I wasn't disputing that - it is just that I am used to the 
> 'old' accounting and when you've been using it for years it is not so 
> easy to 'detach' yourself from that.
> 
> >> We would have had the consistency (in other words, getting a consistent 
> >> result regardless of the order of the various conditions/matches) if 
> >> nfacct was a target, not a match, but I know that would be difficult (I 
> >> already examined that possibility) since the x_tables target does not 
> >> provide a 'destroy' method, so there isn't a way to track the 'refcnt' 
> >> in the nfacct kernel struct, so inventing this method is as equally as 
> >> ugly as the hack I did with the nfacct match above, so I thought to ask 
> >> and see whether there is a better solution.
> > 
> > Targets do have a destroy method.
> Haha, you are far too quick for me!
> 
> I just found that out - I don't know how I did not see it when I first 
> looked at it. I guess if I 'convert' nfacct to a target I could get that 
> 'consistency', but I appreciate the new example you gave above, which I 
> have to admit is very useful indeed (one can hit two or more birds with 
> one stone so to speak).

nfacct can't be converted to a target, because it'd result backward 
incompatibilty - it already exists as a match. The module could be 
extended to play the role of target as well, but it seems to be 
unnecessary: there's no need to have a target in a rule, so in userspace 
"-j NFACCT" could simply be replaced by "-m nfacct".

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Zintakis April 6, 2013, 4:14 p.m. UTC | #13
Hello Jozsef,

Jozsef Kadlecsik wrote:
> Hi Michael,
> 
> On Fri, 5 Apr 2013, Michael Zintakis wrote:
> 
>> Jozsef Kadlecsik wrote:
>>> On Fri, 5 Apr 2013, Michael Zintakis wrote:
>>>
>>>> Jozsef Kadlecsik wrote:
>>>>> On Thu, 4 Apr 2013, Michael Zintakis wrote:
>>>>>> Something we've discovered with regards to the nfacct match recently. If 
>>>>>> I have the following iptables statement:
>>>>>>
>>>>>> iptables -A INPUT -m nfacct --nfacct <nfacct_obj> -m <match2> -m <match3>
>>>>>>
>>>>>> The above aklways updates the "nfacct_obj" byte and packet counters, 
>>>>>> regardless of whether "match2" and "match3" actually matches. However, 
>>>>>> if we have:
>>>>>>
>>>>>> iptables -A INPUT -m <match2> -m nfacct --nfacct <nfacct_obj> -m <match3>
>>>>>>
>>>>>> then "nfacct_obj" counters are updated only when "match1" is satisfied, 
>>>>>> but if we have:
>>>>>>
>>>>>> iptables -A INPUT -m <match2> -m <match3> -m nfacct --nfacct <nfacct_obj>
>>>>>>
>>>>>> then "nfacct_obj" counters are updated when both match2 and match3 are 
>>>>>> matched (which was the initial intention).
>>>>>>
>>>>>> This inconsistency stems from the fact that the nfacct match in the 
>>>>>> kernel (xt_nfacct.c::nfacct_mt) always returns true, but also because of 
>>>>>> how iptables evaluates matches: it does so from left to right.
>>>>>>
>>>>>> Since there isn't a callback in the xt_match struct which is called 
>>>>>> after ALL matches have been satisfied (xt_match.match is called for each 
>>>>>> registered match in that statement), this causes the nfacct counters to 
>>>>>> be updated (or not) depending on the position of the nfacct match.
>>>>>>
>>>>>> What I have done locally is to add a separate callback (I called it 
>>>>>> "matched") which is called for all matches after all such matches in a 
>>>>>> particular statement have been satisfied, but that obviously will break 
>>>>>> lots of code depending on the old xt_match struct if such approach is 
>>>>>> adopted. My question is: is there more elegant solution to do this? 
>>>>> In my opinion this is not inconsistency at all, but the intended 
>>>>> behaviour. So I don't see any reason to add such a hack to override it.
>>>> I meant inconsistent in terms of the end result, which in the example 
>>>> above is packet/bytes counting.
>>>>
>>>> That result is different depending on the order of the conditions (i.e. 
>>>> matches) attached to the iptables rule. With the 'old' accounting we 
>>>> didn't have that. In other words, with the old accounting we've had:
>>>>
>>>> If (match1 && match2 && matchN) {
>>>>   do_packet_and_bytes_counting();
>>>> }
>>>>
>>>> No matter how we arrange the order of match1, match2 and matchN, the end 
>>>> result is (or should be) the same. With the nfacct match that isn't the 
>>>> case, but that isn't nfacct match's fault, but I guess it is because of 
>>>> the way iptables is examining the matches.
>>> Yes, exactly. And actually it supports rules like this:
>>>
>>> iptables -A INPUT -m <match0> -m nfacct --nfacct acct0 \
>>>                   -m <match1> -m nfacct --nfacct acct1 \ 
>>> 		  ...
>> Hm, never thought of that, but I guess one learns something new every 
>> day. Thanks Jozsef!
Just as a side note (which wasn't obvious to me at first): even though acct0 gets updated when match0 returns true, acct1 only gets updated when both match0 AND match1 return true...


>>> Also, this is a new accounting method, which is just not the same as the 
>>> old one.
>> Yes, I know, I wasn't disputing that - it is just that I am used to the 
>> 'old' accounting and when you've been using it for years it is not so 
>> easy to 'detach' yourself from that.
>>
>>>> We would have had the consistency (in other words, getting a consistent 
>>>> result regardless of the order of the various conditions/matches) if 
>>>> nfacct was a target, not a match, but I know that would be difficult (I 
>>>> already examined that possibility) since the x_tables target does not 
>>>> provide a 'destroy' method, so there isn't a way to track the 'refcnt' 
>>>> in the nfacct kernel struct, so inventing this method is as equally as 
>>>> ugly as the hack I did with the nfacct match above, so I thought to ask 
>>>> and see whether there is a better solution.
>>> Targets do have a destroy method.
>> Haha, you are far too quick for me!
>>
>> I just found that out - I don't know how I did not see it when I first 
>> looked at it. I guess if I 'convert' nfacct to a target I could get that 
>> 'consistency', but I appreciate the new example you gave above, which I 
>> have to admit is very useful indeed (one can hit two or more birds with 
>> one stone so to speak).
> 
> nfacct can't be converted to a target, because it'd result backward 
> incompatibilty - it already exists as a match.
Sorry Jozsef, I meant for nfacct to be added as a target (in addition to nfacct as a match).

? The module could be 
> extended to play the role of target as well, but it seems to be 
> unnecessary: there's no need to have a target in a rule,
I agree. nfacct (as a match) has the full functionality of nfacct (as a target), though one needs to get used to the 'new' matching and be aware of it. Maybe a note in the man pages to that effect would do.

> so in userspace 
> "-j NFACCT" could simply be replaced by "-m nfacct".
I just did a quick hack and implemented nfacct as a target - just out of curiosity, if not anything else. It works well and I could do something like:

iptables -I INPUT 1 -m nfacct --nfacct-name test -m conntrack --ctstate NEW -j NFACCT --nfacct-name test2

In the above statement the nfacct match on 'test' gets updated regardless of the state of the connection, while the nfacct target gets only executed (for 'test2') when cstate is NEW (this statement even works with '-j NFACCT --nfacct-name test').

This is all academical though - I agree that the existing nfacct match covers all the functionality of the nfacct target even if one needs to be aware of how this all works...


MZ
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/net/netfilter/nfnetlink_acct.c
+++ b/net/netfilter/nfnetlink_acct.c
@@ -32,6 +32,7 @@ 
 struct nf_acct {
 	atomic64_t		pkts;
 	atomic64_t		bytes;
+	atomic_t		fmt;
 	struct list_head	head;
 	atomic_t		refcnt;
 	char			name[NFACCT_NAME_MAX];
@@ -63,9 +64,14 @@ 
 
 	if (matching) {
 		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
-			/* reset counters if you request a replacement. */
+			/* reset counters if you request a replacement... */
 			atomic64_set(&matching->pkts, 0);
 			atomic64_set(&matching->bytes, 0);
+			/* ... and change the format */
+			if (tb[NFACCT_FMT]) {
+				atomic_set(&matching->fmt,
+				   be32_to_cpu(nla_get_be32(tb[NFACCT_FMT])));
+			}
 			return 0;
 		}
 		return -EBUSY;
@@ -85,6 +91,10 @@ 
 		atomic64_set(&nfacct->pkts,
 			     be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
 	}
+	if (tb[NFACCT_FMT]) {
+		atomic_set(&nfacct->fmt,
+			   be32_to_cpu(nla_get_be32(tb[NFACCT_FMT])));
+	}
 	atomic_set(&nfacct->refcnt, 1);
 	list_add_tail_rcu(&nfacct->head, &nfnl_acct_list);
 	return 0;
@@ -121,6 +131,7 @@ 
 	}
 	if (nla_put_be64(skb, NFACCT_PKTS, cpu_to_be64(pkts)) ||
 	    nla_put_be64(skb, NFACCT_BYTES, cpu_to_be64(bytes)) ||
+	    nla_put_be32(skb, NFACCT_FMT, htonl(atomic_read(&acct->fmt))) ||
 	    nla_put_be32(skb, NFACCT_USE, htonl(atomic_read(&acct->refcnt))))
 		goto nla_put_failure;
 
@@ -265,6 +276,7 @@ 
 	[NFACCT_NAME] = { .type = NLA_NUL_STRING, .len = NFACCT_NAME_MAX-1 },
 	[NFACCT_BYTES] = { .type = NLA_U64 },
 	[NFACCT_PKTS] = { .type = NLA_U64 },
+	[NFACCT_FMT] = { .type = NLA_U32 },
 };
 
 static const struct nfnl_callback nfnl_acct_cb[NFNL_MSG_ACCT_MAX] = {
--- a/include/uapi/linux/netfilter/nfnetlink_acct.h
+++ b/include/uapi/linux/netfilter/nfnetlink_acct.h
@@ -18,6 +18,7 @@ 
 	NFACCT_NAME,
 	NFACCT_PKTS,
 	NFACCT_BYTES,
+	NFACCT_FMT,
 	NFACCT_USE,
 	__NFACCT_MAX
 };