diff mbox

[nf-next-2.6] netfilter: add xt_cpu match

Message ID 1279811939.2467.79.camel@edumazet-laptop
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet July 22, 2010, 3:18 p.m. UTC
Le jeudi 22 juillet 2010 à 16:19 +0200, Jan Engelhardt a écrit :
> On Thursday 2010-07-22 16:03, Eric Dumazet wrote:
> 
> >This match is a bit strange, being packet content agnostic...
> >+/*
> >+ * Yes, packet content is not interesting for us, we only take care
> >+ * of cpu handling this packet
> >+ */
> 
> That is not so strange after all, we have many packet agnostic matches: 
> xt_time, xt_condition, xt_IDLETIMER, xt_iface.
> So this little comment looks a bit redundant.
> 
> Or it seems that academia can't come up with enough new protocols in time that
> we have to resort to do -m coffeemaker :)
> 
> >@@ -0,0 +1,8 @@
> >+#ifndef _XT_CPU_H
> >+#define _XT_CPU_H
> >+
> >+struct xt_cpu_info {
> >+	unsigned int	cpu;
> >+	int		invert;
> >+};
> >+#endif /*_XT_MAC_H*/
> 
> Please take a read in "Writing Netfilter Modules" e-book :-)
> It will tell you that types other than fixed ones are a no-no.

Ok, let's do that, but I doubt sizeof(int) can be different than 4 on a
Linux 2.6 host right now.

I prefer not doing the !!info->invert, and do the check only once.

Thanks

[PATCH nf-next-2.6] netfilter: add xt_cpu match

In some situations a CPU match permits a better spreading of
connections, or select targets only for a given cpu.

With Remote Packet Steering or multiqueue NIC and appropriate IRQ
affinities, we can distribute trafic on available cpus, per session.
(all RX packets for a given flow is handled by a given cpu)

Some legacy applications being not SMP friendly, one way to scale a
server is to run multiple copies of them.

Instead of randomly choosing an instance, we can use the cpu number as a
key so that softirq handler for a whole instance is running on a single
cpu, maximizing cache effects in TCP/UDP stacks.

Using NAT for example, a four ways machine might run four copies of
server application, using a separate listening port for each instance,
but still presenting an unique external port :

iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 0 \
        -j REDIRECT --to-port 8080

iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 1 \
        -j REDIRECT --to-port 8081

iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 2 \
        -j REDIRECT --to-port 8082

iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 3 \
        -j REDIRECT --to-port 8083


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netfilter/Kbuild   |    3 -
 include/linux/netfilter/xt_cpu.h |   11 +++++
 net/netfilter/Kconfig            |    9 ++++
 net/netfilter/Makefile           |    1 
 net/netfilter/xt_cpu.c           |   63 +++++++++++++++++++++++++++++
 5 files changed, 86 insertions(+), 1 deletion(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jan Engelhardt July 22, 2010, 3:39 p.m. UTC | #1
On Thursday 2010-07-22 17:18, Eric Dumazet wrote:

>Le jeudi 22 juillet 2010 à 16:19 +0200, Jan Engelhardt a écrit :
>> On Thursday 2010-07-22 16:03, Eric Dumazet wrote:
>> 
>> >This match is a bit strange, being packet content agnostic...
>> >+/*
>> >+ * Yes, packet content is not interesting for us, we only take care
>> >+ * of cpu handling this packet
>> >+ */
>> 
>> That is not so strange after all, we have many packet agnostic matches: 
>> xt_time, xt_condition, xt_IDLETIMER, xt_iface.
>> So this little comment looks a bit redundant.
>> 
>> Or it seems that academia can't come up with enough new protocols in time that
>> we have to resort to do -m coffeemaker :)
>> 
>> >@@ -0,0 +1,8 @@
>> >+#ifndef _XT_CPU_H
>> >+#define _XT_CPU_H
>> >+
>> >+struct xt_cpu_info {
>> >+	unsigned int	cpu;
>> >+	int		invert;
>> >+};
>> >+#endif /*_XT_MAC_H*/
>> 
>> Please take a read in "Writing Netfilter Modules" e-book :-)
>> It will tell you that types other than fixed ones are a no-no.
>
>Ok, let's do that, but I doubt sizeof(int) can be different than 4 on a
>Linux 2.6 host right now.

Never say never. "long" already bit people in the past, and now we
have that CONFIG_COMPAT stuff.

If invert is the only flag, perhaps it makes sense to use __u8 
for it. 

>I prefer not doing the !!info->invert, and do the check only once.

>+static int cpu_mt_check(const struct xt_mtchk_param *par)
>+{
>+	const struct xt_cpu_info *info = par->matchinfo;
>+
>+	if (info->invert & ~1)
>+		return -EINVAL;
>+	return 0;
>+}
>+
>+static bool cpu_mt(const struct sk_buff *skb, struct xt_action_param *par)
>+{
>+	const struct xt_cpu_info *info = par->matchinfo;
>+
>+	return (info->cpu == smp_processor_id()) ^ info->invert;
>+}

That works nicely indeed. Do you anticipate any future flags?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet July 22, 2010, 4:24 p.m. UTC | #2
Le jeudi 22 juillet 2010 à 17:39 +0200, Jan Engelhardt a écrit :

> Never say never. "long" already bit people in the past, and now we
> have that CONFIG_COMPAT stuff.
> 

I know pretty well the "long" problem, I received one of the first alpha
machine ever built in the world (DEC 3000 AXP, with a fast 133 MHz
cpu ;) ), before I began to use Linux :)


> If invert is the only flag, perhaps it makes sense to use __u8 
> for it. 
> 

Quite frankly it brings more problems than plain u32

- Possible security problems (padding bytes). Not applicable to
iptables.

- Some arches have slow byte/short accesses (21064 for example :) )

"int" is the natural type, fast on all arches.

- Given alignment requirements of iptables rules, using less than 32bits
here saves no ram.

But I dont care that much.

I even see compiler doesnt want to use a XOR instruction :

00000018 <cpu_mt>:
  18:	55                   	push   %ebp
  19:	8b 42 04             	mov    0x4(%edx),%eax
  1c:	64 8b 15 00 00 00 00 	mov    %fs:0x0,%edx
  23:	89 e5                	mov    %esp,%ebp
  25:	5d                   	pop    %ebp
  26:	39 10                	cmp    %edx,(%eax)
  28:	0f 94 c2             	sete   %dl
  2b:	0f b6 d2             	movzbl %dl,%edx
  2e:	3b 50 04             	cmp    0x4(%eax),%edx
  31:	0f 95 c0             	setne  %al
  34:	c3                   	ret    




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy July 23, 2010, 11 a.m. UTC | #3
Am 22.07.2010 17:18, schrieb Eric Dumazet:
> [PATCH nf-next-2.6] netfilter: add xt_cpu match
> 
> In some situations a CPU match permits a better spreading of
> connections, or select targets only for a given cpu.
> 
> With Remote Packet Steering or multiqueue NIC and appropriate IRQ
> affinities, we can distribute trafic on available cpus, per session.
> (all RX packets for a given flow is handled by a given cpu)
> 
> Some legacy applications being not SMP friendly, one way to scale a
> server is to run multiple copies of them.
> 
> Instead of randomly choosing an instance, we can use the cpu number as a
> key so that softirq handler for a whole instance is running on a single
> cpu, maximizing cache effects in TCP/UDP stacks.
> 
> Using NAT for example, a four ways machine might run four copies of
> server application, using a separate listening port for each instance,
> but still presenting an unique external port :
> 
> iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 0 \
>         -j REDIRECT --to-port 8080
> 
> iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 1 \
>         -j REDIRECT --to-port 8081
> 
> iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 2 \
>         -j REDIRECT --to-port 8082
> 
> iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 3 \
>         -j REDIRECT --to-port 8083
> 

Applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netfilter/Kbuild b/include/linux/netfilter/Kbuild
index bb103f4..1041a1d 100644
--- a/include/linux/netfilter/Kbuild
+++ b/include/linux/netfilter/Kbuild
@@ -19,12 +19,13 @@  header-y += xt_TCPMSS.h
 header-y += xt_TCPOPTSTRIP.h
 header-y += xt_TEE.h
 header-y += xt_TPROXY.h
+header-y += xt_cluster.h
 header-y += xt_comment.h
 header-y += xt_connbytes.h
 header-y += xt_connlimit.h
 header-y += xt_connmark.h
 header-y += xt_conntrack.h
-header-y += xt_cluster.h
+header-y += xt_cpu.h
 header-y += xt_dccp.h
 header-y += xt_dscp.h
 header-y += xt_esp.h
diff --git a/include/linux/netfilter/xt_cpu.h b/include/linux/netfilter/xt_cpu.h
index e69de29..93c7f11 100644
--- a/include/linux/netfilter/xt_cpu.h
+++ b/include/linux/netfilter/xt_cpu.h
@@ -0,0 +1,11 @@ 
+#ifndef _XT_CPU_H
+#define _XT_CPU_H
+
+#include <linux/types.h>
+
+struct xt_cpu_info {
+	__u32	cpu;
+	__u32	invert;
+};
+
+#endif /*_XT_CPU_H*/
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index aa2f106..523e8d0 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -647,6 +647,15 @@  config NETFILTER_XT_MATCH_CONNTRACK
 
 	  To compile it as a module, choose M here.  If unsure, say N.
 
+config NETFILTER_XT_MATCH_CPU
+	tristate '"cpu" match support'
+	depends on NETFILTER_ADVANCED
+	help
+	  CPU matching allows you to match packets based on the CPU
+	  currently handling the packet.
+
+	  To compile it as a module, choose M here.  If unsure, say N.
+
 config NETFILTER_XT_MATCH_DCCP
 	tristate '"dccp" protocol match support'
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index e28420a..6da84c3 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -69,6 +69,7 @@  obj-$(CONFIG_NETFILTER_XT_MATCH_COMMENT) += xt_comment.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNBYTES) += xt_connbytes.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNLIMIT) += xt_connlimit.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNTRACK) += xt_conntrack.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_CPU) += xt_cpu.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_DCCP) += xt_dccp.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_DSCP) += xt_dscp.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_ESP) += xt_esp.o
diff --git a/net/netfilter/xt_cpu.c b/net/netfilter/xt_cpu.c
index e69de29..b39db8a 100644
--- a/net/netfilter/xt_cpu.c
+++ b/net/netfilter/xt_cpu.c
@@ -0,0 +1,63 @@ 
+/* Kernel module to match running CPU */
+
+/*
+ * Might be used to distribute connections on several daemons, if
+ * RPS (Remote Packet Steering) is enabled or NIC is multiqueue capable,
+ * each RX queue IRQ affined to one CPU (1:1 mapping)
+ *
+ */
+
+/* (C) 2010 Eric Dumazet
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/netfilter/xt_cpu.h>
+#include <linux/netfilter/x_tables.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Eric Dumazet <eric.dumazet@gmail.com>");
+MODULE_DESCRIPTION("Xtables: CPU match");
+
+static int cpu_mt_check(const struct xt_mtchk_param *par)
+{
+	const struct xt_cpu_info *info = par->matchinfo;
+
+	if (info->invert & ~1)
+		return -EINVAL;
+	return 0;
+}
+
+static bool cpu_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_cpu_info *info = par->matchinfo;
+
+	return (info->cpu == smp_processor_id()) ^ info->invert;
+}
+
+static struct xt_match cpu_mt_reg __read_mostly = {
+	.name       = "cpu",
+	.revision   = 0,
+	.family     = NFPROTO_UNSPEC,
+	.checkentry = cpu_mt_check,
+	.match      = cpu_mt,
+	.matchsize  = sizeof(struct xt_cpu_info),
+	.me         = THIS_MODULE,
+};
+
+static int __init cpu_mt_init(void)
+{
+	return xt_register_match(&cpu_mt_reg);
+}
+
+static void __exit cpu_mt_exit(void)
+{
+	xt_unregister_match(&cpu_mt_reg);
+}
+
+module_init(cpu_mt_init);
+module_exit(cpu_mt_exit);