diff mbox

pkt_sched: act_xt support new Xtables interface

Message ID 50D8413C.8050508@openwrt.org
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Felix Fietkau Dec. 24, 2012, 11:49 a.m. UTC
On 2012-12-24 12:34 PM, Jamal Hadi Salim wrote:
> 
> Some good news Yury.
> I am told Felix Fietkau <nbd@openwrt.org> (on CC) actually
> already solved this issue and it is a feature in openwrt. I
> cant find the code.
> 
> Felix - Yury is trying to retrieve skb->mark fields from
> netfilter connmark. My understanding is you have written
> such an action. Can you please point us to it - and any
> reason you havent submitted this for inclusion in kernel
> proper?
After I added it as an experiment, I got distracted with other projects
again and forgot about submitting it. Take a look at the code - if the
approach is reasonable, I'll submit this thing for inclusion soon.

- Felix


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jamal Hadi Salim Dec. 24, 2012, 12:19 p.m. UTC | #1
On 12-12-24 06:49 AM, Felix Fietkau wrote:

>
> After I added it as an experiment, I got distracted with other projects
> again and forgot about submitting it. Take a look at the code - if the
> approach is reasonable, I'll submit this thing for inclusion soon.
>

Excellent ;-> Simple and elegant.

Usable as is  - some minor comments.
First nitpick: The name is not very reflective, how about:
GetMarkFromConntrack or something along those lines?


> +static int tcf_connmark(struct sk_buff *skb, const struct tc_action *a,
> +		       struct tcf_result *res)
> +{
> +	struct nf_conn *c;
> +	enum ip_conntrack_info ctinfo;
> +	int proto;
> +	int r;
> +
> +	if (skb->protocol == htons(ETH_P_IP)) {
> +		if (skb->len < sizeof(struct iphdr))
> +			goto out;
> +		proto = PF_INET;
> +	} else if (skb->protocol == htons(ETH_P_IPV6)) {
> +		if (skb->len < sizeof(struct ipv6hdr))
> +			goto out;
> +		proto = PF_INET6;
> +	} else
> +		goto out;
> +

I would have said that this action is probably also not useful for 
egress qdisc path since skb->mark would already be set. It maybe worth 
checking skb->tc_verd and skipping overhead of nf_conntrack_in() call.
Look at act_mirred for such a check.

> +	r = nf_conntrack_in(dev_net(skb->dev), proto, NF_INET_PRE_ROUTING, skb);
> +	if (r != NF_ACCEPT)
> +		goto out;
> +
> +	c = nf_ct_get(skb, &ctinfo);
> +	if (!c)
> +		goto out;
> +
> +	skb->mark = c->mark;
> +	nf_conntrack_put(skb->nfct);
> +	skb->nfct = NULL;
> +
> +out:
> +	return TC_ACT_PIPE;

Ok, perhaps set tcf_action in (iproute2) user space to TC_ACT_PIPE then 
just return policy->tcf_action here.

Even better is to have a different TC_ACT_XXX returned for failure
vs success... Your success path becomes TC_ACT_PIPE and let the
user program the failure branch optionally. This would allow for 
branching to different actions if success/failure, example:
if mark is found {
    if mark is 0xa redirect to ifb0
    else
      redirect to ifb1
} else
       set mark to 3 then redirect to ifb9

etc.

Not sure if that made sense. I am under the influence of nyquil ;->

cheers,
jamal


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Dec. 24, 2012, 1:12 p.m. UTC | #2
Hi Felix,

On Mon, Dec 24, 2012 at 12:49:16PM +0100, Felix Fietkau wrote:
> On 2012-12-24 12:34 PM, Jamal Hadi Salim wrote:
> > 
> > Some good news Yury.
> > I am told Felix Fietkau <nbd@openwrt.org> (on CC) actually
> > already solved this issue and it is a feature in openwrt. I
> > cant find the code.
> > 
> > Felix - Yury is trying to retrieve skb->mark fields from
> > netfilter connmark. My understanding is you have written
> > such an action. Can you please point us to it - and any
> > reason you havent submitted this for inclusion in kernel
> > proper?
> After I added it as an experiment, I got distracted with other projects
> again and forgot about submitting it. Take a look at the code - if the
> approach is reasonable, I'll submit this thing for inclusion soon.
> 
> - Felix
> 
> --- /dev/null
> +++ b/net/sched/act_connmark.c
> @@ -0,0 +1,137 @@
> +/*
> + * Copyright (c) 2011 Felix Fietkau <nbd@openwrt.org>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/skbuff.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/pkt_cls.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <net/netlink.h>
> +#include <net/pkt_sched.h>
> +#include <net/act_api.h>
> +
> +#include <net/netfilter/nf_conntrack.h>
> +#include <net/netfilter/nf_conntrack_core.h>
> +
> +#define TCA_ACT_CONNMARK	20
> +
> +#define CONNMARK_TAB_MASK     3
> +static struct tcf_common *tcf_connmark_ht[CONNMARK_TAB_MASK + 1];
> +static u32 connmark_idx_gen;
> +static DEFINE_RWLOCK(connmark_lock);
> +
> +static struct tcf_hashinfo connmark_hash_info = {
> +	.htab	=	tcf_connmark_ht,
> +	.hmask	=	CONNMARK_TAB_MASK,
> +	.lock	=	&connmark_lock,
> +};
> +
> +static int tcf_connmark(struct sk_buff *skb, const struct tc_action *a,
> +		       struct tcf_result *res)
> +{
> +	struct nf_conn *c;
> +	enum ip_conntrack_info ctinfo;
> +	int proto;
> +	int r;
> +
> +	if (skb->protocol == htons(ETH_P_IP)) {
> +		if (skb->len < sizeof(struct iphdr))
> +			goto out;
> +		proto = PF_INET;
> +	} else if (skb->protocol == htons(ETH_P_IPV6)) {
> +		if (skb->len < sizeof(struct ipv6hdr))
> +			goto out;
> +		proto = PF_INET6;
> +	} else
> +		goto out;
> +
> +	r = nf_conntrack_in(dev_net(skb->dev), proto, NF_INET_PRE_ROUTING, skb);

conntrack needs to see defragmented packets, you have to call
nf_defrag_ipv4 / _ipv6 respectively before that.

This also changes the semantics of the raw table in iptables since it
will now see packet with conntrack already attached. So this would
also break -j CT --notrack.

This needs more thinking. I can appreciate the value of calling
conntrack from different points of the packet traversal, but there are
a couple of thing we have to resolve before allowing that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Dec. 24, 2012, 2:05 p.m. UTC | #3
Hi Pablo,

On 12-12-24 08:12 AM, Pablo Neira Ayuso wrote:

>
> conntrack needs to see defragmented packets, you have to call
> nf_defrag_ipv4 / _ipv6 respectively before that.
>

This should not be too hard to do - although my thinking says this
should be a separate action.

> This also changes the semantics of the raw table in iptables since it
> will now see packet with conntrack already attached. So this would
> also break -j CT --notrack.
>

Is there a flag we can check which says a flow is not to be tracked?
Doesnt nf_conntrack_in() fail if --no track is set?

> This needs more thinking. I can appreciate the value of calling
> conntrack from different points of the packet traversal, but there are
> a couple of thing we have to resolve before allowing that.

There is user need for this Pablo - as you can see from what Felix
deployed it seems to be used a lot more wider audience dependency.
What do we need to do to get this to work properly?

cheers,
jamal




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Dec. 24, 2012, 6:19 p.m. UTC | #4
Hi Jamal,

On Mon, Dec 24, 2012 at 09:05:42AM -0500, Jamal Hadi Salim wrote:
> On 12-12-24 08:12 AM, Pablo Neira Ayuso wrote:
> >
> >conntrack needs to see defragmented packets, you have to call
> >nf_defrag_ipv4 / _ipv6 respectively before that.
> >
> 
> This should not be too hard to do - although my thinking says this
> should be a separate action.
> 
> >This also changes the semantics of the raw table in iptables since it
> >will now see packet with conntrack already attached. So this would
> >also break -j CT --notrack.
> 
> Is there a flag we can check which says a flow is not to be tracked?
> Doesnt nf_conntrack_in() fail if --no track is set?

The notrack dummy conntrack (consider it a flag) is attached in
prerouting raw table. By attaching conntracks at ingress, the notrack
flag will be ignored. Note that this also breaks conntrack templates
via -j CT, that allows us to set custom conntrack timeouts, zones and
helpers at prerouting raw.

Basically, ct templates are attached via -j CT, this template is
munched by nf_conntrack_in, which adds the corresponding ct features
based on the template information.

> >This needs more thinking. I can appreciate the value of calling
> >conntrack from different points of the packet traversal, but there are
> >a couple of thing we have to resolve before allowing that.
> 
> There is user need for this Pablo - as you can see from what Felix
> deployed it seems to be used a lot more wider audience dependency.
> What do we need to do to get this to work properly?

The conntrack code needs to be generalized to allow creating conntrack
with features all at once (so we can remove the template
infrastructure). Even after that, we'll still have that -j CT rules
will be ignored if you're using, let's name it, act_ct from ingress to
attach the conntrack to it.

With the current approach you're using, people will see conntracks in
the iptables raw table, that breaks the current semantics.

We'll have the netfilter workshop by Q1/Q2 2013 (still TBA), I think
this is material for discussion in it.

cheers,
Pablo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Dec. 26, 2012, 11:10 p.m. UTC | #5
On Mon, Dec 24, 2012 at 07:19:43PM +0100, Pablo Neira Ayuso wrote:
> Hi Jamal,
> 
> On Mon, Dec 24, 2012 at 09:05:42AM -0500, Jamal Hadi Salim wrote:
> > On 12-12-24 08:12 AM, Pablo Neira Ayuso wrote:
> > >
> > >conntrack needs to see defragmented packets, you have to call
> > >nf_defrag_ipv4 / _ipv6 respectively before that.
> > >
> > 
> > This should not be too hard to do - although my thinking says this
> > should be a separate action.
> > 
> > >This also changes the semantics of the raw table in iptables since it
> > >will now see packet with conntrack already attached. So this would
> > >also break -j CT --notrack.
> > 
> > Is there a flag we can check which says a flow is not to be tracked?
> > Doesnt nf_conntrack_in() fail if --no track is set?
> 
> The notrack dummy conntrack (consider it a flag) is attached in
> prerouting raw table. By attaching conntracks at ingress, the notrack
> flag will be ignored. Note that this also breaks conntrack templates
> via -j CT, that allows us to set custom conntrack timeouts, zones and
> helpers at prerouting raw.
> 
> Basically, ct templates are attached via -j CT, this template is
> munched by nf_conntrack_in, which adds the corresponding ct features
> based on the template information.

I'm still spinning around this and I don't come with some easy
solution that doesn't break the existing semantics. One possibility
can be to drop the ct reference after leaving ingress, so the lookup
happens again in prerouting after the raw table to attach it again and
no ct is seen in the raw table but:

1) it's suboptimal in case users have rules using ct at ingress and in
   iptables.

2) the conntrack template infrastructure needs to be reworked/replaced
   by something more flexible to attach features to conntracks, so we
   can still attach features for conntrack entries that were created
   at ingress (so helpers / custom timeouts / notrack don't break).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- /dev/null
+++ b/net/sched/act_connmark.c
@@ -0,0 +1,137 @@ 
+/*
+ * Copyright (c) 2011 Felix Fietkau <nbd@openwrt.org>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/pkt_cls.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/act_api.h>
+
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_core.h>
+
+#define TCA_ACT_CONNMARK	20
+
+#define CONNMARK_TAB_MASK     3
+static struct tcf_common *tcf_connmark_ht[CONNMARK_TAB_MASK + 1];
+static u32 connmark_idx_gen;
+static DEFINE_RWLOCK(connmark_lock);
+
+static struct tcf_hashinfo connmark_hash_info = {
+	.htab	=	tcf_connmark_ht,
+	.hmask	=	CONNMARK_TAB_MASK,
+	.lock	=	&connmark_lock,
+};
+
+static int tcf_connmark(struct sk_buff *skb, const struct tc_action *a,
+		       struct tcf_result *res)
+{
+	struct nf_conn *c;
+	enum ip_conntrack_info ctinfo;
+	int proto;
+	int r;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		if (skb->len < sizeof(struct iphdr))
+			goto out;
+		proto = PF_INET;
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		if (skb->len < sizeof(struct ipv6hdr))
+			goto out;
+		proto = PF_INET6;
+	} else
+		goto out;
+
+	r = nf_conntrack_in(dev_net(skb->dev), proto, NF_INET_PRE_ROUTING, skb);
+	if (r != NF_ACCEPT)
+		goto out;
+
+	c = nf_ct_get(skb, &ctinfo);
+	if (!c)
+		goto out;
+
+	skb->mark = c->mark;
+	nf_conntrack_put(skb->nfct);
+	skb->nfct = NULL;
+
+out:
+	return TC_ACT_PIPE;
+}
+
+static int tcf_connmark_init(struct nlattr *nla, struct nlattr *est,
+			 struct tc_action *a, int ovr, int bind)
+{
+	struct tcf_common *pc;
+
+	pc = tcf_hash_create(0, est, a, sizeof(*pc), bind,
+			     &connmark_idx_gen, &connmark_hash_info);
+	if (IS_ERR(pc))
+	    return PTR_ERR(pc);
+
+	tcf_hash_insert(pc, &connmark_hash_info);
+
+	return ACT_P_CREATED;
+}
+
+static inline int tcf_connmark_cleanup(struct tc_action *a, int bind)
+{
+	if (a->priv)
+		return tcf_hash_release(a->priv, bind, &connmark_hash_info);
+	return 0;
+}
+
+static inline int tcf_connmark_dump(struct sk_buff *skb, struct tc_action *a,
+				int bind, int ref)
+{
+	return skb->len;
+}
+
+static struct tc_action_ops act_connmark_ops = {
+	.kind		=	"connmark",
+	.hinfo		=	&connmark_hash_info,
+	.type		=	TCA_ACT_CONNMARK,
+	.capab		=	TCA_CAP_NONE,
+	.owner		=	THIS_MODULE,
+	.act		=	tcf_connmark,
+	.dump		=	tcf_connmark_dump,
+	.cleanup	=	tcf_connmark_cleanup,
+	.init		=	tcf_connmark_init,
+	.walk		=	tcf_generic_walker,
+};
+
+MODULE_AUTHOR("Felix Fietkau <nbd@openwrt.org>");
+MODULE_DESCRIPTION("Connection tracking mark restoring");
+MODULE_LICENSE("GPL");
+
+static int __init connmark_init_module(void)
+{
+	return tcf_register_action(&act_connmark_ops);
+}
+
+static void __exit connmark_cleanup_module(void)
+{
+	tcf_unregister_action(&act_connmark_ops);
+}
+
+module_init(connmark_init_module);
+module_exit(connmark_cleanup_module);
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -670,6 +670,19 @@  config NET_ACT_CSUM
 	  To compile this code as a module, choose M here: the
 	  module will be called act_csum.
 
+config NET_ACT_CONNMARK
+        tristate "Connection Tracking Marking"
+        depends on NET_CLS_ACT
+        depends on NF_CONNTRACK
+	 depends on NF_CONNTRACK_MARK
+        ---help---
+	  Say Y here to restore the connmark from a scheduler action
+
+	  If unsure, say N.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called act_connmark.
+
 config NET_CLS_IND
 	bool "Incoming device classification"
 	depends on NET_CLS_U32 || NET_CLS_FW
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -16,6 +16,7 @@  obj-$(CONFIG_NET_ACT_PEDIT)	+= act_pedit
 obj-$(CONFIG_NET_ACT_SIMP)	+= act_simple.o
 obj-$(CONFIG_NET_ACT_SKBEDIT)	+= act_skbedit.o
 obj-$(CONFIG_NET_ACT_CSUM)	+= act_csum.o
+obj-$(CONFIG_NET_ACT_CONNMARK)	+= act_connmark.o
 obj-$(CONFIG_NET_SCH_FIFO)	+= sch_fifo.o
 obj-$(CONFIG_NET_SCH_CBQ)	+= sch_cbq.o
 obj-$(CONFIG_NET_SCH_HTB)	+= sch_htb.o