diff mbox

net: reflect the fwmark for replies with no socket

Message ID 1394536541-25702-1-git-send-email-lorenzo@google.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Lorenzo Colitti March 11, 2014, 11:15 a.m. UTC
Kernel-originated IP packets that have no user socket associated
with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
are emitted with a mark of zero. Instead, make them have the
same mark as the packet they are replying to.

This is consistent with TOS, which is also reflected from the
incoming packet, and it allows the administrator to use
mark-based routing, firewalling, etc. for these replies by
marking the original packets inbound.

Also fix the IPv6 code to reflect the tclass in replies like the
IPv4 code does.

Change-Id: Ifd8dd75016e60dc982e7860f720d45c27dcaf04c
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
---
 net/ipv4/icmp.c      | 10 ++++++++--
 net/ipv4/ip_output.c |  2 +-
 net/ipv6/tcp_ipv6.c  |  7 +++++--
 3 files changed, 14 insertions(+), 5 deletions(-)

Comments

Julian Anastasov March 11, 2014, 8:52 p.m. UTC | #1
Hello,

On Tue, 11 Mar 2014, Lorenzo Colitti wrote:

> Kernel-originated IP packets that have no user socket associated
> with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
> are emitted with a mark of zero. Instead, make them have the
> same mark as the packet they are replying to.
> 
> This is consistent with TOS, which is also reflected from the
> incoming packet, and it allows the administrator to use
> mark-based routing, firewalling, etc. for these replies by
> marking the original packets inbound.

	If you can do it with CONNMARK save+restore,
do not change behavior for existing setups. It is
hard to predict how the mark is used in ip and
POST ROUTING rules.

> Also fix the IPv6 code to reflect the tclass in replies like the
> IPv4 code does.

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lorenzo Colitti March 12, 2014, 1:36 a.m. UTC | #2
On Wed, Mar 12, 2014 at 5:52 AM, Julian Anastasov <ja@ssi.bg> wrote:
>         If you can do it with CONNMARK save+restore,
> do not change behavior for existing setups. It is
> hard to predict how the mark is used in ip and
> POST ROUTING rules.

I think CONNMARK save/restore can do some of this but not all of it.

For example, consider the case of mark-based routing ("ip rule add
fwmark 123 lookup 456"). Setting the mark when emitting the packet
from protocol code (i.e., this patch), will cause the right route to
be selected. This will properly follow the routing policy chain
(including match routes like "unreachable" and "prohibit" if present),
use the right MTU, select the right source address for ICMP errors,
pick the right MSS/rwin for inclusion in SYN+ACK packets, and so on.

Some of this can be done after the fact using a combination of the
route target, snat/masquerade, MSS rewriting, and so on. However, that
essentially requires duplicating routing and source address selection
into iptables rules. This is brittle and particularly difficult when
dealing with IPv6, where routes can come from autoconf and source
address selection is more complex than in IPv4. It also requires using
snat/masquerade for IPv6, which is something we should avoid if
possible. In general rule it seems to me that it's better to have the
kernel emit the correct packet the first time around than having to
use iptables CONNTRACK to fix it up after the fact.

How much of a concern is modifying existing behaviour? Personally I
think reflecting the mark is better than just using 0 all the time,
but of course, it's possible that people are explicitly matching mark
0 in firewall rules or routing tables. Would it be better if this was
a sysctl?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov March 12, 2014, 8:22 a.m. UTC | #3
Hello,

On Wed, 12 Mar 2014, Lorenzo Colitti wrote:

> On Wed, Mar 12, 2014 at 5:52 AM, Julian Anastasov <ja@ssi.bg> wrote:
> >         If you can do it with CONNMARK save+restore,
> > do not change behavior for existing setups. It is
> > hard to predict how the mark is used in ip and
> > POST ROUTING rules.
> 
> I think CONNMARK save/restore can do some of this but not all of it.
> 
> For example, consider the case of mark-based routing ("ip rule add
> fwmark 123 lookup 456"). Setting the mark when emitting the packet
> from protocol code (i.e., this patch), will cause the right route to
> be selected. This will properly follow the routing policy chain
> (including match routes like "unreachable" and "prohibit" if present),
> use the right MTU, select the right source address for ICMP errors,
> pick the right MSS/rwin for inclusion in SYN+ACK packets, and so on.

	I agree that CONNMARK will do second route lookup
in the OUTPUT hook, ipt_mangle_out will detect the mark
change, so it is a slower alternative. And limited as
you pointed out.

> Some of this can be done after the fact using a combination of the
> route target, snat/masquerade, MSS rewriting, and so on. However, that
> essentially requires duplicating routing and source address selection
> into iptables rules. This is brittle and particularly difficult when
> dealing with IPv6, where routes can come from autoconf and source
> address selection is more complex than in IPv4. It also requires using
> snat/masquerade for IPv6, which is something we should avoid if
> possible. In general rule it seems to me that it's better to have the
> kernel emit the correct packet the first time around than having to
> use iptables CONNTRACK to fix it up after the fact.
> 
> How much of a concern is modifying existing behaviour? Personally I
> think reflecting the mark is better than just using 0 all the time,

	Problem should be for setups that use fwmark
for different purposes in both directions. There can
be many examples, one is for asymmetric routing,
ip rule that delivers locally the marked packets:

1. PRE_ROUTING: set mark for dport 80
2. ip rule ... fwmark N table to_local;
   ip route add local 0/0 dev lo table to_local
3. IPVS direct routing to many proxies on LAN

	What we can say is that such ip rule with fwmark
will start to work for the output traffic unexpectedly.
The TOS replying suffers from the same problem, I guess
routing by TOS is not used much and nobody complained.

> but of course, it's possible that people are explicitly matching mark
> 0 in firewall rules or routing tables. Would it be better if this was
> a sysctl?

	sysctl looks better to me. For example,
ip_reply_fwmark(net, fwmark) that will return 0 or
fwmark depending on net->ipv4.sysctl_reply_fwmark boolean.
Similar for ip6_reply_fwmark() and net->ipv6.sysctl_reply_fwmark.
You can select the right names.

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 12, 2014, 8:17 p.m. UTC | #4
From: Lorenzo Colitti <lorenzo@google.com>
Date: Tue, 11 Mar 2014 20:15:41 +0900

> Kernel-originated IP packets that have no user socket associated
> with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
> are emitted with a mark of zero. Instead, make them have the
> same mark as the packet they are replying to.
> 
> This is consistent with TOS, which is also reflected from the
> incoming packet, and it allows the administrator to use
> mark-based routing, firewalling, etc. for these replies by
> marking the original packets inbound.
> 
> Also fix the IPv6 code to reflect the tclass in replies like the
> IPv4 code does.
> 
> Change-Id: Ifd8dd75016e60dc982e7860f720d45c27dcaf04c
> Signed-off-by: Lorenzo Colitti <lorenzo@google.com>

I don't think it's safe to change this behavior after all of this
time.

And incoming marks absolutely do not have any automatic relation
to what an administartor might want to use for output packets.

I'm not applying this, sorry.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lorenzo Colitti March 17, 2014, 5:24 a.m. UTC | #5
On Thu, Mar 13, 2014 at 5:17 AM, David Miller <davem@davemloft.net> wrote:
>> Kernel-originated IP packets that have no user socket associated
>> with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
>> are emitted with a mark of zero. Instead, make them have the
>> same mark as the packet they are replying to.
>>
>> This is consistent with TOS, which is also reflected from the
>> incoming packet, and it allows the administrator to use
>> mark-based routing, firewalling, etc. for these replies by
>> marking the original packets inbound.
>>
>> Also fix the IPv6 code to reflect the tclass in replies like the
>> IPv4 code does.
>>
>> Change-Id: Ifd8dd75016e60dc982e7860f720d45c27dcaf04c
>> Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
>
> I don't think it's safe to change this behavior after all of this
> time.

Fair enough. My bad.

> And incoming marks absolutely do not have any automatic relation
> to what an administartor might want to use for output packets.

In general, that's true. Still - currently the mark on these packets
is 0, period, and the administrator can't do anything about that at
all. The idea was that some control is better than zero control, and
since these packets are always in reply to other packets, then marking
responses based on the packets that provoked them made sense, since
the input marking can be as flexible as the administrator desires.

If this was optional behaviour (e.g., controlled by
/proc/sys/net/ipv{4,6}/reflect_fwmark), would that be acceptable?
That way, even if it doesn't work for everyone, at least some
environments will be able to use it. I'm happy to send in a new patch
that does that.

If not, do you see another way to do this? If you use mark-based
routing, there doesn't currently seem to be a good way of sending
these packets where they need to go. iptables doesn't really work,
because it can only touch the packet after a lot of it has been
decided. You can override some of those decisions via NAT / MSS
rewrite / ..., but it gets messy, and it seems better to send the
right packet off the bat rather than send the wrong one and then
rewrite it without updating any of the socket structures.

Regards,
Lorenzo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 17, 2014, 6:55 p.m. UTC | #6
From: Lorenzo Colitti <lorenzo@google.com>
Date: Mon, 17 Mar 2014 14:24:57 +0900

> In general, that's true. Still - currently the mark on these packets
> is 0, period, and the administrator can't do anything about that at
> all. The idea was that some control is better than zero control, and
> since these packets are always in reply to other packets, then marking
> responses based on the packets that provoked them made sense, since
> the input marking can be as flexible as the administrator desires.

I do not agree with this assesment at all, a default of 0 is better
than an potentially undesirable default.

The incoming fwmark is controlled by external entities, and most people
are usually unhappy with external entities having influence upon their
machine's behavior.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lorenzo Colitti April 3, 2014, 8:55 a.m. UTC | #7
On Wed, Mar 12, 2014 at 5:22 PM, Julian Anastasov <ja@ssi.bg> wrote:
>         sysctl looks better to me. For example,
> ip_reply_fwmark(net, fwmark) that will return 0 or
> fwmark depending on net->ipv4.sysctl_reply_fwmark boolean.
> Similar for ip6_reply_fwmark() and net->ipv6.sysctl_reply_fwmark.
> You can select the right names.

Ack. I'll shortly post a patch that does this. It uses one
per-namespace sysctl - net.ipv4.fwmark_reflect - for both IPv4 and
IPv6. Using only one sysctl is consistent with things like ping and
tcp, and it marginally shortens the code, but I can change this to two
if desired.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lorenzo Colitti April 3, 2014, 9:06 a.m. UTC | #8
On Thu, Apr 3, 2014 at 5:55 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
> Ack. I'll shortly post a patch that does this. It uses one
> per-namespace sysctl - net.ipv4.fwmark_reflect - for both IPv4 and
> IPv6.

http://patchwork.ozlabs.org/patch/336553/

David, does it look better to you now it's an option?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 0134663..3b101a4 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -349,6 +349,7 @@  static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	icmp_param->data.icmph.checksum = 0;
 
 	inet->tos = ip_hdr(skb)->tos;
+	sk->sk_mark = skb->mark;
 	daddr = ipc.addr = ip_hdr(skb)->saddr;
 	saddr = fib_compute_spec_dst(skb);
 	ipc.opt = NULL;
@@ -364,6 +365,7 @@  static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	memset(&fl4, 0, sizeof(fl4));
 	fl4.daddr = daddr;
 	fl4.saddr = saddr;
+	fl4.flowi4_mark = skb->mark;
 	fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
 	fl4.flowi4_proto = IPPROTO_ICMP;
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
@@ -382,7 +384,7 @@  static struct rtable *icmp_route_lookup(struct net *net,
 					struct flowi4 *fl4,
 					struct sk_buff *skb_in,
 					const struct iphdr *iph,
-					__be32 saddr, u8 tos,
+					__be32 saddr, u8 tos, u32 mark,
 					int type, int code,
 					struct icmp_bxm *param)
 {
@@ -394,6 +396,7 @@  static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->daddr = (param->replyopts.opt.opt.srr ?
 		      param->replyopts.opt.opt.faddr : iph->saddr);
 	fl4->saddr = saddr;
+	fl4->flowi4_mark = mark;
 	fl4->flowi4_tos = RT_TOS(tos);
 	fl4->flowi4_proto = IPPROTO_ICMP;
 	fl4->fl4_icmp_type = type;
@@ -491,6 +494,7 @@  void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	struct flowi4 fl4;
 	__be32 saddr;
 	u8  tos;
+	u32 mark;
 	struct net *net;
 	struct sock *sk;
 
@@ -592,6 +596,7 @@  void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	tos = icmp_pointers[type].error ? ((iph->tos & IPTOS_TOS_MASK) |
 					   IPTOS_PREC_INTERNETCONTROL) :
 					  iph->tos;
+	mark = skb_in->mark;
 
 	if (ip_options_echo(&icmp_param->replyopts.opt.opt, skb_in))
 		goto out_unlock;
@@ -608,13 +613,14 @@  void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	icmp_param->skb	  = skb_in;
 	icmp_param->offset = skb_network_offset(skb_in);
 	inet_sk(sk)->tos = tos;
+	sk->sk_mark = mark;
 	ipc.addr = iph->saddr;
 	ipc.opt = &icmp_param->replyopts.opt;
 	ipc.tx_flags = 0;
 	ipc.ttl = 0;
 	ipc.tos = -1;
 
-	rt = icmp_route_lookup(net, &fl4, skb_in, iph, saddr, tos,
+	rt = icmp_route_lookup(net, &fl4, skb_in, iph, saddr, tos, mark,
 			       type, code, icmp_param);
 	if (IS_ERR(rt))
 		goto out_unlock;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1a0755f..a6039b1 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1501,7 +1501,7 @@  void ip_send_unicast_reply(struct net *net, struct sk_buff *skb, __be32 daddr,
 			daddr = replyopts.opt.opt.faddr;
 	}
 
-	flowi4_init_output(&fl4, arg->bound_dev_if, 0,
+	flowi4_init_output(&fl4, arg->bound_dev_if, skb->mark,
 			   RT_TOS(arg->tos),
 			   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
 			   ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3277680..b4f0388 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -802,6 +802,7 @@  static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 	fl6.flowi6_proto = IPPROTO_TCP;
 	if (ipv6_addr_type(&fl6.daddr) & IPV6_ADDR_LINKLOCAL)
 		fl6.flowi6_oif = inet6_iif(skb);
+	fl6.flowi6_mark = skb->mark;
 	fl6.fl6_dport = t1->dest;
 	fl6.fl6_sport = t1->source;
 	security_skb_classify_flow(skb, flowi6_to_flowi(&fl6));
@@ -828,6 +829,7 @@  static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
 	const struct tcphdr *th = tcp_hdr(skb);
 	u32 seq = 0, ack_seq = 0;
 	struct tcp_md5sig_key *key = NULL;
+	u8 tclass;
 #ifdef CONFIG_TCP_MD5SIG
 	const __u8 *hash_location = NULL;
 	struct ipv6hdr *ipv6h = ipv6_hdr(skb);
@@ -878,7 +880,8 @@  static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
 		ack_seq = ntohl(th->seq) + th->syn + th->fin + skb->len -
 			  (th->doff << 2);
 
-	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, key, 1, 0, 0);
+	tclass = ipv6_get_dsfield(ipv6_hdr(skb));
+	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, key, 1, tclass, 0);
 
 #ifdef CONFIG_TCP_MD5SIG
 release_sk1:
@@ -918,7 +921,7 @@  static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
 	tcp_v6_send_ack(skb, tcp_rsk(req)->snt_isn + 1, tcp_rsk(req)->rcv_isn + 1,
 			req->rcv_wnd, tcp_time_stamp, req->ts_recent,
 			tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
-			0, 0);
+			ipv6_get_dsfield(ipv6_hdr(skb)), 0);
 }