| Message ID | 20260515100411.3141-1-fw@strlen.de |
|---|---|
| State | RFC, archived |
| Headers | show |
| Series | [RFC] netfilter: disable payload mangling in userns | expand |
I agree with the userns block. I'll keep pushing the five consumer-side bounds-check patches: root in init userns can still install the same payload-set rule and trigger the same OOB at the re-read sites, so the underlying bug fix is still worth landing. None of the five sites overlap with the relax wishlist (saddr/ daddr, transport, linklayer). Same class showed up in an earlier patch: https://lore.kernel.org/netdev/20260514035802.1540395-1-tpluszz77@gmail.com/ These five are unlikely to be all of them; we think the consumer side warrants a broader audit. Thoughts? Qi
Qi Tang <tpluszz77@gmail.com> wrote: > I agree with the userns block. I'll keep pushing the five > consumer-side bounds-check patches: root in init userns can > still install the same payload-set rule and trigger the same > OOB at the re-read sites, so the underlying bug fix is still > worth landing. > > None of the five sites overlap with the relax wishlist (saddr/ > daddr, transport, linklayer). Same class showed up in an > earlier patch: > https://lore.kernel.org/netdev/20260514035802.1540395-1-tpluszz77@gmail.com/ > > These five are unlikely to be all of them; we think the > consumer side warrants a broader audit. Thoughts? I think we have to do both. For nft_payload/nft_exthdr: 1. Writes from !initial_userns are rejected (rule insert fails). 2. Writes for initial userns get validated at rule add time: - netdev ingress is allowed to alter everything, I think this is early enough to not introduce oddities that can't come from wire / untrusted peer. - bridge is allowed to alter everything: AFAICS there shouldn't be a problem with this, same as with ingress. - inet (ipv4/ip6): Check base (offset is unsigned and relative to ll/network header/transport header/payload - Allow modifications past transport header. - Allow modifications of transport header - Allow network header modifications for a subset of valid offsets/lengths: saddr, daddr, checksum, tos, id, ttl / hl. Reject everything else. - Allow link-layer; but check at from packetpath that offset + len don't write past start of the network header. Else: no-op. nfqueue is the bigger problem: userspace gives us a whole new packet data blob, not a delta that we should apply at offset x). I think we have to update nfqueue to do rudimentary header revalidation and drop the packet on failure, e.g. at least making sure tot_len is not past skb->len.
On Fri, May 15, 2026 at 02:48:05PM +0200, Florian Westphal wrote: > Qi Tang <tpluszz77@gmail.com> wrote: > > I agree with the userns block. I'll keep pushing the five > > consumer-side bounds-check patches: root in init userns can > > still install the same payload-set rule and trigger the same > > OOB at the re-read sites, so the underlying bug fix is still > > worth landing. > > > > None of the five sites overlap with the relax wishlist (saddr/ > > daddr, transport, linklayer). Same class showed up in an > > earlier patch: > > https://lore.kernel.org/netdev/20260514035802.1540395-1-tpluszz77@gmail.com/ > > > > These five are unlikely to be all of them; we think the > > consumer side warrants a broader audit. Thoughts? > > I think we have to do both. > > For nft_payload/nft_exthdr: > 1. Writes from !initial_userns are rejected (rule insert fails). > 2. Writes for initial userns get validated at rule add time: > - netdev ingress is allowed to alter everything, I think > this is early enough to not introduce oddities that can't come > from wire / untrusted peer. > - bridge is allowed to alter everything: AFAICS there shouldn't > be a problem with this, same as with ingress. > - inet (ipv4/ip6): Check base (offset is unsigned and relative > to ll/network header/transport header/payload > - Allow modifications past transport header. > - Allow modifications of transport header > - Allow network header modifications for a subset of valid > offsets/lengths: saddr, daddr, checksum, tos, id, ttl / hl. > Reject everything else. > - Allow link-layer; but check at from packetpath that > offset + len don't write past start of the network header. > Else: no-op. > > nfqueue is the bigger problem: userspace gives us a whole new > packet data blob, not a delta that we should apply at offset x). > > I think we have to update nfqueue to do rudimentary header revalidation > and drop the packet on failure, e.g. at least making sure tot_len is not > past skb->len. Agreed, nfqueue has been there for long time and such parser would validate the packet is correct from stack POV. But if a new function to validate that the IP header mangling is not valid is feasible, then why not simply apply the same sanitization from inet hooks to nft_payload/nft_exthdr? That can be a function in net/netfilter/utils.c.
Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > I think we have to update nfqueue to do rudimentary header revalidation > > and drop the packet on failure, e.g. at least making sure tot_len is not > > past skb->len. > > Agreed, nfqueue has been there for long time and such parser would > validate the packet is correct from stack POV. > > But if a new function to validate that the IP header mangling is not > valid is feasible, then why not simply apply the same sanitization > from inet hooks to nft_payload/nft_exthdr? Ok, I will have a look. Thanks for the feedback.
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c index 58304fd1f70f..e1e1d11fdf04 100644 --- a/net/netfilter/nfnetlink_queue.c +++ b/net/netfilter/nfnetlink_queue.c @@ -1141,6 +1141,9 @@ nfqnl_mangle(void *data, unsigned int data_len, struct nf_queue_entry *e, int di { struct sk_buff *nskb; + if (e->state.net->user_ns != &init_user_ns) + return -EPERM; + if (diff < 0) { unsigned int min_len = skb_transport_offset(e->skb); diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c index e6a07c0df207..577a15383e98 100644 --- a/net/netfilter/nft_exthdr.c +++ b/net/netfilter/nft_exthdr.c @@ -551,6 +551,9 @@ static int nft_exthdr_tcp_set_init(const struct nft_ctx *ctx, u32 offset, len, flags = 0, op = NFT_EXTHDR_OP_IPV6; int err; + if (ctx->net->user_ns != &init_user_ns) + return -EPERM; + if (!tb[NFTA_EXTHDR_SREG] || !tb[NFTA_EXTHDR_TYPE] || !tb[NFTA_EXTHDR_OFFSET] || diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c index 01e13e5255a9..484a5490832e 100644 --- a/net/netfilter/nft_payload.c +++ b/net/netfilter/nft_payload.c @@ -917,6 +917,9 @@ static int nft_payload_set_init(const struct nft_ctx *ctx, struct nft_payload_set *priv = nft_expr_priv(expr); int err; + if (ctx->net->user_ns != &init_user_ns) + return -EPERM; + priv->base = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_BASE])); priv->len = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_LEN]));
Several parts of network stack rely on iph->ihl validation done by network stack before PRE_ROUTING. Disable this feature for user namespaces for now. This could be relaxed later. Example: - allow userns only for ingress hook. - allow userns write if base is transport header - allow userns write if base is linklayer and offset below network header offset - allow userns write for ipv4 if offset+len match saddr/daddr - allow userns write for ipv6 if offset+len match saddr/daddr ... etc. tcp option handling might be safe even for LOCAL_IN, as LOCAL_IN gets invoked before tcp stack, but this turns it off too. optstrip remains enabled, see no problem with that one. I don't think these are the only means to alter packets, but these appear to be relatively prominent. Another option would be to restrict this generally, however, this is harder to do for nfqueue. For nftables we know where the modification happens and can even reject a subset from netlink path directly. But for nfqueue, we'd need to 'revalidate' at least ip/ipv6 header for ipv4/ipv6 families. Bridge path might be okay with arbitray header modifications. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michael Bommarito <michael.bommarito@gmail.com> Cc: Qi Tang <tpluszz77@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> --- net/netfilter/nfnetlink_queue.c | 3 +++ net/netfilter/nft_exthdr.c | 3 +++ net/netfilter/nft_payload.c | 3 +++ 3 files changed, 9 insertions(+)