Message ID | 1408061394.6804.55.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Aug 14, 2014 at 05:09:54PM -0700, Eric Dumazet wrote: > It looks like PACKET_V3 has no check that a packet can always fit in a > block. > > Its trivial with GRO to break the assumption and write into kernel > memory. > > [ 840.989881] BUG: unable to handle kernel paging req0 > [ 840.997667] IP: [<ffffffff9ed17466>] memcpy+0x6/0x110 > [ 841.003314] PGD 2068b067 PUD 20371d3067 PMD 20370e7067 PTE 80000001dd6b2060 > [ 841.011120] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC > [ 841.016199] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_NFLOG nfnetlink_log nfo > [ 841.060739] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 3.16 > [ 841.077824] task: ffff880119ad2580 ti: ffff880119ad8000 task.ti: ffff880119ad8000 > [ 841.086173] RIP: 0010:[<ffffffff9ed17466>] [<ffffffff9ed17466>] memcpy+0x6/0x110 > [ 841.094531] RSP: 0018:ffff880fffac3b48 EFLAGS: 00010286 > [ 841.100461] RAX: ffff8801dd6b1b9c RBX: 0000000000000a74 RCX: 00000000000000d6 > [ 841.108412] RDX: 000000000000053a RSI: ffff880132a769ce RDI: ffff8801dd6b2000 > [ 841.116373] RBP: ffff880fffac3bb0 R08: 0000000000000ff0 R09: ffff8801dd6b1b9c > [ 841.124332] R10: ffff8801bb529f00 R11: 000000000000053a R12: 0000000000000ab6 > [ 841.132292] R13: ffff880119ad8000 R14: 0000000000000001 R15: ffff8801bb529f00 > [ 841.140252] FS: 0000000000000000(0000) GS:ffff880fffac0000(0000) knlGS:0000000000000000 > [ 841.149279] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 841.155689] CR2: ffff8801dd6b2000 CR3: 000000001f40e000 CR4: 00000000001427e0 > [ 841.163649] Stack: > [ 841.165888] ffffffff9eed1a15 ffff8801e1d06800 ffff88010000053a 000000000000053a > [ 841.174176] ffff8801dd6b1b9c ffff880100000ff0 ffff880119adbfd8 ffff880132a755b6 > [ 841.182466] ffff8801bb529f00 ffff8801e1d06800 000000000000151c ffff880132a755b6 > [ 841.190764] Call Trace: > [ 841.193490] <IRQ> > [ 841.195632] [<ffffffff9eed1a15>] ? skb_copy_bits+0x145/0x290 > [ 841.202272] [<ffffffff9ef95899>] tpacket_rcv+0x249/0x820 > [ 841.208295] [<ffffffff9eee3358>] __netif_receive_skb_core+0x388/0x950 > [ 841.215578] [<ffffffff9eee30fb>] ? __netif_receive_skb_core+0x12b/0x950 > [ 841.223054] [<ffffffff9eee3941>] __netif_receive_skb+0x21/0x70 > [ 841.229658] [<ffffffff9eee3b35>] netif_receive_skb+0x25/0x120 > [ 841.236166] [<ffffffff9eee3b60>] ? netif_receive_skb+0x50/0x120 > [ 841.242869] [<ffffffff9ef67b5e>] ? inet_gro_receive+0x6e/0x290 > [ 841.249474] [<ffffffff9eee3d6e>] napi_gro_complete.isra.77+0x13e/0x160 > [ 841.256853] [<ffffffff9eee3c60>] ? napi_gro_complete.isra.77+0x30/0x160 > [ 841.264331] [<ffffffffc03dc0bd>] ? gro_cell_poll+0x8d/0xd0 [ip_tunnel] > [ 841.271711] [<ffffffff9eee4178>] napi_gro_flush+0x78/0xa0 > [ 841.277830] [<ffffffff9eee41d3>] napi_complete+0x33/0x80 > [ 841.283852] [<ffffffffc03dc0d4>] gro_cell_poll+0xa4/0xd0 [ip_tunnel] > [ 841.291038] [<ffffffff9eee435a>] net_rx_action+0x13a/0x240 > [ 841.297254] [<ffffffff9ea575c0>] __do_softirq+0x100/0x290 > [ 841.303374] [<ffffffff9efb1b0c>] call_softirq+0x1c/0x30 > [ 841.309300] [<ffffffff9ea0bdbd>] do_softirq+0x8d/0xc0 > [ 841.315032] [<ffffffff9ea5791d>] irq_exit+0xcd/0xd0 > [ 841.320571] [<ffffffff9ea05db3>] do_IRQ+0x63/0xe0 > [ 841.325915] [<ffffffff9efa762f>] common_interrupt+0x6f/0x6f > > Not sure how to fix this. > > This patch only shows where the problem is, but should we : > > 1) drop the too long packet > 2) clamp size to maximal admissible size > 3) other solution ? (PACKET_V2 can queue a clone of skb in > receive_queue, but PACKET_V3 has no such capability) > Can we separate a gro packet into its constituent pieces and write them into/across blocks accordingly? Or alternatively can we add a bit to the tp_status field to indicate that a packet may span a block? That probably breaks applications I would imagine, but maybe we can create a sockopt to allow applications to opt in to that behavior to give them the choice between that, and a simple packet drop. Neil > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 8d9f8042705a..6ee072b746fb 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -1073,6 +1073,11 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po, > return (void *)curr; > } > > + /* If frame do not fit on a single block, bail out */ > + if (BLK_PLUS_PRIV(pkc->blk_sizeof_priv) + > + TOTAL_PKT_LEN_INCL_ALIGN(len) >= pkc->kblk_size) > + return NULL; > + > /* Ok, close the current block */ > prb_retire_current_block(pkc, po, 0); > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, On Fri, Aug 15, 2014, at 02:09, Eric Dumazet wrote: > It looks like PACKET_V3 has no check that a packet can always fit in a > block. > > Its trivial with GRO to break the assumption and write into kernel > memory. > > [...] > > Not sure how to fix this. > > This patch only shows where the problem is, but should we : > > 1) drop the too long packet Someone could use GRO to create packet trains to hide from intrustion detection systems, which maybe are the main user of TPACKET_V3. I don't think this is a good idea. > 2) clamp size to maximal admissible size Maybe. > 3) other solution ? (PACKET_V2 can queue a clone of skb in > receive_queue, but PACKET_V3 has no such capability) 4) Can we still try to skb_gso_segment the packet again? Not nice, but I guess this will work. Maybe depending on a tunable (default to on)? Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Another thought... On Fri, Aug 15, 2014, at 02:43, Hannes Frederic Sowa wrote: > On Fri, Aug 15, 2014, at 02:09, Eric Dumazet wrote: > > It looks like PACKET_V3 has no check that a packet can always fit in a > > block. > > > > Its trivial with GRO to break the assumption and write into kernel > > memory. > > [...] > > 4) Can we still try to skb_gso_segment the packet again? Not nice, but I > guess this will work. Maybe depending on a tunable (default to on)? 5) Or let a setsockopt control the NETIF_F_GRO bit of an interface if it is bound to it? This will likely result in an implementation where we must count current TPACKET_V3 sockets disabling GRO on an interface, much like promisc counter. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2014-08-15 at 02:43 +0200, Hannes Frederic Sowa wrote: > Someone could use GRO to create packet trains to hide from intrustion > detection systems, which maybe are the main user of TPACKET_V3. I don't > think this is a good idea. Presumably these tools already use a large enough bloc_size, and not a 4KB one ;) Even without GRO, a jumbo frame (9K) can trigger the bug. I do not think we need to skb_gso_segment() for the cases user setup a really small bloc_size. This looks like a lot of consumed cycles (we even might have to recompute the TCP checksums) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2014-08-15 at 02:50 +0200, Hannes Frederic Sowa wrote: > 5) Or let a setsockopt control the NETIF_F_GRO bit of an interface if it > is bound to it? This will likely result in an implementation where we > must count current TPACKET_V3 sockets disabling GRO on an interface, > much like promisc counter. loopback interface has MTU = 64K, it looks like the problem is not tied to GRO. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Fri, Aug 15, 2014, at 02:54, Eric Dumazet wrote: > On Fri, 2014-08-15 at 02:43 +0200, Hannes Frederic Sowa wrote: > > > Someone could use GRO to create packet trains to hide from intrustion > > detection systems, which maybe are the main user of TPACKET_V3. I don't > > think this is a good idea. > > Presumably these tools already use a large enough bloc_size, and not a > 4KB one ;) > > Even without GRO, a jumbo frame (9K) can trigger the bug. Sure, but if I would have written such a tool without knowledge of GRO I would have queried at least the MTU. ;) If an IDS allocates block_sizes below the MTU there is not much we can do. But up to the MTU we should let GRO behave transparently and here we violate this. There are also interfaces which extremely large MTUs but at least they report the MTU size correctly to user space. > I do not think we need to skb_gso_segment() for the cases user setup a > really small bloc_size. This looks like a lot of consumed cycles (we > even might have to recompute the TCP checksums) Yes, I feared that, too. Attacker could specifically target a slow path. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2014-08-15 at 03:04 +0200, Hannes Frederic Sowa wrote: > Sure, but if I would have written such a tool without knowledge of GRO I > would have queried at least the MTU. ;) Would all existing tools react to a mtu change properly ? (It would have to reinit its af_packet ring) I do not think its a GRO issue, really, but an optimistic PACKET_V3 implementation. GRO was already there when PACKET_V3 was added. Even a non GRO packet might not fit and we need to avoid the crash/corruption. I believe I am going to implement 2) (clamp the snaplen) An application should catch that tp_len might be bigger than tp_snaplen and eventually do something sensible about it. BTW, tp_sizeof_priv doesnt look to be checked at all, user input is accepted as is. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 15, 2014, at 04:01, Eric Dumazet wrote: > On Fri, 2014-08-15 at 03:04 +0200, Hannes Frederic Sowa wrote: > > > Sure, but if I would have written such a tool without knowledge of GRO I > > would have queried at least the MTU. ;) > > Would all existing tools react to a mtu change properly ? > > (It would have to reinit its af_packet ring) > > I do not think its a GRO issue, really, but an optimistic PACKET_V3 > implementation. GRO was already there when PACKET_V3 was added. > > Even a non GRO packet might not fit and we need to avoid the > crash/corruption. > > I believe I am going to implement 2) (clamp the snaplen) Ok, cool, I think this is the best way forward for now. Do you think a a printk_once if the packet gets clamped because of GRO would be sensible with a hint that GRO could be disabled in such cases? > An application should catch that tp_len might be bigger than tp_snaplen > and eventually do something sensible about it. > > BTW, tp_sizeof_priv doesnt look to be checked at all, user input is > accepted as is. Hmm... Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Aug 14, 2014, at 6:04 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > On Fri, Aug 15, 2014, at 02:54, Eric Dumazet wrote: >> On Fri, 2014-08-15 at 02:43 +0200, Hannes Frederic Sowa wrote: >> >>> Someone could use GRO to create packet trains to hide from intrustion >>> detection systems, which maybe are the main user of TPACKET_V3. I don't >>> think this is a good idea. >> >> Presumably these tools already use a large enough bloc_size, and not a >> 4KB one ;) >> >> Even without GRO, a jumbo frame (9K) can trigger the bug. > > Sure, but if I would have written such a tool without knowledge of GRO I > would have queried at least the MTU. ;) ...and then queried the maximum size of the headers that precede the link-layer payload. Except that you *can't* do that, and it can be variable-length without an obvious maximum (think 802.11 in monitor mode, where you have radiotap headers). This causes much pain for libpcap when using TPACKET_V1 and TPACKET_V2, forcing it to allocate huge blocks when smaller ones might be sufficient. > If an IDS allocates block_sizes below the MTU there is not much we can > do. If an IDS uses libpcap, it will get libpcap's behavior, which, for TPACKET_V3 is, roughly req.tp_frame_size = MAXIMUM_SNAPLEN; req.tp_frame_nr = handle->opt.buffer_size/req.tp_frame_size; /* compute the minumum block size that will handle this frame. * The block has to be page size aligned. * The max block size allowed by the kernel is arch-dependent and * it's not explicitly checked here. */ req.tp_block_size = getpagesize(); while (req.tp_block_size < req.tp_frame_size) req.tp_block_size <<= 1; frames_per_block = req.tp_block_size/req.tp_frame_size; req.tp_block_nr = req.tp_frame_nr / frames_per_block; /* req.tp_frame_nr is requested to match frames_per_block*req.tp_block_nr */ req.tp_frame_nr = req.tp_block_nr * frames_per_block; (the last two C statements are actually part of a loop, where it'll reduce req.tp_frame_nr if it gets told "I don't have enough room for that big a ring"). MAXIMUM_SNAPLEN is 65535 in older versions of libpcap and 262144 in newer versions; it's the maximum frame size. handle->opt.buffer_size is the buffer size requested by the application; it defaults to 2 MiB. That calculation, with the default values for the latest version of libpcap, ends up with: req.tp_frame_size = 262144; req.tp_frame_nr = /* 2097152/262144 */ 8; /* compute the minumum block size that will handle this frame. * The block has to be page size aligned. * The max block size allowed by the kernel is arch-dependent and * it's not explicitly checked here. */ req.tp_block_size = 4096; /* IA-32 and x86-64, and probably many others */ while (req.tp_block_size < req.tp_frame_size) req.tp_block_size <<= 1; /* ends up with req.tp_block_size = 262144 */ frames_per_block = /* 262144/262144 */ 1; req.tp_block_nr = /* 8 / 1 */ 8; /* req.tp_frame_nr is requested to match frames_per_block*req.tp_block_nr */ req.tp_frame_nr = /* 8 * 1 / 8; which I think means "8 256 KiB blocks". > But up to the MTU we should let GRO behave transparently and here we > violate this. There are also interfaces which extremely large MTUs but > at least they report the MTU size correctly to user space. Reporting the MTU to user space is insufficient for libpcap; it needs to know the *maximum packet size*, which includes link-layer headers (which I guess it could compute based on the ARPHRD_ for the interface, although for 802.11 that may be subject to change, e.g. the maximum link-layer header length grew when the QoS stuff was added) *and* metadata headers (such as radiotap, for which there really is no generic maximum length other than the 65535 byte limit imposed by the header length field being 16 bits, but a given driver can presumably return its maximum). It also doesn't help with segmentation/reassembly offloading, where the MTU is decoupled from the maximum packet size. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Aug 14, 2014, at 7:01 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I believe I am going to implement 2) (clamp the snaplen)
If you mean "a packet that's too big to fit into a single block will be cut off at the size of the block, even if the user-requested snaplen is bigger than that", I, at least, have no problem whatsoever with that from libpcap's point of view. We, the libpcap developers, could just boost the maximum snaplen if necessary. (Disclaimer: I don't speak for all developers or users of libpcap.)--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 14, 2014 at 05:56:20PM -0700, Eric Dumazet wrote: > On Fri, 2014-08-15 at 02:50 +0200, Hannes Frederic Sowa wrote: > > > 5) Or let a setsockopt control the NETIF_F_GRO bit of an interface if it > > is bound to it? This will likely result in an implementation where we > > must count current TPACKET_V3 sockets disabling GRO on an interface, > > much like promisc counter. > > loopback interface has MTU = 64K, it looks like the problem is not tied > to GRO. > Can we add a listener to the af packet code to resize the block length of the socket based on the MTU? I realize that overloads user configuration, but if we warn them... Neil > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Do, 2014-08-14 at 21:54 -0700, Guy Harris wrote: > On Aug 14, 2014, at 6:04 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > > > On Fri, Aug 15, 2014, at 02:54, Eric Dumazet wrote: > >> On Fri, 2014-08-15 at 02:43 +0200, Hannes Frederic Sowa wrote: > >> > >>> Someone could use GRO to create packet trains to hide from intrustion > >>> detection systems, which maybe are the main user of TPACKET_V3. I don't > >>> think this is a good idea. > >> > >> Presumably these tools already use a large enough bloc_size, and not a > >> 4KB one ;) > >> > >> Even without GRO, a jumbo frame (9K) can trigger the bug. > > > > Sure, but if I would have written such a tool without knowledge of GRO I > > would have queried at least the MTU. ;) > > ...and then queried the maximum size of the headers that precede the link-layer payload. But those are mostly constant, no? > Except that you *can't* do that, and it can be variable-length without an obvious maximum (think 802.11 in monitor mode, where you have radiotap headers). This causes much pain for libpcap when using TPACKET_V1 and TPACKET_V2, forcing it to allocate huge blocks when smaller ones might be sufficient. > > If an IDS allocates block_sizes below the MTU there is not much we can > > do. > > If an IDS uses libpcap, it will get libpcap's behavior, which, for TPACKET_V3 is, roughly > > req.tp_frame_size = MAXIMUM_SNAPLEN; > req.tp_frame_nr = handle->opt.buffer_size/req.tp_frame_size; > > /* compute the minumum block size that will handle this frame. > * The block has to be page size aligned. > * The max block size allowed by the kernel is arch-dependent and > * it's not explicitly checked here. */ > req.tp_block_size = getpagesize(); > while (req.tp_block_size < req.tp_frame_size) > req.tp_block_size <<= 1; > > frames_per_block = req.tp_block_size/req.tp_frame_size; > > req.tp_block_nr = req.tp_frame_nr / frames_per_block; > > /* req.tp_frame_nr is requested to match frames_per_block*req.tp_block_nr */ > req.tp_frame_nr = req.tp_block_nr * frames_per_block; > > (the last two C statements are actually part of a loop, where it'll reduce req.tp_frame_nr if it gets told "I don't have enough room for that big a ring"). > > MAXIMUM_SNAPLEN is 65535 in older versions of libpcap and 262144 in newer versions; it's the maximum frame size. > > handle->opt.buffer_size is the buffer size requested by the application; it defaults to 2 MiB. > > That calculation, with the default values for the latest version of libpcap, ends up with: > > req.tp_frame_size = 262144; > req.tp_frame_nr = /* 2097152/262144 */ 8; > > /* compute the minumum block size that will handle this frame. > * The block has to be page size aligned. > * The max block size allowed by the kernel is arch-dependent and > * it's not explicitly checked here. */ > req.tp_block_size = 4096; /* IA-32 and x86-64, and probably many others */ > while (req.tp_block_size < req.tp_frame_size) > req.tp_block_size <<= 1; > /* ends up with req.tp_block_size = 262144 */ > > frames_per_block = /* 262144/262144 */ 1; > > req.tp_block_nr = /* 8 / 1 */ 8; > > /* req.tp_frame_nr is requested to match frames_per_block*req.tp_block_nr */ > req.tp_frame_nr = /* 8 * 1 / 8; > > which I think means "8 256 KiB blocks". > > > But up to the MTU we should let GRO behave transparently and here we > > violate this. There are also interfaces which extremely large MTUs but > > at least they report the MTU size correctly to user space. > > Reporting the MTU to user space is insufficient for libpcap; it needs to know the *maximum packet size*, which includes link-layer headers (which I guess it could compute based on the ARPHRD_ for the interface, although for 802.11 that may be subject to change, e.g. the maximum link-layer header length grew when the QoS stuff was added) *and* metadata headers (such as radiotap, for which there really is no generic maximum length other than the 65535 byte limit imposed by the header length field being 16 bits, but a given driver can presumably return its maximum). > > It also doesn't help with segmentation/reassembly offloading, where the MTU is decoupled from the maximum packet size. Thanks, that answered the question above. Still, I think that the most commonly use-cases would get covered by querying the MTU. libpcap needs to be much more general and so must pay more attention. Also, IMHO GRO is something special and does not fit that model very well. But I am also fine with clamping. If people complain, we can revisit that topic. Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 8d9f8042705a..6ee072b746fb 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1073,6 +1073,11 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po, return (void *)curr; } + /* If frame do not fit on a single block, bail out */ + if (BLK_PLUS_PRIV(pkc->blk_sizeof_priv) + + TOTAL_PKT_LEN_INCL_ALIGN(len) >= pkc->kblk_size) + return NULL; + /* Ok, close the current block */ prb_retire_current_block(pkc, po, 0);