diff mbox

[RFC] packet: handle too big packets for PACKET_V3

Message ID 1408061394.6804.55.camel@edumazet-glaptop2.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Aug. 15, 2014, 12:09 a.m. UTC
It looks like PACKET_V3 has no check that a packet can always fit in a
block.

Its trivial with GRO to break the assumption and write into kernel
memory.

[  840.989881] BUG: unable to handle kernel paging req0
[  840.997667] IP: [<ffffffff9ed17466>] memcpy+0x6/0x110
[  841.003314] PGD 2068b067 PUD 20371d3067 PMD 20370e7067 PTE 80000001dd6b2060
[  841.011120] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[  841.016199] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_NFLOG nfnetlink_log nfo
[  841.060739] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 3.16
[  841.077824] task: ffff880119ad2580 ti: ffff880119ad8000 task.ti: ffff880119ad8000
[  841.086173] RIP: 0010:[<ffffffff9ed17466>]  [<ffffffff9ed17466>] memcpy+0x6/0x110
[  841.094531] RSP: 0018:ffff880fffac3b48  EFLAGS: 00010286
[  841.100461] RAX: ffff8801dd6b1b9c RBX: 0000000000000a74 RCX: 00000000000000d6
[  841.108412] RDX: 000000000000053a RSI: ffff880132a769ce RDI: ffff8801dd6b2000
[  841.116373] RBP: ffff880fffac3bb0 R08: 0000000000000ff0 R09: ffff8801dd6b1b9c
[  841.124332] R10: ffff8801bb529f00 R11: 000000000000053a R12: 0000000000000ab6
[  841.132292] R13: ffff880119ad8000 R14: 0000000000000001 R15: ffff8801bb529f00
[  841.140252] FS:  0000000000000000(0000) GS:ffff880fffac0000(0000) knlGS:0000000000000000
[  841.149279] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  841.155689] CR2: ffff8801dd6b2000 CR3: 000000001f40e000 CR4: 00000000001427e0
[  841.163649] Stack:
[  841.165888]  ffffffff9eed1a15 ffff8801e1d06800 ffff88010000053a 000000000000053a
[  841.174176]  ffff8801dd6b1b9c ffff880100000ff0 ffff880119adbfd8 ffff880132a755b6
[  841.182466]  ffff8801bb529f00 ffff8801e1d06800 000000000000151c ffff880132a755b6
[  841.190764] Call Trace:
[  841.193490]  <IRQ>
[  841.195632]  [<ffffffff9eed1a15>] ? skb_copy_bits+0x145/0x290
[  841.202272]  [<ffffffff9ef95899>] tpacket_rcv+0x249/0x820
[  841.208295]  [<ffffffff9eee3358>] __netif_receive_skb_core+0x388/0x950
[  841.215578]  [<ffffffff9eee30fb>] ? __netif_receive_skb_core+0x12b/0x950
[  841.223054]  [<ffffffff9eee3941>] __netif_receive_skb+0x21/0x70
[  841.229658]  [<ffffffff9eee3b35>] netif_receive_skb+0x25/0x120
[  841.236166]  [<ffffffff9eee3b60>] ? netif_receive_skb+0x50/0x120
[  841.242869]  [<ffffffff9ef67b5e>] ? inet_gro_receive+0x6e/0x290
[  841.249474]  [<ffffffff9eee3d6e>] napi_gro_complete.isra.77+0x13e/0x160
[  841.256853]  [<ffffffff9eee3c60>] ? napi_gro_complete.isra.77+0x30/0x160
[  841.264331]  [<ffffffffc03dc0bd>] ? gro_cell_poll+0x8d/0xd0 [ip_tunnel]
[  841.271711]  [<ffffffff9eee4178>] napi_gro_flush+0x78/0xa0
[  841.277830]  [<ffffffff9eee41d3>] napi_complete+0x33/0x80
[  841.283852]  [<ffffffffc03dc0d4>] gro_cell_poll+0xa4/0xd0 [ip_tunnel]
[  841.291038]  [<ffffffff9eee435a>] net_rx_action+0x13a/0x240
[  841.297254]  [<ffffffff9ea575c0>] __do_softirq+0x100/0x290
[  841.303374]  [<ffffffff9efb1b0c>] call_softirq+0x1c/0x30
[  841.309300]  [<ffffffff9ea0bdbd>] do_softirq+0x8d/0xc0
[  841.315032]  [<ffffffff9ea5791d>] irq_exit+0xcd/0xd0
[  841.320571]  [<ffffffff9ea05db3>] do_IRQ+0x63/0xe0
[  841.325915]  [<ffffffff9efa762f>] common_interrupt+0x6f/0x6f

Not sure how to fix this.

This patch only shows where the problem is, but should we :

1) drop the too long packet
2) clamp size to maximal admissible size
3) other solution ? (PACKET_V2 can queue a clone of skb in
receive_queue, but PACKET_V3 has no such capability)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Neil Horman Aug. 15, 2014, 12:36 a.m. UTC | #1
On Thu, Aug 14, 2014 at 05:09:54PM -0700, Eric Dumazet wrote:
> It looks like PACKET_V3 has no check that a packet can always fit in a
> block.
> 
> Its trivial with GRO to break the assumption and write into kernel
> memory.
> 
> [  840.989881] BUG: unable to handle kernel paging req0
> [  840.997667] IP: [<ffffffff9ed17466>] memcpy+0x6/0x110
> [  841.003314] PGD 2068b067 PUD 20371d3067 PMD 20370e7067 PTE 80000001dd6b2060
> [  841.011120] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
> [  841.016199] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_NFLOG nfnetlink_log nfo
> [  841.060739] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 3.16
> [  841.077824] task: ffff880119ad2580 ti: ffff880119ad8000 task.ti: ffff880119ad8000
> [  841.086173] RIP: 0010:[<ffffffff9ed17466>]  [<ffffffff9ed17466>] memcpy+0x6/0x110
> [  841.094531] RSP: 0018:ffff880fffac3b48  EFLAGS: 00010286
> [  841.100461] RAX: ffff8801dd6b1b9c RBX: 0000000000000a74 RCX: 00000000000000d6
> [  841.108412] RDX: 000000000000053a RSI: ffff880132a769ce RDI: ffff8801dd6b2000
> [  841.116373] RBP: ffff880fffac3bb0 R08: 0000000000000ff0 R09: ffff8801dd6b1b9c
> [  841.124332] R10: ffff8801bb529f00 R11: 000000000000053a R12: 0000000000000ab6
> [  841.132292] R13: ffff880119ad8000 R14: 0000000000000001 R15: ffff8801bb529f00
> [  841.140252] FS:  0000000000000000(0000) GS:ffff880fffac0000(0000) knlGS:0000000000000000
> [  841.149279] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  841.155689] CR2: ffff8801dd6b2000 CR3: 000000001f40e000 CR4: 00000000001427e0
> [  841.163649] Stack:
> [  841.165888]  ffffffff9eed1a15 ffff8801e1d06800 ffff88010000053a 000000000000053a
> [  841.174176]  ffff8801dd6b1b9c ffff880100000ff0 ffff880119adbfd8 ffff880132a755b6
> [  841.182466]  ffff8801bb529f00 ffff8801e1d06800 000000000000151c ffff880132a755b6
> [  841.190764] Call Trace:
> [  841.193490]  <IRQ>
> [  841.195632]  [<ffffffff9eed1a15>] ? skb_copy_bits+0x145/0x290
> [  841.202272]  [<ffffffff9ef95899>] tpacket_rcv+0x249/0x820
> [  841.208295]  [<ffffffff9eee3358>] __netif_receive_skb_core+0x388/0x950
> [  841.215578]  [<ffffffff9eee30fb>] ? __netif_receive_skb_core+0x12b/0x950
> [  841.223054]  [<ffffffff9eee3941>] __netif_receive_skb+0x21/0x70
> [  841.229658]  [<ffffffff9eee3b35>] netif_receive_skb+0x25/0x120
> [  841.236166]  [<ffffffff9eee3b60>] ? netif_receive_skb+0x50/0x120
> [  841.242869]  [<ffffffff9ef67b5e>] ? inet_gro_receive+0x6e/0x290
> [  841.249474]  [<ffffffff9eee3d6e>] napi_gro_complete.isra.77+0x13e/0x160
> [  841.256853]  [<ffffffff9eee3c60>] ? napi_gro_complete.isra.77+0x30/0x160
> [  841.264331]  [<ffffffffc03dc0bd>] ? gro_cell_poll+0x8d/0xd0 [ip_tunnel]
> [  841.271711]  [<ffffffff9eee4178>] napi_gro_flush+0x78/0xa0
> [  841.277830]  [<ffffffff9eee41d3>] napi_complete+0x33/0x80
> [  841.283852]  [<ffffffffc03dc0d4>] gro_cell_poll+0xa4/0xd0 [ip_tunnel]
> [  841.291038]  [<ffffffff9eee435a>] net_rx_action+0x13a/0x240
> [  841.297254]  [<ffffffff9ea575c0>] __do_softirq+0x100/0x290
> [  841.303374]  [<ffffffff9efb1b0c>] call_softirq+0x1c/0x30
> [  841.309300]  [<ffffffff9ea0bdbd>] do_softirq+0x8d/0xc0
> [  841.315032]  [<ffffffff9ea5791d>] irq_exit+0xcd/0xd0
> [  841.320571]  [<ffffffff9ea05db3>] do_IRQ+0x63/0xe0
> [  841.325915]  [<ffffffff9efa762f>] common_interrupt+0x6f/0x6f
> 
> Not sure how to fix this.
> 
> This patch only shows where the problem is, but should we :
> 
> 1) drop the too long packet
> 2) clamp size to maximal admissible size
> 3) other solution ? (PACKET_V2 can queue a clone of skb in
> receive_queue, but PACKET_V3 has no such capability)
> 

Can we separate a gro packet into its constituent pieces and write them
into/across blocks accordingly?  Or alternatively can we add a bit to the
tp_status field to indicate that a packet may span a block?  That probably
breaks applications I would imagine, but maybe we can create a sockopt to allow
applications to opt in to that behavior to give them the choice between that,
and a simple packet drop.

Neil


> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 8d9f8042705a..6ee072b746fb 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1073,6 +1073,11 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po,
>  		return (void *)curr;
>  	}
>  
> +	/* If frame do not fit on a single block, bail out */
> +	if (BLK_PLUS_PRIV(pkc->blk_sizeof_priv) +
> +	    TOTAL_PKT_LEN_INCL_ALIGN(len) >= pkc->kblk_size)
> +		return NULL;
> +
>  	/* Ok, close the current block */
>  	prb_retire_current_block(pkc, po, 0);
>  
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Aug. 15, 2014, 12:43 a.m. UTC | #2
Hi Eric,

On Fri, Aug 15, 2014, at 02:09, Eric Dumazet wrote:
> It looks like PACKET_V3 has no check that a packet can always fit in a
> block.
> 
> Its trivial with GRO to break the assumption and write into kernel
> memory.
> 
> [...]
>
> Not sure how to fix this.
> 
> This patch only shows where the problem is, but should we :
> 
> 1) drop the too long packet

Someone could use GRO to create packet trains to hide from intrustion
detection systems, which maybe are the main user of TPACKET_V3. I don't
think this is a good idea.

> 2) clamp size to maximal admissible size

Maybe.

> 3) other solution ? (PACKET_V2 can queue a clone of skb in
> receive_queue, but PACKET_V3 has no such capability)

4) Can we still try to skb_gso_segment the packet again? Not nice, but I
guess this will work. Maybe depending on a tunable (default to on)?

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Aug. 15, 2014, 12:50 a.m. UTC | #3
Another thought...

On Fri, Aug 15, 2014, at 02:43, Hannes Frederic Sowa wrote:
> On Fri, Aug 15, 2014, at 02:09, Eric Dumazet wrote:
> > It looks like PACKET_V3 has no check that a packet can always fit in a
> > block.
> > 
> > Its trivial with GRO to break the assumption and write into kernel
> > memory.
>
> [...]
>
> 4) Can we still try to skb_gso_segment the packet again? Not nice, but I
> guess this will work. Maybe depending on a tunable (default to on)?

5) Or let a setsockopt control the NETIF_F_GRO bit of an interface if it
is bound to it? This will likely result in an implementation where we
must count current TPACKET_V3 sockets disabling GRO on an interface,
much like promisc counter.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 15, 2014, 12:54 a.m. UTC | #4
On Fri, 2014-08-15 at 02:43 +0200, Hannes Frederic Sowa wrote:

> Someone could use GRO to create packet trains to hide from intrustion
> detection systems, which maybe are the main user of TPACKET_V3. I don't
> think this is a good idea.

Presumably these tools already use a large enough bloc_size, and not a
4KB one ;)

Even without GRO, a jumbo frame (9K) can trigger the bug.

I do not think we need to skb_gso_segment() for the cases user setup a
really small bloc_size. This looks like a lot of consumed cycles (we
even might have to recompute the TCP checksums)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 15, 2014, 12:56 a.m. UTC | #5
On Fri, 2014-08-15 at 02:50 +0200, Hannes Frederic Sowa wrote:

> 5) Or let a setsockopt control the NETIF_F_GRO bit of an interface if it
> is bound to it? This will likely result in an implementation where we
> must count current TPACKET_V3 sockets disabling GRO on an interface,
> much like promisc counter.

loopback interface has MTU = 64K, it looks like the problem is not tied
to GRO.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Aug. 15, 2014, 1:04 a.m. UTC | #6
Hi,

On Fri, Aug 15, 2014, at 02:54, Eric Dumazet wrote:
> On Fri, 2014-08-15 at 02:43 +0200, Hannes Frederic Sowa wrote:
> 
> > Someone could use GRO to create packet trains to hide from intrustion
> > detection systems, which maybe are the main user of TPACKET_V3. I don't
> > think this is a good idea.
> 
> Presumably these tools already use a large enough bloc_size, and not a
> 4KB one ;)
> 
> Even without GRO, a jumbo frame (9K) can trigger the bug.

Sure, but if I would have written such a tool without knowledge of GRO I
would have queried at least the MTU. ;)
If an IDS allocates block_sizes below the MTU there is not much we can
do. But up to the MTU we should let GRO behave transparently and here we
violate this. There are also interfaces which extremely large MTUs but
at least they report the MTU size correctly to user space.

> I do not think we need to skb_gso_segment() for the cases user setup a
> really small bloc_size. This looks like a lot of consumed cycles (we
> even might have to recompute the TCP checksums)

Yes, I feared that, too. Attacker could specifically target a slow path.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 15, 2014, 2:01 a.m. UTC | #7
On Fri, 2014-08-15 at 03:04 +0200, Hannes Frederic Sowa wrote:

> Sure, but if I would have written such a tool without knowledge of GRO I
> would have queried at least the MTU. ;)

Would all existing tools react to a mtu change properly ?

(It would have to reinit its af_packet ring)

I do not think its a GRO issue, really, but an optimistic PACKET_V3
implementation. GRO was already there when PACKET_V3 was added.

Even a non GRO packet might not fit and we need to avoid the
crash/corruption.

I believe I am going to implement 2)  (clamp the snaplen)

An application should catch that tp_len might be bigger than tp_snaplen
and eventually do something sensible about it.

BTW, tp_sizeof_priv doesnt look to be checked at all, user input is
accepted as is.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Aug. 15, 2014, 2:08 a.m. UTC | #8
On Fri, Aug 15, 2014, at 04:01, Eric Dumazet wrote:
> On Fri, 2014-08-15 at 03:04 +0200, Hannes Frederic Sowa wrote:
> 
> > Sure, but if I would have written such a tool without knowledge of GRO I
> > would have queried at least the MTU. ;)
> 
> Would all existing tools react to a mtu change properly ?
> 
> (It would have to reinit its af_packet ring)
> 
> I do not think its a GRO issue, really, but an optimistic PACKET_V3
> implementation. GRO was already there when PACKET_V3 was added.
> 
> Even a non GRO packet might not fit and we need to avoid the
> crash/corruption.
> 
> I believe I am going to implement 2)  (clamp the snaplen)

Ok, cool, I think this is the best way forward for now.

Do you think a a printk_once if the packet gets clamped because of GRO
would be sensible with a hint that GRO could be disabled in such cases?

> An application should catch that tp_len might be bigger than tp_snaplen
> and eventually do something sensible about it.
> 
> BTW, tp_sizeof_priv doesnt look to be checked at all, user input is
> accepted as is.

Hmm...

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guy Harris Aug. 15, 2014, 4:54 a.m. UTC | #9
On Aug 14, 2014, at 6:04 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:

> On Fri, Aug 15, 2014, at 02:54, Eric Dumazet wrote:
>> On Fri, 2014-08-15 at 02:43 +0200, Hannes Frederic Sowa wrote:
>> 
>>> Someone could use GRO to create packet trains to hide from intrustion
>>> detection systems, which maybe are the main user of TPACKET_V3. I don't
>>> think this is a good idea.
>> 
>> Presumably these tools already use a large enough bloc_size, and not a
>> 4KB one ;)
>> 
>> Even without GRO, a jumbo frame (9K) can trigger the bug.
> 
> Sure, but if I would have written such a tool without knowledge of GRO I
> would have queried at least the MTU. ;)

...and then queried the maximum size of the headers that precede the link-layer payload.

Except that you *can't* do that, and it can be variable-length without an obvious maximum (think 802.11 in monitor mode, where you have radiotap headers).  This causes much pain for libpcap when using TPACKET_V1 and TPACKET_V2, forcing it to allocate huge blocks when smaller ones might be sufficient.

> If an IDS allocates block_sizes below the MTU there is not much we can
> do.

If an IDS uses libpcap, it will get libpcap's behavior, which, for TPACKET_V3 is, roughly

	req.tp_frame_size = MAXIMUM_SNAPLEN;
	req.tp_frame_nr = handle->opt.buffer_size/req.tp_frame_size;

	/* compute the minumum block size that will handle this frame. 
	 * The block has to be page size aligned. 
	 * The max block size allowed by the kernel is arch-dependent and 
	 * it's not explicitly checked here. */
	req.tp_block_size = getpagesize();
	while (req.tp_block_size < req.tp_frame_size) 
		req.tp_block_size <<= 1;

	frames_per_block = req.tp_block_size/req.tp_frame_size;

	req.tp_block_nr = req.tp_frame_nr / frames_per_block;

	/* req.tp_frame_nr is requested to match frames_per_block*req.tp_block_nr */
	req.tp_frame_nr = req.tp_block_nr * frames_per_block;

(the last two C statements are actually part of a loop, where it'll reduce req.tp_frame_nr if it gets told "I don't have enough room for that big a ring").

MAXIMUM_SNAPLEN is 65535 in older versions of libpcap and 262144 in newer versions; it's the maximum frame size.

handle->opt.buffer_size is the buffer size requested by the application; it defaults to 2 MiB.

That calculation, with the default values for the latest version of libpcap, ends up with:

	req.tp_frame_size = 262144;
	req.tp_frame_nr = /* 2097152/262144 */ 8;

	/* compute the minumum block size that will handle this frame. 
	 * The block has to be page size aligned. 
	 * The max block size allowed by the kernel is arch-dependent and 
	 * it's not explicitly checked here. */
	req.tp_block_size = 4096;	/* IA-32 and x86-64, and probably many others */
	while (req.tp_block_size < req.tp_frame_size) 
		req.tp_block_size <<= 1;
	/* ends up with req.tp_block_size = 262144 */

	frames_per_block = /* 262144/262144 */ 1;

	req.tp_block_nr = /* 8 / 1 */ 8;

	/* req.tp_frame_nr is requested to match frames_per_block*req.tp_block_nr */
	req.tp_frame_nr = /* 8 * 1 / 8;

which I think means "8 256 KiB blocks".

> But up to the MTU we should let GRO behave transparently and here we
> violate this. There are also interfaces which extremely large MTUs but
> at least they report the MTU size correctly to user space.

Reporting the MTU to user space is insufficient for libpcap; it needs to know the *maximum packet size*, which includes link-layer headers (which I guess it could compute based on the ARPHRD_ for the interface, although for 802.11 that may be subject to change, e.g. the maximum link-layer header length grew when the QoS stuff was added) *and* metadata headers (such as radiotap, for which there really is no generic maximum length other than the 65535 byte limit imposed by the header length field being 16 bits, but a given driver can presumably return its maximum).

It also doesn't help with segmentation/reassembly offloading, where the MTU is decoupled from the maximum packet size.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guy Harris Aug. 15, 2014, 5:02 a.m. UTC | #10
On Aug 14, 2014, at 7:01 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I believe I am going to implement 2)  (clamp the snaplen)

If you mean "a packet that's too big to fit into a single block will be cut off at the size of the block, even if the user-requested snaplen is bigger than that", I, at least, have no problem whatsoever with that from libpcap's point of view.  We, the libpcap developers, could just boost the maximum snaplen if necessary.  (Disclaimer: I don't speak for all developers or users of libpcap.)--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman Aug. 15, 2014, 10:19 a.m. UTC | #11
On Thu, Aug 14, 2014 at 05:56:20PM -0700, Eric Dumazet wrote:
> On Fri, 2014-08-15 at 02:50 +0200, Hannes Frederic Sowa wrote:
> 
> > 5) Or let a setsockopt control the NETIF_F_GRO bit of an interface if it
> > is bound to it? This will likely result in an implementation where we
> > must count current TPACKET_V3 sockets disabling GRO on an interface,
> > much like promisc counter.
> 
> loopback interface has MTU = 64K, it looks like the problem is not tied
> to GRO.
> 
Can we add a listener to the af packet code to resize the block length of the
socket based on the MTU?  

I realize that overloads user configuration, but if we warn them...

Neil
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Aug. 15, 2014, 11:37 a.m. UTC | #12
Hi,

On Do, 2014-08-14 at 21:54 -0700, Guy Harris wrote:
> On Aug 14, 2014, at 6:04 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> 
> > On Fri, Aug 15, 2014, at 02:54, Eric Dumazet wrote:
> >> On Fri, 2014-08-15 at 02:43 +0200, Hannes Frederic Sowa wrote:
> >> 
> >>> Someone could use GRO to create packet trains to hide from intrustion
> >>> detection systems, which maybe are the main user of TPACKET_V3. I don't
> >>> think this is a good idea.
> >> 
> >> Presumably these tools already use a large enough bloc_size, and not a
> >> 4KB one ;)
> >> 
> >> Even without GRO, a jumbo frame (9K) can trigger the bug.
> > 
> > Sure, but if I would have written such a tool without knowledge of GRO I
> > would have queried at least the MTU. ;)
> 
> ...and then queried the maximum size of the headers that precede the link-layer payload.

But those are mostly constant, no?

> Except that you *can't* do that, and it can be variable-length without an obvious maximum (think 802.11 in monitor mode, where you have radiotap headers).  This causes much pain for libpcap when using TPACKET_V1 and TPACKET_V2, forcing it to allocate huge blocks when smaller ones might be sufficient.
> > If an IDS allocates block_sizes below the MTU there is not much we can
> > do.
> 
> If an IDS uses libpcap, it will get libpcap's behavior, which, for TPACKET_V3 is, roughly
> 
> 	req.tp_frame_size = MAXIMUM_SNAPLEN;
> 	req.tp_frame_nr = handle->opt.buffer_size/req.tp_frame_size;
> 
> 	/* compute the minumum block size that will handle this frame. 
> 	 * The block has to be page size aligned. 
> 	 * The max block size allowed by the kernel is arch-dependent and 
> 	 * it's not explicitly checked here. */
> 	req.tp_block_size = getpagesize();
> 	while (req.tp_block_size < req.tp_frame_size) 
> 		req.tp_block_size <<= 1;
> 
> 	frames_per_block = req.tp_block_size/req.tp_frame_size;
> 
> 	req.tp_block_nr = req.tp_frame_nr / frames_per_block;
> 
> 	/* req.tp_frame_nr is requested to match frames_per_block*req.tp_block_nr */
> 	req.tp_frame_nr = req.tp_block_nr * frames_per_block;
> 
> (the last two C statements are actually part of a loop, where it'll reduce req.tp_frame_nr if it gets told "I don't have enough room for that big a ring").
> 
> MAXIMUM_SNAPLEN is 65535 in older versions of libpcap and 262144 in newer versions; it's the maximum frame size.
> 
> handle->opt.buffer_size is the buffer size requested by the application; it defaults to 2 MiB.
> 
> That calculation, with the default values for the latest version of libpcap, ends up with:
> 
> 	req.tp_frame_size = 262144;
> 	req.tp_frame_nr = /* 2097152/262144 */ 8;
> 
> 	/* compute the minumum block size that will handle this frame. 
> 	 * The block has to be page size aligned. 
> 	 * The max block size allowed by the kernel is arch-dependent and 
> 	 * it's not explicitly checked here. */
> 	req.tp_block_size = 4096;	/* IA-32 and x86-64, and probably many others */
> 	while (req.tp_block_size < req.tp_frame_size) 
> 		req.tp_block_size <<= 1;
> 	/* ends up with req.tp_block_size = 262144 */
> 
> 	frames_per_block = /* 262144/262144 */ 1;
> 
> 	req.tp_block_nr = /* 8 / 1 */ 8;
> 
> 	/* req.tp_frame_nr is requested to match frames_per_block*req.tp_block_nr */
> 	req.tp_frame_nr = /* 8 * 1 / 8;
> 
> which I think means "8 256 KiB blocks".
> 
> > But up to the MTU we should let GRO behave transparently and here we
> > violate this. There are also interfaces which extremely large MTUs but
> > at least they report the MTU size correctly to user space.
> 
> Reporting the MTU to user space is insufficient for libpcap; it needs to know the *maximum packet size*, which includes link-layer headers (which I guess it could compute based on the ARPHRD_ for the interface, although for 802.11 that may be subject to change, e.g. the maximum link-layer header length grew when the QoS stuff was added) *and* metadata headers (such as radiotap, for which there really is no generic maximum length other than the 65535 byte limit imposed by the header length field being 16 bits, but a given driver can presumably return its maximum).
> 
> It also doesn't help with segmentation/reassembly offloading, where the MTU is decoupled from the maximum packet size.

Thanks, that answered the question above. Still, I think that the most
commonly use-cases would get covered by querying the MTU. libpcap needs
to be much more general and so must pay more attention.

Also, IMHO GRO is something special and does not fit that model very
well. But I am also fine with clamping. If people complain, we can
revisit that topic.

Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 8d9f8042705a..6ee072b746fb 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1073,6 +1073,11 @@  static void *__packet_lookup_frame_in_block(struct packet_sock *po,
 		return (void *)curr;
 	}
 
+	/* If frame do not fit on a single block, bail out */
+	if (BLK_PLUS_PRIV(pkc->blk_sizeof_priv) +
+	    TOTAL_PKT_LEN_INCL_ALIGN(len) >= pkc->kblk_size)
+		return NULL;
+
 	/* Ok, close the current block */
 	prb_retire_current_block(pkc, po, 0);