diff mbox series

skb_can_coalesce() merges tcp frags with XFS-related slab objects

Message ID f8fde7da-9402-4e17-1bcc-cba08f1ec18b@virtuozzo.com
State Changes Requested
Delegated to: David Miller
Headers show
Series skb_can_coalesce() merges tcp frags with XFS-related slab objects | expand

Commit Message

Vasily Averin Feb. 20, 2019, 1:34 p.m. UTC
Dear David,

currently do_tcp_sendpages() calls skb_can_coalesce() to merge proper tcp fragments.
If these fragments are slab objects and the data is not transferred out of the local host
then tcp_recvmsg() can crash host on BUG_ON (see [2] below).

There is known usecase when slab objects are provided to tcp_sendpage:
XFS over locally landed network blockdevice.

I found few such cases:
- _drbd_send_page() had PageSlab() check log time ago.
- recently Ilya Dryomov fixed it in ceph 
 by commit 7e241f647dc7 "libceph: fall back to sendmsg for slab pages"

Recently OpenVZ team noticed this problem during experiments with
XFS over locally-landed iscsi target.

I would note: triggered BUG is not a real problem but false alert,
that though crashes host.

I can fix last problem by adding PageSlab() into iscsi_tcp_segment_map(),
however it does not fix the problem completely,
there are chances that the problem will be reproduced again with some other filesystems 
or with some other kind of network blockdevice.

David, what do you think, is it probably better to add PageSlab() check
directly into skb_can_coalesce()? (see [1] below)

Thank you,
	Vasily Averin

[1] The patch preventing the merge of the Slab-based tcp fragments

[2] oops example, RHEL7-based OpenVZ node, XFS over locally-landed iscsi target

[ 4902.545219] usercopy: kernel memory exposure attempt detected from ffff8c1497c92200 (kmalloc-512) (1024 bytes)
[ 4902.550585] ------------[ cut here ]------------
[ 4902.552472] kernel BUG at mm/usercopy.c:72!
[ 4902.554148] invalid opcode: 0000 [#1] SMP 
[ 4902.555906] Modules linked in: nf_conntrack_netlink xt_mark raw_diag udp_diag netlink_diag af_packet_diag unix_diag dm_service_time iscsi_tcp libiscsi_tcp libiscsi xfs xt_CHECKSUM ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables binfmt_misc tun devlink tcp_diag inet_diag ip_set nfnetlink fuse kvm_intel ppdev kvm i2c_piix4 irqbypass sg virtio_balloon parport_pc parport joydev pcspkr libcrc32c br_netfilter veth overlay ip6_vzprivnet ip6_vznetstat
[ 4903.413397]  ip_vznetstat ip_vzprivnet vziolimit vzevent vzlist vzstat vznetstat vznetdev vzmon vzdev bridge pio_kaio pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod crct10dif_generic cdrom crct10dif_common ata_generic pata_acpi 8021q garp mrp stp llc virtio_net virtio_console virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix scsi_transport_iscsi drm libata crc32c_intel serio_raw virtio_pci virtio_ring dm_multipath virtio drm_panel_orientation_quirks floppy sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
[ 4903.433902] CPU: 0 PID: 13793 Comm: tgtd ve: 0 Kdump: loaded Not tainted 3.10.0-957.1.3.vz7.83.4 #1 83.4
[ 4903.436975] Hardware name: Virtuozzo KVM, BIOS 1.10.2-3.1.vz7.3 04/01/2014
[ 4903.439474] task: ffff8c14f898c740 ti: ffff8c14f8a94000 task.ti: ffff8c14f8a94000
[ 4903.442278] RIP: 0010:[<ffffffff9ce59427>]  [<ffffffff9ce59427>] __check_object_size+0x87/0x250
[ 4903.445370] RSP: 0018:ffff8c14f8a97b58  EFLAGS: 00010246
[ 4903.447547] RAX: 0000000000000062 RBX: ffff8c1497c92200 RCX: 0000000000000000
[ 4903.450108] RDX: 0000000000000000 RSI: ffff8c167fc138d8 RDI: ffff8c167fc138d8
[ 4903.452753] RBP: ffff8c14f8a97b78 R08: 0000000000000004 R09: ffff8c166d14af00
[ 4903.455451] R10: 0000000000000080 R11: ffff97b6016ffff8 R12: 0000000000000400
[ 4903.458142] R13: 0000000000000001 R14: ffff8c1497c92600 R15: 0000000000000400
[ 4903.460759] FS:  00007fd59b2b7740(0000) GS:ffff8c167fc00000(0000) knlGS:0000000000000000
[ 4903.463776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4903.466111] CR2: 0000000002701b58 CR3: 00000000b8a98000 CR4: 00000000000006f0
[ 4903.468860] Call Trace:
[ 4903.470483]  [<ffffffff9cfb310d>] memcpy_toiovec+0x4d/0xb0
[ 4903.472780]  [<ffffffff9d2589e8>] skb_copy_datagram_iovec+0x128/0x280
[ 4903.475388]  [<ffffffff9d2c0646>] tcp_recvmsg+0x246/0xbb0
[ 4903.477884]  [<ffffffff9cdd42cd>] ? __alloc_pages_nodemask+0x26d/0x610
[ 4903.480432]  [<ffffffff9d2ef1d0>] inet_recvmsg+0x80/0xb0
[ 4903.482718]  [<ffffffff9d2467bc>] sock_aio_read.part.12+0x14c/0x170
[ 4903.485255]  [<ffffffff9d246801>] sock_aio_read+0x21/0x30
[ 4903.487495]  [<ffffffff9ce5edd6>] do_sync_read+0x96/0xe0
[ 4903.489710]  [<ffffffff9ce5f8b5>] vfs_read+0x145/0x170
[ 4903.491975]  [<ffffffff9ce606cf>] SyS_read+0x7f/0xf0
[ 4903.494179]  [<ffffffff9d3a4de1>] ? system_call_after_swapgs+0xae/0x146
[ 4903.496684]  [<ffffffff9d3a4e9b>] system_call_fastpath+0x22/0x27
[ 4903.499143]  [<ffffffff9d3a4de1>] ? system_call_after_swapgs+0xae/0x146
[ 4903.501748] Code: 45 d1 48 c7 c6 f8 c3 68 9d 48 c7 c1 93 5e 69 9d 48 0f 45 f1 49 89 c0 4d 89 e1 48 89 d9 48 c7 c7 68 2b 69 9d 31 c0 e8 4f 1e 53 00 <0f> 0b 0f 1f 80 00 00 00 00 48 c7 c0 00 00 c0 9c 4c 39 f0 73 0d 
[ 4903.510508] RIP  [<ffffffff9ce59427>] __check_object_size+0x87/0x250
[ 4903.513149]  RSP <ffff8c14f8a97b58>

Comments

Eric Dumazet Feb. 20, 2019, 3:53 p.m. UTC | #1
On 02/20/2019 05:34 AM, Vasily Averin wrote:
> Dear David,
> 
> currently do_tcp_sendpages() calls skb_can_coalesce() to merge proper tcp fragments.
> If these fragments are slab objects and the data is not transferred out of the local host
> then tcp_recvmsg() can crash host on BUG_ON (see [2] below).
> 
> There is known usecase when slab objects are provided to tcp_sendpage:
> XFS over locally landed network blockdevice.
> 
> I found few such cases:
> - _drbd_send_page() had PageSlab() check log time ago.
> - recently Ilya Dryomov fixed it in ceph 
>  by commit 7e241f647dc7 "libceph: fall back to sendmsg for slab pages"
> 
> Recently OpenVZ team noticed this problem during experiments with
> XFS over locally-landed iscsi target.
> 
> I would note: triggered BUG is not a real problem but false alert,
> that though crashes host.
> 
> I can fix last problem by adding PageSlab() into iscsi_tcp_segment_map(),
> however it does not fix the problem completely,
> there are chances that the problem will be reproduced again with some other filesystems 
> or with some other kind of network blockdevice.
> 
> David, what do you think, is it probably better to add PageSlab() check
> directly into skb_can_coalesce()? (see [1] below)
> 

No, this would be wrong.

There is no way a page fragment can be backed by slab object,
since a page fragment can be shared (the page refcount needs to be manipulated, without slab/slub
being aware of this)

Please fix the callers.

> Thank you,
> 	Vasily Averin
> 
> [1] The patch preventing the merge of the Slab-based tcp fragments
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 95d25b010a25..e1d200ba1fef 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3089,7 +3089,7 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
>  	if (i) {
>  		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
>  
> -		return page == skb_frag_page(frag) &&
> +		return page == skb_frag_page(frag) && !PageSlab(page) &&
>  		       off == frag->page_offset + skb_frag_size(frag);
>  	}
>  	return false;
> 
> [2] oops example, RHEL7-based OpenVZ node, XFS over locally-landed iscsi target
> 
> [ 4902.545219] usercopy: kernel memory exposure attempt detected from ffff8c1497c92200 (kmalloc-512) (1024 bytes)
> [ 4902.550585] ------------[ cut here ]------------
> [ 4902.552472] kernel BUG at mm/usercopy.c:72!
> [ 4902.554148] invalid opcode: 0000 [#1] SMP 
> [ 4902.555906] Modules linked in: nf_conntrack_netlink xt_mark raw_diag udp_diag netlink_diag af_packet_diag unix_diag dm_service_time iscsi_tcp libiscsi_tcp libiscsi xfs xt_CHECKSUM ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables binfmt_misc tun devlink tcp_diag inet_diag ip_set nfnetlink fuse kvm_intel ppdev kvm i2c_piix4 irqbypass sg virtio_balloon parport_pc parport joydev pcspkr libcrc32c br_netfilter veth overlay ip6_vzprivnet ip6_vznetstat
> [ 4903.413397]  ip_vznetstat ip_vzprivnet vziolimit vzevent vzlist vzstat vznetstat vznetdev vzmon vzdev bridge pio_kaio pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod crct10dif_generic cdrom crct10dif_common ata_generic pata_acpi 8021q garp mrp stp llc virtio_net virtio_console virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix scsi_transport_iscsi drm libata crc32c_intel serio_raw virtio_pci virtio_ring dm_multipath virtio drm_panel_orientation_quirks floppy sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
> [ 4903.433902] CPU: 0 PID: 13793 Comm: tgtd ve: 0 Kdump: loaded Not tainted 3.10.0-957.1.3.vz7.83.4 #1 83.4
> [ 4903.436975] Hardware name: Virtuozzo KVM, BIOS 1.10.2-3.1.vz7.3 04/01/2014
> [ 4903.439474] task: ffff8c14f898c740 ti: ffff8c14f8a94000 task.ti: ffff8c14f8a94000
> [ 4903.442278] RIP: 0010:[<ffffffff9ce59427>]  [<ffffffff9ce59427>] __check_object_size+0x87/0x250
> [ 4903.445370] RSP: 0018:ffff8c14f8a97b58  EFLAGS: 00010246
> [ 4903.447547] RAX: 0000000000000062 RBX: ffff8c1497c92200 RCX: 0000000000000000
> [ 4903.450108] RDX: 0000000000000000 RSI: ffff8c167fc138d8 RDI: ffff8c167fc138d8
> [ 4903.452753] RBP: ffff8c14f8a97b78 R08: 0000000000000004 R09: ffff8c166d14af00
> [ 4903.455451] R10: 0000000000000080 R11: ffff97b6016ffff8 R12: 0000000000000400
> [ 4903.458142] R13: 0000000000000001 R14: ffff8c1497c92600 R15: 0000000000000400
> [ 4903.460759] FS:  00007fd59b2b7740(0000) GS:ffff8c167fc00000(0000) knlGS:0000000000000000
> [ 4903.463776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4903.466111] CR2: 0000000002701b58 CR3: 00000000b8a98000 CR4: 00000000000006f0
> [ 4903.468860] Call Trace:
> [ 4903.470483]  [<ffffffff9cfb310d>] memcpy_toiovec+0x4d/0xb0
> [ 4903.472780]  [<ffffffff9d2589e8>] skb_copy_datagram_iovec+0x128/0x280
> [ 4903.475388]  [<ffffffff9d2c0646>] tcp_recvmsg+0x246/0xbb0
> [ 4903.477884]  [<ffffffff9cdd42cd>] ? __alloc_pages_nodemask+0x26d/0x610
> [ 4903.480432]  [<ffffffff9d2ef1d0>] inet_recvmsg+0x80/0xb0
> [ 4903.482718]  [<ffffffff9d2467bc>] sock_aio_read.part.12+0x14c/0x170
> [ 4903.485255]  [<ffffffff9d246801>] sock_aio_read+0x21/0x30
> [ 4903.487495]  [<ffffffff9ce5edd6>] do_sync_read+0x96/0xe0
> [ 4903.489710]  [<ffffffff9ce5f8b5>] vfs_read+0x145/0x170
> [ 4903.491975]  [<ffffffff9ce606cf>] SyS_read+0x7f/0xf0
> [ 4903.494179]  [<ffffffff9d3a4de1>] ? system_call_after_swapgs+0xae/0x146
> [ 4903.496684]  [<ffffffff9d3a4e9b>] system_call_fastpath+0x22/0x27
> [ 4903.499143]  [<ffffffff9d3a4de1>] ? system_call_after_swapgs+0xae/0x146
> [ 4903.501748] Code: 45 d1 48 c7 c6 f8 c3 68 9d 48 c7 c1 93 5e 69 9d 48 0f 45 f1 49 89 c0 4d 89 e1 48 89 d9 48 c7 c7 68 2b 69 9d 31 c0 e8 4f 1e 53 00 <0f> 0b 0f 1f 80 00 00 00 00 48 c7 c0 00 00 c0 9c 4c 39 f0 73 0d 
> [ 4903.510508] RIP  [<ffffffff9ce59427>] __check_object_size+0x87/0x250
> [ 4903.513149]  RSP <ffff8c14f8a97b58>
>
Vasily Averin Feb. 20, 2019, 4:19 p.m. UTC | #2
On 2/20/19 6:53 PM, Eric Dumazet wrote:
> On 02/20/2019 05:34 AM, Vasily Averin wrote:
>> Dear David,
>>
>> currently do_tcp_sendpages() calls skb_can_coalesce() to merge proper tcp fragments.
>> If these fragments are slab objects and the data is not transferred out of the local host
>> then tcp_recvmsg() can crash host on BUG_ON (see [2] below).
>>
>> There is known usecase when slab objects are provided to tcp_sendpage:
>> XFS over locally landed network blockdevice.
>>
>> I found few such cases:
>> - _drbd_send_page() had PageSlab() check log time ago.
>> - recently Ilya Dryomov fixed it in ceph 
>>  by commit 7e241f647dc7 "libceph: fall back to sendmsg for slab pages"
>>
>> Recently OpenVZ team noticed this problem during experiments with
>> XFS over locally-landed iscsi target.
>>
>> I would note: triggered BUG is not a real problem but false alert,
>> that though crashes host.
>>
>> I can fix last problem by adding PageSlab() into iscsi_tcp_segment_map(),
>> however it does not fix the problem completely,
>> there are chances that the problem will be reproduced again with some other filesystems 
>> or with some other kind of network blockdevice.
>>
>> David, what do you think, is it probably better to add PageSlab() check
>> directly into skb_can_coalesce()? (see [1] below)
>>
> 
> No, this would be wrong.
> 
> There is no way a page fragment can be backed by slab object,
> since a page fragment can be shared (the page refcount needs to be manipulated, without slab/slub
> being aware of this)

Thank you for explanation, 
though this happen in real life and triggers BUG_ON only if receiving side is located on the same host.
Is it probably makes sense to add WARN_ON into skb_can_coalesce to detect such cases?

> Please fix the callers.

Ok, will do it.
Eric Dumazet Feb. 20, 2019, 4:35 p.m. UTC | #3
On 02/20/2019 08:19 AM, Vasily Averin wrote:

> Thank you for explanation, 
> though this happen in real life and triggers BUG_ON only if receiving side is located on the same host.
> Is it probably makes sense to add WARN_ON into skb_can_coalesce to detect such cases?

Yes, but please do it only in the sendpage() path, or only in CONFIG_DEBUG_PAGEALLOC / CONFIG_DEBUG_VM cases.

tcp_sendmsg() uses a per task page (look at sk_page_frag()), and it seems
strange to recheck what we already know (it is a page not backed/used by SLAB)
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 95d25b010a25..e1d200ba1fef 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3089,7 +3089,7 @@  static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 	if (i) {
 		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
-		return page == skb_frag_page(frag) &&
+		return page == skb_frag_page(frag) && !PageSlab(page) &&
 		       off == frag->page_offset + skb_frag_size(frag);
 	}
 	return false;