diff mbox

[3/3] tun: Limit amount of queued packets per device

Message ID E1LUfJt-0005ZQ-4D@gondolin.me.apana.org.au
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu Feb. 4, 2009, 10:49 a.m. UTC
tun: Limit amount of queued packets per device

Unlike a normal socket path, the tuntap device send path does
not have any accounting.  This means that the user-space sender
may be able to pin down arbitrary amounts of kernel memory by
continuing to send data to an end-point that is congested.

Even when this isn't an issue because of limited queueing at
most end points, this can also be a problem because its only
response to congestion is packet loss.  That is, when those
local queues at the end-point fills up, the tuntap device will
start wasting system time because it will continue to send
data there which simply gets dropped straight away.

Of course one could argue that everybody should do congestion
control end-to-end, unfortunately there are people in this world
still hooked on UDP, and they don't appear to be going away
anywhere fast.  In fact, we've always helped them by performing
accounting in our UDP code, the sole purpose of which is to
provide congestion feedback other than through packet loss.

This patch attempts to apply the same bandaid to the tuntap device.
It creates a pseudo-socket object which is used to account our
packets just as a normal socket does for UDP.  Of course things
are a little complex because we're actually reinjecting traffic
back into the stack rather than out of the stack.

The stack complexities however should have been resolved by preceding
patches.  So this one can simply start using skb_set_owner_w.

For now the accounting is essentially disabled by default for
backwards compatibility.  In particular, we set the cap to INT_MAX.
This is so that existing applications don't get confused by the
sudden arrival EAGAIN errors.

In future we may wish (or be forced to) do this by default.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 drivers/net/tun.c      |  161 +++++++++++++++++++++++++++++++++----------------
 include/linux/if_tun.h |    2 
 2 files changed, 111 insertions(+), 52 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Feb. 5, 2009, 12:56 a.m. UTC | #1
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 04 Feb 2009 21:49:25 +1100

> tun: Limit amount of queued packets per device

When adding new tun ioctls, you need to add compat entries
to fs/compat_ioctl.c

Please make this correction and resubmit.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson Feb. 10, 2009, 6:33 p.m. UTC | #2
On Wed, 2009-02-04 at 21:49 +1100, Herbert Xu wrote:
> tun: Limit amount of queued packets per device

Hi Herbert,

I'm getting a variety of Oopses, null pointer derefs, etc... from this
patch when trying to run a qemu guest on net-next-2.6 using a standard
tap/bridge config.  I've included a sample below.  Thanks,

Alex


[  173.231609] BUG: unable to handle kernel paging request at ffffffffffff8871
[  173.233252] IP: [<ffffffff8044875e>] skb_copy_datagram_from_iovec+0x1e/0x260
[  173.233252] PGD 203067 PUD 204067 PMD 0 
[  173.233252] Oops: 0000 [#1] SMP 
[  173.233252] last sysfs file: /sys/kernel/uevent_seqnum
[  173.233252] CPU 5 
[  173.233252] Modules linked in: tun nfs lockd nfs_acl auth_rpcgss sunrpc iptable_filter ip_tables ebtable_broute bridge stp ebtable_nat ebtable_filter ebtables x_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc lp parport loop af_packet ipmi_devintf hpilo ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpwdt i5000_edac serio_raw edac_core psmouse pcspkr shpchp button container i5k_amb pci_hotplug joydev evdev ext3 jbd mbcache usbhid hid sg sd_mod ehci_hcd uhci_hcd lpfc scsi_transport_fc usbcore cciss scsi_tgt scsi_mod bnx2 dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan thermal_sys fuse
[  173.233252] Pid: 6770, comm: qemu-system-x86 Not tainted 2.6.29-rc3 #4
[  173.233252] RIP: 0010:[<ffffffff8044875e>]  [<ffffffff8044875e>] skb_copy_datagram_from_iovec+0x1e/0x260
[  173.233252] RSP: 0018:ffff880827cbfc68  EFLAGS: 00010292
[  173.233252] RAX: 0000000000000000 RBX: ffffffffffff8809 RCX: 0000000000000148
[  173.233252] RDX: ffff880827cbfe78 RSI: 0000000000000000 RDI: ffffffffffff8809
[  173.233252] RBP: ffffffffffff8809 R08: ffff880827cbfcf4 R09: 0000000000000000
[  173.233252] R10: 0000000000000000 R11: ffffffff80350440 R12: 0000000000000148
[  173.233252] R13: ffff88082b414840 R14: 0000000000000000 R15: 0000000000000148
[  173.233252] FS:  00007f184f8756e0(0000) GS:ffff88082bfe1100(0000) knlGS:0000000000000000
[  173.233252] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  173.233252] CR2: ffffffffffff8871 CR3: 000000081d963000 CR4: 00000000000006e0
[  173.233252] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  173.233252] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  173.233252] Process qemu-system-x86 (pid: 6770, threadinfo ffff880827cbe000, task ffff88082bb7cbc0)
[  173.233252] Stack:
[  173.233252]  000000000000001e 000000000000001e ffff880827cbfe78 ffffffffffff8809
[  173.233252]  000000004991c50c ffffffffffff8809 ffffffffffff8809 0000000000000148
[  173.233252]  ffff88082b414840 0000000000000156 0000000000000148 ffffffffa047f5ac
[  173.233252] Call Trace:
[  173.233252]  [<ffffffffa047f5ac>] ? tun_chr_aio_write+0x19c/0x440 [tun]
[  173.233252]  [<ffffffff802b68ad>] ? zone_statistics+0x7d/0x80
[  173.233252]  [<ffffffffa047f410>] ? tun_chr_aio_write+0x0/0x440 [tun]
[  173.233252]  [<ffffffff802df90b>] ? do_sync_readv_writev+0xcb/0x110
[  173.233252]  [<ffffffff80261f90>] ? autoremove_wake_function+0x0/0x30
[  173.233252]  [<ffffffff802dcf25>] ? mem_cgroup_charge_common+0x75/0xa0
[  173.233252]  [<ffffffff802df74d>] ? rw_copy_check_uvector+0x9d/0x150
[  173.233252]  [<ffffffff802e0062>] ? do_readv_writev+0xe2/0x220
[  173.233252]  [<ffffffff8022cc35>] ? default_spin_lock_flags+0x5/0x10
[  173.233252]  [<ffffffff804de09e>] ? _spin_lock_irqsave+0x2e/0x40
[  173.233252]  [<ffffffff804e0ae3>] ? do_page_fault+0x523/0xaa0
[  173.233252]  [<ffffffff804de09e>] ? _spin_lock_irqsave+0x2e/0x40
[  173.233252]  [<ffffffff802e0693>] ? sys_writev+0x53/0xc0
[  173.233252]  [<ffffffff8021252a>] ? system_call_fastpath+0x16/0x1b
[  173.233252] Code: c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 89 f6 41 55 41 54 41 89 cc 55 53 48 83 ec 28 48 89 7c 24 18 48 89 54 24 10 <8b> 6f 68 2b 6f 6c 89 e8 29 f0 85 c0 0f 8f 6f 01 00 00 48 8b 4c 
[  173.233252] RIP  [<ffffffff8044875e>] skb_copy_datagram_from_iovec+0x1e/0x260
[  173.233252]  RSP <ffff880827cbfc68>
[  173.233252] CR2: ffffffffffff8871
[  173.233252] ---[ end trace efbfb68cafc813b4 ]---

[  298.181441] general protection fault: 0000 [#2] SMP 
[  298.184002] last sysfs file: /sys/kernel/uevent_seqnum
[  298.184002] CPU 0 
[  298.184002] Modules linked in: tun nfs lockd nfs_acl auth_rpcgss sunrpc iptable_filter ip_tables ebtable_broute bridge stp ebtable_nat ebtable_filter ebtables x_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc lp parport loop af_packet ipmi_devintf hpilo ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpwdt i5000_edac serio_raw edac_core psmouse pcspkr shpchp button container i5k_amb pci_hotplug joydev evdev ext3 jbd mbcache usbhid hid sg sd_mod ehci_hcd uhci_hcd lpfc scsi_transport_fc usbcore cciss scsi_tgt scsi_mod bnx2 dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan thermal_sys fuse
[  298.184002] Pid: 6822, comm: qemu-system-x86 Tainted: G      D    2.6.29-rc3 #4
[  298.184002] RIP: 0010:[<ffffffff8044144a>]  [<ffffffff8044144a>] sock_alloc_send_pskb+0x7a/0x2c0
[  298.184002] RSP: 0018:ffff880828dc5c48  EFLAGS: 00010217
[  298.184002] RAX: 1f00ffffffffffff RBX: ffff88082036fd80 RCX: 0000000000000800
[  298.184002] RDX: 0000000000000000 RSI: 0000000000000148 RDI: ffff88082036fd80
[  298.184002] RBP: 0000000000000000 R08: ffff880828dc5cf4 R09: 0000000000000000
[  298.184002] R10: 0000000000000000 R11: ffffffff80350440 R12: ffff880828dc5c58
[  298.184002] R13: ffff880828dc5c70 R14: 00000000e9291f00 R15: 0000000000000000
[  298.184002] FS:  00007f5e2e9c46e0(0000) GS:ffffffff80797000(0000) knlGS:0000000000000000
[  298.184002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  298.184002] CR2: 00007fff369c6f90 CR3: 00000007df827000 CR4: 00000000000006e0
[  298.184002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  298.184002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  298.184002] Process qemu-system-x86 (pid: 6822, threadinfo ffff880828dc4000, task ffff88081f7d0650)
[  298.184002] Stack:
[  298.184002]  ffff880828dc5cf4 0000000000000148 ffff880828915c78 ffffe2001bdc76c8
[  298.184002]  000000000000001e 000000000000001e ffff880000001d90 0000000000000002
[  298.184002]  000000004991c589 0000000000000800 ffffffffa047f410 0000000000000148
[  298.184002] Call Trace:
[  298.184002]  [<ffffffffa047f410>] tun_chr_aio_write+0x0/0x440 [tun]
[  298.184002]  [<ffffffffa047f554>] tun_chr_aio_write+0x144/0x440 [tun]
[  298.184002]  [<ffffffff802b68ad>] zone_statistics+0x7d/0x80
[  298.184002]  [<ffffffffa047f410>] tun_chr_aio_write+0x0/0x440 [tun]
[  298.184002]  [<ffffffff802df90b>] do_sync_readv_writev+0xcb/0x110
[  298.184002]  [<ffffffff80261f90>] autoremove_wake_function+0x0/0x30
[  298.184002]  [<ffffffff802dcf25>] mem_cgroup_charge_common+0x75/0xa0
[  298.184002]  [<ffffffff802df74d>] rw_copy_check_uvector+0x9d/0x150
[  298.184002]  [<ffffffff802e0062>] do_readv_writev+0xe2/0x220
[  298.184002]  [<ffffffff8022cc35>] default_spin_lock_flags+0x5/0x10
[  298.184002]  [<ffffffff804de09e>] _spin_lock_irqsave+0x2e/0x40
[  298.184002]  [<ffffffff804e0ae3>] do_page_fault+0x523/0xaa0
[  298.184002]  [<ffffffff804de09e>] _spin_lock_irqsave+0x2e/0x40
[  298.184002]  [<ffffffff802e0693>] sys_writev+0x53/0xc0
[  298.184002]  [<ffffffff8021252a>] system_call_fastpath+0x16/0x1b
[  298.184002] Code: 85 c0 0f 85 fb 00 00 00 f6 43 38 02 0f 85 09 01 00 00 8b 83 98 00 00 00 3b 83 a0 00 00 00 0f 8c 16 01 00 00 48 8b 83 e0 01 00 00 <f0> 80 48 08 01 48 8b 83 e0 01 00 00 f0 80 48 08 04 48 85 ed 0f 
[  298.184002] RIP  [<ffffffff8044144a>] sock_alloc_send_pskb+0x7a/0x2c0
[  298.184002]  RSP <ffff880828dc5c48>
[  298.314428] ---[ end trace efbfb68cafc813b5 ]---

[  490.120309] BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
[  490.121002] IP: [<ffffffff804413ed>] sock_alloc_send_pskb+0x1d/0x2c0
[  490.121002] PGD 7df826067 PUD 8234fd067 PMD 0 
[  490.121002] Oops: 0000 [#3] SMP 
[  490.121002] last sysfs file: /sys/kernel/uevent_seqnum
[  490.121002] CPU 4 
[  490.121002] Modules linked in: tun nfs lockd nfs_acl auth_rpcgss sunrpc iptable_filter ip_tables ebtable_broute bridge stp ebtable_nat ebtable_filter ebtables x_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc lp parport loop af_packet ipmi_devintf hpilo ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpwdt i5000_edac serio_raw edac_core psmouse pcspkr shpchp button container i5k_amb pci_hotplug joydev evdev ext3 jbd mbcache usbhid hid sg sd_mod ehci_hcd uhci_hcd lpfc scsi_transport_fc usbcore cciss scsi_tgt scsi_mod bnx2 dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan thermal_sys fuse
[  490.121002] Pid: 6864, comm: qemu-system-x86 Tainted: G      D    2.6.29-rc3 #4
[  490.121002] RIP: 0010:[<ffffffff804413ed>]  [<ffffffff804413ed>] sock_alloc_send_pskb+0x1d/0x2c0
[  490.121002] RSP: 0018:ffff88081f4f1c48  EFLAGS: 00010296
[  490.121002] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000800
[  490.121002] RDX: 0000000000000000 RSI: 0000000000000148 RDI: 0000000000000000
[  490.121002] RBP: ffffffffa047f410 R08: ffff88081f4f1cf4 R09: 0000000000000000
[  490.121002] R10: 0000000000000000 R11: ffffffff80350440 R12: 0000000000000148
[  490.121002] R13: ffff880823575240 R14: 0000000000000156 R15: 0000000000000000
[  490.121002] FS:  00007f9436fce6e0(0000) GS:ffff88082bfe0d80(0000) knlGS:0000000000000000
[  490.121002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  490.121002] CR2: 00000000000000f8 CR3: 000000081f4a8000 CR4: 00000000000006e0
[  490.121002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  490.121002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  490.121002] Process qemu-system-x86 (pid: 6864, threadinfo ffff88081f4f0000, task ffff88081dc7e500)
[  490.121002] Stack:
[  490.121002]  ffff88081f4f1cf4 0000000000000148 00000000012f53da ffffe2001b9c05f8
[  490.121002]  000000000000001e 000000000000001e ffff880000001d90 0000000000000002
[  490.121002]  000000004991c649 0000000000000800 ffffffffa047f410 0000000000000148
[  490.121002] Call Trace:
[  490.121002]  [<ffffffffa047f410>] ? tun_chr_aio_write+0x0/0x440 [tun]
[  490.121002]  [<ffffffffa047f554>] ? tun_chr_aio_write+0x144/0x440 [tun]
[  490.121002]  [<ffffffff804ddf75>] ? _spin_lock+0x5/0x10
[  490.121002]  [<ffffffff802f0008>] ? sys_ppoll+0xe8/0x170
[  490.121002]  [<ffffffff804ddf75>] ? _spin_lock+0x5/0x10
[  490.121002]  [<ffffffffa047f410>] ? tun_chr_aio_write+0x0/0x440 [tun]
[  490.121002]  [<ffffffff802df90b>] ? do_sync_readv_writev+0xcb/0x110
[  490.121002]  [<ffffffff80261f90>] ? autoremove_wake_function+0x0/0x30
[  490.121002]  [<ffffffff80265380>] ? ktime_get_ts+0x20/0x60
[  490.121002]  [<ffffffff802653cc>] ? ktime_get+0xc/0x50
[  490.121002]  [<ffffffff802df74d>] ? rw_copy_check_uvector+0x9d/0x150
[  490.121002]  [<ffffffff802e0062>] ? do_readv_writev+0xe2/0x220
[  490.121002]  [<ffffffff802615fe>] ? sys_timer_settime+0x14e/0x340
[  490.121002]  [<ffffffff804de09e>] ? _spin_lock_irqsave+0x2e/0x40
[  490.121002]  [<ffffffff802e0693>] ? sys_writev+0x53/0xc0
[  490.121002]  [<ffffffff8021252a>] ? system_call_fastpath+0x16/0x1b
[  490.121002] Code: 00 00 5b 48 89 d0 c3 0f 1f 80 00 00 00 00 41 57 49 89 d7 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 48 48 89 74 24 08 4c 89 04 24 <44> 8b b7 f8 00 00 00 44 89 f0 80 cc 04 41 f6 c6 10 44 0f 45 f0 
[  490.121002] RIP  [<ffffffff804413ed>] sock_alloc_send_pskb+0x1d/0x2c0
[  490.121002]  RSP <ffff88081f4f1c48>
[  490.121002] CR2: 00000000000000f8
[  490.259999] ---[ end trace efbfb68cafc813b6 ]---


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 12, 2009, 11:13 a.m. UTC | #3
On Tue, Feb 10, 2009 at 11:33:45AM -0700, Alex Williamson wrote:
> 
> I'm getting a variety of Oopses, null pointer derefs, etc... from this
> patch when trying to run a qemu guest on net-next-2.6 using a standard
> tap/bridge config.  I've included a sample below.  Thanks,

Are you using the current net-next-2.6 (which already has the
patch) or an older net-next-2.6 with the patch added by hand?

> [  173.231609] BUG: unable to handle kernel paging request at ffffffffffff8871
> [  173.233252] IP: [<ffffffff8044875e>] skb_copy_datagram_from_iovec+0x1e/0x260
> [  173.233252] PGD 203067 PUD 204067 PMD 0 
> [  173.233252] Oops: 0000 [#1] SMP 
> [  173.233252] last sysfs file: /sys/kernel/uevent_seqnum
> [  173.233252] CPU 5 
> [  173.233252] Modules linked in: tun nfs lockd nfs_acl auth_rpcgss sunrpc iptable_filter ip_tables ebtable_broute bridge stp ebtable_nat ebtable_filter ebtables x_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc lp parport loop af_packet ipmi_devintf hpilo ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpwdt i5000_edac serio_raw edac_core psmouse pcspkr shpchp button container i5k_amb pci_hotplug joydev evdev ext3 jbd mbcache usbhid hid sg sd_mod ehci_hcd uhci_hcd lpfc scsi_transport_fc usbcore cciss scsi_tgt scsi_mod bnx2 dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan thermal_sys fuse
> [  173.233252] Pid: 6770, comm: qemu-system-x86 Not tainted 2.6.29-rc3 #4
> [  173.233252] RIP: 0010:[<ffffffff8044875e>]  [<ffffffff8044875e>] skb_copy_datagram_from_iovec+0x1e/0x260
> [  173.233252] RSP: 0018:ffff880827cbfc68  EFLAGS: 00010292
> [  173.233252] RAX: 0000000000000000 RBX: ffffffffffff8809 RCX: 0000000000000148
> [  173.233252] RDX: ffff880827cbfe78 RSI: 0000000000000000 RDI: ffffffffffff8809

This means that the skb argument (RDI) is bogus.  However, I can't
see how that can happen unless some other corruption happened
earlier.

Does this occur on the first packet written?

Thanks,
Alex Williamson Feb. 12, 2009, 7:35 p.m. UTC | #4
On Thu, 2009-02-12 at 19:13 +0800, Herbert Xu wrote:
> On Tue, Feb 10, 2009 at 11:33:45AM -0700, Alex Williamson wrote:
> > 
> > I'm getting a variety of Oopses, null pointer derefs, etc... from this
> > patch when trying to run a qemu guest on net-next-2.6 using a standard
> > tap/bridge config.  I've included a sample below.  Thanks,
> 
> Are you using the current net-next-2.6 (which already has the
> patch) or an older net-next-2.6 with the patch added by hand?

Current net-next-2.6 (v2.6.29-rc2-1715-g367681f).  I just reverified it
with kvm-userspace (kvm-83-389-ga1efe3d).  The problem goes away if I
patch -R commit 33dccbb.

> > [  173.231609] BUG: unable to handle kernel paging request at ffffffffffff8871
> > [  173.233252] IP: [<ffffffff8044875e>] skb_copy_datagram_from_iovec+0x1e/0x260
> > [  173.233252] PGD 203067 PUD 204067 PMD 0 
> > [  173.233252] Oops: 0000 [#1] SMP 
> > [  173.233252] last sysfs file: /sys/kernel/uevent_seqnum
> > [  173.233252] CPU 5 
> > [  173.233252] Modules linked in: tun nfs lockd nfs_acl auth_rpcgss sunrpc iptable_filter ip_tables ebtable_broute bridge stp ebtable_nat ebtable_filter ebtables x_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc lp parport loop af_packet ipmi_devintf hpilo ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpwdt i5000_edac serio_raw edac_core psmouse pcspkr shpchp button container i5k_amb pci_hotplug joydev evdev ext3 jbd mbcache usbhid hid sg sd_mod ehci_hcd uhci_hcd lpfc scsi_transport_fc usbcore cciss scsi_tgt scsi_mod bnx2 dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan thermal_sys fuse
> > [  173.233252] Pid: 6770, comm: qemu-system-x86 Not tainted 2.6.29-rc3 #4
> > [  173.233252] RIP: 0010:[<ffffffff8044875e>]  [<ffffffff8044875e>] skb_copy_datagram_from_iovec+0x1e/0x260
> > [  173.233252] RSP: 0018:ffff880827cbfc68  EFLAGS: 00010292
> > [  173.233252] RAX: 0000000000000000 RBX: ffffffffffff8809 RCX: 0000000000000148
> > [  173.233252] RDX: ffff880827cbfe78 RSI: 0000000000000000 RDI: ffffffffffff8809
> 
> This means that the skb argument (RDI) is bogus.  However, I can't
> see how that can happen unless some other corruption happened
> earlier.
> 
> Does this occur on the first packet written?

Seems a little beyond the first packet, this time my VM made it to
starting sshd before causing this fault in the host (so it had at least
DHCP'd an address):

[  208.823990] BUG: unable to handle kernel paging request at 0000000000007860
[  208.826836] IP: [<ffffffff804481be>] skb_copy_datagram_from_iovec+0x1e/0x260
[  208.827918] PGD 827032067 PUD 81d9d6067 PMD 0 
[  208.827918] Oops: 0000 [#1] SMP 
[  208.827918] last sysfs file: /sys/kernel/uevent_seqnum
[  208.827918] CPU 6 
[  208.827918] Modules linked in: kvm_intel kvm tun nfs lockd nfs_acl auth_rpcgss sunrpc bridge stp ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc lp parport loop af_packet ipmi_devintf ipmi_si iTCO_wdt iTCO_vendor_support hpwdt serio_raw hpilo psmouse ipmi_msghandler i5000_edac container edac_core pcspkr i5k_amb shpchp pci_hotplug button joydev evdev ext3 jbd mbcache usbhid hid sg sd_mod lpfc scsi_transport_fc ehci_hcd uhci_hcd scsi_tgt usbcore cciss bnx2 scsi_mod dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan thermal_sys fuse
[  208.827918] Pid: 8078, comm: qemu-system-x86 Not tainted 2.6.29-rc3 #8
[  208.827918] RIP: 0010:[<ffffffff804481be>]  [<ffffffff804481be>] skb_copy_datagram_from_iovec+0x1e/0x260
[  208.827918] RSP: 0018:ffff88082284fc68  EFLAGS: 00010292
[  208.827918] RAX: 0000000000000000 RBX: 00000000000077f8 RCX: 0000000000000056
[  208.827918] RDX: ffff88082284fe78 RSI: 0000000000000000 RDI: 00000000000077f8
[  208.827918] RBP: 00000000000077f8 R08: ffff88082284fcf4 R09: 0000000000000000
[  208.827918] R10: 0000000000000000 R11: ffffffff80350430 R12: 0000000000000056
[  208.827918] R13: ffff8808299c2280 R14: 0000000000000000 R15: 0000000000000056
[  208.827918] FS:  00007f7932b216e0(0000) GS:ffff88082bfe1480(0000) knlGS:0000000000000000
[  208.827918] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  208.827918] CR2: 0000000000007860 CR3: 00000008271cc000 CR4: 00000000000026e0
[  208.827918] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  208.827918] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  208.827918] Process qemu-system-x86 (pid: 8078, threadinfo ffff88082284e000, task ffff88082bbf8650)
[  208.827918] Stack:
[  208.827918]  ffffffff802efec0 0000000000100100 ffff88082284fe78 00000000000077f8
[  208.827918]  00000000499472fc 00000000000077f8 00000000000077f8 0000000000000056
[  208.827918]  ffff8808299c2280 0000000000000064 0000000000000056 ffffffffa04455ac
[  208.827918] Call Trace:
[  208.827918]  [<ffffffff802efec0>] ? pollwake+0x0/0x50
[  208.827918]  [<ffffffffa04455ac>] ? tun_chr_aio_write+0x19c/0x440 [tun]
[  208.827918]  [<ffffffff802b68ad>] ? zone_statistics+0x7d/0x80
[  208.827918]  [<ffffffffa0445410>] ? tun_chr_aio_write+0x0/0x440 [tun]
[  208.827918]  [<ffffffff802df8fb>] ? do_sync_readv_writev+0xcb/0x110
[  208.827918]  [<ffffffff80261f90>] ? autoremove_wake_function+0x0/0x30
[  208.827918]  [<ffffffff802dcf15>] ? mem_cgroup_charge_common+0x75/0xa0
[  208.827918]  [<ffffffff802df73d>] ? rw_copy_check_uvector+0x9d/0x150
[  208.827918]  [<ffffffff802e0052>] ? do_readv_writev+0xe2/0x220
[  208.827918]  [<ffffffff80265380>] ? ktime_get_ts+0x20/0x60
[  208.827918]  [<ffffffff8022cc35>] ? default_spin_lock_flags+0x5/0x10
[  208.827918]  [<ffffffff804db50e>] ? _spin_lock_irqsave+0x2e/0x40
[  208.827918]  [<ffffffff804ddf53>] ? do_page_fault+0x523/0xaa0
[  208.827918]  [<ffffffff802e0683>] ? sys_writev+0x53/0xc0
[  208.827918]  [<ffffffff8021252a>] ? system_call_fastpath+0x16/0x1b
[  208.827918] Code: c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 89 f6 41 55 41 54 41 89 cc 55 53 48 83 ec 28 48 89 7c 24 18 48 89 54 24 10 <8b> 6f 68 2b 6f 6c 89 e8 29 f0 85 c0 0f 8f 6f 01 00 00 48 8b 4c 
[  208.827918] RIP  [<ffffffff804481be>] skb_copy_datagram_from_iovec+0x1e/0x260
[  208.827918]  RSP <ffff88082284fc68>
[  208.827918] CR2: 0000000000007860
[  208.959112] ---[ end trace a4838e8d8e9e602d ]---


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d7b81e4..a97448d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -63,6 +63,7 @@ 
 #include <linux/virtio_net.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
+#include <net/sock.h>
 
 #include <asm/system.h>
 #include <asm/uaccess.h>
@@ -87,6 +88,8 @@  struct tap_filter {
 	unsigned char	addr[FLT_EXACT_COUNT][ETH_ALEN];
 };
 
+struct tun_sock;
+
 struct tun_struct {
 	struct list_head        list;
 	unsigned int 		flags;
@@ -101,12 +104,24 @@  struct tun_struct {
 	struct fasync_struct	*fasync;
 
 	struct tap_filter       txflt;
+	struct sock		*sk;
+	struct socket		socket;
 
 #ifdef TUN_DEBUG
 	int debug;
 #endif
 };
 
+struct tun_sock {
+	struct sock		sk;
+	struct tun_struct	*tun;
+};
+
+static inline struct tun_sock *tun_sk(struct sock *sk)
+{
+	return container_of(sk, struct tun_sock, sk);
+}
+
 /* TAP filterting */
 static void addr_hash_set(u32 *mask, const u8 *addr)
 {
@@ -360,7 +375,8 @@  static void tun_net_init(struct net_device *dev)
 static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
 {
 	struct tun_struct *tun = file->private_data;
-	unsigned int mask = POLLOUT | POLLWRNORM;
+	struct sock *sk = tun->sk;
+	unsigned int mask = 0;
 
 	if (!tun)
 		return -EBADFD;
@@ -372,71 +388,45 @@  static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
 	if (!skb_queue_empty(&tun->readq))
 		mask |= POLLIN | POLLRDNORM;
 
+	if (sock_writeable(sk) ||
+	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+	     sock_writeable(sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
 	return mask;
 }
 
 /* prepad is the amount to reserve at front.  len is length after that.
  * linear is a hint as to how much to copy (usually headers). */
-static struct sk_buff *tun_alloc_skb(size_t prepad, size_t len, size_t linear,
-				     gfp_t gfp)
+static inline struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
+					    size_t prepad, size_t len,
+					    size_t linear, int noblock)
 {
+	struct sock *sk = tun->sk;
 	struct sk_buff *skb;
-	unsigned int i;
-
-	skb = alloc_skb(prepad + len, gfp|__GFP_NOWARN);
-	if (skb) {
-		skb_reserve(skb, prepad);
-		skb_put(skb, len);
-		return skb;
-	}
+	int err;
 
 	/* Under a page?  Don't bother with paged skb. */
 	if (prepad + len < PAGE_SIZE)
-		return NULL;
+		linear = len;
 
-	/* Start with a normal skb, and add pages. */
-	skb = alloc_skb(prepad + linear, gfp);
+	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
+				   &err);
 	if (!skb)
-		return NULL;
+		return ERR_PTR(err);
 
 	skb_reserve(skb, prepad);
 	skb_put(skb, linear);
-
-	len -= linear;
-
-	for (i = 0; i < MAX_SKB_FRAGS; i++) {
-		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
-
-		f->page = alloc_page(gfp|__GFP_ZERO);
-		if (!f->page)
-			break;
-
-		f->page_offset = 0;
-		f->size = PAGE_SIZE;
-
-		skb->data_len += PAGE_SIZE;
-		skb->len += PAGE_SIZE;
-		skb->truesize += PAGE_SIZE;
-		skb_shinfo(skb)->nr_frags++;
-
-		if (len < PAGE_SIZE) {
-			len = 0;
-			break;
-		}
-		len -= PAGE_SIZE;
-	}
-
-	/* Too large, or alloc fail? */
-	if (unlikely(len)) {
-		kfree_skb(skb);
-		skb = NULL;
-	}
+	skb->data_len = len - linear;
+	skb->len += len - linear;
 
 	return skb;
 }
 
 /* Get packet from user space buffer */
-static __inline__ ssize_t tun_get_user(struct tun_struct *tun, struct iovec *iv, size_t count)
+static __inline__ ssize_t tun_get_user(struct tun_struct *tun,
+				       struct iovec *iv, size_t count,
+				       int noblock)
 {
 	struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) };
 	struct sk_buff *skb;
@@ -468,9 +458,11 @@  static __inline__ ssize_t tun_get_user(struct tun_struct *tun, struct iovec *iv,
 			return -EINVAL;
 	}
 
-	if (!(skb = tun_alloc_skb(align, len, gso.hdr_len, GFP_KERNEL))) {
-		tun->dev->stats.rx_dropped++;
-		return -ENOMEM;
+	skb = tun_alloc_skb(tun, align, len, gso.hdr_len, noblock);
+	if (IS_ERR(skb)) {
+		if (PTR_ERR(skb) != -EAGAIN)
+			tun->dev->stats.rx_dropped++;
+		return PTR_ERR(skb);
 	}
 
 	if (skb_copy_datagram_from_iovec(skb, 0, iv, len)) {
@@ -556,14 +548,16 @@  static __inline__ ssize_t tun_get_user(struct tun_struct *tun, struct iovec *iv,
 static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
 			      unsigned long count, loff_t pos)
 {
-	struct tun_struct *tun = iocb->ki_filp->private_data;
+	struct file *file = iocb->ki_filp;
+	struct tun_struct *tun = file->private_data;
 
 	if (!tun)
 		return -EBADFD;
 
 	DBG(KERN_INFO "%s: tun_chr_write %ld\n", tun->dev->name, count);
 
-	return tun_get_user(tun, (struct iovec *) iv, iov_length(iv, count));
+	return tun_get_user(tun, (struct iovec *) iv, iov_length(iv, count),
+			    file->f_flags & O_NONBLOCK);
 }
 
 /* Put packet to the user space buffer */
@@ -710,8 +704,37 @@  static struct tun_struct *tun_get_by_name(struct tun_net *tn, const char *name)
 	return NULL;
 }
 
+static void tun_sock_write_space(struct sock *sk)
+{
+	struct tun_struct *tun;
+
+	if (!sock_writeable(sk))
+		return;
+
+	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		wake_up_interruptible_sync(sk->sk_sleep);
+
+	if (!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
+		return;
+
+	tun = container_of(sk, struct tun_sock, sk)->tun;
+	kill_fasync(&tun->fasync, SIGIO, POLL_OUT);
+}
+
+static void tun_sock_destruct(struct sock *sk)
+{
+	dev_put(container_of(sk, struct tun_sock, sk)->tun->dev);
+}
+
+static struct proto tun_proto = {
+	.name		= "tun",
+	.owner		= THIS_MODULE,
+	.obj_size	= sizeof(struct tun_sock),
+};
+
 static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 {
+	struct sock *sk;
 	struct tun_net *tn;
 	struct tun_struct *tun;
 	struct net_device *dev;
@@ -771,14 +794,31 @@  static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		tun->flags = flags;
 		tun->txflt.count = 0;
 
+		err = -ENOMEM;
+		sk = sk_alloc(net, AF_UNSPEC, GFP_KERNEL, &tun_proto);
+		if (!sk)
+			goto err_free_dev;
+
+		/* This ref count is for tun->sk. */
+		dev_hold(dev);
+		sock_init_data(&tun->socket, sk);
+		sk->sk_write_space = tun_sock_write_space;
+		sk->sk_destruct = tun_sock_destruct;
+		sk->sk_sndbuf = INT_MAX;
+		sk->sk_sleep = &tun->read_wait;
+
+		tun->sk = sk;
+		container_of(sk, struct tun_sock, sk)->tun = tun;
+
 		tun_net_init(dev);
 
 		if (strchr(dev->name, '%')) {
 			err = dev_alloc_name(dev, dev->name);
 			if (err < 0)
-				goto err_free_dev;
+				goto err_free_sk;
 		}
 
+		err = -EINVAL;
 		err = register_netdevice(tun->dev);
 		if (err < 0)
 			goto err_free_dev;
@@ -816,6 +856,8 @@  static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 	strcpy(ifr->ifr_name, tun->dev->name);
 	return 0;
 
+ err_free_sk:
+	sock_put(sk);
  err_free_dev:
 	free_netdev(dev);
  failed:
@@ -898,6 +940,7 @@  static int tun_chr_ioctl(struct inode *inode, struct file *file,
 	struct tun_struct *tun = file->private_data;
 	void __user* argp = (void __user*)arg;
 	struct ifreq ifr;
+	int sndbuf;
 	int ret;
 
 	if (cmd == TUNSETIFF || _IOC_TYPE(cmd) == 0x89)
@@ -1034,6 +1077,18 @@  static int tun_chr_ioctl(struct inode *inode, struct file *file,
 		rtnl_unlock();
 		return ret;
 
+	case TUNGETSNDBUF:
+		sndbuf = tun->sk->sk_sndbuf;
+		if (copy_to_user(argp, &sndbuf, sizeof(sndbuf)))
+			return -EFAULT;
+		return 0;
+
+	case TUNSETSNDBUF:
+		if (copy_from_user(&sndbuf, argp, sizeof(sndbuf)))
+			return -EFAULT;
+		tun->sk->sk_sndbuf = sndbuf;
+		return 0;
+
 	default:
 		return -EINVAL;
 	};
@@ -1097,6 +1152,7 @@  static int tun_chr_close(struct inode *inode, struct file *file)
 
 	if (!(tun->flags & TUN_PERSIST)) {
 		list_del(&tun->list);
+		sock_put(tun->sk);
 		unregister_netdevice(tun->dev);
 	}
 
@@ -1238,6 +1294,7 @@  static void tun_exit_net(struct net *net)
 	rtnl_lock();
 	list_for_each_entry_safe(tun, nxt, &tn->dev_list, list) {
 		DBG(KERN_INFO "%s cleaned up\n", tun->dev->name);
+		sock_put(tun->sk);
 		unregister_netdevice(tun->dev);
 	}
 	rtnl_unlock();
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 8529f57..049d6c9 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -46,6 +46,8 @@ 
 #define TUNSETOFFLOAD  _IOW('T', 208, unsigned int)
 #define TUNSETTXFILTER _IOW('T', 209, unsigned int)
 #define TUNGETIFF      _IOR('T', 210, unsigned int)
+#define TUNGETSNDBUF   _IOR('T', 211, int)
+#define TUNSETSNDBUF   _IOW('T', 212, int)
 
 /* TUNSETIFF ifr flags */
 #define IFF_TUN		0x0001