diff mbox

[0/1] proposed export branch rebase

Message ID 20191208182417.16789-1-fw@strlen.de
State Accepted, archived
Delegated to: Florian Westphal
Headers show

Pull-request

git://git.breakpoint.cc/fw/mptcp-next.git export_rebase_19

Commit Message

Florian Westphal Dec. 8, 2019, 6:24 p.m. UTC
It includes 48 patches (down from 66), I will refrain from patch-bombing
the list with it.  There is one patch as followup here because its
"almost new" in the sense that its an amalgamation of different patches
that I placed in that commit because the others seemed quite large already.

The __tcp_poll patch is gone.  Later patches, in particular the changes to
only store 'tcp_sock' rather than 'struct socket' on the conn_list, have
been brought forward and squashed into the relevant commits, would thus
be implicitly part of basic mptcp submission.  I've amemded commit
messages where appropriate.

Here is the full diff to the current export branch:




As you can see, the result is almost identical.

I've done the following tests:
1. all commits compile with MPTCP=y and =n
2. A build at the 'kselftest' commit passes the kselftests
3. The topmost commit passes kselftests.

Major changes are in the following commits:
mptcp: Create SUBFLOW socket for incoming connections

This squashed the patch that copied saddr/daddr from first subflow to msk
and the patch changing conn_list to store 'struct tcp_sock'.

mptcp: Add shutdown() socket operation
This merges the patch adding subflow_shutdown helper and uses it in all
needed places.

The change in the bit flag numbering is due to the need to have a
working mptcp_poll without calling into tcp_poll by the time the
kselftests are added -- for that, DATA_READY and SEND_SPACE flags
are needed, so they are added in a different ordering compared
to the current export branch.

There are more opportunities for squashing after the kselftest commit,
but I did not do this.  Unless there is a need to do so I would prefer
to add smaller patches after basic functionality is working.

I've reviewed the formatted patches up to 'kselftest' and I did not see
any silly 'add x; remove x; add x' sequences.

If you spot changes in 'Author' as result of the squashing --
please report and I will fix it up.

Thanks,
Florian


The following changes since commit 596cf45cbf6e4fa7bcb0df33e373a7d062b644b5:

  Merge branch 'akpm' (patches from Andrew) (2019-12-01 20:36:41 -0800)

are available in the Git repository at:

  git://git.breakpoint.cc/fw/mptcp-next.git export_rebase_19

for you to fetch changes up to e562b10de2abdcbda68aef01e36b35b7116b083e:

  subflow: place further subflows on new 'join_list' (2019-12-08 19:03:06 +0100)

----------------------------------------------------------------
Christoph Paasch (2):
      mptcp: parse and emit MP_CAPABLE option according to v1 spec.
      mptcp: process MP_CAPABLE data option.

Davide Caratti (1):
      mptcp: allow dumping subflow context to userspace

Florian Westphal (8):
      mptcp: add subflow write space signalling and mptcp_poll
      mptcp: add basic kselftest for mptcp
      mptcp: allow partial cleaning of rtx head dfrag
      mptcp: add MIB counter infrastructure
      mptcp: increment MIB counters in a few places
      subflow: check parent mptcp socket on subflow state change
      sendmsg: truncate source buffer if mptcp sndbuf size was set from userspace
      subflow: place further subflows on new 'join_list'

Mat Martineau (12):
      net: Make sock protocol value checks more specific
      sock: Make sk_protocol a 16-bit value
      tcp: Define IPPROTO_MPTCP
      tcp: Add MPTCP option number
      tcp, ulp: Add clone operation to tcp_ulp_ops
      mptcp: Add MPTCP to skb extensions
      tcp: Prevent coalesce/collapse when skb has MPTCP extensions
      tcp: Export TCP functions and ops struct
      tcp: Check for filled TCP option space before SACK
      mptcp: Add MPTCP socket stubs
      mptcp: Write MPTCP DSS headers to outgoing data packets
      mptcp: Implement MPTCP receive path

Matthieu Baerts (1):
      mptcp: new sysctl to control the activation per NS

Paolo Abeni (11):
      tcp: clean ext on tx recycle
      skb: add helpers to allocate ext independently from sk_buff
      mptcp: recvmsg() can drain data from multiple subflows
      mptcp: allow collapsing consecutive sendpages on the same substream
      mptcp: move from sha1 (v0) to sha256 (v1)
      mptcp: update per unacked sequence on pkt reception
      mptcp: queue data for mptcp level retransmission
      mptcp: introduce MPTCP retransmission timer
      mptcp: implement memory accounting for mptcp rtx queue
      mptcp: rework mptcp_sendmsg_frag to accept optional dfrag
      mptcp: implement and use MPTCP-level retransmission

Peter Krystad (13):
      mptcp: Handle MPTCP TCP options
      mptcp: Associate MPTCP context with TCP socket
      mptcp: Handle MP_CAPABLE options for outgoing connections
      mptcp: Create SUBFLOW socket for incoming connections
      mptcp: Add key generation and token tree
      mptcp: Add shutdown() socket operation
      mptcp: Add setsockopt()/getsockopt() socket operations
      mptcp: Add path manager interface
      mptcp: Add ADD_ADDR handling
      mptcp: Add handling of incoming MP_JOIN requests
      mptcp: Add handling of outgoing MP_JOIN requests
      mptcp: Implement path manager interface commands
      mptcp: Implement basic path manager

 include/linux/skbuff.h                             |    6 +
 include/linux/tcp.h                                |   48 +
 include/net/mptcp.h                                |  168 ++
 include/net/netns/mib.h                            |    3 +
 include/net/sock.h                                 |    6 +-
 include/net/tcp.h                                  |   22 +
 include/trace/events/sock.h                        |    5 +-
 include/uapi/linux/in.h                            |    2 +
 include/uapi/linux/inet_diag.h                     |    1 +
 include/uapi/linux/mptcp.h                         |   34 +
 net/Kconfig                                        |    1 +
 net/Makefile                                       |    1 +
 net/ax25/af_ax25.c                                 |    2 +-
 net/core/skbuff.c                                  |   42 +-
 net/decnet/af_decnet.c                             |    2 +-
 net/ipv4/af_inet.c                                 |    4 +
 net/ipv4/inet_connection_sock.c                    |    2 +
 net/ipv4/proc.c                                    |    2 +
 net/ipv4/tcp.c                                     |    8 +-
 net/ipv4/tcp_input.c                               |   29 +-
 net/ipv4/tcp_ipv4.c                                |    2 +-
 net/ipv4/tcp_minisocks.c                           |    6 +
 net/ipv4/tcp_output.c                              |   62 +-
 net/ipv4/tcp_ulp.c                                 |   12 +
 net/ipv6/tcp_ipv6.c                                |   13 +-
 net/mptcp/Kconfig                                  |   26 +
 net/mptcp/Makefile                                 |    4 +
 net/mptcp/basic.c                                  |  270 +++
 net/mptcp/crypto.c                                 |  151 ++
 net/mptcp/ctrl.c                                   |  130 ++
 net/mptcp/diag.c                                   |  101 ++
 net/mptcp/mib.c                                    |   66 +
 net/mptcp/mib.h                                    |   37 +
 net/mptcp/options.c                                |  906 ++++++++++
 net/mptcp/pm.c                                     |  273 +++
 net/mptcp/protocol.c                               | 1742 ++++++++++++++++++++
 net/mptcp/protocol.h                               |  376 +++++
 net/mptcp/subflow.c                                | 1014 ++++++++++++
 net/mptcp/token.c                                  |  222 +++
 tools/include/uapi/linux/in.h                      |    2 +
 tools/testing/selftests/Makefile                   |    1 +
 tools/testing/selftests/kselftest/runner.sh        |    2 +-
 tools/testing/selftests/net/mptcp/.gitignore       |    2 +
 tools/testing/selftests/net/mptcp/Makefile         |   13 +
 tools/testing/selftests/net/mptcp/config           |    2 +
 tools/testing/selftests/net/mptcp/mptcp_connect.c  |  834 ++++++++++
 tools/testing/selftests/net/mptcp/mptcp_connect.sh |  595 +++++++
 tools/testing/selftests/net/mptcp/settings         |    1 +
 48 files changed, 7232 insertions(+), 21 deletions(-)
 create mode 100644 include/net/mptcp.h
 create mode 100644 include/uapi/linux/mptcp.h
 create mode 100644 net/mptcp/Kconfig
 create mode 100644 net/mptcp/Makefile
 create mode 100644 net/mptcp/basic.c
 create mode 100644 net/mptcp/crypto.c
 create mode 100644 net/mptcp/ctrl.c
 create mode 100644 net/mptcp/diag.c
 create mode 100644 net/mptcp/mib.c
 create mode 100644 net/mptcp/mib.h
 create mode 100644 net/mptcp/options.c
 create mode 100644 net/mptcp/pm.c
 create mode 100644 net/mptcp/protocol.c
 create mode 100644 net/mptcp/protocol.h
 create mode 100644 net/mptcp/subflow.c
 create mode 100644 net/mptcp/token.c
 create mode 100644 tools/testing/selftests/net/mptcp/.gitignore
 create mode 100644 tools/testing/selftests/net/mptcp/Makefile
 create mode 100644 tools/testing/selftests/net/mptcp/config
 create mode 100644 tools/testing/selftests/net/mptcp/mptcp_connect.c
 create mode 100755 tools/testing/selftests/net/mptcp/mptcp_connect.sh
 create mode 100644 tools/testing/selftests/net/mptcp/settings

Comments

Paolo Abeni Dec. 9, 2019, 12:33 p.m. UTC | #1
Hi,

On Sun, 2019-12-08 at 19:24 +0100, Florian Westphal wrote:
> It includes 48 patches (down from 66), I will refrain from patch-bombing
> the list with it.  There is one patch as followup here because its
> "almost new" in the sense that its an amalgamation of different patches
> that I placed in that commit because the others seemed quite large already.
> 
> The __tcp_poll patch is gone.  Later patches, in particular the changes to
> only store 'tcp_sock' rather than 'struct socket' on the conn_list, have
> been brought forward and squashed into the relevant commits, would thus
> be implicitly part of basic mptcp submission.  I've amemded commit
> messages where appropriate.

Thank you Florian for the great work!

I went through the patches up to v1 support, and LGTM.

> There are more opportunities for squashing after the kselftest commit,
> but I did not do this.  Unless there is a need to do so I would prefer
> to add smaller patches after basic functionality is working.

Agreed!

I'll try to rebase and squash the fallback cleanup patches on top of
this.

Thanks,

Paolo
Matthieu Baerts Dec. 10, 2019, 6:24 p.m. UTC | #2
Hi Florian, Paolo,

On 09/12/2019 13:33, Paolo Abeni wrote:
> Hi,
> 
> On Sun, 2019-12-08 at 19:24 +0100, Florian Westphal wrote:
>> It includes 48 patches (down from 66), I will refrain from patch-bombing
>> the list with it.  There is one patch as followup here because its
>> "almost new" in the sense that its an amalgamation of different patches
>> that I placed in that commit because the others seemed quite large already.
>>
>> The __tcp_poll patch is gone.  Later patches, in particular the changes to
>> only store 'tcp_sock' rather than 'struct socket' on the conn_list, have
>> been brought forward and squashed into the relevant commits, would thus
>> be implicitly part of basic mptcp submission.  I've amemded commit
>> messages where appropriate.
> 
> Thank you Florian for the great work!
> 
> I went through the patches up to v1 support, and LGTM.

It is also good to me! Thank you very much for the very big work! The 
kind of work where even you you have to sit down to do it! ...  O:-D

I just re-created the tree using your branch. I am also checking now 
that all commits are OK, with and without MPTCP. Also the usual 
compilation tests with i386 and without IPv6.

I will do a second pass on the different commits tomorrow!

Cheers,
Matt
Matthieu Baerts Dec. 10, 2019, 9:27 p.m. UTC | #3
Hi,

On 10/12/2019 19:24, Matthieu Baerts wrote:
> Hi Florian, Paolo,
> 
> On 09/12/2019 13:33, Paolo Abeni wrote:
>> Hi,
>>
>> On Sun, 2019-12-08 at 19:24 +0100, Florian Westphal wrote:
>>> It includes 48 patches (down from 66), I will refrain from patch-bombing
>>> the list with it.  There is one patch as followup here because its
>>> "almost new" in the sense that its an amalgamation of different patches
>>> that I placed in that commit because the others seemed quite large 
>>> already.
>>>
>>> The __tcp_poll patch is gone.  Later patches, in particular the 
>>> changes to
>>> only store 'tcp_sock' rather than 'struct socket' on the conn_list, have
>>> been brought forward and squashed into the relevant commits, would thus
>>> be implicitly part of basic mptcp submission.  I've amemded commit
>>> messages where appropriate.
>>
>> Thank you Florian for the great work!
>>
>> I went through the patches up to v1 support, and LGTM.
> 
> It is also good to me! Thank you very much for the very big work! The 
> kind of work where even you you have to sit down to do it! ...  O:-D
> 
> I just re-created the tree using your branch. I am also checking now 
> that all commits are OK, with and without MPTCP. Also the usual 
> compilation tests with i386 and without IPv6.

FYI, all tests are green!
I didn't re-enable the automatic sync with net-next. I will do that 
after having applied recent Paolo's changes.

Cheers,
Matt
Paolo Abeni Dec. 11, 2019, 4:32 p.m. UTC | #4
On Tue, 2019-12-10 at 22:27 +0100, Matthieu Baerts wrote:
> Hi,
> 
> On 10/12/2019 19:24, Matthieu Baerts wrote:
> > Hi Florian, Paolo,
> > 
> > On 09/12/2019 13:33, Paolo Abeni wrote:
> > > Hi,
> > > 
> > > On Sun, 2019-12-08 at 19:24 +0100, Florian Westphal wrote:
> > > > It includes 48 patches (down from 66), I will refrain from patch-bombing
> > > > the list with it.  There is one patch as followup here because its
> > > > "almost new" in the sense that its an amalgamation of different patches
> > > > that I placed in that commit because the others seemed quite large 
> > > > already.
> > > > 
> > > > The __tcp_poll patch is gone.  Later patches, in particular the 
> > > > changes to
> > > > only store 'tcp_sock' rather than 'struct socket' on the conn_list, have
> > > > been brought forward and squashed into the relevant commits, would thus
> > > > be implicitly part of basic mptcp submission.  I've amemded commit
> > > > messages where appropriate.
> > > 
> > > Thank you Florian for the great work!
> > > 
> > > I went through the patches up to v1 support, and LGTM.
> > 
> > It is also good to me! Thank you very much for the very big work! The 
> > kind of work where even you you have to sit down to do it! ...  O:-D
> > 
> > I just re-created the tree using your branch. I am also checking now 
> > that all commits are OK, with and without MPTCP. Also the usual 
> > compilation tests with i386 and without IPv6.
> 
> FYI, all tests are green!
> I didn't re-enable the automatic sync with net-next. I will do that 
> after having applied recent Paolo's changes.

The rebased and amost squashed thing is available here:

https://github.com/pabeni/mptcp/tree/fwestpha-mptcp_export_rebase_19_fallback_cleanup_2

It build on each commit. Not tested yet, sent-out early for review's
sake.

Note that some patches after MP_JOIN required some relevant rework,
see:

https://github.com/pabeni/mptcp/commit/0a84b5b3547cba0bea6186402d0eed4981b6174a
https://github.com/pabeni/mptcp/commit/481d3137ce456df2dc93f4b64f0204201d6fee7f

@Florian, WDYT?

Thanks,

Paolo
Paolo Abeni Dec. 11, 2019, 5:15 p.m. UTC | #5
On Wed, 2019-12-11 at 17:32 +0100, Paolo Abeni wrote:
> On Tue, 2019-12-10 at 22:27 +0100, Matthieu Baerts wrote:
> > Hi,
> > 
> > On 10/12/2019 19:24, Matthieu Baerts wrote:
> > > Hi Florian, Paolo,
> > > 
> > > On 09/12/2019 13:33, Paolo Abeni wrote:
> > > > Hi,
> > > > 
> > > > On Sun, 2019-12-08 at 19:24 +0100, Florian Westphal wrote:
> > > > > It includes 48 patches (down from 66), I will refrain from patch-bombing
> > > > > the list with it.  There is one patch as followup here because its
> > > > > "almost new" in the sense that its an amalgamation of different patches
> > > > > that I placed in that commit because the others seemed quite large 
> > > > > already.
> > > > > 
> > > > > The __tcp_poll patch is gone.  Later patches, in particular the 
> > > > > changes to
> > > > > only store 'tcp_sock' rather than 'struct socket' on the conn_list, have
> > > > > been brought forward and squashed into the relevant commits, would thus
> > > > > be implicitly part of basic mptcp submission.  I've amemded commit
> > > > > messages where appropriate.
> > > > 
> > > > Thank you Florian for the great work!
> > > > 
> > > > I went through the patches up to v1 support, and LGTM.
> > > 
> > > It is also good to me! Thank you very much for the very big work! The 
> > > kind of work where even you you have to sit down to do it! ...  O:-D
> > > 
> > > I just re-created the tree using your branch. I am also checking now 
> > > that all commits are OK, with and without MPTCP. Also the usual 
> > > compilation tests with i386 and without IPv6.
> > 
> > FYI, all tests are green!
> > I didn't re-enable the automatic sync with net-next. I will do that 
> > after having applied recent Paolo's changes.
> 
> The rebased and amost squashed thing is available here:
> 
> https://github.com/pabeni/mptcp/tree/fwestpha-mptcp_export_rebase_19_fallback_cleanup_2
> 
> It build on each commit. Not tested yet, sent-out early for review's
> sake.

Just tested successfully on top of V1 patches and on branch head.

For the record, I observed once on the unsquashed patches, the
following warn (with decoded stack trace):

[ 1093.576154] ------------[ cut here ]------------
[ 1093.577188] refcount_t: addition on 0; use-after-free.
[ 1093.577966] WARNING: CPU: 2 PID: 29790 at lib/refcount.c:25 refcount_warn_saturate (/home/pabeni/devel/linux-mptcp/lib/refcount.c:25 (discriminator 3))
[ 1093.579153] Modules linked in: sch_netem veth intel_rapl_msr rfkill intel_rapl_common kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_net net_failover virtio_blk ata_generic failover virtio_console ata_piix libata
[ 1093.584101] CPU: 2 PID: 29790 Comm: mptcp_connect Not tainted 5.4.0.mptcp_bfaef67f2a47+ #464
[ 1093.585723] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
[ 1093.587415] RIP: 0010:refcount_warn_saturate (/home/pabeni/devel/linux-mptcp/lib/refcount.c:25 (discriminator 3))
[ 1093.588409] Code: b9 2d 41 02 01 e8 3b 90 41 ff 0f 0b eb 9d 80 3d a8 2d 41 02 00 75 94 48 c7 c7 c0 09 e8 87 c6 05 98 2d 41 02 01 e8 1b 90 41 ff <0f> 0b e9 7a ff ff ff 80 3d 82 2d 41 02 00 0f 85 6d ff ff ff 48 c7
All code
========
   0:   b9 2d 41 02 01          mov    $0x102412d,%ecx
   5:   e8 3b 90 41 ff          callq  0xffffffffff419045
   a:   0f 0b                   ud2
   c:   eb 9d                   jmp    0xffffffffffffffab
   e:   80 3d a8 2d 41 02 00    cmpb   $0x0,0x2412da8(%rip)        # 0x2412dbd
  15:   75 94                   jne    0xffffffffffffffab
  17:   48 c7 c7 c0 09 e8 87    mov    $0xffffffff87e809c0,%rdi
  1e:   c6 05 98 2d 41 02 01    movb   $0x1,0x2412d98(%rip)        # 0x2412dbd
  25:   e8 1b 90 41 ff          callq  0xffffffffff419045
  2a:*  0f 0b                   ud2             <-- trapping instruction
  2c:   e9 7a ff ff ff          jmpq   0xffffffffffffffab
  31:   80 3d 82 2d 41 02 00    cmpb   $0x0,0x2412d82(%rip)        # 0x2412dba
  38:   0f 85 6d ff ff ff       jne    0xffffffffffffffab
  3e:   48                      rex.W
  3f:   c7                      .byte 0xc7

Code starting with the faulting instruction
===========================================
   0:   0f 0b                   ud2
   2:   e9 7a ff ff ff          jmpq   0xffffffffffffff81
   7:   80 3d 82 2d 41 02 00    cmpb   $0x0,0x2412d82(%rip)        # 0x2412d90
   e:   0f 85 6d ff ff ff       jne    0xffffffffffffff81
  14:   48                      rex.W
  15:   c7                      .byte 0xc7
[ 1093.592029] RSP: 0018:ffff8881061f7cd0 EFLAGS: 00010282
[ 1093.593155] RAX: 0000000000000000 RBX: ffff888035b54040 RCX: 0000000000000000
[ 1093.594476] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffed1020c3ef8c
[ 1093.595769] RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed102187c4d1
[ 1093.597063] R10: ffffed102187c4d0 R11: ffff88810c3e2687 R12: ffff888035b53fe8
[ 1093.598354] R13: ffff888035b54040 R14: ffff888035b53fc0 R15: ffff888035b53fc0
[ 1093.599654] FS:  00007f9b8b9ad740(0000) GS:ffff88810c200000(0000) knlGS:0000000000000000
[ 1093.601104] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1093.602174] CR2: 00007fffdcdc0000 CR3: 0000000103e04001 CR4: 0000000000160ee0
[ 1093.603472] Call Trace:
[ 1093.603997] tcp_close (/home/pabeni/devel/linux-mptcp/./include/linux/refcount.h:191 /home/pabeni/devel/linux-mptcp/./include/linux/refcount.h:228 /home/pabeni/devel/linux-mptcp/./include/net/sock.h:647 /home/pabeni/devel/linux-mptcp/net/ipv4/tcp.c:2434)
[ 1093.605757] mptcp_close (/home/pabeni/devel/linux-mptcp/net/mptcp/protocol.c:521 /home/pabeni/devel/linux-mptcp/net/mptcp/protocol.c:579)
[ 1093.606604] inet_release (/home/pabeni/devel/linux-mptcp/net/ipv4/af_inet.c:428)
[ 1093.607427] __sock_release (/home/pabeni/devel/linux-mptcp/net/socket.c:593)
[ 1093.608230] sock_close (/home/pabeni/devel/linux-mptcp/net/socket.c:1272)
[ 1093.608792] __fput (/home/pabeni/devel/linux-mptcp/fs/file_table.c:280)
[ 1093.609430] task_work_run (/home/pabeni/devel/linux-mptcp/kernel/task_work.c:115 (discriminator 1))
[ 1093.610100] exit_to_usermode_loop (/home/pabeni/devel/linux-mptcp/./include/linux/tracehook.h:188 /home/pabeni/devel/linux-mptcp/arch/x86/entry/common.c:164)
[ 1093.610858] do_syscall_64 (/home/pabeni/devel/linux-mptcp/arch/x86/entry/common.c:195 /home/pabeni/devel/linux-mptcp/arch/x86/entry/common.c:278 /home/pabeni/devel/linux-mptcp/arch/x86/entry/common.c:304)
[ 1093.611579] entry_SYSCALL_64_after_hwframe (/home/pabeni/devel/linux-mptcp/arch/x86/entry/entry_64.S:177)
[ 1093.612617] RIP: 0033:0x7f9b8b4b77a0
[ 1093.613289] Code: 00 64 c7 00 0d 00 00 00 b8 ff ff ff ff eb 90 b8 ff ff ff ff eb 89 0f 1f 40 00 83 3d fd c7 2d 00 00 75 10 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 9e c5 01 00 48 89 04 24
All code
========
   0:   00 64 c7 00             add    %ah,0x0(%rdi,%rax,8)
   4:   0d 00 00 00 b8          or     $0xb8000000,%eax
   9:   ff                      (bad)
   a:   ff                      (bad)
   b:   ff                      (bad)
   c:   ff                      (bad)
   d:   eb 90                   jmp    0xffffffffffffff9f
   f:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  14:   eb 89                   jmp    0xffffffffffffff9f
  16:   0f 1f 40 00             nopl   0x0(%rax)
  1a:   83 3d fd c7 2d 00 00    cmpl   $0x0,0x2dc7fd(%rip)        # 0x2dc81e
  21:   75 10                   jne    0x33
  23:   b8 03 00 00 00          mov    $0x3,%eax
  28:   0f 05                   syscall
  2a:*  48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax         <-- trapping instruction
  30:   73 31                   jae    0x63
  32:   c3                      retq
---

But I could not reproduce it (so far) on the squashed branch.

I'll suggest to merge the patches as-is and ev. look after bug after
that.

Cheers,

Paolo
Paolo Abeni Dec. 11, 2019, 7 p.m. UTC | #6
On Wed, 2019-12-11 at 17:32 +0100, Paolo Abeni wrote:
> On Tue, 2019-12-10 at 22:27 +0100, Matthieu Baerts wrote:
> > Hi,
> > 
> > On 10/12/2019 19:24, Matthieu Baerts wrote:
> > > Hi Florian, Paolo,
> > > 
> > > On 09/12/2019 13:33, Paolo Abeni wrote:
> > > > Hi,
> > > > 
> > > > On Sun, 2019-12-08 at 19:24 +0100, Florian Westphal wrote:
> > > > > It includes 48 patches (down from 66), I will refrain from patch-bombing
> > > > > the list with it.  There is one patch as followup here because its
> > > > > "almost new" in the sense that its an amalgamation of different patches
> > > > > that I placed in that commit because the others seemed quite large 
> > > > > already.
> > > > > 
> > > > > The __tcp_poll patch is gone.  Later patches, in particular the 
> > > > > changes to
> > > > > only store 'tcp_sock' rather than 'struct socket' on the conn_list, have
> > > > > been brought forward and squashed into the relevant commits, would thus
> > > > > be implicitly part of basic mptcp submission.  I've amemded commit
> > > > > messages where appropriate.
> > > > 
> > > > Thank you Florian for the great work!
> > > > 
> > > > I went through the patches up to v1 support, and LGTM.
> > > 
> > > It is also good to me! Thank you very much for the very big work! The 
> > > kind of work where even you you have to sit down to do it! ...  O:-D
> > > 
> > > I just re-created the tree using your branch. I am also checking now 
> > > that all commits are OK, with and without MPTCP. Also the usual 
> > > compilation tests with i386 and without IPv6.
> > 
> > FYI, all tests are green!
> > I didn't re-enable the automatic sync with net-next. I will do that 
> > after having applied recent Paolo's changes.
> 
> The rebased and amost squashed thing is available here:
> 
> https://github.com/pabeni/mptcp/tree/fwestpha-mptcp_export_rebase_19_fallback_cleanup_2
> 
> It build on each commit. Not tested yet, sent-out early for review's
> sake.
> 
> Note that some patches after MP_JOIN required some relevant rework,
> see:
> 
> https://github.com/pabeni/mptcp/commit/0a84b5b3547cba0bea6186402d0eed4981b6174a

This one had a couple of gross mistake, just cooked a less bugged
version. The new tree is here:

https://github.com/pabeni/mptcp/commits/fwestpha-mptcp_export_rebase_19_fallback_cleanup_3

The only updated patch is the last one:

https://github.com/pabeni/mptcp/commit/831e5c21e2751c2598b3e25516544d59d65c72d3

@Florian: I tested the whole tree vs self-tests - so MP_JOIN is not
really tested. Are you ok to merge as-is and then look after eventual
bug introduced by me here?

Thanks,

Paolo
Florian Westphal Dec. 11, 2019, 7:22 p.m. UTC | #7
Paolo Abeni <pabeni@redhat.com> wrote:
> @Florian: I tested the whole tree vs self-tests - so MP_JOIN is not
> really tested. Are you ok to merge as-is and then look after eventual
> bug introduced by me here?

Maybe, can you explain __mptcp_flush_join_list() usage?

AFAICS subflow on join list can be moved to conn_list at 'random'
point in time, even if its not fully established yet?

(there are calls to __mptcp_flush_join_list in recv/sendmsg path...)

Given MP_JOIN isn't in the 1st/2nd batch i would be ok with merging
now though.
Paolo Abeni Dec. 11, 2019, 9:36 p.m. UTC | #8
On Wed, 2019-12-11 at 20:22 +0100, Florian Westphal wrote:
> Paolo Abeni <pabeni@redhat.com> wrote:
> > @Florian: I tested the whole tree vs self-tests - so MP_JOIN is not
> > really tested. Are you ok to merge as-is and then look after eventual
> > bug introduced by me here?
> 
> Maybe, can you explain __mptcp_flush_join_list() usage?

mp_join subflow are appended to join_list by mptcp_finish_join(), for
passive connections, at when the tcp subflow become established.

We can't add directly to conn_list, or we will hit the ABBA locking
schema the fallback cleanup tried to avoid.

The code tries to splice join list into conn_list every just before
each conn_list traversing, so that that later lookup/access have to
cope only with conn_list.

> AFAICS subflow on join list can be moved to conn_list at 'random'
> point in time, even if its not fully established yet?

passive subflow are moved into join_list when they are 'established' at
the tcp level, but the MP_JOIN handshake is not completed (fourth ack
still not reached). Should be ok for recvmsg() - as we check the
'data_ready' field - and close(), possibly we need some additional care
for mptcp_subflow_get_send().

Cheers,

Paolo
Mat Martineau Dec. 12, 2019, 1:06 a.m. UTC | #9
On Wed, 11 Dec 2019, Paolo Abeni wrote:

> On Wed, 2019-12-11 at 17:32 +0100, Paolo Abeni wrote:
>> On Tue, 2019-12-10 at 22:27 +0100, Matthieu Baerts wrote:
>>> Hi,
>>>
>>> On 10/12/2019 19:24, Matthieu Baerts wrote:
>>>> Hi Florian, Paolo,
>>>>
>>>> On 09/12/2019 13:33, Paolo Abeni wrote:
>>>>> Hi,
>>>>>
>>>>> On Sun, 2019-12-08 at 19:24 +0100, Florian Westphal wrote:
>>>>>> It includes 48 patches (down from 66), I will refrain from patch-bombing
>>>>>> the list with it.  There is one patch as followup here because its
>>>>>> "almost new" in the sense that its an amalgamation of different patches
>>>>>> that I placed in that commit because the others seemed quite large
>>>>>> already.
>>>>>>
>>>>>> The __tcp_poll patch is gone.  Later patches, in particular the
>>>>>> changes to
>>>>>> only store 'tcp_sock' rather than 'struct socket' on the conn_list, have
>>>>>> been brought forward and squashed into the relevant commits, would thus
>>>>>> be implicitly part of basic mptcp submission.  I've amemded commit
>>>>>> messages where appropriate.
>>>>>
>>>>> Thank you Florian for the great work!
>>>>>
>>>>> I went through the patches up to v1 support, and LGTM.
>>>>
>>>> It is also good to me! Thank you very much for the very big work! The
>>>> kind of work where even you you have to sit down to do it! ...  O:-D
>>>>
>>>> I just re-created the tree using your branch. I am also checking now
>>>> that all commits are OK, with and without MPTCP. Also the usual
>>>> compilation tests with i386 and without IPv6.
>>>
>>> FYI, all tests are green!
>>> I didn't re-enable the automatic sync with net-next. I will do that
>>> after having applied recent Paolo's changes.
>>
>> The rebased and amost squashed thing is available here:
>>
>> https://github.com/pabeni/mptcp/tree/fwestpha-mptcp_export_rebase_19_fallback_cleanup_2
>>
>> It build on each commit. Not tested yet, sent-out early for review's
>> sake.
>
> Just tested successfully on top of V1 patches and on branch head.
>
> For the record, I observed once on the unsquashed patches, the
> following warn (with decoded stack trace):
>
> [ 1093.576154] ------------[ cut here ]------------
> [ 1093.577188] refcount_t: addition on 0; use-after-free.
> [ 1093.577966] WARNING: CPU: 2 PID: 29790 at lib/refcount.c:25 refcount_warn_saturate (/home/pabeni/devel/linux-mptcp/lib/refcount.c:25 (discriminator 3))
> [ 1093.579153] Modules linked in: sch_netem veth intel_rapl_msr rfkill intel_rapl_common kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_net net_failover virtio_blk ata_generic failover virtio_console ata_piix libata
> [ 1093.584101] CPU: 2 PID: 29790 Comm: mptcp_connect Not tainted 5.4.0.mptcp_bfaef67f2a47+ #464
> [ 1093.585723] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
> [ 1093.587415] RIP: 0010:refcount_warn_saturate (/home/pabeni/devel/linux-mptcp/lib/refcount.c:25 (discriminator 3))
> [ 1093.588409] Code: b9 2d 41 02 01 e8 3b 90 41 ff 0f 0b eb 9d 80 3d a8 2d 41 02 00 75 94 48 c7 c7 c0 09 e8 87 c6 05 98 2d 41 02 01 e8 1b 90 41 ff <0f> 0b e9 7a ff ff ff 80 3d 82 2d 41 02 00 0f 85 6d ff ff ff 48 c7
> All code
> ========
>   0:   b9 2d 41 02 01          mov    $0x102412d,%ecx
>   5:   e8 3b 90 41 ff          callq  0xffffffffff419045
>   a:   0f 0b                   ud2
>   c:   eb 9d                   jmp    0xffffffffffffffab
>   e:   80 3d a8 2d 41 02 00    cmpb   $0x0,0x2412da8(%rip)        # 0x2412dbd
>  15:   75 94                   jne    0xffffffffffffffab
>  17:   48 c7 c7 c0 09 e8 87    mov    $0xffffffff87e809c0,%rdi
>  1e:   c6 05 98 2d 41 02 01    movb   $0x1,0x2412d98(%rip)        # 0x2412dbd
>  25:   e8 1b 90 41 ff          callq  0xffffffffff419045
>  2a:*  0f 0b                   ud2             <-- trapping instruction
>  2c:   e9 7a ff ff ff          jmpq   0xffffffffffffffab
>  31:   80 3d 82 2d 41 02 00    cmpb   $0x0,0x2412d82(%rip)        # 0x2412dba
>  38:   0f 85 6d ff ff ff       jne    0xffffffffffffffab
>  3e:   48                      rex.W
>  3f:   c7                      .byte 0xc7
>
> Code starting with the faulting instruction
> ===========================================
>   0:   0f 0b                   ud2
>   2:   e9 7a ff ff ff          jmpq   0xffffffffffffff81
>   7:   80 3d 82 2d 41 02 00    cmpb   $0x0,0x2412d82(%rip)        # 0x2412d90
>   e:   0f 85 6d ff ff ff       jne    0xffffffffffffff81
>  14:   48                      rex.W
>  15:   c7                      .byte 0xc7
> [ 1093.592029] RSP: 0018:ffff8881061f7cd0 EFLAGS: 00010282
> [ 1093.593155] RAX: 0000000000000000 RBX: ffff888035b54040 RCX: 0000000000000000
> [ 1093.594476] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffed1020c3ef8c
> [ 1093.595769] RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed102187c4d1
> [ 1093.597063] R10: ffffed102187c4d0 R11: ffff88810c3e2687 R12: ffff888035b53fe8
> [ 1093.598354] R13: ffff888035b54040 R14: ffff888035b53fc0 R15: ffff888035b53fc0
> [ 1093.599654] FS:  00007f9b8b9ad740(0000) GS:ffff88810c200000(0000) knlGS:0000000000000000
> [ 1093.601104] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1093.602174] CR2: 00007fffdcdc0000 CR3: 0000000103e04001 CR4: 0000000000160ee0
> [ 1093.603472] Call Trace:
> [ 1093.603997] tcp_close (/home/pabeni/devel/linux-mptcp/./include/linux/refcount.h:191 /home/pabeni/devel/linux-mptcp/./include/linux/refcount.h:228 /home/pabeni/devel/linux-mptcp/./include/net/sock.h:647 /home/pabeni/devel/linux-mptcp/net/ipv4/tcp.c:2434)

There's an extra sock_put() (or a missing sock_hold()) somewhere before 
the sock_hold() in tcp_close(). Seems like the only candidate for that is 
an unmatched sock_put() in subflow_ulp_release() which is releasing the 
reference count for mptcp_subflow_context->conn.

It looks like that mptcp_subflow_context->conn reference count is intended 
to be matched by a sock_hold() when the conn value is assigned (see 
mptcp_subflow_create_socket()). There are other assignments to 
mptcp_subflow_context->conn in subflow_syn_recv_sock() and mptcp_accept() 
which are not calling sock_hold() - so maybe that's the problem. If that's 
it, I'm not sure why this warning happens rarely instead of all the time.

I'll try adding some debug statements to confirm this theory.


Mat


> [ 1093.605757] mptcp_close (/home/pabeni/devel/linux-mptcp/net/mptcp/protocol.c:521 /home/pabeni/devel/linux-mptcp/net/mptcp/protocol.c:579)
> [ 1093.606604] inet_release (/home/pabeni/devel/linux-mptcp/net/ipv4/af_inet.c:428)
> [ 1093.607427] __sock_release (/home/pabeni/devel/linux-mptcp/net/socket.c:593)
> [ 1093.608230] sock_close (/home/pabeni/devel/linux-mptcp/net/socket.c:1272)
> [ 1093.608792] __fput (/home/pabeni/devel/linux-mptcp/fs/file_table.c:280)
> [ 1093.609430] task_work_run (/home/pabeni/devel/linux-mptcp/kernel/task_work.c:115 (discriminator 1))
> [ 1093.610100] exit_to_usermode_loop (/home/pabeni/devel/linux-mptcp/./include/linux/tracehook.h:188 /home/pabeni/devel/linux-mptcp/arch/x86/entry/common.c:164)
> [ 1093.610858] do_syscall_64 (/home/pabeni/devel/linux-mptcp/arch/x86/entry/common.c:195 /home/pabeni/devel/linux-mptcp/arch/x86/entry/common.c:278 /home/pabeni/devel/linux-mptcp/arch/x86/entry/common.c:304)
> [ 1093.611579] entry_SYSCALL_64_after_hwframe (/home/pabeni/devel/linux-mptcp/arch/x86/entry/entry_64.S:177)
> [ 1093.612617] RIP: 0033:0x7f9b8b4b77a0
> [ 1093.613289] Code: 00 64 c7 00 0d 00 00 00 b8 ff ff ff ff eb 90 b8 ff ff ff ff eb 89 0f 1f 40 00 83 3d fd c7 2d 00 00 75 10 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 9e c5 01 00 48 89 04 24
> All code
> ========
>   0:   00 64 c7 00             add    %ah,0x0(%rdi,%rax,8)
>   4:   0d 00 00 00 b8          or     $0xb8000000,%eax
>   9:   ff                      (bad)
>   a:   ff                      (bad)
>   b:   ff                      (bad)
>   c:   ff                      (bad)
>   d:   eb 90                   jmp    0xffffffffffffff9f
>   f:   b8 ff ff ff ff          mov    $0xffffffff,%eax
>  14:   eb 89                   jmp    0xffffffffffffff9f
>  16:   0f 1f 40 00             nopl   0x0(%rax)
>  1a:   83 3d fd c7 2d 00 00    cmpl   $0x0,0x2dc7fd(%rip)        # 0x2dc81e
>  21:   75 10                   jne    0x33
>  23:   b8 03 00 00 00          mov    $0x3,%eax
>  28:   0f 05                   syscall
>  2a:*  48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax         <-- trapping instruction
>  30:   73 31                   jae    0x63
>  32:   c3                      retq
> ---
>
> But I could not reproduce it (so far) on the squashed branch.
>
> I'll suggest to merge the patches as-is and ev. look after bug after
> that.
>
> Cheers,
>
> Paolo
> _______________________________________________
> mptcp mailing list -- mptcp@lists.01.org
> To unsubscribe send an email to mptcp-leave@lists.01.org
>

--
Mat Martineau
Intel
Paolo Abeni Dec. 12, 2019, 12:03 p.m. UTC | #10
On Wed, 2019-12-11 at 17:06 -0800, Mat Martineau wrote:
> On Wed, 11 Dec 2019, Paolo Abeni wrote:
> 
> > On Wed, 2019-12-11 at 17:32 +0100, Paolo Abeni wrote:
> > > On Tue, 2019-12-10 at 22:27 +0100, Matthieu Baerts wrote:
> > > > Hi,
> > > > 
> > > > On 10/12/2019 19:24, Matthieu Baerts wrote:
> > > > > Hi Florian, Paolo,
> > > > > 
> > > > > On 09/12/2019 13:33, Paolo Abeni wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On Sun, 2019-12-08 at 19:24 +0100, Florian Westphal wrote:
> > > > > > > It includes 48 patches (down from 66), I will refrain from patch-bombing
> > > > > > > the list with it.  There is one patch as followup here because its
> > > > > > > "almost new" in the sense that its an amalgamation of different patches
> > > > > > > that I placed in that commit because the others seemed quite large
> > > > > > > already.
> > > > > > > 
> > > > > > > The __tcp_poll patch is gone.  Later patches, in particular the
> > > > > > > changes to
> > > > > > > only store 'tcp_sock' rather than 'struct socket' on the conn_list, have
> > > > > > > been brought forward and squashed into the relevant commits, would thus
> > > > > > > be implicitly part of basic mptcp submission.  I've amemded commit
> > > > > > > messages where appropriate.
> > > > > > 
> > > > > > Thank you Florian for the great work!
> > > > > > 
> > > > > > I went through the patches up to v1 support, and LGTM.
> > > > > 
> > > > > It is also good to me! Thank you very much for the very big work! The
> > > > > kind of work where even you you have to sit down to do it! ...  O:-D
> > > > > 
> > > > > I just re-created the tree using your branch. I am also checking now
> > > > > that all commits are OK, with and without MPTCP. Also the usual
> > > > > compilation tests with i386 and without IPv6.
> > > > 
> > > > FYI, all tests are green!
> > > > I didn't re-enable the automatic sync with net-next. I will do that
> > > > after having applied recent Paolo's changes.
> > > 
> > > The rebased and amost squashed thing is available here:
> > > 
> > > https://github.com/pabeni/mptcp/tree/fwestpha-mptcp_export_rebase_19_fallback_cleanup_2
> > > 
> > > It build on each commit. Not tested yet, sent-out early for review's
> > > sake.
> > 
> > Just tested successfully on top of V1 patches and on branch head.
> > 
> > For the record, I observed once on the unsquashed patches, the
> > following warn (with decoded stack trace):
> > 
> > [ 1093.576154] ------------[ cut here ]------------
> > [ 1093.577188] refcount_t: addition on 0; use-after-free.
> > [ 1093.577966] WARNING: CPU: 2 PID: 29790 at lib/refcount.c:25 refcount_warn_saturate (/home/pabeni/devel/linux-mptcp/lib/refcount.c:25 (discriminator 3))
> > [ 1093.579153] Modules linked in: sch_netem veth intel_rapl_msr rfkill intel_rapl_common kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_net net_failover virtio_blk ata_generic failover virtio_console ata_piix libata
> > [ 1093.584101] CPU: 2 PID: 29790 Comm: mptcp_connect Not tainted 5.4.0.mptcp_bfaef67f2a47+ #464
> > [ 1093.585723] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
> > [ 1093.587415] RIP: 0010:refcount_warn_saturate (/home/pabeni/devel/linux-mptcp/lib/refcount.c:25 (discriminator 3))
> > [ 1093.588409] Code: b9 2d 41 02 01 e8 3b 90 41 ff 0f 0b eb 9d 80 3d a8 2d 41 02 00 75 94 48 c7 c7 c0 09 e8 87 c6 05 98 2d 41 02 01 e8 1b 90 41 ff <0f> 0b e9 7a ff ff ff 80 3d 82 2d 41 02 00 0f 85 6d ff ff ff 48 c7
> > All code
> > ========
> >   0:   b9 2d 41 02 01          mov    $0x102412d,%ecx
> >   5:   e8 3b 90 41 ff          callq  0xffffffffff419045
> >   a:   0f 0b                   ud2
> >   c:   eb 9d                   jmp    0xffffffffffffffab
> >   e:   80 3d a8 2d 41 02 00    cmpb   $0x0,0x2412da8(%rip)        # 0x2412dbd
> >  15:   75 94                   jne    0xffffffffffffffab
> >  17:   48 c7 c7 c0 09 e8 87    mov    $0xffffffff87e809c0,%rdi
> >  1e:   c6 05 98 2d 41 02 01    movb   $0x1,0x2412d98(%rip)        # 0x2412dbd
> >  25:   e8 1b 90 41 ff          callq  0xffffffffff419045
> >  2a:*  0f 0b                   ud2             <-- trapping instruction
> >  2c:   e9 7a ff ff ff          jmpq   0xffffffffffffffab
> >  31:   80 3d 82 2d 41 02 00    cmpb   $0x0,0x2412d82(%rip)        # 0x2412dba
> >  38:   0f 85 6d ff ff ff       jne    0xffffffffffffffab
> >  3e:   48                      rex.W
> >  3f:   c7                      .byte 0xc7
> > 
> > Code starting with the faulting instruction
> > ===========================================
> >   0:   0f 0b                   ud2
> >   2:   e9 7a ff ff ff          jmpq   0xffffffffffffff81
> >   7:   80 3d 82 2d 41 02 00    cmpb   $0x0,0x2412d82(%rip)        # 0x2412d90
> >   e:   0f 85 6d ff ff ff       jne    0xffffffffffffff81
> >  14:   48                      rex.W
> >  15:   c7                      .byte 0xc7
> > [ 1093.592029] RSP: 0018:ffff8881061f7cd0 EFLAGS: 00010282
> > [ 1093.593155] RAX: 0000000000000000 RBX: ffff888035b54040 RCX: 0000000000000000
> > [ 1093.594476] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffed1020c3ef8c
> > [ 1093.595769] RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed102187c4d1
> > [ 1093.597063] R10: ffffed102187c4d0 R11: ffff88810c3e2687 R12: ffff888035b53fe8
> > [ 1093.598354] R13: ffff888035b54040 R14: ffff888035b53fc0 R15: ffff888035b53fc0
> > [ 1093.599654] FS:  00007f9b8b9ad740(0000) GS:ffff88810c200000(0000) knlGS:0000000000000000
> > [ 1093.601104] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1093.602174] CR2: 00007fffdcdc0000 CR3: 0000000103e04001 CR4: 0000000000160ee0
> > [ 1093.603472] Call Trace:
> > [ 1093.603997] tcp_close (/home/pabeni/devel/linux-mptcp/./include/linux/refcount.h:191 /home/pabeni/devel/linux-mptcp/./include/linux/refcount.h:228 /home/pabeni/devel/linux-mptcp/./include/net/sock.h:647 /home/pabeni/devel/linux-mptcp/net/ipv4/tcp.c:2434)
> 
> There's an extra sock_put() (or a missing sock_hold()) somewhere before 
> the sock_hold() in tcp_close(). Seems like the only candidate for that is 
> an unmatched sock_put() in subflow_ulp_release() which is releasing the 
> reference count for mptcp_subflow_context->conn.
> 
> It looks like that mptcp_subflow_context->conn reference count is intended 
> to be matched by a sock_hold() when the conn value is assigned (see 
> mptcp_subflow_create_socket()). There are other assignments to 
> mptcp_subflow_context->conn in subflow_syn_recv_sock() and mptcp_accept() 
> which are not calling sock_hold() - so maybe that's the problem. If that's 
> it, I'm not sure why this warning happens rarely instead of all the time.
> 
> I'll try adding some debug statements to confirm this theory.

[...]

> [ 1093.605757] mptcp_close (/home/pabeni/devel/linux-mptcp/net/mptcp/protocol.c:521 /home/pabeni/devel/linux-mptcp/net/mptcp/protocol.c:579)

Thank you for the feedback!

I think the refcount mismatch happens on a subflow socket, in the
relevant checked-out tree:

/home/pabeni/devel/linux-mptcp/net/mptcp/protocol.c:521 is:

static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
                              struct mptcp_subflow_context *subflow,
                              long timeout)
{
        struct socket *sock = READ_ONCE(ssk->sk_socket);

        list_del(&subflow->node);

        if (sock && sock != sk->sk_socket) {
                /* outgoing subflow */
                sock_release(sock);
        } else {
                /* incoming subflow */
                sock_orphan(ssk);
                tcp_close(ssk, timeout);  // XXXXXXX <- this line
        }
}

and /home/pabeni/devel/linux-mptcp/net/mptcp/protocol.c:579 is:

static void mptcp_close(struct sock *sk, long timeout)
{
        struct mptcp_subflow_context *subflow, *tmp;
        struct mptcp_sock *msk = mptcp_sk(sk);

        mptcp_token_destroy(msk->token);
        inet_sk_state_store(sk, TCP_CLOSE);

        lock_sock(sk);

        list_for_each_entry_safe(subflow, tmp, &msk->conn_list, node) {
                struct sock *ssk = mptcp_subflow_tcp_sock(subflow);

                __mptcp_close_ssk(sk, ssk, subflow, timeout);  
// ^^^^^^^^^^^^^^^^^^ this line XXXXXXXXXXXXXXXXX ^^^^^^^^^^^^
        }
	
Given the very low frequency of the event this, the refcount mismatch
is very likely due to some race conditions.

I think the following one could be a possible (even likely) root cause.
In __mptcp_close_ssk() - ssk lock is not held:

	sock_orphan(ssk); // ssk->flags |= SOCK_DEAD
	
	// ssk is in LAST_ACK status, some other CPU
	// receives the last ack and call tcp_done()
	// which in turns call inet_csk_destroy_sock(), as ssk
	// has SOCK_DEAD set, ssk is freed

	tcp_close(ssk, timeout);
	// whooops :(

I'll try to validate the above with the following dumb/hacky fix, if it
works, we can think something more elegant meanwhile/later:

	sock_hold(ssk); // so that eventual inet_csk_destroy_sock() 
			// in between orphan and close will not free
			// ssk
	sock_orphan(ssk); 
	tcp_close(ssk, timeout);
	sock_put(ssk);

Cheers,

Paolo
Davide Caratti Dec. 12, 2019, 3:27 p.m. UTC | #11
On Wed, 2019-12-11 at 20:00 +0100, Paolo Abeni wrote:
> On Wed, 2019-12-11 at 17:32 +0100, Paolo Abeni wrote:
> > On Tue, 2019-12-10 at 22:27 +0100, Matthieu Baerts wrote:
> > > 
> This one had a couple of gross mistake, just cooked a less bugged
> version. The new tree is here:
> 
> https://github.com/pabeni/mptcp/commits/fwestpha-mptcp_export_rebase_19_fallback_cleanup_3
> 
> The only updated patch is the last one:

FYI, Paolo did a newer version at

https://github.com/pabeni/mptcp/commits/fwestpha-mptcp_export_rebase_19_fallback_cleanup_4

that has squashed the fixes. Over that tree, I fixed most of checkpatch issues
on the first 26 patches (up to kselftests), rebuilt, retested and pushed at

https://github.com/dcaratti/mptcp/tree/fwestpha-mptcp_export_rebase_19_fallback_cleanup_5

Please find below the list of patches where the code actually changed:

0003-tcp-Define-IPPROTO_MPTCP.patch  
0013-mptcp-Handle-MPTCP-TCP-options.patch
0017-mptcp-Add-key-generation-and-token-tree.patch
0020-mptcp-Write-MPTCP-DSS-headers-to-outgoing-data-packe.patch
0021-mptcp-Implement-MPTCP-receive-path.patch
0022-mptcp-add-subflow-write-space-signalling-and-mptcp_p.patch
0023-mptcp-recvmsg-can-drain-data-from-multiple-subflows.patch
0026-mptcp-add-basic-kselftest-for-mptcp.patch
0027-mptcp-move-from-sha1-v0-to-sha256-v1.patch
0028-mptcp-parse-and-emit-MP_CAPABLE-option-according-to-.patch
0031-mptcp-Add-ADD_ADDR-handling.patch
diff mbox

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e0525a5eb1fb..354a4ce9e5e9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -396,7 +396,6 @@  void tcp_init_sock(struct sock *sk);
 void tcp_init_transfer(struct sock *sk, int bpf_op);
 __poll_t tcp_poll(struct file *file, struct socket *sock,
 		      struct poll_table_struct *wait);
-__poll_t __tcp_poll(struct sock *sk);
 int tcp_getsockopt(struct sock *sk, int level, int optname,
 		   char __user *optval, int __user *optlen);
 int tcp_setsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1cc7eb288029..9994bb49b8cd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -490,12 +490,15 @@  static inline bool tcp_stream_is_readable(const struct tcp_sock *tp,
  *	take care of normal races (between the test and the event) and we don't
  *	go look at any of the socket buffers directly.
  */
-__poll_t __tcp_poll(struct sock *sk)
+__poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 {
 	__poll_t mask;
+	struct sock *sk = sock->sk;
 	const struct tcp_sock *tp = tcp_sk(sk);
 	int state;
 
+	sock_poll_wait(file, sock, wait);
+
 	state = inet_sk_state_load(sk);
 	if (state == TCP_LISTEN)
 		return inet_csk_listen_poll(sk);
@@ -587,13 +590,6 @@  __poll_t __tcp_poll(struct sock *sk)
 
 	return mask;
 }
-
-__poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
-{
-	sock_poll_wait(file, sock, wait);
-
-	return __tcp_poll(sock->sk);
-}
 EXPORT_SYMBOL(tcp_poll);
 
 int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 156752e11591..57e092aca615 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -513,7 +513,6 @@  static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	ssk_check_wmem(msk, ssk);
 	release_sock(ssk);
-
 out:
 	release_sock(sk);
 	return ret;
@@ -902,9 +901,9 @@  static int __mptcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&msk->conn_list);
 	INIT_LIST_HEAD(&msk->join_list);
 	INIT_LIST_HEAD(&msk->rtx_queue);
+	__set_bit(MPTCP_SEND_SPACE, &msk->flags);
 
 	INIT_WORK(&msk->rtx_work, mptcp_worker);
-	__set_bit(MPTCP_SEND_SPACE, &msk->flags);
 
 	/* re-use the csk retrans timer for MPTCP-level retrans */
 	timer_setup(&msk->sk.icsk_retransmit_timer, mptcp_retransmit_timer, 0);
@@ -1012,15 +1011,6 @@  static void mptcp_close(struct sock *sk, long timeout)
 	sk_common_release(sk);
 }
 
-static int mptcp_disconnect(struct sock *sk, int flags)
-{
-	lock_sock(sk);
-	__mptcp_clear_xmit(sk);
-	release_sock(sk);
-	mptcp_cancel_work(sk);
-	return tcp_disconnect(sk, flags);
-}
-
 static void mptcp_copy_inaddrs(struct sock *msk, const struct sock *ssk)
 {
 	const struct ipv6_pinfo *ssk6 = inet6_sk(ssk);
@@ -1044,6 +1034,15 @@  static void mptcp_copy_inaddrs(struct sock *msk, const struct sock *ssk)
 #endif
 }
 
+static int mptcp_disconnect(struct sock *sk, int flags)
+{
+	lock_sock(sk);
+	__mptcp_clear_xmit(sk);
+	release_sock(sk);
+	mptcp_cancel_work(sk);
+	return tcp_disconnect(sk, flags);
+}
+
 static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
 				 bool kern)
 {
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 8251408610f0..1563d52eaa8d 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -79,8 +79,8 @@ 
 
 /* MPTCP socket flags */
 #define MPTCP_DATA_READY	BIT(0)
-#define MPTCP_WORK_RTX		BIT(1)
-#define MPTCP_SEND_SPACE	BIT(2)
+#define MPTCP_SEND_SPACE	BIT(1)
+#define MPTCP_WORK_RTX		BIT(2)
 #define MPTCP_WORK_EOF		BIT(3)
 
 static inline __be32 mptcp_option(u8 subopt, u8 len, u8 nib, u8 field)