[net] net/tls: Fix driver request resync

Message ID	20200520151408.8080-1-tariqt@mellanox.com
State	Changes Requested
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Tariq Toukan <tariqt@mellanox.com> To: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org, Jakub Kicinski <kuba@kernel.org>, Boris Pismenny <borisp@mellanox.com>, Tariq Toukan <tariqt@mellanox.com>, Saeed Mahameed <saeedm@mellanox.com> Subject: [PATCH net] net/tls: Fix driver request resync Date: Wed, 20 May 2020 18:14:08 +0300 Message-Id: <20200520151408.8080-1-tariqt@mellanox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	[net] net/tls: Fix driver request resync \| expand [net] net/tls: Fix driver request resync

Message ID

20200520151408.8080-1-tariqt@mellanox.com

State

Changes Requested

Delegated to:

David Miller

Headers

From: Tariq Toukan <tariqt@mellanox.com>
To: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org, Jakub Kicinski <kuba@kernel.org>,
        Boris Pismenny <borisp@mellanox.com>,
        Tariq Toukan <tariqt@mellanox.com>,
        Saeed Mahameed <saeedm@mellanox.com>
Subject: [PATCH net] net/tls: Fix driver request resync
Date: Wed, 20 May 2020 18:14:08 +0300
Message-Id: <20200520151408.8080-1-tariqt@mellanox.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Series

[net] net/tls: Fix driver request resync | expand

Commit Message

Tariq Toukan May 20, 2020, 3:14 p.m. UTC

From: Boris Pismenny <borisp@mellanox.com>

In driver request resync, the hardware requests a resynchronization
request at some TCP sequence number. If that TCP sequence number does
not point to a TLS record header, then the resync attempt has failed.

Failed resync should reset the resync request to avoid spurious resyncs
after the TCP sequence number has wrapped around.

Fix this by resetting the resync request when the TLS record header
sequence number is not before the requested sequence number.
As a result, drivers may be called with a sequence number that is not
equal to the requested sequence number.

Fixes: f953d33ba122 ("net/tls: add kernel-driven TLS RX resync")
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
---
 net/tls/tls_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jakub Kicinski May 20, 2020, 8:34 p.m. UTC | #1

On Wed, 20 May 2020 18:14:08 +0300 Tariq Toukan wrote:
> From: Boris Pismenny <borisp@mellanox.com>
> 
> In driver request resync, the hardware requests a resynchronization
> request at some TCP sequence number. If that TCP sequence number does
> not point to a TLS record header, then the resync attempt has failed.
> 
> Failed resync should reset the resync request to avoid spurious resyncs
> after the TCP sequence number has wrapped around.
> 
> Fix this by resetting the resync request when the TLS record header
> sequence number is not before the requested sequence number.
> As a result, drivers may be called with a sequence number that is not
> equal to the requested sequence number.
> 
> Fixes: f953d33ba122 ("net/tls: add kernel-driven TLS RX resync")
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  net/tls/tls_device.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> index a562ebaaa33c..cbb13001b4a9 100644
> --- a/net/tls/tls_device.c
> +++ b/net/tls/tls_device.c
> @@ -714,7 +714,7 @@ void tls_device_rx_resync_new_rec(struct sock *sk, u32 rcd_len, u32 seq)
>  		seq += TLS_HEADER_SIZE - 1;
>  		is_req_pending = resync_req;
>  
> -		if (likely(!is_req_pending) || req_seq != seq ||
> +		if (likely(!is_req_pending) || before(seq, req_seq) ||

So the kernel is going to send the sync message to the device with at
sequence number the device never asked about? 

Kernel usually can't guarantee that the notification will happen,
(memory allocation errors, etc.) so the device needs to do the
restarting itself. The notification should not be necessary.

>  		    !atomic64_try_cmpxchg(&rx_ctx->resync_req, &resync_req, 0))
>  			return;
>  		break;

Boris Pismenny May 28, 2020, 6:03 a.m. UTC | #2

On 20/05/2020 23:34, Jakub Kicinski wrote:
> On Wed, 20 May 2020 18:14:08 +0300 Tariq Toukan wrote:
>> From: Boris Pismenny <borisp@mellanox.com>
>>
>> In driver request resync, the hardware requests a resynchronization
>> request at some TCP sequence number. If that TCP sequence number does
>> not point to a TLS record header, then the resync attempt has failed.
>>
>> Failed resync should reset the resync request to avoid spurious resyncs
>> after the TCP sequence number has wrapped around.
>>
>> Fix this by resetting the resync request when the TLS record header
>> sequence number is not before the requested sequence number.
>> As a result, drivers may be called with a sequence number that is not
>> equal to the requested sequence number.
>>
>> Fixes: f953d33ba122 ("net/tls: add kernel-driven TLS RX resync")
>> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
>> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
>> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>  net/tls/tls_device.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
>> index a562ebaaa33c..cbb13001b4a9 100644
>> --- a/net/tls/tls_device.c
>> +++ b/net/tls/tls_device.c
>> @@ -714,7 +714,7 @@ void tls_device_rx_resync_new_rec(struct sock *sk, u32 rcd_len, u32 seq)
>>  		seq += TLS_HEADER_SIZE - 1;
>>  		is_req_pending = resync_req;
>>  
>> -		if (likely(!is_req_pending) || req_seq != seq ||
>> +		if (likely(!is_req_pending) || before(seq, req_seq) ||
> So the kernel is going to send the sync message to the device with at
> sequence number the device never asked about? 

Yes, although I would phrase it differently: the kernel would indicate to the driver,
that the resync request is wrong, and that it can go back to searching for a header.
If there are any drivers that need an extra check, then we can add it in the driver itself.

>
> Kernel usually can't guarantee that the notification will happen,
> (memory allocation errors, etc.) so the device needs to do the
> restarting itself. The notification should not be necessary.
>

Usually, all is best effort, but in principle, reliability should be guaranteed by higher layers to simplify the design.
On the one hand, resync depends on packet arrival, which may take a while, and implementing different heuristics in each driver to timeout is complex.
On the other hand, assuming the user reads the record data eventually, ktls will be able to deliver the resync request, so implementing this in the tls layer is simple.

In this case, I see no reason for the tls layer to fail --- did you have a specific flow in mind?
AFAICT, there are no memory allocation/error flows that will prevent the driver to receive a resync without an error on the socket (bad tls header).
If tls received the header, then the driver will receive the resync call, and it will take responsibility for reliably delivering it to HW.

Boris

Jakub Kicinski May 28, 2020, 5:29 p.m. UTC | #3

On Thu, 28 May 2020 09:03:07 +0300 Boris Pismenny wrote:
> On 20/05/2020 23:34, Jakub Kicinski wrote:
> > On Wed, 20 May 2020 18:14:08 +0300 Tariq Toukan wrote:  
> >> From: Boris Pismenny <borisp@mellanox.com>
> >>
> >> In driver request resync, the hardware requests a resynchronization
> >> request at some TCP sequence number. If that TCP sequence number does
> >> not point to a TLS record header, then the resync attempt has failed.
> >>
> >> Failed resync should reset the resync request to avoid spurious resyncs
> >> after the TCP sequence number has wrapped around.
> >>
> >> Fix this by resetting the resync request when the TLS record header
> >> sequence number is not before the requested sequence number.
> >> As a result, drivers may be called with a sequence number that is not
> >> equal to the requested sequence number.
> >>
> >> Fixes: f953d33ba122 ("net/tls: add kernel-driven TLS RX resync")
> >> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> >> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> >> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
> >> ---
> >>  net/tls/tls_device.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> >> index a562ebaaa33c..cbb13001b4a9 100644
> >> --- a/net/tls/tls_device.c
> >> +++ b/net/tls/tls_device.c
> >> @@ -714,7 +714,7 @@ void tls_device_rx_resync_new_rec(struct sock *sk, u32 rcd_len, u32 seq)
> >>  		seq += TLS_HEADER_SIZE - 1;
> >>  		is_req_pending = resync_req;
> >>  
> >> -		if (likely(!is_req_pending) || req_seq != seq ||
> >> +		if (likely(!is_req_pending) || before(seq, req_seq) ||  
> > So the kernel is going to send the sync message to the device with at
> > sequence number the device never asked about?   
> 
> Yes, although I would phrase it differently: the kernel would indicate to the driver,
> that the resync request is wrong, and that it can go back to searching for a header.
> If there are any drivers that need an extra check, then we can add it in the driver itself.

I'd rather make the API clear and use a different op to indicate this is
a reset rather than a valid sync response. sync callback already has
the enum for sync type.

> > Kernel usually can't guarantee that the notification will happen,
> > (memory allocation errors, etc.) so the device needs to do the
> > restarting itself. The notification should not be necessary.
> >  
> 
> Usually, all is best effort, but in principle, reliability should be guaranteed by higher layers to simplify the design.

Since we're talking high level design perspectives here - IMO when you
talk to FW on a device - it's a distributed system. The days when you
could say driver on the host is higher layer ended when people started
putting fat firmwares on the NICs. So no, restart has to be handled by
the system making the request. In this case the NIC.

> On the one hand, resync depends on packet arrival, which may take a while, and implementing different heuristics in each driver to timeout is complex.
> On the other hand, assuming the user reads the record data eventually, ktls will be able to deliver the resync request, so implementing this in the tls layer is simple.

We definitely not want any driver logic here - the resync restart logic
has to be implemented on the device.

> In this case, I see no reason for the tls layer to fail --- did you have a specific flow in mind?
> AFAICT, there are no memory allocation/error flows that will prevent the driver to receive a resync without an error on the socket (bad tls header).
> If tls received the header, then the driver will receive the resync call, and it will take responsibility for reliably delivering it to HW.

So you're saying the request path and the response path for resync are
both 100% lossless both on the NIC and the host? There is no scenario
in which queue overflows, PCIe gets congested, etc.?

Boris Pismenny May 31, 2020, 11:38 a.m. UTC | #4

>>
>> Yes, although I would phrase it differently: the kernel would indicate to the driver,
>> that the resync request is wrong, and that it can go back to searching for a header.
>> If there are any drivers that need an extra check, then we can add it in the driver itself.
> 
> I'd rather make the API clear and use a different op to indicate this is
> a reset rather than a valid sync response. sync callback already has
> the enum for sync type.
> 

Sure, so we will add another flag to `tls_offload_rx_resync_request` and respin this patch with driver changes which use it.

>>> Kernel usually can't guarantee that the notification will happen,
>>> (memory allocation errors, etc.) so the device needs to do the
>>> restarting itself. The notification should not be necessary.
>>>  
>>
>> Usually, all is best effort, but in principle, reliability should be guaranteed by higher layers to simplify the design.
> 
> Since we're talking high level design perspectives here - IMO when you
> talk to FW on a device - it's a distributed system. The days when you
> could say driver on the host is higher layer ended when people started
> putting fat firmwares on the NICs. So no, restart has to be handled by
> the system making the request. In this case the NIC.
> 

When the driver talks to the device, then the driver is responsible to ensure messages are received reliably - no doubt about that.

>> On the one hand, resync depends on packet arrival, which may take a while, and implementing different heuristics in each driver to timeout is complex.
>> On the other hand, assuming the user reads the record data eventually, ktls will be able to deliver the resync request, so implementing this in the tls layer is simple.
> 
> We definitely not want any driver logic here - the resync restart logic
> has to be implemented on the device.
> 

Restart can be triggered by the device if there is traffic, as it would identify bad record headers. But, otherwise, software should help restart the device.
From an API perspective, I think it makes little sense for the device to send a request without any response from the TLS layer, as it is now; this patches fixes this.

>> In this case, I see no reason for the tls layer to fail --- did you have a specific flow in mind?
>> AFAICT, there are no memory allocation/error flows that will prevent the driver to receive a resync without an error on the socket (bad tls header).
>> If tls received the header, then the driver will receive the resync call, and it will take responsibility for reliably delivering it to HW.
> 
> So you're saying the request path and the response path for resync are
> both 100% lossless both on the NIC and the host? There is no scenario
> in which queue overflows, PCIe gets congested, etc.?
> 

Not at all.
As I've mentioned above "all is best effort", but we try our best to make things work when all is in proper order. In our case, resync is fully implemented in ASIC, so no FW failures. This leaves only hardware and host software failures. Hardware failures will likely cause the entire NIC to restart, closing all queues in the process. Software failures should be handled by each layer respectively. In the TLS layer, all is very simple and I see no reason for failure. Each driver should handle its own reliability trade-offs according to its own goals, which might vary from vendor to vendor and device to device.

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index a562ebaaa33c..cbb13001b4a9 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -714,7 +714,7 @@  void tls_device_rx_resync_new_rec(struct sock *sk, u32 rcd_len, u32 seq)
 		seq += TLS_HEADER_SIZE - 1;
 		is_req_pending = resync_req;
 
-		if (likely(!is_req_pending) || req_seq != seq ||
+		if (likely(!is_req_pending) || before(seq, req_seq) ||
 		    !atomic64_try_cmpxchg(&rx_ctx->resync_req, &resync_req, 0))
 			return;
 		break;

[net] net/tls: Fix driver request resync

Commit Message

Comments

Patch