diff mbox

[sparc] ldc_connect() should not return EINVAL when handshake is in progress.

Message ID 20140801135040.GE19031@oracle.com
State Accepted
Delegated to: David Miller
Headers show

Commit Message

Sowmini Varadhan Aug. 1, 2014, 1:50 p.m. UTC
The LDC handshake could have been asynchronously triggered
after ldc_bind() enables the ldc_rx() receive interrupt-handler
(and thus intercepts incoming control packets)
and before vio_port_up() calls ldc_connect(). If that is the case,
ldc_connect() should return 0 and let the state-machine
progress.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Karl Volz <karl.volz@oracle.com>
---
Resending to the correct list this time.

 arch/sparc/kernel/ldc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

David Miller Aug. 5, 2014, 3:22 a.m. UTC | #1
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Fri, 1 Aug 2014 09:50:40 -0400

> The LDC handshake could have been asynchronously triggered
> after ldc_bind() enables the ldc_rx() receive interrupt-handler
> (and thus intercepts incoming control packets)
> and before vio_port_up() calls ldc_connect(). If that is the case,
> ldc_connect() should return 0 and let the state-machine
> progress.
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Acked-by: Karl Volz <karl.volz@oracle.com>
> ---
> Resending to the correct list this time.

Applied and queued up for -stable, thanks.

Please, in the future, format your Subject line something like
the following:

[PATCH] $subsystem: $description

Everything after "[xxx] " goes into the commit log header line, and
putting a subsystem prefix allows people scanning the shortlog to
get a good idea what area a change touches.

In this case I added "sparc64: " for the subsystem prefix when
commiting your change.

BTW, you'll probably come to notice that not a lot of work has gone
into handling hard disconnect failures properly in LDC and the drivers
that use it.  It's the one area where the LDC framework is very weak.
Unfortunately a lot of cooperation is required by the drivers
themselves, because what to do when a LDC connection goes down is
different in different drivers (networking drivers can drop packets,
but block drivers have to redo the I/O when the LDC link comes back
up, for example)


--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Aug. 5, 2014, 1:52 p.m. UTC | #2
On (08/04/14 20:22), David Miller wrote:
> 
> In this case I added "sparc64: " for the subsystem prefix when
> commiting your change.

Thanks, I'll keep that in mind for the future- I wasnt sure of
the taxonomy for ldc.

> BTW, you'll probably come to notice that not a lot of work has gone
> into handling hard disconnect failures properly in LDC and the drivers
> that use it.  It's the one area where the LDC framework is very weak.
> Unfortunately a lot of cooperation is required by the drivers
> themselves, because what to do when a LDC connection goes down is
> different in different drivers (networking drivers can drop packets,
> but block drivers have to redo the I/O when the LDC link comes back
> up, for example)

Noted - I've not run into this personally, but we'll add it to the
list.

FWIW, what I'm trying to sort out at the moment is the potential for 
triggering a soft lockup on a peer (i.e, a DoS vector). 
If a node sends a burst of data and then never drains its incoming 
ldc channel, the targetted peer will eventually encounter a 
tx_has_no_space_for()/EWOULDBLOCK from __vnet_tx_trigger/vnet_send_ack 
(and also __vdc_tx_trigger?)  - all of these are infinite loops, 
and vnet_send_ack is particularly vicious because it's actually a 
paradoxical blocking Tx in the Rx path (due to the vnet protocol design, 
I suppose).  

Carefully asserting netif_stop_queue() for __vnet_tx_trigger() might 
possibly help that case (though it flow-controls all ldc_channels, 
instead of just the blocked one), but recovering gracefully
for the other cases needs thought.

--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 7, 2014, 6 a.m. UTC | #3
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 5 Aug 2014 09:52:38 -0400

> FWIW, what I'm trying to sort out at the moment is the potential for 
> triggering a soft lockup on a peer (i.e, a DoS vector). 
> If a node sends a burst of data and then never drains its incoming 
> ldc channel, the targetted peer will eventually encounter a 
> tx_has_no_space_for()/EWOULDBLOCK from __vnet_tx_trigger/vnet_send_ack 
> (and also __vdc_tx_trigger?)  - all of these are infinite loops, 
> and vnet_send_ack is particularly vicious because it's actually a 
> paradoxical blocking Tx in the Rx path (due to the vnet protocol design, 
> I suppose).  
> 
> Carefully asserting netif_stop_queue() for __vnet_tx_trigger() might 
> possibly help that case (though it flow-controls all ldc_channels, 
> instead of just the blocked one), but recovering gracefully
> for the other cases needs thought.

Other virtualization drivers tend to have this issue.  In fact this
was just being discussed over the past week on the netdev list
wrt. the xen-netback driver.

The scheme used there basically is to mark the carrier off when
the peer stops sinking packets we are TX'ing to it, and queue
up a timer.

If the timer fires, we free up all of the packets in the transmit
queue so they don't linger forever and take up resources.

Usually there is some "event" that can let us know that data is
being sinked again, and we can use that to turn the carrier on
and start transmitting again.  Otherwise we'll have to poll
for such things in the timer.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Aug. 7, 2014, 8:17 p.m. UTC | #4
(Maybe this discussion belongs in netdev?)

On (08/06/14 23:00), David Miller wrote:
> The scheme used there basically is to mark the carrier off when
> the peer stops sinking packets we are TX'ing to it, and queue
> up a timer.
> 
> If the timer fires, we free up all of the packets in the transmit
> queue so they don't linger forever and take up resources.
> 
> Usually there is some "event" that can let us know that data is
> being sinked again, and we can use that to turn the carrier on
> and start transmitting again.  Otherwise we'll have to poll
> for such things in the timer.

so in the case of sunvnet, there are 2 writes that can block-
either !vnet_tx_dring_avail() (i.e., no descriptor rings) , or 
!tx_has_space_for() (i.e., cannot send the ldc trigger)

vnet_start_xmit() already seems to have most of the smarts to 
handle both failures, the one missing piece is that __vnet_tx_trigger()
should not loop infinitely on the ldc_write, but give up with 
EWOULDBLOCK after some finite number of tries- at that point
vnet_start_xmit() recovers correctly (afaict).

But then there's another problem at the tail of vnet_event()-
if we hit the maybe_tx_wakeup() condition, we try to take the
netif_tx_lock() in the recv-interrupt-context and can deadlock
with dev_watchdog(). 

I could move the contents of maybe_tx_wakeup() into 
vnet_tx_timeout(), and that avoids deadlocks. However I think that now 
matches the description of polling-to-recover-from-blocked-tx above. 
And one side-effect of moving from the synchronous maybe_tx_wakeup()
to the reliance on watchdog-timer to unblock things is that 
things like iperf throughput can fluctuate wildly. But given that
this actually *is* a performance weakness, perhaps its not worth
over-optimizing this case?

Another data-point in favor of not over-optimizing the wrong thing:
in separate experiments, when I try to move the data-handling code in
vnet_event off to a bh-handler, performance improves noticeably- I 
rarely encounter the netif_queue_stopped(), so relying on 
vnet_tx_timeout() for recovery is not an issue with that prototype.

Unless there's a simple way to kick off netif_wake_queue()
from vnet_event(), it seems better to just make vnet_tx_timeout()
do this job.

--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 8, 2014, 5:26 a.m. UTC | #5
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Thu, 7 Aug 2014 16:17:08 -0400

> vnet_start_xmit() already seems to have most of the smarts to 
> handle both failures, the one missing piece is that __vnet_tx_trigger()
> should not loop infinitely on the ldc_write, but give up with 
> EWOULDBLOCK after some finite number of tries- at that point
> vnet_start_xmit() recovers correctly (afaict).

Actually, if we look at the vio_ldc_send() code path, it does it's
own delay loop waiting for ldc_write() to stop signalling -EAGAIN.
It loops up to 1000 times, doing a udelay(1) each iteration.

So firstly the backoff based delay loop in __vnet_tx_trigger() should
probably be removed entirely.

Now, besides taking the error path in vnet_start_xmit() and dropping
the packet, we need to think about what we should do subsequently when
this condition triggers.

We probably want to take the carrier down via netif_carrier_off(), and
then when we see the head pointer advancing (I guess via
vnet_event()'s logic), the networking driver can netif_carrir_on() in
response.

> But then there's another problem at the tail of vnet_event()-
> if we hit the maybe_tx_wakeup() condition, we try to take the
> netif_tx_lock() in the recv-interrupt-context and can deadlock
> with dev_watchdog(). 
> 
> I could move the contents of maybe_tx_wakeup() into 
> vnet_tx_timeout(), and that avoids deadlocks. However I think that now 
> matches the description of polling-to-recover-from-blocked-tx above. 

Indeed, that's a bad deadlock.

You can just use a tasklet to make it execute in a software interrupt
"right now".

> Another data-point in favor of not over-optimizing the wrong thing:
> in separate experiments, when I try to move the data-handling code in
> vnet_event off to a bh-handler, performance improves noticeably- I 
> rarely encounter the netif_queue_stopped(), so relying on 
> vnet_tx_timeout() for recovery is not an issue with that prototype.
> 
> Unless there's a simple way to kick off netif_wake_queue()
> from vnet_event(), it seems better to just make vnet_tx_timeout()
> do this job.

I think just the time it takes to unwind the HW interrupt handler and
enter the software interrupt code path is enough time to let the other
side of the LDC link make a bunch of progresss.

And yes, hitting stop queue conditions less often will improve performance
if only because it means we have less overhead going in and out of stopped
queue state.

Anyways, see my tasklet suggestion above.

Also it's a real pain that one port backing up can head-of-line block
the entire interface.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David L Stevens Aug. 8, 2014, 1:55 p.m. UTC | #6
On 08/08/2014 01:26 AM, David Miller wrote:

> Now, besides taking the error path in vnet_start_xmit() and dropping
> the packet, we need to think about what we should do subsequently when
> this condition triggers.
> 
> We probably want to take the carrier down via netif_carrier_off(), and
> then when we see the head pointer advancing (I guess via
> vnet_event()'s logic), the networking driver can netif_carrir_on() in
> response.

Aren't we talking about one ldc port (out of many) for the device here?
I'm not sure we want to drop carrier if only one port isn't making
progress, unless maybe that's the only switch port.

						+-DLS

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 8, 2014, 5:33 p.m. UTC | #7
From: David L Stevens <david.stevens@oracle.com>
Date: Fri, 08 Aug 2014 09:55:55 -0400

> On 08/08/2014 01:26 AM, David Miller wrote:
> 
>> Now, besides taking the error path in vnet_start_xmit() and dropping
>> the packet, we need to think about what we should do subsequently when
>> this condition triggers.
>> 
>> We probably want to take the carrier down via netif_carrier_off(), and
>> then when we see the head pointer advancing (I guess via
>> vnet_event()'s logic), the networking driver can netif_carrir_on() in
>> response.
> 
> Aren't we talking about one ldc port (out of many) for the device here?
> I'm not sure we want to drop carrier if only one port isn't making
> progress, unless maybe that's the only switch port.

Right now we already are stopping the entire TX queue of the device if
just one port backs up.

Effectively the link is down at that point, and as long as it persists
no traffic is going to pass over the device.

The problem with not taking the carrier off is that the qdisc layer is
going to spam messages into the log each time the TX watchdog timer
goes off.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Aug. 8, 2014, 6:39 p.m. UTC | #8
On (08/08/14 10:33), David Miller wrote:
> 
> The problem with not taking the carrier off is that the qdisc layer is
> going to spam messages into the log each time the TX watchdog timer
> goes off.

spamming messages into logs is the lesser of my problems at this point :-)

The tasklet mechanims for kicking of netif_wake_queue works quite
well, and is simple enough to do. 

But once I removed the heuristic exponential backoff/retry for
vnet_send_ack(), I'm freqently not able to send any DRING_STOPPED 
messages, and that seems to freeze all access even over the switch-port
to the VM  (even though, afaict, netif_stop_queue has not been called.

If we can't send the LDC ack from vnet_event, we need to reset
this peer, but vio_conn_reset() is a no-op. Recovering from here
is going to be quite sticky.



--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 8, 2014, 6:46 p.m. UTC | #9
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Fri, 8 Aug 2014 14:39:39 -0400

> The tasklet mechanims for kicking of netif_wake_queue works quite
> well, and is simple enough to do. 

So you are able to successfully trigger the tasklet from vnet_event(),
and have that tasklet do the queue wakeups?

> But once I removed the heuristic exponential backoff/retry for
> vnet_send_ack(), I'm freqently not able to send any DRING_STOPPED 
> messages, and that seems to freeze all access even over the switch-port
> to the VM  (even though, afaict, netif_stop_queue has not been called.
> 
> If we can't send the LDC ack from vnet_event, we need to reset
> this peer, but vio_conn_reset() is a no-op. Recovering from here
> is going to be quite sticky.

But removing the backoff logic from __vnet_tx_trigger() does work,
right?

I don't think vnet_walk_rx() is really able to handle any kind of real
failures from vnet_send_ack() properly.  If we send one or more
VIO_DRING_ACTIVE ACKs and then can't send the VIO_DRING_STOPPED one
out, the ring will likely be left in an inconsistent state.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Aug. 8, 2014, 6:55 p.m. UTC | #10
On (08/08/14 11:46), David Miller wrote:
> Date: Fri, 08 Aug 2014 11:46:01 -0700 (PDT)
> From: David Miller <davem@davemloft.net>
> To: sowmini.varadhan@oracle.com
> Cc: david.stevens@oracle.com, karl.volz@oracle.com,
>  sparclinux@vger.kernel.org
> Subject: Re: soft-lockups in sunvnet
> X-Mailer: Mew version 6.5 on Emacs 24.1 / Mule 6.0 (HANACHIRUSATO)
> 
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Fri, 8 Aug 2014 14:39:39 -0400
> 
> 
> So you are able to successfully trigger the tasklet from vnet_event(),
> and have that tasklet do the queue wakeups?

yes.

> But removing the backoff logic from __vnet_tx_trigger() does work,
> right?

It "works" to the extent that it recovers. You get a lot more 
errors, much more easily, though -  thus throughput sinks. 
I dont know how the heuristics were determined, but they seem to help...
 
> I don't think vnet_walk_rx() is really able to handle any kind of real
> failures from vnet_send_ack() properly.  If we send one or more
> VIO_DRING_ACTIVE ACKs and then can't send the VIO_DRING_STOPPED one
> out, the ring will likely be left in an inconsistent state.

I just found out last week that you dont actually need to set the
VIO_ACK_ENABLE (and thus trigger the ACTIVE acks)- evidently the protocol
is such that the STOPPED ldc message is sufficient. 

So one patch that I'm working on lining up (after due testing etc)
is to not set VIO_ACK_ENABLE in vnet_start_xmit- it also helps perf
slightly because it reduces the trips through ldc (and potentail
for filling up the ldc ring).

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 8, 2014, 7:59 p.m. UTC | #11
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Fri, 8 Aug 2014 14:55:22 -0400

> On (08/08/14 11:46), David Miller wrote:
>> I don't think vnet_walk_rx() is really able to handle any kind of real
>> failures from vnet_send_ack() properly.  If we send one or more
>> VIO_DRING_ACTIVE ACKs and then can't send the VIO_DRING_STOPPED one
>> out, the ring will likely be left in an inconsistent state.
> 
> I just found out last week that you dont actually need to set the
> VIO_ACK_ENABLE (and thus trigger the ACTIVE acks)- evidently the protocol
> is such that the STOPPED ldc message is sufficient. 
> 
> So one patch that I'm working on lining up (after due testing etc)
> is to not set VIO_ACK_ENABLE in vnet_start_xmit- it also helps perf
> slightly because it reduces the trips through ldc (and potentail
> for filling up the ldc ring).

This only works because we free the packet in the ->ndo_start_xmit()
method.  If we didn't do that, and we freed it up at ACK time, we'd
need to force the ACKs in order to guarentee that we release the SKB
in a finite amount of time.

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Aug. 8, 2014, 8:47 p.m. UTC | #12
On (08/08/14 12:59), David Miller wrote:
> 
> This only works because we free the packet in the ->ndo_start_xmit()
> method.  If we didn't do that, and we freed it up at ACK time, we'd
> need to force the ACKs in order to guarentee that we release the SKB
> in a finite amount of time.
> 

I see. I missed that subtlety. I'll make a note in the comments for
my proposed change, in case this piece of code changes in the future.

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Aug. 10, 2014, 7:56 p.m. UTC | #13
To wrap up this thread..

I just sent out a tentative patch with the first round of fixes
to netdev. These fixes take care of the bare minimum of making
sure we don't soft-lockup when the sink does not receive packets.

I can still see at least 2 areas of improvement, that I'd like
to address separately, since the changes are non-trivial and
have to be done carefully

1. finer granularity of flow-control in vnet_start_xmit(): instead
   of doing a netif_stop_queue() when any single peer is congested,
   try to track flow-control for that peer only, and let the others
   continue Tx/Rx

2. better recovery from vnet_send_ack() failure: I have a somewhat
   odd printk there today, just to let the admin know that help
   is needed. I've tried calling ldc_disconnect() here, but it
   doesn't really reset the peer, though a module-reload fixes it.
   So what's needed is to trigger just the unregister/register of
   the problematic port, and this will need more than a few lines
   of change (I think it has to be triggered by ds?)

I'll take a look at those two over the next few weeks, but didnt
want to hold up these changes hostage while that's happening.

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 11, 2014, 8:58 p.m. UTC | #14
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Sun, 10 Aug 2014 15:56:22 -0400

> 1. finer granularity of flow-control in vnet_start_xmit(): instead
>    of doing a netif_stop_queue() when any single peer is congested,
>    try to track flow-control for that peer only, and let the others
>    continue Tx/Rx

The big blocker is that you can't send to other peers if you receive
a packet destined for the stopped peer.

It's just not legal to reorder packets in this manner.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
index e01d75d..66dacd5 100644
--- a/arch/sparc/kernel/ldc.c
+++ b/arch/sparc/kernel/ldc.c
@@ -1336,7 +1336,7 @@  int ldc_connect(struct ldc_channel *lp)
 	if (!(lp->flags & LDC_FLAG_ALLOCED_QUEUES) ||
 	    !(lp->flags & LDC_FLAG_REGISTERED_QUEUES) ||
 	    lp->hs_state != LDC_HS_OPEN)
-		err = -EINVAL;
+		err = ((lp->hs_state > LDC_HS_OPEN) ? 0 : -EINVAL);
 	else
 		err = start_handshake(lp);