diff mbox

[net-next,3/3] sunvnet: Schedule maybe_tx_wakeup as a tasklet from ldc_rx path

Message ID 20140812143531.GJ2404@oracle.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Sowmini Varadhan Aug. 12, 2014, 2:35 p.m. UTC
This patch must be applied after
"[PATCH net-next 2/3] sunvnet: Do not spin in an infinite loop when vio_ldc_send() returns EAGAIN"

At the tail of vnet_event(), if we hit the maybe_tx_wakeup()
condition, we try to take the netif_tx_lock() in the
recv-interrupt-context and can deadlock with dev_watchdog().
Two changes to fix this:

- The contents of maybe_tx_wakeup() should be moved into
  vnet_tx_timeout
- vnet_event() should kick off a tasklet that triggers
  netif_wake_queue()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Raghuram Kothakota <raghuram.kothakota@oracle.com>

---
 drivers/net/ethernet/sun/sunvnet.c | 52 +++++++++++++++++++++++---------------
 drivers/net/ethernet/sun/sunvnet.h |  4 +++
 2 files changed, 36 insertions(+), 20 deletions(-)

Comments

David Miller Aug. 12, 2014, 10:13 p.m. UTC | #1
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 12 Aug 2014 10:35:31 -0400

> @@ -489,28 +491,13 @@ static int handle_mcast(struct vnet_port *port, void *msgbuf)
>  	return 0;
>  }
>  
> -static void maybe_tx_wakeup(struct vnet *vp)
> +static void maybe_tx_wakeup(unsigned long param)
>  {
> +	struct vnet *vp = (struct vnet *)param;
>  	struct net_device *dev = vp->dev;
>  
 ...
> -	}
> +	vnet_tx_timeout(dev);
>  	netif_tx_unlock(dev);
>  }
>  
> @@ -764,7 +756,25 @@ out_dropped:
>  
>  static void vnet_tx_timeout(struct net_device *dev)
>  {
> -	/* XXX Implement me XXX */
> +	struct vnet *vp = netdev_priv(dev);
> +

Please do not usurp vnet_tx_timeout() to handle queue waking.

It is hooked up to netdev_ops->ndo_tx_timeout() which is invoked when
the qdisc watchdog triggers after dev->watchdog_timeout time elapses
after the queue is stopped without a subsequent wake.

This method is a therefore executed after a hard timeout, meaning that
the device should probably be reset, whereas maybe_tx_wakeup() is
normal processing meant to drive the engine of TX flow control.

In the context of sunvnet, vnet_tx_timeout() should probably reset the
LDC channel(s) of the unresponsive peer(s).

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Aug. 13, 2014, 1:58 a.m. UTC | #2
> 
> Please do not usurp vnet_tx_timeout() to handle queue waking.
> 
> It is hooked up to netdev_ops->ndo_tx_timeout() which is invoked when
> the qdisc watchdog triggers after dev->watchdog_timeout time elapses
> after the queue is stopped without a subsequent wake.
> 
> This method is a therefore executed after a hard timeout, meaning that
> the device should probably be reset, whereas maybe_tx_wakeup() is
> normal processing meant to drive the engine of TX flow control.
> 
> In the context of sunvnet, vnet_tx_timeout() should probably reset the
> LDC channel(s) of the unresponsive peer(s).

yes, after a couple of days of reflection on the lingering issues in
http://www.spinics.net/lists/sparclinux/msg12360.html
I too was starting to arrive at  similar conclusions about 
maybe_tx_wakeup vs vnet_tx_timeout.

First case 2 in the "todo" list (cannot send DRING_STOPPED):
If we cannot send a vnet_send_ack() we should invoke ldc_disconnect() 
for this peer.  The ldc_disconnect() will reset hs_state and flags 
in the ldc_channel, so subsequent attempts to ldc_write() 
(e.g., from vio_ldc_send()) will return -ENOTCONN. So, for
that case, a subsequent vnet_start_xmit() would *not* netif_stop_queue(),
but just drop the packet, reset the dring, and keep driving.

To recover the disconnected channel, the admin would need to (ldm) 
unbind/bind the offender, identifiable by their mac address.
And none of the other channels shold be affected (assuming
inter-vnet-links is set to on)

So for case 1 we could do something very similar- I haven't tried this yet,
but here's what I'm thinking: vnet_start_xmit() should not do a
netif_stop_queue.  Instead it should do a (new function) "vio_stop_queue()"
which sets some (new) VIO_DR_FLOW_CONTROLLED  state in the vio_driver_state, 
(or perhaps some flag bit in the ldc_channel?) and also marks a 
(new) boolean is_flow_controlled state variable in the struct vnet 
as TRUE.

The modified maybe_tx_wakeup would check is_flow_controlled on the vnet,
and call a new vnet_wake_queue() to reset VIO_DR_FLOW_CONTROLLED
if appropriate.

vnet_start_xmit should drop packets as long as VIO_DR_FLOW_CONTROLLED
is set.

The above sketch has a deficiency that it doesnt avoid packet drops
at the net_device layer itself - packets that can't make it out because
of FLOW_CONTROLLED will get dropped - I think the Qdisc code path
deals better with the back-pressure. I dont know if there is a way
to leverage that path and still avoid the head-of-line blocking due
to one flow-controlled ldc_channel.

I'm a little hesitant to ldc_disconnect a peer from the vnet_start_xmit
path- if we have a super-efficient Tx path, we dont want to
reset things merely because the receiver can't keep up- we just want to
flow-control the Tx? Whereas the inability to transmit a DRING_STOPPED
from the Rx side is likely to be a more unusual situation (would need
the peer to send a burst of packets and then not be prepared to do
even a single ldc_read()!)?

thoughts? 

--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 13, 2014, 4:25 a.m. UTC | #3
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 12 Aug 2014 21:58:17 -0400

>> 
>> Please do not usurp vnet_tx_timeout() to handle queue waking.
>> 
>> It is hooked up to netdev_ops->ndo_tx_timeout() which is invoked when
>> the qdisc watchdog triggers after dev->watchdog_timeout time elapses
>> after the queue is stopped without a subsequent wake.
>> 
>> This method is a therefore executed after a hard timeout, meaning that
>> the device should probably be reset, whereas maybe_tx_wakeup() is
>> normal processing meant to drive the engine of TX flow control.
>> 
>> In the context of sunvnet, vnet_tx_timeout() should probably reset the
>> LDC channel(s) of the unresponsive peer(s).
> 
> yes, after a couple of days of reflection on the lingering issues in
> http://www.spinics.net/lists/sparclinux/msg12360.html
> I too was starting to arrive at  similar conclusions about 
> maybe_tx_wakeup vs vnet_tx_timeout.

However, don't get ahead of yourself.  All I'm saying in the context
of reviewing this patch #3 is that you should leave vnet_tx_timeout()
alone and just put what you were putting into vnet_tx_timeout() into
maybe_tx_wakeup().

> First case 2 in the "todo" list (cannot send DRING_STOPPED):
> If we cannot send a vnet_send_ack() we should invoke ldc_disconnect() 
> for this peer.  The ldc_disconnect() will reset hs_state and flags 
> in the ldc_channel, so subsequent attempts to ldc_write() 
> (e.g., from vio_ldc_send()) will return -ENOTCONN. So, for
> that case, a subsequent vnet_start_xmit() would *not* netif_stop_queue(),
> but just drop the packet, reset the dring, and keep driving.
> 
> To recover the disconnected channel, the admin would need to (ldm) 
> unbind/bind the offender, identifiable by their mac address.
> And none of the other channels shold be affected (assuming
> inter-vnet-links is set to on)

This doesn't match my understanding.   My understanding is that when
an LDC connection resets, the peer on the other side should automatically
try to re-bind or whatever is necessary to re-establish the LDC link.

It's supposed to be fault tolerant in as many situations as possible
without the need for explicit adminstrator intervention.

But besides that, an LDC reset is a big hammer.  So I would say that it
should be deferred to the last possible point.

So initially we do the backoff spinning loop synchronously.  If that
doesn't succeed, we schedule a workqueue that can poll the LDC channel
some more trying to do the write, in a sleepable context (so a loop
with delays and preemption points) until a much larger timeout.  Only
at this second timeout do we declare things to be serious enough for
an LDC channel reset.

> So for case 1 we could do something very similar- I haven't tried this yet,
> but here's what I'm thinking: vnet_start_xmit() should not do a
> netif_stop_queue.  Instead it should do a (new function) "vio_stop_queue()"
> which sets some (new) VIO_DR_FLOW_CONTROLLED  state in the vio_driver_state, 
> (or perhaps some flag bit in the ldc_channel?) and also marks a 
> (new) boolean is_flow_controlled state variable in the struct vnet 
> as TRUE.
> 
> The modified maybe_tx_wakeup would check is_flow_controlled on the vnet,
> and call a new vnet_wake_queue() to reset VIO_DR_FLOW_CONTROLLED
> if appropriate.
> 
> vnet_start_xmit should drop packets as long as VIO_DR_FLOW_CONTROLLED
> is set.

Again, as I stated yesterday, you really cannot do this.  Head of line
blocking when one peer gets wedged is absolutely necessary.

If a packet for a blocked peer is pending, you have to make the other
guys wait until either the blocked peer gets unstuck or you reset
the blocked peer's LDC channel on a hard timeout.

> I dont know if there is a way to leverage that path and still avoid
> the head-of-line blocking due to one flow-controlled ldc_channel.

Again, I don't think the semantics resulting from head-of-line
blocking avoidance are acceptable.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghuram Kothakota Aug. 13, 2014, 4:26 a.m. UTC | #4
On Aug 12, 2014, at 6:58 PM, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote:

>> 
>> Please do not usurp vnet_tx_timeout() to handle queue waking.
>> 
>> It is hooked up to netdev_ops->ndo_tx_timeout() which is invoked when
>> the qdisc watchdog triggers after dev->watchdog_timeout time elapses
>> after the queue is stopped without a subsequent wake.
>> 
>> This method is a therefore executed after a hard timeout, meaning that
>> the device should probably be reset, whereas maybe_tx_wakeup() is
>> normal processing meant to drive the engine of TX flow control.
>> 
>> In the context of sunvnet, vnet_tx_timeout() should probably reset the
>> LDC channel(s) of the unresponsive peer(s).
> 
> yes, after a couple of days of reflection on the lingering issues in
> http://www.spinics.net/lists/sparclinux/msg12360.html
> I too was starting to arrive at  similar conclusions about 
> maybe_tx_wakeup vs vnet_tx_timeout.
> 
> First case 2 in the "todo" list (cannot send DRING_STOPPED):
> If we cannot send a vnet_send_ack() we should invoke ldc_disconnect() 
> for this peer.  The ldc_disconnect() will reset hs_state and flags 
> in the ldc_channel, so subsequent attempts to ldc_write() 
> (e.g., from vio_ldc_send()) will return -ENOTCONN. So, for
> that case, a subsequent vnet_start_xmit() would *not* netif_stop_queue(),
> but just drop the packet, reset the dring, and keep driving.
> 
> To recover the disconnected channel, the admin would need to (ldm) 
> unbind/bind the offender, identifiable by their mac address.
> And none of the other channels shold be affected (assuming
> inter-vnet-links is set to on)

We do not consider a peer which couldn't process LDC messages
as an offender but just a peer in a broken state. A peer could get
into this type of situation due to many different reasons for example
a hang or lockup in an entirely unrelated code or stuck in a debugger
or due to a bug in the sunvnet driver itself. Such a guest should be able
to recover when it is rebooted. Requiring an unbind/bind is not considered
a solution, we don't expect admins to perform unbind/bind unless they
are really trying to destroy the domain. Note, an unbind of a domain
causes the vnet device ports in all of the peer domains removed and
the bind brings back, which is a heavy hammer. What we really want
is an LDC reset which puts the LDC state in a clean state to allow
the peer to restart the handshake and establish communication again.
This can be accomplished either by rebooting the peer or unload/load the 
sunvnet driver in the peer domain. It could also automatically recover too.
You can see this as something similar to resetting a NIC when the
NIC h/w is locked up.  

From what I read the ldc_disconect() code, it seems like it is reconfiguring
the queues, so I assume a reset event would re-enable the communication
fine, but I don't know for sure without testing this specific condition.

> 
> So for case 1 we could do something very similar- I haven't tried this yet,
> but here's what I'm thinking: vnet_start_xmit() should not do a
> netif_stop_queue.  Instead it should do a (new function) "vio_stop_queue()"
> which sets some (new) VIO_DR_FLOW_CONTROLLED  state in the vio_driver_state, 
> (or perhaps some flag bit in the ldc_channel?) and also marks a 
> (new) boolean is_flow_controlled state variable in the struct vnet 
> as TRUE.
> 
> The modified maybe_tx_wakeup would check is_flow_controlled on the vnet,
> and call a new vnet_wake_queue() to reset VIO_DR_FLOW_CONTROLLED
> if appropriate.
> 
> vnet_start_xmit should drop packets as long as VIO_DR_FLOW_CONTROLLED
> is set.
> 

One important point to keep in mind that packets to different peers shouldn't
be blocked by one blocked peer. Any flow controlling or dropping of the packets
needs to be done on a per-port basis.

-Raghuram
> The above sketch has a deficiency that it doesnt avoid packet drops
> at the net_device layer itself - packets that can't make it out because
> of FLOW_CONTROLLED will get dropped - I think the Qdisc code path
> deals better with the back-pressure. I dont know if there is a way
> to leverage that path and still avoid the head-of-line blocking due
> to one flow-controlled ldc_channel.
> 
> I'm a little hesitant to ldc_disconnect a peer from the vnet_start_xmit
> path- if we have a super-efficient Tx path, we dont want to
> reset things merely because the receiver can't keep up- we just want to
> flow-control the Tx? Whereas the inability to transmit a DRING_STOPPED
> from the Rx side is likely to be a more unusual situation (would need
> the peer to send a burst of packets and then not be prepared to do
> even a single ldc_read()!)?
> 
> thoughts? 
> 
> --Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 13, 2014, 4:31 a.m. UTC | #5
From: Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
Date: Tue, 12 Aug 2014 21:26:28 -0700

> One important point to keep in mind that packets to different peers shouldn't
> be blocked by one blocked peer. Any flow controlling or dropping of the packets
> needs to be done on a per-port basis.

Until we use a big hammer and reset the LDC channel of that peer, we
_absolutely_ should not reorder traffic and send to non-stuck peers.

A networking stack should avoid packet reordering in all cases when it
has the ability to do so, and sending packets which came in later to
unstuck peers when there are older packets still pending to the stuck
peer violates that requirement.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghuram Kothakota Aug. 13, 2014, 5:14 a.m. UTC | #6
On Aug 12, 2014, at 9:31 PM, David Miller <davem@davemloft.net> wrote:

> From: Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
> Date: Tue, 12 Aug 2014 21:26:28 -0700
> 
>> One important point to keep in mind that packets to different peers shouldn't
>> be blocked by one blocked peer. Any flow controlling or dropping of the packets
>> needs to be done on a per-port basis.
> 
> Until we use a big hammer and reset the LDC channel of that peer, we
> _absolutely_ should not reorder traffic and send to non-stuck peers.

The packet ordering requirement applies only for packets destined to
a specific destination. The packets to different destinations can flow
as long as they do not cause ordering issues. In case of sunvnet, 
each peer except the switch-port would be for a specific destination,
so sending packets to other ports would not result in re-ordering of the packets.

Even if the peer to peer LDC is the one that is blocked then disconnecting
that LDC channel and sending the packets via switch-port would not be
expected to change the order of the packet as the LDC reset is expect
to drop all packets in the ring so far.  

-Raghuram
> 
> A networking stack should avoid packet reordering in all cases when it
> has the ability to do so, and sending packets which came in later to
> unstuck peers when there are older packets still pending to the stuck
> peer violates that requirement.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 13, 2014, 5:29 a.m. UTC | #7
From: Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
Date: Tue, 12 Aug 2014 22:14:09 -0700

> On Aug 12, 2014, at 9:31 PM, David Miller <davem@davemloft.net> wrote:
> 
>> From: Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
>> Date: Tue, 12 Aug 2014 21:26:28 -0700
>> 
>>> One important point to keep in mind that packets to different peers shouldn't
>>> be blocked by one blocked peer. Any flow controlling or dropping of the packets
>>> needs to be done on a per-port basis.
>> 
>> Until we use a big hammer and reset the LDC channel of that peer, we
>> _absolutely_ should not reorder traffic and send to non-stuck peers.
> 
> The packet ordering requirement applies only for packets destined to
> a specific destination. The packets to different destinations can flow
> as long as they do not cause ordering issues. In case of sunvnet, 
> each peer except the switch-port would be for a specific destination,
> so sending packets to other ports would not result in re-ordering of the packets.
> 
> Even if the peer to peer LDC is the one that is blocked then disconnecting
> that LDC channel and sending the packets via switch-port would not be
> expected to change the order of the packet as the LDC reset is expect
> to drop all packets in the ring so far.  

Ok, that makes sense.

So the question is how to manage this on the driver side, and the most
natural way I see do this would be to use multiple TX netdev queues
and a custom netdev_ops->ndo_select_queue() method which selects the
queue based upon the peer that would be selected.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghuram Kothakota Aug. 13, 2014, 5:44 a.m. UTC | #8
On Aug 12, 2014, at 10:29 PM, David Miller <davem@davemloft.net> wrote:

> From: Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
> Date: Tue, 12 Aug 2014 22:14:09 -0700
> 
>> On Aug 12, 2014, at 9:31 PM, David Miller <davem@davemloft.net> wrote:
>> 
>>> From: Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
>>> Date: Tue, 12 Aug 2014 21:26:28 -0700
>>> 
>>>> One important point to keep in mind that packets to different peers shouldn't
>>>> be blocked by one blocked peer. Any flow controlling or dropping of the packets
>>>> needs to be done on a per-port basis.
>>> 
>>> Until we use a big hammer and reset the LDC channel of that peer, we
>>> _absolutely_ should not reorder traffic and send to non-stuck peers.
>> 
>> The packet ordering requirement applies only for packets destined to
>> a specific destination. The packets to different destinations can flow
>> as long as they do not cause ordering issues. In case of sunvnet, 
>> each peer except the switch-port would be for a specific destination,
>> so sending packets to other ports would not result in re-ordering of the packets.
>> 
>> Even if the peer to peer LDC is the one that is blocked then disconnecting
>> that LDC channel and sending the packets via switch-port would not be
>> expected to change the order of the packet as the LDC reset is expect
>> to drop all packets in the ring so far.  
> 
> Ok, that makes sense.
> 
> So the question is how to manage this on the driver side, and the most
> natural way I see do this would be to use multiple TX netdev queues
> and a custom netdev_ops->ndo_select_queue() method which selects the
> queue based upon the peer that would be selected.


Thanks, if there is a method to accomplish this with multiple Tx netdev queues
would be wonderful. We will research on the custom ndo_selec_queue()
method to direct traffic automatically at the network stack level.
We probably also need to look for methods to increase parallelism to boost
performance, I assume multiple Tx queues would be a method to accomplish
that as well.

-Raghuram--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 13, 2014, 6:11 a.m. UTC | #9
From: Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
Date: Tue, 12 Aug 2014 22:44:27 -0700

> Thanks, if there is a method to accomplish this with multiple Tx netdev queues
> would be wonderful. We will research on the custom ndo_selec_queue()
> method to direct traffic automatically at the network stack level.
> We probably also need to look for methods to increase parallelism to boost
> performance, I assume multiple Tx queues would be a method to accomplish
> that as well.

It will introduce parallelism only when there is traffic to multiple
peers at the same time.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghuram Kothakota Aug. 13, 2014, 6:37 a.m. UTC | #10
On Aug 12, 2014, at 11:11 PM, David Miller <davem@davemloft.net> wrote:

> From: Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
> Date: Tue, 12 Aug 2014 22:44:27 -0700
> 
>> Thanks, if there is a method to accomplish this with multiple Tx netdev queues
>> would be wonderful. We will research on the custom ndo_selec_queue()
>> method to direct traffic automatically at the network stack level.
>> We probably also need to look for methods to increase parallelism to boost
>> performance, I assume multiple Tx queues would be a method to accomplish
>> that as well.
> 
> It will introduce parallelism only when there is traffic to multiple
> peers at the same time.

Yes, I should not have related these two things, especially without
proper experimentation. In other implementations, we have used 
techniques such as multiple  Tx queues to increase parallelism but 
path selection is done within the driver to get the benefit of the parallelism for all peers.
We need to see what method suites best for sunvnet, we hope to know
more once we investigate all methods available. 

In case of custom ndo_select_queue() method, we need to see if there
are methods to deal with ports dynamically added/removed too. 

-Raghuram--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Aug. 13, 2014, 10:51 a.m. UTC | #11
On (08/12/14 21:26), Raghuram Kothakota wrote:
> 
> We do not consider a peer which couldn't process LDC messages
> as an offender but just a peer in a broken state. A peer could get

So I was using "offender" in the same way that one typically
talks about sending  ICMP errors for "offending IP packets" 
(nothing pesonal :-)).  I dont want to quibble over terminology here.
 
> From what I read the ldc_disconect() code, it seems like it is reconfiguring
> the queues, so I assume a reset event would re-enable the communication
> fine, but I don't know for sure without testing this specific condition.

No such events were generated by the code when I tested it.

I'll send a consolidated reply to the rest of that very long thread 
in a bit.

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan Aug. 13, 2014, 11:20 a.m. UTC | #12
On (08/12/14 21:25), David Miller wrote:
> 
> However, don't get ahead of yourself.  All I'm saying in the context
> of reviewing this patch #3 is that you should leave vnet_tx_timeout()
> alone and just put what you were putting into vnet_tx_timeout() into
> maybe_tx_wakeup().

And just leave vnet_tx_timeout() as before (i.e, "XXX implement me"?)?

Sure, I can generate that (as a stand-alone unified-diff patch?) 
it would be the minimum fix needed to avoid the deadlock on 
netif_tx_lock() itself.

> This doesn't match my understanding.   My understanding is that when
> an LDC connection resets, the peer on the other side should automatically
> try to re-bind or whatever is necessary to re-establish the LDC link.
> 
> It's supposed to be fault tolerant in as many situations as possible
> without the need for explicit adminstrator intervention.

Ideally, yes. But that's not what I was observing when I tested this.

> But besides that, an LDC reset is a big hammer.  So I would say that it
> should be deferred to the last possible point.

Correct, and that's why it feels ok (to me) to do this when the
DRING_STOPPED fails (so that we also fail subsequent ldc_write's
from vnet_start_xmit()) but not in any other case.

Note that, as I pointed out to Raghuram already, doing ldc_disconnect
does *not* send any reset events. The code may have other missing
functionality. 

> So initially we do the backoff spinning loop synchronously.  If that
> doesn't succeed, we schedule a workqueue that can poll the LDC channel
> some more trying to do the write, in a sleepable context (so a loop
> with delays and preemption points) until a much larger timeout.  Only
> at this second timeout do we declare things to be serious enough for
> an LDC channel reset.
> 
> > So for case 1 we could do something very similar- I haven't tried this yet,
> > but here's what I'm thinking: vnet_start_xmit() should not do a
> > netif_stop_queue.  Instead it should do a (new function) "vio_stop_queue()"
> > which sets some (new) VIO_DR_FLOW_CONTROLLED  state in the vio_driver_state, 
> > (or perhaps some flag bit in the ldc_channel?) and also marks a 
> > (new) boolean is_flow_controlled state variable in the struct vnet 
> > as TRUE.
> > 
> > The modified maybe_tx_wakeup would check is_flow_controlled on the vnet,
> > and call a new vnet_wake_queue() to reset VIO_DR_FLOW_CONTROLLED
> > if appropriate.
> > 
> > vnet_start_xmit should drop packets as long as VIO_DR_FLOW_CONTROLLED
> > is set.
> 
> Again, as I stated yesterday, you really cannot do this.  Head of line
> blocking when one peer gets wedged is absolutely necessary.

In the light of the subsequent discussion about head-of-line blocking
(and it's applicability per ldc_channel, instead of per-net_device),
do we still think the above has issues, or is otherwise not worth
doing?

Further in the thread, David Miller wrote:

> So the question is how to manage this on the driver side, and the most
> natural way I see do this would be to use multiple TX netdev queues
> and a custom netdev_ops->ndo_select_queue() method which selects the
> queue based upon the peer that would be selected.

This might work well for the current sunvnet model (and even
there, it has limitations- if all the traffic is going out 
via the switchport to a gateway, you are again back to the 
head-of-line blocking model).

But how generally useful is this model? For point-to-multipoint links
like ethernet, I dont think you actually track one channel per
peer.  

--Sowmini


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghuram Kothakota Aug. 13, 2014, 9:59 p.m. UTC | #13
On Aug 13, 2014, at 3:51 AM, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote:

> On (08/12/14 21:26), Raghuram Kothakota wrote:
>> 
>> We do not consider a peer which couldn't process LDC messages
>> as an offender but just a peer in a broken state. A peer could get
> 
> So I was using "offender" in the same way that one typically
> talks about sending  ICMP errors for "offending IP packets" 
> (nothing pesonal :-)).  I dont want to quibble over terminology here.
> 
>> From what I read the ldc_disconect() code, it seems like it is reconfiguring
>> the queues, so I assume a reset event would re-enable the communication
>> fine, but I don't know for sure without testing this specific condition.
> 
> No such events were generated by the code when I tested it.

The reset event is expected for the peer, if it is not happening
then we probably need to check why.

-Raghuram
> 
> I'll send a consolidated reply to the rest of that very long thread 
> in a bit.
> 
> --Sowmini
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 13, 2014, 11:29 p.m. UTC | #14
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 13 Aug 2014 07:20:10 -0400

> On (08/12/14 21:25), David Miller wrote:
>> 
>> However, don't get ahead of yourself.  All I'm saying in the context
>> of reviewing this patch #3 is that you should leave vnet_tx_timeout()
>> alone and just put what you were putting into vnet_tx_timeout() into
>> maybe_tx_wakeup().
> 
> And just leave vnet_tx_timeout() as before (i.e, "XXX implement me"?)?

Yes, and I see this is what you have done.

> Note that, as I pointed out to Raghuram already, doing ldc_disconnect
> does *not* send any reset events. The code may have other missing
> functionality. 

For the caller of ldc_disconnect() the reset is implied, if this is
what you are talking about.

I don't see any value for generating an event the user of the LDC
channel itself generated.

>> So the question is how to manage this on the driver side, and the most
>> natural way I see do this would be to use multiple TX netdev queues
>> and a custom netdev_ops->ndo_select_queue() method which selects the
>> queue based upon the peer that would be selected.
> 
> This might work well for the current sunvnet model (and even
> there, it has limitations- if all the traffic is going out 
> via the switchport to a gateway, you are again back to the 
> head-of-line blocking model).
> 
> But how generally useful is this model? For point-to-multipoint links
> like ethernet, I dont think you actually track one channel per
> peer.  

I don't understand what your concern could be.

Ethernet is a broadcast medium, so all traffic (as far as the NIC is
concerned) goes via the same single queue.

But here we have independent queues leading up to different
destinations, so representing that as independent queues makes
sense.

This way, if one non-switch peer goes down, all switch based traffic
will still continue to flow, and vice versa.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 39c39bf..c75a638 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -32,6 +32,8 @@  MODULE_DESCRIPTION("Sun LDOM virtual network driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(DRV_MODULE_VERSION);
 
+static void vnet_tx_timeout(struct net_device *dev);
+
 /* Heuristic for the number of times to exponentially backoff and
  * retry sending an LDC trigger when EAGAIN is encountered
  */
@@ -489,28 +491,13 @@  static int handle_mcast(struct vnet_port *port, void *msgbuf)
 	return 0;
 }
 
-static void maybe_tx_wakeup(struct vnet *vp)
+static void maybe_tx_wakeup(unsigned long param)
 {
+	struct vnet *vp = (struct vnet *)param;
 	struct net_device *dev = vp->dev;
 
 	netif_tx_lock(dev);
-	if (likely(netif_queue_stopped(dev))) {
-		struct vnet_port *port;
-		int wake = 1;
-
-		list_for_each_entry(port, &vp->port_list, list) {
-			struct vio_dring_state *dr;
-
-			dr = &port->vio.drings[VIO_DRIVER_TX_RING];
-			if (vnet_tx_dring_avail(dr) <
-			    VNET_TX_WAKEUP_THRESH(dr)) {
-				wake = 0;
-				break;
-			}
-		}
-		if (wake)
-			netif_wake_queue(dev);
-	}
+	vnet_tx_timeout(dev);
 	netif_tx_unlock(dev);
 }
 
@@ -587,8 +574,13 @@  static void vnet_event(void *arg, int event)
 			break;
 	}
 	spin_unlock(&vio->lock);
+	/* Kick off a tasklet to wake the queue.  We cannot call
+	 * maybe_tx_wakeup directly here because we could deadlock on
+	 * netif_tx_lock() with dev_watchdog()
+	 */
 	if (unlikely(tx_wakeup && err != -ECONNRESET))
-		maybe_tx_wakeup(port->vp);
+		tasklet_schedule(&port->vp->vnet_tx_wakeup);
+
 	local_irq_restore(flags);
 }
 
@@ -764,7 +756,25 @@  out_dropped:
 
 static void vnet_tx_timeout(struct net_device *dev)
 {
-	/* XXX Implement me XXX */
+	struct vnet *vp = netdev_priv(dev);
+
+	if (likely(netif_queue_stopped(dev))) {
+		struct vnet_port *port;
+		int wake = 1;
+
+		list_for_each_entry(port, &vp->port_list, list) {
+			struct vio_dring_state *dr;
+
+			dr = &port->vio.drings[VIO_DRIVER_TX_RING];
+			if (vnet_tx_dring_avail(dr) <
+			    VNET_TX_WAKEUP_THRESH(dr)) {
+				wake = 0;
+				break;
+			}
+		}
+		if (wake)
+			netif_wake_queue(dev);
+	}
 }
 
 static int vnet_open(struct net_device *dev)
@@ -1071,6 +1081,7 @@  static struct vnet *vnet_new(const u64 *local_mac)
 	vp = netdev_priv(dev);
 
 	spin_lock_init(&vp->lock);
+	tasklet_init(&vp->vnet_tx_wakeup, maybe_tx_wakeup, (unsigned long)vp);
 	vp->dev = dev;
 
 	INIT_LIST_HEAD(&vp->port_list);
@@ -1130,6 +1141,7 @@  static void vnet_cleanup(void)
 		vp = list_first_entry(&vnet_list, struct vnet, list);
 		list_del(&vp->list);
 		dev = vp->dev;
+		tasklet_kill(&vp->vnet_tx_wakeup);
 		/* vio_unregister_driver() should have cleaned up port_list */
 		BUG_ON(!list_empty(&vp->port_list));
 		unregister_netdev(dev);
diff --git a/drivers/net/ethernet/sun/sunvnet.h b/drivers/net/ethernet/sun/sunvnet.h
index d347a5b..0e872aa 100644
--- a/drivers/net/ethernet/sun/sunvnet.h
+++ b/drivers/net/ethernet/sun/sunvnet.h
@@ -1,6 +1,8 @@ 
 #ifndef _SUNVNET_H
 #define _SUNVNET_H
 
+#include <linux/interrupt.h>
+
 #define DESC_NCOOKIES(entry_size)	\
 	((entry_size) - sizeof(struct vio_net_desc))
 
@@ -78,6 +80,8 @@  struct vnet {
 
 	struct list_head	list;
 	u64			local_mac;
+
+	struct tasklet_struct	vnet_tx_wakeup;
 };
 
 #endif /* _SUNVNET_H */