diff mbox

[RFC PATCH] sched: only dequeue if packet can be queued to hardware queue.

Message ID 20080918194419.GA2982@ami.dom.local
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jarek Poplawski Sept. 18, 2008, 7:44 p.m. UTC
Alexander Duyck wrote, On 09/18/2008 08:43 AM:
...
> ---
> This patch changes the behavior of the sch->dequeue to only
> dequeue a packet if the queue it is bound for is not currently
> stopped.  This functionality is provided via a new op called
> smart_dequeue.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---

I think, these changes make sense only if they don't add more then give,
and two dequeues (and still no way to kill requeue) is IMHO too much.
(I mean the maintenance.) As far as I can see it's mainly for HFSC's
qdisc_peek_len(), anyway this looks like main issue to me.

Below a few small doubts. (I need to find some time for details yet.)
BTW, this patch needs a checkpatch run.

Comments

Alexander H Duyck Sept. 19, 2008, 1:11 a.m. UTC | #1
On Thu, Sep 18, 2008 at 12:44 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> I think, these changes make sense only if they don't add more then give,
> and two dequeues (and still no way to kill requeue) is IMHO too much.
> (I mean the maintenance.) As far as I can see it's mainly for HFSC's
> qdisc_peek_len(), anyway this looks like main issue to me.

The thing is this was just meant to be a proof of concept for the most
part so I was doing a lot of cut and paste coding, and as a result the
size increased by a good amount.  I admit this could be cleaned up a
lot I just wanted to verify some things.

Also my ultimate goal wasn't to kill requeue completely as you can't
do that since in the case of TSO/GSO you will end up with SKBs which
require multiple transmit descriptors and therefore you will always
need an option to requeue.  The advantage with this approach is that
you don't incur the CPU cost, which is a significant savings when you
compare the requeue approach which was generating 13% cpu to the smart
dequeue which was only using 3%.

> Below a few small doubts. (I need to find some time for details yet.)
> BTW, this patch needs a checkpatch run.

I did run checkpatch on this.  Most of the errors are inherited from
the cut and paste and I didn't want to take the time to completely
rewrite the core qdisc functionality.  Other than those errors I
believe some were whitespace complaints as I was using tabs to the
start of my functions and then spaces to indent my function parameters
in the case of having to wrap for long lines.

> ---
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index b786a5b..4082f39 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -90,10 +90,7 @@ extern void __qdisc_run(struct Qdisc *q);
>
>  static inline void qdisc_run(struct Qdisc *q)
>  {
> -       struct netdev_queue *txq = q->dev_queue;
> -
> -       if (!netif_tx_queue_stopped(txq) &&
>
> I think, there is no reason to do a full dequeue try each time instead
> of this check, even if we save on requeuing now. We could try to save
> the result of the last dequeue, e.g. a number or some mask of a few
> tx_queues which prevented dequeuing, and check for the change of state
> only. (Or alternatively, what I mentioned before: a flag set with the
> full stop or freeze.)

Once again if you have a suggestion on approach feel free to modify
the patch and see how it works for you.  My only concern is that there
are several qdiscs which won't give you the same packet twice and so
you don't know what is going to pop out until you go in and check.

>
> -           !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
> +       if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
>                __qdisc_run(q);
>  }
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index e556962..4400a18 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> ...
> +static inline struct sk_buff *__qdisc_smart_dequeue(struct Qdisc *sch,
> +                                                   struct sk_buff_head *list)
> +{
> +       struct sk_buff *skb = skb_peek(list);
>
> Since success is much more likely here, __skb_dequeue() with
> __skb_queue_head() on fail looks better to me.
>
> Of course, it's a matter of taste, but (if we really need these two
> dequeues) maybe qdisc_dequeue_smart() would be more in line with
> qdisc_dequeue_head()? (And similarly smart names below.)

Right, but then we are getting back to the queue/requeue stuff.  If
you want feel free to make use of my patch to generate your own that
uses that approach.  I just don't like changing things unless I
absolutely have to and all I did is essentially tear apart
__skb_dequeue and place the bits inline with my testing for
netif_tx_subqueue_stopped.

> +       struct netdev_queue *txq;
> +
> +       if (!skb)
> +               return NULL;
> +
> +       txq = netdev_get_tx_queue(qdisc_dev(sch), skb_get_queue_mapping(skb));
> +       if (netif_tx_queue_stopped(txq) || netif_tx_queue_frozen(txq)) {
> +               sch->flags |= TCQ_F_STOPPED;
> +               return NULL;
> +       }
> +       __skb_unlink(skb, list);
> +       sch->qstats.backlog -= qdisc_pkt_len(skb);
> +       sch->flags &= ~TCQ_F_STOPPED;
>
> This clearing is probably needed in ->reset() and in ->drop() too.
> (Or above, after if (!skb))

For the most part I would agree with you, but for now I was only using
the flag as a part of the smart_dequeue process to flag the upper
queues so I didn't give it much thought.  It is yet another thing I
probably should have cleaned up but didn't get around to since this
was mostly proof of concept.

> ...
> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> index d14f020..4da1a85 100644
> --- a/net/sched/sch_htb.c
> +++ b/net/sched/sch_htb.c
> ...
> +static struct sk_buff *htb_smart_dequeue(struct Qdisc *sch)
> +{
> +       struct sk_buff *skb = NULL;
> +       struct htb_sched *q = qdisc_priv(sch);
> +       int level, stopped = false;
> +       psched_time_t next_event;
> +
> +       /* try to dequeue direct packets as high prio (!) to minimize cpu work */
> +       skb = skb_peek(&q->direct_queue);
>
> As above: __skb_dequeue()?

Actually I could probably replace most of the skb_peek stuff with
calls to __qdisc_smart_dequeue instead and then just check for the
flag on fail.

> Thanks,
> Jarek P.

I probably won't be able to contribute to this for at least the next
two weeks since I am going to be out for vacation from Saturday until
the start of October.

In the meantime I also just threw a couple of patches out which may
help anyone who is trying to test this stuff.  Turns out if you try to
do a netperf UDP_STREAM test on a multiqueue aware system with 2.6.27
you get horrible performance on the receive side.  The root cause
appears to be that simple_tx_hash was doing a hash on fragmented
packets and as a result placing packets in psuedo random queues which
caused issues with packet ordering.

Thanks,

Alex
Jarek Poplawski Sept. 19, 2008, 3:16 p.m. UTC | #2
On Thu, Sep 18, 2008 at 09:44:19PM +0200, Jarek Poplawski wrote:
> Alexander Duyck wrote, On 09/18/2008 08:43 AM:
> ...
> > ---
> > This patch changes the behavior of the sch->dequeue to only
> > dequeue a packet if the queue it is bound for is not currently
> > stopped.  This functionality is provided via a new op called
> > smart_dequeue.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> > ---

Alexander, I guess you've seen my last messages/patches in this thread
and noticed that this __netif_schedule() problem is present both in
this RFC patch and sch_multiq: if skb isn't dequeued because of inner
tx_queue stopped state check, there is missing __netif_schedule()
before exit from qdisc_run().

Jarek P.
Duyck, Alexander H Sept. 19, 2008, 4:26 p.m. UTC | #3
Jarek Poplawski wrote:
> On Thu, Sep 18, 2008 at 09:44:19PM +0200, Jarek Poplawski wrote:
>> Alexander Duyck wrote, On 09/18/2008 08:43 AM:
>> ...
>>> ---
>>> This patch changes the behavior of the sch->dequeue to only
>>> dequeue a packet if the queue it is bound for is not currently
>>> stopped.  This functionality is provided via a new op called
>>> smart_dequeue.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> ---
>
> Alexander, I guess you've seen my last messages/patches in this thread
> and noticed that this __netif_schedule() problem is present both in
> this RFC patch and sch_multiq: if skb isn't dequeued because of inner
> tx_queue stopped state check, there is missing __netif_schedule()
> before exit from qdisc_run().
>
> Jarek P.

Actually most of that is handled in the fact that netif_tx_wake_queue
will call __netif_schedule when it decides to wake up a queue that has
been stopped.  Putting it in skb_dequeue is unnecessary, and the way
you did it would cause issues because you should be scheduling the root
qdisc, not the one currently doing the dequeue.

My bigger concern is the fact one queue is back to stopping all queues.
By using one global XOFF state, you create head-of-line blocking.  I can
see the merit in this approach but the problem is for multiqueue the
queues really should be independent.  What your change does is reduce the
benefit of having multiqueue by throwing in a new state which pretty much
matches what netif_stop/wake_queue used to do in the 2.6.26 version of tx
multiqueue.

Also changing dequeue_skb will likely cause additional issues for
several qdiscs as it doesn't report anything up to parent queues, as a
result you will end up with qdiscs like prio acting more like multiq
because they won't know if a queue is empty, stopped, or throttled.
Also I believe this will cause a panic on pfifo_fast and several other
qdiscs because some check to see if there are packets in the queue and
if so dequeue with the assumption that they will get a packet out.  You
will need to add checks for this to resolve this issue.

The one thing I liked about my approach was that after I was done you
could have prio as a parent of multiq leaves or multiq as the parent of
prio leaves and all the hardware queues would receive the packets in the
same order as the result.

Thanks,

Alex
Jarek Poplawski Sept. 19, 2008, 4:37 p.m. UTC | #4
On Fri, Sep 19, 2008 at 05:16:32PM +0200, Jarek Poplawski wrote:
...
> Alexander, I guess you've seen my last messages/patches in this thread
> and noticed that this __netif_schedule() problem is present both in
> this RFC patch and sch_multiq: if skb isn't dequeued because of inner
> tx_queue stopped state check, there is missing __netif_schedule()
> before exit from qdisc_run().

Hmm... probably false alarm. It seems there have to be same wake_queue
after all.

Sorry,
Jarek P.
Jarek Poplawski Sept. 19, 2008, 5:35 p.m. UTC | #5
On Fri, Sep 19, 2008 at 09:26:29AM -0700, Duyck, Alexander H wrote:
...
> Actually most of that is handled in the fact that netif_tx_wake_queue
> will call __netif_schedule when it decides to wake up a queue that has
> been stopped.  Putting it in skb_dequeue is unnecessary,

You are right, I've just noticed this too, and I'll withdraw this change.

> and the way
> you did it would cause issues because you should be scheduling the root
> qdisc, not the one currently doing the dequeue.

I think, this is the right way (but useless here).

> My bigger concern is the fact one queue is back to stopping all queues.
> By using one global XOFF state, you create head-of-line blocking.  I can
> see the merit in this approach but the problem is for multiqueue the
> queues really should be independent.  What your change does is reduce the
> benefit of having multiqueue by throwing in a new state which pretty much
> matches what netif_stop/wake_queue used to do in the 2.6.26 version of tx
> multiqueue.

Do you mean netif_tx_stop_all_queues() and then netif_tx_wake_queue()
before netif_tx_wake_all_queues()? OK, I'll withdraw this patch too.

> Also changing dequeue_skb will likely cause additional issues for
> several qdiscs as it doesn't report anything up to parent queues, as a
> result you will end up with qdiscs like prio acting more like multiq
> because they won't know if a queue is empty, stopped, or throttled.
> Also I believe this will cause a panic on pfifo_fast and several other
> qdiscs because some check to see if there are packets in the queue and
> if so dequeue with the assumption that they will get a packet out.  You
> will need to add checks for this to resolve this issue.

I really can't get your point. Don't you mean skb_dequeue()?
dequeue_skb() is used only by qdisc_restart()...

> The one thing I liked about my approach was that after I was done you
> could have prio as a parent of multiq leaves or multiq as the parent of
> prio leaves and all the hardware queues would receive the packets in the
> same order as the result.

I'm not against your approach, but I'd like to be sure these
complications are really worth of it. Of course, if my proposal, the
first take of 3 patches, doesn't work as you predict (and I doubt),
then we can forget about it.

Thanks,
Jarek P.
Duyck, Alexander H Sept. 19, 2008, 6:01 p.m. UTC | #6
Jarek Poplawski wrote:

>> Also changing dequeue_skb will likely cause additional issues for
>> several qdiscs as it doesn't report anything up to parent queues, as
>> a result you will end up with qdiscs like prio acting more like
>> multiq because they won't know if a queue is empty, stopped, or
>> throttled.
>> Also I believe this will cause a panic on pfifo_fast and several
>> other qdiscs because some check to see if there are packets in the
>> queue and if so dequeue with the assumption that they will get a
>> packet out.  You will need to add checks for this to resolve this
>> issue.
>
> I really can't get your point. Don't you mean skb_dequeue()?
> dequeue_skb() is used only by qdisc_restart()...
You're right.  I misread it as skb_dequeue.  The problem is though you
are still relying on q->requeue which last I knew was only being used
a few qdiscs.  In addition you will still be taking the cpu hit for the
dequeue/requeue on several qdiscs which can't use q->requeue without
violating the way they were supposed to work.

>> The one thing I liked about my approach was that after I was done you
>> could have prio as a parent of multiq leaves or multiq as the parent
>> of prio leaves and all the hardware queues would receive the packets
>> in the same order as the result.
>
> I'm not against your approach, but I'd like to be sure these
> complications are really worth of it. Of course, if my proposal, the
> first take of 3 patches, doesn't work as you predict (and I doubt),
> then we can forget about it.
Well when you get some testing done let me know.  The main areas I am
concerned with are:

1.  CPU utilization stays the same regardless of which queue used.
2.  Maintain current qdisc behavior on a per hw queue basis.
3.  Avoid head-of-line blocking where it applies
        for example: prio band 0 not blocked by band 1, or 1 by 2, etc..
                                or
                         multiq not blocked on any band due to 1 band blocked

As long as all 3 criteria are met I would be happy with any solution
provided.

Thanks,

Alex
Jarek Poplawski Sept. 19, 2008, 6:51 p.m. UTC | #7
On Fri, Sep 19, 2008 at 11:01:12AM -0700, Duyck, Alexander H wrote:
> Jarek Poplawski wrote:
> 
> >> Also changing dequeue_skb will likely cause additional issues for
> >> several qdiscs as it doesn't report anything up to parent queues, as
> >> a result you will end up with qdiscs like prio acting more like
> >> multiq because they won't know if a queue is empty, stopped, or
> >> throttled.
> >> Also I believe this will cause a panic on pfifo_fast and several
> >> other qdiscs because some check to see if there are packets in the
> >> queue and if so dequeue with the assumption that they will get a
> >> packet out.  You will need to add checks for this to resolve this
> >> issue.
> >
> > I really can't get your point. Don't you mean skb_dequeue()?
> > dequeue_skb() is used only by qdisc_restart()...
> You're right.  I misread it as skb_dequeue.  The problem is though you
> are still relying on q->requeue which last I knew was only being used
> a few qdiscs.  In addition you will still be taking the cpu hit for the
> dequeue/requeue on several qdiscs which can't use q->requeue without
> violating the way they were supposed to work.

I'm not sure I understand your point. No qdisc will use any new
q->requeue. All qdisc do dequeue, but on tx fail an skb isn't requeued
back, but queued by qdisc_restart() in a private list (of root qdisc),
and then tried before any new dequeuing. There will be no cpu hit,
because each next try is only skb_peek() on this list, until tx queue
of an skb at the head is working.

> >> The one thing I liked about my approach was that after I was done you
> >> could have prio as a parent of multiq leaves or multiq as the parent
> >> of prio leaves and all the hardware queues would receive the packets
> >> in the same order as the result.
> >
> > I'm not against your approach, but I'd like to be sure these
> > complications are really worth of it. Of course, if my proposal, the
> > first take of 3 patches, doesn't work as you predict (and I doubt),
> > then we can forget about it.
> Well when you get some testing done let me know.  The main areas I am
> concerned with are:
> 
> 1.  CPU utilization stays the same regardless of which queue used.
> 2.  Maintain current qdisc behavior on a per hw queue basis.
> 3.  Avoid head-of-line blocking where it applies
>         for example: prio band 0 not blocked by band 1, or 1 by 2, etc..
>                                 or
>                          multiq not blocked on any band due to 1 band blocked
> 
> As long as all 3 criteria are met I would be happy with any solution
> provided.

Alas this testing is both the weekest (I'm not going to do this), and
the strongest side of this solution, because it's mostly David's work
(my patch could be actually skipped without much impact). So, I hope,
you don't suggest this could be not enough tested or otherwise not the
best?!

Anyway, I don't insist on "my" solution. I simply think it's reasonable,
and it's not for private reasons, because I didn't even find this out.
I think, if it's so wrong you should have no problem to show this in
one little test. And maybe David changed his mind in the meantime and he
will choose your way even without this.

Thanks,
Jarek P.
Duyck, Alexander H Sept. 19, 2008, 9:43 p.m. UTC | #8
Jarek Poplawski wrote:
> On Fri, Sep 19, 2008 at 11:01:12AM -0700, Duyck, Alexander H wrote:
> I'm not sure I understand your point. No qdisc will use any new
> q->requeue. All qdisc do dequeue, but on tx fail an skb isn't requeued
> back, but queued by qdisc_restart() in a private list (of root qdisc),
> and then tried before any new dequeuing. There will be no cpu hit,
> because each next try is only skb_peek() on this list, until tx queue
> of an skb at the head is working.

It was me missing a piece.  I didn't look over the portion where Dave
changed requeue to always use the requeue list.  This breaks multiq
and we are right back to the head-of-line blocking.  Also I think patch
1 will break things since queueing a packet with skb->next already set
will cause issues since the queue uses prev and next last I knew.

> Alas this testing is both the weekest (I'm not going to do this), and
> the strongest side of this solution, because it's mostly David's work
> (my patch could be actually skipped without much impact). So, I hope,
> you don't suggest this could be not enough tested or otherwise not the
> best?!
>
> Anyway, I don't insist on "my" solution. I simply think it's
> reasonable, and it's not for private reasons, because I didn't even
> find this out. I think, if it's so wrong you should have no problem
> to show this in one little test. And maybe David changed his mind in
> the meantime and he will choose your way even without this.

I didn't mean to insist on this if it is how it came across.  My main
concern is that I just don't want to break what I just got working.
Most of this requeue to q->requeue stuff will break multiq and
introduce head-of-line blocking.  When That happens it will fall back
into my lap to try to resolve it so that is why I want to avoid it in
the first place.

Anyway I'm out for the next two weeks with at least the first week of
that being without email access.  I hope I provided some valuable
input so that this can head in the right direction.

Thanks,

Alex
diff mbox

Patch

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index b786a5b..4082f39 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -90,10 +90,7 @@  extern void __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	struct netdev_queue *txq = q->dev_queue;
-
-	if (!netif_tx_queue_stopped(txq) &&

I think, there is no reason to do a full dequeue try each time instead
of this check, even if we save on requeuing now. We could try to save
the result of the last dequeue, e.g. a number or some mask of a few