diff mbox

bond + tc regression ?

Message ID 4A0105A8.3060707@cosmosbay.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet May 6, 2009, 3:36 a.m. UTC
Vladimir Ivashchenko a écrit :
> On Tue, May 05, 2009 at 08:50:26PM +0200, Eric Dumazet wrote:
> 
>>> I have tried with IRQs bound to one CPU per NIC. Same result.
>> Did you check "grep eth /proc/interrupts" that your affinities setup 
>> were indeed taken into account ?
>>
>> You should use same CPU for eth0 and eth2 (bond0),
>>
>> and another CPU for eth1 and eth3 (bond1)
> 
> Ok, the best result is when assign all IRQs to the same CPU. Zero drops.
> 
> When I bind slaves of bond interfaces to the same CPU, I start to get 
> some drops, but much less than before. I didn't play with combinations.
> 
> My problem is, after applying your accounting patch below, one of my 
> HTB servers reports only 30-40% CPU idle on one of the cores. That won't 
> take me for very long, load balancing across cores is needed.
> 
> Is there any way at least to balance individual NICs on per core basis?
> 

Problem of this setup is you have four NICS, but two logical devices (bond0
& bond1) and a central HTB thing. This essentialy makes flows go through the same
locks (some rwlocks guarding bonding driver, and others guarding HTB structures).

Also when a cpu receives a frame on ethX, it has to forward it on ethY, and
another lock guards access to TX queue of ethY device. If another cpus receives
a frame on ethZ and want to forward it to ethY device, this other cpu will
need same locks and everything slowdown.

I am pretty sure you could get good results choosing two cpus sharing same L2
cache. L2 on your cpu is 6MB. Another point would be to carefuly choose size
of RX rings on ethX devices. You could try to *reduce* them so that number
of inflight skb is small enough that everything fits in this 6MB cache.

Problem is not really CPU power, but RAM bandwidth. Having two cores instead of one
attached to one central memory bank wont increase ram bandwidth, but reduce it.

And making several cores compete for locks on this ram only slows down processing.

Only choice we have is to change bonding so that this driver uses RCU instead
of rwlocks, but it is probably a complex task. Multiple cpus accessing
bonding structures could share memory structures without dirtying them
and ping-pong cache lines.

Ah, I forgot about one patch that could help your setup too (if using more than one
cpu on NIC irqs of course), queued for 2.6.31

(commit 6a321cb370ad3db4ba6e405e638b3a42c41089b0)

You could post oprofile results to help us finding other hot spots.


[PATCH] net: netif_tx_queue_stopped too expensive

netif_tx_queue_stopped(txq) is most of the time false.

Yet its cost is very expensive on SMP.

static inline int netif_tx_queue_stopped(const struct netdev_queue *dev_queue)
{
	return test_bit(__QUEUE_STATE_XOFF, &dev_queue->state);
}

I saw this on oprofile hunting and bnx2 driver bnx2_tx_int().

We probably should split "struct netdev_queue" in two parts, one
being read mostly.

__netif_tx_lock() touches _xmit_lock & xmit_lock_owner, these
deserve a separate cache line.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Vladimir Ivashchenko May 6, 2009, 10:28 a.m. UTC | #1
On Wed, May 06, 2009 at 05:36:08AM +0200, Eric Dumazet wrote:

> > Is there any way at least to balance individual NICs on per core basis?
> > 
> 
> Problem of this setup is you have four NICS, but two logical devices (bond0
> & bond1) and a central HTB thing. This essentialy makes flows go through the same
> locks (some rwlocks guarding bonding driver, and others guarding HTB structures).
> 
> Also when a cpu receives a frame on ethX, it has to forward it on ethY, and
> another lock guards access to TX queue of ethY device. If another cpus receives
> a frame on ethZ and want to forward it to ethY device, this other cpu will
> need same locks and everything slowdown.
> 
> I am pretty sure you could get good results choosing two cpus sharing same L2
> cache. L2 on your cpu is 6MB. Another point would be to carefuly choose size
> of RX rings on ethX devices. You could try to *reduce* them so that number
> of inflight skb is small enough that everything fits in this 6MB cache.
> 
> Problem is not really CPU power, but RAM bandwidth. Having two cores instead of one
> attached to one central memory bank wont increase ram bandwidth, but reduce it.

Thanks for the detailed explanation.

On the particular server I reported, I worked around the problem by getting rid of classes 
and switching to ingress policers.

However, I have one central box doing HTB, small amount of classes, but 850 mbps of
traffic. The CPU is dual-core 5160 @ 3 Ghz. With 2.6.29 + bond I'm experiencing strange problems 
with HTB, under high load borrowing doesn't seem to work properly. This box has two 
BNX2 and two E1000 NICs, and for some reason I cannot force BNX2 to sit on a single IRQ -
even though I put only one CPU into smp_affinity, it keeps balancing on both. So I cannot
figure out if its related to IRQ balancing or not.

[root@tshape3 tshaper]# cat /proc/irq/63/smp_affinity
01
[root@tshape3 tshaper]# cat /proc/interrupts | grep eth0
 63:   44610754   95469129   PCI-MSI-edge      eth0
[root@tshape3 tshaper]# cat /proc/interrupts | grep eth0
 63:   44614125   95472512   PCI-MSI-edge      eth0

lspci -v:

03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
        Subsystem: Hewlett-Packard Company NC373i Integrated Multifunction Gigabit Server Adapter
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 63
        Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
        [virtual] Expansion ROM at 88200000 [disabled] [size=2K]
        Capabilities: [40] PCI-X non-bridge device
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data <?>
        Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
        Kernel driver in use: bnx2
        Kernel modules: bnx2


Any ideas on how to force it on a single CPU ?

Thanks for the new patch, I will try it and let you know.
Eric Dumazet May 6, 2009, 10:41 a.m. UTC | #2
Vladimir Ivashchenko a écrit :
> On Wed, May 06, 2009 at 05:36:08AM +0200, Eric Dumazet wrote:
> 
>>> Is there any way at least to balance individual NICs on per core basis?
>>>
>> Problem of this setup is you have four NICS, but two logical devices (bond0
>> & bond1) and a central HTB thing. This essentialy makes flows go through the same
>> locks (some rwlocks guarding bonding driver, and others guarding HTB structures).
>>
>> Also when a cpu receives a frame on ethX, it has to forward it on ethY, and
>> another lock guards access to TX queue of ethY device. If another cpus receives
>> a frame on ethZ and want to forward it to ethY device, this other cpu will
>> need same locks and everything slowdown.
>>
>> I am pretty sure you could get good results choosing two cpus sharing same L2
>> cache. L2 on your cpu is 6MB. Another point would be to carefuly choose size
>> of RX rings on ethX devices. You could try to *reduce* them so that number
>> of inflight skb is small enough that everything fits in this 6MB cache.
>>
>> Problem is not really CPU power, but RAM bandwidth. Having two cores instead of one
>> attached to one central memory bank wont increase ram bandwidth, but reduce it.
> 
> Thanks for the detailed explanation.
> 
> On the particular server I reported, I worked around the problem by getting rid of classes 
> and switching to ingress policers.
> 
> However, I have one central box doing HTB, small amount of classes, but 850 mbps of
> traffic. The CPU is dual-core 5160 @ 3 Ghz. With 2.6.29 + bond I'm experiencing strange problems 
> with HTB, under high load borrowing doesn't seem to work properly. This box has two 
> BNX2 and two E1000 NICs, and for some reason I cannot force BNX2 to sit on a single IRQ -
> even though I put only one CPU into smp_affinity, it keeps balancing on both. So I cannot
> figure out if its related to IRQ balancing or not.
> 
> [root@tshape3 tshaper]# cat /proc/irq/63/smp_affinity
> 01
> [root@tshape3 tshaper]# cat /proc/interrupts | grep eth0
>  63:   44610754   95469129   PCI-MSI-edge      eth0
> [root@tshape3 tshaper]# cat /proc/interrupts | grep eth0
>  63:   44614125   95472512   PCI-MSI-edge      eth0
> 
> lspci -v:
> 
> 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
>         Subsystem: Hewlett-Packard Company NC373i Integrated Multifunction Gigabit Server Adapter
>         Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 63
>         Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
>         [virtual] Expansion ROM at 88200000 [disabled] [size=2K]
>         Capabilities: [40] PCI-X non-bridge device
>         Capabilities: [48] Power Management version 2
>         Capabilities: [50] Vital Product Data <?>
>         Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
>         Kernel driver in use: bnx2
>         Kernel modules: bnx2
> 
> 
> Any ideas on how to force it on a single CPU ?
> 
> Thanks for the new patch, I will try it and let you know.
> 

Yes, its doable but tricky with bnx2, this is a known problem on recent kernels as well.


You must do for example (to bind on CPU 0)

echo 1 >/proc/irq/default_smp_affinity

ifconfig eth1 down
# IRQ of eth1 handled by CPU0 only
echo 1 >/proc/irq/34/smp_affinity
ifconfig eth1 up

ifconfig eth0 down
# IRQ of eth0 handled by CPU0 only
echo 1 >/proc/irq/36/smp_affinity
ifconfig eth0 up


One thing to consider too is the BIOS option you might have, labeled "Adjacent Sector Prefetch"

This basically tells your cpu to use 128 bytes cache lines, instead of 64

In your forwarding worload, I believe this extra prefetch can slowdown your machine.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 6, 2009, 10:49 a.m. UTC | #3
On Wednesday 06 May 2009 13:41:25 Eric Dumazet wrote:
> You must do for example (to bind on CPU 0)
>
> echo 1 >/proc/irq/default_smp_affinity
>
> ifconfig eth1 down
> # IRQ of eth1 handled by CPU0 only
> echo 1 >/proc/irq/34/smp_affinity
> ifconfig eth1 up
>
> ifconfig eth0 down
> # IRQ of eth0 handled by CPU0 only
> echo 1 >/proc/irq/36/smp_affinity
> ifconfig eth0 up
I think better to use some method over ethtool, that will cause reset.
WHen you do down - you will loose default route, beware of that
>
>
> One thing to consider too is the BIOS option you might have, labeled
> "Adjacent Sector Prefetch"
>
> This basically tells your cpu to use 128 bytes cache lines, instead of 64
>
> In your forwarding worload, I believe this extra prefetch can slowdown your
> machine.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko May 6, 2009, 6:45 p.m. UTC | #4
On Wed, 2009-05-06 at 05:36 +0200, Eric Dumazet wrote:

> Ah, I forgot about one patch that could help your setup too (if using more than one
> cpu on NIC irqs of course), queued for 2.6.31

I have tried the patch. Didn't make a noticeable difference. Under 850
mbps HTB+sfq load, 2.6.29.1, four NICs / two bond ifaces, IRQ balancing,
the dual-core server has only 25% idle on each CPU.

What's interesting, the same 850mbps load, identical machine, but with
only two NICs and no bond, HTB+esfq, kernel 2.6.21.2 => 60% CPU idle.
2.5x overhead.

> (commit 6a321cb370ad3db4ba6e405e638b3a42c41089b0)
> 
> You could post oprofile results to help us finding other hot spots.
> 
> 
> [PATCH] net: netif_tx_queue_stopped too expensive
> 
> netif_tx_queue_stopped(txq) is most of the time false.
> 
> Yet its cost is very expensive on SMP.
> 
> static inline int netif_tx_queue_stopped(const struct netdev_queue *dev_queue)
> {
> 	return test_bit(__QUEUE_STATE_XOFF, &dev_queue->state);
> }
> 
> I saw this on oprofile hunting and bnx2 driver bnx2_tx_int().
> 
> We probably should split "struct netdev_queue" in two parts, one
> being read mostly.
> 
> __netif_tx_lock() touches _xmit_lock & xmit_lock_owner, these
> deserve a separate cache line.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 2e7783f..1caaebb 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -447,12 +447,18 @@ enum netdev_queue_state_t
>  };
>  
>  struct netdev_queue {
> +/*
> + * read mostly part
> + */
>  	struct net_device	*dev;
>  	struct Qdisc		*qdisc;
>  	unsigned long		state;
> -	spinlock_t		_xmit_lock;
> -	int			xmit_lock_owner;
>  	struct Qdisc		*qdisc_sleeping;
> +/*
> + * write mostly part
> + */
> +	spinlock_t		_xmit_lock ____cacheline_aligned_in_smp;
> +	int			xmit_lock_owner;
>  } ____cacheline_aligned_in_smp;
>  
>
Denys Fedoryshchenko May 6, 2009, 7:30 p.m. UTC | #5
On Wednesday 06 May 2009 21:45:18 Vladimir Ivashchenko wrote:
> On Wed, 2009-05-06 at 05:36 +0200, Eric Dumazet wrote:
> > Ah, I forgot about one patch that could help your setup too (if using
> > more than one cpu on NIC irqs of course), queued for 2.6.31
>
> I have tried the patch. Didn't make a noticeable difference. Under 850
> mbps HTB+sfq load, 2.6.29.1, four NICs / two bond ifaces, IRQ balancing,
> the dual-core server has only 25% idle on each CPU.
>
> What's interesting, the same 850mbps load, identical machine, but with
> only two NICs and no bond, HTB+esfq, kernel 2.6.21.2 => 60% CPU idle.
> 2.5x overhead.

Probably oprofile can sched some light on this.
On my own experience IRQ balancing hurt performance a lot, because of cache 
misses.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko May 6, 2009, 8:47 p.m. UTC | #6
On Wed, May 06, 2009 at 10:30:04PM +0300, Denys Fedoryschenko wrote:

> > What's interesting, the same 850mbps load, identical machine, but with
> > only two NICs and no bond, HTB+esfq, kernel 2.6.21.2 => 60% CPU idle.
> > 2.5x overhead.
> 
> Probably oprofile can sched some light on this.
> On my own experience IRQ balancing hurt performance a lot, because of cache 
> misses.

This is a dual-core machine, isn't cache shared between the cores?

Without IRQ balancing, one of the cores goes around 10% idle and HTB doesn't do
its job properly. Actually, in my experience HTB stops working properly after
idle goes below 35%.

I'll try gathering some stats using oprofile.
Denys Fedoryshchenko May 6, 2009, 9:46 p.m. UTC | #7
On Wednesday 06 May 2009 23:47:59 Vladimir Ivashchenko wrote:
> On Wed, May 06, 2009 at 10:30:04PM +0300, Denys Fedoryschenko wrote:
> > > What's interesting, the same 850mbps load, identical machine, but with
> > > only two NICs and no bond, HTB+esfq, kernel 2.6.21.2 => 60% CPU idle.
> > > 2.5x overhead.
> >
> > Probably oprofile can sched some light on this.
> > On my own experience IRQ balancing hurt performance a lot, because of
> > cache misses.
>
> This is a dual-core machine, isn't cache shared between the cores?
>
> Without IRQ balancing, one of the cores goes around 10% idle and HTB
> doesn't do its job properly. Actually, in my experience HTB stops working
> properly after idle goes below 35%.
It seems they should. No idea, more experienced guys should know more.

Can you show me please
cat /proc/net/psched
If it is highres working, try to add in HTB script, first line

HZ=1000
to set environment variable. Because if clock resolution high, burst 
calculation going crazy on high speeds.
Maybe it will help.

Also without irq balance, did you try to assign interface to cpu by 
smp_affinity? (/proc/irq/NN/smp_affinity)

And still i think best thing is oprofile. It can show "hot" places in code, 
who is spending cpu cycles.

>
> I'll try gathering some stats using oprofile.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko May 8, 2009, 8:46 p.m. UTC | #8
> > Without IRQ balancing, one of the cores goes around 10% idle and HTB
> > doesn't do its job properly. Actually, in my experience HTB stops working
> > properly after idle goes below 35%.
> It seems they should. No idea, more experienced guys should know more.
> 
> Can you show me please
> cat /proc/net/psched
> If it is highres working, try to add in HTB script, first line
> 
> HZ=1000
> to set environment variable. Because if clock resolution high, burst 
> calculation going crazy on high speeds.
> Maybe it will help.

Wow, instead of 98425b burst, its calculating 970203b. 

Exporting HZ=1000 doesn't help. However, even if I recompile the kernel
to 1000 Hz and the burst is calculated correctly, for some reason HTB on
2.6.29 is still worse at rate control than 2.6.21.

With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
With 2.6.29, same ceil/burst -> actual rate 890 mbits.

Moreover, after I stop the traffic *COMPLETELY* on 2.6.29, actual rate
reported by htb goes ballistic and stays at 1100mbits. Then it drops
back to expected value after a minute or so.

> Also without irq balance, did you try to assign interface to cpu by 
> smp_affinity? (/proc/irq/NN/smp_affinity)

Yes, I did, didn't make any difference.

> And still i think best thing is oprofile. It can show "hot" places in code, 
> who is spending cpu cycles.

For some reason I get a hard freeze when I start oprofile daemon, even
without traffic. Never used oprofile before, so I'm not sure if I'm
doing something wrong ... I'm starting it just with --vmlinux parameter
and nothing else. I use vanilla 2.6.29 and oprofile from FC8.
Denys Fedoryshchenko May 8, 2009, 9:05 p.m. UTC | #9
On Friday 08 May 2009 23:46:11 Vladimir Ivashchenko wrote:
> > > Without IRQ balancing, one of the cores goes around 10% idle and HTB
> > > doesn't do its job properly. Actually, in my experience HTB stops
> > > working properly after idle goes below 35%.
> >
> > It seems they should. No idea, more experienced guys should know more.
> >
> > Can you show me please
> > cat /proc/net/psched
> > If it is highres working, try to add in HTB script, first line
> >
> > HZ=1000
> > to set environment variable. Because if clock resolution high, burst
> > calculation going crazy on high speeds.
> > Maybe it will help.
>
> Wow, instead of 98425b burst, its calculating 970203b.
Kind of strange burst, something wrong there. For 1000HZ and 1 Gbit it should 
be 126375b. You value is for 8Gbit/s.
What version of iproute2 you are using ( tc -V )?

>
> Exporting HZ=1000 doesn't help. However, even if I recompile the kernel
> to 1000 Hz and the burst is calculated correctly, for some reason HTB on
> 2.6.29 is still worse at rate control than 2.6.21.
>
> With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
> With 2.6.29, same ceil/burst -> actual rate 890 mbits.
It depends also if there is child classes, what is bursts set for them, and 
what is ceil/burst set for them.

>
> Moreover, after I stop the traffic *COMPLETELY* on 2.6.29, actual rate
> reported by htb goes ballistic and stays at 1100mbits. Then it drops
> back to expected value after a minute or so.
It is average bandwidth for some period, it is not realtime value. 

>
> > Also without irq balance, did you try to assign interface to cpu by
> > smp_affinity? (/proc/irq/NN/smp_affinity)
>
> Yes, I did, didn't make any difference.
What is a clock source?
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Timer resolution?
cat /proc/net/psched

>
> > And still i think best thing is oprofile. It can show "hot" places in
> > code, who is spending cpu cycles.
>
> For some reason I get a hard freeze when I start oprofile daemon, even
> without traffic. Never used oprofile before, so I'm not sure if I'm
> doing something wrong ... I'm starting it just with --vmlinux parameter
> and nothing else. I use vanilla 2.6.29 and oprofile from FC8.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko May 8, 2009, 10:07 p.m. UTC | #10
> > Wow, instead of 98425b burst, its calculating 970203b.
> Kind of strange burst, something wrong there. For 1000HZ and 1 Gbit it should 
> be 126375b. You value is for 8Gbit/s.
> What version of iproute2 you are using ( tc -V )?

That was iproute2-ss080725, I think it is confused by tickless mode.
With iproute2-ss090324 I'm getting an opposite: 1589b :)

> >
> > With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
> > With 2.6.29, same ceil/burst -> actual rate 890 mbits.
> It depends also if there is child classes, what is bursts set for them, and 
> what is ceil/burst set for them.

All child classes have smaller bursts than the parent. However, there are two 
sub-classes which have ceil at 70% of parent, e.g. ~500mbit each. I
don't know HTB internals, perhaps these two classes make the parent class 
overstretch itself.

By the way, I experience the same "overstretching" with hfsc. In any case, 
I prefer HTB because it reports statistics of parent classes, unlike hfsc.

> > Moreover, after I stop the traffic *COMPLETELY* on 2.6.29, actual rate
> > reported by htb goes ballistic and stays at 1100mbits. Then it drops
> > back to expected value after a minute or so.
> It is average bandwidth for some period, it is not realtime value. 

But why it would it jump from 850mbits to 1200mbits *AFTER* I remove all
the traffic ?

> > Yes, I did, didn't make any difference.
> What is a clock source?
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource

tsc

> Timer resolution?
> cat /proc/net/psched

With tickless kernel:

000003e8 00000400 000f4240 3b9aca00
Denys Fedoryshchenko May 8, 2009, 10:42 p.m. UTC | #11
On Saturday 09 May 2009 01:07:27 Vladimir Ivashchenko wrote:
> > > Wow, instead of 98425b burst, its calculating 970203b.
> >
> > Kind of strange burst, something wrong there. For 1000HZ and 1 Gbit it
> > should be 126375b. You value is for 8Gbit/s.
> > What version of iproute2 you are using ( tc -V )?
>
> That was iproute2-ss080725, I think it is confused by tickless mode.
> With iproute2-ss090324 I'm getting an opposite: 1589b :)
And it is too low. Thats why i set HZ=1000
>
>
> All child classes have smaller bursts than the parent. However, there are
> two sub-classes which have ceil at 70% of parent, e.g. ~500mbit each. I
> don't know HTB internals, perhaps these two classes make the parent class
> overstretch itself.
As i remember important to keep sum of child rates lower or equal parent rate.
Sure ceil of childs must not exceed ceil of parent.
Sometimes i had mess, when i tried to play with quantum value. After all that 
i switched to HFSC which works for me flawlessly. Maybe we should give more 
attention to HTB problem with high speeds and help kernel developers spot 
problem, if there is any.

>
> By the way, I experience the same "overstretching" with hfsc. In any case,
> I prefer HTB because it reports statistics of parent classes, unlike hfsc.
Sometimes it happen when some offloading enabled on devices.
Check ethtool -k device

I think everything except rx/tx checksumming have to be off, at least for 
test.

Disable them by "ethtool -K device tso off " for example.


>
> But why it would it jump from 850mbits to 1200mbits *AFTER* I remove all
> the traffic ?
>
Well, i dont know how it is doing averaging, even maybe for 1 minute. 
I dont like it at all, and thats why i prefer HFSC. But HTB work very well in 
some setups
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko May 17, 2009, 6:46 p.m. UTC | #12
> > All child classes have smaller bursts than the parent. However, there are
> > two sub-classes which have ceil at 70% of parent, e.g. ~500mbit each. I
> > don't know HTB internals, perhaps these two classes make the parent class
> > overstretch itself.
> As i remember important to keep sum of child rates lower or equal parent rate.
> Sure ceil of childs must not exceed ceil of parent.
> Sometimes i had mess, when i tried to play with quantum value. After all that 
> i switched to HFSC which works for me flawlessly. Maybe we should give more 
> attention to HTB problem with high speeds and help kernel developers spot 
> problem, if there is any.

In case of HFSC my problem is even worse. With 775mbit ceiling
configured it is passing over 900mbit in reality. Moreover not having
statistics for parent classes makes it difficult to troubleshoot :( I'm
100% sure that it is 900 mbps, I see this on the switch.

Attached is "tc -s -d class show dev bond0" output.

To calculate total traffic rate:

$ cat hfsc-stat.txt | grep rate | grep Kbit | sed 's/Kbit//' | awk
'{ a=a+$2; } END { print a; }'
906955

Did I misconfigure something ?... How can hfsc go above 775mbit when
everything goes via class 1:2 with 775mbit rate & ul ?

> > By the way, I experience the same "overstretching" with hfsc. In any case,
> > I prefer HTB because it reports statistics of parent classes, unlike hfsc.
> Sometimes it happen when some offloading enabled on devices.
> Check ethtool -k device

Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off


> I think everything except rx/tx checksumming have to be off, at least for 
> test.
> 
> Disable them by "ethtool -K device tso off " for example.

Doesn't help.
Jarek Poplawski May 18, 2009, 8:51 a.m. UTC | #13
On 17-05-2009 20:46, Vladimir Ivashchenko wrote:
>>> All child classes have smaller bursts than the parent. However, there are
>>> two sub-classes which have ceil at 70% of parent, e.g. ~500mbit each. I
>>> don't know HTB internals, perhaps these two classes make the parent class
>>> overstretch itself.
>> As i remember important to keep sum of child rates lower or equal parent rate.
>> Sure ceil of childs must not exceed ceil of parent.
>> Sometimes i had mess, when i tried to play with quantum value. After all that 
>> i switched to HFSC which works for me flawlessly. Maybe we should give more 
>> attention to HTB problem with high speeds and help kernel developers spot 
>> problem, if there is any.
> 
> In case of HFSC my problem is even worse. With 775mbit ceiling
> configured it is passing over 900mbit in reality. Moreover not having
> statistics for parent classes makes it difficult to troubleshoot :( I'm
> 100% sure that it is 900 mbps, I see this on the switch.
> 
> Attached is "tc -s -d class show dev bond0" output.
> 
> To calculate total traffic rate:
> 
> $ cat hfsc-stat.txt | grep rate | grep Kbit | sed 's/Kbit//' | awk
> '{ a=a+$2; } END { print a; }'
> 906955
> 
> Did I misconfigure something ?... How can hfsc go above 775mbit when
> everything goes via class 1:2 with 775mbit rate & ul ?

Maybe... It's a lot of checking - it seems test cases could be simpler
to show the real problem. Anyway, it looks like the sum of m2 of 1:2
children is more than 775Mbit.


>>> By the way, I experience the same "overstretching" with hfsc. In any case,
>>> I prefer HTB because it reports statistics of parent classes, unlike hfsc.
>> Sometimes it happen when some offloading enabled on devices.
>> Check ethtool -k device
> 
> Offload parameters for eth0:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: off
> udp fragmentation offload: off

Current versions of ethtool should show "generic segmentation offload"
too.

I hope you've read the nearby thread "HTB accuracy for high speed",
which explains at least partially some problems/bugs, and maybe you'll
try some patches too (at last one of them addresses the problem you've
reported). Anyway, if you don't find hfsc is better for you I'd be
more interested in tracking this on htb test cases yet.

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e7783f..1caaebb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -447,12 +447,18 @@  enum netdev_queue_state_t
 };
 
 struct netdev_queue {
+/*
+ * read mostly part
+ */
 	struct net_device	*dev;
 	struct Qdisc		*qdisc;
 	unsigned long		state;
-	spinlock_t		_xmit_lock;
-	int			xmit_lock_owner;
 	struct Qdisc		*qdisc_sleeping;
+/*
+ * write mostly part
+ */
+	spinlock_t		_xmit_lock ____cacheline_aligned_in_smp;
+	int			xmit_lock_owner;
 } ____cacheline_aligned_in_smp;