diff mbox

rps perfomance WAS(Re: rps: question

Message ID 1271424065.4606.31.camel@bigi
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

jamal April 16, 2010, 1:21 p.m. UTC
On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:

> 
> A kernel module might do this, this could be integrated in perf bench so
> that we can regression tests upcoming kernels.

Perf would be good - but even softnet_stat cleaner than the the nasty
hack i use (attached) would be a good start; the ping with and without
rps gives me a ballpark number.

IPI is important to me because having tried it before it and failed
miserably. I was thinking the improvement may be due to hardware used
but i am having a hard time to get people to tell me what hardware they
used! I am old school - I need data;-> The RFS patch commit seems to
have more info but still vague, example: 
"The benefits of RFS are dependent on cache hierarchy, application
load, and other factors"
Also, what does a "simple" or "complex" benchmark mean?;->
I think it is only fair to get this info, no?

Please dont consider what i say above as being anti-RPS.
5 microsec extra latency is not bad if it can be amortized.
Unfortunately, the best traffic i could generate was < 20Kpps of
ping which still manages to get 1 IPI/packet on Nehalem. I am going
to write up some app (lots of cycles available tommorow). I still think
it is valueable.

cheers,
jamal

Comments

Changli Gao April 16, 2010, 1:34 p.m. UTC | #1
On Fri, Apr 16, 2010 at 9:21 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:
>
>>
>> A kernel module might do this, this could be integrated in perf bench so
>> that we can regression tests upcoming kernels.
>
> Perf would be good - but even softnet_stat cleaner than the the nasty
> hack i use (attached) would be a good start; the ping with and without
> rps gives me a ballpark number.
>
> IPI is important to me because having tried it before it and failed
> miserably. I was thinking the improvement may be due to hardware used
> but i am having a hard time to get people to tell me what hardware they
> used! I am old school - I need data;-> The RFS patch commit seems to
> have more info but still vague, example:
> "The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors"
> Also, what does a "simple" or "complex" benchmark mean?;->
> I think it is only fair to get this info, no?
>
> Please dont consider what i say above as being anti-RPS.
> 5 microsec extra latency is not bad if it can be amortized.
> Unfortunately, the best traffic i could generate was < 20Kpps of
> ping which still manages to get 1 IPI/packet on Nehalem. I am going
> to write up some app (lots of cycles available tommorow). I still think
> it is valueable.
>

+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   s->cpu_collision, s->received_rps, s->ipi_rps);

Do you mean that received_rps is equal to ipi_rps? received_rps is the
number of IPI used by RPS. And ipi_rps is the number of IPIs sent by
function generic_exec_single(). If there isn't other user of
generic_exec_single(), received_rps should be equal to ipi_rps.

@@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct
call_single_data *data, int wait)
 	 * equipped to do the right thing...
 	 */
 	if (ipi)
+{
 		arch_send_call_function_single_ipi(cpu);
+		__get_cpu_var(netdev_rx_stat).ipi_rps++;
+}
jamal April 16, 2010, 1:49 p.m. UTC | #2
On Fri, 2010-04-16 at 21:34 +0800, Changli Gao wrote:

> 
> +	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
>  		   s->total, s->dropped, s->time_squeeze, 0,
>  		   0, 0, 0, 0, /* was fastroute */
> -		   s->cpu_collision, s->received_rps);
> +		   s->cpu_collision, s->received_rps, s->ipi_rps);
> 
> Do you mean that received_rps is equal to ipi_rps? received_rps is the
> number of IPI used by RPS. And ipi_rps is the number of IPIs sent by
> function generic_exec_single(). If there isn't other user of
> generic_exec_single(), received_rps should be equal to ipi_rps.
> 

my observation is:
s->total is the sum of all packets received by cpu (some directly from
ethernet)
s->received_rps was what the count receiver cpu saw incoming if they
were sent by another cpu. 
s-> ipi_rps is the times we tried to enq to remote cpu but found it to
be empty and had to send an IPI. 
ipi_rps can be < received_rps if we receive > 1 packet without
generating an IPI. What did i miss?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao April 16, 2010, 2:10 p.m. UTC | #3
On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 21:34 +0800, Changli Gao wrote:
>
>
> my observation is:
> s->total is the sum of all packets received by cpu (some directly from
> ethernet)

It is meaningless currently. If rps is enabled, it may be twice of the
number of the packets received, because one packet may be count twice:
one in enqueue_to_backlog(), and the other in __netif_receive_skb(). I
had posted a patch to solve this problem.

http://patchwork.ozlabs.org/patch/50217/

If you don't apply my patch, you'd better refer to /proc/net/dev for
the total number.

> s->received_rps was what the count receiver cpu saw incoming if they
> were sent by another cpu.

Maybe its name confused you.

/* Called from hardirq (IPI) context */
static void trigger_softirq(void *data)
{
        struct softnet_data *queue = data;
        __napi_schedule(&queue->backlog);
        __get_cpu_var(netdev_rx_stat).received_rps++;
}

the function above is called in IRQ of IPI. It counts the number of
IPIs received. It is actually ipi_rps you need.

> s-> ipi_rps is the times we tried to enq to remote cpu but found it to
> be empty and had to send an IPI.
> ipi_rps can be < received_rps if we receive > 1 packet without
> generating an IPI. What did i miss?
>
jamal April 16, 2010, 2:43 p.m. UTC | #4
On Fri, 2010-04-16 at 22:10 +0800, Changli Gao wrote:
> On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:

> > my observation is:
> > s->total is the sum of all packets received by cpu (some directly from
> > ethernet)
> 
> It is meaningless currently. If rps is enabled, it may be twice of the
> number of the packets received, because one packet may be count twice:
> one in enqueue_to_backlog(), and the other in __netif_receive_skb(). 

You are probably right - you made me look at my collected data ;->
i will look closely later, but it seems they are accounting for
different cpus, no? 
Example, attached are some of the stats i captured when i was running
the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just
cut to the first and last two columns):

cpu   Total     |rps_recv |rps_ipi
-----+----------+---------+---------
cpu0 | 002dc7f1 |00000000 |000f4246
cpu1 | 002dc804 |000f4240 |00000000
-------------------------------------

So: cpu0 receive 0x2dc7f1 pkts accummulative over time and
redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear
the data) and for the test 0xf4246 times it generated an IPI. It can be
seen that total running for CPU1 is 0x2dc804 but in this one run it
received 1M packets (0xf4240). 
i.e i dont see the double accounting..

cheers,
jamal
002dc7f1 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000f4246
002dc804 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000f4240 00000000
00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0000003c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0000003e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Changli Gao April 16, 2010, 2:58 p.m. UTC | #5
On Fri, Apr 16, 2010 at 10:43 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 22:10 +0800, Changli Gao wrote:
>> On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:
>
>> > my observation is:
>> > s->total is the sum of all packets received by cpu (some directly from
>> > ethernet)
>>
>> It is meaningless currently. If rps is enabled, it may be twice of the
>> number of the packets received, because one packet may be count twice:
>> one in enqueue_to_backlog(), and the other in __netif_receive_skb().
>
> You are probably right - you made me look at my collected data ;->
> i will look closely later, but it seems they are accounting for
> different cpus, no?
> Example, attached are some of the stats i captured when i was running
> the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just
> cut to the first and last two columns):
>
> cpu   Total     |rps_recv |rps_ipi
> -----+----------+---------+---------
> cpu0 | 002dc7f1 |00000000 |000f4246
> cpu1 | 002dc804 |000f4240 |00000000
> -------------------------------------
>
> So: cpu0 receive 0x2dc7f1 pkts accummulative over time and
> redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear
> the data) and for the test 0xf4246 times it generated an IPI. It can be
> seen that total running for CPU1 is 0x2dc804 but in this one run it
> received 1M packets (0xf4240).

I remeber you redirected all the traffic from cpu0 to cpu1, and the data shows:

about 0x2dc7f1 packets are processed, and about 0xf4240 IPI are generated.

> i.e i dont see the double accounting..
>

a single packet is counted twice by CPU0 and CPU1. If you change RPS setting by:

echo 1 > ..../rps_cpus

you will find the total number are doubled.
Eric Dumazet April 17, 2010, 7:35 a.m. UTC | #6
Le vendredi 16 avril 2010 à 09:21 -0400, jamal a écrit :
> On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:
> 
> > 
> > A kernel module might do this, this could be integrated in perf bench so
> > that we can regression tests upcoming kernels.
> 
> Perf would be good - but even softnet_stat cleaner than the the nasty
> hack i use (attached) would be a good start; the ping with and without
> rps gives me a ballpark number.
> 
> IPI is important to me because having tried it before it and failed
> miserably. I was thinking the improvement may be due to hardware used
> but i am having a hard time to get people to tell me what hardware they
> used! I am old school - I need data;-> The RFS patch commit seems to
> have more info but still vague, example: 
> "The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors"
> Also, what does a "simple" or "complex" benchmark mean?;->
> I think it is only fair to get this info, no?
> 
> Please dont consider what i say above as being anti-RPS.
> 5 microsec extra latency is not bad if it can be amortized.
> Unfortunately, the best traffic i could generate was < 20Kpps of
> ping which still manages to get 1 IPI/packet on Nehalem. I am going
> to write up some app (lots of cycles available tommorow). I still think
> it is valueable.

I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
nehalem. So a 3-4 years old design.

For all test, I use the best time of 3 runs of "ping -f -q -c 100000
192.168.0.2". Yes ping is not very good, but its available ;)

Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
user land. I dont want to tweak acpi or whatever smart power saving
mechanisms.

When RPS off
100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms

RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
(echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms

So the cost of queing the packet into our own queue (netif_receive_skb
-> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)

I personally think we should process packet instead of queeing it, but
Tom disagree with me.

RPS on, directed on cpu1 (other socket)
(echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
is 3 us. Note this cost is in case we receive a single packet.

I suspect IPI itself is in the 1.5 us range, not very far from the
queing to ourself case.

For me RPS use cases are :

1) Value added apps handling lot of TCP data, where the costs of cache
misses in tcp stack easily justify to spend 3 us to gain much more.

2) Network appliance, where a single cpu is filled 100% to handle one
device hardware and software/RPS interrupts, delegating all higher level
works to a pool of cpus.

I'll try to do these tests on a Nehalem target.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert April 17, 2010, 8:43 a.m. UTC | #7
> So the cost of queing the packet into our own queue (netif_receive_skb
> -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
>
> I personally think we should process packet instead of queeing it, but
> Tom disagree with me.
>
You could do that, but then the packet processing becomes HOL blocking
on all the packets that are being sent to other queues for
processing-- remember the IPIs is only sent at the end of the NAPI.
So unless the upper stack processing is <0.74us in your case, I think
processing packets directly on the local queue would improve best case
latency, but would increase average latency and even more likely worse
case latency on loads with multiple flows.

> RPS on, directed on cpu1 (other socket)
> (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms
>
> So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
> is 3 us. Note this cost is in case we receive a single packet.
>
> I suspect IPI itself is in the 1.5 us range, not very far from the
> queing to ourself case.
>
> For me RPS use cases are :
>
> 1) Value added apps handling lot of TCP data, where the costs of cache
> misses in tcp stack easily justify to spend 3 us to gain much more.
>
> 2) Network appliance, where a single cpu is filled 100% to handle one
> device hardware and software/RPS interrupts, delegating all higher level
> works to a pool of cpus.
>
> I'll try to do these tests on a Nehalem target.
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 17, 2010, 9:23 a.m. UTC | #8
Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit :
> > So the cost of queing the packet into our own queue (netif_receive_skb
> > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> >
> > I personally think we should process packet instead of queeing it, but
> > Tom disagree with me.
> >
> You could do that, but then the packet processing becomes HOL blocking
> on all the packets that are being sent to other queues for
> processing-- remember the IPIs is only sent at the end of the NAPI.
> So unless the upper stack processing is <0.74us in your case, I think
> processing packets directly on the local queue would improve best case
> latency, but would increase average latency and even more likely worse
> case latency on loads with multiple flows.

Anyway, a big part of this 0.74 us overhead comes from get_rps_cpu()
itself, computing skb->rxhash and all. We should make a review of how
many cache lines we exchange per skb, and try to reduce this number.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 17, 2010, 2:27 p.m. UTC | #9
Le samedi 17 avril 2010 à 11:23 +0200, Eric Dumazet a écrit :
> Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit :
> > > So the cost of queing the packet into our own queue (netif_receive_skb
> > > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> > >
> > > I personally think we should process packet instead of queeing it, but
> > > Tom disagree with me.
> > >
> > You could do that, but then the packet processing becomes HOL blocking
> > on all the packets that are being sent to other queues for
> > processing-- remember the IPIs is only sent at the end of the NAPI.
> > So unless the upper stack processing is <0.74us in your case, I think
> > processing packets directly on the local queue would improve best case
> > latency, but would increase average latency and even more likely worse
> > case latency on loads with multiple flows.


Tom, I am not sure what you describe is even respected for NAPI devices.
(I hope you use napi devices in your company ;) )

If we enqueue a skb to backlog, we also link our backlog napi into our
poll_list, if not already there.

So the loop in net_rx_action() will make us handle our backlog napi a
bit after this network device napi (if time limit of 2 jiffies not
elapsed) and *before* sending IPIS to remote cpus anyway.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert April 17, 2010, 5:26 p.m. UTC | #10
> Tom, I am not sure what you describe is even respected for NAPI devices.
> (I hope you use napi devices in your company ;) )
>
> If we enqueue a skb to backlog, we also link our backlog napi into our
> poll_list, if not already there.
>
> So the loop in net_rx_action() will make us handle our backlog napi a
> bit after this network device napi (if time limit of 2 jiffies not
> elapsed) and *before* sending IPIS to remote cpus anyway.
>
Then I think that's a bug you've identified ;-)

>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 17, 2010, 5:31 p.m. UTC | #11
On Sat, 2010-04-17 at 09:35 +0200, Eric Dumazet wrote:

> I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
> nehalem. So a 3-4 years old design.

Eric, I thank you kind sir for going out of your way to do this - it is
certainly a good processor to compare against 

> For all test, I use the best time of 3 runs of "ping -f -q -c 100000
> 192.168.0.2". Yes ping is not very good, but its available ;)

It is a reasonable quick test, no fancy setup required ;->

> Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
> user land. 

I didnt keep the cpus busy. I should re-run with such a setup, any
specific app that you used to keep them busy? Keeping them busy could
have consequences;  I am speculating you probably ended having greater
than one packet/IPI ratio i.e amortization benefit..
  
> I dont want to tweak acpi or whatever smart power saving
> mechanisms.

I should mention i turned off acpi as well in the bios; it was consuming
more cpu cycles than net-processing and was interfering in my tests.

> When RPS off
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms
> 
> RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
> (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms
> 
> So the cost of queing the packet into our own queue (netif_receive_skb
> -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> 

Excellent analysis.

> I personally think we should process packet instead of queeing it, but
> Tom disagree with me.

Sorry - I am gonna have to turn on some pedagogy and offer my
Canadian 2 cents;->
I would lean on agreeing with Tom, but maybe go one step further (sans
packet-reordering): we should never process packets to socket layer on
the demuxing cpu.
enqueue everything you receive on a different cpu - so somehow receiving
cpu becomes part of a hashing decision ...

The reason is derived from queueing theory - of which i know dangerously
little - but refer you to mr. little his-self[1] (pun fully
intended;->):
i.e fixed serving time provides more predictable results as opposed to
once in a while a spike as you receive packets destined to "our cpu".
Queueing packets and later allocating cycles to processing them adds to
variability, but is not as bad as processing to completion to socket
layer.

> RPS on, directed on cpu1 (other socket)
> (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

Good test - should be worst case scenario. But there are two other 
scenarios which will give different results in my opinion.
On your setup i think each socket has two dies, each with two cores. So
my feeling is you will get different numbers if you go within same die
and across dies within same socket. If i am not mistaken, the mapping
would be something like socket0/die0{core0/2}, socket0/die1{core4/6},
socket1/die0{core1/3}, socket1{core5/7}.
If you have cycles can you try the same socket+die but different cores
and same socket but different die test?

> So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
> is 3 us. Note this cost is in case we receive a single packet.

Which is not too bad if amortized. Were you able to check if you
processed a packet/IPI? One way to achieve that is just standard ping.
In the nehalem my number for going to a different core was in the range
of 5 microseconds effect on RTT when system was not busy. I think it
would be higher going across QPI.

> I suspect IPI itself is in the 1.5 us range, not very far from the
> queing to ourself case.

Sound about right maybe 2 us in my case. I am still mystified by "what
damage does an IPI make?" to the system harmony. I have to do some
reading. Andi mentioned the APIC connection - but my gut feeling is you
probably end up going to main memory and invalidate cache.

> For me RPS use cases are :
> 
> 1) Value added apps handling lot of TCP data, where the costs of cache
> misses in tcp stack easily justify to spend 3 us to gain much more.
> 
> 2) Network appliance, where a single cpu is filled 100% to handle one
> device hardware and software/RPS interrupts, delegating all higher level
> works to a pool of cpus.
> 

Agreed on both. 
The caveat to note:
- what hardware would be reasonable
- within same hardware what setups would be good to use 
- when it doesnt benefit even with the everything correct (eg low tcp
throughput)

> I'll try to do these tests on a Nehalem target.

Thanks again Eric.

cheers,
jamal 

[1]http://en.wikipedia.org/wiki/Little's_law

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 18, 2010, 9:39 a.m. UTC | #12
Le samedi 17 avril 2010 à 13:31 -0400, jamal a écrit :
> On Sat, 2010-04-17 at 09:35 +0200, Eric Dumazet wrote:
> 
> > I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
> > nehalem. So a 3-4 years old design.
> 
> Eric, I thank you kind sir for going out of your way to do this - it is
> certainly a good processor to compare against 
> 
> > For all test, I use the best time of 3 runs of "ping -f -q -c 100000
> > 192.168.0.2". Yes ping is not very good, but its available ;)
> 
> It is a reasonable quick test, no fancy setup required ;->
> 
> > Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
> > user land. 
> 
> I didnt keep the cpus busy. I should re-run with such a setup, any
> specific app that you used to keep them busy? Keeping them busy could
> have consequences;  I am speculating you probably ended having greater
> than one packet/IPI ratio i.e amortization benefit..

No, only one packet per IPI, since I setup my tg3 coalescing parameter
to the minimum value, I received one packet per interrupt.

The specific app is :

for f in `seq 1 8`; do while :; do :; done& done


>   
> > I dont want to tweak acpi or whatever smart power saving
> > mechanisms.
> 
> I should mention i turned off acpi as well in the bios; it was consuming
> more cpu cycles than net-processing and was interfering in my tests.
> 
> > When RPS off
> > 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms
> > 
> > RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
> > (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> > 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms
> > 
> > So the cost of queing the packet into our own queue (netif_receive_skb
> > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> > 
> 
> Excellent analysis.
> 
> > I personally think we should process packet instead of queeing it, but
> > Tom disagree with me.
> 
> Sorry - I am gonna have to turn on some pedagogy and offer my
> Canadian 2 cents;->
> I would lean on agreeing with Tom, but maybe go one step further (sans
> packet-reordering): we should never process packets to socket layer on
> the demuxing cpu.
> enqueue everything you receive on a different cpu - so somehow receiving
> cpu becomes part of a hashing decision ...
> 
> The reason is derived from queueing theory - of which i know dangerously
> little - but refer you to mr. little his-self[1] (pun fully
> intended;->):
> i.e fixed serving time provides more predictable results as opposed to
> once in a while a spike as you receive packets destined to "our cpu".
> Queueing packets and later allocating cycles to processing them adds to
> variability, but is not as bad as processing to completion to socket
> layer.
> 
> > RPS on, directed on cpu1 (other socket)
> > (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> > 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms
> 
> Good test - should be worst case scenario. But there are two other 
> scenarios which will give different results in my opinion.
> On your setup i think each socket has two dies, each with two cores. So
> my feeling is you will get different numbers if you go within same die
> and across dies within same socket. If i am not mistaken, the mapping
> would be something like socket0/die0{core0/2}, socket0/die1{core4/6},
> socket1/die0{core1/3}, socket1{core5/7}.
> If you have cycles can you try the same socket+die but different cores
> and same socket but different die test?

Sure, lets redo a full test, taking lowest time of three ping runs


echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4151ms

echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4254ms

echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms

echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4458ms

echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms

echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4327ms

echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4571ms

echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4472ms

echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4568ms


# egrep "physical id|core|apicid" /proc/cpuinfo 
physical id	: 0
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0

physical id	: 1
core id		: 0
cpu cores	: 4
apicid		: 4
initial apicid	: 4

physical id	: 0
core id		: 2
cpu cores	: 4
apicid		: 2
initial apicid	: 2

physical id	: 1
core id		: 2
cpu cores	: 4
apicid		: 6
initial apicid	: 6

physical id	: 0
core id		: 1
cpu cores	: 4
apicid		: 1
initial apicid	: 1

physical id	: 1
core id		: 1
cpu cores	: 4
apicid		: 5
initial apicid	: 5

physical id	: 0
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3

physical id	: 1
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 18, 2010, 11:34 a.m. UTC | #13
Le dimanche 18 avril 2010 à 11:39 +0200, Eric Dumazet a écrit :
> No, only one packet per IPI, since I setup my tg3 coalescing parameter
> to the minimum value, I received one packet per interrupt.
> 
> The specific app is :
> 
> for f in `seq 1 8`; do while :; do :; done& done
> 

An other interesting user land app would be to use a cpu _and_ memory
cruncher, because of caches misses we'll get.

$ cat nloop.c
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define SZ 4*1024*1024

int main(int argc, char *argv[])
{
	int nproc = 8;
	char *buffer;

	if (argc > 1)
		nproc = atoi(argv[1]);
	while (nproc > 1) {
		if (fork() == 0)
			break;
		nproc--;
	}
	buffer = malloc(SZ);
	while (1)
		memset(buffer, 0x55, SZ);
}

$ ./nloop 8 &

echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus
4861ms

echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus
4981ms

echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7191ms

echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7128ms

echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7107ms

echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus
5505ms

echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7125ms

echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7022ms

echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7157ms


Maximum overhead is 7191-4861 = 23.3 us per packet



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 19, 2010, 2:09 a.m. UTC | #14
Thanks Eric. I tried to visualize your results - attached.
There are 2-3 odd numbers (labelled with *) but other
than that results are as expected...

I did run some experiments with some udp sink server
and i saw the IPIs amortized; unfortunately sky2 h/ware 
proved to be bottleneck (at > 750Kpps incoming, it started 
dropping and wasnt recording the drops, so i had to slow things down). I
need to digest my results a little more - but it seems i was getting
better throughput results with RPS (i.e it was able to sink
more packets)..

cheers,
jamal
jamal April 19, 2010, 12:48 p.m. UTC | #15
Sorry, didnt respond to you - busyed out setting up before trying
to think a little more about this..

On Fri, 2010-04-16 at 22:58 +0800, Changli Gao wrote:

> >
> > cpu   Total     |rps_recv |rps_ipi
> > -----+----------+---------+---------
> > cpu0 | 002dc7f1 |00000000 |000f4246
> > cpu1 | 002dc804 |000f4240 |00000000
> > -------------------------------------
> >
> > So: cpu0 receive 0x2dc7f1 pkts accummulative over time and
> > redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear
> > the data) and for the test 0xf4246 times it generated an IPI. It can be
> > seen that total running for CPU1 is 0x2dc804 but in this one run it
> > received 1M packets (0xf4240).
> 
> I remeber you redirected all the traffic from cpu0 to cpu1, and the data shows:
> 
> about 0x2dc7f1 packets are processed, and about 0xf4240 IPI are generated.

If you look at the patch, I am zeroing those stats - so 0xf4240 is only
one test (decimal 1M). I think there is something to what you are
saying; rps_ipi on cpu0 is ambigous because it counts the number of
times cpu0 softirq was scheduled as well as the number of times cpu0
scheduled other cpus. 
The extra six for cpu0 turn out to be the times an ethernet interrupt
scheduled the cpu0 softirq.

> a single packet is counted twice by CPU0 and CPU1. 

Well, the counts have different meanings; rps_ipi applies to source cpu
activity and rps_recv applies to destination. Example, if cpu0 in total
6 times found some destination cpu to be empty and 2 of those happen to
be on cpu1, cpu2, cpu3 then
cpu0: ipi_rps = 6
cpu1: rps_recv = 2
cpu2: rps_recv = 2
cpu3: rps_recv = 2


> If you change RPS setting by:
> 
> echo 1 > ..../rps_cpus
> 
> you will find the total number are doubled.

This is true. But IMO deserving and should be double counted.
It is just more fine-grained accounting.
IOW, I am not sure we need your patch because we will loose the
fine-grain accounting - and mine requires more work to be less ambigous.

cheers,
jamal 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 20, 2010, 12:02 p.m. UTC | #16
folks,

Thanks to everybody (Eric stands out) for your patience. 
I ended mostly validating whats already been said. I have a lot of data
and can describe in details how i tested etc but it would require
patience in reading, so i will spare you;-> If you are interested let me
know and i will be happy to share.

Summary is: 
-rps good, gives higher throughput for apps
-rps not so good, latency worse but gets better with higher input rate
or increasing number of flows (which translates to higher pps)
-rps works well with newer hardware that has better cache structures.
[Gives great results on my test machine a Nehalem single processor, 4
cores each with two SMT threads that has a shared L2 between threads and
a shared L3 between cores]. 
Your selection of what the demux cpu is and where the target cpus are is
an influencing factor in the latency results. If you have a system with
multiple sockets, you should get better numbers if you stay within the
same socket relative to going across sockets.
-rps does a better job at helping schedule apps on same cpu thus
localizing the app. The throughput results with rps are very consistent
and better whereas in non-rps case, variance is _high_.

My next step is to do some forwarding tests - probably next week. I am
concerned here because i expect the cache misses to be higher than the
app scenario (netdev structure and attributes could be touched by many
cpus)

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 20, 2010, 1:13 p.m. UTC | #17
Le mardi 20 avril 2010 à 08:02 -0400, jamal a écrit : 
> folks,
> 
> Thanks to everybody (Eric stands out) for your patience. 
> I ended mostly validating whats already been said. I have a lot of data
> and can describe in details how i tested etc but it would require
> patience in reading, so i will spare you;-> If you are interested let me
> know and i will be happy to share.
> 
> Summary is: 
> -rps good, gives higher throughput for apps
> -rps not so good, latency worse but gets better with higher input rate
> or increasing number of flows (which translates to higher pps)
> -rps works well with newer hardware that has better cache structures.
> [Gives great results on my test machine a Nehalem single processor, 4
> cores each with two SMT threads that has a shared L2 between threads and
> a shared L3 between cores]. 
> Your selection of what the demux cpu is and where the target cpus are is
> an influencing factor in the latency results. If you have a system with
> multiple sockets, you should get better numbers if you stay within the
> same socket relative to going across sockets.
> -rps does a better job at helping schedule apps on same cpu thus
> localizing the app. The throughput results with rps are very consistent
> and better whereas in non-rps case, variance is _high_.
> 
> My next step is to do some forwarding tests - probably next week. I am
> concerned here because i expect the cache misses to be higher than the
> app scenario (netdev structure and attributes could be touched by many
> cpus)
> 

Hi Jamal

I think your tests are very interesting, maybe could you publish them
somehow ? (I forgot to thank you about the previous report and nice
graph)

perf reports would be good too to help to spot hot points.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 21, 2010, 7:01 p.m. UTC | #18
Le mercredi 21 avril 2010 à 08:39 -0400, jamal a écrit :
> On Tue, 2010-04-20 at 15:13 +0200, Eric Dumazet wrote:
> 
> 
> > I think your tests are very interesting, maybe could you publish them
> > somehow ? (I forgot to thank you about the previous report and nice
> > graph)
> > perf reports would be good too to help to spot hot points.
> 
> Ok ;->
> Let me explain my test setup (which some app types may gasp at;->):
> 
> SUT(system under test) was a nehalem single processor (4 cores, 2 SMT
> threads per core). 
> SUT runs a udp sink server i wrote (with apologies to Rick Jones[1])
> which forks at most a process per detected cpu and binds to a different
> udp port on each processor.
> Traffic generator sent to SUT upto 750Kpps of udp packets round-robbin
> and varied the destination port to select a different flow on each of
> the outgoing packets. I could further increment the number of flows by
> varying the source address and source port number but in the end i 
> settled down to fixed srcip/srcport/destinationip and just varied the
> port number in order to simplify results collection.
> For rps i selected mask "ee" and bound interrupt to cpu0. ee leaves
> out cpu0 and cpu4 from the set of target cpus. Because Nehalem has SMT
> threads, cpu0 and cpu4 are SMT threads that reside on core0 and they
> steal execution cycles from each other - so i didnt want that to happen
> and instead tried to have as many of those cycles as possible for
> demuxing incoming packets.
> 
> Overall, in best case scenario rps had 5-7% better throughput than
> nonrps setup. It had upto 10% more cpu use and about 2-5% more latency.
> I am attaching some visualization of the way 8 flows were distributed
> around the different cpus. The diagrams show some samples - but what you
> see there was a good reflection of what i saw in many runs of the tests.
> Essentially, for localization is better with rps which gets better if
> you can somehow map the target cpus as selected by rps to what the app
> binds to.
> Ive also attached a small annotated perf output - sorry i didnt have
> time to dig deeper into the code; maybe later this week. I think my
> biggest problem in this setup was the sky2 driver or hardware poor
> ability to handle lots of traffic.
> 
> 
> cheers,
> jamal
> 
> [1] I want to hump on the SUT with tons of traffic and count packets;
> too complex to do with netperf

Thanks a lot Jamal, this is really useful

Drawback of using a fixed src ip from your generator is that all flows
share the same struct dst entry on SUT. This might explain some glitches
you noticed (ip_route_input + ip_rcv at high level on slave/application
cpus)
Also note your test is one way. If some data was replied we would see
much use of the 'flows'

I notice epoll_ctl() used a lot, are you re-arming epoll each time you
receive a datagram ?

I see slave/application cpus hit _raw_spin_lock_irqsave() and  
_raw_spin_unlock_irqrestore().

Maybe a ring buffer could help (instead of a double linked queue) for
backlog, or the double queue trick, if Changli wants to respin his
patch.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones April 21, 2010, 9:53 p.m. UTC | #19
> Let me explain my test setup (which some app types may gasp at;->):
> 
> SUT(system under test) was a nehalem single processor (4 cores, 2 SMT
> threads per core). 
> SUT runs a udp sink server i wrote (with apologies to Rick Jones[1])
 > ...
> 
> [1] I want to hump on the SUT with tons of traffic and count packets;
> too complex to do with netperf

No need to apologize,  if you like I'd be happy to discuss netperf usage tips 
offline.  That offer stands for everyone.

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao April 22, 2010, 1:27 a.m. UTC | #20
On Thu, Apr 22, 2010 at 3:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Thanks a lot Jamal, this is really useful
>
> Drawback of using a fixed src ip from your generator is that all flows
> share the same struct dst entry on SUT. This might explain some glitches
> you noticed (ip_route_input + ip_rcv at high level on slave/application
> cpus)
> Also note your test is one way. If some data was replied we would see
> much use of the 'flows'
>
> I notice epoll_ctl() used a lot, are you re-arming epoll each time you
> receive a datagram ?
>
> I see slave/application cpus hit _raw_spin_lock_irqsave() and
> _raw_spin_unlock_irqrestore().
>
> Maybe a ring buffer could help (instead of a double linked queue) for
> backlog, or the double queue trick, if Changli wants to respin his
> patch.
>
>

OK, I'll post a new patch against the current tree, so Jamal can have
a try. I am sorry, but I don't have a suitable computer for benchmark.
jamal April 22, 2010, 12:12 p.m. UTC | #21
On Wed, 2010-04-21 at 21:01 +0200, Eric Dumazet wrote:

> Drawback of using a fixed src ip from your generator is that all flows
> share the same struct dst entry on SUT. This might explain some glitches
> you noticed (ip_route_input + ip_rcv at high level on slave/application
> cpus)

yes, that would explain it ;-> I could have flows going to each cpu
generating different unique dst. It is good i didnt ;->

> Also note your test is one way. If some data was replied we would see
> much use of the 'flows'
> 

In my next step i wanted to "route" these packets at app level and for
this stage of testing just wanted to sink the data to reduce experiment
variables. Reason:
The netdev structure would hit a lot of cache misses if i started using
it to both send/recv since lots of things are shared on tx/rx (example
napi tx prunning could happen on either tx or receive path); same thing
with qdisc path which is at netdev granularity.. I think there may be
room for interesting improvements in this area..

> I notice epoll_ctl() used a lot, are you re-arming epoll each time you
> receive a datagram ?

I am using default libevent on debian. It looks very old and maybe
buggy. I will try to upgrade first and if still see the same
investigate.
  
> I see slave/application cpus hit _raw_spin_lock_irqsave() and  
> _raw_spin_unlock_irqrestore().
> 
> Maybe a ring buffer could help (instead of a double linked queue) for
> backlog, or the double queue trick, if Changli wants to respin his
> patch.
> 

Ok, I will have some cycles later today/tommorow or for sure on weekend.
My setup is still intact - so i can test.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao April 25, 2010, 2:31 a.m. UTC | #22
On Thu, Apr 22, 2010 at 8:12 PM, jamal <hadi@cyberus.ca> wrote:
>
>> I see slave/application cpus hit _raw_spin_lock_irqsave() and
>> _raw_spin_unlock_irqrestore().
>>
>> Maybe a ring buffer could help (instead of a double linked queue) for
>> backlog, or the double queue trick, if Changli wants to respin his
>> patch.
>>
>
> Ok, I will have some cycles later today/tommorow or for sure on weekend.
> My setup is still intact - so i can test.
>

I read the code again, and find that we don't use spin_lock_irqsave(),
and we use local_irq_save() and spin_lock() instead, so
_raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
related to backlog. the lock maybe sk_receive_queue.lock.

Jamal, did you use a single socket to serve all the clients?

BTW:  completion_queue and output_queue in softnet_data both are LIFO
queues. For completion_queue, FIFO is better, as the last used skb is
more likely in cache, and should be used first. Since slab has always
cache the last used memory at the head, we'd better free the skb in
FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.
jamal April 26, 2010, 11:35 a.m. UTC | #23
On Sun, 2010-04-25 at 10:31 +0800, Changli Gao wrote:

> I read the code again, and find that we don't use spin_lock_irqsave(),
> and we use local_irq_save() and spin_lock() instead, so
> _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
> related to backlog. the lock maybe sk_receive_queue.lock.

Possible.
I am wondering if there's a way we can precisely nail where that is
happening? is lockstat any use? 
Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit.

So looking at your patch now i see it is likely there was an improvement
made for non-rps case (moving out of loop some irq_enable etc).
i.e my results may not be crazy after adding your patch and seeing an
improvement for non-rps case.
However, whatever your patch did - it did not help the rps case case:
call_function_single_interrupt() comes out higher in the profile,
and # of IPIs seems to have gone up (although i did not measure this, I
can see the interrupts/second went up by almost 50-60%)

> Jamal, did you use a single socket to serve all the clients?

Socket per detected cpu.

> BTW:  completion_queue and output_queue in softnet_data both are LIFO
> queues. For completion_queue, FIFO is better, as the last used skb is
> more likely in cache, and should be used first. Since slab has always
> cache the last used memory at the head, we'd better free the skb in
> FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.

I think it will depend on how many of those skbs are sitting in the
completion queue, cache warmth etc. LIFO is always safest, you have
higher probability of finding a cached skb infront.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao April 26, 2010, 1:35 p.m. UTC | #24
On Mon, Apr 26, 2010 at 7:35 PM, jamal <hadi@cyberus.ca> wrote:
> On Sun, 2010-04-25 at 10:31 +0800, Changli Gao wrote:
>
>> I read the code again, and find that we don't use spin_lock_irqsave(),
>> and we use local_irq_save() and spin_lock() instead, so
>> _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
>> related to backlog. the lock maybe sk_receive_queue.lock.
>
> Possible.
> I am wondering if there's a way we can precisely nail where that is
> happening? is lockstat any use?
> Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit.
>

Maybe lockstat can help in this case.

> So looking at your patch now i see it is likely there was an improvement
> made for non-rps case (moving out of loop some irq_enable etc).
> i.e my results may not be crazy after adding your patch and seeing an
> improvement for non-rps case.
> However, whatever your patch did - it did not help the rps case case:
> call_function_single_interrupt() comes out higher in the profile,
> and # of IPIs seems to have gone up (although i did not measure this, I
> can see the interrupts/second went up by almost 50-60%)

Did you apply the patch from Eric? It would reduce the number of
local_irq_disable() calls but increase the number of IPIs.

>
>> Jamal, did you use a single socket to serve all the clients?
>
> Socket per detected cpu.

Ignore it. I made a mistake here.

>
>> BTW:  completion_queue and output_queue in softnet_data both are LIFO
>> queues. For completion_queue, FIFO is better, as the last used skb is
>> more likely in cache, and should be used first. Since slab has always
>> cache the last used memory at the head, we'd better free the skb in
>> FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.
>
> I think it will depend on how many of those skbs are sitting in the
> completion queue, cache warmth etc. LIFO is always safest, you have
> higher probability of finding a cached skb infront.
>

we call kfree_skb() to release skbs to slab allocator, then slab
allocator stores them in a LIFO queue. If completion queue is also a
LIFO queue, the latest unused skb will be in the front of the queue,
and will be released to slab allocator at first. At the next time, we
call alloc_skb(), the memory used by the skb in the end of the
completion queue will be returned instead of the hot one.

However, as Eric said, new drivers don't rely on completion queue, it
isn't a real problem, especially in your test case.
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..f8267fc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -224,6 +224,7 @@  struct netif_rx_stats {
 	unsigned time_squeeze;
 	unsigned cpu_collision;
 	unsigned received_rps;
+	unsigned ipi_rps;
 };
 
 DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
diff --git a/kernel/smp.c b/kernel/smp.c
index 9867b6b..8c5dcb7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -11,6 +11,7 @@ 
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <linux/cpu.h>
+#include <linux/netdevice.h>
 
 static struct {
 	struct list_head	queue;
@@ -158,7 +159,10 @@  void generic_exec_single(int cpu, struct call_single_data *data, int wait)
 	 * equipped to do the right thing...
 	 */
 	if (ipi)
+{
 		arch_send_call_function_single_ipi(cpu);
+		__get_cpu_var(netdev_rx_stat).ipi_rps++;
+}
 
 	if (wait)
 		csd_lock_wait(data);
diff --git a/net/core/dev.c b/net/core/dev.c
index b98ddc6..0bbbdcf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3563,10 +3563,12 @@  static int softnet_seq_show(struct seq_file *seq, void *v)
 {
 	struct netif_rx_stats *s = v;
 
-	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   s->cpu_collision, s->received_rps, s->ipi_rps);
+	s->ipi_rps = 0;
+	s->received_rps = 0;
 	return 0;
 }