Question about way that NICs deliver packets to the kernel

Message ID	20100715142418.GA26491@host-a-229.ustcsz.edu.cn
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:subject:message-id:mime-version:content-type :content-disposition:user-agent; b=rRhjcAdZsJn7H+tGmgIpNG3Z2sDPEpTpCGvaDT9oqKYrtr0sadDXghu4d5CeMlurSA AuYlgGIbgIKL5y+wdGhITLltH0eM5OMjcrqTlZ9djxQU8S3OYZcRKL2TUVqRTfiUC6eW YGYR7JOa10TX3qpyWfSQBQSRXODC4ntFse/vk= Date: Thu, 15 Jul 2010 22:24:23 +0800 From: Junchang Wang <junchangwang@gmail.com> To: romieu@fr.zoreil.com, netdev@vger.kernel.org Subject: Question about way that NICs deliver packets to the kernel Message-ID: <20100715142418.GA26491@host-a-229.ustcsz.edu.cn> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) Sender: netdev-owner@vger.kernel.org Precedence: bulk

Junchang Wang July 15, 2010, 2:24 p.m. UTC

Hi list,
My understand of the way that NICs deliver packets to the kernel is
as follows. Correct me if any of this is wrong. Thanks.

1) The device buffer is fixed. When the kernel is acknowledged arrival of a 
new packet, it dynamically allocate a new skb and copy the packet into it. 
For example, 8139too.

2) The device buffer is mapped by streaming DMA. When the kernel is 
acknowledged arrival of a new packet, it unmaps the region previously mapped. 
Obviously, there is NO memcpy operation. Additional cost is streaming DMA 
map/unmap operations. For example, e100 and e1000.

Here comes my question:
1) Is there a principle indicating which one is better? Is streaming DMA
map/unmap operations more expensive than memcpy operation?


2) Why does r8169 bias towards the first approach even if it support both? I 
convert r8169 to the second one and get a 5% performance boost. Below is result
running netperf TCP_STREAM test with 1.6K byte packet length.
        scheme 1    scheme 2    Imp.
r8169     683M        718M       5%

The following patch shows what I did:


Thanks in advance.

--Junchang
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ben Hutchings July 15, 2010, 2:33 p.m. UTC | #1

On Thu, 2010-07-15 at 22:24 +0800, Junchang Wang wrote:
> Hi list,
> My understand of the way that NICs deliver packets to the kernel is
> as follows. Correct me if any of this is wrong. Thanks.
> 
> 1) The device buffer is fixed. When the kernel is acknowledged arrival of a 
> new packet, it dynamically allocate a new skb and copy the packet into it. 
> For example, 8139too.
> 
> 2) The device buffer is mapped by streaming DMA. When the kernel is 
> acknowledged arrival of a new packet, it unmaps the region previously mapped. 
> Obviously, there is NO memcpy operation. Additional cost is streaming DMA 
> map/unmap operations. For example, e100 and e1000.
> 
> Here comes my question:
> 1) Is there a principle indicating which one is better? Is streaming DMA
> map/unmap operations more expensive than memcpy operation?

DMA should result in lower CPU usage and higher maximum performance.

> 2) Why does r8169 bias towards the first approach even if it support both? I 
> convert r8169 to the second one and get a 5% performance boost. Below is result
> running netperf TCP_STREAM test with 1.6K byte packet length.
>         scheme 1    scheme 2    Imp.
> r8169     683M        718M       5%
[...]

You should also compare the CPU usage.

Ben.

stephen hemminger July 15, 2010, 3:59 p.m. UTC | #2

On Thu, 15 Jul 2010 15:33:37 +0100
Ben Hutchings <bhutchings@solarflare.com> wrote:

> On Thu, 2010-07-15 at 22:24 +0800, Junchang Wang wrote:
> > Hi list,
> > My understand of the way that NICs deliver packets to the kernel is
> > as follows. Correct me if any of this is wrong. Thanks.
> > 
> > 1) The device buffer is fixed. When the kernel is acknowledged arrival of a 
> > new packet, it dynamically allocate a new skb and copy the packet into it. 
> > For example, 8139too.
> > 
> > 2) The device buffer is mapped by streaming DMA. When the kernel is 
> > acknowledged arrival of a new packet, it unmaps the region previously mapped. 
> > Obviously, there is NO memcpy operation. Additional cost is streaming DMA 
> > map/unmap operations. For example, e100 and e1000.
> > 
> > Here comes my question:
> > 1) Is there a principle indicating which one is better? Is streaming DMA
> > map/unmap operations more expensive than memcpy operation?
> 
> DMA should result in lower CPU usage and higher maximum performance.
> 
> > 2) Why does r8169 bias towards the first approach even if it support both? I 
> > convert r8169 to the second one and get a 5% performance boost. Below is result
> > running netperf TCP_STREAM test with 1.6K byte packet length.
> >         scheme 1    scheme 2    Imp.
> > r8169     683M        718M       5%
> [...]
> 
> You should also compare the CPU usage.

Also many drivers copy small receives into a new buffer
which saves space and often gives better performance.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Francois Romieu July 15, 2010, 9:12 p.m. UTC | #3

Junchang Wang <junchangwang@gmail.com> :
[...]
> 2) Why does r8169 bias towards the first approach even if it support both ?

It is a simple, straightforward fix against a 8169 hardware bug.

See commit c0cd884af045338476b8e69a61fceb3f34ff22f1.

Junchang Wang July 16, 2010, 7:05 a.m. UTC | #4

>
> You should also compare the CPU usage.
>
> Ben.
>
Hi Ben,
I added options -c -C to netperf's command line. Result is as follows:
                    scheme 1    scheme 2    Imp.
Throughput:     683M        718M       5%
CPU usage:     47.8%       45.6%

That really surprised me because "top" command showed the CPU usage
was fluctuating between 0.5% and 1.5% rather that between 45% and 50%.

How can I get the exact CPU usage?

Thanks.

Junchang Wang July 16, 2010, 7:35 a.m. UTC | #5

> It is a simple, straightforward fix against a 8169 hardware bug.
>
> See commit c0cd884af045338476b8e69a61fceb3f34ff22f1.
>
Fortunately, it seems my device is unaffected by this issue. :)

Thanks Francois.

Rick Jones July 16, 2010, 5:58 p.m. UTC | #6

Junchang Wang wrote:
>>You should also compare the CPU usage.
>>
>>Ben.
>>
> 
> Hi Ben,
> I added options -c -C to netperf's command line. Result is as follows:
>                     scheme 1    scheme 2    Imp.
> Throughput:     683M        718M       5%
> CPU usage:     47.8%       45.6%
> 
> That really surprised me because "top" command showed the CPU usage
> was fluctuating between 0.5% and 1.5% rather that between 45% and 50%.

Can you tell us a bit more about the system, and which version of netperf you 
are using?  Any chance that the CPU utilization you were looking at in top was 
just that being charged to netperf the process?  "Network processing" does not 
often get charged to the responsible process, so netperf reports system-wide CPU 
utilization on the assumption it is the only thing causing the CPUs to be utilized.

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Junchang Wang July 20, 2010, 1:15 a.m. UTC | #7

On Fri, Jul 16, 2010 at 10:58:46AM -0700, Rick Jones wrote:
>>Hi Ben,
>>I added options -c -C to netperf's command line. Result is as follows:
>>                    scheme 1    scheme 2    Imp.
>>Throughput:     683M        718M       5%
>>CPU usage:     47.8%       45.6%
>>
>>That really surprised me because "top" command showed the CPU usage
>>was fluctuating between 0.5% and 1.5% rather that between 45% and 50%.
>

Hi rick,
very sorry for my late reply. Just recovered from the final exam.:)

>Can you tell us a bit more about the system, and which version of
>netperf you are using?  

The target machine is a Pentium Dual-core E2200 desktop with a r8169 
gigabit NIC. (I couldn't find a better server with old pci slot.)

Another machine is a Nehalem based system with Intel 82576 NIC.

The target machine executes netserver and Nehalem machine executes netperf.
The version of netperf is 2.4.5

>Any chance that the CPU utilization you were
>looking at in top was just that being charged to netperf the process?

What I see on target machine is as follows:

top - 21:37:12 up 21 min,  6 users,  load average: 0.43, 0.28, 0.19
Tasks: 152 total,   2 running, 149 sleeping,   0 stopped,   1 zombie
Cpu(s):  2.3%us,  1.5%sy,  0.1%ni, 89.5%id,  2.7%wa,  0.0%hi,  3.9%si,  0.0%
Mem:   2074064k total,   690200k used,  1383864k free,    39372k buffers
Swap:  2096476k total,        0k used,  2096476k free,   435044k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
3916 root      20   0  2228  584  296 R 84.6  0.0   0:07.12 netserver    

It shows the CPU usage of taget machine is around 10%.

while Nehalem machine's report is as follows:

TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.1 (192.168.2.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

87380  16384  16384    10.05       679.79   1.63     48.27    1.571   11.634 

It shows the CPU usage of target machine is 48.27%.

>"Network processing" does not often get charged to the responsible
>process, so netperf reports system-wide CPU utilization on the
>assumption it is the only thing causing the CPUs to be utilized.

My understand of your commends is:
1)except running in ksoftirqd, network processing cannot be correctly counted
  because it runs in interrupt contexts that do not get charged to a correct
  process. So "top" misses lots of CPU usage in high interrupt rate network
  situation.
2)As you have mentioned in netperf's manual, netperf uses /proc/stat on Linux
  to retrieve time spent in idle mode. In other words, it accumulates cpu time
  spent in all other modes, including hardware interrupt, software interrupt,
  etc., making the CPU usage more accurate in high interrupt situation.
3)Since most processes in target machine are in sleeping mode, the CPU usage
  of network processing is in actually very close to 48.27%. Right?

Correct me if any of them are incorrect. Thanks.

--Junchang
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rick Jones July 20, 2010, 5:16 p.m. UTC | #8

Junchang Wang wrote:
> On Fri, Jul 16, 2010 at 10:58:46AM -0700, Rick Jones wrote:
> 
>>>Hi Ben,
>>>I added options -c -C to netperf's command line. Result is as follows:
>>>                   scheme 1    scheme 2    Imp.
>>>Throughput:     683M        718M       5%
>>>CPU usage:     47.8%       45.6%
>>>
>>>That really surprised me because "top" command showed the CPU usage
>>>was fluctuating between 0.5% and 1.5% rather that between 45% and 50%.
>>
> 
> Hi rick,
> very sorry for my late reply. Just recovered from the final exam.:)
> 
> 
>>Can you tell us a bit more about the system, and which version of
>>netperf you are using?  
> 
> 
> The target machine is a Pentium Dual-core E2200 desktop with a r8169 
> gigabit NIC. (I couldn't find a better server with old pci slot.)
> 
> Another machine is a Nehalem based system with Intel 82576 NIC.
> 
> The target machine executes netserver and Nehalem machine executes netperf.
> The version of netperf is 2.4.5
> 
> 
>>Any chance that the CPU utilization you were
>>looking at in top was just that being charged to netperf the process?
> 
> 
> What I see on target machine is as follows:
> 
> top - 21:37:12 up 21 min,  6 users,  load average: 0.43, 0.28, 0.19
> Tasks: 152 total,   2 running, 149 sleeping,   0 stopped,   1 zombie
> Cpu(s):  2.3%us,  1.5%sy,  0.1%ni, 89.5%id,  2.7%wa,  0.0%hi,  3.9%si,  0.0%
> Mem:   2074064k total,   690200k used,  1383864k free,    39372k buffers
> Swap:  2096476k total,        0k used,  2096476k free,   435044k cached
> 
> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
> 3916 root      20   0  2228  584  296 R 84.6  0.0   0:07.12 netserver    

You said this was a dual-core system right?  So two cores, no threads?  If so, 
then that does look odd - if netserver is consuming 84% of a CPU (core) and 
there are only two CPUs (cores) in the system, how the system can be 89.5% idle 
is beyond me. The 48% reported by netperf below makes better sense. If you press 
"1" while top is running it should start to show per-CPU statistics

> It shows the CPU usage of taget machine is around 10%.
> 
> while Nehalem machine's report is as follows:
> 
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.1 (192.168.2.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
> 87380  16384  16384    10.05       679.79   1.63     48.27    1.571   11.634 
> 
> It shows the CPU usage of target machine is 48.27%.

Clearly something is out of joint - let's go off-list (or on to 
netperf-talk@netperf.org) and hash that out to see what may be happening.  It 
will probably involve variations on grabbing the top-of-trunk, adding the debug 
option etc.

> 
> 
>>"Network processing" does not often get charged to the responsible
>>process, so netperf reports system-wide CPU utilization on the
>>assumption it is the only thing causing the CPUs to be utilized.
> 
> 
> My understand of your commends is:
> 1)except running in ksoftirqd, network processing cannot be correctly counted
>   because it runs in interrupt contexts that do not get charged to a correct
>   process. So "top" misses lots of CPU usage in high interrupt rate network
>   situation.

Top *shouldn't* miss it as far as reporting overall CPU utlization.  It just may 
not be charged to the process on who's behalf the work is done.

> 2)As you have mentioned in netperf's manual, netperf uses /proc/stat on Linux
>   to retrieve time spent in idle mode. In other words, it accumulates cpu time
>   spent in all other modes, including hardware interrupt, software interrupt,
>   etc., making the CPU usage more accurate in high interrupt situation.

That is the theory.  In practice however...  while the top output you've 
provided looks like there is an "issue" in top, netperf has been known to have a 
bug or three.

> 3)Since most processes in target machine are in sleeping mode, the CPU usage
>   of network processing is in actually very close to 48.27%. Right?

I do not expect there to be a huge discrepancy between the overall CPU 
utilization reported by top and the CPU utilization reported by netperf.  That 
there seems to be such a discrepancy has me wanting to make certain that netperf 
is operating correctly.

happy benchmarking,

rick jones

> 
> Correct me if any of them are incorrect. Thanks.
> 
> --Junchang
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Junchang Wang July 25, 2010, 2:18 p.m. UTC | #9

Hi list,

> Clearly something is out of joint - let's go off-list (or on to
> netperf-talk@netperf.org) and hash that out to see what may be happening.
>  It will probably involve variations on grabbing the top-of-trunk, adding
> the debug option etc.
>
The discrepancy between netperf and top has been worked out.

It seems top produce buggy data when I try to send output to a file.
For example, "top -b > output" gives out my previous buggy data in its
first iteration.

Actually, the report of top should be:

top - 21:37:15 up 21 min,  6 users,  load average: 0.43, 0.28, 0.19
Tasks: 152 total,   2 running, 149 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.2%us,  5.4%sy,  0.0%ni, 50.9%id,  0.0%wa,  0.0%hi, 43.5%si,  0.0%
Mem:   2074064k total,   690192k used,  1383872k free,    39372k buffers
Swap:  2096476k total,        0k used,  2096476k free,   435056k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3916 root      20   0  2228  584  296 R 86.3  0.0   0:09.72 netserver

I think 50.9% system idle makes sense because this is a dual-core system
and netserver is consuming 86.3% of a core. On average, the CPU usage
of the whole system reported by top can be regarded as from 46.2% to
50.1%.

netperf's report of 48% is right and testifies that "there is no huge
discrepancy
between the overall CPU utilization reported by top and the CPU utilization
reported by netperf."

Thanks Rick.

Question about way that NICs deliver packets to the kernel

Commit Message

Comments

Patch