[ovs-dev,RFC,5/5] dpif-netdev: Prefetch the cacheline having the cycle stats.

Message ID 1512418610-84032-5-git-send-email-bhanuprakash.bodireddy@intel.com
State New
Headers show
Series
  • [ovs-dev,RFC,1/5] compiler: Introduce OVS_PREFETCH variants.
Related show

Commit Message

Bodireddy, Bhanuprakash Dec. 4, 2017, 8:16 p.m.
Prefetch the cacheline having the cycle stats so that we can speed up
the cycles_count_start() and cycles_count_intermediate().

Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
---
 lib/dpif-netdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Ilya Maximets Dec. 5, 2017, 9:50 a.m. | #1
> Prefetch the cacheline having the cycle stats so that we can speed up
> the cycles_count_start() and cycles_count_intermediate().

Do you have any performance results?

> 
> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com>
> ---
>  lib/dpif-netdev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index b74b5d7..ab13d83 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -576,7 +576,7 @@ struct dp_netdev_pmd_thread {
>          struct ovs_mutex flow_mutex;
>          /* 8 pad bytes. */
>      );
> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
> +    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cachelineC,
>          struct cmap flow_table OVS_GUARDED; /* Flow table. */
>  
>          /* One classifier per in_port polled by the pmd */
> @@ -4082,6 +4082,7 @@ reload:
>          lc = UINT_MAX;
>      }
>  
> +    OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW);
>      cycles_count_start(pmd);
>      for (;;) {
>          for (i = 0; i < poll_cnt; i++) {
> -- 
> 2.4.11
Bodireddy, Bhanuprakash Dec. 5, 2017, 3:11 p.m. | #2
>

>> Prefetch the cacheline having the cycle stats so that we can speed up

>> the cycles_count_start() and cycles_count_intermediate().

>

>Do you have any performance results?


I don’t have nos. for this patch alone. I was testing the overall throughput along with other patches (that were *not* part of this RFC series) to verify performance improvements. I will include in commit log when I do for individual patches. 

BTW, I usually look at  the % of total instructions getting retired, cycles spent in front and back-end for the functions to see if prefetching does improve/degrade performance.

- Bhanuprakash.

>

>>

>> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at

>> intel.com>

>> ---

>>  lib/dpif-netdev.c | 3 ++-

>>  1 file changed, 2 insertions(+), 1 deletion(-)

>>

>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index

>> b74b5d7..ab13d83 100644

>> --- a/lib/dpif-netdev.c

>> +++ b/lib/dpif-netdev.c

>> @@ -576,7 +576,7 @@ struct dp_netdev_pmd_thread {

>>          struct ovs_mutex flow_mutex;

>>          /* 8 pad bytes. */

>>      );

>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,

>> +    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,

>cachelineC,

>>          struct cmap flow_table OVS_GUARDED; /* Flow table. */

>>

>>          /* One classifier per in_port polled by the pmd */ @@ -4082,6

>> +4082,7 @@ reload:

>>          lc = UINT_MAX;

>>      }

>>

>> +    OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW);

>>      cycles_count_start(pmd);

>>      for (;;) {

>>          for (i = 0; i < poll_cnt; i++) {

>> --

>> 2.4.11
Ilya Maximets Dec. 7, 2017, 2:04 p.m. | #3
On 05.12.2017 18:11, Bodireddy, Bhanuprakash wrote:
>>
>>> Prefetch the cacheline having the cycle stats so that we can speed up
>>> the cycles_count_start() and cycles_count_intermediate().
>>
>> Do you have any performance results?
> 
> I don’t have nos. for this patch alone. I was testing the overall throughput along with other patches (that were *not* part of this RFC series) to verify performance improvements. I will include in commit log when I do for individual patches. 
> 
> BTW, I usually look at  the % of total instructions getting retired, cycles spent in front and back-end for the functions to see if prefetching does improve/degrade performance.
> 
> - Bhanuprakash.
> 
>>
>>>
>>> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at
>>> intel.com>
>>> ---
>>>  lib/dpif-netdev.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>> b74b5d7..ab13d83 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -576,7 +576,7 @@ struct dp_netdev_pmd_thread {
>>>          struct ovs_mutex flow_mutex;
>>>          /* 8 pad bytes. */
>>>      );
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> +    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>> cachelineC,
>>>          struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>
>>>          /* One classifier per in_port polled by the pmd */ @@ -4082,6
>>> +4082,7 @@ reload:
>>>          lc = UINT_MAX;
>>>      }
>>>
>>> +    OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW);

How does prefetch just before the infinite loop should improve performance?
I didn't test that, but IMHO, this should have zero impact.

>>>      cycles_count_start(pmd);
>>>      for (;;) {
>>>          for (i = 0; i < poll_cnt; i++) {
>>> --
>>> 2.4.11

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index b74b5d7..ab13d83 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -576,7 +576,7 @@  struct dp_netdev_pmd_thread {
         struct ovs_mutex flow_mutex;
         /* 8 pad bytes. */
     );
-    PADDED_MEMBERS(CACHE_LINE_SIZE,
+    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cachelineC,
         struct cmap flow_table OVS_GUARDED; /* Flow table. */
 
         /* One classifier per in_port polled by the pmd */
@@ -4082,6 +4082,7 @@  reload:
         lc = UINT_MAX;
     }
 
+    OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW);
     cycles_count_start(pmd);
     for (;;) {
         for (i = 0; i < poll_cnt; i++) {