[ovs-dev] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

Message ID 1511352705-2125-1-git-send-email-i.maximets@samsung.com
State Accepted
Delegated to: Ian Stokes
Headers show
Series
  • [ovs-dev] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."
Related show

Commit Message

Ilya Maximets Nov. 22, 2017, 12:11 p.m.
This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.

Padding and aligning of dp_netdev_pmd_thread structure members
is useless, broken in a several ways and only greatly degrades
maintainability and extensibility of the structure.

Issues:

    1. It's not working because all the instances of struct
       dp_netdev_pmd_thread allocated only by usual malloc. All the
       memory is not aligned to cachelines -> structure almost never
       starts at aligned memory address. This means that any further
       paddings and alignments inside the structure are completely
       useless. Fo example:

       Breakpoint 1, pmd_thread_main
       (gdb) p pmd
       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
       (gdb) p &pmd->cacheline1
       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
       (gdb) p &pmd->cacheline0
       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
       (gdb) p &pmd->flow_cache
       $53 = (struct emc_cache *) 0x1b1afe0

       All of the above addresses shifted from cacheline start by 32B.

       Can we fix it properly? NO.
       OVS currently doesn't have appropriate API to allocate aligned
       memory. The best candidate is 'xmalloc_cacheline()' but it
       clearly states that "The memory returned will not be at the
       start of a cache line, though, so don't assume such alignment".
       And also, this function will never return aligned memory on
       Windows or MacOS.

    2. CACHE_LINE_SIZE is not constant. Different architectures have
       different cache line sizes, but the code assumes that
       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
       members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
       This leads to a huge holes in a structures if CACHE_LINE_SIZE
       differs from 64. This is opposite to portability. If I want
       good performance of cmap I need to have CACHE_LINE_SIZE equal
       to the real cache line size, but I will have huge holes in the
       structures. If you'll take a look to struct rte_mbuf from DPDK
       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.

    3. Sizes of system/libc defined types are not constant for all the
       systems. For example, sizeof(pthread_mutex_t) == 48 on my
       ARMv8 machine, but only 40 on x86. The difference could be
       much bigger on Windows or MacOS systems. But the code assumes
       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
       to broken alignment/big holes in case of padding/wrong comments
       about amount of free pad bytes.

    4. Sizes of the many fileds in structure depends on defines like
       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on.
       Any change in these defines or any change in any structure
       contained by thread should lead to the not so simple
       refactoring of the whole dp_netdev_pmd_thread structure. This
       greatly reduces maintainability and complicates development of
       a new features.

    5. There is no reason to align flow_cache member because it's
       too big and we usually access random entries by single thread
       only.

So, the padding/alignment only creates some visibility of performance
optimization but does nothing useful in reality. It only complicates
maintenance and adds huge holes for non-x86 architectures and non-Linux
systems. Performance improvement stated in a original commit message
should be random and not valuable. I see no performance difference.

Most of the above issues are also true for some other padded/aligned
structures like 'struct netdev_dpdk'. They will be treated separately.

CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
CC: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
---
 lib/dpif-netdev.c | 160 +++++++++++++++++++++++-------------------------------
 1 file changed, 69 insertions(+), 91 deletions(-)

Comments

Bodireddy, Bhanuprakash Nov. 22, 2017, 5:14 p.m. | #1
>This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>
>Padding and aligning of dp_netdev_pmd_thread structure members is
>useless, broken in a several ways and only greatly degrades maintainability
>and extensibility of the structure.

The idea of my earlier patch was to mark the cache lines and reduce the holes while still maintaining the grouping of related members in this structure.
Also cache line marking is a good practice to make some one extra cautious when extending or editing important structures . 
Most importantly I was experimenting with prefetching on this structure and needed cache line markers for it. 

I see that you are on ARM (I don't have HW to test) and want to know if this commit has any negative affect and any numbers would be appreciated.
More comments inline.

>
>Issues:
>
>    1. It's not working because all the instances of struct
>       dp_netdev_pmd_thread allocated only by usual malloc. All the
>       memory is not aligned to cachelines -> structure almost never
>       starts at aligned memory address. This means that any further
>       paddings and alignments inside the structure are completely
>       useless. Fo example:
>
>       Breakpoint 1, pmd_thread_main
>       (gdb) p pmd
>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>       (gdb) p &pmd->cacheline1
>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>       (gdb) p &pmd->cacheline0
>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>       (gdb) p &pmd->flow_cache
>       $53 = (struct emc_cache *) 0x1b1afe0
>
>       All of the above addresses shifted from cacheline start by 32B.

If you see below all the addresses are 64 byte aligned.

(gdb) p pmd
$1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
(gdb) p &pmd->cacheline0
$2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
(gdb) p &pmd->cacheline1
$3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
(gdb) p &pmd->flow_cache
$4 = (struct emc_cache *) 0x7fc1e9b1a0c0
(gdb) p &pmd->flow_table
$5 = (struct cmap *) 0x7fc1e9fba100
(gdb) p &pmd->stats
$6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
(gdb) p &pmd->port_mutex
$7 = (struct ovs_mutex *) 0x7fc1e9fba180
(gdb) p &pmd->poll_list
$8 = (struct hmap *) 0x7fc1e9fba1c0
(gdb) p &pmd->tnl_port_cache
$9 = (struct hmap *) 0x7fc1e9fba200
(gdb) p &pmd->stats_zero
$10 = (unsigned long long (*)[5]) 0x7fc1e9fba240

I tried using xzalloc_cacheline instead of default xzalloc() here.  I tried tens of times and always found that the address is
64 byte aligned and it should start at the beginning of cache line on X86. 
Not sure why the comment  " (The memory returned will not be at the start of  a cache line, though, so don't assume such alignment.)" says otherwise?

>
>       Can we fix it properly? NO.
>       OVS currently doesn't have appropriate API to allocate aligned
>       memory. The best candidate is 'xmalloc_cacheline()' but it
>       clearly states that "The memory returned will not be at the
>       start of a cache line, though, so don't assume such alignment".
>       And also, this function will never return aligned memory on
>       Windows or MacOS.
>
>    2. CACHE_LINE_SIZE is not constant. Different architectures have
>       different cache line sizes, but the code assumes that
>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>       members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
>       differs from 64. This is opposite to portability. If I want
>       good performance of cmap I need to have CACHE_LINE_SIZE equal
>       to the real cache line size, but I will have huge holes in the
>       structures. If you'll take a look to struct rte_mbuf from DPDK
>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.

I understand that ARM and few other processors (like OCTEON) have 128 bytes cache lines.
But  again curious of performance impact in your case with this new alignment.

>
>    3. Sizes of system/libc defined types are not constant for all the
>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
>       ARMv8 machine, but only 40 on x86. The difference could be
>       much bigger on Windows or MacOS systems. But the code assumes
>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>       to broken alignment/big holes in case of padding/wrong comments
>       about amount of free pad bytes.

This isn't an issue as you would have already mentioned and more about issue with the comment that reads the pad bytes.
In case of ARM it would be just 8 pad bytes instead of 16 on X86. 

        union {
                struct {
                        struct ovs_mutex port_mutex;     /* 4849984    48 */
                };    /*          48 */
                uint8_t            pad13[64];            /*          64 */
        };                                               /

>
>    4. Sizes of the many fileds in structure depends on defines like
>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on.
>       Any change in these defines or any change in any structure
>       contained by thread should lead to the not so simple
>       refactoring of the whole dp_netdev_pmd_thread structure. This
>       greatly reduces maintainability and complicates development of
>       a new features.

I don't think it complicates development and instead I feel the commit gives a clear indication
to the developer that the members are grouped and aligned and marked with cacheline markers.
This makes the developer extra cautious when adding new members so that holes can be avoided. 

Cacheline marking the structure is a good practice and I am sure this structure is significant
and should be carefully extended in the future.

>
>    5. There is no reason to align flow_cache member because it's
>       too big and we usually access random entries by single thread
>       only.
>

I see your point. This patch wasn't done for performance and instead more to have some order with this
ever growing structure. During testing I found that for some test cases aligning the flow_cache was giving 
me 100k+ performance consistently and so was added.

>So, the padding/alignment only creates some visibility of performance
>optimization but does nothing useful in reality. It only complicates
>maintenance and adds huge holes for non-x86 architectures and non-Linux
>systems. Performance improvement stated in a original commit message
>should be random and not valuable. I see no performance difference.

I understand that this is causing issues with architecture having different cache line sizes,
but unfortunately majority of them have 64 byte cache lines so this change makes sense.

If you have performance data to prove that this causes sever perf degradation I can think of work arounds for ARM.

- Bhanuprakash.

>
>Most of the above issues are also true for some other padded/aligned
>structures like 'struct netdev_dpdk'. They will be treated separately.
>
>CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>CC: Ben Pfaff <blp@ovn.org>
>Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>---
> lib/dpif-netdev.c | 160 +++++++++++++++++++++++-----------------------------
>--
> 1 file changed, 69 insertions(+), 91 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 0a62630..6784269
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -547,31 +547,18 @@ struct tx_port {
>  * actions in either case.
>  * */
> struct dp_netdev_pmd_thread {
>-    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
>-        struct dp_netdev *dp;
>-        struct cmap_node node;          /* In 'dp->poll_threads'. */
>-        pthread_cond_t cond;            /* For synchronizing pmd thread
>-                                           reload. */
>-    );
>-
>-    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>-        struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>-        pthread_t thread;
>-        unsigned core_id;               /* CPU core id of this pmd thread. */
>-        int numa_id;                    /* numa node id of this pmd thread. */
>-    );
>+    struct dp_netdev *dp;
>+    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
>+    struct cmap_node node;          /* In 'dp->poll_threads'. */
>+
>+    pthread_cond_t cond;            /* For synchronizing pmd thread reload. */
>+    struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>
>     /* Per thread exact-match cache.  Note, the instance for cpu core
>      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>      * need to be protected by 'non_pmd_mutex'.  Every other instance
>      * will only be accessed by its own pmd thread. */
>-    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
>-    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
>-
>-    /* Queue id used by this pmd thread to send packets on all netdevs if
>-     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>-     * than 'cmap_count(dp->poll_threads)'. */
>-    uint32_t static_tx_qid;
>+    struct emc_cache flow_cache;
>
>     /* Flow-Table and classifiers
>      *
>@@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>      * 'flow_mutex'.
>      */
>     struct ovs_mutex flow_mutex;
>-    PADDED_MEMBERS(CACHE_LINE_SIZE,
>-        struct cmap flow_table OVS_GUARDED; /* Flow table. */
>-
>-        /* One classifier per in_port polled by the pmd */
>-        struct cmap classifiers;
>-        /* Periodically sort subtable vectors according to hit frequencies */
>-        long long int next_optimization;
>-        /* End of the next time interval for which processing cycles
>-           are stored for each polled rxq. */
>-        long long int rxq_next_cycle_store;
>-
>-        /* Cycles counters */
>-        struct dp_netdev_pmd_cycles cycles;
>-
>-        /* Used to count cycles. See 'cycles_counter_end()'. */
>-        unsigned long long last_cycles;
>-        struct latch exit_latch;        /* For terminating the pmd thread. */
>-    );
>-
>-    PADDED_MEMBERS(CACHE_LINE_SIZE,
>-        /* Statistics. */
>-        struct dp_netdev_pmd_stats stats;
>-
>-        struct seq *reload_seq;
>-        uint64_t last_reload_seq;
>-        atomic_bool reload;             /* Do we need to reload ports? */
>-        bool isolated;
>-
>-        /* Set to true if the pmd thread needs to be reloaded. */
>-        bool need_reload;
>-        /* 5 pad bytes. */
>-    );
>-
>-    PADDED_MEMBERS(CACHE_LINE_SIZE,
>-        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
>-                                           and 'tx_ports'. */
>-        /* 16 pad bytes. */
>-    );
>-    PADDED_MEMBERS(CACHE_LINE_SIZE,
>-        /* List of rx queues to poll. */
>-        struct hmap poll_list OVS_GUARDED;
>-        /* Map of 'tx_port's used for transmission.  Written by the main
>-         * thread, read by the pmd thread. */
>-        struct hmap tx_ports OVS_GUARDED;
>-    );
>-    PADDED_MEMBERS(CACHE_LINE_SIZE,
>-        /* These are thread-local copies of 'tx_ports'.  One contains only
>-         * tunnel ports (that support push_tunnel/pop_tunnel), the other
>-         * contains ports with at least one txq (that support send).
>-         * A port can be in both.
>-         *
>-         * There are two separate maps to make sure that we don't try to
>-         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
>-         * non-tunnel device.
>-         *
>-         * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>-         * multiple threads and thusly need to be protected by
>'non_pmd_mutex'.
>-         * Every other instance will only be accessed by its own pmd thread. */
>-        struct hmap tnl_port_cache;
>-        struct hmap send_port_cache;
>-    );
>-
>-    PADDED_MEMBERS(CACHE_LINE_SIZE,
>-        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>-         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>-         * values and subtracts them from 'stats' and 'cycles' before
>-         * reporting to the user */
>-        unsigned long long stats_zero[DP_N_STATS];
>-        uint64_t cycles_zero[PMD_N_CYCLES];
>-        /* 8 pad bytes. */
>-    );
>+    struct cmap flow_table OVS_GUARDED; /* Flow table. */
>+
>+    /* One classifier per in_port polled by the pmd */
>+    struct cmap classifiers;
>+    /* Periodically sort subtable vectors according to hit frequencies */
>+    long long int next_optimization;
>+    /* End of the next time interval for which processing cycles
>+       are stored for each polled rxq. */
>+    long long int rxq_next_cycle_store;
>+
>+    /* Statistics. */
>+    struct dp_netdev_pmd_stats stats;
>+
>+    /* Cycles counters */
>+    struct dp_netdev_pmd_cycles cycles;
>+
>+    /* Used to count cicles. See 'cycles_counter_end()' */
>+    unsigned long long last_cycles;
>+
>+    struct latch exit_latch;        /* For terminating the pmd thread. */
>+    struct seq *reload_seq;
>+    uint64_t last_reload_seq;
>+    atomic_bool reload;             /* Do we need to reload ports? */
>+    pthread_t thread;
>+    unsigned core_id;               /* CPU core id of this pmd thread. */
>+    int numa_id;                    /* numa node id of this pmd thread. */
>+    bool isolated;
>+
>+    /* Queue id used by this pmd thread to send packets on all netdevs if
>+     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>+     * than 'cmap_count(dp->poll_threads)'. */
>+    uint32_t static_tx_qid;
>+
>+    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'. */
>+    /* List of rx queues to poll. */
>+    struct hmap poll_list OVS_GUARDED;
>+    /* Map of 'tx_port's used for transmission.  Written by the main thread,
>+     * read by the pmd thread. */
>+    struct hmap tx_ports OVS_GUARDED;
>+
>+    /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
>+     * ports (that support push_tunnel/pop_tunnel), the other contains ports
>+     * with at least one txq (that support send).  A port can be in both.
>+     *
>+     * There are two separate maps to make sure that we don't try to execute
>+     * OUTPUT on a device which has 0 txqs or PUSH/POP on a non-tunnel
>device.
>+     *
>+     * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>multiple
>+     * threads, and thusly need to be protected by 'non_pmd_mutex'.  Every
>+     * other instance will only be accessed by its own pmd thread. */
>+    struct hmap tnl_port_cache;
>+    struct hmap send_port_cache;
>+
>+    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>+     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>+     * values and subtracts them from 'stats' and 'cycles' before
>+     * reporting to the user */
>+    unsigned long long stats_zero[DP_N_STATS];
>+    uint64_t cycles_zero[PMD_N_CYCLES];
>+
>+    /* Set to true if the pmd thread needs to be reloaded. */
>+    bool need_reload;
> };
>
> /* Interface to netdev-based datapath. */
>--
>2.7.4
Ilya Maximets Nov. 24, 2017, 3:04 p.m. | #2
On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>
>> Padding and aligning of dp_netdev_pmd_thread structure members is
>> useless, broken in a several ways and only greatly degrades maintainability
>> and extensibility of the structure.
> 
> The idea of my earlier patch was to mark the cache lines and reduce the holes while still maintaining the grouping of related members in this structure.

Some of the grouping aspects looks strange. For example, it looks illogical that
'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other
reload related stuff. It looks strange that statistics and counters spread
across different groups. So, IMHO, it's not well grouped.

> Also cache line marking is a good practice to make some one extra cautious when extending or editing important structures . 
> Most importantly I was experimenting with prefetching on this structure and needed cache line markers for it. 
> 
> I see that you are on ARM (I don't have HW to test) and want to know if this commit has any negative affect and any numbers would be appreciated.

Basic VM-VM testing shows stable 0.5% perfromance improvement with revert applied.
Padding adds 560 additional bytes of holes.

> More comments inline.
> 
>>
>> Issues:
>>
>>    1. It's not working because all the instances of struct
>>       dp_netdev_pmd_thread allocated only by usual malloc. All the
>>       memory is not aligned to cachelines -> structure almost never
>>       starts at aligned memory address. This means that any further
>>       paddings and alignments inside the structure are completely
>>       useless. Fo example:
>>
>>       Breakpoint 1, pmd_thread_main
>>       (gdb) p pmd
>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>       (gdb) p &pmd->cacheline1
>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>       (gdb) p &pmd->cacheline0
>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>       (gdb) p &pmd->flow_cache
>>       $53 = (struct emc_cache *) 0x1b1afe0
>>
>>       All of the above addresses shifted from cacheline start by 32B.
> 
> If you see below all the addresses are 64 byte aligned.
> 
> (gdb) p pmd
> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
> (gdb) p &pmd->cacheline0
> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
> (gdb) p &pmd->cacheline1
> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
> (gdb) p &pmd->flow_cache
> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
> (gdb) p &pmd->flow_table
> $5 = (struct cmap *) 0x7fc1e9fba100
> (gdb) p &pmd->stats
> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
> (gdb) p &pmd->port_mutex
> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
> (gdb) p &pmd->poll_list
> $8 = (struct hmap *) 0x7fc1e9fba1c0
> (gdb) p &pmd->tnl_port_cache
> $9 = (struct hmap *) 0x7fc1e9fba200
> (gdb) p &pmd->stats_zero
> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
> 
> I tried using xzalloc_cacheline instead of default xzalloc() here.  I tried tens of times and always found that the address is
> 64 byte aligned and it should start at the beginning of cache line on X86. 
> Not sure why the comment  " (The memory returned will not be at the start of  a cache line, though, so don't assume such alignment.)" says otherwise?

Yes, you will always get aligned addressess on your x86 Linux system that supports
posix_memalign() call. The comment says what it says because it will make some
memory allocation tricks in case posix_memalign() is not available (Windows, some MacOS,
maybe some Linux systems (not sure)) and the address will not be aligned it this case.

> 
>>
>>       Can we fix it properly? NO.
>>       OVS currently doesn't have appropriate API to allocate aligned
>>       memory. The best candidate is 'xmalloc_cacheline()' but it
>>       clearly states that "The memory returned will not be at the
>>       start of a cache line, though, so don't assume such alignment".
>>       And also, this function will never return aligned memory on
>>       Windows or MacOS.
>>
>>    2. CACHE_LINE_SIZE is not constant. Different architectures have
>>       different cache line sizes, but the code assumes that
>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>>       members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
>>       differs from 64. This is opposite to portability. If I want
>>       good performance of cmap I need to have CACHE_LINE_SIZE equal
>>       to the real cache line size, but I will have huge holes in the
>>       structures. If you'll take a look to struct rte_mbuf from DPDK
>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
> 
> I understand that ARM and few other processors (like OCTEON) have 128 bytes cache lines.
> But  again curious of performance impact in your case with this new alignment.
> 
>>
>>    3. Sizes of system/libc defined types are not constant for all the
>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
>>       ARMv8 machine, but only 40 on x86. The difference could be
>>       much bigger on Windows or MacOS systems. But the code assumes
>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>>       to broken alignment/big holes in case of padding/wrong comments
>>       about amount of free pad bytes.
> 
> This isn't an issue as you would have already mentioned and more about issue with the comment that reads the pad bytes.
> In case of ARM it would be just 8 pad bytes instead of 16 on X86. 
> 
>         union {
>                 struct {
>                         struct ovs_mutex port_mutex;     /* 4849984    48 */
>                 };    /*          48 */
>                 uint8_t            pad13[64];            /*          64 */
>         };                                               /
> 

It's not only about 'port_mutex'. If you'll take a look at 'flow_mutex',
you will see, that it even not padded. So, increasing the size of 'flow_mutex'
leads to shifting of all the following padded blocks and no other blocks will
be properly aligned even if the structure allocated on the aligned memory.

>>
>>    4. Sizes of the many fileds in structure depends on defines like
>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on.
>>       Any change in these defines or any change in any structure
>>       contained by thread should lead to the not so simple
>>       refactoring of the whole dp_netdev_pmd_thread structure. This
>>       greatly reduces maintainability and complicates development of
>>       a new features.
> 
> I don't think it complicates development and instead I feel the commit gives a clear indication
> to the developer that the members are grouped and aligned and marked with cacheline markers.
> This makes the developer extra cautious when adding new members so that holes can be avoided. 

Starting rebase of the output batching patch-set I figured out that I need to
remove 'unsigned long long last_cycles' and add 'struct dp_netdev_pmd_thread_ctx ctx'
which is 8 bytes larger. Could you, please, suggest me where should I place that
new structure member and what to do with a hole from 'last_cycles'?

This is not a trivial question, because already poor grouping will become worse
almost anyway.

> 
> Cacheline marking the structure is a good practice and I am sure this structure is significant
> and should be carefully extended in the future.

Not so sure about that.

> 
>>
>>    5. There is no reason to align flow_cache member because it's
>>       too big and we usually access random entries by single thread
>>       only.
>>
> 
> I see your point. This patch wasn't done for performance and instead more to have some order with this
> ever growing structure. During testing I found that for some test cases aligning the flow_cache was giving 
> me 100k+ performance consistently and so was added.

This was a random performance boost. You achieved it without aligned memory allocation.
Just a luck with you system environment. Using of mzalloc_cacheline will likely
eliminate this performance difference or even degrade the performance.

> 
>> So, the padding/alignment only creates some visibility of performance
>> optimization but does nothing useful in reality. It only complicates
>> maintenance and adds huge holes for non-x86 architectures and non-Linux
>> systems. Performance improvement stated in a original commit message
>> should be random and not valuable. I see no performance difference.
> 
> I understand that this is causing issues with architecture having different cache line sizes,
> but unfortunately majority of them have 64 byte cache lines so this change makes sense.

I understand that 64 byte cache lines are spread a lot wider. I also have x86 as a target arch,
but still, IMHO, OVS is a cross-platform application and it should not have platform dependent
stuff which makes one architecture/platform better and worsen others.

> 
> If you have performance data to prove that this causes sever perf degradation I can think of work arounds for ARM.
> 
> - Bhanuprakash.


P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have to apply
following patch to avoid build time assert (I'll send it formally later):

-----------------------------------------------------------------------------
diff --git a/lib/cmap.c b/lib/cmap.c
index 35decea..5b15ecd 100644
--- a/lib/cmap.c
+++ b/lib/cmap.c
@@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
 /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */
 #define CMAP_K ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
 
-/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on 64-bit. */
-#define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K * CMAP_ENTRY_SIZE))
-
 /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly one cache
  * line long. */
 struct cmap_bucket {
+    /* Padding to make cmap_bucket exactly one cache line long. */
+    PADDED_MEMBERS(CACHE_LINE_SIZE,
     /* Allows readers to track in-progress changes.  Initially zero, each
      * writer increments this value just before and just after each change (see
      * cmap_set_bucket()).  Thus, a reader can ensure that it gets a consistent
@@ -145,11 +144,7 @@ struct cmap_bucket {
      * slots. */
     uint32_t hashes[CMAP_K];
     struct cmap_node nodes[CMAP_K];
-
-    /* Padding to make cmap_bucket exactly one cache line long. */
-#if CMAP_PADDING > 0
-    uint8_t pad[CMAP_PADDING];
-#endif
+    );
 };
 BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
 
diff --git a/lib/util.h b/lib/util.h
index 3c43c2c..514fdaa 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
 
 /* This system's cache line size, in bytes.
  * Being wrong hurts performance but not correctness. */
-#define CACHE_LINE_SIZE 64
+#define CACHE_LINE_SIZE 128
 BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
 
 /* Cacheline marking is typically done using zero-sized array.
-----------------------------------------------------------------------------


Best regards, Ilya Maximets.

> 
>>
>> Most of the above issues are also true for some other padded/aligned
>> structures like 'struct netdev_dpdk'. They will be treated separately.
>>
>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>> CC: Ben Pfaff <blp@ovn.org>
>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>> ---
>> lib/dpif-netdev.c | 160 +++++++++++++++++++++++-----------------------------
>> --
>> 1 file changed, 69 insertions(+), 91 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 0a62630..6784269
>> 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -547,31 +547,18 @@ struct tx_port {
>>  * actions in either case.
>>  * */
>> struct dp_netdev_pmd_thread {
>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
>> -        struct dp_netdev *dp;
>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */
>> -        pthread_cond_t cond;            /* For synchronizing pmd thread
>> -                                           reload. */
>> -    );
>> -
>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>> -        pthread_t thread;
>> -        unsigned core_id;               /* CPU core id of this pmd thread. */
>> -        int numa_id;                    /* numa node id of this pmd thread. */
>> -    );
>> +    struct dp_netdev *dp;
>> +    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */
>> +
>> +    pthread_cond_t cond;            /* For synchronizing pmd thread reload. */
>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>>
>>     /* Per thread exact-match cache.  Note, the instance for cpu core
>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>>      * need to be protected by 'non_pmd_mutex'.  Every other instance
>>      * will only be accessed by its own pmd thread. */
>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
>> -    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
>> -
>> -    /* Queue id used by this pmd thread to send packets on all netdevs if
>> -     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>> -     * than 'cmap_count(dp->poll_threads)'. */
>> -    uint32_t static_tx_qid;
>> +    struct emc_cache flow_cache;
>>
>>     /* Flow-Table and classifiers
>>      *
>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>>      * 'flow_mutex'.
>>      */
>>     struct ovs_mutex flow_mutex;
>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */
>> -
>> -        /* One classifier per in_port polled by the pmd */
>> -        struct cmap classifiers;
>> -        /* Periodically sort subtable vectors according to hit frequencies */
>> -        long long int next_optimization;
>> -        /* End of the next time interval for which processing cycles
>> -           are stored for each polled rxq. */
>> -        long long int rxq_next_cycle_store;
>> -
>> -        /* Cycles counters */
>> -        struct dp_netdev_pmd_cycles cycles;
>> -
>> -        /* Used to count cycles. See 'cycles_counter_end()'. */
>> -        unsigned long long last_cycles;
>> -        struct latch exit_latch;        /* For terminating the pmd thread. */
>> -    );
>> -
>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> -        /* Statistics. */
>> -        struct dp_netdev_pmd_stats stats;
>> -
>> -        struct seq *reload_seq;
>> -        uint64_t last_reload_seq;
>> -        atomic_bool reload;             /* Do we need to reload ports? */
>> -        bool isolated;
>> -
>> -        /* Set to true if the pmd thread needs to be reloaded. */
>> -        bool need_reload;
>> -        /* 5 pad bytes. */
>> -    );
>> -
>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
>> -                                           and 'tx_ports'. */
>> -        /* 16 pad bytes. */
>> -    );
>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> -        /* List of rx queues to poll. */
>> -        struct hmap poll_list OVS_GUARDED;
>> -        /* Map of 'tx_port's used for transmission.  Written by the main
>> -         * thread, read by the pmd thread. */
>> -        struct hmap tx_ports OVS_GUARDED;
>> -    );
>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> -        /* These are thread-local copies of 'tx_ports'.  One contains only
>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the other
>> -         * contains ports with at least one txq (that support send).
>> -         * A port can be in both.
>> -         *
>> -         * There are two separate maps to make sure that we don't try to
>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
>> -         * non-tunnel device.
>> -         *
>> -         * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>> -         * multiple threads and thusly need to be protected by
>> 'non_pmd_mutex'.
>> -         * Every other instance will only be accessed by its own pmd thread. */
>> -        struct hmap tnl_port_cache;
>> -        struct hmap send_port_cache;
>> -    );
>> -
>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> -        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>> -         * values and subtracts them from 'stats' and 'cycles' before
>> -         * reporting to the user */
>> -        unsigned long long stats_zero[DP_N_STATS];
>> -        uint64_t cycles_zero[PMD_N_CYCLES];
>> -        /* 8 pad bytes. */
>> -    );
>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */
>> +
>> +    /* One classifier per in_port polled by the pmd */
>> +    struct cmap classifiers;
>> +    /* Periodically sort subtable vectors according to hit frequencies */
>> +    long long int next_optimization;
>> +    /* End of the next time interval for which processing cycles
>> +       are stored for each polled rxq. */
>> +    long long int rxq_next_cycle_store;
>> +
>> +    /* Statistics. */
>> +    struct dp_netdev_pmd_stats stats;
>> +
>> +    /* Cycles counters */
>> +    struct dp_netdev_pmd_cycles cycles;
>> +
>> +    /* Used to count cicles. See 'cycles_counter_end()' */
>> +    unsigned long long last_cycles;
>> +
>> +    struct latch exit_latch;        /* For terminating the pmd thread. */
>> +    struct seq *reload_seq;
>> +    uint64_t last_reload_seq;
>> +    atomic_bool reload;             /* Do we need to reload ports? */
>> +    pthread_t thread;
>> +    unsigned core_id;               /* CPU core id of this pmd thread. */
>> +    int numa_id;                    /* numa node id of this pmd thread. */
>> +    bool isolated;
>> +
>> +    /* Queue id used by this pmd thread to send packets on all netdevs if
>> +     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>> +     * than 'cmap_count(dp->poll_threads)'. */
>> +    uint32_t static_tx_qid;
>> +
>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'. */
>> +    /* List of rx queues to poll. */
>> +    struct hmap poll_list OVS_GUARDED;
>> +    /* Map of 'tx_port's used for transmission.  Written by the main thread,
>> +     * read by the pmd thread. */
>> +    struct hmap tx_ports OVS_GUARDED;
>> +
>> +    /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
>> +     * ports (that support push_tunnel/pop_tunnel), the other contains ports
>> +     * with at least one txq (that support send).  A port can be in both.
>> +     *
>> +     * There are two separate maps to make sure that we don't try to execute
>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a non-tunnel
>> device.
>> +     *
>> +     * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>> multiple
>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.  Every
>> +     * other instance will only be accessed by its own pmd thread. */
>> +    struct hmap tnl_port_cache;
>> +    struct hmap send_port_cache;
>> +
>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>> +     * values and subtracts them from 'stats' and 'cycles' before
>> +     * reporting to the user */
>> +    unsigned long long stats_zero[DP_N_STATS];
>> +    uint64_t cycles_zero[PMD_N_CYCLES];
>> +
>> +    /* Set to true if the pmd thread needs to be reloaded. */
>> +    bool need_reload;
>> };
>>
>> /* Interface to netdev-based datapath. */
>> --
>> 2.7.4
> 
> 
> 
>
Bodireddy, Bhanuprakash Nov. 24, 2017, 4:08 p.m. | #3
>On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>>
>>> Padding and aligning of dp_netdev_pmd_thread structure members is
>>> useless, broken in a several ways and only greatly degrades
>>> maintainability and extensibility of the structure.
>>
>> The idea of my earlier patch was to mark the cache lines and reduce the
>holes while still maintaining the grouping of related members in this structure.
>
>Some of the grouping aspects looks strange. For example, it looks illogical that
>'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other
>reload related stuff. It looks strange that statistics and counters spread across
>different groups. So, IMHO, it's not well grouped.

I had to strike a fine balance and some members may be placed in a different group
due to their sizes and importance. Let me think if I can make it better.

>
>> Also cache line marking is a good practice to make some one extra cautious
>when extending or editing important structures .
>> Most importantly I was experimenting with prefetching on this structure
>and needed cache line markers for it.
>>
>> I see that you are on ARM (I don't have HW to test) and want to know if this
>commit has any negative affect and any numbers would be appreciated.
>
>Basic VM-VM testing shows stable 0.5% perfromance improvement with
>revert applied.

I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.  

>Padding adds 560 additional bytes of holes.
As the cache line in ARM is 128 , it created holes, I can find a workaround to handle this.

>
>> More comments inline.
>>
>>>
>>> Issues:
>>>
>>>    1. It's not working because all the instances of struct
>>>       dp_netdev_pmd_thread allocated only by usual malloc. All the
>>>       memory is not aligned to cachelines -> structure almost never
>>>       starts at aligned memory address. This means that any further
>>>       paddings and alignments inside the structure are completely
>>>       useless. Fo example:
>>>
>>>       Breakpoint 1, pmd_thread_main
>>>       (gdb) p pmd
>>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>>       (gdb) p &pmd->cacheline1
>>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>>       (gdb) p &pmd->cacheline0
>>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>>       (gdb) p &pmd->flow_cache
>>>       $53 = (struct emc_cache *) 0x1b1afe0
>>>
>>>       All of the above addresses shifted from cacheline start by 32B.
>>
>> If you see below all the addresses are 64 byte aligned.
>>
>> (gdb) p pmd
>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>> (gdb) p &pmd->cacheline0
>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>> (gdb) p &pmd->cacheline1
>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>> (gdb) p &pmd->flow_cache
>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>> (gdb) p &pmd->flow_table
>> $5 = (struct cmap *) 0x7fc1e9fba100
>> (gdb) p &pmd->stats
>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>> (gdb) p &pmd->port_mutex
>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>> (gdb) p &pmd->poll_list
>> $8 = (struct hmap *) 0x7fc1e9fba1c0
>> (gdb) p &pmd->tnl_port_cache
>> $9 = (struct hmap *) 0x7fc1e9fba200
>> (gdb) p &pmd->stats_zero
>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>>
>> I tried using xzalloc_cacheline instead of default xzalloc() here.  I
>> tried tens of times and always found that the address is
>> 64 byte aligned and it should start at the beginning of cache line on X86.
>> Not sure why the comment  " (The memory returned will not be at the start
>of  a cache line, though, so don't assume such alignment.)" says otherwise?
>
>Yes, you will always get aligned addressess on your x86 Linux system that
>supports
>posix_memalign() call. The comment says what it says because it will make
>some memory allocation tricks in case posix_memalign() is not available
>(Windows, some MacOS, maybe some Linux systems (not sure)) and the
>address will not be aligned it this case.

I also verified the other case when posix_memalign isn't available and even in that case
it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a patch to use
 xzalloc_cacheline for allocating the memory.

>
>>
>>>
>>>       Can we fix it properly? NO.
>>>       OVS currently doesn't have appropriate API to allocate aligned
>>>       memory. The best candidate is 'xmalloc_cacheline()' but it
>>>       clearly states that "The memory returned will not be at the
>>>       start of a cache line, though, so don't assume such alignment".
>>>       And also, this function will never return aligned memory on
>>>       Windows or MacOS.
>>>
>>>    2. CACHE_LINE_SIZE is not constant. Different architectures have
>>>       different cache line sizes, but the code assumes that
>>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>>>       members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
>>>       differs from 64. This is opposite to portability. If I want
>>>       good performance of cmap I need to have CACHE_LINE_SIZE equal
>>>       to the real cache line size, but I will have huge holes in the
>>>       structures. If you'll take a look to struct rte_mbuf from DPDK
>>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
>>
>> I understand that ARM and few other processors (like OCTEON) have 128
>bytes cache lines.
>> But  again curious of performance impact in your case with this new
>alignment.
>>
>>>
>>>    3. Sizes of system/libc defined types are not constant for all the
>>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
>>>       ARMv8 machine, but only 40 on x86. The difference could be
>>>       much bigger on Windows or MacOS systems. But the code assumes
>>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>>>       to broken alignment/big holes in case of padding/wrong comments
>>>       about amount of free pad bytes.
>>
>> This isn't an issue as you would have already mentioned and more about
>issue with the comment that reads the pad bytes.
>> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
>>
>>         union {
>>                 struct {
>>                         struct ovs_mutex port_mutex;     /* 4849984    48 */
>>                 };    /*          48 */
>>                 uint8_t            pad13[64];            /*          64 */
>>         };                                               /
>>
>
>It's not only about 'port_mutex'. If you'll take a look at 'flow_mutex', you will
>see, that it even not padded. So, increasing the size of 'flow_mutex'
>leads to shifting of all the following padded blocks and no other blocks will be
>properly aligned even if the structure allocated on the aligned memory.
>
>>>
>>>    4. Sizes of the many fileds in structure depends on defines like
>>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on.
>>>       Any change in these defines or any change in any structure
>>>       contained by thread should lead to the not so simple
>>>       refactoring of the whole dp_netdev_pmd_thread structure. This
>>>       greatly reduces maintainability and complicates development of
>>>       a new features.
>>
>> I don't think it complicates development and instead I feel the commit
>> gives a clear indication to the developer that the members are grouped and
>aligned and marked with cacheline markers.
>> This makes the developer extra cautious when adding new members so that
>holes can be avoided.
>
>Starting rebase of the output batching patch-set I figured out that I need to
>remove 'unsigned long long last_cycles' and add 'struct
>dp_netdev_pmd_thread_ctx ctx'
>which is 8 bytes larger. Could you, please, suggest me where should I place
>that new structure member and what to do with a hole from 'last_cycles'?
>
>This is not a trivial question, because already poor grouping will become worse
>almost anyway.

Aah, realized now that the batching series doesn't cleanly apply on master.
Let me check this and will send across the changes that should fix this.

- Bhanuprakash

>>
>> Cacheline marking the structure is a good practice and I am sure this
>> structure is significant and should be carefully extended in the future.
>
>Not so sure about that.
>
>>
>>>
>>>    5. There is no reason to align flow_cache member because it's
>>>       too big and we usually access random entries by single thread
>>>       only.
>>>
>>
>> I see your point. This patch wasn't done for performance and instead
>> more to have some order with this ever growing structure. During
>> testing I found that for some test cases aligning the flow_cache was giving
>me 100k+ performance consistently and so was added.
>
>This was a random performance boost. You achieved it without aligned
>memory allocation.
>Just a luck with you system environment. Using of mzalloc_cacheline will likely
>eliminate this performance difference or even degrade the performance.
>
>>
>>> So, the padding/alignment only creates some visibility of performance
>>> optimization but does nothing useful in reality. It only complicates
>>> maintenance and adds huge holes for non-x86 architectures and
>>> non-Linux systems. Performance improvement stated in a original
>>> commit message should be random and not valuable. I see no
>performance difference.
>>
>> I understand that this is causing issues with architecture having
>> different cache line sizes, but unfortunately majority of them have 64 byte
>cache lines so this change makes sense.
>
>I understand that 64 byte cache lines are spread a lot wider. I also have x86 as
>a target arch, but still, IMHO, OVS is a cross-platform application and it should
>not have platform dependent stuff which makes one architecture/platform
>better and worsen others.
>
>>
>> If you have performance data to prove that this causes sever perf
>degradation I can think of work arounds for ARM.
>>
>> - Bhanuprakash.
>
>
>P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have to apply
>following patch to avoid build time assert (I'll send it formally later):
>
>-----------------------------------------------------------------------------
>diff --git a/lib/cmap.c b/lib/cmap.c
>index 35decea..5b15ecd 100644
>--- a/lib/cmap.c
>+++ b/lib/cmap.c
>@@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */  #define CMAP_K
>((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
>
>-/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on 64-bit. */ -
>#define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K *
>CMAP_ENTRY_SIZE))
>-
> /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly one
>cache
>  * line long. */
> struct cmap_bucket {
>+    /* Padding to make cmap_bucket exactly one cache line long. */
>+    PADDED_MEMBERS(CACHE_LINE_SIZE,
>     /* Allows readers to track in-progress changes.  Initially zero, each
>      * writer increments this value just before and just after each change (see
>      * cmap_set_bucket()).  Thus, a reader can ensure that it gets a consistent
>@@ -145,11 +144,7 @@ struct cmap_bucket {
>      * slots. */
>     uint32_t hashes[CMAP_K];
>     struct cmap_node nodes[CMAP_K];
>-
>-    /* Padding to make cmap_bucket exactly one cache line long. */
>-#if CMAP_PADDING > 0
>-    uint8_t pad[CMAP_PADDING];
>-#endif
>+    );
> };
> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
>
>diff --git a/lib/util.h b/lib/util.h
>index 3c43c2c..514fdaa 100644
>--- a/lib/util.h
>+++ b/lib/util.h
>@@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
>
> /* This system's cache line size, in bytes.
>  * Being wrong hurts performance but not correctness. */ -#define
>CACHE_LINE_SIZE 64
>+#define CACHE_LINE_SIZE 128
> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
>
> /* Cacheline marking is typically done using zero-sized array.
>-----------------------------------------------------------------------------
>
>
>Best regards, Ilya Maximets.
>
>>
>>>
>>> Most of the above issues are also true for some other padded/aligned
>>> structures like 'struct netdev_dpdk'. They will be treated separately.
>>>
>>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>>> CC: Ben Pfaff <blp@ovn.org>
>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>>> ---
>>> lib/dpif-netdev.c | 160
>>> +++++++++++++++++++++++-----------------------------
>>> --
>>> 1 file changed, 69 insertions(+), 91 deletions(-)
>>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>> 0a62630..6784269
>>> 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -547,31 +547,18 @@ struct tx_port {
>>>  * actions in either case.
>>>  * */
>>> struct dp_netdev_pmd_thread {
>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>cacheline0,
>>> -        struct dp_netdev *dp;
>>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */
>>> -        pthread_cond_t cond;            /* For synchronizing pmd thread
>>> -                                           reload. */
>>> -    );
>>> -
>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>cacheline1,
>>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>>> -        pthread_t thread;
>>> -        unsigned core_id;               /* CPU core id of this pmd thread. */
>>> -        int numa_id;                    /* numa node id of this pmd thread. */
>>> -    );
>>> +    struct dp_netdev *dp;
>>> +    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed.
>*/
>>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */
>>> +
>>> +    pthread_cond_t cond;            /* For synchronizing pmd thread reload. */
>>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>>>
>>>     /* Per thread exact-match cache.  Note, the instance for cpu core
>>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>>>      * need to be protected by 'non_pmd_mutex'.  Every other instance
>>>      * will only be accessed by its own pmd thread. */
>>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
>>> -    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed.
>*/
>>> -
>>> -    /* Queue id used by this pmd thread to send packets on all netdevs if
>>> -     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>> -     * than 'cmap_count(dp->poll_threads)'. */
>>> -    uint32_t static_tx_qid;
>>> +    struct emc_cache flow_cache;
>>>
>>>     /* Flow-Table and classifiers
>>>      *
>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>>>      * 'flow_mutex'.
>>>      */
>>>     struct ovs_mutex flow_mutex;
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>> -
>>> -        /* One classifier per in_port polled by the pmd */
>>> -        struct cmap classifiers;
>>> -        /* Periodically sort subtable vectors according to hit frequencies */
>>> -        long long int next_optimization;
>>> -        /* End of the next time interval for which processing cycles
>>> -           are stored for each polled rxq. */
>>> -        long long int rxq_next_cycle_store;
>>> -
>>> -        /* Cycles counters */
>>> -        struct dp_netdev_pmd_cycles cycles;
>>> -
>>> -        /* Used to count cycles. See 'cycles_counter_end()'. */
>>> -        unsigned long long last_cycles;
>>> -        struct latch exit_latch;        /* For terminating the pmd thread. */
>>> -    );
>>> -
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        /* Statistics. */
>>> -        struct dp_netdev_pmd_stats stats;
>>> -
>>> -        struct seq *reload_seq;
>>> -        uint64_t last_reload_seq;
>>> -        atomic_bool reload;             /* Do we need to reload ports? */
>>> -        bool isolated;
>>> -
>>> -        /* Set to true if the pmd thread needs to be reloaded. */
>>> -        bool need_reload;
>>> -        /* 5 pad bytes. */
>>> -    );
>>> -
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
>>> -                                           and 'tx_ports'. */
>>> -        /* 16 pad bytes. */
>>> -    );
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        /* List of rx queues to poll. */
>>> -        struct hmap poll_list OVS_GUARDED;
>>> -        /* Map of 'tx_port's used for transmission.  Written by the main
>>> -         * thread, read by the pmd thread. */
>>> -        struct hmap tx_ports OVS_GUARDED;
>>> -    );
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        /* These are thread-local copies of 'tx_ports'.  One contains only
>>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the other
>>> -         * contains ports with at least one txq (that support send).
>>> -         * A port can be in both.
>>> -         *
>>> -         * There are two separate maps to make sure that we don't try to
>>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>> -         * non-tunnel device.
>>> -         *
>>> -         * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>>> -         * multiple threads and thusly need to be protected by
>>> 'non_pmd_mutex'.
>>> -         * Every other instance will only be accessed by its own pmd thread.
>*/
>>> -        struct hmap tnl_port_cache;
>>> -        struct hmap send_port_cache;
>>> -    );
>>> -
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>> -         * values and subtracts them from 'stats' and 'cycles' before
>>> -         * reporting to the user */
>>> -        unsigned long long stats_zero[DP_N_STATS];
>>> -        uint64_t cycles_zero[PMD_N_CYCLES];
>>> -        /* 8 pad bytes. */
>>> -    );
>>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>> +
>>> +    /* One classifier per in_port polled by the pmd */
>>> +    struct cmap classifiers;
>>> +    /* Periodically sort subtable vectors according to hit frequencies */
>>> +    long long int next_optimization;
>>> +    /* End of the next time interval for which processing cycles
>>> +       are stored for each polled rxq. */
>>> +    long long int rxq_next_cycle_store;
>>> +
>>> +    /* Statistics. */
>>> +    struct dp_netdev_pmd_stats stats;
>>> +
>>> +    /* Cycles counters */
>>> +    struct dp_netdev_pmd_cycles cycles;
>>> +
>>> +    /* Used to count cicles. See 'cycles_counter_end()' */
>>> +    unsigned long long last_cycles;
>>> +
>>> +    struct latch exit_latch;        /* For terminating the pmd thread. */
>>> +    struct seq *reload_seq;
>>> +    uint64_t last_reload_seq;
>>> +    atomic_bool reload;             /* Do we need to reload ports? */
>>> +    pthread_t thread;
>>> +    unsigned core_id;               /* CPU core id of this pmd thread. */
>>> +    int numa_id;                    /* numa node id of this pmd thread. */
>>> +    bool isolated;
>>> +
>>> +    /* Queue id used by this pmd thread to send packets on all netdevs if
>>> +     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>> +     * than 'cmap_count(dp->poll_threads)'. */
>>> +    uint32_t static_tx_qid;
>>> +
>>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'.
>*/
>>> +    /* List of rx queues to poll. */
>>> +    struct hmap poll_list OVS_GUARDED;
>>> +    /* Map of 'tx_port's used for transmission.  Written by the main thread,
>>> +     * read by the pmd thread. */
>>> +    struct hmap tx_ports OVS_GUARDED;
>>> +
>>> +    /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
>>> +     * ports (that support push_tunnel/pop_tunnel), the other contains
>ports
>>> +     * with at least one txq (that support send).  A port can be in both.
>>> +     *
>>> +     * There are two separate maps to make sure that we don't try to
>execute
>>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>> + non-tunnel
>>> device.
>>> +     *
>>> +     * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>>> multiple
>>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.
>Every
>>> +     * other instance will only be accessed by its own pmd thread. */
>>> +    struct hmap tnl_port_cache;
>>> +    struct hmap send_port_cache;
>>> +
>>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>> +     * values and subtracts them from 'stats' and 'cycles' before
>>> +     * reporting to the user */
>>> +    unsigned long long stats_zero[DP_N_STATS];
>>> +    uint64_t cycles_zero[PMD_N_CYCLES];
>>> +
>>> +    /* Set to true if the pmd thread needs to be reloaded. */
>>> +    bool need_reload;
>>> };
>>>
>>> /* Interface to netdev-based datapath. */
>>> --
>>> 2.7.4
>>
>>
>>
>>
Bodireddy, Bhanuprakash Nov. 26, 2017, 9:41 p.m. | #4
[snip]
>>>
>>> I don't think it complicates development and instead I feel the
>>> commit gives a clear indication to the developer that the members are
>>> grouped and
>>aligned and marked with cacheline markers.
>>> This makes the developer extra cautious when adding new members so
>>> that
>>holes can be avoided.
>>
>>Starting rebase of the output batching patch-set I figured out that I
>>need to remove 'unsigned long long last_cycles' and add 'struct
>>dp_netdev_pmd_thread_ctx ctx'
>>which is 8 bytes larger. Could you, please, suggest me where should I
>>place that new structure member and what to do with a hole from
>'last_cycles'?
>>
>>This is not a trivial question, because already poor grouping will
>>become worse almost anyway.
>
>Aah, realized now that the batching series doesn't cleanly apply on master.
>Let me check this and will send across the changes that should fix this.
>

I see that 2 patches of the output batching series touches this structure and I  modified
the structure to factor in below changes introduced in batching series.
     -  Include dp_netdev_pmd_thread_ctx structure.
     -  Include n_output_batches variable.
     -   Change in sizes of dp_netdev_pmd_stats struct and stats_zero .
     -   ovs_mutex  size ( 48bytes on x86 vs 56bytes in ARM)

Also carried some testing and found no performance impact with the below changes. 

---------------------------------------------------------------------------------------------------
struct dp_netdev_pmd_thread {
    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
        struct dp_netdev *dp;
        struct cmap_node node;          /* In 'dp->poll_threads'. */
        pthread_cond_t cond;            /* For synchronizing pmd thread	
                                           reload. */
    );

    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
        struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
        pthread_t thread;
    );

    /* Per thread exact-match cache.  Note, the instance for cpu core
     * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
     * need to be protected by 'non_pmd_mutex'.  Every other instance
     * will only be accessed by its own pmd thread. */
    PADDED_MEMBERS(CACHE_LINE_SIZE,
        OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
    );

    /* Flow-Table and classifiers
     *
     * Writers of 'flow_table' must take the 'flow_mutex'.  Corresponding
     * changes to 'classifiers' must be made while still holding the
     * 'flow_mutex'.
     */
    PADDED_MEMBERS(CACHE_LINE_SIZE,
        struct ovs_mutex flow_mutex;
    );
    PADDED_MEMBERS(CACHE_LINE_SIZE,
        struct cmap flow_table OVS_GUARDED; /* Flow table. */

        /* One classifier per in_port polled by the pmd */
        struct cmap classifiers;
        /* Periodically sort subtable vectors according to hit frequencies */
        long long int next_optimization;
        /* End of the next time interval for which processing cycles
           are stored for each polled rxq. */
        long long int rxq_next_cycle_store;

        /* Cycles counters */
        struct dp_netdev_pmd_cycles cycles;

        /* Current context of the PMD thread. */
        struct dp_netdev_pmd_thread_ctx ctx;
    );
    PADDED_MEMBERS(CACHE_LINE_SIZE,
        /* Statistics. */
        struct dp_netdev_pmd_stats stats;
        /* 8 pad bytes. */
    );

    PADDED_MEMBERS(CACHE_LINE_SIZE,
        struct latch exit_latch;        /* For terminating the pmd thread. */
        struct seq *reload_seq;
        uint64_t last_reload_seq;
        atomic_bool reload;             /* Do we need to reload ports? */
        /* Set to true if the pmd thread needs to be reloaded. */
        bool need_reload;
        bool isolated;

        struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */

        /* Queue id used by this pmd thread to send packets on all netdevs if
         * XPS disabled for this netdev. All static_tx_qid's are unique and less
         * than 'cmap_count(dp->poll_threads)'. */
        uint32_t static_tx_qid;

        /* Number of filled output batches. */
        int n_output_batches;
        unsigned core_id;               /* CPU core id of this pmd thread. */
        int numa_id;                    /* numa node id of this pmd thread. */

        /* 16 pad bytes. */
    );

    PADDED_MEMBERS(CACHE_LINE_SIZE,
        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
                                           and 'tx_ports'. */
        /* 16 pad bytes. */
    );
    PADDED_MEMBERS(CACHE_LINE_SIZE,
        /* List of rx queues to poll. */
        struct hmap poll_list OVS_GUARDED;
        /* Map of 'tx_port's used for transmission.  Written by the main
         * thread, read by the pmd thread. */
        struct hmap tx_ports OVS_GUARDED;
    );
    PADDED_MEMBERS(CACHE_LINE_SIZE,
        /* These are thread-local copies of 'tx_ports'.  One contains only
         * tunnel ports (that support push_tunnel/pop_tunnel), the other
         * contains ports with at least one txq (that support send).
         * A port can be in both.
         *
         * There are two separate maps to make sure that we don't try to
         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
         * non-tunnel device.
         *
         * The instances for cpu core NON_PMD_CORE_ID can be accessed by
         * multiple threads and thusly need to be protected by 'non_pmd_mutex'.
         * Every other instance will only be accessed by its own pmd thread. */
        struct hmap tnl_port_cache;
        struct hmap send_port_cache;
    );

    PADDED_MEMBERS(CACHE_LINE_SIZE,
        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
         * values and subtracts them from 'stats' and 'cycles' before
         * reporting to the user */
        unsigned long long stats_zero[DP_N_STATS];
        /* 8 pad bytes. */
    );

    PADDED_MEMBERS(CACHE_LINE_SIZE,
        uint64_t cycles_zero[PMD_N_CYCLES];
        /* 48 pad bytes. */
    );
};
---------------------------------------------------------------------------------------------------

- Bhanuprakash.
Jan Scheurich Nov. 27, 2017, 11:29 a.m. | #5
I agree with Ilya here. Adding theses cache line markers and re-grouping variables to minimize gaps in cache lines is creating a maintenance burden without any tangible benefit. I have had to go through the pain of refactoring my PMD Performance Metrics patch to the new dp_netdev_pmd_thread struct and spent a lot of time to analyze the actual memory layout with GDB and play Tetris with the variables.

There will never be more than a handful of PMDs, so minimizing the gaps does not matter from memory perspective. And whether the individual members occupy 4 or 5 cache lines does not matter either compared to the many hundred cache lines touched for EMC and DPCLS lookups of an Rx batch. And any optimization done for x86 is not necessarily optimal for other architectures.

Finally, even for x86 there is not even a performance improvement. I re-ran our standard L3VPN over VXLAN performance PVP test on master and with Ilya's revert patch:

Flows   master  reverted
8,      4.46    4.48
100,    4.27    4.29
1000,   4.07    4.07
2000,   3.68    3.68
5000,   3.03    3.03
10000,  2.76    2.77
20000,  2.64    2.65
50000,  2.60    2.61
100000, 2.60    2.61
500000, 2.60    2.61

All in all, I support reverting this change.

Regards, Jan

Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>

> -----Original Message-----
> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Bodireddy, Bhanuprakash
> Sent: Friday, 24 November, 2017 17:09
> To: Ilya Maximets <i.maximets@samsung.com>; ovs-dev@openvswitch.org; Ben Pfaff <blp@ovn.org>
> Cc: Heetae Ahn <heetae82.ahn@samsung.com>
> Subject: Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."
> 
> >On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
> >>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
> >>>
> >>> Padding and aligning of dp_netdev_pmd_thread structure members is
> >>> useless, broken in a several ways and only greatly degrades
> >>> maintainability and extensibility of the structure.
> >>
> >> The idea of my earlier patch was to mark the cache lines and reduce the
> >holes while still maintaining the grouping of related members in this structure.
> >
> >Some of the grouping aspects looks strange. For example, it looks illogical that
> >'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other
> >reload related stuff. It looks strange that statistics and counters spread across
> >different groups. So, IMHO, it's not well grouped.
> 
> I had to strike a fine balance and some members may be placed in a different group
> due to their sizes and importance. Let me think if I can make it better.
> 
> >
> >> Also cache line marking is a good practice to make some one extra cautious
> >when extending or editing important structures .
> >> Most importantly I was experimenting with prefetching on this structure
> >and needed cache line markers for it.
> >>
> >> I see that you are on ARM (I don't have HW to test) and want to know if this
> >commit has any negative affect and any numbers would be appreciated.
> >
> >Basic VM-VM testing shows stable 0.5% perfromance improvement with
> >revert applied.
> 
> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.
> 
> >Padding adds 560 additional bytes of holes.
> As the cache line in ARM is 128 , it created holes, I can find a workaround to handle this.
> 
> >
> >> More comments inline.
> >>
> >>>
> >>> Issues:
> >>>
> >>>    1. It's not working because all the instances of struct
> >>>       dp_netdev_pmd_thread allocated only by usual malloc. All the
> >>>       memory is not aligned to cachelines -> structure almost never
> >>>       starts at aligned memory address. This means that any further
> >>>       paddings and alignments inside the structure are completely
> >>>       useless. Fo example:
> >>>
> >>>       Breakpoint 1, pmd_thread_main
> >>>       (gdb) p pmd
> >>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
> >>>       (gdb) p &pmd->cacheline1
> >>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
> >>>       (gdb) p &pmd->cacheline0
> >>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
> >>>       (gdb) p &pmd->flow_cache
> >>>       $53 = (struct emc_cache *) 0x1b1afe0
> >>>
> >>>       All of the above addresses shifted from cacheline start by 32B.
> >>
> >> If you see below all the addresses are 64 byte aligned.
> >>
> >> (gdb) p pmd
> >> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
> >> (gdb) p &pmd->cacheline0
> >> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
> >> (gdb) p &pmd->cacheline1
> >> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
> >> (gdb) p &pmd->flow_cache
> >> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
> >> (gdb) p &pmd->flow_table
> >> $5 = (struct cmap *) 0x7fc1e9fba100
> >> (gdb) p &pmd->stats
> >> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
> >> (gdb) p &pmd->port_mutex
> >> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
> >> (gdb) p &pmd->poll_list
> >> $8 = (struct hmap *) 0x7fc1e9fba1c0
> >> (gdb) p &pmd->tnl_port_cache
> >> $9 = (struct hmap *) 0x7fc1e9fba200
> >> (gdb) p &pmd->stats_zero
> >> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
> >>
> >> I tried using xzalloc_cacheline instead of default xzalloc() here.  I
> >> tried tens of times and always found that the address is
> >> 64 byte aligned and it should start at the beginning of cache line on X86.
> >> Not sure why the comment  " (The memory returned will not be at the start
> >of  a cache line, though, so don't assume such alignment.)" says otherwise?
> >
> >Yes, you will always get aligned addressess on your x86 Linux system that
> >supports
> >posix_memalign() call. The comment says what it says because it will make
> >some memory allocation tricks in case posix_memalign() is not available
> >(Windows, some MacOS, maybe some Linux systems (not sure)) and the
> >address will not be aligned it this case.
> 
> I also verified the other case when posix_memalign isn't available and even in that case
> it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a patch to use
>  xzalloc_cacheline for allocating the memory.
> 
> >
> >>
> >>>
> >>>       Can we fix it properly? NO.
> >>>       OVS currently doesn't have appropriate API to allocate aligned
> >>>       memory. The best candidate is 'xmalloc_cacheline()' but it
> >>>       clearly states that "The memory returned will not be at the
> >>>       start of a cache line, though, so don't assume such alignment".
> >>>       And also, this function will never return aligned memory on
> >>>       Windows or MacOS.
> >>>
> >>>    2. CACHE_LINE_SIZE is not constant. Different architectures have
> >>>       different cache line sizes, but the code assumes that
> >>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
> >>>       members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
> >>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
> >>>       differs from 64. This is opposite to portability. If I want
> >>>       good performance of cmap I need to have CACHE_LINE_SIZE equal
> >>>       to the real cache line size, but I will have huge holes in the
> >>>       structures. If you'll take a look to struct rte_mbuf from DPDK
> >>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
> >>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
> >>
> >> I understand that ARM and few other processors (like OCTEON) have 128
> >bytes cache lines.
> >> But  again curious of performance impact in your case with this new
> >alignment.
> >>
> >>>
> >>>    3. Sizes of system/libc defined types are not constant for all the
> >>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
> >>>       ARMv8 machine, but only 40 on x86. The difference could be
> >>>       much bigger on Windows or MacOS systems. But the code assumes
> >>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
> >>>       to broken alignment/big holes in case of padding/wrong comments
> >>>       about amount of free pad bytes.
> >>
> >> This isn't an issue as you would have already mentioned and more about
> >issue with the comment that reads the pad bytes.
> >> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
> >>
> >>         union {
> >>                 struct {
> >>                         struct ovs_mutex port_mutex;     /* 4849984    48 */
> >>                 };    /*          48 */
> >>                 uint8_t            pad13[64];            /*          64 */
> >>         };                                               /
> >>
> >
> >It's not only about 'port_mutex'. If you'll take a look at 'flow_mutex', you will
> >see, that it even not padded. So, increasing the size of 'flow_mutex'
> >leads to shifting of all the following padded blocks and no other blocks will be
> >properly aligned even if the structure allocated on the aligned memory.
> >
> >>>
> >>>    4. Sizes of the many fileds in structure depends on defines like
> >>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on.
> >>>       Any change in these defines or any change in any structure
> >>>       contained by thread should lead to the not so simple
> >>>       refactoring of the whole dp_netdev_pmd_thread structure. This
> >>>       greatly reduces maintainability and complicates development of
> >>>       a new features.
> >>
> >> I don't think it complicates development and instead I feel the commit
> >> gives a clear indication to the developer that the members are grouped and
> >aligned and marked with cacheline markers.
> >> This makes the developer extra cautious when adding new members so that
> >holes can be avoided.
> >
> >Starting rebase of the output batching patch-set I figured out that I need to
> >remove 'unsigned long long last_cycles' and add 'struct
> >dp_netdev_pmd_thread_ctx ctx'
> >which is 8 bytes larger. Could you, please, suggest me where should I place
> >that new structure member and what to do with a hole from 'last_cycles'?
> >
> >This is not a trivial question, because already poor grouping will become worse
> >almost anyway.
> 
> Aah, realized now that the batching series doesn't cleanly apply on master.
> Let me check this and will send across the changes that should fix this.
> 
> - Bhanuprakash
> 
> >>
> >> Cacheline marking the structure is a good practice and I am sure this
> >> structure is significant and should be carefully extended in the future.
> >
> >Not so sure about that.
> >
> >>
> >>>
> >>>    5. There is no reason to align flow_cache member because it's
> >>>       too big and we usually access random entries by single thread
> >>>       only.
> >>>
> >>
> >> I see your point. This patch wasn't done for performance and instead
> >> more to have some order with this ever growing structure. During
> >> testing I found that for some test cases aligning the flow_cache was giving
> >me 100k+ performance consistently and so was added.
> >
> >This was a random performance boost. You achieved it without aligned
> >memory allocation.
> >Just a luck with you system environment. Using of mzalloc_cacheline will likely
> >eliminate this performance difference or even degrade the performance.
> >
> >>
> >>> So, the padding/alignment only creates some visibility of performance
> >>> optimization but does nothing useful in reality. It only complicates
> >>> maintenance and adds huge holes for non-x86 architectures and
> >>> non-Linux systems. Performance improvement stated in a original
> >>> commit message should be random and not valuable. I see no
> >performance difference.
> >>
> >> I understand that this is causing issues with architecture having
> >> different cache line sizes, but unfortunately majority of them have 64 byte
> >cache lines so this change makes sense.
> >
> >I understand that 64 byte cache lines are spread a lot wider. I also have x86 as
> >a target arch, but still, IMHO, OVS is a cross-platform application and it should
> >not have platform dependent stuff which makes one architecture/platform
> >better and worsen others.
> >
> >>
> >> If you have performance data to prove that this causes sever perf
> >degradation I can think of work arounds for ARM.
> >>
> >> - Bhanuprakash.
> >
> >
> >P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have to apply
> >following patch to avoid build time assert (I'll send it formally later):
> >
> >-----------------------------------------------------------------------------
> >diff --git a/lib/cmap.c b/lib/cmap.c
> >index 35decea..5b15ecd 100644
> >--- a/lib/cmap.c
> >+++ b/lib/cmap.c
> >@@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
> > /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */  #define CMAP_K
> >((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
> >
> >-/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on 64-bit. */ -
> >#define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K *
> >CMAP_ENTRY_SIZE))
> >-
> > /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly one
> >cache
> >  * line long. */
> > struct cmap_bucket {
> >+    /* Padding to make cmap_bucket exactly one cache line long. */
> >+    PADDED_MEMBERS(CACHE_LINE_SIZE,
> >     /* Allows readers to track in-progress changes.  Initially zero, each
> >      * writer increments this value just before and just after each change (see
> >      * cmap_set_bucket()).  Thus, a reader can ensure that it gets a consistent
> >@@ -145,11 +144,7 @@ struct cmap_bucket {
> >      * slots. */
> >     uint32_t hashes[CMAP_K];
> >     struct cmap_node nodes[CMAP_K];
> >-
> >-    /* Padding to make cmap_bucket exactly one cache line long. */
> >-#if CMAP_PADDING > 0
> >-    uint8_t pad[CMAP_PADDING];
> >-#endif
> >+    );
> > };
> > BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
> >
> >diff --git a/lib/util.h b/lib/util.h
> >index 3c43c2c..514fdaa 100644
> >--- a/lib/util.h
> >+++ b/lib/util.h
> >@@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
> >
> > /* This system's cache line size, in bytes.
> >  * Being wrong hurts performance but not correctness. */ -#define
> >CACHE_LINE_SIZE 64
> >+#define CACHE_LINE_SIZE 128
> > BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
> >
> > /* Cacheline marking is typically done using zero-sized array.
> >-----------------------------------------------------------------------------
> >
> >
> >Best regards, Ilya Maximets.
> >
> >>
> >>>
> >>> Most of the above issues are also true for some other padded/aligned
> >>> structures like 'struct netdev_dpdk'. They will be treated separately.
> >>>
> >>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
> >>> CC: Ben Pfaff <blp@ovn.org>
> >>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> >>> ---
> >>> lib/dpif-netdev.c | 160
> >>> +++++++++++++++++++++++-----------------------------
> >>> --
> >>> 1 file changed, 69 insertions(+), 91 deletions(-)
> >>>
> >>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> >>> 0a62630..6784269
> >>> 100644
> >>> --- a/lib/dpif-netdev.c
> >>> +++ b/lib/dpif-netdev.c
> >>> @@ -547,31 +547,18 @@ struct tx_port {
> >>>  * actions in either case.
> >>>  * */
> >>> struct dp_netdev_pmd_thread {
> >>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
> >cacheline0,
> >>> -        struct dp_netdev *dp;
> >>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */
> >>> -        pthread_cond_t cond;            /* For synchronizing pmd thread
> >>> -                                           reload. */
> >>> -    );
> >>> -
> >>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
> >cacheline1,
> >>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
> >>> -        pthread_t thread;
> >>> -        unsigned core_id;               /* CPU core id of this pmd thread. */
> >>> -        int numa_id;                    /* numa node id of this pmd thread. */
> >>> -    );
> >>> +    struct dp_netdev *dp;
> >>> +    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed.
> >*/
> >>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */
> >>> +
> >>> +    pthread_cond_t cond;            /* For synchronizing pmd thread reload. */
> >>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
> >>>
> >>>     /* Per thread exact-match cache.  Note, the instance for cpu core
> >>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
> >>>      * need to be protected by 'non_pmd_mutex'.  Every other instance
> >>>      * will only be accessed by its own pmd thread. */
> >>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
> >>> -    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed.
> >*/
> >>> -
> >>> -    /* Queue id used by this pmd thread to send packets on all netdevs if
> >>> -     * XPS disabled for this netdev. All static_tx_qid's are unique and less
> >>> -     * than 'cmap_count(dp->poll_threads)'. */
> >>> -    uint32_t static_tx_qid;
> >>> +    struct emc_cache flow_cache;
> >>>
> >>>     /* Flow-Table and classifiers
> >>>      *
> >>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
> >>>      * 'flow_mutex'.
> >>>      */
> >>>     struct ovs_mutex flow_mutex;
> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
> >>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */
> >>> -
> >>> -        /* One classifier per in_port polled by the pmd */
> >>> -        struct cmap classifiers;
> >>> -        /* Periodically sort subtable vectors according to hit frequencies */
> >>> -        long long int next_optimization;
> >>> -        /* End of the next time interval for which processing cycles
> >>> -           are stored for each polled rxq. */
> >>> -        long long int rxq_next_cycle_store;
> >>> -
> >>> -        /* Cycles counters */
> >>> -        struct dp_netdev_pmd_cycles cycles;
> >>> -
> >>> -        /* Used to count cycles. See 'cycles_counter_end()'. */
> >>> -        unsigned long long last_cycles;
> >>> -        struct latch exit_latch;        /* For terminating the pmd thread. */
> >>> -    );
> >>> -
> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
> >>> -        /* Statistics. */
> >>> -        struct dp_netdev_pmd_stats stats;
> >>> -
> >>> -        struct seq *reload_seq;
> >>> -        uint64_t last_reload_seq;
> >>> -        atomic_bool reload;             /* Do we need to reload ports? */
> >>> -        bool isolated;
> >>> -
> >>> -        /* Set to true if the pmd thread needs to be reloaded. */
> >>> -        bool need_reload;
> >>> -        /* 5 pad bytes. */
> >>> -    );
> >>> -
> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
> >>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
> >>> -                                           and 'tx_ports'. */
> >>> -        /* 16 pad bytes. */
> >>> -    );
> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
> >>> -        /* List of rx queues to poll. */
> >>> -        struct hmap poll_list OVS_GUARDED;
> >>> -        /* Map of 'tx_port's used for transmission.  Written by the main
> >>> -         * thread, read by the pmd thread. */
> >>> -        struct hmap tx_ports OVS_GUARDED;
> >>> -    );
> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
> >>> -        /* These are thread-local copies of 'tx_ports'.  One contains only
> >>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the other
> >>> -         * contains ports with at least one txq (that support send).
> >>> -         * A port can be in both.
> >>> -         *
> >>> -         * There are two separate maps to make sure that we don't try to
> >>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
> >>> -         * non-tunnel device.
> >>> -         *
> >>> -         * The instances for cpu core NON_PMD_CORE_ID can be accessed by
> >>> -         * multiple threads and thusly need to be protected by
> >>> 'non_pmd_mutex'.
> >>> -         * Every other instance will only be accessed by its own pmd thread.
> >*/
> >>> -        struct hmap tnl_port_cache;
> >>> -        struct hmap send_port_cache;
> >>> -    );
> >>> -
> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
> >>> -        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
> >>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
> >>> -         * values and subtracts them from 'stats' and 'cycles' before
> >>> -         * reporting to the user */
> >>> -        unsigned long long stats_zero[DP_N_STATS];
> >>> -        uint64_t cycles_zero[PMD_N_CYCLES];
> >>> -        /* 8 pad bytes. */
> >>> -    );
> >>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */
> >>> +
> >>> +    /* One classifier per in_port polled by the pmd */
> >>> +    struct cmap classifiers;
> >>> +    /* Periodically sort subtable vectors according to hit frequencies */
> >>> +    long long int next_optimization;
> >>> +    /* End of the next time interval for which processing cycles
> >>> +       are stored for each polled rxq. */
> >>> +    long long int rxq_next_cycle_store;
> >>> +
> >>> +    /* Statistics. */
> >>> +    struct dp_netdev_pmd_stats stats;
> >>> +
> >>> +    /* Cycles counters */
> >>> +    struct dp_netdev_pmd_cycles cycles;
> >>> +
> >>> +    /* Used to count cicles. See 'cycles_counter_end()' */
> >>> +    unsigned long long last_cycles;
> >>> +
> >>> +    struct latch exit_latch;        /* For terminating the pmd thread. */
> >>> +    struct seq *reload_seq;
> >>> +    uint64_t last_reload_seq;
> >>> +    atomic_bool reload;             /* Do we need to reload ports? */
> >>> +    pthread_t thread;
> >>> +    unsigned core_id;               /* CPU core id of this pmd thread. */
> >>> +    int numa_id;                    /* numa node id of this pmd thread. */
> >>> +    bool isolated;
> >>> +
> >>> +    /* Queue id used by this pmd thread to send packets on all netdevs if
> >>> +     * XPS disabled for this netdev. All static_tx_qid's are unique and less
> >>> +     * than 'cmap_count(dp->poll_threads)'. */
> >>> +    uint32_t static_tx_qid;
> >>> +
> >>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'.
> >*/
> >>> +    /* List of rx queues to poll. */
> >>> +    struct hmap poll_list OVS_GUARDED;
> >>> +    /* Map of 'tx_port's used for transmission.  Written by the main thread,
> >>> +     * read by the pmd thread. */
> >>> +    struct hmap tx_ports OVS_GUARDED;
> >>> +
> >>> +    /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
> >>> +     * ports (that support push_tunnel/pop_tunnel), the other contains
> >ports
> >>> +     * with at least one txq (that support send).  A port can be in both.
> >>> +     *
> >>> +     * There are two separate maps to make sure that we don't try to
> >execute
> >>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a
> >>> + non-tunnel
> >>> device.
> >>> +     *
> >>> +     * The instances for cpu core NON_PMD_CORE_ID can be accessed by
> >>> multiple
> >>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.
> >Every
> >>> +     * other instance will only be accessed by its own pmd thread. */
> >>> +    struct hmap tnl_port_cache;
> >>> +    struct hmap send_port_cache;
> >>> +
> >>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
> >>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
> >>> +     * values and subtracts them from 'stats' and 'cycles' before
> >>> +     * reporting to the user */
> >>> +    unsigned long long stats_zero[DP_N_STATS];
> >>> +    uint64_t cycles_zero[PMD_N_CYCLES];
> >>> +
> >>> +    /* Set to true if the pmd thread needs to be reloaded. */
> >>> +    bool need_reload;
> >>> };
> >>>
> >>> /* Interface to netdev-based datapath. */
> >>> --
> >>> 2.7.4
> >>
> >>
> >>
> >>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ilya Maximets Nov. 27, 2017, 11:59 a.m. | #6
On 24.11.2017 19:08, Bodireddy, Bhanuprakash wrote:
>> On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>>>
>>>> Padding and aligning of dp_netdev_pmd_thread structure members is
>>>> useless, broken in a several ways and only greatly degrades
>>>> maintainability and extensibility of the structure.
>>>
>>> The idea of my earlier patch was to mark the cache lines and reduce the
>> holes while still maintaining the grouping of related members in this structure.
>>
>> Some of the grouping aspects looks strange. For example, it looks illogical that
>> 'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other
>> reload related stuff. It looks strange that statistics and counters spread across
>> different groups. So, IMHO, it's not well grouped.
> 
> I had to strike a fine balance and some members may be placed in a different group
> due to their sizes and importance. Let me think if I can make it better.
> 
>>
>>> Also cache line marking is a good practice to make some one extra cautious
>> when extending or editing important structures .
>>> Most importantly I was experimenting with prefetching on this structure
>> and needed cache line markers for it.
>>>
>>> I see that you are on ARM (I don't have HW to test) and want to know if this
>> commit has any negative affect and any numbers would be appreciated.
>>
>> Basic VM-VM testing shows stable 0.5% perfromance improvement with
>> revert applied.
> 
> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.  
> 
>> Padding adds 560 additional bytes of holes.
> As the cache line in ARM is 128 , it created holes, I can find a workaround to handle this.
> 
>>
>>> More comments inline.
>>>
>>>>
>>>> Issues:
>>>>
>>>>    1. It's not working because all the instances of struct
>>>>       dp_netdev_pmd_thread allocated only by usual malloc. All the
>>>>       memory is not aligned to cachelines -> structure almost never
>>>>       starts at aligned memory address. This means that any further
>>>>       paddings and alignments inside the structure are completely
>>>>       useless. Fo example:
>>>>
>>>>       Breakpoint 1, pmd_thread_main
>>>>       (gdb) p pmd
>>>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>>>       (gdb) p &pmd->cacheline1
>>>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>>>       (gdb) p &pmd->cacheline0
>>>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>>>       (gdb) p &pmd->flow_cache
>>>>       $53 = (struct emc_cache *) 0x1b1afe0
>>>>
>>>>       All of the above addresses shifted from cacheline start by 32B.
>>>
>>> If you see below all the addresses are 64 byte aligned.
>>>
>>> (gdb) p pmd
>>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>>> (gdb) p &pmd->cacheline0
>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>>> (gdb) p &pmd->cacheline1
>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>>> (gdb) p &pmd->flow_cache
>>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>>> (gdb) p &pmd->flow_table
>>> $5 = (struct cmap *) 0x7fc1e9fba100
>>> (gdb) p &pmd->stats
>>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>>> (gdb) p &pmd->port_mutex
>>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>>> (gdb) p &pmd->poll_list
>>> $8 = (struct hmap *) 0x7fc1e9fba1c0
>>> (gdb) p &pmd->tnl_port_cache
>>> $9 = (struct hmap *) 0x7fc1e9fba200
>>> (gdb) p &pmd->stats_zero
>>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>>>
>>> I tried using xzalloc_cacheline instead of default xzalloc() here.  I
>>> tried tens of times and always found that the address is
>>> 64 byte aligned and it should start at the beginning of cache line on X86.
>>> Not sure why the comment  " (The memory returned will not be at the start
>> of  a cache line, though, so don't assume such alignment.)" says otherwise?
>>
>> Yes, you will always get aligned addressess on your x86 Linux system that
>> supports
>> posix_memalign() call. The comment says what it says because it will make
>> some memory allocation tricks in case posix_memalign() is not available
>> (Windows, some MacOS, maybe some Linux systems (not sure)) and the
>> address will not be aligned it this case.
> 
> I also verified the other case when posix_memalign isn't available and even in that case
> it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a patch to use
>  xzalloc_cacheline for allocating the memory.

I don't know how you tested this, because it is impossible:

	1. OVS allocates some memory:
		base = xmalloc(...);

	2. Rounds it up to the cache line start:
		payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);

	3. Returns the pointer increased by 8 bytes:
		return (char *) payload + MEM_ALIGN;

So, unless you redefined MEM_ALIGN to zero, you will never get aligned memory
address while allocating by xmalloc_cacheline() on system without posix_memalign().


> 
>>
>>>
>>>>
>>>>       Can we fix it properly? NO.
>>>>       OVS currently doesn't have appropriate API to allocate aligned
>>>>       memory. The best candidate is 'xmalloc_cacheline()' but it
>>>>       clearly states that "The memory returned will not be at the
>>>>       start of a cache line, though, so don't assume such alignment".
>>>>       And also, this function will never return aligned memory on
>>>>       Windows or MacOS.
>>>>
>>>>    2. CACHE_LINE_SIZE is not constant. Different architectures have
>>>>       different cache line sizes, but the code assumes that
>>>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>>>>       members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>>>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
>>>>       differs from 64. This is opposite to portability. If I want
>>>>       good performance of cmap I need to have CACHE_LINE_SIZE equal
>>>>       to the real cache line size, but I will have huge holes in the
>>>>       structures. If you'll take a look to struct rte_mbuf from DPDK
>>>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>>>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
>>>
>>> I understand that ARM and few other processors (like OCTEON) have 128
>> bytes cache lines.
>>> But  again curious of performance impact in your case with this new
>> alignment.
>>>
>>>>
>>>>    3. Sizes of system/libc defined types are not constant for all the
>>>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
>>>>       ARMv8 machine, but only 40 on x86. The difference could be
>>>>       much bigger on Windows or MacOS systems. But the code assumes
>>>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>>>>       to broken alignment/big holes in case of padding/wrong comments
>>>>       about amount of free pad bytes.
>>>
>>> This isn't an issue as you would have already mentioned and more about
>> issue with the comment that reads the pad bytes.
>>> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
>>>
>>>         union {
>>>                 struct {
>>>                         struct ovs_mutex port_mutex;     /* 4849984    48 */
>>>                 };    /*          48 */
>>>                 uint8_t            pad13[64];            /*          64 */
>>>         };                                               /
>>>
>>
>> It's not only about 'port_mutex'. If you'll take a look at 'flow_mutex', you will
>> see, that it even not padded. So, increasing the size of 'flow_mutex'
>> leads to shifting of all the following padded blocks and no other blocks will be
>> properly aligned even if the structure allocated on the aligned memory.
>>
>>>>
>>>>    4. Sizes of the many fileds in structure depends on defines like
>>>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on.
>>>>       Any change in these defines or any change in any structure
>>>>       contained by thread should lead to the not so simple
>>>>       refactoring of the whole dp_netdev_pmd_thread structure. This
>>>>       greatly reduces maintainability and complicates development of
>>>>       a new features.
>>>
>>> I don't think it complicates development and instead I feel the commit
>>> gives a clear indication to the developer that the members are grouped and
>> aligned and marked with cacheline markers.
>>> This makes the developer extra cautious when adding new members so that
>> holes can be avoided.
>>
>> Starting rebase of the output batching patch-set I figured out that I need to
>> remove 'unsigned long long last_cycles' and add 'struct
>> dp_netdev_pmd_thread_ctx ctx'
>> which is 8 bytes larger. Could you, please, suggest me where should I place
>> that new structure member and what to do with a hole from 'last_cycles'?
>>
>> This is not a trivial question, because already poor grouping will become worse
>> almost anyway.
> 
> Aah, realized now that the batching series doesn't cleanly apply on master.
> Let me check this and will send across the changes that should fix this.
> 
> - Bhanuprakash
> 
>>>
>>> Cacheline marking the structure is a good practice and I am sure this
>>> structure is significant and should be carefully extended in the future.
>>
>> Not so sure about that.
>>
>>>
>>>>
>>>>    5. There is no reason to align flow_cache member because it's
>>>>       too big and we usually access random entries by single thread
>>>>       only.
>>>>
>>>
>>> I see your point. This patch wasn't done for performance and instead
>>> more to have some order with this ever growing structure. During
>>> testing I found that for some test cases aligning the flow_cache was giving
>> me 100k+ performance consistently and so was added.
>>
>> This was a random performance boost. You achieved it without aligned
>> memory allocation.
>> Just a luck with you system environment. Using of mzalloc_cacheline will likely
>> eliminate this performance difference or even degrade the performance.
>>
>>>
>>>> So, the padding/alignment only creates some visibility of performance
>>>> optimization but does nothing useful in reality. It only complicates
>>>> maintenance and adds huge holes for non-x86 architectures and
>>>> non-Linux systems. Performance improvement stated in a original
>>>> commit message should be random and not valuable. I see no
>> performance difference.
>>>
>>> I understand that this is causing issues with architecture having
>>> different cache line sizes, but unfortunately majority of them have 64 byte
>> cache lines so this change makes sense.
>>
>> I understand that 64 byte cache lines are spread a lot wider. I also have x86 as
>> a target arch, but still, IMHO, OVS is a cross-platform application and it should
>> not have platform dependent stuff which makes one architecture/platform
>> better and worsen others.
>>
>>>
>>> If you have performance data to prove that this causes sever perf
>> degradation I can think of work arounds for ARM.
>>>
>>> - Bhanuprakash.
>>
>>
>> P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have to apply
>> following patch to avoid build time assert (I'll send it formally later):
>>
>> -----------------------------------------------------------------------------
>> diff --git a/lib/cmap.c b/lib/cmap.c
>> index 35decea..5b15ecd 100644
>> --- a/lib/cmap.c
>> +++ b/lib/cmap.c
>> @@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
>> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */  #define CMAP_K
>> ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
>>
>> -/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on 64-bit. */ -
>> #define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K *
>> CMAP_ENTRY_SIZE))
>> -
>> /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly one
>> cache
>>  * line long. */
>> struct cmap_bucket {
>> +    /* Padding to make cmap_bucket exactly one cache line long. */
>> +    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>     /* Allows readers to track in-progress changes.  Initially zero, each
>>      * writer increments this value just before and just after each change (see
>>      * cmap_set_bucket()).  Thus, a reader can ensure that it gets a consistent
>> @@ -145,11 +144,7 @@ struct cmap_bucket {
>>      * slots. */
>>     uint32_t hashes[CMAP_K];
>>     struct cmap_node nodes[CMAP_K];
>> -
>> -    /* Padding to make cmap_bucket exactly one cache line long. */
>> -#if CMAP_PADDING > 0
>> -    uint8_t pad[CMAP_PADDING];
>> -#endif
>> +    );
>> };
>> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
>>
>> diff --git a/lib/util.h b/lib/util.h
>> index 3c43c2c..514fdaa 100644
>> --- a/lib/util.h
>> +++ b/lib/util.h
>> @@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
>>
>> /* This system's cache line size, in bytes.
>>  * Being wrong hurts performance but not correctness. */ -#define
>> CACHE_LINE_SIZE 64
>> +#define CACHE_LINE_SIZE 128
>> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
>>
>> /* Cacheline marking is typically done using zero-sized array.
>> -----------------------------------------------------------------------------
>>
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>>>
>>>> Most of the above issues are also true for some other padded/aligned
>>>> structures like 'struct netdev_dpdk'. They will be treated separately.
>>>>
>>>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>>>> CC: Ben Pfaff <blp@ovn.org>
>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>>>> ---
>>>> lib/dpif-netdev.c | 160
>>>> +++++++++++++++++++++++-----------------------------
>>>> --
>>>> 1 file changed, 69 insertions(+), 91 deletions(-)
>>>>
>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>> 0a62630..6784269
>>>> 100644
>>>> --- a/lib/dpif-netdev.c
>>>> +++ b/lib/dpif-netdev.c
>>>> @@ -547,31 +547,18 @@ struct tx_port {
>>>>  * actions in either case.
>>>>  * */
>>>> struct dp_netdev_pmd_thread {
>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>> cacheline0,
>>>> -        struct dp_netdev *dp;
>>>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */
>>>> -        pthread_cond_t cond;            /* For synchronizing pmd thread
>>>> -                                           reload. */
>>>> -    );
>>>> -
>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>> cacheline1,
>>>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>>>> -        pthread_t thread;
>>>> -        unsigned core_id;               /* CPU core id of this pmd thread. */
>>>> -        int numa_id;                    /* numa node id of this pmd thread. */
>>>> -    );
>>>> +    struct dp_netdev *dp;
>>>> +    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed.
>> */
>>>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */
>>>> +
>>>> +    pthread_cond_t cond;            /* For synchronizing pmd thread reload. */
>>>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>>>>
>>>>     /* Per thread exact-match cache.  Note, the instance for cpu core
>>>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>>>>      * need to be protected by 'non_pmd_mutex'.  Every other instance
>>>>      * will only be accessed by its own pmd thread. */
>>>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
>>>> -    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed.
>> */
>>>> -
>>>> -    /* Queue id used by this pmd thread to send packets on all netdevs if
>>>> -     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>>> -     * than 'cmap_count(dp->poll_threads)'. */
>>>> -    uint32_t static_tx_qid;
>>>> +    struct emc_cache flow_cache;
>>>>
>>>>     /* Flow-Table and classifiers
>>>>      *
>>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>>>>      * 'flow_mutex'.
>>>>      */
>>>>     struct ovs_mutex flow_mutex;
>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>> -
>>>> -        /* One classifier per in_port polled by the pmd */
>>>> -        struct cmap classifiers;
>>>> -        /* Periodically sort subtable vectors according to hit frequencies */
>>>> -        long long int next_optimization;
>>>> -        /* End of the next time interval for which processing cycles
>>>> -           are stored for each polled rxq. */
>>>> -        long long int rxq_next_cycle_store;
>>>> -
>>>> -        /* Cycles counters */
>>>> -        struct dp_netdev_pmd_cycles cycles;
>>>> -
>>>> -        /* Used to count cycles. See 'cycles_counter_end()'. */
>>>> -        unsigned long long last_cycles;
>>>> -        struct latch exit_latch;        /* For terminating the pmd thread. */
>>>> -    );
>>>> -
>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> -        /* Statistics. */
>>>> -        struct dp_netdev_pmd_stats stats;
>>>> -
>>>> -        struct seq *reload_seq;
>>>> -        uint64_t last_reload_seq;
>>>> -        atomic_bool reload;             /* Do we need to reload ports? */
>>>> -        bool isolated;
>>>> -
>>>> -        /* Set to true if the pmd thread needs to be reloaded. */
>>>> -        bool need_reload;
>>>> -        /* 5 pad bytes. */
>>>> -    );
>>>> -
>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
>>>> -                                           and 'tx_ports'. */
>>>> -        /* 16 pad bytes. */
>>>> -    );
>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> -        /* List of rx queues to poll. */
>>>> -        struct hmap poll_list OVS_GUARDED;
>>>> -        /* Map of 'tx_port's used for transmission.  Written by the main
>>>> -         * thread, read by the pmd thread. */
>>>> -        struct hmap tx_ports OVS_GUARDED;
>>>> -    );
>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> -        /* These are thread-local copies of 'tx_ports'.  One contains only
>>>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the other
>>>> -         * contains ports with at least one txq (that support send).
>>>> -         * A port can be in both.
>>>> -         *
>>>> -         * There are two separate maps to make sure that we don't try to
>>>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>>> -         * non-tunnel device.
>>>> -         *
>>>> -         * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>>>> -         * multiple threads and thusly need to be protected by
>>>> 'non_pmd_mutex'.
>>>> -         * Every other instance will only be accessed by its own pmd thread.
>> */
>>>> -        struct hmap tnl_port_cache;
>>>> -        struct hmap send_port_cache;
>>>> -    );
>>>> -
>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> -        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>>> -         * values and subtracts them from 'stats' and 'cycles' before
>>>> -         * reporting to the user */
>>>> -        unsigned long long stats_zero[DP_N_STATS];
>>>> -        uint64_t cycles_zero[PMD_N_CYCLES];
>>>> -        /* 8 pad bytes. */
>>>> -    );
>>>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>> +
>>>> +    /* One classifier per in_port polled by the pmd */
>>>> +    struct cmap classifiers;
>>>> +    /* Periodically sort subtable vectors according to hit frequencies */
>>>> +    long long int next_optimization;
>>>> +    /* End of the next time interval for which processing cycles
>>>> +       are stored for each polled rxq. */
>>>> +    long long int rxq_next_cycle_store;
>>>> +
>>>> +    /* Statistics. */
>>>> +    struct dp_netdev_pmd_stats stats;
>>>> +
>>>> +    /* Cycles counters */
>>>> +    struct dp_netdev_pmd_cycles cycles;
>>>> +
>>>> +    /* Used to count cicles. See 'cycles_counter_end()' */
>>>> +    unsigned long long last_cycles;
>>>> +
>>>> +    struct latch exit_latch;        /* For terminating the pmd thread. */
>>>> +    struct seq *reload_seq;
>>>> +    uint64_t last_reload_seq;
>>>> +    atomic_bool reload;             /* Do we need to reload ports? */
>>>> +    pthread_t thread;
>>>> +    unsigned core_id;               /* CPU core id of this pmd thread. */
>>>> +    int numa_id;                    /* numa node id of this pmd thread. */
>>>> +    bool isolated;
>>>> +
>>>> +    /* Queue id used by this pmd thread to send packets on all netdevs if
>>>> +     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>>> +     * than 'cmap_count(dp->poll_threads)'. */
>>>> +    uint32_t static_tx_qid;
>>>> +
>>>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'.
>> */
>>>> +    /* List of rx queues to poll. */
>>>> +    struct hmap poll_list OVS_GUARDED;
>>>> +    /* Map of 'tx_port's used for transmission.  Written by the main thread,
>>>> +     * read by the pmd thread. */
>>>> +    struct hmap tx_ports OVS_GUARDED;
>>>> +
>>>> +    /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
>>>> +     * ports (that support push_tunnel/pop_tunnel), the other contains
>> ports
>>>> +     * with at least one txq (that support send).  A port can be in both.
>>>> +     *
>>>> +     * There are two separate maps to make sure that we don't try to
>> execute
>>>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>>> + non-tunnel
>>>> device.
>>>> +     *
>>>> +     * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>>>> multiple
>>>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.
>> Every
>>>> +     * other instance will only be accessed by its own pmd thread. */
>>>> +    struct hmap tnl_port_cache;
>>>> +    struct hmap send_port_cache;
>>>> +
>>>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>>> +     * values and subtracts them from 'stats' and 'cycles' before
>>>> +     * reporting to the user */
>>>> +    unsigned long long stats_zero[DP_N_STATS];
>>>> +    uint64_t cycles_zero[PMD_N_CYCLES];
>>>> +
>>>> +    /* Set to true if the pmd thread needs to be reloaded. */
>>>> +    bool need_reload;
>>>> };
>>>>
>>>> /* Interface to netdev-based datapath. */
>>>> --
>>>> 2.7.4
>>>
>>>
>>>
>>>
Bodireddy, Bhanuprakash Nov. 27, 2017, 3:22 p.m. | #7
[ snip]
>>> Yes, you will always get aligned addressess on your x86 Linux system
>>> that supports
>>> posix_memalign() call. The comment says what it says because it will
>>> make some memory allocation tricks in case posix_memalign() is not
>>> available (Windows, some MacOS, maybe some Linux systems (not sure))
>>> and the address will not be aligned it this case.
>>
>> I also verified the other case when posix_memalign isn't available and
>> even in that case it returns the address aligned on CACHE_LINE_SIZE
>> boundary. I will send out a patch to use  xzalloc_cacheline for allocating the
>memory.
>
>I don't know how you tested this, because it is impossible:
>
>	1. OVS allocates some memory:
>		base = xmalloc(...);
>
>	2. Rounds it up to the cache line start:
>		payload = (void **) ROUND_UP((uintptr_t) base,
>CACHE_LINE_SIZE);
>
>	3. Returns the pointer increased by 8 bytes:
>		return (char *) payload + MEM_ALIGN;
>
>So, unless you redefined MEM_ALIGN to zero, you will never get aligned
>memory address while allocating by xmalloc_cacheline() on system without
>posix_memalign().
>

Hmmm, I didn't set MEM_ALIGN to zero instead used below test code to get aligned addresses
when posix_memalign() isn't available.  We can't set MEM_ALIGN to zero so have to do this
hack to get aligned address and store the initial address (original address allocated by malloc) in a place before the
aligned location so that it can be freed  by later  call to free(). (I should have mentioned in my previous mail). 

-------------------------------------------------------------------------------------------------
    void **payload;
    void *base;

    base = xmalloc(CACHE_LINE_SIZE + size + MEM_ALIGN);
    /* Address aligned on CACHE_LINE_SIZE boundary. */
    payload = (void**)(((uintptr_t) base + CACHE_LINE_SIZE + MEM_ALIGN) &
                            ~(CACHE_LINE_SIZE - 1));
    /* Store the original address so it can be freed later. */
    payload[-1] = base;
    return (char *)payload;
-------------------------------------------------------------------------------------------------

- Bhanuprakash.
Bodireddy, Bhanuprakash Nov. 27, 2017, 5:02 p.m. | #8
>I agree with Ilya here. Adding theses cache line markers and re-grouping
>variables to minimize gaps in cache lines is creating a maintenance burden
>without any tangible benefit. I have had to go through the pain of refactoring
>my PMD Performance Metrics patch to the new dp_netdev_pmd_thread
>struct and spent a lot of time to analyze the actual memory layout with GDB
>and play Tetris with the variables.

Analyzing the memory layout with gdb for large structures is time consuming and not usually recommended.
I would suggest using Poke-a-hole(pahole) and that helps to understand and fix the structures in no time.
With pahole it's going to be lot easier to work with large structures especially.

>
>There will never be more than a handful of PMDs, so minimizing the gaps does
>not matter from memory perspective. And whether the individual members
>occupy 4 or 5 cache lines does not matter either compared to the many
>hundred cache lines touched for EMC and DPCLS lookups of an Rx batch. And
>any optimization done for x86 is not necessarily optimal for other
>architectures.

I agree that optimization targeted for x86 doesn't necessarily suit ARM due to its different cache line size.

>
>Finally, even for x86 there is not even a performance improvement. I re-ran
>our standard L3VPN over VXLAN performance PVP test on master and with
>Ilya's revert patch:
>
>Flows   master  reverted
>8,      4.46    4.48
>100,    4.27    4.29
>1000,   4.07    4.07
>2000,   3.68    3.68
>5000,   3.03    3.03
>10000,  2.76    2.77
>20000,  2.64    2.65
>50000,  2.60    2.61
>100000, 2.60    2.61
>500000, 2.60    2.61

What are the  CFLAGS in this case, as they seem to make difference. I have added my finding here for a different patch targeted at performance
      https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341270.html

Patches to consider when testing your use case:
     Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341231.html
     (If using output batching)      https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341230.html

- Bhanuprakash.

>
>All in all, I support reverting this change.
>
>Regards, Jan
>
>Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
>
>> -----Original Message-----
>> From: ovs-dev-bounces@openvswitch.org
>> [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Bodireddy,
>> Bhanuprakash
>> Sent: Friday, 24 November, 2017 17:09
>> To: Ilya Maximets <i.maximets@samsung.com>; ovs-dev@openvswitch.org;
>> Ben Pfaff <blp@ovn.org>
>> Cc: Heetae Ahn <heetae82.ahn@samsung.com>
>> Subject: Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor
>dp_netdev_pmd_thread structure."
>>
>> >On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>> >>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>> >>>
>> >>> Padding and aligning of dp_netdev_pmd_thread structure members is
>> >>> useless, broken in a several ways and only greatly degrades
>> >>> maintainability and extensibility of the structure.
>> >>
>> >> The idea of my earlier patch was to mark the cache lines and reduce
>> >> the
>> >holes while still maintaining the grouping of related members in this
>structure.
>> >
>> >Some of the grouping aspects looks strange. For example, it looks
>> >illogical that 'exit_latch' is grouped with 'flow_table' but not the
>> >'reload_seq' and other reload related stuff. It looks strange that
>> >statistics and counters spread across different groups. So, IMHO, it's not
>well grouped.
>>
>> I had to strike a fine balance and some members may be placed in a
>> different group due to their sizes and importance. Let me think if I can make
>it better.
>>
>> >
>> >> Also cache line marking is a good practice to make some one extra
>> >> cautious
>> >when extending or editing important structures .
>> >> Most importantly I was experimenting with prefetching on this
>> >> structure
>> >and needed cache line markers for it.
>> >>
>> >> I see that you are on ARM (I don't have HW to test) and want to
>> >> know if this
>> >commit has any negative affect and any numbers would be appreciated.
>> >
>> >Basic VM-VM testing shows stable 0.5% perfromance improvement with
>> >revert applied.
>>
>> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.
>>
>> >Padding adds 560 additional bytes of holes.
>> As the cache line in ARM is 128 , it created holes, I can find a workaround to
>handle this.
>>
>> >
>> >> More comments inline.
>> >>
>> >>>
>> >>> Issues:
>> >>>
>> >>>    1. It's not working because all the instances of struct
>> >>>       dp_netdev_pmd_thread allocated only by usual malloc. All the
>> >>>       memory is not aligned to cachelines -> structure almost never
>> >>>       starts at aligned memory address. This means that any further
>> >>>       paddings and alignments inside the structure are completely
>> >>>       useless. Fo example:
>> >>>
>> >>>       Breakpoint 1, pmd_thread_main
>> >>>       (gdb) p pmd
>> >>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>> >>>       (gdb) p &pmd->cacheline1
>> >>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>> >>>       (gdb) p &pmd->cacheline0
>> >>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>> >>>       (gdb) p &pmd->flow_cache
>> >>>       $53 = (struct emc_cache *) 0x1b1afe0
>> >>>
>> >>>       All of the above addresses shifted from cacheline start by 32B.
>> >>
>> >> If you see below all the addresses are 64 byte aligned.
>> >>
>> >> (gdb) p pmd
>> >> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>> >> (gdb) p &pmd->cacheline0
>> >> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>> >> (gdb) p &pmd->cacheline1
>> >> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>> >> (gdb) p &pmd->flow_cache
>> >> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>> >> (gdb) p &pmd->flow_table
>> >> $5 = (struct cmap *) 0x7fc1e9fba100
>> >> (gdb) p &pmd->stats
>> >> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>> >> (gdb) p &pmd->port_mutex
>> >> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>> >> (gdb) p &pmd->poll_list
>> >> $8 = (struct hmap *) 0x7fc1e9fba1c0
>> >> (gdb) p &pmd->tnl_port_cache
>> >> $9 = (struct hmap *) 0x7fc1e9fba200
>> >> (gdb) p &pmd->stats_zero
>> >> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>> >>
>> >> I tried using xzalloc_cacheline instead of default xzalloc() here.
>> >> I tried tens of times and always found that the address is
>> >> 64 byte aligned and it should start at the beginning of cache line on X86.
>> >> Not sure why the comment  " (The memory returned will not be at the
>> >> start
>> >of  a cache line, though, so don't assume such alignment.)" says otherwise?
>> >
>> >Yes, you will always get aligned addressess on your x86 Linux system
>> >that supports
>> >posix_memalign() call. The comment says what it says because it will
>> >make some memory allocation tricks in case posix_memalign() is not
>> >available (Windows, some MacOS, maybe some Linux systems (not sure))
>> >and the address will not be aligned it this case.
>>
>> I also verified the other case when posix_memalign isn't available and
>> even in that case it returns the address aligned on CACHE_LINE_SIZE
>> boundary. I will send out a patch to use  xzalloc_cacheline for allocating the
>memory.
>>
>> >
>> >>
>> >>>
>> >>>       Can we fix it properly? NO.
>> >>>       OVS currently doesn't have appropriate API to allocate aligned
>> >>>       memory. The best candidate is 'xmalloc_cacheline()' but it
>> >>>       clearly states that "The memory returned will not be at the
>> >>>       start of a cache line, though, so don't assume such alignment".
>> >>>       And also, this function will never return aligned memory on
>> >>>       Windows or MacOS.
>> >>>
>> >>>    2. CACHE_LINE_SIZE is not constant. Different architectures have
>> >>>       different cache line sizes, but the code assumes that
>> >>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>> >>>       members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>> >>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
>> >>>       differs from 64. This is opposite to portability. If I want
>> >>>       good performance of cmap I need to have CACHE_LINE_SIZE equal
>> >>>       to the real cache line size, but I will have huge holes in the
>> >>>       structures. If you'll take a look to struct rte_mbuf from DPDK
>> >>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>> >>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
>> >>
>> >> I understand that ARM and few other processors (like OCTEON) have
>> >> 128
>> >bytes cache lines.
>> >> But  again curious of performance impact in your case with this new
>> >alignment.
>> >>
>> >>>
>> >>>    3. Sizes of system/libc defined types are not constant for all the
>> >>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
>> >>>       ARMv8 machine, but only 40 on x86. The difference could be
>> >>>       much bigger on Windows or MacOS systems. But the code assumes
>> >>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>> >>>       to broken alignment/big holes in case of padding/wrong comments
>> >>>       about amount of free pad bytes.
>> >>
>> >> This isn't an issue as you would have already mentioned and more
>> >> about
>> >issue with the comment that reads the pad bytes.
>> >> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
>> >>
>> >>         union {
>> >>                 struct {
>> >>                         struct ovs_mutex port_mutex;     /* 4849984    48 */
>> >>                 };    /*          48 */
>> >>                 uint8_t            pad13[64];            /*          64 */
>> >>         };                                               /
>> >>
>> >
>> >It's not only about 'port_mutex'. If you'll take a look at
>> >'flow_mutex', you will see, that it even not padded. So, increasing the size
>of 'flow_mutex'
>> >leads to shifting of all the following padded blocks and no other
>> >blocks will be properly aligned even if the structure allocated on the
>aligned memory.
>> >
>> >>>
>> >>>    4. Sizes of the many fileds in structure depends on defines like
>> >>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so
>on.
>> >>>       Any change in these defines or any change in any structure
>> >>>       contained by thread should lead to the not so simple
>> >>>       refactoring of the whole dp_netdev_pmd_thread structure. This
>> >>>       greatly reduces maintainability and complicates development of
>> >>>       a new features.
>> >>
>> >> I don't think it complicates development and instead I feel the
>> >> commit gives a clear indication to the developer that the members
>> >> are grouped and
>> >aligned and marked with cacheline markers.
>> >> This makes the developer extra cautious when adding new members so
>> >> that
>> >holes can be avoided.
>> >
>> >Starting rebase of the output batching patch-set I figured out that I
>> >need to remove 'unsigned long long last_cycles' and add 'struct
>> >dp_netdev_pmd_thread_ctx ctx'
>> >which is 8 bytes larger. Could you, please, suggest me where should I
>> >place that new structure member and what to do with a hole from
>'last_cycles'?
>> >
>> >This is not a trivial question, because already poor grouping will
>> >become worse almost anyway.
>>
>> Aah, realized now that the batching series doesn't cleanly apply on master.
>> Let me check this and will send across the changes that should fix this.
>>
>> - Bhanuprakash
>>
>> >>
>> >> Cacheline marking the structure is a good practice and I am sure
>> >> this structure is significant and should be carefully extended in the
>future.
>> >
>> >Not so sure about that.
>> >
>> >>
>> >>>
>> >>>    5. There is no reason to align flow_cache member because it's
>> >>>       too big and we usually access random entries by single thread
>> >>>       only.
>> >>>
>> >>
>> >> I see your point. This patch wasn't done for performance and
>> >> instead more to have some order with this ever growing structure.
>> >> During testing I found that for some test cases aligning the
>> >> flow_cache was giving
>> >me 100k+ performance consistently and so was added.
>> >
>> >This was a random performance boost. You achieved it without aligned
>> >memory allocation.
>> >Just a luck with you system environment. Using of mzalloc_cacheline
>> >will likely eliminate this performance difference or even degrade the
>performance.
>> >
>> >>
>> >>> So, the padding/alignment only creates some visibility of
>> >>> performance optimization but does nothing useful in reality. It
>> >>> only complicates maintenance and adds huge holes for non-x86
>> >>> architectures and non-Linux systems. Performance improvement
>> >>> stated in a original commit message should be random and not
>> >>> valuable. I see no
>> >performance difference.
>> >>
>> >> I understand that this is causing issues with architecture having
>> >> different cache line sizes, but unfortunately majority of them have
>> >> 64 byte
>> >cache lines so this change makes sense.
>> >
>> >I understand that 64 byte cache lines are spread a lot wider. I also
>> >have x86 as a target arch, but still, IMHO, OVS is a cross-platform
>> >application and it should not have platform dependent stuff which
>> >makes one architecture/platform better and worsen others.
>> >
>> >>
>> >> If you have performance data to prove that this causes sever perf
>> >degradation I can think of work arounds for ARM.
>> >>
>> >> - Bhanuprakash.
>> >
>> >
>> >P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have
>> >to apply following patch to avoid build time assert (I'll send it formally
>later):
>> >
>> >---------------------------------------------------------------------
>> >--------
>> >diff --git a/lib/cmap.c b/lib/cmap.c
>> >index 35decea..5b15ecd 100644
>> >--- a/lib/cmap.c
>> >+++ b/lib/cmap.c
>> >@@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
>> > /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */
>> >#define CMAP_K ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
>> >
>> >-/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on
>> >64-bit. */ - #define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K *
>> >CMAP_ENTRY_SIZE))
>> >-
>> > /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly
>> >one cache
>> >  * line long. */
>> > struct cmap_bucket {
>> >+    /* Padding to make cmap_bucket exactly one cache line long. */
>> >+    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> >     /* Allows readers to track in-progress changes.  Initially zero, each
>> >      * writer increments this value just before and just after each change
>(see
>> >      * cmap_set_bucket()).  Thus, a reader can ensure that it gets a
>> >consistent @@ -145,11 +144,7 @@ struct cmap_bucket {
>> >      * slots. */
>> >     uint32_t hashes[CMAP_K];
>> >     struct cmap_node nodes[CMAP_K];
>> >-
>> >-    /* Padding to make cmap_bucket exactly one cache line long. */
>> >-#if CMAP_PADDING > 0
>> >-    uint8_t pad[CMAP_PADDING];
>> >-#endif
>> >+    );
>> > };
>> > BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
>> >
>> >diff --git a/lib/util.h b/lib/util.h
>> >index 3c43c2c..514fdaa 100644
>> >--- a/lib/util.h
>> >+++ b/lib/util.h
>> >@@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
>> >
>> > /* This system's cache line size, in bytes.
>> >  * Being wrong hurts performance but not correctness. */ -#define
>> >CACHE_LINE_SIZE 64
>> >+#define CACHE_LINE_SIZE 128
>> > BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
>> >
>> > /* Cacheline marking is typically done using zero-sized array.
>> >---------------------------------------------------------------------
>> >--------
>> >
>> >
>> >Best regards, Ilya Maximets.
>> >
>> >>
>> >>>
>> >>> Most of the above issues are also true for some other
>> >>> padded/aligned structures like 'struct netdev_dpdk'. They will be
>treated separately.
>> >>>
>> >>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>> >>> CC: Ben Pfaff <blp@ovn.org>
>> >>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>> >>> ---
>> >>> lib/dpif-netdev.c | 160
>> >>> +++++++++++++++++++++++-----------------------------
>> >>> --
>> >>> 1 file changed, 69 insertions(+), 91 deletions(-)
>> >>>
>> >>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>> >>> 0a62630..6784269
>> >>> 100644
>> >>> --- a/lib/dpif-netdev.c
>> >>> +++ b/lib/dpif-netdev.c
>> >>> @@ -547,31 +547,18 @@ struct tx_port {
>> >>>  * actions in either case.
>> >>>  * */
>> >>> struct dp_netdev_pmd_thread {
>> >>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>> >cacheline0,
>> >>> -        struct dp_netdev *dp;
>> >>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */
>> >>> -        pthread_cond_t cond;            /* For synchronizing pmd thread
>> >>> -                                           reload. */
>> >>> -    );
>> >>> -
>> >>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>> >cacheline1,
>> >>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition variable.
>*/
>> >>> -        pthread_t thread;
>> >>> -        unsigned core_id;               /* CPU core id of this pmd thread. */
>> >>> -        int numa_id;                    /* numa node id of this pmd thread. */
>> >>> -    );
>> >>> +    struct dp_netdev *dp;
>> >>> +    struct ovs_refcount ref_cnt;    /* Every reference must be
>refcount'ed.
>> >*/
>> >>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */
>> >>> +
>> >>> +    pthread_cond_t cond;            /* For synchronizing pmd thread reload.
>*/
>> >>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>> >>>
>> >>>     /* Per thread exact-match cache.  Note, the instance for cpu core
>> >>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and
>thusly
>> >>>      * need to be protected by 'non_pmd_mutex'.  Every other instance
>> >>>      * will only be accessed by its own pmd thread. */
>> >>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache
>flow_cache;
>> >>> -    struct ovs_refcount ref_cnt;    /* Every reference must be
>refcount'ed.
>> >*/
>> >>> -
>> >>> -    /* Queue id used by this pmd thread to send packets on all netdevs
>if
>> >>> -     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>> >>> -     * than 'cmap_count(dp->poll_threads)'. */
>> >>> -    uint32_t static_tx_qid;
>> >>> +    struct emc_cache flow_cache;
>> >>>
>> >>>     /* Flow-Table and classifiers
>> >>>      *
>> >>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>> >>>      * 'flow_mutex'.
>> >>>      */
>> >>>     struct ovs_mutex flow_mutex;
>> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> >>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */
>> >>> -
>> >>> -        /* One classifier per in_port polled by the pmd */
>> >>> -        struct cmap classifiers;
>> >>> -        /* Periodically sort subtable vectors according to hit frequencies */
>> >>> -        long long int next_optimization;
>> >>> -        /* End of the next time interval for which processing cycles
>> >>> -           are stored for each polled rxq. */
>> >>> -        long long int rxq_next_cycle_store;
>> >>> -
>> >>> -        /* Cycles counters */
>> >>> -        struct dp_netdev_pmd_cycles cycles;
>> >>> -
>> >>> -        /* Used to count cycles. See 'cycles_counter_end()'. */
>> >>> -        unsigned long long last_cycles;
>> >>> -        struct latch exit_latch;        /* For terminating the pmd thread. */
>> >>> -    );
>> >>> -
>> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> >>> -        /* Statistics. */
>> >>> -        struct dp_netdev_pmd_stats stats;
>> >>> -
>> >>> -        struct seq *reload_seq;
>> >>> -        uint64_t last_reload_seq;
>> >>> -        atomic_bool reload;             /* Do we need to reload ports? */
>> >>> -        bool isolated;
>> >>> -
>> >>> -        /* Set to true if the pmd thread needs to be reloaded. */
>> >>> -        bool need_reload;
>> >>> -        /* 5 pad bytes. */
>> >>> -    );
>> >>> -
>> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> >>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
>> >>> -                                           and 'tx_ports'. */
>> >>> -        /* 16 pad bytes. */
>> >>> -    );
>> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> >>> -        /* List of rx queues to poll. */
>> >>> -        struct hmap poll_list OVS_GUARDED;
>> >>> -        /* Map of 'tx_port's used for transmission.  Written by the main
>> >>> -         * thread, read by the pmd thread. */
>> >>> -        struct hmap tx_ports OVS_GUARDED;
>> >>> -    );
>> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> >>> -        /* These are thread-local copies of 'tx_ports'.  One contains only
>> >>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the other
>> >>> -         * contains ports with at least one txq (that support send).
>> >>> -         * A port can be in both.
>> >>> -         *
>> >>> -         * There are two separate maps to make sure that we don't try to
>> >>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
>> >>> -         * non-tunnel device.
>> >>> -         *
>> >>> -         * The instances for cpu core NON_PMD_CORE_ID can be accessed
>by
>> >>> -         * multiple threads and thusly need to be protected by
>> >>> 'non_pmd_mutex'.
>> >>> -         * Every other instance will only be accessed by its own pmd
>thread.
>> >*/
>> >>> -        struct hmap tnl_port_cache;
>> >>> -        struct hmap send_port_cache;
>> >>> -    );
>> >>> -
>> >>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>> >>> -        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>> >>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>> >>> -         * values and subtracts them from 'stats' and 'cycles' before
>> >>> -         * reporting to the user */
>> >>> -        unsigned long long stats_zero[DP_N_STATS];
>> >>> -        uint64_t cycles_zero[PMD_N_CYCLES];
>> >>> -        /* 8 pad bytes. */
>> >>> -    );
>> >>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */
>> >>> +
>> >>> +    /* One classifier per in_port polled by the pmd */
>> >>> +    struct cmap classifiers;
>> >>> +    /* Periodically sort subtable vectors according to hit frequencies */
>> >>> +    long long int next_optimization;
>> >>> +    /* End of the next time interval for which processing cycles
>> >>> +       are stored for each polled rxq. */
>> >>> +    long long int rxq_next_cycle_store;
>> >>> +
>> >>> +    /* Statistics. */
>> >>> +    struct dp_netdev_pmd_stats stats;
>> >>> +
>> >>> +    /* Cycles counters */
>> >>> +    struct dp_netdev_pmd_cycles cycles;
>> >>> +
>> >>> +    /* Used to count cicles. See 'cycles_counter_end()' */
>> >>> +    unsigned long long last_cycles;
>> >>> +
>> >>> +    struct latch exit_latch;        /* For terminating the pmd thread. */
>> >>> +    struct seq *reload_seq;
>> >>> +    uint64_t last_reload_seq;
>> >>> +    atomic_bool reload;             /* Do we need to reload ports? */
>> >>> +    pthread_t thread;
>> >>> +    unsigned core_id;               /* CPU core id of this pmd thread. */
>> >>> +    int numa_id;                    /* numa node id of this pmd thread. */
>> >>> +    bool isolated;
>> >>> +
>> >>> +    /* Queue id used by this pmd thread to send packets on all netdevs
>if
>> >>> +     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>> >>> +     * than 'cmap_count(dp->poll_threads)'. */
>> >>> +    uint32_t static_tx_qid;
>> >>> +
>> >>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and
>'tx_ports'.
>> >*/
>> >>> +    /* List of rx queues to poll. */
>> >>> +    struct hmap poll_list OVS_GUARDED;
>> >>> +    /* Map of 'tx_port's used for transmission.  Written by the main
>thread,
>> >>> +     * read by the pmd thread. */
>> >>> +    struct hmap tx_ports OVS_GUARDED;
>> >>> +
>> >>> +    /* These are thread-local copies of 'tx_ports'.  One contains only
>tunnel
>> >>> +     * ports (that support push_tunnel/pop_tunnel), the other
>> >>> + contains
>> >ports
>> >>> +     * with at least one txq (that support send).  A port can be in both.
>> >>> +     *
>> >>> +     * There are two separate maps to make sure that we don't try
>> >>> + to
>> >execute
>> >>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a
>> >>> + non-tunnel
>> >>> device.
>> >>> +     *
>> >>> +     * The instances for cpu core NON_PMD_CORE_ID can be accessed
>> >>> + by
>> >>> multiple
>> >>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.
>> >Every
>> >>> +     * other instance will only be accessed by its own pmd thread. */
>> >>> +    struct hmap tnl_port_cache;
>> >>> +    struct hmap send_port_cache;
>> >>> +
>> >>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>> >>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>> >>> +     * values and subtracts them from 'stats' and 'cycles' before
>> >>> +     * reporting to the user */
>> >>> +    unsigned long long stats_zero[DP_N_STATS];
>> >>> +    uint64_t cycles_zero[PMD_N_CYCLES];
>> >>> +
>> >>> +    /* Set to true if the pmd thread needs to be reloaded. */
>> >>> +    bool need_reload;
>> >>> };
>> >>>
>> >>> /* Interface to netdev-based datapath. */
>> >>> --
>> >>> 2.7.4
>> >>
>> >>
>> >>
>> >>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ben Pfaff Nov. 27, 2017, 10:54 p.m. | #9
On Mon, Nov 27, 2017 at 02:59:29PM +0300, Ilya Maximets wrote:
> > I also verified the other case when posix_memalign isn't available and even in that case
> > it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a patch to use
> >  xzalloc_cacheline for allocating the memory.
> 
> I don't know how you tested this, because it is impossible:
> 
> 	1. OVS allocates some memory:
> 		base = xmalloc(...);
> 
> 	2. Rounds it up to the cache line start:
> 		payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
> 
> 	3. Returns the pointer increased by 8 bytes:
> 		return (char *) payload + MEM_ALIGN;
> 
> So, unless you redefined MEM_ALIGN to zero, you will never get aligned memory
> address while allocating by xmalloc_cacheline() on system without posix_memalign().

Maybe we should make the non-HAVE_POSIX_MEMALIGN case better.  Something
like this (compile tested only);

diff --git a/lib/util.c b/lib/util.c
index 9e6edd27ae4c..137091a3cd4f 100644
--- a/lib/util.c
+++ b/lib/util.c
@@ -196,15 +196,9 @@ x2nrealloc(void *p, size_t *n, size_t s)
     return xrealloc(p, *n * s);
 }
 
-/* The desired minimum alignment for an allocated block of memory. */
-#define MEM_ALIGN MAX(sizeof(void *), 8)
-BUILD_ASSERT_DECL(IS_POW2(MEM_ALIGN));
-BUILD_ASSERT_DECL(CACHE_LINE_SIZE >= MEM_ALIGN);
-
-/* Allocates and returns 'size' bytes of memory in dedicated cache lines.  That
- * is, the memory block returned will not share a cache line with other data,
- * avoiding "false sharing".  (The memory returned will not be at the start of
- * a cache line, though, so don't assume such alignment.)
+/* Allocates and returns 'size' bytes of memory aligned to a cache line and in
+ * dedicated cache lines.  That is, the memory block returned will not share a
+ * cache line with other data, avoiding "false sharing".
  *
  * Use free_cacheline() to free the returned memory block. */
 void *
@@ -221,28 +215,37 @@ xmalloc_cacheline(size_t size)
     }
     return p;
 #else
-    void **payload;
-    void *base;
-
     /* Allocate room for:
      *
-     *     - Up to CACHE_LINE_SIZE - 1 bytes before the payload, so that the
-     *       start of the payload doesn't potentially share a cache line.
+     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the
+     *       pointer to be aligned exactly sizeof(void *) bytes before the
+     *       beginning of a cache line.
      *
-     *     - A payload consisting of a void *, followed by padding out to
-     *       MEM_ALIGN bytes, followed by 'size' bytes of user data.
+     *     - Pointer: A pointer to the start of the header padding, to allow us
+     *       to free() the block later.
      *
-     *     - Space following the payload up to the end of the cache line, so
-     *       that the end of the payload doesn't potentially share a cache line
-     *       with some following block. */
-    base = xmalloc((CACHE_LINE_SIZE - 1)
-                   + ROUND_UP(MEM_ALIGN + size, CACHE_LINE_SIZE));
-
-    /* Locate the payload and store a pointer to the base at the beginning. */
-    payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
-    *payload = base;
-
-    return (char *) payload + MEM_ALIGN;
+     *     - User data: 'size' bytes.
+     *
+     *     - Trailer padding: Enough to bring the user data up to a cache line
+     *       multiple.
+     *
+     * +---------------+---------+------------------------+---------+
+     * | header        | pointer | user data              | trailer |
+     * +---------------+---------+------------------------+---------+
+     * ^               ^         ^
+     * |               |         |
+     * p               q         r
+     *
+     */
+    void *p = xmalloc((CACHE_LINE_SIZE - 1)
+                      + sizeof(void *)
+                      + ROUND_UP(size, CACHE_LINE_SIZE));
+    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < sizeof(void *);
+    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? CACHE_LINE_SIZE : 0),
+                                CACHE_LINE_SIZE);
+    void **q = (void **) r - 1;
+    *q = p;
+    return r;
 #endif
 }
 
@@ -265,7 +268,8 @@ free_cacheline(void *p)
     free(p);
 #else
     if (p) {
-        free(*(void **) ((uintptr_t) p - MEM_ALIGN));
+        void **q = (void **) p - 1;
+        free(*q);
     }
 #endif
 }
Jan Scheurich Nov. 28, 2017, 8:35 a.m. | #10
> Analyzing the memory layout with gdb for large structures is time consuming and not usually recommended.
> I would suggest using Poke-a-hole(pahole) and that helps to understand and fix the structures in no time.
> With pahole it's going to be lot easier to work with large structures especially.

Thanks for the pointer. I'll have a look at pahole.
It doesn't affect my reasoning against optimizing the compactification of struct dp_netdev_pmd_thread, though.

> >Finally, even for x86 there is not even a performance improvement. I re-ran
> >our standard L3VPN over VXLAN performance PVP test on master and with
> >Ilya's revert patch:
> >
> >Flows   master  reverted
> >8,      4.46    4.48
> >100,    4.27    4.29
> >1000,   4.07    4.07
> >2000,   3.68    3.68
> >5000,   3.03    3.03
> >10000,  2.76    2.77
> >20000,  2.64    2.65
> >50000,  2.60    2.61
> >100000, 2.60    2.61
> >500000, 2.60    2.61
> 
> What are the  CFLAGS in this case, as they seem to make difference. I have added my finding here for a different patch targeted at
> performance
>       https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341270.html

I'm compiling with "-O3 -msse4.2" to be in line with production deployments of OVS-DPDK that need to run on a wider family of Xeon generations. 

> 
> Patches to consider when testing your use case:
>      Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341231.html
>      (If using output batching)      https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341230.html

I didn't use these. Tx batching is not relevant here. And I understand the xzalloc_cacheline patch alone does not guarantee that the allocated memory is indeed cache line-aligned.

Thx, Jan
Ilya Maximets Nov. 28, 2017, 11:47 a.m. | #11
On 27.11.2017 20:02, Bodireddy, Bhanuprakash wrote:
>> I agree with Ilya here. Adding theses cache line markers and re-grouping
>> variables to minimize gaps in cache lines is creating a maintenance burden
>> without any tangible benefit. I have had to go through the pain of refactoring
>> my PMD Performance Metrics patch to the new dp_netdev_pmd_thread
>> struct and spent a lot of time to analyze the actual memory layout with GDB
>> and play Tetris with the variables.
> 
> Analyzing the memory layout with gdb for large structures is time consuming and not usually recommended.
> I would suggest using Poke-a-hole(pahole) and that helps to understand and fix the structures in no time.
> With pahole it's going to be lot easier to work with large structures especially.

Interesting tool, but it seems doesn't work perfectly. I see duplicated unions
and zero length arrays in the output and I still need to check sizes by hands.
And it fails trying to run on my ubuntu 16.04 LTS on x86.
IMHO, the code should be simple enough to not use external utilities when you
need to check the single structure.

>>
>> There will never be more than a handful of PMDs, so minimizing the gaps does
>> not matter from memory perspective. And whether the individual members
>> occupy 4 or 5 cache lines does not matter either compared to the many
>> hundred cache lines touched for EMC and DPCLS lookups of an Rx batch. And
>> any optimization done for x86 is not necessarily optimal for other
>> architectures.
> 
> I agree that optimization targeted for x86 doesn't necessarily suit ARM due to its different cache line size.
> 
>>
>> Finally, even for x86 there is not even a performance improvement. I re-ran
>> our standard L3VPN over VXLAN performance PVP test on master and with
>> Ilya's revert patch:
>>
>> Flows   master  reverted
>> 8,      4.46    4.48
>> 100,    4.27    4.29
>> 1000,   4.07    4.07
>> 2000,   3.68    3.68
>> 5000,   3.03    3.03
>> 10000,  2.76    2.77
>> 20000,  2.64    2.65
>> 50000,  2.60    2.61
>> 100000, 2.60    2.61
>> 500000, 2.60    2.61
> 
> What are the  CFLAGS in this case, as they seem to make difference. I have added my finding here for a different patch targeted at performance
>       https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341270.html

Do you have any performance results that shows significant performance difference
between above cases? Please describe your test scenario and environment so we'll
be able to see that padding/alignment really needed here. I saw no such results yet.

BTW, at one place you're saying that patch was not about performance, at the same
time you're trying to show that it has some positive performance impact. I'm a bit
confused with that.

> 
> Patches to consider when testing your use case:
>      Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341231.html
>      (If using output batching)      https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341230.html
> 
> - Bhanuprakash.
> 
>>
>> All in all, I support reverting this change.
>>
>> Regards, Jan
>>
>> Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
>>
>>> -----Original Message-----
>>> From: ovs-dev-bounces@openvswitch.org
>>> [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Bodireddy,
>>> Bhanuprakash
>>> Sent: Friday, 24 November, 2017 17:09
>>> To: Ilya Maximets <i.maximets@samsung.com>; ovs-dev@openvswitch.org;
>>> Ben Pfaff <blp@ovn.org>
>>> Cc: Heetae Ahn <heetae82.ahn@samsung.com>
>>> Subject: Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor
>> dp_netdev_pmd_thread structure."
>>>
>>>> On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>>>>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>>>>>
>>>>>> Padding and aligning of dp_netdev_pmd_thread structure members is
>>>>>> useless, broken in a several ways and only greatly degrades
>>>>>> maintainability and extensibility of the structure.
>>>>>
>>>>> The idea of my earlier patch was to mark the cache lines and reduce
>>>>> the
>>>> holes while still maintaining the grouping of related members in this
>> structure.
>>>>
>>>> Some of the grouping aspects looks strange. For example, it looks
>>>> illogical that 'exit_latch' is grouped with 'flow_table' but not the
>>>> 'reload_seq' and other reload related stuff. It looks strange that
>>>> statistics and counters spread across different groups. So, IMHO, it's not
>> well grouped.
>>>
>>> I had to strike a fine balance and some members may be placed in a
>>> different group due to their sizes and importance. Let me think if I can make
>> it better.
>>>
>>>>
>>>>> Also cache line marking is a good practice to make some one extra
>>>>> cautious
>>>> when extending or editing important structures .
>>>>> Most importantly I was experimenting with prefetching on this
>>>>> structure
>>>> and needed cache line markers for it.
>>>>>
>>>>> I see that you are on ARM (I don't have HW to test) and want to
>>>>> know if this
>>>> commit has any negative affect and any numbers would be appreciated.
>>>>
>>>> Basic VM-VM testing shows stable 0.5% perfromance improvement with
>>>> revert applied.
>>>
>>> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.
>>>
>>>> Padding adds 560 additional bytes of holes.
>>> As the cache line in ARM is 128 , it created holes, I can find a workaround to
>> handle this.
>>>
>>>>
>>>>> More comments inline.
>>>>>
>>>>>>
>>>>>> Issues:
>>>>>>
>>>>>>    1. It's not working because all the instances of struct
>>>>>>       dp_netdev_pmd_thread allocated only by usual malloc. All the
>>>>>>       memory is not aligned to cachelines -> structure almost never
>>>>>>       starts at aligned memory address. This means that any further
>>>>>>       paddings and alignments inside the structure are completely
>>>>>>       useless. Fo example:
>>>>>>
>>>>>>       Breakpoint 1, pmd_thread_main
>>>>>>       (gdb) p pmd
>>>>>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>>>>>       (gdb) p &pmd->cacheline1
>>>>>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>>>>>       (gdb) p &pmd->cacheline0
>>>>>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>>>>>       (gdb) p &pmd->flow_cache
>>>>>>       $53 = (struct emc_cache *) 0x1b1afe0
>>>>>>
>>>>>>       All of the above addresses shifted from cacheline start by 32B.
>>>>>
>>>>> If you see below all the addresses are 64 byte aligned.
>>>>>
>>>>> (gdb) p pmd
>>>>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>>>>> (gdb) p &pmd->cacheline0
>>>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>>>>> (gdb) p &pmd->cacheline1
>>>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>>>>> (gdb) p &pmd->flow_cache
>>>>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>>>>> (gdb) p &pmd->flow_table
>>>>> $5 = (struct cmap *) 0x7fc1e9fba100
>>>>> (gdb) p &pmd->stats
>>>>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>>>>> (gdb) p &pmd->port_mutex
>>>>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>>>>> (gdb) p &pmd->poll_list
>>>>> $8 = (struct hmap *) 0x7fc1e9fba1c0
>>>>> (gdb) p &pmd->tnl_port_cache
>>>>> $9 = (struct hmap *) 0x7fc1e9fba200
>>>>> (gdb) p &pmd->stats_zero
>>>>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>>>>>
>>>>> I tried using xzalloc_cacheline instead of default xzalloc() here.
>>>>> I tried tens of times and always found that the address is
>>>>> 64 byte aligned and it should start at the beginning of cache line on X86.
>>>>> Not sure why the comment  " (The memory returned will not be at the
>>>>> start
>>>> of  a cache line, though, so don't assume such alignment.)" says otherwise?
>>>>
>>>> Yes, you will always get aligned addressess on your x86 Linux system
>>>> that supports
>>>> posix_memalign() call. The comment says what it says because it will
>>>> make some memory allocation tricks in case posix_memalign() is not
>>>> available (Windows, some MacOS, maybe some Linux systems (not sure))
>>>> and the address will not be aligned it this case.
>>>
>>> I also verified the other case when posix_memalign isn't available and
>>> even in that case it returns the address aligned on CACHE_LINE_SIZE
>>> boundary. I will send out a patch to use  xzalloc_cacheline for allocating the
>> memory.
>>>
>>>>
>>>>>
>>>>>>
>>>>>>       Can we fix it properly? NO.
>>>>>>       OVS currently doesn't have appropriate API to allocate aligned
>>>>>>       memory. The best candidate is 'xmalloc_cacheline()' but it
>>>>>>       clearly states that "The memory returned will not be at the
>>>>>>       start of a cache line, though, so don't assume such alignment".
>>>>>>       And also, this function will never return aligned memory on
>>>>>>       Windows or MacOS.
>>>>>>
>>>>>>    2. CACHE_LINE_SIZE is not constant. Different architectures have
>>>>>>       different cache line sizes, but the code assumes that
>>>>>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>>>>>>       members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>>>>>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
>>>>>>       differs from 64. This is opposite to portability. If I want
>>>>>>       good performance of cmap I need to have CACHE_LINE_SIZE equal
>>>>>>       to the real cache line size, but I will have huge holes in the
>>>>>>       structures. If you'll take a look to struct rte_mbuf from DPDK
>>>>>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>>>>>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
>>>>>
>>>>> I understand that ARM and few other processors (like OCTEON) have
>>>>> 128
>>>> bytes cache lines.
>>>>> But  again curious of performance impact in your case with this new
>>>> alignment.
>>>>>
>>>>>>
>>>>>>    3. Sizes of system/libc defined types are not constant for all the
>>>>>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
>>>>>>       ARMv8 machine, but only 40 on x86. The difference could be
>>>>>>       much bigger on Windows or MacOS systems. But the code assumes
>>>>>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>>>>>>       to broken alignment/big holes in case of padding/wrong comments
>>>>>>       about amount of free pad bytes.
>>>>>
>>>>> This isn't an issue as you would have already mentioned and more
>>>>> about
>>>> issue with the comment that reads the pad bytes.
>>>>> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
>>>>>
>>>>>         union {
>>>>>                 struct {
>>>>>                         struct ovs_mutex port_mutex;     /* 4849984    48 */
>>>>>                 };    /*          48 */
>>>>>                 uint8_t            pad13[64];            /*          64 */
>>>>>         };                                               /
>>>>>
>>>>
>>>> It's not only about 'port_mutex'. If you'll take a look at
>>>> 'flow_mutex', you will see, that it even not padded. So, increasing the size
>> of 'flow_mutex'
>>>> leads to shifting of all the following padded blocks and no other
>>>> blocks will be properly aligned even if the structure allocated on the
>> aligned memory.
>>>>
>>>>>>
>>>>>>    4. Sizes of the many fileds in structure depends on defines like
>>>>>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so
>> on.
>>>>>>       Any change in these defines or any change in any structure
>>>>>>       contained by thread should lead to the not so simple
>>>>>>       refactoring of the whole dp_netdev_pmd_thread structure. This
>>>>>>       greatly reduces maintainability and complicates development of
>>>>>>       a new features.
>>>>>
>>>>> I don't think it complicates development and instead I feel the
>>>>> commit gives a clear indication to the developer that the members
>>>>> are grouped and
>>>> aligned and marked with cacheline markers.
>>>>> This makes the developer extra cautious when adding new members so
>>>>> that
>>>> holes can be avoided.
>>>>
>>>> Starting rebase of the output batching patch-set I figured out that I
>>>> need to remove 'unsigned long long last_cycles' and add 'struct
>>>> dp_netdev_pmd_thread_ctx ctx'
>>>> which is 8 bytes larger. Could you, please, suggest me where should I
>>>> place that new structure member and what to do with a hole from
>> 'last_cycles'?
>>>>
>>>> This is not a trivial question, because already poor grouping will
>>>> become worse almost anyway.
>>>
>>> Aah, realized now that the batching series doesn't cleanly apply on master.
>>> Let me check this and will send across the changes that should fix this.
>>>
>>> - Bhanuprakash
>>>
>>>>>
>>>>> Cacheline marking the structure is a good practice and I am sure
>>>>> this structure is significant and should be carefully extended in the
>> future.
>>>>
>>>> Not so sure about that.
>>>>
>>>>>
>>>>>>
>>>>>>    5. There is no reason to align flow_cache member because it's
>>>>>>       too big and we usually access random entries by single thread
>>>>>>       only.
>>>>>>
>>>>>
>>>>> I see your point. This patch wasn't done for performance and
>>>>> instead more to have some order with this ever growing structure.
>>>>> During testing I found that for some test cases aligning the
>>>>> flow_cache was giving
>>>> me 100k+ performance consistently and so was added.
>>>>
>>>> This was a random performance boost. You achieved it without aligned
>>>> memory allocation.
>>>> Just a luck with you system environment. Using of mzalloc_cacheline
>>>> will likely eliminate this performance difference or even degrade the
>> performance.
>>>>
>>>>>
>>>>>> So, the padding/alignment only creates some visibility of
>>>>>> performance optimization but does nothing useful in reality. It
>>>>>> only complicates maintenance and adds huge holes for non-x86
>>>>>> architectures and non-Linux systems. Performance improvement
>>>>>> stated in a original commit message should be random and not
>>>>>> valuable. I see no
>>>> performance difference.
>>>>>
>>>>> I understand that this is causing issues with architecture having
>>>>> different cache line sizes, but unfortunately majority of them have
>>>>> 64 byte
>>>> cache lines so this change makes sense.
>>>>
>>>> I understand that 64 byte cache lines are spread a lot wider. I also
>>>> have x86 as a target arch, but still, IMHO, OVS is a cross-platform
>>>> application and it should not have platform dependent stuff which
>>>> makes one architecture/platform better and worsen others.
>>>>
>>>>>
>>>>> If you have performance data to prove that this causes sever perf
>>>> degradation I can think of work arounds for ARM.
>>>>>
>>>>> - Bhanuprakash.
>>>>
>>>>
>>>> P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have
>>>> to apply following patch to avoid build time assert (I'll send it formally
>> later):
>>>>
>>>> ---------------------------------------------------------------------
>>>> --------
>>>> diff --git a/lib/cmap.c b/lib/cmap.c
>>>> index 35decea..5b15ecd 100644
>>>> --- a/lib/cmap.c
>>>> +++ b/lib/cmap.c
>>>> @@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
>>>> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */
>>>> #define CMAP_K ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
>>>>
>>>> -/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on
>>>> 64-bit. */ - #define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K *
>>>> CMAP_ENTRY_SIZE))
>>>> -
>>>> /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly
>>>> one cache
>>>>  * line long. */
>>>> struct cmap_bucket {
>>>> +    /* Padding to make cmap_bucket exactly one cache line long. */
>>>> +    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>     /* Allows readers to track in-progress changes.  Initially zero, each
>>>>      * writer increments this value just before and just after each change
>> (see
>>>>      * cmap_set_bucket()).  Thus, a reader can ensure that it gets a
>>>> consistent @@ -145,11 +144,7 @@ struct cmap_bucket {
>>>>      * slots. */
>>>>     uint32_t hashes[CMAP_K];
>>>>     struct cmap_node nodes[CMAP_K];
>>>> -
>>>> -    /* Padding to make cmap_bucket exactly one cache line long. */
>>>> -#if CMAP_PADDING > 0
>>>> -    uint8_t pad[CMAP_PADDING];
>>>> -#endif
>>>> +    );
>>>> };
>>>> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
>>>>
>>>> diff --git a/lib/util.h b/lib/util.h
>>>> index 3c43c2c..514fdaa 100644
>>>> --- a/lib/util.h
>>>> +++ b/lib/util.h
>>>> @@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
>>>>
>>>> /* This system's cache line size, in bytes.
>>>>  * Being wrong hurts performance but not correctness. */ -#define
>>>> CACHE_LINE_SIZE 64
>>>> +#define CACHE_LINE_SIZE 128
>>>> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
>>>>
>>>> /* Cacheline marking is typically done using zero-sized array.
>>>> ---------------------------------------------------------------------
>>>> --------
>>>>
>>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>>>
>>>>>>
>>>>>> Most of the above issues are also true for some other
>>>>>> padded/aligned structures like 'struct netdev_dpdk'. They will be
>> treated separately.
>>>>>>
>>>>>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>>>>>> CC: Ben Pfaff <blp@ovn.org>
>>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>>>>>> ---
>>>>>> lib/dpif-netdev.c | 160
>>>>>> +++++++++++++++++++++++-----------------------------
>>>>>> --
>>>>>> 1 file changed, 69 insertions(+), 91 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>>>> 0a62630..6784269
>>>>>> 100644
>>>>>> --- a/lib/dpif-netdev.c
>>>>>> +++ b/lib/dpif-netdev.c
>>>>>> @@ -547,31 +547,18 @@ struct tx_port {
>>>>>>  * actions in either case.
>>>>>>  * */
>>>>>> struct dp_netdev_pmd_thread {
>>>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>>>> cacheline0,
>>>>>> -        struct dp_netdev *dp;
>>>>>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */
>>>>>> -        pthread_cond_t cond;            /* For synchronizing pmd thread
>>>>>> -                                           reload. */
>>>>>> -    );
>>>>>> -
>>>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>>>> cacheline1,
>>>>>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition variable.
>> */
>>>>>> -        pthread_t thread;
>>>>>> -        unsigned core_id;               /* CPU core id of this pmd thread. */
>>>>>> -        int numa_id;                    /* numa node id of this pmd thread. */
>>>>>> -    );
>>>>>> +    struct dp_netdev *dp;
>>>>>> +    struct ovs_refcount ref_cnt;    /* Every reference must be
>> refcount'ed.
>>>> */
>>>>>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */
>>>>>> +
>>>>>> +    pthread_cond_t cond;            /* For synchronizing pmd thread reload.
>> */
>>>>>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>>>>>>
>>>>>>     /* Per thread exact-match cache.  Note, the instance for cpu core
>>>>>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and
>> thusly
>>>>>>      * need to be protected by 'non_pmd_mutex'.  Every other instance
>>>>>>      * will only be accessed by its own pmd thread. */
>>>>>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache
>> flow_cache;
>>>>>> -    struct ovs_refcount ref_cnt;    /* Every reference must be
>> refcount'ed.
>>>> */
>>>>>> -
>>>>>> -    /* Queue id used by this pmd thread to send packets on all netdevs
>> if
>>>>>> -     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>>>>> -     * than 'cmap_count(dp->poll_threads)'. */
>>>>>> -    uint32_t static_tx_qid;
>>>>>> +    struct emc_cache flow_cache;
>>>>>>
>>>>>>     /* Flow-Table and classifiers
>>>>>>      *
>>>>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>>>>>>      * 'flow_mutex'.
>>>>>>      */
>>>>>>     struct ovs_mutex flow_mutex;
>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>>>> -
>>>>>> -        /* One classifier per in_port polled by the pmd */
>>>>>> -        struct cmap classifiers;
>>>>>> -        /* Periodically sort subtable vectors according to hit frequencies */
>>>>>> -        long long int next_optimization;
>>>>>> -        /* End of the next time interval for which processing cycles
>>>>>> -           are stored for each polled rxq. */
>>>>>> -        long long int rxq_next_cycle_store;
>>>>>> -
>>>>>> -        /* Cycles counters */
>>>>>> -        struct dp_netdev_pmd_cycles cycles;
>>>>>> -
>>>>>> -        /* Used to count cycles. See 'cycles_counter_end()'. */
>>>>>> -        unsigned long long last_cycles;
>>>>>> -        struct latch exit_latch;        /* For terminating the pmd thread. */
>>>>>> -    );
>>>>>> -
>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>> -        /* Statistics. */
>>>>>> -        struct dp_netdev_pmd_stats stats;
>>>>>> -
>>>>>> -        struct seq *reload_seq;
>>>>>> -        uint64_t last_reload_seq;
>>>>>> -        atomic_bool reload;             /* Do we need to reload ports? */
>>>>>> -        bool isolated;
>>>>>> -
>>>>>> -        /* Set to true if the pmd thread needs to be reloaded. */
>>>>>> -        bool need_reload;
>>>>>> -        /* 5 pad bytes. */
>>>>>> -    );
>>>>>> -
>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
>>>>>> -                                           and 'tx_ports'. */
>>>>>> -        /* 16 pad bytes. */
>>>>>> -    );
>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>> -        /* List of rx queues to poll. */
>>>>>> -        struct hmap poll_list OVS_GUARDED;
>>>>>> -        /* Map of 'tx_port's used for transmission.  Written by the main
>>>>>> -         * thread, read by the pmd thread. */
>>>>>> -        struct hmap tx_ports OVS_GUARDED;
>>>>>> -    );
>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>> -        /* These are thread-local copies of 'tx_ports'.  One contains only
>>>>>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the other
>>>>>> -         * contains ports with at least one txq (that support send).
>>>>>> -         * A port can be in both.
>>>>>> -         *
>>>>>> -         * There are two separate maps to make sure that we don't try to
>>>>>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>>>>> -         * non-tunnel device.
>>>>>> -         *
>>>>>> -         * The instances for cpu core NON_PMD_CORE_ID can be accessed
>> by
>>>>>> -         * multiple threads and thusly need to be protected by
>>>>>> 'non_pmd_mutex'.
>>>>>> -         * Every other instance will only be accessed by its own pmd
>> thread.
>>>> */
>>>>>> -        struct hmap tnl_port_cache;
>>>>>> -        struct hmap send_port_cache;
>>>>>> -    );
>>>>>> -
>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>> -        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>>>>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>>>>> -         * values and subtracts them from 'stats' and 'cycles' before
>>>>>> -         * reporting to the user */
>>>>>> -        unsigned long long stats_zero[DP_N_STATS];
>>>>>> -        uint64_t cycles_zero[PMD_N_CYCLES];
>>>>>> -        /* 8 pad bytes. */
>>>>>> -    );
>>>>>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>>>> +
>>>>>> +    /* One classifier per in_port polled by the pmd */
>>>>>> +    struct cmap classifiers;
>>>>>> +    /* Periodically sort subtable vectors according to hit frequencies */
>>>>>> +    long long int next_optimization;
>>>>>> +    /* End of the next time interval for which processing cycles
>>>>>> +       are stored for each polled rxq. */
>>>>>> +    long long int rxq_next_cycle_store;
>>>>>> +
>>>>>> +    /* Statistics. */
>>>>>> +    struct dp_netdev_pmd_stats stats;
>>>>>> +
>>>>>> +    /* Cycles counters */
>>>>>> +    struct dp_netdev_pmd_cycles cycles;
>>>>>> +
>>>>>> +    /* Used to count cicles. See 'cycles_counter_end()' */
>>>>>> +    unsigned long long last_cycles;
>>>>>> +
>>>>>> +    struct latch exit_latch;        /* For terminating the pmd thread. */
>>>>>> +    struct seq *reload_seq;
>>>>>> +    uint64_t last_reload_seq;
>>>>>> +    atomic_bool reload;             /* Do we need to reload ports? */
>>>>>> +    pthread_t thread;
>>>>>> +    unsigned core_id;               /* CPU core id of this pmd thread. */
>>>>>> +    int numa_id;                    /* numa node id of this pmd thread. */
>>>>>> +    bool isolated;
>>>>>> +
>>>>>> +    /* Queue id used by this pmd thread to send packets on all netdevs
>> if
>>>>>> +     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>>>>> +     * than 'cmap_count(dp->poll_threads)'. */
>>>>>> +    uint32_t static_tx_qid;
>>>>>> +
>>>>>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and
>> 'tx_ports'.
>>>> */
>>>>>> +    /* List of rx queues to poll. */
>>>>>> +    struct hmap poll_list OVS_GUARDED;
>>>>>> +    /* Map of 'tx_port's used for transmission.  Written by the main
>> thread,
>>>>>> +     * read by the pmd thread. */
>>>>>> +    struct hmap tx_ports OVS_GUARDED;
>>>>>> +
>>>>>> +    /* These are thread-local copies of 'tx_ports'.  One contains only
>> tunnel
>>>>>> +     * ports (that support push_tunnel/pop_tunnel), the other
>>>>>> + contains
>>>> ports
>>>>>> +     * with at least one txq (that support send).  A port can be in both.
>>>>>> +     *
>>>>>> +     * There are two separate maps to make sure that we don't try
>>>>>> + to
>>>> execute
>>>>>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>>>>> + non-tunnel
>>>>>> device.
>>>>>> +     *
>>>>>> +     * The instances for cpu core NON_PMD_CORE_ID can be accessed
>>>>>> + by
>>>>>> multiple
>>>>>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.
>>>> Every
>>>>>> +     * other instance will only be accessed by its own pmd thread. */
>>>>>> +    struct hmap tnl_port_cache;
>>>>>> +    struct hmap send_port_cache;
>>>>>> +
>>>>>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>>>>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>>>>> +     * values and subtracts them from 'stats' and 'cycles' before
>>>>>> +     * reporting to the user */
>>>>>> +    unsigned long long stats_zero[DP_N_STATS];
>>>>>> +    uint64_t cycles_zero[PMD_N_CYCLES];
>>>>>> +
>>>>>> +    /* Set to true if the pmd thread needs to be reloaded. */
>>>>>> +    bool need_reload;
>>>>>> };
>>>>>>
>>>>>> /* Interface to netdev-based datapath. */
>>>>>> --
>>>>>> 2.7.4
>>>>>
>>>>>
>>>>>
>>>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 
> 
>
Ilya Maximets Nov. 28, 2017, 1:22 p.m. | #12
Hello, Ben.
Thanks for looking at this.

On 28.11.2017 01:54, Ben Pfaff wrote:
> On Mon, Nov 27, 2017 at 02:59:29PM +0300, Ilya Maximets wrote:
>>> I also verified the other case when posix_memalign isn't available and even in that case
>>> it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a patch to use
>>>  xzalloc_cacheline for allocating the memory.
>>
>> I don't know how you tested this, because it is impossible:
>>
>> 	1. OVS allocates some memory:
>> 		base = xmalloc(...);
>>
>> 	2. Rounds it up to the cache line start:
>> 		payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
>>
>> 	3. Returns the pointer increased by 8 bytes:
>> 		return (char *) payload + MEM_ALIGN;
>>
>> So, unless you redefined MEM_ALIGN to zero, you will never get aligned memory
>> address while allocating by xmalloc_cacheline() on system without posix_memalign().
> 
> Maybe we should make the non-HAVE_POSIX_MEMALIGN case better.  Something
> like this (compile tested only);

That's interesting, but the question here is a bit different.
We definitely can apply 5-10 more patches to fix this unaligned memory allocation,
paddings, holes in case of different cache line sizes / mutex sizes, whatever else
we forget to mention, but *do we need this*? Why we're making big efforts to make
this work as it should?

Padding/alignment of struct dp_netdev_pmd_thread:

- Cons:
	* Code complication.
	* Worse grouping of structure members.
	  (In compare with what we can have without padding/alignment restrictions)
	* Poor extensibility: Even small struct modifications may require regrouping
	  and rearranging of many other structure members. For example:
		https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341230.html
	* There is no any valuable performance impact (At least, all the results I saw
	  shows close performance between 'with' and 'without' padding/alignment cases)
	* Broken now: Few additional patches required to make it work as intended.
	              (aligned memory allocation)
	* Targeted to optimize x86 only. Doesn't care about different cache line sizes,
	  sizes of system/libc defined structures on different platforms. (Few more
	  patches required to fix this)

- Pros:
	* Didn't find any.

I consider previously applied structure refactoring as an over-engineering which doesn't
have any positive impact.

---

Beside that, looking at the code below:
Do we care about false sharing on alloc/free operations? That is the difference between
old and new implementations. The new one will write and read bytes possibly placed
on a shared cacheline on alloc/free.

Best regards, Ilya Maximets.

> 
> diff --git a/lib/util.c b/lib/util.c
> index 9e6edd27ae4c..137091a3cd4f 100644
> --- a/lib/util.c
> +++ b/lib/util.c
> @@ -196,15 +196,9 @@ x2nrealloc(void *p, size_t *n, size_t s)
>      return xrealloc(p, *n * s);
>  }
>  
> -/* The desired minimum alignment for an allocated block of memory. */
> -#define MEM_ALIGN MAX(sizeof(void *), 8)
> -BUILD_ASSERT_DECL(IS_POW2(MEM_ALIGN));
> -BUILD_ASSERT_DECL(CACHE_LINE_SIZE >= MEM_ALIGN);
> -
> -/* Allocates and returns 'size' bytes of memory in dedicated cache lines.  That
> - * is, the memory block returned will not share a cache line with other data,
> - * avoiding "false sharing".  (The memory returned will not be at the start of
> - * a cache line, though, so don't assume such alignment.)
> +/* Allocates and returns 'size' bytes of memory aligned to a cache line and in
> + * dedicated cache lines.  That is, the memory block returned will not share a
> + * cache line with other data, avoiding "false sharing".
>   *
>   * Use free_cacheline() to free the returned memory block. */
>  void *
> @@ -221,28 +215,37 @@ xmalloc_cacheline(size_t size)
>      }
>      return p;
>  #else
> -    void **payload;
> -    void *base;
> -
>      /* Allocate room for:
>       *
> -     *     - Up to CACHE_LINE_SIZE - 1 bytes before the payload, so that the
> -     *       start of the payload doesn't potentially share a cache line.
> +     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the
> +     *       pointer to be aligned exactly sizeof(void *) bytes before the
> +     *       beginning of a cache line.
>       *
> -     *     - A payload consisting of a void *, followed by padding out to
> -     *       MEM_ALIGN bytes, followed by 'size' bytes of user data.
> +     *     - Pointer: A pointer to the start of the header padding, to allow us
> +     *       to free() the block later.
>       *
> -     *     - Space following the payload up to the end of the cache line, so
> -     *       that the end of the payload doesn't potentially share a cache line
> -     *       with some following block. */
> -    base = xmalloc((CACHE_LINE_SIZE - 1)
> -                   + ROUND_UP(MEM_ALIGN + size, CACHE_LINE_SIZE));
> -
> -    /* Locate the payload and store a pointer to the base at the beginning. */
> -    payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
> -    *payload = base;
> -
> -    return (char *) payload + MEM_ALIGN;
> +     *     - User data: 'size' bytes.
> +     *
> +     *     - Trailer padding: Enough to bring the user data up to a cache line
> +     *       multiple.
> +     *
> +     * +---------------+---------+------------------------+---------+
> +     * | header        | pointer | user data              | trailer |
> +     * +---------------+---------+------------------------+---------+
> +     * ^               ^         ^
> +     * |               |         |
> +     * p               q         r
> +     *
> +     */
> +    void *p = xmalloc((CACHE_LINE_SIZE - 1)
> +                      + sizeof(void *)
> +                      + ROUND_UP(size, CACHE_LINE_SIZE));
> +    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < sizeof(void *);
> +    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? CACHE_LINE_SIZE : 0),
> +                                CACHE_LINE_SIZE);
> +    void **q = (void **) r - 1;
> +    *q = p;
> +    return r;
>  #endif
>  }
>  
> @@ -265,7 +268,8 @@ free_cacheline(void *p)
>      free(p);
>  #else
>      if (p) {
> -        free(*(void **) ((uintptr_t) p - MEM_ALIGN));
> +        void **q = (void **) p - 1;
> +        free(*q);
>      }
>  #endif
>  }
> 
> 
>
Ben Pfaff Nov. 28, 2017, 3:19 p.m. | #13
On Tue, Nov 28, 2017 at 04:22:46PM +0300, Ilya Maximets wrote:
> On 28.11.2017 01:54, Ben Pfaff wrote:
> > On Mon, Nov 27, 2017 at 02:59:29PM +0300, Ilya Maximets wrote:
> >>> I also verified the other case when posix_memalign isn't available and even in that case
> >>> it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a patch to use
> >>>  xzalloc_cacheline for allocating the memory.
> >>
> >> I don't know how you tested this, because it is impossible:
> >>
> >> 	1. OVS allocates some memory:
> >> 		base = xmalloc(...);
> >>
> >> 	2. Rounds it up to the cache line start:
> >> 		payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
> >>
> >> 	3. Returns the pointer increased by 8 bytes:
> >> 		return (char *) payload + MEM_ALIGN;
> >>
> >> So, unless you redefined MEM_ALIGN to zero, you will never get aligned memory
> >> address while allocating by xmalloc_cacheline() on system without posix_memalign().
> > 
> > Maybe we should make the non-HAVE_POSIX_MEMALIGN case better.  Something
> > like this (compile tested only);
> 
> That's interesting, but the question here is a bit different.
> We definitely can apply 5-10 more patches to fix this unaligned memory allocation,
> paddings, holes in case of different cache line sizes / mutex sizes, whatever else
> we forget to mention, but *do we need this*? Why we're making big efforts to make
> this work as it should?

For what it's worth, I was addressing the xmalloc_cacheline() code
independent of your comments on the PMD code.  I don't know the PMD code
well enough to take a position.  I'll let others comment on that.

> Beside that, looking at the code below:
> Do we care about false sharing on alloc/free operations? That is the difference between
> old and new implementations. The new one will write and read bytes possibly placed
> on a shared cacheline on alloc/free.

I'd say that the value of the change that I'm proposing is that it
allows programmers to design their structures knowing that the first
cacheline ends after 64 bytes, not after 64-8 bytes in some cases.
Bodireddy, Bhanuprakash Nov. 28, 2017, 3:42 p.m. | #14
>
>> Analyzing the memory layout with gdb for large structures is time consuming
>and not usually recommended.
>> I would suggest using Poke-a-hole(pahole) and that helps to understand
>and fix the structures in no time.
>> With pahole it's going to be lot easier to work with large structures
>especially.
>
>Thanks for the pointer. I'll have a look at pahole.
>It doesn't affect my reasoning against optimizing the compactification of struct
>dp_netdev_pmd_thread, though.
>
>> >Finally, even for x86 there is not even a performance improvement. I
>> >re-ran our standard L3VPN over VXLAN performance PVP test on master
>> >and with Ilya's revert patch:
>> >
>> >Flows   master  reverted
>> >8,      4.46    4.48
>> >100,    4.27    4.29
>> >1000,   4.07    4.07
>> >2000,   3.68    3.68
>> >5000,   3.03    3.03
>> >10000,  2.76    2.77
>> >20000,  2.64    2.65
>> >50000,  2.60    2.61
>> >100000, 2.60    2.61
>> >500000, 2.60    2.61
>>
>> What are the  CFLAGS in this case, as they seem to make difference. I
>> have added my finding here for a different patch targeted at performance
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341270.ht
>> ml
>
>I'm compiling with "-O3 -msse4.2" to be in line with production deployments
>of OVS-DPDK that need to run on a wider family of Xeon generations.

Thanks for this. AFAIK by specifying  '-msse4.2' alone, you don't get to use the builtin_popcnt().
One way to enable is to use '-mpopcnt'   in CFLAGS or build with march=native.

(This is slightly out of context for this thread and JFYI. Ignore this if you only want to use intrinsics and not builtin popcnt.)

>
>>
>> Patches to consider when testing your use case:
>>      Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341231.html
>>      (If using output batching)      https://mail.openvswitch.org/pipermail/ovs-
>dev/2017-November/341230.html
>
>I didn't use these. Tx batching is not relevant here. And I understand the
>xzalloc_cacheline patch alone does not guarantee that the allocated memory
>is indeed cache line-aligned.

Atleast with POSIX_MEMALIGN, address will be aligned on 64 byte and start at CACHE_LINE_SIZE boundary.
I am yet to check Ben's new patch and test it. 

- Bhanuprakash.

>
>Thx, Jan
Ben Pfaff Nov. 28, 2017, 6:29 p.m. | #15
On Tue, Nov 28, 2017 at 07:19:26AM -0800, Ben Pfaff wrote:
> On Tue, Nov 28, 2017 at 04:22:46PM +0300, Ilya Maximets wrote:
> > On 28.11.2017 01:54, Ben Pfaff wrote:
> > > On Mon, Nov 27, 2017 at 02:59:29PM +0300, Ilya Maximets wrote:
> > >>> I also verified the other case when posix_memalign isn't available and even in that case
> > >>> it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a patch to use
> > >>>  xzalloc_cacheline for allocating the memory.
> > >>
> > >> I don't know how you tested this, because it is impossible:
> > >>
> > >> 	1. OVS allocates some memory:
> > >> 		base = xmalloc(...);
> > >>
> > >> 	2. Rounds it up to the cache line start:
> > >> 		payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
> > >>
> > >> 	3. Returns the pointer increased by 8 bytes:
> > >> 		return (char *) payload + MEM_ALIGN;
> > >>
> > >> So, unless you redefined MEM_ALIGN to zero, you will never get aligned memory
> > >> address while allocating by xmalloc_cacheline() on system without posix_memalign().
> > > 
> > > Maybe we should make the non-HAVE_POSIX_MEMALIGN case better.  Something
> > > like this (compile tested only);
> > 
> > That's interesting, but the question here is a bit different.
> > We definitely can apply 5-10 more patches to fix this unaligned memory allocation,
> > paddings, holes in case of different cache line sizes / mutex sizes, whatever else
> > we forget to mention, but *do we need this*? Why we're making big efforts to make
> > this work as it should?
> 
> For what it's worth, I was addressing the xmalloc_cacheline() code
> independent of your comments on the PMD code.  I don't know the PMD code
> well enough to take a position.  I'll let others comment on that.
> 
> > Beside that, looking at the code below:
> > Do we care about false sharing on alloc/free operations? That is the difference between
> > old and new implementations. The new one will write and read bytes possibly placed
> > on a shared cacheline on alloc/free.
> 
> I'd say that the value of the change that I'm proposing is that it
> allows programmers to design their structures knowing that the first
> cacheline ends after 64 bytes, not after 64-8 bytes in some cases.

Anyway, I sent out the patch officially.  I haven't tested it beyond
compile-testing:
        https://patchwork.ozlabs.org/patch/842251/
Bodireddy, Bhanuprakash Nov. 28, 2017, 9:40 p.m. | #16
>On 27.11.2017 20:02, Bodireddy, Bhanuprakash wrote:

>>> I agree with Ilya here. Adding theses cache line markers and

>>> re-grouping variables to minimize gaps in cache lines is creating a

>>> maintenance burden without any tangible benefit. I have had to go

>>> through the pain of refactoring my PMD Performance Metrics patch to

>>> the new dp_netdev_pmd_thread struct and spent a lot of time to

>>> analyze the actual memory layout with GDB and play Tetris with the

>variables.

>>

>> Analyzing the memory layout with gdb for large structures is time consuming

>and not usually recommended.

>> I would suggest using Poke-a-hole(pahole) and that helps to understand

>and fix the structures in no time.

>> With pahole it's going to be lot easier to work with large structures

>especially.

>

>Interesting tool, but it seems doesn't work perfectly. I see duplicated unions

>and zero length arrays in the output and I still need to check sizes by hands.

>And it fails trying to run on my ubuntu 16.04 LTS on x86.

>IMHO, the code should be simple enough to not use external utilities when

>you need to check the single structure.


Pahole has been there for a while and is available with most distributions and works reliably on RHEL distros.
I am on fedora and built pahole from sources and it displays all the sizes and cacheline boundaries.

>

>>>

>>> There will never be more than a handful of PMDs, so minimizing the

>>> gaps does not matter from memory perspective. And whether the

>>> individual members occupy 4 or 5 cache lines does not matter either

>>> compared to the many hundred cache lines touched for EMC and DPCLS

>>> lookups of an Rx batch. And any optimization done for x86 is not

>>> necessarily optimal for other architectures.

>>

>> I agree that optimization targeted for x86 doesn't necessarily suit ARM due

>to its different cache line size.

>>

>>>

>>> Finally, even for x86 there is not even a performance improvement. I

>>> re-ran our standard L3VPN over VXLAN performance PVP test on master

>>> and with Ilya's revert patch:

>>>

>>> Flows   master  reverted

>>> 8,      4.46    4.48

>>> 100,    4.27    4.29

>>> 1000,   4.07    4.07

>>> 2000,   3.68    3.68

>>> 5000,   3.03    3.03

>>> 10000,  2.76    2.77

>>> 20000,  2.64    2.65

>>> 50000,  2.60    2.61

>>> 100000, 2.60    2.61

>>> 500000, 2.60    2.61

>>

>> What are the  CFLAGS in this case, as they seem to make difference. I have

>added my finding here for a different patch targeted at performance

>>

>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-

>November/341270.ht

>> ml

>

>Do you have any performance results that shows significant performance

>difference between above cases? Please describe your test scenario and

>environment so we'll be able to see that padding/alignment really needed

>here. I saw no such results yet.

>

>BTW, at one place you're saying that patch was not about performance, at the

>same time you're trying to show that it has some positive performance

>impact. I'm a bit confused with that.


Giving a bit more context here, I have been experimenting  with *prefetching* in
OvS as prefetching isn't used except in 2 instances(emc_processing & cmaps).
This work is aimed at checking the performance benefits with prefetching on not just Haswell
but also with newer range of processors.

The best way to prefetch a part of structure is to mark the portion of it. This isn't possible
unless we have some kind of cache line marking. This is what my patch initially did and then
we can prefetch portion of it based on cacheline markers. You can find an example in pkt_metadata struct.

My point is on X86  with cache line marking and with xzalloc_cacheline() API one shouldn't
see drop in performance if not improvement. But the real improvements would be seen when the
prefetching is done at right places and that’s WIP.

Bhanuprakash.

>

>>

>> Patches to consider when testing your use case:

>>      Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-

>November/341231.html

>>      (If using output batching)      https://mail.openvswitch.org/pipermail/ovs-

>dev/2017-November/341230.html

>>

>> - Bhanuprakash.

>>

>>>

>>> All in all, I support reverting this change.

>>>

>>> Regards, Jan

>>>

>>> Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>

>>>

>>>> -----Original Message-----

>>>> From: ovs-dev-bounces@openvswitch.org

>>>> [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Bodireddy,

>>>> Bhanuprakash

>>>> Sent: Friday, 24 November, 2017 17:09

>>>> To: Ilya Maximets <i.maximets@samsung.com>; ovs-

>dev@openvswitch.org;

>>>> Ben Pfaff <blp@ovn.org>

>>>> Cc: Heetae Ahn <heetae82.ahn@samsung.com>

>>>> Subject: Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor

>>> dp_netdev_pmd_thread structure."

>>>>

>>>>> On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:

>>>>>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.

>>>>>>>

>>>>>>> Padding and aligning of dp_netdev_pmd_thread structure members

>is

>>>>>>> useless, broken in a several ways and only greatly degrades

>>>>>>> maintainability and extensibility of the structure.

>>>>>>

>>>>>> The idea of my earlier patch was to mark the cache lines and

>>>>>> reduce the

>>>>> holes while still maintaining the grouping of related members in

>>>>> this

>>> structure.

>>>>>

>>>>> Some of the grouping aspects looks strange. For example, it looks

>>>>> illogical that 'exit_latch' is grouped with 'flow_table' but not

>>>>> the 'reload_seq' and other reload related stuff. It looks strange

>>>>> that statistics and counters spread across different groups. So,

>>>>> IMHO, it's not

>>> well grouped.

>>>>

>>>> I had to strike a fine balance and some members may be placed in a

>>>> different group due to their sizes and importance. Let me think if I

>>>> can make

>>> it better.

>>>>

>>>>>

>>>>>> Also cache line marking is a good practice to make some one extra

>>>>>> cautious

>>>>> when extending or editing important structures .

>>>>>> Most importantly I was experimenting with prefetching on this

>>>>>> structure

>>>>> and needed cache line markers for it.

>>>>>>

>>>>>> I see that you are on ARM (I don't have HW to test) and want to

>>>>>> know if this

>>>>> commit has any negative affect and any numbers would be appreciated.

>>>>>

>>>>> Basic VM-VM testing shows stable 0.5% perfromance improvement with

>>>>> revert applied.

>>>>

>>>> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.

>>>>

>>>>> Padding adds 560 additional bytes of holes.

>>>> As the cache line in ARM is 128 , it created holes, I can find a

>>>> workaround to

>>> handle this.

>>>>

>>>>>

>>>>>> More comments inline.

>>>>>>

>>>>>>>

>>>>>>> Issues:

>>>>>>>

>>>>>>>    1. It's not working because all the instances of struct

>>>>>>>       dp_netdev_pmd_thread allocated only by usual malloc. All the

>>>>>>>       memory is not aligned to cachelines -> structure almost never

>>>>>>>       starts at aligned memory address. This means that any further

>>>>>>>       paddings and alignments inside the structure are completely

>>>>>>>       useless. Fo example:

>>>>>>>

>>>>>>>       Breakpoint 1, pmd_thread_main

>>>>>>>       (gdb) p pmd

>>>>>>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20

>>>>>>>       (gdb) p &pmd->cacheline1

>>>>>>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60

>>>>>>>       (gdb) p &pmd->cacheline0

>>>>>>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20

>>>>>>>       (gdb) p &pmd->flow_cache

>>>>>>>       $53 = (struct emc_cache *) 0x1b1afe0

>>>>>>>

>>>>>>>       All of the above addresses shifted from cacheline start by 32B.

>>>>>>

>>>>>> If you see below all the addresses are 64 byte aligned.

>>>>>>

>>>>>> (gdb) p pmd

>>>>>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040

>>>>>> (gdb) p &pmd->cacheline0

>>>>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040

>>>>>> (gdb) p &pmd->cacheline1

>>>>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080

>>>>>> (gdb) p &pmd->flow_cache

>>>>>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0

>>>>>> (gdb) p &pmd->flow_table

>>>>>> $5 = (struct cmap *) 0x7fc1e9fba100

>>>>>> (gdb) p &pmd->stats

>>>>>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140

>>>>>> (gdb) p &pmd->port_mutex

>>>>>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180

>>>>>> (gdb) p &pmd->poll_list

>>>>>> $8 = (struct hmap *) 0x7fc1e9fba1c0

>>>>>> (gdb) p &pmd->tnl_port_cache

>>>>>> $9 = (struct hmap *) 0x7fc1e9fba200

>>>>>> (gdb) p &pmd->stats_zero

>>>>>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240

>>>>>>

>>>>>> I tried using xzalloc_cacheline instead of default xzalloc() here.

>>>>>> I tried tens of times and always found that the address is

>>>>>> 64 byte aligned and it should start at the beginning of cache line on X86.

>>>>>> Not sure why the comment  " (The memory returned will not be at

>>>>>> the start

>>>>> of  a cache line, though, so don't assume such alignment.)" says

>otherwise?

>>>>>

>>>>> Yes, you will always get aligned addressess on your x86 Linux

>>>>> system that supports

>>>>> posix_memalign() call. The comment says what it says because it

>>>>> will make some memory allocation tricks in case posix_memalign() is

>>>>> not available (Windows, some MacOS, maybe some Linux systems (not

>>>>> sure)) and the address will not be aligned it this case.

>>>>

>>>> I also verified the other case when posix_memalign isn't available

>>>> and even in that case it returns the address aligned on

>>>> CACHE_LINE_SIZE boundary. I will send out a patch to use

>>>> xzalloc_cacheline for allocating the

>>> memory.

>>>>

>>>>>

>>>>>>

>>>>>>>

>>>>>>>       Can we fix it properly? NO.

>>>>>>>       OVS currently doesn't have appropriate API to allocate aligned

>>>>>>>       memory. The best candidate is 'xmalloc_cacheline()' but it

>>>>>>>       clearly states that "The memory returned will not be at the

>>>>>>>       start of a cache line, though, so don't assume such alignment".

>>>>>>>       And also, this function will never return aligned memory on

>>>>>>>       Windows or MacOS.

>>>>>>>

>>>>>>>    2. CACHE_LINE_SIZE is not constant. Different architectures have

>>>>>>>       different cache line sizes, but the code assumes that

>>>>>>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure

>>>>>>>       members are grouped by 64 bytes and padded to

>CACHE_LINE_SIZE.

>>>>>>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE

>>>>>>>       differs from 64. This is opposite to portability. If I want

>>>>>>>       good performance of cmap I need to have CACHE_LINE_SIZE equal

>>>>>>>       to the real cache line size, but I will have huge holes in the

>>>>>>>       structures. If you'll take a look to struct rte_mbuf from DPDK

>>>>>>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and

>>>>>>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.

>>>>>>

>>>>>> I understand that ARM and few other processors (like OCTEON) have

>>>>>> 128

>>>>> bytes cache lines.

>>>>>> But  again curious of performance impact in your case with this

>>>>>> new

>>>>> alignment.

>>>>>>

>>>>>>>

>>>>>>>    3. Sizes of system/libc defined types are not constant for all the

>>>>>>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my

>>>>>>>       ARMv8 machine, but only 40 on x86. The difference could be

>>>>>>>       much bigger on Windows or MacOS systems. But the code

>assumes

>>>>>>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead

>>>>>>>       to broken alignment/big holes in case of padding/wrong

>comments

>>>>>>>       about amount of free pad bytes.

>>>>>>

>>>>>> This isn't an issue as you would have already mentioned and more

>>>>>> about

>>>>> issue with the comment that reads the pad bytes.

>>>>>> In case of ARM it would be just 8 pad bytes instead of 16 on X86.

>>>>>>

>>>>>>         union {

>>>>>>                 struct {

>>>>>>                         struct ovs_mutex port_mutex;     /* 4849984    48 */

>>>>>>                 };    /*          48 */

>>>>>>                 uint8_t            pad13[64];            /*          64 */

>>>>>>         };                                               /

>>>>>>

>>>>>

>>>>> It's not only about 'port_mutex'. If you'll take a look at

>>>>> 'flow_mutex', you will see, that it even not padded. So, increasing

>>>>> the size

>>> of 'flow_mutex'

>>>>> leads to shifting of all the following padded blocks and no other

>>>>> blocks will be properly aligned even if the structure allocated on

>>>>> the

>>> aligned memory.

>>>>>

>>>>>>>

>>>>>>>    4. Sizes of the many fileds in structure depends on defines like

>>>>>>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so

>>> on.

>>>>>>>       Any change in these defines or any change in any structure

>>>>>>>       contained by thread should lead to the not so simple

>>>>>>>       refactoring of the whole dp_netdev_pmd_thread structure. This

>>>>>>>       greatly reduces maintainability and complicates development of

>>>>>>>       a new features.

>>>>>>

>>>>>> I don't think it complicates development and instead I feel the

>>>>>> commit gives a clear indication to the developer that the members

>>>>>> are grouped and

>>>>> aligned and marked with cacheline markers.

>>>>>> This makes the developer extra cautious when adding new members

>so

>>>>>> that

>>>>> holes can be avoided.

>>>>>

>>>>> Starting rebase of the output batching patch-set I figured out that

>>>>> I need to remove 'unsigned long long last_cycles' and add 'struct

>>>>> dp_netdev_pmd_thread_ctx ctx'

>>>>> which is 8 bytes larger. Could you, please, suggest me where should

>>>>> I place that new structure member and what to do with a hole from

>>> 'last_cycles'?

>>>>>

>>>>> This is not a trivial question, because already poor grouping will

>>>>> become worse almost anyway.

>>>>

>>>> Aah, realized now that the batching series doesn't cleanly apply on

>master.

>>>> Let me check this and will send across the changes that should fix this.

>>>>

>>>> - Bhanuprakash

>>>>

>>>>>>

>>>>>> Cacheline marking the structure is a good practice and I am sure

>>>>>> this structure is significant and should be carefully extended in

>>>>>> the

>>> future.

>>>>>

>>>>> Not so sure about that.

>>>>>

>>>>>>

>>>>>>>

>>>>>>>    5. There is no reason to align flow_cache member because it's

>>>>>>>       too big and we usually access random entries by single thread

>>>>>>>       only.

>>>>>>>

>>>>>>

>>>>>> I see your point. This patch wasn't done for performance and

>>>>>> instead more to have some order with this ever growing structure.

>>>>>> During testing I found that for some test cases aligning the

>>>>>> flow_cache was giving

>>>>> me 100k+ performance consistently and so was added.

>>>>>

>>>>> This was a random performance boost. You achieved it without

>>>>> aligned memory allocation.

>>>>> Just a luck with you system environment. Using of mzalloc_cacheline

>>>>> will likely eliminate this performance difference or even degrade

>>>>> the

>>> performance.

>>>>>

>>>>>>

>>>>>>> So, the padding/alignment only creates some visibility of

>>>>>>> performance optimization but does nothing useful in reality. It

>>>>>>> only complicates maintenance and adds huge holes for non-x86

>>>>>>> architectures and non-Linux systems. Performance improvement

>>>>>>> stated in a original commit message should be random and not

>>>>>>> valuable. I see no

>>>>> performance difference.

>>>>>>

>>>>>> I understand that this is causing issues with architecture having

>>>>>> different cache line sizes, but unfortunately majority of them

>>>>>> have

>>>>>> 64 byte

>>>>> cache lines so this change makes sense.

>>>>>

>>>>> I understand that 64 byte cache lines are spread a lot wider. I

>>>>> also have x86 as a target arch, but still, IMHO, OVS is a

>>>>> cross-platform application and it should not have platform

>>>>> dependent stuff which makes one architecture/platform better and

>worsen others.

>>>>>

>>>>>>

>>>>>> If you have performance data to prove that this causes sever perf

>>>>> degradation I can think of work arounds for ARM.

>>>>>>

>>>>>> - Bhanuprakash.

>>>>>

>>>>>

>>>>> P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have

>>>>> to apply following patch to avoid build time assert (I'll send it

>>>>> formally

>>> later):

>>>>>

>>>>> -------------------------------------------------------------------

>>>>> --

>>>>> --------

>>>>> diff --git a/lib/cmap.c b/lib/cmap.c index 35decea..5b15ecd 100644

>>>>> --- a/lib/cmap.c

>>>>> +++ b/lib/cmap.c

>>>>> @@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);

>>>>> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */

>>>>> #define CMAP_K ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)

>>>>>

>>>>> -/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0

>>>>> on 64-bit. */ - #define CMAP_PADDING ((CACHE_LINE_SIZE - 4) -

>>>>> (CMAP_K *

>>>>> CMAP_ENTRY_SIZE))

>>>>> -

>>>>> /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly

>>>>> one cache

>>>>>  * line long. */

>>>>> struct cmap_bucket {

>>>>> +    /* Padding to make cmap_bucket exactly one cache line long. */

>>>>> +    PADDED_MEMBERS(CACHE_LINE_SIZE,

>>>>>     /* Allows readers to track in-progress changes.  Initially zero, each

>>>>>      * writer increments this value just before and just after each

>>>>> change

>>> (see

>>>>>      * cmap_set_bucket()).  Thus, a reader can ensure that it gets

>>>>> a consistent @@ -145,11 +144,7 @@ struct cmap_bucket {

>>>>>      * slots. */

>>>>>     uint32_t hashes[CMAP_K];

>>>>>     struct cmap_node nodes[CMAP_K];

>>>>> -

>>>>> -    /* Padding to make cmap_bucket exactly one cache line long. */

>>>>> -#if CMAP_PADDING > 0

>>>>> -    uint8_t pad[CMAP_PADDING];

>>>>> -#endif

>>>>> +    );

>>>>> };

>>>>> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);

>>>>>

>>>>> diff --git a/lib/util.h b/lib/util.h index 3c43c2c..514fdaa 100644

>>>>> --- a/lib/util.h

>>>>> +++ b/lib/util.h

>>>>> @@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {

>>>>>

>>>>> /* This system's cache line size, in bytes.

>>>>>  * Being wrong hurts performance but not correctness. */ -#define

>>>>> CACHE_LINE_SIZE 64

>>>>> +#define CACHE_LINE_SIZE 128

>>>>> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));

>>>>>

>>>>> /* Cacheline marking is typically done using zero-sized array.

>>>>> -------------------------------------------------------------------

>>>>> --

>>>>> --------

>>>>>

>>>>>

>>>>> Best regards, Ilya Maximets.

>>>>>

>>>>>>

>>>>>>>

>>>>>>> Most of the above issues are also true for some other

>>>>>>> padded/aligned structures like 'struct netdev_dpdk'. They will be

>>> treated separately.

>>>>>>>

>>>>>>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>

>>>>>>> CC: Ben Pfaff <blp@ovn.org>

>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

>>>>>>> ---

>>>>>>> lib/dpif-netdev.c | 160

>>>>>>> +++++++++++++++++++++++-----------------------------

>>>>>>> --

>>>>>>> 1 file changed, 69 insertions(+), 91 deletions(-)

>>>>>>>

>>>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index

>>>>>>> 0a62630..6784269

>>>>>>> 100644

>>>>>>> --- a/lib/dpif-netdev.c

>>>>>>> +++ b/lib/dpif-netdev.c

>>>>>>> @@ -547,31 +547,18 @@ struct tx_port {

>>>>>>>  * actions in either case.

>>>>>>>  * */

>>>>>>> struct dp_netdev_pmd_thread {

>>>>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,

>>>>> cacheline0,

>>>>>>> -        struct dp_netdev *dp;

>>>>>>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */

>>>>>>> -        pthread_cond_t cond;            /* For synchronizing pmd thread

>>>>>>> -                                           reload. */

>>>>>>> -    );

>>>>>>> -

>>>>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,

>>>>> cacheline1,

>>>>>>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition

>variable.

>>> */

>>>>>>> -        pthread_t thread;

>>>>>>> -        unsigned core_id;               /* CPU core id of this pmd thread. */

>>>>>>> -        int numa_id;                    /* numa node id of this pmd thread. */

>>>>>>> -    );

>>>>>>> +    struct dp_netdev *dp;

>>>>>>> +    struct ovs_refcount ref_cnt;    /* Every reference must be

>>> refcount'ed.

>>>>> */

>>>>>>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */

>>>>>>> +

>>>>>>> +    pthread_cond_t cond;            /* For synchronizing pmd thread

>reload.

>>> */

>>>>>>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable.

>*/

>>>>>>>

>>>>>>>     /* Per thread exact-match cache.  Note, the instance for cpu core

>>>>>>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and

>>> thusly

>>>>>>>      * need to be protected by 'non_pmd_mutex'.  Every other

>instance

>>>>>>>      * will only be accessed by its own pmd thread. */

>>>>>>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache

>>> flow_cache;

>>>>>>> -    struct ovs_refcount ref_cnt;    /* Every reference must be

>>> refcount'ed.

>>>>> */

>>>>>>> -

>>>>>>> -    /* Queue id used by this pmd thread to send packets on all

>netdevs

>>> if

>>>>>>> -     * XPS disabled for this netdev. All static_tx_qid's are unique and

>less

>>>>>>> -     * than 'cmap_count(dp->poll_threads)'. */

>>>>>>> -    uint32_t static_tx_qid;

>>>>>>> +    struct emc_cache flow_cache;

>>>>>>>

>>>>>>>     /* Flow-Table and classifiers

>>>>>>>      *

>>>>>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {

>>>>>>>      * 'flow_mutex'.

>>>>>>>      */

>>>>>>>     struct ovs_mutex flow_mutex;

>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,

>>>>>>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */

>>>>>>> -

>>>>>>> -        /* One classifier per in_port polled by the pmd */

>>>>>>> -        struct cmap classifiers;

>>>>>>> -        /* Periodically sort subtable vectors according to hit frequencies

>*/

>>>>>>> -        long long int next_optimization;

>>>>>>> -        /* End of the next time interval for which processing cycles

>>>>>>> -           are stored for each polled rxq. */

>>>>>>> -        long long int rxq_next_cycle_store;

>>>>>>> -

>>>>>>> -        /* Cycles counters */

>>>>>>> -        struct dp_netdev_pmd_cycles cycles;

>>>>>>> -

>>>>>>> -        /* Used to count cycles. See 'cycles_counter_end()'. */

>>>>>>> -        unsigned long long last_cycles;

>>>>>>> -        struct latch exit_latch;        /* For terminating the pmd thread. */

>>>>>>> -    );

>>>>>>> -

>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,

>>>>>>> -        /* Statistics. */

>>>>>>> -        struct dp_netdev_pmd_stats stats;

>>>>>>> -

>>>>>>> -        struct seq *reload_seq;

>>>>>>> -        uint64_t last_reload_seq;

>>>>>>> -        atomic_bool reload;             /* Do we need to reload ports? */

>>>>>>> -        bool isolated;

>>>>>>> -

>>>>>>> -        /* Set to true if the pmd thread needs to be reloaded. */

>>>>>>> -        bool need_reload;

>>>>>>> -        /* 5 pad bytes. */

>>>>>>> -    );

>>>>>>> -

>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,

>>>>>>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'

>>>>>>> -                                           and 'tx_ports'. */

>>>>>>> -        /* 16 pad bytes. */

>>>>>>> -    );

>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,

>>>>>>> -        /* List of rx queues to poll. */

>>>>>>> -        struct hmap poll_list OVS_GUARDED;

>>>>>>> -        /* Map of 'tx_port's used for transmission.  Written by the main

>>>>>>> -         * thread, read by the pmd thread. */

>>>>>>> -        struct hmap tx_ports OVS_GUARDED;

>>>>>>> -    );

>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,

>>>>>>> -        /* These are thread-local copies of 'tx_ports'.  One contains only

>>>>>>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the

>other

>>>>>>> -         * contains ports with at least one txq (that support send).

>>>>>>> -         * A port can be in both.

>>>>>>> -         *

>>>>>>> -         * There are two separate maps to make sure that we don't try

>to

>>>>>>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP

>on a

>>>>>>> -         * non-tunnel device.

>>>>>>> -         *

>>>>>>> -         * The instances for cpu core NON_PMD_CORE_ID can be

>accessed

>>> by

>>>>>>> -         * multiple threads and thusly need to be protected by

>>>>>>> 'non_pmd_mutex'.

>>>>>>> -         * Every other instance will only be accessed by its own pmd

>>> thread.

>>>>> */

>>>>>>> -        struct hmap tnl_port_cache;

>>>>>>> -        struct hmap send_port_cache;

>>>>>>> -    );

>>>>>>> -

>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,

>>>>>>> -        /* Only a pmd thread can write on its own 'cycles' and 'stats'.

>>>>>>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as base

>>>>>>> -         * values and subtracts them from 'stats' and 'cycles' before

>>>>>>> -         * reporting to the user */

>>>>>>> -        unsigned long long stats_zero[DP_N_STATS];

>>>>>>> -        uint64_t cycles_zero[PMD_N_CYCLES];

>>>>>>> -        /* 8 pad bytes. */

>>>>>>> -    );

>>>>>>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */

>>>>>>> +

>>>>>>> +    /* One classifier per in_port polled by the pmd */

>>>>>>> +    struct cmap classifiers;

>>>>>>> +    /* Periodically sort subtable vectors according to hit frequencies

>*/

>>>>>>> +    long long int next_optimization;

>>>>>>> +    /* End of the next time interval for which processing cycles

>>>>>>> +       are stored for each polled rxq. */

>>>>>>> +    long long int rxq_next_cycle_store;

>>>>>>> +

>>>>>>> +    /* Statistics. */

>>>>>>> +    struct dp_netdev_pmd_stats stats;

>>>>>>> +

>>>>>>> +    /* Cycles counters */

>>>>>>> +    struct dp_netdev_pmd_cycles cycles;

>>>>>>> +

>>>>>>> +    /* Used to count cicles. See 'cycles_counter_end()' */

>>>>>>> +    unsigned long long last_cycles;

>>>>>>> +

>>>>>>> +    struct latch exit_latch;        /* For terminating the pmd thread. */

>>>>>>> +    struct seq *reload_seq;

>>>>>>> +    uint64_t last_reload_seq;

>>>>>>> +    atomic_bool reload;             /* Do we need to reload ports? */

>>>>>>> +    pthread_t thread;

>>>>>>> +    unsigned core_id;               /* CPU core id of this pmd thread. */

>>>>>>> +    int numa_id;                    /* numa node id of this pmd thread. */

>>>>>>> +    bool isolated;

>>>>>>> +

>>>>>>> +    /* Queue id used by this pmd thread to send packets on all

>>>>>>> + netdevs

>>> if

>>>>>>> +     * XPS disabled for this netdev. All static_tx_qid's are unique and

>less

>>>>>>> +     * than 'cmap_count(dp->poll_threads)'. */

>>>>>>> +    uint32_t static_tx_qid;

>>>>>>> +

>>>>>>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and

>>> 'tx_ports'.

>>>>> */

>>>>>>> +    /* List of rx queues to poll. */

>>>>>>> +    struct hmap poll_list OVS_GUARDED;

>>>>>>> +    /* Map of 'tx_port's used for transmission.  Written by the

>>>>>>> + main

>>> thread,

>>>>>>> +     * read by the pmd thread. */

>>>>>>> +    struct hmap tx_ports OVS_GUARDED;

>>>>>>> +

>>>>>>> +    /* These are thread-local copies of 'tx_ports'.  One

>>>>>>> + contains only

>>> tunnel

>>>>>>> +     * ports (that support push_tunnel/pop_tunnel), the other

>>>>>>> + contains

>>>>> ports

>>>>>>> +     * with at least one txq (that support send).  A port can be in both.

>>>>>>> +     *

>>>>>>> +     * There are two separate maps to make sure that we don't

>>>>>>> + try to

>>>>> execute

>>>>>>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a

>>>>>>> + non-tunnel

>>>>>>> device.

>>>>>>> +     *

>>>>>>> +     * The instances for cpu core NON_PMD_CORE_ID can be

>>>>>>> + accessed by

>>>>>>> multiple

>>>>>>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.

>>>>> Every

>>>>>>> +     * other instance will only be accessed by its own pmd thread. */

>>>>>>> +    struct hmap tnl_port_cache;

>>>>>>> +    struct hmap send_port_cache;

>>>>>>> +

>>>>>>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.

>>>>>>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base

>>>>>>> +     * values and subtracts them from 'stats' and 'cycles' before

>>>>>>> +     * reporting to the user */

>>>>>>> +    unsigned long long stats_zero[DP_N_STATS];

>>>>>>> +    uint64_t cycles_zero[PMD_N_CYCLES];

>>>>>>> +

>>>>>>> +    /* Set to true if the pmd thread needs to be reloaded. */

>>>>>>> +    bool need_reload;

>>>>>>> };

>>>>>>>

>>>>>>> /* Interface to netdev-based datapath. */

>>>>>>> --

>>>>>>> 2.7.4

>>>>>>

>>>>>>

>>>>>>

>>>>>>

>>>> _______________________________________________

>>>> dev mailing list

>>>> dev@openvswitch.org

>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>>

>>

>>
Ilya Maximets Dec. 1, 2017, 3:55 p.m. | #17
On 29.11.2017 00:40, Bodireddy, Bhanuprakash wrote:
>> On 27.11.2017 20:02, Bodireddy, Bhanuprakash wrote:
>>>> I agree with Ilya here. Adding theses cache line markers and
>>>> re-grouping variables to minimize gaps in cache lines is creating a
>>>> maintenance burden without any tangible benefit. I have had to go
>>>> through the pain of refactoring my PMD Performance Metrics patch to
>>>> the new dp_netdev_pmd_thread struct and spent a lot of time to
>>>> analyze the actual memory layout with GDB and play Tetris with the
>> variables.
>>>
>>> Analyzing the memory layout with gdb for large structures is time consuming
>> and not usually recommended.
>>> I would suggest using Poke-a-hole(pahole) and that helps to understand
>> and fix the structures in no time.
>>> With pahole it's going to be lot easier to work with large structures
>> especially.
>>
>> Interesting tool, but it seems doesn't work perfectly. I see duplicated unions
>> and zero length arrays in the output and I still need to check sizes by hands.
>> And it fails trying to run on my ubuntu 16.04 LTS on x86.
>> IMHO, the code should be simple enough to not use external utilities when
>> you need to check the single structure.
> 
> Pahole has been there for a while and is available with most distributions and works reliably on RHEL distros.
> I am on fedora and built pahole from sources and it displays all the sizes and cacheline boundaries.
> 
>>
>>>>
>>>> There will never be more than a handful of PMDs, so minimizing the
>>>> gaps does not matter from memory perspective. And whether the
>>>> individual members occupy 4 or 5 cache lines does not matter either
>>>> compared to the many hundred cache lines touched for EMC and DPCLS
>>>> lookups of an Rx batch. And any optimization done for x86 is not
>>>> necessarily optimal for other architectures.
>>>
>>> I agree that optimization targeted for x86 doesn't necessarily suit ARM due
>> to its different cache line size.
>>>
>>>>
>>>> Finally, even for x86 there is not even a performance improvement. I
>>>> re-ran our standard L3VPN over VXLAN performance PVP test on master
>>>> and with Ilya's revert patch:
>>>>
>>>> Flows   master  reverted
>>>> 8,      4.46    4.48
>>>> 100,    4.27    4.29
>>>> 1000,   4.07    4.07
>>>> 2000,   3.68    3.68
>>>> 5000,   3.03    3.03
>>>> 10000,  2.76    2.77
>>>> 20000,  2.64    2.65
>>>> 50000,  2.60    2.61
>>>> 100000, 2.60    2.61
>>>> 500000, 2.60    2.61
>>>
>>> What are the  CFLAGS in this case, as they seem to make difference. I have
>> added my finding here for a different patch targeted at performance
>>>
>>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>> November/341270.ht
>>> ml
>>
>> Do you have any performance results that shows significant performance
>> difference between above cases? Please describe your test scenario and
>> environment so we'll be able to see that padding/alignment really needed
>> here. I saw no such results yet.
>>
>> BTW, at one place you're saying that patch was not about performance, at the
>> same time you're trying to show that it has some positive performance
>> impact. I'm a bit confused with that.
> 
> Giving a bit more context here, I have been experimenting  with *prefetching* in
> OvS as prefetching isn't used except in 2 instances(emc_processing & cmaps).
> This work is aimed at checking the performance benefits with prefetching on not just Haswell
> but also with newer range of processors.
> 
> The best way to prefetch a part of structure is to mark the portion of it. This isn't possible
> unless we have some kind of cache line marking. This is what my patch initially did and then
> we can prefetch portion of it based on cacheline markers. You can find an example in pkt_metadata struct.
> 
> My point is on X86  with cache line marking and with xzalloc_cacheline() API one shouldn't
> see drop in performance if not improvement. But the real improvements would be seen when the
> prefetching is done at right places and that’s WIP.

Is there any preliminary results for this work? Any significant performance
improvement with prefetching 'dp_netdev_pmd_thread' structure?
Otherwise I see no reason to keep these paddings which needed only for
experiments and complicates normal development.

> 
> Bhanuprakash.
> 
>>
>>>
>>> Patches to consider when testing your use case:
>>>      Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>> November/341231.html
>>>      (If using output batching)      https://mail.openvswitch.org/pipermail/ovs-
>> dev/2017-November/341230.html
>>>
>>> - Bhanuprakash.
>>>
>>>>
>>>> All in all, I support reverting this change.
>>>>
>>>> Regards, Jan
>>>>
>>>> Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
>>>>
>>>>> -----Original Message-----
>>>>> From: ovs-dev-bounces@openvswitch.org
>>>>> [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Bodireddy,
>>>>> Bhanuprakash
>>>>> Sent: Friday, 24 November, 2017 17:09
>>>>> To: Ilya Maximets <i.maximets@samsung.com>; ovs-
>> dev@openvswitch.org;
>>>>> Ben Pfaff <blp@ovn.org>
>>>>> Cc: Heetae Ahn <heetae82.ahn@samsung.com>
>>>>> Subject: Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor
>>>> dp_netdev_pmd_thread structure."
>>>>>
>>>>>> On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>>>>>>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>>>>>>>
>>>>>>>> Padding and aligning of dp_netdev_pmd_thread structure members
>> is
>>>>>>>> useless, broken in a several ways and only greatly degrades
>>>>>>>> maintainability and extensibility of the structure.
>>>>>>>
>>>>>>> The idea of my earlier patch was to mark the cache lines and
>>>>>>> reduce the
>>>>>> holes while still maintaining the grouping of related members in
>>>>>> this
>>>> structure.
>>>>>>
>>>>>> Some of the grouping aspects looks strange. For example, it looks
>>>>>> illogical that 'exit_latch' is grouped with 'flow_table' but not
>>>>>> the 'reload_seq' and other reload related stuff. It looks strange
>>>>>> that statistics and counters spread across different groups. So,
>>>>>> IMHO, it's not
>>>> well grouped.
>>>>>
>>>>> I had to strike a fine balance and some members may be placed in a
>>>>> different group due to their sizes and importance. Let me think if I
>>>>> can make
>>>> it better.
>>>>>
>>>>>>
>>>>>>> Also cache line marking is a good practice to make some one extra
>>>>>>> cautious
>>>>>> when extending or editing important structures .
>>>>>>> Most importantly I was experimenting with prefetching on this
>>>>>>> structure
>>>>>> and needed cache line markers for it.
>>>>>>>
>>>>>>> I see that you are on ARM (I don't have HW to test) and want to
>>>>>>> know if this
>>>>>> commit has any negative affect and any numbers would be appreciated.
>>>>>>
>>>>>> Basic VM-VM testing shows stable 0.5% perfromance improvement with
>>>>>> revert applied.
>>>>>
>>>>> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.
>>>>>
>>>>>> Padding adds 560 additional bytes of holes.
>>>>> As the cache line in ARM is 128 , it created holes, I can find a
>>>>> workaround to
>>>> handle this.
>>>>>
>>>>>>
>>>>>>> More comments inline.
>>>>>>>
>>>>>>>>
>>>>>>>> Issues:
>>>>>>>>
>>>>>>>>    1. It's not working because all the instances of struct
>>>>>>>>       dp_netdev_pmd_thread allocated only by usual malloc. All the
>>>>>>>>       memory is not aligned to cachelines -> structure almost never
>>>>>>>>       starts at aligned memory address. This means that any further
>>>>>>>>       paddings and alignments inside the structure are completely
>>>>>>>>       useless. Fo example:
>>>>>>>>
>>>>>>>>       Breakpoint 1, pmd_thread_main
>>>>>>>>       (gdb) p pmd
>>>>>>>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>>>>>>>       (gdb) p &pmd->cacheline1
>>>>>>>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>>>>>>>       (gdb) p &pmd->cacheline0
>>>>>>>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>>>>>>>       (gdb) p &pmd->flow_cache
>>>>>>>>       $53 = (struct emc_cache *) 0x1b1afe0
>>>>>>>>
>>>>>>>>       All of the above addresses shifted from cacheline start by 32B.
>>>>>>>
>>>>>>> If you see below all the addresses are 64 byte aligned.
>>>>>>>
>>>>>>> (gdb) p pmd
>>>>>>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>>>>>>> (gdb) p &pmd->cacheline0
>>>>>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>>>>>>> (gdb) p &pmd->cacheline1
>>>>>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>>>>>>> (gdb) p &pmd->flow_cache
>>>>>>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>>>>>>> (gdb) p &pmd->flow_table
>>>>>>> $5 = (struct cmap *) 0x7fc1e9fba100
>>>>>>> (gdb) p &pmd->stats
>>>>>>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>>>>>>> (gdb) p &pmd->port_mutex
>>>>>>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>>>>>>> (gdb) p &pmd->poll_list
>>>>>>> $8 = (struct hmap *) 0x7fc1e9fba1c0
>>>>>>> (gdb) p &pmd->tnl_port_cache
>>>>>>> $9 = (struct hmap *) 0x7fc1e9fba200
>>>>>>> (gdb) p &pmd->stats_zero
>>>>>>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>>>>>>>
>>>>>>> I tried using xzalloc_cacheline instead of default xzalloc() here.
>>>>>>> I tried tens of times and always found that the address is
>>>>>>> 64 byte aligned and it should start at the beginning of cache line on X86.
>>>>>>> Not sure why the comment  " (The memory returned will not be at
>>>>>>> the start
>>>>>> of  a cache line, though, so don't assume such alignment.)" says
>> otherwise?
>>>>>>
>>>>>> Yes, you will always get aligned addressess on your x86 Linux
>>>>>> system that supports
>>>>>> posix_memalign() call. The comment says what it says because it
>>>>>> will make some memory allocation tricks in case posix_memalign() is
>>>>>> not available (Windows, some MacOS, maybe some Linux systems (not
>>>>>> sure)) and the address will not be aligned it this case.
>>>>>
>>>>> I also verified the other case when posix_memalign isn't available
>>>>> and even in that case it returns the address aligned on
>>>>> CACHE_LINE_SIZE boundary. I will send out a patch to use
>>>>> xzalloc_cacheline for allocating the
>>>> memory.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>       Can we fix it properly? NO.
>>>>>>>>       OVS currently doesn't have appropriate API to allocate aligned
>>>>>>>>       memory. The best candidate is 'xmalloc_cacheline()' but it
>>>>>>>>       clearly states that "The memory returned will not be at the
>>>>>>>>       start of a cache line, though, so don't assume such alignment".
>>>>>>>>       And also, this function will never return aligned memory on
>>>>>>>>       Windows or MacOS.
>>>>>>>>
>>>>>>>>    2. CACHE_LINE_SIZE is not constant. Different architectures have
>>>>>>>>       different cache line sizes, but the code assumes that
>>>>>>>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>>>>>>>>       members are grouped by 64 bytes and padded to
>> CACHE_LINE_SIZE.
>>>>>>>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
>>>>>>>>       differs from 64. This is opposite to portability. If I want
>>>>>>>>       good performance of cmap I need to have CACHE_LINE_SIZE equal
>>>>>>>>       to the real cache line size, but I will have huge holes in the
>>>>>>>>       structures. If you'll take a look to struct rte_mbuf from DPDK
>>>>>>>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>>>>>>>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
>>>>>>>
>>>>>>> I understand that ARM and few other processors (like OCTEON) have
>>>>>>> 128
>>>>>> bytes cache lines.
>>>>>>> But  again curious of performance impact in your case with this
>>>>>>> new
>>>>>> alignment.
>>>>>>>
>>>>>>>>
>>>>>>>>    3. Sizes of system/libc defined types are not constant for all the
>>>>>>>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
>>>>>>>>       ARMv8 machine, but only 40 on x86. The difference could be
>>>>>>>>       much bigger on Windows or MacOS systems. But the code
>> assumes
>>>>>>>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>>>>>>>>       to broken alignment/big holes in case of padding/wrong
>> comments
>>>>>>>>       about amount of free pad bytes.
>>>>>>>
>>>>>>> This isn't an issue as you would have already mentioned and more
>>>>>>> about
>>>>>> issue with the comment that reads the pad bytes.
>>>>>>> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
>>>>>>>
>>>>>>>         union {
>>>>>>>                 struct {
>>>>>>>                         struct ovs_mutex port_mutex;     /* 4849984    48 */
>>>>>>>                 };    /*          48 */
>>>>>>>                 uint8_t            pad13[64];            /*          64 */
>>>>>>>         };                                               /
>>>>>>>
>>>>>>
>>>>>> It's not only about 'port_mutex'. If you'll take a look at
>>>>>> 'flow_mutex', you will see, that it even not padded. So, increasing
>>>>>> the size
>>>> of 'flow_mutex'
>>>>>> leads to shifting of all the following padded blocks and no other
>>>>>> blocks will be properly aligned even if the structure allocated on
>>>>>> the
>>>> aligned memory.
>>>>>>
>>>>>>>>
>>>>>>>>    4. Sizes of the many fileds in structure depends on defines like
>>>>>>>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so
>>>> on.
>>>>>>>>       Any change in these defines or any change in any structure
>>>>>>>>       contained by thread should lead to the not so simple
>>>>>>>>       refactoring of the whole dp_netdev_pmd_thread structure. This
>>>>>>>>       greatly reduces maintainability and complicates development of
>>>>>>>>       a new features.
>>>>>>>
>>>>>>> I don't think it complicates development and instead I feel the
>>>>>>> commit gives a clear indication to the developer that the members
>>>>>>> are grouped and
>>>>>> aligned and marked with cacheline markers.
>>>>>>> This makes the developer extra cautious when adding new members
>> so
>>>>>>> that
>>>>>> holes can be avoided.
>>>>>>
>>>>>> Starting rebase of the output batching patch-set I figured out that
>>>>>> I need to remove 'unsigned long long last_cycles' and add 'struct
>>>>>> dp_netdev_pmd_thread_ctx ctx'
>>>>>> which is 8 bytes larger. Could you, please, suggest me where should
>>>>>> I place that new structure member and what to do with a hole from
>>>> 'last_cycles'?
>>>>>>
>>>>>> This is not a trivial question, because already poor grouping will
>>>>>> become worse almost anyway.
>>>>>
>>>>> Aah, realized now that the batching series doesn't cleanly apply on
>> master.
>>>>> Let me check this and will send across the changes that should fix this.
>>>>>
>>>>> - Bhanuprakash
>>>>>
>>>>>>>
>>>>>>> Cacheline marking the structure is a good practice and I am sure
>>>>>>> this structure is significant and should be carefully extended in
>>>>>>> the
>>>> future.
>>>>>>
>>>>>> Not so sure about that.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>    5. There is no reason to align flow_cache member because it's
>>>>>>>>       too big and we usually access random entries by single thread
>>>>>>>>       only.
>>>>>>>>
>>>>>>>
>>>>>>> I see your point. This patch wasn't done for performance and
>>>>>>> instead more to have some order with this ever growing structure.
>>>>>>> During testing I found that for some test cases aligning the
>>>>>>> flow_cache was giving
>>>>>> me 100k+ performance consistently and so was added.
>>>>>>
>>>>>> This was a random performance boost. You achieved it without
>>>>>> aligned memory allocation.
>>>>>> Just a luck with you system environment. Using of mzalloc_cacheline
>>>>>> will likely eliminate this performance difference or even degrade
>>>>>> the
>>>> performance.
>>>>>>
>>>>>>>
>>>>>>>> So, the padding/alignment only creates some visibility of
>>>>>>>> performance optimization but does nothing useful in reality. It
>>>>>>>> only complicates maintenance and adds huge holes for non-x86
>>>>>>>> architectures and non-Linux systems. Performance improvement
>>>>>>>> stated in a original commit message should be random and not
>>>>>>>> valuable. I see no
>>>>>> performance difference.
>>>>>>>
>>>>>>> I understand that this is causing issues with architecture having
>>>>>>> different cache line sizes, but unfortunately majority of them
>>>>>>> have
>>>>>>> 64 byte
>>>>>> cache lines so this change makes sense.
>>>>>>
>>>>>> I understand that 64 byte cache lines are spread a lot wider. I
>>>>>> also have x86 as a target arch, but still, IMHO, OVS is a
>>>>>> cross-platform application and it should not have platform
>>>>>> dependent stuff which makes one architecture/platform better and
>> worsen others.
>>>>>>
>>>>>>>
>>>>>>> If you have performance data to prove that this causes sever perf
>>>>>> degradation I can think of work arounds for ARM.
>>>>>>>
>>>>>>> - Bhanuprakash.
>>>>>>
>>>>>>
>>>>>> P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have
>>>>>> to apply following patch to avoid build time assert (I'll send it
>>>>>> formally
>>>> later):
>>>>>>
>>>>>> -------------------------------------------------------------------
>>>>>> --
>>>>>> --------
>>>>>> diff --git a/lib/cmap.c b/lib/cmap.c index 35decea..5b15ecd 100644
>>>>>> --- a/lib/cmap.c
>>>>>> +++ b/lib/cmap.c
>>>>>> @@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
>>>>>> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */
>>>>>> #define CMAP_K ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
>>>>>>
>>>>>> -/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0
>>>>>> on 64-bit. */ - #define CMAP_PADDING ((CACHE_LINE_SIZE - 4) -
>>>>>> (CMAP_K *
>>>>>> CMAP_ENTRY_SIZE))
>>>>>> -
>>>>>> /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly
>>>>>> one cache
>>>>>>  * line long. */
>>>>>> struct cmap_bucket {
>>>>>> +    /* Padding to make cmap_bucket exactly one cache line long. */
>>>>>> +    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>     /* Allows readers to track in-progress changes.  Initially zero, each
>>>>>>      * writer increments this value just before and just after each
>>>>>> change
>>>> (see
>>>>>>      * cmap_set_bucket()).  Thus, a reader can ensure that it gets
>>>>>> a consistent @@ -145,11 +144,7 @@ struct cmap_bucket {
>>>>>>      * slots. */
>>>>>>     uint32_t hashes[CMAP_K];
>>>>>>     struct cmap_node nodes[CMAP_K];
>>>>>> -
>>>>>> -    /* Padding to make cmap_bucket exactly one cache line long. */
>>>>>> -#if CMAP_PADDING > 0
>>>>>> -    uint8_t pad[CMAP_PADDING];
>>>>>> -#endif
>>>>>> +    );
>>>>>> };
>>>>>> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
>>>>>>
>>>>>> diff --git a/lib/util.h b/lib/util.h index 3c43c2c..514fdaa 100644
>>>>>> --- a/lib/util.h
>>>>>> +++ b/lib/util.h
>>>>>> @@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
>>>>>>
>>>>>> /* This system's cache line size, in bytes.
>>>>>>  * Being wrong hurts performance but not correctness. */ -#define
>>>>>> CACHE_LINE_SIZE 64
>>>>>> +#define CACHE_LINE_SIZE 128
>>>>>> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
>>>>>>
>>>>>> /* Cacheline marking is typically done using zero-sized array.
>>>>>> -------------------------------------------------------------------
>>>>>> --
>>>>>> --------
>>>>>>
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Most of the above issues are also true for some other
>>>>>>>> padded/aligned structures like 'struct netdev_dpdk'. They will be
>>>> treated separately.
>>>>>>>>
>>>>>>>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>>>>>>>> CC: Ben Pfaff <blp@ovn.org>
>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>>>>>>>> ---
>>>>>>>> lib/dpif-netdev.c | 160
>>>>>>>> +++++++++++++++++++++++-----------------------------
>>>>>>>> --
>>>>>>>> 1 file changed, 69 insertions(+), 91 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>>>>>> 0a62630..6784269
>>>>>>>> 100644
>>>>>>>> --- a/lib/dpif-netdev.c
>>>>>>>> +++ b/lib/dpif-netdev.c
>>>>>>>> @@ -547,31 +547,18 @@ struct tx_port {
>>>>>>>>  * actions in either case.
>>>>>>>>  * */
>>>>>>>> struct dp_netdev_pmd_thread {
>>>>>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>>>>>> cacheline0,
>>>>>>>> -        struct dp_netdev *dp;
>>>>>>>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */
>>>>>>>> -        pthread_cond_t cond;            /* For synchronizing pmd thread
>>>>>>>> -                                           reload. */
>>>>>>>> -    );
>>>>>>>> -
>>>>>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>>>>>> cacheline1,
>>>>>>>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition
>> variable.
>>>> */
>>>>>>>> -        pthread_t thread;
>>>>>>>> -        unsigned core_id;               /* CPU core id of this pmd thread. */
>>>>>>>> -        int numa_id;                    /* numa node id of this pmd thread. */
>>>>>>>> -    );
>>>>>>>> +    struct dp_netdev *dp;
>>>>>>>> +    struct ovs_refcount ref_cnt;    /* Every reference must be
>>>> refcount'ed.
>>>>>> */
>>>>>>>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */
>>>>>>>> +
>>>>>>>> +    pthread_cond_t cond;            /* For synchronizing pmd thread
>> reload.
>>>> */
>>>>>>>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable.
>> */
>>>>>>>>
>>>>>>>>     /* Per thread exact-match cache.  Note, the instance for cpu core
>>>>>>>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and
>>>> thusly
>>>>>>>>      * need to be protected by 'non_pmd_mutex'.  Every other
>> instance
>>>>>>>>      * will only be accessed by its own pmd thread. */
>>>>>>>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache
>>>> flow_cache;
>>>>>>>> -    struct ovs_refcount ref_cnt;    /* Every reference must be
>>>> refcount'ed.
>>>>>> */
>>>>>>>> -
>>>>>>>> -    /* Queue id used by this pmd thread to send packets on all
>> netdevs
>>>> if
>>>>>>>> -     * XPS disabled for this netdev. All static_tx_qid's are unique and
>> less
>>>>>>>> -     * than 'cmap_count(dp->poll_threads)'. */
>>>>>>>> -    uint32_t static_tx_qid;
>>>>>>>> +    struct emc_cache flow_cache;
>>>>>>>>
>>>>>>>>     /* Flow-Table and classifiers
>>>>>>>>      *
>>>>>>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>>>>>>>>      * 'flow_mutex'.
>>>>>>>>      */
>>>>>>>>     struct ovs_mutex flow_mutex;
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>>>>>> -
>>>>>>>> -        /* One classifier per in_port polled by the pmd */
>>>>>>>> -        struct cmap classifiers;
>>>>>>>> -        /* Periodically sort subtable vectors according to hit frequencies
>> */
>>>>>>>> -        long long int next_optimization;
>>>>>>>> -        /* End of the next time interval for which processing cycles
>>>>>>>> -           are stored for each polled rxq. */
>>>>>>>> -        long long int rxq_next_cycle_store;
>>>>>>>> -
>>>>>>>> -        /* Cycles counters */
>>>>>>>> -        struct dp_netdev_pmd_cycles cycles;
>>>>>>>> -
>>>>>>>> -        /* Used to count cycles. See 'cycles_counter_end()'. */
>>>>>>>> -        unsigned long long last_cycles;
>>>>>>>> -        struct latch exit_latch;        /* For terminating the pmd thread. */
>>>>>>>> -    );
>>>>>>>> -
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        /* Statistics. */
>>>>>>>> -        struct dp_netdev_pmd_stats stats;
>>>>>>>> -
>>>>>>>> -        struct seq *reload_seq;
>>>>>>>> -        uint64_t last_reload_seq;
>>>>>>>> -        atomic_bool reload;             /* Do we need to reload ports? */
>>>>>>>> -        bool isolated;
>>>>>>>> -
>>>>>>>> -        /* Set to true if the pmd thread needs to be reloaded. */
>>>>>>>> -        bool need_reload;
>>>>>>>> -        /* 5 pad bytes. */
>>>>>>>> -    );
>>>>>>>> -
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
>>>>>>>> -                                           and 'tx_ports'. */
>>>>>>>> -        /* 16 pad bytes. */
>>>>>>>> -    );
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        /* List of rx queues to poll. */
>>>>>>>> -        struct hmap poll_list OVS_GUARDED;
>>>>>>>> -        /* Map of 'tx_port's used for transmission.  Written by the main
>>>>>>>> -         * thread, read by the pmd thread. */
>>>>>>>> -        struct hmap tx_ports OVS_GUARDED;
>>>>>>>> -    );
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        /* These are thread-local copies of 'tx_ports'.  One contains only
>>>>>>>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the
>> other
>>>>>>>> -         * contains ports with at least one txq (that support send).
>>>>>>>> -         * A port can be in both.
>>>>>>>> -         *
>>>>>>>> -         * There are two separate maps to make sure that we don't try
>> to
>>>>>>>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP
>> on a
>>>>>>>> -         * non-tunnel device.
>>>>>>>> -         *
>>>>>>>> -         * The instances for cpu core NON_PMD_CORE_ID can be
>> accessed
>>>> by
>>>>>>>> -         * multiple threads and thusly need to be protected by
>>>>>>>> 'non_pmd_mutex'.
>>>>>>>> -         * Every other instance will only be accessed by its own pmd
>>>> thread.
>>>>>> */
>>>>>>>> -        struct hmap tnl_port_cache;
>>>>>>>> -        struct hmap send_port_cache;
>>>>>>>> -    );
>>>>>>>> -
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>>>>>>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>>>>>>> -         * values and subtracts them from 'stats' and 'cycles' before
>>>>>>>> -         * reporting to the user */
>>>>>>>> -        unsigned long long stats_zero[DP_N_STATS];
>>>>>>>> -        uint64_t cycles_zero[PMD_N_CYCLES];
>>>>>>>> -        /* 8 pad bytes. */
>>>>>>>> -    );
>>>>>>>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>>>>>> +
>>>>>>>> +    /* One classifier per in_port polled by the pmd */
>>>>>>>> +    struct cmap classifiers;
>>>>>>>> +    /* Periodically sort subtable vectors according to hit frequencies
>> */
>>>>>>>> +    long long int next_optimization;
>>>>>>>> +    /* End of the next time interval for which processing cycles
>>>>>>>> +       are stored for each polled rxq. */
>>>>>>>> +    long long int rxq_next_cycle_store;
>>>>>>>> +
>>>>>>>> +    /* Statistics. */
>>>>>>>> +    struct dp_netdev_pmd_stats stats;
>>>>>>>> +
>>>>>>>> +    /* Cycles counters */
>>>>>>>> +    struct dp_netdev_pmd_cycles cycles;
>>>>>>>> +
>>>>>>>> +    /* Used to count cicles. See 'cycles_counter_end()' */
>>>>>>>> +    unsigned long long last_cycles;
>>>>>>>> +
>>>>>>>> +    struct latch exit_latch;        /* For terminating the pmd thread. */
>>>>>>>> +    struct seq *reload_seq;
>>>>>>>> +    uint64_t last_reload_seq;
>>>>>>>> +    atomic_bool reload;             /* Do we need to reload ports? */
>>>>>>>> +    pthread_t thread;
>>>>>>>> +    unsigned core_id;               /* CPU core id of this pmd thread. */
>>>>>>>> +    int numa_id;                    /* numa node id of this pmd thread. */
>>>>>>>> +    bool isolated;
>>>>>>>> +
>>>>>>>> +    /* Queue id used by this pmd thread to send packets on all
>>>>>>>> + netdevs
>>>> if
>>>>>>>> +     * XPS disabled for this netdev. All static_tx_qid's are unique and
>> less
>>>>>>>> +     * than 'cmap_count(dp->poll_threads)'. */
>>>>>>>> +    uint32_t static_tx_qid;
>>>>>>>> +
>>>>>>>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and
>>>> 'tx_ports'.
>>>>>> */
>>>>>>>> +    /* List of rx queues to poll. */
>>>>>>>> +    struct hmap poll_list OVS_GUARDED;
>>>>>>>> +    /* Map of 'tx_port's used for transmission.  Written by the
>>>>>>>> + main
>>>> thread,
>>>>>>>> +     * read by the pmd thread. */
>>>>>>>> +    struct hmap tx_ports OVS_GUARDED;
>>>>>>>> +
>>>>>>>> +    /* These are thread-local copies of 'tx_ports'.  One
>>>>>>>> + contains only
>>>> tunnel
>>>>>>>> +     * ports (that support push_tunnel/pop_tunnel), the other
>>>>>>>> + contains
>>>>>> ports
>>>>>>>> +     * with at least one txq (that support send).  A port can be in both.
>>>>>>>> +     *
>>>>>>>> +     * There are two separate maps to make sure that we don't
>>>>>>>> + try to
>>>>>> execute
>>>>>>>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>>>>>>> + non-tunnel
>>>>>>>> device.
>>>>>>>> +     *
>>>>>>>> +     * The instances for cpu core NON_PMD_CORE_ID can be
>>>>>>>> + accessed by
>>>>>>>> multiple
>>>>>>>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.
>>>>>> Every
>>>>>>>> +     * other instance will only be accessed by its own pmd thread. */
>>>>>>>> +    struct hmap tnl_port_cache;
>>>>>>>> +    struct hmap send_port_cache;
>>>>>>>> +
>>>>>>>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>>>>>>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>>>>>>> +     * values and subtracts them from 'stats' and 'cycles' before
>>>>>>>> +     * reporting to the user */
>>>>>>>> +    unsigned long long stats_zero[DP_N_STATS];
>>>>>>>> +    uint64_t cycles_zero[PMD_N_CYCLES];
>>>>>>>> +
>>>>>>>> +    /* Set to true if the pmd thread needs to be reloaded. */
>>>>>>>> +    bool need_reload;
>>>>>>>> };
>>>>>>>>
>>>>>>>> /* Interface to netdev-based datapath. */
>>>>>>>> --
>>>>>>>> 2.7.4
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> dev@openvswitch.org
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>
>>>
>>>

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 0a62630..6784269 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -547,31 +547,18 @@  struct tx_port {
  * actions in either case.
  * */
 struct dp_netdev_pmd_thread {
-    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
-        struct dp_netdev *dp;
-        struct cmap_node node;          /* In 'dp->poll_threads'. */
-        pthread_cond_t cond;            /* For synchronizing pmd thread
-                                           reload. */
-    );
-
-    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
-        struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
-        pthread_t thread;
-        unsigned core_id;               /* CPU core id of this pmd thread. */
-        int numa_id;                    /* numa node id of this pmd thread. */
-    );
+    struct dp_netdev *dp;
+    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
+    struct cmap_node node;          /* In 'dp->poll_threads'. */
+
+    pthread_cond_t cond;            /* For synchronizing pmd thread reload. */
+    struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
 
     /* Per thread exact-match cache.  Note, the instance for cpu core
      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
      * need to be protected by 'non_pmd_mutex'.  Every other instance
      * will only be accessed by its own pmd thread. */
-    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
-    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed. */
-
-    /* Queue id used by this pmd thread to send packets on all netdevs if
-     * XPS disabled for this netdev. All static_tx_qid's are unique and less
-     * than 'cmap_count(dp->poll_threads)'. */
-    uint32_t static_tx_qid;
+    struct emc_cache flow_cache;
 
     /* Flow-Table and classifiers
      *
@@ -580,77 +567,68 @@  struct dp_netdev_pmd_thread {
      * 'flow_mutex'.
      */
     struct ovs_mutex flow_mutex;
-    PADDED_MEMBERS(CACHE_LINE_SIZE,
-        struct cmap flow_table OVS_GUARDED; /* Flow table. */
-
-        /* One classifier per in_port polled by the pmd */
-        struct cmap classifiers;
-        /* Periodically sort subtable vectors according to hit frequencies */
-        long long int next_optimization;
-        /* End of the next time interval for which processing cycles
-           are stored for each polled rxq. */
-        long long int rxq_next_cycle_store;
-
-        /* Cycles counters */
-        struct dp_netdev_pmd_cycles cycles;
-
-        /* Used to count cycles. See 'cycles_counter_end()'. */
-        unsigned long long last_cycles;
-        struct latch exit_latch;        /* For terminating the pmd thread. */
-    );
-
-    PADDED_MEMBERS(CACHE_LINE_SIZE,
-        /* Statistics. */
-        struct dp_netdev_pmd_stats stats;
-
-        struct seq *reload_seq;
-        uint64_t last_reload_seq;
-        atomic_bool reload;             /* Do we need to reload ports? */
-        bool isolated;
-
-        /* Set to true if the pmd thread needs to be reloaded. */
-        bool need_reload;
-        /* 5 pad bytes. */
-    );
-
-    PADDED_MEMBERS(CACHE_LINE_SIZE,
-        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
-                                           and 'tx_ports'. */
-        /* 16 pad bytes. */
-    );
-    PADDED_MEMBERS(CACHE_LINE_SIZE,
-        /* List of rx queues to poll. */
-        struct hmap poll_list OVS_GUARDED;
-        /* Map of 'tx_port's used for transmission.  Written by the main
-         * thread, read by the pmd thread. */
-        struct hmap tx_ports OVS_GUARDED;
-    );
-    PADDED_MEMBERS(CACHE_LINE_SIZE,
-        /* These are thread-local copies of 'tx_ports'.  One contains only
-         * tunnel ports (that support push_tunnel/pop_tunnel), the other
-         * contains ports with at least one txq (that support send).
-         * A port can be in both.
-         *
-         * There are two separate maps to make sure that we don't try to
-         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
-         * non-tunnel device.
-         *
-         * The instances for cpu core NON_PMD_CORE_ID can be accessed by
-         * multiple threads and thusly need to be protected by 'non_pmd_mutex'.
-         * Every other instance will only be accessed by its own pmd thread. */
-        struct hmap tnl_port_cache;
-        struct hmap send_port_cache;
-    );
-
-    PADDED_MEMBERS(CACHE_LINE_SIZE,
-        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
-         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
-         * values and subtracts them from 'stats' and 'cycles' before
-         * reporting to the user */
-        unsigned long long stats_zero[DP_N_STATS];
-        uint64_t cycles_zero[PMD_N_CYCLES];
-        /* 8 pad bytes. */
-    );
+    struct cmap flow_table OVS_GUARDED; /* Flow table. */
+
+    /* One classifier per in_port polled by the pmd */
+    struct cmap classifiers;
+    /* Periodically sort subtable vectors according to hit frequencies */
+    long long int next_optimization;
+    /* End of the next time interval for which processing cycles
+       are stored for each polled rxq. */
+    long long int rxq_next_cycle_store;
+
+    /* Statistics. */
+    struct dp_netdev_pmd_stats stats;
+
+    /* Cycles counters */
+    struct dp_netdev_pmd_cycles cycles;
+
+    /* Used to count cicles. See 'cycles_counter_end()' */
+    unsigned long long last_cycles;
+
+    struct latch exit_latch;        /* For terminating the pmd thread. */
+    struct seq *reload_seq;
+    uint64_t last_reload_seq;
+    atomic_bool reload;             /* Do we need to reload ports? */
+    pthread_t thread;
+    unsigned core_id;               /* CPU core id of this pmd thread. */
+    int numa_id;                    /* numa node id of this pmd thread. */
+    bool isolated;
+
+    /* Queue id used by this pmd thread to send packets on all netdevs if
+     * XPS disabled for this netdev. All static_tx_qid's are unique and less
+     * than 'cmap_count(dp->poll_threads)'. */
+    uint32_t static_tx_qid;
+
+    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'. */
+    /* List of rx queues to poll. */
+    struct hmap poll_list OVS_GUARDED;
+    /* Map of 'tx_port's used for transmission.  Written by the main thread,
+     * read by the pmd thread. */
+    struct hmap tx_ports OVS_GUARDED;
+
+    /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
+     * ports (that support push_tunnel/pop_tunnel), the other contains ports
+     * with at least one txq (that support send).  A port can be in both.
+     *
+     * There are two separate maps to make sure that we don't try to execute
+     * OUTPUT on a device which has 0 txqs or PUSH/POP on a non-tunnel device.
+     *
+     * The instances for cpu core NON_PMD_CORE_ID can be accessed by multiple
+     * threads, and thusly need to be protected by 'non_pmd_mutex'.  Every
+     * other instance will only be accessed by its own pmd thread. */
+    struct hmap tnl_port_cache;
+    struct hmap send_port_cache;
+
+    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
+     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
+     * values and subtracts them from 'stats' and 'cycles' before
+     * reporting to the user */
+    unsigned long long stats_zero[DP_N_STATS];
+    uint64_t cycles_zero[PMD_N_CYCLES];
+
+    /* Set to true if the pmd thread needs to be reloaded. */
+    bool need_reload;
 };
 
 /* Interface to netdev-based datapath. */