[ovs-dev,v5,4/6] dpif-netdev: Change rxq_scheduling to use rxq processing cycles.
diff mbox

Message ID 1503495281-1741-5-git-send-email-ktraynor@redhat.com
State Superseded
Delegated to: Darrell Ball
Headers show

Commit Message

Kevin Traynor Aug. 23, 2017, 1:34 p.m. UTC
Previously rxqs were assigned to pmds by round robin in
port/queue order.

Now that we have the processing cycles used for existing rxqs,
use that information to try and produced a better balanced
distribution of rxqs across pmds. i.e. given multiple pmds, the
rxqs which have consumed the largest amount of processing cycles
will be placed on different pmds.

The rxqs are sorted by their processing cycles and assigned (in
sorted order) round robin across pmds.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
---
 Documentation/howto/dpdk.rst |   7 +++
 lib/dpif-netdev.c            | 123 ++++++++++++++++++++++++++++++++-----------
 2 files changed, 99 insertions(+), 31 deletions(-)

Comments

Darrell Ball Aug. 24, 2017, 8:47 a.m. UTC | #1
There is a minor checkpatch warning


== Checking "/home/dball/Desktop/patches/ovs-dev-v5-4-6-dpif-netdev-Change-rxq_scheduling-to-use-rxq-processing-cycles..patch" ==
WARNING: Line lacks whitespace around operator
#170 FILE: lib/dpif-netdev.c:3456:
               Round-robin on the NUMA nodes that do have pmds. */

Lines checked: 204, Warnings: 1, Errors: 0

I marked it below

    
    On 8/23/17, 6:34 AM, "Kevin Traynor" <ktraynor@redhat.com> wrote:
    
        Previously rxqs were assigned to pmds by round robin in
        port/queue order.
        
        Now that we have the processing cycles used for existing rxqs,
        use that information to try and produced a better balanced
        distribution of rxqs across pmds. i.e. given multiple pmds, the
        rxqs which have consumed the largest amount of processing cycles
        will be placed on different pmds.
        
        The rxqs are sorted by their processing cycles and assigned (in
        sorted order) round robin across pmds.
        
        Signed-off-by: Kevin Traynor <ktraynor@redhat.com>

        ---
         Documentation/howto/dpdk.rst |   7 +++
         lib/dpif-netdev.c            | 123 ++++++++++++++++++++++++++++++++-----------
         2 files changed, 99 insertions(+), 31 deletions(-)
        
        diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
        index d7f6610..44737e4 100644
        --- a/Documentation/howto/dpdk.rst
        +++ b/Documentation/howto/dpdk.rst
        @@ -119,4 +119,11 @@ After that PMD threads on cores where RX queues was pinned will become
           thread.
         
        +If pmd-rxq-affinity is not set for rxqs, they will be assigned to pmds (cores)
        +automatically. The processing cycles that have been required for each rxq

        +will be used where known to assign rxqs with the highest consumption of
        +processing cycles to different pmds.

‘will be used where known to assign rxqs to pmds based on round robin of the sorted rxqs’ ?


        +
        +Rxq to pmds assignment takes place whenever there are configuration changes.
        +
         QoS
         ---
        diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
        index 5670c55..afbf591 100644
        --- a/lib/dpif-netdev.c
        +++ b/lib/dpif-netdev.c
        @@ -712,4 +712,6 @@ static void
         dp_netdev_rxq_set_interval(struct dp_netdev_rxq *rx,
                                    unsigned long long cycles);
        +static uint64_t
        +dp_netdev_rxq_get_interval(struct dp_netdev_rxq *rx, unsigned idx);
         static void
         dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
        @@ -3166,4 +3168,12 @@ dp_netdev_rxq_set_interval(struct dp_netdev_rxq *rx,
         }
         
        +static uint64_t
        +dp_netdev_rxq_get_interval(struct dp_netdev_rxq *rx, unsigned idx)
        +{
        +    unsigned long long tmp;
        +    atomic_read_relaxed(&rx->cycles_intrvl[idx], &tmp);
        +    return tmp;

Could we use something like ‘intrv_processing_cycles’ instead of ‘tmp’
Also _get_intrv_cycles ?; same forward comment I mentioned in patch 3


        +}
        +
         static int
         dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
        @@ -3352,8 +3362,37 @@ rr_numa_list_destroy(struct rr_numa_list *rr)
         }
         
        +/* Sort Rx Queues by the processing cycles they are consuming. */
        +static int
        +rxq_cycle_sort(const void *a, const void *b)
        +{
        +    struct dp_netdev_rxq * qa;
        +    struct dp_netdev_rxq * qb;
        +    uint64_t total_qa, total_qb;
        +    unsigned i;
        +
        +    qa = *(struct dp_netdev_rxq **) a;
        +    qb = *(struct dp_netdev_rxq **) b;
        +
        +    total_qa = total_qb = 0;
        +    for (i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
        +        total_qa += dp_netdev_rxq_get_interval(qa, i);
        +        total_qb += dp_netdev_rxq_get_interval(qb, i);
        +    }
        +    dp_netdev_rxq_set_cycles(qa, RXQ_CYCLES_PROC_HIST, total_qa);
        +    dp_netdev_rxq_set_cycles(qb, RXQ_CYCLES_PROC_HIST, total_qb);
        +
        +    if (total_qa >= total_qb) {
        +        return -1;
        +    }
        +    return 1;
        +}
        +
         /* Assign pmds to queues.  If 'pinned' is true, assign pmds to pinned
          * queues and marks the pmds as isolated.  Otherwise, assign non isolated
          * pmds to unpinned queues.
          *
        + * If 'pinned' is false queues will be sorted by processing cycles they are
        + * consuming and then assigned to pmds in round robin order.
        + *
          * The function doesn't touch the pmd threads, it just stores the assignment
          * in the 'pmd' member of each rxq. */
        @@ -3364,18 +3403,14 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
             struct rr_numa_list rr;
             struct rr_numa *non_local_numa = NULL;
        -
        -    rr_numa_list_populate(dp, &rr);
        +    struct dp_netdev_rxq ** rxqs = NULL;
        +    int i, n_rxqs = 0;
        +    struct rr_numa *numa = NULL;
        +    int numa_id;
         
             HMAP_FOR_EACH (port, node, &dp->ports) {
        -        struct rr_numa *numa;
        -        int numa_id;
        -
                 if (!netdev_is_pmd(port->netdev)) {
                     continue;
                 }
         
        -        numa_id = netdev_get_numa_id(port->netdev);
        -        numa = rr_numa_list_lookup(&rr, numa_id);
        -
                 for (int qid = 0; qid < port->n_rxq; qid++) {
                     struct dp_netdev_rxq *q = &port->rxqs[qid];
        @@ -3395,34 +3430,60 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
                         }
                     } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
        -                if (!numa) {
        -                    /* There are no pmds on the queue's local NUMA node.
        -                       Round-robin on the NUMA nodes that do have pmds. */
        -                    non_local_numa = rr_numa_list_next(&rr, non_local_numa);
        -                    if (!non_local_numa) {
        -                        VLOG_ERR("There is no available (non-isolated) pmd "
        -                                 "thread for port \'%s\' queue %d. This queue "
        -                                 "will not be polled. Is pmd-cpu-mask set to "
        -                                 "zero? Or are all PMDs isolated to other "
        -                                 "queues?", netdev_get_name(port->netdev),
        -                                 qid);
        -                        continue;
        -                    }
        -                    q->pmd = rr_numa_get_pmd(non_local_numa);
        -                    VLOG_WARN("There's no available (non-isolated) pmd thread "
        -                              "on numa node %d. Queue %d on port \'%s\' will "
        -                              "be assigned to the pmd on core %d "
        -                              "(numa node %d). Expect reduced performance.",
        -                              numa_id, qid, netdev_get_name(port->netdev),
        -                              q->pmd->core_id, q->pmd->numa_id);
        +                if (n_rxqs == 0) {

checkpatch lack of whitespace warning above




        +                    rxqs = xmalloc(sizeof *rxqs);
                         } else {
        -                    /* Assign queue to the next (round-robin) PMD on it's local
        -                       NUMA node. */
        -                    q->pmd = rr_numa_get_pmd(numa);
        +                    rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
                         }
        +                /* Store the queue. */
        +                rxqs[n_rxqs++] = q;
                     }
                 }
             }
         
        +    if (n_rxqs > 1) {
        +        /* Sort the queues in order of the processing cycles
        +         * they consumed during their last pmd interval. */
        +        qsort(rxqs, n_rxqs, sizeof *rxqs, rxq_cycle_sort);
        +    }
        +
        +    rr_numa_list_populate(dp, &rr);
        +    /* Assign the sorted queues to pmds in round robin. */
        +    for (i = 0; i < n_rxqs; i++) {
        +        numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
        +        numa = rr_numa_list_lookup(&rr, numa_id);
        +        if (!numa) {
        +            /* There are no pmds on the queue's local NUMA node.
        +               Round-robin on the NUMA nodes that do have pmds. */
        +            non_local_numa = rr_numa_list_next(&rr, non_local_numa);
        +            if (!non_local_numa) {
        +                VLOG_ERR("There is no available (non-isolated) pmd "
        +                         "thread for port \'%s\' queue %d. This queue "
        +                         "will not be polled. Is pmd-cpu-mask set to "
        +                         "zero? Or are all PMDs isolated to other "
        +                         "queues?", netdev_rxq_get_name(rxqs[i]->rx),
        +                         netdev_rxq_get_queue_id(rxqs[i]->rx));
        +                continue;
        +            }
        +            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
        +            VLOG_WARN("There's no available (non-isolated) pmd thread "
        +                      "on numa node %d. Queue %d on port \'%s\' will "
        +                      "be assigned to the pmd on core %d "
        +                      "(numa node %d). Expect reduced performance.",
        +                      numa_id, netdev_rxq_get_queue_id(rxqs[i]->rx),
        +                      netdev_rxq_get_name(rxqs[i]->rx),
        +                      rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
        +        } else {
        +        rxqs[i]->pmd = rr_numa_get_pmd(numa);
        +        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
        +                  "rx queue %d (measured processing cycles %"PRIu64").",
        +                  rxqs[i]->pmd->core_id, numa_id,
        +                  netdev_rxq_get_name(rxqs[i]->rx),
        +                  netdev_rxq_get_queue_id(rxqs[i]->rx),
        +                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
        +        }
        +    }
        +
             rr_numa_list_destroy(&rr);
        +    free(rxqs);
         }
         
        -- 
        1.8.3.1
Darrell Ball Aug. 24, 2017, 9:06 a.m. UTC | #2
On 8/24/17, 1:47 AM, "Darrell Ball" <dball@vmware.com> wrote:

    There is a minor checkpatch warning
    
    
    == Checking "/home/dball/Desktop/patches/ovs-dev-v5-4-6-dpif-netdev-Change-rxq_scheduling-to-use-rxq-processing-cycles..patch" ==
    WARNING: Line lacks whitespace around operator
    #170 FILE: lib/dpif-netdev.c:3456:
                   Round-robin on the NUMA nodes that do have pmds. */
    
    Lines checked: 204, Warnings: 1, Errors: 0

[Darrell] 
Maybe try
* Round-robin on the NUMA nodes that do have pmds. */

    
    I marked it below

Ignore what I marked for the warning inline
    
        
        On 8/23/17, 6:34 AM, "Kevin Traynor" <ktraynor@redhat.com> wrote:
        
            Previously rxqs were assigned to pmds by round robin in
            port/queue order.
            
            Now that we have the processing cycles used for existing rxqs,
            use that information to try and produced a better balanced
            distribution of rxqs across pmds. i.e. given multiple pmds, the
            rxqs which have consumed the largest amount of processing cycles
            will be placed on different pmds.
            
            The rxqs are sorted by their processing cycles and assigned (in
            sorted order) round robin across pmds.
            
            Signed-off-by: Kevin Traynor <ktraynor@redhat.com>

            ---
             Documentation/howto/dpdk.rst |   7 +++
             lib/dpif-netdev.c            | 123 ++++++++++++++++++++++++++++++++-----------
             2 files changed, 99 insertions(+), 31 deletions(-)
            
            diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
            index d7f6610..44737e4 100644
            --- a/Documentation/howto/dpdk.rst
            +++ b/Documentation/howto/dpdk.rst
            @@ -119,4 +119,11 @@ After that PMD threads on cores where RX queues was pinned will become
               thread.
             
            +If pmd-rxq-affinity is not set for rxqs, they will be assigned to pmds (cores)
            +automatically. The processing cycles that have been required for each rxq
    
            +will be used where known to assign rxqs with the highest consumption of
            +processing cycles to different pmds.
    
    ‘will be used where known to assign rxqs to pmds based on round robin of the sorted rxqs’ ?
    
    
            +
            +Rxq to pmds assignment takes place whenever there are configuration changes.
            +
             QoS
             ---
            diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
            index 5670c55..afbf591 100644
            --- a/lib/dpif-netdev.c
            +++ b/lib/dpif-netdev.c
            @@ -712,4 +712,6 @@ static void
             dp_netdev_rxq_set_interval(struct dp_netdev_rxq *rx,
                                        unsigned long long cycles);
            +static uint64_t
            +dp_netdev_rxq_get_interval(struct dp_netdev_rxq *rx, unsigned idx);
             static void
             dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
            @@ -3166,4 +3168,12 @@ dp_netdev_rxq_set_interval(struct dp_netdev_rxq *rx,
             }
             
            +static uint64_t
            +dp_netdev_rxq_get_interval(struct dp_netdev_rxq *rx, unsigned idx)
            +{
            +    unsigned long long tmp;
            +    atomic_read_relaxed(&rx->cycles_intrvl[idx], &tmp);
            +    return tmp;
    
    Could we use something like ‘intrv_processing_cycles’ instead of ‘tmp’
    Also _get_intrv_cycles ?; same forward comment I mentioned in patch 3
    
    
            +}
            +
             static int
             dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
            @@ -3352,8 +3362,37 @@ rr_numa_list_destroy(struct rr_numa_list *rr)
             }
             
            +/* Sort Rx Queues by the processing cycles they are consuming. */
            +static int
            +rxq_cycle_sort(const void *a, const void *b)
            +{
            +    struct dp_netdev_rxq * qa;
            +    struct dp_netdev_rxq * qb;
            +    uint64_t total_qa, total_qb;
            +    unsigned i;
            +
            +    qa = *(struct dp_netdev_rxq **) a;
            +    qb = *(struct dp_netdev_rxq **) b;
            +
            +    total_qa = total_qb = 0;
            +    for (i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
            +        total_qa += dp_netdev_rxq_get_interval(qa, i);
            +        total_qb += dp_netdev_rxq_get_interval(qb, i);
            +    }
            +    dp_netdev_rxq_set_cycles(qa, RXQ_CYCLES_PROC_HIST, total_qa);
            +    dp_netdev_rxq_set_cycles(qb, RXQ_CYCLES_PROC_HIST, total_qb);
            +
            +    if (total_qa >= total_qb) {
            +        return -1;
            +    }
            +    return 1;
            +}
            +
             /* Assign pmds to queues.  If 'pinned' is true, assign pmds to pinned
              * queues and marks the pmds as isolated.  Otherwise, assign non isolated
              * pmds to unpinned queues.
              *
            + * If 'pinned' is false queues will be sorted by processing cycles they are
            + * consuming and then assigned to pmds in round robin order.
            + *
              * The function doesn't touch the pmd threads, it just stores the assignment
              * in the 'pmd' member of each rxq. */
            @@ -3364,18 +3403,14 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
                 struct rr_numa_list rr;
                 struct rr_numa *non_local_numa = NULL;
            -
            -    rr_numa_list_populate(dp, &rr);
            +    struct dp_netdev_rxq ** rxqs = NULL;
            +    int i, n_rxqs = 0;
            +    struct rr_numa *numa = NULL;
            +    int numa_id;
             
                 HMAP_FOR_EACH (port, node, &dp->ports) {
            -        struct rr_numa *numa;
            -        int numa_id;
            -
                     if (!netdev_is_pmd(port->netdev)) {
                         continue;
                     }
             
            -        numa_id = netdev_get_numa_id(port->netdev);
            -        numa = rr_numa_list_lookup(&rr, numa_id);
            -
                     for (int qid = 0; qid < port->n_rxq; qid++) {
                         struct dp_netdev_rxq *q = &port->rxqs[qid];
            @@ -3395,34 +3430,60 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
                             }
                         } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
            -                if (!numa) {
            -                    /* There are no pmds on the queue's local NUMA node.
            -                       Round-robin on the NUMA nodes that do have pmds. */
            -                    non_local_numa = rr_numa_list_next(&rr, non_local_numa);
            -                    if (!non_local_numa) {
            -                        VLOG_ERR("There is no available (non-isolated) pmd "
            -                                 "thread for port \'%s\' queue %d. This queue "
            -                                 "will not be polled. Is pmd-cpu-mask set to "
            -                                 "zero? Or are all PMDs isolated to other "
            -                                 "queues?", netdev_get_name(port->netdev),
            -                                 qid);
            -                        continue;
            -                    }
            -                    q->pmd = rr_numa_get_pmd(non_local_numa);
            -                    VLOG_WARN("There's no available (non-isolated) pmd thread "
            -                              "on numa node %d. Queue %d on port \'%s\' will "
            -                              "be assigned to the pmd on core %d "
            -                              "(numa node %d). Expect reduced performance.",
            -                              numa_id, qid, netdev_get_name(port->netdev),
            -                              q->pmd->core_id, q->pmd->numa_id);
            +                if (n_rxqs == 0) {
    
    checkpatch lack of whitespace warning above

[Darrell]
Ignore this – I rechecked this line and it was fine
    
    
    
    
            +                    rxqs = xmalloc(sizeof *rxqs);
                             } else {
            -                    /* Assign queue to the next (round-robin) PMD on it's local
            -                       NUMA node. */
            -                    q->pmd = rr_numa_get_pmd(numa);
            +                    rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
                             }
            +                /* Store the queue. */
            +                rxqs[n_rxqs++] = q;
                         }
                     }
                 }
             
            +    if (n_rxqs > 1) {
            +        /* Sort the queues in order of the processing cycles
            +         * they consumed during their last pmd interval. */
            +        qsort(rxqs, n_rxqs, sizeof *rxqs, rxq_cycle_sort);
            +    }
            +
            +    rr_numa_list_populate(dp, &rr);
            +    /* Assign the sorted queues to pmds in round robin. */
            +    for (i = 0; i < n_rxqs; i++) {
            +        numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
            +        numa = rr_numa_list_lookup(&rr, numa_id);
            +        if (!numa) {
            +            /* There are no pmds on the queue's local NUMA node.
            +               Round-robin on the NUMA nodes that do have pmds. */
            +            non_local_numa = rr_numa_list_next(&rr, non_local_numa);
            +            if (!non_local_numa) {
            +                VLOG_ERR("There is no available (non-isolated) pmd "
            +                         "thread for port \'%s\' queue %d. This queue "
            +                         "will not be polled. Is pmd-cpu-mask set to "
            +                         "zero? Or are all PMDs isolated to other "
            +                         "queues?", netdev_rxq_get_name(rxqs[i]->rx),
            +                         netdev_rxq_get_queue_id(rxqs[i]->rx));
            +                continue;
            +            }
            +            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
            +            VLOG_WARN("There's no available (non-isolated) pmd thread "
            +                      "on numa node %d. Queue %d on port \'%s\' will "
            +                      "be assigned to the pmd on core %d "
            +                      "(numa node %d). Expect reduced performance.",
            +                      numa_id, netdev_rxq_get_queue_id(rxqs[i]->rx),
            +                      netdev_rxq_get_name(rxqs[i]->rx),
            +                      rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
            +        } else {
            +        rxqs[i]->pmd = rr_numa_get_pmd(numa);
            +        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
            +                  "rx queue %d (measured processing cycles %"PRIu64").",
            +                  rxqs[i]->pmd->core_id, numa_id,
            +                  netdev_rxq_get_name(rxqs[i]->rx),
            +                  netdev_rxq_get_queue_id(rxqs[i]->rx),
            +                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
            +        }
            +    }
            +
                 rr_numa_list_destroy(&rr);
            +    free(rxqs);
             }
             
            -- 
            1.8.3.1
Kevin Traynor Aug. 24, 2017, 11:33 p.m. UTC | #3
On 08/24/2017 10:06 AM, Darrell Ball wrote:
> 
> 
> On 8/24/17, 1:47 AM, "Darrell Ball" <dball@vmware.com> wrote:
> 
>     There is a minor checkpatch warning
>     
>     
>     == Checking "/home/dball/Desktop/patches/ovs-dev-v5-4-6-dpif-netdev-Change-rxq_scheduling-to-use-rxq-processing-cycles..patch" ==
>     WARNING: Line lacks whitespace around operator
>     #170 FILE: lib/dpif-netdev.c:3456:
>                    Round-robin on the NUMA nodes that do have pmds. */
>     
>     Lines checked: 204, Warnings: 1, Errors: 0
> 
> [Darrell] 
> Maybe try
> * Round-robin on the NUMA nodes that do have pmds. */
> 
>     
>     I marked it below
> 
> Ignore what I marked for the warning inline
> 

I noticed the checkpatch warning too. It's a false positive for the
"Round-robin" in a comment block from a previous commit that I moved to
later in the function. I didn't want to confuse by fixing inline with
the other changes but now you've highlighted it I might as well change it.

>         
>         On 8/23/17, 6:34 AM, "Kevin Traynor" <ktraynor@redhat.com> wrote:
>         
>             Previously rxqs were assigned to pmds by round robin in
>             port/queue order.
>             
>             Now that we have the processing cycles used for existing rxqs,
>             use that information to try and produced a better balanced
>             distribution of rxqs across pmds. i.e. given multiple pmds, the
>             rxqs which have consumed the largest amount of processing cycles
>             will be placed on different pmds.
>             
>             The rxqs are sorted by their processing cycles and assigned (in
>             sorted order) round robin across pmds.
>             
>             Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
>             ---
>              Documentation/howto/dpdk.rst |   7 +++
>              lib/dpif-netdev.c            | 123 ++++++++++++++++++++++++++++++++-----------
>              2 files changed, 99 insertions(+), 31 deletions(-)
>             
>             diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
>             index d7f6610..44737e4 100644
>             --- a/Documentation/howto/dpdk.rst
>             +++ b/Documentation/howto/dpdk.rst
>             @@ -119,4 +119,11 @@ After that PMD threads on cores where RX queues was pinned will become
>                thread.
>              
>             +If pmd-rxq-affinity is not set for rxqs, they will be assigned to pmds (cores)
>             +automatically. The processing cycles that have been required for each rxq
>     

I noticed this also
s/required/stored

>             +will be used where known to assign rxqs with the highest consumption of
>             +processing cycles to different pmds.
>     
>     ‘will be used where known to assign rxqs to pmds based on round robin of the sorted rxqs’ ?
>     

changed

>     
>             +
>             +Rxq to pmds assignment takes place whenever there are configuration changes.
>             +
>              QoS
>              ---
>             diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>             index 5670c55..afbf591 100644
>             --- a/lib/dpif-netdev.c
>             +++ b/lib/dpif-netdev.c
>             @@ -712,4 +712,6 @@ static void
>              dp_netdev_rxq_set_interval(struct dp_netdev_rxq *rx,
>                                         unsigned long long cycles);
>             +static uint64_t
>             +dp_netdev_rxq_get_interval(struct dp_netdev_rxq *rx, unsigned idx);
>              static void
>              dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
>             @@ -3166,4 +3168,12 @@ dp_netdev_rxq_set_interval(struct dp_netdev_rxq *rx,
>              }
>              
>             +static uint64_t
>             +dp_netdev_rxq_get_interval(struct dp_netdev_rxq *rx, unsigned idx)
>             +{
>             +    unsigned long long tmp;
>             +    atomic_read_relaxed(&rx->cycles_intrvl[idx], &tmp);
>             +    return tmp;
>     
>     Could we use something like ‘intrv_processing_cycles’ instead of ‘tmp’
>     Also _get_intrv_cycles ?; same forward comment I mentioned in patch 3
>     

removed tmp and changed to processing_cycles. made _get_intrvl_cycles to
match the _set_intrvl_cycles in 3/6

>     
>             +}
>             +
>              static int
>              dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>             @@ -3352,8 +3362,37 @@ rr_numa_list_destroy(struct rr_numa_list *rr)
>              }
>              
>             +/* Sort Rx Queues by the processing cycles they are consuming. */
>             +static int
>             +rxq_cycle_sort(const void *a, const void *b)
>             +{
>             +    struct dp_netdev_rxq * qa;
>             +    struct dp_netdev_rxq * qb;
>             +    uint64_t total_qa, total_qb;
>             +    unsigned i;
>             +
>             +    qa = *(struct dp_netdev_rxq **) a;
>             +    qb = *(struct dp_netdev_rxq **) b;
>             +
>             +    total_qa = total_qb = 0;
>             +    for (i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
>             +        total_qa += dp_netdev_rxq_get_interval(qa, i);
>             +        total_qb += dp_netdev_rxq_get_interval(qb, i);
>             +    }
>             +    dp_netdev_rxq_set_cycles(qa, RXQ_CYCLES_PROC_HIST, total_qa);
>             +    dp_netdev_rxq_set_cycles(qb, RXQ_CYCLES_PROC_HIST, total_qb);
>             +
>             +    if (total_qa >= total_qb) {
>             +        return -1;
>             +    }
>             +    return 1;
>             +}
>             +
>              /* Assign pmds to queues.  If 'pinned' is true, assign pmds to pinned
>               * queues and marks the pmds as isolated.  Otherwise, assign non isolated
>               * pmds to unpinned queues.
>               *
>             + * If 'pinned' is false queues will be sorted by processing cycles they are
>             + * consuming and then assigned to pmds in round robin order.
>             + *
>               * The function doesn't touch the pmd threads, it just stores the assignment
>               * in the 'pmd' member of each rxq. */
>             @@ -3364,18 +3403,14 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>                  struct rr_numa_list rr;
>                  struct rr_numa *non_local_numa = NULL;
>             -
>             -    rr_numa_list_populate(dp, &rr);
>             +    struct dp_netdev_rxq ** rxqs = NULL;
>             +    int i, n_rxqs = 0;
>             +    struct rr_numa *numa = NULL;
>             +    int numa_id;
>              
>                  HMAP_FOR_EACH (port, node, &dp->ports) {
>             -        struct rr_numa *numa;
>             -        int numa_id;
>             -
>                      if (!netdev_is_pmd(port->netdev)) {
>                          continue;
>                      }
>              
>             -        numa_id = netdev_get_numa_id(port->netdev);
>             -        numa = rr_numa_list_lookup(&rr, numa_id);
>             -
>                      for (int qid = 0; qid < port->n_rxq; qid++) {
>                          struct dp_netdev_rxq *q = &port->rxqs[qid];
>             @@ -3395,34 +3430,60 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>                              }
>                          } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
>             -                if (!numa) {
>             -                    /* There are no pmds on the queue's local NUMA node.
>             -                       Round-robin on the NUMA nodes that do have pmds. */
>             -                    non_local_numa = rr_numa_list_next(&rr, non_local_numa);
>             -                    if (!non_local_numa) {
>             -                        VLOG_ERR("There is no available (non-isolated) pmd "
>             -                                 "thread for port \'%s\' queue %d. This queue "
>             -                                 "will not be polled. Is pmd-cpu-mask set to "
>             -                                 "zero? Or are all PMDs isolated to other "
>             -                                 "queues?", netdev_get_name(port->netdev),
>             -                                 qid);
>             -                        continue;
>             -                    }
>             -                    q->pmd = rr_numa_get_pmd(non_local_numa);
>             -                    VLOG_WARN("There's no available (non-isolated) pmd thread "
>             -                              "on numa node %d. Queue %d on port \'%s\' will "
>             -                              "be assigned to the pmd on core %d "
>             -                              "(numa node %d). Expect reduced performance.",
>             -                              numa_id, qid, netdev_get_name(port->netdev),
>             -                              q->pmd->core_id, q->pmd->numa_id);
>             +                if (n_rxqs == 0) {
>     
>     checkpatch lack of whitespace warning above
> 
> [Darrell]
> Ignore this – I rechecked this line and it was fine
>     

np, thanks, as mentioned above, I changed the "Round-robin" that was
causing the issue

>     
>     
>     
>             +                    rxqs = xmalloc(sizeof *rxqs);
>                              } else {
>             -                    /* Assign queue to the next (round-robin) PMD on it's local
>             -                       NUMA node. */
>             -                    q->pmd = rr_numa_get_pmd(numa);
>             +                    rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
>                              }
>             +                /* Store the queue. */
>             +                rxqs[n_rxqs++] = q;
>                          }
>                      }
>                  }
>              
>             +    if (n_rxqs > 1) {
>             +        /* Sort the queues in order of the processing cycles
>             +         * they consumed during their last pmd interval. */
>             +        qsort(rxqs, n_rxqs, sizeof *rxqs, rxq_cycle_sort);
>             +    }
>             +
>             +    rr_numa_list_populate(dp, &rr);
>             +    /* Assign the sorted queues to pmds in round robin. */
>             +    for (i = 0; i < n_rxqs; i++) {
>             +        numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
>             +        numa = rr_numa_list_lookup(&rr, numa_id);
>             +        if (!numa) {
>             +            /* There are no pmds on the queue's local NUMA node.
>             +               Round-robin on the NUMA nodes that do have pmds. */
>             +            non_local_numa = rr_numa_list_next(&rr, non_local_numa);
>             +            if (!non_local_numa) {
>             +                VLOG_ERR("There is no available (non-isolated) pmd "
>             +                         "thread for port \'%s\' queue %d. This queue "
>             +                         "will not be polled. Is pmd-cpu-mask set to "
>             +                         "zero? Or are all PMDs isolated to other "
>             +                         "queues?", netdev_rxq_get_name(rxqs[i]->rx),
>             +                         netdev_rxq_get_queue_id(rxqs[i]->rx));
>             +                continue;
>             +            }
>             +            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
>             +            VLOG_WARN("There's no available (non-isolated) pmd thread "
>             +                      "on numa node %d. Queue %d on port \'%s\' will "
>             +                      "be assigned to the pmd on core %d "
>             +                      "(numa node %d). Expect reduced performance.",
>             +                      numa_id, netdev_rxq_get_queue_id(rxqs[i]->rx),
>             +                      netdev_rxq_get_name(rxqs[i]->rx),
>             +                      rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
>             +        } else {
>             +        rxqs[i]->pmd = rr_numa_get_pmd(numa);
>             +        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
>             +                  "rx queue %d (measured processing cycles %"PRIu64").",
>             +                  rxqs[i]->pmd->core_id, numa_id,
>             +                  netdev_rxq_get_name(rxqs[i]->rx),
>             +                  netdev_rxq_get_queue_id(rxqs[i]->rx),
>             +                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
>             +        }
>             +    }
>             +
>                  rr_numa_list_destroy(&rr);
>             +    free(rxqs);
>              }
>              
>             -- 
>             1.8.3.1
>             
>             
>         
>         
>     
>     
>     
>     
>     
>     
>     
>     
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>

Patch
diff mbox

diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
index d7f6610..44737e4 100644
--- a/Documentation/howto/dpdk.rst
+++ b/Documentation/howto/dpdk.rst
@@ -119,4 +119,11 @@  After that PMD threads on cores where RX queues was pinned will become
   thread.
 
+If pmd-rxq-affinity is not set for rxqs, they will be assigned to pmds (cores)
+automatically. The processing cycles that have been required for each rxq
+will be used where known to assign rxqs with the highest consumption of
+processing cycles to different pmds.
+
+Rxq to pmds assignment takes place whenever there are configuration changes.
+
 QoS
 ---
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 5670c55..afbf591 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -712,4 +712,6 @@  static void
 dp_netdev_rxq_set_interval(struct dp_netdev_rxq *rx,
                            unsigned long long cycles);
+static uint64_t
+dp_netdev_rxq_get_interval(struct dp_netdev_rxq *rx, unsigned idx);
 static void
 dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
@@ -3166,4 +3168,12 @@  dp_netdev_rxq_set_interval(struct dp_netdev_rxq *rx,
 }
 
+static uint64_t
+dp_netdev_rxq_get_interval(struct dp_netdev_rxq *rx, unsigned idx)
+{
+    unsigned long long tmp;
+    atomic_read_relaxed(&rx->cycles_intrvl[idx], &tmp);
+    return tmp;
+}
+
 static int
 dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
@@ -3352,8 +3362,37 @@  rr_numa_list_destroy(struct rr_numa_list *rr)
 }
 
+/* Sort Rx Queues by the processing cycles they are consuming. */
+static int
+rxq_cycle_sort(const void *a, const void *b)
+{
+    struct dp_netdev_rxq * qa;
+    struct dp_netdev_rxq * qb;
+    uint64_t total_qa, total_qb;
+    unsigned i;
+
+    qa = *(struct dp_netdev_rxq **) a;
+    qb = *(struct dp_netdev_rxq **) b;
+
+    total_qa = total_qb = 0;
+    for (i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
+        total_qa += dp_netdev_rxq_get_interval(qa, i);
+        total_qb += dp_netdev_rxq_get_interval(qb, i);
+    }
+    dp_netdev_rxq_set_cycles(qa, RXQ_CYCLES_PROC_HIST, total_qa);
+    dp_netdev_rxq_set_cycles(qb, RXQ_CYCLES_PROC_HIST, total_qb);
+
+    if (total_qa >= total_qb) {
+        return -1;
+    }
+    return 1;
+}
+
 /* Assign pmds to queues.  If 'pinned' is true, assign pmds to pinned
  * queues and marks the pmds as isolated.  Otherwise, assign non isolated
  * pmds to unpinned queues.
  *
+ * If 'pinned' is false queues will be sorted by processing cycles they are
+ * consuming and then assigned to pmds in round robin order.
+ *
  * The function doesn't touch the pmd threads, it just stores the assignment
  * in the 'pmd' member of each rxq. */
@@ -3364,18 +3403,14 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
     struct rr_numa_list rr;
     struct rr_numa *non_local_numa = NULL;
-
-    rr_numa_list_populate(dp, &rr);
+    struct dp_netdev_rxq ** rxqs = NULL;
+    int i, n_rxqs = 0;
+    struct rr_numa *numa = NULL;
+    int numa_id;
 
     HMAP_FOR_EACH (port, node, &dp->ports) {
-        struct rr_numa *numa;
-        int numa_id;
-
         if (!netdev_is_pmd(port->netdev)) {
             continue;
         }
 
-        numa_id = netdev_get_numa_id(port->netdev);
-        numa = rr_numa_list_lookup(&rr, numa_id);
-
         for (int qid = 0; qid < port->n_rxq; qid++) {
             struct dp_netdev_rxq *q = &port->rxqs[qid];
@@ -3395,34 +3430,60 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
                 }
             } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
-                if (!numa) {
-                    /* There are no pmds on the queue's local NUMA node.
-                       Round-robin on the NUMA nodes that do have pmds. */
-                    non_local_numa = rr_numa_list_next(&rr, non_local_numa);
-                    if (!non_local_numa) {
-                        VLOG_ERR("There is no available (non-isolated) pmd "
-                                 "thread for port \'%s\' queue %d. This queue "
-                                 "will not be polled. Is pmd-cpu-mask set to "
-                                 "zero? Or are all PMDs isolated to other "
-                                 "queues?", netdev_get_name(port->netdev),
-                                 qid);
-                        continue;
-                    }
-                    q->pmd = rr_numa_get_pmd(non_local_numa);
-                    VLOG_WARN("There's no available (non-isolated) pmd thread "
-                              "on numa node %d. Queue %d on port \'%s\' will "
-                              "be assigned to the pmd on core %d "
-                              "(numa node %d). Expect reduced performance.",
-                              numa_id, qid, netdev_get_name(port->netdev),
-                              q->pmd->core_id, q->pmd->numa_id);
+                if (n_rxqs == 0) {
+                    rxqs = xmalloc(sizeof *rxqs);
                 } else {
-                    /* Assign queue to the next (round-robin) PMD on it's local
-                       NUMA node. */
-                    q->pmd = rr_numa_get_pmd(numa);
+                    rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
                 }
+                /* Store the queue. */
+                rxqs[n_rxqs++] = q;
             }
         }
     }
 
+    if (n_rxqs > 1) {
+        /* Sort the queues in order of the processing cycles
+         * they consumed during their last pmd interval. */
+        qsort(rxqs, n_rxqs, sizeof *rxqs, rxq_cycle_sort);
+    }
+
+    rr_numa_list_populate(dp, &rr);
+    /* Assign the sorted queues to pmds in round robin. */
+    for (i = 0; i < n_rxqs; i++) {
+        numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
+        numa = rr_numa_list_lookup(&rr, numa_id);
+        if (!numa) {
+            /* There are no pmds on the queue's local NUMA node.
+               Round-robin on the NUMA nodes that do have pmds. */
+            non_local_numa = rr_numa_list_next(&rr, non_local_numa);
+            if (!non_local_numa) {
+                VLOG_ERR("There is no available (non-isolated) pmd "
+                         "thread for port \'%s\' queue %d. This queue "
+                         "will not be polled. Is pmd-cpu-mask set to "
+                         "zero? Or are all PMDs isolated to other "
+                         "queues?", netdev_rxq_get_name(rxqs[i]->rx),
+                         netdev_rxq_get_queue_id(rxqs[i]->rx));
+                continue;
+            }
+            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
+            VLOG_WARN("There's no available (non-isolated) pmd thread "
+                      "on numa node %d. Queue %d on port \'%s\' will "
+                      "be assigned to the pmd on core %d "
+                      "(numa node %d). Expect reduced performance.",
+                      numa_id, netdev_rxq_get_queue_id(rxqs[i]->rx),
+                      netdev_rxq_get_name(rxqs[i]->rx),
+                      rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
+        } else {
+        rxqs[i]->pmd = rr_numa_get_pmd(numa);
+        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
+                  "rx queue %d (measured processing cycles %"PRIu64").",
+                  rxqs[i]->pmd->core_id, numa_id,
+                  netdev_rxq_get_name(rxqs[i]->rx),
+                  netdev_rxq_get_queue_id(rxqs[i]->rx),
+                  dp_netdev_rxq_get_cycles(rxqs[i], RXQ_CYCLES_PROC_HIST));
+        }
+    }
+
     rr_numa_list_destroy(&rr);
+    free(rxqs);
 }