Patchwork Proposed patch: huge RX speedup for hw/e1000.c

login
register
mail settings
Submitter Paolo Bonzini
Date May 31, 2012, 7:38 a.m.
Message ID <4FC71FEB.9030100@redhat.com>
Download mbox | patch
Permalink /patch/162111/
State New
Headers show

Comments

Paolo Bonzini - May 31, 2012, 7:38 a.m.
Il 31/05/2012 00:53, Luigi Rizzo ha scritto:
> The image contains my fast packet generator "pkt-gen" (a stock
> traffic generator such as netperf etc. is too slow to show the
> problem). pkt-gen can send about 1Mpps in this configuration using
> -net netmap in the backend. The qemu process in this case takes 100%
> CPU. On the receive side, i cannot receive more than 50Kpps, even if i
> flood the bridge with a a huge amount of traffic. The qemu process stays
> at 5% cpu or less.
> 
> Then i read on the docs in main-loop.h which says that one case where
> the qemu_notify_event() is needed is when using 
> qemu_set_fd_handler2(), which is exactly what my backend uses
> (similar to tap.c)

The path is a bit involved, but I think Luigi is right.  The docs say
"Remember to call qemu_notify_event whenever the [return value of the
fd_read_poll callback] may change from false to true."  Now net/tap.c has

    static int tap_can_send(void *opaque)
    {
        TAPState *s = opaque;

        return qemu_can_send_packet(&s->nc);
    }

and (ignoring VLANs) qemu_can_send_packet is

    int qemu_can_send_packet(VLANClientState *sender)
    {
        if (sender->peer->receive_disabled) {
            return 0;
        } else if (sender->peer->info->can_receive &&
                   !sender->peer->info->can_receive(sender->peer)) {
            return 0;
        } else {
            return 1;
        }
    }

So whenever receive_disabled goes from 0 to 1 or can_receive goes from 0 to 1,
the _peer_ has to call qemu_notify_event.  In e1000.c we have

    static bool e1000_has_rxbufs(E1000State *s, size_t total_size)
    {
        int bufs;
        /* Fast-path short packets */
        if (total_size <= s->rxbuf_size) {
            return s->mac_reg[RDH] != s->mac_reg[RDT] || !s->check_rxov;
        }
        if (s->mac_reg[RDH] < s->mac_reg[RDT]) {
            bufs = s->mac_reg[RDT] - s->mac_reg[RDH];
        } else if (s->mac_reg[RDH] > s->mac_reg[RDT] || !s->check_rxov) {
            bufs = s->mac_reg[RDLEN] /  sizeof(struct e1000_rx_desc) +
                s->mac_reg[RDT] - s->mac_reg[RDH];
        } else {
            return false;
        }
        return total_size <= bufs * s->rxbuf_size;
    }

    static int
    e1000_can_receive(VLANClientState *nc)
    {
        E1000State *s = DO_UPCAST(NICState, nc, nc)->opaque;
    
        return (s->mac_reg[RCTL] & E1000_RCTL_EN) && e1000_has_rxbufs(s, 1);
    }

So as a conservative approximation, you need to fire qemu_notify_event
whenever you write to RDH, RDT, RDLEN and RCTL, or when check_rxov becomes
zero.  In practice, only RDT, RCTL and check_rxov matter.  Luigi, does this
patch work for you?



RDT is indeed written in the ISR.  In the Linux driver, e1000_clean_rx_irq
calls adapter->alloc_rx_buf which is e1000_alloc_rx_buffers.  There you
see this:

        if (likely(rx_ring->next_to_use != i)) {
                rx_ring->next_to_use = i;
                if (unlikely(i-- == 0))
                        i = (rx_ring->count - 1);

                /* Force memory writes to complete before letting h/w
                 * know there are new descriptors to fetch.  (Only
                 * applicable for weak-ordered memory model archs,
                 * such as IA-64). */
                wmb();
                writel(i, hw->hw_addr + rx_ring->rdt);
        }

Similarly for all other devices:
- cadence_gem -> GEM_NWCTRL
- dp8393x -> SONIC_CR, SONIC_ISR
- eepro100 -> set_ru_state
- mcf_fec -> mcf_fec_enable_rx
- milkymist-minimax2 -> R_STATE0, R_STATE1
- mipsnet -> MIPSNET_INT_CTL, MIPSNET_RX_DATA_BUFFER
- ne2000 -> EN0_STARTPG, EN0_STOPPG, E8390_CMD
- opencores_eth -> TX_BD_NUM, MODER, rx_desc
- pcnet -> pcnet_start, csr[5]
- rtl8139 -> RxBufPtr and Cfg9346
- smc91c111 -> RCR, smc91c111_release_packet
- spapr_llan -> h_add_logical_lan_buffer
- stellaris_enet -> RCTL, DATA
- xgmac -> DMA_CONTROL
- xilinx_axienet -> rcw[1]
- xilinx_ethlite -> R_RX_CTRL0

For Xen I think this is not possible at the moment because it doesn't
implement rx notification.

Paolo
Luigi Rizzo - May 31, 2012, 8:23 a.m.
On Thu, May 31, 2012 at 9:38 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 31/05/2012 00:53, Luigi Rizzo ha scritto:
> > The image contains my fast packet generator "pkt-gen" (a stock
> > traffic generator such as netperf etc. is too slow to show the
> > problem). pkt-gen can send about 1Mpps in this configuration using
> > -net netmap in the backend. The qemu process in this case takes 100%
> > CPU. On the receive side, i cannot receive more than 50Kpps, even if i
> > flood the bridge with a a huge amount of traffic. The qemu process stays
> > at 5% cpu or less.
> >
> > Then i read on the docs in main-loop.h which says that one case where
> > the qemu_notify_event() is needed is when using
> > qemu_set_fd_handler2(), which is exactly what my backend uses
> > (similar to tap.c)
>
> The path is a bit involved, but I think Luigi is right.  The docs say
> "Remember to call qemu_notify_event whenever the [return value of the
> fd_read_poll callback] may change from false to true."  Now net/tap.c has
>
...


> So as a conservative approximation, you need to fire qemu_notify_event
> whenever you write to RDH, RDT, RDLEN and RCTL, or when check_rxov becomes
> zero.  In practice, only RDT, RCTL and check_rxov matter.  Luigi, does this
> patch work for you?
>

it almost surely works (cannot check today), as my (guest) driver modifies
RDT to notify the
hardware of  read packets. I know/mentioned that the notification can be
optimized and sent only in specific case, as you describe above
(i might be missing the check_rxov).

But i think it would be more robust to make fewer assumptions on
what the guest does, and send the notify on all register writes
(those are relatively rare in a driver, and the datapath touches
exactly the registers we ought to be interested in),
and possibly on those reads that may have side effects, such as interrupt
registers, together with a big comment in the code explaining
why we need to call qemu_notify_event().

Actually, what would pay even more is devise a way to avoid the
write() syscall in qemu_notify_event if the other process (or is it a
thread ?)
has data queued.
This is not so important in my netmap driver, but a standard driver is
likely to
update the RDT on every single packet, which would pretty much kill
performance.

cheers
luigi
Jan Kiszka - May 31, 2012, 8:23 a.m.
On 2012-05-31 09:38, Paolo Bonzini wrote:
> Il 31/05/2012 00:53, Luigi Rizzo ha scritto:
>> The image contains my fast packet generator "pkt-gen" (a stock
>> traffic generator such as netperf etc. is too slow to show the
>> problem). pkt-gen can send about 1Mpps in this configuration using
>> -net netmap in the backend. The qemu process in this case takes 100%
>> CPU. On the receive side, i cannot receive more than 50Kpps, even if i
>> flood the bridge with a a huge amount of traffic. The qemu process stays
>> at 5% cpu or less.
>>
>> Then i read on the docs in main-loop.h which says that one case where
>> the qemu_notify_event() is needed is when using 
>> qemu_set_fd_handler2(), which is exactly what my backend uses
>> (similar to tap.c)
> 
> The path is a bit involved, but I think Luigi is right.  The docs say
> "Remember to call qemu_notify_event whenever the [return value of the
> fd_read_poll callback] may change from false to true."  Now net/tap.c has
> 
>     static int tap_can_send(void *opaque)
>     {
>         TAPState *s = opaque;
> 
>         return qemu_can_send_packet(&s->nc);
>     }
> 
> and (ignoring VLANs) qemu_can_send_packet is
> 
>     int qemu_can_send_packet(VLANClientState *sender)
>     {
>         if (sender->peer->receive_disabled) {
>             return 0;
>         } else if (sender->peer->info->can_receive &&
>                    !sender->peer->info->can_receive(sender->peer)) {
>             return 0;
>         } else {
>             return 1;
>         }
>     }
> 
> So whenever receive_disabled goes from 0 to 1 or can_receive goes from 0 to 1,
> the _peer_ has to call qemu_notify_event.  In e1000.c we have
> 
>     static bool e1000_has_rxbufs(E1000State *s, size_t total_size)
>     {
>         int bufs;
>         /* Fast-path short packets */
>         if (total_size <= s->rxbuf_size) {
>             return s->mac_reg[RDH] != s->mac_reg[RDT] || !s->check_rxov;
>         }
>         if (s->mac_reg[RDH] < s->mac_reg[RDT]) {
>             bufs = s->mac_reg[RDT] - s->mac_reg[RDH];
>         } else if (s->mac_reg[RDH] > s->mac_reg[RDT] || !s->check_rxov) {
>             bufs = s->mac_reg[RDLEN] /  sizeof(struct e1000_rx_desc) +
>                 s->mac_reg[RDT] - s->mac_reg[RDH];
>         } else {
>             return false;
>         }
>         return total_size <= bufs * s->rxbuf_size;
>     }
> 
>     static int
>     e1000_can_receive(VLANClientState *nc)
>     {
>         E1000State *s = DO_UPCAST(NICState, nc, nc)->opaque;
>     
>         return (s->mac_reg[RCTL] & E1000_RCTL_EN) && e1000_has_rxbufs(s, 1);
>     }
> 
> So as a conservative approximation, you need to fire qemu_notify_event
> whenever you write to RDH, RDT, RDLEN and RCTL, or when check_rxov becomes
> zero.  In practice, only RDT, RCTL and check_rxov matter.  Luigi, does this
> patch work for you?
> 
> diff --git a/hw/e1000.c b/hw/e1000.c
> index 4573f13..0069103 100644
> --- a/hw/e1000.c
> +++ b/hw/e1000.c
> @@ -295,6 +295,7 @@ set_rx_control(E1000State *s, int index, uint32_t val)
>      s->rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT) & 3) + 1;
>      DBGOUT(RX, "RCTL: %d, mac_reg[RCTL] = 0x%x\n", s->mac_reg[RDT],
>             s->mac_reg[RCTL]);
> +    qemu_notify_event();
>  }
>  
>  static void
> @@ -922,6 +923,7 @@ set_rdt(E1000State *s, int index, uint32_t val)
>  {
>      s->check_rxov = 0;
>      s->mac_reg[index] = val & 0xffff;
> +    qemu_notify_event();

This still looks like the wrong tool: Packets that can't be delivered
are queued. So we need to flush the queue and clear the blocked delivery
there. qemu_flush_queued_packets appears more appropriate for this.

Conceptually, the backend should be responsible for kicking the iothread
as needed.

Jan
Jan Kiszka - May 31, 2012, 8:26 a.m.
On 2012-05-31 10:23, Luigi Rizzo wrote:
> But i think it would be more robust to make fewer assumptions on
> what the guest does, and send the notify on all register writes
> (those are relatively rare in a driver, and the datapath touches
> exactly the registers we ought to be interested in),
> and possibly on those reads that may have side effects, such as interrupt
> registers, together with a big comment in the code explaining
> why we need to call qemu_notify_event().

As we know the hardware that we emulate and its states, such a hacks are
unneeded.

Jan
Luigi Rizzo - May 31, 2012, 8:32 a.m.
On Thu, May 31, 2012 at 10:23 AM, Jan Kiszka <jan.kiszka@web.de> wrote:

> On 2012-05-31 09:38, Paolo Bonzini wrote
>
...

>
> This still looks like the wrong tool: Packets that can't be delivered
> are queued. So we need to flush the queue and clear the blocked delivery
> there. qemu_flush_queued_packets appears more appropriate for this.
>
> Conceptually, the backend should be responsible for kicking the iothread
> as needed.
>
>
as i understand the code, the backend _is_ the iothread, and it is
sleeping when the frontend becomes able to receive again.

cheers
luigi
Jan Kiszka - May 31, 2012, 8:41 a.m.
On 2012-05-31 10:32, Luigi Rizzo wrote:
> On Thu, May 31, 2012 at 10:23 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> 
>> On 2012-05-31 09:38, Paolo Bonzini wrote
>>
> ...
> 
>>
>> This still looks like the wrong tool: Packets that can't be delivered
>> are queued. So we need to flush the queue and clear the blocked delivery
>> there. qemu_flush_queued_packets appears more appropriate for this.
>>
>> Conceptually, the backend should be responsible for kicking the iothread
>> as needed.
>>
>>
> as i understand the code, the backend _is_ the iothread, and it is
> sleeping when the frontend becomes able to receive again.

The backend is a subsystem that can run in the context of the vcpu or
the iothread. The question is in which context which job shall run.
That's something the backend should decide, not the NIC (who should be
agnostic of the I/O architecture - we may have different ones in the
future). Possibly it does make sense to run the flush over a different
context than the vcpu, but let's keep this decision out of the NICs.

Jan
Stefano Stabellini - May 31, 2012, 11:06 a.m.
On Thu, 31 May 2012, Paolo Bonzini wrote:
> Il 31/05/2012 00:53, Luigi Rizzo ha scritto:
> > The image contains my fast packet generator "pkt-gen" (a stock
> > traffic generator such as netperf etc. is too slow to show the
> > problem). pkt-gen can send about 1Mpps in this configuration using
> > -net netmap in the backend. The qemu process in this case takes 100%
> > CPU. On the receive side, i cannot receive more than 50Kpps, even if i
> > flood the bridge with a a huge amount of traffic. The qemu process stays
> > at 5% cpu or less.
> > 
> > Then i read on the docs in main-loop.h which says that one case where
> > the qemu_notify_event() is needed is when using 
> > qemu_set_fd_handler2(), which is exactly what my backend uses
> > (similar to tap.c)
> 
> The path is a bit involved, but I think Luigi is right.  The docs say
> "Remember to call qemu_notify_event whenever the [return value of the
> fd_read_poll callback] may change from false to true."  Now net/tap.c has
> 
>     static int tap_can_send(void *opaque)
>     {
>         TAPState *s = opaque;
> 
>         return qemu_can_send_packet(&s->nc);
>     }
> 
> and (ignoring VLANs) qemu_can_send_packet is
> 
>     int qemu_can_send_packet(VLANClientState *sender)
>     {
>         if (sender->peer->receive_disabled) {
>             return 0;
>         } else if (sender->peer->info->can_receive &&
>                    !sender->peer->info->can_receive(sender->peer)) {
>             return 0;
>         } else {
>             return 1;
>         }
>     }
> 
> So whenever receive_disabled goes from 0 to 1 or can_receive goes from 0 to 1,
> the _peer_ has to call qemu_notify_event.  In e1000.c we have
> 
>     static bool e1000_has_rxbufs(E1000State *s, size_t total_size)
>     {
>         int bufs;
>         /* Fast-path short packets */
>         if (total_size <= s->rxbuf_size) {
>             return s->mac_reg[RDH] != s->mac_reg[RDT] || !s->check_rxov;
>         }
>         if (s->mac_reg[RDH] < s->mac_reg[RDT]) {
>             bufs = s->mac_reg[RDT] - s->mac_reg[RDH];
>         } else if (s->mac_reg[RDH] > s->mac_reg[RDT] || !s->check_rxov) {
>             bufs = s->mac_reg[RDLEN] /  sizeof(struct e1000_rx_desc) +
>                 s->mac_reg[RDT] - s->mac_reg[RDH];
>         } else {
>             return false;
>         }
>         return total_size <= bufs * s->rxbuf_size;
>     }
> 
>     static int
>     e1000_can_receive(VLANClientState *nc)
>     {
>         E1000State *s = DO_UPCAST(NICState, nc, nc)->opaque;
>     
>         return (s->mac_reg[RCTL] & E1000_RCTL_EN) && e1000_has_rxbufs(s, 1);
>     }
> 
> So as a conservative approximation, you need to fire qemu_notify_event
> whenever you write to RDH, RDT, RDLEN and RCTL, or when check_rxov becomes
> zero.  In practice, only RDT, RCTL and check_rxov matter.  Luigi, does this
> patch work for you?
> 
> diff --git a/hw/e1000.c b/hw/e1000.c
> index 4573f13..0069103 100644
> --- a/hw/e1000.c
> +++ b/hw/e1000.c
> @@ -295,6 +295,7 @@ set_rx_control(E1000State *s, int index, uint32_t val)
>      s->rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT) & 3) + 1;
>      DBGOUT(RX, "RCTL: %d, mac_reg[RCTL] = 0x%x\n", s->mac_reg[RDT],
>             s->mac_reg[RCTL]);
> +    qemu_notify_event();
>  }
>  
>  static void
> @@ -922,6 +923,7 @@ set_rdt(E1000State *s, int index, uint32_t val)
>  {
>      s->check_rxov = 0;
>      s->mac_reg[index] = val & 0xffff;
> +    qemu_notify_event();
>  }
>  
>  static void
> 
> 
> RDT is indeed written in the ISR.  In the Linux driver, e1000_clean_rx_irq
> calls adapter->alloc_rx_buf which is e1000_alloc_rx_buffers.  There you
> see this:
> 
>         if (likely(rx_ring->next_to_use != i)) {
>                 rx_ring->next_to_use = i;
>                 if (unlikely(i-- == 0))
>                         i = (rx_ring->count - 1);
> 
>                 /* Force memory writes to complete before letting h/w
>                  * know there are new descriptors to fetch.  (Only
>                  * applicable for weak-ordered memory model archs,
>                  * such as IA-64). */
>                 wmb();
>                 writel(i, hw->hw_addr + rx_ring->rdt);
>         }
> 
> Similarly for all other devices:
> - cadence_gem -> GEM_NWCTRL
> - dp8393x -> SONIC_CR, SONIC_ISR
> - eepro100 -> set_ru_state
> - mcf_fec -> mcf_fec_enable_rx
> - milkymist-minimax2 -> R_STATE0, R_STATE1
> - mipsnet -> MIPSNET_INT_CTL, MIPSNET_RX_DATA_BUFFER
> - ne2000 -> EN0_STARTPG, EN0_STOPPG, E8390_CMD
> - opencores_eth -> TX_BD_NUM, MODER, rx_desc
> - pcnet -> pcnet_start, csr[5]
> - rtl8139 -> RxBufPtr and Cfg9346
> - smc91c111 -> RCR, smc91c111_release_packet
> - spapr_llan -> h_add_logical_lan_buffer
> - stellaris_enet -> RCTL, DATA
> - xgmac -> DMA_CONTROL
> - xilinx_axienet -> rcw[1]
> - xilinx_ethlite -> R_RX_CTRL0
> 
> For Xen I think this is not possible at the moment because it doesn't
> implement rx notification.
 
Why do you say that?
Xen supports the iothread and CONFIG_EVENTFD.
Paolo Bonzini - May 31, 2012, 11:15 a.m.
Il 31/05/2012 13:06, Stefano Stabellini ha scritto:
>> > For Xen I think this is not possible at the moment because it doesn't
>> > implement rx notification.
>  
> Why do you say that? Xen supports the iothread and CONFIG_EVENTFD.

No, it's not that.  I was talking about feature-rx-notify.  But I
remembered wrong, that feature is advertised by the front-end, not the
back-end.  IIRC there is only one event channel per Xen network device,
so net_event is the place to add the call to qemu_flush_queued_packets.

Paolo
Stefano Stabellini - May 31, 2012, 11:43 a.m.
On Thu, 31 May 2012, Paolo Bonzini wrote:
> Il 31/05/2012 13:06, Stefano Stabellini ha scritto:
> >> > For Xen I think this is not possible at the moment because it doesn't
> >> > implement rx notification.
> >  
> > Why do you say that? Xen supports the iothread and CONFIG_EVENTFD.
> 
> No, it's not that.  I was talking about feature-rx-notify.  But I
> remembered wrong, that feature is advertised by the front-end, not the
> back-end. 

Yes, that's right.


> IIRC there is only one event channel per Xen network device,
> so net_event is the place to add the call to qemu_flush_queued_packets.

I see what you mean and I think that you are correct.

Could you please add that change to your submission?
Or do you want me to send it separately?
Paolo Bonzini - May 31, 2012, 11:44 a.m.
Il 31/05/2012 13:43, Stefano Stabellini ha scritto:
>> > No, it's not that.  I was talking about feature-rx-notify.  But I
>> > remembered wrong, that feature is advertised by the front-end, not the
>> > back-end. 
> Yes, that's right.
> 
> 
>> > IIRC there is only one event channel per Xen network device,
>> > so net_event is the place to add the call to qemu_flush_queued_packets.
> I see what you mean and I think that you are correct.
> 
> Could you please add that change to your submission?
> Or do you want me to send it separately?

Don't hold your breath, but I can do it.

Paolo

Patch

diff --git a/hw/e1000.c b/hw/e1000.c
index 4573f13..0069103 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -295,6 +295,7 @@  set_rx_control(E1000State *s, int index, uint32_t val)
     s->rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT) & 3) + 1;
     DBGOUT(RX, "RCTL: %d, mac_reg[RCTL] = 0x%x\n", s->mac_reg[RDT],
            s->mac_reg[RCTL]);
+    qemu_notify_event();
 }
 
 static void
@@ -922,6 +923,7 @@  set_rdt(E1000State *s, int index, uint32_t val)
 {
     s->check_rxov = 0;
     s->mac_reg[index] = val & 0xffff;
+    qemu_notify_event();
 }
 
 static void