diff mbox series

e1000: set RX descriptor status in a separate operation

Message ID 20220629094026.558-1-dinghui@sangfor.com.cn
State New
Headers show
Series e1000: set RX descriptor status in a separate operation | expand

Commit Message

Ding Hui June 29, 2022, 9:40 a.m. UTC
The code of setting RX descriptor status field maybe work fine in
previously, however with the update of glibc version, it shows two
issues when guest using dpdk receive packets:

  1. The dpdk has a certain probability getting wrong buffer_addr

     this impact may be not obvious, such as lost a packet once in
     a while

  2. The dpdk may consume a packet twice when scan the RX desc queue
     over again

     this impact will lead a infinite wait in Qemu, since the RDT
     (tail pointer) be inscreased to equal to RDH by unexpected,
     which regard as the RX desc queue is full

Write a whole of RX desc with DD flag on is not quite correct, because
when the underlying implementation of memcpy using XMM registers to
copy e1000_rx_desc (when AVX or something else CPU feature is usable),
the bytes order of desc writing to memory is indeterminacy

We can use full-scale test case to reproduce the issue-2 by
https://github.com/BASM/qemu_dpdk_e1000_test (thanks to Leonid Myravjev)

I also write a POC test case at https://github.com/cdkey/e1000_poc
which can reproduce both of them, and easy to verify the patch effect.

The hw watchpoint also shows that, when Qemu using XMM related instructions
writing 16 bytes e1000_rx_desc, concurrent with DPDK using movb
writing 1 byte status, the final result of writing to memory will be one
of them, if it made by Qemu which DD flag is on, DPDK will consume it
again.

Setting DD status in a separate operation, can prevent the impact of
disorder memory writing by memcpy, also avoid unexpected data when
concurrent writing status by qemu and guest dpdk.

Links: https://lore.kernel.org/qemu-devel/20200102110504.GG121208@stefanha-x1.localdomain/T/

Reported-by: Leonid Myravjev <asm@asm.pp.ru>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: qemu-stable@nongnu.org
Tested-by: Jing Zhang <zhangjing@sangfor.com.cn>
Reviewed-by: Frank Lee <lifan38153@sangfor.com.cn>
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
---
 hw/net/e1000.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Jason Wang July 4, 2022, 7:10 a.m. UTC | #1
在 2022/6/29 17:40, Ding Hui 写道:
> The code of setting RX descriptor status field maybe work fine in
> previously, however with the update of glibc version, it shows two
> issues when guest using dpdk receive packets:
>
>    1. The dpdk has a certain probability getting wrong buffer_addr
>
>       this impact may be not obvious, such as lost a packet once in
>       a while
>
>    2. The dpdk may consume a packet twice when scan the RX desc queue
>       over again
>
>       this impact will lead a infinite wait in Qemu, since the RDT
>       (tail pointer) be inscreased to equal to RDH by unexpected,
>       which regard as the RX desc queue is full
>
> Write a whole of RX desc with DD flag on is not quite correct, because
> when the underlying implementation of memcpy using XMM registers to
> copy e1000_rx_desc (when AVX or something else CPU feature is usable),
> the bytes order of desc writing to memory is indeterminacy
>
> We can use full-scale test case to reproduce the issue-2 by
> https://github.com/BASM/qemu_dpdk_e1000_test (thanks to Leonid Myravjev)
>
> I also write a POC test case at https://github.com/cdkey/e1000_poc
> which can reproduce both of them, and easy to verify the patch effect.
>
> The hw watchpoint also shows that, when Qemu using XMM related instructions
> writing 16 bytes e1000_rx_desc, concurrent with DPDK using movb
> writing 1 byte status, the final result of writing to memory will be one
> of them, if it made by Qemu which DD flag is on, DPDK will consume it
> again.
>
> Setting DD status in a separate operation, can prevent the impact of
> disorder memory writing by memcpy, also avoid unexpected data when
> concurrent writing status by qemu and guest dpdk.
>
> Links: https://lore.kernel.org/qemu-devel/20200102110504.GG121208@stefanha-x1.localdomain/T/
>
> Reported-by: Leonid Myravjev <asm@asm.pp.ru>
> Cc: Stefan Hajnoczi <stefanha@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: qemu-stable@nongnu.org
> Tested-by: Jing Zhang <zhangjing@sangfor.com.cn>
> Reviewed-by: Frank Lee <lifan38153@sangfor.com.cn>
> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
> ---
>   hw/net/e1000.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/hw/net/e1000.c b/hw/net/e1000.c
> index f5bc81296d..e26e0a64c1 100644
> --- a/hw/net/e1000.c
> +++ b/hw/net/e1000.c
> @@ -979,7 +979,7 @@ e1000_receive_iov(NetClientState *nc, const struct iovec *iov, int iovcnt)
>           base = rx_desc_base(s) + sizeof(desc) * s->mac_reg[RDH];
>           pci_dma_read(d, base, &desc, sizeof(desc));
>           desc.special = vlan_special;
> -        desc.status |= (vlan_status | E1000_RXD_STAT_DD);
> +        desc.status &= ~E1000_RXD_STAT_DD;
>           if (desc.buffer_addr) {
>               if (desc_offset < size) {
>                   size_t iov_copy;
> @@ -1013,6 +1013,9 @@ e1000_receive_iov(NetClientState *nc, const struct iovec *iov, int iovcnt)
>               DBGOUT(RX, "Null RX descriptor!!\n");
>           }
>           pci_dma_write(d, base, &desc, sizeof(desc));
> +        desc.status |= (vlan_status | E1000_RXD_STAT_DD);
> +        pci_dma_write(d, base + offsetof(struct e1000_rx_desc, status),
> +                      &desc.status, sizeof(desc.status));


Good catch, but to be more safe, I wonder if we can simply use 
stx_le_pci_dma() here?

Thanks


>   
>           if (++s->mac_reg[RDH] * sizeof(desc) >= s->mac_reg[RDLEN])
>               s->mac_reg[RDH] = 0;
Ding Hui July 4, 2022, 9:04 a.m. UTC | #2
On 2022/7/4 15:10, Jason Wang wrote:
> 
> 在 2022/6/29 17:40, Ding Hui 写道:
>> @@ -1013,6 +1013,9 @@ e1000_receive_iov(NetClientState *nc, const 
>> struct iovec *iov, int iovcnt)
>>               DBGOUT(RX, "Null RX descriptor!!\n");
>>           }
>>           pci_dma_write(d, base, &desc, sizeof(desc));
>> +        desc.status |= (vlan_status | E1000_RXD_STAT_DD);
>> +        pci_dma_write(d, base + offsetof(struct e1000_rx_desc, status),
>> +                      &desc.status, sizeof(desc.status));
> 
> 
> Good catch, but to be more safe, I wonder if we can simply use 
> stx_le_pci_dma() here?
> 

Do you mean stb_le_pci_dma(d, base + offsetof(struct e1000_rx_desc, 
status), desc.status, MEMTXATTRS_UNSPECIFIED)?

I checked both pci_dma_write() and stb_le_pci_dma(), there is no 
difference between them, but I'm not sure whether it is proper to mixed 
use address based api and value based api, besides that it's OK to me.

Thanks for reply.
Jason Wang July 5, 2022, 12:54 a.m. UTC | #3
On Mon, Jul 4, 2022 at 5:05 PM Ding Hui <dinghui@sangfor.com.cn> wrote:
>
> On 2022/7/4 15:10, Jason Wang wrote:
> >
> > 在 2022/6/29 17:40, Ding Hui 写道:
> >> @@ -1013,6 +1013,9 @@ e1000_receive_iov(NetClientState *nc, const
> >> struct iovec *iov, int iovcnt)
> >>               DBGOUT(RX, "Null RX descriptor!!\n");
> >>           }
> >>           pci_dma_write(d, base, &desc, sizeof(desc));
> >> +        desc.status |= (vlan_status | E1000_RXD_STAT_DD);
> >> +        pci_dma_write(d, base + offsetof(struct e1000_rx_desc, status),
> >> +                      &desc.status, sizeof(desc.status));
> >
> >
> > Good catch, but to be more safe, I wonder if we can simply use
> > stx_le_pci_dma() here?
> >
>
> Do you mean stb_le_pci_dma(d, base + offsetof(struct e1000_rx_desc,
> status), desc.status, MEMTXATTRS_UNSPECIFIED)?
>
> I checked both pci_dma_write() and stb_le_pci_dma(), there is no
> difference between them,

I think the difference is that the stx_xxx() can guarantee the atomicy
when it is allowed by the arch.

> but I'm not sure whether it is proper to mixed
> use address based api and value based api, besides that it's OK to me.
>
> Thanks for reply.

I apply this patch as is.

Thanks

>
> --
> Thanks,
> - Ding Hui
>
>
diff mbox series

Patch

diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index f5bc81296d..e26e0a64c1 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -979,7 +979,7 @@  e1000_receive_iov(NetClientState *nc, const struct iovec *iov, int iovcnt)
         base = rx_desc_base(s) + sizeof(desc) * s->mac_reg[RDH];
         pci_dma_read(d, base, &desc, sizeof(desc));
         desc.special = vlan_special;
-        desc.status |= (vlan_status | E1000_RXD_STAT_DD);
+        desc.status &= ~E1000_RXD_STAT_DD;
         if (desc.buffer_addr) {
             if (desc_offset < size) {
                 size_t iov_copy;
@@ -1013,6 +1013,9 @@  e1000_receive_iov(NetClientState *nc, const struct iovec *iov, int iovcnt)
             DBGOUT(RX, "Null RX descriptor!!\n");
         }
         pci_dma_write(d, base, &desc, sizeof(desc));
+        desc.status |= (vlan_status | E1000_RXD_STAT_DD);
+        pci_dma_write(d, base + offsetof(struct e1000_rx_desc, status),
+                      &desc.status, sizeof(desc.status));
 
         if (++s->mac_reg[RDH] * sizeof(desc) >= s->mac_reg[RDLEN])
             s->mac_reg[RDH] = 0;