diff mbox

[Fwd:,qemu,tx,stop,in,cloonix]

Message ID 6389e86d0d68cc476c04c1e9b944208e.squirrel@clownix.net
State New
Headers show

Commit Message

clownix@clownix.net Feb. 24, 2013, 8:14 p.m. UTC
Hello,
I use qemu inside a gplv3 software called cloonix, I have patched qemu to
have unix sockets instead of inet ones but the bug I have with unix
sockets may also happen with inet ones.

The bug can be reproduced in cloonix context by using iperf, it occurs
randomly in a virtual cloonix network but occurs within seconds using
iperf in nested virtualisation (cloonix inside cloonix), the problem
begins when a lot of packets must be transmitted and the socket (inet in
the classical qemu, unix in cloonix) gets full and
qemu_net_queue_append_iov is called, then tx never restarts.

See under in the patch attached, the way I avoided queuing anything, it
works, even if it is not a correction to the bug...

The patch is for version 21.3 of cloonix which uses qemu-1.4.0-rc1, but
I now use qemu-1.4.0 and the bug is still there.

Regards
Vincent Perrier



---------------------------- Original Message ----------------------------
Subject: qemu tx stop in cloonix
From:    clownix@clownix.net
Date:    Sun, February 24, 2013 9:14 am
To:      "list" <cloonix-list@clownix.net>
--------------------------------------------------------------------------

There is a bug visible more particularly when doing nested cloonix and
iperf inside the second level nested machines.
The ethernet interface emitting a big load stops working, this has been
corrected in my version but I will not deliver the correction outside the
regular deliveries (every 2 or 3 months).

If you have a stopping of your ethernet access after a burst of traffic,
here is the cause:

From the kernel virtio_net driver inside the guest, piles of messages are
sent into a virtio queue to the qemu user process.
The qemu user process does what it can to give the messages to a unix
socket (to cloonix). When too much traffic arrives, the unix socket writes
0 bytes as it gets full.
Then qemu, instead of droping packet (too much is too much, no need to try
harder) qemu does not want to drop, it tries to enqueue packets until the
unix socket clears...

The mechanics of this unusual case management is too complex, I did not
get into it to repair it, I just dropped the packets, simplest solution,
and the above layers know low level packets can just disapear...

Here is the solution:
in "sources/Cloonix-Net-Lab/qemu", in file cmd, add the following
qemu_drop_burst.patch line after having put the qemu_drop_burst.patch in
the qemu directory.

  patch -p1 < ../cloonix_qemu.patch
  patch -p1 < ../qemu_drop_burst.patch


The qemu_drop_burst.patch should be with this mail...

Comments

Stefan Hajnoczi Feb. 25, 2013, 2:20 p.m. UTC | #1
On Sun, Feb 24, 2013 at 02:14:43PM -0600, clownix@clownix.net wrote:
> I use qemu inside a gplv3 software called cloonix, I have patched qemu to
> have unix sockets instead of inet ones but the bug I have with unix
> sockets may also happen with inet ones.
> 
> The bug can be reproduced in cloonix context by using iperf, it occurs
> randomly in a virtual cloonix network but occurs within seconds using
> iperf in nested virtualisation (cloonix inside cloonix), the problem
> begins when a lot of packets must be transmitted and the socket (inet in
> the classical qemu, unix in cloonix) gets full and
> qemu_net_queue_append_iov is called, then tx never restarts.
> 
> See under in the patch attached, the way I avoided queuing anything, it
> works, even if it is not a correction to the bug...
> 
> The patch is for version 21.3 of cloonix which uses qemu-1.4.0-rc1, but
> I now use qemu-1.4.0 and the bug is still there.

Thanks for the bug report.  This sounds like a problem with net/socket.c
- it's supposed to restart the queue when the socket becomes writable
again.

Can you share a way to reproduce this tx stall with vanilla QEMU?

Stefan
Stefan Hajnoczi Feb. 25, 2013, 2:24 p.m. UTC | #2
On Sun, Feb 24, 2013 at 02:14:43PM -0600, clownix@clownix.net wrote:
> The patch is for version 21.3 of cloonix which uses qemu-1.4.0-rc1, but
> I now use qemu-1.4.0 and the bug is still there.

Please post the QEMU command-line so we can see how the socket netdev
was configured.

Stefan
clownix@clownix.net Feb. 25, 2013, 10:38 p.m. UTC | #3
Hello Stefan,

I coded a socket-based cable between 2 vanilla kvm, here are the commands
to do:


tar xvf qemu_test_sock.tar.gz
cd qemu_test_sock
make
./qemu_test_sock



    kvm \
    -nodefaults \
    -nographic \
    -serial stdio \
    -drive file=guest1,media=disk,if=virtio \
    -device virtio-net-pci,tx=bh,vlan=1,mac=02:01:01:01:01:01 \
    -net socket,vlan=1,connect=127.0.0.1:47654

    kvm \
    -nodefaults \
    -nographic \
    -serial stdio \
    -drive file=guest2,media=disk,if=virtio \
    -device virtio-net-pci,tx=bh,vlan=1,mac=02:02:02:02:02:02 \
    -net socket,vlan=1,connect=127.0.0.1:47655


ifconfig eth0 1.1.1.1
iperf -s -u

ifconfig eth0 1.1.1.2
iperf -c 1.1.1.1 -u -b 100M


then when you feel something is not right:

ping 1.1.1.1
From 1.1.1.2 icmp_seq=24 Destination Host Unreachable
From 1.1.1.2 icmp_seq=25 Destination Host Unreachable
From 1.1.1.2 icmp_seq=26 Destination Host Unreachable
64 bytes from 1.1.1.1: icmp_req=1 ttl=64 time=29128 ms
64 bytes from 1.1.1.1: icmp_req=2 ttl=64 time=28121 ms



The principle: the process between the kvm just takes messages from one
side to the other, then it stops working for 5 seconde every 30 seconds to
create a socket full problem.
I have not the same trouble as with my unix socket, in the inet case
the ethernet access has an empty moment (of 30 sec in above case) and
starts again...
I hope that helps, but I could not reproduce the total ethernet stop
I had in cloonix...



> On Sun, Feb 24, 2013 at 02:14:43PM -0600, clownix@clownix.net wrote:
>> The patch is for version 21.3 of cloonix which uses qemu-1.4.0-rc1, but
>> I now use qemu-1.4.0 and the bug is still there.
>
> Please post the QEMU command-line so we can see how the socket netdev
> was configured.
>
> Stefan
>
Stefan Hajnoczi Feb. 26, 2013, 9:21 a.m. UTC | #4
On Mon, Feb 25, 2013 at 11:38 PM,  <clownix@clownix.net> wrote:
> I coded a socket-based cable between 2 vanilla kvm, here are the commands
> to do:

Please try:

 kvm \
    -nodefaults \
    -nographic \
    -serial stdio \
    -drive file=guest1,media=disk,if=virtio \
    -netdev socket,id=socket0,connect=127.0.0.1:47654 \
    -device virtio-net-pci,tx=bh,netdev=socket0,mac=02:01:01:01:01:01

    kvm \
    -nodefaults \
    -nographic \
    -serial stdio \
    -drive file=guest2,media=disk,if=virtio \
    -netdev socket,id=socket0,connect=127.0.0.1:47655 \
    -device virtio-net-pci,tx=bh,netdev=socket0,mac=02:02:02:02:02:02

Notice that -netdev socket is used instead of -net socket,vlan=1.

Luigi Rizzo recently fixed a bug where traffic could stall when using
the QEMU "vlan" feature:

http://lists.gnu.org/archive/html/qemu-devel/2013-02/msg00679.html

If you want to try this fix, use the
git://github.com/stefanha/qemu.git net branch.

Stefan
Jan Kiszka Feb. 26, 2013, 11:25 a.m. UTC | #5
On 2013-02-26 10:21, Stefan Hajnoczi wrote:
> On Mon, Feb 25, 2013 at 11:38 PM,  <clownix@clownix.net> wrote:
>> I coded a socket-based cable between 2 vanilla kvm, here are the commands
>> to do:
> 
> Please try:
> 
>  kvm \
>     -nodefaults \
>     -nographic \
>     -serial stdio \
>     -drive file=guest1,media=disk,if=virtio \
>     -netdev socket,id=socket0,connect=127.0.0.1:47654 \
>     -device virtio-net-pci,tx=bh,netdev=socket0,mac=02:01:01:01:01:01
> 
>     kvm \
>     -nodefaults \
>     -nographic \
>     -serial stdio \
>     -drive file=guest2,media=disk,if=virtio \
>     -netdev socket,id=socket0,connect=127.0.0.1:47655 \
>     -device virtio-net-pci,tx=bh,netdev=socket0,mac=02:02:02:02:02:02
> 
> Notice that -netdev socket is used instead of -net socket,vlan=1.

That's pointless. -netdev socket is still broken, only -net works.

Jan
diff mbox

Patch

diff -Naur qemu-1.4.0-rc1/net/net.c new_qemu-1.4.0-rc1/net/net.c
--- qemu-1.4.0-rc1/net/net.c	2013-02-07 01:40:56.000000000 +0100
+++ new_qemu-1.4.0-rc1/net/net.c	2013-02-24 16:03:45.139853349 +0100
@@ -388,10 +388,14 @@ 
     }
 
     if (sender->peer->receive_disabled) {
-        return 0;
+//cloonix DROP
+//        return 0;
+        return 1;
     } else if (sender->peer->info->can_receive &&
                !sender->peer->info->can_receive(sender->peer)) {
-        return 0;
+//cloonix DROP
+//        return 0;
+        return 1;
     }
     return 1;
 }