Message ID | 1464000201-15560-3-git-send-email-mst@redhat.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, 23 May 2016 13:43:46 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > Add ringtest based unit test for skb array. > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > --- > tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++ > tools/virtio/ringtest/Makefile | 4 +- Patch didn't apply cleanly to Makefile, as you also seems to have "virtio_ring_inorder", I manually applied it. I chdir to tools/virtio/ringtest/ and I could compile "skb_array", BUT how do I use it??? (the README is not helpful) What is the "output", are there any performance measurement results? > diff --git a/tools/virtio/ringtest/Makefile b/tools/virtio/ringtest/Makefile > index 6ba7455..87e58cf 100644 > --- a/tools/virtio/ringtest/Makefile > +++ b/tools/virtio/ringtest/Makefile > @@ -1,6 +1,6 @@ > all: > > -all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder > +all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder skb_array ^^^^^^^^^^^^^^^^^^^ > > CFLAGS += -Wall > CFLAGS += -pthread -O2 -ggdb > @@ -8,6 +8,7 @@ LDFLAGS += -pthread -O2 -ggdb > > main.o: main.c main.h > ring.o: ring.c main.h > +skb_array.o: skb_array.c main.h ../../../include/linux/skb_array.h > virtio_ring_0_9.o: virtio_ring_0_9.c main.h > virtio_ring_poll.o: virtio_ring_poll.c virtio_ring_0_9.c main.h > virtio_ring_inorder.o: virtio_ring_inorder.c virtio_ring_0_9.c main.h > @@ -15,6 +16,7 @@ ring: ring.o main.o > virtio_ring_0_9: virtio_ring_0_9.o main.o > virtio_ring_poll: virtio_ring_poll.o main.o > virtio_ring_inorder: virtio_ring_inorder.o main.o ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ > +skb_array: skb_array.o main.o > clean: > -rm main.o > -rm ring.o ring
On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 23 May 2016 13:43:46 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > Add ringtest based unit test for skb array. > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > --- > > tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++ > > tools/virtio/ringtest/Makefile | 4 +- > > Patch didn't apply cleanly to Makefile, as you also seems to have > "virtio_ring_inorder", I manually applied it. > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array", > BUT how do I use it??? (the README is not helpful) > > What is the "output", are there any performance measurement results? First, if it completes successfully this means it completed a ton of cycles without errors. It caches any missing barriers which aren't nops on your system. Second - use perf. E.g. simple perf stat will measure how long does it take to execute. there's a script that runs it on different CPUs, so I normally do: sh run-on-all.sh perf stat -r 5 ./skb_array > > diff --git a/tools/virtio/ringtest/Makefile b/tools/virtio/ringtest/Makefile > > index 6ba7455..87e58cf 100644 > > --- a/tools/virtio/ringtest/Makefile > > +++ b/tools/virtio/ringtest/Makefile > > @@ -1,6 +1,6 @@ > > all: > > > > -all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder > > +all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder skb_array > ^^^^^^^^^^^^^^^^^^^ > > > > CFLAGS += -Wall > > CFLAGS += -pthread -O2 -ggdb > > @@ -8,6 +8,7 @@ LDFLAGS += -pthread -O2 -ggdb > > > > main.o: main.c main.h > > ring.o: ring.c main.h > > +skb_array.o: skb_array.c main.h ../../../include/linux/skb_array.h > > virtio_ring_0_9.o: virtio_ring_0_9.c main.h > > virtio_ring_poll.o: virtio_ring_poll.c virtio_ring_0_9.c main.h > > virtio_ring_inorder.o: virtio_ring_inorder.c virtio_ring_0_9.c main.h > > @@ -15,6 +16,7 @@ ring: ring.o main.o > > virtio_ring_0_9: virtio_ring_0_9.o main.o > > virtio_ring_poll: virtio_ring_poll.o main.o > > virtio_ring_inorder: virtio_ring_inorder.o main.o > ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ > > +skb_array: skb_array.o main.o > > clean: > > -rm main.o > > -rm ring.o ring > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
On Mon, 23 May 2016 23:52:47 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote: > > On Mon, 23 May 2016 13:43:46 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > Add ringtest based unit test for skb array. > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > > --- > > > tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++ > > > tools/virtio/ringtest/Makefile | 4 +- > > > > Patch didn't apply cleanly to Makefile, as you also seems to have > > "virtio_ring_inorder", I manually applied it. > > > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array", > > BUT how do I use it??? (the README is not helpful) > > > > What is the "output", are there any performance measurement results? > > First, if it completes successfully this means it completed > a ton of cycles without errors. It caches any missing barriers > which aren't nops on your system. I applied these patches on net-next (at commit 07b75260e) and the skb_array test program never terminates. Strangely if I use your git tree[1] (on branch vhost) the program does terminate... I didn't spot the difference. > Second - use perf. I do like perf, but it does not answer my questions about the performance of this queue. I will code something up in my own framework[2] to answer my own performance questions. Like what is be minimum overhead (in cycles) achievable with this type of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) for fastpath usage. Then I also want to know how this performs when two CPUs are involved. As this is also a primary use-case, for you when sending packets into a guest. > E.g. simple perf stat will measure how long does it take to execute. > there's a script that runs it on different CPUs, > so I normally do: > > sh run-on-all.sh perf stat -r 5 ./skb_array I recommend documenting this in the README file in the same dir ;-) [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost [2] https://github.com/netoptimizer/prototype-kernel
On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 23 May 2016 23:52:47 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote: > > > On Mon, 23 May 2016 13:43:46 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > Add ringtest based unit test for skb array. > > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > > > --- > > > > tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++ > > > > tools/virtio/ringtest/Makefile | 4 +- > > > > > > Patch didn't apply cleanly to Makefile, as you also seems to have > > > "virtio_ring_inorder", I manually applied it. > > > > > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array", > > > BUT how do I use it??? (the README is not helpful) > > > > > > What is the "output", are there any performance measurement results? > > > > First, if it completes successfully this means it completed > > a ton of cycles without errors. It caches any missing barriers > > which aren't nops on your system. > > I applied these patches on net-next (at commit 07b75260e) and the > skb_array test program never terminates. Strangely if I use your git > tree[1] (on branch vhost) the program does terminate... I didn't spot > the difference. Disassemble the binaries and compare? Should be identical. Or attach gdb and look at array.producer and array.consumer. > > Second - use perf. > > I do like perf, but it does not answer my questions about the > performance of this queue. I will code something up in my own > framework[2] to answer my own performance questions. Sounds good. > Like what is be minimum overhead (in cycles) achievable with this type > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > for fastpath usage. Interesting. > Then I also want to know how this performs when two CPUs are involved. This has flags to pin threads to different CPUs. > As this is also a primary use-case, for you when sending packets into a > guest. > That's absolutely the primary usecase. Was designed with this in mind. > > > E.g. simple perf stat will measure how long does it take to execute. > > there's a script that runs it on different CPUs, > > so I normally do: > > > > sh run-on-all.sh perf stat -r 5 ./skb_array > > I recommend documenting this in the README file in the same dir ;-) Good idea. Will do. > [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost > [2] https://github.com/netoptimizer/prototype-kernel > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 23 May 2016 23:52:47 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote: > > > On Mon, 23 May 2016 13:43:46 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > Add ringtest based unit test for skb array. > > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > > > --- > > > > tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++ > > > > tools/virtio/ringtest/Makefile | 4 +- > > > > > > Patch didn't apply cleanly to Makefile, as you also seems to have > > > "virtio_ring_inorder", I manually applied it. > > > > > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array", > > > BUT how do I use it??? (the README is not helpful) > > > > > > What is the "output", are there any performance measurement results? > > > > First, if it completes successfully this means it completed > > a ton of cycles without errors. It caches any missing barriers > > which aren't nops on your system. > > I applied these patches on net-next (at commit 07b75260e) and the > skb_array test program never terminates. Strangely if I use your git > tree[1] (on branch vhost) the program does terminate... I didn't spot > the difference. Oh, that's my bad. You need ringtest: pass buf != NULL just a stub pointer for now. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> from my tree.
On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 23 May 2016 23:52:47 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote: > > > On Mon, 23 May 2016 13:43:46 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > Add ringtest based unit test for skb array. > > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > > > --- > > > > tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++ > > > > tools/virtio/ringtest/Makefile | 4 +- > > > > > > Patch didn't apply cleanly to Makefile, as you also seems to have > > > "virtio_ring_inorder", I manually applied it. > > > > > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array", > > > BUT how do I use it??? (the README is not helpful) > > > > > > What is the "output", are there any performance measurement results? > > > > First, if it completes successfully this means it completed > > a ton of cycles without errors. It caches any missing barriers > > which aren't nops on your system. > > I applied these patches on net-next (at commit 07b75260e) and the > skb_array test program never terminates. Strangely if I use your git > tree[1] (on branch vhost) the program does terminate... I didn't spot > the difference. > > > Second - use perf. > > I do like perf, but it does not answer my questions about the > performance of this queue. I will code something up in my own > framework[2] to answer my own performance questions. > > Like what is be minimum overhead (in cycles) achievable with this type > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > for fastpath usage. Actually there is, kind of, a way to find out with my tool if you have an HT CPU. When you do run-on-all.sh it will pin consumer to the last CPU, then run producer on all of them. Look for the number for the HT pair - this shares cache between producer and consumer. This is not the same as doing produce + consume on the same CPU but it's close enough I think. To measure overhead I guess I should build a NOP tool that does not actually produce or consume anything. Will do. > Then I also want to know how this performs when two CPUs are involved. > As this is also a primary use-case, for you when sending packets into a > guest. > > > > E.g. simple perf stat will measure how long does it take to execute. > > there's a script that runs it on different CPUs, > > so I normally do: > > > > sh run-on-all.sh perf stat -r 5 ./skb_array > > I recommend documenting this in the README file in the same dir ;-) > > [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost > [2] https://github.com/netoptimizer/prototype-kernel > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
On Tue, 24 May 2016 12:28:09 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > I do like perf, but it does not answer my questions about the > performance of this queue. I will code something up in my own > framework[2] to answer my own performance questions. > > Like what is be minimum overhead (in cycles) achievable with this type > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > for fastpath usage. Coded it up here: https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c This is a really fake benchmark, but it sort of shows the minimum overhead achievable with this type of queue, where it is the same CPU enqueuing and dequeuing, and cache is guaranteed to be hot. Measured on a i7-4790K CPU @ 4.00GHz, the average cost of enqueue+dequeue of a single object is around 102 cycles(tsc). To compare this with below, where enq and deq is measured separately: 102 / 2 = 51 cycles > Then I also want to know how this performs when two CPUs are involved. > As this is also a primary use-case, for you when sending packets into a > guest. Coded it up here: https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c This parallel benchmark try to keep two (or more) CPUs busy enqueuing or dequeuing on the same skb_array queue. It prefills the queue, and stops the test as soon as queue is empty or full, or completes a number of "loops"/cycles. For two CPUs the results are really good: enqueue: 54 cycles(tsc) dequeue: 53 cycles(tsc) Going to 4 CPUs, things break down (but it was not primary use-case?): CPU(0) 927 cycles(tsc) enqueue CPU(1) 921 cycles(tsc) dequeue CPU(2) 927 cycles(tsc) enqueue CPU(3) 898 cycles(tsc) dequeue Next on my todo-list is to implement same tests for e.g. alf_queue, so we can compare the concurrency part (which is the important part). But FYI I'll be busy the next days at conf http://fosd2016.itu.dk/
On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote: > > On Tue, 24 May 2016 12:28:09 +0200 > Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > I do like perf, but it does not answer my questions about the > > performance of this queue. I will code something up in my own > > framework[2] to answer my own performance questions. > > > > Like what is be minimum overhead (in cycles) achievable with this type > > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > > for fastpath usage. > > Coded it up here: > https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184 > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c > > This is a really fake benchmark, but it sort of shows the minimum > overhead achievable with this type of queue, where it is the same > CPU enqueuing and dequeuing, and cache is guaranteed to be hot. > > Measured on a i7-4790K CPU @ 4.00GHz, the average cost of > enqueue+dequeue of a single object is around 102 cycles(tsc). > > To compare this with below, where enq and deq is measured separately: > 102 / 2 = 51 cycles > > > > Then I also want to know how this performs when two CPUs are involved. > > As this is also a primary use-case, for you when sending packets into a > > guest. > > Coded it up here: > https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c > > This parallel benchmark try to keep two (or more) CPUs busy enqueuing or > dequeuing on the same skb_array queue. It prefills the queue, > and stops the test as soon as queue is empty or full, or > completes a number of "loops"/cycles. > > For two CPUs the results are really good: > enqueue: 54 cycles(tsc) > dequeue: 53 cycles(tsc) > > Going to 4 CPUs, things break down (but it was not primary use-case?): > CPU(0) 927 cycles(tsc) enqueue > CPU(1) 921 cycles(tsc) dequeue > CPU(2) 927 cycles(tsc) enqueue > CPU(3) 898 cycles(tsc) dequeue It's mostly the spinlock contention I guess. Maybe we don't need fair spinlocks in this case. Try replacing spinlocks with simple cmpxchg and see what happens?
On Tue, 24 May 2016 23:34:14 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote: > > > > On Tue, 24 May 2016 12:28:09 +0200 > > Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > > > I do like perf, but it does not answer my questions about the > > > performance of this queue. I will code something up in my own > > > framework[2] to answer my own performance questions. > > > > > > Like what is be minimum overhead (in cycles) achievable with this type > > > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > > > for fastpath usage. > > > > Coded it up here: > > https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184 > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c > > > > This is a really fake benchmark, but it sort of shows the > > overhead achievable with this type of queue, where it is the same > > CPU enqueuing and dequeuing, and cache is guaranteed to be hot. > > > > Measured on a i7-4790K CPU @ 4.00GHz, the average cost of > > enqueue+dequeue of a single object is around 102 cycles(tsc). > > > > To compare this with below, where enq and deq is measured separately: > > 102 / 2 = 51 cycles The alf_queue[1] baseline is 26 cycles in this minimum overhead achievable benchmark with a MPMC (Multi-Producer/Multi-Consumer) queue which use a locked cmpxchg. (SPSC variant is 5 cycles, thus most cost comes from locked cmpxchg). [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h > > > Then I also want to know how this performs when two CPUs are involved. > > > As this is also a primary use-case, for you when sending packets into a > > > guest. > > > > Coded it up here: > > https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c > > > > This parallel benchmark try to keep two (or more) CPUs busy enqueuing or > > dequeuing on the same skb_array queue. It prefills the queue, > > and stops the test as soon as queue is empty or full, or > > completes a number of "loops"/cycles. > > > > For two CPUs the results are really good: > > enqueue: 54 cycles(tsc) > > dequeue: 53 cycles(tsc) As MST points out, a scheme like the alf_queue[1] have the issue that it "reads" the opposite cacheline of the consumer.tail/producer.tail to determine if space-is-left/queue-is-empty. This cause an expensive transition for the cache coherency protocol. Coded up similar test for alf_queue: https://github.com/netoptimizer/prototype-kernel/commit/b3ff2624f1 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c For two CPUs MPMC results are, significantly worse, and demonstrate MSTs point: enqueue: 227 cycles(tsc) dequeue: 231 cycles(tsc) Alf_queue also have a SPSC (Single-Producer/Single-Consumer) variant: enqueue: 24 cycles(tsc) dequeue: 23 cycles(tsc) > > Going to 4 CPUs, things break down (but it was not primary use-case?): > > CPU(0) 927 cycles(tsc) enqueue > > CPU(1) 921 cycles(tsc) dequeue > > CPU(2) 927 cycles(tsc) enqueue > > CPU(3) 898 cycles(tsc) dequeue > > It's mostly the spinlock contention I guess. > Maybe we don't need fair spinlocks in this case. > Try replacing spinlocks with simple cmpxchg > and see what happens? The alf_queue uses a cmpxchg scheme, and it does scale better when the number of CPUs increase: CPUs:4 Average: 586 cycles(tsc) CPUs:6 Average: 744 cycles(tsc) CPUs:8 Average: 1578 cycles(tsc) Notice the alf_queue was designed with the purpose of bulking, to mitigate the effect of this cacheline bouncing, but it was not covered in this test.
On Thu, 2 Jun 2016 20:47:25 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Tue, 24 May 2016 23:34:14 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote: > > > > > > On Tue, 24 May 2016 12:28:09 +0200 > > > Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > > > > > I do like perf, but it does not answer my questions about the > > > > performance of this queue. I will code something up in my own > > > > framework[2] to answer my own performance questions. > > > > > > > > Like what is be minimum overhead (in cycles) achievable with this type > > > > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > > > > for fastpath usage. > > > > > > Coded it up here: > > > https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184 > > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c > > > > > > This is a really fake benchmark, but it sort of shows the > > > overhead achievable with this type of queue, where it is the same > > > CPU enqueuing and dequeuing, and cache is guaranteed to be hot. > > > > > > Measured on a i7-4790K CPU @ 4.00GHz, the average cost of > > > enqueue+dequeue of a single object is around 102 cycles(tsc). > > > > > > To compare this with below, where enq and deq is measured separately: > > > 102 / 2 = 51 cycles > > The alf_queue[1] baseline is 26 cycles in this minimum overhead > achievable benchmark with a MPMC (Multi-Producer/Multi-Consumer) queue > which use a locked cmpxchg. (SPSC variant is 5 cycles, thus most cost > comes from locked cmpxchg). > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h > > > > > Then I also want to know how this performs when two CPUs are involved. > > > > As this is also a primary use-case, for you when sending packets into a > > > > guest. > > > > > > Coded it up here: > > > https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e > > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c > > > > > > This parallel benchmark try to keep two (or more) CPUs busy enqueuing or > > > dequeuing on the same skb_array queue. It prefills the queue, > > > and stops the test as soon as queue is empty or full, or > > > completes a number of "loops"/cycles. > > > > > > For two CPUs the results are really good: > > > enqueue: 54 cycles(tsc) > > > dequeue: 53 cycles(tsc) > > As MST points out, a scheme like the alf_queue[1] have the issue that it > "reads" the opposite cacheline of the consumer.tail/producer.tail to > determine if space-is-left/queue-is-empty. This cause an expensive > transition for the cache coherency protocol. > > Coded up similar test for alf_queue: > https://github.com/netoptimizer/prototype-kernel/commit/b3ff2624f1 > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c > > For two CPUs MPMC results are, significantly worse, and demonstrate MSTs point: > enqueue: 227 cycles(tsc) > dequeue: 231 cycles(tsc) > > Alf_queue also have a SPSC (Single-Producer/Single-Consumer) variant: > enqueue: 24 cycles(tsc) > dequeue: 23 cycles(tsc) > > > > > Going to 4 CPUs, things break down (but it was not primary use-case?): > > > CPU(0) 927 cycles(tsc) enqueue > > > CPU(1) 921 cycles(tsc) dequeue > > > CPU(2) 927 cycles(tsc) enqueue > > > CPU(3) 898 cycles(tsc) dequeue > > > > It's mostly the spinlock contention I guess. > > Maybe we don't need fair spinlocks in this case. > > Try replacing spinlocks with simple cmpxchg > > and see what happens? > > The alf_queue uses a cmpxchg scheme, and it does scale better when the > number of CPUs increase: > > CPUs:4 Average: 586 cycles(tsc) > CPUs:6 Average: 744 cycles(tsc) > CPUs:8 Average: 1578 cycles(tsc) > > Notice the alf_queue was designed with the purpose of bulking, to > mitigate the effect of this cacheline bouncing, but it was not covered > in this test. Added bulking to the alf_queue test: https://github.com/netoptimizer/prototype-kernel/commit/e22a0d8745 This does help significantly, but requires use-cases where there are packets to be bulk deq/enq. On the other hand, the skb_array also requires that objects in the queue/array exceed one cacheline, before it starts to scale. For two CPUs we need bulk=4 before beating skb_array. See benchmark adjusting bulk size: CPUs:2 bulk=step:1 Average: 231 cycles(tsc) CPUs:2 bulk=step:2 Average: 118 cycles(tsc) CPUs:2 bulk=step:3 Average: 65 cycles(tsc) CPUs:2 bulk=step:4 Average: 48 cycles(tsc) CPUs:2 bulk=step:5 Average: 40 cycles(tsc) CPUs:2 bulk=step:6 Average: 37 cycles(tsc) CPUs:2 bulk=step:7 Average: 29 cycles(tsc) CPUs:2 bulk=step:8 Average: 24 cycles(tsc) CPUs:2 bulk=step:9 Average: 23 cycles(tsc) CPUs:2 bulk=step:10 Average: 20 cycles(tsc) Keeping bulk=8, and increasing the CPUs does show better scalability, due to bulking. This system (i7-4790K CPU @ 4.00GHz) only had 8-core CPUs: CPUs:2 bulk=step:8 Average: 25 cycles(tsc) CPUs:4 bulk=step:8 Average: 71 cycles(tsc) CPUs:6 bulk=step:8 Average: 100 cycles(tsc) CPUs:8 bulk=step:8 Average: 185 cycles(tsc) Found a (slower) 24-core CPU system (E5-2695v2-ES @ 2.50GHz): CPUs:2 bulk=step:8 Average: 50 cycles(tsc) CPUs:4 bulk=step:8 Average: 101 cycles(tsc) CPUs:6 bulk=step:8 Average: 214 cycles(tsc) CPUs:8 bulk=step:8 Average: 347 cycles(tsc) CPUs:10 bulk=step:8 Average: 468 cycles(tsc) CPUs:12 bulk=step:8 Average: 670 cycles(tsc) CPUs:14 bulk=step:8 Average: 698 cycles(tsc) CPUs:16 bulk=step:8 Average: 1149 cycles(tsc) CPUs:18 bulk=step:8 Average: 1094 cycles(tsc) CPUs:20 bulk=step:8 Average: 1349 cycles(tsc) CPUs:22 bulk=step:8 Average: 1406 cycles(tsc) CPUs:24 bulk=step:8 Average: 1553 cycles(tsc) I still think skb_array is the winner, when the normal use-case is two CPUs, and we cannot guarantee CPU pinning (thus cannot use SPSC).
diff --git a/tools/virtio/ringtest/skb_array.c b/tools/virtio/ringtest/skb_array.c new file mode 100644 index 0000000..4ab7e31 --- /dev/null +++ b/tools/virtio/ringtest/skb_array.c @@ -0,0 +1,167 @@ +#define _GNU_SOURCE +#include "main.h" +#include <stdlib.h> +#include <stdio.h> +#include <string.h> +#include <pthread.h> +#include <malloc.h> +#include <assert.h> +#include <errno.h> +#include <limits.h> + +struct sk_buff; +#define SMP_CACHE_BYTES 64 +#define cache_line_size() SMP_CACHE_BYTES +#define ____cacheline_aligned_in_smp __attribute__ ((aligned (SMP_CACHE_BYTES))) +#define unlikely(x) (__builtin_expect(!!(x), 0)) +#define ALIGN(x, a) (((x) + (a) - 1) / (a) * (a)) +typedef pthread_spinlock_t spinlock_t; + +typedef int gfp_t; +static void *kzalloc(unsigned size, gfp_t gfp) +{ + void *p = memalign(64, size); + if (!p) + return p; + memset(p, 0, size); + + return p; +} + +static void kfree(void *p) +{ + if (p) + free(p); +} + +static void spin_lock_init(spinlock_t *lock) +{ + int r = pthread_spin_init(lock, 0); + assert(!r); +} + +static void spin_lock_bh(spinlock_t *lock) +{ + int ret = pthread_spin_lock(lock); + assert(!ret); +} + +static void spin_unlock_bh(spinlock_t *lock) +{ + int ret = pthread_spin_unlock(lock); + assert(!ret); +} + +#include "../../../include/linux/skb_array.h" + +static unsigned long long headcnt, tailcnt; +static struct skb_array array ____cacheline_aligned_in_smp; + +/* implemented by ring */ +void alloc_ring(void) +{ + int ret = skb_array_init(&array, ring_size, 0); + assert(!ret); +} + +/* guest side */ +int add_inbuf(unsigned len, void *buf, void *datap) +{ + int ret; + + assert(headcnt - tailcnt <= ring_size); + ret = __skb_array_produce(&array, buf); + if (ret >= 0) { + ret = 0; + headcnt++; + } + + return ret; +} + +/* + * skb_array API provides no way for producer to find out whether a given + * buffer was consumed. Our tests merely require that a successful get_buf + * implies that add_inbuf succeed in the past, and that add_inbuf will succeed, + * fake it accordingly. + */ +void *get_buf(unsigned *lenp, void **bufp) +{ + void *datap; + + if (tailcnt == headcnt || __skb_array_full(&array)) + datap = NULL; + else { + datap = "Buffer\n"; + ++tailcnt; + } + + return datap; +} + +void poll_used(void) +{ + void *b; + + do { + if (tailcnt == headcnt || __skb_array_full(&array)) { + b = NULL; + barrier(); + } else { + b = "Buffer\n"; + } + } while (!b); +} + +void disable_call() +{ + assert(0); +} + +bool enable_call() +{ + assert(0); +} + +void kick_available(void) +{ + assert(0); +} + +/* host side */ +void disable_kick() +{ + assert(0); +} + +bool enable_kick() +{ + assert(0); +} + +void poll_avail(void) +{ + void *b; + + do { + b = __skb_array_peek(&array); + barrier(); + } while (!b); +} + +bool use_buf(unsigned *lenp, void **bufp) +{ + struct sk_buff *skb; + + skb = __skb_array_peek(&array); + if (skb) { + __skb_array_consume(&array); + } + + return skb; +} + +void call_used(void) +{ + assert(0); +} diff --git a/tools/virtio/ringtest/Makefile b/tools/virtio/ringtest/Makefile index 6ba7455..87e58cf 100644 --- a/tools/virtio/ringtest/Makefile +++ b/tools/virtio/ringtest/Makefile @@ -1,6 +1,6 @@ all: -all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder +all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder skb_array CFLAGS += -Wall CFLAGS += -pthread -O2 -ggdb @@ -8,6 +8,7 @@ LDFLAGS += -pthread -O2 -ggdb main.o: main.c main.h ring.o: ring.c main.h +skb_array.o: skb_array.c main.h ../../../include/linux/skb_array.h virtio_ring_0_9.o: virtio_ring_0_9.c main.h virtio_ring_poll.o: virtio_ring_poll.c virtio_ring_0_9.c main.h virtio_ring_inorder.o: virtio_ring_inorder.c virtio_ring_0_9.c main.h @@ -15,6 +16,7 @@ ring: ring.o main.o virtio_ring_0_9: virtio_ring_0_9.o main.o virtio_ring_poll: virtio_ring_poll.o main.o virtio_ring_inorder: virtio_ring_inorder.o main.o +skb_array: skb_array.o main.o clean: -rm main.o -rm ring.o ring
Add ringtest based unit test for skb array. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> --- tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++ tools/virtio/ringtest/Makefile | 4 +- 2 files changed, 170 insertions(+), 1 deletion(-) create mode 100644 tools/virtio/ringtest/skb_array.c