diff mbox series

[v13,19/25] replay: add BH oneshot event for block layer

Message ID 155074715265.32129.1158027635780211224.stgit@pasha-VirtualBox
State New
Headers show
Series Fixing record/replay and adding reverse debugging | expand

Commit Message

Pavel Dovgalyuk Feb. 21, 2019, 11:05 a.m. UTC
Replay is capable of recording normal BH events, but sometimes
there are single use callbacks scheduled with aio_bh_schedule_oneshot
function. This patch enables recording and replaying such callbacks.
Block layer uses these events for calling the completion function.
Replaying these calls makes the execution deterministic.

Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>

--

v6:
 - moved stub function to the separate file for fixing linux-user build
v10:
 - replaced all block layer aio_bh_schedule_oneshot calls
---
 block/block-backend.c    |    8 +++++---
 block/io.c               |    4 ++--
 block/iscsi.c            |    5 +++--
 block/nfs.c              |    5 +++--
 block/null.c             |    4 +++-
 block/nvme.c             |    6 ++++--
 block/rbd.c              |    5 +++--
 block/vxhs.c             |    5 +++--
 include/sysemu/replay.h  |    3 +++
 replay/replay-events.c   |   16 ++++++++++++++++
 replay/replay-internal.h |    1 +
 replay/replay.c          |    2 +-
 stubs/Makefile.objs      |    1 +
 stubs/replay-user.c      |    9 +++++++++
 14 files changed, 57 insertions(+), 17 deletions(-)
 create mode 100644 stubs/replay-user.c

Comments

Kevin Wolf March 4, 2019, 10:33 a.m. UTC | #1
Am 21.02.2019 um 12:05 hat Pavel Dovgalyuk geschrieben:
> Replay is capable of recording normal BH events, but sometimes
> there are single use callbacks scheduled with aio_bh_schedule_oneshot
> function. This patch enables recording and replaying such callbacks.
> Block layer uses these events for calling the completion function.
> Replaying these calls makes the execution deterministic.
> 
> Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> 
> --
> 
> v6:
>  - moved stub function to the separate file for fixing linux-user build
> v10:
>  - replaced all block layer aio_bh_schedule_oneshot calls
> ---
>  block/block-backend.c    |    8 +++++---
>  block/io.c               |    4 ++--
>  block/iscsi.c            |    5 +++--
>  block/nfs.c              |    5 +++--
>  block/null.c             |    4 +++-
>  block/nvme.c             |    6 ++++--
>  block/rbd.c              |    5 +++--
>  block/vxhs.c             |    5 +++--
>  include/sysemu/replay.h  |    3 +++
>  replay/replay-events.c   |   16 ++++++++++++++++
>  replay/replay-internal.h |    1 +
>  replay/replay.c          |    2 +-
>  stubs/Makefile.objs      |    1 +
>  stubs/replay-user.c      |    9 +++++++++
>  14 files changed, 57 insertions(+), 17 deletions(-)
>  create mode 100644 stubs/replay-user.c

This still doesn't catch all instances, e.g. everything that goes
through aio_co_schedule() is missing.

But I fully expect this to get broken anyway all the time because nobody
understands which function to use, and if it works for your special case
now and we'll fix other stuff as you encouter it, maybe that's good
enough for you.

> @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
>  
>      acb->has_returned = true;
>      if (acb->rwco.ret != NOT_DONE) {
> -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> -                                blk_aio_complete_bh, acb);
> +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> +                                         blk_aio_complete_bh, acb);
>      }

This, and a few other places that you convert, are in fast paths and add
some calls that are unnecessary for non-replay cases.

I wonder if we could make replay optional in ./configure and then make
replay_bh_schedule_oneshot_event() a static inline function that can get
optimised away at compile time if the feature is disabled.

Kevin
Pavel Dovgalyuk March 4, 2019, 12:17 p.m. UTC | #2
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 21.02.2019 um 12:05 hat Pavel Dovgalyuk geschrieben:
> > Replay is capable of recording normal BH events, but sometimes
> > there are single use callbacks scheduled with aio_bh_schedule_oneshot
> > function. This patch enables recording and replaying such callbacks.
> > Block layer uses these events for calling the completion function.
> > Replaying these calls makes the execution deterministic.
> >
> > Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> >
> > --
> >
> > v6:
> >  - moved stub function to the separate file for fixing linux-user build
> > v10:
> >  - replaced all block layer aio_bh_schedule_oneshot calls
> This still doesn't catch all instances, e.g. everything that goes
> through aio_co_schedule() is missing.

It seems, that everything else is synchronized with blkreplay driver
which is mandatory when using block devices in rr mode.

> But I fully expect this to get broken anyway all the time because nobody
> understands which function to use, and if it works for your special case
> now and we'll fix other stuff as you encouter it, maybe that's good
> enough for you.

This problem exists in every subsystem and it is ok for now, when record/replay
is not mature enough, and not familiar for others.
When virtual devices are updated, developers may miss correct loadvm/savevm
implementation. For example, loading the audio device state may miss
shift the phase of the output signal. Nobody will notice that bug
in the migration process, but it reveals when we use record/replay.

We can't cover everything with record/replay tests. Most of the new bugs
can be revealed in complex configurations after billions of executed instructions.
But when this feature will be available out of the box, we'll
at least get more smoke testing.

> 
> > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int
> bytes,
> >
> >      acb->has_returned = true;
> >      if (acb->rwco.ret != NOT_DONE) {
> > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > -                                blk_aio_complete_bh, acb);
> > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > +                                         blk_aio_complete_bh, acb);
> >      }
> 
> This, and a few other places that you convert, are in fast paths and add
> some calls that are unnecessary for non-replay cases.

I don't think that this can make a noticeable slowdown, but we can run
the tests if you want.
We have the test suite which performs disk-intensive computation.
It was created to measure the effect of running BH callbacks through
the virtual timer infrastructure.

> I wonder if we could make replay optional in ./configure and then make
> replay_bh_schedule_oneshot_event() a static inline function that can get
> optimised away at compile time if the feature is disabled.

It is coupled with icount. However, some icount calls are also lie on
the fast paths and are completely useless when icount is not enabled.

Pavel Dovgalyuk
Kevin Wolf March 5, 2019, 9:52 a.m. UTC | #3
Am 04.03.2019 um 13:17 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 21.02.2019 um 12:05 hat Pavel Dovgalyuk geschrieben:
> > > Replay is capable of recording normal BH events, but sometimes
> > > there are single use callbacks scheduled with aio_bh_schedule_oneshot
> > > function. This patch enables recording and replaying such callbacks.
> > > Block layer uses these events for calling the completion function.
> > > Replaying these calls makes the execution deterministic.
> > >
> > > Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> > >
> > > --
> > >
> > > v6:
> > >  - moved stub function to the separate file for fixing linux-user build
> > > v10:
> > >  - replaced all block layer aio_bh_schedule_oneshot calls
> > This still doesn't catch all instances, e.g. everything that goes
> > through aio_co_schedule() is missing.
> 
> It seems, that everything else is synchronized with blkreplay driver
> which is mandatory when using block devices in rr mode.

Ah, yes, this is a good point. blkreplay goes through
replay_block_event(), which is where things get synchronised, right?

Does this mean that most of the places where you replaced a BH with your
new function don't actually need it either because they are called
through blkreplay and will go through replay_block_event() before
reaching the guest?

> > But I fully expect this to get broken anyway all the time because nobody
> > understands which function to use, and if it works for your special case
> > now and we'll fix other stuff as you encouter it, maybe that's good
> > enough for you.
> 
> This problem exists in every subsystem and it is ok for now, when
> record/replay is not mature enough, and not familiar for others.  When
> virtual devices are updated, developers may miss correct loadvm/savevm
> implementation. For example, loading the audio device state may miss
> shift the phase of the output signal. Nobody will notice that bug in
> the migration process, but it reveals when we use record/replay.
> 
> We can't cover everything with record/replay tests. Most of the new
> bugs can be revealed in complex configurations after billions of
> executed instructions.  But when this feature will be available out of
> the box, we'll at least get more smoke testing.

Ok.

> > > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int
> > bytes,
> > >
> > >      acb->has_returned = true;
> > >      if (acb->rwco.ret != NOT_DONE) {
> > > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > > -                                blk_aio_complete_bh, acb);
> > > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > > +                                         blk_aio_complete_bh, acb);
> > >      }
> > 
> > This, and a few other places that you convert, are in fast paths and add
> > some calls that are unnecessary for non-replay cases.
> 
> I don't think that this can make a noticeable slowdown, but we can run
> the tests if you want.
> We have the test suite which performs disk-intensive computation.
> It was created to measure the effect of running BH callbacks through
> the virtual timer infrastructure.

I think this requires quite fast storage to possibly make a difference.
Or if you don't have that, maybe a ramdisk or even a null-co:// backend
could do the trick. Maybe null-co:// is actually the best option.

Anyway, if it's not too much work for you, running some tests would be
good.

> > I wonder if we could make replay optional in ./configure and then make
> > replay_bh_schedule_oneshot_event() a static inline function that can get
> > optimised away at compile time if the feature is disabled.
> 
> It is coupled with icount. However, some icount calls are also lie on
> the fast paths and are completely useless when icount is not enabled.

Well, the common fast path is KVM, which doesn't have icount at all, so
that might make it less critical. :-)

I get your point, though maybe that just means that both should be
possible to be disabled at configure time.

Kevin
Pavel Dovgalyuk March 5, 2019, 11:04 a.m. UTC | #4
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 04.03.2019 um 13:17 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 21.02.2019 um 12:05 hat Pavel Dovgalyuk geschrieben:
> > > > Replay is capable of recording normal BH events, but sometimes
> > > > there are single use callbacks scheduled with aio_bh_schedule_oneshot
> > > > function. This patch enables recording and replaying such callbacks.
> > > > Block layer uses these events for calling the completion function.
> > > > Replaying these calls makes the execution deterministic.
> > > >
> > > > Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> > > >
> > > > --
> > > >
> > > > v6:
> > > >  - moved stub function to the separate file for fixing linux-user build
> > > > v10:
> > > >  - replaced all block layer aio_bh_schedule_oneshot calls
> > > This still doesn't catch all instances, e.g. everything that goes
> > > through aio_co_schedule() is missing.
> >
> > It seems, that everything else is synchronized with blkreplay driver
> > which is mandatory when using block devices in rr mode.
> 
> Ah, yes, this is a good point. blkreplay goes through
> replay_block_event(), which is where things get synchronised, right?

Right.

> Does this mean that most of the places where you replaced a BH with your
> new function don't actually need it either because they are called
> through blkreplay and will go through replay_block_event() before
> reaching the guest?

I'm not sure, maybe some of them are going through the blkreplay driver.

> > > > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset,
> int
> > > bytes,
> > > >
> > > >      acb->has_returned = true;
> > > >      if (acb->rwco.ret != NOT_DONE) {
> > > > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > > > -                                blk_aio_complete_bh, acb);
> > > > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > > > +                                         blk_aio_complete_bh, acb);
> > > >      }
> > >
> > > This, and a few other places that you convert, are in fast paths and add
> > > some calls that are unnecessary for non-replay cases.
> >
> > I don't think that this can make a noticeable slowdown, but we can run
> > the tests if you want.
> > We have the test suite which performs disk-intensive computation.
> > It was created to measure the effect of running BH callbacks through
> > the virtual timer infrastructure.
> 
> I think this requires quite fast storage to possibly make a difference.

True.

> Or if you don't have that, maybe a ramdisk or even a null-co:// backend
> could do the trick. Maybe null-co:// is actually the best option.

We've got tests with file copying and compression on qcow2 disks.
How the null backend can be applied there?

> Anyway, if it's not too much work for you, running some tests would be
> good.
> 
> > > I wonder if we could make replay optional in ./configure and then make
> > > replay_bh_schedule_oneshot_event() a static inline function that can get
> > > optimised away at compile time if the feature is disabled.
> >
> > It is coupled with icount. However, some icount calls are also lie on
> > the fast paths and are completely useless when icount is not enabled.
> 
> Well, the common fast path is KVM, which doesn't have icount at all, so
> that might make it less critical. :-)

I see.

> I get your point, though maybe that just means that both should be
> possible to be disabled at configure time.

Right, it may be reasonable for some TCG use cases.
This could be a separate patch set, because everything changes.

Pavel Dovgalyuk
Kevin Wolf March 5, 2019, 11:12 a.m. UTC | #5
Am 05.03.2019 um 12:04 hat Pavel Dovgalyuk geschrieben:
> > > > > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset,
> > int
> > > > bytes,
> > > > >
> > > > >      acb->has_returned = true;
> > > > >      if (acb->rwco.ret != NOT_DONE) {
> > > > > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > > > > -                                blk_aio_complete_bh, acb);
> > > > > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > > > > +                                         blk_aio_complete_bh, acb);
> > > > >      }
> > > >
> > > > This, and a few other places that you convert, are in fast paths and add
> > > > some calls that are unnecessary for non-replay cases.
> > >
> > > I don't think that this can make a noticeable slowdown, but we can run
> > > the tests if you want.
> > > We have the test suite which performs disk-intensive computation.
> > > It was created to measure the effect of running BH callbacks through
> > > the virtual timer infrastructure.
> > 
> > I think this requires quite fast storage to possibly make a difference.
> 
> True.
> 
> > Or if you don't have that, maybe a ramdisk or even a null-co:// backend
> > could do the trick. Maybe null-co:// is actually the best option.
> 
> We've got tests with file copying and compression on qcow2 disks.
> How the null backend can be applied there?

With qcow2, it can't really. null-co:// would work for running something
like fio directly against a virtual disk, without any image format
involved. Getting the image format out of the way makes things even a
little bit faster.

Maybe we should run a micro-benchmark fio with null-co just in addition
to your higher level tests?

> > Anyway, if it's not too much work for you, running some tests would be
> > good.
> > 
> > > > I wonder if we could make replay optional in ./configure and then make
> > > > replay_bh_schedule_oneshot_event() a static inline function that can get
> > > > optimised away at compile time if the feature is disabled.
> > >
> > > It is coupled with icount. However, some icount calls are also lie on
> > > the fast paths and are completely useless when icount is not enabled.
> > 
> > Well, the common fast path is KVM, which doesn't have icount at all, so
> > that might make it less critical. :-)
> 
> I see.
> 
> > I get your point, though maybe that just means that both should be
> > possible to be disabled at configure time.
> 
> Right, it may be reasonable for some TCG use cases.
> This could be a separate patch set, because everything changes.

Yes, I agree.

Kevin
Pavel Dovgalyuk March 6, 2019, 8:33 a.m. UTC | #6
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 05.03.2019 um 12:04 hat Pavel Dovgalyuk geschrieben:
> > > > > > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t
> offset,
> > > int
> > > > > bytes,
> > > > > >
> > > > > >      acb->has_returned = true;
> > > > > >      if (acb->rwco.ret != NOT_DONE) {
> > > > > > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > > > > > -                                blk_aio_complete_bh, acb);
> > > > > > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > > > > > +                                         blk_aio_complete_bh, acb);
> > > > > >      }
> > > > >
> > > > > This, and a few other places that you convert, are in fast paths and add
> > > > > some calls that are unnecessary for non-replay cases.
> > > >
> > > > I don't think that this can make a noticeable slowdown, but we can run
> > > > the tests if you want.
> > > > We have the test suite which performs disk-intensive computation.
> > > > It was created to measure the effect of running BH callbacks through
> > > > the virtual timer infrastructure.
> > >
> > > I think this requires quite fast storage to possibly make a difference.
> >
> > True.
> >
> > > Or if you don't have that, maybe a ramdisk or even a null-co:// backend
> > > could do the trick. Maybe null-co:// is actually the best option.
> >
> > We've got tests with file copying and compression on qcow2 disks.
> > How the null backend can be applied there?
> 
> With qcow2, it can't really. null-co:// would work for running something
> like fio directly against a virtual disk, without any image format
> involved. Getting the image format out of the way makes things even a
> little bit faster.
> 
> Maybe we should run a micro-benchmark fio with null-co just in addition
> to your higher level tests?

We run something like that:
  -drive file={},if=none,id=drv,snapshot -device ide-hd,drive=drv -drive
file={},if=none,id=tst,snapshot -device ide-hd,drive=tst -net none -monitor stdio --enable-kvm -m 2G

I don't really get your idea. What should be added to the command line to run null-co benchmark?

Pavel Dovgalyuk
Kevin Wolf March 6, 2019, 9:09 a.m. UTC | #7
Am 06.03.2019 um 09:33 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 05.03.2019 um 12:04 hat Pavel Dovgalyuk geschrieben:
> > > > > > > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t
> > offset,
> > > > int
> > > > > > bytes,
> > > > > > >
> > > > > > >      acb->has_returned = true;
> > > > > > >      if (acb->rwco.ret != NOT_DONE) {
> > > > > > > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > > > > > > -                                blk_aio_complete_bh, acb);
> > > > > > > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > > > > > > +                                         blk_aio_complete_bh, acb);
> > > > > > >      }
> > > > > >
> > > > > > This, and a few other places that you convert, are in fast paths and add
> > > > > > some calls that are unnecessary for non-replay cases.
> > > > >
> > > > > I don't think that this can make a noticeable slowdown, but we can run
> > > > > the tests if you want.
> > > > > We have the test suite which performs disk-intensive computation.
> > > > > It was created to measure the effect of running BH callbacks through
> > > > > the virtual timer infrastructure.
> > > >
> > > > I think this requires quite fast storage to possibly make a difference.
> > >
> > > True.
> > >
> > > > Or if you don't have that, maybe a ramdisk or even a null-co:// backend
> > > > could do the trick. Maybe null-co:// is actually the best option.
> > >
> > > We've got tests with file copying and compression on qcow2 disks.
> > > How the null backend can be applied there?
> > 
> > With qcow2, it can't really. null-co:// would work for running something
> > like fio directly against a virtual disk, without any image format
> > involved. Getting the image format out of the way makes things even a
> > little bit faster.
> > 
> > Maybe we should run a micro-benchmark fio with null-co just in addition
> > to your higher level tests?
> 
> We run something like that:
>   -drive file={},if=none,id=drv,snapshot -device ide-hd,drive=drv -drive
> file={},if=none,id=tst,snapshot -device ide-hd,drive=tst -net none -monitor stdio --enable-kvm -m 2G
> 
> I don't really get your idea. What should be added to the command line
> to run null-co benchmark?

Something like:

-drive file=null-co://,if=none,id=null -device virtio-blk,drive=null

(Testing with IDE is useless, IDE is slow, you'll never get very high
IOPS or bandwidth with it, neither before nor after your series.)

Kevin
Pavel Dovgalyuk March 6, 2019, 9:18 a.m. UTC | #8
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 06.03.2019 um 09:33 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 05.03.2019 um 12:04 hat Pavel Dovgalyuk geschrieben:
> > > > > > > > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t
> > > offset,
> > > > > int
> > > > > > > bytes,
> > > > > > > >
> > > > > > > >      acb->has_returned = true;
> > > > > > > >      if (acb->rwco.ret != NOT_DONE) {
> > > > > > > > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > > > > > > > -                                blk_aio_complete_bh, acb);
> > > > > > > > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > > > > > > > +                                         blk_aio_complete_bh, acb);
> > > > > > > >      }
> > > > > > >
> > > > > > > This, and a few other places that you convert, are in fast paths and add
> > > > > > > some calls that are unnecessary for non-replay cases.
> > > > > >
> > > > > > I don't think that this can make a noticeable slowdown, but we can run
> > > > > > the tests if you want.
> > > > > > We have the test suite which performs disk-intensive computation.
> > > > > > It was created to measure the effect of running BH callbacks through
> > > > > > the virtual timer infrastructure.
> > > > >
> > > > > I think this requires quite fast storage to possibly make a difference.
> > > >
> > > > True.
> > > >
> > > > > Or if you don't have that, maybe a ramdisk or even a null-co:// backend
> > > > > could do the trick. Maybe null-co:// is actually the best option.
> > > >
> > > > We've got tests with file copying and compression on qcow2 disks.
> > > > How the null backend can be applied there?
> > >
> > > With qcow2, it can't really. null-co:// would work for running something
> > > like fio directly against a virtual disk, without any image format
> > > involved. Getting the image format out of the way makes things even a
> > > little bit faster.
> > >
> > > Maybe we should run a micro-benchmark fio with null-co just in addition
> > > to your higher level tests?
> >
> > We run something like that:
> >   -drive file={},if=none,id=drv,snapshot -device ide-hd,drive=drv -drive
> > file={},if=none,id=tst,snapshot -device ide-hd,drive=tst -net none -monitor stdio --enable-
> kvm -m 2G
> >
> > I don't really get your idea. What should be added to the command line
> > to run null-co benchmark?
> 
> Something like:
> 
> -drive file=null-co://,if=none,id=null -device virtio-blk,drive=null

And this drive should be destination of the copy operations, right?

> (Testing with IDE is useless, IDE is slow, you'll never get very high
> IOPS or bandwidth with it, neither before nor after your series.)

SCSI or something else?

Pavel Dovgalyuk
Kevin Wolf March 6, 2019, 9:29 a.m. UTC | #9
Am 06.03.2019 um 10:18 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 06.03.2019 um 09:33 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Am 05.03.2019 um 12:04 hat Pavel Dovgalyuk geschrieben:
> > > > > > > > > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t
> > > > offset,
> > > > > > int
> > > > > > > > bytes,
> > > > > > > > >
> > > > > > > > >      acb->has_returned = true;
> > > > > > > > >      if (acb->rwco.ret != NOT_DONE) {
> > > > > > > > > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > > > > > > > > -                                blk_aio_complete_bh, acb);
> > > > > > > > > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > > > > > > > > +                                         blk_aio_complete_bh, acb);
> > > > > > > > >      }
> > > > > > > >
> > > > > > > > This, and a few other places that you convert, are in fast paths and add
> > > > > > > > some calls that are unnecessary for non-replay cases.
> > > > > > >
> > > > > > > I don't think that this can make a noticeable slowdown, but we can run
> > > > > > > the tests if you want.
> > > > > > > We have the test suite which performs disk-intensive computation.
> > > > > > > It was created to measure the effect of running BH callbacks through
> > > > > > > the virtual timer infrastructure.
> > > > > >
> > > > > > I think this requires quite fast storage to possibly make a difference.
> > > > >
> > > > > True.
> > > > >
> > > > > > Or if you don't have that, maybe a ramdisk or even a null-co:// backend
> > > > > > could do the trick. Maybe null-co:// is actually the best option.
> > > > >
> > > > > We've got tests with file copying and compression on qcow2 disks.
> > > > > How the null backend can be applied there?
> > > >
> > > > With qcow2, it can't really. null-co:// would work for running something
> > > > like fio directly against a virtual disk, without any image format
> > > > involved. Getting the image format out of the way makes things even a
> > > > little bit faster.
> > > >
> > > > Maybe we should run a micro-benchmark fio with null-co just in addition
> > > > to your higher level tests?
> > >
> > > We run something like that:
> > >   -drive file={},if=none,id=drv,snapshot -device ide-hd,drive=drv -drive
> > > file={},if=none,id=tst,snapshot -device ide-hd,drive=tst -net none -monitor stdio --enable-
> > kvm -m 2G
> > >
> > > I don't really get your idea. What should be added to the command line
> > > to run null-co benchmark?
> > 
> > Something like:
> > 
> > -drive file=null-co://,if=none,id=null -device virtio-blk,drive=null
> 
> And this drive should be destination of the copy operations, right?

I don't know your exact benchmark, but this drive should be where the
high I/O rates are, yes.

For getting meaningful numbers, you should have I/O only on the fast
test disk (you're talking about a copy, where is source?), you should
use direct I/O to get the page cache of the guest out of the way, and
you should make sure that multiple requests are issued in parallel.

> > (Testing with IDE is useless, IDE is slow, you'll never get very high
> > IOPS or bandwidth with it, neither before nor after your series.)
> 
> SCSI or something else?

I think virtio-blk is still the fastest we have.

Kevin
Pavel Dovgalyuk March 6, 2019, 9:37 a.m. UTC | #10
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 06.03.2019 um 10:18 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 06.03.2019 um 09:33 hat Pavel Dovgalyuk geschrieben:
> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > Am 05.03.2019 um 12:04 hat Pavel Dovgalyuk geschrieben:
> > > > > > > > > > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk,
> int64_t
> > > > > offset,
> > > > > > > int
> > > > > > > > > bytes,
> > > > > > > > > >
> > > > > > > > > >      acb->has_returned = true;
> > > > > > > > > >      if (acb->rwco.ret != NOT_DONE) {
> > > > > > > > > > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > > > > > > > > > -                                blk_aio_complete_bh, acb);
> > > > > > > > > > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > > > > > > > > > +                                         blk_aio_complete_bh, acb);
> > > > > > > > > >      }
> > > > > > > > >
> > > > > > > > > This, and a few other places that you convert, are in fast paths and add
> > > > > > > > > some calls that are unnecessary for non-replay cases.
> > > > > > > >
> > > > > > > > I don't think that this can make a noticeable slowdown, but we can run
> > > > > > > > the tests if you want.
> > > > > > > > We have the test suite which performs disk-intensive computation.
> > > > > > > > It was created to measure the effect of running BH callbacks through
> > > > > > > > the virtual timer infrastructure.
> > > > > > >
> > > > > > > I think this requires quite fast storage to possibly make a difference.
> > > > > >
> > > > > > True.
> > > > > >
> > > > > > > Or if you don't have that, maybe a ramdisk or even a null-co:// backend
> > > > > > > could do the trick. Maybe null-co:// is actually the best option.
> > > > > >
> > > > > > We've got tests with file copying and compression on qcow2 disks.
> > > > > > How the null backend can be applied there?
> > > > >
> > > > > With qcow2, it can't really. null-co:// would work for running something
> > > > > like fio directly against a virtual disk, without any image format
> > > > > involved. Getting the image format out of the way makes things even a
> > > > > little bit faster.
> > > > >
> > > > > Maybe we should run a micro-benchmark fio with null-co just in addition
> > > > > to your higher level tests?
> > > >
> > > > We run something like that:
> > > >   -drive file={},if=none,id=drv,snapshot -device ide-hd,drive=drv -drive
> > > > file={},if=none,id=tst,snapshot -device ide-hd,drive=tst -net none -monitor stdio --
> enable-
> > > kvm -m 2G
> > > >
> > > > I don't really get your idea. What should be added to the command line
> > > > to run null-co benchmark?
> > >
> > > Something like:
> > >
> > > -drive file=null-co://,if=none,id=null -device virtio-blk,drive=null
> >
> > And this drive should be destination of the copy operations, right?
> 
> I don't know your exact benchmark, but this drive should be where the
> high I/O rates are, yes.

Ok.

> For getting meaningful numbers, you should have I/O only on the fast
> test disk (you're talking about a copy, where is source?), 

We used a qcow2 image as a source.

> you should
> use direct I/O to get the page cache of the guest out of the way, and
> you should make sure that multiple requests are issued in parallel.

Is this possible, if we have only conventional HDDs?


Pavel Dovgalyuk
Kevin Wolf March 6, 2019, 10:33 a.m. UTC | #11
Am 06.03.2019 um 10:37 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 06.03.2019 um 10:18 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Am 06.03.2019 um 09:33 hat Pavel Dovgalyuk geschrieben:
> > > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > > Am 05.03.2019 um 12:04 hat Pavel Dovgalyuk geschrieben:
> > > > > > > > > > > @@ -1349,8 +1351,8 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk,
> > int64_t
> > > > > > offset,
> > > > > > > > int
> > > > > > > > > > bytes,
> > > > > > > > > > >
> > > > > > > > > > >      acb->has_returned = true;
> > > > > > > > > > >      if (acb->rwco.ret != NOT_DONE) {
> > > > > > > > > > > -        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
> > > > > > > > > > > -                                blk_aio_complete_bh, acb);
> > > > > > > > > > > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > > > > > > > > > > +                                         blk_aio_complete_bh, acb);
> > > > > > > > > > >      }
> > > > > > > > > >
> > > > > > > > > > This, and a few other places that you convert, are in fast paths and add
> > > > > > > > > > some calls that are unnecessary for non-replay cases.
> > > > > > > > >
> > > > > > > > > I don't think that this can make a noticeable slowdown, but we can run
> > > > > > > > > the tests if you want.
> > > > > > > > > We have the test suite which performs disk-intensive computation.
> > > > > > > > > It was created to measure the effect of running BH callbacks through
> > > > > > > > > the virtual timer infrastructure.
> > > > > > > >
> > > > > > > > I think this requires quite fast storage to possibly make a difference.
> > > > > > >
> > > > > > > True.
> > > > > > >
> > > > > > > > Or if you don't have that, maybe a ramdisk or even a null-co:// backend
> > > > > > > > could do the trick. Maybe null-co:// is actually the best option.
> > > > > > >
> > > > > > > We've got tests with file copying and compression on qcow2 disks.
> > > > > > > How the null backend can be applied there?
> > > > > >
> > > > > > With qcow2, it can't really. null-co:// would work for running something
> > > > > > like fio directly against a virtual disk, without any image format
> > > > > > involved. Getting the image format out of the way makes things even a
> > > > > > little bit faster.
> > > > > >
> > > > > > Maybe we should run a micro-benchmark fio with null-co just in addition
> > > > > > to your higher level tests?
> > > > >
> > > > > We run something like that:
> > > > >   -drive file={},if=none,id=drv,snapshot -device ide-hd,drive=drv -drive
> > > > > file={},if=none,id=tst,snapshot -device ide-hd,drive=tst -net none -monitor stdio --
> > enable-
> > > > kvm -m 2G
> > > > >
> > > > > I don't really get your idea. What should be added to the command line
> > > > > to run null-co benchmark?
> > > >
> > > > Something like:
> > > >
> > > > -drive file=null-co://,if=none,id=null -device virtio-blk,drive=null
> > >
> > > And this drive should be destination of the copy operations, right?
> > 
> > I don't know your exact benchmark, but this drive should be where the
> > high I/O rates are, yes.
> 
> Ok.
> 
> > For getting meaningful numbers, you should have I/O only on the fast
> > test disk (you're talking about a copy, where is source?), 
> 
> We used a qcow2 image as a source.

So the source is going to slow down the I/O and you won't actually test
whether the possible maximum changes.

> > you should
> > use direct I/O to get the page cache of the guest out of the way, and
> > you should make sure that multiple requests are issued in parallel.
> 
> Is this possible, if we have only conventional HDDs?

null-co:// doesn't access your disk at all, so if this is the only
virtual disk that has I/O, the conventional HDD doesn't hurt. But you're
right that you probably can't use your physical disk for high
performance benchmarks then.

I'm going to suggest once more to use fio for storage testing. Actually,
maybe I can find the time to do this myself on my system, too.

But anyway, I think it's good if you're at least aware that the test you
used so far was a low-performance test where any possible performance
degradations would be dwarved by both the slow physical disk and the
slow IDE interface to communicate with the guest (which also enforces to
have only one request at the same time).

Kevin
Pavel Dovgalyuk March 6, 2019, 10:57 a.m. UTC | #12
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 06.03.2019 um 10:37 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 06.03.2019 um 10:18 hat Pavel Dovgalyuk geschrieben:
> > > > > Something like:
> > > > >
> > > > > -drive file=null-co://,if=none,id=null -device virtio-blk,drive=null
> > > >
> > > > And this drive should be destination of the copy operations, right?
> > >
> > > I don't know your exact benchmark, but this drive should be where the
> > > high I/O rates are, yes.
> >
> > Ok.
> >
> > > For getting meaningful numbers, you should have I/O only on the fast
> > > test disk (you're talking about a copy, where is source?),
> >
> > We used a qcow2 image as a source.
> 
> So the source is going to slow down the I/O and you won't actually test
> whether the possible maximum changes.
> 
> > > you should
> > > use direct I/O to get the page cache of the guest out of the way, and
> > > you should make sure that multiple requests are issued in parallel.
> >
> > Is this possible, if we have only conventional HDDs?
> 
> null-co:// doesn't access your disk at all, so if this is the only
> virtual disk that has I/O, the conventional HDD doesn't hurt. But you're
> right that you probably can't use your physical disk for high
> performance benchmarks then.
> 
> I'm going to suggest once more to use fio for storage testing. Actually,
> maybe I can find the time to do this myself on my system, too.

Ok, I'll try it.

> But anyway, I think it's good if you're at least aware that the test you
> used so far was a low-performance test where any possible performance
> degradations would be dwarved by both the slow physical disk and the
> slow IDE interface to communicate with the guest (which also enforces to
> have only one request at the same time).

Pavel Dovgalyuk
Pavel Dovgalyuk March 6, 2019, 2 p.m. UTC | #13
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 06.03.2019 um 10:37 hat Pavel Dovgalyuk geschrieben:
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 06.03.2019 um 10:18 hat Pavel Dovgalyuk geschrieben:
> > > > > Something like:
> > > > >
> > > > > -drive file=null-co://,if=none,id=null -device virtio-blk,drive=null
> > > >
> > > > And this drive should be destination of the copy operations, right?
> > >
> > > I don't know your exact benchmark, but this drive should be where the
> > > high I/O rates are, yes.
> >
> > Ok.
> >
> > > For getting meaningful numbers, you should have I/O only on the fast
> > > test disk (you're talking about a copy, where is source?),
> >
> > We used a qcow2 image as a source.
> 
> So the source is going to slow down the I/O and you won't actually test
> whether the possible maximum changes.
> 
> > > you should
> > > use direct I/O to get the page cache of the guest out of the way, and
> > > you should make sure that multiple requests are issued in parallel.
> >
> > Is this possible, if we have only conventional HDDs?
> 
> null-co:// doesn't access your disk at all, so if this is the only
> virtual disk that has I/O, the conventional HDD doesn't hurt. But you're
> right that you probably can't use your physical disk for high
> performance benchmarks then.
> 
> I'm going to suggest once more to use fio for storage testing. Actually,
> maybe I can find the time to do this myself on my system, too.

We've made some testing with the following fio configs:

[readtest]
blocksize=4k
filename=/dev/vda
rw=randread
direct=1
buffered=0
ioengine=libaio
iodepth=32

[writetest]
blocksize=4k
filename=/dev/vda
rw=randwrite
direct=1
buffered=0
ioengine=libaio
iodepth=32

One with read, one with write, and one with both.

master branch:
1  read : io=1024.0MB, bw=475545KB/s, iops=118886, runt=  2205msec

2  write: io=1024.0MB, bw=445444KB/s, iops=111361, runt=  2354msec

3  read : io=1024.0MB, bw=229850KB/s, iops=57462, runt=  4562msec
   write: io=1024.0MB, bw=227210KB/s, iops=56802, runt=  4615msec

rr branch:
1  read : io=1024.0MB, bw=479021KB/s, iops=119755, runt=  2189msec
2  write: io=1024.0MB, bw=440763KB/s, iops=110190, runt=  2379msec

3  read : io=1024.0MB, bw=230456KB/s, iops=57614, runt=  4550msec
   write: io=1024.0MB, bw=228548KB/s, iops=57136, runt=  4588msec

It seems that the difference can't be measured in our environment.

Pavel Dovgalyuk
Pavel Dovgalyuk March 13, 2019, 5:57 a.m. UTC | #14
Kevin, what about this one?


Pavel Dovgalyuk

> -----Original Message-----
> From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> Sent: Wednesday, March 06, 2019 5:01 PM
> To: 'Kevin Wolf'
> Cc: 'Pavel Dovgalyuk'; qemu-devel@nongnu.org; peter.maydell@linaro.org; war2jordan@live.com;
> crosthwaite.peter@gmail.com; boost.lists@gmail.com; artem.k.pisarenko@gmail.com;
> quintela@redhat.com; ciro.santilli@gmail.com; jasowang@redhat.com; mst@redhat.com;
> armbru@redhat.com; mreitz@redhat.com; maria.klimushenkova@ispras.ru; kraxel@redhat.com;
> thomas.dullien@googlemail.com; pbonzini@redhat.com; alex.bennee@linaro.org;
> dgilbert@redhat.com; rth@twiddle.net
> Subject: RE: [PATCH v13 19/25] replay: add BH oneshot event for block layer
> 
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 06.03.2019 um 10:37 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > Am 06.03.2019 um 10:18 hat Pavel Dovgalyuk geschrieben:
> > > > > > Something like:
> > > > > >
> > > > > > -drive file=null-co://,if=none,id=null -device virtio-blk,drive=null
> > > > >
> > > > > And this drive should be destination of the copy operations, right?
> > > >
> > > > I don't know your exact benchmark, but this drive should be where the
> > > > high I/O rates are, yes.
> > >
> > > Ok.
> > >
> > > > For getting meaningful numbers, you should have I/O only on the fast
> > > > test disk (you're talking about a copy, where is source?),
> > >
> > > We used a qcow2 image as a source.
> >
> > So the source is going to slow down the I/O and you won't actually test
> > whether the possible maximum changes.
> >
> > > > you should
> > > > use direct I/O to get the page cache of the guest out of the way, and
> > > > you should make sure that multiple requests are issued in parallel.
> > >
> > > Is this possible, if we have only conventional HDDs?
> >
> > null-co:// doesn't access your disk at all, so if this is the only
> > virtual disk that has I/O, the conventional HDD doesn't hurt. But you're
> > right that you probably can't use your physical disk for high
> > performance benchmarks then.
> >
> > I'm going to suggest once more to use fio for storage testing. Actually,
> > maybe I can find the time to do this myself on my system, too.
> 
> We've made some testing with the following fio configs:
> 
> [readtest]
> blocksize=4k
> filename=/dev/vda
> rw=randread
> direct=1
> buffered=0
> ioengine=libaio
> iodepth=32
> 
> [writetest]
> blocksize=4k
> filename=/dev/vda
> rw=randwrite
> direct=1
> buffered=0
> ioengine=libaio
> iodepth=32
> 
> One with read, one with write, and one with both.
> 
> master branch:
> 1  read : io=1024.0MB, bw=475545KB/s, iops=118886, runt=  2205msec
> 
> 2  write: io=1024.0MB, bw=445444KB/s, iops=111361, runt=  2354msec
> 
> 3  read : io=1024.0MB, bw=229850KB/s, iops=57462, runt=  4562msec
>    write: io=1024.0MB, bw=227210KB/s, iops=56802, runt=  4615msec
> 
> rr branch:
> 1  read : io=1024.0MB, bw=479021KB/s, iops=119755, runt=  2189msec
> 2  write: io=1024.0MB, bw=440763KB/s, iops=110190, runt=  2379msec
> 
> 3  read : io=1024.0MB, bw=230456KB/s, iops=57614, runt=  4550msec
>    write: io=1024.0MB, bw=228548KB/s, iops=57136, runt=  4588msec
> 
> It seems that the difference can't be measured in our environment.
> 
> Pavel Dovgalyuk
Kevin Wolf March 14, 2019, 2:57 p.m. UTC | #15
Am 13.03.2019 um 06:57 hat Pavel Dovgalyuk geschrieben:
> Kevin, what about this one?

I made some benchmark on my system, too, and included some cases with
your series. I agree that there is no significant difference, so that's
fine.

Kevin

> > -----Original Message-----
> > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > Sent: Wednesday, March 06, 2019 5:01 PM
> > To: 'Kevin Wolf'
> > Cc: 'Pavel Dovgalyuk'; qemu-devel@nongnu.org; peter.maydell@linaro.org; war2jordan@live.com;
> > crosthwaite.peter@gmail.com; boost.lists@gmail.com; artem.k.pisarenko@gmail.com;
> > quintela@redhat.com; ciro.santilli@gmail.com; jasowang@redhat.com; mst@redhat.com;
> > armbru@redhat.com; mreitz@redhat.com; maria.klimushenkova@ispras.ru; kraxel@redhat.com;
> > thomas.dullien@googlemail.com; pbonzini@redhat.com; alex.bennee@linaro.org;
> > dgilbert@redhat.com; rth@twiddle.net
> > Subject: RE: [PATCH v13 19/25] replay: add BH oneshot event for block layer
> > 
> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > Am 06.03.2019 um 10:37 hat Pavel Dovgalyuk geschrieben:
> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > > > > Am 06.03.2019 um 10:18 hat Pavel Dovgalyuk geschrieben:
> > > > > > > Something like:
> > > > > > >
> > > > > > > -drive file=null-co://,if=none,id=null -device virtio-blk,drive=null
> > > > > >
> > > > > > And this drive should be destination of the copy operations, right?
> > > > >
> > > > > I don't know your exact benchmark, but this drive should be where the
> > > > > high I/O rates are, yes.
> > > >
> > > > Ok.
> > > >
> > > > > For getting meaningful numbers, you should have I/O only on the fast
> > > > > test disk (you're talking about a copy, where is source?),
> > > >
> > > > We used a qcow2 image as a source.
> > >
> > > So the source is going to slow down the I/O and you won't actually test
> > > whether the possible maximum changes.
> > >
> > > > > you should
> > > > > use direct I/O to get the page cache of the guest out of the way, and
> > > > > you should make sure that multiple requests are issued in parallel.
> > > >
> > > > Is this possible, if we have only conventional HDDs?
> > >
> > > null-co:// doesn't access your disk at all, so if this is the only
> > > virtual disk that has I/O, the conventional HDD doesn't hurt. But you're
> > > right that you probably can't use your physical disk for high
> > > performance benchmarks then.
> > >
> > > I'm going to suggest once more to use fio for storage testing. Actually,
> > > maybe I can find the time to do this myself on my system, too.
> > 
> > We've made some testing with the following fio configs:
> > 
> > [readtest]
> > blocksize=4k
> > filename=/dev/vda
> > rw=randread
> > direct=1
> > buffered=0
> > ioengine=libaio
> > iodepth=32
> > 
> > [writetest]
> > blocksize=4k
> > filename=/dev/vda
> > rw=randwrite
> > direct=1
> > buffered=0
> > ioengine=libaio
> > iodepth=32
> > 
> > One with read, one with write, and one with both.
> > 
> > master branch:
> > 1  read : io=1024.0MB, bw=475545KB/s, iops=118886, runt=  2205msec
> > 
> > 2  write: io=1024.0MB, bw=445444KB/s, iops=111361, runt=  2354msec
> > 
> > 3  read : io=1024.0MB, bw=229850KB/s, iops=57462, runt=  4562msec
> >    write: io=1024.0MB, bw=227210KB/s, iops=56802, runt=  4615msec
> > 
> > rr branch:
> > 1  read : io=1024.0MB, bw=479021KB/s, iops=119755, runt=  2189msec
> > 2  write: io=1024.0MB, bw=440763KB/s, iops=110190, runt=  2379msec
> > 
> > 3  read : io=1024.0MB, bw=230456KB/s, iops=57614, runt=  4550msec
> >    write: io=1024.0MB, bw=228548KB/s, iops=57136, runt=  4588msec
> > 
> > It seems that the difference can't be measured in our environment.
> > 
> > Pavel Dovgalyuk
> 
>
Pavel Dovgalyuk March 14, 2019, 3:12 p.m. UTC | #16
Thanks.
What about formal reviewed-by then?

Pavel Dovgalyuk

⁣Отправлено с помощью BlueMail ​

На 14 Мар 2019 г., 17:57, в 17:57, Kevin Wolf <kwolf@redhat.com> написал:п>Am 13.03.2019 um 06:57 hat Pavel Dovgalyuk geschrieben:
>> Kevin, what about this one?
>
>I made some benchmark on my system, too, and included some cases with
>your series. I agree that there is no significant difference, so that's
>fine.
>
>Kevin
>
>> > -----Original Message-----
>> > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
>> > Sent: Wednesday, March 06, 2019 5:01 PM
>> > To: 'Kevin Wolf'
>> > Cc: 'Pavel Dovgalyuk'; qemu-devel@nongnu.org;
>peter.maydell@linaro.org; war2jordan@live.com;
>> > crosthwaite.peter@gmail.com; boost.lists@gmail.com;
>artem.k.pisarenko@gmail.com;
>> > quintela@redhat.com; ciro.santilli@gmail.com; jasowang@redhat.com;
>mst@redhat.com;
>> > armbru@redhat.com; mreitz@redhat.com;
>maria.klimushenkova@ispras.ru; kraxel@redhat.com;
>> > thomas.dullien@googlemail.com; pbonzini@redhat.com;
>alex.bennee@linaro.org;
>> > dgilbert@redhat.com; rth@twiddle.net
>> > Subject: RE: [PATCH v13 19/25] replay: add BH oneshot event for
>block layer
>> > 
>> > > From: Kevin Wolf [mailto:kwolf@redhat.com]
>> > > Am 06.03.2019 um 10:37 hat Pavel Dovgalyuk geschrieben:
>> > > > > From: Kevin Wolf [mailto:kwolf@redhat.com]
>> > > > > Am 06.03.2019 um 10:18 hat Pavel Dovgalyuk geschrieben:
>> > > > > > > Something like:
>> > > > > > >
>> > > > > > > -drive file=null-co://,if=none,id=null -device
>virtio-blk,drive=null
>> > > > > >
>> > > > > > And this drive should be destination of the copy
>operations, right?
>> > > > >
>> > > > > I don't know your exact benchmark, but this drive should be
>where the
>> > > > > high I/O rates are, yes.
>> > > >
>> > > > Ok.
>> > > >
>> > > > > For getting meaningful numbers, you should have I/O only on
>the fast
>> > > > > test disk (you're talking about a copy, where is source?),
>> > > >
>> > > > We used a qcow2 image as a source.
>> > >
>> > > So the source is going to slow down the I/O and you won't
>actually test
>> > > whether the possible maximum changes.
>> > >
>> > > > > you should
>> > > > > use direct I/O to get the page cache of the guest out of the
>way, and
>> > > > > you should make sure that multiple requests are issued in
>parallel.
>> > > >
>> > > > Is this possible, if we have only conventional HDDs?
>> > >
>> > > null-co:// doesn't access your disk at all, so if this is the
>only
>> > > virtual disk that has I/O, the conventional HDD doesn't hurt. But
>you're
>> > > right that you probably can't use your physical disk for high
>> > > performance benchmarks then.
>> > >
>> > > I'm going to suggest once more to use fio for storage testing.
>Actually,
>> > > maybe I can find the time to do this myself on my system, too.
>> > 
>> > We've made some testing with the following fio configs:
>> > 
>> > [readtest]
>> > blocksize=4k
>> > filename=/dev/vda
>> > rw=randread
>> > direct=1
>> > buffered=0
>> > ioengine=libaio
>> > iodepth=32
>> > 
>> > [writetest]
>> > blocksize=4k
>> > filename=/dev/vda
>> > rw=randwrite
>> > direct=1
>> > buffered=0
>> > ioengine=libaio
>> > iodepth=32
>> > 
>> > One with read, one with write, and one with both.
>> > 
>> > master branch:
>> > 1  read : io=1024.0MB, bw=475545KB/s, iops=118886, runt=  2205msec
>> > 
>> > 2  write: io=1024.0MB, bw=445444KB/s, iops=111361, runt=  2354msec
>> > 
>> > 3  read : io=1024.0MB, bw=229850KB/s, iops=57462, runt=  4562msec
>> >    write: io=1024.0MB, bw=227210KB/s, iops=56802, runt=  4615msec
>> > 
>> > rr branch:
>> > 1  read : io=1024.0MB, bw=479021KB/s, iops=119755, runt=  2189msec
>> > 2  write: io=1024.0MB, bw=440763KB/s, iops=110190, runt=  2379msec
>> > 
>> > 3  read : io=1024.0MB, bw=230456KB/s, iops=57614, runt=  4550msec
>> >    write: io=1024.0MB, bw=228548KB/s, iops=57136, runt=  4588msec
>> > 
>> > It seems that the difference can't be measured in our environment.
>> > 
>> > Pavel Dovgalyuk
>> 
>>
Kevin Wolf March 14, 2019, 4:09 p.m. UTC | #17
Am 21.02.2019 um 12:05 hat Pavel Dovgalyuk geschrieben:
> Replay is capable of recording normal BH events, but sometimes
> there are single use callbacks scheduled with aio_bh_schedule_oneshot
> function. This patch enables recording and replaying such callbacks.
> Block layer uses these events for calling the completion function.
> Replaying these calls makes the execution deterministic.
> 
> Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>

Acked-by: Kevin Wolf <kwolf@redhat.com>
Pavel Dovgalyuk March 15, 2019, 7:26 a.m. UTC | #18
> From: Kevin Wolf [mailto:kwolf@redhat.com]
> Am 21.02.2019 um 12:05 hat Pavel Dovgalyuk geschrieben:
> > Replay is capable of recording normal BH events, but sometimes
> > there are single use callbacks scheduled with aio_bh_schedule_oneshot
> > function. This patch enables recording and replaying such callbacks.
> > Block layer uses these events for calling the completion function.
> > Replaying these calls makes the execution deterministic.
> >
> > Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> 
> Acked-by: Kevin Wolf <kwolf@redhat.com>

Thanks, Kevin!

Can you also take a look at patches 2, 3, 5?
They affect only RR behavior, but change the block layer too.

Pavel Dovgalyuk
Kevin Wolf March 15, 2019, 10:32 a.m. UTC | #19
Am 15.03.2019 um 08:26 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kwolf@redhat.com]
> > Am 21.02.2019 um 12:05 hat Pavel Dovgalyuk geschrieben:
> > > Replay is capable of recording normal BH events, but sometimes
> > > there are single use callbacks scheduled with aio_bh_schedule_oneshot
> > > function. This patch enables recording and replaying such callbacks.
> > > Block layer uses these events for calling the completion function.
> > > Replaying these calls makes the execution deterministic.
> > >
> > > Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
> > 
> > Acked-by: Kevin Wolf <kwolf@redhat.com>
> 
> Thanks, Kevin!
> 
> Can you also take a look at patches 2, 3, 5?
> They affect only RR behavior, but change the block layer too.

No objection there either (I see most of them more as RR things than
block layer things anyway).

Kevin
diff mbox series

Patch

diff --git a/block/block-backend.c b/block/block-backend.c
index f6ea824308..9ccd89b56a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -17,6 +17,7 @@ 
 #include "block/throttle-groups.h"
 #include "sysemu/blockdev.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/replay.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-block.h"
 #include "qemu/id.h"
@@ -1293,7 +1294,8 @@  BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
     acb->blk = blk;
     acb->ret = ret;
 
-    aio_bh_schedule_oneshot(blk_get_aio_context(blk), error_callback_bh, acb);
+    replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                     error_callback_bh, acb);
     return &acb->common;
 }
 
@@ -1349,8 +1351,8 @@  static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, int64_t offset, int bytes,
 
     acb->has_returned = true;
     if (acb->rwco.ret != NOT_DONE) {
-        aio_bh_schedule_oneshot(blk_get_aio_context(blk),
-                                blk_aio_complete_bh, acb);
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
     }
 
     return &acb->common;
diff --git a/block/io.c b/block/io.c
index c25864a8ee..a7d97b59e4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -340,8 +340,8 @@  static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
     if (bs) {
         bdrv_inc_in_flight(bs);
     }
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(bs),
-                            bdrv_co_drain_bh_cb, &data);
+    replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
+                                     bdrv_co_drain_bh_cb, &data);
 
     qemu_coroutine_yield();
     /* If we are resumed from some other event (such as an aio completion or a
diff --git a/block/iscsi.c b/block/iscsi.c
index ff473206e6..571bec80c9 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -38,6 +38,7 @@ 
 #include "qemu/iov.h"
 #include "qemu/option.h"
 #include "qemu/uuid.h"
+#include "sysemu/replay.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-misc.h"
 #include "qapi/qmp/qdict.h"
@@ -275,8 +276,8 @@  iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
 
 out:
     if (iTask->co) {
-        aio_bh_schedule_oneshot(iTask->iscsilun->aio_context,
-                                 iscsi_co_generic_bh_cb, iTask);
+        replay_bh_schedule_oneshot_event(iTask->iscsilun->aio_context,
+                                         iscsi_co_generic_bh_cb, iTask);
     } else {
         iTask->complete = 1;
     }
diff --git a/block/nfs.c b/block/nfs.c
index eab1a2c408..e8b6c90cf1 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -36,6 +36,7 @@ 
 #include "qemu/uri.h"
 #include "qemu/cutils.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/replay.h"
 #include "qapi/qapi-visit-block-core.h"
 #include "qapi/qmp/qdict.h"
 #include "qapi/qmp/qstring.h"
@@ -256,8 +257,8 @@  nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
     if (task->ret < 0) {
         error_report("NFS Error: %s", nfs_get_error(nfs));
     }
-    aio_bh_schedule_oneshot(task->client->aio_context,
-                            nfs_co_generic_bh_cb, task);
+    replay_bh_schedule_oneshot_event(task->client->aio_context,
+                                     nfs_co_generic_bh_cb, task);
 }
 
 static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, uint64_t offset,
diff --git a/block/null.c b/block/null.c
index d442d3e901..733873c79b 100644
--- a/block/null.c
+++ b/block/null.c
@@ -16,6 +16,7 @@ 
 #include "qapi/qmp/qstring.h"
 #include "qemu/option.h"
 #include "block/block_int.h"
+#include "sysemu/replay.h"
 
 #define NULL_OPT_LATENCY "latency-ns"
 #define NULL_OPT_ZEROES  "read-zeroes"
@@ -178,7 +179,8 @@  static inline BlockAIOCB *null_aio_common(BlockDriverState *bs,
         timer_mod_ns(&acb->timer,
                      qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + s->latency_ns);
     } else {
-        aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), null_bh_cb, acb);
+        replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs),
+                                         null_bh_cb, acb);
     }
     return &acb->common;
 }
diff --git a/block/nvme.c b/block/nvme.c
index b5952c9b08..7ada0794e8 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -21,6 +21,7 @@ 
 #include "qemu/option.h"
 #include "qemu/vfio-helpers.h"
 #include "block/block_int.h"
+#include "sysemu/replay.h"
 #include "trace.h"
 
 #include "block/nvme.h"
@@ -346,7 +347,8 @@  static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
         smp_mb_release();
         *q->cq.doorbell = cpu_to_le32(q->cq.head);
         if (!qemu_co_queue_empty(&q->free_req_queue)) {
-            aio_bh_schedule_oneshot(s->aio_context, nvme_free_req_queue_cb, q);
+            replay_bh_schedule_oneshot_event(s->aio_context,
+                                             nvme_free_req_queue_cb, q);
         }
     }
     q->busy = false;
@@ -897,7 +899,7 @@  static void nvme_rw_cb(void *opaque, int ret)
         /* The rw coroutine hasn't yielded, don't try to enter. */
         return;
     }
-    aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data);
+    replay_bh_schedule_oneshot_event(data->ctx, nvme_rw_cb_bh, data);
 }
 
 static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
diff --git a/block/rbd.c b/block/rbd.c
index 8a1a9f4b6e..daaddbc28c 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -21,6 +21,7 @@ 
 #include "block/qdict.h"
 #include "crypto/secret.h"
 #include "qemu/cutils.h"
+#include "sysemu/replay.h"
 #include "qapi/qmp/qstring.h"
 #include "qapi/qmp/qdict.h"
 #include "qapi/qmp/qjson.h"
@@ -858,8 +859,8 @@  static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb)
     rcb->ret = rbd_aio_get_return_value(c);
     rbd_aio_release(c);
 
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                            rbd_finish_bh, rcb);
+    replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
+                                     rbd_finish_bh, rcb);
 }
 
 static int rbd_aio_discard_wrapper(rbd_image_t image,
diff --git a/block/vxhs.c b/block/vxhs.c
index 0cb0a007e9..3ac35cb6bd 100644
--- a/block/vxhs.c
+++ b/block/vxhs.c
@@ -21,6 +21,7 @@ 
 #include "qapi/error.h"
 #include "qemu/uuid.h"
 #include "crypto/tlscredsx509.h"
+#include "sysemu/replay.h"
 
 #define VXHS_OPT_FILENAME           "filename"
 #define VXHS_OPT_VDISK_ID           "vdisk-id"
@@ -104,8 +105,8 @@  static void vxhs_iio_callback(void *ctx, uint32_t opcode, uint32_t error)
             trace_vxhs_iio_callback(error);
         }
 
-        aio_bh_schedule_oneshot(bdrv_get_aio_context(acb->common.bs),
-                                vxhs_complete_aio_bh, acb);
+        replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
+                                         vxhs_complete_aio_bh, acb);
         break;
 
     default:
diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
index 1d18c9b6ea..b7394a1f5c 100644
--- a/include/sysemu/replay.h
+++ b/include/sysemu/replay.h
@@ -163,6 +163,9 @@  bool replay_events_enabled(void);
 void replay_flush_events(void);
 /*! Adds bottom half event to the queue */
 void replay_bh_schedule_event(QEMUBH *bh);
+/* Adds oneshot bottom half event to the queue */
+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+    QEMUBHFunc *cb, void *opaque);
 /*! Adds input event to the queue */
 void replay_input_event(QemuConsole *src, InputEvent *evt);
 /*! Adds input sync event to the queue */
diff --git a/replay/replay-events.c b/replay/replay-events.c
index d9a2d495b9..fc9b08225d 100644
--- a/replay/replay-events.c
+++ b/replay/replay-events.c
@@ -37,6 +37,9 @@  static void replay_run_event(Event *event)
     case REPLAY_ASYNC_EVENT_BH:
         aio_bh_call(event->opaque);
         break;
+    case REPLAY_ASYNC_EVENT_BH_ONESHOT:
+        ((QEMUBHFunc *)event->opaque)(event->opaque2);
+        break;
     case REPLAY_ASYNC_EVENT_INPUT:
         qemu_input_event_send_impl(NULL, (InputEvent *)event->opaque);
         qapi_free_InputEvent((InputEvent *)event->opaque);
@@ -132,6 +135,17 @@  void replay_bh_schedule_event(QEMUBH *bh)
     }
 }
 
+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+    QEMUBHFunc *cb, void *opaque)
+{
+    if (events_enabled) {
+        uint64_t id = replay_get_current_step();
+        replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, opaque, id);
+    } else {
+        aio_bh_schedule_oneshot(ctx, cb, opaque);
+    }
+}
+
 void replay_add_input_event(struct InputEvent *event)
 {
     replay_add_event(REPLAY_ASYNC_EVENT_INPUT, event, NULL, 0);
@@ -162,6 +176,7 @@  static void replay_save_event(Event *event, int checkpoint)
         /* save event-specific data */
         switch (event->event_kind) {
         case REPLAY_ASYNC_EVENT_BH:
+        case REPLAY_ASYNC_EVENT_BH_ONESHOT:
             replay_put_qword(event->id);
             break;
         case REPLAY_ASYNC_EVENT_INPUT:
@@ -217,6 +232,7 @@  static Event *replay_read_event(int checkpoint)
     /* Events that has not to be in the queue */
     switch (replay_state.read_event_kind) {
     case REPLAY_ASYNC_EVENT_BH:
+    case REPLAY_ASYNC_EVENT_BH_ONESHOT:
         if (replay_state.read_event_id == -1) {
             replay_state.read_event_id = replay_get_qword();
         }
diff --git a/replay/replay-internal.h b/replay/replay-internal.h
index 945802361b..e9786b2377 100644
--- a/replay/replay-internal.h
+++ b/replay/replay-internal.h
@@ -51,6 +51,7 @@  enum ReplayEvents {
 
 enum ReplayAsyncEventKind {
     REPLAY_ASYNC_EVENT_BH,
+    REPLAY_ASYNC_EVENT_BH_ONESHOT,
     REPLAY_ASYNC_EVENT_INPUT,
     REPLAY_ASYNC_EVENT_INPUT_SYNC,
     REPLAY_ASYNC_EVENT_CHAR_READ,
diff --git a/replay/replay.c b/replay/replay.c
index 1be34aa824..e578958659 100644
--- a/replay/replay.c
+++ b/replay/replay.c
@@ -22,7 +22,7 @@ 
 
 /* Current version of the replay mechanism.
    Increase it when file format changes. */
-#define REPLAY_VERSION              0xe02007
+#define REPLAY_VERSION              0xe02008
 /* Size of replay log header */
 #define HEADER_SIZE                 (sizeof(uint32_t) + sizeof(uint64_t))
 
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 269dfa5832..72351de769 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -20,6 +20,7 @@  stub-obj-y += monitor.o
 stub-obj-y += notify-event.o
 stub-obj-y += qtest.o
 stub-obj-y += replay.o
+stub-obj-y += replay-user.o
 stub-obj-y += runstate-check.o
 stub-obj-y += set-fd-handler.o
 stub-obj-y += sysbus.o
diff --git a/stubs/replay-user.c b/stubs/replay-user.c
new file mode 100644
index 0000000000..2ad9e27203
--- /dev/null
+++ b/stubs/replay-user.c
@@ -0,0 +1,9 @@ 
+#include "qemu/osdep.h"
+#include "sysemu/replay.h"
+#include "sysemu/sysemu.h"
+
+void replay_bh_schedule_oneshot_event(AioContext *ctx,
+    QEMUBHFunc *cb, void *opaque)
+{
+    aio_bh_schedule_oneshot(ctx, cb, opaque);
+}