[v3] qapi/qmp: Add timestamps to qmp command responses

Message ID	20221011153408.495401-1-den-plotnikov@yandex-team.ru
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Precedence: bulk From: Denis Plotnikov <den-plotnikov@yandex-team.ru> To: qemu-devel@nongnu.org Cc: yc-core@yandex-team.ru, armbru@redhat.com, michael.roth@amd.com, vsementsov@yandex-team.ru, berrange@redhat.com, marcandre.lureau@gmail.com Subject: [PATCH v3] qapi/qmp: Add timestamps to qmp command responses Date: Tue, 11 Oct 2022 18:34:08 +0300 Message-Id: <20221011153408.495401-1-den-plotnikov@yandex-team.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=178.154.239.136; envelope-from=den-plotnikov@yandex-team.ru; helo=forwardcorp1b.mail.yandex.net X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	[v3] qapi/qmp: Add timestamps to qmp command responses \| expand [v3] qapi/qmp: Add timestamps to qmp command responses

Denis Plotnikov Oct. 11, 2022, 3:34 p.m. UTC

Add "start" & "end" time values to qmp command responses.

These time values are added to let the qemu management layer get the exact
command execution time without any other time variance which might be brought by
other parts of management layer or qemu internals. This is particulary useful
for the management layer logging for later problems resolving.

Example of result:

    ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket

    (QEMU) query-status
    {"end": {"seconds": 1650367305, "microseconds": 831032},
     "start": {"seconds": 1650367305, "microseconds": 831012},
     "return": {"status": "running", "singlestep": false, "running": true}}

The responce of the qmp command contains the start & end time of
the qmp command processing.

Suggested-by: Andrey Ryabinin <arbn@yandex-team.ru>
Signed-off-by: Denis Plotnikov <den-plotnikov@yandex-team.ru>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---

v0->v1:
 - remove interface to control "start" and "end" time values: return timestamps unconditionally
 - add description to qmp specification
 - leave the same timestamp format in "seconds", "microseconds" to be consistent with events
   timestamp
 - fix patch description

v1->v2:
 - rephrase doc descriptions [Daniel]
 - add tests for qmp timestamps to qmp test and qga test [Daniel]
 - adjust asserts in test-qmp-cmds according to the new number of returning keys

v2->v3:
 - fix typo "timestaps -> timestamps" [Marc-André]

 docs/interop/qmp-spec.txt  | 28 ++++++++++++++++++++++++++--
 qapi/qmp-dispatch.c        | 18 ++++++++++++++++++
 tests/qtest/qmp-test.c     | 34 ++++++++++++++++++++++++++++++++++
 tests/unit/test-qga.c      | 31 +++++++++++++++++++++++++++++++
 tests/unit/test-qmp-cmds.c |  4 ++--
 5 files changed, 111 insertions(+), 4 deletions(-)

Markus Armbruster Oct. 13, 2022, 3 p.m. UTC | #1

Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:

> Add "start" & "end" time values to qmp command responses.

Please spell it QMP.  More of the same below.

>
> These time values are added to let the qemu management layer get the exact
> command execution time without any other time variance which might be brought by
> other parts of management layer or qemu internals. This is particulary useful
> for the management layer logging for later problems resolving.

I'm still having difficulties seeing the value add over existing
tracepoints and logging.

Can you tell me about a problem you cracked (or could have cracked) with
the help of this?

> Example of result:
>
>     ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket
>
>     (QEMU) query-status
>     {"end": {"seconds": 1650367305, "microseconds": 831032},
>      "start": {"seconds": 1650367305, "microseconds": 831012},
>      "return": {"status": "running", "singlestep": false, "running": true}}
>
> The responce of the qmp command contains the start & end time of

response

> the qmp command processing.
>
> Suggested-by: Andrey Ryabinin <arbn@yandex-team.ru>
> Signed-off-by: Denis Plotnikov <den-plotnikov@yandex-team.ru>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

Please spell out that this affects both QMP and qemu-ga.

> ---
>
> v0->v1:
>  - remove interface to control "start" and "end" time values: return timestamps unconditionally
>  - add description to qmp specification
>  - leave the same timestamp format in "seconds", "microseconds" to be consistent with events
>    timestamp
>  - fix patch description
>
> v1->v2:
>  - rephrase doc descriptions [Daniel]
>  - add tests for qmp timestamps to qmp test and qga test [Daniel]
>  - adjust asserts in test-qmp-cmds according to the new number of returning keys
>
> v2->v3:
>  - fix typo "timestaps -> timestamps" [Marc-André]
>
>  docs/interop/qmp-spec.txt  | 28 ++++++++++++++++++++++++++--
>  qapi/qmp-dispatch.c        | 18 ++++++++++++++++++
>  tests/qtest/qmp-test.c     | 34 ++++++++++++++++++++++++++++++++++
>  tests/unit/test-qga.c      | 31 +++++++++++++++++++++++++++++++
>  tests/unit/test-qmp-cmds.c |  4 ++--
>  5 files changed, 111 insertions(+), 4 deletions(-)
>
> diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.txt
> index b0e8351d5b261..2e0b7de0c4dc7 100644
> --- a/docs/interop/qmp-spec.txt
> +++ b/docs/interop/qmp-spec.txt
> @@ -158,7 +158,9 @@ responses that have an unknown "id" field.
>  
>  The format of a success response is:
>  
> -{ "return": json-value, "id": json-value }
> +{ "return": json-value, "id": json-value,
> +  "start": {"seconds": json-value, "microseconds": json-value},
> +  "end": {"seconds": json-value, "microseconds": json-value} }
>  
>   Where,
>  
> @@ -169,13 +171,25 @@ The format of a success response is:
>    command does not return data
>  - The "id" member contains the transaction identification associated
>    with the command execution if issued by the Client
> +- The "start" member contains the exact time of when the server
> +  started executing the command. This excludes any time the
> +  command request spent queued, after reading it off the wire.
> +  It is a fixed json-object with time in seconds and microseconds
> +  relative to the Unix Epoch (1 Jan 1970)

What's a "fixed json-object"?

Hmm, I guess you're copying from the description of event member
"timestamp".

Let's go with "a json-object with the number of seconds and microseconds
since the Unix epoch" everywhere.

> +- The "end" member contains the exact time of when the server
> +  finished executing the command. This excludes any time the
> +  command response spent queued, waiting to be sent on the wire.
> +  It is a fixed json-object with time in seconds and microseconds
> +  relative to the Unix Epoch (1 Jan 1970)
>  
>  2.4.2 error
>  -----------
>  
>  The format of an error response is:
>  
> -{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
> +{ "error": { "class": json-string, "desc": json-string }, "id": json-value
> +  "start": {"seconds": json-value, "microseconds": json-value},
> +  "end": {"seconds": json-value, "microseconds": json-value} }
>  
>   Where,
>  
> @@ -184,6 +198,16 @@ The format of an error response is:
>    not attempt to parse this message.
>  - The "id" member contains the transaction identification associated with
>    the command execution if issued by the Client
> +- The "start" member contains the exact time of when the server
> +  started executing the command. This excludes any time the
> +  command request spent queued, after reading it off the wire.
> +  It is a fixed json-object with time in seconds and microseconds
> +  relative to the Unix Epoch (1 Jan 1970)
> +- The "end" member contains the exact time of when the server
> +  finished executing the command. This excludes any time the
> +  command response spent queued, waiting to be sent on the wire.
> +  It is a fixed json-object with time in seconds and microseconds
> +  relative to the Unix Epoch (1 Jan 1970)
>  
>  NOTE: Some errors can occur before the Server is able to read the "id" member,
>  in these cases the "id" member will not be part of the error response, even
> diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> index 0990873ec8ec1..fce87416f2128 100644
> --- a/qapi/qmp-dispatch.c
> +++ b/qapi/qmp-dispatch.c
> @@ -130,6 +130,22 @@ static void do_qmp_dispatch_bh(void *opaque)
>      aio_co_wake(data->co);
>  }
>  
> +static void add_timestamps(QDict *qdict, uint64_t start_ms, uint64_t end_ms)
> +{
> +    QDict *start_dict, *end_dict;
> +
> +    start_dict = qdict_new();
> +    qdict_put_int(start_dict, "seconds", start_ms / G_USEC_PER_SEC);
> +    qdict_put_int(start_dict, "microseconds", start_ms % G_USEC_PER_SEC);
> +
> +    end_dict = qdict_new();
> +    qdict_put_int(end_dict, "seconds", end_ms / G_USEC_PER_SEC);
> +    qdict_put_int(end_dict, "microseconds", end_ms % G_USEC_PER_SEC);
> +
> +    qdict_put_obj(qdict, "start", QOBJECT(start_dict));
> +    qdict_put_obj(qdict, "end", QOBJECT(end_dict));
> +}
> +
>  /*
>   * Runs outside of coroutine context for OOB commands, but in coroutine
>   * context for everything else.
> @@ -146,6 +162,7 @@ QDict *qmp_dispatch(const QmpCommandList *cmds, QObject *request,
>      QObject *id;
>      QObject *ret = NULL;
>      QDict *rsp = NULL;
> +    uint64_t ts_start = g_get_real_time();

Make this int64_t, because that's what g_get_real_time() returns.

Same for add_timestamps() parameters.

>  
>      dict = qobject_to(QDict, request);
>      if (!dict) {
> @@ -270,5 +287,6 @@ out:
>          qdict_put_obj(rsp, "id", qobject_ref(id));
>      }
>  
> +    add_timestamps(rsp, ts_start, g_get_real_time());
>      return rsp;
>  }
> diff --git a/tests/qtest/qmp-test.c b/tests/qtest/qmp-test.c
> index 22957fa49c228..e0d83eea91ae5 100644
> --- a/tests/qtest/qmp-test.c
> +++ b/tests/qtest/qmp-test.c
> @@ -33,6 +33,30 @@ static void test_version(QObject *version)
>      visit_free(v);
>  }
>  
> +static void test_timestamps(QDict *resp)
> +{
> +    QDict *start, *end;
> +    uint64_t start_s, start_us, end_s, end_us, start_ts, end_ts;

Make these int64_t, because...

> +
> +    start = qdict_get_qdict(resp, "start");
> +    g_assert(start);
> +    end = qdict_get_qdict(resp, "end");
> +    g_assert(end);
> +
> +    start_s = qdict_get_try_int(start, "seconds", 0);

... that's what qdict_get_try_int() returns.

> +    g_assert(start_s);

Fails when the real time clock is within one second of the epoch.
You'll arguably have bigger problems to worry about then.  However, code
that doesn't have this "problem" is simpler:

       start_s = qdict_get_int(start, "seconds");

> +    start_us = qdict_get_try_int(start, "microseconds", 0);

Use qdict_get_int(), so we actually fail when "microseconds" is absent.

> +
> +    end_s = qdict_get_try_int(end, "seconds", 0);
> +    g_assert(end_s);
> +    end_us = qdict_get_try_int(end, "microseconds", 0);

Likewise.

> +
> +    start_ts = (start_s * G_USEC_PER_SEC) + start_us;
> +    end_ts = (end_s * G_USEC_PER_SEC) + end_us;
> +
> +    g_assert(end_ts > start_ts);
> +}
> +
>  static void assert_recovered(QTestState *qts)
>  {
>      QDict *resp;
> @@ -156,6 +180,16 @@ static void test_qmp_protocol(void)
>      g_assert_cmpint(qdict_get_int(resp, "id"), ==, 2);
>      qmp_expect_error_and_unref(resp, "GenericError");
>  
> +    /* Test timestamps on success */
> +    resp = qtest_qmp(qts, "{ 'execute': 'query-version' }");
> +    test_timestamps(resp);
> +    qobject_unref(resp);
> +
> +    /* Test timestamps on error */
> +    resp = qtest_qmp(qts, "{ 'execute': 'not-existing-cmd' }");
> +    test_timestamps(resp);
> +    qobject_unref(resp);

I'd be tempted to fold this into existing tests.

> +
>      qtest_quit(qts);
>  }
>  
> diff --git a/tests/unit/test-qga.c b/tests/unit/test-qga.c
> index b4e0a145737d1..18ec9bac3650e 100644
> --- a/tests/unit/test-qga.c
> +++ b/tests/unit/test-qga.c
> @@ -217,6 +217,36 @@ static void test_qga_ping(gconstpointer fix)
>      qmp_assert_no_error(ret);
>  }
>  
> +static void test_qga_timestamps(gconstpointer fix)
> +{
> +    QDict *start, *end;
> +    uint64_t start_s, start_us, end_s, end_us, start_ts, end_ts;
> +    const TestFixture *fixture = fix;
> +    g_autoptr(QDict) ret = NULL;
> +
> +    ret = qmp_fd(fixture->fd, "{'execute': 'guest-ping'}");
> +    g_assert_nonnull(ret);
> +    qmp_assert_no_error(ret);
> +
> +    start = qdict_get_qdict(ret, "start");
> +    g_assert(start);
> +    end = qdict_get_qdict(ret, "end");
> +    g_assert(end);
> +
> +    start_s = qdict_get_try_int(start, "seconds", 0);
> +    g_assert(start_s);
> +    start_us = qdict_get_try_int(start, "microseconds", 0);
> +
> +    end_s = qdict_get_try_int(end, "seconds", 0);
> +    g_assert(end_s);
> +    end_us = qdict_get_try_int(end, "microseconds", 0);
> +
> +    start_ts = (start_s * G_USEC_PER_SEC) + start_us;
> +    end_ts = (end_s * G_USEC_PER_SEC) + end_us;
> +
> +    g_assert(end_ts > start_ts);
> +}
> +
>  static void test_qga_id(gconstpointer fix)
>  {
>      const TestFixture *fixture = fix;
> @@ -948,6 +978,7 @@ int main(int argc, char **argv)
>      g_test_add_data_func("/qga/sync-delimited", &fix, test_qga_sync_delimited);
>      g_test_add_data_func("/qga/sync", &fix, test_qga_sync);
>      g_test_add_data_func("/qga/ping", &fix, test_qga_ping);
> +    g_test_add_data_func("/qga/timestamps", &fix, test_qga_timestamps);
>      g_test_add_data_func("/qga/info", &fix, test_qga_info);
>      g_test_add_data_func("/qga/network-get-interfaces", &fix,
>                           test_qga_network_get_interfaces);
> diff --git a/tests/unit/test-qmp-cmds.c b/tests/unit/test-qmp-cmds.c
> index 6085c099950b5..54d63bb8e346f 100644
> --- a/tests/unit/test-qmp-cmds.c
> +++ b/tests/unit/test-qmp-cmds.c
> @@ -154,7 +154,7 @@ static QObject *do_qmp_dispatch(bool allow_oob, const char *template, ...)
>      g_assert(resp);
>      ret = qdict_get(resp, "return");
>      g_assert(ret);
> -    g_assert(qdict_size(resp) == 1);
> +    g_assert(qdict_size(resp) == 3);
>  
>      qobject_ref(ret);
>      qobject_unref(resp);
> @@ -181,7 +181,7 @@ static void do_qmp_dispatch_error(bool allow_oob, ErrorClass cls,
>                      ==, QapiErrorClass_str(cls));
>      g_assert(qdict_get_try_str(error, "desc"));
>      g_assert(qdict_size(error) == 2);
> -    g_assert(qdict_size(resp) == 1);
> +    g_assert(qdict_size(resp) == 3);
>  
>      qobject_unref(resp);
>      qobject_unref(req);

Daniel P. Berrangé Oct. 13, 2022, 3:40 p.m. UTC | #2

On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote:
> Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:
> 
> > Add "start" & "end" time values to qmp command responses.
> 
> Please spell it QMP.  More of the same below.
> 
> >
> > These time values are added to let the qemu management layer get the exact
> > command execution time without any other time variance which might be brought by
> > other parts of management layer or qemu internals. This is particulary useful
> > for the management layer logging for later problems resolving.
> 
> I'm still having difficulties seeing the value add over existing
> tracepoints and logging.
> 
> Can you tell me about a problem you cracked (or could have cracked) with
> the help of this?

Consider your QMP client is logging all commands and replies in its
own logfile (libvirt can do this). Having this start/end timestamps
included means the QMP client log is self contained.

Relying on tracing means that when a user attaches the QMP client log
to a bug report, the timing info is missing. You have to ask the user
to try to reproduce again with QEMU tracing enabled, which may be
impossible, and then correlate the tracing output with the QMP client
log.

QEMU side tracing & logging is fine, but not a substitute for having
this info included by default IMHO.

With regards,
Daniel

Markus Armbruster Oct. 14, 2022, 9:31 a.m. UTC | #3

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote:
>> Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:
>> 
>> > Add "start" & "end" time values to qmp command responses.
>> 
>> Please spell it QMP.  More of the same below.
>> 
>> >
>> > These time values are added to let the qemu management layer get the exact
>> > command execution time without any other time variance which might be brought by
>> > other parts of management layer or qemu internals. This is particulary useful
>> > for the management layer logging for later problems resolving.
>> 
>> I'm still having difficulties seeing the value add over existing
>> tracepoints and logging.
>> 
>> Can you tell me about a problem you cracked (or could have cracked) with
>> the help of this?
>
> Consider your QMP client is logging all commands and replies in its
> own logfile (libvirt can do this). Having this start/end timestamps
> included means the QMP client log is self contained.

A QMP client can include client-side timestamps in its log.  What value
is being added by server-side timestamps?  According to the commit
message, it's for getting "the exact command execution time without any
other time variance which might be brought by other parts of management
layer or qemu internals."  Why is that useful?  In particular, why is
excluding network and QEMU queueing delays (inbound and outbound)
useful?

> Relying on tracing means that when a user attaches the QMP client log
> to a bug report, the timing info is missing. You have to ask the user
> to try to reproduce again with QEMU tracing enabled, which may be
> impossible, and then correlate the tracing output with the QMP client
> log.
>
> QEMU side tracing & logging is fine, but not a substitute for having
> this info included by default IMHO.

Not an answer to my question, but helpful all the same.

Denis Plotnikov Oct. 14, 2022, 9:47 a.m. UTC | #4

On 13.10.2022 18:00, Markus Armbruster wrote:
> Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:
>
>> Add "start" & "end" time values to qmp command responses.
> Please spell it QMP.  More of the same below.
ok
> Can you tell me about a problem you cracked (or could have cracked) with
> the help of this?

We have a management layer which interacts with qemu via qmp. When it 
issues a qmp command we measure execution time which takes to perform a 
certain qmp command. Some of that commands seems to execute longer that 
expected. In that case there is a question what part of command 
execution takes the majority of time. Is it the flaw in the management 
layer or in qemu qmp command scheduling or the qmp command execution 
itself? The timestaps being added help to exclude the qmp command 
execution time from the question. Also timestamps helps to get know the 
exact time when the command is started and ended and put that 
information to a system logs properly according to timestamps.

>>       "return": {"status": "running", "singlestep": false, "running": true}}
>>
>> The responce of the qmp command contains the start & end time of
> response
ok
>
>> the qmp command processing.
>>
>> Suggested-by: Andrey Ryabinin <arbn@yandex-team.ru>
>> Signed-off-by: Denis Plotnikov <den-plotnikov@yandex-team.ru>
>> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> Please spell out that this affects both QMP and qemu-ga.
ok
>>     command does not return data
>>   - The "id" member contains the transaction identification associated
>>     with the command execution if issued by the Client
>> +- The "start" member contains the exact time of when the server
>> +  started executing the command. This excludes any time the
>> +  command request spent queued, after reading it off the wire.
>> +  It is a fixed json-object with time in seconds and microseconds
>> +  relative to the Unix Epoch (1 Jan 1970)
> What's a "fixed json-object"?
>
> Hmm, I guess you're copying from the description of event member
> "timestamp".
That's right
> Let's go with "a json-object with the number of seconds and microseconds
> since the Unix epoch" everywhere.
ok
>
> Make this int64_t, because that's what g_get_real_time() returns.
>
> Same for add_timestamps() parameters.
ok, will fix the type everywhere
>
> +    qobject_unref(resp);
> I'd be tempted to fold this into existing tests.

Do you want me to put timestamp checking to an existing testcase?


Thanks,

Denis

>
>> +
>>       qtest_quit(qts);
>>   }
>>   
>> diff --git a/tests/unit/test-qga.c b/tests/unit/test-qga.c
>> index b4e0a145737d1..18ec9bac3650e 100644
>> --- a/tests/unit/test-qga.c
>> +++ b/tests/unit/test-qga.c
>> @@ -217,6 +217,36 @@ static void test_qga_ping(gconstpointer fix)
>>       qmp_assert_no_error(ret);
>>   }
>>   
>> +static void test_qga_timestamps(gconstpointer fix)
>> +{
>> +    QDict *start, *end;
>> +    uint64_t start_s, start_us, end_s, end_us, start_ts, end_ts;
>> +    const TestFixture *fixture = fix;
>> +    g_autoptr(QDict) ret = NULL;
>> +
>> +    ret = qmp_fd(fixture->fd, "{'execute': 'guest-ping'}");
>> +    g_assert_nonnull(ret);
>> +    qmp_assert_no_error(ret);
>> +
>> +    start = qdict_get_qdict(ret, "start");
>> +    g_assert(start);
>> +    end = qdict_get_qdict(ret, "end");
>> +    g_assert(end);
>> +
>> +    start_s = qdict_get_try_int(start, "seconds", 0);
>> +    g_assert(start_s);
>> +    start_us = qdict_get_try_int(start, "microseconds", 0);
>> +
>> +    end_s = qdict_get_try_int(end, "seconds", 0);
>> +    g_assert(end_s);
>> +    end_us = qdict_get_try_int(end, "microseconds", 0);
>> +
>> +    start_ts = (start_s * G_USEC_PER_SEC) + start_us;
>> +    end_ts = (end_s * G_USEC_PER_SEC) + end_us;
>> +
>> +    g_assert(end_ts > start_ts);
>> +}
>> +
>>   static void test_qga_id(gconstpointer fix)
>>   {
>>       const TestFixture *fixture = fix;
>> @@ -948,6 +978,7 @@ int main(int argc, char **argv)
>>       g_test_add_data_func("/qga/sync-delimited", &fix, test_qga_sync_delimited);
>>       g_test_add_data_func("/qga/sync", &fix, test_qga_sync);
>>       g_test_add_data_func("/qga/ping", &fix, test_qga_ping);
>> +    g_test_add_data_func("/qga/timestamps", &fix, test_qga_timestamps);
>>       g_test_add_data_func("/qga/info", &fix, test_qga_info);
>>       g_test_add_data_func("/qga/network-get-interfaces", &fix,
>>                            test_qga_network_get_interfaces);
>> diff --git a/tests/unit/test-qmp-cmds.c b/tests/unit/test-qmp-cmds.c
>> index 6085c099950b5..54d63bb8e346f 100644
>> --- a/tests/unit/test-qmp-cmds.c
>> +++ b/tests/unit/test-qmp-cmds.c
>> @@ -154,7 +154,7 @@ static QObject *do_qmp_dispatch(bool allow_oob, const char *template, ...)
>>       g_assert(resp);
>>       ret = qdict_get(resp, "return");
>>       g_assert(ret);
>> -    g_assert(qdict_size(resp) == 1);
>> +    g_assert(qdict_size(resp) == 3);
>>   
>>       qobject_ref(ret);
>>       qobject_unref(resp);
>> @@ -181,7 +181,7 @@ static void do_qmp_dispatch_error(bool allow_oob, ErrorClass cls,
>>                       ==, QapiErrorClass_str(cls));
>>       g_assert(qdict_get_try_str(error, "desc"));
>>       g_assert(qdict_size(error) == 2);
>> -    g_assert(qdict_size(resp) == 1);
>> +    g_assert(qdict_size(resp) == 3);
>>   
>>       qobject_unref(resp);
>>       qobject_unref(req);

Daniel P. Berrangé Oct. 14, 2022, 11:31 a.m. UTC | #5

On Fri, Oct 14, 2022 at 11:31:13AM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote:
> >> Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:
> >> 
> >> > Add "start" & "end" time values to qmp command responses.
> >> 
> >> Please spell it QMP.  More of the same below.
> >> 
> >> >
> >> > These time values are added to let the qemu management layer get the exact
> >> > command execution time without any other time variance which might be brought by
> >> > other parts of management layer or qemu internals. This is particulary useful
> >> > for the management layer logging for later problems resolving.
> >> 
> >> I'm still having difficulties seeing the value add over existing
> >> tracepoints and logging.
> >> 
> >> Can you tell me about a problem you cracked (or could have cracked) with
> >> the help of this?
> >
> > Consider your QMP client is logging all commands and replies in its
> > own logfile (libvirt can do this). Having this start/end timestamps
> > included means the QMP client log is self contained.
> 
> A QMP client can include client-side timestamps in its log.  What value
> is being added by server-side timestamps?  According to the commit
> message, it's for getting "the exact command execution time without any
> other time variance which might be brought by other parts of management
> layer or qemu internals."  Why is that useful?  In particular, why is
> excluding network and QEMU queueing delays (inbound and outbound)
> useful?

Lets, say some commands normally runs in ~100ms, but occasionally
runs in 2secs, and you want to understand why.

A first step is understanding whether a given command itself is
slow at executing, or whether its execution has merely been
delayed because some other aspect of QEMU has delayed its execution.
If the server timestamps show it was very fast, then that indicates
delayed processing. Thus instead of debugging the slow command, I
can think about what scenarios would be responsible for the delay.
Perhaps a previous QMP command was very slow, or maybe there is
simply a large volume of QMP commands backlogged, or some part of
QEMU got blocked.

Another case would be a command that is normally fast, and sometimes
is slower, but still relatively fast. The network and queueing side
might be a significant enough proportion of the total time to obscure
the slowdown. If you can eliminate the non-execution time, you can
see the performance trends over time to spot the subtle slowdowns
and detect abnormal behaviour before it becomes too terrible.

With regards,
Daniel

Markus Armbruster Oct. 14, 2022, 12:57 p.m. UTC | #6

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Fri, Oct 14, 2022 at 11:31:13AM +0200, Markus Armbruster wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote:
>> >> Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:
>> >> 
>> >> > Add "start" & "end" time values to qmp command responses.
>> >> 
>> >> Please spell it QMP.  More of the same below.
>> >> 
>> >> >
>> >> > These time values are added to let the qemu management layer get the exact
>> >> > command execution time without any other time variance which might be brought by
>> >> > other parts of management layer or qemu internals. This is particulary useful
>> >> > for the management layer logging for later problems resolving.
>> >> 
>> >> I'm still having difficulties seeing the value add over existing
>> >> tracepoints and logging.
>> >> 
>> >> Can you tell me about a problem you cracked (or could have cracked) with
>> >> the help of this?
>> >
>> > Consider your QMP client is logging all commands and replies in its
>> > own logfile (libvirt can do this). Having this start/end timestamps
>> > included means the QMP client log is self contained.
>> 
>> A QMP client can include client-side timestamps in its log.  What value
>> is being added by server-side timestamps?  According to the commit
>> message, it's for getting "the exact command execution time without any
>> other time variance which might be brought by other parts of management
>> layer or qemu internals."  Why is that useful?  In particular, why is
>> excluding network and QEMU queueing delays (inbound and outbound)
>> useful?
>
> Lets, say some commands normally runs in ~100ms, but occasionally
> runs in 2secs, and you want to understand why.
>
> A first step is understanding whether a given command itself is
> slow at executing, or whether its execution has merely been
> delayed because some other aspect of QEMU has delayed its execution.
> If the server timestamps show it was very fast, then that indicates
> delayed processing. Thus instead of debugging the slow command, I
> can think about what scenarios would be responsible for the delay.
> Perhaps a previous QMP command was very slow, or maybe there is
> simply a large volume of QMP commands backlogged, or some part of
> QEMU got blocked.
>
> Another case would be a command that is normally fast, and sometimes
> is slower, but still relatively fast. The network and queueing side
> might be a significant enough proportion of the total time to obscure
> the slowdown. If you can eliminate the non-execution time, you can
> see the performance trends over time to spot the subtle slowdowns
> and detect abnormal behaviour before it becomes too terrible.

This is troubleshooting.  Asking for better troubleshooting tools is
fair.

However, the proposed timestamps provide much more limited insight than
existing tracepoints.  For instance, enabling

    monitor_qmp_*
    handle_qmp_command

results in something like

    2656634@1665750359.529763:handle_qmp_command mon 0x55f93a4375d0 req: {"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}}
    2656634@1665750359.529789:monitor_qmp_in_band_enqueue 0x7fbaa8003870 mon 0x55f93a4375d0 len 0
    2656629@1665750359.529997:monitor_qmp_in_band_dequeue 0x7fbaa8003870 len 0
    2656629@1665750359.530081:monitor_qmp_cmd_in_band 
    2656629@1665750359.530166:monitor_qmp_respond mon 0x55f93a4375d0 resp: {"return": {}}

This gives me timestamps for

    complete JSON value read from socket and parsed
    value enqueued in-band (with mutex held)
    value dequeued in-band (with mutex held)
    in-band command dispatched
    reply enqueued

The last two are pretty close to the timestamps added by the patch.

The patch's timestamps together with client-side timestamps can tell me
whether the command is delayed somewhere between the client's timestamp
(presumably close to socket send) and QEMU's dispatch.  It could be the
network, QEMU's networking code, QEMU's JSON parser, QEMU's queuing, or
QEMU's coroutine scheduling.

To narrow down the delay to something useful, I still need to resort to
tracepoints.

The value added by having timestamps in QMP is "always on".  But is that
really worthwhile?  What would you *do* with timestamps fished out of
logs?

My gut feeling is that client-side timestamps suffice to signal "there
is a problem", but the additional server-side QMP timestamps are
unlikely to be enough to figure out the problem, so you enable
tracepoints and wait for the problem to return.

That's why I'm asking for concrete experience supporting "yes, it
is worthwhile".

Daniel P. Berrangé Oct. 14, 2022, 1:19 p.m. UTC | #7

On Fri, Oct 14, 2022 at 02:57:06PM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Fri, Oct 14, 2022 at 11:31:13AM +0200, Markus Armbruster wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >> 
> >> > On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote:
> >> >> Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:
> >> >> 
> >> >> > Add "start" & "end" time values to qmp command responses.
> >> >> 
> >> >> Please spell it QMP.  More of the same below.
> >> >> 
> >> >> >
> >> >> > These time values are added to let the qemu management layer get the exact
> >> >> > command execution time without any other time variance which might be brought by
> >> >> > other parts of management layer or qemu internals. This is particulary useful
> >> >> > for the management layer logging for later problems resolving.
> >> >> 
> >> >> I'm still having difficulties seeing the value add over existing
> >> >> tracepoints and logging.
> >> >> 
> >> >> Can you tell me about a problem you cracked (or could have cracked) with
> >> >> the help of this?
> >> >
> >> > Consider your QMP client is logging all commands and replies in its
> >> > own logfile (libvirt can do this). Having this start/end timestamps
> >> > included means the QMP client log is self contained.
> >> 
> >> A QMP client can include client-side timestamps in its log.  What value
> >> is being added by server-side timestamps?  According to the commit
> >> message, it's for getting "the exact command execution time without any
> >> other time variance which might be brought by other parts of management
> >> layer or qemu internals."  Why is that useful?  In particular, why is
> >> excluding network and QEMU queueing delays (inbound and outbound)
> >> useful?
> >
> > Lets, say some commands normally runs in ~100ms, but occasionally
> > runs in 2secs, and you want to understand why.
> >
> > A first step is understanding whether a given command itself is
> > slow at executing, or whether its execution has merely been
> > delayed because some other aspect of QEMU has delayed its execution.
> > If the server timestamps show it was very fast, then that indicates
> > delayed processing. Thus instead of debugging the slow command, I
> > can think about what scenarios would be responsible for the delay.
> > Perhaps a previous QMP command was very slow, or maybe there is
> > simply a large volume of QMP commands backlogged, or some part of
> > QEMU got blocked.
> >
> > Another case would be a command that is normally fast, and sometimes
> > is slower, but still relatively fast. The network and queueing side
> > might be a significant enough proportion of the total time to obscure
> > the slowdown. If you can eliminate the non-execution time, you can
> > see the performance trends over time to spot the subtle slowdowns
> > and detect abnormal behaviour before it becomes too terrible.
> 
> This is troubleshooting.  Asking for better troubleshooting tools is
> fair.
> 
> However, the proposed timestamps provide much more limited insight than
> existing tracepoints.  For instance, enabling

tracepoints are absolutely great and let you get a hell of alot
more information, *provided* you are in a position to actually
use tracepoints. This is, unfortunately, frequently not the case
when supporting real world production deployments.

Bug reports from customers typically include little more than a
log file they got from the mgmt client at time the problem happened.
The problem experianced may no longer exist, so asking them to run
a tracepoint script is not possible. They may also be reluctant to
actually run tracepoint scripts on a production system, or simply
lack the ability todo so at all, due to constraints of the deployment
environment. Logs from libvirt are something that are collected by
default for many mgmt apps, or can be turned on by the user with
minimal risk of disruption. 

Overall, there's a compelling desire to be proactive in collecting
information ahead of time, that might be useful in diagnosing
future bug reports.

So it isn't an 'either / or' decision of QMP reply logs vs use of
tracepoints, both are beneficial, with their own pros/cons.

With regards,
Daniel

Denis Plotnikov Oct. 16, 2022, 9:25 a.m. UTC | #8

On 14.10.2022 16:19, Daniel P. Berrangé wrote:
> On Fri, Oct 14, 2022 at 02:57:06PM +0200, Markus Armbruster wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>
>>> On Fri, Oct 14, 2022 at 11:31:13AM +0200, Markus Armbruster wrote:
>>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>>
>>>>> On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote:
>>>>>> Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:
>>>>>>
>>>>>>> Add "start" & "end" time values to qmp command responses.
>>>>>> Please spell it QMP.  More of the same below.
>>>>>>
>>>>>>> These time values are added to let the qemu management layer get the exact
>>>>>>> command execution time without any other time variance which might be brought by
>>>>>>> other parts of management layer or qemu internals. This is particulary useful
>>>>>>> for the management layer logging for later problems resolving.
>>>>>> I'm still having difficulties seeing the value add over existing
>>>>>> tracepoints and logging.
>>>>>>
>>>>>> Can you tell me about a problem you cracked (or could have cracked) with
>>>>>> the help of this?
>>>>> Consider your QMP client is logging all commands and replies in its
>>>>> own logfile (libvirt can do this). Having this start/end timestamps
>>>>> included means the QMP client log is self contained.
>>>> A QMP client can include client-side timestamps in its log.  What value
>>>> is being added by server-side timestamps?  According to the commit
>>>> message, it's for getting "the exact command execution time without any
>>>> other time variance which might be brought by other parts of management
>>>> layer or qemu internals."  Why is that useful?  In particular, why is
>>>> excluding network and QEMU queueing delays (inbound and outbound)
>>>> useful?
>>> Lets, say some commands normally runs in ~100ms, but occasionally
>>> runs in 2secs, and you want to understand why.
>>>
>>> A first step is understanding whether a given command itself is
>>> slow at executing, or whether its execution has merely been
>>> delayed because some other aspect of QEMU has delayed its execution.
>>> If the server timestamps show it was very fast, then that indicates
>>> delayed processing. Thus instead of debugging the slow command, I
>>> can think about what scenarios would be responsible for the delay.
>>> Perhaps a previous QMP command was very slow, or maybe there is
>>> simply a large volume of QMP commands backlogged, or some part of
>>> QEMU got blocked.
>>>
>>> Another case would be a command that is normally fast, and sometimes
>>> is slower, but still relatively fast. The network and queueing side
>>> might be a significant enough proportion of the total time to obscure
>>> the slowdown. If you can eliminate the non-execution time, you can
>>> see the performance trends over time to spot the subtle slowdowns
>>> and detect abnormal behaviour before it becomes too terrible.
>> This is troubleshooting.  Asking for better troubleshooting tools is
>> fair.
>>
>> However, the proposed timestamps provide much more limited insight than
>> existing tracepoints.  For instance, enabling
> tracepoints are absolutely great and let you get a hell of alot
> more information, *provided* you are in a position to actually
> use tracepoints. This is, unfortunately, frequently not the case
> when supporting real world production deployments.
Exactly!!! Thanks for the pointing out!
>
> Bug reports from customers typically include little more than a
> log file they got from the mgmt client at time the problem happened.
> The problem experianced may no longer exist, so asking them to run
> a tracepoint script is not possible. They may also be reluctant to
> actually run tracepoint scripts on a production system, or simply
> lack the ability todo so at all, due to constraints of the deployment
> environment. Logs from libvirt are something that are collected by
> default for many mgmt apps, or can be turned on by the user with
> minimal risk of disruption.
>
> Overall, there's a compelling desire to be proactive in collecting
> information ahead of time, that might be useful in diagnosing
> future bug reports.

This is the main reason. When you encounter a problem one of the first 
questions is "Was there something similar in the past. Another question 
is how often does it happen.

With the timestamps these questions answering becomes easier.

Another thing is that with the qmp command timestamps you can build a 
monitoring system which will report about the cases when 
execution_time_from_mgmt_perspective - excution_time_qmp_command > 
some_threshold which in turn proactively tell you about the potential 
problems. And then you'll start using the qmp tracepoints (and other 
means) to figure out the real reason of the execution time variance.

Thanks, Denis

>
> So it isn't an 'either / or' decision of QMP reply logs vs use of
> tracepoints, both are beneficial, with their own pros/cons.
>
> With regards,
> Daniel

Markus Armbruster Oct. 17, 2022, 5:23 p.m. UTC | #9

Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:

> On 14.10.2022 16:19, Daniel P. Berrangé wrote:
>> On Fri, Oct 14, 2022 at 02:57:06PM +0200, Markus Armbruster wrote:
>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>
>>>> On Fri, Oct 14, 2022 at 11:31:13AM +0200, Markus Armbruster wrote:
>>>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>>>
>>>>>> On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote:
>>>>>>> Denis Plotnikov <den-plotnikov@yandex-team.ru> writes:
>>>>>>>
>>>>>>>> Add "start" & "end" time values to qmp command responses.
>>>>>>> Please spell it QMP.  More of the same below.
>>>>>>>
>>>>>>>> These time values are added to let the qemu management layer get the exact
>>>>>>>> command execution time without any other time variance which might be brought by
>>>>>>>> other parts of management layer or qemu internals. This is particulary useful
>>>>>>>> for the management layer logging for later problems resolving.
>>>>>>> I'm still having difficulties seeing the value add over existing
>>>>>>> tracepoints and logging.
>>>>>>>
>>>>>>> Can you tell me about a problem you cracked (or could have cracked) with
>>>>>>> the help of this?
>>>>>> Consider your QMP client is logging all commands and replies in its
>>>>>> own logfile (libvirt can do this). Having this start/end timestamps
>>>>>> included means the QMP client log is self contained.
>>>>> A QMP client can include client-side timestamps in its log.  What value
>>>>> is being added by server-side timestamps?  According to the commit
>>>>> message, it's for getting "the exact command execution time without any
>>>>> other time variance which might be brought by other parts of management
>>>>> layer or qemu internals."  Why is that useful?  In particular, why is
>>>>> excluding network and QEMU queueing delays (inbound and outbound)
>>>>> useful?
>>>> Lets, say some commands normally runs in ~100ms, but occasionally
>>>> runs in 2secs, and you want to understand why.
>>>>
>>>> A first step is understanding whether a given command itself is
>>>> slow at executing, or whether its execution has merely been
>>>> delayed because some other aspect of QEMU has delayed its execution.
>>>> If the server timestamps show it was very fast, then that indicates
>>>> delayed processing. Thus instead of debugging the slow command, I
>>>> can think about what scenarios would be responsible for the delay.
>>>> Perhaps a previous QMP command was very slow, or maybe there is
>>>> simply a large volume of QMP commands backlogged, or some part of
>>>> QEMU got blocked.
>>>>
>>>> Another case would be a command that is normally fast, and sometimes
>>>> is slower, but still relatively fast. The network and queueing side
>>>> might be a significant enough proportion of the total time to obscure
>>>> the slowdown. If you can eliminate the non-execution time, you can
>>>> see the performance trends over time to spot the subtle slowdowns
>>>> and detect abnormal behaviour before it becomes too terrible.
>>> This is troubleshooting.  Asking for better troubleshooting tools is
>>> fair.
>>>
>>> However, the proposed timestamps provide much more limited insight than
>>> existing tracepoints.  For instance, enabling
>> tracepoints are absolutely great and let you get a hell of alot
>> more information, *provided* you are in a position to actually
>> use tracepoints. This is, unfortunately, frequently not the case
>> when supporting real world production deployments.
>
> Exactly!!! Thanks for the pointing out!
>>
>> Bug reports from customers typically include little more than a
>> log file they got from the mgmt client at time the problem happened.
>> The problem experianced may no longer exist, so asking them to run
>> a tracepoint script is not possible. They may also be reluctant to
>> actually run tracepoint scripts on a production system, or simply
>> lack the ability todo so at all, due to constraints of the deployment
>> environment. Logs from libvirt are something that are collected by
>> default for many mgmt apps, or can be turned on by the user with
>> minimal risk of disruption.
>>
>> Overall, there's a compelling desire to be proactive in collecting
>> information ahead of time, that might be useful in diagnosing
>> future bug reports.
>
> This is the main reason. When you encounter a problem one of the first
> questions is "Was there something similar in the past. Another question
> is how often does it happen.
>
> With the timestamps these questions answering becomes easier.
>
> Another thing is that with the qmp command timestamps you can build a
> monitoring system which will report about the cases when
> execution_time_from_mgmt_perspective - excution_time_qmp_command >
> some_threshold which in turn proactively tell you about the potential
> problems. And then you'll start using the qmp tracepoints (and other
> means) to figure out the real reason of the execution time variance.
>
> Thanks, Denis
>
>>
>> So it isn't an 'either / or' decision of QMP reply logs vs use of
>> tracepoints, both are beneficial, with their own pros/cons.
>>
>> With regards,
>> Daniel

Please give full rationale in the commit message.  Make sure to address
the following points:

1. Why tracepoints can't do the job

2. Why client-side logging can't do the job

3. Why of all the possible points of interest in a QMP command's flow
   through QEMU, you're picking these two.

So far, I've seen a full argument on 1., not much on 2., and basically
nothing on 3.

[v3] qapi/qmp: Add timestamps to qmp command responses

Commit Message

Comments

Patch