diff mbox

[v3,4/5] qtest: precompute hex nibs

Message ID 1430864578-22072-5-git-send-email-jsnow@redhat.com
State New
Headers show

Commit Message

John Snow May 5, 2015, 10:22 p.m. UTC
Instead of letting printf and friends do this for us
one byte at a time, fill a buffer ourselves and then
send the entire buffer in one go.

This gives a moderate speed improvement over the old
method.

Signed-off-by: John Snow <jsnow@redhat.com>
---
 qtest.c          | 20 ++++++++++++++++----
 tests/libqtest.c | 17 ++++++++++++++---
 2 files changed, 30 insertions(+), 7 deletions(-)

Comments

Markus Armbruster May 6, 2015, 6:25 a.m. UTC | #1
John Snow <jsnow@redhat.com> writes:

> Instead of letting printf and friends do this for us
> one byte at a time, fill a buffer ourselves and then
> send the entire buffer in one go.
>
> This gives a moderate speed improvement over the old
> method.

Out of curiosity: how much of the improvement is due to doing our own
buffering instead of printf()'s (assuming the stream is buffered), and
how much is due to doing our own hex formatting instead of printf()'s?
John Snow May 6, 2015, 2:12 p.m. UTC | #2
On 05/06/2015 02:25 AM, Markus Armbruster wrote:
> John Snow <jsnow@redhat.com> writes:
> 
>> Instead of letting printf and friends do this for us
>> one byte at a time, fill a buffer ourselves and then
>> send the entire buffer in one go.
>>
>> This gives a moderate speed improvement over the old
>> method.
> 
> Out of curiosity: how much of the improvement is due to doing our own
> buffering instead of printf()'s (assuming the stream is buffered), and
> how much is due to doing our own hex formatting instead of printf()'s?
> 

Out of ignorance: How would I measure?
Markus Armbruster May 6, 2015, 3:19 p.m. UTC | #3
John Snow <jsnow@redhat.com> writes:

> On 05/06/2015 02:25 AM, Markus Armbruster wrote:
>> John Snow <jsnow@redhat.com> writes:
>> 
>>> Instead of letting printf and friends do this for us
>>> one byte at a time, fill a buffer ourselves and then
>>> send the entire buffer in one go.
>>>
>>> This gives a moderate speed improvement over the old
>>> method.
>> 
>> Out of curiosity: how much of the improvement is due to doing our own
>> buffering instead of printf()'s (assuming the stream is buffered), and
>> how much is due to doing our own hex formatting instead of printf()'s?
>> 
>
> Out of ignorance: How would I measure?

Heh, well played!

The code before the series uses chr unbuffered:

        for (i = 0; i < len; i++) {
            qtest_send(chr, "%02x", data[i]);
        }

qtest_send() formats into two bytes, passes them to
qemu_chr_fe_write_all(), which writes them to chr.

The chr are typically unbuffered, so this could well produce a series of
two-byte write() system calls.

Adding some buffering will obviously make a difference for larger len.

Whether formatting hex digits by hands can make a difference is not
obvious.

To find out, add just buffering.  Something like this in your patch
instead of byte2hex():

         for (i = 0; i < len; i++) {
-            qtest_sendf(chr, "%02x", data[i]);
+            snprintf(&enc[i * 2], 2, "%02x", data[i]);
         }

If the speedup is pretty much entirely due to buffering (which I
suspect), then your commit message could use a bit of love :)
John Snow May 6, 2015, 4:18 p.m. UTC | #4
On 05/06/2015 11:19 AM, Markus Armbruster wrote:
> John Snow <jsnow@redhat.com> writes:
> 
>> On 05/06/2015 02:25 AM, Markus Armbruster wrote:
>>> John Snow <jsnow@redhat.com> writes:
>>>
>>>> Instead of letting printf and friends do this for us
>>>> one byte at a time, fill a buffer ourselves and then
>>>> send the entire buffer in one go.
>>>>
>>>> This gives a moderate speed improvement over the old
>>>> method.
>>>
>>> Out of curiosity: how much of the improvement is due to doing our own
>>> buffering instead of printf()'s (assuming the stream is buffered), and
>>> how much is due to doing our own hex formatting instead of printf()'s?
>>>
>>
>> Out of ignorance: How would I measure?
> 
> Heh, well played!
> 
> The code before the series uses chr unbuffered:
> 
>         for (i = 0; i < len; i++) {
>             qtest_send(chr, "%02x", data[i]);
>         }
> 
> qtest_send() formats into two bytes, passes them to
> qemu_chr_fe_write_all(), which writes them to chr.
> 
> The chr are typically unbuffered, so this could well produce a series of
> two-byte write() system calls.
> 
> Adding some buffering will obviously make a difference for larger len.
> 
> Whether formatting hex digits by hands can make a difference is not
> obvious.
> 
> To find out, add just buffering.  Something like this in your patch
> instead of byte2hex():
> 
>          for (i = 0; i < len; i++) {
> -            qtest_sendf(chr, "%02x", data[i]);
> +            snprintf(&enc[i * 2], 2, "%02x", data[i]);
>          }
> 
> If the speedup is pretty much entirely due to buffering (which I
> suspect), then your commit message could use a bit of love :)
> 

When you're right, you're right. The difference may not be statistically
meaningful, but with today's current planetary alignment, using
sprintf() to batch the sends instead of my home-rolled nib computation
function, I can eke out a few more tenths of a second.

If there are no objections, I will stage patches 1-3 and 5, and resubmit
a quick v2 of just this single patch, unless you want to go ahead and
say that making the edit will be fine, then I will just edit it before
sending the pullreq.

--js
Markus Armbruster May 7, 2015, 6:13 a.m. UTC | #5
John Snow <jsnow@redhat.com> writes:

> On 05/06/2015 11:19 AM, Markus Armbruster wrote:
>> John Snow <jsnow@redhat.com> writes:
>> 
>>> On 05/06/2015 02:25 AM, Markus Armbruster wrote:
>>>> John Snow <jsnow@redhat.com> writes:
>>>>
>>>>> Instead of letting printf and friends do this for us
>>>>> one byte at a time, fill a buffer ourselves and then
>>>>> send the entire buffer in one go.
>>>>>
>>>>> This gives a moderate speed improvement over the old
>>>>> method.
>>>>
>>>> Out of curiosity: how much of the improvement is due to doing our own
>>>> buffering instead of printf()'s (assuming the stream is buffered), and
>>>> how much is due to doing our own hex formatting instead of printf()'s?
>>>>
>>>
>>> Out of ignorance: How would I measure?
>> 
>> Heh, well played!
>> 
>> The code before the series uses chr unbuffered:
>> 
>>         for (i = 0; i < len; i++) {
>>             qtest_send(chr, "%02x", data[i]);
>>         }
>> 
>> qtest_send() formats into two bytes, passes them to
>> qemu_chr_fe_write_all(), which writes them to chr.
>> 
>> The chr are typically unbuffered, so this could well produce a series of
>> two-byte write() system calls.
>> 
>> Adding some buffering will obviously make a difference for larger len.
>> 
>> Whether formatting hex digits by hands can make a difference is not
>> obvious.
>> 
>> To find out, add just buffering.  Something like this in your patch
>> instead of byte2hex():
>> 
>>          for (i = 0; i < len; i++) {
>> -            qtest_sendf(chr, "%02x", data[i]);
>> +            snprintf(&enc[i * 2], 2, "%02x", data[i]);
>>          }
>> 
>> If the speedup is pretty much entirely due to buffering (which I
>> suspect), then your commit message could use a bit of love :)
>> 
>
> When you're right, you're right. The difference may not be statistically
> meaningful, but with today's current planetary alignment, using
> sprintf() to batch the sends instead of my home-rolled nib computation
> function, I can eke out a few more tenths of a second.

:)

> If there are no objections, I will stage patches 1-3 and 5, and resubmit
> a quick v2 of just this single patch, unless you want to go ahead and
> say that making the edit will be fine, then I will just edit it before
> sending the pullreq.

Recommend to post the revised patch, with an updated commit message.
John Snow May 7, 2015, 5:52 p.m. UTC | #6
On 05/07/2015 02:13 AM, Markus Armbruster wrote:
> John Snow <jsnow@redhat.com> writes:
> 
>> On 05/06/2015 11:19 AM, Markus Armbruster wrote:
>>> John Snow <jsnow@redhat.com> writes:
>>>
>>>> On 05/06/2015 02:25 AM, Markus Armbruster wrote:
>>>>> John Snow <jsnow@redhat.com> writes:
>>>>>
>>>>>> Instead of letting printf and friends do this for us
>>>>>> one byte at a time, fill a buffer ourselves and then
>>>>>> send the entire buffer in one go.
>>>>>>
>>>>>> This gives a moderate speed improvement over the old
>>>>>> method.
>>>>>
>>>>> Out of curiosity: how much of the improvement is due to doing our own
>>>>> buffering instead of printf()'s (assuming the stream is buffered), and
>>>>> how much is due to doing our own hex formatting instead of printf()'s?
>>>>>
>>>>
>>>> Out of ignorance: How would I measure?
>>>
>>> Heh, well played!
>>>
>>> The code before the series uses chr unbuffered:
>>>
>>>         for (i = 0; i < len; i++) {
>>>             qtest_send(chr, "%02x", data[i]);
>>>         }
>>>
>>> qtest_send() formats into two bytes, passes them to
>>> qemu_chr_fe_write_all(), which writes them to chr.
>>>
>>> The chr are typically unbuffered, so this could well produce a series of
>>> two-byte write() system calls.
>>>
>>> Adding some buffering will obviously make a difference for larger len.
>>>
>>> Whether formatting hex digits by hands can make a difference is not
>>> obvious.
>>>
>>> To find out, add just buffering.  Something like this in your patch
>>> instead of byte2hex():
>>>
>>>          for (i = 0; i < len; i++) {
>>> -            qtest_sendf(chr, "%02x", data[i]);
>>> +            snprintf(&enc[i * 2], 2, "%02x", data[i]);
>>>          }
>>>
>>> If the speedup is pretty much entirely due to buffering (which I
>>> suspect), then your commit message could use a bit of love :)
>>>
>>
>> When you're right, you're right. The difference may not be statistically
>> meaningful, but with today's current planetary alignment, using
>> sprintf() to batch the sends instead of my home-rolled nib computation
>> function, I can eke out a few more tenths of a second.
> 
> :)
> 
>> If there are no objections, I will stage patches 1-3 and 5, and resubmit
>> a quick v2 of just this single patch, unless you want to go ahead and
>> say that making the edit will be fine, then I will just edit it before
>> sending the pullreq.
> 
> Recommend to post the revised patch, with an updated commit message.
> 

Staged patches and dropped patch 4, thanks.

--js
Eric Blake May 7, 2015, 8:27 p.m. UTC | #7
On 05/06/2015 10:18 AM, John Snow wrote:

>> To find out, add just buffering.  Something like this in your patch
>> instead of byte2hex():
>>
>>          for (i = 0; i < len; i++) {
>> -            qtest_sendf(chr, "%02x", data[i]);
>> +            snprintf(&enc[i * 2], 2, "%02x", data[i]);
>>          }
>>
>> If the speedup is pretty much entirely due to buffering (which I
>> suspect), then your commit message could use a bit of love :)
>>
> 
> When you're right, you're right. The difference may not be statistically
> meaningful, but with today's current planetary alignment, using
> sprintf() to batch the sends instead of my home-rolled nib computation
> function, I can eke out a few more tenths of a second.

I'm a bit surprised - making a function call per byte generally executes
more instructions than open-coding the conversion (albeit the branch
prediction in the hardware probably does fairly well over long strings,
since it is a tight and predictable loop).  Remember, sprintf() has to
decode the format string on every call (unless the compiler is smart
enough to open-code what sprintf would do).
Markus Armbruster May 8, 2015, 6:25 a.m. UTC | #8
Eric Blake <eblake@redhat.com> writes:

> On 05/06/2015 10:18 AM, John Snow wrote:
>
>>> To find out, add just buffering.  Something like this in your patch
>>> instead of byte2hex():
>>>
>>>          for (i = 0; i < len; i++) {
>>> -            qtest_sendf(chr, "%02x", data[i]);
>>> +            snprintf(&enc[i * 2], 2, "%02x", data[i]);
>>>          }
>>>
>>> If the speedup is pretty much entirely due to buffering (which I
>>> suspect), then your commit message could use a bit of love :)
>>>
>> 
>> When you're right, you're right. The difference may not be statistically
>> meaningful, but with today's current planetary alignment, using
>> sprintf() to batch the sends instead of my home-rolled nib computation
>> function, I can eke out a few more tenths of a second.
>
> I'm a bit surprised - making a function call per byte generally executes
> more instructions than open-coding the conversion (albeit the branch
> prediction in the hardware probably does fairly well over long strings,
> since it is a tight and predictable loop).  Remember, sprintf() has to
> decode the format string on every call (unless the compiler is smart
> enough to open-code what sprintf would do).

John's measurements show that the speed difference between snprintf()
and a local copy of formatting code gets thoroughly drowned in noise.

The snprintf() version takes 18 lines less, according to diffstat.  Less
code, same measured performance, what's not to like?

However, if you feel strongly about avoiding snprintf() here, I won't
argue further.  Except for the commit message: it needs to be fixed not
to claim avoiding "printf and friends" makes a speed difference.
John Snow May 8, 2015, 4:22 p.m. UTC | #9
On 05/08/2015 02:25 AM, Markus Armbruster wrote:
> Eric Blake <eblake@redhat.com> writes:
> 
>> On 05/06/2015 10:18 AM, John Snow wrote:
>>
>>>> To find out, add just buffering.  Something like this in your patch
>>>> instead of byte2hex():
>>>>
>>>>          for (i = 0; i < len; i++) {
>>>> -            qtest_sendf(chr, "%02x", data[i]);
>>>> +            snprintf(&enc[i * 2], 2, "%02x", data[i]);
>>>>          }
>>>>
>>>> If the speedup is pretty much entirely due to buffering (which I
>>>> suspect), then your commit message could use a bit of love :)
>>>>
>>>
>>> When you're right, you're right. The difference may not be statistically
>>> meaningful, but with today's current planetary alignment, using
>>> sprintf() to batch the sends instead of my home-rolled nib computation
>>> function, I can eke out a few more tenths of a second.
>>
>> I'm a bit surprised - making a function call per byte generally executes
>> more instructions than open-coding the conversion (albeit the branch
>> prediction in the hardware probably does fairly well over long strings,
>> since it is a tight and predictable loop).  Remember, sprintf() has to
>> decode the format string on every call (unless the compiler is smart
>> enough to open-code what sprintf would do).
> 
> John's measurements show that the speed difference between snprintf()
> and a local copy of formatting code gets thoroughly drowned in noise.
> 
> The snprintf() version takes 18 lines less, according to diffstat.  Less
> code, same measured performance, what's not to like?
> 
> However, if you feel strongly about avoiding snprintf() here, I won't
> argue further.  Except for the commit message: it needs to be fixed not
> to claim avoiding "printf and friends" makes a speed difference.
> 

My reasoning was the same as Markus's: the difference was so negligible
that I went with the "less home-rolled code" version.

I already staged this series without the nib functions and submitted the
snprintf version as its own patch with a less disparaging (to printf and
friends) commit message.

Any further micro-optimization is a waste of time to properly benchmark
and split hairs. I already dropped the test from ~14s to ~4s. Good enough.

--js
Eric Blake May 8, 2015, 7:47 p.m. UTC | #10
On 05/08/2015 10:22 AM, John Snow wrote:
>>> I'm a bit surprised - making a function call per byte generally executes
>>> more instructions than open-coding the conversion (albeit the branch
>>> prediction in the hardware probably does fairly well over long strings,
>>> since it is a tight and predictable loop).  Remember, sprintf() has to
>>> decode the format string on every call (unless the compiler is smart
>>> enough to open-code what sprintf would do).
>>
>> John's measurements show that the speed difference between snprintf()
>> and a local copy of formatting code gets thoroughly drowned in noise.
>>
>> The snprintf() version takes 18 lines less, according to diffstat.  Less
>> code, same measured performance, what's not to like?
>>
>> However, if you feel strongly about avoiding snprintf() here, I won't
>> argue further.  Except for the commit message: it needs to be fixed not
>> to claim avoiding "printf and friends" makes a speed difference.
>>
> 
> My reasoning was the same as Markus's: the difference was so negligible
> that I went with the "less home-rolled code" version.
> 
> I already staged this series without the nib functions and submitted the
> snprintf version as its own patch with a less disparaging (to printf and
> friends) commit message.
> 
> Any further micro-optimization is a waste of time to properly benchmark
> and split hairs. I already dropped the test from ~14s to ~4s. Good enough.

Okay, I'm convinced - keep the s[n]printf, on the grounds that it wasn't
the bottleneck.  You are right that premature optimization is not worth
the complexity if it gives no real speed gain.
diff mbox

Patch

diff --git a/qtest.c b/qtest.c
index c1a0493..383f9ed 100644
--- a/qtest.c
+++ b/qtest.c
@@ -157,6 +157,8 @@  static bool qtest_opened;
  * NUM=0 even though it is remapped to GSI 2).
  */
 
+static const char *hex = "0123456789abcdef";
+
 static int hex2nib(char ch)
 {
     if (ch >= '0' && ch <= '9') {
@@ -170,6 +172,12 @@  static int hex2nib(char ch)
     }
 }
 
+static inline void byte2hex(uint8_t byte, uint8_t *out)
+{
+    *out++ = hex[byte >> 4];
+    *out = hex[byte & 0x0F];
+}
+
 static void qtest_get_time(qemu_timeval *tv)
 {
     qemu_gettimeofday(tv);
@@ -414,6 +422,7 @@  static void qtest_process_command(CharDriverState *chr, gchar **words)
     } else if (strcmp(words[0], "read") == 0) {
         uint64_t addr, len, i;
         uint8_t *data;
+        uint8_t *enc;
 
         g_assert(words[1] && words[2]);
         addr = strtoull(words[1], NULL, 0);
@@ -422,14 +431,17 @@  static void qtest_process_command(CharDriverState *chr, gchar **words)
         data = g_malloc(len);
         cpu_physical_memory_read(addr, data, len);
 
-        qtest_send_prefix(chr);
-        qtest_send(chr, "OK 0x");
+        enc = g_malloc(2 * len + 1);
         for (i = 0; i < len; i++) {
-            qtest_sendf(chr, "%02x", data[i]);
+            byte2hex(data[i], &enc[i * 2]);
         }
-        qtest_send(chr, "\n");
+        enc[2 * len] = '\0';
+
+        qtest_send_prefix(chr);
+        qtest_sendf(chr, "OK 0x%s\n", enc);
 
         g_free(data);
+        g_free(enc);
     } else if (strcmp(words[0], "b64read") == 0) {
         uint64_t addr, len;
         uint8_t *data;
diff --git a/tests/libqtest.c b/tests/libqtest.c
index 055aad6..1b246c9 100644
--- a/tests/libqtest.c
+++ b/tests/libqtest.c
@@ -651,6 +651,8 @@  uint64_t qtest_readq(QTestState *s, uint64_t addr)
     return qtest_read(s, "readq", addr);
 }
 
+static const char *hex = "0123456789abcdef";
+
 static int hex2nib(char ch)
 {
     if (ch >= '0' && ch <= '9') {
@@ -664,6 +666,12 @@  static int hex2nib(char ch)
     }
 }
 
+static inline void byte2hex(uint8_t byte, uint8_t *out)
+{
+    *out++ = hex[byte >> 4];
+    *out = hex[byte & 0x0f];
+}
+
 void qtest_memread(QTestState *s, uint64_t addr, void *data, size_t size)
 {
     uint8_t *ptr = data;
@@ -730,13 +738,16 @@  void qtest_memwrite(QTestState *s, uint64_t addr, const void *data, size_t size)
 {
     const uint8_t *ptr = data;
     size_t i;
+    uint8_t *enc = g_malloc(2 * size + 1);
 
-    qtest_sendf(s, "write 0x%" PRIx64 " 0x%zx 0x", addr, size);
     for (i = 0; i < size; i++) {
-        qtest_sendf(s, "%02x", ptr[i]);
+        byte2hex(ptr[i], &enc[i * 2]);
     }
-    qtest_sendf(s, "\n");
+    enc[2 * size] = '\0';
+
+    qtest_sendf(s, "write 0x%" PRIx64 " 0x%zx 0x%s\n", addr, size, enc);
     qtest_rsp(s, 0);
+    g_free(enc);
 }
 
 void qtest_memset(QTestState *s, uint64_t addr, uint8_t pattern, size_t size)