Converted benchmark to benchtest.
diff mbox

Message ID 20140605123701.GF9145@spoyarek.pnq.redhat.com
State New
Headers show

Commit Message

Siddhesh Poyarekar June 5, 2014, 12:37 p.m. UTC
On Sat, May 03, 2014 at 12:48:31PM +0200, Ondřej Bílka wrote:
> Its nice to have but orthogonal on proposed change, so it should not
> block this patch

This looks like something that could easily use the $bench-inputs
format since you're really just looking for mean times.  I haven't
checked to see if a failed branch would be expensive compared to the
lock/unlock cycle we're testing, but I'm guessing that it should get
cached and hence not have an impact since we don't interleave the
rdlock and wrlock calls.

Siddhesh

Comments

Torvald Riegel June 5, 2014, 12:46 p.m. UTC | #1
On Thu, 2014-06-05 at 18:07 +0530, Siddhesh Poyarekar wrote:
> -bench-pthread := pthread_once
> +bench-pthread := pthread_once pthread_rwlock_test

Could we rename this to pthread_rwlock_uncontended or
pthread_rwlock_1thread to signal that this measures just the
single-thread overheads on an uncontended rwlock?
Ondřej Bílka June 5, 2014, 12:57 p.m. UTC | #2
On Thu, Jun 05, 2014 at 06:07:01PM +0530, Siddhesh Poyarekar wrote:
> On Sat, May 03, 2014 at 12:48:31PM +0200, Ondřej Bílka wrote:
> > Its nice to have but orthogonal on proposed change, so it should not
> > block this patch
> 
> This looks like something that could easily use the $bench-inputs
> format since you're really just looking for mean times.  I haven't
> checked to see if a failed branch would be expensive compared to the
> lock/unlock cycle we're testing, but I'm guessing that it should get
> cached and hence not have an impact since we don't interleave the
> rdlock and wrlock calls.
> 
Anyway I realized that microbenchmarks may be wrong way here.

How it is likely that you have lock in L1 cache? Andy any insigth? 
I could try to simulate it by using many locks and choosing one at
random, how that would work.

Also Siddhesh, benchmarks are useless unless you use their result so
what did you measured with your benchmark and what do you thing about
Andi's patch.
Siddhesh Poyarekar June 5, 2014, 1:06 p.m. UTC | #3
On 5 June 2014 18:27, Ondřej Bílka <neleai@seznam.cz> wrote:
> Also Siddhesh, benchmarks are useless unless you use their result so
> what did you measured with your benchmark and what do you thing about
> Andi's patch.

I haven't looked at Andi's patch and I haven't done any comparative
testing; I merely proposed a test similar to your test in the standard
format to measure uncontended rdlock/wrlock+unlock performance because
such an example of framework usage is not present in the current
tests.

Torvald, I'm not very picky about the name if we find such a test
useful to include in the microbenchmark.  I haven't actually looked at
the original problem statement yet to make any comments about that.

Siddhesh
Torvald Riegel June 5, 2014, 1:06 p.m. UTC | #4
On Thu, 2014-06-05 at 14:57 +0200, Ondřej Bílka wrote:
> On Thu, Jun 05, 2014 at 06:07:01PM +0530, Siddhesh Poyarekar wrote:
> > On Sat, May 03, 2014 at 12:48:31PM +0200, Ondřej Bílka wrote:
> > > Its nice to have but orthogonal on proposed change, so it should not
> > > block this patch
> > 
> > This looks like something that could easily use the $bench-inputs
> > format since you're really just looking for mean times.  I haven't
> > checked to see if a failed branch would be expensive compared to the
> > lock/unlock cycle we're testing, but I'm guessing that it should get
> > cached and hence not have an impact since we don't interleave the
> > rdlock and wrlock calls.
> > 
> Anyway I realized that microbenchmarks may be wrong way here.
> 
> How it is likely that you have lock in L1 cache?

That can indeed be likely, depending on the way the program uses the
locks.  Especially for a rdlock this can be likely.

> Andy any insigth? 
> I could try to simulate it by using many locks and choosing one at
> random, how that would work.

I don't think that's necessary.  The purpose of this benchmark is to
measure the single-thread overheads, and whether they differ between a C
implementation and an assembly implementation.  Assuming that does
implement the same algorithm, they should roughly make the same memory
accesses.

> Also Siddhesh, benchmarks are useless unless you use their result so
> what did you measured with your benchmark and what do you thing about
> Andi's patch.

The fact that Siddhesh reviewed it doesn't imply that he has to measure,
nor that the benchmark would be useless.
Ondřej Bílka June 5, 2014, 1:51 p.m. UTC | #5
On Thu, Jun 05, 2014 at 03:06:24PM +0200, Torvald Riegel wrote:
> On Thu, 2014-06-05 at 14:57 +0200, Ondřej Bílka wrote:
> > On Thu, Jun 05, 2014 at 06:07:01PM +0530, Siddhesh Poyarekar wrote:
> > > On Sat, May 03, 2014 at 12:48:31PM +0200, Ondřej Bílka wrote:
> > > > Its nice to have but orthogonal on proposed change, so it should not
> > > > block this patch
> > > 
> > > This looks like something that could easily use the $bench-inputs
> > > format since you're really just looking for mean times.  I haven't
> > > checked to see if a failed branch would be expensive compared to the
> > > lock/unlock cycle we're testing, but I'm guessing that it should get
> > > cached and hence not have an impact since we don't interleave the
> > > rdlock and wrlock calls.
> > > 
> > Anyway I realized that microbenchmarks may be wrong way here.
> > 
> > How it is likely that you have lock in L1 cache?
> 
> That can indeed be likely, depending on the way the program uses the
> locks.  Especially for a rdlock this can be likely.
>
That I wanted to hear. 
> > Andy any insigth? 
> > I could try to simulate it by using many locks and choosing one at
> > random, how that would work.
> 
> I don't think that's necessary.  The purpose of this benchmark is to
> measure the single-thread overheads, and whether they differ between a C
> implementation and an assembly implementation.  Assuming that does
> implement the same algorithm, they should roughly make the same memory
> accesses.
> 
That could depend on scheduling but rwlock code looks like taking gcc -S
output anyway.

> > Also Siddhesh, benchmarks are useless unless you use their result so
> > what did you measured with your benchmark and what do you thing about
> > Andi's patch.
> 
> The fact that Siddhesh reviewed it doesn't imply that he has to measure,
> nor that the benchmark would be useless.

For benchmark review you should definitely run it to check if result
make sense. It helps to prevent oops like what happened to me when I
checked pthread_once it took always few cycles and problem turned out to
be not linking with pthread.

Siddhesh send a alternate benchmark so somebody needs to stand before
it. Who will do it? I wont as I am convinced by previous benchmark.
Will one of you stop bikesheding and say that there is really no difference 
between assembly and c rwlock implementation or will you keep arguing
about name of benchmark for next month?
Torvald Riegel June 5, 2014, 2:14 p.m. UTC | #6
On Thu, 2014-06-05 at 15:51 +0200, Ondřej Bílka wrote:
> Will one of you stop bikesheding and say that there is really no difference 
> between assembly and c rwlock implementation or will you keep arguing
> about name of benchmark for next month?

I find the tone of this message inappropriate. YMMV -- but don't be
surprised if that doesn't make your stuff bounce up on my priority list.
Siddhesh Poyarekar June 5, 2014, 3:53 p.m. UTC | #7
On 5 June 2014 19:21, Ondřej Bílka <neleai@seznam.cz> wrote:
> For benchmark review you should definitely run it to check if result
> make sense. It helps to prevent oops like what happened to me when I
> checked pthread_once it took always few cycles and problem turned out to
> be not linking with pthread.

If sanity is what you're looking for, i.e. the requisite calls not
being optimized out, then yes, I did verify that and even ran it once
against current master:

  "pthread_rwlock_test": {
   "rwlock": {
    "duration": 2.87958e+09,
    "iterations": 2.7688e+07,
    "max": 172.944,
    "min": 97.516,
    "mean": 104.001
   },
   "rdlock": {
    "duration": 2.88252e+09,
    "iterations": 2.7022e+07,
    "max": 252.66,
    "min": 101.541,
    "mean": 106.673
   }
  }

> Siddhesh send a alternate benchmark so somebody needs to stand before
> it. Who will do it? I wont as I am convinced by previous benchmark.

The benchmark you proposed is equivalent in terms of stuff we're
measuring and I was only trying to help you get your benchmark into
the standard format so that you don't have to redo the json output or
leave that for someone else to do.  If uncontended performance is
important then this benchmark is useful regardless of whether it shows
that the assembly implementation is better than the C one or not.  In
fact you argued in your patch proposal that the benchmark itself is
orthogonal to Andi's patch.  Did you change your mind about that?

> Will one of you stop bikesheding and say that there is really no difference
> between assembly and c rwlock implementation or will you keep arguing
> about name of benchmark for next month?

I like my benchmark name in italics :)

Siddhesh
Ondřej Bílka June 5, 2014, 7:54 p.m. UTC | #8
On Thu, Jun 05, 2014 at 09:23:04PM +0530, Siddhesh Poyarekar wrote:
> On 5 June 2014 19:21, Ondřej Bílka <neleai@seznam.cz> wrote:
> > For benchmark review you should definitely run it to check if result
> > make sense. It helps to prevent oops like what happened to me when I
> > checked pthread_once it took always few cycles and problem turned out to
> > be not linking with pthread.
> 
> If sanity is what you're looking for, i.e. the requisite calls not
> being optimized out, then yes, I did verify that and even ran it once
> against current master:
> 
>   "pthread_rwlock_test": {
>    "rwlock": {
>     "duration": 2.87958e+09,
>     "iterations": 2.7688e+07,
>     "max": 172.944,
>     "min": 97.516,
>     "mean": 104.001
>    },
>    "rdlock": {
>     "duration": 2.88252e+09,
>     "iterations": 2.7022e+07,
>     "max": 252.66,
>     "min": 101.541,
>     "mean": 106.673
>    }
>   }
>
It is that you must always look for ways that could make benchmark
invalid.
Also I noticed that this takes long to run, did you also noticed that?
 
> > Siddhesh send a alternate benchmark so somebody needs to stand before
> > it. Who will do it? I wont as I am convinced by previous benchmark.
> 
> The benchmark you proposed is equivalent in terms of stuff we're
> measuring and I was only trying to help you get your benchmark into
> the standard format so that you don't have to redo the json output or
> leave that for someone else to do.

I am ok also with this benchmark but its your benchmark so I treated it
like that. 

> If uncontended performance is
> important then this benchmark is useful regardless of whether it shows
> that the assembly implementation is better than the C one or not. 

That is true to some extend, but lets assume that we check this patch in
other way and commit your benchmark. Do you thing that next time when
there comes a patch touching rdlocks somebody will do a comparison if
nobody did it first time?

> In fact you argued in your patch proposal that the benchmark itself is
> orthogonal to Andi's patch.  Did you change your mind about that?
> 
Its orthogonal what benchmark we use. However as Carlos said that he
wants a benchtest before it could be commited we should settle for
something.

> > Will one of you stop bikesheding and say that there is really no difference
> > between assembly and c rwlock implementation or will you keep arguing
> > about name of benchmark for next month?
> 
> I like my benchmark name in italics :)
> 
> Siddhesh
> -- 
> http://siddhesh.in
Ondřej Bílka June 5, 2014, 8:02 p.m. UTC | #9
On Thu, Jun 05, 2014 at 06:07:01PM +0530, Siddhesh Poyarekar wrote:
> On Sat, May 03, 2014 at 12:48:31PM +0200, Ondřej Bílka wrote:
> > Its nice to have but orthogonal on proposed change, so it should not
> > block this patch
> 
> This looks like something that could easily use the $bench-inputs
> format since you're really just looking for mean times.  I haven't
> checked to see if a failed branch would be expensive compared to the
> lock/unlock cycle we're testing, but I'm guessing that it should get
> cached and hence not have an impact since we don't interleave the
> rdlock and wrlock calls.
> 
I assume that you asked about branch misprediction, that should not be
problem here, there are two ways how that it could happen. First one is
that it was not cache but then this is in cold path so it is not that
interesting. Second possibility is that we took different path previous
time, this is also not that interesting because previous call went to
slow path which was order of magnitude slower so its contribution is
tiny.

A third case where misprediction can happen is that there are parallel
readers. That would need a different benchmark.
Siddhesh Poyarekar June 5, 2014, 9:25 p.m. UTC | #10
On 6 June 2014 01:24, Ondřej Bílka <neleai@seznam.cz> wrote:
>>   "pthread_rwlock_test": {
>>    "rwlock": {
>>     "duration": 2.87958e+09,
>>     "iterations": 2.7688e+07,
>>     "max": 172.944,
>>     "min": 97.516,
>>     "mean": 104.001
>>    },
>>    "rdlock": {
>>     "duration": 2.88252e+09,
>>     "iterations": 2.7022e+07,
>>     "max": 252.66,
>>     "min": 101.541,
>>     "mean": 106.673
>>    }
>>   }
>>
> It is that you must always look for ways that could make benchmark
> invalid.
> Also I noticed that this takes long to run, did you also noticed that?

The standard benchmarks run only for BENCH_DURATION time and the above
was for 1 second, so I'm not sure how I should be concluding whether
it 'takes long' or not.

> That is true to some extend, but lets assume that we check this patch in
> other way and commit your benchmark. Do you thing that next time when
> there comes a patch touching rdlocks somebody will do a comparison if
> nobody did it first time?

Why not?  That is the point of the microbenchmarks - if there is a
significant change in a function and there's a microbenchmark for it,
the submitter should post the before and after results of the change.
Since we don't have any disagreement over the approach and the
generated code also is not a problem (as you pointed out in your other
response) there should be no problem with getting the benchmark in and
letting Andi use it to compare the before and after results.

I don't even mind waiting for Andi to revert with results before
committing the benchmark if that's what you want.

Siddhesh

Patch
diff mbox

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 63a5a7f..d74de80 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -25,7 +25,7 @@  include ../Makeconfig
 bench-math := acos acosh asin asinh atan atanh cos cosh exp exp2 ffs ffsll \
 	      log log2 modf pow rint sin sincos sinh sqrt tan tanh
 
-bench-pthread := pthread_once
+bench-pthread := pthread_once pthread_rwlock_test
 
 bench := $(bench-math) $(bench-pthread)
 
diff --git a/benchtests/pthread_rwlock_test-inputs b/benchtests/pthread_rwlock_test-inputs
new file mode 100644
index 0000000..9122afb
--- /dev/null
+++ b/benchtests/pthread_rwlock_test-inputs
@@ -0,0 +1,8 @@ 
+## args: int
+## includes: pthread.h
+## include-sources: pthread_rwlock_test-source.c
+
+##name: rdlock
+0
+##name: rwlock
+1
diff --git a/benchtests/pthread_rwlock_test-source.c b/benchtests/pthread_rwlock_test-source.c
new file mode 100644
index 0000000..c679f7b
--- /dev/null
+++ b/benchtests/pthread_rwlock_test-source.c
@@ -0,0 +1,13 @@ 
+
+static pthread_rwlock_t rwlock = PTHREAD_RWLOCK_INITIALIZER;
+
+static void
+pthread_rwlock_test (int write)
+{
+  if (write)
+    pthread_rwlock_wrlock (&rwlock);
+  else
+    pthread_rwlock_rdlock (&rwlock);
+
+  pthread_rwlock_unlock (&rwlock);
+}