Patchwork [v6,04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

login
register
mail settings
Submitter Srivatsa S. Bhat
Date Feb. 18, 2013, 12:38 p.m.
Message ID <20130218123856.26245.46705.stgit@srivatsabhat.in.ibm.com>
Download mbox | patch
Permalink /patch/221282/
State Not Applicable
Headers show

Comments

Srivatsa S. Bhat - Feb. 18, 2013, 12:38 p.m.
Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
lock-ordering related problems (unlike per-cpu locks). However, global
rwlocks lead to unnecessary cache-line bouncing even when there are no
writers present, which can slow down the system needlessly.

Per-cpu counters can help solve the cache-line bouncing problem. So we
actually use the best of both: per-cpu counters (no-waiting) at the reader
side in the fast-path, and global rwlocks in the slowpath.

[ Fastpath = no writer is active; Slowpath = a writer is active ]

IOW, the readers just increment/decrement their per-cpu refcounts (disabling
interrupts during the updates, if necessary) when no writer is active.
When a writer becomes active, he signals all readers to switch to global
rwlocks for the duration of his activity. The readers switch over when it
is safe for them (ie., when they are about to start a fresh, non-nested
read-side critical section) and start using (holding) the global rwlock for
read in their subsequent critical sections.

The writer waits for every existing reader to switch, and then acquires the
global rwlock for write and enters his critical section. Later, the writer
signals all readers that he is done, and that they can go back to using their
per-cpu refcounts again.

Note that the lock-safety (despite the per-cpu scheme) comes from the fact
that the readers can *choose* _when_ to switch to rwlocks upon the writer's
signal. And the readers don't wait on anybody based on the per-cpu counters.
The only true synchronization that involves waiting at the reader-side in this
scheme, is the one arising from the global rwlock, which is safe from circular
locking dependency issues.

Reader-writer locks and per-cpu counters are recursive, so they can be
used in a nested fashion in the reader-path, which makes per-CPU rwlocks also
recursive. Also, this design of switching the synchronization scheme ensures
that you can safely nest and use these locks in a very flexible manner.

I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
suggestions and ideas, which inspired and influenced many of the decisions in
this as well as previous designs. Thanks a lot Michael and Xiao!

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 lib/percpu-rwlock.c |  139 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 137 insertions(+), 2 deletions(-)
Michel Lespinasse - Feb. 18, 2013, 3:45 p.m.
Hi Srivasta,

I admit not having followed in detail the threads about the previous
iteration, so some of my comments may have been discussed already
before - apologies if that is the case.

On Mon, Feb 18, 2013 at 8:38 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> Reader-writer locks and per-cpu counters are recursive, so they can be
> used in a nested fashion in the reader-path, which makes per-CPU rwlocks also
> recursive. Also, this design of switching the synchronization scheme ensures
> that you can safely nest and use these locks in a very flexible manner.

I like the general idea of switching between per-cpu and global
rwlocks as needed; however I dislike unfair locks, and especially
unfair recursive rwlocks.

If you look at rwlock_t, the main reason we haven't been able to
implement reader/writer fairness there is because tasklist_lock makes
use of the recursive nature of the rwlock_t read side. I'm worried
about introducing more lock usages that would make use of the same
property for your proposed lock.

I am fine with your proposal not implementing reader/writer fairness
from day 1, but I am worried about your proposal having a recursive
reader side. Or, to put it another way: if your proposal didn't have a
recursive reader side, and rwlock_t could somehow be changed to
implement reader/writer fairness, then this property could
automatically propagate into your proposed rwlock; but if anyone makes
use of the recursive nature of your proposal then implementing
reader/writer fairness later won't be as easy.

I see that the very next change in this series is talking about
acquiring the read side from interrupts, so it does look like you're
planning to make use of the recursive nature of the read side. I kinda
wish you didn't, as this is exactly replicating the design of
tasklist_lock which is IMO problematic. Your prior proposal of
disabling interrupts during the read side had other disadvantages, but
I think it was nice that it didn't rely on having a recursive read
side.

> +#define reader_yet_to_switch(pcpu_rwlock, cpu)                             \
> +       (ACCESS_ONCE(per_cpu_ptr((pcpu_rwlock)->rw_state, cpu)->reader_refcnt))
> +
> +#define reader_percpu_nesting_depth(pcpu_rwlock)                 \
> +       (__this_cpu_read((pcpu_rwlock)->rw_state->reader_refcnt))
> +
> +#define reader_uses_percpu_refcnt(pcpu_rwlock)                         \
> +                               reader_percpu_nesting_depth(pcpu_rwlock)
> +
> +#define reader_nested_percpu(pcpu_rwlock)                              \
> +                       (reader_percpu_nesting_depth(pcpu_rwlock) > 1)
> +
> +#define writer_active(pcpu_rwlock)                                     \
> +       (__this_cpu_read((pcpu_rwlock)->rw_state->writer_signal))

I'm personally not a fan of such one-line shorthand functions - I
think they tend to make the code harder to read instead of easier, as
one constantly has to refer to them to understand what's actually
going on.

>  void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
>  {
> +       unsigned int cpu;
> +
> +       /*
> +        * Tell all readers that a writer is becoming active, so that they
> +        * start switching over to the global rwlock.
> +        */
> +       for_each_possible_cpu(cpu)
> +               per_cpu_ptr(pcpu_rwlock->rw_state, cpu)->writer_signal = true;

I don't see anything preventing a race with the corresponding code in
percpu_write_unlock() that sets writer_signal back to false. Did I
miss something here ? It seems to me we don't have any guarantee that
all writer signals will be set to true at the end of the loop...
Srivatsa S. Bhat - Feb. 18, 2013, 4:21 p.m.
Hi Michel,

On 02/18/2013 09:15 PM, Michel Lespinasse wrote:
> Hi Srivasta,
> 
> I admit not having followed in detail the threads about the previous
> iteration, so some of my comments may have been discussed already
> before - apologies if that is the case.
> 
> On Mon, Feb 18, 2013 at 8:38 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> Reader-writer locks and per-cpu counters are recursive, so they can be
>> used in a nested fashion in the reader-path, which makes per-CPU rwlocks also
>> recursive. Also, this design of switching the synchronization scheme ensures
>> that you can safely nest and use these locks in a very flexible manner.
> 
> I like the general idea of switching between per-cpu and global
> rwlocks as needed; however I dislike unfair locks, and especially
> unfair recursive rwlocks.
> 
> If you look at rwlock_t, the main reason we haven't been able to
> implement reader/writer fairness there is because tasklist_lock makes
> use of the recursive nature of the rwlock_t read side. I'm worried
> about introducing more lock usages that would make use of the same
> property for your proposed lock.
> 
> I am fine with your proposal not implementing reader/writer fairness
> from day 1, but I am worried about your proposal having a recursive
> reader side. Or, to put it another way: if your proposal didn't have a
> recursive reader side, and rwlock_t could somehow be changed to
> implement reader/writer fairness, then this property could
> automatically propagate into your proposed rwlock; but if anyone makes
> use of the recursive nature of your proposal then implementing
> reader/writer fairness later won't be as easy.
>

Actually, we don't want reader/writer fairness in this particular case.
We want deadlock safety - and in this particular case, this is guaranteed
by the unfair nature of rwlocks today.

I understand that you want to make rwlocks fair. So, I am thinking of
going ahead with Tejun's proposal - implementing our own unfair locking
scheme inside percpu-rwlocks using atomic ops or something like that, and
being completely independent of rwlock_t. That way, you can easily go
ahead with making rwlocks fair without fear of breaking CPU hotplug.
However I would much prefer making that change to percpu-rwlocks as a
separate patchset, after this patchset goes in, so that we can also see
how well this unfair logic performs in practice.

And regarding recursive reader side,... the way I see it, having a
recursive reader side is a primary requirement in this case. The reason is
that the existing reader side (with stop_machine) uses preempt_disable(),
which is recursive. So our replacement also has to be recursive.
 
> I see that the very next change in this series is talking about
> acquiring the read side from interrupts, so it does look like you're
> planning to make use of the recursive nature of the read side.

Yes.. I don't think we can avoid that. Moreover, since we _want_ unfair
reader/writer semantics to allow flexible locking rules and guarantee
deadlock-safety, having a recursive reader side is not even an issue, IMHO.

> I kinda
> wish you didn't, as this is exactly replicating the design of
> tasklist_lock which is IMO problematic. Your prior proposal of
> disabling interrupts during the read side had other disadvantages, but
> I think it was nice that it didn't rely on having a recursive read
> side.
> 

We can have readers from non-interrupt contexts too, which depend on the
recursive property...

>> +#define reader_yet_to_switch(pcpu_rwlock, cpu)                             \
>> +       (ACCESS_ONCE(per_cpu_ptr((pcpu_rwlock)->rw_state, cpu)->reader_refcnt))
>> +
>> +#define reader_percpu_nesting_depth(pcpu_rwlock)                 \
>> +       (__this_cpu_read((pcpu_rwlock)->rw_state->reader_refcnt))
>> +
>> +#define reader_uses_percpu_refcnt(pcpu_rwlock)                         \
>> +                               reader_percpu_nesting_depth(pcpu_rwlock)
>> +
>> +#define reader_nested_percpu(pcpu_rwlock)                              \
>> +                       (reader_percpu_nesting_depth(pcpu_rwlock) > 1)
>> +
>> +#define writer_active(pcpu_rwlock)                                     \
>> +       (__this_cpu_read((pcpu_rwlock)->rw_state->writer_signal))
> 
> I'm personally not a fan of such one-line shorthand functions - I
> think they tend to make the code harder to read instead of easier, as
> one constantly has to refer to them to understand what's actually
> going on.
> 

I got rid of most of the helper functions in this version. But I would rather
prefer retaining the above ones, because they are unwieldy and long. And IMHO
the short-hand names are pretty descriptive, so you might not actually need
to keep referring to their implementations all the time.

>>  void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
>>  {
>> +       unsigned int cpu;
>> +
>> +       /*
>> +        * Tell all readers that a writer is becoming active, so that they
>> +        * start switching over to the global rwlock.
>> +        */
>> +       for_each_possible_cpu(cpu)
>> +               per_cpu_ptr(pcpu_rwlock->rw_state, cpu)->writer_signal = true;
> 
> I don't see anything preventing a race with the corresponding code in
> percpu_write_unlock() that sets writer_signal back to false. Did I
> miss something here ? It seems to me we don't have any guarantee that
> all writer signals will be set to true at the end of the loop...
> 

Ah, thanks for pointing that out! IIRC Oleg had pointed this issue in the last
version, but back then, I hadn't fully understood what he meant. Your
explanation made it clear. I'll work on fixing this.

Thanks a lot for your review Michel!

Regards,
Srivatsa S. Bhat
Steven Rostedt - Feb. 18, 2013, 4:31 p.m.
On Mon, 2013-02-18 at 21:51 +0530, Srivatsa S. Bhat wrote:
> Hi Michel,

> Yes.. I don't think we can avoid that. Moreover, since we _want_ unfair
> reader/writer semantics to allow flexible locking rules and guarantee
> deadlock-safety, having a recursive reader side is not even an issue, IMHO.

Recursive unfair reader lock may guarantee deadlock-safety, but
remember, it adds a higher probability of live-locking the write_lock.
Which is another argument to keep this separate to cpu hotplug only.

-- Steve
Srivatsa S. Bhat - Feb. 18, 2013, 4:46 p.m.
On 02/18/2013 10:01 PM, Steven Rostedt wrote:
> On Mon, 2013-02-18 at 21:51 +0530, Srivatsa S. Bhat wrote:
>> Hi Michel,
> 
>> Yes.. I don't think we can avoid that. Moreover, since we _want_ unfair
>> reader/writer semantics to allow flexible locking rules and guarantee
>> deadlock-safety, having a recursive reader side is not even an issue, IMHO.
> 
> Recursive unfair reader lock may guarantee deadlock-safety, but
> remember, it adds a higher probability of live-locking the write_lock.
> Which is another argument to keep this separate to cpu hotplug only.
> 

True.
 
Regards,
Srivatsa S. Bhat
Lai Jiangshan - Feb. 26, 2013, 2:17 p.m.
On Mon, Feb 18, 2013 at 8:38 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
> lock-ordering related problems (unlike per-cpu locks). However, global
> rwlocks lead to unnecessary cache-line bouncing even when there are no
> writers present, which can slow down the system needlessly.

per-CPU rwlocks(yours and mine) are the exactly same as rwlock_t in
the view of lock dependency(except reader-C.S. can be nested in
writer-C.S.)
so they can deadlock in this order:

spin_lock(some_lock);				percpu_write_lock_irqsave()
						case CPU_DYING
percpu_read_lock_irqsafe();	<---deadlock--->	spin_lock(some_lock);

The lockdep can find out such dependency, but we must try our best to
find out them before merge the patchset to mainline. We can  review
all the code of cpu_disable() and CPU_DYING and fix this kinds of lock
dependency, but it is not easy thing, it may be a long term project.

======
And if there is any CPU_DYING code takes no locks and do some
works(because they know they are called via stop_machine()) we need to
add that locking code back if there is such code.(I don't know whether
such code exist or not)

>
> Per-cpu counters can help solve the cache-line bouncing problem. So we
> actually use the best of both: per-cpu counters (no-waiting) at the reader
> side in the fast-path, and global rwlocks in the slowpath.
>
> [ Fastpath = no writer is active; Slowpath = a writer is active ]
>
> IOW, the readers just increment/decrement their per-cpu refcounts (disabling
> interrupts during the updates, if necessary) when no writer is active.
> When a writer becomes active, he signals all readers to switch to global
> rwlocks for the duration of his activity. The readers switch over when it
> is safe for them (ie., when they are about to start a fresh, non-nested
> read-side critical section) and start using (holding) the global rwlock for
> read in their subsequent critical sections.
>
> The writer waits for every existing reader to switch, and then acquires the
> global rwlock for write and enters his critical section. Later, the writer
> signals all readers that he is done, and that they can go back to using their
> per-cpu refcounts again.
>
> Note that the lock-safety (despite the per-cpu scheme) comes from the fact
> that the readers can *choose* _when_ to switch to rwlocks upon the writer's
> signal. And the readers don't wait on anybody based on the per-cpu counters.
> The only true synchronization that involves waiting at the reader-side in this
> scheme, is the one arising from the global rwlock, which is safe from circular
> locking dependency issues.
>
> Reader-writer locks and per-cpu counters are recursive, so they can be
> used in a nested fashion in the reader-path, which makes per-CPU rwlocks also
> recursive. Also, this design of switching the synchronization scheme ensures
> that you can safely nest and use these locks in a very flexible manner.
>
> I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
> suggestions and ideas, which inspired and influenced many of the decisions in
> this as well as previous designs. Thanks a lot Michael and Xiao!
>
> Cc: David Howells <dhowells@redhat.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>  lib/percpu-rwlock.c |  139 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 137 insertions(+), 2 deletions(-)
>
> diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
> index f938096..edefdea 100644
> --- a/lib/percpu-rwlock.c
> +++ b/lib/percpu-rwlock.c
> @@ -27,6 +27,24 @@
>  #include <linux/percpu-rwlock.h>
>  #include <linux/errno.h>
>
> +#include <asm/processor.h>
> +
> +
> +#define reader_yet_to_switch(pcpu_rwlock, cpu)                             \
> +       (ACCESS_ONCE(per_cpu_ptr((pcpu_rwlock)->rw_state, cpu)->reader_refcnt))
> +
> +#define reader_percpu_nesting_depth(pcpu_rwlock)                 \
> +       (__this_cpu_read((pcpu_rwlock)->rw_state->reader_refcnt))
> +
> +#define reader_uses_percpu_refcnt(pcpu_rwlock)                         \
> +                               reader_percpu_nesting_depth(pcpu_rwlock)
> +
> +#define reader_nested_percpu(pcpu_rwlock)                              \
> +                       (reader_percpu_nesting_depth(pcpu_rwlock) > 1)
> +
> +#define writer_active(pcpu_rwlock)                                     \
> +       (__this_cpu_read((pcpu_rwlock)->rw_state->writer_signal))
> +
>
>  int __percpu_init_rwlock(struct percpu_rwlock *pcpu_rwlock,
>                          const char *name, struct lock_class_key *rwlock_key)
> @@ -55,21 +73,138 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock)
>
>  void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
>  {
> -       read_lock(&pcpu_rwlock->global_rwlock);
> +       preempt_disable();
> +
> +       /*
> +        * Let the writer know that a reader is active, even before we choose
> +        * our reader-side synchronization scheme.
> +        */
> +       this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);
> +
> +       /*
> +        * If we are already using per-cpu refcounts, it is not safe to switch
> +        * the synchronization scheme. So continue using the refcounts.
> +        */
> +       if (reader_nested_percpu(pcpu_rwlock))
> +               return;
> +
> +       /*
> +        * The write to 'reader_refcnt' must be visible before we read
> +        * 'writer_signal'.
> +        */
> +       smp_mb();
> +
> +       if (likely(!writer_active(pcpu_rwlock))) {
> +               goto out;
> +       } else {
> +               /* Writer is active, so switch to global rwlock. */
> +               read_lock(&pcpu_rwlock->global_rwlock);
> +
> +               /*
> +                * We might have raced with a writer going inactive before we
> +                * took the read-lock. So re-evaluate whether we still need to
> +                * hold the rwlock or if we can switch back to per-cpu
> +                * refcounts. (This also helps avoid heterogeneous nesting of
> +                * readers).
> +                */
> +               if (writer_active(pcpu_rwlock)) {
> +                       /*
> +                        * The above writer_active() check can get reordered
> +                        * with this_cpu_dec() below, but this is OK, because
> +                        * holding the rwlock is conservative.
> +                        */
> +                       this_cpu_dec(pcpu_rwlock->rw_state->reader_refcnt);
> +               } else {
> +                       read_unlock(&pcpu_rwlock->global_rwlock);
> +               }
> +       }
> +
> +out:
> +       /* Prevent reordering of any subsequent reads/writes */
> +       smp_mb();
>  }
>
>  void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
>  {
> -       read_unlock(&pcpu_rwlock->global_rwlock);
> +       /*
> +        * We never allow heterogeneous nesting of readers. So it is trivial
> +        * to find out the kind of reader we are, and undo the operation
> +        * done by our corresponding percpu_read_lock().
> +        */
> +
> +       /* Try to fast-path: a nested percpu reader is the simplest case */
> +       if (reader_nested_percpu(pcpu_rwlock)) {
> +               this_cpu_dec(pcpu_rwlock->rw_state->reader_refcnt);
> +               preempt_enable();
> +               return;
> +       }
> +
> +       /*
> +        * Now we are left with only 2 options: a non-nested percpu reader,
> +        * or a reader holding rwlock
> +        */
> +       if (reader_uses_percpu_refcnt(pcpu_rwlock)) {
> +               /*
> +                * Complete the critical section before decrementing the
> +                * refcnt. We can optimize this away if we are a nested
> +                * reader (the case above).
> +                */
> +               smp_mb();
> +               this_cpu_dec(pcpu_rwlock->rw_state->reader_refcnt);
> +       } else {
> +               read_unlock(&pcpu_rwlock->global_rwlock);
> +       }
> +
> +       preempt_enable();
>  }
>
>  void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
>  {
> +       unsigned int cpu;
> +
> +       /*
> +        * Tell all readers that a writer is becoming active, so that they
> +        * start switching over to the global rwlock.
> +        */
> +       for_each_possible_cpu(cpu)
> +               per_cpu_ptr(pcpu_rwlock->rw_state, cpu)->writer_signal = true;
> +
> +       smp_mb();
> +
> +       /*
> +        * Wait for every reader to see the writer's signal and switch from
> +        * percpu refcounts to global rwlock.
> +        *
> +        * If a reader is still using percpu refcounts, wait for him to switch.
> +        * Else, we can safely go ahead, because either the reader has already
> +        * switched over, or the next reader that comes along on that CPU will
> +        * notice the writer's signal and will switch over to the rwlock.
> +        */
> +
> +       for_each_possible_cpu(cpu) {
> +               while (reader_yet_to_switch(pcpu_rwlock, cpu))
> +                       cpu_relax();
> +       }
> +
> +       smp_mb(); /* Complete the wait-for-readers, before taking the lock */
>         write_lock(&pcpu_rwlock->global_rwlock);
>  }
>
>  void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
>  {
> +       unsigned int cpu;
> +
> +       /* Complete the critical section before clearing ->writer_signal */
> +       smp_mb();
> +
> +       /*
> +        * Inform all readers that we are done, so that they can switch back
> +        * to their per-cpu refcounts. (We don't need to wait for them to
> +        * see it).
> +        */
> +       for_each_possible_cpu(cpu)
> +               per_cpu_ptr(pcpu_rwlock->rw_state, cpu)->writer_signal = false;
> +
>         write_unlock(&pcpu_rwlock->global_rwlock);
>  }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Srivatsa S. Bhat - Feb. 26, 2013, 2:37 p.m.
On 02/26/2013 07:47 PM, Lai Jiangshan wrote:
> On Mon, Feb 18, 2013 at 8:38 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
>> lock-ordering related problems (unlike per-cpu locks). However, global
>> rwlocks lead to unnecessary cache-line bouncing even when there are no
>> writers present, which can slow down the system needlessly.
> 
> per-CPU rwlocks(yours and mine) are the exactly same as rwlock_t in
> the view of lock dependency(except reader-C.S. can be nested in
> writer-C.S.)
> so they can deadlock in this order:
> 
> spin_lock(some_lock);				percpu_write_lock_irqsave()
> 						case CPU_DYING
> percpu_read_lock_irqsafe();	<---deadlock--->	spin_lock(some_lock);
> 

Yes, of course! But this is the most straight-forward of cases to fix,
because there are a well-defined number of CPU_DYING notifiers, and the fix
is to make the lock ordering same at both sides.

The real challenge is with cases like below:

spin_lock(some_lock)			percpu_read_lock_irqsafe()

percpu_read_lock_irqsafe()		spin_lock(some_lock)

The locking primitives (percpu_read_lock/unlock_irqsafe) have been explicitly
designed to keep cases like the above deadlock-free. Because, they ease
conversions from preempt_disable() to the new locking primitive without
lock-ordering headache.

> The lockdep can find out such dependency, but we must try our best to
> find out them before merge the patchset to mainline. We can  review
> all the code of cpu_disable() and CPU_DYING and fix this kinds of lock
> dependency, but it is not easy thing, it may be a long term project.
> 

:-)

That's exactly what I have done in this patchset, and that's why there
are 46 patches in this series! :-) I have run this patchset with lockdep
turned on, and verified that there are no locking problems due to the
conversion. I even posted out performance numbers from this patchset
(ie., in comparison to stop_machine), if you are curious...
http://article.gmane.org/gmane.linux.kernel/1435249

But yes, I'll have to re-verify this because of the new code that went in
during this merge window.

> ======
> And if there is any CPU_DYING code takes no locks and do some
> works(because they know they are called via stop_machine()) we need to
> add that locking code back if there is such code.(I don't know whether
> such code exist or not)
>

Yes, I explicitly verified this too. (I had mentioned this in the cover letter).

Regards,
Srivatsa S. Bhat

Patch

diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
index f938096..edefdea 100644
--- a/lib/percpu-rwlock.c
+++ b/lib/percpu-rwlock.c
@@ -27,6 +27,24 @@ 
 #include <linux/percpu-rwlock.h>
 #include <linux/errno.h>
 
+#include <asm/processor.h>
+
+
+#define reader_yet_to_switch(pcpu_rwlock, cpu)				    \
+	(ACCESS_ONCE(per_cpu_ptr((pcpu_rwlock)->rw_state, cpu)->reader_refcnt))
+
+#define reader_percpu_nesting_depth(pcpu_rwlock)		  \
+	(__this_cpu_read((pcpu_rwlock)->rw_state->reader_refcnt))
+
+#define reader_uses_percpu_refcnt(pcpu_rwlock)				\
+				reader_percpu_nesting_depth(pcpu_rwlock)
+
+#define reader_nested_percpu(pcpu_rwlock)				\
+			(reader_percpu_nesting_depth(pcpu_rwlock) > 1)
+
+#define writer_active(pcpu_rwlock)					\
+	(__this_cpu_read((pcpu_rwlock)->rw_state->writer_signal))
+
 
 int __percpu_init_rwlock(struct percpu_rwlock *pcpu_rwlock,
 			 const char *name, struct lock_class_key *rwlock_key)
@@ -55,21 +73,138 @@  void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock)
 
 void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
 {
-	read_lock(&pcpu_rwlock->global_rwlock);
+	preempt_disable();
+
+	/*
+	 * Let the writer know that a reader is active, even before we choose
+	 * our reader-side synchronization scheme.
+	 */
+	this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);
+
+	/*
+	 * If we are already using per-cpu refcounts, it is not safe to switch
+	 * the synchronization scheme. So continue using the refcounts.
+	 */
+	if (reader_nested_percpu(pcpu_rwlock))
+		return;
+
+	/*
+	 * The write to 'reader_refcnt' must be visible before we read
+	 * 'writer_signal'.
+	 */
+	smp_mb();
+
+	if (likely(!writer_active(pcpu_rwlock))) {
+		goto out;
+	} else {
+		/* Writer is active, so switch to global rwlock. */
+		read_lock(&pcpu_rwlock->global_rwlock);
+
+		/*
+		 * We might have raced with a writer going inactive before we
+		 * took the read-lock. So re-evaluate whether we still need to
+		 * hold the rwlock or if we can switch back to per-cpu
+		 * refcounts. (This also helps avoid heterogeneous nesting of
+		 * readers).
+		 */
+		if (writer_active(pcpu_rwlock)) {
+			/*
+			 * The above writer_active() check can get reordered
+			 * with this_cpu_dec() below, but this is OK, because
+			 * holding the rwlock is conservative.
+			 */
+			this_cpu_dec(pcpu_rwlock->rw_state->reader_refcnt);
+		} else {
+			read_unlock(&pcpu_rwlock->global_rwlock);
+		}
+	}
+
+out:
+	/* Prevent reordering of any subsequent reads/writes */
+	smp_mb();
 }
 
 void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
 {
-	read_unlock(&pcpu_rwlock->global_rwlock);
+	/*
+	 * We never allow heterogeneous nesting of readers. So it is trivial
+	 * to find out the kind of reader we are, and undo the operation
+	 * done by our corresponding percpu_read_lock().
+	 */
+
+	/* Try to fast-path: a nested percpu reader is the simplest case */
+	if (reader_nested_percpu(pcpu_rwlock)) {
+		this_cpu_dec(pcpu_rwlock->rw_state->reader_refcnt);
+		preempt_enable();
+		return;
+	}
+
+	/*
+	 * Now we are left with only 2 options: a non-nested percpu reader,
+	 * or a reader holding rwlock
+	 */
+	if (reader_uses_percpu_refcnt(pcpu_rwlock)) {
+		/*
+		 * Complete the critical section before decrementing the
+		 * refcnt. We can optimize this away if we are a nested
+		 * reader (the case above).
+		 */
+		smp_mb();
+		this_cpu_dec(pcpu_rwlock->rw_state->reader_refcnt);
+	} else {
+		read_unlock(&pcpu_rwlock->global_rwlock);
+	}
+
+	preempt_enable();
 }
 
 void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
 {
+	unsigned int cpu;
+
+	/*
+	 * Tell all readers that a writer is becoming active, so that they
+	 * start switching over to the global rwlock.
+	 */
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(pcpu_rwlock->rw_state, cpu)->writer_signal = true;
+
+	smp_mb();
+
+	/*
+	 * Wait for every reader to see the writer's signal and switch from
+	 * percpu refcounts to global rwlock.
+	 *
+	 * If a reader is still using percpu refcounts, wait for him to switch.
+	 * Else, we can safely go ahead, because either the reader has already
+	 * switched over, or the next reader that comes along on that CPU will
+	 * notice the writer's signal and will switch over to the rwlock.
+	 */
+
+	for_each_possible_cpu(cpu) {
+		while (reader_yet_to_switch(pcpu_rwlock, cpu))
+			cpu_relax();
+	}
+
+	smp_mb(); /* Complete the wait-for-readers, before taking the lock */
 	write_lock(&pcpu_rwlock->global_rwlock);
 }
 
 void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
 {
+	unsigned int cpu;
+
+	/* Complete the critical section before clearing ->writer_signal */
+	smp_mb();
+
+	/*
+	 * Inform all readers that we are done, so that they can switch back
+	 * to their per-cpu refcounts. (We don't need to wait for them to
+	 * see it).
+	 */
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(pcpu_rwlock->rw_state, cpu)->writer_signal = false;
+
 	write_unlock(&pcpu_rwlock->global_rwlock);
 }