Patchwork [v5,04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

login
register
mail settings
Submitter Srivatsa S. Bhat
Date Jan. 22, 2013, 7:33 a.m.
Message ID <20130122073347.13822.85876.stgit@srivatsabhat.in.ibm.com>
Download mbox | patch
Permalink /patch/214368/
State Not Applicable
Headers show

Comments

Srivatsa S. Bhat - Jan. 22, 2013, 7:33 a.m.
Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
lock-ordering related problems (unlike per-cpu locks). However, global
rwlocks lead to unnecessary cache-line bouncing even when there are no
writers present, which can slow down the system needlessly.

Per-cpu counters can help solve the cache-line bouncing problem. So we
actually use the best of both: per-cpu counters (no-waiting) at the reader
side in the fast-path, and global rwlocks in the slowpath.

[ Fastpath = no writer is active; Slowpath = a writer is active ]

IOW, the readers just increment/decrement their per-cpu refcounts (disabling
interrupts during the updates, if necessary) when no writer is active.
When a writer becomes active, he signals all readers to switch to global
rwlocks for the duration of his activity. The readers switch over when it
is safe for them (ie., when they are about to start a fresh, non-nested
read-side critical section) and start using (holding) the global rwlock for
read in their subsequent critical sections.

The writer waits for every existing reader to switch, and then acquires the
global rwlock for write and enters his critical section. Later, the writer
signals all readers that he is done, and that they can go back to using their
per-cpu refcounts again.

Note that the lock-safety (despite the per-cpu scheme) comes from the fact
that the readers can *choose* _when_ to switch to rwlocks upon the writer's
signal. And the readers don't wait on anybody based on the per-cpu counters.
The only true synchronization that involves waiting at the reader-side in this
scheme, is the one arising from the global rwlock, which is safe from circular
locking dependency issues.

Reader-writer locks and per-cpu counters are recursive, so they can be
used in a nested fashion in the reader-path, which makes per-CPU rwlocks also
recursive. Also, this design of switching the synchronization scheme ensures
that you can safely nest and use these locks in a very flexible manner.

I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
suggestions and ideas, which inspired and influenced many of the decisions in
this as well as previous designs. Thanks a lot Michael and Xiao!

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/percpu-rwlock.h |   10 +++
 lib/percpu-rwlock.c           |  128 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 136 insertions(+), 2 deletions(-)
Tejun Heo - Jan. 23, 2013, 6:55 p.m.
Hello, Srivatsa.

First of all, I'm not sure whether we need to be this step-by-step
when introducing something new.  It's not like we're transforming an
existing implementation and it doesn't seem to help understanding the
series that much either.

On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
> lock-ordering related problems (unlike per-cpu locks). However, global

So, unfortunately, this already seems broken, right?  The problem here
seems to be that previously, say, read_lock() implied
preempt_disable() but as this series aims to move away from it, it
introduces the problem of locking order between such locks and the new
contruct.

The only two options are either punishing writers or identifying and
updating all such possible deadlocks.  percpu_rwsem does the former,
right?  I don't know how feasible the latter would be.  Srivatsa,
you've been looking at all the places which would require conversion,
how difficult would doing the latter be?

> +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu)			\
> +		(ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu)))
> +
> +#define reader_nested_percpu(pcpu_rwlock)				\
> +			(__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1)
> +
> +#define writer_active(pcpu_rwlock)					\
> +			(__this_cpu_read(*((pcpu_rwlock)->writer_signal)))

Why are these in the public header file?  Are they gonna be used to
inline something?

> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock,
> +				       unsigned int cpu)
> +{
> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = true;
> +}
> +
> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock,
> +				      unsigned int cpu)
> +{
> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = false;
> +}
> +
> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock)
> +{
> +	unsigned int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		raise_writer_signal(pcpu_rwlock, cpu);
> +
> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
> +}
> +
> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
> +{
> +	unsigned int cpu;
> +
> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());
> +
> +	for_each_online_cpu(cpu)
> +		drop_writer_signal(pcpu_rwlock, cpu);
> +
> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
> +}

It could be just personal preference but I find the above one line
wrappers more obfuscating than anything else.  What's the point of
wrapping writer_signal = true/false into a separate function?  These
simple wrappers just add layers that people have to dig through to
figure out what's going on without adding anything of value.  I'd much
prefer collapsing these into the percpu_write_[un]lock().

Thanks.
Srivatsa S. Bhat - Jan. 23, 2013, 7:33 p.m.
On 01/24/2013 12:25 AM, Tejun Heo wrote:
> Hello, Srivatsa.
> 
> First of all, I'm not sure whether we need to be this step-by-step
> when introducing something new.  It's not like we're transforming an
> existing implementation and it doesn't seem to help understanding the
> series that much either.
> 

Hmm.. I split it up into steps to help explain the reasoning behind
the code sufficiently, rather than spring all of the intricacies at
one go (which would make it very hard to write the changelog/comments
also). The split made it easier for me to document it well in the
changelog, because I could deal with reasonable chunks of code/complexity
at a time. IMHO that helps people reading it for the first time to
understand the logic easily.

> On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
>> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
>> lock-ordering related problems (unlike per-cpu locks). However, global
> 
> So, unfortunately, this already seems broken, right?  The problem here
> seems to be that previously, say, read_lock() implied
> preempt_disable() but as this series aims to move away from it, it
> introduces the problem of locking order between such locks and the new
> contruct.
>

Not sure I got your point correctly. Are you referring to Steve's comment
that rwlocks are probably fair now (and hence not really safe when used
like this)? If yes, I haven't actually verified that yet, but yes, that
will make this hard to use, since we need to take care of locking rules.

But suppose rwlocks are unfair (as I had assumed them to be), then we
have absolutely no problems and no lock-ordering to worry about.
 
> The only two options are either punishing writers or identifying and
> updating all such possible deadlocks.  percpu_rwsem does the former,
> right?  I don't know how feasible the latter would be.

I don't think we can avoid looking into all the possible deadlocks,
as long as we use rwlocks inside get/put_online_cpus_atomic() (assuming
rwlocks are fair). Even with Oleg's idea of using synchronize_sched()
at the writer, we still need to take care of locking rules, because the
synchronize_sched() only helps avoid the memory barriers at the reader,
and doesn't help get rid of the rwlocks themselves.
So in short, I don't see how we can punish the writers and thereby somehow
avoid looking into possible deadlocks (if rwlocks are fair).

>  Srivatsa,
> you've been looking at all the places which would require conversion,
> how difficult would doing the latter be?
> 

The problem is that some APIs like smp_call_function() will need to use
get/put_online_cpus_atomic(). That is when the locking becomes tricky
in the subsystem which invokes these APIs with other (subsystem-specific,
internal) locks held. So we could potentially use a convention such as
"Make get/put_online_cpus_atomic() your outer-most calls, within which
you nest the other locks" to rule out all ABBA deadlock possibilities...
But we might still hit some hard-to-convert places.. 

BTW, Steve, fair rwlocks doesn't mean the following scenario will result
in a deadlock right?

CPU 0                          CPU 1

read_lock(&rwlock)

                              write_lock(&rwlock) //spins, because CPU 0
                              //has acquired the lock for read

read_lock(&rwlock)
   ^^^^^
What happens here? Does CPU 0 start spinning (and hence deadlock) or will
it continue realizing that it already holds the rwlock for read?

If the above ends in a deadlock, then its next to impossible to convert
all the places safely (because the above mentioned convention will simply
fall apart).

>> +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu)			\
>> +		(ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu)))
>> +
>> +#define reader_nested_percpu(pcpu_rwlock)				\
>> +			(__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1)
>> +
>> +#define writer_active(pcpu_rwlock)					\
>> +			(__this_cpu_read(*((pcpu_rwlock)->writer_signal)))
> 
> Why are these in the public header file?  Are they gonna be used to
> inline something?
>

No, I can put it in the .c file itself. Will do.
 
>> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock,
>> +				       unsigned int cpu)
>> +{
>> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = true;
>> +}
>> +
>> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock,
>> +				      unsigned int cpu)
>> +{
>> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = false;
>> +}
>> +
>> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock)
>> +{
>> +	unsigned int cpu;
>> +
>> +	for_each_online_cpu(cpu)
>> +		raise_writer_signal(pcpu_rwlock, cpu);
>> +
>> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
>> +}
>> +
>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
>> +{
>> +	unsigned int cpu;
>> +
>> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());
>> +
>> +	for_each_online_cpu(cpu)
>> +		drop_writer_signal(pcpu_rwlock, cpu);
>> +
>> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
>> +}
> 
> It could be just personal preference but I find the above one line
> wrappers more obfuscating than anything else.  What's the point of
> wrapping writer_signal = true/false into a separate function?  These
> simple wrappers just add layers that people have to dig through to
> figure out what's going on without adding anything of value.  I'd much
> prefer collapsing these into the percpu_write_[un]lock().
>

Sure, I see your point. I'll change that.

Thanks a lot for your feedback Tejun!

Regards,
Srivatsa S. Bhat
Tejun Heo - Jan. 23, 2013, 7:57 p.m.
Hello, Srivatsa.

On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote:
> Hmm.. I split it up into steps to help explain the reasoning behind
> the code sufficiently, rather than spring all of the intricacies at
> one go (which would make it very hard to write the changelog/comments
> also). The split made it easier for me to document it well in the
> changelog, because I could deal with reasonable chunks of code/complexity
> at a time. IMHO that helps people reading it for the first time to
> understand the logic easily.

I don't know.  It's a judgement call I guess.  I personally would much
prefer having ample documentation as comments in the source itself or
as a separate Documentation/ file as that's what most people are gonna
be looking at to figure out what's going on.  Maybe just compact it a
bit and add more in-line documentation instead?

> > The only two options are either punishing writers or identifying and
> > updating all such possible deadlocks.  percpu_rwsem does the former,
> > right?  I don't know how feasible the latter would be.
> 
> I don't think we can avoid looking into all the possible deadlocks,
> as long as we use rwlocks inside get/put_online_cpus_atomic() (assuming
> rwlocks are fair). Even with Oleg's idea of using synchronize_sched()
> at the writer, we still need to take care of locking rules, because the
> synchronize_sched() only helps avoid the memory barriers at the reader,
> and doesn't help get rid of the rwlocks themselves.

Well, percpu_rwlock don't have to use rwlock for the slow path.  It
can implement its own writer starving locking scheme.  It's not like
implementing slow path global rwlock logic is difficult.

> CPU 0                          CPU 1
> 
> read_lock(&rwlock)
> 
>                               write_lock(&rwlock) //spins, because CPU 0
>                               //has acquired the lock for read
> 
> read_lock(&rwlock)
>    ^^^^^
> What happens here? Does CPU 0 start spinning (and hence deadlock) or will
> it continue realizing that it already holds the rwlock for read?

I don't think rwlock allows nesting write lock inside read lock.
read_lock(); write_lock() will always deadlock.

Thanks.
Srivatsa S. Bhat - Jan. 24, 2013, 4:30 a.m.
On 01/24/2013 01:27 AM, Tejun Heo wrote:
> Hello, Srivatsa.
> 
> On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote:
>> Hmm.. I split it up into steps to help explain the reasoning behind
>> the code sufficiently, rather than spring all of the intricacies at
>> one go (which would make it very hard to write the changelog/comments
>> also). The split made it easier for me to document it well in the
>> changelog, because I could deal with reasonable chunks of code/complexity
>> at a time. IMHO that helps people reading it for the first time to
>> understand the logic easily.
> 
> I don't know.  It's a judgement call I guess.  I personally would much
> prefer having ample documentation as comments in the source itself or
> as a separate Documentation/ file as that's what most people are gonna
> be looking at to figure out what's going on.  Maybe just compact it a
> bit and add more in-line documentation instead?
> 

OK, I'll think about this.

>>> The only two options are either punishing writers or identifying and
>>> updating all such possible deadlocks.  percpu_rwsem does the former,
>>> right?  I don't know how feasible the latter would be.
>>
>> I don't think we can avoid looking into all the possible deadlocks,
>> as long as we use rwlocks inside get/put_online_cpus_atomic() (assuming
>> rwlocks are fair). Even with Oleg's idea of using synchronize_sched()
>> at the writer, we still need to take care of locking rules, because the
>> synchronize_sched() only helps avoid the memory barriers at the reader,
>> and doesn't help get rid of the rwlocks themselves.
> 
> Well, percpu_rwlock don't have to use rwlock for the slow path.  It
> can implement its own writer starving locking scheme.  It's not like
> implementing slow path global rwlock logic is difficult.
>

Great idea! So probably I could use atomic ops or something similar in the
slow path to implement the scheme we need...

>> CPU 0                          CPU 1
>>
>> read_lock(&rwlock)
>>
>>                               write_lock(&rwlock) //spins, because CPU 0
>>                               //has acquired the lock for read
>>
>> read_lock(&rwlock)
>>    ^^^^^
>> What happens here? Does CPU 0 start spinning (and hence deadlock) or will
>> it continue realizing that it already holds the rwlock for read?
> 
> I don't think rwlock allows nesting write lock inside read lock.
> read_lock(); write_lock() will always deadlock.
> 

Sure, I understand that :-) My question was, what happens when *two* CPUs
are involved, as in, the read_lock() is invoked only on CPU 0 whereas the
write_lock() is invoked on CPU 1.

For example, the same scenario shown above, but with slightly different
timing, will NOT result in a deadlock:

Scenario 2:
  CPU 0                                CPU 1

read_lock(&rwlock)


read_lock(&rwlock) //doesn't spin

                                    write_lock(&rwlock) //spins, because CPU 0
                                    //has acquired the lock for read


So I was wondering whether the "fairness" logic of rwlocks would cause
the second read_lock() to spin (in the first scenario shown above) because
a writer is already waiting (and hence new readers should spin) and thus
cause a deadlock.

Regards,
Srivatsa S. Bhat
Namhyung Kim - Jan. 29, 2013, 11:12 a.m.
On Thu, 24 Jan 2013 10:00:04 +0530, Srivatsa S. Bhat wrote:
> On 01/24/2013 01:27 AM, Tejun Heo wrote:
>> On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote:
>>> CPU 0                          CPU 1
>>>
>>> read_lock(&rwlock)
>>>
>>>                               write_lock(&rwlock) //spins, because CPU 0
>>>                               //has acquired the lock for read
>>>
>>> read_lock(&rwlock)
>>>    ^^^^^
>>> What happens here? Does CPU 0 start spinning (and hence deadlock) or will
>>> it continue realizing that it already holds the rwlock for read?
>> 
>> I don't think rwlock allows nesting write lock inside read lock.
>> read_lock(); write_lock() will always deadlock.
>> 
>
> Sure, I understand that :-) My question was, what happens when *two* CPUs
> are involved, as in, the read_lock() is invoked only on CPU 0 whereas the
> write_lock() is invoked on CPU 1.
>
> For example, the same scenario shown above, but with slightly different
> timing, will NOT result in a deadlock:
>
> Scenario 2:
>   CPU 0                                CPU 1
>
> read_lock(&rwlock)
>
>
> read_lock(&rwlock) //doesn't spin
>
>                                     write_lock(&rwlock) //spins, because CPU 0
>                                     //has acquired the lock for read
>
>
> So I was wondering whether the "fairness" logic of rwlocks would cause
> the second read_lock() to spin (in the first scenario shown above) because
> a writer is already waiting (and hence new readers should spin) and thus
> cause a deadlock.

In my understanding, current x86 rwlock does basically this (of course,
in an atomic fashion):


#define RW_LOCK_BIAS 0x10000

rwlock_init(rwlock)
{
	rwlock->lock = RW_LOCK_BIAS;
}

arch_read_lock(rwlock)
{
retry:
	if (--rwlock->lock >= 0)
		return;

        rwlock->lock++;
        while (rwlock->lock < 1)
        	continue;

        goto retry;
}

arch_write_lock(rwlock)
{
retry:
	if ((rwlock->lock -= RW_LOCK_BIAS) == 0)
        	return;

        rwlock->lock += RW_LOCK_BIAS;
	while (rwlock->lock != RW_LOCK_BIAS)
		continue;

        goto retry;
}


So I can't find where the 'fairness' logic comes from..

Thanks,
Namhyung
Paul E. McKenney - Feb. 8, 2013, 10:47 p.m.
On Tue, Jan 29, 2013 at 08:12:37PM +0900, Namhyung Kim wrote:
> On Thu, 24 Jan 2013 10:00:04 +0530, Srivatsa S. Bhat wrote:
> > On 01/24/2013 01:27 AM, Tejun Heo wrote:
> >> On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote:
> >>> CPU 0                          CPU 1
> >>>
> >>> read_lock(&rwlock)
> >>>
> >>>                               write_lock(&rwlock) //spins, because CPU 0
> >>>                               //has acquired the lock for read
> >>>
> >>> read_lock(&rwlock)
> >>>    ^^^^^
> >>> What happens here? Does CPU 0 start spinning (and hence deadlock) or will
> >>> it continue realizing that it already holds the rwlock for read?
> >> 
> >> I don't think rwlock allows nesting write lock inside read lock.
> >> read_lock(); write_lock() will always deadlock.
> >> 
> >
> > Sure, I understand that :-) My question was, what happens when *two* CPUs
> > are involved, as in, the read_lock() is invoked only on CPU 0 whereas the
> > write_lock() is invoked on CPU 1.
> >
> > For example, the same scenario shown above, but with slightly different
> > timing, will NOT result in a deadlock:
> >
> > Scenario 2:
> >   CPU 0                                CPU 1
> >
> > read_lock(&rwlock)
> >
> >
> > read_lock(&rwlock) //doesn't spin
> >
> >                                     write_lock(&rwlock) //spins, because CPU 0
> >                                     //has acquired the lock for read
> >
> >
> > So I was wondering whether the "fairness" logic of rwlocks would cause
> > the second read_lock() to spin (in the first scenario shown above) because
> > a writer is already waiting (and hence new readers should spin) and thus
> > cause a deadlock.
> 
> In my understanding, current x86 rwlock does basically this (of course,
> in an atomic fashion):
> 
> 
> #define RW_LOCK_BIAS 0x10000
> 
> rwlock_init(rwlock)
> {
> 	rwlock->lock = RW_LOCK_BIAS;
> }
> 
> arch_read_lock(rwlock)
> {
> retry:
> 	if (--rwlock->lock >= 0)
> 		return;
> 
>         rwlock->lock++;
>         while (rwlock->lock < 1)
>         	continue;
> 
>         goto retry;
> }
> 
> arch_write_lock(rwlock)
> {
> retry:
> 	if ((rwlock->lock -= RW_LOCK_BIAS) == 0)
>         	return;
> 
>         rwlock->lock += RW_LOCK_BIAS;
> 	while (rwlock->lock != RW_LOCK_BIAS)
> 		continue;
> 
>         goto retry;
> }
> 
> 
> So I can't find where the 'fairness' logic comes from..

I looked through several of the rwlock implementations, and in all of
them the writer backs off if it sees readers -- or refrains from asserting
ownership of the lock to begin with.

So it should be OK to use rwlock as shown in the underlying patch.

							Thanx, Paul
Paul E. McKenney - Feb. 8, 2013, 11:10 p.m.
On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
> lock-ordering related problems (unlike per-cpu locks). However, global
> rwlocks lead to unnecessary cache-line bouncing even when there are no
> writers present, which can slow down the system needlessly.
> 
> Per-cpu counters can help solve the cache-line bouncing problem. So we
> actually use the best of both: per-cpu counters (no-waiting) at the reader
> side in the fast-path, and global rwlocks in the slowpath.
> 
> [ Fastpath = no writer is active; Slowpath = a writer is active ]
> 
> IOW, the readers just increment/decrement their per-cpu refcounts (disabling
> interrupts during the updates, if necessary) when no writer is active.
> When a writer becomes active, he signals all readers to switch to global
> rwlocks for the duration of his activity. The readers switch over when it
> is safe for them (ie., when they are about to start a fresh, non-nested
> read-side critical section) and start using (holding) the global rwlock for
> read in their subsequent critical sections.
> 
> The writer waits for every existing reader to switch, and then acquires the
> global rwlock for write and enters his critical section. Later, the writer
> signals all readers that he is done, and that they can go back to using their
> per-cpu refcounts again.
> 
> Note that the lock-safety (despite the per-cpu scheme) comes from the fact
> that the readers can *choose* _when_ to switch to rwlocks upon the writer's
> signal. And the readers don't wait on anybody based on the per-cpu counters.
> The only true synchronization that involves waiting at the reader-side in this
> scheme, is the one arising from the global rwlock, which is safe from circular
> locking dependency issues.
> 
> Reader-writer locks and per-cpu counters are recursive, so they can be
> used in a nested fashion in the reader-path, which makes per-CPU rwlocks also
> recursive. Also, this design of switching the synchronization scheme ensures
> that you can safely nest and use these locks in a very flexible manner.
> 
> I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
> suggestions and ideas, which inspired and influenced many of the decisions in
> this as well as previous designs. Thanks a lot Michael and Xiao!

Looks pretty close!  Some comments interspersed below.  Please either
fix the code or my confusion, as the case may be.  ;-)

							Thanx, Paul

> Cc: David Howells <dhowells@redhat.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
> 
>  include/linux/percpu-rwlock.h |   10 +++
>  lib/percpu-rwlock.c           |  128 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 136 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h
> index 8dec8fe..6819bb8 100644
> --- a/include/linux/percpu-rwlock.h
> +++ b/include/linux/percpu-rwlock.h
> @@ -68,4 +68,14 @@ extern void percpu_free_rwlock(struct percpu_rwlock *);
>  	__percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key);	\
>  })
> 
> +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu)			\
> +		(ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu)))
> +
> +#define reader_nested_percpu(pcpu_rwlock)				\
> +			(__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1)
> +
> +#define writer_active(pcpu_rwlock)					\
> +			(__this_cpu_read(*((pcpu_rwlock)->writer_signal)))
> +
>  #endif
> +
> diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
> index 80dad93..992da5c 100644
> --- a/lib/percpu-rwlock.c
> +++ b/lib/percpu-rwlock.c
> @@ -64,21 +64,145 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock)
> 
>  void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
>  {
> -	read_lock(&pcpu_rwlock->global_rwlock);
> +	preempt_disable();
> +
> +	/* First and foremost, let the writer know that a reader is active */
> +	this_cpu_inc(*pcpu_rwlock->reader_refcnt);
> +
> +	/*
> +	 * If we are already using per-cpu refcounts, it is not safe to switch
> +	 * the synchronization scheme. So continue using the refcounts.
> +	 */
> +	if (reader_nested_percpu(pcpu_rwlock)) {
> +		goto out;
> +	} else {
> +		/*
> +		 * The write to 'reader_refcnt' must be visible before we
> +		 * read 'writer_signal'.
> +		 */
> +		smp_mb(); /* Paired with smp_rmb() in sync_reader() */
> +
> +		if (likely(!writer_active(pcpu_rwlock))) {
> +			goto out;
> +		} else {
> +			/* Writer is active, so switch to global rwlock. */
> +			read_lock(&pcpu_rwlock->global_rwlock);
> +
> +			/*
> +			 * We might have raced with a writer going inactive
> +			 * before we took the read-lock. So re-evaluate whether
> +			 * we still need to hold the rwlock or if we can switch
> +			 * back to per-cpu refcounts. (This also helps avoid
> +			 * heterogeneous nesting of readers).
> +			 */
> +			if (writer_active(pcpu_rwlock))

The above writer_active() can be reordered with the following this_cpu_dec(),
strange though it might seem.  But this is OK because holding the rwlock
is conservative.  But might be worth a comment.

> +				this_cpu_dec(*pcpu_rwlock->reader_refcnt);
> +			else

In contrast, no reordering can happen here because read_unlock() is
required to keep the critical section underneath the lock.

> +				read_unlock(&pcpu_rwlock->global_rwlock);
> +		}
> +	}
> +
> +out:
> +	/* Prevent reordering of any subsequent reads */
> +	smp_rmb();

This should be smp_mb().  "Readers" really can do writes.  Hence the
name lglock -- "local/global" rather than "reader/writer".

>  }
> 
>  void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
>  {
> -	read_unlock(&pcpu_rwlock->global_rwlock);

We need an smp_mb() here to keep the critical section ordered before the
this_cpu_dec() below.  Otherwise, if a writer shows up just after we
exit the fastpath, that writer is not guaranteed to see the effects of
our critical section.  Equivalently, the prior read-side critical section
just might see some of the writer's updates, which could be a bit of
a surprise to the reader.

> +	/*
> +	 * We never allow heterogeneous nesting of readers. So it is trivial
> +	 * to find out the kind of reader we are, and undo the operation
> +	 * done by our corresponding percpu_read_lock().
> +	 */
> +	if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) {
> +		this_cpu_dec(*pcpu_rwlock->reader_refcnt);
> +		smp_wmb(); /* Paired with smp_rmb() in sync_reader() */

Given an smp_mb() above, I don't understand the need for this smp_wmb().
Isn't the idea that if the writer sees ->reader_refcnt decremented to
zero, it also needs to see the effects of the corresponding reader's
critical section?

Or am I missing something subtle here?  In any case, if this smp_wmb()
really is needed, there should be some subsequent write that the writer
might observe.  From what I can see, there is no subsequent write from
this reader that the writer cares about.

> +	} else {
> +		read_unlock(&pcpu_rwlock->global_rwlock);
> +	}
> +
> +	preempt_enable();
> +}
> +
> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock,
> +				       unsigned int cpu)
> +{
> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = true;
> +}
> +
> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock,
> +				      unsigned int cpu)
> +{
> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = false;
> +}
> +
> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock)
> +{
> +	unsigned int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		raise_writer_signal(pcpu_rwlock, cpu);
> +
> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
> +}
> +
> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
> +{
> +	unsigned int cpu;
> +
> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());

Why do we drop ourselves twice?  More to the point, why is it important to
drop ourselves first?

> +
> +	for_each_online_cpu(cpu)
> +		drop_writer_signal(pcpu_rwlock, cpu);
> +
> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
> +}
> +
> +/*
> + * Wait for the reader to see the writer's signal and switch from percpu
> + * refcounts to global rwlock.
> + *
> + * If the reader is still using percpu refcounts, wait for him to switch.
> + * Else, we can safely go ahead, because either the reader has already
> + * switched over, or the next reader that comes along on that CPU will
> + * notice the writer's signal and will switch over to the rwlock.
> + */
> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
> +			       unsigned int cpu)
> +{
> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */

As I understand it, the purpose of this memory barrier is to ensure
that the stores in drop_writer_signal() happen before the reads from
->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
race between a new reader attempting to use the fastpath and this writer
acquiring the lock.  Unless I am confused, this must be smp_mb() rather
than smp_rmb().

Also, why not just have a single smp_mb() at the beginning of
sync_all_readers() instead of executing one barrier per CPU?

> +
> +	while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu))
> +		cpu_relax();
> +}
> +
> +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock)
> +{
> +	unsigned int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		sync_reader(pcpu_rwlock, cpu);
>  }
> 
>  void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
>  {
> +	/*
> +	 * Tell all readers that a writer is becoming active, so that they
> +	 * start switching over to the global rwlock.
> +	 */
> +	announce_writer_active(pcpu_rwlock);
> +	sync_all_readers(pcpu_rwlock);
>  	write_lock(&pcpu_rwlock->global_rwlock);
>  }
> 
>  void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
>  {
> +	/*
> +	 * Inform all readers that we are done, so that they can switch back
> +	 * to their per-cpu refcounts. (We don't need to wait for them to
> +	 * see it).
> +	 */
> +	announce_writer_inactive(pcpu_rwlock);
>  	write_unlock(&pcpu_rwlock->global_rwlock);
>  }
> 
>
Oleg Nesterov - Feb. 10, 2013, 6:06 p.m.
On 02/08, Paul E. McKenney wrote:
>
> On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
> >
> >  void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
> >  {
> > -	read_unlock(&pcpu_rwlock->global_rwlock);
>
> We need an smp_mb() here to keep the critical section ordered before the
> this_cpu_dec() below.  Otherwise, if a writer shows up just after we
> exit the fastpath, that writer is not guaranteed to see the effects of
> our critical section.  Equivalently, the prior read-side critical section
> just might see some of the writer's updates, which could be a bit of
> a surprise to the reader.

Agreed, we should not assume that a "reader" doesn't write. And we should
ensure that this "read" section actually completes before this_cpu_dec().

> > +	/*
> > +	 * We never allow heterogeneous nesting of readers. So it is trivial
> > +	 * to find out the kind of reader we are, and undo the operation
> > +	 * done by our corresponding percpu_read_lock().
> > +	 */
> > +	if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) {
> > +		this_cpu_dec(*pcpu_rwlock->reader_refcnt);
> > +		smp_wmb(); /* Paired with smp_rmb() in sync_reader() */
>
> Given an smp_mb() above, I don't understand the need for this smp_wmb().
> Isn't the idea that if the writer sees ->reader_refcnt decremented to
> zero, it also needs to see the effects of the corresponding reader's
> critical section?

I am equally confused ;)

OTOH, we can probably aboid any barrier if reader_nested_percpu() == T.


> > +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
> > +{
> > +   unsigned int cpu;
> > +
> > +   drop_writer_signal(pcpu_rwlock, smp_processor_id());
>
> Why do we drop ourselves twice?  More to the point, why is it important to
> drop ourselves first?

And don't we need mb() _before_ we clear ->writer_signal ?

> > +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
> > +			       unsigned int cpu)
> > +{
> > +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
>
> As I understand it, the purpose of this memory barrier is to ensure
> that the stores in drop_writer_signal() happen before the reads from
> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
> race between a new reader attempting to use the fastpath and this writer
> acquiring the lock.  Unless I am confused, this must be smp_mb() rather
> than smp_rmb().

And note that before sync_reader() we call announce_writer_active() which
already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
unneeded.

But, at the same time, could you confirm that we do not need another mb()
after sync_all_readers() in percpu_write_lock() ? I mean, without mb(),
can't this reader_uses_percpu_refcnt() LOAD leak into the critical section
protected by ->global_rwlock? Then this LOAD can be re-ordered with other
memory operations done by the writer.



Srivatsa, I think that the code would be more understandable if you kill
the helpers like sync_reader/raise_writer_signal. Perhaps even all "write"
helpers, I am not sure. At least, it seems to me that all barriers should
be moved to percpu_write_lock/unlock. But I won't insist of course, up to
you.

And cosmetic nit... How about

	struct xxx {
		unsigned long	reader_refcnt;
		bool		writer_signal;
	}

	struct percpu_rwlock {
		struct xxx __percpu	*xxx;
		rwlock_t		global_rwlock;
	};

?

This saves one alloc_percpu() and ensures that reader_refcnt/writer_signal
are always in the same cache-line.

Oleg.
Srivatsa S. Bhat - Feb. 10, 2013, 6:38 p.m.
On 02/09/2013 04:17 AM, Paul E. McKenney wrote:
> On Tue, Jan 29, 2013 at 08:12:37PM +0900, Namhyung Kim wrote:
>> On Thu, 24 Jan 2013 10:00:04 +0530, Srivatsa S. Bhat wrote:
>>> On 01/24/2013 01:27 AM, Tejun Heo wrote:
>>>> On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote:
>>>>> CPU 0                          CPU 1
>>>>>
>>>>> read_lock(&rwlock)
>>>>>
>>>>>                               write_lock(&rwlock) //spins, because CPU 0
>>>>>                               //has acquired the lock for read
>>>>>
>>>>> read_lock(&rwlock)
>>>>>    ^^^^^
>>>>> What happens here? Does CPU 0 start spinning (and hence deadlock) or will
>>>>> it continue realizing that it already holds the rwlock for read?
>>>>
>>>> I don't think rwlock allows nesting write lock inside read lock.
>>>> read_lock(); write_lock() will always deadlock.
>>>>
>>>
>>> Sure, I understand that :-) My question was, what happens when *two* CPUs
>>> are involved, as in, the read_lock() is invoked only on CPU 0 whereas the
>>> write_lock() is invoked on CPU 1.
>>>
>>> For example, the same scenario shown above, but with slightly different
>>> timing, will NOT result in a deadlock:
>>>
>>> Scenario 2:
>>>   CPU 0                                CPU 1
>>>
>>> read_lock(&rwlock)
>>>
>>>
>>> read_lock(&rwlock) //doesn't spin
>>>
>>>                                     write_lock(&rwlock) //spins, because CPU 0
>>>                                     //has acquired the lock for read
>>>
>>>
>>> So I was wondering whether the "fairness" logic of rwlocks would cause
>>> the second read_lock() to spin (in the first scenario shown above) because
>>> a writer is already waiting (and hence new readers should spin) and thus
>>> cause a deadlock.
>>
>> In my understanding, current x86 rwlock does basically this (of course,
>> in an atomic fashion):
>>
>>
>> #define RW_LOCK_BIAS 0x10000
>>
>> rwlock_init(rwlock)
>> {
>> 	rwlock->lock = RW_LOCK_BIAS;
>> }
>>
>> arch_read_lock(rwlock)
>> {
>> retry:
>> 	if (--rwlock->lock >= 0)
>> 		return;
>>
>>         rwlock->lock++;
>>         while (rwlock->lock < 1)
>>         	continue;
>>
>>         goto retry;
>> }
>>
>> arch_write_lock(rwlock)
>> {
>> retry:
>> 	if ((rwlock->lock -= RW_LOCK_BIAS) == 0)
>>         	return;
>>
>>         rwlock->lock += RW_LOCK_BIAS;
>> 	while (rwlock->lock != RW_LOCK_BIAS)
>> 		continue;
>>
>>         goto retry;
>> }
>>
>>
>> So I can't find where the 'fairness' logic comes from..
> 
> I looked through several of the rwlock implementations, and in all of
> them the writer backs off if it sees readers -- or refrains from asserting
> ownership of the lock to begin with.
> 
> So it should be OK to use rwlock as shown in the underlying patch.
> 

Thanks a lot for confirming that Paul! So I guess we can use rwlocks
as it is, since its behaviour suits our needs perfectly. So I won't tinker
with atomic counters for a while, atleast not until someone starts making
rwlocks fair.. ;-)

Regards,
Srivatsa S. Bhat
Srivatsa S. Bhat - Feb. 10, 2013, 7:10 p.m.
On 02/09/2013 04:40 AM, Paul E. McKenney wrote:
> On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
>> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
>> lock-ordering related problems (unlike per-cpu locks). However, global
>> rwlocks lead to unnecessary cache-line bouncing even when there are no
>> writers present, which can slow down the system needlessly.
>>
[...]
> 
> Looks pretty close!  Some comments interspersed below.  Please either
> fix the code or my confusion, as the case may be.  ;-)
> 

Sure :-)

>> ---
>>
>>  include/linux/percpu-rwlock.h |   10 +++
>>  lib/percpu-rwlock.c           |  128 ++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 136 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h
>> index 8dec8fe..6819bb8 100644
>> --- a/include/linux/percpu-rwlock.h
>> +++ b/include/linux/percpu-rwlock.h
>> @@ -68,4 +68,14 @@ extern void percpu_free_rwlock(struct percpu_rwlock *);
>>  	__percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key);	\
>>  })
>>
>> +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu)			\
>> +		(ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu)))
>> +
>> +#define reader_nested_percpu(pcpu_rwlock)				\
>> +			(__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1)
>> +
>> +#define writer_active(pcpu_rwlock)					\
>> +			(__this_cpu_read(*((pcpu_rwlock)->writer_signal)))
>> +
>>  #endif
>> +
>> diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
>> index 80dad93..992da5c 100644
>> --- a/lib/percpu-rwlock.c
>> +++ b/lib/percpu-rwlock.c
>> @@ -64,21 +64,145 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock)
>>
>>  void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
>>  {
>> -	read_lock(&pcpu_rwlock->global_rwlock);
>> +	preempt_disable();
>> +
>> +	/* First and foremost, let the writer know that a reader is active */
>> +	this_cpu_inc(*pcpu_rwlock->reader_refcnt);
>> +
>> +	/*
>> +	 * If we are already using per-cpu refcounts, it is not safe to switch
>> +	 * the synchronization scheme. So continue using the refcounts.
>> +	 */
>> +	if (reader_nested_percpu(pcpu_rwlock)) {
>> +		goto out;
>> +	} else {
>> +		/*
>> +		 * The write to 'reader_refcnt' must be visible before we
>> +		 * read 'writer_signal'.
>> +		 */
>> +		smp_mb(); /* Paired with smp_rmb() in sync_reader() */
>> +
>> +		if (likely(!writer_active(pcpu_rwlock))) {
>> +			goto out;
>> +		} else {
>> +			/* Writer is active, so switch to global rwlock. */
>> +			read_lock(&pcpu_rwlock->global_rwlock);
>> +
>> +			/*
>> +			 * We might have raced with a writer going inactive
>> +			 * before we took the read-lock. So re-evaluate whether
>> +			 * we still need to hold the rwlock or if we can switch
>> +			 * back to per-cpu refcounts. (This also helps avoid
>> +			 * heterogeneous nesting of readers).
>> +			 */
>> +			if (writer_active(pcpu_rwlock))
> 
> The above writer_active() can be reordered with the following this_cpu_dec(),
> strange though it might seem.  But this is OK because holding the rwlock
> is conservative.  But might be worth a comment.
> 

Ok..

>> +				this_cpu_dec(*pcpu_rwlock->reader_refcnt);
>> +			else
> 
> In contrast, no reordering can happen here because read_unlock() is
> required to keep the critical section underneath the lock.
>

Ok..
 
>> +				read_unlock(&pcpu_rwlock->global_rwlock);
>> +		}
>> +	}
>> +
>> +out:
>> +	/* Prevent reordering of any subsequent reads */
>> +	smp_rmb();
> 
> This should be smp_mb().  "Readers" really can do writes.  Hence the
> name lglock -- "local/global" rather than "reader/writer".
> 

Ok!

[ We were trying to avoid full memory barriers in get/put_online_cpus_atomic()
in the fastpath, as far as possible... Now it feels like all that discussion
was for nothing :-( ]

>>  }
>>
>>  void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
>>  {
>> -	read_unlock(&pcpu_rwlock->global_rwlock);
> 
> We need an smp_mb() here to keep the critical section ordered before the
> this_cpu_dec() below.  Otherwise, if a writer shows up just after we
> exit the fastpath, that writer is not guaranteed to see the effects of
> our critical section.  Equivalently, the prior read-side critical section
> just might see some of the writer's updates, which could be a bit of
> a surprise to the reader.
> 

Ok, will add it.

>> +	/*
>> +	 * We never allow heterogeneous nesting of readers. So it is trivial
>> +	 * to find out the kind of reader we are, and undo the operation
>> +	 * done by our corresponding percpu_read_lock().
>> +	 */
>> +	if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) {
>> +		this_cpu_dec(*pcpu_rwlock->reader_refcnt);
>> +		smp_wmb(); /* Paired with smp_rmb() in sync_reader() */
> 
> Given an smp_mb() above, I don't understand the need for this smp_wmb().
> Isn't the idea that if the writer sees ->reader_refcnt decremented to
> zero, it also needs to see the effects of the corresponding reader's
> critical section?
>

Not sure what you meant, but my idea here was that the writer should see
the reader_refcnt falling to zero as soon as possible, to avoid keeping the
writer waiting in a tight loop for longer than necessary.
I might have been a little over-zealous to use lighter memory barriers though,
(given our lengthy discussions in the previous versions to reduce the memory
barrier overheads), so the smp_wmb() used above might be wrong.

So, are you saying that the smp_mb() you indicated above would be enough
to make the writer observe the 1->0 transition of reader_refcnt immediately?
 
> Or am I missing something subtle here?  In any case, if this smp_wmb()
> really is needed, there should be some subsequent write that the writer
> might observe.  From what I can see, there is no subsequent write from
> this reader that the writer cares about.
> 

I thought the smp_wmb() here and the smp_rmb() at the writer would ensure
immediate reflection of the reader state at the writer side... Please correct
me if my understanding is incorrect.

>> +	} else {
>> +		read_unlock(&pcpu_rwlock->global_rwlock);
>> +	}
>> +
>> +	preempt_enable();
>> +}
>> +
>> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock,
>> +				       unsigned int cpu)
>> +{
>> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = true;
>> +}
>> +
>> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock,
>> +				      unsigned int cpu)
>> +{
>> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = false;
>> +}
>> +
>> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock)
>> +{
>> +	unsigned int cpu;
>> +
>> +	for_each_online_cpu(cpu)
>> +		raise_writer_signal(pcpu_rwlock, cpu);
>> +
>> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
>> +}
>> +
>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
>> +{
>> +	unsigned int cpu;
>> +
>> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());
> 
> Why do we drop ourselves twice?  More to the point, why is it important to
> drop ourselves first?
> 

I don't see where we are dropping ourselves twice. Note that we are no longer
in the cpu_online_mask, so the 'for' loop below won't include us. So we need
to manually drop ourselves. It doesn't matter whether we drop ourselves first
or later.

>> +
>> +	for_each_online_cpu(cpu)
>> +		drop_writer_signal(pcpu_rwlock, cpu);
>> +
>> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
>> +}
>> +
>> +/*
>> + * Wait for the reader to see the writer's signal and switch from percpu
>> + * refcounts to global rwlock.
>> + *
>> + * If the reader is still using percpu refcounts, wait for him to switch.
>> + * Else, we can safely go ahead, because either the reader has already
>> + * switched over, or the next reader that comes along on that CPU will
>> + * notice the writer's signal and will switch over to the rwlock.
>> + */
>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
>> +			       unsigned int cpu)
>> +{
>> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
> 
> As I understand it, the purpose of this memory barrier is to ensure
> that the stores in drop_writer_signal() happen before the reads from
> ->reader_refcnt in reader_uses_percpu_refcnt(),

No, that was not what I intended. announce_writer_inactive() already does
a full smp_mb() after calling drop_writer_signal().

I put the smp_rmb() here and the smp_wmb() at the reader side (after updates
to the ->reader_refcnt) to reflect the state change of ->reader_refcnt
immediately at the writer, so that the writer doesn't have to keep spinning
unnecessarily still referring to the old (non-zero) value of ->reader_refcnt.
Or perhaps I am confused about how to use memory barriers properly.. :-(

> thus preventing the
> race between a new reader attempting to use the fastpath and this writer
> acquiring the lock.  Unless I am confused, this must be smp_mb() rather
> than smp_rmb().
> 
> Also, why not just have a single smp_mb() at the beginning of
> sync_all_readers() instead of executing one barrier per CPU?

Well, since my intention was to help the writer see the update (->reader_refcnt
dropping to zero) ASAP, I kept the multiple smp_rmb()s.

> 
>> +
>> +	while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu))
>> +		cpu_relax();
>> +}
>> +
>> +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock)
>> +{
>> +	unsigned int cpu;
>> +
>> +	for_each_online_cpu(cpu)
>> +		sync_reader(pcpu_rwlock, cpu);
>>  }
>>
>>  void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
>>  {
>> +	/*
>> +	 * Tell all readers that a writer is becoming active, so that they
>> +	 * start switching over to the global rwlock.
>> +	 */
>> +	announce_writer_active(pcpu_rwlock);
>> +	sync_all_readers(pcpu_rwlock);
>>  	write_lock(&pcpu_rwlock->global_rwlock);
>>  }
>>
>>  void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
>>  {
>> +	/*
>> +	 * Inform all readers that we are done, so that they can switch back
>> +	 * to their per-cpu refcounts. (We don't need to wait for them to
>> +	 * see it).
>> +	 */
>> +	announce_writer_inactive(pcpu_rwlock);
>>  	write_unlock(&pcpu_rwlock->global_rwlock);
>>  }
>>
>>

Thanks a lot for your detailed review and comments! :-)

Regards,
Srivatsa S. Bhat
Srivatsa S. Bhat - Feb. 10, 2013, 7:24 p.m.
On 02/10/2013 11:36 PM, Oleg Nesterov wrote:
> On 02/08, Paul E. McKenney wrote:
>>
>> On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
>>>
>>>  void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
>>>  {
>>> -	read_unlock(&pcpu_rwlock->global_rwlock);
>>
>> We need an smp_mb() here to keep the critical section ordered before the
>> this_cpu_dec() below.  Otherwise, if a writer shows up just after we
>> exit the fastpath, that writer is not guaranteed to see the effects of
>> our critical section.  Equivalently, the prior read-side critical section
>> just might see some of the writer's updates, which could be a bit of
>> a surprise to the reader.
> 
> Agreed, we should not assume that a "reader" doesn't write. And we should
> ensure that this "read" section actually completes before this_cpu_dec().
>

Right, will fix.
 
>>> +	/*
>>> +	 * We never allow heterogeneous nesting of readers. So it is trivial
>>> +	 * to find out the kind of reader we are, and undo the operation
>>> +	 * done by our corresponding percpu_read_lock().
>>> +	 */
>>> +	if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) {
>>> +		this_cpu_dec(*pcpu_rwlock->reader_refcnt);
>>> +		smp_wmb(); /* Paired with smp_rmb() in sync_reader() */
>>
>> Given an smp_mb() above, I don't understand the need for this smp_wmb().
>> Isn't the idea that if the writer sees ->reader_refcnt decremented to
>> zero, it also needs to see the effects of the corresponding reader's
>> critical section?
> 
> I am equally confused ;)
> 
> OTOH, we can probably aboid any barrier if reader_nested_percpu() == T.
> 

Good point! Will add that optimization, thank you!

> 
>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
>>> +{
>>> +   unsigned int cpu;
>>> +
>>> +   drop_writer_signal(pcpu_rwlock, smp_processor_id());
>>
>> Why do we drop ourselves twice?  More to the point, why is it important to
>> drop ourselves first?
> 
> And don't we need mb() _before_ we clear ->writer_signal ?
> 

Oh, right! Or, how about moving announce_writer_inactive() to _after_
write_unlock()?

>>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
>>> +			       unsigned int cpu)
>>> +{
>>> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
>>
>> As I understand it, the purpose of this memory barrier is to ensure
>> that the stores in drop_writer_signal() happen before the reads from
>> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
>> race between a new reader attempting to use the fastpath and this writer
>> acquiring the lock.  Unless I am confused, this must be smp_mb() rather
>> than smp_rmb().
> 
> And note that before sync_reader() we call announce_writer_active() which
> already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
> unneeded.
> 

My intention was to help the writer see the ->reader_refcnt drop to zero
ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer.
Please correct me if my understanding of memory barriers is wrong here..

> But, at the same time, could you confirm that we do not need another mb()
> after sync_all_readers() in percpu_write_lock() ? I mean, without mb(),
> can't this reader_uses_percpu_refcnt() LOAD leak into the critical section
> protected by ->global_rwlock? Then this LOAD can be re-ordered with other
> memory operations done by the writer.
> 

Hmm.. it appears that we need a smp_mb() there.

> 
> 
> Srivatsa, I think that the code would be more understandable if you kill
> the helpers like sync_reader/raise_writer_signal. Perhaps even all "write"
> helpers, I am not sure. At least, it seems to me that all barriers should
> be moved to percpu_write_lock/unlock. But I won't insist of course, up to
> you.
> 

Sure, sure. Even Tejun pointed out that those helpers are getting in the way
of readability. I'll get rid of them in the next version.

> And cosmetic nit... How about
> 
> 	struct xxx {
> 		unsigned long	reader_refcnt;
> 		bool		writer_signal;
> 	}
> 
> 	struct percpu_rwlock {
> 		struct xxx __percpu	*xxx;
> 		rwlock_t		global_rwlock;
> 	};
> 
> ?
> 
> This saves one alloc_percpu() and ensures that reader_refcnt/writer_signal
> are always in the same cache-line.
>

Ok, that sounds better. Will make that change. Thanks a lot Oleg!

Regards,
Srivatsa S. Bhat
Paul E. McKenney - Feb. 10, 2013, 7:47 p.m.
On Mon, Feb 11, 2013 at 12:40:56AM +0530, Srivatsa S. Bhat wrote:
> On 02/09/2013 04:40 AM, Paul E. McKenney wrote:
> > On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
> >> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
> >> lock-ordering related problems (unlike per-cpu locks). However, global
> >> rwlocks lead to unnecessary cache-line bouncing even when there are no
> >> writers present, which can slow down the system needlessly.
> >>
> [...]
> > 
> > Looks pretty close!  Some comments interspersed below.  Please either
> > fix the code or my confusion, as the case may be.  ;-)
> > 
> 
> Sure :-)
> 
> >> ---
> >>
> >>  include/linux/percpu-rwlock.h |   10 +++
> >>  lib/percpu-rwlock.c           |  128 ++++++++++++++++++++++++++++++++++++++++-
> >>  2 files changed, 136 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h
> >> index 8dec8fe..6819bb8 100644
> >> --- a/include/linux/percpu-rwlock.h
> >> +++ b/include/linux/percpu-rwlock.h
> >> @@ -68,4 +68,14 @@ extern void percpu_free_rwlock(struct percpu_rwlock *);
> >>  	__percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key);	\
> >>  })
> >>
> >> +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu)			\
> >> +		(ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu)))
> >> +
> >> +#define reader_nested_percpu(pcpu_rwlock)				\
> >> +			(__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1)
> >> +
> >> +#define writer_active(pcpu_rwlock)					\
> >> +			(__this_cpu_read(*((pcpu_rwlock)->writer_signal)))
> >> +
> >>  #endif
> >> +
> >> diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
> >> index 80dad93..992da5c 100644
> >> --- a/lib/percpu-rwlock.c
> >> +++ b/lib/percpu-rwlock.c
> >> @@ -64,21 +64,145 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock)
> >>
> >>  void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
> >>  {
> >> -	read_lock(&pcpu_rwlock->global_rwlock);
> >> +	preempt_disable();
> >> +
> >> +	/* First and foremost, let the writer know that a reader is active */
> >> +	this_cpu_inc(*pcpu_rwlock->reader_refcnt);
> >> +
> >> +	/*
> >> +	 * If we are already using per-cpu refcounts, it is not safe to switch
> >> +	 * the synchronization scheme. So continue using the refcounts.
> >> +	 */
> >> +	if (reader_nested_percpu(pcpu_rwlock)) {
> >> +		goto out;
> >> +	} else {
> >> +		/*
> >> +		 * The write to 'reader_refcnt' must be visible before we
> >> +		 * read 'writer_signal'.
> >> +		 */
> >> +		smp_mb(); /* Paired with smp_rmb() in sync_reader() */
> >> +
> >> +		if (likely(!writer_active(pcpu_rwlock))) {
> >> +			goto out;
> >> +		} else {
> >> +			/* Writer is active, so switch to global rwlock. */
> >> +			read_lock(&pcpu_rwlock->global_rwlock);
> >> +
> >> +			/*
> >> +			 * We might have raced with a writer going inactive
> >> +			 * before we took the read-lock. So re-evaluate whether
> >> +			 * we still need to hold the rwlock or if we can switch
> >> +			 * back to per-cpu refcounts. (This also helps avoid
> >> +			 * heterogeneous nesting of readers).
> >> +			 */
> >> +			if (writer_active(pcpu_rwlock))
> > 
> > The above writer_active() can be reordered with the following this_cpu_dec(),
> > strange though it might seem.  But this is OK because holding the rwlock
> > is conservative.  But might be worth a comment.
> > 
> 
> Ok..
> 
> >> +				this_cpu_dec(*pcpu_rwlock->reader_refcnt);
> >> +			else
> > 
> > In contrast, no reordering can happen here because read_unlock() is
> > required to keep the critical section underneath the lock.
> >
> 
> Ok..
> 
> >> +				read_unlock(&pcpu_rwlock->global_rwlock);
> >> +		}
> >> +	}
> >> +
> >> +out:
> >> +	/* Prevent reordering of any subsequent reads */
> >> +	smp_rmb();
> > 
> > This should be smp_mb().  "Readers" really can do writes.  Hence the
> > name lglock -- "local/global" rather than "reader/writer".
> > 
> 
> Ok!
> 
> [ We were trying to avoid full memory barriers in get/put_online_cpus_atomic()
> in the fastpath, as far as possible... Now it feels like all that discussion
> was for nothing :-( ]
> 
> >>  }
> >>
> >>  void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
> >>  {
> >> -	read_unlock(&pcpu_rwlock->global_rwlock);
> > 
> > We need an smp_mb() here to keep the critical section ordered before the
> > this_cpu_dec() below.  Otherwise, if a writer shows up just after we
> > exit the fastpath, that writer is not guaranteed to see the effects of
> > our critical section.  Equivalently, the prior read-side critical section
> > just might see some of the writer's updates, which could be a bit of
> > a surprise to the reader.
> > 
> 
> Ok, will add it.
> 
> >> +	/*
> >> +	 * We never allow heterogeneous nesting of readers. So it is trivial
> >> +	 * to find out the kind of reader we are, and undo the operation
> >> +	 * done by our corresponding percpu_read_lock().
> >> +	 */
> >> +	if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) {
> >> +		this_cpu_dec(*pcpu_rwlock->reader_refcnt);
> >> +		smp_wmb(); /* Paired with smp_rmb() in sync_reader() */
> > 
> > Given an smp_mb() above, I don't understand the need for this smp_wmb().
> > Isn't the idea that if the writer sees ->reader_refcnt decremented to
> > zero, it also needs to see the effects of the corresponding reader's
> > critical section?
> >
> 
> Not sure what you meant, but my idea here was that the writer should see
> the reader_refcnt falling to zero as soon as possible, to avoid keeping the
> writer waiting in a tight loop for longer than necessary.
> I might have been a little over-zealous to use lighter memory barriers though,
> (given our lengthy discussions in the previous versions to reduce the memory
> barrier overheads), so the smp_wmb() used above might be wrong.
> 
> So, are you saying that the smp_mb() you indicated above would be enough
> to make the writer observe the 1->0 transition of reader_refcnt immediately?
> 
> > Or am I missing something subtle here?  In any case, if this smp_wmb()
> > really is needed, there should be some subsequent write that the writer
> > might observe.  From what I can see, there is no subsequent write from
> > this reader that the writer cares about.
> 
> I thought the smp_wmb() here and the smp_rmb() at the writer would ensure
> immediate reflection of the reader state at the writer side... Please correct
> me if my understanding is incorrect.

Ah, but memory barriers are not so much about making data move faster
through the machine, but more about making sure that ordering constraints
are met.  After all, memory barriers cannot make electrons flow faster
through silicon.  You should therefore use memory barriers only to
constrain ordering, not to try to expedite electrons.

> >> +	} else {
> >> +		read_unlock(&pcpu_rwlock->global_rwlock);
> >> +	}
> >> +
> >> +	preempt_enable();
> >> +}
> >> +
> >> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock,
> >> +				       unsigned int cpu)
> >> +{
> >> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = true;
> >> +}
> >> +
> >> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock,
> >> +				      unsigned int cpu)
> >> +{
> >> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = false;
> >> +}
> >> +
> >> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock)
> >> +{
> >> +	unsigned int cpu;
> >> +
> >> +	for_each_online_cpu(cpu)
> >> +		raise_writer_signal(pcpu_rwlock, cpu);
> >> +
> >> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
> >> +}
> >> +
> >> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
> >> +{
> >> +	unsigned int cpu;
> >> +
> >> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());
> > 
> > Why do we drop ourselves twice?  More to the point, why is it important to
> > drop ourselves first?
> 
> I don't see where we are dropping ourselves twice. Note that we are no longer
> in the cpu_online_mask, so the 'for' loop below won't include us. So we need
> to manually drop ourselves. It doesn't matter whether we drop ourselves first
> or later.

Good point, apologies for my confusion!  Still worth a commment, though.

> >> +
> >> +	for_each_online_cpu(cpu)
> >> +		drop_writer_signal(pcpu_rwlock, cpu);
> >> +
> >> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
> >> +}
> >> +
> >> +/*
> >> + * Wait for the reader to see the writer's signal and switch from percpu
> >> + * refcounts to global rwlock.
> >> + *
> >> + * If the reader is still using percpu refcounts, wait for him to switch.
> >> + * Else, we can safely go ahead, because either the reader has already
> >> + * switched over, or the next reader that comes along on that CPU will
> >> + * notice the writer's signal and will switch over to the rwlock.
> >> + */
> >> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
> >> +			       unsigned int cpu)
> >> +{
> >> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
> > 
> > As I understand it, the purpose of this memory barrier is to ensure
> > that the stores in drop_writer_signal() happen before the reads from
> > ->reader_refcnt in reader_uses_percpu_refcnt(),
> 
> No, that was not what I intended. announce_writer_inactive() already does
> a full smp_mb() after calling drop_writer_signal().
> 
> I put the smp_rmb() here and the smp_wmb() at the reader side (after updates
> to the ->reader_refcnt) to reflect the state change of ->reader_refcnt
> immediately at the writer, so that the writer doesn't have to keep spinning
> unnecessarily still referring to the old (non-zero) value of ->reader_refcnt.
> Or perhaps I am confused about how to use memory barriers properly.. :-(

Sadly, no, memory barriers don't make electrons move faster.  So you
should only need the one -- the additional memory barriers are just
slowing things down.

> > thus preventing the
> > race between a new reader attempting to use the fastpath and this writer
> > acquiring the lock.  Unless I am confused, this must be smp_mb() rather
> > than smp_rmb().
> > 
> > Also, why not just have a single smp_mb() at the beginning of
> > sync_all_readers() instead of executing one barrier per CPU?
> 
> Well, since my intention was to help the writer see the update (->reader_refcnt
> dropping to zero) ASAP, I kept the multiple smp_rmb()s.

At least you were consistent.  ;-)

> >> +
> >> +	while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu))
> >> +		cpu_relax();
> >> +}
> >> +
> >> +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock)
> >> +{
> >> +	unsigned int cpu;
> >> +
> >> +	for_each_online_cpu(cpu)
> >> +		sync_reader(pcpu_rwlock, cpu);
> >>  }
> >>
> >>  void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
> >>  {
> >> +	/*
> >> +	 * Tell all readers that a writer is becoming active, so that they
> >> +	 * start switching over to the global rwlock.
> >> +	 */
> >> +	announce_writer_active(pcpu_rwlock);
> >> +	sync_all_readers(pcpu_rwlock);
> >>  	write_lock(&pcpu_rwlock->global_rwlock);
> >>  }
> >>
> >>  void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
> >>  {
> >> +	/*
> >> +	 * Inform all readers that we are done, so that they can switch back
> >> +	 * to their per-cpu refcounts. (We don't need to wait for them to
> >> +	 * see it).
> >> +	 */
> >> +	announce_writer_inactive(pcpu_rwlock);
> >>  	write_unlock(&pcpu_rwlock->global_rwlock);
> >>  }
> >>
> >>
> 
> Thanks a lot for your detailed review and comments! :-)

It will be good to get this in!

							Thanx, Paul
Oleg Nesterov - Feb. 10, 2013, 7:50 p.m.
On 02/11, Srivatsa S. Bhat wrote:
>
> On 02/10/2013 11:36 PM, Oleg Nesterov wrote:
> >>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
> >>> +{
> >>> +   unsigned int cpu;
> >>> +
> >>> +   drop_writer_signal(pcpu_rwlock, smp_processor_id());
> >>
> >> Why do we drop ourselves twice?  More to the point, why is it important to
> >> drop ourselves first?
> >
> > And don't we need mb() _before_ we clear ->writer_signal ?
> >
>
> Oh, right! Or, how about moving announce_writer_inactive() to _after_
> write_unlock()?

Not sure this will help... but, either way it seems we have another
problem...

percpu_rwlock tries to be "generic". This means we should "ignore" its
usage in hotplug, and _write_lock should not race with _write_unlock.

IOW. Suppose that _write_unlock clears ->writer_signal. We need to ensure
that this can't race with another write which wants to set this flag.
Perhaps it should be counter as well, and it should be protected by
the same ->global_rwlock, but _write_lock() should drop it before
sync_all_readers() and then take it again?

> >>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
> >>> +			       unsigned int cpu)
> >>> +{
> >>> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
> >>
> >> As I understand it, the purpose of this memory barrier is to ensure
> >> that the stores in drop_writer_signal() happen before the reads from
> >> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
> >> race between a new reader attempting to use the fastpath and this writer
> >> acquiring the lock.  Unless I am confused, this must be smp_mb() rather
> >> than smp_rmb().
> >
> > And note that before sync_reader() we call announce_writer_active() which
> > already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
> > unneeded.
> >
>
> My intention was to help the writer see the ->reader_refcnt drop to zero
> ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer.

Hmm, interesting... Not sure, but can't really comment. However I can
answer your next question:

> Please correct me if my understanding of memory barriers is wrong here..

Who? Me??? No we have paulmck for that ;)

Oleg.
Paul E. McKenney - Feb. 10, 2013, 7:54 p.m.
On Sun, Feb 10, 2013 at 07:06:07PM +0100, Oleg Nesterov wrote:
> On 02/08, Paul E. McKenney wrote:

[ . . . ]

> > > +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
> > > +			       unsigned int cpu)
> > > +{
> > > +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
> >
> > As I understand it, the purpose of this memory barrier is to ensure
> > that the stores in drop_writer_signal() happen before the reads from
> > ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
> > race between a new reader attempting to use the fastpath and this writer
> > acquiring the lock.  Unless I am confused, this must be smp_mb() rather
> > than smp_rmb().
> 
> And note that before sync_reader() we call announce_writer_active() which
> already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
> unneeded.
> 
> But, at the same time, could you confirm that we do not need another mb()
> after sync_all_readers() in percpu_write_lock() ? I mean, without mb(),
> can't this reader_uses_percpu_refcnt() LOAD leak into the critical section
> protected by ->global_rwlock? Then this LOAD can be re-ordered with other
> memory operations done by the writer.

As soon as I get the rest of the way through Thomas's patchset.  ;-)

							Thanx, Paul
Srivatsa S. Bhat - Feb. 10, 2013, 7:57 p.m.
On 02/11/2013 01:17 AM, Paul E. McKenney wrote:
> On Mon, Feb 11, 2013 at 12:40:56AM +0530, Srivatsa S. Bhat wrote:
>> On 02/09/2013 04:40 AM, Paul E. McKenney wrote:
>>> On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
>>>> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
>>>> lock-ordering related problems (unlike per-cpu locks). However, global
>>>> rwlocks lead to unnecessary cache-line bouncing even when there are no
>>>> writers present, which can slow down the system needlessly.
>>>>
>> [...]
>>>> +	/*
>>>> +	 * We never allow heterogeneous nesting of readers. So it is trivial
>>>> +	 * to find out the kind of reader we are, and undo the operation
>>>> +	 * done by our corresponding percpu_read_lock().
>>>> +	 */
>>>> +	if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) {
>>>> +		this_cpu_dec(*pcpu_rwlock->reader_refcnt);
>>>> +		smp_wmb(); /* Paired with smp_rmb() in sync_reader() */
>>>
>>> Given an smp_mb() above, I don't understand the need for this smp_wmb().
>>> Isn't the idea that if the writer sees ->reader_refcnt decremented to
>>> zero, it also needs to see the effects of the corresponding reader's
>>> critical section?
>>>
>>
>> Not sure what you meant, but my idea here was that the writer should see
>> the reader_refcnt falling to zero as soon as possible, to avoid keeping the
>> writer waiting in a tight loop for longer than necessary.
>> I might have been a little over-zealous to use lighter memory barriers though,
>> (given our lengthy discussions in the previous versions to reduce the memory
>> barrier overheads), so the smp_wmb() used above might be wrong.
>>
>> So, are you saying that the smp_mb() you indicated above would be enough
>> to make the writer observe the 1->0 transition of reader_refcnt immediately?
>>
>>> Or am I missing something subtle here?  In any case, if this smp_wmb()
>>> really is needed, there should be some subsequent write that the writer
>>> might observe.  From what I can see, there is no subsequent write from
>>> this reader that the writer cares about.
>>
>> I thought the smp_wmb() here and the smp_rmb() at the writer would ensure
>> immediate reflection of the reader state at the writer side... Please correct
>> me if my understanding is incorrect.
> 
> Ah, but memory barriers are not so much about making data move faster
> through the machine, but more about making sure that ordering constraints
> are met.  After all, memory barriers cannot make electrons flow faster
> through silicon.  You should therefore use memory barriers only to
> constrain ordering, not to try to expedite electrons.
>

I guess I must have been confused after looking at that graph which showed
how much time it takes for other CPUs to notice the change in value of a
variable performed in a given CPU.. and must have gotten the (wrong) idea
that memory barriers also help speed that up! Very sorry about that!
 
>>>> +	} else {
>>>> +		read_unlock(&pcpu_rwlock->global_rwlock);
>>>> +	}
>>>> +
>>>> +	preempt_enable();
>>>> +}
>>>> +
>>>> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock,
>>>> +				       unsigned int cpu)
>>>> +{
>>>> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = true;
>>>> +}
>>>> +
>>>> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock,
>>>> +				      unsigned int cpu)
>>>> +{
>>>> +	per_cpu(*pcpu_rwlock->writer_signal, cpu) = false;
>>>> +}
>>>> +
>>>> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock)
>>>> +{
>>>> +	unsigned int cpu;
>>>> +
>>>> +	for_each_online_cpu(cpu)
>>>> +		raise_writer_signal(pcpu_rwlock, cpu);
>>>> +
>>>> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
>>>> +}
>>>> +
>>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
>>>> +{
>>>> +	unsigned int cpu;
>>>> +
>>>> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());
>>>
>>> Why do we drop ourselves twice?  More to the point, why is it important to
>>> drop ourselves first?
>>
>> I don't see where we are dropping ourselves twice. Note that we are no longer
>> in the cpu_online_mask, so the 'for' loop below won't include us. So we need
>> to manually drop ourselves. It doesn't matter whether we drop ourselves first
>> or later.
> 
> Good point, apologies for my confusion!  Still worth a commment, though.
> 

Sure, will add it.

>>>> +
>>>> +	for_each_online_cpu(cpu)
>>>> +		drop_writer_signal(pcpu_rwlock, cpu);
>>>> +
>>>> +	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
>>>> +}
>>>> +
>>>> +/*
>>>> + * Wait for the reader to see the writer's signal and switch from percpu
>>>> + * refcounts to global rwlock.
>>>> + *
>>>> + * If the reader is still using percpu refcounts, wait for him to switch.
>>>> + * Else, we can safely go ahead, because either the reader has already
>>>> + * switched over, or the next reader that comes along on that CPU will
>>>> + * notice the writer's signal and will switch over to the rwlock.
>>>> + */
>>>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
>>>> +			       unsigned int cpu)
>>>> +{
>>>> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
>>>
>>> As I understand it, the purpose of this memory barrier is to ensure
>>> that the stores in drop_writer_signal() happen before the reads from
>>> ->reader_refcnt in reader_uses_percpu_refcnt(),
>>
>> No, that was not what I intended. announce_writer_inactive() already does
>> a full smp_mb() after calling drop_writer_signal().
>>
>> I put the smp_rmb() here and the smp_wmb() at the reader side (after updates
>> to the ->reader_refcnt) to reflect the state change of ->reader_refcnt
>> immediately at the writer, so that the writer doesn't have to keep spinning
>> unnecessarily still referring to the old (non-zero) value of ->reader_refcnt.
>> Or perhaps I am confused about how to use memory barriers properly.. :-(
> 
> Sadly, no, memory barriers don't make electrons move faster.  So you
> should only need the one -- the additional memory barriers are just
> slowing things down.
> 

Ok..

>>> thus preventing the
>>> race between a new reader attempting to use the fastpath and this writer
>>> acquiring the lock.  Unless I am confused, this must be smp_mb() rather
>>> than smp_rmb().
>>>
>>> Also, why not just have a single smp_mb() at the beginning of
>>> sync_all_readers() instead of executing one barrier per CPU?
>>
>> Well, since my intention was to help the writer see the update (->reader_refcnt
>> dropping to zero) ASAP, I kept the multiple smp_rmb()s.
> 
> At least you were consistent.  ;-)
>

Haha, that's an optimistic way of looking at it, but its no good if I was
consistently _wrong_! ;-)

>>>> +
>>>> +	while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu))
>>>> +		cpu_relax();
>>>> +}
>>>> +
>>>> +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock)
>>>> +{
>>>> +	unsigned int cpu;
>>>> +
>>>> +	for_each_online_cpu(cpu)
>>>> +		sync_reader(pcpu_rwlock, cpu);
>>>>  }
>>>>
>>>>  void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
>>>>  {
>>>> +	/*
>>>> +	 * Tell all readers that a writer is becoming active, so that they
>>>> +	 * start switching over to the global rwlock.
>>>> +	 */
>>>> +	announce_writer_active(pcpu_rwlock);
>>>> +	sync_all_readers(pcpu_rwlock);
>>>>  	write_lock(&pcpu_rwlock->global_rwlock);
>>>>  }
>>>>
>>>>  void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
>>>>  {
>>>> +	/*
>>>> +	 * Inform all readers that we are done, so that they can switch back
>>>> +	 * to their per-cpu refcounts. (We don't need to wait for them to
>>>> +	 * see it).
>>>> +	 */
>>>> +	announce_writer_inactive(pcpu_rwlock);
>>>>  	write_unlock(&pcpu_rwlock->global_rwlock);
>>>>  }
>>>>
>>>>
>>
>> Thanks a lot for your detailed review and comments! :-)
> 
> It will be good to get this in!
>

Thank you :-) I'll try to address the review comments and respin the
patchset soon.

Regards,
Srivatsa S. Bhat
Srivatsa S. Bhat - Feb. 10, 2013, 8:09 p.m.
On 02/11/2013 01:20 AM, Oleg Nesterov wrote:
> On 02/11, Srivatsa S. Bhat wrote:
>>
>> On 02/10/2013 11:36 PM, Oleg Nesterov wrote:
>>>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
>>>>> +{
>>>>> +   unsigned int cpu;
>>>>> +
>>>>> +   drop_writer_signal(pcpu_rwlock, smp_processor_id());
>>>>
>>>> Why do we drop ourselves twice?  More to the point, why is it important to
>>>> drop ourselves first?
>>>
>>> And don't we need mb() _before_ we clear ->writer_signal ?
>>>
>>
>> Oh, right! Or, how about moving announce_writer_inactive() to _after_
>> write_unlock()?
> 
> Not sure this will help... but, either way it seems we have another
> problem...
> 
> percpu_rwlock tries to be "generic". This means we should "ignore" its
> usage in hotplug, and _write_lock should not race with _write_unlock.
> 

Yes, good point!

> IOW. Suppose that _write_unlock clears ->writer_signal. We need to ensure
> that this can't race with another write which wants to set this flag.
> Perhaps it should be counter as well, and it should be protected by
> the same ->global_rwlock, but _write_lock() should drop it before
> sync_all_readers() and then take it again?

Hmm, or we could just add an extra mb() like you suggested, and keep it
simple...

> 
>>>>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
>>>>> +			       unsigned int cpu)
>>>>> +{
>>>>> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
>>>>
>>>> As I understand it, the purpose of this memory barrier is to ensure
>>>> that the stores in drop_writer_signal() happen before the reads from
>>>> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
>>>> race between a new reader attempting to use the fastpath and this writer
>>>> acquiring the lock.  Unless I am confused, this must be smp_mb() rather
>>>> than smp_rmb().
>>>
>>> And note that before sync_reader() we call announce_writer_active() which
>>> already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
>>> unneeded.
>>>
>>
>> My intention was to help the writer see the ->reader_refcnt drop to zero
>> ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer.
> 
> Hmm, interesting... Not sure, but can't really comment. However I can
> answer your next question:
>

Paul told me in another mail that I was expecting too much out of memory
barriers, like increasing the speed of electrons and what not ;-)
[ It would have been cool though, if it had such magical powers :P ]
 
>> Please correct me if my understanding of memory barriers is wrong here..
> 
> Who? Me??? No we have paulmck for that ;)
>

Haha ;-)

Regards,
Srivatsa S. Bhat
Oleg Nesterov - Feb. 10, 2013, 8:13 p.m.
On 02/11, Srivatsa S. Bhat wrote:
>
> On 02/09/2013 04:40 AM, Paul E. McKenney wrote:
> >> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
> >> +{
> >> +	unsigned int cpu;
> >> +
> >> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());
> >
> > Why do we drop ourselves twice?  More to the point, why is it important to
> > drop ourselves first?
> >
>
> I don't see where we are dropping ourselves twice. Note that we are no longer
> in the cpu_online_mask, so the 'for' loop below won't include us. So we need
> to manually drop ourselves. It doesn't matter whether we drop ourselves first
> or later.

Yes, but this just reflects its usage in cpu-hotplug. cpu goes away under
_write_lock.

Perhaps _write_lock/unlock shoud use for_each_possible_cpu() instead?

Hmm... I think this makes sense anyway. Otherwise, in theory,
percpu_write_lock(random_non_hotplug_lock) can race with cpu_up?

Oleg.
Srivatsa S. Bhat - Feb. 10, 2013, 8:20 p.m.
On 02/11/2013 01:43 AM, Oleg Nesterov wrote:
> On 02/11, Srivatsa S. Bhat wrote:
>>
>> On 02/09/2013 04:40 AM, Paul E. McKenney wrote:
>>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
>>>> +{
>>>> +	unsigned int cpu;
>>>> +
>>>> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());
>>>
>>> Why do we drop ourselves twice?  More to the point, why is it important to
>>> drop ourselves first?
>>>
>>
>> I don't see where we are dropping ourselves twice. Note that we are no longer
>> in the cpu_online_mask, so the 'for' loop below won't include us. So we need
>> to manually drop ourselves. It doesn't matter whether we drop ourselves first
>> or later.
> 
> Yes, but this just reflects its usage in cpu-hotplug. cpu goes away under
> _write_lock.
> 

Ah, right. I guess the code still has remnants from the older version in which
this locking scheme wasn't generic and was tied to cpu-hotplug alone..

> Perhaps _write_lock/unlock shoud use for_each_possible_cpu() instead?
> 

Hmm, that wouldn't be too bad.

> Hmm... I think this makes sense anyway. Otherwise, in theory,
> percpu_write_lock(random_non_hotplug_lock) can race with cpu_up?
> 

Yeah, makes sense. Will change it to for_each_possible_cpu().
And I had previously fixed such races with lglocks with a complicated scheme (to
avoid the costly for_each_possible loop), which was finally rewritten to use
for_each_possible_cpu() for the sake of simplicity..
Regards,
Srivatsa S. Bhat
Paul E. McKenney - Feb. 10, 2013, 10:13 p.m.
On Mon, Feb 11, 2013 at 01:39:24AM +0530, Srivatsa S. Bhat wrote:
> On 02/11/2013 01:20 AM, Oleg Nesterov wrote:
> > On 02/11, Srivatsa S. Bhat wrote:
> >>
> >> On 02/10/2013 11:36 PM, Oleg Nesterov wrote:
> >>>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
> >>>>> +{
> >>>>> +   unsigned int cpu;
> >>>>> +
> >>>>> +   drop_writer_signal(pcpu_rwlock, smp_processor_id());
> >>>>
> >>>> Why do we drop ourselves twice?  More to the point, why is it important to
> >>>> drop ourselves first?
> >>>
> >>> And don't we need mb() _before_ we clear ->writer_signal ?
> >>>
> >>
> >> Oh, right! Or, how about moving announce_writer_inactive() to _after_
> >> write_unlock()?
> > 
> > Not sure this will help... but, either way it seems we have another
> > problem...
> > 
> > percpu_rwlock tries to be "generic". This means we should "ignore" its
> > usage in hotplug, and _write_lock should not race with _write_unlock.
> > 
> 
> Yes, good point!
> 
> > IOW. Suppose that _write_unlock clears ->writer_signal. We need to ensure
> > that this can't race with another write which wants to set this flag.
> > Perhaps it should be counter as well, and it should be protected by
> > the same ->global_rwlock, but _write_lock() should drop it before
> > sync_all_readers() and then take it again?
> 
> Hmm, or we could just add an extra mb() like you suggested, and keep it
> simple...
> 
> > 
> >>>>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
> >>>>> +			       unsigned int cpu)
> >>>>> +{
> >>>>> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
> >>>>
> >>>> As I understand it, the purpose of this memory barrier is to ensure
> >>>> that the stores in drop_writer_signal() happen before the reads from
> >>>> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
> >>>> race between a new reader attempting to use the fastpath and this writer
> >>>> acquiring the lock.  Unless I am confused, this must be smp_mb() rather
> >>>> than smp_rmb().
> >>>
> >>> And note that before sync_reader() we call announce_writer_active() which
> >>> already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
> >>> unneeded.
> >>>
> >>
> >> My intention was to help the writer see the ->reader_refcnt drop to zero
> >> ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer.
> > 
> > Hmm, interesting... Not sure, but can't really comment. However I can
> > answer your next question:
> 
> Paul told me in another mail that I was expecting too much out of memory
> barriers, like increasing the speed of electrons and what not ;-)
> [ It would have been cool though, if it had such magical powers :P ]

"But because you have used the special mb_tachyonic instruction, the
speed of light is 600,000 km/s for the next five clock cycles"...  ;-)

							Thanx, Paul

> >> Please correct me if my understanding of memory barriers is wrong here..
> > 
> > Who? Me??? No we have paulmck for that ;)
> >
> 
> Haha ;-)
> 
> Regards,
> Srivatsa S. Bhat
>
Paul E. McKenney - Feb. 12, 2013, 4:15 p.m.
On Sun, Feb 10, 2013 at 11:54:17AM -0800, Paul E. McKenney wrote:
> On Sun, Feb 10, 2013 at 07:06:07PM +0100, Oleg Nesterov wrote:
> > On 02/08, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > > +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
> > > > +			       unsigned int cpu)
> > > > +{
> > > > +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
> > >
> > > As I understand it, the purpose of this memory barrier is to ensure
> > > that the stores in drop_writer_signal() happen before the reads from
> > > ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
> > > race between a new reader attempting to use the fastpath and this writer
> > > acquiring the lock.  Unless I am confused, this must be smp_mb() rather
> > > than smp_rmb().
> > 
> > And note that before sync_reader() we call announce_writer_active() which
> > already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
> > unneeded.
> > 
> > But, at the same time, could you confirm that we do not need another mb()
> > after sync_all_readers() in percpu_write_lock() ? I mean, without mb(),
> > can't this reader_uses_percpu_refcnt() LOAD leak into the critical section
> > protected by ->global_rwlock? Then this LOAD can be re-ordered with other
> > memory operations done by the writer.
> 
> As soon as I get the rest of the way through Thomas's patchset.  ;-)

There is a memory barrier associated with write_lock(), but it is
only required to keep the critical section inside the lock -- and is
permitted to allow stuff outside of the lock to be reordered into the
critical section.  So I believe we do indeed need an smp_mb() between
sync_all_readers() and write_lock() in percpu_write_lock().

Good eyes, Oleg!

							Thanx, Paul

Patch

diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h
index 8dec8fe..6819bb8 100644
--- a/include/linux/percpu-rwlock.h
+++ b/include/linux/percpu-rwlock.h
@@ -68,4 +68,14 @@  extern void percpu_free_rwlock(struct percpu_rwlock *);
 	__percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key);	\
 })
 
+#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu)			\
+		(ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu)))
+
+#define reader_nested_percpu(pcpu_rwlock)				\
+			(__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1)
+
+#define writer_active(pcpu_rwlock)					\
+			(__this_cpu_read(*((pcpu_rwlock)->writer_signal)))
+
 #endif
+
diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
index 80dad93..992da5c 100644
--- a/lib/percpu-rwlock.c
+++ b/lib/percpu-rwlock.c
@@ -64,21 +64,145 @@  void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock)
 
 void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
 {
-	read_lock(&pcpu_rwlock->global_rwlock);
+	preempt_disable();
+
+	/* First and foremost, let the writer know that a reader is active */
+	this_cpu_inc(*pcpu_rwlock->reader_refcnt);
+
+	/*
+	 * If we are already using per-cpu refcounts, it is not safe to switch
+	 * the synchronization scheme. So continue using the refcounts.
+	 */
+	if (reader_nested_percpu(pcpu_rwlock)) {
+		goto out;
+	} else {
+		/*
+		 * The write to 'reader_refcnt' must be visible before we
+		 * read 'writer_signal'.
+		 */
+		smp_mb(); /* Paired with smp_rmb() in sync_reader() */
+
+		if (likely(!writer_active(pcpu_rwlock))) {
+			goto out;
+		} else {
+			/* Writer is active, so switch to global rwlock. */
+			read_lock(&pcpu_rwlock->global_rwlock);
+
+			/*
+			 * We might have raced with a writer going inactive
+			 * before we took the read-lock. So re-evaluate whether
+			 * we still need to hold the rwlock or if we can switch
+			 * back to per-cpu refcounts. (This also helps avoid
+			 * heterogeneous nesting of readers).
+			 */
+			if (writer_active(pcpu_rwlock))
+				this_cpu_dec(*pcpu_rwlock->reader_refcnt);
+			else
+				read_unlock(&pcpu_rwlock->global_rwlock);
+		}
+	}
+
+out:
+	/* Prevent reordering of any subsequent reads */
+	smp_rmb();
 }
 
 void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
 {
-	read_unlock(&pcpu_rwlock->global_rwlock);
+	/*
+	 * We never allow heterogeneous nesting of readers. So it is trivial
+	 * to find out the kind of reader we are, and undo the operation
+	 * done by our corresponding percpu_read_lock().
+	 */
+	if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) {
+		this_cpu_dec(*pcpu_rwlock->reader_refcnt);
+		smp_wmb(); /* Paired with smp_rmb() in sync_reader() */
+	} else {
+		read_unlock(&pcpu_rwlock->global_rwlock);
+	}
+
+	preempt_enable();
+}
+
+static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock,
+				       unsigned int cpu)
+{
+	per_cpu(*pcpu_rwlock->writer_signal, cpu) = true;
+}
+
+static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock,
+				      unsigned int cpu)
+{
+	per_cpu(*pcpu_rwlock->writer_signal, cpu) = false;
+}
+
+static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		raise_writer_signal(pcpu_rwlock, cpu);
+
+	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
+}
+
+static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
+{
+	unsigned int cpu;
+
+	drop_writer_signal(pcpu_rwlock, smp_processor_id());
+
+	for_each_online_cpu(cpu)
+		drop_writer_signal(pcpu_rwlock, cpu);
+
+	smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
+}
+
+/*
+ * Wait for the reader to see the writer's signal and switch from percpu
+ * refcounts to global rwlock.
+ *
+ * If the reader is still using percpu refcounts, wait for him to switch.
+ * Else, we can safely go ahead, because either the reader has already
+ * switched over, or the next reader that comes along on that CPU will
+ * notice the writer's signal and will switch over to the rwlock.
+ */
+static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
+			       unsigned int cpu)
+{
+	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
+
+	while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu))
+		cpu_relax();
+}
+
+static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		sync_reader(pcpu_rwlock, cpu);
 }
 
 void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
 {
+	/*
+	 * Tell all readers that a writer is becoming active, so that they
+	 * start switching over to the global rwlock.
+	 */
+	announce_writer_active(pcpu_rwlock);
+	sync_all_readers(pcpu_rwlock);
 	write_lock(&pcpu_rwlock->global_rwlock);
 }
 
 void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
 {
+	/*
+	 * Inform all readers that we are done, so that they can switch back
+	 * to their per-cpu refcounts. (We don't need to wait for them to
+	 * see it).
+	 */
+	announce_writer_inactive(pcpu_rwlock);
 	write_unlock(&pcpu_rwlock->global_rwlock);
 }