diff mbox series

[RFC] powerpc/32: Switch VDSO to C implementation.

Message ID 8ce3582f7f7da9ff0286ced857e5aa2e5ae6746e.1571662378.git.christophe.leroy@c-s.fr (mailing list archive)
State Superseded
Headers show
Series [RFC] powerpc/32: Switch VDSO to C implementation. | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch next (600802af9049be799465b24d14162918545634bf)
snowpatch_ozlabs/build-ppc64le fail build failed!
snowpatch_ozlabs/build-ppc64be fail build failed!
snowpatch_ozlabs/build-ppc64e fail build failed!
snowpatch_ozlabs/build-pmac32 success Build succeeded
snowpatch_ozlabs/checkpatch warning total: 0 errors, 0 warnings, 4 checks, 654 lines checked

Commit Message

Christophe Leroy Oct. 21, 2019, 12:53 p.m. UTC
This is a tentative to switch powerpc/32 vdso to generic C implementation.
It will likely not work on 64 bits or even build properly at the moment.

powerpc is a bit special for VDSO as well as system calls in the
way that it requires setting CR SO bit which cannot be done in C.
Therefore, entry/exit and fallback needs to be performed in ASM.

To allow that, C fallbacks just return -1 and the ASM entry point
performs the system call when the C function returns -1.

The performance is rather disappoiting. That's most likely all
calculation in the C implementation are based on 64 bits math and
converted to 32 bits at the very end. I guess C implementation should
use 32 bits math like the assembly VDSO does as of today.

With current powerpc/32 ASM VDSO:

gettimeofday:    vdso: 750 nsec/call
clock-getres-realtime:    vdso: 382 nsec/call
clock-gettime-realtime:    vdso: 928 nsec/call
clock-getres-monotonic:    vdso: 382 nsec/call
clock-gettime-monotonic:    vdso: 1033 nsec/call

Once switched to C implementation:

gettimeofday:    vdso: 1533 nsec/call
clock-getres-realtime:    vdso: 853 nsec/call
clock-gettime-realtime:    vdso: 1570 nsec/call
clock-getres-monotonic:    vdso: 835 nsec/call
clock-gettime-monotonic:    vdso: 1605 nsec/call
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/Kconfig                         |   2 +
 arch/powerpc/include/asm/vdso/gettimeofday.h |  81 ++++++++
 arch/powerpc/include/asm/vdso/vsyscall.h     |  27 +++
 arch/powerpc/include/asm/vdso_datapage.h     |  16 +-
 arch/powerpc/kernel/asm-offsets.c            |  26 +--
 arch/powerpc/kernel/time.c                   |  90 +--------
 arch/powerpc/kernel/vdso.c                   |  19 +-
 arch/powerpc/kernel/vdso32/Makefile          |  19 +-
 arch/powerpc/kernel/vdso32/gettimeofday.S    | 270 ++++++---------------------
 arch/powerpc/kernel/vdso32/vgettimeofday.c   |  32 ++++
 10 files changed, 233 insertions(+), 349 deletions(-)
 create mode 100644 arch/powerpc/include/asm/vdso/gettimeofday.h
 create mode 100644 arch/powerpc/include/asm/vdso/vsyscall.h
 create mode 100644 arch/powerpc/kernel/vdso32/vgettimeofday.c

Comments

Thomas Gleixner Oct. 21, 2019, 9:29 p.m. UTC | #1
On Mon, 21 Oct 2019, Christophe Leroy wrote:

> This is a tentative to switch powerpc/32 vdso to generic C implementation.
> It will likely not work on 64 bits or even build properly at the moment.
> 
> powerpc is a bit special for VDSO as well as system calls in the
> way that it requires setting CR SO bit which cannot be done in C.
> Therefore, entry/exit and fallback needs to be performed in ASM.
> 
> To allow that, C fallbacks just return -1 and the ASM entry point
> performs the system call when the C function returns -1.
> 
> The performance is rather disappoiting. That's most likely all
> calculation in the C implementation are based on 64 bits math and
> converted to 32 bits at the very end. I guess C implementation should
> use 32 bits math like the assembly VDSO does as of today.

> gettimeofday:    vdso: 750 nsec/call
> 
> gettimeofday:    vdso: 1533 nsec/call

The only real 64bit math which can matter is the 64bit * 32bit multiply,
i.e.

static __always_inline
u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
        return ((cycles - last) & mask) * mult;
}

Everything else is trivial add/sub/shift, which should be roughly the same
in ASM.

Can you try to replace that with:

static __always_inline
u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
        u64 ret, delta = ((cycles - last) & mask);
        u32 dh, dl;

        dl = delta;
        dh = delta >> 32;

        res = mul_u32_u32(al, mul);
        if (ah)
                res += mul_u32_u32(ah, mul) << 32;

        return res;
}

That's pretty much what __do_get_tspec does in ASM.

Thanks,

	tglx
Christophe Leroy Oct. 22, 2019, 9:01 a.m. UTC | #2
Le 21/10/2019 à 23:29, Thomas Gleixner a écrit :
> On Mon, 21 Oct 2019, Christophe Leroy wrote:
> 
>> This is a tentative to switch powerpc/32 vdso to generic C implementation.
>> It will likely not work on 64 bits or even build properly at the moment.
>>
>> powerpc is a bit special for VDSO as well as system calls in the
>> way that it requires setting CR SO bit which cannot be done in C.
>> Therefore, entry/exit and fallback needs to be performed in ASM.
>>
>> To allow that, C fallbacks just return -1 and the ASM entry point
>> performs the system call when the C function returns -1.
>>
>> The performance is rather disappoiting. That's most likely all
>> calculation in the C implementation are based on 64 bits math and
>> converted to 32 bits at the very end. I guess C implementation should
>> use 32 bits math like the assembly VDSO does as of today.
> 
>> gettimeofday:    vdso: 750 nsec/call
>>
>> gettimeofday:    vdso: 1533 nsec/call

Small improvement (3%) with the proposed change:

gettimeofday:    vdso: 1485 nsec/call

Though still some way to go.

Christophe

> 
> The only real 64bit math which can matter is the 64bit * 32bit multiply,
> i.e.
> 
> static __always_inline
> u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
> {
>          return ((cycles - last) & mask) * mult;
> }
> 
> Everything else is trivial add/sub/shift, which should be roughly the same
> in ASM.
> 
> Can you try to replace that with:
> 
> static __always_inline
> u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
> {
>          u64 ret, delta = ((cycles - last) & mask);
>          u32 dh, dl;
> 
>          dl = delta;
>          dh = delta >> 32;
> 
>          res = mul_u32_u32(al, mul);
>          if (ah)
>                  res += mul_u32_u32(ah, mul) << 32;
> 
>          return res;
> }
> 
> That's pretty much what __do_get_tspec does in ASM.
> 
> Thanks,
> 
> 	tglx
>
Christophe Leroy Oct. 22, 2019, 1:56 p.m. UTC | #3
Le 22/10/2019 à 11:01, Christophe Leroy a écrit :
> 
> 
> Le 21/10/2019 à 23:29, Thomas Gleixner a écrit :
>> On Mon, 21 Oct 2019, Christophe Leroy wrote:
>>
>>> This is a tentative to switch powerpc/32 vdso to generic C 
>>> implementation.
>>> It will likely not work on 64 bits or even build properly at the moment.
>>>
>>> powerpc is a bit special for VDSO as well as system calls in the
>>> way that it requires setting CR SO bit which cannot be done in C.
>>> Therefore, entry/exit and fallback needs to be performed in ASM.
>>>
>>> To allow that, C fallbacks just return -1 and the ASM entry point
>>> performs the system call when the C function returns -1.
>>>
>>> The performance is rather disappoiting. That's most likely all
>>> calculation in the C implementation are based on 64 bits math and
>>> converted to 32 bits at the very end. I guess C implementation should
>>> use 32 bits math like the assembly VDSO does as of today.
>>
>>> gettimeofday:    vdso: 750 nsec/call
>>>
>>> gettimeofday:    vdso: 1533 nsec/call
> 
> Small improvement (3%) with the proposed change:
> 
> gettimeofday:    vdso: 1485 nsec/call

By inlining do_hres() I get the following:

gettimeofday:    vdso: 1072 nsec/call

Christophe

> 
> Though still some way to go.
> 
> Christophe
> 
>>
>> The only real 64bit math which can matter is the 64bit * 32bit multiply,
>> i.e.
>>
>> static __always_inline
>> u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
>> {
>>          return ((cycles - last) & mask) * mult;
>> }
>>
>> Everything else is trivial add/sub/shift, which should be roughly the 
>> same
>> in ASM.
>>
>> Can you try to replace that with:
>>
>> static __always_inline
>> u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
>> {
>>          u64 ret, delta = ((cycles - last) & mask);
>>          u32 dh, dl;
>>
>>          dl = delta;
>>          dh = delta >> 32;
>>
>>          res = mul_u32_u32(al, mul);
>>          if (ah)
>>                  res += mul_u32_u32(ah, mul) << 32;
>>
>>          return res;
>> }
>>
>> That's pretty much what __do_get_tspec does in ASM.
>>
>> Thanks,
>>
>>     tglx
>>
Andy Lutomirski Oct. 26, 2019, 1:55 p.m. UTC | #4
On Tue, Oct 22, 2019 at 6:56 AM Christophe Leroy
<christophe.leroy@c-s.fr> wrote:
>
>
> >>> The performance is rather disappoiting. That's most likely all
> >>> calculation in the C implementation are based on 64 bits math and
> >>> converted to 32 bits at the very end. I guess C implementation should
> >>> use 32 bits math like the assembly VDSO does as of today.
> >>
> >>> gettimeofday:    vdso: 750 nsec/call
> >>>
> >>> gettimeofday:    vdso: 1533 nsec/call
> >
> > Small improvement (3%) with the proposed change:
> >
> > gettimeofday:    vdso: 1485 nsec/call
>
> By inlining do_hres() I get the following:
>
> gettimeofday:    vdso: 1072 nsec/call
>

A perf report might be informative.
Thomas Gleixner Oct. 26, 2019, 3:53 p.m. UTC | #5
On Tue, 22 Oct 2019, Christophe Leroy wrote:
> Le 22/10/2019 à 11:01, Christophe Leroy a écrit :
> > Le 21/10/2019 à 23:29, Thomas Gleixner a écrit :
> > > On Mon, 21 Oct 2019, Christophe Leroy wrote:
> > > 
> > > > This is a tentative to switch powerpc/32 vdso to generic C
> > > > implementation.
> > > > It will likely not work on 64 bits or even build properly at the moment.
> > > > 
> > > > powerpc is a bit special for VDSO as well as system calls in the
> > > > way that it requires setting CR SO bit which cannot be done in C.
> > > > Therefore, entry/exit and fallback needs to be performed in ASM.
> > > > 
> > > > To allow that, C fallbacks just return -1 and the ASM entry point
> > > > performs the system call when the C function returns -1.
> > > > 
> > > > The performance is rather disappoiting. That's most likely all
> > > > calculation in the C implementation are based on 64 bits math and
> > > > converted to 32 bits at the very end. I guess C implementation should
> > > > use 32 bits math like the assembly VDSO does as of today.
> > > 
> > > > gettimeofday:    vdso: 750 nsec/call
> > > > 
> > > > gettimeofday:    vdso: 1533 nsec/call
> > 
> > Small improvement (3%) with the proposed change:
> > 
> > gettimeofday:    vdso: 1485 nsec/call
> 
> By inlining do_hres() I get the following:
> 
> gettimeofday:    vdso: 1072 nsec/call

What's the effect for clock_gettime()?

gettimeofday() is suboptimal vs. the PPC ASM variant due to an extra
division, but clock_gettime() should be 1:1 comparable.

Thanks,

	tglx
Christophe Leroy Oct. 26, 2019, 3:54 p.m. UTC | #6
Le 26/10/2019 à 15:55, Andy Lutomirski a écrit :
> On Tue, Oct 22, 2019 at 6:56 AM Christophe Leroy
> <christophe.leroy@c-s.fr> wrote:
>>
>>
>>>>> The performance is rather disappoiting. That's most likely all
>>>>> calculation in the C implementation are based on 64 bits math and
>>>>> converted to 32 bits at the very end. I guess C implementation should
>>>>> use 32 bits math like the assembly VDSO does as of today.
>>>>
>>>>> gettimeofday:    vdso: 750 nsec/call
>>>>>
>>>>> gettimeofday:    vdso: 1533 nsec/call
>>>
>>> Small improvement (3%) with the proposed change:
>>>
>>> gettimeofday:    vdso: 1485 nsec/call
>>
>> By inlining do_hres() I get the following:
>>
>> gettimeofday:    vdso: 1072 nsec/call
>>
> 
> A perf report might be informative.
> 

Not sure there is much to learn from perf report:

With the original RFC:

     51.83%  test_vdso  [vdso]             [.] do_hres
     24.86%  test_vdso  [vdso]             [.] __c_kernel_gettimeofday
      7.33%  test_vdso  [vdso]             [.] __kernel_gettimeofday
      5.77%  test_vdso  test_vdso          [.] main
      1.55%  test_vdso  [kernel.kallsyms]  [k] copy_page
      0.67%  test_vdso  libc-2.23.so       [.] _dl_addr
      0.55%  test_vdso  ld-2.23.so         [.] do_lookup_x

With vdso_calc_delta() optimised as suggested by Thomas + inlined do_hres():

     68.00%  test_vdso  [vdso]             [.] __c_kernel_gettimeofday
     12.59%  test_vdso  [vdso]             [.] __kernel_gettimeofday
      6.22%  test_vdso  test_vdso          [.] main
      2.07%  test_vdso  [kernel.kallsyms]  [k] copy_page
      1.04%  test_vdso  ld-2.23.so         [.] _dl_relocate_object
      0.89%  test_vdso  ld-2.23.so         [.] do_lookup_x

I've tried 'perf annotate', but I have not found how to tell perf to use 
vdso32.so.dbg file for annotate [vdso].

Test app:

#include <dlfcn.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/time.h>

static int (*gettimeofday_vdso)(struct timeval *tv, struct timezone *tz);

int main(int argc, char **argv)
{
	void *handle = dlopen("linux-vdso32.so.1", RTLD_NOW | RTLD_GLOBAL);
	struct timeval tv;
	struct timezone tz;
	int i;

	(void)dlerror();

	gettimeofday_vdso = dlsym(handle, "__kernel_gettimeofday");
	if (dlerror())
		gettimeofday_vdso = NULL;

	for (i = 0; i < 100000; i++)
		gettimeofday_vdso(&tv, &tz);
}


Christophe
Christophe Leroy Oct. 26, 2019, 4:06 p.m. UTC | #7
Le 26/10/2019 à 17:53, Thomas Gleixner a écrit :
> On Tue, 22 Oct 2019, Christophe Leroy wrote:
>> Le 22/10/2019 à 11:01, Christophe Leroy a écrit :
>>> Le 21/10/2019 à 23:29, Thomas Gleixner a écrit :
>>>> On Mon, 21 Oct 2019, Christophe Leroy wrote:
>>>>
>>>>> This is a tentative to switch powerpc/32 vdso to generic C
>>>>> implementation.
>>>>> It will likely not work on 64 bits or even build properly at the moment.
>>>>>
>>>>> powerpc is a bit special for VDSO as well as system calls in the
>>>>> way that it requires setting CR SO bit which cannot be done in C.
>>>>> Therefore, entry/exit and fallback needs to be performed in ASM.
>>>>>
>>>>> To allow that, C fallbacks just return -1 and the ASM entry point
>>>>> performs the system call when the C function returns -1.
>>>>>
>>>>> The performance is rather disappoiting. That's most likely all
>>>>> calculation in the C implementation are based on 64 bits math and
>>>>> converted to 32 bits at the very end. I guess C implementation should
>>>>> use 32 bits math like the assembly VDSO does as of today.
>>>>
>>>>> gettimeofday:    vdso: 750 nsec/call
>>>>>
>>>>> gettimeofday:    vdso: 1533 nsec/call
>>>
>>> Small improvement (3%) with the proposed change:
>>>
>>> gettimeofday:    vdso: 1485 nsec/call
>>
>> By inlining do_hres() I get the following:
>>
>> gettimeofday:    vdso: 1072 nsec/call
> 
> What's the effect for clock_gettime()?
> 
> gettimeofday() is suboptimal vs. the PPC ASM variant due to an extra
> division, but clock_gettime() should be 1:1 comparable.
> 

Original PPC asm:
clock-gettime-realtime:    vdso: 928 nsec/call

My original RFC:
clock-gettime-realtime:    vdso: 1570 nsec/call

With your suggested vdso_calc_delta():
clock-gettime-realtime:    vdso: 1512 nsec/call

With your vdso_calc_delta() and inlined do_hres():
clock-gettime-realtime:    vdso: 1302 nsec/call

Christophe
Thomas Gleixner Oct. 26, 2019, 6:48 p.m. UTC | #8
On Sat, 26 Oct 2019, Christophe Leroy wrote:
> Le 26/10/2019 à 17:53, Thomas Gleixner a écrit :
> > > > > > gettimeofday:    vdso: 750 nsec/call
> > > > > > 
> > > > > > gettimeofday:    vdso: 1533 nsec/call
> > > > 
> > > > Small improvement (3%) with the proposed change:
> > > > 
> > > > gettimeofday:    vdso: 1485 nsec/call
> > > 
> > > By inlining do_hres() I get the following:
> > > 
> > > gettimeofday:    vdso: 1072 nsec/call
> > 
> > What's the effect for clock_gettime()?
> > 
> > gettimeofday() is suboptimal vs. the PPC ASM variant due to an extra
> > division, but clock_gettime() should be 1:1 comparable.
> > 
> 
> Original PPC asm:
> clock-gettime-realtime:    vdso: 928 nsec/call
> 
> My original RFC:
> clock-gettime-realtime:    vdso: 1570 nsec/call
> 
> With your suggested vdso_calc_delta():
> clock-gettime-realtime:    vdso: 1512 nsec/call
> 
> With your vdso_calc_delta() and inlined do_hres():
> clock-gettime-realtime:    vdso: 1302 nsec/call

That does not make any sense at all.

gettimeofday() is basically the same as clock_gettime(REALTIME) and does
an extra division. So I would expect it to be slower. 

Let's look at the code:

__cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
{
        const struct vdso_data *vd = __arch_get_vdso_data();

        if (likely(tv != NULL)) {
		struct __kernel_timespec ts;

                if (do_hres(&vd[CS_HRES_COARSE], CLOCK_REALTIME, &ts))
                        return gettimeofday_fallback(tv, tz);

                tv->tv_sec = ts.tv_sec;
                tv->tv_usec = (u32)ts.tv_nsec / NSEC_PER_USEC;

IIRC PPC did some magic math tricks to avoid that. Could you just for the
fun of it replace this division with

       (u32)ts.tv_nsec >> 10;

That's obviously incorrect, but it would tell us how heavy the division
is. If that brings us close we might do something special for
gettimeofday().

OTOH, last time I checked clock_gettime() was by far more used than
gettimeofday() but that obviously depends on the use cases.

        }
	...
}

and

__cvdso_clock_gettime_common(clockid_t clock, struct __kernel_timespec *ts)
{
        const struct vdso_data *vd = __arch_get_vdso_data();
        u32 msk;

	/* Check for negative values or invalid clocks */
        if (unlikely((u32) clock >= MAX_CLOCKS))
                return -1;

        /*
         * Convert the clockid to a bitmask and use it to check which
         * clocks are handled in the VDSO directly.
         */
        msk = 1U << clock;
        if (likely(msk & VDSO_HRES)) {
		return do_hres(&vd[CS_HRES_COARSE], clock, ts);

So this is the extra code which is executed for clock_gettime(REAL) which
is pure logic and certainly not as heavyweight as the division in the
gettimeofday() code path.

}

static __maybe_unused int
__cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
{
        int ret = __cvdso_clock_gettime_common(clock, ts);

	if (unlikely(ret))
                return clock_gettime_fallback(clock, ts);
        return 0;
}

One thing which might be worth to try as well is to mark all functions in
that file as inline. The speedup by the do_hres() inlining was impressive
on PPC.

Thanks,

	tglx
Segher Boessenkool Oct. 26, 2019, 11:06 p.m. UTC | #9
On Sat, Oct 26, 2019 at 08:48:27PM +0200, Thomas Gleixner wrote:
> On Sat, 26 Oct 2019, Christophe Leroy wrote:
> Let's look at the code:
> 
> __cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
> {
>         const struct vdso_data *vd = __arch_get_vdso_data();
> 
>         if (likely(tv != NULL)) {
> 		struct __kernel_timespec ts;
> 
>                 if (do_hres(&vd[CS_HRES_COARSE], CLOCK_REALTIME, &ts))
>                         return gettimeofday_fallback(tv, tz);
> 
>                 tv->tv_sec = ts.tv_sec;
>                 tv->tv_usec = (u32)ts.tv_nsec / NSEC_PER_USEC;
> 
> IIRC PPC did some magic math tricks to avoid that. Could you just for the
> fun of it replace this division with
> 
>        (u32)ts.tv_nsec >> 10;

On this particular CPU (the 885, right?) a division by 1000 is just 9
cycles.  On other CPUs it can be more, say 19 cycles like on the 750; not
cheap at all, but not hugely expensive either, comparatively.

(A 64/32->32 division is expensive on all 32-bit PowerPC: there is no
hardware help for it at all, so it's all done in software.)

Of course the compiler won't do a division by a constant with a division
instruction at all, so it's somewhat cheaper even, 5 or 6 cycles or so.

> One thing which might be worth to try as well is to mark all functions in
> that file as inline. The speedup by the do_hres() inlining was impressive
> on PPC.

The hand-optimised asm code will pretty likely win handsomely, whatever
you do.  Especially on cores like the 885 (no branch prediction, single
issue, small caches, etc.: every instruction counts).

Is there any reason to replace this hand-optimised code?  It was written
for exacty this reason?  These functions are critical and should be as
fast as possible.


Segher
Christophe Leroy Oct. 27, 2019, 9:21 a.m. UTC | #10
Le 27/10/2019 à 01:06, Segher Boessenkool a écrit :
> On Sat, Oct 26, 2019 at 08:48:27PM +0200, Thomas Gleixner wrote:
>> On Sat, 26 Oct 2019, Christophe Leroy wrote:
>> Let's look at the code:
>>
>> __cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
>> {
>>          const struct vdso_data *vd = __arch_get_vdso_data();
>>
>>          if (likely(tv != NULL)) {
>> 		struct __kernel_timespec ts;
>>
>>                  if (do_hres(&vd[CS_HRES_COARSE], CLOCK_REALTIME, &ts))
>>                          return gettimeofday_fallback(tv, tz);
>>
>>                  tv->tv_sec = ts.tv_sec;
>>                  tv->tv_usec = (u32)ts.tv_nsec / NSEC_PER_USEC;
>>
>> IIRC PPC did some magic math tricks to avoid that. Could you just for the
>> fun of it replace this division with
>>
>>         (u32)ts.tv_nsec >> 10;
> 
> On this particular CPU (the 885, right?) a division by 1000 is just 9
> cycles.  On other CPUs it can be more, say 19 cycles like on the 750; not
> cheap at all, but not hugely expensive either, comparatively.
> 
> (A 64/32->32 division is expensive on all 32-bit PowerPC: there is no
> hardware help for it at all, so it's all done in software.)
> 
> Of course the compiler won't do a division by a constant with a division
> instruction at all, so it's somewhat cheaper even, 5 or 6 cycles or so.
> 
>> One thing which might be worth to try as well is to mark all functions in
>> that file as inline. The speedup by the do_hres() inlining was impressive
>> on PPC.
> 
> The hand-optimised asm code will pretty likely win handsomely, whatever
> you do.  Especially on cores like the 885 (no branch prediction, single
> issue, small caches, etc.: every instruction counts).
> 
> Is there any reason to replace this hand-optimised code?  It was written
> for exacty this reason?  These functions are critical and should be as
> fast as possible.

Well, all this started with COARSE clocks not being supported by PPC32 
VDSO. I first submitted a series with a set of optimisations including 
the implementation of COARSE clocks 
(https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=126779)

Then after a comment received on patch 4 of the series from Santosh 
Sivaraj asking for a common implementation of it for PPC32 and PPC64, I 
started looking into making the whole VDSO source code common to PPC32 
and PPC64. Most functions are similar. Time functions are also rather 
similar but unfortunately don't use the same registers. They also don't 
cover all possible clocks. And getres() is also buggy, see series 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=110321

So instead of reworking the existing time functions, I started 
investigating whether we could plug powerpc to the generic 
implementation. One drawback of PPC is that we need to setup an ASM 
trampoline to handle the SO bit as it can't be handled from C directly, 
can it ?

How critical are these functions ? Although we have a slight degration 
with the C implementation, they are still way faster than the 
corresponding syscall.

Another thing I was wondering, is it worth using the 64 bit timebase on 
PPC32 ? As far as I understand, the timebase is there to calculate a 
linear date update since last VDSO datapage update. How often is the 
VDSO datapage updated ? On the 885 clocked at 132Mhz, the timebase is at 
8.25 Mhz, which means it needs more than 8 minutes to loop over 32 bits.

Christophe
Segher Boessenkool Oct. 27, 2019, 7:07 p.m. UTC | #11
On Sun, Oct 27, 2019 at 10:21:25AM +0100, Christophe Leroy wrote:
> Le 27/10/2019 à 01:06, Segher Boessenkool a écrit :
> >The hand-optimised asm code will pretty likely win handsomely, whatever
> >you do.  Especially on cores like the 885 (no branch prediction, single
> >issue, small caches, etc.: every instruction counts).
> >
> >Is there any reason to replace this hand-optimised code?  It was written
> >for exacty this reason?  These functions are critical and should be as
> >fast as possible.
> 
> Well, all this started with COARSE clocks not being supported by PPC32 
> VDSO. I first submitted a series with a set of optimisations including 
> the implementation of COARSE clocks 
> (https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=126779)
> 
> Then after a comment received on patch 4 of the series from Santosh 
> Sivaraj asking for a common implementation of it for PPC32 and PPC64, I 
> started looking into making the whole VDSO source code common to PPC32 
> and PPC64. Most functions are similar. Time functions are also rather 
> similar but unfortunately don't use the same registers. They also don't 
> cover all possible clocks. And getres() is also buggy, see series 
> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=110321

That is all nice work :-)

> So instead of reworking the existing time functions, I started 
> investigating whether we could plug powerpc to the generic 
> implementation. One drawback of PPC is that we need to setup an ASM 
> trampoline to handle the SO bit as it can't be handled from C directly, 
> can it ?

There is no way to say what CR bits to return.  The ABI requires some of
those bits to be preserved, and some are volatile.  System calls use a
different ABI, one the compiler knows nothing about, so you cannot even
show system calls as calls to the compiler.

> How critical are these functions ? Although we have a slight degration 
> with the C implementation, they are still way faster than the 
> corresponding syscall.

"Slight":

	With current powerpc/32 ASM VDSO:

	gettimeofday:    vdso: 750 nsec/call
	clock-getres-realtime:    vdso: 382 nsec/call
	clock-gettime-realtime:    vdso: 928 nsec/call
	clock-getres-monotonic:    vdso: 382 nsec/call
	clock-gettime-monotonic:    vdso: 1033 nsec/call

	Once switched to C implementation:

	gettimeofday:    vdso: 1533 nsec/call
	clock-getres-realtime:    vdso: 853 nsec/call
	clock-gettime-realtime:    vdso: 1570 nsec/call
	clock-getres-monotonic:    vdso: 835 nsec/call
	clock-gettime-monotonic:    vdso: 1605 nsec/call


---> Those that are not more than two times slower are almost that. <---


This also needs measurements on more representative PowerPC cores, say
some G3 or G4; and on modern CPUs (Power7/8/9).

It also needs context with those measurements: what CPU core is it?
Running at what frequency clock?

> Another thing I was wondering, is it worth using the 64 bit timebase on 
> PPC32 ? As far as I understand, the timebase is there to calculate a 
> linear date update since last VDSO datapage update. How often is the 
> VDSO datapage updated ? On the 885 clocked at 132Mhz, the timebase is at 
> 8.25 Mhz, which means it needs more than 8 minutes to loop over 32 bits.

On most PowerPC cores the time base is incremented significantly faster.
Usual speeds for older cores are 50MHz to 100MHz, and for newer cores ten
times that.  Recommended frequency is currently 512MHz, so you'll wrap the
low 32 bits in 8s or so on those, and in about a minute on many powermac
etc. machines already.  How can you know this long hasn't passed since the
last time you read the high half of the time base?  Without reading that
high part?

The current (assembler) code already optimises converting this to some
other scale quite well, better than a compiler can (see __do_get_tspec).


Segher
Christophe Leroy Dec. 20, 2019, 6:24 p.m. UTC | #12
Hi Thomas,

In do_hres(), I see:

		cycles = __arch_get_hw_counter(vd->clock_mode);
		ns = vdso_ts->nsec;
		last = vd->cycle_last;
		if (unlikely((s64)cycles < 0))
			return -1;

__arch_get_hw_counter() returns a u64 values. On the PPC, this is read 
from the timebase which is a 64 bits counter.

Why returning -1 if (s64)cycles < 0 ? Does it means we have to mask out 
the most significant bit when reading the HW counter ?

Christophe
Thomas Gleixner Jan. 9, 2020, 2:05 p.m. UTC | #13
Christophe!

Christophe Leroy <christophe.leroy@c-s.fr> writes:
> In do_hres(), I see:
>
> 		cycles = __arch_get_hw_counter(vd->clock_mode);
> 		ns = vdso_ts->nsec;
> 		last = vd->cycle_last;
> 		if (unlikely((s64)cycles < 0))
> 			return -1;
>
> __arch_get_hw_counter() returns a u64 values. On the PPC, this is read 
> from the timebase which is a 64 bits counter.
>
> Why returning -1 if (s64)cycles < 0 ? Does it means we have to mask out 
> the most significant bit when reading the HW counter ?

Only if you expect the HW counter to reach a value which has bit 63
set. That'd require:

uptime		counter frequency

~292 years      1GHz
~ 58 years      5GHz

assumed that the HW counter starts at 0 when the box is powered on.

The reason why this is implemented in this way is that
__arch_get_hw_counter() needs a way to express that the clocksource of
the moment is not suitable for VDSO so that the syscall fallback gets
invoked.

Sure we could have used a pointer for the value and a return value
indicating the validity, but given the required uptime the resulting
code overhead seemed to be not worth it. At least not for me as I'm not
planning to be around 58 years from now :)

Thanks,

        tglx
Christophe Leroy Jan. 9, 2020, 3:21 p.m. UTC | #14
Hi Thomas,

On 01/09/2020 02:05 PM, Thomas Gleixner wrote:
> Christophe!
> 
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>> In do_hres(), I see:
>>
>> 		cycles = __arch_get_hw_counter(vd->clock_mode);
>> 		ns = vdso_ts->nsec;
>> 		last = vd->cycle_last;
>> 		if (unlikely((s64)cycles < 0))
>> 			return -1;
>>
>> __arch_get_hw_counter() returns a u64 values. On the PPC, this is read
>> from the timebase which is a 64 bits counter.
>>
>> Why returning -1 if (s64)cycles < 0 ? Does it means we have to mask out
>> the most significant bit when reading the HW counter ?
> 
> Only if you expect the HW counter to reach a value which has bit 63
> set. That'd require:
> 
> uptime		counter frequency
> 
> ~292 years      1GHz
> ~ 58 years      5GHz
> 
> assumed that the HW counter starts at 0 when the box is powered on.
> 
> The reason why this is implemented in this way is that
> __arch_get_hw_counter() needs a way to express that the clocksource of
> the moment is not suitable for VDSO so that the syscall fallback gets
> invoked.
> 
> Sure we could have used a pointer for the value and a return value
> indicating the validity, but given the required uptime the resulting
> code overhead seemed to be not worth it. At least not for me as I'm not
> planning to be around 58 years from now :)
> 

I managed to get better code and better performance by splitting out the 
validity check as follows. Would it be suitable for all arches ?


diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
b/arch/powerpc/include/asm/vdso/gettimeofday.h
index 689f51b0d8c9..11cdd6faa4ad 100644
--- a/arch/powerpc/include/asm/vdso/gettimeofday.h
+++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
@@ -114,15 +114,17 @@ int clock_getres32_fallback(clockid_t _clkid, 
struct old_timespec32 *_ts)
  	return ret;
  }

-static __always_inline u64 __arch_get_hw_counter(s32 clock_mode)
+static __always_inline bool __arch_is_hw_counter_valid(s32 clock_mode)
  {
  	/*
  	 * clock_mode == 0 implies that vDSO are enabled otherwise
  	 * fallback on syscall.
  	 */
-	if (clock_mode)
-		return ULLONG_MAX;
+	return clock_mode ? false : true;
+}

+static __always_inline u64 __arch_get_hw_counter(s32 clock_mode)
+{
  	return get_tb();
  }

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index ee9da52a3e02..90bb5dfd0db0 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -46,11 +46,12 @@ static inline int do_hres(const struct vdso_data 
*vd, clockid_t clk,

  	do {
  		seq = vdso_read_begin(vd);
+		if (!__arch_is_hw_counter_valid(vd->clock_mode))
+			return -1;
+
  		cycles = __arch_get_hw_counter(vd->clock_mode);
  		ns = vdso_ts->nsec;
  		last = vd->cycle_last;
-		if (unlikely((s64)cycles < 0))
-			return -1;

  		ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult);
  		ns >>= vd->shift;


Thanks
Christophe
Thomas Gleixner Jan. 10, 2020, 10:42 p.m. UTC | #15
Christophe,

Christophe Leroy <christophe.leroy@c-s.fr> writes:
> On 01/09/2020 02:05 PM, Thomas Gleixner wrote:
>> The reason why this is implemented in this way is that
>> __arch_get_hw_counter() needs a way to express that the clocksource of
>> the moment is not suitable for VDSO so that the syscall fallback gets
>> invoked.
>> 
>> Sure we could have used a pointer for the value and a return value
>> indicating the validity, but given the required uptime the resulting
>> code overhead seemed to be not worth it. At least not for me as I'm not
>> planning to be around 58 years from now :)
>> 
>
> I managed to get better code and better performance by splitting out the 
> validity check as follows. Would it be suitable for all arches ?

A quick test on x86 shows only a minimal impact, but it's in the
noise. So from my side that's fine, but I can't talk for ARM[64]/MIPS

> diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
> b/arch/powerpc/include/asm/vdso/gettimeofday.h
> index 689f51b0d8c9..11cdd6faa4ad 100644
> --- a/arch/powerpc/include/asm/vdso/gettimeofday.h
> +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
> @@ -114,15 +114,17 @@ int clock_getres32_fallback(clockid_t _clkid, 
> struct old_timespec32 *_ts)
>   	return ret;
>   }
>
> -static __always_inline u64 __arch_get_hw_counter(s32 clock_mode)
> +static __always_inline bool __arch_is_hw_counter_valid(s32 clock_mode)
>   {
>   	/*
>   	 * clock_mode == 0 implies that vDSO are enabled otherwise
>   	 * fallback on syscall.
>   	 */
> -	if (clock_mode)
> -		return ULLONG_MAX;
> +	return clock_mode ? false : true;

I don't think we need an arch specific function here. I rather convert
the boolean of ARM[64] to the x86/MIPS way of VCLOCK_* modes and let the
generic code check for clock_mode == VCLOCK_NONE.

There is some magic in ARM and MIPS which wants to be able to disable
the whole thing at compile time, but that can be handled in the generic
code with just an extra config switch.

I'll have a stab at that tomorrow.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3e56c9c2f16e..a363c5186b82 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -168,6 +168,7 @@  config PPC
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select GENERIC_TIME_VSYSCALL
+	select GENERIC_GETTIMEOFDAY
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_HUGE_VMAP		if PPC_BOOK3S_64 && PPC_RADIX_MMU
 	select HAVE_ARCH_JUMP_LABEL
@@ -197,6 +198,7 @@  config PPC
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_FUNCTION_TRACER
 	select HAVE_GCC_PLUGINS			if GCC_VERSION >= 50200   # plugin support on gcc <= 5.1 is buggy on PPC
+	select HAVE_GENERIC_VDSO
 	select HAVE_HW_BREAKPOINT		if PERF_EVENTS && (PPC_BOOK3S || PPC_8xx)
 	select HAVE_IDE
 	select HAVE_IOREMAP_PROT
diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h b/arch/powerpc/include/asm/vdso/gettimeofday.h
new file mode 100644
index 000000000000..6de875cf4b75
--- /dev/null
+++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
@@ -0,0 +1,81 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2018 ARM Limited
+ */
+#ifndef __ASM_VDSO_GETTIMEOFDAY_H
+#define __ASM_VDSO_GETTIMEOFDAY_H
+
+#ifndef __ASSEMBLY__
+
+#include <asm/time.h>
+#include <asm/unistd.h>
+#include <uapi/linux/time.h>
+
+#define __VDSO_USE_SYSCALL		ULLONG_MAX
+
+#define VDSO_HAS_CLOCK_GETRES		1
+
+#define VDSO_HAS_TIME			1
+
+#define VDSO_HAS_32BIT_FALLBACK		1
+
+static __always_inline
+int gettimeofday_fallback(struct __kernel_old_timeval *_tv,
+			  struct timezone *_tz)
+{
+	return -1;
+}
+
+static __always_inline
+long clock_gettime_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
+{
+	return -1;
+}
+
+static __always_inline
+int clock_getres_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
+{
+	return -1;
+}
+
+static __always_inline
+int clock_getres32_fallback(clockid_t clock, struct old_timespec32 *res)
+{
+	return -1;
+}
+
+static __always_inline
+int clock_gettime32_fallback(clockid_t clock, struct old_timespec32 *res)
+{
+	return -1;
+}
+
+static __always_inline u64 __arch_get_hw_counter(s32 clock_mode)
+{
+	/*
+	 * clock_mode == 0 implies that vDSO are enabled otherwise
+	 * fallback on syscall.
+	 */
+	if (clock_mode)
+		return __VDSO_USE_SYSCALL;
+
+	return get_tb();
+}
+
+static __always_inline
+const struct vdso_data *__arch_get_vdso_data(void)
+{
+	void *ptr;
+
+	asm volatile(
+		"	bcl	20, 31, .+4;\n"
+		"	mflr	%0;\n"
+		"	addi	%0, %0, __kernel_datapage_offset - (.-4);\n"
+		: "=b"(ptr) : : "lr");
+
+	return ptr + *(unsigned long *)ptr;
+}
+
+#endif /* !__ASSEMBLY__ */
+
+#endif /* __ASM_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/powerpc/include/asm/vdso/vsyscall.h b/arch/powerpc/include/asm/vdso/vsyscall.h
new file mode 100644
index 000000000000..d12c2298cbb8
--- /dev/null
+++ b/arch/powerpc/include/asm/vdso/vsyscall.h
@@ -0,0 +1,27 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_VDSO_VSYSCALL_H
+#define __ASM_VDSO_VSYSCALL_H
+
+#ifndef __ASSEMBLY__
+
+#include <linux/timekeeper_internal.h>
+#include <asm/vdso_datapage.h>
+
+extern struct vdso_arch_data *vdso_arch_data;
+
+/*
+ * Update the vDSO data page to keep in sync with kernel timekeeping.
+ */
+static __always_inline
+struct vdso_data *__powerpc_get_k_vdso_data(void)
+{
+	return vdso_arch_data->data;
+}
+#define __arch_get_k_vdso_data __powerpc_get_k_vdso_data
+
+/* The asm-generic header needs to be included after the definitions above */
+#include <asm-generic/vdso/vsyscall.h>
+
+#endif /* !__ASSEMBLY__ */
+
+#endif /* __ASM_VDSO_VSYSCALL_H */
diff --git a/arch/powerpc/include/asm/vdso_datapage.h b/arch/powerpc/include/asm/vdso_datapage.h
index c61d59ed3b45..f9b50bc50989 100644
--- a/arch/powerpc/include/asm/vdso_datapage.h
+++ b/arch/powerpc/include/asm/vdso_datapage.h
@@ -36,6 +36,7 @@ 
 
 #include <linux/unistd.h>
 #include <linux/time.h>
+#include <vdso/datapage.h>
 
 #define SYSCALL_MAP_SIZE      ((NR_syscalls + 31) / 32)
 
@@ -91,18 +92,9 @@  struct vdso_data {
 /*
  * And here is the simpler 32 bits version
  */
-struct vdso_data {
-	__u64 tb_orig_stamp;		/* Timebase at boot		0x30 */
+struct vdso_arch_data {
+	struct vdso_data data[CS_BASES];
 	__u64 tb_ticks_per_sec;		/* Timebase tics / sec		0x38 */
-	__u64 tb_to_xs;			/* Inverse of TB to 2^20	0x40 */
-	__u64 stamp_xsec;		/*				0x48 */
-	__u32 tb_update_count;		/* Timebase atomicity ctr	0x50 */
-	__u32 tz_minuteswest;		/* Minutes west of Greenwich	0x58 */
-	__u32 tz_dsttime;		/* Type of dst correction	0x5C */
-	__s32 wtom_clock_sec;			/* Wall to monotonic clock */
-	__s32 wtom_clock_nsec;
-	struct timespec stamp_xtime;	/* xtime as at tb_orig_stamp */
-	__u32 stamp_sec_fraction;	/* fractional seconds of stamp_xtime */
    	__u32 syscall_map_32[SYSCALL_MAP_SIZE]; /* map of syscalls */
 	__u32 dcache_block_size;	/* L1 d-cache block size     */
 	__u32 icache_block_size;	/* L1 i-cache block size     */
@@ -112,7 +104,7 @@  struct vdso_data {
 
 #endif /* CONFIG_PPC64 */
 
-extern struct vdso_data *vdso_data;
+extern struct vdso_arch_data *vdso_arch_data;
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 484f54dab247..88ec3acde094 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -376,21 +376,12 @@  int main(void)
 #endif /* ! CONFIG_PPC64 */
 
 	/* datapage offsets for use by vdso */
-	OFFSET(CFG_TB_ORIG_STAMP, vdso_data, tb_orig_stamp);
-	OFFSET(CFG_TB_TICKS_PER_SEC, vdso_data, tb_ticks_per_sec);
-	OFFSET(CFG_TB_TO_XS, vdso_data, tb_to_xs);
-	OFFSET(CFG_TB_UPDATE_COUNT, vdso_data, tb_update_count);
-	OFFSET(CFG_TZ_MINUTEWEST, vdso_data, tz_minuteswest);
-	OFFSET(CFG_TZ_DSTTIME, vdso_data, tz_dsttime);
-	OFFSET(CFG_SYSCALL_MAP32, vdso_data, syscall_map_32);
-	OFFSET(WTOM_CLOCK_SEC, vdso_data, wtom_clock_sec);
-	OFFSET(WTOM_CLOCK_NSEC, vdso_data, wtom_clock_nsec);
-	OFFSET(STAMP_XTIME, vdso_data, stamp_xtime);
-	OFFSET(STAMP_SEC_FRAC, vdso_data, stamp_sec_fraction);
-	OFFSET(CFG_ICACHE_BLOCKSZ, vdso_data, icache_block_size);
-	OFFSET(CFG_DCACHE_BLOCKSZ, vdso_data, dcache_block_size);
-	OFFSET(CFG_ICACHE_LOGBLOCKSZ, vdso_data, icache_log_block_size);
-	OFFSET(CFG_DCACHE_LOGBLOCKSZ, vdso_data, dcache_log_block_size);
+	OFFSET(CFG_TB_TICKS_PER_SEC, vdso_arch_data, tb_ticks_per_sec);
+	OFFSET(CFG_SYSCALL_MAP32, vdso_arch_data, syscall_map_32);
+	OFFSET(CFG_ICACHE_BLOCKSZ, vdso_arch_data, icache_block_size);
+	OFFSET(CFG_DCACHE_BLOCKSZ, vdso_arch_data, dcache_block_size);
+	OFFSET(CFG_ICACHE_LOGBLOCKSZ, vdso_arch_data, icache_log_block_size);
+	OFFSET(CFG_DCACHE_LOGBLOCKSZ, vdso_arch_data, dcache_log_block_size);
 #ifdef CONFIG_PPC64
 	OFFSET(CFG_SYSCALL_MAP64, vdso_data, syscall_map_64);
 	OFFSET(TVAL64_TV_SEC, timeval, tv_sec);
@@ -401,11 +392,6 @@  int main(void)
 	OFFSET(TSPC64_TV_NSEC, timespec, tv_nsec);
 	OFFSET(TSPC32_TV_SEC, old_timespec32, tv_sec);
 	OFFSET(TSPC32_TV_NSEC, old_timespec32, tv_nsec);
-#else
-	OFFSET(TVAL32_TV_SEC, timeval, tv_sec);
-	OFFSET(TVAL32_TV_USEC, timeval, tv_usec);
-	OFFSET(TSPC32_TV_SEC, timespec, tv_sec);
-	OFFSET(TSPC32_TV_NSEC, timespec, tv_nsec);
 #endif
 	/* timeval/timezone offsets for use by vdso */
 	OFFSET(TZONE_TZ_MINWEST, timezone, tz_minuteswest);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 694522308cd5..57bdaf08bafc 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -882,93 +882,6 @@  static notrace u64 timebase_read(struct clocksource *cs)
 	return (u64)get_tb();
 }
 
-
-void update_vsyscall(struct timekeeper *tk)
-{
-	struct timespec xt;
-	struct clocksource *clock = tk->tkr_mono.clock;
-	u32 mult = tk->tkr_mono.mult;
-	u32 shift = tk->tkr_mono.shift;
-	u64 cycle_last = tk->tkr_mono.cycle_last;
-	u64 new_tb_to_xs, new_stamp_xsec;
-	u64 frac_sec;
-
-	if (clock != &clocksource_timebase)
-		return;
-
-	xt.tv_sec = tk->xtime_sec;
-	xt.tv_nsec = (long)(tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift);
-
-	/* Make userspace gettimeofday spin until we're done. */
-	++vdso_data->tb_update_count;
-	smp_mb();
-
-	/*
-	 * This computes ((2^20 / 1e9) * mult) >> shift as a
-	 * 0.64 fixed-point fraction.
-	 * The computation in the else clause below won't overflow
-	 * (as long as the timebase frequency is >= 1.049 MHz)
-	 * but loses precision because we lose the low bits of the constant
-	 * in the shift.  Note that 19342813113834067 ~= 2^(20+64) / 1e9.
-	 * For a shift of 24 the error is about 0.5e-9, or about 0.5ns
-	 * over a second.  (Shift values are usually 22, 23 or 24.)
-	 * For high frequency clocks such as the 512MHz timebase clock
-	 * on POWER[6789], the mult value is small (e.g. 32768000)
-	 * and so we can shift the constant by 16 initially
-	 * (295147905179 ~= 2^(20+64-16) / 1e9) and then do the
-	 * remaining shifts after the multiplication, which gives a
-	 * more accurate result (e.g. with mult = 32768000, shift = 24,
-	 * the error is only about 1.2e-12, or 0.7ns over 10 minutes).
-	 */
-	if (mult <= 62500000 && clock->shift >= 16)
-		new_tb_to_xs = ((u64) mult * 295147905179ULL) >> (clock->shift - 16);
-	else
-		new_tb_to_xs = (u64) mult * (19342813113834067ULL >> clock->shift);
-
-	/*
-	 * Compute the fractional second in units of 2^-32 seconds.
-	 * The fractional second is tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift
-	 * in nanoseconds, so multiplying that by 2^32 / 1e9 gives
-	 * it in units of 2^-32 seconds.
-	 * We assume shift <= 32 because clocks_calc_mult_shift()
-	 * generates shift values in the range 0 - 32.
-	 */
-	frac_sec = tk->tkr_mono.xtime_nsec << (32 - shift);
-	do_div(frac_sec, NSEC_PER_SEC);
-
-	/*
-	 * Work out new stamp_xsec value for any legacy users of systemcfg.
-	 * stamp_xsec is in units of 2^-20 seconds.
-	 */
-	new_stamp_xsec = frac_sec >> 12;
-	new_stamp_xsec += tk->xtime_sec * XSEC_PER_SEC;
-
-	/*
-	 * tb_update_count is used to allow the userspace gettimeofday code
-	 * to assure itself that it sees a consistent view of the tb_to_xs and
-	 * stamp_xsec variables.  It reads the tb_update_count, then reads
-	 * tb_to_xs and stamp_xsec and then reads tb_update_count again.  If
-	 * the two values of tb_update_count match and are even then the
-	 * tb_to_xs and stamp_xsec values are consistent.  If not, then it
-	 * loops back and reads them again until this criteria is met.
-	 */
-	vdso_data->tb_orig_stamp = cycle_last;
-	vdso_data->stamp_xsec = new_stamp_xsec;
-	vdso_data->tb_to_xs = new_tb_to_xs;
-	vdso_data->wtom_clock_sec = tk->wall_to_monotonic.tv_sec;
-	vdso_data->wtom_clock_nsec = tk->wall_to_monotonic.tv_nsec;
-	vdso_data->stamp_xtime = xt;
-	vdso_data->stamp_sec_fraction = frac_sec;
-	smp_wmb();
-	++(vdso_data->tb_update_count);
-}
-
-void update_vsyscall_tz(void)
-{
-	vdso_data->tz_minuteswest = sys_tz.tz_minuteswest;
-	vdso_data->tz_dsttime = sys_tz.tz_dsttime;
-}
-
 static void __init clocksource_init(void)
 {
 	struct clocksource *clock;
@@ -1138,8 +1051,7 @@  void __init time_init(void)
 		sys_tz.tz_dsttime = 0;
 	}
 
-	vdso_data->tb_update_count = 0;
-	vdso_data->tb_ticks_per_sec = tb_ticks_per_sec;
+	vdso_arch_data->tb_ticks_per_sec = tb_ticks_per_sec;
 
 	/* initialise and enable the large decrementer (if we have one) */
 	set_decrementer_max();
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index eae9ddaecbcf..d1e4f3a3a781 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -17,6 +17,7 @@ 
 #include <linux/elf.h>
 #include <linux/security.h>
 #include <linux/memblock.h>
+#include <vdso/datapage.h>
 
 #include <asm/pgtable.h>
 #include <asm/processor.h>
@@ -71,10 +72,10 @@  static int vdso_ready;
  * with it, it will become dynamically allocated
  */
 static union {
-	struct vdso_data	data;
+	struct vdso_arch_data	arch_data;
 	u8			page[PAGE_SIZE];
 } vdso_data_store __page_aligned_data;
-struct vdso_data *vdso_data = &vdso_data_store.data;
+struct vdso_arch_data *vdso_arch_data = &vdso_data_store.arch_data;
 
 /* Format of the patch table */
 struct vdso_patch_def
@@ -661,7 +662,7 @@  static void __init vdso_setup_syscall_map(void)
 				0x80000000UL >> (i & 0x1f);
 #else /* CONFIG_PPC64 */
 		if (sys_call_table[i] != sys_ni_syscall)
-			vdso_data->syscall_map_32[i >> 5] |=
+			vdso_arch_data->syscall_map_32[i >> 5] |=
 				0x80000000UL >> (i & 0x1f);
 #endif /* CONFIG_PPC64 */
 	}
@@ -729,10 +730,10 @@  static int __init vdso_init(void)
 	vdso64_pages = (&vdso64_end - &vdso64_start) >> PAGE_SHIFT;
 	DBG("vdso64_kbase: %p, 0x%x pages\n", vdso64_kbase, vdso64_pages);
 #else
-	vdso_data->dcache_block_size = L1_CACHE_BYTES;
-	vdso_data->dcache_log_block_size = L1_CACHE_SHIFT;
-	vdso_data->icache_block_size = L1_CACHE_BYTES;
-	vdso_data->icache_log_block_size = L1_CACHE_SHIFT;
+	vdso_arch_data->dcache_block_size = L1_CACHE_BYTES;
+	vdso_arch_data->dcache_log_block_size = L1_CACHE_SHIFT;
+	vdso_arch_data->icache_block_size = L1_CACHE_BYTES;
+	vdso_arch_data->icache_log_block_size = L1_CACHE_SHIFT;
 #endif /* CONFIG_PPC64 */
 
 
@@ -775,7 +776,7 @@  static int __init vdso_init(void)
 		get_page(pg);
 		vdso32_pagelist[i] = pg;
 	}
-	vdso32_pagelist[i++] = virt_to_page(vdso_data);
+	vdso32_pagelist[i++] = virt_to_page(vdso_arch_data);
 	vdso32_pagelist[i] = NULL;
 #endif
 
@@ -792,7 +793,7 @@  static int __init vdso_init(void)
 	vdso64_pagelist[i] = NULL;
 #endif /* CONFIG_PPC64 */
 
-	get_page(virt_to_page(vdso_data));
+	get_page(virt_to_page(vdso_arch_data));
 
 	smp_wmb();
 	vdso_ready = 1;
diff --git a/arch/powerpc/kernel/vdso32/Makefile b/arch/powerpc/kernel/vdso32/Makefile
index 06f54d947057..09edcd1a2dc7 100644
--- a/arch/powerpc/kernel/vdso32/Makefile
+++ b/arch/powerpc/kernel/vdso32/Makefile
@@ -2,10 +2,17 @@ 
 
 # List of files in the vdso, has to be asm only for now
 
+ARCH_REL_TYPE_ABS := R_PPC_JUMP_SLOT|R_PPC_GLOB_DAT|R_PPC_ADDR32|R_PPC_ADDR24|R_PPC_ADDR16|R_PPC_ADDR16_LO|R_PPC_ADDR16_HI|R_PPC_ADDR16_HA|R_PPC_ADDR14|R_PPC_ADDR14_BRTAKEN|R_PPC_ADDR14_BRNTAKEN
+include $(srctree)/lib/vdso/Makefile
+
 obj-vdso32-$(CONFIG_PPC64) = getcpu.o
 obj-vdso32 = sigtramp.o gettimeofday.o datapage.o cacheflush.o note.o \
 		$(obj-vdso32-y)
 
+ifneq ($(c-gettimeofday-y),)
+  CFLAGS_vgettimeofday.o += -include $(c-gettimeofday-y)
+endif
+
 # Build rules
 
 ifdef CROSS32_COMPILE
@@ -38,8 +45,8 @@  CPPFLAGS_vdso32.lds += -P -C -Upowerpc
 $(obj)/vdso32_wrapper.o : $(obj)/vdso32.so
 
 # link rule for the .so file, .lds has to be first
-$(obj)/vdso32.so.dbg: $(src)/vdso32.lds $(obj-vdso32) FORCE
-	$(call if_changed,vdso32ld)
+$(obj)/vdso32.so.dbg: $(src)/vdso32.lds $(obj-vdso32) $(obj)/vgettimeofday.o FORCE
+	$(call if_changed,vdso32ld_and_check)
 
 # strip rule for the .so file
 $(obj)/%.so: OBJCOPYFLAGS := -S
@@ -49,12 +56,16 @@  $(obj)/%.so: $(obj)/%.so.dbg FORCE
 # assembly rules for the .S files
 $(obj-vdso32): %.o: %.S FORCE
 	$(call if_changed_dep,vdso32as)
+$(obj)/vgettimeofday.o: %.o: %.c FORCE
+	$(call if_changed_dep,vdso32cc)
 
 # actual build commands
-quiet_cmd_vdso32ld = VDSO32L $@
-      cmd_vdso32ld = $(VDSOCC) $(c_flags) $(CC32FLAGS) -o $@ -Wl,-T$(filter %.lds,$^) $(filter %.o,$^)
+quiet_cmd_vdso32ld_and_check = VDSO32L $@
+      cmd_vdso32ld_and_check = $(VDSOCC) $(c_flags) $(CC32FLAGS) -o $@ -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) ; $(cmd_vdso_check)
 quiet_cmd_vdso32as = VDSO32A $@
       cmd_vdso32as = $(VDSOCC) $(a_flags) $(CC32FLAGS) -c -o $@ $<
+quiet_cmd_vdso32cc = VDSO32A $@
+      cmd_vdso32cc = $(VDSOCC) $(c_flags) $(CC32FLAGS) -c -o $@ $<
 
 # install commands for the unstripped file
 quiet_cmd_vdso_install = INSTALL $@
diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S b/arch/powerpc/kernel/vdso32/gettimeofday.S
index becd9f8767ed..6f2671101248 100644
--- a/arch/powerpc/kernel/vdso32/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso32/gettimeofday.S
@@ -12,15 +12,6 @@ 
 #include <asm/asm-offsets.h>
 #include <asm/unistd.h>
 
-/* Offset for the low 32-bit part of a field of long type */
-#ifdef CONFIG_PPC64
-#define LOPART	4
-#define TSPEC_TV_SEC	TSPC64_TV_SEC+LOPART
-#else
-#define LOPART	0
-#define TSPEC_TV_SEC	TSPC32_TV_SEC
-#endif
-
 	.text
 /*
  * Exact prototype of gettimeofday
@@ -30,31 +21,25 @@ 
  */
 V_FUNCTION_BEGIN(__kernel_gettimeofday)
   .cfi_startproc
-	mflr	r12
-  .cfi_register lr,r12
-
-	mr	r10,r3			/* r10 saves tv */
-	mr	r11,r4			/* r11 saves tz */
-	bl	__get_datapage@local	/* get data page */
-	mr	r9, r3			/* datapage ptr in r9 */
-	cmplwi	r10,0			/* check if tv is NULL */
-	beq	3f
-	lis	r7,1000000@ha		/* load up USEC_PER_SEC */
-	addi	r7,r7,1000000@l		/* so we get microseconds in r4 */
-	bl	__do_get_tspec@local	/* get sec/usec from tb & kernel */
-	stw	r3,TVAL32_TV_SEC(r10)
-	stw	r4,TVAL32_TV_USEC(r10)
-
-3:	cmplwi	r11,0			/* check if tz is NULL */
-	beq	1f
-	lwz	r4,CFG_TZ_MINUTEWEST(r9)/* fill tz */
-	lwz	r5,CFG_TZ_DSTTIME(r9)
-	stw	r4,TZONE_TZ_MINWEST(r11)
-	stw	r5,TZONE_TZ_DSTTIME(r11)
-
-1:	mtlr	r12
+	mflr	r0
+	stwu	r1, -16(r1)
+	stw	r0, 20(r1)
+	stw	r3, 8(r1)
+	stw	r4, 12(r1)
+	bl	__c_kernel_gettimeofday
+	cmpwi	r3, 0
+	lwz	r0, 20(r1)
+	mtlr	r0
+	bne	99f
+	addi	r1, r1, 16
 	crclr	cr0*4+so
-	li	r3,0
+	blr
+99:
+	lwz	r3, 8(r1)
+	lwz	r4, 12(r1)
+	addi	r1, r1, 16
+	li	r0, __NR_gettimeofday
+	sc
 	blr
   .cfi_endproc
 V_FUNCTION_END(__kernel_gettimeofday)
@@ -67,75 +52,24 @@  V_FUNCTION_END(__kernel_gettimeofday)
  */
 V_FUNCTION_BEGIN(__kernel_clock_gettime)
   .cfi_startproc
-	/* Check for supported clock IDs */
-	cmpli	cr0,r3,CLOCK_REALTIME
-	cmpli	cr1,r3,CLOCK_MONOTONIC
-	cror	cr0*4+eq,cr0*4+eq,cr1*4+eq
-	bne	cr0,99f
-
-	mflr	r12			/* r12 saves lr */
-  .cfi_register lr,r12
-	mr	r11,r4			/* r11 saves tp */
-	bl	__get_datapage@local	/* get data page */
-	mr	r9,r3			/* datapage ptr in r9 */
-	lis	r7,NSEC_PER_SEC@h	/* want nanoseconds */
-	ori	r7,r7,NSEC_PER_SEC@l
-50:	bl	__do_get_tspec@local	/* get sec/nsec from tb & kernel */
-	bne	cr1,80f			/* not monotonic -> all done */
-
-	/*
-	 * CLOCK_MONOTONIC
-	 */
-
-	/* now we must fixup using wall to monotonic. We need to snapshot
-	 * that value and do the counter trick again. Fortunately, we still
-	 * have the counter value in r8 that was returned by __do_get_xsec.
-	 * At this point, r3,r4 contain our sec/nsec values, r5 and r6
-	 * can be used, r7 contains NSEC_PER_SEC.
-	 */
-
-	lwz	r5,(WTOM_CLOCK_SEC+LOPART)(r9)
-	lwz	r6,WTOM_CLOCK_NSEC(r9)
-
-	/* We now have our offset in r5,r6. We create a fake dependency
-	 * on that value and re-check the counter
-	 */
-	or	r0,r6,r5
-	xor	r0,r0,r0
-	add	r9,r9,r0
-	lwz	r0,(CFG_TB_UPDATE_COUNT+LOPART)(r9)
-        cmpl    cr0,r8,r0		/* check if updated */
-	bne-	50b
-
-	/* Calculate and store result. Note that this mimics the C code,
-	 * which may cause funny results if nsec goes negative... is that
-	 * possible at all ?
-	 */
-	add	r3,r3,r5
-	add	r4,r4,r6
-	cmpw	cr0,r4,r7
-	cmpwi	cr1,r4,0
-	blt	1f
-	subf	r4,r7,r4
-	addi	r3,r3,1
-1:	bge	cr1,80f
-	addi	r3,r3,-1
-	add	r4,r4,r7
-
-80:	stw	r3,TSPC32_TV_SEC(r11)
-	stw	r4,TSPC32_TV_NSEC(r11)
-
-	mtlr	r12
+	mflr	r0
+	stwu	r1, -16(r1)
+	stw	r0, 20(r1)
+	stw	r3, 8(r1)
+	stw	r4, 12(r1)
+	bl	__c_kernel_clock_gettime
+	cmpwi	r3, 0
+	lwz	r0, 20(r1)
+	mtlr	r0
+	bne	99f
+	addi	r1, r1, 16
 	crclr	cr0*4+so
-	li	r3,0
 	blr
-
-	/*
-	 * syscall fallback
-	 */
 99:
-	li	r0,__NR_clock_gettime
-  .cfi_restore lr
+	lwz	r3, 8(r1)
+	lwz	r4, 12(r1)
+	addi	r1, r1, 16
+	li	r0, __NR_clock_gettime
 	sc
 	blr
   .cfi_endproc
@@ -150,27 +84,24 @@  V_FUNCTION_END(__kernel_clock_gettime)
  */
 V_FUNCTION_BEGIN(__kernel_clock_getres)
   .cfi_startproc
-	/* Check for supported clock IDs */
-	cmpwi	cr0,r3,CLOCK_REALTIME
-	cmpwi	cr1,r3,CLOCK_MONOTONIC
-	cror	cr0*4+eq,cr0*4+eq,cr1*4+eq
-	bne	cr0,99f
-
-	li	r3,0
-	cmpli	cr0,r4,0
+	mflr	r0
+	stwu	r1, -16(r1)
+	stw	r0, 20(r1)
+	stw	r3, 8(r1)
+	stw	r4, 12(r1)
+	bl	__c_kernel_clock_getres
+	cmpwi	r3, 0
+	lwz	r0, 20(r1)
+	mtlr	r0
+	bne	99f
+	addi	r1, r1, 16
 	crclr	cr0*4+so
-	beqlr
-	lis	r5,CLOCK_REALTIME_RES@h
-	ori	r5,r5,CLOCK_REALTIME_RES@l
-	stw	r3,TSPC32_TV_SEC(r4)
-	stw	r5,TSPC32_TV_NSEC(r4)
 	blr
-
-	/*
-	 * syscall fallback
-	 */
 99:
-	li	r0,__NR_clock_getres
+	lwz	r3, 8(r1)
+	lwz	r4, 12(r1)
+	addi	r1, r1, 16
+	li	r0, __NR_clock_getres
 	sc
 	blr
   .cfi_endproc
@@ -185,105 +116,14 @@  V_FUNCTION_END(__kernel_clock_getres)
  */
 V_FUNCTION_BEGIN(__kernel_time)
   .cfi_startproc
-	mflr	r12
-  .cfi_register lr,r12
-
-	mr	r11,r3			/* r11 holds t */
-	bl	__get_datapage@local
-	mr	r9, r3			/* datapage ptr in r9 */
-
-	lwz	r3,STAMP_XTIME+TSPEC_TV_SEC(r9)
-
-	cmplwi	r11,0			/* check if t is NULL */
-	beq	2f
-	stw	r3,0(r11)		/* store result at *t */
-2:	mtlr	r12
+	mflr	r0
+	stwu	r1, -16(r1)
+	stw	r0, 20(r1)
+	bl	__c_kernel_time
+	lwz	r0, 20(r1)
 	crclr	cr0*4+so
+	mtlr	r0
+	addi	r1, r1, 16
 	blr
   .cfi_endproc
 V_FUNCTION_END(__kernel_time)
-
-/*
- * This is the core of clock_gettime() and gettimeofday(),
- * it returns the current time in r3 (seconds) and r4.
- * On entry, r7 gives the resolution of r4, either USEC_PER_SEC
- * or NSEC_PER_SEC, giving r4 in microseconds or nanoseconds.
- * It expects the datapage ptr in r9 and doesn't clobber it.
- * It clobbers r0, r5 and r6.
- * On return, r8 contains the counter value that can be reused.
- * This clobbers cr0 but not any other cr field.
- */
-__do_get_tspec:
-  .cfi_startproc
-	/* Check for update count & load values. We use the low
-	 * order 32 bits of the update count
-	 */
-1:	lwz	r8,(CFG_TB_UPDATE_COUNT+LOPART)(r9)
-	andi.	r0,r8,1			/* pending update ? loop */
-	bne-	1b
-	xor	r0,r8,r8		/* create dependency */
-	add	r9,r9,r0
-
-	/* Load orig stamp (offset to TB) */
-	lwz	r5,CFG_TB_ORIG_STAMP(r9)
-	lwz	r6,(CFG_TB_ORIG_STAMP+4)(r9)
-
-	/* Get a stable TB value */
-2:	MFTBU(r3)
-	MFTBL(r4)
-	MFTBU(r0)
-	cmplw	cr0,r3,r0
-	bne-	2b
-
-	/* Subtract tb orig stamp and shift left 12 bits.
-	 */
-	subfc	r4,r6,r4
-	subfe	r0,r5,r3
-	slwi	r0,r0,12
-	rlwimi.	r0,r4,12,20,31
-	slwi	r4,r4,12
-
-	/*
-	 * Load scale factor & do multiplication.
-	 * We only use the high 32 bits of the tb_to_xs value.
-	 * Even with a 1GHz timebase clock, the high 32 bits of
-	 * tb_to_xs will be at least 4 million, so the error from
-	 * ignoring the low 32 bits will be no more than 0.25ppm.
-	 * The error will just make the clock run very very slightly
-	 * slow until the next time the kernel updates the VDSO data,
-	 * at which point the clock will catch up to the kernel's value,
-	 * so there is no long-term error accumulation.
-	 */
-	lwz	r5,CFG_TB_TO_XS(r9)	/* load values */
-	mulhwu	r4,r4,r5
-	li	r3,0
-
-	beq+	4f			/* skip high part computation if 0 */
-	mulhwu	r3,r0,r5
-	mullw	r5,r0,r5
-	addc	r4,r4,r5
-	addze	r3,r3
-4:
-	/* At this point, we have seconds since the xtime stamp
-	 * as a 32.32 fixed-point number in r3 and r4.
-	 * Load & add the xtime stamp.
-	 */
-	lwz	r5,STAMP_XTIME+TSPEC_TV_SEC(r9)
-	lwz	r6,STAMP_SEC_FRAC(r9)
-	addc	r4,r4,r6
-	adde	r3,r3,r5
-
-	/* We create a fake dependency on the result in r3/r4
-	 * and re-check the counter
-	 */
-	or	r6,r4,r3
-	xor	r0,r6,r6
-	add	r9,r9,r0
-	lwz	r0,(CFG_TB_UPDATE_COUNT+LOPART)(r9)
-        cmplw	cr0,r8,r0		/* check if updated */
-	bne-	1b
-
-	mulhwu	r4,r4,r7		/* convert to micro or nanoseconds */
-
-	blr
-  .cfi_endproc
diff --git a/arch/powerpc/kernel/vdso32/vgettimeofday.c b/arch/powerpc/kernel/vdso32/vgettimeofday.c
new file mode 100644
index 000000000000..9cc39fc60b30
--- /dev/null
+++ b/arch/powerpc/kernel/vdso32/vgettimeofday.c
@@ -0,0 +1,32 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ARM64 userspace implementations of gettimeofday() and similar.
+ *
+ * Copyright (C) 2018 ARM Limited
+ *
+ */
+#include <linux/time.h>
+#include <linux/types.h>
+
+int __c_kernel_clock_gettime(clockid_t clock,
+			   struct old_timespec32 *ts)
+{
+	return __cvdso_clock_gettime32(clock, ts);
+}
+
+int __c_kernel_gettimeofday(struct __kernel_old_timeval *tv,
+			  struct timezone *tz)
+{
+	return __cvdso_gettimeofday(tv, tz);
+}
+
+int __c_kernel_clock_getres(clockid_t clock_id,
+			  struct old_timespec32 *res)
+{
+	return __cvdso_clock_getres_time32(clock_id, res);
+}
+
+time_t __c_kernel_time(time_t *time)
+{
+	return __cvdso_time(time);
+}