[RFC,5/5] arm: Simplify cycle counter

Message ID	1430417667-4245-5-git-send-email-christopher.covington@linaro.org
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Christopher Covington <christopher.covington@linaro.org> To: peter.maydell@linaro.org, qemu-devel@nongnu.org Date: Thu, 30 Apr 2015 14:14:27 -0400 Message-Id: <1430417667-4245-5-git-send-email-christopher.covington@linaro.org> In-Reply-To: <1430417667-4245-1-git-send-email-christopher.covington@linaro.org> References: <1430417667-4245-1-git-send-email-christopher.covington@linaro.org> Cc: Christopher Covington <christopher.covington@linaro.org> Subject: [Qemu-devel] [RFC 5/5] arm: Simplify cycle counter Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Christopher Covington April 30, 2015, 6:14 p.m. UTC

Present a system with an instructions per cycle of exactly one.
This makes it less likely a user will mistake the cycle counter
values as meaningful and makes calculations involving cycles
trivial while preserving the necessary property of the cycle
counter register as monotonically increasing.

Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
---
 target-arm/helper.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

Peter Maydell April 30, 2015, 6:27 p.m. UTC | #1

On 30 April 2015 at 19:14, Christopher Covington
<christopher.covington@linaro.org> wrote:
> Present a system with an instructions per cycle of exactly one.
> This makes it less likely a user will mistake the cycle counter
> values as meaningful and makes calculations involving cycles
> trivial while preserving the necessary property of the cycle
> counter register as monotonically increasing.
>
> Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
> ---
>  target-arm/helper.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/target-arm/helper.c b/target-arm/helper.c
> index 3e6fb0b..a027a19 100644
> --- a/target-arm/helper.c
> +++ b/target-arm/helper.c
> @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env)
>  {
>      uint64_t temp_ticks;
>
> -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> -                          get_ticks_per_sec(), 1000000);
> +    temp_ticks = cpu_get_icount_raw();

Are you really really sure the _raw function is the correct one?
Nowhere else in the codebase calls it except the other wrappers
in cpu.c which provide a sane view of the instruction count...
(I suspect cpu_get_icount_raw() should really be static.)

PS: it would be helpful if you could make sure you include
a cover letter for future patchseries: patchsets without a
cover letter are awkward for my workflow... (A single
standalone patch doesn't need a coverletter.)

thanks
-- PMM

Christopher Covington April 30, 2015, 9:33 p.m. UTC | #2

Hi Peter,

Thanks for taking a look.

On Apr 30, 2015 2:28 PM, "Peter Maydell" <peter.maydell@linaro.org> wrote:
>
> On 30 April 2015 at 19:14, Christopher Covington
> <christopher.covington@linaro.org> wrote:
> > Present a system with an instructions per cycle of exactly one.
> > This makes it less likely a user will mistake the cycle counter
> > values as meaningful and makes calculations involving cycles
> > trivial while preserving the necessary property of the cycle
> > counter register as monotonically increasing.
> >
> > Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
> > ---
> >  target-arm/helper.c | 9 +++------
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/target-arm/helper.c b/target-arm/helper.c
> > index 3e6fb0b..a027a19 100644
> > --- a/target-arm/helper.c
> > +++ b/target-arm/helper.c
> > @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env)
> >  {
> >      uint64_t temp_ticks;
> >
> > -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> > -                          get_ticks_per_sec(), 1000000);
> > +    temp_ticks = cpu_get_icount_raw();
>
> Are you really really sure the _raw function is the correct one?
> Nowhere else in the codebase calls it except the other wrappers
> in cpu.c which provide a sane view of the instruction count...
> (I suspect cpu_get_icount_raw() should really be static.)

I thought it wasn't being used because it was new, having been introduced
by Pavel Dovgalyuk's determinism patches.

When you talk about sanity, what would this patch make insane? The
instructions per second and cycles per second that would result? If so,
what if seconds/timer ticks were changed in the same patch to be derived
from the instruction count. All of this could potentially only apply with
-icount specified.

> PS: it would be helpful if you could make sure you include
> a cover letter for future patchseries: patchsets without a
> cover letter are awkward for my workflow... (A single
> standalone patch doesn't need a coverletter.)

Will do. For this series it should have been that the first four patches
are hopefully not controversial, but so far have only been tested by some
very simple bare metal code (using PSCI exit instead of Angel exit at the
end) and booting the Linux kernel and looking at the printks, but not yet
by running `perf stat`. The last patch (this one) I realize may be
controversial as it tries to make QEMU conform to certain expectations such
as determinism that it has not historically met.

Thanks,
Chris

Peter Maydell April 30, 2015, 10:02 p.m. UTC | #3

On 30 April 2015 at 22:33, Christopher Covington
<christopher.covington@linaro.org> wrote:
> On Apr 30, 2015 2:28 PM, "Peter Maydell" <peter.maydell@linaro.org> wrote:
>> Are you really really sure the _raw function is the correct one?
>> Nowhere else in the codebase calls it except the other wrappers
>> in cpu.c which provide a sane view of the instruction count...
>> (I suspect cpu_get_icount_raw() should really be static.)
>
> I thought it wasn't being used because it was new, having been introduced by
> Pavel Dovgalyuk's determinism patches.
>
> When you talk about sanity, what would this patch make insane? The
> instructions per second and cycles per second that would result? If so, what
> if seconds/timer ticks were changed in the same patch to be derived from the
> instruction count. All of this could potentially only apply with -icount
> specified.

I don't really know for certain how the code here works, but
it makes me very dubious when I see a function that is being
used literally nowhere else in any other target CPU, and
where every other code path to it goes via taking a lock.

Paolo: can you suggest what the right function to call to implement
a cycle counter is?

thanks
-- PMM

Peter Crosthwaite May 1, 2015, 1:24 a.m. UTC | #4

On Thu, Apr 30, 2015 at 11:14 AM, Christopher Covington
<christopher.covington@linaro.org> wrote:
> Present a system with an instructions per cycle of exactly one.
> This makes it less likely a user will mistake the cycle counter
> values as meaningful and makes calculations involving cycles
> trivial while preserving the necessary property of the cycle
> counter register as monotonically increasing.
>
> Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
> ---
>  target-arm/helper.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/target-arm/helper.c b/target-arm/helper.c
> index 3e6fb0b..a027a19 100644
> --- a/target-arm/helper.c
> +++ b/target-arm/helper.c
> @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env)
>  {
>      uint64_t temp_ticks;
>
> -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> -                          get_ticks_per_sec(), 1000000);
> +    temp_ticks = cpu_get_icount_raw();

So I have guests (for better or worse) that make assumptions about the
rate of the PMCCNTR WRT real time. But isn't the PMCCNTR really a
clock cycle counter rather than an instruction counter? That clock
rate is well defined even if it is just the trivial
get_ticks_per_sec() at the moment. Ideally we should have a
configurable clock rate in there instead of get_ticks_per_sec(). This
is a major change in definition.

I can see your use case though, where you actually want this to mean
something WRT program performance. Should we add a switch between the
two behaviours?

Regards,
Peter

>
>      if (env->cp15.c9_pmcr & PMCRD) {
>          /* Increment once every 64 processor clock cycles */
> @@ -687,8 +686,7 @@ static uint64_t pmccntr_read(CPUARMState *env, const ARMCPRegInfo *ri)
>          return env->cp15.c15_ccnt;
>      }
>
> -    total_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> -                           get_ticks_per_sec(), 1000000);
> +    total_ticks = cpu_get_icount_raw();
>
>      if (env->cp15.c9_pmcr & PMCRD) {
>          /* Increment once every 64 processor clock cycles */
> @@ -708,8 +706,7 @@ static void pmccntr_write(CPUARMState *env, const ARMCPRegInfo *ri,
>          return;
>      }
>
> -    total_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
> -                           get_ticks_per_sec(), 1000000);
> +    total_ticks = cpu_get_icount_raw();
>
>      if (env->cp15.c9_pmcr & PMCRD) {
>          /* Increment once every 64 processor clock cycles */
> --
> 1.9.1
>
>

Christopher Covington May 1, 2015, 2:35 p.m. UTC | #5

On Thu, Apr 30, 2015 at 9:24 PM, Peter Crosthwaite
<peter.crosthwaite@xilinx.com> wrote:
> On Thu, Apr 30, 2015 at 11:14 AM, Christopher Covington
> <christopher.covington@linaro.org> wrote:
>> Present a system with an instructions per cycle of exactly one.
>> This makes it less likely a user will mistake the cycle counter
>> values as meaningful and makes calculations involving cycles
>> trivial while preserving the necessary property of the cycle
>> counter register as monotonically increasing.
>>
>> Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
>> ---
>>  target-arm/helper.c | 9 +++------
>>  1 file changed, 3 insertions(+), 6 deletions(-)
>>
>> diff --git a/target-arm/helper.c b/target-arm/helper.c
>> index 3e6fb0b..a027a19 100644
>> --- a/target-arm/helper.c
>> +++ b/target-arm/helper.c
>> @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env)
>>  {
>>      uint64_t temp_ticks;
>>
>> -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
>> -                          get_ticks_per_sec(), 1000000);
>> +    temp_ticks = cpu_get_icount_raw();
>
> So I have guests (for better or worse) that make assumptions about the
> rate of the PMCCNTR WRT real time. But isn't the PMCCNTR really a
> clock cycle counter rather than an instruction counter? That clock
> rate is well defined even if it is just the trivial
> get_ticks_per_sec() at the moment. Ideally we should have a
> configurable clock rate in there instead of get_ticks_per_sec(). This
> is a major change in definition.

Is this with KVM, user emulation, or system emulation mode? I'm mainly
looking at things from the TCG (and mostly system mode) perspective.
If not using KVM, would the assumptions of your guests hold if you ran
them on a hardware device instead of QEMU TCG? I'm still trying to
understand all the code involved, but I don't see get_ticks_per_sec()
or any other constants as problematic.

> I can see your use case though, where you actually want this to mean
> something WRT program performance. Should we add a switch between the
> two behaviours?

This switch already exists, in the form of -icount, and I really
should have already been using it. What initially scared me away was
qemu_icount_bias and icount_adjust(). Accounting for halting seems
like a good thing but the accounting for "speed" by referencing a host
clock doesn't make sense to me. If I have an ARM development board
hooked up to an x86 desktop, it's not querying the desktop for time.
So why does it make sense for QEMU system emulation mode behave like
that? The -icount help text makes it sound like adjustment for speed
reasons should only take place when "auto" is specified, but I have
yet to understand the code well enough to verify that.

Chris

Paolo Bonzini May 4, 2015, 9:54 a.m. UTC | #6

On 01/05/2015 00:02, Peter Maydell wrote:
> On 30 April 2015 at 22:33, Christopher Covington
> <christopher.covington@linaro.org> wrote:
>> On Apr 30, 2015 2:28 PM, "Peter Maydell" <peter.maydell@linaro.org> wrote:
>>> Are you really really sure the _raw function is the correct one?
>>> Nowhere else in the codebase calls it except the other wrappers
>>> in cpu.c which provide a sane view of the instruction count...
>>> (I suspect cpu_get_icount_raw() should really be static.)
>>
>> I thought it wasn't being used because it was new, having been introduced by
>> Pavel Dovgalyuk's determinism patches.
>>
>> When you talk about sanity, what would this patch make insane?

The right function is cpu_get_ticks().  This one works without icount as
well.

At the top there is:

    if (use_icount) {
        return cpu_get_icount();
    }

and perhaps it would be correct to use cpu_get_icount_raw() if you do
not want the cycle counter to increase in halted state.  But at least
x86 wants it to increase.

Paolo

 The
>> instructions per second and cycles per second that would result? If so, what
>> if seconds/timer ticks were changed in the same patch to be derived from the
>> instruction count. All of this could potentially only apply with -icount
>> specified.
> 
> I don't really know for certain how the code here works, but
> it makes me very dubious when I see a function that is being
> used literally nowhere else in any other target CPU, and
> where every other code path to it goes via taking a lock.
> 
> Paolo: can you suggest what the right function to call to implement
> a cycle counter is?
> 
> thanks
> -- PMM
>

Peter Crosthwaite May 6, 2015, 2:05 p.m. UTC | #7

On Fri, May 1, 2015 at 7:35 AM, Christopher Covington
<christopher.covington@linaro.org> wrote:
> On Thu, Apr 30, 2015 at 9:24 PM, Peter Crosthwaite
> <peter.crosthwaite@xilinx.com> wrote:
>> On Thu, Apr 30, 2015 at 11:14 AM, Christopher Covington
>> <christopher.covington@linaro.org> wrote:
>>> Present a system with an instructions per cycle of exactly one.
>>> This makes it less likely a user will mistake the cycle counter
>>> values as meaningful and makes calculations involving cycles
>>> trivial while preserving the necessary property of the cycle
>>> counter register as monotonically increasing.
>>>
>>> Signed-off-by: Christopher Covington <christopher.covington@linaro.org>
>>> ---
>>>  target-arm/helper.c | 9 +++------
>>>  1 file changed, 3 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/target-arm/helper.c b/target-arm/helper.c
>>> index 3e6fb0b..a027a19 100644
>>> --- a/target-arm/helper.c
>>> +++ b/target-arm/helper.c
>>> @@ -648,8 +648,7 @@ void pmccntr_sync(CPUARMState *env)
>>>  {
>>>      uint64_t temp_ticks;
>>>
>>> -    temp_ticks = muldiv64(qemu_clock_get_us(QEMU_CLOCK_VIRTUAL),
>>> -                          get_ticks_per_sec(), 1000000);
>>> +    temp_ticks = cpu_get_icount_raw();
>>
>> So I have guests (for better or worse) that make assumptions about the
>> rate of the PMCCNTR WRT real time. But isn't the PMCCNTR really a
>> clock cycle counter rather than an instruction counter? That clock
>> rate is well defined even if it is just the trivial
>> get_ticks_per_sec() at the moment. Ideally we should have a
>> configurable clock rate in there instead of get_ticks_per_sec(). This
>> is a major change in definition.
>
> Is this with KVM, user emulation, or system emulation mode? I'm mainly
> looking at things from the TCG (and mostly system mode) perspective.

The same. My guest has only ever run on TCG.

> If not using KVM, would the assumptions of your guests hold if you ran
> them on a hardware device instead of QEMU TCG?

I run them as a matter of course on hardware devices.

What I am most worried about (and I need to run some tests to really
confirm facts) is what happens if a CPU WFIs. Should the PMCCNTR
continue on or hold its value? If we match instruction execution to
PMCCNTR to the PMCCNTR will freeze.

Regards,
Peter

> I'm still trying to
> understand all the code involved, but I don't see get_ticks_per_sec()
> or any other constants as problematic.
>
>> I can see your use case though, where you actually want this to mean
>> something WRT program performance. Should we add a switch between the
>> two behaviours?
>
> This switch already exists, in the form of -icount, and I really
> should have already been using it. What initially scared me away was
> qemu_icount_bias and icount_adjust(). Accounting for halting seems
> like a good thing but the accounting for "speed" by referencing a host
> clock doesn't make sense to me. If I have an ARM development board
> hooked up to an x86 desktop, it's not querying the desktop for time.
> So why does it make sense for QEMU system emulation mode behave like
> that? The -icount help text makes it sound like adjustment for speed
> reasons should only take place when "auto" is specified, but I have
> yet to understand the code well enough to verify that.
>
> Chris
>

Peter Maydell May 6, 2015, 3:38 p.m. UTC | #8

On 6 May 2015 at 15:05, Peter Crosthwaite <peter.crosthwaite@xilinx.com> wrote:
> What I am most worried about (and I need to run some tests to really
> confirm facts) is what happens if a CPU WFIs. Should the PMCCNTR
> continue on or hold its value? If we match instruction execution to
> PMCCNTR to the PMCCNTR will freeze.

See the v8 ARM ARM D5.1.1: this doesn't count PE clock cycles, it's
linked to the hardware processor clock. The exact relationship is
IMPDEF so we have some leeway for doing whatever seems reasonable here.
Permitted things:
 * counter can stop on WFI (see D5.1.3)
 * counter can continue to run on WFI
   (it's impdef whether the PE clock is gated when in WFI, though
   I would expect that to be a popular implementation)
 * counter can read the same value if read twice in a row
 * counter can run forward a lot even if no insns executed
   (the example given is of a hyperthreading implementation)

So we should do whatever seems most convenient implementation-wise
I think...

-- PMM

[RFC,5/5] arm: Simplify cycle counter

Commit Message

Comments

Patch