Patchwork [RFC] perf: Add a few generic stalled-cycles events

login
register
mail settings
Submitter sukadev@linux.vnet.ibm.com
Date Oct. 12, 2012, 1:28 a.m.
Message ID <20121012012839.GA15348@us.ibm.com>
Download mbox | patch
Permalink /patch/191036/
State Not Applicable
Headers show

Comments

sukadev@linux.vnet.ibm.com - Oct. 12, 2012, 1:28 a.m.
From 89cb6a25b9f714e55a379467a832ee015014ed11 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Tue, 18 Sep 2012 10:59:01 -0700
Subject: [PATCH] perf: Add a few generic stalled-cycles events

The existing generic event 'stalled-cycles-backend' corresponds to
PM_CMPLU_STALL event in Power7. While this event is useful, detailed
performance analysis often requires us to find more specific reasons
for the stalled cycle. For instance, stalled cycles in Power7 can
occur due to, among others:

	- instruction fetch unit (IFU),
	- Load-store-unit (LSU),
	- Fixed point unit (FXU)
	- Branch unit (BRU)

While it is possible to use raw codes to monitor these events, it quickly
becomes cumbersome with performance analysis frequently requiring mapping
the raw event codes in reports to their symbolic names.

This patch is a proposal to try and generalize such perf events. Since
the code changes are quite simple, I bunched all the 4 events together.

I am not familiar with how readily these events would map to other
architectures. Here is some information on the events for Power7:

	stalled-cycles-fixed-point (PM_CMPLU_STALL_FXU)

		Following a completion stall, the last instruction to finish
		before completion resumes was from the Fixed Point Unit.

		Completion stall is any period when no groups completed and
		the completion table was not empty for that thread.

	stalled-cycles-load-store (PM_CMPLU_STALL_LSU)

		Following a completion stall, the last instruction to finish
		before completion resumes was from the Load-Store Unit.

	stalled-cycles-instruction-fetch (PM_CMPLU_STALL_IFU)

		Following a completion stall, the last instruction to finish
		before completion resumes was from the Instruction Fetch Unit.

	stalled-cycles-branch (PM_CMPLU_STALL_BRU)

		Following a completion stall, the last instruction to finish
		before completion resumes was from the Branch Unit.

Looking for feedback on this approach and if this can be further extended.
Power7 has 530 events[2] out of which a "CPI stack analysis"[1] uses about 26
events.


[1] CPI Stack analysis
	https://www.power.org/documentation/commonly-used-metrics-for-performance-analysis

[2] Power7 events:
	https://www.power.org/documentation/comprehensive-pmu-event-reference-power7/

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 arch/powerpc/perf/power7-pmu.c |    4 ++++
 include/linux/perf_event.h     |    4 ++++
 tools/perf/builtin-stat.c      |    4 ++++
 tools/perf/util/evsel.c        |    4 ++++
 tools/perf/util/parse-events.l |    4 ++++
 tools/perf/util/python.c       |    4 ++++
 6 files changed, 24 insertions(+), 0 deletions(-)
Anshuman Khandual - Oct. 15, 2012, 5:26 a.m.
On 10/12/2012 06:58 AM, Sukadev Bhattiprolu wrote:
> 
> From 89cb6a25b9f714e55a379467a832ee015014ed11 Mon Sep 17 00:00:00 2001
> From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Date: Tue, 18 Sep 2012 10:59:01 -0700
> Subject: [PATCH] perf: Add a few generic stalled-cycles events
> 
> The existing generic event 'stalled-cycles-backend' corresponds to
> PM_CMPLU_STALL event in Power7. While this event is useful, detailed
> performance analysis often requires us to find more specific reasons
> for the stalled cycle. For instance, stalled cycles in Power7 can
> occur due to, among others:
> 
> 	- instruction fetch unit (IFU),
> 	- Load-store-unit (LSU),
> 	- Fixed point unit (FXU)
> 	- Branch unit (BRU)
> 
> While it is possible to use raw codes to monitor these events, it quickly
> becomes cumbersome with performance analysis frequently requiring mapping
> the raw event codes in reports to their symbolic names.
> 
> This patch is a proposal to try and generalize such perf events. Since
> the code changes are quite simple, I bunched all the 4 events together.
> 
> I am not familiar with how readily these events would map to other
> architectures. Here is some information on the events for Power7:
> 
> 	stalled-cycles-fixed-point (PM_CMPLU_STALL_FXU)
> 
> 		Following a completion stall, the last instruction to finish
> 		before completion resumes was from the Fixed Point Unit.
> 
> 		Completion stall is any period when no groups completed and
> 		the completion table was not empty for that thread.
> 
> 	stalled-cycles-load-store (PM_CMPLU_STALL_LSU)
> 
> 		Following a completion stall, the last instruction to finish
> 		before completion resumes was from the Load-Store Unit.
> 
> 	stalled-cycles-instruction-fetch (PM_CMPLU_STALL_IFU)
> 
> 		Following a completion stall, the last instruction to finish
> 		before completion resumes was from the Instruction Fetch Unit.
> 
> 	stalled-cycles-branch (PM_CMPLU_STALL_BRU)
> 
> 		Following a completion stall, the last instruction to finish
> 		before completion resumes was from the Branch Unit.
> 
> Looking for feedback on this approach and if this can be further extended.
> Power7 has 530 events[2] out of which a "CPI stack analysis"[1] uses about 26
> events.
> 
> 
> [1] CPI Stack analysis
> 	https://www.power.org/documentation/commonly-used-metrics-for-performance-analysis
> 
> [2] Power7 events:
> 	https://www.power.org/documentation/comprehensive-pmu-event-reference-power7/

Here we should try to come up with a generic list of places in the processor where
the cycles can stall.

PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT
PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE
PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH
PERF_COUNT_HW_STALLED_CYCLES_BRANCH
PERF_COUNT_HW_STALLED_CYCLES_<ANY_OTHER_PLACE1>
PERF_COUNT_HW_STALLED_CYCLES_<ANY_OTHER_PLACE2>
PERF_COUNT_HW_STALLED_CYCLES_<ANY_OTHER_PLACE3>
-----------------------------------------------

This generic list can be a superset which can accommodate all the architecture
giving the flexibility to implement selectively there after. Stall locations are
very important from CPI analysis stand point with real world use cases. This will
definitely help us in that direction.

Regards
Anshuman
Robert Richter - Oct. 15, 2012, 3:55 p.m.
On 11.10.12 18:28:39, Sukadev Bhattiprolu wrote:
> +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT },
> +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE },
> +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH },
> +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_BRANCH },

Instead of adding new hardware event types I would prefer to use raw
events in conjunction with sysfs, see e.g. the intel-uncore
implementation. Something like:

 $ find /sys/bus/event_source/devices/cpu/events/
 ...
 /sys/bus/event_source/devices/cpu/events/stalled-cycles-fixed-point
 /sys/bus/event_source/devices/cpu/events/stalled-cycles-load-store
 /sys/bus/event_source/devices/cpu/events/stalled-cycles-instruction-fetch
 /sys/bus/event_source/devices/cpu/events/stalled-cycles-branch
 ...
 $ cat /sys/bus/event_source/devices/cpu/events/stalled-cycles-fixed-point
 event=0xff,umask=0x00

Perf tool works then out-of-the-box with:

 $ perf record -e cpu/stalled-cycles-fixed-point/ ...

The event string can easily be reused by other architectures as a
quasi standard.

-Robert
Arun Sharma - Oct. 15, 2012, 5:23 p.m.
On 10/15/12 8:55 AM, Robert Richter wrote:

[..]
> Perf tool works then out-of-the-box with:
>
>   $ perf record -e cpu/stalled-cycles-fixed-point/ ...
>
> The event string can easily be reused by other architectures as a
> quasi standard.

I like Robert's proposal better. It's hard to model all the stall events 
(eg: instruction decoder related stalls on x86) in a hardware 
independent way.

Another area to think about: software engineers are generally busy and 
have a limited amount of time to devote to hardware event based 
optimizations. The most common question I hear is: what is the expected 
perf gain if I fix this? It's hard to answer that with just the stall 
events.

  -Arun
Anshuman Khandual - Oct. 16, 2012, 5:28 a.m.
On 10/15/2012 10:53 PM, Arun Sharma wrote:
> On 10/15/12 8:55 AM, Robert Richter wrote:
> 
> [..]
>> Perf tool works then out-of-the-box with:
>>
>>   $ perf record -e cpu/stalled-cycles-fixed-point/ ...
>>
>> The event string can easily be reused by other architectures as a
>> quasi standard.
> 
> I like Robert's proposal better. It's hard to model all the stall events
> (eg: instruction decoder related stalls on x86) in a hardware
> independent way.
> 
> Another area to think about: software engineers are generally busy and
> have a limited amount of time to devote to hardware event based
> optimizations. The most common question I hear is: what is the expected
> perf gain if I fix this? It's hard to answer that with just the stall
> events.
> 

Hardware event based optimization is a very important aspect of real world application
tuning. CPI stack analysis is a good reason why perf should have stall events as generic
ones. But I am not clear on situations where we consider adding these new generic events
into linux/perf_event.h and the situations where we should go with the sys fs interface.
Could you please elaborate on this ?

Regards
Anshuman
Robert Richter - Oct. 16, 2012, 10:08 a.m.
Sukadev,

On 15.10.12 17:55:34, Robert Richter wrote:
> On 11.10.12 18:28:39, Sukadev Bhattiprolu wrote:
> > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT },
> > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE },
> > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH },
> > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_BRANCH },
> 
> Instead of adding new hardware event types I would prefer to use raw
> events in conjunction with sysfs, see e.g. the intel-uncore
> implementation. Something like:
> 
>  $ find /sys/bus/event_source/devices/cpu/events/
>  ...
>  /sys/bus/event_source/devices/cpu/events/stalled-cycles-fixed-point
>  /sys/bus/event_source/devices/cpu/events/stalled-cycles-load-store
>  /sys/bus/event_source/devices/cpu/events/stalled-cycles-instruction-fetch
>  /sys/bus/event_source/devices/cpu/events/stalled-cycles-branch
>  ...
>  $ cat /sys/bus/event_source/devices/cpu/events/stalled-cycles-fixed-point
>  event=0xff,umask=0x00
> 
> Perf tool works then out-of-the-box with:
> 
>  $ perf record -e cpu/stalled-cycles-fixed-point/ ...

I refer here to arch/x86/kernel/cpu/perf_event_intel_uncore.c (should
be in v3.7-rc1 or tip:perf/core). See the INTEL_UNCORE_EVENT_DESC()
macro and 'if (type->event_descs) ...' in uncore_type_init(). The code
should be reworked to be non-architectural.

PMU registration is implemented for a longer time already for all
architectures and pmu types:

 /sys/bus/event_source/devices/*

But

 /sys/bus/event_source/devices/*/events/

exists only for a small number of pmus. Perf tool support of this was
implemented with:

 a6146d5 perf/tool: Add PMU event alias support

-Robert
Stephane Eranian - Oct. 16, 2012, 12:21 p.m.
On Tue, Oct 16, 2012 at 12:08 PM, Robert Richter <robert.richter@amd.com> wrote:
> Sukadev,
>
> On 15.10.12 17:55:34, Robert Richter wrote:
>> On 11.10.12 18:28:39, Sukadev Bhattiprolu wrote:
>> > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT },
>> > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE },
>> > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH },
>> > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_BRANCH },
>>
>> Instead of adding new hardware event types I would prefer to use raw
>> events in conjunction with sysfs, see e.g. the intel-uncore
>> implementation. Something like:
>>
In general, I don't like generic events and especially stall events. I
have not seen a clear definition of
what they mean. Without it, there is no way to understand how to map
them across architecture. If
the definition is too precise, you may not be able to find an exact
mapping. If the definition is to loose
then it is unclear what you are measuring.

Also this opens another can of worms which is that on some processors,
you may need more than
one event to encapsulate what the generic event is supposed to
measure. That means developing
a lot of code in the kernel to express and manage that. And of course,
you would not be able
to sample on those events (you cannot sample on a difference, for
instance). So all in all, I think
this is not a very good idea. You have to put this into the tool or a
library that auto-detects the
host CPU and programs the right set of events.

We've had that discussion many times. Just reiterating my personal
opinion on this.

>>  $ find /sys/bus/event_source/devices/cpu/events/
>>  ...
>>  /sys/bus/event_source/devices/cpu/events/stalled-cycles-fixed-point
>>  /sys/bus/event_source/devices/cpu/events/stalled-cycles-load-store
>>  /sys/bus/event_source/devices/cpu/events/stalled-cycles-instruction-fetch
>>  /sys/bus/event_source/devices/cpu/events/stalled-cycles-branch
>>  ...
>>  $ cat /sys/bus/event_source/devices/cpu/events/stalled-cycles-fixed-point
>>  event=0xff,umask=0x00
>>
>> Perf tool works then out-of-the-box with:
>>
>>  $ perf record -e cpu/stalled-cycles-fixed-point/ ...
>
> I refer here to arch/x86/kernel/cpu/perf_event_intel_uncore.c (should
> be in v3.7-rc1 or tip:perf/core). See the INTEL_UNCORE_EVENT_DESC()
> macro and 'if (type->event_descs) ...' in uncore_type_init(). The code
> should be reworked to be non-architectural.
>
> PMU registration is implemented for a longer time already for all
> architectures and pmu types:
>
>  /sys/bus/event_source/devices/*
>
> But
>
>  /sys/bus/event_source/devices/*/events/
>
> exists only for a small number of pmus. Perf tool support of this was
> implemented with:
>
>  a6146d5 perf/tool: Add PMU event alias support
>
> -Robert
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
>
sukadev@linux.vnet.ibm.com - Oct. 16, 2012, 6:31 p.m.
Robert Richter [robert.richter@amd.com] wrote:
| Sukadev,
| 
| On 15.10.12 17:55:34, Robert Richter wrote:
| > On 11.10.12 18:28:39, Sukadev Bhattiprolu wrote:
| > > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT },
| > > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE },
| > > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH },
| > > +  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_BRANCH },
| > 
| > Instead of adding new hardware event types I would prefer to use raw
| > events in conjunction with sysfs, see e.g. the intel-uncore
| > implementation. Something like:
| > 
| >  $ find /sys/bus/event_source/devices/cpu/events/
| >  ...
| >  /sys/bus/event_source/devices/cpu/events/stalled-cycles-fixed-point
| >  /sys/bus/event_source/devices/cpu/events/stalled-cycles-load-store
| >  /sys/bus/event_source/devices/cpu/events/stalled-cycles-instruction-fetch
| >  /sys/bus/event_source/devices/cpu/events/stalled-cycles-branch
| >  ...
| >  $ cat /sys/bus/event_source/devices/cpu/events/stalled-cycles-fixed-point
| >  event=0xff,umask=0x00
| > 
| > Perf tool works then out-of-the-box with:
| > 
| >  $ perf record -e cpu/stalled-cycles-fixed-point/ ...
| 
| I refer here to arch/x86/kernel/cpu/perf_event_intel_uncore.c (should
| be in v3.7-rc1 or tip:perf/core). See the INTEL_UNCORE_EVENT_DESC()
| macro and 'if (type->event_descs) ...' in uncore_type_init(). The code
| should be reworked to be non-architectural.


Ok. I will look through that code. Does that mean we are trying to avoid
any more new hardware generic events ?

Also a broader question - is the sysfs approach intended for all raw events
or just for the generic events supported in the kernel ?

If it is intended for all events a CPU supports, isn't there a chance of
bloating kernel code ? Power7 has 530 events and Intel Nehalem (in libpfm)
seems to have 370 events. Would that mean we would need to represent all these
events in the kernel so they are available in sysfs ?

On a side note, how does the kernel on x86 use the 'config' information in 
say /sys/bus/event_source/devices/cpu/format/cccr ? On Power7, the raw
code encodes the information such as the PMC to use for the event. Is that
how the 'config' info in Intel is used ?

Does the 'config' info change from system to system or is it static for
a given event on a given CPU ?

I guess I am trying to understand if this mapping between event-name (event
code) and the config info is something the kernel needs/uses or is it something
the kernel simply passes through from userspace to CPU ?

AFAICT, on the Power we use the raw codes to determine which PMC to select
and which bits to set in some registers. That selection is static for a given 
CPU type such as Power7. If it is static, is it worth adding all this static 
mapping (for 530 events) into the kernel ?

If we don't add to the kernel, we don't seem to have a way to specify the 
events symbolically.

Thanks for you detailed comments.
| 
| PMU registration is implemented for a longer time already for all
| architectures and pmu types:
| 
|  /sys/bus/event_source/devices/*

Yes I see this.

| 
| But
| 
|  /sys/bus/event_source/devices/*/events/

Thanks for clarifying. I was looking to see if this was implemented too :-)

Sukadev

| 
| exists only for a small number of pmus. Perf tool support of this was
| implemented with:
| 
|  a6146d5 perf/tool: Add PMU event alias support
| 
| -Robert
| 
| -- 
| Advanced Micro Devices, Inc.
| Operating System Research Center
sukadev@linux.vnet.ibm.com - Oct. 19, 2012, 5:05 p.m.
Stephane Eranian [eranian@google.com] wrote:
| So all in all, I think this is not a very good idea. You have to put
| this into the tool or a library that auto-detects the
| host CPU and programs the right set of events.
| 
| We've had that discussion many times. Just reiterating my personal
| opinion on this.

Yes that would work too. One drawback is that the hardware events
will be in the tool, while the software/tracepoint events in the
kernel sysfs representation.

Or is that the reason we want all events in one place (sysfs) ?

Sukadev
Peter Zijlstra - Oct. 24, 2012, 12:27 p.m.
On Tue, 2012-10-16 at 11:31 -0700, Sukadev Bhattiprolu wrote:
> On a side note, how does the kernel on x86 use the 'config' information in 
> say /sys/bus/event_source/devices/cpu/format/cccr ? On Power7, the raw
> code encodes the information such as the PMC to use for the event. Is that
> how the 'config' info in Intel is used ?
> 
> Does the 'config' info change from system to system or is it static for
> a given event on a given CPU ? 

Have a look at commits (tip/master):

  641cc938815dfd09f8fa1ec72deb814f0938ac33
  a47473939db20e3961b200eb00acf5fcf084d755
  43c032febde48aabcf6d59f47cdcb7b5debbdc63


So basically

 /sys/bus/event_source/devices/cpu/format/event

contains something like:

  config:0-7

Which says that for the 'cpu' PMU, field 'event' fills
perf_event_attr::config bits 0 through 7 (for type=PERF_TYPE_RAW).

The perf tool syntax for this is:

  perf stat -e 'cpu/event=0x3c/'

This basically allows you to expose bitfields in the 'raw' event format
for ease of writing raw events. I do not know if the Power PMU has such
or not.

Using this,

  /sys/bus/event_source/devices/cpu/events/cpu-cycles

would contain something like:

  event=0x3c

which one can use as:

  perf stat -e 'cpu/event=cpu-cycles/'
  perf stat -e 'cpu/cpu-cycles/'

The tool will then read the sysfs file, substitute the content to
obtain:

  perf stat -e 'cpu/event=0x3c/'

and run with that.

Within all this, the perf_event_attr::config* field names are hard-coded
special, so 'cpu/config=0xffff/' will always work, even without sysfs
format/ specification and is equivalent to the raw event stuff we had
before.


If the Power PMU lacks any structure to the raw config, you could simply
provide sysfs event/ files with:

  config=0xdeadbeef

like content.
sukadev@linux.vnet.ibm.com - Oct. 31, 2012, 6:40 a.m.
Peter Zijlstra [peterz@infradead.org] wrote:
| On Tue, 2012-10-16 at 11:31 -0700, Sukadev Bhattiprolu wrote:
| > On a side note, how does the kernel on x86 use the 'config' information in 
| > say /sys/bus/event_source/devices/cpu/format/cccr ? On Power7, the raw
| > code encodes the information such as the PMC to use for the event. Is that
| > how the 'config' info in Intel is used ?
| > 
| > Does the 'config' info change from system to system or is it static for
| > a given event on a given CPU ? 
| 
| Have a look at commits (tip/master):
| 
|   641cc938815dfd09f8fa1ec72deb814f0938ac33
|   a47473939db20e3961b200eb00acf5fcf084d755
|   43c032febde48aabcf6d59f47cdcb7b5debbdc63
| 
| 
| So basically
| 
|  /sys/bus/event_source/devices/cpu/format/event
| 
| contains something like:
| 
|   config:0-7
| 
| Which says that for the 'cpu' PMU, field 'event' fills
| perf_event_attr::config bits 0 through 7 (for type=PERF_TYPE_RAW).
| 
| The perf tool syntax for this is:
| 
|   perf stat -e 'cpu/event=0x3c/'
| 
| This basically allows you to expose bitfields in the 'raw' event format
| for ease of writing raw events. I do not know if the Power PMU has such
| or not.

Thanks for the detailed explanation.

Power does not support this yet, but I have started working on it now.

BTW, does this mean that we can use arch-specific names for the sysfs entries
within:

	/sys/bus/event_source/devices/cpu/events/

So instead of the names I came up with in this patch, stalled-cycles-fixed-point
we could use the name used in the CPU spec - 'cmplu_stall_fxu' in the arch
specific code ?

Sukadev
Peter Zijlstra - Oct. 31, 2012, 7:22 a.m.
On Tue, 2012-10-30 at 23:40 -0700, Sukadev Bhattiprolu wrote:
> So instead of the names I came up with in this patch, stalled-cycles-fixed-point
> we could use the name used in the CPU spec - 'cmplu_stall_fxu' in the arch
> specific code ? 

You could, but I would advise against it. Human readable names are so
much more accessible.

Patch

diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index 1251e4d..813e7c7 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -304,6 +304,10 @@  static int power7_generic_events[] = {
 	[PERF_COUNT_HW_CACHE_MISSES] = 0x400f0,		/* LD_MISS_L1	*/
 	[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x10068,	/* BRU_FIN	*/
 	[PERF_COUNT_HW_BRANCH_MISSES] = 0x400f6,	/* BR_MPRED	*/
+	[PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT] = 0x20014,/* CMPLU_STALL_FXU */
+	[PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE] = 0x20012,/* CMPLU_STALL_LSU */
+	[PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH] = 0x4004c,/* CMPLU_STALL_IFU */
+	[PERF_COUNT_HW_STALLED_CYCLES_BRANCH] = 0x4004e,/* CMPLU_STALL_BRU */
 };
 
 #define C(x)	PERF_COUNT_HW_CACHE_##x
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bdb4161..ff9f0a6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -55,6 +55,10 @@  enum perf_hw_id {
 	PERF_COUNT_HW_STALLED_CYCLES_FRONTEND	= 7,
 	PERF_COUNT_HW_STALLED_CYCLES_BACKEND	= 8,
 	PERF_COUNT_HW_REF_CPU_CYCLES		= 9,
+	PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT = 10,
+	PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE	= 11,
+	PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH = 12,
+	PERF_COUNT_HW_STALLED_CYCLES_BRANCH	= 13,
 
 	PERF_COUNT_HW_MAX,			/* non-ABI */
 };
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 861f0ae..6275dbb 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -77,6 +77,10 @@  static struct perf_event_attr default_attrs[] = {
   { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_INSTRUCTIONS		},
   { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_INSTRUCTIONS	},
   { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_MISSES		},
+  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT },
+  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE },
+  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH },
+  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_STALLED_CYCLES_BRANCH },
 
 };
 
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 2eaae14..17e3190 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -77,6 +77,10 @@  static const char *perf_evsel__hw_names[PERF_COUNT_HW_MAX] = {
 	"stalled-cycles-frontend",
 	"stalled-cycles-backend",
 	"ref-cycles",
+	"stalled-cycles-fixed-point",
+	"stalled-cycles-load-store",
+	"stalled-cycles-instruction-fetch",
+	"stalled-cycles-branch",
 };
 
 static const char *__perf_evsel__hw_name(u64 config)
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 384ca74..0c49c05 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -102,6 +102,10 @@  branch-instructions|branches			{ return sym(yyscanner, PERF_TYPE_HARDWARE, PERF_
 branch-misses					{ return sym(yyscanner, PERF_TYPE_HARDWARE, PERF_COUNT_HW_BRANCH_MISSES); }
 bus-cycles					{ return sym(yyscanner, PERF_TYPE_HARDWARE, PERF_COUNT_HW_BUS_CYCLES); }
 ref-cycles					{ return sym(yyscanner, PERF_TYPE_HARDWARE, PERF_COUNT_HW_REF_CPU_CYCLES); }
+stalled-cycles-fixed-point			{ return sym(yyscanner, PERF_TYPE_HARDWARE, PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT); }
+stalled-cycles-load-store			{ return sym(yyscanner, PERF_TYPE_HARDWARE, PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE); }
+stalled-cycles-instruction-fetch		{ return sym(yyscanner, PERF_TYPE_HARDWARE, PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH); }
+stalled-cycles-branch				{ return sym(yyscanner, PERF_TYPE_HARDWARE, PERF_COUNT_HW_STALLED_CYCLES_BRANCH); }
 cpu-clock					{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_CPU_CLOCK); }
 task-clock					{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_TASK_CLOCK); }
 page-faults|faults				{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_PAGE_FAULTS); }
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index 0688bfb..c563b30 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -952,6 +952,10 @@  static struct {
 
 	{ "COUNT_HW_STALLED_CYCLES_FRONTEND",	  PERF_COUNT_HW_STALLED_CYCLES_FRONTEND },
 	{ "COUNT_HW_STALLED_CYCLES_BACKEND",	  PERF_COUNT_HW_STALLED_CYCLES_BACKEND },
+	{ "COUNT_HW_STALLED_CYCLES_FIXED_POINT",  PERF_COUNT_HW_STALLED_CYCLES_FIXED_POINT },
+	{ "COUNT_HW_STALLED_CYCLES_LOAD_STORE",  PERF_COUNT_HW_STALLED_CYCLES_LOAD_STORE },
+	{ "COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH",  PERF_COUNT_HW_STALLED_CYCLES_INSTRUCTION_FETCH },
+	{ "COUNT_HW_STALLED_CYCLES_BRANCH",  PERF_COUNT_HW_STALLED_CYCLES_BRANCH },
 
 	{ "COUNT_SW_CPU_CLOCK",	       PERF_COUNT_SW_CPU_CLOCK },
 	{ "COUNT_SW_TASK_CLOCK",       PERF_COUNT_SW_TASK_CLOCK },