mbox series

[RFC,00/11] perf: Enhancing perf to export processor hazard information

Message ID 20200302052355.36365-1-ravi.bangoria@linux.ibm.com (mailing list archive)
Headers show
Series perf: Enhancing perf to export processor hazard information | expand

Message

Ravi Bangoria March 2, 2020, 5:23 a.m. UTC
Most modern microprocessors employ complex instruction execution
pipelines such that many instructions can be 'in flight' at any
given point in time. Various factors affect this pipeline and
hazards are the primary among them. Different types of hazards
exist - Data hazards, Structural hazards and Control hazards.
Data hazard is the case where data dependencies exist between
instructions in different stages in the pipeline. Structural
hazard is when the same processor hardware is needed by more
than one instruction in flight at the same time. Control hazards
are more the branch misprediction kinds. 

Information about these hazards are critical towards analyzing
performance issues and also to tune software to overcome such
issues. Modern processors export such hazard data in Performance
Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
AMD[3] provides similar information.

Implementation detail:

A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
If it's set, kernel converts arch specific hazard information
into generic format:

  struct perf_pipeline_haz_data {
         /* Instruction/Opcode type: Load, Store, Branch .... */
         __u8    itype;
         /* Instruction Cache source */
         __u8    icache;
         /* Instruction suffered hazard in pipeline stage */
         __u8    hazard_stage;
         /* Hazard reason */
         __u8    hazard_reason;
         /* Instruction suffered stall in pipeline stage */
         __u8    stall_stage;
         /* Stall reason */
         __u8    stall_reason;
         __u16   pad;
  };

... which can be read by user from mmap() ring buffer. With this
approach, sample perf report in hazard mode looks like (On IBM
PowerPC):

  # ./perf record --hazard ./ebizzy
  # ./perf report --hazard
  Overhead  Symbol          Shared  Instruction Type  Hazard Stage   Hazard Reason         Stall Stage   Stall Reason  ICache access
    36.58%  [.] thread_run  ebizzy  Load              LSU            Mispredict            LSU           Load fin      L1 hit
     9.46%  [.] thread_run  ebizzy  Load              LSU            Mispredict            LSU           Dcache_miss   L1 hit
     1.76%  [.] thread_run  ebizzy  Fixed point       -              -                     -             -             L1 hit
     1.31%  [.] thread_run  ebizzy  Load              LSU            ERAT Miss             LSU           Load fin      L1 hit
     1.27%  [.] thread_run  ebizzy  Load              LSU            Mispredict            -             -             L1 hit
     1.16%  [.] thread_run  ebizzy  Fixed point       -              -                     FXU           Fixed cycle   L1 hit
     0.50%  [.] thread_run  ebizzy  Fixed point       ISU            Source Unavailable    FXU           Fixed cycle   L1 hit
     0.30%  [.] thread_run  ebizzy  Load              LSU            LMQ Full, DERAT Miss  LSU           Load fin      L1 hit
     0.24%  [.] thread_run  ebizzy  Load              LSU            ERAT Miss             -             -             L1 hit
     0.08%  [.] thread_run  ebizzy  -                 -              -                     BRU           Fixed cycle   L1 hit
     0.05%  [.] thread_run  ebizzy  Branch            -              -                     BRU           Fixed cycle   L1 hit
     0.04%  [.] thread_run  ebizzy  Fixed point       ISU            Source Unavailable    -             -             L1 hit

Also perf annotate with hazard data:

         │    Disassembly of section .text:
         │
         │    0000000010001cf8 <compare>:
         │    compare():
         │    return NULL;
         │    }
         │
         │    static int
         │    compare(const void *p1, const void *p2)
         │    {
   33.23 │      std    r31,-8(r1)
         │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
         │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
         │       {haz_stage: LSU, haz_reason: Load Hit Store, stall_stage: LSU, stall_reason: -, icache: L3 hit}
         │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
         │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
         │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
    0.84 │      stdu   r1,-64(r1)
         │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
    0.24 │      mr     r31,r1
         │       {haz_stage: -, haz_reason: -, stall_stage: -, stall_reason: -, icache: L1 hit}
   21.18 │      std    r3,32(r31)
         │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
         │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
         │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}


Patches:
 - Patch #1 is a simple cleanup patch
 - Patch #2, #3, #4 implements generic and arch specific kernel
   infrastructure
 - Patch #5 enables perf record and script with hazard mode
 - Patch #6, #7, #8 enables perf report with hazard mode
 - Patch #9, #10, #11 enables perf annotate with hazard mode

Note:
 - This series is based on the talk by Madhavan in LPC 2018[4]. This is
   just an early RFC to get comments about the approach and not intended
   to be merged yet.
 - I've prepared the series base on v5.6-rc3. But it depends on generic
   perf annotate fixes [5][6] which are already merged by Arnaldo in
   perf/urgent and perf/core.

[1]: Book III, Section 9.4.10:
     https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 
[2]: https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf#G9.1106986
[3]: https://www.amd.com/system/files/TechDocs/24593.pdf#G19.1089550
[4]: https://linuxplumbersconf.org/event/2/contributions/76/
[5]: http://lore.kernel.org/r/20200204045233.474937-1-ravi.bangoria@linux.ibm.com
[6]: http://lore.kernel.org/r/20200213064306.160480-1-ravi.bangoria@linux.ibm.com


Madhavan Srinivasan (7):
  perf/core: Data structure to present hazard data
  powerpc/perf: Arch specific definitions for pipeline
  powerpc/perf: Arch support to expose Hazard data
  perf tools: Enable record and script to record and show hazard data
  perf hists: Make a room for hazard info in struct hist_entry
  perf hazard: Functions to convert generic hazard data to arch specific
    string
  perf report: Enable hazard mode

Ravi Bangoria (4):
  powerpc/perf: Simplify ISA207_SIER macros
  perf annotate: Introduce type for annotation_line
  perf annotate: Preparation for hazard
  perf annotate: Show hazard data in tui mode

 arch/powerpc/include/asm/perf_event_server.h  |   2 +
 .../include/uapi/asm/perf_pipeline_haz.h      |  80 ++++++
 arch/powerpc/perf/core-book3s.c               |   4 +
 arch/powerpc/perf/isa207-common.c             | 165 ++++++++++++-
 arch/powerpc/perf/isa207-common.h             |  23 +-
 arch/powerpc/perf/power8-pmu.c                |   1 +
 arch/powerpc/perf/power9-pmu.c                |   1 +
 include/linux/perf_event.h                    |   7 +
 include/uapi/linux/perf_event.h               |  32 ++-
 kernel/events/core.c                          |   6 +
 tools/include/uapi/linux/perf_event.h         |  32 ++-
 tools/perf/Documentation/perf-record.txt      |   3 +
 tools/perf/builtin-annotate.c                 |   7 +-
 tools/perf/builtin-c2c.c                      |   4 +-
 tools/perf/builtin-diff.c                     |   6 +-
 tools/perf/builtin-record.c                   |   1 +
 tools/perf/builtin-report.c                   |  29 +++
 tools/perf/tests/hists_link.c                 |   4 +-
 tools/perf/ui/browsers/annotate.c             | 128 ++++++++--
 tools/perf/ui/gtk/annotate.c                  |   6 +-
 tools/perf/util/Build                         |   2 +
 tools/perf/util/annotate.c                    | 153 +++++++++++-
 tools/perf/util/annotate.h                    |  38 ++-
 tools/perf/util/event.h                       |   1 +
 tools/perf/util/evsel.c                       |  10 +
 tools/perf/util/hazard.c                      |  51 ++++
 tools/perf/util/hazard.h                      |  14 ++
 tools/perf/util/hazard/Build                  |   1 +
 .../util/hazard/powerpc/perf_pipeline_haz.h   |  80 ++++++
 .../perf/util/hazard/powerpc/powerpc_hazard.c | 142 +++++++++++
 .../perf/util/hazard/powerpc/powerpc_hazard.h |  14 ++
 tools/perf/util/hist.c                        | 112 ++++++++-
 tools/perf/util/hist.h                        |  13 +
 tools/perf/util/machine.c                     |   6 +
 tools/perf/util/machine.h                     |   3 +
 tools/perf/util/perf_event_attr_fprintf.c     |   1 +
 tools/perf/util/record.h                      |   1 +
 tools/perf/util/session.c                     |  16 ++
 tools/perf/util/sort.c                        | 230 ++++++++++++++++++
 tools/perf/util/sort.h                        |  23 ++
 40 files changed, 1387 insertions(+), 65 deletions(-)
 create mode 100644 arch/powerpc/include/uapi/asm/perf_pipeline_haz.h
 create mode 100644 tools/perf/util/hazard.c
 create mode 100644 tools/perf/util/hazard.h
 create mode 100644 tools/perf/util/hazard/Build
 create mode 100644 tools/perf/util/hazard/powerpc/perf_pipeline_haz.h
 create mode 100644 tools/perf/util/hazard/powerpc/powerpc_hazard.c
 create mode 100644 tools/perf/util/hazard/powerpc/powerpc_hazard.h

Comments

Peter Zijlstra March 2, 2020, 10:13 a.m. UTC | #1
On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
> Modern processors export such hazard data in Performance
> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> AMD[3] provides similar information.
> 
> Implementation detail:
> 
> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> If it's set, kernel converts arch specific hazard information
> into generic format:
> 
>   struct perf_pipeline_haz_data {
>          /* Instruction/Opcode type: Load, Store, Branch .... */
>          __u8    itype;
>          /* Instruction Cache source */
>          __u8    icache;
>          /* Instruction suffered hazard in pipeline stage */
>          __u8    hazard_stage;
>          /* Hazard reason */
>          __u8    hazard_reason;
>          /* Instruction suffered stall in pipeline stage */
>          __u8    stall_stage;
>          /* Stall reason */
>          __u8    stall_reason;
>          __u16   pad;
>   };

Kim, does this format indeed work for AMD IBS?
Stephane Eranian March 2, 2020, 8:21 p.m. UTC | #2
On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
> > Modern processors export such hazard data in Performance
> > Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> > Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> > AMD[3] provides similar information.
> >
> > Implementation detail:
> >
> > A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> > If it's set, kernel converts arch specific hazard information
> > into generic format:
> >
> >   struct perf_pipeline_haz_data {
> >          /* Instruction/Opcode type: Load, Store, Branch .... */
> >          __u8    itype;
> >          /* Instruction Cache source */
> >          __u8    icache;
> >          /* Instruction suffered hazard in pipeline stage */
> >          __u8    hazard_stage;
> >          /* Hazard reason */
> >          __u8    hazard_reason;
> >          /* Instruction suffered stall in pipeline stage */
> >          __u8    stall_stage;
> >          /* Stall reason */
> >          __u8    stall_reason;
> >          __u16   pad;
> >   };
>
> Kim, does this format indeed work for AMD IBS?


Personally, I don't like the term hazard. This is too IBM Power
specific. We need to find a better term, maybe stall or penalty.
Also worth considering is the support of ARM SPE (Statistical
Profiling Extension) which is their version of IBS.
Whatever gets added need to cover all three with no limitations.
Paul A. Clarke March 2, 2020, 9:08 p.m. UTC | #3
On 3/1/20 11:23 PM, Ravi Bangoria wrote:
> Most modern microprocessors employ complex instruction execution
> pipelines such that many instructions can be 'in flight' at any
> given point in time. Various factors affect this pipeline and
> hazards are the primary among them. Different types of hazards
> exist - Data hazards, Structural hazards and Control hazards.
> Data hazard is the case where data dependencies exist between
> instructions in different stages in the pipeline. Structural
> hazard is when the same processor hardware is needed by more
> than one instruction in flight at the same time. Control hazards
> are more the branch misprediction kinds. 
> 
> Information about these hazards are critical towards analyzing
> performance issues and also to tune software to overcome such
> issues. Modern processors export such hazard data in Performance
> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> AMD[3] provides similar information.
> 
> Implementation detail:
> 
> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> If it's set, kernel converts arch specific hazard information
> into generic format:
> 
>   struct perf_pipeline_haz_data {
>          /* Instruction/Opcode type: Load, Store, Branch .... */
>          __u8    itype;

At the risk of bike-shedding (in an RFC, no less), "itype" doesn't convey enough meaning to me.  "inst_type"?  I see in 03/11, you use "perf_inst_type".

>          /* Instruction Cache source */
>          __u8    icache;

Possibly same here, and you use "perf_inst_cache" in 03/11.

>          /* Instruction suffered hazard in pipeline stage */
>          __u8    hazard_stage;
>          /* Hazard reason */
>          __u8    hazard_reason;
>          /* Instruction suffered stall in pipeline stage */
>          __u8    stall_stage;
>          /* Stall reason */
>          __u8    stall_reason;
>          __u16   pad;
>   };
> 
> ... which can be read by user from mmap() ring buffer. With this
> approach, sample perf report in hazard mode looks like (On IBM
> PowerPC):
> 
>   # ./perf record --hazard ./ebizzy
>   # ./perf report --hazard
>   Overhead  Symbol          Shared  Instruction Type  Hazard Stage   Hazard Reason         Stall Stage   Stall Reason  ICache access
>     36.58%  [.] thread_run  ebizzy  Load              LSU            Mispredict            LSU           Load fin      L1 hit
>      9.46%  [.] thread_run  ebizzy  Load              LSU            Mispredict            LSU           Dcache_miss   L1 hit
>      1.76%  [.] thread_run  ebizzy  Fixed point       -              -                     -             -             L1 hit
>      1.31%  [.] thread_run  ebizzy  Load              LSU            ERAT Miss             LSU           Load fin      L1 hit
>      1.27%  [.] thread_run  ebizzy  Load              LSU            Mispredict            -             -             L1 hit
>      1.16%  [.] thread_run  ebizzy  Fixed point       -              -                     FXU           Fixed cycle   L1 hit
>      0.50%  [.] thread_run  ebizzy  Fixed point       ISU            Source Unavailable    FXU           Fixed cycle   L1 hit
>      0.30%  [.] thread_run  ebizzy  Load              LSU            LMQ Full, DERAT Miss  LSU           Load fin      L1 hit
>      0.24%  [.] thread_run  ebizzy  Load              LSU            ERAT Miss             -             -             L1 hit
>      0.08%  [.] thread_run  ebizzy  -                 -              -                     BRU           Fixed cycle   L1 hit
>      0.05%  [.] thread_run  ebizzy  Branch            -              -                     BRU           Fixed cycle   L1 hit
>      0.04%  [.] thread_run  ebizzy  Fixed point       ISU            Source Unavailable    -             -             L1 hit

How are these to be interpreted?  This is great information, but is it possible to make it more readable for non-experts?  If each of these map 1:1 with hardware events, should you emit the name of the event here, so that can be used to look up further information?  For example, does the first line map to PM_CMPLU_STALL_LSU_FIN?
What was "Mispredict[ed]"? (Is it different from a branch misprediction?) And how does this relate to "L1 hit"?
Can we emit "Load finish" instead of "Load fin" for easier reading?  03/11 also has "Marked fin before NTC".
Nit: why does "Dcache_miss" have an underscore and none of the others?

> Also perf annotate with hazard data:

>          │    static int
>          │    compare(const void *p1, const void *p2)
>          │    {
>    33.23 │      std    r31,-8(r1)
>          │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>          │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>          │       {haz_stage: LSU, haz_reason: Load Hit Store, stall_stage: LSU, stall_reason: -, icache: L3 hit}
>          │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
>          │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>          │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>     0.84 │      stdu   r1,-64(r1)
>          │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
>     0.24 │      mr     r31,r1
>          │       {haz_stage: -, haz_reason: -, stall_stage: -, stall_reason: -, icache: L1 hit}
>    21.18 │      std    r3,32(r31)
>          │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>          │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>          │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
> 
> 
> Patches:
>  - Patch #1 is a simple cleanup patch
>  - Patch #2, #3, #4 implements generic and arch specific kernel
>    infrastructure
>  - Patch #5 enables perf record and script with hazard mode
>  - Patch #6, #7, #8 enables perf report with hazard mode
>  - Patch #9, #10, #11 enables perf annotate with hazard mode
> 
> Note:
>  - This series is based on the talk by Madhavan in LPC 2018[4]. This is
>    just an early RFC to get comments about the approach and not intended
>    to be merged yet.
>  - I've prepared the series base on v5.6-rc3. But it depends on generic
>    perf annotate fixes [5][6] which are already merged by Arnaldo in
>    perf/urgent and perf/core.
> 
> [1]: Book III, Section 9.4.10:
>      https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 
> [2]: https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf#G9.1106986

This document is also available from the "IBM Portal for OpenPOWER" under the "All IBM Material for OpenPOWER" https://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=OpenPOWER, under each of the individual modules.  (Well hidden, it might be said, and not a simple link like you have here.)

> [3]: https://www.amd.com/system/files/TechDocs/24593.pdf#G19.1089550
> [4]: https://linuxplumbersconf.org/event/2/contributions/76/
> [5]: http://lore.kernel.org/r/20200204045233.474937-1-ravi.bangoria@linux.ibm.com
> [6]: http://lore.kernel.org/r/20200213064306.160480-1-ravi.bangoria@linux.ibm.com

PC
Kim Phillips March 2, 2020, 10:25 p.m. UTC | #4
On 3/2/20 2:21 PM, Stephane Eranian wrote:
> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>> Modern processors export such hazard data in Performance
>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>> AMD[3] provides similar information.
>>>
>>> Implementation detail:
>>>
>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>> If it's set, kernel converts arch specific hazard information
>>> into generic format:
>>>
>>>   struct perf_pipeline_haz_data {
>>>          /* Instruction/Opcode type: Load, Store, Branch .... */
>>>          __u8    itype;
>>>          /* Instruction Cache source */
>>>          __u8    icache;
>>>          /* Instruction suffered hazard in pipeline stage */
>>>          __u8    hazard_stage;
>>>          /* Hazard reason */
>>>          __u8    hazard_reason;
>>>          /* Instruction suffered stall in pipeline stage */
>>>          __u8    stall_stage;
>>>          /* Stall reason */
>>>          __u8    stall_reason;
>>>          __u16   pad;
>>>   };
>>
>> Kim, does this format indeed work for AMD IBS?

It's not really 1:1, we don't have these separations of stages
and reasons, for example: we have missed in L2 cache, for example.
So IBS output is flatter, with more cycle latency figures than
IBM's AFAICT.

> Personally, I don't like the term hazard. This is too IBM Power
> specific. We need to find a better term, maybe stall or penalty.

Right, IBS doesn't have a filter to only count stalled or otherwise
bad events.  IBS' PPR descriptions has one occurrence of the
word stall, and no penalty.  The way I read IBS is it's just
reporting more sample data than just the precise IP: things like
hits, misses, cycle latencies, addresses, types, etc., so words
like 'extended', or the 'auxiliary' already used today even
are more appropriate for IBS, although I'm the last person to
bikeshed.

> Also worth considering is the support of ARM SPE (Statistical
> Profiling Extension) which is their version of IBS.
> Whatever gets added need to cover all three with no limitations.

I thought Intel's various LBR, PEBS, and PT supported providing
similar sample data in perf already, like with perf mem/c2c?

Kim
Andi Kleen March 3, 2020, 1:33 a.m. UTC | #5
On Mon, Mar 02, 2020 at 11:13:32AM +0100, Peter Zijlstra wrote:
> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
> > Modern processors export such hazard data in Performance
> > Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> > Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> > AMD[3] provides similar information.
> > 
> > Implementation detail:
> > 
> > A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> > If it's set, kernel converts arch specific hazard information
> > into generic format:
> > 
> >   struct perf_pipeline_haz_data {
> >          /* Instruction/Opcode type: Load, Store, Branch .... */
> >          __u8    itype;
> >          /* Instruction Cache source */
> >          __u8    icache;
> >          /* Instruction suffered hazard in pipeline stage */
> >          __u8    hazard_stage;
> >          /* Hazard reason */
> >          __u8    hazard_reason;
> >          /* Instruction suffered stall in pipeline stage */
> >          __u8    stall_stage;
> >          /* Stall reason */
> >          __u8    stall_reason;
> >          __u16   pad;
> >   };
> 
> Kim, does this format indeed work for AMD IBS?

Intel PEBS has a similar concept for annotation of memory accesses,
which is already exported through perf_mem_data_src. This is essentially
an extension. It would be better to have something unified here. 
Right now it seems to duplicate at least part of the PEBS facility.

-Andi
Madhavan Srinivasan March 5, 2020, 4:28 a.m. UTC | #6
On 3/3/20 1:51 AM, Stephane Eranian wrote:
> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>> Modern processors export such hazard data in Performance
>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>> AMD[3] provides similar information.
>>>
>>> Implementation detail:
>>>
>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>> If it's set, kernel converts arch specific hazard information
>>> into generic format:
>>>
>>>    struct perf_pipeline_haz_data {
>>>           /* Instruction/Opcode type: Load, Store, Branch .... */
>>>           __u8    itype;
>>>           /* Instruction Cache source */
>>>           __u8    icache;
>>>           /* Instruction suffered hazard in pipeline stage */
>>>           __u8    hazard_stage;
>>>           /* Hazard reason */
>>>           __u8    hazard_reason;
>>>           /* Instruction suffered stall in pipeline stage */
>>>           __u8    stall_stage;
>>>           /* Stall reason */
>>>           __u8    stall_reason;
>>>           __u16   pad;
>>>    };
>> Kim, does this format indeed work for AMD IBS?
>
> Personally, I don't like the term hazard. This is too IBM Power
> specific. We need to find a better term, maybe stall or penalty.

Yes, names can be reworked and thinking more on it, how about these
as "pipeline" data instead of "hazard" data.

> Also worth considering is the support of ARM SPE (Statistical
> Profiling Extension) which is their version of IBS.
> Whatever gets added need to cover all three with no limitations.

Thanks for pointing this out. We looked at the ARM SPE spec and it does
provides information like issue latency, translation latency so on.
And AMD IBS provides data like fetch latency, tag to retire latency,
completion to retire latency and so on when using Fetch sampling.
  So yes, will rework the struct definition to include data from ARM SPE
and AMD IBS also. Will post out a newer version soon.

Thanks for the comments
Maddy
Ravi Bangoria March 5, 2020, 4:46 a.m. UTC | #7
Hi Kim,

Sorry about being bit late.

On 3/3/20 3:55 AM, Kim Phillips wrote:
> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>> Modern processors export such hazard data in Performance
>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>> AMD[3] provides similar information.
>>>>
>>>> Implementation detail:
>>>>
>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>> If it's set, kernel converts arch specific hazard information
>>>> into generic format:
>>>>
>>>>    struct perf_pipeline_haz_data {
>>>>           /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>           __u8    itype;
>>>>           /* Instruction Cache source */
>>>>           __u8    icache;
>>>>           /* Instruction suffered hazard in pipeline stage */
>>>>           __u8    hazard_stage;
>>>>           /* Hazard reason */
>>>>           __u8    hazard_reason;
>>>>           /* Instruction suffered stall in pipeline stage */
>>>>           __u8    stall_stage;
>>>>           /* Stall reason */
>>>>           __u8    stall_reason;
>>>>           __u16   pad;
>>>>    };
>>>
>>> Kim, does this format indeed work for AMD IBS?
> 
> It's not really 1:1, we don't have these separations of stages
> and reasons, for example: we have missed in L2 cache, for example.
> So IBS output is flatter, with more cycle latency figures than
> IBM's AFAICT.

AMD IBS captures pipeline latency data incase Fetch sampling like the
Fetch latency, tag to retire latency, completion to retire latency and
so on. Yes, Ops sampling do provide more data on load/store centric
information. But it also captures more detailed data for Branch instructions.
And we also looked at ARM SPE, which also captures more details pipeline
data and latency information.

> 
>> Personally, I don't like the term hazard. This is too IBM Power
>> specific. We need to find a better term, maybe stall or penalty.
> 
> Right, IBS doesn't have a filter to only count stalled or otherwise
> bad events.  IBS' PPR descriptions has one occurrence of the
> word stall, and no penalty.  The way I read IBS is it's just
> reporting more sample data than just the precise IP: things like
> hits, misses, cycle latencies, addresses, types, etc., so words
> like 'extended', or the 'auxiliary' already used today even
> are more appropriate for IBS, although I'm the last person to
> bikeshed.

We are thinking of using "pipeline" word instead of Hazard.

> 
>> Also worth considering is the support of ARM SPE (Statistical
>> Profiling Extension) which is their version of IBS.
>> Whatever gets added need to cover all three with no limitations.
> 
> I thought Intel's various LBR, PEBS, and PT supported providing
> similar sample data in perf already, like with perf mem/c2c?

perf-mem is more of data centric in my opinion. It is more towards
memory profiling. So proposal here is to expose pipeline related
details like stalls and latencies.

Thanks for the review,
Ravi
Ravi Bangoria March 5, 2020, 5:06 a.m. UTC | #8
Hi Paul,

Sorry for bit late reply.

On 3/3/20 2:38 AM, Paul Clarke wrote:
> On 3/1/20 11:23 PM, Ravi Bangoria wrote:
>> Most modern microprocessors employ complex instruction execution
>> pipelines such that many instructions can be 'in flight' at any
>> given point in time. Various factors affect this pipeline and
>> hazards are the primary among them. Different types of hazards
>> exist - Data hazards, Structural hazards and Control hazards.
>> Data hazard is the case where data dependencies exist between
>> instructions in different stages in the pipeline. Structural
>> hazard is when the same processor hardware is needed by more
>> than one instruction in flight at the same time. Control hazards
>> are more the branch misprediction kinds.
>>
>> Information about these hazards are critical towards analyzing
>> performance issues and also to tune software to overcome such
>> issues. Modern processors export such hazard data in Performance
>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>> AMD[3] provides similar information.
>>
>> Implementation detail:
>>
>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>> If it's set, kernel converts arch specific hazard information
>> into generic format:
>>
>>    struct perf_pipeline_haz_data {
>>           /* Instruction/Opcode type: Load, Store, Branch .... */
>>           __u8    itype;
> 
> At the risk of bike-shedding (in an RFC, no less), "itype" doesn't convey enough meaning to me.  "inst_type"?  I see in 03/11, you use "perf_inst_type".

I was thinking to rename itype with operation_type or op_type. Because
AMD IBS and ARM SPE observes micro ops and also op_type is more aligned
to pipeline word.

> 
>>           /* Instruction Cache source */
>>           __u8    icache;
> 
> Possibly same here, and you use "perf_inst_cache" in 03/11.

Sure.

> 
>>           /* Instruction suffered hazard in pipeline stage */
>>           __u8    hazard_stage;
>>           /* Hazard reason */
>>           __u8    hazard_reason;
>>           /* Instruction suffered stall in pipeline stage */
>>           __u8    stall_stage;
>>           /* Stall reason */
>>           __u8    stall_reason;
>>           __u16   pad;
>>    };
>>
>> ... which can be read by user from mmap() ring buffer. With this
>> approach, sample perf report in hazard mode looks like (On IBM
>> PowerPC):
>>
>>    # ./perf record --hazard ./ebizzy
>>    # ./perf report --hazard
>>    Overhead  Symbol          Shared  Instruction Type  Hazard Stage   Hazard Reason         Stall Stage   Stall Reason  ICache access
>>      36.58%  [.] thread_run  ebizzy  Load              LSU            Mispredict            LSU           Load fin      L1 hit
>>       9.46%  [.] thread_run  ebizzy  Load              LSU            Mispredict            LSU           Dcache_miss   L1 hit
>>       1.76%  [.] thread_run  ebizzy  Fixed point       -              -                     -             -             L1 hit
>>       1.31%  [.] thread_run  ebizzy  Load              LSU            ERAT Miss             LSU           Load fin      L1 hit
>>       1.27%  [.] thread_run  ebizzy  Load              LSU            Mispredict            -             -             L1 hit
>>       1.16%  [.] thread_run  ebizzy  Fixed point       -              -                     FXU           Fixed cycle   L1 hit
>>       0.50%  [.] thread_run  ebizzy  Fixed point       ISU            Source Unavailable    FXU           Fixed cycle   L1 hit
>>       0.30%  [.] thread_run  ebizzy  Load              LSU            LMQ Full, DERAT Miss  LSU           Load fin      L1 hit
>>       0.24%  [.] thread_run  ebizzy  Load              LSU            ERAT Miss             -             -             L1 hit
>>       0.08%  [.] thread_run  ebizzy  -                 -              -                     BRU           Fixed cycle   L1 hit
>>       0.05%  [.] thread_run  ebizzy  Branch            -              -                     BRU           Fixed cycle   L1 hit
>>       0.04%  [.] thread_run  ebizzy  Fixed point       ISU            Source Unavailable    -             -             L1 hit
> 
> How are these to be interpreted?  This is great information, but is it possible to make it more readable for non-experts?

For the RFC proposal we just pulled the details from the spec. But yes, will
look into this.

>  If each of these map 1:1 with hardware events, should you emit the name of the event here, so that can be used to look up further information? For example, does the first line map to PM_CMPLU_STALL_LSU_FIN?
I'm using PM_MRK_INST_CMPL event in perf record an SIER provides all these
information.

> What was "Mispredict[ed]"? (Is it different from a branch misprediction?) And how does this relate to "L1 hit"?

I'm not 100% sure. I'll check with the hw folks about it.

> Can we emit "Load finish" instead of "Load fin" for easier reading?  03/11 also has "Marked fin before NTC".
> Nit: why does "Dcache_miss" have an underscore and none of the others?

Sure. Will change it.

> 
>> Also perf annotate with hazard data:
> 
>>           │    static int
>>           │    compare(const void *p1, const void *p2)
>>           │    {
>>     33.23 │      std    r31,-8(r1)
>>           │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>>           │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>>           │       {haz_stage: LSU, haz_reason: Load Hit Store, stall_stage: LSU, stall_reason: -, icache: L3 hit}
>>           │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
>>           │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>>           │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>>      0.84 │      stdu   r1,-64(r1)
>>           │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: -, stall_reason: -, icache: L1 hit}
>>      0.24 │      mr     r31,r1
>>           │       {haz_stage: -, haz_reason: -, stall_stage: -, stall_reason: -, icache: L1 hit}
>>     21.18 │      std    r3,32(r31)
>>           │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>>           │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>>           │       {haz_stage: LSU, haz_reason: ERAT Miss, stall_stage: LSU, stall_reason: Store, icache: L1 hit}
>>
>>
>> Patches:
>>   - Patch #1 is a simple cleanup patch
>>   - Patch #2, #3, #4 implements generic and arch specific kernel
>>     infrastructure
>>   - Patch #5 enables perf record and script with hazard mode
>>   - Patch #6, #7, #8 enables perf report with hazard mode
>>   - Patch #9, #10, #11 enables perf annotate with hazard mode
>>
>> Note:
>>   - This series is based on the talk by Madhavan in LPC 2018[4]. This is
>>     just an early RFC to get comments about the approach and not intended
>>     to be merged yet.
>>   - I've prepared the series base on v5.6-rc3. But it depends on generic
>>     perf annotate fixes [5][6] which are already merged by Arnaldo in
>>     perf/urgent and perf/core.
>>
>> [1]: Book III, Section 9.4.10:
>>       https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
>> [2]: https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf#G9.1106986
> 
> This document is also available from the "IBM Portal for OpenPOWER" under the "All IBM Material for OpenPOWER" https://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=OpenPOWER, under each of the individual modules.  (Well hidden, it might be said, and not a simple link like you have here.)

Thanks for pointing it :)
Ravi
Ravi Bangoria March 5, 2020, 5:06 a.m. UTC | #9
Hi Andi,

Sorry for being bit late.

On 3/3/20 7:03 AM, Andi Kleen wrote:
> On Mon, Mar 02, 2020 at 11:13:32AM +0100, Peter Zijlstra wrote:
>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>> Modern processors export such hazard data in Performance
>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>> AMD[3] provides similar information.
>>>
>>> Implementation detail:
>>>
>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>> If it's set, kernel converts arch specific hazard information
>>> into generic format:
>>>
>>>    struct perf_pipeline_haz_data {
>>>           /* Instruction/Opcode type: Load, Store, Branch .... */
>>>           __u8    itype;
>>>           /* Instruction Cache source */
>>>           __u8    icache;
>>>           /* Instruction suffered hazard in pipeline stage */
>>>           __u8    hazard_stage;
>>>           /* Hazard reason */
>>>           __u8    hazard_reason;
>>>           /* Instruction suffered stall in pipeline stage */
>>>           __u8    stall_stage;
>>>           /* Stall reason */
>>>           __u8    stall_reason;
>>>           __u16   pad;
>>>    };
>>
>> Kim, does this format indeed work for AMD IBS?
> 
> Intel PEBS has a similar concept for annotation of memory accesses,
> which is already exported through perf_mem_data_src. This is essentially
> an extension. It would be better to have something unified here.
> Right now it seems to duplicate at least part of the PEBS facility.

IIUC there is a distinction from perf mem vs exposing the pipeline details.
perf-mem/perf_mem_data_src is more of memory accesses profiling. And proposal
here is to expose pipeline related details like stalls and latencies. Would
prefer/suggest not to extend the current structure further to capture pipeline
details.

Ravi
Kim Phillips March 5, 2020, 10:06 p.m. UTC | #10
On 3/4/20 10:46 PM, Ravi Bangoria wrote:
> Hi Kim,

Hi Ravi,

> On 3/3/20 3:55 AM, Kim Phillips wrote:
>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>> Modern processors export such hazard data in Performance
>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>> AMD[3] provides similar information.
>>>>>
>>>>> Implementation detail:
>>>>>
>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>> If it's set, kernel converts arch specific hazard information
>>>>> into generic format:
>>>>>
>>>>>    struct perf_pipeline_haz_data {
>>>>>           /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>           __u8    itype;
>>>>>           /* Instruction Cache source */
>>>>>           __u8    icache;
>>>>>           /* Instruction suffered hazard in pipeline stage */
>>>>>           __u8    hazard_stage;
>>>>>           /* Hazard reason */
>>>>>           __u8    hazard_reason;
>>>>>           /* Instruction suffered stall in pipeline stage */
>>>>>           __u8    stall_stage;
>>>>>           /* Stall reason */
>>>>>           __u8    stall_reason;
>>>>>           __u16   pad;
>>>>>    };
>>>>
>>>> Kim, does this format indeed work for AMD IBS?
>>
>> It's not really 1:1, we don't have these separations of stages
>> and reasons, for example: we have missed in L2 cache, for example.
>> So IBS output is flatter, with more cycle latency figures than
>> IBM's AFAICT.
> 
> AMD IBS captures pipeline latency data incase Fetch sampling like the
> Fetch latency, tag to retire latency, completion to retire latency and
> so on. Yes, Ops sampling do provide more data on load/store centric
> information. But it also captures more detailed data for Branch instructions.
> And we also looked at ARM SPE, which also captures more details pipeline
> data and latency information.
> 
>>> Personally, I don't like the term hazard. This is too IBM Power
>>> specific. We need to find a better term, maybe stall or penalty.
>>
>> Right, IBS doesn't have a filter to only count stalled or otherwise
>> bad events.  IBS' PPR descriptions has one occurrence of the
>> word stall, and no penalty.  The way I read IBS is it's just
>> reporting more sample data than just the precise IP: things like
>> hits, misses, cycle latencies, addresses, types, etc., so words
>> like 'extended', or the 'auxiliary' already used today even
>> are more appropriate for IBS, although I'm the last person to
>> bikeshed.
> 
> We are thinking of using "pipeline" word instead of Hazard.

Hm, the word 'pipeline' occurs 0 times in IBS documentation.

I realize there are a couple of core pipeline-specific pieces
of information coming out of it, but the vast majority
are addresses, latencies of various components in the memory
hierarchy, and various component hit/miss bits.

What's needed here is a vendor-specific extended
sample information that all these technologies gather,
of which things like e.g., 'L1 TLB cycle latency' we
all should have in common.

I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
either.  Can we use PERF_SAMPLE_AUX instead?  Take a look at
commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
definitions".  The sample identifier can be used to determine
which vendor's sampling IP's data is in it, and events can
be recorded just by copying the content of the SIER, etc.
registers, and then events get synthesized from the aux
sample at report/inject/annotate etc. time.  This allows
for less sample recording overhead, and moves all the vendor
specific decoding and common event conversions for userspace
to figure out.

>>> Also worth considering is the support of ARM SPE (Statistical
>>> Profiling Extension) which is their version of IBS.
>>> Whatever gets added need to cover all three with no limitations.
>>
>> I thought Intel's various LBR, PEBS, and PT supported providing
>> similar sample data in perf already, like with perf mem/c2c?
> 
> perf-mem is more of data centric in my opinion. It is more towards
> memory profiling. So proposal here is to expose pipeline related
> details like stalls and latencies.

Like I said, I don't see it that way, I see it as "any particular
vendor's event's extended details', and these pipeline details
have overlap with existing infrastructure within perf, e.g., L2
cache misses.

Kim
Ravi Bangoria March 11, 2020, 4 p.m. UTC | #11
Hi Kim,

On 3/6/20 3:36 AM, Kim Phillips wrote:
>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>
>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>> Modern processors export such hazard data in Performance
>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>> AMD[3] provides similar information.
>>>>>>
>>>>>> Implementation detail:
>>>>>>
>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>> into generic format:
>>>>>>
>>>>>>     struct perf_pipeline_haz_data {
>>>>>>            /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>            __u8    itype;
>>>>>>            /* Instruction Cache source */
>>>>>>            __u8    icache;
>>>>>>            /* Instruction suffered hazard in pipeline stage */
>>>>>>            __u8    hazard_stage;
>>>>>>            /* Hazard reason */
>>>>>>            __u8    hazard_reason;
>>>>>>            /* Instruction suffered stall in pipeline stage */
>>>>>>            __u8    stall_stage;
>>>>>>            /* Stall reason */
>>>>>>            __u8    stall_reason;
>>>>>>            __u16   pad;
>>>>>>     };
>>>>>
>>>>> Kim, does this format indeed work for AMD IBS?
>>>
>>> It's not really 1:1, we don't have these separations of stages
>>> and reasons, for example: we have missed in L2 cache, for example.
>>> So IBS output is flatter, with more cycle latency figures than
>>> IBM's AFAICT.
>>
>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>> Fetch latency, tag to retire latency, completion to retire latency and
>> so on. Yes, Ops sampling do provide more data on load/store centric
>> information. But it also captures more detailed data for Branch instructions.
>> And we also looked at ARM SPE, which also captures more details pipeline
>> data and latency information.
>>
>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>> specific. We need to find a better term, maybe stall or penalty.
>>>
>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>> word stall, and no penalty.  The way I read IBS is it's just
>>> reporting more sample data than just the precise IP: things like
>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>> like 'extended', or the 'auxiliary' already used today even
>>> are more appropriate for IBS, although I'm the last person to
>>> bikeshed.
>>
>> We are thinking of using "pipeline" word instead of Hazard.
> 
> Hm, the word 'pipeline' occurs 0 times in IBS documentation.

NP. We thought pipeline is generic hw term so we proposed "pipeline"
word. We are open to term which can be generic enough.

> 
> I realize there are a couple of core pipeline-specific pieces
> of information coming out of it, but the vast majority
> are addresses, latencies of various components in the memory
> hierarchy, and various component hit/miss bits.

Yes. we should capture core pipeline specific details. For example,
IBS generates Branch unit information(IbsOpData1) and Icahce related
data(IbsFetchCtl) which is something that shouldn't be extended as
part of perf-mem, IMO.

> 
> What's needed here is a vendor-specific extended
> sample information that all these technologies gather,
> of which things like e.g., 'L1 TLB cycle latency' we
> all should have in common.

Yes. We will include fields to capture the latency cycles (like Issue
latency, Instruction completion latency etc..) along with other pipeline
details in the proposed structure.

> 
> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
> either.  Can we use PERF_SAMPLE_AUX instead?

We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
large volume of data needs to be captured as part of perf.data without
frequent PMIs. But proposed type is to address the capture of pipeline
information on each sample using PMI at periodic intervals. Hence proposing
PERF_SAMPLE_PIPELINE_HAZ.

>  Take a look at
> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
> definitions".  The sample identifier can be used to determine
> which vendor's sampling IP's data is in it, and events can
> be recorded just by copying the content of the SIER, etc.
> registers, and then events get synthesized from the aux
> sample at report/inject/annotate etc. time.  This allows
> for less sample recording overhead, and moves all the vendor
> specific decoding and common event conversions for userspace
> to figure out.

When AUX buffer data is structured, tool side changes added to present the
pipeline data can be re-used.

> 
>>>> Also worth considering is the support of ARM SPE (Statistical
>>>> Profiling Extension) which is their version of IBS.
>>>> Whatever gets added need to cover all three with no limitations.
>>>
>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>> similar sample data in perf already, like with perf mem/c2c?
>>
>> perf-mem is more of data centric in my opinion. It is more towards
>> memory profiling. So proposal here is to expose pipeline related
>> details like stalls and latencies.
> 
> Like I said, I don't see it that way, I see it as "any particular
> vendor's event's extended details', and these pipeline details
> have overlap with existing infrastructure within perf, e.g., L2
> cache misses.
> 
> Kim
>
Kim Phillips March 12, 2020, 10:38 p.m. UTC | #12
On 3/11/20 11:00 AM, Ravi Bangoria wrote:
> Hi Kim,

Hi Ravi,

> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>
>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>> Modern processors export such hazard data in Performance
>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>> AMD[3] provides similar information.
>>>>>>>
>>>>>>> Implementation detail:
>>>>>>>
>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>> into generic format:
>>>>>>>
>>>>>>>     struct perf_pipeline_haz_data {
>>>>>>>            /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>            __u8    itype;
>>>>>>>            /* Instruction Cache source */
>>>>>>>            __u8    icache;
>>>>>>>            /* Instruction suffered hazard in pipeline stage */
>>>>>>>            __u8    hazard_stage;
>>>>>>>            /* Hazard reason */
>>>>>>>            __u8    hazard_reason;
>>>>>>>            /* Instruction suffered stall in pipeline stage */
>>>>>>>            __u8    stall_stage;
>>>>>>>            /* Stall reason */
>>>>>>>            __u8    stall_reason;
>>>>>>>            __u16   pad;
>>>>>>>     };
>>>>>>
>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>
>>>> It's not really 1:1, we don't have these separations of stages
>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>> So IBS output is flatter, with more cycle latency figures than
>>>> IBM's AFAICT.
>>>
>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>> Fetch latency, tag to retire latency, completion to retire latency and
>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>> information. But it also captures more detailed data for Branch instructions.
>>> And we also looked at ARM SPE, which also captures more details pipeline
>>> data and latency information.
>>>
>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>
>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>> reporting more sample data than just the precise IP: things like
>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>> like 'extended', or the 'auxiliary' already used today even
>>>> are more appropriate for IBS, although I'm the last person to
>>>> bikeshed.
>>>
>>> We are thinking of using "pipeline" word instead of Hazard.
>>
>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
> 
> NP. We thought pipeline is generic hw term so we proposed "pipeline"
> word. We are open to term which can be generic enough.
> 
>>
>> I realize there are a couple of core pipeline-specific pieces
>> of information coming out of it, but the vast majority
>> are addresses, latencies of various components in the memory
>> hierarchy, and various component hit/miss bits.
> 
> Yes. we should capture core pipeline specific details. For example,
> IBS generates Branch unit information(IbsOpData1) and Icahce related
> data(IbsFetchCtl) which is something that shouldn't be extended as
> part of perf-mem, IMO.

Sure, IBS Op-side output is more 'perf mem' friendly, and so it
should populate perf_mem_data_src fields, just like POWER9 can:

union perf_mem_data_src {
...
                __u64   mem_rsvd:24,
                        mem_snoopx:2,   /* snoop mode, ext */
                        mem_remote:1,   /* remote */
                        mem_lvl_num:4,  /* memory hierarchy level number */
                        mem_dtlb:7,     /* tlb access */
                        mem_lock:2,     /* lock instr */
                        mem_snoop:5,    /* snoop mode */
                        mem_lvl:14,     /* memory hierarchy level */
                        mem_op:5;       /* type of opcode */


E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
'mem_lock', and the Reload Bus Source Encoding bits can
be used to populate mem_snoop, right?

For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
used for the ld/st target addresses, too.

>> What's needed here is a vendor-specific extended
>> sample information that all these technologies gather,
>> of which things like e.g., 'L1 TLB cycle latency' we
>> all should have in common.
> 
> Yes. We will include fields to capture the latency cycles (like Issue
> latency, Instruction completion latency etc..) along with other pipeline
> details in the proposed structure.

Latency figures are just an example, and from what I
can tell, struct perf_sample_data already has a 'weight' member, 
used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
transfer memory access latency figures.  Granted, that's
a bad name given all other vendors don't call latency
'weight'.

I didn't see any latency figures coming out of POWER9,
and do not expect this patchseries to implement those
of other vendors, e.g., AMD's IBS; leave each vendor
to amend perf to suit their own h/w output please.

My main point there, however, was that each vendor should
use streamlined record-level code to just copy the data
in the proprietary format that their hardware produces,
and then then perf tooling can synthesize the events
from the raw data at report/script/etc. time.

>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>> either.  Can we use PERF_SAMPLE_AUX instead?
> 
> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
> large volume of data needs to be captured as part of perf.data without
> frequent PMIs. But proposed type is to address the capture of pipeline

SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
PMIs are, even though it may be used in those environments.

> information on each sample using PMI at periodic intervals. Hence proposing
> PERF_SAMPLE_PIPELINE_HAZ.

And that's fine for any extra bits that POWER9 has to convey
to its users beyond things already represented by other sample
types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
and other vendor e.g., AMD IBS data can be made vendor-independent
at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
what IBS currently uses.

>>  Take a look at
>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>> definitions".  The sample identifier can be used to determine
>> which vendor's sampling IP's data is in it, and events can
>> be recorded just by copying the content of the SIER, etc.
>> registers, and then events get synthesized from the aux
>> sample at report/inject/annotate etc. time.  This allows
>> for less sample recording overhead, and moves all the vendor
>> specific decoding and common event conversions for userspace
>> to figure out.
> 
> When AUX buffer data is structured, tool side changes added to present the
> pipeline data can be re-used.

Not sure I understand: AUX data would be structured on
each vendor's raw h/w register formats.

Thanks,

Kim

>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>> Profiling Extension) which is their version of IBS.
>>>>> Whatever gets added need to cover all three with no limitations.
>>>>
>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>> similar sample data in perf already, like with perf mem/c2c?
>>>
>>> perf-mem is more of data centric in my opinion. It is more towards
>>> memory profiling. So proposal here is to expose pipeline related
>>> details like stalls and latencies.
>>
>> Like I said, I don't see it that way, I see it as "any particular
>> vendor's event's extended details', and these pipeline details
>> have overlap with existing infrastructure within perf, e.g., L2
>> cache misses.
>>
>> Kim
>>
>
Madhavan Srinivasan March 17, 2020, 6:50 a.m. UTC | #13
On 3/13/20 4:08 AM, Kim Phillips wrote:
> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>> Hi Kim,
> Hi Ravi,
>
>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>> AMD[3] provides similar information.
>>>>>>>>
>>>>>>>> Implementation detail:
>>>>>>>>
>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>> into generic format:
>>>>>>>>
>>>>>>>>      struct perf_pipeline_haz_data {
>>>>>>>>             /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>             __u8    itype;
>>>>>>>>             /* Instruction Cache source */
>>>>>>>>             __u8    icache;
>>>>>>>>             /* Instruction suffered hazard in pipeline stage */
>>>>>>>>             __u8    hazard_stage;
>>>>>>>>             /* Hazard reason */
>>>>>>>>             __u8    hazard_reason;
>>>>>>>>             /* Instruction suffered stall in pipeline stage */
>>>>>>>>             __u8    stall_stage;
>>>>>>>>             /* Stall reason */
>>>>>>>>             __u8    stall_reason;
>>>>>>>>             __u16   pad;
>>>>>>>>      };
>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>> It's not really 1:1, we don't have these separations of stages
>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>> IBM's AFAICT.
>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>> information. But it also captures more detailed data for Branch instructions.
>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>> data and latency information.
>>>>
>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>> reporting more sample data than just the precise IP: things like
>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>> are more appropriate for IBS, although I'm the last person to
>>>>> bikeshed.
>>>> We are thinking of using "pipeline" word instead of Hazard.
>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>> word. We are open to term which can be generic enough.
>>
>>> I realize there are a couple of core pipeline-specific pieces
>>> of information coming out of it, but the vast majority
>>> are addresses, latencies of various components in the memory
>>> hierarchy, and various component hit/miss bits.
>> Yes. we should capture core pipeline specific details. For example,
>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>> data(IbsFetchCtl) which is something that shouldn't be extended as
>> part of perf-mem, IMO.
> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
> should populate perf_mem_data_src fields, just like POWER9 can:
>
> union perf_mem_data_src {
> ...
>                  __u64   mem_rsvd:24,
>                          mem_snoopx:2,   /* snoop mode, ext */
>                          mem_remote:1,   /* remote */
>                          mem_lvl_num:4,  /* memory hierarchy level number */
>                          mem_dtlb:7,     /* tlb access */
>                          mem_lock:2,     /* lock instr */
>                          mem_snoop:5,    /* snoop mode */
>                          mem_lvl:14,     /* memory hierarchy level */
>                          mem_op:5;       /* type of opcode */
>
>
> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
> 'mem_lock', and the Reload Bus Source Encoding bits can
> be used to populate mem_snoop, right?
Hi Kim,

Yes. We do expose these data as part of perf-mem for POWER.


> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
> used for the ld/st target addresses, too.
>
>>> What's needed here is a vendor-specific extended
>>> sample information that all these technologies gather,
>>> of which things like e.g., 'L1 TLB cycle latency' we
>>> all should have in common.
>> Yes. We will include fields to capture the latency cycles (like Issue
>> latency, Instruction completion latency etc..) along with other pipeline
>> details in the proposed structure.
> Latency figures are just an example, and from what I
> can tell, struct perf_sample_data already has a 'weight' member,
> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
> transfer memory access latency figures.  Granted, that's
> a bad name given all other vendors don't call latency
> 'weight'.
>
> I didn't see any latency figures coming out of POWER9,
> and do not expect this patchseries to implement those
> of other vendors, e.g., AMD's IBS; leave each vendor
> to amend perf to suit their own h/w output please.

Reference structure proposed in this patchset did not have members
to capture latency info for that exact reason. But idea here is to
abstract  as vendor specific as possible. So if we include u16 array,
then this format can also capture data from IBS since it provides
few latency details.


>
> My main point there, however, was that each vendor should
> use streamlined record-level code to just copy the data
> in the proprietary format that their hardware produces,
> and then then perf tooling can synthesize the events
> from the raw data at report/script/etc. time.
>
>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>> either.  Can we use PERF_SAMPLE_AUX instead?
>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>> large volume of data needs to be captured as part of perf.data without
>> frequent PMIs. But proposed type is to address the capture of pipeline
> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
> PMIs are, even though it may be used in those environments.
>
>> information on each sample using PMI at periodic intervals. Hence proposing
>> PERF_SAMPLE_PIPELINE_HAZ.
> And that's fine for any extra bits that POWER9 has to convey
> to its users beyond things already represented by other sample
> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
> and other vendor e.g., AMD IBS data can be made vendor-independent
> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
> what IBS currently uses.

My bad. Not sure what you mean by this. We are trying to abstract
as much vendor specific data as possible with this (like perf-mem).


Maddy
>
>>>   Take a look at
>>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>>> definitions".  The sample identifier can be used to determine
>>> which vendor's sampling IP's data is in it, and events can
>>> be recorded just by copying the content of the SIER, etc.
>>> registers, and then events get synthesized from the aux
>>> sample at report/inject/annotate etc. time.  This allows
>>> for less sample recording overhead, and moves all the vendor
>>> specific decoding and common event conversions for userspace
>>> to figure out.
>> When AUX buffer data is structured, tool side changes added to present the
>> pipeline data can be re-used.
> Not sure I understand: AUX data would be structured on
> each vendor's raw h/w register formats.
>
> Thanks,
>
> Kim
>
>>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>>> Profiling Extension) which is their version of IBS.
>>>>>> Whatever gets added need to cover all three with no limitations.
>>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>>> similar sample data in perf already, like with perf mem/c2c?
>>>> perf-mem is more of data centric in my opinion. It is more towards
>>>> memory profiling. So proposal here is to expose pipeline related
>>>> details like stalls and latencies.
>>> Like I said, I don't see it that way, I see it as "any particular
>>> vendor's event's extended details', and these pipeline details
>>> have overlap with existing infrastructure within perf, e.g., L2
>>> cache misses.
>>>
>>> Kim
>>>
Kim Phillips March 18, 2020, 5:35 p.m. UTC | #14
Hi Maddy,

On 3/17/20 1:50 AM, maddy wrote:
> On 3/13/20 4:08 AM, Kim Phillips wrote:
>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>
>>>>>>>>> Implementation detail:
>>>>>>>>>
>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>> into generic format:
>>>>>>>>>
>>>>>>>>>      struct perf_pipeline_haz_data {
>>>>>>>>>             /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>             __u8    itype;
>>>>>>>>>             /* Instruction Cache source */
>>>>>>>>>             __u8    icache;
>>>>>>>>>             /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>             __u8    hazard_stage;
>>>>>>>>>             /* Hazard reason */
>>>>>>>>>             __u8    hazard_reason;
>>>>>>>>>             /* Instruction suffered stall in pipeline stage */
>>>>>>>>>             __u8    stall_stage;
>>>>>>>>>             /* Stall reason */
>>>>>>>>>             __u8    stall_reason;
>>>>>>>>>             __u16   pad;
>>>>>>>>>      };
>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>> IBM's AFAICT.
>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>> data and latency information.
>>>>>
>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>> reporting more sample data than just the precise IP: things like
>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>> bikeshed.
>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>> word. We are open to term which can be generic enough.
>>>
>>>> I realize there are a couple of core pipeline-specific pieces
>>>> of information coming out of it, but the vast majority
>>>> are addresses, latencies of various components in the memory
>>>> hierarchy, and various component hit/miss bits.
>>> Yes. we should capture core pipeline specific details. For example,
>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>> part of perf-mem, IMO.
>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>> should populate perf_mem_data_src fields, just like POWER9 can:
>>
>> union perf_mem_data_src {
>> ...
>>                  __u64   mem_rsvd:24,
>>                          mem_snoopx:2,   /* snoop mode, ext */
>>                          mem_remote:1,   /* remote */
>>                          mem_lvl_num:4,  /* memory hierarchy level number */
>>                          mem_dtlb:7,     /* tlb access */
>>                          mem_lock:2,     /* lock instr */
>>                          mem_snoop:5,    /* snoop mode */
>>                          mem_lvl:14,     /* memory hierarchy level */
>>                          mem_op:5;       /* type of opcode */
>>
>>
>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>> 'mem_lock', and the Reload Bus Source Encoding bits can
>> be used to populate mem_snoop, right?
> Hi Kim,
> 
> Yes. We do expose these data as part of perf-mem for POWER.

OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
isa207_find_source now, thanks.

>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>> used for the ld/st target addresses, too.
>>
>>>> What's needed here is a vendor-specific extended
>>>> sample information that all these technologies gather,
>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>> all should have in common.
>>> Yes. We will include fields to capture the latency cycles (like Issue
>>> latency, Instruction completion latency etc..) along with other pipeline
>>> details in the proposed structure.
>> Latency figures are just an example, and from what I
>> can tell, struct perf_sample_data already has a 'weight' member,
>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>> transfer memory access latency figures.  Granted, that's
>> a bad name given all other vendors don't call latency
>> 'weight'.
>>
>> I didn't see any latency figures coming out of POWER9,
>> and do not expect this patchseries to implement those
>> of other vendors, e.g., AMD's IBS; leave each vendor
>> to amend perf to suit their own h/w output please.
> 
> Reference structure proposed in this patchset did not have members
> to capture latency info for that exact reason. But idea here is to
> abstract  as vendor specific as possible. So if we include u16 array,
> then this format can also capture data from IBS since it provides
> few latency details.

OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
struct presented in this patchset.

IBS Ops can report e.g.:

15 tag-to-retire cycles bits,
15 completion to retire count bits,
15 L1 DTLB refill latency bits,
15 DC miss latency bits,
5 outstanding memory requests on mem refill bits, and so on.

IBS Fetch reports 15 bits of fetch latency, and another 16
for iTLB latency, among others.

Some of these may/may not be valid simultaneously, and
there are IBS specific rules to establish validity.

>> My main point there, however, was that each vendor should
>> use streamlined record-level code to just copy the data
>> in the proprietary format that their hardware produces,
>> and then then perf tooling can synthesize the events
>> from the raw data at report/script/etc. time.
>>
>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>> large volume of data needs to be captured as part of perf.data without
>>> frequent PMIs. But proposed type is to address the capture of pipeline
>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>> PMIs are, even though it may be used in those environments.
>>
>>> information on each sample using PMI at periodic intervals. Hence proposing
>>> PERF_SAMPLE_PIPELINE_HAZ.
>> And that's fine for any extra bits that POWER9 has to convey
>> to its users beyond things already represented by other sample
>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>> and other vendor e.g., AMD IBS data can be made vendor-independent
>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>> what IBS currently uses.
> 
> My bad. Not sure what you mean by this. We are trying to abstract
> as much vendor specific data as possible with this (like perf-mem).

Perhaps if I say it this way: instead of doing all the 
isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
in patch 4/11, rather/instead just put the raw sier value in a 
PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
Specific SIER capabilities can be written as part of the perf.data
header.  Then synthesize the true pipe events from the raw SIER
values later, and in userspace.

I guess it's technically optional, but I think that's how
I'd do it in IBS, since it minimizes the record-time overhead.

Thanks,

Kim

> Maddy
>>
>>>>   Take a look at
>>>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>>>> definitions".  The sample identifier can be used to determine
>>>> which vendor's sampling IP's data is in it, and events can
>>>> be recorded just by copying the content of the SIER, etc.
>>>> registers, and then events get synthesized from the aux
>>>> sample at report/inject/annotate etc. time.  This allows
>>>> for less sample recording overhead, and moves all the vendor
>>>> specific decoding and common event conversions for userspace
>>>> to figure out.
>>> When AUX buffer data is structured, tool side changes added to present the
>>> pipeline data can be re-used.
>> Not sure I understand: AUX data would be structured on
>> each vendor's raw h/w register formats.
>>
>> Thanks,
>>
>> Kim
>>
>>>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>>>> Profiling Extension) which is their version of IBS.
>>>>>>> Whatever gets added need to cover all three with no limitations.
>>>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>>>> similar sample data in perf already, like with perf mem/c2c?
>>>>> perf-mem is more of data centric in my opinion. It is more towards
>>>>> memory profiling. So proposal here is to expose pipeline related
>>>>> details like stalls and latencies.
>>>> Like I said, I don't see it that way, I see it as "any particular
>>>> vendor's event's extended details', and these pipeline details
>>>> have overlap with existing infrastructure within perf, e.g., L2
>>>> cache misses.
>>>>
>>>> Kim
>>>>
>
Michael Ellerman March 19, 2020, 11:22 a.m. UTC | #15
Kim Phillips <kim.phillips@amd.com> writes:
> On 3/17/20 1:50 AM, maddy wrote:
>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>
>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>
>>> And that's fine for any extra bits that POWER9 has to convey
>>> to its users beyond things already represented by other sample
>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>> what IBS currently uses.
>> 
>> My bad. Not sure what you mean by this. We are trying to abstract
>> as much vendor specific data as possible with this (like perf-mem).
>
> Perhaps if I say it this way: instead of doing all the 
> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
> in patch 4/11, rather/instead just put the raw sier value in a 
> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
> Specific SIER capabilities can be written as part of the perf.data
> header.  Then synthesize the true pipe events from the raw SIER
> values later, and in userspace.

In the past the perf maintainers have wanted the perf API to abstract
over the specific CPU details, rather than just pushing raw register
values out to userspace.

But maybe that's no longer the case and we should just use
PERF_SAMPLE_AUX?

cheers
Madhavan Srinivasan March 26, 2020, 10:19 a.m. UTC | #16
On 3/18/20 11:05 PM, Kim Phillips wrote:
> Hi Maddy,
>
> On 3/17/20 1:50 AM, maddy wrote:
>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>
>>>>>>>>>> Implementation detail:
>>>>>>>>>>
>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>> into generic format:
>>>>>>>>>>
>>>>>>>>>>       struct perf_pipeline_haz_data {
>>>>>>>>>>              /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>>              __u8    itype;
>>>>>>>>>>              /* Instruction Cache source */
>>>>>>>>>>              __u8    icache;
>>>>>>>>>>              /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>>              __u8    hazard_stage;
>>>>>>>>>>              /* Hazard reason */
>>>>>>>>>>              __u8    hazard_reason;
>>>>>>>>>>              /* Instruction suffered stall in pipeline stage */
>>>>>>>>>>              __u8    stall_stage;
>>>>>>>>>>              /* Stall reason */
>>>>>>>>>>              __u8    stall_reason;
>>>>>>>>>>              __u16   pad;
>>>>>>>>>>       };
>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>> IBM's AFAICT.
>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>>> data and latency information.
>>>>>>
>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>> bikeshed.
>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>>> word. We are open to term which can be generic enough.
>>>>
>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>> of information coming out of it, but the vast majority
>>>>> are addresses, latencies of various components in the memory
>>>>> hierarchy, and various component hit/miss bits.
>>>> Yes. we should capture core pipeline specific details. For example,
>>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>> part of perf-mem, IMO.
>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>
>>> union perf_mem_data_src {
>>> ...
>>>                   __u64   mem_rsvd:24,
>>>                           mem_snoopx:2,   /* snoop mode, ext */
>>>                           mem_remote:1,   /* remote */
>>>                           mem_lvl_num:4,  /* memory hierarchy level number */
>>>                           mem_dtlb:7,     /* tlb access */
>>>                           mem_lock:2,     /* lock instr */
>>>                           mem_snoop:5,    /* snoop mode */
>>>                           mem_lvl:14,     /* memory hierarchy level */
>>>                           mem_op:5;       /* type of opcode */
>>>
>>>
>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>> be used to populate mem_snoop, right?
>> Hi Kim,
>>
>> Yes. We do expose these data as part of perf-mem for POWER.
> OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
> isa207_find_source now, thanks.
>
>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>> used for the ld/st target addresses, too.
>>>
>>>>> What's needed here is a vendor-specific extended
>>>>> sample information that all these technologies gather,
>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>> all should have in common.
>>>> Yes. We will include fields to capture the latency cycles (like Issue
>>>> latency, Instruction completion latency etc..) along with other pipeline
>>>> details in the proposed structure.
>>> Latency figures are just an example, and from what I
>>> can tell, struct perf_sample_data already has a 'weight' member,
>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>> transfer memory access latency figures.  Granted, that's
>>> a bad name given all other vendors don't call latency
>>> 'weight'.
>>>
>>> I didn't see any latency figures coming out of POWER9,
>>> and do not expect this patchseries to implement those
>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>> to amend perf to suit their own h/w output please.
>> Reference structure proposed in this patchset did not have members
>> to capture latency info for that exact reason. But idea here is to
>> abstract  as vendor specific as possible. So if we include u16 array,
>> then this format can also capture data from IBS since it provides
>> few latency details.
> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
> struct presented in this patchset.
>
> IBS Ops can report e.g.:
>
> 15 tag-to-retire cycles bits,
> 15 completion to retire count bits,
> 15 L1 DTLB refill latency bits,
> 15 DC miss latency bits,
> 5 outstanding memory requests on mem refill bits, and so on.
>
> IBS Fetch reports 15 bits of fetch latency, and another 16
> for iTLB latency, among others.
>
> Some of these may/may not be valid simultaneously, and
> there are IBS specific rules to establish validity.
>
>>> My main point there, however, was that each vendor should
>>> use streamlined record-level code to just copy the data
>>> in the proprietary format that their hardware produces,
>>> and then then perf tooling can synthesize the events
>>> from the raw data at report/script/etc. time.
>>>
>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>>> large volume of data needs to be captured as part of perf.data without
>>>> frequent PMIs. But proposed type is to address the capture of pipeline
>>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>>> PMIs are, even though it may be used in those environments.
>>>
>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>> And that's fine for any extra bits that POWER9 has to convey
>>> to its users beyond things already represented by other sample
>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>> what IBS currently uses.
>> My bad. Not sure what you mean by this. We are trying to abstract
>> as much vendor specific data as possible with this (like perf-mem).
> Perhaps if I say it this way: instead of doing all the
> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
> in patch 4/11, rather/instead just put the raw sier value in a
> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
> Specific SIER capabilities can be written as part of the perf.data
> header.  Then synthesize the true pipe events from the raw SIER
> values later, and in userspace.

Hi Kim,

Would like to stay away from SAMPLE_RAW type for these comments in 
perf_events.h

*      #
*      # The RAW record below is opaque data wrt the ABI
*      #
*      # That is, the ABI doesn't make any promises wrt to
*      # the stability of its content, it may vary depending
*      # on event, hardware, kernel version and phase of
*      # the moon.
*      #
*      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
*      #

Secondly, sorry I didn't understand your suggestion about using 
PERF_SAMPLE_AUX.
IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory and more
challenging when correlating and presenting the pipeline details for 
each IP.
IMO, having a new sample type can be useful to capture the pipeline data
both in perf_sample_data and if _AUX is enabled, can be made to push to
AUX buffer.

Maddy

>
> I guess it's technically optional, but I think that's how
> I'd do it in IBS, since it minimizes the record-time overhead.
>
> Thanks,
>
> Kim
>
>> Maddy
>>>>>    Take a look at
>>>>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>>>>> definitions".  The sample identifier can be used to determine
>>>>> which vendor's sampling IP's data is in it, and events can
>>>>> be recorded just by copying the content of the SIER, etc.
>>>>> registers, and then events get synthesized from the aux
>>>>> sample at report/inject/annotate etc. time.  This allows
>>>>> for less sample recording overhead, and moves all the vendor
>>>>> specific decoding and common event conversions for userspace
>>>>> to figure out.
>>>> When AUX buffer data is structured, tool side changes added to present the
>>>> pipeline data can be re-used.
>>> Not sure I understand: AUX data would be structured on
>>> each vendor's raw h/w register formats.
>>>
>>> Thanks,
>>>
>>> Kim
>>>
>>>>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>>>>> Profiling Extension) which is their version of IBS.
>>>>>>>> Whatever gets added need to cover all three with no limitations.
>>>>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>>>>> similar sample data in perf already, like with perf mem/c2c?
>>>>>> perf-mem is more of data centric in my opinion. It is more towards
>>>>>> memory profiling. So proposal here is to expose pipeline related
>>>>>> details like stalls and latencies.
>>>>> Like I said, I don't see it that way, I see it as "any particular
>>>>> vendor's event's extended details', and these pipeline details
>>>>> have overlap with existing infrastructure within perf, e.g., L2
>>>>> cache misses.
>>>>>
>>>>> Kim
>>>>>
Kim Phillips March 26, 2020, 7:48 p.m. UTC | #17
On 3/26/20 5:19 AM, maddy wrote:
> 
> 
> On 3/18/20 11:05 PM, Kim Phillips wrote:
>> Hi Maddy,
>>
>> On 3/17/20 1:50 AM, maddy wrote:
>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>>
>>>>>>>>>>> Implementation detail:
>>>>>>>>>>>
>>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>>> into generic format:
>>>>>>>>>>>
>>>>>>>>>>>       struct perf_pipeline_haz_data {
>>>>>>>>>>>              /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>>>              __u8    itype;
>>>>>>>>>>>              /* Instruction Cache source */
>>>>>>>>>>>              __u8    icache;
>>>>>>>>>>>              /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>>>              __u8    hazard_stage;
>>>>>>>>>>>              /* Hazard reason */
>>>>>>>>>>>              __u8    hazard_reason;
>>>>>>>>>>>              /* Instruction suffered stall in pipeline stage */
>>>>>>>>>>>              __u8    stall_stage;
>>>>>>>>>>>              /* Stall reason */
>>>>>>>>>>>              __u8    stall_reason;
>>>>>>>>>>>              __u16   pad;
>>>>>>>>>>>       };
>>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>>> IBM's AFAICT.
>>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>>>> data and latency information.
>>>>>>>
>>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>>> bikeshed.
>>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>>>> word. We are open to term which can be generic enough.
>>>>>
>>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>>> of information coming out of it, but the vast majority
>>>>>> are addresses, latencies of various components in the memory
>>>>>> hierarchy, and various component hit/miss bits.
>>>>> Yes. we should capture core pipeline specific details. For example,
>>>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>>> part of perf-mem, IMO.
>>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>>
>>>> union perf_mem_data_src {
>>>> ...
>>>>                   __u64   mem_rsvd:24,
>>>>                           mem_snoopx:2,   /* snoop mode, ext */
>>>>                           mem_remote:1,   /* remote */
>>>>                           mem_lvl_num:4,  /* memory hierarchy level number */
>>>>                           mem_dtlb:7,     /* tlb access */
>>>>                           mem_lock:2,     /* lock instr */
>>>>                           mem_snoop:5,    /* snoop mode */
>>>>                           mem_lvl:14,     /* memory hierarchy level */
>>>>                           mem_op:5;       /* type of opcode */
>>>>
>>>>
>>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>>> be used to populate mem_snoop, right?
>>> Hi Kim,
>>>
>>> Yes. We do expose these data as part of perf-mem for POWER.
>> OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
>> isa207_find_source now, thanks.
>>
>>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>>> used for the ld/st target addresses, too.
>>>>
>>>>>> What's needed here is a vendor-specific extended
>>>>>> sample information that all these technologies gather,
>>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>>> all should have in common.
>>>>> Yes. We will include fields to capture the latency cycles (like Issue
>>>>> latency, Instruction completion latency etc..) along with other pipeline
>>>>> details in the proposed structure.
>>>> Latency figures are just an example, and from what I
>>>> can tell, struct perf_sample_data already has a 'weight' member,
>>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>>> transfer memory access latency figures.  Granted, that's
>>>> a bad name given all other vendors don't call latency
>>>> 'weight'.
>>>>
>>>> I didn't see any latency figures coming out of POWER9,
>>>> and do not expect this patchseries to implement those
>>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>>> to amend perf to suit their own h/w output please.
>>> Reference structure proposed in this patchset did not have members
>>> to capture latency info for that exact reason. But idea here is to
>>> abstract  as vendor specific as possible. So if we include u16 array,
>>> then this format can also capture data from IBS since it provides
>>> few latency details.
>> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
>> struct presented in this patchset.
>>
>> IBS Ops can report e.g.:
>>
>> 15 tag-to-retire cycles bits,
>> 15 completion to retire count bits,
>> 15 L1 DTLB refill latency bits,
>> 15 DC miss latency bits,
>> 5 outstanding memory requests on mem refill bits, and so on.
>>
>> IBS Fetch reports 15 bits of fetch latency, and another 16
>> for iTLB latency, among others.
>>
>> Some of these may/may not be valid simultaneously, and
>> there are IBS specific rules to establish validity.
>>
>>>> My main point there, however, was that each vendor should
>>>> use streamlined record-level code to just copy the data
>>>> in the proprietary format that their hardware produces,
>>>> and then then perf tooling can synthesize the events
>>>> from the raw data at report/script/etc. time.
>>>>
>>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>>>> large volume of data needs to be captured as part of perf.data without
>>>>> frequent PMIs. But proposed type is to address the capture of pipeline
>>>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>>>> PMIs are, even though it may be used in those environments.
>>>>
>>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>> And that's fine for any extra bits that POWER9 has to convey
>>>> to its users beyond things already represented by other sample
>>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>>> what IBS currently uses.
>>> My bad. Not sure what you mean by this. We are trying to abstract
>>> as much vendor specific data as possible with this (like perf-mem).
>> Perhaps if I say it this way: instead of doing all the
>> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
>> in patch 4/11, rather/instead just put the raw sier value in a
>> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
>> Specific SIER capabilities can be written as part of the perf.data
>> header.  Then synthesize the true pipe events from the raw SIER
>> values later, and in userspace.
> 
> Hi Kim,
> 
> Would like to stay away from SAMPLE_RAW type for these comments in perf_events.h
> 
> *      #
> *      # The RAW record below is opaque data wrt the ABI
> *      #
> *      # That is, the ABI doesn't make any promises wrt to
> *      # the stability of its content, it may vary depending
> *      # on event, hardware, kernel version and phase of
> *      # the moon.
> *      #
> *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
> *      #

The "it may vary depending on ... hardware" clause makes it sound
appropriate for the use-case where the raw hardware register contents
are copied directly into the user buffer.

> Secondly, sorry I didn't understand your suggestion about using PERF_SAMPLE_AUX.
> IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory and more
> challenging when correlating and presenting the pipeline details for each IP.
> IMO, having a new sample type can be useful to capture the pipeline data
> both in perf_sample_data and if _AUX is enabled, can be made to push to
> AUX buffer.

OK, I didn't think SAMPLE_AUX and the aux ring buffer were
interdependent, sorry.

Thanks,

Kim
Madhavan Srinivasan April 20, 2020, 7:09 a.m. UTC | #18
On 3/27/20 1:18 AM, Kim Phillips wrote:
>
> On 3/26/20 5:19 AM, maddy wrote:
>>
>> On 3/18/20 11:05 PM, Kim Phillips wrote:
>>> Hi Maddy,
>>>
>>> On 3/17/20 1:50 AM, maddy wrote:
>>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>>>
>>>>>>>>>>>> Implementation detail:
>>>>>>>>>>>>
>>>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>>>> into generic format:
>>>>>>>>>>>>
>>>>>>>>>>>>        struct perf_pipeline_haz_data {
>>>>>>>>>>>>               /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>>>>               __u8    itype;
>>>>>>>>>>>>               /* Instruction Cache source */
>>>>>>>>>>>>               __u8    icache;
>>>>>>>>>>>>               /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>>>>               __u8    hazard_stage;
>>>>>>>>>>>>               /* Hazard reason */
>>>>>>>>>>>>               __u8    hazard_reason;
>>>>>>>>>>>>               /* Instruction suffered stall in pipeline stage */
>>>>>>>>>>>>               __u8    stall_stage;
>>>>>>>>>>>>               /* Stall reason */
>>>>>>>>>>>>               __u8    stall_reason;
>>>>>>>>>>>>               __u16   pad;
>>>>>>>>>>>>        };
>>>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>>>> IBM's AFAICT.
>>>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>>>>> data and latency information.
>>>>>>>>
>>>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>>>> bikeshed.
>>>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>>>>> word. We are open to term which can be generic enough.
>>>>>>
>>>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>>>> of information coming out of it, but the vast majority
>>>>>>> are addresses, latencies of various components in the memory
>>>>>>> hierarchy, and various component hit/miss bits.
>>>>>> Yes. we should capture core pipeline specific details. For example,
>>>>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>>>> part of perf-mem, IMO.
>>>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>>>
>>>>> union perf_mem_data_src {
>>>>> ...
>>>>>                    __u64   mem_rsvd:24,
>>>>>                            mem_snoopx:2,   /* snoop mode, ext */
>>>>>                            mem_remote:1,   /* remote */
>>>>>                            mem_lvl_num:4,  /* memory hierarchy level number */
>>>>>                            mem_dtlb:7,     /* tlb access */
>>>>>                            mem_lock:2,     /* lock instr */
>>>>>                            mem_snoop:5,    /* snoop mode */
>>>>>                            mem_lvl:14,     /* memory hierarchy level */
>>>>>                            mem_op:5;       /* type of opcode */
>>>>>
>>>>>
>>>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>>>> be used to populate mem_snoop, right?
>>>> Hi Kim,
>>>>
>>>> Yes. We do expose these data as part of perf-mem for POWER.
>>> OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
>>> isa207_find_source now, thanks.
>>>
>>>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>>>> used for the ld/st target addresses, too.
>>>>>
>>>>>>> What's needed here is a vendor-specific extended
>>>>>>> sample information that all these technologies gather,
>>>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>>>> all should have in common.
>>>>>> Yes. We will include fields to capture the latency cycles (like Issue
>>>>>> latency, Instruction completion latency etc..) along with other pipeline
>>>>>> details in the proposed structure.
>>>>> Latency figures are just an example, and from what I
>>>>> can tell, struct perf_sample_data already has a 'weight' member,
>>>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>>>> transfer memory access latency figures.  Granted, that's
>>>>> a bad name given all other vendors don't call latency
>>>>> 'weight'.
>>>>>
>>>>> I didn't see any latency figures coming out of POWER9,
>>>>> and do not expect this patchseries to implement those
>>>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>>>> to amend perf to suit their own h/w output please.
>>>> Reference structure proposed in this patchset did not have members
>>>> to capture latency info for that exact reason. But idea here is to
>>>> abstract  as vendor specific as possible. So if we include u16 array,
>>>> then this format can also capture data from IBS since it provides
>>>> few latency details.
>>> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
>>> struct presented in this patchset.
>>>
>>> IBS Ops can report e.g.:
>>>
>>> 15 tag-to-retire cycles bits,
>>> 15 completion to retire count bits,
>>> 15 L1 DTLB refill latency bits,
>>> 15 DC miss latency bits,
>>> 5 outstanding memory requests on mem refill bits, and so on.
>>>
>>> IBS Fetch reports 15 bits of fetch latency, and another 16
>>> for iTLB latency, among others.
>>>
>>> Some of these may/may not be valid simultaneously, and
>>> there are IBS specific rules to establish validity.
>>>
>>>>> My main point there, however, was that each vendor should
>>>>> use streamlined record-level code to just copy the data
>>>>> in the proprietary format that their hardware produces,
>>>>> and then then perf tooling can synthesize the events
>>>>> from the raw data at report/script/etc. time.
>>>>>
>>>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>>>>> large volume of data needs to be captured as part of perf.data without
>>>>>> frequent PMIs. But proposed type is to address the capture of pipeline
>>>>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>>>>> PMIs are, even though it may be used in those environments.
>>>>>
>>>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>>> And that's fine for any extra bits that POWER9 has to convey
>>>>> to its users beyond things already represented by other sample
>>>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>>>> what IBS currently uses.
>>>> My bad. Not sure what you mean by this. We are trying to abstract
>>>> as much vendor specific data as possible with this (like perf-mem).
>>> Perhaps if I say it this way: instead of doing all the
>>> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
>>> in patch 4/11, rather/instead just put the raw sier value in a
>>> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
>>> Specific SIER capabilities can be written as part of the perf.data
>>> header.  Then synthesize the true pipe events from the raw SIER
>>> values later, and in userspace.
>> Hi Kim,
>>
>> Would like to stay away from SAMPLE_RAW type for these comments in perf_events.h
>>
>> *      #
>> *      # The RAW record below is opaque data wrt the ABI
>> *      #
>> *      # That is, the ABI doesn't make any promises wrt to
>> *      # the stability of its content, it may vary depending
>> *      # on event, hardware, kernel version and phase of
>> *      # the moon.
>> *      #
>> *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
>> *      #
> The "it may vary depending on ... hardware" clause makes it sound
> appropriate for the use-case where the raw hardware register contents
> are copied directly into the user buffer.


Hi Kim,

Sorry for the delayed response.

But perf tool side needs infrastructure to handle the raw sample
data from cpu-pmu (used by tracepoints). I am not sure whether
his is the approach we should look here.

peterz any comments?

>
>> Secondly, sorry I didn't understand your suggestion about using PERF_SAMPLE_AUX.
>> IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory and more
>> challenging when correlating and presenting the pipeline details for each IP.
>> IMO, having a new sample type can be useful to capture the pipeline data
>> both in perf_sample_data and if _AUX is enabled, can be made to push to
>> AUX buffer.
> OK, I didn't think SAMPLE_AUX and the aux ring buffer were
> interdependent, sorry.
>
> Thanks,
>
> Kim
Madhavan Srinivasan April 27, 2020, 7:18 a.m. UTC | #19
peterz,

     Can you please help. Is it okay to use PERF_SAMPLE_RAW to expose 
the pipeline stall details and
add tool side infrastructure to handle the PERF_SAMPLE_RAW for cpu-pmu 
samples.

Maddy

On 4/20/20 12:39 PM, Madhavan Srinivasan wrote:
>
>
> On 3/27/20 1:18 AM, Kim Phillips wrote:
>>
>> On 3/26/20 5:19 AM, maddy wrote:
>>>
>>> On 3/18/20 11:05 PM, Kim Phillips wrote:
>>>> Hi Maddy,
>>>>
>>>> On 3/17/20 1:50 AM, maddy wrote:
>>>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>>>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra 
>>>>>>>>>>> <peterz@infradead.org> wrote:
>>>>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction 
>>>>>>>>>>>>> Event
>>>>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based 
>>>>>>>>>>>>> Sampling' on
>>>>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Implementation detail:
>>>>>>>>>>>>>
>>>>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is 
>>>>>>>>>>>>> introduced.
>>>>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>>>>> into generic format:
>>>>>>>>>>>>>
>>>>>>>>>>>>>        struct perf_pipeline_haz_data {
>>>>>>>>>>>>>               /* Instruction/Opcode type: Load, Store, 
>>>>>>>>>>>>> Branch .... */
>>>>>>>>>>>>>               __u8    itype;
>>>>>>>>>>>>>               /* Instruction Cache source */
>>>>>>>>>>>>>               __u8    icache;
>>>>>>>>>>>>>               /* Instruction suffered hazard in pipeline 
>>>>>>>>>>>>> stage */
>>>>>>>>>>>>>               __u8    hazard_stage;
>>>>>>>>>>>>>               /* Hazard reason */
>>>>>>>>>>>>>               __u8    hazard_reason;
>>>>>>>>>>>>>               /* Instruction suffered stall in pipeline 
>>>>>>>>>>>>> stage */
>>>>>>>>>>>>>               __u8    stall_stage;
>>>>>>>>>>>>>               /* Stall reason */
>>>>>>>>>>>>>               __u8    stall_reason;
>>>>>>>>>>>>>               __u16   pad;
>>>>>>>>>>>>>        };
>>>>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>>>>> and reasons, for example: we have missed in L2 cache, for 
>>>>>>>>>> example.
>>>>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>>>>> IBM's AFAICT.
>>>>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling 
>>>>>>>>> like the
>>>>>>>>> Fetch latency, tag to retire latency, completion to retire 
>>>>>>>>> latency and
>>>>>>>>> so on. Yes, Ops sampling do provide more data on load/store 
>>>>>>>>> centric
>>>>>>>>> information. But it also captures more detailed data for 
>>>>>>>>> Branch instructions.
>>>>>>>>> And we also looked at ARM SPE, which also captures more 
>>>>>>>>> details pipeline
>>>>>>>>> data and latency information.
>>>>>>>>>
>>>>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>>>>> specific. We need to find a better term, maybe stall or 
>>>>>>>>>>> penalty.
>>>>>>>>>> Right, IBS doesn't have a filter to only count stalled or 
>>>>>>>>>> otherwise
>>>>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>>>>> bikeshed.
>>>>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>>>>> NP. We thought pipeline is generic hw term so we proposed 
>>>>>>> "pipeline"
>>>>>>> word. We are open to term which can be generic enough.
>>>>>>>
>>>>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>>>>> of information coming out of it, but the vast majority
>>>>>>>> are addresses, latencies of various components in the memory
>>>>>>>> hierarchy, and various component hit/miss bits.
>>>>>>> Yes. we should capture core pipeline specific details. For example,
>>>>>>> IBS generates Branch unit information(IbsOpData1) and Icahce 
>>>>>>> related
>>>>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>>>>> part of perf-mem, IMO.
>>>>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>>>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>>>>
>>>>>> union perf_mem_data_src {
>>>>>> ...
>>>>>>                    __u64   mem_rsvd:24,
>>>>>>                            mem_snoopx:2,   /* snoop mode, ext */
>>>>>>                            mem_remote:1,   /* remote */
>>>>>>                            mem_lvl_num:4,  /* memory hierarchy 
>>>>>> level number */
>>>>>>                            mem_dtlb:7,     /* tlb access */
>>>>>>                            mem_lock:2,     /* lock instr */
>>>>>>                            mem_snoop:5,    /* snoop mode */
>>>>>>                            mem_lvl:14,     /* memory hierarchy 
>>>>>> level */
>>>>>>                            mem_op:5;       /* type of opcode */
>>>>>>
>>>>>>
>>>>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>>>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>>>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>>>>> be used to populate mem_snoop, right?
>>>>> Hi Kim,
>>>>>
>>>>> Yes. We do expose these data as part of perf-mem for POWER.
>>>> OK, I see relevant PERF_MEM_S bits in 
>>>> arch/powerpc/perf/isa207-common.c:
>>>> isa207_find_source now, thanks.
>>>>
>>>>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>>>>> used for the ld/st target addresses, too.
>>>>>>
>>>>>>>> What's needed here is a vendor-specific extended
>>>>>>>> sample information that all these technologies gather,
>>>>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>>>>> all should have in common.
>>>>>>> Yes. We will include fields to capture the latency cycles (like 
>>>>>>> Issue
>>>>>>> latency, Instruction completion latency etc..) along with other 
>>>>>>> pipeline
>>>>>>> details in the proposed structure.
>>>>>> Latency figures are just an example, and from what I
>>>>>> can tell, struct perf_sample_data already has a 'weight' member,
>>>>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>>>>> transfer memory access latency figures.  Granted, that's
>>>>>> a bad name given all other vendors don't call latency
>>>>>> 'weight'.
>>>>>>
>>>>>> I didn't see any latency figures coming out of POWER9,
>>>>>> and do not expect this patchseries to implement those
>>>>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>>>>> to amend perf to suit their own h/w output please.
>>>>> Reference structure proposed in this patchset did not have members
>>>>> to capture latency info for that exact reason. But idea here is to
>>>>> abstract  as vendor specific as possible. So if we include u16 array,
>>>>> then this format can also capture data from IBS since it provides
>>>>> few latency details.
>>>> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
>>>> struct presented in this patchset.
>>>>
>>>> IBS Ops can report e.g.:
>>>>
>>>> 15 tag-to-retire cycles bits,
>>>> 15 completion to retire count bits,
>>>> 15 L1 DTLB refill latency bits,
>>>> 15 DC miss latency bits,
>>>> 5 outstanding memory requests on mem refill bits, and so on.
>>>>
>>>> IBS Fetch reports 15 bits of fetch latency, and another 16
>>>> for iTLB latency, among others.
>>>>
>>>> Some of these may/may not be valid simultaneously, and
>>>> there are IBS specific rules to establish validity.
>>>>
>>>>>> My main point there, however, was that each vendor should
>>>>>> use streamlined record-level code to just copy the data
>>>>>> in the proprietary format that their hardware produces,
>>>>>> and then then perf tooling can synthesize the events
>>>>>> from the raw data at report/script/etc. time.
>>>>>>
>>>>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is 
>>>>>>> intended when
>>>>>>> large volume of data needs to be captured as part of perf.data 
>>>>>>> without
>>>>>>> frequent PMIs. But proposed type is to address the capture of 
>>>>>>> pipeline
>>>>>> SAMPLE_AUX shouldn't care whether the volume is large, or how 
>>>>>> frequent
>>>>>> PMIs are, even though it may be used in those environments.
>>>>>>
>>>>>>> information on each sample using PMI at periodic intervals. 
>>>>>>> Hence proposing
>>>>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>>>> And that's fine for any extra bits that POWER9 has to convey
>>>>>> to its users beyond things already represented by other sample
>>>>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>>>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>>>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>>>>> what IBS currently uses.
>>>>> My bad. Not sure what you mean by this. We are trying to abstract
>>>>> as much vendor specific data as possible with this (like perf-mem).
>>>> Perhaps if I say it this way: instead of doing all the
>>>> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
>>>> in patch 4/11, rather/instead just put the raw sier value in a
>>>> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
>>>> Specific SIER capabilities can be written as part of the perf.data
>>>> header.  Then synthesize the true pipe events from the raw SIER
>>>> values later, and in userspace.
>>> Hi Kim,
>>>
>>> Would like to stay away from SAMPLE_RAW type for these comments in 
>>> perf_events.h
>>>
>>> *      #
>>> *      # The RAW record below is opaque data wrt the ABI
>>> *      #
>>> *      # That is, the ABI doesn't make any promises wrt to
>>> *      # the stability of its content, it may vary depending
>>> *      # on event, hardware, kernel version and phase of
>>> *      # the moon.
>>> *      #
>>> *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
>>> *      #
>> The "it may vary depending on ... hardware" clause makes it sound
>> appropriate for the use-case where the raw hardware register contents
>> are copied directly into the user buffer.
>
>
> Hi Kim,
>
> Sorry for the delayed response.
>
> But perf tool side needs infrastructure to handle the raw sample
> data from cpu-pmu (used by tracepoints). I am not sure whether
> his is the approach we should look here.
>
> peterz any comments?
>
>>
>>> Secondly, sorry I didn't understand your suggestion about using 
>>> PERF_SAMPLE_AUX.
>>> IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory 
>>> and more
>>> challenging when correlating and presenting the pipeline details for 
>>> each IP.
>>> IMO, having a new sample type can be useful to capture the pipeline 
>>> data
>>> both in perf_sample_data and if _AUX is enabled, can be made to push to
>>> AUX buffer.
>> OK, I didn't think SAMPLE_AUX and the aux ring buffer were
>> interdependent, sorry.
>>
>> Thanks,
>>
>> Kim
>