diff mbox

[v3,7/8] perf: Define PMU_TXN_READ interface

Message ID 1436929315-28520-8-git-send-email-sukadev@linux.vnet.ibm.com (mailing list archive)
State Superseded
Headers show

Commit Message

Sukadev Bhattiprolu July 15, 2015, 3:01 a.m. UTC
Define a new PERF_PMU_TXN_READ interface to read a group of counters
at once. Note that we use this interface with all PMUs.

PMUs that implement this interface use the ->read() operation to _queue_
the counters to be read and use ->commit_txn() to actually read all the
queued counters at once.

PMUs that don't implement PERF_PMU_TXN_READ ignore ->start_txn() and
->commit_txn() and continue to read counters one at a time.

Thanks to input from Peter Zijlstra.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 include/linux/perf_event.h |    1 +
 kernel/events/core.c       |   35 +++++++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra July 16, 2015, 10:20 p.m. UTC | #1
On Tue, Jul 14, 2015 at 08:01:54PM -0700, Sukadev Bhattiprolu wrote:
> +/*
> + * Use the transaction interface to read the group of events in @leader.
> + * PMUs like the 24x7 counters in Power, can use this to queue the events
> + * in the ->read() operation and perform the actual read in ->commit_txn.
> + *
> + * Other PMUs can ignore the ->start_txn and ->commit_txn and read each
> + * PMU directly in the ->read() operation.
> + */
> +static int perf_event_read_group(struct perf_event *leader)
> +{
> +	int ret;
> +	struct perf_event *sub;
> +	struct pmu *pmu;
> +
> +	pmu = leader->pmu;
> +
> +	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> +
> +	perf_event_read(leader);

There should be a lockdep assert with that list iteration.

> +	list_for_each_entry(sub, &leader->sibling_list, group_entry)
> +		perf_event_read(sub);
> +
> +	ret = pmu->commit_txn(pmu);
> +
> +	return ret;
> +}
Sukadev Bhattiprolu July 22, 2015, 1:50 a.m. UTC | #2
Peter Zijlstra [peterz@infradead.org] wrote:
| On Tue, Jul 14, 2015 at 08:01:54PM -0700, Sukadev Bhattiprolu wrote:
| > +/*
| > + * Use the transaction interface to read the group of events in @leader.
| > + * PMUs like the 24x7 counters in Power, can use this to queue the events
| > + * in the ->read() operation and perform the actual read in ->commit_txn.
| > + *
| > + * Other PMUs can ignore the ->start_txn and ->commit_txn and read each
| > + * PMU directly in the ->read() operation.
| > + */
| > +static int perf_event_read_group(struct perf_event *leader)
| > +{
| > +	int ret;
| > +	struct perf_event *sub;
| > +	struct pmu *pmu;
| > +
| > +	pmu = leader->pmu;
| > +
| > +	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
| > +
| > +	perf_event_read(leader);
| 
| There should be a lockdep assert with that list iteration.
| 
| > +	list_for_each_entry(sub, &leader->sibling_list, group_entry)
| > +		perf_event_read(sub);
| > +
| > +	ret = pmu->commit_txn(pmu);

Peter,

I have a situation :-)

We are trying to use the following interface:

	start_txn(pmu, PERF_PMU_TXN_READ);

	perf_event_read(leader);
	list_for_each(sibling, &leader->sibling_list, group_entry)
		perf_event_read(sibling)

	pmu->commit_txn(pmu);

with the idea that the PMU driver would save the type of transaction in
->start_txn() and use in ->read() and ->commit_txn().

But since ->start_txn() and the ->read() operations could happen on different
CPUs (perf_event_read() uses the event->oncpu to schedule a call), the PMU
driver cannot use a per-cpu variable to save the state in ->start_txn().

I tried using a pmu-wide global, but that would also need us to hold a mutex
to serialize access to that global. The problem is ->start_txn() can be
called from an interrupt context for the TXN_ADD transactions (I got the
following backtrace during testing)

	mutex_lock_nested+0x504/0x520 (unreliable)
	h_24x7_event_start_txn+0x3c/0xd0
	group_sched_in+0x70/0x230
	ctx_sched_in.isra.63+0x150/0x230
	__perf_install_in_context+0x1c8/0x1e0
	remote_function+0x7c/0xa0
	flush_smp_call_function_queue+0xb0/0x1d0
	smp_ipi_demux+0x88/0xf0
	icp_hv_ipi_action+0x54/0xc0
	handle_irq_event_percpu+0x98/0x2b0
	handle_percpu_irq+0x7c/0xc0
	generic_handle_irq+0x4c/0x80
	__do_irq+0x7c/0x190
	call_do_irq+0x14/0x24
	do_IRQ+0x8c/0x100
	hardware_interrupt_common+0x168/0x180
	--- interrupt: 501 at .plpar_hcall_norets+0x14/0x20

Basically stuck trying to save the txn type in ->start_txn() and retrieve in
->read().

Couple of options I can think of are:

	- having ->start_txn() return a handle that should then be passed in
	  with ->read() (yuck) and ->commit_txn().

	- serialize the READ transaction for the PMU in perf_event_read_group()
	  with a new pmu->txn_mutex:

		mutex_lock(&pmu->txn_mutex);

		pmu->start_txn()
		list_for_each_entry(sub, &leader->sibling_list, group_entry)
			perf_event_read(sub);

		ret = pmu->commit_txn(pmu);

		mutex_unlock(&pmu->txn_mutex);

	  such serialization would be ok with 24x7 counters (they are system
	  wide counters anyway) We could maybe skip the mutex for PMUs that
	  don't implement TXN_READ interface.

or is there better way?

Sukadev
Peter Zijlstra July 22, 2015, 5:55 a.m. UTC | #3
On Tue, Jul 21, 2015 at 06:50:45PM -0700, Sukadev Bhattiprolu wrote:
> We are trying to use the following interface:
> 
> 	start_txn(pmu, PERF_PMU_TXN_READ);
> 
> 	perf_event_read(leader);
> 	list_for_each(sibling, &leader->sibling_list, group_entry)
> 		perf_event_read(sibling)
> 
> 	pmu->commit_txn(pmu);
> 
> with the idea that the PMU driver would save the type of transaction in
> ->start_txn() and use in ->read() and ->commit_txn().
> 
> But since ->start_txn() and the ->read() operations could happen on different
> CPUs (perf_event_read() uses the event->oncpu to schedule a call), the PMU
> driver cannot use a per-cpu variable to save the state in ->start_txn().

> or is there better way?


I've not woken up yet, and not actually fully read the email, but can
you stuff the entire above chunk inside the IPI?

I think you could then actually optimize __perf_event_read() as well,
because all these events should be on the same context, so no point in
calling update_*time*() for every event or so.
diff mbox

Patch

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 44bf05f..da307ad 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -169,6 +169,7 @@  struct perf_event;
 #define PERF_EVENT_TXN 0x1
 
 #define PERF_PMU_TXN_ADD  0x1		/* txn to add/schedule event on PMU */
+#define PERF_PMU_TXN_READ 0x2		/* txn to read event group from PMU */
 
 /**
  * pmu::capabilities flags
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a83d45c..2ea06c4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3763,6 +3763,33 @@  static void orphans_remove_work(struct work_struct *work)
 	put_ctx(ctx);
 }
 
+/*
+ * Use the transaction interface to read the group of events in @leader.
+ * PMUs like the 24x7 counters in Power, can use this to queue the events
+ * in the ->read() operation and perform the actual read in ->commit_txn.
+ *
+ * Other PMUs can ignore the ->start_txn and ->commit_txn and read each
+ * PMU directly in the ->read() operation.
+ */
+static int perf_event_read_group(struct perf_event *leader)
+{
+	int ret;
+	struct perf_event *sub;
+	struct pmu *pmu;
+
+	pmu = leader->pmu;
+
+	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
+
+	perf_event_read(leader);
+	list_for_each_entry(sub, &leader->sibling_list, group_entry)
+		perf_event_read(sub);
+
+	ret = pmu->commit_txn(pmu);
+
+	return ret;
+}
+
 u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 {
 	u64 total = 0;
@@ -3792,7 +3819,11 @@  static int perf_read_group(struct perf_event *event,
 
 	lockdep_assert_held(&ctx->mutex);
 
-	count = perf_event_read_value(leader, &enabled, &running);
+	ret = perf_event_read_group(leader);
+	if (ret)
+		return ret;
+
+	count = perf_event_compute(leader, &enabled, &running);
 
 	values[n++] = 1 + leader->nr_siblings;
 	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
@@ -3813,7 +3844,7 @@  static int perf_read_group(struct perf_event *event,
 	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
 		n = 0;
 
-		values[n++] = perf_event_read_value(sub, &enabled, &running);
+		values[n++] = perf_event_compute(sub, &enabled, &running);
 		if (read_format & PERF_FORMAT_ID)
 			values[n++] = primary_event_id(sub);