amdgcn: Tune default OpenMP/OpenACC GPU utilization

Message ID	87lezfskrd.fsf@euler.schwinge.homeip.net
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D1E763858C39 IronPort-SDR: 3PGOzL6oQtTQakQUWER6FGhGuGo+P4y9BciUWVH8wpH1k31DXRv4gedVlHS+J6tdHz3lnI3NUd lVhwq7oKgVZtOPYMyE4iUIwEGjqDVe6WUkyZHtSS3RzB8Cq3K4v2Gb6gSNM2ASTrnLesK7eXsw Rzvh/R+MLjB0ZXmF/CgIFQI5bIJLfnk4akiQYIZVpsyLhOFzCBI+BDCbA0vwcqSyI6n8TPqG3U Ixtt4z+C28TooAR/cFlieDoWXwOnjOujipiOvFIEwqCzWaRqR/+PS1Kh9JvP+SzxbHnK4/mDaR yg45CamOEFhJKV9/ChSmog5P IronPort-SDR: eEWxAyU/4pQeTIHiAvGJZk+06j3+R5u6c32Nc2/TMp4lZsKaffrV7nzqG4hiIQciDWRuRyZz9m PugzAvQs15c4aRxiGNmZIwLvxDry77Jf6ftdL9Iac7XsFnh3f+pPncyIKEO7N/f2klzPDUdC7b 0eWozXUz5PhTK0nLzVBosptiJs3dAxVodytqVYQG8dbB2MWgX2XwaZN6dA0f7Eg7WujmClBgRw einW5yumoIcmEOKj7RsBiNQhIGE9oUF+QKKzVkm29cWlur3fP3P3JRl+1jR/4eYp/w2d7cuIKR 35U= From: Thomas Schwinge <thomas@codesourcery.com> To: <gcc-patches@gcc.gnu.org> Subject: amdgcn: Tune default OpenMP/OpenACC GPU utilization In-Reply-To: <08b8cdb2-11ef-1ceb-efc2-b8495bda6bef@codesourcery.com> References: <08b8cdb2-11ef-1ceb-efc2-b8495bda6bef@codesourcery.com> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1 (x86_64-pc-linux-gnu) Date: Sun, 16 Jan 2022 17:33:42 +0100 Message-ID: <87lezfskrd.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Precedence: list Cc: Andrew Stubbs <ams@codesourcery.com>, Kwok Cheung Yeung <kcy@codesourcery.com>, Tobias Burnus <tobias@codesourcery.com> Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
Series	amdgcn: Tune default OpenMP/OpenACC GPU utilization \| expand amdgcn: Tune default OpenMP/OpenACC GPU utilization

Message ID

87lezfskrd.fsf@euler.schwinge.homeip.net

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D1E763858C39
IronPort-SDR: 
 3PGOzL6oQtTQakQUWER6FGhGuGo+P4y9BciUWVH8wpH1k31DXRv4gedVlHS+J6tdHz3lnI3NUd
 lVhwq7oKgVZtOPYMyE4iUIwEGjqDVe6WUkyZHtSS3RzB8Cq3K4v2Gb6gSNM2ASTrnLesK7eXsw
 Rzvh/R+MLjB0ZXmF/CgIFQI5bIJLfnk4akiQYIZVpsyLhOFzCBI+BDCbA0vwcqSyI6n8TPqG3U
 Ixtt4z+C28TooAR/cFlieDoWXwOnjOujipiOvFIEwqCzWaRqR/+PS1Kh9JvP+SzxbHnK4/mDaR
 yg45CamOEFhJKV9/ChSmog5P
IronPort-SDR: 
 eEWxAyU/4pQeTIHiAvGJZk+06j3+R5u6c32Nc2/TMp4lZsKaffrV7nzqG4hiIQciDWRuRyZz9m
 PugzAvQs15c4aRxiGNmZIwLvxDry77Jf6ftdL9Iac7XsFnh3f+pPncyIKEO7N/f2klzPDUdC7b
 0eWozXUz5PhTK0nLzVBosptiJs3dAxVodytqVYQG8dbB2MWgX2XwaZN6dA0f7Eg7WujmClBgRw
 einW5yumoIcmEOKj7RsBiNQhIGE9oUF+QKKzVkm29cWlur3fP3P3JRl+1jR/4eYp/w2d7cuIKR
 35U=
From: Thomas Schwinge <thomas@codesourcery.com>
To: <gcc-patches@gcc.gnu.org>
Subject: amdgcn: Tune default OpenMP/OpenACC GPU utilization
In-Reply-To: <08b8cdb2-11ef-1ceb-efc2-b8495bda6bef@codesourcery.com>
References: <08b8cdb2-11ef-1ceb-efc2-b8495bda6bef@codesourcery.com>
User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1
 (x86_64-pc-linux-gnu)
Date: Sun, 16 Jan 2022 17:33:42 +0100
Message-ID: <87lezfskrd.fsf@euler.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
Precedence: list
Cc: Andrew Stubbs <ams@codesourcery.com>,
 Kwok Cheung Yeung <kcy@codesourcery.com>,
 Tobias Burnus <tobias@codesourcery.com>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>

Series

amdgcn: Tune default OpenMP/OpenACC GPU utilization | expand

Commit Message

Thomas Schwinge Jan. 16, 2022, 4:33 p.m. UTC

Hi!

On 2020-07-15T21:49:11+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> This patch tunes the default GPU thread count for OpenMP and OpenACC on
> AMD GCN devices. It chooses a sensible default if no attributes are
> given at all, increases the number of OpenACC gangs if only one worker
> per gang is specified, and increases the number of workers otherwise.
> The tuning is still a work in progress as we fix issues that limit
> occupancy.

Pushed to in commit a78b1ab1df9ca44acc5638e8f9d0ae2e62bd65ed
"amdgcn: Tune default OpenMP/OpenACC GPU utilization", see attached.


Tobias, this should've unblocked your
"[wwwdocs] gcc-12/changes.html (GCN): >1 workers per gang"; see
<http://mid.mail-archive.com/87a6lhpwlq.fsf@euler.schwinge.homeip.net>.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

From a78b1ab1df9ca44acc5638e8f9d0ae2e62bd65ed Mon Sep 17 00:00:00 2001
From: Kwok Cheung Yeung <kcy@codesourcery.com>
Date: Thu, 29 Aug 2019 10:16:42 -0700
Subject: [PATCH] amdgcn: Tune default OpenMP/OpenACC GPU utilization

	libgomp/
	* plugin/plugin-gcn.c (parse_target_attributes): Automatically set
	the number of teams and threads if necessary.
	(gcn_exec): Automatically set the number of gangs and workers if
	necessary.

Co-Authored-By: Andrew Stubbs  <ams@codesourcery.com>
---
 libgomp/plugin/plugin-gcn.c | 82 ++++++++++++++++++++++++++++++-------
 1 file changed, 67 insertions(+), 15 deletions(-)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index d0f05b28bf3..f305d726874 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1219,24 +1219,55 @@  parse_target_attributes (void **input,
 
   if (gcn_dims_found)
     {
+      bool gfx900_workaround_p = false;
+
       if (agent->device_isa == EF_AMDGPU_MACH_AMDGCN_GFX900
 	  && gcn_threads == 0 && override_z_dim == 0)
 	{
-	  gcn_threads = 4;
+	  gfx900_workaround_p = true;
 	  GCN_WARNING ("VEGA BUG WORKAROUND: reducing default number of "
-		       "threads to 4 per team.\n");
+		       "threads to at most 4 per team.\n");
 	  GCN_WARNING (" - If this is not a Vega 10 device, please use "
 		       "GCN_NUM_THREADS=16\n");
 	}
 
+      /* Ideally, when a dimension isn't explicitly specified, we should
+	 tune it to run 40 (or 32?) threads per CU with no threads getting queued.
+	 In practice, we tune for peak performance on BabelStream, which
+	 for OpenACC is currently 32 threads per CU.  */
       def->ndim = 3;
-      /* Fiji has 64 CUs, but Vega20 has 60.  */
-      def->gdims[0] = (gcn_teams > 0) ? gcn_teams : get_cu_count (agent);
-      /* Each thread is 64 work items wide.  */
-      def->gdims[1] = 64;
-      /* A work group can have 16 wavefronts.  */
-      def->gdims[2] = (gcn_threads > 0) ? gcn_threads : 16;
-      def->wdims[0] = 1; /* Single team per work-group.  */
+      if (gcn_teams <= 0 && gcn_threads <= 0)
+	{
+	  /* Set up a reasonable number of teams and threads.  */
+	  gcn_threads = gfx900_workaround_p ? 4 : 16; // 8;
+	  def->gdims[0] = get_cu_count (agent); // * (40 / gcn_threads);
+	  def->gdims[2] = gcn_threads;
+	}
+      else if (gcn_teams <= 0 && gcn_threads > 0)
+	{
+	  /* Auto-scale the number of teams with the number of threads.  */
+	  def->gdims[0] = get_cu_count (agent); // * (40 / gcn_threads);
+	  def->gdims[2] = gcn_threads;
+	}
+      else if (gcn_teams > 0 && gcn_threads <= 0)
+	{
+	  int max_threads = gfx900_workaround_p ? 4 : 16;
+
+	  /* Auto-scale the number of threads with the number of teams.  */
+	  def->gdims[0] = gcn_teams;
+	  def->gdims[2] = 16; // get_cu_count (agent) * 40 / gcn_teams;
+	  if (def->gdims[2] == 0)
+	    def->gdims[2] = 1;
+	  else if (def->gdims[2] > max_threads)
+	    def->gdims[2] = max_threads;
+	}
+      else
+	{
+	  def->gdims[0] = gcn_teams;
+	  def->gdims[2] = gcn_threads;
+	}
+      def->gdims[1] = 64; /* Each thread is 64 work items wide.  */
+      def->wdims[0] = 1;  /* Single team per work-group.  */
       def->wdims[1] = 64;
       def->wdims[2] = 16;
       *result = def;
@@ -3031,13 +3062,34 @@  gcn_exec (struct kernel_info *kernel, size_t mapnum, void **hostaddrs,
   if (hsa_kernel_desc->oacc_dims[2] > 0)
     dims[2] = hsa_kernel_desc->oacc_dims[2];
 
-  /* If any of the OpenACC dimensions remain 0 then we get to pick a number.
-     There isn't really a correct answer for this without a clue about the
-     problem size, so let's do a reasonable number of single-worker gangs.
-     64 gangs matches a typical Fiji device.  */
+  /* Ideally, when a dimension isn't explicitly specified, we should
+     tune it to run 40 (or 32?) threads per CU with no threads getting queued.
+     In practice, we tune for peak performance on BabelStream, which
+     for OpenACC is currently 32 threads per CU.  */
+  if (dims[0] == 0 && dims[1] == 0)
+    {
+      /* If any of the OpenACC dimensions remain 0 then we get to pick a
+	 number.  There isn't really a correct answer for this without a clue
+	 about the problem size, so let's do a reasonable number of workers
+	 and gangs.  */
 
-  if (dims[0] == 0) dims[0] = get_cu_count (kernel->agent); /* Gangs.  */
-  if (dims[1] == 0) dims[1] = 16; /* Workers.  */
+      dims[0] = get_cu_count (kernel->agent) * 4; /* Gangs.  */
+      dims[1] = 8; /* Workers.  */
+    }
+  else if (dims[0] == 0 && dims[1] > 0)
+    {
+      /* Auto-scale the number of gangs with the requested number of workers.  */
+      dims[0] = get_cu_count (kernel->agent) * (32 / dims[1]);
+    }
+  else if (dims[0] > 0 && dims[1] == 0)
+    {
+      /* Auto-scale the number of workers with the requested number of gangs.  */
+      dims[1] = get_cu_count (kernel->agent) * 32 / dims[0];
+      if (dims[1] == 0)
+	dims[1] = 1;
+      if (dims[1] > 16)
+	dims[1] = 16;
+    }
 
   /* The incoming dimensions are expressed in terms of gangs, workers, and
      vectors.  The HSA dimensions are expressed in terms of "work-items",
-- 
2.34.1

amdgcn: Tune default OpenMP/OpenACC GPU utilization

Commit Message

Patch