From patchwork Sat Nov 15 00:49:04 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Julian Brown X-Patchwork-Id: 411055 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id BEA301400A0 for ; Sat, 15 Nov 2014 11:50:07 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-type; q=dns; s=default; b=O57UrpnDhJVutB3F en/QidMAzK/HQ+32ymQA7w2j0YSsGikkzaZlJcZFC6VdB6qIasH8O8XQuSaruHvz TnbzIPsaMNfJUTCnlkJDKimLA+llZIwkJFeQQXYZGTU7sg/N2Ft+J4bql0J/s/G7 6QqoCsh9GDtoxE1n4wLoL+DWQEQ= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-type; s=default; bh=0BK6XOh6cZy4LBhKYOpoR3 ppbnw=; b=YeaVn7PAQ6XQ3y8A7X7SB12hu5fs+myufotBy/jkU+wn0dvujRxUVf cxt5bDWbpdeVyl/K0WIC0GA437PqXtJ253AInQdkYFAFBBxue7/XR1bp2P8BYbBP veQeHHpK+tRJUu3hvDrObqRDw332bPLb9S5xjf+2GFVfleh7CIdxM= Received: (qmail 16800 invoked by alias); 15 Nov 2014 00:49:49 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 16781 invoked by uid 89); 15 Nov 2014 00:49:46 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.3 required=5.0 tests=AWL, BAYES_50, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sat, 15 Nov 2014 00:49:28 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com) by relay1.mentorg.com with esmtp id 1XpRYH-00064X-1l from Julian_Brown@mentor.com ; Fri, 14 Nov 2014 16:49:24 -0800 Received: from octopus (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server id 14.3.181.6; Sat, 15 Nov 2014 00:49:16 +0000 Date: Sat, 15 Nov 2014 00:49:04 +0000 From: Julian Brown To: Jakub Jelinek CC: , Thomas Schwinge , Ilya Verbin Subject: Re: [PATCH 1/5] OpenACC 2.0 support for libgomp - OpenACC runtime, NVidia PTX/CUDA plugin (repost) Message-ID: <20141115004904.55b16737@octopus> In-Reply-To: <20141112100626.GP5026@tucnak.redhat.com> References: <20140923191931.2177e60f@octopus> <20141111135323.29e0f27b@octopus> <20141112100626.GP5026@tucnak.redhat.com> MIME-Version: 1.0 X-IsSubscribed: yes Hi, On Wed, 12 Nov 2014 11:06:26 +0100 Jakub Jelinek wrote: > On Tue, Nov 11, 2014 at 01:53:23PM +0000, Julian Brown wrote: > > A few OpenMP tests fail with the new host_nonshm plugin (with > > failures of the form "libgomp: Trying to update > > [0x605820..0x605824) object that is not mapped"), probably because > > of middle-end bugs. I haven't investigated those in detail. > > Depends how exactly your host_nonshm plugin works. [...] > > One can always use the intelmicemul plugin to test nonshared-memory > stuff without any HW (provided the host is x86_64/i686), so do we > really need host_nonshm plugin? This is a new version of the patch with (hopefully) all these comments addressed. For now, I've left the host_nonshm plugin present, but removed the OpenMP-support capability from it, since it doesn't seem likely that we can support the required semantics for that in the short term. > > --- a/libgomp/configure.ac > > +++ b/libgomp/configure.ac > > @@ -2,6 +2,8 @@ > > # aclocal -I ../config && autoconf && autoheader && automake > > > > AC_PREREQ(2.64) > > +#TODO: Update for OpenACC? But then also have to update copyright > > notices in +#all source files... I've incorporated David Malcolm's suggestion (and patch) into the configury bits of this patch, and into the documentation bits. I'll be reposting those too shortly. > > @@ -1181,6 +1197,7 @@ initialize_env (void) > > gomp_global_icv.thread_limit_var > > = thread_limit_var > INT_MAX ? UINT_MAX : thread_limit_var; > > } > > + parse_int ("GCC_ACC_NOTIFY", &goacc_notify_var, true); > > I would have expected GACC_NOTIFY name instead (or GOACC_NOTIFY) > to match GOMP_SPINCOUNT and similar env vars. I've renamed the environment variable, added a configure-time flag to enable verbose libgomp output (--enable-libgomp-verbose, disabled by default), and added the suggested __builtin_expect. The configure flag might be overkill, considering Thomas's later comments, or maybe the default could be verbose-output enabled instead. > > +OACC_2.0 { > > + global: > > + acc_get_num_devices; > > + acc_get_num_devices_h_; > > Somebody recently suggested (for OpenMP) that we just should use > bind(C) in the Fortran module, it is too late for OpenMP, as we > have to keep the *_ entrypoints for compatibility anyway, but > for OpenACC and new OpenMP functions supposedly you could avoid > exporting all the *_ wrappers and use * directly. I enlisted Jim Norris's help with this -- I might not understand the issues with the Fortran bindings fully (discussed in another thread), but it seems like the interfaces that only accept/return scalars at least can use the "non-decorated" C function directly. That removes some of the underscore-suffixed symbols, at least. I've renamed several other variables that used unnecessary capitals, and standardized on "goacc_" for internal interfaces, where it seemed appropriate. I also moved some of the host-specific parts of plugin/plugin-host.c into oacc-host.c. Thanks, Julian commit 48ae7eecfbca988d1bd85e28d2ee52bb2ebb7e27 Author: Julian Brown Date: Thu Nov 13 04:21:00 2014 -0800 OpenACC support for libgomp. xxxx-xx-xx Nathan Sidwell James Norris Thomas Schwinge Tom de Vries Julian Brown Bernd Schmidt Cesar Philippidis include/ * gomp-constants.h: New file. libgomp/ * Makefile.am (search_path): Search in $(top_srcidr)/../include also. (libgomp_la_SOURCES): Add oacc-parallel.c, splay-tree.c, oacc-fortran.c, oacc-host.c, oacc-init.c, oacc-mem.c, oacc-async.c, oacc-plugin.c, oacc-cuda.c, libgomp-plugin.c. (Makefrag.am): Include. (libgomp_la_SOURCES): Add openacc.f90 if USE_FORTRAN is true. (nodist_libsubinclude_HEADERS): Add openacc.h, ../include/gomp-constants.h. (nodist_finclude_HEADERS): Add openacc_lib.h, openacc.f90, openacc.mod, openacc_kinds.mod. * configure.ac (plugin_support): Add check for accelerators if attempting to build plugins. (plugin/configfrag.ac): Include. (offload_targets): Add host_nonshm target by default, nvptx target conditionally if the corresponding offload target is enabled. (testsuite/libgomp-test-support.exp): Add to AC_CONFIG_FILES. * env.c (libgomp_target.h, oacc-int.h): Include. (goacc_notify_var, goacc_device_num, goacc_device_type): New globals. (goacc_parse_device_type): New functions. (initialize_env): Parse GCC_ACC_NOTIFY, ACC_DEVICE_TYPE, ACC_DEVICE_NUM environment variables. Call ACC_runtime_initialize. * error.c (gomp_verror): Make global. (gomp_vfatal, gomp_vnotify, gomp_notify): New functions. (gomp_fatal): Use gomp_vfatal instead of gomp_verror. * libgomp.h (stdarg.h): Include. (struct gomp_memory_mapping): Forward declaration. (goacc_notify_var, goacc_device_num, goacc_device_type): Add extern declarations. (gomp_vnotify, gomp_notify, gomp_verror, gomp_vfatal): Add prototypes. (gomp_init_targets_once): Add prototype. * libgomp.map (OACC_2.0): New symbol version. Add public acc_* interface functions. (PLUGIN_1.0): New symbol version. Add gomp plugin interface functions. * libgomp_g.h (GOACC_data_start, GOACC_data_end, GOACC_kernels) (GOACC_parallel, GOACC_wait): Add prototypes. * libgomp_target.h (gomp-constants.h, splay-tree.h): Include. (offload_target_type): Set enumeration values from constants in gomp-constants.h. Add OFFLOAD_TARGET_TYPE_HOST_NONSHM and OFFLOAD_TARGET_TYPE_NVIDIA_PTX. (struct target_mem_desc): Move to here. (TARGET_CAP_SHARED_MEM, TARGET_CAP_NATIVE_EXEC, TARGET_CAP_OPENMP_400) (TARGET_CAP_OPENACC_200): Define macros. (struct gomp_memory_mapping): New. (struct ACC_dispatch_t): New. (struct gomp_device_descr): Move here. Add offload_regions_registered, openacc dispatch functions, target_data. (gomp_map_vars, gomp_copy_from_async, gomp_unmap_vars, gomp_init_device) (gomp_init_tables, gomp_fini_device, gomp_free_memmap): Add prototypes. * target.c (oacc-plugin.h, gomp-constants.h, oacc-int.h, stdio.h) (assert.h): Include. (splay_tree_node, splay_tree, splay_tree_key, target_mem_desc) (splay_tree_key_s, gomp_device_descr): Don't declare here. (splay_compare): Change linkage to hidden not static. (gomp_init_targets_once): New function. (gomp_get_num_devices): Use above. (get_kind): New function. (gomp_map_vars): Add is_openacc parameter. Change KINDS to void *. Use lock from memory map not device. Use macros from gomp-constants.h instead of hard-coded values. Support OpenACC-specific mappings. (gomp_copy_from_async): New function. (gomp_unmap_vars): Add DO_COPYFROM argument. Only copy memory back from device if it is true. Use lock from memory map not device. (gomp_update): Add is_openacc parameter. Use lock from memory map not device. Use macros from gomp-constants.h instead of hard-coded values. (gomp_register_image_for_device): Add forward declaration. (GOMP_offload_register): Check realloc result. (gomp_init_device): Change linkage to hidden not static. (gomp_init_tables, gomp_init_dev_tables, gomp_free_memmap) (gomp_fini_device): New function. (GOMP_target): Adjust lazy initialization, check target capabilities for OpenMP 4.0 support. Update call to gomp_map_vars, gomp_unmap_vars. (GOMP_target_data): Adjust lazy initialization. Update call to gomp_map_vars. (GOMP_target_end_data): Update call to gomp_unmap_vars. (GOMP_target_update): Tweak lazy initialization. Add new args to gomp_update call. (gomp_load_plugin_for_device): Initialize get_name, get_caps, device_fini and OpenACC-specific plugin hooks. (gomp_register_images_for_device): Rename to... (gomp_register_image_for_device): This, and register a single device only, and only if it has not already had images registered. (gomp_find_available_plugins): Initialize OpenACC-specific bits, offload image registration, and other new device member data. Prefer device with TARGET_CAP_OPENMP_400 if more than one plugin is available. * libgomp-plugin.c: New file. * libgomp-plugin.h: New file. * oacc-async.c: New file. * oacc-cuda.c: New file. * oacc-fortran.c: New file. * oacc-host.c: New file. * oacc-init.c: New file. * oacc-int.h: New file. * oacc-mem.c: New file. * oacc-parallel.c: New file. * oacc-plugin.c: New file. * oacc-plugin.h: New file. * openacc.f90: New file. * openacc.h: New file. * openacc_lib.h: New file. * splay-tree.h: Move bulk of implementation to... * splay-tree.c: New file. * Makefile.in: Regenerate. * config.h.in: Regenerate. * configure: Regenerate. * plugin/Makefrag.am: New file. * plugin/configfrag.am: New file. * plugin/plugin-host.c: New file. * plugin/plugin-nvptx.c: New file. * testsuite/libgomp-test-support.exp.in: New file. add --enable-libgomp-verbose to compile-time disable notify calls __builtin_expect for gomp_notify, when enabled diff --git a/include/gomp-constants.h b/include/gomp-constants.h new file mode 100644 index 0000000..7ef5c88 --- /dev/null +++ b/include/gomp-constants.h @@ -0,0 +1,45 @@ +#ifndef GOMP_CONSTANTS_H +#define GOMP_CONSTANTS_H 1 + +/* Enumerated variable mapping types used to communicate between GCC and + libgomp. These values are used for both OpenMP and OpenACC. */ + +#define GOMP_MAP_ALLOC 0x00 +#define GOMP_MAP_ALLOC_TO 0x01 +#define GOMP_MAP_ALLOC_FROM 0x02 +#define GOMP_MAP_ALLOC_TOFROM 0x03 +#define GOMP_MAP_POINTER 0x04 +#define GOMP_MAP_TO_PSET 0x05 +#define GOMP_MAP_FORCE_ALLOC 0x08 +#define GOMP_MAP_FORCE_TO 0x09 +#define GOMP_MAP_FORCE_FROM 0x0a +#define GOMP_MAP_FORCE_TOFROM 0x0b +#define GOMP_MAP_FORCE_PRESENT 0x0c +#define GOMP_MAP_FORCE_DEALLOC 0x0d +#define GOMP_MAP_FORCE_DEVICEPTR 0x0e +#define GOMP_MAP_FORCE_PRIVATE 0x18 +#define GOMP_MAP_FORCE_FIRSTPRIVATE 0x19 + +#define GOMP_MAP_COPYTO_P(X) \ + ((X) == GOMP_MAP_ALLOC_TO || (X) == GOMP_MAP_FORCE_TO) + +#define GOMP_MAP_COPYFROM_P(X) \ + ((X) == GOMP_MAP_ALLOC_FROM || (X) == GOMP_MAP_FORCE_FROM) + +#define GOMP_MAP_TOFROM_P(X) \ + ((X) == GOMP_MAP_ALLOC_TOFROM || (X) == GOMP_MAP_FORCE_TOFROM) + +#define GOMP_MAP_POINTER_P(X) \ + ((X) == GOMP_MAP_POINTER) + +#define GOMP_IF_CLAUSE_FALSE -2 + +/* Canonical list of target type codes for OpenMP/OpenACC. */ +#define GOMP_TARGET_NONE 0 +#define GOMP_TARGET_HOST 2 +#define GOMP_TARGET_HOST_NONSHM 3 +#define GOMP_TARGET_NOT_HOST 4 +#define GOMP_TARGET_NVIDIA_PTX 5 +#define GOMP_TARGET_INTEL_MIC 6 + +#endif diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am index 427415e..f48c1ff 100644 --- a/libgomp/Makefile.am +++ b/libgomp/Makefile.am @@ -7,7 +7,8 @@ SUBDIRS = testsuite gcc_version := $(shell cat $(top_srcdir)/../gcc/BASE-VER) config_path = @config_path@ -search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) +search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \ + $(top_srcdir)/../include fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/finclude libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include @@ -18,6 +19,10 @@ AM_CPPFLAGS = $(addprefix -I, $(search_path)) AM_CFLAGS = $(XCFLAGS) AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS) +if LIBGOMP_VERBOSE +AM_CPPFLAGS += -DLIBGOMP_VERBOSE +endif + toolexeclib_LTLIBRARIES = libgomp.la nodist_toolexeclib_HEADERS = libgomp.spec @@ -60,12 +65,21 @@ libgomp_la_LINK = $(LINK) $(libgomp_la_LDFLAGS) libgomp_la_SOURCES = alloc.c barrier.c critical.c env.c error.c iter.c \ iter_ull.c loop.c loop_ull.c ordered.c parallel.c sections.c single.c \ task.c team.c work.c lock.c mutex.c proc.c sem.c bar.c ptrlock.c \ - time.c fortran.c affinity.c target.c + time.c fortran.c affinity.c target.c oacc-parallel.c splay-tree.c \ + oacc-host.c oacc-init.c oacc-mem.c oacc-async.c oacc-plugin.c \ + oacc-cuda.c libgomp-plugin.c + +include $(top_srcdir)/plugin/Makefrag.am + +if USE_FORTRAN +libgomp_la_SOURCES += openacc.f90 +endif nodist_noinst_HEADERS = libgomp_f.h -nodist_libsubinclude_HEADERS = omp.h +nodist_libsubinclude_HEADERS = omp.h openacc.h ../include/gomp-constants.h if USE_FORTRAN -nodist_finclude_HEADERS = omp_lib.h omp_lib.f90 omp_lib.mod omp_lib_kinds.mod +nodist_finclude_HEADERS = omp_lib.h omp_lib.f90 omp_lib.mod omp_lib_kinds.mod \ + openacc_lib.h openacc.f90 openacc.mod openacc_kinds.mod endif LTLDFLAGS = $(shell $(SHELL) $(top_srcdir)/../libtool-ldflags $(LDFLAGS)) diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in index 8e4774f..d2a803a 100644 diff --git a/libgomp/config.h.in b/libgomp/config.h.in index 94a2b3b..309962d 100644 diff --git a/libgomp/configure b/libgomp/configure index 19f36c6..83a6a11 100755 diff --git a/libgomp/configure.ac b/libgomp/configure.ac index cea6366..68bcb27 100644 --- a/libgomp/configure.ac +++ b/libgomp/configure.ac @@ -2,7 +2,7 @@ # aclocal -I ../config && autoconf && autoheader && automake AC_PREREQ(2.64) -AC_INIT([GNU OpenMP Runtime Library], 1.0,,[libgomp]) +AC_INIT([GNU Offloading and Multi Processing Runtime Library], 1.0,,[libgomp]) AC_CONFIG_HEADER(config.h) # ------- @@ -28,7 +28,6 @@ LIBGOMP_ENABLE(generated-files-in-srcdir, no, , AC_MSG_RESULT($enable_generated_files_in_srcdir) AM_CONDITIONAL(GENINSRC, test "$enable_generated_files_in_srcdir" = yes) - # ------- # ------- @@ -193,13 +192,28 @@ AC_LINK_IFELSE( [], [AC_MSG_ERROR([Pthreads are required to build libgomp])])]) +# Enable --enable-libgomp-verbose +AC_ARG_ENABLE(libgomp-verbose, +[AS_HELP_STRING([--enable-libgomp-verbose], + [enable verbose debugging output for libgomp])], +[case "${enableval}" in + yes) libgomp_verbose=true ;; + no) libgomp_verbose=false ;; + *) AC_MSG_ERROR([bad value ${enableval} for --enable-libgomp-verbose]) ;; +esac], [libgomp_verbose=false]) +AM_CONDITIONAL([LIBGOMP_VERBOSE], [test x$libgomp_verbose = xtrue]) + plugin_support=yes AC_CHECK_LIB(dl, dlsym, , [plugin_support=no]) if test x"$plugin_support" = xyes; then AC_DEFINE(PLUGIN_SUPPORT, 1, [Define if all infrastructure, needed for plugins, is supported.]) +elif test "x$enable_accelerator" != xno; then + AC_MSG_ERROR([Can't have support for accelerators without support for plugins]) fi +m4_include([plugin/configfrag.ac]) + # Check for functions needed. AC_CHECK_FUNCS(getloadavg clock_gettime strtoull) @@ -283,7 +297,7 @@ fi # Get accel target and path to install tree of accel compiler offload_additional_options= offload_additional_lib_paths= -offload_targets= +offload_targets=host_nonshm if test x"$enable_offload_targets" != x; then for tgt in `echo $enable_offload_targets | sed -e 's#,# #g'`; do tgt_dir=`echo $tgt | grep '=' | sed 's/.*=//'` @@ -291,6 +305,8 @@ if test x"$enable_offload_targets" != x; then case $tgt in *-intelmic-* | *-intelmicemul-*) tgt_name="intelmic" ;; + nvptx-*) + tgt_name="nvptx" ;; *) AC_MSG_ERROR([unknown offload target specified]) ;; esac @@ -388,4 +404,5 @@ CFLAGS="$save_CFLAGS" AC_CONFIG_FILES(omp.h omp_lib.h omp_lib.f90 libgomp_f.h) AC_CONFIG_FILES(Makefile testsuite/Makefile libgomp.spec) +AC_CONFIG_FILES([testsuite/libgomp-test-support.exp]) AC_OUTPUT diff --git a/libgomp/env.c b/libgomp/env.c index 94c72a3..7e32eb7 100644 --- a/libgomp/env.c +++ b/libgomp/env.c @@ -27,6 +27,8 @@ #include "libgomp.h" #include "libgomp_f.h" +#include "libgomp_target.h" +#include "oacc-int.h" #include #include #include @@ -77,6 +79,10 @@ unsigned long gomp_bind_var_list_len; void **gomp_places_list; unsigned long gomp_places_list_len; +int goacc_notify_var; +int goacc_device_num; +char* goacc_device_type; + /* Parse the OMP_SCHEDULE environment variable. */ static void @@ -1011,6 +1017,16 @@ parse_affinity (bool ignore) return false; } +static void +goacc_parse_device_type (void) +{ + const char *env = getenv ("ACC_DEVICE_TYPE"); + + if (env && *env != '\0') + goacc_device_type = strdup (env); + else + goacc_device_type = NULL; +} static void handle_omp_display_env (unsigned long stacksize, int wait_policy) @@ -1181,6 +1197,7 @@ initialize_env (void) gomp_global_icv.thread_limit_var = thread_limit_var > INT_MAX ? UINT_MAX : thread_limit_var; } + parse_int ("GOACC_NOTIFY", &goacc_notify_var, true); #ifndef HAVE_SYNC_BUILTINS gomp_mutex_init (&gomp_managed_threads_lock); #endif @@ -1271,6 +1288,15 @@ initialize_env (void) } handle_omp_display_env (stacksize, wait_policy); + + /* Look for OpenACC-specific environment variables. */ + if (!parse_int ("ACC_DEVICE_NUM", &goacc_device_num, true)) + goacc_device_num = 0; + + goacc_parse_device_type (); + + /* Initialize OpenACC-specific internal state. */ + goacc_runtime_initialize (); } diff --git a/libgomp/error.c b/libgomp/error.c index d9b28f1..c455f58 100644 --- a/libgomp/error.c +++ b/libgomp/error.c @@ -35,7 +35,7 @@ #include -static void +void gomp_verror (const char *fmt, va_list list) { fputs ("\nlibgomp: ", stderr); @@ -54,13 +54,40 @@ gomp_error (const char *fmt, ...) } void +gomp_vfatal (const char *fmt, va_list list) +{ + gomp_verror (fmt, list); + exit (EXIT_FAILURE); +} + +void gomp_fatal (const char *fmt, ...) { va_list list; va_start (list, fmt); - gomp_verror (fmt, list); + gomp_vfatal (fmt, list); va_end (list); +} - exit (EXIT_FAILURE); +#ifdef LIBGOMP_VERBOSE + +#undef gomp_vnotify +void +gomp_vnotify (const char *msg, va_list list) +{ + if (goacc_notify_var) + vfprintf (stderr, msg, list); +} + +#undef gomp_notify +void +gomp_notify (const char *msg, ...) +{ + va_list list; + + va_start (list, msg); + gomp_vnotify (msg, list); + va_end (list); } +#endif diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c new file mode 100644 index 0000000..f0e35d6 --- /dev/null +++ b/libgomp/libgomp-plugin.c @@ -0,0 +1,107 @@ +/* Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* Exported (non-hidden) functions exposing libgomp interface for plugins. */ + +#include + +#include "libgomp.h" +#include "libgomp-plugin.h" +#include "target.h" + +void * +GOMP_PLUGIN_malloc (size_t size) +{ + return gomp_malloc (size); +} + +void * +GOMP_PLUGIN_malloc_cleared (size_t size) +{ + return gomp_malloc_cleared (size); +} + +void * +GOMP_PLUGIN_realloc (void *ptr, size_t size) +{ + return gomp_realloc (ptr, size); +} + +void +GOMP_PLUGIN_error (const char *msg, ...) +{ + va_list ap; + + va_start (ap, msg); + gomp_verror (msg, ap); + va_end (ap); +} + +void +GOMP_PLUGIN_notify (const char *msg, ...) +{ + va_list ap; + + va_start (ap, msg); + gomp_vnotify (msg, ap); + va_end (ap); +} + +void +GOMP_PLUGIN_fatal (const char *msg, ...) +{ + va_list ap; + + va_start (ap, msg); + gomp_vfatal (msg, ap); + va_end (ap); + + /* Unreachable. */ + abort (); +} + +void +GOMP_PLUGIN_mutex_init (gomp_mutex_t *mutex) +{ + gomp_mutex_init (mutex); +} + +void +GOMP_PLUGIN_mutex_destroy (gomp_mutex_t *mutex) +{ + gomp_mutex_destroy (mutex); +} + +void +GOMP_PLUGIN_mutex_lock (gomp_mutex_t *mutex) +{ + gomp_mutex_lock (mutex); +} + +void +GOMP_PLUGIN_mutex_unlock (gomp_mutex_t *mutex) +{ + gomp_mutex_unlock (mutex); +} diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h new file mode 100644 index 0000000..87367e3 --- /dev/null +++ b/libgomp/libgomp-plugin.h @@ -0,0 +1,54 @@ +/* Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* An interface to various libgomp-internal functions for use by plugins. */ + +#ifndef LIBGOMP_PLUGIN_H +#define LIBGOMP_PLUGIN_H 1 + +#include "mutex.h" + +/* alloc.c */ + +extern void *GOMP_PLUGIN_malloc (size_t) __attribute__((malloc)); +extern void *GOMP_PLUGIN_malloc_cleared (size_t) __attribute__((malloc)); +extern void *GOMP_PLUGIN_realloc (void *, size_t); + +/* error.c */ + +extern void GOMP_PLUGIN_notify(const char *msg, ...); +extern void GOMP_PLUGIN_error (const char *, ...) + __attribute__((format (printf, 1, 2))); +extern void GOMP_PLUGIN_fatal (const char *, ...) + __attribute__((noreturn, format (printf, 1, 2))); + +/* mutex.c */ + +extern void GOMP_PLUGIN_mutex_init (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_destroy (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_lock (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_unlock (gomp_mutex_t *mutex); + +#endif diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index a1482cc..b86b960 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -40,6 +40,7 @@ #include #include #include +#include #ifdef HAVE_ATTRIBUTE_VISIBILITY # pragma GCC visibility push(hidden) @@ -220,6 +221,7 @@ struct gomp_team_state }; struct target_mem_desc; +struct gomp_memory_mapping; /* These are the OpenMP 4.0 Internal Control Variables described in section 2.3.1. Those described as having one copy per task are @@ -254,6 +256,10 @@ extern unsigned long gomp_bind_var_list_len; extern void **gomp_places_list; extern unsigned long gomp_places_list_len; +extern int goacc_notify_var; +extern int goacc_device_num; +extern char* goacc_device_type; + enum gomp_task_kind { GOMP_TASK_IMPLICIT, @@ -532,8 +538,29 @@ extern void *gomp_realloc (void *, size_t); /* error.c */ +#ifdef LIBGOMP_VERBOSE +extern void gomp_vnotify (const char *, va_list); +extern void gomp_notify (const char *msg, ...) + __attribute__((format (printf, 1, 2))); +#define gomp_notify(...) \ + do { \ + if (__builtin_expect (goacc_notify_var, 0)) \ + (gomp_notify) (__VA_ARGS__); \ + } while (0) +#define gomp_vnotify(FMT, VALIST) \ + do { \ + if (__builtin_expect (goacc_notify_var, 0)) \ + (gomp_vnotify) ((FMT), (VALIST)); \ + } while (0) +#else +#define gomp_vnotify(FMT, VALIST) +#define gomp_notify(FMT, ...) +#endif +extern void gomp_verror (const char *, va_list); extern void gomp_error (const char *, ...) __attribute__((format (printf, 1, 2))); +extern void gomp_vfatal (const char *, va_list) + __attribute__((noreturn)); extern void gomp_fatal (const char *, ...) __attribute__((noreturn, format (printf, 1, 2))); @@ -606,6 +633,7 @@ extern void gomp_free_thread (void *); /* target.c */ +extern void gomp_init_targets_once (void); extern int gomp_get_num_devices (void); /* work.c */ diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index f36df23..f6e70e9 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -232,3 +232,98 @@ GOMP_4.0.1 { global: GOMP_offload_register; } GOMP_4.0; + +OACC_2.0 { + global: + acc_get_num_devices; + acc_set_device_type; + acc_get_device_type; + acc_set_device_num; + acc_get_device_num; + acc_async_test; + acc_async_test_h_; + acc_async_test_all; + acc_async_test_all_h_; + acc_wait; + acc_wait_async; + acc_wait_all; + acc_wait_all_async; + acc_init; + acc_shutdown; + acc_on_device; + acc_on_device_h_; + acc_malloc; + acc_free; + acc_copyin; + acc_copyin_32_h_; + acc_copyin_64_h_; + acc_copyin_array_h_; + acc_present_or_copyin; + acc_present_or_copyin_32_h_; + acc_present_or_copyin_64_h_; + acc_present_or_copyin_array_h_; + acc_create; + acc_create_32_h_; + acc_create_64_h_; + acc_create_array_h_; + acc_present_or_create; + acc_present_or_create_32_h_; + acc_present_or_create_64_h_; + acc_present_or_create_array_h_; + acc_copyout; + acc_copyout_32_h_; + acc_copyout_64_h_; + acc_copyout_array_h_; + acc_delete; + acc_delete_32_h_; + acc_delete_64_h_; + acc_delete_array_h_; + acc_update_device; + acc_update_device_32_h_; + acc_update_device_64_h_; + acc_update_device_array_h_; + acc_update_self; + acc_update_self_32_h_; + acc_update_self_64_h_; + acc_update_self_array_h_; + acc_map_data; + acc_unmap_data; + acc_deviceptr; + acc_hostptr; + acc_is_present; + acc_is_present_32_h_; + acc_is_present_64_h_; + acc_is_present_array_h_; + acc_memcpy_to_device; + acc_memcpy_from_device; + acc_get_current_cuda_device; + acc_get_current_cuda_context; + acc_get_cuda_stream; + acc_set_cuda_stream; +}; + +GOACC_2.0 { + global: + GOACC_data_end; + GOACC_data_start; + GOACC_kernels; + GOACC_parallel; + GOACC_update; + GOACC_wait; +}; + +GOMP_PLUGIN_1.0 { + global: + GOMP_PLUGIN_malloc; + GOMP_PLUGIN_malloc_cleared; + GOMP_PLUGIN_realloc; + GOMP_PLUGIN_error; + GOMP_PLUGIN_notify; + GOMP_PLUGIN_fatal; + GOMP_PLUGIN_mutex_init; + GOMP_PLUGIN_mutex_destroy; + GOMP_PLUGIN_mutex_lock; + GOMP_PLUGIN_mutex_unlock; + GOMP_PLUGIN_async_unmap_vars; + GOMP_PLUGIN_acc_thread; +}; diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h index be0c6ea..44f200c 100644 --- a/libgomp/libgomp_g.h +++ b/libgomp/libgomp_g.h @@ -214,4 +214,17 @@ extern void GOMP_target_update (int, const void *, size_t, void **, size_t *, unsigned char *); extern void GOMP_teams (unsigned int, unsigned int); +/* oacc-parallel.c */ + +extern void GOACC_data_start (int, const void *, + size_t, void **, size_t *, unsigned short *); +extern void GOACC_data_end (void); +extern void GOACC_kernels (int, void (*) (void *), const void *, + size_t, void **, size_t *, unsigned short *, + int, int, int, int, int, ...); +extern void GOACC_parallel (int, void (*) (void *), const void *, + size_t, void **, size_t *, unsigned short *, + int, int, int, int, int, ...); +extern void GOACC_wait (int, int, ...); + #endif /* LIBGOMP_G_H */ diff --git a/libgomp/libgomp_target.h b/libgomp/libgomp_target.h index f7d19d0..679368a 100644 --- a/libgomp/libgomp_target.h +++ b/libgomp/libgomp_target.h @@ -24,11 +24,15 @@ #ifndef LIBGOMP_TARGET_H #define LIBGOMP_TARGET_H 1 -/* Type of offload target device. */ +#include "gomp-constants.h" + +/* Type of offload target device. Keep in sync with openacc.h:acc_device_t. */ enum offload_target_type { - OFFLOAD_TARGET_TYPE_HOST, - OFFLOAD_TARGET_TYPE_INTEL_MIC + OFFLOAD_TARGET_TYPE_HOST = GOMP_TARGET_HOST, + OFFLOAD_TARGET_TYPE_HOST_NONSHM = GOMP_TARGET_HOST_NONSHM, + OFFLOAD_TARGET_TYPE_NVIDIA_PTX = GOMP_TARGET_NVIDIA_PTX, + OFFLOAD_TARGET_TYPE_INTEL_MIC = GOMP_TARGET_INTEL_MIC }; /* Auxiliary struct, used for transferring a host-target address range mapping @@ -41,4 +45,177 @@ struct mapping_table uintptr_t tgt_end; }; +#include "splay-tree.h" + +struct target_mem_desc { + /* Reference count. */ + uintptr_t refcount; + /* All the splay nodes allocated together. */ + splay_tree_node array; + /* Start of the target region. */ + uintptr_t tgt_start; + /* End of the targer region. */ + uintptr_t tgt_end; + /* Handle to free. */ + void *to_free; + /* Previous target_mem_desc. */ + struct target_mem_desc *prev; + /* Number of items in following list. */ + size_t list_count; + + /* Corresponding target device descriptor. */ + struct gomp_device_descr *device_descr; + + /* Memory mapping info for the thread that created this descriptor. */ + struct gomp_memory_mapping *mem_map; + + /* List of splay keys to remove (or decrease refcount) + at the end of region. */ + splay_tree_key list[]; +}; + +#define TARGET_CAP_SHARED_MEM 1 +#define TARGET_CAP_NATIVE_EXEC 2 +#define TARGET_CAP_OPENMP_400 4 +#define TARGET_CAP_OPENACC_200 8 + +/* Information about mapped memory regions (per device/context). */ + +struct gomp_memory_mapping +{ + /* Splay tree containing information about mapped memory regions. */ + struct splay_tree_s splay_tree; + + /* Mutex for operating with the splay tree and other shared structures. */ + gomp_mutex_t lock; + + /* True when tables have been added to this memory map. */ + bool is_initialized; +}; + +typedef struct acc_dispatch_t +{ + /* This is a linked list of data mapped using the + acc_map_data/acc_unmap_data or "acc enter data"/"acc exit data" pragmas + (TODO). Unlike mapped_data in the goacc_thread struct, unmapping can + happen out-of-order with respect to mapping. */ + struct target_mem_desc *data_environ; + + /* Open or close a device instance. */ + void *(*open_device_func) (int n); + int (*close_device_func) (void *h); + + /* Set or get the device number. */ + int (*get_device_num_func) (void); + void (*set_device_num_func) (int); + + /* Execute. */ + void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *, + unsigned short *, int, int, int, int, void *); + + /* Async cleanup callback registration. */ + void (*register_async_cleanup_func) (void *); + + /* Asynchronous routines. */ + int (*async_test_func) (int); + int (*async_test_all_func) (void); + void (*async_wait_func) (int); + void (*async_wait_async_func) (int, int); + void (*async_wait_all_func) (void); + void (*async_wait_all_async_func) (int); + void (*async_set_async_func) (int); + + /* Create/destroy TLS data. */ + void *(*create_thread_data_func) (void *); + void (*destroy_thread_data_func) (void *); + + /* NVIDIA target specific routines. */ + struct { + void *(*get_current_device_func) (void); + void *(*get_current_context_func) (void); + void *(*get_stream_func) (int); + int (*set_stream_func) (int, void *); + } cuda; +} acc_dispatch_t; + +/* This structure describes accelerator device. + It contains name of the corresponding libgomp plugin, function handlers for + interaction with the device, ID-number of the device, and information about + mapped memory. */ +struct gomp_device_descr +{ + /* The name of the device. */ + const char *name; + + /* Capabilities of device (supports OpenACC, OpenMP). */ + unsigned int capabilities; + + /* This is the ID number of device. It could be specified in DEVICE-clause of + TARGET construct. */ + int id; + + /* This is the ID number of device among devices of the same type. */ + int target_id; + + /* This is the TYPE of device. */ + enum offload_target_type type; + + /* Set to true when device is initialized. */ + bool is_initialized; + + /* True when offload regions have been registered with this device. */ + bool offload_regions_registered; + + /* Plugin file handler. */ + void *plugin_handle; + + /* Function handlers. */ + const char *(*get_name_func) (void); + unsigned int (*get_caps_func) (void); + int (*get_type_func) (void); + int (*get_num_devices_func) (void); + void (*register_image_func) (void *, void *); + void (*init_device_func) (int); + void (*fini_device_func) (int); + int (*get_table_func) (int, struct mapping_table **); + void *(*alloc_func) (int, size_t); + void (*free_func) (int, void *); + void *(*dev2host_func) (int, void *, const void *, size_t); + void *(*host2dev_func) (int, void *, const void *, size_t); + void (*run_func) (int, void *, void *); + + /* OpenACC-specific functions. */ + acc_dispatch_t openacc; + + /* Memory-mapping info for this device instance. */ + struct gomp_memory_mapping mem_map; + + /* Extra information required for a device instance by a given target. */ + void *target_data; +}; + +extern struct target_mem_desc * +gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, + void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds, + bool is_openacc, bool is_target); + +extern void +gomp_copy_from_async (struct target_mem_desc *tgt); + +extern void +gomp_unmap_vars (struct target_mem_desc *tgt, bool); + +extern attribute_hidden void +gomp_init_device (struct gomp_device_descr *devicep); + +extern attribute_hidden void +gomp_init_tables (const struct gomp_device_descr *devicep, + struct gomp_memory_mapping *mm); + +extern attribute_hidden void +gomp_fini_device (struct gomp_device_descr *devicep); + +extern attribute_hidden void +gomp_free_memmap (struct gomp_device_descr *devicep); + #endif /* LIBGOMP_TARGET_H */ diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c new file mode 100644 index 0000000..94c62d8 --- /dev/null +++ b/libgomp/oacc-async.c @@ -0,0 +1,77 @@ +/* OpenACC Runtime Library Definitions. + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + + +#include "openacc.h" +#include "libgomp.h" +#include "libgomp_target.h" +#include "oacc-int.h" + +int +acc_async_test (int async) +{ + if (async < acc_async_sync) + gomp_fatal ("invalid async argument: %d", async); + + return base_dev->openacc.async_test_func (async); +} + +int +acc_async_test_all (void) +{ + return base_dev->openacc.async_test_all_func (); +} + +void +acc_wait (int async) +{ + if (async < acc_async_sync) + gomp_fatal ("invalid async argument: %d", async); + + base_dev->openacc.async_wait_func (async); +} + +void +acc_wait_async (int async1, int async2) +{ + base_dev->openacc.async_wait_async_func (async1, async2); +} + +void +acc_wait_all (void) +{ + base_dev->openacc.async_wait_all_func (); +} + +void +acc_wait_all_async (int async) +{ + if (async < acc_async_sync) + gomp_fatal ("invalid async argument: %d", async); + + base_dev->openacc.async_wait_all_async_func (async); +} diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c new file mode 100644 index 0000000..4d0b284 --- /dev/null +++ b/libgomp/oacc-cuda.c @@ -0,0 +1,84 @@ +/* OpenACC Runtime Library: CUDA support glue. + + Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#include "openacc.h" +#include "config.h" +#include "libgomp.h" +#include "libgomp_target.h" +#include "oacc-int.h" + +void * +acc_get_current_cuda_device (void) +{ + void *p = NULL; + + if (base_dev && base_dev->openacc.cuda.get_current_device_func) + p = base_dev->openacc.cuda.get_current_device_func (); + + return p; +} + +void * +acc_get_current_cuda_context (void) +{ + void *p = NULL; + + if (base_dev && base_dev->openacc.cuda.get_current_context_func) + p = base_dev->openacc.cuda.get_current_context_func (); + + return p; +} + +void * +acc_get_cuda_stream (int async) +{ + void *p = NULL; + + if (async < 0) + return p; + + if (base_dev && base_dev->openacc.cuda.get_stream_func) + p = base_dev->openacc.cuda.get_stream_func (async); + + return p; +} + +int +acc_set_cuda_stream (int async, void *stream) +{ + int s = -1; + + if (async < 0 || stream == NULL) + return 0; + + goacc_lazy_initialize (); + + if (base_dev && base_dev->openacc.cuda.set_stream_func) + s = base_dev->openacc.cuda.set_stream_func (async, stream); + + return s; +} diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c new file mode 100644 index 0000000..0d94465 --- /dev/null +++ b/libgomp/oacc-host.c @@ -0,0 +1,99 @@ +/* OpenACC Runtime Library: acc_device_host, acc_device_host_nonshm. + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* This shares much of the implementation of the plugin-host.c "host_nonshm" + plugin. */ +#include "plugin/plugin-host.c" + +static struct gomp_device_descr host_dispatch = + { + .name = "host", + + .type = OFFLOAD_TARGET_TYPE_HOST, + .capabilities = TARGET_CAP_OPENACC_200 | TARGET_CAP_NATIVE_EXEC + | TARGET_CAP_SHARED_MEM, + .id = 0, + + .is_initialized = false, + .offload_regions_registered = false, + + .get_name_func = GOMP_OFFLOAD_get_name, + .get_type_func = GOMP_OFFLOAD_get_type, + .get_caps_func = GOMP_OFFLOAD_get_caps, + + .init_device_func = GOMP_OFFLOAD_init_device, + .fini_device_func = GOMP_OFFLOAD_fini_device, + .get_num_devices_func = GOMP_OFFLOAD_get_num_devices, + .register_image_func = GOMP_OFFLOAD_register_image, + .get_table_func = GOMP_OFFLOAD_get_table, + + .alloc_func = GOMP_OFFLOAD_alloc, + .free_func = GOMP_OFFLOAD_free, + .host2dev_func = GOMP_OFFLOAD_host2dev, + .dev2host_func = GOMP_OFFLOAD_dev2host, + + .run_func = GOMP_OFFLOAD_run, + + .openacc = { + .open_device_func = GOMP_OFFLOAD_openacc_open_device, + .close_device_func = GOMP_OFFLOAD_openacc_close_device, + + .get_device_num_func = GOMP_OFFLOAD_openacc_get_device_num, + .set_device_num_func = GOMP_OFFLOAD_openacc_set_device_num, + + .exec_func = GOMP_OFFLOAD_openacc_parallel, + + .register_async_cleanup_func + = GOMP_OFFLOAD_openacc_register_async_cleanup, + + .async_set_async_func = GOMP_OFFLOAD_openacc_async_set_async, + .async_test_func = GOMP_OFFLOAD_openacc_async_test, + .async_test_all_func = GOMP_OFFLOAD_openacc_async_test_all, + .async_wait_func = GOMP_OFFLOAD_openacc_async_wait, + .async_wait_async_func = GOMP_OFFLOAD_openacc_async_wait_async, + .async_wait_all_func = GOMP_OFFLOAD_openacc_async_wait_all, + .async_wait_all_async_func = GOMP_OFFLOAD_openacc_async_wait_all_async, + + .create_thread_data_func = GOMP_OFFLOAD_openacc_create_thread_data, + .destroy_thread_data_func = GOMP_OFFLOAD_openacc_destroy_thread_data, + + .cuda = { + .get_current_device_func = NULL, + .get_current_context_func = NULL, + .get_stream_func = NULL, + .set_stream_func = NULL, + } + } + }; + +/* Register this device type. */ +static __attribute__ ((constructor)) +void goacc_host_init (void) +{ + gomp_mutex_init (&host_dispatch.mem_map.lock); + goacc_register (&host_dispatch); +} diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c new file mode 100644 index 0000000..ed5deb3 --- /dev/null +++ b/libgomp/oacc-init.c @@ -0,0 +1,613 @@ +/* OpenACC Runtime initialization routines + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#include "libgomp.h" +#include "libgomp_target.h" +#include "oacc-int.h" +#include "openacc.h" +#include +#include +#include +#include +#include + +static gomp_mutex_t acc_device_lock; + +/* The dispatch table for the current accelerator device. This is global, so + you can only have one type of device open at any given time in a program. + This is the "base" device in that several devices that use the same + dispatch table may be active concurrently: this one (the "zeroth") is used + for overall initialisation/shutdown, and other instances -- not necessarily + including this one -- may be opened and closed once the base device has + been initialized. */ +struct gomp_device_descr const *base_dev; + +#ifdef HAVE_TLS +__thread struct goacc_thread *goacc_tls_data; +#else +pthread_key_t goacc_tls_key; +#endif +static pthread_key_t goacc_cleanup_key; + +/* Current dispatcher, and how it was initialized */ +static acc_device_t init_key = _ACC_device_hwm; + +static struct goacc_thread *goacc_threads; +static gomp_mutex_t goacc_thread_lock; + +/* An array of dispatchers for device types, indexed by the type. This array + only references "base" devices, and other instances of the same type are + found by simply indexing from each such device (which are stored linearly, + grouped by device in target.c:devices). */ +static struct gomp_device_descr const *dispatchers[_ACC_device_hwm] = { 0 }; + +attribute_hidden void +goacc_register (struct gomp_device_descr const *disp) +{ + /* Only register the 0th device here. */ + if (disp->target_id != 0) + return; + + gomp_mutex_lock (&acc_device_lock); + + assert (acc_device_type (disp->type) != acc_device_none + && acc_device_type (disp->type) != acc_device_default + && acc_device_type (disp->type) != acc_device_not_host); + assert (!dispatchers[disp->type]); + dispatchers[disp->type] = disp; + + gomp_mutex_unlock (&acc_device_lock); +} + +static struct gomp_device_descr const * +resolve_device (acc_device_t d) +{ + acc_device_t d_arg = d; + + switch (d) + { + case acc_device_default: + { + if (goacc_device_type) + { + /* Lookup the named device. */ + while (++d != _ACC_device_hwm) + if (dispatchers[d] + && !strcasecmp (goacc_device_type, dispatchers[d]->name) + && dispatchers[d]->get_num_devices_func () > 0) + goto found; + + gomp_fatal ("device type %s not supported", goacc_device_type); + } + + /* No default device specified, so start scanning for any non-host + device that is available. */ + d = acc_device_not_host; + } + /* FALLTHROUGH */ + + case acc_device_not_host: + /* Find the first available device after acc_device_not_host. */ + while (++d != _ACC_device_hwm) + if (dispatchers[d] && dispatchers[d]->get_num_devices_func () > 0) + goto found; + if (d_arg == acc_device_default) + { + d = acc_device_host; + goto found; + } + gomp_fatal ("no device found"); + break; + + case acc_device_host: + break; + + default: + if (d > _ACC_device_hwm) + gomp_fatal ("device %u out of range", (unsigned)d); + break; + } + found: + + assert (d != acc_device_none + && d != acc_device_default + && d != acc_device_not_host); + + return dispatchers[d]; +} + +/* This is called when plugins have been initialized, and serves to call + (indirectly) the target's device_init hook. Calling multiple times without + an intervening acc_shutdown_1 call is an error. */ + +static struct gomp_device_descr const * +acc_init_1 (acc_device_t d) +{ + struct gomp_device_descr const *acc_dev; + + acc_dev = resolve_device (d); + + if (!acc_dev || acc_dev->get_num_devices_func () <= 0) + gomp_fatal ("device %u not supported", (unsigned)d); + + if (acc_dev->is_initialized) + gomp_fatal ("device already active"); + + /* We need to remember what we were intialized as, to check shutdown etc. */ + init_key = d; + + gomp_init_device ((struct gomp_device_descr *) acc_dev); + + return acc_dev; +} + +static struct goacc_thread * +goacc_new_thread (void) +{ + struct goacc_thread *thr = gomp_malloc (sizeof (struct gomp_thread)); + +#ifdef HAVE_TLS + goacc_tls_data = thr; +#else + pthread_setspecific (goacc_tls_key, thr); +#endif + + pthread_setspecific (goacc_cleanup_key, thr); + + gomp_mutex_lock (&goacc_thread_lock); + thr->next = goacc_threads; + goacc_threads = thr; + gomp_mutex_unlock (&goacc_thread_lock); + + return thr; +} + +static void +goacc_destroy_thread (void *data) +{ + struct goacc_thread *thr = data, *walk, *prev; + + gomp_mutex_lock (&goacc_thread_lock); + + if (thr) + { + if (base_dev && thr->target_tls) + { + base_dev->openacc.destroy_thread_data_func (thr->target_tls); + thr->target_tls = NULL; + } + + assert (!thr->mapped_data); + + /* Remove from thread list. */ + for (prev = NULL, walk = goacc_threads; walk; + prev = walk, walk = walk->next) + if (walk == thr) + { + if (prev == NULL) + goacc_threads = walk->next; + else + prev->next = walk->next; + + free (thr); + + break; + } + + assert (walk); + } + + gomp_mutex_unlock (&goacc_thread_lock); +} + +/* Open the ORD'th device of the currently-active type (base_dev must be + initialised before calling). If ORD is < 0, open the default-numbered + device (set by the ACC_DEVICE_NUM environment variable or a call to + acc_set_device_num), or leave any currently-opened device as is. "Opening" + consists of calling the device's open_device_func hook, and setting up + thread-local data (maybe allocating, then initializing with information + pertaining to the newly-opened or previously-opened device). */ + +static void +lazy_open (int ord) +{ + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev; + + if (thr && thr->dev) + { + assert (ord < 0 || ord == thr->dev->target_id); + return; + } + + assert (base_dev); + + if (ord < 0) + ord = goacc_device_num; + + if (ord >= base_dev->get_num_devices_func ()) + gomp_fatal ("device %u does not exist", ord); + + if (!thr) + thr = goacc_new_thread (); + + acc_dev = thr->dev = (struct gomp_device_descr *) &base_dev[ord]; + + assert (acc_dev->target_id == ord); + + thr->saved_bound_dev = NULL; + thr->mapped_data = NULL; + + if (!acc_dev->target_data) + acc_dev->target_data = acc_dev->openacc.open_device_func (ord); + + thr->target_tls + = acc_dev->openacc.create_thread_data_func (acc_dev->target_data); + + acc_dev->openacc.async_set_async_func (acc_async_sync); + + if (!acc_dev->mem_map.is_initialized) + gomp_init_tables (acc_dev, &acc_dev->mem_map); +} + +/* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of + init/shutdown is per-process or per-thread. We choose per-process. */ + +void +acc_init (acc_device_t d) +{ + if (!base_dev) + gomp_init_targets_once (); + + gomp_mutex_lock (&acc_device_lock); + + base_dev = acc_init_1 (d); + + lazy_open (-1); + + gomp_mutex_unlock (&acc_device_lock); +} + +ialias (acc_init) + +void +acc_shutdown_1 (acc_device_t d) +{ + struct goacc_thread *walk; + + /* We don't check whether d matches the actual device found, because + OpenACC 2.0 (3.2.12) says the parameters to the init and this + call must match (for the shutdown call anyway, it's silent on + others). */ + + if (!base_dev) + gomp_fatal ("no device initialized"); + if (d != init_key) + gomp_fatal ("device %u(%u) is initialized", + (unsigned) init_key, (unsigned) base_dev->type); + + gomp_mutex_lock (&goacc_thread_lock); + + /* Free target-specific TLS data and close all devices. */ + for (walk = goacc_threads; walk != NULL; walk = walk->next) + { + if (walk->target_tls) + base_dev->openacc.destroy_thread_data_func (walk->target_tls); + + walk->target_tls = NULL; + + /* This would mean the user is shutting down OpenACC in the middle of an + "acc data" pragma. Likely not intentional. */ + if (walk->mapped_data) + gomp_fatal ("shutdown in 'acc data' region"); + + if (walk->dev) + { + if (walk->dev->openacc.close_device_func (walk->dev->target_data) < 0) + gomp_fatal ("failed to close device"); + + walk->dev->target_data = NULL; + + gomp_free_memmap (walk->dev); + + walk->dev = NULL; + } + } + + gomp_mutex_unlock (&goacc_thread_lock); + + gomp_fini_device ((struct gomp_device_descr *) base_dev); + + base_dev = NULL; +} + +void +acc_shutdown (acc_device_t d) +{ + gomp_mutex_lock (&acc_device_lock); + + acc_shutdown_1 (d); + + gomp_mutex_unlock (&acc_device_lock); +} + +ialias (acc_shutdown) + +/* This function is called after plugins have been initialized. It deals with + the "base" device, and is used to prepare the runtime for dealing with a + number of such devices (as implemented by some particular plugin). If the + argument device type D matches a previous call to the function, return the + current base device, else shut the old device down and re-initialize with + the new device type. */ + +static struct gomp_device_descr const * +lazy_init (acc_device_t d) +{ + if (base_dev) + { + /* Re-initializing the same device, do nothing. */ + if (d == init_key) + return base_dev; + + acc_shutdown_1 (init_key); + } + + assert (!base_dev); + + return acc_init_1 (d); +} + +/* Ensure that plugins are loaded, initialize and open the (default-numbered) + device. */ + +static void +lazy_init_and_open (acc_device_t d) +{ + if (!base_dev) + gomp_init_targets_once (); + + gomp_mutex_lock (&acc_device_lock); + + base_dev = lazy_init (d); + + lazy_open (-1); + + gomp_mutex_unlock (&acc_device_lock); +} + +int +acc_get_num_devices (acc_device_t d) +{ + int n = 0; + struct gomp_device_descr const *acc_dev; + + if (d == acc_device_none) + return 0; + + if (!base_dev) + gomp_init_targets_once (); + + acc_dev = resolve_device (d); + if (!acc_dev) + return 0; + + n = acc_dev->get_num_devices_func (); + if (n < 0) + n = 0; + + return n; +} + +ialias (acc_get_num_devices) + +void +acc_set_device_type (acc_device_t d) +{ + lazy_init_and_open (d); +} + +ialias (acc_set_device_type) + +acc_device_t +acc_get_device_type (void) +{ + acc_device_t res = acc_device_none; + const struct gomp_device_descr *dev; + + if (base_dev) + res = acc_device_type (base_dev->type); + else + { + gomp_init_targets_once (); + + dev = resolve_device (acc_device_default); + res = acc_device_type (dev->type); + } + + assert (res != acc_device_default + && res != acc_device_not_host); + + return res; +} + +ialias (acc_get_device_type) + +int +acc_get_device_num (acc_device_t d) +{ + const struct gomp_device_descr *dev; + int num; + + if (d >= _ACC_device_hwm) + gomp_fatal ("device %u out of range", (unsigned)d); + + if (!base_dev) + gomp_init_targets_once (); + + dev = resolve_device (d); + if (!dev) + gomp_fatal ("no devices of type %u", d); + + /* We might not have called lazy_open for this host thread yet, in which case + the get_device_num_func hook will return -1. */ + num = dev->openacc.get_device_num_func (); + if (num < 0) + num = goacc_device_num; + + return num; +} + +ialias (acc_get_device_num) + +void +acc_set_device_num (int n, acc_device_t d) +{ + const struct gomp_device_descr *dev; + int num_devices; + + if (!base_dev) + gomp_init_targets_once (); + + if ((int) d == 0) + { + int i; + + /* A device setting of zero sets all device types on the system to use + the Nth instance of that device type. Only attempt it for initialized + devices though. */ + for (i = acc_device_not_host + 1; i < _ACC_device_hwm; i++) + { + dev = resolve_device (d); + if (dev && dev->is_initialized) + dev->openacc.set_device_num_func (n); + } + + /* ...and for future calls to acc_init/acc_set_device_type, etc. */ + goacc_device_num = n; + } + else + { + struct goacc_thread *thr = goacc_thread (); + + gomp_mutex_lock (&acc_device_lock); + + base_dev = lazy_init (d); + + num_devices = base_dev->get_num_devices_func (); + + if (n >= num_devices) + gomp_fatal ("device %u out of range", n); + + /* If we're changing the device number, de-associate this thread with + the device (but don't close the device, since it may be in use by + other threads). */ + if (thr && thr->dev && n != thr->dev->target_id) + thr->dev = NULL; + + lazy_open (n); + + gomp_mutex_unlock (&acc_device_lock); + } +} + +ialias (acc_set_device_num) + +int +acc_on_device (acc_device_t dev) +{ + struct goacc_thread *thr = goacc_thread (); + + if (thr && thr->dev + && acc_device_type (thr->dev->type) == acc_device_host_nonshm) + return dev == acc_device_host_nonshm || dev == acc_device_not_host; + + /* Just rely on the compiler builtin. */ + return __builtin_acc_on_device (dev); +} +ialias (acc_on_device) + +attribute_hidden void +goacc_runtime_initialize (void) +{ + gomp_mutex_init (&acc_device_lock); + +#ifndef HAVE_TLS + pthread_key_create (&goacc_tls_key, NULL); +#endif + + pthread_key_create (&goacc_cleanup_key, goacc_destroy_thread); + + base_dev = NULL; + + goacc_threads = NULL; + gomp_mutex_init (&goacc_thread_lock); +} + +/* Compiler helper functions */ + +attribute_hidden void +goacc_save_and_set_bind (acc_device_t d) +{ + struct goacc_thread *thr = goacc_thread (); + + assert (!thr->saved_bound_dev); + + thr->saved_bound_dev = thr->dev; + thr->dev = (struct gomp_device_descr *) dispatchers[d]; +} + +attribute_hidden void +goacc_restore_bind (void) +{ + struct goacc_thread *thr = goacc_thread (); + + thr->dev = thr->saved_bound_dev; + thr->saved_bound_dev = NULL; +} + +/* This is called from any OpenACC support function that may need to implicitly + initialize the libgomp runtime. On exit all such initialization will have + been done, and both the global ACC_dev and the per-host-thread ACC_memmap + pointers will be valid. */ + +attribute_hidden void +goacc_lazy_initialize (void) +{ + struct goacc_thread *thr = goacc_thread (); + + if (thr && thr->dev) + return; + + if (!base_dev) + lazy_init_and_open (acc_device_default); + else + { + gomp_mutex_lock (&acc_device_lock); + lazy_open (-1); + gomp_mutex_unlock (&acc_device_lock); + } +} diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h new file mode 100644 index 0000000..c333a20 --- /dev/null +++ b/libgomp/oacc-int.h @@ -0,0 +1,106 @@ +/* OpenACC Runtime - internal declarations + + Copyright (C) 2005-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* This file contains data types and function declarations that are not + part of the official OpenACC user interface. There are declarations + in here that are part of the GNU OpenACC ABI, in that the compiler is + required to know about them and use them. + + The convention is that the all caps prefix "GOACC" is used group items + that are part of the external ABI, and the lower case prefix "goacc" + is used group items that are completely private to the library. */ + +#ifndef _OACC_INT_H +#define _OACC_INT_H 1 + +#include "openacc.h" +#include "config.h" +#include +#include +#include + +#ifdef HAVE_ATTRIBUTE_VISIBILITY +# pragma GCC visibility push(hidden) +#endif + +static inline enum acc_device_t +acc_device_type (enum offload_target_type type) +{ + return (enum acc_device_t) type; +} + +struct goacc_thread +{ + /* The device for the current thread. */ + struct gomp_device_descr *dev; + + struct gomp_device_descr *saved_bound_dev; + + /* This is a linked list of data mapped by the "acc data" pragma, following + strictly push/pop semantics according to lexical scope. */ + struct target_mem_desc *mapped_data; + + /* These structures form a list: this is the next thread in that list. */ + struct goacc_thread *next; + + /* Target-specific data (used by plugin). */ + void *target_tls; +}; + +#ifdef HAVE_TLS +extern __thread struct goacc_thread *goacc_tls_data; +static inline struct goacc_thread * +goacc_thread (void) +{ + return goacc_tls_data; +} +#else +extern pthread_key_t goacc_tls_key; +static inline struct goacc_thread * +goacc_thread (void) +{ + return pthread_getspecific (goacc_tls_key); +} +#endif + +struct gomp_device_descr; + +void goacc_register (struct gomp_device_descr const *) __GOACC_NOTHROW; + +/* Current dispatcher. */ +extern struct gomp_device_descr const *base_dev; + +void goacc_runtime_initialize (void); +void goacc_save_and_set_bind (acc_device_t); +void goacc_restore_bind (void); +void goacc_lazy_initialize (void); + +#ifdef HAVE_ATTRIBUTE_VISIBILITY +# pragma GCC visibility pop +#endif + +#endif /* _OACC_INT_H */ diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c new file mode 100644 index 0000000..ac1ea47 --- /dev/null +++ b/libgomp/oacc-mem.c @@ -0,0 +1,510 @@ +/* OpenACC Runtime initialization routines + + Copyright (C) 2013 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#include "openacc.h" +#include "config.h" +#include "libgomp.h" +#include "gomp-constants.h" +#include "libgomp_target.h" +#include "oacc-int.h" +#include +#include +#include + +#include "splay-tree.h" + +/* Return block containing [H->S), or NULL if not contained. */ + +attribute_hidden splay_tree_key +lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s) +{ + struct splay_tree_key_s node; + splay_tree_key key; + + node.host_start = (uintptr_t) h; + node.host_end = (uintptr_t) h + s; + + gomp_mutex_lock (&mem_map->lock); + + key = splay_tree_lookup (&mem_map->splay_tree, &node); + + gomp_mutex_unlock (&mem_map->lock); + + return key; +} + +/* Return block containing [D->S), or NULL if not contained. + The list isn't ordered by device address, so we have to iterate + over the whole array. This is not expected to be a common + operation. */ + +static splay_tree_key +lookup_dev (struct target_mem_desc *tgt, void *d, size_t s) +{ + int i; + struct target_mem_desc *t; + struct gomp_memory_mapping *mem_map; + + if (!tgt) + return NULL; + + mem_map = tgt->mem_map; + + gomp_mutex_lock (&mem_map->lock); + + for (t = tgt; t != NULL; t = t->prev) + { + if (t->tgt_start <= (uintptr_t) d && t->tgt_end >= (uintptr_t) d + s) + break; + } + + gomp_mutex_unlock (&mem_map->lock); + + if (!t) + return NULL; + + for (i = 0; i < t->list_count; i++) + { + void * offset; + + splay_tree_key k = &t->array[i].key; + offset = d - t->tgt_start + k->tgt_offset; + + if (k->host_start + offset <= (void *) k->host_end) + return k; + } + + return NULL; +} + +/* OpenACC is silent on how memory exhaustion is indicated. We return + NULL. */ + +void * +acc_malloc (size_t s) +{ + if (!s) + return NULL; + + goacc_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + + return base_dev->alloc_func (thr->dev->target_id, s); +} + +/* OpenACC 2.0a (3.2.16) doesn't specify what to do in the event + the device address is mapped. We choose to check if it mapped, + and if it is, to unmap it. */ +void +acc_free (void *d) +{ + splay_tree_key k; + struct goacc_thread *thr = goacc_thread (); + + if (!d) + return; + + /* We don't have to call lazy open here, as the ptr value must have + been returned by acc_malloc. It's not permitted to pass NULL in + (unless you got that null from acc_malloc). */ + if ((k = lookup_dev (thr->dev->openacc.data_environ, d, 1))) + { + void *offset; + + offset = d - k->tgt->tgt_start + k->tgt_offset; + + acc_unmap_data ((void *)(k->host_start + offset)); + } + + base_dev->free_func (thr->dev->target_id, d); +} + +void +acc_memcpy_to_device (void *d, void *h, size_t s) +{ + /* No need to call lazy open here, as the device pointer must have + been obtained from a routine that did that. */ + struct goacc_thread *thr = goacc_thread (); + + base_dev->host2dev_func (thr->dev->target_id, d, h, s); +} + +void +acc_memcpy_from_device (void *h, void *d, size_t s) +{ + /* No need to call lazy open here, as the device pointer must have + been obtained from a routine that did that. */ + struct goacc_thread *thr = goacc_thread (); + + base_dev->dev2host_func (thr->dev->target_id, h, d, s); +} + +/* Return the device pointer that corresponds to host data H. Or NULL + if no mapping. */ + +void * +acc_deviceptr (void *h) +{ + splay_tree_key n; + void *d; + void *offset; + + goacc_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + + n = lookup_host (&thr->dev->mem_map, h, 1); + + if (!n) + return NULL; + + offset = h - n->host_start; + + d = n->tgt->tgt_start + n->tgt_offset + offset; + + return d; +} + +/* Return the host pointer that corresponds to device data D. Or NULL + if no mapping. */ + +void * +acc_hostptr (void *d) +{ + splay_tree_key n; + void *h; + void *offset; + + goacc_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + + n = lookup_dev (thr->dev->openacc.data_environ, d, 1); + + if (!n) + return NULL; + + offset = d - n->tgt->tgt_start + n->tgt_offset; + + h = n->host_start + offset; + + return h; +} + +/* Return 1 if host data [H,+S] is present on the device. */ + +int +acc_is_present (void *h, size_t s) +{ + splay_tree_key n; + + if (!s || !h) + return 0; + + goacc_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + n = lookup_host (&acc_dev->mem_map, h, s); + + if (n && ((uintptr_t)h < n->host_start + || (uintptr_t)h + s > n->host_end + || s > n->host_end - n->host_start)) + n = NULL; + + return n != NULL; +} + +/* Create a mapping for host [H,+S] -> device [D,+S] */ + +void +acc_map_data (void *h, void *d, size_t s) +{ + struct target_mem_desc *tgt; + size_t mapnum = 1; + void *hostaddrs = h; + void *devaddrs = d; + size_t sizes = s; + unsigned short kinds = GOMP_MAP_ALLOC; + + goacc_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + if (acc_dev->capabilities & TARGET_CAP_SHARED_MEM) + { + if (d != h) + gomp_fatal ("cannot map data on shared-memory system"); + + tgt = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, true, false); + } + else + { + struct goacc_thread *thr = goacc_thread (); + + if (!d || !h || !s) + gomp_fatal ("[%p,+%d]->[%p,+%d] is a bad map", + (void *)h, (int)s, (void *)d, (int)s); + + if (lookup_host (&acc_dev->mem_map, h, s)) + gomp_fatal ("host address [%p, +%d] is already mapped", (void *)h, + (int)s); + + if (lookup_dev (thr->dev->openacc.data_environ, d, s)) + gomp_fatal ("device address [%p, +%d] is already mapped", (void *)d, + (int)s); + + tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, &devaddrs, &sizes, + &kinds, true, false); + } + + tgt->prev = acc_dev->openacc.data_environ; + acc_dev->openacc.data_environ = tgt; +} + +void +acc_unmap_data (void *h) +{ + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + /* No need to call lazy open, as the address must have been mapped. */ + + size_t host_size; + splay_tree_key n = lookup_host (&acc_dev->mem_map, h, 1); + struct target_mem_desc *t; + + if (!n) + gomp_fatal ("%p is not a mapped block", (void *)h); + + host_size = n->host_end - n->host_start; + + if (n->host_start != (uintptr_t) h) + gomp_fatal ("[%p,%d] surrounds1 %p", + (void *) n->host_start, (int) host_size, (void *) h); + + t = n->tgt; + + if (t->refcount == 2) + { + struct target_mem_desc *tp; + + /* This is the last reference, so pull the descriptor off the + chain. This avoids gomp_unmap_vars via gomp_unmap_tgt from + freeing the device memory. */ + t->tgt_end = 0; + t->to_free = 0; + + gomp_mutex_lock (&acc_dev->mem_map.lock); + + for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL; + tp = t, t = t->prev) + if (n->tgt == t) + { + if (tp) + tp->prev = t->prev; + else + acc_dev->openacc.data_environ = t->prev; + + break; + } + + gomp_mutex_unlock (&acc_dev->mem_map.lock); + } + + gomp_unmap_vars (t, true); +} + +#define PCC_Present (1 << 0) +#define PCC_Create (1 << 1) +#define PCC_Copy (1 << 2) + +attribute_hidden void * +present_create_copy (unsigned f, void *h, size_t s) +{ + void *d; + splay_tree_key n; + + if (!h || !s) + gomp_fatal ("[%p,+%d] is a bad range", (void *)h, (int)s); + + goacc_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + n = lookup_host (&acc_dev->mem_map, h, s); + if (n) + { + /* Present. */ + d = (void *) (n->tgt->tgt_start + n->tgt_offset); + + if (!(f & PCC_Present)) + gomp_fatal ("[%p,+%d] already mapped to [%p,+%d]", + (void *)h, (int)s, (void *)d, (int)s); + if ((h + s) > (void *)n->host_end) + gomp_fatal ("[%p,+%d] not mapped", (void *)h, (int)s); + } + else if (!(f & PCC_Create)) + { + gomp_fatal ("[%p,+%d] not mapped", (void *)h, (int)s); + } + else + { + struct target_mem_desc *tgt; + size_t mapnum = 1; + unsigned short kinds; + void *hostaddrs = h; + + if (f & PCC_Copy) + kinds = GOMP_MAP_ALLOC_TO; + else + kinds = GOMP_MAP_ALLOC; + + tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true, + false); + + gomp_mutex_lock (&acc_dev->mem_map.lock); + + d = tgt->to_free; + tgt->prev = acc_dev->openacc.data_environ; + acc_dev->openacc.data_environ = tgt; + + gomp_mutex_unlock (&acc_dev->mem_map.lock); + } + + return d; +} + +void * +acc_create (void *h, size_t s) +{ + return present_create_copy (PCC_Create, h, s); +} + +void * +acc_copyin (void *h, size_t s) +{ + return present_create_copy (PCC_Create | PCC_Copy, h, s); +} + +void * +acc_present_or_create (void *h, size_t s) +{ + return present_create_copy (PCC_Present | PCC_Create, h, s); +} + +void * +acc_present_or_copyin (void *h, size_t s) +{ + return present_create_copy (PCC_Present | PCC_Create | PCC_Copy, h, s); +} + +#define DC_Copyout (1 << 0) + +static void +delete_copyout (unsigned f, void *h, size_t s) +{ + size_t host_size; + splay_tree_key n; + void *d; + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + n = lookup_host (&acc_dev->mem_map, h, s); + + /* No need to call lazy open, as the data must already have been + mapped. */ + + if (!n) + gomp_fatal ("[%p,%d] is not mapped", (void *)h, (int)s); + + d = (void *) (n->tgt->tgt_start + n->tgt_offset); + + host_size = n->host_end - n->host_start; + + if (n->host_start != (uintptr_t) h || host_size != s) + gomp_fatal ("[%p,%d] surrounds2 [%p,+%d]", + (void *) n->host_start, (int) host_size, (void *) h, (int) s); + + if (f & DC_Copyout) + acc_dev->dev2host_func (acc_dev->target_id, h, d, s); + + acc_unmap_data (h); + + acc_dev->free_func (acc_dev->target_id, d); +} + +void +acc_delete (void *h , size_t s) +{ + delete_copyout (0, h, s); +} + +void acc_copyout (void *h, size_t s) +{ + delete_copyout (DC_Copyout, h, s); +} + +static void +update_dev_host (int is_dev, void *h, size_t s) +{ + splay_tree_key n; + void *d; + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + n = lookup_host (&acc_dev->mem_map, h, s); + + /* No need to call lazy open, as the data must already have been + mapped. */ + + if (!n) + gomp_fatal ("[%p,%d] is not mapped", h, (int)s); + + d = (void *) (n->tgt->tgt_start + n->tgt_offset); + + if (is_dev) + acc_dev->host2dev_func (acc_dev->target_id, d, h, s); + else + acc_dev->dev2host_func (acc_dev->target_id, h, d, s); +} + +void +acc_update_device (void *h, size_t s) +{ + update_dev_host (1, h, s); +} + +void +acc_update_self (void *h, size_t s) +{ + update_dev_host (0, h, s); +} diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c new file mode 100644 index 0000000..0ff44bf --- /dev/null +++ b/libgomp/oacc-parallel.c @@ -0,0 +1,388 @@ +/* Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* This file handles OpenACC constructs. */ + +#include "openacc.h" +#include "libgomp.h" +#include "libgomp_g.h" +#include "gomp-constants.h" +#include "libgomp_target.h" +#include "oacc-int.h" +#include +#include +#include +#include +#include + +static void +dump_var (char *s, size_t idx, void *hostaddr, size_t size, unsigned char kind) +{ + gomp_notify (" %2zi: %3s 0x%.2x -", idx, s, kind & 0xff); + + switch (kind & 0xff) + { + case 0x00: gomp_notify (" ALLOC "); break; + case 0x01: gomp_notify (" ALLOC TO "); break; + case 0x02: gomp_notify (" ALLOC FROM "); break; + case 0x03: gomp_notify (" ALLOC TOFROM "); break; + case 0x04: gomp_notify (" POINTER "); break; + case 0x05: gomp_notify (" TO_PSET "); break; + + case 0x08: gomp_notify (" FORCE_ALLOC "); break; + case 0x09: gomp_notify (" FORCE_TO "); break; + case 0x0a: gomp_notify (" FORCE_FROM "); break; + case 0x0b: gomp_notify (" FORCE_TOFROM "); break; + case 0x0c: gomp_notify (" FORCE_PRESENT "); break; + case 0x0d: gomp_notify (" FORCE_DEALLOC "); break; + case 0x0e: gomp_notify (" FORCE_DEVICEPTR "); break; + + case 0x18: gomp_notify (" FORCE_PRIVATE "); break; + case 0x19: gomp_notify (" FORCE_FIRSTPRIVATE "); break; + + case (unsigned char) -1: gomp_notify (" DUMMY "); break; + default: gomp_notify ("UGH! 0x%x\n", kind); + } + + gomp_notify ("- %d - %4d/0x%04x ", 1 << (kind >> 8), (int) size, (int) size); + gomp_notify ("- %p\n", hostaddr); +} + +/* Ensure that the target device for DEVICE_TYPE is initialised (and that + plugins have been loaded if appropriate). The ACC_dev variable for the + current thread will be set appropriately for the given device type on + return. */ + +attribute_hidden void +select_acc_device (int device_type) +{ + goacc_lazy_initialize (); + + if (device_type == GOMP_IF_CLAUSE_FALSE) + return; + + if (device_type == acc_device_none) + device_type = acc_device_host; + + if (device_type >= 0) + { + /* NOTE: this will go badly if the surrounding data environment is set up + to use a different device type. We'll just have to trust that users + know what they're doing... */ + acc_set_device_type (device_type); + } +} + +void goacc_wait (int async, int num_waits, va_list ap); + +void +GOACC_parallel (int device, void (*fn) (void *), const void *openmp_target, + size_t mapnum, void **hostaddrs, size_t *sizes, + unsigned short *kinds, + int num_gangs, int num_workers, int vector_length, + int async, int num_waits, ...) +{ + bool if_clause_condition_value = device != GOMP_IF_CLAUSE_FALSE; + va_list ap; + struct goacc_thread *thr; + struct gomp_device_descr *acc_dev; + struct target_mem_desc *tgt; + void **devaddrs; + unsigned int i; + struct splay_tree_key_s k; + splay_tree_key tgt_fn_key; + void (*tgt_fn); + + if (num_gangs != 1) + gomp_fatal ("num_gangs (%d) different from one is not yet supported", + num_gangs); + if (num_workers != 1) + gomp_fatal ("num_workers (%d) different from one is not yet supported", + num_workers); + + gomp_notify ("%s: mapnum=%zd, hostaddrs=%p, sizes=%p, kinds=%p, async=%d\n", + __FUNCTION__, mapnum, hostaddrs, sizes, kinds, async); + + select_acc_device (device); + + thr = goacc_thread (); + acc_dev = thr->dev; + + /* Host fallback if "if" clause is false or if the current device is set to + the host. */ + if (!if_clause_condition_value) + { + goacc_save_and_set_bind (acc_device_host); + fn (hostaddrs); + goacc_restore_bind (); + return; + } + else if (acc_device_type (acc_dev->type) == acc_device_host) + { + fn (hostaddrs); + return; + } + + va_start (ap, num_waits); + + if (num_waits > 0) + goacc_wait (async, num_waits, ap); + + va_end (ap); + + acc_dev->openacc.async_set_async_func (async); + + if (!(acc_dev->capabilities & TARGET_CAP_NATIVE_EXEC)) + { + k.host_start = (uintptr_t) fn; + k.host_end = k.host_start + 1; + gomp_mutex_lock (&acc_dev->mem_map.lock); + tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map.splay_tree, &k); + gomp_mutex_unlock (&acc_dev->mem_map.lock); + + if (tgt_fn_key == NULL) + gomp_fatal ("target function wasn't mapped: perhaps -fopenacc was " + "used without -flto?"); + + tgt_fn = (void (*)) tgt_fn_key->tgt->tgt_start; + } + else + tgt_fn = (void (*)) fn; + + tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true, + false); + + devaddrs = alloca (sizeof (void *) * mapnum); + for (i = 0; i < mapnum; i++) + devaddrs[i] = (void *) (tgt->list[i]->tgt->tgt_start + + tgt->list[i]->tgt_offset); + + acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, sizes, kinds, + num_gangs, num_workers, vector_length, async, + tgt); + + /* If running synchronously, unmap immediately. */ + if (async < acc_async_noval) + gomp_unmap_vars (tgt, true); + else + { + gomp_copy_from_async (tgt); + acc_dev->openacc.register_async_cleanup_func (tgt); + } + + acc_dev->openacc.async_set_async_func (acc_async_sync); +} + +void +GOACC_data_start (int device, const void *openmp_target, size_t mapnum, + void **hostaddrs, size_t *sizes, unsigned short *kinds) +{ + bool if_clause_condition_value = device != GOMP_IF_CLAUSE_FALSE; + struct target_mem_desc *tgt; + + gomp_notify ("%s: mapnum=%zd, hostaddrs=%p, sizes=%p, kinds=%p\n", + __FUNCTION__, mapnum, hostaddrs, sizes, kinds); + + select_acc_device (device); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + /* Host fallback or 'do nothing'. */ + if ((acc_dev->capabilities & TARGET_CAP_SHARED_MEM) + || !if_clause_condition_value) + { + tgt = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, true, false); + tgt->prev = thr->mapped_data; + thr->mapped_data = tgt; + + return; + } + + gomp_notify (" %s: prepare mappings\n", __FUNCTION__); + tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true, + false); + gomp_notify (" %s: mappings prepared\n", __FUNCTION__); + tgt->prev = thr->mapped_data; + thr->mapped_data = tgt; +} + +void +GOACC_data_end (void) +{ + struct goacc_thread *thr = goacc_thread (); + struct target_mem_desc *tgt = thr->mapped_data; + + gomp_notify (" %s: restore mappings\n", __FUNCTION__); + thr->mapped_data = tgt->prev; + gomp_unmap_vars (tgt, true); + gomp_notify (" %s: mappings restored\n", __FUNCTION__); +} + + +void +GOACC_kernels (int device, void (*fn) (void *), const void *openmp_target, + size_t mapnum, void **hostaddrs, size_t *sizes, + unsigned short *kinds, + int num_gangs, int num_workers, int vector_length, + int async, int num_waits, ...) +{ + gomp_notify ("%s: mapnum=%zd, hostaddrs=%p, sizes=%p, kinds=%p\n", + __FUNCTION__, mapnum, hostaddrs, sizes, kinds); + + va_list ap; + + select_acc_device (device); + + va_start (ap, num_waits); + + if (num_waits > 0) + goacc_wait (async, num_waits, ap); + + va_end (ap); + + GOACC_parallel (device, fn, openmp_target, mapnum, hostaddrs, sizes, kinds, + num_gangs, num_workers, vector_length, async, 0); +} + +void +goacc_wait (int async, int num_waits, va_list ap) +{ + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + int i; + + assert (num_waits >= 0); + + if (async == acc_async_sync && num_waits == 0) + { + acc_wait_all (); + return; + } + + if (async == acc_async_sync && num_waits) + { + for (i = 0; i < num_waits; i++) + { + int qid = va_arg (ap, int); + + if (acc_async_test (qid)) + continue; + + acc_wait (qid); + } + return; + } + + if (async == acc_async_noval && num_waits == 0) + { + acc_dev->openacc.async_wait_all_async_func (acc_async_noval); + return; + } + + for (i = 0; i < num_waits; i++) + { + int qid = va_arg (ap, int); + + if (acc_async_test (qid)) + continue; + + /* If we're waiting on the same asynchronous queue as we're launching on, + the queue itself will order work as required, so there's no need to + wait explicitly. */ + if (qid != async) + acc_dev->openacc.async_wait_async_func (qid, async); + } +} + +void +GOACC_update (int device, const void *openmp_target, size_t mapnum, + void **hostaddrs, size_t *sizes, unsigned short *kinds, + int async, int num_waits, ...) +{ + bool if_clause_condition_value = device != GOMP_IF_CLAUSE_FALSE; + size_t i; + + select_acc_device (device); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + if ((acc_dev->capabilities & TARGET_CAP_SHARED_MEM) + || !if_clause_condition_value) + return; + + if (num_waits > 0) + { + va_list ap; + + va_start (ap, num_waits); + + goacc_wait (async, num_waits, ap); + + va_end (ap); + } + + acc_dev->openacc.async_set_async_func (async); + + for (i = 0; i < mapnum; ++i) + { + unsigned char kind = kinds[i] & 0xff; + + dump_var ("UPD", i, hostaddrs[i], sizes[i], kinds[i]); + + switch (kind) + { + case GOMP_MAP_POINTER: + break; + + case GOMP_MAP_FORCE_TO: + acc_update_device (hostaddrs[i], sizes[i]); + break; + + case GOMP_MAP_FORCE_FROM: + acc_update_self (hostaddrs[i], sizes[i]); + break; + + default: + gomp_fatal (">>>> GOACC_update UNHANDLED kind 0x%.2x", kind); + break; + } + } + + acc_dev->openacc.async_set_async_func (acc_async_sync); +} + +void +GOACC_wait (int async, int num_waits, ...) +{ + va_list ap; + + va_start (ap, num_waits); + + goacc_wait (async, num_waits, ap); + + va_end (ap); +} diff --git a/libgomp/oacc-plugin.c b/libgomp/oacc-plugin.c new file mode 100644 index 0000000..357cb5f --- /dev/null +++ b/libgomp/oacc-plugin.c @@ -0,0 +1,48 @@ +/* Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* Initialize and register OpenACC dispatch table from libgomp plugin. */ + +#include "libgomp.h" +#include "oacc-plugin.h" +#include "libgomp_target.h" +#include "oacc-int.h" + +void +GOMP_PLUGIN_async_unmap_vars (void *ptr) +{ + struct target_mem_desc *tgt = ptr; + + gomp_unmap_vars (tgt, false); +} + +/* Return the target-specific part of the TLS data for the current thread. */ + +void * +GOMP_PLUGIN_acc_thread (void) +{ + struct goacc_thread *thr = goacc_thread (); + return thr ? thr->target_tls : NULL; +} diff --git a/libgomp/oacc-plugin.h b/libgomp/oacc-plugin.h new file mode 100644 index 0000000..d05a28f --- /dev/null +++ b/libgomp/oacc-plugin.h @@ -0,0 +1,32 @@ +/* Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#ifndef _OACC_PLUGIN_H +#define _OACC_PLUGIN_H 1 + +extern void GOMP_PLUGIN_async_unmap_vars (void *ptr); +extern void *GOMP_PLUGIN_acc_thread (void); + +#endif diff --git a/libgomp/openacc.f90 b/libgomp/openacc.f90 new file mode 100644 index 0000000..a344929 --- /dev/null +++ b/libgomp/openacc.f90 @@ -0,0 +1,803 @@ +! OpenACC Runtime Library Definitions. + +! Copyright (C) 2014 Free Software Foundation, Inc. + +! Contributed by Tobias Burnus +! and Mentor Embedded. + +! This file is part of the GNU OpenMP Library (libgomp). + +! Libgomp is free software; you can redistribute it and/or modify it +! under the terms of the GNU General Public License as published by +! the Free Software Foundation; either version 3, or (at your option) +! any later version. + +! Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY +! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +! FOR A PARTICULAR PURPOSE. See the GNU General Public License for +! more details. + +! Under Section 7 of GPL version 3, you are granted additional +! permissions described in the GCC Runtime Library Exception, version +! 3.1, as published by the Free Software Foundation. + +! You should have received a copy of the GNU General Public License and +! a copy of the GCC Runtime Library Exception along with this program; +! see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +! . + +module openacc_kinds + use iso_fortran_env, only: int32 + implicit none + + private :: int32 + public :: acc_device_kind + + integer, parameter :: acc_device_kind = int32 + + public :: acc_device_none, acc_device_default, acc_device_host + public :: acc_device_not_host, acc_device_nvidia + + integer (acc_device_kind), parameter :: acc_device_none = 0 + integer (acc_device_kind), parameter :: acc_device_default = 1 + integer (acc_device_kind), parameter :: acc_device_host = 2 + integer (acc_device_kind), parameter :: acc_device_host_nonshm = 3 + integer (acc_device_kind), parameter :: acc_device_not_host = 4 + integer (acc_device_kind), parameter :: acc_device_nvidia = 5 + + public :: acc_handle_kind + + integer, parameter :: acc_handle_kind = int32 + + public :: acc_async_noval, acc_async_sync + + integer (acc_handle_kind), parameter :: acc_async_noval = -1 + integer (acc_handle_kind), parameter :: acc_async_sync = -2 + +end module + +module openacc_internal + use openacc_kinds + implicit none + + interface + function acc_async_test_h (a) + logical acc_async_test_h + integer a + end function + + function acc_async_test_all_h () + logical acc_async_test_all_h + end function + + function acc_on_device_h (d) + import + integer (acc_device_kind) d + logical acc_on_device_h + end function + + subroutine acc_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_copyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_present_or_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_present_or_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_present_or_copyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_create_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_create_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_create_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_present_or_create_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_present_or_create_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_present_or_create_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_copyout_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_copyout_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_copyout_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_delete_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_delete_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_delete_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_update_device_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_update_device_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_update_device_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_update_self_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_update_self_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_update_self_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + function acc_is_present_32_h (a, len) + use iso_c_binding, only: c_int32_t + logical acc_is_present_32_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end function + + function acc_is_present_64_h (a, len) + use iso_c_binding, only: c_int64_t + logical acc_is_present_64_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end function + + function acc_is_present_array_h (a) + logical acc_is_present_array_h + type (*), dimension (..), contiguous :: a + end function + end interface + + interface + function acc_async_test_l (a) & + bind (C, name = "acc_async_test") + use iso_c_binding, only: c_int + integer (c_int) :: acc_async_test_l + integer (c_int), value :: a + end function + + function acc_async_test_all_l () & + bind (C, name = "acc_async_test_all") + use iso_c_binding, only: c_int + integer (c_int) :: acc_async_test_all_l + end function + + function acc_on_device_l (d) & + bind (C, name = "acc_on_device") + use iso_c_binding, only: c_int + integer (c_int) :: acc_on_device_l + integer (c_int), value :: d + end function + + subroutine acc_copyin_l (a, len) & + bind (C, name = "acc_copyin") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_present_or_copyin_l (a, len) & + bind (C, name = "acc_present_or_copyin") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_create_l (a, len) & + bind (C, name = "acc_create") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_present_or_create_l (a, len) & + bind (C, name = "acc_present_or_create") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_copyout_l (a, len) & + bind (C, name = "acc_copyout") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_delete_l (a, len) & + bind (C, name = "acc_delete") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_update_device_l (a, len) & + bind (C, name = "acc_update_device") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_update_self_l (a, len) & + bind (C, name = "acc_update_self") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + function acc_is_present_l (a, len) & + bind (C, name = "acc_is_present") + use iso_c_binding, only: c_int32_t, c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + integer (c_int32_t) :: acc_is_present_l + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end function + end interface +end module + +module openacc + use openacc_kinds + use openacc_internal + implicit none + + public :: openacc_version + + public :: acc_get_num_devices, acc_set_device_type, acc_get_device_type + public :: acc_set_device_num, acc_get_device_num, acc_async_test + public :: acc_async_test_all, acc_wait, acc_wait_async, acc_wait_all + public :: acc_wait_all_async, acc_init, acc_shutdown, acc_on_device + public :: acc_copyin, acc_present_or_copyin, acc_pcopyin, acc_create + public :: acc_present_or_create, acc_pcreate, acc_copyout, acc_delete + public :: acc_update_device, acc_update_self, acc_is_present + + integer, parameter :: openacc_version = 201306 + + interface acc_get_num_devices + function acc_get_num_devices (d) & + bind (C, name = "acc_get_num_devices") + use iso_c_binding, only: c_int + integer (c_int) :: acc_get_num_devices + integer (c_int), value :: d + end function + end interface + + interface acc_set_device_type + subroutine acc_set_device_type (d) & + bind (C, name = "acc_set_device_type") + use iso_c_binding, only: c_int + integer (c_int), value :: d + end subroutine + end interface + + interface acc_get_device_type + function acc_get_device_type () & + bind (C, name = "acc_get_device_type") + use iso_c_binding, only: c_int + integer (c_int) :: acc_get_device_type + end function + end interface + + interface acc_set_device_num + subroutine acc_set_device_num (n, d) & + bind (C, name = "acc_set_device_num") + use iso_c_binding, only: c_int + integer (c_int), value :: n, d + end subroutine + end interface + + interface acc_get_device_num + function acc_get_device_num (d) & + bind (C, name = "acc_get_device_num") + use iso_c_binding, only: c_int + integer (c_int) :: acc_get_device_num + integer (c_int), value :: d + end function + end interface + + interface acc_async_test + procedure :: acc_async_test_h + end interface + + interface acc_async_test_all + procedure :: acc_async_test_all_h + end interface + + interface acc_wait + subroutine acc_wait (a) & + bind (C, name = "acc_wait") + use iso_c_binding, only: c_int + integer (c_int), value :: a + end subroutine + end interface + + interface acc_wait_async + subroutine acc_wait_async (a1, a2) & + bind (C, name = "acc_wait_async") + use iso_c_binding, only: c_int + integer (c_int), value :: a1, a2 + end subroutine + end interface + + interface acc_wait_all + subroutine acc_wait_all () & + bind (C, name = "acc_wait_all") + use iso_c_binding, only: c_int + end subroutine + end interface + + interface acc_wait_all_async + subroutine acc_wait_all_async (a) & + bind (C, name = "acc_wait_all_async") + use iso_c_binding, only: c_int + integer (c_int), value :: a + end subroutine + end interface + + interface acc_init + subroutine acc_init (d) & + bind (C, name = "acc_init") + use iso_c_binding, only: c_int + integer (c_int), value :: d + end subroutine + end interface + + interface acc_shutdown + subroutine acc_shutdown (d) & + bind (C, name = "acc_shutdown") + use iso_c_binding, only: c_int + integer (c_int), value :: d + end subroutine + end interface + + interface acc_on_device + procedure :: acc_on_device_h + end interface + + ! acc_malloc: Only available in C/C++ + ! acc_free: Only available in C/C++ + + ! As vendor extension, the following code supports both 32bit and 64bit + ! arguments for "size"; the OpenACC standard only permits default-kind + ! integers, which are of kind 4 (i.e. 32 bits). + ! Additionally, the two-argument version also takes arrays as argument. + ! and the one argument version also scalars. Note that the code assumes + ! that the arrays are contiguous. + + interface acc_copyin + procedure :: acc_copyin_32_h + procedure :: acc_copyin_64_h + procedure :: acc_copyin_array_h + end interface + + interface acc_present_or_copyin + procedure :: acc_present_or_copyin_32_h + procedure :: acc_present_or_copyin_64_h + procedure :: acc_present_or_copyin_array_h + end interface + + interface acc_pcopyin + procedure :: acc_present_or_copyin_32_h + procedure :: acc_present_or_copyin_64_h + procedure :: acc_present_or_copyin_array_h + end interface + + interface acc_create + procedure :: acc_create_32_h + procedure :: acc_create_64_h + procedure :: acc_create_array_h + end interface + + interface acc_present_or_create + procedure :: acc_present_or_create_32_h + procedure :: acc_present_or_create_64_h + procedure :: acc_present_or_create_array_h + end interface + + interface acc_pcreate + procedure :: acc_present_or_create_32_h + procedure :: acc_present_or_create_64_h + procedure :: acc_present_or_create_array_h + end interface + + interface acc_copyout + procedure :: acc_copyout_32_h + procedure :: acc_copyout_64_h + procedure :: acc_copyout_array_h + end interface + + interface acc_delete + procedure :: acc_delete_32_h + procedure :: acc_delete_64_h + procedure :: acc_delete_array_h + end interface + + interface acc_update_device + procedure :: acc_update_device_32_h + procedure :: acc_update_device_64_h + procedure :: acc_update_device_array_h + end interface + + interface acc_update_self + procedure :: acc_update_self_32_h + procedure :: acc_update_self_64_h + procedure :: acc_update_self_array_h + end interface + + ! acc_map_data: Only available in C/C++ + ! acc_unmap_data: Only available in C/C++ + ! acc_deviceptr: Only available in C/C++ + ! acc_hostptr: Only available in C/C++ + + interface acc_is_present + procedure :: acc_is_present_32_h + procedure :: acc_is_present_64_h + procedure :: acc_is_present_array_h + end interface + + ! acc_memcpy_to_device: Only available in C/C++ + ! acc_memcpy_from_device: Only available in C/C++ + +end module + +function acc_async_test_h (a) + use openacc_internal, only: acc_async_test_l + logical acc_async_test_h + integer a + if (acc_async_test_l (a) .eq. 1) then + acc_async_test_h = .TRUE. + else + acc_async_test_h = .FALSE. + end if +end function + +function acc_async_test_all_h () + use openacc_internal, only: acc_async_test_all_l + logical acc_async_test_all_h + if (acc_async_test_all_l () .eq. 1) then + acc_async_test_all_h = .TRUE. + else + acc_async_test_all_h = .FALSE. + end if +end function + +function acc_on_device_h (d) + use openacc_internal, only: acc_on_device_l + use openacc_kinds + integer (acc_device_kind) d + logical acc_on_device_h + if (acc_on_device_l (d) .eq. 1) then + acc_on_device_h = .TRUE. + else + acc_on_device_h = .FALSE. + end if +end function + +subroutine acc_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_copyin_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_copyin_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_copyin_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_copyin_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_copyin_array_h (a) + use openacc_internal, only: acc_copyin_l + type (*), dimension (..), contiguous :: a + call acc_copyin_l (a, sizeof (a)) +end subroutine + +subroutine acc_present_or_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_present_or_copyin_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_present_or_copyin_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_present_or_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_present_or_copyin_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_present_or_copyin_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_present_or_copyin_array_h (a) + use openacc_internal, only: acc_present_or_copyin_l + type (*), dimension (..), contiguous :: a + call acc_present_or_copyin_l (a, sizeof (a)) +end subroutine + +subroutine acc_create_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_create_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_create_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_create_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_create_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_create_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_create_array_h (a) + use openacc_internal, only: acc_create_l + type (*), dimension (..), contiguous :: a + call acc_create_l (a, sizeof (a)) +end subroutine + +subroutine acc_present_or_create_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_present_or_create_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_present_or_create_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_present_or_create_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_present_or_create_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_present_or_create_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_present_or_create_array_h (a) + use openacc_internal, only: acc_present_or_create_l + type (*), dimension (..), contiguous :: a + call acc_present_or_create_l (a, sizeof (a)) +end subroutine + +subroutine acc_copyout_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_copyout_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_copyout_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_copyout_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_copyout_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_copyout_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_copyout_array_h (a) + use openacc_internal, only: acc_copyout_l + type (*), dimension (..), contiguous :: a + call acc_copyout_l (a, sizeof (a)) +end subroutine + +subroutine acc_delete_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_delete_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_delete_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_delete_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_delete_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_delete_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_delete_array_h (a) + use openacc_internal, only: acc_delete_l + type (*), dimension (..), contiguous :: a + call acc_delete_l (a, sizeof (a)) +end subroutine + +subroutine acc_update_device_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_update_device_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_update_device_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_update_device_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_update_device_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_update_device_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_update_device_array_h (a) + use openacc_internal, only: acc_update_device_l + type (*), dimension (..), contiguous :: a + call acc_update_device_l (a, sizeof (a)) +end subroutine + +subroutine acc_update_self_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_update_self_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_update_self_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_update_self_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_update_self_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_update_self_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_update_self_array_h (a) + use openacc_internal, only: acc_update_self_l + type (*), dimension (..), contiguous :: a + call acc_update_self_l (a, sizeof (a)) +end subroutine + +function acc_is_present_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_is_present_l + logical acc_is_present_32_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + if (acc_is_present_l (a, int (len, kind = c_size_t)) .eq. 1) then + acc_is_present_32_h = .TRUE. + else + acc_is_present_32_h = .FALSE. + end if +end function + +function acc_is_present_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_is_present_l + logical acc_is_present_64_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + if (acc_is_present_l (a, int (len, kind = c_size_t)) .eq. 1) then + acc_is_present_64_h = .TRUE. + else + acc_is_present_64_h = .FALSE. + end if +end function + +function acc_is_present_array_h (a) + use openacc_internal, only: acc_is_present_l + logical acc_is_present_array_h + type (*), dimension (..), contiguous :: a + acc_is_present_array_h = acc_is_present_l (a, sizeof (a)) == 1 +end function diff --git a/libgomp/openacc.h b/libgomp/openacc.h new file mode 100644 index 0000000..01e0722 --- /dev/null +++ b/libgomp/openacc.h @@ -0,0 +1,127 @@ +/* OpenACC Runtime Library User-facing Declarations + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#ifndef _OPENACC_H +#define _OPENACC_H 1 + +#include "gomp-constants.h" + +/* The OpenACC std is silent on whether or not including openacc.h + might or must not include other header files. We chose to include + some. */ +#include + +#ifdef __cplusplus +extern "C" { +#endif + +#if __cplusplus >= 201103 +# define __GOACC_NOTHROW noexcept () +#elif __cplusplus +# define __GOACC_NOTHROW throw () +#else /* Not C++ */ +# define __GOACC_NOTHROW __attribute__ ((__nothrow__)) +#endif + + /* Types */ + typedef enum acc_device_t + { + acc_device_none = 0, + acc_device_default, /* This has to be a distinct value, as no + return value can match it. */ + acc_device_host = GOMP_TARGET_HOST, + acc_device_host_nonshm = GOMP_TARGET_HOST_NONSHM, + acc_device_not_host, + acc_device_nvidia = GOMP_TARGET_NVIDIA_PTX, + _ACC_device_hwm + } acc_device_t; + + typedef enum acc_async_t + { + acc_async_noval = -1, + acc_async_sync = -2 + } acc_async_t; + + int acc_get_num_devices (acc_device_t __dev) __GOACC_NOTHROW; + void acc_set_device_type (acc_device_t __dev) __GOACC_NOTHROW; + acc_device_t acc_get_device_type (void) __GOACC_NOTHROW; + void acc_set_device_num (int __num, acc_device_t __dev) __GOACC_NOTHROW; + int acc_get_device_num (acc_device_t __dev) __GOACC_NOTHROW; + int acc_async_test (int __async) __GOACC_NOTHROW; + int acc_async_test_all (void) __GOACC_NOTHROW; + void acc_wait (int __async) __GOACC_NOTHROW; + void acc_wait_async (int __async1, int __async2) __GOACC_NOTHROW; + void acc_wait_all (void) __GOACC_NOTHROW; + void acc_wait_all_async (int __async) __GOACC_NOTHROW; + void acc_init (acc_device_t __dev) __GOACC_NOTHROW; + void acc_shutdown (acc_device_t __dev) __GOACC_NOTHROW; + int acc_on_device (acc_device_t __dev) __GOACC_NOTHROW; + void *acc_malloc (size_t) __GOACC_NOTHROW; + void acc_free (void *) __GOACC_NOTHROW; + /* Some of these would be more correct with const qualifiers, but + the standard specifies otherwise. */ + void *acc_copyin (void *, size_t) __GOACC_NOTHROW; + void *acc_present_or_copyin (void *, size_t) __GOACC_NOTHROW; + void *acc_create (void *, size_t) __GOACC_NOTHROW; + void *acc_present_or_create (void *, size_t) __GOACC_NOTHROW; + void acc_copyout (void *, size_t) __GOACC_NOTHROW; + void acc_delete (void *, size_t) __GOACC_NOTHROW; + void acc_update_device (void *, size_t) __GOACC_NOTHROW; + void acc_update_self (void *, size_t) __GOACC_NOTHROW; + void acc_map_data (void *, void *, size_t) __GOACC_NOTHROW; + void acc_unmap_data (void *) __GOACC_NOTHROW; + void *acc_deviceptr (void *) __GOACC_NOTHROW; + void *acc_hostptr (void *) __GOACC_NOTHROW; + int acc_is_present (void *, size_t) __GOACC_NOTHROW; + void acc_memcpy_to_device (void *, void *, size_t) __GOACC_NOTHROW; + void acc_memcpy_from_device (void *, void *, size_t) __GOACC_NOTHROW; + + void ACC_target (int, void (*) (void *), const void *, + size_t, void **, size_t *, unsigned char *, int *) __GOACC_NOTHROW; + void ACC_parallel (int, void (*) (void *), const void *, + size_t, void **, size_t *, unsigned char *) __GOACC_NOTHROW; + void ACC_add_device_code (void const *, char const *) __GOACC_NOTHROW; + + void ACC_async_copy (int) __GOACC_NOTHROW; + void ACC_async_kern (int) __GOACC_NOTHROW; + + /* Old names. OpenACC does not specify whether these can or must + not be macros, inlines or aliases for the new names. */ + #define acc_pcreate acc_present_or_create + #define acc_pcopyin acc_present_or_copyin + + /* CUDA-specific routines. */ + void *acc_get_current_cuda_device (void) __GOACC_NOTHROW; + void *acc_get_current_cuda_context (void) __GOACC_NOTHROW; + void *acc_get_cuda_stream (int __async) __GOACC_NOTHROW; + int acc_set_cuda_stream (int __async, void *__stream) __GOACC_NOTHROW; + +#ifdef __cplusplus +} +#endif + +#endif /* _OPENACC_H */ diff --git a/libgomp/openacc_lib.h b/libgomp/openacc_lib.h new file mode 100644 index 0000000..13118a7 --- /dev/null +++ b/libgomp/openacc_lib.h @@ -0,0 +1,390 @@ +! OpenACC Runtime Library Definitions. -*- mode: fortran -*- + +! Copyright (C) 2014 Free Software Foundation, Inc. + +! Contributed by Tobias Burnus +! and Mentor Embedded. + +! This file is part of the GNU OpenMP Library (libgomp). + +! Libgomp is free software; you can redistribute it and/or modify it +! under the terms of the GNU General Public License as published by +! the Free Software Foundation; either version 3, or (at your option) +! any later version. + +! Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY +! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +! FOR A PARTICULAR PURPOSE. See the GNU General Public License for +! more details. + +! Under Section 7 of GPL version 3, you are granted additional +! permissions described in the GCC Runtime Library Exception, version +! 3.1, as published by the Free Software Foundation. + +! You should have received a copy of the GNU General Public License and +! a copy of the GCC Runtime Library Exception along with this program; +! see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +! . + +! NOTE: Due to the use of dimension (..), the code only works when compiled +! with -std=f2008ts/gnu/legacy but not with other standard settings. +! Alternatively, the user can use the module version, which permits +! compilation with -std=f95. + + integer, parameter :: acc_device_kind = 4 + + integer (acc_device_kind), parameter :: acc_device_none = 0 + integer (acc_device_kind), parameter :: acc_device_default = 1 + integer (acc_device_kind), parameter :: acc_device_host = 2 + integer (acc_device_kind), parameter :: acc_device_host_nonshm = 3 + integer (acc_device_kind), parameter :: acc_device_not_host = 4 + integer (acc_device_kind), parameter :: acc_device_nvidia = 5 + + integer, parameter :: acc_handle_kind = 4 + + integer (acc_handle_kind), parameter :: acc_async_noval = -1 + integer (acc_handle_kind), parameter :: acc_async_sync = -2 + + integer, parameter :: openacc_version = 201306 + + interface + function acc_get_num_devices (d) + & bind (C, name = "acc_get_num_devices") + use iso_c_binding, only: c_int + integer (c_int) :: acc_get_num_devices + integer (c_int), value :: d + end function + end interface + + interface acc_set_device_type + subroutine acc_set_device_type (d) + & bind (C, name = "acc_set_device_type") + use iso_c_binding, only: c_int + integer (c_int), value :: d + end subroutine + end interface + + interface acc_get_device_type + function acc_get_device_type () + & bind (C, name = "acc_get_device_type") + use iso_c_binding, only: c_int + integer (c_int) :: acc_get_device_type + end function + end interface + + interface acc_set_device_num + subroutine acc_set_device_num (n, d) + & bind (C, name = "acc_set_device_num") + use iso_c_binding, only: c_int + integer (c_int), value :: n, d + end subroutine + end interface + + interface acc_get_device_num + function acc_get_device_num (d) + & bind (C, name = "acc_get_device_num") + use iso_c_binding, only: c_int + integer (c_int) :: acc_get_device_num + integer (c_int), value :: d + end function + end interface + + interface acc_async_test + function acc_async_test_h (a) + logical acc_async_test_h + integer a + end function + end interface + + interface acc_async_test_all + function acc_async_test_all_h () + logical acc_async_test_all_h + end function + end interface + + interface acc_wait + subroutine acc_wait (a) + & bind (C, name = "acc_wait") + use iso_c_binding, only: c_int + integer (c_int), value :: a + end subroutine + end interface + + interface acc_wait_async + subroutine acc_wait_async (a1, a2) + & bind (C, name = "acc_wait_async") + end subroutine + end interface + + interface acc_wait_all + subroutine acc_wait_all () + & bind (C, name = "acc_wait_all") + use iso_c_binding, only: c_int + end subroutine + end interface + + interface acc_wait_all_async + subroutine acc_wait_all_async (a) + & bind (C, name = "acc_wait_all_async") + use iso_c_binding, only: c_int + integer (c_int), value :: a + end subroutine + end interface + + interface acc_init + subroutine acc_init (d) + & bind (C, name = "acc_init") + use iso_c_binding, only: c_int + integer (c_int), value :: d + end subroutine + end interface + + interface acc_shutdown + subroutine acc_shutdown (d) + & bind (C, name = "acc_shutdown") + use iso_c_binding, only: c_int + integer (c_int), value :: d + end subroutine + end interface + + interface acc_on_device + function acc_on_device_h (devicetype) + import acc_device_kind + logical acc_on_device_h + integer (acc_device_kind) devicetype + end function + end interface + + ! acc_malloc: Only available in C/C++ + ! acc_free: Only available in C/C++ + + interface acc_copyin + subroutine acc_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_copyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_present_or_copyin + subroutine acc_present_or_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_present_or_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_present_or_copyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_pcopyin + subroutine acc_pcopyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_pcopyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_pcopyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_create + subroutine acc_create_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_create_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_create_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_present_or_create + subroutine acc_present_or_create_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_present_or_create_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_present_or_create_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_pcreate + subroutine acc_pcreate_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_pcreate_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_pcreate_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_copyout + subroutine acc_copyout_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_copyout_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_copyout_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_delete + subroutine acc_delete_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_delete_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_delete_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_update_device + subroutine acc_update_device_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_update_device_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_update_device_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_update_self + subroutine acc_update_self_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_update_self_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_update_self_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + ! acc_map_data: Only available in C/C++ + ! acc_unmap_data: Only available in C/C++ + ! acc_deviceptr: Only available in C/C++ + ! acc_ostptr: Only available in C/C++ + + interface acc_is_present + function acc_is_present_32_h (a, len) + use iso_c_binding, only: c_int32_t + logical acc_is_present_32_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end function + + function acc_is_present_64_h (a, len) + use iso_c_binding, only: c_int64_t + logical acc_is_present_64_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end function + + function acc_is_present_array_h (a) + logical acc_is_present_array_h + type (*), dimension (..), contiguous :: a + end function + end interface + + ! acc_memcpy_to_device: Only available in C/C++ + ! acc_memcpy_from_device: Only available in C/C++ diff --git a/libgomp/plugin/Makefrag.am b/libgomp/plugin/Makefrag.am new file mode 100644 index 0000000..d6642d9 --- /dev/null +++ b/libgomp/plugin/Makefrag.am @@ -0,0 +1,47 @@ +# Plugins for offload execution, Makefile.am fragment. +# +# Copyright (C) 2014 Free Software Foundation, Inc. +# +# Contributed by Mentor Embedded. +# +# This file is part of the GNU OpenMP Library (libgomp). +# +# Libgomp is free software; you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3, or (at your option) +# any later version. +# +# Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY +# WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +# FOR A PARTICULAR PURPOSE. See the GNU General Public License for +# more details. +# +# Under Section 7 of GPL version 3, you are granted additional +# permissions described in the GCC Runtime Library Exception, version +# 3.1, as published by the Free Software Foundation. +# +# You should have received a copy of the GNU General Public License and +# a copy of the GCC Runtime Library Exception along with this program; +# see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +# . + +if PLUGIN_NVPTX +# Nvidia PTX OpenACC plugin. +libgomp_plugin_nvptx_version_info = -version-info $(libtool_VERSION) +toolexeclib_LTLIBRARIES += libgomp-plugin-nvptx.la +libgomp_plugin_nvptx_la_SOURCES = plugin/plugin-nvptx.c +libgomp_plugin_nvptx_la_CPPFLAGS = $(AM_CPPFLAGS) $(PLUGIN_NVPTX_CPPFLAGS) +libgomp_plugin_nvptx_la_LDFLAGS = $(libgomp_plugin_nvptx_version_info) \ + $(lt_host_flags) +libgomp_plugin_nvptx_la_LDFLAGS += $(PLUGIN_NVPTX_LDFLAGS) +libgomp_plugin_nvptx_la_LIBADD = $(PLUGIN_NVPTX_LIBS) +libgomp_plugin_nvptx_la_LIBTOOLFLAGS = --tag=disable-static +endif + +libgomp_plugin_host_nonshm_version_info = -version-info $(libtool_VERSION) +toolexeclib_LTLIBRARIES += libgomp-plugin-host_nonshm.la +libgomp_plugin_host_nonshm_la_SOURCES = plugin/plugin-host.c +libgomp_plugin_host_nonshm_la_CPPFLAGS = $(AM_CPPFLAGS) -DHOST_NONSHM_PLUGIN +libgomp_plugin_host_nonshm_la_LDFLAGS = \ + $(libgomp_plugin_host_nonshm_version_info) $(lt_host_flags) +libgomp_plugin_host_nonshm_la_LIBTOOLFLAGS = --tag=disable-static diff --git a/libgomp/plugin/configfrag.ac b/libgomp/plugin/configfrag.ac new file mode 100644 index 0000000..68c7dc7 --- /dev/null +++ b/libgomp/plugin/configfrag.ac @@ -0,0 +1,107 @@ +# Plugins for offload execution, configure.ac fragment. +# +# Copyright (C) 2014 Free Software Foundation, Inc. +# +# Contributed by Mentor Embedded. +# +# This file is part of the GNU OpenMP Library (libgomp). +# +# Libgomp is free software; you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3, or (at your option) +# any later version. +# +# Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY +# WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +# FOR A PARTICULAR PURPOSE. See the GNU General Public License for +# more details. +# +# Under Section 7 of GPL version 3, you are granted additional +# permissions described in the GCC Runtime Library Exception, version +# 3.1, as published by the Free Software Foundation. +# +# You should have received a copy of the GNU General Public License and +# a copy of the GCC Runtime Library Exception along with this program; +# see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +# . + +# Look for the CUDA driver package. +CUDA_DRIVER_INCLUDE= +CUDA_DRIVER_LIB= +AC_SUBST(CUDA_DRIVER_INCLUDE) +AC_SUBST(CUDA_DRIVER_LIB) +CUDA_DRIVER_CPPFLAGS= +CUDA_DRIVER_LDFLAGS= +AC_ARG_WITH(cuda-driver, + [AS_HELP_STRING([--with-cuda-driver=PATH], + [specify prefix directory for installed CUDA driver package. + Equivalent to --with-cuda-driver-include=PATH/include + plus --with-cuda-driver-lib=PATH/lib])]) +AC_ARG_WITH(cuda-driver-include, + [AS_HELP_STRING([--with-cuda-driver-include=PATH], + [specify directory for installed CUDA driver include files])]) +AC_ARG_WITH(cuda-driver-lib, + [AS_HELP_STRING([--with-cuda-driver-lib=PATH], + [specify directory for the installed CUDA driver library])]) +if test "x$with_cuda_driver" != x; then + CUDA_DRIVER_INCLUDE=$with_cuda_driver/include + CUDA_DRIVER_LIB=$with_cuda_driver/lib +fi +if test "x$with_cuda_driver_include" != x; then + CUDA_DRIVER_INCLUDE=$with_cuda_driver_include +fi +if test "x$with_cuda_driver_lib" != x; then + CUDA_DRIVER_LIB=$with_cuda_driver_lib +fi +if test "x$CUDA_DRIVER_INCLUDE" != x; then + CUDA_DRIVER_CPPFLAGS=-I$CUDA_DRIVER_INCLUDE +fi +if test "x$CUDA_DRIVER_LIB" != x; then + CUDA_DRIVER_LDFLAGS=-L$CUDA_DRIVER_LIB +fi + +PLUGIN_NVPTX=0 +PLUGIN_NVPTX_CPPFLAGS= +PLUGIN_NVPTX_LDFLAGS= +PLUGIN_NVPTX_LIBS= +AC_SUBST(PLUGIN_NVPTX) +AC_SUBST(PLUGIN_NVPTX_CPPFLAGS) +AC_SUBST(PLUGIN_NVPTX_LDFLAGS) +AC_SUBST(PLUGIN_NVPTX_LIBS) + +for accel in `echo $enable_offload_targets | sed -e 's#,# #g'`; do + case "$accel" in + nvptx*) + PLUGIN_NVPTX=$accel + PLUGIN_NVPTX_CPPFLAGS=$CUDA_DRIVER_CPPFLAGS + PLUGIN_NVPTX_LDFLAGS=$CUDA_DRIVER_LDFLAGS + PLUGIN_NVPTX_LIBS='-lcuda' + + PLUGIN_NVPTX_save_CPPFLAGS=$CPPFLAGS + CPPFLAGS="$PLUGIN_NVPTX_CPPFLAGS $CPPFLAGS" + PLUGIN_NVPTX_save_LDFLAGS=$LDFLAGS + LDFLAGS="$PLUGIN_NVPTX_LDFLAGS $LDFLAGS" + PLUGIN_NVPTX_save_LIBS=$LIBS + LIBS="$PLUGIN_NVPTX_LIBS $LIBS" + AC_LINK_IFELSE( + [AC_LANG_PROGRAM( + [#include "cuda.h"], + [CUresult r = cuCtxPushCurrent (NULL);])], + [PLUGIN_NVPTX=1]) + CPPFLAGS=$PLUGIN_NVPTX_save_CPPFLAGS + LDFLAGS=$PLUGIN_NVPTX_save_LDFLAGS + LIBS=$PLUGIN_NVPTX_save_LIBS + case $PLUGIN_NVPTX in + nvptx*) + PLUGIN_NVPTX=0 + AC_MSG_ERROR([CUDA driver package required for nvptx support]) + ;; + esac + ;; + esac +done +AM_CONDITIONAL([PLUGIN_NVPTX], [test $PLUGIN_NVPTX = 1]) +AC_DEFINE_UNQUOTED([PLUGIN_NVPTX], [$PLUGIN_NVPTX], + [Define to 1 if the NVIDIA plugin is built, 0 if not.]) + +AC_OUTPUT diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c new file mode 100644 index 0000000..aee3c4e --- /dev/null +++ b/libgomp/plugin/plugin-host.c @@ -0,0 +1,269 @@ +/* OpenACC Runtime Library: acc_device_host, acc_device_host_nonshm. + + Copyright (C) 2013 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* Simple implementation of support routines for a shared-memory + acc_device_host, and a non-shared memory acc_device_host_nonshm, with the + latter built as a plugin. */ + +#include "openacc.h" +#include "config.h" +#include "libgomp.h" +#include "libgomp_target.h" +#ifdef HOST_NONSHM_PLUGIN +#include "libgomp-plugin.h" +#include "oacc-plugin.h" +#else +#include "oacc-int.h" +#endif + +#include +#include +#include +#include + +#ifdef HOST_NONSHM_PLUGIN +#define STATIC +#define GOMP(X) GOMP_PLUGIN_##X +#define SELF "host_nonshm plugin: " +#else +#define STATIC static +#define GOMP(X) gomp_##X +#define SELF "host: " +#endif + +#ifndef HOST_NONSHM_PLUGIN +static struct gomp_device_descr host_dispatch; +#endif + +STATIC const char * +GOMP_OFFLOAD_get_name (void) +{ +#ifdef HOST_NONSHM_PLUGIN + return "host_nonshm"; +#else + return "host"; +#endif +} + +STATIC int +GOMP_OFFLOAD_get_type (void) +{ +#ifdef HOST_NONSHM_PLUGIN + return OFFLOAD_TARGET_TYPE_HOST_NONSHM; +#else + return OFFLOAD_TARGET_TYPE_HOST; +#endif +} + +STATIC unsigned int +GOMP_OFFLOAD_get_caps (void) +{ + unsigned int caps = TARGET_CAP_OPENACC_200 | TARGET_CAP_NATIVE_EXEC; + +#ifndef HOST_NONSHM_PLUGIN + caps |= TARGET_CAP_SHARED_MEM; +#endif + + return caps; +} + +STATIC int +GOMP_OFFLOAD_get_num_devices (void) +{ + return 1; +} + +STATIC void +GOMP_OFFLOAD_register_image (void *host_table __attribute__((unused)), + void *target_data __attribute__((unused))) +{ +} + +STATIC void +GOMP_OFFLOAD_init_device (int n __attribute__((unused))) +{ +} + +STATIC void +GOMP_OFFLOAD_fini_device (int n __attribute__((unused))) +{ +} + +STATIC int +GOMP_OFFLOAD_get_table (int n __attribute__((unused)), + struct mapping_table **table __attribute__((unused))) +{ + return 0; +} + +STATIC void * +GOMP_OFFLOAD_openacc_open_device (int n) +{ + return (void *) (intptr_t) n; +} + +STATIC int +GOMP_OFFLOAD_openacc_close_device (void *hnd) +{ + return 0; +} + +STATIC int +GOMP_OFFLOAD_openacc_get_device_num (void) +{ + return 0; +} + +STATIC void +GOMP_OFFLOAD_openacc_set_device_num (int n) +{ + if (n > 0) + GOMP(fatal) ("device number %u out of range for host execution", n); +} + +STATIC void * +GOMP_OFFLOAD_alloc (int n __attribute__((unused)), size_t s) +{ + return GOMP(malloc) (s); +} + +STATIC void +GOMP_OFFLOAD_free (int n __attribute__((unused)), void *p) +{ + free (p); +} + +STATIC void * +GOMP_OFFLOAD_host2dev (int n __attribute__((unused)), void *d, const void *h, + size_t s) +{ +#ifdef HOST_NONSHM_PLUGIN + memcpy (d, h, s); +#endif + + return 0; +} + +STATIC void * +GOMP_OFFLOAD_dev2host (int n __attribute__((unused)), void *h, const void *d, + size_t s) +{ +#ifdef HOST_NONSHM_PLUGIN + memcpy (h, d, s); +#endif + + return 0; +} + +STATIC void +GOMP_OFFLOAD_run (int n __attribute__((unused)), void *fn_ptr, void *vars) +{ + void (*fn)(void *) = (void (*)(void *)) fn_ptr; + + fn (vars); +} + +STATIC void +GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *), + size_t mapnum __attribute__((unused)), + void **hostaddrs, + void **devaddrs __attribute__((unused)), + size_t *sizes __attribute__((unused)), + unsigned short *kinds __attribute__((unused)), + int num_gangs __attribute__((unused)), + int num_workers __attribute__((unused)), + int vector_length __attribute__((unused)), + int async __attribute__((unused)), + void *targ_mem_desc __attribute__((unused))) +{ +#ifdef HOST_NONSHM_PLUGIN + fn (devaddrs); +#else + fn (hostaddrs); +#endif +} + +STATIC void +GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc) +{ +#ifdef HOST_NONSHM_PLUGIN + /* "Asynchronous" launches are executed synchronously on the (non-SHM) host, + so there's no point in delaying host-side cleanup -- just do it now. */ + GOMP_PLUGIN_async_unmap_vars (targ_mem_desc); +#endif +} + +STATIC void +GOMP_OFFLOAD_openacc_async_set_async (int async __attribute__((unused))) +{ +} + +STATIC int +GOMP_OFFLOAD_openacc_async_test (int async __attribute__((unused))) +{ + return 1; +} + +STATIC int +GOMP_OFFLOAD_openacc_async_test_all (void) +{ + return 1; +} + +STATIC void +GOMP_OFFLOAD_openacc_async_wait (int async __attribute__((unused))) +{ +} + +STATIC void +GOMP_OFFLOAD_openacc_async_wait_all (void) +{ +} + +STATIC void +GOMP_OFFLOAD_openacc_async_wait_async (int async1 __attribute__((unused)), + int async2 __attribute__((unused))) +{ +} + +STATIC void +GOMP_OFFLOAD_openacc_async_wait_all_async (int async __attribute__((unused))) +{ +} + +STATIC void * +GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data + __attribute__((unused))) +{ + return NULL; +} + +STATIC void +GOMP_OFFLOAD_openacc_destroy_thread_data (void *tls_data + __attribute__((unused))) +{ +} diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c new file mode 100644 index 0000000..3d1b81b --- /dev/null +++ b/libgomp/plugin/plugin-nvptx.c @@ -0,0 +1,1852 @@ +/* Plugin for NVPTX execution. + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* Nvidia PTX-specific parts of OpenACC support. The cuda driver + library appears to hold some implicit state, but the documentation + is not clear as to what that state might be. Or how one might + propagate it from one thread to another. */ + +#include "openacc.h" +#include "config.h" +#include "libgomp.h" +#include "libgomp_target.h" +#include "libgomp-plugin.h" +#include "oacc-plugin.h" + +#include +#include +#include +#include +#include +#include +#include + +#define ARRAYSIZE(X) (sizeof (X) / sizeof ((X)[0])) + +static struct +{ + CUresult r; + char *m; +} cuda_errlist[]= +{ + { CUDA_ERROR_INVALID_VALUE, "invalid value" }, + { CUDA_ERROR_OUT_OF_MEMORY, "out of memory" }, + { CUDA_ERROR_NOT_INITIALIZED, "not initialized" }, + { CUDA_ERROR_DEINITIALIZED, "deinitialized" }, + { CUDA_ERROR_PROFILER_DISABLED, "profiler disabled" }, + { CUDA_ERROR_PROFILER_NOT_INITIALIZED, "profiler not initialized" }, + { CUDA_ERROR_PROFILER_ALREADY_STARTED, "already started" }, + { CUDA_ERROR_PROFILER_ALREADY_STOPPED, "already stopped" }, + { CUDA_ERROR_NO_DEVICE, "no device" }, + { CUDA_ERROR_INVALID_DEVICE, "invalid device" }, + { CUDA_ERROR_INVALID_IMAGE, "invalid image" }, + { CUDA_ERROR_INVALID_CONTEXT, "invalid context" }, + { CUDA_ERROR_CONTEXT_ALREADY_CURRENT, "context already current" }, + { CUDA_ERROR_MAP_FAILED, "map error" }, + { CUDA_ERROR_UNMAP_FAILED, "unmap error" }, + { CUDA_ERROR_ARRAY_IS_MAPPED, "array is mapped" }, + { CUDA_ERROR_ALREADY_MAPPED, "already mapped" }, + { CUDA_ERROR_NO_BINARY_FOR_GPU, "no binary for gpu" }, + { CUDA_ERROR_ALREADY_ACQUIRED, "already acquired" }, + { CUDA_ERROR_NOT_MAPPED, "not mapped" }, + { CUDA_ERROR_NOT_MAPPED_AS_ARRAY, "not mapped as array" }, + { CUDA_ERROR_NOT_MAPPED_AS_POINTER, "not mapped as pointer" }, + { CUDA_ERROR_ECC_UNCORRECTABLE, "ecc uncorrectable" }, + { CUDA_ERROR_UNSUPPORTED_LIMIT, "unsupported limit" }, + { CUDA_ERROR_CONTEXT_ALREADY_IN_USE, "context already in use" }, + { CUDA_ERROR_PEER_ACCESS_UNSUPPORTED, "peer access unsupported" }, + { CUDA_ERROR_INVALID_SOURCE, "invalid source" }, + { CUDA_ERROR_FILE_NOT_FOUND, "file not found" }, + { CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND, + "shared object symbol not found" }, + { CUDA_ERROR_SHARED_OBJECT_INIT_FAILED, "shared object init error" }, + { CUDA_ERROR_OPERATING_SYSTEM, "operating system" }, + { CUDA_ERROR_INVALID_HANDLE, "invalid handle" }, + { CUDA_ERROR_NOT_FOUND, "not found" }, + { CUDA_ERROR_NOT_READY, "not ready" }, + { CUDA_ERROR_LAUNCH_FAILED, "launch error" }, + { CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, "launch out of resources" }, + { CUDA_ERROR_LAUNCH_TIMEOUT, "launch timeout" }, + { CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING, + "launch incompatibe texturing" }, + { CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED, "peer access already enabled" }, + { CUDA_ERROR_PEER_ACCESS_NOT_ENABLED, "peer access not enabled " }, + { CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE, "primary cotext active" }, + { CUDA_ERROR_CONTEXT_IS_DESTROYED, "context is destroyed" }, + { CUDA_ERROR_ASSERT, "assert" }, + { CUDA_ERROR_TOO_MANY_PEERS, "too many peers" }, + { CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED, + "host memory already registered" }, + { CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED, "host memory not registered" }, + { CUDA_ERROR_NOT_PERMITTED, "not permitted" }, + { CUDA_ERROR_NOT_SUPPORTED, "not supported" }, + { CUDA_ERROR_UNKNOWN, "unknown" } +}; + +static char errmsg[128]; + +static char * +cuda_error (CUresult r) +{ + int i; + + for (i = 0; i < ARRAYSIZE (cuda_errlist); i++) + { + if (cuda_errlist[i].r == r) + return &cuda_errlist[i].m[0]; + } + + sprintf (&errmsg[0], "unknown result code: %5d", r); + + return &errmsg[0]; +} + +struct targ_fn_descriptor +{ + CUfunction fn; + const char *name; +}; + +static bool ptx_inited = false; + +struct ptx_stream +{ + CUstream stream; + pthread_t host_thread; + bool multithreaded; + + CUdeviceptr d; + void *h; + void *h_begin; + void *h_end; + void *h_next; + void *h_prev; + void *h_tail; + + struct ptx_stream *next; +}; + +/* Thread-specific data for PTX. */ + +struct nvptx_thread +{ + struct ptx_stream *current_stream; + struct ptx_device *ptx_dev; +}; + +struct map +{ + int async; + size_t size; + char mappings[0]; +}; + +static void +map_init (struct ptx_stream *s) +{ + CUresult r; + + int size = getpagesize (); + + assert (s); + assert (!s->d); + assert (!s->h); + + r = cuMemAllocHost (&s->h, size); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemAllocHost error: %s", cuda_error (r)); + + r = cuMemHostGetDevicePointer (&s->d, s->h, 0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemHostGetDevicePointer error: %s", cuda_error (r)); + + assert (s->h); + + s->h_begin = s->h; + s->h_end = s->h_begin + size; + s->h_next = s->h_prev = s->h_tail = s->h_begin; + + assert (s->h_next); + assert (s->h_end); +} + +static void +map_fini (struct ptx_stream *s) +{ + CUresult r; + + r = cuMemFreeHost (s->h); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuda_error (r)); +} + +static void +map_pop (struct ptx_stream *s) +{ + struct map *m; + + assert (s != NULL); + assert (s->h_next); + assert (s->h_prev); + assert (s->h_tail); + + m = s->h_tail; + + s->h_tail += m->size; + + if (s->h_tail >= s->h_end) + s->h_tail = s->h_begin + (int) (s->h_tail - s->h_end); + + if (s->h_next == s->h_tail) + s->h_prev = s->h_next; + + assert (s->h_next >= s->h_begin); + assert (s->h_tail >= s->h_begin); + assert (s->h_prev >= s->h_begin); + + assert (s->h_next <= s->h_end); + assert (s->h_tail <= s->h_end); + assert (s->h_prev <= s->h_end); +} + +static void +map_push (struct ptx_stream *s, int async, size_t size, void **h, void **d) +{ + int left; + int offset; + struct map *m; + + assert (s != NULL); + + left = s->h_end - s->h_next; + size += sizeof (struct map); + + assert (s->h_prev); + assert (s->h_next); + + if (size >= left) + { + m = s->h_prev; + m->size += left; + s->h_next = s->h_begin; + + if (s->h_next + size > s->h_end) + GOMP_PLUGIN_fatal ("unable to push map"); + } + + assert (s->h_next); + + m = s->h_next; + m->async = async; + m->size = size; + + offset = (void *)&m->mappings[0] - s->h; + + *d = (void *)(s->d + offset); + *h = (void *)(s->h + offset); + + s->h_prev = s->h_next; + s->h_next += size; + + assert (s->h_prev); + assert (s->h_next); + + assert (s->h_next >= s->h_begin); + assert (s->h_tail >= s->h_begin); + assert (s->h_prev >= s->h_begin); + assert (s->h_next <= s->h_end); + assert (s->h_tail <= s->h_end); + assert (s->h_prev <= s->h_end); + + return; +} + +struct ptx_device +{ + CUcontext ctx; + bool ctx_shared; + CUdevice dev; + struct ptx_stream *null_stream; + /* All non-null streams associated with this device (actually context), + either created implicitly or passed in from the user (via + acc_set_cuda_stream). */ + struct ptx_stream *active_streams; + struct { + struct ptx_stream **arr; + int size; + } async_streams; + /* A lock for use when manipulating the above stream list and array. */ + gomp_mutex_t stream_lock; + int ord; + bool overlap; + bool map; + bool concur; + int mode; + bool mkern; + + struct ptx_device *next; +}; + +enum PTX_event_type +{ + PTX_EVT_MEM, + PTX_EVT_KNL, + PTX_EVT_SYNC, + PTX_EVT_ASYNC_CLEANUP +}; + +struct PTX_event +{ + CUevent *evt; + int type; + void *addr; + int ord; + + struct PTX_event *next; +}; + +static gomp_mutex_t PTX_event_lock; +static struct PTX_event *PTX_events; + +#define _XSTR(s) _STR(s) +#define _STR(s) #s + +static struct _synames +{ + char *n; +} cuSymNames[] = +{ + { _XSTR (cuCtxCreate) }, + { _XSTR (cuCtxDestroy) }, + { _XSTR (cuCtxGetCurrent) }, + { _XSTR (cuCtxPushCurrent) }, + { _XSTR (cuCtxSynchronize) }, + { _XSTR (cuDeviceGet) }, + { _XSTR (cuDeviceGetAttribute) }, + { _XSTR (cuDeviceGetCount) }, + { _XSTR (cuEventCreate) }, + { _XSTR (cuEventDestroy) }, + { _XSTR (cuEventQuery) }, + { _XSTR (cuEventRecord) }, + { _XSTR (cuInit) }, + { _XSTR (cuLaunchKernel) }, + { _XSTR (cuLinkAddData) }, + { _XSTR (cuLinkComplete) }, + { _XSTR (cuLinkCreate) }, + { _XSTR (cuMemAlloc) }, + { _XSTR (cuMemAllocHost) }, + { _XSTR (cuMemcpy) }, + { _XSTR (cuMemcpyDtoH) }, + { _XSTR (cuMemcpyDtoHAsync) }, + { _XSTR (cuMemcpyHtoD) }, + { _XSTR (cuMemcpyHtoDAsync) }, + { _XSTR (cuMemFree) }, + { _XSTR (cuMemFreeHost) }, + { _XSTR (cuMemGetAddressRange) }, + { _XSTR (cuMemHostGetDevicePointer) }, + { _XSTR (cuMemHostRegister) }, + { _XSTR (cuMemHostUnregister) }, + { _XSTR (cuModuleGetFunction) }, + { _XSTR (cuModuleLoadData) }, + { _XSTR (cuStreamDestroy) }, + { _XSTR (cuStreamQuery) }, + { _XSTR (cuStreamSynchronize) }, + { _XSTR (cuStreamWaitEvent) } +}; + +static int +verify_device_library (void) +{ + int i; + void *dh, *ds; + + dh = dlopen ("libcuda.so", RTLD_LAZY); + if (!dh) + return -1; + + for (i = 0; i < ARRAYSIZE (cuSymNames); i++) + { + ds = dlsym (dh, cuSymNames[i].n); + if (!ds) + return -1; + } + + dlclose (dh); + + return 0; +} + +static inline struct nvptx_thread * +nvptx_thread (void) +{ + return (struct nvptx_thread *) GOMP_PLUGIN_acc_thread (); +} + +static void +init_streams_for_device (struct ptx_device *ptx_dev, int concurrency) +{ + int i; + struct ptx_stream *null_stream + = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream)); + + null_stream->stream = NULL; + null_stream->host_thread = pthread_self (); + null_stream->multithreaded = true; + null_stream->d = (CUdeviceptr) NULL; + null_stream->h = NULL; + map_init (null_stream); + ptx_dev->null_stream = null_stream; + + ptx_dev->active_streams = NULL; + GOMP_PLUGIN_mutex_init (&ptx_dev->stream_lock); + + if (concurrency < 1) + concurrency = 1; + + /* This is just a guess -- make space for as many async streams as the + current device is capable of concurrently executing. This can grow + later as necessary. No streams are created yet. */ + ptx_dev->async_streams.arr + = GOMP_PLUGIN_malloc (concurrency * sizeof (struct ptx_stream *)); + ptx_dev->async_streams.size = concurrency; + + for (i = 0; i < concurrency; i++) + ptx_dev->async_streams.arr[i] = NULL; +} + +static void +fini_streams_for_device (struct ptx_device *ptx_dev) +{ + free (ptx_dev->async_streams.arr); + + while (ptx_dev->active_streams != NULL) + { + struct ptx_stream *s = ptx_dev->active_streams; + ptx_dev->active_streams = ptx_dev->active_streams->next; + + cuStreamDestroy (s->stream); + map_fini (s); + free (s); + } + + map_fini (ptx_dev->null_stream); + free (ptx_dev->null_stream); +} + +/* Select a stream for (OpenACC-semantics) ASYNC argument for the current + thread THREAD (and also current device/context). If CREATE is true, create + the stream if it does not exist (or use EXISTING if it is non-NULL), and + associate the stream with the same thread argument. Returns stream to use + as result. */ + +static struct ptx_stream * +select_stream_for_async (int async, pthread_t thread, bool create, + CUstream existing) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + /* Local copy of TLS variable. */ + struct ptx_device *ptx_dev = nvthd->ptx_dev; + struct ptx_stream *stream = NULL; + int orig_async = async; + + /* The special value acc_async_noval (-1) maps (for now) to an + implicitly-created stream, which is then handled the same as any other + numbered async stream. Other options are available, e.g. using the null + stream for anonymous async operations, or choosing an idle stream from an + active set. But, stick with this for now. */ + if (async > acc_async_sync) + async++; + + if (create) + GOMP_PLUGIN_mutex_lock (&ptx_dev->stream_lock); + + /* NOTE: AFAICT there's no particular need for acc_async_sync to map to the + null stream, and in fact better performance may be obtainable if it doesn't + (because the null stream enforces overly-strict synchronisation with + respect to other streams for legacy reasons, and that's probably not + needed with OpenACC). Maybe investigate later. */ + if (async == acc_async_sync) + stream = ptx_dev->null_stream; + else if (async >= 0 && async < ptx_dev->async_streams.size + && ptx_dev->async_streams.arr[async] && !(create && existing)) + stream = ptx_dev->async_streams.arr[async]; + else if (async >= 0 && create) + { + if (async >= ptx_dev->async_streams.size) + { + int i, newsize = ptx_dev->async_streams.size * 2; + + if (async >= newsize) + newsize = async + 1; + + ptx_dev->async_streams.arr + = GOMP_PLUGIN_realloc (ptx_dev->async_streams.arr, + newsize * sizeof (struct ptx_stream *)); + + for (i = ptx_dev->async_streams.size; i < newsize; i++) + ptx_dev->async_streams.arr[i] = NULL; + + ptx_dev->async_streams.size = newsize; + } + + /* Create a new stream on-demand if there isn't one already, or if we're + setting a particular async value to an existing (externally-provided) + stream. */ + if (!ptx_dev->async_streams.arr[async] || existing) + { + CUresult r; + struct ptx_stream *s + = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream)); + + if (existing) + s->stream = existing; + else + { + r = cuStreamCreate (&s->stream, CU_STREAM_DEFAULT); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamCreate error: %s", cuda_error (r)); + } + + /* If CREATE is true, we're going to be queueing some work on this + stream. Associate it with the current host thread. */ + s->host_thread = thread; + s->multithreaded = false; + + s->d = (CUdeviceptr) NULL; + s->h = NULL; + map_init (s); + + s->next = ptx_dev->active_streams; + ptx_dev->active_streams = s; + ptx_dev->async_streams.arr[async] = s; + } + + stream = ptx_dev->async_streams.arr[async]; + } + else if (async < 0) + GOMP_PLUGIN_fatal ("bad async %d", async); + + if (create) + { + assert (stream != NULL); + + /* If we're trying to use the same stream from different threads + simultaneously, set stream->multithreaded to true. This affects the + behaviour of acc_async_test_all and acc_wait_all, which are supposed to + only wait for asynchronous launches from the same host thread they are + invoked on. If multiple threads use the same async value, we make note + of that here and fall back to testing/waiting for all threads in those + functions. */ + if (thread != stream->host_thread) + stream->multithreaded = true; + + GOMP_PLUGIN_mutex_unlock (&ptx_dev->stream_lock); + } + else if (stream && !stream->multithreaded + && !pthread_equal (stream->host_thread, thread)) + GOMP_PLUGIN_fatal ("async %d used on wrong thread", orig_async); + + return stream; +} + +static int PTX_get_num_devices (void); + +/* Initialize the device. */ +static int +PTX_init (void) +{ + CUresult r; + int rc; + + if (ptx_inited) + return PTX_get_num_devices (); + + rc = verify_device_library (); + if (rc < 0) + return -1; + + r = cuInit (0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuInit error: %s", cuda_error (r)); + + PTX_events = NULL; + + GOMP_PLUGIN_mutex_init (&PTX_event_lock); + + ptx_inited = true; + + return PTX_get_num_devices (); +} + +static void +PTX_fini (void) +{ + ptx_inited = false; +} + +static void * +PTX_open_device (int n) +{ + struct ptx_device *ptx_dev; + CUdevice dev; + CUresult r; + int async_engines, pi; + + r = cuDeviceGet (&dev, n); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGet error: %s", cuda_error (r)); + + ptx_dev = GOMP_PLUGIN_malloc (sizeof (struct ptx_device)); + + ptx_dev->ord = n; + ptx_dev->dev = dev; + ptx_dev->ctx_shared = false; + + r = cuCtxGetCurrent (&ptx_dev->ctx); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r)); + + if (!ptx_dev->ctx) + { + r = cuCtxCreate (&ptx_dev->ctx, CU_CTX_SCHED_AUTO, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxCreate error: %s", cuda_error (r)); + } + else + ptx_dev->ctx_shared = true; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_GPU_OVERLAP, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); + + ptx_dev->overlap = pi; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); + + ptx_dev->map = pi; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); + + ptx_dev->concur = pi; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); + + ptx_dev->mode = pi; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_INTEGRATED, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuda_error (r)); + + ptx_dev->mkern = pi; + + r = cuDeviceGetAttribute (&async_engines, + CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev); + if (r != CUDA_SUCCESS) + async_engines = 1; + + init_streams_for_device (ptx_dev, async_engines); + + return (void *) ptx_dev; +} + +static int +PTX_close_device (void *targ_data) +{ + CUresult r; + struct ptx_device *ptx_dev = targ_data; + + if (!ptx_dev) + return 0; + + fini_streams_for_device (ptx_dev); + + if (!ptx_dev->ctx_shared) + { + r = cuCtxDestroy (ptx_dev->ctx); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxDestroy error: %s", cuda_error (r)); + } + + free (ptx_dev); + + return 0; +} + +static int +PTX_get_num_devices (void) +{ + int n; + CUresult r; + + /* This function will be called before the plugin has been initialized in + order to enumerate available devices, but CUDA API routines can't be used + until cuInit has been called. Just call it now (but don't yet do any + further initialization). */ + if (!ptx_inited) + cuInit (0); + + r = cuDeviceGetCount (&n); + if (r!= CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetCount error: %s", cuda_error (r)); + + return n; +} + +#define ABORT_PTX \ + ".version 3.1\n" \ + ".target sm_30\n" \ + ".address_size 64\n" \ + ".visible .func abort;\n" \ + ".visible .func abort\n" \ + "{\n" \ + "trap;\n" \ + "ret;\n" \ + "}\n" \ + ".visible .func _gfortran_abort;\n" \ + ".visible .func _gfortran_abort\n" \ + "{\n" \ + "trap;\n" \ + "ret;\n" \ + "}\n" \ + +/* Generated with: + + $ echo 'int acc_on_device(int d) { return __builtin_acc_on_device(d); } int acc_on_device_(int *d) { return acc_on_device(*d); }' | accel-gcc/xgcc -Baccel-gcc -x c - -o - -S -m64 -O3 -fno-builtin-acc_on_device -fno-inline +*/ +#define ACC_ON_DEVICE_PTX \ + " .version 3.1\n" \ + " .target sm_30\n" \ + " .address_size 64\n" \ + ".visible .func (.param.u32 %out_retval)acc_on_device(.param.u32 %in_ar1);\n" \ + ".visible .func (.param.u32 %out_retval)acc_on_device(.param.u32 %in_ar1)\n" \ + "{\n" \ + " .reg.u32 %ar1;\n" \ + ".reg.u32 %retval;\n" \ + " .reg.u64 %hr10;\n" \ + " .reg.u32 %r24;\n" \ + " .reg.u32 %r25;\n" \ + " .reg.pred %r27;\n" \ + " .reg.u32 %r30;\n" \ + " ld.param.u32 %ar1, [%in_ar1];\n" \ + " mov.u32 %r24, %ar1;\n" \ + " setp.ne.u32 %r27,%r24,4;\n" \ + " set.u32.eq.u32 %r30,%r24,5;\n" \ + " neg.s32 %r25, %r30;\n" \ + " @%r27 bra $L3;\n" \ + " mov.u32 %r25, 1;\n" \ + "$L3:\n" \ + " mov.u32 %retval, %r25;\n" \ + " st.param.u32 [%out_retval], %retval;\n" \ + " ret;\n" \ + " }\n" \ + ".visible .func (.param.u32 %out_retval)acc_on_device_(.param.u64 %in_ar1);\n" \ + ".visible .func (.param.u32 %out_retval)acc_on_device_(.param.u64 %in_ar1)\n" \ + "{\n" \ + " .reg.u64 %ar1;\n" \ + ".reg.u32 %retval;\n" \ + " .reg.u64 %hr10;\n" \ + " .reg.u64 %r25;\n" \ + " .reg.u32 %r26;\n" \ + " .reg.u32 %r27;\n" \ + " ld.param.u64 %ar1, [%in_ar1];\n" \ + " mov.u64 %r25, %ar1;\n" \ + " ld.u32 %r26, [%r25];\n" \ + " {\n" \ + " .param.u32 %retval_in;\n" \ + " {\n" \ + " .param.u32 %out_arg0;\n" \ + " st.param.u32 [%out_arg0], %r26;\n" \ + " call (%retval_in), acc_on_device, (%out_arg0);\n" \ + " }\n" \ + " ld.param.u32 %r27, [%retval_in];\n" \ + "}\n" \ + " mov.u32 %retval, %r27;\n" \ + " st.param.u32 [%out_retval], %retval;\n" \ + " ret;\n" \ + " }" + +static void +link_ptx (CUmodule *module, char *ptx_code) +{ + CUjit_option opts[7]; + void *optvals[7]; + float elapsed = 0.0; +#define LOGSIZE 8192 + char elog[LOGSIZE]; + char ilog[LOGSIZE]; + unsigned long logsize = LOGSIZE; + CUlinkState linkstate; + CUresult r; + void *linkout; + size_t linkoutsize __attribute__((unused)); + + GOMP_PLUGIN_notify ("attempting to load:\n---\n%s\n---\n", ptx_code); + + opts[0] = CU_JIT_WALL_TIME; + optvals[0] = &elapsed; + + opts[1] = CU_JIT_INFO_LOG_BUFFER; + optvals[1] = &ilog[0]; + + opts[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES; + optvals[2] = (void *) logsize; + + opts[3] = CU_JIT_ERROR_LOG_BUFFER; + optvals[3] = &elog[0]; + + opts[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES; + optvals[4] = (void *) logsize; + + opts[5] = CU_JIT_LOG_VERBOSE; + optvals[5] = (void *) 1; + + opts[6] = CU_JIT_TARGET; + optvals[6] = (void *) CU_TARGET_COMPUTE_30; + + r = cuLinkCreate (7, opts, optvals, &linkstate); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuLinkCreate error: %s", cuda_error (r)); + + char *abort_ptx = ABORT_PTX; + r = cuLinkAddData (linkstate, CU_JIT_INPUT_PTX, abort_ptx, + strlen (abort_ptx) + 1, 0, 0, 0, 0); + if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("Link error log %s\n", &elog[0]); + GOMP_PLUGIN_fatal ("cuLinkAddData (abort) error: %s", cuda_error (r)); + } + + char *acc_on_device_ptx = ACC_ON_DEVICE_PTX; + r = cuLinkAddData (linkstate, CU_JIT_INPUT_PTX, acc_on_device_ptx, + strlen (acc_on_device_ptx) + 1, 0, 0, 0, 0); + if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("Link error log %s\n", &elog[0]); + GOMP_PLUGIN_fatal ("cuLinkAddData (acc_on_device) error: %s", + cuda_error (r)); + } + + r = cuLinkAddData (linkstate, CU_JIT_INPUT_PTX, ptx_code, + strlen (ptx_code) + 1, 0, 0, 0, 0); + if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("Link error log %s\n", &elog[0]); + GOMP_PLUGIN_fatal ("cuLinkAddData (ptx_code) error: %s", cuda_error (r)); + } + + r = cuLinkComplete (linkstate, &linkout, &linkoutsize); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuLinkComplete error: %s", cuda_error (r)); + + GOMP_PLUGIN_notify ("Link complete: %fms\n", elapsed); + GOMP_PLUGIN_notify ("Link log %s\n", &ilog[0]); + + r = cuModuleLoadData (module, linkout); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuModuleLoadData error: %s", cuda_error (r)); +} + +static void +event_gc (bool memmap_lockable) +{ + struct PTX_event *e = PTX_events; + struct nvptx_thread *nvthd = nvptx_thread (); + + GOMP_PLUGIN_mutex_lock (&PTX_event_lock); + + while (e != NULL) + { + CUresult r; + + if (e->ord != nvthd->ptx_dev->ord) + { + e = e->next; + continue; + } + + r = cuEventQuery (*e->evt); + if (r == CUDA_SUCCESS) + { + CUevent *te; + + te = e->evt; + + switch (e->type) + { + case PTX_EVT_MEM: + case PTX_EVT_SYNC: + break; + + case PTX_EVT_KNL: + map_pop (e->addr); + break; + + case PTX_EVT_ASYNC_CLEANUP: + { + /* The function gomp_plugin_async_unmap_vars needs to claim the + memory-map splay tree lock for the current device, so we + can't call it when one of our callers has already claimed + the lock. In that case, just delay the GC for this event + until later. */ + if (!memmap_lockable) + { + e = e->next; + continue; + } + + GOMP_PLUGIN_async_unmap_vars (e->addr); + } + break; + } + + cuEventDestroy (*te); + free ((void *)te); + + struct PTX_event *next = e->next; + + if (PTX_events == e) + PTX_events = PTX_events->next; + else + { + struct PTX_event *e_ = PTX_events; + while (e_->next != e) + e_ = e_->next; + e_->next = e_->next->next; + } + + free (e); + e = next; + } + else + e = e->next; + } + + GOMP_PLUGIN_mutex_unlock (&PTX_event_lock); +} + +static void +event_add (enum PTX_event_type type, CUevent *e, void *h) +{ + struct PTX_event *ptx_event; + struct nvptx_thread *nvthd = nvptx_thread (); + + assert (type == PTX_EVT_MEM || type == PTX_EVT_KNL || type == PTX_EVT_SYNC + || type == PTX_EVT_ASYNC_CLEANUP); + + ptx_event = GOMP_PLUGIN_malloc (sizeof (struct PTX_event)); + ptx_event->type = type; + ptx_event->evt = e; + ptx_event->addr = h; + ptx_event->ord = nvthd->ptx_dev->ord; + + GOMP_PLUGIN_mutex_lock (&PTX_event_lock); + + ptx_event->next = PTX_events; + PTX_events = ptx_event; + + GOMP_PLUGIN_mutex_unlock (&PTX_event_lock); +} + +void +PTX_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, + size_t *sizes, unsigned short *kinds, int num_gangs, int num_workers, + int vector_length, int async, void *targ_mem_desc) +{ + struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; + CUfunction function; + CUresult r; + int i; + struct ptx_stream *dev_str; + void *kargs[1]; + void *hp, *dp; + unsigned int nthreads_in_block; + struct nvptx_thread *nvthd = nvptx_thread (); + + function = targ_fn->fn; + + dev_str = select_stream_for_async (async, pthread_self (), false, NULL); + assert (dev_str == nvthd->current_stream); + + /* This reserves a chunk of a pre-allocated page of memory mapped on both + the host and the device. HP is a host pointer to the new chunk, and DP is + the corresponding device pointer. */ + map_push (dev_str, async, mapnum * sizeof (void *), &hp, &dp); + + GOMP_PLUGIN_notify (" %s: prepare mappings\n", __FUNCTION__); + + /* Copy the array of arguments to the mapped page. */ + for (i = 0; i < mapnum; i++) + ((void **) hp)[i] = devaddrs[i]; + + /* Copy the (device) pointers to arguments to the device (dp and hp might in + fact have the same value on a unified-memory system). */ + r = cuMemcpy ((CUdeviceptr)dp, (CUdeviceptr)hp, mapnum * sizeof (void *)); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpy failed: %s", cuda_error (r)); + + GOMP_PLUGIN_notify (" %s: kernel %s: launch\n", __FUNCTION__, targ_fn->name); + + // XXX: possible geometry mappings?? + // + // OpenACC CUDA + // + // num_gangs blocks + // num_workers warps (where a warp is equivalent to 32 threads) + // vector length threads + // + + /* The openacc vector_length clause 'determines the vector length to use for + vector or SIMD operations'. The question is how to map this to CUDA. + + In CUDA, the warp size is the vector length of a CUDA device. However, the + CUDA interface abstracts away from that, and only shows us warp size + indirectly in maximum number of threads per block, which is a product of + warp size and the number of hyperthreads of a multiprocessor. + + We choose to map openacc vector_length directly onto the number of threads + in a block, in the x dimension. This is reflected in gcc code generation + that uses ThreadIdx.x to access vector elements. + + Attempting to use an openacc vector_length of more than the maximum number + of threads per block will result in a cuda error. */ + nthreads_in_block = vector_length; + + kargs[0] = &dp; + r = cuLaunchKernel (function, + 1, 1, 1, + nthreads_in_block, 1, 1, + 0, dev_str->stream, kargs, 0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r)); + +#ifndef DISABLE_ASYNC + if (async < acc_async_noval) + { + r = cuStreamSynchronize (dev_str->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); + } + else + { + CUevent *e; + + e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); + + event_gc (true); + + r = cuEventRecord (*e, dev_str->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); + + event_add (PTX_EVT_KNL, e, (void *)dev_str); + } +#else + r = cuCtxSynchronize (); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r)); +#endif + + GOMP_PLUGIN_notify (" %s: kernel %s: finished\n", __FUNCTION__, + targ_fn->name); + +#ifndef DISABLE_ASYNC + if (async < acc_async_noval) +#endif + map_pop (dev_str); +} + +void * openacc_get_current_cuda_context (void); + +static void * +PTX_alloc (size_t s) +{ + CUdeviceptr d; + CUresult r; + + r = cuMemAlloc (&d, s); + if (r == CUDA_ERROR_OUT_OF_MEMORY) + return 0; + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r)); + return (void *)d; +} + +static void +PTX_free (void *p) +{ + CUresult r; + CUdeviceptr pb; + size_t ps; + + r = cuMemGetAddressRange (&pb, &ps, (CUdeviceptr)p); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemGetAddressRange error: %s", cuda_error (r)); + + if ((CUdeviceptr)p != pb) + GOMP_PLUGIN_fatal ("invalid device address"); + + r = cuMemFree ((CUdeviceptr)p); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r)); +} + +static void * +PTX_host2dev (void *d, const void *h, size_t s) +{ + CUresult r; + CUdeviceptr pb; + size_t ps; + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!s) + return 0; + + if (!d) + GOMP_PLUGIN_fatal ("invalid device address"); + + r = cuMemGetAddressRange (&pb, &ps, (CUdeviceptr)d); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemGetAddressRange error: %s", cuda_error (r)); + + if (!pb) + GOMP_PLUGIN_fatal ("invalid device address"); + + if (!h) + GOMP_PLUGIN_fatal ("invalid host address"); + + if (d == h) + GOMP_PLUGIN_fatal ("invalid host or device address"); + + if ((void *)(d + s) > (void *)(pb + ps)) + GOMP_PLUGIN_fatal ("invalid size"); + +#ifndef DISABLE_ASYNC + if (nvthd->current_stream != nvthd->ptx_dev->null_stream) + { + CUevent *e; + + e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); + + event_gc (false); + + r = cuMemcpyHtoDAsync ((CUdeviceptr)d, h, s, + nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpyHtoDAsync error: %s", cuda_error (r)); + + r = cuEventRecord (*e, nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); + + event_add (PTX_EVT_MEM, e, (void *)h); + } + else +#endif + { + r = cuMemcpyHtoD ((CUdeviceptr)d, h, s); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r)); + } + + return 0; +} + +static void * +PTX_dev2host (void *h, const void *d, size_t s) +{ + CUresult r; + CUdeviceptr pb; + size_t ps; + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!s) + return 0; + + if (!d) + GOMP_PLUGIN_fatal ("invalid device address"); + + r = cuMemGetAddressRange (&pb, &ps, (CUdeviceptr)d); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemGetAddressRange error: %s", cuda_error (r)); + + if (!pb) + GOMP_PLUGIN_fatal ("invalid device address"); + + if (!h) + GOMP_PLUGIN_fatal ("invalid host address"); + + if (d == h) + GOMP_PLUGIN_fatal ("invalid host or device address"); + + if ((void *)(d + s) > (void *)(pb + ps)) + GOMP_PLUGIN_fatal ("invalid size"); + +#ifndef DISABLE_ASYNC + if (nvthd->current_stream != nvthd->ptx_dev->null_stream) + { + CUevent *e; + + e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s\n", cuda_error (r)); + + event_gc (false); + + r = cuMemcpyDtoHAsync (h, (CUdeviceptr)d, s, + nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpyDtoHAsync error: %s", cuda_error (r)); + + r = cuEventRecord (*e, nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); + + event_add (PTX_EVT_MEM, e, (void *)h); + } + else +#endif + { + r = cuMemcpyDtoH (h, (CUdeviceptr)d, s); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r)); + } + + return 0; +} + +static void +PTX_set_async (int async) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + nvthd->current_stream + = select_stream_for_async (async, pthread_self (), true, NULL); +} + +static int +PTX_async_test (int async) +{ + CUresult r; + struct ptx_stream *s; + + s = select_stream_for_async (async, pthread_self (), false, NULL); + + if (!s) + GOMP_PLUGIN_fatal ("unknown async %d", async); + + r = cuStreamQuery (s->stream); + if (r == CUDA_SUCCESS) + { + /* The oacc-parallel.c:goacc_wait function calls this hook to determine + whether all work has completed on this stream, and if so omits the call + to the wait hook. If that happens, event_gc might not get called + (which prevents variables from getting unmapped and their associated + device storage freed), so call it here. */ + event_gc (true); + return 1; + } + else if (r == CUDA_ERROR_NOT_READY) + return 0; + + GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r)); + + return 0; +} + +static int +PTX_async_test_all (void) +{ + struct ptx_stream *s; + pthread_t self = pthread_self (); + struct nvptx_thread *nvthd = nvptx_thread (); + + GOMP_PLUGIN_mutex_lock (&nvthd->ptx_dev->stream_lock); + + for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next) + { + if ((s->multithreaded || pthread_equal (s->host_thread, self)) + && cuStreamQuery (s->stream) == CUDA_ERROR_NOT_READY) + { + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); + return 0; + } + } + + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); + + event_gc (true); + + return 1; +} + +static void +PTX_wait (int async) +{ + CUresult r; + struct ptx_stream *s; + + s = select_stream_for_async (async, pthread_self (), false, NULL); + + if (!s) + GOMP_PLUGIN_fatal ("unknown async %d", async); + + r = cuStreamSynchronize (s->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); + + event_gc (true); +} + +static void +PTX_wait_async (int async1, int async2) +{ + CUresult r; + CUevent *e; + struct ptx_stream *s1, *s2; + pthread_t self = pthread_self (); + + /* The stream that is waiting (rather than being waited for) doesn't + necessarily have to exist already. */ + s2 = select_stream_for_async (async2, self, true, NULL); + + s1 = select_stream_for_async (async1, self, false, NULL); + if (!s1) + GOMP_PLUGIN_fatal ("invalid async 1\n"); + + if (s1 == s2) + GOMP_PLUGIN_fatal ("identical parameters"); + + e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); + + event_gc (true); + + r = cuEventRecord (*e, s1->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); + + event_add (PTX_EVT_SYNC, e, NULL); + + r = cuStreamWaitEvent (s2->stream, *e, 0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamWaitEvent error: %s", cuda_error (r)); +} + +static void +PTX_wait_all (void) +{ + CUresult r; + struct ptx_stream *s; + pthread_t self = pthread_self (); + struct nvptx_thread *nvthd = nvptx_thread (); + + GOMP_PLUGIN_mutex_lock (&nvthd->ptx_dev->stream_lock); + + /* Wait for active streams initiated by this thread (or by multiple threads) + to complete. */ + for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next) + { + if (s->multithreaded || pthread_equal (s->host_thread, self)) + { + r = cuStreamQuery (s->stream); + if (r == CUDA_SUCCESS) + continue; + else if (r != CUDA_ERROR_NOT_READY) + GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r)); + + r = cuStreamSynchronize (s->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); + } + } + + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); + + event_gc (true); +} + +static void +PTX_wait_all_async (int async) +{ + CUresult r; + struct ptx_stream *waiting_stream, *other_stream; + CUevent *e; + struct nvptx_thread *nvthd = nvptx_thread (); + pthread_t self = pthread_self (); + + /* The stream doing the waiting. This could be the first mention of the + stream, so create it if necessary. */ + waiting_stream + = select_stream_for_async (async, pthread_self (), true, NULL); + + /* Launches on the null stream already block on other streams in the + context. */ + if (!waiting_stream || waiting_stream == nvthd->ptx_dev->null_stream) + return; + + event_gc (true); + + GOMP_PLUGIN_mutex_lock (&nvthd->ptx_dev->stream_lock); + + for (other_stream = nvthd->ptx_dev->active_streams; + other_stream != NULL; + other_stream = other_stream->next) + { + if (!other_stream->multithreaded + && !pthread_equal (other_stream->host_thread, self)) + continue; + + e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); + + /* Record an event on the waited-for stream. */ + r = cuEventRecord (*e, other_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); + + event_add (PTX_EVT_SYNC, e, NULL); + + r = cuStreamWaitEvent (waiting_stream->stream, *e, 0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamWaitEvent error: %s", cuda_error (r)); + } + + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); +} + +static void * +PTX_get_current_cuda_device (void) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!nvthd || !nvthd->ptx_dev) + return NULL; + + return &nvthd->ptx_dev->dev; +} + +static void * +PTX_get_current_cuda_context (void) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!nvthd || !nvthd->ptx_dev) + return NULL; + + return nvthd->ptx_dev->ctx; +} + +static void * +PTX_get_cuda_stream (int async) +{ + struct ptx_stream *s; + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!nvthd || !nvthd->ptx_dev) + return NULL; + + s = select_stream_for_async (async, pthread_self (), false, NULL); + + return s ? s->stream : NULL; +} + +static int +PTX_set_cuda_stream (int async, void *stream) +{ + struct ptx_stream *oldstream; + pthread_t self = pthread_self (); + struct nvptx_thread *nvthd = nvptx_thread (); + + GOMP_PLUGIN_mutex_lock (&nvthd->ptx_dev->stream_lock); + + if (async < 0) + GOMP_PLUGIN_fatal ("bad async %d", async); + + /* We have a list of active streams and an array mapping async values to + entries of that list. We need to take "ownership" of the passed-in stream, + and add it to our list, removing the previous entry also (if there was one) + in order to prevent resource leaks. Note the potential for surprise + here: maybe we should keep track of passed-in streams and leave it up to + the user to tidy those up, but that doesn't work for stream handles + returned from acc_get_cuda_stream above... */ + + oldstream = select_stream_for_async (async, self, false, NULL); + + if (oldstream) + { + if (nvthd->ptx_dev->active_streams == oldstream) + nvthd->ptx_dev->active_streams = nvthd->ptx_dev->active_streams->next; + else + { + struct ptx_stream *s = nvthd->ptx_dev->active_streams; + while (s->next != oldstream) + s = s->next; + s->next = s->next->next; + } + + cuStreamDestroy (oldstream->stream); + map_fini (oldstream); + free (oldstream); + } + + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); + + (void) select_stream_for_async (async, self, true, (CUstream) stream); + + return 1; +} + +/* Plugin entry points. */ + + +int +GOMP_OFFLOAD_get_type (void) +{ + return OFFLOAD_TARGET_TYPE_NVIDIA_PTX; +} + +unsigned int +GOMP_OFFLOAD_get_caps (void) +{ + return TARGET_CAP_OPENACC_200; +} + +const char * +GOMP_OFFLOAD_get_name (void) +{ + return "nvidia"; +} + +int +GOMP_OFFLOAD_get_num_devices (void) +{ + return PTX_get_num_devices (); +} + +static void **kernel_target_data; +static void **kernel_host_table; + +void +GOMP_OFFLOAD_register_image (void *host_table, void *target_data) +{ + kernel_target_data = target_data; + kernel_host_table = host_table; +} + +void +GOMP_OFFLOAD_init_device (int n __attribute__((unused))) +{ + (void) PTX_init (); +} + +void +GOMP_OFFLOAD_fini_device (int n __attribute__((unused))) +{ + PTX_fini (); +} + +int +GOMP_OFFLOAD_get_table (int n __attribute__((unused)), + struct mapping_table **tablep) +{ + CUmodule module; + void **fn_table; + char **fn_names; + int fn_entries, i; + CUresult r; + struct targ_fn_descriptor *targ_fns; + + if (PTX_init () <= 0) + return 0; + + /* This isn't an error, because an image may legitimately have no offloaded + regions and so will not call GOMP_offload_register. */ + if (kernel_target_data == NULL) + return 0; + + link_ptx (&module, kernel_target_data[0]); + + /* kernel_target_data[0] -> ptx code + kernel_target_data[1] -> variable mappings + kernel_target_data[2] -> array of kernel names in ascii + + kernel_host_table[0] -> start of function addresses (_omp_func_table) + kernel_host_table[1] -> end of function addresses (_omp_funcs_end) + + The array of kernel names and the functions addresses form a + one-to-one correspondence. */ + + fn_table = kernel_host_table[0]; + fn_names = (char **) kernel_target_data[2]; + fn_entries = (kernel_host_table[1] - kernel_host_table[0]) / sizeof (void *); + + *tablep = GOMP_PLUGIN_malloc (sizeof (struct mapping_table) * fn_entries); + targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor) + * fn_entries); + + for (i = 0; i < fn_entries; i++) + { + CUfunction function; + + r = cuModuleGetFunction (&function, module, fn_names[i]); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuda_error (r)); + + targ_fns[i].fn = function; + targ_fns[i].name = (const char *) fn_names[i]; + + (*tablep)[i].host_start = (uintptr_t) fn_table[i]; + (*tablep)[i].host_end = (*tablep)[i].host_start + 1; + (*tablep)[i].tgt_start = (uintptr_t) &targ_fns[i]; + (*tablep)[i].tgt_end = (*tablep)[i].tgt_start + 1; + } + + return fn_entries; +} + +void * +GOMP_OFFLOAD_alloc (int n __attribute__((unused)), size_t size) +{ + return PTX_alloc (size); +} + +void +GOMP_OFFLOAD_free (int n __attribute__((unused)), void *ptr) +{ + PTX_free (ptr); +} + +void * +GOMP_OFFLOAD_dev2host (int ord __attribute__((unused)), void *dst, + const void *src, size_t n) +{ + return PTX_dev2host (dst, src, n); +} + +void * +GOMP_OFFLOAD_host2dev (int ord __attribute__((unused)), void *dst, + const void *src, size_t n) +{ + return PTX_host2dev (dst, src, n); +} + +void (*device_run) (void *fn_ptr, void *vars) = NULL; + +void +GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, size_t *sizes, + unsigned short *kinds, int num_gangs, + int num_workers, int vector_length, int async, + void *targ_mem_desc) +{ + PTX_exec (fn, mapnum, hostaddrs, devaddrs, sizes, kinds, num_gangs, + num_workers, vector_length, async, targ_mem_desc); +} + +void * +GOMP_OFFLOAD_openacc_open_device (int n) +{ + return PTX_open_device (n); +} + +int +GOMP_OFFLOAD_openacc_close_device (void *h) +{ + return PTX_close_device (h); +} + +void +GOMP_OFFLOAD_openacc_set_device_num (int n) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + + assert (n >= 0); + + if (!nvthd->ptx_dev || nvthd->ptx_dev->ord != n) + (void) PTX_open_device (n); +} + +/* This can be called before the device is "opened" for the current thread, in + which case we can't tell which device number should be returned. We don't + actually want to open the device here, so just return -1 and let the caller + (oacc-init.c:acc_get_device_num) handle it. */ + +int +GOMP_OFFLOAD_openacc_get_device_num (void) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + + if (nvthd && nvthd->ptx_dev) + return nvthd->ptx_dev->ord; + else + return -1; +} + +void +GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc) +{ + CUevent *e; + CUresult r; + struct nvptx_thread *nvthd = nvptx_thread (); + + e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); + + r = cuEventRecord (*e, nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuda_error (r)); + + event_add (PTX_EVT_ASYNC_CLEANUP, e, targ_mem_desc); +} + +int +GOMP_OFFLOAD_openacc_async_test (int async) +{ + return PTX_async_test (async); +} + +int +GOMP_OFFLOAD_openacc_async_test_all (void) +{ + return PTX_async_test_all (); +} + +void +GOMP_OFFLOAD_openacc_async_wait (int async) +{ + PTX_wait (async); +} + +void +GOMP_OFFLOAD_openacc_async_wait_async (int async1, int async2) +{ + PTX_wait_async (async1, async2); +} + +void +GOMP_OFFLOAD_openacc_async_wait_all (void) +{ + PTX_wait_all (); +} + +void +GOMP_OFFLOAD_openacc_async_wait_all_async (int async) +{ + PTX_wait_all_async (async); +} + +void +GOMP_OFFLOAD_openacc_async_set_async (int async) +{ + PTX_set_async (async); +} + +void * +GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data) +{ + struct ptx_device *ptx_dev = (struct ptx_device *) targ_data; + struct nvptx_thread *nvthd + = GOMP_PLUGIN_malloc (sizeof (struct nvptx_thread)); + CUresult r; + CUcontext thd_ctx; + + r = cuCtxGetCurrent (&thd_ctx); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuda_error (r)); + + assert (ptx_dev->ctx); + + if (!thd_ctx) + { + r = cuCtxPushCurrent (ptx_dev->ctx); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxPushCurrent error: %s", cuda_error (r)); + } + + nvthd->current_stream = ptx_dev->null_stream; + nvthd->ptx_dev = ptx_dev; + + return (void *) nvthd; +} + +void +GOMP_OFFLOAD_openacc_destroy_thread_data (void *data) +{ + free (data); +} + +void * +GOMP_OFFLOAD_openacc_get_current_cuda_device (void) +{ + return PTX_get_current_cuda_device (); +} + +void * +GOMP_OFFLOAD_openacc_get_current_cuda_context (void) +{ + return PTX_get_current_cuda_context (); +} + +/* NOTE: This returns a CUstream, not a ptx_stream pointer. */ + +void * +GOMP_OFFLOAD_openacc_get_cuda_stream (int async) +{ + return PTX_get_cuda_stream (async); +} + +/* NOTE: This takes a CUstream, not a ptx_stream pointer. */ + +int +GOMP_OFFLOAD_openacc_set_cuda_stream (int async, void *stream) +{ + return PTX_set_cuda_stream (async, stream); +} diff --git a/libgomp/splay-tree.c b/libgomp/splay-tree.c new file mode 100644 index 0000000..14b03ac --- /dev/null +++ b/libgomp/splay-tree.c @@ -0,0 +1,224 @@ +/* A splay-tree datatype. + Copyright 1998-2013 + Free Software Foundation, Inc. + Contributed by Mark Mitchell (mark@markmitchell.com). + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* The splay tree code copied from include/splay-tree.h and adjusted, + so that all the data lives directly in splay_tree_node_s structure + and no extra allocations are needed. + + Files including this header should before including it add: +typedef struct splay_tree_node_s *splay_tree_node; +typedef struct splay_tree_s *splay_tree; +typedef struct splay_tree_key_s *splay_tree_key; + define splay_tree_key_s structure, and define + splay_compare inline function. */ + +/* For an easily readable description of splay-trees, see: + + Lewis, Harry R. and Denenberg, Larry. Data Structures and Their + Algorithms. Harper-Collins, Inc. 1991. + + The major feature of splay trees is that all basic tree operations + are amortized O(log n) time for a tree with n nodes. */ + +#include "libgomp.h" +#include "splay-tree.h" + +extern int splay_compare (splay_tree_key, splay_tree_key); + +/* Rotate the edge joining the left child N with its parent P. PP is the + grandparents' pointer to P. */ + +static inline void +rotate_left (splay_tree_node *pp, splay_tree_node p, splay_tree_node n) +{ + splay_tree_node tmp; + tmp = n->right; + n->right = p; + p->left = tmp; + *pp = n; +} + +/* Rotate the edge joining the right child N with its parent P. PP is the + grandparents' pointer to P. */ + +static inline void +rotate_right (splay_tree_node *pp, splay_tree_node p, splay_tree_node n) +{ + splay_tree_node tmp; + tmp = n->left; + n->left = p; + p->right = tmp; + *pp = n; +} + +/* Bottom up splay of KEY. */ + +static void +splay_tree_splay (splay_tree sp, splay_tree_key key) +{ + if (sp->root == NULL) + return; + + do { + int cmp1, cmp2; + splay_tree_node n, c; + + n = sp->root; + cmp1 = splay_compare (key, &n->key); + + /* Found. */ + if (cmp1 == 0) + return; + + /* Left or right? If no child, then we're done. */ + if (cmp1 < 0) + c = n->left; + else + c = n->right; + if (!c) + return; + + /* Next one left or right? If found or no child, we're done + after one rotation. */ + cmp2 = splay_compare (key, &c->key); + if (cmp2 == 0 + || (cmp2 < 0 && !c->left) + || (cmp2 > 0 && !c->right)) + { + if (cmp1 < 0) + rotate_left (&sp->root, n, c); + else + rotate_right (&sp->root, n, c); + return; + } + + /* Now we have the four cases of double-rotation. */ + if (cmp1 < 0 && cmp2 < 0) + { + rotate_left (&n->left, c, c->left); + rotate_left (&sp->root, n, n->left); + } + else if (cmp1 > 0 && cmp2 > 0) + { + rotate_right (&n->right, c, c->right); + rotate_right (&sp->root, n, n->right); + } + else if (cmp1 < 0 && cmp2 > 0) + { + rotate_right (&n->left, c, c->right); + rotate_left (&sp->root, n, n->left); + } + else if (cmp1 > 0 && cmp2 < 0) + { + rotate_left (&n->right, c, c->left); + rotate_right (&sp->root, n, n->right); + } + } while (1); +} + +/* Insert a new NODE into SP. The NODE shouldn't exist in the tree. */ + +attribute_hidden void +splay_tree_insert (splay_tree sp, splay_tree_node node) +{ + int comparison = 0; + + splay_tree_splay (sp, &node->key); + + if (sp->root) + comparison = splay_compare (&sp->root->key, &node->key); + + if (sp->root && comparison == 0) + gomp_fatal ("Duplicate node"); + else + { + /* Insert it at the root. */ + if (sp->root == NULL) + node->left = node->right = NULL; + else if (comparison < 0) + { + node->left = sp->root; + node->right = node->left->right; + node->left->right = NULL; + } + else + { + node->right = sp->root; + node->left = node->right->left; + node->right->left = NULL; + } + + sp->root = node; + } +} + +/* Remove node with KEY from SP. It is not an error if it did not exist. */ + +attribute_hidden void +splay_tree_remove (splay_tree sp, splay_tree_key key) +{ + splay_tree_splay (sp, key); + + if (sp->root && splay_compare (&sp->root->key, key) == 0) + { + splay_tree_node left, right; + + left = sp->root->left; + right = sp->root->right; + + /* One of the children is now the root. Doesn't matter much + which, so long as we preserve the properties of the tree. */ + if (left) + { + sp->root = left; + + /* If there was a right child as well, hang it off the + right-most leaf of the left child. */ + if (right) + { + while (left->right) + left = left->right; + left->right = right; + } + } + else + sp->root = right; + } +} + +/* Lookup KEY in SP, returning NODE if present, and NULL + otherwise. */ + +attribute_hidden splay_tree_key +splay_tree_lookup (splay_tree sp, splay_tree_key key) +{ + splay_tree_splay (sp, key); + + if (sp->root && splay_compare (&sp->root->key, key) == 0) + return &sp->root->key; + else + return NULL; +} diff --git a/libgomp/splay-tree.h b/libgomp/splay-tree.h index eb8011a..f29d437 100644 --- a/libgomp/splay-tree.h +++ b/libgomp/splay-tree.h @@ -43,6 +43,30 @@ typedef struct splay_tree_key_s *splay_tree_key; The major feature of splay trees is that all basic tree operations are amortized O(log n) time for a tree with n nodes. */ +#ifndef _SPLAY_TREE_H +#define _SPLAY_TREE_H 1 + +typedef struct splay_tree_node_s *splay_tree_node; +typedef struct splay_tree_s *splay_tree; +typedef struct splay_tree_key_s *splay_tree_key; + +struct splay_tree_key_s { + /* Address of the host object. */ + uintptr_t host_start; + /* Address immediately after the host object. */ + uintptr_t host_end; + /* Descriptor of the target memory. */ + struct target_mem_desc *tgt; + /* Offset from tgt->tgt_start to the start of the target object. */ + uintptr_t tgt_offset; + /* Reference count. */ + uintptr_t refcount; + /* Asynchronous reference count. */ + uintptr_t async_refcount; + /* True if data should be copied from device to host at the end. */ + bool copy_from; +}; + /* The nodes in the splay tree. */ struct splay_tree_node_s { struct splay_tree_key_s key; @@ -56,177 +80,8 @@ struct splay_tree_s { splay_tree_node root; }; -/* Rotate the edge joining the left child N with its parent P. PP is the - grandparents' pointer to P. */ - -static inline void -rotate_left (splay_tree_node *pp, splay_tree_node p, splay_tree_node n) -{ - splay_tree_node tmp; - tmp = n->right; - n->right = p; - p->left = tmp; - *pp = n; -} - -/* Rotate the edge joining the right child N with its parent P. PP is the - grandparents' pointer to P. */ - -static inline void -rotate_right (splay_tree_node *pp, splay_tree_node p, splay_tree_node n) -{ - splay_tree_node tmp; - tmp = n->left; - n->left = p; - p->right = tmp; - *pp = n; -} - -/* Bottom up splay of KEY. */ - -static void -splay_tree_splay (splay_tree sp, splay_tree_key key) -{ - if (sp->root == NULL) - return; - - do { - int cmp1, cmp2; - splay_tree_node n, c; - - n = sp->root; - cmp1 = splay_compare (key, &n->key); - - /* Found. */ - if (cmp1 == 0) - return; - - /* Left or right? If no child, then we're done. */ - if (cmp1 < 0) - c = n->left; - else - c = n->right; - if (!c) - return; - - /* Next one left or right? If found or no child, we're done - after one rotation. */ - cmp2 = splay_compare (key, &c->key); - if (cmp2 == 0 - || (cmp2 < 0 && !c->left) - || (cmp2 > 0 && !c->right)) - { - if (cmp1 < 0) - rotate_left (&sp->root, n, c); - else - rotate_right (&sp->root, n, c); - return; - } - - /* Now we have the four cases of double-rotation. */ - if (cmp1 < 0 && cmp2 < 0) - { - rotate_left (&n->left, c, c->left); - rotate_left (&sp->root, n, n->left); - } - else if (cmp1 > 0 && cmp2 > 0) - { - rotate_right (&n->right, c, c->right); - rotate_right (&sp->root, n, n->right); - } - else if (cmp1 < 0 && cmp2 > 0) - { - rotate_right (&n->left, c, c->right); - rotate_left (&sp->root, n, n->left); - } - else if (cmp1 > 0 && cmp2 < 0) - { - rotate_left (&n->right, c, c->left); - rotate_right (&sp->root, n, n->right); - } - } while (1); -} - -/* Insert a new NODE into SP. The NODE shouldn't exist in the tree. */ - -static void -splay_tree_insert (splay_tree sp, splay_tree_node node) -{ - int comparison = 0; - - splay_tree_splay (sp, &node->key); - - if (sp->root) - comparison = splay_compare (&sp->root->key, &node->key); - - if (sp->root && comparison == 0) - abort (); - else - { - /* Insert it at the root. */ - if (sp->root == NULL) - node->left = node->right = NULL; - else if (comparison < 0) - { - node->left = sp->root; - node->right = node->left->right; - node->left->right = NULL; - } - else - { - node->right = sp->root; - node->left = node->right->left; - node->right->left = NULL; - } - - sp->root = node; - } -} - -/* Remove node with KEY from SP. It is not an error if it did not exist. */ - -static void -splay_tree_remove (splay_tree sp, splay_tree_key key) -{ - splay_tree_splay (sp, key); - - if (sp->root && splay_compare (&sp->root->key, key) == 0) - { - splay_tree_node left, right; - - left = sp->root->left; - right = sp->root->right; - - /* One of the children is now the root. Doesn't matter much - which, so long as we preserve the properties of the tree. */ - if (left) - { - sp->root = left; - - /* If there was a right child as well, hang it off the - right-most leaf of the left child. */ - if (right) - { - while (left->right) - left = left->right; - left->right = right; - } - } - else - sp->root = right; - } -} - -/* Lookup KEY in SP, returning NODE if present, and NULL - otherwise. */ - -static splay_tree_key -splay_tree_lookup (splay_tree sp, splay_tree_key key) -{ - splay_tree_splay (sp, key); - - if (sp->root && splay_compare (&sp->root->key, key) == 0) - return &sp->root->key; - else - return NULL; -} +attribute_hidden splay_tree_key splay_tree_lookup (splay_tree, splay_tree_key); +attribute_hidden void splay_tree_insert (splay_tree, splay_tree_node); +attribute_hidden void splay_tree_remove (splay_tree, splay_tree_key); + +#endif /* _SPLAY_TREE_H */ diff --git a/libgomp/target.c b/libgomp/target.c index 5b4873b..9345ac2 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -30,7 +30,12 @@ #include #include #include +#include "oacc-plugin.h" +#include "gomp-constants.h" +#include "oacc-int.h" #include +#include +#include #ifdef PLUGIN_SUPPORT #include @@ -40,50 +45,6 @@ static void gomp_target_init (void); static pthread_once_t gomp_is_initialized = PTHREAD_ONCE_INIT; -/* Forward declaration for a node in the tree. */ -typedef struct splay_tree_node_s *splay_tree_node; -typedef struct splay_tree_s *splay_tree; -typedef struct splay_tree_key_s *splay_tree_key; - -struct target_mem_desc { - /* Reference count. */ - uintptr_t refcount; - /* All the splay nodes allocated together. */ - splay_tree_node array; - /* Start of the target region. */ - uintptr_t tgt_start; - /* End of the targer region. */ - uintptr_t tgt_end; - /* Handle to free. */ - void *to_free; - /* Previous target_mem_desc. */ - struct target_mem_desc *prev; - /* Number of items in following list. */ - size_t list_count; - - /* Corresponding target device descriptor. */ - struct gomp_device_descr *device_descr; - - /* List of splay keys to remove (or decrease refcount) - at the end of region. */ - splay_tree_key list[]; -}; - -struct splay_tree_key_s { - /* Address of the host object. */ - uintptr_t host_start; - /* Address immediately after the host object. */ - uintptr_t host_end; - /* Descriptor of the target memory. */ - struct target_mem_desc *tgt; - /* Offset from tgt->tgt_start to the start of the target object. */ - uintptr_t tgt_offset; - /* Reference count. */ - uintptr_t refcount; - /* True if data should be copied from device to host at the end. */ - bool copy_from; -}; - /* This structure describes an offload image. It contains type of the target device, pointer to host table descriptor, and pointer to target data. */ @@ -107,7 +68,7 @@ static int num_devices; /* The comparison function. */ -static int +attribute_hidden int splay_compare (splay_tree_key x, splay_tree_key y) { if (x->host_start == x->host_end @@ -122,47 +83,16 @@ splay_compare (splay_tree_key x, splay_tree_key y) #include "splay-tree.h" -/* This structure describes accelerator device. - It contains ID-number of the device, its type, function handlers for - interaction with the device, and information about mapped memory. */ -struct gomp_device_descr +attribute_hidden void +gomp_init_targets_once (void) { - /* This is the ID number of device. It could be specified in DEVICE-clause of - TARGET construct. */ - int id; - - /* This is the ID number of device among devices of the same type. */ - int target_id; - - /* This is the TYPE of device. */ - enum offload_target_type type; - - /* Set to true when device is initialized. */ - bool is_initialized; - - /* Function handlers. */ - int (*get_type_func) (void); - int (*get_num_devices_func) (void); - void (*register_image_func) (void *, void *); - void (*init_device_func) (int); - int (*get_table_func) (int, void *); - void *(*alloc_func) (int, size_t); - void (*free_func) (int, void *); - void *(*host2dev_func) (int, void *, const void *, size_t); - void *(*dev2host_func) (int, void *, const void *, size_t); - void (*run_func) (int, void *, void *); - - /* Splay tree containing information about mapped memory regions. */ - struct splay_tree_s dev_splay_tree; - - /* Mutex for operating with the splay tree and other shared structures. */ - gomp_mutex_t dev_env_lock; -}; + (void) pthread_once (&gomp_is_initialized, gomp_target_init); +} attribute_hidden int gomp_get_num_devices (void) { - (void) pthread_once (&gomp_is_initialized, gomp_target_init); + gomp_init_targets_once (); return num_devices; } @@ -198,18 +128,29 @@ gomp_map_vars_existing (splay_tree_key oldn, splay_tree_key newn, oldn->refcount++; } -static struct target_mem_desc * +static int +get_kind (bool is_openacc, void *kinds, int idx) +{ + return is_openacc ? ((unsigned short *) kinds)[idx] + : ((unsigned char *) kinds)[idx]; +} + +attribute_hidden struct target_mem_desc * gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, - void **hostaddrs, size_t *sizes, unsigned char *kinds, - bool is_target) + void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds, + bool is_openacc, bool is_target) { size_t i, tgt_align, tgt_size, not_found_cnt = 0; + const int rshift = is_openacc ? 8 : 3; + const int typemask = is_openacc ? 0xff : 0x7; + struct gomp_memory_mapping *mm = &devicep->mem_map; struct splay_tree_key_s cur_node; struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum); tgt->list_count = mapnum; tgt->refcount = 1; tgt->device_descr = devicep; + tgt->mem_map = mm; if (mapnum == 0) return tgt; @@ -222,41 +163,41 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, tgt_align = align; tgt_size = mapnum * sizeof (void *); } - - gomp_mutex_lock (&devicep->dev_env_lock); + gomp_mutex_lock (&mm->lock); for (i = 0; i < mapnum; i++) { + int kind = get_kind (is_openacc, kinds, i); if (hostaddrs[i] == NULL) { tgt->list[i] = NULL; continue; } cur_node.host_start = (uintptr_t) hostaddrs[i]; - if ((kinds[i] & 7) != 4) + if (!GOMP_MAP_POINTER_P (kind & typemask)) cur_node.host_end = cur_node.host_start + sizes[i]; else cur_node.host_end = cur_node.host_start + sizeof (void *); - splay_tree_key n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n) { tgt->list[i] = n; - gomp_map_vars_existing (n, &cur_node, kinds[i]); + gomp_map_vars_existing (n, &cur_node, kind); } else { - size_t align = (size_t) 1 << (kinds[i] >> 3); + size_t align = (size_t) 1 << (kind >> rshift); tgt->list[i] = NULL; not_found_cnt++; if (tgt_align < align) tgt_align = align; tgt_size = (tgt_size + align - 1) & ~(align - 1); tgt_size += cur_node.host_end - cur_node.host_start; - if ((kinds[i] & 7) == 5) + if ((kind & typemask) == GOMP_MAP_TO_PSET) { size_t j; for (j = i + 1; j < mapnum; j++) - if ((kinds[j] & 7) != 4) + if (!GOMP_MAP_POINTER_P (get_kind (is_openacc, kinds, j) + & typemask)) break; else if ((uintptr_t) hostaddrs[j] < cur_node.host_start || ((uintptr_t) hostaddrs[j] + sizeof (void *) @@ -271,7 +212,15 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, } } - if (not_found_cnt || is_target) + if (devaddrs) + { + if (mapnum != 1) + gomp_fatal ("unexpected aggregation"); + tgt->to_free = devaddrs[0]; + tgt->tgt_start = (uintptr_t) tgt->to_free; + tgt->tgt_end = tgt->tgt_start + sizes[0]; + } + else if (not_found_cnt || is_target) { /* Allocate tgt_align aligned tgt_size block of memory. */ /* FIXME: Perhaps change interface to allocate properly aligned @@ -303,44 +252,52 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, for (i = 0; i < mapnum; i++) if (tgt->list[i] == NULL) { + int kind = get_kind (is_openacc, kinds, i); if (hostaddrs[i] == NULL) continue; splay_tree_key k = &array->key; k->host_start = (uintptr_t) hostaddrs[i]; - if ((kinds[i] & 7) != 4) + if (!GOMP_MAP_POINTER_P (kind & typemask)) k->host_end = k->host_start + sizes[i]; else k->host_end = k->host_start + sizeof (void *); - splay_tree_key n - = splay_tree_lookup (&devicep->dev_splay_tree, k); + splay_tree_key n = splay_tree_lookup (&mm->splay_tree, k); if (n) { tgt->list[i] = n; - gomp_map_vars_existing (n, k, kinds[i]); + gomp_map_vars_existing (n, k, kind); } else { - size_t align = (size_t) 1 << (kinds[i] >> 3); + size_t align = (size_t) 1 << (kind >> rshift); tgt->list[i] = k; tgt_size = (tgt_size + align - 1) & ~(align - 1); k->tgt = tgt; k->tgt_offset = tgt_size; tgt_size += k->host_end - k->host_start; - k->copy_from = false; - if ((kinds[i] & 7) == 2 || (kinds[i] & 7) == 3) - k->copy_from = true; + k->copy_from = GOMP_MAP_COPYFROM_P (kind & typemask) + || GOMP_MAP_TOFROM_P (kind & typemask); k->refcount = 1; + k->async_refcount = 0; tgt->refcount++; array->left = NULL; array->right = NULL; - splay_tree_insert (&devicep->dev_splay_tree, array); - switch (kinds[i] & 7) + splay_tree_insert (&mm->splay_tree, array); + switch (kind & typemask) { - case 0: /* ALLOC */ - case 2: /* FROM */ + case GOMP_MAP_FORCE_ALLOC: + case GOMP_MAP_FORCE_FROM: + /* FIXME: No special handling (see comment in + oacc-parallel.c). */ + case GOMP_MAP_ALLOC: + case GOMP_MAP_ALLOC_FROM: break; - case 1: /* TO */ - case 3: /* TOFROM */ + case GOMP_MAP_FORCE_TO: + case GOMP_MAP_FORCE_TOFROM: + /* FIXME: No special handling, as above. */ + case GOMP_MAP_ALLOC_TO: + case GOMP_MAP_ALLOC_TOFROM: + /* Copy from host to device memory. */ /* FIXME: Perhaps add some smarts, like if copying several adjacent fields from host to target, use some host buffer to avoid sending each var individually. */ @@ -350,7 +307,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, (void *) k->host_start, k->host_end - k->host_start); break; - case 4: /* POINTER */ + case GOMP_MAP_POINTER: cur_node.host_start = (uintptr_t) *(void **) k->host_start; if (cur_node.host_start == (uintptr_t) NULL) @@ -366,19 +323,16 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, /* Add bias to the pointer value. */ cur_node.host_start += sizes[i]; cur_node.host_end = cur_node.host_start + 1; - n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n == NULL) { /* Could be possibly zero size array section. */ cur_node.host_end--; - n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n == NULL) { cur_node.host_start--; - n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + n = splay_tree_lookup (&mm->splay_tree, &cur_node); cur_node.host_start++; } } @@ -398,14 +352,17 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, (void *) &cur_node.tgt_offset, sizeof (void *)); break; - case 5: /* TO_PSET */ - devicep->host2dev_func (devicep->target_id, - (void *) (tgt->tgt_start - + k->tgt_offset), - (void *) k->host_start, - k->host_end - k->host_start); + case GOMP_MAP_TO_PSET: + /* Copy from host to device memory. */ + /* FIXME: see above FIXME comment. */ + devicep->host2dev_func + (devicep->target_id, + (void *) (tgt->tgt_start + k->tgt_offset), + (void *) k->host_start, + (k->host_end - k->host_start)); for (j = i + 1; j < mapnum; j++) - if ((kinds[j] & 7) != 4) + if (!GOMP_MAP_POINTER_P (get_kind (is_openacc, kinds, j) + & typemask)) break; else if ((uintptr_t) hostaddrs[j] < k->host_start || ((uintptr_t) hostaddrs[j] + sizeof (void *) @@ -432,19 +389,18 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, /* Add bias to the pointer value. */ cur_node.host_start += sizes[j]; cur_node.host_end = cur_node.host_start + 1; - n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n == NULL) { /* Could be possibly zero size array section. */ cur_node.host_end--; - n = splay_tree_lookup (&devicep->dev_splay_tree, + n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n == NULL) { cur_node.host_start--; - n = splay_tree_lookup - (&devicep->dev_splay_tree, &cur_node); + n = splay_tree_lookup (&mm->splay_tree, + &cur_node); cur_node.host_start++; } } @@ -468,6 +424,32 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, i++; } break; + case GOMP_MAP_FORCE_PRESENT: + { + /* We already looked up the memory region above and it + was missing. */ + size_t size = k->host_end - k->host_start; + gomp_fatal ("present clause: !acc_is_present (%p, " + "%zd (0x%zx))", (void *) k->host_start, + size, size); + } + break; + case GOMP_MAP_FORCE_DEVICEPTR: + assert (k->host_end - k->host_start == sizeof (void *)); + + devicep->host2dev_func + (devicep->target_id, + (void *) (tgt->tgt_start + k->tgt_offset), + (void *) k->host_start, + sizeof (void *)); + break; + case GOMP_MAP_FORCE_PRIVATE: + abort (); + case GOMP_MAP_FORCE_FIRSTPRIVATE: + abort (); + default: + gomp_fatal ("%s: unhandled kind 0x%.2x", __FUNCTION__, + kind); } array++; } @@ -490,7 +472,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, } } - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_mutex_unlock (&mm->lock); return tgt; } @@ -505,10 +487,51 @@ gomp_unmap_tgt (struct target_mem_desc *tgt) free (tgt); } -static void -gomp_unmap_vars (struct target_mem_desc *tgt) +/* Decrease the refcount for a set of mapped variables, and queue asychronous + copies from the device back to the host after any work that has been issued. + Because the regions are still "live", increment an asynchronous reference + count to indicate that they should not be unmapped from host-side data + structures until the asynchronous copy has completed. */ + +attribute_hidden void +gomp_copy_from_async (struct target_mem_desc *tgt) +{ + struct gomp_device_descr *devicep = tgt->device_descr; + struct gomp_memory_mapping *mm = tgt->mem_map; + size_t i; + + gomp_mutex_lock (&mm->lock); + + for (i = 0; i < tgt->list_count; i++) + if (tgt->list[i] == NULL) + ; + else if (tgt->list[i]->refcount > 1) + { + tgt->list[i]->refcount--; + tgt->list[i]->async_refcount++; + } + else + { + splay_tree_key k = tgt->list[i]; + if (k->copy_from) + /* Copy from device to host memory. */ + devicep->dev2host_func (devicep->target_id, (void *) k->host_start, + (void *) (k->tgt->tgt_start + k->tgt_offset), + k->host_end - k->host_start); + } + + gomp_mutex_unlock (&mm->lock); +} + +/* Unmap variables described by TGT. If DO_COPYFROM is true, copy relevant + variables back from device to host: if it is false, it is assumed that this + has been done already, i.e. by gomp_copy_from_async above. */ + +attribute_hidden void +gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom) { struct gomp_device_descr *devicep = tgt->device_descr; + struct gomp_memory_mapping *mm = tgt->mem_map; if (tgt->list_count == 0) { @@ -517,20 +540,23 @@ gomp_unmap_vars (struct target_mem_desc *tgt) } size_t i; - gomp_mutex_lock (&devicep->dev_env_lock); + gomp_mutex_lock (&mm->lock); for (i = 0; i < tgt->list_count; i++) if (tgt->list[i] == NULL) ; else if (tgt->list[i]->refcount > 1) tgt->list[i]->refcount--; + else if (tgt->list[i]->async_refcount > 0) + tgt->list[i]->async_refcount--; else { splay_tree_key k = tgt->list[i]; - if (k->copy_from) + if (k->copy_from && do_copyfrom) + /* Copy from device to host memory. */ devicep->dev2host_func (devicep->target_id, (void *) k->host_start, (void *) (k->tgt->tgt_start + k->tgt_offset), k->host_end - k->host_start); - splay_tree_remove (&devicep->dev_splay_tree, k); + splay_tree_remove (&mm->splay_tree, k); if (k->tgt->refcount > 1) k->tgt->refcount--; else @@ -541,15 +567,17 @@ gomp_unmap_vars (struct target_mem_desc *tgt) tgt->refcount--; else gomp_unmap_tgt (tgt); - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_mutex_unlock (&mm->lock); } static void -gomp_update (struct gomp_device_descr *devicep, size_t mapnum, - void **hostaddrs, size_t *sizes, unsigned char *kinds) +gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm, + size_t mapnum, void **hostaddrs, size_t *sizes, void *kinds, + bool is_openacc) { size_t i; struct splay_tree_key_s cur_node; + const int typemask = is_openacc ? 0xff : 0x7; if (!devicep) return; @@ -557,16 +585,17 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, if (mapnum == 0) return; - gomp_mutex_lock (&devicep->dev_env_lock); + gomp_mutex_lock (&mm->lock); for (i = 0; i < mapnum; i++) if (sizes[i]) { cur_node.host_start = (uintptr_t) hostaddrs[i]; cur_node.host_end = cur_node.host_start + sizes[i]; - splay_tree_key n = splay_tree_lookup (&devicep->dev_splay_tree, + splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n) { + int kind = get_kind (is_openacc, kinds, i); if (n->host_start > cur_node.host_start || n->host_end < cur_node.host_end) gomp_fatal ("Trying to update [%p..%p) object when" @@ -575,31 +604,38 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, (void *) cur_node.host_end, (void *) n->host_start, (void *) n->host_end); - if ((kinds[i] & 7) == 1) - devicep->host2dev_func (devicep->target_id, - (void *) (n->tgt->tgt_start - + n->tgt_offset - + cur_node.host_start - - n->host_start), - (void *) cur_node.host_start, - cur_node.host_end - cur_node.host_start); - else if ((kinds[i] & 7) == 2) - devicep->dev2host_func (devicep->target_id, - (void *) cur_node.host_start, - (void *) (n->tgt->tgt_start - + n->tgt_offset - + cur_node.host_start - - n->host_start), - cur_node.host_end - cur_node.host_start); + if (GOMP_MAP_COPYTO_P (kind & typemask)) + /* Copy from host to device memory. */ + devicep->host2dev_func + (devicep->target_id, + (void *) (n->tgt->tgt_start + + n->tgt_offset + + cur_node.host_start + - n->host_start), + (void *) cur_node.host_start, + cur_node.host_end - cur_node.host_start); + else if (GOMP_MAP_COPYFROM_P (kind & typemask)) + /* Copy from device to host memory. */ + devicep->dev2host_func + (devicep->target_id, + (void *) cur_node.host_start, + (void *) (n->tgt->tgt_start + + n->tgt_offset + + cur_node.host_start + - n->host_start), + cur_node.host_end - cur_node.host_start); } else gomp_fatal ("Trying to update [%p..%p) object that is not mapped", (void *) cur_node.host_start, (void *) cur_node.host_end); } - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_mutex_unlock (&mm->lock); } +static void gomp_register_image_for_device (struct gomp_device_descr *device, + struct offload_image_descr *image); + /* This function should be called from every offload image. It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of the target, and TARGET_DATA needed by target plugin. */ @@ -612,6 +648,9 @@ GOMP_offload_register (void *host_table, enum offload_target_type target_type, (num_offload_images + 1) * sizeof (struct offload_image_descr)); + if (offload_images == NULL) + return; + offload_images[num_offload_images].type = target_type; offload_images[num_offload_images].host_table = host_table; offload_images[num_offload_images].target_data = target_data; @@ -621,17 +660,24 @@ GOMP_offload_register (void *host_table, enum offload_target_type target_type, /* This function initializes the target device, specified by DEVICEP. */ -static void +attribute_hidden void gomp_init_device (struct gomp_device_descr *devicep) { + /* Initialize the target device. */ devicep->init_device_func (devicep->target_id); + + devicep->is_initialized = true; +} +attribute_hidden void +gomp_init_tables (const struct gomp_device_descr *devicep, + struct gomp_memory_mapping *mm) +{ /* Get address mapping table for device. */ struct mapping_table *table = NULL; - int num_entries = devicep->get_table_func (devicep->target_id, &table); + int i, num_entries = devicep->get_table_func (devicep->target_id, &table); /* Insert host-target address mapping into dev_splay_tree. */ - int i; for (i = 0; i < num_entries; i++) { struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt)); @@ -641,7 +687,7 @@ gomp_init_device (struct gomp_device_descr *devicep) tgt->tgt_end = table[i].tgt_end; tgt->to_free = NULL; tgt->list_count = 0; - tgt->device_descr = devicep; + tgt->device_descr = (struct gomp_device_descr *) devicep; splay_tree_node node = tgt->array; splay_tree_key k = &node->key; k->host_start = table[i].host_start; @@ -652,11 +698,45 @@ gomp_init_device (struct gomp_device_descr *devicep) k->tgt = tgt; node->left = NULL; node->right = NULL; - splay_tree_insert (&devicep->dev_splay_tree, node); + splay_tree_insert (&mm->splay_tree, node); } free (table); - devicep->is_initialized = true; + mm->is_initialized = true; +} + +static void +gomp_init_dev_tables (struct gomp_device_descr *devicep) +{ + gomp_init_device (devicep); + gomp_init_tables (devicep, &devicep->mem_map); +} + + +attribute_hidden void +gomp_free_memmap (struct gomp_device_descr *devicep) +{ + struct gomp_memory_mapping *mm = &devicep->mem_map; + + while (mm->splay_tree.root) + { + struct target_mem_desc *tgt = mm->splay_tree.root->key.tgt; + + splay_tree_remove (&mm->splay_tree, &mm->splay_tree.root->key); + free (tgt->array); + free (tgt); + } + + mm->is_initialized = false; +} + +attribute_hidden void +gomp_fini_device (struct gomp_device_descr *devicep) +{ + if (devicep->is_initialized) + devicep->fini_device_func (devicep->target_id); + + devicep->is_initialized = false; } /* Called when encountering a target directive. If DEVICE @@ -675,7 +755,12 @@ GOMP_target (int device, void (*fn) (void *), const void *openmp_target, unsigned char *kinds) { struct gomp_device_descr *devicep = resolve_device (device); - if (devicep == NULL) + struct gomp_memory_mapping *mm = &devicep->mem_map; + + if (devicep != NULL && !devicep->is_initialized) + gomp_init_dev_tables (devicep); + + if (devicep == NULL || !(devicep->capabilities & TARGET_CAP_OPENMP_400)) { /* Host fallback. */ struct gomp_thread old_thr, *thr = gomp_thread (); @@ -692,20 +777,30 @@ GOMP_target (int device, void (*fn) (void *), const void *openmp_target, return; } - gomp_mutex_lock (&devicep->dev_env_lock); - if (!devicep->is_initialized) - gomp_init_device (devicep); + void *fn_addr; - struct splay_tree_key_s k; - k.host_start = (uintptr_t) fn; - k.host_end = k.host_start + 1; - splay_tree_key tgt_fn = splay_tree_lookup (&devicep->dev_splay_tree, &k); - if (tgt_fn == NULL) - gomp_fatal ("Target function wasn't mapped"); - gomp_mutex_unlock (&devicep->dev_env_lock); + if (devicep->capabilities & TARGET_CAP_NATIVE_EXEC) + fn_addr = (void *) fn; + else + { + gomp_mutex_lock (&mm->lock); + if (!devicep->is_initialized) + gomp_init_dev_tables (devicep); + struct splay_tree_key_s k; + k.host_start = (uintptr_t) fn; + k.host_end = k.host_start + 1; + splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map.splay_tree, + &k); + if (tgt_fn == NULL) + gomp_fatal ("Target function wasn't mapped"); + gomp_mutex_unlock (&mm->lock); + + fn_addr = (void *) tgt_fn->tgt->tgt_start; + } struct target_mem_desc *tgt_vars - = gomp_map_vars (devicep, mapnum, hostaddrs, sizes, kinds, true); + = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false, + true); struct gomp_thread old_thr, *thr = gomp_thread (); old_thr = *thr; memset (thr, '\0', sizeof (*thr)); @@ -714,11 +809,10 @@ GOMP_target (int device, void (*fn) (void *), const void *openmp_target, thr->place = old_thr.place; thr->ts.place_partition_len = gomp_places_list_len; } - devicep->run_func (devicep->target_id, (void *) tgt_fn->tgt->tgt_start, - (void *) tgt_vars->tgt_start); + devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start); gomp_free_thread (thr); *thr = old_thr; - gomp_unmap_vars (tgt_vars); + gomp_unmap_vars (tgt_vars, true); } void @@ -726,7 +820,12 @@ GOMP_target_data (int device, const void *openmp_target, size_t mapnum, void **hostaddrs, size_t *sizes, unsigned char *kinds) { struct gomp_device_descr *devicep = resolve_device (device); - if (devicep == NULL) + struct gomp_memory_mapping *mm = &devicep->mem_map; + + if (devicep != NULL && !devicep->is_initialized) + gomp_init_dev_tables (devicep); + + if (devicep == NULL || !(devicep->capabilities & TARGET_CAP_OPENMP_400)) { /* Host fallback. */ struct gomp_task_icv *icv = gomp_icv (false); @@ -737,20 +836,21 @@ GOMP_target_data (int device, const void *openmp_target, size_t mapnum, new #pragma omp target data, otherwise GOMP_target_end_data would get out of sync. */ struct target_mem_desc *tgt - = gomp_map_vars (NULL, 0, NULL, NULL, NULL, false); + = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, false, false); tgt->prev = icv->target_data; icv->target_data = tgt; } return; } - gomp_mutex_lock (&devicep->dev_env_lock); + gomp_mutex_lock (&mm->lock); if (!devicep->is_initialized) - gomp_init_device (devicep); - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_init_dev_tables (devicep); + gomp_mutex_unlock (&mm->lock); struct target_mem_desc *tgt - = gomp_map_vars (devicep, mapnum, hostaddrs, sizes, kinds, false); + = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false, + false); struct gomp_task_icv *icv = gomp_icv (true); tgt->prev = icv->target_data; icv->target_data = tgt; @@ -764,7 +864,7 @@ GOMP_target_end_data (void) { struct target_mem_desc *tgt = icv->target_data; icv->target_data = tgt->prev; - gomp_unmap_vars (tgt); + gomp_unmap_vars (tgt, true); } } @@ -773,15 +873,18 @@ GOMP_target_update (int device, const void *openmp_target, size_t mapnum, void **hostaddrs, size_t *sizes, unsigned char *kinds) { struct gomp_device_descr *devicep = resolve_device (device); - if (devicep == NULL) - return; + struct gomp_memory_mapping *mm = &devicep->mem_map; - gomp_mutex_lock (&devicep->dev_env_lock); - if (!devicep->is_initialized) + gomp_mutex_lock (&mm->lock); + if (devicep != NULL && !devicep->is_initialized) gomp_init_device (devicep); - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_mutex_unlock (&mm->lock); - gomp_update (devicep, mapnum, hostaddrs, sizes, kinds); + if (devicep == NULL || !(devicep->capabilities & TARGET_CAP_OPENMP_400)) + return; + + gomp_update (devicep, &devicep->mem_map, mapnum, hostaddrs, sizes, kinds, + false); } void @@ -808,9 +911,22 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, const char *plugin_name) { void *plugin_handle = dlopen (plugin_name, RTLD_LAZY); + char *err = NULL, *last_missing = NULL; + int optional_present, optional_total; + if (!plugin_handle) return false; + /* Clear any existing error. */ + dlerror (); + + device->plugin_handle = dlopen (plugin_name, RTLD_LAZY); + if (!device->plugin_handle) + { + err = dlerror (); + goto out; + } + /* Check if all required functions are available in the plugin and store their handlers. */ #define DLSYM(f) \ @@ -821,33 +937,104 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, return false; \ } \ while (0) + /* Similar, but missing functions are not an error. */ +#define DLSYM_OPT(f,n) \ + do \ + { \ + char *tmp_err; \ + device->f##_func = dlsym (device->plugin_handle, \ + "GOMP_OFFLOAD_" #n); \ + tmp_err = dlerror (); \ + if (tmp_err == NULL) \ + optional_present++; \ + else \ + last_missing = #n; \ + optional_total++; \ + } \ + while (0) + + DLSYM (get_name); + DLSYM (get_caps); DLSYM (get_type); DLSYM (get_num_devices); DLSYM (register_image); DLSYM (init_device); + DLSYM (fini_device); DLSYM (get_table); DLSYM (alloc); DLSYM (free); DLSYM (dev2host); DLSYM (host2dev); - DLSYM (run); + device->capabilities = device->get_caps_func (); + if (device->capabilities & TARGET_CAP_OPENMP_400) + DLSYM (run); + if (device->capabilities & TARGET_CAP_OPENACC_200) + { + optional_present = optional_total = 0; + DLSYM_OPT (openacc.exec, openacc_parallel); + DLSYM_OPT (openacc.open_device, openacc_open_device); + DLSYM_OPT (openacc.close_device, openacc_close_device); + DLSYM_OPT (openacc.get_device_num, openacc_get_device_num); + DLSYM_OPT (openacc.set_device_num, openacc_set_device_num); + DLSYM_OPT (openacc.register_async_cleanup, + openacc_register_async_cleanup); + DLSYM_OPT (openacc.async_test, openacc_async_test); + DLSYM_OPT (openacc.async_test_all, openacc_async_test_all); + DLSYM_OPT (openacc.async_wait, openacc_async_wait); + DLSYM_OPT (openacc.async_wait_async, openacc_async_wait_async); + DLSYM_OPT (openacc.async_wait_all, openacc_async_wait_all); + DLSYM_OPT (openacc.async_wait_all_async, openacc_async_wait_all_async); + DLSYM_OPT (openacc.async_set_async, openacc_async_set_async); + DLSYM_OPT (openacc.create_thread_data, openacc_create_thread_data); + DLSYM_OPT (openacc.destroy_thread_data, openacc_destroy_thread_data); + /* Require all the OpenACC handlers if we have TARGET_CAP_OPENACC_200. */ + if (optional_present != optional_total) + { + err = "plugin missing OpenACC handler function"; + goto out; + } + optional_present = optional_total = 0; + DLSYM_OPT (openacc.cuda.get_current_device, + openacc_get_current_cuda_device); + DLSYM_OPT (openacc.cuda.get_current_context, + openacc_get_current_cuda_context); + DLSYM_OPT (openacc.cuda.get_stream, openacc_get_cuda_stream); + DLSYM_OPT (openacc.cuda.set_stream, openacc_set_cuda_stream); + /* Make sure all the CUDA functions are there if any of them are. */ + if (optional_present && optional_present != optional_total) + { + err = "plugin missing OpenACC CUDA handler function"; + goto out; + } + } #undef DLSYM +#undef DLSYM_OPT - return true; + out: + if (err != NULL) + { + gomp_error ("while loading %s: %s", plugin_name, err); + if (last_missing) + gomp_error ("missing function was %s", last_missing); + if (device->plugin_handle) + dlclose (device->plugin_handle); + } + return err == NULL; } -/* This function finds OFFLOAD_IMAGES corresponding to DEVICE type, and - registers them in the plugin. */ +/* This function adds a compatible offload image IMAGE to an accelerator device + DEVICE. */ static void -gomp_register_images_for_device (struct gomp_device_descr *device) +gomp_register_image_for_device (struct gomp_device_descr *device, + struct offload_image_descr *image) { - int i; - for (i = 0; i < num_offload_images; i++) + if (!device->offload_regions_registered + && (device->type == image->type + || device->type == OFFLOAD_TARGET_TYPE_HOST)) { - struct offload_image_descr *image = &offload_images[i]; - if (image->type == device->type) - device->register_image_func (image->host_table, image->target_data); + device->register_image_func (image->host_table, image->target_data); + device->offload_regions_registered = true; } } @@ -903,15 +1090,19 @@ gomp_target_init (void) } current_device.type = current_device.get_type_func (); + current_device.name = current_device.get_name_func (); current_device.is_initialized = false; - current_device.dev_splay_tree.root = NULL; - gomp_register_images_for_device (¤t_device); + current_device.offload_regions_registered = false; + current_device.mem_map.splay_tree.root = NULL; + current_device.mem_map.is_initialized = false; + current_device.target_data = NULL; + current_device.openacc.data_environ = NULL; for (i = 0; i < new_num_devices; i++) { current_device.id = num_devices + 1; current_device.target_id = i; devices[num_devices] = current_device; - gomp_mutex_init (&devices[num_devices].dev_env_lock); + gomp_mutex_init (&devices[num_devices].mem_map.lock); num_devices++; } } @@ -922,6 +1113,43 @@ gomp_target_init (void) } while (next); + /* Prefer a device with TARGET_CAP_OPENMP_400 for ICV default-device-var. */ + if (num_devices > 1) + { + int d = gomp_icv (false)->default_device_var; + + if (!(devices[d].capabilities & TARGET_CAP_OPENMP_400)) + { + for (i = 0; i < num_devices; i++) + { + if (devices[i].capabilities & TARGET_CAP_OPENMP_400) + { + struct gomp_device_descr device_tmp = devices[d]; + devices[d] = devices[i]; + devices[d].id = d + 1; + devices[i] = device_tmp; + devices[i].id = i + 1; + + break; + } + } + } + } + + for (i = 0; i < num_devices; i++) + { + int j; + + for (j = 0; j < num_offload_images; j++) + gomp_register_image_for_device (&devices[i], &offload_images[j]); + + /* The 'devices' array can be moved (by the realloc call) until we have + found all the plugins, so registering with the OpenACC runtime (which + takes a copy of the pointer argument) must be delayed until now. */ + if (devices[i].capabilities & TARGET_CAP_OPENACC_200) + goacc_register (&devices[i]); + } + free (offload_images); offload_images = NULL; num_offload_images = 0; diff --git a/libgomp/target.h b/libgomp/target.h new file mode 100644 index 0000000..e69de29 diff --git a/libgomp/testsuite/Makefile.in b/libgomp/testsuite/Makefile.in index 2f845f0..78b6351 100644 diff --git a/libgomp/testsuite/libgomp-test-support.exp.in b/libgomp/testsuite/libgomp-test-support.exp.in new file mode 100644 index 0000000..dcadad7 --- /dev/null +++ b/libgomp/testsuite/libgomp-test-support.exp.in @@ -0,0 +1,2 @@ +set cuda_driver_include "@CUDA_DRIVER_INCLUDE@" +set cuda_driver_lib "@CUDA_DRIVER_LIB@"