From patchwork Tue Nov 11 13:53:23 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Julian Brown X-Patchwork-Id: 409456 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 311341400F1 for ; Wed, 12 Nov 2014 00:54:31 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:in-reply-to:references:mime-version :content-type; q=dns; s=default; b=IXajmN7Sw+FfQuDUJqT64Nlid2833 JBEitM0iw+yvR4EMIwJF8EbO+S/nceVlh2Ebrlapg6Q/X3iMaxBHzYgpmYAVRV0Z GOAZALhT9YFRk9Fu43wsbyyPb4dzlphA77q1pZeuEbTo2PSOT8e5f7pvae9+wySR H+ECdqKAa8khG0= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:in-reply-to:references:mime-version :content-type; s=default; bh=5NugywG4rvAHI4IJsJ353Drfksk=; b=CXu XCha2tVZBWW9TeeUXGkixoKCZznBRt04E62vUMTeX4jiltZkYF9Jzk8OIvpMbR3z t4lCz/wnkTDQEy7AUGeI1CO8XSQ7nMKP7K1jDhTxHp/LSBDPCipdeawpjsqCIS2X 8PcQuZWmN66QeiGpukFF3nhAQxU5H1VWJEISu7Fo= Received: (qmail 23190 invoked by alias); 11 Nov 2014 13:54:06 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 23052 invoked by uid 89); 11 Nov 2014 13:54:05 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.3 required=5.0 tests=AWL, BAYES_50, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 11 Nov 2014 13:53:45 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com) by relay1.mentorg.com with esmtp id 1XoBt4-0006GN-Bj from Julian_Brown@mentor.com ; Tue, 11 Nov 2014 05:53:40 -0800 Received: from octopus (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server id 14.3.181.6; Tue, 11 Nov 2014 13:53:33 +0000 Date: Tue, 11 Nov 2014 13:53:23 +0000 From: Julian Brown To: , Jakub Jelinek , "Thomas Schwinge" , Ilya Verbin Subject: [PATCH 1/5] OpenACC 2.0 support for libgomp - OpenACC runtime, NVidia PTX/CUDA plugin (repost) Message-ID: <20141111135323.29e0f27b@octopus> In-Reply-To: <20140923191931.2177e60f@octopus> References: <20140923191931.2177e60f@octopus> MIME-Version: 1.0 X-IsSubscribed: yes On Tue, 23 Sep 2014 19:19:31 +0100 Julian Brown wrote: > This patch contains the bulk of the OpenACC 2.0 runtime support, > building around, or on top of, the OpenMP 4.0 support (as previously > posted or already extant upstream) where we could. [...] Here is a new version of the OpenACC support patch for libgomp, rebased on top of a version of Ilya Verbin's patches that I merged to a local clone of trunk, and tested as far as possible without the middle/front-end pieces, since those are not ready yet. This patch brings the OpenACC support in libgomp up-to-date with the various fixes that I have been making on the gomp4 branch, in particular I believe all of Jakub's earlier comments in the following email have been addressed: https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02095.html Since Ilya's most-recently posted patches, there is now somewhat of a mismatch in APIs between the OpenMP and OpenACC parts of libgomp with regard to handling of multiple devices of the same type. This is mostly handled by the "open" and "close" hooks for OpenACC (and per-thread state that tracks the active device number) but is now handled by the "init" hook for OpenMP (which OpenACC just uses for overall initialisation/shutdown), and explicit target_id parameters for several of the plugin hooks. This is only a problem for hypothetical plugins that support both multiple devices and both of OpenMP and OpenACC, and so far no such plugins exist, but we may need to think about how to unify the divergent approaches to multiple devices/multiple threads sooner or later. A few OpenMP tests fail with the new host_nonshm plugin (with failures of the form "libgomp: Trying to update [0x605820..0x605824) object that is not mapped"), probably because of middle-end bugs. I haven't investigated those in detail. OK for mainline? Thanks, Julian ChangeLog xxxx-xx-xx Nathan Sidwell James Norris Thomas Schwinge Tom de Vries Julian Brown Bernd Schmidt Cesar Philippidis include/ * gomp-constants.h: New file. libgomp/ * Makefile.am (search_path): Search in $(top_srcidr)/../include also. (libgomp_la_SOURCES): Add oacc-parallel.c, splay-tree.c, oacc-host.c, oacc-init.c, oacc-mem.c, oacc-async.c, oacc-plugin.c, oacc-cuda.c, libgomp-plugin.c. (Makefrag.am): Include. (libgomp_la_SOURCES): Add openacc.f90 if USE_FORTRAN is true. (nodist_libsubinclude_HEADERS): Add openacc.h, ../include/gomp-constants.h. (nodist_finclude_HEADERS): Add openacc_lib.h, openacc.f90, openacc.mod, openacc_kinds.mod. * configure.ac (plugin_support): Add check for accelerators if attempting to build plugins. (plugin/configfrag.ac): Include. (offload_targets): Add host_nonshm target by default, nvptx target conditionally if the corresponding offload target is enabled. (testsuite/libgomp-test-support.exp): Add to AC_CONFIG_FILES. * env.c (libgomp_target.h, oacc-int.h): Include. (goacc_notify_var, goacc_device_num, goacc_device_type): New globals. (goacc_parse_device_type): New functions. (initialize_env): Parse GCC_ACC_NOTIFY, ACC_DEVICE_TYPE, ACC_DEVICE_NUM environment variables. Call ACC_runtime_initialize. * error.c (gomp_verror): Make global. (gomp_vfatal, gomp_vnotify, gomp_notify): New functions. (gomp_fatal): Use gomp_vfatal instead of gomp_verror. * libgomp.h (stdarg.h): Include. (struct gomp_memory_mapping): Forward declaration. (goacc_notify_var, goacc_device_num, goacc_device_type): Add extern declarations. (gomp_vnotify, gomp_notify, gomp_verror, gomp_vfatal): Add prototypes. (gomp_init_targets_once): Add prototype. * libgomp.map (OACC_2.0): New symbol version. Add public acc_* interface functions. (PLUGIN_1.0): New symbol version. Add gomp plugin interface functions. * libgomp_g.h (GOACC_data_start, GOACC_data_end, GOACC_kernels) (GOACC_parallel, GOACC_wait): Add prototypes. * libgomp_target.h (gomp-constants.h, splay-tree.h): Include. (offload_target_type): Set enumeration values from constants in gomp-constants.h. Add OFFLOAD_TARGET_TYPE_HOST_NONSHM and OFFLOAD_TARGET_TYPE_NVIDIA_PTX. (struct target_mem_desc): Move to here. (TARGET_CAP_SHARED_MEM, TARGET_CAP_NATIVE_EXEC, TARGET_CAP_OPENMP_400) (TARGET_CAP_OPENACC_200): Define macros. (struct gomp_memory_mapping): New. (struct ACC_dispatch_t): New. (struct gomp_device_descr): Move here. Add offload_regions_registered, openacc dispatch functions, target_data. (gomp_map_vars, gomp_copy_from_async, gomp_unmap_vars, gomp_init_device) (gomp_init_tables, gomp_fini_device, gomp_free_memmap): Add prototypes. * target.c (oacc-plugin.h, gomp-constants.h, oacc-int.h, stdio.h) (assert.h): Include. (splay_tree_node, splay_tree, splay_tree_key, target_mem_desc) (splay_tree_key_s, gomp_device_descr): Don't declare here. (splay_compare): Change linkage to hidden not static. (gomp_init_targets_once): New function. (gomp_get_num_devices): Use above. (get_kind): New function. (gomp_map_vars): Add is_openacc parameter. Change KINDS to void *. Use lock from memory map not device. Use macros from gomp-constants.h instead of hard-coded values. Support OpenACC-specific mappings. (gomp_copy_from_async): New function. (gomp_unmap_vars): Add DO_COPYFROM argument. Only copy memory back from device if it is true. Use lock from memory map not device. (gomp_update): Add is_openacc parameter. Use lock from memory map not device. Use macros from gomp-constants.h instead of hard-coded values. (gomp_register_image_for_device): Add forward declaration. (GOMP_offload_register): Check realloc result. (gomp_init_device): Change linkage to hidden not static. (gomp_init_tables, gomp_init_dev_tables, gomp_free_memmap) (gomp_fini_device): New function. (GOMP_target): Adjust lazy initialization, check target capabilities for OpenMP 4.0 support. Update call to gomp_map_vars, gomp_unmap_vars. (GOMP_target_data): Adjust lazy initialization. Update call to gomp_map_vars. (GOMP_target_end_data): Update call to gomp_unmap_vars. (GOMP_target_update): Tweak lazy initialization. Add new args to gomp_update call. (gomp_load_plugin_for_device): Initialize get_name, get_caps, device_fini and OpenACC-specific plugin hooks. (gomp_register_images_for_device): Rename to... (gomp_register_image_for_device): This, and register a single device only, and only if it has not already had images registered. (gomp_find_available_plugins): Initialize OpenACC-specific bits, offload image registration, and other new device member data. Prefer device with TARGET_CAP_OPENMP_400 if more than one plugin is available. * libgomp-plugin.c: New file. * libgomp-plugin.h: New file. * oacc-async.c: New file. * oacc-cuda.c: New file. * oacc-host.c: New file. * oacc-init.c: New file. * oacc-int.h: New file. * oacc-mem.c: New file. * oacc-parallel.c: New file. * oacc-plugin.c: New file. * oacc-plugin.h: New file. * openacc.f90: New file. * openacc.h: New file. * openacc_lib.h: New file. * splay-tree.h: Move bulk of implementation to... * splay-tree.c: New file. * Makefile.in: Regenerate. * config.h.in: Regenerate. * configure: Regenerate. * plugin/Makefrag.am: New file. * plugin/configfrag.am: New file. * plugin/plugin-host.c: New file. * plugin/plugin-nvptx.c: New file. * testsuite/libgomp-test-support.exp.in: New file. commit ee628c1e4014164b184eabad53bace13472e0d19 Author: Julian Brown Date: Mon Sep 22 02:55:12 2014 -0700 OpenACC support for libgomp. xxxx-xx-xx Nathan Sidwell James Norris Thomas Schwinge Tom de Vries Julian Brown Bernd Schmidt Cesar Philippidis include/ * gomp-constants.h: New file. libgomp/ * Makefile.am (search_path): Search in $(top_srcidr)/../include also. (libgomp_la_SOURCES): Add oacc-parallel.c, splay-tree.c, oacc-fortran.c, oacc-host.c, oacc-init.c, oacc-mem.c, oacc-async.c, oacc-plugin.c, oacc-cuda.c, libgomp-plugin.c. (Makefrag.am): Include. (libgomp_la_SOURCES): Add openacc.f90 if USE_FORTRAN is true. (nodist_libsubinclude_HEADERS): Add openacc.h, ../include/gomp-constants.h. (nodist_finclude_HEADERS): Add openacc_lib.h, openacc.f90, openacc.mod, openacc_kinds.mod. * configure.ac (plugin_support): Add check for accelerators if attempting to build plugins. (plugin/configfrag.ac): Include. (offload_targets): Add host_nonshm target by default, nvptx target conditionally if the corresponding offload target is enabled. (testsuite/libgomp-test-support.exp): Add to AC_CONFIG_FILES. * env.c (libgomp_target.h, oacc-int.h): Include. (goacc_notify_var, goacc_device_num, goacc_device_type): New globals. (goacc_parse_device_type): New functions. (initialize_env): Parse GCC_ACC_NOTIFY, ACC_DEVICE_TYPE, ACC_DEVICE_NUM environment variables. Call ACC_runtime_initialize. * error.c (gomp_verror): Make global. (gomp_vfatal, gomp_vnotify, gomp_notify): New functions. (gomp_fatal): Use gomp_vfatal instead of gomp_verror. * libgomp.h (stdarg.h): Include. (struct gomp_memory_mapping): Forward declaration. (goacc_notify_var, goacc_device_num, goacc_device_type): Add extern declarations. (gomp_vnotify, gomp_notify, gomp_verror, gomp_vfatal): Add prototypes. (gomp_init_targets_once): Add prototype. * libgomp.map (OACC_2.0): New symbol version. Add public acc_* interface functions. (PLUGIN_1.0): New symbol version. Add gomp plugin interface functions. * libgomp_g.h (GOACC_data_start, GOACC_data_end, GOACC_kernels) (GOACC_parallel, GOACC_wait): Add prototypes. * libgomp_target.h (gomp-constants.h, splay-tree.h): Include. (offload_target_type): Set enumeration values from constants in gomp-constants.h. Add OFFLOAD_TARGET_TYPE_HOST_NONSHM and OFFLOAD_TARGET_TYPE_NVIDIA_PTX. (struct target_mem_desc): Move to here. (TARGET_CAP_SHARED_MEM, TARGET_CAP_NATIVE_EXEC, TARGET_CAP_OPENMP_400) (TARGET_CAP_OPENACC_200): Define macros. (struct gomp_memory_mapping): New. (struct ACC_dispatch_t): New. (struct gomp_device_descr): Move here. Add offload_regions_registered, openacc dispatch functions, target_data. (gomp_map_vars, gomp_copy_from_async, gomp_unmap_vars, gomp_init_device) (gomp_init_tables, gomp_fini_device, gomp_free_memmap): Add prototypes. * target.c (oacc-plugin.h, gomp-constants.h, oacc-int.h, stdio.h) (assert.h): Include. (splay_tree_node, splay_tree, splay_tree_key, target_mem_desc) (splay_tree_key_s, gomp_device_descr): Don't declare here. (splay_compare): Change linkage to hidden not static. (gomp_init_targets_once): New function. (gomp_get_num_devices): Use above. (get_kind): New function. (gomp_map_vars): Add is_openacc parameter. Change KINDS to void *. Use lock from memory map not device. Use macros from gomp-constants.h instead of hard-coded values. Support OpenACC-specific mappings. (gomp_copy_from_async): New function. (gomp_unmap_vars): Add DO_COPYFROM argument. Only copy memory back from device if it is true. Use lock from memory map not device. (gomp_update): Add is_openacc parameter. Use lock from memory map not device. Use macros from gomp-constants.h instead of hard-coded values. (gomp_register_image_for_device): Add forward declaration. (GOMP_offload_register): Check realloc result. (gomp_init_device): Change linkage to hidden not static. (gomp_init_tables, gomp_init_dev_tables, gomp_free_memmap) (gomp_fini_device): New function. (GOMP_target): Adjust lazy initialization, check target capabilities for OpenMP 4.0 support. Update call to gomp_map_vars, gomp_unmap_vars. (GOMP_target_data): Adjust lazy initialization. Update call to gomp_map_vars. (GOMP_target_end_data): Update call to gomp_unmap_vars. (GOMP_target_update): Tweak lazy initialization. Add new args to gomp_update call. (gomp_load_plugin_for_device): Initialize get_name, get_caps, device_fini and OpenACC-specific plugin hooks. (gomp_register_images_for_device): Rename to... (gomp_register_image_for_device): This, and register a single device only, and only if it has not already had images registered. (gomp_find_available_plugins): Initialize OpenACC-specific bits, offload image registration, and other new device member data. Prefer device with TARGET_CAP_OPENMP_400 if more than one plugin is available. * libgomp-plugin.c: New file. * libgomp-plugin.h: New file. * oacc-async.c: New file. * oacc-cuda.c: New file. * oacc-fortran.c: New file. * oacc-host.c: New file. * oacc-init.c: New file. * oacc-int.h: New file. * oacc-mem.c: New file. * oacc-parallel.c: New file. * oacc-plugin.c: New file. * oacc-plugin.h: New file. * openacc.f90: New file. * openacc.h: New file. * openacc_lib.h: New file. * splay-tree.h: Move bulk of implementation to... * splay-tree.c: New file. * Makefile.in: Regenerate. * config.h.in: Regenerate. * configure: Regenerate. * plugin/Makefrag.am: New file. * plugin/configfrag.am: New file. * plugin/plugin-host.c: New file. * plugin/plugin-nvptx.c: New file. * testsuite/libgomp-test-support.exp.in: New file. diff --git a/include/gomp-constants.h b/include/gomp-constants.h new file mode 100644 index 0000000..7ef5c88 --- /dev/null +++ b/include/gomp-constants.h @@ -0,0 +1,45 @@ +#ifndef GOMP_CONSTANTS_H +#define GOMP_CONSTANTS_H 1 + +/* Enumerated variable mapping types used to communicate between GCC and + libgomp. These values are used for both OpenMP and OpenACC. */ + +#define GOMP_MAP_ALLOC 0x00 +#define GOMP_MAP_ALLOC_TO 0x01 +#define GOMP_MAP_ALLOC_FROM 0x02 +#define GOMP_MAP_ALLOC_TOFROM 0x03 +#define GOMP_MAP_POINTER 0x04 +#define GOMP_MAP_TO_PSET 0x05 +#define GOMP_MAP_FORCE_ALLOC 0x08 +#define GOMP_MAP_FORCE_TO 0x09 +#define GOMP_MAP_FORCE_FROM 0x0a +#define GOMP_MAP_FORCE_TOFROM 0x0b +#define GOMP_MAP_FORCE_PRESENT 0x0c +#define GOMP_MAP_FORCE_DEALLOC 0x0d +#define GOMP_MAP_FORCE_DEVICEPTR 0x0e +#define GOMP_MAP_FORCE_PRIVATE 0x18 +#define GOMP_MAP_FORCE_FIRSTPRIVATE 0x19 + +#define GOMP_MAP_COPYTO_P(X) \ + ((X) == GOMP_MAP_ALLOC_TO || (X) == GOMP_MAP_FORCE_TO) + +#define GOMP_MAP_COPYFROM_P(X) \ + ((X) == GOMP_MAP_ALLOC_FROM || (X) == GOMP_MAP_FORCE_FROM) + +#define GOMP_MAP_TOFROM_P(X) \ + ((X) == GOMP_MAP_ALLOC_TOFROM || (X) == GOMP_MAP_FORCE_TOFROM) + +#define GOMP_MAP_POINTER_P(X) \ + ((X) == GOMP_MAP_POINTER) + +#define GOMP_IF_CLAUSE_FALSE -2 + +/* Canonical list of target type codes for OpenMP/OpenACC. */ +#define GOMP_TARGET_NONE 0 +#define GOMP_TARGET_HOST 2 +#define GOMP_TARGET_HOST_NONSHM 3 +#define GOMP_TARGET_NOT_HOST 4 +#define GOMP_TARGET_NVIDIA_PTX 5 +#define GOMP_TARGET_INTEL_MIC 6 + +#endif diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am index 427415e..4c73c7a 100644 --- a/libgomp/Makefile.am +++ b/libgomp/Makefile.am @@ -7,7 +7,8 @@ SUBDIRS = testsuite gcc_version := $(shell cat $(top_srcdir)/../gcc/BASE-VER) config_path = @config_path@ -search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) +search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \ + $(top_srcdir)/../include fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/finclude libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include @@ -60,12 +61,21 @@ libgomp_la_LINK = $(LINK) $(libgomp_la_LDFLAGS) libgomp_la_SOURCES = alloc.c barrier.c critical.c env.c error.c iter.c \ iter_ull.c loop.c loop_ull.c ordered.c parallel.c sections.c single.c \ task.c team.c work.c lock.c mutex.c proc.c sem.c bar.c ptrlock.c \ - time.c fortran.c affinity.c target.c + time.c fortran.c affinity.c target.c oacc-parallel.c splay-tree.c \ + oacc-host.c oacc-init.c oacc-mem.c oacc-async.c oacc-plugin.c \ + oacc-cuda.c libgomp-plugin.c + +include $(top_srcdir)/plugin/Makefrag.am + +if USE_FORTRAN +libgomp_la_SOURCES += openacc.f90 +endif nodist_noinst_HEADERS = libgomp_f.h -nodist_libsubinclude_HEADERS = omp.h +nodist_libsubinclude_HEADERS = omp.h openacc.h ../include/gomp-constants.h if USE_FORTRAN -nodist_finclude_HEADERS = omp_lib.h omp_lib.f90 omp_lib.mod omp_lib_kinds.mod +nodist_finclude_HEADERS = omp_lib.h omp_lib.f90 omp_lib.mod omp_lib_kinds.mod \ + openacc_lib.h openacc.f90 openacc.mod openacc_kinds.mod endif LTLDFLAGS = $(shell $(SHELL) $(top_srcdir)/../libtool-ldflags $(LDFLAGS)) diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in index 5cd666f..88a4f46 100644 diff --git a/libgomp/config.h.in b/libgomp/config.h.in index 94a2b3b..309962d 100644 diff --git a/libgomp/configure b/libgomp/configure index 97c9be6..d325306 100755 diff --git a/libgomp/configure.ac b/libgomp/configure.ac index 3f34ff8..2b701ca 100644 --- a/libgomp/configure.ac +++ b/libgomp/configure.ac @@ -2,6 +2,8 @@ # aclocal -I ../config && autoconf && autoheader && automake AC_PREREQ(2.64) +#TODO: Update for OpenACC? But then also have to update copyright notices in +#all source files... AC_INIT([GNU OpenMP Runtime Library], 1.0,,[libgomp]) AC_CONFIG_HEADER(config.h) @@ -28,7 +30,6 @@ LIBGOMP_ENABLE(generated-files-in-srcdir, no, , AC_MSG_RESULT($enable_generated_files_in_srcdir) AM_CONDITIONAL(GENINSRC, test "$enable_generated_files_in_srcdir" = yes) - # ------- # ------- @@ -198,8 +199,12 @@ AC_CHECK_LIB(dl, dlsym, , [plugin_support=no]) if test x"$plugin_support" = xyes; then AC_DEFINE(PLUGIN_SUPPORT, 1, [Define if all infrastructure, needed for plugins, is supported.]) +elif test "x$enable_accelerator" != xno; then + AC_MSG_ERROR([Can't have support for accelerators without support for plugins]) fi +m4_include([plugin/configfrag.ac]) + # Check for functions needed. AC_CHECK_FUNCS(getloadavg clock_gettime strtoull) @@ -280,13 +285,15 @@ else multilib_arg= fi -offload_targets= +offload_targets=host_nonshm if test x"$enable_offload_targets" != x; then for tgt in `echo $enable_offload_targets | sed -e 's#,# #g'`; do tgt=`echo $tgt | sed 's/=.*//'` case $tgt in *-intelmic-* | *-intelmicemul-*) tgt_name="intelmic" ;; + nvptx-*) + tgt_name="nvptx" ;; *) AC_MSG_ERROR([unknown offload target specified]) ;; esac @@ -374,4 +381,5 @@ CFLAGS="$save_CFLAGS" AC_CONFIG_FILES(omp.h omp_lib.h omp_lib.f90 libgomp_f.h) AC_CONFIG_FILES(Makefile testsuite/Makefile libgomp.spec) +AC_CONFIG_FILES([testsuite/libgomp-test-support.exp]) AC_OUTPUT diff --git a/libgomp/env.c b/libgomp/env.c index 94c72a3..26d2149 100644 --- a/libgomp/env.c +++ b/libgomp/env.c @@ -27,6 +27,8 @@ #include "libgomp.h" #include "libgomp_f.h" +#include "libgomp_target.h" +#include "oacc-int.h" #include #include #include @@ -77,6 +79,10 @@ unsigned long gomp_bind_var_list_len; void **gomp_places_list; unsigned long gomp_places_list_len; +int goacc_notify_var; +int goacc_device_num; +char* goacc_device_type; + /* Parse the OMP_SCHEDULE environment variable. */ static void @@ -1011,6 +1017,16 @@ parse_affinity (bool ignore) return false; } +static void +goacc_parse_device_type (void) +{ + const char *env = getenv ("ACC_DEVICE_TYPE"); + + if (env && *env != '\0') + goacc_device_type = strdup (env); + else + goacc_device_type = NULL; +} static void handle_omp_display_env (unsigned long stacksize, int wait_policy) @@ -1181,6 +1197,7 @@ initialize_env (void) gomp_global_icv.thread_limit_var = thread_limit_var > INT_MAX ? UINT_MAX : thread_limit_var; } + parse_int ("GCC_ACC_NOTIFY", &goacc_notify_var, true); #ifndef HAVE_SYNC_BUILTINS gomp_mutex_init (&gomp_managed_threads_lock); #endif @@ -1271,6 +1288,15 @@ initialize_env (void) } handle_omp_display_env (stacksize, wait_policy); + + /* Look for OpenACC-specific environment variables. */ + if (!parse_int ("ACC_DEVICE_NUM", &goacc_device_num, true)) + goacc_device_num = 0; + + goacc_parse_device_type (); + + /* Initialize OpenACC-specific internal state. */ + ACC_runtime_initialize (); } diff --git a/libgomp/error.c b/libgomp/error.c index d9b28f1..320b4d2 100644 --- a/libgomp/error.c +++ b/libgomp/error.c @@ -35,7 +35,7 @@ #include -static void +void gomp_verror (const char *fmt, va_list list) { fputs ("\nlibgomp: ", stderr); @@ -54,13 +54,39 @@ gomp_error (const char *fmt, ...) } void +gomp_vfatal (const char *fmt, va_list list) +{ + gomp_verror (fmt, list); + exit (EXIT_FAILURE); +} + +void gomp_fatal (const char *fmt, ...) { va_list list; va_start (list, fmt); - gomp_verror (fmt, list); + gomp_vfatal (fmt, list); va_end (list); - exit (EXIT_FAILURE); + /* Unreachable. */ + abort (); +} + +void +gomp_vnotify (const char *msg, va_list list) +{ + if (goacc_notify_var) + vfprintf (stderr, msg, list); +} + +void +gomp_notify (const char *msg, ...) +{ + va_list list; + + va_start (list, msg); + gomp_vnotify (msg, list); + va_end (list); } + diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c new file mode 100644 index 0000000..f0e35d6 --- /dev/null +++ b/libgomp/libgomp-plugin.c @@ -0,0 +1,107 @@ +/* Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* Exported (non-hidden) functions exposing libgomp interface for plugins. */ + +#include + +#include "libgomp.h" +#include "libgomp-plugin.h" +#include "target.h" + +void * +GOMP_PLUGIN_malloc (size_t size) +{ + return gomp_malloc (size); +} + +void * +GOMP_PLUGIN_malloc_cleared (size_t size) +{ + return gomp_malloc_cleared (size); +} + +void * +GOMP_PLUGIN_realloc (void *ptr, size_t size) +{ + return gomp_realloc (ptr, size); +} + +void +GOMP_PLUGIN_error (const char *msg, ...) +{ + va_list ap; + + va_start (ap, msg); + gomp_verror (msg, ap); + va_end (ap); +} + +void +GOMP_PLUGIN_notify (const char *msg, ...) +{ + va_list ap; + + va_start (ap, msg); + gomp_vnotify (msg, ap); + va_end (ap); +} + +void +GOMP_PLUGIN_fatal (const char *msg, ...) +{ + va_list ap; + + va_start (ap, msg); + gomp_vfatal (msg, ap); + va_end (ap); + + /* Unreachable. */ + abort (); +} + +void +GOMP_PLUGIN_mutex_init (gomp_mutex_t *mutex) +{ + gomp_mutex_init (mutex); +} + +void +GOMP_PLUGIN_mutex_destroy (gomp_mutex_t *mutex) +{ + gomp_mutex_destroy (mutex); +} + +void +GOMP_PLUGIN_mutex_lock (gomp_mutex_t *mutex) +{ + gomp_mutex_lock (mutex); +} + +void +GOMP_PLUGIN_mutex_unlock (gomp_mutex_t *mutex) +{ + gomp_mutex_unlock (mutex); +} diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h new file mode 100644 index 0000000..87367e3 --- /dev/null +++ b/libgomp/libgomp-plugin.h @@ -0,0 +1,54 @@ +/* Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* An interface to various libgomp-internal functions for use by plugins. */ + +#ifndef LIBGOMP_PLUGIN_H +#define LIBGOMP_PLUGIN_H 1 + +#include "mutex.h" + +/* alloc.c */ + +extern void *GOMP_PLUGIN_malloc (size_t) __attribute__((malloc)); +extern void *GOMP_PLUGIN_malloc_cleared (size_t) __attribute__((malloc)); +extern void *GOMP_PLUGIN_realloc (void *, size_t); + +/* error.c */ + +extern void GOMP_PLUGIN_notify(const char *msg, ...); +extern void GOMP_PLUGIN_error (const char *, ...) + __attribute__((format (printf, 1, 2))); +extern void GOMP_PLUGIN_fatal (const char *, ...) + __attribute__((noreturn, format (printf, 1, 2))); + +/* mutex.c */ + +extern void GOMP_PLUGIN_mutex_init (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_destroy (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_lock (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_unlock (gomp_mutex_t *mutex); + +#endif diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index a1482cc..251e61b 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -40,6 +40,7 @@ #include #include #include +#include #ifdef HAVE_ATTRIBUTE_VISIBILITY # pragma GCC visibility push(hidden) @@ -220,6 +221,7 @@ struct gomp_team_state }; struct target_mem_desc; +struct gomp_memory_mapping; /* These are the OpenMP 4.0 Internal Control Variables described in section 2.3.1. Those described as having one copy per task are @@ -254,6 +256,10 @@ extern unsigned long gomp_bind_var_list_len; extern void **gomp_places_list; extern unsigned long gomp_places_list_len; +extern int goacc_notify_var; +extern int goacc_device_num; +extern char* goacc_device_type; + enum gomp_task_kind { GOMP_TASK_IMPLICIT, @@ -532,8 +538,12 @@ extern void *gomp_realloc (void *, size_t); /* error.c */ +extern void gomp_vnotify (const char *, va_list); +extern void gomp_notify (const char *msg, ...); +extern void gomp_verror (const char *, va_list); extern void gomp_error (const char *, ...) __attribute__((format (printf, 1, 2))); +extern void gomp_vfatal (const char *, va_list); extern void gomp_fatal (const char *, ...) __attribute__((noreturn, format (printf, 1, 2))); @@ -606,6 +616,7 @@ extern void gomp_free_thread (void *); /* target.c */ +extern void gomp_init_targets_once (void); extern int gomp_get_num_devices (void); /* work.c */ diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index f36df23..938f6bf 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -232,3 +232,115 @@ GOMP_4.0.1 { global: GOMP_offload_register; } GOMP_4.0; + +OACC_2.0 { + global: + acc_get_num_devices; + acc_get_num_devices_h_; + acc_set_device_type; + acc_set_device_type_h_; + acc_get_device_type; + acc_get_device_type_h_; + acc_set_device_num; + acc_set_device_num_h_; + acc_get_device_num; + acc_get_device_num_h_; + acc_async_test; + acc_async_test_h_; + acc_async_test_all; + acc_async_test_all_h_; + acc_wait; + acc_wait_h_; + acc_wait_async; + acc_wait_async_h_; + acc_wait_all; + acc_wait_all_h_; + acc_wait_all_async; + acc_wait_all_async_h_; + acc_init; + acc_init_h_; + acc_shutdown; + acc_shutdown_h_; + acc_on_device; + acc_on_device_h_; + acc_malloc; + acc_free; + acc_copyin; + acc_copyin_32_h_; + acc_copyin_64_h_; + acc_copyin_array_h_; + acc_present_or_copyin; + acc_present_or_copyin_32_h_; + acc_present_or_copyin_64_h_; + acc_present_or_copyin_array_h_; + acc_create; + acc_create_32_h_; + acc_create_64_h_; + acc_create_array_h_; + acc_present_or_create; + acc_present_or_create_32_h_; + acc_present_or_create_64_h_; + acc_present_or_create_array_h_; + acc_copyout; + acc_copyout_32_h_; + acc_copyout_64_h_; + acc_copyout_array_h_; + acc_delete; + acc_delete_32_h_; + acc_delete_64_h_; + acc_delete_array_h_; + acc_update_device; + acc_update_device_32_h_; + acc_update_device_64_h_; + acc_update_device_array_h_; + acc_update_self; + acc_update_self_32_h_; + acc_update_self_64_h_; + acc_update_self_array_h_; + acc_map_data; + acc_unmap_data; + acc_deviceptr; + acc_hostptr; + acc_is_present; + acc_is_present_32_h_; + acc_is_present_64_h_; + acc_is_present_array_h_; + acc_memcpy_to_device; + acc_memcpy_from_device; + acc_get_current_cuda_device; + acc_get_current_cuda_context; + acc_get_cuda_stream; + acc_set_cuda_stream; +}; + +GOACC_2.0 { + global: + GOACC_data_end; + GOACC_data_start; + GOACC_kernels; + GOACC_parallel; + GOACC_update; + GOACC_wait; +}; + +PLUGIN_1.0 { + global: + GOMP_PLUGIN_malloc; + GOMP_PLUGIN_malloc_cleared; + GOMP_PLUGIN_realloc; + GOMP_PLUGIN_error; + GOMP_PLUGIN_notify; + GOMP_PLUGIN_fatal; + GOMP_PLUGIN_mutex_init; + GOMP_PLUGIN_mutex_destroy; + GOMP_PLUGIN_mutex_lock; + GOMP_PLUGIN_mutex_unlock; + GOMP_PLUGIN_async_unmap_vars; + GOMP_PLUGIN_acc_thread; +}; + +# TODO. See testsuite/lib/libgomp.exp:libgomp_init. +INTERNAL { + global: + initialize_env; +}; diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h index be0c6ea..44f200c 100644 --- a/libgomp/libgomp_g.h +++ b/libgomp/libgomp_g.h @@ -214,4 +214,17 @@ extern void GOMP_target_update (int, const void *, size_t, void **, size_t *, unsigned char *); extern void GOMP_teams (unsigned int, unsigned int); +/* oacc-parallel.c */ + +extern void GOACC_data_start (int, const void *, + size_t, void **, size_t *, unsigned short *); +extern void GOACC_data_end (void); +extern void GOACC_kernels (int, void (*) (void *), const void *, + size_t, void **, size_t *, unsigned short *, + int, int, int, int, int, ...); +extern void GOACC_parallel (int, void (*) (void *), const void *, + size_t, void **, size_t *, unsigned short *, + int, int, int, int, int, ...); +extern void GOACC_wait (int, int, ...); + #endif /* LIBGOMP_G_H */ diff --git a/libgomp/libgomp_target.h b/libgomp/libgomp_target.h index f7d19d0..9d6b7e4 100644 --- a/libgomp/libgomp_target.h +++ b/libgomp/libgomp_target.h @@ -24,11 +24,15 @@ #ifndef LIBGOMP_TARGET_H #define LIBGOMP_TARGET_H 1 -/* Type of offload target device. */ +#include "gomp-constants.h" + +/* Type of offload target device. Keep in sync with openacc.h:acc_device_t. */ enum offload_target_type { - OFFLOAD_TARGET_TYPE_HOST, - OFFLOAD_TARGET_TYPE_INTEL_MIC + OFFLOAD_TARGET_TYPE_HOST = GOMP_TARGET_HOST, + OFFLOAD_TARGET_TYPE_HOST_NONSHM = GOMP_TARGET_HOST_NONSHM, + OFFLOAD_TARGET_TYPE_NVIDIA_PTX = GOMP_TARGET_NVIDIA_PTX, + OFFLOAD_TARGET_TYPE_INTEL_MIC = GOMP_TARGET_INTEL_MIC }; /* Auxiliary struct, used for transferring a host-target address range mapping @@ -41,4 +45,177 @@ struct mapping_table uintptr_t tgt_end; }; +#include "splay-tree.h" + +struct target_mem_desc { + /* Reference count. */ + uintptr_t refcount; + /* All the splay nodes allocated together. */ + splay_tree_node array; + /* Start of the target region. */ + uintptr_t tgt_start; + /* End of the targer region. */ + uintptr_t tgt_end; + /* Handle to free. */ + void *to_free; + /* Previous target_mem_desc. */ + struct target_mem_desc *prev; + /* Number of items in following list. */ + size_t list_count; + + /* Corresponding target device descriptor. */ + struct gomp_device_descr *device_descr; + + /* Memory mapping info for the thread that created this descriptor. */ + struct gomp_memory_mapping *mem_map; + + /* List of splay keys to remove (or decrease refcount) + at the end of region. */ + splay_tree_key list[]; +}; + +#define TARGET_CAP_SHARED_MEM 1 +#define TARGET_CAP_NATIVE_EXEC 2 +#define TARGET_CAP_OPENMP_400 4 +#define TARGET_CAP_OPENACC_200 8 + +/* Information about mapped memory regions (per device/context). */ + +struct gomp_memory_mapping +{ + /* Splay tree containing information about mapped memory regions. */ + struct splay_tree_s splay_tree; + + /* Mutex for operating with the splay tree and other shared structures. */ + gomp_mutex_t lock; + + /* True when tables have been added to this memory map. */ + bool is_initialized; +}; + +typedef struct ACC_dispatch_t +{ + /* This is a linked list of data mapped using the + acc_map_data/acc_unmap_data or "acc enter data"/"acc exit data" pragmas + (TODO). Unlike mapped_data in the goacc_thread struct, unmapping can + happen out-of-order with respect to mapping. */ + struct target_mem_desc *data_environ; + + /* Open or close a device instance. */ + void *(*open_device_func) (int n); + int (*close_device_func) (void *h); + + /* Set or get the device number. */ + int (*get_device_num_func) (void); + void (*set_device_num_func) (int); + + /* Execute. */ + void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *, + unsigned short *, int, int, int, int, void *); + + /* Async cleanup callback registration. */ + void (*register_async_cleanup_func) (void *); + + /* Asynchronous routines. */ + int (*async_test_func) (int); + int (*async_test_all_func) (void); + void (*async_wait_func) (int); + void (*async_wait_async_func) (int, int); + void (*async_wait_all_func) (void); + void (*async_wait_all_async_func) (int); + void (*async_set_async_func) (int); + + /* Create/destroy TLS data. */ + void *(*create_thread_data_func) (void *); + void (*destroy_thread_data_func) (void *); + + /* NVIDIA target specific routines. */ + struct { + void *(*get_current_device_func) (void); + void *(*get_current_context_func) (void); + void *(*get_stream_func) (int); + int (*set_stream_func) (int, void *); + } cuda; +} ACC_dispatch_t; + +/* This structure describes accelerator device. + It contains name of the corresponding libgomp plugin, function handlers for + interaction with the device, ID-number of the device, and information about + mapped memory. */ +struct gomp_device_descr +{ + /* The name of the device. */ + const char *name; + + /* Capabilities of device (supports OpenACC, OpenMP). */ + unsigned int capabilities; + + /* This is the ID number of device. It could be specified in DEVICE-clause of + TARGET construct. */ + int id; + + /* This is the ID number of device among devices of the same type. */ + int target_id; + + /* This is the TYPE of device. */ + enum offload_target_type type; + + /* Set to true when device is initialized. */ + bool is_initialized; + + /* True when offload regions have been registered with this device. */ + bool offload_regions_registered; + + /* Plugin file handler. */ + void *plugin_handle; + + /* Function handlers. */ + const char *(*get_name_func) (void); + unsigned int (*get_caps_func) (void); + int (*get_type_func) (void); + int (*get_num_devices_func) (void); + void (*register_image_func) (void *, void *); + void (*init_device_func) (int); + void (*fini_device_func) (int); + int (*get_table_func) (int, struct mapping_table **); + void *(*alloc_func) (int, size_t); + void (*free_func) (int, void *); + void *(*dev2host_func) (int, void *, const void *, size_t); + void *(*host2dev_func) (int, void *, const void *, size_t); + void (*run_func) (int, void *, void *); + + /* OpenACC-specific functions. */ + ACC_dispatch_t openacc; + + /* Memory-mapping info for this device instance. */ + struct gomp_memory_mapping mem_map; + + /* Extra information required for a device instance by a given target. */ + void *target_data; +}; + +extern struct target_mem_desc * +gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, + void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds, + bool is_openacc, bool is_target); + +extern void +gomp_copy_from_async (struct target_mem_desc *tgt); + +extern void +gomp_unmap_vars (struct target_mem_desc *tgt, bool); + +extern attribute_hidden void +gomp_init_device (struct gomp_device_descr *devicep); + +extern attribute_hidden void +gomp_init_tables (const struct gomp_device_descr *devicep, + struct gomp_memory_mapping *mm); + +extern attribute_hidden void +gomp_fini_device (struct gomp_device_descr *devicep); + +extern attribute_hidden void +gomp_free_memmap (struct gomp_device_descr *devicep); + #endif /* LIBGOMP_TARGET_H */ diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c new file mode 100644 index 0000000..94c62d8 --- /dev/null +++ b/libgomp/oacc-async.c @@ -0,0 +1,77 @@ +/* OpenACC Runtime Library Definitions. + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + + +#include "openacc.h" +#include "libgomp.h" +#include "libgomp_target.h" +#include "oacc-int.h" + +int +acc_async_test (int async) +{ + if (async < acc_async_sync) + gomp_fatal ("invalid async argument: %d", async); + + return base_dev->openacc.async_test_func (async); +} + +int +acc_async_test_all (void) +{ + return base_dev->openacc.async_test_all_func (); +} + +void +acc_wait (int async) +{ + if (async < acc_async_sync) + gomp_fatal ("invalid async argument: %d", async); + + base_dev->openacc.async_wait_func (async); +} + +void +acc_wait_async (int async1, int async2) +{ + base_dev->openacc.async_wait_async_func (async1, async2); +} + +void +acc_wait_all (void) +{ + base_dev->openacc.async_wait_all_func (); +} + +void +acc_wait_all_async (int async) +{ + if (async < acc_async_sync) + gomp_fatal ("invalid async argument: %d", async); + + base_dev->openacc.async_wait_all_async_func (async); +} diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c new file mode 100644 index 0000000..9965d5c --- /dev/null +++ b/libgomp/oacc-cuda.c @@ -0,0 +1,84 @@ +/* OpenACC Runtime Library: CUDA support glue. + + Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#include "openacc.h" +#include "config.h" +#include "libgomp.h" +#include "libgomp_target.h" +#include "oacc-int.h" + +void * +acc_get_current_cuda_device (void) +{ + void *p = NULL; + + if (base_dev && base_dev->openacc.cuda.get_current_device_func) + p = base_dev->openacc.cuda.get_current_device_func (); + + return p; +} + +void * +acc_get_current_cuda_context (void) +{ + void *p = NULL; + + if (base_dev && base_dev->openacc.cuda.get_current_context_func) + p = base_dev->openacc.cuda.get_current_context_func (); + + return p; +} + +void * +acc_get_cuda_stream (int async) +{ + void *p = NULL; + + if (async < 0) + return p; + + if (base_dev && base_dev->openacc.cuda.get_stream_func) + p = base_dev->openacc.cuda.get_stream_func (async); + + return p; +} + +int +acc_set_cuda_stream (int async, void *stream) +{ + int s = -1; + + if (async < 0 || stream == NULL) + return 0; + + ACC_lazy_initialize (); + + if (base_dev && base_dev->openacc.cuda.set_stream_func) + s = base_dev->openacc.cuda.set_stream_func (async, stream); + + return s; +} diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c new file mode 100644 index 0000000..079ba3c --- /dev/null +++ b/libgomp/oacc-host.c @@ -0,0 +1,30 @@ +/* OpenACC Runtime Library: acc_device_host, acc_device_host_nonshm. + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* This shares much of the implementation of the plugin-host.c "host_nonshm" + plugin. */ +#include "plugin/plugin-host.c" diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c new file mode 100644 index 0000000..12d52e7 --- /dev/null +++ b/libgomp/oacc-init.c @@ -0,0 +1,613 @@ +/* OpenACC Runtime initialization routines + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#include "libgomp.h" +#include "libgomp_target.h" +#include "oacc-int.h" +#include "openacc.h" +#include +#include +#include +#include +#include + +static gomp_mutex_t acc_device_lock; + +/* The dispatch table for the current accelerator device. This is global, so + you can only have one type of device open at any given time in a program. + This is the "base" device in that several devices that use the same + dispatch table may be active concurrently: this one (the "zeroth") is used + for overall initialisation/shutdown, and other instances -- not necessarily + including this one -- may be opened and closed once the base device has + been initialized. */ +struct gomp_device_descr const *base_dev; + +#ifdef HAVE_TLS +__thread struct goacc_thread *goacc_tls_data; +#else +pthread_key_t goacc_tls_key; +#endif +static pthread_key_t goacc_cleanup_key; + +/* Current dispatcher, and how it was initialized */ +static acc_device_t init_key = _ACC_device_hwm; + +static struct goacc_thread *goacc_threads; +static gomp_mutex_t goacc_thread_lock; + +/* An array of dispatchers for device types, indexed by the type. This array + only references "base" devices, and other instances of the same type are + found by simply indexing from each such device (which are stored linearly, + grouped by device in target.c:devices). */ +static struct gomp_device_descr const *dispatchers[_ACC_device_hwm] = { 0 }; + +void +ACC_register (struct gomp_device_descr const *disp) +{ + /* Only register the 0th device here. */ + if (disp->target_id != 0) + return; + + gomp_mutex_lock (&acc_device_lock); + + assert (acc_device_type (disp->type) != acc_device_none + && acc_device_type (disp->type) != acc_device_default + && acc_device_type (disp->type) != acc_device_not_host); + assert (!dispatchers[disp->type]); + dispatchers[disp->type] = disp; + + gomp_mutex_unlock (&acc_device_lock); +} + +static struct gomp_device_descr const * +resolve_device (acc_device_t d) +{ + acc_device_t d_arg = d; + + switch (d) + { + case acc_device_default: + { + if (goacc_device_type) + { + /* Lookup the named device. */ + while (++d != _ACC_device_hwm) + if (dispatchers[d] + && !strcasecmp (goacc_device_type, dispatchers[d]->name) + && dispatchers[d]->get_num_devices_func () > 0) + goto found; + + gomp_fatal ("device type %s not supported", goacc_device_type); + } + + /* No default device specified, so start scanning for any non-host + device that is available. */ + d = acc_device_not_host; + } + /* FALLTHROUGH */ + + case acc_device_not_host: + /* Find the first available device after acc_device_not_host. */ + while (++d != _ACC_device_hwm) + if (dispatchers[d] && dispatchers[d]->get_num_devices_func () > 0) + goto found; + if (d_arg == acc_device_default) + { + d = acc_device_host; + goto found; + } + gomp_fatal ("no device found"); + break; + + case acc_device_host: + break; + + default: + if (d > _ACC_device_hwm) + gomp_fatal ("device %u out of range", (unsigned)d); + break; + } + found: + + assert (d != acc_device_none + && d != acc_device_default + && d != acc_device_not_host); + + return dispatchers[d]; +} + +/* This is called when plugins have been initialized, and serves to call + (indirectly) the target's device_init hook. Calling multiple times without + an intervening _acc_shutdown call is an error. */ + +static struct gomp_device_descr const * +_acc_init (acc_device_t d) +{ + struct gomp_device_descr const *acc_dev; + + acc_dev = resolve_device (d); + + if (!acc_dev || acc_dev->get_num_devices_func () <= 0) + gomp_fatal ("device %u not supported", (unsigned)d); + + if (acc_dev->is_initialized) + gomp_fatal ("device already active"); + + /* We need to remember what we were intialized as, to check shutdown etc. */ + init_key = d; + + gomp_init_device ((struct gomp_device_descr *) acc_dev); + + return acc_dev; +} + +static struct goacc_thread * +goacc_new_thread (void) +{ + struct goacc_thread *thr = gomp_malloc (sizeof (struct gomp_thread)); + +#ifdef HAVE_TLS + goacc_tls_data = thr; +#else + pthread_setspecific (goacc_tls_key, thr); +#endif + + pthread_setspecific (goacc_cleanup_key, thr); + + gomp_mutex_lock (&goacc_thread_lock); + thr->next = goacc_threads; + goacc_threads = thr; + gomp_mutex_unlock (&goacc_thread_lock); + + return thr; +} + +static void +goacc_destroy_thread (void *data) +{ + struct goacc_thread *thr = data, *walk, *prev; + + gomp_mutex_lock (&goacc_thread_lock); + + if (thr) + { + if (base_dev && thr->target_tls) + { + base_dev->openacc.destroy_thread_data_func (thr->target_tls); + thr->target_tls = NULL; + } + + assert (!thr->mapped_data); + + /* Remove from thread list. */ + for (prev = NULL, walk = goacc_threads; walk; + prev = walk, walk = walk->next) + if (walk == thr) + { + if (prev == NULL) + goacc_threads = walk->next; + else + prev->next = walk->next; + + free (thr); + + break; + } + + assert (walk); + } + + gomp_mutex_unlock (&goacc_thread_lock); +} + +/* Open the ORD'th device of the currently-active type (base_dev must be + initialised before calling). If ORD is < 0, open the default-numbered + device (set by the ACC_DEVICE_NUM environment variable or a call to + acc_set_device_num), or leave any currently-opened device as is. "Opening" + consists of calling the device's open_device_func hook, and setting up + thread-local data (maybe allocating, then initializing with information + pertaining to the newly-opened or previously-opened device). */ + +static void +lazy_open (int ord) +{ + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev; + + if (thr && thr->dev) + { + assert (ord < 0 || ord == thr->dev->target_id); + return; + } + + assert (base_dev); + + if (ord < 0) + ord = goacc_device_num; + + if (ord >= base_dev->get_num_devices_func ()) + gomp_fatal ("device %u does not exist", ord); + + if (!thr) + thr = goacc_new_thread (); + + acc_dev = thr->dev = (struct gomp_device_descr *) &base_dev[ord]; + + assert (acc_dev->target_id == ord); + + thr->saved_bound_dev = NULL; + thr->mapped_data = NULL; + + if (!acc_dev->target_data) + acc_dev->target_data = acc_dev->openacc.open_device_func (ord); + + thr->target_tls + = acc_dev->openacc.create_thread_data_func (acc_dev->target_data); + + acc_dev->openacc.async_set_async_func (acc_async_sync); + + if (!acc_dev->mem_map.is_initialized) + gomp_init_tables (acc_dev, &acc_dev->mem_map); +} + +/* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of + init/shutdown is per-process or per-thread. We choose per-process. */ + +void +acc_init (acc_device_t d) +{ + if (!base_dev) + gomp_init_targets_once (); + + gomp_mutex_lock (&acc_device_lock); + + base_dev = _acc_init (d); + + lazy_open (-1); + + gomp_mutex_unlock (&acc_device_lock); +} + +ialias (acc_init) + +void +_acc_shutdown (acc_device_t d) +{ + struct goacc_thread *walk; + + /* We don't check whether d matches the actual device found, because + OpenACC 2.0 (3.2.12) says the parameters to the init and this + call must match (for the shutdown call anyway, it's silent on + others). */ + + if (!base_dev) + gomp_fatal ("no device initialized"); + if (d != init_key) + gomp_fatal ("device %u(%u) is initialized", + (unsigned) init_key, (unsigned) base_dev->type); + + gomp_mutex_lock (&goacc_thread_lock); + + /* Free target-specific TLS data and close all devices. */ + for (walk = goacc_threads; walk != NULL; walk = walk->next) + { + if (walk->target_tls) + base_dev->openacc.destroy_thread_data_func (walk->target_tls); + + walk->target_tls = NULL; + + /* This would mean the user is shutting down OpenACC in the middle of an + "acc data" pragma. Likely not intentional. */ + if (walk->mapped_data) + gomp_fatal ("shutdown in 'acc data' region"); + + if (walk->dev) + { + if (walk->dev->openacc.close_device_func (walk->dev->target_data) < 0) + gomp_fatal ("failed to close device"); + + walk->dev->target_data = NULL; + + gomp_free_memmap (walk->dev); + + walk->dev = NULL; + } + } + + gomp_mutex_unlock (&goacc_thread_lock); + + gomp_fini_device ((struct gomp_device_descr *) base_dev); + + base_dev = NULL; +} + +void +acc_shutdown (acc_device_t d) +{ + gomp_mutex_lock (&acc_device_lock); + + _acc_shutdown (d); + + gomp_mutex_unlock (&acc_device_lock); +} + +ialias (acc_shutdown) + +/* This function is called after plugins have been initialized. It deals with + the "base" device, and is used to prepare the runtime for dealing with a + number of such devices (as implemented by some particular plugin). If the + argument device type D matches a previous call to the function, return the + current base device, else shut the old device down and re-initialize with + the new device type. */ + +static struct gomp_device_descr const * +lazy_init (acc_device_t d) +{ + if (base_dev) + { + /* Re-initializing the same device, do nothing. */ + if (d == init_key) + return base_dev; + + _acc_shutdown (init_key); + } + + assert (!base_dev); + + return _acc_init (d); +} + +/* Ensure that plugins are loaded, initialize and open the (default-numbered) + device. */ + +static void +lazy_init_and_open (acc_device_t d) +{ + if (!base_dev) + gomp_init_targets_once (); + + gomp_mutex_lock (&acc_device_lock); + + base_dev = lazy_init (d); + + lazy_open (-1); + + gomp_mutex_unlock (&acc_device_lock); +} + +int +acc_get_num_devices (acc_device_t d) +{ + int n = 0; + struct gomp_device_descr const *acc_dev; + + if (d == acc_device_none) + return 0; + + if (!base_dev) + gomp_init_targets_once (); + + acc_dev = resolve_device (d); + if (!acc_dev) + return 0; + + n = acc_dev->get_num_devices_func (); + if (n < 0) + n = 0; + + return n; +} + +ialias (acc_get_num_devices) + +void +acc_set_device_type (acc_device_t d) +{ + lazy_init_and_open (d); +} + +ialias (acc_set_device_type) + +acc_device_t +acc_get_device_type (void) +{ + acc_device_t res = acc_device_none; + const struct gomp_device_descr *dev; + + if (base_dev) + res = acc_device_type (base_dev->type); + else + { + gomp_init_targets_once (); + + dev = resolve_device (acc_device_default); + res = acc_device_type (dev->type); + } + + assert (res != acc_device_default + && res != acc_device_not_host); + + return res; +} + +ialias (acc_get_device_type) + +int +acc_get_device_num (acc_device_t d) +{ + const struct gomp_device_descr *dev; + int num; + + if (d >= _ACC_device_hwm) + gomp_fatal ("device %u out of range", (unsigned)d); + + if (!base_dev) + gomp_init_targets_once (); + + dev = resolve_device (d); + if (!dev) + gomp_fatal ("no devices of type %u", d); + + /* We might not have called lazy_open for this host thread yet, in which case + the get_device_num_func hook will return -1. */ + num = dev->openacc.get_device_num_func (); + if (num < 0) + num = goacc_device_num; + + return num; +} + +ialias (acc_get_device_num) + +void +acc_set_device_num (int n, acc_device_t d) +{ + const struct gomp_device_descr *dev; + int num_devices; + + if (!base_dev) + gomp_init_targets_once (); + + if ((int) d == 0) + { + int i; + + /* A device setting of zero sets all device types on the system to use + the Nth instance of that device type. Only attempt it for initialized + devices though. */ + for (i = acc_device_not_host + 1; i < _ACC_device_hwm; i++) + { + dev = resolve_device (d); + if (dev && dev->is_initialized) + dev->openacc.set_device_num_func (n); + } + + /* ...and for future calls to acc_init/acc_set_device_type, etc. */ + goacc_device_num = n; + } + else + { + struct goacc_thread *thr = goacc_thread (); + + gomp_mutex_lock (&acc_device_lock); + + base_dev = lazy_init (d); + + num_devices = base_dev->get_num_devices_func (); + + if (n >= num_devices) + gomp_fatal ("device %u out of range", n); + + /* If we're changing the device number, de-associate this thread with + the device (but don't close the device, since it may be in use by + other threads). */ + if (thr && thr->dev && n != thr->dev->target_id) + thr->dev = NULL; + + lazy_open (n); + + gomp_mutex_unlock (&acc_device_lock); + } +} + +ialias (acc_set_device_num) + +int +acc_on_device (acc_device_t dev) +{ + struct goacc_thread *thr = goacc_thread (); + + if (thr && thr->dev + && acc_device_type (thr->dev->type) == acc_device_host_nonshm) + return dev == acc_device_host_nonshm || dev == acc_device_not_host; + + /* Just rely on the compiler builtin. */ + return __builtin_acc_on_device (dev); +} +ialias (acc_on_device) + +attribute_hidden void +ACC_runtime_initialize (void) +{ + gomp_mutex_init (&acc_device_lock); + +#ifndef HAVE_TLS + pthread_key_create (&goacc_tls_key, NULL); +#endif + + pthread_key_create (&goacc_cleanup_key, goacc_destroy_thread); + + base_dev = NULL; + + goacc_threads = NULL; + gomp_mutex_init (&goacc_thread_lock); +} + +/* Compiler helper functions */ + +void +ACC_save_and_set_bind (acc_device_t d) +{ + struct goacc_thread *thr = goacc_thread (); + + assert (!thr->saved_bound_dev); + + thr->saved_bound_dev = thr->dev; + thr->dev = (struct gomp_device_descr *) dispatchers[d]; +} + +void +ACC_restore_bind (void) +{ + struct goacc_thread *thr = goacc_thread (); + + thr->dev = thr->saved_bound_dev; + thr->saved_bound_dev = NULL; +} + +/* This is called from any OpenACC support function that may need to implicitly + initialize the libgomp runtime. On exit all such initialization will have + been done, and both the global ACC_dev and the per-host-thread ACC_memmap + pointers will be valid. */ + +void +ACC_lazy_initialize (void) +{ + struct goacc_thread *thr = goacc_thread (); + + if (thr && thr->dev) + return; + + if (!base_dev) + lazy_init_and_open (acc_device_default); + else + { + gomp_mutex_lock (&acc_device_lock); + lazy_open (-1); + gomp_mutex_unlock (&acc_device_lock); + } +} diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h new file mode 100644 index 0000000..aa955bd --- /dev/null +++ b/libgomp/oacc-int.h @@ -0,0 +1,106 @@ +/* OpenACC Runtime - internal declarations + + Copyright (C) 2005-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* This file contains data types and function declarations that are not + part of the official OpenACC user interface. There are declarations + in here that are part of the GNU OpenACC ABI, in that the compiler is + required to know about them and use them. + + The convention is that the all caps prefix "GOACC" is used group items + that are part of the external ABI, and the lower case prefix "goacc" + is used group items that are completely private to the library. */ + +#ifndef _OACC_INT_H +#define _OACC_INT_H 1 + +#include "openacc.h" +#include "config.h" +#include +#include +#include + +#ifdef HAVE_ATTRIBUTE_VISIBILITY +# pragma GCC visibility push(hidden) +#endif + +static inline enum acc_device_t +acc_device_type (enum offload_target_type type) +{ + return (enum acc_device_t) type; +} + +struct goacc_thread +{ + /* The device for the current thread. */ + struct gomp_device_descr *dev; + + struct gomp_device_descr *saved_bound_dev; + + /* This is a linked list of data mapped by the "acc data" pragma, following + strictly push/pop semantics according to lexical scope. */ + struct target_mem_desc *mapped_data; + + /* These structures form a list: this is the next thread in that list. */ + struct goacc_thread *next; + + /* Target-specific data (used by plugin). */ + void *target_tls; +}; + +#ifdef HAVE_TLS +extern __thread struct goacc_thread *goacc_tls_data; +static inline struct goacc_thread * +goacc_thread (void) +{ + return goacc_tls_data; +} +#else +extern pthread_key_t goacc_tls_key; +static inline struct goacc_thread * +goacc_thread (void) +{ + return pthread_getspecific (goacc_tls_key); +} +#endif + +struct gomp_device_descr; + +void ACC_register (struct gomp_device_descr const *) __GOACC_NOTHROW; + +/* Current dispatcher. */ +extern struct gomp_device_descr const *base_dev; + +void ACC_runtime_initialize (void); +void ACC_save_and_set_bind (acc_device_t); +void ACC_restore_bind (void); +void ACC_lazy_initialize (void); + +#ifdef HAVE_ATTRIBUTE_VISIBILITY +# pragma GCC visibility pop +#endif + +#endif /* _OACC_INT_H */ diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c new file mode 100644 index 0000000..39e14a7 --- /dev/null +++ b/libgomp/oacc-mem.c @@ -0,0 +1,510 @@ +/* OpenACC Runtime initialization routines + + Copyright (C) 2013 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#include "openacc.h" +#include "config.h" +#include "libgomp.h" +#include "gomp-constants.h" +#include "libgomp_target.h" +#include "oacc-int.h" +#include +#include +#include + +#include "splay-tree.h" + +/* Return block containing [H->S), or NULL if not contained. */ + +attribute_hidden splay_tree_key +lookup_host (struct gomp_memory_mapping *mem_map, void *h, size_t s) +{ + struct splay_tree_key_s node; + splay_tree_key key; + + node.host_start = (uintptr_t) h; + node.host_end = (uintptr_t) h + s; + + gomp_mutex_lock (&mem_map->lock); + + key = splay_tree_lookup (&mem_map->splay_tree, &node); + + gomp_mutex_unlock (&mem_map->lock); + + return key; +} + +/* Return block containing [D->S), or NULL if not contained. + The list isn't ordered by device address, so we have to iterate + over the whole array. This is not expected to be a common + operation. */ + +static splay_tree_key +lookup_dev (struct target_mem_desc *tgt, void *d, size_t s) +{ + int i; + struct target_mem_desc *t; + struct gomp_memory_mapping *mem_map; + + if (!tgt) + return NULL; + + mem_map = tgt->mem_map; + + gomp_mutex_lock (&mem_map->lock); + + for (t = tgt; t != NULL; t = t->prev) + { + if (t->tgt_start <= (uintptr_t) d && t->tgt_end >= (uintptr_t) d + s) + break; + } + + gomp_mutex_unlock (&mem_map->lock); + + if (!t) + return NULL; + + for (i = 0; i < t->list_count; i++) + { + void * offset; + + splay_tree_key k = &t->array[i].key; + offset = d - t->tgt_start + k->tgt_offset; + + if (k->host_start + offset <= (void *) k->host_end) + return k; + } + + return NULL; +} + +/* OpenACC is silent on how memory exhaustion is indicated. We return + NULL. */ + +void * +acc_malloc (size_t s) +{ + if (!s) + return NULL; + + ACC_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + + return base_dev->alloc_func (thr->dev->target_id, s); +} + +/* OpenACC 2.0a (3.2.16) doesn't specify what to do in the event + the device address is mapped. We choose to check if it mapped, + and if it is, to unmap it. */ +void +acc_free (void *d) +{ + splay_tree_key k; + struct goacc_thread *thr = goacc_thread (); + + if (!d) + return; + + /* We don't have to call lazy open here, as the ptr value must have + been returned by acc_malloc. It's not permitted to pass NULL in + (unless you got that null from acc_malloc). */ + if ((k = lookup_dev (thr->dev->openacc.data_environ, d, 1))) + { + void *offset; + + offset = d - k->tgt->tgt_start + k->tgt_offset; + + acc_unmap_data ((void *)(k->host_start + offset)); + } + + base_dev->free_func (thr->dev->target_id, d); +} + +void +acc_memcpy_to_device (void *d, void *h, size_t s) +{ + /* No need to call lazy open here, as the device pointer must have + been obtained from a routine that did that. */ + struct goacc_thread *thr = goacc_thread (); + + base_dev->host2dev_func (thr->dev->target_id, d, h, s); +} + +void +acc_memcpy_from_device (void *h, void *d, size_t s) +{ + /* No need to call lazy open here, as the device pointer must have + been obtained from a routine that did that. */ + struct goacc_thread *thr = goacc_thread (); + + base_dev->dev2host_func (thr->dev->target_id, h, d, s); +} + +/* Return the device pointer that corresponds to host data H. Or NULL + if no mapping. */ + +void * +acc_deviceptr (void *h) +{ + splay_tree_key n; + void *d; + void *offset; + + ACC_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + + n = lookup_host (&thr->dev->mem_map, h, 1); + + if (!n) + return NULL; + + offset = h - n->host_start; + + d = n->tgt->tgt_start + n->tgt_offset + offset; + + return d; +} + +/* Return the host pointer that corresponds to device data D. Or NULL + if no mapping. */ + +void * +acc_hostptr (void *d) +{ + splay_tree_key n; + void *h; + void *offset; + + ACC_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + + n = lookup_dev (thr->dev->openacc.data_environ, d, 1); + + if (!n) + return NULL; + + offset = d - n->tgt->tgt_start + n->tgt_offset; + + h = n->host_start + offset; + + return h; +} + +/* Return 1 if host data [H,+S] is present on the device. */ + +int +acc_is_present (void *h, size_t s) +{ + splay_tree_key n; + + if (!s || !h) + return 0; + + ACC_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + n = lookup_host (&acc_dev->mem_map, h, s); + + if (n && ((uintptr_t)h < n->host_start + || (uintptr_t)h + s > n->host_end + || s > n->host_end - n->host_start)) + n = NULL; + + return n != NULL; +} + +/* Create a mapping for host [H,+S] -> device [D,+S] */ + +void +acc_map_data (void *h, void *d, size_t s) +{ + struct target_mem_desc *tgt; + size_t mapnum = 1; + void *hostaddrs = h; + void *devaddrs = d; + size_t sizes = s; + unsigned short kinds = GOMP_MAP_ALLOC; + + ACC_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + if (acc_dev->capabilities & TARGET_CAP_SHARED_MEM) + { + if (d != h) + gomp_fatal ("cannot map data on shared-memory system"); + + tgt = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, true, false); + } + else + { + struct goacc_thread *thr = goacc_thread (); + + if (!d || !h || !s) + gomp_fatal ("[%p,+%d]->[%p,+%d] is a bad map", + (void *)h, (int)s, (void *)d, (int)s); + + if (lookup_host (&acc_dev->mem_map, h, s)) + gomp_fatal ("host address [%p, +%d] is already mapped", (void *)h, + (int)s); + + if (lookup_dev (thr->dev->openacc.data_environ, d, s)) + gomp_fatal ("device address [%p, +%d] is already mapped", (void *)d, + (int)s); + + tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, &devaddrs, &sizes, + &kinds, true, false); + } + + tgt->prev = acc_dev->openacc.data_environ; + acc_dev->openacc.data_environ = tgt; +} + +void +acc_unmap_data (void *h) +{ + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + /* No need to call lazy open, as the address must have been mapped. */ + + size_t host_size; + splay_tree_key n = lookup_host (&acc_dev->mem_map, h, 1); + struct target_mem_desc *t; + + if (!n) + gomp_fatal ("%p is not a mapped block", (void *)h); + + host_size = n->host_end - n->host_start; + + if (n->host_start != (uintptr_t) h) + gomp_fatal ("[%p,%d] surrounds1 %p", + (void *) n->host_start, (int) host_size, (void *) h); + + t = n->tgt; + + if (t->refcount == 2) + { + struct target_mem_desc *tp; + + /* This is the last reference, so pull the descriptor off the + chain. This avoids gomp_unmap_vars via gomp_unmap_tgt from + freeing the device memory. */ + t->tgt_end = 0; + t->to_free = 0; + + gomp_mutex_lock (&acc_dev->mem_map.lock); + + for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL; + tp = t, t = t->prev) + if (n->tgt == t) + { + if (tp) + tp->prev = t->prev; + else + acc_dev->openacc.data_environ = t->prev; + + break; + } + + gomp_mutex_unlock (&acc_dev->mem_map.lock); + } + + gomp_unmap_vars (t, true); +} + +#define PCC_Present (1 << 0) +#define PCC_Create (1 << 1) +#define PCC_Copy (1 << 2) + +attribute_hidden void * +present_create_copy (unsigned f, void *h, size_t s) +{ + void *d; + splay_tree_key n; + + if (!h || !s) + gomp_fatal ("[%p,+%d] is a bad range", (void *)h, (int)s); + + ACC_lazy_initialize (); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + n = lookup_host (&acc_dev->mem_map, h, s); + if (n) + { + /* Present. */ + d = (void *) (n->tgt->tgt_start + n->tgt_offset); + + if (!(f & PCC_Present)) + gomp_fatal ("[%p,+%d] already mapped to [%p,+%d]", + (void *)h, (int)s, (void *)d, (int)s); + if ((h + s) > (void *)n->host_end) + gomp_fatal ("[%p,+%d] not mapped", (void *)h, (int)s); + } + else if (!(f & PCC_Create)) + { + gomp_fatal ("[%p,+%d] not mapped", (void *)h, (int)s); + } + else + { + struct target_mem_desc *tgt; + size_t mapnum = 1; + unsigned short kinds; + void *hostaddrs = h; + + if (f & PCC_Copy) + kinds = GOMP_MAP_ALLOC_TO; + else + kinds = GOMP_MAP_ALLOC; + + tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true, + false); + + gomp_mutex_lock (&acc_dev->mem_map.lock); + + d = tgt->to_free; + tgt->prev = acc_dev->openacc.data_environ; + acc_dev->openacc.data_environ = tgt; + + gomp_mutex_unlock (&acc_dev->mem_map.lock); + } + + return d; +} + +void * +acc_create (void *h, size_t s) +{ + return present_create_copy (PCC_Create, h, s); +} + +void * +acc_copyin (void *h, size_t s) +{ + return present_create_copy (PCC_Create | PCC_Copy, h, s); +} + +void * +acc_present_or_create (void *h, size_t s) +{ + return present_create_copy (PCC_Present | PCC_Create, h, s); +} + +void * +acc_present_or_copyin (void *h, size_t s) +{ + return present_create_copy (PCC_Present | PCC_Create | PCC_Copy, h, s); +} + +#define DC_Copyout (1 << 0) + +static void +delete_copyout (unsigned f, void *h, size_t s) +{ + size_t host_size; + splay_tree_key n; + void *d; + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + n = lookup_host (&acc_dev->mem_map, h, s); + + /* No need to call lazy open, as the data must already have been + mapped. */ + + if (!n) + gomp_fatal ("[%p,%d] is not mapped", (void *)h, (int)s); + + d = (void *) (n->tgt->tgt_start + n->tgt_offset); + + host_size = n->host_end - n->host_start; + + if (n->host_start != (uintptr_t) h || host_size != s) + gomp_fatal ("[%p,%d] surrounds2 [%p,+%d]", + (void *) n->host_start, (int) host_size, (void *) h, (int) s); + + if (f & DC_Copyout) + acc_dev->dev2host_func (acc_dev->target_id, h, d, s); + + acc_unmap_data (h); + + acc_dev->free_func (acc_dev->target_id, d); +} + +void +acc_delete (void *h , size_t s) +{ + delete_copyout (0, h, s); +} + +void acc_copyout (void *h, size_t s) +{ + delete_copyout (DC_Copyout, h, s); +} + +static void +update_dev_host (int is_dev, void *h, size_t s) +{ + splay_tree_key n; + void *d; + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + n = lookup_host (&acc_dev->mem_map, h, s); + + /* No need to call lazy open, as the data must already have been + mapped. */ + + if (!n) + gomp_fatal ("[%p,%d] is not mapped", h, (int)s); + + d = (void *) (n->tgt->tgt_start + n->tgt_offset); + + if (is_dev) + acc_dev->host2dev_func (acc_dev->target_id, d, h, s); + else + acc_dev->dev2host_func (acc_dev->target_id, h, d, s); +} + +void +acc_update_device (void *h, size_t s) +{ + update_dev_host (1, h, s); +} + +void +acc_update_self (void *h, size_t s) +{ + update_dev_host (0, h, s); +} diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c new file mode 100644 index 0000000..0611362 --- /dev/null +++ b/libgomp/oacc-parallel.c @@ -0,0 +1,390 @@ +/* Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* This file handles OpenACC constructs. */ + +#include "openacc.h" +#include "libgomp.h" +#include "libgomp_g.h" +#include "gomp-constants.h" +#include "libgomp_target.h" +#include "oacc-int.h" +#include +#include +#include +#include +#include + +static void +dump_var (char *s, size_t idx, void *hostaddr, size_t size, unsigned char kind) +{ + gomp_notify(" %2zi: %3s 0x%.2x -", idx, s, kind & 0xff); + + switch (kind & 0xff) + { + case 0x00: gomp_notify(" ALLOC "); break; + case 0x01: gomp_notify(" ALLOC TO "); break; + case 0x02: gomp_notify(" ALLOC FROM "); break; + case 0x03: gomp_notify(" ALLOC TOFROM "); break; + case 0x04: gomp_notify(" POINTER "); break; + case 0x05: gomp_notify(" TO_PSET "); break; + + case 0x08: gomp_notify(" FORCE_ALLOC "); break; + case 0x09: gomp_notify(" FORCE_TO "); break; + case 0x0a: gomp_notify(" FORCE_FROM "); break; + case 0x0b: gomp_notify(" FORCE_TOFROM "); break; + case 0x0c: gomp_notify(" FORCE_PRESENT "); break; + case 0x0d: gomp_notify(" FORCE_DEALLOC "); break; + case 0x0e: gomp_notify(" FORCE_DEVICEPTR "); break; + + case 0x18: gomp_notify(" FORCE_PRIVATE "); break; + case 0x19: gomp_notify(" FORCE_FIRSTPRIVATE "); break; + + case (unsigned char) -1: gomp_notify(" DUMMY "); break; + default: gomp_notify("UGH! 0x%x\n", kind); + } + + gomp_notify("- %d - %4d/0x%04x ", 1 << (kind >> 8), (int)size, (int)size); + gomp_notify("- %p\n", hostaddr); + + return; +} + +/* Ensure that the target device for DEVICE_TYPE is initialised (and that + plugins have been loaded if appropriate). The ACC_dev variable for the + current thread will be set appropriately for the given device type on + return. */ + +attribute_hidden void +select_acc_device (int device_type) +{ + ACC_lazy_initialize (); + + if (device_type == GOMP_IF_CLAUSE_FALSE) + return; + + if (device_type == acc_device_none) + device_type = acc_device_host; + + if (device_type >= 0) + { + /* NOTE: this will go badly if the surrounding data environment is set up + to use a different device type. We'll just have to trust that users + know what they're doing... */ + acc_set_device_type (device_type); + } +} + +void goacc_wait (int async, int num_waits, va_list ap); + +void +GOACC_parallel (int device, void (*fn) (void *), const void *openmp_target, + size_t mapnum, void **hostaddrs, size_t *sizes, + unsigned short *kinds, + int num_gangs, int num_workers, int vector_length, + int async, int num_waits, ...) +{ + bool if_clause_condition_value = device != GOMP_IF_CLAUSE_FALSE; + va_list ap; + struct goacc_thread *thr; + struct gomp_device_descr *acc_dev; + struct target_mem_desc *tgt; + void **devaddrs; + unsigned int i; + struct splay_tree_key_s k; + splay_tree_key tgt_fn_key; + void (*tgt_fn); + + if (num_gangs != 1) + gomp_fatal ("num_gangs (%d) different from one is not yet supported", + num_gangs); + if (num_workers != 1) + gomp_fatal ("num_workers (%d) different from one is not yet supported", + num_workers); + + gomp_notify ("%s: mapnum=%zd, hostaddrs=%p, sizes=%p, kinds=%p, async=%d\n", + __FUNCTION__, mapnum, hostaddrs, sizes, kinds, async); + + select_acc_device (device); + + thr = goacc_thread (); + acc_dev = thr->dev; + + /* Host fallback if "if" clause is false or if the current device is set to + the host. */ + if (!if_clause_condition_value) + { + ACC_save_and_set_bind (acc_device_host); + fn (hostaddrs); + ACC_restore_bind (); + return; + } + else if (acc_device_type (acc_dev->type) == acc_device_host) + { + fn (hostaddrs); + return; + } + + va_start (ap, num_waits); + + if (num_waits > 0) + goacc_wait (async, num_waits, ap); + + va_end (ap); + + acc_dev->openacc.async_set_async_func (async); + + if (!(acc_dev->capabilities & TARGET_CAP_NATIVE_EXEC)) + { + k.host_start = (uintptr_t) fn; + k.host_end = k.host_start + 1; + gomp_mutex_lock (&acc_dev->mem_map.lock); + tgt_fn_key = splay_tree_lookup (&acc_dev->mem_map.splay_tree, &k); + gomp_mutex_unlock (&acc_dev->mem_map.lock); + + if (tgt_fn_key == NULL) + gomp_fatal ("target function wasn't mapped: perhaps -fopenacc was " + "used without -flto?"); + + tgt_fn = (void (*)) tgt_fn_key->tgt->tgt_start; + } + else + tgt_fn = (void (*)) fn; + + tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true, + false); + + devaddrs = alloca (sizeof (void *) * mapnum); + for (i = 0; i < mapnum; i++) + devaddrs[i] = (void *) (tgt->list[i]->tgt->tgt_start + + tgt->list[i]->tgt_offset); + + acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, sizes, kinds, + num_gangs, num_workers, vector_length, async, + tgt); + + /* If running synchronously, unmap immediately. */ + if (async < acc_async_noval) + gomp_unmap_vars (tgt, true); + else + { + gomp_copy_from_async (tgt); + acc_dev->openacc.register_async_cleanup_func (tgt); + } + + acc_dev->openacc.async_set_async_func (acc_async_sync); +} + +void +GOACC_data_start (int device, const void *openmp_target, size_t mapnum, + void **hostaddrs, size_t *sizes, unsigned short *kinds) +{ + bool if_clause_condition_value = device != GOMP_IF_CLAUSE_FALSE; + struct target_mem_desc *tgt; + + gomp_notify ("%s: mapnum=%zd, hostaddrs=%p, sizes=%p, kinds=%p\n", + __FUNCTION__, mapnum, hostaddrs, sizes, kinds); + + select_acc_device (device); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + /* Host fallback or 'do nothing'. */ + if ((acc_dev->capabilities & TARGET_CAP_SHARED_MEM) + || !if_clause_condition_value) + { + tgt = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, true, false); + tgt->prev = thr->mapped_data; + thr->mapped_data = tgt; + + return; + } + + gomp_notify (" %s: prepare mappings\n", __FUNCTION__); + tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true, + false); + gomp_notify (" %s: mappings prepared\n", __FUNCTION__); + tgt->prev = thr->mapped_data; + thr->mapped_data = tgt; +} + +void +GOACC_data_end (void) +{ + struct goacc_thread *thr = goacc_thread (); + struct target_mem_desc *tgt = thr->mapped_data; + + gomp_notify (" %s: restore mappings\n", __FUNCTION__); + thr->mapped_data = tgt->prev; + gomp_unmap_vars (tgt, true); + gomp_notify (" %s: mappings restored\n", __FUNCTION__); +} + + +void +GOACC_kernels (int device, void (*fn) (void *), const void *openmp_target, + size_t mapnum, void **hostaddrs, size_t *sizes, + unsigned short *kinds, + int num_gangs, int num_workers, int vector_length, + int async, int num_waits, ...) +{ + gomp_notify ("%s: mapnum=%zd, hostaddrs=%p, sizes=%p, kinds=%p\n", + __FUNCTION__, mapnum, hostaddrs, sizes, kinds); + + va_list ap; + + select_acc_device (device); + + va_start (ap, num_waits); + + if (num_waits > 0) + goacc_wait (async, num_waits, ap); + + va_end (ap); + + GOACC_parallel (device, fn, openmp_target, mapnum, hostaddrs, sizes, kinds, + num_gangs, num_workers, vector_length, async, 0); +} + +void +goacc_wait (int async, int num_waits, va_list ap) +{ + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + int i; + + assert (num_waits >= 0); + + if (async == acc_async_sync && num_waits == 0) + { + acc_wait_all (); + return; + } + + if (async == acc_async_sync && num_waits) + { + for (i = 0; i < num_waits; i++) + { + int qid = va_arg (ap, int); + + if (acc_async_test (qid)) + continue; + + acc_wait (qid); + } + return; + } + + if (async == acc_async_noval && num_waits == 0) + { + acc_dev->openacc.async_wait_all_async_func (acc_async_noval); + return; + } + + for (i = 0; i < num_waits; i++) + { + int qid = va_arg (ap, int); + + if (acc_async_test (qid)) + continue; + + /* If we're waiting on the same asynchronous queue as we're launching on, + the queue itself will order work as required, so there's no need to + wait explicitly. */ + if (qid != async) + acc_dev->openacc.async_wait_async_func (qid, async); + } +} + +void +GOACC_update (int device, const void *openmp_target, size_t mapnum, + void **hostaddrs, size_t *sizes, unsigned short *kinds, + int async, int num_waits, ...) +{ + bool if_clause_condition_value = device != GOMP_IF_CLAUSE_FALSE; + size_t i; + + select_acc_device (device); + + struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr->dev; + + if ((acc_dev->capabilities & TARGET_CAP_SHARED_MEM) + || !if_clause_condition_value) + return; + + if (num_waits > 0) + { + va_list ap; + + va_start (ap, num_waits); + + goacc_wait (async, num_waits, ap); + + va_end (ap); + } + + acc_dev->openacc.async_set_async_func (async); + + for (i = 0; i < mapnum; ++i) + { + unsigned char kind = kinds[i] & 0xff; + + dump_var ("UPD", i, hostaddrs[i], sizes[i], kinds[i]); + + switch (kind) + { + case GOMP_MAP_POINTER: + break; + + case GOMP_MAP_FORCE_TO: + acc_update_device (hostaddrs[i], sizes[i]); + break; + + case GOMP_MAP_FORCE_FROM: + acc_update_self (hostaddrs[i], sizes[i]); + break; + + default: + gomp_fatal (">>>> GOACC_update UNHANDLED kind 0x%.2x", kind); + break; + } + } + + acc_dev->openacc.async_set_async_func (acc_async_sync); +} + +void +GOACC_wait (int async, int num_waits, ...) +{ + va_list ap; + + va_start (ap, num_waits); + + goacc_wait (async, num_waits, ap); + + va_end (ap); +} diff --git a/libgomp/oacc-plugin.c b/libgomp/oacc-plugin.c new file mode 100644 index 0000000..357cb5f --- /dev/null +++ b/libgomp/oacc-plugin.c @@ -0,0 +1,48 @@ +/* Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* Initialize and register OpenACC dispatch table from libgomp plugin. */ + +#include "libgomp.h" +#include "oacc-plugin.h" +#include "libgomp_target.h" +#include "oacc-int.h" + +void +GOMP_PLUGIN_async_unmap_vars (void *ptr) +{ + struct target_mem_desc *tgt = ptr; + + gomp_unmap_vars (tgt, false); +} + +/* Return the target-specific part of the TLS data for the current thread. */ + +void * +GOMP_PLUGIN_acc_thread (void) +{ + struct goacc_thread *thr = goacc_thread (); + return thr ? thr->target_tls : NULL; +} diff --git a/libgomp/oacc-plugin.h b/libgomp/oacc-plugin.h new file mode 100644 index 0000000..d05a28f --- /dev/null +++ b/libgomp/oacc-plugin.h @@ -0,0 +1,32 @@ +/* Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#ifndef _OACC_PLUGIN_H +#define _OACC_PLUGIN_H 1 + +extern void GOMP_PLUGIN_async_unmap_vars (void *ptr); +extern void *GOMP_PLUGIN_acc_thread (void); + +#endif diff --git a/libgomp/openacc.f90 b/libgomp/openacc.f90 new file mode 100644 index 0000000..e4d4d8f --- /dev/null +++ b/libgomp/openacc.f90 @@ -0,0 +1,953 @@ +! OpenACC Runtime Library Definitions. + +! Copyright (C) 2014 Free Software Foundation, Inc. + +! Contributed by Tobias Burnus +! and Mentor Embedded. + +! This file is part of the GNU OpenMP Library (libgomp). + +! Libgomp is free software; you can redistribute it and/or modify it +! under the terms of the GNU General Public License as published by +! the Free Software Foundation; either version 3, or (at your option) +! any later version. + +! Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY +! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +! FOR A PARTICULAR PURPOSE. See the GNU General Public License for +! more details. + +! Under Section 7 of GPL version 3, you are granted additional +! permissions described in the GCC Runtime Library Exception, version +! 3.1, as published by the Free Software Foundation. + +! You should have received a copy of the GNU General Public License and +! a copy of the GCC Runtime Library Exception along with this program; +! see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +! . + +module openacc_kinds + use iso_fortran_env, only: int32 + implicit none + + private :: int32 + public :: acc_device_kind + + integer, parameter :: acc_device_kind = int32 + + public :: acc_device_none, acc_device_default, acc_device_host + public :: acc_device_not_host, acc_device_nvidia + + integer (acc_device_kind), parameter :: acc_device_none = 0 + integer (acc_device_kind), parameter :: acc_device_default = 1 + integer (acc_device_kind), parameter :: acc_device_host = 2 + integer (acc_device_kind), parameter :: acc_device_host_nonshm = 3 + integer (acc_device_kind), parameter :: acc_device_not_host = 4 + integer (acc_device_kind), parameter :: acc_device_nvidia = 5 + + public :: acc_handle_kind + + integer, parameter :: acc_handle_kind = int32 + + public :: acc_async_noval, acc_async_sync + + integer (acc_handle_kind), parameter :: acc_async_noval = -1 + integer (acc_handle_kind), parameter :: acc_async_sync = -2 + +end module + +module openacc_internal + use openacc_kinds + implicit none + + interface + function acc_get_num_devices_h (d) + import + integer acc_get_num_devices_h + integer (acc_device_kind) d + end function + + subroutine acc_set_device_type_h (d) + import + integer (acc_device_kind) d + end subroutine + + function acc_get_device_type_h () + import + integer (acc_device_kind) acc_get_device_type_h + end function + + subroutine acc_set_device_num_h (n, d) + import + integer n + integer (acc_device_kind) d + end subroutine + + function acc_get_device_num_h (d) + import + integer acc_get_device_num_h + integer (acc_device_kind) d + end function + + function acc_async_test_h (a) + logical acc_async_test_h + integer a + end function + + function acc_async_test_all_h () + logical acc_async_test_all_h + end function + + subroutine acc_wait_h (a) + integer a + end subroutine + + subroutine acc_wait_async_h (a1, a2) + integer a1, a2 + end subroutine + + subroutine acc_wait_all_h () + end subroutine + + subroutine acc_wait_all_async_h (a) + integer a + end subroutine + + subroutine acc_init_h (d) + import + integer (acc_device_kind) d + end subroutine + + subroutine acc_shutdown_h (d) + import + integer (acc_device_kind) d + end subroutine + + function acc_on_device_h (d) + import + integer (acc_device_kind) d + logical acc_on_device_h + end function + + subroutine acc_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_copyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_present_or_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_present_or_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_present_or_copyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_create_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_create_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_create_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_present_or_create_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_present_or_create_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_present_or_create_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_copyout_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_copyout_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_copyout_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_delete_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_delete_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_delete_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_update_device_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_update_device_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_update_device_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + subroutine acc_update_self_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_update_self_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_update_self_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + + function acc_is_present_32_h (a, len) + use iso_c_binding, only: c_int32_t + logical acc_is_present_32_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end function + + function acc_is_present_64_h (a, len) + use iso_c_binding, only: c_int64_t + logical acc_is_present_64_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end function + + function acc_is_present_array_h (a) + logical acc_is_present_array_h + type (*), dimension (..), contiguous :: a + end function + end interface + + interface + function acc_get_num_devices_l (d) & + bind (C, name = "acc_get_num_devices") + use iso_c_binding, only: c_int + integer (c_int) :: acc_get_num_devices_l + integer (c_int), value :: d + end function + + subroutine acc_set_device_type_l (d) & + bind (C, name = "acc_set_device_type") + use iso_c_binding, only: c_int + integer (c_int), value :: d + end subroutine + + function acc_get_device_type_l () & + bind (C, name = "acc_get_device_type") + use iso_c_binding, only: c_int + integer (c_int) :: acc_get_device_type_l + end function + + subroutine acc_set_device_num_l (n, d) & + bind (C, name = "acc_set_device_num") + use iso_c_binding, only: c_int + integer (c_int), value :: n, d + end subroutine + + function acc_get_device_num_l (d) & + bind (C, name = "acc_get_device_num") + use iso_c_binding, only: c_int + integer (c_int) :: acc_get_device_num_l + integer (c_int), value :: d + end function + + function acc_async_test_l (a) & + bind (C, name = "acc_async_test") + use iso_c_binding, only: c_int + integer (c_int) :: acc_async_test_l + integer (c_int), value :: a + end function + + function acc_async_test_all_l () & + bind (C, name = "acc_async_test_all") + use iso_c_binding, only: c_int + integer (c_int) :: acc_async_test_all_l + end function + + subroutine acc_wait_l (a) & + bind (C, name = "acc_wait") + use iso_c_binding, only: c_int + integer (c_int), value :: a + end subroutine + + subroutine acc_wait_async_l (a1, a2) & + bind (C, name = "acc_wait_async") + use iso_c_binding, only: c_int + integer (c_int), value :: a1, a2 + end subroutine + + subroutine acc_wait_all_l () & + bind (C, name = "acc_wait_all") + use iso_c_binding, only: c_int + end subroutine + + subroutine acc_wait_all_async_l (a) & + bind (C, name = "acc_wait_all_async") + use iso_c_binding, only: c_int + integer (c_int), value :: a + end subroutine + + subroutine acc_init_l (d) & + bind (C, name = "acc_init") + use iso_c_binding, only: c_int + integer (c_int), value :: d + end subroutine + + subroutine acc_shutdown_l (d) & + bind (C, name = "acc_shutdown") + use iso_c_binding, only: c_int + integer (c_int), value :: d + end subroutine + + function acc_on_device_l (d) & + bind (C, name = "acc_on_device") + use iso_c_binding, only: c_int + integer (c_int) :: acc_on_device_l + integer (c_int), value :: d + end function + + subroutine acc_copyin_l (a, len) & + bind (C, name = "acc_copyin") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_present_or_copyin_l (a, len) & + bind (C, name = "acc_present_or_copyin") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_create_l (a, len) & + bind (C, name = "acc_create") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_present_or_create_l (a, len) & + bind (C, name = "acc_present_or_create") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_copyout_l (a, len) & + bind (C, name = "acc_copyout") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_delete_l (a, len) & + bind (C, name = "acc_delete") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_update_device_l (a, len) & + bind (C, name = "acc_update_device") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + subroutine acc_update_self_l (a, len) & + bind (C, name = "acc_update_self") + use iso_c_binding, only: c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end subroutine + + function acc_is_present_l (a, len) & + bind (C, name = "acc_is_present") + use iso_c_binding, only: c_int32_t, c_size_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + integer (c_int32_t) :: acc_is_present_l + type (*), dimension (*) :: a + integer (c_size_t), value :: len + end function + end interface +end module + +module openacc + use openacc_kinds + use openacc_internal + implicit none + + public :: openacc_version + + public :: acc_get_num_devices, acc_set_device_type, acc_get_device_type + public :: acc_set_device_num, acc_get_device_num, acc_async_test + public :: acc_async_test_all, acc_wait, acc_wait_async, acc_wait_all + public :: acc_wait_all_async, acc_init, acc_shutdown, acc_on_device + public :: acc_copyin, acc_present_or_copyin, acc_pcopyin, acc_create + public :: acc_present_or_create, acc_pcreate, acc_copyout, acc_delete + public :: acc_update_device, acc_update_self, acc_is_present + + integer, parameter :: openacc_version = 201306 + + interface acc_get_num_devices + procedure :: acc_get_num_devices_h + end interface + + interface acc_set_device_type + procedure :: acc_set_device_type_h + end interface + + interface acc_get_device_type + procedure :: acc_get_device_type_h + end interface + + interface acc_set_device_num + procedure :: acc_set_device_num_h + end interface + + interface acc_get_device_num + procedure :: acc_get_device_num_h + end interface + + interface acc_async_test + procedure :: acc_async_test_h + end interface + + interface acc_async_test_all + procedure :: acc_async_test_all_h + end interface + + interface acc_wait + procedure :: acc_wait_h + end interface + + interface acc_wait_async + procedure :: acc_wait_async_h + end interface + + interface acc_wait_all + procedure :: acc_wait_all_h + end interface + + interface acc_wait_all_async + procedure :: acc_wait_all_async_h + end interface + + interface acc_init + procedure :: acc_init_h + end interface + + interface acc_shutdown + procedure :: acc_shutdown_h + end interface + + interface acc_on_device + procedure :: acc_on_device_h + end interface + + ! acc_malloc: Only available in C/C++ + ! acc_free: Only available in C/C++ + + ! As vendor extension, the following code supports both 32bit and 64bit + ! arguments for "size"; the OpenACC standard only permits default-kind + ! integers, which are of kind 4 (i.e. 32 bits). + ! Additionally, the two-argument version also takes arrays as argument. + ! and the one argument version also scalars. Note that the code assumes + ! that the arrays are contiguous. + + interface acc_copyin + procedure :: acc_copyin_32_h + procedure :: acc_copyin_64_h + procedure :: acc_copyin_array_h + end interface + + interface acc_present_or_copyin + procedure :: acc_present_or_copyin_32_h + procedure :: acc_present_or_copyin_64_h + procedure :: acc_present_or_copyin_array_h + end interface + + interface acc_pcopyin + procedure :: acc_present_or_copyin_32_h + procedure :: acc_present_or_copyin_64_h + procedure :: acc_present_or_copyin_array_h + end interface + + interface acc_create + procedure :: acc_create_32_h + procedure :: acc_create_64_h + procedure :: acc_create_array_h + end interface + + interface acc_present_or_create + procedure :: acc_present_or_create_32_h + procedure :: acc_present_or_create_64_h + procedure :: acc_present_or_create_array_h + end interface + + interface acc_pcreate + procedure :: acc_present_or_create_32_h + procedure :: acc_present_or_create_64_h + procedure :: acc_present_or_create_array_h + end interface + + interface acc_copyout + procedure :: acc_copyout_32_h + procedure :: acc_copyout_64_h + procedure :: acc_copyout_array_h + end interface + + interface acc_delete + procedure :: acc_delete_32_h + procedure :: acc_delete_64_h + procedure :: acc_delete_array_h + end interface + + interface acc_update_device + procedure :: acc_update_device_32_h + procedure :: acc_update_device_64_h + procedure :: acc_update_device_array_h + end interface + + interface acc_update_self + procedure :: acc_update_self_32_h + procedure :: acc_update_self_64_h + procedure :: acc_update_self_array_h + end interface + + ! acc_map_data: Only available in C/C++ + ! acc_unmap_data: Only available in C/C++ + ! acc_deviceptr: Only available in C/C++ + ! acc_hostptr: Only available in C/C++ + + interface acc_is_present + procedure :: acc_is_present_32_h + procedure :: acc_is_present_64_h + procedure :: acc_is_present_array_h + end interface + + ! acc_memcpy_to_device: Only available in C/C++ + ! acc_memcpy_from_device: Only available in C/C++ + +end module + +function acc_get_num_devices_h (d) + use openacc_internal, only: acc_get_num_devices_l + use openacc_kinds + integer acc_get_num_devices_h + integer (acc_device_kind) d + acc_get_num_devices_h = acc_get_num_devices_l (d) +end function + +subroutine acc_set_device_type_h (d) + use openacc_internal, only: acc_set_device_type_l + use openacc_kinds + integer (acc_device_kind) d + call acc_set_device_type_l (d) +end subroutine + +function acc_get_device_type_h () + use openacc_internal, only: acc_get_device_type_l + use openacc_kinds + integer (acc_device_kind) acc_get_device_type_h + acc_get_device_type_h = acc_get_device_type_l () +end function + +subroutine acc_set_device_num_h (n, d) + use openacc_internal, only: acc_set_device_num_l + use openacc_kinds + integer n + integer (acc_device_kind) d + call acc_set_device_num_l (n, d) +end subroutine + +function acc_get_device_num_h (d) + use openacc_internal, only: acc_get_device_num_l + use openacc_kinds + integer acc_get_device_num_h + integer (acc_device_kind) d + acc_get_device_num_h = acc_get_device_num_l (d) +end function + +function acc_async_test_h (a) + use openacc_internal, only: acc_async_test_l + logical acc_async_test_h + integer a + if (acc_async_test_l (a) .eq. 1) then + acc_async_test_h = .TRUE. + else + acc_async_test_h = .FALSE. + end if +end function + +function acc_async_test_all_h () + use openacc_internal, only: acc_async_test_all_l + logical acc_async_test_all_h + if (acc_async_test_all_l () .eq. 1) then + acc_async_test_all_h = .TRUE. + else + acc_async_test_all_h = .FALSE. + end if +end function + +subroutine acc_wait_h (a) + use openacc_internal, only: acc_wait_l + integer a + call acc_wait_l (a) +end subroutine + +subroutine acc_wait_async_h (a1, a2) + use openacc_internal, only: acc_wait_async_l + integer a1, a2 + call acc_wait_async_l (a1, a2) +end subroutine + +subroutine acc_wait_all_h () + use openacc_internal, only: acc_wait_all_l + call acc_wait_all_l () +end subroutine + +subroutine acc_wait_all_async_h (a) + use openacc_internal, only: acc_wait_all_async_l + integer a + call acc_wait_all_async_l (a) +end subroutine + +subroutine acc_init_h (d) + use openacc_internal, only: acc_init_l + use openacc_kinds + integer (acc_device_kind) d + call acc_init_l (d) +end subroutine + +subroutine acc_shutdown_h (d) + use openacc_internal, only: acc_shutdown_l + use openacc_kinds + integer (acc_device_kind) d + call acc_shutdown_l (d) +end subroutine + +function acc_on_device_h (d) + use openacc_internal, only: acc_on_device_l + use openacc_kinds + integer (acc_device_kind) d + logical acc_on_device_h + if (acc_on_device_l (d) .eq. 1) then + acc_on_device_h = .TRUE. + else + acc_on_device_h = .FALSE. + end if +end function + +subroutine acc_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_copyin_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_copyin_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_copyin_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_copyin_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_copyin_array_h (a) + use openacc_internal, only: acc_copyin_l + type (*), dimension (..), contiguous :: a + call acc_copyin_l (a, sizeof (a)) +end subroutine + +subroutine acc_present_or_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_present_or_copyin_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_present_or_copyin_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_present_or_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_present_or_copyin_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_present_or_copyin_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_present_or_copyin_array_h (a) + use openacc_internal, only: acc_present_or_copyin_l + type (*), dimension (..), contiguous :: a + call acc_present_or_copyin_l (a, sizeof (a)) +end subroutine + +subroutine acc_create_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_create_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_create_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_create_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_create_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_create_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_create_array_h (a) + use openacc_internal, only: acc_create_l + type (*), dimension (..), contiguous :: a + call acc_create_l (a, sizeof (a)) +end subroutine + +subroutine acc_present_or_create_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_present_or_create_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_present_or_create_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_present_or_create_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_present_or_create_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_present_or_create_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_present_or_create_array_h (a) + use openacc_internal, only: acc_present_or_create_l + type (*), dimension (..), contiguous :: a + call acc_present_or_create_l (a, sizeof (a)) +end subroutine + +subroutine acc_copyout_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_copyout_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_copyout_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_copyout_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_copyout_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_copyout_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_copyout_array_h (a) + use openacc_internal, only: acc_copyout_l + type (*), dimension (..), contiguous :: a + call acc_copyout_l (a, sizeof (a)) +end subroutine + +subroutine acc_delete_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_delete_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_delete_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_delete_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_delete_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_delete_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_delete_array_h (a) + use openacc_internal, only: acc_delete_l + type (*), dimension (..), contiguous :: a + call acc_delete_l (a, sizeof (a)) +end subroutine + +subroutine acc_update_device_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_update_device_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_update_device_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_update_device_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_update_device_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_update_device_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_update_device_array_h (a) + use openacc_internal, only: acc_update_device_l + type (*), dimension (..), contiguous :: a + call acc_update_device_l (a, sizeof (a)) +end subroutine + +subroutine acc_update_self_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_update_self_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + call acc_update_self_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_update_self_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_update_self_l + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + call acc_update_self_l (a, int (len, kind = c_size_t)) +end subroutine + +subroutine acc_update_self_array_h (a) + use openacc_internal, only: acc_update_self_l + type (*), dimension (..), contiguous :: a + call acc_update_self_l (a, sizeof (a)) +end subroutine + +function acc_is_present_32_h (a, len) + use iso_c_binding, only: c_int32_t, c_size_t + use openacc_internal, only: acc_is_present_l + logical acc_is_present_32_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + if (acc_is_present_l (a, int (len, kind = c_size_t)) .eq. 1) then + acc_is_present_32_h = .TRUE. + else + acc_is_present_32_h = .FALSE. + end if +end function + +function acc_is_present_64_h (a, len) + use iso_c_binding, only: c_int64_t, c_size_t + use openacc_internal, only: acc_is_present_l + logical acc_is_present_64_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + if (acc_is_present_l (a, int (len, kind = c_size_t)) .eq. 1) then + acc_is_present_64_h = .TRUE. + else + acc_is_present_64_h = .FALSE. + end if +end function + +function acc_is_present_array_h (a) + use openacc_internal, only: acc_is_present_l + logical acc_is_present_array_h + type (*), dimension (..), contiguous :: a + acc_is_present_array_h = acc_is_present_l (a, sizeof (a)) == 1 +end function diff --git a/libgomp/openacc.h b/libgomp/openacc.h new file mode 100644 index 0000000..d43978f --- /dev/null +++ b/libgomp/openacc.h @@ -0,0 +1,127 @@ +/* OpenACC Runtime Library User-facing Declarations + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +#ifndef _OPENACC_H +#define _OPENACC_H 1 + +#include "gomp-constants.h" + +/* The OpenACC std is silent on whether or not including openacc.h + might or must not include other header files. We chose to include + some. */ +#include + +#ifdef __cplusplus +extern "C" { +#endif + +#if __cplusplus >= 201103 +# define __GOACC_NOTHROW noexcept () +#elif __cplusplus +# define __GOACC_NOTHROW throw () +#else /* Not C++ */ +# define __GOACC_NOTHROW __attribute__ ((__nothrow__)) +#endif + + /* Types */ + typedef enum acc_device_t + { + acc_device_none = 0, + acc_device_default, /* This has to be a distinct value, as no + return value can match it. */ + acc_device_host = GOMP_TARGET_HOST, + acc_device_host_nonshm = GOMP_TARGET_HOST_NONSHM, + acc_device_not_host, + acc_device_nvidia = GOMP_TARGET_NVIDIA_PTX, + _ACC_device_hwm + } acc_device_t; + + typedef enum acc_async_t + { + acc_async_noval = -1, + acc_async_sync = -2 + } acc_async_t; + + int acc_get_num_devices (acc_device_t __dev) __GOACC_NOTHROW; + void acc_set_device_type (acc_device_t __dev) __GOACC_NOTHROW; + acc_device_t acc_get_device_type (void) __GOACC_NOTHROW; + void acc_set_device_num (int __num, acc_device_t __dev) __GOACC_NOTHROW; + int acc_get_device_num (acc_device_t __dev) __GOACC_NOTHROW; + int acc_async_test (int __async) __GOACC_NOTHROW; + int acc_async_test_all (void) __GOACC_NOTHROW; + void acc_wait (int __async) __GOACC_NOTHROW; + void acc_wait_async (int __async1, int __async2) __GOACC_NOTHROW; + void acc_wait_all (void) __GOACC_NOTHROW; + void acc_wait_all_async (int __async) __GOACC_NOTHROW; + void acc_init (acc_device_t __dev) __GOACC_NOTHROW; + void acc_shutdown (acc_device_t __dev) __GOACC_NOTHROW; + int acc_on_device (acc_device_t __dev) __GOACC_NOTHROW; + void *acc_malloc (size_t) __GOACC_NOTHROW; + void acc_free (void *) __GOACC_NOTHROW; + /* Some of these would be more correct with const qualifiers, but + the standard specifies otherwise. */ + void *acc_copyin (void *, size_t) __GOACC_NOTHROW; + void *acc_present_or_copyin (void *, size_t) __GOACC_NOTHROW; + void *acc_create (void *, size_t) __GOACC_NOTHROW; + void *acc_present_or_create (void *, size_t) __GOACC_NOTHROW; + void acc_copyout (void *, size_t) __GOACC_NOTHROW; + void acc_delete (void *, size_t) __GOACC_NOTHROW; + void acc_update_device (void *, size_t) __GOACC_NOTHROW; + void acc_update_self (void *, size_t) __GOACC_NOTHROW; + void acc_map_data (void *, void *, size_t) __GOACC_NOTHROW; + void acc_unmap_data (void *) __GOACC_NOTHROW; + void *acc_deviceptr (void *) __GOACC_NOTHROW; + void *acc_hostptr (void *) __GOACC_NOTHROW; + int acc_is_present (void *, size_t) __GOACC_NOTHROW; + void acc_memcpy_to_device (void *, void *, size_t) __GOACC_NOTHROW; + void acc_memcpy_from_device (void *, void *, size_t) __GOACC_NOTHROW; + + void ACC_target (int, void (*) (void *), const void *, + size_t, void **, size_t *, unsigned char *, int *) __GOACC_NOTHROW; + void ACC_parallel (int, void (*) (void *), const void *, + size_t, void **, size_t *, unsigned char *) __GOACC_NOTHROW; + void ACC_add_device_code (void const *, char const *) __GOACC_NOTHROW; + + void ACC_async_copy(int) __GOACC_NOTHROW; + void ACC_async_kern(int) __GOACC_NOTHROW; + + /* Old names. OpenACC does not specify whether these can or must + not be macros, inlines or aliases for the new names. */ + #define acc_pcreate acc_present_or_create + #define acc_pcopyin acc_present_or_copyin + + /* CUDA-specific routines. */ + void *acc_get_current_cuda_device (void) __GOACC_NOTHROW; + void *acc_get_current_cuda_context (void) __GOACC_NOTHROW; + void *acc_get_cuda_stream (int __async) __GOACC_NOTHROW; + int acc_set_cuda_stream (int __async, void *__stream) __GOACC_NOTHROW; + +#ifdef __cplusplus +} +#endif + +#endif /* _OPENACC_H */ diff --git a/libgomp/openacc_lib.h b/libgomp/openacc_lib.h new file mode 100644 index 0000000..4e335f2 --- /dev/null +++ b/libgomp/openacc_lib.h @@ -0,0 +1,378 @@ +! OpenACC Runtime Library Definitions. -*- mode: fortran -*- + +! Copyright (C) 2014 Free Software Foundation, Inc. + +! Contributed by Tobias Burnus +! and Mentor Embedded. + +! This file is part of the GNU OpenMP Library (libgomp). + +! Libgomp is free software; you can redistribute it and/or modify it +! under the terms of the GNU General Public License as published by +! the Free Software Foundation; either version 3, or (at your option) +! any later version. + +! Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY +! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +! FOR A PARTICULAR PURPOSE. See the GNU General Public License for +! more details. + +! Under Section 7 of GPL version 3, you are granted additional +! permissions described in the GCC Runtime Library Exception, version +! 3.1, as published by the Free Software Foundation. + +! You should have received a copy of the GNU General Public License and +! a copy of the GCC Runtime Library Exception along with this program; +! see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +! . + +! NOTE: Due to the use of dimension (..), the code only works when compiled +! with -std=f2008ts/gnu/legacy but not with other standard settings. +! Alternatively, the user can use the module version, which permits +! compilation with -std=f95. + + integer, parameter :: acc_device_kind = 4 + + integer (acc_device_kind), parameter :: acc_device_none = 0 + integer (acc_device_kind), parameter :: acc_device_default = 1 + integer (acc_device_kind), parameter :: acc_device_host = 2 + integer (acc_device_kind), parameter :: acc_device_host_nonshm = 3 + integer (acc_device_kind), parameter :: acc_device_not_host = 4 + integer (acc_device_kind), parameter :: acc_device_nvidia = 5 + + integer, parameter :: acc_handle_kind = 4 + + integer (acc_handle_kind), parameter :: acc_async_noval = -1 + integer (acc_handle_kind), parameter :: acc_async_sync = -2 + + integer, parameter :: openacc_version = 201306 + + interface acc_get_num_devices + function acc_get_num_devices_h (d) + import acc_device_kind + integer acc_get_num_devices_h + integer (acc_device_kind) d + end function + end interface + + interface acc_set_device_type + subroutine acc_set_device_type_h (d) + import acc_device_kind + integer (acc_device_kind) d + end subroutine + end interface + + interface acc_get_device_type + function acc_get_device_type_h () + import acc_device_kind + integer (acc_device_kind) acc_get_device_type_h + end function + end interface + + interface acc_set_device_num + subroutine acc_set_device_num_h (n, d) + import acc_device_kind + integer n + integer (acc_device_kind) d + end subroutine + end interface + + interface acc_get_device_num + function acc_get_device_num_h (d) + import acc_device_kind + integer acc_get_device_num_h + integer (acc_device_kind) d + end function + end interface + + interface acc_async_test + function acc_async_test_h (a) + logical acc_async_test_h + integer a + end function + end interface + + interface acc_async_test_all + function acc_async_test_all_h () + logical acc_async_test_all_h + end function + end interface + + interface acc_wait + subroutine acc_wait_h (a) + integer a + end subroutine + end interface + + interface acc_wait_async + subroutine acc_wait_async_h (a1, a2) + integer a1, a2 + end subroutine + end interface + + interface acc_wait_all + subroutine acc_wait_all_h () + end subroutine + end interface + + interface acc_wait_all_async + subroutine acc_wait_all_async_h (a) + integer a + end subroutine + end interface + + interface acc_init + subroutine acc_init_h (devicetype) + import acc_device_kind + integer (acc_device_kind) devicetype + end subroutine + end interface + + interface acc_shutdown + subroutine acc_shutdown_h (devicetype) + import acc_device_kind + integer (acc_device_kind) devicetype + end subroutine + end interface + + interface acc_on_device + function acc_on_device_h (devicetype) + import acc_device_kind + logical acc_on_device_h + integer (acc_device_kind) devicetype + end function + end interface + + ! acc_malloc: Only available in C/C++ + ! acc_free: Only available in C/C++ + + interface acc_copyin + subroutine acc_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_copyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_present_or_copyin + subroutine acc_present_or_copyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_present_or_copyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_present_or_copyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_pcopyin + subroutine acc_pcopyin_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_pcopyin_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_pcopyin_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_create + subroutine acc_create_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_create_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_create_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_present_or_create + subroutine acc_present_or_create_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_present_or_create_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_present_or_create_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_pcreate + subroutine acc_pcreate_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_pcreate_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_pcreate_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_copyout + subroutine acc_copyout_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_copyout_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_copyout_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_delete + subroutine acc_delete_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_delete_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_delete_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_update_device + subroutine acc_update_device_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_update_device_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_update_device_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + interface acc_update_self + subroutine acc_update_self_32_h (a, len) + use iso_c_binding, only: c_int32_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end subroutine + + subroutine acc_update_self_64_h (a, len) + use iso_c_binding, only: c_int64_t + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end subroutine + + subroutine acc_update_self_array_h (a) + type (*), dimension (..), contiguous :: a + end subroutine + end interface + + ! acc_map_data: Only available in C/C++ + ! acc_unmap_data: Only available in C/C++ + ! acc_deviceptr: Only available in C/C++ + ! acc_ostptr: Only available in C/C++ + + interface acc_is_present + function acc_is_present_32_h (a, len) + use iso_c_binding, only: c_int32_t + logical acc_is_present_32_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int32_t) len + end function + + function acc_is_present_64_h (a, len) + use iso_c_binding, only: c_int64_t + logical acc_is_present_64_h + !GCC$ ATTRIBUTES NO_ARG_CHECK :: a + type (*), dimension (*) :: a + integer (c_int64_t) len + end function + + function acc_is_present_array_h (a) + logical acc_is_present_array_h + type (*), dimension (..), contiguous :: a + end function + end interface + + ! acc_memcpy_to_device: Only available in C/C++ + ! acc_memcpy_from_device: Only available in C/C++ diff --git a/libgomp/plugin/Makefrag.am b/libgomp/plugin/Makefrag.am new file mode 100644 index 0000000..d6642d9 --- /dev/null +++ b/libgomp/plugin/Makefrag.am @@ -0,0 +1,47 @@ +# Plugins for offload execution, Makefile.am fragment. +# +# Copyright (C) 2014 Free Software Foundation, Inc. +# +# Contributed by Mentor Embedded. +# +# This file is part of the GNU OpenMP Library (libgomp). +# +# Libgomp is free software; you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3, or (at your option) +# any later version. +# +# Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY +# WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +# FOR A PARTICULAR PURPOSE. See the GNU General Public License for +# more details. +# +# Under Section 7 of GPL version 3, you are granted additional +# permissions described in the GCC Runtime Library Exception, version +# 3.1, as published by the Free Software Foundation. +# +# You should have received a copy of the GNU General Public License and +# a copy of the GCC Runtime Library Exception along with this program; +# see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +# . + +if PLUGIN_NVPTX +# Nvidia PTX OpenACC plugin. +libgomp_plugin_nvptx_version_info = -version-info $(libtool_VERSION) +toolexeclib_LTLIBRARIES += libgomp-plugin-nvptx.la +libgomp_plugin_nvptx_la_SOURCES = plugin/plugin-nvptx.c +libgomp_plugin_nvptx_la_CPPFLAGS = $(AM_CPPFLAGS) $(PLUGIN_NVPTX_CPPFLAGS) +libgomp_plugin_nvptx_la_LDFLAGS = $(libgomp_plugin_nvptx_version_info) \ + $(lt_host_flags) +libgomp_plugin_nvptx_la_LDFLAGS += $(PLUGIN_NVPTX_LDFLAGS) +libgomp_plugin_nvptx_la_LIBADD = $(PLUGIN_NVPTX_LIBS) +libgomp_plugin_nvptx_la_LIBTOOLFLAGS = --tag=disable-static +endif + +libgomp_plugin_host_nonshm_version_info = -version-info $(libtool_VERSION) +toolexeclib_LTLIBRARIES += libgomp-plugin-host_nonshm.la +libgomp_plugin_host_nonshm_la_SOURCES = plugin/plugin-host.c +libgomp_plugin_host_nonshm_la_CPPFLAGS = $(AM_CPPFLAGS) -DHOST_NONSHM_PLUGIN +libgomp_plugin_host_nonshm_la_LDFLAGS = \ + $(libgomp_plugin_host_nonshm_version_info) $(lt_host_flags) +libgomp_plugin_host_nonshm_la_LIBTOOLFLAGS = --tag=disable-static diff --git a/libgomp/plugin/configfrag.ac b/libgomp/plugin/configfrag.ac new file mode 100644 index 0000000..68c7dc7 --- /dev/null +++ b/libgomp/plugin/configfrag.ac @@ -0,0 +1,107 @@ +# Plugins for offload execution, configure.ac fragment. +# +# Copyright (C) 2014 Free Software Foundation, Inc. +# +# Contributed by Mentor Embedded. +# +# This file is part of the GNU OpenMP Library (libgomp). +# +# Libgomp is free software; you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3, or (at your option) +# any later version. +# +# Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY +# WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +# FOR A PARTICULAR PURPOSE. See the GNU General Public License for +# more details. +# +# Under Section 7 of GPL version 3, you are granted additional +# permissions described in the GCC Runtime Library Exception, version +# 3.1, as published by the Free Software Foundation. +# +# You should have received a copy of the GNU General Public License and +# a copy of the GCC Runtime Library Exception along with this program; +# see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +# . + +# Look for the CUDA driver package. +CUDA_DRIVER_INCLUDE= +CUDA_DRIVER_LIB= +AC_SUBST(CUDA_DRIVER_INCLUDE) +AC_SUBST(CUDA_DRIVER_LIB) +CUDA_DRIVER_CPPFLAGS= +CUDA_DRIVER_LDFLAGS= +AC_ARG_WITH(cuda-driver, + [AS_HELP_STRING([--with-cuda-driver=PATH], + [specify prefix directory for installed CUDA driver package. + Equivalent to --with-cuda-driver-include=PATH/include + plus --with-cuda-driver-lib=PATH/lib])]) +AC_ARG_WITH(cuda-driver-include, + [AS_HELP_STRING([--with-cuda-driver-include=PATH], + [specify directory for installed CUDA driver include files])]) +AC_ARG_WITH(cuda-driver-lib, + [AS_HELP_STRING([--with-cuda-driver-lib=PATH], + [specify directory for the installed CUDA driver library])]) +if test "x$with_cuda_driver" != x; then + CUDA_DRIVER_INCLUDE=$with_cuda_driver/include + CUDA_DRIVER_LIB=$with_cuda_driver/lib +fi +if test "x$with_cuda_driver_include" != x; then + CUDA_DRIVER_INCLUDE=$with_cuda_driver_include +fi +if test "x$with_cuda_driver_lib" != x; then + CUDA_DRIVER_LIB=$with_cuda_driver_lib +fi +if test "x$CUDA_DRIVER_INCLUDE" != x; then + CUDA_DRIVER_CPPFLAGS=-I$CUDA_DRIVER_INCLUDE +fi +if test "x$CUDA_DRIVER_LIB" != x; then + CUDA_DRIVER_LDFLAGS=-L$CUDA_DRIVER_LIB +fi + +PLUGIN_NVPTX=0 +PLUGIN_NVPTX_CPPFLAGS= +PLUGIN_NVPTX_LDFLAGS= +PLUGIN_NVPTX_LIBS= +AC_SUBST(PLUGIN_NVPTX) +AC_SUBST(PLUGIN_NVPTX_CPPFLAGS) +AC_SUBST(PLUGIN_NVPTX_LDFLAGS) +AC_SUBST(PLUGIN_NVPTX_LIBS) + +for accel in `echo $enable_offload_targets | sed -e 's#,# #g'`; do + case "$accel" in + nvptx*) + PLUGIN_NVPTX=$accel + PLUGIN_NVPTX_CPPFLAGS=$CUDA_DRIVER_CPPFLAGS + PLUGIN_NVPTX_LDFLAGS=$CUDA_DRIVER_LDFLAGS + PLUGIN_NVPTX_LIBS='-lcuda' + + PLUGIN_NVPTX_save_CPPFLAGS=$CPPFLAGS + CPPFLAGS="$PLUGIN_NVPTX_CPPFLAGS $CPPFLAGS" + PLUGIN_NVPTX_save_LDFLAGS=$LDFLAGS + LDFLAGS="$PLUGIN_NVPTX_LDFLAGS $LDFLAGS" + PLUGIN_NVPTX_save_LIBS=$LIBS + LIBS="$PLUGIN_NVPTX_LIBS $LIBS" + AC_LINK_IFELSE( + [AC_LANG_PROGRAM( + [#include "cuda.h"], + [CUresult r = cuCtxPushCurrent (NULL);])], + [PLUGIN_NVPTX=1]) + CPPFLAGS=$PLUGIN_NVPTX_save_CPPFLAGS + LDFLAGS=$PLUGIN_NVPTX_save_LDFLAGS + LIBS=$PLUGIN_NVPTX_save_LIBS + case $PLUGIN_NVPTX in + nvptx*) + PLUGIN_NVPTX=0 + AC_MSG_ERROR([CUDA driver package required for nvptx support]) + ;; + esac + ;; + esac +done +AM_CONDITIONAL([PLUGIN_NVPTX], [test $PLUGIN_NVPTX = 1]) +AC_DEFINE_UNQUOTED([PLUGIN_NVPTX], [$PLUGIN_NVPTX], + [Define to 1 if the NVIDIA plugin is built, 0 if not.]) + +AC_OUTPUT diff --git a/libgomp/plugin/plugin-host.c b/libgomp/plugin/plugin-host.c new file mode 100644 index 0000000..937fc7f --- /dev/null +++ b/libgomp/plugin/plugin-host.c @@ -0,0 +1,340 @@ +/* OpenACC Runtime Library: acc_device_host, acc_device_host_nonshm. + + Copyright (C) 2013 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* Simple implementation of support routines for a shared-memory + acc_device_host, and a non-shared memory acc_device_host_nonshm, with the + latter built as a plugin. */ + +#include "openacc.h" +#include "config.h" +#include "libgomp.h" +#include "libgomp_target.h" +#ifdef HOST_NONSHM_PLUGIN +#include "libgomp-plugin.h" +#include "oacc-plugin.h" +#else +#include "oacc-int.h" +#endif + +#include +#include +#include +#include + +#ifdef HOST_NONSHM_PLUGIN +#define STATIC +#define GOMP(X) GOMP_PLUGIN_##X +#define SELF "host_nonshm plugin: " +#else +#define STATIC static +#define GOMP(X) gomp_##X +#define SELF "host: " +#endif + +#ifndef HOST_NONSHM_PLUGIN +static struct gomp_device_descr host_dispatch; +#endif + +STATIC const char * +GOMP_OFFLOAD_get_name (void) +{ +#ifdef HOST_NONSHM_PLUGIN + return "host_nonshm"; +#else + return "host"; +#endif +} + +STATIC int +GOMP_OFFLOAD_get_type (void) +{ +#ifdef HOST_NONSHM_PLUGIN + return OFFLOAD_TARGET_TYPE_HOST_NONSHM; +#else + return OFFLOAD_TARGET_TYPE_HOST; +#endif +} + +STATIC unsigned int +GOMP_OFFLOAD_get_caps (void) +{ + unsigned int caps = TARGET_CAP_OPENACC_200 | TARGET_CAP_OPENMP_400 + | TARGET_CAP_NATIVE_EXEC; + +#ifndef HOST_NONSHM_PLUGIN + caps |= TARGET_CAP_SHARED_MEM; +#endif + + return caps; +} + +STATIC int +GOMP_OFFLOAD_get_num_devices (void) +{ + return 1; +} + +STATIC void +GOMP_OFFLOAD_register_image (void *host_table __attribute__((unused)), + void *target_data __attribute__((unused))) +{ +} + +STATIC void +GOMP_OFFLOAD_init_device (int n __attribute__((unused))) +{ +} + +STATIC void +GOMP_OFFLOAD_fini_device (int n __attribute__((unused))) +{ +} + +STATIC int +GOMP_OFFLOAD_get_table (int n __attribute__((unused)), + struct mapping_table **table __attribute__((unused))) +{ + return 0; +} + +STATIC void * +GOMP_OFFLOAD_openacc_open_device (int n) +{ + return (void *) (intptr_t) n; +} + +STATIC int +GOMP_OFFLOAD_openacc_close_device (void *hnd) +{ + return 0; +} + +STATIC int +GOMP_OFFLOAD_openacc_get_device_num (void) +{ + return 0; +} + +STATIC void +GOMP_OFFLOAD_openacc_set_device_num (int n) +{ + if (n > 0) + GOMP(fatal) ("device number %u out of range for host execution", n); +} + +STATIC void * +GOMP_OFFLOAD_alloc (int n __attribute__((unused)), size_t s) +{ + return GOMP(malloc) (s); +} + +STATIC void +GOMP_OFFLOAD_free (int n __attribute__((unused)), void *p) +{ + free (p); +} + +STATIC void * +GOMP_OFFLOAD_host2dev (int n __attribute__((unused)), void *d, const void *h, + size_t s) +{ +#ifdef HOST_NONSHM_PLUGIN + memcpy (d, h, s); +#endif + + return 0; +} + +STATIC void * +GOMP_OFFLOAD_dev2host (int n __attribute__((unused)), void *h, const void *d, + size_t s) +{ +#ifdef HOST_NONSHM_PLUGIN + memcpy (h, d, s); +#endif + + return 0; +} + +STATIC void +GOMP_OFFLOAD_run (int n __attribute__((unused)), void *fn_ptr, void *vars) +{ + void (*fn)(void *) = (void (*)(void *)) fn_ptr; + + fn (vars); +} + +STATIC void +GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *), + size_t mapnum __attribute__((unused)), + void **hostaddrs, + void **devaddrs __attribute__((unused)), + size_t *sizes __attribute__((unused)), + unsigned short *kinds __attribute__((unused)), + int num_gangs __attribute__((unused)), + int num_workers __attribute__((unused)), + int vector_length __attribute__((unused)), + int async __attribute__((unused)), + void *targ_mem_desc __attribute__((unused))) +{ +#ifdef HOST_NONSHM_PLUGIN + fn (devaddrs); +#else + fn (hostaddrs); +#endif +} + +STATIC void +GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc) +{ +#ifdef HOST_NONSHM_PLUGIN + /* "Asynchronous" launches are executed synchronously on the (non-SHM) host, + so there's no point in delaying host-side cleanup -- just do it now. */ + GOMP_PLUGIN_async_unmap_vars (targ_mem_desc); +#endif +} + +STATIC void +GOMP_OFFLOAD_openacc_async_set_async (int async __attribute__((unused))) +{ +} + +STATIC int +GOMP_OFFLOAD_openacc_async_test (int async __attribute__((unused))) +{ + return 1; +} + +STATIC int +GOMP_OFFLOAD_openacc_async_test_all (void) +{ + return 1; +} + +STATIC void +GOMP_OFFLOAD_openacc_async_wait (int async __attribute__((unused))) +{ +} + +STATIC void +GOMP_OFFLOAD_openacc_async_wait_all (void) +{ +} + +STATIC void +GOMP_OFFLOAD_openacc_async_wait_async (int async1 __attribute__((unused)), + int async2 __attribute__((unused))) +{ +} + +STATIC void +GOMP_OFFLOAD_openacc_async_wait_all_async (int async __attribute__((unused))) +{ +} + +STATIC void * +GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data __attribute__((unused))) +{ + return NULL; +} + +STATIC void +GOMP_OFFLOAD_openacc_destroy_thread_data (void *tls_data __attribute__((unused))) +{ +} + +#ifndef HOST_NONSHM_PLUGIN +static struct gomp_device_descr host_dispatch = + { + .name = "host", + + .type = OFFLOAD_TARGET_TYPE_HOST, + .capabilities = TARGET_CAP_OPENACC_200 | TARGET_CAP_NATIVE_EXEC + | TARGET_CAP_SHARED_MEM, + .id = 0, + + .is_initialized = false, + .offload_regions_registered = false, + + .get_name_func = GOMP_OFFLOAD_get_name, + .get_type_func = GOMP_OFFLOAD_get_type, + .get_caps_func = GOMP_OFFLOAD_get_caps, + + .init_device_func = GOMP_OFFLOAD_init_device, + .fini_device_func = GOMP_OFFLOAD_fini_device, + .get_num_devices_func = GOMP_OFFLOAD_get_num_devices, + .register_image_func = GOMP_OFFLOAD_register_image, + .get_table_func = GOMP_OFFLOAD_get_table, + + .alloc_func = GOMP_OFFLOAD_alloc, + .free_func = GOMP_OFFLOAD_free, + .host2dev_func = GOMP_OFFLOAD_host2dev, + .dev2host_func = GOMP_OFFLOAD_dev2host, + + .run_func = GOMP_OFFLOAD_run, + + .openacc = { + .open_device_func = GOMP_OFFLOAD_openacc_open_device, + .close_device_func = GOMP_OFFLOAD_openacc_close_device, + + .get_device_num_func = GOMP_OFFLOAD_openacc_get_device_num, + .set_device_num_func = GOMP_OFFLOAD_openacc_set_device_num, + + .exec_func = GOMP_OFFLOAD_openacc_parallel, + + .register_async_cleanup_func + = GOMP_OFFLOAD_openacc_register_async_cleanup, + + .async_set_async_func = GOMP_OFFLOAD_openacc_async_set_async, + .async_test_func = GOMP_OFFLOAD_openacc_async_test, + .async_test_all_func = GOMP_OFFLOAD_openacc_async_test_all, + .async_wait_func = GOMP_OFFLOAD_openacc_async_wait, + .async_wait_async_func = GOMP_OFFLOAD_openacc_async_wait_async, + .async_wait_all_func = GOMP_OFFLOAD_openacc_async_wait_all, + .async_wait_all_async_func = GOMP_OFFLOAD_openacc_async_wait_all_async, + + .create_thread_data_func = GOMP_OFFLOAD_openacc_create_thread_data, + .destroy_thread_data_func = GOMP_OFFLOAD_openacc_destroy_thread_data, + + .cuda = { + .get_current_device_func = NULL, + .get_current_context_func = NULL, + .get_stream_func = NULL, + .set_stream_func = NULL, + } + } + }; + +/* Register this device type. */ +static __attribute__ ((constructor)) +void ACC_host_init (void) +{ + gomp_mutex_init (&host_dispatch.mem_map.lock); + ACC_register (&host_dispatch); +} +#endif + diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c new file mode 100644 index 0000000..f66633d --- /dev/null +++ b/libgomp/plugin/plugin-nvptx.c @@ -0,0 +1,1851 @@ +/* Plugin for NVPTX execution. + + Copyright (C) 2013-2014 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* Nvidia PTX-specific parts of OpenACC support. The cuda driver + library appears to hold some implicit state, but the documentation + is not clear as to what that state might be. Or how one might + propagate it from one thread to another. */ + +#include "openacc.h" +#include "config.h" +#include "libgomp.h" +#include "libgomp_target.h" +#include "libgomp-plugin.h" +#include "oacc-plugin.h" + +#include +#include +#include +#include +#include +#include +#include + +#define ARRAYSIZE(X) (sizeof (X) / sizeof ((X)[0])) + +static struct _errlist +{ + CUresult r; + char *m; +} cuErrorList[] = { + { CUDA_ERROR_INVALID_VALUE, "invalid value" }, + { CUDA_ERROR_OUT_OF_MEMORY, "out of memory" }, + { CUDA_ERROR_NOT_INITIALIZED, "not initialized" }, + { CUDA_ERROR_DEINITIALIZED, "deinitialized" }, + { CUDA_ERROR_PROFILER_DISABLED, "profiler disabled" }, + { CUDA_ERROR_PROFILER_NOT_INITIALIZED, "profiler not initialized" }, + { CUDA_ERROR_PROFILER_ALREADY_STARTED, "already started" }, + { CUDA_ERROR_PROFILER_ALREADY_STOPPED, "already stopped" }, + { CUDA_ERROR_NO_DEVICE, "no device" }, + { CUDA_ERROR_INVALID_DEVICE, "invalid device" }, + { CUDA_ERROR_INVALID_IMAGE, "invalid image" }, + { CUDA_ERROR_INVALID_CONTEXT, "invalid context" }, + { CUDA_ERROR_CONTEXT_ALREADY_CURRENT, "context already current" }, + { CUDA_ERROR_MAP_FAILED, "map error" }, + { CUDA_ERROR_UNMAP_FAILED, "unmap error" }, + { CUDA_ERROR_ARRAY_IS_MAPPED, "array is mapped" }, + { CUDA_ERROR_ALREADY_MAPPED, "already mapped" }, + { CUDA_ERROR_NO_BINARY_FOR_GPU, "no binary for gpu" }, + { CUDA_ERROR_ALREADY_ACQUIRED, "already acquired" }, + { CUDA_ERROR_NOT_MAPPED, "not mapped" }, + { CUDA_ERROR_NOT_MAPPED_AS_ARRAY, "not mapped as array" }, + { CUDA_ERROR_NOT_MAPPED_AS_POINTER, "not mapped as pointer" }, + { CUDA_ERROR_ECC_UNCORRECTABLE, "ecc uncorrectable" }, + { CUDA_ERROR_UNSUPPORTED_LIMIT, "unsupported limit" }, + { CUDA_ERROR_CONTEXT_ALREADY_IN_USE, "context already in use" }, + { CUDA_ERROR_PEER_ACCESS_UNSUPPORTED, "peer access unsupported" }, + { CUDA_ERROR_INVALID_SOURCE, "invalid source" }, + { CUDA_ERROR_FILE_NOT_FOUND, "file not found" }, + { CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND, + "shared object symbol not found" }, + { CUDA_ERROR_SHARED_OBJECT_INIT_FAILED, "shared object init error" }, + { CUDA_ERROR_OPERATING_SYSTEM, "operating system" }, + { CUDA_ERROR_INVALID_HANDLE, "invalid handle" }, + { CUDA_ERROR_NOT_FOUND, "not found" }, + { CUDA_ERROR_NOT_READY, "not ready" }, + { CUDA_ERROR_LAUNCH_FAILED, "launch error" }, + { CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, "launch out of resources" }, + { CUDA_ERROR_LAUNCH_TIMEOUT, "launch timeout" }, + { CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING, + "launch incompatibe texturing" }, + { CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED, "peer access already enabled" }, + { CUDA_ERROR_PEER_ACCESS_NOT_ENABLED, "peer access not enabled " }, + { CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE, "primary cotext active" }, + { CUDA_ERROR_CONTEXT_IS_DESTROYED, "context is destroyed" }, + { CUDA_ERROR_ASSERT, "assert" }, + { CUDA_ERROR_TOO_MANY_PEERS, "too many peers" }, + { CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED, + "host memory already registered" }, + { CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED, "host memory not registered" }, + { CUDA_ERROR_NOT_PERMITTED, "no permitted" }, + { CUDA_ERROR_NOT_SUPPORTED, "not supported" }, + { CUDA_ERROR_UNKNOWN, "unknown" } +}; + +static char errmsg[128]; + +static char * +cuErrorMsg (CUresult r) +{ + int i; + + for (i = 0; i < ARRAYSIZE (cuErrorList); i++) + { + if (cuErrorList[i].r == r) + return &cuErrorList[i].m[0]; + } + + sprintf (&errmsg[0], "unknown result code: %5d", r); + + return &errmsg[0]; +} + +struct targ_fn_descriptor +{ + CUfunction fn; + const char *name; +}; + +static bool PTX_inited = false; + +struct PTX_stream +{ + CUstream stream; + pthread_t host_thread; + bool multithreaded; + + CUdeviceptr d; + void *h; + void *h_begin; + void *h_end; + void *h_next; + void *h_prev; + void *h_tail; + + struct PTX_stream *next; +}; + +/* Thread-specific data for PTX. */ + +struct nvptx_thread +{ + struct PTX_stream *current_stream; + struct PTX_device *ptx_dev; +}; + +struct map +{ + int async; + size_t size; + char mappings[0]; +}; + +static void +map_init (struct PTX_stream *s) +{ + CUresult r; + + int size = getpagesize (); + + assert (s); + assert (!s->d); + assert (!s->h); + + r = cuMemAllocHost (&s->h, size); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemAllocHost error: %s", cuErrorMsg (r)); + + r = cuMemHostGetDevicePointer (&s->d, s->h, 0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemHostGetDevicePointer error: %s", cuErrorMsg (r)); + + assert (s->h); + + s->h_begin = s->h; + s->h_end = s->h_begin + size; + s->h_next = s->h_prev = s->h_tail = s->h_begin; + + assert (s->h_next); + assert (s->h_end); +} + +static void +map_fini (struct PTX_stream *s) +{ + CUresult r; + + r = cuMemFreeHost (s->h); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuErrorMsg (r)); +} + +static void +map_pop (struct PTX_stream *s) +{ + struct map *m; + + assert (s != NULL); + assert (s->h_next); + assert (s->h_prev); + assert (s->h_tail); + + m = s->h_tail; + + s->h_tail += m->size; + + if (s->h_tail >= s->h_end) + s->h_tail = s->h_begin + (int) (s->h_tail - s->h_end); + + if (s->h_next == s->h_tail) + s->h_prev = s->h_next; + + assert (s->h_next >= s->h_begin); + assert (s->h_tail >= s->h_begin); + assert (s->h_prev >= s->h_begin); + + assert (s->h_next <= s->h_end); + assert (s->h_tail <= s->h_end); + assert (s->h_prev <= s->h_end); +} + +static void +map_push (struct PTX_stream *s, int async, size_t size, void **h, void **d) +{ + int left; + int offset; + struct map *m; + + assert (s != NULL); + + left = s->h_end - s->h_next; + size += sizeof (struct map); + + assert (s->h_prev); + assert (s->h_next); + + if (size >= left) + { + m = s->h_prev; + m->size += left; + s->h_next = s->h_begin; + + if (s->h_next + size > s->h_end) + GOMP_PLUGIN_fatal ("unable to push map"); + } + + assert (s->h_next); + + m = s->h_next; + m->async = async; + m->size = size; + + offset = (void *)&m->mappings[0] - s->h; + + *d = (void *)(s->d + offset); + *h = (void *)(s->h + offset); + + s->h_prev = s->h_next; + s->h_next += size; + + assert (s->h_prev); + assert (s->h_next); + + assert (s->h_next >= s->h_begin); + assert (s->h_tail >= s->h_begin); + assert (s->h_prev >= s->h_begin); + assert (s->h_next <= s->h_end); + assert (s->h_tail <= s->h_end); + assert (s->h_prev <= s->h_end); + + return; +} + +struct PTX_device +{ + CUcontext ctx; + bool ctx_shared; + CUdevice dev; + struct PTX_stream *null_stream; + /* All non-null streams associated with this device (actually context), + either created implicitly or passed in from the user (via + acc_set_cuda_stream). */ + struct PTX_stream *active_streams; + struct { + struct PTX_stream **arr; + int size; + } async_streams; + /* A lock for use when manipulating the above stream list and array. */ + gomp_mutex_t stream_lock; + int ord; + bool overlap; + bool map; + bool concur; + int mode; + bool mkern; + + struct PTX_device *next; +}; + +enum PTX_event_type +{ + PTX_EVT_MEM, + PTX_EVT_KNL, + PTX_EVT_SYNC, + PTX_EVT_ASYNC_CLEANUP +}; + +struct PTX_event +{ + CUevent *evt; + int type; + void *addr; + int ord; + + struct PTX_event *next; +}; + +static gomp_mutex_t PTX_event_lock; +static struct PTX_event *PTX_events; + +#define _XSTR(s) _STR(s) +#define _STR(s) #s + +static struct _synames +{ + char *n; +} cuSymNames[] = +{ + { _XSTR(cuCtxCreate) }, + { _XSTR(cuCtxDestroy) }, + { _XSTR(cuCtxGetCurrent) }, + { _XSTR(cuCtxPushCurrent) }, + { _XSTR(cuCtxSynchronize) }, + { _XSTR(cuDeviceGet) }, + { _XSTR(cuDeviceGetAttribute) }, + { _XSTR(cuDeviceGetCount) }, + { _XSTR(cuEventCreate) }, + { _XSTR(cuEventDestroy) }, + { _XSTR(cuEventQuery) }, + { _XSTR(cuEventRecord) }, + { _XSTR(cuInit) }, + { _XSTR(cuLaunchKernel) }, + { _XSTR(cuLinkAddData) }, + { _XSTR(cuLinkComplete) }, + { _XSTR(cuLinkCreate) }, + { _XSTR(cuMemAlloc) }, + { _XSTR(cuMemAllocHost) }, + { _XSTR(cuMemcpy) }, + { _XSTR(cuMemcpyDtoH) }, + { _XSTR(cuMemcpyDtoHAsync) }, + { _XSTR(cuMemcpyHtoD) }, + { _XSTR(cuMemcpyHtoDAsync) }, + { _XSTR(cuMemFree) }, + { _XSTR(cuMemFreeHost) }, + { _XSTR(cuMemGetAddressRange) }, + { _XSTR(cuMemHostGetDevicePointer) }, + { _XSTR(cuMemHostRegister) }, + { _XSTR(cuMemHostUnregister) }, + { _XSTR(cuModuleGetFunction) }, + { _XSTR(cuModuleLoadData) }, + { _XSTR(cuStreamDestroy) }, + { _XSTR(cuStreamQuery) }, + { _XSTR(cuStreamSynchronize) }, + { _XSTR(cuStreamWaitEvent) } +}; + +static int +verify_device_library (void) +{ + int i; + void *dh, *ds; + + dh = dlopen ("libcuda.so", RTLD_LAZY); + if (!dh) + return -1; + + for (i = 0; i < ARRAYSIZE (cuSymNames); i++) + { + ds = dlsym (dh, cuSymNames[i].n); + if (!ds) + return -1; + } + + dlclose (dh); + + return 0; +} + +static inline struct nvptx_thread * +nvptx_thread (void) +{ + return (struct nvptx_thread *) GOMP_PLUGIN_acc_thread (); +} + +static void +init_streams_for_device (struct PTX_device *ptx_dev, int concurrency) +{ + int i; + struct PTX_stream *null_stream + = GOMP_PLUGIN_malloc (sizeof (struct PTX_stream)); + + null_stream->stream = NULL; + null_stream->host_thread = pthread_self (); + null_stream->multithreaded = true; + null_stream->d = (CUdeviceptr) NULL; + null_stream->h = NULL; + map_init (null_stream); + ptx_dev->null_stream = null_stream; + + ptx_dev->active_streams = NULL; + GOMP_PLUGIN_mutex_init (&ptx_dev->stream_lock); + + if (concurrency < 1) + concurrency = 1; + + /* This is just a guess -- make space for as many async streams as the + current device is capable of concurrently executing. This can grow + later as necessary. No streams are created yet. */ + ptx_dev->async_streams.arr + = GOMP_PLUGIN_malloc (concurrency * sizeof (struct PTX_stream *)); + ptx_dev->async_streams.size = concurrency; + + for (i = 0; i < concurrency; i++) + ptx_dev->async_streams.arr[i] = NULL; +} + +static void +fini_streams_for_device (struct PTX_device *ptx_dev) +{ + free (ptx_dev->async_streams.arr); + + while (ptx_dev->active_streams != NULL) + { + struct PTX_stream *s = ptx_dev->active_streams; + ptx_dev->active_streams = ptx_dev->active_streams->next; + + cuStreamDestroy (s->stream); + map_fini (s); + free (s); + } + + map_fini (ptx_dev->null_stream); + free (ptx_dev->null_stream); +} + +/* Select a stream for (OpenACC-semantics) ASYNC argument for the current + thread THREAD (and also current device/context). If CREATE is true, create + the stream if it does not exist (or use EXISTING if it is non-NULL), and + associate the stream with the same thread argument. Returns stream to use + as result. */ + +static struct PTX_stream * +select_stream_for_async (int async, pthread_t thread, bool create, + CUstream existing) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + /* Local copy of TLS variable. */ + struct PTX_device *ptx_dev = nvthd->ptx_dev; + struct PTX_stream *stream = NULL; + int orig_async = async; + + /* The special value acc_async_noval (-1) maps (for now) to an + implicitly-created stream, which is then handled the same as any other + numbered async stream. Other options are available, e.g. using the null + stream for anonymous async operations, or choosing an idle stream from an + active set. But, stick with this for now. */ + if (async > acc_async_sync) + async++; + + if (create) + GOMP_PLUGIN_mutex_lock (&ptx_dev->stream_lock); + + /* NOTE: AFAICT there's no particular need for acc_async_sync to map to the + null stream, and in fact better performance may be obtainable if it doesn't + (because the null stream enforces overly-strict synchronisation with + respect to other streams for legacy reasons, and that's probably not + needed with OpenACC). Maybe investigate later. */ + if (async == acc_async_sync) + stream = ptx_dev->null_stream; + else if (async >= 0 && async < ptx_dev->async_streams.size + && ptx_dev->async_streams.arr[async] && !(create && existing)) + stream = ptx_dev->async_streams.arr[async]; + else if (async >= 0 && create) + { + if (async >= ptx_dev->async_streams.size) + { + int i, newsize = ptx_dev->async_streams.size * 2; + + if (async >= newsize) + newsize = async + 1; + + ptx_dev->async_streams.arr + = GOMP_PLUGIN_realloc (ptx_dev->async_streams.arr, + newsize * sizeof (struct PTX_stream *)); + + for (i = ptx_dev->async_streams.size; i < newsize; i++) + ptx_dev->async_streams.arr[i] = NULL; + + ptx_dev->async_streams.size = newsize; + } + + /* Create a new stream on-demand if there isn't one already, or if we're + setting a particular async value to an existing (externally-provided) + stream. */ + if (!ptx_dev->async_streams.arr[async] || existing) + { + CUresult r; + struct PTX_stream *s + = GOMP_PLUGIN_malloc (sizeof (struct PTX_stream)); + + if (existing) + s->stream = existing; + else + { + r = cuStreamCreate (&s->stream, CU_STREAM_DEFAULT); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamCreate error: %s", cuErrorMsg (r)); + } + + /* If CREATE is true, we're going to be queueing some work on this + stream. Associate it with the current host thread. */ + s->host_thread = thread; + s->multithreaded = false; + + s->d = (CUdeviceptr) NULL; + s->h = NULL; + map_init (s); + + s->next = ptx_dev->active_streams; + ptx_dev->active_streams = s; + ptx_dev->async_streams.arr[async] = s; + } + + stream = ptx_dev->async_streams.arr[async]; + } + else if (async < 0) + GOMP_PLUGIN_fatal ("bad async %d", async); + + if (create) + { + assert (stream != NULL); + + /* If we're trying to use the same stream from different threads + simultaneously, set stream->multithreaded to true. This affects the + behaviour of acc_async_test_all and acc_wait_all, which are supposed to + only wait for asynchronous launches from the same host thread they are + invoked on. If multiple threads use the same async value, we make note + of that here and fall back to testing/waiting for all threads in those + functions. */ + if (thread != stream->host_thread) + stream->multithreaded = true; + + GOMP_PLUGIN_mutex_unlock (&ptx_dev->stream_lock); + } + else if (stream && !stream->multithreaded + && !pthread_equal (stream->host_thread, thread)) + GOMP_PLUGIN_fatal ("async %d used on wrong thread", orig_async); + + return stream; +} + +static int PTX_get_num_devices (void); + +/* Initialize the device. */ +static int +PTX_init (void) +{ + CUresult r; + int rc; + + if (PTX_inited) + return PTX_get_num_devices (); + + rc = verify_device_library (); + if (rc < 0) + return -1; + + r = cuInit (0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuInit error: %s", cuErrorMsg (r)); + + PTX_events = NULL; + + GOMP_PLUGIN_mutex_init (&PTX_event_lock); + + PTX_inited = true; + + return PTX_get_num_devices (); +} + +static void +PTX_fini (void) +{ + PTX_inited = false; +} + +static void * +PTX_open_device (int n) +{ + struct PTX_device *ptx_dev; + CUdevice dev; + CUresult r; + int async_engines, pi; + + r = cuDeviceGet (&dev, n); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGet error: %s", cuErrorMsg (r)); + + ptx_dev = GOMP_PLUGIN_malloc (sizeof (struct PTX_device)); + + ptx_dev->ord = n; + ptx_dev->dev = dev; + ptx_dev->ctx_shared = false; + + r = cuCtxGetCurrent (&ptx_dev->ctx); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuErrorMsg (r)); + + if (!ptx_dev->ctx) + { + r = cuCtxCreate (&ptx_dev->ctx, CU_CTX_SCHED_AUTO, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxCreate error: %s", cuErrorMsg (r)); + } + else + ptx_dev->ctx_shared = true; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_GPU_OVERLAP, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuErrorMsg (r)); + + ptx_dev->overlap = pi; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuErrorMsg (r)); + + ptx_dev->map = pi; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuErrorMsg (r)); + + ptx_dev->concur = pi; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuErrorMsg (r)); + + ptx_dev->mode = pi; + + r = cuDeviceGetAttribute (&pi, CU_DEVICE_ATTRIBUTE_INTEGRATED, dev); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetAttribute error: %s", cuErrorMsg (r)); + + ptx_dev->mkern = pi; + + r = cuDeviceGetAttribute (&async_engines, + CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev); + if (r != CUDA_SUCCESS) + async_engines = 1; + + init_streams_for_device (ptx_dev, async_engines); + + return (void *) ptx_dev; +} + +static int +PTX_close_device (void *targ_data) +{ + CUresult r; + struct PTX_device *ptx_dev = targ_data; + + if (!ptx_dev) + return 0; + + fini_streams_for_device (ptx_dev); + + if (!ptx_dev->ctx_shared) + { + r = cuCtxDestroy (ptx_dev->ctx); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxDestroy error: %s", cuErrorMsg (r)); + } + + free (ptx_dev); + + return 0; +} + +static int +PTX_get_num_devices (void) +{ + int n; + CUresult r; + + /* This function will be called before the plugin has been initialized in + order to enumerate available devices, but CUDA API routines can't be used + until cuInit has been called. Just call it now (but don't yet do any + further initialization). */ + if (!PTX_inited) + cuInit (0); + + r = cuDeviceGetCount (&n); + if (r!= CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuDeviceGetCount error: %s", cuErrorMsg (r)); + + return n; +} + +#define ABORT_PTX \ + ".version 3.1\n" \ + ".target sm_30\n" \ + ".address_size 64\n" \ + ".visible .func abort;\n" \ + ".visible .func abort\n" \ + "{\n" \ + "trap;\n" \ + "ret;\n" \ + "}\n" \ + ".visible .func _gfortran_abort;\n" \ + ".visible .func _gfortran_abort\n" \ + "{\n" \ + "trap;\n" \ + "ret;\n" \ + "}\n" \ + +/* Generated with: + + $ echo 'int acc_on_device(int d) { return __builtin_acc_on_device(d); } int acc_on_device_(int *d) { return acc_on_device(*d); }' | accel-gcc/xgcc -Baccel-gcc -x c - -o - -S -m64 -O3 -fno-builtin-acc_on_device -fno-inline +*/ +#define ACC_ON_DEVICE_PTX \ + " .version 3.1\n" \ + " .target sm_30\n" \ + " .address_size 64\n" \ + ".visible .func (.param.u32 %out_retval)acc_on_device(.param.u32 %in_ar1);\n" \ + ".visible .func (.param.u32 %out_retval)acc_on_device(.param.u32 %in_ar1)\n" \ + "{\n" \ + " .reg.u32 %ar1;\n" \ + ".reg.u32 %retval;\n" \ + " .reg.u64 %hr10;\n" \ + " .reg.u32 %r24;\n" \ + " .reg.u32 %r25;\n" \ + " .reg.pred %r27;\n" \ + " .reg.u32 %r30;\n" \ + " ld.param.u32 %ar1, [%in_ar1];\n" \ + " mov.u32 %r24, %ar1;\n" \ + " setp.ne.u32 %r27,%r24,4;\n" \ + " set.u32.eq.u32 %r30,%r24,5;\n" \ + " neg.s32 %r25, %r30;\n" \ + " @%r27 bra $L3;\n" \ + " mov.u32 %r25, 1;\n" \ + "$L3:\n" \ + " mov.u32 %retval, %r25;\n" \ + " st.param.u32 [%out_retval], %retval;\n" \ + " ret;\n" \ + " }\n" \ + ".visible .func (.param.u32 %out_retval)acc_on_device_(.param.u64 %in_ar1);\n" \ + ".visible .func (.param.u32 %out_retval)acc_on_device_(.param.u64 %in_ar1)\n" \ + "{\n" \ + " .reg.u64 %ar1;\n" \ + ".reg.u32 %retval;\n" \ + " .reg.u64 %hr10;\n" \ + " .reg.u64 %r25;\n" \ + " .reg.u32 %r26;\n" \ + " .reg.u32 %r27;\n" \ + " ld.param.u64 %ar1, [%in_ar1];\n" \ + " mov.u64 %r25, %ar1;\n" \ + " ld.u32 %r26, [%r25];\n" \ + " {\n" \ + " .param.u32 %retval_in;\n" \ + " {\n" \ + " .param.u32 %out_arg0;\n" \ + " st.param.u32 [%out_arg0], %r26;\n" \ + " call (%retval_in), acc_on_device, (%out_arg0);\n" \ + " }\n" \ + " ld.param.u32 %r27, [%retval_in];\n" \ + "}\n" \ + " mov.u32 %retval, %r27;\n" \ + " st.param.u32 [%out_retval], %retval;\n" \ + " ret;\n" \ + " }" + +static void +link_ptx (CUmodule *module, char *ptx_code) +{ + CUjit_option opts[7]; + void *optvals[7]; + float elapsed = 0.0; +#define LOGSIZE 8192 + char elog[LOGSIZE]; + char ilog[LOGSIZE]; + unsigned long logsize = LOGSIZE; + CUlinkState linkstate; + CUresult r; + void *linkout; + size_t linkoutsize __attribute__((unused)); + + GOMP_PLUGIN_notify ("attempting to load:\n---\n%s\n---\n", ptx_code); + + opts[0] = CU_JIT_WALL_TIME; + optvals[0] = &elapsed; + + opts[1] = CU_JIT_INFO_LOG_BUFFER; + optvals[1] = &ilog[0]; + + opts[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES; + optvals[2] = (void *) logsize; + + opts[3] = CU_JIT_ERROR_LOG_BUFFER; + optvals[3] = &elog[0]; + + opts[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES; + optvals[4] = (void *) logsize; + + opts[5] = CU_JIT_LOG_VERBOSE; + optvals[5] = (void *) 1; + + opts[6] = CU_JIT_TARGET; + optvals[6] = (void *) CU_TARGET_COMPUTE_30; + + r = cuLinkCreate (7, opts, optvals, &linkstate); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuLinkCreate error: %s", cuErrorMsg (r)); + + char *abort_ptx = ABORT_PTX; + r = cuLinkAddData (linkstate, CU_JIT_INPUT_PTX, abort_ptx, + strlen (abort_ptx) + 1, 0, 0, 0, 0); + if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("Link error log %s\n", &elog[0]); + GOMP_PLUGIN_fatal ("cuLinkAddData (abort) error: %s", cuErrorMsg (r)); + } + + char *acc_on_device_ptx = ACC_ON_DEVICE_PTX; + r = cuLinkAddData (linkstate, CU_JIT_INPUT_PTX, acc_on_device_ptx, + strlen (acc_on_device_ptx) + 1, 0, 0, 0, 0); + if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("Link error log %s\n", &elog[0]); + GOMP_PLUGIN_fatal ("cuLinkAddData (acc_on_device) error: %s", + cuErrorMsg (r)); + } + + r = cuLinkAddData (linkstate, CU_JIT_INPUT_PTX, ptx_code, + strlen (ptx_code) + 1, 0, 0, 0, 0); + if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("Link error log %s\n", &elog[0]); + GOMP_PLUGIN_fatal ("cuLinkAddData (ptx_code) error: %s", cuErrorMsg (r)); + } + + r = cuLinkComplete (linkstate, &linkout, &linkoutsize); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuLinkComplete error: %s", cuErrorMsg (r)); + + GOMP_PLUGIN_notify ("Link complete: %fms\n", elapsed); + GOMP_PLUGIN_notify ("Link log %s\n", &ilog[0]); + + r = cuModuleLoadData (module, linkout); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuModuleLoadData error: %s", cuErrorMsg (r)); +} + +static void +event_gc (bool memmap_lockable) +{ + struct PTX_event *e = PTX_events; + struct nvptx_thread *nvthd = nvptx_thread (); + + GOMP_PLUGIN_mutex_lock (&PTX_event_lock); + + while (e != NULL) + { + CUresult r; + + if (e->ord != nvthd->ptx_dev->ord) + { + e = e->next; + continue; + } + + r = cuEventQuery (*e->evt); + if (r == CUDA_SUCCESS) + { + CUevent *te; + + te = e->evt; + + switch (e->type) + { + case PTX_EVT_MEM: + case PTX_EVT_SYNC: + break; + + case PTX_EVT_KNL: + map_pop (e->addr); + break; + + case PTX_EVT_ASYNC_CLEANUP: + { + /* The function gomp_plugin_async_unmap_vars needs to claim the + memory-map splay tree lock for the current device, so we + can't call it when one of our callers has already claimed + the lock. In that case, just delay the GC for this event + until later. */ + if (!memmap_lockable) + { + e = e->next; + continue; + } + + GOMP_PLUGIN_async_unmap_vars (e->addr); + } + break; + } + + cuEventDestroy (*te); + free ((void *)te); + + struct PTX_event *next = e->next; + + if (PTX_events == e) + PTX_events = PTX_events->next; + else + { + struct PTX_event *e_ = PTX_events; + while (e_->next != e) + e_ = e_->next; + e_->next = e_->next->next; + } + + free (e); + e = next; + } + else + e = e->next; + } + + GOMP_PLUGIN_mutex_unlock (&PTX_event_lock); +} + +static void +event_add (enum PTX_event_type type, CUevent *e, void *h) +{ + struct PTX_event *ptx_event; + struct nvptx_thread *nvthd = nvptx_thread (); + + assert (type == PTX_EVT_MEM || type == PTX_EVT_KNL || type == PTX_EVT_SYNC + || type == PTX_EVT_ASYNC_CLEANUP); + + ptx_event = GOMP_PLUGIN_malloc (sizeof (struct PTX_event)); + ptx_event->type = type; + ptx_event->evt = e; + ptx_event->addr = h; + ptx_event->ord = nvthd->ptx_dev->ord; + + GOMP_PLUGIN_mutex_lock (&PTX_event_lock); + + ptx_event->next = PTX_events; + PTX_events = ptx_event; + + GOMP_PLUGIN_mutex_unlock (&PTX_event_lock); +} + +void +PTX_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, + size_t *sizes, unsigned short *kinds, int num_gangs, int num_workers, + int vector_length, int async, void *targ_mem_desc) +{ + struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; + CUfunction function; + CUresult r; + int i; + struct PTX_stream *dev_str; + void *kargs[1]; + void *hp, *dp; + unsigned int nthreads_in_block; + struct nvptx_thread *nvthd = nvptx_thread (); + + function = targ_fn->fn; + + dev_str = select_stream_for_async (async, pthread_self (), false, NULL); + assert (dev_str == nvthd->current_stream); + + /* This reserves a chunk of a pre-allocated page of memory mapped on both + the host and the device. HP is a host pointer to the new chunk, and DP is + the corresponding device pointer. */ + map_push (dev_str, async, mapnum * sizeof (void *), &hp, &dp); + + GOMP_PLUGIN_notify (" %s: prepare mappings\n", __FUNCTION__); + + /* Copy the array of arguments to the mapped page. */ + for (i = 0; i < mapnum; i++) + ((void **) hp)[i] = devaddrs[i]; + + /* Copy the (device) pointers to arguments to the device (dp and hp might in + fact have the same value on a unified-memory system). */ + r = cuMemcpy ((CUdeviceptr)dp, (CUdeviceptr)hp, mapnum * sizeof (void *)); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpy failed: %s", cuErrorMsg (r)); + + GOMP_PLUGIN_notify (" %s: kernel %s: launch\n", __FUNCTION__, targ_fn->name); + + // XXX: possible geometry mappings?? + // + // OpenACC CUDA + // + // num_gangs blocks + // num_workers warps (where a warp is equivalent to 32 threads) + // vector length threads + // + + /* The openacc vector_length clause 'determines the vector length to use for + vector or SIMD operations'. The question is how to map this to CUDA. + + In CUDA, the warp size is the vector length of a CUDA device. However, the + CUDA interface abstracts away from that, and only shows us warp size + indirectly in maximum number of threads per block, which is a product of + warp size and the number of hyperthreads of a multiprocessor. + + We choose to map openacc vector_length directly onto the number of threads + in a block, in the x dimension. This is reflected in gcc code generation + that uses ThreadIdx.x to access vector elements. + + Attempting to use an openacc vector_length of more than the maximum number + of threads per block will result in a cuda error. */ + nthreads_in_block = vector_length; + + kargs[0] = &dp; + r = cuLaunchKernel (function, + 1, 1, 1, + nthreads_in_block, 1, 1, + 0, dev_str->stream, kargs, 0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuErrorMsg (r)); + +#ifndef DISABLE_ASYNC + if (async < acc_async_noval) + { + r = cuStreamSynchronize (dev_str->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuErrorMsg (r)); + } + else + { + CUevent *e; + + e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuErrorMsg (r)); + + event_gc (true); + + r = cuEventRecord (*e, dev_str->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuErrorMsg (r)); + + event_add (PTX_EVT_KNL, e, (void *)dev_str); + } +#else + r = cuCtxSynchronize (); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuErrorMsg (r)); +#endif + + GOMP_PLUGIN_notify (" %s: kernel %s: finished\n", __FUNCTION__, + targ_fn->name); + +#ifndef DISABLE_ASYNC + if (async < acc_async_noval) +#endif + map_pop (dev_str); +} + +void * openacc_get_current_cuda_context (void); + +static void * +PTX_alloc (size_t s) +{ + CUdeviceptr d; + CUresult r; + + r = cuMemAlloc (&d, s); + if (r == CUDA_ERROR_OUT_OF_MEMORY) + return 0; + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuErrorMsg (r)); + return (void *)d; +} + +static void +PTX_free (void *p) +{ + CUresult r; + CUdeviceptr pb; + size_t ps; + + r = cuMemGetAddressRange (&pb, &ps, (CUdeviceptr)p); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemGetAddressRange error: %s", cuErrorMsg (r)); + + if ((CUdeviceptr)p != pb) + GOMP_PLUGIN_fatal ("invalid device address"); + + r = cuMemFree ((CUdeviceptr)p); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuErrorMsg (r)); +} + +static void * +PTX_host2dev (void *d, const void *h, size_t s) +{ + CUresult r; + CUdeviceptr pb; + size_t ps; + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!s) + return 0; + + if (!d) + GOMP_PLUGIN_fatal ("invalid device address"); + + r = cuMemGetAddressRange (&pb, &ps, (CUdeviceptr)d); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemGetAddressRange error: %s", cuErrorMsg (r)); + + if (!pb) + GOMP_PLUGIN_fatal ("invalid device address"); + + if (!h) + GOMP_PLUGIN_fatal ("invalid host address"); + + if (d == h) + GOMP_PLUGIN_fatal ("invalid host or device address"); + + if ((void *)(d + s) > (void *)(pb + ps)) + GOMP_PLUGIN_fatal ("invalid size"); + +#ifndef DISABLE_ASYNC + if (nvthd->current_stream != nvthd->ptx_dev->null_stream) + { + CUevent *e; + + e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuErrorMsg (r)); + + event_gc (false); + + r = cuMemcpyHtoDAsync ((CUdeviceptr)d, h, s, + nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpyHtoDAsync error: %s", cuErrorMsg (r)); + + r = cuEventRecord (*e, nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuErrorMsg (r)); + + event_add (PTX_EVT_MEM, e, (void *)h); + } + else +#endif + { + r = cuMemcpyHtoD ((CUdeviceptr)d, h, s); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuErrorMsg (r)); + } + + return 0; +} + +static void * +PTX_dev2host (void *h, const void *d, size_t s) +{ + CUresult r; + CUdeviceptr pb; + size_t ps; + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!s) + return 0; + + if (!d) + GOMP_PLUGIN_fatal ("invalid device address"); + + r = cuMemGetAddressRange (&pb, &ps, (CUdeviceptr)d); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemGetAddressRange error: %s", cuErrorMsg (r)); + + if (!pb) + GOMP_PLUGIN_fatal ("invalid device address"); + + if (!h) + GOMP_PLUGIN_fatal ("invalid host address"); + + if (d == h) + GOMP_PLUGIN_fatal ("invalid host or device address"); + + if ((void *)(d + s) > (void *)(pb + ps)) + GOMP_PLUGIN_fatal ("invalid size"); + +#ifndef DISABLE_ASYNC + if (nvthd->current_stream != nvthd->ptx_dev->null_stream) + { + CUevent *e; + + e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s\n", cuErrorMsg (r)); + + event_gc (false); + + r = cuMemcpyDtoHAsync (h, (CUdeviceptr)d, s, + nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpyDtoHAsync error: %s", cuErrorMsg (r)); + + r = cuEventRecord (*e, nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuErrorMsg (r)); + + event_add (PTX_EVT_MEM, e, (void *)h); + } + else +#endif + { + r = cuMemcpyDtoH (h, (CUdeviceptr)d, s); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuErrorMsg (r)); + } + + return 0; +} + +static void +PTX_set_async (int async) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + nvthd->current_stream + = select_stream_for_async (async, pthread_self (), true, NULL); +} + +static int +PTX_async_test (int async) +{ + CUresult r; + struct PTX_stream *s; + + s = select_stream_for_async (async, pthread_self (), false, NULL); + + if (!s) + GOMP_PLUGIN_fatal ("unknown async %d", async); + + r = cuStreamQuery (s->stream); + if (r == CUDA_SUCCESS) + { + /* The oacc-parallel.c:goacc_wait function calls this hook to determine + whether all work has completed on this stream, and if so omits the call + to the wait hook. If that happens, event_gc might not get called + (which prevents variables from getting unmapped and their associated + device storage freed), so call it here. */ + event_gc (true); + return 1; + } + else if (r == CUDA_ERROR_NOT_READY) + return 0; + + GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuErrorMsg (r)); + + return 0; +} + +static int +PTX_async_test_all (void) +{ + struct PTX_stream *s; + pthread_t self = pthread_self (); + struct nvptx_thread *nvthd = nvptx_thread (); + + GOMP_PLUGIN_mutex_lock (&nvthd->ptx_dev->stream_lock); + + for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next) + { + if ((s->multithreaded || pthread_equal (s->host_thread, self)) + && cuStreamQuery (s->stream) == CUDA_ERROR_NOT_READY) + { + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); + return 0; + } + } + + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); + + event_gc (true); + + return 1; +} + +static void +PTX_wait (int async) +{ + CUresult r; + struct PTX_stream *s; + + s = select_stream_for_async (async, pthread_self (), false, NULL); + + if (!s) + GOMP_PLUGIN_fatal ("unknown async %d", async); + + r = cuStreamSynchronize (s->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuErrorMsg (r)); + + event_gc (true); +} + +static void +PTX_wait_async (int async1, int async2) +{ + CUresult r; + CUevent *e; + struct PTX_stream *s1, *s2; + pthread_t self = pthread_self (); + + /* The stream that is waiting (rather than being waited for) doesn't + necessarily have to exist already. */ + s2 = select_stream_for_async (async2, self, true, NULL); + + s1 = select_stream_for_async (async1, self, false, NULL); + if (!s1) + GOMP_PLUGIN_fatal ("invalid async 1\n"); + + if (s1 == s2) + GOMP_PLUGIN_fatal ("identical parameters"); + + e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuErrorMsg (r)); + + event_gc (true); + + r = cuEventRecord (*e, s1->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuErrorMsg (r)); + + event_add (PTX_EVT_SYNC, e, NULL); + + r = cuStreamWaitEvent (s2->stream, *e, 0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamWaitEvent error: %s", cuErrorMsg (r)); +} + +static void +PTX_wait_all (void) +{ + CUresult r; + struct PTX_stream *s; + pthread_t self = pthread_self (); + struct nvptx_thread *nvthd = nvptx_thread (); + + GOMP_PLUGIN_mutex_lock (&nvthd->ptx_dev->stream_lock); + + /* Wait for active streams initiated by this thread (or by multiple threads) + to complete. */ + for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next) + { + if (s->multithreaded || pthread_equal (s->host_thread, self)) + { + r = cuStreamQuery (s->stream); + if (r == CUDA_SUCCESS) + continue; + else if (r != CUDA_ERROR_NOT_READY) + GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuErrorMsg (r)); + + r = cuStreamSynchronize (s->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuErrorMsg (r)); + } + } + + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); + + event_gc (true); +} + +static void +PTX_wait_all_async (int async) +{ + CUresult r; + struct PTX_stream *waiting_stream, *other_stream; + CUevent *e; + struct nvptx_thread *nvthd = nvptx_thread (); + pthread_t self = pthread_self (); + + /* The stream doing the waiting. This could be the first mention of the + stream, so create it if necessary. */ + waiting_stream + = select_stream_for_async (async, pthread_self (), true, NULL); + + /* Launches on the null stream already block on other streams in the + context. */ + if (!waiting_stream || waiting_stream == nvthd->ptx_dev->null_stream) + return; + + event_gc (true); + + GOMP_PLUGIN_mutex_lock (&nvthd->ptx_dev->stream_lock); + + for (other_stream = nvthd->ptx_dev->active_streams; + other_stream != NULL; + other_stream = other_stream->next) + { + if (!other_stream->multithreaded + && !pthread_equal (other_stream->host_thread, self)) + continue; + + e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuErrorMsg (r)); + + /* Record an event on the waited-for stream. */ + r = cuEventRecord (*e, other_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuErrorMsg (r)); + + event_add (PTX_EVT_SYNC, e, NULL); + + r = cuStreamWaitEvent (waiting_stream->stream, *e, 0); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamWaitEvent error: %s", cuErrorMsg (r)); + } + + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); +} + +static void * +PTX_get_current_cuda_device (void) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!nvthd || !nvthd->ptx_dev) + return NULL; + + return &nvthd->ptx_dev->dev; +} + +static void * +PTX_get_current_cuda_context (void) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!nvthd || !nvthd->ptx_dev) + return NULL; + + return nvthd->ptx_dev->ctx; +} + +static void * +PTX_get_cuda_stream (int async) +{ + struct PTX_stream *s; + struct nvptx_thread *nvthd = nvptx_thread (); + + if (!nvthd || !nvthd->ptx_dev) + return NULL; + + s = select_stream_for_async (async, pthread_self (), false, NULL); + + return s ? s->stream : NULL; +} + +static int +PTX_set_cuda_stream (int async, void *stream) +{ + struct PTX_stream *oldstream; + pthread_t self = pthread_self (); + struct nvptx_thread *nvthd = nvptx_thread (); + + GOMP_PLUGIN_mutex_lock (&nvthd->ptx_dev->stream_lock); + + if (async < 0) + GOMP_PLUGIN_fatal ("bad async %d", async); + + /* We have a list of active streams and an array mapping async values to + entries of that list. We need to take "ownership" of the passed-in stream, + and add it to our list, removing the previous entry also (if there was one) + in order to prevent resource leaks. Note the potential for surprise + here: maybe we should keep track of passed-in streams and leave it up to + the user to tidy those up, but that doesn't work for stream handles + returned from acc_get_cuda_stream above... */ + + oldstream = select_stream_for_async (async, self, false, NULL); + + if (oldstream) + { + if (nvthd->ptx_dev->active_streams == oldstream) + nvthd->ptx_dev->active_streams = nvthd->ptx_dev->active_streams->next; + else + { + struct PTX_stream *s = nvthd->ptx_dev->active_streams; + while (s->next != oldstream) + s = s->next; + s->next = s->next->next; + } + + cuStreamDestroy (oldstream->stream); + map_fini (oldstream); + free (oldstream); + } + + GOMP_PLUGIN_mutex_unlock (&nvthd->ptx_dev->stream_lock); + + (void) select_stream_for_async (async, self, true, (CUstream) stream); + + return 1; +} + +/* Plugin entry points. */ + + +int +GOMP_OFFLOAD_get_type (void) +{ + return OFFLOAD_TARGET_TYPE_NVIDIA_PTX; +} + +unsigned int +GOMP_OFFLOAD_get_caps (void) +{ + return TARGET_CAP_OPENACC_200; +} + +const char * +GOMP_OFFLOAD_get_name (void) +{ + return "nvidia"; +} + +int +GOMP_OFFLOAD_get_num_devices (void) +{ + return PTX_get_num_devices (); +} + +static void **kernel_target_data; +static void **kernel_host_table; + +void +GOMP_OFFLOAD_register_image (void *host_table, void *target_data) +{ + kernel_target_data = target_data; + kernel_host_table = host_table; +} + +void +GOMP_OFFLOAD_init_device (int n __attribute__((unused))) +{ + (void) PTX_init (); +} + +void +GOMP_OFFLOAD_fini_device (int n __attribute__((unused))) +{ + PTX_fini (); +} + +int +GOMP_OFFLOAD_get_table (int n __attribute__((unused)), + struct mapping_table **tablep) +{ + CUmodule module; + void **fn_table; + char **fn_names; + int fn_entries, i; + CUresult r; + struct targ_fn_descriptor *targ_fns; + + if (PTX_init () <= 0) + return 0; + + /* This isn't an error, because an image may legitimately have no offloaded + regions and so will not call GOMP_offload_register. */ + if (kernel_target_data == NULL) + return 0; + + link_ptx (&module, kernel_target_data[0]); + + /* kernel_target_data[0] -> ptx code + kernel_target_data[1] -> variable mappings + kernel_target_data[2] -> array of kernel names in ascii + + kernel_host_table[0] -> start of function addresses (_omp_func_table) + kernel_host_table[1] -> end of function addresses (_omp_funcs_end) + + The array of kernel names and the functions addresses form a + one-to-one correspondence. */ + + fn_table = kernel_host_table[0]; + fn_names = (char **) kernel_target_data[2]; + fn_entries = (kernel_host_table[1] - kernel_host_table[0]) / sizeof (void *); + + *tablep = GOMP_PLUGIN_malloc (sizeof (struct mapping_table) * fn_entries); + targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor) + * fn_entries); + + for (i = 0; i < fn_entries; i++) + { + CUfunction function; + + r = cuModuleGetFunction (&function, module, fn_names[i]); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuErrorMsg (r)); + + targ_fns[i].fn = function; + targ_fns[i].name = (const char *) fn_names[i]; + + (*tablep)[i].host_start = (uintptr_t) fn_table[i]; + (*tablep)[i].host_end = (*tablep)[i].host_start + 1; + (*tablep)[i].tgt_start = (uintptr_t) &targ_fns[i]; + (*tablep)[i].tgt_end = (*tablep)[i].tgt_start + 1; + } + + return fn_entries; +} + +void * +GOMP_OFFLOAD_alloc (int n __attribute__((unused)), size_t size) +{ + return PTX_alloc (size); +} + +void +GOMP_OFFLOAD_free (int n __attribute__((unused)), void *ptr) +{ + PTX_free (ptr); +} + +void * +GOMP_OFFLOAD_dev2host (int ord __attribute__((unused)), void *dst, + const void *src, size_t n) +{ + return PTX_dev2host (dst, src, n); +} + +void * +GOMP_OFFLOAD_host2dev (int ord __attribute__((unused)), void *dst, + const void *src, size_t n) +{ + return PTX_host2dev (dst, src, n); +} + +void (*device_run) (void *fn_ptr, void *vars) = NULL; + +void +GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, size_t *sizes, + unsigned short *kinds, int num_gangs, + int num_workers, int vector_length, int async, + void *targ_mem_desc) +{ + PTX_exec (fn, mapnum, hostaddrs, devaddrs, sizes, kinds, num_gangs, + num_workers, vector_length, async, targ_mem_desc); +} + +void * +GOMP_OFFLOAD_openacc_open_device (int n) +{ + return PTX_open_device (n); +} + +int +GOMP_OFFLOAD_openacc_close_device (void *h) +{ + return PTX_close_device (h); +} + +void +GOMP_OFFLOAD_openacc_set_device_num (int n) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + + assert (n >= 0); + + if (!nvthd->ptx_dev || nvthd->ptx_dev->ord != n) + (void) PTX_open_device (n); +} + +/* This can be called before the device is "opened" for the current thread, in + which case we can't tell which device number should be returned. We don't + actually want to open the device here, so just return -1 and let the caller + (oacc-init.c:acc_get_device_num) handle it. */ + +int +GOMP_OFFLOAD_openacc_get_device_num (void) +{ + struct nvptx_thread *nvthd = nvptx_thread (); + + if (nvthd && nvthd->ptx_dev) + return nvthd->ptx_dev->ord; + else + return -1; +} + +void +GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc) +{ + CUevent *e; + CUresult r; + struct nvptx_thread *nvthd = nvptx_thread (); + + e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); + + r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuErrorMsg (r)); + + r = cuEventRecord (*e, nvthd->current_stream->stream); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuEventRecord error: %s", cuErrorMsg (r)); + + event_add (PTX_EVT_ASYNC_CLEANUP, e, targ_mem_desc); +} + +int +GOMP_OFFLOAD_openacc_async_test (int async) +{ + return PTX_async_test (async); +} + +int +GOMP_OFFLOAD_openacc_async_test_all (void) +{ + return PTX_async_test_all (); +} + +void +GOMP_OFFLOAD_openacc_async_wait (int async) +{ + PTX_wait (async); +} + +void +GOMP_OFFLOAD_openacc_async_wait_async (int async1, int async2) +{ + PTX_wait_async (async1, async2); +} + +void +GOMP_OFFLOAD_openacc_async_wait_all (void) +{ + PTX_wait_all (); +} + +void +GOMP_OFFLOAD_openacc_async_wait_all_async (int async) +{ + PTX_wait_all_async (async); +} + +void +GOMP_OFFLOAD_openacc_async_set_async (int async) +{ + PTX_set_async (async); +} + +void * +GOMP_OFFLOAD_openacc_create_thread_data (void *targ_data) +{ + struct PTX_device *ptx_dev = (struct PTX_device *) targ_data; + struct nvptx_thread *nvthd + = GOMP_PLUGIN_malloc (sizeof (struct nvptx_thread)); + CUresult r; + CUcontext thd_ctx; + + r = cuCtxGetCurrent (&thd_ctx); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxGetCurrent error: %s", cuErrorMsg (r)); + + assert (ptx_dev->ctx); + + if (!thd_ctx) + { + r = cuCtxPushCurrent (ptx_dev->ctx); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxPushCurrent error: %s", cuErrorMsg (r)); + } + + nvthd->current_stream = ptx_dev->null_stream; + nvthd->ptx_dev = ptx_dev; + + return (void *) nvthd; +} + +void +GOMP_OFFLOAD_openacc_destroy_thread_data (void *data) +{ + free (data); +} + +void * +GOMP_OFFLOAD_openacc_get_current_cuda_device (void) +{ + return PTX_get_current_cuda_device (); +} + +void * +GOMP_OFFLOAD_openacc_get_current_cuda_context (void) +{ + return PTX_get_current_cuda_context (); +} + +/* NOTE: This returns a CUstream, not a PTX_stream pointer. */ + +void * +GOMP_OFFLOAD_openacc_get_cuda_stream (int async) +{ + return PTX_get_cuda_stream (async); +} + +/* NOTE: This takes a CUstream, not a PTX_stream pointer. */ + +int +GOMP_OFFLOAD_openacc_set_cuda_stream (int async, void *stream) +{ + return PTX_set_cuda_stream (async, stream); +} diff --git a/libgomp/splay-tree.c b/libgomp/splay-tree.c new file mode 100644 index 0000000..14b03ac --- /dev/null +++ b/libgomp/splay-tree.c @@ -0,0 +1,224 @@ +/* A splay-tree datatype. + Copyright 1998-2013 + Free Software Foundation, Inc. + Contributed by Mark Mitchell (mark@markmitchell.com). + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* The splay tree code copied from include/splay-tree.h and adjusted, + so that all the data lives directly in splay_tree_node_s structure + and no extra allocations are needed. + + Files including this header should before including it add: +typedef struct splay_tree_node_s *splay_tree_node; +typedef struct splay_tree_s *splay_tree; +typedef struct splay_tree_key_s *splay_tree_key; + define splay_tree_key_s structure, and define + splay_compare inline function. */ + +/* For an easily readable description of splay-trees, see: + + Lewis, Harry R. and Denenberg, Larry. Data Structures and Their + Algorithms. Harper-Collins, Inc. 1991. + + The major feature of splay trees is that all basic tree operations + are amortized O(log n) time for a tree with n nodes. */ + +#include "libgomp.h" +#include "splay-tree.h" + +extern int splay_compare (splay_tree_key, splay_tree_key); + +/* Rotate the edge joining the left child N with its parent P. PP is the + grandparents' pointer to P. */ + +static inline void +rotate_left (splay_tree_node *pp, splay_tree_node p, splay_tree_node n) +{ + splay_tree_node tmp; + tmp = n->right; + n->right = p; + p->left = tmp; + *pp = n; +} + +/* Rotate the edge joining the right child N with its parent P. PP is the + grandparents' pointer to P. */ + +static inline void +rotate_right (splay_tree_node *pp, splay_tree_node p, splay_tree_node n) +{ + splay_tree_node tmp; + tmp = n->left; + n->left = p; + p->right = tmp; + *pp = n; +} + +/* Bottom up splay of KEY. */ + +static void +splay_tree_splay (splay_tree sp, splay_tree_key key) +{ + if (sp->root == NULL) + return; + + do { + int cmp1, cmp2; + splay_tree_node n, c; + + n = sp->root; + cmp1 = splay_compare (key, &n->key); + + /* Found. */ + if (cmp1 == 0) + return; + + /* Left or right? If no child, then we're done. */ + if (cmp1 < 0) + c = n->left; + else + c = n->right; + if (!c) + return; + + /* Next one left or right? If found or no child, we're done + after one rotation. */ + cmp2 = splay_compare (key, &c->key); + if (cmp2 == 0 + || (cmp2 < 0 && !c->left) + || (cmp2 > 0 && !c->right)) + { + if (cmp1 < 0) + rotate_left (&sp->root, n, c); + else + rotate_right (&sp->root, n, c); + return; + } + + /* Now we have the four cases of double-rotation. */ + if (cmp1 < 0 && cmp2 < 0) + { + rotate_left (&n->left, c, c->left); + rotate_left (&sp->root, n, n->left); + } + else if (cmp1 > 0 && cmp2 > 0) + { + rotate_right (&n->right, c, c->right); + rotate_right (&sp->root, n, n->right); + } + else if (cmp1 < 0 && cmp2 > 0) + { + rotate_right (&n->left, c, c->right); + rotate_left (&sp->root, n, n->left); + } + else if (cmp1 > 0 && cmp2 < 0) + { + rotate_left (&n->right, c, c->left); + rotate_right (&sp->root, n, n->right); + } + } while (1); +} + +/* Insert a new NODE into SP. The NODE shouldn't exist in the tree. */ + +attribute_hidden void +splay_tree_insert (splay_tree sp, splay_tree_node node) +{ + int comparison = 0; + + splay_tree_splay (sp, &node->key); + + if (sp->root) + comparison = splay_compare (&sp->root->key, &node->key); + + if (sp->root && comparison == 0) + gomp_fatal ("Duplicate node"); + else + { + /* Insert it at the root. */ + if (sp->root == NULL) + node->left = node->right = NULL; + else if (comparison < 0) + { + node->left = sp->root; + node->right = node->left->right; + node->left->right = NULL; + } + else + { + node->right = sp->root; + node->left = node->right->left; + node->right->left = NULL; + } + + sp->root = node; + } +} + +/* Remove node with KEY from SP. It is not an error if it did not exist. */ + +attribute_hidden void +splay_tree_remove (splay_tree sp, splay_tree_key key) +{ + splay_tree_splay (sp, key); + + if (sp->root && splay_compare (&sp->root->key, key) == 0) + { + splay_tree_node left, right; + + left = sp->root->left; + right = sp->root->right; + + /* One of the children is now the root. Doesn't matter much + which, so long as we preserve the properties of the tree. */ + if (left) + { + sp->root = left; + + /* If there was a right child as well, hang it off the + right-most leaf of the left child. */ + if (right) + { + while (left->right) + left = left->right; + left->right = right; + } + } + else + sp->root = right; + } +} + +/* Lookup KEY in SP, returning NODE if present, and NULL + otherwise. */ + +attribute_hidden splay_tree_key +splay_tree_lookup (splay_tree sp, splay_tree_key key) +{ + splay_tree_splay (sp, key); + + if (sp->root && splay_compare (&sp->root->key, key) == 0) + return &sp->root->key; + else + return NULL; +} diff --git a/libgomp/splay-tree.h b/libgomp/splay-tree.h index eb8011a..f29d437 100644 --- a/libgomp/splay-tree.h +++ b/libgomp/splay-tree.h @@ -43,6 +43,30 @@ typedef struct splay_tree_key_s *splay_tree_key; The major feature of splay trees is that all basic tree operations are amortized O(log n) time for a tree with n nodes. */ +#ifndef _SPLAY_TREE_H +#define _SPLAY_TREE_H 1 + +typedef struct splay_tree_node_s *splay_tree_node; +typedef struct splay_tree_s *splay_tree; +typedef struct splay_tree_key_s *splay_tree_key; + +struct splay_tree_key_s { + /* Address of the host object. */ + uintptr_t host_start; + /* Address immediately after the host object. */ + uintptr_t host_end; + /* Descriptor of the target memory. */ + struct target_mem_desc *tgt; + /* Offset from tgt->tgt_start to the start of the target object. */ + uintptr_t tgt_offset; + /* Reference count. */ + uintptr_t refcount; + /* Asynchronous reference count. */ + uintptr_t async_refcount; + /* True if data should be copied from device to host at the end. */ + bool copy_from; +}; + /* The nodes in the splay tree. */ struct splay_tree_node_s { struct splay_tree_key_s key; @@ -56,177 +80,8 @@ struct splay_tree_s { splay_tree_node root; }; -/* Rotate the edge joining the left child N with its parent P. PP is the - grandparents' pointer to P. */ - -static inline void -rotate_left (splay_tree_node *pp, splay_tree_node p, splay_tree_node n) -{ - splay_tree_node tmp; - tmp = n->right; - n->right = p; - p->left = tmp; - *pp = n; -} - -/* Rotate the edge joining the right child N with its parent P. PP is the - grandparents' pointer to P. */ - -static inline void -rotate_right (splay_tree_node *pp, splay_tree_node p, splay_tree_node n) -{ - splay_tree_node tmp; - tmp = n->left; - n->left = p; - p->right = tmp; - *pp = n; -} - -/* Bottom up splay of KEY. */ - -static void -splay_tree_splay (splay_tree sp, splay_tree_key key) -{ - if (sp->root == NULL) - return; - - do { - int cmp1, cmp2; - splay_tree_node n, c; - - n = sp->root; - cmp1 = splay_compare (key, &n->key); - - /* Found. */ - if (cmp1 == 0) - return; - - /* Left or right? If no child, then we're done. */ - if (cmp1 < 0) - c = n->left; - else - c = n->right; - if (!c) - return; - - /* Next one left or right? If found or no child, we're done - after one rotation. */ - cmp2 = splay_compare (key, &c->key); - if (cmp2 == 0 - || (cmp2 < 0 && !c->left) - || (cmp2 > 0 && !c->right)) - { - if (cmp1 < 0) - rotate_left (&sp->root, n, c); - else - rotate_right (&sp->root, n, c); - return; - } - - /* Now we have the four cases of double-rotation. */ - if (cmp1 < 0 && cmp2 < 0) - { - rotate_left (&n->left, c, c->left); - rotate_left (&sp->root, n, n->left); - } - else if (cmp1 > 0 && cmp2 > 0) - { - rotate_right (&n->right, c, c->right); - rotate_right (&sp->root, n, n->right); - } - else if (cmp1 < 0 && cmp2 > 0) - { - rotate_right (&n->left, c, c->right); - rotate_left (&sp->root, n, n->left); - } - else if (cmp1 > 0 && cmp2 < 0) - { - rotate_left (&n->right, c, c->left); - rotate_right (&sp->root, n, n->right); - } - } while (1); -} - -/* Insert a new NODE into SP. The NODE shouldn't exist in the tree. */ - -static void -splay_tree_insert (splay_tree sp, splay_tree_node node) -{ - int comparison = 0; - - splay_tree_splay (sp, &node->key); - - if (sp->root) - comparison = splay_compare (&sp->root->key, &node->key); - - if (sp->root && comparison == 0) - abort (); - else - { - /* Insert it at the root. */ - if (sp->root == NULL) - node->left = node->right = NULL; - else if (comparison < 0) - { - node->left = sp->root; - node->right = node->left->right; - node->left->right = NULL; - } - else - { - node->right = sp->root; - node->left = node->right->left; - node->right->left = NULL; - } - - sp->root = node; - } -} - -/* Remove node with KEY from SP. It is not an error if it did not exist. */ - -static void -splay_tree_remove (splay_tree sp, splay_tree_key key) -{ - splay_tree_splay (sp, key); - - if (sp->root && splay_compare (&sp->root->key, key) == 0) - { - splay_tree_node left, right; - - left = sp->root->left; - right = sp->root->right; - - /* One of the children is now the root. Doesn't matter much - which, so long as we preserve the properties of the tree. */ - if (left) - { - sp->root = left; - - /* If there was a right child as well, hang it off the - right-most leaf of the left child. */ - if (right) - { - while (left->right) - left = left->right; - left->right = right; - } - } - else - sp->root = right; - } -} - -/* Lookup KEY in SP, returning NODE if present, and NULL - otherwise. */ - -static splay_tree_key -splay_tree_lookup (splay_tree sp, splay_tree_key key) -{ - splay_tree_splay (sp, key); - - if (sp->root && splay_compare (&sp->root->key, key) == 0) - return &sp->root->key; - else - return NULL; -} +attribute_hidden splay_tree_key splay_tree_lookup (splay_tree, splay_tree_key); +attribute_hidden void splay_tree_insert (splay_tree, splay_tree_node); +attribute_hidden void splay_tree_remove (splay_tree, splay_tree_key); + +#endif /* _SPLAY_TREE_H */ diff --git a/libgomp/target.c b/libgomp/target.c index 4ace170..a307239 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -30,7 +30,12 @@ #include #include #include +#include "oacc-plugin.h" +#include "gomp-constants.h" +#include "oacc-int.h" #include +#include +#include #ifdef PLUGIN_SUPPORT #include @@ -40,50 +45,6 @@ static void gomp_target_init (void); static pthread_once_t gomp_is_initialized = PTHREAD_ONCE_INIT; -/* Forward declaration for a node in the tree. */ -typedef struct splay_tree_node_s *splay_tree_node; -typedef struct splay_tree_s *splay_tree; -typedef struct splay_tree_key_s *splay_tree_key; - -struct target_mem_desc { - /* Reference count. */ - uintptr_t refcount; - /* All the splay nodes allocated together. */ - splay_tree_node array; - /* Start of the target region. */ - uintptr_t tgt_start; - /* End of the targer region. */ - uintptr_t tgt_end; - /* Handle to free. */ - void *to_free; - /* Previous target_mem_desc. */ - struct target_mem_desc *prev; - /* Number of items in following list. */ - size_t list_count; - - /* Corresponding target device descriptor. */ - struct gomp_device_descr *device_descr; - - /* List of splay keys to remove (or decrease refcount) - at the end of region. */ - splay_tree_key list[]; -}; - -struct splay_tree_key_s { - /* Address of the host object. */ - uintptr_t host_start; - /* Address immediately after the host object. */ - uintptr_t host_end; - /* Descriptor of the target memory. */ - struct target_mem_desc *tgt; - /* Offset from tgt->tgt_start to the start of the target object. */ - uintptr_t tgt_offset; - /* Reference count. */ - uintptr_t refcount; - /* True if data should be copied from device to host at the end. */ - bool copy_from; -}; - /* This structure describes an offload image. It contains type of the target device, pointer to host table descriptor, and pointer to target data. */ @@ -107,7 +68,7 @@ static int num_devices; /* The comparison function. */ -static int +attribute_hidden int splay_compare (splay_tree_key x, splay_tree_key y) { if (x->host_start == x->host_end @@ -122,47 +83,16 @@ splay_compare (splay_tree_key x, splay_tree_key y) #include "splay-tree.h" -/* This structure describes accelerator device. - It contains ID-number of the device, its type, function handlers for - interaction with the device, and information about mapped memory. */ -struct gomp_device_descr +attribute_hidden void +gomp_init_targets_once (void) { - /* This is the ID number of device. It could be specified in DEVICE-clause of - TARGET construct. */ - int id; - - /* This is the ID number of device among devices of the same type. */ - int target_id; - - /* This is the TYPE of device. */ - enum offload_target_type type; - - /* Set to true when device is initialized. */ - bool is_initialized; - - /* Function handlers. */ - int (*get_type_func) (void); - int (*get_num_devices_func) (void); - void (*register_image_func) (void *, void *); - void (*init_device_func) (int); - int (*get_table_func) (int, void *); - void *(*alloc_func) (int, size_t); - void (*free_func) (int, void *); - void *(*host2dev_func) (int, void *, const void *, size_t); - void *(*dev2host_func) (int, void *, const void *, size_t); - void (*run_func) (int, void *, void *); - - /* Splay tree containing information about mapped memory regions. */ - struct splay_tree_s dev_splay_tree; - - /* Mutex for operating with the splay tree and other shared structures. */ - gomp_mutex_t dev_env_lock; -}; + (void) pthread_once (&gomp_is_initialized, gomp_target_init); +} attribute_hidden int gomp_get_num_devices (void) { - (void) pthread_once (&gomp_is_initialized, gomp_target_init); + gomp_init_targets_once (); return num_devices; } @@ -198,18 +128,29 @@ gomp_map_vars_existing (splay_tree_key oldn, splay_tree_key newn, oldn->refcount++; } -static struct target_mem_desc * +static int +get_kind (bool is_openacc, void *kinds, int idx) +{ + return is_openacc ? ((unsigned short *) kinds)[idx] + : ((unsigned char *) kinds)[idx]; +} + +attribute_hidden struct target_mem_desc * gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, - void **hostaddrs, size_t *sizes, unsigned char *kinds, - bool is_target) + void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds, + bool is_openacc, bool is_target) { size_t i, tgt_align, tgt_size, not_found_cnt = 0; + const int rshift = is_openacc ? 8 : 3; + const int typemask = is_openacc ? 0xff : 0x7; + struct gomp_memory_mapping *mm = &devicep->mem_map; struct splay_tree_key_s cur_node; struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum); tgt->list_count = mapnum; tgt->refcount = 1; tgt->device_descr = devicep; + tgt->mem_map = mm; if (mapnum == 0) return tgt; @@ -222,41 +163,41 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, tgt_align = align; tgt_size = mapnum * sizeof (void *); } - - gomp_mutex_lock (&devicep->dev_env_lock); + gomp_mutex_lock (&mm->lock); for (i = 0; i < mapnum; i++) { + int kind = get_kind (is_openacc, kinds, i); if (hostaddrs[i] == NULL) { tgt->list[i] = NULL; continue; } cur_node.host_start = (uintptr_t) hostaddrs[i]; - if ((kinds[i] & 7) != 4) + if (!GOMP_MAP_POINTER_P (kind & typemask)) cur_node.host_end = cur_node.host_start + sizes[i]; else cur_node.host_end = cur_node.host_start + sizeof (void *); - splay_tree_key n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n) { tgt->list[i] = n; - gomp_map_vars_existing (n, &cur_node, kinds[i]); + gomp_map_vars_existing (n, &cur_node, kind); } else { - size_t align = (size_t) 1 << (kinds[i] >> 3); + size_t align = (size_t) 1 << (kind >> rshift); tgt->list[i] = NULL; not_found_cnt++; if (tgt_align < align) tgt_align = align; tgt_size = (tgt_size + align - 1) & ~(align - 1); tgt_size += cur_node.host_end - cur_node.host_start; - if ((kinds[i] & 7) == 5) + if ((kind & typemask) == GOMP_MAP_TO_PSET) { size_t j; for (j = i + 1; j < mapnum; j++) - if ((kinds[j] & 7) != 4) + if (!GOMP_MAP_POINTER_P (get_kind (is_openacc, kinds, j) + & typemask)) break; else if ((uintptr_t) hostaddrs[j] < cur_node.host_start || ((uintptr_t) hostaddrs[j] + sizeof (void *) @@ -271,7 +212,15 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, } } - if (not_found_cnt || is_target) + if (devaddrs) + { + if (mapnum != 1) + gomp_fatal ("unexpected aggregation"); + tgt->to_free = devaddrs[0]; + tgt->tgt_start = (uintptr_t) tgt->to_free; + tgt->tgt_end = tgt->tgt_start + sizes[0]; + } + else if (not_found_cnt || is_target) { /* Allocate tgt_align aligned tgt_size block of memory. */ /* FIXME: Perhaps change interface to allocate properly aligned @@ -303,44 +252,52 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, for (i = 0; i < mapnum; i++) if (tgt->list[i] == NULL) { + int kind = get_kind (is_openacc, kinds, i); if (hostaddrs[i] == NULL) continue; splay_tree_key k = &array->key; k->host_start = (uintptr_t) hostaddrs[i]; - if ((kinds[i] & 7) != 4) + if (!GOMP_MAP_POINTER_P (kind & typemask)) k->host_end = k->host_start + sizes[i]; else k->host_end = k->host_start + sizeof (void *); - splay_tree_key n - = splay_tree_lookup (&devicep->dev_splay_tree, k); + splay_tree_key n = splay_tree_lookup (&mm->splay_tree, k); if (n) { tgt->list[i] = n; - gomp_map_vars_existing (n, k, kinds[i]); + gomp_map_vars_existing (n, k, kind); } else { - size_t align = (size_t) 1 << (kinds[i] >> 3); + size_t align = (size_t) 1 << (kind >> rshift); tgt->list[i] = k; tgt_size = (tgt_size + align - 1) & ~(align - 1); k->tgt = tgt; k->tgt_offset = tgt_size; tgt_size += k->host_end - k->host_start; - k->copy_from = false; - if ((kinds[i] & 7) == 2 || (kinds[i] & 7) == 3) - k->copy_from = true; + k->copy_from = GOMP_MAP_COPYFROM_P (kind & typemask) + || GOMP_MAP_TOFROM_P (kind & typemask); k->refcount = 1; + k->async_refcount = 0; tgt->refcount++; array->left = NULL; array->right = NULL; - splay_tree_insert (&devicep->dev_splay_tree, array); - switch (kinds[i] & 7) + splay_tree_insert (&mm->splay_tree, array); + switch (kind & typemask) { - case 0: /* ALLOC */ - case 2: /* FROM */ + case GOMP_MAP_FORCE_ALLOC: + case GOMP_MAP_FORCE_FROM: + /* FIXME: No special handling (see comment in + oacc-parallel.c). */ + case GOMP_MAP_ALLOC: + case GOMP_MAP_ALLOC_FROM: break; - case 1: /* TO */ - case 3: /* TOFROM */ + case GOMP_MAP_FORCE_TO: + case GOMP_MAP_FORCE_TOFROM: + /* FIXME: No special handling, as above. */ + case GOMP_MAP_ALLOC_TO: + case GOMP_MAP_ALLOC_TOFROM: + /* Copy from host to device memory. */ /* FIXME: Perhaps add some smarts, like if copying several adjacent fields from host to target, use some host buffer to avoid sending each var individually. */ @@ -350,7 +307,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, (void *) k->host_start, k->host_end - k->host_start); break; - case 4: /* POINTER */ + case GOMP_MAP_POINTER: cur_node.host_start = (uintptr_t) *(void **) k->host_start; if (cur_node.host_start == (uintptr_t) NULL) @@ -366,19 +323,16 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, /* Add bias to the pointer value. */ cur_node.host_start += sizes[i]; cur_node.host_end = cur_node.host_start + 1; - n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n == NULL) { /* Could be possibly zero size array section. */ cur_node.host_end--; - n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n == NULL) { cur_node.host_start--; - n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + n = splay_tree_lookup (&mm->splay_tree, &cur_node); cur_node.host_start++; } } @@ -398,14 +352,17 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, (void *) &cur_node.tgt_offset, sizeof (void *)); break; - case 5: /* TO_PSET */ - devicep->host2dev_func (devicep->target_id, - (void *) (tgt->tgt_start - + k->tgt_offset), - (void *) k->host_start, - k->host_end - k->host_start); + case GOMP_MAP_TO_PSET: + /* Copy from host to device memory. */ + /* FIXME: see above FIXME comment. */ + devicep->host2dev_func + (devicep->target_id, + (void *) (tgt->tgt_start + k->tgt_offset), + (void *) k->host_start, + (k->host_end - k->host_start)); for (j = i + 1; j < mapnum; j++) - if ((kinds[j] & 7) != 4) + if (!GOMP_MAP_POINTER_P (get_kind (is_openacc, kinds, j) + & typemask)) break; else if ((uintptr_t) hostaddrs[j] < k->host_start || ((uintptr_t) hostaddrs[j] + sizeof (void *) @@ -432,19 +389,18 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, /* Add bias to the pointer value. */ cur_node.host_start += sizes[j]; cur_node.host_end = cur_node.host_start + 1; - n = splay_tree_lookup (&devicep->dev_splay_tree, - &cur_node); + n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n == NULL) { /* Could be possibly zero size array section. */ cur_node.host_end--; - n = splay_tree_lookup (&devicep->dev_splay_tree, + n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n == NULL) { cur_node.host_start--; - n = splay_tree_lookup - (&devicep->dev_splay_tree, &cur_node); + n = splay_tree_lookup (&mm->splay_tree, + &cur_node); cur_node.host_start++; } } @@ -468,6 +424,32 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, i++; } break; + case GOMP_MAP_FORCE_PRESENT: + { + /* We already looked up the memory region above and it + was missing. */ + size_t size = k->host_end - k->host_start; + gomp_fatal ("present clause: !acc_is_present (%p, " + "%zd (0x%zx))", (void *) k->host_start, + size, size); + } + break; + case GOMP_MAP_FORCE_DEVICEPTR: + assert (k->host_end - k->host_start == sizeof (void *)); + + devicep->host2dev_func + (devicep->target_id, + (void *) (tgt->tgt_start + k->tgt_offset), + (void *) k->host_start, + sizeof (void *)); + break; + case GOMP_MAP_FORCE_PRIVATE: + abort (); + case GOMP_MAP_FORCE_FIRSTPRIVATE: + abort (); + default: + gomp_fatal ("%s: unhandled kind 0x%.2x", __FUNCTION__, + kind); } array++; } @@ -490,7 +472,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, } } - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_mutex_unlock (&mm->lock); return tgt; } @@ -505,10 +487,51 @@ gomp_unmap_tgt (struct target_mem_desc *tgt) free (tgt); } -static void -gomp_unmap_vars (struct target_mem_desc *tgt) +/* Decrease the refcount for a set of mapped variables, and queue asychronous + copies from the device back to the host after any work that has been issued. + Because the regions are still "live", increment an asynchronous reference + count to indicate that they should not be unmapped from host-side data + structures until the asynchronous copy has completed. */ + +attribute_hidden void +gomp_copy_from_async (struct target_mem_desc *tgt) +{ + struct gomp_device_descr *devicep = tgt->device_descr; + struct gomp_memory_mapping *mm = tgt->mem_map; + size_t i; + + gomp_mutex_lock (&mm->lock); + + for (i = 0; i < tgt->list_count; i++) + if (tgt->list[i] == NULL) + ; + else if (tgt->list[i]->refcount > 1) + { + tgt->list[i]->refcount--; + tgt->list[i]->async_refcount++; + } + else + { + splay_tree_key k = tgt->list[i]; + if (k->copy_from) + /* Copy from device to host memory. */ + devicep->dev2host_func (devicep->target_id, (void *) k->host_start, + (void *) (k->tgt->tgt_start + k->tgt_offset), + k->host_end - k->host_start); + } + + gomp_mutex_unlock (&mm->lock); +} + +/* Unmap variables described by TGT. If DO_COPYFROM is true, copy relevant + variables back from device to host: if it is false, it is assumed that this + has been done already, i.e. by gomp_copy_from_async above. */ + +attribute_hidden void +gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom) { struct gomp_device_descr *devicep = tgt->device_descr; + struct gomp_memory_mapping *mm = tgt->mem_map; if (tgt->list_count == 0) { @@ -517,20 +540,23 @@ gomp_unmap_vars (struct target_mem_desc *tgt) } size_t i; - gomp_mutex_lock (&devicep->dev_env_lock); + gomp_mutex_lock (&mm->lock); for (i = 0; i < tgt->list_count; i++) if (tgt->list[i] == NULL) ; else if (tgt->list[i]->refcount > 1) tgt->list[i]->refcount--; + else if (tgt->list[i]->async_refcount > 0) + tgt->list[i]->async_refcount--; else { splay_tree_key k = tgt->list[i]; - if (k->copy_from) + if (k->copy_from && do_copyfrom) + /* Copy from device to host memory. */ devicep->dev2host_func (devicep->target_id, (void *) k->host_start, (void *) (k->tgt->tgt_start + k->tgt_offset), k->host_end - k->host_start); - splay_tree_remove (&devicep->dev_splay_tree, k); + splay_tree_remove (&mm->splay_tree, k); if (k->tgt->refcount > 1) k->tgt->refcount--; else @@ -541,15 +567,17 @@ gomp_unmap_vars (struct target_mem_desc *tgt) tgt->refcount--; else gomp_unmap_tgt (tgt); - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_mutex_unlock (&mm->lock); } static void -gomp_update (struct gomp_device_descr *devicep, size_t mapnum, - void **hostaddrs, size_t *sizes, unsigned char *kinds) +gomp_update (struct gomp_device_descr *devicep, struct gomp_memory_mapping *mm, + size_t mapnum, void **hostaddrs, size_t *sizes, void *kinds, + bool is_openacc) { size_t i; struct splay_tree_key_s cur_node; + const int typemask = is_openacc ? 0xff : 0x7; if (!devicep) return; @@ -557,16 +585,17 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, if (mapnum == 0) return; - gomp_mutex_lock (&devicep->dev_env_lock); + gomp_mutex_lock (&mm->lock); for (i = 0; i < mapnum; i++) if (sizes[i]) { cur_node.host_start = (uintptr_t) hostaddrs[i]; cur_node.host_end = cur_node.host_start + sizes[i]; - splay_tree_key n = splay_tree_lookup (&devicep->dev_splay_tree, + splay_tree_key n = splay_tree_lookup (&mm->splay_tree, &cur_node); if (n) { + int kind = get_kind (is_openacc, kinds, i); if (n->host_start > cur_node.host_start || n->host_end < cur_node.host_end) gomp_fatal ("Trying to update [%p..%p) object when" @@ -575,31 +604,38 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, (void *) cur_node.host_end, (void *) n->host_start, (void *) n->host_end); - if ((kinds[i] & 7) == 1) - devicep->host2dev_func (devicep->target_id, - (void *) (n->tgt->tgt_start - + n->tgt_offset - + cur_node.host_start - - n->host_start), - (void *) cur_node.host_start, - cur_node.host_end - cur_node.host_start); - else if ((kinds[i] & 7) == 2) - devicep->dev2host_func (devicep->target_id, - (void *) cur_node.host_start, - (void *) (n->tgt->tgt_start - + n->tgt_offset - + cur_node.host_start - - n->host_start), - cur_node.host_end - cur_node.host_start); + if (GOMP_MAP_COPYTO_P (kind & typemask)) + /* Copy from host to device memory. */ + devicep->host2dev_func + (devicep->target_id, + (void *) (n->tgt->tgt_start + + n->tgt_offset + + cur_node.host_start + - n->host_start), + (void *) cur_node.host_start, + cur_node.host_end - cur_node.host_start); + else if (GOMP_MAP_COPYFROM_P (kind & typemask)) + /* Copy from device to host memory. */ + devicep->dev2host_func + (devicep->target_id, + (void *) cur_node.host_start, + (void *) (n->tgt->tgt_start + + n->tgt_offset + + cur_node.host_start + - n->host_start), + cur_node.host_end - cur_node.host_start); } else gomp_fatal ("Trying to update [%p..%p) object that is not mapped", (void *) cur_node.host_start, (void *) cur_node.host_end); } - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_mutex_unlock (&mm->lock); } +static void gomp_register_image_for_device (struct gomp_device_descr *device, + struct offload_image_descr *image); + /* This function should be called from every offload image. It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of the target, and TARGET_DATA needed by target plugin. */ @@ -612,6 +648,9 @@ GOMP_offload_register (void *host_table, enum offload_target_type target_type, (num_offload_images + 1) * sizeof (struct offload_image_descr)); + if (offload_images == NULL) + return; + offload_images[num_offload_images].type = target_type; offload_images[num_offload_images].host_table = host_table; offload_images[num_offload_images].target_data = target_data; @@ -621,17 +660,24 @@ GOMP_offload_register (void *host_table, enum offload_target_type target_type, /* This function initializes the target device, specified by DEVICEP. */ -static void +attribute_hidden void gomp_init_device (struct gomp_device_descr *devicep) { + /* Initialize the target device. */ devicep->init_device_func (devicep->target_id); + + devicep->is_initialized = true; +} +attribute_hidden void +gomp_init_tables (const struct gomp_device_descr *devicep, + struct gomp_memory_mapping *mm) +{ /* Get address mapping table for device. */ struct mapping_table *table = NULL; - int num_entries = devicep->get_table_func (devicep->target_id, &table); + int i, num_entries = devicep->get_table_func (devicep->target_id, &table); /* Insert host-target address mapping into dev_splay_tree. */ - int i; for (i = 0; i < num_entries; i++) { struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt)); @@ -641,7 +687,7 @@ gomp_init_device (struct gomp_device_descr *devicep) tgt->tgt_end = table[i].tgt_end; tgt->to_free = NULL; tgt->list_count = 0; - tgt->device_descr = devicep; + tgt->device_descr = (struct gomp_device_descr *) devicep; splay_tree_node node = tgt->array; splay_tree_key k = &node->key; k->host_start = table[i].host_start; @@ -650,11 +696,45 @@ gomp_init_device (struct gomp_device_descr *devicep) k->tgt = tgt; node->left = NULL; node->right = NULL; - splay_tree_insert (&devicep->dev_splay_tree, node); + splay_tree_insert (&mm->splay_tree, node); } free (table); - devicep->is_initialized = true; + mm->is_initialized = true; +} + +static void +gomp_init_dev_tables (struct gomp_device_descr *devicep) +{ + gomp_init_device (devicep); + gomp_init_tables (devicep, &devicep->mem_map); +} + + +attribute_hidden void +gomp_free_memmap (struct gomp_device_descr *devicep) +{ + struct gomp_memory_mapping *mm = &devicep->mem_map; + + while (mm->splay_tree.root) + { + struct target_mem_desc *tgt = mm->splay_tree.root->key.tgt; + + splay_tree_remove (&mm->splay_tree, &mm->splay_tree.root->key); + free (tgt->array); + free (tgt); + } + + mm->is_initialized = false; +} + +attribute_hidden void +gomp_fini_device (struct gomp_device_descr *devicep) +{ + if (devicep->is_initialized) + devicep->fini_device_func (devicep->target_id); + + devicep->is_initialized = false; } /* Called when encountering a target directive. If DEVICE @@ -673,7 +753,12 @@ GOMP_target (int device, void (*fn) (void *), const void *openmp_target, unsigned char *kinds) { struct gomp_device_descr *devicep = resolve_device (device); - if (devicep == NULL) + struct gomp_memory_mapping *mm = &devicep->mem_map; + + if (devicep != NULL && !devicep->is_initialized) + gomp_init_dev_tables (devicep); + + if (devicep == NULL || !(devicep->capabilities & TARGET_CAP_OPENMP_400)) { /* Host fallback. */ struct gomp_thread old_thr, *thr = gomp_thread (); @@ -690,20 +775,30 @@ GOMP_target (int device, void (*fn) (void *), const void *openmp_target, return; } - gomp_mutex_lock (&devicep->dev_env_lock); - if (!devicep->is_initialized) - gomp_init_device (devicep); + void *fn_addr; - struct splay_tree_key_s k; - k.host_start = (uintptr_t) fn; - k.host_end = k.host_start + 1; - splay_tree_key tgt_fn = splay_tree_lookup (&devicep->dev_splay_tree, &k); - if (tgt_fn == NULL) - gomp_fatal ("Target function wasn't mapped"); - gomp_mutex_unlock (&devicep->dev_env_lock); + if (devicep->capabilities & TARGET_CAP_NATIVE_EXEC) + fn_addr = (void *) fn; + else + { + gomp_mutex_lock (&mm->lock); + if (!devicep->is_initialized) + gomp_init_dev_tables (devicep); + struct splay_tree_key_s k; + k.host_start = (uintptr_t) fn; + k.host_end = k.host_start + 1; + splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map.splay_tree, + &k); + if (tgt_fn == NULL) + gomp_fatal ("Target function wasn't mapped"); + gomp_mutex_unlock (&mm->lock); + + fn_addr = (void *) tgt_fn->tgt->tgt_start; + } struct target_mem_desc *tgt_vars - = gomp_map_vars (devicep, mapnum, hostaddrs, sizes, kinds, true); + = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false, + true); struct gomp_thread old_thr, *thr = gomp_thread (); old_thr = *thr; memset (thr, '\0', sizeof (*thr)); @@ -712,11 +807,10 @@ GOMP_target (int device, void (*fn) (void *), const void *openmp_target, thr->place = old_thr.place; thr->ts.place_partition_len = gomp_places_list_len; } - devicep->run_func (devicep->target_id, (void *) tgt_fn->tgt->tgt_start, - (void *) tgt_vars->tgt_start); + devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start); gomp_free_thread (thr); *thr = old_thr; - gomp_unmap_vars (tgt_vars); + gomp_unmap_vars (tgt_vars, true); } void @@ -724,7 +818,12 @@ GOMP_target_data (int device, const void *openmp_target, size_t mapnum, void **hostaddrs, size_t *sizes, unsigned char *kinds) { struct gomp_device_descr *devicep = resolve_device (device); - if (devicep == NULL) + struct gomp_memory_mapping *mm = &devicep->mem_map; + + if (devicep != NULL && !devicep->is_initialized) + gomp_init_dev_tables (devicep); + + if (devicep == NULL || !(devicep->capabilities & TARGET_CAP_OPENMP_400)) { /* Host fallback. */ struct gomp_task_icv *icv = gomp_icv (false); @@ -735,20 +834,21 @@ GOMP_target_data (int device, const void *openmp_target, size_t mapnum, new #pragma omp target data, otherwise GOMP_target_end_data would get out of sync. */ struct target_mem_desc *tgt - = gomp_map_vars (NULL, 0, NULL, NULL, NULL, false); + = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, false, false); tgt->prev = icv->target_data; icv->target_data = tgt; } return; } - gomp_mutex_lock (&devicep->dev_env_lock); + gomp_mutex_lock (&mm->lock); if (!devicep->is_initialized) - gomp_init_device (devicep); - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_init_dev_tables (devicep); + gomp_mutex_unlock (&mm->lock); struct target_mem_desc *tgt - = gomp_map_vars (devicep, mapnum, hostaddrs, sizes, kinds, false); + = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false, + false); struct gomp_task_icv *icv = gomp_icv (true); tgt->prev = icv->target_data; icv->target_data = tgt; @@ -762,7 +862,7 @@ GOMP_target_end_data (void) { struct target_mem_desc *tgt = icv->target_data; icv->target_data = tgt->prev; - gomp_unmap_vars (tgt); + gomp_unmap_vars (tgt, true); } } @@ -771,15 +871,18 @@ GOMP_target_update (int device, const void *openmp_target, size_t mapnum, void **hostaddrs, size_t *sizes, unsigned char *kinds) { struct gomp_device_descr *devicep = resolve_device (device); - if (devicep == NULL) - return; + struct gomp_memory_mapping *mm = &devicep->mem_map; - gomp_mutex_lock (&devicep->dev_env_lock); - if (!devicep->is_initialized) + gomp_mutex_lock (&mm->lock); + if (devicep != NULL && !devicep->is_initialized) gomp_init_device (devicep); - gomp_mutex_unlock (&devicep->dev_env_lock); + gomp_mutex_unlock (&mm->lock); - gomp_update (devicep, mapnum, hostaddrs, sizes, kinds); + if (devicep == NULL || !(devicep->capabilities & TARGET_CAP_OPENMP_400)) + return; + + gomp_update (devicep, &devicep->mem_map, mapnum, hostaddrs, sizes, kinds, + false); } void @@ -806,9 +909,22 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, const char *plugin_name) { void *plugin_handle = dlopen (plugin_name, RTLD_LAZY); + char *err = NULL, *last_missing = NULL; + int optional_present, optional_total; + if (!plugin_handle) return false; + /* Clear any existing error. */ + dlerror (); + + device->plugin_handle = dlopen (plugin_name, RTLD_LAZY); + if (!device->plugin_handle) + { + err = dlerror (); + goto out; + } + /* Check if all required functions are available in the plugin and store their handlers. */ #define DLSYM(f) \ @@ -819,33 +935,104 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, return false; \ } \ while (0) + /* Similar, but missing functions are not an error. */ +#define DLSYM_OPT(f,n) \ + do \ + { \ + char *tmp_err; \ + device->f##_func = dlsym (device->plugin_handle, \ + "GOMP_OFFLOAD_" #n); \ + tmp_err = dlerror (); \ + if (tmp_err == NULL) \ + optional_present++; \ + else \ + last_missing = #n; \ + optional_total++; \ + } \ + while (0) + + DLSYM (get_name); + DLSYM (get_caps); DLSYM (get_type); DLSYM (get_num_devices); DLSYM (register_image); DLSYM (init_device); + DLSYM (fini_device); DLSYM (get_table); DLSYM (alloc); DLSYM (free); DLSYM (dev2host); DLSYM (host2dev); - DLSYM (run); + device->capabilities = device->get_caps_func (); + if (device->capabilities & TARGET_CAP_OPENMP_400) + DLSYM (run); + if (device->capabilities & TARGET_CAP_OPENACC_200) + { + optional_present = optional_total = 0; + DLSYM_OPT (openacc.exec, openacc_parallel); + DLSYM_OPT (openacc.open_device, openacc_open_device); + DLSYM_OPT (openacc.close_device, openacc_close_device); + DLSYM_OPT (openacc.get_device_num, openacc_get_device_num); + DLSYM_OPT (openacc.set_device_num, openacc_set_device_num); + DLSYM_OPT (openacc.register_async_cleanup, + openacc_register_async_cleanup); + DLSYM_OPT (openacc.async_test, openacc_async_test); + DLSYM_OPT (openacc.async_test_all, openacc_async_test_all); + DLSYM_OPT (openacc.async_wait, openacc_async_wait); + DLSYM_OPT (openacc.async_wait_async, openacc_async_wait_async); + DLSYM_OPT (openacc.async_wait_all, openacc_async_wait_all); + DLSYM_OPT (openacc.async_wait_all_async, openacc_async_wait_all_async); + DLSYM_OPT (openacc.async_set_async, openacc_async_set_async); + DLSYM_OPT (openacc.create_thread_data, openacc_create_thread_data); + DLSYM_OPT (openacc.destroy_thread_data, openacc_destroy_thread_data); + /* Require all the OpenACC handlers if we have TARGET_CAP_OPENACC_200. */ + if (optional_present != optional_total) + { + err = "plugin missing OpenACC handler function"; + goto out; + } + optional_present = optional_total = 0; + DLSYM_OPT (openacc.cuda.get_current_device, + openacc_get_current_cuda_device); + DLSYM_OPT (openacc.cuda.get_current_context, + openacc_get_current_cuda_context); + DLSYM_OPT (openacc.cuda.get_stream, openacc_get_cuda_stream); + DLSYM_OPT (openacc.cuda.set_stream, openacc_set_cuda_stream); + /* Make sure all the CUDA functions are there if any of them are. */ + if (optional_present && optional_present != optional_total) + { + err = "plugin missing OpenACC CUDA handler function"; + goto out; + } + } #undef DLSYM +#undef DLSYM_OPT - return true; + out: + if (err != NULL) + { + gomp_error ("while loading %s: %s", plugin_name, err); + if (last_missing) + gomp_error ("missing function was %s", last_missing); + if (device->plugin_handle) + dlclose (device->plugin_handle); + } + return err == NULL; } -/* This function finds OFFLOAD_IMAGES corresponding to DEVICE type, and - registers them in the plugin. */ +/* This function adds a compatible offload image IMAGE to an accelerator device + DEVICE. */ static void -gomp_register_images_for_device (struct gomp_device_descr *device) +gomp_register_image_for_device (struct gomp_device_descr *device, + struct offload_image_descr *image) { - int i; - for (i = 0; i < num_offload_images; i++) + if (!device->offload_regions_registered + && (device->type == image->type + || device->type == OFFLOAD_TARGET_TYPE_HOST)) { - struct offload_image_descr *image = &offload_images[i]; - if (image->type == device->type) - device->register_image_func (image->host_table, image->target_data); + device->register_image_func (image->host_table, image->target_data); + device->offload_regions_registered = true; } } @@ -901,15 +1088,19 @@ gomp_target_init (void) } current_device.type = current_device.get_type_func (); + current_device.name = current_device.get_name_func (); current_device.is_initialized = false; - current_device.dev_splay_tree.root = NULL; - gomp_register_images_for_device (¤t_device); + current_device.offload_regions_registered = false; + current_device.mem_map.splay_tree.root = NULL; + current_device.mem_map.is_initialized = false; + current_device.target_data = NULL; + current_device.openacc.data_environ = NULL; for (i = 0; i < new_num_devices; i++) { current_device.id = num_devices + 1; current_device.target_id = i; devices[num_devices] = current_device; - gomp_mutex_init (&devices[num_devices].dev_env_lock); + gomp_mutex_init (&devices[num_devices].mem_map.lock); num_devices++; } } @@ -920,6 +1111,43 @@ gomp_target_init (void) } while (next); + /* Prefer a device with TARGET_CAP_OPENMP_400 for ICV default-device-var. */ + if (num_devices > 1) + { + int d = gomp_icv (false)->default_device_var; + + if (!(devices[d].capabilities & TARGET_CAP_OPENMP_400)) + { + for (i = 0; i < num_devices; i++) + { + if (devices[i].capabilities & TARGET_CAP_OPENMP_400) + { + struct gomp_device_descr device_tmp = devices[d]; + devices[d] = devices[i]; + devices[d].id = d + 1; + devices[i] = device_tmp; + devices[i].id = i + 1; + + break; + } + } + } + } + + for (i = 0; i < num_devices; i++) + { + int j; + + for (j = 0; j < num_offload_images; j++) + gomp_register_image_for_device (&devices[i], &offload_images[j]); + + /* The 'devices' array can be moved (by the realloc call) until we have + found all the plugins, so registering with the OpenACC runtime (which + takes a copy of the pointer argument) must be delayed until now. */ + if (devices[i].capabilities & TARGET_CAP_OPENACC_200) + ACC_register (&devices[i]); + } + free (offload_images); offload_images = NULL; num_offload_images = 0; diff --git a/libgomp/target.h b/libgomp/target.h new file mode 100644 index 0000000..e69de29 diff --git a/libgomp/testsuite/Makefile.in b/libgomp/testsuite/Makefile.in index 5273eaa..634844c 100644 diff --git a/libgomp/testsuite/libgomp-test-support.exp.in b/libgomp/testsuite/libgomp-test-support.exp.in new file mode 100644 index 0000000..dcadad7 --- /dev/null +++ b/libgomp/testsuite/libgomp-test-support.exp.in @@ -0,0 +1,2 @@ +set cuda_driver_include "@CUDA_DRIVER_INCLUDE@" +set cuda_driver_lib "@CUDA_DRIVER_LIB@"