Message ID | 20151207100645.GA28286@physik.fu-berlin.de |
---|---|
State | New |
Headers | show |
On 07/12/15 10:06, Tobias Burnus wrote: > I wrote: >> I wonder whether using >> >> __asm__ __volatile__ ("":::"memory"); >> >> would be sufficient as it has a way lower overhead than >> __sync_synchronize(). > > Namely, something like the attached patch. > > Regarding the original patch submission: Is there a reason that you didn't > include the test case of Deepak from https://gcc.gnu.org/ml/fortran/2015-04/msg00062.html > It should work as -fcoarray=lib -lcaf_single "dg-do run" test. > > Tobias > I don't know anything about Fortran or coarrays and I'm curious whether this affects architectures with weak memory models. Is the barrier only needed to stop reordering by the compiler or is does it also need to stop reordering by the hardware? Matthew
On Mon, Dec 07, 2015 at 02:09:22PM +0000, Matthew Wahab wrote: > >>I wonder whether using > >>__asm__ __volatile__ ("":::"memory"); > >>would be sufficient as it has a way lower overhead than > >>__sync_synchronize(). > > I don't know anything about Fortran or coarrays and I'm curious > whether this affects architectures with weak memory models. Is the > barrier only needed to stop reordering by the compiler or is does it > also need to stop reordering by the hardware? Short answer: I think no mfence is needed as either the communication is local (to the thread/process) - in which case the hardware will act correctly - or the communication is remote (different thread, process, communication to different computer via interlink [ethernet, infiniband, ...]); and in the later case, the communication library has to deal with it. However, I think that except for SYNC using __asm__ __volatile__ ("":::"memory"); is the wrong solution - even though it might work as band aid. * * * To gether some suggestions, here is how coarrays work. They are based on: everything is a local variable - except it is a coarray. And all communication is explicit - but is often single sideded. That scheme works well (also) with distributed memory and is similar to MPI (Message Passing Interface). Using integer :: var[*] one declares a coarray scalar variable. Accessing it as var = 5 tmp = var acts on the variable on the current 'image' (process). Using var[idx] = 5 ! Set 'var' in image 'idx' to '5' tmp = var[idx] ! Obtain value of 'var' on image 'idx' one accesses that variable on a remote image - except if 'idx' matches the current image; in that case, it acts on the local variable. In one segment (which ends at explicit or implicit synchronizations, e.g. using SYNC ALL): If the value of a is changed - either locally or by a remote image - then only that image is permitted to 'read' the value (or to change it as that would be another possibility for race conditions). Very often, coarray variables are arrays and heavily accessed as local varable in hot loops. But on the other hand, the value of the variable can be changed by external means - up to having hardware support to write into the memory from another computer. Thus, there two cases were the local view might fail: (a) when the variable has been changed in a previous segment by a remote process, e.g. var = 5 ! assue 'var' is a coarray sync all ! end of segment ! ! var is changed by a remote image sync all ! end of segment tmp = var where "var" might or might not have the value 5. Or more fancy without locks and global barriers: type(event_type) :: var_is_set[*] var = 5 event post(have_set_var[remote_idx]) ! Remote waits for event, sets out 'var' to 42 and ! then posts an event that it is set event wait(have_set_var) tmp = var where one tells the remote process that "var = 5" has been set and waits until it has set the local variable "var". (b) when the variable is changed on the same image by two means, e.g. var = 5 var[this_image()] = 42 tmp = var The communication maps to function calls like: __gfortran_caf_send(var_token, idx, 5) // var[idx] = 5 __gfortran_caf_get (tmp, var_token, idx) // tmp = var[idx] [Real commands, see https://gcc.gnu.org/onlinedocs/gfortran/index.html#toc_Coarray-Programming] It is assumed that the library called takes care of the hardware part, i.e. locking, mfence etc. - such that in the compiler itself, one only has to deal with restricting compiler optimizations to the places where it is permitted. A coarray comes into existance via: var = _gfortran_caf_register (size, &var_token); similar to malloc - and var_token identifies the coarray; what the content is, is left to the library. - For the good or worse, _gfortran_caf_register doesn't tell the compiler that it has clobbered "var" as in a way the pointer address escaped. Nor is "var" marked as volatile. Coming back to item (a): After a segment has ended (SYNC ALL, EVENT WAIT or similar), the compiler can no longer assume that coarray variables have the same value. Those variables can be hidden as in call foo() sync all call foo() where 'foo' doesn't know about the 'sync all' and the caller doesn't know whether 'foo' has accesses to coarrays (and whether it accesses them). I think __asm__ __volatile__ ("":::"memory"); should be sufficient in that case, assuming that the communication library takes are of making all changes available (mfence, __sync_synchronize or whatever). The other question is item (b): Here, one has an alias problem within the segment - but the alias only rarely happens. Code of the form var[idx] = ... usually does not access the local memory; it happens only in corner cases like: do i = 1, num_images var[i] = 5 end do which sets 'var' to 5 on all images - including the local image. This will produce code like (full example below): *var = 5 _gfortran_caf_send(var_token, idx, 42); tmp = *var where the frontend knows that _gfortran_caf_send might modify 'var' iff idx matches the local image. The question is only how to tell this the compiler. Using __asm__ __volatile__ ("":::"memory"); seems to be appropriate, even though something more tailored would be nice, something like: __asm__ __volatile__ ("": : "+rm" (*var) : "") but I am not sure whether that's valid and works - additionally, it assumes that one reads "*var" (i.e. that it is defined before), which is not required. (No idea whether -Wundefined would trigger.) I there some better way to express this? Cheers, Tobias PS: Full example: integer :: tmp integer :: var[*] var = 5 var[this_image()] = 42 tmp = var end This will generate a static constructor, run at program startup: _caf_init.1 () { // v-- allocate 4 byte var = _gfortran_caf_register (4, 0, &caf_token.0, 0B, 0B, 0); } and the (main) program code (slightly trimmed): static void * restrict caf_token.0; static integer(kind=4) * restrict var; void _caf_init.1 (void); *var = 4; desc.3.data = 42; _gfortran_caf_send (caf_token.0, 0B /* offset */ var, _gfortran_caf_this_image (0), &desc.2, 0B, &desc.3, 4, 4, 0); __asm__ __volatile__("":::"memory"); // new tmp = *var; The problem is that in that case the compiler does not know that "_gfortran_caf_send (caf_token.0," can modify "*var".
On 08/12/15 09:25, Tobias Burnus wrote: > On Mon, Dec 07, 2015 at 02:09:22PM +0000, Matthew Wahab wrote: >>>> I wonder whether using >>>> __asm__ __volatile__ ("":::"memory"); >>>> would be sufficient as it has a way lower overhead than >>>> __sync_synchronize(). >> >> I don't know anything about Fortran or coarrays and I'm curious >> whether this affects architectures with weak memory models. Is the >> barrier only needed to stop reordering by the compiler or is does it >> also need to stop reordering by the hardware? > > Short answer: I think no mfence is needed as either the communication > is local (to the thread/process) - in which case the hardware will act > correctly - or the communication is remote (different thread, process, > communication to different computer via interlink [ethernet, infiniband, > ...]); and in the later case, the communication library has to deal with > it. Thanks for explaining this, it made things clear. Based on your description, I agree that hardware reordering shouldn't be a problem. > and the (main) program code (slightly trimmed): > > static void * restrict caf_token.0; > static integer(kind=4) * restrict var; > void _caf_init.1 (void); > > *var = 4; > > desc.3.data = 42; > _gfortran_caf_send (caf_token.0, 0B /* offset */ var, > _gfortran_caf_this_image (0), &desc.2, 0B, &desc.3, 4, 4, 0); > __asm__ __volatile__("":::"memory"); // new > tmp = *var; > > The problem is that in that case the compiler does not know that > "_gfortran_caf_send (caf_token.0," can modify "*var". > Is the restrict attribute on var correct? From what you say, it sounds like *var could be accessed through other pointers (assuming restrict has the same meaning as in C). Matthew
trans-intrinsic.c | 18 ++++++++++++++++++ trans-stmt.c | 29 +++++++++++++++++++++++++++++ trans.c | 16 ++++++++++++++++ 3 files changed, 63 insertions(+) diff --git a/gcc/fortran/trans-intrinsic.c b/gcc/fortran/trans-intrinsic.c index 21efe44..04ba3ea 100644 --- a/gcc/fortran/trans-intrinsic.c +++ b/gcc/fortran/trans-intrinsic.c @@ -1222,6 +1222,15 @@ gfc_conv_intrinsic_caf_get (gfc_se *se, gfc_expr *expr, tree lhs, tree lhs_kind, se->expr = res_var; if (array_expr->ts.type == BT_CHARACTER) se->string_length = argse.string_length; + + /* It guarantees memory consistency within the same segment */ + tmp = gfc_build_string_const (strlen ("memory")+1, "memory"), + tmp = build5_loc (input_location, ASM_EXPR, void_type_node, + gfc_build_string_const (1, ""), NULL_TREE, NULL_TREE, + tree_cons (NULL_TREE, tmp, NULL_TREE), NULL_TREE); + ASM_VOLATILE_P (tmp) = 1; + gfc_add_expr_to_block (&se->pre, tmp); + } @@ -1390,6 +1399,15 @@ conv_caf_send (gfc_code *code) { gfc_add_expr_to_block (&block, tmp); gfc_add_block_to_block (&block, &lhs_se.post); gfc_add_block_to_block (&block, &rhs_se.post); + + /* It guarantees memory consistency within the same segment */ + tmp = gfc_build_string_const (strlen ("memory")+1, "memory"), + tmp = build5_loc (input_location, ASM_EXPR, void_type_node, + gfc_build_string_const (1, ""), NULL_TREE, NULL_TREE, + tree_cons (NULL_TREE, tmp, NULL_TREE), NULL_TREE); + ASM_VOLATILE_P (tmp) = 1; + gfc_add_expr_to_block (&block, tmp); + return gfc_finish_block (&block); } diff --git a/gcc/fortran/trans-stmt.c b/gcc/fortran/trans-stmt.c index 3df483a..b7e1faa 100644 --- a/gcc/fortran/trans-stmt.c +++ b/gcc/fortran/trans-stmt.c @@ -818,6 +818,15 @@ gfc_trans_lock_unlock (gfc_code *code, gfc_exec_op op) errmsg, errmsg_len); gfc_add_expr_to_block (&se.pre, tmp); + /* It guarantees memory consistency within the same segment */ + tmp = gfc_build_string_const (strlen ("memory")+1, "memory"), + tmp = build5_loc (input_location, ASM_EXPR, void_type_node, + gfc_build_string_const (1, ""), NULL_TREE, NULL_TREE, + tree_cons (NULL_TREE, tmp, NULL_TREE), NULL_TREE); + ASM_VOLATILE_P (tmp) = 1; + + gfc_add_expr_to_block (&se.pre, tmp); + if (stat2 != NULL_TREE) gfc_add_modify (&se.pre, stat2, fold_convert (TREE_TYPE (stat2), stat)); @@ -995,6 +1004,14 @@ gfc_trans_event_post_wait (gfc_code *code, gfc_exec_op op) errmsg, errmsg_len); gfc_add_expr_to_block (&se.pre, tmp); + /* It guarantees memory consistency within the same segment */ + tmp = gfc_build_string_const (strlen ("memory")+1, "memory"), + tmp = build5_loc (input_location, ASM_EXPR, void_type_node, + gfc_build_string_const (1, ""), NULL_TREE, NULL_TREE, + tree_cons (NULL_TREE, tmp, NULL_TREE), NULL_TREE); + ASM_VOLATILE_P (tmp) = 1; + gfc_add_expr_to_block (&se.pre, tmp); + if (stat2 != NULL_TREE) gfc_add_modify (&se.pre, stat2, fold_convert (TREE_TYPE (stat2), stat)); @@ -1080,6 +1097,18 @@ gfc_trans_sync (gfc_code *code, gfc_exec_op type) fold_convert (integer_type_node, images)); } + /* Per F2008, 8.5.1, a SYNC MEMORY is implied by calling the + image control statements SYNC IMAGES and SYNC ALL. */ + if (flag_coarray == GFC_FCOARRAY_LIB) + { + tmp = gfc_build_string_const (strlen ("memory")+1, "memory"), + tmp = build5_loc (input_location, ASM_EXPR, void_type_node, + gfc_build_string_const (1, ""), NULL_TREE, NULL_TREE, + tree_cons (NULL_TREE, tmp, NULL_TREE), NULL_TREE); + ASM_VOLATILE_P (tmp) = 1; + gfc_add_expr_to_block (&se.pre, tmp); + } + if (flag_coarray != GFC_FCOARRAY_LIB) { /* Set STAT to zero. */ diff --git a/gcc/fortran/trans.c b/gcc/fortran/trans.c index 001db41..1993743 100644 --- a/gcc/fortran/trans.c +++ b/gcc/fortran/trans.c @@ -746,6 +746,14 @@ gfc_allocate_using_lib (stmtblock_t * block, tree pointer, tree size, TREE_TYPE (pointer), pointer, fold_convert ( TREE_TYPE (pointer), tmp)); gfc_add_expr_to_block (block, tmp); + + /* It guarantees memory consistency within the same segment */ + tmp = gfc_build_string_const (strlen ("memory")+1, "memory"), + tmp = build5_loc (input_location, ASM_EXPR, void_type_node, + gfc_build_string_const (1, ""), NULL_TREE, NULL_TREE, + tree_cons (NULL_TREE, tmp, NULL_TREE), NULL_TREE); + ASM_VOLATILE_P (tmp) = 1; + gfc_add_expr_to_block (block, tmp); } @@ -1356,6 +1364,14 @@ gfc_deallocate_with_status (tree pointer, tree status, tree errmsg, token, pstat, errmsg, errlen); gfc_add_expr_to_block (&non_null, tmp); + /* It guarantees memory consistency within the same segment */ + tmp = gfc_build_string_const (strlen ("memory")+1, "memory"), + tmp = build5_loc (input_location, ASM_EXPR, void_type_node, + gfc_build_string_const (1, ""), NULL_TREE, NULL_TREE, + tree_cons (NULL_TREE, tmp, NULL_TREE), NULL_TREE); + ASM_VOLATILE_P (tmp) = 1; + gfc_add_expr_to_block (&non_null, tmp); + if (status != NULL_TREE) { tree stat = build_fold_indirect_ref_loc (input_location, status);