Message ID | e8e22ccf-2f25-1a8a-a599-b4a0b245c14d@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | [NVPTX] Fix PR83920 | expand |
On 01/18/2018 12:40 AM, Cesar Philippidis wrote: > In PR83920, I encountered a nvptx bug where live predicate variables > were clobbered before their value was broadcasted. Hi, I've managed to reproduce the problem based on the description in the PR. > Apparently, there > were problems in certain version of the CUDA driver where the JIT would > generate wrong code for shfl broadcasts. Correct. And there's a work around committed for the JIT problem, which you refer to in the next line (without introducing it first). > The attached patch teaches > nvptx_single not to apply that workaround if the predicate register is live. > > Tom, does this patch look sane to you? The fact that the cond register is live at the start of the from bb does not mean that the register can't be set inside the bb. Furthermore, the live info does not make a distinction between live-for-lane-0 and line-for-warp. So, if the condition reg is not set in the bb, but set only for lane-0 in a previous bb, then we still need to initialize lanes 1-31. So, I don't think this is the way to address this bug. > I'm not sure if it defeats the > purpose of your original patch. In test cases mentioned above, it does. > Regardless, the live predicate registers > shouldn't be clobbered before they are used. > There is a bug in the workaround, that's correct. I think the way to address it is using a tmp .pred reg like so: ... { .reg .u32 %x; mov.u32 %x,%tid.x; setp.ne.u32 %rnotvzero,%x,0; } { .reg .pred %rcond2; setp.eq.u32 %rcond2, 1, 0; // workaround @%rnotvzero bra Lskip; ... setp.<op>.<type> %rcond,op1,op2; // could be here, could be earlier mov.b1 %rcond2, %rcond; // used pseudo opcode mov.b1 for convenience Lskip: selp.u32 %rcondu32,1,0,%rcond2; shfl.idx.b32 %rcondu32,%rcondu32,0,31; setp.ne.u32 %rcond,%rcondu32,0; } ... > Unfortunately, I cannot reproduce the runtime failure with gemm example > in the PR, so I didn't include it in the patch. I'm managed to modify the test-case such that it reproduces the failure with trunk (see PR). So, that test-case should be included. Thanks, - Tom > However, this patch does > fix the failure with da-1.c in og7. > This patch does not cause any > regressions. > > Is it OK for trunk? > > Thanks, > Cesar > > > nvptx-jit-relax.diff > > > diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c > index 55c7e3c..698c574 100644 > --- a/gcc/config/nvptx/nvptx.c > +++ b/gcc/config/nvptx/nvptx.c > @@ -3957,6 +3957,7 @@ bb_first_real_insn (basic_block bb) > static void > nvptx_single (unsigned mask, basic_block from, basic_block to) > { > + bitmap live = DF_LIVE_IN (from); > rtx_insn *head = BB_HEAD (from); > rtx_insn *tail = BB_END (to); > unsigned skip_mask = mask; > @@ -4126,8 +4127,9 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) > There is nothing in the PTX spec to suggest that this is wrong, or > to explain why the extra initialization is needed. So, we classify > it as a JIT bug, and the extra initialization as workaround. */ > - emit_insn_before (gen_movbi (pvar, const0_rtx), > - bb_first_real_insn (from)); > + if (!bitmap_bit_p (live, REGNO (pvar))) > + emit_insn_before (gen_movbi (pvar, const0_rtx), > + bb_first_real_insn (from)); > #endif > emit_insn_before (nvptx_gen_vcast (pvar), tail); > } >
On 01/18/2018 02:27 PM, Tom de Vries wrote: > On 01/18/2018 12:40 AM, Cesar Philippidis wrote: >> In PR83920, I encountered a nvptx bug where live predicate variables >> were clobbered before their value was broadcasted. > > Hi, > > I've managed to reproduce the problem based on the description in the PR. > I think the way to address it is using a tmp .pred reg like so: > ... > { > .reg .u32 %x; > mov.u32 %x,%tid.x; > setp.ne.u32 %rnotvzero,%x,0; > } > > { > .reg .pred %rcond2; > setp.eq.u32 %rcond2, 1, 0; // workaround > > @%rnotvzero bra Lskip; > ... > setp.<op>.<type> %rcond,op1,op2; // could be here, could be earlier > mov.b1 %rcond2, %rcond; // used pseudo opcode mov.b1 for convenience > Lskip: > selp.u32 %rcondu32,1,0,%rcond2; > shfl.idx.b32 %rcondu32,%rcondu32,0,31; > setp.ne.u32 %rcond,%rcondu32,0; > } > ... > Hi, this is the fix that I plan to commit (similar to the scheme listed above, but modified to keep the selp.u32 using rcond, which is easier in code generation). Build and reg-tested on x86_64 with nvptx accelerator. Richard, this is an 8 regression for the nvptx target. OK for stage 4 or defer to stage1? Thanks, - Tom [nvptx] Fix bug in jit bug workaround 2018-01-19 Tom de Vries <tom@codesourcery.com> Cesar Philippidis <cesar@codesourcery.com> PR target/83920 * config/nvptx/nvptx.c (nvptx_single): Fix jit workaround. * testsuite/libgomp.oacc-c-c++-common/pr83920.c: New test. * testsuite/libgomp.oacc-fortran/pr83920.f90: New test. --- gcc/config/nvptx/nvptx.c | 28 +++++++++++++++++-- .../testsuite/libgomp.oacc-c-c++-common/pr83920.c | 32 ++++++++++++++++++++++ libgomp/testsuite/libgomp.oacc-fortran/pr83920.f90 | 28 +++++++++++++++++++ 3 files changed, 86 insertions(+), 2 deletions(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 86fc13f4fc0..afb0e4dd185 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -4096,9 +4096,33 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) There is nothing in the PTX spec to suggest that this is wrong, or to explain why the extra initialization is needed. So, we classify - it as a JIT bug, and the extra initialization as workaround. */ - emit_insn_before (gen_movbi (pvar, const0_rtx), + it as a JIT bug, and the extra initialization as workaround: + + { + .reg .u32 %x; + mov.u32 %x,%tid.x; + setp.ne.u32 %rnotvzero,%x,0; + } + + +.reg .pred %rcond2; + +setp.eq.u32 %rcond2, 1, 0; + + @%rnotvzero bra Lskip; + setp.<op>.<type> %rcond,op1,op2; + +mov.pred %rcond2, %rcond; + Lskip: + +mov.pred %rcond, %rcond2; + selp.u32 %rcondu32,1,0,%rcond; + shfl.idx.b32 %rcondu32,%rcondu32,0,31; + setp.ne.u32 %rcond,%rcondu32,0; + */ + rtx_insn *label = PREV_INSN (tail); + gcc_assert (label && LABEL_P (label)); + rtx tmp = gen_reg_rtx (BImode); + emit_insn_before (gen_movbi (tmp, const0_rtx), bb_first_real_insn (from)); + emit_insn_before (gen_rtx_SET (tmp, pvar), label); + emit_insn_before (gen_rtx_SET (pvar, tmp), tail); #endif emit_insn_before (nvptx_gen_vcast (pvar), tail); } diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr83920.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr83920.c new file mode 100644 index 00000000000..6cd3b5d6f06 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr83920.c @@ -0,0 +1,32 @@ +/* { dg-do run } */ + +#include <stdlib.h> + +#define n 10 + +static void __attribute__((noinline)) __attribute__((noclone)) +foo (int beta, int *c) +{ + #pragma acc parallel copy(c[0:(n * n) - 1]) num_gangs(2) + #pragma acc loop gang + for (int j = 0; j < n; ++j) + if (beta != 1) + { + #pragma acc loop vector + for (int i = 0; i < n; ++i) + c[i + (j * n)] = 0; + } +} + +int +main (void) +{ + int c[n * n]; + + c[0] = 1; + foo (0, c); + if (c[0] != 0) + abort (); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-fortran/pr83920.f90 b/libgomp/testsuite/libgomp.oacc-fortran/pr83920.f90 new file mode 100644 index 00000000000..34ad001abcd --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-fortran/pr83920.f90 @@ -0,0 +1,28 @@ +! { dg-do run } + +subroutine foo (BETA, C) + real :: C(100,100) + integer :: i, j, l + real, parameter :: one = 1.0 + real :: beta + + !$acc parallel copy(c(1:100,1:100)) num_gangs(2) + !$acc loop gang + do j = 1, 100 + if (beta /= one) then + !$acc loop vector + do i = 1, 100 + C(i,j) = 0.0 + end do + end if + end do + !$acc end parallel +end subroutine foo + +program test_foo + real :: c(100,100), beta + beta = 0.0 + c(:,:) = 1.0 + call foo (beta, c) + if (c(1,1) /= 0.0) call abort () +end program test_foo
On January 19, 2018 3:15:45 PM GMT+01:00, Tom de Vries <Tom_deVries@mentor.com> wrote: >On 01/18/2018 02:27 PM, Tom de Vries wrote: >> On 01/18/2018 12:40 AM, Cesar Philippidis wrote: >>> In PR83920, I encountered a nvptx bug where live predicate variables >>> were clobbered before their value was broadcasted. >> >> Hi, >> >> I've managed to reproduce the problem based on the description in the >PR. > >> I think the way to address it is using a tmp .pred reg like so: >> ... >> { >> .reg .u32 %x; >> mov.u32 %x,%tid.x; >> setp.ne.u32 %rnotvzero,%x,0; >> } >> >> { >> .reg .pred %rcond2; >> setp.eq.u32 %rcond2, 1, 0; // workaround >> >> @%rnotvzero bra Lskip; >> ... >> setp.<op>.<type> %rcond,op1,op2; // could be here, could be >earlier >> mov.b1 %rcond2, %rcond; // used pseudo opcode mov.b1 for >convenience >> Lskip: >> selp.u32 %rcondu32,1,0,%rcond2; >> shfl.idx.b32 %rcondu32,%rcondu32,0,31; >> setp.ne.u32 %rcond,%rcondu32,0; >> } >> ... >> > >Hi, > >this is the fix that I plan to commit (similar to the scheme listed >above, but modified to keep the selp.u32 using rcond, which is easier >in >code generation). > >Build and reg-tested on x86_64 with nvptx accelerator. > >Richard, this is an 8 regression for the nvptx target. OK for stage 4 >or >defer to stage1? OK for stage 4. Richard. >Thanks, >- Tom
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 55c7e3c..698c574 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -3957,6 +3957,7 @@ bb_first_real_insn (basic_block bb) static void nvptx_single (unsigned mask, basic_block from, basic_block to) { + bitmap live = DF_LIVE_IN (from); rtx_insn *head = BB_HEAD (from); rtx_insn *tail = BB_END (to); unsigned skip_mask = mask; @@ -4126,8 +4127,9 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) There is nothing in the PTX spec to suggest that this is wrong, or to explain why the extra initialization is needed. So, we classify it as a JIT bug, and the extra initialization as workaround. */ - emit_insn_before (gen_movbi (pvar, const0_rtx), - bb_first_real_insn (from)); + if (!bitmap_bit_p (live, REGNO (pvar))) + emit_insn_before (gen_movbi (pvar, const0_rtx), + bb_first_real_insn (from)); #endif emit_insn_before (nvptx_gen_vcast (pvar), tail); }