Message ID | 20120913155258.GF22619@tucnak.redhat.com |
---|---|
State | New |
Headers | show |
On Thu, Sep 13, 2012 at 5:52 PM, Jakub Jelinek <jakub@redhat.com> wrote: > The fma-*.c testcase show that these intrinsics probably mean to preserve > the high elements (other than the lowest) of the first argument of the > fmaintrin.h *_s{s,d} intrinsics in the destination (the HW insn preserve > there the destination register, but that varies - for 132 and 213 it is the > first one (but the negation performed for _mm_fnm*_s[sd] breaks it anyway), > for 231 it is the last one). What the expander did was to put there > an uninitialized pseudo, so we ended up with pretty random content, before > H.J's http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190492 it happened > to work by accident, but when things changed slightly and reload chose > different alternative, this broke. > > The following patch fixes it, by tweaking the header so that the first > argument is not negated (we negate the second one instead), as we don't want > to negate the high elements if e.g. for whatever reason combiner doesn't > match it. It fixes the expander to use a dup of the X operand as the high > element provider for the pattern, removes the 231 alternatives (because > those provide different destination high elements) and removes commutative > marker (again, that would mean different high elements). Can we introduce additional "*fmai_fmadd_<mode>_1" pattern (and others) that would cover missing 231 alternative? > 2012-09-13 Jakub Jelinek <jakub@redhat.com> > > PR target/54564 > * config/i386/sse.md (fmai_vmfmadd_<mode>): Use (match_dup 1) > instead of (match_dup 0) as second argument to vec_merge. > (*fmai_fmadd_<mode>, *fmai_fmsub_<mode>): Likewise. > Remove third alternative. > (*fmai_fnmadd_<mode>, *fmai_fnmsub_<mode>): Likewise. Negate > operand 2 instead of operand 1, but put it as first argument > of fma. > > * config/i386/fmaintrin.h (_mm_fnmadd_sd, _mm_fnmadd_ss, > _mm_fnmsub_sd, _mm_fnmsub_ss): Negate the second argument instead > of the first. OK, but header change should be also reviewed by H.J. Thanks, Uros.
On Thu, Sep 13, 2012 at 07:42:17PM +0200, Uros Bizjak wrote: > Can we introduce additional "*fmai_fmadd_<mode>_1" pattern (and > others) that would cover missing 231 alternative? Sure. Will cook up a patch soon. > > 2012-09-13 Jakub Jelinek <jakub@redhat.com> > > > > PR target/54564 > > * config/i386/sse.md (fmai_vmfmadd_<mode>): Use (match_dup 1) > > instead of (match_dup 0) as second argument to vec_merge. > > (*fmai_fmadd_<mode>, *fmai_fmsub_<mode>): Likewise. > > Remove third alternative. > > (*fmai_fnmadd_<mode>, *fmai_fnmsub_<mode>): Likewise. Negate > > operand 2 instead of operand 1, but put it as first argument > > of fma. > > > > * config/i386/fmaintrin.h (_mm_fnmadd_sd, _mm_fnmadd_ss, > > _mm_fnmsub_sd, _mm_fnmsub_ss): Negate the second argument instead > > of the first. > > OK, but header change should be also reviewed by H.J. H.J., are you ok with this? Jakub
On 09/13/2012 08:52 AM, Jakub Jelinek wrote: > The following patch fixes it, by tweaking the header so that the first > argument is not negated (we negate the second one instead), as we don't want > to negate the high elements if e.g. for whatever reason combiner doesn't > match it. It fixes the expander to use a dup of the X operand as the high > element provider for the pattern, removes the 231 alternatives (because > those provide different destination high elements) and removes commutative > marker (again, that would mean different high elements). I don't think this is the best way to fix this up. (1) Negating the second argument is arguably non-canonical rtl. (2) It's not the best match if we were to extend these builtins to FMA4. There we really do have 4 inputs. Thus (define_insn "*fmai_fmadd_<mode>_4" [(set (match_operand:VF_128 0 "register_operand" "=x,x") (vec_merge:VF_128 (fma:VF_128 (match_operand:VF_128 1 "nonimmediate_operand" "%x,x") (match_operand:VF_128 2 "nonimmediate_operand" " x,m") (match_operand:VF_128 3 "nonimmediate_operand" "xm,x")) (match_operand:VF_128 4 "register_operand" "0,0") (const_int 1)))] "TARGET_FMA4" "vfmadd<ssescalarmodesuffix>\t{%3, %2, %1, %0|%0, %1, %2, %3}" [(set_attr "type" "ssemuladd") (set_attr "mode" "<MODE>")]) which we can just as easily do by passing the appropriate input to the expander. It would be nice if Intel cleaned up their documentation for the builtin, explicitly saying which high bits to copy. I agree that the testcase is probably as normative as we'll get though. r~
On 09/13/2012 10:42 AM, Uros Bizjak wrote: > Can we introduce additional "*fmai_fmadd_<mode>_1" pattern (and > others) that would cover missing 231 alternative? I really don't think that's necessary. For that you'd need to be computing fma(x, y, x), at which point vfmadd213 matches as well. r~
On Thu, Sep 13, 2012 at 11:25:42AM -0700, Richard Henderson wrote: > On 09/13/2012 08:52 AM, Jakub Jelinek wrote: > > The following patch fixes it, by tweaking the header so that the first > > argument is not negated (we negate the second one instead), as we don't want > > to negate the high elements if e.g. for whatever reason combiner doesn't > > match it. It fixes the expander to use a dup of the X operand as the high > > element provider for the pattern, removes the 231 alternatives (because > > those provide different destination high elements) and removes commutative > > marker (again, that would mean different high elements). > > I don't think this is the best way to fix this up. > > (1) Negating the second argument is arguably non-canonical rtl. That is why I've put in the *fmai_fnm{add,sub}_<mode> patterns operands 2 with the neg as first operand of the FMA rtl. That way it is canonical (otherwise it didn't match in combine). The FMA rtl operand order doesn't need to imply the order of instruction operands. The reason why I don't want to negate in fmaintrin.h the first operand is that the higher elements of it shouldn't be negated. So, when the second argument to the builtin is negated, combiner substitutes ... (fma (reg1) (neg (reg2) (reg3))) (reg1) ..., then canonicalizes to ... (fma (neg (reg2) (reg1) (reg3))) (reg1) ... and it matches that way. > (2) It's not the best match if we were to extend these builtins to FMA4. > There we really do have 4 inputs. Thus The first thing I've tried was for the *fmai_fmadd_<mode> patterns to use match_operand:VF_128 4 instead of the match_dup I'm using now. But that way I've ended up with: (define_insn "*fmai_fmadd_<mode>" [(set (match_operand:VF_128 0 "register_operand" "=x,x") (vec_merge:VF_128 (fma:VF_128 (match_operand:VF_128 1 "nonimmediate_operand" " 0, 0") (match_operand:VF_128 2 "nonimmediate_operand" "xm, x") (match_operand:VF_128 3 "nonimmediate_operand" " x,xm")) (match_operand:VF_128 4 "nonimmediate_operand" " 0, 0") (const_int 1)))] "TARGET_FMA" "@ vfmadd132<ssescalarmodesuffix>\t{%2, %3, %0|%0, %3, %2} vfmadd213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3}" [(set_attr "type" "ssemuladd") (set_attr "mode" "<MODE>")]) which was apparently too much for reload (supposedly the two "0" constraint operands, even when the expander used (match_dup 1)). > It would be nice if Intel cleaned up their documentation for the > builtin, explicitly saying which high bits to copy. I agree that > the testcase is probably as normative as we'll get though. Usually the docs just document instructions and list intrinsics. When there is non-obvious match between the intrinsics and instructions, we have a problem of insufficiently documented semantics. Jakub
On Thu, Sep 13, 2012 at 11:28:11AM -0700, Richard Henderson wrote: > On 09/13/2012 10:42 AM, Uros Bizjak wrote: > > Can we introduce additional "*fmai_fmadd_<mode>_1" pattern (and > > others) that would cover missing 231 alternative? > > I really don't think that's necessary. For that you'd need to > be computing fma(x, y, x), at which point vfmadd213 matches as well. I agree. I've posted the patch just because Uros requested it. Jakub
On Thu, Sep 13, 2012 at 11:25:42AM -0700, Richard Henderson wrote: > (2) It's not the best match if we were to extend these builtins to FMA4. > There we really do have 4 inputs. Thus How could you extend these builtins to FMA4 BTW? Doesn't FMA4 zero up the high elements? In that case you'd need to expand it as copy of the X operand register to DEST, doing vfmadd{ss,sd} to a temp register and followed by vmovss/vmovsd instruction. > (define_insn "*fmai_fmadd_<mode>_4" > [(set (match_operand:VF_128 0 "register_operand" "=x,x") > (vec_merge:VF_128 > (fma:VF_128 > (match_operand:VF_128 1 "nonimmediate_operand" "%x,x") > (match_operand:VF_128 2 "nonimmediate_operand" " x,m") > (match_operand:VF_128 3 "nonimmediate_operand" "xm,x")) > (match_operand:VF_128 4 "register_operand" "0,0") > (const_int 1)))] > "TARGET_FMA4" > "vfmadd<ssescalarmodesuffix>\t{%3, %2, %1, %0|%0, %1, %2, %3}" > [(set_attr "type" "ssemuladd") > (set_attr "mode" "<MODE>")]) Jakub
On 09/13/2012 11:49 AM, Jakub Jelinek wrote: > On Thu, Sep 13, 2012 at 11:25:42AM -0700, Richard Henderson wrote: >> (1) Negating the second argument is arguably non-canonical rtl. > > That is why I've put in the *fmai_fnm{add,sub}_<mode> patterns > operands 2 with the neg as first operand of the FMA rtl. That way it is > canonical (otherwise it didn't match in combine). The FMA rtl operand > order doesn't need to imply the order of instruction operands. Sorry, I didn't read the unidiff properly. > (fma:VF_128 > (match_operand:VF_128 1 "nonimmediate_operand" " 0, 0") > (match_operand:VF_128 2 "nonimmediate_operand" "xm, x") > (match_operand:VF_128 3 "nonimmediate_operand" " x,xm")) > (match_operand:VF_128 4 "nonimmediate_operand" " 0, 0") ... > which was apparently too much for reload (supposedly the two "0" constraint > operands, even when the expander used (match_dup 1)). Yes. We'd have to have two different patterns to "properly" support fma4. Though I suppose now that I think about it this is extremely similar to the vfmadd231 case, in that in order to want to generate vfmaddss %xmm3, %xmm2, %xmm1, %xmm0 given the semantics of the builtin we'd have had to emit a copy of %xmm1 or %xmm2 into %xmm0 anyway. So we might as well not support this and just do (define_insn "*fmai_fmadd_<mode>" [(set (match_operand:VF_128 0 "register_operand" "=x,x,x,x") (vec_merge:VF_128 (fma:VF_128 (match_operand:VF_128 1 "nonimmediate_operand" "%0, 0, 0,0") (match_operand:VF_128 2 "nonimmediate_operand" "xm, x, x,m") (match_operand:VF_128 3 "nonimmediate_operand" " x,xm,xm,x")) (match_dup 0) (const_int 1)))] "TARGET_FMA || TARGET_FMA4" "@ vfmadd132<ssescalarmodesuffix>\t{%2, %3, %0|%0, %3, %2} vfmadd213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3} vfmadd<ssescalarmodesuffix>\t{%3, %2, %1, %0|%0, %1, %2, %3} vfmadd<ssescalarmodesuffix>\t{%3, %2, %1, %0|%0, %1, %2, %3}" [(set_attr "isa" "fma,fma,fma4,fma4") (set_attr "type" "ssemuladd") (set_attr "mode" "<MODE>")]) r~
On 09/13/2012 12:03 PM, Jakub Jelinek wrote: > How could you extend these builtins to FMA4 BTW? Doesn't FMA4 zero up the > high elements? Duh. You're absolutely right. I'd mis-read the document as clearing the high lane of the %ymm register only. r~
On Thu, Sep 13, 2012 at 10:42 AM, Uros Bizjak <ubizjak@gmail.com> wrote: > On Thu, Sep 13, 2012 at 5:52 PM, Jakub Jelinek <jakub@redhat.com> wrote: > >> The fma-*.c testcase show that these intrinsics probably mean to preserve >> the high elements (other than the lowest) of the first argument of the >> fmaintrin.h *_s{s,d} intrinsics in the destination (the HW insn preserve >> there the destination register, but that varies - for 132 and 213 it is the >> first one (but the negation performed for _mm_fnm*_s[sd] breaks it anyway), >> for 231 it is the last one). What the expander did was to put there >> an uninitialized pseudo, so we ended up with pretty random content, before >> H.J's http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190492 it happened >> to work by accident, but when things changed slightly and reload chose >> different alternative, this broke. >> >> The following patch fixes it, by tweaking the header so that the first >> argument is not negated (we negate the second one instead), as we don't want >> to negate the high elements if e.g. for whatever reason combiner doesn't >> match it. It fixes the expander to use a dup of the X operand as the high >> element provider for the pattern, removes the 231 alternatives (because >> those provide different destination high elements) and removes commutative >> marker (again, that would mean different high elements). > > Can we introduce additional "*fmai_fmadd_<mode>_1" pattern (and > others) that would cover missing 231 alternative? > >> 2012-09-13 Jakub Jelinek <jakub@redhat.com> >> >> PR target/54564 >> * config/i386/sse.md (fmai_vmfmadd_<mode>): Use (match_dup 1) >> instead of (match_dup 0) as second argument to vec_merge. >> (*fmai_fmadd_<mode>, *fmai_fmsub_<mode>): Likewise. >> Remove third alternative. >> (*fmai_fnmadd_<mode>, *fmai_fnmsub_<mode>): Likewise. Negate >> operand 2 instead of operand 1, but put it as first argument >> of fma. >> >> * config/i386/fmaintrin.h (_mm_fnmadd_sd, _mm_fnmadd_ss, >> _mm_fnmsub_sd, _mm_fnmsub_ss): Negate the second argument instead >> of the first. > > OK, but header change should be also reviewed by H.J. > It looks OK to me. Thanks.
--- gcc/config/i386/sse.md.jj 2012-09-05 18:27:03.000000000 +0200 +++ gcc/config/i386/sse.md 2012-09-13 13:49:49.504968716 +0200 @@ -2072,79 +2072,75 @@ (define_expand "fmai_vmfmadd_<mode>" (match_operand:VF_128 1 "nonimmediate_operand") (match_operand:VF_128 2 "nonimmediate_operand") (match_operand:VF_128 3 "nonimmediate_operand")) - (match_dup 0) + (match_dup 1) (const_int 1)))] "TARGET_FMA") (define_insn "*fmai_fmadd_<mode>" - [(set (match_operand:VF_128 0 "register_operand" "=x,x,x") + [(set (match_operand:VF_128 0 "register_operand" "=x,x") (vec_merge:VF_128 (fma:VF_128 - (match_operand:VF_128 1 "nonimmediate_operand" "%0, 0,x") - (match_operand:VF_128 2 "nonimmediate_operand" "xm, x,xm") - (match_operand:VF_128 3 "nonimmediate_operand" " x,xm,0")) - (match_dup 0) + (match_operand:VF_128 1 "nonimmediate_operand" " 0, 0") + (match_operand:VF_128 2 "nonimmediate_operand" "xm, x") + (match_operand:VF_128 3 "nonimmediate_operand" " x,xm")) + (match_dup 1) (const_int 1)))] "TARGET_FMA" "@ vfmadd132<ssescalarmodesuffix>\t{%2, %3, %0|%0, %3, %2} - vfmadd213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3} - vfmadd231<ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}" + vfmadd213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3}" [(set_attr "type" "ssemuladd") (set_attr "mode" "<MODE>")]) (define_insn "*fmai_fmsub_<mode>" - [(set (match_operand:VF_128 0 "register_operand" "=x,x,x") + [(set (match_operand:VF_128 0 "register_operand" "=x,x") (vec_merge:VF_128 (fma:VF_128 - (match_operand:VF_128 1 "nonimmediate_operand" "%0, 0,x") - (match_operand:VF_128 2 "nonimmediate_operand" "xm, x,xm") + (match_operand:VF_128 1 "nonimmediate_operand" " 0, 0") + (match_operand:VF_128 2 "nonimmediate_operand" "xm, x") (neg:VF_128 - (match_operand:VF_128 3 "nonimmediate_operand" " x,xm,0"))) - (match_dup 0) + (match_operand:VF_128 3 "nonimmediate_operand" " x,xm"))) + (match_dup 1) (const_int 1)))] "TARGET_FMA" "@ vfmsub132<ssescalarmodesuffix>\t{%2, %3, %0|%0, %3, %2} - vfmsub213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3} - vfmsub231<ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}" + vfmsub213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3}" [(set_attr "type" "ssemuladd") (set_attr "mode" "<MODE>")]) (define_insn "*fmai_fnmadd_<mode>" - [(set (match_operand:VF_128 0 "register_operand" "=x,x,x") + [(set (match_operand:VF_128 0 "register_operand" "=x,x") (vec_merge:VF_128 (fma:VF_128 (neg:VF_128 - (match_operand:VF_128 1 "nonimmediate_operand" "%0, 0,x")) - (match_operand:VF_128 2 "nonimmediate_operand" "xm, x,xm") - (match_operand:VF_128 3 "nonimmediate_operand" " x,xm,0")) - (match_dup 0) + (match_operand:VF_128 2 "nonimmediate_operand" "xm, x")) + (match_operand:VF_128 1 "nonimmediate_operand" " 0, 0") + (match_operand:VF_128 3 "nonimmediate_operand" " x,xm")) + (match_dup 1) (const_int 1)))] "TARGET_FMA" "@ vfnmadd132<ssescalarmodesuffix>\t{%2, %3, %0|%0, %3, %2} - vfnmadd213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3} - vfnmadd231<ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}" + vfnmadd213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3}" [(set_attr "type" "ssemuladd") (set_attr "mode" "<MODE>")]) (define_insn "*fmai_fnmsub_<mode>" - [(set (match_operand:VF_128 0 "register_operand" "=x,x,x") + [(set (match_operand:VF_128 0 "register_operand" "=x,x") (vec_merge:VF_128 (fma:VF_128 (neg:VF_128 - (match_operand:VF_128 1 "nonimmediate_operand" "%0, 0,x")) - (match_operand:VF_128 2 "nonimmediate_operand" "xm, x,xm") + (match_operand:VF_128 2 "nonimmediate_operand" "xm, x")) + (match_operand:VF_128 1 "nonimmediate_operand" " 0, 0") (neg:VF_128 - (match_operand:VF_128 3 "nonimmediate_operand" " x,xm,0"))) - (match_dup 0) + (match_operand:VF_128 3 "nonimmediate_operand" " x,xm"))) + (match_dup 1) (const_int 1)))] "TARGET_FMA" "@ vfnmsub132<ssescalarmodesuffix>\t{%2, %3, %0|%0, %3, %2} - vfnmsub213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3} - vfnmsub231<ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}" + vfnmsub213<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3}" [(set_attr "type" "ssemuladd") (set_attr "mode" "<MODE>")]) --- gcc/config/i386/fmaintrin.h.jj 2011-09-02 16:29:38.000000000 +0200 +++ gcc/config/i386/fmaintrin.h 2012-09-13 13:32:20.162333244 +0200 @@ -1,4 +1,4 @@ -/* Copyright (C) 2011 Free Software Foundation, Inc. +/* Copyright (C) 2011, 2012 Free Software Foundation, Inc. This file is part of GCC. @@ -164,7 +164,7 @@ extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_fnmadd_sd (__m128d __A, __m128d __B, __m128d __C) { - return (__m128d)__builtin_ia32_vfmaddsd3 (-(__v2df)__A, (__v2df)__B, + return (__m128d)__builtin_ia32_vfmaddsd3 ((__v2df)__A, -(__v2df)__B, (__v2df)__C); } @@ -172,7 +172,7 @@ extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_fnmadd_ss (__m128 __A, __m128 __B, __m128 __C) { - return (__m128)__builtin_ia32_vfmaddss3 (-(__v4sf)__A, (__v4sf)__B, + return (__m128)__builtin_ia32_vfmaddss3 ((__v4sf)__A, -(__v4sf)__B, (__v4sf)__C); } @@ -212,7 +212,7 @@ extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_fnmsub_sd (__m128d __A, __m128d __B, __m128d __C) { - return (__m128d)__builtin_ia32_vfmaddsd3 (-(__v2df)__A, (__v2df)__B, + return (__m128d)__builtin_ia32_vfmaddsd3 ((__v2df)__A, -(__v2df)__B, -(__v2df)__C); } @@ -220,7 +220,7 @@ extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_fnmsub_ss (__m128 __A, __m128 __B, __m128 __C) { - return (__m128)__builtin_ia32_vfmaddss3 (-(__v4sf)__A, (__v4sf)__B, + return (__m128)__builtin_ia32_vfmaddss3 ((__v4sf)__A, -(__v4sf)__B, -(__v4sf)__C); }