diff mbox

Fix -O0 -ffloat-store -mavx ICE (PR target/79807)

Message ID 20170302214419.GT1849@tucnak
State New
Headers show

Commit Message

Jakub Jelinek March 2, 2017, 9:44 p.m. UTC
Hi!

For -O1 and above, we force all operands and targets of x86 builtins into
registers during expansion, because we expect combiner will do its job.
But for -O0, we try to keep them in MEM if the predicate allows it, we just
make sure that at most one input operand is a MEM.  We still allow
MEM destination and one MEM input operand, which is why we ICE - most
insns/expanders don't allow that in the predicates and it is fine, but
various insns/expanders have e.g. nonimmediate_operand on both target
and some input operand and just have a condition that requires that one
of those is a register or that both of them are not memory.  The expanders
don't check the condition though, it is checked only when the insn is being
recognized in vregs pass and that is too late.
The following patch fixes that by forcing all input operands at -O0 into
non-memory if the target is memory.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Or are there any define_insn/define_expand where it is desirable to have
both input and output operand a MEM (and does it have to be matching)?
For various scalar binary and unary expanders the backend already uses a helper
that will force something into memory if dest and src are both memory and
not rtx_equal_p, but do we have anything like that in anything these two
builtin expanders emit?

2017-03-02  Jakub Jelinek  <jakub@redhat.com>

	PR target/79807
	* config/i386/i386.c (ix86_expand_multi_arg_builtin): If target
	is a memory operand, increase num_memory.
	(ix86_expand_args_builtin): Likewise.

	* gcc.target/i386/pr79807.c: New test.


	Jakub

Comments

Uros Bizjak March 3, 2017, 8:10 a.m. UTC | #1
On Thu, Mar 2, 2017 at 10:44 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> Hi!
>
> For -O1 and above, we force all operands and targets of x86 builtins into
> registers during expansion, because we expect combiner will do its job.
> But for -O0, we try to keep them in MEM if the predicate allows it, we just
> make sure that at most one input operand is a MEM.  We still allow
> MEM destination and one MEM input operand, which is why we ICE - most
> insns/expanders don't allow that in the predicates and it is fine, but
> various insns/expanders have e.g. nonimmediate_operand on both target
> and some input operand and just have a condition that requires that one
> of those is a register or that both of them are not memory.  The expanders
> don't check the condition though, it is checked only when the insn is being
> recognized in vregs pass and that is too late.
> The following patch fixes that by forcing all input operands at -O0 into
> non-memory if the target is memory.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> Or are there any define_insn/define_expand where it is desirable to have
> both input and output operand a MEM (and does it have to be matching)?
> For various scalar binary and unary expanders the backend already uses a helper
> that will force something into memory if dest and src are both memory and
> not rtx_equal_p, but do we have anything like that in anything these two
> builtin expanders emit?

Any insn with matched memory (movlps, movhps and similar) can also
operate with matched register. To my knowledge, all insn patterns are
written in this way, since in the past we had plenty of problems with
matched memory operands.

Also, I really don't remember what benefit brings us to *not* force
input operand to a register at -O0 at expand time and leave to RA to
find matched memory. OTOH, with -O0 we already have so many
unnecessary moves that you get tears in the eyes...

> 2017-03-02  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/79807
>         * config/i386/i386.c (ix86_expand_multi_arg_builtin): If target
>         is a memory operand, increase num_memory.
>         (ix86_expand_args_builtin): Likewise.
>
>         * gcc.target/i386/pr79807.c: New test.

OK.

Thanks,
Uros.

>
> --- gcc/config/i386/i386.c.jj   2017-03-02 08:08:43.000000000 +0100
> +++ gcc/config/i386/i386.c      2017-03-02 17:58:28.396999033 +0100
> @@ -34249,6 +34249,8 @@ ix86_expand_multi_arg_builtin (enum insn
>        || GET_MODE (target) != tmode
>        || !insn_data[icode].operand[0].predicate (target, tmode))
>      target = gen_reg_rtx (tmode);
> +  else if (memory_operand (target, tmode))
> +    num_memory++;
>
>    gcc_assert (nargs <= 4);
>
> @@ -35534,6 +35536,8 @@ ix86_expand_args_builtin (const struct b
>           || GET_MODE (target) != tmode
>           || !insn_p->operand[0].predicate (target, tmode))
>         target = gen_reg_rtx (tmode);
> +      else if (memory_operand (target, tmode))
> +       num_memory++;
>        real_target = target;
>      }
>    else
> --- gcc/testsuite/gcc.target/i386/pr79807.c.jj  2017-03-02 18:03:30.032023436 +0100
> +++ gcc/testsuite/gcc.target/i386/pr79807.c     2017-03-02 18:02:53.000000000 +0100
> @@ -0,0 +1,12 @@
> +/* PR target/79807 */
> +/* { dg-do compile } */
> +/* { dg-options "-O0 -mavx -ffloat-store" } */
> +
> +typedef double __v2df __attribute__ ((__vector_size__ (16)));
> +typedef double __v4df __attribute__ ((__vector_size__ (32)));
> +
> +__v2df
> +foo (__v4df x)
> +{
> +  return __builtin_ia32_pd_pd256 (x);
> +}
>
>         Jakub
Jakub Jelinek March 3, 2017, 10:35 a.m. UTC | #2
On Fri, Mar 03, 2017 at 09:10:03AM +0100, Uros Bizjak wrote:
> > Or are there any define_insn/define_expand where it is desirable to have
> > both input and output operand a MEM (and does it have to be matching)?
> > For various scalar binary and unary expanders the backend already uses a helper
> > that will force something into memory if dest and src are both memory and
> > not rtx_equal_p, but do we have anything like that in anything these two
> > builtin expanders emit?
> 
> Any insn with matched memory (movlps, movhps and similar) can also
> operate with matched register. To my knowledge, all insn patterns are
> written in this way, since in the past we had plenty of problems with
> matched memory operands.
> 
> Also, I really don't remember what benefit brings us to *not* force
> input operand to a register at -O0 at expand time and leave to RA to
> find matched memory. OTOH, with -O0 we already have so many
> unnecessary moves that you get tears in the eyes...

So, I've tried:
make mddump
for i in `sed -n -e 's/^.*CODE_FOR_\([^,]*\),.*$/\1/p' ../../gcc/config/i386/i386-builtin.def`; do sed -n '/^(define_\(insn\|expand\) ("'$i'"/,/^$/p' tmp-mddump.md; done
and looked for "=.*[mo] in there.
Some insns might want that && !(MEM_P (operands[0]) && MEM_P (operands[1])),
e.g.
(define_insn_and_split "avx512f_<castmode><avxsizesuffix>_<castmode>"
  [(set (match_operand:AVX512MODE2P 0 "nonimmediate_operand" "=x,m")
        (unspec:AVX512MODE2P
          [(match_operand:<ssequartermode> 1 "nonimmediate_operand" "xm,x")]
          UNSPEC_CAST))]
  "TARGET_AVX512F"
  "#"
  "&& reload_completed"
  [(set (match_dup 0) (match_dup 1))]
as the constraint require that both operands aren't memory, shall I create a
patch for that?  This is the first category below.

The second category is where matching operand is ok, so my patch can
pessimize stuff.  I wonder if we couldn't handle this like:
          /* If we aren't optimizing, only allow one memory operand to be
             generated.  */
          if (memory_operand (op, mode))
	    {
	      const char *const constraint
		= insn_data[icode].operand[i + adjust + 1].constraint;
	      if (optimize
		  || num_memory != 1
		  || !rtx_equal_p (real_target, op))
		num_memory++;
	      /* sse2_movsd allows matching operand.  */
	      else if (icode == CODE_FOR_sse2_movsd)
		;
	      /* Various masked insns allow matching operand.  */
	      else if (insn_data[icode].operand[i + adjust + 1].predicate
		       == vector_move_operand
		       && (strcmp (constraint, "0C") == 0
			   || strcmp (constraint, "0C,0") == 0))
		;
	      else
		num_memory++;
	    }
(though perhaps sse2_movsd is still too simplistic, because it has just
=m v 0 and =o 0 v alternatives, so if i + adjust + 1 is 2, then it is
fine as is, if it is 1, then only if the memory is offsettable; though
perhaps LRA can just turn non-offsettable memory into offsettable one
through secondary reload).

And the last category is with "=m" destination (not allowing anything else)
and (match_dup 0) somewhere in the pattern, I believe those have to be
expanded specially, because otherwise one e.g. wouldn't get MEM_P target
if optimize at all (I think the builtins are supposed to pass pointer to the
result in that case) and the match_dup just doesn't appear as another operand.

missing !(MEM && MEM) ?
=======================
avx512f_pd512_256pd
avx512f_pd512_pd
avx512f_ps512_256ps
avx512f_ps512_ps
avx512f_si512_256si
avx512f_si512_si
avx_pd256_pd
avx_ps256_ps
avx_si256_si
sse_storehps
sse_storelps

valid matching mem
==================
avx512f_ss_truncatev16siv16hi2_mask
avx512f_ss_truncatev16siv16qi2_mask
avx512f_ss_truncatev8div8hi2_mask
avx512f_ss_truncatev8div8si2_mask
avx512f_truncatev16siv16hi2_mask
avx512f_truncatev16siv16qi2_mask
avx512f_truncatev8div8hi2_mask
avx512f_truncatev8div8si2_mask
avx512f_us_truncatev16siv16hi2_mask
avx512f_us_truncatev16siv16qi2_mask
avx512f_us_truncatev8div8hi2_mask
avx512f_us_truncatev8div8si2_mask
avx512f_vcvtps2ph512_mask
avx512vl_ss_truncatev16hiv16qi2_mask
avx512vl_ss_truncatev4div4si2_mask
avx512vl_ss_truncatev8siv8hi2_mask
avx512vl_truncatev16hiv16qi2_mask
avx512vl_truncatev4div4si2_mask
avx512vl_truncatev8siv8hi2_mask
avx512vl_us_truncatev16hiv16qi2_mask
avx512vl_us_truncatev4div4si2_mask
avx512vl_us_truncatev8siv8hi2_mask
sse2_movsd
vcvtps2ph256_mask

match_dup on mem target
=======================
avx512f_ss_truncatev8div16qi2_mask_store
avx512f_storev16sf_mask
avx512f_storev16si_mask
avx512f_storev8df_mask
avx512f_storev8di_mask
avx512f_truncatev8div16qi2_mask_store
avx512f_us_truncatev8div16qi2_mask_store
avx512vl_compressstorev2df_mask
avx512vl_compressstorev2di_mask
avx512vl_compressstorev4df_mask
avx512vl_compressstorev4di_mask
avx512vl_compressstorev4sf_mask
avx512vl_compressstorev4si_mask
avx512vl_compressstorev8sf_mask
avx512vl_compressstorev8si_mask
avx512vl_ss_truncatev2div2hi2_mask_store
avx512vl_ss_truncatev2div2qi2_mask_store
avx512vl_ss_truncatev2div2si2_mask_store
avx512vl_ss_truncatev4div4hi2_mask_store
avx512vl_ss_truncatev4div4qi2_mask_store
avx512vl_ss_truncatev4siv4hi2_mask_store
avx512vl_ss_truncatev4siv4qi2_mask_store
avx512vl_ss_truncatev8siv8qi2_mask_store
avx512vl_storev16hi_mask
avx512vl_storev16qi_mask
avx512vl_storev2df_mask
avx512vl_storev2di_mask
avx512vl_storev32qi_mask
avx512vl_storev4df_mask
avx512vl_storev4di_mask
avx512vl_storev4sf_mask
avx512vl_storev4si_mask
avx512vl_storev8hi_mask
avx512vl_storev8sf_mask
avx512vl_storev8si_mask
avx512vl_truncatev2div2hi2_mask_store
avx512vl_truncatev2div2qi2_mask_store
avx512vl_truncatev2div2si2_mask_store
avx512vl_truncatev4div4hi2_mask_store
avx512vl_truncatev4div4qi2_mask_store
avx512vl_truncatev4siv4hi2_mask_store
avx512vl_truncatev4siv4qi2_mask_store
avx512vl_truncatev8siv8qi2_mask_store
avx512vl_us_truncatev2div2hi2_mask_store
avx512vl_us_truncatev2div2qi2_mask_store
avx512vl_us_truncatev2div2si2_mask_store
avx512vl_us_truncatev4div4hi2_mask_store
avx512vl_us_truncatev4div4qi2_mask_store
avx512vl_us_truncatev4siv4hi2_mask_store
avx512vl_us_truncatev4siv4qi2_mask_store
avx512vl_us_truncatev8siv8qi2_mask_store

	Jakub
Uros Bizjak March 3, 2017, 11:18 a.m. UTC | #3
On Fri, Mar 3, 2017 at 11:35 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Mar 03, 2017 at 09:10:03AM +0100, Uros Bizjak wrote:
>> > Or are there any define_insn/define_expand where it is desirable to have
>> > both input and output operand a MEM (and does it have to be matching)?
>> > For various scalar binary and unary expanders the backend already uses a helper
>> > that will force something into memory if dest and src are both memory and
>> > not rtx_equal_p, but do we have anything like that in anything these two
>> > builtin expanders emit?
>>
>> Any insn with matched memory (movlps, movhps and similar) can also
>> operate with matched register. To my knowledge, all insn patterns are
>> written in this way, since in the past we had plenty of problems with
>> matched memory operands.
>>
>> Also, I really don't remember what benefit brings us to *not* force
>> input operand to a register at -O0 at expand time and leave to RA to
>> find matched memory. OTOH, with -O0 we already have so many
>> unnecessary moves that you get tears in the eyes...
>
> So, I've tried:
> make mddump
> for i in `sed -n -e 's/^.*CODE_FOR_\([^,]*\),.*$/\1/p' ../../gcc/config/i386/i386-builtin.def`; do sed -n '/^(define_\(insn\|expand\) ("'$i'"/,/^$/p' tmp-mddump.md; done
> and looked for "=.*[mo] in there.
> Some insns might want that && !(MEM_P (operands[0]) && MEM_P (operands[1])),
> e.g.
> (define_insn_and_split "avx512f_<castmode><avxsizesuffix>_<castmode>"
>   [(set (match_operand:AVX512MODE2P 0 "nonimmediate_operand" "=x,m")
>         (unspec:AVX512MODE2P
>           [(match_operand:<ssequartermode> 1 "nonimmediate_operand" "xm,x")]
>           UNSPEC_CAST))]
>   "TARGET_AVX512F"
>   "#"
>   "&& reload_completed"
>   [(set (match_dup 0) (match_dup 1))]
> as the constraint require that both operands aren't memory, shall I create a
> patch for that?  This is the first category below.

Yes. Although expander takes care not to generate two memory
references, combine can propagate memory to the other operand,
creating semi-invalid RTX that is later resolved by RA.

> The second category is where matching operand is ok, so my patch can
> pessimize stuff.  I wonder if we couldn't handle this like:
>           /* If we aren't optimizing, only allow one memory operand to be
>              generated.  */
>           if (memory_operand (op, mode))
>             {
>               const char *const constraint
>                 = insn_data[icode].operand[i + adjust + 1].constraint;
>               if (optimize
>                   || num_memory != 1
>                   || !rtx_equal_p (real_target, op))
>                 num_memory++;
>               /* sse2_movsd allows matching operand.  */
>               else if (icode == CODE_FOR_sse2_movsd)
>                 ;
>               /* Various masked insns allow matching operand.  */
>               else if (insn_data[icode].operand[i + adjust + 1].predicate
>                        == vector_move_operand
>                        && (strcmp (constraint, "0C") == 0
>                            || strcmp (constraint, "0C,0") == 0))
>                 ;
>               else
>                 num_memory++;
>             }
> (though perhaps sse2_movsd is still too simplistic, because it has just
> =m v 0 and =o 0 v alternatives, so if i + adjust + 1 is 2, then it is
> fine as is, if it is 1, then only if the memory is offsettable; though
> perhaps LRA can just turn non-offsettable memory into offsettable one
> through secondary reload).

There is similarity with arithmetic operations. Please note that arith
ops fixup their arguments using ix86_expand_binary_operator, and their
insn patterns have to satisfy ix86_binary_operator_ok insn predicate.
However, even using this infrastructure, I have noticed that the
compiler is quite sloppy with memory operands, and we mostly get
matched memory after combine pass (if we are lucky).

Regarding the implementation, I'd rather see another flag for
i386-builtin.def entries, where we can mark if matching memory operand
is allowed. The expander can then look at the flag, and do some
special magic.

> And the last category is with "=m" destination (not allowing anything else)
> and (match_dup 0) somewhere in the pattern, I believe those have to be
> expanded specially, because otherwise one e.g. wouldn't get MEM_P target
> if optimize at all (I think the builtins are supposed to pass pointer to the
> result in that case) and the match_dup just doesn't appear as another operand.

It looks to me, that the expander already takes care for matching,
where strictly required. OTOH, matching memory requirement would also
fit in the proposed implementation with flag in builtin entries, we
can mark if matching memory operand is required, not only allowed.

However, as said above, expanding with memory operands is somehow
neglected area, desperately looking for some love.

Uros.

> missing !(MEM && MEM) ?
> =======================
> avx512f_pd512_256pd
> avx512f_pd512_pd
> avx512f_ps512_256ps
> avx512f_ps512_ps
> avx512f_si512_256si
> avx512f_si512_si
> avx_pd256_pd
> avx_ps256_ps
> avx_si256_si
> sse_storehps
> sse_storelps
>
> valid matching mem
> ==================
> avx512f_ss_truncatev16siv16hi2_mask
> avx512f_ss_truncatev16siv16qi2_mask
> avx512f_ss_truncatev8div8hi2_mask
> avx512f_ss_truncatev8div8si2_mask
> avx512f_truncatev16siv16hi2_mask
> avx512f_truncatev16siv16qi2_mask
> avx512f_truncatev8div8hi2_mask
> avx512f_truncatev8div8si2_mask
> avx512f_us_truncatev16siv16hi2_mask
> avx512f_us_truncatev16siv16qi2_mask
> avx512f_us_truncatev8div8hi2_mask
> avx512f_us_truncatev8div8si2_mask
> avx512f_vcvtps2ph512_mask
> avx512vl_ss_truncatev16hiv16qi2_mask
> avx512vl_ss_truncatev4div4si2_mask
> avx512vl_ss_truncatev8siv8hi2_mask
> avx512vl_truncatev16hiv16qi2_mask
> avx512vl_truncatev4div4si2_mask
> avx512vl_truncatev8siv8hi2_mask
> avx512vl_us_truncatev16hiv16qi2_mask
> avx512vl_us_truncatev4div4si2_mask
> avx512vl_us_truncatev8siv8hi2_mask
> sse2_movsd
> vcvtps2ph256_mask
>
> match_dup on mem target
> =======================
> avx512f_ss_truncatev8div16qi2_mask_store
> avx512f_storev16sf_mask
> avx512f_storev16si_mask
> avx512f_storev8df_mask
> avx512f_storev8di_mask
> avx512f_truncatev8div16qi2_mask_store
> avx512f_us_truncatev8div16qi2_mask_store
> avx512vl_compressstorev2df_mask
> avx512vl_compressstorev2di_mask
> avx512vl_compressstorev4df_mask
> avx512vl_compressstorev4di_mask
> avx512vl_compressstorev4sf_mask
> avx512vl_compressstorev4si_mask
> avx512vl_compressstorev8sf_mask
> avx512vl_compressstorev8si_mask
> avx512vl_ss_truncatev2div2hi2_mask_store
> avx512vl_ss_truncatev2div2qi2_mask_store
> avx512vl_ss_truncatev2div2si2_mask_store
> avx512vl_ss_truncatev4div4hi2_mask_store
> avx512vl_ss_truncatev4div4qi2_mask_store
> avx512vl_ss_truncatev4siv4hi2_mask_store
> avx512vl_ss_truncatev4siv4qi2_mask_store
> avx512vl_ss_truncatev8siv8qi2_mask_store
> avx512vl_storev16hi_mask
> avx512vl_storev16qi_mask
> avx512vl_storev2df_mask
> avx512vl_storev2di_mask
> avx512vl_storev32qi_mask
> avx512vl_storev4df_mask
> avx512vl_storev4di_mask
> avx512vl_storev4sf_mask
> avx512vl_storev4si_mask
> avx512vl_storev8hi_mask
> avx512vl_storev8sf_mask
> avx512vl_storev8si_mask
> avx512vl_truncatev2div2hi2_mask_store
> avx512vl_truncatev2div2qi2_mask_store
> avx512vl_truncatev2div2si2_mask_store
> avx512vl_truncatev4div4hi2_mask_store
> avx512vl_truncatev4div4qi2_mask_store
> avx512vl_truncatev4siv4hi2_mask_store
> avx512vl_truncatev4siv4qi2_mask_store
> avx512vl_truncatev8siv8qi2_mask_store
> avx512vl_us_truncatev2div2hi2_mask_store
> avx512vl_us_truncatev2div2qi2_mask_store
> avx512vl_us_truncatev2div2si2_mask_store
> avx512vl_us_truncatev4div4hi2_mask_store
> avx512vl_us_truncatev4div4qi2_mask_store
> avx512vl_us_truncatev4siv4hi2_mask_store
> avx512vl_us_truncatev4siv4qi2_mask_store
> avx512vl_us_truncatev8siv8qi2_mask_store
>
>         Jakub
Jakub Jelinek March 3, 2017, 11:30 a.m. UTC | #4
On Fri, Mar 03, 2017 at 12:18:09PM +0100, Uros Bizjak wrote:
> > as the constraint require that both operands aren't memory, shall I create a
> > patch for that?  This is the first category below.
> 
> Yes. Although expander takes care not to generate two memory
> references, combine can propagate memory to the other operand,
> creating semi-invalid RTX that is later resolved by RA.

Ok, will work on a patch.

> > The second category is where matching operand is ok, so my patch can
> > pessimize stuff.  I wonder if we couldn't handle this like:
> >           /* If we aren't optimizing, only allow one memory operand to be
> >              generated.  */
> >           if (memory_operand (op, mode))
> >             {
> >               const char *const constraint
> >                 = insn_data[icode].operand[i + adjust + 1].constraint;
> >               if (optimize
> >                   || num_memory != 1
> >                   || !rtx_equal_p (real_target, op))
> >                 num_memory++;
> >               /* sse2_movsd allows matching operand.  */
> >               else if (icode == CODE_FOR_sse2_movsd)
> >                 ;
> >               /* Various masked insns allow matching operand.  */
> >               else if (insn_data[icode].operand[i + adjust + 1].predicate
> >                        == vector_move_operand
> >                        && (strcmp (constraint, "0C") == 0
> >                            || strcmp (constraint, "0C,0") == 0))
> >                 ;
> >               else
> >                 num_memory++;
> >             }
> > (though perhaps sse2_movsd is still too simplistic, because it has just
> > =m v 0 and =o 0 v alternatives, so if i + adjust + 1 is 2, then it is
> > fine as is, if it is 1, then only if the memory is offsettable; though
> > perhaps LRA can just turn non-offsettable memory into offsettable one
> > through secondary reload).
> 
> There is similarity with arithmetic operations. Please note that arith
> ops fixup their arguments using ix86_expand_binary_operator, and their
> insn patterns have to satisfy ix86_binary_operator_ok insn predicate.
> However, even using this infrastructure, I have noticed that the
> compiler is quite sloppy with memory operands, and we mostly get
> matched memory after combine pass (if we are lucky).
> 
> Regarding the implementation, I'd rather see another flag for
> i386-builtin.def entries, where we can mark if matching memory operand
> is allowed. The expander can then look at the flag, and do some
> special magic.

I think maintaining such a flag would be a nightmare, I think more insns
will need it in the future and it will be hard to remember.  Maybe it is better
not to do anything about those (i.e. apply my patch as is), after all -O0
is not optimized code.

	Jakub
Uros Bizjak March 3, 2017, 12:17 p.m. UTC | #5
On Fri, Mar 3, 2017 at 12:30 PM, Jakub Jelinek <jakub@redhat.com> wrote:

>> Regarding the implementation, I'd rather see another flag for
>> i386-builtin.def entries, where we can mark if matching memory operand
>> is allowed. The expander can then look at the flag, and do some
>> special magic.
>
> I think maintaining such a flag would be a nightmare, I think more insns
> will need it in the future and it will be hard to remember.  Maybe it is better
> not to do anything about those (i.e. apply my patch as is), after all -O0
> is not optimized code.

Yes, I agree.

Uros.
diff mbox

Patch

--- gcc/config/i386/i386.c.jj	2017-03-02 08:08:43.000000000 +0100
+++ gcc/config/i386/i386.c	2017-03-02 17:58:28.396999033 +0100
@@ -34249,6 +34249,8 @@  ix86_expand_multi_arg_builtin (enum insn
       || GET_MODE (target) != tmode
       || !insn_data[icode].operand[0].predicate (target, tmode))
     target = gen_reg_rtx (tmode);
+  else if (memory_operand (target, tmode))
+    num_memory++;
 
   gcc_assert (nargs <= 4);
 
@@ -35534,6 +35536,8 @@  ix86_expand_args_builtin (const struct b
 	  || GET_MODE (target) != tmode
 	  || !insn_p->operand[0].predicate (target, tmode))
 	target = gen_reg_rtx (tmode);
+      else if (memory_operand (target, tmode))
+	num_memory++;
       real_target = target;
     }
   else
--- gcc/testsuite/gcc.target/i386/pr79807.c.jj	2017-03-02 18:03:30.032023436 +0100
+++ gcc/testsuite/gcc.target/i386/pr79807.c	2017-03-02 18:02:53.000000000 +0100
@@ -0,0 +1,12 @@ 
+/* PR target/79807 */
+/* { dg-do compile } */
+/* { dg-options "-O0 -mavx -ffloat-store" } */
+
+typedef double __v2df __attribute__ ((__vector_size__ (16)));
+typedef double __v4df __attribute__ ((__vector_size__ (32)));
+
+__v2df
+foo (__v4df x)
+{
+  return __builtin_ia32_pd_pd256 (x);
+}