diff mbox

PR target/70155: Use SSE for TImode load/store

Message ID 20160425125145.GA5326@intel.com
State New
Headers show

Commit Message

H.J. Lu April 25, 2016, 12:51 p.m. UTC
Tested on Linux/x86-64.  OK for trunk?

BTW, I have a followup patch to use SSE for TImode bitwise operation.

H.J.
----
128-bit SSE load and store instructions can be used for load and store
of 128-bit integers if they are the only operations on 128-bit integers.
To convert load and store of 128-bit integers to 128-bit SSE load and
store, the original STV pass, which is designed to convert 64-bit integer
operations to SSE2 operations in 32-bit mode, is extended to 64-bit mode
in the following ways:

1. Class scalar_chain is turned into base class.  The 32-bit specific
member functions are moved to the new derived class, scalar_chain_32.
The new derived class, scalar_chain_64, is added to convert oad and
store of 128-bit integers to 128-bit SSE load and store.
2. Add the 64-bit version of scalar_to_vector_candidate_p and
remove_non_convertible_regs.  Only TImode load and store are allowed
for conversion.  If one instruction on the chain of dependent
instructions aren't TImode load or store, the chain of instructions
won't be converted.
3. In 64-bit, we only convert from TImode to V1TImode, which have the
same size.  The difference is only vector registers are allowed in
TImode so that 128-bit SSE load and store instructions will be used
for load and store of 128-bit integers.
4. Put the 64-bit STV pass before the CSE pass so that instructions
changed or generated by the STV pass can be CSEed.

gcc/

	PR target/70155
	* config/i386/i386.c (scalar_to_vector_candidate_p): Renamed
	to ...
	(scalar_to_vector_candidate_p_32): This.
	(scalar_to_vector_candidate_p_64): New function.
	(scalar_to_vector_candidate_p): Likewise.
	(check_non_convertible_regs_64): Likewise.
	(remove_non_convertible_regs_64): Likewise.
	(remove_non_convertible_regs): Likewise.
	(remove_non_convertible_regs): Renamed to ...
	(remove_non_convertible_regs_32): This.
	(scalar_chain::~scalar_chain): Make it virtual.
	(scalar_chain::compute_convert_gain): Make it pure virtual.
	(scalar_chain::convert_insn): Likewise.
	(scalar_chain::convert_registers): Likewise.
	(scalar_chain::analyze_register_chain): Likewise.
	(scalar_chain::add_to_queue): Make it protected.
	(scalar_chain::emit_conversion_insns): Likewise.
	(scalar_chain::mark_dual_mode_def): Moved to scalar_chain_32.
	(scalar_chain::replace_with_subreg): Likewise.
	(scalar_chain::replace_with_subreg_in_insn): Likewise.
	(scalar_chain::convert_op): Likewise.
	(scalar_chain::convert_reg): Likewise.
	(scalar_chain::make_vector_copies): Likewise.
	(scalar_chain::convert_registers): New pure virtual function.
	(class scalar_chain_32): New class.
	(class scalar_chain_64): Likewise.
	(scalar_chain::mark_dual_mode_def): Renamed to ...
	(scalar_chain_32::mark_dual_mode_def): This.
	(scalar_chain::analyze_register_chain): Renamed to ...
	(scalar_chain_32::analyze_register_chain ): This.
	(scalar_chain_64::analyze_register_chain): New function.
	(scalar_chain::compute_convert_gain): Renamed to ...
	(scalar_chain_32::compute_convert_gain): This.
	(scalar_chain::replace_with_subreg): Renamed to ...
	(scalar_chain_32::replace_with_subreg): This.
	(scalar_chain::replace_with_subreg_in_insn): Renamed to ...
	(scalar_chain_32::replace_with_subreg_in_insn): This.
	(scalar_chain::make_vector_copies): Renamed to ...
	(scalar_chain_32::make_vector_copies): This.
	(scalar_chain::convert_reg): Renamed to ...
	(scalar_chain_32::convert_reg ): This.
	(scalar_chain::convert_op): Renamed to ...
	(scalar_chain_32::convert_op): This.
	(scalar_chain::convert_insn): Renamed to ...
	(scalar_chain_32::convert_insn): This.
	(scalar_chain_64::convert_insn): New function.
	(scalar_chain_32::convert_registers): Likewise.
	(scalar_chain::convert): Call convert_registers.
	(convert_scalars_to_vector): Change to scalar_chain pointer to
	use scalar_chain_64 in 64-bit mode and scalar_chain_32 in 32-bit
	mode.  Delete scalar_chain pointer.  Call free_dominance_info in
	64-bit mode.
	(pass_stv::gate): Remove TARGET_64BIT check.
	(ix86_option_override): Put the 64-bit STV pass before the CSE
	pass.

gcc/testsuite/

	PR target/70155
	* gcc.target/i386/pr55247-2.c: Updated to check movti_internal
	and movv1ti_internal patterns
	* gcc.target/i386/pr70155-1.c: New test.
	* gcc.target/i386/pr70155-2.c: Likewise.
	* gcc.target/i386/pr70155-3.c: Likewise.
	* gcc.target/i386/pr70155-4.c: Likewise.
	* gcc.target/i386/pr70155-5.c: Likewise.
	* gcc.target/i386/pr70155-6.c: Likewise.
	* gcc.target/i386/pr70155-7.c: Likewise.
	* gcc.target/i386/pr70155-8.c: Likewise.
	* gcc.target/i386/pr70155-9.c: Likewise.
	* gcc.target/i386/pr70155-10.c: Likewise.
	* gcc.target/i386/pr70155-11.c: Likewise.
	* gcc.target/i386/pr70155-12.c: Likewise.
	* gcc.target/i386/pr70155-13.c: Likewise.
	* gcc.target/i386/pr70155-14.c: Likewise.
	* gcc.target/i386/pr70155-15.c: Likewise.
	* gcc.target/i386/pr70155-16.c: Likewise.
	* gcc.target/i386/pr70155-17.c: Likewise.
	* gcc.target/i386/pr70155-18.c: Likewise.
	* gcc.target/i386/pr70155-19.c: Likewise.
	* gcc.target/i386/pr70155-20.c: Likewise.
	* gcc.target/i386/pr70155-21.c: Likewise.
	* gcc.target/i386/pr70155-22.c: Likewise.
---
 gcc/config/i386/i386.c                     | 427 ++++++++++++++++++++++++++---
 gcc/testsuite/gcc.target/i386/pr55247-2.c  |   5 +-
 gcc/testsuite/gcc.target/i386/pr70155-1.c  |  13 +
 gcc/testsuite/gcc.target/i386/pr70155-10.c |  19 ++
 gcc/testsuite/gcc.target/i386/pr70155-11.c |  19 ++
 gcc/testsuite/gcc.target/i386/pr70155-12.c |  17 ++
 gcc/testsuite/gcc.target/i386/pr70155-13.c |  17 ++
 gcc/testsuite/gcc.target/i386/pr70155-14.c |  17 ++
 gcc/testsuite/gcc.target/i386/pr70155-15.c |  18 ++
 gcc/testsuite/gcc.target/i386/pr70155-16.c |  17 ++
 gcc/testsuite/gcc.target/i386/pr70155-17.c |  18 ++
 gcc/testsuite/gcc.target/i386/pr70155-18.c |  12 +
 gcc/testsuite/gcc.target/i386/pr70155-19.c |  12 +
 gcc/testsuite/gcc.target/i386/pr70155-2.c  |  18 ++
 gcc/testsuite/gcc.target/i386/pr70155-20.c |  13 +
 gcc/testsuite/gcc.target/i386/pr70155-21.c |  13 +
 gcc/testsuite/gcc.target/i386/pr70155-22.c |  14 +
 gcc/testsuite/gcc.target/i386/pr70155-3.c  |  20 ++
 gcc/testsuite/gcc.target/i386/pr70155-4.c  |  20 ++
 gcc/testsuite/gcc.target/i386/pr70155-5.c  |  13 +
 gcc/testsuite/gcc.target/i386/pr70155-6.c  |  13 +
 gcc/testsuite/gcc.target/i386/pr70155-7.c  |  18 ++
 gcc/testsuite/gcc.target/i386/pr70155-8.c  |  18 ++
 gcc/testsuite/gcc.target/i386/pr70155-9.c  |  17 ++
 24 files changed, 750 insertions(+), 38 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-11.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-12.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-13.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-14.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-15.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-16.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-17.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-18.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-19.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-20.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-21.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-22.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr70155-9.c

Comments

Uros Bizjak April 25, 2016, 2:18 p.m. UTC | #1
On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
> Tested on Linux/x86-64.  OK for trunk?

> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
> +     expected by the fwprop pass, call free_dominance_info to
> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
> +     when dominance info is changed.  */
> +  if (TARGET_64BIT)
> +    free_dominance_info (CDI_DOMINATORS);
> +

Please resolve the above problem first, target-dependent sources are
not the place to apply band-aids for middle-end problems. The thread
with the proposed fix died in [1].

[1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html

Also, I find _32 and _64 suffixes confusing, maybe better would be to
use timode_ and dimode_ prefixes everywhere?

Uros.
H.J. Lu April 25, 2016, 2:47 p.m. UTC | #2
On Mon, Apr 25, 2016 at 7:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>> Tested on Linux/x86-64.  OK for trunk?
>
>> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
>> +     expected by the fwprop pass, call free_dominance_info to
>> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
>> +     when dominance info is changed.  */
>> +  if (TARGET_64BIT)
>> +    free_dominance_info (CDI_DOMINATORS);
>> +
>
> Please resolve the above problem first, target-dependent sources are
> not the place to apply band-aids for middle-end problems. The thread
> with the proposed fix died in [1].
>
> [1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html

free_dominance_info (CDI_DOMINATORS) has been called in other
places to avoid this middle-end issue.   I don't know when the middle-end
will be fixed.  I don't think this target optimization should be penalized by
the middle-end issue.

> Also, I find _32 and _64 suffixes confusing, maybe better would be to
> use timode_ and dimode_ prefixes everywhere?
>

I will make the change.
Uros Bizjak April 25, 2016, 3:10 p.m. UTC | #3
On Mon, Apr 25, 2016 at 4:47 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Apr 25, 2016 at 7:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>> Tested on Linux/x86-64.  OK for trunk?
>>
>>> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
>>> +     expected by the fwprop pass, call free_dominance_info to
>>> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
>>> +     when dominance info is changed.  */
>>> +  if (TARGET_64BIT)
>>> +    free_dominance_info (CDI_DOMINATORS);
>>> +
>>
>> Please resolve the above problem first, target-dependent sources are
>> not the place to apply band-aids for middle-end problems. The thread
>> with the proposed fix died in [1].
>>
>> [1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html
>
> free_dominance_info (CDI_DOMINATORS) has been called in other
> places to avoid this middle-end issue.   I don't know when the middle-end
> will be fixed.  I don't think this target optimization should be penalized by
> the middle-end issue.

Let's ask Richard if he is OK with the workaround...

One more thing:

@@ -3551,6 +3874,7 @@ convert_scalars_to_vector ()
   basic_block bb;
   bitmap candidates;
   int converted_insns = 0;
+  rtx zero, minus_one;

   bitmap_obstack_initialize (NULL);
   candidates = BITMAP_ALLOC (NULL);
@@ -3585,22 +3909,40 @@ convert_scalars_to_vector ()
     if (dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");

+  if (TARGET_64BIT)
+    {
+      zero = gen_reg_rtx (V1TImode);
+      minus_one = gen_reg_rtx (V1TImode);
+    }
+  else
+    {
+      zero = NULL_RTX;
+      minus_one = NULL_RTX;
+    }
+

Do we *really* need to crate registers here? They are only used as a
temporary in scalar_chain_64::convert_insn, and I think that any CSE
worth its name should find out when the same immediate is loaded to
different temporaries.

BTW: I'd really like if Ilya can review the functionality of the
patch, he is an expert in this conversion stuff.

Uros.
H.J. Lu April 25, 2016, 3:27 p.m. UTC | #4
On Mon, Apr 25, 2016 at 8:10 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Mon, Apr 25, 2016 at 4:47 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Mon, Apr 25, 2016 at 7:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>> Tested on Linux/x86-64.  OK for trunk?
>>>
>>>> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
>>>> +     expected by the fwprop pass, call free_dominance_info to
>>>> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
>>>> +     when dominance info is changed.  */
>>>> +  if (TARGET_64BIT)
>>>> +    free_dominance_info (CDI_DOMINATORS);
>>>> +
>>>
>>> Please resolve the above problem first, target-dependent sources are
>>> not the place to apply band-aids for middle-end problems. The thread
>>> with the proposed fix died in [1].
>>>
>>> [1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html
>>
>> free_dominance_info (CDI_DOMINATORS) has been called in other
>> places to avoid this middle-end issue.   I don't know when the middle-end
>> will be fixed.  I don't think this target optimization should be penalized by
>> the middle-end issue.
>
> Let's ask Richard if he is OK with the workaround...
>
> One more thing:
>
> @@ -3551,6 +3874,7 @@ convert_scalars_to_vector ()
>    basic_block bb;
>    bitmap candidates;
>    int converted_insns = 0;
> +  rtx zero, minus_one;
>
>    bitmap_obstack_initialize (NULL);
>    candidates = BITMAP_ALLOC (NULL);
> @@ -3585,22 +3909,40 @@ convert_scalars_to_vector ()
>      if (dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
>
> +  if (TARGET_64BIT)
> +    {
> +      zero = gen_reg_rtx (V1TImode);
> +      minus_one = gen_reg_rtx (V1TImode);
> +    }
> +  else
> +    {
> +      zero = NULL_RTX;
> +      minus_one = NULL_RTX;
> +    }
> +
>
> Do we *really* need to crate registers here? They are only used as a
> temporary in scalar_chain_64::convert_insn, and I think that any CSE
> worth its name should find out when the same immediate is loaded to
> different temporaries.

I will double check. Last time when I tried,  CSE didn't work on this for a
reason.  For

[hjl@gnu-6 pr67400]$ cat z.i
extern void bar (void);

void *
foo (void)
{
  return &bar;
}
[hjl@gnu-6 pr67400]$ gcc -S -fPIC -O2 z.i
[hjl@gnu-6 pr67400]$

CSE sees:

(insn 5 2 6 2 (set (reg:DI 89)
        (mem/u/c:DI (const:DI (unspec:DI [
                        (symbol_ref:DI ("bar") [flags 0x41]
<function_decl 0x7f24362151b0 bar>)
                    ] UNSPEC_GOTPCREL)) [1  S8 A8])) z.i:6 89 {*movdi_internal}
     (nil))
(insn 6 5 10 2 (set (reg:DI 87 [ <retval> ])
        (reg:DI 89)) z.i:6 89 {*movdi_internal}
     (expr_list:REG_EQUAL (symbol_ref:DI ("bar") [flags 0x41]
<function_decl 0x7f24362151b0 bar>)
        (nil)))

CSE will not change it to

(insn 6 5 10 2 (set (reg:DI 89 [ <retval> ])
        (reg:DI 89)) z.i:6 89 {*movdi_internal}
     (expr_list:REG_EQUAL (symbol_ref:DI ("bar") [flags 0x41]
<function_decl 0x7f24362151b0 bar>)
        (nil)))

> BTW: I'd really like if Ilya can review the functionality of the
> patch, he is an expert in this conversion stuff.
>
> Uros.

Ilya, can you take a look?

Thanks.
H.J. Lu April 25, 2016, 3:29 p.m. UTC | #5
On Mon, Apr 25, 2016 at 8:27 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Apr 25, 2016 at 8:10 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Mon, Apr 25, 2016 at 4:47 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Mon, Apr 25, 2016 at 7:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>> On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>>> Tested on Linux/x86-64.  OK for trunk?
>>>>
>>>>> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
>>>>> +     expected by the fwprop pass, call free_dominance_info to
>>>>> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
>>>>> +     when dominance info is changed.  */
>>>>> +  if (TARGET_64BIT)
>>>>> +    free_dominance_info (CDI_DOMINATORS);
>>>>> +
>>>>
>>>> Please resolve the above problem first, target-dependent sources are
>>>> not the place to apply band-aids for middle-end problems. The thread
>>>> with the proposed fix died in [1].
>>>>
>>>> [1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html
>>>
>>> free_dominance_info (CDI_DOMINATORS) has been called in other
>>> places to avoid this middle-end issue.   I don't know when the middle-end
>>> will be fixed.  I don't think this target optimization should be penalized by
>>> the middle-end issue.
>>
>> Let's ask Richard if he is OK with the workaround...
>>
>> One more thing:
>>
>> @@ -3551,6 +3874,7 @@ convert_scalars_to_vector ()
>>    basic_block bb;
>>    bitmap candidates;
>>    int converted_insns = 0;
>> +  rtx zero, minus_one;
>>
>>    bitmap_obstack_initialize (NULL);
>>    candidates = BITMAP_ALLOC (NULL);
>> @@ -3585,22 +3909,40 @@ convert_scalars_to_vector ()
>>      if (dump_file)
>>        fprintf (dump_file, "There are no candidates for optimization.\n");
>>
>> +  if (TARGET_64BIT)
>> +    {
>> +      zero = gen_reg_rtx (V1TImode);
>> +      minus_one = gen_reg_rtx (V1TImode);
>> +    }
>> +  else
>> +    {
>> +      zero = NULL_RTX;
>> +      minus_one = NULL_RTX;
>> +    }
>> +
>>
>> Do we *really* need to crate registers here? They are only used as a
>> temporary in scalar_chain_64::convert_insn, and I think that any CSE
>> worth its name should find out when the same immediate is loaded to
>> different temporaries.
>
> I will double check. Last time when I tried,  CSE didn't work on this for a
> reason.  For
>
> [hjl@gnu-6 pr67400]$ cat z.i
> extern void bar (void);
>
> void *
> foo (void)
> {
>   return &bar;
> }
> [hjl@gnu-6 pr67400]$ gcc -S -fPIC -O2 z.i
> [hjl@gnu-6 pr67400]$
>
> CSE sees:
>
> (insn 5 2 6 2 (set (reg:DI 89)
>         (mem/u/c:DI (const:DI (unspec:DI [
>                         (symbol_ref:DI ("bar") [flags 0x41]
> <function_decl 0x7f24362151b0 bar>)
>                     ] UNSPEC_GOTPCREL)) [1  S8 A8])) z.i:6 89 {*movdi_internal}
>      (nil))
> (insn 6 5 10 2 (set (reg:DI 87 [ <retval> ])
>         (reg:DI 89)) z.i:6 89 {*movdi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:DI ("bar") [flags 0x41]
> <function_decl 0x7f24362151b0 bar>)
>         (nil)))
>
> CSE will not change it to
>
> (insn 6 5 10 2 (set (reg:DI 89 [ <retval> ])
>         (reg:DI 89)) z.i:6 89 {*movdi_internal}
>      (expr_list:REG_EQUAL (symbol_ref:DI ("bar") [flags 0x41]
> <function_decl 0x7f24362151b0 bar>)
>         (nil)))

I meant CSE won't change it to

insn 5 2 9 2 (set (reg:DI 87 [ <retval> ])
        (symbol_ref:DI ("bar") [flags 0x41]  <function_decl
0x7f7a734971b0 bar>)) z.i:6 89 {*movdi_internal}
     (nil))

>> BTW: I'd really like if Ilya can review the functionality of the
>> patch, he is an expert in this conversion stuff.
>>
>> Uros.
>
> Ilya, can you take a look?
>
> Thanks.
>
> --
> H.J.
Ilya Enkovich April 25, 2016, 4:13 p.m. UTC | #6
2016-04-25 18:27 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>
> Ilya, can you take a look?
>
> Thanks.
>
> --
> H.J.

Hi,

Algorithmic part of the patch looks OK to me except the following piece of code.

+/* Check REF's chain to add new insns into a queue
+   and find registers requiring conversion.  */

Comment is wrong because you don't have any conversions required for
your candidates.

+
+void
+scalar_chain_64::analyze_register_chain (bitmap candidates, df_ref ref)
+{
+  df_link *chain;
+
+  gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
+             || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
+  add_to_queue (DF_REF_INSN_UID (ref));
+
+  for (chain = DF_REF_CHAIN (ref); chain; chain = chain->next)
+    {
+      unsigned uid = DF_REF_INSN_UID (chain->ref);
+
+      if (!NONDEBUG_INSN_P (DF_REF_INSN (chain->ref)))
+       continue;
+
+      if (!DF_REF_REG_MEM_P (chain->ref))
+       continue;

I believe here you wrongly jump to the next ref intead of actually adding it
to a queue.  You may just use

gcc_assert (!DF_REF_REG_MEM_P (chain->ref));

because you should'n have a candidate used in address operand.

+
+      if (bitmap_bit_p (insns, uid))
+       continue;
+
+      if (bitmap_bit_p (candidates, uid))
+       add_to_queue (uid);

Probably gcc_assert (bitmap_bit_p (candidates, uid)) since no uses and defs
out of candidates list are allowed?

+    }
+}

Thanks,
Ilya
Richard Biener April 26, 2016, 9:17 a.m. UTC | #7
On Mon, 25 Apr 2016, Uros Bizjak wrote:

> On Mon, Apr 25, 2016 at 4:47 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> > On Mon, Apr 25, 2016 at 7:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> >> On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
> >>> Tested on Linux/x86-64.  OK for trunk?
> >>
> >>> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
> >>> +     expected by the fwprop pass, call free_dominance_info to
> >>> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
> >>> +     when dominance info is changed.  */
> >>> +  if (TARGET_64BIT)
> >>> +    free_dominance_info (CDI_DOMINATORS);
> >>> +
> >>
> >> Please resolve the above problem first, target-dependent sources are
> >> not the place to apply band-aids for middle-end problems. The thread
> >> with the proposed fix died in [1].
> >>
> >> [1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html
> >
> > free_dominance_info (CDI_DOMINATORS) has been called in other
> > places to avoid this middle-end issue.   I don't know when the middle-end
> > will be fixed.  I don't think this target optimization should be penalized by
> > the middle-end issue.
> 
> Let's ask Richard if he is OK with the workaround...

Well, it's ultimately your call (it's a workaround in the target).

Of course I'd like to see the underlying issue fixed and the 
workarounds in "other places" be removed.

Richard.

> One more thing:
> 
> @@ -3551,6 +3874,7 @@ convert_scalars_to_vector ()
>    basic_block bb;
>    bitmap candidates;
>    int converted_insns = 0;
> +  rtx zero, minus_one;
> 
>    bitmap_obstack_initialize (NULL);
>    candidates = BITMAP_ALLOC (NULL);
> @@ -3585,22 +3909,40 @@ convert_scalars_to_vector ()
>      if (dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
> 
> +  if (TARGET_64BIT)
> +    {
> +      zero = gen_reg_rtx (V1TImode);
> +      minus_one = gen_reg_rtx (V1TImode);
> +    }
> +  else
> +    {
> +      zero = NULL_RTX;
> +      minus_one = NULL_RTX;
> +    }
> +
> 
> Do we *really* need to crate registers here? They are only used as a
> temporary in scalar_chain_64::convert_insn, and I think that any CSE
> worth its name should find out when the same immediate is loaded to
> different temporaries.
> 
> BTW: I'd really like if Ilya can review the functionality of the
> patch, he is an expert in this conversion stuff.
> 
> Uros.
> 
>
Uros Bizjak April 26, 2016, 9:33 a.m. UTC | #8
On Tue, Apr 26, 2016 at 11:17 AM, Richard Biener <rguenther@suse.de> wrote:
> On Mon, 25 Apr 2016, Uros Bizjak wrote:
>
>> On Mon, Apr 25, 2016 at 4:47 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> > On Mon, Apr 25, 2016 at 7:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> >> On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>> >>> Tested on Linux/x86-64.  OK for trunk?
>> >>
>> >>> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
>> >>> +     expected by the fwprop pass, call free_dominance_info to
>> >>> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
>> >>> +     when dominance info is changed.  */
>> >>> +  if (TARGET_64BIT)
>> >>> +    free_dominance_info (CDI_DOMINATORS);
>> >>> +
>> >>
>> >> Please resolve the above problem first, target-dependent sources are
>> >> not the place to apply band-aids for middle-end problems. The thread
>> >> with the proposed fix died in [1].
>> >>
>> >> [1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html
>> >
>> > free_dominance_info (CDI_DOMINATORS) has been called in other
>> > places to avoid this middle-end issue.   I don't know when the middle-end
>> > will be fixed.  I don't think this target optimization should be penalized by
>> > the middle-end issue.
>>
>> Let's ask Richard if he is OK with the workaround...
>
> Well, it's ultimately your call (it's a workaround in the target).

Oh well, ...

> Of course I'd like to see the underlying issue fixed and the
> workarounds in "other places" be removed.

... then at least a reference to a relevant PR should be added to a
FIXME comment.

Uros.
Richard Biener April 26, 2016, 9:35 a.m. UTC | #9
On Tue, 26 Apr 2016, Uros Bizjak wrote:

> On Tue, Apr 26, 2016 at 11:17 AM, Richard Biener <rguenther@suse.de> wrote:
> > On Mon, 25 Apr 2016, Uros Bizjak wrote:
> >
> >> On Mon, Apr 25, 2016 at 4:47 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >> > On Mon, Apr 25, 2016 at 7:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> >> >> On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
> >> >>> Tested on Linux/x86-64.  OK for trunk?
> >> >>
> >> >>> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
> >> >>> +     expected by the fwprop pass, call free_dominance_info to
> >> >>> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
> >> >>> +     when dominance info is changed.  */
> >> >>> +  if (TARGET_64BIT)
> >> >>> +    free_dominance_info (CDI_DOMINATORS);
> >> >>> +
> >> >>
> >> >> Please resolve the above problem first, target-dependent sources are
> >> >> not the place to apply band-aids for middle-end problems. The thread
> >> >> with the proposed fix died in [1].
> >> >>
> >> >> [1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html
> >> >
> >> > free_dominance_info (CDI_DOMINATORS) has been called in other
> >> > places to avoid this middle-end issue.   I don't know when the middle-end
> >> > will be fixed.  I don't think this target optimization should be penalized by
> >> > the middle-end issue.
> >>
> >> Let's ask Richard if he is OK with the workaround...
> >
> > Well, it's ultimately your call (it's a workaround in the target).
> 
> Oh well, ...
> 
> > Of course I'd like to see the underlying issue fixed and the
> > workarounds in "other places" be removed.
> 
> ... then at least a reference to a relevant PR should be added to a
> FIXME comment.

HJ, can you please open a bug with 1) a testcase, 2) a patch to revert
the workaround so it shows the ICE and 3) a pointer to the ml thread
with your preliminary analysis?

The advantage is that with the patch in we have a reproducer at least.

Thanks,
Richard.
H.J. Lu April 26, 2016, 2:07 p.m. UTC | #10
On Mon, Apr 25, 2016 at 9:13 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2016-04-25 18:27 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>
>> Ilya, can you take a look?
>>
>> Thanks.
>>
>> --
>> H.J.
>
> Hi,
>
> Algorithmic part of the patch looks OK to me except the following piece of code.
>
> +/* Check REF's chain to add new insns into a queue
> +   and find registers requiring conversion.  */
>
> Comment is wrong because you don't have any conversions required for
> your candidates.

I will fix it.

> +
> +void
> +scalar_chain_64::analyze_register_chain (bitmap candidates, df_ref ref)
> +{
> +  df_link *chain;
> +
> +  gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
> +             || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
> +  add_to_queue (DF_REF_INSN_UID (ref));
> +
> +  for (chain = DF_REF_CHAIN (ref); chain; chain = chain->next)
> +    {
> +      unsigned uid = DF_REF_INSN_UID (chain->ref);
> +
> +      if (!NONDEBUG_INSN_P (DF_REF_INSN (chain->ref)))
> +       continue;
> +
> +      if (!DF_REF_REG_MEM_P (chain->ref))
> +       continue;
>
> I believe here you wrongly jump to the next ref intead of actually adding it
> to a queue.  You may just use
>
> gcc_assert (!DF_REF_REG_MEM_P (chain->ref));
>
> because you should'n have a candidate used in address operand.

I will update.

> +
> +      if (bitmap_bit_p (insns, uid))
> +       continue;
> +
> +      if (bitmap_bit_p (candidates, uid))
> +       add_to_queue (uid);
>
> Probably gcc_assert (bitmap_bit_p (candidates, uid)) since no uses and defs
> out of candidates list are allowed?

That would be wrong since there are

 while (!bitmap_empty_p (queue))
    {
      insn_uid = bitmap_first_set_bit (queue);
      bitmap_clear_bit (queue, insn_uid);
      bitmap_clear_bit (candidates, insn_uid);
      add_insn (candidates, insn_uid);
    }

An instruction is a candidate and the bit is cleared when
analyze_register_chain is called.
Ilya Enkovich April 26, 2016, 2:15 p.m. UTC | #11
2016-04-26 17:07 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
> On Mon, Apr 25, 2016 at 9:13 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2016-04-25 18:27 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>>
>>> Ilya, can you take a look?
>>>
>>> Thanks.
>>>
>>> --
>>> H.J.
>>
>> Hi,
>>
>> Algorithmic part of the patch looks OK to me except the following piece of code.
>>
>> +/* Check REF's chain to add new insns into a queue
>> +   and find registers requiring conversion.  */
>>
>> Comment is wrong because you don't have any conversions required for
>> your candidates.
>
> I will fix it.
>
>> +
>> +void
>> +scalar_chain_64::analyze_register_chain (bitmap candidates, df_ref ref)
>> +{
>> +  df_link *chain;
>> +
>> +  gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
>> +             || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
>> +  add_to_queue (DF_REF_INSN_UID (ref));
>> +
>> +  for (chain = DF_REF_CHAIN (ref); chain; chain = chain->next)
>> +    {
>> +      unsigned uid = DF_REF_INSN_UID (chain->ref);
>> +
>> +      if (!NONDEBUG_INSN_P (DF_REF_INSN (chain->ref)))
>> +       continue;
>> +
>> +      if (!DF_REF_REG_MEM_P (chain->ref))
>> +       continue;
>>
>> I believe here you wrongly jump to the next ref intead of actually adding it
>> to a queue.  You may just use
>>
>> gcc_assert (!DF_REF_REG_MEM_P (chain->ref));
>>
>> because you should'n have a candidate used in address operand.
>
> I will update.
>
>> +
>> +      if (bitmap_bit_p (insns, uid))
>> +       continue;
>> +
>> +      if (bitmap_bit_p (candidates, uid))
>> +       add_to_queue (uid);
>>
>> Probably gcc_assert (bitmap_bit_p (candidates, uid)) since no uses and defs
>> out of candidates list are allowed?
>
> That would be wrong since there are
>
>  while (!bitmap_empty_p (queue))
>     {
>       insn_uid = bitmap_first_set_bit (queue);
>       bitmap_clear_bit (queue, insn_uid);
>       bitmap_clear_bit (candidates, insn_uid);
>       add_insn (candidates, insn_uid);
>     }
>
> An instruction is a candidate and the bit is cleared when
> analyze_register_chain is called.

You clear candidates bit but the first thing you do in add_insn is set
insns bit.
Thus you should hit:

      if (bitmap_bit_p (insns, uid))
        continue;

For handled candidates.

Probably it would be more clear if we keep this clear/set pair
together?  E.g. move
bitmap_clear_bit (candidates, insn_uid) to scalar_chain::add_insn.

Thanks,
Ilya

>
> --
> H.J.
H.J. Lu April 26, 2016, 2:55 p.m. UTC | #12
On Tue, Apr 26, 2016 at 7:15 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2016-04-26 17:07 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>> On Mon, Apr 25, 2016 at 9:13 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>> 2016-04-25 18:27 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>>>
>>>> Ilya, can you take a look?
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> H.J.
>>>
>>> Hi,
>>>
>>> Algorithmic part of the patch looks OK to me except the following piece of code.
>>>
>>> +/* Check REF's chain to add new insns into a queue
>>> +   and find registers requiring conversion.  */
>>>
>>> Comment is wrong because you don't have any conversions required for
>>> your candidates.
>>
>> I will fix it.
>>
>>> +
>>> +void
>>> +scalar_chain_64::analyze_register_chain (bitmap candidates, df_ref ref)
>>> +{
>>> +  df_link *chain;
>>> +
>>> +  gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
>>> +             || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
>>> +  add_to_queue (DF_REF_INSN_UID (ref));
>>> +
>>> +  for (chain = DF_REF_CHAIN (ref); chain; chain = chain->next)
>>> +    {
>>> +      unsigned uid = DF_REF_INSN_UID (chain->ref);
>>> +
>>> +      if (!NONDEBUG_INSN_P (DF_REF_INSN (chain->ref)))
>>> +       continue;
>>> +
>>> +      if (!DF_REF_REG_MEM_P (chain->ref))
>>> +       continue;
>>>
>>> I believe here you wrongly jump to the next ref intead of actually adding it
>>> to a queue.  You may just use
>>>
>>> gcc_assert (!DF_REF_REG_MEM_P (chain->ref));
>>>
>>> because you should'n have a candidate used in address operand.
>>
>> I will update.
>>
>>> +
>>> +      if (bitmap_bit_p (insns, uid))
>>> +       continue;
>>> +
>>> +      if (bitmap_bit_p (candidates, uid))
>>> +       add_to_queue (uid);
>>>
>>> Probably gcc_assert (bitmap_bit_p (candidates, uid)) since no uses and defs
>>> out of candidates list are allowed?
>>
>> That would be wrong since there are
>>
>>  while (!bitmap_empty_p (queue))
>>     {
>>       insn_uid = bitmap_first_set_bit (queue);
>>       bitmap_clear_bit (queue, insn_uid);
>>       bitmap_clear_bit (candidates, insn_uid);
>>       add_insn (candidates, insn_uid);
>>     }
>>
>> An instruction is a candidate and the bit is cleared when
>> analyze_register_chain is called.
>
> You clear candidates bit but the first thing you do in add_insn is set
> insns bit.
> Thus you should hit:
>
>       if (bitmap_bit_p (insns, uid))
>         continue;
>
> For handled candidates.
>
> Probably it would be more clear if we keep this clear/set pair
> together?  E.g. move
> bitmap_clear_bit (candidates, insn_uid) to scalar_chain::add_insn.
>

After we started processing candidates, we only use candidates
to check if an instruction is a candidate, not to check if an
instruction is NOT a candidate.
Ilya Enkovich April 26, 2016, 3:05 p.m. UTC | #13
2016-04-26 17:55 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
> On Tue, Apr 26, 2016 at 7:15 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2016-04-26 17:07 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>> On Mon, Apr 25, 2016 at 9:13 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>>> 2016-04-25 18:27 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>>>>
>>>>> Ilya, can you take a look?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> H.J.
>>>>
>>>> Hi,
>>>>
>>>> Algorithmic part of the patch looks OK to me except the following piece of code.
>>>>
>>>> +/* Check REF's chain to add new insns into a queue
>>>> +   and find registers requiring conversion.  */
>>>>
>>>> Comment is wrong because you don't have any conversions required for
>>>> your candidates.
>>>
>>> I will fix it.
>>>
>>>> +
>>>> +void
>>>> +scalar_chain_64::analyze_register_chain (bitmap candidates, df_ref ref)
>>>> +{
>>>> +  df_link *chain;
>>>> +
>>>> +  gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
>>>> +             || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
>>>> +  add_to_queue (DF_REF_INSN_UID (ref));
>>>> +
>>>> +  for (chain = DF_REF_CHAIN (ref); chain; chain = chain->next)
>>>> +    {
>>>> +      unsigned uid = DF_REF_INSN_UID (chain->ref);
>>>> +
>>>> +      if (!NONDEBUG_INSN_P (DF_REF_INSN (chain->ref)))
>>>> +       continue;
>>>> +
>>>> +      if (!DF_REF_REG_MEM_P (chain->ref))
>>>> +       continue;
>>>>
>>>> I believe here you wrongly jump to the next ref intead of actually adding it
>>>> to a queue.  You may just use
>>>>
>>>> gcc_assert (!DF_REF_REG_MEM_P (chain->ref));
>>>>
>>>> because you should'n have a candidate used in address operand.
>>>
>>> I will update.
>>>
>>>> +
>>>> +      if (bitmap_bit_p (insns, uid))
>>>> +       continue;
>>>> +
>>>> +      if (bitmap_bit_p (candidates, uid))
>>>> +       add_to_queue (uid);
>>>>
>>>> Probably gcc_assert (bitmap_bit_p (candidates, uid)) since no uses and defs
>>>> out of candidates list are allowed?
>>>
>>> That would be wrong since there are
>>>
>>>  while (!bitmap_empty_p (queue))
>>>     {
>>>       insn_uid = bitmap_first_set_bit (queue);
>>>       bitmap_clear_bit (queue, insn_uid);
>>>       bitmap_clear_bit (candidates, insn_uid);
>>>       add_insn (candidates, insn_uid);
>>>     }
>>>
>>> An instruction is a candidate and the bit is cleared when
>>> analyze_register_chain is called.
>>
>> You clear candidates bit but the first thing you do in add_insn is set
>> insns bit.
>> Thus you should hit:
>>
>>       if (bitmap_bit_p (insns, uid))
>>         continue;
>>
>> For handled candidates.
>>
>> Probably it would be more clear if we keep this clear/set pair
>> together?  E.g. move
>> bitmap_clear_bit (candidates, insn_uid) to scalar_chain::add_insn.
>>
>
> After we started processing candidates, we only use candidates
> to check if an instruction is a candidate, not to check if an
> instruction is NOT a candidate.

I don't see how it's related to what I said.  My point is that
when you analyze added insn you shouldn't reach insns which are both
not in candidates and not in current scalar_chain_64.  That's why I
think you miss an assert in scalar_chain_64::analyze_register_chain.

Thanks,
Ilya

>
> --
> H.J.
Jeff Law April 26, 2016, 3:08 p.m. UTC | #14
On 04/26/2016 03:17 AM, Richard Biener wrote:
> On Mon, 25 Apr 2016, Uros Bizjak wrote:
>
>> On Mon, Apr 25, 2016 at 4:47 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Mon, Apr 25, 2016 at 7:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>> On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>>> Tested on Linux/x86-64.  OK for trunk?
>>>>
>>>>> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
>>>>> +     expected by the fwprop pass, call free_dominance_info to
>>>>> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
>>>>> +     when dominance info is changed.  */
>>>>> +  if (TARGET_64BIT)
>>>>> +    free_dominance_info (CDI_DOMINATORS);
>>>>> +
>>>>
>>>> Please resolve the above problem first, target-dependent sources are
>>>> not the place to apply band-aids for middle-end problems. The thread
>>>> with the proposed fix died in [1].
>>>>
>>>> [1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html
>>>
>>> free_dominance_info (CDI_DOMINATORS) has been called in other
>>> places to avoid this middle-end issue.   I don't know when the middle-end
>>> will be fixed.  I don't think this target optimization should be penalized by
>>> the middle-end issue.
>>
>> Let's ask Richard if he is OK with the workaround...
>
> Well, it's ultimately your call (it's a workaround in the target).
>
> Of course I'd like to see the underlying issue fixed and the
> workarounds in "other places" be removed.
Agreed.  It's fundamentally papering over a problem elsewhere.

jeff
H.J. Lu April 26, 2016, 3:12 p.m. UTC | #15
On Tue, Apr 26, 2016 at 8:05 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2016-04-26 17:55 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>> On Tue, Apr 26, 2016 at 7:15 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>> 2016-04-26 17:07 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>>> On Mon, Apr 25, 2016 at 9:13 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>>>> 2016-04-25 18:27 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>>>>>
>>>>>> Ilya, can you take a look?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> --
>>>>>> H.J.
>>>>>
>>>>> Hi,
>>>>>
>>>>> Algorithmic part of the patch looks OK to me except the following piece of code.
>>>>>
>>>>> +/* Check REF's chain to add new insns into a queue
>>>>> +   and find registers requiring conversion.  */
>>>>>
>>>>> Comment is wrong because you don't have any conversions required for
>>>>> your candidates.
>>>>
>>>> I will fix it.
>>>>
>>>>> +
>>>>> +void
>>>>> +scalar_chain_64::analyze_register_chain (bitmap candidates, df_ref ref)
>>>>> +{
>>>>> +  df_link *chain;
>>>>> +
>>>>> +  gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
>>>>> +             || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
>>>>> +  add_to_queue (DF_REF_INSN_UID (ref));
>>>>> +
>>>>> +  for (chain = DF_REF_CHAIN (ref); chain; chain = chain->next)
>>>>> +    {
>>>>> +      unsigned uid = DF_REF_INSN_UID (chain->ref);
>>>>> +
>>>>> +      if (!NONDEBUG_INSN_P (DF_REF_INSN (chain->ref)))
>>>>> +       continue;
>>>>> +
>>>>> +      if (!DF_REF_REG_MEM_P (chain->ref))
>>>>> +       continue;
>>>>>
>>>>> I believe here you wrongly jump to the next ref intead of actually adding it
>>>>> to a queue.  You may just use
>>>>>
>>>>> gcc_assert (!DF_REF_REG_MEM_P (chain->ref));
>>>>>
>>>>> because you should'n have a candidate used in address operand.
>>>>
>>>> I will update.
>>>>
>>>>> +
>>>>> +      if (bitmap_bit_p (insns, uid))
>>>>> +       continue;
>>>>> +
>>>>> +      if (bitmap_bit_p (candidates, uid))
>>>>> +       add_to_queue (uid);
>>>>>
>>>>> Probably gcc_assert (bitmap_bit_p (candidates, uid)) since no uses and defs
>>>>> out of candidates list are allowed?
>>>>
>>>> That would be wrong since there are
>>>>
>>>>  while (!bitmap_empty_p (queue))
>>>>     {
>>>>       insn_uid = bitmap_first_set_bit (queue);
>>>>       bitmap_clear_bit (queue, insn_uid);
>>>>       bitmap_clear_bit (candidates, insn_uid);
>>>>       add_insn (candidates, insn_uid);
>>>>     }
>>>>
>>>> An instruction is a candidate and the bit is cleared when
>>>> analyze_register_chain is called.
>>>
>>> You clear candidates bit but the first thing you do in add_insn is set
>>> insns bit.
>>> Thus you should hit:
>>>
>>>       if (bitmap_bit_p (insns, uid))
>>>         continue;
>>>
>>> For handled candidates.
>>>
>>> Probably it would be more clear if we keep this clear/set pair
>>> together?  E.g. move
>>> bitmap_clear_bit (candidates, insn_uid) to scalar_chain::add_insn.
>>>
>>
>> After we started processing candidates, we only use candidates
>> to check if an instruction is a candidate, not to check if an
>> instruction is NOT a candidate.
>
> I don't see how it's related to what I said.  My point is that
> when you analyze added insn you shouldn't reach insns which are both
> not in candidates and not in current scalar_chain_64.  That's why I
> think you miss an assert in scalar_chain_64::analyze_register_chain.

Since all candidates will be processed by

 while (!bitmap_empty_p (queue))
    {
      insn_uid = bitmap_first_set_bit (queue);
      bitmap_clear_bit (queue, insn_uid);
      bitmap_clear_bit (candidates, insn_uid);
      add_insn (candidates, insn_uid);
    }

I will change to

/* Check REF's chain to add new insns into a queue.  */

void
timode_scalar_chain::analyze_register_chain (bitmap candidates,
                                             df_ref ref)
{
    gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
              || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
  add_to_queue (DF_REF_INSN_UID (ref));
}
H.J. Lu April 26, 2016, 3:21 p.m. UTC | #16
On Tue, Apr 26, 2016 at 2:35 AM, Richard Biener <rguenther@suse.de> wrote:
> On Tue, 26 Apr 2016, Uros Bizjak wrote:
>
>> On Tue, Apr 26, 2016 at 11:17 AM, Richard Biener <rguenther@suse.de> wrote:
>> > On Mon, 25 Apr 2016, Uros Bizjak wrote:
>> >
>> >> On Mon, Apr 25, 2016 at 4:47 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> >> > On Mon, Apr 25, 2016 at 7:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> >> >> On Mon, Apr 25, 2016 at 2:51 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>> >> >>> Tested on Linux/x86-64.  OK for trunk?
>> >> >>
>> >> >>> +  /* FIXME: Since the CSE pass may change dominance info, which isn't
>> >> >>> +     expected by the fwprop pass, call free_dominance_info to
>> >> >>> +     invalidate dominance info.  Otherwise, the fwprop pass may crash
>> >> >>> +     when dominance info is changed.  */
>> >> >>> +  if (TARGET_64BIT)
>> >> >>> +    free_dominance_info (CDI_DOMINATORS);
>> >> >>> +
>> >> >>
>> >> >> Please resolve the above problem first, target-dependent sources are
>> >> >> not the place to apply band-aids for middle-end problems. The thread
>> >> >> with the proposed fix died in [1].
>> >> >>
>> >> >> [1] https://gcc.gnu.org/ml/gcc/2016-03/msg00143.html
>> >> >
>> >> > free_dominance_info (CDI_DOMINATORS) has been called in other
>> >> > places to avoid this middle-end issue.   I don't know when the middle-end
>> >> > will be fixed.  I don't think this target optimization should be penalized by
>> >> > the middle-end issue.
>> >>
>> >> Let's ask Richard if he is OK with the workaround...
>> >
>> > Well, it's ultimately your call (it's a workaround in the target).
>>
>> Oh well, ...
>>
>> > Of course I'd like to see the underlying issue fixed and the
>> > workarounds in "other places" be removed.
>>
>> ... then at least a reference to a relevant PR should be added to a
>> FIXME comment.
>
> HJ, can you please open a bug with 1) a testcase, 2) a patch to revert
> the workaround so it shows the ICE and 3) a pointer to the ml thread
> with your preliminary analysis?

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70807
Ilya Enkovich April 26, 2016, 3:21 p.m. UTC | #17
2016-04-26 18:12 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
> On Tue, Apr 26, 2016 at 8:05 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2016-04-26 17:55 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>> On Tue, Apr 26, 2016 at 7:15 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>>> 2016-04-26 17:07 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>>>> On Mon, Apr 25, 2016 at 9:13 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>>>>>> 2016-04-25 18:27 GMT+03:00 H.J. Lu <hjl.tools@gmail.com>:
>>>>>>>
>>>>>>> Ilya, can you take a look?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> --
>>>>>>> H.J.
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Algorithmic part of the patch looks OK to me except the following piece of code.
>>>>>>
>>>>>> +/* Check REF's chain to add new insns into a queue
>>>>>> +   and find registers requiring conversion.  */
>>>>>>
>>>>>> Comment is wrong because you don't have any conversions required for
>>>>>> your candidates.
>>>>>
>>>>> I will fix it.
>>>>>
>>>>>> +
>>>>>> +void
>>>>>> +scalar_chain_64::analyze_register_chain (bitmap candidates, df_ref ref)
>>>>>> +{
>>>>>> +  df_link *chain;
>>>>>> +
>>>>>> +  gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
>>>>>> +             || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
>>>>>> +  add_to_queue (DF_REF_INSN_UID (ref));
>>>>>> +
>>>>>> +  for (chain = DF_REF_CHAIN (ref); chain; chain = chain->next)
>>>>>> +    {
>>>>>> +      unsigned uid = DF_REF_INSN_UID (chain->ref);
>>>>>> +
>>>>>> +      if (!NONDEBUG_INSN_P (DF_REF_INSN (chain->ref)))
>>>>>> +       continue;
>>>>>> +
>>>>>> +      if (!DF_REF_REG_MEM_P (chain->ref))
>>>>>> +       continue;
>>>>>>
>>>>>> I believe here you wrongly jump to the next ref intead of actually adding it
>>>>>> to a queue.  You may just use
>>>>>>
>>>>>> gcc_assert (!DF_REF_REG_MEM_P (chain->ref));
>>>>>>
>>>>>> because you should'n have a candidate used in address operand.
>>>>>
>>>>> I will update.
>>>>>
>>>>>> +
>>>>>> +      if (bitmap_bit_p (insns, uid))
>>>>>> +       continue;
>>>>>> +
>>>>>> +      if (bitmap_bit_p (candidates, uid))
>>>>>> +       add_to_queue (uid);
>>>>>>
>>>>>> Probably gcc_assert (bitmap_bit_p (candidates, uid)) since no uses and defs
>>>>>> out of candidates list are allowed?
>>>>>
>>>>> That would be wrong since there are
>>>>>
>>>>>  while (!bitmap_empty_p (queue))
>>>>>     {
>>>>>       insn_uid = bitmap_first_set_bit (queue);
>>>>>       bitmap_clear_bit (queue, insn_uid);
>>>>>       bitmap_clear_bit (candidates, insn_uid);
>>>>>       add_insn (candidates, insn_uid);
>>>>>     }
>>>>>
>>>>> An instruction is a candidate and the bit is cleared when
>>>>> analyze_register_chain is called.
>>>>
>>>> You clear candidates bit but the first thing you do in add_insn is set
>>>> insns bit.
>>>> Thus you should hit:
>>>>
>>>>       if (bitmap_bit_p (insns, uid))
>>>>         continue;
>>>>
>>>> For handled candidates.
>>>>
>>>> Probably it would be more clear if we keep this clear/set pair
>>>> together?  E.g. move
>>>> bitmap_clear_bit (candidates, insn_uid) to scalar_chain::add_insn.
>>>>
>>>
>>> After we started processing candidates, we only use candidates
>>> to check if an instruction is a candidate, not to check if an
>>> instruction is NOT a candidate.
>>
>> I don't see how it's related to what I said.  My point is that
>> when you analyze added insn you shouldn't reach insns which are both
>> not in candidates and not in current scalar_chain_64.  That's why I
>> think you miss an assert in scalar_chain_64::analyze_register_chain.
>
> Since all candidates will be processed by
>
>  while (!bitmap_empty_p (queue))
>     {
>       insn_uid = bitmap_first_set_bit (queue);
>       bitmap_clear_bit (queue, insn_uid);
>       bitmap_clear_bit (candidates, insn_uid);
>       add_insn (candidates, insn_uid);
>     }

This process only queue, not all candidates.  analyze_register_chain
fills queue bitmap to build a chain.

>
> I will change to
>
> /* Check REF's chain to add new insns into a queue.  */
>
> void
> timode_scalar_chain::analyze_register_chain (bitmap candidates,
>                                              df_ref ref)
> {
>     gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
>               || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
>   add_to_queue (DF_REF_INSN_UID (ref));
> }
>

The purpose of analyze_register_chain is to collect all register defs
or uses into chain's queue.  You can't just remove a loop over ref's
chain then.

Thanks,
Ilya

>
>
> --
> H.J.
diff mbox

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a061b2f..c5296ae 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2769,11 +2769,10 @@  convertible_comparison_p (rtx_insn *insn)
   return true;
 }
 
-/* Return 1 if INSN may be converted into vector
-   instruction.  */
+/* The 32-bit version of scalar_to_vector_candidate_p.  */
 
 static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
+scalar_to_vector_candidate_p_32 (rtx_insn *insn)
 {
   rtx def_set = single_set (insn);
 
@@ -2833,16 +2832,79 @@  scalar_to_vector_candidate_p (rtx_insn *insn)
   return true;
 }
 
-/* For a given bitmap of insn UIDs scans all instruction and
-   remove insn from CANDIDATES in case it has both convertible
-   and not convertible definitions.
+/* The 64-bit version of scalar_to_vector_candidate_p.  */
 
-   All insns in a bitmap are conversion candidates according to
-   scalar_to_vector_candidate_p.  Currently it implies all insns
-   are single_set.  */
+static bool
+scalar_to_vector_candidate_p_64 (rtx_insn *insn)
+{
+  rtx def_set = single_set (insn);
+
+  if (!def_set)
+    return false;
+
+  if (has_non_address_hard_reg (insn))
+    return false;
+
+  rtx src = SET_SRC (def_set);
+  rtx dst = SET_DEST (def_set);
+
+  /* Only TImode load and store are allowed.  */
+  if (GET_MODE (dst) != TImode)
+    return false;
+
+  if (MEM_P (dst))
+    {
+      /* Check for store.  Only support store from register or standard
+	 SSE constants.  */
+      switch (GET_CODE (src))
+	{
+	default:
+	  return false;
+
+	case REG:
+	  /* For store from register, memory must be aligned or both
+	     unaligned load and store are optimal.  */
+	  return (!misaligned_operand (dst, TImode)
+		  || (TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
+		      && TARGET_SSE_UNALIGNED_STORE_OPTIMAL));
+
+	case CONST_INT:
+	  /* For store from standard SSE constant, memory must be
+	     aligned or unaligned store is optimal.  */
+	  return (standard_sse_constant_p (src, TImode)
+		  && (!misaligned_operand (dst, TImode)
+		      || TARGET_SSE_UNALIGNED_STORE_OPTIMAL));
+	}
+    }
+  else if (MEM_P (src))
+    {
+      /* Check for load.  Memory must be aligned or both unaligned
+	 load and store are optimal.  */
+      return (GET_CODE (dst) == REG
+	      && (!misaligned_operand (src, TImode)
+		  || (TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
+		      && TARGET_SSE_UNALIGNED_STORE_OPTIMAL)));
+    }
+
+  return false;
+}
+
+/* Return 1 if INSN may be converted into vector
+   instruction.  */
+
+static bool
+scalar_to_vector_candidate_p (rtx_insn *insn)
+{
+  if (TARGET_64BIT)
+    return scalar_to_vector_candidate_p_64 (insn);
+  else
+    return scalar_to_vector_candidate_p_32 (insn);
+}
+
+/* The 32-bit version of remove_non_convertible_regs.  */
 
 static void
-remove_non_convertible_regs (bitmap candidates)
+remove_non_convertible_regs_32 (bitmap candidates)
 {
   bitmap_iterator bi;
   unsigned id;
@@ -2893,11 +2955,130 @@  remove_non_convertible_regs (bitmap candidates)
   BITMAP_FREE (regs);
 }
 
+/* For a register REGNO, scan instructions for its defs and uses.
+   Put REGNO in REGS if a def or use isn't in CANDIDATES.  */
+
+static void
+check_non_convertible_regs_64 (bitmap candidates, bitmap regs,
+			       unsigned int regno)
+{
+  for (df_ref def = DF_REG_DEF_CHAIN (regno);
+       def;
+       def = DF_REF_NEXT_REG (def))
+    {
+      if (!bitmap_bit_p (candidates, DF_REF_INSN_UID (def)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "r%d has non convertible def in insn %d\n",
+		     regno, DF_REF_INSN_UID (def));
+
+	  bitmap_set_bit (regs, regno);
+	  break;
+	}
+    }
+
+  for (df_ref ref = DF_REG_USE_CHAIN (regno);
+       ref;
+       ref = DF_REF_NEXT_REG (ref))
+    {
+      /* Debug instructions are skipped.  */
+      if (NONDEBUG_INSN_P (DF_REF_INSN (ref))
+	  && !bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "r%d has non convertible use in insn %d\n",
+		     regno, DF_REF_INSN_UID (ref));
+
+	  bitmap_set_bit (regs, regno);
+	  break;
+	}
+    }
+}
+
+/* The 64-bit version of remove_non_convertible_regs.  */
+
+static void
+remove_non_convertible_regs_64 (bitmap candidates)
+{
+  bitmap_iterator bi;
+  unsigned id;
+  bitmap regs = BITMAP_ALLOC (NULL);
+
+  EXECUTE_IF_SET_IN_BITMAP (candidates, 0, id, bi)
+    {
+      rtx def_set = single_set (DF_INSN_UID_GET (id)->insn);
+      rtx dest = SET_DEST (def_set);
+      rtx src = SET_SRC (def_set);
+
+      if ((!REG_P (dest)
+	   || bitmap_bit_p (regs, REGNO (dest))
+	   || HARD_REGISTER_P (dest))
+	  && (!REG_P (src)
+	      || bitmap_bit_p (regs, REGNO (src))
+	      || HARD_REGISTER_P (src)))
+	continue;
+
+      if (REG_P (dest))
+	check_non_convertible_regs_64 (candidates, regs, REGNO (dest));
+
+      if (REG_P (src))
+	check_non_convertible_regs_64 (candidates, regs, REGNO (src));
+    }
+
+  EXECUTE_IF_SET_IN_BITMAP (regs, 0, id, bi)
+    {
+      for (df_ref def = DF_REG_DEF_CHAIN (id);
+	   def;
+	   def = DF_REF_NEXT_REG (def))
+	if (bitmap_bit_p (candidates, DF_REF_INSN_UID (def)))
+	  {
+	    if (dump_file)
+	      fprintf (dump_file, "Removing insn %d from candidates list\n",
+		       DF_REF_INSN_UID (def));
+
+	    bitmap_clear_bit (candidates, DF_REF_INSN_UID (def));
+	  }
+
+      for (df_ref ref = DF_REG_USE_CHAIN (id);
+	   ref;
+	   ref = DF_REF_NEXT_REG (ref))
+	if (bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)))
+	  {
+	    if (dump_file)
+	      fprintf (dump_file, "Removing insn %d from candidates list\n",
+		       DF_REF_INSN_UID (ref));
+
+	    bitmap_clear_bit (candidates, DF_REF_INSN_UID (ref));
+	  }
+    }
+
+  BITMAP_FREE (regs);
+}
+
+/* For a given bitmap of insn UIDs scans all instruction and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
+
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
+
+static void
+remove_non_convertible_regs (bitmap candidates)
+{
+  if (TARGET_64BIT)
+    remove_non_convertible_regs_64 (candidates);
+  else
+    remove_non_convertible_regs_32 (candidates);
+}
+
 class scalar_chain
 {
  public:
   scalar_chain ();
-  ~scalar_chain ();
+  virtual ~scalar_chain ();
 
   static unsigned max_id;
 
@@ -2913,21 +3094,55 @@  class scalar_chain
   bitmap defs_conv;
 
   void build (bitmap candidates, unsigned insn_uid);
-  int compute_convert_gain ();
+  virtual int compute_convert_gain () = 0;
   int convert ();
 
+ protected:
+  void add_to_queue (unsigned insn_uid);
+  void emit_conversion_insns (rtx insns, rtx_insn *pos);
+
  private:
   void add_insn (bitmap candidates, unsigned insn_uid);
-  void add_to_queue (unsigned insn_uid);
+  virtual void convert_insn (rtx_insn *insn) = 0;
+  virtual void convert_registers () = 0;
+  virtual void analyze_register_chain (bitmap candidates, df_ref ref) = 0;
+};
+
+class scalar_chain_32 : public scalar_chain
+{
+ public:
+  int compute_convert_gain ();
+ private:
   void mark_dual_mode_def (df_ref def);
   void analyze_register_chain (bitmap candidates, df_ref ref);
   rtx replace_with_subreg (rtx x, rtx reg, rtx subreg);
-  void emit_conversion_insns (rtx insns, rtx_insn *pos);
   void replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx subreg);
   void convert_insn (rtx_insn *insn);
   void convert_op (rtx *op, rtx_insn *insn);
   void convert_reg (unsigned regno);
   void make_vector_copies (unsigned regno);
+  void convert_registers ();
+};
+
+class scalar_chain_64 : public scalar_chain
+{
+ public:
+  scalar_chain_64 (rtx r0, rtx r1)
+    : scalar_chain (), zero (r0), minus_one (r1) { }
+
+  /* Convert from TImode to V1TImode is always faster.  */
+  int compute_convert_gain () { return 1; }
+
+ private:
+  void analyze_register_chain (bitmap candidates, df_ref ref);
+  void convert_insn (rtx_insn *insn);
+  /* We don't convert registers to difference size.  */
+  void convert_registers () {}
+
+  /* Use one scatch register for loading CONST0_RTX and one for loading
+     CONSTM1_RTX so that they can be CSEed.  */
+  rtx zero;
+  rtx minus_one;
 };
 
 unsigned scalar_chain::max_id = 0;
@@ -2976,7 +3191,7 @@  scalar_chain::add_to_queue (unsigned insn_uid)
 /* Mark register defined by DEF as requiring conversion.  */
 
 void
-scalar_chain::mark_dual_mode_def (df_ref def)
+scalar_chain_32::mark_dual_mode_def (df_ref def)
 {
   gcc_assert (DF_REF_REG_DEF_P (def));
 
@@ -2995,7 +3210,7 @@  scalar_chain::mark_dual_mode_def (df_ref def)
    and find registers requiring conversion.  */
 
 void
-scalar_chain::analyze_register_chain (bitmap candidates, df_ref ref)
+scalar_chain_32::analyze_register_chain (bitmap candidates, df_ref ref)
 {
   df_link *chain;
 
@@ -3039,6 +3254,36 @@  scalar_chain::analyze_register_chain (bitmap candidates, df_ref ref)
     }
 }
 
+/* Check REF's chain to add new insns into a queue
+   and find registers requiring conversion.  */
+
+void
+scalar_chain_64::analyze_register_chain (bitmap candidates, df_ref ref)
+{
+  df_link *chain;
+
+  gcc_assert (bitmap_bit_p (insns, DF_REF_INSN_UID (ref))
+	      || bitmap_bit_p (candidates, DF_REF_INSN_UID (ref)));
+  add_to_queue (DF_REF_INSN_UID (ref));
+
+  for (chain = DF_REF_CHAIN (ref); chain; chain = chain->next)
+    {
+      unsigned uid = DF_REF_INSN_UID (chain->ref);
+
+      if (!NONDEBUG_INSN_P (DF_REF_INSN (chain->ref)))
+	continue;
+
+      if (!DF_REF_REG_MEM_P (chain->ref))
+	continue;
+
+      if (bitmap_bit_p (insns, uid))
+	continue;
+
+      if (bitmap_bit_p (candidates, uid))
+	add_to_queue (uid);
+    }
+}
+
 /* Add instruction into a chain.  */
 
 void
@@ -3117,7 +3362,7 @@  scalar_chain::build (bitmap candidates, unsigned insn_uid)
 /* Compute a gain for chain conversion.  */
 
 int
-scalar_chain::compute_convert_gain ()
+scalar_chain_32::compute_convert_gain ()
 {
   bitmap_iterator bi;
   unsigned insn_uid;
@@ -3174,7 +3419,7 @@  scalar_chain::compute_convert_gain ()
 /* Replace REG in X with a V2DI subreg of NEW_REG.  */
 
 rtx
-scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
+scalar_chain_32::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 {
   if (x == reg)
     return gen_rtx_SUBREG (V2DImode, new_reg, 0);
@@ -3197,7 +3442,7 @@  scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
 
 void
-scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx new_reg)
+scalar_chain_32::replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx new_reg)
 {
   replace_with_subreg (single_set (insn), reg, new_reg);
 }
@@ -3227,7 +3472,7 @@  scalar_chain::emit_conversion_insns (rtx insns, rtx_insn *after)
    and replace its uses in a chain.  */
 
 void
-scalar_chain::make_vector_copies (unsigned regno)
+scalar_chain_32::make_vector_copies (unsigned regno)
 {
   rtx reg = regno_reg_rtx[regno];
   rtx vreg = gen_reg_rtx (DImode);
@@ -3298,7 +3543,7 @@  scalar_chain::make_vector_copies (unsigned regno)
    in case register is used in not convertible insn.  */
 
 void
-scalar_chain::convert_reg (unsigned regno)
+scalar_chain_32::convert_reg (unsigned regno)
 {
   bool scalar_copy = bitmap_bit_p (defs_conv, regno);
   rtx reg = regno_reg_rtx[regno];
@@ -3390,7 +3635,7 @@  scalar_chain::convert_reg (unsigned regno)
    registers conversion.  */
 
 void
-scalar_chain::convert_op (rtx *op, rtx_insn *insn)
+scalar_chain_32::convert_op (rtx *op, rtx_insn *insn)
 {
   *op = copy_rtx_if_shared (*op);
 
@@ -3434,7 +3679,7 @@  scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 /* Convert INSN to vector mode.  */
 
 void
-scalar_chain::convert_insn (rtx_insn *insn)
+scalar_chain_32::convert_insn (rtx_insn *insn)
 {
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
@@ -3511,6 +3756,88 @@  scalar_chain::convert_insn (rtx_insn *insn)
   df_insn_rescan (insn);
 }
 
+/* Convert INSN from TImode to V1T1mode.  */
+
+void
+scalar_chain_64::convert_insn (rtx_insn *insn)
+{
+  rtx def_set = single_set (insn);
+  rtx src = SET_SRC (def_set);
+  rtx tmp;
+  rtx dst = SET_DEST (def_set);
+
+  switch (GET_CODE (dst))
+    {
+    case REG:
+      tmp = find_reg_equal_equiv_note (insn);
+      if (tmp)
+	PUT_MODE (XEXP (tmp, 0), V1TImode);
+    case MEM:
+      PUT_MODE (dst, V1TImode);
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  switch (GET_CODE (src))
+    {
+    case REG:
+    case MEM:
+      PUT_MODE (src, V1TImode);
+      break;
+
+    case CONST_INT:
+      switch (standard_sse_constant_p (src, TImode))
+	{
+	case 1:
+	  src = CONST0_RTX (GET_MODE (dst));
+	  tmp = zero;
+	  break;
+	case 2:
+	  src = CONSTM1_RTX (GET_MODE (dst));
+	  tmp = minus_one;
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+      if (NONDEBUG_INSN_P (insn))
+	{
+	  /* Since there are no instructions to store standard SSE
+	     constant, temporary register usage is required.  */
+	  emit_conversion_insns (gen_rtx_SET (dst, tmp), insn);
+	  dst = tmp;
+	}
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  SET_SRC (def_set) = src;
+  SET_DEST (def_set) = dst;
+
+  /* Drop possible dead definitions.  */
+  PATTERN (insn) = def_set;
+
+  INSN_CODE (insn) = -1;
+  recog_memoized (insn);
+  df_insn_rescan (insn);
+}
+
+void
+scalar_chain_32::convert_registers ()
+{
+  bitmap_iterator bi;
+  unsigned id;
+
+  EXECUTE_IF_SET_IN_BITMAP (defs, 0, id, bi)
+    convert_reg (id);
+
+  EXECUTE_IF_AND_COMPL_IN_BITMAP (defs_conv, defs, 0, id, bi)
+    make_vector_copies (id);
+}
+
 /* Convert whole chain creating required register
    conversions and copies.  */
 
@@ -3527,11 +3854,7 @@  scalar_chain::convert ()
   if (dump_file)
     fprintf (dump_file, "Converting chain #%d...\n", chain_id);
 
-  EXECUTE_IF_SET_IN_BITMAP (defs, 0, id, bi)
-    convert_reg (id);
-
-  EXECUTE_IF_AND_COMPL_IN_BITMAP (defs_conv, defs, 0, id, bi)
-    make_vector_copies (id);
+  convert_registers ();
 
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, id, bi)
     {
@@ -3551,6 +3874,7 @@  convert_scalars_to_vector ()
   basic_block bb;
   bitmap candidates;
   int converted_insns = 0;
+  rtx zero, minus_one;
 
   bitmap_obstack_initialize (NULL);
   candidates = BITMAP_ALLOC (NULL);
@@ -3585,22 +3909,40 @@  convert_scalars_to_vector ()
     if (dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");
 
+  if (TARGET_64BIT)
+    {
+      zero = gen_reg_rtx (V1TImode);
+      minus_one = gen_reg_rtx (V1TImode);
+    }
+  else
+    {
+      zero = NULL_RTX;
+      minus_one = NULL_RTX;
+    }
+
   while (!bitmap_empty_p (candidates))
     {
       unsigned uid = bitmap_first_set_bit (candidates);
-      scalar_chain chain;
+      scalar_chain *chain;
+
+      if (TARGET_64BIT)
+	chain = new scalar_chain_64 (zero, minus_one);
+      else
+	chain = new scalar_chain_32;
 
       /* Find instructions chain we want to convert to vector mode.
 	 Check all uses and definitions to estimate all required
 	 conversions.  */
-      chain.build (candidates, uid);
+      chain->build (candidates, uid);
 
-      if (chain.compute_convert_gain () > 0)
-	converted_insns += chain.convert ();
+      if (chain->compute_convert_gain () > 0)
+	converted_insns += chain->convert ();
       else
 	if (dump_file)
 	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
-		   chain.chain_id);
+		   chain->chain_id);
+
+      delete chain;
     }
 
   if (dump_file)
@@ -3610,6 +3952,13 @@  convert_scalars_to_vector ()
   bitmap_obstack_release (NULL);
   df_process_deferred_rescans ();
 
+  /* FIXME: Since the CSE pass may change dominance info, which isn't
+     expected by the fwprop pass, call free_dominance_info to
+     invalidate dominance info.  Otherwise, the fwprop pass may crash
+     when dominance info is changed.  */
+  if (TARGET_64BIT)
+    free_dominance_info (CDI_DOMINATORS);
+
   /* Conversion means we may have 128bit register spills/fills
      which require aligned stack.  */
   if (converted_insns)
@@ -3683,7 +4032,7 @@  public:
   /* opt_pass methods: */
   virtual bool gate (function *)
     {
-      return !TARGET_64BIT && TARGET_STV && TARGET_SSE2 && optimize > 1;
+      return TARGET_STV && TARGET_SSE2 && optimize > 1;
     }
 
   virtual unsigned int execute (function *)
@@ -5596,17 +5945,23 @@  ix86_option_override (void)
 	1, PASS_POS_INSERT_AFTER
       };
   opt_pass *pass_stv = make_pass_stv (g);
-  struct register_pass_info stv_info
+  struct register_pass_info stv_info_32
     = { pass_stv, "combine",
 	1, PASS_POS_INSERT_AFTER
       };
+  /* Run the 64-bit STV pass before the CSE pass so that CONST0_RTX and
+     CONSTM1_RTX generated by the STV pass can be CSEed.  */
+  struct register_pass_info stv_info_64
+    = { pass_stv, "cse2",
+	1, PASS_POS_INSERT_BEFORE
+      };
 
   ix86_option_override_internal (true, &global_options, &global_options_set);
 
 
   /* This needs to be done at start up.  It's convenient to do it here.  */
   register_pass (&insert_vzeroupper_info);
-  register_pass (&stv_info);
+  register_pass (TARGET_64BIT ? &stv_info_64 : &stv_info_32);
 }
 
 /* Implement the TARGET_OFFLOAD_OPTIONS hook.  */
diff --git a/gcc/testsuite/gcc.target/i386/pr55247-2.c b/gcc/testsuite/gcc.target/i386/pr55247-2.c
index 6b5b36d..77901af 100644
--- a/gcc/testsuite/gcc.target/i386/pr55247-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr55247-2.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile { target { ! ia32 } } } */
 /* { dg-require-effective-target maybe_x32 } */
-/* { dg-options "-O2 -mx32 -mtune=generic -maddress-mode=long" } */
+/* { dg-options "-O2 -mx32 -mtune=generic -maddress-mode=long -dp" } */
 
 typedef unsigned int uint32_t;
 typedef uint32_t Elf32_Word;
@@ -34,4 +34,5 @@  _dl_profile_fixup (struct link_map *l, Elf32_Word reloc_arg)
   symbind32 (&sym);
 }
 
-/* { dg-final { scan-assembler-not "%xmm\[0-9\]" } } */
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "movti_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-1.c b/gcc/testsuite/gcc.target/i386/pr70155-1.c
new file mode 100644
index 0000000..3500364
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-1.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern __int128 a, b;
+
+void
+foo (void)
+{
+  a = b;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-10.c b/gcc/testsuite/gcc.target/i386/pr70155-10.c
new file mode 100644
index 0000000..2d0b91f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-10.c
@@ -0,0 +1,19 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=core2 -dp" } */
+
+extern __int128 a;
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  a = x.i;
+}
+
+/* { dg-final { scan-assembler-not "movv1ti_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-11.c b/gcc/testsuite/gcc.target/i386/pr70155-11.c
new file mode 100644
index 0000000..b00aa13
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-11.c
@@ -0,0 +1,19 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=core2 -dp" } */
+
+extern __int128 a;
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  x.i = a;
+}
+
+/* { dg-final { scan-assembler-not "movv1ti_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-12.c b/gcc/testsuite/gcc.target/i386/pr70155-12.c
new file mode 100644
index 0000000..dd0edf0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-12.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=core2 -dp" } */
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  x.i = 0;
+}
+
+/* { dg-final { scan-assembler-not "movv1ti_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-13.c b/gcc/testsuite/gcc.target/i386/pr70155-13.c
new file mode 100644
index 0000000..6718290
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-13.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=core2 -dp" } */
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  x.i = -1;
+}
+
+/* { dg-final { scan-assembler-not "movv1ti_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-14.c b/gcc/testsuite/gcc.target/i386/pr70155-14.c
new file mode 100644
index 0000000..a43de2e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-14.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  x.i = 2;
+}
+
+/* { dg-final { scan-assembler-not "movv1ti_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-15.c b/gcc/testsuite/gcc.target/i386/pr70155-15.c
new file mode 100644
index 0000000..e9cafcc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-15.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=core2 -mtune-ctrl=sse_unaligned_store_optimal -dp" } */
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  x.i = 0;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-16.c b/gcc/testsuite/gcc.target/i386/pr70155-16.c
new file mode 100644
index 0000000..7750b58
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-16.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=core2 -mtune-ctrl=sse_unaligned_load_optimal -dp" } */
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  x.i = 0;
+}
+
+/* { dg-final { scan-assembler-not "movv1ti_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-17.c b/gcc/testsuite/gcc.target/i386/pr70155-17.c
new file mode 100644
index 0000000..a9427e6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-17.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern __int128 a, b, c, d, e, f;
+
+void
+foo (void)
+{
+  a = 0;
+  b = -1;
+  c = 0;
+  d = -1;
+  e = 0;
+  f = -1;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 8 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-18.c b/gcc/testsuite/gcc.target/i386/pr70155-18.c
new file mode 100644
index 0000000..eb9db68
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-18.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern char *src, *dst;
+
+char *
+foo1 (void)
+{
+  return __builtin_memcpy (dst, src, 16);
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-19.c b/gcc/testsuite/gcc.target/i386/pr70155-19.c
new file mode 100644
index 0000000..e2e73aa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-19.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern char *src, *dst;
+
+char *
+foo1 (void)
+{
+  return __builtin_mempcpy (dst, src, 16);
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-2.c b/gcc/testsuite/gcc.target/i386/pr70155-2.c
new file mode 100644
index 0000000..af2ddc6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-2.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x, y;
+
+void
+foo (void)
+{
+  x = y;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-20.c b/gcc/testsuite/gcc.target/i386/pr70155-20.c
new file mode 100644
index 0000000..10b8c45
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-20.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern __int128 a, b;
+
+__int128
+foo (void)
+{
+  a = b;
+  return b;
+}
+
+/* { dg-final { scan-assembler-not "movv1ti_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-21.c b/gcc/testsuite/gcc.target/i386/pr70155-21.c
new file mode 100644
index 0000000..be76e5f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-21.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern __int128 a, b, c;
+
+void
+foo (void)
+{
+  a = b;
+  c = a + 1;
+}
+
+/* { dg-final { scan-assembler-not "movv1ti_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-22.c b/gcc/testsuite/gcc.target/i386/pr70155-22.c
new file mode 100644
index 0000000..ff5cbce
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-22.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern __int128 a, b, c;
+
+void
+foo (void)
+{
+  a = b;
+  c++;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-3.c b/gcc/testsuite/gcc.target/i386/pr70155-3.c
new file mode 100644
index 0000000..01b38aa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-3.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern __int128 a;
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  a = x.i;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-4.c b/gcc/testsuite/gcc.target/i386/pr70155-4.c
new file mode 100644
index 0000000..31bc0a7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-4.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern __int128 a;
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  x.i = a;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-5.c b/gcc/testsuite/gcc.target/i386/pr70155-5.c
new file mode 100644
index 0000000..9647452
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-5.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern __int128 a;
+
+void
+foo (void)
+{
+  a = 0;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-6.c b/gcc/testsuite/gcc.target/i386/pr70155-6.c
new file mode 100644
index 0000000..7e074a73
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-6.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+extern __int128 a;
+
+void
+foo (void)
+{
+  a = -1;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-7.c b/gcc/testsuite/gcc.target/i386/pr70155-7.c
new file mode 100644
index 0000000..93c6fc0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-7.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  x.i = 0;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-8.c b/gcc/testsuite/gcc.target/i386/pr70155-8.c
new file mode 100644
index 0000000..f304a4e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-8.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=generic -dp" } */
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x;
+
+void
+foo (void)
+{
+  x.i = -1;
+}
+
+/* { dg-final { scan-assembler-times "movv1ti_internal" 2 } } */
+/* { dg-final { scan-assembler-not "\\*movdi_internal" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70155-9.c b/gcc/testsuite/gcc.target/i386/pr70155-9.c
new file mode 100644
index 0000000..5dc3a76
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70155-9.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse2 -mtune=core2 -dp" } */
+
+struct foo
+{
+  __int128 i;
+}__attribute__ ((packed));
+
+extern struct foo x, y;
+
+void
+foo (void)
+{
+  x = y;
+}
+
+/* { dg-final { scan-assembler-not "movv1ti_internal" } } */