diff mbox series

Fix PR90332 by extending half size vector mode

Message ID ca948c93-b697-f5ee-0387-73ed520ed0e3@linux.ibm.com
State New
Headers show
Series Fix PR90332 by extending half size vector mode | expand

Commit Message

Li, Pan2 via Gcc-patches March 18, 2020, 10:06 a.m. UTC
Hi,

As PR90332 shows, the current scalar epilogue peeling for gaps 
elimination requires expected vec_init optab with two half size
vector mode.  On Power, we don't support vector mode like V8QI,
so can't support optab like vec_initv16qiv8qi.  But we want to
leverage existing scalar mode like DI to init the desirable
vector mode.  This patch is to extend the existing support for
Power, as evaluated on Power9 we can see expected 1.9% speed up
on SPEC2017 525.x264_r.

Bootstrapped/regtested on powerpc64le-linux-gnu (LE) P8 and P9.

Is it ok for trunk?  

BR,
Kewen
-----------

gcc/ChangeLog

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	PR tree-optimization/90332
	* gcc/tree-vectorizer.h (struct _stmt_vec_info): Add half_mode field.
	(DR_GROUP_HALF_MODE): New macro.
	* gcc/tree-vect-stmts.c (get_half_mode_for_vector): New function.
	(get_group_load_store_type): Call get_half_mode_for_vector to query target
	whether support half size mode and update DR_GROUP_HALF_MODE if yes.
	(vectorizable_load): Build appropriate vector type based on
	DR_GROUP_HALF_MODE.

Comments

Li, Pan2 via Gcc-patches March 18, 2020, 10:39 a.m. UTC | #1
On Wed, Mar 18, 2020 at 11:06 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi,
>
> As PR90332 shows, the current scalar epilogue peeling for gaps
> elimination requires expected vec_init optab with two half size
> vector mode.  On Power, we don't support vector mode like V8QI,
> so can't support optab like vec_initv16qiv8qi.  But we want to
> leverage existing scalar mode like DI to init the desirable
> vector mode.  This patch is to extend the existing support for
> Power, as evaluated on Power9 we can see expected 1.9% speed up
> on SPEC2017 525.x264_r.
>
> Bootstrapped/regtested on powerpc64le-linux-gnu (LE) P8 and P9.
>
> Is it ok for trunk?

There's already code exercising such a case in vectorizable_load
(VMAT_STRIDED_SLP) which you could have factored out.

 vectype, bool slp,
             than the alignment boundary B.  Every vector access will
             be a multiple of B and so we are guaranteed to access a
             non-gap element in the same B-sized block.  */
+         machine_mode half_mode;
          if (overrun_p
              && gap < (vect_known_alignment_in_bytes (first_dr_info)
                        / vect_get_scalar_dr_size (first_dr_info)))
-           overrun_p = false;
-
+           {
+             overrun_p = false;
+             if (known_eq (nunits, (group_size - gap) * 2)
+                 && known_eq (nunits, group_size)
+                 && get_half_mode_for_vector (vectype, &half_mode))
+               DR_GROUP_HALF_MODE (first_stmt_info) = half_mode;
+           }

why do you need to amend this case?

I don't like storing DR_GROUP_HALF_MODE very much, later
you need a vector type and it looks cheap enough to recompute
it where you need it?  Iff then it doesn't belong to DR_GROUP
but to the stmt-info.

I realize the original optimization was kind of a hack (and I was too
lazy to implement the integer mode construction path ...).

So, can you factor out the existing code into a function returning
the vector type for construction for a vector type and a
pieces size?  So for V16QI and a pieces-size of 4 we'd
get either V16QI back (then construction from V4QI pieces
should work) or V4SI (then construction from SImode pieces
should work)?  Eventually as secondary output provide that
piece type (SI / V4QI).

Thanks,
Richard.

> BR,
> Kewen
> -----------
>
> gcc/ChangeLog
>
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR tree-optimization/90332
>         * gcc/tree-vectorizer.h (struct _stmt_vec_info): Add half_mode field.
>         (DR_GROUP_HALF_MODE): New macro.
>         * gcc/tree-vect-stmts.c (get_half_mode_for_vector): New function.
>         (get_group_load_store_type): Call get_half_mode_for_vector to query target
>         whether support half size mode and update DR_GROUP_HALF_MODE if yes.
>         (vectorizable_load): Build appropriate vector type based on
>         DR_GROUP_HALF_MODE.
Li, Pan2 via Gcc-patches March 18, 2020, 10:40 a.m. UTC | #2
On Wed, Mar 18, 2020 at 11:39 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Mar 18, 2020 at 11:06 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >
> > Hi,
> >
> > As PR90332 shows, the current scalar epilogue peeling for gaps
> > elimination requires expected vec_init optab with two half size
> > vector mode.  On Power, we don't support vector mode like V8QI,
> > so can't support optab like vec_initv16qiv8qi.  But we want to
> > leverage existing scalar mode like DI to init the desirable
> > vector mode.  This patch is to extend the existing support for
> > Power, as evaluated on Power9 we can see expected 1.9% speed up
> > on SPEC2017 525.x264_r.
> >
> > Bootstrapped/regtested on powerpc64le-linux-gnu (LE) P8 and P9.
> >
> > Is it ok for trunk?
>
> There's already code exercising such a case in vectorizable_load
> (VMAT_STRIDED_SLP) which you could have factored out.
>
>  vectype, bool slp,
>              than the alignment boundary B.  Every vector access will
>              be a multiple of B and so we are guaranteed to access a
>              non-gap element in the same B-sized block.  */
> +         machine_mode half_mode;
>           if (overrun_p
>               && gap < (vect_known_alignment_in_bytes (first_dr_info)
>                         / vect_get_scalar_dr_size (first_dr_info)))
> -           overrun_p = false;
> -
> +           {
> +             overrun_p = false;
> +             if (known_eq (nunits, (group_size - gap) * 2)
> +                 && known_eq (nunits, group_size)
> +                 && get_half_mode_for_vector (vectype, &half_mode))
> +               DR_GROUP_HALF_MODE (first_stmt_info) = half_mode;
> +           }
>
> why do you need to amend this case?
>
> I don't like storing DR_GROUP_HALF_MODE very much, later
> you need a vector type and it looks cheap enough to recompute
> it where you need it?  Iff then it doesn't belong to DR_GROUP
> but to the stmt-info.
>
> I realize the original optimization was kind of a hack (and I was too
> lazy to implement the integer mode construction path ...).
>
> So, can you factor out the existing code into a function returning
> the vector type for construction for a vector type and a
> pieces size?  So for V16QI and a pieces-size of 4 we'd
> get either V16QI back (then construction from V4QI pieces
> should work) or V4SI (then construction from SImode pieces
> should work)?  Eventually as secondary output provide that
> piece type (SI / V4QI).

Btw, why not implement the neccessary vector init patterns?

> Thanks,
> Richard.
>
> > BR,
> > Kewen
> > -----------
> >
> > gcc/ChangeLog
> >
> > 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
> >
> >         PR tree-optimization/90332
> >         * gcc/tree-vectorizer.h (struct _stmt_vec_info): Add half_mode field.
> >         (DR_GROUP_HALF_MODE): New macro.
> >         * gcc/tree-vect-stmts.c (get_half_mode_for_vector): New function.
> >         (get_group_load_store_type): Call get_half_mode_for_vector to query target
> >         whether support half size mode and update DR_GROUP_HALF_MODE if yes.
> >         (vectorizable_load): Build appropriate vector type based on
> >         DR_GROUP_HALF_MODE.
Li, Pan2 via Gcc-patches March 18, 2020, 1:56 p.m. UTC | #3
Hi Richi,

Thanks for your comments.

on 2020/3/18 下午6:39, Richard Biener wrote:
> On Wed, Mar 18, 2020 at 11:06 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> As PR90332 shows, the current scalar epilogue peeling for gaps
>> elimination requires expected vec_init optab with two half size
>> vector mode.  On Power, we don't support vector mode like V8QI,
>> so can't support optab like vec_initv16qiv8qi.  But we want to
>> leverage existing scalar mode like DI to init the desirable
>> vector mode.  This patch is to extend the existing support for
>> Power, as evaluated on Power9 we can see expected 1.9% speed up
>> on SPEC2017 525.x264_r.
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu (LE) P8 and P9.
>>
>> Is it ok for trunk?
> 
> There's already code exercising such a case in vectorizable_load
> (VMAT_STRIDED_SLP) which you could have factored out.
> 

Nice, will refer to and factor it.

>  vectype, bool slp,
>              than the alignment boundary B.  Every vector access will
>              be a multiple of B and so we are guaranteed to access a
>              non-gap element in the same B-sized block.  */
> +         machine_mode half_mode;
>           if (overrun_p
>               && gap < (vect_known_alignment_in_bytes (first_dr_info)
>                         / vect_get_scalar_dr_size (first_dr_info)))
> -           overrun_p = false;
> -
> +           {
> +             overrun_p = false;
> +             if (known_eq (nunits, (group_size - gap) * 2)
> +                 && known_eq (nunits, group_size)
> +                 && get_half_mode_for_vector (vectype, &half_mode))
> +               DR_GROUP_HALF_MODE (first_stmt_info) = half_mode;
> +           }
> 
> why do you need to amend this case?
> 

This path can define overrun_p to false, some case can fall into
"no peeling for gaps" hunk in vectorizable_load.  Since I used
DR_GROUP_HALF_MODE to save the half mode, if some case matches
this condition, vectorizable_load hunk can get unitialized
DR_GROUP_HALF_MODE.  But even with proposed recomputing way, I
think we still need to check the vec_init optab here if the
know_eq half size conditions hold? 


> I don't like storing DR_GROUP_HALF_MODE very much, later
> you need a vector type and it looks cheap enough to recompute
> it where you need it?  Iff then it doesn't belong to DR_GROUP
> but to the stmt-info.
> 

OK, I was intended not to recompute it for time saving, will
throw it away.

> I realize the original optimization was kind of a hack (and I was too
> lazy to implement the integer mode construction path ...).
> 
> So, can you factor out the existing code into a function returning
> the vector type for construction for a vector type and a
> pieces size?  So for V16QI and a pieces-size of 4 we'd
> get either V16QI back (then construction from V4QI pieces
> should work) or V4SI (then construction from SImode pieces
> should work)?  Eventually as secondary output provide that
> piece type (SI / V4QI).

Sure.  I'm very poor to get a function name, does function name
suitable_vector_and_pieces sound good? 
  ie. tree suitable_vector_and_pieces (tree vtype, tree *ptype);


BR,
Kewen
Li, Pan2 via Gcc-patches March 18, 2020, 2:12 p.m. UTC | #4
on 2020/3/18 下午6:40, Richard Biener wrote:
> On Wed, Mar 18, 2020 at 11:39 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
>>
>> On Wed, Mar 18, 2020 at 11:06 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>
>>> Hi,
>>>
>>> As PR90332 shows, the current scalar epilogue peeling for gaps
>>> elimination requires expected vec_init optab with two half size
>>> vector mode.  On Power, we don't support vector mode like V8QI,
>>> so can't support optab like vec_initv16qiv8qi.  But we want to
>>> leverage existing scalar mode like DI to init the desirable
>>> vector mode.  This patch is to extend the existing support for
>>> Power, as evaluated on Power9 we can see expected 1.9% speed up
>>> on SPEC2017 525.x264_r.
>>>
>>> Bootstrapped/regtested on powerpc64le-linux-gnu (LE) P8 and P9.
>>>
>>> Is it ok for trunk?
>>
>> There's already code exercising such a case in vectorizable_load
>> (VMAT_STRIDED_SLP) which you could have factored out.
>>
>>  vectype, bool slp,
>>              than the alignment boundary B.  Every vector access will
>>              be a multiple of B and so we are guaranteed to access a
>>              non-gap element in the same B-sized block.  */
>> +         machine_mode half_mode;
>>           if (overrun_p
>>               && gap < (vect_known_alignment_in_bytes (first_dr_info)
>>                         / vect_get_scalar_dr_size (first_dr_info)))
>> -           overrun_p = false;
>> -
>> +           {
>> +             overrun_p = false;
>> +             if (known_eq (nunits, (group_size - gap) * 2)
>> +                 && known_eq (nunits, group_size)
>> +                 && get_half_mode_for_vector (vectype, &half_mode))
>> +               DR_GROUP_HALF_MODE (first_stmt_info) = half_mode;
>> +           }
>>
>> why do you need to amend this case?
>>
>> I don't like storing DR_GROUP_HALF_MODE very much, later
>> you need a vector type and it looks cheap enough to recompute
>> it where you need it?  Iff then it doesn't belong to DR_GROUP
>> but to the stmt-info.
>>
>> I realize the original optimization was kind of a hack (and I was too
>> lazy to implement the integer mode construction path ...).
>>
>> So, can you factor out the existing code into a function returning
>> the vector type for construction for a vector type and a
>> pieces size?  So for V16QI and a pieces-size of 4 we'd
>> get either V16QI back (then construction from V4QI pieces
>> should work) or V4SI (then construction from SImode pieces
>> should work)?  Eventually as secondary output provide that
>> piece type (SI / V4QI).
> 
> Btw, why not implement the neccessary vector init patterns?
> 

Power doesn't support 64bit vector size, it looks a bit hacky and
confusing to introduce this kind of mode just for some optab requirement,
but I admit the optab hack can immediately make it work.  :)

BR,
Kewen
Li, Pan2 via Gcc-patches March 18, 2020, 3:10 p.m. UTC | #5
On Wed, Mar 18, 2020 at 2:56 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richi,
>
> Thanks for your comments.
>
> on 2020/3/18 下午6:39, Richard Biener wrote:
> > On Wed, Mar 18, 2020 at 11:06 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> As PR90332 shows, the current scalar epilogue peeling for gaps
> >> elimination requires expected vec_init optab with two half size
> >> vector mode.  On Power, we don't support vector mode like V8QI,
> >> so can't support optab like vec_initv16qiv8qi.  But we want to
> >> leverage existing scalar mode like DI to init the desirable
> >> vector mode.  This patch is to extend the existing support for
> >> Power, as evaluated on Power9 we can see expected 1.9% speed up
> >> on SPEC2017 525.x264_r.
> >>
> >> Bootstrapped/regtested on powerpc64le-linux-gnu (LE) P8 and P9.
> >>
> >> Is it ok for trunk?
> >
> > There's already code exercising such a case in vectorizable_load
> > (VMAT_STRIDED_SLP) which you could have factored out.
> >
>
> Nice, will refer to and factor it.
>
> >  vectype, bool slp,
> >              than the alignment boundary B.  Every vector access will
> >              be a multiple of B and so we are guaranteed to access a
> >              non-gap element in the same B-sized block.  */
> > +         machine_mode half_mode;
> >           if (overrun_p
> >               && gap < (vect_known_alignment_in_bytes (first_dr_info)
> >                         / vect_get_scalar_dr_size (first_dr_info)))
> > -           overrun_p = false;
> > -
> > +           {
> > +             overrun_p = false;
> > +             if (known_eq (nunits, (group_size - gap) * 2)
> > +                 && known_eq (nunits, group_size)
> > +                 && get_half_mode_for_vector (vectype, &half_mode))
> > +               DR_GROUP_HALF_MODE (first_stmt_info) = half_mode;
> > +           }
> >
> > why do you need to amend this case?
> >
>
> This path can define overrun_p to false, some case can fall into
> "no peeling for gaps" hunk in vectorizable_load.  Since I used
> DR_GROUP_HALF_MODE to save the half mode, if some case matches
> this condition, vectorizable_load hunk can get unitialized
> DR_GROUP_HALF_MODE.  But even with proposed recomputing way, I
> think we still need to check the vec_init optab here if the
> know_eq half size conditions hold?

Hmm, but for the above case it's fine to access the excess elements.

I guess the vectorizable_load code needs to be amended with
the alignment check or we do need to store somewhere our
decision to use smaller loads.

>
> > I don't like storing DR_GROUP_HALF_MODE very much, later
> > you need a vector type and it looks cheap enough to recompute
> > it where you need it?  Iff then it doesn't belong to DR_GROUP
> > but to the stmt-info.
> >
>
> OK, I was intended not to recompute it for time saving, will
> throw it away.
>
> > I realize the original optimization was kind of a hack (and I was too
> > lazy to implement the integer mode construction path ...).
> >
> > So, can you factor out the existing code into a function returning
> > the vector type for construction for a vector type and a
> > pieces size?  So for V16QI and a pieces-size of 4 we'd
> > get either V16QI back (then construction from V4QI pieces
> > should work) or V4SI (then construction from SImode pieces
> > should work)?  Eventually as secondary output provide that
> > piece type (SI / V4QI).
>
> Sure.  I'm very poor to get a function name, does function name
> suitable_vector_and_pieces sound good?
>   ie. tree suitable_vector_and_pieces (tree vtype, tree *ptype);

tree vector_vector_composition_type (tree vtype, poly_uint64 nelts,
tree *ptype);

where nelts specifies the number of vtype elements in a piece.

Richard.

>
> BR,
> Kewen
>
Segher Boessenkool March 18, 2020, 7:34 p.m. UTC | #6
On Wed, Mar 18, 2020 at 10:12:00PM +0800, Kewen.Lin wrote:
> > Btw, why not implement the neccessary vector init patterns?
> 
> Power doesn't support 64bit vector size, it looks a bit hacky and
> confusing to introduce this kind of mode just for some optab requirement,
> but I admit the optab hack can immediately make it work.  :)

But it opens up all kinds of other problems.  To begin with, how is a
short vector mapped to a "real" vector?

We don't have ops on short integer types, either, for similar reasons.


Segher
Li, Pan2 via Gcc-patches March 19, 2020, 8:18 a.m. UTC | #7
On Wed, Mar 18, 2020 at 8:34 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Wed, Mar 18, 2020 at 10:12:00PM +0800, Kewen.Lin wrote:
> > > Btw, why not implement the neccessary vector init patterns?
> >
> > Power doesn't support 64bit vector size, it looks a bit hacky and
> > confusing to introduce this kind of mode just for some optab requirement,
> > but I admit the optab hack can immediately make it work.  :)
>
> But it opens up all kinds of other problems.  To begin with, how is a
> short vector mapped to a "real" vector?
>
> We don't have ops on short integer types, either, for similar reasons.

How do you represent two vector input shuffles?  The usual
way is (vec_select (vec_concat ...))) which requires a _larger_
vector mode for the concat.  Which you don't have ops on either.

It's also not different to those large integer modes you need
but do not have ops on.

So I think the argument is somewhat moot, but yes.

Richard.

>
> Segher
Segher Boessenkool March 19, 2020, 5:41 p.m. UTC | #8
Hi!

On Thu, Mar 19, 2020 at 09:18:06AM +0100, Richard Biener wrote:
> On Wed, Mar 18, 2020 at 8:34 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > We don't have ops on short integer types, either, for similar reasons.
> 
> How do you represent two vector input shuffles?  The usual
> way is (vec_select (vec_concat ...))) which requires a _larger_
> vector mode for the concat.  Which you don't have ops on either.

Yes, we have double length modes for this, and it is painful as well.

And we also have a few half-length modes.

From rs6000-modes.def:

/* VMX/VSX.  */
VECTOR_MODES (INT, 16);       /* V16QI V8HI  V4SI V2DI */
VECTOR_MODE (INT, TI, 1);     /*                  V1TI */
VECTOR_MODES (FLOAT, 16);     /*       V8HF  V4SF V2DF */

/* Two VMX/VSX vectors (for permute, select, concat, etc.)  */
VECTOR_MODES (INT, 32);       /* V32QI V16HI V8SI V4DI */
VECTOR_MODES (FLOAT, 32);     /*       V16HF V8SF V4DF */

/* Half VMX/VSX vector (for internal use)  */
VECTOR_MODE (FLOAT, SF, 2);   /*                 V2SF  */
VECTOR_MODE (INT, SI, 2);     /*                 V2SI  */

> It's also not different to those large integer modes you need
> but do not have ops on.
> 
> So I think the argument is somewhat moot, but yes.

The point is that as soon as you allow some computations in any mode,
it snowballs to having to support all computations in those modes, but
even more importantly having to do it in movM as well.

Not good.


Segher
diff mbox series

Patch

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 2ca8e494680..24ec0d3759d 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -2220,6 +2220,52 @@  vect_get_store_rhs (stmt_vec_info stmt_info)
   gcc_unreachable ();
 }
 
+/* Function GET_HALF_MODE_FOR_VECTOR
+
+   If target supports either of:
+     - One vector mode, whose size is half of given vector size and whose
+       element mode is the same as that of given vector.  Meanwhile, it's
+       available to init given vector with two of them.
+     - One scalar mode, whose size is half of given vector size.  Meanwhile,
+       vector mode with two of them exists and it's available to init it with
+       two of them.
+   return true and save the mode in HMODE.  Otherwise, return false.
+
+   VECTYPE is type of given vector type.  */
+
+static bool
+get_half_mode_for_vector (tree vectype, machine_mode *hmode)
+{
+  gcc_assert (VECTOR_TYPE_P (vectype));
+  machine_mode vec_mode = TYPE_MODE (vectype);
+  scalar_mode elmode = SCALAR_TYPE_MODE (TREE_TYPE (vectype));
+
+  /* Check whether half size vector mode supported.  */
+  gcc_assert (GET_MODE_NUNITS (vec_mode).is_constant ());
+  poly_uint64 n_half_units = exact_div (GET_MODE_NUNITS (vec_mode), 2);
+  if (related_vector_mode (vec_mode, elmode, n_half_units).exists (hmode)
+      && convert_optab_handler (vec_init_optab, vec_mode, *hmode)
+	   != CODE_FOR_nothing)
+    return true;
+
+  /* Check whether half size scalar mode supported.  */
+  poly_uint64 half_size = exact_div (GET_MODE_BITSIZE (vec_mode), 2);
+  opt_machine_mode smode
+    = mode_for_size (half_size, GET_MODE_CLASS (elmode), 0);
+  if (!smode.exists ())
+    return false;
+  *hmode = smode.require ();
+
+  machine_mode new_vec_mode;
+  if (related_vector_mode (vec_mode, as_a<scalar_mode> (*hmode), 2)
+	.exists (&new_vec_mode)
+      && convert_optab_handler (vec_init_optab, new_vec_mode, *hmode)
+	   != CODE_FOR_nothing)
+    return true;
+
+  return false;
+}
+
 /* A subroutine of get_load_store_type, with a subset of the same
    arguments.  Handle the case where STMT_INFO is part of a grouped load
    or store.
@@ -2290,33 +2336,36 @@  get_group_load_store_type (stmt_vec_info stmt_info, tree vectype, bool slp,
 	     than the alignment boundary B.  Every vector access will
 	     be a multiple of B and so we are guaranteed to access a
 	     non-gap element in the same B-sized block.  */
+	  machine_mode half_mode;
 	  if (overrun_p
 	      && gap < (vect_known_alignment_in_bytes (first_dr_info)
 			/ vect_get_scalar_dr_size (first_dr_info)))
-	    overrun_p = false;
-
+	    {
+	      overrun_p = false;
+	      if (known_eq (nunits, (group_size - gap) * 2)
+		  && known_eq (nunits, group_size)
+		  && get_half_mode_for_vector (vectype, &half_mode))
+		DR_GROUP_HALF_MODE (first_stmt_info) = half_mode;
+	    }
 	  /* If the gap splits the vector in half and the target
 	     can do half-vector operations avoid the epilogue peeling
 	     by simply loading half of the vector only.  Usually
 	     the construction with an upper zero half will be elided.  */
 	  dr_alignment_support alignment_support_scheme;
-	  scalar_mode elmode = SCALAR_TYPE_MODE (TREE_TYPE (vectype));
-	  machine_mode vmode;
 	  if (overrun_p
 	      && !masked_p
 	      && (((alignment_support_scheme
-		      = vect_supportable_dr_alignment (first_dr_info, false)))
-		   == dr_aligned
+		    = vect_supportable_dr_alignment (first_dr_info, false)))
+		    == dr_aligned
 		  || alignment_support_scheme == dr_unaligned_supported)
 	      && known_eq (nunits, (group_size - gap) * 2)
 	      && known_eq (nunits, group_size)
 	      && VECTOR_MODE_P (TYPE_MODE (vectype))
-	      && related_vector_mode (TYPE_MODE (vectype), elmode,
-				      group_size - gap).exists (&vmode)
-	      && (convert_optab_handler (vec_init_optab,
-					 TYPE_MODE (vectype), vmode)
-		  != CODE_FOR_nothing))
-	    overrun_p = false;
+	      && get_half_mode_for_vector (vectype, &half_mode))
+	    {
+	      DR_GROUP_HALF_MODE (first_stmt_info) = half_mode;
+	      overrun_p = false;
+	    }
 
 	  if (overrun_p && !can_overrun_p)
 	    {
@@ -9541,6 +9590,7 @@  vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 		    else
 		      {
 			tree ltype = vectype;
+			machine_mode half_mode = VOIDmode;
 			/* If there's no peeling for gaps but we have a gap
 			   with slp loads then load the lower half of the
 			   vector only.  See get_group_load_store_type for
@@ -9553,10 +9603,18 @@  vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 					 (group_size
 					  - DR_GROUP_GAP (first_stmt_info)) * 2)
 			    && known_eq (nunits, group_size))
-			  ltype = build_vector_type (TREE_TYPE (vectype),
-						     (group_size
-						      - DR_GROUP_GAP
-						          (first_stmt_info)));
+			  {
+			    gcc_assert (DR_GROUP_HALF_MODE (first_stmt_info)
+					!= VOIDmode);
+			    half_mode = DR_GROUP_HALF_MODE (first_stmt_info);
+			    if (VECTOR_MODE_P (half_mode))
+			      ltype = build_vector_type (
+				TREE_TYPE (vectype),
+				(group_size - DR_GROUP_GAP (first_stmt_info)));
+			    else
+			      ltype
+				= lang_hooks.types.type_for_mode (half_mode, 1);
+			  }
 			data_ref
 			  = fold_build2 (MEM_REF, ltype, dataref_ptr,
 					 dataref_offset
@@ -9584,10 +9642,21 @@  vectorizable_load (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem);
 			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
 						    build_zero_cst (ltype));
-			    new_stmt
-			      = gimple_build_assign (vec_dest,
-						     build_constructor
-						       (vectype, v));
+			    if (VECTOR_MODE_P (half_mode))
+			      new_stmt = gimple_build_assign (
+				vec_dest, build_constructor (vectype, v));
+			    else
+			      {
+				tree new_vtype = build_vector_type (ltype, 2);
+				tree new_vname = make_ssa_name (new_vtype);
+				new_stmt = gimple_build_assign (
+				  new_vname, build_constructor (new_vtype, v));
+				vect_finish_stmt_generation (stmt_info,
+							     new_stmt, gsi);
+				new_stmt = gimple_build_assign (
+				  vec_dest, build1 (VIEW_CONVERT_EXPR, vectype,
+						    new_vname));
+			      }
 			  }
 		      }
 		    break;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index f7becb34ab4..6fcbeb653d7 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1018,6 +1018,8 @@  public:
   /* For loads only, the gap from the previous load. For consecutive loads, GAP
      is 1.  */
   unsigned int gap;
+  /* For loads only, mode for halves of vector without peeling for gaps.  */
+  machine_mode half_mode;
 
   /* The minimum negative dependence distance this stmt participates in
      or zero if none.  */
@@ -1227,6 +1229,8 @@  STMT_VINFO_BB_VINFO (stmt_vec_info stmt_vinfo)
   (gcc_checking_assert ((S)->dr_aux.dr), (S)->store_count)
 #define DR_GROUP_GAP(S) \
   (gcc_checking_assert ((S)->dr_aux.dr), (S)->gap)
+#define DR_GROUP_HALF_MODE(S) \
+  (gcc_checking_assert ((S)->dr_aux.dr), (S)->half_mode)
 
 #define REDUC_GROUP_FIRST_ELEMENT(S) \
   (gcc_checking_assert (!(S)->dr_aux.dr), (S)->first_element)