diff mbox series

[1/7,v5] ifn/optabs: Support vector load/store with length

Message ID 380e1418-f262-7a09-e24a-2eb14817662f@linux.ibm.com
State New
Headers show
Series [1/7,v5] ifn/optabs: Support vector load/store with length | expand

Commit Message

Kewen.Lin June 23, 2020, 3:54 a.m. UTC
Hi Richard,

Thanks for your comments!

on 2020/6/23 上午3:59, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> @@ -5167,6 +5167,24 @@ mode @var{n}.
>>  
>>  This pattern is not allowed to @code{FAIL}.
>>  
>> +@cindex @code{lenload@var{m}} instruction pattern
>> +@item @samp{lenload@var{m}}
>> +Perform a vector load with length from memory operand 1 of mode @var{m}
>> +into register operand 0.  Length is provided in register operand 2 with
>> +appropriate mode which should afford the maximal required precision of
>> +any available lengths.
> 
> I think we need to say in more detail what “load with length” actually
> means.  How about:
> 
>   Load the number of bytes specified by operand 2 from memory operand 1
>   into register operand 0, setting the other bytes of operand 0 to
>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
>   whichever integer mode the target prefers.
> 

Thanks for nice wordings!  Updated, for "... to undefined values" I changed it
to "... to undefined values or zeros" as Segher's comments to match the behavior
on Power.

> Similarly here:
> 
>   Store the number of bytes specified by operand 2 from nonmemory operand 1
>   into memory operand 0, leaving the other bytes of operand 0 unchanged.
>   Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
>   mode the target prefers.
> 

Done.

>> @@ -2478,7 +2480,7 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
>>    return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
>>  }
>>  
>> -/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
>> +/* Expand MASK_LOAD{,_LANES} and LEN_LOAD call STMT using optab OPTAB.  */
> 
> s/and/or/.
> 

Done.

>>  
>>  static void
>>  expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
> 
> Think it would be worth generalising the name, e.g. to
> expand_partial_load_optab_fn, and adding a #define for
> expand_mask_load_optab_fn before the other two #defines.
> 
> Same comments for stores.
> 

Done.

>> +OPTAB_D (lenload_optab, "lenload$a")
>> +OPTAB_D (lenstore_optab, "lenstore$a")
> 
> Sorry, I should have picked up on this last time, but I think we should
> be consistent about whether there's an underscore after “len” or not.
> I realise this is just replicating what happens for IFN_MASK_LOAD/
> “maskload” and IFN_MASK_STORE/“maskstore”, but it's something I kept
> tripping over when implementing those for SVE.
> 
> Personally I think it is easier to read with the underscore, so this
> would be “len_load_optab” and “len_load$a” (or “len_load_$a”,
> there's no real consistency on that).  Same for stores.
> 

Good point!  I found there are two flavors on the optab naming.

  OPTAB_CD(maskload_optab, "maskload$a$b")
  OPTAB_CD(gather_load_optab, "gather_load$a$b")
  ...
vs. 
  OPTAB_D (vec_realign_load_optab, "vec_realign_load_$a")

Finally I chose two underscores as Segher's comment on readability.

----------------------------

v5:
  - Updated lenload/lenstore optab to len_load/len_store and the docs.
  - Rename expand_mask_{load,store}_optab_fn to expand_partial_{load,store}_optab_fn
  - Added/updated macros for expand_mask_{load,store}_optab_fn
    and expand_len_{load,store}_optab_fn

v4: Update len_load_direct/len_store_direct to align with direct optab.

v3: Get rid of length mode hook.


BR,
Kewen
---
gcc/ChangeLog:

2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>

	* doc/md.texi (len_load_@var{m}): Document.
	(len_store_@var{m}): Likewise.
	* internal-fn.c (len_load_direct): New macro.
	(len_store_direct): Likewise.
	(expand_len_load_optab_fn): Likewise.
	(expand_len_store_optab_fn): Likewise.
	(direct_len_load_optab_supported_p): Likewise.
	(direct_len_store_optab_supported_p): Likewise.
	(expand_mask_load_optab_fn): New macro.  Original renamed to ...
	(expand_partial_load_optab_fn): ... here.  Add handlings for
	len_load_optab.
	(expand_mask_store_optab_fn): New macro.  Original renamed to ...
	(expand_partial_store_optab_fn): ... here. Add handlings for
	len_store_optab.
	(internal_load_fn_p): Handle IFN_LEN_LOAD.
	(internal_store_fn_p): Handle IFN_LEN_STORE.
	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
	* internal-fn.def (LEN_LOAD): New internal function.
	(LEN_STORE): Likewise.
	* optabs.def (len_load_optab, len_store_optab): New optab.
commit f6012656a8968f239ad781c2cd388a9210675e11
Author: Kewen Lin <linkw@gcc.gnu.org>
Date:   Mon May 25 10:55:16 2020 +0800

    IFN for vector load/store with length and related optabs V5

Comments

Richard Sandiford June 23, 2020, 9:52 a.m. UTC | #1
Things have moved on due to the IRC conversation, but…

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/6/23 上午3:59, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> @@ -5167,6 +5167,24 @@ mode @var{n}.
>>>  
>>>  This pattern is not allowed to @code{FAIL}.
>>>  
>>> +@cindex @code{lenload@var{m}} instruction pattern
>>> +@item @samp{lenload@var{m}}
>>> +Perform a vector load with length from memory operand 1 of mode @var{m}
>>> +into register operand 0.  Length is provided in register operand 2 with
>>> +appropriate mode which should afford the maximal required precision of
>>> +any available lengths.
>> 
>> I think we need to say in more detail what “load with length” actually
>> means.  How about:
>> 
>>   Load the number of bytes specified by operand 2 from memory operand 1
>>   into register operand 0, setting the other bytes of operand 0 to
>>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
>>   whichever integer mode the target prefers.
>> 
>
> Thanks for nice wordings!  Updated, for "... to undefined values" I changed it
> to "... to undefined values or zeros" as Segher's comments to match the behavior
> on Power.

“set … to undefined values” means that the values are not defined by
the optab interface.  In other words, the target can set the bytes
to whatever it wants, and gimple code can't make any assumptions about
what the values of the bytes are.

So setting the bytes to zero (as Power does) would conform to the
interface.  So would leaving the bytes in operand 0 untouched.
So would using an instruction that really does leave the other
bytes with undefined values, etc.

So I think we should keep it as just “… to undefined values”,

The alternative would be to define the interface so that targets *must*
ensure that the other bytes are zeros.  But at the moment, the only
intended use of the optabs and ifns is for autovectorisation, and the
vectoriser won't care about the values of “inactive” bytes/lanes.
Forcing the target to set them to a specific value like zero would be
unnecessarily restrictive.

Thanks,
Richard
Richard Biener June 23, 2020, 11:25 a.m. UTC | #2
On Tue, Jun 23, 2020 at 11:53 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Things have moved on due to the IRC conversation, but…
>
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> > on 2020/6/23 上午3:59, Richard Sandiford wrote:
> >> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> >>> @@ -5167,6 +5167,24 @@ mode @var{n}.
> >>>
> >>>  This pattern is not allowed to @code{FAIL}.
> >>>
> >>> +@cindex @code{lenload@var{m}} instruction pattern
> >>> +@item @samp{lenload@var{m}}
> >>> +Perform a vector load with length from memory operand 1 of mode @var{m}
> >>> +into register operand 0.  Length is provided in register operand 2 with
> >>> +appropriate mode which should afford the maximal required precision of
> >>> +any available lengths.
> >>
> >> I think we need to say in more detail what “load with length” actually
> >> means.  How about:
> >>
> >>   Load the number of bytes specified by operand 2 from memory operand 1
> >>   into register operand 0, setting the other bytes of operand 0 to
> >>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
> >>   whichever integer mode the target prefers.
> >>
> >
> > Thanks for nice wordings!  Updated, for "... to undefined values" I changed it
> > to "... to undefined values or zeros" as Segher's comments to match the behavior
> > on Power.
>
> “set … to undefined values” means that the values are not defined by
> the optab interface.  In other words, the target can set the bytes
> to whatever it wants, and gimple code can't make any assumptions about
> what the values of the bytes are.
>
> So setting the bytes to zero (as Power does) would conform to the
> interface.  So would leaving the bytes in operand 0 untouched.
> So would using an instruction that really does leave the other
> bytes with undefined values, etc.
>
> So I think we should keep it as just “… to undefined values”,
>
> The alternative would be to define the interface so that targets *must*
> ensure that the other bytes are zeros.  But at the moment, the only
> intended use of the optabs and ifns is for autovectorisation, and the
> vectoriser won't care about the values of “inactive” bytes/lanes.
> Forcing the target to set them to a specific value like zero would be
> unnecessarily restrictive.

Actually it _does_ care.  This is supposed to be used for fully masked
loops and 'unspecified values' would require us to explicitely zero
them for any FP op because of possible sNaN representations.  It
also precludes us from bitwise ORing in an appropriately masked
vector of 1s to make integer division happy (OK, no vector ISA supports
integer division).

So unless we have evidence that there exists an ISA that does _not_
zero the excess bits I'd rather specify it does.

Richard.

>
> Thanks,
> Richard
Richard Sandiford June 23, 2020, 12:20 p.m. UTC | #3
Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Jun 23, 2020 at 11:53 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Things have moved on due to the IRC conversation, but…
>>
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> > on 2020/6/23 上午3:59, Richard Sandiford wrote:
>> >> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> >>> @@ -5167,6 +5167,24 @@ mode @var{n}.
>> >>>
>> >>>  This pattern is not allowed to @code{FAIL}.
>> >>>
>> >>> +@cindex @code{lenload@var{m}} instruction pattern
>> >>> +@item @samp{lenload@var{m}}
>> >>> +Perform a vector load with length from memory operand 1 of mode @var{m}
>> >>> +into register operand 0.  Length is provided in register operand 2 with
>> >>> +appropriate mode which should afford the maximal required precision of
>> >>> +any available lengths.
>> >>
>> >> I think we need to say in more detail what “load with length” actually
>> >> means.  How about:
>> >>
>> >>   Load the number of bytes specified by operand 2 from memory operand 1
>> >>   into register operand 0, setting the other bytes of operand 0 to
>> >>   undefined values.  Operands 0 and 1 have mode @var{m}.  Operand 2 has
>> >>   whichever integer mode the target prefers.
>> >>
>> >
>> > Thanks for nice wordings!  Updated, for "... to undefined values" I changed it
>> > to "... to undefined values or zeros" as Segher's comments to match the behavior
>> > on Power.
>>
>> “set … to undefined values” means that the values are not defined by
>> the optab interface.  In other words, the target can set the bytes
>> to whatever it wants, and gimple code can't make any assumptions about
>> what the values of the bytes are.
>>
>> So setting the bytes to zero (as Power does) would conform to the
>> interface.  So would leaving the bytes in operand 0 untouched.
>> So would using an instruction that really does leave the other
>> bytes with undefined values, etc.
>>
>> So I think we should keep it as just “… to undefined values”,
>>
>> The alternative would be to define the interface so that targets *must*
>> ensure that the other bytes are zeros.  But at the moment, the only
>> intended use of the optabs and ifns is for autovectorisation, and the
>> vectoriser won't care about the values of “inactive” bytes/lanes.
>> Forcing the target to set them to a specific value like zero would be
>> unnecessarily restrictive.
>
> Actually it _does_ care.

I'd argue it doesn't, but for essentially the same reasons :-)

> This is supposed to be used for fully masked
> loops and 'unspecified values' would require us to explicitely zero
> them for any FP op because of possible sNaN representations.  It
> also precludes us from bitwise ORing in an appropriately masked
> vector of 1s to make integer division happy (OK, no vector ISA supports
> integer division).

Zeros would be a problem for FP division too.  And even if we require
loads to set inactive lanes to zero, we couldn't infer from that that
any given FP addition (say) won't raise an exception.  E.g. the inputs
could be the result of converting integers and adding them could trigger
an inexact exception.  Or the values could be the result of simple
bitcasts, giving arbitrary FP values.  (AIUI, current bfloat code
works this way.)

The vectoriser currently only allows potentially-trapping FP operations
on partial vectors if the target provides an appropriate IFN_COND_*
function.  (That's one of the main use cases for those functions.)
In other cases it requires the loop to operate on full vectors.
This should be relaxed in future to support inbranch partial
vectorisation of simd calls.

This means that the current patch series will/should simply punt
for “length”-based loop control if the loop contains FP operations
that (as far as gimple is concerned) might trap.

If we're thinking about how to relax that, then IMO it will need
to be done either at the level of each FP operation or by some
kind of “global” vectorisation subpass that introduces known-safe
values for inactive lanes.  The first would be easier, the second
would be more optimal.

I don't think that's specific to “length” vectorisation though.
The same concerns apply to if-converted loops that operate on full
vectors.  I think the approach would be essentially the same for both.

In that scenario, removing zeroing of an IFN_LEN_LOAD would “just” be
an optimisation, and could potentially be left to RTL code if necessary.
(But see my main point below.)

SVE supports integer division btw. :-)

> So unless we have evidence that there exists an ISA that does _not_
> zero the excess bits I'd rather specify it does.

I think the known architectures that might use this are:

- MVE
- Power
- RVV

MVE and Power both set inactive lanes to zero.  But I'm not sure about RVV.
AIUI, for RVV the approach instead would be to reduce the effective vector
length for the final iteration of the vector loop, and I'm not sure
whether in that situation it makes sense to say that the other elements
still exist and are guaranteed to be zero.

I'm the last person who should be speculating on that though.  Let's see
whether Jim has any comments.

In summary, I'm not saying we should never define the inactive values
to be zero.  I just think that we should leave it until it matters.
And I don't think it does/should matter for the current patch series.

IFN_MASK_LOAD has been around for quite a long time now and we've never
had to define the values of inactive lanes there.

Thanks,
Richard
Jim Wilson June 24, 2020, 2:40 a.m. UTC | #4
On Tue, Jun 23, 2020 at 5:21 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
> MVE and Power both set inactive lanes to zero.  But I'm not sure about RVV.
> AIUI, for RVV the approach instead would be to reduce the effective vector
> length for the final iteration of the vector loop, and I'm not sure
> whether in that situation it makes sense to say that the other elements
> still exist and are guaranteed to be zero.
>
> I'm the last person who should be speculating on that though.  Let's see
> whether Jim has any comments.

The RVV spec supports two policies for tail elements, i.e. elements
beyond the current vector length.  They can be undisturbed or
agnostic.  In the undisturbed case, the trail elements retain their
old values.  In the agnostic case, the implementation can choose to
either retain their old values, or set them to all ones, and this
choice can be different from lane to lane.  The latter case is useful
because registers may be wider than the execution unit, and current
vector length may not be a multiple of the width of the execution
unit.  So for instance if the vector registers can hold 8 elements,
and the execution unit works on 4 elements at a time, and the current
vector length is 2, then it might make sense to leave the last four
elements unmodified to avoid an iteration across the registers, but
the third and fourth elements might be set to all ones because you
have to write to them anyways.  The choice is left up to the
implementation because we have multiple parties designing vector
units, and some are target for low cost embedded market, and some are
target for high performance, and they couldn't agree on a single best
way to implement this.  The software is expected to choose agnostic
only if it doesn't care about what happens to tail elements, and
undisturbed if you want to preserve them.  The value of all ones was
chosen to discourage software developers from trying to use the values
in tail elements.  The choice of undisturbed or agnostic can be
changed every time you set the current vector length and type.

In most cases, I think RVV programs will use agnostic for tail
elements, since we can change the vector length at will, and it will
be rare that we will care about elements beyond the current vector
length.

Tail elements can't cause exceptions so there is no need to worry
about whether those elements hold valid values.

Jim
Richard Sandiford June 24, 2020, 7:34 a.m. UTC | #5
Jim Wilson <jimw@sifive.com> writes:
> On Tue, Jun 23, 2020 at 5:21 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> MVE and Power both set inactive lanes to zero.  But I'm not sure about RVV.
>> AIUI, for RVV the approach instead would be to reduce the effective vector
>> length for the final iteration of the vector loop, and I'm not sure
>> whether in that situation it makes sense to say that the other elements
>> still exist and are guaranteed to be zero.
>>
>> I'm the last person who should be speculating on that though.  Let's see
>> whether Jim has any comments.
>
> The RVV spec supports two policies for tail elements, i.e. elements
> beyond the current vector length.  They can be undisturbed or
> agnostic.  In the undisturbed case, the trail elements retain their
> old values.  In the agnostic case, the implementation can choose to
> either retain their old values, or set them to all ones, and this
> choice can be different from lane to lane.  The latter case is useful
> because registers may be wider than the execution unit, and current
> vector length may not be a multiple of the width of the execution
> unit.  So for instance if the vector registers can hold 8 elements,
> and the execution unit works on 4 elements at a time, and the current
> vector length is 2, then it might make sense to leave the last four
> elements unmodified to avoid an iteration across the registers, but
> the third and fourth elements might be set to all ones because you
> have to write to them anyways.  The choice is left up to the
> implementation because we have multiple parties designing vector
> units, and some are target for low cost embedded market, and some are
> target for high performance, and they couldn't agree on a single best
> way to implement this.  The software is expected to choose agnostic
> only if it doesn't care about what happens to tail elements, and
> undisturbed if you want to preserve them.  The value of all ones was
> chosen to discourage software developers from trying to use the values
> in tail elements.  The choice of undisturbed or agnostic can be
> changed every time you set the current vector length and type.
>
> In most cases, I think RVV programs will use agnostic for tail
> elements, since we can change the vector length at will, and it will
> be rare that we will care about elements beyond the current vector
> length.
>
> Tail elements can't cause exceptions so there is no need to worry
> about whether those elements hold valid values.

Thanks for the info.  Based on that, I guess GCC should leave the values
of extra inactive lanes undefined for now, so that the agnostic case
is supported.

Maybe in future we could have IFN_LEN_* versions of arithmetic
operations too, similar to the IFN_COND_* ones, so that they explicitly
ignore the inactive elements.

Richard
Segher Boessenkool June 24, 2020, 11:56 p.m. UTC | #6
Hi!

On Tue, Jun 23, 2020 at 01:20:53PM +0100, Richard Sandiford wrote:
> SVE supports integer division btw. :-)

So does Power (ISA 3.1, power10).

> In summary, I'm not saying we should never define the inactive values
> to be zero.  I just think that we should leave it until it matters.
> And I don't think it does/should matter for the current patch series.

I am perfectly happy with that.  Thanks for looking at it!

> IFN_MASK_LOAD has been around for quite a long time now and we've never
> had to define the values of inactive lanes there.

Yeah, but typically the insns that consume the values loaded will use
the same masks again, so that may not be such a strong point.


Segher
Richard Sandiford June 29, 2020, 10:07 a.m. UTC | #7
Thanks for the update.  I agree with the summary of the IRC discussion
except for…

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard S./Richi/Jim/Segher,
>
> Thanks a lot for your comments to make this patch more solid.
>
> Based on our discussion, for the vector load/store with length
> optab, the length unit would be measured in lanes by default.
> For the targets which support length measured in bytes like Power,
> they should only define VnQI modes to wrap the other same size
> vector modes.  If the length is larger than total lane/byte count
> of the given mode, it's taken to load all lanes/bytes implicitly.

…this last bit.  IMO the behaviour of the optab should be undefined
when the supplied length is greater than the number of lanes.

I think that also makes things better for the lxvl implementation,
which ignores the upper 56 bits of the length.  It sounds like the
above semantics would instead require Power to saturate the value
at 255 before shifting it.

Richard

> For the remaining lanes/bytes which isn't specified by length,
> they would be taken as undefined value.  For length in bytes,
> it's required that the byte count should be a multiple of the
> element size (wrapped vector), otherwise it's undefined.
>
> This patch has been updated as attached.
>
> 2/7 for rs6000 optab defintion has been updated to use V16QI.
> 5/7 for vectorizer change has been updated accordingly.
>
> -----
>
> v6: Updated optab descriptions.
>
> v5:
>   - Updated lenload/lenstore optab to len_load/len_store and the docs.
>   - Rename expand_mask_{load,store}_optab_fn to expand_partial_{load,store}_optab_fn
>   - Added/updated macros for expand_mask_{load,store}_optab_fn
>     and expand_len_{load,store}_optab_fn
>
> v4: Update len_load_direct/len_store_direct to align with direct optab.
>
> v3: Get rid of length mode hook.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 2020-MM-DD  Kewen Lin  <linkw@gcc.gnu.org>
>
> 	* doc/md.texi (len_load_@var{m}): Document.
> 	(len_store_@var{m}): Likewise.
> 	* internal-fn.c (len_load_direct): New macro.
> 	(len_store_direct): Likewise.
> 	(expand_len_load_optab_fn): Likewise.
> 	(expand_len_store_optab_fn): Likewise.
> 	(direct_len_load_optab_supported_p): Likewise.
> 	(direct_len_store_optab_supported_p): Likewise.
> 	(expand_mask_load_optab_fn): New macro.  Original renamed to ...
> 	(expand_partial_load_optab_fn): ... here.  Add handlings for
> 	len_load_optab.
> 	(expand_mask_store_optab_fn): New macro.  Original renamed to ...
> 	(expand_partial_store_optab_fn): ... here. Add handlings for
> 	len_store_optab.
> 	(internal_load_fn_p): Handle IFN_LEN_LOAD.
> 	(internal_store_fn_p): Handle IFN_LEN_STORE.
> 	(internal_fn_stored_value_index): Handle IFN_LEN_STORE.
> 	* internal-fn.def (LEN_LOAD): New internal function.
> 	(LEN_STORE): Likewise.
> 	* optabs.def (len_load_optab, len_store_optab): New optab.
diff mbox series

Patch

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2c67c818da5..23918136345 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5167,6 +5167,24 @@  mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{len_load_@var{m}} instruction pattern
+@item @samp{len_load_@var{m}}
+Load the number of bytes specified by operand 2 from memory operand 1
+into register operand 0, setting the other bytes of operand 0 to
+undefined values or zeros.  Operands 0 and 1 have mode @var{m}.
+Operand 2 has whichever integer mode the target prefers.
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{len_store_@var{m}} instruction pattern
+@item @samp{len_store_@var{m}}
+Store the number of bytes specified by operand 2 from nonmemory operand 1
+into memory operand 0, leaving the other bytes of operand 0 unchanged.
+Operands 0 and 1 have mode @var{m}.  Operand 2 has whichever integer
+mode the target prefers.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 5e9aa60721e..f9e851069a5 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -104,10 +104,12 @@  init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define len_load_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define len_store_direct { 3, 3, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 #define ternary_direct { 0, 0, true }
@@ -2478,10 +2480,10 @@  expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} or LEN_LOAD call STMT using optab OPTAB.  */
 
 static void
-expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2497,6 +2499,8 @@  expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_load_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_load_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2507,18 +2511,24 @@  expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_load_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
   if (!rtx_equal_p (target, ops[0].value))
     emit_move_insn (target, ops[0].value);
 }
 
+#define expand_mask_load_optab_fn expand_partial_load_optab_fn
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+#define expand_len_load_optab_fn expand_partial_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES} or LEN_STORE call STMT using optab OPTAB.  */
 
 static void
-expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
   class expand_operand ops[3];
   tree type, lhs, rhs, maskt;
@@ -2532,6 +2542,8 @@  expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 
   if (optab == vec_mask_store_lanes_optab)
     icode = get_multi_vector_move (type, optab);
+  else if (optab == len_store_optab)
+    icode = direct_optab_handler (optab, TYPE_MODE (type));
   else
     icode = convert_optab_handler (optab, TYPE_MODE (type),
 				   TYPE_MODE (TREE_TYPE (maskt)));
@@ -2542,11 +2554,17 @@  expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
   reg = expand_normal (rhs);
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
-  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  if (optab == len_store_optab)
+    create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				 TYPE_UNSIGNED (TREE_TYPE (maskt)));
+  else
+    create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
   expand_insn (icode, 3, ops);
 }
 
+#define expand_mask_store_optab_fn expand_partial_store_optab_fn
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+#define expand_len_store_optab_fn expand_partial_store_optab_fn
 
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
@@ -3128,10 +3146,12 @@  multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
@@ -3498,6 +3518,7 @@  internal_load_fn_p (internal_fn fn)
     case IFN_MASK_LOAD_LANES:
     case IFN_GATHER_LOAD:
     case IFN_MASK_GATHER_LOAD:
+    case IFN_LEN_LOAD:
       return true;
 
     default:
@@ -3517,6 +3538,7 @@  internal_store_fn_p (internal_fn fn)
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return true;
 
     default:
@@ -3577,6 +3599,7 @@  internal_fn_stored_value_index (internal_fn fn)
     case IFN_MASK_STORE:
     case IFN_SCATTER_STORE:
     case IFN_MASK_SCATTER_STORE:
+    case IFN_LEN_STORE:
       return 3;
 
     default:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1d190d492ff..17dac128e83 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -49,11 +49,13 @@  along with GCC; see the file COPYING3.  If not see
    - load_lanes: currently just vec_load_lanes
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
+   - len_load: currently just len_load
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
+   - len_store: currently just len_store
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -127,6 +129,8 @@  DEF_INTERNAL_OPTAB_FN (GATHER_LOAD, ECF_PURE, gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
+DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
+
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
@@ -136,6 +140,8 @@  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
+
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
 		       check_raw_ptrs, check_ptrs)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 0c64eb52a8d..78409aa1453 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -435,3 +435,5 @@  OPTAB_D (check_war_ptrs_optab, "check_war_ptrs$a")
 OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE)
 OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
+OPTAB_D (len_load_optab, "len_load_$a")
+OPTAB_D (len_store_optab, "len_store_$a")