diff mbox

avoid calling memset et al. with excessively large sizes (PR 79095)

Message ID 497e7848-5690-2c4e-7277-cab674a60a35@gmail.com
State New
Headers show

Commit Message

Martin Sebor Jan. 17, 2017, 12:06 a.m. UTC
The test case submitted in bug 79095 - [7 regression] spurious
stringop-overflow warning shows that GCC optimizes some loops
into calls to memset with size arguments in excess of the object
size limit.  Since such calls will unavoidably lead to a buffer
overflow and memory corruption the attached patch detects them
and replaces them with a trap.  That both prevents the buffer
overflow and eliminates the warning.

Martin

Comments

Jakub Jelinek Jan. 17, 2017, 7:38 a.m. UTC | #1
On Mon, Jan 16, 2017 at 05:06:40PM -0700, Martin Sebor wrote:
> The test case submitted in bug 79095 - [7 regression] spurious
> stringop-overflow warning shows that GCC optimizes some loops
> into calls to memset with size arguments in excess of the object
> size limit.  Since such calls will unavoidably lead to a buffer
> overflow and memory corruption the attached patch detects them
> and replaces them with a trap.  That both prevents the buffer
> overflow and eliminates the warning.

I fear this is going to break various 32-bit database programs and similar
that mmap say 3GB of RAM and then work on that memory chunk as contiguous.
Some things don't work too well in that case (pointer differences), but it
is unlikely they would be using those, while your patch actively breaks it
even for loops that can be transformed into memset (memcpy of course isn't a
problem, because you need some virtual address space to copy it from).

And as written in the PR, IMNSHO the warning should not be enabled by
default at its current verboseness and false positive rate.

	Jakub
Jeff Law Jan. 17, 2017, 3:26 p.m. UTC | #2
On 01/16/2017 05:06 PM, Martin Sebor wrote:
> The test case submitted in bug 79095 - [7 regression] spurious
> stringop-overflow warning shows that GCC optimizes some loops
> into calls to memset with size arguments in excess of the object
> size limit.  Since such calls will unavoidably lead to a buffer
> overflow and memory corruption the attached patch detects them
> and replaces them with a trap.  That both prevents the buffer
> overflow and eliminates the warning.
But doesn't the creation of the bogus memset signal an invalid 
transformation in the loop optimizer?  ie, if we're going to convert a 
loop into a memset, then we'd damn well better be sure the loop bounds 
are reasonable.

Jeff
Martin Sebor Jan. 17, 2017, 4:12 p.m. UTC | #3
On 01/17/2017 08:26 AM, Jeff Law wrote:
> On 01/16/2017 05:06 PM, Martin Sebor wrote:
>> The test case submitted in bug 79095 - [7 regression] spurious
>> stringop-overflow warning shows that GCC optimizes some loops
>> into calls to memset with size arguments in excess of the object
>> size limit.  Since such calls will unavoidably lead to a buffer
>> overflow and memory corruption the attached patch detects them
>> and replaces them with a trap.  That both prevents the buffer
>> overflow and eliminates the warning.
> But doesn't the creation of the bogus memset signal an invalid
> transformation in the loop optimizer?  ie, if we're going to convert a
> loop into a memset, then we'd damn well better be sure the loop bounds
> are reasonable.

I'm not sure that emitting the memset call is necessarily a bug in
the loop optimizer (which in all likelihood wasn't written with
the goal of preventing or detecting possible buffer overflows).
The loop with the excessive bound is in the source code and can
be reached given the right inputs (calling v.resize(v.size() - 1)
on an empty vector.  It's a lurking bug in the program that, if
triggered, will overflow the vector and crash the program (or worse)
with or without the optimization.

What else could the loop optimizer could do in this instance?
I suppose it could just leave the loop alone and avoid emitting
the memset call.  That would avoid the warning but mask the
problem with the overflow.  In my mind, preventing the overflow
given that we have the opportunity is the right thing to do.
That is, after all, the goal of the warning.

As I mentioned privately yesterday, I'm actually pleasantly
surprised that it's helped identify this opportunity in GCC itself.
My hope was to eventually go and find the places where GCC emits
potentially out of bounds calls (based on user inputs) and fix them
to emit better code on the assumption that they can't be valid or
replace them with traps if they could happen in a running program.
It didn't occur to me that the warning itself would help find them.

Martin
Jeff Law Jan. 17, 2017, 5:57 p.m. UTC | #4
On 01/17/2017 09:12 AM, Martin Sebor wrote:
> On 01/17/2017 08:26 AM, Jeff Law wrote:
>> On 01/16/2017 05:06 PM, Martin Sebor wrote:
>>> The test case submitted in bug 79095 - [7 regression] spurious
>>> stringop-overflow warning shows that GCC optimizes some loops
>>> into calls to memset with size arguments in excess of the object
>>> size limit.  Since such calls will unavoidably lead to a buffer
>>> overflow and memory corruption the attached patch detects them
>>> and replaces them with a trap.  That both prevents the buffer
>>> overflow and eliminates the warning.
>> But doesn't the creation of the bogus memset signal an invalid
>> transformation in the loop optimizer?  ie, if we're going to convert a
>> loop into a memset, then we'd damn well better be sure the loop bounds
>> are reasonable.
>
> I'm not sure that emitting the memset call is necessarily a bug in
> the loop optimizer (which in all likelihood wasn't written with
> the goal of preventing or detecting possible buffer overflows).
> The loop with the excessive bound is in the source code and can
> be reached given the right inputs (calling v.resize(v.size() - 1)
> on an empty vector.  It's a lurking bug in the program that, if
> triggered, will overflow the vector and crash the program (or worse)
> with or without the optimization.
Right, but that doesn't mean that the loop optimizer can turn it into a 
memset.  If the bounds are such that we're going to invoke undefined 
behaviour from memset, then the loop optimizer must leave the loop alone.

>
> What else could the loop optimizer could do in this instance?
> I suppose it could just leave the loop alone and avoid emitting
> the memset call.  That would avoid the warning but mask the
> problem with the overflow.  In my mind, preventing the overflow
> given that we have the opportunity is the right thing to do.
> That is, after all, the goal of the warning.
The right warning in this case is WRT the loop iteration space 
independent of mem*.


>
> As I mentioned privately yesterday, I'm actually pleasantly
> surprised that it's helped identify this opportunity in GCC itself.
> My hope was to eventually go and find the places where GCC emits
> potentially out of bounds calls (based on user inputs) and fix them
> to emit better code on the assumption that they can't be valid or
> replace them with traps if they could happen in a running program.
> It didn't occur to me that the warning itself would help find them.
>
> Martin
>
Martin Sebor Jan. 18, 2017, 3:04 a.m. UTC | #5
On 01/17/2017 10:57 AM, Jeff Law wrote:
> On 01/17/2017 09:12 AM, Martin Sebor wrote:
>> On 01/17/2017 08:26 AM, Jeff Law wrote:
>>> On 01/16/2017 05:06 PM, Martin Sebor wrote:
>>>> The test case submitted in bug 79095 - [7 regression] spurious
>>>> stringop-overflow warning shows that GCC optimizes some loops
>>>> into calls to memset with size arguments in excess of the object
>>>> size limit.  Since such calls will unavoidably lead to a buffer
>>>> overflow and memory corruption the attached patch detects them
>>>> and replaces them with a trap.  That both prevents the buffer
>>>> overflow and eliminates the warning.
>>> But doesn't the creation of the bogus memset signal an invalid
>>> transformation in the loop optimizer?  ie, if we're going to convert a
>>> loop into a memset, then we'd damn well better be sure the loop bounds
>>> are reasonable.
>>
>> I'm not sure that emitting the memset call is necessarily a bug in
>> the loop optimizer (which in all likelihood wasn't written with
>> the goal of preventing or detecting possible buffer overflows).
>> The loop with the excessive bound is in the source code and can
>> be reached given the right inputs (calling v.resize(v.size() - 1)
>> on an empty vector.  It's a lurking bug in the program that, if
>> triggered, will overflow the vector and crash the program (or worse)
>> with or without the optimization.
> Right, but that doesn't mean that the loop optimizer can turn it into a
> memset.  If the bounds are such that we're going to invoke undefined
> behaviour from memset, then the loop optimizer must leave the loop alone.
>
>>
>> What else could the loop optimizer could do in this instance?
>> I suppose it could just leave the loop alone and avoid emitting
>> the memset call.  That would avoid the warning but mask the
>> problem with the overflow.  In my mind, preventing the overflow
>> given that we have the opportunity is the right thing to do.
>> That is, after all, the goal of the warning.
> The right warning in this case is WRT the loop iteration space
> independent of mem*.

I agree that warning for the user code would be appropriate if
the loop with the excessive bound were unavoidable or at least
reachable.  But in the submitted test case it isn't because
the call to vector::resize() is guarded.  In fact, the out of-
bounds memset is also emitted with this modified test case:

   void f (std::vector<int> &v)
   {
     size_t n = v.size ();

     if (n > 1 && n < 5)
       v.resize (n - 1);
   }

I believe the root cause of the the out-of-bounds memset is the
lack of support for pointer ranges or at least some notion of
their relationships and constraints.  A std::vector is defined
by three pointers that satisfy the following relation:

   start <= finish <= end_of_storage

Queries about the capacity and size of a vector are done in terms
of expressions involving these three pointers:

   size_type capacity () {
     return end_of_storage - start;
   }

   size_type size () {
     return finish - start;
   }

Space remaining is hand-coded as

   end_of_storage - finish

The trouble is that GCC has no idea about the constraints on
the three pointers and so it treats them essentially as unrelated
integers with no ranges (except for their scale that depends on
the type the pointer points to).

Absent support for pointer ranges, GCC would be able to generate
much better code with just some help from annotations telling it
about at least some of the constraints.  In this case, for example,
adding the following optimistic invariant:

    size_type space_left = end_of_storage - finish;

    if (space_left > SIZE_MAX / 2 / sizeof (T))
      __builtin_unreachable ();

to vector::_M_default_append() lets GCC avoid the memset and emit
significantly more efficient code.  On x86_64 it reduces the number
instructions for the test case by 40%.

Martin
Martin Sebor Jan. 18, 2017, 3:16 a.m. UTC | #6
On 01/17/2017 12:38 AM, Jakub Jelinek wrote:
> On Mon, Jan 16, 2017 at 05:06:40PM -0700, Martin Sebor wrote:
>> The test case submitted in bug 79095 - [7 regression] spurious
>> stringop-overflow warning shows that GCC optimizes some loops
>> into calls to memset with size arguments in excess of the object
>> size limit.  Since such calls will unavoidably lead to a buffer
>> overflow and memory corruption the attached patch detects them
>> and replaces them with a trap.  That both prevents the buffer
>> overflow and eliminates the warning.
>
> I fear this is going to break various 32-bit database programs and similar
> that mmap say 3GB of RAM and then work on that memory chunk as contiguous.
> Some things don't work too well in that case (pointer differences), but it
> is unlikely they would be using those, while your patch actively breaks it
> even for loops that can be transformed into memset (memcpy of course isn't a
> problem, because you need some virtual address space to copy it from).

I agree that breaking those applications would be bad.  It could
be dealt with by adding an option to let them disable the insertion
of the trap.  With the warning, programmers would get a heads up
that their (already dubious) code won't work otherwise.  I don't
think it's a necessary or wise to have the default mode be the most
permissive (and most dangerous) and expect users to tweak options
to make it safe.  Rather, I would argue that it should be the other
way around.  Make the default safe and strict and let the advanced
users who know how deal with the risks tweak those options.

Martin
Jeff Law Jan. 18, 2017, 5:59 a.m. UTC | #7
On 01/17/2017 08:16 PM, Martin Sebor wrote:
> On 01/17/2017 12:38 AM, Jakub Jelinek wrote:
>> On Mon, Jan 16, 2017 at 05:06:40PM -0700, Martin Sebor wrote:
>>> The test case submitted in bug 79095 - [7 regression] spurious
>>> stringop-overflow warning shows that GCC optimizes some loops
>>> into calls to memset with size arguments in excess of the object
>>> size limit.  Since such calls will unavoidably lead to a buffer
>>> overflow and memory corruption the attached patch detects them
>>> and replaces them with a trap.  That both prevents the buffer
>>> overflow and eliminates the warning.
>>
>> I fear this is going to break various 32-bit database programs and
>> similar
>> that mmap say 3GB of RAM and then work on that memory chunk as
>> contiguous.
>> Some things don't work too well in that case (pointer differences),
>> but it
>> is unlikely they would be using those, while your patch actively
>> breaks it
>> even for loops that can be transformed into memset (memcpy of course
>> isn't a
>> problem, because you need some virtual address space to copy it from).
>
> I agree that breaking those applications would be bad.  It could
> be dealt with by adding an option to let them disable the insertion
> of the trap.  With the warning, programmers would get a heads up
> that their (already dubious) code won't work otherwise.  I don't
> think it's a necessary or wise to have the default mode be the most
> permissive (and most dangerous) and expect users to tweak options
> to make it safe.  Rather, I would argue that it should be the other
> way around.  Make the default safe and strict and let the advanced
> users who know how deal with the risks tweak those options.
I still come back to the assertion that changing this loop to a mem* is 
fundamentally the wrong thing to do as it changes something that has 
well defined semantics to something that is invalid.

Thus the transformation into a mem* call is invalid.
jeff
Jakub Jelinek Jan. 18, 2017, 8:10 a.m. UTC | #8
On Tue, Jan 17, 2017 at 10:59:43PM -0700, Jeff Law wrote:
> > I agree that breaking those applications would be bad.  It could
> > be dealt with by adding an option to let them disable the insertion
> > of the trap.  With the warning, programmers would get a heads up
> > that their (already dubious) code won't work otherwise.  I don't
> > think it's a necessary or wise to have the default mode be the most
> > permissive (and most dangerous) and expect users to tweak options
> > to make it safe.  Rather, I would argue that it should be the other
> > way around.  Make the default safe and strict and let the advanced
> > users who know how deal with the risks tweak those options.
> I still come back to the assertion that changing this loop to a mem* is
> fundamentally the wrong thing to do as it changes something that has well
> defined semantics to something that is invalid.
> 
> Thus the transformation into a mem* call is invalid.

The mem* call is as valid as the loop, it will work exactly the same.
If you have say on 32-bit target
#include <stdlib.h>

int
main ()
{
  char *p = malloc (3U * 1024 * 1024 * 1024);
  if (p == NULL)
    return 0;
  size_t i;
  for (i = 0; i < 3U * 1024 * 1024 * 1024; i++)
    p[i] = 6;
  use (p);
  return 0;
}

then the loop does the same thing as will memset (p, 6, 3U * 1024 * 1024 * 1024);
do.  On such large objects some operations may not work properly, e.g.
&p[i] - &p[0] might be negative etc., but that is not something the above
loop does or memset will do internally.  If the loop doesn't use just 3/4 of
the address space, but much more, e.g. more than whole address space minus
one page, which is what happens in the testcase, it is indeed quite sure it
will crash if invoked, but the problem with the warning is the same with
many other late warnings or warnings excessively using VRP etc.
If we are given
void
foo (char *p, size_t len)
{
  memset (p, '\0', len);
}
then it also can invoke UB if somebody calls it with a huge len
(~(size_t)0), but we rightly don't want to warn because we can't prove it
will be called with that.  If some unrelated or related test somewhere else
makes the value range smaller on some path or turns it constant on some
path, the warning does warn, even when there is a sometimes low, sometimes
very high possibility the warning is on dead code.
Which is why IMHO it is a very bad idea to enable this warning by default
even without any -W* option, it just has too many false positives.
Turning extra large memcpy (as memcpy can't have overlap in between the
objects, already starting from half of address space is fine; for memset
I'd think only for something like -4096 or something similar that is just
unlikely in hosted environment to be a single object) into a trap or
unreachable might be a nice optimization and allow optimizing away code that
would follow the spot that would always crash.  But with warning above it
the problem is that it is hard to figure out if it is reachable or not.

	Jakub
Martin Sebor Jan. 18, 2017, 5:45 p.m. UTC | #9
On 01/18/2017 01:10 AM, Jakub Jelinek wrote:
> On Tue, Jan 17, 2017 at 10:59:43PM -0700, Jeff Law wrote:
>>> I agree that breaking those applications would be bad.  It could
>>> be dealt with by adding an option to let them disable the insertion
>>> of the trap.  With the warning, programmers would get a heads up
>>> that their (already dubious) code won't work otherwise.  I don't
>>> think it's a necessary or wise to have the default mode be the most
>>> permissive (and most dangerous) and expect users to tweak options
>>> to make it safe.  Rather, I would argue that it should be the other
>>> way around.  Make the default safe and strict and let the advanced
>>> users who know how deal with the risks tweak those options.
>> I still come back to the assertion that changing this loop to a mem* is
>> fundamentally the wrong thing to do as it changes something that has well
>> defined semantics to something that is invalid.
>>
>> Thus the transformation into a mem* call is invalid.
>
> The mem* call is as valid as the loop, it will work exactly the same.

The problem here isn't the loop itself or the memset call but the
bounds computed from the inputs.

In the test case submitted with the bug the inputs supplied by
the program are well within a valid range, yet the bounds of
the loop that are computed by GCC from those inputs are excessive
because they are based on impossible ranges (or an incomplete view
of the program).

There is no unbounded loop in

   void f(std::vector<int> &v)
   {
     size_t n = v.size ();

     if (n > 1 && n < 5)
       v.resize (n - 1);
   }

yet GCC synthesizes one:

   void f(std::vector<int>&) (struct vector & v)
   {
     ...
     <bb 2> [100.00%]:
     _7 = MEM[(int * *)v_5(D)];             // v._M_impl._M_start
     _8 = MEM[(int * *)v_5(D) + 8B];        // v._M_impl._M_finish
     _9 = (long int) _8;
     _10 = (long int) _7;
     _11 = _9 - _10;
     _12 = _11 /[ex] 4;
     _13 = (long unsigned int) _12;         // v.size()
     _1 = _13 + 18446744073709551614;       // v.size() - 4
     if (_1 <= 2)                           // if (v.size() <= 6)
       goto <bb 3>; [36.64%]
     else
       goto <bb 9>; [63.36%]

     <bb 3> [36.64%]:
     _2 = _13 + 18446744073709551615;       // v.size() - 1
     if (_2 > _13)                          // if (v.size() == 0)
       goto <bb 4>; [29.56%]                //   this can't happen
     else
       goto <bb 7>; [70.44%]

     <bb 4> [5.85%]:
     _24 = v_5(D)->D.15828._M_impl._M_end_of_storage;
     _25 = (long int) _24;
     _28 = _25 - _9;                        // bytes left
     if (_28 == -4)                         // this can't happen
       goto <bb 5>; [67.00%]
     else
       goto <bb 6>; [33.00%]

     <bb 5> [3.92%]:
     __builtin_memset (_8, 0, 18446744073709551612);   // (size_t)-4
     ...

     <bb 6> [1.93%]:
     std::__throw_length_error ("vector::_M_default_append");

> If you have say on 32-bit target
> #include <stdlib.h>
>
> int
> main ()
> {
>   char *p = malloc (3U * 1024 * 1024 * 1024);
>   if (p == NULL)
>     return 0;
>   size_t i;
>   for (i = 0; i < 3U * 1024 * 1024 * 1024; i++)
>     p[i] = 6;
>   use (p);
>   return 0;
> }

Unlike in the test case submitted with the bug, here the loop is in
the source and warning on it would be justified for most programs.

Here's a test case that's closer to the one from the bug.  It also
ends up with the out of bounds memset even at -O1, during PRE.

typedef __SIZE_TYPE__ size_t;

struct S
   int *p0, *p1, *p2;

   size_t size () const { return p1 - p0; }

   void f (size_t n) {
     if (n > size ())       // can't happen because
       foo (n - size ());   //   n is in [1, MIN(size() - 1, 3)]
     else if (n < size ())
       bar (p0 + n);
   }

   void foo (size_t n)
   {
     size_t left = (size_t)(p2 - p1);
     if (left >= n)
       __builtin_memset (p2, 0, n * sizeof *p2);
   }

   void bar (int*);
};

void f (S &s)
{
   size_t n = s.size ();
   if (n > 1 && n < 5)
     s.f (n - 1);
}

Martin
Jeff Law Jan. 20, 2017, 11:30 p.m. UTC | #10
On 01/18/2017 10:45 AM, Martin Sebor wrote:
> On 01/18/2017 01:10 AM, Jakub Jelinek wrote:
>> On Tue, Jan 17, 2017 at 10:59:43PM -0700, Jeff Law wrote:
>>>> I agree that breaking those applications would be bad.  It could
>>>> be dealt with by adding an option to let them disable the insertion
>>>> of the trap.  With the warning, programmers would get a heads up
>>>> that their (already dubious) code won't work otherwise.  I don't
>>>> think it's a necessary or wise to have the default mode be the most
>>>> permissive (and most dangerous) and expect users to tweak options
>>>> to make it safe.  Rather, I would argue that it should be the other
>>>> way around.  Make the default safe and strict and let the advanced
>>>> users who know how deal with the risks tweak those options.
>>> I still come back to the assertion that changing this loop to a mem* is
>>> fundamentally the wrong thing to do as it changes something that has
>>> well
>>> defined semantics to something that is invalid.
>>>
>>> Thus the transformation into a mem* call is invalid.
>>
>> The mem* call is as valid as the loop, it will work exactly the same.
>
> The problem here isn't the loop itself or the memset call but the
> bounds computed from the inputs.
Sorry.  My bad.  Only memcpy has the SIZE_MAX / 2 restriction.

[ ... ]
Going to work from the self-contained test...


>
> Here's a test case that's closer to the one from the bug.  It also
> ends up with the out of bounds memset even at -O1, during PRE.
>
> typedef __SIZE_TYPE__ size_t;
>
> struct S
>   int *p0, *p1, *p2;
>
>   size_t size () const { return p1 - p0; }
>
>   void f (size_t n) {
>     if (n > size ())       // can't happen because
>       foo (n - size ());   //   n is in [1, MIN(size() - 1, 3)]
>     else if (n < size ())
>       bar (p0 + n);
>   }
>
>   void foo (size_t n)
>   {
>     size_t left = (size_t)(p2 - p1);
>     if (left >= n)
>       __builtin_memset (p2, 0, n * sizeof *p2);
>   }
>
>   void bar (int*);
> };
>
> void f (S &s)
> {
>   size_t n = s.size ();
>   if (n > 1 && n < 5)
>     s.f (n - 1);
> }


Looking at the .vrp1 dump for this test we find:

;;   basic block 3, loop depth 0, count 0, freq 3664, maybe hot
;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;;    pred:       2 [36.6%]  (TRUE_VALUE,EXECUTABLE)
   _18 = ASSERT_EXPR <_13, _13 + 18446744073709551614 <= 2>;
   _2 = _18 + 18446744073709551615;
   if (_2 > _18)
     goto <bb 4>; [50.00%]
   else
     goto <bb 6>; [50.00%]

_2 > _18 is the same as

(_18 - 1) > _18

Which is only true if _18 is zero.  And since _18 has a computed range 
of [2..4], that can never happen.

VRP doesn't currently handle this case, though we do have some code to 
turn relational comparisons into equality comparisons.  If I manually 
turn the relational into _18 == 0 at that point, then the next VRP pass 
is able to optimize the conditional away which in turn eliminates the 
problematical memset call & warning.

So one possible approach would be to treat simplify_cond_using_ranges to 
simplify that kind of condition.  I don't know if that would fix the 
reported testcase as I haven't looked at it directly under the debugger yet.



Jeff
Jeff Law Jan. 20, 2017, 11:32 p.m. UTC | #11
On 01/18/2017 01:10 AM, Jakub Jelinek wrote:
> On Tue, Jan 17, 2017 at 10:59:43PM -0700, Jeff Law wrote:
>>> I agree that breaking those applications would be bad.  It could
>>> be dealt with by adding an option to let them disable the insertion
>>> of the trap.  With the warning, programmers would get a heads up
>>> that their (already dubious) code won't work otherwise.  I don't
>>> think it's a necessary or wise to have the default mode be the most
>>> permissive (and most dangerous) and expect users to tweak options
>>> to make it safe.  Rather, I would argue that it should be the other
>>> way around.  Make the default safe and strict and let the advanced
>>> users who know how deal with the risks tweak those options.
>> I still come back to the assertion that changing this loop to a mem* is
>> fundamentally the wrong thing to do as it changes something that has well
>> defined semantics to something that is invalid.
>>
>> Thus the transformation into a mem* call is invalid.
>
> The mem* call is as valid as the loop, it will work exactly the same.
> If you have say on 32-bit target
> #include <stdlib.h>
>
> int
> main ()
> {
>   char *p = malloc (3U * 1024 * 1024 * 1024);
>   if (p == NULL)
>     return 0;
>   size_t i;
>   for (i = 0; i < 3U * 1024 * 1024 * 1024; i++)
>     p[i] = 6;
>   use (p);
>   return 0;
> }
>
> then the loop does the same thing as will memset (p, 6, 3U * 1024 * 1024 * 1024);
> do.  On such large objects some operations may not work properly, e.g.
> &p[i] - &p[0] might be negative etc., but that is not something the above
> loop does or memset will do internally.  If the loop doesn't use just 3/4 of
> the address space, but much more, e.g. more than whole address space minus
> one page, which is what happens in the testcase, it is indeed quite sure it
> will crash if invoked, but the problem with the warning is the same with
> many other late warnings or warnings excessively using VRP etc.
Not in my mind, it's different.  It's not triggered by path isolation. 
It's standard const propagation + simplification.

jeff
Jakub Jelinek Jan. 20, 2017, 11:34 p.m. UTC | #12
On Fri, Jan 20, 2017 at 04:32:19PM -0700, Jeff Law wrote:
> > then the loop does the same thing as will memset (p, 6, 3U * 1024 * 1024 * 1024);
> > do.  On such large objects some operations may not work properly, e.g.
> > &p[i] - &p[0] might be negative etc., but that is not something the above
> > loop does or memset will do internally.  If the loop doesn't use just 3/4 of
> > the address space, but much more, e.g. more than whole address space minus
> > one page, which is what happens in the testcase, it is indeed quite sure it
> > will crash if invoked, but the problem with the warning is the same with
> > many other late warnings or warnings excessively using VRP etc.
> Not in my mind, it's different.  It's not triggered by path isolation. It's
> standard const propagation + simplification.

So where does the constant -1 length appear there?  The test clearly just
attempts to clear some variable length - 1.  I admit I haven't looked at the
dumps in detail, I should...

	Jakub
Jeff Law Jan. 20, 2017, 11:56 p.m. UTC | #13
On 01/20/2017 04:34 PM, Jakub Jelinek wrote:
> On Fri, Jan 20, 2017 at 04:32:19PM -0700, Jeff Law wrote:
>>> then the loop does the same thing as will memset (p, 6, 3U * 1024 * 1024 * 1024);
>>> do.  On such large objects some operations may not work properly, e.g.
>>> &p[i] - &p[0] might be negative etc., but that is not something the above
>>> loop does or memset will do internally.  If the loop doesn't use just 3/4 of
>>> the address space, but much more, e.g. more than whole address space minus
>>> one page, which is what happens in the testcase, it is indeed quite sure it
>>> will crash if invoked, but the problem with the warning is the same with
>>> many other late warnings or warnings excessively using VRP etc.
>> Not in my mind, it's different.  It's not triggered by path isolation. It's
>> standard const propagation + simplification.
>
> So where does the constant -1 length appear there?  The test clearly just
> attempts to clear some variable length - 1.  I admit I haven't looked at the
> dumps in detail, I should...
At least in Martin's simplified test it's just a series of standard 
constant propagations and obvious simplifications.  No threading, no 
path isolation.

;;   basic block 2, loop depth 0, count 0, freq 10000, maybe hot
;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
;;    pred:       ENTRY [100.0%]  (FALLTHRU,EXECUTABLE)
   _7 = MEM[(int * *)s_5(D)];
   _8 = MEM[(int * *)s_5(D) + 8B];
   _9 = (long int) _8;
   _10 = (long int) _7;
   _11 = _9 - _10;
   _12 = _11 /[ex] 4;
   _13 = (long unsigned int) _12;
   _1 = _13 + 18446744073709551614;
   if (_1 <= 2)
     goto <bb 3>; [36.64%]
   else
     goto <bb 8>; [63.36%]
;;    succ:       3 [36.6%]  (TRUE_VALUE,EXECUTABLE)
;;                8 [63.4%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 3, loop depth 0, count 0, freq 3664, maybe hot
;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;;    pred:       2 [36.6%]  (TRUE_VALUE,EXECUTABLE)
   _2 = _13 + 18446744073709551615;
   _14 = MEM[(int * *)s_5(D)];
   _15 = MEM[(int * *)s_5(D) + 8B];
   _16 = (long int) _15;
   _17 = (long int) _14;
   _18 = _16 - _17;
   _19 = _18 /[ex] 4;
   _20 = (long unsigned int) _19;
   if (_2 > _20)
     goto <bb 4>; [50.00%]
   else
     goto <bb 6>; [50.00%]
;;    succ:       4 [50.0%]  (TRUE_VALUE,EXECUTABLE)
;;                6 [50.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 4, loop depth 0, count 0, freq 1832, maybe hot
;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
;;    pred:       3 [50.0%]  (TRUE_VALUE,EXECUTABLE)
   _21 = _2 - _20;
   _22 = MEM[(int * *)s_5(D) + 16B];
   _23 = (long int) _22;
   _24 = _23 - _16;
   _25 = _24 /[ex] 4;
   left_26 = (size_t) _25;
   if (_21 <= left_26)
     goto <bb 5>; [33.00%]
   else
     goto <bb 8>; [67.00%]
;;    succ:       5 [33.0%]  (TRUE_VALUE,EXECUTABLE)
;;                8 [67.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 5, loop depth 0, count 0, freq 605, maybe hot
;;    prev block 4, next block 6, flags: (NEW, REACHABLE, VISITED)
;;    pred:       4 [33.0%]  (TRUE_VALUE,EXECUTABLE)
   _27 = _21 * 4;
   __builtin_memset (_22, 0, _27);
   goto <bb 8>; [100.00%]
;;    succ:       8 [100.0%]  (FALLTHRU,EXECUTABLE)


In particular look at _27, which is _21 * 4.

_21 is _2 - _20

If you follow things though the use-def chains and simplify you'll see 
that _2 - 20 is always -1.


Jeff
Jeff Law Jan. 21, 2017, 5:22 a.m. UTC | #14
On 01/20/2017 04:30 PM, Jeff Law wrote:
> Going to work from the self-contained test...
>
>
>>
>> Here's a test case that's closer to the one from the bug.  It also
>> ends up with the out of bounds memset even at -O1, during PRE.
>>
>> typedef __SIZE_TYPE__ size_t;
>>
>> struct S
>>   int *p0, *p1, *p2;
>>
>>   size_t size () const { return p1 - p0; }
>>
>>   void f (size_t n) {
>>     if (n > size ())       // can't happen because
>>       foo (n - size ());   //   n is in [1, MIN(size() - 1, 3)]
>>     else if (n < size ())
>>       bar (p0 + n);
>>   }
>>
>>   void foo (size_t n)
>>   {
>>     size_t left = (size_t)(p2 - p1);
>>     if (left >= n)
>>       __builtin_memset (p2, 0, n * sizeof *p2);
>>   }
>>
>>   void bar (int*);
>> };
>>
>> void f (S &s)
>> {
>>   size_t n = s.size ();
>>   if (n > 1 && n < 5)
>>     s.f (n - 1);
>> }
>
>
> Looking at the .vrp1 dump for this test we find:
>
> ;;   basic block 3, loop depth 0, count 0, freq 3664, maybe hot
> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       2 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>   _18 = ASSERT_EXPR <_13, _13 + 18446744073709551614 <= 2>;
>   _2 = _18 + 18446744073709551615;
>   if (_2 > _18)
>     goto <bb 4>; [50.00%]
>   else
>     goto <bb 6>; [50.00%]
>
> _2 > _18 is the same as
>
> (_18 - 1) > _18
>
> Which is only true if _18 is zero.  And since _18 has a computed range
> of [2..4], that can never happen.
>
> VRP doesn't currently handle this case, though we do have some code to
> turn relational comparisons into equality comparisons.  If I manually
> turn the relational into _18 == 0 at that point, then the next VRP pass
> is able to optimize the conditional away which in turn eliminates the
> problematical memset call & warning.
>
> So one possible approach would be to treat simplify_cond_using_ranges to
> simplify that kind of condition.  I don't know if that would fix the
> reported testcase as I haven't looked at it directly under the debugger
> yet.
In fact, match.pd already has a pattern to handle a relational like
(X + -1) > X

The pattern doesn't fire because of a single_use guard.   This was 
discussed back in 2015 with Jakub advocating for a single_use guard, but 
Marc leaning the other direction, but not enough to argue too much about it.

In early 2016 Marc actually submitted the match.pd patch with the single 
use guard per Jakub's preference.

The single use guard checks the SSA_NAME holding the (X + CST) 
expression for single use.   It's not entirely clear why we check for 
single use.  If it's because of object lifetimes, then that's a total 
red herring here.

Given something like:
x = a + c
if (a > x)

The transformation into

x = a + c;
if (a < c')

Where x has multiple uses, the transformation will not inherently change 
the lifetime of "x" worse (if the conditional was the last use, then the 
lifetime of x may shorten somewhat).  If there were uses after the 
condition, the lifetime of "x" remains unchanged.

"a" was already live from the assignment to at least the conditional. 
But again, we haven't inherently changed its lifetime.

However, what can happen is after transformation, the assignment might 
sink to a later use point in the CFG, which might lengthen the lifetime 
of "a", but it's also shortening the lifetime of "x" by the same amount.

So, ultimately I don't buy the argument that we should inhibit this 
based on register lifetime issues.

Jakub, perhaps you remember why you wanted this restricted to cases 
where "x" has a single use?

Certainly the easiest thing to do is remove the single use restriction. 
If we ultimately don't want to do that, we can probably detect when the 
transformation will allow the conditional to be eliminated in VRP and I 
would claim that elimination of the conditional at compile time trumps 
the other potential concerns here.  That's going to be uglier and 
essentially duplicate parts of what match.pd does, but certainly doable 
-- I've even prototyped it.

Thoughts?

jeff
Marc Glisse Jan. 21, 2017, 8 a.m. UTC | #15
On Fri, 20 Jan 2017, Jeff Law wrote:

> On 01/20/2017 04:30 PM, Jeff Law wrote:
>> Going to work from the self-contained test...
>> 
>> 
>>> 
>>> Here's a test case that's closer to the one from the bug.  It also
>>> ends up with the out of bounds memset even at -O1, during PRE.
>>> 
>>> typedef __SIZE_TYPE__ size_t;
>>> 
>>> struct S
>>>   int *p0, *p1, *p2;
>>>
>>>   size_t size () const { return p1 - p0; }
>>>
>>>   void f (size_t n) {
>>>     if (n > size ())       // can't happen because
>>>       foo (n - size ());   //   n is in [1, MIN(size() - 1, 3)]
>>>     else if (n < size ())
>>>       bar (p0 + n);
>>>   }
>>>
>>>   void foo (size_t n)
>>>   {
>>>     size_t left = (size_t)(p2 - p1);
>>>     if (left >= n)
>>>       __builtin_memset (p2, 0, n * sizeof *p2);
>>>   }
>>>
>>>   void bar (int*);
>>> };
>>> 
>>> void f (S &s)
>>> {
>>>   size_t n = s.size ();
>>>   if (n > 1 && n < 5)
>>>     s.f (n - 1);
>>> }
>> 
>> 
>> Looking at the .vrp1 dump for this test we find:
>> 
>> ;;   basic block 3, loop depth 0, count 0, freq 3664, maybe hot
>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>> ;;    pred:       2 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>   _18 = ASSERT_EXPR <_13, _13 + 18446744073709551614 <= 2>;
>>   _2 = _18 + 18446744073709551615;
>>   if (_2 > _18)
>>     goto <bb 4>; [50.00%]
>>   else
>>     goto <bb 6>; [50.00%]
>> 
>> _2 > _18 is the same as
>> 
>> (_18 - 1) > _18
>> 
>> Which is only true if _18 is zero.  And since _18 has a computed range
>> of [2..4], that can never happen.
>> 
>> VRP doesn't currently handle this case, though we do have some code to
>> turn relational comparisons into equality comparisons.  If I manually
>> turn the relational into _18 == 0 at that point, then the next VRP pass
>> is able to optimize the conditional away which in turn eliminates the
>> problematical memset call & warning.
>> 
>> So one possible approach would be to treat simplify_cond_using_ranges to
>> simplify that kind of condition.  I don't know if that would fix the
>> reported testcase as I haven't looked at it directly under the debugger
>> yet.
> In fact, match.pd already has a pattern to handle a relational like
> (X + -1) > X
>
> The pattern doesn't fire because of a single_use guard.   This was discussed 
> back in 2015 with Jakub advocating for a single_use guard, but Marc leaning 
> the other direction, but not enough to argue too much about it.
>
> In early 2016 Marc actually submitted the match.pd patch with the single use 
> guard per Jakub's preference.
>
> The single use guard checks the SSA_NAME holding the (X + CST) expression for 
> single use.   It's not entirely clear why we check for single use.  If it's 
> because of object lifetimes, then that's a total red herring here.
>
> Given something like:
> x = a + c
> if (a > x)
>
> The transformation into
>
> x = a + c;
> if (a < c')
>
> Where x has multiple uses, the transformation will not inherently change the 
> lifetime of "x" worse (if the conditional was the last use, then the lifetime 
> of x may shorten somewhat).  If there were uses after the condition, the 
> lifetime of "x" remains unchanged.

I don't remember the conversation and I don't have experience with 
register pressure. My assumption would be that the issue is with the new 
constant c'. In the first case, during the comparison, only a and x are 
alive, while in the second case, c' is as well, which may use an extra 
register. Does that make sense?

(I don't know if that's bad enough to inhibit the simplification)

Or are we talking about the transformation that has this comment in 
match.pd:

    Currently restricted to single use so as not to interfere too much with
    ADD_OVERFLOW detection in tree-ssa-math-opts.c.

That detection logic needs to be improved anyway (see also a discussion 
with Vincent Lefevre yesterday on gcc-help, users do write the simplified 
form directly), the single_use is supposedly only there until that is done 
(really depends on how well the improved logic works...).

> "a" was already live from the assignment to at least the conditional. But 
> again, we haven't inherently changed its lifetime.
>
> However, what can happen is after transformation, the assignment might sink 
> to a later use point in the CFG, which might lengthen the lifetime of "a", 
> but it's also shortening the lifetime of "x" by the same amount.
>
> So, ultimately I don't buy the argument that we should inhibit this based on 
> register lifetime issues.
>
> Jakub, perhaps you remember why you wanted this restricted to cases where "x" 
> has a single use?
>
> Certainly the easiest thing to do is remove the single use restriction. If we 
> ultimately don't want to do that, we can probably detect when the 
> transformation will allow the conditional to be eliminated in VRP and I would 
> claim that elimination of the conditional at compile time trumps the other 
> potential concerns here.  That's going to be uglier and essentially duplicate 
> parts of what match.pd does, but certainly doable -- I've even prototyped 
> it.
>
> Thoughts?

I believe we are supposed to call match.pd from VRP eventually (though 
that may not have happened yet), so the test could probably also be done 
there, if that looks more convenient and we decide not to remove 
single_use completely for this transformation.
Richard Biener Jan. 23, 2017, 9:10 a.m. UTC | #16
On Sat, Jan 21, 2017 at 6:22 AM, Jeff Law <law@redhat.com> wrote:
> On 01/20/2017 04:30 PM, Jeff Law wrote:
>>
>> Going to work from the self-contained test...
>>
>>
>>>
>>> Here's a test case that's closer to the one from the bug.  It also
>>> ends up with the out of bounds memset even at -O1, during PRE.
>>>
>>> typedef __SIZE_TYPE__ size_t;
>>>
>>> struct S
>>>   int *p0, *p1, *p2;
>>>
>>>   size_t size () const { return p1 - p0; }
>>>
>>>   void f (size_t n) {
>>>     if (n > size ())       // can't happen because
>>>       foo (n - size ());   //   n is in [1, MIN(size() - 1, 3)]
>>>     else if (n < size ())
>>>       bar (p0 + n);
>>>   }
>>>
>>>   void foo (size_t n)
>>>   {
>>>     size_t left = (size_t)(p2 - p1);
>>>     if (left >= n)
>>>       __builtin_memset (p2, 0, n * sizeof *p2);
>>>   }
>>>
>>>   void bar (int*);
>>> };
>>>
>>> void f (S &s)
>>> {
>>>   size_t n = s.size ();
>>>   if (n > 1 && n < 5)
>>>     s.f (n - 1);
>>> }
>>
>>
>>
>> Looking at the .vrp1 dump for this test we find:
>>
>> ;;   basic block 3, loop depth 0, count 0, freq 3664, maybe hot
>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>> ;;    pred:       2 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>   _18 = ASSERT_EXPR <_13, _13 + 18446744073709551614 <= 2>;
>>   _2 = _18 + 18446744073709551615;
>>   if (_2 > _18)
>>     goto <bb 4>; [50.00%]
>>   else
>>     goto <bb 6>; [50.00%]
>>
>> _2 > _18 is the same as
>>
>> (_18 - 1) > _18
>>
>> Which is only true if _18 is zero.  And since _18 has a computed range
>> of [2..4], that can never happen.
>>
>> VRP doesn't currently handle this case, though we do have some code to
>> turn relational comparisons into equality comparisons.  If I manually
>> turn the relational into _18 == 0 at that point, then the next VRP pass
>> is able to optimize the conditional away which in turn eliminates the
>> problematical memset call & warning.
>>
>> So one possible approach would be to treat simplify_cond_using_ranges to
>> simplify that kind of condition.  I don't know if that would fix the
>> reported testcase as I haven't looked at it directly under the debugger
>> yet.
>
> In fact, match.pd already has a pattern to handle a relational like
> (X + -1) > X
>
> The pattern doesn't fire because of a single_use guard.   This was discussed
> back in 2015 with Jakub advocating for a single_use guard, but Marc leaning
> the other direction, but not enough to argue too much about it.
>
> In early 2016 Marc actually submitted the match.pd patch with the single use
> guard per Jakub's preference.
>
> The single use guard checks the SSA_NAME holding the (X + CST) expression
> for single use.   It's not entirely clear why we check for single use.  If
> it's because of object lifetimes, then that's a total red herring here.
>
> Given something like:
> x = a + c
> if (a > x)
>
> The transformation into
>
> x = a + c;
> if (a < c')
>
> Where x has multiple uses, the transformation will not inherently change the
> lifetime of "x" worse (if the conditional was the last use, then the
> lifetime of x may shorten somewhat).  If there were uses after the
> condition, the lifetime of "x" remains unchanged.
>
> "a" was already live from the assignment to at least the conditional. But
> again, we haven't inherently changed its lifetime.
>
> However, what can happen is after transformation, the assignment might sink
> to a later use point in the CFG, which might lengthen the lifetime of "a",
> but it's also shortening the lifetime of "x" by the same amount.
>
> So, ultimately I don't buy the argument that we should inhibit this based on
> register lifetime issues.
>
> Jakub, perhaps you remember why you wanted this restricted to cases where
> "x" has a single use?

I think it was added for the reason stated in the comment:

/* When one argument is a constant, overflow detection can be simplified.
   Currently restricted to single use so as not to interfere too much with
   ADD_OVERFLOW detection in tree-ssa-math-opts.c.
   A + CST CMP A  ->  A CMP' CST' */
(for cmp (lt le ge gt)
     out (gt gt le le)
 (simplify
  (cmp:c (plus@2 @0 INTEGER_CST@1) @0)
  (if (TYPE_UNSIGNED (TREE_TYPE (@0))
       && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
       && wi::ne_p (@1, 0)
       && single_use (@2))
   (out @0 { wide_int_to_tree (TREE_TYPE (@0), wi::max_value
               (TYPE_PRECISION (TREE_TYPE (@0)), UNSIGNED) - @1); }))))

> Certainly the easiest thing to do is remove the single use restriction. If
> we ultimately don't want to do that, we can probably detect when the
> transformation will allow the conditional to be eliminated in VRP and I
> would claim that elimination of the conditional at compile time trumps the
> other potential concerns here.  That's going to be uglier and essentially
> duplicate parts of what match.pd does, but certainly doable -- I've even
> prototyped it.

Yeah, duplicating that sounds ugly.

But undoing it for the sake of detecting overflow builtins is as well (OTOH
nothing does the reverse transform so if the user wrote the "simplified"
variant we don't detect overflow either).

Richard.

> Thoughts?
>
> jeff
>
>
Jeff Law Jan. 23, 2017, 9:10 p.m. UTC | #17
On 01/23/2017 02:10 AM, Richard Biener wrote:
>
> I think it was added for the reason stated in the comment:
>
> /* When one argument is a constant, overflow detection can be simplified.
>    Currently restricted to single use so as not to interfere too much with
>    ADD_OVERFLOW detection in tree-ssa-math-opts.c.
>    A + CST CMP A  ->  A CMP' CST' */
> (for cmp (lt le ge gt)
>      out (gt gt le le)
>  (simplify
>   (cmp:c (plus@2 @0 INTEGER_CST@1) @0)
>   (if (TYPE_UNSIGNED (TREE_TYPE (@0))
>        && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
>        && wi::ne_p (@1, 0)
>        && single_use (@2))
>    (out @0 { wide_int_to_tree (TREE_TYPE (@0), wi::max_value
>                (TYPE_PRECISION (TREE_TYPE (@0)), UNSIGNED) - @1); }))))
>
>> Certainly the easiest thing to do is remove the single use restriction. If
>> we ultimately don't want to do that, we can probably detect when the
>> transformation will allow the conditional to be eliminated in VRP and I
>> would claim that elimination of the conditional at compile time trumps the
>> other potential concerns here.  That's going to be uglier and essentially
>> duplicate parts of what match.pd does, but certainly doable -- I've even
>> prototyped it.
>
> Yeah, duplicating that sounds ugly.
So one possibility is to put an additional if case in the code above 
without the single_use condition which fires iff we're turning the two 
operand relational into a zero/not-zero test.

The only problem is that doesn't manage to clean things up until dom3. 
We could instead have them cleaned up by dom2 if the tweak is in VRP. 
This is because we don't call the gimple folders on the statement until 
after dom2 is already complete.

So there's 10 blocks of crud in the IL for the 40 passes between dom2 
and dom3.  And FWIW, this shows up a hundred times or so during a GCC 
bootstrap.  I haven't looked to see how many of those result in just a 
simplified test vs those with nice secondary optimizations like we see 
with the test in the BZ.


>
> But undoing it for the sake of detecting overflow builtins is as well (OTOH
> nothing does the reverse transform so if the user wrote the "simplified"
> variant we don't detect overflow either).
I'm not sure there'd be enough information left to undo the 
transformation, even if it were restricted to the zero/not-zero 
resulting test.

Jeff
Jeff Law Jan. 24, 2017, 12:03 a.m. UTC | #18
On 01/21/2017 01:00 AM, Marc Glisse wrote:
> On Fri, 20 Jan 2017, Jeff Law wrote:
>
>> On 01/20/2017 04:30 PM, Jeff Law wrote:
>>> Going to work from the self-contained test...
>>>
>>>
>>>>
>>>> Here's a test case that's closer to the one from the bug.  It also
>>>> ends up with the out of bounds memset even at -O1, during PRE.
>>>>
>>>> typedef __SIZE_TYPE__ size_t;
>>>>
>>>> struct S
>>>>   int *p0, *p1, *p2;
>>>>
>>>>   size_t size () const { return p1 - p0; }
>>>>
>>>>   void f (size_t n) {
>>>>     if (n > size ())       // can't happen because
>>>>       foo (n - size ());   //   n is in [1, MIN(size() - 1, 3)]
>>>>     else if (n < size ())
>>>>       bar (p0 + n);
>>>>   }
>>>>
>>>>   void foo (size_t n)
>>>>   {
>>>>     size_t left = (size_t)(p2 - p1);
>>>>     if (left >= n)
>>>>       __builtin_memset (p2, 0, n * sizeof *p2);
>>>>   }
>>>>
>>>>   void bar (int*);
>>>> };
>>>>
>>>> void f (S &s)
>>>> {
>>>>   size_t n = s.size ();
>>>>   if (n > 1 && n < 5)
>>>>     s.f (n - 1);
>>>> }
>>>
>>>
>>> Looking at the .vrp1 dump for this test we find:
>>>
>>> ;;   basic block 3, loop depth 0, count 0, freq 3664, maybe hot
>>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       2 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>>   _18 = ASSERT_EXPR <_13, _13 + 18446744073709551614 <= 2>;
>>>   _2 = _18 + 18446744073709551615;
>>>   if (_2 > _18)
>>>     goto <bb 4>; [50.00%]
>>>   else
>>>     goto <bb 6>; [50.00%]
>>>
>>> _2 > _18 is the same as
>>>
>>> (_18 - 1) > _18
>>>
>>> Which is only true if _18 is zero.  And since _18 has a computed range
>>> of [2..4], that can never happen.
>>>
>>> VRP doesn't currently handle this case, though we do have some code to
>>> turn relational comparisons into equality comparisons.  If I manually
>>> turn the relational into _18 == 0 at that point, then the next VRP pass
>>> is able to optimize the conditional away which in turn eliminates the
>>> problematical memset call & warning.
>>>
>>> So one possible approach would be to treat simplify_cond_using_ranges to
>>> simplify that kind of condition.  I don't know if that would fix the
>>> reported testcase as I haven't looked at it directly under the debugger
>>> yet.
>> In fact, match.pd already has a pattern to handle a relational like
>> (X + -1) > X
>>
>> The pattern doesn't fire because of a single_use guard.   This was
>> discussed back in 2015 with Jakub advocating for a single_use guard,
>> but Marc leaning the other direction, but not enough to argue too much
>> about it.
>>
>> In early 2016 Marc actually submitted the match.pd patch with the
>> single use guard per Jakub's preference.
>>
>> The single use guard checks the SSA_NAME holding the (X + CST)
>> expression for single use.   It's not entirely clear why we check for
>> single use.  If it's because of object lifetimes, then that's a total
>> red herring here.
>>
>> Given something like:
>> x = a + c
>> if (a > x)
>>
>> The transformation into
>>
>> x = a + c;
>> if (a < c')
>>
>> Where x has multiple uses, the transformation will not inherently
>> change the lifetime of "x" worse (if the conditional was the last use,
>> then the lifetime of x may shorten somewhat).  If there were uses
>> after the condition, the lifetime of "x" remains unchanged.
>
> I don't remember the conversation and I don't have experience with
> register pressure. My assumption would be that the issue is with the new
> constant c'. In the first case, during the comparison, only a and x are
> alive, while in the second case, c' is as well, which may use an extra
> register. Does that make sense?
>
> (I don't know if that's bad enough to inhibit the simplification)
>
> Or are we talking about the transformation that has this comment in
> match.pd:
>
>    Currently restricted to single use so as not to interfere too much with
>    ADD_OVERFLOW detection in tree-ssa-math-opts.c.
>
> That detection logic needs to be improved anyway (see also a discussion
> with Vincent Lefevre yesterday on gcc-help, users do write the
> simplified form directly), the single_use is supposedly only there until
> that is done (really depends on how well the improved logic works...).
>
>> "a" was already live from the assignment to at least the conditional.
>> But again, we haven't inherently changed its lifetime.
>>
>> However, what can happen is after transformation, the assignment might
>> sink to a later use point in the CFG, which might lengthen the
>> lifetime of "a", but it's also shortening the lifetime of "x" by the
>> same amount.
>>
>> So, ultimately I don't buy the argument that we should inhibit this
>> based on register lifetime issues.
>>
>> Jakub, perhaps you remember why you wanted this restricted to cases
>> where "x" has a single use?
So I went and looked at the match_uaddsub_overflow bits for a while. 
AFAICT it appears to be trying to allow us to do the arithmetic and 
overflow status computation once and CSE the result.  At the single 
instance we would generate a uaddv4 insn which does the arithmetic and jump.

ISTM the real value here is in the initial RTL generation.  DOM ought to 
be simplifying the arithmetic CSEs and dominated tests.  Threading 
should pick up the most common non-dominated tests.

Converting into an EQ/NE test against zero for the special case should 
still preserve those optimizations (and in fact makes the threading 
easier to discover).

So it's really a matter of do we do it directly in VRP, or in match.pd. 
The former is more effective, the latter is cleaner.
>
> I believe we are supposed to call match.pd from VRP eventually (though
> that may not have happened yet), so the test could probably also be done
> there, if that looks more convenient and we decide not to remove
> single_use completely for this transformation.
That was my thought as well, but AFAICT we only call into match.pd from 
VRP if we changed the insn.   There's no systematic calling into 
match.pd from most passes.  Thus, we don't do the transformation until 
fairly late in the pipeline when it's in match.pd.

Jeff
Richard Biener Jan. 24, 2017, 10:36 a.m. UTC | #19
On Tue, Jan 24, 2017 at 1:03 AM, Jeff Law <law@redhat.com> wrote:
> On 01/21/2017 01:00 AM, Marc Glisse wrote:
>>
>> On Fri, 20 Jan 2017, Jeff Law wrote:
>>
>>> On 01/20/2017 04:30 PM, Jeff Law wrote:
>>>>
>>>> Going to work from the self-contained test...
>>>>
>>>>
>>>>>
>>>>> Here's a test case that's closer to the one from the bug.  It also
>>>>> ends up with the out of bounds memset even at -O1, during PRE.
>>>>>
>>>>> typedef __SIZE_TYPE__ size_t;
>>>>>
>>>>> struct S
>>>>>   int *p0, *p1, *p2;
>>>>>
>>>>>   size_t size () const { return p1 - p0; }
>>>>>
>>>>>   void f (size_t n) {
>>>>>     if (n > size ())       // can't happen because
>>>>>       foo (n - size ());   //   n is in [1, MIN(size() - 1, 3)]
>>>>>     else if (n < size ())
>>>>>       bar (p0 + n);
>>>>>   }
>>>>>
>>>>>   void foo (size_t n)
>>>>>   {
>>>>>     size_t left = (size_t)(p2 - p1);
>>>>>     if (left >= n)
>>>>>       __builtin_memset (p2, 0, n * sizeof *p2);
>>>>>   }
>>>>>
>>>>>   void bar (int*);
>>>>> };
>>>>>
>>>>> void f (S &s)
>>>>> {
>>>>>   size_t n = s.size ();
>>>>>   if (n > 1 && n < 5)
>>>>>     s.f (n - 1);
>>>>> }
>>>>
>>>>
>>>>
>>>> Looking at the .vrp1 dump for this test we find:
>>>>
>>>> ;;   basic block 3, loop depth 0, count 0, freq 3664, maybe hot
>>>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>>>> ;;    pred:       2 [36.6%]  (TRUE_VALUE,EXECUTABLE)
>>>>   _18 = ASSERT_EXPR <_13, _13 + 18446744073709551614 <= 2>;
>>>>   _2 = _18 + 18446744073709551615;
>>>>   if (_2 > _18)
>>>>     goto <bb 4>; [50.00%]
>>>>   else
>>>>     goto <bb 6>; [50.00%]
>>>>
>>>> _2 > _18 is the same as
>>>>
>>>> (_18 - 1) > _18
>>>>
>>>> Which is only true if _18 is zero.  And since _18 has a computed range
>>>> of [2..4], that can never happen.
>>>>
>>>> VRP doesn't currently handle this case, though we do have some code to
>>>> turn relational comparisons into equality comparisons.  If I manually
>>>> turn the relational into _18 == 0 at that point, then the next VRP pass
>>>> is able to optimize the conditional away which in turn eliminates the
>>>> problematical memset call & warning.
>>>>
>>>> So one possible approach would be to treat simplify_cond_using_ranges to
>>>> simplify that kind of condition.  I don't know if that would fix the
>>>> reported testcase as I haven't looked at it directly under the debugger
>>>> yet.
>>>
>>> In fact, match.pd already has a pattern to handle a relational like
>>> (X + -1) > X
>>>
>>> The pattern doesn't fire because of a single_use guard.   This was
>>> discussed back in 2015 with Jakub advocating for a single_use guard,
>>> but Marc leaning the other direction, but not enough to argue too much
>>> about it.
>>>
>>> In early 2016 Marc actually submitted the match.pd patch with the
>>> single use guard per Jakub's preference.
>>>
>>> The single use guard checks the SSA_NAME holding the (X + CST)
>>> expression for single use.   It's not entirely clear why we check for
>>> single use.  If it's because of object lifetimes, then that's a total
>>> red herring here.
>>>
>>> Given something like:
>>> x = a + c
>>> if (a > x)
>>>
>>> The transformation into
>>>
>>> x = a + c;
>>> if (a < c')
>>>
>>> Where x has multiple uses, the transformation will not inherently
>>> change the lifetime of "x" worse (if the conditional was the last use,
>>> then the lifetime of x may shorten somewhat).  If there were uses
>>> after the condition, the lifetime of "x" remains unchanged.
>>
>>
>> I don't remember the conversation and I don't have experience with
>> register pressure. My assumption would be that the issue is with the new
>> constant c'. In the first case, during the comparison, only a and x are
>> alive, while in the second case, c' is as well, which may use an extra
>> register. Does that make sense?
>>
>> (I don't know if that's bad enough to inhibit the simplification)
>>
>> Or are we talking about the transformation that has this comment in
>> match.pd:
>>
>>    Currently restricted to single use so as not to interfere too much with
>>    ADD_OVERFLOW detection in tree-ssa-math-opts.c.
>>
>> That detection logic needs to be improved anyway (see also a discussion
>> with Vincent Lefevre yesterday on gcc-help, users do write the
>> simplified form directly), the single_use is supposedly only there until
>> that is done (really depends on how well the improved logic works...).
>>
>>> "a" was already live from the assignment to at least the conditional.
>>> But again, we haven't inherently changed its lifetime.
>>>
>>> However, what can happen is after transformation, the assignment might
>>> sink to a later use point in the CFG, which might lengthen the
>>> lifetime of "a", but it's also shortening the lifetime of "x" by the
>>> same amount.
>>>
>>> So, ultimately I don't buy the argument that we should inhibit this
>>> based on register lifetime issues.
>>>
>>> Jakub, perhaps you remember why you wanted this restricted to cases
>>> where "x" has a single use?
>
> So I went and looked at the match_uaddsub_overflow bits for a while. AFAICT
> it appears to be trying to allow us to do the arithmetic and overflow status
> computation once and CSE the result.  At the single instance we would
> generate a uaddv4 insn which does the arithmetic and jump.
>
> ISTM the real value here is in the initial RTL generation.  DOM ought to be
> simplifying the arithmetic CSEs and dominated tests.  Threading should pick
> up the most common non-dominated tests.
>
> Converting into an EQ/NE test against zero for the special case should still
> preserve those optimizations (and in fact makes the threading easier to
> discover).
>
> So it's really a matter of do we do it directly in VRP, or in match.pd. The
> former is more effective, the latter is cleaner.
>>
>>
>> I believe we are supposed to call match.pd from VRP eventually (though
>> that may not have happened yet), so the test could probably also be done
>> there, if that looks more convenient and we decide not to remove
>> single_use completely for this transformation.
>
> That was my thought as well, but AFAICT we only call into match.pd from VRP
> if we changed the insn.

Yes - there was thoughts to change that (but it comes at an expense).  Basically
we'd like to re-fold stmts that indirectly use stmts we changed.  We certainly
don't want to re-fold everything all the time.

>   There's no systematic calling into match.pd from
> most passes.  Thus, we don't do the transformation until fairly late in the
> pipeline when it's in match.pd.

Passes should fold stmts they change but with match.pd we now have that
"indirect" issue indeed.  Currently only forwprop unconditionally folds all
stmts.

I'm not sure if there's any heuristic we can apply for folding stmts that
is cheaper than simply folding everything...  The simplest thing would
be to fold stmts in the same BB after we've changed a stmt in it.  A
"full" solution would be to populate a bitmap with all defs of stmts
using the changed stmts defs, folding their definitions (and clearing the
bits).  Exclude virtual operands and PHIs.

Something to experiment with I guess.  The above could apply to
the SSA propagator substitute-and-fold phase as well as FRE
elimination and DOM stmt optimization.

Richard.

>
> Jeff
>
Marc Glisse Jan. 24, 2017, 2:29 p.m. UTC | #20
On Tue, 24 Jan 2017, Richard Biener wrote:

>> That was my thought as well, but AFAICT we only call into match.pd
>> from VRP if we changed the insn.
>
> Yes - there was thoughts to change that (but it comes at an expense).
> Basically we'd like to re-fold stmts that indirectly use stmts we
> changed.  We certainly don't want to re-fold everything all the time.

VRP is kind of a special case, every variable for which it finds a
new/improved range could be considered changed, since it may trigger
some extra transformation in match.pd (same for CCP and the nonzero
mask).
Jeff Law Jan. 24, 2017, 3:05 p.m. UTC | #21
On 01/24/2017 07:29 AM, Marc Glisse wrote:
> On Tue, 24 Jan 2017, Richard Biener wrote:
>
>>> That was my thought as well, but AFAICT we only call into match.pd
>>> from VRP if we changed the insn.
>>
>> Yes - there was thoughts to change that (but it comes at an expense).
>> Basically we'd like to re-fold stmts that indirectly use stmts we
>> changed.  We certainly don't want to re-fold everything all the time.
>
> VRP is kind of a special case, every variable for which it finds a
> new/improved range could be considered changed, since it may trigger
> some extra transformation in match.pd (same for CCP and the nonzero
> mask).
But that would assume that match.pd is relying on range information and 
could thus use the improved range information.  *If* match.pd is using 
the range information generated by VRP, it's not terribly pervasive.

But waiting until forwprop3 means we're leaving a ton of useless blocks 
and statements in the IL for this testcase, and likely other code using 
std::vec.

Perhaps rather than open-coding a fix in VRP I could have VRP call into 
match.pd slightly more aggressively (say just for gimple_cond).  That 
may be enough to capture the effects much earlier in the pipeline 
without trying to fold *everything*.

Jeff
Marc Glisse Jan. 24, 2017, 4:02 p.m. UTC | #22
On Tue, 24 Jan 2017, Jeff Law wrote:

> But that would assume that match.pd is relying on range information and could 
> thus use the improved range information.  *If* match.pd is using the range 
> information generated by VRP, it's not terribly pervasive.

Oh, I thought we already had some explicit calls to get_range_info in 
there, but apparently not. We can use VRP info through get_nonzero_bits, 
expr_not_equal_to and maybe one or two more like tree_expr_nonzero_p or 
tree_expr_nonnegative_p. If we called into match.pd from VRP, I assume 
several of the simplify_*_using_ranges could be moved to match.pd. But I 
agree that most transformations do not look at ranges at all.
Richard Biener Jan. 24, 2017, 4:27 p.m. UTC | #23
On January 24, 2017 5:02:39 PM GMT+01:00, Marc Glisse <marc.glisse@inria.fr> wrote:
>On Tue, 24 Jan 2017, Jeff Law wrote:
>
>> But that would assume that match.pd is relying on range information
>and could 
>> thus use the improved range information.  *If* match.pd is using the
>range 
>> information generated by VRP, it's not terribly pervasive.
>
>Oh, I thought we already had some explicit calls to get_range_info in 
>there, but apparently not. We can use VRP info through
>get_nonzero_bits, 
>expr_not_equal_to and maybe one or two more like tree_expr_nonzero_p or

We have the latter already.

>tree_expr_nonnegative_p. If we called into match.pd from VRP, I assume 
>several of the simplify_*_using_ranges could be moved to match.pd. 

Yes, that will hopefully happen at some point but it also requires folding more stmts to not regress (I'd like to remove the stmt folding callback from the SSA propagator)

Richard.

But
>I 
>agree that most transformations do not look at ranges at all.
Richard Biener Jan. 25, 2017, 10:34 a.m. UTC | #24
On Tue, Jan 24, 2017 at 4:05 PM, Jeff Law <law@redhat.com> wrote:
> On 01/24/2017 07:29 AM, Marc Glisse wrote:
>>
>> On Tue, 24 Jan 2017, Richard Biener wrote:
>>
>>>> That was my thought as well, but AFAICT we only call into match.pd
>>>> from VRP if we changed the insn.
>>>
>>>
>>> Yes - there was thoughts to change that (but it comes at an expense).
>>> Basically we'd like to re-fold stmts that indirectly use stmts we
>>> changed.  We certainly don't want to re-fold everything all the time.
>>
>>
>> VRP is kind of a special case, every variable for which it finds a
>> new/improved range could be considered changed, since it may trigger
>> some extra transformation in match.pd (same for CCP and the nonzero
>> mask).
>
> But that would assume that match.pd is relying on range information and
> could thus use the improved range information.  *If* match.pd is using the
> range information generated by VRP, it's not terribly pervasive.
>
> But waiting until forwprop3 means we're leaving a ton of useless blocks and
> statements in the IL for this testcase, and likely other code using
> std::vec.
>
> Perhaps rather than open-coding a fix in VRP I could have VRP call into
> match.pd slightly more aggressively (say just for gimple_cond).  That may be
> enough to capture the effects much earlier in the pipeline without trying to
> fold *everything*.

Sure, the only disadvantage of doing it in VRP (in vrp_fold_stmt) is that you
may end up doing it twice.

Richard.

> Jeff
>
>
Jeff Law Jan. 25, 2017, 5:32 p.m. UTC | #25
On 01/25/2017 03:34 AM, Richard Biener wrote:
> On Tue, Jan 24, 2017 at 4:05 PM, Jeff Law <law@redhat.com> wrote:
>> On 01/24/2017 07:29 AM, Marc Glisse wrote:
>>>
>>> On Tue, 24 Jan 2017, Richard Biener wrote:
>>>
>>>>> That was my thought as well, but AFAICT we only call into match.pd
>>>>> from VRP if we changed the insn.
>>>>
>>>>
>>>> Yes - there was thoughts to change that (but it comes at an expense).
>>>> Basically we'd like to re-fold stmts that indirectly use stmts we
>>>> changed.  We certainly don't want to re-fold everything all the time.
>>>
>>>
>>> VRP is kind of a special case, every variable for which it finds a
>>> new/improved range could be considered changed, since it may trigger
>>> some extra transformation in match.pd (same for CCP and the nonzero
>>> mask).
>>
>> But that would assume that match.pd is relying on range information and
>> could thus use the improved range information.  *If* match.pd is using the
>> range information generated by VRP, it's not terribly pervasive.
>>
>> But waiting until forwprop3 means we're leaving a ton of useless blocks and
>> statements in the IL for this testcase, and likely other code using
>> std::vec.
>>
>> Perhaps rather than open-coding a fix in VRP I could have VRP call into
>> match.pd slightly more aggressively (say just for gimple_cond).  That may be
>> enough to capture the effects much earlier in the pipeline without trying to
>> fold *everything*.
>
> Sure, the only disadvantage of doing it in VRP (in vrp_fold_stmt) is that you
> may end up doing it twice.
Once per VRP pass doesn't seem excessive.

If we simplify in VRP with a valueizer that walks up the ASSERT_EXPRs, 
then VRP1 will simplify the two key conditionals.  The first DOM pass is 
then able to clean up the whole mess.  But that valueizer runs afoul of 
maybe_set_nonzero_bits's assumptions for an unrelated testcase (pr60482).

maybe_set_nonzero_bits has restrictions on the number of uses of an 
SSA_NAME.  folding with a valueizer that walks the ASSERT_EXPR chain has 
a side effect of copy propagating through ASSERT_EXPRs.  So for the 
pr60482 testcase we end up with 3 uses of "n_12" rather than the 
expected 2.  That in turn causes us to avoid aggressively clearing bits 
in the non-zero bitmask of n_12.  That in turn causes us to fail to 
eliminate a conditional, which in turn causes us to need a loop epilogue 
for vectorization.  Ugh.

If we fold in VRP1 without walking up the ASSERT_EXPRs, we transform 
just the first conditional in VRP1.  A goodly amount of simplification 
is still done in the first DOM pass, but not all of it.

forwprop3 then transforms the second conditional which PRE is then able 
to optimize away.  That's early enough to allow sinking of the arithmetic.

The first DOM pass still cleaned up most of the crud early so we're 
avoiding useless work.  The final result is the same as with the 
ASSERT_EXPR walking valueizer.  That seems like a reasonable compromise.

Spinning that version...

jeff
diff mbox

Patch

PR c++/79095 - [7 regression] spurious stringop-overflow warning

gcc/ChangeLog:

	PR c++/79095
	* tree-loop-distribution.c (maybe_emit_trap): New function.
	(generate_memset_builtin): Call it.
	(generate_memcpy_builtin): Same.

gcc/testsuite/ChangeLog:

	PR c++/79095
	* g++.dg/pr79095.C: New test.

Index: gcc/testsuite/g++.dg/pr79095.C
===================================================================
--- gcc/testsuite/g++.dg/pr79095.C	(revision 0)
+++ gcc/testsuite/g++.dg/pr79095.C	(working copy)
@@ -0,0 +1,41 @@ 
+/* PR c++/79095 - spurious stringop-overflow warning
+   { dg-do compile }
+   { dg-options "-O3 -Wall -fdump-tree-optimized" } */
+
+typedef long unsigned int size_t;
+
+inline void
+fill (int *p, size_t n, int)
+{
+  while (n--)
+    *p++ = 0;
+}
+
+struct B
+{
+  int* p0, *p1, *p2;
+
+  size_t size () const {
+    return size_t (p1 - p0);
+  }
+
+  void resize (size_t n) {
+    if (n > size())
+      append (n - size());
+  }
+
+  void append (size_t n)
+  {
+    if (size_t (p2 - p1) >= n) {
+      fill (p1, n, 0);
+    }
+  }
+};
+
+void foo (B &b)
+{
+  b.resize (b.size () - 1);
+}
+
+
+/* { dg-final { scan-tree-dump-not "memset" "optimized" } } */
Index: gcc/tree-loop-distribution.c
===================================================================
--- gcc/tree-loop-distribution.c	(revision 244508)
+++ gcc/tree-loop-distribution.c	(working copy)
@@ -796,6 +796,49 @@  const_with_all_bytes_same (tree val)
   return buf[0];
 }
 
+/* If NBYTES is greater than SSIZE_MAX emit a trap and return true.
+   Otherwise return false.  FNCODE identifies the built-in function
+   being generated.  */
+
+static bool
+maybe_emit_trap (gimple_stmt_iterator gsi, tree nbytes,
+		 built_in_function fncode)
+{
+  if (TREE_CODE (nbytes) != INTEGER_CST
+      || tree_int_cst_le (nbytes, TYPE_MAX_VALUE (ssizetype)))
+    return false;
+
+  tree fn = builtin_decl_implicit (BUILT_IN_TRAP);
+  gimple *fn_call = gimple_build_call (fn, 0);
+  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
+  split_block (gimple_bb (fn_call), fn_call);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      const char *fname = 0;
+
+      switch (fncode)
+	{
+	case BUILT_IN_MEMCPY:
+	  fname = "memcpy";
+	  break;
+	case BUILT_IN_MEMMOVE:
+	  fname = "memove";
+	  break;
+	case BUILT_IN_MEMSET:
+	  fname = "memset";
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+
+      fprintf (dump_file, "generated trap for an out-of-bounds %s "
+	       "with %wu size", fname, tree_to_uhwi (nbytes));
+    }
+
+  return true;
+}
+
 /* Generate a call to memset for PARTITION in LOOP.  */
 
 static void
@@ -817,6 +860,13 @@  generate_memset_builtin (struct loop *loop, partit
 				 partition->plus_one);
   nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
 				       false, GSI_CONTINUE_LINKING);
+
+  /* If the number of bytes is in excess of SSIZE_MAX avoid generating
+     the memset call that would certainly overflow and emit a trap
+     instead.  */
+  if (maybe_emit_trap (gsi, nb_bytes, BUILT_IN_MEMSET))
+    return;
+
   mem = build_addr_arg_loc (loc, partition->main_dr, nb_bytes);
   mem = force_gimple_operand_gsi (&gsi, mem, true, NULL_TREE,
 				  false, GSI_CONTINUE_LINKING);
@@ -873,6 +923,7 @@  generate_memcpy_builtin (struct loop *loop, partit
 				 partition->plus_one);
   nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
 				       false, GSI_CONTINUE_LINKING);
+
   dest = build_addr_arg_loc (loc, partition->main_dr, nb_bytes);
   src = build_addr_arg_loc (loc, partition->secondary_dr, nb_bytes);
   if (partition->kind == PKIND_MEMCPY
@@ -881,6 +932,12 @@  generate_memcpy_builtin (struct loop *loop, partit
   else
     kind = BUILT_IN_MEMMOVE;
 
+  /* If the number of bytes is in excess of SSIZE_MAX avoid generating
+     the mem(cpy|move) call that would certainly overflow and instead
+     emit a trap.  */
+  if (maybe_emit_trap (gsi, nb_bytes, kind))
+    return;
+
   dest = force_gimple_operand_gsi (&gsi, dest, true, NULL_TREE,
 				   false, GSI_CONTINUE_LINKING);
   src = force_gimple_operand_gsi (&gsi, src, true, NULL_TREE,