diff mbox series

[v4] Implement new RTL optimizations pass: fold-mem-offsets.

Message ID 20230807143324.656791-1-manolis.tsamis@vrull.eu
State New
Headers show
Series [v4] Implement new RTL optimizations pass: fold-mem-offsets. | expand

Commit Message

Manolis Tsamis Aug. 7, 2023, 2:33 p.m. UTC
This is a new RTL pass that tries to optimize memory offset calculations
by moving them from add immediate instructions to the memory loads/stores.
For example it can transform this:

  addi t4,sp,16
  add  t2,a6,t4
  shl  t3,t2,1
  ld   a2,0(t3)
  addi a2,1
  sd   a2,8(t2)

into the following (one instruction less):

  add  t2,a6,sp
  shl  t3,t2,1
  ld   a2,32(t3)
  addi a2,1
  sd   a2,24(t2)

Although there are places where this is done already, this pass is more
powerful and can handle the more difficult cases that are currently not
optimized. Also, it runs late enough and can optimize away unnecessary
stack pointer calculations.

gcc/ChangeLog:

	* Makefile.in: Add fold-mem-offsets.o.
	* passes.def: Schedule a new pass.
	* tree-pass.h (make_pass_fold_mem_offsets): Declare.
	* common.opt: New options.
	* doc/invoke.texi: Document new option.
	* fold-mem-offsets.cc: New file.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/fold-mem-offsets-1.c: New test.
	* gcc.target/riscv/fold-mem-offsets-2.c: New test.
	* gcc.target/riscv/fold-mem-offsets-3.c: New test.

Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
---

Changes in v4:
        - Add DF_EQ_NOTES flag to avoid incorrect state in notes.
        - Remove fold_mem_offsets_driver and enum fold_mem_phase.
        - Call recog when patching offsets in do_commit_offset.
        - Restore INSN_CODE after modifying insn in do_check_validity.

Changes in v3:
        - Added propagation for more codes:
          sub, neg, mul.
        - Added folding / elimination for sub and
          const int moves.
        - For the validity check of the generated addresses
          also test memory_address_addr_space_p.
        - Replaced GEN_INT with gen_int_mode.
        - Replaced some bitmap_head with auto_bitmap.
        - Refactor each phase into own function for readability.
        - Add dump details.
        - Replace rtx iteration with reg_mentioned_p.
        - Return early for codes that we can't propagate through.

Changes in v2:
        - Made the pass target-independant instead of RISCV specific.
        - Fixed a number of bugs.
        - Add code to handle more ADD patterns as found
          in other targets (x86, aarch64).
        - Improved naming and comments.
        - Fixed bitmap memory leak.

 gcc/Makefile.in                               |   1 +
 gcc/common.opt                                |   4 +
 gcc/doc/invoke.texi                           |   8 +
 gcc/fold-mem-offsets.cc                       | 725 ++++++++++++++++++
 gcc/passes.def                                |   1 +
 .../gcc.target/riscv/fold-mem-offsets-1.c     |  16 +
 .../gcc.target/riscv/fold-mem-offsets-2.c     |  24 +
 .../gcc.target/riscv/fold-mem-offsets-3.c     |  17 +
 gcc/tree-pass.h                               |   1 +
 9 files changed, 797 insertions(+)
 create mode 100644 gcc/fold-mem-offsets.cc
 create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c

Comments

Jeff Law Aug. 9, 2023, 4:22 p.m. UTC | #1
On 8/7/23 08:33, Manolis Tsamis wrote:
> This is a new RTL pass that tries to optimize memory offset calculations
> by moving them from add immediate instructions to the memory loads/stores.
> For example it can transform this:
> 
>    addi t4,sp,16
>    add  t2,a6,t4
>    shl  t3,t2,1
>    ld   a2,0(t3)
>    addi a2,1
>    sd   a2,8(t2)
> 
> into the following (one instruction less):
> 
>    add  t2,a6,sp
>    shl  t3,t2,1
>    ld   a2,32(t3)
>    addi a2,1
>    sd   a2,24(t2)
> 
> Although there are places where this is done already, this pass is more
> powerful and can handle the more difficult cases that are currently not
> optimized. Also, it runs late enough and can optimize away unnecessary
> stack pointer calculations.
> 
> gcc/ChangeLog:
> 
> 	* Makefile.in: Add fold-mem-offsets.o.
> 	* passes.def: Schedule a new pass.
> 	* tree-pass.h (make_pass_fold_mem_offsets): Declare.
> 	* common.opt: New options.
> 	* doc/invoke.texi: Document new option.
> 	* fold-mem-offsets.cc: New file.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/riscv/fold-mem-offsets-1.c: New test.
> 	* gcc.target/riscv/fold-mem-offsets-2.c: New test.
> 	* gcc.target/riscv/fold-mem-offsets-3.c: New test.
We still have the m68k issue to deal with.

We've got these key insns going into fold-mem-offsets:

> (insn 39 38 41 3 (set (reg:SI 13 %a5 [orig:36 _14 ] [36])
>         (plus:SI (reg:SI 10 %a2 [47])
>             (const_int 1 [0x1]))) "j.c":19:3 157 {*addsi3_internal}
>      (nil))
> (insn 41 39 42 3 (set (reg:SI 8 %a0 [61])
>         (plus:SI (reg/f:SI 12 %a4 [52])
>             (reg:SI 13 %a5 [orig:36 _14 ] [36]))) "j.c":19:3 discrim 1 157 {*addsi3_internal}
>      (nil))
> (insn 42 41 43 3 (set (mem:SI (reg:SI 8 %a0 [61]) [0 MEM <char[1:5]> [(void *)_15]+0 S4 A8])
>         (const_int 1633837924 [0x61626364])) "j.c":19:3 discrim 1 55 {*movsi_m68k2}
>      (nil))
> (insn 43 42 45 3 (set (mem:QI (plus:SI (reg:SI 8 %a0 [61])
>                 (const_int 4 [0x4])) [0 MEM <char[1:5]> [(void *)_15]+4 S1 A8])
>         (const_int 101 [0x65])) "j.c":19:3 discrim 1 62 {*m68k.md:1130}
>      (expr_list:REG_DEAD (reg:SI 8 %a0 [61])
>         (nil)))

[ ... ]
> (insn 58 57 59 3 (set (reg:SI 8 %a0 [72])
>         (plus:SI (plus:SI (reg:SI 13 %a5 [orig:36 _14 ] [36])
>                 (reg:SI 11 %a3 [49]))
>             (const_int 5 [0x5]))) "j.c":24:3 421 {*lea}
>      (expr_list:REG_DEAD (reg:SI 13 %a5 [orig:36 _14 ] [36])
>         (expr_list:REG_DEAD (reg:SI 11 %a3 [49])
>             (nil))))


f-m-o will propagate the (const_int 1) from insn 39 into insns 42 & 43 
and modifies insn 39 into a simple copy.  Note carefully that turning 
insn 39 into a simple copy changes the value in a5.

As a result insn 58 computes the wrong value and the test (strlenopt-2) 
fails on the m68k.

In fold_offsets we have this code:

>      /* We only fold through instructions that are transitively used as
>          memory addresses and do not have other uses.  Use the same logic
>          from offset calculation to visit instructions that can propagate
>          offsets and keep track of them in CAN_FOLD_INSNS.  */
>       bool success;
>       struct df_link *uses = get_uses (def, dest, &success), *ref_link;
> 
>       if (!success)
>         return 0;
> 
>       for (ref_link = uses; ref_link; ref_link = ref_link->next)
>         {
>           rtx_insn* use = DF_REF_INSN (ref_link->ref);
> 
>           if (DEBUG_INSN_P (use))
>             continue;
> 
>           /* Punt if the use is anything more complicated than a set
>              (clobber, use, etc).  */
>           if (!NONJUMP_INSN_P (use) || GET_CODE (PATTERN (use)) != SET)
>             return 0;
> 
>           /* This use affects instructions outside of CAN_FOLD_INSNS.  */
>           if (!bitmap_bit_p (&can_fold_insns, INSN_UID (use)))
>             return 0;
> 
>           rtx use_set = PATTERN (use);
> 
>           /* Special case: A foldable memory store is not foldable if it
>              mentions DEST outside of the address calculation.  */
>           if (use_set && MEM_P (SET_DEST (use_set))
>               && reg_mentioned_p (dest, SET_SRC (use_set)))
>             return 0;
>         }

AFAICT that code is supposed to detect if there are uses outside of the 
set of insns that can be optimized.  In our case DEF is a0 (the base of 
the memory references) and its def is insn 41.  The assignment of a0 in 
insn 41 only reaches insns 42 and 43 which are marked in can_fold_insns.

That's find and good, but insufficient for correctness.  ISTM we must 
look at the uses of any insn where we change the result of the 
computation, not just the def of the base register in the memory reference.

In this particular case we're going to modify insn 39, so we need to 
look at the uses of a5 (the value defined by insn 39).  If we did that 
we'd see the use of a5 at insn 38 and insn 38 and reject the optimization.



Testcase below.  You should be able to see the incorrect transformation 
with a m68k-linux-gnu cross compiler with -O2.



typedef unsigned int size_t;
extern void abort (void);
void *calloc (size_t, size_t);
void *malloc (size_t);
void free (void *);
char *strdup (const char *);
size_t strlen (const char *);
size_t strnlen (const char *, size_t);
void *memcpy (void *__restrict, const void *__restrict, size_t);
void *memmove (void *, const void *, size_t);
char *strcpy (char *__restrict, const char *__restrict);
char *strcat (char *__restrict, const char *__restrict);
char *strchr (const char *, int);
int strcmp (const char *, const char *);
int strncmp (const char *, const char *, size_t);
void *memset (void *, int, size_t);
int memcmp (const void *, const void *, size_t);
int strcmp (const char *, const char *);
int sprintf (char * __restrict, const char *__restrict, ...);
int snprintf (char * __restrict, size_t, const char *__restrict, ...);
__attribute__((noinline, noclone)) char *
foo (char *p, char *r)
{
   char buf[26];
   if (strlen (p) + strlen (r) + 9 > 26)
     return ((void *) 0);
   strcpy (buf, p);
   strcat (buf, "/");
   strcat (buf, "abcde");
   strcat (buf, r);
   strcat (buf, "fg");
   return strdup (buf);
}
int
main ()
{
   char *volatile p = "string1";
   char *volatile r = "string2";
   char *q = foo (p, r);
   if (q != ((void *) 0))
     {
       if (strcmp (q, "string1/abcdestring2fg"))
  abort ();
       free (q);
     }
   return 0;
}
~
Manolis Tsamis Aug. 10, 2023, 9:28 a.m. UTC | #2
Hi Jeff,

Thanks a lot for providing all this information and testcase! I have
been able to reproduce the issue with it.

I have investigated the cause of the issue and it's not what you
mention, the uses of all intermediate calculations are properly taken
into account.
In this case it would be fine to alter the runtime value of insn 58
because its uses are also memory accesses that can have a folded
offset. The real issue is that the offset for these two (insn 60/61)
is not updated.

I think this can be seen by using -fdump-rtl-fold_mem_offsets-all
which also shows root memory instructions and when instructions are
marked for propagation. Here's the relevant dump (a bit long but
should help with this):

Starting analysis from root: (insn 2 95 3 2 (set (reg/v/f:SI 3 %d3
[orig:44 pD.1867 ] [44])
        (mem/f/c:SI (plus:SI (reg/f:SI 15 %sp)
                (const_int 60 [0x3c])) [1 pD.1867+0 S4 A32]))
"/home/mtsamis/temp/fmo-bug-2/a.c":23:1 55 {*movsi_m68k2}
     (expr_list:REG_EQUIV (mem/f/c:SI (plus:SI (reg/f:SI 15 %sp)
                (const_int 60 [0x3c])) [1 pD.1867+0 S4 A32])
        (nil)))
Starting analysis from root: (insn 3 2 4 2 (set (reg/v/f:SI 2 %d2
[orig:45 rD.1868 ] [45])
        (mem/f/c:SI (plus:SI (reg/f:SI 15 %sp)
                (const_int 64 [0x40])) [1 rD.1868+0 S4 A32]))
"/home/mtsamis/temp/fmo-bug-2/a.c":23:1 55 {*movsi_m68k2}
     (expr_list:REG_EQUIV (mem/f/c:SI (plus:SI (reg/f:SI 15 %sp)
                (const_int 64 [0x40])) [1 rD.1868+0 S4 A32])
        (nil)))
Starting analysis from root: (insn 14 12 16 2 (set (mem/f:SI (reg/f:SI
15 %sp) [1  S4 A16])
        (reg/v/f:SI 2 %d2 [orig:45 rD.1868 ] [45]))
"/home/mtsamis/temp/fmo-bug-2/a.c":25:21 discrim 1 54 {*movsi_m68k}
     (expr_list:REG_ARGS_SIZE (const_int 4 [0x4])
        (nil)))
Starting analysis from root: (insn 42 41 43 3 (set (mem:SI (reg:SI 8
%a0 [61]) [0 MEM <charD.2[1:5]> [(voidD.37 *)_15]+0 S4 A8])
        (const_int 1633837924 [0x61626364]))
"/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 55 {*movsi_m68k2}
     (nil))
Starting analysis from root: (insn 43 42 45 3 (set (mem:QI (plus:SI
(reg:SI 8 %a0 [61])
                (const_int 4 [0x4])) [0 MEM <charD.2[1:5]> [(voidD.37
*)_15]+4 S1 A8])
        (const_int 101 [0x65]))
"/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 62 {*m68k.md:1130}
     (expr_list:REG_DEAD (reg:SI 8 %a0 [61])
        (nil)))
Instruction marked for propagation: (insn 41 39 42 3 (set (reg:SI 8 %a0 [61])
        (plus:SI (reg/f:SI 12 %a4 [52])
            (reg:SI 13 %a5 [orig:36 _14 ] [36])))
"/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 150
{*addsi3_internal}
     (nil))
Starting analysis from root: (insn 60 59 61 3 (set (mem:HI (reg:SI 8
%a0 [73]) [0 MEM <charD.2[1:3]> [(voidD.37 *)_19]+0 S2 A8])
        (const_int 26215 [0x6667]))
"/home/mtsamis/temp/fmo-bug-2/a.c":31:4 discrim 1 58 {*m68k.md:1084}
     (nil))
Starting analysis from root: (insn 61 60 63 3 (set (mem:QI (plus:SI
(reg:SI 8 %a0 [73])
                (const_int 2 [0x2])) [0 MEM <charD.2[1:3]> [(voidD.37
*)_19]+2 S1 A8])
        (const_int 0 [0])) "/home/mtsamis/temp/fmo-bug-2/a.c":31:4
discrim 1 62 {*m68k.md:1130}
     (expr_list:REG_DEAD (reg:SI 8 %a0 [73])
        (nil)))
Instruction marked for propagation: (insn 59 58 60 3 (set (reg:SI 8 %a0 [73])
        (plus:SI (reg/f:SI 12 %a4 [52])
            (reg:SI 8 %a0 [72])))
"/home/mtsamis/temp/fmo-bug-2/a.c":31:4 discrim 1 150
{*addsi3_internal}
     (nil))
Instruction marked for propagation: (insn 58 57 59 3 (set (reg:SI 8 %a0 [72])
        (plus:SI (plus:SI (reg:SI 13 %a5 [orig:36 _14 ] [36])
                (reg:SI 11 %a3 [49]))
            (const_int 5 [0x5])))
"/home/mtsamis/temp/fmo-bug-2/a.c":31:4 407 {*lea}
     (expr_list:REG_DEAD (reg:SI 13 %a5 [orig:36 _14 ] [36])
        (expr_list:REG_DEAD (reg:SI 11 %a3 [49])
            (nil))))
Instruction marked for propagation: (insn 39 38 41 3 (set (reg:SI 13
%a5 [orig:36 _14 ] [36])
        (plus:SI (reg:SI 10 %a2 [47])
            (const_int 1 [0x1])))
"/home/mtsamis/temp/fmo-bug-2/a.c":29:4 150 {*addsi3_internal}
     (nil))
Memory offset changed from 0 to 1 for instruction:
(insn 42 41 43 3 (set (mem:SI (reg:SI 8 %a0 [61]) [0 MEM
<charD.2[1:5]> [(voidD.37 *)_15]+0 S4 A8])
        (const_int 1633837924 [0x61626364]))
"/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 55 {*movsi_m68k2}
     (nil))
deferring rescan insn with uid = 42.
Memory offset changed from 4 to 5 for instruction:
(insn 43 42 45 3 (set (mem:QI (plus:SI (reg:SI 8 %a0 [61])
                (const_int 4 [0x4])) [0 MEM <charD.2[1:5]> [(voidD.37
*)_15]+4 S1 A8])
        (const_int 101 [0x65]))
"/home/mtsamis/temp/fmo-bug-2/a.c":29:4 discrim 1 62 {*m68k.md:1130}
     (expr_list:REG_DEAD (reg:SI 8 %a0 [61])
        (nil)))
deferring rescan insn with uid = 43.
Instruction folded:(insn 39 38 41 3 (set (reg:SI 13 %a5 [orig:36 _14 ] [36])
        (plus:SI (reg:SI 10 %a2 [47])
            (const_int 1 [0x1])))
"/home/mtsamis/temp/fmo-bug-2/a.c":29:4 150 {*addsi3_internal}
     (nil))

Here you can see that insn 39 is the last to be marked for propagation
and only after insn 58 is also marked i.e. it can propagate a constant
offset to its uses.

After that the question was why the constants were not actually
propagated in insn 60/61 and I added additional debug code to check
why. The culprit was this instruction:

(insn 58 57 59 3 (set (reg:SI 8 %a0 [72])
        (plus:SI (plus:SI (reg:SI 13 %a5 [orig:36 _14 ] [36])
                (reg:SI 11 %a3 [49]))
            (const_int 5 [0x5])))

Here the recursive call to fold_offsets for the first argument will
propagate the folded offset 1 from the change to register %a5.
But when it's time to handle the second argument (const_int 5 [0x5])
this code is used:

if (REG_P (arg2))
  offset += fold_offsets (def, arg2, analyze, foldable_insns);
else if (REG_P (arg1) && CONST_INT_P (arg2) && !analyze)
  {
    offset += INTVAL (arg2);
    /* This is a R1 = R2 + C instruction, candidate for folding. */
    bitmap_set_bit (foldable_insns, INSN_UID (def));
  }
else
  return 0;

return offset;

This instruction doesn't match any of these since the if with the
CONST_INT only accepts arg1 being a single REG (that could be
extended, but that's not the point now) and as a result we do `return
0;`
But return 0 at this point loses the offset 1 calculated from arg1
previously and which is stored in `offset`. And that's our bug :)

Changing that return 0 to return offset (i.e. return what we have up
to now) fixes this testcase with the insn being folded and all offsets
updating properly. But it got me thinking that this is a more general
issue that I need to address differently.
There are more `return 0;` cases in the code which say "We know how to
handle this rtx code, but don't know how to propagate", but returning
0 is not enough.
It's also not correct to propagate an offset from just one argument
and punt on the other because the argument we punt might contain
references to the other argument.

So the general solution that solves all the issues is: If we don't
fully understand how to handle an instruction and its arguments in
fold_offsets then we need to mark it in one of the bitsets (either set
in cannot_fold or don't set in can_fold), whereas currently a insn
that has transitively all uses as foldable is foldable.

I will prepare a newer version that should handle all these cases (and
with the proper changes the testcase will still fold insn 39).
Please let me know If you have any other feedback on this.

Thanks!

Manolis

On Wed, Aug 9, 2023 at 7:22 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 8/7/23 08:33, Manolis Tsamis wrote:
> > This is a new RTL pass that tries to optimize memory offset calculations
> > by moving them from add immediate instructions to the memory loads/stores.
> > For example it can transform this:
> >
> >    addi t4,sp,16
> >    add  t2,a6,t4
> >    shl  t3,t2,1
> >    ld   a2,0(t3)
> >    addi a2,1
> >    sd   a2,8(t2)
> >
> > into the following (one instruction less):
> >
> >    add  t2,a6,sp
> >    shl  t3,t2,1
> >    ld   a2,32(t3)
> >    addi a2,1
> >    sd   a2,24(t2)
> >
> > Although there are places where this is done already, this pass is more
> > powerful and can handle the more difficult cases that are currently not
> > optimized. Also, it runs late enough and can optimize away unnecessary
> > stack pointer calculations.
> >
> > gcc/ChangeLog:
> >
> >       * Makefile.in: Add fold-mem-offsets.o.
> >       * passes.def: Schedule a new pass.
> >       * tree-pass.h (make_pass_fold_mem_offsets): Declare.
> >       * common.opt: New options.
> >       * doc/invoke.texi: Document new option.
> >       * fold-mem-offsets.cc: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/riscv/fold-mem-offsets-1.c: New test.
> >       * gcc.target/riscv/fold-mem-offsets-2.c: New test.
> >       * gcc.target/riscv/fold-mem-offsets-3.c: New test.
> We still have the m68k issue to deal with.
>
> We've got these key insns going into fold-mem-offsets:
>
> > (insn 39 38 41 3 (set (reg:SI 13 %a5 [orig:36 _14 ] [36])
> >         (plus:SI (reg:SI 10 %a2 [47])
> >             (const_int 1 [0x1]))) "j.c":19:3 157 {*addsi3_internal}
> >      (nil))
> > (insn 41 39 42 3 (set (reg:SI 8 %a0 [61])
> >         (plus:SI (reg/f:SI 12 %a4 [52])
> >             (reg:SI 13 %a5 [orig:36 _14 ] [36]))) "j.c":19:3 discrim 1 157 {*addsi3_internal}
> >      (nil))
> > (insn 42 41 43 3 (set (mem:SI (reg:SI 8 %a0 [61]) [0 MEM <char[1:5]> [(void *)_15]+0 S4 A8])
> >         (const_int 1633837924 [0x61626364])) "j.c":19:3 discrim 1 55 {*movsi_m68k2}
> >      (nil))
> > (insn 43 42 45 3 (set (mem:QI (plus:SI (reg:SI 8 %a0 [61])
> >                 (const_int 4 [0x4])) [0 MEM <char[1:5]> [(void *)_15]+4 S1 A8])
> >         (const_int 101 [0x65])) "j.c":19:3 discrim 1 62 {*m68k.md:1130}
> >      (expr_list:REG_DEAD (reg:SI 8 %a0 [61])
> >         (nil)))
>
> [ ... ]
> > (insn 58 57 59 3 (set (reg:SI 8 %a0 [72])
> >         (plus:SI (plus:SI (reg:SI 13 %a5 [orig:36 _14 ] [36])
> >                 (reg:SI 11 %a3 [49]))
> >             (const_int 5 [0x5]))) "j.c":24:3 421 {*lea}
> >      (expr_list:REG_DEAD (reg:SI 13 %a5 [orig:36 _14 ] [36])
> >         (expr_list:REG_DEAD (reg:SI 11 %a3 [49])
> >             (nil))))
>
>
> f-m-o will propagate the (const_int 1) from insn 39 into insns 42 & 43
> and modifies insn 39 into a simple copy.  Note carefully that turning
> insn 39 into a simple copy changes the value in a5.
>
> As a result insn 58 computes the wrong value and the test (strlenopt-2)
> fails on the m68k.
>
> In fold_offsets we have this code:
>
> >      /* We only fold through instructions that are transitively used as
> >          memory addresses and do not have other uses.  Use the same logic
> >          from offset calculation to visit instructions that can propagate
> >          offsets and keep track of them in CAN_FOLD_INSNS.  */
> >       bool success;
> >       struct df_link *uses = get_uses (def, dest, &success), *ref_link;
> >
> >       if (!success)
> >         return 0;
> >
> >       for (ref_link = uses; ref_link; ref_link = ref_link->next)
> >         {
> >           rtx_insn* use = DF_REF_INSN (ref_link->ref);
> >
> >           if (DEBUG_INSN_P (use))
> >             continue;
> >
> >           /* Punt if the use is anything more complicated than a set
> >              (clobber, use, etc).  */
> >           if (!NONJUMP_INSN_P (use) || GET_CODE (PATTERN (use)) != SET)
> >             return 0;
> >
> >           /* This use affects instructions outside of CAN_FOLD_INSNS.  */
> >           if (!bitmap_bit_p (&can_fold_insns, INSN_UID (use)))
> >             return 0;
> >
> >           rtx use_set = PATTERN (use);
> >
> >           /* Special case: A foldable memory store is not foldable if it
> >              mentions DEST outside of the address calculation.  */
> >           if (use_set && MEM_P (SET_DEST (use_set))
> >               && reg_mentioned_p (dest, SET_SRC (use_set)))
> >             return 0;
> >         }
>
> AFAICT that code is supposed to detect if there are uses outside of the
> set of insns that can be optimized.  In our case DEF is a0 (the base of
> the memory references) and its def is insn 41.  The assignment of a0 in
> insn 41 only reaches insns 42 and 43 which are marked in can_fold_insns.
>
> That's find and good, but insufficient for correctness.  ISTM we must
> look at the uses of any insn where we change the result of the
> computation, not just the def of the base register in the memory reference.
>
> In this particular case we're going to modify insn 39, so we need to
> look at the uses of a5 (the value defined by insn 39).  If we did that
> we'd see the use of a5 at insn 38 and insn 38 and reject the optimization.
>
>
>
> Testcase below.  You should be able to see the incorrect transformation
> with a m68k-linux-gnu cross compiler with -O2.
>
>
>
> typedef unsigned int size_t;
> extern void abort (void);
> void *calloc (size_t, size_t);
> void *malloc (size_t);
> void free (void *);
> char *strdup (const char *);
> size_t strlen (const char *);
> size_t strnlen (const char *, size_t);
> void *memcpy (void *__restrict, const void *__restrict, size_t);
> void *memmove (void *, const void *, size_t);
> char *strcpy (char *__restrict, const char *__restrict);
> char *strcat (char *__restrict, const char *__restrict);
> char *strchr (const char *, int);
> int strcmp (const char *, const char *);
> int strncmp (const char *, const char *, size_t);
> void *memset (void *, int, size_t);
> int memcmp (const void *, const void *, size_t);
> int strcmp (const char *, const char *);
> int sprintf (char * __restrict, const char *__restrict, ...);
> int snprintf (char * __restrict, size_t, const char *__restrict, ...);
> __attribute__((noinline, noclone)) char *
> foo (char *p, char *r)
> {
>    char buf[26];
>    if (strlen (p) + strlen (r) + 9 > 26)
>      return ((void *) 0);
>    strcpy (buf, p);
>    strcat (buf, "/");
>    strcat (buf, "abcde");
>    strcat (buf, r);
>    strcat (buf, "fg");
>    return strdup (buf);
> }
> int
> main ()
> {
>    char *volatile p = "string1";
>    char *volatile r = "string2";
>    char *q = foo (p, r);
>    if (q != ((void *) 0))
>      {
>        if (strcmp (q, "string1/abcdestring2fg"))
>   abort ();
>        free (q);
>      }
>    return 0;
> }
> ~
>
>
>
Jeff Law Aug. 10, 2023, 3:23 p.m. UTC | #3
On 8/10/23 03:28, Manolis Tsamis wrote:
> Hi Jeff,
> 
> Thanks a lot for providing all this information and testcase! I have
> been able to reproduce the issue with it.
> 
> I have investigated the cause of the issue and it's not what you
> mention, the uses of all intermediate calculations are properly taken
> into account.
> In this case it would be fine to alter the runtime value of insn 58
> because its uses are also memory accesses that can have a folded
> offset. The real issue is that the offset for these two (insn 60/61)
> is not updated.
[ ... ]


> 
> This instruction doesn't match any of these since the if with the
> CONST_INT only accepts arg1 being a single REG (that could be
> extended, but that's not the point now) and as a result we do `return
> 0;`
> But return 0 at this point loses the offset 1 calculated from arg1
> previously and which is stored in `offset`. And that's our bug :)
> 
> Changing that return 0 to return offset (i.e. return what we have up
> to now) fixes this testcase with the insn being folded and all offsets
> updating properly. But it got me thinking that this is a more general
> issue that I need to address differently.
> There are more `return 0;` cases in the code which say "We know how to
> handle this rtx code, but don't know how to propagate", but returning
> 0 is not enough.
> It's also not correct to propagate an offset from just one argument
> and punt on the other because the argument we punt might contain
> references to the other argument.
Funny, I'd looked at that as well (return 0 signaling two different 
things), but from the standpoint of the analysis phase it didn't matter 
as we don't use the returned value.  So I set it aside.

> 
> So the general solution that solves all the issues is: If we don't
> fully understand how to handle an instruction and its arguments in
> fold_offsets then we need to mark it in one of the bitsets (either set
> in cannot_fold or don't set in can_fold), whereas currently a insn
> that has transitively all uses as foldable is foldable.
I'm still struggling a bit with using the transitive set as a global 
like we do.  I haven't come up with a case where it fails, but every 
time I look at it I wonder if it's going to go awry at some point.

Basically we'd be looking for a case where we have two MEMs which share 
some bit of address calculation, where one of the MEMs is foldable, but 
the other is not for some reason.

If we adjust the address calculations and the foldable MEM, then do we 
run the risk of needing to change the non-foldable MEM?

Jeff
diff mbox series

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index e99628cec07..d717e90bc44 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1431,6 +1431,7 @@  OBJS = \
 	fixed-value.o \
 	fold-const.o \
 	fold-const-call.o \
+	fold-mem-offsets.o \
 	function.o \
 	function-abi.o \
 	function-tests.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 0888c15b88f..5f3d3e9706f 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1248,6 +1248,10 @@  fcprop-registers
 Common Var(flag_cprop_registers) Optimization
 Perform a register copy-propagation optimization pass.
 
+ffold-mem-offsets
+Target Bool Var(flag_fold_mem_offsets) Init(1)
+Fold instructions calculating memory offsets to the memory access instruction if possible.
+
 fcrossjumping
 Common Var(flag_crossjumping) Optimization
 Perform cross-jumping optimization.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 674f956f4b8..30974309e9b 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -540,6 +540,7 @@  Objective-C and Objective-C++ Dialects}.
 -fauto-inc-dec  -fbranch-probabilities
 -fcaller-saves
 -fcombine-stack-adjustments  -fconserve-stack
+-ffold-mem-offsets
 -fcompare-elim  -fcprop-registers  -fcrossjumping
 -fcse-follow-jumps  -fcse-skip-blocks  -fcx-fortran-rules
 -fcx-limited-range
@@ -14314,6 +14315,13 @@  the comparison operation before register allocation is complete.
 
 Enabled at levels @option{-O1}, @option{-O2}, @option{-O3}, @option{-Os}.
 
+@opindex ffold-mem-offsets
+@item -ffold-mem-offsets
+@itemx -fno-fold-mem-offsets
+Try to eliminate add instructions by folding them in memory loads/stores.
+
+Enabled at levels @option{-O2}, @option{-O3}.
+
 @opindex fcprop-registers
 @item -fcprop-registers
 After register allocation and post-register allocation instruction splitting,
diff --git a/gcc/fold-mem-offsets.cc b/gcc/fold-mem-offsets.cc
new file mode 100644
index 00000000000..e500fa30002
--- /dev/null
+++ b/gcc/fold-mem-offsets.cc
@@ -0,0 +1,725 @@ 
+/* Late RTL pass to fold memory offsets.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "rtl.h"
+#include "tree.h"
+#include "expr.h"
+#include "backend.h"
+#include "regs.h"
+#include "target.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "insn-config.h"
+#include "recog.h"
+#include "predict.h"
+#include "df.h"
+#include "tree-pass.h"
+#include "cfgrtl.h"
+
+/* This pass tries to optimize memory offset calculations by moving constants
+   from add instructions to the memory instructions (loads / stores).
+   For example it can transform code like this:
+
+     add  t4, sp, 16
+     add  t2, a6, t4
+     shl  t3, t2, 1
+     ld   a2, 0(t3)
+     add  a2, 1
+     sd   a2, 8(t2)
+
+   into the following (one instruction less):
+
+     add  t2, a6, sp
+     shl  t3, t2, 1
+     ld   a2, 32(t3)
+     add  a2, 1
+     sd   a2, 24(t2)
+
+   Although the previous passes try to emit efficient offset calculations
+   this pass is still beneficial because:
+
+    - The mechanisms that optimize memory offsets usually work with specific
+      patterns or have limitations.  This pass is designed to fold offsets
+      through complex calculations that affect multiple memory operations
+      and have partially overlapping calculations.
+
+    - There are cases where add instructions are introduced in late rtl passes
+      and the rest of the pipeline cannot eliminate them.  Arrays and structs
+      allocated on the stack can result in unwanted add instructions that
+      cannot be eliminated easily.
+
+   This pass works on a basic block level and consists of 4 phases:
+
+    - Phase 1 (Analysis): Find "foldable" instructions.
+      Foldable instructions are those that we know how to propagate
+      a constant addition through (add, shift, move, ...) and only have other
+      foldable instructions for uses.  In that phase a DFS traversal on the
+      definition tree is performed and foldable instructions are marked on
+      a bitmap.  The add immediate instructions that are reachable in this
+      DFS are candidates for folding since all the intermediate calculations
+      affected by them are also foldable.
+
+    - Phase 2 (Validity): Traverse and calculate the offsets that would result
+      from folding the add immediate instructions.  Check whether the
+      calculated offsets result in a valid instruction for the target.
+
+    - Phase 3 (Commit offsets): Traverse again.  It is now known which folds
+      are valid so at this point change the offsets in the memory instructions.
+
+    - Phase 4 (Commit instruction deletions): Scan all instructions and delete
+      or simplify (reduce to move) all add immediate instructions that were
+      folded.
+
+   This pass should run before hard register propagation because it creates
+   register moves that we expect to be eliminated.  */
+
+namespace {
+
+const pass_data pass_data_fold_mem =
+{
+  RTL_PASS, /* type */
+  "fold_mem_offsets", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_fold_mem_offsets : public rtl_opt_pass
+{
+public:
+  pass_fold_mem_offsets (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_fold_mem, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return flag_fold_mem_offsets && optimize >= 2;
+    }
+
+  virtual unsigned int execute (function *);
+}; // class pass_fold_mem_offsets
+
+/* Tracks which instructions can be reached through instructions that can
+   propagate offsets for folding.  */
+static bitmap_head can_fold_insns;
+
+/* Marks instructions that are currently eligible for folding.  */
+static bitmap_head candidate_fold_insns;
+
+/* Tracks instructions that cannot be folded because it turned out that
+   folding will result in creating an invalid memory instruction.
+   An instruction can be in both CANDIDATE_FOLD_INSNS and CANNOT_FOLD_INSNS
+   at the same time, in which case it is not legal to fold.  */
+static bitmap_head cannot_fold_insns;
+
+/* The number of instructions that were simplified or eliminated.  */
+static int stats_fold_count;
+
+/* Get the single reaching definition of an instruction inside a BB.
+   The definition is desired for REG used in INSN.
+   Return the definition insn or NULL if there's no definition with
+   the desired criteria.  */
+static rtx_insn*
+get_single_def_in_bb (rtx_insn *insn, rtx reg)
+{
+  df_ref use;
+  struct df_link *ref_chain, *ref_link;
+
+  FOR_EACH_INSN_USE (use, insn)
+    {
+      if (GET_CODE (DF_REF_REG (use)) == SUBREG)
+	return NULL;
+      if (REGNO (DF_REF_REG (use)) == REGNO (reg))
+	break;
+    }
+
+  if (!use)
+    return NULL;
+
+  ref_chain = DF_REF_CHAIN (use);
+
+  if (!ref_chain)
+    return NULL;
+
+  for (ref_link = ref_chain; ref_link; ref_link = ref_link->next)
+    {
+      /* Problem getting some definition for this instruction.  */
+      if (ref_link->ref == NULL)
+	return NULL;
+      if (DF_REF_INSN_INFO (ref_link->ref) == NULL)
+	return NULL;
+      if (global_regs[REGNO (reg)]
+	  && !set_of (reg, DF_REF_INSN (ref_link->ref)))
+	return NULL;
+    }
+
+  if (ref_chain->next)
+    return NULL;
+
+  rtx_insn* def = DF_REF_INSN (ref_chain->ref);
+
+  if (BLOCK_FOR_INSN (def) != BLOCK_FOR_INSN (insn))
+    return NULL;
+
+  if (DF_INSN_LUID (def) > DF_INSN_LUID (insn))
+    return NULL;
+
+  return def;
+}
+
+/* Get all uses of REG which is set in INSN.  Return the use list or NULL if a
+   use is missing / irregular.  If SUCCESS is not NULL then set it to false if
+   there are missing / irregular uses and true otherwise.  */
+static struct df_link*
+get_uses (rtx_insn *insn, rtx reg, bool* success)
+{
+  df_ref def;
+  struct df_link *ref_chain, *ref_link;
+
+  if (success)
+    *success = false;
+
+  FOR_EACH_INSN_DEF (def, insn)
+    if (REGNO (DF_REF_REG (def)) == REGNO (reg))
+      break;
+
+  if (!def)
+    return NULL;
+
+  ref_chain = DF_REF_CHAIN (def);
+
+  for (ref_link = ref_chain; ref_link; ref_link = ref_link->next)
+    {
+      /* Problem getting a use for this instruction.  */
+      if (ref_link->ref == NULL)
+	return NULL;
+      if (DF_REF_CLASS (ref_link->ref) != DF_REF_REGULAR)
+	return NULL;
+      /* We do not handle REG_EQUIV/REG_EQ notes for now.  */
+      if (DF_REF_FLAGS (ref_link->ref) & DF_REF_IN_NOTE)
+	return NULL;
+    }
+
+  if (success)
+    *success = true;
+
+  return ref_chain;
+}
+
+/* Function that computes the offset that would have to be added to all uses
+   of REG if the instructions marked in FOLDABLE_INSNS were to be eliminated.
+
+   If ANALYZE is true then mark in CAN_FOLD_INSNS which instructions
+   transitively only affect other instructions found in CAN_FOLD_INSNS.
+   If ANALYZE is false then compute the offset required for folding.  */
+static HOST_WIDE_INT
+fold_offsets (rtx_insn* insn, rtx reg, bool analyze, bitmap foldable_insns)
+{
+  rtx_insn* def = get_single_def_in_bb (insn, reg);
+
+  if (!def || GET_CODE (PATTERN (def)) != SET)
+    return 0;
+
+  rtx src = SET_SRC (PATTERN (def));
+  rtx dest = SET_DEST (PATTERN (def));
+
+  if (!REG_P (dest))
+    return 0;
+
+  enum rtx_code code = GET_CODE (src);
+
+  /* Keep these in sync with the switch below.  We need to early out because
+     otherwise a lot of unnecessary work is done in use analysis.  */
+  if (!(code == PLUS || code == MINUS || code == NEG || code == MULT
+	|| code == ASHIFT || code == REG || code == CONST_INT))
+    return 0;
+
+  unsigned int dest_regno = REGNO (dest);
+
+  /* We can only affect the values of GPR registers.  */
+  if (fixed_regs[dest_regno]
+      || !TEST_HARD_REG_BIT (reg_class_contents[GENERAL_REGS], dest_regno))
+    return 0;
+
+  if (analyze)
+    {
+      /* We only fold through instructions that are transitively used as
+	 memory addresses and do not have other uses.  Use the same logic
+	 from offset calculation to visit instructions that can propagate
+	 offsets and keep track of them in CAN_FOLD_INSNS.  */
+      bool success;
+      struct df_link *uses = get_uses (def, dest, &success), *ref_link;
+
+      if (!success)
+	return 0;
+
+      for (ref_link = uses; ref_link; ref_link = ref_link->next)
+	{
+	  rtx_insn* use = DF_REF_INSN (ref_link->ref);
+
+	  if (DEBUG_INSN_P (use))
+	    continue;
+
+	  /* Punt if the use is anything more complicated than a set
+	     (clobber, use, etc).  */
+	  if (!NONJUMP_INSN_P (use) || GET_CODE (PATTERN (use)) != SET)
+	    return 0;
+
+	  /* This use affects instructions outside of CAN_FOLD_INSNS.  */
+	  if (!bitmap_bit_p (&can_fold_insns, INSN_UID (use)))
+	    return 0;
+
+	  rtx use_set = PATTERN (use);
+
+	  /* Special case: A foldable memory store is not foldable if it
+	     mentions DEST outside of the address calculation.  */
+	  if (use_set && MEM_P (SET_DEST (use_set))
+	      && reg_mentioned_p (dest, SET_SRC (use_set)))
+	    return 0;
+	}
+
+      bitmap_set_bit (&can_fold_insns, INSN_UID (def));
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Instruction marked for propagation: ");
+	  print_rtl_single (dump_file, def);
+	}
+    }
+  else
+    {
+      /* We cannot propagate through this instruction.  */
+      if (!bitmap_bit_p (&can_fold_insns, INSN_UID (def)))
+	return 0;
+    }
+
+  switch (code)
+    {
+    case PLUS:
+      {
+	/* Propagate through add.  */
+	rtx arg1 = XEXP (src, 0);
+	rtx arg2 = XEXP (src, 1);
+	HOST_WIDE_INT offset = 0;
+
+	if (REG_P (arg1))
+	  offset += fold_offsets (def, arg1, analyze, foldable_insns);
+	else if (GET_CODE (arg1) == ASHIFT
+		 && REG_P (XEXP (arg1, 0))
+		 && CONST_INT_P (XEXP (arg1, 1)))
+	  {
+	    /* Handle R1 = (R2 << C) + ...  */
+	    HOST_WIDE_INT scale
+	      = (HOST_WIDE_INT_1U << INTVAL (XEXP (arg1, 1)));
+	    offset += scale * fold_offsets (def, XEXP (arg1, 0), analyze,
+					 foldable_insns);
+	  }
+	else if (GET_CODE (arg1) == PLUS
+		 && REG_P (XEXP (arg1, 0))
+		 && REG_P (XEXP (arg1, 1)))
+	  {
+	    /* Handle R1 = (R2 + R3) + ...  */
+	    offset += fold_offsets (def, XEXP (arg1, 0), analyze,
+				    foldable_insns);
+	    offset += fold_offsets (def, XEXP (arg1, 1), analyze,
+				    foldable_insns);
+	  }
+	else if (GET_CODE (arg1) == PLUS
+		 && GET_CODE (XEXP (arg1, 0)) == ASHIFT
+		 && REG_P (XEXP (XEXP (arg1, 0), 0))
+		 && CONST_INT_P (XEXP (XEXP (arg1, 0), 1))
+		 && REG_P (XEXP (arg1, 1)))
+	  {
+	    /* Handle R1 = ((R2 << C) + R3) + ...  */
+	    HOST_WIDE_INT scale
+	      = (HOST_WIDE_INT_1U << INTVAL (XEXP (XEXP (arg1, 0), 1)));
+	    offset += scale * fold_offsets (def, XEXP (XEXP (arg1, 0), 0),
+					    analyze, foldable_insns);
+	    offset += fold_offsets (def, XEXP (arg1, 1), analyze,
+				    foldable_insns);
+	  }
+	else
+	  return 0;
+
+	if (REG_P (arg2))
+	  offset += fold_offsets (def, arg2, analyze, foldable_insns);
+	else if (REG_P (arg1) && CONST_INT_P (arg2) && !analyze)
+	  {
+	    offset += INTVAL (arg2);
+	    /* This is a R1 = R2 + C instruction, candidate for folding.  */
+	    bitmap_set_bit (foldable_insns, INSN_UID (def));
+	  }
+	else
+	  return 0;
+
+	return offset;
+      }
+    case MINUS:
+      {
+	/* Propagate through minus.  */
+	rtx arg1 = XEXP (src, 0);
+	rtx arg2 = XEXP (src, 1);
+	HOST_WIDE_INT offset = 0;
+
+	if (REG_P (arg1))
+	  offset += fold_offsets (def, arg1, analyze, foldable_insns);
+	else
+	  return 0;
+
+	if (REG_P (arg2))
+	  offset -= fold_offsets (def, arg2, analyze, foldable_insns);
+	else if (REG_P (arg1) && CONST_INT_P (arg2) && !analyze)
+	  {
+	    offset -= INTVAL (arg2);
+	    /* This is a R1 = R2 - C instruction, candidate for folding.  */
+	    bitmap_set_bit (foldable_insns, INSN_UID (def));
+	  }
+	else
+	  return 0;
+
+	return offset;
+      }
+    case NEG:
+      {
+	/* Propagate through negation.  */
+	rtx arg1 = XEXP (src, 0);
+	if (REG_P (arg1))
+	  return -fold_offsets (def, arg1, analyze, foldable_insns);
+	else
+	  return 0;
+      }
+    case MULT:
+      {
+	/* Propagate through multiply by constant.  */
+	rtx arg1 = XEXP (src, 0);
+	rtx arg2 = XEXP (src, 1);
+
+	if (REG_P (arg1) && CONST_INT_P (arg2))
+	  {
+	    HOST_WIDE_INT scale = INTVAL (arg2);
+	    return scale * fold_offsets (def, arg1, analyze, foldable_insns);
+	  }
+
+	return 0;
+      }
+    case ASHIFT:
+      {
+	/* Propagate through shift left by constant.  */
+	rtx arg1 = XEXP (src, 0);
+	rtx arg2 = XEXP (src, 1);
+
+	if (REG_P (arg1) && CONST_INT_P (arg2))
+	  {
+	    HOST_WIDE_INT scale = (HOST_WIDE_INT_1U << INTVAL (arg2));
+	    return scale * fold_offsets (def, arg1, analyze, foldable_insns);
+	  }
+
+	return 0;
+      }
+    case REG:
+      /* Propagate through register move.  */
+      return fold_offsets (def, src, analyze, foldable_insns);
+    case CONST_INT:
+      {
+	/* R1 = C is candidate for folding.  */
+	if (!analyze)
+	  bitmap_set_bit (foldable_insns, INSN_UID (def));
+	return INTVAL (src);
+      }
+    default:
+      /* Cannot propagate.  */
+      return 0;
+    }
+}
+
+/* Test if INSN is a memory load / store that can have an offset folded to it.
+   Return true iff INSN is such an instruction and return through MEM_OUT,
+   REG_OUT and OFFSET_OUT the RTX that has a MEM code, the register that is
+   used as a base address and the offset accordingly.
+   All of the out pointers may be NULL in which case they will be ignored.  */
+bool
+get_fold_mem_root (rtx_insn* insn, rtx *mem_out, rtx *reg_out,
+		   HOST_WIDE_INT *offset_out)
+{
+  rtx set = single_set (insn);
+  rtx mem = NULL_RTX;
+
+  if (set != NULL_RTX)
+    {
+      rtx src = SET_SRC (set);
+      rtx dest = SET_DEST (set);
+
+      /* Don't fold when we have unspec / volatile.  */
+      if (GET_CODE (src) == UNSPEC
+	  || GET_CODE (src) == UNSPEC_VOLATILE
+	  || GET_CODE (dest) == UNSPEC
+	  || GET_CODE (dest) == UNSPEC_VOLATILE)
+	return false;
+
+      if (MEM_P (src))
+	mem = src;
+      else if (MEM_P (dest))
+	mem = dest;
+      else if ((GET_CODE (src) == SIGN_EXTEND
+		|| GET_CODE (src) == ZERO_EXTEND)
+	       && MEM_P (XEXP (src, 0)))
+	mem = XEXP (src, 0);
+    }
+
+  if (mem == NULL_RTX)
+    return false;
+
+  rtx mem_addr = XEXP (mem, 0);
+  rtx reg;
+  HOST_WIDE_INT offset;
+
+  if (REG_P (mem_addr))
+    {
+      reg = mem_addr;
+      offset = 0;
+    }
+  else if (GET_CODE (mem_addr) == PLUS
+	   && REG_P (XEXP (mem_addr, 0))
+	   && CONST_INT_P (XEXP (mem_addr, 1)))
+    {
+      reg = XEXP (mem_addr, 0);
+      offset = INTVAL (XEXP (mem_addr, 1));
+    }
+  else
+    return false;
+
+  if (mem_out)
+    *mem_out = mem;
+  if (reg_out)
+    *reg_out = reg;
+  if (offset_out)
+    *offset_out = offset;
+
+  return true;
+}
+
+/* If INSN is a root memory instruction then do a DFS traversal on its
+   definitions and find folding candidates.  */
+static void
+do_analysis (rtx_insn* insn)
+{
+  rtx reg;
+  if (!get_fold_mem_root (insn, NULL, &reg, NULL))
+    return;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Starting analysis from root: ");
+      print_rtl_single (dump_file, insn);
+    }
+
+  /* Analyse folding opportunities for this memory instruction.  */
+  bitmap_set_bit (&can_fold_insns, INSN_UID (insn));
+  fold_offsets (insn, reg, true, NULL);
+}
+
+/* If INSN is a root memory instruction then compute a potentially new offset
+   for it and test if the resulting instruction is valid.  */
+static void
+do_check_validity (rtx_insn* insn)
+{
+  rtx mem, reg;
+  HOST_WIDE_INT cur_offset;
+  if (!get_fold_mem_root (insn, &mem, &reg, &cur_offset))
+    return;
+
+  auto_bitmap fold_insns;
+  HOST_WIDE_INT new_offset
+    = cur_offset + fold_offsets (insn, reg, false, fold_insns);
+
+  /* Test if it is valid to change MEM's address offset to NEW_OFFSET.  */
+  int icode = INSN_CODE (insn);
+  rtx mem_addr = XEXP (mem, 0);
+  machine_mode mode = GET_MODE (mem_addr);
+  XEXP (mem, 0) = gen_rtx_PLUS (mode, reg, gen_int_mode (new_offset, mode));
+
+  bool illegal = insn_invalid_p (insn, false)
+		 || !memory_address_addr_space_p (mode, XEXP (mem, 0),
+						  MEM_ADDR_SPACE (mem));
+
+  /* Restore the instruction.  */
+  XEXP (mem, 0) = mem_addr;
+  INSN_CODE (insn) = icode;
+
+  if (illegal)
+    bitmap_ior_into (&cannot_fold_insns, fold_insns);
+  else
+    bitmap_ior_into (&candidate_fold_insns, fold_insns);
+}
+
+/* If INSN is a root memory instruction that was affected by any folding
+   then update its offset as necessary.  */
+static void
+do_commit_offset (rtx_insn* insn)
+{
+  rtx mem, reg;
+  HOST_WIDE_INT cur_offset;
+  if (!get_fold_mem_root (insn, &mem, &reg, &cur_offset))
+    return;
+
+  auto_bitmap fold_insns;
+  HOST_WIDE_INT new_offset
+    = cur_offset + fold_offsets (insn, reg, false, fold_insns);
+
+  /* If an change turned out illegal in the previous phase then legal
+     transformations that share calculations also become illegal.  */
+  if (bitmap_intersect_p (&cannot_fold_insns, fold_insns))
+    {
+      bitmap_ior_into (&cannot_fold_insns, fold_insns);
+      return;
+    }
+
+  if (new_offset == cur_offset)
+    return;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Memory offset changed from "
+	       HOST_WIDE_INT_PRINT_DEC " to " HOST_WIDE_INT_PRINT_DEC
+	       " for instruction:\n", cur_offset, new_offset);
+      print_rtl_single (dump_file, insn);
+    }
+
+  gcc_assert (!bitmap_empty_p (fold_insns));
+  machine_mode mode = GET_MODE (XEXP (mem, 0));
+  XEXP (mem, 0) = gen_rtx_PLUS (mode, reg, gen_int_mode (new_offset, mode));
+  INSN_CODE (insn) = recog (PATTERN (insn), insn, 0);
+  df_insn_rescan (insn);
+}
+
+/* If INSN is a move / add instruction that was folded then replace its
+   constant part with zero.  */
+static void
+do_commit_insn (rtx_insn* insn)
+{
+  if (bitmap_bit_p (&candidate_fold_insns, INSN_UID (insn))
+      && !bitmap_bit_p (&cannot_fold_insns, INSN_UID (insn)))
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Instruction folded:");
+	  print_rtl_single (dump_file, insn);
+	}
+
+      stats_fold_count++;
+
+      rtx set = single_set (insn);
+      rtx dest = SET_DEST (set);
+      rtx src = SET_SRC (set);
+
+      /* Emit a move and let subsequent passes eliminate it if possible.  */
+      if (GET_CODE (src) == CONST_INT)
+	{
+	  /* INSN is R1 = C.
+	     Replace it with R1 = 0 because C was folded.  */
+	  rtx mov_rtx
+	    = gen_move_insn (dest, gen_int_mode (0, GET_MODE (dest)));
+	  df_insn_rescan (emit_insn_after (mov_rtx, insn));
+	}
+      else
+	{
+	  /* INSN is R1 = R2 + C.
+	     Replace it with R1 = R2 because C was folded.  */
+	  rtx arg1 = XEXP (src, 0);
+
+	  /* If the DEST == ARG1 then the move is a no-op.  */
+	  if (REGNO (dest) != REGNO (arg1))
+	    {
+	      gcc_checking_assert (GET_MODE (dest) == GET_MODE (arg1));
+	      rtx mov_rtx = gen_move_insn (dest, arg1);
+	      df_insn_rescan (emit_insn_after (mov_rtx, insn));
+	    }
+	}
+
+      /* Delete the original move / add instruction.  */
+      delete_insn (insn);
+    }
+}
+
+unsigned int
+pass_fold_mem_offsets::execute (function *fn)
+{
+  df_set_flags (DF_EQ_NOTES + DF_RD_PRUNE_DEAD_DEFS + DF_DEFER_INSN_RESCAN);
+  df_chain_add_problem (DF_UD_CHAIN + DF_DU_CHAIN);
+  df_analyze ();
+
+  bitmap_initialize (&can_fold_insns, NULL);
+  bitmap_initialize (&candidate_fold_insns, NULL);
+  bitmap_initialize (&cannot_fold_insns, NULL);
+
+  stats_fold_count = 0;
+
+  basic_block bb;
+  rtx_insn *insn;
+  FOR_ALL_BB_FN (bb, fn)
+    {
+      /* There is a conflict between this pass and RISCV's shorten-memrefs
+	  pass.  For now disable folding if optimizing for size because
+	  otherwise this cancels the effects of shorten-memrefs.  */
+      if (optimize_bb_for_size_p (bb))
+	continue;
+
+      bitmap_clear (&can_fold_insns);
+      bitmap_clear (&candidate_fold_insns);
+      bitmap_clear (&cannot_fold_insns);
+
+      FOR_BB_INSNS (bb, insn)
+	do_analysis (insn);
+
+      FOR_BB_INSNS (bb, insn)
+	do_check_validity (insn);
+
+      FOR_BB_INSNS (bb, insn)
+	do_commit_offset (insn);
+
+      FOR_BB_INSNS (bb, insn)
+	do_commit_insn (insn);
+    }
+
+  statistics_counter_event (cfun, "Number of folded instructions",
+			    stats_fold_count);
+
+  bitmap_release (&can_fold_insns);
+  bitmap_release (&candidate_fold_insns);
+  bitmap_release (&cannot_fold_insns);
+
+  return 0;
+}
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_fold_mem_offsets (gcc::context *ctxt)
+{
+  return new pass_fold_mem_offsets (ctxt);
+}
diff --git a/gcc/passes.def b/gcc/passes.def
index ef5a21afe49..4a0e9a1aee2 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -516,6 +516,7 @@  along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_peephole2);
 	  NEXT_PASS (pass_if_after_reload);
 	  NEXT_PASS (pass_regrename);
+	  NEXT_PASS (pass_fold_mem_offsets);
 	  NEXT_PASS (pass_cprop_hardreg);
 	  NEXT_PASS (pass_fast_rtl_dce);
 	  NEXT_PASS (pass_reorder_blocks);
diff --git a/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c
new file mode 100644
index 00000000000..574cc92b6ab
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mfold-mem-offsets" } */
+
+void sink(int arr[2]);
+
+void
+foo(int a, int b, int i)
+{
+  int arr[2] = {a, b};
+  arr[i]++;
+  sink(arr);
+}
+
+// Should compile without negative memory offsets when using -mfold-mem-offsets
+/* { dg-final { scan-assembler-not "lw\t.*,-.*\\(.*\\)" } } */
+/* { dg-final { scan-assembler-not "sw\t.*,-.*\\(.*\\)" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c
new file mode 100644
index 00000000000..e6c251d3a3c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c
@@ -0,0 +1,24 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mfold-mem-offsets" } */
+
+void sink(int arr[3]);
+
+void
+foo(int a, int b, int c, int i)
+{
+  int arr1[3] = {a, b, c};
+  int arr2[3] = {a, c, b};
+  int arr3[3] = {c, b, a};
+
+  arr1[i]++;
+  arr2[i]++;
+  arr3[i]++;
+  
+  sink(arr1);
+  sink(arr2);
+  sink(arr3);
+}
+
+// Should compile without negative memory offsets when using -mfold-mem-offsets
+/* { dg-final { scan-assembler-not "lw\t.*,-.*\\(.*\\)" } } */
+/* { dg-final { scan-assembler-not "sw\t.*,-.*\\(.*\\)" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c
new file mode 100644
index 00000000000..8586d3e3a29
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mfold-mem-offsets" } */
+
+void load(int arr[2]);
+
+int
+foo(long unsigned int i)
+{
+  int arr[2];
+  load(arr);
+
+  return arr[3 * i + 77];
+}
+
+// Should compile without negative memory offsets when using -mfold-mem-offsets
+/* { dg-final { scan-assembler-not "lw\t.*,-.*\\(.*\\)" } } */
+/* { dg-final { scan-assembler-not "addi\t.*,.*,77" } } */
\ No newline at end of file
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 57865cdfc42..69a8a6d5cb7 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -618,6 +618,7 @@  extern rtl_opt_pass *make_pass_sched_fusion (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_peephole2 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_if_after_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_regrename (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_fold_mem_offsets (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_cprop_hardreg (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_reorder_blocks (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_leaf_regs (gcc::context *ctxt);