mbox series

[V3,0/7] ira/lra: Support subreg coalesce

Message ID 20231112120817.2635864-1-lehua.ding@rivai.ai
Headers show
Series ira/lra: Support subreg coalesce | expand

Message

Lehua Ding Nov. 12, 2023, 12:08 p.m. UTC
V3 Changes:
  1. fix three ICE.
  2. rebase

Hi,

These patchs try to support subreg coalesce feature in
register allocation passes (ira and lra).

Let's consider a RISC-V program (https://godbolt.org/z/ec51d91aT):

```
#include <riscv_vector.h>

void
foo (int32_t *in, int32_t *out, size_t m)
{
  vint32m2_t result = __riscv_vle32_v_i32m2 (in, 32);
  vint32m1_t v0 = __riscv_vget_v_i32m2_i32m1 (result, 0);
  vint32m1_t v1 = __riscv_vget_v_i32m2_i32m1 (result, 1);
  for (size_t i = 0; i < m; i++)
    {
      v0 = __riscv_vadd_vv_i32m1(v0, v0, 4);
      v1 = __riscv_vmul_vv_i32m1(v1, v1, 4);
    }
  *(vint32m1_t*)(out+4*0) = v0;
  *(vint32m1_t*)(out+4*1) = v1;
}
```

Before these patchs:

```
foo:
	li	a5,32
	vsetvli	zero,a5,e32,m2,ta,ma
	vle32.v	v4,0(a0)
	vmv1r.v	v2,v4
	vmv1r.v	v1,v5
	beq	a2,zero,.L2
	li	a5,0
	vsetivli	zero,4,e32,m1,ta,ma
.L3:
	addi	a5,a5,1
	vadd.vv	v2,v2,v2
	vmul.vv	v1,v1,v1
	bne	a2,a5,.L3
.L2:
	vs1r.v	v2,0(a1)
	addi	a1,a1,16
	vs1r.v	v1,0(a1)
	ret
```

After these patchs:

```
foo:
	li	a5,32
	vsetvli	zero,a5,e32,m2,ta,ma
	vle32.v	v2,0(a0)
	beq	a2,zero,.L2
	li	a5,0
	vsetivli	zero,4,e32,m1,ta,ma
.L3:
	addi	a5,a5,1
	vadd.vv	v2,v2,v2
	vmul.vv	v3,v3,v3
	bne	a2,a5,.L3
.L2:
	vs1r.v	v2,0(a1)
	addi	a1,a1,16
	vs1r.v	v3,0(a1)
	ret
```

As you can see, the two redundant vmv1r.v instructions were removed.
The reason for the two redundant vmv1r.v instructions is because
the current ira pass is being conservative in calculating the live
range of pseduo registers that occupy multil hardregs. As in the
following two RTL instructions. Where r134 occupies two physical
registers and r135 and r136 occupy one physical register.
At insn 12 point, ira considers the entire r134 pseudo register
to be live, so r135 is in conflict with r134, as shown in the ira
dump info. Then when the physical registers are allocated, r135 and
r134 are allocated first because they are inside the loop body and
have higher priority. This makes it difficult to assign r136 to
overlap with r134, i.e., to assign r136 to hr100, thus eliminating
the need for the vmv1r.v instruction. Thus two vmv1r.v instructions
appear.

If we refine the live information of r134 to the case of each subreg,
we can remove this conflict. We can then create copies of the set
with subreg reference, thus increasing the priority of the r134 allocation,
which allow registers with bigger alignment requirements to prioritize
the allocation of physical registers. In RVV, pseudo registers occupying
two physical registers need to be time-2 aligned.

```
(insn 11 10 12 2 (set (reg/v:RVVM1SI 135 [ v0 ])
        (subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) 0)) "/app/example.c":7:19 998 {*movrvvm1si_whole}
     (nil))
(insn 12 11 13 2 (set (reg/v:RVVM1SI 136 [ v1 ])
        (subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) [16, 16])) "/app/example.c":8:19 998 {*movrvvm1si_whole}
     (expr_list:REG_DEAD (reg/v:RVVM2SI 134 [ result ])
        (nil)))
```

ira dump:

;; a1(r136,l0) conflicts: a3(r135,l0)
;;     total conflict hard regs:
;;     conflict hard regs:
;; a3(r135,l0) conflicts: a1(r136,l0) a6(r134,l0)
;;     total conflict hard regs:
;;     conflict hard regs:
;; a6(r134,l0) conflicts: a3(r135,l0)
;;     total conflict hard regs:
;;     conflict hard regs:
;;
;; ...
      Popping a1(r135,l0)  --         assign reg 97
      Popping a3(r136,l0)  --         assign reg 98
      Popping a4(r137,l0)  --         assign reg 15
      Popping a5(r140,l0)  --         assign reg 12
      Popping a10(r145,l0)  --         assign reg 12
      Popping a2(r139,l0)  --         assign reg 11
      Popping a9(r144,l0)  --         assign reg 11
      Popping a0(r142,l0)  --         assign reg 11
      Popping a6(r134,l0)  --         assign reg 100
      Popping a7(r143,l0)  --         assign reg 10
      Popping a8(r141,l0)  --         assign reg 15

The AArch64 SVE has the same problem. Consider the following
code (https://godbolt.org/z/MYrK7Ghaj):

```
#include <arm_sve.h>

int bar (svbool_t pg, int64_t* base, int n, int64_t *in1, int64_t *in2, int64_t*out)
{
  svint64x4_t result = svld4_s64 (pg, base);
  svint64_t v0 = svget4_s64(result, 0);
  svint64_t v1 = svget4_s64(result, 1);
  svint64_t v2 = svget4_s64(result, 2);
  svint64_t v3 = svget4_s64(result, 3);

  for (int i = 0; i < n; i += 1)
    {
        svint64_t v18 = svld1_s64(pg, in1);
        svint64_t v19 = svld1_s64(pg, in2);
        v0 = svmad_s64_z(pg, v0, v18, v19);
        v1 = svmad_s64_z(pg, v1, v18, v19);
        v2 = svmad_s64_z(pg, v2, v18, v19);
        v3 = svmad_s64_z(pg, v3, v18, v19);
    }
  svst1_s64(pg, out+0,v0);
  svst1_s64(pg, out+1,v1);
  svst1_s64(pg, out+2,v2);
  svst1_s64(pg, out+3,v3);
}
```

Before these patchs:

```
bar:
	ld4d	{z4.d - z7.d}, p0/z, [x0]
	mov	z26.d, z4.d
	mov	z27.d, z5.d
	mov	z28.d, z6.d
	mov	z29.d, z7.d
	cmp	w1, 0
	...
```

After these patchs:

```
bar:
	ld4d	{z28.d - z31.d}, p0/z, [x0]
	cmp	w1, 0
	...
```

Lehua Ding (7):
  df: Add DF_LIVE_SUBREG problem
  ira: Switch to live_subreg data
  ira: Support subreg live range track
  ira: Support subreg copy
  ira: Add all nregs >= 2 pseudos to tracke subreg list
  lra: Switch to live_subreg data flow
  lra: Support subreg live range track and conflict detect

 gcc/Makefile.in          |   1 +
 gcc/df-problems.cc       | 889 ++++++++++++++++++++++++++++++++++++++-
 gcc/df.h                 |  67 +++
 gcc/hard-reg-set.h       |  33 ++
 gcc/ira-build.cc         | 456 ++++++++++++++++----
 gcc/ira-color.cc         | 851 ++++++++++++++++++++++++++-----------
 gcc/ira-conflicts.cc     | 221 +++++++---
 gcc/ira-emit.cc          |  24 +-
 gcc/ira-int.h            |  67 ++-
 gcc/ira-lives.cc         | 507 ++++++++++++++++------
 gcc/ira.cc               |  73 ++--
 gcc/lra-assigns.cc       | 111 ++++-
 gcc/lra-coalesce.cc      |  20 +-
 gcc/lra-constraints.cc   | 111 +++--
 gcc/lra-int.h            |  33 ++
 gcc/lra-lives.cc         | 660 ++++++++++++++++++++++++-----
 gcc/lra-remat.cc         |  13 +-
 gcc/lra-spills.cc        |  22 +-
 gcc/lra.cc               | 139 +++++-
 gcc/regs.h               |   7 +
 gcc/subreg-live-range.cc | 628 +++++++++++++++++++++++++++
 gcc/subreg-live-range.h  | 333 +++++++++++++++
 gcc/timevar.def          |   1 +
 23 files changed, 4490 insertions(+), 777 deletions(-)
 create mode 100644 gcc/subreg-live-range.cc
 create mode 100644 gcc/subreg-live-range.h

Comments

Dimitar Dimitrov Nov. 13, 2023, 4:43 p.m. UTC | #1
On Sun, Nov 12, 2023 at 08:08:10PM +0800, Lehua Ding wrote:
> V3 Changes:
>   1. fix three ICE.
>   2. rebase
> 
> Hi,
> 
> These patchs try to support subreg coalesce feature in
> register allocation passes (ira and lra).
> 

Hi Lehua,

V3 indeed fixes the arm-none-eabi build. It's also confirmed by Linaro CI:
  https://patchwork.sourceware.org/project/gcc/patch/20231112120817.2635864-8-lehua.ding@rivai.ai/

But avr and pru backends are still broken, albeit with different crash
signatures. Both targets are peculiar because they have
UNITS_PER_WORD=1. I'll try building some 16-bit target like msp430.

AVR fails when building libgcc:
/mnt/nvme/dinux/local-workspace/gcc/libgcc/config/avr/lib2funcs.c: In function '__roundlr':
/mnt/nvme/dinux/local-workspace/gcc/libgcc/config/avr/lib2funcs.c:115:3: internal compiler error: in check_allocation, at ira.cc:2673
  115 |   }
      |   ^
/mnt/nvme/dinux/local-workspace/gcc/libgcc/config/avr/lib2funcs.c:106:3: note: in expansion of macro 'ROUND2'
  106 |   ROUND2 (FX)
      |   ^~~~~~
/mnt/nvme/dinux/local-workspace/gcc/libgcc/config/avr/lib2funcs.c:117:1: note: in expansion of macro 'ROUND1'
  117 | ROUND1(L_LABEL)
      | ^~~~~~
0xc80b8d check_allocation
        /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:2673
0xc89451 ira
        /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5873
0xc89451 execute
        /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6104

Script I'm using to build avr: https://github.com/dinuxbg/gnupru/blob/master/testing/manual-build-avr.sh



PRU fails building newlib:
/mnt/nvme/dinux/local-workspace/newlib/newlib/libc/stdlib/gdtoa-gdtoa.c:835:9: internal compiler error: in lra_create_live_ranges, at lra-lives.cc:1933
  835 |         }
      |         ^
0x6b951c lra_create_live_ranges(bool, bool)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra-lives.cc:1933
0xd9320c lra(_IO_FILE*)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:2638
0xd3e519 do_reload
        /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5960
0xd3e519 execute
        /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6148

Script I'm using to build pru: https://github.com/dinuxbg/gnupru/blob/master/testing/manual-build-pru.sh

Regards,
Dimitar,
Vladimir Makarov Nov. 13, 2023, 7:37 p.m. UTC | #2
On 11/12/23 07:08, Lehua Ding wrote:
> V3 Changes:
>    1. fix three ICE.
>    2. rebase
>
> Hi,
>
> These patchs try to support subreg coalesce feature in
> register allocation passes (ira and lra).
>
I've started review of v3 patches and here is my initial general 
criticism of your patches:

   * Absence of comments for some functions, e.g. for `HARD_REG_SET 
operator>> (unsigned int shift_amount) const`.

   * Adding significant functionality to existing functions is not 
reflected in the function comment, e.g. in ira_set_allocno_class.

   * A lot of typos, e.g. `pesudo` or `reprensent`.  I think you need to 
check spelling of you comments (I myself do spell checking in emacs by 
ispell-region command).

   * Grammar mistakes, e.g `Flag means need track subreg live range for 
the allocno`.  I understand English is not your native languages (as for 
me).  In case of some doubts I'd recommend to check grammar in ChatGPT 
(Proofread: <english> text).

   * Some local variables use upper case letters (e.g. `int A`) which 
should be used for macros or enums according to GNU coding standard 
(https://www.gnu.org/prep/standards/standards.html) .

   * Sometimes you put one space at the end of sentence.  Please see GNU 
coding standard and GCC coding conventions 
(https://gcc.gnu.org/codingconventions.html)

   * There is no uniformity in your code, e.g. sometimes you use 'i++', 
sometimes `++i` or `i += 1`.  Although the uniformity is not necessary, 
it makes a better impression about the patches.


I also did not find what targets did you use for testing.  I am asking 
this because I see new testsuite failures (apx-spill_to_egprs-1.c) even 
on x86-64.  It might be nothing as the test expects a specific code 
generation.

Also besides testing major targets I'd recommend testing at least one 
big endian target (I'd recommend ppc64be. gcc110.fsfrance.org could be 
used for this).  Plenty RA issues occur because BE targets are not tested.
Lehua Ding Nov. 14, 2023, 5:37 a.m. UTC | #3
Hi Vladimir,

On 2023/11/14 3:37, Vladimir Makarov wrote:
> 
> On 11/12/23 07:08, Lehua Ding wrote:
>> V3 Changes:
>>    1. fix three ICE.
>>    2. rebase
>>
>> Hi,
>>
>> These patchs try to support subreg coalesce feature in
>> register allocation passes (ira and lra).
>>
> I've started review of v3 patches and here is my initial general 
> criticism of your patches:
> 
>    * Absence of comments for some functions, e.g. for `HARD_REG_SET 
> operator>> (unsigned int shift_amount) const`.
> 
>    * Adding significant functionality to existing functions is not 
> reflected in the function comment, e.g. in ira_set_allocno_class.
> 
>    * A lot of typos, e.g. `pesudo` or `reprensent`.  I think you need to 
> check spelling of you comments (I myself do spell checking in emacs by 
> ispell-region command).
> 
>    * Grammar mistakes, e.g `Flag means need track subreg live range for 
> the allocno`.  I understand English is not your native languages (as for 
> me).  In case of some doubts I'd recommend to check grammar in ChatGPT 
> (Proofread: <english> text).
> 
>    * Some local variables use upper case letters (e.g. `int A`) which 
> should be used for macros or enums according to GNU coding standard 
> (https://www.gnu.org/prep/standards/standards.html) .
> 
>    * Sometimes you put one space at the end of sentence.  Please see GNU 
> coding standard and GCC coding conventions 
> (https://gcc.gnu.org/codingconventions.html)
> 
>    * There is no uniformity in your code, e.g. sometimes you use 'i++', 
> sometimes `++i` or `i += 1`.  Although the uniformity is not necessary, 
> it makes a better impression about the patches.

Sorry for these issue, I'll address all those comments.

> I also did not find what targets did you use for testing.  I am asking 
> this because I see new testsuite failures (apx-spill_to_egprs-1.c) even 
> on x86-64.  It might be nothing as the test expects a specific code 
> generation.

There was testing x86, aarch64, riscv not long ago, but it looks like 
I'm missing something, I just locally tested with the latest code and 
also reproduced this fail you mentioned, along with a c++ fail 
(pr106877.C). I'll have a look at the cause.

> Also besides testing major targets I'd recommend testing at least one 
> big endian target (I'd recommend ppc64be. gcc110.fsfrance.org could be 
> used for this).  Plenty RA issues occur because BE targets are not tested.

You said the address looks a bit wrong, it should be this 
gcc110.fsffrance.org right? I looked for it and it looks like you have 
to go to portal.cfarm.net first to apply for an account on this site, 
I'll try that, thanks a lot.
Peter Bergner Nov. 14, 2023, 11:22 p.m. UTC | #4
On 11/12/23 6:08 AM, Lehua Ding wrote:
> V3 Changes:
>   1. fix three ICE.
>   2. rebase


I tested this on powerpc64le-linux and powerpc64-linux.  The LE build
bootstrapped fine and it looks like only one testsuite FAIL which I have
to look into why it's FAILing.

The BE build did bootstrap, but the 32-bit and 64-bit testsuite runs both
had lots of FAILs (over 100 between them both) which I have yet to look
into what is happening.

I'll also note I have done no performance testing yet until I have an
idea of what the testsuite failures are.  I think a patch like this that
can affect the performance of all architectures needs some performance
testing to ensure we don't have unintended performance degradations.
I'll have someone on my team kick off some builds once I have a handle
on the testsuite FAILs.

Peter
Peter Bergner Nov. 14, 2023, 11:33 p.m. UTC | #5
On 11/13/23 11:37 PM, Lehua Ding wrote:
> On 2023/11/14 3:37, Vladimir Makarov wrote:
>> Also besides testing major targets I'd recommend testing at least one big
>> endian target (I'd recommend ppc64be. gcc110.fsfrance.org could be used
>> for this).  Plenty RA issues occur because BE targets are not tested.
> 
> You said the address looks a bit wrong, it should be this gcc110.fsffrance.org
> right? I looked for it and it looks like you have to go to portal.cfarm.net
> first to apply for an account on this site, I'll try that, thanks a lot.


The compile farm just went through with a domain name change, so the Power7 BE
gcc110.fsffrance.org system is now reachable via cfarm110.cfarm.net.
You are correct on the address for requesting a cfarm account.

That said, I posted results using your V3 patches for both LE and BE Power
in my other reply. 

Peter
Lehua Ding Nov. 15, 2023, 2:10 a.m. UTC | #6
On 2023/11/14 0:43, Dimitar Dimitrov wrote:
> On Sun, Nov 12, 2023 at 08:08:10PM +0800, Lehua Ding wrote:
>> V3 Changes:
>>    1. fix three ICE.
>>    2. rebase
>>
>> Hi,
>>
>> These patchs try to support subreg coalesce feature in
>> register allocation passes (ira and lra).
>>
> 
> Hi Lehua,
> 
> V3 indeed fixes the arm-none-eabi build. It's also confirmed by Linaro CI:
>    https://patchwork.sourceware.org/project/gcc/patch/20231112120817.2635864-8-lehua.ding@rivai.ai/
> 
> But avr and pru backends are still broken, albeit with different crash
> signatures. Both targets are peculiar because they have
> UNITS_PER_WORD=1. I'll try building some 16-bit target like msp430.
> 
> AVR fails when building libgcc:
> /mnt/nvme/dinux/local-workspace/gcc/libgcc/config/avr/lib2funcs.c: In function '__roundlr':
> /mnt/nvme/dinux/local-workspace/gcc/libgcc/config/avr/lib2funcs.c:115:3: internal compiler error: in check_allocation, at ira.cc:2673
>    115 |   }
>        |   ^
> /mnt/nvme/dinux/local-workspace/gcc/libgcc/config/avr/lib2funcs.c:106:3: note: in expansion of macro 'ROUND2'
>    106 |   ROUND2 (FX)
>        |   ^~~~~~
> /mnt/nvme/dinux/local-workspace/gcc/libgcc/config/avr/lib2funcs.c:117:1: note: in expansion of macro 'ROUND1'
>    117 | ROUND1(L_LABEL)
>        | ^~~~~~
> 0xc80b8d check_allocation
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:2673
> 0xc89451 ira
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5873
> 0xc89451 execute
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6104
> 
> Script I'm using to build avr: https://github.com/dinuxbg/gnupru/blob/master/testing/manual-build-avr.sh
> 
> 
> 
> PRU fails building newlib:
> /mnt/nvme/dinux/local-workspace/newlib/newlib/libc/stdlib/gdtoa-gdtoa.c:835:9: internal compiler error: in lra_create_live_ranges, at lra-lives.cc:1933
>    835 |         }
>        |         ^
> 0x6b951c lra_create_live_ranges(bool, bool)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra-lives.cc:1933
> 0xd9320c lra(_IO_FILE*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:2638
> 0xd3e519 do_reload
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5960
> 0xd3e519 execute
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6148
> 
> Script I'm using to build pru: https://github.com/dinuxbg/gnupru/blob/master/testing/manual-build-pru.sh

These ICE will fixed in the V4 patchs and both targets build 
successfully in my machine, thank you so much for the reported.
Lehua Ding Nov. 15, 2023, 3:12 a.m. UTC | #7
On 2023/11/15 7:22, Peter Bergner wrote:
> On 11/12/23 6:08 AM, Lehua Ding wrote:
>> V3 Changes:
>>    1. fix three ICE.
>>    2. rebase
> 
> 
> I tested this on powerpc64le-linux and powerpc64-linux.  The LE build
> bootstrapped fine and it looks like only one testsuite FAIL which I have
> to look into why it's FAILing.
> 
> The BE build did bootstrap, but the 32-bit and 64-bit testsuite runs both
> had lots of FAILs (over 100 between them both) which I have yet to look
> into what is happening.

I've applied for machine permissions on the compile farm, can you give 
me the way to compile and run tests on PPC64BE machine? I'll take a look 
at it too, thanks a lot.

> I'll also note I have done no performance testing yet until I have an
> idea of what the testsuite failures are.  I think a patch like this that
> can affect the performance of all architectures needs some performance
> testing to ensure we don't have unintended performance degradations.
> I'll have someone on my team kick off some builds once I have a handle
> on the testsuite FAILs.

This is really great, thanks for helping to test the performance.
Peter Bergner Nov. 15, 2023, 3:33 a.m. UTC | #8
On 11/14/23 9:12 PM, Lehua Ding wrote:
> I've applied for machine permissions on the compile farm, can you give
> me the way to compile and run tests on PPC64BE machine? I'll take a look
> at it too, thanks a lot.

That's an old system, with too old system libgmp, etc.  Let me attempt a
build there so I can give you correct build directions for that system.

That said, unfortunately, that system is currently almost out of available
disk space:

  [bergner@gcc1-power7 ~]$ df -h
  Filesystem      Size  Used Avail Use% Mounted on
  ...
  /dev/md4        1.6T  1.6T  9.0G 100% /home

Segher, can you please send out an admin note for people to clean up
unneeded space on cfarm110?  Thanks.

Peter