diff mbox

Fix PR69274, 435.gromacs performance regression due to RA

Message ID alpine.LSU.2.11.1602051335390.31122@t29.fhfr.qr
State New
Headers show

Commit Message

Richard Biener Feb. 5, 2016, 12:42 p.m. UTC
On Fri, 5 Feb 2016, Bernd Schmidt wrote:

> On 02/05/2016 01:10 PM, Richard Biener wrote:
> > It fails
> > 
> > FAIL: gcc.target/i386/addr-sel-1.c scan-assembler b\\\\+1
> > 
> > on i?86 (or x86_64 -m32) though, generating
> > 
> > f:
> > .LFB0:
> >          .cfi_startproc
> >          movl    4(%esp), %eax
> >          leal    1(%eax), %edx
> >          movsbl  a+1(%eax), %eax
> >          movsbl  b(%edx), %edx
> >          addl    %edx, %eax
> >          ret
> 
> Well, it looks like the first movsbl load clobbers the potentially better base
> register, so trivial propagation doesn't work.
> 
> It might be another case where allowing 2->2 in combine would help. Or
> enabling -frename-registers and rerunning reload_combine afterwards.

Before postreload we have

(insn 6 21 8 2 (parallel [
            (set (reg:SI 1 dx [orig:87 _2 ] [87])
                (plus:SI (reg:SI 0 ax [99])
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ])
(insn 8 6 10 2 (set (reg:SI 0 ax [96])
        (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 1 dx [orig:87 _2 ] 
[87])
                    (symbol_ref:SI ("a") [flags 0x2]  <var_decl 
0x7faa46f94cf0 a>)) [0 a S1 A8])))
(insn 10 8 11 2 (set (reg:SI 1 dx [98])
        (sign_extend:SI (mem/j:QI (plus:SI (reg:SI 1 dx [orig:87 _2 ] 
[87])
                    (symbol_ref:SI ("b") [flags 0x2]  <var_decl 
0x7faa46f94d80 b>)) [0 b S1 A8])))

so indeed the issue is not dx dieing in insn 10 but ax dieing in insn 8...

Maybe LRA can prefer to not do that if enough free registers are
available (that is, never re-use a register)?

With the above observation it seems less likely we can fix this
regression.  Should I continue to find a less invasive fix like
by computing 'commutative' as it were before Andreas patch as well
and only if they disagree (thus Andreas patch introduced extra
swapping in recog_data) swap back?

The variant below fixes the testcase but I have to check if it
also fixes the 416.gamess regression (I would guess so).

The patch is of course quite arbitrary, basically reverting the
operand-swapping part of Andreas patch.

Thanks,
Richard.

Comments

Bernd Schmidt Feb. 5, 2016, 1:01 p.m. UTC | #1
On 02/05/2016 01:42 PM, Richard Biener wrote:
> so indeed the issue is not dx dieing in insn 10 but ax dieing in insn 8...
>
> Maybe LRA can prefer to not do that if enough free registers are
> available (that is, never re-use a register)?

Maybe, but at this stage that will probably also have some unpredictable 
random effects. Essentially it sounds like the gromacs regression was 
one of these where we just got unlucky.
It might help to know exactly how the gromacs slowdown occurred, in case 
there's another way to fix it. Maybe you can add a dbgcnt to your patch 
to help pinpoint the area.

> With the above observation it seems less likely we can fix this
> regression.  Should I continue to find a less invasive fix like
> by computing 'commutative' as it were before Andreas patch as well
> and only if they disagree (thus Andreas patch introduced extra
> swapping in recog_data) swap back?

I wouldn't really worry about it all that much, but I also think it 
would be good to know more precisely what went wrong for gromacs.


Bernd
Richard Biener Feb. 5, 2016, 1:21 p.m. UTC | #2
On Fri, 5 Feb 2016, Bernd Schmidt wrote:

> On 02/05/2016 01:42 PM, Richard Biener wrote:
> > so indeed the issue is not dx dieing in insn 10 but ax dieing in insn 8...
> > 
> > Maybe LRA can prefer to not do that if enough free registers are
> > available (that is, never re-use a register)?
> 
> Maybe, but at this stage that will probably also have some unpredictable
> random effects. Essentially it sounds like the gromacs regression was one of
> these where we just got unlucky.
> It might help to know exactly how the gromacs slowdown occurred, in case
> there's another way to fix it. Maybe you can add a dbgcnt to your patch to
> help pinpoint the area.
> 
> > With the above observation it seems less likely we can fix this
> > regression.  Should I continue to find a less invasive fix like
> > by computing 'commutative' as it were before Andreas patch as well
> > and only if they disagree (thus Andreas patch introduced extra
> > swapping in recog_data) swap back?
> 
> I wouldn't really worry about it all that much, but I also think it would be
> good to know more precisely what went wrong for gromacs.

Well, mostly noise.  So for the hot function I attached assembly
for -Ofast -march=haswell -fno-schedule-insns2 without the fix
and with the minimal (ugly) fix.  From a diff I can only spot
very few real changes, we even seem to do one more spilling with the fix.

The vectorized loop body is very large unfortunately but as the
issue reproduces on both AMD and Intel machines I doubt it's some
CPU specific hazard we hit.

There's really nothing obvious which is why I concentrated to understand
why Andreas patch made a difference and then just did sth that makes
sense which happened to fix the issue.

I know that reverting Andreas change will also fix the issue but
Andreas patch makes perfect sense.

Richard.
.file	"innerf.f"
	.text
	.p2align 4,,15
	.globl	inl1130_
	.type	inl1130_, @function
inl1130_:
.LFB0:
	.cfi_startproc
	pushq	%r15
	.cfi_def_cfa_offset 16
	.cfi_offset 15, -16
	pushq	%r14
	.cfi_def_cfa_offset 24
	.cfi_offset 14, -24
	pushq	%r13
	.cfi_def_cfa_offset 32
	.cfi_offset 13, -32
	pushq	%r12
	.cfi_def_cfa_offset 40
	.cfi_offset 12, -40
	pushq	%rbp
	.cfi_def_cfa_offset 48
	.cfi_offset 6, -48
	pushq	%rbx
	.cfi_def_cfa_offset 56
	.cfi_offset 3, -56
	subq	$160, %rsp
	.cfi_def_cfa_offset 216
	movq	%rsi, 120(%rsp)
	movq	%rdx, 128(%rsp)
	movq	%rcx, 144(%rsp)
	movq	%r8, 136(%rsp)
	movq	%r9, 152(%rsp)
	movq	232(%rsp), %r9
	movq	240(%rsp), %r10
	movslq	(%rsi), %rdx
	movq	%rdx, %rax
	movq	248(%rsp), %rcx
	leaq	(%rcx,%rdx,4), %rdx
	vmovss	(%rdx), %xmm3
	vmovss	4(%rdx), %xmm0
	movq	256(%rsp), %rdx
	vmovss	(%rdx), %xmm1
	vmulss	%xmm1, %xmm3, %xmm2
	vmulss	%xmm2, %xmm3, %xmm6
	vmovss	%xmm6, -4(%rsp)
	vmulss	%xmm2, %xmm0, %xmm6
	vmovss	%xmm6, -116(%rsp)
	vmulss	%xmm0, %xmm0, %xmm0
	vmulss	%xmm1, %xmm0, %xmm6
	vmovss	%xmm6, -104(%rsp)
	addl	$1, %eax
	cltq
	movq	272(%rsp), %rdx
	movl	-4(%rdx,%rax,4), %edx
	movq	280(%rsp), %rax
	movl	(%rax), %eax
	addl	%eax, %eax
	imull	%edx, %eax
	leal	(%rax,%rdx,2), %eax
	movl	(%rdi), %ebx
	movl	%ebx, 116(%rsp)
	testl	%ebx, %ebx
	jle	.L9
	cltq
	movq	288(%rsp), %rdx
	leaq	(%rdx,%rax,4), %rax
	vmovss	(%rax), %xmm6
	vmovss	%xmm6, 100(%rsp)
	vmovss	4(%rax), %xmm6
	vmovss	%xmm6, 112(%rsp)
	xorl	%r8d, %r8d
	movq	%r8, %rbp
	movq	%r10, %r11
	movq	%r9, %r10
	.p2align 4,,10
	.p2align 3
.L6:
	movq	136(%rsp), %rax
	movl	(%rax,%rbp,4), %eax
	leal	(%rax,%rax,2), %r8d
	leal	3(%r8), %eax
	movslq	%eax, %r9
	leaq	-1(%r9), %r13
	movq	120(%rsp), %rax
	movl	(%rax,%rbp,4), %eax
	leal	3(%rax,%rax,2), %esi
	movq	128(%rsp), %rax
	movl	(%rax,%rbp,4), %ecx
	leal	1(%rcx), %edx
	movl	4(%rax,%rbp,4), %eax
	movslq	%esi, %r12
	leaq	-1(%r12), %r15
	leal	3(%rsi), %edi
	movslq	%edi, %rbx
	leaq	-1(%rbx), %r14
	addl	$6, %esi
	movslq	%esi, %rdi
	leaq	-1(%rdi), %rsi
	movq	%rsi, -112(%rsp)
	cmpl	%eax, %edx
	jg	.L7
	movslq	%r8d, %r8
	movq	%r8, 104(%rsp)
	movq	152(%rsp), %rsi
	leaq	(%rsi,%r8,4), %r8
	vmovss	(%r8), %xmm2
	vmovss	4(%r8), %xmm1
	vmovss	(%rsi,%r13,4), %xmm0
	leaq	(%r10,%r12,4), %r13
	vaddss	-12(%r13), %xmm2, %xmm6
	vmovss	%xmm6, 80(%rsp)
	vaddss	-8(%r13), %xmm1, %xmm6
	vmovss	%xmm6, -100(%rsp)
	vaddss	(%r10,%r15,4), %xmm0, %xmm6
	vmovss	%xmm6, 84(%rsp)
	vaddss	0(%r13), %xmm2, %xmm6
	vmovss	%xmm6, -96(%rsp)
	vaddss	4(%r13), %xmm1, %xmm6
	vmovss	%xmm6, -92(%rsp)
	vaddss	(%r10,%r14,4), %xmm0, %xmm6
	vmovss	%xmm6, -8(%rsp)
	vaddss	12(%r13), %xmm2, %xmm6
	vmovss	%xmm6, 88(%rsp)
	vaddss	16(%r13), %xmm1, %xmm6
	vmovss	%xmm6, 92(%rsp)
	leaq	-1(%rdi), %rsi
	vaddss	(%r10,%rsi,4), %xmm0, %xmm6
	vmovss	%xmm6, 96(%rsp)
	movslq	%edx, %rdx
	movq	144(%rsp), %rsi
	leaq	-4(%rsi,%rdx,4), %r8
	subl	$1, %eax
	subl	%ecx, %eax
	addq	%rdx, %rax
	leaq	(%rsi,%rax,4), %r15
	movl	$0x00000000, -12(%rsp)
	movl	$0x00000000, -16(%rsp)
	movl	$0x00000000, -20(%rsp)
	movl	$0x00000000, -32(%rsp)
	movl	$0x00000000, -36(%rsp)
	movl	$0x00000000, -24(%rsp)
	movl	$0x00000000, -40(%rsp)
	movl	$0x00000000, -44(%rsp)
	movl	$0x00000000, -28(%rsp)
	movl	$0x00000000, -48(%rsp)
	movl	$0x00000000, -52(%rsp)
	movq	%rbp, %r14
	.p2align 4,,10
	.p2align 3
.L4:
	movl	(%r8), %eax
	leal	3(%rax,%rax,2), %ecx
	movslq	%ecx, %rsi
	leaq	0(,%rsi,4), %rax
	leaq	(%r10,%rax), %rdx
	vmovss	-12(%rdx), %xmm15
	vmovss	-8(%rdx), %xmm3
	vmovss	-4(%r10,%rsi,4), %xmm14
	vmovss	(%rdx), %xmm7
	vmovss	4(%rdx), %xmm6
	leal	3(%rcx), %ebp
	movslq	%ebp, %rbp
	vmovss	-4(%r10,%rbp,4), %xmm1
	vmovss	12(%rdx), %xmm13
	vmovss	16(%rdx), %xmm12
	addl	$6, %ecx
	movslq	%ecx, %rdx
	vmovss	-4(%r10,%rdx,4), %xmm11
	vmovss	80(%rsp), %xmm2
	vsubss	%xmm15, %xmm2, %xmm5
	vmovss	%xmm5, (%rsp)
	vmovss	-100(%rsp), %xmm4
	vsubss	%xmm3, %xmm4, %xmm9
	vmovss	84(%rsp), %xmm10
	vsubss	%xmm14, %xmm10, %xmm8
	vsubss	%xmm7, %xmm2, %xmm5
	vsubss	%xmm6, %xmm4, %xmm4
	vmovaps	%xmm4, %xmm0
	vmovss	%xmm1, -88(%rsp)
	vsubss	%xmm1, %xmm10, %xmm4
	vmovss	%xmm0, 8(%rsp)
	vmulss	%xmm0, %xmm0, %xmm0
	vmovss	%xmm5, 4(%rsp)
	vfmadd231ss	%xmm5, %xmm5, %xmm0
	vmovss	%xmm4, 12(%rsp)
	vmovaps	%xmm4, %xmm5
	vfmadd132ss	%xmm4, %xmm0, %xmm5
	vmovss	%xmm5, 16(%rsp)
	vsubss	%xmm13, %xmm2, %xmm1
	vmovss	-100(%rsp), %xmm2
	vsubss	%xmm12, %xmm2, %xmm0
	vsubss	%xmm11, %xmm10, %xmm2
	vmovaps	%xmm2, %xmm10
	vmovss	%xmm0, 24(%rsp)
	vmulss	%xmm0, %xmm0, %xmm2
	vmovss	%xmm1, 20(%rsp)
	vfmadd231ss	%xmm1, %xmm1, %xmm2
	vmovss	%xmm10, 28(%rsp)
	vfmadd231ss	%xmm10, %xmm10, %xmm2
	vmovss	%xmm2, 32(%rsp)
	vmovss	-96(%rsp), %xmm5
	vsubss	%xmm15, %xmm5, %xmm4
	vmovss	-92(%rsp), %xmm0
	vsubss	%xmm3, %xmm0, %xmm2
	vmovss	-8(%rsp), %xmm10
	vsubss	%xmm14, %xmm10, %xmm10
	vmovss	%xmm2, -80(%rsp)
	vmulss	%xmm2, %xmm2, %xmm2
	vmovss	%xmm4, -84(%rsp)
	vfmadd231ss	%xmm4, %xmm4, %xmm2
	vmovss	%xmm10, -76(%rsp)
	vfmadd231ss	%xmm10, %xmm10, %xmm2
	vsubss	%xmm7, %xmm5, %xmm1
	vsubss	%xmm6, %xmm0, %xmm0
	vmovss	-8(%rsp), %xmm10
	vsubss	-88(%rsp), %xmm10, %xmm5
	vmovaps	%xmm5, %xmm4
	vmovss	%xmm0, 40(%rsp)
	vmulss	%xmm0, %xmm0, %xmm5
	vmovss	%xmm1, 36(%rsp)
	vfmadd231ss	%xmm1, %xmm1, %xmm5
	vmovss	%xmm4, -72(%rsp)
	vfmadd231ss	%xmm4, %xmm4, %xmm5
	vmovss	-96(%rsp), %xmm4
	vsubss	%xmm13, %xmm4, %xmm4
	vmovaps	%xmm4, %xmm0
	vmovss	-92(%rsp), %xmm4
	vsubss	%xmm12, %xmm4, %xmm4
	vsubss	%xmm11, %xmm10, %xmm1
	vmovss	%xmm4, -64(%rsp)
	vmulss	%xmm4, %xmm4, %xmm4
	vmovss	%xmm0, -68(%rsp)
	vfmadd231ss	%xmm0, %xmm0, %xmm4
	vmovss	%xmm1, 44(%rsp)
	vfmadd231ss	%xmm1, %xmm1, %xmm4
	vmovss	88(%rsp), %xmm10
	vsubss	%xmm15, %xmm10, %xmm15
	vmovss	92(%rsp), %xmm0
	vsubss	%xmm3, %xmm0, %xmm3
	vmovss	96(%rsp), %xmm1
	vsubss	%xmm14, %xmm1, %xmm14
	vmovss	%xmm14, -112(%rsp)
	vmovss	%xmm3, -56(%rsp)
	vmulss	%xmm3, %xmm3, %xmm14
	vmovss	%xmm15, -60(%rsp)
	vfmadd231ss	%xmm15, %xmm15, %xmm14
	vmovss	-112(%rsp), %xmm3
	vfmadd132ss	%xmm3, %xmm14, %xmm3
	vmovaps	%xmm3, %xmm14
	vsubss	%xmm7, %xmm10, %xmm7
	vmovaps	%xmm7, %xmm15
	vsubss	%xmm6, %xmm0, %xmm6
	vmovaps	%xmm6, %xmm7
	vsubss	-88(%rsp), %xmm1, %xmm6
	vmovss	%xmm7, 52(%rsp)
	vmulss	%xmm7, %xmm7, %xmm7
	vmovss	%xmm15, 48(%rsp)
	vfmadd231ss	%xmm15, %xmm15, %xmm7
	vmovss	%xmm6, -88(%rsp)
	vfmadd231ss	%xmm6, %xmm6, %xmm7
	vsubss	%xmm13, %xmm10, %xmm13
	vsubss	%xmm12, %xmm0, %xmm0
	vsubss	%xmm11, %xmm1, %xmm11
	vmovaps	%xmm11, %xmm12
	vmovss	%xmm0, 60(%rsp)
	vmulss	%xmm0, %xmm0, %xmm11
	vmovss	%xmm13, 56(%rsp)
	vfmadd231ss	%xmm13, %xmm13, %xmm11
	vmovss	%xmm12, 64(%rsp)
	vfmadd231ss	%xmm12, %xmm12, %xmm11
	vmulss	%xmm9, %xmm9, %xmm1
	vmovss	(%rsp), %xmm10
	vfmadd231ss	%xmm10, %xmm10, %xmm1
	vfmadd231ss	%xmm8, %xmm8, %xmm1
	vrsqrtss	%xmm1, %xmm3, %xmm3
	vmulss	%xmm1, %xmm3, %xmm1
	vmulss	%xmm3, %xmm1, %xmm1
	vaddss	.LC1(%rip), %xmm1, %xmm1
	vmulss	.LC2(%rip), %xmm3, %xmm3
	vmulss	%xmm3, %xmm1, %xmm6
	vmovaps	%xmm6, %xmm13
	vrsqrtss	%xmm2, %xmm1, %xmm1
	vmulss	%xmm2, %xmm1, %xmm6
	vmulss	%xmm1, %xmm6, %xmm6
	vaddss	.LC1(%rip), %xmm6, %xmm6
	vmulss	.LC2(%rip), %xmm1, %xmm1
	vmulss	%xmm1, %xmm6, %xmm6
	vrsqrtss	%xmm14, %xmm1, %xmm1
	vmulss	%xmm14, %xmm1, %xmm15
	vmulss	%xmm1, %xmm15, %xmm15
	vaddss	.LC1(%rip), %xmm15, %xmm15
	vmulss	.LC2(%rip), %xmm1, %xmm1
	vmulss	%xmm1, %xmm15, %xmm15
	vmovss	16(%rsp), %xmm3
	vrsqrtss	%xmm3, %xmm1, %xmm1
	vmulss	%xmm3, %xmm1, %xmm3
	vmulss	%xmm1, %xmm3, %xmm3
	vaddss	.LC1(%rip), %xmm3, %xmm3
	vmulss	.LC2(%rip), %xmm1, %xmm1
	vmulss	%xmm1, %xmm3, %xmm3
	vrsqrtss	%xmm5, %xmm0, %xmm0
	vmulss	%xmm5, %xmm0, %xmm2
	vmulss	%xmm0, %xmm2, %xmm2
	vaddss	.LC1(%rip), %xmm2, %xmm2
	vmulss	.LC2(%rip), %xmm0, %xmm0
	vmulss	%xmm0, %xmm2, %xmm2
	vrsqrtss	%xmm7, %xmm0, %xmm0
	vmulss	%xmm7, %xmm0, %xmm1
	vmulss	%xmm0, %xmm1, %xmm1
	vaddss	.LC1(%rip), %xmm1, %xmm1
	vmulss	.LC2(%rip), %xmm0, %xmm0
	vmulss	%xmm0, %xmm1, %xmm1
	vmovss	32(%rsp), %xmm7
	vrsqrtss	%xmm7, %xmm0, %xmm0
	vmulss	%xmm7, %xmm0, %xmm5
	vmulss	%xmm0, %xmm5, %xmm5
	vaddss	.LC1(%rip), %xmm5, %xmm5
	vmulss	.LC2(%rip), %xmm0, %xmm0
	vmulss	%xmm0, %xmm5, %xmm5
	vrsqrtss	%xmm4, %xmm7, %xmm7
	vmulss	%xmm4, %xmm7, %xmm0
	vmulss	%xmm7, %xmm0, %xmm0
	vaddss	.LC1(%rip), %xmm0, %xmm0
	vmulss	.LC2(%rip), %xmm7, %xmm7
	vmulss	%xmm7, %xmm0, %xmm0
	vrsqrtss	%xmm11, %xmm4, %xmm4
	vmulss	%xmm11, %xmm4, %xmm11
	vmulss	%xmm4, %xmm11, %xmm11
	vaddss	.LC1(%rip), %xmm11, %xmm11
	vmulss	.LC2(%rip), %xmm4, %xmm4
	vmulss	%xmm4, %xmm11, %xmm14
	vmovss	%xmm14, 16(%rsp)
	vmulss	%xmm13, %xmm13, %xmm12
	vmulss	%xmm13, %xmm12, %xmm7
	vmulss	%xmm7, %xmm7, %xmm7
	vmulss	100(%rsp), %xmm7, %xmm11
	vmulss	%xmm7, %xmm7, %xmm7
	vmulss	112(%rsp), %xmm7, %xmm7
	vsubss	%xmm11, %xmm7, %xmm4
	vaddss	-12(%rsp), %xmm4, %xmm14
	vmovss	%xmm14, -12(%rsp)
	vmulss	.LC3(%rip), %xmm11, %xmm11
	vfmsub231ss	.LC4(%rip), %xmm7, %xmm11
	vmovss	-4(%rsp), %xmm14
	vmovss	%xmm13, (%rsp)
	vfmadd132ss	%xmm13, %xmm11, %xmm14
	vmulss	%xmm12, %xmm14, %xmm4
	vmovss	-52(%rsp), %xmm13
	vmovaps	%xmm10, %xmm11
	vfmadd231ss	%xmm10, %xmm4, %xmm13
	vmovss	-44(%rsp), %xmm12
	vfmadd231ss	%xmm4, %xmm9, %xmm12
	vmovss	-36(%rsp), %xmm7
	vfmadd231ss	%xmm4, %xmm8, %xmm7
	addq	%r11, %rax
	vfnmadd213ss	-12(%rax), %xmm4, %xmm11
	vmovss	%xmm11, 32(%rsp)
	vfnmadd213ss	-8(%rax), %xmm4, %xmm9
	vmovss	%xmm9, 68(%rsp)
	vmovaps	%xmm8, %xmm9
	vfnmadd213ss	-4(%rax), %xmm4, %xmm9
	vmovss	%xmm9, 72(%rsp)
	vmovss	-116(%rsp), %xmm8
	vmulss	%xmm8, %xmm3, %xmm9
	vmulss	%xmm3, %xmm3, %xmm3
	vmovss	%xmm9, 76(%rsp)
	vmulss	%xmm9, %xmm3, %xmm3
	vmovss	4(%rsp), %xmm14
	vmovaps	%xmm14, %xmm4
	vfnmadd213ss	(%rax), %xmm3, %xmm4
	vmovd	%xmm4, %esi
	vmovss	8(%rsp), %xmm10
	vmovaps	%xmm10, %xmm9
	vfnmadd213ss	4(%rax), %xmm3, %xmm9
	vmovd	%xmm9, %ecx
	leaq	(%r11,%rbp,4), %r13
	vmovss	12(%rsp), %xmm11
	vmovaps	%xmm11, %xmm9
	vfnmadd213ss	-4(%r13), %xmm3, %xmm9
	vmovss	%xmm9, 4(%rsp)
	vmulss	%xmm8, %xmm5, %xmm4
	vmulss	%xmm5, %xmm5, %xmm5
	vmulss	%xmm4, %xmm5, %xmm5
	vmulss	20(%rsp), %xmm5, %xmm9
	vmulss	24(%rsp), %xmm5, %xmm8
	vmulss	28(%rsp), %xmm5, %xmm5
	vfmadd132ss	%xmm3, %xmm9, %xmm14
	vaddss	%xmm13, %xmm14, %xmm13
	vmovss	%xmm13, -52(%rsp)
	vmovaps	%xmm10, %xmm13
	vfmadd132ss	%xmm3, %xmm8, %xmm13
	vaddss	%xmm12, %xmm13, %xmm13
	vmovss	%xmm13, -44(%rsp)
	vfmadd132ss	%xmm11, %xmm5, %xmm3
	vaddss	%xmm7, %xmm3, %xmm13
	vmovss	%xmm13, -36(%rsp)
	vmovss	12(%rax), %xmm3
	vsubss	%xmm9, %xmm3, %xmm13
	vmovss	%xmm13, 8(%rsp)
	vmovss	16(%rax), %xmm3
	vsubss	%xmm8, %xmm3, %xmm12
	vmovss	%xmm12, 12(%rsp)
	leaq	(%r11,%rdx,4), %rbp
	vmovss	-4(%rbp), %xmm3
	vsubss	%xmm5, %xmm3, %xmm5
	vmovd	%xmm5, %edx
	vmulss	-116(%rsp), %xmm6, %xmm14
	vmulss	%xmm6, %xmm6, %xmm6
	vmulss	%xmm14, %xmm6, %xmm6
	vmovss	-32(%rsp), %xmm5
	vfmadd231ss	-76(%rsp), %xmm6, %xmm5
	vmovaps	%xmm5, %xmm3
	vmovss	-104(%rsp), %xmm13
	vmulss	%xmm13, %xmm2, %xmm9
	vmulss	%xmm2, %xmm2, %xmm2
	vmulss	%xmm9, %xmm2, %xmm2
	vmulss	36(%rsp), %xmm2, %xmm11
	vmulss	40(%rsp), %xmm2, %xmm10
	vmulss	%xmm13, %xmm0, %xmm13
	vmulss	%xmm0, %xmm0, %xmm0
	vmulss	%xmm13, %xmm0, %xmm0
	vmovss	-48(%rsp), %xmm12
	vfmadd231ss	-68(%rsp), %xmm0, %xmm12
	vmovaps	%xmm12, %xmm8
	vmovss	-40(%rsp), %xmm5
	vfmadd231ss	-64(%rsp), %xmm0, %xmm5
	vmovaps	%xmm5, %xmm7
	vmulss	44(%rsp), %xmm0, %xmm5
	vmovss	-84(%rsp), %xmm12
	vfmadd132ss	%xmm6, %xmm11, %xmm12
	vaddss	%xmm8, %xmm12, %xmm12
	vmovss	%xmm12, -48(%rsp)
	vmovss	-80(%rsp), %xmm8
	vfmadd132ss	%xmm6, %xmm10, %xmm8
	vaddss	%xmm7, %xmm8, %xmm12
	vmovss	%xmm12, -40(%rsp)
	vmovss	-72(%rsp), %xmm12
	vfmadd132ss	%xmm2, %xmm5, %xmm12
	vaddss	%xmm3, %xmm12, %xmm3
	vmovss	%xmm3, -32(%rsp)
	vmulss	-116(%rsp), %xmm15, %xmm12
	vmulss	%xmm15, %xmm15, %xmm15
	vmulss	%xmm12, %xmm15, %xmm3
	vmovss	-20(%rsp), %xmm8
	vmovss	-112(%rsp), %xmm15
	vfmadd231ss	%xmm15, %xmm3, %xmm8
	vmovss	%xmm8, -112(%rsp)
	vmovss	-84(%rsp), %xmm7
	vfnmadd213ss	32(%rsp), %xmm6, %xmm7
	vfnmadd231ss	-60(%rsp), %xmm3, %xmm7
	vmovss	%xmm7, -12(%rax)
	vmovss	-80(%rsp), %xmm7
	vfnmadd213ss	68(%rsp), %xmm6, %xmm7
	vfnmadd231ss	-56(%rsp), %xmm3, %xmm7
	vmovss	%xmm7, -8(%rax)
	vmovss	-76(%rsp), %xmm8
	vfnmadd213ss	72(%rsp), %xmm8, %xmm6
	vfnmadd231ss	%xmm15, %xmm3, %xmm6
	vmovss	%xmm6, -4(%rax)
	vmulss	-104(%rsp), %xmm1, %xmm8
	vmulss	%xmm1, %xmm1, %xmm1
	vmulss	%xmm8, %xmm1, %xmm1
	vmulss	48(%rsp), %xmm1, %xmm7
	vmulss	52(%rsp), %xmm1, %xmm6
	vmovd	%esi, %xmm15
	vsubss	%xmm11, %xmm15, %xmm11
	vsubss	%xmm7, %xmm11, %xmm11
	vmovss	%xmm11, (%rax)
	vmovd	%ecx, %xmm15
	vsubss	%xmm10, %xmm15, %xmm10
	vsubss	%xmm6, %xmm10, %xmm10
	vmovss	%xmm10, 4(%rax)
	vmovss	4(%rsp), %xmm10
	vfnmadd132ss	-72(%rsp), %xmm10, %xmm2
	vfnmadd231ss	-88(%rsp), %xmm1, %xmm2
	vmovss	%xmm2, -4(%r13)
	vmovss	-104(%rsp), %xmm2
	vmovss	16(%rsp), %xmm11
	vmulss	%xmm11, %xmm2, %xmm2
	vaddss	%xmm2, %xmm8, %xmm8
	vaddss	-16(%rsp), %xmm8, %xmm8
	vmulss	%xmm11, %xmm11, %xmm10
	vmulss	%xmm2, %xmm10, %xmm2
	vaddss	%xmm14, %xmm4, %xmm4
	vmovss	(%rsp), %xmm14
	vmovss	76(%rsp), %xmm10
	vfmadd132ss	-4(%rsp), %xmm10, %xmm14
	vaddss	%xmm14, %xmm4, %xmm4
	vaddss	%xmm13, %xmm9, %xmm9
	vaddss	%xmm12, %xmm9, %xmm9
	vaddss	%xmm9, %xmm4, %xmm4
	vaddss	%xmm8, %xmm4, %xmm4
	vmovss	%xmm4, -16(%rsp)
	vmovss	-28(%rsp), %xmm9
	vmovss	56(%rsp), %xmm14
	vfmadd231ss	%xmm14, %xmm2, %xmm9
	vmovss	-24(%rsp), %xmm4
	vmovss	60(%rsp), %xmm10
	vfmadd231ss	%xmm10, %xmm2, %xmm4
	vmulss	64(%rsp), %xmm2, %xmm11
	vfmadd231ss	-60(%rsp), %xmm3, %xmm7
	vaddss	%xmm9, %xmm7, %xmm7
	vmovss	%xmm7, -28(%rsp)
	vfmadd231ss	-56(%rsp), %xmm3, %xmm6
	vaddss	%xmm4, %xmm6, %xmm3
	vmovss	%xmm3, -24(%rsp)
	vfmadd132ss	-88(%rsp), %xmm11, %xmm1
	vaddss	-112(%rsp), %xmm1, %xmm3
	vmovss	%xmm3, -20(%rsp)
	vmovss	-68(%rsp), %xmm9
	vfnmadd213ss	8(%rsp), %xmm0, %xmm9
	vfnmadd231ss	%xmm14, %xmm2, %xmm9
	vmovss	%xmm9, 12(%rax)
	vmovss	-64(%rsp), %xmm12
	vfnmadd213ss	12(%rsp), %xmm12, %xmm0
	vfnmadd132ss	%xmm10, %xmm0, %xmm2
	vmovss	%xmm2, 16(%rax)
	vmovd	%edx, %xmm6
	vsubss	%xmm5, %xmm6, %xmm5
	vsubss	%xmm11, %xmm5, %xmm11
	vmovss	%xmm11, -4(%rbp)
	addq	$4, %r8
	cmpq	%r15, %r8
	jne	.L4
	movq	%r14, %rbp
	vmovss	-52(%rsp), %xmm6
	vaddss	-48(%rsp), %xmm6, %xmm8
	vmovss	-44(%rsp), %xmm6
	vaddss	-40(%rsp), %xmm6, %xmm2
	vmovss	-36(%rsp), %xmm6
	vaddss	-32(%rsp), %xmm6, %xmm0
	vmovss	-52(%rsp), %xmm6
.L3:
	leaq	(%r11,%r12,4), %rax
	vaddss	-12(%rax), %xmm6, %xmm12
	vmovss	%xmm12, -12(%rax)
	vmovss	-44(%rsp), %xmm6
	vaddss	-8(%rax), %xmm6, %xmm11
	vmovss	%xmm11, -8(%rax)
	vmovss	-36(%rsp), %xmm6
	vaddss	-4(%rax), %xmm6, %xmm5
	vmovss	%xmm5, -4(%rax)
	vmovss	-48(%rsp), %xmm6
	vaddss	(%rax), %xmm6, %xmm11
	vmovss	%xmm11, (%rax)
	vmovss	-40(%rsp), %xmm6
	vaddss	4(%rax), %xmm6, %xmm11
	vmovss	%xmm11, 4(%rax)
	leaq	(%r11,%rbx,4), %rdx
	vmovss	-32(%rsp), %xmm6
	vaddss	-4(%rdx), %xmm6, %xmm11
	vmovss	%xmm11, -4(%rdx)
	vmovss	-28(%rsp), %xmm6
	vaddss	12(%rax), %xmm6, %xmm1
	vmovss	%xmm1, 12(%rax)
	vmovss	-24(%rsp), %xmm5
	vaddss	16(%rax), %xmm5, %xmm1
	vmovss	%xmm1, 16(%rax)
	leaq	(%r11,%rdi,4), %rax
	vaddss	-4(%rax), %xmm3, %xmm1
	vmovss	%xmm1, -4(%rax)
	movq	216(%rsp), %rax
	movq	104(%rsp), %rbx
	leaq	(%rax,%rbx,4), %rax
	vaddss	%xmm6, %xmm8, %xmm8
	vaddss	(%rax), %xmm8, %xmm8
	vmovss	%xmm8, (%rax)
	vaddss	%xmm5, %xmm2, %xmm2
	vaddss	4(%rax), %xmm2, %xmm2
	vmovss	%xmm2, 4(%rax)
	movq	216(%rsp), %rax
	leaq	(%rax,%r9,4), %rax
	vaddss	%xmm3, %xmm0, %xmm6
	vaddss	-4(%rax), %xmm6, %xmm6
	vmovss	%xmm6, -4(%rax)
	movq	224(%rsp), %rax
	movslq	(%rax,%rbp,4), %rax
	salq	$2, %rax
	movq	%rax, %rdx
	addq	264(%rsp), %rdx
	vmovss	-16(%rsp), %xmm6
	vaddss	(%rdx), %xmm6, %xmm7
	vmovss	%xmm7, (%rdx)
	addq	296(%rsp), %rax
	vmovss	-12(%rsp), %xmm6
	vaddss	(%rax), %xmm6, %xmm0
	vmovss	%xmm0, (%rax)
	addq	$1, %rbp
	cmpl	%ebp, 116(%rsp)
	jne	.L6
.L9:
	addq	$160, %rsp
	.cfi_remember_state
	.cfi_def_cfa_offset 56
	popq	%rbx
	.cfi_def_cfa_offset 48
	popq	%rbp
	.cfi_def_cfa_offset 40
	popq	%r12
	.cfi_def_cfa_offset 32
	popq	%r13
	.cfi_def_cfa_offset 24
	popq	%r14
	.cfi_def_cfa_offset 16
	popq	%r15
	.cfi_def_cfa_offset 8
	ret
	.p2align 4,,10
	.p2align 3
.L7:
	.cfi_restore_state
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%xmm0, %xmm2
	vmovaps	%xmm0, %xmm8
	vmovss	%xmm0, -12(%rsp)
	vmovss	%xmm0, -16(%rsp)
	vmovss	%xmm0, -20(%rsp)
	vmovss	%xmm0, -32(%rsp)
	vmovss	%xmm0, -36(%rsp)
	vmovss	%xmm0, -24(%rsp)
	vmovss	%xmm0, -40(%rsp)
	vmovss	%xmm0, -44(%rsp)
	vmovss	%xmm0, -28(%rsp)
	vmovss	%xmm0, -48(%rsp)
	vmovss	%xmm0, -52(%rsp)
	movslq	%r8d, %rax
	movq	%rax, 104(%rsp)
	vmovaps	%xmm0, %xmm6
	vmovaps	%xmm0, %xmm3
	jmp	.L3
	.cfi_endproc
.LFE0:
	.size	inl1130_, .-inl1130_
	.section	.rodata.cst4,"aM",@progbits,4
	.align 4
.LC1:
	.long	3225419776
	.align 4
.LC2:
	.long	3204448256
	.align 4
.LC3:
	.long	1086324736
	.align 4
.LC4:
	.long	1094713344
	.ident	"GCC: (GNU) 6.0.0 20160205 (experimental) [trunk revision 233136]"
	.section	.note.GNU-stack,"",@progbits
.file	"innerf.f"
	.text
	.p2align 4,,15
	.globl	inl1130_
	.type	inl1130_, @function
inl1130_:
.LFB0:
	.cfi_startproc
	pushq	%r15
	.cfi_def_cfa_offset 16
	.cfi_offset 15, -16
	pushq	%r14
	.cfi_def_cfa_offset 24
	.cfi_offset 14, -24
	pushq	%r13
	.cfi_def_cfa_offset 32
	.cfi_offset 13, -32
	pushq	%r12
	.cfi_def_cfa_offset 40
	.cfi_offset 12, -40
	pushq	%rbp
	.cfi_def_cfa_offset 48
	.cfi_offset 6, -48
	pushq	%rbx
	.cfi_def_cfa_offset 56
	.cfi_offset 3, -56
	subq	$160, %rsp
	.cfi_def_cfa_offset 216
	movq	%rsi, 120(%rsp)
	movq	%rdx, 128(%rsp)
	movq	%rcx, 144(%rsp)
	movq	%r8, 136(%rsp)
	movq	%r9, 152(%rsp)
	movq	232(%rsp), %r9
	movq	240(%rsp), %r10
	movslq	(%rsi), %rdx
	movq	%rdx, %rax
	movq	248(%rsp), %rcx
	leaq	(%rcx,%rdx,4), %rdx
	vmovss	(%rdx), %xmm3
	vmovss	4(%rdx), %xmm0
	movq	256(%rsp), %rdx
	vmovss	(%rdx), %xmm1
	vmulss	%xmm1, %xmm3, %xmm2
	vmulss	%xmm2, %xmm3, %xmm6
	vmovss	%xmm6, -4(%rsp)
	vmulss	%xmm2, %xmm0, %xmm6
	vmovss	%xmm6, -108(%rsp)
	vmulss	%xmm0, %xmm0, %xmm0
	vmulss	%xmm1, %xmm0, %xmm6
	vmovss	%xmm6, -104(%rsp)
	addl	$1, %eax
	cltq
	movq	272(%rsp), %rdx
	movl	-4(%rdx,%rax,4), %edx
	movq	280(%rsp), %rax
	movl	(%rax), %eax
	addl	%eax, %eax
	imull	%edx, %eax
	leal	(%rax,%rdx,2), %eax
	movl	(%rdi), %edi
	movl	%edi, 116(%rsp)
	testl	%edi, %edi
	jle	.L9
	cltq
	movq	288(%rsp), %rdx
	leaq	(%rdx,%rax,4), %rax
	vmovss	(%rax), %xmm6
	vmovss	%xmm6, 100(%rsp)
	vmovss	4(%rax), %xmm5
	vmovss	%xmm5, 112(%rsp)
	xorl	%r8d, %r8d
	movq	%r8, %rbp
	movq	%r10, %r11
	movq	%r9, %r10
	.p2align 4,,10
	.p2align 3
.L6:
	movq	136(%rsp), %rax
	movl	(%rax,%rbp,4), %eax
	leal	(%rax,%rax,2), %r8d
	leal	3(%r8), %eax
	movslq	%eax, %r9
	leaq	-1(%r9), %r13
	movq	120(%rsp), %rax
	movl	(%rax,%rbp,4), %eax
	leal	3(%rax,%rax,2), %esi
	movq	128(%rsp), %rbx
	movl	(%rbx,%rbp,4), %ecx
	leal	1(%rcx), %eax
	movl	4(%rbx,%rbp,4), %edx
	movslq	%esi, %r12
	leaq	-1(%r12), %r15
	leal	3(%rsi), %edi
	movslq	%edi, %rbx
	leaq	-1(%rbx), %r14
	addl	$6, %esi
	movslq	%esi, %rdi
	leaq	-1(%rdi), %rsi
	movq	%rsi, -120(%rsp)
	cmpl	%edx, %eax
	jg	.L7
	movslq	%r8d, %r8
	movq	%r8, 104(%rsp)
	movq	152(%rsp), %rsi
	leaq	(%rsi,%r8,4), %r8
	vmovss	(%r8), %xmm2
	vmovss	4(%r8), %xmm1
	vmovss	(%rsi,%r13,4), %xmm0
	leaq	(%r10,%r12,4), %r13
	vaddss	-12(%r13), %xmm2, %xmm6
	vmovss	%xmm6, 76(%rsp)
	vaddss	-8(%r13), %xmm1, %xmm6
	vmovss	%xmm6, -88(%rsp)
	vaddss	(%r10,%r15,4), %xmm0, %xmm6
	vmovss	%xmm6, 80(%rsp)
	vaddss	0(%r13), %xmm2, %xmm6
	vmovss	%xmm6, -84(%rsp)
	vaddss	4(%r13), %xmm1, %xmm6
	vmovss	%xmm6, -80(%rsp)
	vaddss	(%r10,%r14,4), %xmm0, %xmm6
	vmovss	%xmm6, 84(%rsp)
	vaddss	12(%r13), %xmm2, %xmm6
	vmovss	%xmm6, 88(%rsp)
	vaddss	16(%r13), %xmm1, %xmm6
	vmovss	%xmm6, 92(%rsp)
	leaq	-1(%rdi), %rsi
	vaddss	(%r10,%rsi,4), %xmm0, %xmm6
	vmovss	%xmm6, 96(%rsp)
	cltq
	movq	144(%rsp), %rsi
	leaq	-4(%rsi,%rax,4), %r8
	subl	$1, %edx
	subl	%ecx, %edx
	addq	%rax, %rdx
	leaq	(%rsi,%rdx,4), %r15
	movl	$0x00000000, -8(%rsp)
	movl	$0x00000000, -12(%rsp)
	movl	$0x00000000, -16(%rsp)
	movl	$0x00000000, -28(%rsp)
	movl	$0x00000000, -32(%rsp)
	movl	$0x00000000, -20(%rsp)
	movl	$0x00000000, -36(%rsp)
	movl	$0x00000000, -40(%rsp)
	movl	$0x00000000, -24(%rsp)
	movl	$0x00000000, -44(%rsp)
	movl	$0x00000000, -48(%rsp)
	movq	%rbp, %r14
	.p2align 4,,10
	.p2align 3
.L4:
	movl	(%r8), %eax
	leal	3(%rax,%rax,2), %ecx
	movslq	%ecx, %rsi
	leaq	0(,%rsi,4), %rax
	leaq	(%r10,%rax), %rdx
	vmovss	-12(%rdx), %xmm15
	vmovss	-8(%rdx), %xmm3
	vmovss	-4(%r10,%rsi,4), %xmm14
	vmovss	(%rdx), %xmm7
	vmovss	4(%rdx), %xmm6
	leal	3(%rcx), %ebp
	movslq	%ebp, %rbp
	vmovss	-4(%r10,%rbp,4), %xmm0
	vmovss	%xmm0, -100(%rsp)
	vmovss	12(%rdx), %xmm13
	vmovss	16(%rdx), %xmm12
	addl	$6, %ecx
	movslq	%ecx, %rdx
	vmovss	-4(%r10,%rdx,4), %xmm11
	vmovss	76(%rsp), %xmm1
	vsubss	%xmm15, %xmm1, %xmm5
	vmovss	%xmm5, -96(%rsp)
	vmovss	-88(%rsp), %xmm4
	vsubss	%xmm3, %xmm4, %xmm9
	vmovss	80(%rsp), %xmm10
	vsubss	%xmm14, %xmm10, %xmm8
	vsubss	%xmm7, %xmm1, %xmm5
	vsubss	%xmm6, %xmm4, %xmm2
	vsubss	-100(%rsp), %xmm10, %xmm0
	vmovaps	%xmm0, %xmm4
	vmovss	%xmm2, 4(%rsp)
	vmulss	%xmm2, %xmm2, %xmm0
	vmovss	%xmm5, (%rsp)
	vfmadd231ss	%xmm5, %xmm5, %xmm0
	vmovss	%xmm4, 8(%rsp)
	vmovaps	%xmm4, %xmm5
	vfmadd132ss	%xmm4, %xmm0, %xmm5
	vmovss	%xmm5, 12(%rsp)
	vsubss	%xmm13, %xmm1, %xmm2
	vmovss	-88(%rsp), %xmm1
	vsubss	%xmm12, %xmm1, %xmm0
	vsubss	%xmm11, %xmm10, %xmm1
	vmovaps	%xmm1, %xmm10
	vmovss	%xmm0, 20(%rsp)
	vmulss	%xmm0, %xmm0, %xmm1
	vmovss	%xmm2, 16(%rsp)
	vfmadd231ss	%xmm2, %xmm2, %xmm1
	vmovss	%xmm10, 24(%rsp)
	vfmadd231ss	%xmm10, %xmm10, %xmm1
	vmovss	%xmm1, 28(%rsp)
	vmovss	-84(%rsp), %xmm0
	vsubss	%xmm15, %xmm0, %xmm4
	vmovss	-80(%rsp), %xmm2
	vsubss	%xmm3, %xmm2, %xmm1
	vmovss	84(%rsp), %xmm10
	vsubss	%xmm14, %xmm10, %xmm5
	vmovss	%xmm5, -120(%rsp)
	vmovss	%xmm1, -72(%rsp)
	vmulss	%xmm1, %xmm1, %xmm5
	vmovss	%xmm4, -76(%rsp)
	vfmadd231ss	%xmm4, %xmm4, %xmm5
	vmovss	-120(%rsp), %xmm1
	vfmadd132ss	%xmm1, %xmm5, %xmm1
	vmovaps	%xmm1, %xmm5
	vsubss	%xmm7, %xmm0, %xmm0
	vsubss	%xmm6, %xmm2, %xmm2
	vsubss	-100(%rsp), %xmm10, %xmm1
	vmovaps	%xmm1, %xmm4
	vmovss	%xmm2, 36(%rsp)
	vmulss	%xmm2, %xmm2, %xmm1
	vmovss	%xmm0, 32(%rsp)
	vfmadd231ss	%xmm0, %xmm0, %xmm1
	vmovss	%xmm4, -68(%rsp)
	vfmadd231ss	%xmm4, %xmm4, %xmm1
	vmovss	-84(%rsp), %xmm4
	vsubss	%xmm13, %xmm4, %xmm4
	vmovaps	%xmm4, %xmm0
	vmovss	-80(%rsp), %xmm4
	vsubss	%xmm12, %xmm4, %xmm4
	vmovaps	%xmm4, %xmm2
	vsubss	%xmm11, %xmm10, %xmm10
	vmovss	%xmm2, -60(%rsp)
	vmulss	%xmm2, %xmm2, %xmm4
	vmovss	%xmm0, -64(%rsp)
	vfmadd231ss	%xmm0, %xmm0, %xmm4
	vmovss	%xmm10, 40(%rsp)
	vfmadd231ss	%xmm10, %xmm10, %xmm4
	vmovss	88(%rsp), %xmm10
	vsubss	%xmm15, %xmm10, %xmm15
	vmovss	92(%rsp), %xmm0
	vsubss	%xmm3, %xmm0, %xmm3
	vmovss	96(%rsp), %xmm2
	vsubss	%xmm14, %xmm2, %xmm14
	vmovss	%xmm14, -92(%rsp)
	vmovss	%xmm3, -52(%rsp)
	vmulss	%xmm3, %xmm3, %xmm14
	vmovss	%xmm15, -56(%rsp)
	vfmadd231ss	%xmm15, %xmm15, %xmm14
	vmovss	-92(%rsp), %xmm3
	vfmadd132ss	%xmm3, %xmm14, %xmm3
	vmovaps	%xmm3, %xmm14
	vsubss	%xmm7, %xmm10, %xmm7
	vmovaps	%xmm7, %xmm15
	vsubss	%xmm6, %xmm0, %xmm6
	vmovaps	%xmm6, %xmm7
	vsubss	-100(%rsp), %xmm2, %xmm6
	vmovss	%xmm7, 48(%rsp)
	vmulss	%xmm7, %xmm7, %xmm7
	vmovss	%xmm15, 44(%rsp)
	vfmadd231ss	%xmm15, %xmm15, %xmm7
	vmovss	%xmm6, -100(%rsp)
	vfmadd231ss	%xmm6, %xmm6, %xmm7
	vsubss	%xmm13, %xmm10, %xmm13
	vsubss	%xmm12, %xmm0, %xmm0
	vsubss	%xmm11, %xmm2, %xmm12
	vmovss	%xmm0, 56(%rsp)
	vmulss	%xmm0, %xmm0, %xmm11
	vmovss	%xmm13, 52(%rsp)
	vfmadd231ss	%xmm13, %xmm13, %xmm11
	vmovss	%xmm12, 60(%rsp)
	vfmadd231ss	%xmm12, %xmm12, %xmm11
	vmulss	%xmm9, %xmm9, %xmm2
	vmovss	-96(%rsp), %xmm6
	vfmadd132ss	%xmm6, %xmm2, %xmm6
	vmovaps	%xmm6, %xmm2
	vfmadd231ss	%xmm8, %xmm8, %xmm2
	vrsqrtss	%xmm2, %xmm3, %xmm3
	vmulss	%xmm2, %xmm3, %xmm2
	vmulss	%xmm3, %xmm2, %xmm2
	vaddss	.LC1(%rip), %xmm2, %xmm2
	vmulss	.LC2(%rip), %xmm3, %xmm3
	vmulss	%xmm3, %xmm2, %xmm6
	vmovaps	%xmm6, %xmm13
	vrsqrtss	%xmm5, %xmm2, %xmm2
	vmulss	%xmm5, %xmm2, %xmm6
	vmulss	%xmm2, %xmm6, %xmm6
	vaddss	.LC1(%rip), %xmm6, %xmm6
	vmulss	.LC2(%rip), %xmm2, %xmm2
	vmulss	%xmm2, %xmm6, %xmm6
	vrsqrtss	%xmm14, %xmm2, %xmm2
	vmulss	%xmm14, %xmm2, %xmm15
	vmulss	%xmm2, %xmm15, %xmm15
	vaddss	.LC1(%rip), %xmm15, %xmm15
	vmulss	.LC2(%rip), %xmm2, %xmm2
	vmulss	%xmm2, %xmm15, %xmm15
	vmovss	12(%rsp), %xmm5
	vrsqrtss	%xmm5, %xmm2, %xmm2
	vmulss	%xmm5, %xmm2, %xmm3
	vmulss	%xmm2, %xmm3, %xmm3
	vaddss	.LC1(%rip), %xmm3, %xmm3
	vmulss	.LC2(%rip), %xmm2, %xmm2
	vmulss	%xmm2, %xmm3, %xmm3
	vrsqrtss	%xmm1, %xmm2, %xmm2
	vmulss	%xmm1, %xmm2, %xmm0
	vmulss	%xmm2, %xmm0, %xmm0
	vaddss	.LC1(%rip), %xmm0, %xmm0
	vmulss	.LC2(%rip), %xmm2, %xmm2
	vmulss	%xmm2, %xmm0, %xmm0
	vrsqrtss	%xmm7, %xmm1, %xmm1
	vmulss	%xmm7, %xmm1, %xmm2
	vmulss	%xmm1, %xmm2, %xmm2
	vaddss	.LC1(%rip), %xmm2, %xmm2
	vmulss	.LC2(%rip), %xmm1, %xmm1
	vmulss	%xmm1, %xmm2, %xmm2
	vmovss	28(%rsp), %xmm7
	vrsqrtss	%xmm7, %xmm1, %xmm1
	vmulss	%xmm7, %xmm1, %xmm5
	vmulss	%xmm1, %xmm5, %xmm5
	vaddss	.LC1(%rip), %xmm5, %xmm5
	vmulss	.LC2(%rip), %xmm1, %xmm1
	vmulss	%xmm1, %xmm5, %xmm5
	vrsqrtss	%xmm4, %xmm7, %xmm7
	vmulss	%xmm4, %xmm7, %xmm1
	vmulss	%xmm7, %xmm1, %xmm1
	vaddss	.LC1(%rip), %xmm1, %xmm1
	vmulss	.LC2(%rip), %xmm7, %xmm7
	vmulss	%xmm7, %xmm1, %xmm1
	vrsqrtss	%xmm11, %xmm4, %xmm4
	vmulss	%xmm11, %xmm4, %xmm11
	vmulss	%xmm4, %xmm11, %xmm11
	vaddss	.LC1(%rip), %xmm11, %xmm11
	vmulss	.LC2(%rip), %xmm4, %xmm4
	vmulss	%xmm4, %xmm11, %xmm11
	vmovss	%xmm11, 28(%rsp)
	vmulss	%xmm13, %xmm13, %xmm12
	vmulss	%xmm13, %xmm12, %xmm7
	vmulss	%xmm7, %xmm7, %xmm7
	vmulss	100(%rsp), %xmm7, %xmm11
	vmulss	%xmm7, %xmm7, %xmm7
	vmulss	112(%rsp), %xmm7, %xmm7
	vsubss	%xmm11, %xmm7, %xmm4
	vaddss	-8(%rsp), %xmm4, %xmm14
	vmovss	%xmm14, -8(%rsp)
	vmulss	.LC3(%rip), %xmm11, %xmm11
	vfmsub231ss	.LC4(%rip), %xmm7, %xmm11
	vmovss	-4(%rsp), %xmm10
	vmovss	%xmm13, 12(%rsp)
	vfmadd132ss	%xmm13, %xmm11, %xmm10
	vmulss	%xmm12, %xmm10, %xmm4
	vmovss	-48(%rsp), %xmm13
	vmovss	-96(%rsp), %xmm11
	vfmadd231ss	%xmm11, %xmm4, %xmm13
	vmovss	-40(%rsp), %xmm12
	vfmadd231ss	%xmm4, %xmm9, %xmm12
	vmovss	-32(%rsp), %xmm7
	vfmadd231ss	%xmm4, %xmm8, %xmm7
	addq	%r11, %rax
	vfnmadd213ss	-12(%rax), %xmm4, %xmm11
	vmovss	%xmm11, -96(%rsp)
	vfnmadd213ss	-8(%rax), %xmm4, %xmm9
	vmovss	%xmm9, 64(%rsp)
	vmovaps	%xmm8, %xmm9
	vfnmadd213ss	-4(%rax), %xmm4, %xmm9
	vmovss	%xmm9, 68(%rsp)
	vmovss	-108(%rsp), %xmm8
	vmulss	%xmm8, %xmm3, %xmm9
	vmulss	%xmm3, %xmm3, %xmm3
	vmovss	%xmm9, 72(%rsp)
	vmulss	%xmm9, %xmm3, %xmm3
	vmovss	(%rsp), %xmm14
	vmovaps	%xmm14, %xmm4
	vfnmadd213ss	(%rax), %xmm3, %xmm4
	vmovd	%xmm4, %esi
	vmovss	4(%rsp), %xmm10
	vmovaps	%xmm10, %xmm9
	vfnmadd213ss	4(%rax), %xmm3, %xmm9
	vmovd	%xmm9, %ecx
	leaq	(%r11,%rbp,4), %r13
	vmovss	8(%rsp), %xmm11
	vmovaps	%xmm11, %xmm9
	vfnmadd213ss	-4(%r13), %xmm3, %xmm9
	vmovss	%xmm9, (%rsp)
	vmulss	%xmm8, %xmm5, %xmm4
	vmulss	%xmm5, %xmm5, %xmm5
	vmulss	%xmm4, %xmm5, %xmm5
	vmulss	16(%rsp), %xmm5, %xmm9
	vmulss	20(%rsp), %xmm5, %xmm8
	vmulss	24(%rsp), %xmm5, %xmm5
	vfmadd132ss	%xmm3, %xmm9, %xmm14
	vaddss	%xmm13, %xmm14, %xmm14
	vmovss	%xmm14, -48(%rsp)
	vmovaps	%xmm10, %xmm14
	vfmadd132ss	%xmm3, %xmm8, %xmm14
	vaddss	%xmm12, %xmm14, %xmm14
	vmovss	%xmm14, -40(%rsp)
	vfmadd132ss	%xmm11, %xmm5, %xmm3
	vaddss	%xmm7, %xmm3, %xmm14
	vmovss	%xmm14, -32(%rsp)
	vmovss	12(%rax), %xmm3
	vsubss	%xmm9, %xmm3, %xmm14
	vmovss	%xmm14, 4(%rsp)
	vmovss	16(%rax), %xmm3
	vsubss	%xmm8, %xmm3, %xmm12
	vmovss	%xmm12, 8(%rsp)
	leaq	(%r11,%rdx,4), %rbp
	vmovss	-4(%rbp), %xmm3
	vsubss	%xmm5, %xmm3, %xmm5
	vmovd	%xmm5, %edx
	vmulss	-108(%rsp), %xmm6, %xmm14
	vmulss	%xmm6, %xmm6, %xmm6
	vmulss	%xmm14, %xmm6, %xmm6
	vmovss	-28(%rsp), %xmm5
	vfmadd231ss	-120(%rsp), %xmm6, %xmm5
	vmovaps	%xmm5, %xmm3
	vmovss	-104(%rsp), %xmm13
	vmulss	%xmm13, %xmm0, %xmm9
	vmulss	%xmm0, %xmm0, %xmm0
	vmulss	%xmm9, %xmm0, %xmm0
	vmulss	32(%rsp), %xmm0, %xmm11
	vmulss	36(%rsp), %xmm0, %xmm10
	vmulss	%xmm13, %xmm1, %xmm13
	vmulss	%xmm1, %xmm1, %xmm1
	vmulss	%xmm13, %xmm1, %xmm1
	vmovss	-44(%rsp), %xmm12
	vfmadd231ss	-64(%rsp), %xmm1, %xmm12
	vmovaps	%xmm12, %xmm8
	vmovss	-36(%rsp), %xmm5
	vfmadd231ss	-60(%rsp), %xmm1, %xmm5
	vmovaps	%xmm5, %xmm7
	vmulss	40(%rsp), %xmm1, %xmm5
	vmovss	-76(%rsp), %xmm12
	vfmadd132ss	%xmm6, %xmm11, %xmm12
	vaddss	%xmm8, %xmm12, %xmm8
	vmovss	%xmm8, -44(%rsp)
	vmovss	-72(%rsp), %xmm8
	vfmadd132ss	%xmm6, %xmm10, %xmm8
	vaddss	%xmm7, %xmm8, %xmm8
	vmovss	%xmm8, -36(%rsp)
	vmovss	-68(%rsp), %xmm12
	vfmadd132ss	%xmm0, %xmm5, %xmm12
	vaddss	%xmm3, %xmm12, %xmm3
	vmovss	%xmm3, -28(%rsp)
	vmulss	-108(%rsp), %xmm15, %xmm12
	vmulss	%xmm15, %xmm15, %xmm15
	vmulss	%xmm12, %xmm15, %xmm3
	vmovss	-16(%rsp), %xmm8
	vmovss	-92(%rsp), %xmm15
	vfmadd231ss	%xmm15, %xmm3, %xmm8
	vmovss	%xmm8, -92(%rsp)
	vmovss	-76(%rsp), %xmm7
	vfnmadd213ss	-96(%rsp), %xmm6, %xmm7
	vfnmadd231ss	-56(%rsp), %xmm3, %xmm7
	vmovss	%xmm7, -12(%rax)
	vmovss	-72(%rsp), %xmm7
	vfnmadd213ss	64(%rsp), %xmm6, %xmm7
	vfnmadd231ss	-52(%rsp), %xmm3, %xmm7
	vmovss	%xmm7, -8(%rax)
	vmovss	-120(%rsp), %xmm8
	vfnmadd213ss	68(%rsp), %xmm8, %xmm6
	vfnmadd231ss	%xmm15, %xmm3, %xmm6
	vmovss	%xmm6, -4(%rax)
	vmulss	-104(%rsp), %xmm2, %xmm8
	vmulss	%xmm2, %xmm2, %xmm2
	vmulss	%xmm8, %xmm2, %xmm2
	vmulss	44(%rsp), %xmm2, %xmm7
	vmulss	48(%rsp), %xmm2, %xmm6
	vmovd	%esi, %xmm15
	vsubss	%xmm11, %xmm15, %xmm11
	vsubss	%xmm7, %xmm11, %xmm11
	vmovss	%xmm11, (%rax)
	vmovd	%ecx, %xmm15
	vsubss	%xmm10, %xmm15, %xmm10
	vsubss	%xmm6, %xmm10, %xmm10
	vmovss	%xmm10, 4(%rax)
	vmovss	(%rsp), %xmm10
	vfnmadd132ss	-68(%rsp), %xmm10, %xmm0
	vfnmadd231ss	-100(%rsp), %xmm2, %xmm0
	vmovss	%xmm0, -4(%r13)
	vmovss	-104(%rsp), %xmm0
	vmovss	28(%rsp), %xmm11
	vmulss	%xmm11, %xmm0, %xmm0
	vaddss	%xmm0, %xmm8, %xmm8
	vaddss	-12(%rsp), %xmm8, %xmm8
	vmulss	%xmm11, %xmm11, %xmm10
	vmulss	%xmm0, %xmm10, %xmm0
	vaddss	%xmm14, %xmm4, %xmm4
	vmovss	12(%rsp), %xmm14
	vmovss	72(%rsp), %xmm11
	vfmadd132ss	-4(%rsp), %xmm11, %xmm14
	vaddss	%xmm14, %xmm4, %xmm4
	vaddss	%xmm13, %xmm9, %xmm9
	vaddss	%xmm12, %xmm9, %xmm9
	vaddss	%xmm9, %xmm4, %xmm4
	vaddss	%xmm8, %xmm4, %xmm15
	vmovss	%xmm15, -12(%rsp)
	vmovss	-24(%rsp), %xmm4
	vmovss	52(%rsp), %xmm13
	vfmadd231ss	%xmm13, %xmm0, %xmm4
	vmovaps	%xmm4, %xmm8
	vmovss	-20(%rsp), %xmm4
	vmovss	56(%rsp), %xmm14
	vfmadd231ss	%xmm14, %xmm0, %xmm4
	vmulss	60(%rsp), %xmm0, %xmm11
	vfmadd231ss	-56(%rsp), %xmm3, %xmm7
	vaddss	%xmm8, %xmm7, %xmm7
	vmovss	%xmm7, -24(%rsp)
	vfmadd231ss	-52(%rsp), %xmm3, %xmm6
	vaddss	%xmm4, %xmm6, %xmm3
	vmovss	%xmm3, -20(%rsp)
	vfmadd132ss	-100(%rsp), %xmm11, %xmm2
	vaddss	-92(%rsp), %xmm2, %xmm3
	vmovss	%xmm3, -16(%rsp)
	vmovss	-64(%rsp), %xmm9
	vfnmadd213ss	4(%rsp), %xmm1, %xmm9
	vfnmadd231ss	%xmm13, %xmm0, %xmm9
	vmovss	%xmm9, 12(%rax)
	vmovss	-60(%rsp), %xmm12
	vfnmadd213ss	8(%rsp), %xmm12, %xmm1
	vfnmadd132ss	%xmm14, %xmm1, %xmm0
	vmovss	%xmm0, 16(%rax)
	vmovd	%edx, %xmm6
	vsubss	%xmm5, %xmm6, %xmm5
	vsubss	%xmm11, %xmm5, %xmm11
	vmovss	%xmm11, -4(%rbp)
	addq	$4, %r8
	cmpq	%r15, %r8
	jne	.L4
	movq	%r14, %rbp
	vmovss	-48(%rsp), %xmm6
	vaddss	-44(%rsp), %xmm6, %xmm8
	vmovss	-40(%rsp), %xmm6
	vaddss	-36(%rsp), %xmm6, %xmm2
	vmovss	-32(%rsp), %xmm6
	vaddss	-28(%rsp), %xmm6, %xmm0
	vmovss	-48(%rsp), %xmm6
.L3:
	leaq	(%r11,%r12,4), %rax
	vaddss	-12(%rax), %xmm6, %xmm12
	vmovss	%xmm12, -12(%rax)
	vmovss	-40(%rsp), %xmm6
	vaddss	-8(%rax), %xmm6, %xmm11
	vmovss	%xmm11, -8(%rax)
	vmovss	-32(%rsp), %xmm6
	vaddss	-4(%rax), %xmm6, %xmm5
	vmovss	%xmm5, -4(%rax)
	vmovss	-44(%rsp), %xmm6
	vaddss	(%rax), %xmm6, %xmm11
	vmovss	%xmm11, (%rax)
	vmovss	-36(%rsp), %xmm6
	vaddss	4(%rax), %xmm6, %xmm11
	vmovss	%xmm11, 4(%rax)
	leaq	(%r11,%rbx,4), %rdx
	vmovss	-28(%rsp), %xmm6
	vaddss	-4(%rdx), %xmm6, %xmm11
	vmovss	%xmm11, -4(%rdx)
	vmovss	-24(%rsp), %xmm6
	vaddss	12(%rax), %xmm6, %xmm1
	vmovss	%xmm1, 12(%rax)
	vmovss	-20(%rsp), %xmm5
	vaddss	16(%rax), %xmm5, %xmm1
	vmovss	%xmm1, 16(%rax)
	leaq	(%r11,%rdi,4), %rax
	vaddss	-4(%rax), %xmm3, %xmm1
	vmovss	%xmm1, -4(%rax)
	movq	216(%rsp), %rax
	movq	104(%rsp), %rdi
	leaq	(%rax,%rdi,4), %rax
	vaddss	%xmm6, %xmm8, %xmm8
	vaddss	(%rax), %xmm8, %xmm8
	vmovss	%xmm8, (%rax)
	vaddss	%xmm5, %xmm2, %xmm2
	vaddss	4(%rax), %xmm2, %xmm2
	vmovss	%xmm2, 4(%rax)
	movq	216(%rsp), %rax
	leaq	(%rax,%r9,4), %rax
	vaddss	%xmm3, %xmm0, %xmm6
	vaddss	-4(%rax), %xmm6, %xmm6
	vmovss	%xmm6, -4(%rax)
	movq	224(%rsp), %rax
	movslq	(%rax,%rbp,4), %rax
	salq	$2, %rax
	movq	%rax, %rdx
	addq	264(%rsp), %rdx
	vmovss	-12(%rsp), %xmm6
	vaddss	(%rdx), %xmm6, %xmm7
	vmovss	%xmm7, (%rdx)
	addq	296(%rsp), %rax
	vmovss	-8(%rsp), %xmm6
	vaddss	(%rax), %xmm6, %xmm0
	vmovss	%xmm0, (%rax)
	addq	$1, %rbp
	cmpl	%ebp, 116(%rsp)
	jne	.L6
.L9:
	addq	$160, %rsp
	.cfi_remember_state
	.cfi_def_cfa_offset 56
	popq	%rbx
	.cfi_def_cfa_offset 48
	popq	%rbp
	.cfi_def_cfa_offset 40
	popq	%r12
	.cfi_def_cfa_offset 32
	popq	%r13
	.cfi_def_cfa_offset 24
	popq	%r14
	.cfi_def_cfa_offset 16
	popq	%r15
	.cfi_def_cfa_offset 8
	ret
	.p2align 4,,10
	.p2align 3
.L7:
	.cfi_restore_state
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%xmm0, %xmm2
	vmovaps	%xmm0, %xmm8
	vmovss	%xmm0, -8(%rsp)
	vmovss	%xmm0, -12(%rsp)
	vmovss	%xmm0, -16(%rsp)
	vmovss	%xmm0, -28(%rsp)
	vmovss	%xmm0, -32(%rsp)
	vmovss	%xmm0, -20(%rsp)
	vmovss	%xmm0, -36(%rsp)
	vmovss	%xmm0, -40(%rsp)
	vmovss	%xmm0, -24(%rsp)
	vmovss	%xmm0, -44(%rsp)
	vmovss	%xmm0, -48(%rsp)
	movslq	%r8d, %rax
	movq	%rax, 104(%rsp)
	vmovaps	%xmm0, %xmm6
	vmovaps	%xmm0, %xmm3
	jmp	.L3
	.cfi_endproc
.LFE0:
	.size	inl1130_, .-inl1130_
	.section	.rodata.cst4,"aM",@progbits,4
	.align 4
.LC1:
	.long	3225419776
	.align 4
.LC2:
	.long	3204448256
	.align 4
.LC3:
	.long	1086324736
	.align 4
.LC4:
	.long	1094713344
	.ident	"GCC: (GNU) 6.0.0 20160205 (experimental) [trunk revision 233136]"
	.section	.note.GNU-stack,"",@progbits
diff mbox

Patch

Index: gcc/ira.c
===================================================================
--- gcc/ira.c	(revision 233172)
+++ gcc/ira.c	(working copy)
@@ -1774,7 +1774,7 @@  ira_setup_alts (rtx_insn *insn, HARD_REG
   int nop, nalt;
   bool curr_swapped;
   const char *p;
-  int commutative = -1;
+  int commutative = -1, alt_commutative = -1;
 
   extract_insn (insn);
   alternative_mask preferred = get_preferred_alternatives (insn);
@@ -1838,6 +1838,8 @@  ira_setup_alts (rtx_insn *insn, HARD_REG
 		  
 		  case '%':
 		    /* The commutative modifier is handled above.  */
+		    if (alt_commutative < 0)
+		      alt_commutative = nop;
 		    break;
 
 		  case '0':  case '1':  case '2':  case '3':  case '4':
@@ -1889,10 +1891,13 @@  ira_setup_alts (rtx_insn *insn, HARD_REG
 	}
       if (commutative < 0)
 	break;
+      /* Swap forth and back to avoid changing recog_data.  */
+      if (! curr_swapped
+	  || alt_commutative < 0)
+	std::swap (recog_data.operand[commutative],
+		   recog_data.operand[commutative + 1]);
       if (curr_swapped)
 	break;
-      std::swap (recog_data.operand[commutative],
-		 recog_data.operand[commutative + 1]);
     }
 }