Message ID | 87bngfwxz9.fsf@samsung.com |
---|---|
State | New |
Headers | show |
Hi Yury [cc'ing the ARM maintainers] On 16/06/15 15:04, Yury Usishchev wrote: > Hello! > > Following patch fixes PR target/66433. > > As described in PR, cost of memory operation with autoincrement is > considered to be greater than same operation without autoincrement. This > causes auto-inc-dec pass not to optimize vector memory operations like > vld and vst. The autoincrement form may not always be as cheap as a simple memory op, since it does involve an implicit addition operation. I've tried out your patch and I do see the autoincrement forms being used more aggressively. Do you have any benchmark data for making this change? > > Bootstrapped and regtested on armv7l-linux-gnueabi on trunk. > OK for trunk? case MEM: /* A memory access costs 1 insn if the mode is small, or the address is a single register, otherwise it costs one insn per word. */ - if (REG_P (XEXP (x, 0))) + if (REG_P (XEXP (x, 0)) + || (GET_RTX_CLASS (GET_CODE (XEXP (x, 0))) == RTX_AUTOINC + && REG_P (XEXP (XEXP (x, 0), 0)))) *cost = COSTS_N_INSNS (1); else if (flag_pic && GET_CODE (XEXP (x, 0)) == PLUS I would have hoped that auto-inc-dec.c would be using address costs rather than rtx costs here, but I don't think it's well defined who is responsible for choosing preferences between these autoinc ops :( I note that in our arm_arm_address_cost we already consider the autoinc modes to be cheap. One situation that we want to avoid is for non-NEON memory ops sequences of the form: ldr ra, [rn, #4] ldr rb, [rn, #8] ldr rc, [rn, #12] add rn, rn, #16 being transformed into: ldr ra, [rn]! ldr rb, [rn]! ldr rc, [rn]! So I think at least for non-vector/FP modes where we can use offsets we should consider autoinc ops to be slightly more expensive (COSTS_N_INSNS (2) instead of COSTS_N_INSNS (1)). But when optimising for size, we should prefer the autoinc forms since they can save us on add/sub instructions. Thanks, Kyrill > > -- > BR, > Yury Usishchev >
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index f5050cb..a8dc0ed 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -9444,7 +9444,9 @@ arm_new_rtx_costs (rtx x, enum rtx_code code, enum rtx_code outer_code, case MEM: /* A memory access costs 1 insn if the mode is small, or the address is a single register, otherwise it costs one insn per word. */ - if (REG_P (XEXP (x, 0))) + if (REG_P (XEXP (x, 0)) + || (GET_RTX_CLASS (GET_CODE (XEXP (x, 0))) == RTX_AUTOINC + && REG_P (XEXP (XEXP (x, 0), 0)))) *cost = COSTS_N_INSNS (1); else if (flag_pic && GET_CODE (XEXP (x, 0)) == PLUS diff --git a/gcc/testsuite/gcc.target/arm/pr66433.c b/gcc/testsuite/gcc.target/arm/pr66433.c new file mode 100644 index 0000000..22ba158 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/pr66433.c @@ -0,0 +1,21 @@ +/* Test the optimization of `vld*' ARM NEON intrinsic with autoincrement. */ + +/* { dg-do compile } */ +/* { dg-require-effective-target arm_neon_ok } */ +/* { dg-options "-O2" } */ +/* { dg-add-options arm_neon } */ + +#include <arm_neon.h> + +void test_vld_autoinc (uint32_t *__restrict__ a, uint32_t *__restrict__ b) +{ + int i; + for(i = 0; i < 1000000; i++) { + vst1q_u32 (b, vld1q_u32 (a)); + a += 4; + b += 4; + } +} + +/* { dg-final { scan-assembler "vld1\.32.*!.*\n" } } */ +/* { dg-final { scan-assembler "vst1\.32.*!.*\n" } } */