Message ID | 20181217122405.18732-1-mark.cave-ayland@ilande.co.uk |
---|---|
Headers | show |
Series | target/ppc: convert VMX instructions to use TCG vector operations | expand |
On 12/17/18 4:23 AM, Mark Cave-Ayland wrote: > NOTE: there are a lot of instructions that cannot (yet) be optimised to use TCG vector > operations, however it struck me that there may be some potential for converting > saturating add/sub and cmp instructions if there were a mechanism to return a set of > flags indicating the result of the saturation/comparison. There are also a lot of instructions that can be converted, but aren't: * vspltis[bhw] can use tcg_gen_gvec_dup{8,16,32}i. * vsplt{b,h,w} can use tcg_gen_gvec_dup_mem. Note that you'll need something like vec_reg_offset from target/arm/translate-a64.h to compute the offset of the specific byte/word/long from which we are to splat. * vmr should be handled by having tcg_gen_gvec_or notice aofs == bofs. For ARM, we do special case this during translation. But since tcg/tcg-op.c does these things for tcg_gen_or_i64, we should probably handle the same set of transformations. * vnot would need to be handled by actually adding a tcg_gen_gvec_nor and then also noticing aofs == bofs. For saturation, I think the easiest thing to do is represent SAT as a ppc_avr_t. We notice saturation by also computing normal arithmetic and comparing to see if they differ. E.g. tcg_gen_gvec_add(vece, offsetof_avr_tmp, offsetof(ra), offsetof(rb), 16, 16); tcg_gen_gvec_ssadd(vece, offsetof(rt), offsetof(ra), offsetof(rb), 16, 16); tcg_gen_gvec_cmp(TCG_COND_NE, vece, offsetof_avr_tmp, offsetof_avr_tmp, offsetof(rt), 16, 16); tcg_gen_gvec_or(vece, offsetof_avr_sat, offsetof_avr_sat, offsetof_avr_tmp, 16, 16); You only need to convert the ppc_avr_t to a single bit when reading VSCR. For comparisons... that's tricky. I wonder if there's anything better than tcg_gen_gvec_cmp(TCG_COND_FOO, vece, offsetof(rt), offsetof(ra), offsetof(rb), 16, 16); if (rc) { TCGv_i64 hi, lo, t, f; tcg_gen_ld_i64(hi, cpu_env, offsetof(rt)); tcg_gen_ld_i64(lo, cpu_env, offsetof(rt) + 8); tcg_gen_and_i64(t, hi, lo); tcg_gen_or_i64(f, hi, lo); tcg_gen_setcondi_i64(TCG_COND_EQ, t, t, -1); tcg_gen_setcondi_i64(TCG_COND_EQ, f, f, 0); // truncate to i32, shift, or, and set to cr6. } r~
On 17/12/2018 17:39, Richard Henderson wrote: > On 12/17/18 4:23 AM, Mark Cave-Ayland wrote: >> NOTE: there are a lot of instructions that cannot (yet) be optimised to use TCG vector >> operations, however it struck me that there may be some potential for converting >> saturating add/sub and cmp instructions if there were a mechanism to return a set of >> flags indicating the result of the saturation/comparison. > > There are also a lot of instructions that can be converted, but aren't: > > * vspltis[bhw] can use tcg_gen_gvec_dup{8,16,32}i. > > * vsplt{b,h,w} can use tcg_gen_gvec_dup_mem. > > Note that you'll need something like vec_reg_offset from > target/arm/translate-a64.h to compute the offset of the > specific byte/word/long from which we are to splat. Oh okay, thanks for the hints - I remember thinking that I couldn't much with splat, but I'll go and look again. > * vmr should be handled by having tcg_gen_gvec_or notice aofs == bofs. > For ARM, we do special case this during translation. > But since tcg/tcg-op.c does these things for tcg_gen_or_i64, > we should probably handle the same set of transformations. > > * vnot would need to be handled by actually adding a tcg_gen_gvec_nor > and then also noticing aofs == bofs. And I'll revisit these ones too. > For saturation, I think the easiest thing to do is represent SAT as a > ppc_avr_t. We notice saturation by also computing normal arithmetic and > comparing to see if they differ. E.g. > > tcg_gen_gvec_add(vece, offsetof_avr_tmp, > offsetof(ra), offsetof(rb), 16, 16); > tcg_gen_gvec_ssadd(vece, offsetof(rt), > offsetof(ra), offsetof(rb), 16, 16); > tcg_gen_gvec_cmp(TCG_COND_NE, vece, offsetof_avr_tmp, > offsetof_avr_tmp, offsetof(rt), 16, 16); > tcg_gen_gvec_or(vece, offsetof_avr_sat, offsetof_avr_sat, > offsetof_avr_tmp, 16, 16); > > You only need to convert the ppc_avr_t to a single bit when reading VSCR. I actually had a PoC that looked somewhat similar to this, except that I abandoned the idea as I thought that the penalty of doing the add twice (plus comparison) would slow down everything by several orders of magnitude for backends that didn't support vector instructions. What's the best way to handle this? > For comparisons... that's tricky. I wonder if there's anything better than > > tcg_gen_gvec_cmp(TCG_COND_FOO, vece, offsetof(rt), > offsetof(ra), offsetof(rb), 16, 16); > if (rc) { > TCGv_i64 hi, lo, t, f; > > tcg_gen_ld_i64(hi, cpu_env, offsetof(rt)); > tcg_gen_ld_i64(lo, cpu_env, offsetof(rt) + 8); > > tcg_gen_and_i64(t, hi, lo); > tcg_gen_or_i64(f, hi, lo); > tcg_gen_setcondi_i64(TCG_COND_EQ, t, t, -1); > tcg_gen_setcondi_i64(TCG_COND_EQ, f, f, 0); > > // truncate to i32, shift, or, and set to cr6. > } Certainly I can look at this approach, but again my concern is that we end up heavily penalising the backends without vector instruction support :/ ATB, Mark.
This patchset is an attempt at trying to improve the VMX (Altivec) instruction performance by making use of the new TCG vector operations where possible. In order to use TCG vector operations, the registers must be accessible from cpu_env whilst currently they are accessed via arrays of static TCG globals. Patches 1-3 are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR registers using the supplied TCGv_i64 parameter. Meanwhile patch 4 fixes a minor issue spotted by Richard during review to ensure that AVR registers are not modified until after exceptions are processing during register load. Once this is done, patch 5 enables us to remove the static TCG global arrays and updates the access helpers to read/write to the relevant fields in cpu_env directly. Patches 6 and 7 perform the legwork required to enable VSX instructions to be converted to use TCG vector operations in future by rearranging the FP, VMX and VSX registers into a single aligned VSR register array (the scope of this patchset is VMX only). The final patches 8 and 9 convert the VMX logical instructions and addition/subtraction instructions respectively over to the TCG vector operations. NOTE: there are a lot of instructions that cannot (yet) be optimised to use TCG vector operations, however it struck me that there may be some potential for converting saturating add/sub and cmp instructions if there were a mechanism to return a set of flags indicating the result of the saturation/comparison. Finally thanks to Richard for taking the time to answer some of my (mostly beginner) questions related to TCG. Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> v2: - Rebase onto master - Add comment explaining rationale for FPR helpers in description for patch 1 - Add R-B tags from Richard - Add patch 3 to delay AVR register writeback as spotted by Richard - Add patches 6 and 7 to merge FPR, VMX and VSX registers into the vsr array to facilitate conversion of VSX instructions to vector operations later - Fix accidental bug whereby the conversion of get_vsr()/set_vsr() to access data from cpu_env was incorrectly squashed into patch 3 - Move set_fpr() further down in gen_fsqrts() and gen_frsqrtes() in patch 1 Mark Cave-Ayland (9): target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access target/ppc: introduce get_avr64() and set_avr64() helpers for VMX register access target/ppc: introduce get_cpu_vsr{l,h}() and set_cpu_vsr{l,h}() helpers for VSR register access target/ppc: delay writeback of avr{l,h} during lvx instruction target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env target/ppc: merge ppc_vsr_t and ppc_avr_t union types target/ppc: move FP and VMX registers into aligned vsr register array target/ppc: convert VMX logical instructions to use vector operations target/ppc: convert vaddu[b,h,w,d] and vsubu[b,h,w,d] over to use vector operations linux-user/ppc/signal.c | 24 +- target/ppc/arch_dump.c | 12 +- target/ppc/cpu.h | 26 +- target/ppc/gdbstub.c | 8 +- target/ppc/helper.h | 8 - target/ppc/int_helper.c | 63 ++- target/ppc/internal.h | 29 +- target/ppc/machine.c | 72 +++- target/ppc/monitor.c | 4 +- target/ppc/translate.c | 74 ++-- target/ppc/translate/dfp-impl.inc.c | 2 +- target/ppc/translate/fp-impl.inc.c | 490 +++++++++++++++++----- target/ppc/translate/vmx-impl.inc.c | 186 ++++++--- target/ppc/translate/vsx-impl.inc.c | 782 ++++++++++++++++++++++++++---------- target/ppc/translate_init.inc.c | 24 +- 15 files changed, 1262 insertions(+), 542 deletions(-)