mbox series

[v2,00/19] target/i386: decoder changes for 8.2

Message ID 20231019104648.389942-1-pbonzini@redhat.com
Headers show
Series target/i386: decoder changes for 8.2 | expand

Message

Paolo Bonzini Oct. 19, 2023, 10:46 a.m. UTC
This includes:

- implementing SHA and CMPccXADD instruction extensions

- introducing a new mechanism for flags writeback that avoids a
  tricky failure

- converting the more orthogonal parts of the one-byte opcode
  map, as well as the CMOVcc and SETcc instructions.

Tested by booting several 32-bit and 64-bit guests.

The new decoder produces roughly 2% more ops, but after optimization there
are just 0.5% more and almost all of them come from cmp instructions.
For some reason that I have not investigated, these end up with an extra
mov even after optimization:

                                sub_i64 tmp0,rax,$0x33
 mov_i64 cc_src,$0x33           mov_i64 cc_dst,tmp0
 sub_i64 cc_dst,rax,$0x33       mov_i64 cc_src,$0x33
 discard cc_src2                discard cc_src2
 discard cc_op                  discard cc_op

It could be easily fixed by not reusing gen_SUB for cmp instructions,
or by debugging what goes on in the optimizer.  However, it does not
result in larger assembly.

Paolo

v1->v2: call set_cc_op from the delayed flags writeback
	preparation for CC_OP_DYNAMIC
	fix INC/DEC to use delayed flags writeback
	remove cc_srcT from delayed flags writeback
	annotate places that call set_cc_op() from emit functions
	rewrite IMUL expansion to avoid nowb and to commonize flags handling
	introduce tcg_gen_negsetcondi*

Paolo Bonzini (19):
  target/i386: group common checks in the decoding phase
  target/i386: validate VEX.W for AVX instructions
  target/i386: implement SHA instructions
  tests/tcg/i386: initialize more registers in test-avx
  tests/tcg/i386: test-avx: add test cases for SHA new instructions
  target/i386: accept full MemOp in gen_ext_tl
  target/i386: introduce flags writeback mechanism
  target/i386: implement CMPccXADD
  target/i386: do not clobber A0 in POP translation
  target/i386: reintroduce debugging mechanism
  target/i386: move 00-5F opcodes to new decoder
  target/i386: adjust decoding of J operand
  target/i386: split eflags computation out of gen_compute_eflags
  tcg: add negsetcondi
  target/i386: move 60-BF opcodes to new decoder
  target/i386: move operand load and writeback out of gen_cmovcc1
  target/i386: move remaining conditional operations to new decoder
  target/i386: remove now converted opcodes from old decoder
  target/i386: remove gen_op

 include/tcg/tcg-op-common.h          |    4 +
 include/tcg/tcg-op.h                 |    2 +
 target/i386/cpu.c                    |    4 +-
 target/i386/cpu.h                    |    1 +
 target/i386/ops_sse.h                |  128 ++++
 target/i386/tcg/decode-new.c.inc     |  616 ++++++++++++++--
 target/i386/tcg/decode-new.h         |   43 +-
 target/i386/tcg/emit.c.inc           |  745 ++++++++++++++++++-
 target/i386/tcg/ops_sse_header.h.inc |   14 +
 target/i386/tcg/translate.c          | 1001 +++-----------------------
 tcg/tcg-op.c                         |   12 +
 tests/tcg/i386/Makefile.target       |    2 +-
 tests/tcg/i386/test-avx.c            |    8 +
 tests/tcg/i386/test-avx.py           |    3 +-
 tests/tcg/i386/test-flags.c          |   37 +
 15 files changed, 1644 insertions(+), 976 deletions(-)
 create mode 100644 tests/tcg/i386/test-flags.c

Comments

Paolo Bonzini Oct. 19, 2023, 11:39 a.m. UTC | #1
On 10/19/23 12:46, Paolo Bonzini wrote:
> This includes:
> 
> - implementing SHA and CMPccXADD instruction extensions
> 
> - introducing a new mechanism for flags writeback that avoids a
>    tricky failure
> 
> - converting the more orthogonal parts of the one-byte opcode
>    map, as well as the CMOVcc and SETcc instructions.
> 
> Tested by booting several 32-bit and 64-bit guests.
> 
> The new decoder produces roughly 2% more ops, but after optimization there
> are just 0.5% more and almost all of them come from cmp instructions.
> For some reason that I have not investigated, these end up with an extra
> mov even after optimization:
> 
>                                  sub_i64 tmp0,rax,$0x33
>   mov_i64 cc_src,$0x33           mov_i64 cc_dst,tmp0
>   sub_i64 cc_dst,rax,$0x33       mov_i64 cc_src,$0x33
>   discard cc_src2                discard cc_src2
>   discard cc_op                  discard cc_op
> 
> It could be easily fixed by not reusing gen_SUB for cmp instructions,
> or by debugging what goes on in the optimizer.  However, it does not
> result in larger assembly.

Oops, I missed Richard's newer reviews.  Will send v3 sometime next week.

Paolo
Richard Henderson Oct. 19, 2023, 3:44 p.m. UTC | #2
On 10/19/23 03:46, Paolo Bonzini wrote:
> This includes:
> 
> - implementing SHA and CMPccXADD instruction extensions
> 
> - introducing a new mechanism for flags writeback that avoids a
>    tricky failure
> 
> - converting the more orthogonal parts of the one-byte opcode
>    map, as well as the CMOVcc and SETcc instructions.
> 
> Tested by booting several 32-bit and 64-bit guests.
> 
> The new decoder produces roughly 2% more ops, but after optimization there
> are just 0.5% more and almost all of them come from cmp instructions.
> For some reason that I have not investigated, these end up with an extra
> mov even after optimization:
> 
>                                  sub_i64 tmp0,rax,$0x33
>   mov_i64 cc_src,$0x33           mov_i64 cc_dst,tmp0
>   sub_i64 cc_dst,rax,$0x33       mov_i64 cc_src,$0x33
>   discard cc_src2                discard cc_src2
>   discard cc_op                  discard cc_op
> 
> It could be easily fixed by not reusing gen_SUB for cmp instructions,
> or by debugging what goes on in the optimizer.  However, it does not
> result in larger assembly.

This is expected behaviour out of the tcg optimizer.  We don't forward-propagate outputs 
at that point.  But during register allocation of the "mov cc_dst,tmp0" opcode, we will 
see that tmp0 is dead and re-assign the register from tmp0 to cc_dst without emitting an 
host instruction.


r~