diff mbox

[01/18] tcg: add support for 128bit vector type

Message ID 1484644078-21312-2-git-send-email-batuzovk@ispras.ru
State New
Headers show

Commit Message

Kirill Batuzov Jan. 17, 2017, 9:07 a.m. UTC
Introduce TCG_TYPE_V128 and corresponding TCGv_v128 for TCG temps. Add hepler
functions that work with temps of this new type.

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/tcg-op.h | 24 ++++++++++++++++++++++++
 tcg/tcg.c    | 13 +++++++++++++
 tcg/tcg.h    | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 71 insertions(+)

Comments

Richard Henderson Jan. 18, 2017, 6:29 p.m. UTC | #1
On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
> +static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t offset,
> +                                                const char *name)
> +{
> +    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset, name);
> +    return MAKE_TCGV_V128(idx);
> +}

You shouldn't allow a v128 type to be created if the host doesn't support it.

You may want to treat v128 as a pair of v64 if the host supports that. 
Although there's limited applicability there, since only minor hosts (MIPS, 
Sparc, ia64) have 64-bit-only vector extensions.

That said, treating v128 as 2 x v64 scales nicely when we add v256.  Which, if 
we've already gone this far, is clearly how avx2 guest support should be 
implemented.

For hosts that have had no vector support added, you may want to represent v128 
as 2 x i64, for the purpose of intermediate expansion.


r~
Kirill Batuzov Jan. 19, 2017, 1:04 p.m. UTC | #2
On Wed, 18 Jan 2017, Richard Henderson wrote:

> On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
> > +static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t
> > offset,
> > +                                                const char *name)
> > +{
> > +    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset,
> > name);
> > +    return MAKE_TCGV_V128(idx);
> > +}
> 
> You shouldn't allow a v128 type to be created if the host doesn't support it.

The idea here was to create it either way, but make sure no operation
will ever be issued if host does not support it (tcg_gen_* wrappers take
care of this).

> 
> You may want to treat v128 as a pair of v64 if the host supports that.
> Although there's limited applicability there, since only minor hosts (MIPS,
> Sparc, ia64) have 64-bit-only vector extensions.
> 
> That said, treating v128 as 2 x v64 scales nicely when we add v256.  Which, if
> we've already gone this far, is clearly how avx2 guest support should be
> implemented.
> 
> For hosts that have had no vector support added, you may want to represent
> v128 as 2 x i64, for the purpose of intermediate expansion.
>

I'm not sure about this last part. The host may not have i64, so there
should be another case - 4 x i32. So we'll get 4 cases for v128:

v128
2 x v64
2 x i64
4 x i32

3 cases will need to be added to tcg_temp_new_internal and
tcg_global_new_mem_internal, two of which are rather useless (2 x i64, 4 x i32).
Introduction of v256 will add 4 more cases two of which will be useless
again. This sounds like too much code that serves no purpose to me.

Maybe we can only adapt 2 x v64 (and later 2 x v128 and may be 4 x v64)
cases and just generate v128 temp that'll never be used if none of these
worked?
Richard Henderson Jan. 19, 2017, 3:09 p.m. UTC | #3
On 01/19/2017 05:04 AM, Kirill Batuzov wrote:
> On Wed, 18 Jan 2017, Richard Henderson wrote:
>
>> On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
>>> +static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t
>>> offset,
>>> +                                                const char *name)
>>> +{
>>> +    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset,
>>> name);
>>> +    return MAKE_TCGV_V128(idx);
>>> +}
>>
>> You shouldn't allow a v128 type to be created if the host doesn't support it.
>
> The idea here was to create it either way, but make sure no operation
> will ever be issued if host does not support it (tcg_gen_* wrappers take
> care of this).

Huh?  If you issue *no* operation, then how is the operation being emulated?

> I'm not sure about this last part. The host may not have i64, so there
> should be another case - 4 x i32. So we'll get 4 cases for v128:

Recursively you'd get 4 x i32, but at least they'll be tagged TCG_TYPE_I64, and 
be handled by the rest of the tcg code generator like it should be.

>
> v128
> 2 x v64
> 2 x i64
> 4 x i32
>
> 3 cases will need to be added to tcg_temp_new_internal and
> tcg_global_new_mem_internal, two of which are rather useless (2 x i64, 4 x i32).
> Introduction of v256 will add 4 more cases two of which will be useless
> again. This sounds like too much code that serves no purpose to me.

Useless?  Surely you mean "used by hosts that don't implement v128".

I think one of us is very confused about how you intend to generate fallback 
code.  Perhaps any future patchset revision should update tcg/README first.


r~
Kirill Batuzov Jan. 19, 2017, 4:54 p.m. UTC | #4
On 19.01.2017 18:09, Richard Henderson wrote:
> On 01/19/2017 05:04 AM, Kirill Batuzov wrote:
>> On Wed, 18 Jan 2017, Richard Henderson wrote:
>>
>>> On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
>>>> +static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t
>>>> offset,
>>>> +                                                const char *name)
>>>> +{
>>>> +    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset,
>>>> name);
>>>> +    return MAKE_TCGV_V128(idx);
>>>> +}
>>>
>>> You shouldn't allow a v128 type to be created if the host doesn't
>>> support it.
>>
>> The idea here was to create it either way, but make sure no operation
>> will ever be issued if host does not support it (tcg_gen_* wrappers take
>> care of this).
>
> Huh?  If you issue *no* operation, then how is the operation being
> emulated?

Wrappers issue emulation code instead of operation if it is not 
supported by host.

tcg_gen_add_i32x4 looks like this:

if (TCG_TARGET_HAS_add_i32x4) {
     tcg_gen_op3_v128(INDEX_op_add_i32x4, args[0], args[1], args[2]);
} else {
     for (i = 0; i < 4; i++) {
         tcg_gen_ld_i32(...);
         tcg_gen_ld_i32(...);
         tcg_gen_add_i32(...);
         tcg_gen_st_i32(...);
     }
}

So no operation working directly with TCGV_v128 temp should appear 
anywhere in the intermediate representation unless host claims it 
supports it (in which case it must support 128-bit type as well).

>
>> I'm not sure about this last part. The host may not have i64, so there
>> should be another case - 4 x i32. So we'll get 4 cases for v128:
>
> Recursively you'd get 4 x i32, but at least they'll be tagged
> TCG_TYPE_I64, and be handled by the rest of the tcg code generator like
> it should be.
>
>>
>> v128
>> 2 x v64
>> 2 x i64
>> 4 x i32
>>
>> 3 cases will need to be added to tcg_temp_new_internal and
>> tcg_global_new_mem_internal, two of which are rather useless (2 x i64,
>> 4 x i32).
>> Introduction of v256 will add 4 more cases two of which will be useless
>> again. This sounds like too much code that serves no purpose to me.
>
> Useless?  Surely you mean "used by hosts that don't implement v128".

I meant that host that doesn't support v128 type will not use this 
variables. It'll use their memory locations instead, so it does not 
matter how we represent them. The only TCG code that'll see them is 
tcg_gen_<vector_op> wrappers which know how to deal with them.

2 x v64 is a different story. We can make a much better emulation code 
if we represent a v128 variable as a pair of v64 variables and work with 
them as variables.

>
> I think one of us is very confused about how you intend to generate
> fallback code.  Perhaps any future patchset revision should update
> tcg/README first.
>

Sure, I'll do it in v2.
Richard Henderson Jan. 22, 2017, 7 a.m. UTC | #5
On 01/19/2017 08:54 AM, Kirill Batuzov wrote:
>
> Wrappers issue emulation code instead of operation if it is not supported by host.
>
> tcg_gen_add_i32x4 looks like this:
>
> if (TCG_TARGET_HAS_add_i32x4) {
>     tcg_gen_op3_v128(INDEX_op_add_i32x4, args[0], args[1], args[2]);
> } else {
>     for (i = 0; i < 4; i++) {
>         tcg_gen_ld_i32(...);
>         tcg_gen_ld_i32(...);
>         tcg_gen_add_i32(...);
>         tcg_gen_st_i32(...);
>     }
> }

To me that begs the question of why you wouldn't issue 4 adds on 4 i32 
registers instead.


r~
Kirill Batuzov Jan. 23, 2017, 10:30 a.m. UTC | #6
On Sat, 21 Jan 2017, Richard Henderson wrote:

> On 01/19/2017 08:54 AM, Kirill Batuzov wrote:
> > 
> > Wrappers issue emulation code instead of operation if it is not supported by
> > host.
> > 
> > tcg_gen_add_i32x4 looks like this:
> > 
> > if (TCG_TARGET_HAS_add_i32x4) {
> >     tcg_gen_op3_v128(INDEX_op_add_i32x4, args[0], args[1], args[2]);
> > } else {
> >     for (i = 0; i < 4; i++) {
> >         tcg_gen_ld_i32(...);
> >         tcg_gen_ld_i32(...);
> >         tcg_gen_add_i32(...);
> >         tcg_gen_st_i32(...);
> >     }
> > }
> 
> To me that begs the question of why you wouldn't issue 4 adds on 4 i32
> registers instead.
>

Because 4 adds on 4 i32 registers work good only when the size of
vector elements matches the size of scalar variables we use for
representation of a vector. add_i16x8 will not be that great if we use
4 i32 variables: each will need to be split into two values, processed
independently and merged back afterwards. And when we create variable we
do not know which operations will be performed on it.

Scalar variables lack primitives to work with them as vectors of shorter
values. This is one of the reasons I added v64 type instead of using i64
for 64-bit vector operations. And this is the reason I'm so opposed to
using them to represent vector types if vector registers are not
supported by host. Handling vector operations with element size that
does not match representation will be complicated, may require special
handling for different operations and will produce a lot of if-s in code.

The method I'm proposing can handle any operation regardless of
representation. This includes handling situation where host supports
vector registers but does not support required operation (for example 
SSE/AVX does not support multiplication of vectors of 8-bit values).
Richard Henderson Jan. 23, 2017, 6:43 p.m. UTC | #7
On 01/23/2017 02:30 AM, Kirill Batuzov wrote:
> Because 4 adds on 4 i32 registers work good only when the size of
> vector elements matches the size of scalar variables we use for
> representation of a vector. add_i16x8 will not be that great if we use
> 4 i32 variables: each will need to be split into two values, processed
> independently and merged back afterwards.

Certainly.  But that's pretty much exactly how they are processed now.  Usually
via a helper function that accepts an i64 input as a pair of i32 arguments.

> Scalar variables lack primitives to work with them as vectors of shorter
> values. This is one of the reasons I added v64 type instead of using i64
> for 64-bit vector operations. And this is the reason I'm so opposed to
> using them to represent vector types if vector registers are not
> supported by host. Handling vector operations with element size that
> does not match representation will be complicated, may require special
> handling for different operations and will produce a lot of if-s in code.

A lot of if's?  I've no idea what you're talking about.

A v64 type makes sense because generally we're going to allocate them to a
different register set than i64.  That said, i64 is perfectly adequate for
implementing add_i8x8:

  t0  = in1 & 0x7f7f7f7f7f7f7f7f
  t1  = in0 + t0;
  t2  = in1 & 0x8080808080808080
  out = t1 ^ t2

This is less expensive than addition by pieces if there are at least 4 pieces.

> The method I'm proposing can handle any operation regardless of
> representation. This includes handling situation where host supports
> vector registers but does not support required operation (for example 
> SSE/AVX does not support multiplication of vectors of 8-bit values).

Not for nothing but it's trivial to expand with punpcklbw, punpckhbw, pmullw,
pand, packuswb.  That said, if an expansion gets too complicated, it's still
better to move it into a helper than expand 16 * (load, op, store).


r~
Kirill Batuzov Jan. 24, 2017, 2:29 p.m. UTC | #8
On Mon, 23 Jan 2017, Richard Henderson wrote:

> On 01/23/2017 02:30 AM, Kirill Batuzov wrote:
> > Because 4 adds on 4 i32 registers work good only when the size of
> > vector elements matches the size of scalar variables we use for
> > representation of a vector. add_i16x8 will not be that great if we use
> > 4 i32 variables: each will need to be split into two values, processed
> > independently and merged back afterwards.
> 
> Certainly.  But that's pretty much exactly how they are processed now.  Usually
> via a helper function that accepts an i64 input as a pair of i32 arguments.
> 
> > Scalar variables lack primitives to work with them as vectors of shorter
> > values. This is one of the reasons I added v64 type instead of using i64
> > for 64-bit vector operations. And this is the reason I'm so opposed to
> > using them to represent vector types if vector registers are not
> > supported by host. Handling vector operations with element size that
> > does not match representation will be complicated, may require special
> > handling for different operations and will produce a lot of if-s in code.
> 
> A lot of if's?  I've no idea what you're talking about.
> 
> A v64 type makes sense because generally we're going to allocate them to a
> different register set than i64.  That said, i64 is perfectly adequate for
> implementing add_i8x8:
> 
>   t0  = in1 & 0x7f7f7f7f7f7f7f7f
>   t1  = in0 + t0;
>   t2  = in1 & 0x8080808080808080
>   out = t1 ^ t2
> 
> This is less expensive than addition by pieces if there are at least 4 pieces.
> 
> > The method I'm proposing can handle any operation regardless of
> > representation. This includes handling situation where host supports
> > vector registers but does not support required operation (for example 
> > SSE/AVX does not support multiplication of vectors of 8-bit values).
> 
> Not for nothing but it's trivial to expand with punpcklbw, punpckhbw, pmullw,
> pand, packuswb.  That said, if an expansion gets too complicated, it's still
> better to move it into a helper than expand 16 * (load, op, store).
>

I'm a bit lost in the discussion so let me try to summarise. As far as I
understand there is only one major point on which we disagree: is it
worth representing vector variables as a sequences of scalar ones?

Pros:
1. We will not get phantom variables of unsupported type like we do in
my current implementation.

2. If we manage to efficiently emulate large enough number of vector
operations using scalar types we'll get some performance benefits. In
this case scalar variables can be allocated on registers and stay there
across several consecutive guest instructions.

I personally doubt that first "if": logical operations will be fine,
addition and subtraction can be implemented, may be shifts, but
everything else will end up as helpers (and they are expensive
from performance perspective).

Cons:
1. Additional cases for each possible representation in
tcg_global_mem_new_internal and tcg_temp_new_internal. I do not see how I
can use existing i64 as a pair of i32 recursively. TCG supports only one
level of indirection: there is a "type" of variable, and a "base_type"
it is used to represent. i64 code does not check "base_type" explicitly,
so if I pass two consecutive i32 variables to these functions they will
work, but this sounds like some dirty hack to me.

2. Additional cases for each possible representation in
tcg_gen_<vector_op> wrappers. We need to generate adequate expansion
code for each representation. That is if do not default to memory
location every time (in which case why bother with different
representation to begin with).

3. TCG variables exhaustion: to (potentially) represent AVX-512
registers with 32 bit variables we'll need 512 of them (32 of 512 bit
registers). TCG_MAX_TEMP is 512. Sure, it can be increased.

Making something a global variable is only beneficial when we can carry
a value of it in a register from one operation to another (so we'll get
ld+op1+op2+st instead of l1+op1+st+ld+op2+st). I'm not sure that subset
of operations we can effectively emulate is large enough for this to
happen often, but my experience with vector operations is limited so it
might be.

Let's do the following: in v2 I'll add representation of v128 as a pair
of v64 and update tcg_gen_<vector_op> wrappers. We'll see how this works
out and decide if it is good to follow with representation of v128 as a
sequence of scalar types.
diff mbox

Patch

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 6d044b7..df077d6 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -248,6 +248,23 @@  static inline void tcg_gen_op6ii_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                 GET_TCGV_I64(a3), GET_TCGV_I64(a4), a5, a6);
 }
 
+static inline void tcg_gen_op1_v128(TCGOpcode opc, TCGv_v128 a1)
+{
+    tcg_gen_op1(&tcg_ctx, opc, GET_TCGV_V128(a1));
+}
+
+static inline void tcg_gen_op2_v128(TCGOpcode opc, TCGv_v128 a1,
+                                    TCGv_v128 a2)
+{
+    tcg_gen_op2(&tcg_ctx, opc, GET_TCGV_V128(a1), GET_TCGV_V128(a2));
+}
+
+static inline void tcg_gen_op3_v128(TCGOpcode opc, TCGv_v128 a1,
+                                    TCGv_v128 a2, TCGv_v128 a3)
+{
+    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_V128(a1), GET_TCGV_V128(a2),
+                GET_TCGV_V128(a3));
+}
 
 /* Generic ops.  */
 
@@ -442,6 +459,13 @@  static inline void tcg_gen_not_i32(TCGv_i32 ret, TCGv_i32 arg)
     }
 }
 
+/* Vector ops */
+
+static inline void tcg_gen_discard_v128(TCGv_v128 arg)
+{
+    tcg_gen_op1_v128(INDEX_op_discard, arg);
+}
+
 /* 64 bit ops */
 
 void tcg_gen_addi_i64(TCGv_i64 ret, TCGv_i64 arg1, int64_t arg2);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index aabf94f..b20a044 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -637,6 +637,14 @@  TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
     return MAKE_TCGV_I64(idx);
 }
 
+TCGv_v128 tcg_temp_new_internal_v128(int temp_local)
+{
+    int idx;
+
+    idx = tcg_temp_new_internal(TCG_TYPE_V128, temp_local);
+    return MAKE_TCGV_V128(idx);
+}
+
 static void tcg_temp_free_internal(int idx)
 {
     TCGContext *s = &tcg_ctx;
@@ -669,6 +677,11 @@  void tcg_temp_free_i64(TCGv_i64 arg)
     tcg_temp_free_internal(GET_TCGV_I64(arg));
 }
 
+void tcg_temp_free_v128(TCGv_v128 arg)
+{
+    tcg_temp_free_internal(GET_TCGV_V128(arg));
+}
+
 TCGv_i32 tcg_const_i32(int32_t val)
 {
     TCGv_i32 t0;
diff --git a/tcg/tcg.h b/tcg/tcg.h
index a35e4c4..b9aa56b 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -235,6 +235,7 @@  typedef struct TCGPool {
 typedef enum TCGType {
     TCG_TYPE_I32,
     TCG_TYPE_I64,
+    TCG_TYPE_V128,
     TCG_TYPE_COUNT, /* number of different types */
 
     /* An alias for the size of the host register.  */
@@ -410,6 +411,7 @@  typedef tcg_target_ulong TCGArg;
 typedef struct TCGv_i32_d *TCGv_i32;
 typedef struct TCGv_i64_d *TCGv_i64;
 typedef struct TCGv_ptr_d *TCGv_ptr;
+typedef struct TCGv_v128_d *TCGv_v128;
 typedef TCGv_ptr TCGv_env;
 #if TARGET_LONG_BITS == 32
 #define TCGv TCGv_i32
@@ -434,6 +436,11 @@  static inline TCGv_ptr QEMU_ARTIFICIAL MAKE_TCGV_PTR(intptr_t i)
     return (TCGv_ptr)i;
 }
 
+static inline TCGv_v128 QEMU_ARTIFICIAL MAKE_TCGV_V128(intptr_t i)
+{
+    return (TCGv_v128)i;
+}
+
 static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_I32(TCGv_i32 t)
 {
     return (intptr_t)t;
@@ -449,6 +456,11 @@  static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_PTR(TCGv_ptr t)
     return (intptr_t)t;
 }
 
+static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_V128(TCGv_v128 t)
+{
+    return (intptr_t)t;
+}
+
 #if TCG_TARGET_REG_BITS == 32
 #define TCGV_LOW(t) MAKE_TCGV_I32(GET_TCGV_I64(t))
 #define TCGV_HIGH(t) MAKE_TCGV_I32(GET_TCGV_I64(t) + 1)
@@ -456,15 +468,18 @@  static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_PTR(TCGv_ptr t)
 
 #define TCGV_EQUAL_I32(a, b) (GET_TCGV_I32(a) == GET_TCGV_I32(b))
 #define TCGV_EQUAL_I64(a, b) (GET_TCGV_I64(a) == GET_TCGV_I64(b))
+#define TCGV_EQUAL_V128(a, b) (GET_TCGV_V128(a) == GET_TCGV_V128(b))
 #define TCGV_EQUAL_PTR(a, b) (GET_TCGV_PTR(a) == GET_TCGV_PTR(b))
 
 /* Dummy definition to avoid compiler warnings.  */
 #define TCGV_UNUSED_I32(x) x = MAKE_TCGV_I32(-1)
 #define TCGV_UNUSED_I64(x) x = MAKE_TCGV_I64(-1)
+#define TCGV_UNUSED_V128(x) x = MAKE_TCGV_V128(-1)
 #define TCGV_UNUSED_PTR(x) x = MAKE_TCGV_PTR(-1)
 
 #define TCGV_IS_UNUSED_I32(x) (GET_TCGV_I32(x) == -1)
 #define TCGV_IS_UNUSED_I64(x) (GET_TCGV_I64(x) == -1)
+#define TCGV_IS_UNUSED_V128(x) (GET_TCGV_V128(x) == -1)
 #define TCGV_IS_UNUSED_PTR(x) (GET_TCGV_PTR(x) == -1)
 
 /* call flags */
@@ -787,9 +802,11 @@  TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name);
 
 TCGv_i32 tcg_temp_new_internal_i32(int temp_local);
 TCGv_i64 tcg_temp_new_internal_i64(int temp_local);
+TCGv_v128 tcg_temp_new_internal_v128(int temp_local);
 
 void tcg_temp_free_i32(TCGv_i32 arg);
 void tcg_temp_free_i64(TCGv_i64 arg);
+void tcg_temp_free_v128(TCGv_v128 arg);
 
 static inline TCGv_i32 tcg_global_mem_new_i32(TCGv_ptr reg, intptr_t offset,
                                               const char *name)
@@ -825,6 +842,23 @@  static inline TCGv_i64 tcg_temp_local_new_i64(void)
     return tcg_temp_new_internal_i64(1);
 }
 
+static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t offset,
+                                                const char *name)
+{
+    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset, name);
+    return MAKE_TCGV_V128(idx);
+}
+
+static inline TCGv_v128 tcg_temp_new_v128(void)
+{
+    return tcg_temp_new_internal_v128(0);
+}
+
+static inline TCGv_v128 tcg_temp_local_new_v128(void)
+{
+    return tcg_temp_new_internal_v128(1);
+}
+
 #if defined(CONFIG_DEBUG_TCG)
 /* If you call tcg_clear_temp_count() at the start of a section of
  * code which is not supposed to leak any TCG temporaries, then