Patchwork target-arm: Report unimplemented opcodes (LOG_UNIMP)

login
register
mail settings
Submitter Stefan Weil
Date Aug. 28, 2013, 4:39 a.m.
Message ID <1377664796-11698-1-git-send-email-sw@weilnetz.de>
Download mbox | patch
Permalink /patch/270326/
State Accepted
Headers show

Comments

Stefan Weil - Aug. 28, 2013, 4:39 a.m.
These unimplemented opcodes are handled like illegal opcodes, but
they are used in existing code. We should at least report when they
are executed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
---

When running a QEMU system emulation of an ARM system
(Raspberry PI), Linux booted, but when I tried to run a
user session, it terminated without error message.

It took me some time to see that bash got an illegal
instruction exception. It was caused by ARM opcode 'setend'
which is not implemented in QEMU's ARM emulation.
The patch should help detecting similar scenarios in
the future.

Raspberry PI uses 'setend' in an optimized version of
memcmp, so lots of other executables also fail with QEMU.

As a workaround, the preloading of that optimized code
can be removed. Of course an improved QEMU emulation
would be better.

Regards,
Stefan

 target-arm/translate.c |    4 ++++
 1 file changed, 4 insertions(+)
Peter Maydell - Aug. 28, 2013, 8:15 a.m.
On 28 August 2013 05:39, Stefan Weil <sw@weilnetz.de> wrote:
> These unimplemented opcodes are handled like illegal opcodes, but
> they are used in existing code. We should at least report when they
> are executed.

Yeah, seems reasonable. (There might be other unimplemented
bits lurking too but we can add logging when we find them.)

> Signed-off-by: Stefan Weil <sw@weilnetz.de>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

> When running a QEMU system emulation of an ARM system
> (Raspberry PI), Linux booted, but when I tried to run a
> user session, it terminated without error message.
>
> It took me some time to see that bash got an illegal
> instruction exception. It was caused by ARM opcode 'setend'
> which is not implemented in QEMU's ARM emulation.
> The patch should help detecting similar scenarios in
> the future.
>
> Raspberry PI uses 'setend' in an optimized version of
> memcmp, so lots of other executables also fail with QEMU.
>
> As a workaround, the preloading of that optimized code
> can be removed. Of course an improved QEMU emulation
> would be better.

setend is kinda hard to emulate[*] and it's generally not used
by anything (except this weird memcmp implementation).
I'm inclined to think this "optimized" version of memcmp is a
bad idea -- I think setend is expensive on hardware, and I've
never seen it suggested by any of the toolchain folks who
do work on optimised versions of string routines. OTOH I
haven't run any benchmarks and hopefully whoever wrote
that rpi code did.

[*] not impossible, we already do something on the ppc
that's similar; however I'd really want to take the time to
figure out how to do endianness swapping "properly"
and what qemu does currently before messing with it.

-- PMM
Richard Henderson - Aug. 28, 2013, 2:31 p.m.
On 08/28/2013 01:15 AM, Peter Maydell wrote:
> [*] not impossible, we already do something on the ppc
> that's similar; however I'd really want to take the time to
> figure out how to do endianness swapping "properly"
> and what qemu does currently before messing with it.

I've got a loose plan in my head for how to clean up handling of
reverse-endian load/store instructions at both the translator and
tcg backend levels.

Hopefully it make the ppc, sparc, and s390 targets cleaner.
I would think that would equally apply to the arm setend.


r~
Peter Maydell - Aug. 28, 2013, 2:34 p.m.
On 28 August 2013 15:31, Richard Henderson <rth@twiddle.net> wrote:
> On 08/28/2013 01:15 AM, Peter Maydell wrote:
>> [*] not impossible, we already do something on the ppc
>> that's similar; however I'd really want to take the time to
>> figure out how to do endianness swapping "properly"
>> and what qemu does currently before messing with it.
>
> I've got a loose plan in my head for how to clean up handling of
> reverse-endian load/store instructions at both the translator and
> tcg backend levels.

Nice. Will it allow us to get rid of TARGET_WORDS_BIGENDIAN?

-- PMM
Richard Henderson - Aug. 28, 2013, 3:26 p.m.
On 08/28/2013 07:34 AM, Peter Maydell wrote:
> On 28 August 2013 15:31, Richard Henderson <rth@twiddle.net> wrote:
>> On 08/28/2013 01:15 AM, Peter Maydell wrote:
>>> [*] not impossible, we already do something on the ppc
>>> that's similar; however I'd really want to take the time to
>>> figure out how to do endianness swapping "properly"
>>> and what qemu does currently before messing with it.
>>
>> I've got a loose plan in my head for how to clean up handling of
>> reverse-endian load/store instructions at both the translator and
>> tcg backend levels.
> 
> Nice. Will it allow us to get rid of TARGET_WORDS_BIGENDIAN?

I don't know, as I don't know off-hand what all that implies.

Let me lay out my idea and see what you think:

Currently, at the TCG level we have 8 qemu_ld* opcodes, and 4 qemu_st* opcodes,
that always produce target_ulong sized results, and always in the guest
declared endianness.

There are several problems I want to address:

(1) I want explicit _i32 and _i64 sizes for the loads and stores.  This will
clean up a number of places in several translators where we have to load to _tl
and then truncate or extend to an explicit size.

(2) I want explicit endianness for the loads and stores.  E.g. when a sparc
guest does a byte-swapped store, there's little point in doing two offsetting
bswaps to make that happen.

(3) For hosts that do not support byte-swapped loads and stores themselves, the
need to allocate extra registers during the memory operation in order to  hold
the swapped results is an unnecessary burden.  Better to expose the bswap
operation at the tcg opcode level and let normal register allocation happen.

Now, naively implementing 1 and 2 would result in 32 opcodes for qemu_ld*. That
is obviously a non-starter.  However, the very first thing that each tcg
backend does is map the current 8 opcodes into a bitmask ("opc" and "s_bits"
in the source).  Let us make that official, and then extend it.

Therefore:

(A) Compress qemu_ld* into two qemu_ld_{i32,i64}, with an additional constant
argument that describes the actual load, exactly as "opc" does today.
Adjusting the translators to match can be done in stages, or we might decide to
leave the existing translator-level interface in place permanently.

(B) Add an additional bit to the "opc" to indicate which endianness is desired.
 E.g. 0 = LE, 8 = BE.  Expose the opc interface to the translators.  At which
point generating a load becomes more like

    tcg_gen_qemu_ld_tl(dest, addr, size | sign | dc->big_endian);

and the current endianness of the guest becomes a bit on the TB, to be copied
into the DisasContext at the beginning of translation.

(C) Examine the endian bit in the tcg-op.h expander, and check a
TCG_TARGET_HAS_foo flag to see if the tcg backend supports reverse endian
memory ops.  If not, break out the bswap into the opcode stream as a temporary.

The corollary here is that we must have a full set of bi-endian tcg helper
functions.  At the moment, the helper functions are all keyed to the hard-coded
guest endianness.  That means the typical LE/BE host/guest memory op looks like

	if (tlb hit) {
            t = bswap(data);
            store t;
        } else {
            helper_store_be(data);
        }

If we hoist the bswap it'll need to be

	t = bswap(data);
	if (tlb hit) {
	    store t;
	} else {
	    helper_store_le(t);
	}

(D) Profit!  I'm not sure what will be left of TARGET_WORDS_BIGENDIAN at this
point.  Possibly only if we leave the current translator interface in place in
step A.



r~
Peter Maydell - Aug. 28, 2013, 4:38 p.m.
On 28 August 2013 16:26, Richard Henderson <rth@twiddle.net> wrote:
> On 08/28/2013 07:34 AM, Peter Maydell wrote:
>> On 28 August 2013 15:31, Richard Henderson <rth@twiddle.net> wrote:
>>> On 08/28/2013 01:15 AM, Peter Maydell wrote:
>>>> [*] not impossible, we already do something on the ppc
>>>> that's similar; however I'd really want to take the time to
>>>> figure out how to do endianness swapping "properly"
>>>> and what qemu does currently before messing with it.
>>>
>>> I've got a loose plan in my head for how to clean up handling of
>>> reverse-endian load/store instructions at both the translator and
>>> tcg backend levels.
>>
>> Nice. Will it allow us to get rid of TARGET_WORDS_BIGENDIAN?
>
> I don't know, as I don't know off-hand what all that implies.
>
> Let me lay out my idea and see what you think:
>
> Currently, at the TCG level we have 8 qemu_ld* opcodes, and 4 qemu_st* opcodes,
> that always produce target_ulong sized results, and always in the guest
> declared endianness.
>
> There are several problems I want to address:
>
> (1) I want explicit _i32 and _i64 sizes for the loads and stores.  This will
> clean up a number of places in several translators where we have to load to _tl
> and then truncate or extend to an explicit size.
>
> (2) I want explicit endianness for the loads and stores.  E.g. when a sparc
> guest does a byte-swapped store, there's little point in doing two offsetting
> bswaps to make that happen.

I like both of these. (Do we have much code that relies on being able
to do "just load whatever the natural register width is " ? In any case,
we can rewrite that without too much trouble I think.)

> (3) For hosts that do not support byte-swapped loads and stores themselves, the
> need to allocate extra registers during the memory operation in order to  hold
> the swapped results is an unnecessary burden.  Better to expose the bswap
> operation at the tcg opcode level and let normal register allocation happen.
>
> Now, naively implementing 1 and 2 would result in 32 opcodes for qemu_ld*. That
> is obviously a non-starter.  However, the very first thing that each tcg
> backend does is map the current 8 opcodes into a bitmask ("opc" and "s_bits"
> in the source).  Let us make that official, and then extend it.
>
> Therefore:
>
> (A) Compress qemu_ld* into two qemu_ld_{i32,i64}, with an additional constant
> argument that describes the actual load, exactly as "opc" does today.
> Adjusting the translators to match can be done in stages, or we might decide to
> leave the existing translator-level interface in place permanently.

I think I'd prefer to remove the old interface; having multiple
ways to do something is a problem we have across the codebase.

> (B) Add an additional bit to the "opc" to indicate which endianness is desired.
>  E.g. 0 = LE, 8 = BE.  Expose the opc interface to the translators.  At which
> point generating a load becomes more like
>
>     tcg_gen_qemu_ld_tl(dest, addr, size | sign | dc->big_endian);
>
> and the current endianness of the guest becomes a bit on the TB, to be copied
> into the DisasContext at the beginning of translation.

I guess we deal with ARMv5-style BE32 by having the target
emit an explicit XOR TCG op?

> (C) Examine the endian bit in the tcg-op.h expander, and check a
> TCG_TARGET_HAS_foo flag to see if the tcg backend supports reverse endian
> memory ops.  If not, break out the bswap into the opcode stream as a temporary.
>
> The corollary here is that we must have a full set of bi-endian tcg helper
> functions.  At the moment, the helper functions are all keyed to the hard-coded
> guest endianness.  That means the typical LE/BE host/guest memory op looks like
>
>         if (tlb hit) {
>             t = bswap(data);
>             store t;
>         } else {
>             helper_store_be(data);
>         }
>
> If we hoist the bswap it'll need to be
>
>         t = bswap(data);
>         if (tlb hit) {
>             store t;
>         } else {
>             helper_store_le(t);
>         }

Do we need to overhaul the C interface to the
memory system too? (ie ldl_p and friends).

> (D) Profit!  I'm not sure what will be left of TARGET_WORDS_BIGENDIAN at this
> point.  Possibly only if we leave the current translator interface in place in
> step A.

I think there are a number of devices and boards which use it
as a convenient shortcut, but we can fix those -- the TCG
reliance on knowing about the target endianness is the
hard part of the problem, I think.

-- PMM
Richard Henderson - Aug. 28, 2013, 5:16 p.m.
On 08/28/2013 09:38 AM, Peter Maydell wrote:
>> (B) Add an additional bit to the "opc" to indicate which endianness is desired.
>>  E.g. 0 = LE, 8 = BE.  Expose the opc interface to the translators.  At which
>> point generating a load becomes more like
>>
>>     tcg_gen_qemu_ld_tl(dest, addr, size | sign | dc->big_endian);
>>
>> and the current endianness of the guest becomes a bit on the TB, to be copied
>> into the DisasContext at the beginning of translation.
> 
> I guess we deal with ARMv5-style BE32 by having the target
> emit an explicit XOR TCG op?

Yes.  I see no other way to implement that.

> Do we need to overhaul the C interface to the
> memory system too? (ie ldl_p and friends).

I don't think so, since we already have ldl_{le,be}_p.


r~
Peter Maydell - Aug. 28, 2013, 5:28 p.m.
On 28 August 2013 18:16, Richard Henderson <rth@twiddle.net> wrote:
> On 08/28/2013 09:38 AM, Peter Maydell wrote:
>>> (B) Add an additional bit to the "opc" to indicate which endianness is desired.
>>>  E.g. 0 = LE, 8 = BE.  Expose the opc interface to the translators.  At which
>>> point generating a load becomes more like
>>>
>>>     tcg_gen_qemu_ld_tl(dest, addr, size | sign | dc->big_endian);
>>>
>>> and the current endianness of the guest becomes a bit on the TB, to be copied
>>> into the DisasContext at the beginning of translation.
>>
>> I guess we deal with ARMv5-style BE32 by having the target
>> emit an explicit XOR TCG op?
>
> Yes.  I see no other way to implement that.
>
>> Do we need to overhaul the C interface to the
>> memory system too? (ie ldl_p and friends).
>
> I don't think so, since we already have ldl_{le,be}_p.

Well, what do ldl_p or ldl_phys or any of the other functions
without an le/be qualifier mean any more if "is this CPU in
big endian mode?" now requires you to have a CPUState
to ask? I guess we can tackle that separately from getting
the assumptions on endianness out of the tcg backends
though.

-- PMM
Stefan Weil - Aug. 28, 2013, 5:41 p.m.
Am 28.08.2013 18:38, schrieb Peter Maydell:
> On 28 August 2013 16:26, Richard Henderson <rth@twiddle.net> wrote:
>
>> (D) Profit!  I'm not sure what will be left of TARGET_WORDS_BIGENDIAN at this
>> point.  Possibly only if we leave the current translator interface in place in
>> step A.
> I think there are a number of devices and boards which use it
> as a convenient shortcut, but we can fix those -- the TCG
> reliance on knowing about the target endianness is the
> hard part of the problem, I think.
>
> -- PMM

In my personal code base (git://repo.or.cz/qemu/ar7.git), I had removed
lots of TARGET_WORDS_BIGENDIAN some years ago when I started
code changes to combine MIPS big and little endian in one executable
(still unfinished :-(, sorry).

I added a 'bool bigendian' to CPUArchState.

Richard, MIPS can be added to your list of architectures which might
profit from your proposed changes.

Regards,
Stefan
Richard Henderson - Aug. 28, 2013, 5:45 p.m.
On 08/28/2013 10:28 AM, Peter Maydell wrote:
> Well, what do ldl_p or ldl_phys or any of the other functions
> without an le/be qualifier mean any more if "is this CPU in
> big endian mode?" now requires you to have a CPUState
> to ask? I guess we can tackle that separately from getting
> the assumptions on endianness out of the tcg backends
> though.

ldl_p is the host native endian unaligned load primitive.

ldl_phys is the one that'll need to be adjusted or eliminated.

>From a quick glance, this doesn't look too bad.  Most of the uses are in the
target-* directories already, and mostly for emulating page tables.  Those
should be trivial to adjust as needed.

The troublesome offender appears to be virtio.  I know some discussion of
cleaning that up has happened in the context of LE PowerPC, but I didn't really
follow what went on there.


r~
Edgar Iglesias - Aug. 28, 2013, 8:42 p.m.
On Wed, Aug 28, 2013 at 08:26:43AM -0700, Richard Henderson wrote:
> On 08/28/2013 07:34 AM, Peter Maydell wrote:
> > On 28 August 2013 15:31, Richard Henderson <rth@twiddle.net> wrote:
> >> On 08/28/2013 01:15 AM, Peter Maydell wrote:
> >>> [*] not impossible, we already do something on the ppc
> >>> that's similar; however I'd really want to take the time to
> >>> figure out how to do endianness swapping "properly"
> >>> and what qemu does currently before messing with it.
> >>
> >> I've got a loose plan in my head for how to clean up handling of
> >> reverse-endian load/store instructions at both the translator and
> >> tcg backend levels.
> > 
> > Nice. Will it allow us to get rid of TARGET_WORDS_BIGENDIAN?
> 
> I don't know, as I don't know off-hand what all that implies.
> 
> Let me lay out my idea and see what you think:
> 
> Currently, at the TCG level we have 8 qemu_ld* opcodes, and 4 qemu_st* opcodes,
> that always produce target_ulong sized results, and always in the guest
> declared endianness.
> 
> There are several problems I want to address:
> 
> (1) I want explicit _i32 and _i64 sizes for the loads and stores.  This will
> clean up a number of places in several translators where we have to load to _tl
> and then truncate or extend to an explicit size.
> 
> (2) I want explicit endianness for the loads and stores.  E.g. when a sparc
> guest does a byte-swapped store, there's little point in doing two offsetting
> bswaps to make that happen.
> 
> (3) For hosts that do not support byte-swapped loads and stores themselves, the
> need to allocate extra registers during the memory operation in order to  hold
> the swapped results is an unnecessary burden.  Better to expose the bswap
> operation at the tcg opcode level and let normal register allocation happen.
> 
> Now, naively implementing 1 and 2 would result in 32 opcodes for qemu_ld*. That
> is obviously a non-starter.  However, the very first thing that each tcg
> backend does is map the current 8 opcodes into a bitmask ("opc" and "s_bits"
> in the source).  Let us make that official, and then extend it.


Hi,

I like what you propose aswell. A question, some archs have an endian swap
controlled via the MMU, e.g per page selectable (some PPC, microblaze and
maybe others). AFAIK the behaviour is implementable in QEMU today but not
very efficiently. Any thoughts/ideas on this?

Best regards,
Edgar
Peter Maydell - Aug. 28, 2013, 9:06 p.m.
On 28 August 2013 21:42, Edgar E. Iglesias <edgar.iglesias@gmail.com> wrote:
> I like what you propose aswell. A question, some archs have an endian swap
> controlled via the MMU, e.g per page selectable

It seems to be a rule that no matter how weird a concept in CPU architecture,
some CPU designer will have implemented a weirder variation on it.

-- PMM
Richard Henderson - Aug. 28, 2013, 9:23 p.m.
On 08/28/2013 01:42 PM, Edgar E. Iglesias wrote:
> A question, some archs have an endian swap
> controlled via the MMU, e.g per page selectable (some PPC, microblaze and
> maybe others). AFAIK the behaviour is implementable in QEMU today but not
> very efficiently. Any thoughts/ideas on this?

The only thing I could imagine doing on a page-by-page basis like
this is to bring this feature into the qemu page table.

We'd have to hang it under the same general scheme as an IO access,
where all reads/writes to the page are forced through the helper.

I'd imagine that it would get handled similarly to io_mem_rom, but
with different accessors.


r~
Michael Tokarev - Sept. 1, 2013, 3:35 p.m.
28.08.2013 08:39, Stefan Weil wrote:
> These unimplemented opcodes are handled like illegal opcodes, but
> they are used in existing code. We should at least report when they
> are executed.

Thanks, applied to the trivial-patches queue.

/mjt
Aurelien Jarno - Sept. 2, 2013, 11:42 p.m.
On Wed, Aug 28, 2013 at 08:26:43AM -0700, Richard Henderson wrote:
> On 08/28/2013 07:34 AM, Peter Maydell wrote:
> > On 28 August 2013 15:31, Richard Henderson <rth@twiddle.net> wrote:
> >> On 08/28/2013 01:15 AM, Peter Maydell wrote:
> >>> [*] not impossible, we already do something on the ppc
> >>> that's similar; however I'd really want to take the time to
> >>> figure out how to do endianness swapping "properly"
> >>> and what qemu does currently before messing with it.
> >>
> >> I've got a loose plan in my head for how to clean up handling of
> >> reverse-endian load/store instructions at both the translator and
> >> tcg backend levels.
> > 
> > Nice. Will it allow us to get rid of TARGET_WORDS_BIGENDIAN?
> 
> I don't know, as I don't know off-hand what all that implies.
> 
> Let me lay out my idea and see what you think:
> 
> Currently, at the TCG level we have 8 qemu_ld* opcodes, and 4 qemu_st* opcodes,
> that always produce target_ulong sized results, and always in the guest
> declared endianness.
> 
> There are several problems I want to address:
> 
> (1) I want explicit _i32 and _i64 sizes for the loads and stores.  This will
> clean up a number of places in several translators where we have to load to _tl
> and then truncate or extend to an explicit size.

I guess you mean there that it would still be possible to do a
qemu_ld32u for a _i64 size? 

Also it should be the moment to clean the big mess with qemu_ld32 for
32-bit guests vs qemu_ld32/qemu_ld32u/qemu_ld32s for 64-bit guests.

> (2) I want explicit endianness for the loads and stores.  E.g. when a sparc
> guest does a byte-swapped store, there's little point in doing two offsetting
> bswaps to make that happen.

That's indeed something which would be nice to fix. This is also the
case of powerpc which has a byte-swapped ld/st instruction.

> (3) For hosts that do not support byte-swapped loads and stores themselves, the
> need to allocate extra registers during the memory operation in order to  hold
> the swapped results is an unnecessary burden.  Better to expose the bswap
> operation at the tcg opcode level and let normal register allocation happen.

I don't fully agree with that point. For load ops, the byte swap is
basically done in place, not using any additional register. For store
ops, the bswap has to be done before, but if the value to be stored is
not used later in the TB, no additional register is used.

> Now, naively implementing 1 and 2 would result in 32 opcodes for qemu_ld*. That
> is obviously a non-starter.  However, the very first thing that each tcg
> backend does is map the current 8 opcodes into a bitmask ("opc" and "s_bits"
> in the source).  Let us make that official, and then extend it.

Agreed.

> Therefore:
> 
> (A) Compress qemu_ld* into two qemu_ld_{i32,i64}, with an additional constant
> argument that describes the actual load, exactly as "opc" does today.
> Adjusting the translators to match can be done in stages, or we might decide to
> leave the existing translator-level interface in place permanently.

I think this can be done quite quickly, as the conversion is basically a
matter of find and replace while you have identified if _tl is _i32 or
_i64. That would left a few non-optimized cases like loads with _i32 
later converted to _i64, but that's better than having two interfaces.

> (B) Add an additional bit to the "opc" to indicate which endianness is desired.
>  E.g. 0 = LE, 8 = BE.  Expose the opc interface to the translators.  At which
> point generating a load becomes more like
> 
>     tcg_gen_qemu_ld_tl(dest, addr, size | sign | dc->big_endian);
> 
> and the current endianness of the guest becomes a bit on the TB, to be copied
> into the DisasContext at the beginning of translation.

We should probably define short constants to define that, and why not
the 32 possible constants in a few letters.

Also the question is do we want to use little-endian/big-endian or
native-endian/cross-endian?

> (C) Examine the endian bit in the tcg-op.h expander, and check a
> TCG_TARGET_HAS_foo flag to see if the tcg backend supports reverse endian
> memory ops.  If not, break out the bswap into the opcode stream as a temporary.
> 
> The corollary here is that we must have a full set of bi-endian tcg helper
> functions.  At the moment, the helper functions are all keyed to the hard-coded
> guest endianness.  That means the typical LE/BE host/guest memory op looks like
> 
> 	if (tlb hit) {
>             t = bswap(data);
>             store t;
>         } else {
>             helper_store_be(data);
>         }
> 
> If we hoist the bswap it'll need to be
> 
> 	t = bswap(data);
> 	if (tlb hit) {
> 	    store t;
> 	} else {
> 	    helper_store_le(t);
> 	}

I am not sure it is worth adding additional complexity (and difference
between targets) there, it's basically writing the code corresponding
to qemu_ld* code in each TCG target.

All targets already support cross-endianness, as they can run both
little and big endian guests. Minimum tweaks have to be done to support
both endianness (basically converting a few #ifdef into if). The problem
is at the helper level, we need to either bswap the value before/after
calling the helper, or define more versions of the helper to handle
big/little endian.

I would personally go for more helpers, as it would keep the slow path
small, and also means less modification in the target code.

> (D) Profit!  I'm not sure what will be left of TARGET_WORDS_BIGENDIAN at this
> point.  Possibly only if we leave the current translator interface in place in
> step A.

At the TCG level, almost nothing, but probably still a lot at the
board/devices level.
Richard Henderson - Sept. 3, 2013, 3:11 p.m.
On 09/02/2013 04:42 PM, Aurelien Jarno wrote:
> On Wed, Aug 28, 2013 at 08:26:43AM -0700, Richard Henderson wrote:
>> (1) I want explicit _i32 and _i64 sizes for the loads and stores.  This will
>> clean up a number of places in several translators where we have to load to _tl
>> and then truncate or extend to an explicit size.
> 
> I guess you mean there that it would still be possible to do a
> qemu_ld32u for a _i64 size? 

Of course.

> Also it should be the moment to clean the big mess with qemu_ld32 for
> 32-bit guests vs qemu_ld32/qemu_ld32u/qemu_ld32s for 64-bit guests.

Yes, I would think that would happen more or less automatically by having
the separate _i32 and _i64 opcodes.

>> (2) I want explicit endianness for the loads and stores.  E.g. when a sparc
>> guest does a byte-swapped store, there's little point in doing two offsetting
>> bswaps to make that happen.
> 
> That's indeed something which would be nice to fix. This is also the
> case of powerpc which has a byte-swapped ld/st instruction.

And s390, and x86...  ;-)

>> (3) For hosts that do not support byte-swapped loads and stores themselves, the
>> need to allocate extra registers during the memory operation in order to  hold
>> the swapped results is an unnecessary burden.  Better to expose the bswap
>> operation at the tcg opcode level and let normal register allocation happen.
> 
> I don't fully agree with that point. For load ops, the byte swap is
> basically done in place, not using any additional register. For store
> ops, the bswap has to be done before, but if the value to be stored is
> not used later in the TB, no additional register is used.

And if split apart, the bswap would still generally be done in place, just
because that's what register allocation will tend to do.

Consider the register constraints on the store op.  E.g. arm:

    /* qemu_st address & data_reg */
    case 's':
        ct->ct |= TCG_CT_REG;
        tcg_regset_set32(ct->u.regs, 0, (1 << TCG_TARGET_NB_REGS) - 1);
        /* r0-r2 will be overwritten when reading the tlb entry (softmmu only)
           and r0-r1 doing the byte swapping, so don't use these. */
        tcg_regset_reset_reg(ct->u.regs, TCG_REG_R0);
        tcg_regset_reset_reg(ct->u.regs, TCG_REG_R1);

If we split out the bswap, then user mode doesn't have to reserve r0/r1,
the register allocator will just DTRT.  Similarly for i386.

> I think this can be done quite quickly, as the conversion is basically a
> matter of find and replace while you have identified if _tl is _i32 or
> _i64. That would left a few non-optimized cases like loads with _i32 
> later converted to _i64, but that's better than having two interfaces.

Fair enough.

> We should probably define short constants to define that, and why not
> the 32 possible constants in a few letters.

Sure.

> Also the question is do we want to use little-endian/big-endian or
> native-endian/cross-endian?

Both have their appeal.  Although perhaps with appropriate defines, we can have
the best of both.

#define LDST_BSWAP  8

#ifdef HOST_BIG_ENDIAN
# define LDST_LE  LDST_BSWAP
# define LDST_BE  0
#else
# define LDST_LE  0
# define LDST_BE  LDST_BSWAP
#endif

>> 	if (tlb hit) {
>>             t = bswap(data);
>>             store t;
>>         } else {
>>             helper_store_be(data);
>>         }
>>
>> If we hoist the bswap it'll need to be
>>
>> 	t = bswap(data);
>> 	if (tlb hit) {
>> 	    store t;
>> 	} else {
>> 	    helper_store_le(t);
>> 	}
> 
> I am not sure it is worth adding additional complexity (and difference
> between targets) there, it's basically writing the code corresponding
> to qemu_ld* code in each TCG target.

I was just illustrating the reason for more helpers.


r~

Patch

diff --git a/target-arm/translate.c b/target-arm/translate.c
index d1e8538..92d9f16 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -6715,6 +6715,7 @@  static void disas_arm_insn(CPUARMState * env, DisasContext *s)
             /* setend */
             if (((insn >> 9) & 1) != s->bswap_code) {
                 /* Dynamic endianness switching not implemented. */
+                qemu_log_mask(LOG_UNIMP, "arm: unimplemented setend\n");
                 goto illegal_op;
             }
             return;
@@ -8740,6 +8741,8 @@  static int disas_thumb2_insn(CPUARMState *env, DisasContext *s, uint16_t insn_hw
 
                 if (insn & (1 << 26)) {
                     /* Secure monitor call (v6Z) */
+                    qemu_log_mask(LOG_UNIMP,
+                                  "arm: unimplemented secure monitor call\n");
                     goto illegal_op; /* not implemented.  */
                 } else {
                     op = (insn >> 20) & 7;
@@ -9779,6 +9782,7 @@  static void disas_thumb_insn(CPUARMState *env, DisasContext *s)
                 ARCH(6);
                 if (((insn >> 3) & 1) != s->bswap_code) {
                     /* Dynamic endianness switching not implemented. */
+                    qemu_log_mask(LOG_UNIMP, "arm: unimplemented setend\n");
                     goto illegal_op;
                 }
                 break;