diff mbox

Re: sparc64 lazy conditional codes evaluation

Message ID AANLkTik0uFO9MKH5Rgn1xEEPGa51vlT_cJdEsn-bCsea@mail.gmail.com
State New
Headers show

Commit Message

Igor V. Kovalenko May 8, 2010, 6:41 a.m. UTC
On Thu, May 6, 2010 at 10:51 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
> On 5/5/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote:
>> On Wed, May 5, 2010 at 12:21 AM, Blue Swirl <blauwirbel@gmail.com> wrote:
>>  > On 5/3/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote:
>>  >> On Mon, May 3, 2010 at 11:54 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
>>  >>  > On 5/3/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote:
>>  >>  >> On Mon, May 3, 2010 at 11:24 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
>>  >>  >>  > On 5/3/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote:
>>  >>  >>  >> Hi!
>>  >>  >>  >>
>>  >>  >>  >>  There is an issue with lazy conditional codes evaluation where
>>  >>  >>  >>  we return from trap handler with mismatching conditionals.
>>  >>  >>  >>
>>  >>  >>  >>  I seldom reproduce it here when dragging qemu window while
>>  >>  >>  >>  machine is working through silo initialization. I use gentoo minimal cd
>>  >>  >>  >>  install-sparc64-minimal-20100322.iso but I think anything with silo boot
>>  >>  >>  >>  would experience the same. Once in a while it would report crc error,
>>  >>  >>  >>  unable to open cd partition or it would fail to decompress image.
>>  >>  >>  >
>>  >>  >>  > I think I've also seen this.
>>  >>  >>  >
>>  >>  >>  >>  Pattern that fails appears to require a sequence of compare insn
>>  >>  >>  >>  possibly followed by a few instructions which do not touch conditionals,
>>  >>  >>  >>  then conditional branch insn. If it happens that we trap while processing
>>  >>  >>  >>  conditional branch insn so it is restarted after return from trap then
>>  >>  >>  >>  seldom conditional codes are calculated incorrectly.
>>  >>  >>  >>
>>  >>  >>  >>  I cannot point to exact cause but it appears that after trap return
>>  >>  >>  >>  we may have CC_OP and CC_SRC* mismatch somewhere,
>>  >>  >>  >>  since adding more cond evaluation flushes over the code helps.
>>  >>  >>  >>
>>  >>  >>  >>  We already tried doing flush more frequently and it is still not
>>  >>  >>  >>  complete, so the question is how to finally do this once and right :)
>>  >>  >>  >>
>>  >>  >>  >>  Obviously I do not get the design of lazy evaluation right, but
>>  >>  >>  >>  the following list appears to be good start. Plan is to prepare
>>  >>  >>  >>  a change to qemu and find a way to test it.
>>  >>  >>  >>
>>  >>  >>  >>  1. Since SPARC* is a RISC CPU it seems to be not profitable to
>>  >>  >>  >>    use DisasContext->cc_op to predict if flags should be not evaluated
>>  >>  >>  >>    due to overriding insn. Instead we can drop cc_op from disassembler
>>  >>  >>  >>    context and simplify code to only use cc_op from env.
>>  >>  >>  >
>>  >>  >>  > Not currently, but in the future we may use that to do even lazier
>>  >>  >>  > flags computation. For example the sequence 'cmp x, y; bne target'
>>  >>  >>  > could be much more optimal by changing the branch to do the
>>  >>  >>  > comparison. Here's an old unfinished patch to do some of this.
>>
>>
>> I wonder if it buys anything. Sparc RISC architecture means optimizing compiler
>>  would prevent any extra flags computation, right? So it is basically 1-to-1
>>  conditional computation and use. Or even worse, if we delay computation
>>  until there are two or more consumers, correct?
>
> For x86 target, that is the other part of savings from using lazy
> condition computation. It's true that it will not benefit RISC targets
> much.
>
> But I think you are missing the other part, where the actual flags are
> not calculated but instead the original comparison can be used.
>
> For example, consider this Sparc64 code:
> IN:
> 0x000001fff000be58:  cmp  %g2, 0x51
>
> OUT: [size=82]
> 0x40df16b0:  mov    0x10(%r14),%rbp
> 0x40df16b4:  mov    %rbp,%rbx
> 0x40df16b7:  sub    $0x51,%rbx
> 0x40df16bb:  mov    %rbx,%r12
> 0x40df16be:  mov    %r12,0x10a60(%r14)
> 0x40df16c5:  mov    %rbp,0x60(%r14)
> 0x40df16c9:  mov    $0x51,%ebp
> 0x40df16ce:  mov    %rbp,0x68(%r14)
> 0x40df16d2:  mov    %rbx,0x70(%r14)
> 0x40df16d6:  mov    $0x7,%ebp
> 0x40df16db:  mov    %ebp,0x78(%r14)
> 0x40df16df:  mov    $0x1fff000be5c,%rbp
> 0x40df16e9:  mov    %rbp,0x48(%r14)
> 0x40df16ed:  mov    $0x1fff000be60,%rbp
> 0x40df16f7:  mov    %rbp,0x50(%r14)
> 0x40df16fb:  xor    %eax,%eax
> 0x40df16fd:  jmpq   0xbfface
>
> --------------
> IN:
> 0x000001fff000be5c:  bne  0x1fff000c268
>
> OUT: [size=95]
> 0x40df1710:  callq  0x518260
> 0x40df1715:  mov    0x98(%r14),%ebp
> 0x40df171c:  mov    %ebp,%ebx
> 0x40df171e:  shr    $0x16,%rbx
> 0x40df1722:  and    $0x1,%ebx
> 0x40df1725:  xor    $0x1,%rbx
> 0x40df1729:  mov    %rbx,0x90(%r14)
> 0x40df1730:  mov    $0x1fff000be60,%rbp
> 0x40df173a:  mov    %rbp,0x48(%r14)
> 0x40df173e:  test   %rbx,%rbx
> 0x40df1741:  je     0x40df175a
> 0x40df1747:  mov    $0x1fff000c268,%rbp
> 0x40df1751:  mov    %rbp,0x50(%r14)
> 0x40df1755:  jmpq   0x40df1768
> 0x40df175a:  mov    $0x1fff000be64,%rbp
> 0x40df1764:  mov    %rbp,0x50(%r14)
> 0x40df1768:  xor    %eax,%eax
> 0x40df176a:  jmpq   0xbfface
>
> Instead of calculating any flags, we should generate for combined
> 'cmp/btest + branch' sequences something like:
> mov    0x10(%r14),%rbp
> mov    %rbp,%rbx
> cmp    $0x51,%rbx
> je     0x40ac275a
> mov    $0x1fff000c268,%rbp
> mov    %rbp,0x50(%r14)
> jmpq   0x40ac2768
> mov    $0x1fff000be64,%rbp
> mov    %rbp,0x50(%r14)
> xor    %eax,%eax
> jmpq   0xbfface
>
> But I fully agree that correct code generation is the most important issue.
>
>>  >>  >>  >
>>  >>  >>  >>    Another point is that we always write to env->cc_op when
>>  >>  >>  >>  translating *cc insns
>>  >>  >>  >>    This should solve any issue with dc->cc_op prediction going
>>  >>  >>  >>    out of sync with env->cc_op and cpu_cc_src*
>>  >>  >>  >
>>  >>  >>  > I think this is what is happening now.
>>  >>  >>  >
>>  >>  >>  >>  2. We must flush lazy evaluation back to CC_OP_FLAGS in a few cases when
>>  >>  >>  >>    a. conditional code is required by insn (like addc, cond branch etc.)
>>  >>  >>  >>       - here we can optimize by evaluating specific bits (carry?)
>>  >>  >>  >>       - not sure if it works in case we have two cond consuming insns,
>>  >>  >>  >>         where first needs carry another needs the rest of flags
>>  >>  >>  >
>>  >>  >>  > Here's another patch to optimize C flag handling. It doesn't pass my
>>  >>  >>  > tests though.
>>  >>  >>  >
>>  >>  >>  >>    b. CCR is read by rdccr (helper_rdccr)
>>  >>  >>  >>       - have to compute all flags
>>  >>  >>  >>    c. trap occurs and we prepare trap level context (saving pstate)
>>  >>  >>  >>       - have to compute all flags
>>  >>  >>  >>    d. control goes out of tcg runtime (so gdbstub reads correct value from env)
>>  >>  >>  >>       - have to compute all flags
>>  >>  >>  >
>>  >>  >>  > Fully agree.
>>  >>  >>
>>  >>  >>
>>  >>  >> Cool
>>  >>  >>
>>  >>  >>  Still I'd propose to kill dc->cc_op, find a reliable way to test it
>>  >>  >>  and then add it back possibly with more optimizations.
>>  >>  >>  I'm lost in the code up to the point where I believe we need to
>>  >>  >>  save/restore cc_op and cpu_cc* while switching trap levels.
>>  >>  >
>>  >>  > I'd think this should do the trick:
>>  >>  >
>>  >>  > diff --git a/target-sparc/op_helper.c b/target-sparc/op_helper.c
>>  >>  > index b27778b..94921cd 100644
>>  >>  > --- a/target-sparc/op_helper.c
>>  >>  > +++ b/target-sparc/op_helper.c
>>  >>  > @@ -3506,6 +3506,8 @@ void do_interrupt(CPUState *env)
>>  >>  >     }
>>  >>  >     tsptr = cpu_tsptr(env);
>>  >>  >
>>  >>  > +    helper_compute_psr();
>>  >>  > +
>>  >>  >     tsptr->tstate = ((uint64_t)GET_CCR(env) << 32) |
>>  >>  >         ((env->asi & 0xff) << 24) | ((env->pstate & 0xf3f) << 8) |
>>  >>  >         GET_CWP64(env);
>>  >>  >
>>  >>
>>  >>
>>  >> Thanks, this change seems to work here for silo issue.
>>  >>
>>  >>  Another change would be to flush for gdbstub use of GET_CCR and for
>>  >>  helper_rdccr.
>>  >>  I tried to embed flush into GET_CCR but the code looks ugly since we
>>  >>  need to proxy a call to helper_compute_psr from gdbstub passing
>>  >>  available env pointer.
>>  >>
>>  >>  Not really tested with your changes, but still what is the breakage you see?
>>  >
>>  > Aurora 2.0 (http://distro.ibiblio.org/pub/linux/distributions/aurora/build-2.0/sparc/iso/)
>>  > breaks.
>>  >
>>  > This is what I get with git HEAD, having pressed enter key twice:
>>  > Welcome to Aurora SPARC Linux
>>  >
>>  >
>>  >
>>  >
>>  >                   +--------------+ CD Found +--------------+
>>  >                   |                                        |
>>  >                   | To begin testing the CD media before   |
>>  >                   | installation press OK.                 |
>>  >                   |                                        |
>>  >                   | Choose Skip to skip the media test     |
>>  >                   | and start the installation.            |
>>  >                   |                                        |
>>  >                   |      +----+             +------+       |
>>  >                   |      | OK |             | Skip |       |
>>  >                   |      +----+             +------+       |
>>  >                   |                                        |
>>  >                   |                                        |
>>  >                   +----------------------------------------+
>>  >
>>  >
>>  >
>>  >
>>  >  <Tab>/<Alt-Tab> between elements  | <Space> selects | <F12> next screen
>>  >
>>  > This is what I get with the C flag patch applied:
>>  > Welcome to Aurora SPARC Linux
>>  >
>>  >
>>  >
>>  >
>>  >
>>  >                    +--------------+ Error +---------------+
>>  >                    |                                      |
>>  >                    | failed to read keymap information:   |
>>  >                    | Success                              |
>>  >                    |                                      |
>>  >                    |               +----+                 |
>>  >                    |               | OK |                 |
>>  >                    |               +----+                 |
>>  >                    |                                      |
>>  >                    |                                      |
>>  >                    +--------------------------------------+
>>  >
>>  >
>>  >
>>  >
>>  >
>>  >
>>  >  <Tab>/<Alt-Tab> between elements  | <Space> selects | <F12> next screen
>>  >
>>
>>
>> I do reproduce the issue here with 0001-Convert-C-flag-input-BROKEN.patch
>
> The patch seems harmless, so maybe it uncovers some hideous bug.
>

Right, the bug is inside the gen_op_*cc helpers - we need to compute C
using current flag source, whereas current implementation clobbers source
then computes C.

I think I now get the lazy evaluation design right :)

Something along these changes to your 0001-Convert-C-flag-input-BROKEN.patch
appears to be good for aurora 2.0 test:

     tcg_gen_mov_tl(dst, cpu_cc_dst);
@@ -344,9 +344,9 @@ static inline void gen_op_addxi_cc(TCGv dst, TCGv
src1, target_long src2)

 static inline void gen_op_addx_cc(TCGv dst, TCGv src1, TCGv src2)
 {
+    gen_helper_compute_C_icc(cpu_tmp0);
     tcg_gen_mov_tl(cpu_cc_src, src1);
     tcg_gen_mov_tl(cpu_cc_src2, src2);
-    gen_helper_compute_C_icc(cpu_tmp0);
     tcg_gen_add_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0);
     tcg_gen_add_tl(cpu_cc_dst, cpu_cc_dst, cpu_cc_src2);
     tcg_gen_mov_tl(dst, cpu_cc_dst);
@@ -417,9 +417,9 @@ static inline void gen_op_sub_cc(TCGv dst, TCGv
src1, TCGv src2)

 static inline void gen_op_subxi_cc(TCGv dst, TCGv src1, target_long src2)
 {
+    gen_helper_compute_C_icc(cpu_tmp0);
     tcg_gen_mov_tl(cpu_cc_src, src1);
     tcg_gen_movi_tl(cpu_cc_src2, src2);
-    gen_helper_compute_C_icc(cpu_tmp0);
     tcg_gen_sub_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0);
     tcg_gen_subi_tl(cpu_cc_dst, cpu_cc_dst, src2);
     tcg_gen_mov_tl(dst, cpu_cc_dst);
@@ -427,9 +427,9 @@ static inline void gen_op_subxi_cc(TCGv dst, TCGv
src1, target_long src2)

 static inline void gen_op_subx_cc(TCGv dst, TCGv src1, TCGv src2)
 {
+    gen_helper_compute_C_icc(cpu_tmp0);
     tcg_gen_mov_tl(cpu_cc_src, src1);
     tcg_gen_mov_tl(cpu_cc_src2, src2);
-    gen_helper_compute_C_icc(cpu_tmp0);
     tcg_gen_sub_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0);
     tcg_gen_sub_tl(cpu_cc_dst, cpu_cc_dst, cpu_cc_src2);
     tcg_gen_mov_tl(dst, cpu_cc_dst);

Comments

Blue Swirl May 9, 2010, 8:22 p.m. UTC | #1
On 5/8/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote:
> On Thu, May 6, 2010 at 10:51 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
>  > On 5/5/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote:
>  >> On Wed, May 5, 2010 at 12:21 AM, Blue Swirl <blauwirbel@gmail.com> wrote:
>  >>  > On 5/3/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote:
>  >>  >> On Mon, May 3, 2010 at 11:54 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
>  >>  >>  > On 5/3/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote:
>  >>  >>  >> On Mon, May 3, 2010 at 11:24 PM, Blue Swirl <blauwirbel@gmail.com> wrote:
>  >>  >>  >>  > On 5/3/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote:
>  >>  >>  >>  >> Hi!
>  >>  >>  >>  >>
>  >>  >>  >>  >>  There is an issue with lazy conditional codes evaluation where
>  >>  >>  >>  >>  we return from trap handler with mismatching conditionals.
>  >>  >>  >>  >>
>  >>  >>  >>  >>  I seldom reproduce it here when dragging qemu window while
>  >>  >>  >>  >>  machine is working through silo initialization. I use gentoo minimal cd
>  >>  >>  >>  >>  install-sparc64-minimal-20100322.iso but I think anything with silo boot
>  >>  >>  >>  >>  would experience the same. Once in a while it would report crc error,
>  >>  >>  >>  >>  unable to open cd partition or it would fail to decompress image.
>  >>  >>  >>  >
>  >>  >>  >>  > I think I've also seen this.
>  >>  >>  >>  >
>  >>  >>  >>  >>  Pattern that fails appears to require a sequence of compare insn
>  >>  >>  >>  >>  possibly followed by a few instructions which do not touch conditionals,
>  >>  >>  >>  >>  then conditional branch insn. If it happens that we trap while processing
>  >>  >>  >>  >>  conditional branch insn so it is restarted after return from trap then
>  >>  >>  >>  >>  seldom conditional codes are calculated incorrectly.
>  >>  >>  >>  >>
>  >>  >>  >>  >>  I cannot point to exact cause but it appears that after trap return
>  >>  >>  >>  >>  we may have CC_OP and CC_SRC* mismatch somewhere,
>  >>  >>  >>  >>  since adding more cond evaluation flushes over the code helps.
>  >>  >>  >>  >>
>  >>  >>  >>  >>  We already tried doing flush more frequently and it is still not
>  >>  >>  >>  >>  complete, so the question is how to finally do this once and right :)
>  >>  >>  >>  >>
>  >>  >>  >>  >>  Obviously I do not get the design of lazy evaluation right, but
>  >>  >>  >>  >>  the following list appears to be good start. Plan is to prepare
>  >>  >>  >>  >>  a change to qemu and find a way to test it.
>  >>  >>  >>  >>
>  >>  >>  >>  >>  1. Since SPARC* is a RISC CPU it seems to be not profitable to
>  >>  >>  >>  >>    use DisasContext->cc_op to predict if flags should be not evaluated
>  >>  >>  >>  >>    due to overriding insn. Instead we can drop cc_op from disassembler
>  >>  >>  >>  >>    context and simplify code to only use cc_op from env.
>  >>  >>  >>  >
>  >>  >>  >>  > Not currently, but in the future we may use that to do even lazier
>  >>  >>  >>  > flags computation. For example the sequence 'cmp x, y; bne target'
>  >>  >>  >>  > could be much more optimal by changing the branch to do the
>  >>  >>  >>  > comparison. Here's an old unfinished patch to do some of this.
>  >>
>  >>
>  >> I wonder if it buys anything. Sparc RISC architecture means optimizing compiler
>  >>  would prevent any extra flags computation, right? So it is basically 1-to-1
>  >>  conditional computation and use. Or even worse, if we delay computation
>  >>  until there are two or more consumers, correct?
>  >
>  > For x86 target, that is the other part of savings from using lazy
>  > condition computation. It's true that it will not benefit RISC targets
>  > much.
>  >
>  > But I think you are missing the other part, where the actual flags are
>  > not calculated but instead the original comparison can be used.
>  >
>  > For example, consider this Sparc64 code:
>  > IN:
>  > 0x000001fff000be58:  cmp  %g2, 0x51
>  >
>  > OUT: [size=82]
>  > 0x40df16b0:  mov    0x10(%r14),%rbp
>  > 0x40df16b4:  mov    %rbp,%rbx
>  > 0x40df16b7:  sub    $0x51,%rbx
>  > 0x40df16bb:  mov    %rbx,%r12
>  > 0x40df16be:  mov    %r12,0x10a60(%r14)
>  > 0x40df16c5:  mov    %rbp,0x60(%r14)
>  > 0x40df16c9:  mov    $0x51,%ebp
>  > 0x40df16ce:  mov    %rbp,0x68(%r14)
>  > 0x40df16d2:  mov    %rbx,0x70(%r14)
>  > 0x40df16d6:  mov    $0x7,%ebp
>  > 0x40df16db:  mov    %ebp,0x78(%r14)
>  > 0x40df16df:  mov    $0x1fff000be5c,%rbp
>  > 0x40df16e9:  mov    %rbp,0x48(%r14)
>  > 0x40df16ed:  mov    $0x1fff000be60,%rbp
>  > 0x40df16f7:  mov    %rbp,0x50(%r14)
>  > 0x40df16fb:  xor    %eax,%eax
>  > 0x40df16fd:  jmpq   0xbfface
>  >
>  > --------------
>  > IN:
>  > 0x000001fff000be5c:  bne  0x1fff000c268
>  >
>  > OUT: [size=95]
>  > 0x40df1710:  callq  0x518260
>  > 0x40df1715:  mov    0x98(%r14),%ebp
>  > 0x40df171c:  mov    %ebp,%ebx
>  > 0x40df171e:  shr    $0x16,%rbx
>  > 0x40df1722:  and    $0x1,%ebx
>  > 0x40df1725:  xor    $0x1,%rbx
>  > 0x40df1729:  mov    %rbx,0x90(%r14)
>  > 0x40df1730:  mov    $0x1fff000be60,%rbp
>  > 0x40df173a:  mov    %rbp,0x48(%r14)
>  > 0x40df173e:  test   %rbx,%rbx
>  > 0x40df1741:  je     0x40df175a
>  > 0x40df1747:  mov    $0x1fff000c268,%rbp
>  > 0x40df1751:  mov    %rbp,0x50(%r14)
>  > 0x40df1755:  jmpq   0x40df1768
>  > 0x40df175a:  mov    $0x1fff000be64,%rbp
>  > 0x40df1764:  mov    %rbp,0x50(%r14)
>  > 0x40df1768:  xor    %eax,%eax
>  > 0x40df176a:  jmpq   0xbfface
>  >
>  > Instead of calculating any flags, we should generate for combined
>  > 'cmp/btest + branch' sequences something like:
>  > mov    0x10(%r14),%rbp
>  > mov    %rbp,%rbx
>  > cmp    $0x51,%rbx
>  > je     0x40ac275a
>  > mov    $0x1fff000c268,%rbp
>  > mov    %rbp,0x50(%r14)
>  > jmpq   0x40ac2768
>  > mov    $0x1fff000be64,%rbp
>  > mov    %rbp,0x50(%r14)
>  > xor    %eax,%eax
>  > jmpq   0xbfface
>  >
>  > But I fully agree that correct code generation is the most important issue.
>  >
>  >>  >>  >>  >
>  >>  >>  >>  >>    Another point is that we always write to env->cc_op when
>  >>  >>  >>  >>  translating *cc insns
>  >>  >>  >>  >>    This should solve any issue with dc->cc_op prediction going
>  >>  >>  >>  >>    out of sync with env->cc_op and cpu_cc_src*
>  >>  >>  >>  >
>  >>  >>  >>  > I think this is what is happening now.
>  >>  >>  >>  >
>  >>  >>  >>  >>  2. We must flush lazy evaluation back to CC_OP_FLAGS in a few cases when
>  >>  >>  >>  >>    a. conditional code is required by insn (like addc, cond branch etc.)
>  >>  >>  >>  >>       - here we can optimize by evaluating specific bits (carry?)
>  >>  >>  >>  >>       - not sure if it works in case we have two cond consuming insns,
>  >>  >>  >>  >>         where first needs carry another needs the rest of flags
>  >>  >>  >>  >
>  >>  >>  >>  > Here's another patch to optimize C flag handling. It doesn't pass my
>  >>  >>  >>  > tests though.
>  >>  >>  >>  >
>  >>  >>  >>  >>    b. CCR is read by rdccr (helper_rdccr)
>  >>  >>  >>  >>       - have to compute all flags
>  >>  >>  >>  >>    c. trap occurs and we prepare trap level context (saving pstate)
>  >>  >>  >>  >>       - have to compute all flags
>  >>  >>  >>  >>    d. control goes out of tcg runtime (so gdbstub reads correct value from env)
>  >>  >>  >>  >>       - have to compute all flags
>  >>  >>  >>  >
>  >>  >>  >>  > Fully agree.
>  >>  >>  >>
>  >>  >>  >>
>  >>  >>  >> Cool
>  >>  >>  >>
>  >>  >>  >>  Still I'd propose to kill dc->cc_op, find a reliable way to test it
>  >>  >>  >>  and then add it back possibly with more optimizations.
>  >>  >>  >>  I'm lost in the code up to the point where I believe we need to
>  >>  >>  >>  save/restore cc_op and cpu_cc* while switching trap levels.
>  >>  >>  >
>  >>  >>  > I'd think this should do the trick:
>  >>  >>  >
>  >>  >>  > diff --git a/target-sparc/op_helper.c b/target-sparc/op_helper.c
>  >>  >>  > index b27778b..94921cd 100644
>  >>  >>  > --- a/target-sparc/op_helper.c
>  >>  >>  > +++ b/target-sparc/op_helper.c
>  >>  >>  > @@ -3506,6 +3506,8 @@ void do_interrupt(CPUState *env)
>  >>  >>  >     }
>  >>  >>  >     tsptr = cpu_tsptr(env);
>  >>  >>  >
>  >>  >>  > +    helper_compute_psr();
>  >>  >>  > +
>  >>  >>  >     tsptr->tstate = ((uint64_t)GET_CCR(env) << 32) |
>  >>  >>  >         ((env->asi & 0xff) << 24) | ((env->pstate & 0xf3f) << 8) |
>  >>  >>  >         GET_CWP64(env);
>  >>  >>  >
>  >>  >>
>  >>  >>
>  >>  >> Thanks, this change seems to work here for silo issue.
>  >>  >>
>  >>  >>  Another change would be to flush for gdbstub use of GET_CCR and for
>  >>  >>  helper_rdccr.
>  >>  >>  I tried to embed flush into GET_CCR but the code looks ugly since we
>  >>  >>  need to proxy a call to helper_compute_psr from gdbstub passing
>  >>  >>  available env pointer.
>  >>  >>
>  >>  >>  Not really tested with your changes, but still what is the breakage you see?
>  >>  >
>  >>  > Aurora 2.0 (http://distro.ibiblio.org/pub/linux/distributions/aurora/build-2.0/sparc/iso/)
>  >>  > breaks.
>  >>  >
>  >>  > This is what I get with git HEAD, having pressed enter key twice:
>  >>  > Welcome to Aurora SPARC Linux
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>  >                   +--------------+ CD Found +--------------+
>  >>  >                   |                                        |
>  >>  >                   | To begin testing the CD media before   |
>  >>  >                   | installation press OK.                 |
>  >>  >                   |                                        |
>  >>  >                   | Choose Skip to skip the media test     |
>  >>  >                   | and start the installation.            |
>  >>  >                   |                                        |
>  >>  >                   |      +----+             +------+       |
>  >>  >                   |      | OK |             | Skip |       |
>  >>  >                   |      +----+             +------+       |
>  >>  >                   |                                        |
>  >>  >                   |                                        |
>  >>  >                   +----------------------------------------+
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>  >  <Tab>/<Alt-Tab> between elements  | <Space> selects | <F12> next screen
>  >>  >
>  >>  > This is what I get with the C flag patch applied:
>  >>  > Welcome to Aurora SPARC Linux
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>  >                    +--------------+ Error +---------------+
>  >>  >                    |                                      |
>  >>  >                    | failed to read keymap information:   |
>  >>  >                    | Success                              |
>  >>  >                    |                                      |
>  >>  >                    |               +----+                 |
>  >>  >                    |               | OK |                 |
>  >>  >                    |               +----+                 |
>  >>  >                    |                                      |
>  >>  >                    |                                      |
>  >>  >                    +--------------------------------------+
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>  >  <Tab>/<Alt-Tab> between elements  | <Space> selects | <F12> next screen
>  >>  >
>  >>
>  >>
>  >> I do reproduce the issue here with 0001-Convert-C-flag-input-BROKEN.patch
>  >
>  > The patch seems harmless, so maybe it uncovers some hideous bug.
>  >
>
>
> Right, the bug is inside the gen_op_*cc helpers - we need to compute C
>  using current flag source, whereas current implementation clobbers source
>  then computes C.
>
>  I think I now get the lazy evaluation design right :)
>
>  Something along these changes to your 0001-Convert-C-flag-input-BROKEN.patch
>  appears to be good for aurora 2.0 test:
>
>  diff --git a/target-sparc/translate.c b/target-sparc/translate.c
>  index 94c343d..ea7c71b 100644
>  --- a/target-sparc/translate.c
>  +++ b/target-sparc/translate.c
>  @@ -334,9 +334,9 @@ static inline void gen_op_add_cc(TCGv dst, TCGv
>  src1, TCGv src2)
>
>   static inline void gen_op_addxi_cc(TCGv dst, TCGv src1, target_long src2)
>   {
>  +    gen_helper_compute_C_icc(cpu_tmp0);
>      tcg_gen_mov_tl(cpu_cc_src, src1);
>      tcg_gen_movi_tl(cpu_cc_src2, src2);
>  -    gen_helper_compute_C_icc(cpu_tmp0);
>      tcg_gen_add_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0);
>      tcg_gen_addi_tl(cpu_cc_dst, cpu_cc_dst, src2);
>      tcg_gen_mov_tl(dst, cpu_cc_dst);
>  @@ -344,9 +344,9 @@ static inline void gen_op_addxi_cc(TCGv dst, TCGv
>  src1, target_long src2)
>
>   static inline void gen_op_addx_cc(TCGv dst, TCGv src1, TCGv src2)
>   {
>  +    gen_helper_compute_C_icc(cpu_tmp0);
>      tcg_gen_mov_tl(cpu_cc_src, src1);
>      tcg_gen_mov_tl(cpu_cc_src2, src2);
>  -    gen_helper_compute_C_icc(cpu_tmp0);
>      tcg_gen_add_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0);
>      tcg_gen_add_tl(cpu_cc_dst, cpu_cc_dst, cpu_cc_src2);
>      tcg_gen_mov_tl(dst, cpu_cc_dst);
>  @@ -417,9 +417,9 @@ static inline void gen_op_sub_cc(TCGv dst, TCGv
>  src1, TCGv src2)
>
>   static inline void gen_op_subxi_cc(TCGv dst, TCGv src1, target_long src2)
>   {
>  +    gen_helper_compute_C_icc(cpu_tmp0);
>      tcg_gen_mov_tl(cpu_cc_src, src1);
>      tcg_gen_movi_tl(cpu_cc_src2, src2);
>  -    gen_helper_compute_C_icc(cpu_tmp0);
>      tcg_gen_sub_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0);
>      tcg_gen_subi_tl(cpu_cc_dst, cpu_cc_dst, src2);
>      tcg_gen_mov_tl(dst, cpu_cc_dst);
>  @@ -427,9 +427,9 @@ static inline void gen_op_subxi_cc(TCGv dst, TCGv
>  src1, target_long src2)
>
>   static inline void gen_op_subx_cc(TCGv dst, TCGv src1, TCGv src2)
>   {
>  +    gen_helper_compute_C_icc(cpu_tmp0);
>      tcg_gen_mov_tl(cpu_cc_src, src1);
>      tcg_gen_mov_tl(cpu_cc_src2, src2);
>  -    gen_helper_compute_C_icc(cpu_tmp0);
>      tcg_gen_sub_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0);
>      tcg_gen_sub_tl(cpu_cc_dst, cpu_cc_dst, cpu_cc_src2);
>      tcg_gen_mov_tl(dst, cpu_cc_dst);

Thanks a lot, with this patch my tests passed! I applied the combined patch.

I also did a bit of refactoring to get the original Sparc64 issue fixed.
Mark Cave-Ayland May 10, 2010, 9:40 a.m. UTC | #2
Blue Swirl wrote:

> Thanks a lot, with this patch my tests passed! I applied the combined patch.

Yes, I definitely see an improvement with this patch - at least my 
Debian lenny SPARC boot cd doesn't randomly kernel panic any more. It 
looks as if it now just can't find /init which could just be due to an 
incorrect device mapping somewhere.

> I also did a bit of refactoring to get the original Sparc64 issue fixed.

However, one thing I did notice is that this does introduce a noticeable 
performance penalty. With OpenBIOS SVN head I see the following:

With commit 72139e83a98eba2bfed2dbc2db2818fb19e47ca0 (just before the 
changes):

[   59.225406] Failed to execute /init
[   59.304088] Kernel panic - not syncing: No init found.  Try passing 
init= option to kernel.
[   59.450313] Press Stop-A (L1-A) to return to the boot prom

With commit 5a834bb47c373e887de5210b7ceae96e1ef413f7 (just after the 
changes):

[   70.384466] Failed to execute /init
[   70.474804] Kernel panic - not syncing: No init found.  Try passing 
init= option to kernel.


So while it's technically correct, it seems to have added ~15% overhead 
to the emulation :(


ATB,

Mark.
Blue Swirl May 10, 2010, 6:33 p.m. UTC | #3
On 5/10/10, Mark Cave-Ayland <mark.cave-ayland@siriusit.co.uk> wrote:
> Blue Swirl wrote:
>
>
> > Thanks a lot, with this patch my tests passed! I applied the combined
> patch.
> >
>
>  Yes, I definitely see an improvement with this patch - at least my Debian
> lenny SPARC boot cd doesn't randomly kernel panic any more. It looks as if
> it now just can't find /init which could just be due to an incorrect device
> mapping somewhere.
>
>
> > I also did a bit of refactoring to get the original Sparc64 issue fixed.
> >
>
>  However, one thing I did notice is that this does introduce a noticeable
> performance penalty. With OpenBIOS SVN head I see the following:
>
>  With commit 72139e83a98eba2bfed2dbc2db2818fb19e47ca0 (just
> before the changes):
>
>  [   59.225406] Failed to execute /init
>  [   59.304088] Kernel panic - not syncing: No init found.  Try passing
> init= option to kernel.
>  [   59.450313] Press Stop-A (L1-A) to return to the boot prom
>
>  With commit 5a834bb47c373e887de5210b7ceae96e1ef413f7 (just
> after the changes):
>
>  [   70.384466] Failed to execute /init
>  [   70.474804] Kernel panic - not syncing: No init found.  Try passing
> init= option to kernel.
>
>
>  So while it's technically correct, it seems to have added ~15% overhead to
> the emulation :(

Guest time can be unreliable, it could also indicate that Linux
executes a lot more timer interrupts. Could you retest and measure the
wall clock time?

I think the C flag change should only increase performance. The next
commit may have negative effects because more work is done every
interrupt, but it's also more correct now.
Mark Cave-Ayland May 15, 2010, 12:56 p.m. UTC | #4
Blue Swirl wrote:

> Guest time can be unreliable, it could also indicate that Linux
> executes a lot more timer interrupts. Could you retest and measure the
> wall clock time?
> 
> I think the C flag change should only increase performance. The next
> commit may have negative effects because more work is done every
> interrupt, but it's also more correct now.

Hmmm looks like you're right. I did a few more tests measuring wall 
clock time and averaging across a set of runs gives roughly the same 
time (although there does seem to be quite a bit of variation in 
resulting times on my system here). So nothing to worry about here.


ATB,

Mark.
diff mbox

Patch

diff --git a/target-sparc/translate.c b/target-sparc/translate.c
index 94c343d..ea7c71b 100644
--- a/target-sparc/translate.c
+++ b/target-sparc/translate.c
@@ -334,9 +334,9 @@  static inline void gen_op_add_cc(TCGv dst, TCGv
src1, TCGv src2)

 static inline void gen_op_addxi_cc(TCGv dst, TCGv src1, target_long src2)
 {
+    gen_helper_compute_C_icc(cpu_tmp0);
     tcg_gen_mov_tl(cpu_cc_src, src1);
     tcg_gen_movi_tl(cpu_cc_src2, src2);
-    gen_helper_compute_C_icc(cpu_tmp0);
     tcg_gen_add_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0);
     tcg_gen_addi_tl(cpu_cc_dst, cpu_cc_dst, src2);