diff mbox series

firmware: fw_base: Improve exception stack setup in trap handler

Message ID 20200810135245.230600-1-anup.patel@wdc.com
State Superseded
Headers show
Series firmware: fw_base: Improve exception stack setup in trap handler | expand

Commit Message

Anup Patel Aug. 10, 2020, 1:52 p.m. UTC
Currently, the low-level trap handler (i.e. _trap_handler()) uses
branch instructions to conditionally setup exception stack based
on which mode trap occured.

This patch implements exception stack setup using xor instructions
which is 2 instructions less compared to previous approach and faster
due to lack of branch instructions.

The new exception stack setup approach can be best described by the
following pseudocode:

Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))

Came_From_M_Mode = 0    ==>    Exception_Stack = TP
Came_From_M_Mode = 1    ==>    Exception_Stack = SP

Signed-off-by: Anup Patel <anup.patel@wdc.com>
---
 firmware/fw_base.S | 49 ++++++++++++++++++++--------------------------
 1 file changed, 21 insertions(+), 28 deletions(-)

Comments

Jessica Clarke Aug. 10, 2020, 2:05 p.m. UTC | #1
On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote:
> 
> Currently, the low-level trap handler (i.e. _trap_handler()) uses
> branch instructions to conditionally setup exception stack based
> on which mode trap occured.
> 
> This patch implements exception stack setup using xor instructions
> which is 2 instructions less compared to previous approach and faster
> due to lack of branch instructions.
> 
> The new exception stack setup approach can be best described by the
> following pseudocode:
> 
> Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1

This relies on PRIV 0b10 currently being unallocated, which doesn't
make me feel warm and fuzzy inside.

> Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))

This relies on MUL with 0/1 being fast, but that may well be a different
pipeline in your processor (in an OoO core) with fewer available, or a
higher-latency path to an M-box in a simpler in-order pipeline. The MUL
is also not compressible. You probably do want a branch, but in a way
that macro-op fusion will hopefully turn it into a conditional move:

    mv Exception_Stack, tp
    beqz Came_From_M_Mode, 1f
    mv Exception_Stack, sp
1:

Same number of instructions, but all compressible, no multiplications
required and the lone branch can be macro-op fused.

Jess

> Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> 
> Signed-off-by: Anup Patel <anup.patel@wdc.com>
> ---
> firmware/fw_base.S | 49 ++++++++++++++++++++--------------------------
> 1 file changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/firmware/fw_base.S b/firmware/fw_base.S
> index b66ac41..53a9a48 100644
> --- a/firmware/fw_base.S
> +++ b/firmware/fw_base.S
> @@ -490,35 +490,28 @@ _trap_handler:
> 	/* Save T0 in scratch space */
> 	REG_S	t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> 
> -	/* Check which mode we came from */
> +	/*
> +	 * Set T0 to appropriate exception stack
> +	 *
> +	 * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
> +	 * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
> +	 *
> +	 * Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> +	 * Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> +	 */
> 	csrr	t0, CSR_MSTATUS
> -	srl	t0, t0, MSTATUS_MPP_SHIFT
> -	and	t0, t0, PRV_M
> -	xori	t0, t0, PRV_M
> -	beq	t0, zero, _trap_handler_m_mode
> -
> -	/* We came from S-mode or U-mode */
> -_trap_handler_s_mode:
> -	/* Set T0 to original SP */
> -	add	t0, sp, zero
> -
> -	/* Setup exception stack */
> -	add	sp, tp, -(SBI_TRAP_REGS_SIZE)
> -
> -	/* Jump to code common for all modes */
> -	j	_trap_handler_all_mode
> -
> -	/* We came from M-mode */
> -_trap_handler_m_mode:
> -	/* Set T0 to original SP */
> -	add	t0, sp, zero
> -
> -	/* Re-use current SP as exception stack */
> -	add	sp, sp, -(SBI_TRAP_REGS_SIZE)
> -
> -_trap_handler_all_mode:
> -	/* Save original SP (from T0) on stack */
> -	REG_S	t0, SBI_TRAP_REGS_OFFSET(sp)(sp)
> +	srl	t0, t0, (MSTATUS_MPP_SHIFT + 1)
> +	and	t0, t0, 0x1
> +	xor	sp, sp, tp
> +	mul	t0, t0, sp
> +	xor	sp, sp, tp
> +	xor	t0, tp, t0
> +
> +	/* Save original SP on exception stack */
> +	REG_S	sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0)
> +
> +	/* Set SP to exception stack and make room for trap registers */
> +	add	sp, t0, -(SBI_TRAP_REGS_SIZE)
> 
> 	/* Restore T0 from scratch space */
> 	REG_L	t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> -- 
> 2.25.1
> 
> 
> -- 
> opensbi mailing list
> opensbi@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/opensbi
Anup Patel Aug. 11, 2020, 3:10 a.m. UTC | #2
On Mon, Aug 10, 2020 at 7:35 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
>
> On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote:
> >
> > Currently, the low-level trap handler (i.e. _trap_handler()) uses
> > branch instructions to conditionally setup exception stack based
> > on which mode trap occured.
> >
> > This patch implements exception stack setup using xor instructions
> > which is 2 instructions less compared to previous approach and faster
> > due to lack of branch instructions.
> >
> > The new exception stack setup approach can be best described by the
> > following pseudocode:
> >
> > Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
>
> This relies on PRIV 0b10 currently being unallocated, which doesn't
> make me feel warm and fuzzy inside.

The PRIV 0b10 mode is not used by any of the upcoming extensions. In future,
we will require really strong justification in future to use the
unused 0b10 mode.
Even H-extension has cleanly avoided using PRIV 0b10 mode.

In any case, we should document this assumption.

>
> > Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
>
> This relies on MUL with 0/1 being fast, but that may well be a different

The MUL can be replaced with AND instruction but it was taking one
additional instruction. See below.

    /*
     * Set T0 to appropriate exception stack
     *
     * Came_From_M_Mode = ((MSTATUS.MPP & 0x2) >> 1) - 1
     * Exception_Stack = SP ^ (Came_From_M_Mode & (TP ^ SP))
     *
     * Came_From_M_Mode = -1UL   ==>    Exception_Stack = TP
     * Came_From_M_Mode = 0UL    ==>    Exception_Stack = SP
     */
    csrr    t0, CSR_MSTATUS
    srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
    and    t0, t0, 0x1
    add    t0, t0, -1
    xor    tp, tp, sp
    and    t0, t0, tp
    xor    tp, tp, sp
    xor    t0, sp, t0

> pipeline in your processor (in an OoO core) with fewer available, or a
> higher-latency path to an M-box in a simpler in-order pipeline. The MUL
> is also not compressible. You probably do want a branch, but in a way
> that macro-op fusion will hopefully turn it into a conditional move:
>
>     mv Exception_Stack, tp
>     beqz Came_From_M_Mode, 1f
>     mv Exception_Stack, sp
> 1:

Actually, I had started off with branches but code without branches (this
patch) is performing better on QEMU (because fewer TCG blocks required
for translating _trap_handler()). This code (this patch) will be much faster
on real HW because branch misses can cause bubbles in the instruction
pipeline.

The code with branches is as follows:
    csrr    t0, CSR_MSTATUS
    srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
    and    t0, t0, 0x1
    beq    t0, zero, _trap_stack_s_mode
_trap_stack_m_mode:
    add    t0, sp, zero
    j    _trap_stack_done
_trap_stack_s_mode:
    add    t0, tp, zero
_trap_stack_done:

>
> Same number of instructions, but all compressible, no multiplications
> required and the lone branch can be macro-op fused.

Overall, the key thing is having no (or fewer) branches in _trap_handler()
with the same number of (or fewer) instructions.

Regards,
Anup

>
> Jess
>
> > Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> > Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> >
> > Signed-off-by: Anup Patel <anup.patel@wdc.com>
> > ---
> > firmware/fw_base.S | 49 ++++++++++++++++++++--------------------------
> > 1 file changed, 21 insertions(+), 28 deletions(-)
> >
> > diff --git a/firmware/fw_base.S b/firmware/fw_base.S
> > index b66ac41..53a9a48 100644
> > --- a/firmware/fw_base.S
> > +++ b/firmware/fw_base.S
> > @@ -490,35 +490,28 @@ _trap_handler:
> >       /* Save T0 in scratch space */
> >       REG_S   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> >
> > -     /* Check which mode we came from */
> > +     /*
> > +      * Set T0 to appropriate exception stack
> > +      *
> > +      * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
> > +      * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
> > +      *
> > +      * Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> > +      * Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> > +      */
> >       csrr    t0, CSR_MSTATUS
> > -     srl     t0, t0, MSTATUS_MPP_SHIFT
> > -     and     t0, t0, PRV_M
> > -     xori    t0, t0, PRV_M
> > -     beq     t0, zero, _trap_handler_m_mode
> > -
> > -     /* We came from S-mode or U-mode */
> > -_trap_handler_s_mode:
> > -     /* Set T0 to original SP */
> > -     add     t0, sp, zero
> > -
> > -     /* Setup exception stack */
> > -     add     sp, tp, -(SBI_TRAP_REGS_SIZE)
> > -
> > -     /* Jump to code common for all modes */
> > -     j       _trap_handler_all_mode
> > -
> > -     /* We came from M-mode */
> > -_trap_handler_m_mode:
> > -     /* Set T0 to original SP */
> > -     add     t0, sp, zero
> > -
> > -     /* Re-use current SP as exception stack */
> > -     add     sp, sp, -(SBI_TRAP_REGS_SIZE)
> > -
> > -_trap_handler_all_mode:
> > -     /* Save original SP (from T0) on stack */
> > -     REG_S   t0, SBI_TRAP_REGS_OFFSET(sp)(sp)
> > +     srl     t0, t0, (MSTATUS_MPP_SHIFT + 1)
> > +     and     t0, t0, 0x1
> > +     xor     sp, sp, tp
> > +     mul     t0, t0, sp
> > +     xor     sp, sp, tp
> > +     xor     t0, tp, t0
> > +
> > +     /* Save original SP on exception stack */
> > +     REG_S   sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0)
> > +
> > +     /* Set SP to exception stack and make room for trap registers */
> > +     add     sp, t0, -(SBI_TRAP_REGS_SIZE)
> >
> >       /* Restore T0 from scratch space */
> >       REG_L   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> > --
> > 2.25.1
> >
> >
> > --
> > opensbi mailing list
> > opensbi@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/opensbi
Anup Patel Aug. 11, 2020, 4:29 a.m. UTC | #3
On Tue, Aug 11, 2020 at 8:40 AM Anup Patel <anup@brainfault.org> wrote:
>
> On Mon, Aug 10, 2020 at 7:35 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
> >
> > On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote:
> > >
> > > Currently, the low-level trap handler (i.e. _trap_handler()) uses
> > > branch instructions to conditionally setup exception stack based
> > > on which mode trap occured.
> > >
> > > This patch implements exception stack setup using xor instructions
> > > which is 2 instructions less compared to previous approach and faster
> > > due to lack of branch instructions.
> > >
> > > The new exception stack setup approach can be best described by the
> > > following pseudocode:
> > >
> > > Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
> >
> > This relies on PRIV 0b10 currently being unallocated, which doesn't
> > make me feel warm and fuzzy inside.
>
> The PRIV 0b10 mode is not used by any of the upcoming extensions. In future,
> we will require really strong justification in future to use the
> unused 0b10 mode.
> Even H-extension has cleanly avoided using PRIV 0b10 mode.
>
> In any case, we should document this assumption.
>
> >
> > > Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
> >
> > This relies on MUL with 0/1 being fast, but that may well be a different
>
> The MUL can be replaced with AND instruction but it was taking one
> additional instruction. See below.
>
>     /*
>      * Set T0 to appropriate exception stack
>      *
>      * Came_From_M_Mode = ((MSTATUS.MPP & 0x2) >> 1) - 1
>      * Exception_Stack = SP ^ (Came_From_M_Mode & (TP ^ SP))
>      *
>      * Came_From_M_Mode = -1UL   ==>    Exception_Stack = TP
>      * Came_From_M_Mode = 0UL    ==>    Exception_Stack = SP
>      */
>     csrr    t0, CSR_MSTATUS
>     srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
>     and    t0, t0, 0x1
>     add    t0, t0, -1
>     xor    tp, tp, sp
>     and    t0, t0, tp
>     xor    tp, tp, sp
>     xor    t0, sp, t0
>
> > pipeline in your processor (in an OoO core) with fewer available, or a
> > higher-latency path to an M-box in a simpler in-order pipeline. The MUL
> > is also not compressible. You probably do want a branch, but in a way
> > that macro-op fusion will hopefully turn it into a conditional move:
> >
> >     mv Exception_Stack, tp
> >     beqz Came_From_M_Mode, 1f
> >     mv Exception_Stack, sp
> > 1:
>
> Actually, I had started off with branches but code without branches (this
> patch) is performing better on QEMU (because fewer TCG blocks required
> for translating _trap_handler()). This code (this patch) will be much faster
> on real HW because branch misses can cause bubbles in the instruction
> pipeline.
>
> The code with branches is as follows:
>     csrr    t0, CSR_MSTATUS
>     srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
>     and    t0, t0, 0x1
>     beq    t0, zero, _trap_stack_s_mode
> _trap_stack_m_mode:
>     add    t0, sp, zero
>     j    _trap_stack_done
> _trap_stack_s_mode:
>     add    t0, tp, zero
> _trap_stack_done:
>
> >
> > Same number of instructions, but all compressible, no multiplications
> > required and the lone branch can be macro-op fused.
>
> Overall, the key thing is having no (or fewer) branches in _trap_handler()
> with the same number of (or fewer) instructions.

Here's another approach which has no assumption about PRIV 0b10 mode
and it uses AND operation in-place of MUL.

    /*
     * Set T0 to appropriate exception stack
     *
     * Came_From_M_Mode = ((MSTATUS.MPP < 0x3) ? 1 : 0) - 1;
     * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
     *
     * Came_From_M_Mode = -1UL   ==>    Exception_Stack = SP
     * Came_From_M_Mode = 0UL    ==>    Exception_Stack = TP
     */
    csrr    t0, CSR_MSTATUS
    srl    t0, t0, MSTATUS_MPP_SHIFT
    and    t0, t0, PRV_M
    slti    t0, t0, PRV_M
    add    t0, t0, -1
    xor    sp, sp, tp
    and    t0, t0, sp
    xor    sp, sp, tp
    xor    t0, tp, t0

Overall, the number of instructions are the same but no branches. Even
this approach performs better on QEMU compared to code with branches.

Here's the code with branches..

    csrr    t0, CSR_MSTATUS
    srl    t0, t0, MSTATUS_MPP_SHIFT
    and    t0, t0, PRV_M
    slti    t0, t0, PRV_M
    beq    t0, zero, _trap_stack_m_mode
_trap_stack_s_mode:
    add    t0, tp, zero
    j    _trap_stack_done
_trap_stack_m_mode:
    add    t0, sp, zero
_trap_stack_done:

Regards,
Anup

>
> Regards,
> Anup
>
> >
> > Jess
> >
> > > Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> > > Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> > >
> > > Signed-off-by: Anup Patel <anup.patel@wdc.com>
> > > ---
> > > firmware/fw_base.S | 49 ++++++++++++++++++++--------------------------
> > > 1 file changed, 21 insertions(+), 28 deletions(-)
> > >
> > > diff --git a/firmware/fw_base.S b/firmware/fw_base.S
> > > index b66ac41..53a9a48 100644
> > > --- a/firmware/fw_base.S
> > > +++ b/firmware/fw_base.S
> > > @@ -490,35 +490,28 @@ _trap_handler:
> > >       /* Save T0 in scratch space */
> > >       REG_S   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> > >
> > > -     /* Check which mode we came from */
> > > +     /*
> > > +      * Set T0 to appropriate exception stack
> > > +      *
> > > +      * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
> > > +      * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
> > > +      *
> > > +      * Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> > > +      * Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> > > +      */
> > >       csrr    t0, CSR_MSTATUS
> > > -     srl     t0, t0, MSTATUS_MPP_SHIFT
> > > -     and     t0, t0, PRV_M
> > > -     xori    t0, t0, PRV_M
> > > -     beq     t0, zero, _trap_handler_m_mode
> > > -
> > > -     /* We came from S-mode or U-mode */
> > > -_trap_handler_s_mode:
> > > -     /* Set T0 to original SP */
> > > -     add     t0, sp, zero
> > > -
> > > -     /* Setup exception stack */
> > > -     add     sp, tp, -(SBI_TRAP_REGS_SIZE)
> > > -
> > > -     /* Jump to code common for all modes */
> > > -     j       _trap_handler_all_mode
> > > -
> > > -     /* We came from M-mode */
> > > -_trap_handler_m_mode:
> > > -     /* Set T0 to original SP */
> > > -     add     t0, sp, zero
> > > -
> > > -     /* Re-use current SP as exception stack */
> > > -     add     sp, sp, -(SBI_TRAP_REGS_SIZE)
> > > -
> > > -_trap_handler_all_mode:
> > > -     /* Save original SP (from T0) on stack */
> > > -     REG_S   t0, SBI_TRAP_REGS_OFFSET(sp)(sp)
> > > +     srl     t0, t0, (MSTATUS_MPP_SHIFT + 1)
> > > +     and     t0, t0, 0x1
> > > +     xor     sp, sp, tp
> > > +     mul     t0, t0, sp
> > > +     xor     sp, sp, tp
> > > +     xor     t0, tp, t0
> > > +
> > > +     /* Save original SP on exception stack */
> > > +     REG_S   sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0)
> > > +
> > > +     /* Set SP to exception stack and make room for trap registers */
> > > +     add     sp, t0, -(SBI_TRAP_REGS_SIZE)
> > >
> > >       /* Restore T0 from scratch space */
> > >       REG_L   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> > > --
> > > 2.25.1
> > >
> > >
> > > --
> > > opensbi mailing list
> > > opensbi@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/opensbi
Jessica Clarke Aug. 11, 2020, 12:35 p.m. UTC | #4
On 11 Aug 2020, at 04:10, Anup Patel <anup@brainfault.org> wrote:
> 
> On Mon, Aug 10, 2020 at 7:35 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
>> 
>> On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote:
>>> 
>>> Currently, the low-level trap handler (i.e. _trap_handler()) uses
>>> branch instructions to conditionally setup exception stack based
>>> on which mode trap occured.
>>> 
>>> This patch implements exception stack setup using xor instructions
>>> which is 2 instructions less compared to previous approach and faster
>>> due to lack of branch instructions.
>>> 
>>> The new exception stack setup approach can be best described by the
>>> following pseudocode:
>>> 
>>> Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
>> 
>> This relies on PRIV 0b10 currently being unallocated, which doesn't
>> make me feel warm and fuzzy inside.
> 
> The PRIV 0b10 mode is not used by any of the upcoming extensions. In future,
> we will require really strong justification in future to use the
> unused 0b10 mode.
> Even H-extension has cleanly avoided using PRIV 0b10 mode.
> 
> In any case, we should document this assumption.
> 
>> 
>>> Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
>> 
>> This relies on MUL with 0/1 being fast, but that may well be a different
> 
> The MUL can be replaced with AND instruction but it was taking one
> additional instruction. See below.
> 
>    /*
>     * Set T0 to appropriate exception stack
>     *
>     * Came_From_M_Mode = ((MSTATUS.MPP & 0x2) >> 1) - 1
>     * Exception_Stack = SP ^ (Came_From_M_Mode & (TP ^ SP))
>     *
>     * Came_From_M_Mode = -1UL   ==>    Exception_Stack = TP
>     * Came_From_M_Mode = 0UL    ==>    Exception_Stack = SP
>     */
>    csrr    t0, CSR_MSTATUS
>    srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
>    and    t0, t0, 0x1
>    add    t0, t0, -1
>    xor    tp, tp, sp
>    and    t0, t0, tp
>    xor    tp, tp, sp
>    xor    t0, sp, t0
> 
>> pipeline in your processor (in an OoO core) with fewer available, or a
>> higher-latency path to an M-box in a simpler in-order pipeline. The MUL
>> is also not compressible. You probably do want a branch, but in a way
>> that macro-op fusion will hopefully turn it into a conditional move:
>> 
>>    mv Exception_Stack, tp
>>    beqz Came_From_M_Mode, 1f
>>    mv Exception_Stack, sp
>> 1:
> 
> Actually, I had started off with branches but code without branches (this
> patch) is performing better on QEMU (because fewer TCG blocks required
> for translating _trap_handler()). This code (this patch) will be much faster
> on real HW because branch misses can cause bubbles in the instruction
> pipeline.

I don't think you properly read what I wrote. I specifically made it so
there was a single branch to skip a single instruction and therefore a
macro-op fusion candidate. That means the branch+move is turned into a
conditional move, with no need to branch predict (and potentially
suffer from a miss). I don't know whether QEMU recognises the pattern,
but if not that's just a deficiency in QEMU itself, and fixing that
would suddenly make my proposed solution faster than the one involving
multiplication. You have also neglected to address the fact that my
proposed code is more compressible than yours.

Optimising for QEMU is also not the right approach in my opinion; we
should optimise for hardware first and foremost if we ever want RISC-V
to actually succeed in the real world. Real hardware does macro-op
fusion in many cases, and even in the cases it doesn't, multiplication
is not cheap. In a classic 5-stage pipeline, multiplication is done a
stage after execute, and may well be multi-cycle even for multiplying
by 0/1, which would cause the pipeline to *always* stall, rather than
*sometimes* in the case of a branch mispredict if macro-op fusion is
not implemented. Given it's a direct branch to a nearby instruction,
I'd expect both the misprediction and the corrected prediction to be L1
I-cache hits, so an in-order core should see very little latency. A
good out-of-order core like SonicBOOM will correctly decode it to a
conditional move (in fact I believe it decodes short conditional
branches as being a series of predicated instructions).

Also, if MUL with 0/1 were faster, do you not think compilers would
already be emitting that for RISC-V? Because both GCC and LLVM will use
(compressed) branch + (compressed) move to do a conditional move.

> The code with branches is as follows:
>    csrr    t0, CSR_MSTATUS
>    srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
>    and    t0, t0, 0x1
>    beq    t0, zero, _trap_stack_s_mode
> _trap_stack_m_mode:
>    add    t0, sp, zero
>    j    _trap_stack_done
> _trap_stack_s_mode:
>    add    t0, tp, zero
> _trap_stack_done:

Again, that's not what I wrote. Mine was shorter with only a single
branch.

Jess

>> 
>> Same number of instructions, but all compressible, no multiplications
>> required and the lone branch can be macro-op fused.
> 
> Overall, the key thing is having no (or fewer) branches in _trap_handler()
> with the same number of (or fewer) instructions.
> 
> Regards,
> Anup
> 
>> 
>> Jess
>> 
>>> Came_From_M_Mode = 0    ==>    Exception_Stack = TP
>>> Came_From_M_Mode = 1    ==>    Exception_Stack = SP
>>> 
>>> Signed-off-by: Anup Patel <anup.patel@wdc.com>
>>> ---
>>> firmware/fw_base.S | 49 ++++++++++++++++++++--------------------------
>>> 1 file changed, 21 insertions(+), 28 deletions(-)
>>> 
>>> diff --git a/firmware/fw_base.S b/firmware/fw_base.S
>>> index b66ac41..53a9a48 100644
>>> --- a/firmware/fw_base.S
>>> +++ b/firmware/fw_base.S
>>> @@ -490,35 +490,28 @@ _trap_handler:
>>>      /* Save T0 in scratch space */
>>>      REG_S   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
>>> 
>>> -     /* Check which mode we came from */
>>> +     /*
>>> +      * Set T0 to appropriate exception stack
>>> +      *
>>> +      * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
>>> +      * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
>>> +      *
>>> +      * Came_From_M_Mode = 0    ==>    Exception_Stack = TP
>>> +      * Came_From_M_Mode = 1    ==>    Exception_Stack = SP
>>> +      */
>>>      csrr    t0, CSR_MSTATUS
>>> -     srl     t0, t0, MSTATUS_MPP_SHIFT
>>> -     and     t0, t0, PRV_M
>>> -     xori    t0, t0, PRV_M
>>> -     beq     t0, zero, _trap_handler_m_mode
>>> -
>>> -     /* We came from S-mode or U-mode */
>>> -_trap_handler_s_mode:
>>> -     /* Set T0 to original SP */
>>> -     add     t0, sp, zero
>>> -
>>> -     /* Setup exception stack */
>>> -     add     sp, tp, -(SBI_TRAP_REGS_SIZE)
>>> -
>>> -     /* Jump to code common for all modes */
>>> -     j       _trap_handler_all_mode
>>> -
>>> -     /* We came from M-mode */
>>> -_trap_handler_m_mode:
>>> -     /* Set T0 to original SP */
>>> -     add     t0, sp, zero
>>> -
>>> -     /* Re-use current SP as exception stack */
>>> -     add     sp, sp, -(SBI_TRAP_REGS_SIZE)
>>> -
>>> -_trap_handler_all_mode:
>>> -     /* Save original SP (from T0) on stack */
>>> -     REG_S   t0, SBI_TRAP_REGS_OFFSET(sp)(sp)
>>> +     srl     t0, t0, (MSTATUS_MPP_SHIFT + 1)
>>> +     and     t0, t0, 0x1
>>> +     xor     sp, sp, tp
>>> +     mul     t0, t0, sp
>>> +     xor     sp, sp, tp
>>> +     xor     t0, tp, t0
>>> +
>>> +     /* Save original SP on exception stack */
>>> +     REG_S   sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0)
>>> +
>>> +     /* Set SP to exception stack and make room for trap registers */
>>> +     add     sp, t0, -(SBI_TRAP_REGS_SIZE)
>>> 
>>>      /* Restore T0 from scratch space */
>>>      REG_L   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
>>> --
>>> 2.25.1
>>> 
>>> 
>>> --
>>> opensbi mailing list
>>> opensbi@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/opensbi
Anup Patel Aug. 11, 2020, 1:40 p.m. UTC | #5
On Tue, Aug 11, 2020 at 6:05 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
>
> On 11 Aug 2020, at 04:10, Anup Patel <anup@brainfault.org> wrote:
> >
> > On Mon, Aug 10, 2020 at 7:35 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
> >>
> >> On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel@wdc.com> wrote:
> >>>
> >>> Currently, the low-level trap handler (i.e. _trap_handler()) uses
> >>> branch instructions to conditionally setup exception stack based
> >>> on which mode trap occured.
> >>>
> >>> This patch implements exception stack setup using xor instructions
> >>> which is 2 instructions less compared to previous approach and faster
> >>> due to lack of branch instructions.
> >>>
> >>> The new exception stack setup approach can be best described by the
> >>> following pseudocode:
> >>>
> >>> Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
> >>
> >> This relies on PRIV 0b10 currently being unallocated, which doesn't
> >> make me feel warm and fuzzy inside.
> >
> > The PRIV 0b10 mode is not used by any of the upcoming extensions. In future,
> > we will require really strong justification in future to use the
> > unused 0b10 mode.
> > Even H-extension has cleanly avoided using PRIV 0b10 mode.
> >
> > In any case, we should document this assumption.
> >
> >>
> >>> Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
> >>
> >> This relies on MUL with 0/1 being fast, but that may well be a different
> >
> > The MUL can be replaced with AND instruction but it was taking one
> > additional instruction. See below.
> >
> >    /*
> >     * Set T0 to appropriate exception stack
> >     *
> >     * Came_From_M_Mode = ((MSTATUS.MPP & 0x2) >> 1) - 1
> >     * Exception_Stack = SP ^ (Came_From_M_Mode & (TP ^ SP))
> >     *
> >     * Came_From_M_Mode = -1UL   ==>    Exception_Stack = TP
> >     * Came_From_M_Mode = 0UL    ==>    Exception_Stack = SP
> >     */
> >    csrr    t0, CSR_MSTATUS
> >    srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
> >    and    t0, t0, 0x1
> >    add    t0, t0, -1
> >    xor    tp, tp, sp
> >    and    t0, t0, tp
> >    xor    tp, tp, sp
> >    xor    t0, sp, t0
> >
> >> pipeline in your processor (in an OoO core) with fewer available, or a
> >> higher-latency path to an M-box in a simpler in-order pipeline. The MUL
> >> is also not compressible. You probably do want a branch, but in a way
> >> that macro-op fusion will hopefully turn it into a conditional move:
> >>
> >>    mv Exception_Stack, tp
> >>    beqz Came_From_M_Mode, 1f
> >>    mv Exception_Stack, sp
> >> 1:
> >
> > Actually, I had started off with branches but code without branches (this
> > patch) is performing better on QEMU (because fewer TCG blocks required
> > for translating _trap_handler()). This code (this patch) will be much faster
> > on real HW because branch misses can cause bubbles in the instruction
> > pipeline.
>
> I don't think you properly read what I wrote. I specifically made it so
> there was a single branch to skip a single instruction and therefore a
> macro-op fusion candidate. That means the branch+move is turned into a
> conditional move, with no need to branch predict (and potentially
> suffer from a miss). I don't know whether QEMU recognises the pattern,
> but if not that's just a deficiency in QEMU itself, and fixing that
> would suddenly make my proposed solution faster than the one involving
> multiplication. You have also neglected to address the fact that my
> proposed code is more compressible than yours.

I don't think you understand the problem here. We have to choose a
an appropriate exception stack without modifying current SP because
current SP will have to be first saved on exception stack. That's why
we:
1. Compute and set exception stack in T0
2. Save current SP on exception stack pointed by T0
3. Change current SP to T0

This patch is only trying to optimize point1.

Also, the place where the exception stack is chosen, we can only
clobber T0 for computation to preserve state of other general purpose
registers.

Your proposal if using branch+move won't work because you need
two general purpose registers one for Exception_Stack and second
for Came_From_M_Mode.

>
> Optimising for QEMU is also not the right approach in my opinion; we
> should optimise for hardware first and foremost if we ever want RISC-V
> to actually succeed in the real world. Real hardware does macro-op
> fusion in many cases, and even in the cases it doesn't, multiplication
> is not cheap. In a classic 5-stage pipeline, multiplication is done a
> stage after execute, and may well be multi-cycle even for multiplying
> by 0/1, which would cause the pipeline to *always* stall, rather than
> *sometimes* in the case of a branch mispredict if macro-op fusion is
> not implemented. Given it's a direct branch to a nearby instruction,
> I'd expect both the misprediction and the corrected prediction to be L1
> I-cache hits, so an in-order core should see very little latency. A
> good out-of-order core like SonicBOOM will correctly decode it to a
> conditional move (in fact I believe it decodes short conditional
> branches as being a series of predicated instructions).

Nope, we are not optimizing just for QEMU. Replacing branches
with simple sequential 1-cyle ALU instructions (not MUL instruction)
is faster on simple in-order CPUs as well. The biggest problem with
branches is the wrong path taken by branch predictor which causes
pipeline flush.

>
> Also, if MUL with 0/1 were faster, do you not think compilers would
> already be emitting that for RISC-V? Because both GCC and LLVM will use
> (compressed) branch + (compressed) move to do a conditional move

I am not pushing for MUL instruction here. I am very well aware of the
cost of MUL instructions. Please look at my alternate proposal where
I use AND instruction in-place of MUL with slight modifications.

Also, we can't compare the low-level trap_handler with general
C context in which GCC/LLVM optimizes code where stack pointer
is already set.

The challenging part here is to select exception stack using only
one temporary (T0) general purpose register.

>
> > The code with branches is as follows:
> >    csrr    t0, CSR_MSTATUS
> >    srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
> >    and    t0, t0, 0x1
> >    beq    t0, zero, _trap_stack_s_mode
> > _trap_stack_m_mode:
> >    add    t0, sp, zero
> >    j    _trap_stack_done
> > _trap_stack_s_mode:
> >    add    t0, tp, zero
> > _trap_stack_done:
>
> Again, that's not what I wrote. Mine was shorter with only a single
> branch.

I think you missed the context in-which this optimization is being done.

Please see above.

Let me send V2 with some of your comments addressed.

Regards,
Anup

>
> Jess
>
> >>
> >> Same number of instructions, but all compressible, no multiplications
> >> required and the lone branch can be macro-op fused.
> >
> > Overall, the key thing is having no (or fewer) branches in _trap_handler()
> > with the same number of (or fewer) instructions.
> >
> > Regards,
> > Anup
> >
> >>
> >> Jess
> >>
> >>> Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> >>> Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> >>>
> >>> Signed-off-by: Anup Patel <anup.patel@wdc.com>
> >>> ---
> >>> firmware/fw_base.S | 49 ++++++++++++++++++++--------------------------
> >>> 1 file changed, 21 insertions(+), 28 deletions(-)
> >>>
> >>> diff --git a/firmware/fw_base.S b/firmware/fw_base.S
> >>> index b66ac41..53a9a48 100644
> >>> --- a/firmware/fw_base.S
> >>> +++ b/firmware/fw_base.S
> >>> @@ -490,35 +490,28 @@ _trap_handler:
> >>>      /* Save T0 in scratch space */
> >>>      REG_S   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> >>>
> >>> -     /* Check which mode we came from */
> >>> +     /*
> >>> +      * Set T0 to appropriate exception stack
> >>> +      *
> >>> +      * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
> >>> +      * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
> >>> +      *
> >>> +      * Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> >>> +      * Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> >>> +      */
> >>>      csrr    t0, CSR_MSTATUS
> >>> -     srl     t0, t0, MSTATUS_MPP_SHIFT
> >>> -     and     t0, t0, PRV_M
> >>> -     xori    t0, t0, PRV_M
> >>> -     beq     t0, zero, _trap_handler_m_mode
> >>> -
> >>> -     /* We came from S-mode or U-mode */
> >>> -_trap_handler_s_mode:
> >>> -     /* Set T0 to original SP */
> >>> -     add     t0, sp, zero
> >>> -
> >>> -     /* Setup exception stack */
> >>> -     add     sp, tp, -(SBI_TRAP_REGS_SIZE)
> >>> -
> >>> -     /* Jump to code common for all modes */
> >>> -     j       _trap_handler_all_mode
> >>> -
> >>> -     /* We came from M-mode */
> >>> -_trap_handler_m_mode:
> >>> -     /* Set T0 to original SP */
> >>> -     add     t0, sp, zero
> >>> -
> >>> -     /* Re-use current SP as exception stack */
> >>> -     add     sp, sp, -(SBI_TRAP_REGS_SIZE)
> >>> -
> >>> -_trap_handler_all_mode:
> >>> -     /* Save original SP (from T0) on stack */
> >>> -     REG_S   t0, SBI_TRAP_REGS_OFFSET(sp)(sp)
> >>> +     srl     t0, t0, (MSTATUS_MPP_SHIFT + 1)
> >>> +     and     t0, t0, 0x1
> >>> +     xor     sp, sp, tp
> >>> +     mul     t0, t0, sp
> >>> +     xor     sp, sp, tp
> >>> +     xor     t0, tp, t0
> >>> +
> >>> +     /* Save original SP on exception stack */
> >>> +     REG_S   sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0)
> >>> +
> >>> +     /* Set SP to exception stack and make room for trap registers */
> >>> +     add     sp, t0, -(SBI_TRAP_REGS_SIZE)
> >>>
> >>>      /* Restore T0 from scratch space */
> >>>      REG_L   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> >>> --
> >>> 2.25.1
> >>>
> >>>
> >>> --
> >>> opensbi mailing list
> >>> opensbi@lists.infradead.org
> >>> http://lists.infradead.org/mailman/listinfo/opensbi
>
diff mbox series

Patch

diff --git a/firmware/fw_base.S b/firmware/fw_base.S
index b66ac41..53a9a48 100644
--- a/firmware/fw_base.S
+++ b/firmware/fw_base.S
@@ -490,35 +490,28 @@  _trap_handler:
 	/* Save T0 in scratch space */
 	REG_S	t0, SBI_SCRATCH_TMP0_OFFSET(tp)
 
-	/* Check which mode we came from */
+	/*
+	 * Set T0 to appropriate exception stack
+	 *
+	 * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
+	 * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
+	 *
+	 * Came_From_M_Mode = 0    ==>    Exception_Stack = TP
+	 * Came_From_M_Mode = 1    ==>    Exception_Stack = SP
+	 */
 	csrr	t0, CSR_MSTATUS
-	srl	t0, t0, MSTATUS_MPP_SHIFT
-	and	t0, t0, PRV_M
-	xori	t0, t0, PRV_M
-	beq	t0, zero, _trap_handler_m_mode
-
-	/* We came from S-mode or U-mode */
-_trap_handler_s_mode:
-	/* Set T0 to original SP */
-	add	t0, sp, zero
-
-	/* Setup exception stack */
-	add	sp, tp, -(SBI_TRAP_REGS_SIZE)
-
-	/* Jump to code common for all modes */
-	j	_trap_handler_all_mode
-
-	/* We came from M-mode */
-_trap_handler_m_mode:
-	/* Set T0 to original SP */
-	add	t0, sp, zero
-
-	/* Re-use current SP as exception stack */
-	add	sp, sp, -(SBI_TRAP_REGS_SIZE)
-
-_trap_handler_all_mode:
-	/* Save original SP (from T0) on stack */
-	REG_S	t0, SBI_TRAP_REGS_OFFSET(sp)(sp)
+	srl	t0, t0, (MSTATUS_MPP_SHIFT + 1)
+	and	t0, t0, 0x1
+	xor	sp, sp, tp
+	mul	t0, t0, sp
+	xor	sp, sp, tp
+	xor	t0, tp, t0
+
+	/* Save original SP on exception stack */
+	REG_S	sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0)
+
+	/* Set SP to exception stack and make room for trap registers */
+	add	sp, t0, -(SBI_TRAP_REGS_SIZE)
 
 	/* Restore T0 from scratch space */
 	REG_L	t0, SBI_SCRATCH_TMP0_OFFSET(tp)