diff mbox series

bpf: Tweak BPF jump table optimizations for objtool compatibility

Message ID b581438a16e78559b4cea28cf8bc74158791a9b3.1588273491.git.jpoimboe@redhat.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series bpf: Tweak BPF jump table optimizations for objtool compatibility | expand

Commit Message

Josh Poimboeuf April 30, 2020, 7:07 p.m. UTC
Objtool decodes instructions and follows all potential code branches
within a function.  But it's not an emulator, so it doesn't track
register values.  For that reason, it usually can't follow
intra-function indirect branches, unless they're using a jump table
which follows a certain format (e.g., GCC switch statement jump tables).

In most cases, the generated code for the BPF jump table looks a lot
like a GCC jump table, so objtool can follow it.  However, with
RETPOLINE=n, GCC keeps the jump table address in a register, and then
does 160+ indirect jumps with it.  When objtool encounters the indirect
jumps, it can't tell which jump table is being used (or even whether
they might be sibling calls instead).

This was fixed before by disabling an optimization in ___bpf_prog_run(),
using the "optimize" function attribute.  However, that attribute is bad
news.  It doesn't append options to the command-line arguments.  Instead
it starts from a blank slate.  And according to recent GCC documentation
it's not recommended for production use.  So revert the previous fix:

  3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")

With that reverted, solve the original problem in a different way by
getting rid of the "goto select_insn" indirection, and instead just goto
the jump table directly.  This simplifies the code a bit and helps GCC
generate saner code for the jump table branches, at least in the
RETPOLINE=n case.

But, in the RETPOLINE=y case, this simpler code actually causes GCC to
generate far worse code, ballooning the function text size by +40%.  So
leave that code the way it was.  In fact Alexei prefers to leave *all*
the code the way it was, except where needed by objtool.  So even
non-x86 RETPOLINE=n code will continue to have "goto select_insn".

This stuff is crazy voodoo, and far from ideal.  But it works for now.
Eventually, there's a plan to create a compiler plugin for annotating
jump tables.  That will make this a lot less fragile.

Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 include/linux/compiler-gcc.h   |  2 --
 include/linux/compiler_types.h |  4 ----
 kernel/bpf/core.c              | 10 +++++++---
 3 files changed, 7 insertions(+), 9 deletions(-)

Comments

Alexei Starovoitov May 1, 2020, 7:09 p.m. UTC | #1
On Thu, Apr 30, 2020 at 02:07:43PM -0500, Josh Poimboeuf wrote:
> Objtool decodes instructions and follows all potential code branches
> within a function.  But it's not an emulator, so it doesn't track
> register values.  For that reason, it usually can't follow
> intra-function indirect branches, unless they're using a jump table
> which follows a certain format (e.g., GCC switch statement jump tables).
> 
> In most cases, the generated code for the BPF jump table looks a lot
> like a GCC jump table, so objtool can follow it.  However, with
> RETPOLINE=n, GCC keeps the jump table address in a register, and then
> does 160+ indirect jumps with it.  When objtool encounters the indirect
> jumps, it can't tell which jump table is being used (or even whether
> they might be sibling calls instead).
> 
> This was fixed before by disabling an optimization in ___bpf_prog_run(),
> using the "optimize" function attribute.  However, that attribute is bad
> news.  It doesn't append options to the command-line arguments.  Instead
> it starts from a blank slate.  And according to recent GCC documentation
> it's not recommended for production use.  So revert the previous fix:
> 
>   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> 
> With that reverted, solve the original problem in a different way by
> getting rid of the "goto select_insn" indirection, and instead just goto
> the jump table directly.  This simplifies the code a bit and helps GCC
> generate saner code for the jump table branches, at least in the
> RETPOLINE=n case.
> 
> But, in the RETPOLINE=y case, this simpler code actually causes GCC to
> generate far worse code, ballooning the function text size by +40%.  So
> leave that code the way it was.  In fact Alexei prefers to leave *all*
> the code the way it was, except where needed by objtool.  So even
> non-x86 RETPOLINE=n code will continue to have "goto select_insn".
> 
> This stuff is crazy voodoo, and far from ideal.  But it works for now.
> Eventually, there's a plan to create a compiler plugin for annotating
> jump tables.  That will make this a lot less fragile.

I don't like this commit log.
Here you're saying that the code recognized by objtool is sane and good
whereas well optimized gcc code is somehow voodoo and bad.
That is just wrong.
goto select_insn; vs goto *jumptable[insn->code]; is not a contract that
compiler has to follow. The compiler is free to convert direct goto
into indirect and the other way around.
For all practical purposes this patch is a band aid for objtool that will fall
apart in the future. Just like the previous patch that survived less than a year.
It's not clear whether old one worked for clang.
It's not clear whether new one will work for clang.
retpoline=y causing code bloat is a different issue that can be investigated
separately. gcc/clang have different modes of generating retpoline thunks.
May be one of those flags can help.

In other words I'm ok with the patch, but commit log needs to be reworded.

> Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> Reported-by: Randy Dunlap <rdunlap@infradead.org>
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
>  include/linux/compiler-gcc.h   |  2 --
>  include/linux/compiler_types.h |  4 ----
>  kernel/bpf/core.c              | 10 +++++++---
>  3 files changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index cf294faec2f8..2c8583eb5de8 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -176,5 +176,3 @@
>  #else
>  #define __diag_GCC_8(s)
>  #endif
> -
> -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
> index e970f97a7fcb..58105f1deb79 100644
> --- a/include/linux/compiler_types.h
> +++ b/include/linux/compiler_types.h
> @@ -203,10 +203,6 @@ struct ftrace_likely_data {
>  #define asm_inline asm
>  #endif
>  
> -#ifndef __no_fgcse
> -# define __no_fgcse
> -#endif
> -
>  /* Are two types/vars the same type (ignoring qualifiers)? */
>  #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
>  
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 916f5132a984..eec470c598ad 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1364,7 +1364,7 @@ u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
>   *
>   * Decode and execute eBPF instructions.
>   */
> -static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
> +static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
>  {
>  #define BPF_INSN_2_LBL(x, y)    [BPF_##x | BPF_##y] = &&x##_##y
>  #define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &&x##_##y##_##z
> @@ -1384,11 +1384,15 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6
>  #undef BPF_INSN_2_LBL
>  	u32 tail_call_cnt = 0;
>  
> +#if defined(CONFIG_X86_64) && !defined(CONFIG_RETPOLINE)
> +#define CONT	 ({ insn++; goto *jumptable[insn->code]; })
> +#define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
> +#else
>  #define CONT	 ({ insn++; goto select_insn; })
>  #define CONT_JMP ({ insn++; goto select_insn; })
> -
>  select_insn:
>  	goto *jumptable[insn->code];
> +#endif
>  
>  	/* ALU */
>  #define ALU(OPCODE, OP)			\
> @@ -1547,7 +1551,7 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6
>  		 * where arg1_type is ARG_PTR_TO_CTX.
>  		 */
>  		insn = prog->insnsi;
> -		goto select_insn;
> +		CONT;

This is broken. I don't think you've run basic tests with this patch.
Josh Poimboeuf May 1, 2020, 7:22 p.m. UTC | #2
On Fri, May 01, 2020 at 12:09:30PM -0700, Alexei Starovoitov wrote:
> On Thu, Apr 30, 2020 at 02:07:43PM -0500, Josh Poimboeuf wrote:
> > Objtool decodes instructions and follows all potential code branches
> > within a function.  But it's not an emulator, so it doesn't track
> > register values.  For that reason, it usually can't follow
> > intra-function indirect branches, unless they're using a jump table
> > which follows a certain format (e.g., GCC switch statement jump tables).
> > 
> > In most cases, the generated code for the BPF jump table looks a lot
> > like a GCC jump table, so objtool can follow it.  However, with
> > RETPOLINE=n, GCC keeps the jump table address in a register, and then
> > does 160+ indirect jumps with it.  When objtool encounters the indirect
> > jumps, it can't tell which jump table is being used (or even whether
> > they might be sibling calls instead).
> > 
> > This was fixed before by disabling an optimization in ___bpf_prog_run(),
> > using the "optimize" function attribute.  However, that attribute is bad
> > news.  It doesn't append options to the command-line arguments.  Instead
> > it starts from a blank slate.  And according to recent GCC documentation
> > it's not recommended for production use.  So revert the previous fix:
> > 
> >   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > 
> > With that reverted, solve the original problem in a different way by
> > getting rid of the "goto select_insn" indirection, and instead just goto
> > the jump table directly.  This simplifies the code a bit and helps GCC
> > generate saner code for the jump table branches, at least in the
> > RETPOLINE=n case.
> > 
> > But, in the RETPOLINE=y case, this simpler code actually causes GCC to
> > generate far worse code, ballooning the function text size by +40%.  So
> > leave that code the way it was.  In fact Alexei prefers to leave *all*
> > the code the way it was, except where needed by objtool.  So even
> > non-x86 RETPOLINE=n code will continue to have "goto select_insn".
> > 
> > This stuff is crazy voodoo, and far from ideal.  But it works for now.
> > Eventually, there's a plan to create a compiler plugin for annotating
> > jump tables.  That will make this a lot less fragile.
> 
> I don't like this commit log.
> Here you're saying that the code recognized by objtool is sane and good
> whereas well optimized gcc code is somehow voodoo and bad.
> That is just wrong.

I have no idea what you're talking about.

Are you saying that ballooning the function text size by 40% is well
optimized GCC code?  It seems like a bug to me.  That's the only place I
said anything bad about GCC code.

When I said "this stuff is crazy voodoo" I was referring to the patch
itself.  I agree it's horrible, it's only the best approach we're able
to come up with at the moment.

If any of that isn't clear then can you provide specifics about what
seems wrong?
Alexei Starovoitov May 1, 2020, 7:40 p.m. UTC | #3
On Fri, May 01, 2020 at 02:22:04PM -0500, Josh Poimboeuf wrote:
> On Fri, May 01, 2020 at 12:09:30PM -0700, Alexei Starovoitov wrote:
> > On Thu, Apr 30, 2020 at 02:07:43PM -0500, Josh Poimboeuf wrote:
> > > Objtool decodes instructions and follows all potential code branches
> > > within a function.  But it's not an emulator, so it doesn't track
> > > register values.  For that reason, it usually can't follow
> > > intra-function indirect branches, unless they're using a jump table
> > > which follows a certain format (e.g., GCC switch statement jump tables).
> > > 
> > > In most cases, the generated code for the BPF jump table looks a lot
> > > like a GCC jump table, so objtool can follow it.  However, with
> > > RETPOLINE=n, GCC keeps the jump table address in a register, and then
> > > does 160+ indirect jumps with it.  When objtool encounters the indirect
> > > jumps, it can't tell which jump table is being used (or even whether
> > > they might be sibling calls instead).
> > > 
> > > This was fixed before by disabling an optimization in ___bpf_prog_run(),
> > > using the "optimize" function attribute.  However, that attribute is bad
> > > news.  It doesn't append options to the command-line arguments.  Instead
> > > it starts from a blank slate.  And according to recent GCC documentation
> > > it's not recommended for production use.  So revert the previous fix:
> > > 
> > >   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > > 
> > > With that reverted, solve the original problem in a different way by
> > > getting rid of the "goto select_insn" indirection, and instead just goto
> > > the jump table directly.  This simplifies the code a bit and helps GCC
> > > generate saner code for the jump table branches, at least in the
> > > RETPOLINE=n case.
> > > 
> > > But, in the RETPOLINE=y case, this simpler code actually causes GCC to
> > > generate far worse code, ballooning the function text size by +40%.  So
> > > leave that code the way it was.  In fact Alexei prefers to leave *all*
> > > the code the way it was, except where needed by objtool.  So even
> > > non-x86 RETPOLINE=n code will continue to have "goto select_insn".
> > > 
> > > This stuff is crazy voodoo, and far from ideal.  But it works for now.
> > > Eventually, there's a plan to create a compiler plugin for annotating
> > > jump tables.  That will make this a lot less fragile.
> > 
> > I don't like this commit log.
> > Here you're saying that the code recognized by objtool is sane and good
> > whereas well optimized gcc code is somehow voodoo and bad.
> > That is just wrong.
> 
> I have no idea what you're talking about.
> 
> Are you saying that ballooning the function text size by 40% is well
> optimized GCC code?  It seems like a bug to me.  That's the only place I
> said anything bad about GCC code.

It could be a bug, but did you benchmark the speed of interpreter ?
Is it faster or slower with 40% more code ?
Did you benchmark it on other archs ?

> When I said "this stuff is crazy voodoo" I was referring to the patch
> itself.  I agree it's horrible, it's only the best approach we're able
> to come up with at the moment.

please reword it then.
Josh Poimboeuf May 1, 2020, 7:56 p.m. UTC | #4
On Fri, May 01, 2020 at 12:40:53PM -0700, Alexei Starovoitov wrote:
> On Fri, May 01, 2020 at 02:22:04PM -0500, Josh Poimboeuf wrote:
> > On Fri, May 01, 2020 at 12:09:30PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Apr 30, 2020 at 02:07:43PM -0500, Josh Poimboeuf wrote:
> > > > Objtool decodes instructions and follows all potential code branches
> > > > within a function.  But it's not an emulator, so it doesn't track
> > > > register values.  For that reason, it usually can't follow
> > > > intra-function indirect branches, unless they're using a jump table
> > > > which follows a certain format (e.g., GCC switch statement jump tables).
> > > > 
> > > > In most cases, the generated code for the BPF jump table looks a lot
> > > > like a GCC jump table, so objtool can follow it.  However, with
> > > > RETPOLINE=n, GCC keeps the jump table address in a register, and then
> > > > does 160+ indirect jumps with it.  When objtool encounters the indirect
> > > > jumps, it can't tell which jump table is being used (or even whether
> > > > they might be sibling calls instead).
> > > > 
> > > > This was fixed before by disabling an optimization in ___bpf_prog_run(),
> > > > using the "optimize" function attribute.  However, that attribute is bad
> > > > news.  It doesn't append options to the command-line arguments.  Instead
> > > > it starts from a blank slate.  And according to recent GCC documentation
> > > > it's not recommended for production use.  So revert the previous fix:
> > > > 
> > > >   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > > > 
> > > > With that reverted, solve the original problem in a different way by
> > > > getting rid of the "goto select_insn" indirection, and instead just goto
> > > > the jump table directly.  This simplifies the code a bit and helps GCC
> > > > generate saner code for the jump table branches, at least in the
> > > > RETPOLINE=n case.
> > > > 
> > > > But, in the RETPOLINE=y case, this simpler code actually causes GCC to
> > > > generate far worse code, ballooning the function text size by +40%.  So
> > > > leave that code the way it was.  In fact Alexei prefers to leave *all*
> > > > the code the way it was, except where needed by objtool.  So even
> > > > non-x86 RETPOLINE=n code will continue to have "goto select_insn".
> > > > 
> > > > This stuff is crazy voodoo, and far from ideal.  But it works for now.
> > > > Eventually, there's a plan to create a compiler plugin for annotating
> > > > jump tables.  That will make this a lot less fragile.
> > > 
> > > I don't like this commit log.
> > > Here you're saying that the code recognized by objtool is sane and good
> > > whereas well optimized gcc code is somehow voodoo and bad.
> > > That is just wrong.
> > 
> > I have no idea what you're talking about.
> > 
> > Are you saying that ballooning the function text size by 40% is well
> > optimized GCC code?  It seems like a bug to me.  That's the only place I
> > said anything bad about GCC code.
> 
> It could be a bug, but did you benchmark the speed of interpreter ?
> Is it faster or slower with 40% more code ?
> Did you benchmark it on other archs ?

I thought we were in agreement that 40% text growth is bad.  Isn't that
why you wanted to keep 'goto select_insn' for the retpoline case?

If there's some other reason, let me know and I'll put it in the patch
description instead.

> > When I said "this stuff is crazy voodoo" I was referring to the patch
> > itself.  I agree it's horrible, it's only the best approach we're able
> > to come up with at the moment.
> 
> please reword it then.

Ok, so: This *patch* is crazy voodoo ?
Alexei Starovoitov May 2, 2020, 3:06 a.m. UTC | #5
On Fri, May 01, 2020 at 02:56:17PM -0500, Josh Poimboeuf wrote:
> On Fri, May 01, 2020 at 12:40:53PM -0700, Alexei Starovoitov wrote:
> > On Fri, May 01, 2020 at 02:22:04PM -0500, Josh Poimboeuf wrote:
> > > On Fri, May 01, 2020 at 12:09:30PM -0700, Alexei Starovoitov wrote:
> > > > On Thu, Apr 30, 2020 at 02:07:43PM -0500, Josh Poimboeuf wrote:
> > > > > Objtool decodes instructions and follows all potential code branches
> > > > > within a function.  But it's not an emulator, so it doesn't track
> > > > > register values.  For that reason, it usually can't follow
> > > > > intra-function indirect branches, unless they're using a jump table
> > > > > which follows a certain format (e.g., GCC switch statement jump tables).
> > > > > 
> > > > > In most cases, the generated code for the BPF jump table looks a lot
> > > > > like a GCC jump table, so objtool can follow it.  However, with
> > > > > RETPOLINE=n, GCC keeps the jump table address in a register, and then
> > > > > does 160+ indirect jumps with it.  When objtool encounters the indirect
> > > > > jumps, it can't tell which jump table is being used (or even whether
> > > > > they might be sibling calls instead).
> > > > > 
> > > > > This was fixed before by disabling an optimization in ___bpf_prog_run(),
> > > > > using the "optimize" function attribute.  However, that attribute is bad
> > > > > news.  It doesn't append options to the command-line arguments.  Instead
> > > > > it starts from a blank slate.  And according to recent GCC documentation
> > > > > it's not recommended for production use.  So revert the previous fix:
> > > > > 
> > > > >   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > > > > 
> > > > > With that reverted, solve the original problem in a different way by
> > > > > getting rid of the "goto select_insn" indirection, and instead just goto
> > > > > the jump table directly.  This simplifies the code a bit and helps GCC
> > > > > generate saner code for the jump table branches, at least in the
> > > > > RETPOLINE=n case.
> > > > > 
> > > > > But, in the RETPOLINE=y case, this simpler code actually causes GCC to
> > > > > generate far worse code, ballooning the function text size by +40%.  So
> > > > > leave that code the way it was.  In fact Alexei prefers to leave *all*
> > > > > the code the way it was, except where needed by objtool.  So even
> > > > > non-x86 RETPOLINE=n code will continue to have "goto select_insn".
> > > > > 
> > > > > This stuff is crazy voodoo, and far from ideal.  But it works for now.
> > > > > Eventually, there's a plan to create a compiler plugin for annotating
> > > > > jump tables.  That will make this a lot less fragile.
> > > > 
> > > > I don't like this commit log.
> > > > Here you're saying that the code recognized by objtool is sane and good
> > > > whereas well optimized gcc code is somehow voodoo and bad.
> > > > That is just wrong.
> > > 
> > > I have no idea what you're talking about.
> > > 
> > > Are you saying that ballooning the function text size by 40% is well
> > > optimized GCC code?  It seems like a bug to me.  That's the only place I
> > > said anything bad about GCC code.
> > 
> > It could be a bug, but did you benchmark the speed of interpreter ?
> > Is it faster or slower with 40% more code ?
> > Did you benchmark it on other archs ?
> 
> I thought we were in agreement that 40% text growth is bad.  Isn't that
> why you wanted to keep 'goto select_insn' for the retpoline case?

Let me see whether I got this right.
In first the sentence above you're claiming that I've agreed that 
'goto select_insn' is bad for retpoline case and in the second sentence
you're saying that I wanted to keep it because it's bad?
In other words you're saying I wanted bad code for retpoline case for
some mischievous purpose?
Do you really think so or just trolling?

Let's look at the facts.
I've applied your patch and the kernel crashed on the very first test in
selftests/bpf which makes me believe that you only compile tested it.
Taking the question "is 40% text growth is bad?" out of context... Ohh yes.
but if 40% extra code gives 10% speedup to interpreter it's suddenly good, right?
Since you didn't run basic tests I don't think you've tested performance either.
So this direct->indirect patch might cause performance degradation to
architectures that don't have JIT.
On x86-64 JIT=y is the default, so I'm fine taking that performance risk
only for the case where that risk has to be taken. In other words to help
objtool understand the code and only for the case where objtool cannot do
it with existing code.
The 40% potential text decrease after direct->indirect transiton is
irrelevant here. It must be a separate patch after corresponding
performance benchmarking is done.
Just claiming in commit log that current code is obviously bad
is misleading to folks who will be reading it later.

Also as I explained earlier direct->indirect in C is not a contract
for the compiler. Currently there is single C line:
goto *jumptable[insn->code];
but gcc/clang may generate arbitrary number of indirect jumps
for this function.
Changing the macro from "goto select_insn" to "goto *jumptable"
messes with compiler optimizations and there is no guarantee
that the code is going to be better or worse.
Why do you think there are two identical macros there?
#define CONT     ({ insn++; goto select_insn; })
#define CONT_JMP ({ insn++; goto select_insn; })
Why not one?
The answer is in old patch from 2014:
https://patchwork.ozlabs.org/project/netdev/patch/1393910304-4004-2-git-send-email-ast@plumgrid.com/
+#define CONT ({insn++; LOAD_IMM; goto select_insn; })
+#define CONT_JMP ({insn++; LOAD_IMM; goto select_insn; })
+/* some compilers may need help:
+ * #define CONT_JMP ({insn++; LOAD_IMM; goto *jumptable[insn->code]; })
+ */

That was the patch after dozens of performance experiments
with different gcc versions on different cpus.
Six years ago the interpreter performance could be improved
if _one_ of these macros replaced direct with indirect
for certain versions of gcc. But not both macros.

To be honest I don't think you're interested in doing performance
analysis here. You just want to shut up that objtool warning and
move on, right? So please do so without making misleading statements
about goodness or badness of generated code.

Thanks
Josh Poimboeuf May 2, 2020, 7:21 p.m. UTC | #6
On Fri, May 01, 2020 at 08:06:22PM -0700, Alexei Starovoitov wrote:
> On Fri, May 01, 2020 at 02:56:17PM -0500, Josh Poimboeuf wrote:
> > On Fri, May 01, 2020 at 12:40:53PM -0700, Alexei Starovoitov wrote:
> > > On Fri, May 01, 2020 at 02:22:04PM -0500, Josh Poimboeuf wrote:
> > > > On Fri, May 01, 2020 at 12:09:30PM -0700, Alexei Starovoitov wrote:
> > > > > On Thu, Apr 30, 2020 at 02:07:43PM -0500, Josh Poimboeuf wrote:
> > > > > > Objtool decodes instructions and follows all potential code branches
> > > > > > within a function.  But it's not an emulator, so it doesn't track
> > > > > > register values.  For that reason, it usually can't follow
> > > > > > intra-function indirect branches, unless they're using a jump table
> > > > > > which follows a certain format (e.g., GCC switch statement jump tables).
> > > > > > 
> > > > > > In most cases, the generated code for the BPF jump table looks a lot
> > > > > > like a GCC jump table, so objtool can follow it.  However, with
> > > > > > RETPOLINE=n, GCC keeps the jump table address in a register, and then
> > > > > > does 160+ indirect jumps with it.  When objtool encounters the indirect
> > > > > > jumps, it can't tell which jump table is being used (or even whether
> > > > > > they might be sibling calls instead).
> > > > > > 
> > > > > > This was fixed before by disabling an optimization in ___bpf_prog_run(),
> > > > > > using the "optimize" function attribute.  However, that attribute is bad
> > > > > > news.  It doesn't append options to the command-line arguments.  Instead
> > > > > > it starts from a blank slate.  And according to recent GCC documentation
> > > > > > it's not recommended for production use.  So revert the previous fix:
> > > > > > 
> > > > > >   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > > > > > 
> > > > > > With that reverted, solve the original problem in a different way by
> > > > > > getting rid of the "goto select_insn" indirection, and instead just goto
> > > > > > the jump table directly.  This simplifies the code a bit and helps GCC
> > > > > > generate saner code for the jump table branches, at least in the
> > > > > > RETPOLINE=n case.
> > > > > > 
> > > > > > But, in the RETPOLINE=y case, this simpler code actually causes GCC to
> > > > > > generate far worse code, ballooning the function text size by +40%.  So
> > > > > > leave that code the way it was.  In fact Alexei prefers to leave *all*
> > > > > > the code the way it was, except where needed by objtool.  So even
> > > > > > non-x86 RETPOLINE=n code will continue to have "goto select_insn".
> > > > > > 
> > > > > > This stuff is crazy voodoo, and far from ideal.  But it works for now.
> > > > > > Eventually, there's a plan to create a compiler plugin for annotating
> > > > > > jump tables.  That will make this a lot less fragile.
> > > > > 
> > > > > I don't like this commit log.
> > > > > Here you're saying that the code recognized by objtool is sane and good
> > > > > whereas well optimized gcc code is somehow voodoo and bad.
> > > > > That is just wrong.
> > > > 
> > > > I have no idea what you're talking about.
> > > > 
> > > > Are you saying that ballooning the function text size by 40% is well
> > > > optimized GCC code?  It seems like a bug to me.  That's the only place I
> > > > said anything bad about GCC code.
> > > 
> > > It could be a bug, but did you benchmark the speed of interpreter ?
> > > Is it faster or slower with 40% more code ?
> > > Did you benchmark it on other archs ?
> > 
> > I thought we were in agreement that 40% text growth is bad.  Isn't that
> > why you wanted to keep 'goto select_insn' for the retpoline case?
> 
> Let me see whether I got this right.
> In first the sentence above you're claiming that I've agreed that 
> 'goto select_insn' is bad for retpoline case and in the second sentence
> you're saying that I wanted to keep it because it's bad?
> In other words you're saying I wanted bad code for retpoline case for
> some mischievous purpose?
> Do you really think so or just trolling?

I *never* said anything about 'goto select_insn' being bad for the
retpoline case.

GETTING RID OF IT is bad for the retpoline case, i.e. text explosion.

That's why (I thought) you wanted to keep it for the retpoline case.

Go back and read the words I've written instead of accusing me of
trolling... WTF.

> Let's look at the facts.
> I've applied your patch and the kernel crashed on the very first test in
> selftests/bpf which makes me believe that you only compile tested it.

Yes, that was a dumb mistake.  But to be fair, I asked you about that
change, and you said it was ok:

  https://lkml.kernel.org/r/20200430042400.45vvqx4ocwwogp3j@ast-mbp.dhcp.thefacebook.com

Now I see it's obviously a bug.

> Taking the question "is 40% text growth is bad?" out of context... Ohh yes.
> but if 40% extra code gives 10% speedup to interpreter it's suddenly good, right?
> Since you didn't run basic tests I don't think you've tested performance either.
> So this direct->indirect patch might cause performance degradation to
> architectures that don't have JIT.
> On x86-64 JIT=y is the default, so I'm fine taking that performance risk
> only for the case where that risk has to be taken. In other words to help
> objtool understand the code and only for the case where objtool cannot do
> it with existing code.
> The 40% potential text decrease after direct->indirect transiton is
> irrelevant here. It must be a separate patch after corresponding
> performance benchmarking is done.
> Just claiming in commit log that current code is obviously bad
> is misleading to folks who will be reading it later.

No.  I didn't say the current code is obviously bad.  You seem to be
confused about what the problem is.

> Also as I explained earlier direct->indirect in C is not a contract
> for the compiler. Currently there is single C line:
> goto *jumptable[insn->code];
> but gcc/clang may generate arbitrary number of indirect jumps
> for this function.
> Changing the macro from "goto select_insn" to "goto *jumptable"
> messes with compiler optimizations and there is no guarantee
> that the code is going to be better or worse.
> Why do you think there are two identical macros there?
> #define CONT     ({ insn++; goto select_insn; })
> #define CONT_JMP ({ insn++; goto select_insn; })
> Why not one?
> The answer is in old patch from 2014:
> https://patchwork.ozlabs.org/project/netdev/patch/1393910304-4004-2-git-send-email-ast@plumgrid.com/
> +#define CONT ({insn++; LOAD_IMM; goto select_insn; })
> +#define CONT_JMP ({insn++; LOAD_IMM; goto select_insn; })
> +/* some compilers may need help:
> + * #define CONT_JMP ({insn++; LOAD_IMM; goto *jumptable[insn->code]; })
> + */
>
> That was the patch after dozens of performance experiments
> with different gcc versions on different cpus.
> Six years ago the interpreter performance could be improved
> if _one_ of these macros replaced direct with indirect
> for certain versions of gcc. But not both macros.

I'm not sure why you're explaining this to me again.  Obviously the
compiler is free to do whatever optimizations it wants.

> To be honest I don't think you're interested in doing performance
> analysis here. You just want to shut up that objtool warning and
> move on, right? So please do so without making misleading statements
> about goodness or badness of generated code.

Based on several of your comments above it sounds like you have a
fundamental misunderstanding of the problem.  It's a little complicated,
but I tried to explain it as clearly as I could in the patch
description.  Let me try again:

The existing 'goto select_insn' + RETPOLINE=n produces code which
objtool can't understand.  (Though it can handle RETPOLINE=y just fine.)

Ideally we would get rid of that label and just change all the 'goto
select_insn' to 'goto *jumptable[insn->code]'.  That allows objtool to
follow the code in both retpoline and non-retpoline cases.  It also
simplifies the code flow and (IMO) makes it easier for GCC to find
optimizations.

However, for the RETPOLINE=y case, that simplification actually would
cause GCC to grow the function text size by 40%.  I thought we were in
agreement that significant text growth would be universally bad,
presumably because of i-cache locality/pressure issues.  However I can
leave out any such judgements in the patch description, if that's what
you prefer.  But then I'd still need to give a justification for the
#ifdef.

If you're not worried about the possibility of the text growth being
bad, because as you mentioned there's no guarantee the code is going to
be better or worse, I'm fine with that.  Then the patch can become a lot
simpler (no weird #ifdefs) and we can just do 'goto *jumptable[insn->code]'
everywhere.

Or, if you want to minimize the patch's impact on other arches, and keep
the current patch the way it is (with bug fixed and changed patch
description), that's fine too.  I can change the patch description
accordingly.

Or if you want me to measure the performance impact of the +40% code
growth, and *then* decide what to do, that's also fine.  But you'd need
to tell me what tests to run.
Alexei Starovoitov May 5, 2020, 5:43 p.m. UTC | #7
On Sat, May 02, 2020 at 02:21:05PM -0500, Josh Poimboeuf wrote:
> 
> Ideally we would get rid of that label and just change all the 'goto
> select_insn' to 'goto *jumptable[insn->code]'.  That allows objtool to
> follow the code in both retpoline and non-retpoline cases.  It also
> simplifies the code flow and (IMO) makes it easier for GCC to find
> optimizations.

No. It's the opposite. It's not simplifying the code. It pessimizes
compilers.

> 
> However, for the RETPOLINE=y case, that simplification actually would
> cause GCC to grow the function text size by 40%.  

It pessimizes and causes text increase, since the version of gcc
you're testing with cannot combine indirect gotos back into direct.

> I thought we were in
> agreement that significant text growth would be universally bad,
> presumably because of i-cache locality/pressure issues.  

No. As I explained before the extra code could give performance
increase depending on how branch predictor is designed in HW.

> Or, if you want to minimize the patch's impact on other arches, and keep
> the current patch the way it is (with bug fixed and changed patch
> description), that's fine too.  I can change the patch description
> accordingly.
> 
> Or if you want me to measure the performance impact of the +40% code
> growth, and *then* decide what to do, that's also fine.  But you'd need
> to tell me what tests to run.

I'd like to minimize the risk and avoid code churn,
so how about we step back and debug it first?
Which version of gcc are you using and what .config?
I've tried:
Linux version 5.7.0-rc2 (gcc version 10.0.1 20200505 (prerelease) (GCC)
CONFIG_UNWINDER_ORC=y
# CONFIG_RETPOLINE is not set

and objtool didn't complain.
I would like to reproduce it first before making any changes.

Also since objtool cannot follow the optimizations compiler is doing
how about admit the design failure and teach objtool to build ORC
(and whatever else it needs to build) based on dwarf for the functions where
it cannot understand the assembly code ?
Otherwise objtool will forever be playing whackamole with compilers.
Randy Dunlap May 5, 2020, 5:52 p.m. UTC | #8
On 5/5/20 10:43 AM, Alexei Starovoitov wrote:
> On Sat, May 02, 2020 at 02:21:05PM -0500, Josh Poimboeuf wrote:
>>
>> Ideally we would get rid of that label and just change all the 'goto
>> select_insn' to 'goto *jumptable[insn->code]'.  That allows objtool to
>> follow the code in both retpoline and non-retpoline cases.  It also
>> simplifies the code flow and (IMO) makes it easier for GCC to find
>> optimizations.
> 
> No. It's the opposite. It's not simplifying the code. It pessimizes
> compilers.
> 
>>
>> However, for the RETPOLINE=y case, that simplification actually would
>> cause GCC to grow the function text size by 40%.  
> 
> It pessimizes and causes text increase, since the version of gcc
> you're testing with cannot combine indirect gotos back into direct.
> 
>> I thought we were in
>> agreement that significant text growth would be universally bad,
>> presumably because of i-cache locality/pressure issues.  
> 
> No. As I explained before the extra code could give performance
> increase depending on how branch predictor is designed in HW.
> 
>> Or, if you want to minimize the patch's impact on other arches, and keep
>> the current patch the way it is (with bug fixed and changed patch
>> description), that's fine too.  I can change the patch description
>> accordingly.
>>
>> Or if you want me to measure the performance impact of the +40% code
>> growth, and *then* decide what to do, that's also fine.  But you'd need
>> to tell me what tests to run.
> 
> I'd like to minimize the risk and avoid code churn,
> so how about we step back and debug it first?
> Which version of gcc are you using and what .config?
> I've tried:
> Linux version 5.7.0-rc2 (gcc version 10.0.1 20200505 (prerelease) (GCC)
> CONFIG_UNWINDER_ORC=y
> # CONFIG_RETPOLINE is not set
> 
> and objtool didn't complain.
> I would like to reproduce it first before making any changes.
> 
> Also since objtool cannot follow the optimizations compiler is doing
> how about admit the design failure and teach objtool to build ORC
> (and whatever else it needs to build) based on dwarf for the functions where
> it cannot understand the assembly code ?
> Otherwise objtool will forever be playing whackamole with compilers.
> 

Hi,

I see the objtool warning:
kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x33: call without frame pointer save/setup

when using:
gcc (SUSE Linux) 9.3.1 20200406 [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2]

with the attached config file.

thanks.
Josh Poimboeuf May 5, 2020, 6:11 p.m. UTC | #9
On Tue, May 05, 2020 at 10:43:00AM -0700, Alexei Starovoitov wrote:
> > Or, if you want to minimize the patch's impact on other arches, and keep
> > the current patch the way it is (with bug fixed and changed patch
> > description), that's fine too.  I can change the patch description
> > accordingly.
> > 
> > Or if you want me to measure the performance impact of the +40% code
> > growth, and *then* decide what to do, that's also fine.  But you'd need
> > to tell me what tests to run.
> 
> I'd like to minimize the risk and avoid code churn,
> so how about we step back and debug it first?
> Which version of gcc are you using and what .config?
> I've tried:
> Linux version 5.7.0-rc2 (gcc version 10.0.1 20200505 (prerelease) (GCC)
> CONFIG_UNWINDER_ORC=y
> # CONFIG_RETPOLINE is not set
> 
> and objtool didn't complain.
> I would like to reproduce it first before making any changes.

Revert

  3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")

and compile with retpolines off (and either ORC or FP, doesn't matter).

I'm using GCC 9.3.1:

  kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x8dc: sibling call from callable instruction with modified stack frame

That's the original issue described in that commit.

> Also since objtool cannot follow the optimizations compiler is doing
> how about admit the design failure and teach objtool to build ORC
> (and whatever else it needs to build) based on dwarf for the functions where
> it cannot understand the assembly code ?
> Otherwise objtool will forever be playing whackamole with compilers.

I agree it's not a good long term approach.  But DWARF has its own
issues and we can't rely on it for live patching.

As I mentioned we have a plan to use a compiler plugin to annotate jump
tables (including GCC switch tables).  But the approach taken by this
patch should be good enough for now.
Alexei Starovoitov May 5, 2020, 7:14 p.m. UTC | #10
> 
> Hi,
> 
> I see the objtool warning:
> kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x33: call without frame pointer save/setup
> 
> when using:
> gcc (SUSE Linux) 9.3.1 20200406 [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2]
> 
> with the attached config file.

Thanks Randy. I reproduced it.
Josh Poimboeuf May 5, 2020, 7:31 p.m. UTC | #11
On Tue, May 05, 2020 at 12:14:05PM -0700, Alexei Starovoitov wrote:
> > 
> > Hi,
> > 
> > I see the objtool warning:
> > kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x33: call without frame pointer save/setup
> > 
> > when using:
> > gcc (SUSE Linux) 9.3.1 20200406 [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2]
> > 
> > with the attached config file.
> 
> Thanks Randy. I reproduced it.

This problem isn't a mystery, it's caused by __attribute__((optimize)).

The only real solution is to revert

  3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")

Once you do that (and disable retpolines) then you should see the
problem described in my other email.
Alexei Starovoitov May 5, 2020, 7:53 p.m. UTC | #12
On Tue, May 05, 2020 at 01:11:08PM -0500, Josh Poimboeuf wrote:
> On Tue, May 05, 2020 at 10:43:00AM -0700, Alexei Starovoitov wrote:
> > > Or, if you want to minimize the patch's impact on other arches, and keep
> > > the current patch the way it is (with bug fixed and changed patch
> > > description), that's fine too.  I can change the patch description
> > > accordingly.
> > > 
> > > Or if you want me to measure the performance impact of the +40% code
> > > growth, and *then* decide what to do, that's also fine.  But you'd need
> > > to tell me what tests to run.
> > 
> > I'd like to minimize the risk and avoid code churn,
> > so how about we step back and debug it first?
> > Which version of gcc are you using and what .config?
> > I've tried:
> > Linux version 5.7.0-rc2 (gcc version 10.0.1 20200505 (prerelease) (GCC)
> > CONFIG_UNWINDER_ORC=y
> > # CONFIG_RETPOLINE is not set
> > 
> > and objtool didn't complain.
> > I would like to reproduce it first before making any changes.
> 
> Revert
> 
>   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> 
> and compile with retpolines off (and either ORC or FP, doesn't matter).
> 
> I'm using GCC 9.3.1:
> 
>   kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x8dc: sibling call from callable instruction with modified stack frame
> 
> That's the original issue described in that commit.

I see something different.
With gcc 8, 9, and 10 and CCONFIG_UNWINDER_FRAME_POINTER=y
I see:
kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x4837: call without frame pointer save/setup
and sure enough assembly code for ___bpf_prog_run does not countain frame setup
though -fno-omit-frame-pointer flag was passed at command line.
Then I did:
static u64 /*__no_fgcse*/ ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
and the assembly had proper frame, but objtool wasn't happy:
kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x480a: sibling call from callable instruction with modified stack frame

gcc 6.3 doesn't have objtool warning with and without -fno-gcse.

Looks like we have two issues here.
First gcc 8, 9 and 10 have a severe bug with __attribute__((optimize("")))
In this particular case passing -fno-gcse somehow overruled -fno-omit-frame-pointer
which is serious issue. powerpc is using __nostackprotector. I don't understand
how it can keep working with newer gcc-s. May be got lucky.
Plenty of other projects use various __attribute__((optimize("")))
they all have to double check that their vesion of GCC produces correct code.
Can somebody reach out to gcc folks for explanation?

The second objtool issue is imo minor one. It can be worked around for now
and fixed for real later.

> > Also since objtool cannot follow the optimizations compiler is doing
> > how about admit the design failure and teach objtool to build ORC
> > (and whatever else it needs to build) based on dwarf for the functions where
> > it cannot understand the assembly code ?
> > Otherwise objtool will forever be playing whackamole with compilers.
> 
> I agree it's not a good long term approach.  But DWARF has its own
> issues and we can't rely on it for live patching.

Curious what is the issue with dwarf and live patching ?
I'm sure dwarf is enough to build ORC tables.

> As I mentioned we have a plan to use a compiler plugin to annotate jump
> tables (including GCC switch tables).  But the approach taken by this
> patch should be good enough for now.

I don't have gcc 7 around. Could you please test the workaround with gcc 7,8,9,10
and several clang versions? With ORC and with FP ? and retpoline on/off ?
I don't see any issues with ORC=y. objtool complains with FP=y only for my configs.
I want to make sure the workaround is actually effective.
Josh Poimboeuf May 5, 2020, 8:28 p.m. UTC | #13
On Tue, May 05, 2020 at 12:53:20PM -0700, Alexei Starovoitov wrote:
> On Tue, May 05, 2020 at 01:11:08PM -0500, Josh Poimboeuf wrote:
> > On Tue, May 05, 2020 at 10:43:00AM -0700, Alexei Starovoitov wrote:
> > > > Or, if you want to minimize the patch's impact on other arches, and keep
> > > > the current patch the way it is (with bug fixed and changed patch
> > > > description), that's fine too.  I can change the patch description
> > > > accordingly.
> > > > 
> > > > Or if you want me to measure the performance impact of the +40% code
> > > > growth, and *then* decide what to do, that's also fine.  But you'd need
> > > > to tell me what tests to run.
> > > 
> > > I'd like to minimize the risk and avoid code churn,
> > > so how about we step back and debug it first?
> > > Which version of gcc are you using and what .config?
> > > I've tried:
> > > Linux version 5.7.0-rc2 (gcc version 10.0.1 20200505 (prerelease) (GCC)
> > > CONFIG_UNWINDER_ORC=y
> > > # CONFIG_RETPOLINE is not set
> > > 
> > > and objtool didn't complain.
> > > I would like to reproduce it first before making any changes.
> > 
> > Revert
> > 
> >   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > 
> > and compile with retpolines off (and either ORC or FP, doesn't matter).
> > 
> > I'm using GCC 9.3.1:
> > 
> >   kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x8dc: sibling call from callable instruction with modified stack frame
> > 
> > That's the original issue described in that commit.
> 
> I see something different.
> With gcc 8, 9, and 10 and CCONFIG_UNWINDER_FRAME_POINTER=y
> I see:
> kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x4837: call without frame pointer save/setup
> and sure enough assembly code for ___bpf_prog_run does not countain frame setup
> though -fno-omit-frame-pointer flag was passed at command line.
> Then I did:
> static u64 /*__no_fgcse*/ ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
> and the assembly had proper frame, but objtool wasn't happy:
> kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x480a: sibling call from callable instruction with modified stack frame
> 
> gcc 6.3 doesn't have objtool warning with and without -fno-gcse.
> 
> Looks like we have two issues here.
> First gcc 8, 9 and 10 have a severe bug with __attribute__((optimize("")))
> In this particular case passing -fno-gcse somehow overruled -fno-omit-frame-pointer
> which is serious issue. powerpc is using __nostackprotector. I don't understand
> how it can keep working with newer gcc-s. May be got lucky.
> Plenty of other projects use various __attribute__((optimize("")))
> they all have to double check that their vesion of GCC produces correct code.
> Can somebody reach out to gcc folks for explanation?

Right.  I've mentioned this several times now.  That's why my patch
reverts 3193c0836f20.  I don't see any other way around it.  The GCC
manual even says this attribute should not be used in production code.

> The second objtool issue is imo minor one. It can be worked around for now
> and fixed for real later.

Ok, so keep the patch like v1 (but with the bug fixed)?  Or did you want
to get rid of 'goto select_insn' altogether?

> > > Also since objtool cannot follow the optimizations compiler is doing
> > > how about admit the design failure and teach objtool to build ORC
> > > (and whatever else it needs to build) based on dwarf for the functions where
> > > it cannot understand the assembly code ?
> > > Otherwise objtool will forever be playing whackamole with compilers.
> > 
> > I agree it's not a good long term approach.  But DWARF has its own
> > issues and we can't rely on it for live patching.
> 
> Curious what is the issue with dwarf and live patching ?
> I'm sure dwarf is enough to build ORC tables.

DWARF is a best-effort thing, but for live patching, unwinding has to be
100% accurate.

For assembly code it was impossible to keep all the DWARF CFI
annotations always up to date and to ensure they were correct.

Even for C code there were DWARF problems with inline asm, alternatives
patching, jump labels, etc.

> > As I mentioned we have a plan to use a compiler plugin to annotate jump
> > tables (including GCC switch tables).  But the approach taken by this
> > patch should be good enough for now.
> 
> I don't have gcc 7 around. Could you please test the workaround with gcc 7,8,9,10
> and several clang versions? With ORC and with FP ? and retpoline on/off ?
> I don't see any issues with ORC=y. objtool complains with FP=y only for my configs.
> I want to make sure the workaround is actually effective.

Again, if you revert 3193c0836f20, you will see the issue...

I can certainly test on the matrix of compilers/configs you suggested.
Alexei Starovoitov May 5, 2020, 11:59 p.m. UTC | #14
On Tue, May 05, 2020 at 03:28:23PM -0500, Josh Poimboeuf wrote:
> On Tue, May 05, 2020 at 12:53:20PM -0700, Alexei Starovoitov wrote:
> > On Tue, May 05, 2020 at 01:11:08PM -0500, Josh Poimboeuf wrote:
> > > On Tue, May 05, 2020 at 10:43:00AM -0700, Alexei Starovoitov wrote:
> > > > > Or, if you want to minimize the patch's impact on other arches, and keep
> > > > > the current patch the way it is (with bug fixed and changed patch
> > > > > description), that's fine too.  I can change the patch description
> > > > > accordingly.
> > > > > 
> > > > > Or if you want me to measure the performance impact of the +40% code
> > > > > growth, and *then* decide what to do, that's also fine.  But you'd need
> > > > > to tell me what tests to run.
> > > > 
> > > > I'd like to minimize the risk and avoid code churn,
> > > > so how about we step back and debug it first?
> > > > Which version of gcc are you using and what .config?
> > > > I've tried:
> > > > Linux version 5.7.0-rc2 (gcc version 10.0.1 20200505 (prerelease) (GCC)
> > > > CONFIG_UNWINDER_ORC=y
> > > > # CONFIG_RETPOLINE is not set
> > > > 
> > > > and objtool didn't complain.
> > > > I would like to reproduce it first before making any changes.
> > > 
> > > Revert
> > > 
> > >   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > > 
> > > and compile with retpolines off (and either ORC or FP, doesn't matter).
> > > 
> > > I'm using GCC 9.3.1:
> > > 
> > >   kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x8dc: sibling call from callable instruction with modified stack frame
> > > 
> > > That's the original issue described in that commit.
> > 
> > I see something different.
> > With gcc 8, 9, and 10 and CCONFIG_UNWINDER_FRAME_POINTER=y
> > I see:
> > kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x4837: call without frame pointer save/setup
> > and sure enough assembly code for ___bpf_prog_run does not countain frame setup
> > though -fno-omit-frame-pointer flag was passed at command line.
> > Then I did:
> > static u64 /*__no_fgcse*/ ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
> > and the assembly had proper frame, but objtool wasn't happy:
> > kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x480a: sibling call from callable instruction with modified stack frame
> > 
> > gcc 6.3 doesn't have objtool warning with and without -fno-gcse.
> > 
> > Looks like we have two issues here.
> > First gcc 8, 9 and 10 have a severe bug with __attribute__((optimize("")))
> > In this particular case passing -fno-gcse somehow overruled -fno-omit-frame-pointer
> > which is serious issue. powerpc is using __nostackprotector. I don't understand
> > how it can keep working with newer gcc-s. May be got lucky.
> > Plenty of other projects use various __attribute__((optimize("")))
> > they all have to double check that their vesion of GCC produces correct code.
> > Can somebody reach out to gcc folks for explanation?
> 
> Right.  I've mentioned this several times now.  That's why my patch
> reverts 3193c0836f20.  I don't see any other way around it.  The GCC
> manual even says this attribute should not be used in production code.

What you mentioned in commit log is:
"It doesn't append options to the command-line arguments.  Instead
it starts from a blank slate.  And according to recent GCC documentation
it's not recommended for production use."

I don't think anyone could have guessed from such description that it kills
-fno-omit-frame-pointer but it doesn't reduce optimization level to -O0
and it doesn't kill -D, -m, -I, -std= and other flags.

As far as workaround I prefer the following:
From 94bbc27c5a70d78846a5cb675df4cf8732883564 Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast@kernel.org>
Date: Tue, 5 May 2020 16:52:41 -0700
Subject: [PATCH] bpf,objtool: tweak interpreter compilation flags to help objtool

tbd

Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/compiler-gcc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index d7ee4c6bad48..05104c3cc033 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -171,4 +171,4 @@
 #define __diag_GCC_8(s)
 #endif

-#define __no_fgcse __attribute__((optimize("-fno-gcse")))
+#define __no_fgcse __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
--
2.23.0

I've tested it with gcc 8,9,10 and clang 11 with FP=y and with ORC=y.
All works.
I think it's safer to go with frame pointers even for ORC=y considering
all the pain this issue had caused. Even if objtool gets confused again
in the future __bpf_prog_run() will have frame pointers and kernel stack
unwinding can fall back from ORC to FP for that frame.
wdyt?
Josh Poimboeuf May 6, 2020, 3:53 p.m. UTC | #15
On Tue, May 05, 2020 at 04:59:39PM -0700, Alexei Starovoitov wrote:
> As far as workaround I prefer the following:
> From 94bbc27c5a70d78846a5cb675df4cf8732883564 Mon Sep 17 00:00:00 2001
> From: Alexei Starovoitov <ast@kernel.org>
> Date: Tue, 5 May 2020 16:52:41 -0700
> Subject: [PATCH] bpf,objtool: tweak interpreter compilation flags to help objtool
> 
> tbd
> 
> Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> Reported-by: Randy Dunlap <rdunlap@infradead.org>
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/compiler-gcc.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index d7ee4c6bad48..05104c3cc033 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -171,4 +171,4 @@
>  #define __diag_GCC_8(s)
>  #endif
> 
> -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> +#define __no_fgcse __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
> --
> 2.23.0
> 
> I've tested it with gcc 8,9,10 and clang 11 with FP=y and with ORC=y.
> All works.
> I think it's safer to go with frame pointers even for ORC=y considering
> all the pain this issue had caused. Even if objtool gets confused again
> in the future __bpf_prog_run() will have frame pointers and kernel stack
> unwinding can fall back from ORC to FP for that frame.
> wdyt?

It seems dangerous to me.  The GCC manual recommends against it.

And how do we know what other flags are getting removed for various
arches (now or in the future)?
Alexei Starovoitov May 6, 2020, 4:45 p.m. UTC | #16
On Wed, May 6, 2020 at 8:53 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> On Tue, May 05, 2020 at 04:59:39PM -0700, Alexei Starovoitov wrote:
> > As far as workaround I prefer the following:
> > From 94bbc27c5a70d78846a5cb675df4cf8732883564 Mon Sep 17 00:00:00 2001
> > From: Alexei Starovoitov <ast@kernel.org>
> > Date: Tue, 5 May 2020 16:52:41 -0700
> > Subject: [PATCH] bpf,objtool: tweak interpreter compilation flags to help objtool
> >
> > tbd
> >
> > Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > Reported-by: Randy Dunlap <rdunlap@infradead.org>
> > Reported-by: Arnd Bergmann <arnd@arndb.de>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/compiler-gcc.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> > index d7ee4c6bad48..05104c3cc033 100644
> > --- a/include/linux/compiler-gcc.h
> > +++ b/include/linux/compiler-gcc.h
> > @@ -171,4 +171,4 @@
> >  #define __diag_GCC_8(s)
> >  #endif
> >
> > -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> > +#define __no_fgcse __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
> > --
> > 2.23.0
> >
> > I've tested it with gcc 8,9,10 and clang 11 with FP=y and with ORC=y.
> > All works.
> > I think it's safer to go with frame pointers even for ORC=y considering
> > all the pain this issue had caused. Even if objtool gets confused again
> > in the future __bpf_prog_run() will have frame pointers and kernel stack
> > unwinding can fall back from ORC to FP for that frame.
> > wdyt?
>
> It seems dangerous to me.  The GCC manual recommends against it.

The manual can says that it's broken. That won't stop the world from using it.
Just google projects that are using it. For example: qt, lz4, unreal engine, etc
Telling compiler to disable gcse via flag is a guaranteed way to avoid
that optimization that breaks objtool whereas messing with C code is nothing
but guess work. gcc can still do gcse.
Josh Poimboeuf May 6, 2020, 9:19 p.m. UTC | #17
On Wed, May 06, 2020 at 09:45:01AM -0700, Alexei Starovoitov wrote:
> On Wed, May 6, 2020 at 8:53 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > On Tue, May 05, 2020 at 04:59:39PM -0700, Alexei Starovoitov wrote:
> > > As far as workaround I prefer the following:
> > > From 94bbc27c5a70d78846a5cb675df4cf8732883564 Mon Sep 17 00:00:00 2001
> > > From: Alexei Starovoitov <ast@kernel.org>
> > > Date: Tue, 5 May 2020 16:52:41 -0700
> > > Subject: [PATCH] bpf,objtool: tweak interpreter compilation flags to help objtool
> > >
> > > tbd
> > >
> > > Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > > Reported-by: Randy Dunlap <rdunlap@infradead.org>
> > > Reported-by: Arnd Bergmann <arnd@arndb.de>
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >  include/linux/compiler-gcc.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> > > index d7ee4c6bad48..05104c3cc033 100644
> > > --- a/include/linux/compiler-gcc.h
> > > +++ b/include/linux/compiler-gcc.h
> > > @@ -171,4 +171,4 @@
> > >  #define __diag_GCC_8(s)
> > >  #endif
> > >
> > > -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> > > +#define __no_fgcse __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
> > > --
> > > 2.23.0
> > >
> > > I've tested it with gcc 8,9,10 and clang 11 with FP=y and with ORC=y.
> > > All works.
> > > I think it's safer to go with frame pointers even for ORC=y considering
> > > all the pain this issue had caused. Even if objtool gets confused again
> > > in the future __bpf_prog_run() will have frame pointers and kernel stack
> > > unwinding can fall back from ORC to FP for that frame.
> > > wdyt?
> >
> > It seems dangerous to me.  The GCC manual recommends against it.
> 
> The manual can says that it's broken. That won't stop the world from using it.
> Just google projects that are using it. For example: qt, lz4, unreal engine, etc
> Telling compiler to disable gcse via flag is a guaranteed way to avoid
> that optimization that breaks objtool whereas messing with C code is nothing
> but guess work. gcc can still do gcse.

But the manual's right, it is broken.  How do you know other important
flags won't also be stripped?
Alexei Starovoitov May 7, 2020, 12:03 a.m. UTC | #18
On Wed, May 06, 2020 at 04:19:45PM -0500, Josh Poimboeuf wrote:
> On Wed, May 06, 2020 at 09:45:01AM -0700, Alexei Starovoitov wrote:
> > On Wed, May 6, 2020 at 8:53 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > >
> > > On Tue, May 05, 2020 at 04:59:39PM -0700, Alexei Starovoitov wrote:
> > > > As far as workaround I prefer the following:
> > > > From 94bbc27c5a70d78846a5cb675df4cf8732883564 Mon Sep 17 00:00:00 2001
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > Date: Tue, 5 May 2020 16:52:41 -0700
> > > > Subject: [PATCH] bpf,objtool: tweak interpreter compilation flags to help objtool
> > > >
> > > > tbd
> > > >
> > > > Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
> > > > Reported-by: Randy Dunlap <rdunlap@infradead.org>
> > > > Reported-by: Arnd Bergmann <arnd@arndb.de>
> > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > ---
> > > >  include/linux/compiler-gcc.h | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> > > > index d7ee4c6bad48..05104c3cc033 100644
> > > > --- a/include/linux/compiler-gcc.h
> > > > +++ b/include/linux/compiler-gcc.h
> > > > @@ -171,4 +171,4 @@
> > > >  #define __diag_GCC_8(s)
> > > >  #endif
> > > >
> > > > -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> > > > +#define __no_fgcse __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
> > > > --
> > > > 2.23.0
> > > >
> > > > I've tested it with gcc 8,9,10 and clang 11 with FP=y and with ORC=y.
> > > > All works.
> > > > I think it's safer to go with frame pointers even for ORC=y considering
> > > > all the pain this issue had caused. Even if objtool gets confused again
> > > > in the future __bpf_prog_run() will have frame pointers and kernel stack
> > > > unwinding can fall back from ORC to FP for that frame.
> > > > wdyt?
> > >
> > > It seems dangerous to me.  The GCC manual recommends against it.
> > 
> > The manual can says that it's broken. That won't stop the world from using it.
> > Just google projects that are using it. For example: qt, lz4, unreal engine, etc
> > Telling compiler to disable gcse via flag is a guaranteed way to avoid
> > that optimization that breaks objtool whereas messing with C code is nothing
> > but guess work. gcc can still do gcse.
> 
> But the manual's right, it is broken.  How do you know other important
> flags won't also be stripped?

What flags are you worried about?
I've checked that important things like -mno-red-zone, -fsanitize are preserved.
Josh Poimboeuf May 7, 2020, 2:07 p.m. UTC | #19
On Wed, May 06, 2020 at 05:03:57PM -0700, Alexei Starovoitov wrote:
> > > > > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> > > > > index d7ee4c6bad48..05104c3cc033 100644
> > > > > --- a/include/linux/compiler-gcc.h
> > > > > +++ b/include/linux/compiler-gcc.h
> > > > > @@ -171,4 +171,4 @@
> > > > >  #define __diag_GCC_8(s)
> > > > >  #endif
> > > > >
> > > > > -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> > > > > +#define __no_fgcse __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
> > > > > --
> > > > > 2.23.0
> > > > >
> > > > > I've tested it with gcc 8,9,10 and clang 11 with FP=y and with ORC=y.
> > > > > All works.
> > > > > I think it's safer to go with frame pointers even for ORC=y considering
> > > > > all the pain this issue had caused. Even if objtool gets confused again
> > > > > in the future __bpf_prog_run() will have frame pointers and kernel stack
> > > > > unwinding can fall back from ORC to FP for that frame.
> > > > > wdyt?
> > > >
> > > > It seems dangerous to me.  The GCC manual recommends against it.
> > > 
> > > The manual can says that it's broken. That won't stop the world from using it.
> > > Just google projects that are using it. For example: qt, lz4, unreal engine, etc
> > > Telling compiler to disable gcse via flag is a guaranteed way to avoid
> > > that optimization that breaks objtool whereas messing with C code is nothing
> > > but guess work. gcc can still do gcse.
> > 
> > But the manual's right, it is broken.  How do you know other important
> > flags won't also be stripped?
> 
> What flags are you worried about?
> I've checked that important things like -mno-red-zone, -fsanitize are preserved.

It's not any specific flags I'm worried about, it's all of them.  There
are a lot of possibilities, with all the different configs, and arches.
Flags are usually added for a good reason, so one randomly missing flag
could have unforeseen results.

And I don't have any visibility into how GCC decides which flags to
drop, and when.  But the docs aren't comforting.

Even if things seem to work now, that could (silently) change at any
point in time.  This time objtool warned about the missing frame
pointer, but that's not necessarily going to happen for other flags.

If we go this route, I would much rather do -fno-gcse on a file-wide
basis.
Alexei Starovoitov May 8, 2020, 10:18 p.m. UTC | #20
On Thu, May 07, 2020 at 09:07:33AM -0500, Josh Poimboeuf wrote:
> On Wed, May 06, 2020 at 05:03:57PM -0700, Alexei Starovoitov wrote:
> > > > > > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> > > > > > index d7ee4c6bad48..05104c3cc033 100644
> > > > > > --- a/include/linux/compiler-gcc.h
> > > > > > +++ b/include/linux/compiler-gcc.h
> > > > > > @@ -171,4 +171,4 @@
> > > > > >  #define __diag_GCC_8(s)
> > > > > >  #endif
> > > > > >
> > > > > > -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> > > > > > +#define __no_fgcse __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
> > > > > > --
> > > > > > 2.23.0
> > > > > >
> > > > > > I've tested it with gcc 8,9,10 and clang 11 with FP=y and with ORC=y.
> > > > > > All works.
> > > > > > I think it's safer to go with frame pointers even for ORC=y considering
> > > > > > all the pain this issue had caused. Even if objtool gets confused again
> > > > > > in the future __bpf_prog_run() will have frame pointers and kernel stack
> > > > > > unwinding can fall back from ORC to FP for that frame.
> > > > > > wdyt?
> > > > >
> > > > > It seems dangerous to me.  The GCC manual recommends against it.
> > > > 
> > > > The manual can says that it's broken. That won't stop the world from using it.
> > > > Just google projects that are using it. For example: qt, lz4, unreal engine, etc
> > > > Telling compiler to disable gcse via flag is a guaranteed way to avoid
> > > > that optimization that breaks objtool whereas messing with C code is nothing
> > > > but guess work. gcc can still do gcse.
> > > 
> > > But the manual's right, it is broken.  How do you know other important
> > > flags won't also be stripped?
> > 
> > What flags are you worried about?
> > I've checked that important things like -mno-red-zone, -fsanitize are preserved.
> 
> It's not any specific flags I'm worried about, it's all of them.  There
> are a lot of possibilities, with all the different configs, and arches.
> Flags are usually added for a good reason, so one randomly missing flag
> could have unforeseen results.
> 
> And I don't have any visibility into how GCC decides which flags to
> drop, and when.  But the docs aren't comforting.

That doc change landed 5 years ago:
https://patchwork.ozlabs.org/project/gcc/patch/20151213081911.GA320@x4/
Sure it's 'broken' by whatever definition of broken.
Yet gcc has
$ git grep '__attribute__((optimize' gcc/testsuite/|wc -l
34 tests to make sure it stays working.
And gcc is using it to bootstrap itself. See LIBGCC2_UNWIND_ATTRIBUTE.
The doc is expressing desire and trying to discourage its use,
but that attribute is not going anywhere.

> Even if things seem to work now, that could (silently) change at any
> point in time.  This time objtool warned about the missing frame
> pointer, but that's not necessarily going to happen for other flags.
> 
> If we go this route, I would much rather do -fno-gcse on a file-wide
> basis.

The fix for broken commit 3193c0836f20 has to be backported all the way
to 5.3 release. I'd like to minimize conflicts.
For that very reason I'm not even renaming #define __no_fgcse.
diff mbox series

Patch

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index cf294faec2f8..2c8583eb5de8 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -176,5 +176,3 @@ 
 #else
 #define __diag_GCC_8(s)
 #endif
-
-#define __no_fgcse __attribute__((optimize("-fno-gcse")))
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index e970f97a7fcb..58105f1deb79 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -203,10 +203,6 @@  struct ftrace_likely_data {
 #define asm_inline asm
 #endif
 
-#ifndef __no_fgcse
-# define __no_fgcse
-#endif
-
 /* Are two types/vars the same type (ignoring qualifiers)? */
 #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 916f5132a984..eec470c598ad 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1364,7 +1364,7 @@  u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)
  *
  * Decode and execute eBPF instructions.
  */
-static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
+static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
 {
 #define BPF_INSN_2_LBL(x, y)    [BPF_##x | BPF_##y] = &&x##_##y
 #define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &&x##_##y##_##z
@@ -1384,11 +1384,15 @@  static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6
 #undef BPF_INSN_2_LBL
 	u32 tail_call_cnt = 0;
 
+#if defined(CONFIG_X86_64) && !defined(CONFIG_RETPOLINE)
+#define CONT	 ({ insn++; goto *jumptable[insn->code]; })
+#define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
+#else
 #define CONT	 ({ insn++; goto select_insn; })
 #define CONT_JMP ({ insn++; goto select_insn; })
-
 select_insn:
 	goto *jumptable[insn->code];
+#endif
 
 	/* ALU */
 #define ALU(OPCODE, OP)			\
@@ -1547,7 +1551,7 @@  static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6
 		 * where arg1_type is ARG_PTR_TO_CTX.
 		 */
 		insn = prog->insnsi;
-		goto select_insn;
+		CONT;
 out:
 		CONT;
 	}