mbox series

[v4,00/29] Add support for Clang LTO

Message ID 20200929214631.3516445-1-samitolvanen@google.com
Headers show
Series Add support for Clang LTO | expand

Message

Sami Tolvanen Sept. 29, 2020, 9:46 p.m. UTC
This patch series adds support for building x86_64 and arm64 kernels
with Clang's Link Time Optimization (LTO).

In addition to performance, the primary motivation for LTO is
to allow Clang's Control-Flow Integrity (CFI) to be used in the
kernel. Google has shipped millions of Pixel devices running three
major kernel versions with LTO+CFI since 2018.

Most of the patches are build system changes for handling LLVM
bitcode, which Clang produces with LTO instead of ELF object files,
postponing ELF processing until a later stage, and ensuring initcall
ordering.

Again, patches 1-3 are not directly related to LTO, but are needed
to compile LTO kernels with ToT Clang, so I'm including them in the
series for your convenience:

 - Patch 1 ("RAS/CEC: Fix cec_init() prototype") fixes an initcall
   type mismatch which breaks allmodconfig with LTO. This patch is
   in linux-next.
   
 - Patch 2 ("x86/asm: Replace __force_order with memory clobber")
   fixes x86 builds with LLVM's integrated assembler, which we use
   with LTO for inline assembly. This patch hasn't been picked up
   by maintainers yet.

 - Patch 3 ("kbuild: preprocess module linker script") is from
   Masahiro's kbuild tree and makes the LTO linker script changes
   much cleaner.

Furthermore, patches 4-8 include Peter's patch for generating
__mcount_loc with objtool, and build system changes to enable it on
x86. With these patches, we no longer need to annotate functions
that have non-call references to __fentry__ with LTO, which makes
supporting dynamic ftrace much simpler.

Patch 9 disables recordmcount for arm64 when patchable function
entry is used (enabled by default if the compiler supports the
feature), which removes thousands of unnecessary recordmcount
invocations from a defconfig build.

Note that you can also pull this series from

  https://github.com/samitolvanen/linux.git lto-v4


---
Changes in v4:

  - Fixed a typo in Makefile.lib to correctly pass --no-fp to objtool.

  - Moved ftrace configs related to generating __mcount_loc to Kconfig,
    so they are available also in Makefile.modfinal.

  - Dropped two prerequisite patches that were merged to Linus' tree.

Changes in v3:

  - Added a separate patch to remove the unused DISABLE_LTO treewide,
    as filtering out CC_FLAGS_LTO instead is preferred.

  - Updated the Kconfig help to explain why LTO is behind a choice
    and disabled by default.

  - Dropped CC_FLAGS_LTO_CLANG, compiler-specific LTO flags are now
    appended directly to CC_FLAGS_LTO.

  - Updated $(AR) flags as KBUILD_ARFLAGS was removed earlier.

  - Fixed ThinLTO cache handling for external module builds.

  - Rebased on top of Masahiro's patch for preprocessing modules.lds,
    and moved the contents of module-lto.lds to modules.lds.S.

  - Moved objtool_args to Makefile.lib to avoid duplication of the
    command line parameters in Makefile.modfinal.

  - Clarified in the commit message for the initcall ordering patch
    that the initcall order remains the same as without LTO.

  - Changed link-vmlinux.sh to use jobserver-exec to control the
    number of jobs started by generate_initcall_ordering.pl.

  - Dropped the x86/relocs patch to whitelist L4_PAGE_OFFSET as it's
    no longer needed with ToT kernel.

  - Disabled LTO for arch/x86/power/cpu.c to work around a Clang bug
    with stack protector attributes.

Changes in v2:

  - Fixed -Wmissing-prototypes warnings with W=1.

  - Dropped cc-option from -fsplit-lto-unit and added .thinlto-cache
    scrubbing to make distclean.

  - Added a comment about Clang >=11 being required.

  - Added a patch to disable LTO for the arm64 KVM nVHE code.

  - Disabled objtool's noinstr validation with LTO unless enabled.

  - Included Peter's proposed objtool mcount patch in the series
    and replaced recordmcount with the objtool pass to avoid
    whitelisting relocations that are not calls.

  - Updated several commit messages with better explanations.


Arvind Sankar (1):
  x86/asm: Replace __force_order with memory clobber

Luca Stefani (1):
  RAS/CEC: Fix cec_init() prototype

Masahiro Yamada (1):
  kbuild: preprocess module linker script

Peter Zijlstra (1):
  objtool: Add a pass for generating __mcount_loc

Sami Tolvanen (25):
  objtool: Don't autodetect vmlinux.o
  tracing: move function tracer options to Kconfig
  tracing: add support for objtool mcount
  x86, build: use objtool mcount
  arm64: disable recordmcount with DYNAMIC_FTRACE_WITH_REGS
  treewide: remove DISABLE_LTO
  kbuild: add support for Clang LTO
  kbuild: lto: fix module versioning
  kbuild: lto: postpone objtool
  kbuild: lto: limit inlining
  kbuild: lto: merge module sections
  kbuild: lto: remove duplicate dependencies from .mod files
  init: lto: ensure initcall ordering
  init: lto: fix PREL32 relocations
  PCI: Fix PREL32 relocations for LTO
  modpost: lto: strip .lto from module names
  scripts/mod: disable LTO for empty.c
  efi/libstub: disable LTO
  drivers/misc/lkdtm: disable LTO for rodata.o
  arm64: vdso: disable LTO
  KVM: arm64: disable LTO for the nVHE directory
  arm64: allow LTO_CLANG and THINLTO to be selected
  x86, vdso: disable LTO only for vDSO
  x86, cpu: disable LTO for cpu.c
  x86, build: allow LTO_CLANG and THINLTO to be selected

 .gitignore                                    |   1 +
 Makefile                                      |  68 +++--
 arch/Kconfig                                  |  68 +++++
 arch/arm/Makefile                             |   4 -
 .../module.lds => include/asm/module.lds.h}   |   2 +
 arch/arm64/Kconfig                            |   4 +
 arch/arm64/Makefile                           |   4 -
 .../module.lds => include/asm/module.lds.h}   |   2 +
 arch/arm64/kernel/vdso/Makefile               |   4 +-
 arch/arm64/kvm/hyp/nvhe/Makefile              |   4 +-
 arch/ia64/Makefile                            |   1 -
 .../{module.lds => include/asm/module.lds.h}  |   0
 arch/m68k/Makefile                            |   1 -
 .../module.lds => include/asm/module.lds.h}   |   0
 arch/powerpc/Makefile                         |   1 -
 .../module.lds => include/asm/module.lds.h}   |   0
 arch/riscv/Makefile                           |   3 -
 .../module.lds => include/asm/module.lds.h}   |   3 +-
 arch/sparc/vdso/Makefile                      |   2 -
 arch/um/include/asm/Kbuild                    |   1 +
 arch/x86/Kconfig                              |   3 +
 arch/x86/Makefile                             |   5 +
 arch/x86/boot/compressed/pgtable_64.c         |   9 -
 arch/x86/entry/vdso/Makefile                  |   5 +-
 arch/x86/include/asm/special_insns.h          |  28 +-
 arch/x86/kernel/cpu/common.c                  |   4 +-
 arch/x86/power/Makefile                       |   4 +
 drivers/firmware/efi/libstub/Makefile         |   2 +
 drivers/misc/lkdtm/Makefile                   |   1 +
 drivers/ras/cec.c                             |   9 +-
 include/asm-generic/Kbuild                    |   1 +
 include/asm-generic/module.lds.h              |  10 +
 include/asm-generic/vmlinux.lds.h             |  11 +-
 include/linux/init.h                          |  79 ++++-
 include/linux/pci.h                           |  19 +-
 kernel/Makefile                               |   3 -
 kernel/trace/Kconfig                          |  29 ++
 scripts/.gitignore                            |   1 +
 scripts/Makefile                              |   3 +
 scripts/Makefile.build                        |  69 +++--
 scripts/Makefile.lib                          |  17 +-
 scripts/Makefile.modfinal                     |  29 +-
 scripts/Makefile.modpost                      |  22 +-
 scripts/generate_initcall_order.pl            | 270 ++++++++++++++++++
 scripts/link-vmlinux.sh                       |  95 +++++-
 scripts/mod/Makefile                          |   1 +
 scripts/mod/modpost.c                         |  16 +-
 scripts/mod/modpost.h                         |   9 +
 scripts/mod/sumversion.c                      |   6 +-
 scripts/{module-common.lds => module.lds.S}   |  31 ++
 scripts/package/builddeb                      |   2 +-
 tools/objtool/builtin-check.c                 |  13 +-
 tools/objtool/builtin.h                       |   2 +-
 tools/objtool/check.c                         |  83 ++++++
 tools/objtool/check.h                         |   1 +
 tools/objtool/objtool.h                       |   1 +
 56 files changed, 906 insertions(+), 160 deletions(-)
 rename arch/arm/{kernel/module.lds => include/asm/module.lds.h} (72%)
 rename arch/arm64/{kernel/module.lds => include/asm/module.lds.h} (76%)
 rename arch/ia64/{module.lds => include/asm/module.lds.h} (100%)
 rename arch/m68k/{kernel/module.lds => include/asm/module.lds.h} (100%)
 rename arch/powerpc/{kernel/module.lds => include/asm/module.lds.h} (100%)
 rename arch/riscv/{kernel/module.lds => include/asm/module.lds.h} (84%)
 create mode 100644 include/asm-generic/module.lds.h
 create mode 100755 scripts/generate_initcall_order.pl
 rename scripts/{module-common.lds => module.lds.S} (59%)


base-commit: ccc1d052eff9f3cfe59d201263903fe1d46c79a5

Comments

Kees Cook Sept. 30, 2020, 9:10 p.m. UTC | #1
On Tue, Sep 29, 2020 at 02:46:02PM -0700, Sami Tolvanen wrote:
> Furthermore, patches 4-8 include Peter's patch for generating
> __mcount_loc with objtool, and build system changes to enable it on
> x86. With these patches, we no longer need to annotate functions
> that have non-call references to __fentry__ with LTO, which makes
> supporting dynamic ftrace much simpler.

Peter, can you take patches 4-8 into -tip? I think it'd make sense
to keep them together. Steven, it sounds like you're okay with the
changes (i.e. Sami showed the one concern you had was already handled)?
Getting these into v5.10 would be really really nice.

I'd really like to get the tree-spanning prerequisites nailed down
for this series. It's been difficult to coordinate given the multiple
maintainers. :)

Specifically these patches:

Peter Zijlstra (1):
  objtool: Add a pass for generating __mcount_loc

Sami Tolvanen (4):
  objtool: Don't autodetect vmlinux.o
  tracing: move function tracer options to Kconfig
  tracing: add support for objtool mcount
  x86, build: use objtool mcount

https://lore.kernel.org/lkml/20200929214631.3516445-5-samitolvanen@google.com/
https://lore.kernel.org/lkml/20200929214631.3516445-6-samitolvanen@google.com/
https://lore.kernel.org/lkml/20200929214631.3516445-7-samitolvanen@google.com/
https://lore.kernel.org/lkml/20200929214631.3516445-8-samitolvanen@google.com/
https://lore.kernel.org/lkml/20200929214631.3516445-9-samitolvanen@google.com/

Thanks!
Nick Desaulniers Sept. 30, 2020, 9:58 p.m. UTC | #2
On Tue, Sep 29, 2020 at 2:46 PM Sami Tolvanen <samitolvanen@google.com> wrote:
>
> This patch series adds support for building x86_64 and arm64 kernels
> with Clang's Link Time Optimization (LTO).
>
> In addition to performance, the primary motivation for LTO is
> to allow Clang's Control-Flow Integrity (CFI) to be used in the
> kernel. Google has shipped millions of Pixel devices running three
> major kernel versions with LTO+CFI since 2018.
>
> Most of the patches are build system changes for handling LLVM
> bitcode, which Clang produces with LTO instead of ELF object files,
> postponing ELF processing until a later stage, and ensuring initcall
> ordering.

Sami, thanks for continuing to drive the series. I encourage you to
keep resending with fixes accumulated or dropped on a weekly cadence.

The series worked well for me on arm64, but for x86_64 on mainline I
saw a stream of new objtool warnings:

testing your LTO series; x86_64 defconfig + CONFIG_THINLTO:
``` LTO vmlinux.o OBJTOOL vmlinux.o vmlinux.o: warning: objtool:
wakeup_long64()+0x61: indirect jump found in RETPOLINE build
vmlinux.o: warning: objtool: .text+0x308a: indirect jump found in
RETPOLINE build vmlinux.o: warning: objtool: .text+0x30c5: indirect
jump found in RETPOLINE build vmlinux.o: warning: objtool:
copy_user_enhanced_fast_string() falls through to next function
copy_user_generic_unrolled() vmlinux.o: warning: objtool:
__memcpy_mcsafe() falls through to next function mcsafe_handle_tail()
vmlinux.o: warning: objtool: memset() falls through to next function
memset_erms() vmlinux.o: warning: objtool: __memcpy() falls through to
next function memcpy_erms() vmlinux.o: warning: objtool:
__x86_indirect_thunk_rax() falls through to next function
__x86_retpoline_rax() vmlinux.o: warning: objtool:
__x86_indirect_thunk_rbx() falls through to next function
__x86_retpoline_rbx() vmlinux.o: warning: objtool:
__x86_indirect_thunk_rcx() falls through to next function
__x86_retpoline_rcx() vmlinux.o: warning: objtool:
__x86_indirect_thunk_rdx() falls through to next function
__x86_retpoline_rdx() vmlinux.o: warning: objtool:
__x86_indirect_thunk_rsi() falls through to next function
__x86_retpoline_rsi() vmlinux.o: warning: objtool:
__x86_indirect_thunk_rdi() falls through to next function
__x86_retpoline_rdi() vmlinux.o: warning: objtool:
__x86_indirect_thunk_rbp() falls through to next function
__x86_retpoline_rbp() vmlinux.o: warning: objtool:
__x86_indirect_thunk_r8() falls through to next function
__x86_retpoline_r8() vmlinux.o: warning: objtool:
__x86_indirect_thunk_r9() falls through to next function
__x86_retpoline_r9() vmlinux.o: warning: objtool:
__x86_indirect_thunk_r10() falls through to next function
__x86_retpoline_r10() vmlinux.o: warning: objtool:
__x86_indirect_thunk_r11() falls through to next function
__x86_retpoline_r11() vmlinux.o: warning: objtool:
__x86_indirect_thunk_r12() falls through to next function
__x86_retpoline_r12() vmlinux.o: warning: objtool:
__x86_indirect_thunk_r13() falls through to next function
__x86_retpoline_r13() vmlinux.o: warning: objtool:
__x86_indirect_thunk_r14() falls through to next function
__x86_retpoline_r14() vmlinux.o: warning: objtool:
__x86_indirect_thunk_r15() falls through to next function
__x86_retpoline_r15() ```

I think those should be resolved before I provide any kind of tested
by tag.  My other piece of feedback was that I like the default
ThinLTO, but I think the help text in the Kconfig which is visible
during menuconfig could be improved by informing the user the
tradeoffs.  For example, if CONFIG_THINLTO is disabled, it should be
noted that full LTO will be used instead.  Also, that full LTO may
produce slightly better optimized binaries than ThinLTO, at the cost
of not utilizing multiple cores when linking and thus significantly
slower to link.

Maybe explaining that setting it to "n" implies a full LTO build,
which will be much slower to link but possibly slightly faster would
be good?  It's not visible unless LTO_CLANG and ARCH_SUPPORTS_THINLTO
is enabled, so I don't think you need to explain that THINLTO without
those is *not* full LTO.  I'll leave the precise wording to you. WDYT?

Also, when I look at your treewide DISABLE_LTO patch, I think "does
that need to be a part of this series, or is it a cleanup that can
stand on its own?"  I think it may be the latter?  Maybe it would help
shed one more patch than to have to carry it to just send it?  Or did
I miss something as to why it should remain a part of this series?
Sami Tolvanen Sept. 30, 2020, 10:12 p.m. UTC | #3
On Wed, Sep 30, 2020 at 2:58 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Sep 29, 2020 at 2:46 PM Sami Tolvanen <samitolvanen@google.com> wrote:
> >
> > This patch series adds support for building x86_64 and arm64 kernels
> > with Clang's Link Time Optimization (LTO).
> >
> > In addition to performance, the primary motivation for LTO is
> > to allow Clang's Control-Flow Integrity (CFI) to be used in the
> > kernel. Google has shipped millions of Pixel devices running three
> > major kernel versions with LTO+CFI since 2018.
> >
> > Most of the patches are build system changes for handling LLVM
> > bitcode, which Clang produces with LTO instead of ELF object files,
> > postponing ELF processing until a later stage, and ensuring initcall
> > ordering.
>
> Sami, thanks for continuing to drive the series. I encourage you to
> keep resending with fixes accumulated or dropped on a weekly cadence.
>
> The series worked well for me on arm64, but for x86_64 on mainline I
> saw a stream of new objtool warnings:
[...]

Objtool normally won't print out these warnings when run on vmlinux.o,
but we can't pass --vmlinux to objtool as that also implies noinstr
validation right now. I think we'd have to split that from --vmlinux
to avoid these. I can include a patch to add a --noinstr flag in v5.
Peter, any thoughts about this?

> I think those should be resolved before I provide any kind of tested
> by tag.  My other piece of feedback was that I like the default
> ThinLTO, but I think the help text in the Kconfig which is visible
> during menuconfig could be improved by informing the user the
> tradeoffs.  For example, if CONFIG_THINLTO is disabled, it should be
> noted that full LTO will be used instead.  Also, that full LTO may
> produce slightly better optimized binaries than ThinLTO, at the cost
> of not utilizing multiple cores when linking and thus significantly
> slower to link.
>
> Maybe explaining that setting it to "n" implies a full LTO build,
> which will be much slower to link but possibly slightly faster would
> be good?  It's not visible unless LTO_CLANG and ARCH_SUPPORTS_THINLTO
> is enabled, so I don't think you need to explain that THINLTO without
> those is *not* full LTO.  I'll leave the precise wording to you. WDYT?

Sure, sounds good. I'll update the help text in the next version.

> Also, when I look at your treewide DISABLE_LTO patch, I think "does
> that need to be a part of this series, or is it a cleanup that can
> stand on its own?"  I think it may be the latter?  Maybe it would help
> shed one more patch than to have to carry it to just send it?  Or did
> I miss something as to why it should remain a part of this series?

I suppose it could be stand-alone, but as these patches are also
disabling LTO by filtering out flags in some of the same files,
removing the unused DISABLE_LTO flags first would reduce confusion.
But I'm fine with sending it separately too if that's preferred.

Sami