mbox series

[v2,0/2] gitlab-ci.yml: Add jobs to test CFI

Message ID 20210226152108.7848-1-dbuono@linux.vnet.ibm.com
Headers show
Series gitlab-ci.yml: Add jobs to test CFI | expand

Message

Daniele Buono Feb. 26, 2021, 3:21 p.m. UTC
For a few months now QEMU has had options to enable compiler-based
control-flow integrity if built with clang.

While this feature has a low maintenance, It's probably still better to
add tests to the CI environment to check that an update doesn't break it.

The patchset allow gitlab testing of:
* --enable-cfi: forward-edge cfi (function pointers)
* --enable-safe-stack: backward-edge cfi (return pointers)
As an added benefit, this also inherently tests LTO. 

The first patch allows a custom selection for linker parallelism.
Currently, make parallelism at build time is based on the number
of cpus available.
This doesn't work well with LTO at linking, because the linker has to
load in memory all the intermediate object files for optimization.
If the gitlab runner happens to run two linking processes at the same
time, the job will fail with an out-of-memory error,
The patch leverages the ability to maintain high parallelism at
compile time, but limit the number of linkers executed in parallel.

The second patch introduces the ci/cd jobs in the gitlab pipeline.
My original intention was to create a single chain of
build -> check -> acceptance, with all the targets compiled by default.
Unfortunately, the resulting artifact is too big and won't be uploaded.
So I split the test in two chains, that should cover all non-deprecated
targets as of today.
Build jobs are on the longer side (about 2h and 20m), but I thought it
would be better to just have 6 large jobs than tens of smaller ones.
For build, we have to select --enable-slirp=git, because CFI needs a
statically linked version of slirp, with CFI information. More info on
this can be found in a comment in .gitlab-ci.yml.

Test runs of the full pipeline are here (cfi-ci-v2 branch):
https://gitlab.com/dbuono/qemu/-/pipelines/261311362

v2:
- More details in the code about the issue of using system-wide slirp
- Use meson to only limit linker parallelism instead of forcing no
  parallelism on the whole compilation process.

Daniele Buono (2):
  gitlab-ci.yml: Allow custom # of parallel linkers
  gitlab-ci.yml: Add jobs to test CFI flags

 .gitlab-ci.yml | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)

Comments

Daniel P. Berrangé March 1, 2021, 10:06 a.m. UTC | #1
On Fri, Feb 26, 2021 at 10:21:06AM -0500, Daniele Buono wrote:
> Build jobs are on the longer side (about 2h and 20m), but I thought it
> would be better to just have 6 large jobs than tens of smaller ones.

IMHO that is a not viable.

Our longest job today is approx 60 minutes, and that is already
painfully long when developers are repeatedly testing their
patch series to find and fix bugs before posting them for review.
I can perhaps get through 5-6 test cycles in a day. If we have a
2 hour 20 min job, then I'll get 2-3 test cycles a day.

I don't want to see any new jobs added which increase the longest
job execution time. We want to reduce our max job time if anything.


Regards,
Daniel
Daniele Buono March 1, 2021, 2:59 p.m. UTC | #2
Hi Daniel,

On 3/1/2021 5:06 AM, Daniel P. Berrangé wrote:
> On Fri, Feb 26, 2021 at 10:21:06AM -0500, Daniele Buono wrote:
>> Build jobs are on the longer side (about 2h and 20m), but I thought it
>> would be better to just have 6 large jobs than tens of smaller ones.
> 
> IMHO that is a not viable.
> 
> Our longest job today is approx 60 minutes, and that is already
> painfully long when developers are repeatedly testing their
> patch series to find and fix bugs before posting them for review.
> I can perhaps get through 5-6 test cycles in a day. If we have a
> 2 hour 20 min job, then I'll get 2-3 test cycles a day.
> 
> I don't want to see any new jobs added which increase the longest
> job execution time. We want to reduce our max job time if anything.
> 
> 

I totally understand the argument.

We could build two targets per job. That would create build jobs that
take 40 to 60-ish minutes. If that's the case, however, I would not
recommend testing all the possible targets but limit them to what
is considered a set of most common targets. I have an example of the
resulting pipeline here:

https://gitlab.com/dbuono/qemu/-/pipelines/258983262

I selected intel, power, arm and s390 as "common" targets. Would
something like this be a viable alternative? Perhaps after
due thinking of what targets should be tested?

> Regards,
> Daniel
>
Daniel P. Berrangé March 1, 2021, 3:08 p.m. UTC | #3
On Mon, Mar 01, 2021 at 09:59:22AM -0500, Daniele Buono wrote:
> Hi Daniel,
> 
> On 3/1/2021 5:06 AM, Daniel P. Berrangé wrote:
> > On Fri, Feb 26, 2021 at 10:21:06AM -0500, Daniele Buono wrote:
> > > Build jobs are on the longer side (about 2h and 20m), but I thought it
> > > would be better to just have 6 large jobs than tens of smaller ones.
> > 
> > IMHO that is a not viable.
> > 
> > Our longest job today is approx 60 minutes, and that is already
> > painfully long when developers are repeatedly testing their
> > patch series to find and fix bugs before posting them for review.
> > I can perhaps get through 5-6 test cycles in a day. If we have a
> > 2 hour 20 min job, then I'll get 2-3 test cycles a day.
> > 
> > I don't want to see any new jobs added which increase the longest
> > job execution time. We want to reduce our max job time if anything.
> > 
> > 
> 
> I totally understand the argument.
> 
> We could build two targets per job. That would create build jobs that
> take 40 to 60-ish minutes. If that's the case, however, I would not
> recommend testing all the possible targets but limit them to what
> is considered a set of most common targets. I have an example of the
> resulting pipeline here:
> 
> https://gitlab.com/dbuono/qemu/-/pipelines/258983262
> 
> I selected intel, power, arm and s390 as "common" targets. Would
> something like this be a viable alternative? Perhaps after
> due thinking of what targets should be tested?

What are the unique failure scenarios for CFI that these jobs are
likely to expose ? Is it likely that we'll have cases where
CFI succeeds in say, x86_64 target, but fails in aarch64 target ?

If not, then it would be sufficient to just test a single target
to smoke out CFI specific bugs, and assume it covers other
targets implicitly.


Regards,
Daniel
Daniele Buono March 1, 2021, 8:39 p.m. UTC | #4
Hi Daniel,

On 3/1/2021 10:08 AM, Daniel P. Berrangé wrote:
> What are the unique failure scenarios for CFI that these jobs are
> likely to expose ? Is it likely that we'll have cases where
> CFI succeeds in say, x86_64 target, but fails in aarch64 target ?

For CFI to fail (even if it shouldn't) you'll need code that is calling 
a function pointer that was not well defined at compile time. Although
unlikely, that could happen everywhere in the code.

So by just testing one (or few) targets we are are not covering the code 
in the TCG frontend used to compile the target ISA to tcg ops, which 
should be the in target/, and the architecture-dependent code in linux-user.

That code seems unlikely (at least to me) to cause a false positive with 
CFI. Examples that I've seen in QEMU so far are:
- Calling code that was JIT-ed at runtime
- Callbacks to functions that were loaded from shared libraries
- Signal Handlers
And none of those should happen there IMHO. But you know, corner cases 
are still possible, and it's quite difficult to predict what new code 
may bring.

We could also consider always testing one or two targets, and keep an 
optional job to test them all when deemed necessary. I'm thinking for
example full testing when code in target/ or linux-user/ is considered, 
or for testing pre-release code. Would be nice to have this automated 
but I am not sure that's possible.

Regards,
Daniele
Daniel P. Berrangé March 2, 2021, 10:30 a.m. UTC | #5
On Mon, Mar 01, 2021 at 03:39:42PM -0500, Daniele Buono wrote:
> Hi Daniel,
> 
> On 3/1/2021 10:08 AM, Daniel P. Berrangé wrote:
> > What are the unique failure scenarios for CFI that these jobs are
> > likely to expose ? Is it likely that we'll have cases where
> > CFI succeeds in say, x86_64 target, but fails in aarch64 target ?
> 
> For CFI to fail (even if it shouldn't) you'll need code that is calling a
> function pointer that was not well defined at compile time. Although
> unlikely, that could happen everywhere in the code.

What does "was not well defined" mean here ? 

> 
> So by just testing one (or few) targets we are are not covering the code in
> the TCG frontend used to compile the target ISA to tcg ops, which should be
> the in target/, and the architecture-dependent code in linux-user.
> 
> That code seems unlikely (at least to me) to cause a false positive with
> CFI. Examples that I've seen in QEMU so far are:
> - Calling code that was JIT-ed at runtime
> - Callbacks to functions that were loaded from shared libraries
> - Signal Handlers
> And none of those should happen there IMHO. But you know, corner cases are
> still possible, and it's quite difficult to predict what new code may bring.
> 
> We could also consider always testing one or two targets, and keep an
> optional job to test them all when deemed necessary. I'm thinking for
> example full testing when code in target/ or linux-user/ is considered, or
> for testing pre-release code. Would be nice to have this automated but I am
> not sure that's possible.
> 
> Regards,
> Daniele
> 

Regards,
Daniel
Daniele Buono March 2, 2021, 1:18 p.m. UTC | #6
On 3/2/2021 5:30 AM, Daniel P. Berrangé wrote:
> On Mon, Mar 01, 2021 at 03:39:42PM -0500, Daniele Buono wrote:
>> Hi Daniel,
>>
>> On 3/1/2021 10:08 AM, Daniel P. Berrangé wrote:
>>> What are the unique failure scenarios for CFI that these jobs are
>>> likely to expose ? Is it likely that we'll have cases where
>>> CFI succeeds in say, x86_64 target, but fails in aarch64 target ?
>> For CFI to fail (even if it shouldn't) you'll need code that is calling a
>> function pointer that was not well defined at compile time. Although
>> unlikely, that could happen everywhere in the code.
> What does "was not well defined" mean here ?
> 

At high level, the compiler creates metadata for every function. Before
jumping to a function pointer, it makes sure that the pointer and the
pointee have matching types.
Not well defined means one of these two cases:
1. The function has a different type than the pointer -> Most likely an
error
2. The function was not available at compile time so the compiler could
not create the related metadata -> Most likely a false positive.

Thanks,
Daniele
Daniel P. Berrangé March 2, 2021, 3:38 p.m. UTC | #7
On Tue, Mar 02, 2021 at 08:18:03AM -0500, Daniele Buono wrote:
> On 3/2/2021 5:30 AM, Daniel P. Berrangé wrote:
> > On Mon, Mar 01, 2021 at 03:39:42PM -0500, Daniele Buono wrote:
> > > Hi Daniel,
> > > 
> > > On 3/1/2021 10:08 AM, Daniel P. Berrangé wrote:
> > > > What are the unique failure scenarios for CFI that these jobs are
> > > > likely to expose ? Is it likely that we'll have cases where
> > > > CFI succeeds in say, x86_64 target, but fails in aarch64 target ?
> > > For CFI to fail (even if it shouldn't) you'll need code that is calling a
> > > function pointer that was not well defined at compile time. Although
> > > unlikely, that could happen everywhere in the code.
> > What does "was not well defined" mean here ?
> > 
> 
> At high level, the compiler creates metadata for every function. Before
> jumping to a function pointer, it makes sure that the pointer and the
> pointee have matching types.
> Not well defined means one of these two cases:
> 1. The function has a different type than the pointer -> Most likely an
> error

How strictly is this checked ?  With GLib function prototype mismatch
is not uncommon. For example GLib might need to invoke a callback with
a signature:

   int foo(int somearg, void *opaque);

The API though will often declare the callback signature to be
generic

   void (*GCallback) (void);

The caller will implement a callback with

   int foo(int somearg, mytype *mydata);

and will use  G_CALLBACK() to do an intentional bad cast to GCallback

Before it invokes the callback, GLib would cast from GCallback back
to    int foo(int somearg, void *opaque);

Notice this last arg doesn't match the type of the actual implemented
callback.

Is this scenario going to upset  CFI, or is it happy that 'void *'
is compatible with 'mytype *', and ok with the intermediate casts
to/from GCallback ?

> 2. The function was not available at compile time so the compiler could
> not create the related metadata -> Most likely a false positive.

Regards,
Daniel
Daniele Buono March 2, 2021, 4:31 p.m. UTC | #8
On 3/2/2021 10:38 AM, Daniel P. Berrangé wrote:
> Is this scenario going to upset  CFI, or is it happy that 'void *'
> is compatible with 'mytype *', and ok with the intermediate casts
> to/from GCallback ?

This is a valid scenario. LLVM does offer the ability of considering all 
pointer types compatible, and it is being enabled in QEMU. So void* is 
compatible to any type* and that would not be considered a fault.

Intermediate casts are also fine since you are just passing the pointer 
but not using it. The check will happen only when the function is 
called, at which point it was cast back to something compatible.
Daniel P. Berrangé March 2, 2021, 4:40 p.m. UTC | #9
On Tue, Mar 02, 2021 at 11:31:54AM -0500, Daniele Buono wrote:
> 
> On 3/2/2021 10:38 AM, Daniel P. Berrangé wrote:
> > Is this scenario going to upset  CFI, or is it happy that 'void *'
> > is compatible with 'mytype *', and ok with the intermediate casts
> > to/from GCallback ?
> 
> This is a valid scenario. LLVM does offer the ability of considering all
> pointer types compatible, and it is being enabled in QEMU. So void* is
> compatible to any type* and that would not be considered a fault.

Ok that's good.

> Intermediate casts are also fine since you are just passing the pointer but
> not using it. The check will happen only when the function is called, at
> which point it was cast back to something compatible.

Makes sense.

So in general, it sounds like breadth of test coverage is fairly important
for the CFI jobs, at least if we're exercising different areas of
functionality.  So I think we do need to be testing more than just one
architecture target.

The CFI protection is something I'd say is relevant to virtualization
use cases, not to emulation use cases

   https://qemu-project.gitlab.io/qemu/system/security.html

IOW, the targets that are important to test are the ones where KVM
is available.

So that's  s390x, ppc, x86, mips, and arm.

I think we can probably ignore mips as that's fairly niche.
We can also reasonably limit ourselves to only test the 64-bit
variants of the target, on the basis that 32-bit is increasingly
legacy/niche too.

So that gives us  ppc64le, x86_64, aarch64 and s390x as the
targets we should get CI coverage for CFI.

Regards,
Daniel
Daniele Buono March 2, 2021, 9:01 p.m. UTC | #10
On 3/2/2021 11:40 AM, Daniel P. Berrangé wrote:
> The CFI protection is something I'd say is relevant to virtualization
> use cases, not to emulation use cases
> 
>     https://qemu-project.gitlab.io/qemu/system/security.html
> 
> IOW, the targets that are important to test are the ones where KVM
> is available.
> 
> So that's  s390x, ppc, x86, mips, and arm.
> 
> I think we can probably ignore mips as that's fairly niche.
> We can also reasonably limit ourselves to only test the 64-bit
> variants of the target, on the basis that 32-bit is increasingly
> legacy/niche too.
> 
> So that gives us  ppc64le, x86_64, aarch64 and s390x as the
> targets we should get CI coverage for CFI.

Thanks Daniel,
I'll start working on a V3 that only contains those 4 targets, probably 
in two sets of build/check/acceptance to maintain the jobs below the 
hour mark.

These would still be x86 binaries that are not testing KVM, however,
because of the capabilities of the shared gitlab runners.

I see that there's some work from Cleber Rosa to allow running custom 
jobs on aarch64 and s390x systems. I think that, when the infrastructure 
is ready, having a KVM-based CFI test there would help a lot in terms of 
coverage for those architectures.
Daniel P. Berrangé March 3, 2021, 10:04 a.m. UTC | #11
On Tue, Mar 02, 2021 at 04:01:17PM -0500, Daniele Buono wrote:
> On 3/2/2021 11:40 AM, Daniel P. Berrangé wrote:
> > The CFI protection is something I'd say is relevant to virtualization
> > use cases, not to emulation use cases
> > 
> >     https://qemu-project.gitlab.io/qemu/system/security.html
> > 
> > IOW, the targets that are important to test are the ones where KVM
> > is available.
> > 
> > So that's  s390x, ppc, x86, mips, and arm.
> > 
> > I think we can probably ignore mips as that's fairly niche.
> > We can also reasonably limit ourselves to only test the 64-bit
> > variants of the target, on the basis that 32-bit is increasingly
> > legacy/niche too.
> > 
> > So that gives us  ppc64le, x86_64, aarch64 and s390x as the
> > targets we should get CI coverage for CFI.
> 
> Thanks Daniel,
> I'll start working on a V3 that only contains those 4 targets, probably in
> two sets of build/check/acceptance to maintain the jobs below the hour mark.
> 
> These would still be x86 binaries that are not testing KVM, however,
> because of the capabilities of the shared gitlab runners.

Yes, that's fine.

> I see that there's some work from Cleber Rosa to allow running custom jobs
> on aarch64 and s390x systems. I think that, when the infrastructure is
> ready, having a KVM-based CFI test there would help a lot in terms of
> coverage for those architectures.

Yep, that should be possible.

Regards,
Daniel