[v4,00/13] "Task_isolation" mode

Message ID	04be044c1bcd76b7438b7563edc35383417f12c8.camel@marvell.com
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Alex Belits <abelits@marvell.com> To: "frederic@kernel.org" <frederic@kernel.org>, "rostedt@goodmis.org" <rostedt@goodmis.org> CC: Prasun Kapoor <pkapoor@marvell.com>, "mingo@kernel.org" <mingo@kernel.org>, "davem@davemloft.net" <davem@davemloft.net>, "linux-api@vger.kernel.org" <linux-api@vger.kernel.org>, "peterz@infradead.org" <peterz@infradead.org>, "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>, "catalin.marinas@arm.com" <catalin.marinas@arm.com>, "tglx@linutronix.de" <tglx@linutronix.de>, "will@kernel.org" <will@kernel.org>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "netdev@vger.kernel.org" <netdev@vger.kernel.org> Subject: [PATCH v4 00/13] "Task_isolation" mode Thread-Topic: [PATCH v4 00/13] "Task_isolation" mode Thread-Index: AQHWYDagTVb+dYV8A0GKqxFlnWyBeA== Date: Wed, 22 Jul 2020 14:44:38 +0000 Message-ID: <04be044c1bcd76b7438b7563edc35383417f12c8.camel@marvell.com> Accept-Language: en-US Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-ID: <9D5F23F66C03DB45B2B52ED1271588B2@namprd18.prod.outlook.com> Content-Transfer-Encoding: base64 MIME-Version: 1.0 X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: MW2PR18MB2267.namprd18.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 0db35ee6-fb26-4280-f2a2-08d82e4dc2ca X-MS-Exchange-CrossTenant-originalarrivaltime: 22 Jul 2020 14:44:38.7375 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	"Task_isolation" mode \| expand [v4,00/13] "Task_isolation" mode [v4,01/13] task_isolation: vmstat: add quiet_vmstat_sync function [v4,02/13] task_isolation: vmstat: add vmstat_idle function [v4,03/13] task_isolation: userspace hard isolation from kernel [v4,04/13] task_isolation: Add task isolation hooks to arch-independent code [v4,05/13] task_isolation: Add xen-specific hook [06/13] task_isolation: Add driver-specific hooks [v4,07/13] task_isolation: arch/x86: enable task isolation functionality [08/13] task_isolation: arch/arm64: enable task isolation functionality [v4,09/13] task_isolation: arch/arm: enable task isolation functionality [v4,10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() [v4,11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks [v4,12/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize [13/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus

Alex Belits July 22, 2020, 2:44 p.m. UTC

This is a new version of task isolation implementation. Previous version is at
https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/

Mostly this covers race conditions prevention on breaking isolation. Early after kernel entry,
task_isolation_enter() is called to update flags visible to other CPU cores and to perform
synchronization if necessary. Before this call only "safe" operations happen, as long as
CONFIG_TRACE_IRQFLAGS is not enabled.

This is also intended for future TLB handling -- the idea is to also isolate those CPU cores from
TLB flushes while they are running isolated task in userspace, and do one flush on exiting, before
any code is called that may touch anything updated.

The functionality and interface is unchanged, except for /sys/devices/system/cpu/isolation_running
containing the list of CPUs running isolated tasks. This should be useful for userspace helper
libraries.

Thomas Gleixner July 23, 2020, 1:17 p.m. UTC | #1

Alex,

Alex Belits <abelits@marvell.com> writes:
> This is a new version of task isolation implementation. Previous version is at
> https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/
>
> Mostly this covers race conditions prevention on breaking isolation. Early after kernel entry,
> task_isolation_enter() is called to update flags visible to other CPU cores and to perform
> synchronization if necessary. Before this call only "safe" operations happen, as long as
> CONFIG_TRACE_IRQFLAGS is not enabled.

Without going into details of the individual patches, let me give you a
high level view of this series:

  1) Entry code handling:

     That's completely broken vs. the careful ordering and instrumentation
     protection of the entry code. You can't just slap stuff randomly
     into places which you think are safe w/o actually trying to understand
     why this code is ordered in the way it is.

     This clearly was never built and tested with any of the relevant
     debug options enabled. Both build and boot would have told you.

  2) Instruction synchronization

     Trying to do instruction synchronization delayed is a clear recipe
     for hard to diagnose failures. Just because it blew not up in your
     face does not make it correct in any way. It's broken by design and
     violates _all_ rules of safe instruction patching and introduces a
     complete trainwreck in x86 NMI processing.

     If you really think that this is correct, then please have at least
     the courtesy to come up with a detailed and precise argumentation
     why this is a valid approach.

     While writing that up you surely will find out why it is not.

  3) Debug calls

     Sprinkling debug calls around the codebase randomly is not going to
     happen. That's an unmaintainable mess.

     Aside of that none of these dmesg based debug things is necessary.
     This can simply be monitored with tracing.

  4) Tons of undocumented smp barriers

     See Documentation/process/submit-checklist.rst #25

  5) Signal on page fault

     Why is this a magic task isolation feature instead of making it
     something which can be used in general? There are other legit
     reasons why a task might want a notification about an unexpected
     (resolved) page fault.

  6) Coding style violations all over the place

     Using checkpatch.pl is mandatory

  7) Not Cc'ed maintainers

     While your Cc list is huge, you completely fail to Cc the relevant
     maintainers of various files and subsystems as requested in
     Documentation/process/*

  8) Changelogs

     Most of the changelogs have something along the lines:

     'task isolation does not want X, so do Y to make it not do X'

     without any single line of explanation why this approach was chosen
     and why it is correct under all circumstances and cannot have nasty
     side effects.

     It's not the job of the reviewers/maintainers to figure this out.

Please come up with a coherent design first and then address the
identified issues one by one in a way which is palatable and reviewable.

Throwing a big pile of completely undocumented 'works for me' mess over
the fence does not get you anywhere, not even to the point that people
are willing to review it in detail.

Thanks,

        tglx

Peter Zijlstra July 23, 2020, 2:26 p.m. UTC | #2

On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote:

>   2) Instruction synchronization
> 
>      Trying to do instruction synchronization delayed is a clear recipe
>      for hard to diagnose failures. Just because it blew not up in your
>      face does not make it correct in any way. It's broken by design and
>      violates _all_ rules of safe instruction patching and introduces a
>      complete trainwreck in x86 NMI processing.
> 
>      If you really think that this is correct, then please have at least
>      the courtesy to come up with a detailed and precise argumentation
>      why this is a valid approach.
> 
>      While writing that up you surely will find out why it is not.

So delaying the sync_core() IPIs for kernel text patching _might_ be
possible, but it very much wants to be a separate patchset and not
something hidden inside a 'gem' like this.

Peter Zijlstra July 23, 2020, 2:29 p.m. UTC | #3

On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote:
>   8) Changelogs
> 
>      Most of the changelogs have something along the lines:
> 
>      'task isolation does not want X, so do Y to make it not do X'
> 
>      without any single line of explanation why this approach was chosen
>      and why it is correct under all circumstances and cannot have nasty
>      side effects.
> 
>      It's not the job of the reviewers/maintainers to figure this out.
> 
> Please come up with a coherent design first and then address the
> identified issues one by one in a way which is palatable and reviewable.
> 
> Throwing a big pile of completely undocumented 'works for me' mess over
> the fence does not get you anywhere, not even to the point that people
> are willing to review it in detail.

This.. as presented it is an absolutely unreviewable pile of junk. It
presents code witout any coherent problem description and analysis. And
the patches are not split sanely either.

Thomas Gleixner July 23, 2020, 2:53 p.m. UTC | #4

Peter Zijlstra <peterz@infradead.org> writes:
> On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote:
>
>>   2) Instruction synchronization
>> 
>>      Trying to do instruction synchronization delayed is a clear recipe
>>      for hard to diagnose failures. Just because it blew not up in your
>>      face does not make it correct in any way. It's broken by design and
>>      violates _all_ rules of safe instruction patching and introduces a
>>      complete trainwreck in x86 NMI processing.
>> 
>>      If you really think that this is correct, then please have at least
>>      the courtesy to come up with a detailed and precise argumentation
>>      why this is a valid approach.
>> 
>>      While writing that up you surely will find out why it is not.
>
> So delaying the sync_core() IPIs for kernel text patching _might_ be
> possible, but it very much wants to be a separate patchset and not
> something hidden inside a 'gem' like this.

I'm not saying it's impossible, but the proposed hack is definitely
beyond broken and you really don't want to be the one who has to mop up
the pieces later.

Thanks,

        tglx

Alex Belits July 23, 2020, 3:18 p.m. UTC | #5

On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote:
> 
> Without going into details of the individual patches, let me give you a
> high level view of this series:
> 
>   1) Entry code handling:
> 
>      That's completely broken vs. the careful ordering and instrumentation
>      protection of the entry code. You can't just slap stuff randomly
>      into places which you think are safe w/o actually trying to understand
>      why this code is ordered in the way it is.
> 
>      This clearly was never built and tested with any of the relevant
>      debug options enabled. Both build and boot would have told you.

This is intended to avoid a race condition when entry or exit from isolation
happens at the same time as an event that requires synchronization. The idea
is, it is possible to insulate the core from all events while it is running
isolated task in userspace, it will receive those calls normally after
breaking isolation and entering kernel, and it will synchronize itself on
kernel entry.

This has two potential problems that I am trying to solve:

1. Without careful ordering, there will be a race condition with events that
happen at the same time as kernel entry or exit.

2. CPU runs some kernel code after entering but before synchronization. This
code should be restricted to early entry that is not affected by the "stale"
state, similar to how IPI code that receives synchronization events does it
normally.

I can't say that I am completely happy with the amount of kernel entry
handling that had to be added. The problem is, I am trying to introduce a
feature that allows CPU cores to go into "de-synchronized" state while running
isolated tasks and not receiving synchronization events that normally would
reach them. This means, there should be established some point on kernel entry
when it is safe for the core to catch up with the rest of kernel. It may be
useful for other purposes, however at this point task isolation is the first
to need it, so I had to determine where such point is for every supported
architecture and method of kernel entry.

I have found that each architecture has its own way of handling this,
and sometimes individual interrupt controller drivers vary in their
sequence of calls on early kernel entry. For x86 I also have an
implementation for kernel 5.6, before your changes to IDT macros.
That version is much less straightforward, so I am grateful for those
relatively recent improvements.

Nevertheless, I believe that the goal of finding those points and using
them for synchronization is valid. If you can recommend me a better way
for at least x86, I will be happy to follow your advice. I have tried to
cover kernel entry in a generic way while making the changes least
disruptive, and this is why it looks simple and spread over multiple
places. I also had to do the same for arm and arm64 (that I use for
development), and for each architecture I had to produce sequences of
entry points and function calls to determine the correct placement of
task_isolation_enter() calls in them. It is not random, however it does
reflect the complex nature of kernel entry code. I believe, RCU
implementation faced somewhat similar requirements for calls on kernel
entry, however it is not completely unified, either

>  2) Instruction synchronization
>     Trying to do instruction synchronization delayed is a clear recipe
>     for hard to diagnose failures. Just because it blew not up in your
>     face does not make it correct in any way. It's broken by design and
>     violates _all_ rules of safe instruction patching and introduces a
>     complete trainwreck in x86 NMI processing.

The idea is that just like synchronization events are handled by regular IPI,
we already use some code with the assumption that it is safe to be entered in
"stale" state before synchronization. I have extended it to allow
synchronization points on all kernel entry points.

>     If you really think that this is correct, then please have at least
>     the courtesy to come up with a detailed and precise argumentation
>     why this is a valid approach.
>
>     While writing that up you surely will find out why it is not.
> 

I had to document a sequence of calls for every entry point on three supported
architectures, to determine the points for synchronization. It is possible that
I have somehow missed something, however I don't see a better approach, save
for establishing a kernel-wide infrastructure for this. And even if we did just
that, it would be possible to implement this kind of synchronization point
calls first, and convert them to something more generic later.

> 
>   3) Debug calls
> 
>      Sprinkling debug calls around the codebase randomly is not going to
>      happen. That's an unmaintainable mess.

Those report isolation breaking causes, and are intended for application and
system debugging.

> 
>      Aside of that none of these dmesg based debug things is necessary.
>      This can simply be monitored with tracing.

I think, it would be better to make all that information available to the
userspace application, however I have based this on the Chris Metcalf code,
and gradually updated the mechanisms and interfaces. The original reporting
of isolation breaking causes had far greater problems, so at first I wanted
to have something that produces easily visible and correct reporting, and
does not break things while doing so.

> 
>   4) Tons of undocumented smp barriers
> 
>      See Documentation/process/submit-checklist.rst #25
> 

That should be fixed.

>   5) Signal on page fault
> 
>      Why is this a magic task isolation feature instead of making it
>      something which can be used in general? There are other legit
>      reasons why a task might want a notification about an unexpected
>      (resolved) page fault.

Page fault causes isolation breaking. When a task runs in isolated mode it
does so because it requires predictable timing, so causing page faults and
expecting them to be handled along the way would defeat the purpose of
isolation. So if page fault did happen, it is important that application will
receive notification about isolation being broken, and then may decide to do
something about it, re-enter isolation, etc.

> 
>   6) Coding style violations all over the place
> 
>      Using checkpatch.pl is mandatory
> 
>   7) Not Cc'ed maintainers
> 
>      While your Cc list is huge, you completely fail to Cc the relevant
>      maintainers of various files and subsystems as requested in
>      Documentation/process/*

To be honest, I am not sure, whom I have missed, I tried to include everyone
from my previous attempt.

> 
>   8) Changelogs
> 
>      Most of the changelogs have something along the lines:
> 
>      'task isolation does not want X, so do Y to make it not do X'
> 
>      without any single line of explanation why this approach was chosen
>      and why it is correct under all circumstances and cannot have nasty
>      side effects.

This is the same as the previous version, except for the addition of kernel
entry handling. As far as I can tell, the rest was discussed before, and not
many questions remained except for the race condition on kernel entry. I
agree that kernel entry handling is a complex issue in itself, so I have
included explanation of entry points / function calls sequences for each
supported architecture. I have longer call diagram, that I used to track
each particular function, it probably should be included as a separate
document.

>      It's not the job of the reviewers/maintainers to figure this out.
> 
> Please come up with a coherent design first and then address the
> identified issues one by one in a way which is palatable and reviewable.
> 
> Throwing a big pile of completely undocumented 'works for me' mess over
> the fence does not get you anywhere, not even to the point that people
> are willing to review it in detail.

There is a design, and it is a result of a careful tracking of calls in the
kernel source. It has multiple point where task_isolation_enter() is called
for a reason similar to why RCU-related functions are called in multiple
places.

If someone can recommend a better way to introduce a kernel entry
checkpoint for synchronization that did not exist before, I will be happy
to hear it.

Alex Belits July 23, 2020, 3:41 p.m. UTC | #6

On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote:
> .
> 
> This.. as presented it is an absolutely unreviewable pile of junk. It
> presents code witout any coherent problem description and analysis.
> And
> the patches are not split sanely either.

There is a more complete and slightly outdated description in the
previous version of the patch at 
https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/
 .

It allows userspace application to take a CPU core for itself and run
completely isolated, with no disturbances. There is work in progress
that also disables and re-enables TLB flushes, and depending on CPU it
may be possible to also pre-allocate cache, so it would not be affected
by the rest of the system. Events that cause interaction with isolated
task, cause isolation breaking, turning the task into a regular
userspace task that can continue running normally and enter isolated
state again if necessary.

To make this feature suitable for any practical use, many mechanisms
that normally would cause events on a CPU, should exclude CPU cores in
this state, and synchronization should happen later, at the time of
isolation breaking.

There are three architectures supported, x86, arm and arm64, and it
should be possible to extend it to others. Unfortunately kernel entry
procedures are neither unified, nor straightforward, so introducing new
feature to them causes an appearance of a mess.

Peter Zijlstra July 23, 2020, 3:48 p.m. UTC | #7

On Thu, Jul 23, 2020 at 03:41:46PM +0000, Alex Belits wrote:
> 
> On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote:
> > .
> > 
> > This.. as presented it is an absolutely unreviewable pile of junk. It
> > presents code witout any coherent problem description and analysis.
> > And
> > the patches are not split sanely either.
> 
> There is a more complete and slightly outdated description in the
> previous version of the patch at 
> https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/

Not the point, you're mixing far too many things in one go. You also
have the patches split like 'generic / arch-1 / arch-2' which is wrong
per definition, as patches should be split per change and not care about
sily boundaries.

Also, if you want generic entry code, there's patches for that here:

  https://lkml.kernel.org/r/20200722215954.464281930@linutronix.de

Peter Zijlstra July 23, 2020, 3:49 p.m. UTC | #8

On Thu, Jul 23, 2020 at 03:18:42PM +0000, Alex Belits wrote:
> On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote:
> > 
> > Without going into details of the individual patches, let me give you a
> > high level view of this series:
> > 
> >   1) Entry code handling:
> > 
> >      That's completely broken vs. the careful ordering and instrumentation
> >      protection of the entry code. You can't just slap stuff randomly
> >      into places which you think are safe w/o actually trying to understand
> >      why this code is ordered in the way it is.
> > 
> >      This clearly was never built and tested with any of the relevant
> >      debug options enabled. Both build and boot would have told you.
> 
> This is intended to avoid a race condition when entry or exit from isolation
> happens at the same time as an event that requires synchronization. The idea
> is, it is possible to insulate the core from all events while it is running
> isolated task in userspace, it will receive those calls normally after
> breaking isolation and entering kernel, and it will synchronize itself on
> kernel entry.

'What does noinstr mean? and why do we have it" -- don't dare touch the
entry code until you can answer that.

Alex Belits July 23, 2020, 4:19 p.m. UTC | #9

On Thu, 2020-07-23 at 17:48 +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2020 at 03:41:46PM +0000, Alex Belits wrote:
> > On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote:
> > > .
> > > 
> > > This.. as presented it is an absolutely unreviewable pile of
> > > junk. It
> > > presents code witout any coherent problem description and
> > > analysis.
> > > And
> > > the patches are not split sanely either.
> > 
> > There is a more complete and slightly outdated description in the
> > previous version of the patch at 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_07c25c246c55012981ec0296eee23e68c719333a.camel-40marvell.com_&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=FFZaj-KanwqEiXYCdjd96JOgP_GAOnanpkw6bBvNrK4&e= 
> 
> Not the point, you're mixing far too many things in one go. You also
> have the patches split like 'generic / arch-1 / arch-2' which is
> wrong
> per definition, as patches should be split per change and not care
> about
> sily boundaries.

This follows the original patch by Chris Metcalf. There is a reason for
that -- per-architecture changes are independent from each other and
affect not just code but functionality that was implemented per-
architecture. To support more architectures, it will be necessary to do
it separately for each, and mark them supported with
HAVE_ARCH_TASK_ISOLATION. Having only some architectures supported does
not break anything for the rest -- architectures that are not covered,
would not have this functionality.

> 
> Also, if you want generic entry code, there's patches for that here:
> 
>   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.kernel.org_r_20200722215954.464281930-40linutronix.de&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=nZXIviY7rva31KvPgSVnTacwFNbsmkdW0LxSTfYSiqg&e=
>  
> 
> 

That looks useful. Why didn't Thomas Gleixner mention it in his
criticism of my approach if he already solved that exact problem, at
least for x86?

Alex Belits July 23, 2020, 4:50 p.m. UTC | #10

On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote:
> 
> 'What does noinstr mean? and why do we have it" -- don't dare touch
> the
> entry code until you can answer that.

noinstr disables instrumentation, so there would not be calls and
dependencies on other parts of the kernel when it's not yet safe to
call them. Relevant functions already have it, and I add an inline call
to perform flags update and synchronization. Unless something else is
involved, those operations are safe, so I am not adding anything that
can break those.

Thomas Gleixner July 23, 2020, 9:31 p.m. UTC | #11

Alex,

Alex Belits <abelits@marvell.com> writes:
> On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote:
>> 
>> Without going into details of the individual patches, let me give you a
>> high level view of this series:
>> 
>>   1) Entry code handling:
>> 
>>      That's completely broken vs. the careful ordering and instrumentation
>>      protection of the entry code. You can't just slap stuff randomly
>>      into places which you think are safe w/o actually trying to understand
>>      why this code is ordered in the way it is.
>> 
>>      This clearly was never built and tested with any of the relevant
>>      debug options enabled. Both build and boot would have told you.
>
> This is intended to avoid a race condition when entry or exit from isolation
> happens at the same time as an event that requires synchronization. The idea
> is, it is possible to insulate the core from all events while it is running
> isolated task in userspace, it will receive those calls normally after
> breaking isolation and entering kernel, and it will synchronize itself on
> kernel entry.

It does not matter what your intention is. Fact is that you disrupt a
carefully designed entry code sequence without even trying to point out
that you did so because you don't know how to do it better. There is a
big fat comment above enter_from_user_mode() which should have make you
ask at least. Peter and myself spent month on getting this correct
vs. RCU, instrumentation, code patching and some more things.

From someone who tries to fiddle with such a sensitive area of code it's
not too much asked that he follows or reads up on these changes instead
of just making uninformed choices of placement by defining that this new
stuff is the most important thing on the planet or at least documenting
why this is correct and not violating any of the existing constraints.

> This has two potential problems that I am trying to solve:
>
> 1. Without careful ordering, there will be a race condition with events that
> happen at the same time as kernel entry or exit.

Entry code is all about ordering. News at 11.

> 2. CPU runs some kernel code after entering but before synchronization. This
> code should be restricted to early entry that is not affected by the "stale"
> state, similar to how IPI code that receives synchronization events does it
> normally.

And because of that you define that you can place anything you need
_before_ functionality which is essential for establishing kernel state
correctly without providing the minimum proof that this does not violate
any of the existing contraints.

> reach them. This means, there should be established some point on kernel entry
> when it is safe for the core to catch up with the rest of kernel. It may be
> useful for other purposes, however at this point task isolation is the first
> to need it, so I had to determine where such point is for every supported
> architecture and method of kernel entry.

You decided that your feature has to run first. Where is the analysis
that this is safe and correct vs. the existing ordering constraints?

Why does this trigger build and run time warnings? (I neither built nor
ran it, but with full debug enabled it will for sure).

> Nevertheless, I believe that the goal of finding those points and using
> them for synchronization is valid.

The goal does not justify the means. 

> If you can recommend me a better way for at least x86, I will be happy
> to follow your advice. I have tried to cover kernel entry in a generic
> way while making the changes least disruptive, and this is why it
> looks simple and spread over multiple places.

It does not look simple. It looks random and like the outcome of try and
error. Oh, here it explodes, lets slap another instance into it.

> I also had to do the same for arm and arm64 (that I use for
> development), and for each architecture I had to produce sequences of
> entry points and function calls to determine the correct placement of
> task_isolation_enter() calls in them. It is not random, however it does
> reflect the complex nature of kernel entry code. I believe, RCU
> implementation faced somewhat similar requirements for calls on kernel
> entry, however it is not completely unified, either

But RCU has a well defined design and requirement list and people are
working on making the entry sequence generic and convert architectures
over to it. And no, we don't try to do 5 architectures at once. We did
x86 with an eye on others. It's not perfect and it never will be because
of hardware.

>>  2) Instruction synchronization
>>     Trying to do instruction synchronization delayed is a clear recipe
>>     for hard to diagnose failures. Just because it blew not up in your
>>     face does not make it correct in any way. It's broken by design and
>>     violates _all_ rules of safe instruction patching and introduces a
>>     complete trainwreck in x86 NMI processing.
>
> The idea is that just like synchronization events are handled by regular IPI,
> we already use some code with the assumption that it is safe to be entered in
> "stale" state before synchronization. I have extended it to allow
> synchronization points on all kernel entry points.

The idea is clear, just where is the analysis that this is safe?

Just from quickly skimming the code it's clear that this has never been
done. Experimental development on the base of 'does not explode' is not
a valid approach in the kernel whatever your goal is.

>>     If you really think that this is correct, then please have at least
>>     the courtesy to come up with a detailed and precise argumentation
>>     why this is a valid approach.
>>
>>     While writing that up you surely will find out why it is not.
>
> I had to document a sequence of calls for every entry point on three supported
> architectures, to determine the points for synchronization.

Why is that documentation not part of the patches in form of
documentation or proper changelogs?

> It is possible that I have somehow missed something, however I don't
> see a better approach, save for establishing a kernel-wide
> infrastructure for this. And even if we did just that, it would be
> possible to implement this kind of synchronization point calls first,
> and convert them to something more generic later.

You're putting the cart before the horse.

You want delayed instruction patching synchronization. So the right
approach is to:

   1) Analyze the constraints of instruction patching on a given
      architecture.

   2) Implement a scheme for this architecture to handle delayed
      patching as a stand alone feature with well documented and fine
      grained patches and proper prove that none of the constraints is
      violated.

      Find good arguments why such a feature is generally useful and not
      only for your personal pet pieve.

Once you've done that, then you'll find out that there is no need for
magic task isolation hackery simply because it's already there.

Code patching is very much architecture specific and the constraints
vary due to the different hardware requirements. The idea of making this
generic is laudable, but naive at best. Once you have done #1 above on
two architectures you will know why.

>>   3) Debug calls
>> 
>>      Sprinkling debug calls around the codebase randomly is not going to
>>      happen. That's an unmaintainable mess.
>
> Those report isolation breaking causes, and are intended for application and
> system debugging.

I don't care what they do as that does not make them more palatable or
maintainable.

>> 
>>      Aside of that none of these dmesg based debug things is necessary.
>>      This can simply be monitored with tracing.
>
> I think, it would be better to make all that information available to the
> userspace application, however I have based this on the Chris Metcalf code,
> and gradually updated the mechanisms and interfaces. The original reporting
> of isolation breaking causes had far greater problems, so at first I wanted
> to have something that produces easily visible and correct reporting, and
> does not break things while doing so.

Why are you exposing other people to these horrors? I don't care what
you use in your development branch and I don't care what you share with
your friends, but if you want maintainers and reviewers to look at that
stuff then ensure that what you present:

  - Makes sense
  - Is properly implemented
  - Is properly documented
  - Is properly argumented why this is the right approach.

'I need', 'I want', 'this does' are non-arguments to begin with.

>>   5) Signal on page fault
>> 
>>      Why is this a magic task isolation feature instead of making it
>>      something which can be used in general? There are other legit
>>      reasons why a task might want a notification about an unexpected
>>      (resolved) page fault.
>
> Page fault causes isolation breaking. When a task runs in isolated mode it
> does so because it requires predictable timing, so causing page faults and
> expecting them to be handled along the way would defeat the purpose of
> isolation. So if page fault did happen, it is important that application will
> receive notification about isolation being broken, and then may decide to do
> something about it, re-enter isolation, etc.

Did you actually read what I wrote? I very much understood what you are
trying to do and why. Otherwise I wouldn't have written the above.

>>   6) Coding style violations all over the place
>> 
>>      Using checkpatch.pl is mandatory
>> 
>>   7) Not Cc'ed maintainers
>> 
>>      While your Cc list is huge, you completely fail to Cc the relevant
>>      maintainers of various files and subsystems as requested in
>>      Documentation/process/*
>
> To be honest, I am not sure, whom I have missed, I tried to include everyone
> from my previous attempt.

May I ask you to read, understand and follow the documentation I pointed
you to?

>>   8) Changelogs
>> 
>>      Most of the changelogs have something along the lines:
>> 
>>      'task isolation does not want X, so do Y to make it not do X'
>> 
>>      without any single line of explanation why this approach was chosen
>>      and why it is correct under all circumstances and cannot have nasty
>>      side effects.
>
> This is the same as the previous version, except for the addition of kernel
> entry handling. As far as I can tell, the rest was discussed before, and not
> many questions remained except for the race condition on kernel entry.

How is that related to changelogs which are useless? 

> agree that kernel entry handling is a complex issue in itself, so I have
> included explanation of entry points / function calls sequences for each
> supported architecture.

Which explanations? Let's talk about 7/13 the x86 part:

> In prepare_exit_to_usermode(), run cleanup for tasks exited fromi
> isolation and call task_isolation_start() for tasks that entered
> TIF_TASK_ISOLATION.
>
> In syscall_trace_enter(), add the necessary support for reporting
> syscalls for task-isolation processes.
>
> Add task_isolation_remote() calls for the kernel exception types
> that do not result in signals, namely non-signalling page faults.
>
> Add task_isolation_kernel_enter() calls to interrupt and syscall
> entry handlers.
>
> This mechanism relies on calls to functions that call
> task_isolation_kernel_enter() early after entry into kernel. Those
> functions are:
>
> enter_from_user_mode()
>  called from do_syscall_64(), do_int80_syscall_32(),
>  do_fast_syscall_32(), idtentry_enter_user(),
>  idtentry_enter_cond_rcu()
> idtentry_enter_cond_rcu()
>  called from non-raw IDT macros and other entry points
> idtentry_enter_user()
> nmi_enter()
> xen_call_function_interrupt()
> xen_call_function_single_interrupt()
> xen_irq_work_interrupt()

Can you point me to a single word of explanation in this blurb?

It's a list of things WHAT the patch does without a single word of WHY
and without a single word of WHY any of this would be correct.

> I have longer call diagram, that I used to track each particular
> function, it probably should be included as a separate document.

Call diagrams are completely useless. The people who have to review this
know how that works. They want real explanations:

     - Why is this the right approach
     - Why does this not violate constraints A, B, C
     - What are the potential side effects
     - ...

All of this is asked for in Documentation/process/* for a reason.

>>      It's not the job of the reviewers/maintainers to figure this out.
>> 
>> Please come up with a coherent design first and then address the
>> identified issues one by one in a way which is palatable and reviewable.
>> 
>> Throwing a big pile of completely undocumented 'works for me' mess over
>> the fence does not get you anywhere, not even to the point that people
>> are willing to review it in detail.
>
> There is a design, and it is a result of a careful tracking of calls in the
> kernel source. It has multiple point where task_isolation_enter() is called
> for a reason similar to why RCU-related functions are called in multiple
> places.

Design based on call tracking? That must be some newfangled method of
design which was not taught when I was in school.

You can do analysis with call tracking, but not design. 

Comparing this to RCU is beyond hillarious. RCU has design and
requirements documented and every single instance of RCU state
establishment has been argued in the changelogs and is most of the time
(except for the obvious places) extensively commented.

> If someone can recommend a better way to introduce a kernel entry
> checkpoint for synchronization that did not exist before, I will be happy
> to hear it.

Start with a coherent explanation of:

  - What you are trying to achieve

  - Which problems did you observe in your analysis including the
    impact of the problem on your goal.

  - A per problem conceptual approach to solve it along with cleanly
    implemented and independent RFC code for each particular problem
    without tons of debug hacks and the vain attempts to make everything
    generic. There might be common parts of it, but as explained with
    code patching and #PF signals they can be completely independent of
    each other.

Thanks,

        tglx

Thomas Gleixner July 23, 2020, 9:44 p.m. UTC | #12

Alex Belits <abelits@marvell.com> writes:
> On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote:
>> 
>> 'What does noinstr mean? and why do we have it" -- don't dare touch
>> the
>> entry code until you can answer that.
>
> noinstr disables instrumentation, so there would not be calls and
> dependencies on other parts of the kernel when it's not yet safe to
> call them. Relevant functions already have it, and I add an inline call
> to perform flags update and synchronization. Unless something else is
> involved, those operations are safe, so I am not adding anything that
> can break those.

Sure.

 1) That inline function can be put out of line by the compiler and
    placed into the regular text section which makes it subject to
    instrumentation

 2) That inline function invokes local_irq_save() which is subject to
    instrumentation _before_ the entry state for the instrumentation
    mechanisms is established.

 3) That inline function invokes sync_core() before important state has
    been established, which is especially interesting in NMI like
    exceptions.

As you clearly documented why all of the above is safe and does not
cause any problems, it's just me and Peter being silly, right?

Try again.

Thanks,

        tglx

Alex Belits July 24, 2020, 3 a.m. UTC | #13

On Thu, 2020-07-23 at 23:44 +0200, Thomas Gleixner wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> Alex Belits <abelits@marvell.com> writes:
> > On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote:
> > > 'What does noinstr mean? and why do we have it" -- don't dare
> > > touch
> > > the
> > > entry code until you can answer that.
> > 
> > noinstr disables instrumentation, so there would not be calls and
> > dependencies on other parts of the kernel when it's not yet safe to
> > call them. Relevant functions already have it, and I add an inline
> > call
> > to perform flags update and synchronization. Unless something else
> > is
> > involved, those operations are safe, so I am not adding anything
> > that
> > can break those.
> 
> Sure.
> 
>  1) That inline function can be put out of line by the compiler and
>     placed into the regular text section which makes it subject to
>     instrumentation
> 
>  2) That inline function invokes local_irq_save() which is subject to
>     instrumentation _before_ the entry state for the instrumentation
>     mechanisms is established.
> 
>  3) That inline function invokes sync_core() before important state
> has
>     been established, which is especially interesting in NMI like
>     exceptions.
> 
> As you clearly documented why all of the above is safe and does not
> cause any problems, it's just me and Peter being silly, right?
> 
> Try again.

I don't think, accusations and mockery are really necessary here.

I am trying to do the right thing here. In particular, I am trying to
port the code that was developed on platforms that have not yet
implemented those useful instrumentation safety features of x86 arch
support. For most of the development time I had to figure out, where
the synchronization can be safely inserted into kernel entry code on
three platforms and tens of interrupt controller drivers, with some of
those presenting unusual exceptions (forgive me the pun) from platform-
wide conventions. I really appreciate the work you did cleaning up
kernel entry procedures, my 5.6 version of this patch had to follow a
much more complex and I would say, convoluted entry handling on x86,
and now I don't have to do that, thanks to you.

Unfortunately, most of my mental effort recently had to be spent on
three things:

1. (small): finding a way to safely enable events and synchronize state
on kernel entry, so it will not have a race condition between
isolation-breaking kernel entry and an event that was disabled while
the task was isolated.

2. (big): trying to derive any useful rules applicable to kernel entry
in various architectures, finding that there is very little consistency
across architectures, and whatever exists, can be broken by interrupt
controller drivers that don't all follow the same rules as the rest of
the platform.

3. (medium): introducing calls to synchronization on all kernel entry
procedures, in places where it is guaranteed to not normally yet have
done any calls to parts of the kernel that may be affected by "stale"
state, and do it in a manner as consistent and generalized as possible.

The current state of kernel entry handling on arm and arm64
architectures has significant differences from x86 and from each other.
There is also a matter of interrupt controllers. As can be seen in
interrupt controller-specific patch, I had to accommodate some variety
of custom interrupt entry code. What can not be seen, is that I had to
check that all other interrupt controller drivers and architecture-
specific entry procedures, and find that they _do_ follow some
understandable rules -- unfortunately architecture-specific and not
documented in any manner.

I have no valid reasons for complaining about it. I could not expect
that authors of all kernel entry procedures would have any
foreknowledge that someone at some point may have a reason to establish
any kind of synchronization point for CPU cores. And this is why I had
to do my research by manually drawing call trees and sequences,
separately for every entry on every supported architecture, and across
two or three versions of kernel, as those were changing along the way.

The result of this may be not a "design" per se, but an understanding
of how things are implemented, and what rules are being followed, so I
could add my code in a manner consistent with what is done, and
document the whole thing. Then there will be some written rules to
check for, when anything of this kind will be necessary again (say,
with TLB, but considering how much now is done in userspace, possibly
to accommodate more exotic CPU features that may have state messed up
by userspace). I am afraid, this task, kernel entry documentation,
would take me some time, and I did not want to delay my task isolation
patch for this reason.

As I followed whatever rules I have determined to be applicable, I have
produced code that introduces hooks in multiple seemingly unrelated to
each other places. Whenever there was a, forgive me the pun, exception
to those rules, another special case had to be handled.

So no, I did not just add entry hooks randomly, and your accusations of
having no design are unwarranted. My patches reflect what is already in
code and in its design, I have added one simple rule that entry hook
runs at the point when no dependency on something that requires
synchronization, exists yet. The entry hook is small, you have already
seen all of it while listing things that are not compatible with
noinst. Its mechanism and purpose are explained in general description
of task isolation. I don't think, I can provide a better explanation.

I have to apologize for not taking into account all your carefully
built instrumentation safety support. That was one thing I have missed.
However at this point the only way for me to find out that I am wrong
about it, and my code does not comply with expectations defined by
advanced state of x86 architecture development, was to present whatever
I could do right, based on experience with other platforms. I don't
think, this warrants such hostility.

Another issue that you have asked me to defend is the existence and
scope of task isolation itself. I have provided long explanation in
changelog and previous discussions of this patch, and before me so did
Chris Metcalf and occasionally Yuri Norov. While I understand that this
is an unusual feature and by its nature it affects kernel in multiple
places, it does not deserve to be called a "mess" and other synonyms of
"mess". It's an attempt to introduce a feature that turns Linux
userspace into superior replacement of RTOS. Considering current state
of CPU and SoC development, it is becoming very difficult even for
vendor-specific RTOS to keep up with advanced hardware features. Linux
keeps up with them just fine, however it lacks the ability to truly
leave the CPU core alone, to run the performance-critical and latency-
critical part of a task in a manner that RTOS user would expect. Very
close but not yet. Task isolation provides the last step for this RTOS
replacement. It is implemented in a manner that allows the user to
combine Linux resource handling and initialization with RTOS isolation
and latency. The decision about page faults is a part of this design,
as well as many other decisions implemented in this code. Many may
disagree with either those decisions, or the validity of a goal, some
may even argue that it's a bad thing to provide a reason to stop RTOS
development (I think, this is a good thing but that's not the point).

However most definitely this is not a "mess", and it I do not believe
that I have to defend the validity of this direction of development, or
be accused of general incompetence every time someone finds a
frustrating mistake in my code. As I said, I am trying to do the right
thing, and want to bring my code not only to the state where x86
support is on par with other platforms (that is, working when
instrumentation is disabled), but also make it fully compliant with
current requirements of x86 platform.

Thomas Gleixner July 24, 2020, 4:08 p.m. UTC | #14

Alex,

Alex Belits <abelits@marvell.com> writes:
> On Thu, 2020-07-23 at 23:44 +0200, Thomas Gleixner wrote:
>>  1) That inline function can be put out of line by the compiler and
>>     placed into the regular text section which makes it subject to
>>     instrumentation
>> 
>>  2) That inline function invokes local_irq_save() which is subject to
>>     instrumentation _before_ the entry state for the instrumentation
>>     mechanisms is established.
>> 
>>  3) That inline function invokes sync_core() before important state
>> has
>>     been established, which is especially interesting in NMI like
>>     exceptions.
>> 
>> As you clearly documented why all of the above is safe and does not
>> cause any problems, it's just me and Peter being silly, right?
>> 
>> Try again.
>
> I don't think, accusations and mockery are really necessary here.

Let's get some context to this.

  I told you in my first mail, that this breaks noinstr and that
  building with full debug would have told you.

  Peter gave you a clear hint where to look.

Now it might be expected that you investigate that or at least ask
questions before making the bold claim:

>> > Unless something else is involved, those operations are safe, so I
>> > am not adding anything that can break those.

Surely I could have avoided the snide remark, but after you demonstrably
ignored technically valid concerns and suggestions in your other reply,
I was surely not in the mood to be overly careful in my choice of words.

> The result of this may be not a "design" per se, but an understanding
> of how things are implemented, and what rules are being followed, so I
> could add my code in a manner consistent with what is done, and
> document the whole thing.

Every other big and technically complex project which has to change the
very inner workings of the kernel started the same way. I'm not aware of
any of them getting accepted as is or in a big code dump.

What you have now qualifies as proof of concept and the big challenge is
to turn it into something which is acceptable and maintainable.

You talk in great length about how inconsistent stuff is all over the
place. Yes, it is indeed. You even call that inconsistency an existing
design:

> My patches reflect what is already in code and in its design.

I agree that you just work with the code as is, but you might have
noticed that quite some of this stuff is clearly not designed at all or
designed badly.

The solution is not to pile on top of the inconsistency, the solution is
to make it consistent in the first place.

You are surely going to say, that's beyond the scope of your project. I
can tell you that it is in the scope of your project simply because just
proliferating the status quo and piling new stuff on top is not an
option. And no, there are no people waiting in a row to mop up after
you either.

Quite some of the big technology projects have spent and still spend
considerable amount of time to do exactly this kind of consolidation
work upfront in order to make their features acceptable in a
maintainable form.

All of these projects have been merged or are still being merged
piecewise in reviewable chunks.

We are talking about intrusive technology which requires a very careful
integration to prevent it from becoming a roadblock or a maintenaince
headache. The approach and implementation has to be _agreed_ on by the
involved parties, i.e. submitters, reviewers and maintainers.

> While I understand that this is an unusual feature and by its nature
> it affects kernel in multiple places, it does not deserve to be called
> a "mess" and other synonyms of "mess".

The feature is perfectly fine and I completely understand why you want
it. Guess who started to lay the grounds for NOHZ_FULL more than a
decade ago and why?

The implementation is not acceptable on technical grounds,
maintainability reasons, lack of design and proper consolidated
integration.

> Another issue that you have asked me to defend is the existence and
> scope of task isolation itself.

I have not asked you to defend the existance. I asked you for coherent
explanations how the implementation works and why the chosen approach is
correct and valid. That's a completely different thing.

> It's an attempt to introduce a feature that turns Linux userspace into
> superior replacement of RTOS.....

Can you please spare me the advertising and marketing? I'm very well
aware what an RTOS is and I'm also very well aware that there is no such
thing like a 'superior replacement' for RTOS in general.

If your view of RTOS is limited to this particular feature, then I have
to tell you that this particular feature is only useful for a very small
portion of the overall RTOS use cases.

> However most definitely this is not a "mess", and it I do not believe
> that I have to defend the validity of this direction of development, or
> be accused of general incompetence every time someone finds a
> frustrating mistake in my code.

Nobody accuses you of incompetence, but you will have to defend the
validity of your approach and implementation and accept that things
might not be as shiny as you think they are. That's not hostility,
that's just how Linux kernel development works whether you like it or
not. 

I surely can understand your frustration over my view of this series,
but you might have noticed that aside of criticism I gave you very clear
technical arguments and suggestions how to proceed.

It's your decision what you make of that.

Thanks,

        tglx

[v4,00/13] "Task_isolation" mode

Message

Comments