diff mbox

[2/3] x86_64: Define 128-bit memory-mapped I/O operations

Message ID 1345598601.2659.76.camel@bwh-desktop.uk.solarflarecom.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Ben Hutchings Aug. 22, 2012, 1:23 a.m. UTC
Define reado(), writeo() and their raw counterparts using SSE.

Based on work by Stuart Hodgson <smhodgson@solarflare.com>.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 arch/x86/Kconfig.cpu      |    4 +++
 arch/x86/include/asm/io.h |   14 +++++++++
 arch/x86/lib/Makefile     |    1 +
 arch/x86/lib/oword_io.c   |   65 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 84 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/lib/oword_io.c

Comments

H. Peter Anvin Aug. 22, 2012, 1:37 a.m. UTC | #1
On 08/21/2012 06:23 PM, Ben Hutchings wrote:
> Define reado(), writeo() and their raw counterparts using SSE.
> 
> Based on work by Stuart Hodgson <smhodgson@solarflare.com>.

It would be vastly better if we explicitly controlled this with
kernel_fpu_begin()/kernel_fpu_end() rather than hiding it in primitives
than might tempt the user to do very much the wrong thing.

Also, it needs to be extremely clear to the user that these operations
use the FPU, and all the requirements there need to be met, including
not using them at interrupt time.

	-hpa

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Aug. 22, 2012, 2:04 a.m. UTC | #2
On Tue, 2012-08-21 at 18:37 -0700, H. Peter Anvin wrote:
> On 08/21/2012 06:23 PM, Ben Hutchings wrote:
> > Define reado(), writeo() and their raw counterparts using SSE.
> > 
> > Based on work by Stuart Hodgson <smhodgson@solarflare.com>.
> 
> It would be vastly better if we explicitly controlled this with
> kernel_fpu_begin()/kernel_fpu_end() rather than hiding it in primitives
> than might tempt the user to do very much the wrong thing.
> 
> Also, it needs to be extremely clear to the user that these operations
> use the FPU, and all the requirements there need to be met, including
> not using them at interrupt time.

Well we can sometimes use the FPU state at IRQ time, can't we
(irq_fpu_usable())?   So we might need, say, try_reado() and
try_writeo() with callers expected to fall back to alternatives.  (Which
they must have anyway for any architecture that doesn't support this.)

Ben.
David Miller Aug. 22, 2012, 2:34 a.m. UTC | #3
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Wed, 22 Aug 2012 03:04:11 +0100

> On Tue, 2012-08-21 at 18:37 -0700, H. Peter Anvin wrote:
>> On 08/21/2012 06:23 PM, Ben Hutchings wrote:
>> > Define reado(), writeo() and their raw counterparts using SSE.
>> > 
>> > Based on work by Stuart Hodgson <smhodgson@solarflare.com>.
>> 
>> It would be vastly better if we explicitly controlled this with
>> kernel_fpu_begin()/kernel_fpu_end() rather than hiding it in primitives
>> than might tempt the user to do very much the wrong thing.
>> 
>> Also, it needs to be extremely clear to the user that these operations
>> use the FPU, and all the requirements there need to be met, including
>> not using them at interrupt time.
> 
> Well we can sometimes use the FPU state at IRQ time, can't we
> (irq_fpu_usable())?   So we might need, say, try_reado() and
> try_writeo() with callers expected to fall back to alternatives.  (Which
> they must have anyway for any architecture that doesn't support this.)

I really hope we eventually get rid of this rediculous restriction the
x86 code has.

It really needs a proper stack of FPU state saves like sparc64 has.

Half of the code and complexity in arch/x86/crypto/ would just
disappear, because most of it has to do with handling this obtuse
FPU usage restriction which shouldn't even be an issue in the first
place.

I continually see more and more code that has to check this
irq_fpu_usable() thing, and have ugly fallback code, and therefore is
a sign that this really needs to be fixed properly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
H. Peter Anvin Aug. 22, 2012, 3:24 a.m. UTC | #4
On 08/21/2012 07:34 PM, David Miller wrote:
> 
> I really hope we eventually get rid of this rediculous restriction the
> x86 code has.
> 
> It really needs a proper stack of FPU state saves like sparc64 has.
> 
> Half of the code and complexity in arch/x86/crypto/ would just
> disappear, because most of it has to do with handling this obtuse
> FPU usage restriction which shouldn't even be an issue in the first
> place.
> 
> I continually see more and more code that has to check this
> irq_fpu_usable() thing, and have ugly fallback code, and therefore is
> a sign that this really needs to be fixed properly.
> 

[Cc: Linus, since he has had very strong opinions on this in the past.]

I'm all ears... tell me how sparc64 deals with this, maybe we can
implement something similar.  At the same time, do keep in mind that on
x86 this is not just a matter of the FPU state, but the entire "extended
state" which can be very large.

Given the cost of state save/enable, however, nothing is going to change
the need for kernel_fpu_begin/end, nor the fact that those things will
want to bracket large regions.

	-hpa
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 22, 2012, 3:29 a.m. UTC | #5
From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, 21 Aug 2012 20:24:18 -0700

> I'm all ears... tell me how sparc64 deals with this, maybe we can
> implement something similar.  At the same time, do keep in mind that on
> x86 this is not just a matter of the FPU state, but the entire "extended
> state" which can be very large.

Sparc's state is pretty huge too.  256 bytes worth of FPU registers,
plus a set of 64-bit control registers.

What we do is we have a FPU stack that grows up from the end of the
thread_info struct, towards the bottom of the kernel stack.

Slot 0 is always the user FPU state.

Slot 1 and further are kernel FPU state save areas.

We hold a counter which keep track of how far deeply saved we are
in the stack.

Not for the purpose of space saving, but for overhead reduction we
sometimes can get away with only saving away half of the FPU
registers.  The chip provides a pair of dirty bits, one for the lower
half of the FPU register file and one for the upper half.  We only
save the bits that are actually dirty.

Furthermore, when we have FPU using code in the kernel that only uses
the lower half of the registers, we only save away that part of the
state around the routine.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
H. Peter Anvin Aug. 22, 2012, 3:49 a.m. UTC | #6
On 08/21/2012 08:29 PM, David Miller wrote:
> 
> What we do is we have a FPU stack that grows up from the end of the
> thread_info struct, towards the bottom of the kernel stack.
> 

We have 8K of kernel stack, and an xstate which is pushing a kilobyte
already.  This seems like a nightmare.  Even if we allocate a larger
stack for this sole purpose, we'd have to put a pretty hard cap on how
far it could grow.

> Slot 0 is always the user FPU state.
> 
> Slot 1 and further are kernel FPU state save areas.
> 
> We hold a counter which keep track of how far deeply saved we are
> in the stack.
> 
> Not for the purpose of space saving, but for overhead reduction we
> sometimes can get away with only saving away half of the FPU
> registers.  The chip provides a pair of dirty bits, one for the lower
> half of the FPU register file and one for the upper half.  We only
> save the bits that are actually dirty.
> 
> Furthermore, when we have FPU using code in the kernel that only uses
> the lower half of the registers, we only save away that part of the
> state around the routine.

This is messy on x86; it is somewhat doable but it gets really hairy
because of the monolithic [f]xsave/[f]xrstor instruction.

	-hpa



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Aug. 22, 2012, 3:52 a.m. UTC | #7
On Tue, Aug 21, 2012 at 8:24 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
>> I continually see more and more code that has to check this
>> irq_fpu_usable() thing, and have ugly fallback code, and therefore is
>> a sign that this really needs to be fixed properly.
>>
>
> [Cc: Linus, since he has had very strong opinions on this in the past.]

I have strong opinions, because I don't think you *can* do it any
other way on x86.

Saving FPU state on kernel entry IS NOT GOING TO HAPPEN. End of story.
It's too f*cking slow.

And if we don't save it, we cannot use the FP state in kernel mode
without explicitly checking, because the FPU state may not be there,
and touching FP registers can cause exceptions etc. In fact, under
many loads the exception is going to be the common case.

Keeping the FP state live all the time (and then perhaps just saving a
single MMX register for kernel use) and switching it synchronously on
every task switch sounds like a really really bad idea too. It
apparently works better on modern CPU's, but it sucked horribly on
older ones. The FPU state save is horribly expensive, to the point
that we saw it very clearly on benchmarks for signal handling and task
switching.

So the normal operation is going to be "touching the FPU causes a fault".

That said, we might *try* to make the kernel FPU use be cheaper. We
could do something like:

 - if we need to restore FPU state for user mode, don't do it
synchronously in kernel_fpu_end(), just set a work-flag, and do it
just once before returning to user mode

 - that makes kernel_fpu_end() a no-op, and while we can't get rid of
kernel_fpu_begin(), we could at least make it cheap to do many times
consecutively (ie we'd have to check "who owns FPU state" and perhaps
save it, but the save would have to be done only the first time).

I haven't seen the patch being discussed, or the rationale for it. But
I doubt it makes sense to do 128-bit MMIO and expect any kind of
atomicity things anyway, and I very much doubt that using SSE would
make all that much sense. What's the background, and why would you
want to do this crap?

              Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
H. Peter Anvin Aug. 22, 2012, 3:59 a.m. UTC | #8
On 08/21/2012 08:52 PM, Linus Torvalds wrote:
> 
> That said, we might *try* to make the kernel FPU use be cheaper. We
> could do something like:
> 
>  - if we need to restore FPU state for user mode, don't do it
> synchronously in kernel_fpu_end(), just set a work-flag, and do it
> just once before returning to user mode
> 
>  - that makes kernel_fpu_end() a no-op, and while we can't get rid of
> kernel_fpu_begin(), we could at least make it cheap to do many times
> consecutively (ie we'd have to check "who owns FPU state" and perhaps
> save it, but the save would have to be done only the first time).
> 

The zeroeth order thing we should do is to make it nest; there are cases
where we do multiple kernel_fpu_begin/_end in sequence for no good reason.

kernel_fpu_end() would still have to re-enable preemption (and
preemption would have to check the work flag), but that should be cheap.

We could allow the FPU in the kernel to have preemption, if we allocated
space for two xstates per thread instead of one.  That is, however, a
fair hunk of memory.

	-hpa

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 22, 2012, 4:14 a.m. UTC | #9
From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, 21 Aug 2012 20:59:26 -0700

> kernel_fpu_end() would still have to re-enable preemption (and
> preemption would have to check the work flag), but that should be cheap.
> 
> We could allow the FPU in the kernel to have preemption, if we allocated
> space for two xstates per thread instead of one.  That is, however, a
> fair hunk of memory.

Once you have done the first FPU save for the sake of the kernel, you
can minimize what you save for any deeper nesting because the kernel
only cares about a very limited part of that FPU state not the whole
1K thing.

Those bits you can save by hand with a bunch of explicit stores of the
XMM registers, or something like that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Aug. 22, 2012, 4:35 a.m. UTC | #10
On Tue, Aug 21, 2012 at 8:59 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> kernel_fpu_end() would still have to re-enable preemption (and
> preemption would have to check the work flag), but that should be cheap.

No, done right, you don't even have to disable preemption. Sure, you
might disable preemption inside of kernel_fpu_begin() itself (around
the test for "do we need to save FPU state and set the flag to restore
it at return-to-user-mode time"), but I think that you should be able
to run the code afterwards with preemption enabled.

Why? Because you've saved the FPU state, and you've set the per-thread
flag saying "I will restore at return to user mode". And that, coupled
with teaching the scheduler about this case ("don't set TS, only save
xmm0 in a special kernel-xmm0-save-area") means that the thing can
preempt fine.

The nesting case would be about just saving (again) that xmm0 register
in any nested kernel_fpu_begin/end pair. But that doesn't need
preemption, that just needs local storage for the kernel_fpu_begin/end
(and it can do the xmm0 save/restore with regular "mov" instructions,
because the kernel owns the FPU at that point, since it's nested).

So it's doable. One question is how many registers to save for kernel
use. It might be that we'd need to make that a "actual FPU user needs
to save/restore the registers it uses", and not do the saving in
kernel_fpu_begin()/end() at all.

My biggest reason to question this all is that I don't think it's
worth it. Why would we ever care to do all this in the first place?
There's no really sane use for it.

Judging from the subject line, some crazy person wants to do 128-bit
MMIO. That's just crap. Nobody sane cares. Do it non-atomically with
regular accesses. If it's high-performance IO, it uses DMA. If it
doesn't use DMA and depends on CPU MMIO accesses, WHO THE F*CK CARES?
Nobody.

And if some crazy device depends on 128-bit atomic writes, the device
is shit. Forget about it. Tell people to stop digging themselves
deeper down a hole it's simply not worth going.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Aug. 22, 2012, 4:42 a.m. UTC | #11
On Tue, Aug 21, 2012 at 8:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I haven't seen the patch being discussed, or the rationale for it. But
> I doubt it makes sense to do 128-bit MMIO and expect any kind of
> atomicity things anyway, and I very much doubt that using SSE would
> make all that much sense. What's the background, and why would you
> want to do this crap?

Btw, for the 64-bit case, we did have ordering issues, and maybe some
128-bit model has similar ordering issues. Fine. You can't rely on
128-bit atomic accesses anyway on 99% of all hardware - either the CPU
itself cannot do it, or it's too damn inconvenient with XMM only
registers, or the bus itself is limited to 64 bits at a time anyway,
so the CPU or the IO interface would split such an access *anyway*.

So the whole concept of "we rely on atomic 128-bit MMIO accesses"
seems terminally broken. Any driver that thinks it needs that is just
crazy.

And those issues have nothing to do with x86 kernel_fpu_begin/end()
what-so-ever.

So judging by that, I would say that some driver writer needs to take
a few pills, clear up their head, and take another look at their life.
Tell them to look at

    include/asm-generic/io-64-nonatomic-hi-lo.h

(and *-lo-hi.h) instead, and ponder the wisdom of just doing it that
way. Tell them to go f*ck themselves if they think they need XMM
registers. They are wrong, for one reason or another.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 22, 2012, 5 a.m. UTC | #12
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 21 Aug 2012 21:35:22 -0700

> My biggest reason to question this all is that I don't think it's
> worth it. Why would we ever care to do all this in the first place?
> There's no really sane use for it.

All the x86 crypto code hits this case all the time, easiest example
is doing a dm-crypt on a block device when an IPSEC packet arrives.

The crypto code has all of this special code and layering that is
there purely so it can fall back to the slow non-optimized version
of the crypto operation when it hits this can't-nest-fpu-saving
situation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Aug. 22, 2012, 1:26 p.m. UTC | #13
On Tue, 2012-08-21 at 20:52 -0700, Linus Torvalds wrote:
[...]
> I haven't seen the patch being discussed, or the rationale for it. But
> I doubt it makes sense to do 128-bit MMIO and expect any kind of
> atomicity things anyway, and I very much doubt that using SSE would
> make all that much sense. What's the background, and why would you
> want to do this crap?

It's <1345598804.2659.78.camel@bwh-desktop.uk.solarflarecom.com>.
When updating a TX DMA ring pointer in sfc, we can push the first new
descriptor at the same time, so that for a linear packet only one DMA
read is then required before the packet goes out on the wire.  Currently
this requires 2-4 MMIO writes (each a separate PCIe transactions),
depending on the host word size.  There is a measurable reduction in
latency if we can reduce it to 1 PCIe transaction.  (But as previously
discussed, write-combining isn't suitable for this.)

Ben.
Linus Torvalds Aug. 22, 2012, 2:06 p.m. UTC | #14
On Tue, Aug 21, 2012 at 10:00 PM, David Miller <davem@davemloft.net> wrote:
>
> All the x86 crypto code hits this case all the time, easiest example
> is doing a dm-crypt on a block device when an IPSEC packet arrives.
>
> The crypto code has all of this special code and layering that is
> there purely so it can fall back to the slow non-optimized version
> of the crypto operation when it hits this can't-nest-fpu-saving
> situation.

Right. And it needs to be there. The current interface is fine and correct.

We can maybe optimize the current model (as outlined earlier), but no,
there's no way in hell we're doing lazy saving of FPU state from
interrupts etc. So all the "check if I can use FPU state at all", and
the explicit kernel_fpu_begin/end() interfaces are absolutely the
right model.

I realize that the people who write that code think that *their* code
is the most important thing in the world, and everything else should
revolve around them, and we should make everything else slower just to
make them happy. But they are wrong.

Deal with it, or do not. You can do the crypto entirely in software. I
think the current model is absolutely the right one, exactly because
it *allows* you to use the FPU for crypto, but it doesn't force the
rest of the kernel to make sure the FPU is always available to you.

I think your complaining about the interface is entirely bogus, and
comes from not appreciating that other areas and uses have other
concerns.

What I would suggest is trying to do profiling, and seeing if the
"let's try to save only once, and restore only when returning to user
space" approach helps. But that's an *implementation* change, not an
interface change. The interface is doing the right thing, the
implementation is just not perhaps optimal.

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Aug. 22, 2012, 2:20 p.m. UTC | #15
On Wed, Aug 22, 2012 at 6:26 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
>
> It's <1345598804.2659.78.camel@bwh-desktop.uk.solarflarecom.com>.

That doesn't help me in the least. I don't *have* that email. It was
never sent to me. I don't know what list it went on, and googling the
ID doesn't get me anything. So sending me the mail ID is kind of
pointless.

So I still don't see the actual patch. But everything I heard about it
indirectly makes me go "That's just crazy".

> When updating a TX DMA ring pointer in sfc, we can push the first new
> descriptor at the same time, so that for a linear packet only one DMA
> read is then required before the packet goes out on the wire.  Currently
> this requires 2-4 MMIO writes (each a separate PCIe transactions),
> depending on the host word size.  There is a measurable reduction in
> latency if we can reduce it to 1 PCIe transaction.  (But as previously
> discussed, write-combining isn't suitable for this.)

Quite frankly, this isn't even *remotely* changing my mind about our
FPU model. I'm like "ok, some random driver is trying to be clever,
and it's almost certainly getting things entirely wrong and doing
things that only work on certain machines".

If you really think it's a big deal, do it in *your* driver, and make
it do the whole irq_fpu_usable() check together with
kernel_fpu_begin/end(). And make it very much x86-specific and with a
fallback to non-atomic accesses.

Any patch that exports some "atomic 128-bit MMIO writes" for general
use sounds totally and utterly broken. You can't do it. It's
*fundamentally* an operation that many CPU's cannot even do. 64-bit
buses (or even 32-bit ones) will make the 128-bit write be split up
*anyway*, regardless of any 128-bit register issues.

And nobody else sane cares about this, so it shouldn't even be a
x86-64 specific thing. It should be a driver hack, since that's what
it is. We don't want to support crazy stuff like this in general, that
is not just architecture-, but microarchitecture-specific.

I think you'd be crazy to even want to do this in the first place, but
if you do, keep it internal to your driver, and don't expose the crazy
to anybody else.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Aug. 22, 2012, 2:42 p.m. UTC | #16
> Any patch that exports some "atomic 128-bit MMIO writes" for general
> use sounds totally and utterly broken. You can't do it. It's
> *fundamentally* an operation that many CPU's cannot even do. 64-bit
> buses (or even 32-bit ones) will make the 128-bit write be split up
> *anyway*, regardless of any 128-bit register issues.
> 
> And nobody else sane cares about this, so it shouldn't even be a
> x86-64 specific thing...

There are several other processors that can generate long
PCIe transfers, sometimes by using a DMA engine associated
with the PCIe interface (eg freescale 83xx ppc).

Given the slow speed of PCIe transactions (think ISA bus
speeds - at least with some targets), it is important to
be able to request multi-word transfers in a generic way
on systems that can support it.

This support is a property of the PCIe interface block,
not that of the driver that wishes to use the function.

I don't know if XMM register transfers generate 16byte
TLP on any x86 cpus - they might on some.

Perhaps claiming the function is atomic is the real
problem - otherwise a single TLP could be used on
systems (and in contexts) where it is possible, but
a slower mulit-TLP transfer done (possibly without
guaranteeing the transfer order) done where it is not.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 22, 2012, 9:14 p.m. UTC | #17
From: David Miller <davem@davemloft.net>
Date: Tue, 21 Aug 2012 21:14:27 -0700 (PDT)

> From: "H. Peter Anvin" <hpa@zytor.com>
> Date: Tue, 21 Aug 2012 20:59:26 -0700
> 
>> kernel_fpu_end() would still have to re-enable preemption (and
>> preemption would have to check the work flag), but that should be cheap.
>> 
>> We could allow the FPU in the kernel to have preemption, if we allocated
>> space for two xstates per thread instead of one.  That is, however, a
>> fair hunk of memory.
> 
> Once you have done the first FPU save for the sake of the kernel, you
> can minimize what you save for any deeper nesting because the kernel
> only cares about a very limited part of that FPU state not the whole
> 1K thing.
> 
> Those bits you can save by hand with a bunch of explicit stores of the
> XMM registers, or something like that.

BTW, just to clarify, I'm not saying that we should save the FPU on
every trap where we find the FPU enabled or anything stupid like that.

Definitely keep the kern_fpu_begin()/kern_fpu_end() type markers
around FPU usage, but allow some kind of nesting facility.

Here's one idea.  Anyone using the existing kern_fpu_*() markers get
the existing behavior.  Only one level of kernel FPU usage is allowed.

But a new interface allows specification of a state-save mask.  And it
is only users of this interface for which we allow nesting past the
first FPU user.

If this is the first kernel FPU user, we always do the full fxsave or
whatever to push out the full state.  For any level of kernel FPU
nesting we save only what is in the save-mask, by hand.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Aug. 22, 2012, 9:28 p.m. UTC | #18
On Wed, Aug 22, 2012 at 2:14 PM, David Miller <davem@davemloft.net> wrote:
>
> BTW, just to clarify, I'm not saying that we should save the FPU on
> every trap where we find the FPU enabled or anything stupid like that.
>
> Definitely keep the kern_fpu_begin()/kern_fpu_end() type markers
> around FPU usage, but allow some kind of nesting facility.

So nesting shouldn't be horrible, but the thing that really screws
with people like the crypto use is not nesting, but the fact that
sometimes you can't save at all, and the whole "kernel_fpu_possible()"
or whatever we call the checking function.

IOW, in [soft]irq context, to avoid races with the irq happening as
the process is going to do something with the FPU state, we don't
allow saving and changing state, because that would mean that the
normal FP state paths would have to be irq-safe, and they aren't.

And once you have to have that fpu possible check, if it happens to
also disallow nested use, I doubt that's going to really affect
anybody. The code has to take the case of "I'm not allowed to change
FPU state" case into account regardless.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 22, 2012, 9:38 p.m. UTC | #19
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 22 Aug 2012 14:28:50 -0700

> On Wed, Aug 22, 2012 at 2:14 PM, David Miller <davem@davemloft.net> wrote:
>>
>> BTW, just to clarify, I'm not saying that we should save the FPU on
>> every trap where we find the FPU enabled or anything stupid like that.
>>
>> Definitely keep the kern_fpu_begin()/kern_fpu_end() type markers
>> around FPU usage, but allow some kind of nesting facility.
> 
> So nesting shouldn't be horrible, but the thing that really screws
> with people like the crypto use is not nesting, but the fact that
> sometimes you can't save at all, and the whole "kernel_fpu_possible()"
> or whatever we call the checking function.
> 
> IOW, in [soft]irq context, to avoid races with the irq happening as
> the process is going to do something with the FPU state, we don't
> allow saving and changing state, because that would mean that the
> normal FP state paths would have to be irq-safe, and they aren't.
> 
> And once you have to have that fpu possible check, if it happens to
> also disallow nested use, I doubt that's going to really affect
> anybody. The code has to take the case of "I'm not allowed to change
> FPU state" case into account regardless.

I don't think you really have to do anything special to handle
interrupts properly.

Let's assume that we use some variable length save area at the end of
thread_info to do this nested saving.

When you are asked for FPU usage, you first figure out how much you're
going to save.

Then you advance the allocation pointer in the thread_info, and save
into the space you allocated.

If an interrupt wants to use the FPU, that should be fine as well.
Whether the interrupt FPU save does it's save after you did, or
before, it should work out fine.

I suppose you might have some issues in determining whether we need to
do the full fxsave stuff or not.  There could be a state bit for that,
or similar.

Another idea, instead of doing this in thread_info, is to do it on the
local stack.  That way if we're in an interrupt, we'll use that
interrupt type's kernel stack.

You might be able to get away with always doing the full FPU
save/restore in that situation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 706e12e..802508e 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -372,6 +372,10 @@  config X86_USE_3DNOW
 	def_bool y
 	depends on (MCYRIXIII || MK7 || MGEODE_LX) && !UML
 
+config X86_USE_SSE
+	def_bool y
+	depends on X86_64
+
 config X86_OOSTORE
 	def_bool y
 	depends on (MWINCHIP3D || MWINCHIPC6) && MTRR
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d8e8eef..06b3e23 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -95,6 +95,20 @@  build_mmio_write(writeq, "q", unsigned long, "r", :"memory")
 
 #endif
 
+#ifdef CONFIG_X86_USE_SSE
+
+u128 reado(const volatile void __iomem *addr);
+void writeo(u128 val, volatile void __iomem *addr);
+
+#define __raw_reado(addr) reado(addr)
+#define __raw_writeo(val, addr)	writeo(val, addr)
+
+/* Let people know that we have them */
+#define reado			reado
+#define writeo			writeo
+
+#endif
+
 /**
  *	virt_to_phys	-	map virtual addresses to physical
  *	@address: address to remap
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index b00f678..1791198 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -25,6 +25,7 @@  lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-$(CONFIG_X86_USE_SSE) += oword_io.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/oword_io.c b/arch/x86/lib/oword_io.c
new file mode 100644
index 0000000..8189bf3
--- /dev/null
+++ b/arch/x86/lib/oword_io.c
@@ -0,0 +1,65 @@ 
+/****************************************************************************
+ * 128-bit MMIO for x86
+ * Copyright 2012 Solarflare Communications Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation, incorporated herein by reference.
+ */
+
+#include <linux/export.h>
+#include <linux/preempt.h>
+#include <linux/types.h>
+#include <asm/io.h>
+
+/*
+ * We copy the data between a memory buffer and the MMIO address via
+ * register xmm0.  We have to save and restore the SSE state, and
+ * disable preemption.  Only the MMIO address is required to be
+ * 128-bit aligned, since the stack generally is not.
+ */
+
+u128 reado(const volatile void __iomem *addr)
+{
+	u128 ret;
+	u64 cr0;
+	u128 xmm0;
+
+	preempt_disable();
+	asm volatile (
+		"movq %%cr0,%0\n\t"
+		"clts\n\t"
+		"movups %%xmm0,%1\n\t"
+		"movaps %3,%%xmm0\n\t"
+		"movups %%xmm0,%2\n\t"
+		"sfence\n\t"
+		"movups %1,%%xmm0\n\t"
+		"movq %0,%%cr0\n\t"
+		: "=r"(cr0), "=m"(xmm0), "=m"(ret)
+		: "m"(*(const volatile u128 __iomem *)addr));
+	preempt_enable();
+
+	return ret;
+}
+EXPORT_SYMBOL(reado);
+
+void writeo(u128 val, volatile void __iomem *addr)
+{
+	u64 cr0;
+	u128 xmm0;
+
+	preempt_disable();
+	asm volatile (
+		"movq %%cr0,%0\n\t"
+		"clts\n\t"
+		"movups %%xmm0,%1\n\t"
+		"movups %3,%%xmm0\n\t"
+		"movaps %%xmm0,%2\n\t"
+		"sfence\n\t"
+		"movups %1,%%xmm0\n\t"
+		"movq %0,%%cr0\n\t"
+		: "=r"(cr0), "=m"(xmm0), "=m"(*(volatile u128 __iomem *)addr)
+		: "m"(val));
+	preempt_enable();
+}
+EXPORT_SYMBOL(writeo);