diff mbox

[tpmdd-devel,RFC,3/4] Implement driver for supporting multiple emulated TPMs

Message ID 1452787318-29610-4-git-send-email-stefanb@us.ibm.com
State New
Headers show

Commit Message

Stefan Berger Jan. 14, 2016, 4:01 p.m. UTC
From: Stefan Berger <stefanb@linux.vnet.ibm.com>

This patch implements a driver for supporting multiple emulated TPMs in a
system. An emulated TPM can for example be used by a Linux container or
later on by namespace-enabled IMA running inside a container.

The driver implements a device /dev/vtpmx that is only used to created
device pairs /dev/vtpmcX (e.g., /dev/vtpmc10) and /dev/vtpmsY
(e.g., /dev/vtpms11) using ioctls. The device /dev/vtpmcX is the usual TPM
device created by the core TPM driver. Applications or kernel subsystems
can send TPM commands to it and the corresponding /dev/vtpmY receives these
commands and delivers them to an emulated TPM. The device /dev/vtpmcX can
be moved into a container and appear there as /dev/tpm0.

Life-cycle management is implemented by ioctls of /dev/vtpmx for creating and
destroying them. The ioctl for creating a device pair returns the names of the
created devices so that applications know which devices to use. Other ioctls
help to find the corresponding 'other' device given the device name of one.

Signed-off-by; Stefan Berger <stefanb@linux.vnet.ibm.com>
---
 drivers/char/tpm/Kconfig    |  10 +
 drivers/char/tpm/Makefile   |   1 +
 drivers/char/tpm/tpm-vtpm.c | 855 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/char/tpm/tpm-vtpm.h |  58 +++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/vtpm.h   |  52 +++
 6 files changed, 977 insertions(+)
 create mode 100644 drivers/char/tpm/tpm-vtpm.c
 create mode 100644 drivers/char/tpm/tpm-vtpm.h
 create mode 100644 include/uapi/linux/vtpm.h

Comments

Jason Gunthorpe Jan. 19, 2016, 11:51 p.m. UTC | #1
On Thu, Jan 14, 2016 at 11:01:57AM -0500, Stefan Berger wrote:
> +	pdev = platform_device_register_simple("tpm_vtpm", vtpm_dev->dev_num,
> +					       NULL, 0);

This seems strange, something like this should not be creating
platform_devices.

tpm's cannot be attached to platform_devices outside their probe
functions, I just fixed that bug in tpm_tis.

Attach the tpm to a device on the vtpm_class? Do it through a callback
that takes care of devm? IIRC there is one but the name escapes me now
:)

> +	vtpm_dev->chip = tpmm_chip_alloc_pattern(vtpm_dev->pdev,
> +				&vtpm_tpm_ops,
> +				VTPM_DEV_PREFIX_CLIENT"%d");

I agree with Jarkko I don't see why these deserve special names..

Presumably some namespace magic can be used to make them show up as
tpm0 in a container?

> +	vtpm_dev->chip->flags |= TPM_CHIP_FLAG_NO_SYSFS | TPM_CHIP_FLAG_NO_LOG;

Why no sysfs files?

> +	switch (ioctl) {
> +	case VTPM_NEW_DEV:
> +		if (!capable(CAP_SYS_ADMIN))
> +			return -EPERM;
> +		vtpm_new_pair_p = argp;
> +		if (copy_from_user(&vtpm_new_pair, vtpm_new_pair_p,
> +				   sizeof(vtpm_new_pair)))
> +			return -EFAULT;
> +		vtpm_dev = vtpm_create_device_pair(&vtpm_new_pair);
> +		if (IS_ERR(vtpm_dev))
> +			return PTR_ERR(vtpm_dev);
> +		if (copy_to_user(vtpm_new_pair_p, &vtpm_new_pair,
> +				 sizeof(vtpm_new_pair)))
> +			return -EFAULT;

Another way to handle this is to return the FD for the server side of
the new TPM here and then route accesses from the /dev/tpmX side to
that FD. Avoid the extra char device. There is some kernel
infrasturcture that helpd do this (anon fd or something, IIRC)

For this usage that might make more sense than trying to copy the ptmx
scheme, which is like that for other reasons..

> +	case VTPM_DEL_DEV:
> +		if (!capable(CAP_SYS_ADMIN))
> +			return -EPERM;

Then del becomes close() on the special fd

> +	case VTPM_GET_VTPMDEV:

And this wouldn't be needed

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 20, 2016, 2:39 p.m. UTC | #2
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/19/2016 
06:51:07 PM:

> 
> On Thu, Jan 14, 2016 at 11:01:57AM -0500, Stefan Berger wrote:
> > +   pdev = platform_device_register_simple("tpm_vtpm", 
vtpm_dev->dev_num,
> > +                      NULL, 0);
> 
> This seems strange, something like this should not be creating
> platform_devices.

Should it be a misc_device then ?

> 
> tpm's cannot be attached to platform_devices outside their probe
> functions, I just fixed that bug in tpm_tis.
> 
> Attach the tpm to a device on the vtpm_class? Do it through a callback
> that takes care of devm? IIRC there is one but the name escapes me now
> :)
> 
> > +   vtpm_dev->chip = tpmm_chip_alloc_pattern(vtpm_dev->pdev,
> > +            &vtpm_tpm_ops,
> > +            VTPM_DEV_PREFIX_CLIENT"%d");
> 
> I agree with Jarkko I don't see why these deserve special names..

I did that because it becomes confusing when there are many /dev/tpm%d 
type of devices. This way we can distinguish between a hardware /dev/tpm%d 
and all these types of devices.

> 
> Presumably some namespace magic can be used to make them show up as
> tpm0 in a container?

The magic is to have the container management stack create the device 
pair. From the ioctl it learns the name of the devices that were created 
and it then finds out about the major/minor number of the created device 
and have /dev/tpm0 with that major/minor created in the container's /dev 
directory.

> 
> > +   vtpm_dev->chip->flags |= TPM_CHIP_FLAG_NO_SYSFS | 
TPM_CHIP_FLAG_NO_LOG;
> 
> Why no sysfs files?

I have changed this now to the following

+       if (vtpm_dev->flags & VTPM_FLAG_NO_SYSFS)
+               vtpm_dev->chip->flags |= TPM_CHIP_FLAG_NO_SYSFS;
+
+       if (vtpm_dev->flags & VTPM_FLAG_NO_LOG)
+               vtpm_dev->chip->flags |= TPM_CHIP_FLAG_NO_LOG;
+

https://github.com/stefanberger/linux/commit/5dd68fef872f10261521c91c23b1d0dfbdf69996#diff-0d6b7dbf28fbfdd13909c2e1b5a6d7a3R541


The issue with sysfs is that sysfs is showing directory entries for all 
vTPMs in all containers. If we don't want users to see the PCRs of other 
containers we set this flag. The side effect of setting this flag is that 
the container will not be able to see its own TPM device, either, but 
presumably the user has commands to reads PCRs etc.

Regarding VTPM_FLAG_NO_LOG: Container's boot/start without firmware that 
would write an ACPI log. So in most cases one will choose the 
VTPM_FLAG_NO_LOG flag since there is no log. Maybe that's a flag the user 
shouldn't have control over.

> 
> > +   switch (ioctl) {
> > +   case VTPM_NEW_DEV:
> > +      if (!capable(CAP_SYS_ADMIN))
> > +         return -EPERM;
> > +      vtpm_new_pair_p = argp;
> > +      if (copy_from_user(&vtpm_new_pair, vtpm_new_pair_p,
> > +               sizeof(vtpm_new_pair)))
> > +         return -EFAULT;
> > +      vtpm_dev = vtpm_create_device_pair(&vtpm_new_pair);
> > +      if (IS_ERR(vtpm_dev))
> > +         return PTR_ERR(vtpm_dev);
> > +      if (copy_to_user(vtpm_new_pair_p, &vtpm_new_pair,
> > +             sizeof(vtpm_new_pair)))
> > +         return -EFAULT;
> 
> Another way to handle this is to return the FD for the server side of
> the new TPM here and then route accesses from the /dev/tpmX side to
> that FD. Avoid the extra char device. There is some kernel
> infrasturcture that helpd do this (anon fd or something, IIRC)

Hm, I would prefer to keep the server-side char device since that way one 
isn't tied to fd passing or so and the TPM could terminate and be 
restarted later on on top of that server side device while keeping the 
client side device usable beyond the server side shutting down.

> 
> For this usage that might make more sense than trying to copy the ptmx
> scheme, which is like that for other reasons..

:-) Yes, ptmx is where I took this from.

> 
> > +   case VTPM_DEL_DEV:
> > +      if (!capable(CAP_SYS_ADMIN))
> > +         return -EPERM;
> 
> Then del becomes close() on the special fd

And with that the client side device would *have* to close as well because 
the backend now is inaccessible?

> 
> > +   case VTPM_GET_VTPMDEV:
> 
> And this wouldn't be needed

Assuming we want to keep server side for longer time, this ioctl and 
VTPM_GET_TPMDEV help to determine the other side of device that found in 
/dev.

Thanks for looking at the patches. I would prefer to keep the two 
character devices rather than the server side only being represented with 
an fd.

   Stefan

> 
> Jason
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 21, 2016, 1:17 a.m. UTC | #3
On Wed, Jan 20, 2016 at 09:39:09AM -0500, Stefan Berger wrote:
>    Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/19/2016
>    06:51:07 PM:
>    >
>    > On Thu, Jan 14, 2016 at 11:01:57AM -0500, Stefan Berger wrote:
>    > > +   pdev = platform_device_register_simple("tpm_vtpm",  vtpm_dev->dev_num,
>    > > +                      NULL, 0);
>    >
>    > This seems strange, something like this should not be creating
>    > platform_devices.
>    Should it be a misc_device then ?

No. Check what other virtual devices are doing..

Generally speaking, the tpm core should not insist on a parent device
the way it does. That is an artifact of the clean up I did of the
drivers that was never fully compelted. To make the driver migration
easier the TPM core is using devm in places it probably shouldn't, but
unwinding that requires more difficult all-driver changes.

If you get stuck here you will have to fix the core code.

>    > > +   vtpm_dev->chip = tpmm_chip_alloc_pattern(vtpm_dev->pdev,
>    > > +            &vtpm_tpm_ops,
>    > > +            VTPM_DEV_PREFIX_CLIENT"%d");
>    >
>    > I agree with Jarkko I don't see why these deserve special names..
>    I did that because it becomes confusing when there are many /dev/tpm%d
>    type of devices. This way we can distinguish between a hardware
>    /dev/tpm%d and all these types of devices.

That isn't really consistent with how linux kernel naming works these
days. Use a udev rule in userspace if you really want this.

>    > Presumably some namespace magic can be used to make them show up as
>    > tpm0 in a container?
>    The magic is to have the container management stack create the device
>    pair. From the ioctl it learns the name of the devices that were
>    created and it then finds out about the major/minor number of the
>    created device and have /dev/tpm0 with that major/minor created in the
>    container's /dev directory.

Except that isn't good enough - the IMA kernel side doesn't know that this
tpm is now acting as the 'main' 'default' TPM.

>    The issue with sysfs is that sysfs is showing directory entries for all
>    vTPMs in all containers.

That is not a problem to solve here. The sysfs will also show entires
for hardware tpms too. You need somekind of namespace related solution
that works for all tpms.

I don't know what became of it, but there used to be 'device
namespaces' patch series that delt with this kind of problem.

Again, I suspect strongly this will need some kind of TPM/IMA
namespace to actually do what you want.

>    If we don't want users to see the PCRs of other containers we set
>    this flag. The side effect of setting this flag is that the
>    container will not be able to see its own TPM device, either, but
>    presumably the user has commands to reads PCRs etc.

Erm, sysfs is just an obvious symptom, the container could also just
create the dev node and access any tpm it likes, or use some future
(?) IMA API with a tpm device number.

>    VTPM_FLAG_NO_LOG: Container's boot/start without firmware that
>    would write an ACPI log. So in most cases one will choose the
>    VTPM_FLAG_NO_LOG flag since there is no log. Maybe that's a flag
>    the user shouldn't have control over.

? But there is no ACPI handle so this should already be naturally
defeated by the core code. Why a flag?

>    > Another way to handle this is to return the FD for the server side of
>    > the new TPM here and then route accesses from the /dev/tpmX side to
>    > that FD. Avoid the extra char device. There is some kernel
>    > infrasturcture that helpd do this (anon fd or something, IIRC)

>    Hm, I would prefer to keep the server-side char device since that way
>    one isn't tied to fd passing or so and the TPM could terminate and be
>    restarted later on on top of that server side device while keeping the
>    client side device usable beyond the server side shutting down.

Don't be afraid of FD passing.

vTPM daemon restart isn't really feasible anyhow as TPMs are
stateful. The kernel is not prepared to survive a HW TPM reboot.  If
that problem were to be solved somehow then a scheme like systemd's
FDSTORE=1 to keep the fd alive during a daemon crash could be used.

The huge downside to using a master side dev node is that these things
will leak. Killing the vtpm daemon will not clean up the slave
interface and another command and sysadmin interaction will be needed
to sort out the inevitable mess.

>    > Then del becomes close() on the special fd
>    And with that the client side device would *have* to close as well
>    because the backend now is inaccessible?

The client side would see failures on all TPM commands it tries to

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 21, 2016, 3:01 a.m. UTC | #4
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/20/2016 
08:17:01 PM:

> 
> On Wed, Jan 20, 2016 at 09:39:09AM -0500, Stefan Berger wrote:
> >    Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 
01/19/2016
> >    06:51:07 PM:
> >    >
> >    > On Thu, Jan 14, 2016 at 11:01:57AM -0500, Stefan Berger wrote:
> >    > > +   pdev = platform_device_register_simple("tpm_vtpm", 
> vtpm_dev->dev_num,
> >    > > +                      NULL, 0);
> >    >
> >    > This seems strange, something like this should not be creating
> >    > platform_devices.
> >    Should it be a misc_device then ?
> 
> No. Check what other virtual devices are doing..

register_chrdev maybe?

> 
> Generally speaking, the tpm core should not insist on a parent device
> the way it does. That is an artifact of the clean up I did of the
> drivers that was never fully compelted. To make the driver migration
> easier the TPM core is using devm in places it probably shouldn't, but
> unwinding that requires more difficult all-driver changes.
> 
> If you get stuck here you will have to fix the core code.
> 
> >    > > +   vtpm_dev->chip = tpmm_chip_alloc_pattern(vtpm_dev->pdev,
> >    > > +            &vtpm_tpm_ops,
> >    > > +            VTPM_DEV_PREFIX_CLIENT"%d");
> >    >
> >    > I agree with Jarkko I don't see why these deserve special names..
> >    I did that because it becomes confusing when there are many 
/dev/tpm%d
> >    type of devices. This way we can distinguish between a hardware
> >    /dev/tpm%d and all these types of devices.
> 
> That isn't really consistent with how linux kernel naming works these
> days. Use a udev rule in userspace if you really want this.

Ok, then I will just try to call them /dev/tpm%d.

> 
> >    > Presumably some namespace magic can be used to make them show up 
as
> >    > tpm0 in a container?
> >    The magic is to have the container management stack create the 
device
> >    pair. From the ioctl it learns the name of the devices that were
> >    created and it then finds out about the major/minor number of the
> >    created device and have /dev/tpm0 with that major/minor created in 
the
> >    container's /dev directory.
> 
> Except that isn't good enough - the IMA kernel side doesn't know that 
this
> tpm is now acting as the 'main' 'default' TPM.

Hooking the vTPM to IMA requires another patch that I haven't shown since 
IMA namespacing isn't public yet. Basically we implement another ioctl() 
that is to be called before the clone() in order to 'reserve' a vtpm 
device pair for the calling process. During the clone() call IMA 
namespacing code can query the vTPM driver for a 'reserved' device pair. 
Hooking IMA up after the clone() may also work but in case of 
docker/golang it's better to do this before since the language libraries 
do a lot after the clone automatically.

> 
> >    The issue with sysfs is that sysfs is showing directory entries for 
all
> >    vTPMs in all containers.
> 
> That is not a problem to solve here. The sysfs will also show entires
> for hardware tpms too. You need somekind of namespace related solution
> that works for all tpms.
> 
> I don't know what became of it, but there used to be 'device
> namespaces' patch series that delt with this kind of problem.

Would be good to have now since I for me the best way is to work around 
the problem entirely -- not registering with sysfs.

> 
> Again, I suspect strongly this will need some kind of TPM/IMA
> namespace to actually do what you want.

So here things aren't so clear to me, either. Sysfs should really only 
show the devices that are relevant to a container, but it seems to show a 
lot more than that, serial bus for example, i8042 etc. , that don't 
necessarily have relevance inside the container. So if there's a hardware 
TPM on the system, it probably shouldn't show the sysfs entries in the 
container. If we add a vTPM for the container, it would ideally show up as 
tpm0 inside the container's sysfs entries, but it doesn't do that unless 
this is the first device created and there's no hardware TPM. So while all 
of this isn't possible with sysfs, I made the choice to hide the device 
from sysfs entirely. If hw tpm0 shows, fine, we can report BIOS 
measurements then. It seems like device subsystem namespacing would be 
needed to realize part of this so that multiple tpm0s could exist, one in 
each device namespace.


> 
> >    If we don't want users to see the PCRs of other containers we set
> >    this flag. The side effect of setting this flag is that the
> >    container will not be able to see its own TPM device, either, but
> >    presumably the user has commands to reads PCRs etc.
> 
> Erm, sysfs is just an obvious symptom, the container could also just
> create the dev node and access any tpm it likes, or use some future
> (?) IMA API with a tpm device number.

No, the device cgroup prevents that. Only the major/minor number of its 
dedicated TPM (client) chardev can be accessed.



> 
> >    VTPM_FLAG_NO_LOG: Container's boot/start without firmware that
> >    would write an ACPI log. So in most cases one will choose the
> >    VTPM_FLAG_NO_LOG flag since there is no log. Maybe that's a flag
> >    the user shouldn't have control over.
> 
> ? But there is no ACPI handle so this should already be naturally
> defeated by the core code. Why a flag?
> 
> >    > Another way to handle this is to return the FD for the server 
side of
> >    > the new TPM here and then route accesses from the /dev/tpmX side 
to
> >    > that FD. Avoid the extra char device. There is some kernel
> >    > infrasturcture that helpd do this (anon fd or something, IIRC)
> 
> >    Hm, I would prefer to keep the server-side char device since that 
way
> >    one isn't tied to fd passing or so and the TPM could terminate and 
be
> >    restarted later on on top of that server side device while keeping 
the
> >    client side device usable beyond the server side shutting down.
> 
> Don't be afraid of FD passing.
> 
> vTPM daemon restart isn't really feasible anyhow as TPMs are
> stateful. The kernel is not prepared to survive a HW TPM reboot.  If
> that problem were to be solved somehow then a scheme like systemd's
> FDSTORE=1 to keep the fd alive during a daemon crash could be used.
> 
> The huge downside to using a master side dev node is that these things
> will leak. Killing the vtpm daemon will not clean up the slave
> interface and another command and sysadmin interaction will be needed
> to sort out the inevitable mess.

This is a scenario that works, though.

The server side opened /dev/vtpms1. It closed it and the clients side 
disappeared. Both devices /dev/vtpms1 and /dev/vtpmc1 were removed. The 
client side then shows the following:

# ls -l /proc/self/fd
total 0
lrwx------. 1 root root 64 Jan 20 21:34 0 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 20 21:34 1 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 20 21:34 100 -> /dev/vtpmc1 (deleted)
lrwx------. 1 root root 64 Jan 20 21:34 2 -> /dev/pts/1
lr-x------. 1 root root 64 Jan 20 21:34 3 -> /proc/23266/fd

# echo -en '\x00\xc1\x00\x00\x00\x0a\x00\x00\x00\x00' >&100
-bash: echo: write error: Input/output error


Once the client did a 'exec 100>&-', so closed the fd, the /dev/vtpmc0 
could actually reappear.

There's a 'keepalive' flag in this vtpm driver that keeps the device pair 
around even if the server side shuts down. If that had been set, 
/dev/vtpmc1 wouldn't have been deleted upon server side closure.

  Stefan

> 
> >    > Then del becomes close() on the special fd
> >    And with that the client side device would *have* to close as well
> >    because the backend now is inaccessible?
> 
> The client side would see failures on all TPM commands it tries to
> 
> Jason
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 21, 2016, 3:21 a.m. UTC | #5
On Wed, Jan 20, 2016 at 10:01:38PM -0500, Stefan Berger wrote:
> > On Wed, Jan 20, 2016 at 09:39:09AM -0500, Stefan Berger wrote:
> > >    Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/19/2016
> > >    06:51:07 PM:
> > >    >
> > >    > On Thu, Jan 14, 2016 at 11:01:57AM -0500, Stefan Berger wrote:
> > >    > > +   pdev = platform_device_register_simple("tpm_vtpm",  
> > vtpm_dev->dev_num,
> > >    > > +                      NULL, 0);
> > >    >
> > >    > This seems strange, something like this should not be creating
> > >    > platform_devices.
> > >    Should it be a misc_device then ?
> >
> > No. Check what other virtual devices are doing..
> 
> register_chrdev maybe?

?? that doesn't replace a platform_device

> > Except that isn't good enough - the IMA kernel side doesn't know that this
> > tpm is now acting as the 'main' 'default' TPM.
> 
> Hooking the vTPM to IMA requires another patch that I haven't shown since IMA
> namespacing isn't public yet. Basically we implement another ioctl() that is to
> be called before the clone() in order to 'reserve' a vtpm device pair for the
> calling process. During the clone() call IMA namespacing code can query the
> vTPM driver for a 'reserved' device pair. Hooking IMA up after the clone() may
> also work but in case of docker/golang it's better to do this before since the
> language libraries do a lot after the clone automatically.

That sounds very complicated, wouldn't you just specify the TPM index to use
in the IMA namespace when it is created?

> So here things aren't so clear to me, either. Sysfs should really only show the
> devices that are relevant to a container, but it seems to show a lot
> more than

I think what you are missing is that nobody uses mainline containers
for the kind of strong isolation you are thinking about. Out-of-tree
patches are used by those people and, as I understand it, they cover
all these issues.

So, in mainline, the correct thing to do is nothing, and realize that
people who would care about pcr isolation between containers won't run
mainline anyhow. Work with the out-of-tree people to make sure things
work properly until they get the other bits in mainline.

At least that is my impression of the state of affairs, someone more
involved may know better.

> > The huge downside to using a master side dev node is that these things
> > will leak. Killing the vtpm daemon will not clean up the slave
> > interface and another command and sysadmin interaction will be needed
> > to sort out the inevitable mess.
> 
> This is a scenario that works, though.

So you already implemented the same semantics as I talked about, a clean
up on close?

Then just return the fd like I said.

auto-delete a master char dev on close is a very strange API, don't do
that.

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 21, 2016, 3:56 a.m. UTC | #6
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/20/2016 
10:21:15 PM:

> 
> On Wed, Jan 20, 2016 at 10:01:38PM -0500, Stefan Berger wrote:
> > > On Wed, Jan 20, 2016 at 09:39:09AM -0500, Stefan Berger wrote:
> > > >    Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 
01/19/2016
> > > >    06:51:07 PM:
> > > >    >
> > > >    > On Thu, Jan 14, 2016 at 11:01:57AM -0500, Stefan Berger 
wrote:
> > > >    > > +   pdev = platform_device_register_simple("tpm_vtpm", 
> > > vtpm_dev->dev_num,
> > > >    > > +                      NULL, 0);
> > > >    >
> > > >    > This seems strange, something like this should not be 
creating
> > > >    > platform_devices.
> > > >    Should it be a misc_device then ?
> > >
> > > No. Check what other virtual devices are doing..
> > 
> > register_chrdev maybe?
> 
> ?? that doesn't replace a platform_device
> 
> > > Except that isn't good enough - the IMA kernel side doesn't knowthat 
this
> > > tpm is now acting as the 'main' 'default' TPM.
> > 
> > Hooking the vTPM to IMA requires another patch that I haven't 
> shown since IMA
> > namespacing isn't public yet. Basically we implement another ioctl
> () that is to
> > be called before the clone() in order to 'reserve' a vtpm device 
> pair for the
> > calling process. During the clone() call IMA namespacing code can 
query the
> > vTPM driver for a 'reserved' device pair. Hooking IMA up after the
> clone() may
> > also work but in case of docker/golang it's better to do this 
> before since the
> > language libraries do a lot after the clone automatically.
> 
> That sounds very complicated, wouldn't you just specify the TPM index to 
use
> in the IMA namespace when it is created?

The IMA namespace is created as part of clone(). You cannot pass anything 
via clone(). So you either have to do it before or immediately after. If 
after is too later for whatever reason, you have to do it before.

> 
> > So here things aren't so clear to me, either. Sysfs should really 
> only show the
> > devices that are relevant to a container, but it seems to show a lot
> > more than
> 
> I think what you are missing is that nobody uses mainline containers
> for the kind of strong isolation you are thinking about. Out-of-tree
> patches are used by those people and, as I understand it, they cover
> all these issues.

?? Out-of-tree patches?

> 
> So, in mainline, the correct thing to do is nothing, and realize that
> people who would care about pcr isolation between containers won't run
> mainline anyhow. Work with the out-of-tree people to make sure things
> work properly until they get the other bits in mainline.

Now one just needs to find those poeple.
Basically you suggest to ignore the potential leaking between containers. 
Just register with sysfs ?

> 
> At least that is my impression of the state of affairs, someone more
> involved may know better.
> 
> > > The huge downside to using a master side dev node is that these 
things
> > > will leak. Killing the vtpm daemon will not clean up the slave
> > > interface and another command and sysadmin interaction will be 
needed
> > > to sort out the inevitable mess.
> > 
> > This is a scenario that works, though.
> 
> So you already implemented the same semantics as I talked about, a clean
> up on close?

It's part of the existing patch, yes. A flag gives a choice to keep it 
around, or if not set and the server close()s, the pair disappears.

> 
> Then just return the fd like I said.

Any driver that can be used as an example ?

> 
> auto-delete a master char dev on close is a very strange API, don't do
> that.

What I called cleanup can be trigger by the vTPM closing /dev/vtpms%d, so 
the server-side. What is the master for you? /dev/vtpmx where we run the 
ioctls on?

  Stefan

> 
> Jason
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 21, 2016, 5:42 p.m. UTC | #7
On Wed, Jan 20, 2016 at 10:56:21PM -0500, Stefan Berger wrote:
> The IMA namespace is created as part of clone(). You cannot pass anything via
> clone(). So you either have to do it before or immediately after. If after is
> too later for whatever reason, you have to do it before.

Follow the pattern of something like the net ns.

clone creates an 'empty' IMA namespace. After clone the child waits
for the parent to complete its work.

The parent creates a new vtpm /dev/tpm1 in it's namespace

The parent does a kernel call to make tpm1 visiable in the child's
IMA namespace as tpm0. (eg look at the 'ip link set .. netns ..' kind
of scheme for inspiration)

Hook the tpm core's id to struct tpm_chip code into the IMA namespace
so that /dev/tpm0 and all other places actually access tpm1 when used
in the child's IMA namespace

The parent signals the child to continue to run. The child has a
/dev/tpm0 that is actually routed to /dev/tpm1. The child is prevented
from accessing other tpms.

> > I think what you are missing is that nobody uses mainline containers
> > for the kind of strong isolation you are thinking about. Out-of-tree
> > patches are used by those people and, as I understand it, they cover
> > all these issues.
> 
> ?? Out-of-tree patches?

Sorry, I don't recall the names of al lthe involved parties. lxr, and
there is a notable company whose name eludes me right now. I had a
very interesting discussion on this with James Bottomley at a
conference once. Maybe look up device namespaces and look at the
threads/etc on that for a clue what the status is?

> Basically you suggest to ignore the potential leaking between containers. Just
> register with sysfs ?

Ignore in the sense that mainline doesn't have the tools to address
the issue. If it is important to you then build out general capability
in mainline, don't hack it like this :)

Another approach would be to try and copy what netns does and link tpm
visibility through a IMA namespace (as described above) directly to
sysfs visibility - I have never looked at how that is done though :|

> > Then just return the fd like I said.
> 
> Any driver that can be used as an example ?

The stuff in include/linux/anon_inodes.h is used to do this, eg
a sequence of anon_inode_getfile, get_unused_fd_flags, fd_install
will create a file descriptor in the calling process associated with a
struct file_operations

> > auto-delete a master char dev on close is a very strange API, don't do
> > that.
> 
> What I called cleanup can be trigger by the vTPM closing /dev/vtpms%d, so the
> server-side. What is the master for you? /dev/vtpmx where we run the ioctls on?

Sorry, master/slave is the ptx nomenclature for this scheme. Master
would be your server side I think. 

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 21, 2016, 7:02 p.m. UTC | #8
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/21/2016 
12:42:43 PM:

> 
> On Wed, Jan 20, 2016 at 10:56:21PM -0500, Stefan Berger wrote:
> > The IMA namespace is created as part of clone(). You cannot pass 
> anything via
> > clone(). So you either have to do it before or immediately after. 
> If after is
> > too later for whatever reason, you have to do it before.
> 
> Follow the pattern of something like the net ns.
> 
> clone creates an 'empty' IMA namespace. After clone the child waits
> for the parent to complete its work.
> 
> The parent creates a new vtpm /dev/tpm1 in it's namespace
> 
> The parent does a kernel call to make tpm1 visiable in the child's
> IMA namespace as tpm0. (eg look at the 'ip link set .. netns ..' kind
> of scheme for inspiration)

What is IMA namespace in relation to a device's name? The method is to 
read the major/minor numbers on the host and created /dev/tpm0 with the 
same major/minor numbers in the container's filesystem. The name doesn't 
matter I guess, but major/minor are important.

> 
> Hook the tpm core's id to struct tpm_chip code into the IMA namespace
> so that /dev/tpm0 and all other places actually access tpm1 when used
> in the child's IMA namespace
> 
> The parent signals the child to continue to run. The child has a
> /dev/tpm0 that is actually routed to /dev/tpm1. The child is prevented
> from accessing other tpms.


The problem I have run into in particular with Docker and golang is that 
Docker invokes the golang function to run an external program. The golang 
function does a clone(), a whole lot of other stuff after it, and in the 
end the execve().

The code is here:

https://golang.org/src/syscall/exec_linux.go

Look at the function forkAndExecInChild on line 56++.

The problem with that is that the execve() will trigger IMA measurements. 
IMA will refuse being hooked up with a vTPM driver if it couldn't put its 
first measurement(s) into a vTPM. I don't think we should queue PCR 
extensions until a device may eventually become available. So, the 
conclusion is, to accomodate golang (for example) we can create the device 
pair, sit the vTPM on top of the master, and reserve the device pair befor 
the next clone() so that IMA finds it and can hook up to it. 

What is wrong with this scheme? The ioctl for 'reservation' before the 
clone()?


> 
> 
> > > Then just return the fd like I said.
> > 
> > Any driver that can be used as an example ?
> 
> The stuff in include/linux/anon_inodes.h is used to do this, eg
> a sequence of anon_inode_getfile, get_unused_fd_flags, fd_install
> will create a file descriptor in the calling process associated with a
> struct file_operations
> 

So let me show how things work currently.

fd = open("/dev/vtpmx", ...);
ioctl(fd, 'create device pair; TPM will be a TPM 2 version device')
/* conveying that the device is a TPM 2 or 1.2 is important so that the 
kernel sends the right commands to the device and possibly also for sysfs 
handling */

/* at this point /dev/vtpmc0 and /dev/vtpms0 have been created */
clientfd = open("/dev/vtpmc0",...)
serverfd = open("/dev/vtpms0",...);

...

close(serverfd); /* /dev/vtpmc0 and /dev/vtpms0 disappeared */




Should it work like this?

serverfd = open("/dev/vtpmx", ...);
/* at this point /dev/vtpmc0 has been created and serverfd is used for the 
server side */

ioctl(fd, 'TPM will be a TPM 2 version device') /* setting flags on the 
chip->flags 'late' should be possible */

...

close(serverfd); /* /dev/vtpmc0 disappeared */


Regards,
   Stefan


> > > auto-delete a master char dev on close is a very strange API, don't 
do
> > > that.
> > 
> > What I called cleanup can be trigger by the vTPM closing /dev/
> vtpms%d, so the
> > server-side. What is the master for you? /dev/vtpmx where we run 
> the ioctls on?
> 
> Sorry, master/slave is the ptx nomenclature for this scheme. Master
> would be your server side I think. 
> 
> Jason
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 21, 2016, 7:30 p.m. UTC | #9
On Thu, Jan 21, 2016 at 02:02:17PM -0500, Stefan Berger wrote:
>    What is IMA namespace in relation to a device's name? The method is to
>    read the major/minor numbers on the host and created /dev/tpm0 with the
>    same major/minor numbers in the container's filesystem. The name
>    doesn't matter I guess, but major/minor are important.

Ostensibly we number the /dev/tpmX's in relation to the tpm index
number.

Internally to the kernel the TPM access is done by that tpm index.

Today, IMA hard codes that index value to 0 (IIRC). I could see a
future IMA allowing user space to specify the index. The index is also
how to associate the /dev/tpm node with the /sysfs files.

So the index is important, we'd want to control it for namespaces.

In any case the tpm index is part of the contract, and it would be
ideal if the IMA namespace made tpm index 0 be the right vtpm.

>    The problem I have run into in particular with Docker and golang is
>    that Docker invokes the golang function to run an external program. The
>    golang function does a clone(), a whole lot of other stuff after it,
>    and in the end the execve().

Well, ultimately that is a docker problem, as you describe IMA has a
special new requirement where the IMA NS has to be setup quickly.

>    The code is here:
>    [1]https://golang.org/src/syscall/exec_linux.go
>    Look at the function forkAndExecInChildon line 56++.

Well, that is just bad API design, sorry. The unix model of fork()/exec()
is that the app gets a chance to adjust the environment between
fork/exec, and this design, while easy to use, locks the app into the
hard wired customization that forkAndExecInChild does.

Maybe add a callback to SysProcAttr or something? Can't help you here.

You can't let this influence the kernel UAPI design.

>    available. So, the conclusion is, to accomodate golang (for example) we
>    can create the device pair, sit the vTPM on top of the master, and
>    reserve the device pair befor the next clone() so that IMA finds it and
>    can hook up to it.
>    What is wrong with this scheme? The ioctl for 'reservation' before the
>    clone()?

Yes, how does that sort of thing even make sense in a complex
multi-threaded world?

>    Should it work like this?

Sort of like this:

    controlfd = open("/dev/vtpmx", ...);
    ioctl(controlfd, CREATE_VTPM, &inargs, &outargs);
    serverfd = outargs.fd;
    // /dev/tpmX exists. X is returned in outargs.tpm_index, maybe  return major/minor too

    child = clone(...)
    ioctl(??? , ASSIGN_VTPM_TO_NS, .. child->ima_ns .., to index = 0,
          from index = outargs.tpm_index);

    /* tpm index X is destroyed, kernel prevents reuse of index X
       until the NS is destroyed too. /dev/tpmX is removed by udev */
    close(severfd);

So, you'd probably make a vtpm daemon that took as execv args a
reference to the IMA namespace to create the tpm in, and have docker
launch it after the clone, but before the exec in the parent
namespace.

That is fairly similar to how net ns works, with the wrinkle you have
to do this before the exec, I guess.

It also allows hw tpms to be routed to the ns.

The docker container just has the normal /dev/tpm0 

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 21, 2016, 9:51 p.m. UTC | #10
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/21/2016 
02:30:49 PM:

> 
> On Thu, Jan 21, 2016 at 02:02:17PM -0500, Stefan Berger wrote:
> >    What is IMA namespace in relation to a device's name? The method is 
to
> >    read the major/minor numbers on the host and created /dev/tpm0 with 
the
> >    same major/minor numbers in the container's filesystem. The name
> >    doesn't matter I guess, but major/minor are important.
> 
> Ostensibly we number the /dev/tpmX's in relation to the tpm index
> number.
> 
> Internally to the kernel the TPM access is done by that tpm index.
> 
> Today, IMA hard codes that index value to 0 (IIRC). I could see a
> future IMA allowing user space to specify the index. The index is also
> how to associate the /dev/tpm node with the /sysfs files.
> 
> So the index is important, we'd want to control it for namespaces.
> 
> In any case the tpm index is part of the contract, and it would be
> ideal if the IMA namespace made tpm index 0 be the right vtpm.

Got it.

> 
> >    The problem I have run into in particular with Docker and golang is
> >    that Docker invokes the golang function to run an external program. 
The
> >    golang function does a clone(), a whole lot of other stuff after 
it,
> >    and in the end the execve().
> 
> Well, ultimately that is a docker problem, as you describe IMA has a
> special new requirement where the IMA NS has to be setup quickly.
> 
> >    The code is here:
> >    [1]https://golang.org/src/syscall/exec_linux.go
> >    Look at the function forkAndExecInChildon line 56++.
> 
> Well, that is just bad API design, sorry. The unix model of 
fork()/exec()
> is that the app gets a chance to adjust the environment between
> fork/exec, and this design, while easy to use, locks the app into the
> hard wired customization that forkAndExecInChild does.

>From a callers' perspective it's attractive to have it do all kinds of 
stuff IF one doesn't have to go in between. Unfortunately we have to go in 
between.

> 
> Maybe add a callback to SysProcAttr or something? Can't help you here.
> 
> You can't let this influence the kernel UAPI design.

The choice is between getting this working 'today' (even if just locally) 
or discussing this with golang designers, which in the ideal case would 
cause me waiting for the next version, dealing with that version 
dependency etc., plus the delay. So, clearly, an additional ioctl() and 
~50 lines of code make this work 'now'. Doesn't this seem worth it?

> 
> >    available. So, the conclusion is, to accomodate golang (for 
example) we
> >    can create the device pair, sit the vTPM on top of the master, and
> >    reserve the device pair befor the next clone() so that IMA finds it 
and
> >    can hook up to it.
> >    What is wrong with this scheme? The ioctl for 'reservation' before 
the
> >    clone()?
> 
> Yes, how does that sort of thing even make sense in a complex
> multi-threaded world?

controlfd = open("/dev/vtpmx, ...");
ioctl(controlfd, CREATE_VTPM, &inargs, &outargs);
[...]
ioctl(controlfd, RESERVE_VTPM_FOR_NS, ...params);
clone()

The possibility of passing that controlfd between threads is there if 
clone() was to happen in another thread ... (it currently doesn't). It 
shouldn't be a problem.

> 
> >    Should it work like this?
> 
> Sort of like this:
> 
>     controlfd = open("/dev/vtpmx", ...);
>     ioctl(controlfd, CREATE_VTPM, &inargs, &outargs);
>     serverfd = outargs.fd;
>     // /dev/tpmX exists. X is returned in outargs.tpm_index, maybe 
> return major/minor too
> 
>     child = clone(...)
>     ioctl(??? , ASSIGN_VTPM_TO_NS, .. child->ima_ns .., to index = 0,
>           from index = outargs.tpm_index);

after the clone() you are in that IMA namespace. So the only argument 
needed there is the index to the tpm_chip to hook up to the current IMA 
namespace.

> 
>     /* tpm index X is destroyed, kernel prevents reuse of index X
>        until the NS is destroyed too. /dev/tpmX is removed by udev */
>     close(severfd);
> 
> So, you'd probably make a vtpm daemon that took as execv args a
> reference to the IMA namespace to create the tpm in, and have docker
> launch it after the clone, but before the exec in the parent
> namespace.


I got that part with the fd, major & minor number. It seems to work.
I have one ioctl to reserve for before the clone and another ioctl to hook 
IMA-NS and vTPM together after the clone, but that patch is for later. So 
let's not just kill the ioctl for 'reservation' like that, please.

> 
> That is fairly similar to how net ns works, with the wrinkle you have
> to do this before the exec, I guess.
> 
> It also allows hw tpms to be routed to the ns.

How many hardware TPMs are going to be there ? One? That is to be used for 
the host, right? Presumably the containers will all get emulated TPMs. And 
sharing the single hardware TPM between multiple containers just isn't 
possible.

This will not be possible when going through the vTPM driver, but you have 
the ??? up there. I'd put the 'controlfd' in that place. The vTPM driver 
will only know about vtpm_dev->chip that it created and none of them is a 
hardware TPM.

    Stefan

> 
> The docker container just has the normal /dev/tpm0 
> 
> Jason
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 21, 2016, 10:10 p.m. UTC | #11
On Thu, Jan 21, 2016 at 04:51:06PM -0500, Stefan Berger wrote:

>    > You can't let this influence the kernel UAPI design.

>    The choice is between getting this working 'today' (even if just
>    locally) or discussing this with golang designers, which in the ideal
>    case would cause me waiting for the next version, dealing with that
>    version dependency etc., plus the delay. So, clearly, an additional
>    ioctl() and ~50 lines of code make this work 'now'. Doesn't this seem
>    worth it?

Sorry, for mainline stuff like this reserve thing is not clean enough
to be acceptable.

Why can't you just open-code a modified forkAndExecInChild in docker?

Seriously, the golang code you showed already has special stuff to do
user namespaces before the exec, it is totally unreasonable insist
that other namespace stuff can't be done the same way.

>    >     child = clone(...)
>    >     ioctl(??? , ASSIGN_VTPM_TO_NS, .. child->ima_ns .., to index = 0,
>    >           from index = outargs.tpm_index);

>    after the clone() you are in that IMA namespace. So the only argument
>    needed there is the index to the tpm_chip to hook up to the current IMA
>    namespace.

No, all of the above was in the parent namespace, the clone creates
the child IMA namespace and launches the thread, but only the parent
would have enough permissions to actually share the TPM to the child -
ie the child cannot self-join.

You could also go the way of the mount namespace where the actions
isn't an 'add to child namespace' but a 'remove from my namespace',
but I don't think that makes as much sense for devices, the 'add'
approach in line with the net ns seems cleaner. Others more familiar
with namespaces may have other ideas, but I doubt you'd find anyone to
support a reserve scheme.

>    I got that part with the fd, major & minor number. It seems to work.
>    I have one ioctl to reserve for before the clone and another ioctl to
>    hook IMA-NS and vTPM together after the clone, but that patch is for
>    later. So let's not just kill the ioctl for 'reservation' like that,
>    please.

No, kill it now. It doesn't make sense as part of this series, it
should have been in the IMA namespace patch anyhow.

>    > That is fairly similar to how net ns works, with the wrinkle you have
>    > to do this before the exec, I guess.
>    >
>    > It also allows hw tpms to be routed to the ns.
>    How many hardware TPMs are going to be there ? One? That is to be used
>    for the host, right?

No idea. It depends on the application. The HW tpm is alot better than
this vtpm idea for many use cases. I could see applications where
you'd want to use the host PCRS because they cover the bios and kernel
the app is actually running under. I actually have a hard time
understanding what value fake software PCRS are to containers <shrug>

I certainly wouldn't want to be forced to store keys in a software tpm
just because I am using containers, for instance.

>    TPMs. And sharing the single hardware TPM between multiple containers
>    just isn't possible.

Of course it is, it just hasn't been done yet, and won't be a 100%
perfect emulation.

>    This will not be possible when going through the vTPM driver, but you
>    have the ??? up there. I'd put the 'controlfd' in that place.

No, it should not be controlfd, it should be what ever API the rest of
the IMA namespace stuff is using, I think.

>    The vTPM driver will only know about vtpm_dev->chip that it
>    created and none of them is a hardware TPM.

Right, controlfd implies that only vtpms could be shared to a IMA
namespace, which is a terrible API. This is another reason why
reserved is a terrible API.

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 22, 2016, 12:01 p.m. UTC | #12
On Thu, Jan 21, 2016 at 03:10:40PM -0700, Jason Gunthorpe wrote:
> >    The vTPM driver will only know about vtpm_dev->chip that it
> >    created and none of them is a hardware TPM.
> 
> Right, controlfd implies that only vtpms could be shared to a IMA
> namespace, which is a terrible API. This is another reason why
> reserved is a terrible API.

It's funny because for me the API looks robust, configurable and nice to
use [1]. I could use it outside of virtualization space (I already
mentioned two applications).

Your criticism is for me way too abstract. I have hard time following
it.

This is a nice incremental step. By no means it's perfect but something
that would satisfy a lot of users for things that they want do now (like
me).

[1] I've been still too busy to try this out in order to say the final
    word.

> Jason

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 22, 2016, 3:09 p.m. UTC | #13
On 01/21/2016 05:10 PM, Jason Gunthorpe wrote:
> On Thu, Jan 21, 2016 at 04:51:06PM -0500, Stefan Berger wrote:
>
>>     > You can't let this influence the kernel UAPI design.
>>     The choice is between getting this working 'today' (even if just
>>     locally) or discussing this with golang designers, which in the ideal
>>     case would cause me waiting for the next version, dealing with that
>>     version dependency etc., plus the delay. So, clearly, an additional
>>     ioctl() and ~50 lines of code make this work 'now'. Doesn't this seem
>>     worth it?
> Sorry, for mainline stuff like this reserve thing is not clean enough
> to be acceptable.
>
> Why can't you just open-code a modified forkAndExecInChild in docker?
>
> Seriously, the golang code you showed already has special stuff to do
> user namespaces before the exec, it is totally unreasonable insist
> that other namespace stuff can't be done the same way.

I have not 'insisted' that this cannot be done. I have merely shown a 
work-around.

It looks like the golan impl. would need some sort of (optional) 
synchronization between the child and the parent in order for the parent 
to hook the vTPM to the IMA namespace that the child just created if the 
parent is the only one allowed to do that. The hook-up would have to 
happen before the execve()...

>
>>     >     child = clone(...)
>>     >     ioctl(??? , ASSIGN_VTPM_TO_NS, .. child->ima_ns .., to index = 0,
>>     >           from index = outargs.tpm_index);
>>     after the clone() you are in that IMA namespace. So the only argument
>>     needed there is the index to the tpm_chip to hook up to the current IMA
>>     namespace.
> No, all of the above was in the parent namespace, the clone creates
> the child IMA namespace and launches the thread, but only the parent
> would have enough permissions to actually share the TPM to the child -
> ie the child cannot self-join.

Ok, so I guess the ioctl would take the PID of the created child as a 
parameter (how else to identify it?). The kernel would find that child, 
check that the calling parent is indeed the parent of the child, and if 
everything looks good do the hook-up.

The reservation works similar in so far as the 'reservation' would 
lookup the calling processe's PID and associate that with the vTPM 
device pair instance on which the ioctl was called. Once the child is 
created, IMA namspacing looks for a reserved vTPM device pair instance 
by matching up PIDs.

>
> You could also go the way of the mount namespace where the actions
> isn't an 'add to child namespace' but a 'remove from my namespace',
> but I don't think that makes as much sense for devices, the 'add'
> approach in line with the net ns seems cleaner. Others more familiar
> with namespaces may have other ideas, but I doubt you'd find anyone to
> support a reserve scheme.
>
>>     I got that part with the fd, major & minor number. It seems to work.
>>     I have one ioctl to reserve for before the clone and another ioctl to
>>     hook IMA-NS and vTPM together after the clone, but that patch is for
>>     later. So let's not just kill the ioctl for 'reservation' like that,
>>     please.
> No, kill it now. It doesn't make sense as part of this series, it
> should have been in the IMA namespace patch anyhow.
>
>>     > That is fairly similar to how net ns works, with the wrinkle you have
>>     > to do this before the exec, I guess.
>>     >
>>     > It also allows hw tpms to be routed to the ns.
>>     How many hardware TPMs are going to be there ? One? That is to be used
>>     for the host, right?
> No idea. It depends on the application. The HW tpm is alot better than
> this vtpm idea for many use cases. I could see applications where
> you'd want to use the host PCRS because they cover the bios and kernel
> the app is actually running under. I actually have a hard time
> understanding what value fake software PCRS are to containers <shrug>

I would just say that both use-cases exist, one with a hardware TPM and 
one with each container having a vTPM. The issue is that I don't see how 
well containers can share the single hardware TPM and its PCRs.

>
> I certainly wouldn't want to be forced to store keys in a software tpm
> just because I am using containers, for instance.

IMO, sharing the single hardware TPM between containers is an unsolved problem.


>
>>     TPMs. And sharing the single hardware TPM between multiple containers
>>     just isn't possible.
> Of course it is, it just hasn't been done yet, and won't be a 100%
> perfect emulation.

How do you handle access to PCRs? How do you handle sealing against PCRs?

>
>>     This will not be possible when going through the vTPM driver, but you
>>     have the ??? up there. I'd put the 'controlfd' in that place.
> No, it should not be controlfd, it should be what ever API the rest of
> the IMA namespace stuff is using, I think.
>
>>     The vTPM driver will only know about vtpm_dev->chip that it
>>     created and none of them is a hardware TPM.
> Right, controlfd implies that only vtpms could be shared to a IMA
> namespace, which is a terrible API. This is another reason why
> reserved is a terrible API.

Thanks for all feedback. I will eventually post v2. For now it's to be 
found here:

https://github.com/stefanberger/linux

The branch is easy to find.

      Stefan


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 25, 2016, 6:10 p.m. UTC | #14
On Fri, Jan 22, 2016 at 10:09:16AM -0500, Stefan Berger wrote:
> Ok, so I guess the ioctl would take the PID of the created child as a
> parameter (how else to identify it?). The kernel would find that child,
> check that the calling parent is indeed the parent of the child, and if
> everything looks good do the hook-up.

That is one choice, it is worth reviewing what, eg, the net folks did
for their API. They have already thought about this alot.

> >Right, controlfd implies that only vtpms could be shared to a IMA
> >namespace, which is a terrible API. This is another reason why
> >reserved is a terrible API.
> 
> Thanks for all feedback. I will eventually post v2. For now it's to be found
> here:
> 
> https://github.com/stefanberger/linux
> 
> The branch is easy to find.

That looks much nicer, yes.

I'd also merge the tpm-vtpm.h into the .c file

However, I'm still not sure this is right:

+	vtpm_dev->chip = tpmm_chip_alloc(&vtpm_dev->dev, &vtpm_tpm_ops);

The only issue is error unwind, the tpmm version assumes devm works
for that, but the vtpm_dev will not be destroyed if the tpm attach
fails, do devm is not appropriate.

As I said before the tpmm stuff was a hack I did to make it easier to
migrate the large number of drivers, core code should not be relying
on devm like that... One good option here would be unwind that a bit
and create a tpm_chip_alloc/tpm_chip_register flow that did not use
devm at all. That could be fairly straight forward..

Also, what is up with the _vtpm prefix on some functions? That doesn't
seem aligned with the kernel style..

And confused why there is a miscdev and a alloc_chrdev_region ?

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 26, 2016, 1:05 a.m. UTC | #15
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/25/2016 
01:10:46 PM:

> 
> On Fri, Jan 22, 2016 at 10:09:16AM -0500, Stefan Berger wrote:

> > 
> > Thanks for all feedback. I will eventually post v2. For now it's to be 
found
> > here:
> > 
> > https://github.com/stefanberger/linux
> > 
> > The branch is easy to find.
> 
> That looks much nicer, yes.
> 
> I'd also merge the tpm-vtpm.h into the .c file

Ok, I can do that.

> 
> However, I'm still not sure this is right:
> 
> +   vtpm_dev->chip = tpmm_chip_alloc(&vtpm_dev->dev, &vtpm_tpm_ops);
> 
> The only issue is error unwind, the tpmm version assumes devm works
> for that, but the vtpm_dev will not be destroyed if the tpm attach
> fails, do devm is not appropriate.
> 
> As I said before the tpmm stuff was a hack I did to make it easier to
> migrate the large number of drivers, core code should not be relying
> on devm like that... One good option here would be unwind that a bit
> and create a tpm_chip_alloc/tpm_chip_register flow that did not use
> devm at all. That could be fairly straight forward..
> 
> Also, what is up with the _vtpm prefix on some functions? That doesn't
> seem aligned with the kernel style..

Renamed them.

> 
> And confused why there is a miscdev and a alloc_chrdev_region ?

miscdev = /dev/vtpmx - that should be ok, no ?

So, I just removed alloc_chrdev_region and with that the assignment of a 
major+minor to the virtual device. This works fine on device creation but 
on device destruction something odd happens in sysfs.

With two 'vtpmctrl' test programs running sysfs looks like this:

# find /sys/devices/virtual/vtpm/
/sys/devices/virtual/vtpm/
/sys/devices/virtual/vtpm/vtpms0
/sys/devices/virtual/vtpm/vtpms0/dev
/sys/devices/virtual/vtpm/vtpms0/durations
...
/sys/devices/virtual/vtpm/vtpms0/tpm0
/sys/devices/virtual/vtpm/vtpms0/tpm0/dev
...
/sys/devices/virtual/vtpm/vtpms1
/sys/devices/virtual/vtpm/vtpms1/dev
/sys/devices/virtual/vtpm/vtpms1/durations
...
/sys/devices/virtual/vtpm/vtpms1/tpm1
/sys/devices/virtual/vtpm/vtpms1/tpm1/dev
...
/sys/devices/virtual/vtpm/vtpms1/tpm1/power
/sys/devices/virtual/vtpm/vtpms1/tpm1/power/control


Now one of the test process 'vtpmctrl' terminates:

/sys/devices/virtual/vtpm/
/sys/devices/virtual/vtpm/vtpms1
/sys/devices/virtual/vtpm/vtpms1/power
/sys/devices/virtual/vtpm/vtpms1/power/control
/sys/devices/virtual/vtpm/vtpms1/power/async
/sys/devices/virtual/vtpm/vtpms1/power/runtime_enabled
/sys/devices/virtual/vtpm/vtpms1/power/runtime_active_kids
/sys/devices/virtual/vtpm/vtpms1/power/runtime_active_time
/sys/devices/virtual/vtpm/vtpms1/power/autosuspend_delay_ms
/sys/devices/virtual/vtpm/vtpms1/power/runtime_status
/sys/devices/virtual/vtpm/vtpms1/power/runtime_usage
/sys/devices/virtual/vtpm/vtpms1/power/runtime_suspended_time
/sys/devices/virtual/vtpm/vtpms1/subsystem
/sys/devices/virtual/vtpm/vtpms1/uevent

That's all that is left now. .../vtpms1/tpm1/ has also already 
disappeared. Needless to say, the kernel is very unhappy once the other 
vtpmctrl terminates.

The virtual device needs a major/minor as well. Are there any other 
possible solutions?

   Stefan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 26, 2016, 1:46 a.m. UTC | #16
Hi

Apologies if I'm asking things that have popped out already in this
discussion (and also being passive so far!). I'm still in the process of
reading the discussion about [3/4] of the patch set that Stefan sent.

I should have done that and read the source code properly before sending
any responses. Sorry for being unforgivable sloppy with this.

> > Thanks for all feedback. I will eventually post v2. For now it's to be found
> > here:
> > 
> > https://github.com/stefanberger/linux
> > 
> > The branch is easy to find.
> 
> That looks much nicer, yes.
> 
> I'd also merge the tpm-vtpm.h into the .c file
> 
> However, I'm still not sure this is right:
> 
> +	vtpm_dev->chip = tpmm_chip_alloc(&vtpm_dev->dev, &vtpm_tpm_ops);
> 
> The only issue is error unwind, the tpmm version assumes devm works
> for that, but the vtpm_dev will not be destroyed if the tpm attach
> fails, do devm is not appropriate.

At the moment tpmm_chip_alloc() does not use devm for anything. Are you
speaking about stuff that happens between alloc and register as some of
the drivers use devm to associate resources to pdev at that point?

What do you mean by "tpm attach" here? Are you talking about the ioctl
for the vtpmx device that creates the device pair? Doesn't the fallback
path call destroy_device()?

I'm not sure which devm associated resources you are talking about.

> As I said before the tpmm stuff was a hack I did to make it easier to
> migrate the large number of drivers, core code should not be relying
> on devm like that... One good option here would be unwind that a bit
> and create a tpm_chip_alloc/tpm_chip_register flow that did not use
> devm at all. That could be fairly straight forward..

AFAIK I implemented two phase initialization or are you talking about
something else?

> Also, what is up with the _vtpm prefix on some functions? That doesn't
> seem aligned with the kernel style..
> 
> And confused why there is a miscdev and a alloc_chrdev_region ?
> 
> Jason

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 26, 2016, 3:19 a.m. UTC | #17
On Mon, Jan 25, 2016 at 05:46:52PM -0800, Jarkko Sakkinen wrote:
> > The only issue is error unwind, the tpmm version assumes devm works
> > for that, but the vtpm_dev will not be destroyed if the tpm attach
> > fails, do devm is not appropriate.
> 
> At the moment tpmm_chip_alloc() does not use devm for anything. Are you
> speaking about stuff that happens between alloc and register as some of
> the drivers use devm to associate resources to pdev at that point?

Ugh, well I missed that when you made those patches.

You need to put the devm back ASAP, as it is all the drivers now have
broken error handling. eg

static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info,
                        acpi_handle acpi_dev_handle)
[..]
        chip = tpmm_chip_alloc(dev, &tpm_tis);
[..]
        if (IS_ERR(chip))
	        return PTR_ERR(chip);
			
Obviously leaks chip.

The function is called tpmm_ specifically because it is supposed to
use devm resource unwind, it was a big mistake to remove the devm and
not rename the function or update the comments.

With the change to use a struct device the devm action should have
been a device_put to pair with the device_initialize.

> > As I said before the tpmm stuff was a hack I did to make it easier to
> > migrate the large number of drivers, core code should not be relying
> > on devm like that... One good option here would be unwind that a bit
> > and create a tpm_chip_alloc/tpm_chip_register flow that did not use
> > devm at all. That could be fairly straight forward..
> 
> AFAIK I implemented two phase initialization or are you talking about
> something else?

Right, I was confused because I wrote a different devm thing for TPM
that I discarded when I did the first rework.

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 26, 2016, 3:46 a.m. UTC | #18
On Mon, Jan 25, 2016 at 08:05:14PM -0500, Stefan Berger wrote:

> > And confused why there is a miscdev and a alloc_chrdev_region ?
> 
> miscdev = /dev/vtpmx - that should be ok, no ?

Yes

> So, I just removed alloc_chrdev_region and with that the assignment of a
> major+minor to the virtual device. This works fine on device creation but on
> device destruction something odd happens in sysfs.

> With two 'vtpmctrl' test programs running sysfs looks like this:
> 
> # find /sys/devices/virtual/vtpm/
> /sys/devices/virtual/vtpm/
> /sys/devices/virtual/vtpm/vtpms0
> /sys/devices/virtual/vtpm/vtpms0/dev

Ugh.

So, in an ideal world vtpms0 wouldn't even exist. The need for it is
more tpm core breakage. You could avoid this by having the tpm core
not create sysfs files on the parent for the vtpms. I think we are
very close to being able to do that.

The it would just be /sys/devices/virtual/tpm0/dev

If vtpms0 is kept, then it shouldn't have a 'dev':

> /sys/devices/virtual/vtpm/vtpms0/dev

The plus side is that it should be able to act as the devm container
for the tpmm_ function.

This should be the only dev:

> /sys/devices/virtual/vtpm/vtpms0/tpm0/dev

That is a big clue something is very wrong if the above exists but
the alloc_chrdev_region was removed.

> That's all that is left now. .../vtpms1/tpm1/ has also already disappeared.
> Needless to say, the kernel is very unhappy once the other vtpmctrl terminates.
> 
> The virtual device needs a major/minor as well. Are there any other possible
> solutions?

This bad behavior is some secondary bug.. Just looking around I think
all the lifetime stuff needs a good go over. I noticed at least...

This looks wrong:

+static int vtpm_fops_release(struct inode *inode, struct file *filp)

+    put_device(&vtpm_dev->dev);
+
+	_vtpm_delete_device_pair(vtpm_dev);

put_device should be last because it can kfree(vtpm_dev)

This looks questionable too:

+   device_destroy(vtpm_class, vtpm_dev->dev.devt);

The driver never calls device_create, it shouldn't call
device_destroy.

The device_initialize transfers kref ownership to the caller, so
vtpm_dev->dev needs a matching device_put on all error paths, not
device_destroy

IMHO, the put_device in the fops->release should be the match to
device_initialize, and it should be the last thing. Audit eveything
else to make that true

This looks wrong too:

+    /* device is needed for core TPM driver */
+    device_initialize(&vtpm_dev->dev);
+    vtpm_dev->dev.devt = devt;
+    vtpm_dev->dev.class = vtpm_class;
+    vtpm_dev->dev.release = vtpm_dev_release;

I usualle like to put device_initialize last in that sequence, drop
the devt stuff so you don't get a 'dev'

The RCU stuff looks off too - where is the synchronize_rcu ?

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 26, 2016, 1:56 p.m. UTC | #19
On Mon, Jan 25, 2016 at 08:19:19PM -0700, Jason Gunthorpe wrote:
> On Mon, Jan 25, 2016 at 05:46:52PM -0800, Jarkko Sakkinen wrote:
> > > The only issue is error unwind, the tpmm version assumes devm works
> > > for that, but the vtpm_dev will not be destroyed if the tpm attach
> > > fails, do devm is not appropriate.
> > 
> > At the moment tpmm_chip_alloc() does not use devm for anything. Are you
> > speaking about stuff that happens between alloc and register as some of
> > the drivers use devm to associate resources to pdev at that point?
> 
> Ugh, well I missed that when you made those patches.
> 
> You need to put the devm back ASAP, as it is all the drivers now have
> broken error handling. eg
> 
> static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info,
>                         acpi_handle acpi_dev_handle)
> [..]
>         chip = tpmm_chip_alloc(dev, &tpm_tis);
> [..]
>         if (IS_ERR(chip))
> 	        return PTR_ERR(chip);
> 			
> Obviously leaks chip.

Could you describe how it leaks a chip and what do you mean by verb
"leak"?

> The function is called tpmm_ specifically because it is supposed to
> use devm resource unwind, it was a big mistake to remove the devm and
> not rename the function or update the comments.

The function does not reserve any resources using devm. Should it?

> With the change to use a struct device the devm action should have
> been a device_put to pair with the device_initialize.

Well, I found a regression that is luckily a rare one.

It does not happen when tpm_chip_register() is called because then it
will be paired with tpm_chip_unregister(), which calls
device_unregister().

I injected error to tpm_crb initialization in-between alloc and
register and release-callback was never called. When I did insmod
with a working driver it got device name /dev/tpm1.

So now I do have regression that does show up. If you don't have it,
it does not make sense to "fix" the code even if it looks obviously
wrong.

I have to ask one question. In earlier response to you stated that one
should get rid of devm? How'd you fix this problem without devm and why
you want to get rid of it in this scenario?

Thanks.

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 26, 2016, 2:21 p.m. UTC | #20
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/25/2016 
10:46:32 PM:

> 
> On Mon, Jan 25, 2016 at 08:05:14PM -0500, Stefan Berger wrote:
> 
> > > And confused why there is a miscdev and a alloc_chrdev_region ?
> > 
> > miscdev = /dev/vtpmx - that should be ok, no ?
> 
> Yes
> 
> > So, I just removed alloc_chrdev_region and with that the assignment of 
a
> > major+minor to the virtual device. This works fine on device creation 
but on
> > device destruction something odd happens in sysfs.
> 
> > With two 'vtpmctrl' test programs running sysfs looks like this:
> > 
> > # find /sys/devices/virtual/vtpm/
> > /sys/devices/virtual/vtpm/
> > /sys/devices/virtual/vtpm/vtpms0
> > /sys/devices/virtual/vtpm/vtpms0/dev
> 
> Ugh.
> 
> So, in an ideal world vtpms0 wouldn't even exist. The need for it is
> more tpm core breakage. You could avoid this by having the tpm core
> not create sysfs files on the parent for the vtpms. I think we are
> very close to being able to do that.

We're not explicitly registering the parent with sysfs, yet it appears 
there. How do you prevent it from appearing there? The only other solution 
I could think of is being able to pass a NULL for 'dev' into the existing 
tpmm_chip_alloc() and avoid the parent device altogether.

> 
> The it would just be /sys/devices/virtual/tpm0/dev
> 
> If vtpms0 is kept, then it shouldn't have a 'dev':

Now it doesn't have it anymore.

> 
> > /sys/devices/virtual/vtpm/vtpms0/dev
> 
> The plus side is that it should be able to act as the devm container
> for the tpmm_ function.
> 
> This should be the only dev:
> 
> > /sys/devices/virtual/vtpm/vtpms0/tpm0/dev
> 
> That is a big clue something is very wrong if the above exists but
> the alloc_chrdev_region was removed.
> 
> > That's all that is left now. .../vtpms1/tpm1/ has also already 
disappeared.
> > Needless to say, the kernel is very unhappy once the other 
> vtpmctrl terminates.
> > 
> > The virtual device needs a major/minor as well. Are there any other 
possible
> > solutions?
> 
> This bad behavior is some secondary bug.. Just looking around I think
> all the lifetime stuff needs a good go over. I noticed at least...
> 
> This looks wrong:
> 
> +static int vtpm_fops_release(struct inode *inode, struct file *filp)
> 
> +    put_device(&vtpm_dev->dev);
> +
> +   _vtpm_delete_device_pair(vtpm_dev);
> 
> put_device should be last because it can kfree(vtpm_dev)

Yes. Fixed.

> 
> This looks questionable too:
> 
> +   device_destroy(vtpm_class, vtpm_dev->dev.devt);
> 
> The driver never calls device_create, it shouldn't call
> device_destroy.

My bad! This now has to be device_del() + put_device() and then it works.

> 
> The device_initialize transfers kref ownership to the caller, so
> vtpm_dev->dev needs a matching device_put on all error paths, not
> device_destroy
> 
> IMHO, the put_device in the fops->release should be the match to
> device_initialize, and it should be the last thing. Audit eveything
> else to make that true
> 
> This looks wrong too:
> 
> +    /* device is needed for core TPM driver */
> +    device_initialize(&vtpm_dev->dev);
> +    vtpm_dev->dev.devt = devt;
> +    vtpm_dev->dev.class = vtpm_class;
> +    vtpm_dev->dev.release = vtpm_dev_release;
> 
> I usualle like to put device_initialize last in that sequence, drop
> the devt stuff so you don't get a 'dev'

I did that in the same order as what 'device_create' ends up doing :

1697         device_initialize(dev);
1698         dev->devt = devt;
1699         dev->class = class;
1700         dev->parent = parent;
1701         dev->groups = groups;
1702         dev->release = device_create_release;
1703         dev_set_drvdata(dev, drvdata);

http://lxr.free-electrons.com/source/drivers/base/core.c#L1680

Usage of 'devt' is gone.

I put the fixed version of the driver here:

https://github.com/stefanberger/linux/commits/vtpm-driver.v2


Thanks for the review.

   Stefan

> 
> The RCU stuff looks off too - where is the synchronize_rcu ?
> 
> Jason
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 26, 2016, 5:58 p.m. UTC | #21
On Tue, Jan 26, 2016 at 05:56:59AM -0800, Jarkko Sakkinen wrote:
> On Mon, Jan 25, 2016 at 08:19:19PM -0700, Jason Gunthorpe wrote:
> > On Mon, Jan 25, 2016 at 05:46:52PM -0800, Jarkko Sakkinen wrote:
> > > > The only issue is error unwind, the tpmm version assumes devm works
> > > > for that, but the vtpm_dev will not be destroyed if the tpm attach
> > > > fails, do devm is not appropriate.
> > > 
> > > At the moment tpmm_chip_alloc() does not use devm for anything. Are you
> > > speaking about stuff that happens between alloc and register as some of
> > > the drivers use devm to associate resources to pdev at that point?
> > 
> > Ugh, well I missed that when you made those patches.
> > 
> > You need to put the devm back ASAP, as it is all the drivers now have
> > broken error handling. eg
> > 
> > static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info,
> >                         acpi_handle acpi_dev_handle)
> > [..]
> >         chip = tpmm_chip_alloc(dev, &tpm_tis);
> > [..]
> >         if (IS_ERR(chip))
> > 	        return PTR_ERR(chip);
> > 			
> > Obviously leaks chip.
> 
> Could you describe how it leaks a chip and what do you mean by verb
> "leak"?

It looks like you alread figured this out. release is never called, so
the kalloc for chip is never freed.

> > The function is called tpmm_ specifically because it is supposed to
> > use devm resource unwind, it was a big mistake to remove the devm and
> > not rename the function or update the comments.
> 
> The function does not reserve any resources using devm. Should it?

It originally did, but commit 313d21eeab9282e01fdcecd40e9ca87e0953627f
took it out and didn't change the function name.

The original version looked like:

struct tpm_chip *tpmm_chip_alloc(struct device *dev,
                                 const struct tpm_class_ops *ops)
{
[..]
        devm_add_action(dev, tpmm_chip_remove, chip);
        dev_set_drvdata(dev, chip);

        return chip;
}

> > With the change to use a struct device the devm action should have
> > been a device_put to pair with the device_initialize.
> 
> Well, I found a regression that is luckily a rare one.

It isn't rare, any error path will cause this.

> So now I do have regression that does show up. If you don't have it,
> it does not make sense to "fix" the code even if it looks obviously
> wrong.

That isn't the kernel way, this is very obviosly a bug, I don't need
to see 'proof' to know it needs fixing.

> I have to ask one question. In earlier response to you stated that one
> should get rid of devm? How'd you fix this problem without devm and why
> you want to get rid of it in this scenario?

To get rid of devm each driver would have to be updated to have the
right goto-error unwind to device_put the chip. It is alot of tricky
work to do this, which is why I suggested the tpmm_ scheme when you
did the original two step alloc patches.

For drivers like Stefan's non-devm might be needed, I'm still not
clear on that to be honest. There are rules when devm can be used and
not, and using it outside a probe() function or similar is something
I'm not sure works..

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 26, 2016, 6:22 p.m. UTC | #22
On Tue, Jan 26, 2016 at 09:21:42AM -0500, Stefan Berger wrote:
>    > So, in an ideal world vtpms0 wouldn't even exist. The need for it is
>    > more tpm core breakage. You could avoid this by having the tpm core
>    > not create sysfs files on the parent for the vtpms. I think we are
>    > very close to being able to do that.

>    We're not explicitly registering the parent with sysfs, yet it appears
>    there. How do you prevent it from appearing there?

static struct kobject *get_device_parent(struct device *dev,
                                         struct device *parent)

                /*
                 * If we have no parent, we live in "virtual".
                 * Class-devices with a non class-device as parent, live
                 * in a "glue" directory to prevent namespace collisions.
                 */

The vtpm should be under virtual, the question is if we need vtpm and
and tpm devices, ie should the vtpm driver be calling class_create or
not.

I don't have an answer to that, are there other virtualization devices
you can reference as an example?

Ideally you'd get someone familiar with sysfs to confirm these details.

> The only other solution I could think of is being able to pass a
> NULL for 'dev' into the existing tpmm_chip_alloc() and avoid the
> parent device altogether.

This would drop the vtpm and put tpm directly under virtual, however I
think the tpm core currently will segfault if parent is null, that
could be fixed if that is the way this should go.

>> put_device should be last because it can kfree(vtpm_dev)
> Yes. Fixed.

Not on your github...

+static int vtpm_fops_release(struct inode *inode, struct file *filp)
+{
+	struct vtpm_dev *vtpm_dev = filp->private_data;
+
+	filp->private_data = NULL;
+
+	put_device(&vtpm_dev->dev);
+
+	vtpm_delete_device_pair(vtpm_dev);
+
+	return 0;
+}

Don't put something then continue to reference it after.

Also, I would probably drop the get_device in vtpm_fops_open, use the
implicit get_device of device_initialize instead. Then drop the
put_device in vtpm_destroy_vtpm_dev and use only one at the end of the
fops release.

So the flow is, device_initialize [...] create a flip, transfer the
kref from device_initialize to the flip [..]

Which makes the error flow very sane, once the flip is created,
deleting the flip will clean up everything.

>    I put the fixed version of the driver here:
>    https://github.com/stefanberger/linux/commits/vtpm-driver.v2

Please go through all the error handling, it still looks wrong:

+static long vtpmx_fops_ioctl(struct file *f, unsigned int ioctl,
+			    unsigned long arg)
+		vtpm_dev = vtpm_create_device_pair(&vtpm_new_pair);
[..]
+		if (copy_to_user(vtpm_new_pair_p, &vtpm_new_pair,
+				 sizeof(vtpm_new_pair)))
+			return -EFAULT;

Can't be right, that fails the ioctl but still creates the FD and TPM.

The goto unwind of vtpm_create_device_pair looks wrong too, double
calls vtpm_destroy_vtpm_dev and other. See above on how the flip
release should be relied upon to do that once the flip is created.

Also I'm wondering about the placement of tpm_chip_register, long
term we are going to be moving tpm setup commands under
tpm_chip_register, so the driver must be fully ready to handle
commands when it is called, so I think vtpm calls it too early.

For instance, get_timeouts is never called in vtpm, which is not
correct. The soft tpm should have an opportunity to tell the kernel
the timeout values.

I think you have to fix this by calling tpm_get_timeouts/chip_register
from a work queue thread created before the ioctl returns. That should
avoid deadlock..

This is also very wrong:

+struct vtpm_dev {
 +	struct kref kref;
 +
 +	struct device dev;

Structs should only ever have one kref. Get rid of 'kref' and use the
kref embedded in 'dev' via get_device/put_device.

Also, what is the point of 'vtpm_dev_get_by_chip' ? Just store the
vtpm_dev in the chip's private data pointer like every other driver.

No special locking is needed, tpm core guarentees no op is running or
will ever be called again once tpm_unregister returns.

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 26, 2016, 11:22 p.m. UTC | #23
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/26/2016 
01:22:48 PM:


> 
> On Tue, Jan 26, 2016 at 09:21:42AM -0500, Stefan Berger wrote:
> >    > So, in an ideal world vtpms0 wouldn't even exist. The need for it 
is
> >    > more tpm core breakage. You could avoid this by having the tpm 
core
> >    > not create sysfs files on the parent for the vtpms. I think we 
are
> >    > very close to being able to do that.
> 
> >    We're not explicitly registering the parent with sysfs, yet it 
appears
> >    there. How do you prevent it from appearing there?
> 
> static struct kobject *get_device_parent(struct device *dev,
>                                          struct device *parent)
> 
>                 /*
>                  * If we have no parent, we live in "virtual".
>                  * Class-devices with a non class-device as parent, live
>                  * in a "glue" directory to prevent namespace 
collisions.
>                  */
> 
> The vtpm should be under virtual, the question is if we need vtpm and
> and tpm devices, ie should the vtpm driver be calling class_create or
> not.
> 
> I don't have an answer to that, are there other virtualization devices
> you can reference as an example?

I don't have a virtual device example that doesn't have a parent. I had 
looked at virtio_console for, but that one also has a parent.

http://lxr.free-electrons.com/source/drivers/char/virtio_console.c#L1439

> 
> Ideally you'd get someone familiar with sysfs to confirm these details.
> 
> > The only other solution I could think of is being able to pass a
> > NULL for 'dev' into the existing tpmm_chip_alloc() and avoid the
> > parent device altogether.
> 
> This would drop the vtpm and put tpm directly under virtual, however I
> think the tpm core currently will segfault if parent is null, that
> could be fixed if that is the way this should go.

Likely the parent isn't needed if the vtpm driver calls it. So a couple of 
checks for dev being NULL could solve that. But I rather leave it as is 
for now before I tear  code out that I may have to re-add later on.

> 
> >> put_device should be last because it can kfree(vtpm_dev)
> > Yes. Fixed.
> 
> Not on your github...
> 
> +static int vtpm_fops_release(struct inode *inode, struct file *filp)
> +{
> +   struct vtpm_dev *vtpm_dev = filp->private_data;
> +
> +   filp->private_data = NULL;
> +
> +   put_device(&vtpm_dev->dev);
> +
> +   vtpm_delete_device_pair(vtpm_dev);
> +
> +   return 0;
> +}
> 
> Don't put something then continue to reference it after.
> 
> Also, I would probably drop the get_device in vtpm_fops_open, use the
> implicit get_device of device_initialize instead. Then drop the
> put_device in vtpm_destroy_vtpm_dev and use only one at the end of the
> fops release.

We have the implicit get_device in vtpm_create_vtpm_dev and would 
therefore try to keep the put_device in vtpm_destroy_vtpm_dev.

Besides that a device_initialize + device_add = device_register and 
device_del + put_device = device_unregister, which could go into 
vtpm_create_vtpm_dev and vtpm_destroy_vtpm_dev respectively.

> 
> So the flow is, device_initialize [...] create a flip, transfer the
> kref from device_initialize to the flip [..]
> 
> Which makes the error flow very sane, once the flip is created,
> deleting the flip will clean up everything.
> 
> >    I put the fixed version of the driver here:
> >    https://github.com/stefanberger/linux/commits/vtpm-driver.v2
> 
> Please go through all the error handling, it still looks wrong:
> 
> +static long vtpmx_fops_ioctl(struct file *f, unsigned int ioctl,
> +             unsigned long arg)
> +      vtpm_dev = vtpm_create_device_pair(&vtpm_new_pair);
> [..]
> +      if (copy_to_user(vtpm_new_pair_p, &vtpm_new_pair,
> +             sizeof(vtpm_new_pair)))
> +         return -EFAULT;
> 
> Can't be right, that fails the ioctl but still creates the FD and TPM.


True. I have not been able to undo fd_install so far and cannot find any 
other code that does that. So, I pulled that out and put it after the 
successful copy_to_user().


> 
> The goto unwind of vtpm_create_device_pair looks wrong too, double
> calls vtpm_destroy_vtpm_dev and other. See above on how the flip
> release should be relied upon to do that once the flip is created.
> 
> Also I'm wondering about the placement of tpm_chip_register, long
> term we are going to be moving tpm setup commands under
> tpm_chip_register, so the driver must be fully ready to handle
> commands when it is called, so I think vtpm calls it too early.

re-ordered

> 
> For instance, get_timeouts is never called in vtpm, which is not
> correct. The soft tpm should have an opportunity to tell the kernel
> the timeout values.

That's seems a chick-and-egg problem. The vTPM can only be started once 
the ioctl has provided the fd.

> 
> I think you have to fix this by calling tpm_get_timeouts/chip_register
> from a work queue thread created before the ioctl returns. That should
> avoid deadlock..

We'll only ever be able to call that once the vTPM has been started on the 
fd... otherwise the default timeout values will have to do it.

> 
> This is also very wrong:
> 
> +struct vtpm_dev {
>  +   struct kref kref;
>  +
>  +   struct device dev;
> 
> Structs should only ever have one kref. Get rid of 'kref' and use the
> kref embedded in 'dev' via get_device/put_device.

Yes, sir.

> 
> Also, what is the point of 'vtpm_dev_get_by_chip' ? Just store the
> vtpm_dev in the chip's private data pointer like every other driver.

Introduced chip->priv.

> 
> No special locking is needed, tpm core guarentees no op is running or
> will ever be called again once tpm_unregister returns.
> 
> Jason
> 

cheers!

Stefan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 27, 2016, 2:06 a.m. UTC | #24
On Tue, Jan 26, 2016 at 10:58:16AM -0700, Jason Gunthorpe wrote:
> On Tue, Jan 26, 2016 at 05:56:59AM -0800, Jarkko Sakkinen wrote:
> > On Mon, Jan 25, 2016 at 08:19:19PM -0700, Jason Gunthorpe wrote:
> > > On Mon, Jan 25, 2016 at 05:46:52PM -0800, Jarkko Sakkinen wrote:
> > > > > The only issue is error unwind, the tpmm version assumes devm works
> > > > > for that, but the vtpm_dev will not be destroyed if the tpm attach
> > > > > fails, do devm is not appropriate.
> > > > 
> > > > At the moment tpmm_chip_alloc() does not use devm for anything. Are you
> > > > speaking about stuff that happens between alloc and register as some of
> > > > the drivers use devm to associate resources to pdev at that point?
> > > 
> > > Ugh, well I missed that when you made those patches.
> > > 
> > > You need to put the devm back ASAP, as it is all the drivers now have
> > > broken error handling. eg
> > > 
> > > static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info,
> > >                         acpi_handle acpi_dev_handle)
> > > [..]
> > >         chip = tpmm_chip_alloc(dev, &tpm_tis);
> > > [..]
> > >         if (IS_ERR(chip))
> > > 	        return PTR_ERR(chip);
> > > 			
> > > Obviously leaks chip.
> > 
> > Could you describe how it leaks a chip and what do you mean by verb
> > "leak"?
> 
> It looks like you alread figured this out. release is never called, so
> the kalloc for chip is never freed.
> 
> > > The function is called tpmm_ specifically because it is supposed to
> > > use devm resource unwind, it was a big mistake to remove the devm and
> > > not rename the function or update the comments.
> > 
> > The function does not reserve any resources using devm. Should it?
> 
> It originally did, but commit 313d21eeab9282e01fdcecd40e9ca87e0953627f
> took it out and didn't change the function name.
> 
> The original version looked like:
> 
> struct tpm_chip *tpmm_chip_alloc(struct device *dev,
>                                  const struct tpm_class_ops *ops)
> {
> [..]
>         devm_add_action(dev, tpmm_chip_remove, chip);
>         dev_set_drvdata(dev, chip);
> 
>         return chip;
> }
> 
> > > With the change to use a struct device the devm action should have
> > > been a device_put to pair with the device_initialize.
> > 
> > Well, I found a regression that is luckily a rare one.
> 
> It isn't rare, any error path will cause this.
> 
> > So now I do have regression that does show up. If you don't have it,
> > it does not make sense to "fix" the code even if it looks obviously
> > wrong.
> 
> That isn't the kernel way, this is very obviosly a bug, I don't need
> to see 'proof' to know it needs fixing.
> 
> > I have to ask one question. In earlier response to you stated that one
> > should get rid of devm? How'd you fix this problem without devm and why
> > you want to get rid of it in this scenario?
> 
> To get rid of devm each driver would have to be updated to have the
> right goto-error unwind to device_put the chip. It is alot of tricky
> work to do this, which is why I suggested the tpmm_ scheme when you
> did the original two step alloc patches.
> 
> For drivers like Stefan's non-devm might be needed, I'm still not
> clear on that to be honest. There are rules when devm can be used and
> not, and using it outside a probe() function or similar is something
> I'm not sure works..

OK, thanks for your comments! The important thing is that we are on the
same page in this and rest is cutting hairs.

I'll deliver a bug fix for this.

> Jason

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 27, 2016, 2:36 a.m. UTC | #25
On Wed, Jan 20, 2016 at 09:39:09AM -0500, Stefan Berger wrote:
> > Presumably some namespace magic can be used to make them show up as
> > tpm0 in a container?
> 
> The magic is to have the container management stack create the device pair.
> From the ioctl it learns the name of the devices that were created and it then
> finds out about the major/minor number of the created device and have /dev/tpm0
> with that major/minor created in the container's /dev directory.

Is the device created before container launched? I would assume that
this would work user space accesses through /dev/tpm0.

I don't know how this would work for kernel clients.

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 27, 2016, 2:50 a.m. UTC | #26
On Wed, Jan 20, 2016 at 10:01:38PM -0500, Stefan Berger wrote:
> > Except that isn't good enough - the IMA kernel side doesn't know that this
> > tpm is now acting as the 'main' 'default' TPM.
> 
> Hooking the vTPM to IMA requires another patch that I haven't shown since IMA
> namespacing isn't public yet. Basically we implement another ioctl() that is to
> be called before the clone() in order to 'reserve' a vtpm device pair for the
> calling process. During the clone() call IMA namespacing code can query the
> vTPM driver for a 'reserved' device pair. Hooking IMA up after the clone() may
> also work but in case of docker/golang it's better to do this before since the
> language libraries do a lot after the clone automatically.

Can we expect that "in the end" there will be a single patch set that
contains both TPM and IMA changes? Otherwise, I see it very hard to make
decision to apply TPM patches.

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 27, 2016, 3:13 a.m. UTC | #27
On Thu, Jan 21, 2016 at 03:10:40PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 21, 2016 at 04:51:06PM -0500, Stefan Berger wrote:
> 
> >    > You can't let this influence the kernel UAPI design.
> 
> >    The choice is between getting this working 'today' (even if just
> >    locally) or discussing this with golang designers, which in the ideal
> >    case would cause me waiting for the next version, dealing with that
> >    version dependency etc., plus the delay. So, clearly, an additional
> >    ioctl() and ~50 lines of code make this work 'now'. Doesn't this seem
> >    worth it?
> 
> Sorry, for mainline stuff like this reserve thing is not clean enough
> to be acceptable.
> 
> Why can't you just open-code a modified forkAndExecInChild in docker?
> 
> Seriously, the golang code you showed already has special stuff to do
> user namespaces before the exec, it is totally unreasonable insist

I'm starting to be along the same lines with the Jason now that I've
actually tried to read through the discussion up to this email. I did
say something opposite in last week but will take my words back since I
had too weak understanding back then (you *never* should response to
dicussions like this when you are overloaded with something else).

I haven't yet locked whether the fd or major/minor passing is the right
choice but at least I do agree that we cannot make decission based on
some golang code in Docker. For the new mainline code we should pick the
most robust API, nothing more or less...

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 27, 2016, 12:17 p.m. UTC | #28
Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote on 01/26/2016 
09:36:03 PM:

> 
> On Wed, Jan 20, 2016 at 09:39:09AM -0500, Stefan Berger wrote:
> > > Presumably some namespace magic can be used to make them show up as
> > > tpm0 in a container?
> > 
> > The magic is to have the container management stack create the device 
pair.
> > From the ioctl it learns the name of the devices that were created
> and it then
> > finds out about the major/minor number of the created device and 
> have /dev/tpm0
> > with that major/minor created in the container's /dev directory.
> 
> Is the device created before container launched? I would assume that
> this would work user space accesses through /dev/tpm0.

Yes, device would be created before container is launched.

> 
> I don't know how this would work for kernel clients.

For IMA we have these additional ioctls to either 'reserve' a vTPM for a 
container before clone() or to hook the vTPM up to a IMA namespace after 
clone() -- you may have read the discussions about these in other emails. 
As for trusted and encrypted keys and the TPM based RNG, the kernel 
determines the current IMA namespace a process that wants to use the 
kernel service is associated with. It then uses the TPM associated with 
the IMA namespace or returns an error if there is none.

   Stefan

> 
> /Jarkko
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 27, 2016, 12:20 p.m. UTC | #29
Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote on 01/26/2016 
09:50:57 PM:


> 
> On Wed, Jan 20, 2016 at 10:01:38PM -0500, Stefan Berger wrote:
> > > Except that isn't good enough - the IMA kernel side doesn't knowthat 
this
> > > tpm is now acting as the 'main' 'default' TPM.
> > 
> > Hooking the vTPM to IMA requires another patch that I haven't 
> shown since IMA
> > namespacing isn't public yet. Basically we implement another ioctl
> () that is to
> > be called before the clone() in order to 'reserve' a vtpm device 
> pair for the
> > calling process. During the clone() call IMA namespacing code can 
query the
> > vTPM driver for a 'reserved' device pair. Hooking IMA up after the
> clone() may
> > also work but in case of docker/golang it's better to do this 
> before since the
> > language libraries do a lot after the clone automatically.
> 
> Can we expect that "in the end" there will be a single patch set that
> contains both TPM and IMA changes? Otherwise, I see it very hard to make
> decision to apply TPM patches.

If this can be posted to the same lists, then 'yes'. I cannot set a 
timeframe for this, though. Nevertheless, the vTPM driver reviews were 
fruitful, I think.

The vTPM driver could be used standalone as well, though it may be more 
useful in conjunction with the namespacing of IMA.

Stefan

> 
> /Jarkko
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 27, 2016, 12:42 p.m. UTC | #30
Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote on 01/26/2016 
10:13:20 PM:

> 
> On Thu, Jan 21, 2016 at 03:10:40PM -0700, Jason Gunthorpe wrote:
> > On Thu, Jan 21, 2016 at 04:51:06PM -0500, Stefan Berger wrote:
> > 
> > >    > You can't let this influence the kernel UAPI design.
> > 
> > >    The choice is between getting this working 'today' (even if just
> > >    locally) or discussing this with golang designers, which in the 
ideal
> > >    case would cause me waiting for the next version, dealing with 
that
> > >    version dependency etc., plus the delay. So, clearly, an 
additional
> > >    ioctl() and ~50 lines of code make this work 'now'. Doesn't this 
seem
> > >    worth it?
> > 
> > Sorry, for mainline stuff like this reserve thing is not clean enough
> > to be acceptable.
> > 
> > Why can't you just open-code a modified forkAndExecInChild in docker?
> > 
> > Seriously, the golang code you showed already has special stuff to do
> > user namespaces before the exec, it is totally unreasonable insist
> 
> I'm starting to be along the same lines with the Jason now that I've
> actually tried to read through the discussion up to this email. I did
> say something opposite in last week but will take my words back since I
> had too weak understanding back then (you *never* should response to
> dicussions like this when you are overloaded with something else).
> 
> I haven't yet locked whether the fd or major/minor passing is the right
> choice but at least I do agree that we cannot make decission based on
> some golang code in Docker. For the new mainline code we should pick the
> most robust API, nothing more or less...

So we can design it how it should work from user space. So far it's solved 
with an ioctl.

To pick up on Jason's suggestion to look into network namespaces and 
associated tools:

What we don't want from the IMA perspective is that someone creates an IMA 
namespace on the host similar to how one can create a network namespace 
(ip netns add ...), runs (malicious) programs under a different IMA 
policy, and then closes that IMA namespace. On the other hand, what we 
want are containers having their own IMA namespace with an independent 
policy. Since the argument is that containers are well isolated from the 
host and programs running there don't influence the host, their logs would 
be independent. The thing is that containers are made up of multiple 
namespaces. So, we likely won't give direct control over IMA namespace 
creation through tools like network namespacing does or even a dedicate 
clone flag but connect it to one or multiple existing clone flags that 
cause creation of a (mount, pid, etc.) namespace. So that could mean that 
if someone creates a mount + pid namespace, and thus is well isolated, he 
automatically gets an IMA namespace.

IMA namespacing would then provide less control for users compared to 
network namespacing and therefore an ioctl, issued from the clone()ing 
process, for IMA+vTPM driver hook-up would be sufficient.

   Stefan

> 
> /Jarkko
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 27, 2016, 2:22 p.m. UTC | #31
On Wed, Jan 27, 2016 at 07:17:17AM -0500, Stefan Berger wrote:
> Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote on 01/26/2016 09:36:03
> PM:
> 
> >
> > On Wed, Jan 20, 2016 at 09:39:09AM -0500, Stefan Berger wrote:
> > > > Presumably some namespace magic can be used to make them show up as
> > > > tpm0 in a container?
> > >
> > > The magic is to have the container management stack create the device pair.
> > > From the ioctl it learns the name of the devices that were created
> > and it then
> > > finds out about the major/minor number of the created device and
> > have /dev/tpm0
> > > with that major/minor created in the container's /dev directory.
> >
> > Is the device created before container launched? I would assume that
> > this would work user space accesses through /dev/tpm0.
> 
> Yes, device would be created before container is launched.
> 
> >
> > I don't know how this would work for kernel clients.
> 
> For IMA we have these additional ioctls to either 'reserve' a vTPM for a
> container before clone() or to hook the vTPM up to a IMA namespace after clone
> () -- you may have read the discussions about these in other emails. As for
> trusted and encrypted keys and the TPM based RNG, the kernel determines the
> current IMA namespace a process that wants to use the kernel service is
> associated with. It then uses the TPM associated with the IMA namespace or
> returns an error if there is none.

Yeah, I now have read the discussion but I still don't fully understand
this. I spent 2-3 hours yesterday reading it and frankly don't want do
it again. If I ask something that was already hidden somewhere there,
I just didn't get it.

If this applies to other kernel services than IMA, why this feature is
called IMA namespace?

I.e. as far as I can understand your description you could:

1. Use this feature without using IMA. Create namespace and apply TPM
   emulator or software implementation of TPM (you could do this for
   example with SGX).
2. Use this feature without TPM. Just create namespace and use IMA
   there.

"TPM namespace" that popped out in your and Jaosns discussion and also
"IMA namespace" both seem not to describe what is being developed.

Also, I'm wondering is it right to have this as a separate module or
should this be part of the core TPM infrastructure?

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 27, 2016, 2:23 p.m. UTC | #32
On Wed, Jan 27, 2016 at 07:20:48AM -0500, Stefan Berger wrote:
> Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote on 01/26/2016 09:50:57
> PM:
> 
> 
> >
> > On Wed, Jan 20, 2016 at 10:01:38PM -0500, Stefan Berger wrote:
> > > > Except that isn't good enough - the IMA kernel side doesn't knowthat this
> > > > tpm is now acting as the 'main' 'default' TPM.
> > >
> > > Hooking the vTPM to IMA requires another patch that I haven't
> > shown since IMA
> > > namespacing isn't public yet. Basically we implement another ioctl
> > () that is to
> > > be called before the clone() in order to 'reserve' a vtpm device
> > pair for the
> > > calling process. During the clone() call IMA namespacing code can query the
> > > vTPM driver for a 'reserved' device pair. Hooking IMA up after the
> > clone() may
> > > also work but in case of docker/golang it's better to do this
> > before since the
> > > language libraries do a lot after the clone automatically.
> >
> > Can we expect that "in the end" there will be a single patch set that
> > contains both TPM and IMA changes? Otherwise, I see it very hard to make
> > decision to apply TPM patches.
> 
> If this can be posted to the same lists, then 'yes'. I cannot set a timeframe
> for this, though. Nevertheless, the vTPM driver reviews were fruitful, I think.

Thanks, this is good to hear.

> The vTPM driver could be used standalone as well, though it may be more useful
> in conjunction with the namespacing of IMA.
> 
> Stefan

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 27, 2016, 5:35 p.m. UTC | #33
On Wed, Jan 27, 2016 at 07:17:17AM -0500, Stefan Berger wrote:
>    after clone() -- you may have read the discussions about these in other
>    emails. As for trusted and encrypted keys and the TPM based RNG,
>    the

Oh that is a terrifying thought, the kernel should never use a vtpm
RNG :|

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 27, 2016, 5:58 p.m. UTC | #34
On Wed, Jan 27, 2016 at 07:42:17AM -0500, Stefan Berger wrote:
>    What we don't want from the IMA perspective is that someone creates an
>    IMA namespace on the host similar to how one can create a network
>    namespace (ip netns add ...), runs (malicious) programs under a
>    different IMA policy, and then closes that IMA namespace.

Presumably who ever is implementing IMA namespaces has some kind of
solution for this though?

That seems like a very difficult problem.

>    hand, what we want are containers having their own IMA namespace with
>    an independent policy. Since the argument is that containers are well
>    isolated from the host and programs running there don't influence the
>    host, their logs would be independent. The thing is that containers are
>    made up of multiple namespaces. So, we likely won't give direct control
>    over IMA namespace creation through tools like network namespacing does
>    or even a dedicate clone flag but connect it to one or multiple
>    existing clone flags that cause creation of a (mount, pid, etc.)
>    namespace. So that could mean that if someone creates a mount + pid
>    namespace, and thus is well isolated, he automatically gets an IMA
>    namespace.

That isn't nearly good enough, mount namespaces don't mean you are
isolated. It isn't until another tool like docker goes through and
alters the child's mount table that some degree of isolation is
achieved..

I don't think there is a generic kernel side point where it could tell
the child is isolated enough. Whatever that means.

Doesn't selinux have the exact same problem? How does selinux handle
namespaces?

>    IMA namespacing would then provide less control for users compared to
>    network namespacing and therefore an ioctl, issued from the clone()ing
>    process, for IMA+vTPM driver hook-up would be sufficient.

I agree IMA namespaces probably don't need the whole 'ip setns' type
interface, but realistically if an ioctl is provided to install a TPM
in a child namespace it should let anyone with privilege in the parent
namespaces install a TPM in a child IMA namespace, that is just how
the namespace APIs seem to work.

That said, maybe looking at selinux namespaces interaction will give a
different idea..

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 27, 2016, 6:21 p.m. UTC | #35
On Tue, Jan 26, 2016 at 06:22:46PM -0500, Stefan Berger wrote:
>    > I don't have an answer to that, are there other virtualization devices
>    > you can reference as an example?
>    I don't have a virtual device example that doesn't have a parent. I had
>    looked at virtio_console for, but that one also has a parent.
>    [1]http://lxr.free-electrons.com/source/drivers/char/virtio_console.c#L1439

virtio console is really a 'hardware' device under a VM setting, so
it's parent make some sense.

Are you aware of any devices used directly with containers? I only
know of netdevices, and I'm not sure that is relevant.

>    > Also, I would probably drop the get_device in vtpm_fops_open, use the
>    > implicit get_device of device_initialize instead. Then drop the
>    > put_device in vtpm_destroy_vtpm_dev and use only one at the end of the
>    > fops release.
>    We have the implicit get_device in vtpm_create_vtpm_dev and would
>    therefore try to keep the put_device in vtpm_destroy_vtpm_dev.

No, go back and re-read my last message.

vtpm_destroy_vtpm_dev shouldn't even exist, evey call to it looks
kinda weird..

vtpm_create_device_pair creates a flip and deleting the flip should
delete everything else. The only time there are special cases are
within vtpm_create_vtpm_dev itself, which should be handled in-line
with goto-unwinds.

Return the struct file from vtpm_create_device_pair not the struct
vtpm_dev to make this clear.

'vtpm_delete_device_pair' should just put the flip, and all that clean
up should be in the flip release. Doing any clean up while the file is
still active just creates complex scenarios for no real reason.

I'd also drop vtpm_cleanup, it doesn't look right to me... Ie this
doesn't make any sense:

+		vtpm_dev->file->f_op = &null_fops;

null_fops is still a pointer within this module, so the module still
cannot be unloaded. If the module cannot be unloaded there is no
reason to have vtpm_cleanup. I don't know if there is a solution to
that, but generally speaking module detach is hard, racy, and best
avoided. In a case like this killing all the vtpm daemons will allow
module unload, so I'd just go with that much simpler solution.

Just ensure the module lock is held while any vtpms exist. I actually
thought the file code did this for the module, so I'm susprised that
vtpm_cleanup would even be callable at all?

Once all that is done then the flip alone controls the lifetime of the
rest of the objects and it is very easy to reason about.

>    True. I have not been able to undo fd_install so far and cannot find
>    any other code that does that. So, I pulled that out and put it after
>    the successful copy_to_user().

That looks much better, yes.

>    > For instance, get_timeouts is never called in vtpm, which is not
>    > correct. The soft tpm should have an opportunity to tell the kernel
>    > the timeout values.

>    That's seems a chick-and-egg problem. The vTPM can only be started once
>    the ioctl has provided the fd.

Hence the work queue suggestion.

Start a work queue task/kernel thread/something right before
fd_install that does tpm_get_timeouts and tpm_register.

The ioctl will return and immediately read() on the fd will be
available for get_timeouts. The user space side must process the read
and return a response within the default tpm timeout period.

So, userspace should not call the ioctl to register the tpm until it
is completely prepared to process tpm commands and it should
immediately process commands once the ioctl returns.

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 27, 2016, 6:24 p.m. UTC | #36
On Wed, Jan 27, 2016 at 06:22:39AM -0800, Jarkko Sakkinen wrote:

> Also, I'm wondering is it right to have this as a separate module or
> should this be part of the core TPM infrastructure?

The vtpm itself should be a module.

Any namespace support for TPM (whatever form that takes) should be in
the core infrastructure and apply equally to all TPMs.

I do not like the idea that only vtpm could participate in namespace
support.

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 27, 2016, 7:48 p.m. UTC | #37
On Tue, Jan 26, 2016 at 06:06:17PM -0800, Jarkko Sakkinen wrote:
> On Tue, Jan 26, 2016 at 10:58:16AM -0700, Jason Gunthorpe wrote:
> > On Tue, Jan 26, 2016 at 05:56:59AM -0800, Jarkko Sakkinen wrote:
> > > On Mon, Jan 25, 2016 at 08:19:19PM -0700, Jason Gunthorpe wrote:
> > > > On Mon, Jan 25, 2016 at 05:46:52PM -0800, Jarkko Sakkinen wrote:
> > > > > > The only issue is error unwind, the tpmm version assumes devm works
> > > > > > for that, but the vtpm_dev will not be destroyed if the tpm attach
> > > > > > fails, do devm is not appropriate.
> > > > > 
> > > > > At the moment tpmm_chip_alloc() does not use devm for anything. Are you
> > > > > speaking about stuff that happens between alloc and register as some of
> > > > > the drivers use devm to associate resources to pdev at that point?
> > > > 
> > > > Ugh, well I missed that when you made those patches.
> > > > 
> > > > You need to put the devm back ASAP, as it is all the drivers now have
> > > > broken error handling. eg
> > > > 
> > > > static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info,
> > > >                         acpi_handle acpi_dev_handle)
> > > > [..]
> > > >         chip = tpmm_chip_alloc(dev, &tpm_tis);
> > > > [..]
> > > >         if (IS_ERR(chip))
> > > > 	        return PTR_ERR(chip);
> > > > 			
> > > > Obviously leaks chip.
> > > 
> > > Could you describe how it leaks a chip and what do you mean by verb
> > > "leak"?
> > 
> > It looks like you alread figured this out. release is never called, so
> > the kalloc for chip is never freed.
> > 
> > > > The function is called tpmm_ specifically because it is supposed to
> > > > use devm resource unwind, it was a big mistake to remove the devm and
> > > > not rename the function or update the comments.
> > > 
> > > The function does not reserve any resources using devm. Should it?
> > 
> > It originally did, but commit 313d21eeab9282e01fdcecd40e9ca87e0953627f
> > took it out and didn't change the function name.
> > 
> > The original version looked like:
> > 
> > struct tpm_chip *tpmm_chip_alloc(struct device *dev,
> >                                  const struct tpm_class_ops *ops)
> > {
> > [..]
> >         devm_add_action(dev, tpmm_chip_remove, chip);
> >         dev_set_drvdata(dev, chip);
> > 
> >         return chip;
> > }
> > 
> > > > With the change to use a struct device the devm action should have
> > > > been a device_put to pair with the device_initialize.
> > > 
> > > Well, I found a regression that is luckily a rare one.
> > 
> > It isn't rare, any error path will cause this.
> > 
> > > So now I do have regression that does show up. If you don't have it,
> > > it does not make sense to "fix" the code even if it looks obviously
> > > wrong.
> > 
> > That isn't the kernel way, this is very obviosly a bug, I don't need
> > to see 'proof' to know it needs fixing.

I have to defend my position here. Of course you have to proof in a way
or another that there is a regresssion. Otherwise you cannot speak that
there is a bug. It's by definition like that. I need to see the
regression in order to apply a bug fix to the tpmdd tree.

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jarkko Sakkinen Jan. 27, 2016, 9:13 p.m. UTC | #38
On Wed, Jan 27, 2016 at 11:24:48AM -0700, Jason Gunthorpe wrote:
> On Wed, Jan 27, 2016 at 06:22:39AM -0800, Jarkko Sakkinen wrote:
> 
> > Also, I'm wondering is it right to have this as a separate module or
> > should this be part of the core TPM infrastructure?
> 
> The vtpm itself should be a module.
> 
> Any namespace support for TPM (whatever form that takes) should be in
> the core infrastructure and apply equally to all TPMs.
> 
> I do not like the idea that only vtpm could participate in namespace
> support.

Given your discussion I already see it participating to the namespace support.

> Jason

/Jarkko

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 27, 2016, 9:58 p.m. UTC | #39
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/27/2016 
12:58:39 PM:

> 
> On Wed, Jan 27, 2016 at 07:42:17AM -0500, Stefan Berger wrote:
> >    What we don't want from the IMA perspective is that someone creates 
an
> >    IMA namespace on the host similar to how one can create a network
> >    namespace (ip netns add ...), runs (malicious) programs under a
> >    different IMA policy, and then closes that IMA namespace.
> 
> Presumably who ever is implementing IMA namespaces has some kind of
> solution for this though?
> 
> That seems like a very difficult problem.
> 
> >    hand, what we want are containers having their own IMA namespace 
with
> >    an independent policy. Since the argument is that containers are 
well
> >    isolated from the host and programs running there don't influence 
the
> >    host, their logs would be independent. The thing is that containers 
are
> >    made up of multiple namespaces. So, we likely won't give direct 
control
> >    over IMA namespace creation through tools like network namespacing 
does
> >    or even a dedicate clone flag but connect it to one or multiple
> >    existing clone flags that cause creation of a (mount, pid, etc.)
> >    namespace. So that could mean that if someone creates a mount + pid
> >    namespace, and thus is well isolated, he automatically gets an IMA
> >    namespace.
> 
> That isn't nearly good enough, mount namespaces don't mean you are
> isolated. It isn't until another tool like docker goes through and
> alters the child's mount table that some degree of isolation is
> achieved..
> 
> I don't think there is a generic kernel side point where it could tell
> the child is isolated enough. Whatever that means.

I agree. Which set of namespaces is enough for running any program in this 
set of namespaces (aka container) and being able to forget about the list 
of measurements once the set of namespaces goes away because whatever ran 
there couldn't have influenced the host. I would say mount, network, and 
user namespace is such a set. Probably also include PID namespace in that 
space assuming that in a shared PID namespace one process could signal 
other processes. Though this may be mitigated with user namespace mapping 
etc.. Not clear about UTS namespace, but may not be so important.

To be on the safe side, maybe all would be required and one gets an IMA 
namespace only then.

> 
> Doesn't selinux have the exact same problem? How does selinux handle
> namespaces?

They solve it by mounting with a context option, which enforces an sVirt 
SELinux label across all files that the container user then cannot change.

tmpfs /dev tmpfs rw,context=
"system_u:object_r:svirt_sandbox_file_t:s0:c322,c860",nosuid,mode=755 0 0
devpts /dev/pts devpts rw,context=
"system_u:object_r:svirt_sandbox_file_t:s0:c322,c860
",nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
shm /dev/shm tmpfs 
rw,context="system_u:object_r:svirt_sandbox_file_t:s0:c322,c860",nosuid,nodev,noexec,relatime,size=65536k 
0 0
tmpfs /sys/fs/cgroup tmpfs 
rw,context="system_u:object_r:svirt_sandbox_file_t:s0:c322,c860",nosuid,nodev,noexec,relatime 
0 0
tmpfs /run/secrets tmpfs 
rw,context="system_u:object_r:svirt_sandbox_file_t:s0:c322,c860",nosuid,nodev,noexec,relatime 
0 0
tmpfs /proc/kcore tmpfs 
rw,context="system_u:object_r:svirt_sandbox_file_t:s0:c322,c860",nosuid,mode=755 
0 0
tmpfs /proc/latency_stats tmpfs 
rw,context="system_u:object_r:svirt_sandbox_file_t:s0:c322,c860",nosuid,mode=755 
0 0
tmpfs /proc/timer_stats tmpfs 
rw,context="system_u:object_r:svirt_sandbox_file_t:s0:c322,c860",nosuid,mode=755 
0 0


> 
> >    IMA namespacing would then provide less control for users compared 
to
> >    network namespacing and therefore an ioctl, issued from the 
clone()ing
> >    process, for IMA+vTPM driver hook-up would be sufficient.
> 
> I agree IMA namespaces probably don't need the whole 'ip setns' type
> interface, but realistically if an ioctl is provided to install a TPM
> in a child namespace it should let anyone with privilege in the parent
> namespaces install a TPM in a child IMA namespace, that is just how
> the namespace APIs seem to work.

Agree.

> 
> That said, maybe looking at selinux namespaces interaction will give a
> different idea..

See above. We cannot use the same trick.

   Stefan

> 
> Jason
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 27, 2016, 10:25 p.m. UTC | #40
On Wed, Jan 27, 2016 at 04:58:51PM -0500, Stefan Berger wrote:

>    > I don't think there is a generic kernel side point where it could tell
>    > the child is isolated enough. Whatever that means.

>    I agree. Which set of namespaces is enough for running any program in
>    this set of namespaces (aka container) and being able to forget
>    about

It isn't just the presense of namespaces that matter, eg a net mount
namespace does not mean access is denied to the parent namespace, net
namespaces don't mean devices are isolated, etc.

This is not a good direction to go, access to an IMA namespace needs to
be very strongly controlled, 'enough namespaces' is not a sufficient
criteria!

Any flaw in the access criteria immediately destroys the security of
IMA in non-container contexts, so this needs to be done very
carefully.

>    > Doesn't selinux have the exact same problem? How does selinux handle
>    > namespaces?

>    They solve it by mounting with a context option, which enforces an
>    sVirt SELinux label across all files that the container user then
>    cannot change.

This sounds very sane.

>    > That said, maybe looking at selinux namespaces interaction will give a
>    > different idea..

>    See above. We cannot use the same trick.

Hmm, well, it certainly seems to be a lot of what is required, 
and like a much better direction than trying to use namespaces.

Arranging for an IMA namespace to only exists in association with a
SELinux label - and then rely on SELinux to provide the necessary
security isolation instead of trying to do the same thing with
namespaces sounds more likely to succeed..

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 27, 2016, 10:38 p.m. UTC | #41
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/27/2016 
01:24:48 PM:


> Subject: Re: [tpmdd-devel] [RFC PATCH 3/4] Implement driver for 
> supporting multiple emulated TPMs
> 
> On Wed, Jan 27, 2016 at 06:22:39AM -0800, Jarkko Sakkinen wrote:
> 
> > Also, I'm wondering is it right to have this as a separate module or
> > should this be part of the core TPM infrastructure?
> 
> The vtpm itself should be a module.
> 
> Any namespace support for TPM (whatever form that takes) should be in
> the core infrastructure and apply equally to all TPMs.
> 
> I do not like the idea that only vtpm could participate in namespace
> support.

How would the usage model of the TPM shared by multiple namespaces look 
like? What subset of functionality is available to each namespace and what 
works and what doesn't? Again, how is the sharing of PCRs accomplished? 

I think we should treat this sharing of the hardware TPM discussion 
orthogonal to the vTPM so that we don't get hung up with it.

   Stefan

> 
> Jason
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Jan. 27, 2016, 10:55 p.m. UTC | #42
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/27/2016 
05:25:34 PM:


> 
> On Wed, Jan 27, 2016 at 04:58:51PM -0500, Stefan Berger wrote:
> 
> >    > I don't think there is a generic kernel side point where it could 
tell
> >    > the child is isolated enough. Whatever that means.
> 
> >    I agree. Which set of namespaces is enough for running any program 
in
> >    this set of namespaces (aka container) and being able to forget
> >    about
> 
> It isn't just the presense of namespaces that matter, eg a net mount
> namespace does not mean access is denied to the parent namespace, net
> namespaces don't mean devices are isolated, etc.
> 
> This is not a good direction to go, access to an IMA namespace needs to
> be very strongly controlled, 'enough namespaces' is not a sufficient
> criteria!

Isolation should be a criteria and isolation becomes better with more 
namespaces enabled. That way one can run any program inside the set of 
namespaces and not harm the host or any other namespaces / containers. 


> 
> Any flaw in the access criteria immediately destroys the security of
> IMA in non-container contexts, so this needs to be done very
> carefully.
> 
> >    > Doesn't selinux have the exact same problem? How does selinux 
handle
> >    > namespaces?
> 
> >    They solve it by mounting with a context option, which enforces an
> >    sVirt SELinux label across all files that the container user then
> >    cannot change.
> 
> This sounds very sane.
> 
> >    > That said, maybe looking at selinux namespaces interaction will 
give a
> >    > different idea..
> 
> >    See above. We cannot use the same trick.
> 
> Hmm, well, it certainly seems to be a lot of what is required, 
> and like a much better direction than trying to use namespaces.

I don't agree. We want to allow users to run the own IMA appraisal policy. 
For that files need to be signed and the user's key passed to the IMA 
namespace. To enable that we need per-file file signatures, not some 
single label that works across all files in a filesystem. IMA's appraisal 
mode just doesn't work this way.

> 
> Arranging for an IMA namespace to only exists in association with a
> SELinux label - and then rely on SELinux to provide the necessary
> security isolation instead of trying to do the same thing with
> namespaces sounds more likely to succeed..

I am not sure whether SELinux labeling alone provides enough isolation. 
And it's likely not just the label that's important but all the rules that 
go with it to determine what a process can do and what not. How do you 
even evaluate that from inside the kernel that it's worthy an IMA 
namespace? 

   Stefan

> 
> Jason
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Jason Gunthorpe Jan. 27, 2016, 11:33 p.m. UTC | #43
On Wed, Jan 27, 2016 at 05:55:50PM -0500, Stefan Berger wrote:
>    > be very strongly controlled, 'enough namespaces' is not a sufficient
>    > criteria!

>    Isolation should be a criteria and isolation becomes better with more
>    namespaces enabled.

Yes, but IMA needs complete isolation, not just 'better' isolation.

> That way one can run any program inside the set of namespaces and
> not harm the host or any other namespaces / containers.

No, that is not how namespaces work at the kernel level, they don't
provide isolation until after userspace configures the isolation
features.

Or said another way, it is not better than CAP_ADMIN, because anyone
with CAP_ADMIN could create namespaces and manipulate them from user
space in a way that can substantially defeat IMA.

Do not assume 'docker' is the thing doing this. Assume
'hostile-docker' is involved and make a scheme that doesn't totally
break IMAs security promise.

For instance I could use hostile-docker to create all the namespaces,
including IMA, then drop in eth0 and bind mount /, and run software
that does something hostile to eth0 - and I think that is totally
contry to the guarentee a system running IMA should
provide. Specifically that CAP_ADMIN should not be able to opt out of
IMA.

>    > Hmm, well, it certainly seems to be a lot of what is required,
>    > and like a much better direction than trying to use namespaces.

>    I don't agree. We want to allow users to run the own IMA appraisal
>    policy. For that files need to be signed and the user's key passed to
>    the IMA namespace. To enable that we need per-file file signatures, not
>    some single label that works across all files in a filesystem. IMA's
>    appraisal mode just doesn't work this way.

Isn't that exactly what SElinux is trying to do? The SElinux problem
is 'allow users to run their own SELinux policy'. Just like IMA,
SELinux has per-file attributes that have a user security label. IMA
has a per-file user signature in the attribute instead.

So, for IMA you'd do

 .. make mount namespace ..
 mount --bind /.../container/fs -o ima_namespace=foo /

After which all accesses to / would use the IMA policy namespace foo,
and read the signature from the attribute in the filesystem.

I'm not sure it is better than doing it at clone, but it doesn't seem
out of line.

>    I am not sure whether SELinux labeling alone provides enough
>    isolation.

Nor am I, but maybe there is a way forward like that.

While with namespaces I can tell you right now, there can be no easy
solution that would be better than testing CAP_ADMIN :|

>    And it's likely not just the label that's important but all the rules
>    that go with it to determine what a process can do and what not. How do
>    you even evaluate that from inside the kernel that it's worthy an IMA
>    namespace?

When IMA is turned on user space tells the kernel these details?

So the restriction kernel side becomes
 1) The user space has told the kernel something about SELinux
     security labels for use with IMA
 2) The mount option contains an IMA namespace and the SELinux label compatible
    with #1
Then allow the IMA namespace to be used for that mount

At least that way it something strong that can be reasoned about.. And
is opt-out by default.

Anyhow just brainstorming for you..

Jason

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
Stefan Berger Feb. 2, 2016, 7:22 p.m. UTC | #44
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote on 01/25/2016 
10:46:32 PM:

> 
> On Mon, Jan 25, 2016 at 08:05:14PM -0500, Stefan Berger wrote:
> 
> > > And confused why there is a miscdev and a alloc_chrdev_region ?
> > 
> > miscdev = /dev/vtpmx - that should be ok, no ?
> 
> Yes
> 
> > So, I just removed alloc_chrdev_region and with that the assignment of 
a
> > major+minor to the virtual device. This works fine on device creation 
but on
> > device destruction something odd happens in sysfs.
> 
> > With two 'vtpmctrl' test programs running sysfs looks like this:
> > 
> > # find /sys/devices/virtual/vtpm/
> > /sys/devices/virtual/vtpm/
> > /sys/devices/virtual/vtpm/vtpms0
> > /sys/devices/virtual/vtpm/vtpms0/dev
> 
> Ugh.


I have to dig out the sysfs topic again because I think sysfs seems to be 
the only remaining part that could be a problem.

The problem is information leakage between containers via sysfs due to 
sysfs entries, at least for TPM 1.2. For TPM 1.2 we have a lot of sysfs 
entries that show the current state of the TPM.

The good news is, there's the bind-mount trick that a container management 
stack can apply to hide everything under /sys/devices/virtual/vtpm. To 
achieve that it would do 'mount -o bind [...]/nulldir 
/sys/device/virtual/vtpm' inside a container and container users wouldn't 
be able to see those sysfs entries anymore. However, the bind mount trick 
only works if /sys/devices/virtual/vtpm exists and it only exists if a 
vTPM has been created, but the first started container won't necessarily 
have a vtpm, so that directory will not exists, but it will appear in that 
container once another container has a vtpm. Is there anything we can do 
in the driver to help user space ? Another trick would be that the 
container management stack creates a 'dummy vtpm' device pair every time 
and keeps it around just so it has that sysfs entry and can do the 
bind-mounting.

[mkdir under sysfs doesn't work]

   Stefan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
diff mbox

Patch

diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 3b84a8b..d38b169 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -122,5 +122,15 @@  config TCG_CRB
 	  from within Linux.  To compile this driver as a module, choose
 	  M here; the module will be called tpm_crb.
 
+config TCG_VTPM
+	tristate "VTPM Interface"
+	depends on TCG_TPM
+	---help---
+	  This driver supports a vTPM running in userspace that receives
+	  request from /dev/vtpmsX. The requests are sent through a
+	  corrsponding /dev/tpmcY. The /dev/vtpmsX and /dev/tpmcY pair
+	  can be created dyanmically through ioctls of /dev/vtpmx.
+
+
 source "drivers/char/tpm/st33zp24/Kconfig"
 endif # TCG_TPM
diff --git a/drivers/char/tpm/Makefile b/drivers/char/tpm/Makefile
index 56e8f1f..d947db2 100644
--- a/drivers/char/tpm/Makefile
+++ b/drivers/char/tpm/Makefile
@@ -23,3 +23,4 @@  obj-$(CONFIG_TCG_IBMVTPM) += tpm_ibmvtpm.o
 obj-$(CONFIG_TCG_TIS_ST33ZP24) += st33zp24/
 obj-$(CONFIG_TCG_XEN) += xen-tpmfront.o
 obj-$(CONFIG_TCG_CRB) += tpm_crb.o
+obj-$(CONFIG_TCG_VTPM) += tpm-vtpm.o
diff --git a/drivers/char/tpm/tpm-vtpm.c b/drivers/char/tpm/tpm-vtpm.c
new file mode 100644
index 0000000..dc6f7f6
--- /dev/null
+++ b/drivers/char/tpm/tpm-vtpm.c
@@ -0,0 +1,855 @@ 
+/*
+ * Copyright (C) 2015 IBM Corporation
+ *
+ * Author: Stefan Berger <stefanb@us.ibm.com>
+ *
+ * Maintained by: <tpmdd-devel@lists.sourceforge.net>
+ *
+ * Device driver for vTPM.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/wait.h>
+#include <linux/miscdevice.h>
+#include <linux/vtpm.h>
+#include <linux/poll.h>
+#include <asm/compat.h>
+
+#include "tpm-vtpm.h"
+
+static DECLARE_BITMAP(dev_mask, VTPM_NUM_DEVICES);
+static LIST_HEAD(vtpm_list);
+static DEFINE_SPINLOCK(driver_lock);
+
+static struct class *vtpm_class;
+static dev_t vtpm_devt;
+
+static int _vtpm_delete_device_pair(struct vtpm_dev *vtpm_dev);
+static void free_vtpm_dev(struct kref *kref);
+
+static void vtpm_dev_get(struct vtpm_dev *vtpm_dev)
+{
+	kref_get(&vtpm_dev->kref);
+}
+
+static void vtpm_dev_put(struct vtpm_dev *vtpm_dev)
+{
+	if (vtpm_dev)
+		kref_put(&vtpm_dev->kref, free_vtpm_dev);
+}
+
+static struct vtpm_dev *vtpm_dev_get_by_chip(struct tpm_chip *chip)
+{
+	struct vtpm_dev *pos, *vtpm_dev = NULL;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(pos, &vtpm_list, list) {
+		if (pos->chip == chip) {
+			vtpm_dev = pos;
+			vtpm_dev_get(vtpm_dev);
+			break;
+		}
+	}
+
+	rcu_read_unlock();
+
+	return vtpm_dev;
+}
+
+static struct vtpm_dev *vtpm_dev_get_by_vtpm_devnum(u32 dev_num)
+{
+	struct vtpm_dev *pos, *vtpm_dev = NULL;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(pos, &vtpm_list, list) {
+		if (pos->dev_num == dev_num) {
+			vtpm_dev = pos;
+			vtpm_dev_get(vtpm_dev);
+			break;
+		}
+	}
+
+	rcu_read_unlock();
+
+	return vtpm_dev;
+}
+
+static struct vtpm_dev *vtpm_dev_get_by_tpm_devnum(u32 dev_num)
+{
+	struct vtpm_dev *pos, *vtpm_dev = NULL;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(pos, &vtpm_list, list) {
+		if (pos->chip->dev_num == dev_num) {
+			vtpm_dev = pos;
+			vtpm_dev_get(vtpm_dev);
+			break;
+		}
+	}
+
+	rcu_read_unlock();
+
+	return vtpm_dev;
+}
+
+/*
+ * Functions related to /dev/vtpms%d
+ */
+
+/**
+ * vtpm_fops_read - Read TPM commands from /dev/vtpms%d
+ *
+ * Return value:
+ *	Number of bytes read or negative error code
+ */
+static ssize_t vtpm_fops_read(struct file *filp, char __user *buf,
+			      size_t count, loff_t *off)
+{
+	struct file_priv *priv = filp->private_data;
+	struct vtpm_dev *vtpm_dev = priv->vtpm_dev;
+	size_t len;
+	int sig, rc;
+
+	sig = wait_event_interruptible(vtpm_dev->wq, vtpm_dev->req_len != 0);
+	if (sig)
+		return -EINTR;
+
+	len = vtpm_dev->req_len;
+
+	if (count < len) {
+		dev_err(&vtpm_dev->dev,
+			"Invalid size in recv: count=%zd, req_len=%zd\n",
+			count, len);
+		return -EIO;
+	}
+
+	spin_lock(&vtpm_dev->buf_lock);
+
+	rc = copy_to_user(buf, vtpm_dev->req_buf, len);
+	memset(vtpm_dev->req_buf, 0, len);
+	vtpm_dev->req_len = 0;
+
+	spin_unlock(&vtpm_dev->buf_lock);
+
+	if (rc)
+		return -EFAULT;
+
+	return len;
+}
+
+/**
+ * vtpm_fops_write - Write TPM responses to /dev/vtpms%d
+ *
+ * Return value:
+ *	Number of bytes read or negative error value
+ */
+static ssize_t vtpm_fops_write(struct file *filp, const char __user *buf,
+			       size_t count, loff_t *off)
+{
+	struct file_priv *priv = filp->private_data;
+	struct vtpm_dev *vtpm_dev = priv->vtpm_dev;
+
+	if (count > sizeof(vtpm_dev->resp_buf))
+		return -EIO;
+
+	spin_lock(&vtpm_dev->buf_lock);
+
+	vtpm_dev->req_len = 0;
+
+	if (copy_from_user(vtpm_dev->resp_buf, buf, count)) {
+		spin_unlock(&vtpm_dev->buf_lock);
+		return -EFAULT;
+	}
+
+	vtpm_dev->resp_len = count;
+
+	spin_unlock(&vtpm_dev->buf_lock);
+
+	wake_up_interruptible(&vtpm_dev->wq);
+
+	return count;
+}
+
+/*
+ * vtpm_fops_poll: Poll status of /dev/vtpms%d
+ *
+ * Return value:
+ *      Poll flags
+ */
+static unsigned int vtpm_fops_poll(struct file *filp, poll_table *wait)
+{
+	struct file_priv *priv = filp->private_data;
+	struct vtpm_dev *vtpm_dev = priv->vtpm_dev;
+	unsigned ret;
+
+	poll_wait(filp, &vtpm_dev->wq, wait);
+
+	ret = POLLOUT;
+	if (vtpm_dev->req_len)
+		ret |= POLLIN | POLLRDNORM;
+
+	return ret;
+}
+
+/**
+ * vtpm_fops_open - Open vTPM device /dev/vtpms%d
+ *
+ * Return value:
+ *	0 on success, error code otherwise
+ */
+static int vtpm_fops_open(struct inode *inode, struct file *filp)
+{
+	struct vtpm_dev *vtpm_dev =
+		container_of(inode->i_cdev, struct vtpm_dev, cdev);
+	struct file_priv *priv;
+
+	if (test_and_set_bit(STATE_OPENED_BIT, &vtpm_dev->state)) {
+		dev_dbg(vtpm_dev->pdev,
+		        "Another process owns this vTPM device\n");
+		return -EBUSY;
+	}
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (priv == NULL) {
+		clear_bit(STATE_OPENED_BIT, &vtpm_dev->state);
+		return -ENOMEM;
+	}
+
+	priv->vtpm_dev = vtpm_dev;
+
+	get_device(vtpm_dev->pdev);
+
+	filp->private_data = priv;
+
+	return 0;
+}
+
+/*
+ * vtpm_fops_release: Close /dev/vtpms%d
+ *
+ * If device pair is not in use anymore and flags permit, delete
+ * the device pair.
+ *
+ * Return value:
+ *      Always returns 0.
+ */
+static int vtpm_fops_release(struct inode *inode, struct file *filp)
+{
+	struct file_priv *priv = filp->private_data;
+	struct vtpm_dev *vtpm_dev = priv->vtpm_dev;
+
+	filp->private_data = NULL;
+	put_device(vtpm_dev->pdev);
+	kfree(priv);
+
+	if (!(vtpm_dev->flags & VTPM_FLAG_KEEP_DEVPAIR)) {
+		/*
+		 * device still marked as open; this prevents others from
+		 * opening it while we try to delete it
+		 */
+		if (_vtpm_delete_device_pair(vtpm_dev) == 0) {
+			/* vtpm_dev gone */
+			return 0;
+		}
+	}
+
+	clear_bit(STATE_OPENED_BIT, &vtpm_dev->state);
+	/* won't generate reponses anymore */
+	wake_up_interruptible(&vtpm_dev->wq);
+
+	return 0;
+}
+
+static const struct file_operations vtpm_fops = {
+	.owner = THIS_MODULE,
+	.llseek = no_llseek,
+	.open = vtpm_fops_open,
+	.read = vtpm_fops_read,
+	.write = vtpm_fops_write,
+	.poll = vtpm_fops_poll,
+	.release = vtpm_fops_release,
+};
+
+/*
+ * Functions invoked by the core TPM driver to send TPM commands to
+ * /dev/vtpms%d and receive responses from there.
+ */
+
+/*
+ * Called when core TPM driver reads TPM responses from /dev/vtpms%d.
+ *
+ * Return value:
+ *      Number of TPM response bytes read, negative error value otherwise
+ */
+static int vtpm_tpm_op_recv(struct tpm_chip *chip, u8 *buf, size_t count)
+{
+	struct vtpm_dev *vtpm_dev = vtpm_dev_get_by_chip(chip);
+	int sig;
+	size_t len;
+
+	if (!vtpm_dev)
+		return -EIO;
+
+	/* wait for response or responder gone */
+	sig = wait_event_interruptible(vtpm_dev->wq,
+		(vtpm_dev->resp_len != 0
+		|| !test_bit(STATE_OPENED_BIT, &vtpm_dev->state)));
+	if (sig) {
+		len = -EINTR;
+		goto err_exit;
+	}
+
+	/* process gone ? */
+	if (!test_bit(STATE_OPENED_BIT, &vtpm_dev->state))
+		return -EIO;
+
+	len = vtpm_dev->resp_len;
+	if (count < len) {
+		dev_err(&vtpm_dev->dev,
+			"Invalid size in recv: count=%zd, resp_len=%zd\n",
+			count, len);
+		len = -EIO;
+		goto err_exit;
+	}
+
+	spin_lock(&vtpm_dev->buf_lock);
+
+	memcpy(buf, vtpm_dev->resp_buf, len);
+	vtpm_dev->resp_len = 0;
+
+	spin_unlock(&vtpm_dev->buf_lock);
+
+err_exit:
+	vtpm_dev_put(vtpm_dev);
+
+	return len;
+}
+
+/*
+ * Called when core TPM driver forwards TPM requests to /dev/vtpms%d.
+ *
+ * Return value:
+ *      0 in case of success, negative error value otherwise.
+ */
+static int vtpm_tpm_op_send(struct tpm_chip *chip, u8 *buf, size_t count)
+{
+	struct vtpm_dev *vtpm_dev = vtpm_dev_get_by_chip(chip);
+	int rc = 0;
+
+	if (!vtpm_dev)
+		return -EIO;
+
+	if (!test_bit(STATE_OPENED_BIT, &vtpm_dev->state)) {
+		rc = -EINVAL;
+		goto err_exit;
+	}
+
+	if (count > sizeof(vtpm_dev->req_buf)) {
+		dev_err(&vtpm_dev->dev,
+			"Invalid size in send: count=%zd, buffer size=%zd\n",
+			count, sizeof(vtpm_dev->req_buf));
+		rc = -EIO;
+		goto err_exit;
+	}
+
+	spin_lock(&vtpm_dev->buf_lock);
+
+	vtpm_dev->resp_len = 0;
+
+	vtpm_dev->req_len = count;
+	memcpy(vtpm_dev->req_buf, buf, count);
+
+	spin_unlock(&vtpm_dev->buf_lock);
+
+	wake_up_interruptible(&vtpm_dev->wq);
+
+err_exit:
+	vtpm_dev_put(vtpm_dev);
+
+	return rc;
+}
+
+static void vtpm_tpm_op_cancel(struct tpm_chip *chip)
+{
+	/* not supported */
+}
+
+static u8 vtpm_tpm_op_status(struct tpm_chip *chip)
+{
+	return 0;
+}
+
+static bool vtpm_tpm_req_canceled(struct tpm_chip  *chip, u8 status)
+{
+	return (status == 0);
+}
+
+static const struct tpm_class_ops vtpm_tpm_ops = {
+	.recv = vtpm_tpm_op_recv,
+	.send = vtpm_tpm_op_send,
+	.cancel = vtpm_tpm_op_cancel,
+	.status = vtpm_tpm_op_status,
+	.req_complete_mask = 0,
+	.req_complete_val = 0,
+	.req_canceled = vtpm_tpm_req_canceled,
+};
+
+/*
+ * Code related to creation and deletion of device pairs
+ */
+
+static void vtpm_dev_release(struct device *dev)
+{
+	struct vtpm_dev *vtpm_dev = container_of(dev, struct vtpm_dev, dev);
+
+	vtpm_dev_put(vtpm_dev);
+}
+
+static void free_vtpm_dev(struct kref *kref)
+{
+	struct vtpm_dev *vtpm_dev = container_of(kref, struct vtpm_dev, kref);
+
+	spin_lock(&driver_lock);
+	clear_bit(vtpm_dev->dev_num, dev_mask);
+	spin_unlock(&driver_lock);
+
+	kfree(vtpm_dev);
+}
+
+struct vtpm_dev *vtpm_create_vtpm_dev(struct platform_device **ppdev)
+{
+	struct vtpm_dev *vtpm_dev;
+	struct platform_device *pdev;
+	struct device *dev;
+
+	vtpm_dev = kzalloc(sizeof(*vtpm_dev), GFP_KERNEL);
+	if (vtpm_dev == NULL)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vtpm_dev->kref);
+	init_waitqueue_head(&vtpm_dev->wq);
+	spin_lock_init(&vtpm_dev->buf_lock);
+
+	spin_lock(&driver_lock);
+	vtpm_dev->dev_num = find_first_zero_bit(dev_mask, VTPM_NUM_DEVICES);
+
+	if (vtpm_dev->dev_num >= VTPM_NUM_DEVICES) {
+		spin_unlock(&driver_lock);
+		kfree(vtpm_dev);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	pdev = platform_device_register_simple("tpm_vtpm", vtpm_dev->dev_num,
+					       NULL, 0);
+	if (IS_ERR(pdev)) {
+		spin_unlock(&driver_lock);
+		kfree(vtpm_dev);
+		return ERR_PTR(PTR_ERR(pdev));
+	}
+	*ppdev = pdev;
+
+	set_bit(vtpm_dev->dev_num, dev_mask);
+	spin_unlock(&driver_lock);
+
+	dev = &pdev->dev;
+
+	scnprintf(vtpm_dev->devname, sizeof(vtpm_dev->devname),
+		  VTPM_DEV_PREFIX_SERVER"%d", vtpm_dev->dev_num);
+
+	vtpm_dev->pdev = dev;
+
+	dev_set_drvdata(dev, vtpm_dev);
+
+	vtpm_dev->dev.class = vtpm_class;
+	vtpm_dev->dev.release = vtpm_dev_release;
+	vtpm_dev->dev.parent = vtpm_dev->pdev;
+
+	vtpm_dev->dev.devt = MKDEV(MAJOR(vtpm_devt),vtpm_dev->dev_num);
+
+	dev_set_name(&vtpm_dev->dev, "%s", vtpm_dev->devname);
+
+	device_initialize(&vtpm_dev->dev);
+
+	cdev_init(&vtpm_dev->cdev, &vtpm_fops);
+	vtpm_dev->cdev.owner = vtpm_dev->pdev->driver->owner;
+	vtpm_dev->cdev.kobj.parent = &vtpm_dev->dev.kobj;
+
+	return vtpm_dev;
+}
+
+/*
+ * Create a /dev/vtpms%d and /dev/vtpms%d device pair.
+ *
+ * Return value:
+ *      Returns vtpm_dev pointer on success, an error value otherwise
+ */
+static struct vtpm_dev *vtpm_create_device_pair(
+                                       struct vtpm_new_pair *vtpm_new_pair)
+{
+	struct vtpm_dev *vtpm_dev;
+	struct platform_device *pdev = NULL;
+	int rc;
+
+	vtpm_dev = vtpm_create_vtpm_dev(&pdev);
+	if (IS_ERR(vtpm_dev))
+		return vtpm_dev;
+
+	vtpm_dev->flags = vtpm_new_pair->flags;
+
+	rc = device_add(&vtpm_dev->dev);
+	if (rc) {
+		kfree(vtpm_dev);
+		vtpm_dev = NULL;
+		goto err_device_add;
+	}
+
+	rc = cdev_add(&vtpm_dev->cdev, vtpm_dev->dev.devt, 1);
+	if (rc)
+		goto err_cdev_add;
+
+	vtpm_dev->chip = tpmm_chip_alloc_pattern(vtpm_dev->pdev,
+				&vtpm_tpm_ops,
+				VTPM_DEV_PREFIX_CLIENT"%d");
+	if (IS_ERR(vtpm_dev->chip)) {
+		rc = PTR_ERR(vtpm_dev->chip);
+		goto err_chip_alloc;
+	}
+
+	if (vtpm_dev->flags & VTPM_FLAG_TPM2)
+		vtpm_dev->chip->flags |= TPM_CHIP_FLAG_TPM2;
+
+	vtpm_dev->chip->flags |= TPM_CHIP_FLAG_NO_SYSFS | TPM_CHIP_FLAG_NO_LOG;
+
+	rc = tpm_chip_register(vtpm_dev->chip);
+	if (rc)
+		goto err_chip_register;
+
+	spin_lock(&driver_lock);
+	list_add_rcu(&vtpm_dev->list, &vtpm_list);
+	spin_unlock(&driver_lock);
+
+	vtpm_new_pair->tpm_dev_num = vtpm_dev->chip->dev_num;
+	vtpm_new_pair->vtpm_dev_num = vtpm_dev->dev_num;
+
+	return vtpm_dev;
+
+err_chip_register:
+err_chip_alloc:
+	cdev_del(&vtpm_dev->cdev);
+
+err_cdev_add:
+	device_unregister(&vtpm_dev->dev);
+
+err_device_add:
+	platform_device_unregister(pdev);
+
+	return ERR_PTR(rc);
+}
+
+/*
+ * Delete a /dev/vtpmc%d and /dev/vtpms%d device pair without checking
+ * whether it is still in use.
+ */
+static int _vtpm_delete_device_pair(struct vtpm_dev *vtpm_dev)
+{
+	struct device *dev = vtpm_dev->pdev;
+	struct platform_device *pdev =
+		container_of(dev, struct platform_device, dev);
+
+	tpm_chip_unregister(vtpm_dev->chip);
+
+	cdev_del(&vtpm_dev->cdev);
+	device_unregister(&vtpm_dev->dev);
+
+	platform_device_unregister(pdev);
+
+	spin_lock(&driver_lock);
+	list_del_rcu(&vtpm_dev->list);
+	spin_unlock(&driver_lock);
+
+	return 0;
+}
+
+/*
+ * Delete a /dev/vtpmc%d and /dev/vtpms%d device pair.
+ *
+ * Return value:
+ *      Returns 0 on success, -EBUSY of the device pair is still in use
+ */
+static int vtpm_delete_device_pair(struct vtpm_dev *vtpm_dev)
+{
+	if (test_bit(STATE_OPENED_BIT, &vtpm_dev->state)) {
+		dev_dbg(vtpm_dev->pdev, "Device is busy\n");
+		return -EBUSY;
+	}
+
+	return _vtpm_delete_device_pair(vtpm_dev);
+}
+
+/*
+ * Delete all /dev/tpm%d and /dev/vtpm%d device pairs.
+ * This function is only to be called when the module is removed and
+ * we are sure that there are no more users of any device.
+ */
+static int vtpm_delete_device_pairs(void)
+{
+	struct vtpm_dev *vtpm_dev;
+	int rc = 0;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(vtpm_dev, &vtpm_list, list) {
+		rc = vtpm_delete_device_pair(vtpm_dev);
+		if (rc)
+			break;
+	}
+
+	rcu_read_unlock();
+
+	return rc;
+}
+
+static int vtpm_probe(struct platform_device *pdev)
+{
+	return 0;
+}
+
+static struct platform_driver vtpm_drv = {
+	.probe = vtpm_probe,
+	.driver = {
+		.name = "tpm_vtpm",
+	},
+};
+
+static int vtpm_init(void)
+{
+	return platform_driver_register(&vtpm_drv);
+}
+
+/*
+ * Called for module removal; no more module users
+ */
+static void vtpm_cleanup(void)
+{
+	vtpm_delete_device_pairs();
+	platform_driver_unregister(&vtpm_drv);
+}
+
+/*
+ * Code related to /dev/vtpmx
+ */
+
+struct vtpmx_dev {
+	struct device dev;
+	struct device *pdev;
+	struct cdev cdev;
+};
+
+/*
+ * vtpmx_fops_ioctl: ioctl on /dev/vtpmx
+ *
+ * Return value:
+ *      Returns 0 on success, -EINVAL or -EFAULT otherwise.
+ */
+static long vtpmx_fops_ioctl(struct file *f, unsigned int ioctl,
+			    unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	struct vtpm_new_pair *vtpm_new_pair_p;
+	struct vtpm_new_pair vtpm_new_pair;
+	struct vtpm_pair *vtpm_pair_p;
+	struct vtpm_pair vtpm_pair;
+	struct vtpm_dev *vtpm_dev;
+	int rc = 0;
+
+	switch (ioctl) {
+	case VTPM_NEW_DEV:
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		vtpm_new_pair_p = argp;
+		if (copy_from_user(&vtpm_new_pair, vtpm_new_pair_p,
+				   sizeof(vtpm_new_pair)))
+			return -EFAULT;
+		vtpm_dev = vtpm_create_device_pair(&vtpm_new_pair);
+		if (IS_ERR(vtpm_dev))
+			return PTR_ERR(vtpm_dev);
+		if (copy_to_user(vtpm_new_pair_p, &vtpm_new_pair,
+				 sizeof(vtpm_new_pair)))
+			return -EFAULT;
+		return 0;
+
+	case VTPM_DEL_DEV:
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		vtpm_pair_p = argp;
+		if (copy_from_user(&vtpm_pair, vtpm_pair_p, sizeof(vtpm_pair)))
+			return -EFAULT;
+
+		if (vtpm_pair.tpm_dev_num != VTPM_DEV_NUM_INVALID) {
+			vtpm_dev =
+			    vtpm_dev_get_by_tpm_devnum(vtpm_pair.tpm_dev_num);
+			if (!vtpm_dev || vtpm_delete_device_pair(vtpm_dev) < 0)
+				rc = -EINVAL;
+			vtpm_dev_put(vtpm_dev);
+			return rc;
+		}
+
+		if (vtpm_pair.vtpm_dev_num != VTPM_DEV_NUM_INVALID) {
+			vtpm_dev =
+			    vtpm_dev_get_by_vtpm_devnum(vtpm_pair.vtpm_dev_num);
+			if (!vtpm_dev || vtpm_delete_device_pair(vtpm_dev) < 0)
+				rc = -EINVAL;
+			vtpm_dev_put(vtpm_dev);
+			return rc;
+		}
+		return -EINVAL;
+
+	case VTPM_GET_VTPMDEV:
+		vtpm_pair_p = argp;
+		if (copy_from_user(&vtpm_pair, vtpm_pair_p, sizeof(vtpm_pair)))
+			return -EFAULT;
+
+		if (vtpm_pair.tpm_dev_num != VTPM_DEV_NUM_INVALID) {
+			vtpm_dev =
+			    vtpm_dev_get_by_tpm_devnum(vtpm_pair.tpm_dev_num);
+			if (!vtpm_dev)
+				return -EINVAL;
+
+			vtpm_pair.vtpm_dev_num = vtpm_dev->dev_num;
+
+			if (copy_to_user(vtpm_pair_p, &vtpm_pair,
+					 sizeof(vtpm_pair)))
+				rc = -EFAULT;
+			vtpm_dev_put(vtpm_dev);
+			return rc;
+		}
+		return -EINVAL;
+
+	case VTPM_GET_TPMDEV:
+		vtpm_pair_p = argp;
+		if (copy_from_user(&vtpm_pair, vtpm_pair_p, sizeof(vtpm_pair)))
+			return -EFAULT;
+
+		if (vtpm_pair.vtpm_dev_num != VTPM_DEV_NUM_INVALID) {
+			vtpm_dev =
+			    vtpm_dev_get_by_vtpm_devnum(vtpm_pair.vtpm_dev_num);
+			if (!vtpm_dev)
+				return -EINVAL;
+
+			vtpm_pair.tpm_dev_num = vtpm_dev->chip->dev_num;
+
+			if (copy_to_user(vtpm_pair_p, &vtpm_pair,
+					 sizeof(vtpm_pair)))
+				rc = -EFAULT;
+			vtpm_dev_put(vtpm_dev);
+			return rc;
+		}
+		return -EINVAL;
+
+	default:
+		return -EINVAL;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static long vtpmx_fops_compat_ioctl(struct file *f, unsigned int ioctl,
+				    unsigned long arg)
+{
+	return vtpmx_fops_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations vtpmx_fops = {
+	.owner = THIS_MODULE,
+	.unlocked_ioctl = vtpmx_fops_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = vtpmx_fops_compat_ioctl,
+#endif
+	.llseek = noop_llseek,
+};
+
+static struct miscdevice vtpmx_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vtpmx",
+	.fops = &vtpmx_fops,
+};
+
+static int vtpmx_init(void)
+{
+	return misc_register(&vtpmx_miscdev);
+}
+
+static void vtpmx_cleanup(void)
+{
+	misc_deregister(&vtpmx_miscdev);
+}
+
+static int __init vtpm_module_init(void)
+{
+	int rc;
+
+	vtpm_class = class_create(THIS_MODULE, "vtpm");
+	if (IS_ERR(vtpm_class)) {
+		pr_err("couldn't create vtpm class\n");
+		return PTR_ERR(vtpm_class);
+	}
+
+	rc = alloc_chrdev_region(&vtpm_devt, 0, VTPM_NUM_DEVICES, "vtpm");
+	if (rc < 0) {
+		pr_err("failed to allocate char dev region\n");
+		goto err_alloc_reg;
+	}
+
+	rc = vtpmx_init();
+	if (rc) {
+		pr_err("couldn't create vtpmx device\n");
+		goto err_vtpmx;
+	}
+
+	rc = vtpm_init();
+	if (rc) {
+		pr_err("couldn't init vtpm layer\n");
+		goto err_vtpm;
+	}
+
+	return 0;
+
+err_vtpm:
+	vtpmx_cleanup();
+
+err_vtpmx:
+	unregister_chrdev_region(vtpm_devt, VTPM_NUM_DEVICES);
+
+err_alloc_reg:
+	class_destroy(vtpm_class);
+
+	return rc;
+}
+
+static void __exit vtpm_module_exit(void)
+{
+	vtpm_cleanup();
+	vtpmx_cleanup();
+	unregister_chrdev_region(vtpm_devt, VTPM_NUM_DEVICES);
+	class_destroy(vtpm_class);
+}
+
+subsys_initcall(vtpm_module_init);
+module_exit(vtpm_module_exit);
+
+MODULE_AUTHOR("Stefan Berger (stefanb@us.ibm.com)");
+MODULE_DESCRIPTION("vTPM Driver");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
diff --git a/drivers/char/tpm/tpm-vtpm.h b/drivers/char/tpm/tpm-vtpm.h
new file mode 100644
index 0000000..8649867
--- /dev/null
+++ b/drivers/char/tpm/tpm-vtpm.h
@@ -0,0 +1,58 @@ 
+/*
+ * Copyright (C) 2015 IBM Corporation
+ *
+ * Author: Stefan Berger <stefanb@us.ibm.com>
+ *
+ * Maintained by: <tpmdd-devel@lists.sourceforge.net>
+ *
+ * Device driver for vTPM.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#ifndef __TPM_VTPM_H
+#define __TPM_VTPM_H
+
+#include "tpm.h"
+
+#define VTPM_NUM_DEVICES TPM_NUM_DEVICES
+
+struct vtpm_dev {
+        struct kref kref;
+
+        struct device *pdev;
+        struct device dev;
+        struct cdev cdev;
+
+        struct tpm_chip *chip;
+
+        u32 flags;
+
+        int dev_num;
+        char devname[VTPM_DEVNAME_MAX];
+
+        long state;
+#define STATE_OPENED_BIT   0
+
+        spinlock_t buf_lock;         /* lock for buffers */
+
+        wait_queue_head_t wq;
+
+        size_t req_len;              /* length of queued TPM request */
+        u8 req_buf[TPM_BUFSIZE];     /* request buffer */
+
+        size_t resp_len;             /* length of queued TPM response */
+        u8 resp_buf[TPM_BUFSIZE];    /* request buffer */
+
+        struct list_head list;
+};
+
+struct file_priv {
+        struct vtpm_dev *vtpm_dev;
+};
+
+#endif
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index c2e5d6c..c194d61 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -449,6 +449,7 @@  header-y += virtio_scsi.h
 header-y += virtio_types.h
 header-y += vm_sockets.h
 header-y += vt.h
+header-y += vtpm.h
 header-y += wait.h
 header-y += wanrouter.h
 header-y += watchdog.h
diff --git a/include/uapi/linux/vtpm.h b/include/uapi/linux/vtpm.h
new file mode 100644
index 0000000..460282b
--- /dev/null
+++ b/include/uapi/linux/vtpm.h
@@ -0,0 +1,52 @@ 
+/*
+ * Definitions for the VTPM interface
+ * Copyright (c) 2015, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _UAPI_LINUX_VTPM_H
+#define _UAPI_LINUX_VTPM_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* ioctls */
+#define VTPM_TPM 0xa0
+
+#define VTPM_DEVNAME_MAX   16
+
+#define VTPM_DEV_PREFIX_SERVER "vtpms"  /* server-side device name prefix */
+#define VTPM_DEV_PREFIX_CLIENT "vtpmc"  /* client-side device name prefix */
+
+#define VTPM_DEV_NUM_INVALID  ~0
+
+struct vtpm_new_pair {
+        __u32 flags;         /* input */
+        __u32 tpm_dev_num;   /* output */
+        __u32 vtpm_dev_num;  /* output */
+};
+
+struct vtpm_pair {
+        __u32 tpm_dev_num;   /* input or output */
+        __u32 vtpm_dev_num;  /* input or output */
+};
+
+/* above flags */
+#define VTPM_FLAG_TPM2           1  /* choose a TPM2; mainly for sysfs entries */
+#define VTPM_FLAG_KEEP_DEVPAIR   2  /* keep the device pair once vTPM closes */
+
+/* create new TPM device pair */
+#define VTPM_NEW_DEV         _IOW(VTPM_TPM, 0x00, struct vtpm_new_pair)
+#define VTPM_DEL_DEV         _IOW(VTPM_TPM, 0x01, struct vtpm_pair)
+#define VTPM_GET_TPMDEV      _IOW(VTPM_TPM, 0x02, struct vtpm_pair)
+#define VTPM_GET_VTPMDEV     _IOW(VTPM_TPM, 0x03, struct vtpm_pair)
+
+#endif /* _UAPI_LINUX_VTPM_H */