[RFC] qid path collision issues in 9pfs

Message ID 081955e1-84ec-4877-72d4-f4e8b46be350@huawei.com
State New
Headers show
Series
  • [RFC] qid path collision issues in 9pfs
Related show

Commit Message

Antonios Motakis Jan. 12, 2018, 11:32 a.m.
Hello all,

We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.

In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.

The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.

How to demonstrate the issue:
1) Prepare a problematic share:
 - mount one partition under share/p1/ with some files inside
 - mount another one *with identical contents* under share/p2/
 - confirm that both partitions have files with same inode nr, size, etc
2) Demonstrate breakage:
 - start a VM with a virtio-9p pointing to the share
 - mount 9p share with FSCACHE on
 - keep open share/p1/file
 - open and write to share/p2/file

What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.

In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).

To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(

We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:

* Full fix: Change the 9p protocol

We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.

* Fallback and/or interim solutions

A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.
1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)
2. 64 bit hash of device id and inode nr
3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)

With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.

This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.

Best regards,

Comments

Daniel P. Berrange Jan. 12, 2018, 2:27 p.m. | #1
On Fri, Jan 12, 2018 at 07:32:10PM +0800, Antonios Motakis wrote:
> Hello all,
> 
> We have found an issue in the 9p implementation of QEMU, with how qid paths
> are generated, which can cause qid path collisions and several issues caused
> by them. In our use case (running containers under VMs) these have proven to
> be critical.
> 
> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the
> inode number of the file as input. According to the 9p spec the path should
> be able to uniquely identify a file, distinct files should not share a path
> value.
> 
> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
> 
> How to demonstrate the issue:
> 1) Prepare a problematic share:
>  - mount one partition under share/p1/ with some files inside
>  - mount another one *with identical contents* under share/p2/
>  - confirm that both partitions have files with same inode nr, size, etc
> 2) Demonstrate breakage:
>  - start a VM with a virtio-9p pointing to the share
>  - mount 9p share with FSCACHE on
>  - keep open share/p1/file
>  - open and write to share/p2/file
> 
> What should happen is, the guest will consider share/p1/file and
> share/p2/file to be the same file, and since we are using the cache it will
> not reopen it. We intended to write to partition 2, but we just wrote to
> partition 1. This is just one example on how the guest may rely on qid paths
> being unique.

Ouch, this is a serious security bug. The guest user who has p1/file open may
have different privilege level to the guest user who tried to access p2/file.
So this flaw can be used for privilege escalation and/or confinement escape
by a malicious guest user.

> We have thought of a few approaches, but we would definitely like to hear
> what the upstream maintainers and community think:
> 
> * Full fix: Change the 9p protocol
> 
> We would need to support a longer qid path, based on a virtio feature flag.
> This would take reworking of host and guest parts of virtio-9p, so both QEMU
> and Linux for most users.

Requiring updated guest feels unpleasant because of the co-ordination required
to get the security fix applied. So even if we do that, IMHO, we must also
have a fix in QEMU that does not require guest update.

> 
> * Fallback and/or interim solutions
> 
> A virtio feature flag may be refused by the guest, so we think we still need
> to make collisions less likely even with 64 bit paths. E.g.
> 1. XOR the device id with inode nr to produce the qid path (we attach a proof
> of concept patch)

That just makes collision less likely, but doesnt guarantee its eliminated
and its probably practical for guests to bruteforce collisions depending on
how many different FS are passed through & number of files that can be
opened concurrently

> 2. 64 bit hash of device id and inode nr

This is probably better than (1), but a 64-bit hash algorithm is fairly weak
wrt collision resistance when you have a large number of files.

> 3. other ideas, such as allocating new qid paths and keep a look up table of
> some sorts (possibly too expensive)

I think QEMU will have to maintain a lookup table of files that are currently
open, and their associated qids, in order to provide a real guarantee that
the security flaw is addressed.

Regards,
Daniel
Veaceslav Falico Jan. 12, 2018, 3:05 p.m. | #2
(sorry for top posting and outlook, sigh...)
(adding v9fs-developer list to the loop)

The crucial part here is fscache, which stores inodes guest-side by key/aux
from qid (key = path, aux = version).

Without fscache, there would always be a directory traversal and we're safe
from *this kind* of situation.

With fscache, first inode to get in cache will always respond to inode number
and version pair, even with different device ids.

If we would be able to somehow get/provide device id from host to guest,
a simple modification of fscache aux verification handling would suffice.

However, I can't currently come up with an idea on how exactly could we get
that information guest-side. Maybe use dentry's qid->version (not used
currently) to transmit this, and populate v9fs inodes' caches during 
walk/readdir/etc.?

-----Original Message-----
From: Daniel P. Berrange [mailto:berrange@redhat.com] 

Sent: Friday, January 12, 2018 3:27 PM
To: Antonios Motakis (Tony)
Cc: qemu-devel@nongnu.org; Greg Kurz; zhangwei (CR); Veaceslav Falico; Eduard Shishkin; Wangguoli (Andy); Jiangyiwen; vfalico@gmail.com; Jani Kokkonen
Subject: Re: [Qemu-devel] [RFC] qid path collision issues in 9pfs

On Fri, Jan 12, 2018 at 07:32:10PM +0800, Antonios Motakis wrote:
> Hello all,

> 

> We have found an issue in the 9p implementation of QEMU, with how qid paths

> are generated, which can cause qid path collisions and several issues caused

> by them. In our use case (running containers under VMs) these have proven to

> be critical.

> 

> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the

> inode number of the file as input. According to the 9p spec the path should

> be able to uniquely identify a file, distinct files should not share a path

> value.

> 

> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.

> 

> How to demonstrate the issue:

> 1) Prepare a problematic share:

>  - mount one partition under share/p1/ with some files inside

>  - mount another one *with identical contents* under share/p2/

>  - confirm that both partitions have files with same inode nr, size, etc

> 2) Demonstrate breakage:

>  - start a VM with a virtio-9p pointing to the share

>  - mount 9p share with FSCACHE on

>  - keep open share/p1/file

>  - open and write to share/p2/file

> 

> What should happen is, the guest will consider share/p1/file and

> share/p2/file to be the same file, and since we are using the cache it will

> not reopen it. We intended to write to partition 2, but we just wrote to

> partition 1. This is just one example on how the guest may rely on qid paths

> being unique.


Ouch, this is a serious security bug. The guest user who has p1/file open may
have different privilege level to the guest user who tried to access p2/file.
So this flaw can be used for privilege escalation and/or confinement escape
by a malicious guest user.

> We have thought of a few approaches, but we would definitely like to hear

> what the upstream maintainers and community think:

> 

> * Full fix: Change the 9p protocol

> 

> We would need to support a longer qid path, based on a virtio feature flag.

> This would take reworking of host and guest parts of virtio-9p, so both QEMU

> and Linux for most users.


Requiring updated guest feels unpleasant because of the co-ordination required
to get the security fix applied. So even if we do that, IMHO, we must also
have a fix in QEMU that does not require guest update.

> 

> * Fallback and/or interim solutions

> 

> A virtio feature flag may be refused by the guest, so we think we still need

> to make collisions less likely even with 64 bit paths. E.g.

> 1. XOR the device id with inode nr to produce the qid path (we attach a proof

> of concept patch)


That just makes collision less likely, but doesnt guarantee its eliminated
and its probably practical for guests to bruteforce collisions depending on
how many different FS are passed through & number of files that can be
opened concurrently

> 2. 64 bit hash of device id and inode nr


This is probably better than (1), but a 64-bit hash algorithm is fairly weak
wrt collision resistance when you have a large number of files.

> 3. other ideas, such as allocating new qid paths and keep a look up table of

> some sorts (possibly too expensive)


I think QEMU will have to maintain a lookup table of files that are currently
open, and their associated qids, in order to provide a real guarantee that
the security flaw is addressed.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Greg Kurz Jan. 12, 2018, 4:14 p.m. | #3
On Fri, 12 Jan 2018 19:32:10 +0800
Antonios Motakis <antonios.motakis@huawei.com> wrote:

> Hello all,
> 

Hi Antonios,

I see you have attached a patch to this email... this really isn't the preferred
way to do things since it prevents to comment the patch (at least with my mail
client). The appropriate way would have been to send the patch with a cover
letter, using git-send-email for example.

> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
> 

Ouch...

> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
> 
> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
> 
> How to demonstrate the issue:
> 1) Prepare a problematic share:
>  - mount one partition under share/p1/ with some files inside
>  - mount another one *with identical contents* under share/p2/
>  - confirm that both partitions have files with same inode nr, size, etc
> 2) Demonstrate breakage:
>  - start a VM with a virtio-9p pointing to the share
>  - mount 9p share with FSCACHE on
>  - keep open share/p1/file
>  - open and write to share/p2/file
> 
> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
> 
> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
> 
> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
> 
> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
> 
> * Full fix: Change the 9p protocol
> 
> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
> 

I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
9p is transport agnostic. And it happens to be used with a variety of transports.
QEMU has both virtio-9p and a Xen backend for example.

> * Fallback and/or interim solutions
> 
> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.

In all cases, we would need a fallback solution to support current
guest setups. Also there are several 9p server implementations out
there (ganesha, diod, kvmtool) that are currently used with linux
clients... it will take some time to get everyone in sync :-\

> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)

Hmm... this would still allow collisions. Not good for fallback.

> 2. 64 bit hash of device id and inode nr

Same here.

> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
> 

This would be acceptable for fallback.

> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
> 
> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
> 
> Best regards,
> 

Cheers,

--
Greg
Greg Kurz Jan. 12, 2018, 4:25 p.m. | #4
On Fri, 12 Jan 2018 14:27:22 +0000
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Fri, Jan 12, 2018 at 07:32:10PM +0800, Antonios Motakis wrote:
> > Hello all,
> > 
> > We have found an issue in the 9p implementation of QEMU, with how qid paths
> > are generated, which can cause qid path collisions and several issues caused
> > by them. In our use case (running containers under VMs) these have proven to
> > be critical.
> > 
> > In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the
> > inode number of the file as input. According to the 9p spec the path should
> > be able to uniquely identify a file, distinct files should not share a path
> > value.
> > 
> > The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
> > 
> > How to demonstrate the issue:
> > 1) Prepare a problematic share:
> >  - mount one partition under share/p1/ with some files inside
> >  - mount another one *with identical contents* under share/p2/
> >  - confirm that both partitions have files with same inode nr, size, etc
> > 2) Demonstrate breakage:
> >  - start a VM with a virtio-9p pointing to the share
> >  - mount 9p share with FSCACHE on
> >  - keep open share/p1/file
> >  - open and write to share/p2/file
> > 
> > What should happen is, the guest will consider share/p1/file and
> > share/p2/file to be the same file, and since we are using the cache it will
> > not reopen it. We intended to write to partition 2, but we just wrote to
> > partition 1. This is just one example on how the guest may rely on qid paths
> > being unique.  
> 
> Ouch, this is a serious security bug. The guest user who has p1/file open may
> have different privilege level to the guest user who tried to access p2/file.
> So this flaw can be used for privilege escalation and/or confinement escape
> by a malicious guest user.
> 

Yeah, you're right... Yet another New Year gift, like last year :-\

> > We have thought of a few approaches, but we would definitely like to hear
> > what the upstream maintainers and community think:
> > 
> > * Full fix: Change the 9p protocol
> > 
> > We would need to support a longer qid path, based on a virtio feature flag.
> > This would take reworking of host and guest parts of virtio-9p, so both QEMU
> > and Linux for most users.  
> 
> Requiring updated guest feels unpleasant because of the co-ordination required
> to get the security fix applied. So even if we do that, IMHO, we must also
> have a fix in QEMU that does not require guest update.
> 

Yes, since the linux 9p client is used with a variety of 9p servers, we
cannot wait for everyone to be in sync. We definitely need a standalone
fallback in QEMU.

> > 
> > * Fallback and/or interim solutions
> > 
> > A virtio feature flag may be refused by the guest, so we think we still need
> > to make collisions less likely even with 64 bit paths. E.g.
> > 1. XOR the device id with inode nr to produce the qid path (we attach a proof
> > of concept patch)  
> 
> That just makes collision less likely, but doesnt guarantee its eliminated
> and its probably practical for guests to bruteforce collisions depending on
> how many different FS are passed through & number of files that can be
> opened concurrently
> 
> > 2. 64 bit hash of device id and inode nr  
> 
> This is probably better than (1), but a 64-bit hash algorithm is fairly weak
> wrt collision resistance when you have a large number of files.
> 
> > 3. other ideas, such as allocating new qid paths and keep a look up table of
> > some sorts (possibly too expensive)  
> 
> I think QEMU will have to maintain a lookup table of files that are currently
> open, and their associated qids, in order to provide a real guarantee that
> the security flaw is addressed.
> 

I cannot think of a better solution right now.

> Regards,
> Daniel

Cheers,

--
Greg
Greg Kurz Jan. 12, 2018, 5 p.m. | #5
On Fri, 12 Jan 2018 15:05:24 +0000
Veaceslav Falico <veaceslav.falico@huawei.com> wrote:

> (sorry for top posting and outlook, sigh...)

Well, as long as you're not sending a patch, we can cope with it... but
maybe you could use your gmail account for subsequent posts ;)

> (adding v9fs-developer list to the loop)
> 

Good idea since the issue affects the protocol.

> The crucial part here is fscache, which stores inodes guest-side by key/aux
> from qid (key = path, aux = version).
> 
> Without fscache, there would always be a directory traversal and we're safe
> from *this kind* of situation.
> 

I'm not familiar with fscache, but the problem described by Antonios is a
protocol violation anyway. The 9p server shouldn't generate colliding
qid paths...

> With fscache, first inode to get in cache will always respond to inode number
> and version pair, even with different device ids.
> 

Are you saying that even if the 9p server provides unique qid paths,
no matter how it is achieved, the linux client would still be confused ?

It looks like a client side issue to me...

> If we would be able to somehow get/provide device id from host to guest,
> a simple modification of fscache aux verification handling would suffice.
> 
> However, I can't currently come up with an idea on how exactly could we get
> that information guest-side. Maybe use dentry's qid->version (not used
> currently) to transmit this, and populate v9fs inodes' caches during 
> walk/readdir/etc.?
> 

I'm a bit lost here :)


Cheers,

--
Greg

> -----Original Message-----
> From: Daniel P. Berrange [mailto:berrange@redhat.com] 
> Sent: Friday, January 12, 2018 3:27 PM
> To: Antonios Motakis (Tony)
> Cc: qemu-devel@nongnu.org; Greg Kurz; zhangwei (CR); Veaceslav Falico; Eduard Shishkin; Wangguoli (Andy); Jiangyiwen; vfalico@gmail.com; Jani Kokkonen
> Subject: Re: [Qemu-devel] [RFC] qid path collision issues in 9pfs
> 
> On Fri, Jan 12, 2018 at 07:32:10PM +0800, Antonios Motakis wrote:
> > Hello all,
> > 
> > We have found an issue in the 9p implementation of QEMU, with how qid paths
> > are generated, which can cause qid path collisions and several issues caused
> > by them. In our use case (running containers under VMs) these have proven to
> > be critical.
> > 
> > In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the
> > inode number of the file as input. According to the 9p spec the path should
> > be able to uniquely identify a file, distinct files should not share a path
> > value.
> > 
> > The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
> > 
> > How to demonstrate the issue:
> > 1) Prepare a problematic share:
> >  - mount one partition under share/p1/ with some files inside
> >  - mount another one *with identical contents* under share/p2/
> >  - confirm that both partitions have files with same inode nr, size, etc
> > 2) Demonstrate breakage:
> >  - start a VM with a virtio-9p pointing to the share
> >  - mount 9p share with FSCACHE on
> >  - keep open share/p1/file
> >  - open and write to share/p2/file
> > 
> > What should happen is, the guest will consider share/p1/file and
> > share/p2/file to be the same file, and since we are using the cache it will
> > not reopen it. We intended to write to partition 2, but we just wrote to
> > partition 1. This is just one example on how the guest may rely on qid paths
> > being unique.  
> 
> Ouch, this is a serious security bug. The guest user who has p1/file open may
> have different privilege level to the guest user who tried to access p2/file.
> So this flaw can be used for privilege escalation and/or confinement escape
> by a malicious guest user.
> 
> > We have thought of a few approaches, but we would definitely like to hear
> > what the upstream maintainers and community think:
> > 
> > * Full fix: Change the 9p protocol
> > 
> > We would need to support a longer qid path, based on a virtio feature flag.
> > This would take reworking of host and guest parts of virtio-9p, so both QEMU
> > and Linux for most users.  
> 
> Requiring updated guest feels unpleasant because of the co-ordination required
> to get the security fix applied. So even if we do that, IMHO, we must also
> have a fix in QEMU that does not require guest update.
> 
> > 
> > * Fallback and/or interim solutions
> > 
> > A virtio feature flag may be refused by the guest, so we think we still need
> > to make collisions less likely even with 64 bit paths. E.g.
> > 1. XOR the device id with inode nr to produce the qid path (we attach a proof
> > of concept patch)  
> 
> That just makes collision less likely, but doesnt guarantee its eliminated
> and its probably practical for guests to bruteforce collisions depending on
> how many different FS are passed through & number of files that can be
> opened concurrently
> 
> > 2. 64 bit hash of device id and inode nr  
> 
> This is probably better than (1), but a 64-bit hash algorithm is fairly weak
> wrt collision resistance when you have a large number of files.
> 
> > 3. other ideas, such as allocating new qid paths and keep a look up table of
> > some sorts (possibly too expensive)  
> 
> I think QEMU will have to maintain a lookup table of files that are currently
> open, and their associated qids, in order to provide a real guarantee that
> the security flaw is addressed.
> 
> Regards,
> Daniel
Antonios Motakis Jan. 15, 2018, 3:49 a.m. | #6
On 13-Jan-18 00:14, Greg Kurz wrote:
> On Fri, 12 Jan 2018 19:32:10 +0800
> Antonios Motakis <antonios.motakis@huawei.com> wrote:
> 
>> Hello all,
>>
> 
> Hi Antonios,
> 
> I see you have attached a patch to this email... this really isn't the preferred
> way to do things since it prevents to comment the patch (at least with my mail
> client). The appropriate way would have been to send the patch with a cover
> letter, using git-send-email for example.

I apologize for attaching the patch, I should have known better!

> 
>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
>>
> 
> Ouch...
> 
>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
>>
>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
>>
>> How to demonstrate the issue:
>> 1) Prepare a problematic share:
>>  - mount one partition under share/p1/ with some files inside
>>  - mount another one *with identical contents* under share/p2/
>>  - confirm that both partitions have files with same inode nr, size, etc
>> 2) Demonstrate breakage:
>>  - start a VM with a virtio-9p pointing to the share
>>  - mount 9p share with FSCACHE on
>>  - keep open share/p1/file
>>  - open and write to share/p2/file
>>
>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
>>
>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
>>
>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
>>
>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
>>
>> * Full fix: Change the 9p protocol
>>
>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
>>
> 
> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
> 9p is transport agnostic. And it happens to be used with a variety of transports.
> QEMU has both virtio-9p and a Xen backend for example.
> 
>> * Fallback and/or interim solutions
>>
>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.
> 
> In all cases, we would need a fallback solution to support current
> guest setups. Also there are several 9p server implementations out
> there (ganesha, diod, kvmtool) that are currently used with linux
> clients... it will take some time to get everyone in sync :-\
> 
>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)
> 
> Hmm... this would still allow collisions. Not good for fallback.
> 
>> 2. 64 bit hash of device id and inode nr
> 
> Same here.
> 
>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
>>
> 
> This would be acceptable for fallback.

Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?

I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:

Inode =
[ 10 first bits ] + [ 54 lowest bits ]

A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
The prefix is uniquely allocated for each input.

Qid path = 
[ 10 bit prefix ] + [ inode 54 lowest bits ]

Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.

So what this would give:
(1)	Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
(2)	Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
(3)	When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.

We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?

> 
>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
>>
>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
>>
>> Best regards,
>>
> 
> Cheers,
> 
> --
> Greg
>
Greg Kurz Jan. 19, 2018, 10:27 a.m. | #7
On Mon, 15 Jan 2018 11:49:31 +0800
Antonios Motakis <antonios.motakis@huawei.com> wrote:

> On 13-Jan-18 00:14, Greg Kurz wrote:
> > On Fri, 12 Jan 2018 19:32:10 +0800
> > Antonios Motakis <antonios.motakis@huawei.com> wrote:
> >   
> >> Hello all,
> >>  
> > 
> > Hi Antonios,
> > 
> > I see you have attached a patch to this email... this really isn't the preferred
> > way to do things since it prevents to comment the patch (at least with my mail
> > client). The appropriate way would have been to send the patch with a cover
> > letter, using git-send-email for example.  
> 
> I apologize for attaching the patch, I should have known better!
> 

np :)

> >   
> >> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
> >>  
> > 
> > Ouch...
> >   
> >> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
> >>
> >> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
> >>
> >> How to demonstrate the issue:
> >> 1) Prepare a problematic share:
> >>  - mount one partition under share/p1/ with some files inside
> >>  - mount another one *with identical contents* under share/p2/
> >>  - confirm that both partitions have files with same inode nr, size, etc
> >> 2) Demonstrate breakage:
> >>  - start a VM with a virtio-9p pointing to the share
> >>  - mount 9p share with FSCACHE on
> >>  - keep open share/p1/file
> >>  - open and write to share/p2/file
> >>
> >> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
> >>
> >> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
> >>
> >> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
> >>
> >> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
> >>
> >> * Full fix: Change the 9p protocol
> >>
> >> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
> >>  
> > 
> > I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
> > 9p is transport agnostic. And it happens to be used with a variety of transports.
> > QEMU has both virtio-9p and a Xen backend for example.
> >   
> >> * Fallback and/or interim solutions
> >>
> >> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.  
> > 
> > In all cases, we would need a fallback solution to support current
> > guest setups. Also there are several 9p server implementations out
> > there (ganesha, diod, kvmtool) that are currently used with linux
> > clients... it will take some time to get everyone in sync :-\
> >   
> >> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)  
> > 
> > Hmm... this would still allow collisions. Not good for fallback.
> >   
> >> 2. 64 bit hash of device id and inode nr  
> > 
> > Same here.
> >   
> >> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
> >>  
> > 
> > This would be acceptable for fallback.  
> 
> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
> 

I don't know QHT, hence Cc'ing Emilio for insights.

> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
> 
> Inode =
> [ 10 first bits ] + [ 54 lowest bits ]
> 
> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
> The prefix is uniquely allocated for each input.
> 
> Qid path = 
> [ 10 bit prefix ] + [ inode 54 lowest bits ]
> 
> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
> 
> So what this would give:
> (1)	Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
> (2)	Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
> (3)	When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
> 
> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
> 

I think that should work. Please send a patchset :)

> >   
> >> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
> >>
> >> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
> >>
> >> Best regards,
> >>  
> > 
> > Cheers,
> > 
> > --
> > Greg
> >   
>
Eduard Shishkin Jan. 19, 2018, 3:52 p.m. | #8
On 1/19/2018 11:27 AM, Greg Kurz wrote:
> On Mon, 15 Jan 2018 11:49:31 +0800
> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>
>> On 13-Jan-18 00:14, Greg Kurz wrote:
>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>
>>> Hi Antonios,
>>>
>>> I see you have attached a patch to this email... this really isn't the preferred
>>> way to do things since it prevents to comment the patch (at least with my mail
>>> client). The appropriate way would have been to send the patch with a cover
>>> letter, using git-send-email for example.
>>
>> I apologize for attaching the patch, I should have known better!
>>
>
> np :)
>
>>>
>>>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
>>>>
>>>
>>> Ouch...
>>>
>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
>>>>
>>>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
>>>>
>>>> How to demonstrate the issue:
>>>> 1) Prepare a problematic share:
>>>>  - mount one partition under share/p1/ with some files inside
>>>>  - mount another one *with identical contents* under share/p2/
>>>>  - confirm that both partitions have files with same inode nr, size, etc
>>>> 2) Demonstrate breakage:
>>>>  - start a VM with a virtio-9p pointing to the share
>>>>  - mount 9p share with FSCACHE on
>>>>  - keep open share/p1/file
>>>>  - open and write to share/p2/file
>>>>
>>>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
>>>>
>>>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
>>>>
>>>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
>>>>
>>>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
>>>>
>>>> * Full fix: Change the 9p protocol
>>>>
>>>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
>>>>
>>>
>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
>>> 9p is transport agnostic. And it happens to be used with a variety of transports.
>>> QEMU has both virtio-9p and a Xen backend for example.
>>>
>>>> * Fallback and/or interim solutions
>>>>
>>>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.
>>>
>>> In all cases, we would need a fallback solution to support current
>>> guest setups. Also there are several 9p server implementations out
>>> there (ganesha, diod, kvmtool) that are currently used with linux
>>> clients... it will take some time to get everyone in sync :-\
>>>
>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)
>>>
>>> Hmm... this would still allow collisions. Not good for fallback.
>>>
>>>> 2. 64 bit hash of device id and inode nr
>>>
>>> Same here.
>>>
>>>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
>>>>
>>>
>>> This would be acceptable for fallback.
>>
>> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
>>
>
> I don't know QHT, hence Cc'ing Emilio for insights.
>
>> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
>>
>> Inode =
>> [ 10 first bits ] + [ 54 lowest bits ]
>>
>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
>> The prefix is uniquely allocated for each input.
>>
>> Qid path =
>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
>>
>> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
>>
>> So what this would give:
>> (1)	Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
>> (2)	Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
>> (3)	When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
>>
>> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
>>
>
> I think that should work. Please send a patchset :)

Hi Greg,

This is based on the assumption that high bits of inode numbers are
always zero, which is unacceptable from my standpoint. Inode numbers
allocator is a per-volume file system feature, so there may appear
allocators, which don't use simple increment and assign inode number
with non-zero high bits even for the first file of a volume.

As I understand, the problem is that the guest doesn't have enough
information for proper files identification (in our case share/p1/file
and share/p2/file are wrongly identified as the same file). It seems
that we need to transmit device ID to the guest, and not in the high 
bits of the inode number.

AFAIK Slava has patches for such transmission, I hope that he will send
it eventually.

Thanks,
Eduard.

>
>>>
>>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
>>>>
>>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
>>>>
>>>> Best regards,
>>>>
>>>
>>> Cheers,
>>>
>>> --
>>> Greg
>>>
>>
>
Greg Kurz Jan. 19, 2018, 4:36 p.m. | #9
On Fri, 19 Jan 2018 16:52:44 +0100
Eduard Shishkin <eduard.shishkin@huawei.com> wrote:

> On 1/19/2018 11:27 AM, Greg Kurz wrote:
> > On Mon, 15 Jan 2018 11:49:31 +0800
> > Antonios Motakis <antonios.motakis@huawei.com> wrote:
> >  
> >> On 13-Jan-18 00:14, Greg Kurz wrote:  
> >>> On Fri, 12 Jan 2018 19:32:10 +0800
> >>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
> >>>  
> >>>> Hello all,
> >>>>  
> >>>
> >>> Hi Antonios,
> >>>
> >>> I see you have attached a patch to this email... this really isn't the preferred
> >>> way to do things since it prevents to comment the patch (at least with my mail
> >>> client). The appropriate way would have been to send the patch with a cover
> >>> letter, using git-send-email for example.  
> >>
> >> I apologize for attaching the patch, I should have known better!
> >>  
> >
> > np :)
> >  
> >>>  
> >>>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
> >>>>  
> >>>
> >>> Ouch...
> >>>  
> >>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
> >>>>
> >>>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
> >>>>
> >>>> How to demonstrate the issue:
> >>>> 1) Prepare a problematic share:
> >>>>  - mount one partition under share/p1/ with some files inside
> >>>>  - mount another one *with identical contents* under share/p2/
> >>>>  - confirm that both partitions have files with same inode nr, size, etc
> >>>> 2) Demonstrate breakage:
> >>>>  - start a VM with a virtio-9p pointing to the share
> >>>>  - mount 9p share with FSCACHE on
> >>>>  - keep open share/p1/file
> >>>>  - open and write to share/p2/file
> >>>>
> >>>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
> >>>>
> >>>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
> >>>>
> >>>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
> >>>>
> >>>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
> >>>>
> >>>> * Full fix: Change the 9p protocol
> >>>>
> >>>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
> >>>>  
> >>>
> >>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
> >>> 9p is transport agnostic. And it happens to be used with a variety of transports.
> >>> QEMU has both virtio-9p and a Xen backend for example.
> >>>  
> >>>> * Fallback and/or interim solutions
> >>>>
> >>>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.  
> >>>
> >>> In all cases, we would need a fallback solution to support current
> >>> guest setups. Also there are several 9p server implementations out
> >>> there (ganesha, diod, kvmtool) that are currently used with linux
> >>> clients... it will take some time to get everyone in sync :-\
> >>>  
> >>>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)  
> >>>
> >>> Hmm... this would still allow collisions. Not good for fallback.
> >>>  
> >>>> 2. 64 bit hash of device id and inode nr  
> >>>
> >>> Same here.
> >>>  
> >>>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
> >>>>  
> >>>
> >>> This would be acceptable for fallback.  
> >>
> >> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
> >>  
> >
> > I don't know QHT, hence Cc'ing Emilio for insights.
> >  
> >> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
> >>
> >> Inode =
> >> [ 10 first bits ] + [ 54 lowest bits ]
> >>
> >> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
> >> The prefix is uniquely allocated for each input.
> >>
> >> Qid path =
> >> [ 10 bit prefix ] + [ inode 54 lowest bits ]
> >>
> >> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
> >>
> >> So what this would give:
> >> (1)	Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
> >> (2)	Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
> >> (3)	When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
> >>
> >> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
> >>  
> >
> > I think that should work. Please send a patchset :)  
> 
> Hi Greg,
> 
> This is based on the assumption that high bits of inode numbers are
> always zero, which is unacceptable from my standpoint. Inode numbers
> allocator is a per-volume file system feature, so there may appear
> allocators, which don't use simple increment and assign inode number
> with non-zero high bits even for the first file of a volume.
> 

I agree that inode numbers are random in theory, but this assumption
is likely true for a vast majority of cases.

> As I understand, the problem is that the guest doesn't have enough
> information for proper files identification (in our case share/p1/file
> and share/p2/file are wrongly identified as the same file). It seems
> that we need to transmit device ID to the guest, and not in the high 
> bits of the inode number.
> 

Yes this is what we should do in the long term, but this requires new spec
and coordination between all stakeholders (not only QEMU and linux). This
may take some time.

What we want now is a fallback solution that can be merged in QEMU
right away and works even with unmodified guests. Antonio's proposal
is likely to work for most users, and it will allow to detect collisions
and return an error to the guest in this case.

> AFAIK Slava has patches for such transmission, I hope that he will send
> it eventually.
> 

I'll gladly have a look.

Cheers,

--
Greg

> Thanks,
> Eduard.
> 
> >  
> >>>  
> >>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
> >>>>
> >>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
> >>>>
> >>>> Best regards,
> >>>>  
> >>>
> >>> Cheers,
> >>>
> >>> --
> >>> Greg
> >>>  
> >>  
> >  
>
Veaceslav Falico Jan. 19, 2018, 4:37 p.m. | #10
On 1/19/2018 4:52 PM, Eduard Shishkin wrote:
> 
> 
> On 1/19/2018 11:27 AM, Greg Kurz wrote:
>> On Mon, 15 Jan 2018 11:49:31 +0800
>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>
>>> On 13-Jan-18 00:14, Greg Kurz wrote:
>>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>
>>>> Hi Antonios,
>>>>
>>>> I see you have attached a patch to this email... this really isn't the preferred
>>>> way to do things since it prevents to comment the patch (at least with my mail
>>>> client). The appropriate way would have been to send the patch with a cover
>>>> letter, using git-send-email for example.
>>>
>>> I apologize for attaching the patch, I should have known better!
>>>
>>
>> np :)
>>
>>>>
>>>>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
>>>>>
>>>>
>>>> Ouch...
>>>>
>>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
>>>>>
>>>>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
>>>>>
>>>>> How to demonstrate the issue:
>>>>> 1) Prepare a problematic share:
>>>>>  - mount one partition under share/p1/ with some files inside
>>>>>  - mount another one *with identical contents* under share/p2/
>>>>>  - confirm that both partitions have files with same inode nr, size, etc
>>>>> 2) Demonstrate breakage:
>>>>>  - start a VM with a virtio-9p pointing to the share
>>>>>  - mount 9p share with FSCACHE on
>>>>>  - keep open share/p1/file
>>>>>  - open and write to share/p2/file
>>>>>
>>>>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
>>>>>
>>>>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
>>>>>
>>>>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
>>>>>
>>>>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
>>>>>
>>>>> * Full fix: Change the 9p protocol
>>>>>
>>>>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
>>>>>
>>>>
>>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
>>>> 9p is transport agnostic. And it happens to be used with a variety of transports.
>>>> QEMU has both virtio-9p and a Xen backend for example.
>>>>
>>>>> * Fallback and/or interim solutions
>>>>>
>>>>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.
>>>>
>>>> In all cases, we would need a fallback solution to support current
>>>> guest setups. Also there are several 9p server implementations out
>>>> there (ganesha, diod, kvmtool) that are currently used with linux
>>>> clients... it will take some time to get everyone in sync :-\
>>>>
>>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)
>>>>
>>>> Hmm... this would still allow collisions. Not good for fallback.
>>>>
>>>>> 2. 64 bit hash of device id and inode nr
>>>>
>>>> Same here.
>>>>
>>>>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
>>>>>
>>>>
>>>> This would be acceptable for fallback.
>>>
>>> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
>>>
>>
>> I don't know QHT, hence Cc'ing Emilio for insights.
>>
>>> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
>>>
>>> Inode =
>>> [ 10 first bits ] + [ 54 lowest bits ]
>>>
>>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
>>> The prefix is uniquely allocated for each input.
>>>
>>> Qid path =
>>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
>>>
>>> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
>>>
>>> So what this would give:
>>> (1)    Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
>>> (2)    Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
>>> (3)    When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
>>>
>>> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
>>>
>>
>> I think that should work. Please send a patchset :)
> 
> Hi Greg,
> 
> This is based on the assumption that high bits of inode numbers are
> always zero, which is unacceptable from my standpoint. Inode numbers
> allocator is a per-volume file system feature, so there may appear
> allocators, which don't use simple increment and assign inode number
> with non-zero high bits even for the first file of a volume.
> 
> As I understand, the problem is that the guest doesn't have enough
> information for proper files identification (in our case share/p1/file
> and share/p2/file are wrongly identified as the same file). It seems
> that we need to transmit device ID to the guest, and not in the high bits of the inode number.
> 
> AFAIK Slava has patches for such transmission, I hope that he will send
> it eventually.

They're pretty trivial, however it'll require great effort and coordination
for all parties involved, as they break backwards compatibility (QID becomes
bigger and, thus, breaks "Q" transmission).

I'd like to raise another issue - it seems that st_dev and i_no aren't
enough:

mkdir -p /tmp/mounted
touch /tmp/mounted/file
mount -o bind /tmp/mounted t1
mount -o bind /tmp/mounted t2
stat t{1,2}/file  | grep Inode
Device: 805h/2053d      Inode: 42729487    Links: 3
Device: 805h/2053d      Inode: 42729487    Links: 3

In that case, even with the patch applied, we'll still have the issue of
colliding QIDs guest-side - so, for example, after umount t1, t2/file
becomes unaccessible, as the inode points to t1...

I'm not sure how to fix this one. Ideas?

> 
> Thanks,
> Eduard.
> 
>>
>>>>
>>>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
>>>>>
>>>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
>>>>>
>>>>> Best regards,
>>>>>
>>>>
>>>> Cheers,
>>>>
>>>> -- 
>>>> Greg
>>>>
>>>
>>
> 
> .
>
Greg Kurz Jan. 19, 2018, 6:05 p.m. | #11
On Fri, 19 Jan 2018 17:37:58 +0100
Veaceslav Falico <veaceslav.falico@huawei.com> wrote:

> On 1/19/2018 4:52 PM, Eduard Shishkin wrote:
> > 
> > 
> > On 1/19/2018 11:27 AM, Greg Kurz wrote:  
> >> On Mon, 15 Jan 2018 11:49:31 +0800
> >> Antonios Motakis <antonios.motakis@huawei.com> wrote:
> >>  
> >>> On 13-Jan-18 00:14, Greg Kurz wrote:  
> >>>> On Fri, 12 Jan 2018 19:32:10 +0800
> >>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
> >>>>  
> >>>>> Hello all,
> >>>>>  
> >>>>
> >>>> Hi Antonios,
> >>>>
> >>>> I see you have attached a patch to this email... this really isn't the preferred
> >>>> way to do things since it prevents to comment the patch (at least with my mail
> >>>> client). The appropriate way would have been to send the patch with a cover
> >>>> letter, using git-send-email for example.  
> >>>
> >>> I apologize for attaching the patch, I should have known better!
> >>>  
> >>
> >> np :)
> >>  
> >>>>  
> >>>>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
> >>>>>  
> >>>>
> >>>> Ouch...
> >>>>  
> >>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
> >>>>>
> >>>>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
> >>>>>
> >>>>> How to demonstrate the issue:
> >>>>> 1) Prepare a problematic share:
> >>>>>  - mount one partition under share/p1/ with some files inside
> >>>>>  - mount another one *with identical contents* under share/p2/
> >>>>>  - confirm that both partitions have files with same inode nr, size, etc
> >>>>> 2) Demonstrate breakage:
> >>>>>  - start a VM with a virtio-9p pointing to the share
> >>>>>  - mount 9p share with FSCACHE on
> >>>>>  - keep open share/p1/file
> >>>>>  - open and write to share/p2/file
> >>>>>
> >>>>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
> >>>>>
> >>>>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
> >>>>>
> >>>>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
> >>>>>
> >>>>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
> >>>>>
> >>>>> * Full fix: Change the 9p protocol
> >>>>>
> >>>>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
> >>>>>  
> >>>>
> >>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
> >>>> 9p is transport agnostic. And it happens to be used with a variety of transports.
> >>>> QEMU has both virtio-9p and a Xen backend for example.
> >>>>  
> >>>>> * Fallback and/or interim solutions
> >>>>>
> >>>>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.  
> >>>>
> >>>> In all cases, we would need a fallback solution to support current
> >>>> guest setups. Also there are several 9p server implementations out
> >>>> there (ganesha, diod, kvmtool) that are currently used with linux
> >>>> clients... it will take some time to get everyone in sync :-\
> >>>>  
> >>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)  
> >>>>
> >>>> Hmm... this would still allow collisions. Not good for fallback.
> >>>>  
> >>>>> 2. 64 bit hash of device id and inode nr  
> >>>>
> >>>> Same here.
> >>>>  
> >>>>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
> >>>>>  
> >>>>
> >>>> This would be acceptable for fallback.  
> >>>
> >>> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
> >>>  
> >>
> >> I don't know QHT, hence Cc'ing Emilio for insights.
> >>  
> >>> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
> >>>
> >>> Inode =
> >>> [ 10 first bits ] + [ 54 lowest bits ]
> >>>
> >>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
> >>> The prefix is uniquely allocated for each input.
> >>>
> >>> Qid path =
> >>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
> >>>
> >>> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
> >>>
> >>> So what this would give:
> >>> (1)    Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
> >>> (2)    Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
> >>> (3)    When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
> >>>
> >>> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
> >>>  
> >>
> >> I think that should work. Please send a patchset :)  
> > 
> > Hi Greg,
> > 
> > This is based on the assumption that high bits of inode numbers are
> > always zero, which is unacceptable from my standpoint. Inode numbers
> > allocator is a per-volume file system feature, so there may appear
> > allocators, which don't use simple increment and assign inode number
> > with non-zero high bits even for the first file of a volume.
> > 
> > As I understand, the problem is that the guest doesn't have enough
> > information for proper files identification (in our case share/p1/file
> > and share/p2/file are wrongly identified as the same file). It seems
> > that we need to transmit device ID to the guest, and not in the high bits of the inode number.
> > 
> > AFAIK Slava has patches for such transmission, I hope that he will send
> > it eventually.  
> 
> They're pretty trivial, however it'll require great effort and coordination
> for all parties involved, as they break backwards compatibility (QID becomes
> bigger and, thus, breaks "Q" transmission).
> 
> I'd like to raise another issue - it seems that st_dev and i_no aren't
> enough:
> 
> mkdir -p /tmp/mounted
> touch /tmp/mounted/file
> mount -o bind /tmp/mounted t1
> mount -o bind /tmp/mounted t2
> stat t{1,2}/file  | grep Inode
> Device: 805h/2053d      Inode: 42729487    Links: 3
> Device: 805h/2053d      Inode: 42729487    Links: 3
> 
> In that case, even with the patch applied, we'll still have the issue of
> colliding QIDs guest-side - so, for example, after umount t1, t2/file
> becomes unaccessible, as the inode points to t1...
> 

t1/file and t2/file really point to the same file on the server, so
it is expected they have the same QIDs.

> I'm not sure how to fix this one. Ideas?

fscache bug ?

> 
> > 
> > Thanks,
> > Eduard.
> >   
> >>  
> >>>>  
> >>>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
> >>>>>
> >>>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
> >>>>>
> >>>>> Best regards,
> >>>>>  
> >>>>
> >>>> Cheers,
> >>>>
> >>>> -- 
> >>>> Greg
> >>>>  
> >>>  
> >>  
> > 
> > .
> >   
> 
>
Eduard Shishkin Jan. 19, 2018, 6:51 p.m. | #12
On 1/19/2018 7:05 PM, Greg Kurz wrote:
> On Fri, 19 Jan 2018 17:37:58 +0100
> Veaceslav Falico <veaceslav.falico@huawei.com> wrote:
>
>> On 1/19/2018 4:52 PM, Eduard Shishkin wrote:
>>>
>>>
>>> On 1/19/2018 11:27 AM, Greg Kurz wrote:
>>>> On Mon, 15 Jan 2018 11:49:31 +0800
>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>
>>>>> On 13-Jan-18 00:14, Greg Kurz wrote:
>>>>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>
>>>>>> Hi Antonios,
>>>>>>
>>>>>> I see you have attached a patch to this email... this really isn't the preferred
>>>>>> way to do things since it prevents to comment the patch (at least with my mail
>>>>>> client). The appropriate way would have been to send the patch with a cover
>>>>>> letter, using git-send-email for example.
>>>>>
>>>>> I apologize for attaching the patch, I should have known better!
>>>>>
>>>>
>>>> np :)
>>>>
>>>>>>
>>>>>>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
>>>>>>>
>>>>>>
>>>>>> Ouch...
>>>>>>
>>>>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
>>>>>>>
>>>>>>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
>>>>>>>
>>>>>>> How to demonstrate the issue:
>>>>>>> 1) Prepare a problematic share:
>>>>>>>  - mount one partition under share/p1/ with some files inside
>>>>>>>  - mount another one *with identical contents* under share/p2/
>>>>>>>  - confirm that both partitions have files with same inode nr, size, etc
>>>>>>> 2) Demonstrate breakage:
>>>>>>>  - start a VM with a virtio-9p pointing to the share
>>>>>>>  - mount 9p share with FSCACHE on
>>>>>>>  - keep open share/p1/file
>>>>>>>  - open and write to share/p2/file
>>>>>>>
>>>>>>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
>>>>>>>
>>>>>>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
>>>>>>>
>>>>>>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
>>>>>>>
>>>>>>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
>>>>>>>
>>>>>>> * Full fix: Change the 9p protocol
>>>>>>>
>>>>>>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
>>>>>>>
>>>>>>
>>>>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
>>>>>> 9p is transport agnostic. And it happens to be used with a variety of transports.
>>>>>> QEMU has both virtio-9p and a Xen backend for example.
>>>>>>
>>>>>>> * Fallback and/or interim solutions
>>>>>>>
>>>>>>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.
>>>>>>
>>>>>> In all cases, we would need a fallback solution to support current
>>>>>> guest setups. Also there are several 9p server implementations out
>>>>>> there (ganesha, diod, kvmtool) that are currently used with linux
>>>>>> clients... it will take some time to get everyone in sync :-\
>>>>>>
>>>>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)
>>>>>>
>>>>>> Hmm... this would still allow collisions. Not good for fallback.
>>>>>>
>>>>>>> 2. 64 bit hash of device id and inode nr
>>>>>>
>>>>>> Same here.
>>>>>>
>>>>>>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
>>>>>>>
>>>>>>
>>>>>> This would be acceptable for fallback.
>>>>>
>>>>> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
>>>>>
>>>>
>>>> I don't know QHT, hence Cc'ing Emilio for insights.
>>>>
>>>>> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
>>>>>
>>>>> Inode =
>>>>> [ 10 first bits ] + [ 54 lowest bits ]
>>>>>
>>>>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
>>>>> The prefix is uniquely allocated for each input.
>>>>>
>>>>> Qid path =
>>>>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
>>>>>
>>>>> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
>>>>>
>>>>> So what this would give:
>>>>> (1)    Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
>>>>> (2)    Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
>>>>> (3)    When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
>>>>>
>>>>> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
>>>>>
>>>>
>>>> I think that should work. Please send a patchset :)
>>>
>>> Hi Greg,
>>>
>>> This is based on the assumption that high bits of inode numbers are
>>> always zero, which is unacceptable from my standpoint. Inode numbers
>>> allocator is a per-volume file system feature, so there may appear
>>> allocators, which don't use simple increment and assign inode number
>>> with non-zero high bits even for the first file of a volume.
>>>
>>> As I understand, the problem is that the guest doesn't have enough
>>> information for proper files identification (in our case share/p1/file
>>> and share/p2/file are wrongly identified as the same file). It seems
>>> that we need to transmit device ID to the guest, and not in the high bits of the inode number.
>>>
>>> AFAIK Slava has patches for such transmission, I hope that he will send
>>> it eventually.
>>
>> They're pretty trivial, however it'll require great effort and coordination
>> for all parties involved, as they break backwards compatibility (QID becomes
>> bigger and, thus, breaks "Q" transmission).
>>
>> I'd like to raise another issue - it seems that st_dev and i_no aren't
>> enough:
>>
>> mkdir -p /tmp/mounted
>> touch /tmp/mounted/file
>> mount -o bind /tmp/mounted t1
>> mount -o bind /tmp/mounted t2
>> stat t{1,2}/file  | grep Inode
>> Device: 805h/2053d      Inode: 42729487    Links: 3
>> Device: 805h/2053d      Inode: 42729487    Links: 3
>>
>> In that case, even with the patch applied, we'll still have the issue of
>> colliding QIDs guest-side - so, for example, after umount t1, t2/file
>> becomes unaccessible, as the inode points to t1...
>>
>
> t1/file and t2/file really point to the same file on the server, so
> it is expected they have the same QIDs.
>
>> I'm not sure how to fix this one. Ideas?
>
> fscache bug ?


fscache also thinks that it is the same file.
But applications (docker in particular) want one-to-one correspondence
of semantic trees on the host and on guests. They think if t2/file is
accessible on the host, then it should be also accessible on the guest.
However, t2/file was ousted by fscache :)

>
>>
>>>
>>> Thanks,
>>> Eduard.
>>>
>>>>
>>>>>>
>>>>>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
>>>>>>>
>>>>>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> --
>>>>>> Greg
>>>>>>
>>>>>
>>>>
>>>
>>> .
>>>
>>
>>
>
Emilio G. Cota Jan. 20, 2018, 12:05 a.m. | #13
On Fri, Jan 19, 2018 at 11:27:33 +0100, Greg Kurz wrote:
> On Mon, 15 Jan 2018 11:49:31 +0800
> Antonios Motakis <antonios.motakis@huawei.com> wrote:
> > On 13-Jan-18 00:14, Greg Kurz wrote:
> > > On Fri, 12 Jan 2018 19:32:10 +0800
> > > Antonios Motakis <antonios.motakis@huawei.com> wrote:
(snip)
> > >> To avoid this situation, the device id of a file needs to be also taken as
> > >> input for generating a qid path. Unfortunately, the size of both inode
> > >> nr + device id together would be 96 bits, while we have only 64 bits
> > >> for the qid path, so we can't just append them and call it a day :(
(snip)
> > >>
> > >> We have thought of a few approaches, but we would definitely like to hear
> > >> what the upstream maintainers and community think:
(snip)
> > >> * Fallback and/or interim solutions
(snip)
> > >> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
> > >>  
> > > 
> > > This would be acceptable for fallback.  
> > 
> > Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder
> > if it scales to millions of entries. Do you think it is appropriate in this case?
> 
> I don't know QHT, hence Cc'ing Emilio for insights.

QHT stores a 32-bit hash per entry, so with perfect hashing one wouldn't
see collisions (and the subsequent performance drop) below 2**32 entries.
So yes, millions of entries are supported.

Wrt memory overhead, each entry takes 12 bytes on 64-bit hosts
and 8 bytes on 32-bit hosts (pointer + hash). However, we have to account
for empty or partially filled buckets, as well as bucket overhead (header,
tail and lock), so on 64-bit we can approximate the overhead to 32 bytes
per entry. Given that inode sizes are >128 bytes, memory overhead
[size(ht) / size(index)] would be at most 25%. (See below for some
suggestions to lower this.)

> > I was thinking on how to implement something like this, without having to maintain
> > millions of entries... One option we could consider is to split the bits into a
> > directly-mapped part, and a lookup part. For example:
> > 
> > Inode =
> > [ 10 first bits ] + [ 54 lowest bits ]
> > 
> > A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
> > The prefix is uniquely allocated for each input.
> > 
> > Qid path = 
> > [ 10 bit prefix ] + [ inode 54 lowest bits ]
> > 
> > Since inodes are not completely random, and we usually have a handful of device IDs,
> > we get a much smaller number of entries to track in the hash table.
> > 
> > So what this would give:
> > (1)	Would be faster and take less memory than mapping the full inode_nr,devi_id
> > tuple to unique QID paths
> > (2)	Guaranteed not to run out of bits when inode numbers stay below the lowest
> > 54 bits and we have less than 1024 devices.
> > (3)	When we get beyond this this limit, there is a chance we run out of bits to
> > allocate new QID paths, but we can detect this and refuse to serve the offending
> > files instead of allowing a collision.
> > 
> > We could tweak the prefix size to match the scenarios that we consider more likely,
> > but I think close to 10-16 bits sounds reasonable enough. What do you think?

Assuming assumption (2) is very likely to be true, I'd suggest
dropping the intermediate hash table altogether, and simply refuse
to work with any files that do not meet (2).

That said, the naive solution of having a large hash table with all entries
in it might be worth a shot. With QHT and a good hash function
like xxhash, you'd support parallel lookups and at worst it'd add
an extra last-level cache miss per file operation (~300 cycles),
which on average would be dwarfed by the corresponding I/O latency.

This should be easy to implement and benchmark, so I think it is
worth trying -- you might not need to fiddle with the protocol
after all.

BTW since hashing here would be very simple, we could have a variant
of QHT where only pointers are kept in the buckets, thereby lowering
memory overhead. We could also play with the resizing algorithm to
make it less aggresive so that we end up with less (and fuller) buckets.

Thanks,

		E.
Emilio G. Cota Jan. 20, 2018, 10:03 p.m. | #14
On Fri, Jan 19, 2018 at 19:05:06 -0500, Emilio G. Cota wrote:
> > > > On Fri, 12 Jan 2018 19:32:10 +0800
> > > > Antonios Motakis <antonios.motakis@huawei.com> wrote:
> > > Since inodes are not completely random, and we usually have a handful of device IDs,
> > > we get a much smaller number of entries to track in the hash table.
> > > 
> > > So what this would give:
> > > (1)	Would be faster and take less memory than mapping the full inode_nr,devi_id
> > > tuple to unique QID paths
> > > (2)	Guaranteed not to run out of bits when inode numbers stay below the lowest
> > > 54 bits and we have less than 1024 devices.
> > > (3)	When we get beyond this this limit, there is a chance we run out of bits to
> > > allocate new QID paths, but we can detect this and refuse to serve the offending
> > > files instead of allowing a collision.
> > > 
> > > We could tweak the prefix size to match the scenarios that we consider more likely,
> > > but I think close to 10-16 bits sounds reasonable enough. What do you think?
> 
> Assuming assumption (2) is very likely to be true, I'd suggest
> dropping the intermediate hash table altogether, and simply refuse
> to work with any files that do not meet (2).
> 
> That said, the naive solution of having a large hash table with all entries
> in it might be worth a shot.

hmm but that would still take a lot of memory.

Given assumption (2), a good compromise would be the following,
taking into account that the number of total gids is unlikely to
reach even close to 2**64:
- bit 63: 0/1 determines "fast" or "slow" encoding
- 62-0:
  - fast (trivial) encoding: when assumption (2) is met
    - 62-53: device id (it fits because of (2))
    - 52-0: inode (it fits because of (2))
  - slow path: assumption (2) isn't met. Then, assign incremental
    IDs in the [0,2**63-1] range and track them in a hash table.

Choosing 10 or whatever else bits for the device id is of course TBD,
as Antonios you pointed out.

Something like this will give you great performance and 0 memory
overhead for the majority of cases if (2) indeed holds.

		Emilio
Eduard Shishkin Jan. 22, 2018, 12:40 p.m. | #15
On 1/19/2018 5:37 PM, Veaceslav Falico wrote:
> On 1/19/2018 4:52 PM, Eduard Shishkin wrote:
>>
>>
>> On 1/19/2018 11:27 AM, Greg Kurz wrote:
>>> On Mon, 15 Jan 2018 11:49:31 +0800
>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>
>>>> On 13-Jan-18 00:14, Greg Kurz wrote:
>>>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>
>>>>> Hi Antonios,
>>>>>
>>>>> I see you have attached a patch to this email... this really isn't the preferred
>>>>> way to do things since it prevents to comment the patch (at least with my mail
>>>>> client). The appropriate way would have been to send the patch with a cover
>>>>> letter, using git-send-email for example.
>>>>
>>>> I apologize for attaching the patch, I should have known better!
>>>>
>>>
>>> np :)
>>>
>>>>>
>>>>>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
>>>>>>
>>>>>
>>>>> Ouch...
>>>>>
>>>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
>>>>>>
>>>>>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
>>>>>>
>>>>>> How to demonstrate the issue:
>>>>>> 1) Prepare a problematic share:
>>>>>>  - mount one partition under share/p1/ with some files inside
>>>>>>  - mount another one *with identical contents* under share/p2/
>>>>>>  - confirm that both partitions have files with same inode nr, size, etc
>>>>>> 2) Demonstrate breakage:
>>>>>>  - start a VM with a virtio-9p pointing to the share
>>>>>>  - mount 9p share with FSCACHE on
>>>>>>  - keep open share/p1/file
>>>>>>  - open and write to share/p2/file
>>>>>>
>>>>>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
>>>>>>
>>>>>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
>>>>>>
>>>>>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
>>>>>>
>>>>>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
>>>>>>
>>>>>> * Full fix: Change the 9p protocol
>>>>>>
>>>>>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
>>>>>>
>>>>>
>>>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
>>>>> 9p is transport agnostic. And it happens to be used with a variety of transports.
>>>>> QEMU has both virtio-9p and a Xen backend for example.
>>>>>
>>>>>> * Fallback and/or interim solutions
>>>>>>
>>>>>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.
>>>>>
>>>>> In all cases, we would need a fallback solution to support current
>>>>> guest setups. Also there are several 9p server implementations out
>>>>> there (ganesha, diod, kvmtool) that are currently used with linux
>>>>> clients... it will take some time to get everyone in sync :-\
>>>>>
>>>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)
>>>>>
>>>>> Hmm... this would still allow collisions. Not good for fallback.
>>>>>
>>>>>> 2. 64 bit hash of device id and inode nr
>>>>>
>>>>> Same here.
>>>>>
>>>>>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
>>>>>>
>>>>>
>>>>> This would be acceptable for fallback.
>>>>
>>>> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
>>>>
>>>
>>> I don't know QHT, hence Cc'ing Emilio for insights.
>>>
>>>> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
>>>>
>>>> Inode =
>>>> [ 10 first bits ] + [ 54 lowest bits ]
>>>>
>>>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
>>>> The prefix is uniquely allocated for each input.
>>>>
>>>> Qid path =
>>>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
>>>>
>>>> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
>>>>
>>>> So what this would give:
>>>> (1)    Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
>>>> (2)    Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
>>>> (3)    When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
>>>>
>>>> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
>>>>
>>>
>>> I think that should work. Please send a patchset :)
>>
>> Hi Greg,
>>
>> This is based on the assumption that high bits of inode numbers are
>> always zero, which is unacceptable from my standpoint. Inode numbers
>> allocator is a per-volume file system feature, so there may appear
>> allocators, which don't use simple increment and assign inode number
>> with non-zero high bits even for the first file of a volume.
>>
>> As I understand, the problem is that the guest doesn't have enough
>> information for proper files identification (in our case share/p1/file
>> and share/p2/file are wrongly identified as the same file). It seems
>> that we need to transmit device ID to the guest, and not in the high bits of the inode number.
>>
>> AFAIK Slava has patches for such transmission, I hope that he will send
>> it eventually.
>
> They're pretty trivial, however it'll require great effort and coordination
> for all parties involved, as they break backwards compatibility (QID becomes
> bigger and, thus, breaks "Q" transmission).
>
> I'd like to raise another issue - it seems that st_dev and i_no aren't
> enough:
>
> mkdir -p /tmp/mounted
> touch /tmp/mounted/file
> mount -o bind /tmp/mounted t1
> mount -o bind /tmp/mounted t2
> stat t{1,2}/file  | grep Inode
> Device: 805h/2053d      Inode: 42729487    Links: 3
> Device: 805h/2053d      Inode: 42729487    Links: 3
>
> In that case, even with the patch applied, we'll still have the issue of
> colliding QIDs guest-side - so, for example, after umount t1, t2/file
> becomes unaccessible, as the inode points to t1...
>
> I'm not sure how to fix this one. Ideas?

t1/file and t2/file have different mount IDs,
which can be retrieved by name_to_handle_at(2).
So the idea is to transmit also that mount ID along with st_dev.

Thanks,
Eduard.

>
>>
>> Thanks,
>> Eduard.
>>
>>>
>>>>>
>>>>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
>>>>>>
>>>>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> --
>>>>> Greg
>>>>>
>>>>
>>>
>>
>> .
>>
>
>
Greg Kurz Jan. 24, 2018, 1:30 p.m. | #16
Thanks Emilio for providing these valuable suggestions ! :)

On Sat, 20 Jan 2018 17:03:49 -0500
"Emilio G. Cota" <cota@braap.org> wrote:

> On Fri, Jan 19, 2018 at 19:05:06 -0500, Emilio G. Cota wrote:
> > > > > On Fri, 12 Jan 2018 19:32:10 +0800
> > > > > Antonios Motakis <antonios.motakis@huawei.com> wrote:  
> > > > Since inodes are not completely random, and we usually have a handful of device IDs,
> > > > we get a much smaller number of entries to track in the hash table.
> > > > 
> > > > So what this would give:
> > > > (1)	Would be faster and take less memory than mapping the full inode_nr,devi_id
> > > > tuple to unique QID paths
> > > > (2)	Guaranteed not to run out of bits when inode numbers stay below the lowest
> > > > 54 bits and we have less than 1024 devices.
> > > > (3)	When we get beyond this this limit, there is a chance we run out of bits to
> > > > allocate new QID paths, but we can detect this and refuse to serve the offending
> > > > files instead of allowing a collision.
> > > > 
> > > > We could tweak the prefix size to match the scenarios that we consider more likely,
> > > > but I think close to 10-16 bits sounds reasonable enough. What do you think?  
> > 
> > Assuming assumption (2) is very likely to be true, I'd suggest
> > dropping the intermediate hash table altogether, and simply refuse
> > to work with any files that do not meet (2).
> > 
> > That said, the naive solution of having a large hash table with all entries
> > in it might be worth a shot.  
> 
> hmm but that would still take a lot of memory.
> 
> Given assumption (2), a good compromise would be the following,
> taking into account that the number of total gids is unlikely to
> reach even close to 2**64:
> - bit 63: 0/1 determines "fast" or "slow" encoding
> - 62-0:
>   - fast (trivial) encoding: when assumption (2) is met
>     - 62-53: device id (it fits because of (2))
>     - 52-0: inode (it fits because of (2))

And as pointed by Eduard, we may have to take the mount id into account
as well if we want to support the case where we have bind mounts in the
exported directory... My understanding is that mount ids are incremental
and reused when the associated fs gets unmounted: if we assume that the
host doesn't have more than 1024 mounts, we would need 10 bits to encode
it.

The fast encoding could be something like:

62-53: mount id
52-43: device id
42-0: inode

>   - slow path: assumption (2) isn't met. Then, assign incremental
>     IDs in the [0,2**63-1] range and track them in a hash table.
> 
> Choosing 10 or whatever else bits for the device id is of course TBD,
> as Antonios you pointed out.
> 

This is a best effort to have a fallback in QEMU. The right way to
address the issue would really be to extend the protocol to have
bigger qids (eg, 64 for inode, 32 for device and 32 for mount).

Cheers,

--
Greg

> Something like this will give you great performance and 0 memory
> overhead for the majority of cases if (2) indeed holds.
> 
> 		Emilio
Greg Kurz Jan. 24, 2018, 3:09 p.m. | #17
On Mon, 22 Jan 2018 13:40:00 +0100
Eduard Shishkin <eduard.shishkin@huawei.com> wrote:

> On 1/19/2018 5:37 PM, Veaceslav Falico wrote:
> > On 1/19/2018 4:52 PM, Eduard Shishkin wrote:  
> >>
> >>
> >> On 1/19/2018 11:27 AM, Greg Kurz wrote:  
> >>> On Mon, 15 Jan 2018 11:49:31 +0800
> >>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
> >>>  
> >>>> On 13-Jan-18 00:14, Greg Kurz wrote:  
> >>>>> On Fri, 12 Jan 2018 19:32:10 +0800
> >>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
> >>>>>  
> >>>>>> Hello all,
> >>>>>>  
> >>>>>
> >>>>> Hi Antonios,
> >>>>>
> >>>>> I see you have attached a patch to this email... this really isn't the preferred
> >>>>> way to do things since it prevents to comment the patch (at least with my mail
> >>>>> client). The appropriate way would have been to send the patch with a cover
> >>>>> letter, using git-send-email for example.  
> >>>>
> >>>> I apologize for attaching the patch, I should have known better!
> >>>>  
> >>>
> >>> np :)
> >>>  
> >>>>>  
> >>>>>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
> >>>>>>  
> >>>>>
> >>>>> Ouch...
> >>>>>  
> >>>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
> >>>>>>
> >>>>>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
> >>>>>>
> >>>>>> How to demonstrate the issue:
> >>>>>> 1) Prepare a problematic share:
> >>>>>>  - mount one partition under share/p1/ with some files inside
> >>>>>>  - mount another one *with identical contents* under share/p2/
> >>>>>>  - confirm that both partitions have files with same inode nr, size, etc
> >>>>>> 2) Demonstrate breakage:
> >>>>>>  - start a VM with a virtio-9p pointing to the share
> >>>>>>  - mount 9p share with FSCACHE on
> >>>>>>  - keep open share/p1/file
> >>>>>>  - open and write to share/p2/file
> >>>>>>
> >>>>>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
> >>>>>>
> >>>>>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
> >>>>>>
> >>>>>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
> >>>>>>
> >>>>>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
> >>>>>>
> >>>>>> * Full fix: Change the 9p protocol
> >>>>>>
> >>>>>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
> >>>>>>  
> >>>>>
> >>>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
> >>>>> 9p is transport agnostic. And it happens to be used with a variety of transports.
> >>>>> QEMU has both virtio-9p and a Xen backend for example.
> >>>>>  
> >>>>>> * Fallback and/or interim solutions
> >>>>>>
> >>>>>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.  
> >>>>>
> >>>>> In all cases, we would need a fallback solution to support current
> >>>>> guest setups. Also there are several 9p server implementations out
> >>>>> there (ganesha, diod, kvmtool) that are currently used with linux
> >>>>> clients... it will take some time to get everyone in sync :-\
> >>>>>  
> >>>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)  
> >>>>>
> >>>>> Hmm... this would still allow collisions. Not good for fallback.
> >>>>>  
> >>>>>> 2. 64 bit hash of device id and inode nr  
> >>>>>
> >>>>> Same here.
> >>>>>  
> >>>>>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
> >>>>>>  
> >>>>>
> >>>>> This would be acceptable for fallback.  
> >>>>
> >>>> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
> >>>>  
> >>>
> >>> I don't know QHT, hence Cc'ing Emilio for insights.
> >>>  
> >>>> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
> >>>>
> >>>> Inode =
> >>>> [ 10 first bits ] + [ 54 lowest bits ]
> >>>>
> >>>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
> >>>> The prefix is uniquely allocated for each input.
> >>>>
> >>>> Qid path =
> >>>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
> >>>>
> >>>> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
> >>>>
> >>>> So what this would give:
> >>>> (1)    Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
> >>>> (2)    Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
> >>>> (3)    When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
> >>>>
> >>>> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
> >>>>  
> >>>
> >>> I think that should work. Please send a patchset :)  
> >>
> >> Hi Greg,
> >>
> >> This is based on the assumption that high bits of inode numbers are
> >> always zero, which is unacceptable from my standpoint. Inode numbers
> >> allocator is a per-volume file system feature, so there may appear
> >> allocators, which don't use simple increment and assign inode number
> >> with non-zero high bits even for the first file of a volume.
> >>
> >> As I understand, the problem is that the guest doesn't have enough
> >> information for proper files identification (in our case share/p1/file
> >> and share/p2/file are wrongly identified as the same file). It seems
> >> that we need to transmit device ID to the guest, and not in the high bits of the inode number.
> >>
> >> AFAIK Slava has patches for such transmission, I hope that he will send
> >> it eventually.  
> >
> > They're pretty trivial, however it'll require great effort and coordination
> > for all parties involved, as they break backwards compatibility (QID becomes
> > bigger and, thus, breaks "Q" transmission).
> >
> > I'd like to raise another issue - it seems that st_dev and i_no aren't
> > enough:
> >
> > mkdir -p /tmp/mounted
> > touch /tmp/mounted/file
> > mount -o bind /tmp/mounted t1
> > mount -o bind /tmp/mounted t2
> > stat t{1,2}/file  | grep Inode
> > Device: 805h/2053d      Inode: 42729487    Links: 3
> > Device: 805h/2053d      Inode: 42729487    Links: 3
> >
> > In that case, even with the patch applied, we'll still have the issue of
> > colliding QIDs guest-side - so, for example, after umount t1, t2/file
> > becomes unaccessible, as the inode points to t1...
> >
> > I'm not sure how to fix this one. Ideas?  
> 
> t1/file and t2/file have different mount IDs,
> which can be retrieved by name_to_handle_at(2).
> So the idea is to transmit also that mount ID along with st_dev.
> 

As for transmitting st_dev, this calls for a spec change and might take
some time... :-\

And anyway, we really want to have a fallback in QEMU. Maybe you can try
something like Emilio suggested in another mail ?

The first step would be to allow fsdev backends to return the mount id
along with the struct stat:
- add a uint32_t *mnt_id arg to the .fstat and .stat ops in file-op-9p.h
- adapt the 9p-local backend to fill out *mnt_id with name_to_handle_at()

If this works, then same thing should be done in 9p-proxy/virtfs-proxy-helper.
Other backends can be left behind.

Cheers,

--
Greg

> Thanks,
> Eduard.
> 
> >  
> >>
> >> Thanks,
> >> Eduard.
> >>  
> >>>  
> >>>>>  
> >>>>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
> >>>>>>
> >>>>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
> >>>>>>
> >>>>>> Best regards,
> >>>>>>  
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> --
> >>>>> Greg
> >>>>>  
> >>>>  
> >>>  
> >>
> >> .
> >>  
> >
> >  
>
Antonios Motakis Jan. 24, 2018, 4:40 p.m. | #18
On 01/24/2018 02:30 PM, Greg Kurz wrote:
> Thanks Emilio for providing these valuable suggestions ! :)
>
> On Sat, 20 Jan 2018 17:03:49 -0500
> "Emilio G. Cota" <cota@braap.org> wrote:
>
>> On Fri, Jan 19, 2018 at 19:05:06 -0500, Emilio G. Cota wrote:
>>>>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>> Since inodes are not completely random, and we usually have a handful of device IDs,
>>>>> we get a much smaller number of entries to track in the hash table.
>>>>>
>>>>> So what this would give:
>>>>> (1)	Would be faster and take less memory than mapping the full inode_nr,devi_id
>>>>> tuple to unique QID paths
>>>>> (2)	Guaranteed not to run out of bits when inode numbers stay below the lowest
>>>>> 54 bits and we have less than 1024 devices.
>>>>> (3)	When we get beyond this this limit, there is a chance we run out of bits to
>>>>> allocate new QID paths, but we can detect this and refuse to serve the offending
>>>>> files instead of allowing a collision.
>>>>>
>>>>> We could tweak the prefix size to match the scenarios that we consider more likely,
>>>>> but I think close to 10-16 bits sounds reasonable enough. What do you think?
>>> Assuming assumption (2) is very likely to be true, I'd suggest
>>> dropping the intermediate hash table altogether, and simply refuse
>>> to work with any files that do not meet (2).
>>>
>>> That said, the naive solution of having a large hash table with all entries
>>> in it might be worth a shot.
>> hmm but that would still take a lot of memory.
>>
>> Given assumption (2), a good compromise would be the following,
>> taking into account that the number of total gids is unlikely to
>> reach even close to 2**64:
>> - bit 63: 0/1 determines "fast" or "slow" encoding
>> - 62-0:
>>    - fast (trivial) encoding: when assumption (2) is met
>>      - 62-53: device id (it fits because of (2))
>>      - 52-0: inode (it fits because of (2))
> And as pointed by Eduard, we may have to take the mount id into account
> as well if we want to support the case where we have bind mounts in the
> exported directory... My understanding is that mount ids are incremental
> and reused when the associated fs gets unmounted: if we assume that the
> host doesn't have more than 1024 mounts, we would need 10 bits to encode
> it.
>
> The fast encoding could be something like:
>
> 62-53: mount id
> 52-43: device id
> 42-0: inode

I don't agree that we should take the mount id into account though.
The TL; DR: I think the issue about bind mounts is distinct from the QID 
path issue, and just happens to be worked around when we (falsely) 
advertise to the guest that 2 files are not the same (even though they 
are). Making unique 2 files that shouldn't be, will cause other issues.

The kernel's 9p client documentation states that with fscache enabled, 
there is no support for coherency when multiple users (i.e. guest and 
host) are reading and writing to the share. If this limitation is not 
taken into account, there are multiple issues with stale caches in the 
guest.

Disambiguating files using mount id might work around fscache 
limitations in this case, but will introduce a host of other bugs. For 
example:
(1) The user starts two containers sharing a directory (via host bind 
mounts) with data
(2) Container 1 writes something to a file in the data dir
(3) Container 2 reads from the file
(4) The guest kernel doesn't know the the file is one and the same, so 
it is twice in the cache. Container 2 might get stale data

The user, wrote the code running in containers 1 and 2, assuming they 
can share a file when running on the same system. For example, one 
container generating the configuration file for another. It doesn't 
matter if the user wrote the applications correctly, syncing data when 
needed. It only breaks because we lied to the guest 9p client, telling 
it that they are distinct files. 9p is supposed to support this.

This is why I think including the mount id in the QID path would be 
another bug, this time in the opposite direction.

In contrast the QID path issues:
(1) do not require touching files on the host, after the guest has 
already mounted the share, to trigger it.
(2) can be explained by the guest assuming that two or more distinct 
files are actually the same.

The bind mount issue:
(1) bind mounts have to be changed on the host after the guest has 
mounted the share. Already a no-no for fscache, and can be explained by 
stale caches in the guest.
(2) The guest is correctly identifying that they refer to the same file. 
There is no collision here.

>
>>    - slow path: assumption (2) isn't met. Then, assign incremental
>>      IDs in the [0,2**63-1] range and track them in a hash table.
>>
>> Choosing 10 or whatever else bits for the device id is of course TBD,
>> as Antonios you pointed out.
>>
> This is a best effort to have a fallback in QEMU. The right way to
> address the issue would really be to extend the protocol to have
> bigger qids (eg, 64 for inode, 32 for device and 32 for mount).

Does this mean we don't need the slow path for the fallback case? I have 
tested a glib hash table implementation of the "fast path", I will look 
into porting it to the QEMU hash table and will send it to this list.

Keep in mind, we still need a hash table for the device id, since it is 
32 bits, but we will try to reserve only 10-16 bits for it.

Cheers,
Tony
Eduard Shishkin Jan. 24, 2018, 6:05 p.m. | #19
On 1/24/2018 5:40 PM, Antonios Motakis wrote:
>
>
> On 01/24/2018 02:30 PM, Greg Kurz wrote:
>> Thanks Emilio for providing these valuable suggestions ! :)
>>
>> On Sat, 20 Jan 2018 17:03:49 -0500
>> "Emilio G. Cota" <cota@braap.org> wrote:
>>
>>> On Fri, Jan 19, 2018 at 19:05:06 -0500, Emilio G. Cota wrote:
>>>>>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>>>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>>> Since inodes are not completely random, and we usually have a
>>>>>> handful of device IDs,
>>>>>> we get a much smaller number of entries to track in the hash table.
>>>>>>
>>>>>> So what this would give:
>>>>>> (1)    Would be faster and take less memory than mapping the full
>>>>>> inode_nr,devi_id
>>>>>> tuple to unique QID paths
>>>>>> (2)    Guaranteed not to run out of bits when inode numbers stay
>>>>>> below the lowest
>>>>>> 54 bits and we have less than 1024 devices.
>>>>>> (3)    When we get beyond this this limit, there is a chance we
>>>>>> run out of bits to
>>>>>> allocate new QID paths, but we can detect this and refuse to serve
>>>>>> the offending
>>>>>> files instead of allowing a collision.
>>>>>>
>>>>>> We could tweak the prefix size to match the scenarios that we
>>>>>> consider more likely,
>>>>>> but I think close to 10-16 bits sounds reasonable enough. What do
>>>>>> you think?
>>>> Assuming assumption (2) is very likely to be true, I'd suggest
>>>> dropping the intermediate hash table altogether, and simply refuse
>>>> to work with any files that do not meet (2).
>>>>
>>>> That said, the naive solution of having a large hash table with all
>>>> entries
>>>> in it might be worth a shot.
>>> hmm but that would still take a lot of memory.
>>>
>>> Given assumption (2), a good compromise would be the following,
>>> taking into account that the number of total gids is unlikely to
>>> reach even close to 2**64:
>>> - bit 63: 0/1 determines "fast" or "slow" encoding
>>> - 62-0:
>>>    - fast (trivial) encoding: when assumption (2) is met
>>>      - 62-53: device id (it fits because of (2))
>>>      - 52-0: inode (it fits because of (2))
>> And as pointed by Eduard, we may have to take the mount id into account
>> as well if we want to support the case where we have bind mounts in the
>> exported directory... My understanding is that mount ids are incremental
>> and reused when the associated fs gets unmounted: if we assume that the
>> host doesn't have more than 1024 mounts, we would need 10 bits to encode
>> it.
>>
>> The fast encoding could be something like:
>>
>> 62-53: mount id
>> 52-43: device id
>> 42-0: inode
>
> I don't agree that we should take the mount id into account though.
> The TL; DR: I think the issue about bind mounts is distinct from the QID
> path issue, and just happens to be worked around when we (falsely)
> advertise to the guest that 2 files are not the same (even though they
> are). Making unique 2 files that shouldn't be, will cause other issues.
>
> The kernel's 9p client documentation states that with fscache enabled,
> there is no support for coherency when multiple users (i.e. guest and
> host) are reading and writing to the share. If this limitation is not
> taken into account, there are multiple issues with stale caches in the
> guest.
>
> Disambiguating files using mount id might work around fscache
> limitations in this case, but will introduce a host of other bugs. For
> example:
> (1) The user starts two containers sharing a directory (via host bind
> mounts) with data
> (2) Container 1 writes something to a file in the data dir
> (3) Container 2 reads from the file
> (4) The guest kernel doesn't know the the file is one and the same, so
> it is twice in the cache. Container 2 might get stale data

It is only problem of the guest that he deceives himself.

>
> The user, wrote the code running in containers 1 and 2, assuming they
> can share a file when running on the same system. For example, one
> container generating the configuration file for another. It doesn't
> matter if the user wrote the applications correctly, syncing data when
> needed. It only breaks because we lied to the guest 9p client, telling
> it that they are distinct files.

Nope, we didn't lie. We passed objective information (st_ino, st_dev, 
st_mountid, etc).

Thanks,
Eduard.

  9p is supposed to support this.
>
> This is why I think including the mount id in the QID path would be
> another bug, this time in the opposite direction.
>
> In contrast the QID path issues:
> (1) do not require touching files on the host, after the guest has
> already mounted the share, to trigger it.
> (2) can be explained by the guest assuming that two or more distinct
> files are actually the same.
>
> The bind mount issue:
> (1) bind mounts have to be changed on the host after the guest has
> mounted the share. Already a no-no for fscache, and can be explained by
> stale caches in the guest.
> (2) The guest is correctly identifying that they refer to the same file.
> There is no collision here.
>
>>
>>>    - slow path: assumption (2) isn't met. Then, assign incremental
>>>      IDs in the [0,2**63-1] range and track them in a hash table.
>>>
>>> Choosing 10 or whatever else bits for the device id is of course TBD,
>>> as Antonios you pointed out.
>>>
>> This is a best effort to have a fallback in QEMU. The right way to
>> address the issue would really be to extend the protocol to have
>> bigger qids (eg, 64 for inode, 32 for device and 32 for mount).
>
> Does this mean we don't need the slow path for the fallback case? I have
> tested a glib hash table implementation of the "fast path", I will look
> into porting it to the QEMU hash table and will send it to this list.
>
> Keep in mind, we still need a hash table for the device id, since it is
> 32 bits, but we will try to reserve only 10-16 bits for it.
>
> Cheers,
> Tony
Veaceslav Falico Jan. 25, 2018, 2:46 p.m. | #20
Hi,

sorry for the late reply, we're acutally working on it internally...

On 1/19/2018 7:05 PM, Greg Kurz wrote:
> On Fri, 19 Jan 2018 17:37:58 +0100
> Veaceslav Falico <veaceslav.falico@huawei.com> wrote:
> 
>> On 1/19/2018 4:52 PM, Eduard Shishkin wrote:
>>>
>>>
>>> On 1/19/2018 11:27 AM, Greg Kurz wrote:  
>>>> On Mon, 15 Jan 2018 11:49:31 +0800
>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>  
>>>>> On 13-Jan-18 00:14, Greg Kurz wrote:  
>>>>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>>>  
>>>>>>> Hello all,
>>>>>>>  
>>>>>>
>>>>>> Hi Antonios,
>>>>>>
>>>>>> I see you have attached a patch to this email... this really isn't the preferred
>>>>>> way to do things since it prevents to comment the patch (at least with my mail
>>>>>> client). The appropriate way would have been to send the patch with a cover
>>>>>> letter, using git-send-email for example.  
>>>>>
>>>>> I apologize for attaching the patch, I should have known better!
>>>>>  
>>>>
>>>> np :)
>>>>  
>>>>>>  
>>>>>>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
>>>>>>>  
>>>>>>
>>>>>> Ouch...
>>>>>>  
>>>>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
>>>>>>>
>>>>>>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
>>>>>>>
>>>>>>> How to demonstrate the issue:
>>>>>>> 1) Prepare a problematic share:
>>>>>>>  - mount one partition under share/p1/ with some files inside
>>>>>>>  - mount another one *with identical contents* under share/p2/
>>>>>>>  - confirm that both partitions have files with same inode nr, size, etc
>>>>>>> 2) Demonstrate breakage:
>>>>>>>  - start a VM with a virtio-9p pointing to the share
>>>>>>>  - mount 9p share with FSCACHE on
>>>>>>>  - keep open share/p1/file
>>>>>>>  - open and write to share/p2/file
>>>>>>>
>>>>>>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
>>>>>>>
>>>>>>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
>>>>>>>
>>>>>>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
>>>>>>>
>>>>>>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
>>>>>>>
>>>>>>> * Full fix: Change the 9p protocol
>>>>>>>
>>>>>>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
>>>>>>>  
>>>>>>
>>>>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
>>>>>> 9p is transport agnostic. And it happens to be used with a variety of transports.
>>>>>> QEMU has both virtio-9p and a Xen backend for example.
>>>>>>  
>>>>>>> * Fallback and/or interim solutions
>>>>>>>
>>>>>>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.  
>>>>>>
>>>>>> In all cases, we would need a fallback solution to support current
>>>>>> guest setups. Also there are several 9p server implementations out
>>>>>> there (ganesha, diod, kvmtool) that are currently used with linux
>>>>>> clients... it will take some time to get everyone in sync :-\
>>>>>>  
>>>>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)  
>>>>>>
>>>>>> Hmm... this would still allow collisions. Not good for fallback.
>>>>>>  
>>>>>>> 2. 64 bit hash of device id and inode nr  
>>>>>>
>>>>>> Same here.
>>>>>>  
>>>>>>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
>>>>>>>  
>>>>>>
>>>>>> This would be acceptable for fallback.  
>>>>>
>>>>> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
>>>>>  
>>>>
>>>> I don't know QHT, hence Cc'ing Emilio for insights.
>>>>  
>>>>> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
>>>>>
>>>>> Inode =
>>>>> [ 10 first bits ] + [ 54 lowest bits ]
>>>>>
>>>>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
>>>>> The prefix is uniquely allocated for each input.
>>>>>
>>>>> Qid path =
>>>>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
>>>>>
>>>>> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
>>>>>
>>>>> So what this would give:
>>>>> (1)    Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
>>>>> (2)    Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
>>>>> (3)    When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
>>>>>
>>>>> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
>>>>>  
>>>>
>>>> I think that should work. Please send a patchset :)  
>>>
>>> Hi Greg,
>>>
>>> This is based on the assumption that high bits of inode numbers are
>>> always zero, which is unacceptable from my standpoint. Inode numbers
>>> allocator is a per-volume file system feature, so there may appear
>>> allocators, which don't use simple increment and assign inode number
>>> with non-zero high bits even for the first file of a volume.
>>>
>>> As I understand, the problem is that the guest doesn't have enough
>>> information for proper files identification (in our case share/p1/file
>>> and share/p2/file are wrongly identified as the same file). It seems
>>> that we need to transmit device ID to the guest, and not in the high bits of the inode number.
>>>
>>> AFAIK Slava has patches for such transmission, I hope that he will send
>>> it eventually.  
>>
>> They're pretty trivial, however it'll require great effort and coordination
>> for all parties involved, as they break backwards compatibility (QID becomes
>> bigger and, thus, breaks "Q" transmission).
>>
>> I'd like to raise another issue - it seems that st_dev and i_no aren't
>> enough:
>>
>> mkdir -p /tmp/mounted
>> touch /tmp/mounted/file
>> mount -o bind /tmp/mounted t1
>> mount -o bind /tmp/mounted t2
>> stat t{1,2}/file  | grep Inode
>> Device: 805h/2053d      Inode: 42729487    Links: 3
>> Device: 805h/2053d      Inode: 42729487    Links: 3
>>
>> In that case, even with the patch applied, we'll still have the issue of
>> colliding QIDs guest-side - so, for example, after umount t1, t2/file
>> becomes unaccessible, as the inode points to t1...
>>
> 
> t1/file and t2/file really point to the same file on the server, so
> it is expected they have the same QIDs.
> 
>> I'm not sure how to fix this one. Ideas?
> 
> fscache bug ?

I've reproduced it today without fscache:

host:
mount -o bind /tmp/mounted t1
mount -o bind /tmp/mounted t2

guest:
/ # tail -f t1/file &
/ # 321

/ # cat t2/file
321
/ #

host:
mv t1/file t1/file_moved

guest:
/ # cat t2/file
cat: can't open 't2/file': No such file or directory
/ # mount | grep fscache
/ #

Also, per internal discussions, we're not sure if the guest side
is allowed to reuse the FIDs opened previously for same QID.paths.

QEMU holds FID.path for each FID, which contains the actual FS path,
i.e. "/path/to/file". 

So, if we (guest) have two "identical" (per QID.path and RFC) files
(f1 and f2), though in different directories (host and guest), and 
we've accessed f1 once (FID 10, for example) - are we allowed to
re-use FID 10 for f2, if f1's QID.path == f2's QID.path ?

> 
>>
>>>
>>> Thanks,
>>> Eduard.
>>>   
>>>>  
>>>>>>  
>>>>>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
>>>>>>>
>>>>>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>  
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> -- 
>>>>>> Greg
>>>>>>  
>>>>>  
>>>>  
>>>
>>> .
>>>   
>>
>>
> 
> .
>
Veaceslav Falico Jan. 25, 2018, 4:08 p.m. | #21
On 1/25/2018 3:46 PM, Veaceslav Falico wrote:
> Hi,
> 
> sorry for the late reply, we're acutally working on it internally...
> 
> On 1/19/2018 7:05 PM, Greg Kurz wrote:
>> On Fri, 19 Jan 2018 17:37:58 +0100
>> Veaceslav Falico <veaceslav.falico@huawei.com> wrote:
>>
>>> On 1/19/2018 4:52 PM, Eduard Shishkin wrote:
>>>>
>>>>
>>>> On 1/19/2018 11:27 AM, Greg Kurz wrote:  
>>>>> On Mon, 15 Jan 2018 11:49:31 +0800
>>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>>  
>>>>>> On 13-Jan-18 00:14, Greg Kurz wrote:  
>>>>>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>>>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>>>>  
>>>>>>>> Hello all,
>>>>>>>>  
>>>>>>>
>>>>>>> Hi Antonios,
>>>>>>>
>>>>>>> I see you have attached a patch to this email... this really isn't the preferred
>>>>>>> way to do things since it prevents to comment the patch (at least with my mail
>>>>>>> client). The appropriate way would have been to send the patch with a cover
>>>>>>> letter, using git-send-email for example.  
>>>>>>
>>>>>> I apologize for attaching the patch, I should have known better!
>>>>>>  
>>>>>
>>>>> np :)
>>>>>  
>>>>>>>  
>>>>>>>> We have found an issue in the 9p implementation of QEMU, with how qid paths are generated, which can cause qid path collisions and several issues caused by them. In our use case (running containers under VMs) these have proven to be critical.
>>>>>>>>  
>>>>>>>
>>>>>>> Ouch...
>>>>>>>  
>>>>>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the inode number of the file as input. According to the 9p spec the path should be able to uniquely identify a file, distinct files should not share a path value.
>>>>>>>>
>>>>>>>> The current implementation that defines qid.path = inode nr works fine as long as there are not files from multiple partitions visible under the 9p share. In that case, distinct files from different devices are allowed to have the same inode number. So with multiple partitions, we have a very high probability of qid path collisions.
>>>>>>>>
>>>>>>>> How to demonstrate the issue:
>>>>>>>> 1) Prepare a problematic share:
>>>>>>>>  - mount one partition under share/p1/ with some files inside
>>>>>>>>  - mount another one *with identical contents* under share/p2/
>>>>>>>>  - confirm that both partitions have files with same inode nr, size, etc
>>>>>>>> 2) Demonstrate breakage:
>>>>>>>>  - start a VM with a virtio-9p pointing to the share
>>>>>>>>  - mount 9p share with FSCACHE on
>>>>>>>>  - keep open share/p1/file
>>>>>>>>  - open and write to share/p2/file
>>>>>>>>
>>>>>>>> What should happen is, the guest will consider share/p1/file and share/p2/file to be the same file, and since we are using the cache it will not reopen it. We intended to write to partition 2, but we just wrote to partition 1. This is just one example on how the guest may rely on qid paths being unique.
>>>>>>>>
>>>>>>>> In the use case of containers where we commonly have a few containers per VM, all based on similar images, these kind of qid path collisions are very common and they seem to cause all kinds of funny behavior (sometimes very subtle).
>>>>>>>>
>>>>>>>> To avoid this situation, the device id of a file needs to be also taken as input for generating a qid path. Unfortunately, the size of both inode nr + device id together would be 96 bits, while we have only 64 bits for the qid path, so we can't just append them and call it a day :(
>>>>>>>>
>>>>>>>> We have thought of a few approaches, but we would definitely like to hear what the upstream maintainers and community think:
>>>>>>>>
>>>>>>>> * Full fix: Change the 9p protocol
>>>>>>>>
>>>>>>>> We would need to support a longer qid path, based on a virtio feature flag. This would take reworking of host and guest parts of virtio-9p, so both QEMU and Linux for most users.
>>>>>>>>  
>>>>>>>
>>>>>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag since
>>>>>>> 9p is transport agnostic. And it happens to be used with a variety of transports.
>>>>>>> QEMU has both virtio-9p and a Xen backend for example.
>>>>>>>  
>>>>>>>> * Fallback and/or interim solutions
>>>>>>>>
>>>>>>>> A virtio feature flag may be refused by the guest, so we think we still need to make collisions less likely even with 64 bit paths. E.g.  
>>>>>>>
>>>>>>> In all cases, we would need a fallback solution to support current
>>>>>>> guest setups. Also there are several 9p server implementations out
>>>>>>> there (ganesha, diod, kvmtool) that are currently used with linux
>>>>>>> clients... it will take some time to get everyone in sync :-\
>>>>>>>  
>>>>>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a proof of concept patch)  
>>>>>>>
>>>>>>> Hmm... this would still allow collisions. Not good for fallback.
>>>>>>>  
>>>>>>>> 2. 64 bit hash of device id and inode nr  
>>>>>>>
>>>>>>> Same here.
>>>>>>>  
>>>>>>>> 3. other ideas, such as allocating new qid paths and keep a look up table of some sorts (possibly too expensive)
>>>>>>>>  
>>>>>>>
>>>>>>> This would be acceptable for fallback.  
>>>>>>
>>>>>> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it scales to millions of entries. Do you think it is appropriate in this case?
>>>>>>  
>>>>>
>>>>> I don't know QHT, hence Cc'ing Emilio for insights.
>>>>>  
>>>>>> I was thinking on how to implement something like this, without having to maintain millions of entries... One option we could consider is to split the bits into a directly-mapped part, and a lookup part. For example:
>>>>>>
>>>>>> Inode =
>>>>>> [ 10 first bits ] + [ 54 lowest bits ]
>>>>>>
>>>>>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit prefix ]
>>>>>> The prefix is uniquely allocated for each input.
>>>>>>
>>>>>> Qid path =
>>>>>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
>>>>>>
>>>>>> Since inodes are not completely random, and we usually have a handful of device IDs, we get a much smaller number of entries to track in the hash table.
>>>>>>
>>>>>> So what this would give:
>>>>>> (1)    Would be faster and take less memory than mapping the full inode_nr,devi_id tuple to unique QID paths
>>>>>> (2)    Guaranteed not to run out of bits when inode numbers stay below the lowest 54 bits and we have less than 1024 devices.
>>>>>> (3)    When we get beyond this this limit, there is a chance we run out of bits to allocate new QID paths, but we can detect this and refuse to serve the offending files instead of allowing a collision.
>>>>>>
>>>>>> We could tweak the prefix size to match the scenarios that we consider more likely, but I think close to 10-16 bits sounds reasonable enough. What do you think?
>>>>>>  
>>>>>
>>>>> I think that should work. Please send a patchset :)  
>>>>
>>>> Hi Greg,
>>>>
>>>> This is based on the assumption that high bits of inode numbers are
>>>> always zero, which is unacceptable from my standpoint. Inode numbers
>>>> allocator is a per-volume file system feature, so there may appear
>>>> allocators, which don't use simple increment and assign inode number
>>>> with non-zero high bits even for the first file of a volume.
>>>>
>>>> As I understand, the problem is that the guest doesn't have enough
>>>> information for proper files identification (in our case share/p1/file
>>>> and share/p2/file are wrongly identified as the same file). It seems
>>>> that we need to transmit device ID to the guest, and not in the high bits of the inode number.
>>>>
>>>> AFAIK Slava has patches for such transmission, I hope that he will send
>>>> it eventually.  
>>>
>>> They're pretty trivial, however it'll require great effort and coordination
>>> for all parties involved, as they break backwards compatibility (QID becomes
>>> bigger and, thus, breaks "Q" transmission).
>>>
>>> I'd like to raise another issue - it seems that st_dev and i_no aren't
>>> enough:
>>>
>>> mkdir -p /tmp/mounted
>>> touch /tmp/mounted/file
>>> mount -o bind /tmp/mounted t1
>>> mount -o bind /tmp/mounted t2
>>> stat t{1,2}/file  | grep Inode
>>> Device: 805h/2053d      Inode: 42729487    Links: 3
>>> Device: 805h/2053d      Inode: 42729487    Links: 3
>>>
>>> In that case, even with the patch applied, we'll still have the issue of
>>> colliding QIDs guest-side - so, for example, after umount t1, t2/file
>>> becomes unaccessible, as the inode points to t1...
>>>
>>
>> t1/file and t2/file really point to the same file on the server, so
>> it is expected they have the same QIDs.
>>
>>> I'm not sure how to fix this one. Ideas?
>>
>> fscache bug ?
> 
> I've reproduced it today without fscache:
> 
> host:
> mount -o bind /tmp/mounted t1
> mount -o bind /tmp/mounted t2
> 
> guest:
> / # tail -f t1/file &
> / # 321
> 
> / # cat t2/file
> 321
> / #
> 
> host:
> mv t1/file t1/file_moved
> 
> guest:
> / # cat t2/file
> cat: can't open 't2/file': No such file or directory
> / # mount | grep fscache
> / #

Sorry, disregard this test case, it's operating normally -
when we move the t1/file, we also move the t2/file, as they're
the same directory... Sorry, it's a brainfart after a long
discussion about the issue :).

So, it's still not reproducible without (fs)cache guest-side.


Anyway, the question below still stands - is the guest
allowed to re-use the FIDs for the files with same QIDs?

> 
> Also, per internal discussions, we're not sure if the guest side
> is allowed to reuse the FIDs opened previously for same QID.paths.
> 
> QEMU holds FID.path for each FID, which contains the actual FS path,
> i.e. "/path/to/file". 
> 
> So, if we (guest) have two "identical" (per QID.path and RFC) files
> (f1 and f2), though in different directories (host and guest), and 
> we've accessed f1 once (FID 10, for example) - are we allowed to
> re-use FID 10 for f2, if f1's QID.path == f2's QID.path ?
> 
>>
>>>
>>>>
>>>> Thanks,
>>>> Eduard.
>>>>   
>>>>>  
>>>>>>>  
>>>>>>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
>>>>>>>>
>>>>>>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>  
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> -- 
>>>>>>> Greg
>>>>>>>  
>>>>>>  
>>>>>  
>>>>
>>>> .
>>>>   
>>>
>>>
>>
>> .
>>
>
Greg Kurz Jan. 29, 2018, 5:05 p.m. | #22
On Thu, 25 Jan 2018 17:08:40 +0100
Veaceslav Falico <veaceslav.falico@huawei.com> wrote:

> On 1/25/2018 3:46 PM, Veaceslav Falico wrote:
[...]
> > 
> > I've reproduced it today without fscache:
> > 
> > host:
> > mount -o bind /tmp/mounted t1
> > mount -o bind /tmp/mounted t2
> > 
> > guest:
> > / # tail -f t1/file &
> > / # 321
> > 
> > / # cat t2/file
> > 321
> > / #
> > 
> > host:
> > mv t1/file t1/file_moved
> > 
> > guest:
> > / # cat t2/file
> > cat: can't open 't2/file': No such file or directory
> > / # mount | grep fscache
> > / #  
> 
> Sorry, disregard this test case, it's operating normally -
> when we move the t1/file, we also move the t2/file, as they're
> the same directory... Sorry, it's a brainfart after a long
> discussion about the issue :).
> 

No problem :)

> So, it's still not reproducible without (fs)cache guest-side.
> 
> 
> Anyway, the question below still stands - is the guest
> allowed to re-use the FIDs for the files with same QIDs?
> 

AFAIU FIDs have a 1:1 relation to either a path in the file hierarchy or
to an open file. I don't think they can be re-used.

> > 
> > Also, per internal discussions, we're not sure if the guest side
> > is allowed to reuse the FIDs opened previously for same QID.paths.
> > 
> > QEMU holds FID.path for each FID, which contains the actual FS path,
> > i.e. "/path/to/file". 
> > 
> > So, if we (guest) have two "identical" (per QID.path and RFC) files
> > (f1 and f2), though in different directories (host and guest), and 
> > we've accessed f1 once (FID 10, for example) - are we allowed to
> > re-use FID 10 for f2, if f1's QID.path == f2's QID.path ?
> >   
> >>  
> >>>  
> >>>>
> >>>> Thanks,
> >>>> Eduard.
> >>>>     
> >>>>>    
> >>>>>>>    
> >>>>>>>> With our proof of concept patch, the issues caused by qid path collisions go away, so it can be seen as an interim solution of sorts. However, the chance of collisions is not eliminated, we are just replacing the current strategy, which is almost guaranteed to cause collisions in certain use cases, with one that makes them more rare. We think that a virtio feature flag for longer qid paths is the only way to eliminate these issues completely.
> >>>>>>>>
> >>>>>>>> This is the extent that we were able to analyze the issue from our side, we would like to hear if there are some better ideas, or which approach would be more appropriate for upstream.
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>>    
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> -- 
> >>>>>>> Greg
> >>>>>>>    
> >>>>>>    
> >>>>>    
> >>>>
> >>>> .
> >>>>     
> >>>
> >>>  
> >>
> >> .
> >>  
> >   
> 
>

Patch

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index 393a2b2..a810d13 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -583,11 +583,7 @@  static void virtfs_reset(V9fsPDU *pdu)
 /* This is the algorithm from ufs in spfs */
 static void stat_to_qid(const struct stat *stbuf, V9fsQID *qidp)
 {
-    size_t size;
-
-    memset(&qidp->path, 0, sizeof(qidp->path));
-    size = MIN(sizeof(stbuf->st_ino), sizeof(qidp->path));
-    memcpy(&qidp->path, &stbuf->st_ino, size);
+    qidp->path = stbuf->st_ino ^ ((int64_t)stbuf->st_dev << 16);
     qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
     qidp->type = 0;
     if (S_ISDIR(stbuf->st_mode)) {