Message ID | 20200518181802.3406919-1-christian.brauner@ubuntu.com |
---|---|
State | New |
Headers | show |
Series | [SRU,UNSTABLE/FOCAL/BIONIC,v2] UBUNTU: SAUCE: shiftfs: fix dentry revalidation | expand |
On 18.05.20 20:18, Christian Brauner wrote: > BugLink: https://bugs.launchpad.net/bugs/1879196 > > While fixing [1] we introduced a regression and triggered [2] whereby > the underlay and overlay go out of sync because they disagree about the > dcache. Such situation easily arise when the underlay is modified > directly. This means the overlay needs to revalidate the dentry in the > dcache to catchup with the underlay. > > Note, in general it's not advisable to directly modify the underlay > while a shiftfs mount is on top. In some way this means we need to keep > two caches in sync and it's hard enough to keep a single cache happy. > But shiftfs' use-case is inherently prone to be used for exactly that. > So this is something we have to navigate carefully and honestly we have > no full model upstream that does the same. Overlayfs has the copy-up > behavior which let's it get around most of the issues but we don't have > it and ecryptfs is broken in such scenarios which we verified quite a > while back. > In any case, I built a kernel with this patch and re-ran all regressions > that are related to this that we have so far (cf. [1], [2], and [3]). > None of them were reproducible with this patch here. So we still fix the > ESTALE issue but also keep underlay and overlay in sync. This test kernel should be provided to the bug reporter, so it can be confirmed to fix the issue there as well. But more importantly, you submitted this for Bionic and there is no shiftfs in that release. Only Eoan. I am not convinced you tried at least to apply the patch to all the releases you submit for. There is always chances that code had to be adjusted to be backported. Or that it applies but no longer compiles because added function calls do not exist there. -Stefan > > [1]: https://bugs.launchpad.net/bugs/1872757 > [2]: https://bugs.launchpad.net/bugs/1879196 > [3]: https://bugs.launchpad.net/bugs/1860041 > Cc: Seth Forshee <seth.forshee@canonical.com> > Cc: James Troup <james.troup@canonical.com> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> > --- > fs/shiftfs.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/fs/shiftfs.c b/fs/shiftfs.c > index 39b878a6f820..652fb67b0e0b 100644 > --- a/fs/shiftfs.c > +++ b/fs/shiftfs.c > @@ -240,6 +240,9 @@ static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags) > int err = 1; > struct dentry *lowerd = dentry->d_fsdata; > > + if (d_is_negative(lowerd) != d_is_negative(dentry)) > + return 0; > + > if (lowerd->d_flags & DCACHE_OP_WEAK_REVALIDATE) { > err = lowerd->d_op->d_weak_revalidate(lowerd, flags); > if (err < 0) > @@ -264,13 +267,16 @@ static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags) > if (flags & LOOKUP_RCU) > return -ECHILD; > > + if (d_unhashed(lowerd) || > + d_is_negative(lowerd) != d_is_negative(dentry)) > + return 0; > + > if (lowerd->d_flags & DCACHE_OP_REVALIDATE) { > err = lowerd->d_op->d_revalidate(lowerd, flags); > if (err < 0) > return err; > if (!err) { > - if (!(flags & LOOKUP_RCU)) > - d_invalidate(lowerd); > + d_invalidate(lowerd); > return -ESTALE; > } > } > > base-commit: 9dd531622eff2baf657b906293745ca6dab9bb91 >
On Tue, May 19, 2020 at 07:43:21AM +0200, Stefan Bader wrote: > On 18.05.20 20:18, Christian Brauner wrote: > > BugLink: https://bugs.launchpad.net/bugs/1879196 > > > > While fixing [1] we introduced a regression and triggered [2] whereby > > the underlay and overlay go out of sync because they disagree about the > > dcache. Such situation easily arise when the underlay is modified > > directly. This means the overlay needs to revalidate the dentry in the > > dcache to catchup with the underlay. > > > > Note, in general it's not advisable to directly modify the underlay > > while a shiftfs mount is on top. In some way this means we need to keep > > two caches in sync and it's hard enough to keep a single cache happy. > > But shiftfs' use-case is inherently prone to be used for exactly that. > > So this is something we have to navigate carefully and honestly we have > > no full model upstream that does the same. Overlayfs has the copy-up > > behavior which let's it get around most of the issues but we don't have > > it and ecryptfs is broken in such scenarios which we verified quite a > > while back. > > In any case, I built a kernel with this patch and re-ran all regressions > > that are related to this that we have so far (cf. [1], [2], and [3]). > > None of them were reproducible with this patch here. So we still fix the > > ESTALE issue but also keep underlay and overlay in sync. > > This test kernel should be provided to the bug reporter, so it can be confirmed > to fix the issue there as well. See https://drive.google.com/open?id=19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io Elmo is also Cced here. I don't have any other place I can upload this, I think. > But more importantly, you submitted this for Bionic and there is no shiftfs in > that release. Only Eoan. I am not convinced you tried at least to apply the Yeah, sorry that's always my thing mixing up that shiftfs is in Eoan and not in Bionic. Similar to all the other patches, Eoan doesn't need a separate patch I've verified this on top of Eoan master-next that has all the other shiftfs patches: ubuntu@amdgpu:~/linux/ubuntu/ubuntu-eoan$ git am ./0001-UBUNTU-SAUCE-shiftfs-fix-dentry-revalidation.patch Applying: UBUNTU: SAUCE: shiftfs: fix dentry revalidation on top of ubuntu@amdgpu:~/linux/ubuntu/ubuntu-eoan$ git am ./0001-UBUNTU-SAUCE-shiftfs-fix-dentry-revalidation.patch Applying: UBUNTU: SAUCE: shiftfs: fix dentry revalidation commit d53d4b6679bb89ae78787dca0fea951752e27d48 (upstream/master-next) Author: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Date: Fri May 15 13:03:42 2020 +0200 UBUNTU: Ubuntu-5.3.0-54.48 Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> I'm not letting shiftfs across kernel versions go out of sync precisely (at least until we absolutely have to) so that patches can be applied trivially across releases so it's easier for all of you. Thanks! Christian
On Tue, May 19, 2020 at 08:02:25AM +0200, Christian Brauner wrote: > On Tue, May 19, 2020 at 07:43:21AM +0200, Stefan Bader wrote: > > On 18.05.20 20:18, Christian Brauner wrote: > > > BugLink: https://bugs.launchpad.net/bugs/1879196 > > > > > > While fixing [1] we introduced a regression and triggered [2] whereby > > > the underlay and overlay go out of sync because they disagree about the > > > dcache. Such situation easily arise when the underlay is modified > > > directly. This means the overlay needs to revalidate the dentry in the > > > dcache to catchup with the underlay. > > > > > > Note, in general it's not advisable to directly modify the underlay > > > while a shiftfs mount is on top. In some way this means we need to keep > > > two caches in sync and it's hard enough to keep a single cache happy. > > > But shiftfs' use-case is inherently prone to be used for exactly that. > > > So this is something we have to navigate carefully and honestly we have > > > no full model upstream that does the same. Overlayfs has the copy-up > > > behavior which let's it get around most of the issues but we don't have > > > it and ecryptfs is broken in such scenarios which we verified quite a > > > while back. > > > In any case, I built a kernel with this patch and re-ran all regressions > > > that are related to this that we have so far (cf. [1], [2], and [3]). > > > None of them were reproducible with this patch here. So we still fix the > > > ESTALE issue but also keep underlay and overlay in sync. > > > > This test kernel should be provided to the bug reporter, so it can be confirmed > > to fix the issue there as well. > > See > https://drive.google.com/open?id=19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io > Elmo is also Cced here. I don't have any other place I can upload this, > I think. Just so I don't get railed for this later in case there's an issue. Here's an asciinema recording of all passing tests: https://asciinema.org/a/331649 > > > But more importantly, you submitted this for Bionic and there is no shiftfs in > > that release. Only Eoan. I am not convinced you tried at least to apply the > > Yeah, sorry that's always my thing mixing up that shiftfs is in Eoan and > not in Bionic. > Similar to all the other patches, Eoan doesn't need a separate patch > I've verified this on top of Eoan master-next that has all the other > shiftfs patches: > > ubuntu@amdgpu:~/linux/ubuntu/ubuntu-eoan$ git am ./0001-UBUNTU-SAUCE-shiftfs-fix-dentry-revalidation.patch > Applying: UBUNTU: SAUCE: shiftfs: fix dentry revalidation > > on top of > > ubuntu@amdgpu:~/linux/ubuntu/ubuntu-eoan$ git am ./0001-UBUNTU-SAUCE-shiftfs-fix-dentry-revalidation.patch > Applying: UBUNTU: SAUCE: shiftfs: fix dentry revalidation > commit d53d4b6679bb89ae78787dca0fea951752e27d48 (upstream/master-next) > Author: Kleber Sacilotto de Souza <kleber.souza@canonical.com> > Date: Fri May 15 13:03:42 2020 +0200 > > UBUNTU: Ubuntu-5.3.0-54.48 > > Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> > > I'm not letting shiftfs across kernel versions go out of sync precisely > (at least until we absolutely have to) so that patches can be applied > trivially across releases so it's easier for all of you. Do you still want me to resend with s/BIONIC/EOAN/ in there? Thanks! Christian
On 19.05.20 08:02, Christian Brauner wrote: > On Tue, May 19, 2020 at 07:43:21AM +0200, Stefan Bader wrote: >> On 18.05.20 20:18, Christian Brauner wrote: >>> BugLink: https://bugs.launchpad.net/bugs/1879196 >>> >>> While fixing [1] we introduced a regression and triggered [2] whereby >>> the underlay and overlay go out of sync because they disagree about the >>> dcache. Such situation easily arise when the underlay is modified >>> directly. This means the overlay needs to revalidate the dentry in the >>> dcache to catchup with the underlay. >>> >>> Note, in general it's not advisable to directly modify the underlay >>> while a shiftfs mount is on top. In some way this means we need to keep >>> two caches in sync and it's hard enough to keep a single cache happy. >>> But shiftfs' use-case is inherently prone to be used for exactly that. >>> So this is something we have to navigate carefully and honestly we have >>> no full model upstream that does the same. Overlayfs has the copy-up >>> behavior which let's it get around most of the issues but we don't have >>> it and ecryptfs is broken in such scenarios which we verified quite a >>> while back. >>> In any case, I built a kernel with this patch and re-ran all regressions >>> that are related to this that we have so far (cf. [1], [2], and [3]). >>> None of them were reproducible with this patch here. So we still fix the >>> ESTALE issue but also keep underlay and overlay in sync. >> >> This test kernel should be provided to the bug reporter, so it can be confirmed >> to fix the issue there as well. > > See > https://drive.google.com/open?id=19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io > Elmo is also Cced here. I don't have any other place I can upload this, > I think. Since it is James, you could use private-fileshare.c.c, otherwise people.c.c for providing test kernels. The reason I would like to see the reporter (did not look close enough to see its James initially, but even better) confirm the fix is that while you did regression test against the known issues, this might introduce something new. And there some unknown use pattern not directly aimed at something adds value. > >> But more importantly, you submitted this for Bionic and there is no shiftfs in >> that release. Only Eoan. I am not convinced you tried at least to apply the > > Yeah, sorry that's always my thing mixing up that shiftfs is in Eoan and > not in Bionic. > Similar to all the other patches, Eoan doesn't need a separate patch > I've verified this on top of Eoan master-next that has all the other > shiftfs patches: Problem there is that this causes confusion as the bug report has no nominations to cross-check either. And the SRU justification should go into the description there. Also one gets a bit doubtful of how much care was taken with the fix if the submission looks like it was hurried out. > > ubuntu@amdgpu:~/linux/ubuntu/ubuntu-eoan$ git am ./0001-UBUNTU-SAUCE-shiftfs-fix-dentry-revalidation.patch > Applying: UBUNTU: SAUCE: shiftfs: fix dentry revalidation > > on top of > > ubuntu@amdgpu:~/linux/ubuntu/ubuntu-eoan$ git am ./0001-UBUNTU-SAUCE-shiftfs-fix-dentry-revalidation.patch > Applying: UBUNTU: SAUCE: shiftfs: fix dentry revalidation > commit d53d4b6679bb89ae78787dca0fea951752e27d48 (upstream/master-next) > Author: Kleber Sacilotto de Souza <kleber.souza@canonical.com> > Date: Fri May 15 13:03:42 2020 +0200 > > UBUNTU: Ubuntu-5.3.0-54.48 > > Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> > > I'm not letting shiftfs across kernel versions go out of sync precisely > (at least until we absolutely have to) so that patches can be applied > trivially across releases so it's easier for all of you. I trust that this is true for shiftfs.c but you might be using kernel functions outside your driver domain which do not exist in older kernels. I would say once we get additional test results by James and you refreshed the bug report (I added nominations for focal and eoan, the importance I set to medium as I understood you as this might be an uncommon use-case which triggers it), then re-send your v2 for unstable,focal, and eoan. -Stefan > > Thanks! > Christian >
On Tue, May 19, 2020 at 10:52:10AM +0200, Stefan Bader wrote: > On 19.05.20 08:02, Christian Brauner wrote: > > On Tue, May 19, 2020 at 07:43:21AM +0200, Stefan Bader wrote: > >> On 18.05.20 20:18, Christian Brauner wrote: > >>> BugLink: https://bugs.launchpad.net/bugs/1879196 > >>> > >>> While fixing [1] we introduced a regression and triggered [2] whereby > >>> the underlay and overlay go out of sync because they disagree about the > >>> dcache. Such situation easily arise when the underlay is modified > >>> directly. This means the overlay needs to revalidate the dentry in the > >>> dcache to catchup with the underlay. > >>> > >>> Note, in general it's not advisable to directly modify the underlay > >>> while a shiftfs mount is on top. In some way this means we need to keep > >>> two caches in sync and it's hard enough to keep a single cache happy. > >>> But shiftfs' use-case is inherently prone to be used for exactly that. > >>> So this is something we have to navigate carefully and honestly we have > >>> no full model upstream that does the same. Overlayfs has the copy-up > >>> behavior which let's it get around most of the issues but we don't have > >>> it and ecryptfs is broken in such scenarios which we verified quite a > >>> while back. > >>> In any case, I built a kernel with this patch and re-ran all regressions > >>> that are related to this that we have so far (cf. [1], [2], and [3]). > >>> None of them were reproducible with this patch here. So we still fix the > >>> ESTALE issue but also keep underlay and overlay in sync. > >> > >> This test kernel should be provided to the bug reporter, so it can be confirmed > >> to fix the issue there as well. > > > > See > > https://drive.google.com/open?id=19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io > > Elmo is also Cced here. I don't have any other place I can upload this, > > I think. > > Since it is James, you could use private-fileshare.c.c, otherwise people.c.c for > providing test kernels. The reason I would like to see the reporter (did not > look close enough to see its James initially, but even better) confirm the fix > is that while you did regression test against the known issues, this might > introduce something new. And there some unknown use pattern not directly aimed > at something adds value. It might be worth gathering all the regression tests I've posted into the launchpad bugs into an out of Ubuntu-kernel-tree test-suite somewhere but I frankly lack the time to do that at the moment. But I'll keep it in mind. The LXD test-suite is pretty good at detecting these regressions usually but all bugs here are use-cases that are very rare. The first bug was re-opening a deleted file through /proc/<pid>/fd/<nr>, the second was unlinking a device file and then stat()ing the fd. In both cases they take specific kernel paths. For the re-opening case it's the LOOKUP_JUMPED path that hits d_weak_revalidate and for device files the pinning works differently in the vfs as it does go through the device layer (in convoluted ways). We had Elmo's issue originally in one of the first test version of shiftfs so we introduced agressive revalidation which then led to scenarios where we caused the vfs to think a file had gone ESTALE. This specifically was true for temporary files. That ESTALE is what prevented re-openeing through /proc/<pid>/fd/<nr>. The original change back from this agressive style or revalidation was made under the assumption that we can essentially simply trust the underlay's cache but that presupposes that the underlay isn't modified directly but that squares with directory sharing. So the new version is aggressively revalidating but handles the d_weak_revalidate case which is hit for LOOKUP_JUMPED (and thus for going through proc magic links) by not invalidating a dentry simply because it's i_nlink count is 0. Where i_nlink count == 0 means that it was deleted while we still held it open which is a fine thing to do and backed by posix fs semantics of course. The problem with the original version was - if my analysis is correct - that d_weak_revalidate() for shiftfs returned 0 when i_nlink was 0. But that obviously doesn't matter since someone can still hold a reference to the inode. Christian
Christian Brauner <christian.brauner@ubuntu.com> writes: > See > https://drive.google.com/open?id=19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io > Elmo is also Cced here. I don't have any other place I can upload this, > I think. Sorry for not replying earlier; the only place I can reproduce this is my laptop and I didn't have the space last week to reboot it. I did however finally try over the weekend. When I boot into this kernel, I get a degraded (low-res, laptop screen only) X session and lxd doesn't work: $ lxc list Error: Get "http://unix.socket/1.0": dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused $ The X session looks like the lack of i915 module - I don't see a linux-modules-extra package? Not sure what module/config is missing that is causing lxd to not work....
On Mon, May 25, 2020 at 03:33:03PM +0100, James Troup wrote: > Christian Brauner <christian.brauner@ubuntu.com> writes: > > > See > > https://drive.google.com/open?id=19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io > > Elmo is also Cced here. I don't have any other place I can upload this, > > I think. > > Sorry for not replying earlier; the only place I can reproduce this is > my laptop and I didn't have the space last week to reboot it. I did > however finally try over the weekend. > > When I boot into this kernel, I get a degraded (low-res, laptop screen > only) X session and lxd doesn't work: > > $ lxc list > Error: Get "http://unix.socket/1.0": dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused > $ > > The X session looks like the lack of i915 module - I don't see a > linux-modules-extra package? Oh yeah, I don't use anything like that on my test machines, sorry. > > Not sure what module/config is missing that is causing lxd to not > work.... It's the zfs dkms. I'm not sure it can even be built with unstable but probably not. I'm building you a focal kernel with that patch right now and which should also include all the bits and pieces. Should take some time though. The download from launchpad alone. Christian
On Mon, May 25, 2020 at 04:37:40PM +0200, Christian Brauner wrote: > On Mon, May 25, 2020 at 03:33:03PM +0100, James Troup wrote: > > Christian Brauner <christian.brauner@ubuntu.com> writes: > > > > > See > > > https://drive.google.com/open?id=19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io > > > Elmo is also Cced here. I don't have any other place I can upload this, > > > I think. > > > > Sorry for not replying earlier; the only place I can reproduce this is > > my laptop and I didn't have the space last week to reboot it. I did > > however finally try over the weekend. > > > > When I boot into this kernel, I get a degraded (low-res, laptop screen > > only) X session and lxd doesn't work: > > > > $ lxc list > > Error: Get "http://unix.socket/1.0": dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused > > $ > > > > The X session looks like the lack of i915 module - I don't see a > > linux-modules-extra package? > > Oh yeah, I don't use anything like that on my test machines, sorry. > > > > > Not sure what module/config is missing that is causing lxd to not > > work.... > > It's the zfs dkms. I'm not sure it can even be built with unstable but > probably not. I'm building you a focal kernel with that patch right now > and which should also include all the bits and pieces. > > Should take some time though. The download from launchpad alone. Ok, here's a focal kernel master-next. It should include zfs and hopefully i915 too: https://drive.google.com/drive/u/1/folders/19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io Christian
Christian Brauner <christian.brauner@ubuntu.com> writes: > Ok, here's a focal kernel master-next. It should include zfs and > hopefully i915 too: > https://drive.google.com/drive/u/1/folders/19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io OK, that's much better, thanks. I'm booted into it now and unfortunately it doesn't solve the original problem... After editing a file in emacs on the host: 🙂 james@malefic:~/projects/tasta$ md5sum xxx.foo 5d56b88fddb21d4134eca0294dd76c79 xxx.foo 🙂 james@malefic:~/projects/tasta$ And in the lxc with the shifted FS: ubuntu@portal:~/tasta$ md5sum xxx.foo 14758f1afd44c09b7992073ccf00b43d xxx.foo ubuntu@portal:~/tasta$ 🙂 james@malefic:~/projects/tasta$ uname -a Linux malefic 5.4.0-33-generic #37 SMP Mon May 25 14:40:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux 🙂 james@malefic:~/projects/tasta$
On Wed, May 27, 2020 at 07:41:39PM +0100, James Troup wrote: > Christian Brauner <christian.brauner@ubuntu.com> writes: > > > Ok, here's a focal kernel master-next. It should include zfs and > > hopefully i915 too: > > https://drive.google.com/drive/u/1/folders/19iTwaFSYNS95_I-gD_rvFoV9cMAfy6io > > OK, that's much better, thanks. I'm booted into it now and > unfortunately it doesn't solve the original problem... > > After editing a file in emacs on the host: > > 🙂 james@malefic:~/projects/tasta$ md5sum xxx.foo > 5d56b88fddb21d4134eca0294dd76c79 xxx.foo > 🙂 james@malefic:~/projects/tasta$ > > And in the lxc with the shifted FS: > > ubuntu@portal:~/tasta$ md5sum xxx.foo > 14758f1afd44c09b7992073ccf00b43d xxx.foo > ubuntu@portal:~/tasta$ > > 🙂 james@malefic:~/projects/tasta$ uname -a > Linux malefic 5.4.0-33-generic #37 SMP Mon May 25 14:40:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux > 🙂 james@malefic:~/projects/tasta$ I'm not sure what's happening but I can't reproduce this at all. See: https://asciinema.org/a/334491 https://asciinema.org/a/334492 What's the backing filesystem you're using, i.e. whats your host filesystem? Any other details that might help? Christian
Christian Brauner <christian.brauner@ubuntu.com> writes: > https://asciinema.org/a/334491 > https://asciinema.org/a/334492 Like I mentioned in the bug report, emacs is the key to reproducing it for me. I don't know why. But if I use vim or shell commands, I can't (reliably or easily) reproduce it either. However, I can everytime with emacs. > What's the backing filesystem you're using, i.e. whats your host > filesystem? Any other details that might help? The host is a FDE ext4 FS (so with LVM + DM crypt). The lxc is on zfs. Ah, and I'm using ZRAM (zram-config package is installed), I guess that could be relevant.
James Troup <james.troup@canonical.com> writes: > Christian Brauner <christian.brauner@ubuntu.com> writes: > >> https://asciinema.org/a/334491 >> https://asciinema.org/a/334492 > > Like I mentioned in the bug report, emacs is the key to reproducing it > for me. I don't know why. But if I use vim or shell commands, I > can't (reliably or easily) reproduce it either. However, I can > everytime with emacs. > >> What's the backing filesystem you're using, i.e. whats your host >> filesystem? Any other details that might help? > > The host is a FDE ext4 FS (so with LVM + DM crypt). The lxc is on > zfs. Ah, and I'm using ZRAM (zram-config package is installed), I > guess that could be relevant. Hey, any update on this or anything I can do to help move this forward? Chris Sanders reporting seeing the same problem, so I don't think it's something specific to my setup/laptop. FWIW, I can also apparently reproduce it reliably with 'bzr revert', so not just emacs either. I wonder if it has to do with updating a file in place vs. writing to a temporary file and rename()'ing into place?
On 24.08.20 05:29, James Troup wrote: > James Troup <james.troup@canonical.com> writes: > >> Christian Brauner <christian.brauner@ubuntu.com> writes: >> >>> https://asciinema.org/a/334491 >>> https://asciinema.org/a/334492 >> >> Like I mentioned in the bug report, emacs is the key to reproducing it >> for me. I don't know why. But if I use vim or shell commands, I >> can't (reliably or easily) reproduce it either. However, I can >> everytime with emacs. >> >>> What's the backing filesystem you're using, i.e. whats your host >>> filesystem? Any other details that might help? >> >> The host is a FDE ext4 FS (so with LVM + DM crypt). The lxc is on >> zfs. Ah, and I'm using ZRAM (zram-config package is installed), I >> guess that could be relevant. > > Hey, any update on this or anything I can do to help move this > forward? Chris Sanders reporting seeing the same problem, so I don't > think it's something specific to my setup/laptop. > > FWIW, I can also apparently reproduce it reliably with 'bzr revert', > so not just emacs either. I wonder if it has to do with updating a > file in place vs. writing to a temporary file and rename()'ing into > place? > Hi James, were you provided with a test kernel to ensure the change solves the problem and does not introduce any new problematic behavior? -Stefan
On Mon, Aug 24, 2020 at 09:13:11AM +0200, Stefan Bader wrote: > On 24.08.20 05:29, James Troup wrote: > > James Troup <james.troup@canonical.com> writes: > > > >> Christian Brauner <christian.brauner@ubuntu.com> writes: > >> > >>> https://asciinema.org/a/334491 > >>> https://asciinema.org/a/334492 > >> > >> Like I mentioned in the bug report, emacs is the key to reproducing it > >> for me. I don't know why. But if I use vim or shell commands, I > >> can't (reliably or easily) reproduce it either. However, I can > >> everytime with emacs. > >> > >>> What's the backing filesystem you're using, i.e. whats your host > >>> filesystem? Any other details that might help? > >> > >> The host is a FDE ext4 FS (so with LVM + DM crypt). The lxc is on > >> zfs. Ah, and I'm using ZRAM (zram-config package is installed), I > >> guess that could be relevant. > > > > Hey, any update on this or anything I can do to help move this > > forward? Chris Sanders reporting seeing the same problem, so I don't > > think it's something specific to my setup/laptop. > > > > FWIW, I can also apparently reproduce it reliably with 'bzr revert', > > so not just emacs either. I wonder if it has to do with updating a > > file in place vs. writing to a temporary file and rename()'ing into > > place? > > > Hi James, > > were you provided with a test kernel to ensure the change solves the problem and > does not introduce any new problematic behavior? The change here is not relevant anymore otherwise I would've resent. Once Plumbers is over I'll look into this again. Thanks! Christian
On Mon, Aug 24, 2020 at 04:29:06AM +0100, James Troup wrote: > James Troup <james.troup@canonical.com> writes: > > > Christian Brauner <christian.brauner@ubuntu.com> writes: > > > >> https://asciinema.org/a/334491 > >> https://asciinema.org/a/334492 > > > > Like I mentioned in the bug report, emacs is the key to reproducing it > > for me. I don't know why. But if I use vim or shell commands, I > > can't (reliably or easily) reproduce it either. However, I can > > everytime with emacs. > > > >> What's the backing filesystem you're using, i.e. whats your host > >> filesystem? Any other details that might help? > > > > The host is a FDE ext4 FS (so with LVM + DM crypt). The lxc is on > > zfs. Ah, and I'm using ZRAM (zram-config package is installed), I > > guess that could be relevant. > > Hey, any update on this or anything I can do to help move this > forward? Chris Sanders reporting seeing the same problem, so I don't > think it's something specific to my setup/laptop. > > FWIW, I can also apparently reproduce it reliably with 'bzr revert', > so not just emacs either. I wonder if it has to do with updating a > file in place vs. writing to a temporary file and rename()'ing into > place? Thanks James. I haven't forgotten. I simply didn't have time to look into it yet. I'm look a now but I'll likely need another week because of Plumbers. Sorry for the delay. Christian
diff --git a/fs/shiftfs.c b/fs/shiftfs.c index 39b878a6f820..652fb67b0e0b 100644 --- a/fs/shiftfs.c +++ b/fs/shiftfs.c @@ -240,6 +240,9 @@ static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags) int err = 1; struct dentry *lowerd = dentry->d_fsdata; + if (d_is_negative(lowerd) != d_is_negative(dentry)) + return 0; + if (lowerd->d_flags & DCACHE_OP_WEAK_REVALIDATE) { err = lowerd->d_op->d_weak_revalidate(lowerd, flags); if (err < 0) @@ -264,13 +267,16 @@ static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags) if (flags & LOOKUP_RCU) return -ECHILD; + if (d_unhashed(lowerd) || + d_is_negative(lowerd) != d_is_negative(dentry)) + return 0; + if (lowerd->d_flags & DCACHE_OP_REVALIDATE) { err = lowerd->d_op->d_revalidate(lowerd, flags); if (err < 0) return err; if (!err) { - if (!(flags & LOOKUP_RCU)) - d_invalidate(lowerd); + d_invalidate(lowerd); return -ESTALE; } }
BugLink: https://bugs.launchpad.net/bugs/1879196 While fixing [1] we introduced a regression and triggered [2] whereby the underlay and overlay go out of sync because they disagree about the dcache. Such situation easily arise when the underlay is modified directly. This means the overlay needs to revalidate the dentry in the dcache to catchup with the underlay. Note, in general it's not advisable to directly modify the underlay while a shiftfs mount is on top. In some way this means we need to keep two caches in sync and it's hard enough to keep a single cache happy. But shiftfs' use-case is inherently prone to be used for exactly that. So this is something we have to navigate carefully and honestly we have no full model upstream that does the same. Overlayfs has the copy-up behavior which let's it get around most of the issues but we don't have it and ecryptfs is broken in such scenarios which we verified quite a while back. In any case, I built a kernel with this patch and re-ran all regressions that are related to this that we have so far (cf. [1], [2], and [3]). None of them were reproducible with this patch here. So we still fix the ESTALE issue but also keep underlay and overlay in sync. [1]: https://bugs.launchpad.net/bugs/1872757 [2]: https://bugs.launchpad.net/bugs/1879196 [3]: https://bugs.launchpad.net/bugs/1860041 Cc: Seth Forshee <seth.forshee@canonical.com> Cc: James Troup <james.troup@canonical.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> --- fs/shiftfs.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) base-commit: 9dd531622eff2baf657b906293745ca6dab9bb91