Message ID | 20170507155855.GD5970@birch.djwong.org |
---|---|
State | Not Applicable, archived |
Headers | show |
On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > Document the new GETFSMAP ioctl that returns the physical layout of a > (disk-based) filesystem. [...] > +.B EPERM > +This query is not allowed. Please document the circumstances under which a query is allowed. Also: From a quick glance at the XFS implementation, I don't see any privilege checks. Am I missing something, or does this API permit an unprivileged user to determine the number of physical blocks allocated for any inode, even for inodes the user can't ordinarily see in any way?
On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote: > On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > > Document the new GETFSMAP ioctl that returns the physical layout of a > > (disk-based) filesystem. > [...] > > +.B EPERM > > +This query is not allowed. > > Please document the circumstances under which a query is allowed. For the two current implementations, queries are always allowed. (The doc could be more explicit about this decision being left to the implementation.) > Also: From a quick glance at the XFS implementation, I don't see any > privilege checks. Am I missing something, or does this API permit an > unprivileged user to determine the number of physical blocks allocated > for any inode, even for inodes the user can't ordinarily see in any > way? Correct. --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 8, 2017 at 8:41 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote: >> On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: >> > Document the new GETFSMAP ioctl that returns the physical layout of a >> > (disk-based) filesystem. [...] >> Also: From a quick glance at the XFS implementation, I don't see any >> privilege checks. Am I missing something, or does this API permit an >> unprivileged user to determine the number of physical blocks allocated >> for any inode, even for inodes the user can't ordinarily see in any >> way? > > Correct. What's your reasoning for why this doesn't create any new potential security issues? For example, as far as I can tell, this would permit an unprivileged user to determine with high probability whether a set of large files with known sizes is stored anywhere in the filesystem, even across containers or so.
On Mon, May 08, 2017 at 08:47:56PM +0200, Jann Horn wrote: > On Mon, May 8, 2017 at 8:41 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote: > >> On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > >> > Document the new GETFSMAP ioctl that returns the physical layout of a > >> > (disk-based) filesystem. > [...] > >> Also: From a quick glance at the XFS implementation, I don't see any > >> privilege checks. Am I missing something, or does this API permit an > >> unprivileged user to determine the number of physical blocks allocated > >> for any inode, even for inodes the user can't ordinarily see in any > >> way? > > > > Correct. > > What's your reasoning for why this doesn't create any new potential > security issues? For example, as far as I can tell, this would permit /Any/ ? That is a huge request to be dropping on me after the vfs patch gets merged, after a year-long review cycle, etc. AFAIK there aren't any problems, but then that's part of why I let this thing hang out to dry for such a long time. Even posessing the inode number, an unprivileged process still cannot open files they wouldn't otherwise have access, since that requires the generation number, and only bulkstat provides that (if you have CAP_SYS_ADMIN). The whole reason for dropping the CAP_SYS_ADMIN check from GETFSMAP was (a) so that unpriviledged users could compute free space information and (b) to allow dedupe tools to make better decisions about which file donates blocks and which file accepts blocks. If you have specific complaints, then let's hear and address them. I'm not going to try to prove a broad negative theoretical statement. Moving on... > an unprivileged user to determine with high probability whether a set > of large files with known sizes is stored anywhere in the filesystem, even > across containers or so. How large? How high? Do you have a tool that analyzes a set of st_blocks values and compares the set to known profiles in order to guess what's on the filesystem? With what accuracy can it do that, especially without explicit path or stat data? The maximum resolution provided by the ioctl is fs block size, so it's not like you can guess that this 1268432 byte file is libclangAnalysis.a; all you know is that there are four 310-block files on this filesystem -- on this system that's the desktop wallpaper, a file from each of libclang and libgimp, and libc6 from my aarch64 guest. The logical block map data could be more helpful for fingerprinting, but only if there are sparse files. Say our multi-tenant container hosts all the containers on the same fs. We now have a set of (inode, blockcount) data and a logical block map for every inode stored on that fs. We have no path or stat data, so how do you tell what a 340-block file with a hole at offset 17 is? You could try to infer path structure use the (XFS) heuristic that file inodes are usually created in the same AG as the directory inode they're created in, but GETFSMAP doesn't distinguish file extents from directory extents and AGs can host many different directories, so I don't think this will help much. Even if you have a reasonably good idea which inodes are directories, you still don't know which other inodes have an entry in a particular directory. Then again once we throw reflink and dedupe between containers into the mix the extent maps become far more interesting, because dirs could potentially be identified by the lack of any shared blocks at all, and other containers with the same library files will tend to share the same blocks at the same offsets. But that's still somewhat imprecise -- btrfs directories can share blocks between snapshots, whereas xfs can't, and the existence of small unshared files with the same block count introduces a certain amount of noise into the directory inference process. So maybe you'd be able to search for a reflinked .so file that you /can/ stat to infer that there are X containers running the same software as your container, though you still have to find them to mount an attack. FWIW I don't oppose having a CAP_SYS_ADMIN check again (patches gladly accepted for review!), but I'm not yet convinced that this is a big enough threat to forbid the use case. Sure would be nice if we had finer-grained capabilities... --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 8, 2017 at 10:47 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Mon, May 08, 2017 at 08:47:56PM +0200, Jann Horn wrote: >> On Mon, May 8, 2017 at 8:41 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: >> > On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote: >> >> On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: >> >> > Document the new GETFSMAP ioctl that returns the physical layout of a >> >> > (disk-based) filesystem. >> [...] >> >> Also: From a quick glance at the XFS implementation, I don't see any >> >> privilege checks. Am I missing something, or does this API permit an >> >> unprivileged user to determine the number of physical blocks allocated >> >> for any inode, even for inodes the user can't ordinarily see in any >> >> way? >> > >> > Correct. >> >> What's your reasoning for why this doesn't create any new potential >> security issues? For example, as far as I can tell, this would permit > > /Any/ ? That is a huge request to be dropping on me after the vfs patch > gets merged, after a year-long review cycle, etc. Fair point. >> an unprivileged user to determine with high probability whether a set >> of large files with known sizes is stored anywhere in the filesystem, even >> across containers or so. > > How large? How high? > > Do you have a tool that analyzes a set of st_blocks values and compares > the set to known profiles in order to guess what's on the filesystem? > With what accuracy can it do that, especially without explicit path or > stat data? The maximum resolution provided by the ioctl is fs block > size, so it's not like you can guess that this 1268432 byte file is > libclangAnalysis.a; all you know is that there are four 310-block files > on this filesystem -- on this system that's the desktop wallpaper, a > file from each of libclang and libgimp, and libc6 from my aarch64 guest. This would probably become more realistic for larger files, like conference recordings - with sizes like 200075, 48338, 155870, 134800 blocks -, although admittedly I don't have a specific scenario in mind in which someone knowing what conference recordings I have on my disk would be problematic. You're probably right about this not being a particularly important concern, and I recognize that if I had wanted a different API, I should have said so a year ago.
On Tue, May 09, 2017 at 12:54:57AM +0200, Jann Horn wrote: > On Mon, May 8, 2017 at 10:47 PM, Darrick J. Wong > <darrick.wong@oracle.com> wrote: > > On Mon, May 08, 2017 at 08:47:56PM +0200, Jann Horn wrote: > >> On Mon, May 8, 2017 at 8:41 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > >> > On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote: > >> >> On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > >> >> > Document the new GETFSMAP ioctl that returns the physical layout of a > >> >> > (disk-based) filesystem. > >> [...] > >> >> Also: From a quick glance at the XFS implementation, I don't see any > >> >> privilege checks. Am I missing something, or does this API permit an > >> >> unprivileged user to determine the number of physical blocks allocated > >> >> for any inode, even for inodes the user can't ordinarily see in any > >> >> way? > >> > > >> > Correct. > >> > >> What's your reasoning for why this doesn't create any new potential > >> security issues? For example, as far as I can tell, this would permit > > > > /Any/ ? That is a huge request to be dropping on me after the vfs patch > > gets merged, after a year-long review cycle, etc. > > Fair point. > > >> an unprivileged user to determine with high probability whether a set > >> of large files with known sizes is stored anywhere in the filesystem, even > >> across containers or so. > > > > How large? How high? > > > > Do you have a tool that analyzes a set of st_blocks values and compares > > the set to known profiles in order to guess what's on the filesystem? > > With what accuracy can it do that, especially without explicit path or > > stat data? The maximum resolution provided by the ioctl is fs block > > size, so it's not like you can guess that this 1268432 byte file is > > libclangAnalysis.a; all you know is that there are four 310-block files > > on this filesystem -- on this system that's the desktop wallpaper, a > > file from each of libclang and libgimp, and libc6 from my aarch64 guest. > > This would probably become more realistic for larger files, like > conference recordings - with sizes like 200075, 48338, 155870, 134800 > blocks -, although admittedly I don't have a specific scenario in mind in > which someone knowing what conference recordings I have on my disk > would be problematic. TBH, I /did/ build a (crappy) tool that tries to construct fingerprints based on the fsmaps it finds for each inode number. It works reasonably well for identifying the existence (and number) of reflink clones of any part of the file tree that you can stat and FIEMAP. It also works (sort of) if you have multiple separate filesystems with roughly the same fs trees in them. Mixing things into one big file produces a lot of noise, and data files are harder to pick out unless they're deduped. > You're probably right about this not being a particularly important > concern, and I recognize that if I had wanted a different API, I should > have said so a year ago. I don't mind people bringing up specific concerns and making specific requests for the rest of the 4.12 cycle since it's relatively easy to make small changes to the interface (e.g. adding a capability check). --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 08, 2017 at 06:53:24PM -0700, Darrick J. Wong wrote: > > >> an unprivileged user to determine with high probability whether a set > > >> of large files with known sizes is stored anywhere in the filesystem, even > > >> across containers or so. > > > > > > How large? How high? > > > > > > Do you have a tool that analyzes a set of st_blocks values and compares > > > the set to known profiles in order to guess what's on the filesystem? > > > With what accuracy can it do that, especially without explicit path or > > > stat data? The maximum resolution provided by the ioctl is fs block > > > size, so it's not like you can guess that this 1268432 byte file is > > > libclangAnalysis.a; all you know is that there are four 310-block files > > > on this filesystem -- on this system that's the desktop wallpaper, a > > > file from each of libclang and libgimp, and libc6 from my aarch64 guest. > > > > This would probably become more realistic for larger files, like > > conference recordings - with sizes like 200075, 48338, 155870, 134800 > > blocks -, although admittedly I don't have a specific scenario in mind in > > which someone knowing what conference recordings I have on my disk > > would be problematic. > > TBH, I /did/ build a (crappy) tool that tries to construct fingerprints > based on the fsmaps it finds for each inode number. It works reasonably > well for identifying the existence (and number) of reflink clones of any > part of the file tree that you can stat and FIEMAP. It also works (sort > of) if you have multiple separate filesystems with roughly the same fs > trees in them. Mixing things into one big file produces a lot of noise, > and data files are harder to pick out unless they're deduped. > > > You're probably right about this not being a particularly important > > concern, and I recognize that if I had wanted a different API, I should > > have said so a year ago. > > I don't mind people bringing up specific concerns and making specific > requests for the rest of the 4.12 cycle since it's relatively easy to > make small changes to the interface (e.g. adding a capability check). > I was also surprised to see that there is no authorization check. Using this ioctl, *anyone* will be able to retrieve the precise list of physical blocks used by every inode on the filesystem, even ones that are only linked to in directories the user doesn't have read permission for, or aren't even visible in their mount namespace. I am most concerned about: 1.) Privacy implications. Say the filesystem is being shared between multiple users, and one user unpacks foo.tar.gz into their home directory, which they've set to mode 700 to hide from other users. Because of this new ioctl, all users will be able to see every (inode number, size in blocks) pair that was added to the filesystem, as well as the exact layout of the physical block allocations which might hint at how the files were created. If there is a known "fingerprint" for the unpacked foo.tar.gz in this regard, its presence on the filesystem will be revealed to all users. And if any filesystems happen to prefer allocating blocks near the containing directory, the directory the files are in would likely be revealed too. Also note that by repeatedly executing the ioctl, all users will be able to see at what time any arbitrary inode was added to the filesystem, as well as exactly when any arbitrary inode was truncated, or otherwise modified in a way that changed its extent mappings. More generally, all users will be able to follow the evolution of any arbitrary set of inodes over time. In a shared hosting environment this could allow anyone to determine many of the characteristics of other containers being hosted by the kernel, such as which software and software versions they're using (or at least, to a higher degree of confidence than other side channels that may be available currently). 2.) Abusing the ioctl as an information leak in combination with another security vulnerability. For example let's say that there's a vulnerability in xfs or ext4 that allows writing (but not reading) to an arbitrary physical disk block. Now obviously there are many ways this could be exploited, but let's say you're in a container, so just elevating to root in the container isn't enough, and you don't know where the critical system files are. Using the fsmap ioctl to get the extent mappings of *all* files on the filesystem, you may still be able to determine with a high degree of confidence which physical disk blocks hold the contents of files outside the container that could be backdoored, e.g. /etc/shadow, or some binary in /bin or /lib, or perhaps a kernel module in /lib/modules. So given that this ioctl operates on the global filesystem and not on a particular file, it really seems like more of an administrator-level thing (capable(CAP_SYS_ADMIN)), not something that any random user should be able to execute. - Eric
On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: > 1.) Privacy implications. Say the filesystem is being shared between multiple > users, and one user unpacks foo.tar.gz into their home directory, which > they've set to mode 700 to hide from other users. Because of this new > ioctl, all users will be able to see every (inode number, size in blocks) > pair that was added to the filesystem, as well as the exact layout of the > physical block allocations which might hint at how the files were created. > If there is a known "fingerprint" for the unpacked foo.tar.gz in this > regard, its presence on the filesystem will be revealed to all users. And > if any filesystems happen to prefer allocating blocks near the containing > directory, the directory the files are in would likely be revealed too. Unix/Linux has historically not been terribly concerned about trying to protect this kind of privacy between users. So for example, in order to do this, you would have to call GETFSMAP continously to track this sort of thing. Someone who wanted to do this could probably get this information (and much, much more) by continuously running "ps" to see what processes are running. (I will note. wryly, that in the bad old days, when dozens of users were sharing a one MIPS Vax/780, it was considered a *good* thing that social pressure could be applied when it was found that someone was running a CPU or memory hogger on a time sharing system. The privacy right of someone running "xtrek" to be able to hide this from other users on the system was never considered important at all. :-) Fortunately, the days of timesharing seem to well behind us. For those people who think that containers are as secure as VM's (hah, hah, hah), it might be that best way to handle this is to have a mount option that requires root access to this functionality. For those people who really care about this, they can disable access. Cheers, - Ted
Theodore Ts'o <tytso@mit.edu> writes: > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: >> 1.) Privacy implications. Say the filesystem is being shared between multiple >> users, and one user unpacks foo.tar.gz into their home directory, which >> they've set to mode 700 to hide from other users. Because of this new >> ioctl, all users will be able to see every (inode number, size in blocks) >> pair that was added to the filesystem, as well as the exact layout of the >> physical block allocations which might hint at how the files were created. >> If there is a known "fingerprint" for the unpacked foo.tar.gz in this >> regard, its presence on the filesystem will be revealed to all users. And >> if any filesystems happen to prefer allocating blocks near the containing >> directory, the directory the files are in would likely be revealed too. > > Unix/Linux has historically not been terribly concerned about trying > to protect this kind of privacy between users. So for example, in > order to do this, you would have to call GETFSMAP continously to track > this sort of thing. Someone who wanted to do this could probably get > this information (and much, much more) by continuously running "ps" to > see what processes are running. > > (I will note. wryly, that in the bad old days, when dozens of users > were sharing a one MIPS Vax/780, it was considered a *good* thing > that social pressure could be applied when it was found that someone > was running a CPU or memory hogger on a time sharing system. The > privacy right of someone running "xtrek" to be able to hide this from > other users on the system was never considered important at all. :-) > > Fortunately, the days of timesharing seem to well behind us. For > those people who think that containers are as secure as VM's (hah, > hah, hah), it might be that best way to handle this is to have a mount > option that requires root access to this functionality. For those > people who really care about this, they can disable access. What would be the reason for not putting this behind capable(CAP_SYS_ADMIN)? What possible legitimate function could this functionality serve to users who don't own your filesystem? I have seen several people speak up how this is a concern I don't see anyone saying here is a legitimate use for a non-system administrator. This doesn't seem like something where abuses of time-sharing systems can be observed. Eric
[cc btrfs, since afaict that's where most of the dedupe tool authors hang out] On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: > Theodore Ts'o <tytso@mit.edu> writes: > > > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: > >> 1.) Privacy implications. Say the filesystem is being shared between multiple > >> users, and one user unpacks foo.tar.gz into their home directory, which > >> they've set to mode 700 to hide from other users. Because of this new > >> ioctl, all users will be able to see every (inode number, size in blocks) > >> pair that was added to the filesystem, as well as the exact layout of the > >> physical block allocations which might hint at how the files were created. > >> If there is a known "fingerprint" for the unpacked foo.tar.gz in this > >> regard, its presence on the filesystem will be revealed to all users. And > >> if any filesystems happen to prefer allocating blocks near the containing > >> directory, the directory the files are in would likely be revealed too. Frankly, why are container users even allowed to make unrestricted ioctl calls? I thought we had a bunch of security infrastructure to constrain what userspace can do to a system, so why don't ioctls fall under these same protections? If your containers are really that adversarial, you ought to be blacklisting as much as you can. > > Unix/Linux has historically not been terribly concerned about trying > > to protect this kind of privacy between users. So for example, in > > order to do this, you would have to call GETFSMAP continously to track > > this sort of thing. Someone who wanted to do this could probably get > > this information (and much, much more) by continuously running "ps" to > > see what processes are running. > > > > (I will note. wryly, that in the bad old days, when dozens of users > > were sharing a one MIPS Vax/780, it was considered a *good* thing > > that social pressure could be applied when it was found that someone > > was running a CPU or memory hogger on a time sharing system. The > > privacy right of someone running "xtrek" to be able to hide this from > > other users on the system was never considered important at all. :-) Not to mention someone running GETFSMAP in a loop will be pretty obvious both from the high kernel cpu usage and the huge number of metadata operations. > > Fortunately, the days of timesharing seem to well behind us. For > > those people who think that containers are as secure as VM's (hah, > > hah, hah), it might be that best way to handle this is to have a mount > > option that requires root access to this functionality. For those > > people who really care about this, they can disable access. Or use separate filesystems for each container so that exploitable bugs that shut down the filesystem can't be used to kill the other containers. You could use a torrent of metadata-heavy operations (fallocate a huge file, punch every block, truncate file, repeat) to DoS the other containers. > What would be the reason for not putting this behind > capable(CAP_SYS_ADMIN)? > > What possible legitimate function could this functionality serve to > users who don't own your filesystem? As I've said before, it's to enable dedupe tools to decide, given a set of files with shareable blocks, roughly how many other times each of those shareable blocks are shared so that they can make better decisions about which file keeps its shareable blocks, and which file gets remapped. Dedupe is not a privileged operation, nor are any of the tools. > I have seen several people speak up how this is a concern I don't see > anyone saying here is a legitimate use for a non-system administrator. /I/ said that a few emails ago. --D > This doesn't seem like something where abuses of time-sharing systems > can be observed. > > Eric
On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: > [cc btrfs, since afaict that's where most of the dedupe tool authors hang out] > > On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: > > Theodore Ts'o <tytso@mit.edu> writes: > > > > > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: > > >> 1.) Privacy implications. Say the filesystem is being shared between multiple > > >> users, and one user unpacks foo.tar.gz into their home directory, which > > >> they've set to mode 700 to hide from other users. Because of this new > > >> ioctl, all users will be able to see every (inode number, size in blocks) > > >> pair that was added to the filesystem, as well as the exact layout of the > > >> physical block allocations which might hint at how the files were created. > > >> If there is a known "fingerprint" for the unpacked foo.tar.gz in this > > >> regard, its presence on the filesystem will be revealed to all users. And > > >> if any filesystems happen to prefer allocating blocks near the containing > > >> directory, the directory the files are in would likely be revealed too. > > Frankly, why are container users even allowed to make unrestricted ioctl > calls? I thought we had a bunch of security infrastructure to constrain > what userspace can do to a system, so why don't ioctls fall under these > same protections? If your containers are really that adversarial, you > ought to be blacklisting as much as you can. > Personally I don't find the presence of sandboxing features to be a very good excuse for introducing random insecure ioctls. Not everyone has everything perfectly "sandboxed" all the time, for obvious reasons. It's easy to forget about the filesystem ioctls, too, since they can be executed on any regular file, without having to open some device node in /dev. (And this actually does happen; the SELinux policy in Android, for example, still allows apps to call any ioctl on their data files, despite all the effort that has gone into whitelisting other types of ioctls. Which should be fixed, of course, but it shows that this kind of mistake is very easy to make.) > > > Unix/Linux has historically not been terribly concerned about trying > > > to protect this kind of privacy between users. So for example, in > > > order to do this, you would have to call GETFSMAP continously to track > > > this sort of thing. Someone who wanted to do this could probably get > > > this information (and much, much more) by continuously running "ps" to > > > see what processes are running. > > > > > > (I will note. wryly, that in the bad old days, when dozens of users > > > were sharing a one MIPS Vax/780, it was considered a *good* thing > > > that social pressure could be applied when it was found that someone > > > was running a CPU or memory hogger on a time sharing system. The > > > privacy right of someone running "xtrek" to be able to hide this from > > > other users on the system was never considered important at all. :-) > > Not to mention someone running GETFSMAP in a loop will be pretty obvious > both from the high kernel cpu usage and the huge number of metadata > operations. Well, only if that someone running GETFSMAP actually wants to watch things in real-time (it's not necessary for all scenarios that have been mentioned), *and* there is monitoring in place which actually detects it and can do something about it. Yes, PIDs have traditionally been global, but today we have PID namespaces, and many other isolation features such as mount namespaces. Nothing is perfect, of course, and containers are a lot worse than VMs, but it seems weird to use that as an excuse to knowingly make things worse... > > > > Fortunately, the days of timesharing seem to well behind us. For > > > those people who think that containers are as secure as VM's (hah, > > > hah, hah), it might be that best way to handle this is to have a mount > > > option that requires root access to this functionality. For those > > > people who really care about this, they can disable access. > > Or use separate filesystems for each container so that exploitable bugs > that shut down the filesystem can't be used to kill the other > containers. You could use a torrent of metadata-heavy operations > (fallocate a huge file, punch every block, truncate file, repeat) to DoS > the other containers. > > > What would be the reason for not putting this behind > > capable(CAP_SYS_ADMIN)? > > > > What possible legitimate function could this functionality serve to > > users who don't own your filesystem? > > As I've said before, it's to enable dedupe tools to decide, given a set > of files with shareable blocks, roughly how many other times each of > those shareable blocks are shared so that they can make better decisions > about which file keeps its shareable blocks, and which file gets > remapped. Dedupe is not a privileged operation, nor are any of the > tools. > So why does the ioctl need to return all extent mappings for the entire filesystem, instead of just the share count of each block in the file that the ioctl is called on? - Eric
On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote: > > On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: >> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out] >> >> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: >>> Theodore Ts'o <tytso@mit.edu> writes: >>> >>>> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: >>>>> 1.) Privacy implications. Say the filesystem is being shared between multiple >>>>> users, and one user unpacks foo.tar.gz into their home directory, which >>>>> they've set to mode 700 to hide from other users. Because of this new >>>>> ioctl, all users will be able to see every (inode number, size in blocks) >>>>> pair that was added to the filesystem, as well as the exact layout of the >>>>> physical block allocations which might hint at how the files were created. >>>>> If there is a known "fingerprint" for the unpacked foo.tar.gz in this >>>>> regard, its presence on the filesystem will be revealed to all users. And >>>>> if any filesystems happen to prefer allocating blocks near the containing >>>>> directory, the directory the files are in would likely be revealed too. >> >> Frankly, why are container users even allowed to make unrestricted ioctl >> calls? I thought we had a bunch of security infrastructure to constrain >> what userspace can do to a system, so why don't ioctls fall under these >> same protections? If your containers are really that adversarial, you >> ought to be blacklisting as much as you can. >> > > Personally I don't find the presence of sandboxing features to be a very good > excuse for introducing random insecure ioctls. Not everyone has everything > perfectly "sandboxed" all the time, for obvious reasons. It's easy to forget > about the filesystem ioctls, too, since they can be executed on any regular > file, without having to open some device node in /dev. > > (And this actually does happen; the SELinux policy in Android, for example, > still allows apps to call any ioctl on their data files, despite all the effort > that has gone into whitelisting other types of ioctls. Which should be fixed, > of course, but it shows that this kind of mistake is very easy to make.) > >>>> Unix/Linux has historically not been terribly concerned about trying >>>> to protect this kind of privacy between users. So for example, in >>>> order to do this, you would have to call GETFSMAP continously to track >>>> this sort of thing. Someone who wanted to do this could probably get >>>> this information (and much, much more) by continuously running "ps" to >>>> see what processes are running. >>>> >>>> (I will note. wryly, that in the bad old days, when dozens of users >>>> were sharing a one MIPS Vax/780, it was considered a *good* thing >>>> that social pressure could be applied when it was found that someone >>>> was running a CPU or memory hogger on a time sharing system. The >>>> privacy right of someone running "xtrek" to be able to hide this from >>>> other users on the system was never considered important at all. :-) >> >> Not to mention someone running GETFSMAP in a loop will be pretty obvious >> both from the high kernel cpu usage and the huge number of metadata >> operations. > > Well, only if that someone running GETFSMAP actually wants to watch things in > real-time (it's not necessary for all scenarios that have been mentioned), *and* > there is monitoring in place which actually detects it and can do something > about it. > > Yes, PIDs have traditionally been global, but today we have PID namespaces, and > many other isolation features such as mount namespaces. Nothing is perfect, of > course, and containers are a lot worse than VMs, but it seems weird to use that > as an excuse to knowingly make things worse... > >> >>>> Fortunately, the days of timesharing seem to well behind us. For >>>> those people who think that containers are as secure as VM's (hah, >>>> hah, hah), it might be that best way to handle this is to have a mount >>>> option that requires root access to this functionality. For those >>>> people who really care about this, they can disable access. >> >> Or use separate filesystems for each container so that exploitable bugs >> that shut down the filesystem can't be used to kill the other >> containers. You could use a torrent of metadata-heavy operations >> (fallocate a huge file, punch every block, truncate file, repeat) to DoS >> the other containers. >> >>> What would be the reason for not putting this behind >>> capable(CAP_SYS_ADMIN)? >>> >>> What possible legitimate function could this functionality serve to >>> users who don't own your filesystem? >> >> As I've said before, it's to enable dedupe tools to decide, given a set >> of files with shareable blocks, roughly how many other times each of >> those shareable blocks are shared so that they can make better decisions >> about which file keeps its shareable blocks, and which file gets >> remapped. Dedupe is not a privileged operation, nor are any of the >> tools. >> > > So why does the ioctl need to return all extent mappings for the entire > filesystem, instead of just the share count of each block in the file that the > ioctl is called on? One possibility is that the ioctl() can return the mapping for all inodes owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE, or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more than one if there is a reason to do so) with all the other allocated blocks for inodes the user doesn't have permission to access? IMHO, this would allow a non-root user the main benefit of GETFSMAP, which is trying to determine how fragmented their files are and/or how fragmented the free space is, without leaking any information about file sizes, location, or other information the user can't already get today in a less efficient manner. I don't know how hard this is to implement, but seems not impossible. Cheers, Andreas
On Sat, May 13, 2017 at 07:41:24PM -0600, Andreas Dilger wrote: > On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote: > > > > On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: > >> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out] > >> > >> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: > >>> Theodore Ts'o <tytso@mit.edu> writes: > >>> > >>>> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: > >>>>> 1.) Privacy implications. Say the filesystem is being shared between multiple > >>>>> users, and one user unpacks foo.tar.gz into their home directory, which > >>>>> they've set to mode 700 to hide from other users. Because of this new > >>>>> ioctl, all users will be able to see every (inode number, size in blocks) > >>>>> pair that was added to the filesystem, as well as the exact layout of the > >>>>> physical block allocations which might hint at how the files were created. > >>>>> If there is a known "fingerprint" for the unpacked foo.tar.gz in this > >>>>> regard, its presence on the filesystem will be revealed to all users. And > >>>>> if any filesystems happen to prefer allocating blocks near the containing > >>>>> directory, the directory the files are in would likely be revealed too. > >> > >> Frankly, why are container users even allowed to make unrestricted ioctl > >> calls? I thought we had a bunch of security infrastructure to constrain > >> what userspace can do to a system, so why don't ioctls fall under these > >> same protections? If your containers are really that adversarial, you > >> ought to be blacklisting as much as you can. > >> > > > > Personally I don't find the presence of sandboxing features to be a very good > > excuse for introducing random insecure ioctls. Not everyone has everything > > perfectly "sandboxed" all the time, for obvious reasons. It's easy to forget > > about the filesystem ioctls, too, since they can be executed on any regular > > file, without having to open some device node in /dev. > > > > (And this actually does happen; the SELinux policy in Android, for example, > > still allows apps to call any ioctl on their data files, despite all the effort > > that has gone into whitelisting other types of ioctls. Which should be fixed, > > of course, but it shows that this kind of mistake is very easy to make.) > > > >>>> Unix/Linux has historically not been terribly concerned about trying > >>>> to protect this kind of privacy between users. So for example, in > >>>> order to do this, you would have to call GETFSMAP continously to track > >>>> this sort of thing. Someone who wanted to do this could probably get > >>>> this information (and much, much more) by continuously running "ps" to > >>>> see what processes are running. > >>>> > >>>> (I will note. wryly, that in the bad old days, when dozens of users > >>>> were sharing a one MIPS Vax/780, it was considered a *good* thing > >>>> that social pressure could be applied when it was found that someone > >>>> was running a CPU or memory hogger on a time sharing system. The > >>>> privacy right of someone running "xtrek" to be able to hide this from > >>>> other users on the system was never considered important at all. :-) > >> > >> Not to mention someone running GETFSMAP in a loop will be pretty obvious > >> both from the high kernel cpu usage and the huge number of metadata > >> operations. > > > > Well, only if that someone running GETFSMAP actually wants to watch things in > > real-time (it's not necessary for all scenarios that have been mentioned), *and* > > there is monitoring in place which actually detects it and can do something > > about it. > > > > Yes, PIDs have traditionally been global, but today we have PID namespaces, and > > many other isolation features such as mount namespaces. Nothing is perfect, of > > course, and containers are a lot worse than VMs, but it seems weird to use that > > as an excuse to knowingly make things worse... > > > >> > >>>> Fortunately, the days of timesharing seem to well behind us. For > >>>> those people who think that containers are as secure as VM's (hah, > >>>> hah, hah), it might be that best way to handle this is to have a mount > >>>> option that requires root access to this functionality. For those > >>>> people who really care about this, they can disable access. > >> > >> Or use separate filesystems for each container so that exploitable bugs > >> that shut down the filesystem can't be used to kill the other > >> containers. You could use a torrent of metadata-heavy operations > >> (fallocate a huge file, punch every block, truncate file, repeat) to DoS > >> the other containers. > >> > >>> What would be the reason for not putting this behind > >>> capable(CAP_SYS_ADMIN)? > >>> > >>> What possible legitimate function could this functionality serve to > >>> users who don't own your filesystem? > >> > >> As I've said before, it's to enable dedupe tools to decide, given a set > >> of files with shareable blocks, roughly how many other times each of > >> those shareable blocks are shared so that they can make better decisions > >> about which file keeps its shareable blocks, and which file gets > >> remapped. Dedupe is not a privileged operation, nor are any of the > >> tools. > >> > > > > So why does the ioctl need to return all extent mappings for the entire > > filesystem, instead of just the share count of each block in the file that the > > ioctl is called on? > > One possibility is that the ioctl() can return the mapping for all inodes > owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE, > or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more > than one if there is a reason to do so) with all the other allocated blocks > for inodes the user doesn't have permission to access? Hmm, CAP_DAC_OVERRIDE/CAP_FOWNER? That might be a reasonable set of capabilities to grant access... > IMHO, this would allow a non-root user the main benefit of GETFSMAP, which > is trying to determine how fragmented their files are and/or how fragmented > the free space is, without leaking any information about file sizes, location, > or other information the user can't already get today in a less efficient > manner. > > I don't know how hard this is to implement, but seems not impossible. It's already implemented in both XFS and ext4. <cough> File extents are marked as "owned" by "unknown". Now, I suppose one could devise a scheme such that files that the caller can open actually do get inode numbers returned, but ... that's more engineering work, let's see if anyone asks for that (vs. asks for any of the magic capability bits). --D > > Cheers, Andreas > > > > >
On Sat, May 13, 2017 at 6:41 PM, Andreas Dilger <adilger@dilger.ca> wrote: > On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote: >> >> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: >>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out] >> Yes, PIDs have traditionally been global, but today we have PID namespaces, and >> many other isolation features such as mount namespaces. Nothing is perfect, of >> course, and containers are a lot worse than VMs, but it seems weird to use that >> as an excuse to knowingly make things worse... >> Indeed. Not only PID namespaces -- we have hidepid and we can simply unmount /proc. "There are other info leaks" is a poor excuse. >>> >>>>> Fortunately, the days of timesharing seem to well behind us. For >>>>> those people who think that containers are as secure as VM's (hah, >>>>> hah, hah), it might be that best way to handle this is to have a mount >>>>> option that requires root access to this functionality. For those >>>>> people who really care about this, they can disable access. >>> >>> Or use separate filesystems for each container so that exploitable bugs >>> that shut down the filesystem can't be used to kill the other >>> containers. You could use a torrent of metadata-heavy operations >>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS >>> the other containers. >>> >>>> What would be the reason for not putting this behind >>>> capable(CAP_SYS_ADMIN)? >>>> >>>> What possible legitimate function could this functionality serve to >>>> users who don't own your filesystem? >>> >>> As I've said before, it's to enable dedupe tools to decide, given a set >>> of files with shareable blocks, roughly how many other times each of >>> those shareable blocks are shared so that they can make better decisions >>> about which file keeps its shareable blocks, and which file gets >>> remapped. Dedupe is not a privileged operation, nor are any of the >>> tools. >>> >> >> So why does the ioctl need to return all extent mappings for the entire >> filesystem, instead of just the share count of each block in the file that the >> ioctl is called on? > > One possibility is that the ioctl() can return the mapping for all inodes > owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE, > or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more > than one if there is a reason to do so) with all the other allocated blocks > for inodes the user doesn't have permission to access? Sounds like it could be reasonable. But you don't want "owned by the calling PID" precisely -- you also need to check kgid_has_mapping(current_user_ns(), inode->i_gid), I think.
On Sun, May 14, 2017 at 06:56:10AM -0700, Andy Lutomirski wrote: > On Sat, May 13, 2017 at 6:41 PM, Andreas Dilger <adilger@dilger.ca> wrote: > > On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote: > >> > >> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: > >>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out] > > >> Yes, PIDs have traditionally been global, but today we have PID namespaces, and > >> many other isolation features such as mount namespaces. Nothing is perfect, of > >> course, and containers are a lot worse than VMs, but it seems weird to use that > >> as an excuse to knowingly make things worse... > >> > > Indeed. Not only PID namespaces -- we have hidepid and we can simply > unmount /proc. "There are other info leaks" is a poor excuse. Eh. From the sounds of it I'm not all that impressed at the isolation and leakproofness of any of these schemes. Regardless, I will rephrase the manpage to emphasize more strongly that filesystems are under no obligation to share inode numbers, privileged callers or otherwise. > >>> > >>>>> Fortunately, the days of timesharing seem to well behind us. For > >>>>> those people who think that containers are as secure as VM's (hah, > >>>>> hah, hah), it might be that best way to handle this is to have a mount > >>>>> option that requires root access to this functionality. For those > >>>>> people who really care about this, they can disable access. > >>> > >>> Or use separate filesystems for each container so that exploitable bugs > >>> that shut down the filesystem can't be used to kill the other > >>> containers. You could use a torrent of metadata-heavy operations > >>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS > >>> the other containers. > >>> > >>>> What would be the reason for not putting this behind > >>>> capable(CAP_SYS_ADMIN)? > >>>> > >>>> What possible legitimate function could this functionality serve to > >>>> users who don't own your filesystem? > >>> > >>> As I've said before, it's to enable dedupe tools to decide, given a set > >>> of files with shareable blocks, roughly how many other times each of > >>> those shareable blocks are shared so that they can make better decisions > >>> about which file keeps its shareable blocks, and which file gets > >>> remapped. Dedupe is not a privileged operation, nor are any of the > >>> tools. > >>> > >> > >> So why does the ioctl need to return all extent mappings for the entire > >> filesystem, instead of just the share count of each block in the file that the > >> ioctl is called on? > > > > One possibility is that the ioctl() can return the mapping for all inodes > > owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE, > > or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more > > than one if there is a reason to do so) with all the other allocated blocks > > for inodes the user doesn't have permission to access? > > Sounds like it could be reasonable. But you don't want "owned by the > calling PID" precisely -- you also need to check > kgid_has_mapping(current_user_ns(), inode->i_gid), I think. Not to mention that I don't want to go xfs_igetting every inode across the entire filesystem... :) --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/man2/ioctl_getfsmap.2 b/man2/ioctl_getfsmap.2 new file mode 100644 index 0000000..ef9daef --- /dev/null +++ b/man2/ioctl_getfsmap.2 @@ -0,0 +1,362 @@ +.\" Copyright (c) 2017, Oracle. All rights reserved. +.\" +.\" %%%LICENSE_START(GPLv2+_DOC_FULL) +.\" This is free documentation; you can redistribute it and/or +.\" modify it under the terms of the GNU General Public License as +.\" published by the Free Software Foundation; either version 2 of +.\" the License, or (at your option) any later version. +.\" +.\" The GNU General Public License's references to "object code" +.\" and "executables" are to be interpreted as the output of any +.\" document formatting or typesetting system, including +.\" intermediate and printed output. +.\" +.\" This manual is distributed in the hope that it will be useful, +.\" but WITHOUT ANY WARRANTY; without even the implied warranty of +.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.\" GNU General Public License for more details. +.\" +.\" You should have received a copy of the GNU General Public +.\" License along with this manual; if not, see +.\" <http://www.gnu.org/licenses/>. +.\" %%%LICENSE_END +.TH IOCTL-GETFSMAP 2 2017-02-10 "Linux" "Linux Programmer's Manual" +.SH NAME +ioctl_getfsmap \- retrieve the physical layout of the filesystem +.SH SYNOPSIS +.br +.B #include <sys/ioctl.h> +.br +.B #include <linux/fs.h> +.br +.B #include <linux/fsmap.h> +.sp +.BI "int ioctl(int " fd ", FS_IOC_GETFSMAP, struct fsmap_head * " arg ); +.SH DESCRIPTION +This +.BR ioctl (2) +retrieves physical extent mappings for a filesystem. +This information can be used to discover which files are mapped to a physical +block, examine free space, or find known bad blocks, among other things. + +The sole argument to this ioctl should be a pointer to a single +.BR "struct fsmap_head" ":" +.in +4n +.nf + +struct fsmap { + __u32 fmr_device; /* device id */ + __u32 fmr_flags; /* mapping flags */ + __u64 fmr_physical; /* device offset of segment */ + __u64 fmr_owner; /* owner id */ + __u64 fmr_offset; /* file offset of segment */ + __u64 fmr_length; /* length of segment */ + __u64 fmr_reserved[3]; /* must be zero */ +}; + +struct fsmap_head { + __u32 fmh_iflags; /* control flags */ + __u32 fmh_oflags; /* output flags */ + __u32 fmh_count; /* # of entries in array incl. input */ + __u32 fmh_entries; /* # of entries filled in (output). */ + __u64 fmh_reserved[6]; /* must be zero */ + + struct fsmap fmh_keys[2]; /* low and high keys for the mapping search */ + struct fsmap fmh_recs[]; /* returned records */ +}; + +.fi +.in +The two +.I fmh_keys +array elements specify the lowest and highest reverse-mapping +keys, respectively, for which userspace would like physical mapping +information. +A reverse mapping key consists of the tuple (device, block, owner, offset). +The owner and offset fields are part of the key because some filesystems +support sharing physical blocks between multiple files and +therefore may return multiple mappings for a given physical block. +.PP +Filesystem mappings are copied into the +.I fmh_recs +array, which immediately follows the header data. +.SS Fields of struct fsmap_head +.PP +The +.I fmh_iflags +field is a bitmask passed to the kernel to alter the output. +There are no flags defined, so this value must be zero. + +.PP +The +.I fmh_oflags +field is a bitmask of flags that concern all output mappings. +If +.B FMH_OF_DEV_T +is set, then the +.I fmr_device +field represents a +.B dev_t +structure containing the major and minor numbers of the block device. + +.PP +The +.I fmh_count +field contains the number of elements in the array being passed to the +kernel. +If this value is 0, +.I fmh_entries +will be set to the number of records that would have been returned had +the array been large enough; +no mapping information will be returned. + +.PP +The +.I fmh_entries +field contains the number of elements in the +.I fmh_recs +array that contain useful information. + +.PP +The +.I fmh_reserved +fields must be set to zero. + +.SS Keys +.PP +The two key records in +.B fsmap_head.fmh_keys +specify the lowest and highest extent records in the keyspace that the caller +wants returned. +A filesystem that can share blocks between files likely requires the tuple +.RI "(" "device" ", " "physical" ", " "owner" ", " "offset" ", " "flags" ")" +to uniquely index any filesystem mapping record. +Classic non-sharing filesystems might be able to identify any record with only +.RI "(" "device" ", " "physical" ", " "flags" ")." +For example, if the low key is set to (8:0, 36864, 0, 0, 0), the filesystem will +only return records for extents starting at or above 36KiB on disk. +If the high key is set to (8:0, 1048576, 0, 0, 0), only records below 1MiB will +be returned. +The format of +.B fmr_device +in the keys must match the format of the same field in the output records, +as defined below. +By convention, the field +.B fsmap_head.fmh_keys[0] +must contain the low key and +.B fsmap_head.fmh_keys[1] +must contain the high key for the request. +.PP +For convenience, if +.B fmr_length +is set in the low key, it will be added to +.IR fmr_block " or " fmr_offset +as appropriate. +The caller can take advantage of this subtlety to set up subsequent calls +by copying +.B fsmap_head.fmh_recs[fsmap_head.fmh_entries - 1] +into the low key. +The function +.B fsmap_advance +provides this functionality. + +.SS Fields of struct fsmap +.PP +The +.I fmr_device +field uniquely identifies the underlying storage device. +If the +.B FMH_OF_DEV_T +flag is set in the header's +.I fmh_oflags +field, this field contains a +.B dev_t +from which major and minor numbers can be extracted. +If the flag is not set, this field contains a value that must be unique +for each unique storage device. + +.PP +The +.I fmr_physical +field contains the disk address of the extent in bytes. + +.PP +The +.I fmr_owner +field contains the owner of the extent. +This is an inode number unless +.B FMR_OF_SPECIAL_OWNER +is set in the +.I fmr_flags +field, in which case the value is determined by the filesystem. +See the section below about special owner values for more details. + +.PP +The +.I fmr_offset +field contains the logical address in the mapping record in bytes. +This field has no meaning if the +.BR FMR_OF_SPECIAL_OWNER " or " FMR_OF_EXTENT_MAP +flags are set in +.IR fmr_flags "." + +.PP +The +.I fmr_length +field contains the length of the extent in bytes. + +.PP +The +.I fmr_flags +field is a bitmask of extent state flags. +The bits are: +.RS 0.4i +.TP +.B FMR_OF_PREALLOC +The extent is allocated but not yet written. +.TP +.B FMR_OF_ATTR_FORK +This extent contains extended attribute data. +.TP +.B FMR_OF_EXTENT_MAP +This extent contains extent map information for the owner. +.TP +.B FMR_OF_SHARED +Parts of this extent may be shared. +.TP +.B FMR_OF_SPECIAL_OWNER +The +.I fmr_owner +field contains a special value instead of an inode number. +.TP +.B FMR_OF_LAST +This is the last record in the filesystem. +.RE + +.PP +The +.I fmr_reserved +field will be set to zero. + +.SS Special Owner Values +The following special owner values are generic to all filesystems: +.RS 0.4i +.TP +.B FMR_OWN_FREE +Free space. +.TP +.B FMR_OWN_UNKNOWN +This extent is in use but its owner is not known. +.TP +.B FMR_OWN_METADATA +This extent is filesystem metadata. +.RE + +XFS can return the following special owner values: +.RS 0.4i +.TP +.B XFS_FMR_OWN_FREE +Free space. +.TP +.B XFS_FMR_OWN_UNKNOWN +This extent is in use but its owner is not known. +.TP +.B XFS_FMR_OWN_FS +Static filesystem metadata which exists at a fixed address. +These are the AG superblock, the AGF, the AGFL, and the AGI headers. +.TP +.B XFS_FMR_OWN_LOG +The filesystem journal. +.TP +.B XFS_FMR_OWN_AG +Allocation group metadata, such as the free space btrees and the +reverse mapping btrees. +.TP +.B XFS_FMR_OWN_INOBT +The inode and free inode btrees. +.TP +.B XFS_FMR_OWN_INODES +Inode records. +.TP +.B XFS_FMR_OWN_REFC +Reference count information. +.TP +.B XFS_FMR_OWN_COW +This extent is being used to stage a copy-on-write. +.TP +.B XFS_FMR_OWN_DEFECTIVE: +This extent has been marked defective either by the filesystem or the +underlying device. +.RE + +ext4 can return the following special owner values: +.RS 0.4i +.TP +.B EXT4_FMR_OWN_FREE +Free space. +.TP +.B EXT4_FMR_OWN_UNKNOWN +This extent is in use but its owner is not known. +.TP +.B EXT4_FMR_OWN_FS +Static filesystem metadata which exists at a fixed address. +This is the superblock and the group descriptors. +.TP +.B EXT4_FMR_OWN_LOG +The filesystem journal. +.TP +.B EXT4_FMR_OWN_INODES +Inode records. +.TP +.B EXT4_FMR_OWN_BLKBM +Block bitmap. +.TP +.B EXT4_FMR_OWN_INOBM +Inode bitmap. +.RE + +.SH RETURN VALUE +On error, \-1 is returned, and +.I errno +is set to indicate the error. +.PP +.SH ERRORS +Error codes can be one of, but are not limited to, the following: +.TP +.B EINVAL +The array is not long enough, or a non-zero value was passed in one of the +fields that must be zero. +.TP +.B EFAULT +The pointer passed in was not mapped to a valid memory address. +.TP +.B EBADF +.IR fd +is not open for reading. +.TP +.B EPERM +This query is not allowed. +.TP +.B EOPNOTSUPP +The filesystem does not support this command. +.TP +.B EUCLEAN +The filesystem metadata is corrupt and needs repair. +.TP +.B EBADMSG +The filesystem has detected a checksum error in the metadata. +.TP +.B ENOMEM +Insufficient memory to process the request. + +.SH EXAMPLE +.TP +Please see io/fsmap.c in the xfsprogs distribution for a sample program. + +.SH CONFORMING TO +This API is Linux-specific. +Not all filesystems support it. +.fi +.in +.SH SEE ALSO +.BR ioctl (2)
Document the new GETFSMAP ioctl that returns the physical layout of a (disk-based) filesystem. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- man2/ioctl_getfsmap.2 | 362 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 362 insertions(+) create mode 100644 man2/ioctl_getfsmap.2