diff mbox

ioctl_getfsmap.2: document the GETFSMAP ioctl

Message ID 20170507155855.GD5970@birch.djwong.org
State Not Applicable, archived
Headers show

Commit Message

Darrick Wong May 7, 2017, 3:58 p.m. UTC
Document the new GETFSMAP ioctl that returns the physical layout of a
(disk-based) filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 man2/ioctl_getfsmap.2 |  362 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 362 insertions(+)
 create mode 100644 man2/ioctl_getfsmap.2

Comments

Jann Horn May 7, 2017, 10:17 p.m. UTC | #1
On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> Document the new GETFSMAP ioctl that returns the physical layout of a
> (disk-based) filesystem.
[...]
> +.B EPERM
> +This query is not allowed.

Please document the circumstances under which a query is allowed.

Also: From a quick glance at the XFS implementation, I don't see any
privilege checks. Am I missing something, or does this API permit an
unprivileged user to determine the number of physical blocks allocated
for any inode, even for inodes the user can't ordinarily see in any
way?
Darrick Wong May 8, 2017, 6:41 p.m. UTC | #2
On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote:
> On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > Document the new GETFSMAP ioctl that returns the physical layout of a
> > (disk-based) filesystem.
> [...]
> > +.B EPERM
> > +This query is not allowed.
> 
> Please document the circumstances under which a query is allowed.

For the two current implementations, queries are always allowed.

(The doc could be more explicit about this decision being left to the
implementation.)

> Also: From a quick glance at the XFS implementation, I don't see any
> privilege checks. Am I missing something, or does this API permit an
> unprivileged user to determine the number of physical blocks allocated
> for any inode, even for inodes the user can't ordinarily see in any
> way?

Correct.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jann Horn May 8, 2017, 6:47 p.m. UTC | #3
On Mon, May 8, 2017 at 8:41 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote:
>> On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>> > Document the new GETFSMAP ioctl that returns the physical layout of a
>> > (disk-based) filesystem.
[...]
>> Also: From a quick glance at the XFS implementation, I don't see any
>> privilege checks. Am I missing something, or does this API permit an
>> unprivileged user to determine the number of physical blocks allocated
>> for any inode, even for inodes the user can't ordinarily see in any
>> way?
>
> Correct.

What's your reasoning for why this doesn't create any new potential
security issues? For example, as far as I can tell, this would permit
an unprivileged user to determine with high probability whether a set
of large files with known sizes is stored anywhere in the filesystem, even
across containers or so.
Darrick Wong May 8, 2017, 8:47 p.m. UTC | #4
On Mon, May 08, 2017 at 08:47:56PM +0200, Jann Horn wrote:
> On Mon, May 8, 2017 at 8:41 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote:
> >> On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >> > Document the new GETFSMAP ioctl that returns the physical layout of a
> >> > (disk-based) filesystem.
> [...]
> >> Also: From a quick glance at the XFS implementation, I don't see any
> >> privilege checks. Am I missing something, or does this API permit an
> >> unprivileged user to determine the number of physical blocks allocated
> >> for any inode, even for inodes the user can't ordinarily see in any
> >> way?
> >
> > Correct.
> 
> What's your reasoning for why this doesn't create any new potential
> security issues? For example, as far as I can tell, this would permit

/Any/ ?  That is a huge request to be dropping on me after the vfs patch
gets merged, after a year-long review cycle, etc.  AFAIK there aren't
any problems, but then that's part of why I let this thing hang out to
dry for such a long time.  Even posessing the inode number, an
unprivileged process still cannot open files they wouldn't otherwise
have access, since that requires the generation number, and only
bulkstat provides that (if you have CAP_SYS_ADMIN).

The whole reason for dropping the CAP_SYS_ADMIN check from GETFSMAP was
(a) so that unpriviledged users could compute free space information and
(b) to allow dedupe tools to make better decisions about which file
donates blocks and which file accepts blocks.

If you have specific complaints, then let's hear and address them.  I'm
not going to try to prove a broad negative theoretical statement.

Moving on...

> an unprivileged user to determine with high probability whether a set
> of large files with known sizes is stored anywhere in the filesystem, even
> across containers or so.

How large?  How high?

Do you have a tool that analyzes a set of st_blocks values and compares
the set to known profiles in order to guess what's on the filesystem?
With what accuracy can it do that, especially without explicit path or
stat data?  The maximum resolution provided by the ioctl is fs block
size, so it's not like you can guess that this 1268432 byte file is
libclangAnalysis.a; all you know is that there are four 310-block files
on this filesystem -- on this system that's the desktop wallpaper, a
file from each of libclang and libgimp, and libc6 from my aarch64 guest.
The logical block map data could be more helpful for fingerprinting, but
only if there are sparse files.

Say our multi-tenant container hosts all the containers on the same fs.
We now have a set of (inode, blockcount) data and a logical block map
for every inode stored on that fs.  We have no path or stat data, so how
do you tell what a 340-block file with a hole at offset 17 is?  You
could try to infer path structure use the (XFS) heuristic that file
inodes are usually created in the same AG as the directory inode they're
created in, but GETFSMAP doesn't distinguish file extents from directory
extents and AGs can host many different directories, so I don't think
this will help much.  Even if you have a reasonably good idea which
inodes are directories, you still don't know which other inodes have an
entry in a particular directory.

Then again once we throw reflink and dedupe between containers into the
mix the extent maps become far more interesting, because dirs could
potentially be identified by the lack of any shared blocks at all, and
other containers with the same library files will tend to share the same
blocks at the same offsets.  But that's still somewhat imprecise --
btrfs directories can share blocks between snapshots, whereas xfs can't,
and the existence of small unshared files with the same block count
introduces a certain amount of noise into the directory inference
process.  So maybe you'd be able to search for a reflinked .so file that
you /can/ stat to infer that there are X containers running the same
software as your container, though you still have to find them to mount
an attack.

FWIW I don't oppose having a CAP_SYS_ADMIN check again (patches gladly
accepted for review!), but I'm not yet convinced that this is a big
enough threat to forbid the use case.

Sure would be nice if we had finer-grained capabilities...

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jann Horn May 8, 2017, 10:54 p.m. UTC | #5
On Mon, May 8, 2017 at 10:47 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Mon, May 08, 2017 at 08:47:56PM +0200, Jann Horn wrote:
>> On Mon, May 8, 2017 at 8:41 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>> > On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote:
>> >> On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>> >> > Document the new GETFSMAP ioctl that returns the physical layout of a
>> >> > (disk-based) filesystem.
>> [...]
>> >> Also: From a quick glance at the XFS implementation, I don't see any
>> >> privilege checks. Am I missing something, or does this API permit an
>> >> unprivileged user to determine the number of physical blocks allocated
>> >> for any inode, even for inodes the user can't ordinarily see in any
>> >> way?
>> >
>> > Correct.
>>
>> What's your reasoning for why this doesn't create any new potential
>> security issues? For example, as far as I can tell, this would permit
>
> /Any/ ?  That is a huge request to be dropping on me after the vfs patch
> gets merged, after a year-long review cycle, etc.

Fair point.

>> an unprivileged user to determine with high probability whether a set
>> of large files with known sizes is stored anywhere in the filesystem, even
>> across containers or so.
>
> How large?  How high?
>
> Do you have a tool that analyzes a set of st_blocks values and compares
> the set to known profiles in order to guess what's on the filesystem?
> With what accuracy can it do that, especially without explicit path or
> stat data?  The maximum resolution provided by the ioctl is fs block
> size, so it's not like you can guess that this 1268432 byte file is
> libclangAnalysis.a; all you know is that there are four 310-block files
> on this filesystem -- on this system that's the desktop wallpaper, a
> file from each of libclang and libgimp, and libc6 from my aarch64 guest.

This would probably become more realistic for larger files, like
conference recordings - with sizes like 200075, 48338, 155870, 134800
blocks -, although admittedly I don't have a specific scenario in mind in
which someone knowing what conference recordings I have on my disk
would be problematic.

You're probably right about this not being a particularly important
concern, and I recognize that if I had wanted a different API, I should
have said so a year ago.
Darrick Wong May 9, 2017, 1:53 a.m. UTC | #6
On Tue, May 09, 2017 at 12:54:57AM +0200, Jann Horn wrote:
> On Mon, May 8, 2017 at 10:47 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Mon, May 08, 2017 at 08:47:56PM +0200, Jann Horn wrote:
> >> On Mon, May 8, 2017 at 8:41 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >> > On Mon, May 08, 2017 at 12:17:53AM +0200, Jann Horn wrote:
> >> >> On Sun, May 7, 2017 at 5:58 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >> >> > Document the new GETFSMAP ioctl that returns the physical layout of a
> >> >> > (disk-based) filesystem.
> >> [...]
> >> >> Also: From a quick glance at the XFS implementation, I don't see any
> >> >> privilege checks. Am I missing something, or does this API permit an
> >> >> unprivileged user to determine the number of physical blocks allocated
> >> >> for any inode, even for inodes the user can't ordinarily see in any
> >> >> way?
> >> >
> >> > Correct.
> >>
> >> What's your reasoning for why this doesn't create any new potential
> >> security issues? For example, as far as I can tell, this would permit
> >
> > /Any/ ?  That is a huge request to be dropping on me after the vfs patch
> > gets merged, after a year-long review cycle, etc.
> 
> Fair point.
> 
> >> an unprivileged user to determine with high probability whether a set
> >> of large files with known sizes is stored anywhere in the filesystem, even
> >> across containers or so.
> >
> > How large?  How high?
> >
> > Do you have a tool that analyzes a set of st_blocks values and compares
> > the set to known profiles in order to guess what's on the filesystem?
> > With what accuracy can it do that, especially without explicit path or
> > stat data?  The maximum resolution provided by the ioctl is fs block
> > size, so it's not like you can guess that this 1268432 byte file is
> > libclangAnalysis.a; all you know is that there are four 310-block files
> > on this filesystem -- on this system that's the desktop wallpaper, a
> > file from each of libclang and libgimp, and libc6 from my aarch64 guest.
> 
> This would probably become more realistic for larger files, like
> conference recordings - with sizes like 200075, 48338, 155870, 134800
> blocks -, although admittedly I don't have a specific scenario in mind in
> which someone knowing what conference recordings I have on my disk
> would be problematic.

TBH, I /did/ build a (crappy) tool that tries to construct fingerprints
based on the fsmaps it finds for each inode number.  It works reasonably
well for identifying the existence (and number) of reflink clones of any
part of the file tree that you can stat and FIEMAP.  It also works (sort
of) if you have multiple separate filesystems with roughly the same fs
trees in them.  Mixing things into one big file produces a lot of noise,
and data files are harder to pick out unless they're deduped.

> You're probably right about this not being a particularly important
> concern, and I recognize that if I had wanted a different API, I should
> have said so a year ago.

I don't mind people bringing up specific concerns and making specific
requests for the rest of the 4.12 cycle since it's relatively easy to
make small changes to the interface (e.g. adding a capability check).

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Biggers May 9, 2017, 9:17 p.m. UTC | #7
On Mon, May 08, 2017 at 06:53:24PM -0700, Darrick J. Wong wrote:
> > >> an unprivileged user to determine with high probability whether a set
> > >> of large files with known sizes is stored anywhere in the filesystem, even
> > >> across containers or so.
> > >
> > > How large?  How high?
> > >
> > > Do you have a tool that analyzes a set of st_blocks values and compares
> > > the set to known profiles in order to guess what's on the filesystem?
> > > With what accuracy can it do that, especially without explicit path or
> > > stat data?  The maximum resolution provided by the ioctl is fs block
> > > size, so it's not like you can guess that this 1268432 byte file is
> > > libclangAnalysis.a; all you know is that there are four 310-block files
> > > on this filesystem -- on this system that's the desktop wallpaper, a
> > > file from each of libclang and libgimp, and libc6 from my aarch64 guest.
> > 
> > This would probably become more realistic for larger files, like
> > conference recordings - with sizes like 200075, 48338, 155870, 134800
> > blocks -, although admittedly I don't have a specific scenario in mind in
> > which someone knowing what conference recordings I have on my disk
> > would be problematic.
> 
> TBH, I /did/ build a (crappy) tool that tries to construct fingerprints
> based on the fsmaps it finds for each inode number.  It works reasonably
> well for identifying the existence (and number) of reflink clones of any
> part of the file tree that you can stat and FIEMAP.  It also works (sort
> of) if you have multiple separate filesystems with roughly the same fs
> trees in them.  Mixing things into one big file produces a lot of noise,
> and data files are harder to pick out unless they're deduped.
> 
> > You're probably right about this not being a particularly important
> > concern, and I recognize that if I had wanted a different API, I should
> > have said so a year ago.
> 
> I don't mind people bringing up specific concerns and making specific
> requests for the rest of the 4.12 cycle since it's relatively easy to
> make small changes to the interface (e.g. adding a capability check).
> 

I was also surprised to see that there is no authorization check.  Using this
ioctl, *anyone* will be able to retrieve the precise list of physical blocks
used by every inode on the filesystem, even ones that are only linked to in
directories the user doesn't have read permission for, or aren't even visible in
their mount namespace.

I am most concerned about:

1.) Privacy implications.  Say the filesystem is being shared between multiple
    users, and one user unpacks foo.tar.gz into their home directory, which
    they've set to mode 700 to hide from other users.  Because of this new
    ioctl, all users will be able to see every (inode number, size in blocks)
    pair that was added to the filesystem, as well as the exact layout of the
    physical block allocations which might hint at how the files were created.
    If there is a known "fingerprint" for the unpacked foo.tar.gz in this
    regard, its presence on the filesystem will be revealed to all users.  And
    if any filesystems happen to prefer allocating blocks near the containing
    directory, the directory the files are in would likely be revealed too.
    
    Also note that by repeatedly executing the ioctl, all users will be able to
    see at what time any arbitrary inode was added to the filesystem, as well as
    exactly when any arbitrary inode was truncated, or otherwise modified in a
    way that changed its extent mappings.  More generally, all users will be
    able to follow the evolution of any arbitrary set of inodes over time.   In
    a shared hosting environment this could allow anyone to determine many of
    the characteristics of other containers being hosted by the kernel, such as
    which software and software versions they're using (or at least, to a higher
    degree of confidence than other side channels that may be available
    currently).

2.) Abusing the ioctl as an information leak in combination with another
    security vulnerability.  For example let's say that there's a vulnerability
    in xfs or ext4 that allows writing (but not reading) to an arbitrary
    physical disk block.  Now obviously there are many ways this could be
    exploited, but let's say you're in a container, so just elevating to root in
    the container isn't enough, and you don't know where the critical system
    files are.  Using the fsmap ioctl to get the extent mappings of *all* files
    on the filesystem, you may still be able to determine with a high degree of
    confidence which physical disk blocks hold the contents of files outside the
    container that could be backdoored, e.g. /etc/shadow, or some binary in /bin
    or /lib, or perhaps a kernel module in /lib/modules.

So given that this ioctl operates on the global filesystem and not on a
particular file, it really seems like more of an administrator-level thing
(capable(CAP_SYS_ADMIN)), not something that any random user should be able to
execute.

- Eric
Theodore Ts'o May 10, 2017, 4:38 p.m. UTC | #8
On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> 1.) Privacy implications.  Say the filesystem is being shared between multiple
>     users, and one user unpacks foo.tar.gz into their home directory, which
>     they've set to mode 700 to hide from other users.  Because of this new
>     ioctl, all users will be able to see every (inode number, size in blocks)
>     pair that was added to the filesystem, as well as the exact layout of the
>     physical block allocations which might hint at how the files were created.
>     If there is a known "fingerprint" for the unpacked foo.tar.gz in this
>     regard, its presence on the filesystem will be revealed to all users.  And
>     if any filesystems happen to prefer allocating blocks near the containing
>     directory, the directory the files are in would likely be revealed too.

Unix/Linux has historically not been terribly concerned about trying
to protect this kind of privacy between users.  So for example, in
order to do this, you would have to call GETFSMAP continously to track
this sort of thing.  Someone who wanted to do this could probably get
this information (and much, much more) by continuously running "ps" to
see what processes are running.

(I will note. wryly, that in the bad old days, when dozens of users
were sharing a one MIPS Vax/780, it was considered a *good* thing
that social pressure could be applied when it was found that someone
was running a CPU or memory hogger on a time sharing system.  The
privacy right of someone running "xtrek" to be able to hide this from
other users on the system was never considered important at all.  :-)

Fortunately, the days of timesharing seem to well behind us.  For
those people who think that containers are as secure as VM's (hah,
hah, hah), it might be that best way to handle this is to have a mount
option that requires root access to this functionality.  For those
people who really care about this, they can disable access.

Cheers,

					- Ted
Eric W. Biederman May 10, 2017, 7:27 p.m. UTC | #9
Theodore Ts'o <tytso@mit.edu> writes:

> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
>> 1.) Privacy implications.  Say the filesystem is being shared between multiple
>>     users, and one user unpacks foo.tar.gz into their home directory, which
>>     they've set to mode 700 to hide from other users.  Because of this new
>>     ioctl, all users will be able to see every (inode number, size in blocks)
>>     pair that was added to the filesystem, as well as the exact layout of the
>>     physical block allocations which might hint at how the files were created.
>>     If there is a known "fingerprint" for the unpacked foo.tar.gz in this
>>     regard, its presence on the filesystem will be revealed to all users.  And
>>     if any filesystems happen to prefer allocating blocks near the containing
>>     directory, the directory the files are in would likely be revealed too.
>
> Unix/Linux has historically not been terribly concerned about trying
> to protect this kind of privacy between users.  So for example, in
> order to do this, you would have to call GETFSMAP continously to track
> this sort of thing.  Someone who wanted to do this could probably get
> this information (and much, much more) by continuously running "ps" to
> see what processes are running.
>
> (I will note. wryly, that in the bad old days, when dozens of users
> were sharing a one MIPS Vax/780, it was considered a *good* thing
> that social pressure could be applied when it was found that someone
> was running a CPU or memory hogger on a time sharing system.  The
> privacy right of someone running "xtrek" to be able to hide this from
> other users on the system was never considered important at all.  :-)
>
> Fortunately, the days of timesharing seem to well behind us.  For
> those people who think that containers are as secure as VM's (hah,
> hah, hah), it might be that best way to handle this is to have a mount
> option that requires root access to this functionality.  For those
> people who really care about this, they can disable access.

What would be the reason for not putting this behind
capable(CAP_SYS_ADMIN)?

What possible legitimate function could this functionality serve to
users who don't own your filesystem?

I have seen several people speak up how this is a concern I don't see
anyone saying here is a legitimate use for a non-system administrator.

This doesn't seem like something where abuses of time-sharing systems
can be observed.

Eric
Darrick Wong May 10, 2017, 8:14 p.m. UTC | #10
[cc btrfs, since afaict that's where most of the dedupe tool authors hang out]

On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
> Theodore Ts'o <tytso@mit.edu> writes:
> 
> > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> >> 1.) Privacy implications.  Say the filesystem is being shared between multiple
> >>     users, and one user unpacks foo.tar.gz into their home directory, which
> >>     they've set to mode 700 to hide from other users.  Because of this new
> >>     ioctl, all users will be able to see every (inode number, size in blocks)
> >>     pair that was added to the filesystem, as well as the exact layout of the
> >>     physical block allocations which might hint at how the files were created.
> >>     If there is a known "fingerprint" for the unpacked foo.tar.gz in this
> >>     regard, its presence on the filesystem will be revealed to all users.  And
> >>     if any filesystems happen to prefer allocating blocks near the containing
> >>     directory, the directory the files are in would likely be revealed too.

Frankly, why are container users even allowed to make unrestricted ioctl
calls?  I thought we had a bunch of security infrastructure to constrain
what userspace can do to a system, so why don't ioctls fall under these
same protections?  If your containers are really that adversarial, you
ought to be blacklisting as much as you can.

> > Unix/Linux has historically not been terribly concerned about trying
> > to protect this kind of privacy between users.  So for example, in
> > order to do this, you would have to call GETFSMAP continously to track
> > this sort of thing.  Someone who wanted to do this could probably get
> > this information (and much, much more) by continuously running "ps" to
> > see what processes are running.
> >
> > (I will note. wryly, that in the bad old days, when dozens of users
> > were sharing a one MIPS Vax/780, it was considered a *good* thing
> > that social pressure could be applied when it was found that someone
> > was running a CPU or memory hogger on a time sharing system.  The
> > privacy right of someone running "xtrek" to be able to hide this from
> > other users on the system was never considered important at all.  :-)

Not to mention someone running GETFSMAP in a loop will be pretty obvious
both from the high kernel cpu usage and the huge number of metadata
operations.

> > Fortunately, the days of timesharing seem to well behind us.  For
> > those people who think that containers are as secure as VM's (hah,
> > hah, hah), it might be that best way to handle this is to have a mount
> > option that requires root access to this functionality.  For those
> > people who really care about this, they can disable access.

Or use separate filesystems for each container so that exploitable bugs
that shut down the filesystem can't be used to kill the other
containers.  You could use a torrent of metadata-heavy operations
(fallocate a huge file, punch every block, truncate file, repeat) to DoS
the other containers.

> What would be the reason for not putting this behind
> capable(CAP_SYS_ADMIN)?
> 
> What possible legitimate function could this functionality serve to
> users who don't own your filesystem?

As I've said before, it's to enable dedupe tools to decide, given a set
of files with shareable blocks, roughly how many other times each of
those shareable blocks are shared so that they can make better decisions
about which file keeps its shareable blocks, and which file gets
remapped.  Dedupe is not a privileged operation, nor are any of the
tools.

> I have seen several people speak up how this is a concern I don't see
> anyone saying here is a legitimate use for a non-system administrator.

/I/ said that a few emails ago.

--D

> This doesn't seem like something where abuses of time-sharing systems
> can be observed.
> 
> Eric
Eric Biggers May 11, 2017, 5:10 a.m. UTC | #11
On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]
> 
> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
> > Theodore Ts'o <tytso@mit.edu> writes:
> > 
> > > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> > >> 1.) Privacy implications.  Say the filesystem is being shared between multiple
> > >>     users, and one user unpacks foo.tar.gz into their home directory, which
> > >>     they've set to mode 700 to hide from other users.  Because of this new
> > >>     ioctl, all users will be able to see every (inode number, size in blocks)
> > >>     pair that was added to the filesystem, as well as the exact layout of the
> > >>     physical block allocations which might hint at how the files were created.
> > >>     If there is a known "fingerprint" for the unpacked foo.tar.gz in this
> > >>     regard, its presence on the filesystem will be revealed to all users.  And
> > >>     if any filesystems happen to prefer allocating blocks near the containing
> > >>     directory, the directory the files are in would likely be revealed too.
> 
> Frankly, why are container users even allowed to make unrestricted ioctl
> calls?  I thought we had a bunch of security infrastructure to constrain
> what userspace can do to a system, so why don't ioctls fall under these
> same protections?  If your containers are really that adversarial, you
> ought to be blacklisting as much as you can.
> 

Personally I don't find the presence of sandboxing features to be a very good
excuse for introducing random insecure ioctls.  Not everyone has everything
perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
about the filesystem ioctls, too, since they can be executed on any regular
file, without having to open some device node in /dev.

(And this actually does happen; the SELinux policy in Android, for example,
still allows apps to call any ioctl on their data files, despite all the effort
that has gone into whitelisting other types of ioctls.  Which should be fixed,
of course, but it shows that this kind of mistake is very easy to make.)

> > > Unix/Linux has historically not been terribly concerned about trying
> > > to protect this kind of privacy between users.  So for example, in
> > > order to do this, you would have to call GETFSMAP continously to track
> > > this sort of thing.  Someone who wanted to do this could probably get
> > > this information (and much, much more) by continuously running "ps" to
> > > see what processes are running.
> > >
> > > (I will note. wryly, that in the bad old days, when dozens of users
> > > were sharing a one MIPS Vax/780, it was considered a *good* thing
> > > that social pressure could be applied when it was found that someone
> > > was running a CPU or memory hogger on a time sharing system.  The
> > > privacy right of someone running "xtrek" to be able to hide this from
> > > other users on the system was never considered important at all.  :-)
> 
> Not to mention someone running GETFSMAP in a loop will be pretty obvious
> both from the high kernel cpu usage and the huge number of metadata
> operations.

Well, only if that someone running GETFSMAP actually wants to watch things in
real-time (it's not necessary for all scenarios that have been mentioned), *and*
there is monitoring in place which actually detects it and can do something
about it.

Yes, PIDs have traditionally been global, but today we have PID namespaces, and
many other isolation features such as mount namespaces.  Nothing is perfect, of
course, and containers are a lot worse than VMs, but it seems weird to use that
as an excuse to knowingly make things worse...

> 
> > > Fortunately, the days of timesharing seem to well behind us.  For
> > > those people who think that containers are as secure as VM's (hah,
> > > hah, hah), it might be that best way to handle this is to have a mount
> > > option that requires root access to this functionality.  For those
> > > people who really care about this, they can disable access.
> 
> Or use separate filesystems for each container so that exploitable bugs
> that shut down the filesystem can't be used to kill the other
> containers.  You could use a torrent of metadata-heavy operations
> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
> the other containers.
> 
> > What would be the reason for not putting this behind
> > capable(CAP_SYS_ADMIN)?
> > 
> > What possible legitimate function could this functionality serve to
> > users who don't own your filesystem?
> 
> As I've said before, it's to enable dedupe tools to decide, given a set
> of files with shareable blocks, roughly how many other times each of
> those shareable blocks are shared so that they can make better decisions
> about which file keeps its shareable blocks, and which file gets
> remapped.  Dedupe is not a privileged operation, nor are any of the
> tools.
> 

So why does the ioctl need to return all extent mappings for the entire
filesystem, instead of just the share count of each block in the file that the
ioctl is called on?

- Eric
Andreas Dilger May 14, 2017, 1:41 a.m. UTC | #12
On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> 
> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]
>> 
>> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
>>> Theodore Ts'o <tytso@mit.edu> writes:
>>> 
>>>> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
>>>>> 1.) Privacy implications.  Say the filesystem is being shared between multiple
>>>>>    users, and one user unpacks foo.tar.gz into their home directory, which
>>>>>    they've set to mode 700 to hide from other users.  Because of this new
>>>>>    ioctl, all users will be able to see every (inode number, size in blocks)
>>>>>    pair that was added to the filesystem, as well as the exact layout of the
>>>>>    physical block allocations which might hint at how the files were created.
>>>>>    If there is a known "fingerprint" for the unpacked foo.tar.gz in this
>>>>>    regard, its presence on the filesystem will be revealed to all users.  And
>>>>>    if any filesystems happen to prefer allocating blocks near the containing
>>>>>    directory, the directory the files are in would likely be revealed too.
>> 
>> Frankly, why are container users even allowed to make unrestricted ioctl
>> calls?  I thought we had a bunch of security infrastructure to constrain
>> what userspace can do to a system, so why don't ioctls fall under these
>> same protections?  If your containers are really that adversarial, you
>> ought to be blacklisting as much as you can.
>> 
> 
> Personally I don't find the presence of sandboxing features to be a very good
> excuse for introducing random insecure ioctls.  Not everyone has everything
> perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
> about the filesystem ioctls, too, since they can be executed on any regular
> file, without having to open some device node in /dev.
> 
> (And this actually does happen; the SELinux policy in Android, for example,
> still allows apps to call any ioctl on their data files, despite all the effort
> that has gone into whitelisting other types of ioctls.  Which should be fixed,
> of course, but it shows that this kind of mistake is very easy to make.)
> 
>>>> Unix/Linux has historically not been terribly concerned about trying
>>>> to protect this kind of privacy between users.  So for example, in
>>>> order to do this, you would have to call GETFSMAP continously to track
>>>> this sort of thing.  Someone who wanted to do this could probably get
>>>> this information (and much, much more) by continuously running "ps" to
>>>> see what processes are running.
>>>> 
>>>> (I will note. wryly, that in the bad old days, when dozens of users
>>>> were sharing a one MIPS Vax/780, it was considered a *good* thing
>>>> that social pressure could be applied when it was found that someone
>>>> was running a CPU or memory hogger on a time sharing system.  The
>>>> privacy right of someone running "xtrek" to be able to hide this from
>>>> other users on the system was never considered important at all.  :-)
>> 
>> Not to mention someone running GETFSMAP in a loop will be pretty obvious
>> both from the high kernel cpu usage and the huge number of metadata
>> operations.
> 
> Well, only if that someone running GETFSMAP actually wants to watch things in
> real-time (it's not necessary for all scenarios that have been mentioned), *and*
> there is monitoring in place which actually detects it and can do something
> about it.
> 
> Yes, PIDs have traditionally been global, but today we have PID namespaces, and
> many other isolation features such as mount namespaces.  Nothing is perfect, of
> course, and containers are a lot worse than VMs, but it seems weird to use that
> as an excuse to knowingly make things worse...
> 
>> 
>>>> Fortunately, the days of timesharing seem to well behind us.  For
>>>> those people who think that containers are as secure as VM's (hah,
>>>> hah, hah), it might be that best way to handle this is to have a mount
>>>> option that requires root access to this functionality.  For those
>>>> people who really care about this, they can disable access.
>> 
>> Or use separate filesystems for each container so that exploitable bugs
>> that shut down the filesystem can't be used to kill the other
>> containers.  You could use a torrent of metadata-heavy operations
>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
>> the other containers.
>> 
>>> What would be the reason for not putting this behind
>>> capable(CAP_SYS_ADMIN)?
>>> 
>>> What possible legitimate function could this functionality serve to
>>> users who don't own your filesystem?
>> 
>> As I've said before, it's to enable dedupe tools to decide, given a set
>> of files with shareable blocks, roughly how many other times each of
>> those shareable blocks are shared so that they can make better decisions
>> about which file keeps its shareable blocks, and which file gets
>> remapped.  Dedupe is not a privileged operation, nor are any of the
>> tools.
>> 
> 
> So why does the ioctl need to return all extent mappings for the entire
> filesystem, instead of just the share count of each block in the file that the
> ioctl is called on?

One possibility is that the ioctl() can return the mapping for all inodes
owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
than one if there is a reason to do so) with all the other allocated blocks
for inodes the user doesn't have permission to access?

IMHO, this would allow a non-root user the main benefit of GETFSMAP,  which
is trying to determine how fragmented their files are and/or how fragmented
the free space is, without leaking any information about file sizes, location,
or other information the user can't already get today in a less efficient
manner.

I don't know how hard this is to implement, but seems not impossible.

Cheers, Andreas
Darrick Wong May 14, 2017, 4:25 a.m. UTC | #13
On Sat, May 13, 2017 at 07:41:24PM -0600, Andreas Dilger wrote:
> On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> > 
> > On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
> >> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]
> >> 
> >> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
> >>> Theodore Ts'o <tytso@mit.edu> writes:
> >>> 
> >>>> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> >>>>> 1.) Privacy implications.  Say the filesystem is being shared between multiple
> >>>>>    users, and one user unpacks foo.tar.gz into their home directory, which
> >>>>>    they've set to mode 700 to hide from other users.  Because of this new
> >>>>>    ioctl, all users will be able to see every (inode number, size in blocks)
> >>>>>    pair that was added to the filesystem, as well as the exact layout of the
> >>>>>    physical block allocations which might hint at how the files were created.
> >>>>>    If there is a known "fingerprint" for the unpacked foo.tar.gz in this
> >>>>>    regard, its presence on the filesystem will be revealed to all users.  And
> >>>>>    if any filesystems happen to prefer allocating blocks near the containing
> >>>>>    directory, the directory the files are in would likely be revealed too.
> >> 
> >> Frankly, why are container users even allowed to make unrestricted ioctl
> >> calls?  I thought we had a bunch of security infrastructure to constrain
> >> what userspace can do to a system, so why don't ioctls fall under these
> >> same protections?  If your containers are really that adversarial, you
> >> ought to be blacklisting as much as you can.
> >> 
> > 
> > Personally I don't find the presence of sandboxing features to be a very good
> > excuse for introducing random insecure ioctls.  Not everyone has everything
> > perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
> > about the filesystem ioctls, too, since they can be executed on any regular
> > file, without having to open some device node in /dev.
> > 
> > (And this actually does happen; the SELinux policy in Android, for example,
> > still allows apps to call any ioctl on their data files, despite all the effort
> > that has gone into whitelisting other types of ioctls.  Which should be fixed,
> > of course, but it shows that this kind of mistake is very easy to make.)
> > 
> >>>> Unix/Linux has historically not been terribly concerned about trying
> >>>> to protect this kind of privacy between users.  So for example, in
> >>>> order to do this, you would have to call GETFSMAP continously to track
> >>>> this sort of thing.  Someone who wanted to do this could probably get
> >>>> this information (and much, much more) by continuously running "ps" to
> >>>> see what processes are running.
> >>>> 
> >>>> (I will note. wryly, that in the bad old days, when dozens of users
> >>>> were sharing a one MIPS Vax/780, it was considered a *good* thing
> >>>> that social pressure could be applied when it was found that someone
> >>>> was running a CPU or memory hogger on a time sharing system.  The
> >>>> privacy right of someone running "xtrek" to be able to hide this from
> >>>> other users on the system was never considered important at all.  :-)
> >> 
> >> Not to mention someone running GETFSMAP in a loop will be pretty obvious
> >> both from the high kernel cpu usage and the huge number of metadata
> >> operations.
> > 
> > Well, only if that someone running GETFSMAP actually wants to watch things in
> > real-time (it's not necessary for all scenarios that have been mentioned), *and*
> > there is monitoring in place which actually detects it and can do something
> > about it.
> > 
> > Yes, PIDs have traditionally been global, but today we have PID namespaces, and
> > many other isolation features such as mount namespaces.  Nothing is perfect, of
> > course, and containers are a lot worse than VMs, but it seems weird to use that
> > as an excuse to knowingly make things worse...
> > 
> >> 
> >>>> Fortunately, the days of timesharing seem to well behind us.  For
> >>>> those people who think that containers are as secure as VM's (hah,
> >>>> hah, hah), it might be that best way to handle this is to have a mount
> >>>> option that requires root access to this functionality.  For those
> >>>> people who really care about this, they can disable access.
> >> 
> >> Or use separate filesystems for each container so that exploitable bugs
> >> that shut down the filesystem can't be used to kill the other
> >> containers.  You could use a torrent of metadata-heavy operations
> >> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
> >> the other containers.
> >> 
> >>> What would be the reason for not putting this behind
> >>> capable(CAP_SYS_ADMIN)?
> >>> 
> >>> What possible legitimate function could this functionality serve to
> >>> users who don't own your filesystem?
> >> 
> >> As I've said before, it's to enable dedupe tools to decide, given a set
> >> of files with shareable blocks, roughly how many other times each of
> >> those shareable blocks are shared so that they can make better decisions
> >> about which file keeps its shareable blocks, and which file gets
> >> remapped.  Dedupe is not a privileged operation, nor are any of the
> >> tools.
> >> 
> > 
> > So why does the ioctl need to return all extent mappings for the entire
> > filesystem, instead of just the share count of each block in the file that the
> > ioctl is called on?
> 
> One possibility is that the ioctl() can return the mapping for all inodes
> owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
> or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
> than one if there is a reason to do so) with all the other allocated blocks
> for inodes the user doesn't have permission to access?

Hmm, CAP_DAC_OVERRIDE/CAP_FOWNER?  That might be a reasonable set of
capabilities to grant access...

> IMHO, this would allow a non-root user the main benefit of GETFSMAP,  which
> is trying to determine how fragmented their files are and/or how fragmented
> the free space is, without leaking any information about file sizes, location,
> or other information the user can't already get today in a less efficient
> manner.
> 
> I don't know how hard this is to implement, but seems not impossible.

It's already implemented in both XFS and ext4. <cough>

File extents are marked as "owned" by "unknown".

Now, I suppose one could devise a scheme such that files that the caller
can open actually do get inode numbers returned, but ... that's more
engineering work, let's see if anyone asks for that (vs. asks for any of
the magic capability bits).

--D

> 
> Cheers, Andreas
> 
> 
> 
> 
>
Andy Lutomirski May 14, 2017, 1:56 p.m. UTC | #14
On Sat, May 13, 2017 at 6:41 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
>>
>> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
>>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]

>> Yes, PIDs have traditionally been global, but today we have PID namespaces, and
>> many other isolation features such as mount namespaces.  Nothing is perfect, of
>> course, and containers are a lot worse than VMs, but it seems weird to use that
>> as an excuse to knowingly make things worse...
>>

Indeed.  Not only PID namespaces -- we have hidepid and we can simply
unmount /proc.  "There are other info leaks" is a poor excuse.

>>>
>>>>> Fortunately, the days of timesharing seem to well behind us.  For
>>>>> those people who think that containers are as secure as VM's (hah,
>>>>> hah, hah), it might be that best way to handle this is to have a mount
>>>>> option that requires root access to this functionality.  For those
>>>>> people who really care about this, they can disable access.
>>>
>>> Or use separate filesystems for each container so that exploitable bugs
>>> that shut down the filesystem can't be used to kill the other
>>> containers.  You could use a torrent of metadata-heavy operations
>>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
>>> the other containers.
>>>
>>>> What would be the reason for not putting this behind
>>>> capable(CAP_SYS_ADMIN)?
>>>>
>>>> What possible legitimate function could this functionality serve to
>>>> users who don't own your filesystem?
>>>
>>> As I've said before, it's to enable dedupe tools to decide, given a set
>>> of files with shareable blocks, roughly how many other times each of
>>> those shareable blocks are shared so that they can make better decisions
>>> about which file keeps its shareable blocks, and which file gets
>>> remapped.  Dedupe is not a privileged operation, nor are any of the
>>> tools.
>>>
>>
>> So why does the ioctl need to return all extent mappings for the entire
>> filesystem, instead of just the share count of each block in the file that the
>> ioctl is called on?
>
> One possibility is that the ioctl() can return the mapping for all inodes
> owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
> or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
> than one if there is a reason to do so) with all the other allocated blocks
> for inodes the user doesn't have permission to access?

Sounds like it could be reasonable.  But you don't want "owned by the
calling PID" precisely -- you also need to check
kgid_has_mapping(current_user_ns(), inode->i_gid), I think.
Darrick Wong May 18, 2017, 2:04 a.m. UTC | #15
On Sun, May 14, 2017 at 06:56:10AM -0700, Andy Lutomirski wrote:
> On Sat, May 13, 2017 at 6:41 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> >>
> >> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
> >>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]
> 
> >> Yes, PIDs have traditionally been global, but today we have PID namespaces, and
> >> many other isolation features such as mount namespaces.  Nothing is perfect, of
> >> course, and containers are a lot worse than VMs, but it seems weird to use that
> >> as an excuse to knowingly make things worse...
> >>
> 
> Indeed.  Not only PID namespaces -- we have hidepid and we can simply
> unmount /proc.  "There are other info leaks" is a poor excuse.

Eh.  From the sounds of it I'm not all that impressed at the isolation
and leakproofness of any of these schemes.  Regardless, I will rephrase
the manpage to emphasize more strongly that filesystems are under no
obligation to share inode numbers, privileged callers or otherwise.

> >>>
> >>>>> Fortunately, the days of timesharing seem to well behind us.  For
> >>>>> those people who think that containers are as secure as VM's (hah,
> >>>>> hah, hah), it might be that best way to handle this is to have a mount
> >>>>> option that requires root access to this functionality.  For those
> >>>>> people who really care about this, they can disable access.
> >>>
> >>> Or use separate filesystems for each container so that exploitable bugs
> >>> that shut down the filesystem can't be used to kill the other
> >>> containers.  You could use a torrent of metadata-heavy operations
> >>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
> >>> the other containers.
> >>>
> >>>> What would be the reason for not putting this behind
> >>>> capable(CAP_SYS_ADMIN)?
> >>>>
> >>>> What possible legitimate function could this functionality serve to
> >>>> users who don't own your filesystem?
> >>>
> >>> As I've said before, it's to enable dedupe tools to decide, given a set
> >>> of files with shareable blocks, roughly how many other times each of
> >>> those shareable blocks are shared so that they can make better decisions
> >>> about which file keeps its shareable blocks, and which file gets
> >>> remapped.  Dedupe is not a privileged operation, nor are any of the
> >>> tools.
> >>>
> >>
> >> So why does the ioctl need to return all extent mappings for the entire
> >> filesystem, instead of just the share count of each block in the file that the
> >> ioctl is called on?
> >
> > One possibility is that the ioctl() can return the mapping for all inodes
> > owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
> > or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
> > than one if there is a reason to do so) with all the other allocated blocks
> > for inodes the user doesn't have permission to access?
> 
> Sounds like it could be reasonable.  But you don't want "owned by the
> calling PID" precisely -- you also need to check
> kgid_has_mapping(current_user_ns(), inode->i_gid), I think.

Not to mention that I don't want to go xfs_igetting every inode across
the entire filesystem... :)

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/man2/ioctl_getfsmap.2 b/man2/ioctl_getfsmap.2
new file mode 100644
index 0000000..ef9daef
--- /dev/null
+++ b/man2/ioctl_getfsmap.2
@@ -0,0 +1,362 @@ 
+.\" Copyright (c) 2017, Oracle.  All rights reserved.
+.\"
+.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
+.\" This is free documentation; you can redistribute it and/or
+.\" modify it under the terms of the GNU General Public License as
+.\" published by the Free Software Foundation; either version 2 of
+.\" the License, or (at your option) any later version.
+.\"
+.\" The GNU General Public License's references to "object code"
+.\" and "executables" are to be interpreted as the output of any
+.\" document formatting or typesetting system, including
+.\" intermediate and printed output.
+.\"
+.\" This manual is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.TH IOCTL-GETFSMAP 2 2017-02-10 "Linux" "Linux Programmer's Manual"
+.SH NAME
+ioctl_getfsmap \- retrieve the physical layout of the filesystem
+.SH SYNOPSIS
+.br
+.B #include <sys/ioctl.h>
+.br
+.B #include <linux/fs.h>
+.br
+.B #include <linux/fsmap.h>
+.sp
+.BI "int ioctl(int " fd ", FS_IOC_GETFSMAP, struct fsmap_head * " arg );
+.SH DESCRIPTION
+This
+.BR ioctl (2)
+retrieves physical extent mappings for a filesystem.
+This information can be used to discover which files are mapped to a physical
+block, examine free space, or find known bad blocks, among other things.
+
+The sole argument to this ioctl should be a pointer to a single
+.BR "struct fsmap_head" ":"
+.in +4n
+.nf
+
+struct fsmap {
+	__u32		fmr_device;	/* device id */
+	__u32		fmr_flags;	/* mapping flags */
+	__u64		fmr_physical;	/* device offset of segment */
+	__u64		fmr_owner;	/* owner id */
+	__u64		fmr_offset;	/* file offset of segment */
+	__u64		fmr_length;	/* length of segment */
+	__u64		fmr_reserved[3];	/* must be zero */
+};
+
+struct fsmap_head {
+	__u32		fmh_iflags;	/* control flags */
+	__u32		fmh_oflags;	/* output flags */
+	__u32		fmh_count;	/* # of entries in array incl. input */
+	__u32		fmh_entries;	/* # of entries filled in (output). */
+	__u64		fmh_reserved[6];	/* must be zero */
+
+	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
+	struct fsmap	fmh_recs[];	/* returned records */
+};
+
+.fi
+.in
+The two
+.I fmh_keys
+array elements specify the lowest and highest reverse-mapping
+keys, respectively, for which userspace would like physical mapping
+information.
+A reverse mapping key consists of the tuple (device, block, owner, offset).
+The owner and offset fields are part of the key because some filesystems
+support sharing physical blocks between multiple files and
+therefore may return multiple mappings for a given physical block.
+.PP
+Filesystem mappings are copied into the
+.I fmh_recs
+array, which immediately follows the header data.
+.SS Fields of struct fsmap_head
+.PP
+The
+.I fmh_iflags
+field is a bitmask passed to the kernel to alter the output.
+There are no flags defined, so this value must be zero.
+
+.PP
+The
+.I fmh_oflags
+field is a bitmask of flags that concern all output mappings.
+If
+.B FMH_OF_DEV_T
+is set, then the
+.I fmr_device
+field represents a
+.B dev_t
+structure containing the major and minor numbers of the block device.
+
+.PP
+The
+.I fmh_count
+field contains the number of elements in the array being passed to the
+kernel.
+If this value is 0,
+.I fmh_entries
+will be set to the number of records that would have been returned had
+the array been large enough;
+no mapping information will be returned.
+
+.PP
+The
+.I fmh_entries
+field contains the number of elements in the
+.I fmh_recs
+array that contain useful information.
+
+.PP
+The
+.I fmh_reserved
+fields must be set to zero.
+
+.SS Keys
+.PP
+The two key records in
+.B fsmap_head.fmh_keys
+specify the lowest and highest extent records in the keyspace that the caller
+wants returned.
+A filesystem that can share blocks between files likely requires the tuple
+.RI "(" "device" ", " "physical" ", " "owner" ", " "offset" ", " "flags" ")"
+to uniquely index any filesystem mapping record.
+Classic non-sharing filesystems might be able to identify any record with only
+.RI "(" "device" ", " "physical" ", " "flags" ")."
+For example, if the low key is set to (8:0, 36864, 0, 0, 0), the filesystem will
+only return records for extents starting at or above 36KiB on disk.
+If the high key is set to (8:0, 1048576, 0, 0, 0), only records below 1MiB will
+be returned.
+The format of
+.B fmr_device
+in the keys must match the format of the same field in the output records,
+as defined below.
+By convention, the field
+.B fsmap_head.fmh_keys[0]
+must contain the low key and
+.B fsmap_head.fmh_keys[1]
+must contain the high key for the request.
+.PP
+For convenience, if
+.B fmr_length
+is set in the low key, it will be added to
+.IR fmr_block " or " fmr_offset
+as appropriate.
+The caller can take advantage of this subtlety to set up subsequent calls
+by copying
+.B fsmap_head.fmh_recs[fsmap_head.fmh_entries - 1]
+into the low key.
+The function
+.B fsmap_advance
+provides this functionality.
+
+.SS Fields of struct fsmap
+.PP
+The
+.I fmr_device
+field uniquely identifies the underlying storage device.
+If the
+.B FMH_OF_DEV_T
+flag is set in the header's
+.I fmh_oflags
+field, this field contains a
+.B dev_t
+from which major and minor numbers can be extracted.
+If the flag is not set, this field contains a value that must be unique
+for each unique storage device.
+
+.PP
+The
+.I fmr_physical
+field contains the disk address of the extent in bytes.
+
+.PP
+The
+.I fmr_owner
+field contains the owner of the extent.
+This is an inode number unless
+.B FMR_OF_SPECIAL_OWNER
+is set in the
+.I fmr_flags
+field, in which case the value is determined by the filesystem.
+See the section below about special owner values for more details.
+
+.PP
+The
+.I fmr_offset
+field contains the logical address in the mapping record in bytes.
+This field has no meaning if the
+.BR FMR_OF_SPECIAL_OWNER " or " FMR_OF_EXTENT_MAP
+flags are set in
+.IR fmr_flags "."
+
+.PP
+The
+.I fmr_length
+field contains the length of the extent in bytes.
+
+.PP
+The
+.I fmr_flags
+field is a bitmask of extent state flags.
+The bits are:
+.RS 0.4i
+.TP
+.B FMR_OF_PREALLOC
+The extent is allocated but not yet written.
+.TP
+.B FMR_OF_ATTR_FORK
+This extent contains extended attribute data.
+.TP
+.B FMR_OF_EXTENT_MAP
+This extent contains extent map information for the owner.
+.TP
+.B FMR_OF_SHARED
+Parts of this extent may be shared.
+.TP
+.B FMR_OF_SPECIAL_OWNER
+The
+.I fmr_owner
+field contains a special value instead of an inode number.
+.TP
+.B FMR_OF_LAST
+This is the last record in the filesystem.
+.RE
+
+.PP
+The
+.I fmr_reserved
+field will be set to zero.
+
+.SS Special Owner Values
+The following special owner values are generic to all filesystems:
+.RS 0.4i
+.TP
+.B FMR_OWN_FREE
+Free space.
+.TP
+.B FMR_OWN_UNKNOWN
+This extent is in use but its owner is not known.
+.TP
+.B FMR_OWN_METADATA
+This extent is filesystem metadata.
+.RE
+
+XFS can return the following special owner values:
+.RS 0.4i
+.TP
+.B XFS_FMR_OWN_FREE
+Free space.
+.TP
+.B XFS_FMR_OWN_UNKNOWN
+This extent is in use but its owner is not known.
+.TP
+.B XFS_FMR_OWN_FS
+Static filesystem metadata which exists at a fixed address.
+These are the AG superblock, the AGF, the AGFL, and the AGI headers.
+.TP
+.B XFS_FMR_OWN_LOG
+The filesystem journal.
+.TP
+.B XFS_FMR_OWN_AG
+Allocation group metadata, such as the free space btrees and the
+reverse mapping btrees.
+.TP
+.B XFS_FMR_OWN_INOBT
+The inode and free inode btrees.
+.TP
+.B XFS_FMR_OWN_INODES
+Inode records.
+.TP
+.B XFS_FMR_OWN_REFC
+Reference count information.
+.TP
+.B XFS_FMR_OWN_COW
+This extent is being used to stage a copy-on-write.
+.TP
+.B XFS_FMR_OWN_DEFECTIVE:
+This extent has been marked defective either by the filesystem or the
+underlying device.
+.RE
+
+ext4 can return the following special owner values:
+.RS 0.4i
+.TP
+.B EXT4_FMR_OWN_FREE
+Free space.
+.TP
+.B EXT4_FMR_OWN_UNKNOWN
+This extent is in use but its owner is not known.
+.TP
+.B EXT4_FMR_OWN_FS
+Static filesystem metadata which exists at a fixed address.
+This is the superblock and the group descriptors.
+.TP
+.B EXT4_FMR_OWN_LOG
+The filesystem journal.
+.TP
+.B EXT4_FMR_OWN_INODES
+Inode records.
+.TP
+.B EXT4_FMR_OWN_BLKBM
+Block bitmap.
+.TP
+.B EXT4_FMR_OWN_INOBM
+Inode bitmap.
+.RE
+
+.SH RETURN VALUE
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.PP
+.SH ERRORS
+Error codes can be one of, but are not limited to, the following:
+.TP
+.B EINVAL
+The array is not long enough, or a non-zero value was passed in one of the
+fields that must be zero.
+.TP
+.B EFAULT
+The pointer passed in was not mapped to a valid memory address.
+.TP
+.B EBADF
+.IR fd
+is not open for reading.
+.TP
+.B EPERM
+This query is not allowed.
+.TP
+.B EOPNOTSUPP
+The filesystem does not support this command.
+.TP
+.B EUCLEAN
+The filesystem metadata is corrupt and needs repair.
+.TP
+.B EBADMSG
+The filesystem has detected a checksum error in the metadata.
+.TP
+.B ENOMEM
+Insufficient memory to process the request.
+
+.SH EXAMPLE
+.TP
+Please see io/fsmap.c in the xfsprogs distribution for a sample program.
+
+.SH CONFORMING TO
+This API is Linux-specific.
+Not all filesystems support it.
+.fi
+.in
+.SH SEE ALSO
+.BR ioctl (2)