diff mbox

[v2,2/4] ext4: Add XIP functionality

Message ID 1386273769-12828-3-git-send-email-ross.zwisler@linux.intel.com
State Superseded, archived
Headers show

Commit Message

Ross Zwisler Dec. 5, 2013, 8:02 p.m. UTC
This is a port of the XIP functionality found in the current version of
ext2.  This patch set is intended to achieve feature parity with XIP in
ext2 rather than non-XIP in ext4.  In particular, it lacks support for
splice and AIO.  We'll be submitting patches in the future to add that
functionality, but we think this is a good start.

The motivation behind this work is that we believe that the XIP feature
will begin to find new uses as various persistent memory devices and
technologies come on to the market.  Having direct, byte-addressable
access to persistent memory without having an additional copy in the
page cache can be a win in terms of I/O latency and overall memory
usage.

This patch applies cleanly to v3.13-rc2, and was tested using brd as our
block driver.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
---
 Documentation/filesystems/ext4.txt |    2 +
 Documentation/filesystems/xip.txt  |    3 ++
 fs/Kconfig                         |    2 +-
 fs/ext4/Kconfig                    |   11 +++++
 fs/ext4/Makefile                   |    1 +
 fs/ext4/ext4.h                     |    2 +
 fs/ext4/file.c                     |   17 ++++++++
 fs/ext4/inode.c                    |   42 +++++++++++++++++--
 fs/ext4/namei.c                    |   11 ++++-
 fs/ext4/super.c                    |   36 ++++++++++++++++-
 fs/ext4/xip.c                      |   78 ++++++++++++++++++++++++++++++++++++
 fs/ext4/xip.h                      |   24 +++++++++++
 12 files changed, 221 insertions(+), 8 deletions(-)
 create mode 100644 fs/ext4/xip.c
 create mode 100644 fs/ext4/xip.h

Comments

Dave Chinner Dec. 6, 2013, 3:13 a.m. UTC | #1
On Thu, Dec 05, 2013 at 01:02:46PM -0700, Ross Zwisler wrote:
> This is a port of the XIP functionality found in the current version of
> ext2.  This patch set is intended to achieve feature parity with XIP in
> ext2 rather than non-XIP in ext4.  In particular, it lacks support for
> splice and AIO.  We'll be submitting patches in the future to add that
> functionality, but we think this is a good start.
> 
> The motivation behind this work is that we believe that the XIP feature
> will begin to find new uses as various persistent memory devices and
> technologies come on to the market.  Having direct, byte-addressable
> access to persistent memory without having an additional copy in the
> page cache can be a win in terms of I/O latency and overall memory
> usage.
> 
> This patch applies cleanly to v3.13-rc2, and was tested using brd as our
> block driver.

I think I see a significant problem here with XIP write support:
unwritten extents.

xip_file_write() has no concept of post IO completion processing -
it assumes that all that is necessary is to memcpy() the data into
the backing memory obtained by ->get_xip_mem(), and that's all it
needs to do.

For ext4 (and other filesystems that use unwritten extents) they
need a callback - normally done from bio completion - to run
transactions to convert extent status from unwritten to written, or
run other post-IO completion operations.

I don't see any hooks into ext4 to turn off preallocation (e.g.
fallocate is explicitly hooked up for XIP) when XIP is in use, so I
can't see how XIP can work with such filesystem requirements without
further infrastructure being added. i.e. bypassing the need for the
page cache does not remove the need to post-IO completion
notification to the filesystem....

Indeed, for making filesystems like XFS be able to use XIP, we're
going to need such facilities to be provided by the XIP
infrastructure....

Cheers,

Dave.
Matthew Wilcox Dec. 6, 2013, 4:07 a.m. UTC | #2
On Fri, Dec 06, 2013 at 02:13:54PM +1100, Dave Chinner wrote:
> I think I see a significant problem here with XIP write support:
> unwritten extents.
> 
> xip_file_write() has no concept of post IO completion processing -
> it assumes that all that is necessary is to memcpy() the data into
> the backing memory obtained by ->get_xip_mem(), and that's all it
> needs to do.
> 
> For ext4 (and other filesystems that use unwritten extents) they
> need a callback - normally done from bio completion - to run
> transactions to convert extent status from unwritten to written, or
> run other post-IO completion operations.
> 
> I don't see any hooks into ext4 to turn off preallocation (e.g.
> fallocate is explicitly hooked up for XIP) when XIP is in use, so I
> can't see how XIP can work with such filesystem requirements without
> further infrastructure being added. i.e. bypassing the need for the
> page cache does not remove the need to post-IO completion
> notification to the filesystem....

The two are mutually exclusive:

        if (ext4_use_xip(inode->i_sb))
                inode->i_mapping->a_ops = &ext4_xip_aops;
        else if (test_opt(inode->i_sb, DELALLOC))
                inode->i_mapping->a_ops = &ext4_da_aops;
        else
                inode->i_mapping->a_ops = &ext4_aops;

Is it worth implementing delayed allocation support on top of XIP?  Indeed,
what would that *mean*?  Assuming that the backing store is close to DRAM
speeds, we don't want to cache in DRAM first, then copy to the backing
store, we just want to write to the backing store.

> Indeed, for making filesystems like XFS be able to use XIP, we're
> going to need such facilities to be provided by the XIP
> infrastructure....

I have a patch in my development tree right now which changes the
create argument to get_xip_mem into a flags argument, with 'GXM_CREATE'
and 'GXM_HINT' as the first two flags.  Adding a GXM_ALLOC flag would
presumably be enough of a hint to the filesystem that it's time to commit
this range to disk.  Admitedly, it's pre-write and not post-write,
but does that matter when the write is a memcpy?  I must admit to not
quite understanding all 100k+ lines of XFS, so maybe you really do need
to know when the memcpy has finished.

I also don't see a problem with the filesystem either having a wrapper
around xip_file_write or providing its own entire implementation of
->write.  Equally, I'm sure we could add some other callback in, say,
address_space_operations that the XIP code could call after the memcpy
if that's what XFS needs.
Dave Chinner Dec. 6, 2013, 5:28 a.m. UTC | #3
On Thu, Dec 05, 2013 at 09:07:22PM -0700, Matthew Wilcox wrote:
> On Fri, Dec 06, 2013 at 02:13:54PM +1100, Dave Chinner wrote:
> > I think I see a significant problem here with XIP write support:
> > unwritten extents.
> > 
> > xip_file_write() has no concept of post IO completion processing -
> > it assumes that all that is necessary is to memcpy() the data into
> > the backing memory obtained by ->get_xip_mem(), and that's all it
> > needs to do.
> > 
> > For ext4 (and other filesystems that use unwritten extents) they
> > need a callback - normally done from bio completion - to run
> > transactions to convert extent status from unwritten to written, or
> > run other post-IO completion operations.
> > 
> > I don't see any hooks into ext4 to turn off preallocation (e.g.
> > fallocate is explicitly hooked up for XIP) when XIP is in use, so I
> > can't see how XIP can work with such filesystem requirements without
> > further infrastructure being added. i.e. bypassing the need for the
> > page cache does not remove the need to post-IO completion
> > notification to the filesystem....
> 
> The two are mutually exclusive:
> 
>         if (ext4_use_xip(inode->i_sb))
>                 inode->i_mapping->a_ops = &ext4_xip_aops;
>         else if (test_opt(inode->i_sb, DELALLOC))
>                 inode->i_mapping->a_ops = &ext4_da_aops;
>         else
>                 inode->i_mapping->a_ops = &ext4_aops;
> 
> Is it worth implementing delayed allocation support on top of XIP?

That's delayed allocation, not preallocation and unwritten extents.

> Indeed,
> what would that *mean*?  Assuming that the backing store is close to DRAM
> speeds, we don't want to cache in DRAM first, then copy to the backing
> store, we just want to write to the backing store.

Just because retreiving data is fast, it doesn't mean we can just
fragment the shit out of the block mapping. A GB file made up of 4k
chunks is going to be much, much slower to work with than a GB file
that can be mapped into a single TLB entry....

> > Indeed, for making filesystems like XFS be able to use XIP, we're
> > going to need such facilities to be provided by the XIP
> > infrastructure....
> 
> I have a patch in my development tree right now which changes the
> create argument to get_xip_mem into a flags argument, with 'GXM_CREATE'
> and 'GXM_HINT' as the first two flags.  Adding a GXM_ALLOC flag would
> presumably be enough of a hint to the filesystem that it's time to commit
> this range to disk.  Admitedly, it's pre-write and not post-write,
> but does that matter when the write is a memcpy?  I must admit to not
> quite understanding all 100k+ lines of XFS, so maybe you really do need
> to know when the memcpy has finished.

If you want an idea of how to do generic allocation, go back and
look at the discussion that Nick Piggin and I had years ago about
generic multi-page writes, and what a filesystem requires in terms
of transactional and write failure guarantees. It isn't simple - it
involves a reserve/commit/undo style of interface.

In fact, I think it would probably map to XIP usage just as well as
for multi-page writes through the page cache....

> I also don't see a problem with the filesystem either having a wrapper
> around xip_file_write or providing its own entire implementation of
> ->write.  Equally, I'm sure we could add some other callback in, say,
> address_space_operations that the XIP code could call after the memcpy
> if that's what XFS needs.

I suspect that we shouldn't even attempt to use a generic
implementation at first - do what is necessary for the different
filesystems, then try to work out common infrastructure....

Cheers,

Dave.
Andreas Dilger Dec. 6, 2013, 8:58 p.m. UTC | #4
On 2013/12/05 8:13 PM, "Dave Chinner" <david@fromorbit.com> wrote:

>On Thu, Dec 05, 2013 at 01:02:46PM -0700, Ross Zwisler wrote:
>> This is a port of the XIP functionality found in the current version of
>> ext2.  This patch set is intended to achieve feature parity with XIP in
>> ext2 rather than non-XIP in ext4.  In particular, it lacks support for
>> splice and AIO.  We'll be submitting patches in the future to add that
>> functionality, but we think this is a good start.
>> 
>> The motivation behind this work is that we believe that the XIP feature
>> will begin to find new uses as various persistent memory devices and
>> technologies come on to the market.  Having direct, byte-addressable
>> access to persistent memory without having an additional copy in the
>> page cache can be a win in terms of I/O latency and overall memory
>> usage.
>> 
>> This patch applies cleanly to v3.13-rc2, and was tested using brd as our
>> block driver.
>
>I think I see a significant problem here with XIP write support:
>unwritten extents.
>
>xip_file_write() has no concept of post IO completion processing -
>it assumes that all that is necessary is to memcpy() the data into
>the backing memory obtained by ->get_xip_mem(), and that's all it
>needs to do.
>
>For ext4 (and other filesystems that use unwritten extents) they
>need a callback - normally done from bio completion - to run
>transactions to convert extent status from unwritten to written, or
>run other post-IO completion operations.
>
>I don't see any hooks into ext4 to turn off preallocation (e.g.
>fallocate is explicitly hooked up for XIP) when XIP is in use, so I
>can't see how XIP can work with such filesystem requirements without
>further infrastructure being added. i.e. bypassing the need for the
>page cache does not remove the need to post-IO completion
>notification to the filesystem...

In the short term (at least until it is possible to convert the extent
after it is modified) should it be an error to try and map an unwritten
extent?  That would still allow the ext4 XIP patch to land and be safely
used for regular files while the mechanism for doing the conversion is
worked out.

>Indeed, for making filesystems like XFS be able to use XIP, we're
>going to need such facilities to be provided by the XIP
>infrastructure....
>
>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>david@fromorbit.com
>


Cheers, Andreas
Ross Zwisler Dec. 9, 2013, 3:16 a.m. UTC | #5
On Fri, 2013-12-06 at 14:13 +1100, Dave Chinner wrote:
> On Thu, Dec 05, 2013 at 01:02:46PM -0700, Ross Zwisler wrote:
> > This is a port of the XIP functionality found in the current version of
> > ext2.  This patch set is intended to achieve feature parity with XIP in
> > ext2 rather than non-XIP in ext4.  In particular, it lacks support for
> > splice and AIO.  We'll be submitting patches in the future to add that
> > functionality, but we think this is a good start.
> > 
> > The motivation behind this work is that we believe that the XIP feature
> > will begin to find new uses as various persistent memory devices and
> > technologies come on to the market.  Having direct, byte-addressable
> > access to persistent memory without having an additional copy in the
> > page cache can be a win in terms of I/O latency and overall memory
> > usage.
> > 
> > This patch applies cleanly to v3.13-rc2, and was tested using brd as our
> > block driver.
> 
> I think I see a significant problem here with XIP write support:
> unwritten extents.
> 
> xip_file_write() has no concept of post IO completion processing -
> it assumes that all that is necessary is to memcpy() the data into
> the backing memory obtained by ->get_xip_mem(), and that's all it
> needs to do.
> 
> For ext4 (and other filesystems that use unwritten extents) they
> need a callback - normally done from bio completion - to run
> transactions to convert extent status from unwritten to written, or
> run other post-IO completion operations.
> 
> I don't see any hooks into ext4 to turn off preallocation (e.g.
> fallocate is explicitly hooked up for XIP) when XIP is in use, so I
> can't see how XIP can work with such filesystem requirements without
> further infrastructure being added. i.e. bypassing the need for the
> page cache does not remove the need to post-IO completion
> notification to the filesystem....
> 
> Indeed, for making filesystems like XFS be able to use XIP, we're
> going to need such facilities to be provided by the XIP
> infrastructure....
> 
> Cheers,
> 
> Dave.

Hi Dave,

You're absolutely correct, unwritten extents are an issue that was
overlooked.  Thank you very much for pointing this out!

My best guess on how to fix this (as proposed by Matthew) is to wrap the
generic code in ext4 specific code that deals with unwritten extents.

For writes, I think that we need to potentially split the unwritten
extent in to up to three extents (two unwritten, one written), in the
spirit of the ext4_split_unwritten_extents().

For reads, I think we will probably have to zero the extent, mark it as
written, and then return the data normally.

For mmap, we can probably add code to the page fault handler which will
zero the unwritten extent and mark it as written, similar to what is
done for read.

My hope is that we can do this all inline in the XIP wrappers for ext4,
and avoid having to deal with callbacks.

Does this all sound generally correct?  I'll start work on an example 
implementation.

Regarding fragmentation on XIP, yep, this is also an issue, but one I
was hoping to address in a future patch set.

Thanks,
- Ross


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Dec. 9, 2013, 8:19 a.m. UTC | #6
On Sun, Dec 08, 2013 at 08:16:04PM -0700, Ross Zwisler wrote:
> On Fri, 2013-12-06 at 14:13 +1100, Dave Chinner wrote:
> > On Thu, Dec 05, 2013 at 01:02:46PM -0700, Ross Zwisler wrote:
> > > This is a port of the XIP functionality found in the current version of
> > > ext2.  This patch set is intended to achieve feature parity with XIP in
> > > ext2 rather than non-XIP in ext4.  In particular, it lacks support for
> > > splice and AIO.  We'll be submitting patches in the future to add that
> > > functionality, but we think this is a good start.
> > > 
> > > The motivation behind this work is that we believe that the XIP feature
> > > will begin to find new uses as various persistent memory devices and
> > > technologies come on to the market.  Having direct, byte-addressable
> > > access to persistent memory without having an additional copy in the
> > > page cache can be a win in terms of I/O latency and overall memory
> > > usage.
> > > 
> > > This patch applies cleanly to v3.13-rc2, and was tested using brd as our
> > > block driver.
> > 
> > I think I see a significant problem here with XIP write support:
> > unwritten extents.
> > 
> > xip_file_write() has no concept of post IO completion processing -
> > it assumes that all that is necessary is to memcpy() the data into
> > the backing memory obtained by ->get_xip_mem(), and that's all it
> > needs to do.
> > 
> > For ext4 (and other filesystems that use unwritten extents) they
> > need a callback - normally done from bio completion - to run
> > transactions to convert extent status from unwritten to written, or
> > run other post-IO completion operations.
> > 
> > I don't see any hooks into ext4 to turn off preallocation (e.g.
> > fallocate is explicitly hooked up for XIP) when XIP is in use, so I
> > can't see how XIP can work with such filesystem requirements without
> > further infrastructure being added. i.e. bypassing the need for the
> > page cache does not remove the need to post-IO completion
> > notification to the filesystem....
> > 
> > Indeed, for making filesystems like XFS be able to use XIP, we're
> > going to need such facilities to be provided by the XIP
> > infrastructure....
> > 
> > Cheers,
> > 
> > Dave.
> 
> Hi Dave,
> 
> You're absolutely correct, unwritten extents are an issue that was
> overlooked.  Thank you very much for pointing this out!
> 
> My best guess on how to fix this (as proposed by Matthew) is to wrap the
> generic code in ext4 specific code that deals with unwritten extents.

I completely disagree.

We already have a generic method in the filesystems for handling
post-IO completion processing, and we most definitely do not want to
have to implement it again in every filesystem that wants to support
XIP.

Set up the generic XIP infrastructure in a way that allows the
filesystem to set up post-IO callbacks at submission time and call
them on IO completion.  We manage to do this for both buffered data
IO and direct IO, and I don't see how XIP IO is any different from
this perspective. XIP still costs time and latency to execute, and
if we start to think about hardware offload of large memcpy()s (say
like the SGI Altix machines could do years ago) asychronous
processing in the XIP IO path is quite likely to be used in the
near future.

So, it's pretty clear to me that XIP needs to look like a normal IO
path from a filesystem perspective - it's not necessarily
synchronous, we need concurrent read and write support (i.e. the
equivalent of current direct IO capabilities on XFS where we can
already do millions of mixed read and write IOPS to the same file
on a ram based block device), and so on. XIP doesn't fundamentally
change the way filesystems work, and so we shoul dbe treating XIP in
a similar fashion to how we treat buffered and direct IO.

Indeed, the direct IO model is probably the best one to use here -
it allows the filesystem to attach it's own private data structure
to the kiocb, and it gets an IO completion callback with the kiocb,
the offset and size of the IO, and we can pull the filesystem
private data off the iocb and then pass it into existing normal IO
completion paths.

> For writes, I think that we need to potentially split the unwritten
> extent in to up to three extents (two unwritten, one written), in the
> spirit of the ext4_split_unwritten_extents().

You don't need to touch anything that deep in ext4 to make this
work. What you need to do is make the XIP infrastructure allow ext4
to track it's own IO (as it already does for direct IO and call
ext4_put_io_end() appropriately on IO completion. XFS will use
exactly the same mechanism, so will btrfs and every other filesystem
we might want to add support for XIP to...

> For reads, I think we will probably have to zero the extent, mark it as
> written, and then return the data normally.

Right now we have a "buffer_unwritten(bh)" flag that makes all the
code treat it like a hole. You don't need to convert it to written
until someone actually writes to it - all you need to do is
guarantee reads return zero for that page. IOWs, for users of
read(2) system calls, you can just zero their pages if the
underlying region spans a hole or unwritten extent.

Again, this is infrastructure we already have in the page cache - we
should not be using a different mechanism for XIP.

> For mmap, we can probably add code to the page fault handler which will
> zero the unwritten extent and mark it as written, similar to what is
> done for read.

Have you looked at how ->page_mkwrite handles the first page
fault into an unwritten region? Both XFS and ext4 end up in
__block_write_begin() with a map that says buffer_unwritten(), so it
zeros the page and marks it dirty.

So, at the completion of page_mkwrite, the page is zeroed but still
marked unwritten, so what XIP needs to do is then run an IO
completion....

> My hope is that we can do this all inline in the XIP wrappers for ext4,
> and avoid having to deal with callbacks.

We need to solve these problems by providing generic infrastructure
that executes existing code that handles these problems, not layer
on hacks to make a single filesystem work.

> Does this all sound generally correct?  I'll start work on an example 
> implementation.

IMO, no.

> Regarding fragmentation on XIP, yep, this is also an issue, but one I
> was hoping to address in a future patch set.

XFS has already solved that problem - it has the ability
to set a file's allocation granuarity (so you can match it to the
page sizes supported by the machine) and all allocations get aligned
and sized to that hint. It even turns off delayed allocation, which
makes it perfect for XIP, and it is inheritable from the parent
directory so it can be a "set at mkfs time and forget" configuration
item. But, it requires unwritten extent support and IO completions
to work, so we need XIP to support this infrastructure.

If you make XIP work just like filesystems expect, then you don't
have to reinvent the wheel.  We know how to build generic filesystem
infrastructure and it's not that hard to do. So let's do it the
right way the first time andnot force everyone to reinvent the wheel
repeatedly...

Cheers,

Dave.
Matthew Wilcox Dec. 10, 2013, 4:22 p.m. UTC | #7
On Mon, Dec 09, 2013 at 07:19:40PM +1100, Dave Chinner wrote:
> Set up the generic XIP infrastructure in a way that allows the
> filesystem to set up post-IO callbacks at submission time and call
> them on IO completion.  We manage to do this for both buffered data
> IO and direct IO, and I don't see how XIP IO is any different from
> this perspective. XIP still costs time and latency to execute, and
> if we start to think about hardware offload of large memcpy()s (say
> like the SGI Altix machines could do years ago) asychronous
> processing in the XIP IO path is quite likely to be used in the
> near future.

While I agree there's nothing inherently synchronous about the XIP
path, I don't know that there's a real advantage to a hardware offload.
These days, memory controllers are in the CPUs, so the putative hardware
is also going to have to be in the CPU and it's going to have to bring
cachelines in from oe memory location and write them out to another
location.  Add in setup costs and it's going to have to be a pretty
damn large write() / read() to get any kind of advantage out of it.
I might try to con somebody into estimating where the break-even point
would be on a current CPU.  I bet it's large ... and if it's past 2GB,
we run into Linus' rule about not permitting I/Os larger than that.

I would bet our hardware people would just say something like "would
you like this hardware or two more completely generic cores?"  And I
know what the answer to that is.

> So, it's pretty clear to me that XIP needs to look like a normal IO
> path from a filesystem perspective - it's not necessarily
> synchronous, we need concurrent read and write support (i.e. the
> equivalent of current direct IO capabilities on XFS where we can
> already do millions of mixed read and write IOPS to the same file
> on a ram based block device), and so on. XIP doesn't fundamentally
> change the way filesystems work, and so we shoul dbe treating XIP in
> a similar fashion to how we treat buffered and direct IO.

I don't disagree with any of that.

> Indeed, the direct IO model is probably the best one to use here -
> it allows the filesystem to attach it's own private data structure
> to the kiocb, and it gets an IO completion callback with the kiocb,
> the offset and size of the IO, and we can pull the filesystem
> private data off the iocb and then pass it into existing normal IO
> completion paths.

Um, you're joking, right?  The direct IO model is pretty universally
hated.  It's ridiculously complex.  Maybe you meant "this aspect" of
direct IO, but I would never point anybody at the direct IO path as an
example of good programming practice.

> > For writes, I think that we need to potentially split the unwritten
> > extent in to up to three extents (two unwritten, one written), in the
> > spirit of the ext4_split_unwritten_extents().
> 
> You don't need to touch anything that deep in ext4 to make this
> work. What you need to do is make the XIP infrastructure allow ext4
> to track it's own IO (as it already does for direct IO and call
> ext4_put_io_end() appropriately on IO completion. XFS will use
> exactly the same mechanism, so will btrfs and every other filesystem
> we might want to add support for XIP to...
> 
> > For reads, I think we will probably have to zero the extent, mark it as
> > written, and then return the data normally.
> 
> Right now we have a "buffer_unwritten(bh)" flag that makes all the
> code treat it like a hole. You don't need to convert it to written
> until someone actually writes to it - all you need to do is
> guarantee reads return zero for that page. IOWs, for users of
> read(2) system calls, you can just zero their pages if the
> underlying region spans a hole or unwritten extent.
> 
> Again, this is infrastructure we already have in the page cache - we
> should not be using a different mechanism for XIP.

The XIP code already handles holes just fine.  Reads call __clear_user()
if it finds a hole.  Mmap load faults do some bizarre stuff to map in
a zero page that I think needs fixing, but that'll be the subject of a
future fight.

I don't actually understand what the problem is here.  ext4_get_xip_mem()
calls ext4_get_block() with the 'create' flag set or clear, depending
if it needs the page to be instantiated, or it can live with the hole.
It seems that ext4_get_xip_mem() needs to check BH_Unwritten, but other
than that things should be working the way you seem to want them to.
Dave Chinner Dec. 10, 2013, 11:09 p.m. UTC | #8
On Tue, Dec 10, 2013 at 09:22:31AM -0700, Matthew Wilcox wrote:
> On Mon, Dec 09, 2013 at 07:19:40PM +1100, Dave Chinner wrote:
> > Set up the generic XIP infrastructure in a way that allows the
> > filesystem to set up post-IO callbacks at submission time and call
> > them on IO completion.  We manage to do this for both buffered data
> > IO and direct IO, and I don't see how XIP IO is any different from
> > this perspective. XIP still costs time and latency to execute, and
> > if we start to think about hardware offload of large memcpy()s (say
> > like the SGI Altix machines could do years ago) asychronous
> > processing in the XIP IO path is quite likely to be used in the
> > near future.
> 
> While I agree there's nothing inherently synchronous about the XIP
> path, I don't know that there's a real advantage to a hardware offload.
> These days, memory controllers are in the CPUs, so the putative hardware
> is also going to have to be in the CPU and it's going to have to bring
> cachelines in from oe memory location and write them out to another
> location.  Add in setup costs and it's going to have to be a pretty
> damn large write() / read() to get any kind of advantage out of it.
> I might try to con somebody into estimating where the break-even point
> would be on a current CPU.  I bet it's large ... and if it's past 2GB,
> we run into Linus' rule about not permitting I/Os larger than that.
> I would bet our hardware people would just say something like "would
> you like this hardware or two more completely generic cores?"  And I
> know what the answer to that is.

You're not thinking about what I'm saying - you're just taking the
literal interpretation of the example I gave and arguing about why
it's not a relevant example. You have not
considered the wider implications of what it means.

For example, replace memcpy() with a crypto offload so that when your
laptop gets stolen nobody can read your important data in persistent
memory without the decryption key...

i.e. if you think the sorts of things like encryption, snapshots,
compressions, thin provisioning, etc are not going to be part of a
persistent memory *IO path*, then I think you are being very naive.
Persistent memory may be fast and directly accessed, but it doesn't
change the fact that it is *storage* and so needs to be support all
those neat things people like to do with their persistent data....

Step outside you little Intel coloured box full of blue men,
Willy...

> > So, it's pretty clear to me that XIP needs to look like a normal IO
> > path from a filesystem perspective - it's not necessarily
> > synchronous, we need concurrent read and write support (i.e. the
> > equivalent of current direct IO capabilities on XFS where we can
> > already do millions of mixed read and write IOPS to the same file
> > on a ram based block device), and so on. XIP doesn't fundamentally
> > change the way filesystems work, and so we shoul dbe treating XIP in
> > a similar fashion to how we treat buffered and direct IO.
> 
> I don't disagree with any of that.
> 
> > Indeed, the direct IO model is probably the best one to use here -
> > it allows the filesystem to attach it's own private data structure
> > to the kiocb, and it gets an IO completion callback with the kiocb,
> > the offset and size of the IO, and we can pull the filesystem
> > private data off the iocb and then pass it into existing normal IO
> > completion paths.
> 
> Um, you're joking, right?  The direct IO model is pretty universally
> hated.  It's ridiculously complex.  Maybe you meant "this aspect" of
> direct IO, but I would never point anybody at the direct IO path as an
> example of good programming practice.

Again, you're not thinking about what I'm saying - you stopped
reading at "direct IO" and started ranting instead.  I'm talking
about the design pattern (i.e. the model) used to abstract the
direct IO code from the filesystems to provide generic
infrastructure.  i.e:

aio_write
  xfs_file_aio_write
    generic_file_direct_write
      xfs_vm_direct_IO
        attach xfs_ioend to kiocb
	__blockdev_direct_IO
	  <generic direct IO code>
.....
	<generic direct io completes>
	xfs_end_io_direct_write
	  pulls xfs_ioend off kiocb
	  xfs_finish_ioend_sync()
	    does file size updates, unwritten extent conversion.

At every layer, filesystems that support direct IO can set up
infrastructure to behave like they need it to.  Filesystem specific
locking and sub-block IO synchronisation  is handled at .aio_write,
filesystem IO completions are set up in .direct_IO, etc.

IOWs, XIP should look something like this:

aio_write
  xfs_file_aio_write
    generic_file_xip_write
      xfs_vm_xip_IO
        attach xfs_ioend to kiocb
	xip_file_write
	  <generic XIP write code>

.....
	<generic XIP io completes>
	xfs_end_io_xip_write
	  pulls xfs_ioend off kiocb
	  xfs_finish_ioend_sync()
	    does file size updates, unwritten extent conversion.

See what I mean? We already have a model for handling a special,
non-page cache IO path, and XIP fits it exactly with very little
extra support needed in the filesystems for it. We do not need to
reinvent a new IO model and infrastructure for XIP.

> > > For writes, I think that we need to potentially split the unwritten
> > > extent in to up to three extents (two unwritten, one written), in the
> > > spirit of the ext4_split_unwritten_extents().
> > 
> > You don't need to touch anything that deep in ext4 to make this
> > work. What you need to do is make the XIP infrastructure allow ext4
> > to track it's own IO (as it already does for direct IO and call
> > ext4_put_io_end() appropriately on IO completion. XFS will use
> > exactly the same mechanism, so will btrfs and every other filesystem
> > we might want to add support for XIP to...
> > 
> > > For reads, I think we will probably have to zero the extent, mark it as
> > > written, and then return the data normally.
> > 
> > Right now we have a "buffer_unwritten(bh)" flag that makes all the
> > code treat it like a hole. You don't need to convert it to written
> > until someone actually writes to it - all you need to do is
> > guarantee reads return zero for that page. IOWs, for users of
> > read(2) system calls, you can just zero their pages if the
> > underlying region spans a hole or unwritten extent.
> > 
> > Again, this is infrastructure we already have in the page cache - we
> > should not be using a different mechanism for XIP.
> 
> The XIP code already handles holes just fine.  Reads call __clear_user()
> if it finds a hole.  Mmap load faults do some bizarre stuff to map in
> a zero page that I think needs fixing, but that'll be the subject of a
> future fight.

Yes, I know that XIP has code to do this. Read what I said:

> > Again, this is infrastructure we already have in the page cache - we
> > should not be using a different mechanism for XIP.

Willy, I'm coming from the position of having taken a look at the
XIP code with an eye to adding support to XFS. What I've found is a
*toy* that people played with years ago that relied on a specific
block device implementation.  It simply wasn't architected for ext4
or XFS or btrfs to be implemented on top of it.

It wasn't designed with the consideration that we might need
buffering for mmap writes because we use data transformations in the
IO path (the encryption example, again).

It wasn't design to allow filesystems that require locking other than
i_mutex in the io path to work or do other operations prior to 
the write IO that might be needed to avoid stale data exposure on
extending writes.

It doesn't even serialise xip_file_read() against any other IO
operation at all, and so can race with writes, truncates, hole
punches or any other operations that might modify the underlying
file or block map. That's not just downright nasty, that's a major
security issue. The direct IO design pattern allows filesystems to
put their own locking in place to prevent these sorts of problems.
e.g. XFS holds the IOLOCK in shared mode across reads, so does not
need to rely on page locks to avoid read racing with truncate, etc.

Quite frankly the XIP infrastructure as it stands is simply
inadequate for modern filesystems like ext4 or XFS - it is so full
of holes it makes swiss cheese look positively solid.  If we are
going to make XIP a first class citizen - and we need to for
persistent memory support - then we need to architect a proper
solution for it.  When faced with the choice of "reimplement every
filesystem with custom solutions" or "add generic infrastructure for
the existing filesystem hooks", the answer is a no-brainer: generic
infrastructure improvements win every time. 

Willy, the "XIP as an IO path" infrastructure change is the critical
one that needs to be made. It's not a huge amount of work; it'd take
me a week to do it and to port XFS to support XIP, but I don't have
a week I can spare right now. Intel clearly have resources to throw
at this problem, so I'd be really happy to only have to worry about
the day it would take to do the "port XFS" part of the work.

Cheers,

Dave.
diff mbox

Patch

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..c32c398 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@  max_dir_size_kb=n	This limits the size of directories so that any
 i_version		Enable 64-bit inode version support. This option is
 			off by default.
 
+xip			Use execute in place (no caching) if possible.
+
 Data Mode
 =========
 There are 3 different data modes:
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..54baa05 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -38,6 +38,8 @@  alternative, memory technology devices can be used for this.
 The block device operation is optional, these block devices support it as of
 today:
 - dcssblk: s390 dcss block device driver
+- brd: Ram backed block device driver
+- axonram: Axon DDR2 device driver
 
 An address space operation named get_xip_mem is used to retrieve references
 to a page frame number and a kernel address. To obtain these values a reference
@@ -49,6 +51,7 @@  This address space operation is mutually exclusive with readpage&writepage that
 do page cache read/write operations.
 The following filesystems support it as of today:
 - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
 
 A set of file operations that do utilize get_xip_page can be found in
 mm/filemap_xip.c . The following file operation implementations are provided:
diff --git a/fs/Kconfig b/fs/Kconfig
index c229f82..595cc00 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -17,7 +17,7 @@  source "fs/ext4/Kconfig"
 config FS_XIP
 # execute in place
 	bool
-	depends on EXT2_FS_XIP
+	depends on EXT2_FS_XIP || EXT4_FS_XIP
 	default y
 
 source "fs/jbd/Kconfig"
diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
index efea5d5..62952cb 100644
--- a/fs/ext4/Kconfig
+++ b/fs/ext4/Kconfig
@@ -73,3 +73,14 @@  config EXT4_DEBUG
 	  If you select Y here, then you will be able to turn on debugging
 	  with a command such as:
 		echo 1 > /sys/module/ext4/parameters/mballoc_debug
+
+config EXT4_FS_XIP
+	bool "Ext4 execute in place support"
+	depends on EXT4_FS && MMU
+	help
+	  Execute in place can be used on memory-backed block devices. If you
+	  enable this option, you can select to mount block devices which are
+	  capable of this feature without using the page cache.
+
+	  If you do not use a block device that is capable of using this,
+	  or if unsure, say N.
diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 0310fec..3f1ec56 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -12,3 +12,4 @@  ext4-y	:= balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o page-io.o \
 
 ext4-$(CONFIG_EXT4_FS_POSIX_ACL)	+= acl.o
 ext4-$(CONFIG_EXT4_FS_SECURITY)		+= xattr_security.o
+ext4-$(CONFIG_EXT4_FS_XIP)	 	+= xip.o
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e618503..9b509a0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -954,6 +954,7 @@  struct ext4_inode_info {
 #define EXT4_MOUNT_ERRORS_MASK		0x00070
 #define EXT4_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT4_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
+#define EXT4_MOUNT_XIP			0x00200 /* Execute in place */
 #define EXT4_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT4_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT4_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
@@ -2571,6 +2572,7 @@  extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_xip_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 extern void ext4_unwritten_wait(struct inode *inode);
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3da2194..b9499b2 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -609,6 +609,23 @@  const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
+#ifdef CONFIG_EXT4_FS_XIP
+const struct file_operations ext4_xip_file_operations = {
+	.llseek		= ext4_llseek,
+	.read		= xip_file_read,
+	.write		= xip_file_write,
+	.unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ext4_compat_ioctl,
+#endif
+	.mmap		= xip_file_mmap,
+	.open		= ext4_file_open,
+	.release	= ext4_release_file,
+	.fsync		= ext4_sync_file,
+	.fallocate	= ext4_fallocate,
+};
+#endif
+
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0757634..18d027f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -43,6 +43,7 @@ 
 #include "xattr.h"
 #include "acl.h"
 #include "truncate.h"
+#include "xip.h"
 
 #include <trace/events/ext4.h>
 
@@ -663,6 +664,23 @@  found:
 			WARN_ON(1);
 		}
 
+		if (ext4_use_xip(inode->i_sb)) {
+			ext4_fsblk_t fs_blk;
+
+			for (fs_blk = map->m_pblk;
+			     fs_blk < map->m_pblk + map->m_len; fs_blk++) {
+				/*
+				 * we need to clear the block
+				 */
+				ret = ext4_clear_xip_target(inode, fs_blk);
+
+				if (ret) {
+					retval = ret;
+					goto has_zeroout;
+				}
+			}
+		}
+
 		/*
 		 * If the extent has been zeroed out, we don't need to update
 		 * extent status tree.
@@ -3270,6 +3288,11 @@  static const struct address_space_operations ext4_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
+const struct address_space_operations ext4_xip_aops = {
+	.bmap			= ext4_bmap,
+	.get_xip_mem		= ext4_get_xip_mem,
+};
+
 static const struct address_space_operations ext4_journalled_aops = {
 	.readpage		= ext4_readpage,
 	.readpages		= ext4_readpages,
@@ -3317,7 +3340,9 @@  void ext4_set_aops(struct inode *inode)
 	default:
 		BUG();
 	}
-	if (test_opt(inode->i_sb, DELALLOC))
+	if (ext4_use_xip(inode->i_sb))
+		inode->i_mapping->a_ops = &ext4_xip_aops;
+	else if (test_opt(inode->i_sb, DELALLOC))
 		inode->i_mapping->a_ops = &ext4_da_aops;
 	else
 		inode->i_mapping->a_ops = &ext4_aops;
@@ -3738,8 +3763,14 @@  void ext4_truncate(struct inode *inode)
 		return;
 	}
 
-	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
-		ext4_block_truncate_page(handle, mapping, inode->i_size);
+	if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
+		if (mapping_is_xip(inode->i_mapping)) {
+			if (xip_truncate_page(inode->i_mapping, inode->i_size))
+				goto out_stop;
+		} else
+			ext4_block_truncate_page(handle, mapping,
+						 inode->i_size);
+	}
 
 	/*
 	 * We add the inode to the orphan list, so that if this
@@ -4201,7 +4232,10 @@  struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (ext4_use_xip(inode->i_sb))
+			inode->i_fop = &ext4_xip_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 5a0408d..20a9cf8 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -39,6 +39,7 @@ 
 
 #include "xattr.h"
 #include "acl.h"
+#include "xip.h"
 
 #include <trace/events/ext4.h>
 /*
@@ -2250,7 +2251,10 @@  retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (ext4_use_xip(inode->i_sb))
+			inode->i_fop = &ext4_xip_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2314,7 +2318,10 @@  retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (ext4_use_xip(inode->i_sb))
+			inode->i_fop = &ext4_xip_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c977f4e..144dfd5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -50,6 +50,7 @@ 
 #include "xattr.h"
 #include "acl.h"
 #include "mballoc.h"
+#include "xip.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/ext4.h>
@@ -1162,7 +1163,7 @@  enum {
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
-	Opt_max_dir_size_kb,
+	Opt_max_dir_size_kb, Opt_xip,
 };
 
 static const match_table_t tokens = {
@@ -1243,6 +1244,7 @@  static const match_table_t tokens = {
 	{Opt_removed, "reservation"},	/* mount option from ext2/3 */
 	{Opt_removed, "noreservation"}, /* mount option from ext2/3 */
 	{Opt_removed, "journal=%u"},	/* mount option from ext2/3 */
+	{Opt_xip, "xip"},
 	{Opt_err, NULL},
 };
 
@@ -1436,6 +1438,7 @@  static const struct mount_opts {
 	{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
 	{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
 	{Opt_max_dir_size_kb, 0, MOPT_GTE0},
+	{Opt_xip, EXT4_MOUNT_XIP, MOPT_SET},
 	{Opt_err, 0, 0}
 };
 
@@ -1638,6 +1641,11 @@  static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		}
 		sbi->s_jquota_fmt = m->mount_opt;
 #endif
+#ifndef CONFIG_EXT4_FS_XIP
+	} else if (token == Opt_xip) {
+		ext4_msg(sb, KERN_INFO, "xip option not supported");
+		return -1;
+#endif
 	} else {
 		if (!args->from)
 			arg = 1;
@@ -3553,11 +3561,23 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		}
 		if (test_opt(sb, DELALLOC))
 			clear_opt(sb, DELALLOC);
+		if (test_opt(sb, XIP)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and xip");
+			goto failed_mount;
+		}
 	}
 
 	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
 		(test_opt(sb, POSIX_ACL) ? MS_POSIXACL : 0);
 
+	if ((sbi->s_mount_opt & EXT4_MOUNT_XIP) &&
+	    !sb->s_bdev->bd_disk->fops->direct_access) {
+		ext4_msg(sb, KERN_ERR, "can't mount with xip - "
+				       "not supported by bdev");
+		goto failed_mount;
+	}
+
 	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
 	    (EXT4_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT4_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -3604,6 +3624,12 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount;
 	}
 
+	if (ext4_use_xip(sb) && blocksize != PAGE_SIZE) {
+		ext4_msg(sb, KERN_ERR, "Unsupported blocksize %d for xip",
+				blocksize);
+		goto failed_mount;
+	}
+
 	if (sb->s_blocksize != blocksize) {
 		/* Validate the filesystem blocksize */
 		if (!sb_set_blocksize(sb, blocksize)) {
@@ -4740,6 +4766,7 @@  static int ext4_remount(struct super_block *sb, int *flags, char *data)
 	struct ext4_super_block *es;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	unsigned long old_sb_flags;
+	unsigned long old_mount_opt = sbi->s_mount_opt;
 	struct ext4_mount_options old_opts;
 	int enable_quota = 0;
 	ext4_group_t g;
@@ -4808,6 +4835,13 @@  static int ext4_remount(struct super_block *sb, int *flags, char *data)
 
 	es = sbi->s_es;
 
+	if ((sbi->s_mount_opt ^ old_mount_opt) & EXT4_MOUNT_XIP) {
+		ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+			 "xip flag while remounting");
+		sbi->s_mount_opt &= ~EXT4_MOUNT_XIP;
+		sbi->s_mount_opt |= old_mount_opt & EXT4_MOUNT_XIP;
+	}
+
 	if (sbi->s_journal) {
 		ext4_init_journal_params(sb, sbi->s_journal);
 		set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
diff --git a/fs/ext4/xip.c b/fs/ext4/xip.c
new file mode 100644
index 0000000..21dd166
--- /dev/null
+++ b/fs/ext4/xip.c
@@ -0,0 +1,78 @@ 
+/*
+ *  linux/fs/ext4/xip.c
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte (cotte@de.ibm.com)
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/buffer_head.h>
+#include <linux/blkdev.h>
+#include "ext4.h"
+#include "xip.h"
+
+static inline int
+__inode_direct_access(struct inode *inode, sector_t block,
+		      void **kaddr, unsigned long *pfn)
+{
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	sector_t sector;
+
+	sector = block * (PAGE_SIZE / 512); /* ext4 block to bdev sector */
+
+	BUG_ON(!ops->direct_access);
+	return ops->direct_access(bdev, sector, kaddr, pfn);
+}
+
+static inline int
+__ext4_get_block(struct inode *inode, pgoff_t pgoff, int create,
+		   sector_t *result)
+{
+	struct buffer_head tmp;
+	int rc;
+
+	memset(&tmp, 0, sizeof(struct buffer_head));
+	tmp.b_size = inode->i_sb->s_blocksize;
+	rc = ext4_get_block(inode, pgoff, &tmp, create);
+	*result = tmp.b_blocknr;
+
+	/* did we get a sparse block (hole in the file)? */
+	if (!tmp.b_blocknr && !rc) {
+		BUG_ON(create);
+		rc = -ENODATA;
+	}
+
+	return rc;
+}
+
+int
+ext4_clear_xip_target(struct inode *inode, sector_t block)
+{
+	void *kaddr;
+	unsigned long pfn;
+	int rc;
+
+	rc = __inode_direct_access(inode, block, &kaddr, &pfn);
+	if (!rc)
+		clear_page(kaddr);
+	return rc;
+}
+
+int ext4_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
+				void **kmem, unsigned long *pfn)
+{
+	int rc;
+	sector_t block;
+
+	/* first, retrieve the sector number */
+	rc = __ext4_get_block(mapping->host, pgoff, create, &block);
+	if (rc)
+		return rc;
+
+	/* retrieve address of the target data */
+	rc = __inode_direct_access(mapping->host, block, kmem, pfn);
+	return rc;
+}
diff --git a/fs/ext4/xip.h b/fs/ext4/xip.h
new file mode 100644
index 0000000..af0d553
--- /dev/null
+++ b/fs/ext4/xip.h
@@ -0,0 +1,24 @@ 
+/*
+ *  linux/fs/ext4/xip.h
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte (cotte@de.ibm.com)
+ */
+
+#ifdef CONFIG_EXT4_FS_XIP
+extern int ext4_clear_xip_target(struct inode *, sector_t);
+
+static inline int ext4_use_xip(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	return sbi->s_mount_opt & EXT4_MOUNT_XIP;
+}
+int ext4_get_xip_mem(struct address_space *, pgoff_t, int,
+				void **, unsigned long *);
+#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
+#else
+#define mapping_is_xip(map)			0
+#define ext4_use_xip(sb)			0
+#define ext4_clear_xip_target(inode, chain)	0
+#define ext4_get_xip_mem			NULL
+#endif