Patchwork [RFC,1/1] ext4: Try to better reuse recently freed space

login
register
mail settings
Submitter Lukas Czerner
Date Dec. 2, 2013, 4:32 p.m.
Message ID <alpine.LFD.2.00.1312021727150.2284@localhost.localdomain>
Download mbox | patch
Permalink /patch/295956/
State New
Headers show

Comments

Lukas Czerner - Dec. 2, 2013, 4:32 p.m.
Hi all,

this is the patch I send a while ago to fix the issue I've seen with
a global allocation goal. This might no longer apply to the current
kernel and it might not be the best approach, but I use this example
just to start a discussion about those allocation goals and how to
use, or change them.

I think that we agree that the long term fix would be to have free
extent map. But we might be able to do something quickly, unless
someone commits to make the free extent map reality :)

Thanks!
-Lukas


Currently if the block allocator can not find the goal to allocate we
would use global goal for stream allocation. However the global goal
(s_mb_last_group and s_mb_last_start) will move further every time such
allocation appears and never move backwards.

This causes several problems in certain scenarios:

- the goal will move further and further preventing us from reusing
  space which might have been freed since then. This is ok from the file
  system point of view because we will reuse that space eventually,
  however we're allocating block from slower parts of the spinning disk
  even though it might not be necessary.
- The above also causes more serious problem for example for thinly
  provisioned storage (sparse images backed storage as well), because
  instead of reusing blocks which are already provisioned we would try
  to use new blocks. This would unnecessarily drain storage free blocks
  pool.
- This will also cause blocks to be allocated further from the given
  goal than it's necessary. Consider for example truncating, or removing
  and rewriting the file in the loop. This workload will never reuse
  freed blocks until we continually claim and free all the block in the
  file system.

Note that file systems like xfs, ext3, or btrfs does not have this
problem. This is simply caused by the notion of global pool.

Fix this by changing the global goal to be goal per inode. This will
allow us to invalidate the goal every time the inode has been truncated,
or newly created, so in those cases we would try to use the proper more
specific goal which is based on inode position.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h    |  7 ++++---
 fs/ext4/inode.c   |  8 ++++++++
 fs/ext4/mballoc.c | 20 ++++++++------------
 3 files changed, 20 insertions(+), 15 deletions(-)
Theodore Ts'o - Dec. 4, 2013, 5:21 a.m.
On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote:
> Hi all,
> 
> this is the patch I send a while ago to fix the issue I've seen with
> a global allocation goal. This might no longer apply to the current
> kernel and it might not be the best approach, but I use this example
> just to start a discussion about those allocation goals and how to
> use, or change them.
> 
> I think that we agree that the long term fix would be to have free
> extent map. But we might be able to do something quickly, unless
> someone commits to make the free extent map reality :)

There was discussion from the previous thread here:

      http://patchwork.ozlabs.org/patch/257476/


Looking at this, I think the patch makes sense, but I think it would
be good if we started thinking about what would be some good
benchmarks so we can measure and compare various different allocation
strategies.

Part of it would certainly be how long it takes to write or fallocate
the files, but also how how fragmented the files are.  Some indication
about how friendly the file system is to flash and thin-provisioned
systems.  (Where re-using blocks sooner rather than later might be
helpful if the user is only running fstrim once a week or some such.)

	       	       	    	    	   	- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - April 1, 2014, 5:30 a.m.
On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote:
> Hi all,
> 
> this is the patch I send a while ago to fix the issue I've seen with
> a global allocation goal. This might no longer apply to the current
> kernel and it might not be the best approach, but I use this example
> just to start a discussion about those allocation goals and how to
> use, or change them.
> 
> I think that we agree that the long term fix would be to have free
> extent map. But we might be able to do something quickly, unless
> someone commits to make the free extent map reality :)

Hi Andreas,

We discussed possibly applying this patch last week at the ext4
workshop, possibly as early as for the 3.15 merge window:

	http://patchwork.ozlabs.org/patch/295956/

However, I'm guessing that Lustre has the workload which is most
likely to regress if we were to simply apply this patch.  But, it's
likely it will improve things for many/most other ext4 workloads.

We did talk about trying to assemble some block allocation performance
tests so we can better measure proposed changes to the block
allocator, but that's not something we have yet.  However, this global
goal is definitely causing problems for a number of use cases,
including thinp and being flash friendly.

Would you be willing to apply this patch and then run some benchmarks
to see if Lustre would be impacted negatively if we were to apply this
patch for the next development cycle (i.e., not for 3.15, but for the
next merge window)?

Thanks,

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - April 1, 2014, 11:44 a.m.
On Tue, 1 Apr 2014 01:30:18 -0400, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote:
> > Hi all,
> > 
> > this is the patch I send a while ago to fix the issue I've seen with
> > a global allocation goal. This might no longer apply to the current
> > kernel and it might not be the best approach, but I use this example
> > just to start a discussion about those allocation goals and how to
> > use, or change them.
> > 
> > I think that we agree that the long term fix would be to have free
> > extent map. But we might be able to do something quickly, unless
> > someone commits to make the free extent map reality :)
> 
> Hi Andreas,
> 
> We discussed possibly applying this patch last week at the ext4
> workshop, possibly as early as for the 3.15 merge window:
> 
> 	http://patchwork.ozlabs.org/patch/295956/
> 
> However, I'm guessing that Lustre has the workload which is most
> likely to regress if we were to simply apply this patch.  But, it's
> likely it will improve things for many/most other ext4 workloads.
> 
> We did talk about trying to assemble some block allocation performance
> tests so we can better measure proposed changes to the block
> allocator, but that's not something we have yet.  However, this global
BTW where this can I find this discussion? I would like to cooperate
this that activity. Please CC me next time you will disscuss allocation
performance mesurments. At Parallels we run https://oss.oracle.com/~mason/compilebench/
as load simulator.
> goal is definitely causing problems for a number of use cases,
> including thinp and being flash friendly.
> 
> Would you be willing to apply this patch and then run some benchmarks
> to see if Lustre would be impacted negatively if we were to apply this
> patch for the next development cycle (i.e., not for 3.15, but for the
> next merge window)?
> 
> Thanks,
> 
> 							- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - April 1, 2014, 2:47 p.m.
On Tue, Apr 01, 2014 at 03:44:53PM +0400, Dmitry Monakhov wrote:
> BTW where this can I find this discussion? I would like to cooperate
> this that activity. Please CC me next time you will disscuss allocation
> performance mesurments. At Parallels we run https://oss.oracle.com/~mason/compilebench/
> as load simulator.

The discussion happened at the Ext4 developer's get together in Napa,
California, colocated with the LSF/MM and the Collaboration Summit.
You should go next year; it was a huge amount of fun, and there were a
bunch of other Parallels people there who can tell you about the
reception at the Jacuzzi Family Winery, etc.  :-)

I suspect there will be some future conversations at our weekly
conference calls.  Typically design stuff will happen there, but
technical low-level details about things like patches will happen on
the mailing list, so you'll be alerted when we start having specific
patches to evaluate and as we start putting together a set of
allocation benchmarks.

If you are interested in participating on the conference calls,
contact me off-line.  If the current time (8AM US/Pacific ; 11 AM
US/Eastern) isn't good for you, we can try to see if another time
works for everyone.

One of the discussion points that came up last week is that it would
be good if we can come up with allocation tests that are fast to run.
That might mean (for example) taking a workload such as compilebench,
and changing it to use fallocate() or having a mount option which
causes the actual data path writes to be skipped for files.  We would
then need to have some kind of metric to evaluate how "good" a
particular file system layout ends up being at the end of the
workload.  Not just for a specific file, but for all of the files in
some kind of holistic measurement of "goodness", as well as looking at
how fragmented the free space ended up being.  Exactly how we do this
is still something that we need to figure out; if you have any
suggestions, they would be most welcome!

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - April 1, 2014, 4:35 p.m.
On Tue, 1 Apr 2014 10:47:08 -0400, "Theodore Ts'o" <tytso@mit.edu> wrote:
> On Tue, Apr 01, 2014 at 03:44:53PM +0400, Dmitry Monakhov wrote:
> > BTW where this can I find this discussion? I would like to cooperate
> > this that activity. Please CC me next time you will disscuss allocation
> > performance mesurments. At Parallels we run https://oss.oracle.com/~mason/compilebench/
> > as load simulator.
> 
> The discussion happened at the Ext4 developer's get together in Napa,
> California, colocated with the LSF/MM and the Collaboration Summit.
> You should go next year; it was a huge amount of fun, and there were a
> bunch of other Parallels people there who can tell you about the
> reception at the Jacuzzi Family Winery, etc.  :-)
Hm... the truth is that I was there. I am the man which asked your
opinion about mfsync(multy-file-fsync) remember :)
But probably I've simply missed an allocation topic.
> 
> I suspect there will be some future conversations at our weekly
> conference calls.  Typically design stuff will happen there, but
> technical low-level details about things like patches will happen on
> the mailing list, so you'll be alerted when we start having specific
> patches to evaluate and as we start putting together a set of
> allocation benchmarks.
> 
> If you are interested in participating on the conference calls,
> contact me off-line.  If the current time (8AM US/Pacific ; 11 AM
> US/Eastern) isn't good for you, we can try to see if another time
> works for everyone.
Yes. it would be nice. Please invite be to the next call.
> 
> One of the discussion points that came up last week is that it would
> be good if we can come up with allocation tests that are fast to run.
> That might mean (for example) taking a workload such as compilebench,
> and changing it to use fallocate() or having a mount option which
> causes the actual data path writes to be skipped for files.  We would
> then need to have some kind of metric to evaluate how "good" a
> particular file system layout ends up being at the end of the
> workload.  Not just for a specific file, but for all of the files in
> some kind of holistic measurement of "goodness", as well as looking at
> how fragmented the free space ended up being.  Exactly how we do this
> is still something that we need to figure out; if you have any
> suggestions, they would be most welcome!
> 
> Cheers,
> 
> 					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - April 7, 2014, 6:22 p.m.
On Mar 31, 2014, at 11:30 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote:
>> Hi all,
>> 
>> this is the patch I send a while ago to fix the issue I've seen with
>> a global allocation goal. This might no longer apply to the current
>> kernel and it might not be the best approach, but I use this example
>> just to start a discussion about those allocation goals and how to
>> use, or change them.
>> 
>> I think that we agree that the long term fix would be to have free
>> extent map. But we might be able to do something quickly, unless
>> someone commits to make the free extent map reality :)
> 
> Hi Andreas,
> 
> We discussed possibly applying this patch last week at the ext4
> workshop, possibly as early as for the 3.15 merge window:
> 
> 	http://patchwork.ozlabs.org/patch/295956/

Sorry for the delay in replying, I was away the past couple of weeks
on vacation, and am heading this week to the Lustre conference.

> However, I'm guessing that Lustre has the workload which is most
> likely to regress if we were to simply apply this patch.  But, it's
> likely it will improve things for many/most other ext4 workloads.

I definitely agree that this will help with SSD storage, and thinp
workloads that end up doing random IO to the underlying storage.
The main reason for using the global goal is to maximize the chance
of getting large contiguous allocations, since it does a round-robin
traversal of the groups and maximizes the time that other blocks can
be freed in each group.  It also minimizes the churn of allocating
large files in groups that may not have large contiguous extents (e.g.
if small files/extents have just been freed in a group, but not all
of the blocks in that group are freed).

The global goal avoids not only the higher chance of finding free small
chunks for the file, but also the CPU overhead of re-scanning groups
that are partially full to find large free extents.

> We did talk about trying to assemble some block allocation performance
> tests so we can better measure proposed changes to the block
> allocator, but that's not something we have yet.  However, this global
> goal is definitely causing problems for a number of use cases,
> including thinp and being flash friendly.

An important question is whether thinp and flash are the main uses for ext4?
I could imagine ignoring the global goal if rotational == 0, or if some
large fraction of the filesystem was just deleted, but this change means
that each inode will have to re-scan the free groups to find large extents
and I think this will have a long-term negative impact on file extent size.

Note that the global goal is only used if the per-inode goal is already
full (i.e. ext4_mb_regular_allocator->ext4_mb_find_by_goal() fails), so
it will need to find a new allocation every time.

I think typical benchmarks that allocate and then free all space in a new
filesystem will not exercise this code properly.  Something that allocates
space but then only frees some of it, preferably multi-threaded.

> Would you be willing to apply this patch and then run some benchmarks
> to see if Lustre would be impacted negatively if we were to apply this
> patch for the next development cycle (i.e., not for 3.15, but for the
> next merge window)?

Since most of the Lustre developers are travelling this week, it will be
hard to get this patch suitably tested quickly.  Also, Lustre servers are
not running on 3.14 kernels yet, so the patch would need to be backported
to a kernel that the Lustre server is currently running on.

Some minor notes about the patch itself:
- if there is a desire to land this patch quickly, it would be great to
  have this behaviour at least selectable via mount option and/or /sys/fs
  tuneable that turns this off and on.  That would allow the patch to land
  and simplify testing, and disable it if it later shows regressions.
- the i_last_group and i_last_start are unsigned long, but only need to be
  unsigned int like fe_group (ext4_group_t) and fe_start (ext4_grpblk_t).
  That saves 8 bytes per inode that didn't matter in the superblock.
- there is no locking for the update of i_last_{group,start} if there are
  parallel writers on the same inode
- rather than dropping the global goal entirely, it would be better to have
  a list of groups that have relatively free space.  I know we also discussed
  using an rbtree to locate free extents instead of per-group buddy bitmaps.
  That would probably avoid the need for the global goal.

Cheers, Andreas
Andreas Dilger - April 7, 2014, 8:01 p.m.
On Mar 31, 2014, at 11:30 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote:
>> Hi all,
>> 
>> this is the patch I send a while ago to fix the issue I've seen with
>> a global allocation goal. This might no longer apply to the current
>> kernel and it might not be the best approach, but I use this example
>> just to start a discussion about those allocation goals and how to
>> use, or change them.
>> 
>> I think that we agree that the long term fix would be to have free
>> extent map. But we might be able to do something quickly, unless
>> someone commits to make the free extent map reality :)
> 
> Hi Andreas,
> 
> We discussed possibly applying this patch last week at the ext4
> workshop, possibly as early as for the 3.15 merge window:
> 
> 	http://patchwork.ozlabs.org/patch/295956/

Sorry for the delay in replying, I was away the past couple of weeks
on vacation, and am heading this week to the Lustre conference.

> However, I'm guessing that Lustre has the workload which is most
> likely to regress if we were to simply apply this patch.  But, it's
> likely it will improve things for many/most other ext4 workloads.

I definitely agree that this will help with SSD storage, and thinp
workloads that end up doing random IO to the underlying storage.
The main reason for using the global goal is to maximize the chance
of getting large contiguous allocations, since it does a round-robin
traversal of the groups and maximizes the time that other blocks can
be freed in each group.  It also minimizes the churn of allocating
large files in groups that may not have large contiguous extents (e.g.
if small files/extents have just been freed in a group, but not all
of the blocks in that group are freed).

The global goal avoids not only the higher chance of finding free small
chunks for the file, but also the CPU overhead of re-scanning groups
that are partially full to find large free extents.

> We did talk about trying to assemble some block allocation performance
> tests so we can better measure proposed changes to the block
> allocator, but that's not something we have yet.  However, this global
> goal is definitely causing problems for a number of use cases,
> including thinp and being flash friendly.

An important question is whether thinp and flash are the main uses for ext4?
I could imagine ignoring the global goal if rotational == 0, or if some
large fraction of the filesystem was just deleted, but this change means
that each inode will have to re-scan the free groups to find large extents
and I think this will have a long-term negative impact on file extent size.

Note that the global goal is only used if the per-inode goal is already
full (i.e. ext4_mb_regular_allocator->ext4_mb_find_by_goal() fails), so
it will need to find a new allocation every time.

I think typical benchmarks that allocate and then free all space in a new
filesystem will not exercise this code properly.  Something that allocates
space but then only frees some of it, preferably multi-threaded.

> Would you be willing to apply this patch and then run some benchmarks
> to see if Lustre would be impacted negatively if we were to apply this
> patch for the next development cycle (i.e., not for 3.15, but for the
> next merge window)?

Since most of the Lustre developers are travelling this week, it will be
hard to get this patch suitably tested quickly.  Also, Lustre servers are
not running on 3.14 kernels yet, so the patch would need to be backported
to a kernel that the Lustre server is currently running on.

Some minor notes about the patch itself:
- if there is a desire to land this patch quickly, it would be great to
have this behaviour at least selectable via mount option and/or /sys/fs
tuneable that turns this off and on.  That would allow the patch to land
and simplify testing, and disable it if it later shows regressions.
- the i_last_group and i_last_start are unsigned long, but only need to be
unsigned int like fe_group (ext4_group_t) and fe_start (ext4_grpblk_t).
That saves 8 bytes per inode that didn't matter in the superblock.
- there is no locking for the update of i_last_{group,start} if there are
parallel writers on the same inode
- rather than dropping the global goal entirely, it would be better to have
a list of groups that have relatively free space.  I know we also discussed
using an rbtree to locate free extents instead of per-group buddy bitmaps.
That would probably avoid the need for the global goal.

Cheers, Andreas
Andreas Dilger - April 8, 2014, 1:14 a.m.
On Mar 31, 2014, at 11:30 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote:
>> Hi all,
>> 
>> this is the patch I send a while ago to fix the issue I've seen with
>> a global allocation goal. This might no longer apply to the current
>> kernel and it might not be the best approach, but I use this example
>> just to start a discussion about those allocation goals and how to
>> use, or change them.
>> 
>> I think that we agree that the long term fix would be to have free
>> extent map. But we might be able to do something quickly, unless
>> someone commits to make the free extent map reality :)
> 
> Hi Andreas,
> 
> We discussed possibly applying this patch last week at the ext4
> workshop, possibly as early as for the 3.15 merge window:
> 
> 	http://patchwork.ozlabs.org/patch/295956/

Sorry for the delay in replying, I was away the past couple of weeks
on vacation, and am heading this week to the Lustre conference.

> However, I'm guessing that Lustre has the workload which is most
> likely to regress if we were to simply apply this patch.  But, it's
> likely it will improve things for many/most other ext4 workloads.

I definitely agree that this will help with SSD storage, and thinp
workloads that end up doing random IO to the underlying storage.
The main reason for using the global goal is to maximize the chance
of getting large contiguous allocations, since it does a round-robin
traversal of the groups and maximizes the time that other blocks can
be freed in each group.  It also minimizes the churn of allocating
large files in groups that may not have large contiguous extents (e.g.
if small files/extents have just been freed in a group, but not all
of the blocks in that group are freed).

The global goal avoids not only the higher chance of finding free small
chunks for the file, but also the CPU overhead of re-scanning groups
that are partially full to find large free extents.

> We did talk about trying to assemble some block allocation performance
> tests so we can better measure proposed changes to the block
> allocator, but that's not something we have yet.  However, this global
> goal is definitely causing problems for a number of use cases,
> including thinp and being flash friendly.

An important question is whether thinp and flash are the main uses for ext4?
I could imagine ignoring the global goal if rotational == 0, or if some
large fraction of the filesystem was just deleted, but this change means
that each inode will have to re-scan the free groups to find large extents
and I think this will have a long-term negative impact on file extent size.

Note that the global goal is only used if the per-inode goal is already
full (i.e. ext4_mb_regular_allocator->ext4_mb_find_by_goal() fails), so
it will need to find a new allocation every time.

I think typical benchmarks that allocate and then free all space in a new
filesystem will not exercise this code properly.  Something that allocates
space but then only frees some of it, preferably multi-threaded.

> Would you be willing to apply this patch and then run some benchmarks
> to see if Lustre would be impacted negatively if we were to apply this
> patch for the next development cycle (i.e., not for 3.15, but for the
> next merge window)?

Since most of the Lustre developers are travelling this week, it will be
hard to get this patch suitably tested quickly.  Also, Lustre servers are
not running on 3.14 kernels yet, so the patch would need to be backported
to a kernel that the Lustre server is currently running on.

Some minor notes about the patch itself:
- if there is a desire to land this patch quickly, it would be great to
 have this behaviour at least selectable via mount option and/or /sys/fs
 tuneable that turns this off and on.  That would allow the patch to land
 even if we haven't finished testing, or it later shows regressions.
- the i_last_group and i_last_start are unsigned long, but only need to be
 unsigned int like fe_group (ext4_group_t) and fe_start (ext4_grpblk_t).
 That saves 8 bytes per inode that didn't matter in the superblock.
- there is no locking for the update of i_last_{group,start}
- rather than dropping the global goal entirely, it would be better to have
 a list of groups that have relatively free space.  I know we also discussed
 using an rbtree to locate the free extents.


Cheers, Andreas

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6ed348d..4dffa92 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -917,6 +917,10 @@  struct ext4_inode_info {
 
 	/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
 	__u32 i_csum_seed;
+
+	/* where last allocation was done - for stream allocation */
+	unsigned long i_last_group;
+	unsigned long i_last_start;
 };
 
 /*
@@ -1242,9 +1246,6 @@  struct ext4_sb_info {
 	unsigned int s_mb_order2_reqs;
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
-	/* where last allocation was done - for stream allocation */
-	unsigned long s_mb_last_group;
-	unsigned long s_mb_last_start;
 
 	/* stats for buddy allocator */
 	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0188e65..07d0434 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3702,6 +3702,10 @@  void ext4_truncate(struct inode *inode)
 	else
 		ext4_ind_truncate(handle, inode);
 
+	/* Invalidate last allocation counters */
+	ei->i_last_group = UINT_MAX;
+	ei->i_last_start = UINT_MAX;
+
 	up_write(&ei->i_data_sem);
 
 	if (IS_SYNC(inode))
@@ -4060,6 +4064,10 @@  struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 	inode->i_generation = le32_to_cpu(raw_inode->i_generation);
 	ei->i_block_group = iloc.block_group;
 	ei->i_last_alloc_group = ~0;
+
+	/* Invalidate last allocation counters */
+	ei->i_last_group = UINT_MAX;
+	ei->i_last_start = UINT_MAX;
 	/*
 	 * NOTE! The in-memory inode i_data array is in little-endian order
 	 * even on big-endian machines: we do NOT byteswap the block numbers!
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a9ff5e5..6c23666 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1591,7 +1591,6 @@  static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 					struct ext4_buddy *e4b)
 {
-	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int ret;
 
 	BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
@@ -1622,10 +1621,8 @@  static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	get_page(ac->ac_buddy_page);
 	/* store last allocated for subsequent stream allocation */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		spin_lock(&sbi->s_md_lock);
-		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
-		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
-		spin_unlock(&sbi->s_md_lock);
+		EXT4_I(ac->ac_inode)->i_last_group = ac->ac_f_ex.fe_group;
+		EXT4_I(ac->ac_inode)->i_last_start = ac->ac_f_ex.fe_start;
 	}
 }
 
@@ -2080,13 +2077,12 @@  ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			ac->ac_2order = i - 1;
 	}
 
-	/* if stream allocation is enabled, use global goal */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		/* TBD: may be hot point */
-		spin_lock(&sbi->s_md_lock);
-		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
-		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
-		spin_unlock(&sbi->s_md_lock);
+	/* if stream allocation is enabled and per inode goal is
+	 * set, use it */
+	if ((ac->ac_flags & EXT4_MB_STREAM_ALLOC) &&
+	   (EXT4_I(ac->ac_inode)->i_last_start != UINT_MAX)) {
+		ac->ac_g_ex.fe_group = EXT4_I(ac->ac_inode)->i_last_group;
+		ac->ac_g_ex.fe_start = EXT4_I(ac->ac_inode)->i_last_start;
 	}
 
 	/* Let's just scan groups to find more-less suitable blocks */