Patchwork [RFC,1/7] ext4: Add EXT4_IOC_ADD_GLOBAL_ALLOC_RULE restricts block allocation

login
register
mail settings
Submitter Akira Fujita
Date June 23, 2009, 8:25 a.m.
Message ID <4A409168.3020404@rs.jp.nec.com>
Download mbox | patch
Permalink /patch/29033/
State New
Headers show

Comments

Akira Fujita - June 23, 2009, 8:25 a.m.
ext4: Add EXT4_IOC_ADD_GLOBAL_ALLOC_RULE restricts block allocation

From: Akira Fujita <a-fujita@rs.jp.nec.com>

This ioctl adds block allocation restriction to FS where fd is located in
with EXT4_IOC_ADD_GLOBAL_ALLOC_RULE.

  #define EXT4_IOC_ADD_GLOBAL_ALLOC_RULE _IOW('f', 16, struct ext4_alloc_rule);

  struct ext4_alloc_rule {
        __u64 start;            /* first physical block this rule covers */
        __u64 len;              /* number of blocks covered by this rule */
        __u32 alloc_flag;       /* 0:mandatory, 1:advisory */
  };

  Note: We can not set multiple block allocation restriction rules
        to same block.

alloc_flag of ext4_alloc_rule structure is set as "mandatory" or "advisory".
Restricted blocks with "mandatory" are never used by block allocator.
But in "advisory" case, block allocator is allowed to use restricted blocks
when there are no free blocks on FS.
 	 	
How to manage block allocation restriction:
ext4_sb_info->s_bg_arule_list has ext4_bg_alloc_rule_list structure
which managements the information of restricted blocks for each block group
as list structure.
 	
  ext4_sb_info
        |
  s_bg_arule_list ->[   bg1   ]->[   bg5   ] (struct ext4_bg_alloc_rule_list)
                        |             |
                    start:end    start:end
                    [   0:2000]  [8000:9000] (struct ext4_bg_alloc_rule)
                    [3000:3500]
                    [6000:8000]                  blocksize is 4096

Used block count with block allocation restriction:
ext4_bg_alloc_rule_list has two counters (mand_restricted_blks and
adv_restricted_blks) that hold how many blocks are restricted by mandatory
or advisory.  Only the mand_restricted_blks counter affects to
FS free blocks, since blocks with advisory will be used
when there is no other free blocks to use.

  For example, when block allocation restriction is set as follows,
  mand_restricted_blks is 3000 and adv_restricted_blks is 2000,
  therefore FS's free blocks becomes 2192 (6000-8191) in this case.


  block_per_group:                 8192
  used blocks:                      3000-5999
  free blocks:                     0-2999, 6000-8191
  allocation rule(mandatory):      0-3999
  allocation rule(advisory) :      5000-7999

  BG n |------------------------------|
       0                              8191
  used |         |-----------|        |
       0         3000     5999
  free |---------|           |--------|
       0      2999           6000     8191
  mand |**********----|
       0           3999
  adv                     |--+++++++++|
                          5000     7999

  mand_restricted_blks (*) is where "free" and "mand" overlap.
  adv_restricted_blks (+) is where "free" and "adv" overlap.


Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
---
 fs/ext4/balloc.c  |   22 ++-
 fs/ext4/ext4.h    |   37 +++++-
 fs/ext4/ioctl.c   |   16 ++
 fs/ext4/mballoc.c |  438 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/ext4/super.c   |    9 +-
 5 files changed, 512 insertions(+), 10 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - June 23, 2009, 11:19 p.m.
On Jun 23, 2009  17:25 +0900, Akira Fujita wrote:
> alloc_flag of ext4_alloc_rule structure is set as "mandatory" or "advisory".
> Restricted blocks with "mandatory" are never used by block allocator.
> But in "advisory" case, block allocator is allowed to use restricted blocks
> when there are no free blocks on FS.

Would it make more sense to implement the range protections via the
existing preallocation ranges (PA)?  An inode can have multiple
PAs attached to it to have it prefer allocations from that range.

We could also attach PAs to the superblock to prevent other files from
allocating out of those ranges.  This would work better with the existing
allocation code instead of creating a second similar mechanism.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Freemyer - June 24, 2009, 12:02 a.m.
On Tue, Jun 23, 2009 at 7:19 PM, Andreas Dilger<adilger@sun.com> wrote:
> On Jun 23, 2009  17:25 +0900, Akira Fujita wrote:
>> alloc_flag of ext4_alloc_rule structure is set as "mandatory" or "advisory".
>> Restricted blocks with "mandatory" are never used by block allocator.
>> But in "advisory" case, block allocator is allowed to use restricted blocks
>> when there are no free blocks on FS.
>
> Would it make more sense to implement the range protections via the
> existing preallocation ranges (PA)?  An inode can have multiple
> PAs attached to it to have it prefer allocations from that range.
>
> We could also attach PAs to the superblock to prevent other files from
> allocating out of those ranges.  This would work better with the existing
> allocation code instead of creating a second similar mechanism.
>
> Cheers, Andreas

Andreas,

Where can I find documentation about how PA works?  Or is it just in
the source?  If so, what are one or two calls that cause the PA ranges
to be set, etc.

Thanks
Greg
Andreas Dilger - June 24, 2009, 12:11 a.m.
On Jun 23, 2009  20:02 -0400, Greg Freemyer wrote:
> On Tue, Jun 23, 2009 at 7:19 PM, Andreas Dilger<adilger@sun.com> wrote:
> > On Jun 23, 2009  17:25 +0900, Akira Fujita wrote:
> >> alloc_flag of ext4_alloc_rule structure is set as "mandatory" or "advisory".
> >> Restricted blocks with "mandatory" are never used by block allocator.
> >> But in "advisory" case, block allocator is allowed to use restricted blocks
> >> when there are no free blocks on FS.
> >
> > Would it make more sense to implement the range protections via the
> > existing preallocation ranges (PA)?  An inode can have multiple
> > PAs attached to it to have it prefer allocations from that range.
> >
> > We could also attach PAs to the superblock to prevent other files from
> > allocating out of those ranges.  This would work better with the existing
> > allocation code instead of creating a second similar mechanism.
> 
> Where can I find documentation about how PA works?  Or is it just in
> the source?  If so, what are one or two calls that cause the PA ranges
> to be set, etc.

Aneesh is the expert on the preallocation code.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Freemyer - June 25, 2009, 12:47 p.m.
Aneesh,

Can you provide the below requested info?

Thanks
Greg

On Tue, Jun 23, 2009 at 8:11 PM, Andreas Dilger<adilger@sun.com> wrote:
> On Jun 23, 2009  20:02 -0400, Greg Freemyer wrote:
>> On Tue, Jun 23, 2009 at 7:19 PM, Andreas Dilger<adilger@sun.com> wrote:
>> > On Jun 23, 2009  17:25 +0900, Akira Fujita wrote:
>> >> alloc_flag of ext4_alloc_rule structure is set as "mandatory" or "advisory".
>> >> Restricted blocks with "mandatory" are never used by block allocator.
>> >> But in "advisory" case, block allocator is allowed to use restricted blocks
>> >> when there are no free blocks on FS.
>> >
>> > Would it make more sense to implement the range protections via the
>> > existing preallocation ranges (PA)?  An inode can have multiple
>> > PAs attached to it to have it prefer allocations from that range.
>> >
>> > We could also attach PAs to the superblock to prevent other files from
>> > allocating out of those ranges.  This would work better with the existing
>> > allocation code instead of creating a second similar mechanism.
>>
>> Where can I find documentation about how PA works?  Or is it just in
>> the source?  If so, what are one or two calls that cause the PA ranges
>> to be set, etc.
>
> Aneesh is the expert on the preallocation code.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
Aneesh Kumar K.V - June 27, 2009, 4:54 p.m.
On Thu, Jun 25, 2009 at 08:47:59AM -0400, Greg Freemyer wrote:
> Aneesh,
> 
> Can you provide the below requested info?
> 
> Thanks
> Greg
> 
> On Tue, Jun 23, 2009 at 8:11 PM, Andreas Dilger<adilger@sun.com> wrote:
> > On Jun 23, 2009  20:02 -0400, Greg Freemyer wrote:
> >> On Tue, Jun 23, 2009 at 7:19 PM, Andreas Dilger<adilger@sun.com> wrote:
> >> > On Jun 23, 2009  17:25 +0900, Akira Fujita wrote:
> >> >> alloc_flag of ext4_alloc_rule structure is set as "mandatory" or "advisory".
> >> >> Restricted blocks with "mandatory" are never used by block allocator.
> >> >> But in "advisory" case, block allocator is allowed to use restricted blocks
> >> >> when there are no free blocks on FS.
> >> >
> >> > Would it make more sense to implement the range protections via the
> >> > existing preallocation ranges (PA)?  An inode can have multiple
> >> > PAs attached to it to have it prefer allocations from that range.
> >> >
> >> > We could also attach PAs to the superblock to prevent other files from
> >> > allocating out of those ranges.  This would work better with the existing
> >> > allocation code instead of creating a second similar mechanism.
> >>
> >> Where can I find documentation about how PA works?  Or is it just in
> >> the source?  If so, what are one or two calls that cause the PA ranges
> >> to be set, etc.
> >

Mostly the source. Some of mballoc details are documented in the ols
2008 paper. Regarding some of the functions

ext4_mb_use_preallocated -> allocate from PA
ext4_mb_new_preallocation -> Create new PA.

Source code also have some documentation that explains how mballoc use
PA.

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Akira Fujita - Aug. 7, 2009, 6:42 a.m.
Hi Andreas,

Andreas Dilger wrote:
> On Jun 23, 2009  17:25 +0900, Akira Fujita wrote:
>> alloc_flag of ext4_alloc_rule structure is set as "mandatory" or "advisory".
>> Restricted blocks with "mandatory" are never used by block allocator.
>> But in "advisory" case, block allocator is allowed to use restricted blocks
>> when there are no free blocks on FS.
> 
> Would it make more sense to implement the range protections via the
> existing preallocation ranges (PA)?  An inode can have multiple
> PAs attached to it to have it prefer allocations from that range.
> 
> We could also attach PAs to the superblock to prevent other files from
> allocating out of those ranges.  This would work better with the existing
> allocation code instead of creating a second similar mechanism.

Thank you for comments.

I have considered about the block allocation control with preallocation (PA).
This is my new implementation idea.

a. Block allocation restriction (balloc restriction)
   Redesigned balloc restriction ioctl (EXT4_IOC_BALOC_CONTROL) can set
   and clear protected ranges with flag.
   And balloc restriction used a new type PA (MB_RESTRICT_PA),
   not inode PA (MB_INODE_PA) and group PA (MB_GROUP_PA).

   Previous my patch set has implemented two restriction types: mandatory
   (never used by block allocator) and advisory (used if there is
   no other free blocks to allocate).
   But, to make more simple, I implement only mandatory mode.

   With "SET_BALLOC_RESTRICTION" flag, this ioctl sets MB_RESTRIT_PA,
   and blocks in this PA covers are protected from other block allocator.
   If you want to use these ranges, call with "CLR_BALLOC_RESTRICTIOIN" flag.

   EXT4_IOC_BALLOC_CONTROL calls ext4_mb_new_blocks().  It tries to check
   whether specified range blocks are free or not with mballoc routine.
   If range blocks are free, ext4_mb_new_blocks() sets memory block bitmap
   used (same as ext4 PA), and then adds this information to restriction PA.
   But it does *not* set disk block bitmap used, because these blocks are part of PA.

   ext4_prealloc_space has a new list structure "pa_restrict_list" which holds
   restriction PA passed from user-space.
   ext4_group_info also has a new list structure "bb_restrict_request" which holds
   block group related restriction range.
   This list is used, when we calculate blocks count which are free
   but can not use because of restriction PA.


b. Preferred block allocation for inode (preferred balloc)
   EXT4_IOC_ADD_PREALLOC adds specified blocks to the inode PA.
   You can set arbitrary blocks ranges to inode PA,
   this is the different from fallocate.


   Ext4 inode PA is removed when file is closed, therefore it is not
   necessary to implement to clear inode PA.

Ioctl interfaces are as follows.

a. EXT4_IOC_BALLOC_CONTROL (Set or clear balloc restriction)

     EXT4_IOC_BALLOC_CONTROL
	_IOW('f', 16, struct ext4_balloc_control balloc_control)

     struct ext4_balloc_control {
       __u64 start; /* start physical block offset balloc rest */
       __u64 len;   /* block length */
       __u32 flags; /* set or clear */
     }

    "flags" can be set following 2 types.
    - SET_BALLOC_RESTRICTION
        Set blocks in range to the balloc restriction list.
    - CLR_BALLOC_RESTRICTION
        Clear blocks from the balloc restriction list.

b. EXT4_IOC_ADD_PREALLOC (Add inode preferred range)

     EXT4_IOC_ADD_PREALLOC _IOW('f', 18, struct ext4_balloc_control)

     struct ext4_balloc_control {
       __u64 start; /* start physical block offset */
       __u64 len;   /* block length */
       __u32 flags;  /* create and add mode for inode PA  */
     }

    "flags" must include one of the following create modes
    (MANDATORY or ADVISORY).  In addition, one of the control modes also must
    be set (REPLACER_INODE_PREALLOC or ADD_INODE_PREALLOC).
     Create modes:
     - MANDATORY
         Find free extent which satisfies "start" and "len" completely.
     - ADVISORY
         Try to find free extent from "start" and "len" blocks.
     Control modes:
     - REPLACE_INODE_PREALLOC
         Remove existed inode PA first, and then add specified range to
         the inode PA list newly.
     - ADD_INODE_PREALLOC
         Add specified range to the inode PA list.

     e.g.  flag = MANDATORY | ADD_INODE_PREALLOC
           Find free extent which fulfills the requirements completely,
           and if succeed, add this extent to the inode PA.

Regards,
Akira Fujita

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Freemyer - Aug. 9, 2009, 6:18 p.m.
Akira-san,

I joined the project ohsm team a couple weeks ago and we hope to use
your patches / features to build on.  Below is our feedback as relates
to ohsm as well as my personal feedback.

2009/8/7 Akira Fujita <a-fujita@rs.jp.nec.com>:
> Hi Andreas,
>
> Andreas Dilger wrote:
>> On Jun 23, 2009  17:25 +0900, Akira Fujita wrote:
>>> alloc_flag of ext4_alloc_rule structure is set as "mandatory" or "advisory".
>>> Restricted blocks with "mandatory" are never used by block allocator.
>>> But in "advisory" case, block allocator is allowed to use restricted blocks
>>> when there are no free blocks on FS.
>>
>> Would it make more sense to implement the range protections via the
>> existing preallocation ranges (PA)?  An inode can have multiple
>> PAs attached to it to have it prefer allocations from that range.
>>
>> We could also attach PAs to the superblock to prevent other files from
>> allocating out of those ranges.  This would work better with the existing
>> allocation code instead of creating a second similar mechanism.
>
> Thank you for comments.
>
> I have considered about the block allocation control with preallocation (PA).
> This is my new implementation idea.
>
> a. Block allocation restriction (balloc restriction)
>   Redesigned balloc restriction ioctl (EXT4_IOC_BALOC_CONTROL) can set
>   and clear protected ranges with flag.
>   And balloc restriction used a new type PA (MB_RESTRICT_PA),
>   not inode PA (MB_INODE_PA) and group PA (MB_GROUP_PA).
>
>   Previous my patch set has implemented two restriction types: mandatory
>   (never used by block allocator) and advisory (used if there is
>   no other free blocks to allocate).
>   But, to make more simple, I implement only mandatory mode.

The ohsm team has no current specific plan to use "Block allocation
restriction", but if we did it would be in the advisory role.  We
agree this functionality can be added later when there is an actual
user.

>   With "SET_BALLOC_RESTRICTION" flag, this ioctl sets MB_RESTRIT_PA,
>   and blocks in this PA covers are protected from other block allocator.
>   If you want to use these ranges, call with "CLR_BALLOC_RESTRICTIOIN" flag.
>
>   EXT4_IOC_BALLOC_CONTROL calls ext4_mb_new_blocks().  It tries to check
>   whether specified range blocks are free or not with mballoc routine.
>   If range blocks are free, ext4_mb_new_blocks() sets memory block bitmap
>   used (same as ext4 PA), and then adds this information to restriction PA.
>   But it does *not* set disk block bitmap used, because these blocks are part of PA.
>
>   ext4_prealloc_space has a new list structure "pa_restrict_list" which holds
>   restriction PA passed from user-space.
>   ext4_group_info also has a new list structure "bb_restrict_request" which holds
>   block group related restriction range.
>   This list is used, when we calculate blocks count which are free
>   but can not use because of restriction PA.

Can't say I know enough to comment on the implementation details.

>
> b. Preferred block allocation for inode (preferred balloc)
>   EXT4_IOC_ADD_PREALLOC adds specified blocks to the inode PA.
>   You can set arbitrary blocks ranges to inode PA,
>   this is the different from fallocate.
>

This function is the core functionality that ohsm still needs from
ext4, and we look forward to seeing actual functioning patches, and in
turn those eventually getting pushed to Linus.

>
>   Ext4 inode PA is removed when file is closed, therefore it is not
>   necessary to implement to clear inode PA.

That is fine from ohsm perspective.  Possibly there are other use
cases that need a longer lifetime?

> Ioctl interfaces are as follows.
>
> a. EXT4_IOC_BALLOC_CONTROL (Set or clear balloc restriction)
>
>     EXT4_IOC_BALLOC_CONTROL
>        _IOW('f', 16, struct ext4_balloc_control balloc_control)
>
>     struct ext4_balloc_control {
>       __u64 start; /* start physical block offset balloc rest */
>       __u64 len;   /* block length */
>       __u32 flags; /* set or clear */
>     }
>
>    "flags" can be set following 2 types.
>    - SET_BALLOC_RESTRICTION
>        Set blocks in range to the balloc restriction list.
>    - CLR_BALLOC_RESTRICTION
>        Clear blocks from the balloc restriction list.

ohsm will be an in kernel user of the above, so we hope a kernel API
is also provided.  I assume that would be a simple export and
documenting it in Documentation/filesystems/ext4.

It seems you need to add 3 flags to the above:
mandatory - Have a future block allocate request return ENO_SPACE_PA
if the blocks cannot be found within the restricted range.
advisory - Attempt future block allocate requests from the restricted
range, but use entire unrestricted block range if that fails.
mandatory_with_fallback - Not Implemented - If block allocate from
restricted range fails, fallback to an alternate block range.  API and
implementation details not yet agreed on.

As to mandatory_with_fallback, we (the ohsm team) are looking for
feedback on the below proposal:

The ohsm team envisions submitting subsequent patches to enhance the
ext4 block allocator function such that it makes a callout to ohsm if
a block allocation fails from the current restricted block range.
Possibly by adding an init routine that would allow ohsm to register a
callout routine for the ENO_SPACE_PA condition. This can be thought of
as a inotify type situation for that one case.

After making the callout (to ohsm or other registered kernel user), we
would like to see the block allocation re-attempted.

This would allow ohsm to eventually have multiple tiers of preferred
storage. And if one tier is not able to provide the requested blocks,
an alternate block range could be set.  We envision the oshm callout
function in turn calling the EXT4_IOC_BALLOC_CONTROL kernel API to set
the alternate block range.  Thus the block allocator function would
need to be made aware of this possibility.

Again, the above is mostly our future plans / enhancements to the
initial primary patch and is provided just to let everyone keep ohsm's
needs in mind as the patch is writen / reviewed.  ie. ohsm is the only
known use case for this routine other than defrag at present so we
thought it useful explain how ohsm would utilize / enhance this
function.

> b. EXT4_IOC_ADD_PREALLOC (Add inode preferred range)
>
>     EXT4_IOC_ADD_PREALLOC _IOW('f', 18, struct ext4_balloc_control)
>
>     struct ext4_balloc_control {
>       __u64 start; /* start physical block offset */
>       __u64 len;   /* block length */
>       __u32 flags;  /* create and add mode for inode PA  */
>     }
>
>    "flags" must include one of the following create modes
>    (MANDATORY or ADVISORY).  In addition, one of the control modes also must
>    be set (REPLACER_INODE_PREALLOC or ADD_INODE_PREALLOC).
>     Create modes:
>     - MANDATORY
>         Find free extent which satisfies "start" and "len" completely.
>     - ADVISORY
>         Try to find free extent from "start" and "len" blocks.
>     Control modes:
>     - REPLACE_INODE_PREALLOC
>         Remove existed inode PA first, and then add specified range to
>         the inode PA list newly.
>     - ADD_INODE_PREALLOC
>         Add specified range to the inode PA list.
>
>     e.g.  flag = MANDATORY | ADD_INODE_PREALLOC
>           Find free extent which fulfills the requirements completely,
>           and if succeed, add this extent to the inode PA.

I am unsure how the above relates to EXT4_IOC_BALLOC_CONTROL.  It
appears to be totally independent which I don't think is a good idea.
Nor do I understand the use case of the advisory flag and
add_inode_prealloc flag.

I would prefer if the above API were simplified to:

b. EXT4_IOC_RESET_PREALLOC (Ensure inode prealloc range is withing
preferred block alloc range)

    EXT4_IOC_ADD_PREALLOC _IOW('f', 18, struct ext4_balloc_control)

    struct ext4_balloc_control {
      __u32 flags;  /* Currently unused  */
    }

Find appropriate free prealloc block extent within range set of inode
via EXT4_IOC_BALLOC_CONTROL.

If unable to do so, a preallock block is set via the default logic and
a error is returned to show that the prealloc block is not within the
restricted block range.

This seems far simpler to code, understand, and use.

> Regards,
> Akira Fujita

Thanks
Greg
--
Greg Freemyer
Member of OHSM devel team
http://sourceforge.net/projects/ohsm/
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kazuya Mio - Oct. 16, 2009, 6:47 a.m.
Hi Greg,
I'm sorry for the late reply.

2009/08/10 3:18, Greg Freemyer wrote:
> Akira-san,
> 
> I joined the project ohsm team a couple weeks ago and we hope to use
> your patches / features to build on.  Below is our feedback as relates
> to ohsm as well as my personal feedback.
> 
[snip]
>> > Ioctl interfaces are as follows.
>> >
>> > a. EXT4_IOC_BALLOC_CONTROL (Set or clear balloc restriction)
>> >
>> >     EXT4_IOC_BALLOC_CONTROL
>> >        _IOW('f', 16, struct ext4_balloc_control balloc_control)
>> >
>> >     struct ext4_balloc_control {
>> >       __u64 start; /* start physical block offset balloc rest */
>> >       __u64 len;   /* block length */
>> >       __u32 flags; /* set or clear */
>> >     }
>> >
>> >    "flags" can be set following 2 types.
>> >    - SET_BALLOC_RESTRICTION
>> >        Set blocks in range to the balloc restriction list.
>> >    - CLR_BALLOC_RESTRICTION
>> >        Clear blocks from the balloc restriction list.
> 
> ohsm will be an in kernel user of the above, so we hope a kernel API
> is also provided.  I assume that would be a simple export and
> documenting it in Documentation/filesystems/ext4.
> 

When we implement this ioctl, we will consider about a kernel API.
However, we have no plan to make the kernel API.

Also we consider the documenting after we decide the specific of the
implementation.

> It seems you need to add 3 flags to the above:
> mandatory - Have a future block allocate request return ENO_SPACE_PA
> if the blocks cannot be found within the restricted range.
> advisory - Attempt future block allocate requests from the restricted
> range, but use entire unrestricted block range if that fails.
> mandatory_with_fallback - Not Implemented - If block allocate from
> restricted range fails, fallback to an alternate block range.  API and
> implementation details not yet agreed on.

I'm worried that you misunderstand about the restriction PA.
If you set a restriction PA, a block allocation cannot allocate
blocks into there. Setting a block range that is allocated first
is the role of the inode PA.

> 
[snip]
> 
>> > b. EXT4_IOC_ADD_PREALLOC (Add inode preferred range)
>> >
>> >     EXT4_IOC_ADD_PREALLOC _IOW('f', 18, struct ext4_balloc_control)
>> >
>> >     struct ext4_balloc_control {
>> >       __u64 start; /* start physical block offset */
>> >       __u64 len;   /* block length */
>> >       __u32 flags;  /* create and add mode for inode PA  */
>> >     }
>> >
>> >    "flags" must include one of the following create modes
>> >    (MANDATORY or ADVISORY).  In addition, one of the control modes also must
>> >    be set (REPLACER_INODE_PREALLOC or ADD_INODE_PREALLOC).
>> >     Create modes:
>> >     - MANDATORY
>> >         Find free extent which satisfies "start" and "len" completely.
>> >     - ADVISORY
>> >         Try to find free extent from "start" and "len" blocks.
>> >     Control modes:
>> >     - REPLACE_INODE_PREALLOC
>> >         Remove existed inode PA first, and then add specified range to
>> >         the inode PA list newly.
>> >     - ADD_INODE_PREALLOC
>> >         Add specified range to the inode PA list.
>> >
>> >     e.g.  flag = MANDATORY | ADD_INODE_PREALLOC
>> >           Find free extent which fulfills the requirements completely,
>> >           and if succeed, add this extent to the inode PA.
> 
> I am unsure how the above relates to EXT4_IOC_BALLOC_CONTROL.  It
> appears to be totally independent which I don't think is a good idea.
> Nor do I understand the use case of the advisory flag and
> add_inode_prealloc flag.
> 
> I would prefer if the above API were simplified to:
> 
> b. EXT4_IOC_RESET_PREALLOC (Ensure inode prealloc range is withing
> preferred block alloc range)
> 
>     EXT4_IOC_ADD_PREALLOC _IOW('f', 18, struct ext4_balloc_control)
> 
>     struct ext4_balloc_control {
>       __u32 flags;  /* Currently unused  */
>     }
> 
> Find appropriate free prealloc block extent within range set of inode
> via EXT4_IOC_BALLOC_CONTROL.
> 
> If unable to do so, a preallock block is set via the default logic and
> a error is returned to show that the prealloc block is not within the
> restricted block range.
> 
> This seems far simpler to code, understand, and use.
> 

In advisory mode of the inode PA, the ioctl tries to get the inode PA
that satisfies "start" and "len" completely. If it fails, the ioctl
gets an inode PA from somewhere. Your suggestion seems like this mode.
However, the mandatory mode of the inode PA is necessary for e4defrag,
so struct ext4_balloc_control has flag field that can be set two modes.

In ADD_INODE_PREALLOC mode of the inode PA, the ioctl appends an inode PA
without any changing of existed inode PAs. With hindsight, this flag
is not necessary for e4defrag, so I will remove the flags
ADD_INODE_PREALLOC and REPLACE_INODE_PREALLOC.

Regards,
Kazuya Mio

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index e2126d7..7f08069 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -546,35 +546,45 @@  void ext4_free_blocks(handle_t *handle, struct inode *inode,
  */
 int ext4_has_free_blocks(struct ext4_sb_info *sbi, s64 nblocks)
 {
-	s64 free_blocks, dirty_blocks, root_blocks;
+	s64 free_blocks, dirty_blocks, root_blocks, restricted_blocks;
 	struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
 	struct percpu_counter *dbc = &sbi->s_dirtyblocks_counter;
+	struct percpu_counter *rbc = &sbi->s_restricted_blocks_counter;

 	free_blocks  = percpu_counter_read_positive(fbc);
 	dirty_blocks = percpu_counter_read_positive(dbc);
+	restricted_blocks = percpu_counter_read_positive(rbc);
 	root_blocks = ext4_r_blocks_count(sbi->s_es);

-	if (free_blocks - (nblocks + root_blocks + dirty_blocks) <
-						EXT4_FREEBLOCKS_WATERMARK) {
+	if (free_blocks - (nblocks + root_blocks + dirty_blocks +
+	    restricted_blocks) < EXT4_FREEBLOCKS_WATERMARK) {
 		free_blocks  = percpu_counter_sum_positive(fbc);
 		dirty_blocks = percpu_counter_sum_positive(dbc);
+		restricted_blocks = percpu_counter_sum_positive(rbc);
 		if (dirty_blocks < 0) {
 			printk(KERN_CRIT "Dirty block accounting "
 					"went wrong %lld\n",
 					(long long)dirty_blocks);
 		}
+		if (restricted_blocks < 0) {
+			printk(KERN_CRIT "Restricted block accounting "
+					"went wrong %lld\n",
+					(long long)restricted_blocks);
+		}
 	}
 	/* Check whether we have space after
-	 * accounting for current dirty blocks & root reserved blocks.
+	 * accounting for current dirty blocks, root reserved blocks
+	 * and allocation-restricted blocks.
 	 */
-	if (free_blocks >= ((root_blocks + nblocks) + dirty_blocks))
+	if (free_blocks >= ((root_blocks + nblocks) +
+	    dirty_blocks + restricted_blocks))
 		return 1;

 	/* Hm, nope.  Are (enough) root reserved blocks available? */
 	if (sbi->s_resuid == current_fsuid() ||
 	    ((sbi->s_resgid != 0) && in_group_p(sbi->s_resgid)) ||
 	    capable(CAP_SYS_RESOURCE)) {
-		if (free_blocks >= (nblocks + dirty_blocks))
+		if (free_blocks >= (nblocks + dirty_blocks + restricted_blocks))
 			return 1;
 	}

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0ca49f4..1d2d550 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -88,7 +88,10 @@  typedef unsigned int ext4_group_t;
 #define EXT4_MB_HINT_TRY_GOAL		512
 /* blocks already pre-reserved by delayed allocation */
 #define EXT4_MB_DELALLOC_RESERVED      1024
-
+/* some part of filesystems are in-allocatable */
+#define EXT4_MB_BLOCKS_RESTRICTED      2048
+/* use in-allocatable blocks that have advisory flag */
+#define EXT4_MB_ALLOC_ADVISORY	       4096

 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
@@ -353,6 +356,7 @@  struct ext4_new_group_data {
  /* note ioctl 11 reserved for filesystem-independent FIEMAP ioctl */
 #define EXT4_IOC_ALLOC_DA_BLKS		_IO('f', 12)
 #define EXT4_IOC_MOVE_EXT		_IOWR('f', 15, struct move_extent)
+#define EXT4_IOC_ADD_GLOBAL_ALLOC_RULE	_IOW('f', 16, struct ext4_alloc_rule)

 /*
  * ioctl commands in 32 bit emulation
@@ -373,6 +377,31 @@  struct ext4_new_group_data {

 #define EXT4_IOC_DEBUG_DELALLOC		_IO('f', 42)

+#define EXT4_MB_ALLOC_RULE_MANDATORY	0
+#define EXT4_MB_ALLOC_RULE_ADVISORY	1
+
+struct ext4_alloc_rule {
+	__u64 start;		/* first physical block this rule covers */
+	__u64 len;		/* number of blocks covered by this rule */
+	__u32 alloc_flag;	/* mandatory...0 advisory...1 */
+};
+
+struct ext4_bg_alloc_rule_list {
+	struct list_head bg_arule_list;	/* blockgroup list */
+	ext4_group_t bg_num;		/* blockgroup number */
+	ext4_grpblk_t mand_restricted_blks; /* number of the restricted blocks by mandatory allocation rule */
+	ext4_grpblk_t adv_restricted_blks;  /* number of the restricted blocks by advisory allocation rule */
+	struct list_head arule_list;	/* the range in the blockgroup */
+};
+
+struct ext4_bg_alloc_rule {
+	struct list_head arule_list;	/* the range in the blockgroup */
+	struct list_head tmp_list;	/* to add ext4_alloc_rule */
+	ext4_grpblk_t start;		/* allocate start block */
+	ext4_grpblk_t end;		/* allocate end block */
+	int alloc_flag;			/* 0(mandatory) or 1(advisory) */
+};
+
 /*
  *  Mount options
  */
@@ -877,6 +906,7 @@  struct ext4_sb_info {
 	struct percpu_counter s_dirs_counter;
 	struct percpu_counter s_dirtyblocks_counter;
 	struct blockgroup_lock *s_blockgroup_lock;
+	struct percpu_counter s_restricted_blocks_counter;
 	struct proc_dir_entry *s_proc;
 	struct kobject s_kobj;
 	struct completion s_kobj_unregister;
@@ -967,6 +997,9 @@  struct ext4_sb_info {

 	unsigned int s_log_groups_per_flex;
 	struct flex_groups *s_flex_groups;
+
+	rwlock_t s_bg_arule_lock;
+	struct list_head s_bg_arule_list;
 };

 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1352,6 +1385,8 @@  extern void ext4_mb_update_group_info(struct ext4_group_info *grp,
 extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
 extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
 						ext4_group_t, int);
+extern int ext4_mb_add_global_arule(struct inode *, struct ext4_alloc_rule *);
+
 /* inode.c */
 int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
 		struct buffer_head *bh, ext4_fsblk_t blocknr);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index a0edf61..8505e3a 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -260,6 +260,22 @@  setversion_out:
 		return err;
 	}

+	case EXT4_IOC_ADD_GLOBAL_ALLOC_RULE: {
+		struct ext4_alloc_rule arule;
+		int err;
+
+		if (copy_from_user(&arule,
+				(struct ext4_alloc_rule __user *)arg,
+				sizeof(arule)))
+			return -EFAULT;
+
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+
+		err = ext4_mb_add_global_arule(inode, &arule);
+		return err;
+	}
+
 	case EXT4_IOC_GROUP_ADD: {
 		struct ext4_new_group_data input;
 		struct super_block *sb = inode->i_sb;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 519a0a6..0719900 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1188,6 +1188,120 @@  static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
 	mb_check_buddy(e4b);
 }

+/*
+ * Find a pointer to ext4_bg_alloc_rule_list that indicates @bg,
+ * and set it to @ret_list.
+ * If there is no entry, return NULL.
+ */
+static void mb_arule_find_group(struct ext4_sb_info *sbi,
+		struct ext4_bg_alloc_rule_list **ret_list, ext4_group_t bg)
+{
+	struct list_head *arule_head = &sbi->s_bg_arule_list;
+	struct ext4_bg_alloc_rule_list *pos, *n;
+
+	*ret_list = NULL;
+	list_for_each_entry_safe(pos, n, arule_head, bg_arule_list) {
+		if (pos->bg_num > bg)
+			return;
+		else if (pos->bg_num == bg) {
+			*ret_list = pos;
+			return;
+		}
+	}
+}
+
+static ext4_grpblk_t
+ext4_mb_count_unused_blocks(void *bd_bitmap, ext4_grpblk_t start,
+				ext4_grpblk_t end) {
+	ext4_grpblk_t i, blocks = 0;
+
+	for (i = start; i <= end; i++) {
+		if (!mb_test_bit(i, bd_bitmap))
+			blocks++;
+	}
+	return blocks;
+}
+
+static void
+__mb_arule_count_overlap(struct ext4_buddy *e4b,
+			struct ext4_bg_alloc_rule_list *arule,
+			ext4_grpblk_t start, int len,
+			ext4_grpblk_t *mand_blocks, ext4_grpblk_t *adv_blocks)
+{
+	struct list_head *arule_head = &arule->arule_list;
+	struct ext4_bg_alloc_rule *pos, *n;
+	ext4_grpblk_t end, search_start, search_end, overlap;
+
+	*mand_blocks = 0;
+	*adv_blocks = 0;
+
+	search_start = start;
+	search_end = start + len - 1;
+	end = search_end;
+	list_for_each_entry_safe(pos, n, arule_head, arule_list) {
+
+		if (pos->start > end)
+			return;
+
+		if (pos->start <= end && pos->end >= start) {
+			search_start = start < pos->start ? pos->start :
+								   start;
+			search_end = end > pos->end ? pos->end : end;
+			overlap = search_end - search_start + 1 -
+				  ext4_mb_count_unused_blocks(e4b->bd_bitmap,
+						search_start, search_end);
+			if (pos->alloc_flag == EXT4_MB_ALLOC_RULE_ADVISORY)
+				*adv_blocks += overlap;
+			else
+				*mand_blocks += overlap;
+		}
+	}
+}
+
+/*
+ * Count the number of blocks that on the unallocatable space from @start to
+ * @start + @len.
+ * If there is overlap. @mand_blocks or/and @adv_blocks are changed.
+ */
+static void
+mb_arule_count_overlap(struct ext4_allocation_context *ac,
+			struct ext4_buddy *e4b, ext4_group_t bg,
+			ext4_grpblk_t start, int len,
+			ext4_grpblk_t *mand_blocks, ext4_grpblk_t *adv_blocks)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_inode->i_sb);
+	struct ext4_bg_alloc_rule_list *target_bg_list = NULL;
+
+	if (!(ac->ac_flags & EXT4_MB_BLOCKS_RESTRICTED))
+		return;
+
+	read_lock(&sbi->s_bg_arule_lock);
+	mb_arule_find_group(sbi, &target_bg_list, bg);
+	if (target_bg_list == NULL) {
+		read_unlock(&sbi->s_bg_arule_lock);
+		return;
+	}
+
+	__mb_arule_count_overlap(e4b, target_bg_list, start, len,
+				mand_blocks, adv_blocks);
+	read_unlock(&sbi->s_bg_arule_lock);
+}
+
+static void
+ext4_mb_calc_restricted(struct ext4_sb_info *sbi,
+			struct ext4_bg_alloc_rule_list **arule_list,
+			int alloc_flag,
+			ext4_grpblk_t restricted)
+{
+	if (alloc_flag == EXT4_MB_ALLOC_RULE_ADVISORY)
+		(*arule_list)->adv_restricted_blks += restricted;
+	else {
+		(*arule_list)->mand_restricted_blks += restricted;
+		percpu_counter_add(&sbi->s_restricted_blocks_counter,
+				(s64)restricted);
+	}
+}
+
 static int mb_find_extent(struct ext4_buddy *e4b, int order, int block,
 				int needed, struct ext4_free_extent *ex)
 {
@@ -2733,6 +2847,9 @@  int ext4_mb_init(struct super_block *sb, int needs_recovery)
 	if (sbi->s_journal)
 		sbi->s_journal->j_commit_callback = release_blocks_on_commit;

+	rwlock_init(&sbi->s_bg_arule_lock);
+	INIT_LIST_HEAD(&sbi->s_bg_arule_list);
+
 	printk(KERN_INFO "EXT4-fs: mballoc enabled\n");
 	return 0;
 }
@@ -4698,13 +4815,14 @@  void ext4_mb_free_blocks(handle_t *handle, struct inode *inode,
 	struct ext4_group_desc *gdp;
 	struct ext4_super_block *es;
 	unsigned int overflow;
-	ext4_grpblk_t bit;
+	ext4_grpblk_t bit, mand_blocks = 0, adv_blocks = 0;
 	struct buffer_head *gd_bh;
 	ext4_group_t block_group;
 	struct ext4_sb_info *sbi;
 	struct ext4_buddy e4b;
 	int err = 0;
 	int ret;
+	struct ext4_bg_alloc_rule_list *ret_list = NULL;

 	*freed = 0;

@@ -4797,6 +4915,19 @@  do_more:
 	err = ext4_mb_load_buddy(sb, block_group, &e4b);
 	if (err)
 		goto error_return;
+
+	ext4_lock_group(sb, block_group);
+	if (!list_empty(&EXT4_SB(sb)->s_bg_arule_list)) {
+		/*
+		 * Count up the num of blocks overlap with unallocatable
+		 * range.
+		 */
+		ac->ac_flags |= EXT4_MB_BLOCKS_RESTRICTED;
+		mb_arule_count_overlap(ac, &e4b, block_group, bit, count,
+					&mand_blocks, &adv_blocks);
+	}
+	ext4_unlock_group(sb, block_group);
+
 	if (metadata && ext4_handle_valid(handle)) {
 		struct ext4_free_data *new_entry;
 		/*
@@ -4823,11 +4954,28 @@  do_more:
 		ext4_mb_return_to_preallocation(inode, &e4b, block, count);
 	}

+	/* Modify the number of restricted blocks */
+	if (mand_blocks || adv_blocks) {
+		read_lock(&EXT4_SB(sb)->s_bg_arule_lock);
+		mb_arule_find_group(EXT4_SB(sb), &ret_list, e4b.bd_group);
+		if (ret_list != NULL) {
+			if (mand_blocks)
+				ext4_mb_calc_restricted(sbi, &ret_list,
+						EXT4_MB_ALLOC_RULE_MANDATORY,
+						(s64)mand_blocks);
+			if (adv_blocks)
+				ext4_mb_calc_restricted(sbi, &ret_list,
+						EXT4_MB_ALLOC_RULE_ADVISORY,
+						(s64)adv_blocks);
+		}
+		read_unlock(&EXT4_SB(sb)->s_bg_arule_lock);
+	}
+
 	ret = ext4_free_blks_count(sb, gdp) + count;
 	ext4_free_blks_set(sb, gdp, ret);
 	gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
 	ext4_unlock_group(sb, block_group);
-	percpu_counter_add(&sbi->s_freeblocks_counter, count);
+	percpu_counter_add(&sbi->s_freeblocks_counter, (s64)count);

 	if (sbi->s_log_groups_per_flex) {
 		ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
@@ -4862,3 +5010,289 @@  error_return:
 		kmem_cache_free(ext4_ac_cachep, ac);
 	return;
 }
+
+static void ext4_mb_release_tmp_list(struct list_head *list,
+						struct ext4_sb_info *sbi)
+{
+	struct ext4_bg_alloc_rule_list *bg_arule_list, *tmp_arule_list;
+	struct ext4_bg_alloc_rule *bg_arule, *tmp_arule;
+
+	list_for_each_entry_safe(bg_arule, tmp_arule, list, tmp_list) {
+		list_del(&bg_arule->arule_list);
+		list_del(&bg_arule->tmp_list);
+		kfree(bg_arule);
+		bg_arule = NULL;
+	}
+
+	list_for_each_entry_safe(bg_arule_list, tmp_arule_list,
+				&sbi->s_bg_arule_list, bg_arule_list) {
+		if (list_empty(&bg_arule_list->arule_list)) {
+			list_del(&bg_arule_list->bg_arule_list);
+			kfree(bg_arule_list);
+			bg_arule_list = NULL;
+		}
+	}
+
+	return;
+}
+
+static int ext4_mb_check_arule(struct inode *inode,
+						struct ext4_alloc_rule *arule)
+{
+	struct super_block *sb = inode->i_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	ext4_fsblk_t blocks_count = ext4_blocks_count(sbi->s_es);
+	ext4_fsblk_t first_data_block;
+
+	/* FIXME: indirect block is not supported */
+	if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
+		ext4_debug("Not extent based file or filesystem\n");
+		return -EOPNOTSUPP;
+	}
+
+	if (arule->len == 0) {
+		ext4_debug("Can't set len=0\n");
+		return -EINVAL;
+	}
+
+	if (arule->start >= blocks_count || arule->len > blocks_count) {
+		ext4_debug("Can't set more than %llu blocks: "
+				"start=%llu, len=%llu\n", blocks_count+1,
+				(ext4_fsblk_t)arule->start,
+				(ext4_fsblk_t)arule->len);
+		return -EINVAL;
+	}
+
+	first_data_block = le32_to_cpu(sbi->s_es->s_first_data_block);
+	if (arule->start + arule->len > blocks_count ||
+	    arule->start + arule->len - 1 < first_data_block) {
+		ext4_debug("alloc_rule shows out of FS: start+len=%llu\n",
+				(ext4_fsblk_t)(arule->start + arule->len));
+		return -EINVAL;
+	}
+
+	if (arule->alloc_flag != EXT4_MB_ALLOC_RULE_MANDATORY &&
+			arule->alloc_flag != EXT4_MB_ALLOC_RULE_ADVISORY) {
+		ext4_debug("alloc_flag should be 0 or 1: alloc_flag=%u\n",
+				(unsigned)arule->alloc_flag);
+		return -EINVAL;
+	}
+
+	/* We consider about the boot block if bs = 1k */
+	if (arule->start < first_data_block) {
+		ext4_fsblk_t diff;
+
+		printk(KERN_INFO "%s: alloc_rule->start isn't in data block."
+				 "\nThe argument is modified to cover "
+				 "[%llu:%llu]\n", __func__, first_data_block,
+						arule->start + arule->len - 1);
+		diff = first_data_block - arule->start;
+		arule->start += diff;
+		arule->len -= diff;
+	}
+
+	return 0;
+}
+
+static int
+ext4_mb_add_bg_arule(struct super_block *sb,
+			struct ext4_bg_alloc_rule_list *bg_arule_list,
+			struct ext4_bg_alloc_rule *add_arule)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_bg_alloc_rule *bg_arule, *tmp_arule;
+	struct ext4_buddy e4b;
+	ext4_group_t group_no = bg_arule_list->bg_num;
+	ext4_grpblk_t restricted_blocks;
+	int err;
+
+	err = ext4_mb_load_buddy(sb, group_no, &e4b);
+	if (err)
+		return err;
+
+	ext4_lock_group(sb, group_no);
+
+	restricted_blocks = ext4_mb_count_unused_blocks(e4b.bd_bitmap,
+					add_arule->start, add_arule->end);
+	ext4_mb_calc_restricted(sbi, &bg_arule_list, add_arule->alloc_flag,
+				restricted_blocks);
+
+	list_for_each_entry_safe(bg_arule, tmp_arule,
+				&bg_arule_list->arule_list, arule_list) {
+		if (add_arule->start == bg_arule->end + 1 &&
+		    add_arule->alloc_flag == bg_arule->alloc_flag) {
+			add_arule->start = bg_arule->start;
+			list_del(&bg_arule->arule_list);
+		} else if (add_arule->end == bg_arule->start - 1 &&
+			   add_arule->alloc_flag == bg_arule->alloc_flag) {
+			add_arule->end = bg_arule->end;
+			list_del(&bg_arule->arule_list);
+		}
+	}
+
+	list_for_each_entry_safe(bg_arule, tmp_arule,
+				&bg_arule_list->arule_list, arule_list) {
+		if (bg_arule->start > add_arule->start) {
+			list_add_tail(&add_arule->arule_list,
+						&bg_arule->arule_list);
+			break;
+		}
+		/* if bg_arule is the last entry, call list_add */
+		if (list_is_last(&bg_arule->arule_list,
+						&bg_arule_list->arule_list))
+			list_add(&add_arule->arule_list,
+						&bg_arule->arule_list);
+	}
+
+	if (list_empty(&bg_arule_list->arule_list))
+		list_add(&add_arule->arule_list, &bg_arule_list->arule_list);
+
+	ext4_unlock_group(sb, group_no);
+	ext4_mb_release_desc(&e4b);
+
+	return 0;
+}
+
+int ext4_mb_add_global_arule(struct inode *inode,
+					struct ext4_alloc_rule *arule)
+{
+	struct super_block *sb = inode->i_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct percpu_counter *dbc = &sbi->s_dirtyblocks_counter;
+	struct ext4_bg_alloc_rule_list *bg_arule_list, *tmp_arule_list;
+	struct ext4_bg_alloc_rule_list *new_arule_list = NULL;
+	struct ext4_bg_alloc_rule *bg_arule, *tmp_arule, *add_arule = NULL;
+	struct list_head add_arule_list;
+	ext4_fsblk_t start, end;
+	ext4_grpblk_t bg_start = 0, bg_end = 0;
+	ext4_group_t start_bgnum, end_bgnum;
+	s64 dirty_blocks;
+	int ret, add_flag = 0;
+	unsigned long bg_size = EXT4_BLOCKS_PER_GROUP(sb);
+
+	ret = ext4_mb_check_arule(inode, arule);
+	if (ret < 0)
+		return ret;
+
+	dirty_blocks = percpu_counter_read_positive(dbc);
+	/* Confirm this arule does not violate the reserved delalloc space */
+	if (!ext4_has_free_blocks(sbi, (s64)arule->len + dirty_blocks))
+		return -ENOSPC;
+
+	start = arule->start;
+	end = arule->start + arule->len - 1;
+
+	ext4_get_group_no_and_offset(sb, start, &start_bgnum, NULL);
+	ext4_get_group_no_and_offset(sb, end, &end_bgnum, NULL);
+
+	write_lock(&sbi->s_bg_arule_lock);
+	list_for_each_entry_safe(bg_arule_list, tmp_arule_list,
+					&sbi->s_bg_arule_list, bg_arule_list) {
+		if (bg_arule_list->bg_num < start_bgnum ||
+				bg_arule_list->bg_num > end_bgnum)
+			continue;
+
+		if (start_bgnum < bg_arule_list->bg_num)
+			bg_start = 0;
+		else
+			ext4_get_group_no_and_offset(sb, start, NULL,
+							&bg_start);
+
+		if (end_bgnum > bg_arule_list->bg_num)
+			bg_end = bg_size - 1;
+		else
+			ext4_get_group_no_and_offset(sb, end, NULL, &bg_end);
+
+		list_for_each_entry_safe(bg_arule, tmp_arule,
+				&bg_arule_list->arule_list, arule_list) {
+
+			if (bg_start <= bg_arule->end &&
+						bg_end >= bg_arule->start) {
+				ext4_debug("Overlapping blocks\n");
+				ret = -EINVAL;
+				goto out;
+			} else if (bg_end < bg_arule->start)
+				break;
+		}
+	}
+
+	/* devide alloc rules per blockgroup */
+	INIT_LIST_HEAD(&add_arule_list);
+	while (start <= end) {
+		add_arule = kmalloc(sizeof(struct ext4_bg_alloc_rule),
+								GFP_KERNEL);
+		if (add_arule == NULL) {
+			ext4_mb_release_tmp_list(&add_arule_list, sbi);
+			ret = -ENOMEM;
+			goto out;
+		}
+		INIT_LIST_HEAD(&add_arule->arule_list);
+		INIT_LIST_HEAD(&add_arule->tmp_list);
+		ext4_get_group_no_and_offset(sb, start, NULL,
+						&add_arule->start);
+		add_arule->alloc_flag = arule->alloc_flag;
+		/* if end is out of bg in start, fix it */
+		if (end_bgnum > start_bgnum)
+			add_arule->end = (ext4_group_t)(bg_size - 1);
+		else
+			ext4_get_group_no_and_offset(sb, end, NULL,
+							&add_arule->end);
+
+		list_add(&add_arule->tmp_list, &add_arule_list);
+
+		list_for_each_entry_safe(bg_arule_list, tmp_arule_list,
+					&sbi->s_bg_arule_list, bg_arule_list) {
+			if (bg_arule_list->bg_num < start_bgnum)
+				continue;
+			else if (bg_arule_list->bg_num == start_bgnum) {
+				ret = ext4_mb_add_bg_arule(sb, bg_arule_list,
+						add_arule);
+				if (ret < 0)
+					goto out;
+
+				add_flag = 1;
+			}
+
+			break;
+		}
+
+		/*
+		 * If there is no goal bg_arule_list, crate new
+		 * bg_arule_list.
+		 */
+		if (!add_flag) {
+			new_arule_list = kmalloc(
+					sizeof(struct ext4_bg_alloc_rule_list),
+					GFP_KERNEL);
+			if (new_arule_list == NULL) {
+				ext4_mb_release_tmp_list(&add_arule_list, sbi);
+				ret = -ENOMEM;
+				goto out;
+			}
+			/* init new bg_alloc_rule_list */
+			INIT_LIST_HEAD(&new_arule_list->bg_arule_list);
+			INIT_LIST_HEAD(&new_arule_list->arule_list);
+			new_arule_list->bg_num = start_bgnum;
+			new_arule_list->mand_restricted_blks = 0;
+			new_arule_list->adv_restricted_blks = 0;
+
+			ret = ext4_mb_add_bg_arule(sb,
+						new_arule_list, add_arule);
+			if (ret < 0)
+				goto out;
+
+			/* add new bg_alloc_rule_list to sbi */
+			list_add_tail(&new_arule_list->bg_arule_list,
+					&bg_arule_list->bg_arule_list);
+		}
+
+		/* set next bg's start block number */
+		start_bgnum++;
+		start = ext4_group_first_block_no(sb, start_bgnum);
+		add_flag = 0;
+	}
+
+out:
+	write_unlock(&sbi->s_bg_arule_lock);
+	return 0;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c110cd1..e5fe18a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -615,6 +615,7 @@  static void ext4_put_super(struct super_block *sb)
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
 	percpu_counter_destroy(&sbi->s_dirs_counter);
 	percpu_counter_destroy(&sbi->s_dirtyblocks_counter);
+	percpu_counter_destroy(&sbi->s_restricted_blocks_counter);
 	brelse(sbi->s_sbh);
 #ifdef CONFIG_QUOTA
 	for (i = 0; i < MAXQUOTAS; i++)
@@ -2672,6 +2673,9 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	if (!err) {
 		err = percpu_counter_init(&sbi->s_dirtyblocks_counter, 0);
 	}
+	if (!err) {
+		err = percpu_counter_init(&sbi->s_restricted_blocks_counter, 0);
+	}
 	if (err) {
 		ext4_msg(sb, KERN_ERR, "insufficient memory");
 		goto failed_mount3;
@@ -3649,7 +3653,10 @@  static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
 	buf->f_bsize = sb->s_blocksize;
 	buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
 	buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter) -
-		       percpu_counter_sum_positive(&sbi->s_dirtyblocks_counter);
+		       percpu_counter_sum_positive(
+				&sbi->s_dirtyblocks_counter) -
+		       percpu_counter_sum_positive(
+				&sbi->s_restricted_blocks_counter);
 	ext4_free_blocks_count_set(es, buf->f_bfree);
 	buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
 	if (buf->f_bfree < ext4_r_blocks_count(es))