diff mbox series

[1/1] ext4: fix potential negative array index in do_split()

Message ID f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com
State Awaiting Upstream
Headers show
Series ext4: fix potential negative array index in do_split | expand

Commit Message

Eric Sandeen June 17, 2020, 7:19 p.m. UTC
If for any reason a directory passed to do_split() does not have enough
active entries to exceed half the size of the block, we can end up
iterating over all "count" entries without finding a split point.

In this case, count == move, and split will be zero, and we will
attempt a negative index into map[].

Guard against this by detecting this case, and falling back to
split-to-half-of-count instead; in this case we will still have
plenty of space (> half blocksize) in each split block.

Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

Comments

Andreas Dilger June 19, 2020, 12:33 a.m. UTC | #1
On Jun 17, 2020, at 1:19 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> 
> If for any reason a directory passed to do_split() does not have enough
> active entries to exceed half the size of the block, we can end up
> iterating over all "count" entries without finding a split point.
> 
> In this case, count == move, and split will be zero, and we will
> attempt a negative index into map[].
> 
> Guard against this by detecting this case, and falling back to
> split-to-half-of-count instead; in this case we will still have
> plenty of space (> half blocksize) in each split block.
> 
> Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> ---
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index a8aca4772aaa..8b60881f07ee 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -1858,7 +1858,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> 			     blocksize, hinfo, map);
> 	map -= count;
> 	dx_sort_map(map, count);
> -	/* Split the existing block in the middle, size-wise */
> +	/* Ensure that neither split block is over half full */
> 	size = 0;
> 	move = 0;
> 	for (i = count-1; i >= 0; i--) {
> @@ -1868,8 +1868,18 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> 		size += map[i].size;
> 		move++;
> 	}
> -	/* map index at which we will split */
> -	split = count - move;
> +	/*
> +	 * map index at which we will split
> +	 *
> +	 * If the sum of active entries didn't exceed half the block size, just
> +	 * split it in half by count; each resulting block will have at least
> +	 * half the space free.
> +	 */
> +	if (i > 0)
> +		split = count - move;
> +	else
> +		split = count/2;
> +
> 	hash2 = map[split].hash;
> 	continued = hash2 == map[split - 1].hash;
> 	dxtrace(printk(KERN_INFO "Split block %lu at %x, %i/%i\n",
> 
> 


Cheers, Andreas
Lukas Czerner June 19, 2020, 6:41 a.m. UTC | #2
On Wed, Jun 17, 2020 at 02:19:04PM -0500, Eric Sandeen wrote:
> If for any reason a directory passed to do_split() does not have enough
> active entries to exceed half the size of the block, we can end up
> iterating over all "count" entries without finding a split point.
> 
> In this case, count == move, and split will be zero, and we will
> attempt a negative index into map[].
> 
> Guard against this by detecting this case, and falling back to
> split-to-half-of-count instead; in this case we will still have
> plenty of space (> half blocksize) in each split block.
> 
> Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index a8aca4772aaa..8b60881f07ee 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -1858,7 +1858,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
>  			     blocksize, hinfo, map);
>  	map -= count;
>  	dx_sort_map(map, count);
> -	/* Split the existing block in the middle, size-wise */
> +	/* Ensure that neither split block is over half full */
>  	size = 0;
>  	move = 0;
>  	for (i = count-1; i >= 0; i--) {
> @@ -1868,8 +1868,18 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
>  		size += map[i].size;
>  		move++;
>  	}
> -	/* map index at which we will split */
> -	split = count - move;
> +	/*
> +	 * map index at which we will split
> +	 *
> +	 * If the sum of active entries didn't exceed half the block size, just
> +	 * split it in half by count; each resulting block will have at least
> +	 * half the space free.
> +	 */
> +	if (i > 0)
> +		split = count - move;
> +	else
> +		split = count/2;

Won't we have exactly the same problem as we did before your commit
ef2b02d3e617cb0400eedf2668f86215e1b0e6af ? Since we do not know how much
space we actually moved we might have not made enough space for the new
entry ?

Also since we have the move == count when the problem appears then it's
clear that we never hit the condition

1865 →       →       /* is more than half of this entry in 2nd half of the block? */
1866 →       →       if (size + map[i].size/2 > blocksize/2)
1867 →       →       →       break;

in the loop. This is surprising but it means the the entries must have
gaps between them that are small enough that we can't fit the entry
right in ? Should not we try to compact it before splitting, or is it
the case that this should have been done somewhere else ?

If we really want ot be fair and we want to split it right in the middle
of the entries size-wise then we need to keep track of of sum of the
entries and decide based on that, not blocksize/2. But maybe the problem
could be solved by compacting the entries together because the condition
seems to rely on that.

-Lukas

> +
>  	hash2 = map[split].hash;
>  	continued = hash2 == map[split - 1].hash;
>  	dxtrace(printk(KERN_INFO "Split block %lu at %x, %i/%i\n",
> 
>
Lukas Czerner June 19, 2020, 7:08 a.m. UTC | #3
On Fri, Jun 19, 2020 at 08:41:22AM +0200, Lukas Czerner wrote:
> On Wed, Jun 17, 2020 at 02:19:04PM -0500, Eric Sandeen wrote:
> > If for any reason a directory passed to do_split() does not have enough
> > active entries to exceed half the size of the block, we can end up
> > iterating over all "count" entries without finding a split point.
> > 
> > In this case, count == move, and split will be zero, and we will
> > attempt a negative index into map[].
> > 
> > Guard against this by detecting this case, and falling back to
> > split-to-half-of-count instead; in this case we will still have
> > plenty of space (> half blocksize) in each split block.
> > 
> > Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
> > Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> > ---
> > 
> > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> > index a8aca4772aaa..8b60881f07ee 100644
> > --- a/fs/ext4/namei.c
> > +++ b/fs/ext4/namei.c
> > @@ -1858,7 +1858,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> >  			     blocksize, hinfo, map);
> >  	map -= count;
> >  	dx_sort_map(map, count);
> > -	/* Split the existing block in the middle, size-wise */
> > +	/* Ensure that neither split block is over half full */
> >  	size = 0;
> >  	move = 0;
> >  	for (i = count-1; i >= 0; i--) {
> > @@ -1868,8 +1868,18 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> >  		size += map[i].size;
> >  		move++;
> >  	}
> > -	/* map index at which we will split */
> > -	split = count - move;
> > +	/*
> > +	 * map index at which we will split
> > +	 *
> > +	 * If the sum of active entries didn't exceed half the block size, just
> > +	 * split it in half by count; each resulting block will have at least
> > +	 * half the space free.
> > +	 */
> > +	if (i > 0)
> > +		split = count - move;
> > +	else
> > +		split = count/2;
> 
> Won't we have exactly the same problem as we did before your commit
> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ? Since we do not know how much
> space we actually moved we might have not made enough space for the new
> entry ?
> 
> Also since we have the move == count when the problem appears then it's
> clear that we never hit the condition
> 
> 1865 →       →       /* is more than half of this entry in 2nd half of the block? */
> 1866 →       →       if (size + map[i].size/2 > blocksize/2)
> 1867 →       →       →       break;
> 
> in the loop. This is surprising but it means the the entries must have
> gaps between them that are small enough that we can't fit the entry
> right in ? Should not we try to compact it before splitting, or is it
> the case that this should have been done somewhere else ?

The other possibility is that map[i].size is not right and indeed there
seems to be a bug in dx_make_map()

map_tail->size = le16_to_cpu(de->rec_len);

should be

map_tail->size = ext4_rec_len_from_disk(de->rec_len, blocksize));

right ? Otherwise with large enough records the size will be smaller
than it really is.

A quick look at fs/ext4/namei.c reveals couple of places there rec_len
is used without the conversion and we should check whether it needs
fixing.

-Lukas



> 
> If we really want ot be fair and we want to split it right in the middle
> of the entries size-wise then we need to keep track of of sum of the
> entries and decide based on that, not blocksize/2. But maybe the problem
> could be solved by compacting the entries together because the condition
> seems to rely on that.
> 
> -Lukas
> 
> > +
> >  	hash2 = map[split].hash;
> >  	continued = hash2 == map[split - 1].hash;
> >  	dxtrace(printk(KERN_INFO "Split block %lu at %x, %i/%i\n",
> > 
> > 
>
Lukas Czerner June 19, 2020, 11:16 a.m. UTC | #4
On Fri, Jun 19, 2020 at 09:08:54AM +0200, Lukas Czerner wrote:
> On Fri, Jun 19, 2020 at 08:41:22AM +0200, Lukas Czerner wrote:
> > On Wed, Jun 17, 2020 at 02:19:04PM -0500, Eric Sandeen wrote:
> > > If for any reason a directory passed to do_split() does not have enough
> > > active entries to exceed half the size of the block, we can end up
> > > iterating over all "count" entries without finding a split point.
> > > 
> > > In this case, count == move, and split will be zero, and we will
> > > attempt a negative index into map[].
> > > 
> > > Guard against this by detecting this case, and falling back to
> > > split-to-half-of-count instead; in this case we will still have
> > > plenty of space (> half blocksize) in each split block.
> > > 
> > > Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
> > > Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> > > ---
> > > 
> > > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> > > index a8aca4772aaa..8b60881f07ee 100644
> > > --- a/fs/ext4/namei.c
> > > +++ b/fs/ext4/namei.c
> > > @@ -1858,7 +1858,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> > >  			     blocksize, hinfo, map);
> > >  	map -= count;
> > >  	dx_sort_map(map, count);
> > > -	/* Split the existing block in the middle, size-wise */
> > > +	/* Ensure that neither split block is over half full */
> > >  	size = 0;
> > >  	move = 0;
> > >  	for (i = count-1; i >= 0; i--) {
> > > @@ -1868,8 +1868,18 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> > >  		size += map[i].size;
> > >  		move++;
> > >  	}
> > > -	/* map index at which we will split */
> > > -	split = count - move;
> > > +	/*
> > > +	 * map index at which we will split
> > > +	 *
> > > +	 * If the sum of active entries didn't exceed half the block size, just
> > > +	 * split it in half by count; each resulting block will have at least
> > > +	 * half the space free.
> > > +	 */
> > > +	if (i > 0)
> > > +		split = count - move;
> > > +	else
> > > +		split = count/2;
> > 
> > Won't we have exactly the same problem as we did before your commit
> > ef2b02d3e617cb0400eedf2668f86215e1b0e6af ? Since we do not know how much
> > space we actually moved we might have not made enough space for the new
> > entry ?
> > 
> > Also since we have the move == count when the problem appears then it's
> > clear that we never hit the condition
> > 
> > 1865 →       →       /* is more than half of this entry in 2nd half of the block? */
> > 1866 →       →       if (size + map[i].size/2 > blocksize/2)
> > 1867 →       →       →       break;
> > 
> > in the loop. This is surprising but it means the the entries must have
> > gaps between them that are small enough that we can't fit the entry
> > right in ? Should not we try to compact it before splitting, or is it
> > the case that this should have been done somewhere else ?
> 
> The other possibility is that map[i].size is not right and indeed there
> seems to be a bug in dx_make_map()
> 
> map_tail->size = le16_to_cpu(de->rec_len);
> 
> should be
> 
> map_tail->size = ext4_rec_len_from_disk(de->rec_len, blocksize));
> 
> right ? Otherwise with large enough records the size will be smaller
> than it really is.
> 
> A quick look at fs/ext4/namei.c reveals couple of places there rec_len
> is used without the conversion and we should check whether it needs
> fixing.
> 
> -Lukas

And indeed the following patch seems to have fixed the issue we were
seeing. Eric I think that this might be a proper fix. But we still need
to check the other uses of rec_len to make sure it's ok as well.

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 94ec882..5509fdc 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1068,7 +1068,7 @@ static int dx_make_map(struct ext4_dir_entry_2 *de, unsigned blocksize,
                        map_tail--;
                        map_tail->hash = h.hash;
                        map_tail->offs = ((char *) de - base)>>2;
-                       map_tail->size = le16_to_cpu(de->rec_len);
+                       map_tail->size = ext4_rec_len_from_disk(le16_to_cpu(de->rec_len), blocksize);
                        count++;
                        cond_resched();
                }


> 
> 
> 
> > 
> > If we really want ot be fair and we want to split it right in the middle
> > of the entries size-wise then we need to keep track of of sum of the
> > entries and decide based on that, not blocksize/2. But maybe the problem
> > could be solved by compacting the entries together because the condition
> > seems to rely on that.
> > 
> > -Lukas
> > 
> > > +
> > >  	hash2 = map[split].hash;
> > >  	continued = hash2 == map[split - 1].hash;
> > >  	dxtrace(printk(KERN_INFO "Split block %lu at %x, %i/%i\n",
> > > 
> > > 
> > 
>
Eric Sandeen June 19, 2020, 1:39 p.m. UTC | #5
On 6/19/20 1:41 AM, Lukas Czerner wrote:
> On Wed, Jun 17, 2020 at 02:19:04PM -0500, Eric Sandeen wrote:
>> If for any reason a directory passed to do_split() does not have enough
>> active entries to exceed half the size of the block, we can end up
>> iterating over all "count" entries without finding a split point.
>>
>> In this case, count == move, and split will be zero, and we will
>> attempt a negative index into map[].
>>
>> Guard against this by detecting this case, and falling back to
>> split-to-half-of-count instead; in this case we will still have
>> plenty of space (> half blocksize) in each split block.

...

>> +	/*
>> +	 * map index at which we will split
>> +	 *
>> +	 * If the sum of active entries didn't exceed half the block size, just
>> +	 * split it in half by count; each resulting block will have at least
>> +	 * half the space free.
>> +	 */
>> +	if (i > 0)
>> +		split = count - move;
>> +	else
>> +		split = count/2;
> 
> Won't we have exactly the same problem as we did before your commit
> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ? Since we do not know how much
> space we actually moved we might have not made enough space for the new
> entry ?

I don't think so - while we don't have the original reproducer, I assume that
it was the case where the block was very full, and splitting by count left us
with one of the split blocks still over half full (because ensuring that we
split in half by size seemed to fix it)

In this case, the sum of the active entries was <= half the block size.
So if we split by count, we're guaranteed to have >= half the block size free
in each side of the split.
 
> Also since we have the move == count when the problem appears then it's
> clear that we never hit the condition
> 
> 1865 →       →       /* is more than half of this entry in 2nd half of the block? */
> 1866 →       →       if (size + map[i].size/2 > blocksize/2)
> 1867 →       →       →       break;
> 
> in the loop. This is surprising but it means the the entries must have
> gaps between them that are small enough that we can't fit the entry
> right in ? Should not we try to compact it before splitting, or is it
> the case that this should have been done somewhere else ?

Yes, that's exactly what happened - see my 0/1 cover letter.  Maybe that should
be in the patch description itself.  ALso, yes compaction would help but I was
unclear as to whether that should be done here, is the side effect of some other
bug, etc.  In general, we do seem to do compaction elsewhere and I don't know
how we got to this point.

> If we really want ot be fair and we want to split it right in the middle
> of the entries size-wise then we need to keep track of of sum of the
> entries and decide based on that, not blocksize/2. But maybe the problem
> could be solved by compacting the entries together because the condition
> seems to rely on that.

I thought about that as well, but it took a bit more code to do; we could make
make_map() return both count and total size, for example.  But based on my
theory above that both sides of the split will have >= half block free, it
didn't seem necessary, particularly since this seems like an edge case?

-Eric

> -Lukas
> 
>> +
>>  	hash2 = map[split].hash;
>>  	continued = hash2 == map[split - 1].hash;
>>  	dxtrace(printk(KERN_INFO "Split block %lu at %x, %i/%i\n",
>>
>>
>
Eric Sandeen June 19, 2020, 1:42 p.m. UTC | #6
On 6/19/20 2:08 AM, Lukas Czerner wrote:
> On Fri, Jun 19, 2020 at 08:41:22AM +0200, Lukas Czerner wrote:
>> On Wed, Jun 17, 2020 at 02:19:04PM -0500, Eric Sandeen wrote:
>>> If for any reason a directory passed to do_split() does not have enough
>>> active entries to exceed half the size of the block, we can end up
>>> iterating over all "count" entries without finding a split point.
>>>
>>> In this case, count == move, and split will be zero, and we will
>>> attempt a negative index into map[].
>>>
>>> Guard against this by detecting this case, and falling back to
>>> split-to-half-of-count instead; in this case we will still have
>>> plenty of space (> half blocksize) in each split block.
>>>
>>> Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
>>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>>> ---
>>>
>>> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
>>> index a8aca4772aaa..8b60881f07ee 100644
>>> --- a/fs/ext4/namei.c
>>> +++ b/fs/ext4/namei.c
>>> @@ -1858,7 +1858,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
>>>  			     blocksize, hinfo, map);
>>>  	map -= count;
>>>  	dx_sort_map(map, count);
>>> -	/* Split the existing block in the middle, size-wise */
>>> +	/* Ensure that neither split block is over half full */
>>>  	size = 0;
>>>  	move = 0;
>>>  	for (i = count-1; i >= 0; i--) {
>>> @@ -1868,8 +1868,18 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
>>>  		size += map[i].size;
>>>  		move++;
>>>  	}
>>> -	/* map index at which we will split */
>>> -	split = count - move;
>>> +	/*
>>> +	 * map index at which we will split
>>> +	 *
>>> +	 * If the sum of active entries didn't exceed half the block size, just
>>> +	 * split it in half by count; each resulting block will have at least
>>> +	 * half the space free.
>>> +	 */
>>> +	if (i > 0)
>>> +		split = count - move;
>>> +	else
>>> +		split = count/2;
>>
>> Won't we have exactly the same problem as we did before your commit
>> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ? Since we do not know how much
>> space we actually moved we might have not made enough space for the new
>> entry ?
>>
>> Also since we have the move == count when the problem appears then it's
>> clear that we never hit the condition
>>
>> 1865 →       →       /* is more than half of this entry in 2nd half of the block? */
>> 1866 →       →       if (size + map[i].size/2 > blocksize/2)
>> 1867 →       →       →       break;
>>
>> in the loop. This is surprising but it means the the entries must have
>> gaps between them that are small enough that we can't fit the entry
>> right in ? Should not we try to compact it before splitting, or is it
>> the case that this should have been done somewhere else ?
> 
> The other possibility is that map[i].size is not right and indeed there
> seems to be a bug in dx_make_map()
> 
> map_tail->size = le16_to_cpu(de->rec_len);
> 
> should be
> 
> map_tail->size = ext4_rec_len_from_disk(de->rec_len, blocksize));
> 
> right ? Otherwise with large enough records the size will be smaller
> than it really is.

well, those are the same thing unless (PAGE_SIZE >= 65536) so I don't
think that's the issue here.

static inline unsigned int
ext4_rec_len_from_disk(__le16 dlen, unsigned blocksize)
{
        unsigned len = le16_to_cpu(dlen);

#if (PAGE_SIZE >= 65536)
...
#else
        return len;
#endif
}

Should be fixed for consistency, but seems to not be a root cause here.

> A quick look at fs/ext4/namei.c reveals couple of places there rec_len
> is used without the conversion and we should check whether it needs
> fixing.

...
Eric Sandeen June 19, 2020, 1:44 p.m. UTC | #7
On 6/19/20 6:16 AM, Lukas Czerner wrote:

>> The other possibility is that map[i].size is not right and indeed there
>> seems to be a bug in dx_make_map()
>>
>> map_tail->size = le16_to_cpu(de->rec_len);
>>
>> should be
>>
>> map_tail->size = ext4_rec_len_from_disk(de->rec_len, blocksize));
>>
>> right ? Otherwise with large enough records the size will be smaller
>> than it really is.
>>
>> A quick look at fs/ext4/namei.c reveals couple of places there rec_len
>> is used without the conversion and we should check whether it needs
>> fixing.
>>
>> -Lukas
> 
> And indeed the following patch seems to have fixed the issue we were
> seeing. Eric I think that this might be a proper fix. But we still need
> to check the other uses of rec_len to make sure it's ok as well.
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 94ec882..5509fdc 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -1068,7 +1068,7 @@ static int dx_make_map(struct ext4_dir_entry_2 *de, unsigned blocksize,
>                         map_tail--;
>                         map_tail->hash = h.hash;
>                         map_tail->offs = ((char *) de - base)>>2;
> -                       map_tail->size = le16_to_cpu(de->rec_len);
> +                       map_tail->size = ext4_rec_len_from_disk(le16_to_cpu(de->rec_len), blocksize);

That isn't right, ext4_rec_len_from_disk /takes/ an __le16 :)

-                       map_tail->size = le16_to_cpu(de->rec_len);
+                       map_tail->size = ext4_rec_len_from_disk(de->rec_len), blocksize);

would be more correct, but won't matter for PAGE_SIZE < 65536 right?

-Eric
Lukas Czerner June 19, 2020, 1:49 p.m. UTC | #8
On Fri, Jun 19, 2020 at 08:42:23AM -0500, Eric Sandeen wrote:
> On 6/19/20 2:08 AM, Lukas Czerner wrote:
> > On Fri, Jun 19, 2020 at 08:41:22AM +0200, Lukas Czerner wrote:
> >> On Wed, Jun 17, 2020 at 02:19:04PM -0500, Eric Sandeen wrote:
> >>> If for any reason a directory passed to do_split() does not have enough
> >>> active entries to exceed half the size of the block, we can end up
> >>> iterating over all "count" entries without finding a split point.
> >>>
> >>> In this case, count == move, and split will be zero, and we will
> >>> attempt a negative index into map[].
> >>>
> >>> Guard against this by detecting this case, and falling back to
> >>> split-to-half-of-count instead; in this case we will still have
> >>> plenty of space (> half blocksize) in each split block.
> >>>
> >>> Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
> >>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> >>> ---
> >>>
> >>> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> >>> index a8aca4772aaa..8b60881f07ee 100644
> >>> --- a/fs/ext4/namei.c
> >>> +++ b/fs/ext4/namei.c
> >>> @@ -1858,7 +1858,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> >>>  			     blocksize, hinfo, map);
> >>>  	map -= count;
> >>>  	dx_sort_map(map, count);
> >>> -	/* Split the existing block in the middle, size-wise */
> >>> +	/* Ensure that neither split block is over half full */
> >>>  	size = 0;
> >>>  	move = 0;
> >>>  	for (i = count-1; i >= 0; i--) {
> >>> @@ -1868,8 +1868,18 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> >>>  		size += map[i].size;
> >>>  		move++;
> >>>  	}
> >>> -	/* map index at which we will split */
> >>> -	split = count - move;
> >>> +	/*
> >>> +	 * map index at which we will split
> >>> +	 *
> >>> +	 * If the sum of active entries didn't exceed half the block size, just
> >>> +	 * split it in half by count; each resulting block will have at least
> >>> +	 * half the space free.
> >>> +	 */
> >>> +	if (i > 0)
> >>> +		split = count - move;
> >>> +	else
> >>> +		split = count/2;
> >>
> >> Won't we have exactly the same problem as we did before your commit
> >> ef2b02d3e617cb0400eedf2668f86215e1b0e6af ? Since we do not know how much
> >> space we actually moved we might have not made enough space for the new
> >> entry ?
> >>
> >> Also since we have the move == count when the problem appears then it's
> >> clear that we never hit the condition
> >>
> >> 1865 →       →       /* is more than half of this entry in 2nd half of the block? */
> >> 1866 →       →       if (size + map[i].size/2 > blocksize/2)
> >> 1867 →       →       →       break;
> >>
> >> in the loop. This is surprising but it means the the entries must have
> >> gaps between them that are small enough that we can't fit the entry
> >> right in ? Should not we try to compact it before splitting, or is it
> >> the case that this should have been done somewhere else ?
> > 
> > The other possibility is that map[i].size is not right and indeed there
> > seems to be a bug in dx_make_map()
> > 
> > map_tail->size = le16_to_cpu(de->rec_len);
> > 
> > should be
> > 
> > map_tail->size = ext4_rec_len_from_disk(de->rec_len, blocksize));
> > 
> > right ? Otherwise with large enough records the size will be smaller
> > than it really is.
> 
> well, those are the same thing unless (PAGE_SIZE >= 65536) so I don't
> think that's the issue here.
> 
> static inline unsigned int
> ext4_rec_len_from_disk(__le16 dlen, unsigned blocksize)
> {
>         unsigned len = le16_to_cpu(dlen);
> 
> #if (PAGE_SIZE >= 65536)
> ...
> #else
>         return len;
> #endif
> }

Ah you're right. The reproducer for this is kind of unreliable as well
so that's why it looked to be fxied with this I guess.

> 
> Should be fixed for consistency, but seems to not be a root cause here.

Agreed.

-Lukas

> 
> > A quick look at fs/ext4/namei.c reveals couple of places there rec_len
> > is used without the conversion and we should check whether it needs
> > fixing.
> 
> ...
>
Lukas Czerner June 19, 2020, 1:53 p.m. UTC | #9
On Fri, Jun 19, 2020 at 08:44:19AM -0500, Eric Sandeen wrote:
> On 6/19/20 6:16 AM, Lukas Czerner wrote:
> 
> >> The other possibility is that map[i].size is not right and indeed there
> >> seems to be a bug in dx_make_map()
> >>
> >> map_tail->size = le16_to_cpu(de->rec_len);
> >>
> >> should be
> >>
> >> map_tail->size = ext4_rec_len_from_disk(de->rec_len, blocksize));
> >>
> >> right ? Otherwise with large enough records the size will be smaller
> >> than it really is.
> >>
> >> A quick look at fs/ext4/namei.c reveals couple of places there rec_len
> >> is used without the conversion and we should check whether it needs
> >> fixing.
> >>
> >> -Lukas
> > 
> > And indeed the following patch seems to have fixed the issue we were
> > seeing. Eric I think that this might be a proper fix. But we still need
> > to check the other uses of rec_len to make sure it's ok as well.
> > 
> > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> > index 94ec882..5509fdc 100644
> > --- a/fs/ext4/namei.c
> > +++ b/fs/ext4/namei.c
> > @@ -1068,7 +1068,7 @@ static int dx_make_map(struct ext4_dir_entry_2 *de, unsigned blocksize,
> >                         map_tail--;
> >                         map_tail->hash = h.hash;
> >                         map_tail->offs = ((char *) de - base)>>2;
> > -                       map_tail->size = le16_to_cpu(de->rec_len);
> > +                       map_tail->size = ext4_rec_len_from_disk(le16_to_cpu(de->rec_len), blocksize);
> 
> That isn't right, ext4_rec_len_from_disk /takes/ an __le16 :)
> 
> -                       map_tail->size = le16_to_cpu(de->rec_len);
> +                       map_tail->size = ext4_rec_len_from_disk(de->rec_len), blocksize);

Yep, my bad.

> 
> would be more correct, but won't matter for PAGE_SIZE < 65536 right?

True, it's not the problem we're seeing.

-Lukas

> 
> -Eric
>
Jan Kara July 8, 2020, 4:09 p.m. UTC | #10
On Fri 19-06-20 08:39:53, Eric Sandeen wrote:
> On 6/19/20 1:41 AM, Lukas Czerner wrote:
> > On Wed, Jun 17, 2020 at 02:19:04PM -0500, Eric Sandeen wrote:
> >> If for any reason a directory passed to do_split() does not have enough
> >> active entries to exceed half the size of the block, we can end up
> >> iterating over all "count" entries without finding a split point.
> >>
> >> In this case, count == move, and split will be zero, and we will
> >> attempt a negative index into map[].
> >>
> >> Guard against this by detecting this case, and falling back to
> >> split-to-half-of-count instead; in this case we will still have
> >> plenty of space (> half blocksize) in each split block.
> 
> ...
> 
> >> +	/*
> >> +	 * map index at which we will split
> >> +	 *
> >> +	 * If the sum of active entries didn't exceed half the block size, just
> >> +	 * split it in half by count; each resulting block will have at least
> >> +	 * half the space free.
> >> +	 */
> >> +	if (i > 0)
> >> +		split = count - move;
> >> +	else
> >> +		split = count/2;
> > 
> > Won't we have exactly the same problem as we did before your commit
> > ef2b02d3e617cb0400eedf2668f86215e1b0e6af ? Since we do not know how much
> > space we actually moved we might have not made enough space for the new
> > entry ?
> 
> I don't think so - while we don't have the original reproducer, I assume that
> it was the case where the block was very full, and splitting by count left us
> with one of the split blocks still over half full (because ensuring that we
> split in half by size seemed to fix it)
> 
> In this case, the sum of the active entries was <= half the block size.
> So if we split by count, we're guaranteed to have >= half the block size free
> in each side of the split.
>  
> > Also since we have the move == count when the problem appears then it's
> > clear that we never hit the condition
> > 
> > 1865 →       →       /* is more than half of this entry in 2nd half of the block? */
> > 1866 →       →       if (size + map[i].size/2 > blocksize/2)
> > 1867 →       →       →       break;
> > 
> > in the loop. This is surprising but it means the the entries must have
> > gaps between them that are small enough that we can't fit the entry
> > right in ? Should not we try to compact it before splitting, or is it
> > the case that this should have been done somewhere else ?
> 
> Yes, that's exactly what happened - see my 0/1 cover letter.  Maybe that should
> be in the patch description itself.  ALso, yes compaction would help but I was
> unclear as to whether that should be done here, is the side effect of some other
> bug, etc.  In general, we do seem to do compaction elsewhere and I don't know
> how we got to this point.
> 
> > If we really want ot be fair and we want to split it right in the middle
> > of the entries size-wise then we need to keep track of of sum of the
> > entries and decide based on that, not blocksize/2. But maybe the problem
> > could be solved by compacting the entries together because the condition
> > seems to rely on that.
> 
> I thought about that as well, but it took a bit more code to do; we could make
> make_map() return both count and total size, for example.  But based on my
> theory above that both sides of the split will have >= half block free, it
> didn't seem necessary, particularly since this seems like an edge case?

This didn't seem to conclude in any way? The patch looks good to me FWIW so
feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

Ted, can you please pick this patch up? Thanks!

								Honza
Theodore Ts'o July 30, 2020, 1:48 a.m. UTC | #11
On Wed, Jun 17, 2020 at 02:19:04PM -0500, Eric Sandeen wrote:
> If for any reason a directory passed to do_split() does not have enough
> active entries to exceed half the size of the block, we can end up
> iterating over all "count" entries without finding a split point.
> 
> In this case, count == move, and split will be zero, and we will
> attempt a negative index into map[].
> 
> Guard against this by detecting this case, and falling back to
> split-to-half-of-count instead; in this case we will still have
> plenty of space (> half blocksize) in each split block.
> 
> Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>

Thanks, applied.

						- Ted
diff mbox series

Patch

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a8aca4772aaa..8b60881f07ee 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1858,7 +1858,7 @@  static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
 			     blocksize, hinfo, map);
 	map -= count;
 	dx_sort_map(map, count);
-	/* Split the existing block in the middle, size-wise */
+	/* Ensure that neither split block is over half full */
 	size = 0;
 	move = 0;
 	for (i = count-1; i >= 0; i--) {
@@ -1868,8 +1868,18 @@  static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
 		size += map[i].size;
 		move++;
 	}
-	/* map index at which we will split */
-	split = count - move;
+	/*
+	 * map index at which we will split
+	 *
+	 * If the sum of active entries didn't exceed half the block size, just
+	 * split it in half by count; each resulting block will have at least
+	 * half the space free.
+	 */
+	if (i > 0)
+		split = count - move;
+	else
+		split = count/2;
+
 	hash2 = map[split].hash;
 	continued = hash2 == map[split - 1].hash;
 	dxtrace(printk(KERN_INFO "Split block %lu at %x, %i/%i\n",