Patchwork mtd: kernel BUG at arch/x86/mm/pat.c:279!

login
register
mail settings
Submitter Suresh Siddha
Date Sept. 7, 2012, 10:42 p.m.
Message ID <1347057778.26695.68.camel@sbsiddha-desk.sc.intel.com>
Download mbox | patch
Permalink /patch/182476/
State New
Headers show

Comments

Suresh Siddha - Sept. 7, 2012, 10:42 p.m.
On Fri, 2012-09-07 at 11:14 -0700, Linus Torvalds wrote:
> Guys, this looks like a MTD and/or io_remap_pfn_range() bug, and it's
> not getting any traction.
> 
> What the f*ck is mtd_mmap() doing, and why? The problem seems to be an
> overflow condition, because reserve_pfn_range() does
> 
>     reserve_memtype(paddr, paddr + size, want_flags, &flags);
> 
> and then the BUG_ON() in reserve_memtype is
> 
>     BUG_ON(start >= end);
> 
> so it very much looks like a paddr+size overflow. However, that makes
> little sense too, since we're working in "u64", so I suspect the
> overflow has happened somewhere earlier.
> 
> I really don't see where, though. Could somebody please take a look?
> The mtdchar_mmap() types seem insane (why "u32" for len, for example?
> And that whole
> 
>   off = vma->vm_pgoff << PAGE_SHIFT;
> 
> thing looks like it would overflow, since the whole point of pgoff is
> that if you shift it up by PAGE_SHIFT you need to also extend to
> 64-bit etc.
> 
> So I would *guess* that it's the mtdchar_mmap() stuff that overflows
> due to bad types, but maybe it does deeper than that?
> 

I started to look into this to see if this is a PAT issue but it does
indeed appear to be a mtd mmap issue.

Sasha, Does the appended fix the issue for you?

--8<--
From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: mtd: check the starting offset to be mmap'd

We need to check if both the starting offset aswell the total length
being mmap'd are with in the limits. With a large starting offset,
offset + (length-to-be-mapped) can wrap and appear smaller than the
limit. Need to check both start and end.

Also fix the types of the variables start, off, len.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 drivers/mtd/mtdchar.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)
Linus Torvalds - Sept. 7, 2012, 11:09 p.m.
On Fri, Sep 7, 2012 at 3:42 PM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> -       unsigned long start;
> -       unsigned long off;
> -       u32 len;
> +       resource_size_t start, off;
> +       unsigned long len;

So since the oops is on x86-64, I don't think it's the "unsigned long"
-> "resource_size_t" part (which can be an issue on 32-bit
architectures, though).

The "u32 len" -> "unsigned long len" thing *might* make a difference, though.

I also think your patch is incomplete even on 32-bit, because this:

>         if (mtd->type == MTD_RAM || mtd->type == MTD_ROM) {
>                 off = vma->vm_pgoff << PAGE_SHIFT;

is still wrong. It probably should be

    off = vma->vm_pgoff;
    off <<= PAGE_SHIFT;

because vm_pgoff may be a 32-bit type, while "resource_size_t" may be
64-bit. Shifting the 32-bit type without a cast (implicit or explicit)
isn't going to help.

That said, we have absolutely *tons* of bugs with this particular
pattern. Just do

    git grep 'vm_pgoff.*<<.*PAGE_SHIFT'

and there are distressingly few casts in there (there's a few, mainly
in fs/proc).

Now, I suspect many of them are fine just because most users probably
are size-limited anyway, but it's a bit distressing stuff. And I
suspect it means we might want to introduce a helper function like

    static inline u64 vm_offset(struct vm_area_struct *vma)
    {
        return (u64)vma->vm_pgoff << PAGE_SHIFT;
    }

or something. Maybe add the "vm_length()" helper while at it too,
since the whole "vma->vm_end - vma->vm_start" thing is so common.

Anyway, since Sasha's oops is clearly not 32-bit, the above issues
don't matter, and it would be interesting to hear if it's the 32-bit
'len' thing that triggers this problem. Still, I can't see how it
would - as far as I can tell, a truncated 'len' would at most result
in spurious early "return -EINVAL", not any real problem.

What are we missing?

Sasha, since you can apparently reproduce it, can you replace the
"BUG_ON()" with just a

 if (start >= end) {
    printf("bogus range %llx - %llx\n", start, end);
    return -EINVAL;
  }

or something.

I'm starting to suspect that maybe it's actually that the length is
*zero*, and start == end, and that we should just return zero for that
case. But let's see what Sasha finds..

              Linus
Suresh Siddha - Sept. 7, 2012, 11:54 p.m.
On Fri, 2012-09-07 at 16:09 -0700, Linus Torvalds wrote:
> The "u32 len" -> "unsigned long len" thing *might* make a difference, though.

This I believe doesn't fix the reported BUG. I was trying to address
your previous comment about broken types.

> 
> I also think your patch is incomplete even on 32-bit, because this:
> 
> >         if (mtd->type == MTD_RAM || mtd->type == MTD_ROM) {
> >                 off = vma->vm_pgoff << PAGE_SHIFT;
> 
> is still wrong. It probably should be
> 
>     off = vma->vm_pgoff;
>     off <<= PAGE_SHIFT;
> 
> because vm_pgoff may be a 32-bit type, while "resource_size_t" may be
> 64-bit. Shifting the 32-bit type without a cast (implicit or explicit)
> isn't going to help.

Agree.

> That said, we have absolutely *tons* of bugs with this particular
> pattern. Just do
> 
>     git grep 'vm_pgoff.*<<.*PAGE_SHIFT'
> 
> and there are distressingly few casts in there (there's a few, mainly
> in fs/proc).
> 
> Now, I suspect many of them are fine just because most users probably
> are size-limited anyway, but it's a bit distressing stuff. And I
> suspect it means we might want to introduce a helper function like
> 
>     static inline u64 vm_offset(struct vm_area_struct *vma)
>     {
>         return (u64)vma->vm_pgoff << PAGE_SHIFT;
>     }
> 
> or something. Maybe add the "vm_length()" helper while at it too,
> since the whole "vma->vm_end - vma->vm_start" thing is so common.

Agree.

> Anyway, since Sasha's oops is clearly not 32-bit, the above issues
> don't matter, and it would be interesting to hear if it's the 32-bit
> 'len' thing that triggers this problem. Still, I can't see how it
> would - as far as I can tell, a truncated 'len' would at most result
> in spurious early "return -EINVAL", not any real problem.
> 
> What are we missing?
> 

On Fri, 2012-09-07 at 15:42 -0700, Suresh Siddha wrote:
> -               if ((vma->vm_end - vma->vm_start + off) > len)
> +               if (off >= len || (vma->vm_end - vma->vm_start + off) > len)
>                         return -EINVAL; 

This is the relevant portion that I am thinking will address the BUG.

Essentially the user is trying to mmap at a very large offset (from the
oops it appears "vma->vm_pgoff << PAGE_SHIFT + start" ends up to
"0xfffffffffffff000").

So it appears that the condition "(vma->vm_end - vma->vm_start + off) >
len" might be false because of the wraparound? and doesn't return
-EINVAL.

Let's see what Sasha finds. Anyways the patch does indeed require your
above mentioned vm_pgoff fix for the 32-bit case.

thanks,
suresh
Sasha Levin - Sept. 8, 2012, 8:10 a.m.
On 09/08/2012 01:09 AM, Linus Torvalds wrote:
> Sasha, since you can apparently reproduce it, can you replace the
> "BUG_ON()" with just a
> 
>  if (start >= end) {
>     printf("bogus range %llx - %llx\n", start, end);
>     return -EINVAL;
>   }

Replacing it gives me the following:

[   36.231736] bogus range fffffffffffff000 - 0


Thanks,
Sasha
Linus Torvalds - Sept. 8, 2012, 7:57 p.m.
On Fri, Sep 7, 2012 at 4:54 PM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>
> Essentially the user is trying to mmap at a very large offset (from the
> oops it appears "vma->vm_pgoff << PAGE_SHIFT + start" ends up to
> "0xfffffffffffff000").

Ok, Sasha confirmed that.

> So it appears that the condition "(vma->vm_end - vma->vm_start + off) >
> len" might be false because of the wraparound? and doesn't return
> -EINVAL.

Ack.

Anyway, that means that the BUG_ON() is likely bogus, but so is the
whole calling convention.

The 4kB range starting at 0xfffffffffffff000 sounds like a *valid*
range, but that requires that we fix the calling convention to not
have that "end" (exclusive) thing. It should either be "end"
(inclusive), or just "len".

So it should either be start=0xfffffffffffff000 end=0xffffffffffffffff
or it should be start=0xfffffffffffff000 len=0x1000.

Or we need to say that we must never accept things at the end of the
64-bit range.

Whatever. Something like this (TOTALLY UNTESTED) attached patch should
get the mtdchar overflows to go away, but it does *not* fix the fact
that the MTRR start/end model is broken. It really is technically
valid to have a resource_size_t range of 0xfffffffffffff000+0x1000,
and right now it causes a BUG_ON() in pat.c.

Suresh?

                    Linus
Suresh Siddha - Sept. 9, 2012, 2:56 p.m.
On Sat, 2012-09-08 at 12:57 -0700, Linus Torvalds wrote:
> Whatever. Something like this (TOTALLY UNTESTED) attached patch should
> get the mtdchar overflows to go away, 

It looks good to me. Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>

Sasha, can you please give this a try?

> but it does *not* fix the fact
> that the MTRR start/end model is broken. It really is technically
> valid to have a resource_size_t range of 0xfffffffffffff000+0x1000,
> and right now it causes a BUG_ON() in pat.c.
> 
> Suresh?

yes but that is not a valid range I think because of the supported
physical address bit limits of the processor and also the max
architecture limit of 52 address bits.

I guess we should be checking for those limits in pat.c, especially bits
above 52 are ignored by the HW and they can easily cause conflicting
aliases with other valid regions. I will get back with a different patch
to fix this.

thanks,
suresh
Linus Torvalds - Sept. 9, 2012, 3:31 p.m.
On Sun, Sep 9, 2012 at 7:56 AM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>
> yes but that is not a valid range I think because of the supported
> physical address bit limits of the processor and also the max
> architecture limit of 52 address bits.

But how could the caller possibly know that? None of those internal
PAT limits are exposed anywhere.

So doing the BUG_ON() is wrong. I'd suggest changing it to an EINVAL.

In fact, BUG_ON() is *always* wrong, unless it's a "my internal data
structures are so messed up that I cannot continue".

                Linus
H. Peter Anvin - Sept. 9, 2012, 4:56 p.m.
On 09/08/2012 12:57 PM, Linus Torvalds wrote:
>
> Ack.
>
> Anyway, that means that the BUG_ON() is likely bogus, but so is the
> whole calling convention.
>
> The 4kB range starting at 0xfffffffffffff000 sounds like a *valid*
> range, but that requires that we fix the calling convention to not
> have that "end" (exclusive) thing. It should either be "end"
> (inclusive), or just "len".
>

On x86, it is definitely NOT a valid range.  There is no physical 
addresses there, and there will never be any.

> So it should either be start=0xfffffffffffff000 end=0xffffffffffffffff
> or it should be start=0xfffffffffffff000 len=0x1000.

I would strongly object to the former; that kind of inclusive ranges 
breed a whole class of bugs by themselves.

	-hpa
H. Peter Anvin - Sept. 9, 2012, 5:01 p.m.
On 09/09/2012 08:31 AM, Linus Torvalds wrote:
> On Sun, Sep 9, 2012 at 7:56 AM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>>
>> yes but that is not a valid range I think because of the supported
>> physical address bit limits of the processor and also the max
>> architecture limit of 52 address bits.
>
> But how could the caller possibly know that? None of those internal
> PAT limits are exposed anywhere.
>
> So doing the BUG_ON() is wrong. I'd suggest changing it to an EINVAL.
>
> In fact, BUG_ON() is *always* wrong, unless it's a "my internal data
> structures are so messed up that I cannot continue".
>

I suspect the right answer is doing something like:

	u64 max_phys = 1ULL << boot_cpu_data.x86_phys_bits;

	if (start >= max_phys || end > max_phys || start >= end)
		return -EINVAL;

... although max_phys perhaps should be precalculated and stored in 
struct cpuinfo_x86 instead of being generated de novo.

	-hpa
David Woodhouse - Sept. 9, 2012, 7:04 p.m.
On Sun, 2012-09-09 at 09:56 -0700, H. Peter Anvin wrote:
> 
> > So it should either be start=0xfffffffffffff000 end=0xffffffffffffffff
> > or it should be start=0xfffffffffffff000 len=0x1000.
> 
> I would strongly object to the former; that kind of inclusive ranges 
> breed a whole class of bugs by themselves.

Another alternative that avoids overflow issues is to use a PFN rather
than a byte address.
H. Peter Anvin - Sept. 9, 2012, 8:33 p.m.
On 09/09/2012 12:04 PM, David Woodhouse wrote:
> On Sun, 2012-09-09 at 09:56 -0700, H. Peter Anvin wrote:
>>
>>> So it should either be start=0xfffffffffffff000 end=0xffffffffffffffff
>>> or it should be start=0xfffffffffffff000 len=0x1000.
>>
>> I would strongly object to the former; that kind of inclusive ranges
>> breed a whole class of bugs by themselves.
>
> Another alternative that avoids overflow issues is to use a PFN rather
> than a byte address.
>

Except as a result of that logic have a bunch of places which either 
have rounding errors in how they calculate PFNs, or they think they can 
stick PFNs into 32-bit numbers.  :(

	-hpa
Sasha Levin - Sept. 10, 2012, 5:17 a.m.
On 09/09/2012 06:56 PM, H. Peter Anvin wrote:
>>
>> Anyway, that means that the BUG_ON() is likely bogus, but so is the
>> whole calling convention.
>>
>> The 4kB range starting at 0xfffffffffffff000 sounds like a *valid*
>> range, but that requires that we fix the calling convention to not
>> have that "end" (exclusive) thing. It should either be "end"
>> (inclusive), or just "len".
>>
> 
> On x86, it is definitely NOT a valid range.  There is no physical addresses
> there, and there will never be any.

This reminds me a similar issue: If you try to mmap /dev/kmem at an offset which
is not kernel owned (such as 0), you'll get all the way to __pa() before getting
a BUG() about addresses not making sense.

How come there's no arch-specific validation of attempts to access
virtual/physical addresses? In the kmem example I'd assume that something very
early on should be yelling at me about doing something like that, but for some
reason I get all the way to __pa() before getting a BUG() (!).


Thanks,
Sasha
Sasha Levin - Sept. 12, 2012, 10:50 a.m.
On 09/09/2012 04:56 PM, Suresh Siddha wrote:
> On Sat, 2012-09-08 at 12:57 -0700, Linus Torvalds wrote:
>> > Whatever. Something like this (TOTALLY UNTESTED) attached patch should
>> > get the mtdchar overflows to go away, 
> It looks good to me. Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
> 
> Sasha, can you please give this a try?

Sorry for the delay. It looks good here.


Thanks,
Sasha
Sasha Levin - Sept. 12, 2012, 10:56 a.m.
On 09/12/2012 12:50 PM, Sasha Levin wrote:
> On 09/09/2012 04:56 PM, Suresh Siddha wrote:
>> On Sat, 2012-09-08 at 12:57 -0700, Linus Torvalds wrote:
>>>> Whatever. Something like this (TOTALLY UNTESTED) attached patch should
>>>> get the mtdchar overflows to go away, 
>> It looks good to me. Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
>>
>> Sasha, can you please give this a try?
> 
> Sorry for the delay. It looks good here.
> 
> 
> Thanks,
> Sasha
> 

Uh... sorry again, I obviously tested the second patch sent by Linus but
mistakingly replied to the wrong mail in the thread.
Sasha Levin - Sept. 28, 2012, 9 a.m.
On 09/12/2012 12:56 PM, Sasha Levin wrote:
> On 09/12/2012 12:50 PM, Sasha Levin wrote:
>> On 09/09/2012 04:56 PM, Suresh Siddha wrote:
>>> On Sat, 2012-09-08 at 12:57 -0700, Linus Torvalds wrote:
>>>>> Whatever. Something like this (TOTALLY UNTESTED) attached patch should
>>>>> get the mtdchar overflows to go away, 
>>> It looks good to me. Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
>>>
>>> Sasha, can you please give this a try?
>>
>> Sorry for the delay. It looks good here.
>>
>>
>> Thanks,
>> Sasha
>>
> 
> Uh... sorry again, I obviously tested the second patch sent by Linus but
> mistakingly replied to the wrong mail in the thread.

Is anyone planning on picking up Linus' patch? This is still not in -next even.


Thanks,
Sasha
Linus Torvalds - Sept. 28, 2012, 4:44 p.m.
On Fri, Sep 28, 2012 at 2:00 AM, Sasha Levin <levinsasha928@gmail.com> wrote:
>
> Is anyone planning on picking up Linus' patch? This is still not in -next even.

I was really hoping it would go through the regular channels and come
back to me that way, since I can't really test it, and it's bigger
than the trivial obvious one-liners that I'm happy to commit.

Hmm.

           Linus
Artem Bityutskiy - Sept. 28, 2012, 6:05 p.m.
On Fri, 2012-09-28 at 09:44 -0700, Linus Torvalds wrote:
> On Fri, Sep 28, 2012 at 2:00 AM, Sasha Levin <levinsasha928@gmail.com> wrote:
> >
> > Is anyone planning on picking up Linus' patch? This is still not in -next even.
> 
> I was really hoping it would go through the regular channels and come
> back to me that way, since I can't really test it, and it's bigger
> than the trivial obvious one-liners that I'm happy to commit.

Hi Linus,

I am not the maintainer, but please, go ahead an push your fix. I do not
have time to test it myself and it does not look like anyone else in the
small mtd community does.
David Woodhouse - Sept. 28, 2012, 7:04 p.m.
On Fri, 2012-09-28 at 09:44 -0700, Linus Torvalds wrote:
> On Fri, Sep 28, 2012 at 2:00 AM, Sasha Levin <levinsasha928@gmail.com> wrote:
> >
> > Is anyone planning on picking up Linus' patch? This is still not in -next even.
> 
> I was really hoping it would go through the regular channels and come
> back to me that way, since I can't really test it, and it's bigger
> than the trivial obvious one-liners that I'm happy to commit.

I can't test it on real hardware here either, but I can pull it through
my tree.
Linus Torvalds - Sept. 28, 2012, 7:13 p.m.
On Fri, Sep 28, 2012 at 11:05 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> I am not the maintainer, but please, go ahead an push your fix. I do not
> have time to test it myself and it does not look like anyone else in the
> small mtd community does.

Grr. I told people that patch wasn't tested. I hadn't even *compiled*
it. It has a typo ("inlint" instead of "inline" - so close).

Sasha said he had tested it, but nobody even mentioned this thing. Now
I'm nervous. I had committed it in my tree and was just about to push
it out when I decided that I should at least compile it despite the
"tested-by".

Hmm? Now I really *really* want to know that it's been tested on
actual hardware too. Sasha, what patch did you actually test? Did you
just fix the "inlint" thing, or was there something else entirely?

          Linus

PS. Artem: your "Reply-to" is broken, and doesn't have your real name.
Please fix.
richard -rw- weinberger - Sept. 28, 2012, 7:15 p.m.
On Fri, Sep 28, 2012 at 9:04 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2012-09-28 at 09:44 -0700, Linus Torvalds wrote:
>> On Fri, Sep 28, 2012 at 2:00 AM, Sasha Levin <levinsasha928@gmail.com> wrote:
>> >
>> > Is anyone planning on picking up Linus' patch? This is still not in -next even.
>>
>> I was really hoping it would go through the regular channels and come
>> back to me that way, since I can't really test it, and it's bigger
>> than the trivial obvious one-liners that I'm happy to commit.
>
> I can't test it on real hardware here either, but I can pull it through
> my tree.

If you can a easy test-case I can test it on real hardware.
richard -rw- weinberger - Sept. 28, 2012, 7:18 p.m.
On Fri, Sep 28, 2012 at 9:15 PM, richard -rw- weinberger
<richard.weinberger@gmail.com> wrote:
> On Fri, Sep 28, 2012 at 9:04 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> On Fri, 2012-09-28 at 09:44 -0700, Linus Torvalds wrote:
>>> On Fri, Sep 28, 2012 at 2:00 AM, Sasha Levin <levinsasha928@gmail.com> wrote:
>>> >
>>> > Is anyone planning on picking up Linus' patch? This is still not in -next even.
>>>
>>> I was really hoping it would go through the regular channels and come
>>> back to me that way, since I can't really test it, and it's bigger
>>> than the trivial obvious one-liners that I'm happy to commit.
>>
>> I can't test it on real hardware here either, but I can pull it through
>> my tree.
>
> If you can a easy test-case I can test it on real hardware.

Bah.
If you have an easy test case I can run it on real hardware. :)
Sasha Levin - Sept. 28, 2012, 7:44 p.m.
On 09/28/2012 09:13 PM, Linus Torvalds wrote:
> On Fri, Sep 28, 2012 at 11:05 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>>
>> I am not the maintainer, but please, go ahead an push your fix. I do not
>> have time to test it myself and it does not look like anyone else in the
>> small mtd community does.
> 
> Grr. I told people that patch wasn't tested. I hadn't even *compiled*
> it. It has a typo ("inlint" instead of "inline" - so close).
> 
> Sasha said he had tested it, but nobody even mentioned this thing. Now
> I'm nervous. I had committed it in my tree and was just about to push
> it out when I decided that I should at least compile it despite the
> "tested-by".
> 
> Hmm? Now I really *really* want to know that it's been tested on
> actual hardware too. Sasha, what patch did you actually test? Did you
> just fix the "inlint" thing, or was there something else entirely?

I've just fixed the inlint thing since it was pretty trivial, I didn't bother commenting on it since I figured it would turn into
an actual patch from someone who could test it on actual hardware first.

Note that I've tested it on a KVM guest, and not on real hardware - so I'm not sure how much of a test that is.


Thanks,
Sasha

Patch

diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index f2f482b..f79c0fa 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1132,16 +1132,15 @@  static int mtdchar_mmap(struct file *file, struct vm_area_struct *vma)
 	struct mtd_file_info *mfi = file->private_data;
 	struct mtd_info *mtd = mfi->mtd;
 	struct map_info *map = mtd->priv;
-	unsigned long start;
-	unsigned long off;
-	u32 len;
+	resource_size_t start, off;
+	unsigned long len;
 
 	if (mtd->type == MTD_RAM || mtd->type == MTD_ROM) {
 		off = vma->vm_pgoff << PAGE_SHIFT;
 		start = map->phys;
 		len = PAGE_ALIGN((start & ~PAGE_MASK) + map->size);
 		start &= PAGE_MASK;
-		if ((vma->vm_end - vma->vm_start + off) > len)
+		if (off >= len || (vma->vm_end - vma->vm_start + off) > len)
 			return -EINVAL;
 
 		off += start;