diff mbox

block devices: validate block device capacity

Message ID alpine.LRH.2.02.1401301531040.29912@file01.intranet.prod.int.rdu2.redhat.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Mikulas Patocka Jan. 30, 2014, 8:40 p.m. UTC
When running the LVM2 testsuite on 32-bit kernel, there are unkillable
processes stuck in the kernel consuming 100% CPU:
blkid           R running      0  2005   1409 0x00000004
ce009d00 00000082 ffffffcf c11280ba 00000060 560b5dfd 00003111 00fe41cb
00000000 ce009d00 00000000 d51cfeb0 00000000 0000001e 00000002 ffffffff
00000002 c10748c1 00000002 c106cca4 00000000 00000000 ffffffff 00000000
Call Trace:
[<c11280ba>] ? radix_tree_next_chunk+0xda/0x2c0
[<c10748c1>] ? release_pages+0x61/0x160
[<c106cca4>] ? find_get_pages+0x84/0x100
[<c1251fbe>] ? _cond_resched+0x1e/0x40
[<c10758cb>] ? truncate_inode_pages_range+0x12b/0x440
[<c1075cb7>] ? truncate_inode_pages+0x17/0x20
[<c10cf2ba>] ? __blkdev_put+0x3a/0x140
[<c10d02db>] ? blkdev_close+0x1b/0x40
[<c10a60b2>] ? __fput+0x72/0x1c0
[<c1039461>] ? task_work_run+0x61/0xa0
[<c1253b6f>] ? work_notifysig+0x24/0x35

This is caused by the fact that the LVM2 testsuite creates 64TB device.
The kernel uses "unsigned long" to index pages in files and block devices,
on 64TB device "unsigned long" overflows (it can address up to 16TB with
4k pages), causing the infinite loop.

On 32-bit architectures, we must limit block device size to
PAGE_SIZE*(2^32-1).

The bug with untested device size is pervasive across the whole kernel, 
some drivers test that the device size fits in sector_t, but this test is 
not sufficient on 32-bit architectures. This patch introduces a new 
function validate_disk_capacity that tests if the disk capacity is OK for 
the current kernel and modifies the drivers brd, ide-gd, dm, sd to use it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/genhd.c                 |   23 +++++++++++++++++++++++
 drivers/block/brd.c           |   15 +++++++++++----
 drivers/ide/ide-gd.c          |    8 ++++++++
 drivers/md/dm-ioctl.c         |    3 +--
 drivers/md/dm-table.c         |   14 +++++++++++++-
 drivers/scsi/sd.c             |   20 +++++++++++---------
 include/linux/device-mapper.h |    2 +-
 include/linux/genhd.h         |    2 ++
 8 files changed, 70 insertions(+), 17 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

James Bottomley Jan. 30, 2014, 10:49 p.m. UTC | #1
On Thu, 2014-01-30 at 15:40 -0500, Mikulas Patocka wrote:
> When running the LVM2 testsuite on 32-bit kernel, there are unkillable
> processes stuck in the kernel consuming 100% CPU:
> blkid           R running      0  2005   1409 0x00000004
> ce009d00 00000082 ffffffcf c11280ba 00000060 560b5dfd 00003111 00fe41cb
> 00000000 ce009d00 00000000 d51cfeb0 00000000 0000001e 00000002 ffffffff
> 00000002 c10748c1 00000002 c106cca4 00000000 00000000 ffffffff 00000000
> Call Trace:
> [<c11280ba>] ? radix_tree_next_chunk+0xda/0x2c0
> [<c10748c1>] ? release_pages+0x61/0x160
> [<c106cca4>] ? find_get_pages+0x84/0x100
> [<c1251fbe>] ? _cond_resched+0x1e/0x40
> [<c10758cb>] ? truncate_inode_pages_range+0x12b/0x440
> [<c1075cb7>] ? truncate_inode_pages+0x17/0x20
> [<c10cf2ba>] ? __blkdev_put+0x3a/0x140
> [<c10d02db>] ? blkdev_close+0x1b/0x40
> [<c10a60b2>] ? __fput+0x72/0x1c0
> [<c1039461>] ? task_work_run+0x61/0xa0
> [<c1253b6f>] ? work_notifysig+0x24/0x35
> 
> This is caused by the fact that the LVM2 testsuite creates 64TB device.
> The kernel uses "unsigned long" to index pages in files and block devices,
> on 64TB device "unsigned long" overflows (it can address up to 16TB with
> 4k pages), causing the infinite loop.

Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
allow 64 bit offsets for block devices on 32 bit.  It sounds like
there's somewhere not using sector_t ... or using it wrongly which needs
fixing.

> On 32-bit architectures, we must limit block device size to
> PAGE_SIZE*(2^32-1).

So you're saying CONFIG_LBDAF can never work, why?

James



--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Jan. 30, 2014, 11:10 p.m. UTC | #2
On Thu, 30 Jan 2014, James Bottomley wrote:

> Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
> allow 64 bit offsets for block devices on 32 bit.  It sounds like
> there's somewhere not using sector_t ... or using it wrongly which needs
> fixing.

The page cache uses unsigned long as a page index. Therefore, if unsigned 
long is 32-bit, the block device may have at most 2^32-1 pages.

> > On 32-bit architectures, we must limit block device size to
> > PAGE_SIZE*(2^32-1).
> 
> So you're saying CONFIG_LBDAF can never work, why?
> 
> James

CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, 
without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 
16TiB (4096*2^32).

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley Jan. 30, 2014, 11:37 p.m. UTC | #3
On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote:
> 
> On Thu, 30 Jan 2014, James Bottomley wrote:
> 
> > Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
> > allow 64 bit offsets for block devices on 32 bit.  It sounds like
> > there's somewhere not using sector_t ... or using it wrongly which needs
> > fixing.
> 
> The page cache uses unsigned long as a page index. Therefore, if unsigned 
> long is 32-bit, the block device may have at most 2^32-1 pages.

Um, that's the index into the mapping, not the device; a device can have
multiple mappings and each mapping has a radix tree of pages.  For most
filesystems a mapping is equivalent to a file, so we can have large
filesystems, but they can't have files over actually 4GB on 32 bits
otherwise mmap fails.

Are we running into a problems with struct address_space where we've
assumed the inode belongs to the file and lvm is doing something where
it's the whole device?

> > > On 32-bit architectures, we must limit block device size to
> > > PAGE_SIZE*(2^32-1).
> > 
> > So you're saying CONFIG_LBDAF can never work, why?
> > 
> > James
> 
> CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, 
> without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 
> 16TiB (4096*2^32).

I don't think the people who did the large block device work expected to
gain only 3 bits for all their pain.

James



--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Jan. 31, 2014, 12:20 a.m. UTC | #4
On Thu, 30 Jan 2014, James Bottomley wrote:

> On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote:
> > 
> > On Thu, 30 Jan 2014, James Bottomley wrote:
> > 
> > > Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
> > > allow 64 bit offsets for block devices on 32 bit.  It sounds like
> > > there's somewhere not using sector_t ... or using it wrongly which needs
> > > fixing.
> > 
> > The page cache uses unsigned long as a page index. Therefore, if unsigned 
> > long is 32-bit, the block device may have at most 2^32-1 pages.
> 
> Um, that's the index into the mapping, not the device; a device can have
> multiple mappings and each mapping has a radix tree of pages.  For most
> filesystems a mapping is equivalent to a file, so we can have large
> filesystems, but they can't have files over actually 4GB on 32 bits
> otherwise mmap fails.

A device may be accessed direcly (by opening /dev/sdX) and it creates a 
mapping too - thus, the size of a mapping limits the size of a block 
device.

The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may 
fix it - but there may be some hidden places where pgoff is converted to 
unsigned long - who knows, if they exist or not?

> Are we running into a problems with struct address_space where we've
> assumed the inode belongs to the file and lvm is doing something where
> it's the whole device?

lvm creates a 64TiB device, udev runs blkid on that device and blkid opens 
the device and gets stuck because of unsigned long overflow.

> > > > On 32-bit architectures, we must limit block device size to
> > > > PAGE_SIZE*(2^32-1).
> > > 
> > > So you're saying CONFIG_LBDAF can never work, why?
> > > 
> > > James
> > 
> > CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, 
> > without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 
> > 16TiB (4096*2^32).
> 
> I don't think the people who did the large block device work expected to
> gain only 3 bits for all their pain.
> 
> James

One could change it to have three choices:
2TiB limit - 32-bit sector_t and 32-bit pgoff_t
16TiB limit - 64-bit sector_t and 32-bit pgoff_t
32PiB limit - 64-bit sector_t and 64-bit pgoff_t

Though, we need to know if the people who designed memory management agree 
with changing pgoff_t to 64 bits.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley Jan. 31, 2014, 1:43 a.m. UTC | #5
On Thu, 2014-01-30 at 19:20 -0500, Mikulas Patocka wrote:
> 
> On Thu, 30 Jan 2014, James Bottomley wrote:
> 
> > On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote:
> > > 
> > > On Thu, 30 Jan 2014, James Bottomley wrote:
> > > 
> > > > Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
> > > > allow 64 bit offsets for block devices on 32 bit.  It sounds like
> > > > there's somewhere not using sector_t ... or using it wrongly which needs
> > > > fixing.
> > > 
> > > The page cache uses unsigned long as a page index. Therefore, if unsigned 
> > > long is 32-bit, the block device may have at most 2^32-1 pages.
> > 
> > Um, that's the index into the mapping, not the device; a device can have
> > multiple mappings and each mapping has a radix tree of pages.  For most
> > filesystems a mapping is equivalent to a file, so we can have large
> > filesystems, but they can't have files over actually 4GB on 32 bits
> > otherwise mmap fails.
> 
> A device may be accessed direcly (by opening /dev/sdX) and it creates a 
> mapping too - thus, the size of a mapping limits the size of a block 
> device.

Right, that's what I suspected below.  We can't damage large block
support on filesystems just because of this corner case.

> The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may 
> fix it - but there may be some hidden places where pgoff is converted to 
> unsigned long - who knows, if they exist or not?

I don't think we want to do that ... it will make struct page fatter and
have knock on impacts in the radix tree code.  To fix this, we need to
make the corner case (i.e. opening large block devices without a
filesystem) bear the pain.  It sort of looks like we want to do a linear
array of mappings of 64TB for the device so the page cache calculations
don't overflow.

> > Are we running into a problems with struct address_space where we've
> > assumed the inode belongs to the file and lvm is doing something where
> > it's the whole device?
> 
> lvm creates a 64TiB device, udev runs blkid on that device and blkid opens 
> the device and gets stuck because of unsigned long overflow.

well a simple open won't cause this ... it must be trying to read the
end of the device for some reason.  But anyway, the way to fix this is
to fix the large block open as a corner case.

> > > > > On 32-bit architectures, we must limit block device size to
> > > > > PAGE_SIZE*(2^32-1).
> > > > 
> > > > So you're saying CONFIG_LBDAF can never work, why?
> > > > 
> > > > James
> > > 
> > > CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, 
> > > without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 
> > > 16TiB (4096*2^32).
> > 
> > I don't think the people who did the large block device work expected to
> > gain only 3 bits for all their pain.
> > 
> > James
> 
> One could change it to have three choices:
> 2TiB limit - 32-bit sector_t and 32-bit pgoff_t
> 16TiB limit - 64-bit sector_t and 32-bit pgoff_t
> 32PiB limit - 64-bit sector_t and 64-bit pgoff_t
> 
> Though, we need to know if the people who designed memory management agree 
> with changing pgoff_t to 64 bits.

I don't think we can change the size of pgoff_t ... because it won't
just be that, it will be other problems like the radix tree.

However, you also have to bear in mind that truncating large block
device support to 64TB on 32 bits is a technical ABI break.  Hopefully
it is only technical because I don't know of any current consumer block
device that is 64TB yet, but anyone who'd created a filesystem >64TB
would find it no-longer mounted on 32 bits.
James

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Jan. 31, 2014, 2:43 a.m. UTC | #6
On Thu, 30 Jan 2014, James Bottomley wrote:

> > A device may be accessed direcly (by opening /dev/sdX) and it creates a 
> > mapping too - thus, the size of a mapping limits the size of a block 
> > device.
> 
> Right, that's what I suspected below.  We can't damage large block
> support on filesystems just because of this corner case.

Devices larger than 16TiB never worked on 32-bit kernel, so this patch 
isn't damaging anything.

Note that if you attach a 16TiB block device, don't open it and mount it, 
it still won't work, because the buffer cache uses the page cache (see the 
function __find_get_block_slow and the variable "pgoff_t index" - that 
variable would overflow if the filesystem accessed a buffer beyond 16TiB).

> > The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may 
> > fix it - but there may be some hidden places where pgoff is converted to 
> > unsigned long - who knows, if they exist or not?
> 
> I don't think we want to do that ... it will make struct page fatter and
> have knock on impacts in the radix tree code.  To fix this, we need to
> make the corner case (i.e. opening large block devices without a
> filesystem) bear the pain.  It sort of looks like we want to do a linear
> array of mappings of 64TB for the device so the page cache calculations
> don't overflow.

The code that reads and writes data to block devices and files is shared - 
the functions in mm/filemap.c work for both files and block devices.

So, if you want 64-bit page offsets, you need to increase pgoff_t size, 
and that will increase the limit for both files and block devices.

You shouldn't have separate functions for managing pages on files and 
separate functions for managing pages on block devices - that would 
increase code size and cause maintenance problems.

> > Though, we need to know if the people who designed memory management agree 
> > with changing pgoff_t to 64 bits.
> 
> I don't think we can change the size of pgoff_t ... because it won't
> just be that, it will be other problems like the radix tree.

If we can't change it, then we must stay with the current 16TiB limit. 
There's no other way.

> However, you also have to bear in mind that truncating large block
> device support to 64TB on 32 bits is a technical ABI break.  Hopefully
> it is only technical because I don't know of any current consumer block
> device that is 64TB yet, but anyone who'd created a filesystem >64TB
> would find it no-longer mounted on 32 bits.
> James

It is not ABI break, because block devices larger than 16TiB never worked 
on 32-bit architectures. So it's better to refuse them outright, than to 
cause subtle lockups or data corruption.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley Jan. 31, 2014, 5:45 a.m. UTC | #7
On Thu, 2014-01-30 at 21:43 -0500, Mikulas Patocka wrote:
> 
> On Thu, 30 Jan 2014, James Bottomley wrote:
> 
> > > A device may be accessed direcly (by opening /dev/sdX) and it creates a 
> > > mapping too - thus, the size of a mapping limits the size of a block 
> > > device.
> > 
> > Right, that's what I suspected below.  We can't damage large block
> > support on filesystems just because of this corner case.
> 
> Devices larger than 16TiB never worked on 32-bit kernel, so this patch 
> isn't damaging anything.

expectations: 32 bit with CONFIG_LBDAF is supposed to be able to do
almost everything 64 bits can

> Note that if you attach a 16TiB block device, don't open it and mount it, 
> it still won't work, because the buffer cache uses the page cache (see the 
> function __find_get_block_slow and the variable "pgoff_t index" - that 
> variable would overflow if the filesystem accessed a buffer beyond 16TiB).

That depends on the layout of the fs metadata.

> > > The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may 
> > > fix it - but there may be some hidden places where pgoff is converted to 
> > > unsigned long - who knows, if they exist or not?
> > 
> > I don't think we want to do that ... it will make struct page fatter and
> > have knock on impacts in the radix tree code.  To fix this, we need to
> > make the corner case (i.e. opening large block devices without a
> > filesystem) bear the pain.  It sort of looks like we want to do a linear
> > array of mappings of 64TB for the device so the page cache calculations
> > don't overflow.
> 
> The code that reads and writes data to block devices and files is shared - 
> the functions in mm/filemap.c work for both files and block devices.

Yes.

> So, if you want 64-bit page offsets, you need to increase pgoff_t size, 
> and that will increase the limit for both files and block devices.

No.  The point is the page cache mapping of the device uses a
manufactured inode saved in the backing device. It looks fixable in the
buffer code before the page cache gets involved.

> You shouldn't have separate functions for managing pages on files and 
> separate functions for managing pages on block devices - that would 
> increase code size and cause maintenance problems.

It wouldn't it would add structure to the buffer cache for large
devices.

> > > Though, we need to know if the people who designed memory management agree 
> > > with changing pgoff_t to 64 bits.
> > 
> > I don't think we can change the size of pgoff_t ... because it won't
> > just be that, it will be other problems like the radix tree.
> 
> If we can't change it, then we must stay with the current 16TiB limit. 
> There's no other way.
> 
> > However, you also have to bear in mind that truncating large block
> > device support to 64TB on 32 bits is a technical ABI break.  Hopefully
> > it is only technical because I don't know of any current consumer block
> > device that is 64TB yet, but anyone who'd created a filesystem >64TB
> > would find it no-longer mounted on 32 bits.
> > James
> 
> It is not ABI break, because block devices larger than 16TiB never worked 
> on 32-bit architectures. So it's better to refuse them outright, than to 
> cause subtle lockups or data corruption.

An ABI is a contract between the userspace and the kernel.  Saying we
can remove a clause in the contract because no-one ever exercised it and
not call it changing the contract is sophistry.  The correct thing to do
would be to call it a bug and fix it.

In a couple of short years we'll be over 16TB for hard drives.  I don't
really want to be the one explaining to the personal storage people that
the only way to install a 16+TB drive in their arm (or quark) based
Linux systems is a processor upgrade.

I suppose there are a couple of possibilities: pgoff_t + radix tree
expansion or double radix tree in the buffer code.  This should probably
be taken to fsdevel where they might have better ideas.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Jan. 31, 2014, 8:20 a.m. UTC | #8
On Thu, 30 Jan 2014, James Bottomley wrote:

> > So, if you want 64-bit page offsets, you need to increase pgoff_t size, 
> > and that will increase the limit for both files and block devices.
> 
> No.  The point is the page cache mapping of the device uses a
> manufactured inode saved in the backing device. It looks fixable in the
> buffer code before the page cache gets involved.

So if you think you can support 16TiB devices and leave pgoff_t 32-bit, 
send a patch that does it.

Until you make it, you should apply the patch that I sent, that prevents 
kernel lockups or data corruption when the user uses 16TiB device on 
32-bit kernel.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Feb. 3, 2014, 8:15 a.m. UTC | #9
On Fri, Jan 31, 2014 at 03:20:17AM -0500, Mikulas Patocka wrote:
> So if you think you can support 16TiB devices and leave pgoff_t 32-bit, 
> send a patch that does it.
> 
> Until you make it, you should apply the patch that I sent, that prevents 
> kernel lockups or data corruption when the user uses 16TiB device on 
> 32-bit kernel.

Exactly.  I had actually looked into support for > 16TiB devices for
a NAS use case a while ago, but when explaining the effort involves
the idea was dropped quickly.  The Linux block device is too deeply
tied to the pagecache to make it easily feasible.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Feb. 3, 2014, 8:22 p.m. UTC | #10
On Mon, 3 Feb 2014, Christoph Hellwig wrote:

> On Fri, Jan 31, 2014 at 03:20:17AM -0500, Mikulas Patocka wrote:
> > So if you think you can support 16TiB devices and leave pgoff_t 32-bit, 
> > send a patch that does it.
> > 
> > Until you make it, you should apply the patch that I sent, that prevents 
> > kernel lockups or data corruption when the user uses 16TiB device on 
> > 32-bit kernel.
> 
> Exactly.  I had actually looked into support for > 16TiB devices for
> a NAS use case a while ago, but when explaining the effort involves
> the idea was dropped quickly.  The Linux block device is too deeply
> tied to the pagecache to make it easily feasible.

The memory management routines use pgoff_t, so we could define pgoff_t to 
be 64-bit type. But there is lib/radix_tree.c that uses unsigned long as 
an index into the radix tree - and pgoff_t is cast to unsigned long when 
calling the radix_tree routines - so we'd need to change lib/radix_tree to 
use pgoff_t.

Then, there may be other places where pgoff_t is cast to unsigned long and 
they are not trivial to find (one could enable some extra compiler 
warnings about truncating values when casting them, but I suppose, this 
would trigger a lot of false positives). This needs some deep review by 
people who designed the memory management code.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6-compile/block/genhd.c
===================================================================
--- linux-2.6-compile.orig/block/genhd.c	2014-01-30 17:23:15.000000000 +0100
+++ linux-2.6-compile/block/genhd.c	2014-01-30 19:28:42.000000000 +0100
@@ -1835,3 +1835,26 @@  static void disk_release_events(struct g
 	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
 	kfree(disk->ev);
 }
+
+int validate_disk_capacity(u64 n_sectors, const char **reason)
+{
+	u64 n_pages;
+	if (n_sectors << 9 >> 9 != n_sectors) {
+		if (reason)
+			*reason = "The number of bytes is greater than 2^64.";
+		return -EOVERFLOW;
+	}
+	n_pages = (n_sectors + (1 << (PAGE_SHIFT - 9)) - 1) >> (PAGE_SHIFT - 9);
+	if (n_pages > ULONG_MAX) {
+		if (reason)
+			*reason = "Use 64-bit kernel.";
+		return -EFBIG;
+	}
+	if (n_sectors != (sector_t)n_sectors) {
+		if (reason)
+			*reason = "Use a kernel compiled with support for large block devices.";
+		return -ENOSPC;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(validate_disk_capacity);
Index: linux-2.6-compile/drivers/block/brd.c
===================================================================
--- linux-2.6-compile.orig/drivers/block/brd.c	2014-01-30 17:23:15.000000000 +0100
+++ linux-2.6-compile/drivers/block/brd.c	2014-01-30 19:26:51.000000000 +0100
@@ -429,12 +429,12 @@  static const struct block_device_operati
  * And now the modules code and kernel interface.
  */
 static int rd_nr;
-int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
+static unsigned rd_size = CONFIG_BLK_DEV_RAM_SIZE;
 static int max_part;
 static int part_shift;
 module_param(rd_nr, int, S_IRUGO);
 MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
-module_param(rd_size, int, S_IRUGO);
+module_param(rd_size, uint, S_IRUGO);
 MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
 module_param(max_part, int, S_IRUGO);
 MODULE_PARM_DESC(max_part, "Maximum number of partitions per RAM disk");
@@ -446,7 +446,7 @@  MODULE_ALIAS("rd");
 /* Legacy boot options - nonmodular */
 static int __init ramdisk_size(char *str)
 {
-	rd_size = simple_strtol(str, NULL, 0);
+	rd_size = simple_strtoul(str, NULL, 0);
 	return 1;
 }
 __setup("ramdisk_size=", ramdisk_size);
@@ -463,6 +463,13 @@  static struct brd_device *brd_alloc(int
 {
 	struct brd_device *brd;
 	struct gendisk *disk;
+	u64 capacity = (u64)rd_size * 2;
+	const char *reason;
+
+	if (validate_disk_capacity(capacity, &reason)) {
+		printk(KERN_ERR "brd: disk is too big: %s\n", reason);
+		goto out;
+	}
 
 	brd = kzalloc(sizeof(*brd), GFP_KERNEL);
 	if (!brd)
@@ -493,7 +500,7 @@  static struct brd_device *brd_alloc(int
 	disk->queue		= brd->brd_queue;
 	disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
 	sprintf(disk->disk_name, "ram%d", i);
-	set_capacity(disk, rd_size * 2);
+	set_capacity(disk, capacity);
 
 	return brd;
 
Index: linux-2.6-compile/drivers/ide/ide-gd.c
===================================================================
--- linux-2.6-compile.orig/drivers/ide/ide-gd.c	2014-01-30 17:23:17.000000000 +0100
+++ linux-2.6-compile/drivers/ide/ide-gd.c	2014-01-30 19:26:51.000000000 +0100
@@ -58,6 +58,14 @@  static void ide_disk_put(struct ide_disk
 
 sector_t ide_gd_capacity(ide_drive_t *drive)
 {
+	int v;
+	const char *reason;
+	v = validate_disk_capacity(drive->capacity64, &reason);
+	if (v) {
+		printk(KERN_ERR "%s: The disk is too big. %s\n",
+			drive->name, reason);
+		return 0;
+	}
 	return drive->capacity64;
 }
 
Index: linux-2.6-compile/drivers/scsi/sd.c
===================================================================
--- linux-2.6-compile.orig/drivers/scsi/sd.c	2014-01-30 17:23:24.000000000 +0100
+++ linux-2.6-compile/drivers/scsi/sd.c	2014-01-30 19:26:51.000000000 +0100
@@ -1960,6 +1960,8 @@  static int read_capacity_16(struct scsi_
 	unsigned int alignment;
 	unsigned long long lba;
 	unsigned sector_size;
+	int v;
+	const char *reason;
 
 	if (sdp->no_read_capacity_16)
 		return -EINVAL;
@@ -2014,10 +2016,9 @@  static int read_capacity_16(struct scsi_
 		return -ENODEV;
 	}
 
-	if ((sizeof(sdkp->capacity) == 4) && (lba >= 0xffffffffULL)) {
-		sd_printk(KERN_ERR, sdkp, "Too big for this kernel. Use a "
-			"kernel compiled with support for large block "
-			"devices.\n");
+	v = validate_disk_capacity(lba + (lba != ULLONG_MAX), &reason);
+	if (v) {
+		sd_printk(KERN_ERR, sdkp, "The disk is too big. %s\n", reason);
 		sdkp->capacity = 0;
 		return -EOVERFLOW;
 	}
@@ -2053,8 +2054,10 @@  static int read_capacity_10(struct scsi_
 	int sense_valid = 0;
 	int the_result;
 	int retries = 3, reset_retries = READ_CAPACITY_RETRIES_ON_RESET;
-	sector_t lba;
+	unsigned long long lba;
 	unsigned sector_size;
+	int v;
+	const char *reason;
 
 	do {
 		cmd[0] = READ_CAPACITY;
@@ -2100,10 +2103,9 @@  static int read_capacity_10(struct scsi_
 		return sector_size;
 	}
 
-	if ((sizeof(sdkp->capacity) == 4) && (lba == 0xffffffff)) {
-		sd_printk(KERN_ERR, sdkp, "Too big for this kernel. Use a "
-			"kernel compiled with support for large block "
-			"devices.\n");
+	v = validate_disk_capacity(lba + 1, &reason);
+	if (v) {
+		sd_printk(KERN_ERR, sdkp, "The disk is too big. %s\n", reason);
 		sdkp->capacity = 0;
 		return -EOVERFLOW;
 	}
Index: linux-2.6-compile/include/linux/genhd.h
===================================================================
--- linux-2.6-compile.orig/include/linux/genhd.h	2014-01-30 17:23:29.000000000 +0100
+++ linux-2.6-compile/include/linux/genhd.h	2014-01-30 19:26:51.000000000 +0100
@@ -451,6 +451,8 @@  static inline void set_capacity(struct g
 	disk->part0.nr_sects = size;
 }
 
+extern int validate_disk_capacity(u64 n_sectors, const char **reason);
+
 #ifdef CONFIG_SOLARIS_X86_PARTITION
 
 #define SOLARIS_X86_NUMSLICE	16
Index: linux-2.6-compile/drivers/md/dm-ioctl.c
===================================================================
--- linux-2.6-compile.orig/drivers/md/dm-ioctl.c	2014-01-30 17:23:17.000000000 +0100
+++ linux-2.6-compile/drivers/md/dm-ioctl.c	2014-01-30 19:26:51.000000000 +0100
@@ -1250,8 +1250,7 @@  static int populate_table(struct dm_tabl
 		}
 
 		r = dm_table_add_target(table, spec->target_type,
-					(sector_t) spec->sector_start,
-					(sector_t) spec->length,
+					spec->sector_start, spec->length,
 					target_params);
 		if (r) {
 			DMWARN("error adding target to table");
Index: linux-2.6-compile/drivers/md/dm-table.c
===================================================================
--- linux-2.6-compile.orig/drivers/md/dm-table.c	2014-01-30 17:23:17.000000000 +0100
+++ linux-2.6-compile/drivers/md/dm-table.c	2014-01-30 19:26:51.000000000 +0100
@@ -702,11 +702,12 @@  static int validate_hardware_logical_blo
 }
 
 int dm_table_add_target(struct dm_table *t, const char *type,
-			sector_t start, sector_t len, char *params)
+			u64 start, u64 len, char *params)
 {
 	int r = -EINVAL, argc;
 	char **argv;
 	struct dm_target *tgt;
+	const char *reason;
 
 	if (t->singleton) {
 		DMERR("%s: target type %s must appear alone in table",
@@ -724,6 +725,17 @@  int dm_table_add_target(struct dm_table
 		return -EINVAL;
 	}
 
+	if (start + len < start) {
+		DMERR("%s: target length overflow", dm_device_name(t->md));
+		return -EOVERFLOW;
+	}
+
+	r = validate_disk_capacity(start + len, &reason);
+	if (r) {
+		DMERR("%s: device is too big: %s", dm_device_name(t->md), reason);
+		return r;
+	}
+
 	tgt->type = dm_get_target_type(type);
 	if (!tgt->type) {
 		DMERR("%s: %s: unknown target type", dm_device_name(t->md),
Index: linux-2.6-compile/include/linux/device-mapper.h
===================================================================
--- linux-2.6-compile.orig/include/linux/device-mapper.h	2014-01-30 17:23:29.000000000 +0100
+++ linux-2.6-compile/include/linux/device-mapper.h	2014-01-30 19:26:51.000000000 +0100
@@ -428,7 +428,7 @@  int dm_table_create(struct dm_table **re
  * Then call this once for each target.
  */
 int dm_table_add_target(struct dm_table *t, const char *type,
-			sector_t start, sector_t len, char *params);
+			u64 start, u64 len, char *params);
 
 /*
  * Target_ctr should call this if it needs to add any callbacks.