diff mbox

[RFC] Advertise IDE physical block size as 4K

Message ID 1262081278-1858-1-git-send-email-avi@redhat.com
State New
Headers show

Commit Message

Avi Kivity Dec. 29, 2009, 10:07 a.m. UTC
Guests use this number as a hint for alignment and I/O request sizes.  Given
that modern disks have 4K block sizes, and cached file-backed images also
have 4K block sizes, this hint can improve guest performance.

We probably need to make this configurable depending on machine type.  It
should be the default for -M 0.13 only as it can affect guest code paths.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 hw/ide/core.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

Comments

Jamie Lokier Dec. 29, 2009, 1:21 p.m. UTC | #1
Avi Kivity wrote:
> Guests use this number as a hint for alignment and I/O request sizes.

It's not just a hint.  It is also the "radius of corruption on failed
write" - important for journalling filesystems and databases.

> Given
> that modern disks have 4K block sizes,

Do they, yet?

> and cached file-backed images also have 4K block sizes, this hint
> can improve guest performance.

Agreed - but see below.

> We probably need to make this configurable depending on machine type.  It
> should be the default for -M 0.13 only as it can affect guest code paths.

What about that Windows/Linux 4k sectors incompatibility thing, where
disks with 4k sectors have to sense whether the first partition starts
at 512-byte sector 63 (Linux) or 512-byte sector 1024 (or something;
Windows), and then adjust their 512-byte sector to 4k-sector mapping
so that 4k blocks within the partition are aligned to 4k sectors?

Iirc, Linux (and old but not current Windows) tends to place the first
partition starting at sector 63, which means 4k filesystem blocks will
_not_ align to 4k blocks in the cached file-backed images with Qemu.

It has been discussed for hardware disk design with 4k sectors, and
somehow there were plans to map sectors so that the Linux partition
scheme results in nicely aligned filesystem blocks - so Qemu's IDE
(and SCSI) emulation should do the same.  Or should it?  I don't know
how the 4k sector thing worked out in the end, or if it's still in
discussion.

-- Jamie
Luca Tettamanti Dec. 29, 2009, 1:39 p.m. UTC | #2
On Tue, Dec 29, 2009 at 2:21 PM, Jamie Lokier <jamie@shareable.org> wrote:
> Avi Kivity wrote:
>> Guests use this number as a hint for alignment and I/O request sizes.
>
> It's not just a hint.  It is also the "radius of corruption on failed
> write" - important for journalling filesystems and databases.
>
>> Given
>> that modern disks have 4K block sizes,
>
> Do they, yet?

Yes, there are WD disks in the wild with 4k blocks, although in this
first transition phase the firmware hides the fact and emulates the
old 512b sector.

>> We probably need to make this configurable depending on machine type.  It
>> should be the default for -M 0.13 only as it can affect guest code paths.
>
> What about that Windows/Linux 4k sectors incompatibility thing, where
> disks with 4k sectors have to sense whether the first partition starts
> at 512-byte sector 63 (Linux) or 512-byte sector 1024 (or something;
> Windows), and then adjust their 512-byte sector to 4k-sector mapping
> so that 4k blocks within the partition are aligned to 4k sectors?

Linux tools put the first partition at sector 63 (512-byte) to retain
compatibility with Windows; Linux itself does not have any problem
with different layouts. See e.g. [1]
The problem seems to be limited to Win 5.x (XP, 2k3) and WD has an
utility[2] to re-align partitions in this case, so I guess that they
do cope fine with a 4k-aligned partition table, they just create it
unaligned by default.

> It has been discussed for hardware disk design with 4k sectors, and
> somehow there were plans to map sectors so that the Linux partition
> scheme results in nicely aligned filesystem blocks

Ugh, I hope you're wrong ;-) AFAICS remapping will lead only to
headaches... Linux does not have any problem with aligned partitions.

Luca
[1] http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/
[2] http://support.wdc.com/product/download.asp?groupid=805&sid=123&lang=en
Avi Kivity Dec. 29, 2009, 1:42 p.m. UTC | #3
On 12/29/2009 03:39 PM, Luca Tettamanti wrote:
>
> Ugh, I hope you're wrong ;-) AFAICS remapping will lead only to
> headaches... Linux does not have any problem with aligned partitions.
>
>    

And in fact, that was the motivation for this patch, as parted will 
align based on the physical block size.
Christoph Hellwig Jan. 4, 2010, 8:34 a.m. UTC | #4
On Tue, Dec 29, 2009 at 02:39:38PM +0100, Luca Tettamanti wrote:
> Linux tools put the first partition at sector 63 (512-byte) to retain
> compatibility with Windows;

Well, some of them, and depending on the exact disks.  It's all rather
complicated.

> > It has been discussed for hardware disk design with 4k sectors, and
> > somehow there were plans to map sectors so that the Linux partition
> > scheme results in nicely aligned filesystem blocks
> 
> Ugh, I hope you're wrong ;-) AFAICS remapping will lead only to
> headaches... Linux does not have any problem with aligned partitions.

Linux doesn't care.  As doesn't windows.  But performance on mis-aligned
partitions will suck badly - both on 4k sector drives, SSDs or probably
various copy on write layers in virtualization once you hit the worst
case.  Fortunately the block topology information present in recent
ATA and SCSI standards allows the storage hardware to tell about the
required alignment, and Linux now has a topology API to expose it, which
is used by the most recent versions of the partitioning tools and
filesystem creation tools.
Christoph Hellwig Jan. 4, 2010, 8:36 a.m. UTC | #5
On Tue, Dec 29, 2009 at 12:07:58PM +0200, Avi Kivity wrote:
> Guests use this number as a hint for alignment and I/O request sizes.  Given
> that modern disks have 4K block sizes, and cached file-backed images also
> have 4K block sizes, this hint can improve guest performance.
> 
> We probably need to make this configurable depending on machine type.  It
> should be the default for -M 0.13 only as it can affect guest code paths.

The information is correct per the ATA spec, but:

 (a) as mentioned above it should not be used for old machine types
 (b) we need to sort out passing through the first block alignment bits
     that are also in IDENTIFY word 106 if using a raw block device
     underneat
 (b) probably need to adjust the physical blocks size depending on the
     underlying storage topology.

I have a patch in my queue for a while now dealing with (b) and parts of
(c), but it's been preempted by more urgent work.
diff mbox

Patch

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 76c3820..89fd3ce 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -164,6 +164,7 @@  static void ide_identify(IDEState *s)
     put_le16(p + 101, s->nb_sectors >> 16);
     put_le16(p + 102, s->nb_sectors >> 32);
     put_le16(p + 103, s->nb_sectors >> 48);
+    put_le16(p + 106, 0x6000 | 3); /* 8 logical sectors per physical sector */
 
     memcpy(s->identify_data, p, sizeof(s->identify_data));
     s->identify_set = 1;