Patchwork Add tar container format

login
register
mail settings
Submitter Alexander Graf
Date Aug. 5, 2009, 3:33 p.m.
Message ID <1249486402-10824-1-git-send-email-agraf@suse.de>
Download mbox | patch
Permalink /patch/30800/
State Superseded
Headers show

Comments

Alexander Graf - Aug. 5, 2009, 3:33 p.m.
Tar is a very widely used format to store data in. Sometimes people even put
virtual machine images in there.

So it makes sense for qemu to be able to read from tar files. I implemented a
written from scratch reader that also knows about the GNU sparse format, which
is what pigz creates.

This version checks for filenames that end on well-known extensions. The logic
could be changed to search for filenames given on the command line, but that
would require changes to more parts of qemu.

The tar reader in conjunctiuon with dzip gives us the chance to download
tar'ed up virtual machine images (even via http) and instantly make use of
them.

For that we still need to enable the qemu blockery to support stacking though.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 Makefile    |    2 +-
 block/tar.c |  326 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 327 insertions(+), 1 deletions(-)
 create mode 100644 block/tar.c
Anthony Liguori - Aug. 13, 2009, 10:57 p.m.
Alexander Graf wrote:
> For that we still need to enable the qemu blockery to support stacking though.
>
> Signed-off-by: Alexander Graf <agraf@suse.de>
>   

This feels like a bit too much of a one-off to me.  I'm concerned that 
it's something we'd have to support long term that wouldn't have very 
many users beyond your particular use-case.

So far, we only support image formats that are dedicated to virtual 
machines.  We support http and nbd but those are generic protocols, not 
necessarily special formats.  tar and dzip seem arbitrary.  Why not zip, 
rar, bzip2, or any other format?

You could do all of this by making use of a fuse filesystem.

I'd almost rather see something like gio integration so that this sort 
of generic filesystem stuff could live somewhere else.  I'm curious what 
others think though.  Does it seem reasonable to include this type of 
functionality?
Alexander Graf - Aug. 13, 2009, 11:08 p.m.
On 14.08.2009, at 00:57, Anthony Liguori wrote:

> Alexander Graf wrote:
>> For that we still need to enable the qemu blockery to support  
>> stacking though.
>>
>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>
>
> This feels like a bit too much of a one-off to me.  I'm concerned  
> that it's something we'd have to support long term that wouldn't  
> have very many users beyond your particular use-case.

Well, the same goes for "bochs", "parallels", "dmg" and all the other  
fun image format backends nobody uses, right? How do one or two more  
hurt you there?

Though I'm not saying that I wouldn't like to see users of this. It's  
great to have this option for archiving virtual machines (incl. config  
file) without the need to worry about readability of it. Just tar xzf  
it and you're done.

Also, we will start using this in SUSE Studio. So all appliances  
you'll download from there will be in tar.dzip format. That means that  
if this was in upstream, even fedora or ubuntu users could start those  
appliances without extracting or even downloading them first.

> So far, we only support image formats that are dedicated to virtual  
> machines.  We support http and nbd but those are generic protocols,  
> not necessarily special formats.  tar and dzip seem arbitrary.  Why  
> not zip, rar, bzip2, or any other format?

Simply because this was the only real standard I've found. The dict  
project uses dictzip for a couple of years now and it seems to be  
pretty stable (only a single version exists).

Bzip2 is supposed to be chunkable, but I haven't found anyone who did  
this yet and my knowledge in compression is pretty small. I don't know  
what rar does, but pkzip has a 2 GB file size limit (which is pretty  
bad for VM images) and I'm not aware of any seek extensions to it.

If you have good suggestions, please go ahead. We need a VM format  
container that

   a) everyone can extract with existing tools
   b) can be used as input for qemu

As a sidenote, the OVF specification states that OVF Packages are  
supposed to be TAR files.

> You could do all of this by making use of a fuse filesystem.

Right, but I really don't want to have a server serving a virtual  
machine rely on anything even close to fuse. What if the fuse client  
starts hanging? The whole box goes down? No thanks.

> I'd almost rather see something like gio integration so that this  
> sort of generic filesystem stuff could live somewhere else.  I'm  
> curious what others think though.  Does it seem reasonable to  
> include this type of functionality?


A plugin infrastructure sounds nice at first, but doesn't do all the  
fedoras and ubuntus out there too much good ;-).

Alex
Anthony Liguori - Aug. 13, 2009, 11:13 p.m.
Alexander Graf wrote:
>
>> I'd almost rather see something like gio integration so that this 
>> sort of generic filesystem stuff could live somewhere else.  I'm 
>> curious what others think though.  Does it seem reasonable to include 
>> this type of functionality?
>
>
> A plugin infrastructure sounds nice at first, but doesn't do all the 
> fedoras and ubuntus out there too much good ;-).

If we did plugins, it would not be a GPL-safe boundary so all plugins 
would be forced to be GPL.

What's attractive about doing plugins for the block layer is that we 
have a relatively stable interface for block drivers.  The current AIO 
ops should be good for a very long time so API churn shouldn't be a 
major issue.  The code is all pretty well isolated today.

As part of the longer term refactoring, I think it also makes sense to 
split the block layer into a library that can be consumed independent of 
QEMU.  Obviously, folks want to make use of our block code who don't 
care at all about QEMU.  A lot of people use qemu-img for vmdk 
manipulation, for instance.  It also makes tools like qemu-iotest able 
to consume the block layer in a saner way.

If others agree, I think we should start going down this road.  
block-tar/block-dictzip seem like obvious candidates for plugins.

Regards,

Anthony Liguori

> Alex
Anthony Liguori - Aug. 13, 2009, 11:15 p.m.
Anthony Liguori wrote:
> If others agree, I think we should start going down this road.  
> block-tar/block-dictzip seem like obvious candidates for plugins.

And at some point in the distant future, I think it would also make 
sense for the block layer, once librarized and such, to be split out 
into a separate sub-project.

Regards,

Anthony Liguori
Christoph Hellwig - Aug. 15, 2009, 8:36 p.m.
On Thu, Aug 13, 2009 at 06:13:27PM -0500, Anthony Liguori wrote:
> What's attractive about doing plugins for the block layer is that we 
> have a relatively stable interface for block drivers.  The current AIO 
> ops should be good for a very long time so API churn shouldn't be a 
> major issue.  The code is all pretty well isolated today.
> 
> As part of the longer term refactoring, I think it also makes sense to 
> split the block layer into a library that can be consumed independent of 
> QEMU.  Obviously, folks want to make use of our block code who don't 
> care at all about QEMU.  A lot of people use qemu-img for vmdk 
> manipulation, for instance.  It also makes tools like qemu-iotest able 
> to consume the block layer in a saner way.
> 
> If others agree, I think we should start going down this road.  
> block-tar/block-dictzip seem like obvious candidates for plugins.

Splitting drivers from the core is a total desaster, you'll end up with
the same crap as the X drivers vs core X server versioning mess.  After
a long stabilization period we might be able to split the block layer
_including_ the drivers from qemu if we really want.  Splitting the
drivers into tiny subpackages would be plain stupid.

As far as additional image formats are concerned I'm personally not a
fan at all of any of that image format crap we have right now.  Anything
like qcow and friends or other formats that do complex metadata
manipulation in userspace is a really bad idea for data integrity.
Supporting simple containers with static metadata is absolutely fine,
even more so it it's read-only like the tar container here.

But one problem with tar is that there are many slightly or even totally
different tar formats and extensions around.  It's not nessecarily a
format I would personally chose for a product.  That's something the
SuSE stuio people should think about, but nothing that should prevent
us from including the format.
Paolo Bonzini - Aug. 17, 2009, 12:05 p.m.
> Bzip2 is supposed to be chunkable, but I haven't found anyone who did
> this yet and my knowledge in compression is pretty small.

Bzip2 proceeds in chunks of approximately 100-900 KB size.  However, the 
size is taken after an initial RLE compression pass.  So bzip2 is only 
chunkable if you remove the initial RLE compression---this means hacking 
the bzip2 executable (see ADD_CHAR_TO_BLOCK and add_pair_to_block) in 
bzip2's bzlib.c.

Using GIO may have been a good idea a few years ago, but it seems a bit 
overengineered given that the block layer of qemu works well and it is 
pretty complex.

Paolo

Patch

diff --git a/Makefile b/Makefile
index 288190d..3183e71 100644
--- a/Makefile
+++ b/Makefile
@@ -73,7 +73,7 @@  block-obj-$(CONFIG_AIO) += posix-aio-compat.o
 
 block-nested-y += cow.o qcow.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
-block-nested-y += parallels.o nbd.o dictzip.o
+block-nested-y += parallels.o nbd.o dictzip.o tar.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block/tar.c b/block/tar.c
new file mode 100644
index 0000000..2c965cc
--- /dev/null
+++ b/block/tar.c
@@ -0,0 +1,326 @@ 
+/*
+ * Tar block driver
+ *
+ * Copyright (c) 2009 Alexander Graf <agraf@suse.de>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu-common.h"
+#include "block_int.h"
+
+// #define DEBUG
+
+#ifdef DEBUG
+#define dprintf(fmt, ...) do { printf("tar: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define dprintf(fmt, ...) do { } while (0)
+#endif
+
+#define SECTOR_SIZE      512
+
+#define POSIX_TAR_MAGIC  "ustar"
+#define OFFS_LENGTH      0x7c
+#define OFFS_TYPE        0x9c
+#define OFFS_MAGIC       0x101
+
+#define OFFS_S_SP        0x182
+#define OFFS_S_EXT       0x1e2
+#define OFFS_S_LENGTH    0x1e3
+#define OFFS_SX_EXT      0x1f8
+
+typedef struct SparseCache {
+    uint64_t start;
+    uint64_t end;
+} SparseCache;
+
+typedef struct BDRVTarState {
+    BlockDriverState *hd;
+    size_t file_sec;
+    uint64_t file_len;
+    SparseCache *sparse;
+    int sparse_num;
+    uint64_t last_end;
+} BDRVTarState;
+
+static int tar_probe(const uint8_t *buf, int buf_size, const char *filename)
+{
+    if (buf_size < OFFS_MAGIC + 5)
+        return 0;
+
+    /* we only support newer tar */
+    if (!strncmp((char*)buf + OFFS_MAGIC, POSIX_TAR_MAGIC, 5))
+        return 100;
+
+    return 0;
+}
+
+static int str_ends(char *str, const char *end)
+{
+    int end_len = strlen(end);
+    int str_len = strlen(str);
+
+    if (str_len < end_len)
+        return 0;
+
+    return !strncmp(str + str_len - end_len, end, end_len);
+}
+
+static int is_target_file(BlockDriverState *bs, char *filename)
+{
+    int retval = 0;
+
+    if (str_ends(filename, ".raw"))
+        retval = 1;
+
+    if (str_ends(filename, ".qcow"))
+        retval = 1;
+
+    if (str_ends(filename, ".qcow2"))
+        retval = 1;
+
+    if (str_ends(filename, ".vmdk"))
+        retval = 1;
+
+    dprintf("does filename %s match? %s\n", filename, retval ? "yes" : "no");
+    return retval;
+}
+
+static uint64_t tar2u64(char *ptr)
+{
+    uint64_t retval;
+    char oldend = ptr[12];
+
+    ptr[12] = '\0';
+    if (*ptr & 0x80)
+        retval = be64_to_cpu(*(uint64_t *)ptr);
+    else
+        retval = strtol(ptr, NULL, 8);
+
+    ptr[12] = oldend;
+
+    dprintf("Convert %s -> %#lx\n", ptr, retval);
+    return retval;
+}
+
+static void tar_sparse(BDRVTarState *s, uint64_t offs, uint64_t len)
+{
+    SparseCache *sparse;
+
+    if (!len)
+        return;
+    if (!(offs - s->last_end)) {
+        s->last_end += len;
+        return;
+    }
+    if (s->last_end > offs)
+        return;
+
+    dprintf("Last chunk until %lx new chunk at %lx\n", s->last_end, offs);
+
+    s->sparse = qemu_realloc(s->sparse, (s->sparse_num + 1) * sizeof(SparseCache));
+    sparse = &s->sparse[s->sparse_num];
+    sparse->start = s->last_end;
+    sparse->end = offs;
+    s->last_end = offs + len;
+    s->sparse_num++;
+    dprintf("Sparse at %lx end=%lx\n", sparse->start,
+                                       sparse->end);
+}
+
+static int tar_open(BlockDriverState *bs, const char *filename, int flags)
+{
+    BDRVTarState *s = bs->opaque;
+    char header[SECTOR_SIZE];
+    char *magic;
+    size_t header_offs = 0;
+    int ret;
+
+    ret = bdrv_file_open(&s->hd, filename, flags);
+    if (ret < 0)
+        return ret;
+
+    /* Search the file for an image */
+
+    do {
+        /* tar header */
+        if (bdrv_pread(s->hd, header_offs, header, SECTOR_SIZE) != SECTOR_SIZE)
+            goto fail;
+
+        if ((header_offs > 1) && !header[0]) {
+            fprintf(stderr, "Tar: No image file found in archive\n");
+            goto fail;
+        }
+
+        magic = &header[OFFS_MAGIC];
+        if (strncmp(magic, POSIX_TAR_MAGIC, 5)) {
+            fprintf(stderr, "Tar: Invalid magic: %s\n", magic);
+            goto fail;
+        }
+
+        dprintf("file type: %c\n", header[OFFS_TYPE]);
+
+        /* file length*/
+        s->file_len = (tar2u64(&header[OFFS_LENGTH]) + (SECTOR_SIZE - 1)) &
+                      ~(SECTOR_SIZE - 1);
+        s->file_sec = (header_offs / SECTOR_SIZE) + 1;
+
+        header_offs += s->file_len + SECTOR_SIZE;
+    } while(!is_target_file(bs, header));
+
+    /* We found an image! */
+
+    if (header[OFFS_TYPE] == 'S') {
+        uint8_t isextended;
+        int i;
+
+        for (i = OFFS_S_SP; i < (OFFS_S_SP + (4 * 24)); i += 24)
+            tar_sparse(s, tar2u64(&header[i]), tar2u64(&header[i+12]));
+
+        s->file_len = tar2u64(&header[OFFS_S_LENGTH]);
+        isextended = header[OFFS_S_EXT];
+
+        while (isextended) {
+            if (bdrv_pread(s->hd, s->file_sec * SECTOR_SIZE, header,
+                           SECTOR_SIZE) != SECTOR_SIZE)
+                goto fail;
+
+            for (i = 0; i < (21 * 24); i += 24)
+                tar_sparse(s, tar2u64(&header[i]), tar2u64(&header[i+12]));
+            isextended = header[OFFS_SX_EXT];
+            s->file_sec++;
+        }
+        tar_sparse(s, s->file_len, 1);
+    }
+
+    return 0;
+
+fail:
+    fprintf(stderr, "Tar: Error opening file\n");
+    bdrv_delete(s->hd);
+    return -EINVAL;
+}
+
+typedef struct TarAIOCB {
+    BlockDriverAIOCB common;
+    QEMUBH *bh;
+} TarAIOCB;
+
+static AIOPool tar_aio_pool = {
+    .aiocb_size         = sizeof(TarAIOCB),
+};
+
+/* This callback gets invoked when we have pure sparseness */
+static void tar_sparse_cb(void *opaque)
+{
+    TarAIOCB *acb = (TarAIOCB *)opaque;
+
+    acb->common.cb(acb->common.opaque, 0);
+    qemu_bh_delete(acb->bh);
+    qemu_aio_release(acb);
+}
+
+/* This is where we get a request from a caller to read something */
+static BlockDriverAIOCB *tar_aio_readv(BlockDriverState *bs,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BDRVTarState *s = bs->opaque;
+    SparseCache *sparse;
+    int64_t sec_file = sector_num + s->file_sec;
+    int64_t start = sector_num * SECTOR_SIZE;
+    int64_t end = start + (nb_sectors * SECTOR_SIZE);
+    int i;
+    TarAIOCB *acb;
+
+    for (i = 0; i < s->sparse_num; i++) {
+        sparse = &s->sparse[i];
+        if (sparse->start > end) {
+            /* We expect the cache to be start increasing */
+            break;
+        } else if ((sparse->start < start) && (sparse->end <= start)) {
+            /* sparse before our offset */
+            sec_file -= (sparse->end - sparse->start) / SECTOR_SIZE;
+        } else if ((sparse->start <= start) && (sparse->end >= end)) {
+            /* all our sectors are sparse */
+            char *buf = qemu_mallocz(nb_sectors * SECTOR_SIZE);
+
+            acb = qemu_aio_get(&tar_aio_pool, bs, cb, opaque);
+            qemu_iovec_from_buffer(qiov, buf, nb_sectors * SECTOR_SIZE);
+            qemu_free(buf);
+            acb->bh = qemu_bh_new(tar_sparse_cb, acb);
+            qemu_bh_schedule(acb->bh);
+
+            return &acb->common;
+        } else if (((sparse->start >= start) && (sparse->start < end)) ||
+                   ((sparse->end >= start) && (sparse->end < end))) {
+            /* we're semi-sparse (worst case) */
+            /* let's go synchronous and read all sectors individually */
+            char *buf = qemu_malloc(nb_sectors * SECTOR_SIZE);
+            uint64_t offs;
+
+            for (offs = 0; offs < (nb_sectors * SECTOR_SIZE);
+                 offs += SECTOR_SIZE) {
+                bdrv_pread(bs, (sector_num * SECTOR_SIZE) + offs,
+                           buf + offs, SECTOR_SIZE);
+            }
+
+            qemu_iovec_from_buffer(qiov, buf, nb_sectors * SECTOR_SIZE);
+            acb = qemu_aio_get(&tar_aio_pool, bs, cb, opaque);
+            acb->bh = qemu_bh_new(tar_sparse_cb, acb);
+            qemu_bh_schedule(acb->bh);
+
+            return &acb->common;
+        }
+    }
+
+    return bdrv_aio_readv(s->hd, sec_file, qiov, nb_sectors,
+                          cb, opaque);
+}
+
+static void tar_close(BlockDriverState *bs)
+{
+    dprintf("Close\n");
+}
+
+static int64_t tar_getlength(BlockDriverState *bs)
+{
+    BDRVTarState *s = bs->opaque;
+    dprintf("getlength -> %ld\n", s->file_len);
+    return s->file_len;
+}
+
+static BlockDriver bdrv_tar = {
+    .format_name     = "tar",
+
+    .instance_size   = sizeof(BDRVTarState),
+    .bdrv_open       = tar_open,
+    .bdrv_close      = tar_close,
+    .bdrv_getlength  = tar_getlength,
+    .bdrv_probe      = tar_probe,
+
+    .bdrv_aio_readv  = tar_aio_readv,
+};
+
+static void tar_block_init(void)
+{
+    bdrv_register(&bdrv_tar);
+}
+
+block_init(tar_block_init);