diff mbox

HLFS driver for QEMU

Message ID 1363589264-31606-1-git-send-email-harryxiyou@gmail.com
State New
Headers show

Commit Message

harryxiyou@gmail.com March 18, 2013, 6:47 a.m. UTC
From: Harry Wei <harryxiyou@gmail.com>

HLFS is HDFS-based(Hadoop Distributed File System) Log-Structured File
System. Actually, HLFS, currently, is not a FS but a block-storage system,
which we simplify LFS to fit block-level storage. So you could also call HLFS
as HLBS (HDFS-based Log-Structured Block-storage System).  HLFS has
two mode, which are local mode and HDFS mode. HDFS is once write and
many read so HLFS realize LBS(Log-Structured Block-storage System) to
achieve reading and writing randomly. LBS is based on LFS's basic theories
but is different from LFS, which LBS fits block storage better. See
http://code.google.com/p/cloudxy/wiki/WHAT_IS_CLOUDXY
about HLFS in details.

Currently, HLFS support following features:

1, Portions of POSIX --- Just realize some interfaces VM image need.
2, Randomly Read/Write.
3, Large file storage (TB).
4, Support snapshots(Linear snapshots and tree snapshots), Clone,
Block compression, Cache, etc.
5, A copy of the data more.
6, Cluster system can dynamic expand.
...

Signed-off-by: Harry Wei <harryxiyou@gmail.com>

---
 block/Makefile.objs |    2 +-
 block/hlfs.c        |  515 +++++++++++++++++++++++++++++++++++++++++++++++++++
 configure           |   51 +++++
 3 files changed, 567 insertions(+), 1 deletion(-)
 create mode 100644 block/hlfs.c

Comments

Stefan Hajnoczi March 18, 2013, 11:10 a.m. UTC | #1
On Mon, Mar 18, 2013 at 02:47:44PM +0800, harryxiyou@gmail.com wrote:
> From: Harry Wei <harryxiyou@gmail.com>
> 
> HLFS is HDFS-based(Hadoop Distributed File System) Log-Structured File
> System. Actually, HLFS, currently, is not a FS but a block-storage system,
> which we simplify LFS to fit block-level storage. So you could also call HLFS
> as HLBS (HDFS-based Log-Structured Block-storage System).  HLFS has
> two mode, which are local mode and HDFS mode. HDFS is once write and
> many read so HLFS realize LBS(Log-Structured Block-storage System) to
> achieve reading and writing randomly. LBS is based on LFS's basic theories
> but is different from LFS, which LBS fits block storage better. See
> http://code.google.com/p/cloudxy/wiki/WHAT_IS_CLOUDXY
> about HLFS in details.
> 
> Currently, HLFS support following features:
> 
> 1, Portions of POSIX --- Just realize some interfaces VM image need.
> 2, Randomly Read/Write.
> 3, Large file storage (TB).
> 4, Support snapshots(Linear snapshots and tree snapshots), Clone,
> Block compression, Cache, etc.
> 5, A copy of the data more.
> 6, Cluster system can dynamic expand.
> ...
> 
> Signed-off-by: Harry Wei <harryxiyou@gmail.com>

Is HLFS making releases that distros can package?  I don't see packages
in Debian or Fedora.

Block drivers in qemu.git should have active and sustainable communities
behind them.  For example, if this is a research project where the team
will move on within a year, then it may not be appropriate to merge the
code into QEMU.  Can you share some background on the HLFS community and
where the project is heading?

> ---
>  block/Makefile.objs |    2 +-
>  block/hlfs.c        |  515 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  configure           |   51 +++++
>  3 files changed, 567 insertions(+), 1 deletion(-)
>  create mode 100644 block/hlfs.c
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index c067f38..723c7a5 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -8,7 +8,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>  
>  ifeq ($(CONFIG_POSIX),y)
> -block-obj-y += nbd.o sheepdog.o
> +block-obj-y += nbd.o sheepdog.o hlfs.o

Missing CONFIG_HLFS to compile out hlfs.o.

> +static int hlbs_write(BlockDriverState *bs, int64_t sector_num, const uint8_t
> +        *buf, int nb_sectors)
> +{
> +    int ret;
> +    BDRVHLBSState *s = bs->opaque;
> +    uint32_t write_size = nb_sectors * 512;
> +    uint64_t offset = sector_num * 512;
> +    ret = hlfs_write(s->hctrl, (char *)buf, write_size, offset);
> +    if (ret != write_size) {
> +        return -EIO;
> +    }
> +    return 0;
> +}
> +
> +static int hlbs_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
> +        int nb_sectors)
> +{
> +    int ret;
> +    BDRVHLBSState *s = bs->opaque;
> +    uint32_t read_size = nb_sectors * 512;
> +    uint64_t offset = sector_num * 512;
> +    ret = hlfs_read(s->hctrl, (char *)buf, read_size, offset);
> +    if (ret != read_size) {
> +        return -EIO;
> +    }
> +    return 0;
> +}
> +
> +static int hlbs_flush(BlockDriverState *bs)
> +{
> +    int ret;
> +    BDRVHLBSState *s = bs->opaque;
> +    ret = hlfs_flush(s->hctrl);
> +    return ret;
> +}

read/write/flush should be either .bdrv_co_* or .bdrv_aio_*.

The current code pauses the guest while I/O is in progress!  Try running
disk I/O benchmarks inside the guest and you'll see that performance and
interactivity are poor.

QEMU has two ways of implementing efficient block drivers:

1. Coroutines - .bdrv_co_*

Good for image formats or block drivers that have internal logic.

Each request runs inside a coroutine - see include/block/coroutine.h.
In order to wait for I/O, submit an asynchronous request or worker
thread and yield the coroutine.  When the request completes, re-enter
the coroutine and continue.

Examples: block/sheepdog.c or block/qcow2.c.

2. Asynchronous I/O - .bdrv_aio_*

Good for low-level block drivers that have little logic.

The request processing code is split into callbacks.  I/O requests are
submitted and then the code returns back to QEMU's main loop.  When the
I/O request completes, a callback is invoked.

Examples: block/rbd.c or block/qed.c.

> +static int hlbs_snapshot_delete(BlockDriverState *bs, const char *snapshot)
> +{
> +    /* FIXME: Delete specified snapshot id.  */

Outdated comment?

> +    BDRVHLBSState *s = bs->opaque;
> +    int ret = 0;
> +    ret = hlfs_rm_snapshot(s->hctrl->storage->uri, snapshot);
> +    return ret;
> +}
> +
> +static int hlbs_snapshot_list(BlockDriverState *bs, QEMUSnapshotInfo **psn_tab)
> +{
> +    BDRVHLBSState *s = bs->opaque;
> +    /*Fixed: num_entries must be inited 0*/

Comment can be dropped?

> +    int num_entries = 0;
> +    struct snapshot *snapshots = hlfs_get_all_snapshots(s->hctrl->storage->uri,

I suggest using "hlfs_" as a prefix for libhlfs structs.  Otherwise you
risk name collisions.

> +        &num_entries);
> +    dprintf("snapshot count:%d\n", num_entries);
> +    /*Fixed: snapshots is NULL condition*/

Comment can be dropped?

> +    if (NULL == snapshots) {
> +        dprintf("snapshots is NULL, may be no snapshots, check it please!\n");
> +        return num_entries;
> +    }
> +    QEMUSnapshotInfo *sn_tab = NULL;
> +    struct snapshot *snapshot = snapshots;
> +    sn_tab = g_malloc0(num_entries * sizeof(*sn_tab));
> +    int i;
> +    for (i = 0; i < num_entries; i++) {
> +        printf("---snapshot:%s----", snapshot->sname);

Debugging code?

> +        sn_tab[i].date_sec = snapshot->timestamp * 1000;
> +        sn_tab[i].date_nsec = 0;
> +        sn_tab[i].vm_state_size = 0;
> +        sn_tab[i].vm_clock_nsec = 0;
> +        snprintf(sn_tab[i].id_str, sizeof(sn_tab[i].id_str), "%u", 0);
> +        strncpy(sn_tab[i].name, snapshot->sname, MIN(sizeof(sn_tab[i].name),
> +                    strlen(snapshot->sname)));

Please use snprintf() instead.  strncpy() does not NUL-terminate when
strlen(input) >= bufsize.

> +        snapshot++;
> +    }
> +    *psn_tab = sn_tab;
> +    if (snapshots != NULL) {
> +        g_free(snapshots);
> +    }

g_free(NULL) is a nop, the if is not needed.

> +    return num_entries;
> +}
> +
> +static QEMUOptionParameter hlbs_create_options[] = {
> +    {
> +        .name = BLOCK_OPT_SIZE,
> +        .type = OPT_SIZE,
> +        .help = "Virtual disk size"
> +    }, {
> +        .name = BLOCK_OPT_BACKING_FILE,
> +        .type = OPT_STRING,
> +        .help = "File name of a base image"
> +    }, {
> +        .name = BLOCK_OPT_PREALLOC,
> +        .type = OPT_STRING,
> +        .help = "Preallocation mode (allowed values: off, full)"
> +    }, { NULL }
> +};
> +
> +static BlockDriver bdrv_hlbs = {
> +    .format_name    = "hlfs",
> +    .protocol_name  = "hlfs",
> +    .instance_size  = sizeof(BDRVHLBSState),
> +    .bdrv_file_open = hlbs_open,
> +    .bdrv_close     = hlbs_close,
> +    .bdrv_create    = hlbs_create,
> +    .bdrv_getlength = hlbs_getlength,
> +    .bdrv_read  = hlbs_read,
> +    .bdrv_write = hlbs_write,

.bdrv_flush is missing.

> +    .bdrv_get_allocated_file_size  = hlbs_get_allocated_file_size,
> +    .bdrv_snapshot_create   = hlbs_snapshot_create,
> +    .bdrv_snapshot_goto     = hlbs_snapshot_goto,
> +    .bdrv_snapshot_delete   = hlbs_snapshot_delete,
> +    .bdrv_snapshot_list     = hlbs_snapshot_list,
> +    .create_options = hlbs_create_options,
> +};
> +
> +static void bdrv_hlbs_init(void)
> +{
> +    if (log4c_init()) {
> +        g_message("log4c_init error!");
> +    }

This seems to be for libhlfs since the QEMU code doesn't use log4c.  If
it's safe to call this function more than once, please put it into
libhlfs instead of requiring the program that links to your library to
call it for you.

> @@ -2813,6 +2820,45 @@ if compile_prog "" "" ; then
>  fi
>  
>  ##########################################
> +# hlfs probe
> +if test "$hlfs" != "no" ; then
> +    cat > $TMPC <<EOF
> +#include <stdio.h>
> +#include <api/hlfs.h>
> +    int main(void) {
> +            return 0;
> +    }
> +EOF
> +    INC_GLIB_DIR=`pkg-config --cflags glib-2.0`

QEMU uses gthread-2.0 instead of the single-threaded glib-2.0.

You can reuse glib_cflags instead of probing again here.

> +    HLFS_DIR=`echo $HLFS_INSTALL_PATH`
> +    JVM_DIR=`echo $JAVA_HOME`
> +
> +    if [ `getconf LONG_BIT` -eq "64" ];then
> +    CLIBS="-L$JVM_DIR/jre/lib/amd64/server/ $CLIBS"
> +    fi
> +    if [ `getconf LONG_BIT` -eq "32" ];then
> +    CLIBS="-L$JVM_DIR/jre/lib/i386/server/ $CLIBS"
> +    fi

Hopefully there is a cleaner and portable way to do this (e.g.
pkg-config).

> +    CLIBS="-L$HLFS_DIR/lib/ $CLIBS"
> +    CFLAGS="-I$HLFS_DIR/include $CFLAGS"
> +    CFLAGS="$INC_GLIB_DIR $CFLAGS"
> +
> +    hlfs_libs="$CLIBS -lhlfs -llog4c -lglib-2.0 -lgthread-2.0 -lrt -lhdfs -ljvm"

libhlfs should support pkg-config so that QEMU does not need to know
about -llog4c -lhdfs -ljvm.

> +    echo "999 CLIBS is $CLIBS\n"
> +    echo "999 CFLAGS is $CFLAGS\n"

Debugging?
harryxiyou@gmail.com March 18, 2013, 1:36 p.m. UTC | #2
On Mon, Mar 18, 2013 at 7:10 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
Hi Stefan,

> Is HLFS making releases that distros can package?  I don't see packages
> in Debian or Fedora.

We will make packages for Debian and Fedora.

>
> Block drivers in qemu.git should have active and sustainable communities
> behind them.

Cloudxy is an active and sustainable community, which HLFS is just a
sub-project of it.

>  For example, if this is a research project where the team
> will move on within a year, then it may not be appropriate to merge the
> code into QEMU.

HLFS have been developed for two years.

> Can you share some background on the HLFS community and
> where the project is heading?

See http://code.google.com/p/cloudxy/ (Our Main Website) for
details (in Chinese).
See http://code.google.com/p/cloudxy/wiki/WHAT_IS_CLOUDXY for
details.(in English)

>
>> ---
>>  block/Makefile.objs |    2 +-
>>  block/hlfs.c        |  515 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  configure           |   51 +++++
>>  3 files changed, 567 insertions(+), 1 deletion(-)
>>  create mode 100644 block/hlfs.c
>>
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index c067f38..723c7a5 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -8,7 +8,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
>>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>>
>>  ifeq ($(CONFIG_POSIX),y)
>> -block-obj-y += nbd.o sheepdog.o
>> +block-obj-y += nbd.o sheepdog.o hlfs.o
>
> Missing CONFIG_HLFS to compile out hlfs.o.

I changed this according to sheepdog. I cannot see any CONFIG_SHEEPDOG.

[...]
>
>> +    echo "999 CLIBS is $CLIBS\n"
>> +    echo "999 CFLAGS is $CFLAGS\n"
>
> Debugging?

I will fix these problems you said. Thanks very much.
Stefan Hajnoczi March 18, 2013, 3:16 p.m. UTC | #3
On Mon, Mar 18, 2013 at 2:36 PM, harryxiyou <harryxiyou@gmail.com> wrote:
> On Mon, Mar 18, 2013 at 7:10 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> Hi Stefan,
>
>> Is HLFS making releases that distros can package?  I don't see packages
>> in Debian or Fedora.
>
> We will make packages for Debian and Fedora.
>
>>
>> Block drivers in qemu.git should have active and sustainable communities
>> behind them.
>
> Cloudxy is an active and sustainable community, which HLFS is just a
> sub-project of it.
>
>>  For example, if this is a research project where the team
>> will move on within a year, then it may not be appropriate to merge the
>> code into QEMU.
>
> HLFS have been developed for two years.
>
>> Can you share some background on the HLFS community and
>> where the project is heading?
>
> See http://code.google.com/p/cloudxy/ (Our Main Website) for
> details (in Chinese).
> See http://code.google.com/p/cloudxy/wiki/WHAT_IS_CLOUDXY for
> details.(in English)

I looked at the Google Code project before, it looks like a repo that
a few people are hacking on.  The site is developer-focussed and there
is no evidence of users.  This is why I asked about the background of
the community.

There are references to "workshop3" and "linux lab" in the source
code.  Xiyou and XUPT is http://www.xiyou.edu.cn/.

What I'm concerned about is that there are no users and the developers
leave when they graduate.  In that case integrating the patches into
QEMU is wasted effort.

So what *is* the background of cloudxy (HLFS)?

>
>>
>>> ---
>>>  block/Makefile.objs |    2 +-
>>>  block/hlfs.c        |  515 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  configure           |   51 +++++
>>>  3 files changed, 567 insertions(+), 1 deletion(-)
>>>  create mode 100644 block/hlfs.c
>>>
>>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>>> index c067f38..723c7a5 100644
>>> --- a/block/Makefile.objs
>>> +++ b/block/Makefile.objs
>>> @@ -8,7 +8,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
>>>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>>>
>>>  ifeq ($(CONFIG_POSIX),y)
>>> -block-obj-y += nbd.o sheepdog.o
>>> +block-obj-y += nbd.o sheepdog.o hlfs.o
>>
>> Missing CONFIG_HLFS to compile out hlfs.o.
>
> I changed this according to sheepdog. I cannot see any CONFIG_SHEEPDOG.

Sheepdog has no dependencies, it can always be built.

Stefan
harryxiyou@gmail.com March 20, 2013, 8:43 a.m. UTC | #4
On Mon, Mar 18, 2013 at 11:16 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
[...]
> I looked at the Google Code project before, it looks like a repo that
> a few people are hacking on.  The site is developer-focussed and there
> is no evidence of users.  This is why I asked about the background of
> the community.
>

Cloudxy is developing and we will popularize it.

> There are references to "workshop3" and "linux lab" in the source
> code.  Xiyou and XUPT is http://www.xiyou.edu.cn/.
>

Some developers are from XUPT.

> What I'm concerned about is that there are no users and the developers
> leave when they graduate.  In that case integrating the patches into
> QEMU is wasted effort.
>

There are some developers from XUPT but not *ALL*. Be sure that our key
developers will maintain this project forever like Kang Hua
<kanghua151@gmail.com>,
Zhang Dianbo <littlesmartsmart@gmail.com>, Wang Sen <kelvin.xupt@gmail.com>
and Harry Wei <harryxiyou@gmail.com>.

We will broadcast our project so that there will be more users.

> So what *is* the background of cloudxy (HLFS)?

Be sure that we have enough developers. Users will be more if
Cloudxy has more influences.
Stefan Hajnoczi March 20, 2013, 10:04 a.m. UTC | #5
On Wed, Mar 20, 2013 at 04:43:17PM +0800, harryxiyou wrote:
> On Mon, Mar 18, 2013 at 11:16 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> [...]
> > I looked at the Google Code project before, it looks like a repo that
> > a few people are hacking on.  The site is developer-focussed and there
> > is no evidence of users.  This is why I asked about the background of
> > the community.
> >
> 
> Cloudxy is developing and we will popularize it.

Most important for QEMU is easy building and testing, so let's focus on
that.

Ceph and GlusterFS have packages available in popular distros.  This
allows us to hook them into QEMU's buildbot
(http://buildbot.b1-systems.de/qemu/grid) so their code does not bitrot.

Sheepdog's QEMU block driver has no external dependencies so it's even
easier to build.

I'd like to see packages for popular distros and a "Getting Started"
page for users instead of developers.  Then I can follow the steps to
set up HLFS using the packages and the QEMU buildbot can build the HLFS
block driver.

Stefan
harryxiyou@gmail.com March 20, 2013, 11:17 a.m. UTC | #6
On Wed, Mar 20, 2013 at 6:04 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Wed, Mar 20, 2013 at 04:43:17PM +0800, harryxiyou wrote:
>> On Mon, Mar 18, 2013 at 11:16 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> [...]
>> > I looked at the Google Code project before, it looks like a repo that
>> > a few people are hacking on.  The site is developer-focussed and there
>> > is no evidence of users.  This is why I asked about the background of
>> > the community.
>> >
>>
>> Cloudxy is developing and we will popularize it.
>
> Most important for QEMU is easy building and testing, so let's focus on
> that.

Yup, this is wonderful ;-)

>
> Ceph and GlusterFS have packages available in popular distros.  This
> allows us to hook them into QEMU's buildbot
> (http://buildbot.b1-systems.de/qemu/grid) so their code does not bitrot.

We will make debian and RPM packages.

>
> Sheepdog's QEMU block driver has no external dependencies so it's even
> easier to build.

Yup.
>
> I'd like to see packages for popular distros and a "Getting Started"
> page for users instead of developers.  Then I can follow the steps to
> set up HLFS using the packages and the QEMU buildbot can build the HLFS
> block driver.
>

Actually, we have "Get started" wiki pages in Chinese and i will give them in
English. You can visit Chinese "Get started" wiki pages by following link.
http://code.google.com/p/cloudxy/wiki/HlfsUserManual

We also have "one key install script", which is located here.
http://cloudxy.googlecode.com/svn/branches/hlfs/person/zhangdianbo/scripts/

All the wiki pages for Cloudxy are here(In Chinese).
http://code.google.com/p/cloudxy/w/list
harryxiyou@gmail.com March 28, 2013, 7:36 a.m. UTC | #7
On Mon, Mar 18, 2013 at 7:10 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
[...]
> read/write/flush should be either .bdrv_co_* or .bdrv_aio_*.
>
> The current code pauses the guest while I/O is in progress!  Try running
> disk I/O benchmarks inside the guest and you'll see that performance and
> interactivity are poor.
>
> QEMU has two ways of implementing efficient block drivers:
>
> 1. Coroutines - .bdrv_co_*
>
> Good for image formats or block drivers that have internal logic.
>
> Each request runs inside a coroutine - see include/block/coroutine.h.
> In order to wait for I/O, submit an asynchronous request or worker
> thread and yield the coroutine.  When the request completes, re-enter
> the coroutine and continue.
>
> Examples: block/sheepdog.c or block/qcow2.c.
>
> 2. Asynchronous I/O - .bdrv_aio_*
>
> Good for low-level block drivers that have little logic.
>
> The request processing code is split into callbacks.  I/O requests are
> submitted and then the code returns back to QEMU's main loop.  When the
> I/O request completes, a callback is invoked.
>
> Examples: block/rbd.c or block/qed.c.
>

HLFS do not use QEMU's AIO way. HLFS, itself,  has realized internal AIO.
Stefan Hajnoczi March 28, 2013, 1:18 p.m. UTC | #8
On Thu, Mar 28, 2013 at 03:36:23PM +0800, harryxiyou wrote:
> On Mon, Mar 18, 2013 at 7:10 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> [...]
> > read/write/flush should be either .bdrv_co_* or .bdrv_aio_*.
> >
> > The current code pauses the guest while I/O is in progress!  Try running
> > disk I/O benchmarks inside the guest and you'll see that performance and
> > interactivity are poor.
> >
> > QEMU has two ways of implementing efficient block drivers:
> >
> > 1. Coroutines - .bdrv_co_*
> >
> > Good for image formats or block drivers that have internal logic.
> >
> > Each request runs inside a coroutine - see include/block/coroutine.h.
> > In order to wait for I/O, submit an asynchronous request or worker
> > thread and yield the coroutine.  When the request completes, re-enter
> > the coroutine and continue.
> >
> > Examples: block/sheepdog.c or block/qcow2.c.
> >
> > 2. Asynchronous I/O - .bdrv_aio_*
> >
> > Good for low-level block drivers that have little logic.
> >
> > The request processing code is split into callbacks.  I/O requests are
> > submitted and then the code returns back to QEMU's main loop.  When the
> > I/O request completes, a callback is invoked.
> >
> > Examples: block/rbd.c or block/qed.c.
> >
> 
> HLFS do not use QEMU's AIO way. HLFS, itself,  has realized internal AIO.

You need to perform I/O in a way that does not block QEMU's main loop.
The code you posted blocks the caller until I/O completes, and therefore
blocks QEMU's main loop.

Either you need to use a worker thread approach or add async APIs to
HLFS so it can be integrated into the application (QEMU) main loop
properly.

Stefan
diff mbox

Patch

diff --git a/block/Makefile.objs b/block/Makefile.objs
index c067f38..723c7a5 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -8,7 +8,7 @@  block-obj-$(CONFIG_POSIX) += raw-posix.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 ifeq ($(CONFIG_POSIX),y)
-block-obj-y += nbd.o sheepdog.o
+block-obj-y += nbd.o sheepdog.o hlfs.o
 block-obj-$(CONFIG_LIBISCSI) += iscsi.o
 block-obj-$(CONFIG_CURL) += curl.o
 block-obj-$(CONFIG_RBD) += rbd.o
diff --git a/block/hlfs.c b/block/hlfs.c
new file mode 100644
index 0000000..331feae
--- /dev/null
+++ b/block/hlfs.c
@@ -0,0 +1,514 @@ 
+/*
+ * Block driver for HLFS(HDFS-based Log-structured File System)
+ *
+ * Copyright (c) 2013, Kang Hua <kanghua151@gmail.com>
+ * Copyright (c) 2013, Wang Sen <kelvin.xupt@gmail.com>
+ * Copyright (c) 2013, Harry Wei <harryxiyou@gmail.com>
+ *
+ * This program is free software. You can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * Reference:
+ * http://code.google.com/p/cloudxy
+ */
+
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/sockets.h"
+#include "block/block_int.h"
+#include "qemu/bitops.h"
+#include "api/hlfs.h"
+#include "storage_helper.h"
+#include "comm_define.h"
+#include "snapshot_helper.h"
+#include "address.h"
+
+#define DEBUG_HLBS
+#undef dprintf
+#ifdef DEBUG_HLBS
+#define dprintf(fmt, args...) \
+    do {    \
+        fprintf(stdout, "%s %d: " fmt, __func__, __LINE__, ##args); \
+    } while (0)
+#else
+#define dprintf(fmt, args...)
+#endif
+
+#define HLBS_MAX_VDI_SIZE (8192ULL*8192ULL*8192ULL*8192ULL)
+#define SECTOR_SIZE 512
+
+typedef struct BDRVHLBSState {
+    struct hlfs_ctrl *hctrl;
+    char *snapshot;
+    char *uri;
+} BDRVHLBSState;
+
+/*
+ * Parse a filename.
+ *
+ * file name format must be one of the following:
+ *    1. [vdiname]
+ *    2. [vdiname]%[snapshot]
+ *    * vdiname format --
+ *    * local:///tmp/testenv/testfs
+ *    * hdfs:///tmp/testenv/testfs
+ *    * hdfs://localhost:8020/tmp/testenv/testfs
+ *    * hdfs://localhost/tmp/testenv/testfs
+ *    * hdfs://192.168.0.1:8020/tmp/testenv/testfs
+ */
+
+static int parse_vdiname(BDRVHLBSState *s, const char *filename, char *vdi,
+        char *snapshot)
+{
+    if (!filename) {
+        return -1;
+    }
+
+    gchar **v = g_strsplit(filename, "%", 2);
+    if (g_strv_length(v) == 1) {
+        strcpy(vdi, v[0]);
+        s->uri = g_strdup(vdi);
+    } else if (g_strv_length(v) == 2) {
+        strcpy(vdi, v[0]);
+        strcpy(snapshot, v[1]);
+        s->uri = g_strdup(vdi);
+        s->snapshot = g_strdup(snapshot);
+    } else {
+        goto out;
+    }
+
+    return 0;
+out:
+    g_strfreev(v);
+    return -1;
+}
+
+static int hlbs_open(BlockDriverState *bs, const char *filename, int flags)
+{
+    int ret = 0;
+    BDRVHLBSState *s = bs->opaque;
+    char vdi[256];
+    char snapshot[HLFS_FILE_NAME_MAX];
+
+    strstart(filename, "hlfs:", (const char **)&filename);
+    memset(snapshot, 0, sizeof(snapshot));
+    memset(vdi, 0, sizeof(vdi));
+
+    if (parse_vdiname(s, filename, vdi, snapshot) < 0) {
+        goto out;
+    }
+
+    HLFS_CTRL *ctrl = init_hlfs(vdi);
+    if (strlen(snapshot)) {
+        dprintf("snapshot:%s was open.\n", snapshot);
+        ret = hlfs_open_by_snapshot(ctrl, snapshot, 1);
+    } else {
+        ret = hlfs_open(ctrl, 1);
+    }
+    g_assert(ret == 0);
+    s->hctrl = ctrl;
+    bs->total_sectors = ctrl->sb.max_fs_size * 1024 * 1024 / SECTOR_SIZE;
+    return 0;
+out:
+    if (s->hctrl) {
+        hlfs_close(s->hctrl);
+        deinit_hlfs(s->hctrl);
+    }
+    return -1;
+}
+
+static int hlbs_create(const char *filename, QEMUOptionParameter *options)
+{
+    int64_t vdi_size = 0;
+    char *backing_file = NULL;
+    BDRVHLBSState s;
+    char vdi[256];
+    char snapshot[256];
+    const char *vdiname;
+    int prealloc = 0;
+    uint32_t is_compress = 0;
+    strstart(filename, "hlfs:", &vdiname);
+
+    memset(&s, 0, sizeof(s));
+    memset(vdi, 0, sizeof(vdi));
+    memset(snapshot, 0, sizeof(snapshot));
+
+    if (parse_vdiname(&s, vdiname, vdi, (char *)&snapshot) < 0) {
+        error_report("invailid filename");
+        return -EINVAL;
+    }
+
+    while (options && options->name) {
+        if (!strcmp(options->name, BLOCK_OPT_SIZE)) {
+            vdi_size = options->value.n;
+        } else if (!strcmp(options->name, BLOCK_OPT_BACKING_FILE)) {
+            backing_file = options->value.s;
+        } else if (!strcmp(options->name, BLOCK_OPT_PREALLOC)) {
+            if (!options->value.s || !strcmp(options->value.s, "off")) {
+                prealloc = 0;
+            } else if (!strcmp(options->value.s, "full")) {
+                prealloc = 1;
+            } else {
+                error_report("Invalid preallocation mode: '%s'",
+                                               options->value.s);
+                return -EINVAL;
+            }
+        }
+        options++;
+    }
+
+    if (vdi_size > HLBS_MAX_VDI_SIZE) {
+        g_message("vdi_size: %llu  Max: %llu", vdi_size, HLBS_MAX_VDI_SIZE);
+        error_report("too big image size");
+        return -EINVAL;
+    }
+
+    if (backing_file) {
+        const char *father_uri_with_snapshot;
+        strstart(backing_file, "hlfs:", &father_uri_with_snapshot);
+        char *son_uri = vdi;
+        char *father_uri = NULL;
+        char *father_snapshot = NULL;
+        gchar **v = NULL;
+        v = g_strsplit(father_uri_with_snapshot, "%", 2);
+        if (g_strv_length(v) != 2) {
+            g_strfreev(v);
+            return -1;
+        }
+        father_uri = g_strdup(v[0]);
+        father_snapshot = g_strdup(v[1]);
+        g_strfreev(v);
+        if (0 == strcmp(father_uri, son_uri)) {
+            g_message("father uri can not equal son uri");
+            return -1;
+        }
+        struct back_storage *father_storage = init_storage_handler(father_uri);
+        if (!father_storage) {
+            g_message("can not get storage handler for father_uri:%s",
+                                                             father_uri);
+            return -1;
+        }
+        struct back_storage *son_storage = init_storage_handler(son_uri);
+        if (!son_storage) {
+            g_message("can not get storage handler for son_uri:%s", son_uri);
+            return -1;
+        }
+        /*  check son is exist and empty must  */
+        if ((0 != son_storage->bs_file_is_exist(son_storage, NULL)) || (0 !=
+                    son_storage->bs_file_is_exist(son_storage, "superblock"))) {
+            g_message("hlfs with uri:%s has not exist,please mkfs it first!",
+                                                                       son_uri);
+            return -1;
+        }
+
+        uint32_t segno = 0;
+        uint32_t offset = 0;
+        if (0 != get_cur_latest_segment_info(son_storage, &segno, &offset)) {
+            g_message("can not get latest seg info for son ");
+        } else {
+        if (segno != 0 || offset != 0) {
+            g_message("son hlfs must empty");
+            return -1;
+        }
+        }
+       /* check son is not clone already */
+        char *content = NULL;
+        uint32_t size = 0;
+        if (0 != file_get_contents(son_storage, "superblock",
+                                                 &content, &size)) {
+            g_message("can not read superblock");
+        }
+        GKeyFile *sb_keyfile = g_key_file_new();
+        if (FALSE == g_key_file_load_from_data(sb_keyfile, content, size,
+                    G_KEY_FILE_NONE, NULL)) {
+            g_message("superblock file format is not key value pairs");
+            return -1;
+        }
+        gchar *_father_uri = g_key_file_get_string(sb_keyfile, "METADATA",
+                "father_uri", NULL);
+        printf("father uri: %s\n", _father_uri);
+        if (_father_uri != NULL) {
+            g_message("uri:%s has clone :%s", son_uri, _father_uri);
+            return -1;
+        }
+        g_free(content);
+        /*  read father's snapshot 's inode */
+        uint64_t inode_addr;
+        struct snapshot *ss = NULL;
+        if (0 > load_snapshot_by_name(father_storage, SNAPSHOT_FILE, &ss, \
+                    father_snapshot)) {
+            g_message("load uri:%s ss by name:%s error", father_uri, \
+                    father_snapshot);
+            g_free(ss);
+            return -1;
+        }
+        inode_addr = ss->inode_addr;
+        g_free(ss);
+
+        uint32_t father_seg_size = 0;
+        uint32_t father_block_size = 0;
+        uint64_t father_max_fs_size = 0;
+
+        if (0 != read_fs_meta(father_storage, &father_seg_size,
+                    &father_block_size, &father_max_fs_size, &is_compress)) {
+            g_message("can not read father uri meta");
+            return -1;
+        }
+        segno = get_segno(inode_addr);
+        uint32_t son_block_size = g_key_file_get_integer(sb_keyfile,
+                "METADATA", "block_size", NULL);
+        uint32_t son_seg_size = g_key_file_get_integer(sb_keyfile, "METADATA",
+                "segment_size", NULL);
+        if (son_block_size != father_block_size || father_seg_size !=
+                son_seg_size) {
+            g_message("sorry , now father segsize and block sizee must same as \
+                    son!!!");
+            return -1;
+        }
+        g_key_file_set_uint64(sb_keyfile, "METADATA", "from_segno", segno + 1);
+        g_key_file_set_string(sb_keyfile, "METADATA", "father_uri", father_uri);
+        g_key_file_set_integer(sb_keyfile, "METADATA",
+                                        "is_compress", is_compress);
+        g_key_file_set_string(sb_keyfile, "METADATA", "father_ss",
+                father_snapshot);
+        g_key_file_set_uint64(sb_keyfile, "METADATA", "snapshot_inode",
+                inode_addr);
+        g_key_file_set_uint64(sb_keyfile, "METADATA", "max_fs_size",
+                father_max_fs_size);
+        gchar *data = g_key_file_to_data(sb_keyfile, NULL, NULL);
+        if (0 != son_storage->bs_file_delete(son_storage, "superblock")) {
+            g_message("can not delete old superblock file");
+            return -1;
+        }
+        if (0 != file_append_contents(son_storage, "superblock", (char *)data,
+                    strlen(data) + 1)) {
+            g_message("can not write superblock file");
+            return -1;
+        }
+        deinit_storage_handler(son_storage);
+        deinit_storage_handler(father_storage);
+    } else {
+        struct back_storage *storage = init_storage_handler(vdi);
+        if (NULL == storage) {
+            g_message("can not get storage handler for uri:%s", vdi);
+            return -1;
+        }
+        if ((0 == storage->bs_file_is_exist(storage, NULL)) && (0 ==
+                    storage->bs_file_is_exist(storage, "superblock"))) {
+            g_message("hlfs with uri:%s has exist", vdi);
+            return 1;
+        }
+        if (0 != storage->bs_file_mkdir(storage, NULL)) {
+            g_message("can not mkdir for our fs %s", vdi);
+        }
+        GKeyFile *sb_keyfile = g_key_file_new();
+        g_key_file_set_string(sb_keyfile, "METADATA", "uri", vdi);
+        g_key_file_set_integer(sb_keyfile, "METADATA", "block_size", 8192);
+        g_key_file_set_integer(sb_keyfile, "METADATA",
+                                          "is_compress", is_compress);
+        g_key_file_set_integer(sb_keyfile, "METADATA", "segment_size",
+                67108864);
+        g_key_file_set_integer(sb_keyfile, "METADATA", "max_fs_size",
+                vdi_size/(1024 * 1024));
+        gchar *data = g_key_file_to_data(sb_keyfile, NULL, NULL);
+        char *head, *hostname, *dir, *fs_name;
+        int port;
+        parse_from_uri(vdi, &head, &hostname, &dir, &fs_name, &port);
+        char *sb_file_path = g_build_filename(dir, fs_name, "superblock", NULL);
+        bs_file_t file = storage->bs_file_create(storage, "superblock");
+        if (NULL == file) {
+            g_message("open file :superblock failed");
+            g_free(sb_file_path);
+            return -1;
+        }
+        int size = storage->bs_file_append(storage, file, (char *)data,
+                strlen(data) + 1);
+        if (size != strlen(data) + 1) {
+            g_message("can not write superblock file");
+            g_free(sb_file_path);
+        }
+        storage->bs_file_flush(storage, file);
+        storage->bs_file_close(storage, file);
+        deinit_storage_handler(storage);
+    }
+    return 0;
+}
+
+static void hlbs_close(BlockDriverState *bs)
+{
+    BDRVHLBSState *s = bs->opaque;
+    if (s->hctrl) {
+        hlfs_close(s->hctrl);
+        deinit_hlfs(s->hctrl);
+    }
+}
+
+static int64_t hlbs_getlength(BlockDriverState *bs)
+{
+    BDRVHLBSState *s = bs->opaque;
+    return s->hctrl->sb.max_fs_size*1024*1024;
+}
+
+static int64_t hlbs_get_allocated_file_size(BlockDriverState *bs)
+{
+    BDRVHLBSState *s = bs->opaque;
+    return s->hctrl->inode.length;
+}
+
+static int hlbs_write(BlockDriverState *bs, int64_t sector_num, const uint8_t
+        *buf, int nb_sectors)
+{
+    int ret;
+    BDRVHLBSState *s = bs->opaque;
+    uint32_t write_size = nb_sectors * 512;
+    uint64_t offset = sector_num * 512;
+    ret = hlfs_write(s->hctrl, (char *)buf, write_size, offset);
+    if (ret != write_size) {
+        return -EIO;
+    }
+    return 0;
+}
+
+static int hlbs_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
+        int nb_sectors)
+{
+    int ret;
+    BDRVHLBSState *s = bs->opaque;
+    uint32_t read_size = nb_sectors * 512;
+    uint64_t offset = sector_num * 512;
+    ret = hlfs_read(s->hctrl, (char *)buf, read_size, offset);
+    if (ret != read_size) {
+        return -EIO;
+    }
+    return 0;
+}
+
+static int hlbs_flush(BlockDriverState *bs)
+{
+    int ret;
+    BDRVHLBSState *s = bs->opaque;
+    ret = hlfs_flush(s->hctrl);
+    return ret;
+}
+
+static int hlbs_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info)
+{
+    int ret = 0;
+    BDRVHLBSState *s = bs->opaque;
+    dprintf("sn_info: name %s id_str %s vm_state_size %llu\n", sn_info->name,
+            sn_info->id_str, sn_info->vm_state_size);
+    dprintf("%s %s\n", sn_info->name, sn_info->id_str);
+    ret = hlfs_take_snapshot(s->hctrl, sn_info->name);
+    return ret;
+}
+
+static int hlbs_snapshot_goto(BlockDriverState *bs, const char *snapshot)
+{
+    int ret;
+    BDRVHLBSState *s = bs->opaque;
+    char vdi[265];
+
+    memset(vdi, 0, sizeof(vdi));
+    strncpy(vdi, s->hctrl->storage->uri, sizeof(vdi));
+    hlfs_close(s->hctrl);
+    deinit_hlfs(s->hctrl);
+    HLFS_CTRL *ctrl = init_hlfs(vdi);
+    s->hctrl = ctrl;
+    ret = hlfs_open_by_snapshot(ctrl, snapshot, 1);
+    if (ret != 0) {
+        goto out;
+    }
+    s->hctrl = ctrl;
+    s->uri = g_strdup(vdi);
+    s->snapshot = g_strdup(vdi);
+    bs->total_sectors = ctrl->sb.max_fs_size * 1024 * 1024 / SECTOR_SIZE;
+    return 0;
+out:
+    if (s->hctrl != NULL) {
+        hlfs_close(s->hctrl);
+        deinit_hlfs(s->hctrl);
+    }
+    return -1;
+}
+
+static int hlbs_snapshot_delete(BlockDriverState *bs, const char *snapshot)
+{
+    /* FIXME: Delete specified snapshot id.  */
+    BDRVHLBSState *s = bs->opaque;
+    int ret = 0;
+    ret = hlfs_rm_snapshot(s->hctrl->storage->uri, snapshot);
+    return ret;
+}
+
+static int hlbs_snapshot_list(BlockDriverState *bs, QEMUSnapshotInfo **psn_tab)
+{
+    BDRVHLBSState *s = bs->opaque;
+    /*Fixed: num_entries must be inited 0*/
+    int num_entries = 0;
+    struct snapshot *snapshots = hlfs_get_all_snapshots(s->hctrl->storage->uri,
+        &num_entries);
+    dprintf("snapshot count:%d\n", num_entries);
+    /*Fixed: snapshots is NULL condition*/
+    if (NULL == snapshots) {
+        dprintf("snapshots is NULL, may be no snapshots, check it please!\n");
+        return num_entries;
+    }
+    QEMUSnapshotInfo *sn_tab = NULL;
+    struct snapshot *snapshot = snapshots;
+    sn_tab = g_malloc0(num_entries * sizeof(*sn_tab));
+    int i;
+    for (i = 0; i < num_entries; i++) {
+        printf("---snapshot:%s----", snapshot->sname);
+        sn_tab[i].date_sec = snapshot->timestamp * 1000;
+        sn_tab[i].date_nsec = 0;
+        sn_tab[i].vm_state_size = 0;
+        sn_tab[i].vm_clock_nsec = 0;
+        snprintf(sn_tab[i].id_str, sizeof(sn_tab[i].id_str), "%u", 0);
+        strncpy(sn_tab[i].name, snapshot->sname, MIN(sizeof(sn_tab[i].name),
+                    strlen(snapshot->sname)));
+        snapshot++;
+    }
+    *psn_tab = sn_tab;
+    if (snapshots != NULL) {
+        g_free(snapshots);
+    }
+    return num_entries;
+}
+
+static QEMUOptionParameter hlbs_create_options[] = {
+    {
+        .name = BLOCK_OPT_SIZE,
+        .type = OPT_SIZE,
+        .help = "Virtual disk size"
+    }, {
+        .name = BLOCK_OPT_BACKING_FILE,
+        .type = OPT_STRING,
+        .help = "File name of a base image"
+    }, {
+        .name = BLOCK_OPT_PREALLOC,
+        .type = OPT_STRING,
+        .help = "Preallocation mode (allowed values: off, full)"
+    }, { NULL }
+};
+
+static BlockDriver bdrv_hlbs = {
+    .format_name    = "hlfs",
+    .protocol_name  = "hlfs",
+    .instance_size  = sizeof(BDRVHLBSState),
+    .bdrv_file_open = hlbs_open,
+    .bdrv_close     = hlbs_close,
+    .bdrv_create    = hlbs_create,
+    .bdrv_getlength = hlbs_getlength,
+    .bdrv_read  = hlbs_read,
+    .bdrv_write = hlbs_write,
+    .bdrv_get_allocated_file_size  = hlbs_get_allocated_file_size,
+    .bdrv_snapshot_create   = hlbs_snapshot_create,
+    .bdrv_snapshot_goto     = hlbs_snapshot_goto,
+    .bdrv_snapshot_delete   = hlbs_snapshot_delete,
+    .bdrv_snapshot_list     = hlbs_snapshot_list,
+    .create_options = hlbs_create_options,
+};
+
+static void bdrv_hlbs_init(void)
+{
+    if (log4c_init()) {
+        g_message("log4c_init error!");
+    }
+    bdrv_register(&bdrv_hlbs);
+}
+block_init(bdrv_hlbs_init);
diff --git a/configure b/configure
index 46a7594..8afd5ac 100755
--- a/configure
+++ b/configure
@@ -229,6 +229,7 @@  virtio_blk_data_plane=""
 gtk=""
 gtkabi="2.0"
 tpm="no"
+hlfs=""
 
 # parse CC options first
 for opt do
@@ -896,6 +897,10 @@  for opt do
   ;;
   --enable-glusterfs) glusterfs="yes"
   ;;
+  --disable-hlfs) hlfs="no"
+  ;;
+  --enable-hlfs) hlfs="yes"
+  ;;
   --disable-virtio-blk-data-plane) virtio_blk_data_plane="no"
   ;;
   --enable-virtio-blk-data-plane) virtio_blk_data_plane="yes"
@@ -1161,6 +1166,8 @@  echo "  --with-coroutine=BACKEND coroutine backend. Supported options:"
 echo "                           gthread, ucontext, sigaltstack, windows"
 echo "  --enable-glusterfs       enable GlusterFS backend"
 echo "  --disable-glusterfs      disable GlusterFS backend"
+echo "  --enable-hlfs            enable HLFS backend"
+echo "  --disable-hlfs           disable HLFS backend"
 echo "  --enable-gcov            enable test coverage analysis with gcov"
 echo "  --gcov=GCOV              use specified gcov [$gcov_tool]"
 echo "  --enable-tpm             enable TPM support"
@@ -2813,6 +2820,45 @@  if compile_prog "" "" ; then
 fi
 
 ##########################################
+# hlfs probe
+if test "$hlfs" != "no" ; then
+    cat > $TMPC <<EOF
+#include <stdio.h>
+#include <api/hlfs.h>
+    int main(void) {
+            return 0;
+    }
+EOF
+    INC_GLIB_DIR=`pkg-config --cflags glib-2.0`
+    HLFS_DIR=`echo $HLFS_INSTALL_PATH`
+    JVM_DIR=`echo $JAVA_HOME`
+
+    if [ `getconf LONG_BIT` -eq "64" ];then
+    CLIBS="-L$JVM_DIR/jre/lib/amd64/server/ $CLIBS"
+    fi
+    if [ `getconf LONG_BIT` -eq "32" ];then
+    CLIBS="-L$JVM_DIR/jre/lib/i386/server/ $CLIBS"
+    fi
+    CLIBS="-L$HLFS_DIR/lib/ $CLIBS"
+    CFLAGS="-I$HLFS_DIR/include $CFLAGS"
+    CFLAGS="$INC_GLIB_DIR $CFLAGS"
+
+    hlfs_libs="$CLIBS -lhlfs -llog4c -lglib-2.0 -lgthread-2.0 -lrt -lhdfs -ljvm"
+    echo "999 CLIBS is $CLIBS\n"
+    echo "999 CFLAGS is $CFLAGS\n"
+    if compile_prog "$CFLAGS" "$CLIBS $hlfs_libs"; then
+        hlfs=yes
+        libs_tools="$hlfs_libs $libs_tools"
+        libs_softmmu="$hlfs_libs $libs_softmmu"
+    else
+        if test "$hlfs" = "yes"; then
+        feature_not_found "hlfs block device"
+    fi
+        hlfs=no
+    fi
+fi
+
+##########################################
 # Do we have libiscsi
 # We check for iscsi_unmap_sync() to make sure we have a
 # recent enough version of libiscsi.
@@ -3435,6 +3481,7 @@  echo "build guest agent $guest_agent"
 echo "seccomp support   $seccomp"
 echo "coroutine backend $coroutine_backend"
 echo "GlusterFS support $glusterfs"
+echo "HLFS support      $hlfs"
 echo "virtio-blk-data-plane $virtio_blk_data_plane"
 echo "gcov              $gcov_tool"
 echo "gcov enabled      $gcov"
@@ -4351,6 +4398,10 @@  if test "$gprof" = "yes" ; then
   fi
 fi
 
+if test "$hlfs" = "yes" ; then
+  echo "CONFIG_HLFS=yes" >> $config_target_mak
+fi
+
 if test "$tpm" = "yes"; then
   if test "$target_softmmu" = "yes" ; then
     echo "CONFIG_TPM=y" >> $config_host_mak