{"id":1954403,"url":"http://patchwork.ozlabs.org/api/covers/1954403/?format=json","web_url":"http://patchwork.ozlabs.org/project/qemu-devel/cover/1719776434-435013-1-git-send-email-steven.sistare@oracle.com/","project":{"id":14,"url":"http://patchwork.ozlabs.org/api/projects/14/?format=json","name":"QEMU Development","link_name":"qemu-devel","list_id":"qemu-devel.nongnu.org","list_email":"qemu-devel@nongnu.org","web_url":"","scm_url":"","webscm_url":"","list_archive_url":"","list_archive_url_format":"","commit_url_format":""},"msgid":"<1719776434-435013-1-git-send-email-steven.sistare@oracle.com>","list_archive_url":null,"date":"2024-06-30T19:40:23","name":"[V2,00/11] Live update: cpr-exec","submitter":{"id":71906,"url":"http://patchwork.ozlabs.org/api/people/71906/?format=json","name":"Steve Sistare","email":"steven.sistare@oracle.com"},"mbox":"http://patchwork.ozlabs.org/project/qemu-devel/cover/1719776434-435013-1-git-send-email-steven.sistare@oracle.com/mbox/","series":[{"id":413181,"url":"http://patchwork.ozlabs.org/api/series/413181/?format=json","web_url":"http://patchwork.ozlabs.org/project/qemu-devel/list/?series=413181","date":"2024-06-30T19:40:26","name":"Live update: cpr-exec","version":2,"mbox":"http://patchwork.ozlabs.org/series/413181/mbox/"}],"comments":"http://patchwork.ozlabs.org/api/covers/1954403/comments/","headers":{"Return-Path":"<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>","X-Original-To":"incoming@patchwork.ozlabs.org","Delivered-To":"patchwork-incoming@legolas.ozlabs.org","Authentication-Results":["legolas.ozlabs.org;\n\tdkim=pass (2048-bit key;\n unprotected) header.d=oracle.com header.i=@oracle.com header.a=rsa-sha256\n header.s=corp-2023-11-20 header.b=XNCBUJVl;\n\tdkim-atps=neutral","legolas.ozlabs.org;\n spf=pass (sender SPF authorized) smtp.mailfrom=nongnu.org\n (client-ip=209.51.188.17; helo=lists.gnu.org;\n envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org;\n receiver=patchwork.ozlabs.org)"],"Received":["from lists.gnu.org (lists.gnu.org [209.51.188.17])\n\t(using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits))\n\t(No client certificate requested)\n\tby legolas.ozlabs.org (Postfix) with ESMTPS id 4WC01R1SwSz1xpN\n\tfor <incoming@patchwork.ozlabs.org>; Mon,  1 Jul 2024 05:42:03 +1000 (AEST)","from localhost ([::1] helo=lists1p.gnu.org)\n\tby lists.gnu.org with esmtp (Exim 4.90_1)\n\t(envelope-from <qemu-devel-bounces@nongnu.org>)\n\tid 1sO0Py-0008VZ-AD; Sun, 30 Jun 2024 15:40:58 -0400","from eggs.gnu.org ([2001:470:142:3::10])\n by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)\n (Exim 4.90_1) (envelope-from <steven.sistare@oracle.com>)\n id 1sO0Pp-0008R2-SR\n for qemu-devel@nongnu.org; Sun, 30 Jun 2024 15:40:49 -0400","from mx0a-00069f02.pphosted.com ([205.220.165.32])\n by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)\n (Exim 4.90_1) (envelope-from <steven.sistare@oracle.com>)\n id 1sO0Pn-0004Nv-Jk\n for qemu-devel@nongnu.org; Sun, 30 Jun 2024 15:40:49 -0400","from pps.filterd (m0246629.ppops.net [127.0.0.1])\n by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id\n 45UJecEg030630;\n Sun, 30 Jun 2024 19:40:38 GMT","from iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com\n (iadpaimrmta02.appoci.oracle.com [147.154.18.20])\n by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 402a591era-1\n (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);\n Sun, 30 Jun 2024 19:40:38 +0000 (GMT)","from pps.filterd\n (iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1])\n by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (8.17.1.19/8.17.1.19)\n with ESMTP id 45UE60vx018396; Sun, 30 Jun 2024 19:40:37 GMT","from pps.reinject (localhost [127.0.0.1])\n by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTPS id\n 4028qc16cp-1\n (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);\n Sun, 30 Jun 2024 19:40:36 +0000","from iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com\n (iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1])\n by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 45UJeaSO014044;\n Sun, 30 Jun 2024 19:40:36 GMT","from ca-dev63.us.oracle.com (ca-dev63.us.oracle.com [10.211.8.221])\n by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with\n ESMTP id 4028qc16cc-1; Sun, 30 Jun 2024 19:40:36 +0000"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=\n from:to:cc:subject:date:message-id:mime-version:content-type\n :content-transfer-encoding; s=corp-2023-11-20; bh=2tpyLTAebinn+o\n njq2RWmyvXCR7OKwgCjRq62rWDVbs=; b=XNCBUJVldb31YktsqMtyhnxGdYgbGz\n FW8mD0lhTO0Nrr6OWny06QtySQt8BkBFrm4NfGTjxQyfJ3I5nRx8GafZmw3lVBTn\n /l8uvruhEHUqo8SbaujfYor9aQVzKhZqk/5Hd3Fq8ex60/Q1tAx1Ne4iWM06rNM7\n 7XPUY144G/qCuLGnN2h4Vcd59Nltj4/SiQRHIxTPjMVW7CKZskoUscF2vhpFoOoL\n z0CNvz7waelrvT0xsWd7h5MmTS1mheQm1PNyvb++LBo81tE9xkj+kFOWCpYxNSSq\n DaREkmw4VSISUYGH7ZUBhWlHt0zIXhhXv43yeUxLzkKBJnjvIwC9ctzA==","From":"Steve Sistare <steven.sistare@oracle.com>","To":"qemu-devel@nongnu.org","Cc":"Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>,\n David Hildenbrand <david@redhat.com>,\n Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,\n Eduardo Habkost <eduardo@habkost.net>,\n Philippe Mathieu-Daude <philmd@linaro.org>,\n Paolo Bonzini <pbonzini@redhat.com>,\n \"Daniel P. Berrange\" <berrange@redhat.com>,\n Markus Armbruster <armbru@redhat.com>,\n Steve Sistare <steven.sistare@oracle.com>","Subject":"[PATCH V2 00/11] Live update: cpr-exec","Date":"Sun, 30 Jun 2024 12:40:23 -0700","Message-Id":"<1719776434-435013-1-git-send-email-steven.sistare@oracle.com>","X-Mailer":"git-send-email 1.8.3.1","MIME-Version":"1.0","Content-Type":"text/plain; charset=UTF-8","Content-Transfer-Encoding":"8bit","X-Proofpoint-Virus-Version":"vendor=baseguard\n engine=ICAP:2.0.293,Aquarius:18.0.1039,Hydra:6.0.680,FMLib:17.12.28.16\n definitions=2024-06-30_16,2024-06-28_01,2024-05-17_01","X-Proofpoint-Spam-Details":"rule=notspam policy=default score=0 suspectscore=0\n malwarescore=0\n adultscore=0 bulkscore=0 mlxlogscore=999 phishscore=0 spamscore=0\n mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1\n engine=8.12.0-2406180000 definitions=main-2406300157","X-Proofpoint-GUID":"TLFcZrcx6DS6DsczQaJF9xhHYPtksruH","X-Proofpoint-ORIG-GUID":"TLFcZrcx6DS6DsczQaJF9xhHYPtksruH","Received-SPF":"pass client-ip=205.220.165.32;\n envelope-from=steven.sistare@oracle.com; helo=mx0a-00069f02.pphosted.com","X-Spam_score_int":"-20","X-Spam_score":"-2.1","X-Spam_bar":"--","X-Spam_report":"(-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_MED=-0.001,\n DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,\n RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001,\n SPF_PASS=-0.001 autolearn=ham autolearn_force=no","X-Spam_action":"no action","X-BeenThere":"qemu-devel@nongnu.org","X-Mailman-Version":"2.1.29","Precedence":"list","List-Id":"<qemu-devel.nongnu.org>","List-Unsubscribe":"<https://lists.nongnu.org/mailman/options/qemu-devel>,\n <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>","List-Archive":"<https://lists.nongnu.org/archive/html/qemu-devel>","List-Post":"<mailto:qemu-devel@nongnu.org>","List-Help":"<mailto:qemu-devel-request@nongnu.org?subject=help>","List-Subscribe":"<https://lists.nongnu.org/mailman/listinfo/qemu-devel>,\n <mailto:qemu-devel-request@nongnu.org?subject=subscribe>","Errors-To":"qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org","Sender":"qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org"},"content":"What?\n\nThis patch series adds the live migration cpr-exec mode, which allows\nthe user to update QEMU with minimal guest pause time, by preserving\nguest RAM in place, albeit with new virtual addresses in new QEMU, and\nby preserving device file descriptors.\n\nThe new user-visible interfaces are:\n  * cpr-exec (MigMode migration parameter)\n  * cpr-exec-command (migration parameter)\n  * anon-alloc (command-line option for -machine)\n\nThe user sets the mode parameter before invoking the migrate command.\nIn this mode, the user issues the migrate command to old QEMU, which\nstops the VM and saves state to the migration channels.  Old QEMU then\nexec's new QEMU, replacing the original process while retaining its PID.\nThe user specifies the command to exec new QEMU in the migration parameter\ncpr-exec-command.  The command must pass all old QEMU arguments to new\nQEMU, plus the -incoming option.  Execution resumes in new QEMU.\n\nMemory-backend objects must have the share=on attribute, but\nmemory-backend-epc is not supported.  The VM must be started\nwith the '-machine anon-alloc=memfd' option, which allows anonymous\nmemory to be transferred in place to the new process.\n\nWhy?\n\nThis mode has less impact on the guest than any other method of updating\nin place.  The pause time is much lower, because devices need not be torn\ndown and recreated, DMA does not need to be drained and quiesced, and minimal\nstate is copied to new QEMU.  Further, there are no constraints on the guest.\nBy contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,\nand suspending plus resuming vfio devices adds multiple seconds to the\nguest pause time.  Lastly, there is no loss of connectivity to the guest,\nbecause chardev descriptors remain open and connected.\n\nThese benefits all derive from the core design principle of this mode,\nwhich is preserving open descriptors.  This approach is very general and\ncan be used to support a wide variety of devices that do not have hardware\nsupport for live migration, including but not limited to: vfio, chardev,\nvhost, vdpa, and iommufd.  Some devices need new kernel software interfaces\nto allow a descriptor to be used in a process that did not originally open it.\n\nIn a containerized QEMU environment, cpr-exec reuses an existing QEMU\ncontainer and its assigned resources.  By contrast, consider a design in\nwhich a new container is created on the same host as the target of the\nCPR operation.  Resources must be reserved for the new container, while\nthe old container still reserves resources until the operation completes.\nAvoiding over commitment requires extra work in the management layer.\nThis is one reason why a cloud provider may prefer cpr-exec.  A second reason\nis that the container may include agents with their own connections to the\noutside world, and such connections remain intact if the container is reused.\n\nHow?\n\nAll memory that is mapped by the guest is preserved in place.  Indeed,\nit must be, because it may be the target of DMA requests, which are not\nquiesced during cpr-exec.  All such memory must be mmap'able in new QEMU.\nThis is easy for named memory-backend objects, as long as they are mapped\nshared, because they are visible in the file system in both old and new QEMU.\nAnonymous memory must be allocated using memfd_create rather than MAP_ANON,\nso the memfd's can be sent to new QEMU.  Pages that were locked in memory\nfor DMA in old QEMU remain locked in new QEMU, because the descriptor of\nthe device that locked them remains open.\n\ncpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,\nand by sending the unique name and value of each descriptor to new QEMU\nvia CPR state.\n\nFor device descriptors, new QEMU reuses the descriptor when creating the\ndevice, rather than opening it again.  The same holds for chardevs.  For\nmemfd descriptors, new QEMU mmap's the preserved memfd when a ramblock\nis created.\n\nCPR state cannot be sent over the normal migration channel, because devices\nand backends are created prior to reading the channel, so this mode sends\nCPR state over a second migration channel that is not visible to the user.\nNew QEMU reads the second channel prior to creating devices or backends.\n\nThe exec itself is trivial.  After writing to the migration channels, the\nmigration code calls a new main-loop hook to perform the exec.\n\nExample:\n\nIn this example, we simply restart the same version of QEMU, but in\na real scenario one would use a new QEMU binary path in cpr-exec-command.\n\n  # qemu-kvm -monitor stdio -object\n  memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on\n  -m 4G -machine anon-alloc=memfd ...\n\n  QEMU 9.1.50 monitor - type 'help' for more information\n  (qemu) info status\n  VM status: running\n  (qemu) migrate_set_parameter mode cpr-exec\n  (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state\n  (qemu) migrate -d file:vm.state\n  (qemu) QEMU 9.1.50 monitor - type 'help' for more information\n  (qemu) info status\n  VM status: running\n\nThis patch series implements a minimal version of cpr-exec.  Additional\nseries are ready to be posted to deliver the complete vision described\nabove, including\n  * vfio\n  * chardev\n  * vhost\n  * blockers\n  * hostmem-memfd\n  * migration-test cases\n\nWorks in progress include:\n  * vdpa\n  * iommufd\n  * cpr-transfer mode\n\nChanges since V1:\n  * Dropped precreate and factory patches.  Added CPR state instead.\n  * Dropped patches that refactor ramblock allocation\n  * Dropped vmstate_info_void patch (peter)\n  * Dropped patch \"seccomp: cpr-exec blocker\" (Daniel)\n  * Redefined memfd-alloc option as anon-alloc\n  * No longer preserve ramblock fields, except for fd (peter)\n  * Added fd preservation functions in CPR state\n  * Hoisted cpr code out of migrate_fd_cleanup (fabiano)\n  * Revised migration.json docs (markus)\n  * Fixed qtest failures (fabiano)\n  * Renamed SAVEVM_FOREACH macros (fabiano)\n  * Renamed cpr-exec-args as cpr-exec-command (markus)\n\nThe first 6 patches below are foundational and are needed for both cpr-exec\nmode and cpr-transfer mode.  The last 5 patches are specific to cpr-exec\nand implement the mechanisms for sharing state across exec.\n\nSteve Sistare (11):\n  machine: alloc-anon option\n  migration: cpr-state\n  migration: save cpr mode\n  migration: stop vm earlier for cpr\n  physmem: preserve ram blocks for cpr\n  migration: fix mismatched GPAs during cpr\n  oslib: qemu_clear_cloexec\n  vl: helper to request exec\n  migration: cpr-exec-command parameter\n  migration: cpr-exec save and load\n  migration: cpr-exec mode\n\n hmp-commands.hx                |   2 +-\n hw/core/machine.c              |  24 +++++\n include/exec/memory.h          |  12 +++\n include/hw/boards.h            |   1 +\n include/migration/cpr.h        |  35 ++++++\n include/qemu/osdep.h           |   9 ++\n include/sysemu/runstate.h      |   3 +\n migration/cpr-exec.c           | 180 +++++++++++++++++++++++++++++++\n migration/cpr.c                | 238 +++++++++++++++++++++++++++++++++++++++++\n migration/meson.build          |   2 +\n migration/migration-hmp-cmds.c |  25 +++++\n migration/migration.c          |  43 ++++++--\n migration/options.c            |  23 +++-\n migration/ram.c                |  17 +--\n migration/trace-events         |   5 +\n qapi/machine.json              |  14 +++\n qapi/migration.json            |  45 +++++++-\n qemu-options.hx                |  13 +++\n system/memory.c                |  22 +++-\n system/physmem.c               |  61 ++++++++++-\n system/runstate.c              |  29 +++++\n system/trace-events            |   3 +\n system/vl.c                    |   3 +\n util/oslib-posix.c             |   9 ++\n util/oslib-win32.c             |   4 +\n 25 files changed, 792 insertions(+), 30 deletions(-)\n create mode 100644 include/migration/cpr.h\n create mode 100644 migration/cpr-exec.c\n create mode 100644 migration/cpr.c"}