[v17,00/35] Drivers for Gunyah hypervisor

Message ID	20240222-gunyah-v17-0-1e9da6763d38@quicinc.com
Headers	show Return-Path: <devicetree+bounces-45026-incoming-dt=patchwork.ozlabs.org@vger.kernel.org> From: Elliot Berman <quic_eberman@quicinc.com> Subject: [PATCH v17 00/35] Drivers for Gunyah hypervisor Date: Thu, 22 Feb 2024 15:16:23 -0800 Message-ID: <20240222-gunyah-v17-0-1e9da6763d38@quicinc.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Alex Elder <elder@linaro.org>, Srinivas Kandagatla <srinivas.kandagatla@linaro.org>, Murali Nalajal <quic_mnalajal@quicinc.com>, Trilok Soni <quic_tsoni@quicinc.com>, Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>, Carl van Schaik <quic_cvanscha@quicinc.com>, Philip Derrin <quic_pderrin@quicinc.com>, Prakruthi Deepak Heragu <quic_pheragu@quicinc.com>, Jonathan Corbet <corbet@lwn.net>, Rob Herring <robh+dt@kernel.org>, Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>, Conor Dooley <conor+dt@kernel.org>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Konrad Dybcio <konrad.dybcio@linaro.org>, Bjorn Andersson <andersson@kernel.org>, Dmitry Baryshkov <dmitry.baryshkov@linaro.org>, "Fuad Tabba" <tabba@google.com>, Sean Christopherson <seanjc@google.com>, "Andrew Morton" <akpm@linux-foundation.org> CC: <linux-arm-msm@vger.kernel.org>, <linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <devicetree@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org>, <linux-mm@kvack.org>, Elliot Berman <quic_eberman@quicinc.com>, Rob Herring <robh@kernel.org>
Series	Drivers for Gunyah hypervisor \| expand [v17,00/35] Drivers for Gunyah hypervisor [v17,02/35] dt-bindings: Add binding for gunyah hypervisor

Elliot Berman Feb. 22, 2024, 11:16 p.m. UTC

Gunyah is a Type-1 hypervisor independent of any high-level OS kernel,
and runs in a higher CPU privilege level. It does not depend on any
lower-privileged OS kernel/code for its core functionality. This
increases its security and can support a much smaller trusted computing
base than a Type-2 hypervisor. Gunyah is designed for isolated virtual
machine use cases and to support launching trusted+isolated virtual
machines from a relatively less trusted host virtual machine.

Gunyah is an open source hypervisor. The source repo is available at
https://github.com/quic/gunyah-hypervisor.

The diagram below shows the architecture for AArch64.

::

         VM A                    VM B
     +-----+ +-----+  | +-----+ +-----+ +-----+
     |     | |     |  | |     | |     | |     |
 EL0 | APP | | APP |  | | APP | | APP | | APP |
     |     | |     |  | |     | |     | |     |
     +-----+ +-----+  | +-----+ +-----+ +-----+
 ---------------------|-------------------------
     +--------------+ | +----------------------+
     |              | | |                      |
 EL1 | Linux Kernel | | |Linux kernel/Other OS |   ...
     |              | | |                      |
     +--------------+ | +----------------------+
 --------hvc/smc------|------hvc/smc------------

     +----------------------------------------+
     |                                        |
 EL2 |            Gunyah Hypervisor           |
     |                                        |
     +----------------------------------------+

Gunyah provides these following features.

- Threads and Scheduling: The scheduler schedules virtual CPUs (VCPUs)
  on physical CPUs and enables time-sharing of the CPUs.
- Memory Management: Gunyah tracks memory ownership and use of all
  memory under its control. It provides low level dynamic memory
  management APIs on top of which higher level donation, lending and sharing
  is built. Gunyah provides strong VM memory isolation for trusted VMs.
- Interrupt Virtualization: Interrupts are managed by the hypervisor
  and are routed directly to the assigned VM.
- Inter-VM Communication: There are several different mechanisms
  provided for communicating between VMs.
- Device Virtualization: Para-virtualization of devices is supported
  using inter-VM communication and virtio primitives. Low level architecture
  features and devices such as cpu timers, interrupt controllers are supported
  with hardware virtualization and emulation where required.
- Resource Manager: Gunyah supports a "root" VM that initially owns all
  VM memory and IO resources. The Gunyah Resource Manager is the default
  bundled root VM and provides high-level services including dynamic VM
  management and secure memory donation, lending and sharing.

This series adds the basic framework for detecting that Linux is running
under Gunyah as a virtual machine, communication with the Gunyah
Resource Manager, and a sample virtual machine manager capable of
launching virtual machines.

Changes in v17:
 - Replace RM's irq_chip with irq_create_fwspec_mapping
 - Unmap lent memory from kernel logical map
 - Small optimization to unlock folio earlier to allow vCPUs racing for
   the same folio to run sooner
 - Add missed "safe to lend" checks for a folio when constructing mem
   parcel

Changes in v16:
https://lore.kernel.org/r/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com
 - Fleshed out memory reclaim while VM is running
 - Documentation and comments

Changes in v15:
https://lore.kernel.org/r/20231215-gunyah-v15-0-192a5d872a30@quicinc.com
 - First implementation of virtual machines backed by guestmemfd and
using demand paging to provide memory instead of all up front.
 - Use message queue hypercalls directly instead of traversing through
mailbox framework.

Changes in v14: https://lore.kernel.org/all/20230613172054.3959700-1-quic_eberman@quicinc.com/
 - Coding/cosmetic tweaks suggested by Alex
 - Mark IRQs as wake-up capable

Changes in v13:
https://lore.kernel.org/all/20230509204801.2824351-1-quic_eberman@quicinc.com/
 - Tweaks to message queue driver to address race condition between IRQ
and mailbox registration
 - Allow removal of VM functions by function-specific comparison --
specifically to allow
   removing irqfd by label only and not requiring original FD to be
provided.

Changes in v12:
https://lore.kernel.org/all/20230424231558.70911-1-quic_eberman@quicinc.com/
 - Stylistic/cosmetic tweaks suggested by Alex
 - Remove patch "virt: gunyah: Identify hypervisor version" and squash
the
   check that we're running under a reasonable Gunyah hypervisor into RM
driver
 - Refactor platform hooks into a separate module per suggestion from
Srini
 - GFP_KERNEL_ACCOUNT and account_locked_vm() for page pinning
 - enum-ify related constants

Changes in v11:
https://lore.kernel.org/all/20230304010632.2127470-1-quic_eberman@quicinc.com/
 - Rename struct gh_vm_dtb_config:gpa -> guest_phys_addr & overflow
checks for this
 - More docstrings throughout
 - Make resp_buf and resp_buf_size optional
 - Replace deprecated idr with xarray
 - Refconting on misc device instead of RM's platform device
 - Renaming variables, structs, etc. from gunyah_ -> gh_
 - Drop removal of user mem regions
 - Drop mem_lend functionality; to converge with restricted_memfd later

Changes in v10:
https://lore.kernel.org/all/20230214211229.3239350-1-quic_eberman@quicinc.com/
 - Fix bisectability (end result of series is same, --fixups applied to
wrong commits)
 - Convert GH_ERROR_* and GH_RM_ERROR_* to enums
 - Correct race condition between allocating/freeing user memory
 - Replace offsetof with struct_size
 - Series-wide renaming of functions to be more consistent
 - VM shutdown & restart support added in vCPU and VM Manager patches
 - Convert VM function name (string) to type (number)
 - Convert VM function argument to value (which could be a pointer) to
remove memory wastage for arguments
 - Remove defensive checks of hypervisor correctness
 - Clean ups to ioeventfd as suggested by Srivatsa

Changes in v9:
https://lore.kernel.org/all/20230120224627.4053418-1-quic_eberman@quicinc.com/
 - Refactor Gunyah API flags to be exposed as feature flags at kernel
level
 - Move mbox client cleanup into gunyah_msgq_remove()
 - Simplify gh_rm_call return value and response payload
 - Missing clean-up/error handling/little endian fixes as suggested by
Srivatsa and Alex in v8 series

Changes in v8:
https://lore.kernel.org/all/20221219225850.2397345-1-quic_eberman@quicinc.com/
 - Treat VM manager as a library of RM
 - Add patches 21-28 as RFC to support proxy-scheduled vCPUs and
necessary bits to support virtio
   from Gunyah userspace

Changes in v7:
https://lore.kernel.org/all/20221121140009.2353512-1-quic_eberman@quicinc.com/
 - Refactor to remove gunyah RM bus
 - Refactor allow multiple RM device instances
 - Bump UAPI to start at 0x0
 - Refactor QCOM SCM's platform hooks to allow
CONFIG_QCOM_SCM=Y/CONFIG_GUNYAH=M combinations

Changes in v6:
https://lore.kernel.org/all/20221026185846.3983888-1-quic_eberman@quicinc.com/
 - *Replace gunyah-console with gunyah VM Manager*
 - Move include/asm-generic/gunyah.h into include/linux/gunyah.h
 - s/gunyah_msgq/gh_msgq/
 - Minor tweaks and documentation tidying based on comments from Jiri,
Greg, Arnd, Dmitry, and Bagas.

Changes in v5
https://lore.kernel.org/all/20221011000840.289033-1-quic_eberman@quicinc.com/
 - Dropped sysfs nodes
 - Switch from aux bus to Gunyah RM bus for the subdevices
 - Cleaning up RM console

Changes in v4:
https://lore.kernel.org/all/20220928195633.2348848-1-quic_eberman@quicinc.com/
 - Tidied up documentation throughout based on questions/feedback received
 - Switched message queue implementation to use mailboxes
 - Renamed "gunyah_device" as "gunyah_resource"

Changes in v3:
https://lore.kernel.org/all/20220811214107.1074343-1-quic_eberman@quicinc.com/
 - /Maintained/Supported/ in MAINTAINERS
 - Tidied up documentation throughout based on questions/feedback received
 - Moved hypercalls into arch/arm64/gunyah/; following hyper-v's implementation
 - Drop opaque typedefs
 - Move sysfs nodes under /sys/hypervisor/gunyah/
 - Moved Gunyah console driver to drivers/tty/
 - Reworked gh_device design to drop the Gunyah bus.

Changes in v2: https://lore.kernel.org/all/20220801211240.597859-1-quic_eberman@quicinc.com/
 - DT bindings clean up
 - Switch hypercalls to follow SMCCC 

v1: https://lore.kernel.org/all/20220223233729.1571114-1-quic_eberman@quicinc.com/

Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
---
Elliot Berman (35):
      docs: gunyah: Introduce Gunyah Hypervisor
      dt-bindings: Add binding for gunyah hypervisor
      gunyah: Common types and error codes for Gunyah hypercalls
      virt: gunyah: Add hypercalls to identify Gunyah
      virt: gunyah: Add hypervisor driver
      virt: gunyah: msgq: Add hypercalls to send and receive messages
      gunyah: rsc_mgr: Add resource manager RPC core
      gunyah: vm_mgr: Introduce basic VM Manager
      gunyah: rsc_mgr: Add VM lifecycle RPC
      gunyah: vm_mgr: Add VM start/stop
      virt: gunyah: Translate gh_rm_hyp_resource into gunyah_resource
      virt: gunyah: Add resource tickets
      gunyah: vm_mgr: Add framework for VM Functions
      virt: gunyah: Add hypercalls for running a vCPU
      virt: gunyah: Add proxy-scheduled vCPUs
      gunyah: Add hypercalls for demand paging
      gunyah: rsc_mgr: Add memory parcel RPC
      mm/interval_tree: Export iter_first/iter_next
      arch/mm: Export direct {un,}map functions
      virt: gunyah: Add interfaces to map memory into guest address space
      gunyah: rsc_mgr: Add platform ops on mem_lend/mem_reclaim
      virt: gunyah: Add Qualcomm Gunyah platform ops
      virt: gunyah: Implement guestmemfd
      virt: gunyah: Add ioctl to bind guestmem to VMs
      virt: gunyah: guestmem: Initialize RM mem parcels from guestmem
      virt: gunyah: Share guest VM dtb configuration to Gunyah
      gunyah: rsc_mgr: Add RPC to enable demand paging
      virt: gunyah: Enable demand paging
      gunyah: rsc_mgr: Add RPC to set VM boot context
      virt: gunyah: Allow userspace to initialize context of primary vCPU
      virt: gunyah: Add hypercalls for sending doorbell
      virt: gunyah: Add irqfd interface
      virt: gunyah: Add IO handlers
      virt: gunyah: Add ioeventfd
      MAINTAINERS: Add Gunyah hypervisor drivers section

 .../bindings/firmware/gunyah-hypervisor.yaml       |  82 ++
 Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
 Documentation/virt/gunyah/index.rst                | 135 +++
 Documentation/virt/gunyah/message-queue.rst        |  68 ++
 Documentation/virt/index.rst                       |   1 +
 MAINTAINERS                                        |  12 +
 arch/arm64/Kbuild                                  |   1 +
 arch/arm64/gunyah/Makefile                         |   3 +
 arch/arm64/gunyah/gunyah_hypercall.c               | 279 ++++++
 arch/arm64/include/asm/gunyah.h                    |  57 ++
 arch/arm64/mm/pageattr.c                           |   3 +
 drivers/virt/Kconfig                               |   2 +
 drivers/virt/Makefile                              |   1 +
 drivers/virt/gunyah/Kconfig                        |  47 +
 drivers/virt/gunyah/Makefile                       |   9 +
 drivers/virt/gunyah/guest_memfd.c                  | 987 ++++++++++++++++++++
 drivers/virt/gunyah/gunyah.c                       |  52 ++
 drivers/virt/gunyah/gunyah_ioeventfd.c             | 139 +++
 drivers/virt/gunyah/gunyah_irqfd.c                 | 187 ++++
 drivers/virt/gunyah/gunyah_platform_hooks.c        | 117 +++
 drivers/virt/gunyah/gunyah_qcom.c                  | 220 +++++
 drivers/virt/gunyah/gunyah_vcpu.c                  | 590 ++++++++++++
 drivers/virt/gunyah/rsc_mgr.c                      | 836 +++++++++++++++++
 drivers/virt/gunyah/rsc_mgr.h                      | 144 +++
 drivers/virt/gunyah/rsc_mgr_rpc.c                  | 602 +++++++++++++
 drivers/virt/gunyah/vm_mgr.c                       | 993 +++++++++++++++++++++
 drivers/virt/gunyah/vm_mgr.h                       | 215 +++++
 drivers/virt/gunyah/vm_mgr_mem.c                   | 356 ++++++++
 include/linux/gunyah.h                             | 483 ++++++++++
 include/uapi/linux/gunyah.h                        | 378 ++++++++
 mm/interval_tree.c                                 |   3 +
 31 files changed, 7003 insertions(+)
---
base-commit: ffd2cb6b718e189e7e2d5d0c19c25611f92e061a
change-id: 20231208-gunyah-952aca7668e0

Best regards,

Christoph Hellwig Feb. 23, 2024, 7:09 a.m. UTC | #1

On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> Firmware and hypervisor drivers can donate system heap memory to their
> respective firmware/hypervisor entities. Those drivers should unmap the
> pages from the kernel's logical map before doing so.
> 
> Export can_set_direct_map, set_direct_map_invalid_noflush, and
> set_direct_map_default_noflush.

Err, not they should not.  And not using such super low-level interfaces
from modular code.

Konrad Dybcio Feb. 23, 2024, 9:10 p.m. UTC | #2

On 23.02.2024 00:16, Elliot Berman wrote:
> Add driver to detect when running under Gunyah. It performs basic
> identification hypercall and populates the platform bus for resource
> manager to probe.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> ---

[...]

> +
> +	/* Might move this out to individual drivers if there's ever an API version bump */
> +	if (gunyah_api_version(&gunyah_api) != GUNYAH_API_V1) {
> +		pr_info("Unsupported Gunyah version: %u\n",
> +			gunyah_api_version(&gunyah_api));

Weird for this not to be an error, but it's probably not worth resending
over if it's the only thing

Konrad

Elliot Berman Feb. 23, 2024, 10:58 p.m. UTC | #3

On Fri, Feb 23, 2024 at 10:10:47PM +0100, Konrad Dybcio wrote:
> On 23.02.2024 00:16, Elliot Berman wrote:
> > Add driver to detect when running under Gunyah. It performs basic
> > identification hypercall and populates the platform bus for resource
> > manager to probe.
> > 
> > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> > ---
> 
> [...]
> 
> > +
> > +	/* Might move this out to individual drivers if there's ever an API version bump */
> > +	if (gunyah_api_version(&gunyah_api) != GUNYAH_API_V1) {
> > +		pr_info("Unsupported Gunyah version: %u\n",
> > +			gunyah_api_version(&gunyah_api));
> 
> Weird for this not to be an error, but it's probably not worth resending
> over if it's the only thing

It is an error, but maybe I misunderstood:

> > +	/* Might move this out to individual drivers if there's ever an API version bump */
> > +	if (gunyah_api_version(&gunyah_api) != GUNYAH_API_V1) {
> > +		pr_info("Unsupported Gunyah version: %u\n",
> > +			gunyah_api_version(&gunyah_api));
> > +		return -ENODEV;
> > +	}

Konrad Dybcio Feb. 23, 2024, 11:46 p.m. UTC | #4

On 23.02.2024 23:58, Elliot Berman wrote:
> On Fri, Feb 23, 2024 at 10:10:47PM +0100, Konrad Dybcio wrote:
>> On 23.02.2024 00:16, Elliot Berman wrote:
>>> Add driver to detect when running under Gunyah. It performs basic
>>> identification hypercall and populates the platform bus for resource
>>> manager to probe.
>>>
>>> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
>>> ---
>>
>> [...]
>>
>>> +
>>> +	/* Might move this out to individual drivers if there's ever an API version bump */
>>> +	if (gunyah_api_version(&gunyah_api) != GUNYAH_API_V1) {
>>> +		pr_info("Unsupported Gunyah version: %u\n",
>>> +			gunyah_api_version(&gunyah_api));
>>
>> Weird for this not to be an error, but it's probably not worth resending
>> over if it's the only thing
> 
> It is an error, but maybe I misunderstood:

Sorry, I meant "pr_info might have been pr_err"

Konrad

> 
>>> +	/* Might move this out to individual drivers if there's ever an API version bump */
>>> +	if (gunyah_api_version(&gunyah_api) != GUNYAH_API_V1) {
>>> +		pr_info("Unsupported Gunyah version: %u\n",
>>> +			gunyah_api_version(&gunyah_api));
>>> +		return -ENODEV;
>>> +	}
>

Elliot Berman Feb. 24, 2024, 12:37 a.m. UTC | #5

On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > Firmware and hypervisor drivers can donate system heap memory to their
> > respective firmware/hypervisor entities. Those drivers should unmap the
> > pages from the kernel's logical map before doing so.
> > 
> > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > set_direct_map_default_noflush.
> 
> Err, not they should not.  And not using such super low-level interfaces
> from modular code.

Hi Cristoph,

We've observed a few times that Linux can unintentionally access a page
we've unmapped from host's stage 2 page table via an unaligned load from
an adjacent page. The stage 2 is managed by Gunyah. There are few
scenarios where even though we allocate and own a page from buddy,
someone else could try to access the page without going through the
hypervisor driver. One such instance we know about is
load_unaligned_zeropad() via pathlookup_at() [1].

load_unaligned_zeropad() could be called near the end of a page. If the
next page isn't mapped by the kernel in the stage one page tables, then
the access from to the unmapped page from load_unaligned_zeropad() will
land in __do_kernel_fault(), call fixup_exception(), and fill the
remainder of the load with zeroes. If the page in question is mapped in
stage 1 but was unmapped from stage 2, then the access lands back in
Linux in do_sea(), leading to a panic().

Our preference would be to add fixup_exception() to S2 PTW errors for
two reasons:
1. It's cheaper to do performance wise: we've already manipulated S2
   page table and prevent intentional access to the page because
   pKVM/Gunyah drivers know that access to the page has been lost.
2. Page-granular S1 mappings only happen on arm64 with rodata=full.

In an off-list discussion with the Android pkvm folks, their preference
was to have the pages unmapped from stage 1. I've gone with that
approach to get started but welcome discussion on the best approach.

The Android (downstream) implementation of arm64 pkvm is currently
implementing a hack where s2 ptw faults are given back to the host as s1
ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
allowing the kernel to fixup the exception.

arm64 pKVM will also face this issue when implementing guest_memfd or
when donating more memory to the hyp for s2 page tables, etc. As far as
I can tell, this isn't an issue for arm64 pKVM today because memory
isn't being dynamically donated to the hypervisor.

Thanks,
Elliot

[1]:
path_lookupat+0x340/0x3228
filename_lookup+0xbc/0x1c0
__arm64_sys_newfstatat+0xb0/0x4a0
invoke_syscall+0x58/0x118

Christoph Hellwig Feb. 26, 2024, 11:06 a.m. UTC | #6

The point is that we can't we just allow modules to unmap data from
the kernel mapping, no matter how noble your intentions are.

David Hildenbrand Feb. 26, 2024, 11:53 a.m. UTC | #7

On 26.02.24 12:06, Christoph Hellwig wrote:
> The point is that we can't we just allow modules to unmap data from
> the kernel mapping, no matter how noble your intentions are.

I absolutely agree.

Elliot Berman Feb. 26, 2024, 5:27 p.m. UTC | #8

On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
> On 26.02.24 12:06, Christoph Hellwig wrote:
> > The point is that we can't we just allow modules to unmap data from
> > the kernel mapping, no matter how noble your intentions are.
> 
> I absolutely agree.
> 

Hi David and Chirstoph,

Are your preferences that we should make Gunyah builtin only or should add
fixing up S2 PTW errors (or something else)?

Also, do you extend that preference to modifying S2 mappings? This would
require any hypervisor driver that supports confidential compute
usecases to only ever be builtin.

Is your concern about unmapping data from kernel mapping, then module
being unloaded, and then having no way to recover the mapping? Would a
permanent module be better? The primary reason we were wanting to have
it as module was to avoid having driver in memory if you're not a Gunyah
guest.

Thanks,
Elliot

David Hildenbrand Feb. 27, 2024, 9:49 a.m. UTC | #9

On 26.02.24 18:27, Elliot Berman wrote:
> On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
>> On 26.02.24 12:06, Christoph Hellwig wrote:
>>> The point is that we can't we just allow modules to unmap data from
>>> the kernel mapping, no matter how noble your intentions are.
>>
>> I absolutely agree.
>>
> 
> Hi David and Chirstoph,
> 
> Are your preferences that we should make Gunyah builtin only or should add
> fixing up S2 PTW errors (or something else)?

Having that built into the kernel certainly does sound better than 
exposing that functionality to arbitrary OOT modules. But still, this 
feels like it is using a "too-low-level" interface.

> 
> Also, do you extend that preference to modifying S2 mappings? This would
> require any hypervisor driver that supports confidential compute
> usecases to only ever be builtin.
> 
> Is your concern about unmapping data from kernel mapping, then module
> being unloaded, and then having no way to recover the mapping? Would a
> permanent module be better? The primary reason we were wanting to have
> it as module was to avoid having driver in memory if you're not a Gunyah
> guest.

What I didn't grasp from this patch description: is the area where a 
driver would unmap/remap that memory somehow known ahead of time and 
limited?

How would the driver obtain that memory it would try to unmap/remap the 
direct map of? Simply allocate some pages and then unmap the direct map?

For example, we do have mm/secretmem.c, where we unmap the directmap on 
allocation and remap when freeing a page. A nice abstraction on 
alloc/free, so one cannot really do a lot of harm.

Further, we enlightened the remainder of the system about secretmem, 
such that we can detect that the directmap is no longer there. As one 
example, see the secretmem_active() check in kernel/power/hibernate.c.

A similar abstraction would make sense (I remember a discussion about 
having secretmem functionality in guest_memfd, would that help?), but 
the question is "which" memory you want to unmap the direct map of, and 
how the driver became "owner" of that memory such that it would really 
be allowed to mess with the directmap.

Elliot Berman March 1, 2024, 1:35 a.m. UTC | #10

On Tue, Feb 27, 2024 at 10:49:32AM +0100, David Hildenbrand wrote:
> On 26.02.24 18:27, Elliot Berman wrote:
> > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
> > > On 26.02.24 12:06, Christoph Hellwig wrote:
> > > > The point is that we can't we just allow modules to unmap data from
> > > > the kernel mapping, no matter how noble your intentions are.
> > > 
> > > I absolutely agree.
> > > 
> > 
> > Hi David and Chirstoph,
> > 
> > Are your preferences that we should make Gunyah builtin only or should add
> > fixing up S2 PTW errors (or something else)?
> 
> Having that built into the kernel certainly does sound better than exposing
> that functionality to arbitrary OOT modules. But still, this feels like it
> is using a "too-low-level" interface.
> 

What are your thoughts about fixing up the stage-2 fault instead? I
think this gives mmu-based isolation a slight speed boost because we
avoid modifying kernel mapping. The hypervisor driver (KVM or Gunyah)
knows that the page isn't mapped. Whether we get S2 or S1 fault, the
kernel is likely going to crash, except in the rare case where we want
to fix the exception. In that case, we can modify the S2 fault handler
to call fixup_exception() when appropriate.

> > 
> > Also, do you extend that preference to modifying S2 mappings? This would
> > require any hypervisor driver that supports confidential compute
> > usecases to only ever be builtin.
> > 
> > Is your concern about unmapping data from kernel mapping, then module
> > being unloaded, and then having no way to recover the mapping? Would a
> > permanent module be better? The primary reason we were wanting to have
> > it as module was to avoid having driver in memory if you're not a Gunyah
> > guest.
> 
> What I didn't grasp from this patch description: is the area where a driver
> would unmap/remap that memory somehow known ahead of time and limited?
> 
> How would the driver obtain that memory it would try to unmap/remap the
> direct map of? Simply allocate some pages and then unmap the direct map?

That's correct.

> 
> For example, we do have mm/secretmem.c, where we unmap the directmap on
> allocation and remap when freeing a page. A nice abstraction on alloc/free,
> so one cannot really do a lot of harm.
> 
> Further, we enlightened the remainder of the system about secretmem, such
> that we can detect that the directmap is no longer there. As one example,
> see the secretmem_active() check in kernel/power/hibernate.c.
> 

I'll take a look at this. guest_memfd might be able to use PM notifiers here
instead, but I'll dig in the archives to see why secretmem isn't using that.

> A similar abstraction would make sense (I remember a discussion about having
> secretmem functionality in guest_memfd, would that help?), but the question
> is "which" memory you want to unmap the direct map of, and how the driver
> became "owner" of that memory such that it would really be allowed to mess
> with the directmap.

Quentin Perret March 4, 2024, 1:10 p.m. UTC | #11

On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote:
> On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > > Firmware and hypervisor drivers can donate system heap memory to their
> > > respective firmware/hypervisor entities. Those drivers should unmap the
> > > pages from the kernel's logical map before doing so.
> > > 
> > > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > > set_direct_map_default_noflush.
> > 
> > Err, not they should not.  And not using such super low-level interfaces
> > from modular code.
> 
> Hi Cristoph,
>  
> We've observed a few times that Linux can unintentionally access a page
> we've unmapped from host's stage 2 page table via an unaligned load from
> an adjacent page. The stage 2 is managed by Gunyah. There are few
> scenarios where even though we allocate and own a page from buddy,
> someone else could try to access the page without going through the
> hypervisor driver. One such instance we know about is
> load_unaligned_zeropad() via pathlookup_at() [1].
>  
> load_unaligned_zeropad() could be called near the end of a page. If the
> next page isn't mapped by the kernel in the stage one page tables, then
> the access from to the unmapped page from load_unaligned_zeropad() will
> land in __do_kernel_fault(), call fixup_exception(), and fill the
> remainder of the load with zeroes. If the page in question is mapped in
> stage 1 but was unmapped from stage 2, then the access lands back in
> Linux in do_sea(), leading to a panic().
>  
> Our preference would be to add fixup_exception() to S2 PTW errors for
> two reasons:
> 1. It's cheaper to do performance wise: we've already manipulated S2
>    page table and prevent intentional access to the page because
>    pKVM/Gunyah drivers know that access to the page has been lost.
> 2. Page-granular S1 mappings only happen on arm64 with rodata=full.
>  
> In an off-list discussion with the Android pkvm folks, their preference
> was to have the pages unmapped from stage 1. I've gone with that
> approach to get started but welcome discussion on the best approach.
>  
> The Android (downstream) implementation of arm64 pkvm is currently
> implementing a hack where s2 ptw faults are given back to the host as s1
> ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
> allowing the kernel to fixup the exception.
>  
> arm64 pKVM will also face this issue when implementing guest_memfd or
> when donating more memory to the hyp for s2 page tables, etc. As far as
> I can tell, this isn't an issue for arm64 pKVM today because memory
> isn't being dynamically donated to the hypervisor.

FWIW pKVM already donates memory dynamically to the hypervisor, to store
e.g. guest VM metadata and page-tables, and we've never seen that
problem as far as I can recall.

A key difference is that pKVM injects a data abort back into the kernel
in case of a stage-2 fault, so the whole EXTABLE trick/hack in
load_unaligned_zeropad() should work fine out of the box.

As discussed offline, Gunyah injecting an SEA into the kernel is
questionable, but I understand that the architecture is a bit lacking in
this department, and that's probably the next best thing.

Could the Gunyah driver allocate from a CMA region instead? That would
surely simplify unmapping from EL1 stage-1 (similar to how drivers
usually donate memory to TZ).

Thanks,
Quentin

Elliot Berman March 4, 2024, 11:37 p.m. UTC | #12

On Mon, Mar 04, 2024 at 01:10:48PM +0000, Quentin Perret wrote:
> On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote:
> > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > > > Firmware and hypervisor drivers can donate system heap memory to their
> > > > respective firmware/hypervisor entities. Those drivers should unmap the
> > > > pages from the kernel's logical map before doing so.
> > > > 
> > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > > > set_direct_map_default_noflush.
> > > 
> > > Err, not they should not.  And not using such super low-level interfaces
> > > from modular code.
> > 
> > Hi Cristoph,
> >  
> > We've observed a few times that Linux can unintentionally access a page
> > we've unmapped from host's stage 2 page table via an unaligned load from
> > an adjacent page. The stage 2 is managed by Gunyah. There are few
> > scenarios where even though we allocate and own a page from buddy,
> > someone else could try to access the page without going through the
> > hypervisor driver. One such instance we know about is
> > load_unaligned_zeropad() via pathlookup_at() [1].
> >  
> > load_unaligned_zeropad() could be called near the end of a page. If the
> > next page isn't mapped by the kernel in the stage one page tables, then
> > the access from to the unmapped page from load_unaligned_zeropad() will
> > land in __do_kernel_fault(), call fixup_exception(), and fill the
> > remainder of the load with zeroes. If the page in question is mapped in
> > stage 1 but was unmapped from stage 2, then the access lands back in
> > Linux in do_sea(), leading to a panic().
> >  
> > Our preference would be to add fixup_exception() to S2 PTW errors for
> > two reasons:
> > 1. It's cheaper to do performance wise: we've already manipulated S2
> >    page table and prevent intentional access to the page because
> >    pKVM/Gunyah drivers know that access to the page has been lost.
> > 2. Page-granular S1 mappings only happen on arm64 with rodata=full.
> >  
> > In an off-list discussion with the Android pkvm folks, their preference
> > was to have the pages unmapped from stage 1. I've gone with that
> > approach to get started but welcome discussion on the best approach.
> >  
> > The Android (downstream) implementation of arm64 pkvm is currently
> > implementing a hack where s2 ptw faults are given back to the host as s1
> > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
> > allowing the kernel to fixup the exception.
> >  
> > arm64 pKVM will also face this issue when implementing guest_memfd or
> > when donating more memory to the hyp for s2 page tables, etc. As far as
> > I can tell, this isn't an issue for arm64 pKVM today because memory
> > isn't being dynamically donated to the hypervisor.
> 
> FWIW pKVM already donates memory dynamically to the hypervisor, to store
> e.g. guest VM metadata and page-tables, and we've never seen that
> problem as far as I can recall.
> 
> A key difference is that pKVM injects a data abort back into the kernel
> in case of a stage-2 fault, so the whole EXTABLE trick/hack in
> load_unaligned_zeropad() should work fine out of the box.
> 
> As discussed offline, Gunyah injecting an SEA into the kernel is
> questionable, but I understand that the architecture is a bit lacking in
> this department, and that's probably the next best thing.
>
> Could the Gunyah driver allocate from a CMA region instead? That would
> surely simplify unmapping from EL1 stage-1 (similar to how drivers
> usually donate memory to TZ).

In my opinion, CMA is overly restrictive because we'd have to define the
region up front and we don't know how much memory the virtual machines
the user will want to launch.

Thanks,
Elliot

Pavan Kondeti March 5, 2024, 10:53 a.m. UTC | #13

On Thu, Feb 22, 2024 at 03:16:24PM -0800, Elliot Berman wrote:
> Gunyah is an open-source Type-1 hypervisor developed by Qualcomm. It
> does not depend on any lower-privileged OS/kernel code for its core
> functionality. This increases its security and can support a smaller
> trusted computing based when compared to Type-2 hypervisors.

%s/based/base

> 
> Add documentation describing the Gunyah hypervisor and the main
> components of the Gunyah hypervisor which are of interest to Linux
> virtualization development.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> +- Virtual platform:
> +
> +  Architectural devices such as interrupt controllers and CPU timers are
> +  directly provided by the hypervisor as well as core virtual platform devices
> +  and system APIs such as ARM PSCI.
> +
> +- Device Virtualization:
> +
> +  Para-virtualization of devices is supported using inter-VM communication and
> +  virtio transport support. Select stage 2 faults by virtual machines that use

%s/Select/Selected

> +  proxy-scheduled vCPUs can be handled directly by Linux to provide Type-2
> +  hypervisor style on-demand paging and/or device emulation.
> +


The doc patch looks good to me.

Thanks,
Pavan

Quentin Perret March 5, 2024, 3:30 p.m. UTC | #14

On Monday 04 Mar 2024 at 15:37:41 (-0800), Elliot Berman wrote:
> On Mon, Mar 04, 2024 at 01:10:48PM +0000, Quentin Perret wrote:
> > On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote:
> > > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> > > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > > > > Firmware and hypervisor drivers can donate system heap memory to their
> > > > > respective firmware/hypervisor entities. Those drivers should unmap the
> > > > > pages from the kernel's logical map before doing so.
> > > > > 
> > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > > > > set_direct_map_default_noflush.
> > > > 
> > > > Err, not they should not.  And not using such super low-level interfaces
> > > > from modular code.
> > > 
> > > Hi Cristoph,
> > >  
> > > We've observed a few times that Linux can unintentionally access a page
> > > we've unmapped from host's stage 2 page table via an unaligned load from
> > > an adjacent page. The stage 2 is managed by Gunyah. There are few
> > > scenarios where even though we allocate and own a page from buddy,
> > > someone else could try to access the page without going through the
> > > hypervisor driver. One such instance we know about is
> > > load_unaligned_zeropad() via pathlookup_at() [1].
> > >  
> > > load_unaligned_zeropad() could be called near the end of a page. If the
> > > next page isn't mapped by the kernel in the stage one page tables, then
> > > the access from to the unmapped page from load_unaligned_zeropad() will
> > > land in __do_kernel_fault(), call fixup_exception(), and fill the
> > > remainder of the load with zeroes. If the page in question is mapped in
> > > stage 1 but was unmapped from stage 2, then the access lands back in
> > > Linux in do_sea(), leading to a panic().
> > >  
> > > Our preference would be to add fixup_exception() to S2 PTW errors for
> > > two reasons:
> > > 1. It's cheaper to do performance wise: we've already manipulated S2
> > >    page table and prevent intentional access to the page because
> > >    pKVM/Gunyah drivers know that access to the page has been lost.
> > > 2. Page-granular S1 mappings only happen on arm64 with rodata=full.
> > >  
> > > In an off-list discussion with the Android pkvm folks, their preference
> > > was to have the pages unmapped from stage 1. I've gone with that
> > > approach to get started but welcome discussion on the best approach.
> > >  
> > > The Android (downstream) implementation of arm64 pkvm is currently
> > > implementing a hack where s2 ptw faults are given back to the host as s1
> > > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
> > > allowing the kernel to fixup the exception.
> > >  
> > > arm64 pKVM will also face this issue when implementing guest_memfd or
> > > when donating more memory to the hyp for s2 page tables, etc. As far as
> > > I can tell, this isn't an issue for arm64 pKVM today because memory
> > > isn't being dynamically donated to the hypervisor.
> > 
> > FWIW pKVM already donates memory dynamically to the hypervisor, to store
> > e.g. guest VM metadata and page-tables, and we've never seen that
> > problem as far as I can recall.
> > 
> > A key difference is that pKVM injects a data abort back into the kernel
> > in case of a stage-2 fault, so the whole EXTABLE trick/hack in
> > load_unaligned_zeropad() should work fine out of the box.
> > 
> > As discussed offline, Gunyah injecting an SEA into the kernel is
> > questionable, but I understand that the architecture is a bit lacking in
> > this department, and that's probably the next best thing.
> >
> > Could the Gunyah driver allocate from a CMA region instead? That would
> > surely simplify unmapping from EL1 stage-1 (similar to how drivers
> > usually donate memory to TZ).
> 
> In my opinion, CMA is overly restrictive because we'd have to define the
> region up front and we don't know how much memory the virtual machines
> the user will want to launch.

I was thinking of using CMA to allocate pages needed to store guest
metadata and such at EL2, but not to back the actual guest pages
themselves. That still means overallocating somehow, but that should
hopefully be much smaller and be less of a problem?

For the actual guest pages, the gunyah variant of guestmem will have to
unmap the pages from the direct map itself, but I'd be personally happy
with making that part non-modular to avoid the issue Christoph and
others have raised.

Thanks,
Quentin

Elliot Berman March 5, 2024, 8:26 p.m. UTC | #15

On Tue, Mar 05, 2024 at 03:30:58PM +0000, Quentin Perret wrote:
> On Monday 04 Mar 2024 at 15:37:41 (-0800), Elliot Berman wrote:
> > On Mon, Mar 04, 2024 at 01:10:48PM +0000, Quentin Perret wrote:
> > > On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote:
> > > > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> > > > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > > > > > Firmware and hypervisor drivers can donate system heap memory to their
> > > > > > respective firmware/hypervisor entities. Those drivers should unmap the
> > > > > > pages from the kernel's logical map before doing so.
> > > > > > 
> > > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > > > > > set_direct_map_default_noflush.
> > > > > 
> > > > > Err, not they should not.  And not using such super low-level interfaces
> > > > > from modular code.
> > > > 
> > > > Hi Cristoph,
> > > >  
> > > > We've observed a few times that Linux can unintentionally access a page
> > > > we've unmapped from host's stage 2 page table via an unaligned load from
> > > > an adjacent page. The stage 2 is managed by Gunyah. There are few
> > > > scenarios where even though we allocate and own a page from buddy,
> > > > someone else could try to access the page without going through the
> > > > hypervisor driver. One such instance we know about is
> > > > load_unaligned_zeropad() via pathlookup_at() [1].
> > > >  
> > > > load_unaligned_zeropad() could be called near the end of a page. If the
> > > > next page isn't mapped by the kernel in the stage one page tables, then
> > > > the access from to the unmapped page from load_unaligned_zeropad() will
> > > > land in __do_kernel_fault(), call fixup_exception(), and fill the
> > > > remainder of the load with zeroes. If the page in question is mapped in
> > > > stage 1 but was unmapped from stage 2, then the access lands back in
> > > > Linux in do_sea(), leading to a panic().
> > > >  
> > > > Our preference would be to add fixup_exception() to S2 PTW errors for
> > > > two reasons:
> > > > 1. It's cheaper to do performance wise: we've already manipulated S2
> > > >    page table and prevent intentional access to the page because
> > > >    pKVM/Gunyah drivers know that access to the page has been lost.
> > > > 2. Page-granular S1 mappings only happen on arm64 with rodata=full.
> > > >  
> > > > In an off-list discussion with the Android pkvm folks, their preference
> > > > was to have the pages unmapped from stage 1. I've gone with that
> > > > approach to get started but welcome discussion on the best approach.
> > > >  
> > > > The Android (downstream) implementation of arm64 pkvm is currently
> > > > implementing a hack where s2 ptw faults are given back to the host as s1
> > > > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
> > > > allowing the kernel to fixup the exception.
> > > >  
> > > > arm64 pKVM will also face this issue when implementing guest_memfd or
> > > > when donating more memory to the hyp for s2 page tables, etc. As far as
> > > > I can tell, this isn't an issue for arm64 pKVM today because memory
> > > > isn't being dynamically donated to the hypervisor.
> > > 
> > > FWIW pKVM already donates memory dynamically to the hypervisor, to store
> > > e.g. guest VM metadata and page-tables, and we've never seen that
> > > problem as far as I can recall.
> > > 
> > > A key difference is that pKVM injects a data abort back into the kernel
> > > in case of a stage-2 fault, so the whole EXTABLE trick/hack in
> > > load_unaligned_zeropad() should work fine out of the box.
> > > 
> > > As discussed offline, Gunyah injecting an SEA into the kernel is
> > > questionable, but I understand that the architecture is a bit lacking in
> > > this department, and that's probably the next best thing.
> > >
> > > Could the Gunyah driver allocate from a CMA region instead? That would
> > > surely simplify unmapping from EL1 stage-1 (similar to how drivers
> > > usually donate memory to TZ).
> > 
> > In my opinion, CMA is overly restrictive because we'd have to define the
> > region up front and we don't know how much memory the virtual machines
> > the user will want to launch.
> 
> I was thinking of using CMA to allocate pages needed to store guest
> metadata and such at EL2, but not to back the actual guest pages
> themselves. That still means overallocating somehow, but that should
> hopefully be much smaller and be less of a problem?

Ah, I understood the context now. Yes, we might need to use CMA region
when donating memory to Gunyah if we have to ensure the memory is
unmapped from stage 1, since we wouldn't use guest_memfd for that.

> 
> For the actual guest pages, the gunyah variant of guestmem will have to
> unmap the pages from the direct map itself, but I'd be personally happy

I still disagree that this is a Gunyah-specific problem. As far as we
can tell, Arm doesn't specify how EL2 can tell EL1 its S2 page tables
couldn't give a validation translation of the IPA from stage 1. IMO,
downstream/Android pKVM is violating spec for ESR_EL1 by using the
S1PTW bit (which is res0 for everyone except EL2 [1]) and this means
that guests need to be pKVM-enlightened. If we are adding pKVM
enlightment in the exception handlers, can we add Gunyah enlightment to
handle the same?

Thanks,
Elliot 

[1]: https://developer.arm.com/documentation/ddi0601/2023-12/AArch64-Registers/ESR-EL1--Exception-Syndrome-Register--EL1-?lang=en#fieldset_0-24_0_16-7_7

Quentin Perret March 6, 2024, 12:05 p.m. UTC | #16

On Tuesday 05 Mar 2024 at 12:26:59 (-0800), Elliot Berman wrote:
> I still disagree that this is a Gunyah-specific problem. As far as we
> can tell, Arm doesn't specify how EL2 can tell EL1 its S2 page tables
> couldn't give a validation translation of the IPA from stage 1. IMO,
> downstream/Android pKVM is violating spec for ESR_EL1 by using the
> S1PTW bit (which is res0 for everyone except EL2 [1]) and this means
> that guests need to be pKVM-enlightened.

Not really, in pKVM we have a very clear distinction between host Linux
and guests, and only the host needs to be enlightened. But luckily,
since pKVM is part of Linux, this is pretty much an internal kernel
thing, so we're very flexible and if the S1PTW trick ever conflicts
with something else (e.g. NV) we can fairly easily switch to another
approach. We can tolerate non-architectural tricks like that between
pKVM and host Linux because that is not ABI, but we certainly can't do
that for guests.

> If we are adding pKVM
> enlightment in the exception handlers, can we add Gunyah enlightment to
> handle the same?

If you mean extending the Linux SEA handler so it does what Gunyah
wants, then I'm personally not supportive of that idea since the
'contract' between Linux and Gunyah _is_ the architecture.

The only ways I could see Gunyah delegate stage-2 fault handling to
Linux cleanly is:

 - either talk to Arm to introduce a new ESR specifically for this,
   which doesn't sound entirely crazy to me;

 - or have Gunyah and Linux negotiate in software the location of the
   handlers. That probably means SDEI or equivalent which is a can of
   worm in itself I presume, and I'm not sure how feasible it would be
   for this handler to live in the Gunyah driver (that too probably
   requires exporting kernel symbols we don't want to export).

Thanks,
Quentin

Srivatsa Vaddagiri March 7, 2024, 3:37 p.m. UTC | #17

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:24]:

> Gunyah is an open-source Type-1 hypervisor developed by Qualcomm. It
> does not depend on any lower-privileged OS/kernel code for its core
> functionality. This increases its security and can support a smaller
> trusted computing based when compared to Type-2 hypervisors.
> 
> Add documentation describing the Gunyah hypervisor and the main
> components of the Gunyah hypervisor which are of interest to Linux
> virtualization development.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

LGTM.

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

- vatsa

Srivatsa Vaddagiri March 7, 2024, 3:38 p.m. UTC | #18

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:28]:

> Add driver to detect when running under Gunyah. It performs basic
> identification hypercall and populates the platform bus for resource
> manager to probe.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

LGTM

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

- vatsa

Srivatsa Vaddagiri March 7, 2024, 3:38 p.m. UTC | #19

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:30]:

> The resource manager is a special virtual machine which is always
> running on a Gunyah system. It provides APIs for creating and destroying
> VMs, secure memory management, sharing/lending of memory between VMs,
> and setup of inter-VM communication. Calls to the resource manager are
> made via message queues.
> 
> This patch implements the basic probing and RPC mechanism to make those
> API calls. Request/response calls can be made with gh_rm_call.
> Drivers can also register to notifications pushed by RM via
> gh_rm_register_notifier
> 
> Specific API calls that resource manager supports will be implemented in
> subsequent patches.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

Left a minor comment below. LGTM otherwise.

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

> +static irqreturn_t gunyah_rm_rx(int irq, void *data)
> +{
> +	enum gunyah_error gunyah_error;
> +	struct gunyah_rm_rpc_hdr *hdr;
> +	struct gunyah_rm *rm = data;
> +	void *msg = &rm->recv_msg[0];
> +	size_t len;
> +	bool ready;
> +
> +	do {
> +		gunyah_error = gunyah_hypercall_msgq_recv(rm->rx_ghrsc.capid,
> +							  msg,
> +							  sizeof(rm->recv_msg),
> +							  &len, &ready);
> +		if (gunyah_error != GUNYAH_ERROR_OK) {
> +			if (gunyah_error != GUNYAH_ERROR_MSGQUEUE_EMPTY)
> +				dev_warn(rm->dev,
> +					 "Failed to receive data: %d\n",
> +					 gunyah_error);
> +			return IRQ_HANDLED;
> +		}
> +
> +		if (len < sizeof(*hdr)) {
> +			dev_err_ratelimited(
> +				rm->dev,
> +				"Too small message received. size=%ld\n", len);
> +			continue;

In practice we should never hit this condition, in case we do encounter, do you
see a reason why continue is preferred over simply breaking the loop?

> +		}
> +
> +		hdr = msg;
> +		if (hdr->api != RM_RPC_API) {
> +			dev_err(rm->dev, "Unknown RM RPC API version: %x\n",
> +				hdr->api);
> +			return IRQ_HANDLED;
> +		}
> +
> +		switch (FIELD_GET(RM_RPC_TYPE_MASK, hdr->type)) {
> +		case RM_RPC_TYPE_NOTIF:
> +			gunyah_rm_process_notif(rm, msg, len);
> +			break;
> +		case RM_RPC_TYPE_REPLY:
> +			gunyah_rm_process_reply(rm, msg, len);
> +			break;
> +		case RM_RPC_TYPE_CONTINUATION:
> +			gunyah_rm_process_cont(rm, rm->active_rx_message, msg,
> +					       len);
> +			break;
> +		default:
> +			dev_err(rm->dev,
> +				"Invalid message type (%lu) received\n",
> +				FIELD_GET(RM_RPC_TYPE_MASK, hdr->type));
> +			return IRQ_HANDLED;
> +		}
> +	} while (ready);
> +
> +	return IRQ_HANDLED;
> +}

Srivatsa Vaddagiri March 7, 2024, 3:39 p.m. UTC | #20

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:31]:

> Gunyah VM manager is a kernel moduel which exposes an interface to
> userspace to load, run, and interact with other Gunyah virtual machines.
> The interface is a character device at /dev/gunyah.
> 
> Add a basic VM manager driver. Upcoming patches will add more ioctls
> into this driver.
> 
> Reviewed-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
> Reviewed-by: Alex Elder <elder@linaro.org>
> Co-developed-by: Prakruthi Deepak Heragu <quic_pheragu@quicinc.com>
> Signed-off-by: Prakruthi Deepak Heragu <quic_pheragu@quicinc.com>
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

LGTM

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

- vatsa

Srivatsa Vaddagiri March 7, 2024, 3:39 p.m. UTC | #21

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:32]:

> Add Gunyah Resource Manager RPC interfaces to launch an unauthenticated
> virtual machine.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

LGTM

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

- vatsa

Elliot Berman March 7, 2024, 4:41 p.m. UTC | #22

On Thu, Mar 07, 2024 at 09:08:43PM +0530, Srivatsa Vaddagiri wrote:
> * Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:30]:
> 
> > The resource manager is a special virtual machine which is always
> > running on a Gunyah system. It provides APIs for creating and destroying
> > VMs, secure memory management, sharing/lending of memory between VMs,
> > and setup of inter-VM communication. Calls to the resource manager are
> > made via message queues.
> > 
> > This patch implements the basic probing and RPC mechanism to make those
> > API calls. Request/response calls can be made with gh_rm_call.
> > Drivers can also register to notifications pushed by RM via
> > gh_rm_register_notifier
> > 
> > Specific API calls that resource manager supports will be implemented in
> > subsequent patches.
> > 
> > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> 
> Left a minor comment below. LGTM otherwise.
> 
> Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>
> 
> > +static irqreturn_t gunyah_rm_rx(int irq, void *data)
> > +{
> > +	enum gunyah_error gunyah_error;
> > +	struct gunyah_rm_rpc_hdr *hdr;
> > +	struct gunyah_rm *rm = data;
> > +	void *msg = &rm->recv_msg[0];
> > +	size_t len;
> > +	bool ready;
> > +
> > +	do {
> > +		gunyah_error = gunyah_hypercall_msgq_recv(rm->rx_ghrsc.capid,
> > +							  msg,
> > +							  sizeof(rm->recv_msg),
> > +							  &len, &ready);
> > +		if (gunyah_error != GUNYAH_ERROR_OK) {
> > +			if (gunyah_error != GUNYAH_ERROR_MSGQUEUE_EMPTY)
> > +				dev_warn(rm->dev,
> > +					 "Failed to receive data: %d\n",
> > +					 gunyah_error);
> > +			return IRQ_HANDLED;
> > +		}
> > +
> > +		if (len < sizeof(*hdr)) {
> > +			dev_err_ratelimited(
> > +				rm->dev,
> > +				"Too small message received. size=%ld\n", len);
> > +			continue;
> 
> In practice we should never hit this condition, in case we do encounter, do you
> see a reason why continue is preferred over simply breaking the loop?
> 

There might be more messages to read, which we would not otherwise read.
Since those messages might be parseable, I'd rather try to recover than
break communication with RM.

As you mention, we should never encounter this condition. The guard is
to avoid reading garbage values.

> > +		}
> > +
> > +		hdr = msg;
> > +		if (hdr->api != RM_RPC_API) {
> > +			dev_err(rm->dev, "Unknown RM RPC API version: %x\n",
> > +				hdr->api);
> > +			return IRQ_HANDLED;
> > +		}
> > +
> > +		switch (FIELD_GET(RM_RPC_TYPE_MASK, hdr->type)) {
> > +		case RM_RPC_TYPE_NOTIF:
> > +			gunyah_rm_process_notif(rm, msg, len);
> > +			break;
> > +		case RM_RPC_TYPE_REPLY:
> > +			gunyah_rm_process_reply(rm, msg, len);
> > +			break;
> > +		case RM_RPC_TYPE_CONTINUATION:
> > +			gunyah_rm_process_cont(rm, rm->active_rx_message, msg,
> > +					       len);
> > +			break;
> > +		default:
> > +			dev_err(rm->dev,
> > +				"Invalid message type (%lu) received\n",
> > +				FIELD_GET(RM_RPC_TYPE_MASK, hdr->type));
> > +			return IRQ_HANDLED;
> > +		}
> > +	} while (ready);
> > +
> > +	return IRQ_HANDLED;
> > +}

Elliot Berman March 8, 2024, 7:55 p.m. UTC | #23

On Wed, Mar 06, 2024 at 12:05:38PM +0000, Quentin Perret wrote:
> On Tuesday 05 Mar 2024 at 12:26:59 (-0800), Elliot Berman wrote:
> > I still disagree that this is a Gunyah-specific problem. As far as we
> > can tell, Arm doesn't specify how EL2 can tell EL1 its S2 page tables
> > couldn't give a validation translation of the IPA from stage 1. IMO,
> > downstream/Android pKVM is violating spec for ESR_EL1 by using the
> > S1PTW bit (which is res0 for everyone except EL2 [1]) and this means
> > that guests need to be pKVM-enlightened.
> 
> Not really, in pKVM we have a very clear distinction between host Linux
> and guests, and only the host needs to be enlightened. But luckily,
> since pKVM is part of Linux, this is pretty much an internal kernel
> thing, so we're very flexible and if the S1PTW trick ever conflicts
> with something else (e.g. NV) we can fairly easily switch to another
> approach. We can tolerate non-architectural tricks like that between
> pKVM and host Linux because that is not ABI, but we certainly can't do
> that for guests.
> 
> > If we are adding pKVM
> > enlightment in the exception handlers, can we add Gunyah enlightment to
> > handle the same?
> 
> If you mean extending the Linux SEA handler so it does what Gunyah
> wants, then I'm personally not supportive of that idea since the
> 'contract' between Linux and Gunyah _is_ the architecture.

Fair enough. We're building out more use cases where we want to allocate
memory from buddy and donate it to some entity which unmaps it from
Linux (some entity = Gunyah or Qualcomm firmware). Video DRM is an
example we're working on. I imagine OP-TEE might eventually have
use-cases as well since pKVM is doing same. David expressed concerns
about exporting the direct unmap functions. What kind of
framework/restrictions do we want to have instead? I don't think making
drivers like Gunyah a builtin-only module [1] (even a refactored/small
portion) is the best approach, but maybe that is what we want to do.

Thanks,
Elliot

[1]: qcom_scm_assign_mem (d/firmware/qcom/qcom_scm.ko) is an example of
a module that would have to become builtin as we upstream use cases that
lend buddy-allocated memory to firmware

Srivatsa Vaddagiri March 11, 2024, 5:38 a.m. UTC | #24

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:35]:

> Gunyah doesn't process the label and that makes it
> possible for userspace to create multiple resources with the same label.

I think that description conflicts with what is implemented in this patch?

int gunyah_vm_add_resource_ticket(struct gunyah_vm *ghvm,
				  struct gunyah_vm_resource_ticket *ticket)
{
	mutex_lock(&ghvm->resources_lock);
	list_for_each_entry(iter, &ghvm->resource_tickets, vm_list) {
		if (iter->resource_type == ticket->resource_type &&
		    iter->label == ticket->label) {
			ret = -EEXIST;
			goto out;
		}
	}


//

> @@ -134,6 +246,25 @@ static int gunyah_vm_start(struct gunyah_vm *ghvm)
>  	}
>  	ghvm->vm_status = GUNYAH_RM_VM_STATUS_READY;
>  
> +	ret = gunyah_rm_get_hyp_resources(ghvm->rm, ghvm->vmid, &resources);
> +	if (ret) {
> +		dev_warn(ghvm->parent,
> +			 "Failed to get hypervisor resources for VM: %d\n",
> +			 ret);
> +		goto err;
> +	}

Where do we free memory pointed by 'resources' ptr?

> +
> +	for (i = 0, n = le32_to_cpu(resources->n_entries); i < n; i++) {
> +		ghrsc = gunyah_rm_alloc_resource(ghvm->rm,
> +						 &resources->entries[i]);
> +		if (!ghrsc) {
> +			ret = -ENOMEM;
> +			goto err;
> +		}
> +
> +		gunyah_vm_add_resource(ghvm, ghrsc);
> +	}

Srivatsa Vaddagiri March 11, 2024, 5:38 a.m. UTC | #25

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:33]:

> Add ioctl to trigger the start of a Gunyah virtual machine. Subsequent
> commits will provide memory to the virtual machine and add ability to
> interact with the resources (capabilities) of the virtual machine.
> Although start of the virtual machine can be done implicitly on the
> first vCPU run for proxy-schedule virtual machines, there is a
> non-trivial number of calls to Gunyah: a more precise error can be given
> to userspace which calls VM_START without looking at kernel logs because
> userspace can detect that the VM start failed instead of "couldn't run
> the vCPU".
> 
> Co-developed-by: Prakruthi Deepak Heragu <quic_pheragu@quicinc.com>
> Signed-off-by: Prakruthi Deepak Heragu <quic_pheragu@quicinc.com>
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

Minor nit below. LGTM otherwise.

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

> +static int gunyah_vm_start(struct gunyah_vm *ghvm)
> +{

..

> +err:
> +	/* gunyah_vm_free will handle releasing resources and reclaiming memory */

s/gunyah_vm_free/gunyah_vm_release

> +	up_write(&ghvm->status_lock);
> +	return ret;
> +}

Srivatsa Vaddagiri March 11, 2024, 5:39 a.m. UTC | #26

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:34]:

> When booting a Gunyah virtual machine, the host VM may gain capabilities
> to interact with resources for the guest virtual machine. Examples of
> such resources are vCPUs or message queues. To use those resources, we
> need to translate the RM response into a gunyah_resource structure which
> are useful to Linux drivers. Presently, Linux drivers need only to know
> the type of resource, the capability ID, and an interrupt.
> 
> On ARM64 systems, the interrupt reported by Gunyah is the GIC interrupt
> ID number and always a SPI or extended SPI.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

Minor nit below. LGTM otherwise

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

> +struct gunyah_resource *
> +gunyah_rm_alloc_resource(struct gunyah_rm *rm,
> +			 struct gunyah_rm_hyp_resource *hyp_resource)
> +{
> +	struct gunyah_resource *ghrsc;
> +	int ret;
> +
> +	ghrsc = kzalloc(sizeof(*ghrsc), GFP_KERNEL);
> +	if (!ghrsc)
> +		return NULL;
> +
> +	ghrsc->type = hyp_resource->type;
> +	ghrsc->capid = le64_to_cpu(hyp_resource->cap_id);
> +	ghrsc->irq = IRQ_NOTCONNECTED;
> +	ghrsc->rm_label = le32_to_cpu(hyp_resource->resource_label);
> +	if (hyp_resource->virq) {
> +		struct irq_fwspec fwspec;
> +
> +
> +		fwspec.fwnode = rm->parent_fwnode;
> +		ret = arch_gunyah_fill_irq_fwspec_params(le32_to_cpu(hyp_resource->virq), &fwspec);
> +		if (ret) {
> +			dev_err(rm->dev,
> +				"Failed to translate interrupt for resource %d label: %d: %d\n",
> +				ghrsc->type, ghrsc->rm_label, ret);

Not bailing on error here appears wrong. Can you check?

> +		}
> +
> +		ret = irq_create_fwspec_mapping(&fwspec);
> +		if (ret < 0) {
> +			dev_err(rm->dev,
> +				"Failed to allocate interrupt for resource %d label: %d: %d\n",
> +				ghrsc->type, ghrsc->rm_label, ret);
> +			kfree(ghrsc);
> +			return NULL;
> +		}
> +		ghrsc->irq = ret;
> +	}
> +
> +	return ghrsc;
> +}

Elliot Berman March 11, 2024, 5:13 p.m. UTC | #27

On Mon, Mar 11, 2024 at 11:08:06AM +0530, Srivatsa Vaddagiri wrote:
> * Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:35]:
> 
> > Gunyah doesn't process the label and that makes it
> > possible for userspace to create multiple resources with the same label.
> 
> I think that description conflicts with what is implemented in this patch?
> 
> int gunyah_vm_add_resource_ticket(struct gunyah_vm *ghvm,
> 				  struct gunyah_vm_resource_ticket *ticket)
> {
> 	mutex_lock(&ghvm->resources_lock);
> 	list_for_each_entry(iter, &ghvm->resource_tickets, vm_list) {
> 		if (iter->resource_type == ticket->resource_type &&
> 		    iter->label == ticket->label) {
> 			ret = -EEXIST;
> 			goto out;
> 		}
> 	}
> 
> 
> //
> 

It's a justification for the next sentence in the commit text:

    Resource ticket owners need to be prepared for populate to be called
    multiple times if userspace created multiple resources with the same
    label.


VM manager can make sure that only one entity is going to receive the
resources (the check above you highlighted), but I can't make sure that
it's only going to exactly one resource. We don't currently have a
scenario where we need/want multiple resources with the same label, we
might have that in the future and I didn't want to add that restriction
in common code.

> > @@ -134,6 +246,25 @@ static int gunyah_vm_start(struct gunyah_vm *ghvm)
> >  	}
> >  	ghvm->vm_status = GUNYAH_RM_VM_STATUS_READY;
> >  
> > +	ret = gunyah_rm_get_hyp_resources(ghvm->rm, ghvm->vmid, &resources);
> > +	if (ret) {
> > +		dev_warn(ghvm->parent,
> > +			 "Failed to get hypervisor resources for VM: %d\n",
> > +			 ret);
> > +		goto err;
> > +	}
> 
> Where do we free memory pointed by 'resources' ptr?
> 

Ah, I meant to add the __free(kfree) annotation. I've added it.

> > +
> > +	for (i = 0, n = le32_to_cpu(resources->n_entries); i < n; i++) {
> > +		ghrsc = gunyah_rm_alloc_resource(ghvm->rm,
> > +						 &resources->entries[i]);
> > +		if (!ghrsc) {
> > +			ret = -ENOMEM;
> > +			goto err;
> > +		}
> > +
> > +		gunyah_vm_add_resource(ghvm, ghrsc);
> > +	}

Elliot Berman March 11, 2024, 5:19 p.m. UTC | #28

On Mon, Mar 11, 2024 at 11:09:05AM +0530, Srivatsa Vaddagiri wrote:
> * Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:34]:
> 
> > When booting a Gunyah virtual machine, the host VM may gain capabilities
> > to interact with resources for the guest virtual machine. Examples of
> > such resources are vCPUs or message queues. To use those resources, we
> > need to translate the RM response into a gunyah_resource structure which
> > are useful to Linux drivers. Presently, Linux drivers need only to know
> > the type of resource, the capability ID, and an interrupt.
> > 
> > On ARM64 systems, the interrupt reported by Gunyah is the GIC interrupt
> > ID number and always a SPI or extended SPI.
> > 
> > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> 
> Minor nit below. LGTM otherwise
> 
> Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>
> 
> > +struct gunyah_resource *
> > +gunyah_rm_alloc_resource(struct gunyah_rm *rm,
> > +			 struct gunyah_rm_hyp_resource *hyp_resource)
> > +{
> > +	struct gunyah_resource *ghrsc;
> > +	int ret;
> > +
> > +	ghrsc = kzalloc(sizeof(*ghrsc), GFP_KERNEL);
> > +	if (!ghrsc)
> > +		return NULL;
> > +
> > +	ghrsc->type = hyp_resource->type;
> > +	ghrsc->capid = le64_to_cpu(hyp_resource->cap_id);
> > +	ghrsc->irq = IRQ_NOTCONNECTED;
> > +	ghrsc->rm_label = le32_to_cpu(hyp_resource->resource_label);
> > +	if (hyp_resource->virq) {
> > +		struct irq_fwspec fwspec;
> > +
> > +
> > +		fwspec.fwnode = rm->parent_fwnode;
> > +		ret = arch_gunyah_fill_irq_fwspec_params(le32_to_cpu(hyp_resource->virq), &fwspec);
> > +		if (ret) {
> > +			dev_err(rm->dev,
> > +				"Failed to translate interrupt for resource %d label: %d: %d\n",
> > +				ghrsc->type, ghrsc->rm_label, ret);
> 
> Not bailing on error here appears wrong. Can you check?
> 

Ah, yes. I'll return ghrsc here. I think it's better than returning
NULL because user of resource might be able to cope without the
interrupt and can let us get a more helpful kernel log messages because
the higher level VM function will complain.

> > +		}
> > +
> > +		ret = irq_create_fwspec_mapping(&fwspec);
> > +		if (ret < 0) {
> > +			dev_err(rm->dev,
> > +				"Failed to allocate interrupt for resource %d label: %d: %d\n",
> > +				ghrsc->type, ghrsc->rm_label, ret);
> > +			kfree(ghrsc);
> > +			return NULL;
> > +		}
> > +		ghrsc->irq = ret;
> > +	}
> > +
> > +	return ghrsc;
> > +}

Srivatsa Vaddagiri March 13, 2024, 9:20 a.m. UTC | #29

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:36]:

> Introduce a framework for Gunyah userspace to install VM functions. VM
> functions are optional interfaces to the virtual machine. vCPUs,
> ioeventfs, and irqfds are examples of such VM functions and are
> implemented in subsequent patches.
> 
> A generic framework is implemented instead of individual ioctls to
> create vCPUs, irqfds, etc., in order to simplify the VM manager core
> implementation and allow dynamic loading of VM function modules.
>
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

Minor nit below. LGTM otherwise

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

> +static struct gunyah_vm_function *gunyah_vm_get_function(u32 type)
> +{
> +	struct gunyah_vm_function *fn;
> +
> +	fn = xa_load(&gunyah_vm_functions, type);
> +	if (!fn) {
> +		request_module("ghfunc:%d", type);

s/%d/%u

> +
> +		fn = xa_load(&gunyah_vm_functions, type);
> +	}
> +
> +	if (!fn || !try_module_get(fn->mod))
> +		fn = ERR_PTR(-ENOENT);
> +
> +	return fn;
> +}

Srivatsa Vaddagiri March 13, 2024, 9:21 a.m. UTC | #30

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:37]:

> Add hypercall to donate CPU time to a vCPU.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

LGTM

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

- vatsa

Srivatsa Vaddagiri March 13, 2024, 9:21 a.m. UTC | #31

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:38]:

> Gunyah allows vCPUs that are configured as proxy-scheduled to be scheduled by
> another virtual machine (host) that holds capabilities to those vCPUs with
> suitable rights.
> 
> Gunyah also supports configuring regions of a proxy-scheduled VM's address
> space to be virtualized by the host VM. This permits a host VMM to emulate MMIO
> devices in the proxy-scheduled VM.
> 
> vCPUs are presented to the host as a Gunyah resource and represented to
> userspace as a Gunyah VM function.
> 
> Creating the vcpu function on the VM will create a file descriptor that:
>  - can handle an ioctl to run the vCPU. When called, Gunyah will directly
>    context-switch to the selected vCPU and run it until one of the following
>    events occurs:
>     * the host vcpu's time slice ends
>     * the host vcpu receives an interrupt or would have been pre-empted
>       by the hypervisor
>     * a fault occurs in the proxy-scheduled vcpu
>     * a power management event, such as idle or cpu-off call in the vcpu
>  - can be mmap'd to share the gunyah_vcpu_run structure with userspace. This
>    allows the vcpu_run result codes to be accessed, and for arguments to
>    vcpu_run to be passed, e.g. for resuming the vcpu when handling certain fault
>    and exit cases.
> 
> Co-developed-by: Prakruthi Deepak Heragu <quic_pheragu@quicinc.com>
> Signed-off-by: Prakruthi Deepak Heragu <quic_pheragu@quicinc.com>
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

LGTM

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

- vatsa

Srivatsa Vaddagiri March 13, 2024, 3:35 p.m. UTC | #32

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:39]:

> Three hypercalls are needed to support demand paging.
> In create page mappings for a virtual machine's address space, memory
> must be moved to a memory extent that is allowed to be mapped into that
> address space. Memory extents are Gunyah's implementation of access
> control. Once the memory is moved to the proper memory extent, the
> memory can be mapped into the VM's address space. Implement the
> bindings to perform those hypercalls.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagiri@quicinc.com>

Srivatsa Vaddagiri March 14, 2024, 2:02 p.m. UTC | #33

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:40]:

> In a Gunyah hypervisor system using the Gunyah Resource Manager, the
> "standard" unit of donating, lending and sharing memory is called a
> memory parcel (memparcel).  A memparcel is an abstraction used by the
> resource manager for securely managing donating, lending and sharing
> memory, which may be physically and virtually fragmented, without
> dealing directly with physical memory addresses.
> 
> Memparcels are created and managed through the RM RPC functions for
> lending, sharing and reclaiming memory from VMs.
> 
> When creating a new VM the initial VM memory containing the VM image and
> the VM's device tree blob must be provided as a memparcel. The memparcel
> must be created using the RM RPC for lending and mapping the memory to
> the VM.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

- vatsa

Srivatsa Vaddagiri March 14, 2024, 2:02 p.m. UTC | #34

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:52]:

> The initial context of a the primary vCPU can be initialized by
> performing RM RPC calls.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

- vatsa

Srivatsa Vaddagiri March 14, 2024, 2:03 p.m. UTC | #35

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:53]:

> RM provides APIs to fill boot context the initial registers upon
> starting the vCPU. Most importantly, this allows userspace to set the
> initial PC for the primary vCPU when the VM starts to run.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>

Reviewed-by: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>

Pavan Kondeti April 5, 2024, 3:10 a.m. UTC | #36

On Thu, Feb 22, 2024 at 03:16:34PM -0800, Elliot Berman wrote:
> When booting a Gunyah virtual machine, the host VM may gain capabilities
> to interact with resources for the guest virtual machine. Examples of
> such resources are vCPUs or message queues. To use those resources, we
> need to translate the RM response into a gunyah_resource structure which
> are useful to Linux drivers. Presently, Linux drivers need only to know
> the type of resource, the capability ID, and an interrupt.
> 
> On ARM64 systems, the interrupt reported by Gunyah is the GIC interrupt
> ID number and always a SPI or extended SPI.
> 
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> ---
>  arch/arm64/include/asm/gunyah.h | 36 ++++++++++++++++++++++
>  drivers/virt/gunyah/rsc_mgr.c   | 67 +++++++++++++++++++++++++++++++++++++++++
>  drivers/virt/gunyah/rsc_mgr.h   |  5 +++
>  include/linux/gunyah.h          |  2 ++
>  4 files changed, 110 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/gunyah.h b/arch/arm64/include/asm/gunyah.h
> new file mode 100644
> index 0000000000000..0cd3debe22b64
> --- /dev/null
> +++ b/arch/arm64/include/asm/gunyah.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2022-2023 Qualcomm Innovation Center, Inc. All rights reserved.
> + */
> +#ifndef _ASM_GUNYAH_H
> +#define _ASM_GUNYAH_H
> +
> +#include <linux/irq.h>
> +#include <linux/irqdomain.h>
> +
> +static inline int arch_gunyah_fill_irq_fwspec_params(u32 virq,
> +						 struct irq_fwspec *fwspec)
> +{
> +	/* Assume that Gunyah gave us an SPI or ESPI; defensively check it */
> +	if (WARN(virq < 32, "Unexpected virq: %d\n", virq)) {
> +		return -EINVAL;
> +	} else if (virq <= 1019) {
> +		fwspec->param_count = 3;
> +		fwspec->param[0] = 0; /* GIC_SPI */
> +		fwspec->param[1] = virq - 32; /* virq 32 -> SPI 0 */
> +		fwspec->param[2] = IRQ_TYPE_EDGE_RISING;
> +	} else if (WARN(virq < 4096, "Unexpected virq: %d\n", virq)) {
> +		return -EINVAL;
> +	} else if (virq < 5120) {
> +		fwspec->param_count = 3;
> +		fwspec->param[0] = 2; /* GIC_ESPI */
> +		fwspec->param[1] = virq - 4096; /* virq 4096 -> ESPI 0 */
> +		fwspec->param[2] = IRQ_TYPE_EDGE_RISING;
> +	} else {
> +		WARN(1, "Unexpected virq: %d\n", virq);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +

__get_intid_range() in gic-v3 driver looks more pleasing. Other than
that the logic for the translation looks good to me.

Thanks,
Pavan

Elliot Berman April 5, 2024, 3:18 p.m. UTC | #37

On Fri, Apr 05, 2024 at 08:40:40AM +0530, Pavan Kondeti wrote:
> On Thu, Feb 22, 2024 at 03:16:34PM -0800, Elliot Berman wrote:
> > When booting a Gunyah virtual machine, the host VM may gain capabilities
> > to interact with resources for the guest virtual machine. Examples of
> > such resources are vCPUs or message queues. To use those resources, we
> > need to translate the RM response into a gunyah_resource structure which
> > are useful to Linux drivers. Presently, Linux drivers need only to know
> > the type of resource, the capability ID, and an interrupt.
> > 
> > On ARM64 systems, the interrupt reported by Gunyah is the GIC interrupt
> > ID number and always a SPI or extended SPI.
> > 
> > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> > ---
> >  arch/arm64/include/asm/gunyah.h | 36 ++++++++++++++++++++++
> >  drivers/virt/gunyah/rsc_mgr.c   | 67 +++++++++++++++++++++++++++++++++++++++++
> >  drivers/virt/gunyah/rsc_mgr.h   |  5 +++
> >  include/linux/gunyah.h          |  2 ++
> >  4 files changed, 110 insertions(+)
> > 
> > diff --git a/arch/arm64/include/asm/gunyah.h b/arch/arm64/include/asm/gunyah.h
> > new file mode 100644
> > index 0000000000000..0cd3debe22b64
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/gunyah.h
> > @@ -0,0 +1,36 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (c) 2022-2023 Qualcomm Innovation Center, Inc. All rights reserved.
> > + */
> > +#ifndef _ASM_GUNYAH_H
> > +#define _ASM_GUNYAH_H
> > +
> > +#include <linux/irq.h>
> > +#include <linux/irqdomain.h>
> > +
> > +static inline int arch_gunyah_fill_irq_fwspec_params(u32 virq,
> > +						 struct irq_fwspec *fwspec)
> > +{
> > +	/* Assume that Gunyah gave us an SPI or ESPI; defensively check it */
> > +	if (WARN(virq < 32, "Unexpected virq: %d\n", virq)) {
> > +		return -EINVAL;
> > +	} else if (virq <= 1019) {
> > +		fwspec->param_count = 3;
> > +		fwspec->param[0] = 0; /* GIC_SPI */
> > +		fwspec->param[1] = virq - 32; /* virq 32 -> SPI 0 */
> > +		fwspec->param[2] = IRQ_TYPE_EDGE_RISING;
> > +	} else if (WARN(virq < 4096, "Unexpected virq: %d\n", virq)) {
> > +		return -EINVAL;
> > +	} else if (virq < 5120) {
> > +		fwspec->param_count = 3;
> > +		fwspec->param[0] = 2; /* GIC_ESPI */
> > +		fwspec->param[1] = virq - 4096; /* virq 4096 -> ESPI 0 */
> > +		fwspec->param[2] = IRQ_TYPE_EDGE_RISING;
> > +	} else {
> > +		WARN(1, "Unexpected virq: %d\n", virq);
> > +		return -EINVAL;
> > +	}
> > +	return 0;
> > +}
> > +
> 
> __get_intid_range() in gic-v3 driver looks more pleasing. Other than
> that the logic for the translation looks good to me.

Agreed, updated for v18.

static inline int arch_gunyah_fill_irq_fwspec_params(u32 virq,
						 struct irq_fwspec *fwspec)
{
	/* Assume that Gunyah gave us an SPI or ESPI; defensively check it */
	switch (virq) {
	case 32 ... 1019:
		fwspec->param_count = 3;
		fwspec->param[0] = 0; /* GIC_SPI */
		fwspec->param[1] = virq - 32; /* virq 32 -> SPI 0 */
		fwspec->param[2] = IRQ_TYPE_EDGE_RISING;
		break;
	case 4096 ... 5119:
		fwspec->param_count = 3;
		fwspec->param[0] = 2; /* GIC_ESPI */
		fwspec->param[1] = virq - 4096; /* virq 4096 -> ESPI 0 */
		fwspec->param[2] = IRQ_TYPE_EDGE_RISING;
		break;
	default:
		WARN(1, "Unexpected virq: %d\n", virq)
		return -EINVAL;
	}
	return 0;
}

Srivatsa Vaddagiri April 24, 2024, 9:39 a.m. UTC | #38

* Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:38]:

> +/**
> + * struct gunyah_vm_exit_info - Reason for VM exit as reported by Gunyah
> + * See Gunyah documentation for values.
> + * @type: Describes how VM exited
> + * @padding: padding bytes
> + * @reason_size: Number of bytes valid for `reason`
> + * @reason: See Gunyah documentation for interpretation. Note: these values are
> + *          not interpreted by Linux and need to be converted from little-endian
> + *          as applicable.
> + */
> +struct gunyah_vm_exit_info {
> +	__u16 type;

Pls add an enum to describe the various exit types.

> +	__u16 padding;
> +	__u32 reason_size;
> +	__u8 reason[GUNYAH_VM_MAX_EXIT_REASON_SIZE];
> +};

- vatsa

Elliot Berman April 24, 2024, 5:01 p.m. UTC | #39

On Wed, Apr 24, 2024 at 03:09:57PM +0530, Srivatsa Vaddagiri wrote:
> * Elliot Berman <quic_eberman@quicinc.com> [2024-02-22 15:16:38]:
> 
> > +/**
> > + * struct gunyah_vm_exit_info - Reason for VM exit as reported by Gunyah
> > + * See Gunyah documentation for values.
> > + * @type: Describes how VM exited
> > + * @padding: padding bytes
> > + * @reason_size: Number of bytes valid for `reason`
> > + * @reason: See Gunyah documentation for interpretation. Note: these values are
> > + *          not interpreted by Linux and need to be converted from little-endian
> > + *          as applicable.
> > + */
> > +struct gunyah_vm_exit_info {
> > +	__u16 type;
> 
> Pls add an enum to describe the various exit types.
> 

When I was first added the exit_info, I thought a bunch on whether to
add the enum. I was thinking that if we add it to the UAPI, kernel need
to parse and validate the values from RM. In the current scheme, I can
leave it to userspace to interpret the value from RM. I could be
convinced to go the other way though. Currently this is less code :) Let
me know your thoughts! 

> > +	__u16 padding;
> > +	__u32 reason_size;
> > +	__u8 reason[GUNYAH_VM_MAX_EXIT_REASON_SIZE];

If we start interpreting the @type, then maybe we should also start
interpreting the @reason, and that will also add debt to the kernel
driver.

Thanks,
Elliot

[v17,00/35] Drivers for Gunyah hypervisor

Message

Comments