[v5,0/6] Add Tegra241 (Grace) CMDQV Support (part 1/2)

Message ID	cover.1712977210.git.nicolinc@nvidia.com
Headers	show Return-Path: <linux-tegra+bounces-1603-incoming=patchwork.ozlabs.org@vger.kernel.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.232 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.232; helo=mail.nvidia.com; pr=C From: Nicolin Chen <nicolinc@nvidia.com> To: <will@kernel.org>, <robin.murphy@arm.com> CC: <joro@8bytes.org>, <jgg@nvidia.com>, <thierry.reding@gmail.com>, <vdumpa@nvidia.com>, <jonathanh@nvidia.com>, <linux-kernel@vger.kernel.org>, <iommu@lists.linux.dev>, <linux-arm-kernel@lists.infradead.org>, <linux-tegra@vger.kernel.org> Subject: [PATCH v5 0/6] Add Tegra241 (Grace) CMDQV Support (part 1/2) Date: Fri, 12 Apr 2024 20:43:48 -0700 Message-ID: <cover.1712977210.git.nicolinc@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	Add Tegra241 (Grace) CMDQV Support (part 1/2) \| expand [v5,0/6] Add Tegra241 (Grace) CMDQV Support (part 1/2) [v5,1/6] iommu/arm-smmu-v3: Add CS_NONE quirk [v5,2/6] iommu/arm-smmu-v3: Make arm_smmu_cmdq_init reusable [v5,3/6] iommu/arm-smmu-v3: Make __arm_smmu_cmdq_skip_err reusable [v5,4/6] iommu/arm-smmu-v3: Pass in cmdq pointer to arm_smmu_cmdq_issue_cmdlist() [v5,5/6] iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV [v5,6/6] iommu/tegra241-cmdqv: Limit CMDs for guest owned VINTF

Nicolin Chen April 13, 2024, 3:43 a.m. UTC

NVIDIA's Tegra241 (Grace) SoC has a CMDQ-Virtualization (CMDQV) hardware
that extends standard ARM SMMUv3 to support multiple command queues with
virtualization capabilities. Though this is similar to the ECMDQ in SMMU
v3.3, CMDQV provides additional Virtual Interfaces (VINTFs) allowing VMs
to have their own VINTFs and Virtual Command Queues (VCMDQs). The VCMDQs
can only execute a limited set of commands, mainly invalidation commands
when exclusively used by the VMs, compared to the standard SMMUv3 CMDQ.

Thus, there are two parts of patch series to add its support: the basic
in-kernel support as part 1, and the user-space support as part 2.

The in-kernel support is to detect/configure the CMDQV hardware and then
allocate a VINTF with some VCMDQs for the kernel/hypervisor to use. Like
ECMDQ, CMDQV also allows the kernel to use multiple VCMDQs, giving some
limited performance improvement: up to 20% reduction of TLB invalidation
time was measured by a multi-threaded DMA unmap benchmark, compared to a
single queue.

The user-space support is to provide uAPIs (via IOMMUFD) for hypervisors
in user space to passthrough VCMDQs to VMs, allowing these VMs to access
the VCMDQs directly without trappings, i.e. no VM Exits. This gives huge
performance improvements: 70% to 90% reductions of TLB invalidation time
were measured by various DMA unmap tests running in a guest OS, compared
to a nested SMMU CMDQ (with trappings).

This is the part-1 series:
 - Preparatory changes to share the existing SMMU functions
 - A new CMDQV driver and extending the SMMUv3 driver to interact with
   the new driver
 - Limit the commands for a guest kernel.

It's available on Github:
https://github.com/nicolinc/iommufd/commits/vcmdq_in_kernel-v5

And the part-2 RFC series is also prepared and will be sent soon:
https://github.com/nicolinc/iommufd/commits/vcmdq_user_space-rfc-v1/

Note that this in-kernel support isn't confined to host kernels running
on Grace-powered servers, but is also used by guest kernels running on
VMs virtualized on those servers. So, those VMs must install the driver,
ideally before the part 2 is merged. So, later those servers would only
need to upgrade their host kernels without bothering the VMs.

Thank you!

Changelog
v5:
 * Improved print/mmio helpers
 * Added proper register reset routines
 * Reorganized init/deinit functions to share with VIOMMU callbacks in
   the upcoming part-2 user-space series (RFC)
v4:
 https://lore.kernel.org/all/cover.1711690673.git.nicolinc@nvidia.com/
 * Rebased on v6.9-rc1
 * Renamed to "tegra241-cmdqv", following other Grace kernel patches
 * Added a set of print and MMIO helpers
 * Reworked the guest limitation patch
v3:
 https://lore.kernel.org/all/20211119071959.16706-1-nicolinc@nvidia.com/
 * Dropped VMID and mdev patches to redesign later based on IOMMUFD
 * Separated HYP_OWN part for guest support into a new patch
 * Added new preparatory changes
v2:
 https://lore.kernel.org/all/20210831025923.15812-1-nicolinc@nvidia.com/
 * Added mdev interface support for hypervisor and VMs
 * Added preparatory changes for mdev interface implementation
 * PATCH-12 Changed ->issue_cmdlist() to ->get_cmdq() for a better
   integration with recently merged ECMDQ-related changes
v1:
 https://lore.kernel.org/all/20210723193140.9690-1-nicolinc@nvidia.com/

Nate Watterson (1):
  iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace)
    CMDQV

Nicolin Chen (5):
  iommu/arm-smmu-v3: Add CS_NONE quirk
  iommu/arm-smmu-v3: Make arm_smmu_cmdq_init reusable
  iommu/arm-smmu-v3: Make __arm_smmu_cmdq_skip_err reusable
  iommu/arm-smmu-v3: Pass in cmdq pointer to
    arm_smmu_cmdq_issue_cmdlist()
  iommu/tegra241-cmdqv: Limit CMDs for guest owned VINTF

 MAINTAINERS                                   |   1 +
 drivers/iommu/Kconfig                         |  12 +
 drivers/iommu/arm/arm-smmu-v3/Makefile        |   1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  74 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  46 +
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 845 ++++++++++++++++++
 6 files changed, 952 insertions(+), 27 deletions(-)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c

Jason Gunthorpe April 15, 2024, 5:14 p.m. UTC | #1

On Fri, Apr 12, 2024 at 08:43:48PM -0700, Nicolin Chen wrote:

> The user-space support is to provide uAPIs (via IOMMUFD) for hypervisors
> in user space to passthrough VCMDQs to VMs, allowing these VMs to access
> the VCMDQs directly without trappings, i.e. no VM Exits. This gives huge
> performance improvements: 70% to 90% reductions of TLB invalidation time
> were measured by various DMA unmap tests running in a guest OS, compared
> to a nested SMMU CMDQ (with trappings).

So everyone is on the same page, this is the primary point of this
series. The huge speed up of in-VM performance is necessary for the
workloads this chip is expected to be running. This series is unique
from all the rest because it runs inside a VM, often in the from of a
distro release.

It doesn't need the other series or it's own part 2 as it entirely
stands alone on bare metal hardware or on top of commercial VM cloud
instances runing who-knows-what in their hypervisors.

The other parts are substantially about enabling qemu and the open
ecosystem to have fully functional vSMMU3 virtualization.

Jason

Shameerali Kolothum Thodi April 17, 2024, 8:01 a.m. UTC | #2

> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, April 15, 2024 6:14 PM
> To: Nicolin Chen <nicolinc@nvidia.com>
> Cc: will@kernel.org; robin.murphy@arm.com; joro@8bytes.org;
> thierry.reding@gmail.com; vdumpa@nvidia.com; jonathanh@nvidia.com;
> linux-kernel@vger.kernel.org; iommu@lists.linux.dev; linux-arm-
> kernel@lists.infradead.org; linux-tegra@vger.kernel.org; Jerry Snitselaar
> <jsnitsel@redhat.com>
> Subject: Re: [PATCH v5 0/6] Add Tegra241 (Grace) CMDQV Support (part 1/2)
> 
> On Fri, Apr 12, 2024 at 08:43:48PM -0700, Nicolin Chen wrote:
> 
> > The user-space support is to provide uAPIs (via IOMMUFD) for hypervisors
> > in user space to passthrough VCMDQs to VMs, allowing these VMs to
> access
> > the VCMDQs directly without trappings, i.e. no VM Exits. This gives huge
> > performance improvements: 70% to 90% reductions of TLB invalidation
> time
> > were measured by various DMA unmap tests running in a guest OS,
> compared
> > to a nested SMMU CMDQ (with trappings).
> 
> So everyone is on the same page, this is the primary point of this
> series. The huge speed up of in-VM performance is necessary for the
> workloads this chip is expected to be running. This series is unique
> from all the rest because it runs inside a VM, often in the from of a
> distro release.
> 
> It doesn't need the other series or it's own part 2 as it entirely
> stands alone on bare metal hardware or on top of commercial VM cloud
> instances runing who-knows-what in their hypervisors.
> 
> The other parts are substantially about enabling qemu and the open
> ecosystem to have fully functional vSMMU3 virtualization.

Hi,

We do have plans to revive the SMMUv3 ECMDQ series posted a while back[0]
and looking at this series, I am just wondering whether it makes sense to have
a similar one with ECMDQ as well?  I see that the NVIDIA VCMDQ has a special bit 
to restrict the commands that can be issued from user space. If we end up assigning
a ECMDQ to user space, is there any potential risk in doing so? 

SMMUV3 spec does say,
"Arm expects that the Non-secure Stream table, Command queue, Event queue and
PRI queue are controlled by the most privileged Non-secure system software. "

Not clear to me what are the major concerns here and maybe we can come up with 
something to address that in kernel.

Please let me know if you have any thoughts on this.

Thanks,
Shameer
[0] https://lore.kernel.org/lkml/20230809131303.1355-1-thunder.leizhen@huaweicloud.com/

Shameerali Kolothum Thodi April 17, 2024, 9:45 a.m. UTC | #3

> -----Original Message-----
> From: Shameerali Kolothum Thodi
> Sent: Wednesday, April 17, 2024 9:01 AM
> To: 'Jason Gunthorpe' <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
> Cc: will@kernel.org; robin.murphy@arm.com; joro@8bytes.org;
> thierry.reding@gmail.com; vdumpa@nvidia.com; jonathanh@nvidia.com; linux-
> kernel@vger.kernel.org; iommu@lists.linux.dev; linux-arm-
> kernel@lists.infradead.org; linux-tegra@vger.kernel.org; Jerry Snitselaar
> <jsnitsel@redhat.com>
> Subject: RE: [PATCH v5 0/6] Add Tegra241 (Grace) CMDQV Support (part 1/2)
> 
> 
> 
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, April 15, 2024 6:14 PM
> > To: Nicolin Chen <nicolinc@nvidia.com>
> > Cc: will@kernel.org; robin.murphy@arm.com; joro@8bytes.org;
> > thierry.reding@gmail.com; vdumpa@nvidia.com; jonathanh@nvidia.com;
> > linux-kernel@vger.kernel.org; iommu@lists.linux.dev; linux-arm-
> > kernel@lists.infradead.org; linux-tegra@vger.kernel.org; Jerry Snitselaar
> > <jsnitsel@redhat.com>
> > Subject: Re: [PATCH v5 0/6] Add Tegra241 (Grace) CMDQV Support (part 1/2)
> >
> > On Fri, Apr 12, 2024 at 08:43:48PM -0700, Nicolin Chen wrote:
> >
> > > The user-space support is to provide uAPIs (via IOMMUFD) for hypervisors
> > > in user space to passthrough VCMDQs to VMs, allowing these VMs to
> > access
> > > the VCMDQs directly without trappings, i.e. no VM Exits. This gives huge
> > > performance improvements: 70% to 90% reductions of TLB invalidation
> > time
> > > were measured by various DMA unmap tests running in a guest OS,
> > compared
> > > to a nested SMMU CMDQ (with trappings).
> >
> > So everyone is on the same page, this is the primary point of this
> > series. The huge speed up of in-VM performance is necessary for the
> > workloads this chip is expected to be running. This series is unique
> > from all the rest because it runs inside a VM, often in the from of a
> > distro release.
> >
> > It doesn't need the other series or it's own part 2 as it entirely
> > stands alone on bare metal hardware or on top of commercial VM cloud
> > instances runing who-knows-what in their hypervisors.
> >
> > The other parts are substantially about enabling qemu and the open
> > ecosystem to have fully functional vSMMU3 virtualization.
> 
> Hi,
> 
> We do have plans to revive the SMMUv3 ECMDQ series posted a while back[0]
> and looking at this series, I am just wondering whether it makes sense to have
> a similar one with ECMDQ as well?  I see that the NVIDIA VCMDQ has a special
> bit
> to restrict the commands that can be issued from user space. If we end up
> assigning
> a ECMDQ to user space, is there any potential risk in doing so?
> 
> SMMUV3 spec does say,
> "Arm expects that the Non-secure Stream table, Command queue, Event queue
> and
> PRI queue are controlled by the most privileged Non-secure system software. "
> 
> Not clear to me what are the major concerns here and maybe we can come up
> with
> something to address that in kernel.

Just to add to that. One idea could be like to have a case where when ECMDQs are 
detected, use that for issuing limited set of cmds(like stage 1 TLBIs) and use the
normal cmdq for rest. Since we use stage 1 for both host and for Guest nested cases
and TLBIs are the bottlenecks in most cases I think this should give performance
benefits.

Thanks,
Shameer

Jason Gunthorpe April 17, 2024, 12:24 p.m. UTC | #4

On Wed, Apr 17, 2024 at 08:01:10AM +0000, Shameerali Kolothum Thodi wrote:
> We do have plans to revive the SMMUv3 ECMDQ series posted a while back[0]
> and looking at this series, I am just wondering whether it makes sense to have
> a similar one with ECMDQ as well?  I see that the NVIDIA VCMDQ has a special bit 
> to restrict the commands that can be issued from user space. If we end up assigning
> a ECMDQ to user space, is there any potential risk in doing so?

I think there is some risk/trouble, ECMDQ needs some enhancement
before it can be really safe to use from less privileged software, and
it wasn't designed to have an isolated doorbell page either.

> Not clear to me what are the major concerns here and maybe we can come up with 
> something to address that in kernel.

I haven't looked deeply but my impression has been the ECMDQ is not
workable to support virtualization. At a minimum it has no way to
constrain the command flow to a VMID and to do VSID -> PSID
translation.

I suggest you talk directly to ARM on this if you are interested in
this.

Jason

Jason Gunthorpe April 17, 2024, 12:29 p.m. UTC | #5

On Wed, Apr 17, 2024 at 09:45:34AM +0000, Shameerali Kolothum Thodi wrote:

> Just to add to that. One idea could be like to have a case where when ECMDQs are 
> detected, use that for issuing limited set of cmds(like stage 1 TLBIs) and use the
> normal cmdq for rest. Since we use stage 1 for both host and for Guest nested cases
> and TLBIs are the bottlenecks in most cases I think this should give performance
> benefits.

There is definately options to look at to improve the performance
here.

IMHO the design of the ECMDQ largely seems to expect 1 queue per-cpu
and then we move to a lock-less design where each CPU uses it's own
private per-cpu queue. In this case a VMM calling the kernel to do
invalidation would often naturally use a thread originating on a pCPU
bound to a vCPU which is substantially exclusive to the VM.

Jason

Shameerali Kolothum Thodi April 17, 2024, 3:13 p.m. UTC | #6

> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 17, 2024 1:25 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Nicolin Chen <nicolinc@nvidia.com>; will@kernel.org;
> robin.murphy@arm.com; joro@8bytes.org; thierry.reding@gmail.com;
> vdumpa@nvidia.com; jonathanh@nvidia.com; linux-kernel@vger.kernel.org;
> iommu@lists.linux.dev; linux-arm-kernel@lists.infradead.org; linux-
> tegra@vger.kernel.org; Jerry Snitselaar <jsnitsel@redhat.com>
> Subject: Re: [PATCH v5 0/6] Add Tegra241 (Grace) CMDQV Support (part 1/2)
> 
> On Wed, Apr 17, 2024 at 08:01:10AM +0000, Shameerali Kolothum Thodi wrote:
> > We do have plans to revive the SMMUv3 ECMDQ series posted a while back[0]
> > and looking at this series, I am just wondering whether it makes sense to have
> > a similar one with ECMDQ as well?  I see that the NVIDIA VCMDQ has a special
> bit
> > to restrict the commands that can be issued from user space. If we end up
> assigning
> > a ECMDQ to user space, is there any potential risk in doing so?
> 
> I think there is some risk/trouble, ECMDQ needs some enhancement
> before it can be really safe to use from less privileged software, and
> it wasn't designed to have an isolated doorbell page either.
> 
> > Not clear to me what are the major concerns here and maybe we can come up
> with
> > something to address that in kernel.
> 
> I haven't looked deeply but my impression has been the ECMDQ is not
> workable to support virtualization. At a minimum it has no way to
> constrain the command flow to a VMID and to do VSID -> PSID
> translation.

Ok. That makes sense.

> 
> I suggest you talk directly to ARM on this if you are interested in
> this.
> 

Sure. Will check.

Thanks,
Shameer

[v5,0/6] Add Tegra241 (Grace) CMDQV Support (part 1/2)

Message

Comments