diff mbox series

iommu: Relax ACS requirement for RCiEP devices.

Message ID 1588653736-10835-1-git-send-email-ashok.raj@intel.com
State New
Headers show
Series iommu: Relax ACS requirement for RCiEP devices. | expand

Commit Message

Ashok Raj May 5, 2020, 4:42 a.m. UTC
PCIe Spec recommends we can relax ACS requirement for RCIEP devices.

PCIe 5.0 Specification.
6.12 Access Control Services (ACS)
Implementation of ACS in RCiEPs is permitted but not required. It is
explicitly permitted that, within a single Root Complex, some RCiEPs
implement ACS and some do not. It is strongly recommended that Root Complex
implementations ensure that all accesses originating from RCiEPs
(PFs and VFs) without ACS capability are first subjected to processing by
the Translation Agent (TA) in the Root Complex before further decoding and
processing. The details of such Root Complex handling are outside the scope
of this specification.

Since Linux didn't give special treatment to allow this exception, certain
RCiEP MFD devices are getting grouped in a single iommu group. This
doesn't permit a single device to be assigned to a guest for instance.

In one vendor system: Device 14.x were grouped in a single IOMMU group.

/sys/kernel/iommu_groups/5/devices/0000:00:14.0
/sys/kernel/iommu_groups/5/devices/0000:00:14.2
/sys/kernel/iommu_groups/5/devices/0000:00:14.3

After the patch:
/sys/kernel/iommu_groups/5/devices/0000:00:14.0
/sys/kernel/iommu_groups/5/devices/0000:00:14.2
/sys/kernel/iommu_groups/6/devices/0000:00:14.3 <<< new group

14.0 and 14.2 are integrated devices, but legacy end points.
Whereas 14.3 was a PCIe compliant RCiEP.

00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00

This permits assigning this device to a guest VM.

Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
To: Joerg Roedel <joro@8bytes.org>
To: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Darrel Goeddel <DGoeddel@forcepoint.com>
Cc: Mark Scott <mscott@forcepoint.com>,
Cc: Romil Sharma <rsharma@forcepoint.com>
Cc: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/iommu.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Alex Williamson May 5, 2020, 5:19 a.m. UTC | #1
On Mon,  4 May 2020 21:42:16 -0700
Ashok Raj <ashok.raj@intel.com> wrote:

> PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> 
> PCIe 5.0 Specification.
> 6.12 Access Control Services (ACS)
> Implementation of ACS in RCiEPs is permitted but not required. It is
> explicitly permitted that, within a single Root Complex, some RCiEPs
> implement ACS and some do not. It is strongly recommended that Root Complex
> implementations ensure that all accesses originating from RCiEPs
> (PFs and VFs) without ACS capability are first subjected to processing by
> the Translation Agent (TA) in the Root Complex before further decoding and
> processing. The details of such Root Complex handling are outside the scope
> of this specification.
> 
> Since Linux didn't give special treatment to allow this exception, certain
> RCiEP MFD devices are getting grouped in a single iommu group. This
> doesn't permit a single device to be assigned to a guest for instance.
> 
> In one vendor system: Device 14.x were grouped in a single IOMMU group.
> 
> /sys/kernel/iommu_groups/5/devices/0000:00:14.0
> /sys/kernel/iommu_groups/5/devices/0000:00:14.2
> /sys/kernel/iommu_groups/5/devices/0000:00:14.3
> 
> After the patch:
> /sys/kernel/iommu_groups/5/devices/0000:00:14.0
> /sys/kernel/iommu_groups/5/devices/0000:00:14.2
> /sys/kernel/iommu_groups/6/devices/0000:00:14.3 <<< new group
> 
> 14.0 and 14.2 are integrated devices, but legacy end points.
> Whereas 14.3 was a PCIe compliant RCiEP.
> 
> 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> 
> This permits assigning this device to a guest VM.
> 
> Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> To: Joerg Roedel <joro@8bytes.org>
> To: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: iommu@lists.linux-foundation.org
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Darrel Goeddel <DGoeddel@forcepoint.com>
> Cc: Mark Scott <mscott@forcepoint.com>,
> Cc: Romil Sharma <rsharma@forcepoint.com>
> Cc: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/iommu.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2b471419e26c..5744bd65f3e2 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1187,7 +1187,20 @@ static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
>  	struct pci_dev *tmp = NULL;
>  	struct iommu_group *group;
>  
> -	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> +	/*
> +	 * PCI Spec 5.0, Section 6.12 Access Control Service
> +	 * Implementation of ACS in RCiEPs is permitted but not required.
> +	 * It is explicitly permitted that, within a single Root
> +	 * Complex, some RCiEPs implement ACS and some do not. It is
> +	 * strongly recommended that Root Complex implementations ensure
> +	 * that all accesses originating from RCiEPs (PFs and VFs) without
> +	 * ACS capability are first subjected to processing by the Translation
> +	 * Agent (TA) in the Root Complex before further decoding and
> +	 * processing.
> +	 */

Is the language here really strong enough to make this change?  ACS is
an optional feature, so being permitted but not required is rather
meaningless.  The spec is also specifically avoiding the words "must"
or "shall" and even when emphasized with "strongly", we still only have
a recommendation that may or may not be honored.  This seems like a
weak basis for assuming that RCiEPs universally honor this
recommendation.  Thanks,

Alex

> +	if (!pdev->multifunction ||
> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END) ||
> +	     pci_acs_enabled(pdev, REQ_ACS_FLAGS))
>  		return NULL;
>  
>  	for_each_pci_dev(tmp) {
Ashok Raj May 5, 2020, 6:11 a.m. UTC | #2
Hi Alex

+ Joerg, accidently missed in the Cc.

On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:
> On Mon,  4 May 2020 21:42:16 -0700
> Ashok Raj <ashok.raj@intel.com> wrote:
> 
> > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > 
> > PCIe 5.0 Specification.
> > 6.12 Access Control Services (ACS)
> > Implementation of ACS in RCiEPs is permitted but not required. It is
> > explicitly permitted that, within a single Root Complex, some RCiEPs
> > implement ACS and some do not. It is strongly recommended that Root Complex
> > implementations ensure that all accesses originating from RCiEPs
> > (PFs and VFs) without ACS capability are first subjected to processing by
> > the Translation Agent (TA) in the Root Complex before further decoding and
> > processing. The details of such Root Complex handling are outside the scope
> > of this specification.
> > 
> > Since Linux didn't give special treatment to allow this exception, certain
> > RCiEP MFD devices are getting grouped in a single iommu group. This
> > doesn't permit a single device to be assigned to a guest for instance.
> > 
> > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > 
> > /sys/kernel/iommu_groups/5/devices/0000:00:14.0
> > /sys/kernel/iommu_groups/5/devices/0000:00:14.2
> > /sys/kernel/iommu_groups/5/devices/0000:00:14.3
> > 
> > After the patch:
> > /sys/kernel/iommu_groups/5/devices/0000:00:14.0
> > /sys/kernel/iommu_groups/5/devices/0000:00:14.2
> > /sys/kernel/iommu_groups/6/devices/0000:00:14.3 <<< new group
> > 
> > 14.0 and 14.2 are integrated devices, but legacy end points.
> > Whereas 14.3 was a PCIe compliant RCiEP.
> > 
> > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > 
> > This permits assigning this device to a guest VM.
> > 
> > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > To: Joerg Roedel <joro@8bytes.org>
> > To: Bjorn Helgaas <bhelgaas@google.com>
> > Cc: linux-kernel@vger.kernel.org
> > Cc: iommu@lists.linux-foundation.org
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Darrel Goeddel <DGoeddel@forcepoint.com>
> > Cc: Mark Scott <mscott@forcepoint.com>,
> > Cc: Romil Sharma <rsharma@forcepoint.com>
> > Cc: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/iommu.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 2b471419e26c..5744bd65f3e2 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1187,7 +1187,20 @@ static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
> >  	struct pci_dev *tmp = NULL;
> >  	struct iommu_group *group;
> >  
> > -	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > +	/*
> > +	 * PCI Spec 5.0, Section 6.12 Access Control Service
> > +	 * Implementation of ACS in RCiEPs is permitted but not required.
> > +	 * It is explicitly permitted that, within a single Root
> > +	 * Complex, some RCiEPs implement ACS and some do not. It is
> > +	 * strongly recommended that Root Complex implementations ensure
> > +	 * that all accesses originating from RCiEPs (PFs and VFs) without
> > +	 * ACS capability are first subjected to processing by the Translation
> > +	 * Agent (TA) in the Root Complex before further decoding and
> > +	 * processing.
> > +	 */
> 
> Is the language here really strong enough to make this change?  ACS is
> an optional feature, so being permitted but not required is rather
> meaningless.  The spec is also specifically avoiding the words "must"
> or "shall" and even when emphasized with "strongly", we still only have
> a recommendation that may or may not be honored.  This seems like a
> weak basis for assuming that RCiEPs universally honor this
> recommendation.  Thanks,
> 

We are speaking about PCIe spec, where people write it about 5 years ahead
and every vendor tries to massage their product behavior with vague
words like this..  :)

But honestly for any any RCiEP, or even integrated endpoints, there 
is no way to send them except up north. These aren't behind a RP.

I did check with couple folks who are part of the SIG, and seem to agree
that ACS treatment for RCiEP's doesn't mean much. 

I understand the language isn't strong, but it doesn't seem like ACS should
be a strong requirement for RCiEP's and reasonable to relax.

What are your thoughts? 

Cheers,
Ashok
> 
> > +	if (!pdev->multifunction ||
> > +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END) ||
> > +	     pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> >  		return NULL;
> >  
> >  	for_each_pci_dev(tmp) {
>
Alex Williamson May 5, 2020, 2:05 p.m. UTC | #3
On Mon, 4 May 2020 23:11:07 -0700
"Raj, Ashok" <ashok.raj@intel.com> wrote:

> Hi Alex
> 
> + Joerg, accidently missed in the Cc.
> 
> On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:
> > On Mon,  4 May 2020 21:42:16 -0700
> > Ashok Raj <ashok.raj@intel.com> wrote:
> >   
> > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > 
> > > PCIe 5.0 Specification.
> > > 6.12 Access Control Services (ACS)
> > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > implement ACS and some do not. It is strongly recommended that Root Complex
> > > implementations ensure that all accesses originating from RCiEPs
> > > (PFs and VFs) without ACS capability are first subjected to processing by
> > > the Translation Agent (TA) in the Root Complex before further decoding and
> > > processing. The details of such Root Complex handling are outside the scope
> > > of this specification.
> > > 
> > > Since Linux didn't give special treatment to allow this exception, certain
> > > RCiEP MFD devices are getting grouped in a single iommu group. This
> > > doesn't permit a single device to be assigned to a guest for instance.
> > > 
> > > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > > 
> > > /sys/kernel/iommu_groups/5/devices/0000:00:14.0
> > > /sys/kernel/iommu_groups/5/devices/0000:00:14.2
> > > /sys/kernel/iommu_groups/5/devices/0000:00:14.3
> > > 
> > > After the patch:
> > > /sys/kernel/iommu_groups/5/devices/0000:00:14.0
> > > /sys/kernel/iommu_groups/5/devices/0000:00:14.2
> > > /sys/kernel/iommu_groups/6/devices/0000:00:14.3 <<< new group
> > > 
> > > 14.0 and 14.2 are integrated devices, but legacy end points.
> > > Whereas 14.3 was a PCIe compliant RCiEP.
> > > 
> > > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > > 
> > > This permits assigning this device to a guest VM.
> > > 
> > > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> > > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > To: Joerg Roedel <joro@8bytes.org>
> > > To: Bjorn Helgaas <bhelgaas@google.com>
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: iommu@lists.linux-foundation.org
> > > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Darrel Goeddel <DGoeddel@forcepoint.com>
> > > Cc: Mark Scott <mscott@forcepoint.com>,
> > > Cc: Romil Sharma <rsharma@forcepoint.com>
> > > Cc: Ashok Raj <ashok.raj@intel.com>
> > > ---
> > >  drivers/iommu/iommu.c | 15 ++++++++++++++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index 2b471419e26c..5744bd65f3e2 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -1187,7 +1187,20 @@ static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
> > >  	struct pci_dev *tmp = NULL;
> > >  	struct iommu_group *group;
> > >  
> > > -	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > > +	/*
> > > +	 * PCI Spec 5.0, Section 6.12 Access Control Service
> > > +	 * Implementation of ACS in RCiEPs is permitted but not required.
> > > +	 * It is explicitly permitted that, within a single Root
> > > +	 * Complex, some RCiEPs implement ACS and some do not. It is
> > > +	 * strongly recommended that Root Complex implementations ensure
> > > +	 * that all accesses originating from RCiEPs (PFs and VFs) without
> > > +	 * ACS capability are first subjected to processing by the Translation
> > > +	 * Agent (TA) in the Root Complex before further decoding and
> > > +	 * processing.
> > > +	 */  
> > 
> > Is the language here really strong enough to make this change?  ACS is
> > an optional feature, so being permitted but not required is rather
> > meaningless.  The spec is also specifically avoiding the words "must"
> > or "shall" and even when emphasized with "strongly", we still only have
> > a recommendation that may or may not be honored.  This seems like a
> > weak basis for assuming that RCiEPs universally honor this
> > recommendation.  Thanks,
> >   
> 
> We are speaking about PCIe spec, where people write it about 5 years ahead
> and every vendor tries to massage their product behavior with vague
> words like this..  :)
> 
> But honestly for any any RCiEP, or even integrated endpoints, there 
> is no way to send them except up north. These aren't behind a RP.

But they are multi-function devices and the spec doesn't define routing
within multifunction packages.  A single function RCiEP will already be
assumed isolated within its own group.
 
> I did check with couple folks who are part of the SIG, and seem to agree
> that ACS treatment for RCiEP's doesn't mean much. 
> 
> I understand the language isn't strong, but it doesn't seem like ACS should
> be a strong requirement for RCiEP's and reasonable to relax.
> 
> What are your thoughts? 

I think hardware vendors have ACS at their disposal to clarify when
isolation is provided, otherwise vendors can submit quirks, but I don't
see that the "strongly recommended" phrasing is sufficient to assume
isolation between multifunction RCiEPs.  Thanks,

Alex
Ashok Raj May 5, 2020, 2:56 p.m. UTC | #4
On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:
> On Mon, 4 May 2020 23:11:07 -0700
> "Raj, Ashok" <ashok.raj@intel.com> wrote:
> 
> > Hi Alex
> > 
> > + Joerg, accidently missed in the Cc.
> > 
> > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:
> > > On Mon,  4 May 2020 21:42:16 -0700
> > > Ashok Raj <ashok.raj@intel.com> wrote:
> > >   
> > > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > > 
> > > > PCIe 5.0 Specification.
> > > > 6.12 Access Control Services (ACS)
> > > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > > implement ACS and some do not. It is strongly recommended that Root Complex
> > > > implementations ensure that all accesses originating from RCiEPs
> > > > (PFs and VFs) without ACS capability are first subjected to processing by
> > > > the Translation Agent (TA) in the Root Complex before further decoding and
> > > > processing. The details of such Root Complex handling are outside the scope
> > > > of this specification.
> > > > 
> > > 
> > > Is the language here really strong enough to make this change?  ACS is
> > > an optional feature, so being permitted but not required is rather
> > > meaningless.  The spec is also specifically avoiding the words "must"
> > > or "shall" and even when emphasized with "strongly", we still only have
> > > a recommendation that may or may not be honored.  This seems like a
> > > weak basis for assuming that RCiEPs universally honor this
> > > recommendation.  Thanks,
> > >   
> > 
> > We are speaking about PCIe spec, where people write it about 5 years ahead
> > and every vendor tries to massage their product behavior with vague
> > words like this..  :)
> > 
> > But honestly for any any RCiEP, or even integrated endpoints, there 
> > is no way to send them except up north. These aren't behind a RP.
> 
> But they are multi-function devices and the spec doesn't define routing
> within multifunction packages.  A single function RCiEP will already be
> assumed isolated within its own group.

That's right. The other two devices only have legacy PCI headers. So 
they can't claim to be RCiEP's but just integrated endpoints. The legacy
devices don't even have a PCIe header.

I honestly don't know why these are groped as MFD's in the first place.

>  
> > I did check with couple folks who are part of the SIG, and seem to agree
> > that ACS treatment for RCiEP's doesn't mean much. 
> > 
> > I understand the language isn't strong, but it doesn't seem like ACS should
> > be a strong requirement for RCiEP's and reasonable to relax.
> > 
> > What are your thoughts? 
> 
> I think hardware vendors have ACS at their disposal to clarify when
> isolation is provided, otherwise vendors can submit quirks, but I don't
> see that the "strongly recommended" phrasing is sufficient to assume
> isolation between multifunction RCiEPs.  Thanks,

You point is that integrated MFD endpoints, without ACS, there is no 
gaurantee to SW that they are isolated.

As far as a quirk, do you think:
	- a cmdline optput for integrated endpoints, and RCiEP's suffice?
	  along with a compile time default that is strict enforcement
	- typical vid/did type exception list?

A more generic way to ask for exception would be scalable until we can stop
those type of integrated devices. Or we need to maintain these device lists
for eternity. 

Cheers,
Ashok
Alex Williamson May 5, 2020, 3:34 p.m. UTC | #5
On Tue, 5 May 2020 07:56:06 -0700
"Raj, Ashok" <ashok.raj@intel.com> wrote:

> On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:
> > On Mon, 4 May 2020 23:11:07 -0700
> > "Raj, Ashok" <ashok.raj@intel.com> wrote:
> >   
> > > Hi Alex
> > > 
> > > + Joerg, accidently missed in the Cc.
> > > 
> > > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:  
> > > > On Mon,  4 May 2020 21:42:16 -0700
> > > > Ashok Raj <ashok.raj@intel.com> wrote:
> > > >     
> > > > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > > > 
> > > > > PCIe 5.0 Specification.
> > > > > 6.12 Access Control Services (ACS)
> > > > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > > > implement ACS and some do not. It is strongly recommended that Root Complex
> > > > > implementations ensure that all accesses originating from RCiEPs
> > > > > (PFs and VFs) without ACS capability are first subjected to processing by
> > > > > the Translation Agent (TA) in the Root Complex before further decoding and
> > > > > processing. The details of such Root Complex handling are outside the scope
> > > > > of this specification.
> > > > >   
> > > > 
> > > > Is the language here really strong enough to make this change?  ACS is
> > > > an optional feature, so being permitted but not required is rather
> > > > meaningless.  The spec is also specifically avoiding the words "must"
> > > > or "shall" and even when emphasized with "strongly", we still only have
> > > > a recommendation that may or may not be honored.  This seems like a
> > > > weak basis for assuming that RCiEPs universally honor this
> > > > recommendation.  Thanks,
> > > >     
> > > 
> > > We are speaking about PCIe spec, where people write it about 5 years ahead
> > > and every vendor tries to massage their product behavior with vague
> > > words like this..  :)
> > > 
> > > But honestly for any any RCiEP, or even integrated endpoints, there 
> > > is no way to send them except up north. These aren't behind a RP.  
> > 
> > But they are multi-function devices and the spec doesn't define routing
> > within multifunction packages.  A single function RCiEP will already be
> > assumed isolated within its own group.  
> 
> That's right. The other two devices only have legacy PCI headers. So 
> they can't claim to be RCiEP's but just integrated endpoints. The legacy
> devices don't even have a PCIe header.
> 
> I honestly don't know why these are groped as MFD's in the first place.
> 
> >    
> > > I did check with couple folks who are part of the SIG, and seem to agree
> > > that ACS treatment for RCiEP's doesn't mean much. 
> > > 
> > > I understand the language isn't strong, but it doesn't seem like ACS should
> > > be a strong requirement for RCiEP's and reasonable to relax.
> > > 
> > > What are your thoughts?   
> > 
> > I think hardware vendors have ACS at their disposal to clarify when
> > isolation is provided, otherwise vendors can submit quirks, but I don't
> > see that the "strongly recommended" phrasing is sufficient to assume
> > isolation between multifunction RCiEPs.  Thanks,  
> 
> You point is that integrated MFD endpoints, without ACS, there is no 
> gaurantee to SW that they are isolated.
> 
> As far as a quirk, do you think:
> 	- a cmdline optput for integrated endpoints, and RCiEP's suffice?
> 	  along with a compile time default that is strict enforcement
> 	- typical vid/did type exception list?
> 
> A more generic way to ask for exception would be scalable until we can stop
> those type of integrated devices. Or we need to maintain these device lists
> for eternity. 

I don't think the language in the spec is anything sufficient to handle
RCiEP uniquely.  We've previously rejected kernel command line opt-outs
for ACS, and the extent to which those patches still float around the
user community and are blindly used to separate IOMMU groups are a
testament to the failure of this approach.  Users do not have a basis
for enabling this sort of opt-out.  The benefit is obvious in the IOMMU
grouping, but the risk is entirely unknown.  A kconfig option is even
worse as that means if you consume a downstream kernel, the downstream
maintainers might have decided universally that isolation is less
important than functionality.

I think the only solution is that the hardware vendors need to step up
to indicate where devices are isolated.  The hardware can do this
itself by implementing ACS, otherwise we need quirks.  I think we've
also generally been reluctant to accept quirks that provide a blanket
opt-out for a vendor because doing so is akin to trying to predict the
future (determining the behavior of all current and previous hardware
is generally a sufficiently impossible task already).  Perhaps if a
vendor has a published internal policy regarding RCiEP isolation and is
willing to stand by a quirk, there might be room to negotiate.  Thanks,

Alex
Ashok Raj May 26, 2020, 6:06 p.m. UTC | #6
Hi Alex,

I was able to find better language in the IOMMU spec that gaurantees 
the behavior we need. See below.


On Tue, May 05, 2020 at 09:34:14AM -0600, Alex Williamson wrote:
> On Tue, 5 May 2020 07:56:06 -0700
> "Raj, Ashok" <ashok.raj@intel.com> wrote:
> 
> > On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:
> > > On Mon, 4 May 2020 23:11:07 -0700
> > > "Raj, Ashok" <ashok.raj@intel.com> wrote:
> > >   
> > > > Hi Alex
> > > > 
> > > > + Joerg, accidently missed in the Cc.
> > > > 
> > > > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:  
> > > > > On Mon,  4 May 2020 21:42:16 -0700
> > > > > Ashok Raj <ashok.raj@intel.com> wrote:
> > > > >     
> > > > > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > > > > 
> > > > > > PCIe 5.0 Specification.
> > > > > > 6.12 Access Control Services (ACS)
> > > > > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > > > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > > > > implement ACS and some do not. It is strongly recommended that Root Complex
> > > > > > implementations ensure that all accesses originating from RCiEPs
> > > > > > (PFs and VFs) without ACS capability are first subjected to processing by
> > > > > > the Translation Agent (TA) in the Root Complex before further decoding and
> > > > > > processing. The details of such Root Complex handling are outside the scope
> > > > > > of this specification.
> > > > > >   
> > > > > 
> > > > > Is the language here really strong enough to make this change?  ACS is
> > > > > an optional feature, so being permitted but not required is rather
> > > > > meaningless.  The spec is also specifically avoiding the words "must"
> > > > > or "shall" and even when emphasized with "strongly", we still only have
> > > > > a recommendation that may or may not be honored.  This seems like a
> > > > > weak basis for assuming that RCiEPs universally honor this
> > > > > recommendation.  Thanks,
> > > > >     
> > > > 
> > > > We are speaking about PCIe spec, where people write it about 5 years ahead
> > > > and every vendor tries to massage their product behavior with vague
> > > > words like this..  :)
> > > > 
> > > > But honestly for any any RCiEP, or even integrated endpoints, there 
> > > > is no way to send them except up north. These aren't behind a RP.  
> > > 
> > > But they are multi-function devices and the spec doesn't define routing
> > > within multifunction packages.  A single function RCiEP will already be
> > > assumed isolated within its own group.  
> > 
> > That's right. The other two devices only have legacy PCI headers. So 
> > they can't claim to be RCiEP's but just integrated endpoints. The legacy
> > devices don't even have a PCIe header.
> > 
> > I honestly don't know why these are groped as MFD's in the first place.
> > 
> > >    
> > > > I did check with couple folks who are part of the SIG, and seem to agree
> > > > that ACS treatment for RCiEP's doesn't mean much. 
> > > > 
> > > > I understand the language isn't strong, but it doesn't seem like ACS should
> > > > be a strong requirement for RCiEP's and reasonable to relax.
> > > > 
> > > > What are your thoughts?   
> > > 
> > > I think hardware vendors have ACS at their disposal to clarify when
> > > isolation is provided, otherwise vendors can submit quirks, but I don't
> > > see that the "strongly recommended" phrasing is sufficient to assume
> > > isolation between multifunction RCiEPs.  Thanks,  
> > 
> > You point is that integrated MFD endpoints, without ACS, there is no 
> > gaurantee to SW that they are isolated.
> > 
> > As far as a quirk, do you think:
> > 	- a cmdline optput for integrated endpoints, and RCiEP's suffice?
> > 	  along with a compile time default that is strict enforcement
> > 	- typical vid/did type exception list?
> > 
> > A more generic way to ask for exception would be scalable until we can stop
> > those type of integrated devices. Or we need to maintain these device lists
> > for eternity. 
> 
> I don't think the language in the spec is anything sufficient to handle
> RCiEP uniquely.  We've previously rejected kernel command line opt-outs
> for ACS, and the extent to which those patches still float around the
> user community and are blindly used to separate IOMMU groups are a
> testament to the failure of this approach.  Users do not have a basis
> for enabling this sort of opt-out.  The benefit is obvious in the IOMMU
> grouping, but the risk is entirely unknown.  A kconfig option is even
> worse as that means if you consume a downstream kernel, the downstream
> maintainers might have decided universally that isolation is less
> important than functionality.

We discussed this internally, and Intel vt-d spec does spell out clearly 
in Section 3.16 Root-Complex Peer to Peer Considerations. The spec clearly
calls out that all p2p must be done on translated addresses and therefore
must go through the IOMMU.

I suppose they should also have some similar platform gauranteed behavior
for RCiEP's or MFD's *Must* behave as follows. The language is strict and
when IOMMU is enabled in the platform, everything is sent up north to the
IOMMU agent.

3.16 Root-Complex Peer to Peer Considerations
When DMA remapping is enabled, peer-to-peer requests through the
Root-Complex must be handled
as follows:
• The input address in the request is translated (through first-level,
  second-level or nested translation) to a host physical address (HPA).
  The address decoding for peer addresses must be done only on the 
  translated HPA. Hardware implementations are free to further limit 
  peer-to-peer accesses to specific host physical address regions 
  (or to completely disallow peer-forwarding of translated requests).
• Since address translation changes the contents (address field) of the PCI
  Express Transaction Layer Packet (TLP), for PCI Express peer-to-peer 
  requests with ECRC, the Root-Complex hardware must use the new ECRC 
  (re-computed with the translated address) if it decides to forward 
  the TLP as a peer request.
• Root-ports, and multi-function root-complex integrated endpoints, may
  support additional peerto-peer control features by supporting PCI Express
  Access Control Services (ACS) capability. Refer to ACS capability in 
  PCI Express specifications for details.

> to indicate where devices are isolated.  The hardware can do this
> itself by implementing ACS, otherwise we need quirks.  I think we've
> also generally been reluctant to accept quirks that provide a blanket
> opt-out for a vendor because doing so is akin to trying to predict the
> future (determining the behavior of all current and previous hardware
> is generally a sufficiently impossible task already).  Perhaps if a
> vendor has a published internal policy regarding RCiEP isolation and is
> willing to stand by a quirk, there might be room to negotiate.  Thanks,
> 
> Alex
>
Alex Williamson May 26, 2020, 6:26 p.m. UTC | #7
On Tue, 26 May 2020 11:06:48 -0700
"Raj, Ashok" <ashok.raj@intel.com> wrote:

> Hi Alex,
> 
> I was able to find better language in the IOMMU spec that gaurantees 
> the behavior we need. See below.
> 
> 
> On Tue, May 05, 2020 at 09:34:14AM -0600, Alex Williamson wrote:
> > On Tue, 5 May 2020 07:56:06 -0700
> > "Raj, Ashok" <ashok.raj@intel.com> wrote:
> >   
> > > On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:  
> > > > On Mon, 4 May 2020 23:11:07 -0700
> > > > "Raj, Ashok" <ashok.raj@intel.com> wrote:
> > > >     
> > > > > Hi Alex
> > > > > 
> > > > > + Joerg, accidently missed in the Cc.
> > > > > 
> > > > > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:    
> > > > > > On Mon,  4 May 2020 21:42:16 -0700
> > > > > > Ashok Raj <ashok.raj@intel.com> wrote:
> > > > > >       
> > > > > > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > > > > > 
> > > > > > > PCIe 5.0 Specification.
> > > > > > > 6.12 Access Control Services (ACS)
> > > > > > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > > > > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > > > > > implement ACS and some do not. It is strongly recommended that Root Complex
> > > > > > > implementations ensure that all accesses originating from RCiEPs
> > > > > > > (PFs and VFs) without ACS capability are first subjected to processing by
> > > > > > > the Translation Agent (TA) in the Root Complex before further decoding and
> > > > > > > processing. The details of such Root Complex handling are outside the scope
> > > > > > > of this specification.
> > > > > > >     
> > > > > > 
> > > > > > Is the language here really strong enough to make this change?  ACS is
> > > > > > an optional feature, so being permitted but not required is rather
> > > > > > meaningless.  The spec is also specifically avoiding the words "must"
> > > > > > or "shall" and even when emphasized with "strongly", we still only have
> > > > > > a recommendation that may or may not be honored.  This seems like a
> > > > > > weak basis for assuming that RCiEPs universally honor this
> > > > > > recommendation.  Thanks,
> > > > > >       
> > > > > 
> > > > > We are speaking about PCIe spec, where people write it about 5 years ahead
> > > > > and every vendor tries to massage their product behavior with vague
> > > > > words like this..  :)
> > > > > 
> > > > > But honestly for any any RCiEP, or even integrated endpoints, there 
> > > > > is no way to send them except up north. These aren't behind a RP.    
> > > > 
> > > > But they are multi-function devices and the spec doesn't define routing
> > > > within multifunction packages.  A single function RCiEP will already be
> > > > assumed isolated within its own group.    
> > > 
> > > That's right. The other two devices only have legacy PCI headers. So 
> > > they can't claim to be RCiEP's but just integrated endpoints. The legacy
> > > devices don't even have a PCIe header.
> > > 
> > > I honestly don't know why these are groped as MFD's in the first place.
> > >   
> > > >      
> > > > > I did check with couple folks who are part of the SIG, and seem to agree
> > > > > that ACS treatment for RCiEP's doesn't mean much. 
> > > > > 
> > > > > I understand the language isn't strong, but it doesn't seem like ACS should
> > > > > be a strong requirement for RCiEP's and reasonable to relax.
> > > > > 
> > > > > What are your thoughts?     
> > > > 
> > > > I think hardware vendors have ACS at their disposal to clarify when
> > > > isolation is provided, otherwise vendors can submit quirks, but I don't
> > > > see that the "strongly recommended" phrasing is sufficient to assume
> > > > isolation between multifunction RCiEPs.  Thanks,    
> > > 
> > > You point is that integrated MFD endpoints, without ACS, there is no 
> > > gaurantee to SW that they are isolated.
> > > 
> > > As far as a quirk, do you think:
> > > 	- a cmdline optput for integrated endpoints, and RCiEP's suffice?
> > > 	  along with a compile time default that is strict enforcement
> > > 	- typical vid/did type exception list?
> > > 
> > > A more generic way to ask for exception would be scalable until we can stop
> > > those type of integrated devices. Or we need to maintain these device lists
> > > for eternity.   
> > 
> > I don't think the language in the spec is anything sufficient to handle
> > RCiEP uniquely.  We've previously rejected kernel command line opt-outs
> > for ACS, and the extent to which those patches still float around the
> > user community and are blindly used to separate IOMMU groups are a
> > testament to the failure of this approach.  Users do not have a basis
> > for enabling this sort of opt-out.  The benefit is obvious in the IOMMU
> > grouping, but the risk is entirely unknown.  A kconfig option is even
> > worse as that means if you consume a downstream kernel, the downstream
> > maintainers might have decided universally that isolation is less
> > important than functionality.  
> 
> We discussed this internally, and Intel vt-d spec does spell out clearly 
> in Section 3.16 Root-Complex Peer to Peer Considerations. The spec clearly
> calls out that all p2p must be done on translated addresses and therefore
> must go through the IOMMU.
> 
> I suppose they should also have some similar platform gauranteed behavior
> for RCiEP's or MFD's *Must* behave as follows. The language is strict and
> when IOMMU is enabled in the platform, everything is sent up north to the
> IOMMU agent.
> 
> 3.16 Root-Complex Peer to Peer Considerations
> When DMA remapping is enabled, peer-to-peer requests through the
> Root-Complex must be handled
> as follows:
> • The input address in the request is translated (through first-level,
>   second-level or nested translation) to a host physical address (HPA).
>   The address decoding for peer addresses must be done only on the 
>   translated HPA. Hardware implementations are free to further limit 
>   peer-to-peer accesses to specific host physical address regions 
>   (or to completely disallow peer-forwarding of translated requests).
> • Since address translation changes the contents (address field) of the PCI
>   Express Transaction Layer Packet (TLP), for PCI Express peer-to-peer 
>   requests with ECRC, the Root-Complex hardware must use the new ECRC 
>   (re-computed with the translated address) if it decides to forward 
>   the TLP as a peer request.
> • Root-ports, and multi-function root-complex integrated endpoints, may
>   support additional peerto-peer control features by supporting PCI Express
>   Access Control Services (ACS) capability. Refer to ACS capability in 
>   PCI Express specifications for details.

That sounds like it might be a reasonable basis for quirking all RCiEPs
on VT-d platforms if Intel is willing to stand behind it.  Thanks,

Alex
Ashok Raj May 26, 2020, 6:34 p.m. UTC | #8
On Tue, May 26, 2020 at 12:26:54PM -0600, Alex Williamson wrote:
> > > 
> > > I don't think the language in the spec is anything sufficient to handle
> > > RCiEP uniquely.  We've previously rejected kernel command line opt-outs
> > > for ACS, and the extent to which those patches still float around the
> > > user community and are blindly used to separate IOMMU groups are a
> > > testament to the failure of this approach.  Users do not have a basis
> > > for enabling this sort of opt-out.  The benefit is obvious in the IOMMU
> > > grouping, but the risk is entirely unknown.  A kconfig option is even
> > > worse as that means if you consume a downstream kernel, the downstream
> > > maintainers might have decided universally that isolation is less
> > > important than functionality.  
> > 
> > We discussed this internally, and Intel vt-d spec does spell out clearly 
> > in Section 3.16 Root-Complex Peer to Peer Considerations. The spec clearly
> > calls out that all p2p must be done on translated addresses and therefore
> > must go through the IOMMU.
> > 
> > I suppose they should also have some similar platform gauranteed behavior
> > for RCiEP's or MFD's *Must* behave as follows. The language is strict and
> > when IOMMU is enabled in the platform, everything is sent up north to the
> > IOMMU agent.
> > 
> > 3.16 Root-Complex Peer to Peer Considerations
> > When DMA remapping is enabled, peer-to-peer requests through the
> > Root-Complex must be handled
> > as follows:
> > • The input address in the request is translated (through first-level,
> >   second-level or nested translation) to a host physical address (HPA).
> >   The address decoding for peer addresses must be done only on the 
> >   translated HPA. Hardware implementations are free to further limit 
> >   peer-to-peer accesses to specific host physical address regions 
> >   (or to completely disallow peer-forwarding of translated requests).
> > • Since address translation changes the contents (address field) of the PCI
> >   Express Transaction Layer Packet (TLP), for PCI Express peer-to-peer 
> >   requests with ECRC, the Root-Complex hardware must use the new ECRC 
> >   (re-computed with the translated address) if it decides to forward 
> >   the TLP as a peer request.
> > • Root-ports, and multi-function root-complex integrated endpoints, may
> >   support additional peerto-peer control features by supporting PCI Express
> >   Access Control Services (ACS) capability. Refer to ACS capability in 
> >   PCI Express specifications for details.
> 
> That sounds like it might be a reasonable basis for quirking all RCiEPs
> on VT-d platforms if Intel is willing to stand behind it.  Thanks,
> 

Sounds good.. that's what i hear from our platform teams. If there is a
violation it would be a bug in silicon.
diff mbox series

Patch

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2b471419e26c..5744bd65f3e2 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1187,7 +1187,20 @@  static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
 	struct pci_dev *tmp = NULL;
 	struct iommu_group *group;
 
-	if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
+	/*
+	 * PCI Spec 5.0, Section 6.12 Access Control Service
+	 * Implementation of ACS in RCiEPs is permitted but not required.
+	 * It is explicitly permitted that, within a single Root
+	 * Complex, some RCiEPs implement ACS and some do not. It is
+	 * strongly recommended that Root Complex implementations ensure
+	 * that all accesses originating from RCiEPs (PFs and VFs) without
+	 * ACS capability are first subjected to processing by the Translation
+	 * Agent (TA) in the Root Complex before further decoding and
+	 * processing.
+	 */
+	if (!pdev->multifunction ||
+	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END) ||
+	     pci_acs_enabled(pdev, REQ_ACS_FLAGS))
 		return NULL;
 
 	for_each_pci_dev(tmp) {