mbox series

[v3,0/8] arm64: Default to 32-bit wide ZONE_DMA

Message ID 20201014191211.27029-1-nsaenzjulienne@suse.de
Headers show
Series arm64: Default to 32-bit wide ZONE_DMA | expand

Message

Nicolas Saenz Julienne Oct. 14, 2020, 7:12 p.m. UTC
Using two distinct DMA zones turned out to be problematic. Here's an
attempt go back to a saner default.

I tested this on both a RPi4 and QEMU.

---

Changes since v2:
 - Introduce Ard's patch
 - Improve OF dma-ranges parsing function
 - Add unit test for OF function
 - Address small changes
 - Move crashkernel reservation later in boot process

Changes since v1:
 - Parse dma-ranges instead of using machine compatible string

Ard Biesheuvel (1):
  arm64: mm: Set ZONE_DMA size based on early IORT scan

Nicolas Saenz Julienne (7):
  arm64: mm: Move reserve_crashkernel() into mem_init()
  arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()
  of/address: Introduce of_dma_get_max_cpu_address()
  of: unittest: Add test for of_dma_get_max_cpu_address()
  dma-direct: Turn zone_dma_bits default value into a define
  arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges
  mm: Update DMA zones description

 arch/arm64/include/asm/processor.h |  1 +
 arch/arm64/mm/init.c               | 20 ++++++------
 drivers/acpi/arm64/iort.c          | 51 ++++++++++++++++++++++++++++++
 drivers/of/address.c               | 42 ++++++++++++++++++++++++
 drivers/of/unittest.c              | 20 ++++++++++++
 include/linux/acpi_iort.h          |  4 +++
 include/linux/dma-direct.h         |  3 ++
 include/linux/mmzone.h             |  5 +--
 include/linux/of.h                 |  7 ++++
 kernel/dma/direct.c                |  2 +-
 10 files changed, 143 insertions(+), 12 deletions(-)

Comments

Christoph Hellwig Oct. 15, 2020, 5:38 a.m. UTC | #1
On Wed, Oct 14, 2020 at 09:12:07PM +0200, Nicolas Saenz Julienne wrote:
> Set zone_dma_bits default value through a define so as for architectures
> to be able to override it with their default value.

Architectures can do that already by assigning a value to zone_dma_bits
at runtime.  I really do not want to add the extra clutter.
Christoph Hellwig Oct. 15, 2020, 5:39 a.m. UTC | #2
On Wed, Oct 14, 2020 at 09:12:08PM +0200, Nicolas Saenz Julienne wrote:
> +	zone_dma_bits = min(zone_dma_bits,
> +			    (unsigned int)ilog2(of_dma_get_max_cpu_address(NULL)));

Plase avoid pointlessly long lines.  Especially if it is completely trivial
by using either min_t or not overindenting like here.
Christoph Hellwig Oct. 15, 2020, 5:40 a.m. UTC | #3
On Wed, Oct 14, 2020 at 09:12:10PM +0200, Nicolas Saenz Julienne wrote:
> The default behavior for arm64 changed, so reflect that.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
> ---
>  include/linux/mmzone.h | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fb3bf696c05e..4ee2306351b9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -363,8 +363,9 @@ enum zone_type {
>  	 *  - arm only uses ZONE_DMA, the size, up to 4G, may vary depending on
>  	 *    the specific device.
>  	 *
> -	 *  - arm64 has a fixed 1G ZONE_DMA and ZONE_DMA32 for the rest of the
> -	 *    lower 4G.
> +	 *  - arm64 uses a single 4GB ZONE_DMA, except on the Raspberry Pi 4,
> +	 *    in which ZONE_DMA covers the first GB and ZONE_DMA32 the rest of
> +	 *    the lower 4GB.

Honestly I think this comment just needs to go away.  We can't really list
every setup in a comment in common code.
Will Deacon Oct. 15, 2020, 8:40 a.m. UTC | #4
On Wed, Oct 14, 2020 at 09:12:03PM +0200, Nicolas Saenz Julienne wrote:
> crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> boot table initialization, so move it later in the boot process.
> Specifically into mem_init(), this is the last place crashkernel will be
> able to reserve the memory before the page allocator kicks in and there is
> no need to do it earlier.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> ---
>  arch/arm64/mm/init.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Please can you cc me on the whole series next time? I know different
maintainers have different preferences here, but I find it much easier to
figure out what's happening when I can see all of the changes together.

Thanks,

Will
Nicolas Saenz Julienne Oct. 15, 2020, 8:55 a.m. UTC | #5
On Thu, 2020-10-15 at 09:40 +0100, Will Deacon wrote:
> On Wed, Oct 14, 2020 at 09:12:03PM +0200, Nicolas Saenz Julienne wrote:
> > crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> > ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> > boot table initialization, so move it later in the boot process.
> > Specifically into mem_init(), this is the last place crashkernel will be
> > able to reserve the memory before the page allocator kicks in and there is
> > no need to do it earlier.
> > 
> > Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> > ---
> >  arch/arm64/mm/init.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Please can you cc me on the whole series next time? I know different
> maintainers have different preferences here, but I find it much easier to
> figure out what's happening when I can see all of the changes together.

Will do.

Regards,
Nicolas
Nicolas Saenz Julienne Oct. 15, 2020, 10:05 a.m. UTC | #6
On Thu, 2020-10-15 at 07:38 +0200, Christoph Hellwig wrote:
> On Wed, Oct 14, 2020 at 09:12:07PM +0200, Nicolas Saenz Julienne wrote:
> > Set zone_dma_bits default value through a define so as for architectures
> > to be able to override it with their default value.
> 
> Architectures can do that already by assigning a value to zone_dma_bits
> at runtime.  I really do not want to add the extra clutter.

I'll remove it then.
Nicolas Saenz Julienne Oct. 15, 2020, 10:05 a.m. UTC | #7
On Thu, 2020-10-15 at 07:39 +0200, Christoph Hellwig wrote:
> On Wed, Oct 14, 2020 at 09:12:08PM +0200, Nicolas Saenz Julienne wrote:
> > +	zone_dma_bits = min(zone_dma_bits,
> > +			    (unsigned int)ilog2(of_dma_get_max_cpu_address(NULL)));
> 
> Plase avoid pointlessly long lines.  Especially if it is completely trivial
> by using either min_t or not overindenting like here.

Noted
Lorenzo Pieralisi Oct. 15, 2020, 10:31 a.m. UTC | #8
On Wed, Oct 14, 2020 at 09:12:09PM +0200, Nicolas Saenz Julienne wrote:

[...]

> +unsigned int __init acpi_iort_get_zone_dma_size(void)
> +{
> +	struct acpi_table_iort *iort;
> +	struct acpi_iort_node *node, *end;
> +	acpi_status status;
> +	u8 limit = 32;
> +	int i;
> +
> +	if (acpi_disabled)
> +		return limit;
> +
> +	status = acpi_get_table(ACPI_SIG_IORT, 0,
> +				(struct acpi_table_header **)&iort);
> +	if (ACPI_FAILURE(status))
> +		return limit;
> +
> +	node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
> +	end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
> +
> +	for (i = 0; i < iort->node_count; i++) {
> +		if (node >= end)
> +			break;
> +
> +		switch (node->type) {
> +			struct acpi_iort_named_component *ncomp;
> +			struct acpi_iort_root_complex *rc;
> +
> +		case ACPI_IORT_NODE_NAMED_COMPONENT:
> +			ncomp = (struct acpi_iort_named_component *)node->node_data;
> +			if (ncomp->memory_address_limit)
> +				limit = min(limit, ncomp->memory_address_limit);
> +			break;
> +
> +		case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
> +			rc = (struct acpi_iort_root_complex *)node->node_data;
> +			if (rc->memory_address_limit)

You need to add a node revision check here, see rc_dma_get_range() in
drivers/acpi/arm64/iort.c, otherwise we may be reading junk data
in older IORT tables - acpica structures are always referring to the
latest specs.

Thanks,
Lorenzo

> +				limit = min(limit, rc->memory_address_limit);
> +			break;
> +		}
> +		node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);
> +	}
> +	acpi_put_table(&iort->header);
> +	return limit;
> +}
> +#endif
> diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
> index 20a32120bb88..7d2e184f0d4d 100644
> --- a/include/linux/acpi_iort.h
> +++ b/include/linux/acpi_iort.h
> @@ -38,6 +38,7 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size);
>  const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
>  						const u32 *id_in);
>  int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head);
> +unsigned int acpi_iort_get_zone_dma_size(void);
>  #else
>  static inline void acpi_iort_init(void) { }
>  static inline u32 iort_msi_map_id(struct device *dev, u32 id)
> @@ -55,6 +56,9 @@ static inline const struct iommu_ops *iort_iommu_configure_id(
>  static inline
>  int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
>  { return 0; }
> +
> +static inline unsigned int acpi_iort_get_zone_dma_size(void)
> +{ return 32; }
>  #endif
>  
>  #endif /* __ACPI_IORT_H__ */
> -- 
> 2.28.0
>
Hanjun Guo Oct. 15, 2020, 2:26 p.m. UTC | #9
On 2020/10/15 3:12, Nicolas Saenz Julienne wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
> incorporating masters that can address less than 32 bits of DMA, in
> particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
> peripherals that can only address up to 1 GB (and its PCIe host
> bridge can only access the bottom 3 GB)
> 
> Instructing the DMA layer about these limitations is straight-forward,
> even though we had to fix some issues regarding memory limits set in
> the IORT for named components, and regarding the handling of ACPI _DMA
> methods. However, the DMA layer also needs to be able to allocate
> memory that is guaranteed to meet those DMA constraints, for bounce
> buffering as well as allocating the backing for consistent mappings.
> 
> This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
> it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
> problems with kdump, and potentially in other places where allocations
> cannot cross zone boundaries. Therefore, we should avoid having two
> separate DMA zones when possible.
> 
> So let's do an early scan of the IORT, and only create the ZONE_DMA
> if we encounter any devices that need it. This puts the burden on
> the firmware to describe such limitations in the IORT, which may be
> redundant (and less precise) if _DMA methods are also being provided.
> However, it should be noted that this situation is highly unusual for
> arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
> the _DMA method if implemented, and so we will not lose the ability to
> perform streaming DMA outside the ZONE_DMA if the _DMA method permits
> it.

Sorry, I'm still a little bit confused. With this patch, if we have
a device which set the right _DMA method (DMA size >= 32), but with the
wrong DMA size in IORT, we still have the ZONE_DMA created which
is actually not needed?

> 
> Cc: Jeremy Linton <jeremy.linton@arm.com>
> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> Cc: Rob Herring <robh+dt@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Hanjun Guo <guohanjun@huawei.com>
> Cc: Sudeep Holla <sudeep.holla@arm.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> [nsaenz: Rebased, removed documentation change, warnings and add
> declaration in acpi_iort.h]
> Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> ---
>   arch/arm64/mm/init.c      |  6 +++++
>   drivers/acpi/arm64/iort.c | 51 +++++++++++++++++++++++++++++++++++++++
>   include/linux/acpi_iort.h |  4 +++
>   3 files changed, 61 insertions(+)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 97b0d2768349..f321761eedb2 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -29,6 +29,7 @@
>   #include <linux/kexec.h>
>   #include <linux/crash_dump.h>
>   #include <linux/hugetlb.h>
> +#include <linux/acpi_iort.h>
>   
>   #include <asm/boot.h>
>   #include <asm/fixmap.h>
> @@ -196,6 +197,11 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
>   #ifdef CONFIG_ZONE_DMA
>   	zone_dma_bits = min(zone_dma_bits,
>   			    (unsigned int)ilog2(of_dma_get_max_cpu_address(NULL)));
> +
> +	if (IS_ENABLED(CONFIG_ACPI))
> +		zone_dma_bits = min(zone_dma_bits,
> +				    acpi_iort_get_zone_dma_size());
> +
>   	arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
>   	max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
>   #endif
> diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
> index 9929ff50c0c0..8f530bf3c03b 100644
> --- a/drivers/acpi/arm64/iort.c
> +++ b/drivers/acpi/arm64/iort.c
> @@ -1718,3 +1718,54 @@ void __init acpi_iort_init(void)
>   
>   	iort_init_platform_devices();
>   }
> +
> +#ifdef CONFIG_ZONE_DMA
> +/*
> + * Check the IORT whether any devices exist whose DMA mask is < 32 bits.
> + * If so, return the smallest value encountered, or 32 otherwise.
> + */
> +unsigned int __init acpi_iort_get_zone_dma_size(void)
> +{
> +	struct acpi_table_iort *iort;
> +	struct acpi_iort_node *node, *end;
> +	acpi_status status;
> +	u8 limit = 32;
> +	int i;
> +
> +	if (acpi_disabled)
> +		return limit;
> +
> +	status = acpi_get_table(ACPI_SIG_IORT, 0,
> +				(struct acpi_table_header **)&iort);
> +	if (ACPI_FAILURE(status))
> +		return limit;
> +
> +	node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
> +	end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
> +
> +	for (i = 0; i < iort->node_count; i++) {
> +		if (node >= end)
> +			break;
> +
> +		switch (node->type) {
> +			struct acpi_iort_named_component *ncomp;
> +			struct acpi_iort_root_complex *rc;
> +
> +		case ACPI_IORT_NODE_NAMED_COMPONENT:
> +			ncomp = (struct acpi_iort_named_component *)node->node_data;
> +			if (ncomp->memory_address_limit)
> +				limit = min(limit, ncomp->memory_address_limit);
> +			break;
> +
> +		case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
> +			rc = (struct acpi_iort_root_complex *)node->node_data;
> +			if (rc->memory_address_limit)
> +				limit = min(limit, rc->memory_address_limit);

There is no "Memory address size limit" field in revision 0 table, so as
Lorenzo reminded, please add a revision check here.

Thanks
Hanjun


> +			break;
> +		}
> +		node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);
> +	}
> +	acpi_put_table(&iort->header);
> +	return limit;
> +}
> +#endif
> diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
> index 20a32120bb88..7d2e184f0d4d 100644
> --- a/include/linux/acpi_iort.h
> +++ b/include/linux/acpi_iort.h
> @@ -38,6 +38,7 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size);
>   const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
>   						const u32 *id_in);
>   int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head);
> +unsigned int acpi_iort_get_zone_dma_size(void);
>   #else
>   static inline void acpi_iort_init(void) { }
>   static inline u32 iort_msi_map_id(struct device *dev, u32 id)
> @@ -55,6 +56,9 @@ static inline const struct iommu_ops *iort_iommu_configure_id(
>   static inline
>   int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
>   { return 0; }
> +
> +static inline unsigned int acpi_iort_get_zone_dma_size(void)
> +{ return 32; }
>   #endif
>   
>   #endif /* __ACPI_IORT_H__ */
>
Nicolas Saenz Julienne Oct. 15, 2020, 3:15 p.m. UTC | #10
On Thu, 2020-10-15 at 22:26 +0800, Hanjun Guo wrote:
> On 2020/10/15 3:12, Nicolas Saenz Julienne wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> > 
> > We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
> > incorporating masters that can address less than 32 bits of DMA, in
> > particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
> > peripherals that can only address up to 1 GB (and its PCIe host
> > bridge can only access the bottom 3 GB)
> > 
> > Instructing the DMA layer about these limitations is straight-forward,
> > even though we had to fix some issues regarding memory limits set in
> > the IORT for named components, and regarding the handling of ACPI _DMA
> > methods. However, the DMA layer also needs to be able to allocate
> > memory that is guaranteed to meet those DMA constraints, for bounce
> > buffering as well as allocating the backing for consistent mappings.
> > 
> > This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
> > it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
> > problems with kdump, and potentially in other places where allocations
> > cannot cross zone boundaries. Therefore, we should avoid having two
> > separate DMA zones when possible.
> > 
> > So let's do an early scan of the IORT, and only create the ZONE_DMA
> > if we encounter any devices that need it. This puts the burden on
> > the firmware to describe such limitations in the IORT, which may be
> > redundant (and less precise) if _DMA methods are also being provided.
> > However, it should be noted that this situation is highly unusual for
> > arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
> > the _DMA method if implemented, and so we will not lose the ability to
> > perform streaming DMA outside the ZONE_DMA if the _DMA method permits
> > it.
> 
> Sorry, I'm still a little bit confused. With this patch, if we have
> a device which set the right _DMA method (DMA size >= 32), but with the
> wrong DMA size in IORT, we still have the ZONE_DMA created which
> is actually not needed?

Yes, that would be the case.

Regards,
Nicolas
Catalin Marinas Oct. 15, 2020, 6:03 p.m. UTC | #11
On Thu, Oct 15, 2020 at 10:26:18PM +0800, Hanjun Guo wrote:
> On 2020/10/15 3:12, Nicolas Saenz Julienne wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> > 
> > We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
> > incorporating masters that can address less than 32 bits of DMA, in
> > particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
> > peripherals that can only address up to 1 GB (and its PCIe host
> > bridge can only access the bottom 3 GB)
> > 
> > Instructing the DMA layer about these limitations is straight-forward,
> > even though we had to fix some issues regarding memory limits set in
> > the IORT for named components, and regarding the handling of ACPI _DMA
> > methods. However, the DMA layer also needs to be able to allocate
> > memory that is guaranteed to meet those DMA constraints, for bounce
> > buffering as well as allocating the backing for consistent mappings.
> > 
> > This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
> > it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
> > problems with kdump, and potentially in other places where allocations
> > cannot cross zone boundaries. Therefore, we should avoid having two
> > separate DMA zones when possible.
> > 
> > So let's do an early scan of the IORT, and only create the ZONE_DMA
> > if we encounter any devices that need it. This puts the burden on
> > the firmware to describe such limitations in the IORT, which may be
> > redundant (and less precise) if _DMA methods are also being provided.
> > However, it should be noted that this situation is highly unusual for
> > arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
> > the _DMA method if implemented, and so we will not lose the ability to
> > perform streaming DMA outside the ZONE_DMA if the _DMA method permits
> > it.
> 
> Sorry, I'm still a little bit confused. With this patch, if we have
> a device which set the right _DMA method (DMA size >= 32), but with the
> wrong DMA size in IORT, we still have the ZONE_DMA created which
> is actually not needed?

With the current kernel, we get a ZONE_DMA already with an arbitrary
size of 1GB that matches what RPi4 needs. We are trying to eliminate
such unnecessary ZONE_DMA based on some heuristics (well, something that
looks "better" than a OEM ID based quirk). Now, if we learn that IORT
for platforms in the field is that broken as to describe few bits-wide
DMA masks, we may have to go back to the OEM ID quirk.
Hanjun Guo Oct. 16, 2020, 6:51 a.m. UTC | #12
On 2020/10/16 2:03, Catalin Marinas wrote:
> On Thu, Oct 15, 2020 at 10:26:18PM +0800, Hanjun Guo wrote:
>> On 2020/10/15 3:12, Nicolas Saenz Julienne wrote:
>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>
>>> We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
>>> incorporating masters that can address less than 32 bits of DMA, in
>>> particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
>>> peripherals that can only address up to 1 GB (and its PCIe host
>>> bridge can only access the bottom 3 GB)
>>>
>>> Instructing the DMA layer about these limitations is straight-forward,
>>> even though we had to fix some issues regarding memory limits set in
>>> the IORT for named components, and regarding the handling of ACPI _DMA
>>> methods. However, the DMA layer also needs to be able to allocate
>>> memory that is guaranteed to meet those DMA constraints, for bounce
>>> buffering as well as allocating the backing for consistent mappings.
>>>
>>> This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
>>> it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
>>> problems with kdump, and potentially in other places where allocations
>>> cannot cross zone boundaries. Therefore, we should avoid having two
>>> separate DMA zones when possible.
>>>
>>> So let's do an early scan of the IORT, and only create the ZONE_DMA
>>> if we encounter any devices that need it. This puts the burden on
>>> the firmware to describe such limitations in the IORT, which may be
>>> redundant (and less precise) if _DMA methods are also being provided.
>>> However, it should be noted that this situation is highly unusual for
>>> arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
>>> the _DMA method if implemented, and so we will not lose the ability to
>>> perform streaming DMA outside the ZONE_DMA if the _DMA method permits
>>> it.
>>
>> Sorry, I'm still a little bit confused. With this patch, if we have
>> a device which set the right _DMA method (DMA size >= 32), but with the
>> wrong DMA size in IORT, we still have the ZONE_DMA created which
>> is actually not needed?
> 
> With the current kernel, we get a ZONE_DMA already with an arbitrary
> size of 1GB that matches what RPi4 needs. We are trying to eliminate
> such unnecessary ZONE_DMA based on some heuristics (well, something that
> looks "better" than a OEM ID based quirk). Now, if we learn that IORT
> for platforms in the field is that broken as to describe few bits-wide
> DMA masks, we may have to go back to the OEM ID quirk.

Some platforms using 0 as the memory size limit, for example D05 [0] and
D06 [1], I think we need to go back to the OEM ID quirk.

For D05/D06, there are multi interrupt controllers named as mbigen,
mbigen is using the named component to describe the mappings with
the ITS controller, and mbigen is using 0 as the memory size limit.

Also since the memory size limit for PCI RC was introduced by later
IORT revision, so firmware people may think it's fine to set that
as 0 because the system works without it.

Thanks
Hanjun

[0]:
https://github.com/tianocore/edk2-platforms/blob/master/Silicon/Hisilicon/Hi1616/D05AcpiTables/D05Iort.asl
[1]:
https://github.com/tianocore/edk2-platforms/blob/master/Silicon/Hisilicon/Hi1620/Hi1620AcpiTables/Hi1620Iort.asl
Ard Biesheuvel Oct. 16, 2020, 6:54 a.m. UTC | #13
On Fri, 16 Oct 2020 at 08:51, Hanjun Guo <guohanjun@huawei.com> wrote:
>
> On 2020/10/16 2:03, Catalin Marinas wrote:
> > On Thu, Oct 15, 2020 at 10:26:18PM +0800, Hanjun Guo wrote:
> >> On 2020/10/15 3:12, Nicolas Saenz Julienne wrote:
> >>> From: Ard Biesheuvel <ardb@kernel.org>
> >>>
> >>> We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
> >>> incorporating masters that can address less than 32 bits of DMA, in
> >>> particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
> >>> peripherals that can only address up to 1 GB (and its PCIe host
> >>> bridge can only access the bottom 3 GB)
> >>>
> >>> Instructing the DMA layer about these limitations is straight-forward,
> >>> even though we had to fix some issues regarding memory limits set in
> >>> the IORT for named components, and regarding the handling of ACPI _DMA
> >>> methods. However, the DMA layer also needs to be able to allocate
> >>> memory that is guaranteed to meet those DMA constraints, for bounce
> >>> buffering as well as allocating the backing for consistent mappings.
> >>>
> >>> This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
> >>> it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
> >>> problems with kdump, and potentially in other places where allocations
> >>> cannot cross zone boundaries. Therefore, we should avoid having two
> >>> separate DMA zones when possible.
> >>>
> >>> So let's do an early scan of the IORT, and only create the ZONE_DMA
> >>> if we encounter any devices that need it. This puts the burden on
> >>> the firmware to describe such limitations in the IORT, which may be
> >>> redundant (and less precise) if _DMA methods are also being provided.
> >>> However, it should be noted that this situation is highly unusual for
> >>> arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
> >>> the _DMA method if implemented, and so we will not lose the ability to
> >>> perform streaming DMA outside the ZONE_DMA if the _DMA method permits
> >>> it.
> >>
> >> Sorry, I'm still a little bit confused. With this patch, if we have
> >> a device which set the right _DMA method (DMA size >= 32), but with the
> >> wrong DMA size in IORT, we still have the ZONE_DMA created which
> >> is actually not needed?
> >
> > With the current kernel, we get a ZONE_DMA already with an arbitrary
> > size of 1GB that matches what RPi4 needs. We are trying to eliminate
> > such unnecessary ZONE_DMA based on some heuristics (well, something that
> > looks "better" than a OEM ID based quirk). Now, if we learn that IORT
> > for platforms in the field is that broken as to describe few bits-wide
> > DMA masks, we may have to go back to the OEM ID quirk.
>
> Some platforms using 0 as the memory size limit, for example D05 [0] and
> D06 [1], I think we need to go back to the OEM ID quirk.
>
> For D05/D06, there are multi interrupt controllers named as mbigen,
> mbigen is using the named component to describe the mappings with
> the ITS controller, and mbigen is using 0 as the memory size limit.
>
> Also since the memory size limit for PCI RC was introduced by later
> IORT revision, so firmware people may think it's fine to set that
> as 0 because the system works without it.
>

Hello Hanjun,

The patch only takes the address limit field into account if its value > 0.

Also, before commit 7fb89e1d44cb6aec ("ACPI/IORT: take _DMA methods
into account for named components"), the _DMA method was not taken
into account for named components at all, and only the IORT limit was
used, so I do not anticipate any problems with that.
Ard Biesheuvel Oct. 16, 2020, 6:56 a.m. UTC | #14
On Thu, 15 Oct 2020 at 12:31, Lorenzo Pieralisi
<lorenzo.pieralisi@arm.com> wrote:
>
> On Wed, Oct 14, 2020 at 09:12:09PM +0200, Nicolas Saenz Julienne wrote:
>
> [...]
>
> > +unsigned int __init acpi_iort_get_zone_dma_size(void)
> > +{
> > +     struct acpi_table_iort *iort;
> > +     struct acpi_iort_node *node, *end;
> > +     acpi_status status;
> > +     u8 limit = 32;
> > +     int i;
> > +
> > +     if (acpi_disabled)
> > +             return limit;
> > +
> > +     status = acpi_get_table(ACPI_SIG_IORT, 0,
> > +                             (struct acpi_table_header **)&iort);
> > +     if (ACPI_FAILURE(status))
> > +             return limit;
> > +
> > +     node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
> > +     end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
> > +
> > +     for (i = 0; i < iort->node_count; i++) {
> > +             if (node >= end)
> > +                     break;
> > +
> > +             switch (node->type) {
> > +                     struct acpi_iort_named_component *ncomp;
> > +                     struct acpi_iort_root_complex *rc;
> > +
> > +             case ACPI_IORT_NODE_NAMED_COMPONENT:
> > +                     ncomp = (struct acpi_iort_named_component *)node->node_data;
> > +                     if (ncomp->memory_address_limit)
> > +                             limit = min(limit, ncomp->memory_address_limit);
> > +                     break;
> > +
> > +             case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
> > +                     rc = (struct acpi_iort_root_complex *)node->node_data;
> > +                     if (rc->memory_address_limit)
>
> You need to add a node revision check here, see rc_dma_get_range() in
> drivers/acpi/arm64/iort.c, otherwise we may be reading junk data
> in older IORT tables - acpica structures are always referring to the
> latest specs.
>

Indeed - apologies for not mentioning that when handing over the patch.

Also, we could use min_not_zero() here instead of the if ()
Hanjun Guo Oct. 16, 2020, 7:27 a.m. UTC | #15
Hi Ard,

On 2020/10/16 14:54, Ard Biesheuvel wrote:
> On Fri, 16 Oct 2020 at 08:51, Hanjun Guo <guohanjun@huawei.com> wrote:
>>
>> On 2020/10/16 2:03, Catalin Marinas wrote:
>>> On Thu, Oct 15, 2020 at 10:26:18PM +0800, Hanjun Guo wrote:
>>>> On 2020/10/15 3:12, Nicolas Saenz Julienne wrote:
>>>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>>>
>>>>> We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
>>>>> incorporating masters that can address less than 32 bits of DMA, in
>>>>> particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
>>>>> peripherals that can only address up to 1 GB (and its PCIe host
>>>>> bridge can only access the bottom 3 GB)
>>>>>
>>>>> Instructing the DMA layer about these limitations is straight-forward,
>>>>> even though we had to fix some issues regarding memory limits set in
>>>>> the IORT for named components, and regarding the handling of ACPI _DMA
>>>>> methods. However, the DMA layer also needs to be able to allocate
>>>>> memory that is guaranteed to meet those DMA constraints, for bounce
>>>>> buffering as well as allocating the backing for consistent mappings.
>>>>>
>>>>> This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
>>>>> it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
>>>>> problems with kdump, and potentially in other places where allocations
>>>>> cannot cross zone boundaries. Therefore, we should avoid having two
>>>>> separate DMA zones when possible.
>>>>>
>>>>> So let's do an early scan of the IORT, and only create the ZONE_DMA
>>>>> if we encounter any devices that need it. This puts the burden on
>>>>> the firmware to describe such limitations in the IORT, which may be
>>>>> redundant (and less precise) if _DMA methods are also being provided.
>>>>> However, it should be noted that this situation is highly unusual for
>>>>> arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
>>>>> the _DMA method if implemented, and so we will not lose the ability to
>>>>> perform streaming DMA outside the ZONE_DMA if the _DMA method permits
>>>>> it.
>>>>
>>>> Sorry, I'm still a little bit confused. With this patch, if we have
>>>> a device which set the right _DMA method (DMA size >= 32), but with the
>>>> wrong DMA size in IORT, we still have the ZONE_DMA created which
>>>> is actually not needed?
>>>
>>> With the current kernel, we get a ZONE_DMA already with an arbitrary
>>> size of 1GB that matches what RPi4 needs. We are trying to eliminate
>>> such unnecessary ZONE_DMA based on some heuristics (well, something that
>>> looks "better" than a OEM ID based quirk). Now, if we learn that IORT
>>> for platforms in the field is that broken as to describe few bits-wide
>>> DMA masks, we may have to go back to the OEM ID quirk.
>>
>> Some platforms using 0 as the memory size limit, for example D05 [0] and
>> D06 [1], I think we need to go back to the OEM ID quirk.
>>
>> For D05/D06, there are multi interrupt controllers named as mbigen,
>> mbigen is using the named component to describe the mappings with
>> the ITS controller, and mbigen is using 0 as the memory size limit.
>>
>> Also since the memory size limit for PCI RC was introduced by later
>> IORT revision, so firmware people may think it's fine to set that
>> as 0 because the system works without it.
>>
> 
> Hello Hanjun,
> 
> The patch only takes the address limit field into account if its value > 0.

Sorry I missed the if (*->memory_address_limit) check, thanks
for the reminding.

> 
> Also, before commit 7fb89e1d44cb6aec ("ACPI/IORT: take _DMA methods
> into account for named components"), the _DMA method was not taken
> into account for named components at all, and only the IORT limit was
> used, so I do not anticipate any problems with that.

Then this patch is fine to me.

Thanks
Hanjun
Hanjun Guo Oct. 16, 2020, 7:34 a.m. UTC | #16
On 2020/10/16 15:27, Hanjun Guo wrote:
>> The patch only takes the address limit field into account if its value 
>> > 0.
> 
> Sorry I missed the if (*->memory_address_limit) check, thanks
> for the reminding.
> 
>>
>> Also, before commit 7fb89e1d44cb6aec ("ACPI/IORT: take _DMA methods
>> into account for named components"), the _DMA method was not taken
>> into account for named components at all, and only the IORT limit was
>> used, so I do not anticipate any problems with that.
> 
> Then this patch is fine to me.

Certainly we need to address Lorenzo's comments.

Thanks
Hanjun