Patchwork [05/14] lib: Add I/O map cache implementation

login
register
mail settings
Submitter Thierry Reding
Date Jan. 9, 2013, 8:43 p.m.
Message ID <1357764194-12677-6-git-send-email-thierry.reding@avionic-design.de>
Download mbox | patch
Permalink /patch/210860/
State Not Applicable
Headers show

Comments

Thierry Reding - Jan. 9, 2013, 8:43 p.m.
The I/O map cache is used to map large regions of physical memory in
smaller chunks to avoid running out of vmalloc()/ioremap() space.

Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
---
 include/linux/io.h |  12 +++
 lib/ioremap.c      | 266 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 278 insertions(+)
Arnd Bergmann - Jan. 9, 2013, 9:19 p.m.
On Wednesday 09 January 2013, Thierry Reding wrote:
> The I/O map cache is used to map large regions of physical memory in
> smaller chunks to avoid running out of vmalloc()/ioremap() space.
> 
> Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>

Can you explain why this is necessary and which problem it solves?
The implementation looks reasonable, but it's not clear to me if
we really need this new interface.

In what cases does it actually save ioremap space?

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Russell King - ARM Linux - Jan. 9, 2013, 9:28 p.m.
On Wed, Jan 09, 2013 at 09:43:05PM +0100, Thierry Reding wrote:
> The I/O map cache is used to map large regions of physical memory in
> smaller chunks to avoid running out of vmalloc()/ioremap() space.
> 
> Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>

We already have a means where we record the mappings which ioremap()
creates.  If you look at /proc/vmallocinfo, you'll notice lines such
as:

0xf7b72000-0xf7b74000    8192 e1000_probe+0x291/0xa68 [e1000e] phys=fc025000 ioremap

which gives you the virtual address range, physical address and type
of the mapping.  Why do we need a duplicated data structure?

Moreover, you seem to suggest that you want to break up a large
ioremap() mapping into several smaller mappings.  Why?  The idea
behind ioremap() is that this relationship holds true:

	ptr = ioremap(cookie + n, size);

For any 'n' in the range 0 .. size, the location shall be accessible
via ptr + n when using the IO space accessors.  If you're going to
break up a mapping into several smaller ones, this no longer holds
true.

If the problem is that you're ioremapping huge address ranges because
you're passing larger-than-required resources to devices, then that's
part of the problem too.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding - Jan. 9, 2013, 9:54 p.m.
On Wed, Jan 09, 2013 at 09:19:56PM +0000, Arnd Bergmann wrote:
> On Wednesday 09 January 2013, Thierry Reding wrote:
> > The I/O map cache is used to map large regions of physical memory in
> > smaller chunks to avoid running out of vmalloc()/ioremap() space.
> > 
> > Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
> 
> Can you explain why this is necessary and which problem it solves?
> The implementation looks reasonable, but it's not clear to me if
> we really need this new interface.
> 
> In what cases does it actually save ioremap space?

I briefly described this in the cover letter of this series and didn't
realize I hadn't put more information in this patch's commit message.

What happens on Tegra is that we need to map 256 MiB of physical memory
to access all the PCIe extended configuration space. However, ioremap()
on such a large region fails if not enough vmalloc() space is available.

This was observed when somebody tested this on CardHu which has a 1 GiB
of RAM and therefore remapping the full 256 MiB fails.

Thierry
Thierry Reding - Jan. 9, 2013, 9:57 p.m.
On Wed, Jan 09, 2013 at 09:28:47PM +0000, Russell King - ARM Linux wrote:
> On Wed, Jan 09, 2013 at 09:43:05PM +0100, Thierry Reding wrote:
> > The I/O map cache is used to map large regions of physical memory in
> > smaller chunks to avoid running out of vmalloc()/ioremap() space.
> > 
> > Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
> 
> We already have a means where we record the mappings which ioremap()
> creates.  If you look at /proc/vmallocinfo, you'll notice lines such
> as:
> 
> 0xf7b72000-0xf7b74000    8192 e1000_probe+0x291/0xa68 [e1000e] phys=fc025000 ioremap
> 
> which gives you the virtual address range, physical address and type
> of the mapping.  Why do we need a duplicated data structure?
> 
> Moreover, you seem to suggest that you want to break up a large
> ioremap() mapping into several smaller mappings.  Why?  The idea
> behind ioremap() is that this relationship holds true:
> 
> 	ptr = ioremap(cookie + n, size);
> 
> For any 'n' in the range 0 .. size, the location shall be accessible
> via ptr + n when using the IO space accessors.  If you're going to
> break up a mapping into several smaller ones, this no longer holds
> true.
> 
> If the problem is that you're ioremapping huge address ranges because
> you're passing larger-than-required resources to devices, then that's
> part of the problem too.

The resource is not larger than required. It's just that in order to
address the extended configuration space of all PCI devices we require a
full 256 MiB window.

What this patch tries to achieve is allow a driver to take a large
region and ioremap() it page by page on an as needed basis.

Thierry
Arnd Bergmann - Jan. 9, 2013, 10:10 p.m.
On Wednesday 09 January 2013, Thierry Reding wrote:
> What happens on Tegra is that we need to map 256 MiB of physical memory
> to access all the PCIe extended configuration space. However, ioremap()
> on such a large region fails if not enough vmalloc() space is available.
> 
> This was observed when somebody tested this on CardHu which has a 1 GiB
> of RAM and therefore remapping the full 256 MiB fails.

Hmm, config space accesses are fairly rare and generally not expected
to be fast, and 256 MB is really a huge waste of virtual address space,
so I agree that just ioremapping the entire space is not a good
solution.

However, it's not clear that a cache is necessary. Have you measured
a significant performance benefit of this implementation over just
iorempping and unmapping a single page for every config space access?
Have you checked if the hardware supports an alternative config
space access mechanism that does not depend on a huge address range?
A lot of them provide an index/data register pair somewhere, as the
original PC implementation did.

Even if we actually want a cache, how about a private implementation
that just remembers a single page in LRU? I doubt that there are
more drivers that would benefit from a generalized version that you
provide.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren - Jan. 9, 2013, 11:12 p.m.
On 01/09/2013 03:10 PM, Arnd Bergmann wrote:
> On Wednesday 09 January 2013, Thierry Reding wrote:
>> What happens on Tegra is that we need to map 256 MiB of physical memory
>> to access all the PCIe extended configuration space. However, ioremap()
>> on such a large region fails if not enough vmalloc() space is available.
>>
>> This was observed when somebody tested this on CardHu which has a 1 GiB
>> of RAM and therefore remapping the full 256 MiB fails.
...
> Have you checked if the hardware supports an alternative config
> space access mechanism that does not depend on a huge address range?
> A lot of them provide an index/data register pair somewhere, as the
> original PC implementation did.

That would be nice, but I've talked to the HW engineers, and there's no
indication that any alternative mechanism exists.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe - Jan. 9, 2013, 11:17 p.m.
On Wed, Jan 09, 2013 at 04:12:31PM -0700, Stephen Warren wrote:
> On 01/09/2013 03:10 PM, Arnd Bergmann wrote:
> > On Wednesday 09 January 2013, Thierry Reding wrote:
> >> What happens on Tegra is that we need to map 256 MiB of physical memory
> >> to access all the PCIe extended configuration space. However, ioremap()
> >> on such a large region fails if not enough vmalloc() space is available.
> >>
> >> This was observed when somebody tested this on CardHu which has a 1 GiB
> >> of RAM and therefore remapping the full 256 MiB fails.
> ...
> > Have you checked if the hardware supports an alternative config
> > space access mechanism that does not depend on a huge address range?
> > A lot of them provide an index/data register pair somewhere, as the
> > original PC implementation did.
> 
> That would be nice, but I've talked to the HW engineers, and there's no
> indication that any alternative mechanism exists.

It seems to be convention that extended config space is often only
accessible through mmio space, that was true on x86 last I checked
too..

You could decrease the size of the mapping to only span the bus
numbers that are configured for use via DT.

Are there any concerns about these config registers being accessed
from a context where a new mapping can't be made? Interrupt? Machine
Check? PCI-E Advanced Error Reporting?

Cheers,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding - Jan. 10, 2013, 7:10 a.m.
On Wed, Jan 09, 2013 at 10:10:49PM +0000, Arnd Bergmann wrote:
> On Wednesday 09 January 2013, Thierry Reding wrote:
> > What happens on Tegra is that we need to map 256 MiB of physical memory
> > to access all the PCIe extended configuration space. However, ioremap()
> > on such a large region fails if not enough vmalloc() space is available.
> > 
> > This was observed when somebody tested this on CardHu which has a 1 GiB
> > of RAM and therefore remapping the full 256 MiB fails.
> 
> Hmm, config space accesses are fairly rare and generally not expected
> to be fast, and 256 MB is really a huge waste of virtual address space,
> so I agree that just ioremapping the entire space is not a good
> solution.
> 
> However, it's not clear that a cache is necessary. Have you measured
> a significant performance benefit of this implementation over just
> iorempping and unmapping a single page for every config space access?

No, I had not actually implemented it that way because I figured I might
just as well implement something generic with the added benefit that
most remapping operations would be cached automatically since the PCI
enumeration algorithms usually access the configuration space of a
single device at a time, so it actually maps to the best case for an LRU
based cache approach.

> Even if we actually want a cache, how about a private implementation
> that just remembers a single page in LRU? I doubt that there are
> more drivers that would benefit from a generalized version that you
> provide.

I can move the code to the Tegra PCIe driver, but there's quite a bit of
code that isn't actually related to the PCI functionality and I'd really
like to avoid cluttering the driver with this implementation. Keeping it
in a more central location will certainly increase the code's visibility
and make it easier for other potential users to find.

Also I just noticed that I hadn't actually added a parameter to the
iomap_cache_create() function to specify the maximum number of pages, so
currently the code only uses a single page anyway. It should be trivial
to change. I guess performance was good enough with a single page that I
didn't have a reason to increase the maximum number of pages.

Thierry
Thierry Reding - Jan. 10, 2013, 7:19 a.m.
On Wed, Jan 09, 2013 at 04:17:58PM -0700, Jason Gunthorpe wrote:
> On Wed, Jan 09, 2013 at 04:12:31PM -0700, Stephen Warren wrote:
> > On 01/09/2013 03:10 PM, Arnd Bergmann wrote:
> > > On Wednesday 09 January 2013, Thierry Reding wrote:
> > >> What happens on Tegra is that we need to map 256 MiB of physical memory
> > >> to access all the PCIe extended configuration space. However, ioremap()
> > >> on such a large region fails if not enough vmalloc() space is available.
> > >>
> > >> This was observed when somebody tested this on CardHu which has a 1 GiB
> > >> of RAM and therefore remapping the full 256 MiB fails.
> > ...
> > > Have you checked if the hardware supports an alternative config
> > > space access mechanism that does not depend on a huge address range?
> > > A lot of them provide an index/data register pair somewhere, as the
> > > original PC implementation did.
> > 
> > That would be nice, but I've talked to the HW engineers, and there's no
> > indication that any alternative mechanism exists.
> 
> It seems to be convention that extended config space is often only
> accessible through mmio space, that was true on x86 last I checked
> too..
> 
> You could decrease the size of the mapping to only span the bus
> numbers that are configured for use via DT.

That won't work, unfortunately. The mapping is such that the bus number
is not encoded in the uppermost bits, the extended register number is.
So the only thing that we could do is decrease the size of the extended
register space for *all* devices.

> Are there any concerns about these config registers being accessed
> from a context where a new mapping can't be made? Interrupt? Machine
> Check? PCI-E Advanced Error Reporting?

I haven't checked but I would expect configuration space accesses to not
happen in interrupt context. Usually they are limited to enumeration and
driver probe.

Thierry
Arnd Bergmann - Jan. 10, 2013, 9:17 a.m.
On Thursday 10 January 2013, Thierry Reding wrote:
> On Wed, Jan 09, 2013 at 04:17:58PM -0700, Jason Gunthorpe wrote:
> > On Wed, Jan 09, 2013 at 04:12:31PM -0700, Stephen Warren wrote:
> > You could decrease the size of the mapping to only span the bus
> > numbers that are configured for use via DT.
> 
> That won't work, unfortunately. The mapping is such that the bus number
> is not encoded in the uppermost bits, the extended register number is.
> So the only thing that we could do is decrease the size of the extended
> register space for *all* devices.

But you could still a method to map 16 separate areas per bus, each 65536
bytes long, which results in 1MB per bus. That is probably ok, since
very few systems have more than a handful of buses in practice.

In theory, doing static mappings on a per-page base would let you
do 16 devices at a time, but it's probably worth doing at this fine
granularity.

> > Are there any concerns about these config registers being accessed
> > from a context where a new mapping can't be made? Interrupt? Machine
> > Check? PCI-E Advanced Error Reporting?
> 
> I haven't checked but I would expect configuration space accesses to not
> happen in interrupt context. Usually they are limited to enumeration and
> driver probe.

Actually, AER probably needs this, and I believe some broken devices 
need to mask interrupts using the PCI command word in the config space,
it it can happen.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding - Jan. 10, 2013, 10:25 a.m.
On Thu, Jan 10, 2013 at 09:17:19AM +0000, Arnd Bergmann wrote:
> On Thursday 10 January 2013, Thierry Reding wrote:
> > On Wed, Jan 09, 2013 at 04:17:58PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Jan 09, 2013 at 04:12:31PM -0700, Stephen Warren wrote:
> > > You could decrease the size of the mapping to only span the bus
> > > numbers that are configured for use via DT.
> > 
> > That won't work, unfortunately. The mapping is such that the bus number
> > is not encoded in the uppermost bits, the extended register number is.
> > So the only thing that we could do is decrease the size of the extended
> > register space for *all* devices.
> 
> But you could still a method to map 16 separate areas per bus, each 65536
> bytes long, which results in 1MB per bus. That is probably ok, since
> very few systems have more than a handful of buses in practice.
> 
> In theory, doing static mappings on a per-page base would let you
> do 16 devices at a time, but it's probably worth doing at this fine
> granularity.

I don't understand how this would help. The encoding is like this:

	[27:24] extended register number
	[23:16] bus number
	[15:11] device number
	[10: 8] function number
	[ 7: 0] register number

So it doesn't matter whether I use separate areas per bus or not. As
soon as the whole extended configuration space needs to be accessed a
whopping 28 bits (256 MiB) are required.

What you propose would work if only regular configuration space is
supported. I'm not sure if that's an option.

> > > Are there any concerns about these config registers being accessed
> > > from a context where a new mapping can't be made? Interrupt? Machine
> > > Check? PCI-E Advanced Error Reporting?
> > 
> > I haven't checked but I would expect configuration space accesses to not
> > happen in interrupt context. Usually they are limited to enumeration and
> > driver probe.
> 
> Actually, AER probably needs this, and I believe some broken devices 
> need to mask interrupts using the PCI command word in the config space,
> it it can happen.

Ugh... that would kill any such dynamic mapping approach. Perhaps if we
could mark a device as requiring a static mapping we could pin that
cache entry. But that doesn't sound very encouraging.

Thierry
Jason Gunthorpe - Jan. 10, 2013, 6:20 p.m.
On Thu, Jan 10, 2013 at 11:25:44AM +0100, Thierry Reding wrote:
> On Thu, Jan 10, 2013 at 09:17:19AM +0000, Arnd Bergmann wrote:
> > On Thursday 10 January 2013, Thierry Reding wrote:
> > > On Wed, Jan 09, 2013 at 04:17:58PM -0700, Jason Gunthorpe wrote:
> > > > On Wed, Jan 09, 2013 at 04:12:31PM -0700, Stephen Warren wrote:
> > > > You could decrease the size of the mapping to only span the bus
> > > > numbers that are configured for use via DT.
> > > 
> > > That won't work, unfortunately. The mapping is such that the bus number
> > > is not encoded in the uppermost bits, the extended register number is.
> > > So the only thing that we could do is decrease the size of the extended
> > > register space for *all* devices.
> > 
> > But you could still a method to map 16 separate areas per bus, each 65536
> > bytes long, which results in 1MB per bus. That is probably ok, since
> > very few systems have more than a handful of buses in practice.
> > 
> > In theory, doing static mappings on a per-page base would let you
> > do 16 devices at a time, but it's probably worth doing at this fine
> > granularity.
> 
> I don't understand how this would help. The encoding is like this:
> 
> 	[27:24] extended register number
> 	[23:16] bus number
> 	[15:11] device number
> 	[10: 8] function number
> 	[ 7: 0] register number
> 
> So it doesn't matter whether I use separate areas per bus or not. As
> soon as the whole extended configuration space needs to be accessed a
> whopping 28 bits (256 MiB) are required.

You'd piece a mapping together, each bus requires 16 64k mappings, a
simple 2d array of busnr*16 of pointers would do the trick. A more
clever solution would be to allocate contiguous virtual memory and
split that up..

> > Actually, AER probably needs this, and I believe some broken devices 
> > need to mask interrupts using the PCI command word in the config space,
> > it it can happen.
> 
> Ugh... that would kill any such dynamic mapping approach. Perhaps if we
> could mark a device as requiring a static mapping we could pin that
> cache entry. But that doesn't sound very encouraging.

AER applies to pretty much every PCI-E device these days.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann - Jan. 10, 2013, 6:26 p.m.
On Thursday 10 January 2013, Thierry Reding wrote:
> I don't understand how this would help. The encoding is like this:
> 
>         [27:24] extended register number
>         [23:16] bus number
>         [15:11] device number
>         [10: 8] function number
>         [ 7: 0] register number
> 
> So it doesn't matter whether I use separate areas per bus or not. As
> soon as the whole extended configuration space needs to be accessed a
> whopping 28 bits (256 MiB) are required.
> 
> What you propose would work if only regular configuration space is
> supported. I'm not sure if that's an option.

I mean something like:

struct tegra_bus_private {
	...
	void __iomem *config_space[16];
};


void tegra_scan_bus(struct pci_bus *bus)
{
	int i;
	struct tegra_bus_private *priv = bus->dev->private;

	for (i=0; i<16; i++)
		priv->config_space[i] = ioremap(config_space_phys +
				 65536 * bus->primary + i * SZ_1M, 65536);

	...
}

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding - Jan. 10, 2013, 6:55 p.m.
On Thu, Jan 10, 2013 at 11:20:07AM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 10, 2013 at 11:25:44AM +0100, Thierry Reding wrote:
> > On Thu, Jan 10, 2013 at 09:17:19AM +0000, Arnd Bergmann wrote:
> > > On Thursday 10 January 2013, Thierry Reding wrote:
> > > > On Wed, Jan 09, 2013 at 04:17:58PM -0700, Jason Gunthorpe wrote:
> > > > > On Wed, Jan 09, 2013 at 04:12:31PM -0700, Stephen Warren wrote:
> > > > > You could decrease the size of the mapping to only span the bus
> > > > > numbers that are configured for use via DT.
> > > > 
> > > > That won't work, unfortunately. The mapping is such that the bus number
> > > > is not encoded in the uppermost bits, the extended register number is.
> > > > So the only thing that we could do is decrease the size of the extended
> > > > register space for *all* devices.
> > > 
> > > But you could still a method to map 16 separate areas per bus, each 65536
> > > bytes long, which results in 1MB per bus. That is probably ok, since
> > > very few systems have more than a handful of buses in practice.
> > > 
> > > In theory, doing static mappings on a per-page base would let you
> > > do 16 devices at a time, but it's probably worth doing at this fine
> > > granularity.
> > 
> > I don't understand how this would help. The encoding is like this:
> > 
> > 	[27:24] extended register number
> > 	[23:16] bus number
> > 	[15:11] device number
> > 	[10: 8] function number
> > 	[ 7: 0] register number
> > 
> > So it doesn't matter whether I use separate areas per bus or not. As
> > soon as the whole extended configuration space needs to be accessed a
> > whopping 28 bits (256 MiB) are required.
> 
> You'd piece a mapping together, each bus requires 16 64k mappings, a
> simple 2d array of busnr*16 of pointers would do the trick. A more
> clever solution would be to allocate contiguous virtual memory and
> split that up..

Oh, I see. I'm not very familiar with the internals of remapping, so
I'll need to do some more reading. Thanks for the hints.

> > > Actually, AER probably needs this, and I believe some broken devices 
> > > need to mask interrupts using the PCI command word in the config space,
> > > it it can happen.
> > 
> > Ugh... that would kill any such dynamic mapping approach. Perhaps if we
> > could mark a device as requiring a static mapping we could pin that
> > cache entry. But that doesn't sound very encouraging.
> 
> AER applies to pretty much every PCI-E device these days.

So given there's no way around a segmented static mapping as you
suggested, right?

Thierry
Thierry Reding - Jan. 10, 2013, 6:57 p.m.
On Thu, Jan 10, 2013 at 06:26:55PM +0000, Arnd Bergmann wrote:
> On Thursday 10 January 2013, Thierry Reding wrote:
> > I don't understand how this would help. The encoding is like this:
> > 
> >         [27:24] extended register number
> >         [23:16] bus number
> >         [15:11] device number
> >         [10: 8] function number
> >         [ 7: 0] register number
> > 
> > So it doesn't matter whether I use separate areas per bus or not. As
> > soon as the whole extended configuration space needs to be accessed a
> > whopping 28 bits (256 MiB) are required.
> > 
> > What you propose would work if only regular configuration space is
> > supported. I'm not sure if that's an option.
> 
> I mean something like:
> 
> struct tegra_bus_private {
> 	...
> 	void __iomem *config_space[16];
> };
> 
> 
> void tegra_scan_bus(struct pci_bus *bus)
> {
> 	int i;
> 	struct tegra_bus_private *priv = bus->dev->private;
> 
> 	for (i=0; i<16; i++)
> 		priv->config_space[i] = ioremap(config_space_phys +
> 				 65536 * bus->primary + i * SZ_1M, 65536);
> 
> 	...
> }

Okay, I see. It's a bit kludgy, but I guess so was the I/O map cache.
It'll take some more time to work this out and test, but I'll give it
a shot.

Thierry
Thierry Reding - Jan. 10, 2013, 7:03 p.m.
On Thu, Jan 10, 2013 at 07:55:05PM +0100, Thierry Reding wrote:
> On Thu, Jan 10, 2013 at 11:20:07AM -0700, Jason Gunthorpe wrote:
> > On Thu, Jan 10, 2013 at 11:25:44AM +0100, Thierry Reding wrote:
> > > On Thu, Jan 10, 2013 at 09:17:19AM +0000, Arnd Bergmann wrote:
> > > > On Thursday 10 January 2013, Thierry Reding wrote:
> > > > > On Wed, Jan 09, 2013 at 04:17:58PM -0700, Jason Gunthorpe wrote:
> > > > > > On Wed, Jan 09, 2013 at 04:12:31PM -0700, Stephen Warren wrote:
> > > > > > You could decrease the size of the mapping to only span the bus
> > > > > > numbers that are configured for use via DT.
> > > > > 
> > > > > That won't work, unfortunately. The mapping is such that the bus number
> > > > > is not encoded in the uppermost bits, the extended register number is.
> > > > > So the only thing that we could do is decrease the size of the extended
> > > > > register space for *all* devices.
> > > > 
> > > > But you could still a method to map 16 separate areas per bus, each 65536
> > > > bytes long, which results in 1MB per bus. That is probably ok, since
> > > > very few systems have more than a handful of buses in practice.
> > > > 
> > > > In theory, doing static mappings on a per-page base would let you
> > > > do 16 devices at a time, but it's probably worth doing at this fine
> > > > granularity.
> > > 
> > > I don't understand how this would help. The encoding is like this:
> > > 
> > > 	[27:24] extended register number
> > > 	[23:16] bus number
> > > 	[15:11] device number
> > > 	[10: 8] function number
> > > 	[ 7: 0] register number
> > > 
> > > So it doesn't matter whether I use separate areas per bus or not. As
> > > soon as the whole extended configuration space needs to be accessed a
> > > whopping 28 bits (256 MiB) are required.
> > 
> > You'd piece a mapping together, each bus requires 16 64k mappings, a
> > simple 2d array of busnr*16 of pointers would do the trick. A more
> > clever solution would be to allocate contiguous virtual memory and
> > split that up..
> 
> Oh, I see. I'm not very familiar with the internals of remapping, so
> I'll need to do some more reading. Thanks for the hints.

I forgot to ask. What's the advantage of having a contiguous virtual
memory area and splitting it up versus remapping each chunk separately?

Thierry
Jason Gunthorpe - Jan. 10, 2013, 7:24 p.m.
On Thu, Jan 10, 2013 at 08:03:27PM +0100, Thierry Reding wrote:

> > > You'd piece a mapping together, each bus requires 16 64k mappings, a
> > > simple 2d array of busnr*16 of pointers would do the trick. A more
> > > clever solution would be to allocate contiguous virtual memory and
> > > split that up..
 
> > Oh, I see. I'm not very familiar with the internals of remapping, so
> > I'll need to do some more reading. Thanks for the hints.
> 
> I forgot to ask. What's the advantage of having a contiguous virtual
> memory area and splitting it up versus remapping each chunk separately?

Not alot, really, but it saves you from the pointer array and
associated overhead. IIRC it is fairly easy to do in the kernel.

Arnd's version is good too, but you would be restricted to aligned
powers of two for the bus number range in the DT, which is probably
not that big a deal either?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding - Jan. 10, 2013, 8:20 p.m.
On Thu, Jan 10, 2013 at 12:24:17PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 10, 2013 at 08:03:27PM +0100, Thierry Reding wrote:
> 
> > > > You'd piece a mapping together, each bus requires 16 64k mappings, a
> > > > simple 2d array of busnr*16 of pointers would do the trick. A more
> > > > clever solution would be to allocate contiguous virtual memory and
> > > > split that up..
>  
> > > Oh, I see. I'm not very familiar with the internals of remapping, so
> > > I'll need to do some more reading. Thanks for the hints.
> > 
> > I forgot to ask. What's the advantage of having a contiguous virtual
> > memory area and splitting it up versus remapping each chunk separately?
> 
> Not alot, really, but it saves you from the pointer array and
> associated overhead. IIRC it is fairly easy to do in the kernel.

I've been investigating this a bit, and one problem is that it will
prevent the driver from ever building as a module because the necessary
functions aren't exported and I'm not sure exporting them would be
acceptable. Currently PCI host controller drivers with MSI support can't
be built as modules because the MSI infrastructure requires it, but I
briefly discussed this with Bjorn at some point and it should be easy to
remove that requirement.

> Arnd's version is good too, but you would be restricted to aligned
> powers of two for the bus number range in the DT, which is probably
> not that big a deal either?

Stephen suggested on IRC that we could try to keep a bit of dynamicity
in the allocation scheme if we create the bus mapping when the first
device on the bus is probed and discard the mapping if no devices are
found.

Sounds like a good plan to me. Does anybody see any potential pitfalls?

Thierry
Jason Gunthorpe - Jan. 10, 2013, 9:06 p.m.
On Thu, Jan 10, 2013 at 09:20:07PM +0100, Thierry Reding wrote:

> > Arnd's version is good too, but you would be restricted to aligned
> > powers of two for the bus number range in the DT, which is probably
> > not that big a deal either?
> 
> Stephen suggested on IRC that we could try to keep a bit of dynamicity
> in the allocation scheme if we create the bus mapping when the first
> device on the bus is probed and discard the mapping if no devices are
> found.

You probably don't need to mess around with 'discard on empty' the
kernel should only access bus numbers that are in the range of the
subordinate busses of the bridges. So if you establish a mapping on a
bus-by-bus basis at first access, it should be fine and very close to
minimal..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding - Jan. 16, 2013, 10:18 a.m.
On Thu, Jan 10, 2013 at 12:24:17PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 10, 2013 at 08:03:27PM +0100, Thierry Reding wrote:
> 
> > > > You'd piece a mapping together, each bus requires 16 64k mappings, a
> > > > simple 2d array of busnr*16 of pointers would do the trick. A more
> > > > clever solution would be to allocate contiguous virtual memory and
> > > > split that up..
>  
> > > Oh, I see. I'm not very familiar with the internals of remapping, so
> > > I'll need to do some more reading. Thanks for the hints.
> > 
> > I forgot to ask. What's the advantage of having a contiguous virtual
> > memory area and splitting it up versus remapping each chunk separately?
> 
> Not alot, really, but it saves you from the pointer array and
> associated overhead. IIRC it is fairly easy to do in the kernel.
> 
> Arnd's version is good too, but you would be restricted to aligned
> powers of two for the bus number range in the DT, which is probably
> not that big a deal either?

I've been trying to make this work, but this implementation always
triggers a BUG_ON() in lib/ioremap.c, line 27:

	27 		BUG_ON(!pte_none(*pte));

which seems to indicate that the page is already mapped, right?

Below is the relevant code:

struct tegra_pcie_bus {
	struct vm_struct *area;
	struct list_head list;
	unsigned int nr;
};

static struct tegra_pcie_bus *tegra_pcie_bus_alloc(struct tegra_pcie *pcie,
						   unsigned int busnr)
{
	unsigned long flags = VM_READ | VM_WRITE | VM_IO | VM_PFNMAP |
			      VM_DONTEXPAND | VM_DONTDUMP;
	phys_addr_t cs = pcie->cs->start;
	struct tegra_pcie_bus *bus;
	struct vm_struct *vm;
	unsigned int i;
	int err;

	bus = devm_kzalloc(pcie->dev, sizeof(*bus), GFP_KERNEL);
	if (!bus)
		return ERR_PTR(-ENOMEM);

	INIT_LIST_HEAD(&bus->list);
	bus->nr = busnr;

	bus->area = get_vm_area(SZ_1M, VM_IOREMAP);
	if (!bus->area) {
		err = -ENOMEM;
		goto free;
	}

	for (i = 0; i < 16; i++) {
		unsigned long virt = (unsigned long)bus->area->addr +
				     i * SZ_64K;
		phys_addr_t phys = cs + busnr * SZ_64K + i * SZ_1M;

		err = ioremap_page_range(virt, virt + SZ_64K - 1, phys,
					 vm_get_page_prot(flags));
		if (err < 0) {
			dev_err(pcie->dev, "ioremap_page_range() failed: %d\n",
				err);
			goto unmap;
		}
	}

	return bus;

unmap:
	vunmap(bus->area->addr);
	free_vm_area(bus->area);
free:
	devm_kfree(pcie->dev, bus);
	return ERR_PTR(err);
}

Anybody see what's wrong with that?
Russell King - ARM Linux - Jan. 16, 2013, 11:25 a.m.
On Wed, Jan 16, 2013 at 11:18:22AM +0100, Thierry Reding wrote:
> 		err = ioremap_page_range(virt, virt + SZ_64K - 1, phys,

Why -1 here?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding - Jan. 16, 2013, 11:52 a.m.
On Wed, Jan 16, 2013 at 11:25:56AM +0000, Russell King - ARM Linux wrote:
> On Wed, Jan 16, 2013 at 11:18:22AM +0100, Thierry Reding wrote:
> > 		err = ioremap_page_range(virt, virt + SZ_64K - 1, phys,
> 
> Why -1 here?

Right, I forgot that end in these functions always means one byte after
the end. Removing the -1 seems to get past the remapping at least.
Reading the configuration space through the mapping doesn't though. I'll
investigate some more.

Thanks for pointing this out Russell.

Thierry

Patch

diff --git a/include/linux/io.h b/include/linux/io.h
index 069e407..c5d296c 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -76,4 +76,16 @@  void devm_ioremap_release(struct device *dev, void *res);
 #define arch_has_dev_port()     (1)
 #endif
 
+struct iomap_cache;
+struct resource;
+
+struct iomap_cache *iomap_cache_create(const struct resource *region);
+void iomap_cache_free(struct iomap_cache *cache);
+void __iomem *iomap_cache_map(struct iomap_cache *cache, unsigned long offset);
+void iomap_cache_unmap(struct iomap_cache *cache, void __iomem *addr);
+
+struct iomap_cache *devm_iomap_cache_create(struct device *dev,
+					    const struct resource *region);
+void devm_iomap_cache_free(struct device *dev, struct iomap_cache *cache);
+
 #endif /* _LINUX_IO_H */
diff --git a/lib/ioremap.c b/lib/ioremap.c
index 0c9216c..8a13d97 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -5,11 +5,16 @@ 
  *
  * (C) Copyright 1995 1996 Linus Torvalds
  */
+
+#include <linux/device.h>
+#include <linux/err.h>
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/io.h>
+#include <linux/ioport.h>
 #include <linux/export.h>
+#include <linux/slab.h>
 #include <asm/cacheflush.h>
 #include <asm/pgtable.h>
 
@@ -92,3 +97,264 @@  int ioremap_page_range(unsigned long addr,
 	return err;
 }
 EXPORT_SYMBOL_GPL(ioremap_page_range);
+
+/**
+ * struct iomap_cache_page - page in an I/O map cache
+ * @region: subregion mapped by the page
+ * @list: chain in cache list
+ * @virt: virtual address of mapped region
+ */
+struct iomap_cache_page {
+	struct resource *region;
+	struct list_head list;
+	void __iomem *virt;
+};
+
+static struct iomap_cache_page *iomap_cache_page_create(void)
+{
+	struct iomap_cache_page *page;
+
+	page = kzalloc(sizeof(*page), GFP_KERNEL);
+	if (!page)
+		return NULL;
+
+	INIT_LIST_HEAD(&page->list);
+
+	return page;
+}
+
+static void iomap_cache_page_unmap(struct iomap_cache_page *page)
+{
+	release_resource(page->region);
+	page->region = NULL;
+
+	iounmap(page->virt);
+	page->virt = NULL;
+}
+
+static void iomap_cache_page_free(struct iomap_cache_page *page)
+{
+	iomap_cache_page_unmap(page);
+	list_del(&page->list);
+	kfree(page);
+}
+
+/**
+ * struct iomap_cache - cache of I/O mapped pages
+ * @region: region mapped by the cache
+ * @pages: list of pages in the cache
+ * @num_pages: number of pages in the cache
+ * @max_pages: maximum number of pages that the cache can map simultaneously
+ */
+struct iomap_cache {
+	struct resource region;
+	struct list_head pages;
+	unsigned int num_pages;
+	unsigned int max_pages;
+};
+
+/**
+ * iomap_cache_create() - create an I/O map cache
+ * @region: memory region to map
+ *
+ * Returns a new I/O map cache that can be used to map the given region on a
+ * page by page basis. On failure, a negative error code is returned.
+ */
+struct iomap_cache *iomap_cache_create(const struct resource *region)
+{
+	struct iomap_cache *cache;
+
+	cache = kzalloc(sizeof(*cache), GFP_KERNEL);
+	if (!cache)
+		return ERR_PTR(-ENOMEM);
+
+	memcpy(&cache->region, region, sizeof(*region));
+	INIT_LIST_HEAD(&cache->pages);
+	cache->num_pages = 0;
+	cache->max_pages = 1;
+
+	return cache;
+}
+
+/**
+ * iomap_cache_free() - free an I/O map cache
+ * @cache: I/O map cache
+ */
+void iomap_cache_free(struct iomap_cache *cache)
+{
+	struct iomap_cache_page *page, *tmp;
+
+	if (!cache)
+		return;
+
+	list_for_each_entry_safe(page, tmp, &cache->pages, list)
+		iomap_cache_page_free(page);
+
+	kfree(cache);
+}
+
+/**
+ * iomap_cache_map() - map a given offset in the cache's region
+ * @cache: I/O map cache
+ * @offset: offset into the cache's region of the address to map
+ *
+ * Returns the virtual address of mapped offset into the cache's region or
+ * NULL if the offset is outside of the region or if not enough memory is
+ * available to map the page.
+ */
+void __iomem *iomap_cache_map(struct iomap_cache *cache, unsigned long offset)
+{
+	struct iomap_cache_page *page;
+	struct resource *region;
+	unsigned long phys;
+
+	if (!cache || offset >= resource_size(&cache->region))
+		return NULL;
+
+	phys = cache->region.start + (offset & PAGE_MASK);
+
+	list_for_each_entry(page, &cache->pages, list) {
+		resource_size_t start, end;
+
+		if (!page->region || !page->virt)
+			continue;
+
+		start = page->region->start - cache->region.start;
+		end = page->region->end - cache->region.start;
+
+		/* address is within an already mapped page */
+		if (offset >= start && offset <= end) {
+			/* move page to end of the LRU list */
+			list_del_init(&page->list);
+			list_add_tail(&page->list, &cache->pages);
+			goto out;
+		}
+	}
+
+	/* find an unmapped page */
+	list_for_each_entry(page, &cache->pages, list) {
+		if (!page->region || !page->virt) {
+			list_del_init(&page->list);
+			break;
+		}
+	}
+
+	/* no unmapped page found */
+	if (&page->list == &cache->pages) {
+		/* add a new page if more space is available */
+		if (cache->num_pages < cache->max_pages) {
+			page = iomap_cache_page_create();
+			if (!page)
+				return NULL;
+
+			cache->num_pages++;
+		} else {
+			/*
+			 * If all pages are in use and there's no space left
+			 * for a new one, evict the first page in the list.
+			 */
+			page = list_first_entry(&cache->pages,
+						struct iomap_cache_page,
+						list);
+			iomap_cache_page_unmap(page);
+			list_del_init(&page->list);
+		}
+	}
+
+	/* insert page at the end of the LRU list */
+	list_add_tail(&page->list, &cache->pages);
+
+	region = __request_region(&cache->region, phys, PAGE_SIZE, NULL,
+				  cache->region.flags);
+	if (!region)
+		return NULL;
+
+	page->virt = ioremap(region->start, resource_size(region));
+	if (!page->virt) {
+		release_resource(region);
+		return NULL;
+	}
+
+	page->region = region;
+
+out:
+	return page->virt + (offset & ~PAGE_MASK);
+}
+
+/**
+ * iomap_cache_unmap() - remove a mapping from the cache
+ * @cache: I/O map cache
+ * @addr: virtual address of the mapping to remove
+ */
+void iomap_cache_unmap(struct iomap_cache *cache, void __iomem *addr)
+{
+	struct iomap_cache_page *page;
+
+	if (!cache)
+		return;
+
+	list_for_each_entry(page, &cache->pages, list) {
+		if (page->virt == addr) {
+			iomap_cache_page_unmap(page);
+			break;
+		}
+	}
+}
+
+static void devm_iomap_cache_release(struct device *dev, void *res)
+{
+	iomap_cache_free(*(struct iomap_cache **)res);
+}
+
+static int devm_iomap_cache_match(struct device *dev, void *res, void *data)
+{
+	struct iomap_cache **p = res;
+
+	if (WARN_ON(!p || !*p))
+		return 0;
+
+	return *p == data;
+}
+
+/**
+ * devm_iomap_cache_create() - create an I/O map cache
+ * @dev: device to attach this I/O map cache to
+ * @region: memory region to map
+ *
+ * Returns a new I/O map cache that can be used to map the given region on a
+ * page by page basis. On failure, a negative error code is returned.
+ *
+ * This function is a device-managed version of iomap_cache_create() which
+ * will automatically be freed when the device disappears.
+ */
+struct iomap_cache *devm_iomap_cache_create(struct device *dev,
+					    const struct resource *region)
+{
+	struct iomap_cache **ptr, *cache;
+
+	ptr = devres_alloc(devm_iomap_cache_release, sizeof(**ptr), GFP_KERNEL);
+	if (!ptr)
+		return ERR_PTR(-ENOMEM);
+
+	cache = iomap_cache_create(region);
+	if (IS_ERR(cache)) {
+		devres_free(ptr);
+		return cache;
+	}
+
+	*ptr = cache;
+	devres_add(dev, ptr);
+
+	return cache;
+}
+
+/**
+ * devm_iomap_cache_free() - free an I/O map cache
+ * @dev: device that this I/O map cached was attached to
+ * @cache: I/O map cache
+ */
+void devm_iomap_cache_free(struct device *dev, struct iomap_cache *cache)
+{
+	WARN_ON(devres_release(dev, devm_iomap_cache_release,
+			       devm_iomap_cache_match, cache));
+}