diff mbox

PCI: Clear bridge MEM_64 flag if one child does not support it

Message ID 1418075552-26495-1-git-send-email-yinghai@kernel.org
State Rejected
Headers show

Commit Message

Yinghai Lu Dec. 8, 2014, 9:52 p.m. UTC
Zermond and Marek reported 3.16 and later kernel broke the ati radeon support.
And it points to commit 5b28541552ef ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources").

The root causes:
BIOS does not assign resource for the bridge correctly.
[    0.113672] pci 0000:01:00.0: reg 0x10: [mem 0xc0000000-0xcfffffff pref]
[    0.113683] pci 0000:01:00.0: reg 0x14: [io  0xa000-0xa0ff]
[    0.113695] pci 0000:01:00.0: reg 0x18: [mem 0xfdff0000-0xfdffffff]
[    0.113729] pci 0000:01:00.0: reg 0x30: [mem 0xfdfc0000-0xfdfdffff pref]
[    0.113776] pci 0000:01:00.0: supports D1 D2
[    0.115016] pci 0000:00:01.0: PCI bridge to [bus 01]
[    0.115022] pci 0000:00:01.0:   bridge window [io  0x8000-0xafff]
[    0.115027] pci 0000:00:01.0:   bridge window [mem 0xfdf00000-0xfdffffff]
[    0.115034] pci 0000:00:01.0:   bridge window [mem 0xbdf00000-0xddefffff 64bit pref]
but
[    0.158601] pci 0000:00:01.0: address space collision: [mem
0xbdf00000-0xddefffff 64bit pref] conflicts with System RAM [mem
0x00100000-0xbff9ffff]
so kernel reject them, and try to allocate the resource to it.

Current code after the commit
 1. when bridge does not support 64bit mmio pref, child mmio non 64bit will be
allocated from bridge mmio pref.
2. when bridge does support 64bit mmio pref, child mmio non 64bit will be
 allocated from bridge mmio.

In this case, mmio bar is not big enough and realloc is not triggerd and
it will not be increased, so child device 01:00.0 will not get allocated
mmio pref range.

Solution would be:
1. specify pci=realloc to scratch the bridge mmio register and get bigger
   range for it and then allocate to child mmio pref.
2. or scan the children resource other than ROM to clear bridge MEM_64
   for mmio pref.

The patch is using second way so will keep child mmio pref into bridge
mmio pref range.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=85491
Reported-by: zermond@gmail.com
Reported-and-tested-by: Marek Kordik <kordikmarek@gmail.com>
Fixes: 5b28541552ef ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources")
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 drivers/pci/setup-bus.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Benjamin Herrenschmidt Dec. 8, 2014, 10:56 p.m. UTC | #1
On Mon, 2014-12-08 at 13:52 -0800, Yinghai Lu wrote:
> 2. or scan the children resource other than ROM to clear bridge MEM_64
>    for mmio pref.
> 
> The patch is using second way so will keep child mmio pref into bridge
> mmio pref range.

That means that having a single ROM BAR that is 32-bit and prefetchable
will downgrade the entire window to 32-bit ? That's not going to work
either.

I have GPUs with 16G BARs for example... suddenly they don't fit
anaymore because we downgraded the window to 32 bit because somewhere
there's a 32-bit pref resource ?

That will break more than it fixes...

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Dec. 8, 2014, 11 p.m. UTC | #2
On Tue, 2014-12-09 at 09:56 +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2014-12-08 at 13:52 -0800, Yinghai Lu wrote:
> > 2. or scan the children resource other than ROM to clear bridge MEM_64
> >    for mmio pref.
> > 
> > The patch is using second way so will keep child mmio pref into bridge
> > mmio pref range.
> 
> That means that having a single ROM BAR that is 32-bit and prefetchable
> will downgrade the entire window to 32-bit ? That's not going to work
> either.
> 
> I have GPUs with 16G BARs for example... suddenly they don't fit
> anaymore because we downgraded the window to 32 bit because somewhere
> there's a 32-bit pref resource ?
> 
> That will break more than it fixes...

I think an option would be to keep track of whether the PCI host bridge
can do 64-bit pref above 4G.

If it can't then we know 64-bit pref is always fair game for 32-bit
resources and we can thus downgrade all the bridges 64-pref to 32-perf.

If it can, then we probably do need to leave it there do real 64-bit
pref, and shoot anything that is 32-bit only down the non-pref windows.

We *could* try to be smart and scan first to check if anything under the
bridge actually has large/64-bit BARs but that's going to be a heuristic
at best and isn't going to do any good with hotplug.

I tend to think that treating anything 32-bit pref as non-pref is in
fact the best solution unless we know that the platform doesn't do
>32-bit pref anyway in which case we leave them alone.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 8, 2014, 11:59 p.m. UTC | #3
On Mon, Dec 8, 2014 at 2:56 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2014-12-08 at 13:52 -0800, Yinghai Lu wrote:
>> 2. or scan the children resource other than ROM to clear bridge MEM_64
>>    for mmio pref.
>>
>> The patch is using second way so will keep child mmio pref into bridge
>> mmio pref range.
>
> That means that having a single ROM BAR that is 32-bit and prefetchable
> will downgrade the entire window to 32-bit ? That's not going to work
> either.

the ROM BAR get skipped during the checking.

+                       if (i != PCI_ROM_RESOURCE)
+                               mem64_mask &= r->flags & IORESOURCE_MEM_64;
+               }

>
> I have GPUs with 16G BARs for example... suddenly they don't fit
> anaymore because we downgraded the window to 32 bit because somewhere
> there's a 32-bit pref resource ?
>
> That will break more than it fixes...

Please check if this patch break your platform. I tried on my setup on
x86. and it is still
working on 64 bit resource allocation.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 9, 2014, 12:09 a.m. UTC | #4
On Mon, Dec 8, 2014 at 3:00 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:

>
> I think an option would be to keep track of whether the PCI host bridge
> can do 64-bit pref above 4G.

Yes, that could be enhancement.

That should fix the problem Marek's

But not on Zermond's, as his system is not using _CRS.

Or we can enforce new rule that no _CRS then only use under 4G ?

>
> If it can't then we know 64-bit pref is always fair game for 32-bit
> resources and we can thus downgrade all the bridges 64-pref to 32-perf.
>
> If it can, then we probably do need to leave it there do real 64-bit
> pref, and shoot anything that is 32-bit only down the non-pref windows.
>
> We *could* try to be smart and scan first to check if anything under the
> bridge actually has large/64-bit BARs but that's going to be a heuristic
> at best and isn't going to do any good with hotplug.
>
> I tend to think that treating anything 32-bit pref as non-pref is in
> fact the best solution unless we know that the platform doesn't do
>>32-bit pref anyway in which case we leave them alone.

Yes, but will need to enable pci=realloc automatically, so we can take
back control
of bridge mmio bar.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gavin Shan Dec. 9, 2014, 12:11 a.m. UTC | #5
On Mon, Dec 08, 2014 at 03:59:54PM -0800, Yinghai Lu wrote:
>On Mon, Dec 8, 2014 at 2:56 PM, Benjamin Herrenschmidt
><benh@kernel.crashing.org> wrote:
>> On Mon, 2014-12-08 at 13:52 -0800, Yinghai Lu wrote:
>>> 2. or scan the children resource other than ROM to clear bridge MEM_64
>>>    for mmio pref.
>>>
>>> The patch is using second way so will keep child mmio pref into bridge
>>> mmio pref range.
>>
>> That means that having a single ROM BAR that is 32-bit and prefetchable
>> will downgrade the entire window to 32-bit ? That's not going to work
>> either.
>
>the ROM BAR get skipped during the checking.
>
>+                       if (i != PCI_ROM_RESOURCE)
>+                               mem64_mask &= r->flags & IORESOURCE_MEM_64;
>+               }
>
>>
>> I have GPUs with 16G BARs for example... suddenly they don't fit
>> anaymore because we downgraded the window to 32 bit because somewhere
>> there's a 32-bit pref resource ?
>>
>> That will break more than it fixes...
>
>Please check if this patch break your platform. I tried on my setup on
>x86. and it is still
>working on 64 bit resource allocation.
>

I'm going to give it a spin and Richard, could you please apply Yinghai's
patch to see if your SRIOV code can work properly?

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gavin Shan Dec. 9, 2014, 1:08 a.m. UTC | #6
On Mon, Dec 08, 2014 at 03:59:54PM -0800, Yinghai Lu wrote:
>On Mon, Dec 8, 2014 at 2:56 PM, Benjamin Herrenschmidt
><benh@kernel.crashing.org> wrote:
>> On Mon, 2014-12-08 at 13:52 -0800, Yinghai Lu wrote:
>>> 2. or scan the children resource other than ROM to clear bridge MEM_64
>>>    for mmio pref.
>>>
>>> The patch is using second way so will keep child mmio pref into bridge
>>> mmio pref range.
>>
>> That means that having a single ROM BAR that is 32-bit and prefetchable
>> will downgrade the entire window to 32-bit ? That's not going to work
>> either.
>
>the ROM BAR get skipped during the checking.
>
>+                       if (i != PCI_ROM_RESOURCE)
>+                               mem64_mask &= r->flags & IORESOURCE_MEM_64;
>+               }
>
>>
>> I have GPUs with 16G BARs for example... suddenly they don't fit
>> anaymore because we downgraded the window to 32 bit because somewhere
>> there's a 32-bit pref resource ?
>>
>> That will break more than it fixes...
>
>Please check if this patch break your platform. I tried on my setup on
>x86. and it is still
>working on 64 bit resource allocation.
>

Yinghai, With the patch, the Power8 machine boots up and have resources
assigned properly: 64-bits pref BARs are covered by 64-bits pref windows
and left memory BARs are covered by 32-bits non-pref windows. Note that
I didn't find an 32-bits-pref BAR on the system.

For SRIOV case, Richard still need test.

Thanks,
Gavin

>Thanks
>
>Yinghai
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 9, 2014, 1:38 a.m. UTC | #7
On Mon, Dec 8, 2014 at 5:08 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
> On Mon, Dec 08, 2014 at 03:59:54PM -0800, Yinghai Lu wrote:
>
> Yinghai, With the patch, the Power8 machine boots up and have resources
> assigned properly: 64-bits pref BARs are covered by 64-bits pref windows
> and left memory BARs are covered by 32-bits non-pref windows. Note that
> I didn't find an 32-bits-pref BAR on the system.
>
> For SRIOV case, Richard still need test.

Good, Thanks for the testing.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wei Yang Dec. 9, 2014, 2:26 a.m. UTC | #8
On Tue, Dec 09, 2014 at 11:11:19AM +1100, Gavin Shan wrote:
>On Mon, Dec 08, 2014 at 03:59:54PM -0800, Yinghai Lu wrote:
>>On Mon, Dec 8, 2014 at 2:56 PM, Benjamin Herrenschmidt
>><benh@kernel.crashing.org> wrote:
>>> On Mon, 2014-12-08 at 13:52 -0800, Yinghai Lu wrote:
>>>> 2. or scan the children resource other than ROM to clear bridge MEM_64
>>>>    for mmio pref.
>>>>
>>>> The patch is using second way so will keep child mmio pref into bridge
>>>> mmio pref range.
>>>
>>> That means that having a single ROM BAR that is 32-bit and prefetchable
>>> will downgrade the entire window to 32-bit ? That's not going to work
>>> either.
>>
>>the ROM BAR get skipped during the checking.
>>
>>+                       if (i != PCI_ROM_RESOURCE)
>>+                               mem64_mask &= r->flags & IORESOURCE_MEM_64;
>>+               }
>>
>>>
>>> I have GPUs with 16G BARs for example... suddenly they don't fit
>>> anaymore because we downgraded the window to 32 bit because somewhere
>>> there's a 32-bit pref resource ?
>>>
>>> That will break more than it fixes...
>>
>>Please check if this patch break your platform. I tried on my setup on
>>x86. and it is still
>>working on 64 bit resource allocation.
>>
>
>I'm going to give it a spin and Richard, could you please apply Yinghai's
>patch to see if your SRIOV code can work properly?
>

Yinghai & Gavin,

I did a quick test on my machine. This patch doesn't affect the MMIO
allocation on out platform, so SRIOV works fine.

I will spend more time to read the patch to get more understanding about the
problem.

>Thanks,
>Gavin
Yinghai Lu Dec. 9, 2014, 7:20 a.m. UTC | #9
On Mon, Dec 8, 2014 at 6:26 PM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
> On Tue, Dec 09, 2014 at 11:11:19AM +1100, Gavin Shan wrote:
>>
>>I'm going to give it a spin and Richard, could you please apply Yinghai's
>>patch to see if your SRIOV code can work properly?
>>
>
> I did a quick test on my machine. This patch doesn't affect the MMIO
> allocation on out platform, so SRIOV works fine.
>
> I will spend more time to read the patch to get more understanding about the
> problem.

Hi Marek,

Can you boot following branch with "debug ignore_loglevel pci=realloc"
on your setup?

git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-pci-allocate-fit-3.18

it has
8f74af9: PCI, x86: Allocate mmio near end of free range
1ff91c1: PCI: Don't allocate small resource in big free space.
108b43b: resources: Add allocate_resource_fit()
b6a22f0: resources: Make find_resource could return just fit resource
cf50e16: resources: Split out __allocate_resource()

on top the v3.18, and it will allocate resource in fit way.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wei Yang Dec. 9, 2014, 7:56 a.m. UTC | #10
On Mon, Dec 08, 2014 at 11:20:59PM -0800, Yinghai Lu wrote:
>On Mon, Dec 8, 2014 at 6:26 PM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
>> On Tue, Dec 09, 2014 at 11:11:19AM +1100, Gavin Shan wrote:
>>>
>>>I'm going to give it a spin and Richard, could you please apply Yinghai's
>>>patch to see if your SRIOV code can work properly?
>>>
>>
>> I did a quick test on my machine. This patch doesn't affect the MMIO
>> allocation on out platform, so SRIOV works fine.
>>
>> I will spend more time to read the patch to get more understanding about the
>> problem.
>
>Hi Marek,
>
>Can you boot following branch with "debug ignore_loglevel pci=realloc"
>on your setup?
>
>git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
>for-pci-allocate-fit-3.18
>
>it has
>8f74af9: PCI, x86: Allocate mmio near end of free range
>1ff91c1: PCI: Don't allocate small resource in big free space.
>108b43b: resources: Add allocate_resource_fit()
>b6a22f0: resources: Make find_resource could return just fit resource
>cf50e16: resources: Split out __allocate_resource()
>
>on top the v3.18, and it will allocate resource in fit way.

Hi, Yinghai

Just spent some time to understand the problem and your patch.

Two comments to this patch:
1. realloc_head is not necessary here
2. In case we have two MEM BAR under a bus, one is MEM64|PREF and one is PREF,
after this patch, the MEM64 flag will be cleared, right? If so, this may fall
back to the same situation in bug 74151.

I am looking at the bugzilla and try to understand the root cause for this
allocation failure. Currently I don't get the reason. In my mind, the resource
will be sized in the non-pref bridge window. But seems not. I have asked more
info in the bugzilla, so that I could understand the problem. (Yeah, maybe I
am not that smart, if one of you would help me understand the problem, I am
happy to offer you a cup of coffee :-) )

Bjorin,

As you mentioned in another thread, "5b28541552ef is taking the wrong
approach". (http://www.spinics.net/lists/linux-pci/msg37374.html) Maybe I
don't catch it clearly. Put a 32bit prefetchable resource in a 32bit
non-prefetchable bridge window is a bad idea? But in my mind, if the bridge
prefetchable window is 64bit, we can't put a 32bit prefetchable resource in
it.

I am trying to catch up with the updates of the bug, if I missed something
just let me know.

>
>Thanks
>
>Yinghai
Marek Kordík Dec. 9, 2014, 6:07 p.m. UTC | #11
On 12/09/2014 08:20 AM, Yinghai Lu wrote:
> On Mon, Dec 8, 2014 at 6:26 PM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
>> On Tue, Dec 09, 2014 at 11:11:19AM +1100, Gavin Shan wrote:
>>>
>>> I'm going to give it a spin and Richard, could you please apply Yinghai's
>>> patch to see if your SRIOV code can work properly?
>>>
>>
>> I did a quick test on my machine. This patch doesn't affect the MMIO
>> allocation on out platform, so SRIOV works fine.
>>
>> I will spend more time to read the patch to get more understanding about the
>> problem.
>
> Hi Marek,
>
> Can you boot following branch with "debug ignore_loglevel pci=realloc"
> on your setup?
>
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> for-pci-allocate-fit-3.18
>
> it has
> 8f74af9: PCI, x86: Allocate mmio near end of free range
> 1ff91c1: PCI: Don't allocate small resource in big free space.
> 108b43b: resources: Add allocate_resource_fit()
> b6a22f0: resources: Make find_resource could return just fit resource
> cf50e16: resources: Split out __allocate_resource()
>
> on top the v3.18, and it will allocate resource in fit way.
>
> Thanks
>
> Yinghai
>
Hi Yinghai,

I have built and booted your branch and it works well. Do you want me to 
attach some logs?
(I am new here and I have read http://www.tux.org/lkml/ and I don't want 
to break some rules for attachment size)

Marek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Dec. 9, 2014, 6:21 p.m. UTC | #12
On Tue, Dec 9, 2014 at 12:56 AM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:

> As you mentioned in another thread, "5b28541552ef is taking the wrong
> approach". (http://www.spinics.net/lists/linux-pci/msg37374.html) Maybe I
> don't catch it clearly. Put a 32bit prefetchable resource in a 32bit
> non-prefetchable bridge window is a bad idea?

A 32-bit prefetchable resource *can* be put in a 32-bit
non-prefetchable window, but the device won't perform as well as it
would if the resource were in a prefetchable window.

What I object to is the fact that we put a 32-bit prefetchable
resource in the non-prefetchable window and leave the 64-bit
prefetchable window unused.  This gives up performance for no benefit.

> But in my mind, if the bridge
> prefetchable window is 64bit, we can't put a 32bit prefetchable resource in
> it.

If the window is programmed to be above 4GB, of course we can't put a
32-bit resource in it.  My point is that if the bridge *supports* a
64-bit prefetchable window, we can decide where to place it.  If we
put the window below 4GB, we can put a 32-bit prefetchable resource in
it.

I think maybe you're thinking of "64-bit window" as "a window
programmed to be above 4GB."  I'm using "64-bit window" to mean "a
window that supports 64-bit addressing," i.e., one where
PCI_PREF_BASE_UPPER32 and PCI_PREF_LIMIT_UPPER32 are implemented.
That's analogous to the way we talk about 64-bit BARs.  A 64-bit BAR
is still a 64-bit BAR even if it is currently programmed to be below
4GB.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Dec. 9, 2014, 6:22 p.m. UTC | #13
On Tue, Dec 9, 2014 at 11:07 AM, Marek Kordík <kordikmarek@gmail.com> wrote:
> On 12/09/2014 08:20 AM, Yinghai Lu wrote:
>>
>> On Mon, Dec 8, 2014 at 6:26 PM, Wei Yang <weiyang@linux.vnet.ibm.com>
>> wrote:
>>>
>>> On Tue, Dec 09, 2014 at 11:11:19AM +1100, Gavin Shan wrote:
>>>>
>>>>
>>>> I'm going to give it a spin and Richard, could you please apply
>>>> Yinghai's
>>>> patch to see if your SRIOV code can work properly?
>>>>
>>>
>>> I did a quick test on my machine. This patch doesn't affect the MMIO
>>> allocation on out platform, so SRIOV works fine.
>>>
>>> I will spend more time to read the patch to get more understanding about
>>> the
>>> problem.
>>
>>
>> Hi Marek,
>>
>> Can you boot following branch with "debug ignore_loglevel pci=realloc"
>> on your setup?
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
>> for-pci-allocate-fit-3.18
>>
>> it has
>> 8f74af9: PCI, x86: Allocate mmio near end of free range
>> 1ff91c1: PCI: Don't allocate small resource in big free space.
>> 108b43b: resources: Add allocate_resource_fit()
>> b6a22f0: resources: Make find_resource could return just fit resource
>> cf50e16: resources: Split out __allocate_resource()
>>
>> on top the v3.18, and it will allocate resource in fit way.
>>
>> Thanks
>>
>> Yinghai
>>
> Hi Yinghai,
>
> I have built and booted your branch and it works well. Do you want me to
> attach some logs?
> (I am new here and I have read http://www.tux.org/lkml/ and I don't want to
> break some rules for attachment size)

Please attach logs to the bugzilla and make their type "text/plain".
That's the easiest to browse.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 9, 2014, 7:42 p.m. UTC | #14
On Tue, Dec 9, 2014 at 10:22 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Dec 9, 2014 at 11:07 AM, Marek Kordík <kordikmarek@gmail.com> wrote:
>> I have built and booted your branch and it works well. Do you want me to
>> attach some logs?
>> (I am new here and I have read http://www.tux.org/lkml/ and I don't want to
>> break some rules for attachment size)

Hi Marek,

Can you run some graphics benchmark program to check the performance
between
1.

git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
 branch: for-pci-allocate-fit-3.18

2.
v 3.18 + clear mmio64 flags when children device does not support it

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 9, 2014, 7:49 p.m. UTC | #15
On Mon, Dec 8, 2014 at 11:56 PM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:

> Two comments to this patch:
> 1. realloc_head is not necessary here

why ?

want to align that with pbus_size_mem.

> 2. In case we have two MEM BAR under a bus, one is MEM64|PREF and one is PREF,
> after this patch, the MEM64 flag will be cleared, right? If so, this may fall
> back to the same situation in bug 74151.

Yes.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 9, 2014, 9:39 p.m. UTC | #16
On Mon, Dec 8, 2014 at 4:09 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Mon, Dec 8, 2014 at 3:00 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
>
>>
>> I think an option would be to keep track of whether the PCI host bridge
>> can do 64-bit pref above 4G.
>
> Yes, that could be enhancement.
>
> That should fix the problem Marek's
>
> But not on Zermond's, as his system is not using _CRS.
>

http://patchwork.ozlabs.org/patch/419295/

PCI: Clear all bridge res MEM_64 if host bridge has non mem64
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marek Kordík Dec. 9, 2014, 10:15 p.m. UTC | #17
On 12/09/2014 08:42 PM, Yinghai Lu wrote:
> On Tue, Dec 9, 2014 at 10:22 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Tue, Dec 9, 2014 at 11:07 AM, Marek Kordík <kordikmarek@gmail.com> wrote:
>>> I have built and booted your branch and it works well. Do you want me to
>>> attach some logs?
>>> (I am new here and I have read http://www.tux.org/lkml/ and I don't want to
>>> break some rules for attachment size)
>
> Hi Marek,
>
> Can you run some graphics benchmark program to check the performance
> between
> 1.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
>   branch: for-pci-allocate-fit-3.18
>
> 2.
> v 3.18 + clear mmio64 flags when children device does not support it
>
> Thanks
>
> Yinghai
>
I have run Unigine Heaven benchmark on both versions (I tried version 2. 
with and also without kernel parameters "debug ignore_loglevel 
pci=realloc") and the performance of each version was the same (70-71 
points). I tried to run this benchmark also with kernel 3.15.10 and the 
result was 67 points. Tomorrow I can try to run some more benchmarks, 
today I didn't have much time.

Marek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Dec. 9, 2014, 11:16 p.m. UTC | #18
On Tue, Dec 9, 2014 at 2:15 PM, Marek Kordík <kordikmarek@gmail.com> wrote:
> On 12/09/2014 08:42 PM, Yinghai Lu wrote:
>>
>> On Tue, Dec 9, 2014 at 10:22 AM, Bjorn Helgaas <bhelgaas@google.com>
>> wrote:
>>>
>>> On Tue, Dec 9, 2014 at 11:07 AM, Marek Kordík <kordikmarek@gmail.com>
>>> wrote:
>>>>
>>>> I have built and booted your branch and it works well. Do you want me to
>>>> attach some logs?
>>>> (I am new here and I have read http://www.tux.org/lkml/ and I don't want
>>
>> Can you run some graphics benchmark program to check the performance
>> between
>> 1.
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
>>   branch: for-pci-allocate-fit-3.18
>>
>> 2.
>> v 3.18 + clear mmio64 flags when children device does not support it
>>
>>
> I have run Unigine Heaven benchmark on both versions (I tried version 2.
> with and also without kernel parameters "debug ignore_loglevel pci=realloc")
> and the performance of each version was the same (70-71 points). I tried to
> run this benchmark also with kernel 3.15.10 and the result was 67 points.
> Tomorrow I can try to run some more benchmarks, today I didn't have much
> time.

so putting  mem pref under bridge mem does not cause performance loss?

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wei Yang Dec. 10, 2014, 9:42 a.m. UTC | #19
On Tue, Dec 09, 2014 at 11:21:11AM -0700, Bjorn Helgaas wrote:
>On Tue, Dec 9, 2014 at 12:56 AM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
>
>> As you mentioned in another thread, "5b28541552ef is taking the wrong
>> approach". (http://www.spinics.net/lists/linux-pci/msg37374.html) Maybe I
>> don't catch it clearly. Put a 32bit prefetchable resource in a 32bit
>> non-prefetchable bridge window is a bad idea?
>
>A 32-bit prefetchable resource *can* be put in a 32-bit
>non-prefetchable window, but the device won't perform as well as it
>would if the resource were in a prefetchable window.
>
>What I object to is the fact that we put a 32-bit prefetchable
>resource in the non-prefetchable window and leave the 64-bit
>prefetchable window unused.  This gives up performance for no benefit.
>

OK, I get your point. This is really a waste to put a 32-bit prefetchable
resource in non-prefetchable window and no one use the 64-bit prefetchable
window. (That's very interesting even no performance lost by doing so as
mentioned in reply from Marek. Maybe the reason is the most important IO is
performed in DMA.)

Below is what's in my mind, if not correct please let me know.

1. Yinghai's patch needs a little modification

Now back to Yinghai's patch. If this patch wants to apply this logic it should
check all prefetchable resource, including M64 and non-M64 

  if *none* of them has M64 prefetchable resource, then clear the M64 flag in
  bridge.

But the logic in patch is

  if *any* of them has non-M64 prefetchable resource, then clear the M64 flag 
  in bridge.


2. Yes, the strategy will improve the performance in some case, but with
limited case.

Then suppose we use the strategy, clear the M64 flag in bridge when none of
the child resource need it. 

I imagined this scenario:

                       +--------------+
                       |B1            |
                       +------+-------+
                              |  Bus#1
            ---+--------------+-------------------+----
               |  Bus#2                           |  Bus#3
        +------+------+                    +------+-------+
        |B2           |                    |B3            |
        +-------------+                    +--------------+
	res[1] 32-bit non-pref             res[1] 32-bit non-pref
	res[2] 64-bit pref                 res[2] 32-bit pref

Suppose we have a PCI sub-tree like this. And suppose all the bus->resource[2]
is with M64 flag before sizing. (The initial value is retrieved in enumeration
stage by pci_read_bridge_bases(). This happens before sizing.)

We will first size Bus#2, then Bus#3, at last Bus#1. As shown in the chart,
    For Bus#2, there is a 64-bit pref, so Bus#2->resource[2] with M64 flag set.
    For Bus#3, there is no 64-bit pref, so Bus#3->resource[2] with M64 flag
    cleared.
    Then up to Bus#1, since there is a 64-bit prefetchable resource, the
    Bus#1->resource[2] is with M64 flag set.

The final B1's window will look like this:

         +--------------------------------------+
         | B1                                   |
         |                                      |
	 | res[1] 32-bit non-pref               |
         | +---------+----------+----------+    |
	 | |B2.res[1]| B3.res[1]| B3.res[2]|    |
         | +---------+----------+----------+    |
	 |                                      |
         | res[2] 64-bit pref                   |
         | +----------------+                   |
         | |B2.res[2]       |                   |
         | +----------------+                   |
         +--------------------------------------+

From the performance perspective: 

  B3.res[2] will bring some performance improvement to those devices under it.
  But from its grand parent's point of view, B3.res[2] still sits under a
  non-prefetchable window. So the performance improvement will be limited. Hope
  my understanding is correct.
  
  So the best case for the performance improvement is all this PCI tree has no
  64-bit prefetchable resource.

From the resource sizing/allocation perspective:
  
  B3.res[2] still contribute to the root's non-prefetchable window, once
  there is a 64-bit prefetchable in the PCI tree. If we don't have enough
  non-prefetchable window in root, we would fail.

Bjorn, I agree to apply the logic you mentioned to clear the M64 flag when no
child need it. But the benefits is limited. 

I did a grep on the lspci -vvv output and see there is no 64-bit prefetchable
BAR in the system.

$ grep Region lspci | grep Me
	Region 0: Memory at febff400 (32-bit, non-prefetchable) [size=1K]
	Region 0: Memory at febf8000 (64-bit, non-prefetchable) [size=16K]
	Region 0: Memory at febff000 (32-bit, non-prefetchable) [size=1K]
	Region 5: Memory at febff800 (32-bit, non-prefetchable) [size=2K]
	Region 0: Memory at c0000000 (32-bit, prefetchable) [size=256M]
	Region 2: Memory at fdff0000 (32-bit, non-prefetchable) [size=64K]
	Region 0: Memory at fe0c0000 (64-bit, non-prefetchable) [size=256K]
	Region 0: Memory at fe1fe000 (64-bit, non-prefetchable) [size=8K]
	Region 0: Memory at feaffc00 (32-bit, non-prefetchable) [size=1K]

I think this is the reason why this patch could help on this case.

3. Not sure why the non-prefetchable window becomes bigger on 3.16

I past the fragment of /proc/iomem from Marek.

3.15 good version

c0000000-ffffffff : PCI Bus 0000:00
  c0000000-cfffffff : PCI Bus 0000:01
    c0000000-cfffffff : 0000:01:00.0
  d0000000-d01fffff : PCI Bus 0000:08
  ddf00000-dfefffff : PCI Bus 0000:06
  e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
    e0000000-efffffff : pnp 00:0b
  fdf00000-fdffffff : PCI Bus 0000:01
    fdfc0000-fdfdffff : 0000:01:00.0
    fdff0000-fdffffff : 0000:01:00.0

3.16 bad version

c0000000-ffffffff : PCI Bus 0000:00
  c0000000-c01fffff : PCI Bus 0000:01
  c0200000-c03fffff : PCI Bus 0000:08
  ddf00000-dfefffff : PCI Bus 0000:06
  e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
    e0000000-efffffff : pnp 00:07
  fdf00000-fdffffff : PCI Bus 0000:01
    fdfc0000-fdfdffff : 0000:01:00.0
    fdff0000-fdffffff : 0000:01:00.0

On 3.15, one BAR in device 01:00.0 is allocated from c0000000-cffffffff. While
on 3.16, we need to allocate the same BAR from fdf0000000 range. But we see
this range keeps the same size as in 3.15. I don't catch the reason for this.

My thoughts is if we could have a bigger range in fdf00000, the non-prefetch
window in Bus#01, the Graphic card could work too.

>> But in my mind, if the bridge           
>> prefetchable window is 64bit, we can't put a 32bit prefetchable resource in
>> it.
>
>If the window is programmed to be above 4GB, of course we can't put a
>32-bit resource in it.  My point is that if the bridge *supports* a
>64-bit prefetchable window, we can decide where to place it.  If we
>put the window below 4GB, we can put a 32-bit prefetchable resource in
>it.

Yes, agree.

If the 64-bit prefetchable window contains some space under 4G, we could put
some 32-bit prefetchable resource in it. But this would bring more complexity
to the sizing algorithm. We need a separate value to track the size below 4G
in 64-bit prefetchable window, then to see who would have the luck to fit in.
And those can't the ticket would be sit in non-prefetchable window.

>
>I think maybe you're thinking of "64-bit window" as "a window
>programmed to be above 4GB."  I'm using "64-bit window" to mean "a
>window that supports 64-bit addressing," i.e., one where
>PCI_PREF_BASE_UPPER32 and PCI_PREF_LIMIT_UPPER32 are implemented.
>That's analogous to the way we talk about 64-bit BARs.  A 64-bit BAR
>is still a 64-bit BAR even if it is currently programmed to be below
>4GB.

Thanks for reminding. It makes me more clear.

Let me make my statement more clear.

If the 64-bit window is enabled, like the
PCI_PREF_BASE_UPPER32/PCI_PREF_BASE_UPPER32 is set. And this address is set
the M64 flag, which means the range contains space above 4G. In this case, we
can't put a 32-bit resource in this window.

If my understanding is not correct, please let me know.

>
>Bjorn
diff mbox

Patch

Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -1118,6 +1118,52 @@  handle_done:
 	;
 }
 
+static void pbus_check_mem64(struct pci_bus *bus, unsigned long mask,
+			 unsigned long type, struct list_head *realloc_head)
+{
+	struct pci_dev *dev;
+	resource_size_t align;
+	resource_size_t aligns[18];	/* Alignments from 1Mb to 128Gb */
+	int order;
+	unsigned int mem64_mask = 0;
+	struct resource *b_res = find_free_bus_resource(bus, mask, type);
+
+	if (!b_res || !(b_res->flags & IORESOURCE_MEM_64))
+		return;
+
+	memset(aligns, 0, sizeof(aligns));
+
+	mem64_mask = IORESOURCE_MEM_64;
+	b_res->flags &= ~IORESOURCE_MEM_64;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		int i;
+
+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+			struct resource *r = &dev->resource[i];
+
+			if (r->parent || (r->flags & mask) != type)
+				continue;
+#ifdef CONFIG_PCI_IOV
+			if (realloc_head && i >= PCI_IOV_RESOURCES &&
+					i <= PCI_IOV_RESOURCE_END)
+				continue;
+#endif
+			align = pci_resource_alignment(dev, r);
+			order = __ffs(align) - 20;
+			if (order < 0)
+				order = 0;
+			if (order >= ARRAY_SIZE(aligns))
+				continue;
+
+			if (i != PCI_ROM_RESOURCE)
+				mem64_mask &= r->flags & IORESOURCE_MEM_64;
+		}
+	}
+
+	b_res->flags |= mem64_mask;
+}
+
 void __pci_bus_size_bridges(struct pci_bus *bus, struct list_head *realloc_head)
 {
 	struct pci_dev *dev;
@@ -1171,6 +1217,7 @@  void __pci_bus_size_bridges(struct pci_b
 		b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES];
 		mask = IORESOURCE_MEM;
 		prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
+		pbus_check_mem64(bus, prefmask, prefmask, realloc_head);
 		if (b_res[2].flags & IORESOURCE_MEM_64) {
 			prefmask |= IORESOURCE_MEM_64;
 			ret = pbus_size_mem(bus, prefmask, prefmask,