Message ID | 20201125101013.14953-5-nicoleotsuka@gmail.com |
---|---|
State | Deferred |
Headers | show |
Series | iommu/tegra-smmu: Some pending reviewed changes | expand |
Hi Nicolin, A regression was detected by kernelci.org in IGT's drm_read tests on mainline, it was first seen on 17th December 2020. You can find some details here: https://kernelci.org/test/case/id/600b82dc1e3208f123d3dffc/ Then an automated bisection was run and it landed on this patch (v5.10-rc3-4-g25938c73cd79 on mainline). Normally, an email is generated automatically but I had to start this one by hand as there were issues getting it to complete. You can see the failing test cases with this patch: https://lava.collabora.co.uk/results/3126405/0_igt-kms-tegra Some errors are seen around this point in the log: https://lava.collabora.co.uk/scheduler/job/3126405#L1005 [ 3.029729] tegra-mc 70019000.memory-controller: display0a: read @0xfe000000: EMEM address decode error (SMMU translation error [--S]) [ 3.042058] tegra-mc 70019000.memory-controller: display0a: read @0xfe000000: Page fault (SMMU translation error [--S]) Here's the same test passing with this patch reverted: https://lava.collabora.co.uk/results/3126570/0_igt-kms-tegra For completeness, you can see all the test jobs run by the automated bisection here: https://lava.collabora.co.uk/scheduler/device_type/tegra124-nyan-big?dt_length=25&dt_search=bisection-gtucker-12#dt_ Please let us know if you need any help debugging this issue or to try a fix on this platform. Best wishes, Guillaume On 25/11/2020 10:10, Nicolin Chen wrote: > The bus_set_iommu() in tegra_smmu_probe() enumerates all clients > to call in tegra_smmu_probe_device() where each client searches > its DT node for smmu pointer and swgroup ID, so as to configure > an fwspec. But this requires a valid smmu pointer even before mc > and smmu drivers are probed. So in tegra_smmu_probe() we added a > line of code to fill mc->smmu, marking "a bit of a hack". > > This works for most of clients in the DTB, however, doesn't work > for a client that doesn't exist in DTB, a PCI device for example. > > Actually, if we return ERR_PTR(-ENODEV) in ->probe_device() when > it's called from bus_set_iommu(), iommu core will let everything > carry on. Then when a client gets probed, of_iommu_configure() in > iommu core will search DTB for swgroup ID and call ->of_xlate() > to prepare an fwspec, similar to tegra_smmu_probe_device() and > tegra_smmu_configure(). Then it'll call tegra_smmu_probe_device() > again, and this time we shall return smmu->iommu pointer properly. > > So we can get rid of tegra_smmu_find() and tegra_smmu_configure() > along with DT polling code by letting the iommu core handle every > thing, except a problem that we search iommus property in DTB not > only for swgroup ID but also for mc node to get mc->smmu pointer > to call dev_iommu_priv_set() and return the smmu->iommu pointer. > So we'll need to find another way to get smmu pointer. > > Referencing the implementation of sun50i-iommu driver, of_xlate() > has client's dev pointer, mc node and swgroup ID. This means that > we can call dev_iommu_priv_set() in of_xlate() instead, so we can > simply get smmu pointer in ->probe_device(). > > This patch reworks tegra_smmu_probe_device() by: > 1) Removing mc->smmu hack in tegra_smmu_probe() so as to return > ERR_PTR(-ENODEV) in tegra_smmu_probe_device() during stage of > tegra_smmu_probe/tegra_mc_probe(). > 2) Moving dev_iommu_priv_set() to of_xlate() so we can get smmu > pointer in tegra_smmu_probe_device() to replace DTB polling. > 3) Removing tegra_smmu_configure() accordingly since iommu core > takes care of it. > > This also fixes a problem that previously we could add clients to > iommu groups before iommu core initializes its default domain: > ubuntu@jetson:~$ dmesg | grep iommu > platform 50000000.host1x: Adding to iommu group 1 > platform 57000000.gpu: Adding to iommu group 2 > iommu: Default domain type: Translated > platform 54200000.dc: Adding to iommu group 3 > platform 54240000.dc: Adding to iommu group 3 > platform 54340000.vic: Adding to iommu group 4 > > Though it works fine with IOMMU_DOMAIN_UNMANAGED, but will have > warnings if switching to IOMMU_DOMAIN_DMA: > iommu: Failed to allocate default IOMMU domain of type 0 for > group (null) - Falling back to IOMMU_DOMAIN_DMA > iommu: Failed to allocate default IOMMU domain of type 0 for > group (null) - Falling back to IOMMU_DOMAIN_DMA > > Now, bypassing the first probe_device() call from bus_set_iommu() > fixes the sequence: > ubuntu@jetson:~$ dmesg | grep iommu > iommu: Default domain type: Translated > tegra-host1x 50000000.host1x: Adding to iommu group 0 > tegra-dc 54200000.dc: Adding to iommu group 1 > tegra-dc 54240000.dc: Adding to iommu group 1 > tegra-vic 54340000.vic: Adding to iommu group 2 > nouveau 57000000.gpu: Adding to iommu group 3 > > Note that dmesg log above is testing with IOMMU_DOMAIN_UNMANAGED. > > Reviewed-by: Dmitry Osipenko <digetx@gmail.com> > Tested-by: Dmitry Osipenko <digetx@gmail.com> > Acked-by: Thierry Reding <treding@nvidia.com> > Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com> > --- > drivers/iommu/tegra-smmu.c | 96 ++++++-------------------------------- > 1 file changed, 15 insertions(+), 81 deletions(-) > > diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c > index 297d49f3f80e..f45ed43cf8db 100644 > --- a/drivers/iommu/tegra-smmu.c > +++ b/drivers/iommu/tegra-smmu.c > @@ -797,75 +797,9 @@ static phys_addr_t tegra_smmu_iova_to_phys(struct iommu_domain *domain, > return SMMU_PFN_PHYS(pfn) + SMMU_OFFSET_IN_PAGE(iova); > } > > -static struct tegra_smmu *tegra_smmu_find(struct device_node *np) > -{ > - struct platform_device *pdev; > - struct tegra_mc *mc; > - > - pdev = of_find_device_by_node(np); > - if (!pdev) > - return NULL; > - > - mc = platform_get_drvdata(pdev); > - if (!mc) > - return NULL; > - > - return mc->smmu; > -} > - > -static int tegra_smmu_configure(struct tegra_smmu *smmu, struct device *dev, > - struct of_phandle_args *args) > -{ > - const struct iommu_ops *ops = smmu->iommu.ops; > - int err; > - > - err = iommu_fwspec_init(dev, &dev->of_node->fwnode, ops); > - if (err < 0) { > - dev_err(dev, "failed to initialize fwspec: %d\n", err); > - return err; > - } > - > - err = ops->of_xlate(dev, args); > - if (err < 0) { > - dev_err(dev, "failed to parse SW group ID: %d\n", err); > - iommu_fwspec_free(dev); > - return err; > - } > - > - return 0; > -} > - > static struct iommu_device *tegra_smmu_probe_device(struct device *dev) > { > - struct device_node *np = dev->of_node; > - struct tegra_smmu *smmu = NULL; > - struct of_phandle_args args; > - unsigned int index = 0; > - int err; > - > - while (of_parse_phandle_with_args(np, "iommus", "#iommu-cells", index, > - &args) == 0) { > - smmu = tegra_smmu_find(args.np); > - if (smmu) { > - err = tegra_smmu_configure(smmu, dev, &args); > - of_node_put(args.np); > - > - if (err < 0) > - return ERR_PTR(err); > - > - /* > - * Only a single IOMMU master interface is currently > - * supported by the Linux kernel, so abort after the > - * first match. > - */ > - dev_iommu_priv_set(dev, smmu); > - > - break; > - } > - > - of_node_put(args.np); > - index++; > - } > + struct tegra_smmu *smmu = dev_iommu_priv_get(dev); > > if (!smmu) > return ERR_PTR(-ENODEV); > @@ -873,10 +807,7 @@ static struct iommu_device *tegra_smmu_probe_device(struct device *dev) > return &smmu->iommu; > } > > -static void tegra_smmu_release_device(struct device *dev) > -{ > - dev_iommu_priv_set(dev, NULL); > -} > +static void tegra_smmu_release_device(struct device *dev) {} > > static const struct tegra_smmu_group_soc * > tegra_smmu_find_group(struct tegra_smmu *smmu, unsigned int swgroup) > @@ -953,8 +884,21 @@ static struct iommu_group *tegra_smmu_device_group(struct device *dev) > static int tegra_smmu_of_xlate(struct device *dev, > struct of_phandle_args *args) > { > + struct platform_device *iommu_pdev = of_find_device_by_node(args->np); > + struct tegra_mc *mc = platform_get_drvdata(iommu_pdev); > u32 id = args->args[0]; > > + /* > + * Note: we are here releasing the reference of &iommu_pdev->dev, which > + * is mc->dev. Although some functions in tegra_smmu_ops may keep using > + * its private data beyond this point, it's still safe to do so because > + * the SMMU parent device is the same as the MC, so the reference count > + * isn't strictly necessary. > + */ > + put_device(&iommu_pdev->dev); > + > + dev_iommu_priv_set(dev, mc->smmu); > + > return iommu_fwspec_add_ids(dev, &id, 1); > } > > @@ -1079,16 +1023,6 @@ struct tegra_smmu *tegra_smmu_probe(struct device *dev, > if (!smmu) > return ERR_PTR(-ENOMEM); > > - /* > - * This is a bit of a hack. Ideally we'd want to simply return this > - * value. However the IOMMU registration process will attempt to add > - * all devices to the IOMMU when bus_set_iommu() is called. In order > - * not to rely on global variables to track the IOMMU instance, we > - * set it here so that it can be looked up from the .probe_device() > - * callback via the IOMMU device's .drvdata field. > - */ > - mc->smmu = smmu; > - > size = BITS_TO_LONGS(soc->num_asids) * sizeof(long); > > smmu->asids = devm_kzalloc(dev, size, GFP_KERNEL); >
Hi Guillaume, On Thu, Feb 04, 2021 at 11:10:15AM +0000, Guillaume Tucker wrote: > Hi Nicolin, > > A regression was detected by kernelci.org in IGT's drm_read tests > on mainline, it was first seen on 17th December 2020. You can > find some details here: > > https://kernelci.org/test/case/id/600b82dc1e3208f123d3dffc/ Thanks for reporting the issue. We did test on Tegra210 and Tegra30 yet not on Tegra124. I am wondering what could go wrong... > Please let us know if you need any help debugging this issue or > to try a fix on this platform. Yes, I don't have any Tegra124 platform to run. It'd be very nice if you can run some debugging patch (I can provide you) and a fix after I root cause the issue. Thanks Nicolin
Hi Guillaume, On Thu, Feb 04, 2021 at 09:24:23PM -0800, Nicolin Chen wrote: > > Please let us know if you need any help debugging this issue or > > to try a fix on this platform. > > Yes, I don't have any Tegra124 platform to run. It'd be very nice > if you can run some debugging patch (I can provide you) and a fix > after I root cause the issue. Would it be possible for you to run with the given debugging patch? It'd be nicer if I can get both logs of the vanilla kernel (failing) and the commit-reverted version (passing), each applying this patch. Thanks in advance! Nicolin
On 05/02/2021 09:45, Nicolin Chen wrote: > Hi Guillaume, > > On Thu, Feb 04, 2021 at 09:24:23PM -0800, Nicolin Chen wrote: >>> Please let us know if you need any help debugging this issue or >>> to try a fix on this platform. >> >> Yes, I don't have any Tegra124 platform to run. It'd be very nice >> if you can run some debugging patch (I can provide you) and a fix >> after I root cause the issue. > > Would it be possible for you to run with the given debugging patch? > > It'd be nicer if I can get both logs of the vanilla kernel (failing) > and the commit-reverted version (passing), each applying this patch. Sure, I've run 3 jobs: * v5.11-rc6 as a reference, to see the original issue: https://lava.collabora.co.uk/scheduler/job/3187848 * + your debug patch: https://lava.collabora.co.uk/scheduler/job/3187849 * + the "breaking" commit reverted, passing the tests: https://lava.collabora.co.uk/scheduler/job/3187851 You can see the history of the test branch I'm using here, with the 3 revisions mentioned above: https://gitlab.collabora.com/gtucker/linux/-/commits/linux-5.11-rc6-nyan-big-drm-read/ Hope that helps, Guillaume
Hi Guillaume, On Sat, Feb 06, 2021 at 01:40:13PM +0000, Guillaume Tucker wrote: > > It'd be nicer if I can get both logs of the vanilla kernel (failing) > > and the commit-reverted version (passing), each applying this patch. > > Sure, I've run 3 jobs: > > * v5.11-rc6 as a reference, to see the original issue: > https://lava.collabora.co.uk/scheduler/job/3187848 > > * + your debug patch: > https://lava.collabora.co.uk/scheduler/job/3187849 > > * + the "breaking" commit reverted, passing the tests: > https://lava.collabora.co.uk/scheduler/job/3187851 Thanks for the help! I am able to figure out what's probably wrong, yet not so sure about the best solution at this point. Would it be possible for you to run one more time with another debugging patch? I'd like to see the same logs as previous: 1. Vanilla kernel + debug patch 2. Vanilla kernel + Reverted + debug patch Thank you Nicolin
On 10/02/2021 08:20, Nicolin Chen wrote: > Hi Guillaume, > > On Sat, Feb 06, 2021 at 01:40:13PM +0000, Guillaume Tucker wrote: >>> It'd be nicer if I can get both logs of the vanilla kernel (failing) >>> and the commit-reverted version (passing), each applying this patch. >> >> Sure, I've run 3 jobs: >> >> * v5.11-rc6 as a reference, to see the original issue: >> https://lava.collabora.co.uk/scheduler/job/3187848 >> >> * + your debug patch: >> https://lava.collabora.co.uk/scheduler/job/3187849 >> >> * + the "breaking" commit reverted, passing the tests: >> https://lava.collabora.co.uk/scheduler/job/3187851 > > Thanks for the help! > > I am able to figure out what's probably wrong, yet not so sure > about the best solution at this point. > > Would it be possible for you to run one more time with another > debugging patch? I'd like to see the same logs as previous: > 1. Vanilla kernel + debug patch > 2. Vanilla kernel + Reverted + debug patch As it turns out, next-20210210 is passing all the tests again so it looks like this got fixed in the meantime: https://lava.collabora.co.uk/scheduler/job/3210192 https://lava.collabora.co.uk/results/3210192/0_igt-kms-tegra And here's a more extensive list of IGT tests on next-20210211, all the regressions have been fixed: https://kernelci.org/test/plan/id/60254c42f51df36be53abe62/ I haven't run a reversed bisection to find the fix, but I guess it wouldn't be too hard to find out what happened by hand anyway. I see the drm/tegra/for-5.12-rc1 tag has been merged into linux-next, maybe that solved the issue? FYI I've also run some jobs with your debug patch and with the breaking patch reverted: https://lava.collabora.co.uk/scheduler/job/3210245 https://lava.collabora.co.uk/scheduler/job/3210596 Meanwhile I'll see what can be done to improve the automated bisection so if there are new IGT regressions they would get reported earlier. I guess it would have saved us all some time if it had been bisected in December. Thanks, Guillaume
Hi Guillaume, Thank you for the test results! And sorry for my belated reply. On Thu, Feb 11, 2021 at 03:50:05PM +0000, Guillaume Tucker wrote: > > On Sat, Feb 06, 2021 at 01:40:13PM +0000, Guillaume Tucker wrote: > >>> It'd be nicer if I can get both logs of the vanilla kernel (failing) > >>> and the commit-reverted version (passing), each applying this patch. > >> > >> Sure, I've run 3 jobs: > >> > >> * v5.11-rc6 as a reference, to see the original issue: > >> https://lava.collabora.co.uk/scheduler/job/3187848 > >> > >> * + your debug patch: > >> https://lava.collabora.co.uk/scheduler/job/3187849 > >> > >> * + the "breaking" commit reverted, passing the tests: > >> https://lava.collabora.co.uk/scheduler/job/3187851 > > > > Thanks for the help! > > > > I am able to figure out what's probably wrong, yet not so sure > > about the best solution at this point. > > > > Would it be possible for you to run one more time with another > > debugging patch? I'd like to see the same logs as previous: > > 1. Vanilla kernel + debug patch > > 2. Vanilla kernel + Reverted + debug patch > > As it turns out, next-20210210 is passing all the tests again so > it looks like this got fixed in the meantime: > > https://lava.collabora.co.uk/scheduler/job/3210192 I checked this passing log, however, found that the regression is still there though test passed, as the prints below aren't normal: tegra-mc 70019000.memory-controller: display0a: read @0xfe056b40: EMEM address decode error (SMMU translation error [--S]) tegra-mc 70019000.memory-controller: display0a: read @0xfe056b40: Page fault (SMMU translation error [--S]) I was trying to think of a simpler solution than a revert. However, given the fact that the callback sequence could change -- guessing likely a recent change in iommu core, I feel it safer to revert my previous change, not necessarily being a complete revert though. I attached my partial reverting change in this email. Would it be possible for you to run one more test for me to confirm it? It'd keep the tests passing while eliminating all error prints above. If the fix works, I'll re-send it to mail list by adding a commit message. Thanks! Nicolin
On 18/02/2021 10:35, Nicolin Chen wrote: > Hi Guillaume, > > Thank you for the test results! And sorry for my belated reply. No worries :) > On Thu, Feb 11, 2021 at 03:50:05PM +0000, Guillaume Tucker wrote: >>> On Sat, Feb 06, 2021 at 01:40:13PM +0000, Guillaume Tucker wrote: >>>>> It'd be nicer if I can get both logs of the vanilla kernel (failing) >>>>> and the commit-reverted version (passing), each applying this patch. >>>> >>>> Sure, I've run 3 jobs: >>>> >>>> * v5.11-rc6 as a reference, to see the original issue: >>>> https://lava.collabora.co.uk/scheduler/job/3187848 >>>> >>>> * + your debug patch: >>>> https://lava.collabora.co.uk/scheduler/job/3187849 >>>> >>>> * + the "breaking" commit reverted, passing the tests: >>>> https://lava.collabora.co.uk/scheduler/job/3187851 >>> >>> Thanks for the help! >>> >>> I am able to figure out what's probably wrong, yet not so sure >>> about the best solution at this point. >>> >>> Would it be possible for you to run one more time with another >>> debugging patch? I'd like to see the same logs as previous: >>> 1. Vanilla kernel + debug patch >>> 2. Vanilla kernel + Reverted + debug patch >> >> As it turns out, next-20210210 is passing all the tests again so >> it looks like this got fixed in the meantime: >> >> https://lava.collabora.co.uk/scheduler/job/3210192 > > I checked this passing log, however, found that the regression is > still there though test passed, as the prints below aren't normal: > tegra-mc 70019000.memory-controller: display0a: read @0xfe056b40: > EMEM address decode error (SMMU translation error [--S]) > tegra-mc 70019000.memory-controller: display0a: read @0xfe056b40: > Page fault (SMMU translation error [--S]) Ah yes sorry, there are other KernelCI checks for kernel errors but that wasn't enabled in the bisection so I didn't notice them. > I was trying to think of a simpler solution than a revert. However, > given the fact that the callback sequence could change -- guessing > likely a recent change in iommu core, I feel it safer to revert my > previous change, not necessarily being a complete revert though. > > I attached my partial reverting change in this email. Would it be > possible for you to run one more test for me to confirm it? It'd > keep the tests passing while eliminating all error prints above. > > If the fix works, I'll re-send it to mail list by adding a commit > message. Sure, here's next-20210218 as a reference: https://lava.collabora.co.uk/scheduler/job/3241236 and here with your patch applied on top of it: https://lava.collabora.co.uk/scheduler/job/3241246 The git branch I've used where your patch is applied: https://gitlab.collabora.com/gtucker/linux/-/commits/next-20210218-nyan-big-drm-read/ The errors seem to have disappeared but I'll let you double check that things are all back to a working state. BTW: This thread is a good example of how having an "on-demand" KernelCI service to let developers re-run tests with extra patches would allow them to fix issues independently. We'll keep that in mind for the future. Best wishes, Guillaume
diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c index 297d49f3f80e..f45ed43cf8db 100644 --- a/drivers/iommu/tegra-smmu.c +++ b/drivers/iommu/tegra-smmu.c @@ -797,75 +797,9 @@ static phys_addr_t tegra_smmu_iova_to_phys(struct iommu_domain *domain, return SMMU_PFN_PHYS(pfn) + SMMU_OFFSET_IN_PAGE(iova); } -static struct tegra_smmu *tegra_smmu_find(struct device_node *np) -{ - struct platform_device *pdev; - struct tegra_mc *mc; - - pdev = of_find_device_by_node(np); - if (!pdev) - return NULL; - - mc = platform_get_drvdata(pdev); - if (!mc) - return NULL; - - return mc->smmu; -} - -static int tegra_smmu_configure(struct tegra_smmu *smmu, struct device *dev, - struct of_phandle_args *args) -{ - const struct iommu_ops *ops = smmu->iommu.ops; - int err; - - err = iommu_fwspec_init(dev, &dev->of_node->fwnode, ops); - if (err < 0) { - dev_err(dev, "failed to initialize fwspec: %d\n", err); - return err; - } - - err = ops->of_xlate(dev, args); - if (err < 0) { - dev_err(dev, "failed to parse SW group ID: %d\n", err); - iommu_fwspec_free(dev); - return err; - } - - return 0; -} - static struct iommu_device *tegra_smmu_probe_device(struct device *dev) { - struct device_node *np = dev->of_node; - struct tegra_smmu *smmu = NULL; - struct of_phandle_args args; - unsigned int index = 0; - int err; - - while (of_parse_phandle_with_args(np, "iommus", "#iommu-cells", index, - &args) == 0) { - smmu = tegra_smmu_find(args.np); - if (smmu) { - err = tegra_smmu_configure(smmu, dev, &args); - of_node_put(args.np); - - if (err < 0) - return ERR_PTR(err); - - /* - * Only a single IOMMU master interface is currently - * supported by the Linux kernel, so abort after the - * first match. - */ - dev_iommu_priv_set(dev, smmu); - - break; - } - - of_node_put(args.np); - index++; - } + struct tegra_smmu *smmu = dev_iommu_priv_get(dev); if (!smmu) return ERR_PTR(-ENODEV); @@ -873,10 +807,7 @@ static struct iommu_device *tegra_smmu_probe_device(struct device *dev) return &smmu->iommu; } -static void tegra_smmu_release_device(struct device *dev) -{ - dev_iommu_priv_set(dev, NULL); -} +static void tegra_smmu_release_device(struct device *dev) {} static const struct tegra_smmu_group_soc * tegra_smmu_find_group(struct tegra_smmu *smmu, unsigned int swgroup) @@ -953,8 +884,21 @@ static struct iommu_group *tegra_smmu_device_group(struct device *dev) static int tegra_smmu_of_xlate(struct device *dev, struct of_phandle_args *args) { + struct platform_device *iommu_pdev = of_find_device_by_node(args->np); + struct tegra_mc *mc = platform_get_drvdata(iommu_pdev); u32 id = args->args[0]; + /* + * Note: we are here releasing the reference of &iommu_pdev->dev, which + * is mc->dev. Although some functions in tegra_smmu_ops may keep using + * its private data beyond this point, it's still safe to do so because + * the SMMU parent device is the same as the MC, so the reference count + * isn't strictly necessary. + */ + put_device(&iommu_pdev->dev); + + dev_iommu_priv_set(dev, mc->smmu); + return iommu_fwspec_add_ids(dev, &id, 1); } @@ -1079,16 +1023,6 @@ struct tegra_smmu *tegra_smmu_probe(struct device *dev, if (!smmu) return ERR_PTR(-ENOMEM); - /* - * This is a bit of a hack. Ideally we'd want to simply return this - * value. However the IOMMU registration process will attempt to add - * all devices to the IOMMU when bus_set_iommu() is called. In order - * not to rely on global variables to track the IOMMU instance, we - * set it here so that it can be looked up from the .probe_device() - * callback via the IOMMU device's .drvdata field. - */ - mc->smmu = smmu; - size = BITS_TO_LONGS(soc->num_asids) * sizeof(long); smmu->asids = devm_kzalloc(dev, size, GFP_KERNEL);