[6/9] drm/tegra: gem: Add a clarifying comment

Message ID	20210323155437.513497-7-thierry.reding@gmail.com
State	Changes Requested
Headers	show Return-Path: <linux-tegra-owner@vger.kernel.org> From: Thierry Reding <thierry.reding@gmail.com> To: Thierry Reding <thierry.reding@gmail.com> Cc: David Airlie <airlied@linux.ie>, Daniel Vetter <daniel@ffwll.ch>, Maarten Lankhorst <maarten.lankhorst@linux.intel.com>, Maxime Ripard <mripard@kernel.org>, Thomas Zimmermann <tzimmermann@suse.de>, James Jones <jajones@nvidia.com>, dri-devel@lists.freedesktop.org, linux-tegra@vger.kernel.org Subject: [PATCH 6/9] drm/tegra: gem: Add a clarifying comment Date: Tue, 23 Mar 2021 16:54:34 +0100 Message-Id: <20210323155437.513497-7-thierry.reding@gmail.com> In-Reply-To: <20210323155437.513497-1-thierry.reding@gmail.com> References: <20210323155437.513497-1-thierry.reding@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	drm/tegra: Various improvements \| expand [0/9] drm/tegra: Various improvements [1/9] drm/fourcc: Add macro to check for the modifier vendor [2/9] drm/tegra: dc: Inherit DMA mask [3/9] drm/tegra: dc: Parameterize maximum resolution [4/9] drm/tegra: dc: Implement hardware cursor on Tegra186 and later [5/9] drm/tegra: fb: Add diagnostics for framebuffer modifiers [6/9] drm/tegra: gem: Add a clarifying comment [7/9] gpu: host1x: Add early init and late exit callbacks [8/9] drm/tegra: Count number of display controllers at runtime [9/9] drm/tegra: Support sector layout on Tegra194

Thierry Reding March 23, 2021, 3:54 p.m. UTC

From: Thierry Reding <treding@nvidia.com>

Clarify when a fixed IOV address can be used and when a buffer has to
be mapped before the IOVA can be used.

Signed-off-by: Thierry Reding <treding@nvidia.com>
---
 drivers/gpu/drm/tegra/plane.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Dmitry Osipenko March 24, 2021, 2:41 p.m. UTC | #1

23.03.2021 18:54, Thierry Reding пишет:
> From: Thierry Reding <treding@nvidia.com>
> 
> Clarify when a fixed IOV address can be used and when a buffer has to
> be mapped before the IOVA can be used.
> 
> Signed-off-by: Thierry Reding <treding@nvidia.com>
> ---
>  drivers/gpu/drm/tegra/plane.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
> index 19e8847a164b..793da5d675d2 100644
> --- a/drivers/gpu/drm/tegra/plane.c
> +++ b/drivers/gpu/drm/tegra/plane.c
> @@ -119,6 +119,14 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
>  		dma_addr_t phys_addr, *phys;
>  		struct sg_table *sgt;
>  
> +		/*
> +		 * If we're not attached to a domain, we already stored the
> +		 * physical address when the buffer was allocated. If we're
> +		 * part of a group that's shared between all display
> +		 * controllers, we've also already mapped the framebuffer
> +		 * through the SMMU. In both cases we can short-circuit the
> +		 * code below and retrieve the stored IOV address.
> +		 */
>  		if (!domain || dc->client.group)
>  			phys = &phys_addr;
>  		else
> 

This comment is correct, but the logic feels a bit lame because it
should be wasteful to re-map DMA on each FB flip. Personally I don't
care much about this since older Tegras use pinned buffers by default,
but this shouldn't be good for T124+ users.

Perhaps dumb buffers should be pinned to display by default and then we
should extend the Tegra UAPI to support BO mapping to display client(?).

Thierry Reding March 24, 2021, 3:02 p.m. UTC | #2

On Wed, Mar 24, 2021 at 05:41:08PM +0300, Dmitry Osipenko wrote:
> 23.03.2021 18:54, Thierry Reding пишет:
> > From: Thierry Reding <treding@nvidia.com>
> > 
> > Clarify when a fixed IOV address can be used and when a buffer has to
> > be mapped before the IOVA can be used.
> > 
> > Signed-off-by: Thierry Reding <treding@nvidia.com>
> > ---
> >  drivers/gpu/drm/tegra/plane.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
> > index 19e8847a164b..793da5d675d2 100644
> > --- a/drivers/gpu/drm/tegra/plane.c
> > +++ b/drivers/gpu/drm/tegra/plane.c
> > @@ -119,6 +119,14 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
> >  		dma_addr_t phys_addr, *phys;
> >  		struct sg_table *sgt;
> >  
> > +		/*
> > +		 * If we're not attached to a domain, we already stored the
> > +		 * physical address when the buffer was allocated. If we're
> > +		 * part of a group that's shared between all display
> > +		 * controllers, we've also already mapped the framebuffer
> > +		 * through the SMMU. In both cases we can short-circuit the
> > +		 * code below and retrieve the stored IOV address.
> > +		 */
> >  		if (!domain || dc->client.group)
> >  			phys = &phys_addr;
> >  		else
> > 
> 
> This comment is correct, but the logic feels a bit lame because it
> should be wasteful to re-map DMA on each FB flip. Personally I don't
> care much about this since older Tegras use pinned buffers by default,
> but this shouldn't be good for T124+ users.

I'm not terribly thrilled by this either, but it's the only way to do
this when using the DMA API because we don't know at allocation time (or
import time for that matter) which of the (up to) 4 display controllers
a framebuffer will be shown on. tegra_dc_pin() is the earliest where
this is known and worst case that's called once per flip.

When the IOMMU API is used explicitly, we always map framebuffers into
the IOMMU domain shared by all display controllers at allocation or
import time and then we don't need to pin at flip time anymore.

I do have a work-in-progress patch somewhere that creates a mapping
cache to mitigate this problem to some degree. I need to dig that up and
do a few measurements because I vaguely recall this speeding up flips by
quite a bit (well, except for the very first mapping, obviously).

> Perhaps dumb buffers should be pinned to display by default and then we
> should extend the Tegra UAPI to support BO mapping to display client(?).

That would kind of defeat the purpose of a generic KMS UAPI.

Thierry

Dmitry Osipenko March 24, 2021, 3:45 p.m. UTC | #3

24.03.2021 18:02, Thierry Reding пишет:
> On Wed, Mar 24, 2021 at 05:41:08PM +0300, Dmitry Osipenko wrote:
>> 23.03.2021 18:54, Thierry Reding пишет:
>>> From: Thierry Reding <treding@nvidia.com>
>>>
>>> Clarify when a fixed IOV address can be used and when a buffer has to
>>> be mapped before the IOVA can be used.
>>>
>>> Signed-off-by: Thierry Reding <treding@nvidia.com>
>>> ---
>>>  drivers/gpu/drm/tegra/plane.c | 8 ++++++++
>>>  1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
>>> index 19e8847a164b..793da5d675d2 100644
>>> --- a/drivers/gpu/drm/tegra/plane.c
>>> +++ b/drivers/gpu/drm/tegra/plane.c
>>> @@ -119,6 +119,14 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
>>>  		dma_addr_t phys_addr, *phys;
>>>  		struct sg_table *sgt;
>>>  
>>> +		/*
>>> +		 * If we're not attached to a domain, we already stored the
>>> +		 * physical address when the buffer was allocated. If we're
>>> +		 * part of a group that's shared between all display
>>> +		 * controllers, we've also already mapped the framebuffer
>>> +		 * through the SMMU. In both cases we can short-circuit the
>>> +		 * code below and retrieve the stored IOV address.
>>> +		 */
>>>  		if (!domain || dc->client.group)
>>>  			phys = &phys_addr;
>>>  		else
>>>
>>
>> This comment is correct, but the logic feels a bit lame because it
>> should be wasteful to re-map DMA on each FB flip. Personally I don't
>> care much about this since older Tegras use pinned buffers by default,
>> but this shouldn't be good for T124+ users.
> 
> I'm not terribly thrilled by this either, but it's the only way to do
> this when using the DMA API because we don't know at allocation time (or
> import time for that matter) which of the (up to) 4 display controllers
> a framebuffer will be shown on. tegra_dc_pin() is the earliest where
> this is known and worst case that's called once per flip.
> 
> When the IOMMU API is used explicitly, we always map framebuffers into
> the IOMMU domain shared by all display controllers at allocation or
> import time and then we don't need to pin at flip time anymore.
> 
> I do have a work-in-progress patch somewhere that creates a mapping
> cache to mitigate this problem to some degree. I need to dig that up and
> do a few measurements because I vaguely recall this speeding up flips by
> quite a bit (well, except for the very first mapping, obviously).
> 
>> Perhaps dumb buffers should be pinned to display by default and then we
>> should extend the Tegra UAPI to support BO mapping to display client(?).
> 
> That would kind of defeat the purpose of a generic KMS UAPI.

Couldn't the BOs be mapped when FB is created, i.e. by tegra_fb_create?

Thierry Reding March 24, 2021, 4:42 p.m. UTC | #4

On Wed, Mar 24, 2021 at 06:45:30PM +0300, Dmitry Osipenko wrote:
> 24.03.2021 18:02, Thierry Reding пишет:
> > On Wed, Mar 24, 2021 at 05:41:08PM +0300, Dmitry Osipenko wrote:
> >> 23.03.2021 18:54, Thierry Reding пишет:
> >>> From: Thierry Reding <treding@nvidia.com>
> >>>
> >>> Clarify when a fixed IOV address can be used and when a buffer has to
> >>> be mapped before the IOVA can be used.
> >>>
> >>> Signed-off-by: Thierry Reding <treding@nvidia.com>
> >>> ---
> >>>  drivers/gpu/drm/tegra/plane.c | 8 ++++++++
> >>>  1 file changed, 8 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
> >>> index 19e8847a164b..793da5d675d2 100644
> >>> --- a/drivers/gpu/drm/tegra/plane.c
> >>> +++ b/drivers/gpu/drm/tegra/plane.c
> >>> @@ -119,6 +119,14 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
> >>>  		dma_addr_t phys_addr, *phys;
> >>>  		struct sg_table *sgt;
> >>>  
> >>> +		/*
> >>> +		 * If we're not attached to a domain, we already stored the
> >>> +		 * physical address when the buffer was allocated. If we're
> >>> +		 * part of a group that's shared between all display
> >>> +		 * controllers, we've also already mapped the framebuffer
> >>> +		 * through the SMMU. In both cases we can short-circuit the
> >>> +		 * code below and retrieve the stored IOV address.
> >>> +		 */
> >>>  		if (!domain || dc->client.group)
> >>>  			phys = &phys_addr;
> >>>  		else
> >>>
> >>
> >> This comment is correct, but the logic feels a bit lame because it
> >> should be wasteful to re-map DMA on each FB flip. Personally I don't
> >> care much about this since older Tegras use pinned buffers by default,
> >> but this shouldn't be good for T124+ users.
> > 
> > I'm not terribly thrilled by this either, but it's the only way to do
> > this when using the DMA API because we don't know at allocation time (or
> > import time for that matter) which of the (up to) 4 display controllers
> > a framebuffer will be shown on. tegra_dc_pin() is the earliest where
> > this is known and worst case that's called once per flip.
> > 
> > When the IOMMU API is used explicitly, we always map framebuffers into
> > the IOMMU domain shared by all display controllers at allocation or
> > import time and then we don't need to pin at flip time anymore.
> > 
> > I do have a work-in-progress patch somewhere that creates a mapping
> > cache to mitigate this problem to some degree. I need to dig that up and
> > do a few measurements because I vaguely recall this speeding up flips by
> > quite a bit (well, except for the very first mapping, obviously).
> > 
> >> Perhaps dumb buffers should be pinned to display by default and then we
> >> should extend the Tegra UAPI to support BO mapping to display client(?).
> > 
> > That would kind of defeat the purpose of a generic KMS UAPI.
> 
> Couldn't the BOs be mapped when FB is created, i.e. by tegra_fb_create?

I suppose that would be possible. However, tegra_fb_create() doesn't
know a thing about display controllers, so we'd have to add extra code
to it to iterate over all display controllers and do a dma_map_sg() of
the GEM object for each of them.

It's also somewhat wasteful because now we get a mapping for each
framebuffer for each display controller. So if you've got, say, a four
UHD screen setup (which is something that Tegra194 supports), you could
end up with 8 UHD framebuffers (two for each display, for double-
buffering) at 32 MiB each for a whopping 256 MiB of memory that needs to
be mapped for each of the four display controllers. That 1 GiB worth of
page table updates, whereas you really only need one fourth of that.

Granted, this will make flipping a bit faster, and IOVA space isn't
really a problem on Tegra194. It would still waste a bit of RAM for all
those page table entries that we don't really need, though.

A mapping cache seems like a much better compromise because the cache
lookup should be quite fast compared to a mapping operation and we waste
just a couple dozen bytes per mapping perhaps as opposed to a few
megabytes for the gratuitous, preemptive mappings.

Thierry

Dmitry Osipenko March 24, 2021, 4:50 p.m. UTC | #5

24.03.2021 19:42, Thierry Reding пишет:
> On Wed, Mar 24, 2021 at 06:45:30PM +0300, Dmitry Osipenko wrote:
>> 24.03.2021 18:02, Thierry Reding пишет:
>>> On Wed, Mar 24, 2021 at 05:41:08PM +0300, Dmitry Osipenko wrote:
>>>> 23.03.2021 18:54, Thierry Reding пишет:
>>>>> From: Thierry Reding <treding@nvidia.com>
>>>>>
>>>>> Clarify when a fixed IOV address can be used and when a buffer has to
>>>>> be mapped before the IOVA can be used.
>>>>>
>>>>> Signed-off-by: Thierry Reding <treding@nvidia.com>
>>>>> ---
>>>>>  drivers/gpu/drm/tegra/plane.c | 8 ++++++++
>>>>>  1 file changed, 8 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
>>>>> index 19e8847a164b..793da5d675d2 100644
>>>>> --- a/drivers/gpu/drm/tegra/plane.c
>>>>> +++ b/drivers/gpu/drm/tegra/plane.c
>>>>> @@ -119,6 +119,14 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
>>>>>  		dma_addr_t phys_addr, *phys;
>>>>>  		struct sg_table *sgt;
>>>>>  
>>>>> +		/*
>>>>> +		 * If we're not attached to a domain, we already stored the
>>>>> +		 * physical address when the buffer was allocated. If we're
>>>>> +		 * part of a group that's shared between all display
>>>>> +		 * controllers, we've also already mapped the framebuffer
>>>>> +		 * through the SMMU. In both cases we can short-circuit the
>>>>> +		 * code below and retrieve the stored IOV address.
>>>>> +		 */
>>>>>  		if (!domain || dc->client.group)
>>>>>  			phys = &phys_addr;
>>>>>  		else
>>>>>
>>>>
>>>> This comment is correct, but the logic feels a bit lame because it
>>>> should be wasteful to re-map DMA on each FB flip. Personally I don't
>>>> care much about this since older Tegras use pinned buffers by default,
>>>> but this shouldn't be good for T124+ users.
>>>
>>> I'm not terribly thrilled by this either, but it's the only way to do
>>> this when using the DMA API because we don't know at allocation time (or
>>> import time for that matter) which of the (up to) 4 display controllers
>>> a framebuffer will be shown on. tegra_dc_pin() is the earliest where
>>> this is known and worst case that's called once per flip.
>>>
>>> When the IOMMU API is used explicitly, we always map framebuffers into
>>> the IOMMU domain shared by all display controllers at allocation or
>>> import time and then we don't need to pin at flip time anymore.
>>>
>>> I do have a work-in-progress patch somewhere that creates a mapping
>>> cache to mitigate this problem to some degree. I need to dig that up and
>>> do a few measurements because I vaguely recall this speeding up flips by
>>> quite a bit (well, except for the very first mapping, obviously).
>>>
>>>> Perhaps dumb buffers should be pinned to display by default and then we
>>>> should extend the Tegra UAPI to support BO mapping to display client(?).
>>>
>>> That would kind of defeat the purpose of a generic KMS UAPI.
>>
>> Couldn't the BOs be mapped when FB is created, i.e. by tegra_fb_create?
> 
> I suppose that would be possible. However, tegra_fb_create() doesn't
> know a thing about display controllers, so we'd have to add extra code
> to it to iterate over all display controllers and do a dma_map_sg() of
> the GEM object for each of them.
> 
> It's also somewhat wasteful because now we get a mapping for each
> framebuffer for each display controller. So if you've got, say, a four
> UHD screen setup (which is something that Tegra194 supports), you could
> end up with 8 UHD framebuffers (two for each display, for double-
> buffering) at 32 MiB each for a whopping 256 MiB of memory that needs to
> be mapped for each of the four display controllers. That 1 GiB worth of
> page table updates, whereas you really only need one fourth of that.
> 
> Granted, this will make flipping a bit faster, and IOVA space isn't
> really a problem on Tegra194. It would still waste a bit of RAM for all
> those page table entries that we don't really need, though.
> 
> A mapping cache seems like a much better compromise because the cache
> lookup should be quite fast compared to a mapping operation and we waste
> just a couple dozen bytes per mapping perhaps as opposed to a few
> megabytes for the gratuitous, preemptive mappings.

Isn't it really possible to put displays into the same IOMMU group on
T194? It doesn't make much sense to have them in a separate groups on Linux.

Thierry Reding March 26, 2021, 4:37 p.m. UTC | #6

On Wed, Mar 24, 2021 at 07:50:01PM +0300, Dmitry Osipenko wrote:
> 24.03.2021 19:42, Thierry Reding пишет:
> > On Wed, Mar 24, 2021 at 06:45:30PM +0300, Dmitry Osipenko wrote:
> >> 24.03.2021 18:02, Thierry Reding пишет:
> >>> On Wed, Mar 24, 2021 at 05:41:08PM +0300, Dmitry Osipenko wrote:
> >>>> 23.03.2021 18:54, Thierry Reding пишет:
> >>>>> From: Thierry Reding <treding@nvidia.com>
> >>>>>
> >>>>> Clarify when a fixed IOV address can be used and when a buffer has to
> >>>>> be mapped before the IOVA can be used.
> >>>>>
> >>>>> Signed-off-by: Thierry Reding <treding@nvidia.com>
> >>>>> ---
> >>>>>  drivers/gpu/drm/tegra/plane.c | 8 ++++++++
> >>>>>  1 file changed, 8 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
> >>>>> index 19e8847a164b..793da5d675d2 100644
> >>>>> --- a/drivers/gpu/drm/tegra/plane.c
> >>>>> +++ b/drivers/gpu/drm/tegra/plane.c
> >>>>> @@ -119,6 +119,14 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
> >>>>>  		dma_addr_t phys_addr, *phys;
> >>>>>  		struct sg_table *sgt;
> >>>>>  
> >>>>> +		/*
> >>>>> +		 * If we're not attached to a domain, we already stored the
> >>>>> +		 * physical address when the buffer was allocated. If we're
> >>>>> +		 * part of a group that's shared between all display
> >>>>> +		 * controllers, we've also already mapped the framebuffer
> >>>>> +		 * through the SMMU. In both cases we can short-circuit the
> >>>>> +		 * code below and retrieve the stored IOV address.
> >>>>> +		 */
> >>>>>  		if (!domain || dc->client.group)
> >>>>>  			phys = &phys_addr;
> >>>>>  		else
> >>>>>
> >>>>
> >>>> This comment is correct, but the logic feels a bit lame because it
> >>>> should be wasteful to re-map DMA on each FB flip. Personally I don't
> >>>> care much about this since older Tegras use pinned buffers by default,
> >>>> but this shouldn't be good for T124+ users.
> >>>
> >>> I'm not terribly thrilled by this either, but it's the only way to do
> >>> this when using the DMA API because we don't know at allocation time (or
> >>> import time for that matter) which of the (up to) 4 display controllers
> >>> a framebuffer will be shown on. tegra_dc_pin() is the earliest where
> >>> this is known and worst case that's called once per flip.
> >>>
> >>> When the IOMMU API is used explicitly, we always map framebuffers into
> >>> the IOMMU domain shared by all display controllers at allocation or
> >>> import time and then we don't need to pin at flip time anymore.
> >>>
> >>> I do have a work-in-progress patch somewhere that creates a mapping
> >>> cache to mitigate this problem to some degree. I need to dig that up and
> >>> do a few measurements because I vaguely recall this speeding up flips by
> >>> quite a bit (well, except for the very first mapping, obviously).
> >>>
> >>>> Perhaps dumb buffers should be pinned to display by default and then we
> >>>> should extend the Tegra UAPI to support BO mapping to display client(?).
> >>>
> >>> That would kind of defeat the purpose of a generic KMS UAPI.
> >>
> >> Couldn't the BOs be mapped when FB is created, i.e. by tegra_fb_create?
> > 
> > I suppose that would be possible. However, tegra_fb_create() doesn't
> > know a thing about display controllers, so we'd have to add extra code
> > to it to iterate over all display controllers and do a dma_map_sg() of
> > the GEM object for each of them.
> > 
> > It's also somewhat wasteful because now we get a mapping for each
> > framebuffer for each display controller. So if you've got, say, a four
> > UHD screen setup (which is something that Tegra194 supports), you could
> > end up with 8 UHD framebuffers (two for each display, for double-
> > buffering) at 32 MiB each for a whopping 256 MiB of memory that needs to
> > be mapped for each of the four display controllers. That 1 GiB worth of
> > page table updates, whereas you really only need one fourth of that.
> > 
> > Granted, this will make flipping a bit faster, and IOVA space isn't
> > really a problem on Tegra194. It would still waste a bit of RAM for all
> > those page table entries that we don't really need, though.
> > 
> > A mapping cache seems like a much better compromise because the cache
> > lookup should be quite fast compared to a mapping operation and we waste
> > just a couple dozen bytes per mapping perhaps as opposed to a few
> > megabytes for the gratuitous, preemptive mappings.
> 
> Isn't it really possible to put displays into the same IOMMU group on
> T194? It doesn't make much sense to have them in a separate groups on Linux.

It is possible and in fact that's what's already happening. However, the
problem isn't that these devices are not in the same group, the problem
is that the DMA API doesn't know anything about groups. It works on
struct device and if you've got DMA API debugging enabled it may even
flag incorrect usage as errors.

So from a DMA API point of view, if a device wants to use a buffer, that
buffer first has to be mapped for that device, even if it was already
mapped for a different device that happens to be in the same IOMMU group
and hence share an IOMMU domain.

Thierry

Dmitry Osipenko March 28, 2021, 2:29 p.m. UTC | #7

26.03.2021 19:37, Thierry Reding пишет:
> On Wed, Mar 24, 2021 at 07:50:01PM +0300, Dmitry Osipenko wrote:
>> 24.03.2021 19:42, Thierry Reding пишет:
>>> On Wed, Mar 24, 2021 at 06:45:30PM +0300, Dmitry Osipenko wrote:
>>>> 24.03.2021 18:02, Thierry Reding пишет:
>>>>> On Wed, Mar 24, 2021 at 05:41:08PM +0300, Dmitry Osipenko wrote:
>>>>>> 23.03.2021 18:54, Thierry Reding пишет:
>>>>>>> From: Thierry Reding <treding@nvidia.com>
>>>>>>>
>>>>>>> Clarify when a fixed IOV address can be used and when a buffer has to
>>>>>>> be mapped before the IOVA can be used.
>>>>>>>
>>>>>>> Signed-off-by: Thierry Reding <treding@nvidia.com>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/tegra/plane.c | 8 ++++++++
>>>>>>>  1 file changed, 8 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c
>>>>>>> index 19e8847a164b..793da5d675d2 100644
>>>>>>> --- a/drivers/gpu/drm/tegra/plane.c
>>>>>>> +++ b/drivers/gpu/drm/tegra/plane.c
>>>>>>> @@ -119,6 +119,14 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
>>>>>>>  		dma_addr_t phys_addr, *phys;
>>>>>>>  		struct sg_table *sgt;
>>>>>>>  
>>>>>>> +		/*
>>>>>>> +		 * If we're not attached to a domain, we already stored the
>>>>>>> +		 * physical address when the buffer was allocated. If we're
>>>>>>> +		 * part of a group that's shared between all display
>>>>>>> +		 * controllers, we've also already mapped the framebuffer
>>>>>>> +		 * through the SMMU. In both cases we can short-circuit the
>>>>>>> +		 * code below and retrieve the stored IOV address.
>>>>>>> +		 */
>>>>>>>  		if (!domain || dc->client.group)
>>>>>>>  			phys = &phys_addr;
>>>>>>>  		else
>>>>>>>
>>>>>>
>>>>>> This comment is correct, but the logic feels a bit lame because it
>>>>>> should be wasteful to re-map DMA on each FB flip. Personally I don't
>>>>>> care much about this since older Tegras use pinned buffers by default,
>>>>>> but this shouldn't be good for T124+ users.
>>>>>
>>>>> I'm not terribly thrilled by this either, but it's the only way to do
>>>>> this when using the DMA API because we don't know at allocation time (or
>>>>> import time for that matter) which of the (up to) 4 display controllers
>>>>> a framebuffer will be shown on. tegra_dc_pin() is the earliest where
>>>>> this is known and worst case that's called once per flip.
>>>>>
>>>>> When the IOMMU API is used explicitly, we always map framebuffers into
>>>>> the IOMMU domain shared by all display controllers at allocation or
>>>>> import time and then we don't need to pin at flip time anymore.
>>>>>
>>>>> I do have a work-in-progress patch somewhere that creates a mapping
>>>>> cache to mitigate this problem to some degree. I need to dig that up and
>>>>> do a few measurements because I vaguely recall this speeding up flips by
>>>>> quite a bit (well, except for the very first mapping, obviously).
>>>>>
>>>>>> Perhaps dumb buffers should be pinned to display by default and then we
>>>>>> should extend the Tegra UAPI to support BO mapping to display client(?).
>>>>>
>>>>> That would kind of defeat the purpose of a generic KMS UAPI.
>>>>
>>>> Couldn't the BOs be mapped when FB is created, i.e. by tegra_fb_create?
>>>
>>> I suppose that would be possible. However, tegra_fb_create() doesn't
>>> know a thing about display controllers, so we'd have to add extra code
>>> to it to iterate over all display controllers and do a dma_map_sg() of
>>> the GEM object for each of them.
>>>
>>> It's also somewhat wasteful because now we get a mapping for each
>>> framebuffer for each display controller. So if you've got, say, a four
>>> UHD screen setup (which is something that Tegra194 supports), you could
>>> end up with 8 UHD framebuffers (two for each display, for double-
>>> buffering) at 32 MiB each for a whopping 256 MiB of memory that needs to
>>> be mapped for each of the four display controllers. That 1 GiB worth of
>>> page table updates, whereas you really only need one fourth of that.
>>>
>>> Granted, this will make flipping a bit faster, and IOVA space isn't
>>> really a problem on Tegra194. It would still waste a bit of RAM for all
>>> those page table entries that we don't really need, though.
>>>
>>> A mapping cache seems like a much better compromise because the cache
>>> lookup should be quite fast compared to a mapping operation and we waste
>>> just a couple dozen bytes per mapping perhaps as opposed to a few
>>> megabytes for the gratuitous, preemptive mappings.
>>
>> Isn't it really possible to put displays into the same IOMMU group on
>> T194? It doesn't make much sense to have them in a separate groups on Linux.
> 
> It is possible and in fact that's what's already happening. However, the
> problem isn't that these devices are not in the same group, the problem
> is that the DMA API doesn't know anything about groups. It works on
> struct device and if you've got DMA API debugging enabled it may even
> flag incorrect usage as errors.
> 
> So from a DMA API point of view, if a device wants to use a buffer, that
> buffer first has to be mapped for that device, even if it was already
> mapped for a different device that happens to be in the same IOMMU group
> and hence share an IOMMU domain.

This sounds to me like something which needs to be addressed first, i.e.
to make DMA API aware that it's okay to re-use mappings by sibling
devices within the same IOMMU group. Although, I assume that you already
considered this variant, didn't you?

[6/9] drm/tegra: gem: Add a clarifying comment

Commit Message

Comments

Patch