diff mbox series

[v5,10/13] video: Only dcache flush damaged lines

Message ID 20230821135111.3558478-11-alpernebiyasak@gmail.com
State Under Review
Delegated to: Anatolij Gustschin
Headers show
Series Add video damage tracking | expand

Commit Message

Alper Nebi Yasak Aug. 21, 2023, 1:51 p.m. UTC
From: Alexander Graf <agraf@csgraf.de>

Now that we have a damage area tells us which parts of the frame buffer
actually need updating, let's only dcache flush those on video_sync()
calls. With this optimization in place, frame buffer updates - especially
on large screen such as 4k displays - speed up significantly.

Signed-off-by: Alexander Graf <agraf@csgraf.de>
Reported-by: Da Xue <da@libre.computer>
[Alper: Use damage.xstart/yend, IS_ENABLED()]
Co-developed-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
Signed-off-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
---

Changes in v5:
- Use xstart, ystart, xend, yend as names for damage region
- Use IS_ENABLED() instead of CONFIG_IS_ENABLED()

Changes in v2:
- Fix dcache range; we were flushing too much before
- Remove ifdefs

 drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

Comments

Simon Glass Aug. 21, 2023, 7:11 p.m. UTC | #1
Hi Alper,

On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiyasak@gmail.com> wrote:
>
> From: Alexander Graf <agraf@csgraf.de>
>
> Now that we have a damage area tells us which parts of the frame buffer
> actually need updating, let's only dcache flush those on video_sync()
> calls. With this optimization in place, frame buffer updates - especially
> on large screen such as 4k displays - speed up significantly.
>
> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> Reported-by: Da Xue <da@libre.computer>
> [Alper: Use damage.xstart/yend, IS_ENABLED()]
> Co-developed-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
> Signed-off-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
> ---
>
> Changes in v5:
> - Use xstart, ystart, xend, yend as names for damage region
> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED()
>
> Changes in v2:
> - Fix dcache range; we were flushing too much before
> - Remove ifdefs
>
>  drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
>  1 file changed, 36 insertions(+), 5 deletions(-)

This is a little strange, since flushing the whole cache will only
actually write out data that was actually written (to the display). Is
there a benefit to this patch, in terms of performance?

Regards,
Simon
Alexander Graf Aug. 21, 2023, 7:59 p.m. UTC | #2
On 21.08.23 21:11, Simon Glass wrote:
> Hi Alper,
>
> On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiyasak@gmail.com> wrote:
>> From: Alexander Graf <agraf@csgraf.de>
>>
>> Now that we have a damage area tells us which parts of the frame buffer
>> actually need updating, let's only dcache flush those on video_sync()
>> calls. With this optimization in place, frame buffer updates - especially
>> on large screen such as 4k displays - speed up significantly.
>>
>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>> Reported-by: Da Xue <da@libre.computer>
>> [Alper: Use damage.xstart/yend, IS_ENABLED()]
>> Co-developed-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
>> Signed-off-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
>> ---
>>
>> Changes in v5:
>> - Use xstart, ystart, xend, yend as names for damage region
>> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED()
>>
>> Changes in v2:
>> - Fix dcache range; we were flushing too much before
>> - Remove ifdefs
>>
>>   drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
>>   1 file changed, 36 insertions(+), 5 deletions(-)
> This is a little strange, since flushing the whole cache will only
> actually write out data that was actually written (to the display). Is
> there a benefit to this patch, in terms of performance?


I'm happy to see you go through the same thought process I went through 
when writing these: "This surely can't be the problem, can it?". The 
answer is "simple" in hindsight:

Have a look at the ARMv8 cache flush function. It does the only "safe" 
thing you can expect it to do: Clean+Invalidate to POC because we use it 
for multiple things, clearing modified code among others:

ENTRY(__asm_flush_dcache_range)
         mrs     x3, ctr_el0
         ubfx    x3, x3, #16, #4
         mov     x2, #4
         lsl     x2, x2, x3              /* cache line size */

         /* x2 <- minimal cache line size in cache system */
         sub     x3, x2, #1
         bic     x0, x0, x3
1:      dc      civac, x0       /* clean & invalidate data or unified 
cache */
         add     x0, x0, x2
         cmp     x0, x1
         b.lo    1b
         dsb     sy
         ret
ENDPROC(__asm_flush_dcache_range)


Looking at the "dc civac" call, we find this documentation page from 
ARM: 
https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Instructions/DC-CIVAC--Data-or-unified-Cache-line-Clean-and-Invalidate-by-VA-to-PoC

This says we're writing any dirtyness of this cache line up to the POC 
and then invalidate (remove the cache line) also up to POC. That means 
when you look at a typical SBC, this will either be L2 or system level 
cache. Every read afterwards needs to go and pull it all the way back to 
L1 to modify it (or not) on the next character write and then flush it 
again.

Even worse: Because of the invalidate, we may even evict it from caches 
that the display controller uses to read the frame buffer. So depending 
on the SoC's cache topology and implementation, we may force the display 
controller to refetch the full FB content on its next screen refresh cycle.

I faintly remember that I tried to experiment with CVAC instead to only 
flush without invalidating. I don't fully remember the results anymore 
though. I believe CVAC just behaved identical to CIVAC on the A53 
platform I was working on. And then I looked at Cortex-A53 errata like 
[1] and just accepted that doing anything but restricting the flushing 
range is a waste of time :)


Alex


[1] 
https://patchwork.kernel.org/project/xen-devel/patch/1462466065-30212-14-git-send-email-julien.grall@arm.com/
Simon Glass Aug. 21, 2023, 10:10 p.m. UTC | #3
Hi Alex,

On Mon, 21 Aug 2023 at 13:59, Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 21.08.23 21:11, Simon Glass wrote:
> > Hi Alper,
> >
> > On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiyasak@gmail.com> wrote:
> >> From: Alexander Graf <agraf@csgraf.de>
> >>
> >> Now that we have a damage area tells us which parts of the frame buffer
> >> actually need updating, let's only dcache flush those on video_sync()
> >> calls. With this optimization in place, frame buffer updates - especially
> >> on large screen such as 4k displays - speed up significantly.
> >>
> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> >> Reported-by: Da Xue <da@libre.computer>
> >> [Alper: Use damage.xstart/yend, IS_ENABLED()]
> >> Co-developed-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
> >> Signed-off-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
> >> ---
> >>
> >> Changes in v5:
> >> - Use xstart, ystart, xend, yend as names for damage region
> >> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED()
> >>
> >> Changes in v2:
> >> - Fix dcache range; we were flushing too much before
> >> - Remove ifdefs
> >>
> >>   drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
> >>   1 file changed, 36 insertions(+), 5 deletions(-)
> > This is a little strange, since flushing the whole cache will only
> > actually write out data that was actually written (to the display). Is
> > there a benefit to this patch, in terms of performance?
>
>
> I'm happy to see you go through the same thought process I went through
> when writing these: "This surely can't be the problem, can it?". The
> answer is "simple" in hindsight:
>
> Have a look at the ARMv8 cache flush function. It does the only "safe"
> thing you can expect it to do: Clean+Invalidate to POC because we use it
> for multiple things, clearing modified code among others:
>
> ENTRY(__asm_flush_dcache_range)
>          mrs     x3, ctr_el0
>          ubfx    x3, x3, #16, #4
>          mov     x2, #4
>          lsl     x2, x2, x3              /* cache line size */
>
>          /* x2 <- minimal cache line size in cache system */
>          sub     x3, x2, #1
>          bic     x0, x0, x3
> 1:      dc      civac, x0       /* clean & invalidate data or unified
> cache */
>          add     x0, x0, x2
>          cmp     x0, x1
>          b.lo    1b
>          dsb     sy
>          ret
> ENDPROC(__asm_flush_dcache_range)
>
>
> Looking at the "dc civac" call, we find this documentation page from
> ARM:
> https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Instructions/DC-CIVAC--Data-or-unified-Cache-line-Clean-and-Invalidate-by-VA-to-PoC
>
> This says we're writing any dirtyness of this cache line up to the POC
> and then invalidate (remove the cache line) also up to POC. That means
> when you look at a typical SBC, this will either be L2 or system level
> cache. Every read afterwards needs to go and pull it all the way back to
> L1 to modify it (or not) on the next character write and then flush it
> again.
>
> Even worse: Because of the invalidate, we may even evict it from caches
> that the display controller uses to read the frame buffer. So depending
> on the SoC's cache topology and implementation, we may force the display
> controller to refetch the full FB content on its next screen refresh cycle.
>
> I faintly remember that I tried to experiment with CVAC instead to only
> flush without invalidating. I don't fully remember the results anymore
> though. I believe CVAC just behaved identical to CIVAC on the A53
> platform I was working on. And then I looked at Cortex-A53 errata like
> [1] and just accepted that doing anything but restricting the flushing
> range is a waste of time :)

Yuck I didn't know it was invalidating too. That is horrible. Is there
no way to fix it?

Regards,
Simon

>
>
> Alex
>
>
> [1]
> https://patchwork.kernel.org/project/xen-devel/patch/1462466065-30212-14-git-send-email-julien.grall@arm.com/
>
>
Alexander Graf Aug. 21, 2023, 10:44 p.m. UTC | #4
On 22.08.23 00:10, Simon Glass wrote:
> Hi Alex,
>
> On Mon, 21 Aug 2023 at 13:59, Alexander Graf <agraf@csgraf.de> wrote:
>>
>> On 21.08.23 21:11, Simon Glass wrote:
>>> Hi Alper,
>>>
>>> On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiyasak@gmail.com> wrote:
>>>> From: Alexander Graf <agraf@csgraf.de>
>>>>
>>>> Now that we have a damage area tells us which parts of the frame buffer
>>>> actually need updating, let's only dcache flush those on video_sync()
>>>> calls. With this optimization in place, frame buffer updates - especially
>>>> on large screen such as 4k displays - speed up significantly.
>>>>
>>>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>>> Reported-by: Da Xue <da@libre.computer>
>>>> [Alper: Use damage.xstart/yend, IS_ENABLED()]
>>>> Co-developed-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
>>>> Signed-off-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
>>>> ---
>>>>
>>>> Changes in v5:
>>>> - Use xstart, ystart, xend, yend as names for damage region
>>>> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED()
>>>>
>>>> Changes in v2:
>>>> - Fix dcache range; we were flushing too much before
>>>> - Remove ifdefs
>>>>
>>>>    drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
>>>>    1 file changed, 36 insertions(+), 5 deletions(-)
>>> This is a little strange, since flushing the whole cache will only
>>> actually write out data that was actually written (to the display). Is
>>> there a benefit to this patch, in terms of performance?
>>
>> I'm happy to see you go through the same thought process I went through
>> when writing these: "This surely can't be the problem, can it?". The
>> answer is "simple" in hindsight:
>>
>> Have a look at the ARMv8 cache flush function. It does the only "safe"
>> thing you can expect it to do: Clean+Invalidate to POC because we use it
>> for multiple things, clearing modified code among others:
>>
>> ENTRY(__asm_flush_dcache_range)
>>           mrs     x3, ctr_el0
>>           ubfx    x3, x3, #16, #4
>>           mov     x2, #4
>>           lsl     x2, x2, x3              /* cache line size */
>>
>>           /* x2 <- minimal cache line size in cache system */
>>           sub     x3, x2, #1
>>           bic     x0, x0, x3
>> 1:      dc      civac, x0       /* clean & invalidate data or unified
>> cache */
>>           add     x0, x0, x2
>>           cmp     x0, x1
>>           b.lo    1b
>>           dsb     sy
>>           ret
>> ENDPROC(__asm_flush_dcache_range)
>>
>>
>> Looking at the "dc civac" call, we find this documentation page from
>> ARM:
>> https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Instructions/DC-CIVAC--Data-or-unified-Cache-line-Clean-and-Invalidate-by-VA-to-PoC
>>
>> This says we're writing any dirtyness of this cache line up to the POC
>> and then invalidate (remove the cache line) also up to POC. That means
>> when you look at a typical SBC, this will either be L2 or system level
>> cache. Every read afterwards needs to go and pull it all the way back to
>> L1 to modify it (or not) on the next character write and then flush it
>> again.
>>
>> Even worse: Because of the invalidate, we may even evict it from caches
>> that the display controller uses to read the frame buffer. So depending
>> on the SoC's cache topology and implementation, we may force the display
>> controller to refetch the full FB content on its next screen refresh cycle.
>>
>> I faintly remember that I tried to experiment with CVAC instead to only
>> flush without invalidating. I don't fully remember the results anymore
>> though. I believe CVAC just behaved identical to CIVAC on the A53
>> platform I was working on. And then I looked at Cortex-A53 errata like
>> [1] and just accepted that doing anything but restricting the flushing
>> range is a waste of time :)
> Yuck I didn't know it was invalidating too. That is horrible. Is there
> no way to fix it?


Before building all of this damage logic, I tried, but failed. I'd 
welcome anyone else to try again :). I'm not even convinced yet that it 
is actually fixable: Depending on the SoC's internal cache logic, it may 
opt to always invalidate I think.

That said, this patch set really also makes sense outside of the 
particular invalidate problem. It creates a generic abstraction between 
the copy and non-copy code path and allows us to reduce the amount of 
work spent for both, generically for any video sync operation.


Alex
Simon Glass Aug. 21, 2023, 11:03 p.m. UTC | #5
Hi Alex,

On Mon, 21 Aug 2023 at 16:44, Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 22.08.23 00:10, Simon Glass wrote:
> > Hi Alex,
> >
> > On Mon, 21 Aug 2023 at 13:59, Alexander Graf <agraf@csgraf.de> wrote:
> >>
> >> On 21.08.23 21:11, Simon Glass wrote:
> >>> Hi Alper,
> >>>
> >>> On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiyasak@gmail.com> wrote:
> >>>> From: Alexander Graf <agraf@csgraf.de>
> >>>>
> >>>> Now that we have a damage area tells us which parts of the frame buffer
> >>>> actually need updating, let's only dcache flush those on video_sync()
> >>>> calls. With this optimization in place, frame buffer updates - especially
> >>>> on large screen such as 4k displays - speed up significantly.
> >>>>
> >>>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> >>>> Reported-by: Da Xue <da@libre.computer>
> >>>> [Alper: Use damage.xstart/yend, IS_ENABLED()]
> >>>> Co-developed-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
> >>>> Signed-off-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
> >>>> ---
> >>>>
> >>>> Changes in v5:
> >>>> - Use xstart, ystart, xend, yend as names for damage region
> >>>> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED()
> >>>>
> >>>> Changes in v2:
> >>>> - Fix dcache range; we were flushing too much before
> >>>> - Remove ifdefs
> >>>>
> >>>>    drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
> >>>>    1 file changed, 36 insertions(+), 5 deletions(-)
> >>> This is a little strange, since flushing the whole cache will only
> >>> actually write out data that was actually written (to the display). Is
> >>> there a benefit to this patch, in terms of performance?
> >>
> >> I'm happy to see you go through the same thought process I went through
> >> when writing these: "This surely can't be the problem, can it?". The
> >> answer is "simple" in hindsight:
> >>
> >> Have a look at the ARMv8 cache flush function. It does the only "safe"
> >> thing you can expect it to do: Clean+Invalidate to POC because we use it
> >> for multiple things, clearing modified code among others:
> >>
> >> ENTRY(__asm_flush_dcache_range)
> >>           mrs     x3, ctr_el0
> >>           ubfx    x3, x3, #16, #4
> >>           mov     x2, #4
> >>           lsl     x2, x2, x3              /* cache line size */
> >>
> >>           /* x2 <- minimal cache line size in cache system */
> >>           sub     x3, x2, #1
> >>           bic     x0, x0, x3
> >> 1:      dc      civac, x0       /* clean & invalidate data or unified
> >> cache */
> >>           add     x0, x0, x2
> >>           cmp     x0, x1
> >>           b.lo    1b
> >>           dsb     sy
> >>           ret
> >> ENDPROC(__asm_flush_dcache_range)
> >>
> >>
> >> Looking at the "dc civac" call, we find this documentation page from
> >> ARM:
> >> https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Instructions/DC-CIVAC--Data-or-unified-Cache-line-Clean-and-Invalidate-by-VA-to-PoC
> >>
> >> This says we're writing any dirtyness of this cache line up to the POC
> >> and then invalidate (remove the cache line) also up to POC. That means
> >> when you look at a typical SBC, this will either be L2 or system level
> >> cache. Every read afterwards needs to go and pull it all the way back to
> >> L1 to modify it (or not) on the next character write and then flush it
> >> again.
> >>
> >> Even worse: Because of the invalidate, we may even evict it from caches
> >> that the display controller uses to read the frame buffer. So depending
> >> on the SoC's cache topology and implementation, we may force the display
> >> controller to refetch the full FB content on its next screen refresh cycle.
> >>
> >> I faintly remember that I tried to experiment with CVAC instead to only
> >> flush without invalidating. I don't fully remember the results anymore
> >> though. I believe CVAC just behaved identical to CIVAC on the A53
> >> platform I was working on. And then I looked at Cortex-A53 errata like
> >> [1] and just accepted that doing anything but restricting the flushing
> >> range is a waste of time :)
> > Yuck I didn't know it was invalidating too. That is horrible. Is there
> > no way to fix it?
>
>
> Before building all of this damage logic, I tried, but failed. I'd
> welcome anyone else to try again :). I'm not even convinced yet that it
> is actually fixable: Depending on the SoC's internal cache logic, it may
> opt to always invalidate I think.

Wow, that is crazy! How is anyone supposed to make the system run well
with logic like that??!

>
> That said, this patch set really also makes sense outside of the
> particular invalidate problem. It creates a generic abstraction between
> the copy and non-copy code path and allows us to reduce the amount of
> work spent for both, generically for any video sync operation.

Sure...my question was really why it helps so much, given what I
understood the caches to be doing. If they are invalidating, then it
is amazing anything gets done...

Regards,
SImon
Alper Nebi Yasak Aug. 30, 2023, 7:12 p.m. UTC | #6
On 2023-08-22 02:03 +03:00, Simon Glass wrote:
> Hi Alex,
> 
> On Mon, 21 Aug 2023 at 16:44, Alexander Graf <agraf@csgraf.de> wrote:
>>
>>
>> On 22.08.23 00:10, Simon Glass wrote:
>>> Hi Alex,
>>>
>>> On Mon, 21 Aug 2023 at 13:59, Alexander Graf <agraf@csgraf.de> wrote:
>>>>
>>>> On 21.08.23 21:11, Simon Glass wrote:
>>>>> Hi Alper,
>>>>>
>>>>> On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiyasak@gmail.com> wrote:
>>>>>> From: Alexander Graf <agraf@csgraf.de>
>>>>>>
>>>>>> Now that we have a damage area tells us which parts of the frame buffer
>>>>>> actually need updating, let's only dcache flush those on video_sync()
>>>>>> calls. With this optimization in place, frame buffer updates - especially
>>>>>> on large screen such as 4k displays - speed up significantly.
>>>>>>
>>>>>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>>>>> Reported-by: Da Xue <da@libre.computer>
>>>>>> [Alper: Use damage.xstart/yend, IS_ENABLED()]
>>>>>> Co-developed-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
>>>>>> Signed-off-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
>>>>>> ---
>>>>>>
>>>>>> Changes in v5:
>>>>>> - Use xstart, ystart, xend, yend as names for damage region
>>>>>> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED()
>>>>>>
>>>>>> Changes in v2:
>>>>>> - Fix dcache range; we were flushing too much before
>>>>>> - Remove ifdefs
>>>>>>
>>>>>>    drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
>>>>>>    1 file changed, 36 insertions(+), 5 deletions(-)
>>>>> This is a little strange, since flushing the whole cache will only
>>>>> actually write out data that was actually written (to the display). Is
>>>>> there a benefit to this patch, in terms of performance?
>>>>
>>>> I'm happy to see you go through the same thought process I went through
>>>> when writing these: "This surely can't be the problem, can it?". The
>>>> answer is "simple" in hindsight:
>>>>
>>>> Have a look at the ARMv8 cache flush function. It does the only "safe"
>>>> thing you can expect it to do: Clean+Invalidate to POC because we use it
>>>> for multiple things, clearing modified code among others:
>>>>
>>>> ENTRY(__asm_flush_dcache_range)
>>>>           mrs     x3, ctr_el0
>>>>           ubfx    x3, x3, #16, #4
>>>>           mov     x2, #4
>>>>           lsl     x2, x2, x3              /* cache line size */
>>>>
>>>>           /* x2 <- minimal cache line size in cache system */
>>>>           sub     x3, x2, #1
>>>>           bic     x0, x0, x3
>>>> 1:      dc      civac, x0       /* clean & invalidate data or unified
>>>> cache */
>>>>           add     x0, x0, x2
>>>>           cmp     x0, x1
>>>>           b.lo    1b
>>>>           dsb     sy
>>>>           ret
>>>> ENDPROC(__asm_flush_dcache_range)
>>>>
>>>>
>>>> Looking at the "dc civac" call, we find this documentation page from
>>>> ARM:
>>>> https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Instructions/DC-CIVAC--Data-or-unified-Cache-line-Clean-and-Invalidate-by-VA-to-PoC
>>>>
>>>> This says we're writing any dirtyness of this cache line up to the POC
>>>> and then invalidate (remove the cache line) also up to POC. That means
>>>> when you look at a typical SBC, this will either be L2 or system level
>>>> cache. Every read afterwards needs to go and pull it all the way back to
>>>> L1 to modify it (or not) on the next character write and then flush it
>>>> again.
>>>>
>>>> Even worse: Because of the invalidate, we may even evict it from caches
>>>> that the display controller uses to read the frame buffer. So depending
>>>> on the SoC's cache topology and implementation, we may force the display
>>>> controller to refetch the full FB content on its next screen refresh cycle.
>>>>
>>>> I faintly remember that I tried to experiment with CVAC instead to only
>>>> flush without invalidating. I don't fully remember the results anymore
>>>> though. I believe CVAC just behaved identical to CIVAC on the A53
>>>> platform I was working on. And then I looked at Cortex-A53 errata like
>>>> [1] and just accepted that doing anything but restricting the flushing
>>>> range is a waste of time :)
>>> Yuck I didn't know it was invalidating too. That is horrible. Is there
>>> no way to fix it?
>>
>>
>> Before building all of this damage logic, I tried, but failed. I'd
>> welcome anyone else to try again :). I'm not even convinced yet that it
>> is actually fixable: Depending on the SoC's internal cache logic, it may
>> opt to always invalidate I think.
> 
> Wow, that is crazy! How is anyone supposed to make the system run well
> with logic like that??!
> 
>>
>> That said, this patch set really also makes sense outside of the
>> particular invalidate problem. It creates a generic abstraction between
>> the copy and non-copy code path and allows us to reduce the amount of
>> work spent for both, generically for any video sync operation.
> 
> Sure...my question was really why it helps so much, given what I
> understood the caches to be doing. If they are invalidating, then it
> is amazing anything gets done...

I don't really know cache mechanisms and terminology, but AFAIU there's
nothing actionable for this patch regarding this discussion, right?

Meanwhile I noticed this patch only flushes priv->fb, and think it also
needs to flush priv->copy_fb if VIDEO_COPY.
Alexander Graf Aug. 30, 2023, 7:57 p.m. UTC | #7
On 30.08.23 21:12, Alper Nebi Yasak wrote:
> On 2023-08-22 02:03 +03:00, Simon Glass wrote:
>> Hi Alex,
>>
>> On Mon, 21 Aug 2023 at 16:44, Alexander Graf <agraf@csgraf.de> wrote:
>>>
>>> On 22.08.23 00:10, Simon Glass wrote:
>>>> Hi Alex,
>>>>
>>>> On Mon, 21 Aug 2023 at 13:59, Alexander Graf <agraf@csgraf.de> wrote:
>>>>> On 21.08.23 21:11, Simon Glass wrote:
>>>>>> Hi Alper,
>>>>>>
>>>>>> On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiyasak@gmail.com> wrote:
>>>>>>> From: Alexander Graf <agraf@csgraf.de>
>>>>>>>
>>>>>>> Now that we have a damage area tells us which parts of the frame buffer
>>>>>>> actually need updating, let's only dcache flush those on video_sync()
>>>>>>> calls. With this optimization in place, frame buffer updates - especially
>>>>>>> on large screen such as 4k displays - speed up significantly.
>>>>>>>
>>>>>>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>>>>>> Reported-by: Da Xue <da@libre.computer>
>>>>>>> [Alper: Use damage.xstart/yend, IS_ENABLED()]
>>>>>>> Co-developed-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
>>>>>>> Signed-off-by: Alper Nebi Yasak <alpernebiyasak@gmail.com>
>>>>>>> ---
>>>>>>>
>>>>>>> Changes in v5:
>>>>>>> - Use xstart, ystart, xend, yend as names for damage region
>>>>>>> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED()
>>>>>>>
>>>>>>> Changes in v2:
>>>>>>> - Fix dcache range; we were flushing too much before
>>>>>>> - Remove ifdefs
>>>>>>>
>>>>>>>     drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
>>>>>>>     1 file changed, 36 insertions(+), 5 deletions(-)
>>>>>> This is a little strange, since flushing the whole cache will only
>>>>>> actually write out data that was actually written (to the display). Is
>>>>>> there a benefit to this patch, in terms of performance?
>>>>> I'm happy to see you go through the same thought process I went through
>>>>> when writing these: "This surely can't be the problem, can it?". The
>>>>> answer is "simple" in hindsight:
>>>>>
>>>>> Have a look at the ARMv8 cache flush function. It does the only "safe"
>>>>> thing you can expect it to do: Clean+Invalidate to POC because we use it
>>>>> for multiple things, clearing modified code among others:
>>>>>
>>>>> ENTRY(__asm_flush_dcache_range)
>>>>>            mrs     x3, ctr_el0
>>>>>            ubfx    x3, x3, #16, #4
>>>>>            mov     x2, #4
>>>>>            lsl     x2, x2, x3              /* cache line size */
>>>>>
>>>>>            /* x2 <- minimal cache line size in cache system */
>>>>>            sub     x3, x2, #1
>>>>>            bic     x0, x0, x3
>>>>> 1:      dc      civac, x0       /* clean & invalidate data or unified
>>>>> cache */
>>>>>            add     x0, x0, x2
>>>>>            cmp     x0, x1
>>>>>            b.lo    1b
>>>>>            dsb     sy
>>>>>            ret
>>>>> ENDPROC(__asm_flush_dcache_range)
>>>>>
>>>>>
>>>>> Looking at the "dc civac" call, we find this documentation page from
>>>>> ARM:
>>>>> https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Instructions/DC-CIVAC--Data-or-unified-Cache-line-Clean-and-Invalidate-by-VA-to-PoC
>>>>>
>>>>> This says we're writing any dirtyness of this cache line up to the POC
>>>>> and then invalidate (remove the cache line) also up to POC. That means
>>>>> when you look at a typical SBC, this will either be L2 or system level
>>>>> cache. Every read afterwards needs to go and pull it all the way back to
>>>>> L1 to modify it (or not) on the next character write and then flush it
>>>>> again.
>>>>>
>>>>> Even worse: Because of the invalidate, we may even evict it from caches
>>>>> that the display controller uses to read the frame buffer. So depending
>>>>> on the SoC's cache topology and implementation, we may force the display
>>>>> controller to refetch the full FB content on its next screen refresh cycle.
>>>>>
>>>>> I faintly remember that I tried to experiment with CVAC instead to only
>>>>> flush without invalidating. I don't fully remember the results anymore
>>>>> though. I believe CVAC just behaved identical to CIVAC on the A53
>>>>> platform I was working on. And then I looked at Cortex-A53 errata like
>>>>> [1] and just accepted that doing anything but restricting the flushing
>>>>> range is a waste of time :)
>>>> Yuck I didn't know it was invalidating too. That is horrible. Is there
>>>> no way to fix it?
>>>
>>> Before building all of this damage logic, I tried, but failed. I'd
>>> welcome anyone else to try again :). I'm not even convinced yet that it
>>> is actually fixable: Depending on the SoC's internal cache logic, it may
>>> opt to always invalidate I think.
>> Wow, that is crazy! How is anyone supposed to make the system run well
>> with logic like that??!
>>
>>> That said, this patch set really also makes sense outside of the
>>> particular invalidate problem. It creates a generic abstraction between
>>> the copy and non-copy code path and allows us to reduce the amount of
>>> work spent for both, generically for any video sync operation.
>> Sure...my question was really why it helps so much, given what I
>> understood the caches to be doing. If they are invalidating, then it
>> is amazing anything gets done...
> I don't really know cache mechanisms and terminology, but AFAIU there's
> nothing actionable for this patch regarding this discussion, right?
>
> Meanwhile I noticed this patch only flushes priv->fb, and think it also
> needs to flush priv->copy_fb if VIDEO_COPY.


The reason was mostly that copy_fb is really only used on x86 where we 
don't have the cache flush problem/code :). So nobody bothered to add 
flushing to that code path.


Alex
diff mbox series

Patch

diff --git a/drivers/video/video-uclass.c b/drivers/video/video-uclass.c
index 8bfcbc88dda7..a50220bcc684 100644
--- a/drivers/video/video-uclass.c
+++ b/drivers/video/video-uclass.c
@@ -385,6 +385,41 @@  void video_damage(struct udevice *vid, int x, int y, int width, int height)
 	priv->damage.yend = max(yend, priv->damage.yend);
 }
 
+#if defined(CONFIG_ARM) && !CONFIG_IS_ENABLED(SYS_DCACHE_OFF)
+static void video_flush_dcache(struct udevice *vid)
+{
+	struct video_priv *priv = dev_get_uclass_priv(vid);
+
+	if (!priv->flush_dcache)
+		return;
+
+	if (!IS_ENABLED(CONFIG_VIDEO_DAMAGE)) {
+		flush_dcache_range((ulong)priv->fb,
+				   ALIGN((ulong)priv->fb + priv->fb_size,
+					 CONFIG_SYS_CACHELINE_SIZE));
+
+		return;
+	}
+
+	if (priv->damage.xend && priv->damage.yend) {
+		int lstart = priv->damage.xstart * VNBYTES(priv->bpix);
+		int lend = priv->damage.xend * VNBYTES(priv->bpix);
+		int y;
+
+		for (y = priv->damage.ystart; y < priv->damage.yend; y++) {
+			ulong fb = (ulong)priv->fb;
+			ulong start = fb + (y * priv->line_length) + lstart;
+			ulong end = start + lend - lstart;
+
+			start = ALIGN_DOWN(start, CONFIG_SYS_CACHELINE_SIZE);
+			end = ALIGN(end, CONFIG_SYS_CACHELINE_SIZE);
+
+			flush_dcache_range(start, end);
+		}
+	}
+}
+#endif
+
 /* Flush video activity to the caches */
 int video_sync(struct udevice *vid, bool force)
 {
@@ -404,11 +439,7 @@  int video_sync(struct udevice *vid, bool force)
 	 * out whether it exists? For now, ARM is safe.
 	 */
 #if defined(CONFIG_ARM) && !CONFIG_IS_ENABLED(SYS_DCACHE_OFF)
-	if (priv->flush_dcache) {
-		flush_dcache_range((ulong)priv->fb,
-				   ALIGN((ulong)priv->fb + priv->fb_size,
-					 CONFIG_SYS_CACHELINE_SIZE));
-	}
+	video_flush_dcache(vid);
 #elif defined(CONFIG_VIDEO_SANDBOX_SDL)
 	static ulong last_sync;