mbox series

[00/28] Use the best FPU strategies on 32-bits Arm Cortex

Message ID 20220118104338.2081259-1-giulio.benetti@benettiengineering.com
Headers show
Series Use the best FPU strategies on 32-bits Arm Cortex | expand

Message

Giulio Benetti Jan. 18, 2022, 10:43 a.m. UTC
Hi All,

this patchset aims to enable the best FPU strategy for every board with
32-bits Arm Cortex actually present in Buildroot. I don't own these boards
so I can't test these changes. What is about Allwinner doesn't worry me
because I've tested a lot of cases with Olimex boards, but the other
changes are still to be tested.

So I ask to the board maintainers to test the patches that involve their
boards if possible. Anyway I've checked well all the SoCs Datasheet and I
think I've made it correctly, even if in some of them it's not specified
if VFPv4 means -D32. I assume it like that because otherwise -D16 is
usually specified.

Best regards
---
Giulio Benetti
Benetti Engineering sas

Giulio Benetti (28):
  configs/acmesystems_acqua_a5_256mb_defconfig: enable NEON/VFPV4 FPU
    strategy
  configs/acmesystems_acqua_a5_512mb_defconfig: enable NEON/VFPV4 FPU
    strategy
  configs/bananapi_m2_plus_defconfig: enable NEON/VFPV4 FPU strategy
  configs/bananapi_m2_plus_defconfig: enable NEON/VFPV4 FPU strategy
  configs/beagleboardx15_defconfig: enable NEON/VFPV4 FPU strategy
  configs/beagleboneai_defconfig: enable NEON/VFPV4 FPU strategy
  configs/chromebook_snow_defconfig: enable NEON/VFPV4 FPU strategy
  configs/chromebook_snow_defconfig: enable NEON/VFPV4 FPU strategy
  configs/friendlyarm_nanopi_m1_defconfig: enable NEON/VFPV4 FPU
    strategy
  configs/friendlyarm_nanopi_m1_plus_defconfig: enable NEON/VFPV4 FPU
    strategy
  configs/friendlyarm_nanopi_r1_defconfig: enable NEON/VFPV4 FPU
    strategy
  configs/friendlyarm_nanopi_r1_defconfig: enable NEON/VFPV4 FPU
    strategy
  configs/grinn_liteboard_defconfig: enable NEON/VFPV4 FPU strategy
  configs/licheepi_zero_defconfig: enable NEON/VFPV4 FPU strategy
  configs/microchip_sama5d27_wlsom1_ek_mmc_defconfig: enable NEON/VFPV4
    FPU strategy
  configs/microchip_sama5d27_wlsom1_ek_mmc_dev_defconfig: enable
    NEON/VFPV4 FPU strategy
  configs/microchip_sama5d2_icp_mmc_dev_defconfig: enable NEON/VFPV4 FPU
    strategy
  configs/microchip_sama5d2_icp_mmc_dev_defconfig: enable NEON/VFPV4 FPU
    strategy
  configs/orangepi_lite_defconfig: enable NEON/VFPV4 FPU strategy
  configs/orangepi_one_defconfig: enable NEON/VFPV4 FPU strategy
  configs/orangepi_pc_defconfig: enable NEON/VFPV4 FPU strategy
  configs/orangepi_pc_plus_defconfig: enable NEON/VFPV4 FPU strategy
  configs/orangepi_plus_defconfig: enable NEON/VFPV4 FPU strategy
  configs/orangepi_r1_defconfig: enable NEON/VFPV4 FPU strategy
  configs/orangepi_zero_defconfig: enable NEON/VFPV4 FPU strategy
  configs/stm32mp157a_dk1_defconfig: enable NEON/VFPV4 FPU strategy
  configs/stm32mp157c_dk2_defconfig: enable NEON/VFPV4 FPU strategy
  configs/stm32mp157c_odyssey_defconfig: enable NEON/VFPV4 FPU strategy

 configs/acmesystems_acqua_a5_256mb_defconfig           | 1 +
 configs/acmesystems_acqua_a5_512mb_defconfig           | 1 +
 configs/bananapi_m2_plus_defconfig                     | 1 +
 configs/bananapi_m2_ultra_defconfig                    | 1 +
 configs/beagleboardx15_defconfig                       | 1 +
 configs/beagleboneai_defconfig                         | 1 +
 configs/chromebook_snow_defconfig                      | 1 +
 configs/freescale_imx7dsabresd_defconfig               | 1 +
 configs/friendlyarm_nanopi_m1_defconfig                | 2 +-
 configs/friendlyarm_nanopi_m1_plus_defconfig           | 2 +-
 configs/friendlyarm_nanopi_neo_defconfig               | 2 +-
 configs/friendlyarm_nanopi_r1_defconfig                | 2 +-
 configs/grinn_liteboard_defconfig                      | 1 +
 configs/licheepi_zero_defconfig                        | 2 +-
 configs/microchip_sama5d27_wlsom1_ek_mmc_defconfig     | 1 +
 configs/microchip_sama5d27_wlsom1_ek_mmc_dev_defconfig | 1 +
 configs/microchip_sama5d2_icp_mmc_defconfig            | 1 +
 configs/microchip_sama5d2_icp_mmc_dev_defconfig        | 1 +
 configs/orangepi_lite_defconfig                        | 2 +-
 configs/orangepi_one_defconfig                         | 2 +-
 configs/orangepi_pc_defconfig                          | 2 +-
 configs/orangepi_pc_plus_defconfig                     | 2 +-
 configs/orangepi_plus_defconfig                        | 2 +-
 configs/orangepi_r1_defconfig                          | 2 +-
 configs/orangepi_zero_defconfig                        | 2 +-
 configs/stm32mp157a_dk1_defconfig                      | 1 +
 configs/stm32mp157c_dk2_defconfig                      | 1 +
 configs/stm32mp157c_odyssey_defconfig                  | 1 +
 28 files changed, 28 insertions(+), 12 deletions(-)

Comments

Thomas Petazzoni Jan. 22, 2022, 2:32 p.m. UTC | #1
Hello,

On Tue, 18 Jan 2022 11:43:10 +0100
Giulio Benetti <giulio.benetti@benettiengineering.com> wrote:

> this patchset aims to enable the best FPU strategy for every board with
> 32-bits Arm Cortex actually present in Buildroot. I don't own these boards
> so I can't test these changes. What is about Allwinner doesn't worry me
> because I've tested a lot of cases with Olimex boards, but the other
> changes are still to be tested.
> 
> So I ask to the board maintainers to test the patches that involve their
> boards if possible. Anyway I've checked well all the SoCs Datasheet and I
> think I've made it correctly, even if in some of them it's not specified
> if VFPv4 means -D32. I assume it like that because otherwise -D16 is
> usually specified.

Do we have a good understand of what -mfpu=neon-vfpv4 does? Your patch
series basically converts many defconfigs to use this -mfpu value, but
it's not clear to me how it works. What does it mean to combine NEON
and VFPv4 instructions?

Regarding VFPv4 D16 vs. D32,
https://developer.arm.com/documentation/den0018/a/Compiling-NEON-Instructions/GCC-command-line-options/Option-to-specify-the-FPU
tells us:

  VFPv3 and VFPv4 implementations provide 32 double-precision
  registers. However, when NEON unit is not present, the top sixteen
  registers (D16-D31) become optional. This is shown by the -d16 in the
  option name, which means that the top sixteen D registers are not
  available.

So, my understanding is that when NEON is available, the VFPv4 is
guaranteed to have the 32 double-precision registers (D32).

Thomas
Yann E. MORIN Jan. 25, 2022, 8:55 p.m. UTC | #2
Thomas, Giulio, All,

On 2022-01-22 15:32 +0100, Thomas Petazzoni spake thusly:
> On Tue, 18 Jan 2022 11:43:10 +0100
> Giulio Benetti <giulio.benetti@benettiengineering.com> wrote:
> 
> > this patchset aims to enable the best FPU strategy for every board with
> > 32-bits Arm Cortex actually present in Buildroot. I don't own these boards
> > so I can't test these changes. What is about Allwinner doesn't worry me
> > because I've tested a lot of cases with Olimex boards, but the other
> > changes are still to be tested.
> > 
> > So I ask to the board maintainers to test the patches that involve their
> > boards if possible. Anyway I've checked well all the SoCs Datasheet and I
> > think I've made it correctly, even if in some of them it's not specified
> > if VFPv4 means -D32. I assume it like that because otherwise -D16 is
> > usually specified.
> 
> Do we have a good understand of what -mfpu=neon-vfpv4 does? Your patch
> series basically converts many defconfigs to use this -mfpu value, but
> it's not clear to me how it works. What does it mean to combine NEON
> and VFPv4 instructions?

The gcc man page states that specifying Neon as part of the fpu setting
has no effect, unless the -funsafe-math-optimizations is also specified,
because Neon is not compliant zith IEEE 754:

    If the selected floating-point hardware includes the NEON extension
    (e.g. -mfpu=neon), note that floating-point operations are not
    generated by GCC's auto-vectorization pass unless
    -funsafe-math-optimizations is also specified. This is because NEON
    hardware does not fully implement the IEEE 754 standard for
    floating-point arithmetic (in particular denormal values are treated
    as zero), so the use of NEON instructions may lead to a loss of
    precision.

So it is my understanding that using Neon is not a good idea overall. It
should only be requested on-demand by people who know what they are
doing, and most probably, be setting appropriate CFLAGS on a per-package
basis.

Additionally, changing the default FPU setting on those defconfigs is
not really interesting. Indeed, the base system that (most of) those
defconfig build are probably not exercising the FPU setting much (the
busybox login and shell are probably not using much FPU insns).

So, for me, this series is a no-go, first because it has not been tsted
on actual hardware, second because some of the changes introduce a
dubious feature which is actually a no-op at best, or worse will
generate incorrect code.

Regards,
Yann E. MORIN.

> Regarding VFPv4 D16 vs. D32,
> https://developer.arm.com/documentation/den0018/a/Compiling-NEON-Instructions/GCC-command-line-options/Option-to-specify-the-FPU
> tells us:
> 
>   VFPv3 and VFPv4 implementations provide 32 double-precision
>   registers. However, when NEON unit is not present, the top sixteen
>   registers (D16-D31) become optional. This is shown by the -d16 in the
>   option name, which means that the top sixteen D registers are not
>   available.
> 
> So, my understanding is that when NEON is available, the VFPv4 is
> guaranteed to have the 32 double-precision registers (D32).
> 
> Thomas
> -- 
> Thomas Petazzoni, co-owner and CEO, Bootlin
> Embedded Linux and Kernel engineering and training
> https://bootlin.com
Giulio Benetti Jan. 25, 2022, 9:48 p.m. UTC | #3
Hi Yann, Thomas, All,

On 25/01/22 21:55, Yann E. MORIN wrote:
> Thomas, Giulio, All,
> 
> On 2022-01-22 15:32 +0100, Thomas Petazzoni spake thusly:
>> On Tue, 18 Jan 2022 11:43:10 +0100
>> Giulio Benetti <giulio.benetti@benettiengineering.com> wrote:
>>
>>> this patchset aims to enable the best FPU strategy for every board with
>>> 32-bits Arm Cortex actually present in Buildroot. I don't own these boards
>>> so I can't test these changes. What is about Allwinner doesn't worry me
>>> because I've tested a lot of cases with Olimex boards, but the other
>>> changes are still to be tested.
>>>
>>> So I ask to the board maintainers to test the patches that involve their
>>> boards if possible. Anyway I've checked well all the SoCs Datasheet and I
>>> think I've made it correctly, even if in some of them it's not specified
>>> if VFPv4 means -D32. I assume it like that because otherwise -D16 is
>>> usually specified.
>>
>> Do we have a good understand of what -mfpu=neon-vfpv4 does? Your patch
>> series basically converts many defconfigs to use this -mfpu value, but
>> it's not clear to me how it works. What does it mean to combine NEON
>> and VFPv4 instructions?
> 
> The gcc man page states that specifying Neon as part of the fpu setting
> has no effect, unless the -funsafe-math-optimizations is also specified,
> because Neon is not compliant zith IEEE 754:
> 
>      If the selected floating-point hardware includes the NEON extension
>      (e.g. -mfpu=neon), note that floating-point operations are not
>      generated by GCC's auto-vectorization pass unless
>      -funsafe-math-optimizations is also specified. This is because NEON
>      hardware does not fully implement the IEEE 754 standard for
>      floating-point arithmetic (in particular denormal values are treated
>      as zero), so the use of NEON instructions may lead to a loss of
>      precision.
> 
> So it is my understanding that using Neon is not a good idea overall. It
> should only be requested on-demand by people who know what they are
> doing, and most probably, be setting appropriate CFLAGS on a per-package
> basis.
> 
> Additionally, changing the default FPU setting on those defconfigs is
> not really interesting. Indeed, the base system that (most of) those
> defconfig build are probably not exercising the FPU setting much (the
> busybox login and shell are probably not using much FPU insns).
> 
> So, for me, this series is a no-go, first because it has not been tsted
> on actual hardware, second because some of the changes introduce a
> dubious feature which is actually a no-op at best, or worse will
> generate incorrect code.

I agree with you, it's too dangerous. I mean, I've tested on all 
Allwinner Axx series, but haven't check that in the end the assembly 
code corresponding. What I plan to do later, so now you can reject this 
patchset, is to use the smallest package that uses fpu instructions for 
sure, then enable for example:
-mfpu=vfpv4
check
AND
-mfpu=neon-vfpv4

This way I will have my real answer before eventually re-submitting 
while explaining this in detail.

Also, once I will have cleared this topic and it comes out that fpu is 
not use with neon-vfpv4, then I'll send patches to remove that fpu 
strategy on the boards I've set it to it.

I thought it was easier, instead this is an insidious topic.

So yes, let's drop this and thank you all for helping me to clarify.

Kind regards!