diff mbox

[9/17,ARM] Add NEON FP16 arithmetic instructions.

Message ID 573B2CA9.5060703@foss.arm.com
State New
Headers show

Commit Message

Matthew Wahab May 17, 2016, 2:37 p.m. UTC
The ARMv8.2-A FP16 extension adds a number of arithmetic instrutctions
to the NEON instruction set. This patch adds support for these
instructions to the ARM backend.

As with the VFP FP16 arithmetic instructions, operations on __fp16
values are done by conversion to single-precision. Any new optimization
supported by the instruction descriptions can only apply to code
generated using intrinsics added in this patch series.

A number of the instructions are modelled as two variants, one using
UNSPEC and the other using RTL operations, with the model used decided
by the funsafe-math-optimizations flag. This follows the
single-precision instructions and is due to the half-precision
operations having the same conditions and restrictions on their use in
optmizations (when they are enabled).

Tested the series for arm-none-linux-gnueabihf with native bootstrap and
make check and for arm-none-eabi and armeb-none-eabi with make check on
an ARMv8.2-A emulator.

Ok for trunk?
Matthew

2016-05-17  Matthew Wahab  <matthew.wahab@arm.com>

	* config/arm/iterators.md (VCVTHI): New.
	(NEON_VCMP): Add UNSPEC_VCLT and UNSPEC_VCLE.  Fix a long line.
	(NEON_VAGLTE): New.
	(VFM_LANE_AS): New.
	(VH_CVTTO): New.
	(V_reg): Add HF, V4HF and V8HF.  Fix white-space.
	(V_HALF): Add V4HF.  Fix white-space.
	(V_if_elem): Add HF, V4HF and V8HF.  Fix white-space.
	(V_s_elem): Likewise.
	(V_sz_elem): Fix white-space.
	(V_elem_ch): Likewise.
	(VH_elem_ch): New.
	(scalar_mul_constraint): Add V8HF and V4HF.
	(Is_float_mode): Fix white-space.
	(Is_d_reg): Likewise.
	(q): Add HF.  Fix white-space.
	(float_sup): New.
	(float_SUP): New.
	(cmp_op_unsp): Add UNSPEC_VCALE and UNSPEC_VCALT.
	(neon_vfm_lane_as): New.
	* config/arm/neon.md (add<mode>3_fp16): New.
	(sub<mode>3_fp16): New.
	(mul<mode>3add<mode>_neon): New.
	(*fma<VH:mode>4): New.
	(fma<VH:mode>4_intrinsic): New.
	(fmsub<VCVTF:mode>4_intrinsic): Fix white-space.
	(*fmsub<VH:mode>4): New.
	(fmsub<VH:mode>4_intrinsic): New.
	(<absneg_str><mode>2_fp16): New.
	(neon_v<absneg_str><mode>): New.
	(neon_v<fp16_rnd_str><mode>): New.
	(neon_vsqrte<mode>): New.
	(neon_vpaddv4hf): New.
	(neon_vadd<mode>): New.
	(neon_vsub<mode>): New.
	(neon_vadd<mode>_unspec): New.
	(neon_vsub<mode>_unspec): New.
	(neon_vmulf<mode>): New.
	(neon_vfma<VH:mode>): New.
	(neon_vfms<VH:mode>): New.
	(neon_vc<cmp_op><mode>): New.
	(neon_vc<cmp_op><mode>_fp16insn): New
	(neon_vc<cmp_op_unsp><mode>_fp16insn_unspec): New.
	(neon_vca<cmp_op><mode>): New.
	(neon_vca<cmp_op><mode>_fp16insn): New.
	(neon_vca<cmp_op_unsp><mode>_fp16insn_unspec): New.
	(neon_vc<cmp_op>z<mode>): New.
	(neon_vabd<mode>): New.
	(neon_v<maxmin>f<mode>): New.
	(neon_vp<maxmin>fv4hf): New.
	(neon_<fmaxmin_op><mode>): New.
	(neon_vrecps<mode>): New.
	(neon_vrsqrts<mode>): New.
	(neon_vrecpe<mode>): New (VH variant).
	(neon_vcvt<sup><mode>): New (VCVTHI variant).
	(neon_vcvt<sup><mode>): New (VH variant).
	(neon_vcvt<sup>_n<mode>): New (VH variant).
	(neon_vcvt<sup>_n<mode>): New (VCVTHI variant).
	(neon_vcvt<vcvth_op><sup><mode>): New (VH variant).
	(neon_vmul_lane<mode>): New.
	(neon_vmul_n<mode>): New.
	* config/arm/unspecs.md (UNSPEC_VCALE): New
	(UNSPEC_VCALT): New.
	(UNSPEC_VFMA_LANE): New.
	(UNSPECS_VFMS_LANE): New.
	(UNSPECS_VSQRTE): New.

testsuite/
2016-05-17  Matthew Wahab  <matthew.wahab@arm.com>

	* gcc.target/arm/armv8_2-fp16-arith-1.c: Add tests for float16x4_t
	and float16x8_t.

Comments

Joseph Myers May 18, 2016, 12:58 a.m. UTC | #1
On Tue, 17 May 2016, Matthew Wahab wrote:

> As with the VFP FP16 arithmetic instructions, operations on __fp16
> values are done by conversion to single-precision. Any new optimization
> supported by the instruction descriptions can only apply to code
> generated using intrinsics added in this patch series.

As with the scalar instructions, I think it is legitimate in most cases to 
optimize arithmetic via single precision to work direct on __fp16 values 
(and this would be natural for vectorization of __fp16 arithmetic).

> A number of the instructions are modelled as two variants, one using
> UNSPEC and the other using RTL operations, with the model used decided
> by the funsafe-math-optimizations flag. This follows the
> single-precision instructions and is due to the half-precision
> operations having the same conditions and restrictions on their use in
> optmizations (when they are enabled).

(Of course, these restrictions still apply.)
Jiong Wang May 19, 2016, 5:01 p.m. UTC | #2
On 18/05/16 01:58, Joseph Myers wrote:
> On Tue, 17 May 2016, Matthew Wahab wrote:
>
>> As with the VFP FP16 arithmetic instructions, operations on __fp16
>> values are done by conversion to single-precision. Any new optimization
>> supported by the instruction descriptions can only apply to code
>> generated using intrinsics added in this patch series.
> As with the scalar instructions, I think it is legitimate in most cases to
> optimize arithmetic via single precision to work direct on __fp16 values
> (and this would be natural for vectorization of __fp16 arithmetic).

Hi Josephy,

   Currently for vector types like v4hf, there is not type promotion, it 
will live
on arm until it reaches vector lower pass where it's splitted into hf 
operations, then
these hf operations will be widened into sf operation during rtl expand 
as we don't have
scalar hf support on standard patterns.

Then,

   * if we add scalar HF mode to standard patterns, vector HF modes 
operation will be
     turned into scalar HF operations instead of scalar SF operations.

   * if we add vector HF mode to standard patterns, vector HF modes 
operations will
     generate vector HF instructions directly.

   Will this still cause precision inconsistence with old gcc when there 
are cascade
   vector float operations?

   Thanks

Regards,
Jiong
Joseph Myers May 19, 2016, 5:29 p.m. UTC | #3
On Thu, 19 May 2016, Jiong Wang wrote:

> Then,
> 
>   * if we add scalar HF mode to standard patterns, vector HF modes operation
> will be
>     turned into scalar HF operations instead of scalar SF operations.
> 
>   * if we add vector HF mode to standard patterns, vector HF modes operations
> will
>     generate vector HF instructions directly.
> 
>   Will this still cause precision inconsistence with old gcc when there are
> cascade
>   vector float operations?

I'm not sure inconsistency with old GCC is what's relevant here.

Standard-named RTL patterns have particular semantics.  Those semantics do 
not depend on the target architecture (except where there are target 
macros / hooks to define such dependence).  If you have an instruction 
that matches those target-independent semantics, it should be available 
for the standard-named pattern.  I believe that is the case here, for both 
the scalar and the vector instructions - they have the standard semantics, 
so should be available for the standard patterns.

It is the responsibility of the target-independent parts of the compiler 
to ensure that the RTL generated matches the source code semantics, so 
that providing a standard pattern for an instruction that matches the 
pattern's semantics does not cause any problems regarding source code 
semantics.

That said: if the expander in old GCC is converting a vector HF operation 
into scalar SF operations, I'd expect it also to include a conversion from 
SFmode back to HFmode after those operations, since it will be producing a 
vector HF result.  And that would apply for each individual operation 
expanded.  So I would not expect inconsistency to arise from making direct 
HFmode operations available (given that the semantics of scalar + - * / 
are the same whether you do them directly on HFmode or promote to SFmode, 
do the operation there and then convert the result back to HFmode before 
doing any further operations on it).
James Greenhalgh June 8, 2016, 8:45 a.m. UTC | #4
On Thu, May 19, 2016 at 05:29:16PM +0000, Joseph Myers wrote:
> On Thu, 19 May 2016, Jiong Wang wrote:
> 
> > Then,
> > 
> >   * if we add scalar HF mode to standard patterns, vector HF modes operation
> > will be
> >     turned into scalar HF operations instead of scalar SF operations.
> > 
> >   * if we add vector HF mode to standard patterns, vector HF modes operations
> > will
> >     generate vector HF instructions directly.
> > 
> >   Will this still cause precision inconsistence with old gcc when there are
> > cascade
> >   vector float operations?
> 
> I'm not sure inconsistency with old GCC is what's relevant here.
> 
> Standard-named RTL patterns have particular semantics.  Those semantics do 
> not depend on the target architecture (except where there are target 
> macros / hooks to define such dependence).  If you have an instruction 
> that matches those target-independent semantics, it should be available 
> for the standard-named pattern.  I believe that is the case here, for both 
> the scalar and the vector instructions - they have the standard semantics, 
> so should be available for the standard patterns.
> 
> It is the responsibility of the target-independent parts of the compiler 
> to ensure that the RTL generated matches the source code semantics, so 
> that providing a standard pattern for an instruction that matches the 
> pattern's semantics does not cause any problems regarding source code 
> semantics.
> 
> That said: if the expander in old GCC is converting a vector HF operation 
> into scalar SF operations, I'd expect it also to include a conversion from 
> SFmode back to HFmode after those operations, since it will be producing a 
> vector HF result.  And that would apply for each individual operation 
> expanded.  So I would not expect inconsistency to arise from making direct 
> HFmode operations available (given that the semantics of scalar + - * / 
> are the same whether you do them directly on HFmode or promote to SFmode, 
> do the operation there and then convert the result back to HFmode before 
> doing any further operations on it).

I think the confusion here is that these two functions:

  float16x8_t
  __attribute__ ((noinline)) 
  foo (float16x8_t a, float16x8_t b, float16x8_t c)
  {
    return a * b / c;
  }

  float16_t
  __attribute__ ((noinline)) 
  bar (float16_t a, float16_t b, float16_t c)
  {
    return a * b / c;
  }

Have different behaviours in terms of when they extend and truncate between
floating-point precisions.

A full testcase calling these functions is attached.

Compile with

  `gcc -O3`
     for AArch64 ARMv8-A
  `gcc -O3 -mfloat-abi=hard -mfpu=neon-fp16 -mfp16-format=ieee -march=armv7-a`
     for ARMv7-A 

This prints:

  Fail:
	Scalar Input	256.000000
	Scalar Output	256.000000
	Vector input	256.000000
	Vector output	inf
  Fail:
	Scalar Input	3.300781
	Scalar Output	3.300781
	Vector input	3.300781
	Vector output	3.302734
  Fail:
	Scalar Input	10000.000000
	Scalar Output	10000.000000
	Vector input	10000.000000
	Vector output	inf
  Fail:
	Scalar Input	0.000003
	Scalar Output	0.000003
	Vector input	0.000003
	Vector output	0.000000
  Fail:
	Scalar Input	0.000400
	Scalar Output	0.000400
	Vector input	0.000400
	Vector output	0.000447

foo, operating on vectors, remains in 16-bit precision throughout gimple,
will scalarise during veclower, and will add float_extend and float_truncate
around each operation during expand to preserve the 16-bit rounding
behaviour. For this testcase, that means two truncates per vector element.
One after the multiply, one after the divide.

bar, operating on scalars, adds promotions early due to TARGET_PROMOTED_TYPE.
In gimple we stay in 32-bit precision for the two operations, and we
truncate only after both operations. That means one truncate, taking place
after the divide.

However, I find this surprising at a language level, though I see
that Clang 3.8 has the same behaviour.  ACLE doesn't mention the GCC
vector extensions, so doesn't specify the behaviour of the arithmetic
operators on vector-of-float16_t types. GCC's vector extension documentation
gives this definition for arithmetic operations:

  The types defined in this manner can be used with a subset of normal
  C operations. Currently, GCC allows using the following operators on
  these types: +, -, *, /, unary minus, ^, |, &, ~, %.

  The operations behave like C++ valarrays. Addition is defined as
  the addition of the corresponding elements of the operands. For
  example, in the code below, each of the 4 elements in a is added to
  the corresponding 4 elements in b and the resulting vector is stored
  in c.

  Subtraction, multiplication, division, and the logical operations
  operate in a similar manner. Likewise, the result of using the unary
  minus or complement operators on a vector type is a vector whose
  elements are the negative or complemented values of the corresponding
  elements in the operand. 

Without digging in to the compiler code, I would have expected the vector
implementation to give equivalent results to the scalar one.

My question is whether you consider the different behaviour between scalar
float16_t and vector-of-float16_t types to be a bug? I can think of some
ways to fix the vector behaviour if it is buggy, but they would of course
be a change in behaviour from current releases (and from clang 3.8).

Clearly, this makes no difference to your comment that we should implement
these using standard pattern names. Either this is a bug, in which case
the front-end will arrange for the promotion to vector-of-float32_t
types, and implementing the vector standard pattern names would potentially
allow for some optimisation back to vector-of-float16_t type, or this
is not a bug, in which case the vector-of-float16_t standard pattern names
match the expected semantics perfectly.

Thanks,
James
Joseph Myers June 8, 2016, 8:02 p.m. UTC | #5
On Wed, 8 Jun 2016, James Greenhalgh wrote:

> My question is whether you consider the different behaviour between scalar
> float16_t and vector-of-float16_t types to be a bug? I can think of some

No, because it matches how things work for vectors of integer types.  
E.g.:

typedef unsigned char vuc __attribute__((vector_size(8)));

vuc a = { 128, 128, 128, 128, 128, 128, 128, 128 }, b;

int
main (void)
{
  b = a / (a + a);
  return 0;
}

(Does a divide-by-zero, because (a + a) is evaluated without promotion to 
vector of int.)

It's a general rule for vector operations that there are no promotions 
that change the bit-size of the vectors, so arithmetic is done directly on 
unsigned char in this case, even though it normally would not be.  
Conversions when the types match apart from signedness are, as the comment 
in c_common_type notes, not fully defined.

  /* If one type is a vector type, return that type.  (How the usual
     arithmetic conversions apply to the vector types extension is not
     precisely specified.)  */
diff mbox

Patch

From 623f36632cc2848f16ba1c75f400198a72dc6ea4 Mon Sep 17 00:00:00 2001
From: Matthew Wahab <matthew.wahab@arm.com>
Date: Thu, 7 Apr 2016 16:19:57 +0100
Subject: [PATCH 09/17] [PATCH 9/17][ARM] Add NEON FP16 arithmetic
 instructions.

2016-05-17  Matthew Wahab  <matthew.wahab@arm.com>

	* config/arm/iterators.md (VCVTHI): New.
	(NEON_VCMP): Add UNSPEC_VCLT and UNSPEC_VCLE.  Fix a long line.
	(NEON_VAGLTE): New.
	(VFM_LANE_AS): New.
	(VH_CVTTO): New.
	(V_reg): Add HF, V4HF and V8HF.  Fix white-space.
	(V_HALF): Add V4HF.  Fix white-space.
	(V_if_elem): Add HF, V4HF and V8HF.  Fix white-space.
	(V_s_elem): Likewise.
	(V_sz_elem): Fix white-space.
	(V_elem_ch): Likewise.
	(VH_elem_ch): New.
	(scalar_mul_constraint): Add V8HF and V4HF.
	(Is_float_mode): Fix white-space.
	(Is_d_reg): Add V4HF and V8HF.  Fix white-space.
	(q): Add HF.  Fix white-space.
	(float_sup): New.
	(float_SUP): New.
	(cmp_op_unsp): Add UNSPEC_VCALE and UNSPEC_VCALT.
	(neon_vfm_lane_as): New.
	* config/arm/neon.md (add<mode>3_fp16): New.
	(sub<mode>3_fp16): New.
	(mul<mode>3add<mode>_neon): New.
	(*fma<VH:mode>4): New.
	(fma<VH:mode>4_intrinsic): New.
	(fmsub<VCVTF:mode>4_intrinsic): Fix white-space.
	(*fmsub<VH:mode>4): New.
	(fmsub<VH:mode>4_intrinsic): New.
	(<absneg_str><mode>2_fp16): New.
	(neon_v<absneg_str><mode>): New.
	(neon_v<fp16_rnd_str><mode>): New.
	(neon_vsqrte<mode>): New.
	(neon_vpadd<mode>): New.
	(neon_vadd<mode>): New.
	(neon_vsub<mode>): New.
	(neon_vadd<mode>_unspec): New.
	(neon_vsub<mode>_unspec): New.
	(neon_vmulf<mode>): New.
	(neon_vfma<VH:mode>): New.
	(neon_vfms<VH:mode>): New.
	(neon_vc<cmp_op><mode>): New.
	(neon_vc<cmp_op><mode>_fp16insn): New
	(neon_vc<cmp_op_unsp><mode>_fp16insn_unspec): New.
	(neon_vca<cmp_op><mode>): New.
	(neon_vca<cmp_op><mode>_fp16insn): New.
	(neon_vca<cmp_op_unsp><mode>_fp16insn_unspec): New.
	(neon_vc<cmp_op>z<mode>): New.
	(neon_vabd<mode>): New.
	(neon_v<maxmin>f<mode>): New.
	(neon_vp<maxmin>f<mode>): New.
	(neon_<fmaxmin_op><mode>): New.
	(neon_vrecps<mode>): New.
	(neon_vrsqrts<mode>): New.
	(neon_vrecpe<mode>): New (VH variant).
	(neon_vdup_lane<mode>_internal): New.
	(neon_vdup_lane<mode>): New.
	(neon_vcvt<sup><mode>): New (VCVTHI variant).
	(neon_vcvt<sup><mode>): New (VH variant).
	(neon_vcvt<sup>_n<mode>): New (VH variant).
	(neon_vcvt<sup>_n<mode>): New (VCVTHI variant).
	(neon_vcvt<vcvth_op><sup><mode>): New (VH variant).
	(neon_vmul_lane<mode>): New.
	(neon_vmul_n<mode>): New.
	* config/arm/unspecs.md (UNSPEC_VCALE): New
	(UNSPEC_VCALT): New.
	(UNSPEC_VFMA_LANE): New.
	(UNSPECS_VFMS_LANE): New.
	(UNSPECS_VSQRTE): New.

testsuite/
2016-05-17  Matthew Wahab  <matthew.wahab@arm.com>

	* gcc.target/arm/armv8_2-fp16-arith-1.c: Add tests for float16x4_t
	and float16x8_t.
---
 gcc/config/arm/iterators.md                        | 121 +++--
 gcc/config/arm/neon.md                             | 503 ++++++++++++++++++++-
 gcc/config/arm/unspecs.md                          |   6 +-
 .../gcc.target/arm/armv8_2-fp16-arith-1.c          |  49 +-
 4 files changed, 621 insertions(+), 58 deletions(-)

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 9371b6a..be39e4a 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -145,6 +145,9 @@ 
 ;; Vector modes form int->float conversions.
 (define_mode_iterator VCVTI [V2SI V4SI])
 
+;; Vector modes for int->half conversions.
+(define_mode_iterator VCVTHI [V4HI V8HI])
+
 ;; Vector modes for doubleword multiply-accumulate, etc. insns.
 (define_mode_iterator VMD [V4HI V2SI V2SF])
 
@@ -267,10 +270,14 @@ 
 (define_int_iterator VRINT [UNSPEC_VRINTZ UNSPEC_VRINTP UNSPEC_VRINTM
                             UNSPEC_VRINTR UNSPEC_VRINTX UNSPEC_VRINTA])
 
-(define_int_iterator NEON_VCMP [UNSPEC_VCEQ UNSPEC_VCGT UNSPEC_VCGE UNSPEC_VCLT UNSPEC_VCLE])
+(define_int_iterator NEON_VCMP [UNSPEC_VCEQ UNSPEC_VCGT UNSPEC_VCGE
+				UNSPEC_VCLT UNSPEC_VCLE])
 
 (define_int_iterator NEON_VACMP [UNSPEC_VCAGE UNSPEC_VCAGT])
 
+(define_int_iterator NEON_VAGLTE [UNSPEC_VCAGE UNSPEC_VCAGT
+				  UNSPEC_VCALE UNSPEC_VCALT])
+
 (define_int_iterator VCVT [UNSPEC_VRINTP UNSPEC_VRINTM UNSPEC_VRINTA])
 
 (define_int_iterator NEON_VRINT [UNSPEC_NVRINTP UNSPEC_NVRINTZ UNSPEC_NVRINTM
@@ -398,6 +405,8 @@ 
 
 (define_int_iterator VQRDMLH_AS [UNSPEC_VQRDMLAH UNSPEC_VQRDMLSH])
 
+(define_int_iterator VFM_LANE_AS [UNSPEC_VFMA_LANE UNSPEC_VFMS_LANE])
+
 ;;----------------------------------------------------------------------------
 ;; Mode attributes
 ;;----------------------------------------------------------------------------
@@ -416,6 +425,10 @@ 
 (define_mode_attr V_cvtto [(V2SI "v2sf") (V2SF "v2si")
                            (V4SI "v4sf") (V4SF "v4si")])
 
+;; (Opposite) mode to convert to/from for vector-half mode conversions.
+(define_mode_attr VH_CVTTO [(V4HI "V4HF") (V4HF "V4HI")
+			    (V8HI "V8HF") (V8HF "V8HI")])
+
 ;; Define element mode for each vector mode.
 (define_mode_attr V_elem [(V8QI "QI") (V16QI "QI")
 			  (V4HI "HI") (V8HI "HI")
@@ -459,12 +472,13 @@ 
 
 ;; Register width from element mode
 (define_mode_attr V_reg [(V8QI "P") (V16QI "q")
-                         (V4HI "P") (V8HI  "q")
-                         (V4HF "P") (V8HF  "q")
-                         (V2SI "P") (V4SI  "q")
-                         (V2SF "P") (V4SF  "q")
-                         (DI   "P") (V2DI  "q")
-                         (SF   "")  (DF    "P")])
+			 (V4HI "P") (V8HI  "q")
+			 (V4HF "P") (V8HF  "q")
+			 (V2SI "P") (V4SI  "q")
+			 (V2SF "P") (V4SF  "q")
+			 (DI   "P") (V2DI  "q")
+			 (SF   "")  (DF    "P")
+			 (HF   "")])
 
 ;; Wider modes with the same number of elements.
 (define_mode_attr V_widen [(V8QI "V8HI") (V4HI "V4SI") (V2SI "V2DI")])
@@ -480,7 +494,7 @@ 
 (define_mode_attr V_HALF [(V16QI "V8QI") (V8HI "V4HI")
 			  (V8HF "V4HF") (V4SI  "V2SI")
 			  (V4SF "V2SF") (V2DF "DF")
-                          (V2DI "DI")])
+			  (V2DI "DI") (V4HF "HF")])
 
 ;; Same, but lower-case.
 (define_mode_attr V_half [(V16QI "v8qi") (V8HI "v4hi")
@@ -529,18 +543,22 @@ 
 ;; Get element type from double-width mode, for operations where we 
 ;; don't care about signedness.
 (define_mode_attr V_if_elem [(V8QI "i8")  (V16QI "i8")
-                 (V4HI "i16") (V8HI  "i16")
-                             (V2SI "i32") (V4SI  "i32")
-                             (DI   "i64") (V2DI  "i64")
-                 (V2SF "f32") (V4SF  "f32")
-                 (SF "f32") (DF "f64")])
+			     (V4HI "i16") (V8HI  "i16")
+			     (V2SI "i32") (V4SI  "i32")
+			     (DI   "i64") (V2DI  "i64")
+			     (V2SF "f32") (V4SF  "f32")
+			     (SF   "f32") (DF    "f64")
+			     (HF   "f16") (V4HF  "f16")
+			     (V8HF "f16")])
 
 ;; Same, but for operations which work on signed values.
 (define_mode_attr V_s_elem [(V8QI "s8")  (V16QI "s8")
-                (V4HI "s16") (V8HI  "s16")
-                            (V2SI "s32") (V4SI  "s32")
-                            (DI   "s64") (V2DI  "s64")
-                (V2SF "f32") (V4SF  "f32")])
+			    (V4HI "s16") (V8HI  "s16")
+			    (V2SI "s32") (V4SI  "s32")
+			    (DI   "s64") (V2DI  "s64")
+			    (V2SF "f32") (V4SF  "f32")
+			    (HF   "f16") (V4HF  "f16")
+			    (V8HF "f16")])
 
 ;; Same, but for operations which work on unsigned values.
 (define_mode_attr V_u_elem [(V8QI "u8")  (V16QI "u8")
@@ -557,17 +575,22 @@ 
                              (V2SF "32") (V4SF "32")])
 
 (define_mode_attr V_sz_elem [(V8QI "8")  (V16QI "8")
-                 (V4HI "16") (V8HI  "16")
-                             (V2SI "32") (V4SI  "32")
-                             (DI   "64") (V2DI  "64")
+			     (V4HI "16") (V8HI  "16")
+			     (V2SI "32") (V4SI  "32")
+			     (DI   "64") (V2DI  "64")
 			     (V4HF "16") (V8HF "16")
-                 (V2SF "32") (V4SF  "32")])
+			     (V2SF "32") (V4SF  "32")])
 
 (define_mode_attr V_elem_ch [(V8QI "b")  (V16QI "b")
-                             (V4HI "h") (V8HI  "h")
-                             (V2SI "s") (V4SI  "s")
-                             (DI   "d") (V2DI  "d")
-                             (V2SF "s") (V4SF  "s")])
+			     (V4HI "h") (V8HI  "h")
+			     (V2SI "s") (V4SI  "s")
+			     (DI   "d") (V2DI  "d")
+			     (V2SF "s") (V4SF  "s")
+			     (V2SF "s") (V4SF  "s")])
+
+(define_mode_attr VH_elem_ch [(V4HI "s") (V8HI  "s")
+			      (V4HF "s") (V8HF  "s")
+			      (HF "s")])
 
 ;; Element sizes for duplicating ARM registers to all elements of a vector.
 (define_mode_attr VD_dup [(V8QI "8") (V4HI "16") (V2SI "32") (V2SF "32")])
@@ -603,16 +626,17 @@ 
 ;; This mode attribute is used to obtain the correct register constraints.
 
 (define_mode_attr scalar_mul_constraint [(V4HI "x") (V2SI "t") (V2SF "t")
-                                         (V8HI "x") (V4SI "t") (V4SF "t")])
+					 (V8HI "x") (V4SI "t") (V4SF "t")
+					 (V8HF "x") (V4HF "x")])
 
 ;; Predicates used for setting type for neon instructions
 
 (define_mode_attr Is_float_mode [(V8QI "false") (V16QI "false")
-                 (V4HI "false") (V8HI "false")
-                 (V2SI "false") (V4SI "false")
-                 (V4HF "true") (V8HF "true")
-                 (V2SF "true") (V4SF "true")
-                 (DI "false") (V2DI "false")])
+				 (V4HI "false") (V8HI "false")
+				 (V2SI "false") (V4SI "false")
+				 (V4HF "true") (V8HF "true")
+				 (V2SF "true") (V4SF "true")
+				 (DI "false") (V2DI "false")])
 
 (define_mode_attr Scalar_mul_8_16 [(V8QI "true") (V16QI "true")
 				   (V4HI "true") (V8HI "true")
@@ -621,10 +645,10 @@ 
 				   (DI "false") (V2DI "false")])
 
 (define_mode_attr Is_d_reg [(V8QI "true") (V16QI "false")
-                            (V4HI "true") (V8HI  "false")
-                            (V2SI "true") (V4SI  "false")
-                            (V2SF "true") (V4SF  "false")
-                            (DI   "true") (V2DI  "false")
+			    (V4HI "true") (V8HI  "false")
+			    (V2SI "true") (V4SI  "false")
+			    (V2SF "true") (V4SF  "false")
+			    (DI   "true") (V2DI  "false")
 			    (V4HF "true") (V8HF  "false")])
 
 (define_mode_attr V_mode_nunits [(V8QI "8") (V16QI "16")
@@ -670,12 +694,14 @@ 
 
 ;; Mode attribute used to build the "type" attribute.
 (define_mode_attr q [(V8QI "") (V16QI "_q")
-                     (V4HI "") (V8HI "_q")
-                     (V2SI "") (V4SI "_q")
+		     (V4HI "") (V8HI "_q")
+		     (V2SI "") (V4SI "_q")
 		     (V4HF "") (V8HF "_q")
-                     (V2SF "") (V4SF "_q")
-                     (DI "")   (V2DI "_q")
-                     (DF "")   (V2DF "_q")])
+		     (V2SF "") (V4SF "_q")
+		     (V4HF "") (V8HF "_q")
+		     (DI "")   (V2DI "_q")
+		     (DF "")   (V2DF "_q")
+		     (HF "")])
 
 (define_mode_attr pf [(V8QI "p") (V16QI "p") (V2SF "f") (V4SF "f")])
 
@@ -718,6 +744,10 @@ 
 ;; Conversions.
 (define_code_attr FCVTI32typename [(unsigned_float "u32") (float "s32")])
 
+(define_code_attr float_sup [(unsigned_float "u") (float "s")])
+
+(define_code_attr float_SUP [(unsigned_float "U") (float "S")])
+
 ;;----------------------------------------------------------------------------
 ;; Int attributes
 ;;----------------------------------------------------------------------------
@@ -790,9 +820,10 @@ 
    (UNSPEC_VRNDP "vrintp") (UNSPEC_VRNDX "vrintx")])
 
 (define_int_attr cmp_op_unsp [(UNSPEC_VCEQ "eq") (UNSPEC_VCGT "gt")
-                              (UNSPEC_VCGE "ge") (UNSPEC_VCLE "le")
-                              (UNSPEC_VCLT "lt") (UNSPEC_VCAGE "ge")
-                              (UNSPEC_VCAGT "gt")])
+			      (UNSPEC_VCGE "ge") (UNSPEC_VCLE "le")
+			      (UNSPEC_VCLT "lt") (UNSPEC_VCAGE "ge")
+			      (UNSPEC_VCAGT "gt") (UNSPEC_VCALE "le")
+			      (UNSPEC_VCALT "lt")])
 
 (define_int_attr r [
   (UNSPEC_VRHADD_S "r") (UNSPEC_VRHADD_U "r")
@@ -908,3 +939,7 @@ 
 
 ;; Attributes for VQRDMLAH/VQRDMLSH
 (define_int_attr neon_rdma_as [(UNSPEC_VQRDMLAH "a") (UNSPEC_VQRDMLSH "s")])
+
+;; Attributes for VFMA_LANE/ VFMS_LANE
+(define_int_attr neon_vfm_lane_as
+ [(UNSPEC_VFMA_LANE "a") (UNSPEC_VFMS_LANE "s")])
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 5fcc991..7a44f5f 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -505,6 +505,20 @@ 
                     (const_string "neon_add<q>")))]
 )
 
+(define_insn "add<mode>3_fp16"
+  [(set
+    (match_operand:VH 0 "s_register_operand" "=w")
+    (plus:VH
+     (match_operand:VH 1 "s_register_operand" "w")
+     (match_operand:VH 2 "s_register_operand" "w")))]
+ "TARGET_NEON_FP16INST"
+ "vadd.<V_if_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set (attr "type")
+   (if_then_else (match_test "<Is_float_mode>")
+    (const_string "neon_fp_addsub_s<q>")
+    (const_string "neon_add<q>")))]
+)
+
 (define_insn "adddi3_neon"
   [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?&r,?w,?&r,?&r,?&r")
         (plus:DI (match_operand:DI 1 "s_register_operand" "%w,0,0,w,r,0,r")
@@ -543,6 +557,17 @@ 
                     (const_string "neon_sub<q>")))]
 )
 
+(define_insn "sub<mode>3_fp16"
+ [(set
+   (match_operand:VH 0 "s_register_operand" "=w")
+   (minus:VH
+    (match_operand:VH 1 "s_register_operand" "w")
+    (match_operand:VH 2 "s_register_operand" "w")))]
+ "TARGET_NEON_FP16INST"
+ "vsub.<V_if_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_sub<q>")]
+)
+
 (define_insn "subdi3_neon"
   [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?&r,?&r,?w")
         (minus:DI (match_operand:DI 1 "s_register_operand" "w,0,r,0,w")
@@ -591,6 +616,16 @@ 
 		    (const_string "neon_mla_<V_elem_ch><q>")))]
 )
 
+(define_insn "mul<mode>3add<mode>_neon"
+  [(set (match_operand:VH 0 "s_register_operand" "=w")
+	(plus:VH (mult:VH (match_operand:VH 2 "s_register_operand" "w")
+			  (match_operand:VH 3 "s_register_operand" "w"))
+		  (match_operand:VH 1 "s_register_operand" "0")))]
+  "TARGET_NEON_FP16INST && (!<Is_float_mode> || flag_unsafe_math_optimizations)"
+  "vmla.f16\t%<V_reg>0, %<V_reg>2, %<V_reg>3"
+  [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
 (define_insn "mul<mode>3neg<mode>add<mode>_neon"
   [(set (match_operand:VDQW 0 "s_register_operand" "=w")
         (minus:VDQW (match_operand:VDQW 1 "s_register_operand" "0")
@@ -629,6 +664,28 @@ 
   [(set_attr "type" "neon_fp_mla_s<q>")]
 )
 
+(define_insn "*fma<VH:mode>4"
+  [(set (match_operand:VH 0 "register_operand" "=w")
+    (fma:VH
+     (match_operand:VH 1 "register_operand" "w")
+     (match_operand:VH 2 "register_operand" "w")
+     (match_operand:VH 3 "register_operand" "0")))]
+ "TARGET_NEON_FP16INST && flag_unsafe_math_optimizations"
+ "vfma.<V_if_elem>\\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_insn "fma<VH:mode>4_intrinsic"
+ [(set (match_operand:VH 0 "register_operand" "=w")
+   (fma:VH
+    (match_operand:VH 1 "register_operand" "w")
+    (match_operand:VH 2 "register_operand" "w")
+    (match_operand:VH 3 "register_operand" "0")))]
+ "TARGET_NEON_FP16INST"
+ "vfma.<V_if_elem>\\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
 (define_insn "*fmsub<VCVTF:mode>4"
   [(set (match_operand:VCVTF 0 "register_operand" "=w")
         (fma:VCVTF (neg:VCVTF (match_operand:VCVTF 1 "register_operand" "w"))
@@ -640,13 +697,36 @@ 
 )
 
 (define_insn "fmsub<VCVTF:mode>4_intrinsic"
-  [(set (match_operand:VCVTF 0 "register_operand" "=w")
-        (fma:VCVTF (neg:VCVTF (match_operand:VCVTF 1 "register_operand" "w"))
-		   (match_operand:VCVTF 2 "register_operand" "w")
-		   (match_operand:VCVTF 3 "register_operand" "0")))]
-  "TARGET_NEON && TARGET_FMA"
-  "vfms%?.<V_if_elem>\\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
-  [(set_attr "type" "neon_fp_mla_s<q>")]
+ [(set (match_operand:VCVTF 0 "register_operand" "=w")
+   (fma:VCVTF
+    (neg:VCVTF (match_operand:VCVTF 1 "register_operand" "w"))
+    (match_operand:VCVTF 2 "register_operand" "w")
+    (match_operand:VCVTF 3 "register_operand" "0")))]
+ "TARGET_NEON && TARGET_FMA"
+ "vfms%?.<V_if_elem>\\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_insn "*fmsub<VH:mode>4"
+ [(set (match_operand:VH 0 "register_operand" "=w")
+   (fma:VH
+    (neg:VH (match_operand:VH 1 "register_operand" "w"))
+    (match_operand:VH 2 "register_operand" "w")
+    (match_operand:VH 3 "register_operand" "0")))]
+ "TARGET_NEON_FP16INST && flag_unsafe_math_optimizations"
+ "vfms.<V_if_elem>\\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_insn "fmsub<VH:mode>4_intrinsic"
+ [(set (match_operand:VH 0 "register_operand" "=w")
+   (fma:VH
+    (neg:VH (match_operand:VH 1 "register_operand" "w"))
+    (match_operand:VH 2 "register_operand" "w")
+    (match_operand:VH 3 "register_operand" "0")))]
+ "TARGET_NEON_FP16INST"
+ "vfms.<V_if_elem>\\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_mla_s<q>")]
 )
 
 (define_insn "neon_vrint<NEON_VRINT:nvrint_variant><VCVTF:mode>"
@@ -860,6 +940,44 @@ 
   ""
 )
 
+(define_insn "<absneg_str><mode>2_fp16"
+  [(set (match_operand:VH 0 "s_register_operand" "=w")
+    (ABSNEG:VH (match_operand:VH 1 "s_register_operand" "w")))]
+ "TARGET_NEON_FP16INST"
+ "v<absneg_str>.<V_s_elem>\t%<V_reg>0, %<V_reg>1"
+ [(set_attr "type" "neon_abs<q>")]
+)
+
+(define_expand "neon_v<absneg_str><mode>"
+ [(set
+   (match_operand:VH 0 "s_register_operand")
+   (ABSNEG:VH (match_operand:VH 1 "s_register_operand")))]
+ "TARGET_NEON_FP16INST"
+{
+  emit_insn (gen_<absneg_str><mode>2_fp16 (operands[0], operands[1]));
+  DONE;
+})
+
+(define_insn "neon_v<fp16_rnd_str><mode>"
+  [(set (match_operand:VH 0 "s_register_operand" "=w")
+    (unspec:VH
+     [(match_operand:VH 1 "s_register_operand" "w")]
+     FP16_RND))]
+ "TARGET_NEON_FP16INST"
+ "<fp16_rnd_insn>.<V_s_elem>\t%<V_reg>0, %<V_reg>1"
+ [(set_attr "type" "neon_fp_round_s<q>")]
+)
+
+(define_insn "neon_vsqrte<mode>"
+  [(set (match_operand:VH 0 "s_register_operand" "=w")
+    (unspec:VH
+     [(match_operand:VH 1 "s_register_operand" "w")]
+     UNSPEC_VSQRTE))]
+  "TARGET_NEON_FP16INST"
+  "vsqrte.f16\t%<V_reg>0, %<V_reg>1"
+ [(set_attr "type" "neon_fp_rsqrte_s<q>")]
+)
+
 (define_insn "*umin<mode>3_neon"
   [(set (match_operand:VDQIW 0 "s_register_operand" "=w")
 	(umin:VDQIW (match_operand:VDQIW 1 "s_register_operand" "w")
@@ -1601,6 +1719,17 @@ 
                     (const_string "neon_reduc_add<q>")))]
 )
 
+(define_insn "neon_vpaddv4hf"
+ [(set
+   (match_operand:V4HF 0 "s_register_operand" "=w")
+   (unspec:V4HF [(match_operand:V4HF 1 "s_register_operand" "w")
+		 (match_operand:V4HF 2 "s_register_operand" "w")]
+    UNSPEC_VPADD))]
+ "TARGET_NEON_FP16INST"
+ "vpadd.f16\t%P0, %P1, %P2"
+ [(set_attr "type" "neon_reduc_add")]
+)
+
 (define_insn "neon_vpsmin<mode>"
   [(set (match_operand:VD 0 "s_register_operand" "=w")
 	(unspec:VD [(match_operand:VD 1 "s_register_operand" "w")
@@ -1949,6 +2078,26 @@ 
   DONE;
 })
 
+(define_expand "neon_vadd<mode>"
+  [(match_operand:VH 0 "s_register_operand")
+   (match_operand:VH 1 "s_register_operand")
+   (match_operand:VH 2 "s_register_operand")]
+  "TARGET_NEON_FP16INST"
+{
+  emit_insn (gen_add<mode>3_fp16 (operands[0], operands[1], operands[2]));
+  DONE;
+})
+
+(define_expand "neon_vsub<mode>"
+  [(match_operand:VH 0 "s_register_operand")
+   (match_operand:VH 1 "s_register_operand")
+   (match_operand:VH 2 "s_register_operand")]
+  "TARGET_NEON_FP16INST"
+{
+  emit_insn (gen_sub<mode>3_fp16 (operands[0], operands[1], operands[2]));
+  DONE;
+})
+
 ; Note that NEON operations don't support the full IEEE 754 standard: in
 ; particular, denormal values are flushed to zero.  This means that GCC cannot
 ; use those instructions for autovectorization, etc. unless
@@ -1974,6 +2123,30 @@ 
                     (const_string "neon_add<q>")))]
 )
 
+(define_insn "neon_vadd<mode>_unspec"
+  [(set
+    (match_operand:VH 0 "s_register_operand" "=w")
+    (unspec:VH
+     [(match_operand:VH 1 "s_register_operand" "w")
+      (match_operand:VH 2 "s_register_operand" "w")]
+     UNSPEC_VADD))]
+ "TARGET_NEON_FP16INST"
+ "vadd.f16\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_add<q>")]
+)
+
+(define_insn "neon_vsub<mode>_unspec"
+  [(set
+    (match_operand:VH 0 "s_register_operand" "=w")
+    (unspec:VH
+     [(match_operand:VH 1 "s_register_operand" "w")
+      (match_operand:VH 2 "s_register_operand" "w")]
+     UNSPEC_VSUB))]
+ "TARGET_NEON_FP16INST"
+ "vsub.f16\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_sub<q>")]
+)
+
 (define_insn "neon_vaddl<sup><mode>"
   [(set (match_operand:<V_widen> 0 "s_register_operand" "=w")
         (unspec:<V_widen> [(match_operand:VDI 1 "s_register_operand" "w")
@@ -2040,6 +2213,17 @@ 
                     (const_string "neon_mul_<V_elem_ch><q>")))]
 )
 
+(define_insn "neon_vmulf<mode>"
+ [(set
+   (match_operand:VH 0 "s_register_operand" "=w")
+   (mult:VH
+    (match_operand:VH 1 "s_register_operand" "w")
+    (match_operand:VH 2 "s_register_operand" "w")))]
+  "TARGET_NEON_FP16INST"
+  "vmul.f16\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_mul_<VH_elem_ch><q>")]
+)
+
 (define_expand "neon_vmla<mode>"
   [(match_operand:VDQW 0 "s_register_operand" "=w")
    (match_operand:VDQW 1 "s_register_operand" "0")
@@ -2068,6 +2252,18 @@ 
   DONE;
 })
 
+(define_expand "neon_vfma<VH:mode>"
+  [(match_operand:VH 0 "s_register_operand")
+   (match_operand:VH 1 "s_register_operand")
+   (match_operand:VH 2 "s_register_operand")
+   (match_operand:VH 3 "s_register_operand")]
+  "TARGET_NEON_FP16INST"
+{
+  emit_insn (gen_fma<mode>4_intrinsic (operands[0], operands[2], operands[3],
+				       operands[1]));
+  DONE;
+})
+
 (define_expand "neon_vfms<VCVTF:mode>"
   [(match_operand:VCVTF 0 "s_register_operand")
    (match_operand:VCVTF 1 "s_register_operand")
@@ -2080,6 +2276,18 @@ 
   DONE;
 })
 
+(define_expand "neon_vfms<VH:mode>"
+  [(match_operand:VH 0 "s_register_operand")
+   (match_operand:VH 1 "s_register_operand")
+   (match_operand:VH 2 "s_register_operand")
+   (match_operand:VH 3 "s_register_operand")]
+  "TARGET_NEON_FP16INST"
+{
+  emit_insn (gen_fmsub<mode>4_intrinsic (operands[0], operands[2], operands[3],
+					 operands[1]));
+  DONE;
+})
+
 ; Used for intrinsics when flag_unsafe_math_optimizations is false.
 
 (define_insn "neon_vmla<mode>_unspec"
@@ -2380,6 +2588,72 @@ 
   [(set_attr "type" "neon_fp_compare_s<q>")]
 )
 
+(define_expand "neon_vc<cmp_op><mode>"
+ [(match_operand:<V_cmp_result> 0 "s_register_operand")
+  (neg:<V_cmp_result>
+   (COMPARISONS:VH
+    (match_operand:VH 1 "s_register_operand")
+    (match_operand:VH 2 "reg_or_zero_operand")))]
+ "TARGET_NEON_FP16INST"
+{
+  /* For FP comparisons use UNSPECS unless -funsafe-math-optimizations
+     are enabled.  */
+  if (GET_MODE_CLASS (<MODE>mode) == MODE_VECTOR_FLOAT
+      && !flag_unsafe_math_optimizations)
+    emit_insn
+      (gen_neon_vc<cmp_op><mode>_fp16insn_unspec
+       (operands[0], operands[1], operands[2]));
+  else
+    emit_insn
+      (gen_neon_vc<cmp_op><mode>_fp16insn
+       (operands[0], operands[1], operands[2]));
+  DONE;
+})
+
+(define_insn "neon_vc<cmp_op><mode>_fp16insn"
+ [(set (match_operand:<V_cmp_result> 0 "s_register_operand" "=w,w")
+   (neg:<V_cmp_result>
+    (COMPARISONS:<V_cmp_result>
+     (match_operand:VH 1 "s_register_operand" "w,w")
+     (match_operand:VH 2 "reg_or_zero_operand" "w,Dz"))))]
+ "TARGET_NEON_FP16INST
+  && !(GET_MODE_CLASS (<MODE>mode) == MODE_VECTOR_FLOAT
+  && !flag_unsafe_math_optimizations)"
+{
+  char pattern[100];
+  sprintf (pattern, "vc<cmp_op>.%s%%#<V_sz_elem>\t%%<V_reg>0,"
+	   " %%<V_reg>1, %s",
+	   GET_MODE_CLASS (<MODE>mode) == MODE_VECTOR_FLOAT
+	   ? "f" : "<cmp_type>",
+	   which_alternative == 0
+	   ? "%<V_reg>2" : "#0");
+  output_asm_insn (pattern, operands);
+  return "";
+}
+ [(set (attr "type")
+   (if_then_else (match_operand 2 "zero_operand")
+    (const_string "neon_compare_zero<q>")
+    (const_string "neon_compare<q>")))])
+
+(define_insn "neon_vc<cmp_op_unsp><mode>_fp16insn_unspec"
+ [(set
+   (match_operand:<V_cmp_result> 0 "s_register_operand" "=w,w")
+   (unspec:<V_cmp_result>
+    [(match_operand:VH 1 "s_register_operand" "w,w")
+     (match_operand:VH 2 "reg_or_zero_operand" "w,Dz")]
+    NEON_VCMP))]
+ "TARGET_NEON_FP16INST"
+{
+  char pattern[100];
+  sprintf (pattern, "vc<cmp_op_unsp>.f%%#<V_sz_elem>\t%%<V_reg>0,"
+	   " %%<V_reg>1, %s",
+	   which_alternative == 0
+	   ? "%<V_reg>2" : "#0");
+  output_asm_insn (pattern, operands);
+  return "";
+}
+ [(set_attr "type" "neon_fp_compare_s<q>")])
+
 (define_insn "neon_vc<cmp_op>u<mode>"
   [(set (match_operand:<V_cmp_result> 0 "s_register_operand" "=w")
         (neg:<V_cmp_result>
@@ -2431,6 +2705,60 @@ 
   [(set_attr "type" "neon_fp_compare_s<q>")]
 )
 
+(define_expand "neon_vca<cmp_op><mode>"
+  [(set
+    (match_operand:<V_cmp_result> 0 "s_register_operand")
+    (neg:<V_cmp_result>
+     (GLTE:<V_cmp_result>
+      (abs:VH (match_operand:VH 1 "s_register_operand"))
+      (abs:VH (match_operand:VH 2 "s_register_operand")))))]
+ "TARGET_NEON_FP16INST"
+{
+  if (flag_unsafe_math_optimizations)
+    emit_insn (gen_neon_vca<cmp_op><mode>_fp16insn
+	       (operands[0], operands[1], operands[2]));
+  else
+    emit_insn (gen_neon_vca<cmp_op><mode>_fp16insn_unspec
+	       (operands[0], operands[1], operands[2]));
+  DONE;
+})
+
+(define_insn "neon_vca<cmp_op><mode>_fp16insn"
+  [(set
+    (match_operand:<V_cmp_result> 0 "s_register_operand" "=w")
+    (neg:<V_cmp_result>
+     (GLTE:<V_cmp_result>
+      (abs:VH (match_operand:VH 1 "s_register_operand" "w"))
+      (abs:VH (match_operand:VH 2 "s_register_operand" "w")))))]
+ "TARGET_NEON_FP16INST && flag_unsafe_math_optimizations"
+ "vac<cmp_op>.<V_if_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_compare_s<q>")]
+)
+
+(define_insn "neon_vca<cmp_op_unsp><mode>_fp16insn_unspec"
+ [(set (match_operand:<V_cmp_result> 0 "s_register_operand" "=w")
+   (unspec:<V_cmp_result>
+    [(match_operand:VH 1 "s_register_operand" "w")
+     (match_operand:VH 2 "s_register_operand" "w")]
+    NEON_VAGLTE))]
+ "TARGET_NEON"
+ "vac<cmp_op_unsp>.<V_if_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_compare_s<q>")]
+)
+
+(define_expand "neon_vc<cmp_op>z<mode>"
+ [(set
+   (match_operand:<V_cmp_result> 0 "s_register_operand")
+   (COMPARISONS:<V_cmp_result>
+    (match_operand:VH 1 "s_register_operand")
+    (const_int 0)))]
+ "TARGET_NEON_FP16INST"
+ {
+  emit_insn (gen_neon_vc<cmp_op><mode> (operands[0], operands[1],
+					CONST0_RTX (<MODE>mode)));
+  DONE;
+})
+
 (define_insn "neon_vtst<mode>"
   [(set (match_operand:VDQIW 0 "s_register_operand" "=w")
         (unspec:VDQIW [(match_operand:VDQIW 1 "s_register_operand" "w")
@@ -2451,6 +2779,16 @@ 
   [(set_attr "type" "neon_abd<q>")]
 )
 
+(define_insn "neon_vabd<mode>"
+  [(set (match_operand:VH 0 "s_register_operand" "=w")
+    (unspec:VH [(match_operand:VH 1 "s_register_operand" "w")
+		(match_operand:VH 2 "s_register_operand" "w")]
+     UNSPEC_VABD_F))]
+ "TARGET_NEON_FP16INST"
+ "vabd.<V_s_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+  [(set_attr "type" "neon_abd<q>")]
+)
+
 (define_insn "neon_vabdf<mode>"
   [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
         (unspec:VCVTF [(match_operand:VCVTF 1 "s_register_operand" "w")
@@ -2513,6 +2851,40 @@ 
   [(set_attr "type" "neon_fp_minmax_s<q>")]
 )
 
+(define_insn "neon_v<maxmin>f<mode>"
+ [(set (match_operand:VH 0 "s_register_operand" "=w")
+   (unspec:VH
+    [(match_operand:VH 1 "s_register_operand" "w")
+     (match_operand:VH 2 "s_register_operand" "w")]
+    VMAXMINF))]
+ "TARGET_NEON_FP16INST"
+ "v<maxmin>.<V_s_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_minmax_s<q>")]
+)
+
+(define_insn "neon_vp<maxmin>fv4hf"
+ [(set (match_operand:V4HF 0 "s_register_operand" "=w")
+   (unspec:V4HF
+    [(match_operand:V4HF 1 "s_register_operand" "w")
+     (match_operand:V4HF 2 "s_register_operand" "w")]
+    VPMAXMINF))]
+ "TARGET_NEON_FP16INST"
+ "vp<maxmin>.f16\t%P0, %P1, %P2"
+  [(set_attr "type" "neon_reduc_minmax")]
+)
+
+(define_insn "neon_<fmaxmin_op><mode>"
+ [(set
+   (match_operand:VH 0 "s_register_operand" "=w")
+   (unspec:VH
+    [(match_operand:VH 1 "s_register_operand" "w")
+     (match_operand:VH 2 "s_register_operand" "w")]
+    VMAXMINFNM))]
+ "TARGET_NEON_FP16INST"
+ "<fmaxmin_op>.<V_s_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_minmax_s<q>")]
+)
+
 ;; Vector forms for the IEEE-754 fmax()/fmin() functions
 (define_insn "<fmaxmin><mode>3"
   [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
@@ -2584,6 +2956,17 @@ 
   [(set_attr "type" "neon_fp_recps_s<q>")]
 )
 
+(define_insn "neon_vrecps<mode>"
+  [(set
+    (match_operand:VH 0 "s_register_operand" "=w")
+    (unspec:VH [(match_operand:VH 1 "s_register_operand" "w")
+		(match_operand:VH 2 "s_register_operand" "w")]
+     UNSPEC_VRECPS))]
+  "TARGET_NEON_FP16INST"
+  "vrecps.<V_if_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+  [(set_attr "type" "neon_fp_recps_s<q>")]
+)
+
 (define_insn "neon_vrsqrts<mode>"
   [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
         (unspec:VCVTF [(match_operand:VCVTF 1 "s_register_operand" "w")
@@ -2594,6 +2977,17 @@ 
   [(set_attr "type" "neon_fp_rsqrts_s<q>")]
 )
 
+(define_insn "neon_vrsqrts<mode>"
+  [(set
+    (match_operand:VH 0 "s_register_operand" "=w")
+    (unspec:VH [(match_operand:VH 1 "s_register_operand" "w")
+		 (match_operand:VH 2 "s_register_operand" "w")]
+     UNSPEC_VRSQRTS))]
+ "TARGET_NEON_FP16INST"
+ "vrsqrts.<V_if_elem>\t%<V_reg>0, %<V_reg>1, %<V_reg>2"
+ [(set_attr "type" "neon_fp_rsqrts_s<q>")]
+)
+
 (define_expand "neon_vabs<mode>"
   [(match_operand:VDQW 0 "s_register_operand" "")
    (match_operand:VDQW 1 "s_register_operand" "")]
@@ -2709,6 +3103,15 @@ 
 })
 
 (define_insn "neon_vrecpe<mode>"
+  [(set (match_operand:VH 0 "s_register_operand" "=w")
+	(unspec:VH [(match_operand:VH 1 "s_register_operand" "w")]
+		   UNSPEC_VRECPE))]
+  "TARGET_NEON_FP16INST"
+  "vrecpe.f16\t%<V_reg>0, %<V_reg>1"
+  [(set_attr "type" "neon_fp_recpe_s<q>")]
+)
+
+(define_insn "neon_vrecpe<mode>"
   [(set (match_operand:V32 0 "s_register_operand" "=w")
 	(unspec:V32 [(match_operand:V32 1 "s_register_operand" "w")]
                     UNSPEC_VRECPE))]
@@ -3251,6 +3654,28 @@  if (BYTES_BIG_ENDIAN)
   [(set_attr "type" "neon_fp_cvt_narrow_s_q")]
 )
 
+(define_insn "neon_vcvt<sup><mode>"
+ [(set
+   (match_operand:<VH_CVTTO> 0 "s_register_operand" "=w")
+   (unspec:<VH_CVTTO>
+    [(match_operand:VCVTHI 1 "s_register_operand" "w")]
+    VCVT_US))]
+ "TARGET_NEON_FP16INST"
+ "vcvt.f16.<sup>%#16\t%<V_reg>0, %<V_reg>1"
+  [(set_attr "type" "neon_int_to_fp_<VH_elem_ch><q>")]
+)
+
+(define_insn "neon_vcvt<sup><mode>"
+ [(set
+   (match_operand:<VH_CVTTO> 0 "s_register_operand" "=w")
+   (unspec:<VH_CVTTO>
+    [(match_operand:VH 1 "s_register_operand" "w")]
+    VCVT_US))]
+ "TARGET_NEON_FP16INST"
+ "vcvt.<sup>%#16.f16\t%<V_reg>0, %<V_reg>1"
+  [(set_attr "type" "neon_fp_to_int_<VH_elem_ch><q>")]
+)
+
 (define_insn "neon_vcvt<sup>_n<mode>"
   [(set (match_operand:<V_CVTTO> 0 "s_register_operand" "=w")
 	(unspec:<V_CVTTO> [(match_operand:VCVTF 1 "s_register_operand" "w")
@@ -3265,6 +3690,20 @@  if (BYTES_BIG_ENDIAN)
 )
 
 (define_insn "neon_vcvt<sup>_n<mode>"
+ [(set (match_operand:<VH_CVTTO> 0 "s_register_operand" "=w")
+   (unspec:<VH_CVTTO>
+    [(match_operand:VH 1 "s_register_operand" "w")
+     (match_operand:SI 2 "immediate_operand" "i")]
+    VCVT_US_N))]
+  "TARGET_NEON_FP16INST"
+{
+  neon_const_bounds (operands[2], 1, 33);
+  return "vcvt.<sup>%#16.f16\t%<V_reg>0, %<V_reg>1, %2";
+}
+ [(set_attr "type" "neon_fp_to_int_<VH_elem_ch><q>")]
+)
+
+(define_insn "neon_vcvt<sup>_n<mode>"
   [(set (match_operand:<V_CVTTO> 0 "s_register_operand" "=w")
 	(unspec:<V_CVTTO> [(match_operand:VCVTI 1 "s_register_operand" "w")
 			   (match_operand:SI 2 "immediate_operand" "i")]
@@ -3277,6 +3716,31 @@  if (BYTES_BIG_ENDIAN)
   [(set_attr "type" "neon_int_to_fp_<V_elem_ch><q>")]
 )
 
+(define_insn "neon_vcvt<sup>_n<mode>"
+ [(set (match_operand:<VH_CVTTO> 0 "s_register_operand" "=w")
+   (unspec:<VH_CVTTO>
+    [(match_operand:VCVTHI 1 "s_register_operand" "w")
+     (match_operand:SI 2 "immediate_operand" "i")]
+    VCVT_US_N))]
+ "TARGET_NEON_FP16INST"
+{
+  neon_const_bounds (operands[2], 1, 33);
+  return "vcvt.f16.<sup>%#16\t%<V_reg>0, %<V_reg>1, %2";
+}
+ [(set_attr "type" "neon_int_to_fp_<VH_elem_ch><q>")]
+)
+
+(define_insn "neon_vcvt<vcvth_op><sup><mode>"
+ [(set
+   (match_operand:<VH_CVTTO> 0 "s_register_operand" "=w")
+   (unspec:<VH_CVTTO>
+    [(match_operand:VH 1 "s_register_operand" "w")]
+    VCVT_HF_US))]
+ "TARGET_NEON_FP16INST"
+ "vcvt<vcvth_op>.<sup>%#16.f16\t%<V_reg>0, %<V_reg>1"
+  [(set_attr "type" "neon_fp_to_int_<VH_elem_ch><q>")]
+)
+
 (define_insn "neon_vmovn<mode>"
   [(set (match_operand:<V_narrow> 0 "s_register_operand" "=w")
 	(unspec:<V_narrow> [(match_operand:VN 1 "s_register_operand" "w")]
@@ -3347,6 +3811,18 @@  if (BYTES_BIG_ENDIAN)
                    (const_string "neon_mul_<V_elem_ch>_scalar<q>")))]
 )
 
+(define_insn "neon_vmul_lane<mode>"
+  [(set (match_operand:VH 0 "s_register_operand" "=w")
+	(unspec:VH [(match_operand:VH 1 "s_register_operand" "w")
+		    (match_operand:V4HF 2 "s_register_operand"
+		     "<scalar_mul_constraint>")
+		     (match_operand:SI 3 "immediate_operand" "i")]
+		     UNSPEC_VMUL_LANE))]
+  "TARGET_NEON_FP16INST"
+  "vmul.f16\t%<V_reg>0, %<V_reg>1, %P2[%c3]"
+  [(set_attr "type" "neon_fp_mul_s_scalar<q>")]
+)
+
 (define_insn "neon_vmull<sup>_lane<mode>"
   [(set (match_operand:<V_widen> 0 "s_register_operand" "=w")
 	(unspec:<V_widen> [(match_operand:VMDI 1 "s_register_operand" "w")
@@ -3601,6 +4077,19 @@  if (BYTES_BIG_ENDIAN)
   DONE;
 })
 
+(define_expand "neon_vmul_n<mode>"
+  [(match_operand:VH 0 "s_register_operand")
+   (match_operand:VH 1 "s_register_operand")
+   (match_operand:<V_elem> 2 "s_register_operand")]
+  "TARGET_NEON_FP16INST"
+{
+  rtx tmp = gen_reg_rtx (V4HFmode);
+  emit_insn (gen_neon_vset_lanev4hf (tmp, operands[2], tmp, const0_rtx));
+  emit_insn (gen_neon_vmul_lane<mode> (operands[0], operands[1], tmp,
+				       const0_rtx));
+  DONE;
+})
+
 (define_expand "neon_vmulls_n<mode>"
   [(match_operand:<V_widen> 0 "s_register_operand" "")
    (match_operand:VMDI 1 "s_register_operand" "")
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index 57a47ff..cc5a16a 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -191,6 +191,8 @@ 
   UNSPEC_VBSL
   UNSPEC_VCAGE
   UNSPEC_VCAGT
+  UNSPEC_VCALE
+  UNSPEC_VCALT
   UNSPEC_VCEQ
   UNSPEC_VCGE
   UNSPEC_VCGEU
@@ -258,6 +260,8 @@ 
   UNSPEC_VMLSL_S_LANE
   UNSPEC_VMLSL_U_LANE
   UNSPEC_VMLSL_LANE
+  UNSPEC_VFMA_LANE
+  UNSPEC_VFMS_LANE
   UNSPEC_VMOVL_S
   UNSPEC_VMOVL_U
   UNSPEC_VMOVN
@@ -386,5 +390,5 @@ 
   UNSPEC_VRNDN
   UNSPEC_VRNDP
   UNSPEC_VRNDX
+  UNSPEC_VSQRTE
 ])
-
diff --git a/gcc/testsuite/gcc.target/arm/armv8_2-fp16-arith-1.c b/gcc/testsuite/gcc.target/arm/armv8_2-fp16-arith-1.c
index 8399288..029d13c 100644
--- a/gcc/testsuite/gcc.target/arm/armv8_2-fp16-arith-1.c
+++ b/gcc/testsuite/gcc.target/arm/armv8_2-fp16-arith-1.c
@@ -9,6 +9,9 @@  typedef __fp16 float16_t;
 typedef __simd64_float16_t float16x4_t;
 typedef __simd128_float16_t float16x8_t;
 
+typedef short int16x4_t __attribute__ ((vector_size (8)));
+typedef short int int16x8_t  __attribute__ ((vector_size (16)));
+
 float16_t
 fp16_abs (float16_t a)
 {
@@ -50,15 +53,47 @@  TEST_CMP (greaterthan, >, int, float16_t)
 TEST_CMP (lessthanequal, <=, int, float16_t)
 TEST_CMP (greaterthanqual, >=, int, float16_t)
 
-/* { dg-final { scan-assembler-times {vneg\.f16\ts[0-9]+, s[0-9]+} 1 } }  */
+/* Vectors of size 4.  */
+
+TEST_UNOP (neg, -, float16x4_t)
+
+TEST_BINOP (add, +, float16x4_t)
+TEST_BINOP (sub, -, float16x4_t)
+TEST_BINOP (mult, *, float16x4_t)
+TEST_BINOP (div, /, float16x4_t)
+
+TEST_CMP (equal, ==, int16x4_t, float16x4_t)
+TEST_CMP (unequal, !=, int16x4_t, float16x4_t)
+TEST_CMP (lessthan, <, int16x4_t, float16x4_t)
+TEST_CMP (greaterthan, >, int16x4_t, float16x4_t)
+TEST_CMP (lessthanequal, <=, int16x4_t, float16x4_t)
+TEST_CMP (greaterthanqual, >=, int16x4_t, float16x4_t)
+
+/* Vectors of size 8.  */
+
+TEST_UNOP (neg, -, float16x8_t)
+
+TEST_BINOP (add, +, float16x8_t)
+TEST_BINOP (sub, -, float16x8_t)
+TEST_BINOP (mult, *, float16x8_t)
+TEST_BINOP (div, /, float16x8_t)
+
+TEST_CMP (equal, ==, int16x8_t, float16x8_t)
+TEST_CMP (unequal, !=, int16x8_t, float16x8_t)
+TEST_CMP (lessthan, <, int16x8_t, float16x8_t)
+TEST_CMP (greaterthan, >, int16x8_t, float16x8_t)
+TEST_CMP (lessthanequal, <=, int16x8_t, float16x8_t)
+TEST_CMP (greaterthanqual, >=, int16x8_t, float16x8_t)
+
+/* { dg-final { scan-assembler-times {vneg\.f16\ts[0-9]+, s[0-9]+} 13 } }  */
 /* { dg-final { scan-assembler-times {vabs\.f16\ts[0-9]+, s[0-9]+} 2 } }  */
 
-/* { dg-final { scan-assembler-times {vadd\.f32\ts[0-9]+, s[0-9]+, s[0-9]+} 1 } }  */
-/* { dg-final { scan-assembler-times {vsub\.f32\ts[0-9]+, s[0-9]+, s[0-9]+} 1 } }  */
-/* { dg-final { scan-assembler-times {vmul\.f32\ts[0-9]+, s[0-9]+, s[0-9]+} 1 } }  */
-/* { dg-final { scan-assembler-times {vdiv\.f32\ts[0-9]+, s[0-9]+, s[0-9]+} 1 } }  */
-/* { dg-final { scan-assembler-times {vcmp\.f32\ts[0-9]+, s[0-9]+} 2 } }  */
-/* { dg-final { scan-assembler-times {vcmpe\.f32\ts[0-9]+, s[0-9]+} 4 } }  */
+/* { dg-final { scan-assembler-times {vadd\.f32\ts[0-9]+, s[0-9]+, s[0-9]+} 13 } }  */
+/* { dg-final { scan-assembler-times {vsub\.f32\ts[0-9]+, s[0-9]+, s[0-9]+} 13 } }  */
+/* { dg-final { scan-assembler-times {vmul\.f32\ts[0-9]+, s[0-9]+, s[0-9]+} 13 } }  */
+/* { dg-final { scan-assembler-times {vdiv\.f32\ts[0-9]+, s[0-9]+, s[0-9]+} 13 } }  */
+/* { dg-final { scan-assembler-times {vcmp\.f32\ts[0-9]+, s[0-9]+} 26 } }  */
+/* { dg-final { scan-assembler-times {vcmpe\.f32\ts[0-9]+, s[0-9]+} 52 } }  */
 
 /* { dg-final { scan-assembler-not {vadd\.f16} } }  */
 /* { dg-final { scan-assembler-not {vsub\.f16} } }  */
-- 
2.1.4