Patchwork [1/4] core/conv: Add optimized Viterbi decoding

login
register
mail settings
Submitter Thomas Tsou
Date April 29, 2014, 4:12 a.m.
Message ID <20140429041209.GA5135@phenom.hsd1.va.comcast.net>
Download mbox | patch
Permalink /patch/343653/
State New
Headers show

Comments

Thomas Tsou - April 29, 2014, 4:12 a.m.
Add a separate, faster convolution decoding implementation for rates
up to N=4 and constraint lengths of K=5 and K=7, which covers the
most GSM code uses. The decoding algorithm exploits the symmetric
structure of the Viterbi add-compare-select (ACS) operation - commonly
known as the ACS butterfly. This shift-register optimization can be
found in the well-known text by Dave Forney.

Forney, G.D., "The Viterbi Algorithm," Proc. of the IEEE, March 1973.

Implementation is non-architecture specific and improves performance on
x86 as well as ARM processors. Existing API is unchanged with optimized
code being called internally for supported codes.

Signed-off-by: Thomas Tsou <tom@tsou.cc>
---
 src/Makefile.am   |   3 +-
 src/conv.c        |   9 +
 src/viterbi.c     | 603 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/viterbi_gen.c | 182 ++++++++++++++++
 4 files changed, 796 insertions(+), 1 deletion(-)
 create mode 100644 src/viterbi.c
 create mode 100644 src/viterbi_gen.c
Holger Freyther - April 29, 2014, 7:08 a.m.
On Tue, Apr 29, 2014 at 12:12:09AM -0400, Thomas Tsou wrote:

Hi,

> Add a separate, faster convolution decoding implementation for rates
> up to N=4 and constraint lengths of K=5 and K=7, which covers the
> most GSM code uses. The decoding algorithm exploits the symmetric
> structure of the Viterbi add-compare-select (ACS) operation - commonly
> known as the ACS butterfly. This shift-register optimization can be
> found in the well-known text by Dave Forney.

I am not knowledgable enough to comment on the actual viterbi things
so I will focus on the things around it.

> +/* Aligned Memory Allocator
> + *     SSE requires 16-byte memory alignment. We store relevant trellis values
> + *     (accumulated sums, outputs, and path decisions) as 16 bit signed integers
> + *     so the allocated memory is casted as such.
> + */
> +#define SSE_ALIGN	16
> +
> +static int16_t *vdec_malloc(size_t n)
> +{
> +#ifdef HAVE_SSE3
> +	return (int16_t *) memalign(SSE_ALIGN, sizeof(int16_t) * n);
> +#else
> +	return (int16_t *) malloc(sizeof(int16_t) * n);
> +#endif
> +}

argh, it would be nice if you could use talloc here but then we would
need to play games and align pointers ourselves. Maybe change the API
to at least have a 'ctx' similar to other talloc API?

> +static void free_trellis(struct vtrellis *trellis)
> +{
> +	if (!trellis)
> +		return;
> +
> +	free(trellis->vals);
> +	free(trellis->outputs);
> +	free(trellis->sums);
> +	free(trellis);


Can you use talloc here?

> +		_traceback_rec(dec, state, out, len);

_ is reserved for the system. We might want to avoid using that.

cheers

	holger
Thomas Tsou - May 1, 2014, 5:51 p.m.
On Tue, Apr 29, 2014 at 3:08 AM, Holger Hans Peter Freyther
<holger@freyther.de> wrote:
> On Tue, Apr 29, 2014 at 12:12:09AM -0400, Thomas Tsou wrote:
>> +#define SSE_ALIGN    16
>> +
>> +static int16_t *vdec_malloc(size_t n)
>> +{
>> +#ifdef HAVE_SSE3
>> +     return (int16_t *) memalign(SSE_ALIGN, sizeof(int16_t) * n);
>> +#else
>> +     return (int16_t *) malloc(sizeof(int16_t) * n);
>> +#endif
>> +}
>
> argh, it would be nice if you could use talloc here but then we would
> need to play games and align pointers ourselves. Maybe change the API
> to at least have a 'ctx' similar to other talloc API?

A beneficial change would be to modify the API to allow persistent
decoder objects instead of performing an allocation and tear-down on
every decoding call. From a performance standpoint, the current
implementation is unnecessarily restricted by memory access and page
faults for that reason.

>> +static void free_trellis(struct vtrellis *trellis)
>> +{
>> +     if (!trellis)
>> +             return;
>> +
>> +     free(trellis->vals);
>> +     free(trellis->outputs);
>> +     free(trellis->sums);
>> +     free(trellis);
>
> Can you use talloc here?

Yes. I think that makes sense. Memory alignment is the only issue that
needs to be looked into.

>> +             _traceback_rec(dec, state, out, len);
>
> _ is reserved for the system. We might want to avoid using that.

I thought only double underscore and single underscore with capital
letter were reserved?

  -TT
Sylvain Munaut - May 7, 2014, 4:25 p.m.
Hi Thomas,


I've just only seen this. I'll try to test and comment on them ASAP.

Some quick things I saw from a quick glance though:
 - Some things that should be 'static' are not and they pollute the
global namespace
 - Anything that is in the global namespace needs the osmo_ (or even
osmo_conv_ here) prefix
 - Did you check (and test in make test) that it works for all codes
that matches
   "if ((code->N <= 4) && ((code->K == 5) || (code->K == 7)))" ?
 - The viterbi_gen file seems fairly easy to generate, maybe in the
future it'd be worth autogenerating it
   Also, to avoid exporting those internal symbols, I'm wondering if
#including the viterbi_gen.c from viterbi.c instead of compiling
separately wouldn't be better. But that's just a random idea ...


> A beneficial change would be to modify the API to allow persistent
> decoder objects instead of performing an allocation and tear-down on
> every decoding call. From a performance standpoint, the current
> implementation is unnecessarily restricted by memory access and page
> faults for that reason.

You can't change the existing API without breaking existing stuff, but you can
add a new all-in-one function that allows to give a persistent decoder object.

like osmo_conv_decode_persist or whatever you want to call it.


Cheers,

    Sylvain



On Tue, Apr 29, 2014 at 6:12 AM, Thomas Tsou <tom@tsou.cc> wrote:
> Add a separate, faster convolution decoding implementation for rates
> up to N=4 and constraint lengths of K=5 and K=7, which covers the
> most GSM code uses. The decoding algorithm exploits the symmetric
> structure of the Viterbi add-compare-select (ACS) operation - commonly
> known as the ACS butterfly. This shift-register optimization can be
> found in the well-known text by Dave Forney.
>
> Forney, G.D., "The Viterbi Algorithm," Proc. of the IEEE, March 1973.
>
> Implementation is non-architecture specific and improves performance on
> x86 as well as ARM processors. Existing API is unchanged with optimized
> code being called internally for supported codes.
>
> Signed-off-by: Thomas Tsou <tom@tsou.cc>
> ---
>  src/Makefile.am   |   3 +-
>  src/conv.c        |   9 +
>  src/viterbi.c     | 603 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  src/viterbi_gen.c | 182 ++++++++++++++++
>  4 files changed, 796 insertions(+), 1 deletion(-)
>  create mode 100644 src/viterbi.c
>  create mode 100644 src/viterbi_gen.c
>
> diff --git a/src/Makefile.am b/src/Makefile.am
> index e68c29a..262a4e6 100644
> --- a/src/Makefile.am
> +++ b/src/Makefile.am
> @@ -13,7 +13,8 @@ libosmocore_la_SOURCES = timer.c select.c signal.c msgb.c bits.c \
>                          logging.c logging_syslog.c rate_ctr.c \
>                          gsmtap_util.c crc16.c panic.c backtrace.c \
>                          conv.c application.c rbtree.c strrb.c \
> -                        loggingrb.c crc8gen.c crc16gen.c crc32gen.c crc64gen.c
> +                        loggingrb.c crc8gen.c crc16gen.c crc32gen.c crc64gen.c \
> +                        viterbi.c viterbi_gen.c
>
>  BUILT_SOURCES = crc8gen.c crc16gen.c crc32gen.c crc64gen.c
>
> diff --git a/src/conv.c b/src/conv.c
> index f13deef..79b3a7c 100644
> --- a/src/conv.c
> +++ b/src/conv.c
> @@ -238,6 +238,11 @@ osmo_conv_encode(const struct osmo_conv_code *code,
>
>  #define MAX_AE 0x00ffffff
>
> +/* Forward declaration for accerlated decoding with certain codes */
> +int
> +osmo_conv_decode_acc(const struct osmo_conv_code *code,
> +                     const sbit_t *input, ubit_t *output);
> +
>  void
>  osmo_conv_decode_init(struct osmo_conv_decoder *decoder,
>                        const struct osmo_conv_code *code, int len, int start_state)
> @@ -606,6 +611,10 @@ osmo_conv_decode(const struct osmo_conv_code *code,
>         struct osmo_conv_decoder decoder;
>         int rv, l;
>
> +       /* Use accelerated implementation for supported codes */
> +       if ((code->N <= 4) && ((code->K == 5) || (code->K == 7)))
> +               return osmo_conv_decode_acc(code, input, output);
> +
>         osmo_conv_decode_init(&decoder, code, 0, 0);
>
>         if (code->term == CONV_TERM_TAIL_BITING) {
> diff --git a/src/viterbi.c b/src/viterbi.c
> new file mode 100644
> index 0000000..db8d5c8
> --- /dev/null
> +++ b/src/viterbi.c
> @@ -0,0 +1,603 @@
> +/*
> + * Viterbi decoder
> + * Copyright (C) 2013, 2014 Thomas Tsou <tom@tsou.cc>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "config.h"
> +
> +#include <stdlib.h>
> +#include <malloc.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <osmocom/core/conv.h>
> +
> +/* Forward Metric Units */
> +void gen_metrics_k5_n2(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm);
> +void gen_metrics_k5_n3(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm);
> +void gen_metrics_k5_n4(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm);
> +void gen_metrics_k7_n2(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm);
> +void gen_metrics_k7_n3(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm);
> +void gen_metrics_k7_n4(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm);
> +/* Trellis State
> + *     state - Internal lshift register value
> + *     prev  - Register values of previous 0 and 1 states
> + */
> +struct vstate {
> +       unsigned state;
> +       unsigned prev[2];
> +};
> +
> +/* Trellis Object
> + *     num_states - Number of states in the trellis
> + *     sums       - Accumulated path metrics
> + *     outputs    - Trellis ouput values
> + *     vals       - Input value that led to each state
> + */
> +struct vtrellis {
> +       int num_states;
> +       int16_t *sums;
> +       int16_t *outputs;
> +       uint8_t *vals;
> +};
> +
> +/* Viterbi Decoder
> + *     n         - Code order
> + *     k         - Constraint length
> + *     len       - Horizontal length of trellis
> + *     recursive - Set to '1' if the code is recursive
> + *     intrvl    - Normalization interval
> + *     trellis   - Trellis object
> + *     punc      - Puncturing sequence
> + *     paths     - Trellis paths
> + */
> +struct vdecoder {
> +       int n;
> +       int k;
> +       int len;
> +       int recursive;
> +       int intrvl;
> +       struct vtrellis *trellis;
> +       int *punc;
> +       int16_t **paths;
> +
> +       void (*metric_func)(const int8_t *, const int16_t *,
> +                           int16_t *, int16_t *, int);
> +};
> +
> +/* Aligned Memory Allocator
> + *     SSE requires 16-byte memory alignment. We store relevant trellis values
> + *     (accumulated sums, outputs, and path decisions) as 16 bit signed integers
> + *     so the allocated memory is casted as such.
> + */
> +#define SSE_ALIGN      16
> +
> +static int16_t *vdec_malloc(size_t n)
> +{
> +#ifdef HAVE_SSE3
> +       return (int16_t *) memalign(SSE_ALIGN, sizeof(int16_t) * n);
> +#else
> +       return (int16_t *) malloc(sizeof(int16_t) * n);
> +#endif
> +}
> +
> +/* Accessor calls */
> +inline int conv_code_recursive(const struct osmo_conv_code *code)
> +{
> +       return code->next_term_output ? 1 : 0;
> +}
> +
> +/* Left shift and mask for finding the previous state */
> +static unsigned vstate_lshift(unsigned reg, int k, int val)
> +{
> +       unsigned mask;
> +
> +       if (k == 5)
> +               mask = 0x0e;
> +       else if (k == 7)
> +               mask = 0x3e;
> +       else
> +               mask = 0;
> +
> +       return ((reg << 1) & mask) | val;
> +}
> +
> +/* Bit endian manipulators */
> +inline unsigned bitswap2(unsigned v)
> +{
> +       return ((v & 0x02) >> 1) | ((v & 0x01) << 1);
> +}
> +
> +inline unsigned bitswap3(unsigned v)
> +{
> +       return ((v & 0x04) >> 2) | ((v & 0x02) >> 0) |
> +              ((v & 0x01) << 2);
> +}
> +
> +inline unsigned bitswap4(unsigned v)
> +{
> +       return ((v & 0x08) >> 3) | ((v & 0x04) >> 1) |
> +              ((v & 0x02) << 1) | ((v & 0x01) << 3);
> +}
> +
> +inline unsigned bitswap5(unsigned v)
> +{
> +       return ((v & 0x10) >> 4) | ((v & 0x08) >> 2) | ((v & 0x04) >> 0) |
> +              ((v & 0x02) << 2) | ((v & 0x01) << 4);
> +}
> +
> +inline unsigned bitswap6(unsigned v)
> +{
> +       return ((v & 0x20) >> 5) | ((v & 0x10) >> 3) | ((v & 0x08) >> 1) |
> +              ((v & 0x04) << 1) | ((v & 0x02) << 3) | ((v & 0x01) << 5);
> +}
> +
> +static unsigned bitswap(unsigned v, unsigned n)
> +{
> +       switch (n) {
> +       case 1:
> +               return v;
> +       case 2:
> +               return bitswap2(v);
> +       case 3:
> +               return bitswap3(v);
> +       case 4:
> +               return bitswap4(v);
> +       case 5:
> +               return bitswap5(v);
> +       case 6:
> +               return bitswap6(v);
> +       default:
> +               break;
> +       }
> +
> +       return 0;
> +}
> +
> +/* Generate non-recursive state output from generator state table
> + *     Note that the shift register moves right (i.e. the most recent bit is
> + *     shifted into the register at k-1 bit of the register), which is typical
> + *     textbook representation. The API transition table expects the most recent
> + *     bit in the low order bit, or left shift. A bitswap operation is required
> + *     to accommodate the difference.
> + */
> +static unsigned gen_output(struct vstate *state, int val,
> +                          const struct osmo_conv_code *code)
> +{
> +       unsigned out, prev;
> +
> +       prev = bitswap(state->prev[0], code->K - 1);
> +       out = code->next_output[prev][val];
> +       out = bitswap(out, code->N);
> +
> +       return out;
> +}
> +
> +#define BIT2NRZ(REG,N) (((REG >> N) & 0x01) * 2 - 1) * -1
> +
> +/* Populate non-recursive trellis state
> + *     For a given state defined by the k-1 length shift register, find the
> + *     value of the input bit that drove the trellis to that state. Also
> + *     generate the N outputs of the generator polynomial at that state.
> + */
> +static int gen_state_info(uint8_t *val, unsigned reg,
> +                         int16_t *output, const struct osmo_conv_code *code)
> +{
> +       int i;
> +       unsigned out;
> +       struct vstate state;
> +
> +       /* Previous '0' state */
> +       state.state = reg;
> +       state.prev[0] = vstate_lshift(reg, code->K, 0);
> +       state.prev[1] = vstate_lshift(reg, code->K, 1);
> +
> +       *val = (reg >> (code->K - 2)) & 0x01;
> +
> +       /* Transition output */
> +       out = gen_output(&state, *val, code);
> +
> +       /* Unpack to NRZ */
> +       for (i = 0; i < code->N; i++)
> +               output[i] = BIT2NRZ(out, i);
> +
> +       return 0;
> +}
> +
> +/* Generate recursive state output from generator state table */
> +static unsigned gen_recursive_output(struct vstate *state,
> +                                    uint8_t *val, unsigned reg,
> +                                    const struct osmo_conv_code *code, int pos)
> +{
> +       int val0, val1;
> +       unsigned out, prev;
> +
> +       /* Previous '0' state */
> +       prev = vstate_lshift(reg, code->K, 0);
> +       prev = bitswap(prev, code->K - 1);
> +
> +       /* Input value */
> +       val0 = (reg >> (code->K - 2)) & 0x01;
> +       val1 = (code->next_term_output[prev] >> pos) & 0x01;
> +       *val = val0 == val1 ? 0 : 1;
> +
> +       /* Wrapper for osmocom state access */
> +       prev = bitswap(state->prev[0], code->K - 1);
> +
> +       /* Compute the transition output */
> +       out = code->next_output[prev][*val];
> +       out = bitswap(out, code->N);
> +
> +       return out;
> +}
> +
> +#define NUM_STATES(K) (K == 7 ? 64 : 16)
> +
> +/* Populate recursive trellis state
> + *     The bit position of the systematic bit is not explicitly marked by the
> + *     API, so it must be extracted from the generator table. Otherwise,
> + *     populate the trellis similar to the non-recursive version.
> + *     Non-systematic recursive codes are not supported.
> + */
> +static int gen_recursive_state_info(uint8_t *val,
> +                                   unsigned reg,
> +                                   int16_t *output,
> +                                   const struct osmo_conv_code *code)
> +{
> +       int i, j, pos = -1;
> +       int ns = NUM_STATES(code->K);
> +       unsigned out;
> +       struct vstate state;
> +
> +       /* Previous '0' and '1' states */
> +       state.state = reg;
> +       state.prev[0] = vstate_lshift(reg, code->K, 0);
> +       state.prev[1] = vstate_lshift(reg, code->K, 1);
> +
> +       /* Find recursive bit location */
> +       for (i = 0; i < code->N; i++) {
> +               for (j = 0; j < ns; j++) {
> +                       if ((code->next_output[j][0] >> i) & 0x01)
> +                               break;
> +               }
> +               if (j == ns) {
> +                       pos = i;
> +                       break;
> +               }
> +       }
> +
> +       /* Non-systematic recursive code not supported */
> +       if (pos < 0)
> +               return -EPROTO;
> +
> +       /* Transition output */
> +       out = gen_recursive_output(&state, val, reg, code, pos);
> +
> +       /* Unpack to NRZ */
> +       for (i = 0; i < code->N; i++)
> +               output[i] = BIT2NRZ(out, i);
> +
> +       return 0;
> +}
> +
> +/* Release the trellis */
> +static void free_trellis(struct vtrellis *trellis)
> +{
> +       if (!trellis)
> +               return;
> +
> +       free(trellis->vals);
> +       free(trellis->outputs);
> +       free(trellis->sums);
> +       free(trellis);
> +}
> +
> +/* Allocate and initialize the trellis object
> + *     Initialization consists of generating the outputs and output value of a
> + *     given state. Due to trellis symmetry and anti-symmetry, only one of the
> + *     transition paths is utilized by the butterfly operation in the forward
> + *     recursion, so only one set of N outputs is required per state variable.
> + */
> +static struct vtrellis *generate_trellis(const struct osmo_conv_code *code)
> +{
> +       int i, rc = -1;
> +       struct vtrellis *trellis;
> +       int16_t *outputs;
> +
> +       int ns = NUM_STATES(code->K);
> +       int recursive = conv_code_recursive(code);
> +       int olen = (code->N == 2) ? 2 : 4;
> +
> +       trellis = (struct vtrellis *) calloc(1, sizeof(struct vtrellis));
> +       trellis->num_states = ns;
> +       trellis->sums = vdec_malloc(ns);
> +       trellis->outputs = vdec_malloc(ns * olen);
> +       trellis->vals = (uint8_t *) malloc(ns * sizeof(uint8_t));
> +
> +       if (!trellis->sums || !trellis->outputs)
> +               goto fail;
> +
> +       /* Populate the trellis state objects */
> +       for (i = 0; i < ns; i++) {
> +               outputs = &trellis->outputs[olen * i];
> +
> +               if (recursive)
> +                       rc = gen_recursive_state_info(&trellis->vals[i],
> +                                                     i, outputs, code);
> +               else
> +                       rc = gen_state_info(&trellis->vals[i],
> +                                           i, outputs, code);
> +       }
> +
> +       if (rc < 0)
> +               goto fail;
> +
> +       return trellis;
> +fail:
> +       free_trellis(trellis);
> +       return NULL;
> +}
> +
> +/* Reset decoder
> + *     Set accumulated path metrics to zero. For termination other than
> + *     tail-biting, initialize the zero state as the encoder starting state.
> + *     Intialize with the maximum accumulated sum at length equal to the
> + *     constraint length.
> + */
> +static void reset_decoder(struct vdecoder *dec, int term)
> +{
> +       int ns = dec->trellis->num_states;
> +
> +       memset(dec->trellis->sums, 0, sizeof(int16_t) * ns);
> +
> +       if (term != CONV_TERM_TAIL_BITING)
> +               dec->trellis->sums[0] = INT8_MAX * dec->n * dec->k;
> +}
> +
> +static void _traceback(struct vdecoder *dec,
> +                      unsigned state, uint8_t *out, int len)
> +{
> +       int i;
> +       unsigned path;
> +
> +       for (i = len - 1; i >= 0; i--) {
> +               path = dec->paths[i][state] + 1;
> +               out[i] = dec->trellis->vals[state];
> +               state = vstate_lshift(state, dec->k, path);
> +       }
> +}
> +
> +static void _traceback_rec(struct vdecoder *dec,
> +                          unsigned state, uint8_t *out, int len)
> +{
> +       int i;
> +       unsigned path;
> +
> +       for (i = len - 1; i >= 0; i--) {
> +               path = dec->paths[i][state] + 1;
> +               out[i] = path ^ dec->trellis->vals[state];
> +               state = vstate_lshift(state, dec->k, path);
> +       }
> +}
> +
> +/* Traceback and generate decoded output
> + *     Find the largest accumulated path metric at the final state except for
> + *     the zero terminated case, where we assume the final state is always zero.
> + */
> +static int traceback(struct vdecoder *dec, uint8_t *out, int term, int len)
> +{
> +       int i, sum, max = -1;
> +       unsigned path, state = 0;
> +
> +       if (term != CONV_TERM_FLUSH) {
> +               for (i = 0; i < dec->trellis->num_states; i++) {
> +                       sum = dec->trellis->sums[i];
> +                       if (sum > max) {
> +                               max = sum;
> +                               state = i;
> +                       }
> +               }
> +
> +               if (max < 0)
> +                       return -EPROTO;
> +       }
> +
> +       for (i = dec->len - 1; i >= len; i--) {
> +               path = dec->paths[i][state] + 1;
> +               state = vstate_lshift(state, dec->k, path);
> +       }
> +
> +       if (dec->recursive)
> +               _traceback_rec(dec, state, out, len);
> +       else
> +               _traceback(dec, state, out, len);
> +
> +       return 0;
> +}
> +
> +/* Release decoder object */
> +static void free_vdec(struct vdecoder *dec)
> +{
> +       if (!dec)
> +               return;
> +
> +       free(dec->paths[0]);
> +       free(dec->paths);
> +       free_trellis(dec->trellis);
> +       free(dec);
> +}
> +
> +/* Allocate decoder object
> + *     Subtract the constraint length K on the normalization interval to
> + *     accommodate the initialization path metric at state zero.
> + */
> +static struct vdecoder *alloc_vdec(const struct osmo_conv_code *code)
> +{
> +       int i, ns;
> +       struct vdecoder *dec;
> +
> +       ns = NUM_STATES(code->K);
> +
> +       dec = (struct vdecoder *) calloc(1, sizeof(struct vdecoder));
> +       dec->n = code->N;
> +       dec->k = code->K;
> +       dec->recursive = conv_code_recursive(code);
> +       dec->intrvl = INT16_MAX / (dec->n * INT8_MAX) - dec->k;
> +
> +       if (dec->k == 5) {
> +               switch (dec->n) {
> +               case 2:
> +                       dec->metric_func = gen_metrics_k5_n2;
> +                       break;
> +               case 3:
> +                       dec->metric_func = gen_metrics_k5_n3;
> +                       break;
> +               case 4:
> +                       dec->metric_func = gen_metrics_k5_n4;
> +                       break;
> +               default:
> +                       goto fail;
> +               }
> +       } else if (dec->k == 7) {
> +               switch (dec->n) {
> +               case 2:
> +                       dec->metric_func = gen_metrics_k7_n2;
> +                       break;
> +               case 3:
> +                       dec->metric_func = gen_metrics_k7_n3;
> +                       break;
> +               case 4:
> +                       dec->metric_func = gen_metrics_k7_n4;
> +                       break;
> +               default:
> +                       goto fail;
> +               }
> +       } else {
> +               goto fail;
> +       }
> +
> +       if (code->term == CONV_TERM_FLUSH)
> +               dec->len = code->len + code->K - 1;
> +       else
> +               dec->len = code->len;
> +
> +       dec->trellis = generate_trellis(code);
> +       if (!dec->trellis)
> +               goto fail;
> +
> +       dec->paths = (int16_t **) malloc(sizeof(int16_t *) * dec->len);
> +       dec->paths[0] = vdec_malloc(ns * dec->len);
> +       for (i = 1; i < dec->len; i++)
> +               dec->paths[i] = &dec->paths[0][i * ns];
> +
> +       return dec;
> +fail:
> +       free_vdec(dec);
> +       return NULL;
> +}
> +
> +/* Depuncture sequence with nagative value terminated puncturing matrix */
> +static int depuncture(const int8_t *in, const int *punc, int8_t *out, int len)
> +{
> +       int i, n = 0, m = 0;
> +
> +       for (i = 0; i < len; i++) {
> +               if (i == punc[n]) {
> +                       out[i] = 0;
> +                       n++;
> +                       continue;
> +               }
> +
> +               out[i] = in[m++];
> +       }
> +
> +       return 0;
> +}
> +
> +/* Forward trellis recursion
> + *     Generate branch metrics and path metrics with a combined function. Only
> + *     accumulated path metric sums and path selections are stored. Normalize on
> + *     the interval specified by the decoder.
> + */
> +static void _conv_decode(struct vdecoder *dec, const int8_t *seq, int _cnt)
> +{
> +       int i, len = dec->len;
> +       struct vtrellis *trellis = dec->trellis;
> +
> +       for (i = 0; i < len; i++) {
> +               dec->metric_func(&seq[dec->n * i],
> +                                trellis->outputs,
> +                                trellis->sums,
> +                                dec->paths[i],
> +                                !(i % dec->intrvl));
> +       }
> +}
> +
> +/* Convolutional decode with a decoder object
> + *     Initial puncturing run if necessary followed by the forward recursion.
> + *     For tail-biting perform a second pass before running the backward
> + *     traceback operation.
> + */
> +static int conv_decode(struct vdecoder *dec, const int8_t *seq,
> +                      const int *punc, uint8_t *out, int len, int term)
> +{
> +       int cnt = 0;
> +       int8_t depunc[dec->len * dec->n];
> +
> +       reset_decoder(dec, term);
> +
> +       if (punc) {
> +               depuncture(seq, punc, depunc, dec->len * dec->n);
> +               seq = depunc;
> +       }
> +
> +       /* Propagate through the trellis with interval normalization */
> +       _conv_decode(dec, seq, cnt);
> +
> +       if (term == CONV_TERM_TAIL_BITING)
> +               _conv_decode(dec, seq, cnt);
> +
> +       return traceback(dec, out, term, len);
> +}
> +
> +/* All-in-one viterbi decoding  */
> +int osmo_conv_decode_acc(const struct osmo_conv_code *code,
> +                        const sbit_t *input, ubit_t *output)
> +{
> +       int rc;
> +       struct vdecoder *vdec;
> +
> +       if ((code->N < 2) || (code->N > 4) || (code->len < 1) ||
> +           ((code->K != 5) && (code->K != 7)))
> +               return -EINVAL;
> +
> +       vdec = alloc_vdec(code);
> +       if (!vdec)
> +               return -EFAULT;
> +
> +       rc = conv_decode(vdec, input, code->puncture,
> +                        output, code->len, code->term);
> +
> +       free_vdec(vdec);
> +
> +       return rc;
> +}
> diff --git a/src/viterbi_gen.c b/src/viterbi_gen.c
> new file mode 100644
> index 0000000..894d5ae
> --- /dev/null
> +++ b/src/viterbi_gen.c
> @@ -0,0 +1,182 @@
> +/*
> + * Viterbi decoder
> + * Copyright (C) 2013, 2014 Thomas Tsou <tom@tsou.cc>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include <stdint.h>
> +#include <string.h>
> +
> +/* Add-Compare-Select (ACS-Butterfly)
> + *     Compute 4 accumulated path metrics and 4 path selections. Note that path
> + *     selections are store as -1 and 0 rather than 0 and 1. This is to match
> + *     the output format of the SSE packed compare instruction 'pmaxuw'.
> + */
> +void acs_butterfly(int state, int num_states,
> +                  int16_t metric, int16_t *sum,
> +                  int16_t *new_sum, int16_t *path)
> +{
> +       int state0, state1;
> +       int sum0, sum1, sum2, sum3;
> +
> +       state0 = *(sum + (2 * state + 0));
> +       state1 = *(sum + (2 * state + 1));
> +
> +       sum0 = state0 + metric;
> +       sum1 = state1 - metric;
> +       sum2 = state0 - metric;
> +       sum3 = state1 + metric;
> +
> +       if (sum0 > sum1) {
> +               *new_sum = sum0;
> +               *path = -1;
> +       } else {
> +               *new_sum = sum1;
> +               *path = 0;
> +       }
> +
> +       if (sum2 > sum3) {
> +               *(new_sum + num_states / 2) = sum2;
> +               *(path + num_states / 2) = -1;
> +       } else {
> +               *(new_sum + num_states / 2) = sum3;
> +               *(path + num_states / 2) = 0;
> +       }
> +}
> +
> +/* Branch metrics unit N=2 */
> +void _gen_branch_metrics_n2(int num_states, const int8_t *seq,
> +                           const int16_t *out, int16_t *metrics)
> +{
> +       int i;
> +
> +       for (i = 0; i < num_states / 2; i++) {
> +               metrics[i] = seq[0] * out[2 * i + 0] +
> +                            seq[1] * out[2 * i + 1];
> +       }
> +}
> +
> +/* Branch metrics unit N=3 */
> +void _gen_branch_metrics_n3(int num_states, const int8_t *seq,
> +                           const int16_t *out, int16_t *metrics)
> +{
> +       int i;
> +
> +       for (i = 0; i < num_states / 2; i++)
> +               metrics[i] = seq[0] * out[4 * i + 0] +
> +                            seq[1] * out[4 * i + 1] +
> +                            seq[2] * out[4 * i + 2];
> +}
> +
> +/* Branch metrics unit N=4 */
> +void _gen_branch_metrics_n4(int num_states, const int8_t *seq,
> +                           const int16_t *out, int16_t *metrics)
> +{
> +       int i;
> +
> +       for (i = 0; i < num_states / 2; i++)
> +               metrics[i] = seq[0] * out[4 * i + 0] +
> +                            seq[1] * out[4 * i + 1] +
> +                            seq[2] * out[4 * i + 2] +
> +                            seq[3] * out[4 * i + 3];
> +}
> +
> +/* Path metric unit */
> +void _gen_path_metrics(int num_states, int16_t *sums,
> +                      int16_t *metrics, int16_t *paths, int norm)
> +{
> +       int i;
> +       int16_t min;
> +       int16_t new_sums[num_states];
> +
> +       for (i = 0; i < num_states / 2; i++) {
> +               acs_butterfly(i, num_states, metrics[i],
> +                             sums, &new_sums[i], &paths[i]);
> +       }
> +
> +       if (norm) {
> +               min = new_sums[0];
> +               for (i = 1; i < num_states; i++) {
> +                       if (new_sums[i] < min)
> +                               min = new_sums[i];
> +               }
> +
> +               for (i = 0; i < num_states; i++)
> +                       new_sums[i] -= min;
> +       }
> +
> +       memcpy(sums, new_sums, num_states * sizeof(int16_t));
> +}
> +
> +/* 16-state branch-path metrics units (K=5) */
> +void gen_metrics_k5_n2(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm)
> +{
> +       int16_t metrics[8];
> +
> +       _gen_branch_metrics_n2(16, seq, out, metrics);
> +       _gen_path_metrics(16, sums, metrics, paths, norm);
> +}
> +
> +void gen_metrics_k5_n3(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm)
> +{
> +       int16_t metrics[8];
> +
> +       _gen_branch_metrics_n3(16, seq, out, metrics);
> +       _gen_path_metrics(16, sums, metrics, paths, norm);
> +
> +}
> +
> +void gen_metrics_k5_n4(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm)
> +{
> +       int16_t metrics[8];
> +
> +       _gen_branch_metrics_n4(16, seq, out, metrics);
> +       _gen_path_metrics(16, sums, metrics, paths, norm);
> +
> +}
> +
> +/* 64-state branch-path metrics units (K=7) */
> +void gen_metrics_k7_n2(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm)
> +{
> +       int16_t metrics[32];
> +
> +       _gen_branch_metrics_n2(64, seq, out, metrics);
> +       _gen_path_metrics(64, sums, metrics, paths, norm);
> +
> +}
> +
> +void gen_metrics_k7_n3(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm)
> +{
> +       int16_t metrics[32];
> +
> +       _gen_branch_metrics_n3(64, seq, out, metrics);
> +       _gen_path_metrics(64, sums, metrics, paths, norm);
> +
> +}
> +
> +void gen_metrics_k7_n4(const int8_t *seq, const int16_t *out,
> +                      int16_t *sums, int16_t *paths, int norm)
> +{
> +       int16_t metrics[32];
> +
> +       _gen_branch_metrics_n4(64, seq, out, metrics);
> +       _gen_path_metrics(64, sums, metrics, paths, norm);
> +}
> --
> 1.9.0
>
Jacob Erlbeck - May 8, 2014, 8:06 a.m.
On 01.05.2014 19:51, Tom Tsou wrote:
> 
> A beneficial change would be to modify the API to allow persistent
> decoder objects instead of performing an allocation and tear-down on
> every decoding call. From a performance standpoint, the current
> implementation is unnecessarily restricted by memory access and page
> faults for that reason.

I'd appreciate that very much, since I'm planning to port it to the
epiphany where I definitely want neither to malloc nor to talloc.

Jacob
Thomas Tsou - May 8, 2014, 6:15 p.m.
On Wed, May 7, 2014 at 12:25 PM, Sylvain Munaut <246tnt@gmail.com> wrote:
> Some quick things I saw from a quick glance though:
>  - Some things that should be 'static' are not and they pollute the
> global namespace

Indeed. Internal calls in viterbi_gen.c should be static with leading
underscores stripped since that seems to be the preferred style. If
you were referring to others, please let me know.

>  - Anything that is in the global namespace needs the osmo_ (or even
> osmo_conv_ here) prefix

Ok. I purposely kept the internal struct definitions out of the public
interface, but prefixing and exposing is fine.

>  - Did you check (and test in make test) that it works for all codes
> that matches
>    "if ((code->N <= 4) && ((code->K == 5) || (code->K == 7)))" ?

Almost all codes used by osmo-bts and a few others are covered by make
check. Rates N=5 are skipped and TCH/HR passes with the posted
puncturing fix for osmo-bts. There is an additional set of out-of-tree
tests for Guassian noise, benchmarks, and some other things that
probably don't apply to libosmocore.

https://github.com/ttsou/osmo-conv-test

>    Also, to avoid exporting those internal symbols, I'm wondering if
> #including the viterbi_gen.c from viterbi.c instead of compiling
> separately wouldn't be better. But that's just a random idea ...

I see three approaches:

1. Single file for generic and architecture specific implementations
with large ifdef blocks.
2. Separate files with exported symbols (current patch).
3. Separate files with restricted visibility with
__attribute__((__visibility__("hidden"))) or other combinations with
the gcc -fvisibility flag.

>> A beneficial change would be to modify the API to allow persistent
>> decoder objects instead of performing an allocation and tear-down on
>> every decoding call. From a performance standpoint, the current
>> implementation is unnecessarily restricted by memory access and page
>> faults for that reason.
>
> You can't change the existing API without breaking existing stuff, but you can
> add a new all-in-one function that allows to give a persistent decoder object.

I was just noting performance observations. For example, the SSE K=5
case on my machine is memory bound and not limited by compute
resources. Given the effort to squeeze the current version behind the
existing API, I won't be making breaking changes anytime soon.

 -TT
Sylvain Munaut - May 9, 2014, 8:22 a.m.
Hi Tom,

>>  - Some things that should be 'static' are not and they pollute the
>> global namespace
>
> Indeed. Internal calls in viterbi_gen.c should be static with leading
> underscores stripped since that seems to be the preferred style. If
> you were referring to others, please let me know.

I don't mind the leading underscore honestly, especially in static, but if
Holger prefers it that way, it's fine with me too.

When I do a full review of the patch, I will look in more details into
each call.



>>  - Anything that is in the global namespace needs the osmo_ (or even
>> osmo_conv_ here) prefix
>
> Ok. I purposely kept the internal struct definitions out of the public
> interface, but prefixing and exposing is fine.

Not what I meant.  'structs' don't end up in the global namespace as long
as they're not used in global function signatures, so you can keep them as
is. I was thinking about functions.

Among all those not marked 'static' and don't have an osmo_ prefix, I didn't
yet do an exhaustive check to see what was an oversight and what's really
needed to be global.



>>  - Did you check (and test in make test) that it works for all codes
>> that matches
>>    "if ((code->N <= 4) && ((code->K == 5) || (code->K == 7)))" ?
>
> Almost all codes used by osmo-bts and a few others are covered by make
> check. Rates N=5 are skipped and TCH/HR passes with the posted
> puncturing fix for osmo-bts.

I was more thinking about random codes. Like if someone invents a random
code with N=4 and K=5, will it work ? When implementing it, you didnt see
any other limitations that those right ?


>>    Also, to avoid exporting those internal symbols, I'm wondering if
>> #including the viterbi_gen.c from viterbi.c instead of compiling
>> separately wouldn't be better. But that's just a random idea ...
>
> I see three approaches:
>
> 1. Single file for generic and architecture specific implementations
> with large ifdef blocks.
> 2. Separate files with exported symbols (current patch).
> 3. Separate files with restricted visibility with
> __attribute__((__visibility__("hidden"))) or other combinations with
> the gcc -fvisibility flag.

I think we'll need to play with 3. Still probably need a prefix (not
the full "osmo_conv_", but something less conflicty thatn gen_)



> I was just noting performance observations. For example, the SSE K=5
> case on my machine is memory bound and not limited by compute
> resources. Given the effort to squeeze the current version behind the
> existing API, I won't be making breaking changes anytime soon.

When I designed the API, I did do benchmark and just repeated them, and
there is virtually no difference between the two versions. ( < 1% )

But I guess that's because in the current version, the 'decoder' struct
stores virtually nothing that can be re-used between different bursts.
(and malloc itself is pretty fast nowadays)

In your case there is the treillis derived from the 'code' struct. And I
guess this is where the performance hits comes from. An option would be
to have an internal 'cache' of codes that have been used in the past
so those data would be re-used.

Basic flow of a decoder_init would be:

{
  decoder = calloc(sizeof(decoder));

  treillis = get_from_cache(code)
  if (!treillis)
    treillis = put_in_cache(gen_treillis(code));

  decoder->treillis = treillis;
}

Since the number of different codes used in a single software is not
big, this should be
fairly easy to implement.


Cheers,

    Sylvain
Thomas Tsou - May 9, 2014, 6:23 p.m.
On Fri, May 9, 2014 at 4:22 AM, Sylvain Munaut <246tnt@gmail.com> wrote:
> Among all those not marked 'static' and don't have an osmo_ prefix, I didn't
> yet do an exhaustive check to see what was an oversight and what's really
> needed to be global.

Overall, I don't think there should be _any_ non-static or global
variable as that was the original intent. Admittedly, there were some
oversights in the architecture specific BMU/PMU calls.

>>>  - Did you check (and test in make test) that it works for all codes
>>> that matches
>>>    "if ((code->N <= 4) && ((code->K == 5) || (code->K == 7)))" ?
>>
>> Almost all codes used by osmo-bts and a few others are covered by make
>> check. Rates N=5 are skipped and TCH/HR passes with the posted
>> puncturing fix for osmo-bts.
>
> I was more thinking about random codes. Like if someone invents a random
> code with N=4 and K=5, will it work ? When implementing it, you didnt see
> any other limitations that those right ?

The original implementation used octal based generator polynomials, so
all non-degenerate cases are handled. By degenerate case I mean
something like a K=7 code with generator polynomial of 0104, which is
really a shifted K=5 code. No practical implementation will handle
those cases of bad codes because the symmetric/anti-symmetric ACS
properties of the trellis vanish.

Similarly for recursive codes, the register feedback is based on the
generator polynomial with the allowance that the systematic bit can be
on any output. The AFS codes require that, which was really annoying.

>> 1. Single file for generic and architecture specific implementations
>> with large ifdef blocks.
>> 2. Separate files with exported symbols (current patch).
>> 3. Separate files with restricted visibility with
>> __attribute__((__visibility__("hidden"))) or other combinations with
>> the gcc -fvisibility flag.
>
> I think we'll need to play with 3. Still probably need a prefix (not
> the full "osmo_conv_", but something less conflicty thatn gen_)

Agreed on internal prefixing.

Currently, libosmocore checks for the availability of hidden symbol
visibility, sets SYMBOL_VISIBILITY="-fvisibility=hidden", but ignores
use the result. Without making library wide changes or using ifdef
blocks, the only approach is to mark the osmo_conv_gen calls with
hidden visibility

> When I designed the API, I did do benchmark and just repeated them, and
> there is virtually no difference between the two versions. ( < 1% )

The allocation effects are case specific and more significant as the
compute load drops. Also, you may have noticed the bitswap operations,
which only exist because the trellis and API have conflicting trellis
endian formats. Those calls are only used during initialization.
Parallel decoding threads also seems affected.

Overall though, for GSM where the data rates are quite low, the
allocate and tear-down effects, while measurable, are probably not
significant enough to justify an API change. There may be a case on
ARM or Epiphany, but I have not tested those platforms enough to
comment.

> But I guess that's because in the current version, the 'decoder' struct
> stores virtually nothing that can be re-used between different bursts.
> (and malloc itself is pretty fast nowadays)

The main memory allocation is the path metric storage, which is
unavoidable, and similar in both cases. The current version does store
paths with 8-bits instead of 16-bits, so that allocation is halved.
The SSE outputs are 16-bit so converting to 8-bit at every stage
wasn't worth the overhead. Still, cache performance on the submitted
code codes out far ahead.

$ perf stat -e cache-misses,page-faults ./conv_test -c 3 -b -i 500000 -o

=================================================
[+] Testing: GPRS CS3
[.] Specs: (N=2, K=5, non-recursive, flushed, not punctured)
[.] Input length  : ret = 334  exp = 334 -> OK
[.] Output length : ret = 676  exp = 676 -> OK

[.] Performance benchmark:
[..] Encoding / Decoding 500000 bursts on 1 thread(s):
[..] Testing base:
[..] Elapsed time....................... 18.189684 secs
[..] Rate............................... 9.181028 Mbps

 Performance counter stats for './conv_test -c 3 -b -i 500000 -o':

            65,116      cache-misses
               284      page-faults

      18.191569647 seconds time elapsed


$ perf stat -e cache-misses,page-faults ./conv_test -c 3 -b -i 500000 -s

=================================================
[+] Testing: GPRS CS3
[.] Specs: (N=2, K=5, non-recursive, flushed, not punctured)
[.] Input length  : ret = 334  exp = 334 -> OK
[.] Output length : ret = 676  exp = 676 -> OK

[.] Performance benchmark:
[..] Encoding / Decoding 500000 bursts on 1 thread(s):
[..] Testing SIMD:
[..] Elapsed time....................... 1.820236 secs
[..] Rate............................... 91.746345 Mbps

 Performance counter stats for './conv_test -c 3 -b -i 500000 -s':

            11,508      cache-misses
               288      page-faults

       1.822036339 seconds time elapsed

> In your case there is the treillis derived from the 'code' struct. And I
> guess this is where the performance hits comes from. An option would be
> to have an internal 'cache' of codes that have been used in the past
> so those data would be re-used.

Internal caching was in the original implementation, but stripped from
the submitted version mainly for simplicity and avoiding the need for
global variables, though we seem to be having that discussion anyway
;-) The trellis values can be cached based on pointer or hashed code.
That works well until threading is involved and cache access needs to
be locked. Those are features I need, but can probably be ignored in
this case.

Again, I think the API should be kept intact. Internal caching, can be
a topic for later discussion.

 -TT

Patch

diff --git a/src/Makefile.am b/src/Makefile.am
index e68c29a..262a4e6 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -13,7 +13,8 @@  libosmocore_la_SOURCES = timer.c select.c signal.c msgb.c bits.c \
 			 logging.c logging_syslog.c rate_ctr.c \
 			 gsmtap_util.c crc16.c panic.c backtrace.c \
 			 conv.c application.c rbtree.c strrb.c \
-			 loggingrb.c crc8gen.c crc16gen.c crc32gen.c crc64gen.c
+			 loggingrb.c crc8gen.c crc16gen.c crc32gen.c crc64gen.c \
+			 viterbi.c viterbi_gen.c
 
 BUILT_SOURCES = crc8gen.c crc16gen.c crc32gen.c crc64gen.c
 
diff --git a/src/conv.c b/src/conv.c
index f13deef..79b3a7c 100644
--- a/src/conv.c
+++ b/src/conv.c
@@ -238,6 +238,11 @@  osmo_conv_encode(const struct osmo_conv_code *code,
 
 #define MAX_AE 0x00ffffff
 
+/* Forward declaration for accerlated decoding with certain codes */
+int
+osmo_conv_decode_acc(const struct osmo_conv_code *code,
+                     const sbit_t *input, ubit_t *output);
+
 void
 osmo_conv_decode_init(struct osmo_conv_decoder *decoder,
                       const struct osmo_conv_code *code, int len, int start_state)
@@ -606,6 +611,10 @@  osmo_conv_decode(const struct osmo_conv_code *code,
 	struct osmo_conv_decoder decoder;
 	int rv, l;
 
+	/* Use accelerated implementation for supported codes */
+	if ((code->N <= 4) && ((code->K == 5) || (code->K == 7)))
+		return osmo_conv_decode_acc(code, input, output);
+
 	osmo_conv_decode_init(&decoder, code, 0, 0);
 
 	if (code->term == CONV_TERM_TAIL_BITING) {
diff --git a/src/viterbi.c b/src/viterbi.c
new file mode 100644
index 0000000..db8d5c8
--- /dev/null
+++ b/src/viterbi.c
@@ -0,0 +1,603 @@ 
+/*
+ * Viterbi decoder
+ * Copyright (C) 2013, 2014 Thomas Tsou <tom@tsou.cc>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "config.h"
+
+#include <stdlib.h>
+#include <malloc.h>
+#include <string.h>
+#include <errno.h>
+#include <osmocom/core/conv.h>
+
+/* Forward Metric Units */
+void gen_metrics_k5_n2(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm);
+void gen_metrics_k5_n3(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm);
+void gen_metrics_k5_n4(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm);
+void gen_metrics_k7_n2(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm);
+void gen_metrics_k7_n3(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm);
+void gen_metrics_k7_n4(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm);
+/* Trellis State
+ *     state - Internal lshift register value
+ *     prev  - Register values of previous 0 and 1 states
+ */
+struct vstate {
+	unsigned state;
+	unsigned prev[2];
+};
+
+/* Trellis Object
+ *     num_states - Number of states in the trellis
+ *     sums       - Accumulated path metrics
+ *     outputs    - Trellis ouput values
+ *     vals       - Input value that led to each state
+ */
+struct vtrellis {
+	int num_states;
+	int16_t *sums;
+	int16_t *outputs;
+	uint8_t *vals;
+};
+
+/* Viterbi Decoder
+ *     n         - Code order
+ *     k         - Constraint length
+ *     len       - Horizontal length of trellis
+ *     recursive - Set to '1' if the code is recursive
+ *     intrvl    - Normalization interval
+ *     trellis   - Trellis object
+ *     punc      - Puncturing sequence
+ *     paths     - Trellis paths
+ */
+struct vdecoder {
+	int n;
+	int k;
+	int len;
+	int recursive;
+	int intrvl;
+	struct vtrellis *trellis;
+	int *punc;
+	int16_t **paths;
+
+	void (*metric_func)(const int8_t *, const int16_t *,
+			    int16_t *, int16_t *, int);
+};
+
+/* Aligned Memory Allocator
+ *     SSE requires 16-byte memory alignment. We store relevant trellis values
+ *     (accumulated sums, outputs, and path decisions) as 16 bit signed integers
+ *     so the allocated memory is casted as such.
+ */
+#define SSE_ALIGN	16
+
+static int16_t *vdec_malloc(size_t n)
+{
+#ifdef HAVE_SSE3
+	return (int16_t *) memalign(SSE_ALIGN, sizeof(int16_t) * n);
+#else
+	return (int16_t *) malloc(sizeof(int16_t) * n);
+#endif
+}
+
+/* Accessor calls */
+inline int conv_code_recursive(const struct osmo_conv_code *code)
+{
+	return code->next_term_output ? 1 : 0;
+}
+
+/* Left shift and mask for finding the previous state */
+static unsigned vstate_lshift(unsigned reg, int k, int val)
+{
+	unsigned mask;
+
+	if (k == 5)
+		mask = 0x0e;
+	else if (k == 7)
+		mask = 0x3e;
+	else
+		mask = 0;
+
+	return ((reg << 1) & mask) | val;
+}
+
+/* Bit endian manipulators */
+inline unsigned bitswap2(unsigned v)
+{
+	return ((v & 0x02) >> 1) | ((v & 0x01) << 1);
+}
+
+inline unsigned bitswap3(unsigned v)
+{
+	return ((v & 0x04) >> 2) | ((v & 0x02) >> 0) |
+	       ((v & 0x01) << 2);
+}
+
+inline unsigned bitswap4(unsigned v)
+{
+	return ((v & 0x08) >> 3) | ((v & 0x04) >> 1) |
+	       ((v & 0x02) << 1) | ((v & 0x01) << 3);
+}
+
+inline unsigned bitswap5(unsigned v)
+{
+	return ((v & 0x10) >> 4) | ((v & 0x08) >> 2) | ((v & 0x04) >> 0) |
+	       ((v & 0x02) << 2) | ((v & 0x01) << 4);
+}
+
+inline unsigned bitswap6(unsigned v)
+{
+	return ((v & 0x20) >> 5) | ((v & 0x10) >> 3) | ((v & 0x08) >> 1) |
+	       ((v & 0x04) << 1) | ((v & 0x02) << 3) | ((v & 0x01) << 5);
+}
+
+static unsigned bitswap(unsigned v, unsigned n)
+{
+	switch (n) {
+	case 1:
+		return v;
+	case 2:
+		return bitswap2(v);
+	case 3:
+		return bitswap3(v);
+	case 4:
+		return bitswap4(v);
+	case 5:
+		return bitswap5(v);
+	case 6:
+		return bitswap6(v);
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+/* Generate non-recursive state output from generator state table
+ *     Note that the shift register moves right (i.e. the most recent bit is
+ *     shifted into the register at k-1 bit of the register), which is typical
+ *     textbook representation. The API transition table expects the most recent
+ *     bit in the low order bit, or left shift. A bitswap operation is required
+ *     to accommodate the difference.
+ */
+static unsigned gen_output(struct vstate *state, int val,
+			   const struct osmo_conv_code *code)
+{
+	unsigned out, prev;
+
+	prev = bitswap(state->prev[0], code->K - 1);
+	out = code->next_output[prev][val];
+	out = bitswap(out, code->N);
+
+	return out;
+}
+
+#define BIT2NRZ(REG,N) (((REG >> N) & 0x01) * 2 - 1) * -1
+
+/* Populate non-recursive trellis state
+ *     For a given state defined by the k-1 length shift register, find the
+ *     value of the input bit that drove the trellis to that state. Also
+ *     generate the N outputs of the generator polynomial at that state.
+ */
+static int gen_state_info(uint8_t *val, unsigned reg,
+			  int16_t *output, const struct osmo_conv_code *code)
+{
+	int i;
+	unsigned out;
+	struct vstate state;
+
+	/* Previous '0' state */
+	state.state = reg;
+	state.prev[0] = vstate_lshift(reg, code->K, 0);
+	state.prev[1] = vstate_lshift(reg, code->K, 1);
+
+	*val = (reg >> (code->K - 2)) & 0x01;
+
+	/* Transition output */
+	out = gen_output(&state, *val, code);
+
+	/* Unpack to NRZ */
+	for (i = 0; i < code->N; i++)
+		output[i] = BIT2NRZ(out, i);
+
+	return 0;
+}
+
+/* Generate recursive state output from generator state table */
+static unsigned gen_recursive_output(struct vstate *state,
+				     uint8_t *val, unsigned reg,
+				     const struct osmo_conv_code *code, int pos)
+{
+	int val0, val1;
+	unsigned out, prev;
+
+	/* Previous '0' state */
+	prev = vstate_lshift(reg, code->K, 0);
+	prev = bitswap(prev, code->K - 1);
+
+	/* Input value */
+	val0 = (reg >> (code->K - 2)) & 0x01;
+	val1 = (code->next_term_output[prev] >> pos) & 0x01;
+	*val = val0 == val1 ? 0 : 1;
+
+	/* Wrapper for osmocom state access */
+	prev = bitswap(state->prev[0], code->K - 1);
+
+	/* Compute the transition output */
+	out = code->next_output[prev][*val];
+	out = bitswap(out, code->N);
+
+	return out;
+}
+
+#define NUM_STATES(K) (K == 7 ? 64 : 16)
+
+/* Populate recursive trellis state
+ *     The bit position of the systematic bit is not explicitly marked by the
+ *     API, so it must be extracted from the generator table. Otherwise,
+ *     populate the trellis similar to the non-recursive version.
+ *     Non-systematic recursive codes are not supported.
+ */
+static int gen_recursive_state_info(uint8_t *val,
+				    unsigned reg,
+				    int16_t *output,
+				    const struct osmo_conv_code *code)
+{
+	int i, j, pos = -1;
+	int ns = NUM_STATES(code->K);
+	unsigned out;
+	struct vstate state;
+
+	/* Previous '0' and '1' states */
+	state.state = reg;
+	state.prev[0] = vstate_lshift(reg, code->K, 0);
+	state.prev[1] = vstate_lshift(reg, code->K, 1);
+
+	/* Find recursive bit location */
+	for (i = 0; i < code->N; i++) {
+		for (j = 0; j < ns; j++) {
+			if ((code->next_output[j][0] >> i) & 0x01)
+				break;
+		}
+		if (j == ns) {
+			pos = i;
+			break;
+		}
+	}
+
+	/* Non-systematic recursive code not supported */
+	if (pos < 0)
+		return -EPROTO;
+
+	/* Transition output */
+	out = gen_recursive_output(&state, val, reg, code, pos);
+
+	/* Unpack to NRZ */
+	for (i = 0; i < code->N; i++)
+		output[i] = BIT2NRZ(out, i);
+
+	return 0;
+}
+
+/* Release the trellis */
+static void free_trellis(struct vtrellis *trellis)
+{
+	if (!trellis)
+		return;
+
+	free(trellis->vals);
+	free(trellis->outputs);
+	free(trellis->sums);
+	free(trellis);
+}
+
+/* Allocate and initialize the trellis object
+ *     Initialization consists of generating the outputs and output value of a
+ *     given state. Due to trellis symmetry and anti-symmetry, only one of the
+ *     transition paths is utilized by the butterfly operation in the forward
+ *     recursion, so only one set of N outputs is required per state variable.
+ */
+static struct vtrellis *generate_trellis(const struct osmo_conv_code *code)
+{
+	int i, rc = -1;
+	struct vtrellis *trellis;
+	int16_t *outputs;
+
+	int ns = NUM_STATES(code->K);
+	int recursive = conv_code_recursive(code);
+	int olen = (code->N == 2) ? 2 : 4;
+
+	trellis = (struct vtrellis *) calloc(1, sizeof(struct vtrellis));
+	trellis->num_states = ns;
+	trellis->sums =	vdec_malloc(ns);
+	trellis->outputs = vdec_malloc(ns * olen);
+	trellis->vals = (uint8_t *) malloc(ns * sizeof(uint8_t));
+
+	if (!trellis->sums || !trellis->outputs)
+		goto fail;
+
+	/* Populate the trellis state objects */
+	for (i = 0; i < ns; i++) {
+		outputs = &trellis->outputs[olen * i];
+
+		if (recursive)
+			rc = gen_recursive_state_info(&trellis->vals[i],
+						      i, outputs, code);
+		else
+			rc = gen_state_info(&trellis->vals[i],
+					    i, outputs, code);
+	}
+
+	if (rc < 0)
+		goto fail;
+
+	return trellis;
+fail:
+	free_trellis(trellis);
+	return NULL;
+}
+
+/* Reset decoder
+ *     Set accumulated path metrics to zero. For termination other than
+ *     tail-biting, initialize the zero state as the encoder starting state.
+ *     Intialize with the maximum accumulated sum at length equal to the
+ *     constraint length.
+ */
+static void reset_decoder(struct vdecoder *dec, int term)
+{
+	int ns = dec->trellis->num_states;
+
+	memset(dec->trellis->sums, 0, sizeof(int16_t) * ns);
+
+	if (term != CONV_TERM_TAIL_BITING)
+		dec->trellis->sums[0] = INT8_MAX * dec->n * dec->k;
+}
+
+static void _traceback(struct vdecoder *dec,
+		       unsigned state, uint8_t *out, int len)
+{
+	int i;
+	unsigned path;
+
+	for (i = len - 1; i >= 0; i--) {
+		path = dec->paths[i][state] + 1;
+		out[i] = dec->trellis->vals[state];
+		state = vstate_lshift(state, dec->k, path);
+	}
+}
+
+static void _traceback_rec(struct vdecoder *dec,
+			   unsigned state, uint8_t *out, int len)
+{
+	int i;
+	unsigned path;
+
+	for (i = len - 1; i >= 0; i--) {
+		path = dec->paths[i][state] + 1;
+		out[i] = path ^ dec->trellis->vals[state];
+		state = vstate_lshift(state, dec->k, path);
+	}
+}
+
+/* Traceback and generate decoded output
+ *     Find the largest accumulated path metric at the final state except for
+ *     the zero terminated case, where we assume the final state is always zero.
+ */
+static int traceback(struct vdecoder *dec, uint8_t *out, int term, int len)
+{
+	int i, sum, max = -1;
+	unsigned path, state = 0;
+
+	if (term != CONV_TERM_FLUSH) {
+		for (i = 0; i < dec->trellis->num_states; i++) {
+			sum = dec->trellis->sums[i];
+			if (sum > max) {
+				max = sum;
+				state = i;
+			}
+		}
+
+		if (max < 0)
+			return -EPROTO;
+	}
+
+	for (i = dec->len - 1; i >= len; i--) {
+		path = dec->paths[i][state] + 1;
+		state = vstate_lshift(state, dec->k, path);
+	}
+
+	if (dec->recursive)
+		_traceback_rec(dec, state, out, len);
+	else
+		_traceback(dec, state, out, len);
+
+	return 0;
+}
+
+/* Release decoder object */
+static void free_vdec(struct vdecoder *dec)
+{
+	if (!dec)
+		return;
+
+	free(dec->paths[0]);
+	free(dec->paths);
+	free_trellis(dec->trellis);
+	free(dec);
+}
+
+/* Allocate decoder object
+ *     Subtract the constraint length K on the normalization interval to
+ *     accommodate the initialization path metric at state zero.
+ */
+static struct vdecoder *alloc_vdec(const struct osmo_conv_code *code)
+{
+	int i, ns;
+	struct vdecoder *dec;
+
+	ns = NUM_STATES(code->K);
+
+	dec = (struct vdecoder *) calloc(1, sizeof(struct vdecoder));
+	dec->n = code->N;
+	dec->k = code->K;
+	dec->recursive = conv_code_recursive(code);
+	dec->intrvl = INT16_MAX / (dec->n * INT8_MAX) - dec->k;
+
+	if (dec->k == 5) {
+		switch (dec->n) {
+		case 2:
+			dec->metric_func = gen_metrics_k5_n2;
+			break;
+		case 3:
+			dec->metric_func = gen_metrics_k5_n3;
+			break;
+		case 4:
+			dec->metric_func = gen_metrics_k5_n4;
+			break;
+		default:
+			goto fail;
+		}
+	} else if (dec->k == 7) {
+		switch (dec->n) {
+		case 2:
+			dec->metric_func = gen_metrics_k7_n2;
+			break;
+		case 3:
+			dec->metric_func = gen_metrics_k7_n3;
+			break;
+		case 4:
+			dec->metric_func = gen_metrics_k7_n4;
+			break;
+		default:
+			goto fail;
+		}
+	} else {
+		goto fail;
+	}
+
+	if (code->term == CONV_TERM_FLUSH)
+		dec->len = code->len + code->K - 1;
+	else
+		dec->len = code->len;
+
+	dec->trellis = generate_trellis(code);
+	if (!dec->trellis)
+		goto fail;
+
+	dec->paths = (int16_t **) malloc(sizeof(int16_t *) * dec->len);
+	dec->paths[0] = vdec_malloc(ns * dec->len);
+	for (i = 1; i < dec->len; i++)
+		dec->paths[i] = &dec->paths[0][i * ns];
+
+	return dec;
+fail:
+	free_vdec(dec);
+	return NULL;
+}
+
+/* Depuncture sequence with nagative value terminated puncturing matrix */
+static int depuncture(const int8_t *in, const int *punc, int8_t *out, int len)
+{
+	int i, n = 0, m = 0;
+
+	for (i = 0; i < len; i++) {
+		if (i == punc[n]) {
+			out[i] = 0;
+			n++;
+			continue;
+		}
+
+		out[i] = in[m++];
+	}
+
+	return 0;
+}
+
+/* Forward trellis recursion
+ *     Generate branch metrics and path metrics with a combined function. Only
+ *     accumulated path metric sums and path selections are stored. Normalize on
+ *     the interval specified by the decoder.
+ */
+static void _conv_decode(struct vdecoder *dec, const int8_t *seq, int _cnt)
+{
+	int i, len = dec->len;
+	struct vtrellis *trellis = dec->trellis;
+
+	for (i = 0; i < len; i++) {
+		dec->metric_func(&seq[dec->n * i],
+				 trellis->outputs,
+				 trellis->sums,
+				 dec->paths[i],
+				 !(i % dec->intrvl));
+	}
+}
+
+/* Convolutional decode with a decoder object
+ *     Initial puncturing run if necessary followed by the forward recursion.
+ *     For tail-biting perform a second pass before running the backward
+ *     traceback operation.
+ */
+static int conv_decode(struct vdecoder *dec, const int8_t *seq,
+		       const int *punc, uint8_t *out, int len, int term)
+{
+	int cnt = 0;
+	int8_t depunc[dec->len * dec->n];
+
+	reset_decoder(dec, term);
+
+	if (punc) {
+		depuncture(seq, punc, depunc, dec->len * dec->n);
+		seq = depunc;
+	}
+
+	/* Propagate through the trellis with interval normalization */
+	_conv_decode(dec, seq, cnt);
+
+	if (term == CONV_TERM_TAIL_BITING)
+		_conv_decode(dec, seq, cnt);
+
+	return traceback(dec, out, term, len);
+}
+
+/* All-in-one viterbi decoding  */
+int osmo_conv_decode_acc(const struct osmo_conv_code *code,
+			 const sbit_t *input, ubit_t *output)
+{
+	int rc;
+	struct vdecoder *vdec;
+
+	if ((code->N < 2) || (code->N > 4) || (code->len < 1) ||
+	    ((code->K != 5) && (code->K != 7)))
+		return -EINVAL;
+
+	vdec = alloc_vdec(code);
+	if (!vdec)
+		return -EFAULT;
+
+	rc = conv_decode(vdec, input, code->puncture,
+			 output, code->len, code->term);
+
+	free_vdec(vdec);
+
+	return rc;
+}
diff --git a/src/viterbi_gen.c b/src/viterbi_gen.c
new file mode 100644
index 0000000..894d5ae
--- /dev/null
+++ b/src/viterbi_gen.c
@@ -0,0 +1,182 @@ 
+/*
+ * Viterbi decoder
+ * Copyright (C) 2013, 2014 Thomas Tsou <tom@tsou.cc>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <stdint.h>
+#include <string.h>
+
+/* Add-Compare-Select (ACS-Butterfly)
+ *     Compute 4 accumulated path metrics and 4 path selections. Note that path
+ *     selections are store as -1 and 0 rather than 0 and 1. This is to match
+ *     the output format of the SSE packed compare instruction 'pmaxuw'.
+ */
+void acs_butterfly(int state, int num_states,
+		   int16_t metric, int16_t *sum,
+		   int16_t *new_sum, int16_t *path)
+{
+	int state0, state1;
+	int sum0, sum1, sum2, sum3;
+
+	state0 = *(sum + (2 * state + 0));
+	state1 = *(sum + (2 * state + 1));
+
+	sum0 = state0 + metric;
+	sum1 = state1 - metric;
+	sum2 = state0 - metric;
+	sum3 = state1 + metric;
+
+	if (sum0 > sum1) {
+		*new_sum = sum0;
+		*path = -1;
+	} else {
+		*new_sum = sum1;
+		*path = 0;
+	}
+
+	if (sum2 > sum3) {
+		*(new_sum + num_states / 2) = sum2;
+		*(path + num_states / 2) = -1;
+	} else {
+		*(new_sum + num_states / 2) = sum3;
+		*(path + num_states / 2) = 0;
+	}
+}
+
+/* Branch metrics unit N=2 */
+void _gen_branch_metrics_n2(int num_states, const int8_t *seq,
+			    const int16_t *out, int16_t *metrics)
+{
+	int i;
+
+	for (i = 0; i < num_states / 2; i++) {
+		metrics[i] = seq[0] * out[2 * i + 0] +
+			     seq[1] * out[2 * i + 1];
+	}
+}
+
+/* Branch metrics unit N=3 */
+void _gen_branch_metrics_n3(int num_states, const int8_t *seq,
+			    const int16_t *out, int16_t *metrics)
+{
+	int i;
+
+	for (i = 0; i < num_states / 2; i++)
+		metrics[i] = seq[0] * out[4 * i + 0] +
+			     seq[1] * out[4 * i + 1] +
+			     seq[2] * out[4 * i + 2];
+}
+
+/* Branch metrics unit N=4 */
+void _gen_branch_metrics_n4(int num_states, const int8_t *seq,
+			    const int16_t *out, int16_t *metrics)
+{
+	int i;
+
+	for (i = 0; i < num_states / 2; i++)
+		metrics[i] = seq[0] * out[4 * i + 0] +
+			     seq[1] * out[4 * i + 1] +
+			     seq[2] * out[4 * i + 2] +
+			     seq[3] * out[4 * i + 3];
+}
+
+/* Path metric unit */
+void _gen_path_metrics(int num_states, int16_t *sums,
+		       int16_t *metrics, int16_t *paths, int norm)
+{
+	int i;
+	int16_t min;
+	int16_t new_sums[num_states];
+
+	for (i = 0; i < num_states / 2; i++) {
+		acs_butterfly(i, num_states, metrics[i],
+			      sums, &new_sums[i], &paths[i]);
+	}
+
+	if (norm) {
+		min = new_sums[0];
+		for (i = 1; i < num_states; i++) {
+			if (new_sums[i] < min)
+				min = new_sums[i];
+		}
+
+		for (i = 0; i < num_states; i++)
+			new_sums[i] -= min;
+	}
+
+	memcpy(sums, new_sums, num_states * sizeof(int16_t));
+}
+
+/* 16-state branch-path metrics units (K=5) */
+void gen_metrics_k5_n2(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm)
+{
+	int16_t metrics[8];
+
+	_gen_branch_metrics_n2(16, seq, out, metrics);
+	_gen_path_metrics(16, sums, metrics, paths, norm);
+}
+
+void gen_metrics_k5_n3(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm)
+{
+	int16_t metrics[8];
+
+	_gen_branch_metrics_n3(16, seq, out, metrics);
+	_gen_path_metrics(16, sums, metrics, paths, norm);
+
+}
+
+void gen_metrics_k5_n4(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm)
+{
+	int16_t metrics[8];
+
+	_gen_branch_metrics_n4(16, seq, out, metrics);
+	_gen_path_metrics(16, sums, metrics, paths, norm);
+
+}
+
+/* 64-state branch-path metrics units (K=7) */
+void gen_metrics_k7_n2(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm)
+{
+	int16_t metrics[32];
+
+	_gen_branch_metrics_n2(64, seq, out, metrics);
+	_gen_path_metrics(64, sums, metrics, paths, norm);
+
+}
+
+void gen_metrics_k7_n3(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm)
+{
+	int16_t metrics[32];
+
+	_gen_branch_metrics_n3(64, seq, out, metrics);
+	_gen_path_metrics(64, sums, metrics, paths, norm);
+
+}
+
+void gen_metrics_k7_n4(const int8_t *seq, const int16_t *out,
+		       int16_t *sums, int16_t *paths, int norm)
+{
+	int16_t metrics[32];
+
+	_gen_branch_metrics_n4(64, seq, out, metrics);
+	_gen_path_metrics(64, sums, metrics, paths, norm);
+}