Using huffman coding and trees to compress and decompress data

You are the new intern at a company that collects sensor data from remote places

Your company pays a lot of money per byte to transfer the sensor data into your datacenter

To impress your new boss you want to reduce the amount of data that has to be transfered

Huffman coding is a method that uses a known probability distribution of symbols to encode messages with as few bits as possible

To generate the encodings of a huffman code a binary tree structure is used

A Binary tree is a tree where each node has exactly two children

On the next slide you will see the general algorithm to generate a huffman tree

Sort the symbols by the probability of occurrence

─(17.4%)─╼ 'E' ─(9.78%)─╼ 'N' ─(4.76%)─╼ 'H' ─(4.35%)─╼ 'U'

Take the the two lowest ranking symbols and combine them into a tree node

─(17.4%)─╼ 'E' ─(9.78%)─╼ 'N' ╭─(4.76%)─╼ 'H' ─(9.11%)─┤ ╰─(4.35%)─╼ 'U'

Threat the node like a single symbol, resort the list and create a new node

╭─(9.78%)─╼ 'N' ─(18.89%)─┤ │ ╭─(4.76%)─╼ 'H' ╰─(9.11%)─┤ ╰─(4.35%)─╼ 'U' ─(17.4%)─╼ 'E'

Continue combining the lowest ranking nodes

╭─(9.78%)─╼ 'N' ╭─(18.89%)─┤ │ │ ╭─(4.76%)─╼ 'H' │ ╰─(9.11%)─┤ │ ╰─(4.35%)─╼ 'U' ─(36.29%)─┤ ╰─(17.4%)─╼ 'E'

Until you are left with a single root node

╭─(9.78%)─╼ 'N' ╭─(18.89%)─┤ │ │ ╭─(4.76%)─╼ 'H' │ ╰─(9.11%)─┤ │ ╰─(4.35%)─╼ 'U' ╭─(36.29%)─┤ │ ╰─(17.4%)─╼ 'E' ╭─(65.47%)─┤ │ │ ╭─(7.55%)─╼ 'I' │ │ ╭─(14.91%)─┤ │ │ │ │ ╭─(1.13%)─╼ 'Z' │ │ │ │ ╭─(1.92%)─┤ │ │ │ │ │ ╰─(0.79%)─╼ 'P' │ │ │ │ ╭─(3.81%)─┤ │ │ │ │ │ ╰─(1.89%)─╼ 'W' │ │ │ ╰─(7.36%)─┤ │ │ │ │ ╭─(1.89%)─╼ 'B' │ │ │ ╰─(3.55%)─┤ │ │ │ ╰─(1.66%)─╼ 'F' │ ╰─(29.18%)─┤ │ │ ╭─(7.27%)─╼ 'S' │ ╰─(14.27%)─┤ │ ╰─(7.0%)─╼ 'R' ─(110.31%)─┤ │ ╭─(6.51%)─╼ 'A' │ ╭─(13.01%)─┤ │ │ │ ╭─(3.44%)─╼ 'L' │ │ ╰─(6.50%)─┤ │ │ ╰─(3.06%)─╼ 'C' │ ╭─(24.72%)─┤ │ │ │ ╭─(6.15%)─╼ 'T' │ │ ╰─(11.71%)─┤ │ │ │ ╭─(3.01%)─╼ 'G' │ │ ╰─(5.56%)─┤ │ │ │ ╭─(0.67%)─╼ 'V' │ │ │ ╭─(1.34%)─┤ │ │ │ │ │ ╭─(0.27%)─╼ 'J' │ │ │ │ │ ╭─(0.36%)─┤ │ │ │ │ │ │ │ ╭─(0.03%)─╼ 'X' │ │ │ │ │ │ │ ╭─(0.05%)─┤ │ │ │ │ │ │ │ │ ╰─(0.02%)─╼ 'Q' │ │ │ │ │ │ ╰─(0.09%)─┤ │ │ │ │ │ │ ╰─(0.04%)─╼ 'Y' │ │ │ │ ╰─(0.67%)─┤ │ │ │ │ ╰─(0.31%)─╼ 'ß' │ │ ╰─(2.55%)─┤ │ │ ╰─(1.21%)─╼ 'K' ╰─(44.84%)─┤ │ ╭─(5.08%)─╼ 'D' │ ╭─(10.12%)─┤ │ │ │ ╭─(2.53%)─╼ 'M' │ │ ╰─(5.04%)─┤ │ │ ╰─(2.51%)─╼ 'O' ╰─(20.12%)─┤ ╰─(10.0%)─╼ ' '

To get a binary encoding each side of a branch is assigned a '0' or '1'

╭─(1)─╼ 'N' ╭─(1)─┤ │ │ ╭─(1)─╼ 'H' │ ╰─(0)─┤ │ ╰─(0)─╼ 'U' ─┤ ╰─(0)─╼ 'E'

In the example above the encodings would be:

```
'E': 0, 'N': 11, 'H': 101, 'U': 100
```

*Observation:* No encoding is the prefix
of another encoding!

Sensor data that is sampled often enough is often mostly constant

To store the values above we would usually use a 16 bit wide variable per sample

We want to reduce that for transfer

We only transfer the difference between the current and the last value

Now most values are very close to zero

We know that values close to zero are more likely to occur

We can use huffman coding to reduce the average number of bits we have to transfer!

Assume that the difference values between two subsequent samples are known to have the following probability distribution:

```
Value │ Probability
──────┼────────────
-2 │ 9%
-1 │ 19%
0 │ 40%
1 │ 18%
2 │ 13%
```

Construct a huffman tree from the table above (on paper)

╭─(19%)─╼ -1 ╭─(37.00%)─┤ │ ╰─(18%)─╼ 1 ╭─(59.00%)─┤ │ │ ╭─(13%)─╼ 2 │ ╰─(22.00%)─┤ │ ╰─(9%)─╼ -2 ─(99.00%)─┤ ╰─(40%)─╼ 0

╭─(1)─╼ -1 ╭─(1)─┤ │ ╰─(0)─╼ 1 ╭─(1)─┤ │ │ ╭─(1)─╼ 2 │ ╰─(0)─┤ │ ╰─(0)─╼ -2 ─┤ ╰─(0)─╼ 0

*Observation:* For every measured value that does not differ
from the previous value (difference zero)
only a single bit has to be transfered

To encode a tree structure in C
a `struct`

like the following can be used

```
1 struct hm_node_t {
2 int16_t value;
3
4 struct hm_node_t *child_0;
5 struct hm_node_t *child_1;
6 };
```

Each node can either have two children (branch node) or a value (leaf node)

```
1 struct hm_node_t node_ch0 =
2 {.value= 0,
3 .child_0= NULL, .child_1= NULL};
4
5 struct hm_node_t node_root =
6 {.value= HM_NOVAL,
7 .child_0= &node_ch0, .child_1= &node_ch1};
```

The tree structure is then encoded using references to other nodes

Download the example program from the previous slide

Upload it to your arduino and look at its output on the serial monitor

Edit the `sequence`

variable at the top
of the program so that the following output
is produced on the serial monitor:

```
Start walking
Accumulator: 0
Accumulator: 2
Accumulator: 4
Accumulator: 3
Accumulator: 2
```

*Hint:* The program uses the encoding
developed on the previous slides