Using huffman coding and trees to compress and decompress data
You are the new intern at a company that collects sensor data from remote places
Your company pays a lot of money per byte to transfer the sensor data into your datacenter
To impress your new boss you want to reduce the amount of data that has to be transfered
Huffman coding is a method that uses a known probability distribution of symbols to encode messages with as few bits as possible
To generate the encodings of a huffman code a binary tree structure is used
A Binary tree is a tree where each node has exactly two children
On the next slide you will see the general algorithm to generate a huffman tree
Sort the symbols by the probability of occurrence
─(17.4%)─╼ 'E' ─(9.78%)─╼ 'N' ─(4.76%)─╼ 'H' ─(4.35%)─╼ 'U'
Take the the two lowest ranking symbols and combine them into a tree node
─(17.4%)─╼ 'E' ─(9.78%)─╼ 'N' ╭─(4.76%)─╼ 'H' ─(9.11%)─┤ ╰─(4.35%)─╼ 'U'
Threat the node like a single symbol, resort the list and create a new node
╭─(9.78%)─╼ 'N' ─(18.89%)─┤ │ ╭─(4.76%)─╼ 'H' ╰─(9.11%)─┤ ╰─(4.35%)─╼ 'U' ─(17.4%)─╼ 'E'
Continue combining the lowest ranking nodes
╭─(9.78%)─╼ 'N' ╭─(18.89%)─┤ │ │ ╭─(4.76%)─╼ 'H' │ ╰─(9.11%)─┤ │ ╰─(4.35%)─╼ 'U' ─(36.29%)─┤ ╰─(17.4%)─╼ 'E'
Until you are left with a single root node
╭─(9.78%)─╼ 'N' ╭─(18.89%)─┤ │ │ ╭─(4.76%)─╼ 'H' │ ╰─(9.11%)─┤ │ ╰─(4.35%)─╼ 'U' ╭─(36.29%)─┤ │ ╰─(17.4%)─╼ 'E' ╭─(65.47%)─┤ │ │ ╭─(7.55%)─╼ 'I' │ │ ╭─(14.91%)─┤ │ │ │ │ ╭─(1.13%)─╼ 'Z' │ │ │ │ ╭─(1.92%)─┤ │ │ │ │ │ ╰─(0.79%)─╼ 'P' │ │ │ │ ╭─(3.81%)─┤ │ │ │ │ │ ╰─(1.89%)─╼ 'W' │ │ │ ╰─(7.36%)─┤ │ │ │ │ ╭─(1.89%)─╼ 'B' │ │ │ ╰─(3.55%)─┤ │ │ │ ╰─(1.66%)─╼ 'F' │ ╰─(29.18%)─┤ │ │ ╭─(7.27%)─╼ 'S' │ ╰─(14.27%)─┤ │ ╰─(7.0%)─╼ 'R' ─(110.31%)─┤ │ ╭─(6.51%)─╼ 'A' │ ╭─(13.01%)─┤ │ │ │ ╭─(3.44%)─╼ 'L' │ │ ╰─(6.50%)─┤ │ │ ╰─(3.06%)─╼ 'C' │ ╭─(24.72%)─┤ │ │ │ ╭─(6.15%)─╼ 'T' │ │ ╰─(11.71%)─┤ │ │ │ ╭─(3.01%)─╼ 'G' │ │ ╰─(5.56%)─┤ │ │ │ ╭─(0.67%)─╼ 'V' │ │ │ ╭─(1.34%)─┤ │ │ │ │ │ ╭─(0.27%)─╼ 'J' │ │ │ │ │ ╭─(0.36%)─┤ │ │ │ │ │ │ │ ╭─(0.03%)─╼ 'X' │ │ │ │ │ │ │ ╭─(0.05%)─┤ │ │ │ │ │ │ │ │ ╰─(0.02%)─╼ 'Q' │ │ │ │ │ │ ╰─(0.09%)─┤ │ │ │ │ │ │ ╰─(0.04%)─╼ 'Y' │ │ │ │ ╰─(0.67%)─┤ │ │ │ │ ╰─(0.31%)─╼ 'ß' │ │ ╰─(2.55%)─┤ │ │ ╰─(1.21%)─╼ 'K' ╰─(44.84%)─┤ │ ╭─(5.08%)─╼ 'D' │ ╭─(10.12%)─┤ │ │ │ ╭─(2.53%)─╼ 'M' │ │ ╰─(5.04%)─┤ │ │ ╰─(2.51%)─╼ 'O' ╰─(20.12%)─┤ ╰─(10.0%)─╼ ' '
To get a binary encoding each side of a branch is assigned a '0' or '1'
╭─(1)─╼ 'N' ╭─(1)─┤ │ │ ╭─(1)─╼ 'H' │ ╰─(0)─┤ │ ╰─(0)─╼ 'U' ─┤ ╰─(0)─╼ 'E'
In the example above the encodings would be:
'E': 0, 'N': 11, 'H': 101, 'U': 100
Observation: No encoding is the prefix of another encoding!
Sensor data that is sampled often enough is often mostly constant
To store the values above we would usually use a 16 bit wide variable per sample
We want to reduce that for transfer
We only transfer the difference between the current and the last value
Now most values are very close to zero
We know that values close to zero are more likely to occur
We can use huffman coding to reduce the average number of bits we have to transfer!
Assume that the difference values between two subsequent samples are known to have the following probability distribution:
Value │ Probability
──────┼────────────
-2 │ 9%
-1 │ 19%
0 │ 40%
1 │ 18%
2 │ 13%
Construct a huffman tree from the table above (on paper)
╭─(19%)─╼ -1 ╭─(37.00%)─┤ │ ╰─(18%)─╼ 1 ╭─(59.00%)─┤ │ │ ╭─(13%)─╼ 2 │ ╰─(22.00%)─┤ │ ╰─(9%)─╼ -2 ─(99.00%)─┤ ╰─(40%)─╼ 0
╭─(1)─╼ -1 ╭─(1)─┤ │ ╰─(0)─╼ 1 ╭─(1)─┤ │ │ ╭─(1)─╼ 2 │ ╰─(0)─┤ │ ╰─(0)─╼ -2 ─┤ ╰─(0)─╼ 0
Observation: For every measured value that does not differ from the previous value (difference zero) only a single bit has to be transfered
To encode a tree structure in C
a struct
like the following can be used
1 struct hm_node_t {
2 int16_t value;
3
4 struct hm_node_t *child_0;
5 struct hm_node_t *child_1;
6 };
Each node can either have two children (branch node) or a value (leaf node)
1 struct hm_node_t node_ch0 =
2 {.value= 0,
3 .child_0= NULL, .child_1= NULL};
4
5 struct hm_node_t node_root =
6 {.value= HM_NOVAL,
7 .child_0= &node_ch0, .child_1= &node_ch1};
The tree structure is then encoded using references to other nodes
Download the example program from the previous slide
Upload it to your arduino and look at its output on the serial monitor
Edit the sequence
variable at the top
of the program so that the following output
is produced on the serial monitor:
Start walking
Accumulator: 0
Accumulator: 2
Accumulator: 4
Accumulator: 3
Accumulator: 2
Hint: The program uses the encoding developed on the previous slides