# Implementing the Mysterion block cipher

Last semester, Jordi Riemens and I have built an implementation of the Mysterion block cipher for the Cortex M4 microarchitecture. This block cipher has a 128-bit state and claims similar security to the AES block cipher. Its design is called an “XLS design”, which is essentially the same as traditional substitution permutation networks.

The original paper gives a rough outline of how the block cipher works, and the original authors have kindly provided us with a reference implementation of the cipher for AVR. This reference implementation is however fully lookup-based, so for numerous reasons we cannot use this in practical applications.

## Summary ¶

The block cipher state is separated into 4 blocks of each 32 bytes. These 4
blocks are each divided into 4 rows of each 1 byte long. The following table
illustrates this. I have annotated where the input *bytes* end up in the state
(e.g. input byte `5`

ends up at `55555555`

).

```
000000000 44444444 88888888 ........
111111111 55555555 99999999 ........
222222222 66666666 ........ ........
333333333 77777777 ........ ........
^~~~~~~~~ one 32-bit `sub-block`
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ one 128-bit `block`
```

The cipher does 12 rounds in which it performs 4 operations: The S-box, the L-box, ShiftColumns and key/constant addition.

### S-box ¶

The first operation during a round is the S-box. In these kinds of ciphers (SPNs), S-boxes make sure that the cipher is nonlinear. In the Mysterion cipher, the S-box is defined by a simple function of 8 bitwise operations:

```
def sbox(block):
"""Returns a new block after applying the Mysterion S-box"""
a = (block[0] & block[1]) ^ block[2]
c = (block[1] | block[2]) ^ block[3]
d = (a & block[3]) ^ block[0]
b = (c & block[0]) ^ block[1]
return [a,b,c,d]
```

### L-box ¶

The L-box is the next operation in the round. Its purpose is to diffuse changes
in the state. In other words, this operation makes sure that if some bits in the
input state is changed, *many* bits in the output state are changed. Because of
the L-box, a difference of *one* bit will already have propagated to the
complete state after just two rounds (recall that Mysterion does 12).

This is implemented by multiplying the state with an MDS matrix made up of elements in a finite field of order 16. These MDS matrices are specially crafted matrices which optimise the diffusion. In the special case of Mysterion, this matrix is the eighth power of a transposed companion matrix. This reduces the amount of registers that are needed during the application of the L-box which came in handy in the implementation.

### ShiftColumns ¶

The next operation is the ShiftColumns operation, as found in all other SPN networks. This operation scrambles the locations of bits in the state, so that differences in one sub-block propagate to all the others.

### Key and constant addition ¶

Up until now, we have done nothing to make the ciphertext dependent on the key. So to do this, we add the key to the state. Because we want every round to be different, we also add a round-specific constant value.

The Mysterion block cipher has no key schedule. In every round, the same key is added to the state.

## Implementation strategies ¶

This section goes into the nitty gritty details of the implementation of the block cipher. If you haven’t read the paper yet, you may want to read it now, if you really want to understand what we are doing.

We have implemented the cipher in ARM assembly, but for our own overview we
have also implemented the cipher in Python. In this part I’ll try to use the
assembly product when possible. However—because of the optimisations—the
assembly is not always very readable, so I will sometimes fall back to
Python. You can always find the assembly code in `mysterion.s`

.

When you look closely at the cipher’s state, you see 4 rows of 32 bits. For us—working on a 32-bit platform—it was obvious to use this and bitslice the rows over four registers. The first step here is to get the right bits from the state (which are grouped by block) into the right registers (which are by row). For this we have written the byteslice subroutine. In Python, this function looks like this:

```
def byteslice(message):
registers = [0] * 4
for i in range(4):
for j in range(4):
registers[i] |= message[4*j + i] << ((3 - j) * 8)
return registers
```

We execute this function for both the message and the key. You might expect
that, because we are working in assembly, we will have to do a masking operation
in each loop iteration. But we only have to do the `byteslice`

and `unbyteslice`

operations once per block encryption. So we can combine the byteslicing with
reading and writing in memory. Now we can use the `ldrb`

and `strb`

instructions. Instead of working on 32 bit words, these instructions operate on
one byte each, allowing us to skip the masking operation.

### S-box ¶

Remember that the S-box was fully defined by bitwise operations. Now that we have our state in a nice bitsliced representation, we can exploit this and calculate the S-box on all the bits at the same time. This computes the S-box transformation super efficiently. The resulting code looks like this:

```
.macro sbox
/*
* r2..r5: state
* r0..r1: temporary registers
*/
and r0, r2, r3
eor r0, r0, r4
orr r4, r4, r3
eor r4, r4, r5
and r5, r5, r0
eor r5, r5, r2
and r1, r4, r2
eor r3, r3, r1
mov r2, r0
.endm
```

### L-box ¶

Of course, calculating the S-box was the easy part. The L-box is a whole lot more complex. Our first intuition was that it would probably be easier, and thus more efficient, to compute the L-box by exploiting the fact that the multiplying matrix was its companion matrix to the power eight. This means that we could replace one matrix multiplication by eight vector multiplications. We expected this to be at least as fast as the matrix multiplication method, but needing fewer registers. So we built this function in Python, and then in assembly:

```
def lbox(state):
# state is a list of 4 32-bit numbers (in bitsliced form!)
for clock in range(8):
x = _gf16_mul2(poly(clock), state)
for reg in range(4): # reg for register
acc = 0
for i in range(8-clock):
acc ^= x[reg] << i
for i in range(1, clock+1):
acc ^= x[reg] >> i
state[reg] ^= acc & (0x80808080 >> clock)
return state
```

We were fairly happy with this implementation. It uses register rotation, an
optimisation method in which we slide through the values, instead of having
to shift them on every loop iteration. To still add to the correct polynomial in
the state, we accumulate the new values inside the registers and (while adding
them) we shift them to the exact right spot in the accumulator, so that the
value in the accumulator is aligned with the vector term that is being
calculated. Note that we now have to do only 8 bitsliced multiplications in
`GF(2^4)`

.

There is a downside to this, and that is that we can still only calculate one
value at a time in the resulting state. After implementing this in assembly,
we found that the accumulation step took *a lot* of cycles (see for
yourself).

Later we argued that this could obviously done more efficiently. So we also implemented the L-box by “just” doing the matrix multiplication. All matrix elements could be precomputed anyway, so chances are that this implementation would be a lot faster.

```
def lbox(state):
mat = [
[0b0001, 0b1000, 0b0011, 0b1111, 0b0101, 0b1111, 0b0011, 0b1000],
[0b1000, 0b1101, 0b0011, 0b0010, 0b0001, 0b0100, 0b0100, 0b1111],
[0b1111, 0b1001, 0b1111, 0b1001, 0b0100, 0b1011, 0b0110, 0b0101],
[0b0101, 0b0001, 0b0110, 0b1001, 0b1011, 0b0010, 0b0100, 0b1000],
[0b1000, 0b1001, 0b1010, 0b0111, 0b0111, 0b1010, 0b1001, 0b1000],
[0b1000, 0b0100, 0b0010, 0b1011, 0b1001, 0b0110, 0b0001, 0b0101],
[0b0101, 0b0110, 0b1011, 0b0100, 0b1001, 0b1111, 0b1001, 0b1111],
[0b1111, 0b0100, 0b0100, 0b0001, 0b0010, 0b0011, 0b1101, 0b1000]
]
for i in range(8):
# Rotate for rotated registers
mat[i] = mat[i][i:] + mat[i][:i]
mat = zip(*mat) # Transpose the matrix
mat = [bitslice_32x4(row * 4) for row in mat] # Bitslice all matrix columns
# ^ In the assembly code, these values are all precomputed, so there is no
# computational overhead. :)
acc = [0] * 4
for i in range(8):
for j, x in enumerate(_gf16_mul2(mat[i][:], state[:])):
acc[j] ^= x
# rotate state left inside bytes by 1
state = [((x << 1) & 0xfefefefe) | (x >> 7) & 0x01010101 for x in state]
return acc
```

This immediately turns out to be an much cleaner piece of code than the previous
one. After measuring our new method it turns out to be *much* faster. (“Yay!”)
Also, our assembly code looked so much cleaner. One round of the L-box now
looked like this:

```
# All rounds look the same, except for the polynomials
gf16_mul_lit r1,r10,r11,r12, 0xbbbbbbbb, 0x03030303, 0x5b5b5b5b, 0x77777777
eor r6, r6, r1
eor r7, r7, r10
eor r8, r8, r11
eor r9, r9, r12
lbox_rotate_left /* Rotate the state towards the left inside byte
(macro of 14 cycles) */
```

It turned out that we actually needed those extra registers (exactly 4 of
them). Because of this, we spill the values in `r6-r9`

(the bitsliced key) to
the stack at the beginning of the L-box, and pop them from the stack at the end.
This adds about 14 cycles to each L-box computation, but this is nothing
compared to the speedup that we had achieved by using the new method.

### ShiftColumns ¶

ShiftColumns was easy. For each register, we do 4 shifted `and`

operations. Then
we add (bitwise OR) the results:

```
.macro shiftcolumns_inner reg
and r0, \reg, #0xc0c0c0c0
and r1, \reg, #0x30303030
orr r0, r0, r1, ror #8
and r1, \reg, #0x0c0c0c0c
orr r0, r0, r1, ror #16
and r1, \reg, #0x03030303
orr \reg, r0, r1, ror #24
.endm
```

### Key and constant addition ¶

This is just trivial.

```
.macro add_key
/*
* r2..r5: state
* r6..r9: key
*/
eor r2, r2, r6
eor r3, r3, r7
eor r4, r4, r8
eor r5, r5, r9
.endm
```

### Stitching it all together ¶

After having made all the building blocks, there was only the task left of stitching them together, which—apart from some register allocation—is not too much of a hassle:

```
.macro mysterion_round round_number
/*
* Do one round of the Mysterion block cipher
* r2..r5: state
* r0, r1, r6..r12: temporary registers
*/
sbox
lbox
shiftcolumns
add_const \round_number
add_key
.endm
mysterion_encrypt:
/*
* Do Mysterion block encryption on msg pointed to by r0, with the values
* pointed to by r1 as key.
*/
/* Spill the pointer to the output buffer to the stack, because we need a
register for temporary values. Other temporary registers are pushed
according to the C calling convention. */
push {r4-r12}
push {r0}
byteslice_state
byteslice_key
/* From this point [r2..r9] are in use */
add_key
mysterion_round 1
mysterion_round 2
mysterion_round 3
mysterion_round 4
mysterion_round 5
mysterion_round 6
mysterion_round 7
mysterion_round 8
mysterion_round 9
mysterion_round 10
mysterion_round 11
mysterion_round 12
/* Put the ciphertext back in the input buffer */
pop {r0}
unbyteslice_state
pop {r4-r12}
bx lr
```

## Decryption ¶

Every encryption algorithm of course also need a decryption algorithm. Luckily,
the algorithm is very symmetric in all aspects. So much that there is not a lot
to write about. The inverse S-box is just another sequence of
operations. The L-box operation is exactly the same, with the slight
difference that we are multiplying the *inverse* matrix with the state. The
ShiftColumns operation is easily inverted. The key and
constant addition are XOR operations, so they are exactly the same as in the
encryption algorithm. We only have to remember to add the round constants in
reversed order.

## Result ¶

The result is an encryption (and decryption) algorithm that encrypts one 128-bit
block in 5201 cycles. To our knowledge there exist no similar implementations of
the Mysterion block cipher, so I think we are safe to claim that we have the
*fastest* (and only) implementation of Mysterion for the ARM architecture.

Without any other implementations, we still don’t know if our implementation is
reasonably fast, or just a piece of garbage. So we looked up the performance of
AES (which is also an SPN) on the same platform. The current best
implementation of AES takes 2651 cycles. I should
mention that that this implementation bitslices *over blocks*, which gains the
implementers a large advantage which we do not have. Having built the algorithm
with a performance that is in the same league as a highly optimised academic
implementation of AES, we are pretty satisfied with the result.

## Discussion ¶

In the end, we like the design of the Mysterion block cipher. We would however argue that the paper and the reference implementation are all a bit vague and hard to grasp. With respect to other papers, this paper could have been a lot clearer in our opinion. It almost feels like the cipher was not proposed to be actually implemented and used, as it was sometimes ambiguous and left out certain details that were crucial for implementing (like the composition of the state when applying the S-box or the definition of the round constants).

Furthermore, the cipher’s design chooses an L-box that works on 32 bits at
the same time. I myself am not a cryptanalyst, so I do not know how much this
extra diffusion adds to the security of the block cipher. One thing that I *can*
say is that it is *the* bottleneck of the cipher in terms of performance.
Because the Lbox has a complexity of `O(n^2)`

, this has the most impact on the
speed of the computation. When thinking *only* about performance it would be
preferable to have an S-box with a larger width, and a diffusion layer with a
smaller width. (Here again, AES seems to have figured it out.)

We hope our code may be a good guideline for future (aspiring) crypto-engineers. Not only for implementing this cipher on other platforms, but also for making custom implementations of one of the many ciphers and hash functions out there. As for the Mysterion block cipher, we think that the techniques that we are using can easily be extended to other platforms. On the architectures that support vector operations, a larger bitsliced approach is possible. On smaller platforms like AVR and other microcontrollers, we expect that bitslicing over 8 bits—the width of one sub-block—will result in fast code as well.

## Credits ¶

We built this implementation for the Cryptographic Engineering course of the Radboud University. Joost Rijneveld has guided us during the project, and figured out how to reverse assemble the inverse S-box using a SAT solver.

And of course we thank Anthony Journault for having published the block cipher, providing us with a reference implementation and kindly answering our emails when we bugged him with questions.