URMs or: Why I Can’t Live Without ADTs (Sep 11th, 2021)

In a class at university I recently learned about computable functions, which are functions whose return values can be computed by a Turing machine. Our teacher, however, chose to define them as functions that can be computed by a unlimited register machine (URM). For those of you who don’t know it already, a URM is a theoretic, register-based machine architecture. Apparently URMs are Turing complete, so his choice shouldn’t make much of a difference.

A URM has an infinite register set $\left({R}_{0},{R}_{2},\dots \right)$ and the instructions set is as follows:

1. $T\left(n,m\right)$ transfers the contents of ${R}_{n}$ to ${R}_{m}$
2. $Z\left(n\right)$ transfers 0 to ${R}_{n}$
3. $S\left(n\right)$ increments the value of ${R}_{n}$
4. $J\left(n,m,i\right)$ jumps to the $i$-th instruction if the values of ${R}_{n}$ and ${R}_{m}$ coincide

There’s this cool URM simulator if you’re interested in playing with this. Here’s an example of a URM program that computes ${R}_{1}\mathrm{mod}{R}_{2}$ and stores it in $R3$.

Z(3)
Z(4)
J(2, 3, 8)
J(1, 4, 10)
S(3)
S(4)
J(1, 1, 3)
Z(3)
J(1, 1, 3)

This implementation was obtained from the following implementation in x64 assembly

mod:
mov 0, rcx
mov 0, rdx
f:
cmp rbx, rcx
je  f_first_case
cmp rax, rdx
je  f_second_case
f_third_case:
inc rcx
inc rdx
jmp f
f_first_case:
mov 0, rcx
jmp f
f_second_case:
halt

which was itself obtained from the following Haskell implementation.

mod :: Natural -> Natural -> Natural
mod a n = modTail 0 0
where
modTail acc c
| acc == n = modTail 0 c
| c == a = acc
| otherwise = modTail (acc + 1) (c + 1)

Anyway, as you can see I took this as an opportunity to experiment with some of the languages mentioned in a previous article of mine. One of the experiments I did was to implement a URM virtual machine in OCaml, and this is what I came up with:

open Array

type nat = Zero | Succ of nat

type instruction =
| T of nat * nat
| Z of nat
| S of nat
| J of nat * nat * int

(* A function nat -> nat is used to represent the sequence of all registers *)
type machine =
{ instruction_pointer : int;
registers : nat -> nat;
}

(* Executes the next instruction of a program *)
let step (program : instruction array) (m : machine) : machine =
if m.instruction_pointer >= length program then m
else
match program.(m.instruction_pointer) with
| T (r1, r2) ->
{ registers =
(fun n -> if n = r2 then m.registers r1 else m.registers n);
instruction_pointer = m.instruction_pointer + 1;
}
| Z r ->
{ registers = (fun n -> if n = r then Zero else m.registers n);
instruction_pointer = m.instruction_pointer + 1;
}
| S r ->
{ registers =
(fun n -> if n = r then Succ (m.registers n) else m.registers n);
instruction_pointer = m.instruction_pointer + 1;
}
| J (r1, r2, i) when m.registers r1 = m.registers r2 ->
{ m with instruction_pointer = i }
| J _ ->
{ m with instruction_pointer = m.instruction_pointer + 1 }
For a detailed description of this implementation see git.pablopie.xyz/caml-urm/README.html!

Implementing this was a lot of fun, so I naturally wondered if any more fun could be derived from implementing it in different languages or paradigms. Unfortunately, as far as I can tell this is not the case 😢

Notice that most of the program consists of a single match statement. This is a very natural way to describe this problem: switch through all the possible instructions and update the machine accordingly. Indeed, this is the only way to describe the problem I was able to come up with!

I tried to write equivalent programs in multiple languages such as Racket and C, but I invariably ended up trying to emulate the match statement above: algebraic data types (ADTs) are the only sane way to describe the type of instructions as far as I can tell. Hence OCaml — or any other language from the ML family — is a natural choice, and in some sense the only possible choice.

I guess what I am trying to say is that records and C-style enums are cool, but nothing beats ADTs when it comes to domain-modeling. Languages that do not support ADTs are necessarily less expressive than the ones that do. For instance compare

type instruction =
| T of nat * nat
| Z of nat
| S of nat
| J of nat * nat * int

with the equivalent tagged union in C:

struct instruction {
enum { T, Z, S, J } tag;
union {
struct { unsigned int r1; unsigned int r2; } t;
unsigned int z;
unsigned int s;
struct { unsigned int r1; unsigned int r2; int i; } j;
} value;
}

Anyway, if you do know a different way of solving this problem — in any language — please let me know! I guess this is all I had to say…​