URMs or: Why I Can’t Live Without ADTs

In a class at university I recently learned about computable functions, which are functions whose return values can be computed by a Turing machine. Our teacher, however, chose to define them as functions that can be computed by a unlimited register machine (URM). For those of you who don’t know it already, a URM is a theoretic, register-based machine architecture. Apparently URMs are Turing complete, so his choice shouldn’t make much of a difference.

A URM has an infinite register set (R0,R2,) and the instructions set is as follows:

  1. T(n,m) transfers the contents of Rn to Rm
  2. Z(n) transfers 0 to Rn
  3. S(n) increments the value of Rn
  4. J(n,m,i) jumps to the i-th instruction if the values of Rn and Rm coincide

There’s this cool URM simulator if you’re interested in playing with this. Here’s an example of a URM program that computes R1modR2 and stores it in R3.

J(2, 3, 8)
J(1, 4, 10)
J(1, 1, 3)
J(1, 1, 3)

This implementation was obtained from the following implementation in x64 assembly

  mov 0, rcx
  mov 0, rdx
  cmp rbx, rcx
  je  f_first_case
  cmp rax, rdx
  je  f_second_case
  inc rcx
  inc rdx
  jmp f
  mov 0, rcx
  jmp f

which was itself obtained from the following Haskell implementation.

mod :: Natural -> Natural -> Natural
mod a n = modTail 0 0
    modTail acc c
      | acc == n = modTail 0 c
      | c == a = acc
      | otherwise = modTail (acc + 1) (c + 1)

Anyway, as you can see I took this as an opportunity to experiment with some of the languages mentioned in a previous article of mine. One of the experiments I did was to implement a URM virtual machine in OCaml, and this is what I came up with:

open Array

type nat = Zero | Succ of nat

type instruction =
  | T of nat * nat
  | Z of nat
  | S of nat
  | J of nat * nat * int

(* A function nat -> nat is used to represent the sequence of all registers *)
type machine =
  { instruction_pointer : int;
    registers : nat -> nat;

(* Executes the next instruction of a program *)
let step (program : instruction array) (m : machine) : machine =
  if m.instruction_pointer >= length program then m
    match program.(m.instruction_pointer) with
    | T (r1, r2) ->
      { registers =
          (fun n -> if n = r2 then m.registers r1 else m.registers n);
        instruction_pointer = m.instruction_pointer + 1;
    | Z r ->
      { registers = (fun n -> if n = r then Zero else m.registers n);
        instruction_pointer = m.instruction_pointer + 1;
    | S r ->
      { registers =
          (fun n -> if n = r then Succ (m.registers n) else m.registers n);
        instruction_pointer = m.instruction_pointer + 1;
    | J (r1, r2, i) when m.registers r1 = m.registers r2 ->
      { m with instruction_pointer = i }
    | J _ ->
      { m with instruction_pointer = m.instruction_pointer + 1 }
For a detailed description of this implementation see git.pablopie.xyz/caml-urm/README.html!

Implementing this was a lot of fun, so I naturally wondered if any more fun could be derived from implementing it in different languages or paradigms. Unfortunately, as far as I can tell this is not the case 😢

Notice that most of the program consists of a single match statement. This is a very natural way to describe this problem: switch through all the possible instructions and update the machine accordingly. Indeed, this is the only way to describe the problem I was able to come up with!

I tried to write equivalent programs in multiple languages such as Racket and C, but I invariably ended up trying to emulate the match statement above: algebraic data types (ADTs) are the only sane way to describe the type of instructions as far as I can tell. Hence OCaml — or any other language from the ML family — is a natural choice, and in some sense the only possible choice.

I guess what I am trying to say is that records and C-style enums are cool, but nothing beats ADTs when it comes to domain-modeling. Languages that do not support ADTs are necessarily less expressive than the ones that do. For instance compare

type instruction =
  | T of nat * nat
  | Z of nat
  | S of nat
  | J of nat * nat * int

with the equivalent tagged union in C:

struct instruction {
  enum { T, Z, S, J } tag;
  union {
    struct { unsigned int r1; unsigned int r2; } t;
    unsigned int z;
    unsigned int s;
    struct { unsigned int r1; unsigned int r2; int i; } j;
  } value;

Anyway, if you do know a different way of solving this problem — in any language — please let me know! I guess this is all I had to say…​