URMs or: Why I Can’t Live Without ADTs

Sep 11th, 2021

In a class at university I recently learned about computable functions, which are functions whose return values can be computed by a Turing machine. Our teacher, however, chose to define them as functions that can be computed by a unlimited register machine (URM). For those of you who don’t know it already, a URM is a theoretic, register-based machine architecture. Apparently URMs are Turing complete, so his choice shouldn’t make much of a difference.

A URM has an infinite register set $(R_0, R_1, \ldots)$ and the instructions set is as follows:

$T(n, m)$ transfers the contents of $R_n$ to $R_m$
$Z(n)$ transfers 0 to $R_n$
$S(n)$ increments the value of $R_n$
$J(n, m, i)$ jumps to the $i$ -th instruction if the values of $R_n$ and $R_m$ coincide

There’s this cool URM simulator if you’re interested in playing with this. Here’s an example of a URM program that computes $R_1 \; \operatorname{mod} \; R_2$ and stores it in $R_3$ .

Z(3)
Z(4)
J(2, 3, 8)
J(1, 4, 10)
S(3)
S(4)
J(1, 1, 3)
Z(3)
J(1, 1, 3)

This implementation was obtained from the following implementation in x64 assembly

mod:
  mov 0, rcx
  mov 0, rdx
f:
  cmp rbx, rcx
  je  f_first_case
  cmp rax, rdx
  je  f_second_case
f_third_case:
  inc rcx
  inc rdx
  jmp f
f_first_case:
  mov 0, rcx
  jmp f
f_second_case:
  halt

which was itself obtained from the following Haskell implementation.

mod :: Natural -> Natural -> Natural
mod a n = modTail 0 0
  where
    modTail acc c
      | acc == n = modTail 0 c
      | c == a = acc
      | otherwise = modTail (acc + 1) (c + 1)

Anyway, as you can see I took this as an opportunity to experiment with some of the languages mentioned in a previous article of mine. One of the experiments I did was to implement a URM virtual machine in OCaml, and this is what I came up with:

open Array

type nat = Zero | Succ of nat

type instruction =
  | T of nat * nat
  | Z of nat
  | S of nat
  | J of nat * nat * int

(* A function nat -> nat is used to represent the sequence of all registers *)
type machine =
  { instruction_pointer : int;
    registers : nat -> nat;
  }

(* Executes the next instruction of a program *)
let step (program : instruction array) (m : machine) : machine =
  if m.instruction_pointer >= length program then m
  else
    match program.(m.instruction_pointer) with
    | T (r1, r2) ->
      { registers =
          (fun n -> if n = r2 then m.registers r1 else m.registers n);
        instruction_pointer = m.instruction_pointer + 1;
      }
    | Z r ->
      { registers = (fun n -> if n = r then Zero else m.registers n);
        instruction_pointer = m.instruction_pointer + 1;
      }
    | S r ->
      { registers =
          (fun n -> if n = r then Succ (m.registers n) else m.registers n);
        instruction_pointer = m.instruction_pointer + 1;
      }
    | J (r1, r2, i) when m.registers r1 = m.registers r2 ->
      { m with instruction_pointer = i }
    | J _ ->
      { m with instruction_pointer = m.instruction_pointer + 1 }

For a detailed description of this implementation see git.pablopie.xyz/caml-urm/!

Implementing this was a lot of fun, so I naturally wondered if any more fun could be derived from implementing it in different languages or paradigms. Unfortunately, as far as I can tell this is not the case 😢

Notice that most of the program consists of a single match statement. This is a very natural way to describe this problem: switch through all the possible instructions and update the machine accordingly. Indeed, this is the only way to describe the problem I was able to come up with!

I tried to write equivalent programs in multiple languages such as Racket and C, but I invariably ended up trying to emulate the match statement above: algebraic data types (ADTs) are the only sane way to describe the type of instructions as far as I can tell. Hence OCaml — or any other language from the ML family — is a natural choice, and in some sense the only possible choice.

I guess what I am trying to say is that records and C-style enums are cool, but nothing beats ADTs when it comes to domain-modeling. Languages that do not support ADTs are necessarily less expressive than the ones that do. For instance compare

type instruction =
  | T of nat * nat
  | Z of nat
  | S of nat
  | J of nat * nat * int

with the equivalent tagged union in C:

struct instruction {
  enum { T, Z, S, J } tag;
  union {
    struct { unsigned int r1; unsigned int r2; } t;
    unsigned int z;
    unsigned int s;
    struct { unsigned int r1; unsigned int r2; int i; } j;
  } value;
}

Anyway, if you do know a different way of solving this problem — in any language — please let me know! I guess this is all I had to say…