I am a researcher in machine learning /
at the University of Cambridge.
I work in the
at the Department of Engineering.
I am interested in information theory, data compression,
Bayesian inference, probabilistic programming,
sampling methods, language design, automated question answering
and other topics.
I am a member of King's College.
I can best be reached by email:
Some recent work
Compressing combinatorial objects
Most of the world's digital data is currently encoded in a
sequential form, and compression methods for sequences have
been studied extensively. However, there are many types of
non-sequential data for which good compression techniques are
still largely unexplored. This paper contributes insights and
concrete techniques for compressing various kinds of
non-sequential data via arithmetic coding, and derives
re-usable probabilistic data models from fairly generic
structural assumptions. Near-optimal compression methods are
described for certain types of permutations, combinations and
multisets; and the conditions for optimality are made explicit
for each method.
Proceedings of the Data Compression Conference, DCC 2016
Improving PPM with dynamic parameter updates
This article makes several improvements to the classic PPM algorithm,
resulting in a new algorithm with superior compression effectiveness
on human text. The key differences of our algorithm to classic PPM are
that: Ⓐ rather than the original escape mechanism, we use a
generalised blending method with explicit hyper-parameters that
control the way symbol counts are combined to form predictions;
Ⓑ different hyper-parameters are used for classes of different contexts;
and Ⓒ these hyper-parameters are updated dynamically using gradient
information. The resulting algorithm (PPM-DP) compresses human text
better than all currently published variants of PPM, CTW, DMC, LZ, CSE
and BWT, with runtime only slightly slower than classic PPM.
Proceedings of the Data Compression Conference, DCC 2015
Compressing Sets and Multisets of Sequences
This article describes lossless compression algorithms for multisets
of sequences, taking advantage of the multiset's unordered structure.
Multisets are a generalisation of sets where members are allowed to
occur multiple times. A multiset can be encoded naïvely by simply
storing its elements in some sequential order, but then information is
wasted on the ordering. We propose a technique that transforms the
multiset into an order-invariant tree representation, and derive an
arithmetic code that optimally compresses the tree. Our method
achieves compression even if the sequences in the multiset are
individually incompressible (such as cryptographic hash sums). The
algorithm is demonstrated practically by compressing collections of
SHA-1 hash sums, and multisets of arbitrary, individually encodable
IEEE Transactions on Information Theory
, Vol 61, No. 3, March 2015.
Lossless Data Compression
This thesis makes several contributions to the field of data
Compression algorithms are derived for a variety of
probabilistic modelling, Bayesian inference, and arithmetic coding;
and making the underlying probability distributions explicit
A general compression toolbox is described, consisting of
practical algorithms for compressing data distributed by various
fundamental probability distributions, and mechanisms for combining
these algorithms in a principled way.
New mathematical theory is introduced for compressing objects
with an underlying combinatorial structure, such as permutations,
combinations, and multisets.
For text compression, a novel unifying construction is developed for a
family of context-sensitive compression algorithms, special cases of
which include the PPM algorithm and the Sequence Memoizer.
The work concludes with experimental results, example applications,
and a brief discussion on cost-sensitive compression and
SAT-solving: Performance analysis of Survey Propagation and DPLL
The Boolean Satisfiability Problem (SAT) belongs to the class of
NP-complete problems, meaning that there is no known deterministic
algorithm that can solve an arbitrary problem instance in less than
exponential time (parametrized on the length of the input). There is
great industrial demand for solving SAT, motivating the need for
algorithms which perform well. I present a comparison of two
approaches for solving SAT instances: DPLL (an exact algorithm from
classical computer science) and Survey Propagation (a probabilistic
algorithm from statistical physics). The two algorithms were compared
on randomly generated 3-SAT problems with varying clause to variable
Cavendish Laboratory, University of Cambridge.
Quantum computation on non-quantum computers
This dissertation outlines the the design of a statically typed
programming language for quantum computers,
and describes a working implementation of it
for classical computer systems.
Computer Laboratory, University of Cambridge.
I have taught various undergraduate courses at Cambridge University,
ML, Prolog, Java,
and Foundations of Computer Science.
I also supervised a few undergraduate
I made a handcrafted collection
of stationery paper for easy viewing and printing, directly
off the web. So whenever you need any kind of stationery when
stuck in the middle of nowhere (with nothing but internet and
a laserprinter, say) you should visit the
and print off what you need. You'll find almost anything
ranging from different types of graph paper to stationery suitable
for writing, drawing, calligraphy, science and games. There are also
Patchwork Templates, useful for
home production of differently sized cardboard hexagons and
You can download or print some sudoku sheets,
created with a basic sudoku generator that attempts to model
Difficulties range from easy to hard, and there are many different
kinds of symmetries.
If you need sudokus in large quantities or with
special properties, contact me! :)
The Alchemy Game
The Alchemy Game, the creation of which took several years,
is available on the Alchemy Pages.
A multi-platform redesign of the game engine is underway and will
be available here soon. In particular, there'll be an
enhanced version of the original Alchemy, and an online version.
(PS. The online version is basically working, but still in stealth
mode. If you would like to have a preview before its official
release, email me.)
Amino Acid Search Engine
My amino acid and DNA codon search engine.
You can either use the online version,
or download the shell script - both
suitable for command line use, or for use on a web server (simply
copy it into your cgi-bin directory!)
Are you good at minesweeper?
I made a bunch of very small mine puzzles,
which you might enjoy. These are cute rather than serious,
but be assured that bigger versions can be added if needed.
AI music tracks
I have a collection of music tracks generated by a simple, hierarchical probabilistic model.
Here's a sample from the prior.
The model supports real-time music generation, and can be trained on existing music.
CRH Reference Handler
I coded up a simple and lightweight
citation and reference management system,
for use in academic writing and publishing. It can be used on its own
or in conjunction with BibTEX, but all references are stored in
Unicode (rather than BibTEX format), which makes it much easier to
export citations to other formats, e.g. HTML, plain text or RIS.
Computer Science Miscellania
Combinator Logic Playground
I have written a graphical
combinator logic playground,
featuring SK-combinators which can be dragged around and
dropped on each other to perform computations.
Combinatory logic is a functional programming language
consisting of two
instructions: S and K.
PS: Many known functions will be detected if you build and click them,
e.g. those of other combinators, Church numerals, tuples, lists,
basic arithmetic operations, etc.
Logic Minimizer / Tautology Checker
My small OBDD-based logic minimizer for propositional logic
and uncertainty logic.
It can be used online as a Java Applet.
Alternatively, it can be run locally by
downloading logic.jar and
running java -jar logic.jar from your OS command line.
(An optional alternative for
Unix/Linux-based systems: shell script
rlogic.sh, which fetches
and runs the latest version of the software straight from the
Custom 3D engine with cubes, worm holes, and a tea pot.
Joint work with Vilhelm Sjöberg.
Screenshots, binaries and source code can be found here:
Algebraic Structures, as ML functors
Due to frequent demand, I've put my ML definitions
of algebraic structures online. This file,
struc.ml, contains module
signatures for algebraic structures like groups, monoids, rings,
fields and lattices; it also implements functors for algebraic
constructions (such as e.g. fraction fields).
Axioms are handled by the typesystem at compile time.
There's a diagram (PDF)/(PS)
providing a graphical view of what struc.ml defines.
The language used is Ocaml.
Feedback is welcome! :-)
Other Ocaml Tools
A few people have been asking me for
sblock.ml, my library of
stringblock functions. This is a small collection of handy
tools to create, manipulate and print rectangular blocks
of characters. This is useful for printing matrices or
sudokus to a terminal or text file, e.g. side by side,
in big brackets, or in boxes.
Please submit additions or corrections, so they can be
included in the next version.
A collection of tools
for CVS checkouts and CVS repositories. Includes some useful scripts
for the offline use of CVS checkouts (e.g. when travelling with a
laptop, away from the central CVS server).
I have now added my collection
of unicode material to this website,
including the plain text code charts
and the UTF-8 sample text files.
My xterm start scripts and
multilingual keyboard maps for
various languages / character sets are also available.
GRUB boot floppy
Just for convenience, I made a universal
bootfloppy for x86 computers,
using the GRUB boot loader. This is nice to have
when things have gone wrong, and the machine doesn't start.
(or just download the floppy image).
3D photography tools
A simple tool for splitting MPO files (e.g. stereo-photos)
into their component images mposplit.