Thursday, July 5, 2018

Algebraic Data Types for Imperative Programmers

Algebraic data types are a fantastic way to make code clearer. They're used all the time in Haskell, because they're a fundamental part of the language, easy, and lightweight. But it turns out you can use them in imperative programming languages like Python and C++, too, with just a bit of hackery.

What's an algebraic data type?

Having "algebraic data types" means that you can create a new type by doing AND or OR on two existing types.

In most languages, doing an AND is easy. In this context, AND just means that "to get an item of this new type, you need an item of a building block type AND another item of a second building block type." For example, in C++, you can use a pair or a struct:
std::pair<int, string> person_description = { 50, 'Walter White' };
"To get a person description, you need an integer (age) and a name (string)."
In Python, we can use a tuple:
person = (50, 'Walter White')
In both of these cases, to fill out the data type, we need an integer AND a string.

However, most languages don't make it easy to represent an OR. Suppose you want to represent an animal that could be of multiple types, and would have different properties depending on the particular animal. In Haskell this looks like this:
data Cat = Cat { whiskerLength :: Float }
data Dog = Dog { tailLength :: Float, noseWetness :: Float }
data Pet = Feline Cat | Canine Dog
We're declaring that there's a thing called a Cat, and a Pet can be Feline, in which case it has all the properties of a Cat, or it can be Canine, in which case it has the properties of a Dog.

In C or C++, declaring the Cat and Dog structures is easy:
struct Cat {
  float whisker_length;
}
struct Dog {
  float tail_length;
  float nose_wetness;
}
But declaring a type that could be "a cat OR a dog" is not. Most popular programming languages don't natively support the concept of an OR type, so you can't create any arbitrary algebraic data type with them.

In a case like this, we would often just create a Pet structure that has both a Cat and Dog, and try to enforce in the functions we write for Pet that only one at a time is ever filled out.
struct Pet {
  // Only one should be non-null.
  Cat* cat;
  Dog* dog;
}
But if we don't make a bunch of tedious if Cat checks, then we might make a mistake:
void totalLength(const Pet& pet) {
  // Cats are 20 cm long on average, plus whiskers.
  return 20.0 + pet.cat->whisker_length;
  // BOOM! If it's a Dog, we done goofed.
}

Whereas if we had an algebraic data type, the compiler would object to us doing this.
totalLength :: Pet -> Float
totalLength pet = 20.0 + (whiskerLength pet)
-- ERROR: whiskerLength works on a Cat, not on a Pet.

Instead we have to use pattern matching:
totalLength (Cat cat) = 20.0 + (whiskerLength cat)
-- This obviously only works with a Cat.
... and if you have warnings turned on, the compiler will warn you that this is a partial function that crashes on Pets that are a Dog.
Another solution to this might be object inheritance.
class Pet {
 public:
  virtual float length() = 0;
}
class Cat : public Pet {
  float whiskerLength;
  override float length() {
    return 20.0 + whiskerLength;
  }
}
class Dog : public Pet {
  float tailLength;
  float noseWetness;
}

This approach also has the benefit of being compiler-checked; if you try to create a Dog object, it will fail saying that you forgot to implement length() for Dog, and it's therefore a pure virtual class. On the minus side, if you have a Pet object in a function, it's hackier to determine whether it's actually a Dog or a Cat under the hood. In fact, you have no way to enumerate all the possible types of Pet and handle them differently in a later function; all the differing logic has to be encoded in the class itself.

Why should you use algebraic data types?

Algebraic data types make it clearer to express lots of concepts, and as a bonus, they can help guarantee that your code does what it should.

Everyone uses AND data types - packaging data into named structures is seen as common sense. But while OR data types are used less commonly, they also can make code much clearer.

Here's an example. Suppose we have a program for interpreting a handwritten phone number from an image of a scanned form. If we get back text from the form, we want to try to turn it into someone's phone number. But it's possible someone filled out the form incorrectly, with their name in the "Phone Number" field. So the text may be invalid.
So we need to be able to represent three cases:
A phone number is an integer.
data PhoneNumber = PhoneNumber Integer
InvalidText is an object with a the original text and an error message.
data InvalidText = InvalidText { originalText :: String, errorMessage :: String }
An image error is an error that results from finding no text in the image.
data ImageError = ImageError
And to represent "one of these things," we create a new algebraic data type. A PhoneNumberOrError is either a phone number, invalid text, or an image error.
data PhoneNumberOrError = PhoneNumber | InvalidText | ImageError
We have a function to get a PhoneNumberOrError from an Image.
getPhoneNumber :: Image -> PhoneNumberOrError

This is nice, because now we know we have to handle all the resulting cases in downstream code. If we try to write a function that doesn't handle those cases, we get an error.
callPhoneNumber :: PhoneNumber -> IO ()
callPhoneNumber = ...

callWrittenPhoneNumber :: Image -> IO ()
callWrittenPhoneNumber image = do
    callPhoneNumber (getPhoneNumber image)
    -- COMPILE ERROR: PhoneNumberOrError is not a PhoneNumber.
Instead we define this function, and use pattern matching to check the different cases:
maybeCallPhoneNumber :: PhoneNumberOrError -> IO ()
"If it's a phone number, call it."
maybeCallPhoneNumber (PhoneNumber num) = callPhoneNumber num
"If it's an image error, print that out."
maybeCallPhoneNumber ImageError = printStrLn "Error processing text from image."
"If it's invalid text, print it out, and also the error message."
maybeCallPhoneNumber invalid_text = printStrLn (
    "Found invalid text: '" ++ originalText invalid_text ++
    "'. Error message: " ++ errorMessage invalid_text)
And now we can define our callWrittenPhoneNumber function:
callWrittenPhoneNumber :: Image -> IO ()
callWrittenPhoneNumber image = maybeCallPhoneNumber (getPhoneNumber image)

If we were doing this in a traditional imperative way, this would look very different, depending on the programming language. Perhaps, in C++, we'd return an integer that could be a phone number or an error code:
// -1 means no phone number was found; see error message for details.
int getPhoneNumber(const Image& image, string* error_message) { ... }
We left a nice comment on our function, but what if the next engineer was in a hurry and forgot to read it?
void callPhoneNumberInImage(const Image& image) {
    string error_message;
    int phone_number = getPhoneNumber(image, error_message);
    callPhoneNumber(phone_number);
}
... oops! Now we've tried to call the number -1, because it wasn't enforced that we should handle all the possible cases.

And if we've got to write down what our function returns anyway, why not make it machine-readable, too? Plus, the error message is opaque, and can't tell us which type of thing went wrong.

Why type safety?

Why do you need union types when you can just return whatever you want, like in Python? Well...
class Dog:
  def __init__(self, tail_length, nose_wetness):
    self.tail_length = tail_length
    self.nose_wetness = nose_wetness

class Cat:
  def __init__(self, whisker_length):
    self.whisker_length = whisker_length

def length(pet):
  return pet.whisker_length + 20.0  # OOPS. Doesn't work for Dog.


  • The return type of the function isn't checked by the compiler. Anyone can add a return statement with any old type, and break your assumptions about the thing you got back from that function.
  • With duck typing, it can be hard to track down exactly where the method and attribute you're trying to use gets populated, or whether it applies in each case.
In contrast, with ADTs, the return type is well-specified, and you know which category of thing you got back from the function.


How do you use them?

In Haskell, writing an ADT is simple. But how do you use them in languages that don't support them natively?

There are a few different monkey patches available at this point.

Protocol Buffers and friends

Here's one way: Protocol Buffers. While they are mainly used as an encoding and RPC format, because of their support for the oneof keyword, protocol buffers are actually algebraic data types as well!

Here's how we could write the exact same code in C++ and proto, using protos to represent our ADT:
phone_number.proto
message InterpretedPhoneNumber {
  oneof result {
    int64 phone_number = 1; // That's the right data type for phone numbers, right?
    string invalid_text_error = 2;
    string image_error = 3;
  }
}

call.cpp
#include "phone_number.proto.h"

InterpretedPhoneNumber getPhoneNumber(const Image& image) { ... }

void callPhoneNumberInImage(const Image& image) {
  auto maybe_phone_number = getPhoneNumber(image);
  switch (maybe_phone_number.result_case()) {
    case InterpretedPhoneNumber::kPhoneNumber:
      callPhoneNumber(maybe_phone_number.phone_number());
      break;
    case InterpretedPhoneNumber::kInvalidTextError:
      cout << "Error in phone number text: "
           << maybe_phone_number.invalid_text_error() << endl;
      break;
    case InterpretedPhoneNumber::kImageError:
      cout << "Error in phone number image: "
           << maybe_phone_number.image_error() << endl;
      break;
    default:
      // This should never happen. Must be enforced by 'getPhoneNumber'.
      cout << "Invalid interpreted result" << endl;
  }
}

With this approach, you can get the compiler to warn you if you haven't included all the cases, because the oneof generates an enum that can be checked in a switch statement like any other. It's also clear what the possible types of result are; it's encoded in the type names, rather than a comment.

Caveat: Protocol buffers unfortunately may have significant performance overhead, so in speed-critical applications, they may not be appropriate if all you need is an ADT.

Similarly to Google's Protocol Buffers, Cap'n Proto can also support ADTs, using named unions.

Other ways to get ADTs in non-Haskell languages

I'm not as familiar with these tools, but here's a list of things I've heard can help you use ADTs in languages that don't support them natively:

  • C++17 supports std::variant, a template that can be used for creating un-tagged union types. 
  • TypeScript, a wrapper on Javascript, supports ADTs.
  • Pytype, a Python library for adding type checking, has Union[] and Optional[] types that you can return.
  • C does have a feature called "unions." However, they're not great, because you can't tell which data member is filled out by looking at the union. You have to just try and hope that you got it right.
Are there other tools I'm missing? Or other solutions to this kind of problem?

1 comment :