Is your compiler throwing onions in the varnish?

In a by now well-known anecdote, Primo Levi, an Italian chemist-become-writer from the early twentieth century, was reviewing how varnish was produced in an industrial facility. The recipe, involving many chemical components, included instructions to throw in a raw, peeled onion in the varnish. This was puzzling to him: he couldn’t figure out why the onion could be needed, or even how such a small amount of material could affect the resulting varnish. Nobody he asked at the factory knew why it was needed. Seemingly, it had been done like that forever.

Intrigued, he researched where this came from, and after some heavy digging, he discovered the reason: in earlier times, when varnish was produced in a less industrialized fashion, varnish-makers would throw in the onion, they would look at it, and when it become brown from being fried, they’d know the varnish was in the right temperature for the next step. When the production system had gained thermometers and automation, someone had just kept on throwing the onion out of habit, and thus it had become an ingredient of the varnish.

Compilers?

As an example of a modern varnish recipe, let’s have a look at the canonical MAX function: let’s see how you would find the maximum out of a list of numbers in any modern imperative language:

This loop finds the maximum value in the input array and returns it.

And you may ask, what is the problem with the above code? Isn’t it just pretty conventional code to walk an array and extract some information from it? Isn’t that how programming is supposed to be done?

The problem is the fact that we are writing the detailed steps on how to do something, at a very low level, instead of describing what we are trying to attain. And this has three key negative properties:

  • The actual goal of the code is lost. The code does not contain or embody it, it only lists the steps we concluded we needed. Someone reading it needs to deduce the thinking and the goal, the “spec”, is not actually there. The runtime (whatever it may be: interpreter, compiler or just-in-time trace-compiler) understands too little of what it’s doing. What it’s doing is like following step-by-step directions without a map, or like a blind man taking each step without any knowledge of the environment. Given this, it has limited possibilities to find shortcuts or improvements.
  • Code becomes very repetitive, verbose, and takes a long time to write. Current programming practice involves reinventing the wheel too often. After programming for a few years, it often seems as if you are rewriting the same code over and over again. Libraries and frameworks help somewhat, but they become a kind of “straitjacket”, you can benefit from them, but it comes with a huge extra cost in conventions and other conditions you have to adhere to.
  • Writing code this way is very prone to errors. It’s very easy to think you are writing the steps for a given result, while you are making a mistake and getting something else. Too often, this only apparent when running the code, and even worse, only in edge-cases when unforeseen circumstances take place.

When the compiler is generating the code to compare two values, for example, it is doing exactly the same as the worker who threw an onion in the varnish. The compiler doesn’t know what it is for, it is just blindly following instructions, which may or may not be correct for the desired goal. There are none in the example above, but actually, you often find “onions” in real code out there: pieces of code that serve no purpose, but that are being run because they’ve been there since some time they were useful. And nobody, not the developers, and especially not the compiler, has bothered to review their value in light of the current state.

I believe there is a better way to express computation, a better way to describe the function to compute the maximum, which can sidestep the drawbacks listed above. And this requires getting to the bottom of the issue, understanding the essence of what the “maximum” function above is supposed to do. Let me give it a try.

The essence

This is what I think the core of the MAX function is: the maximum of a set of numbers is the number out of them that is larger or equal than all the others in the set. This is the essence of MAX. We are not saying how it can be calculated, we are defining what we understand as the maximum value in a list or set: just the value that is larger or equal than all others in the set.

When you see the imperative code above, you see a sequence of steps that will allow you to find the maximum value as defined above. But it’s a specific method, a recipe, and it doesn’t describe either what the final result is, or why we took those specific steps. There are actually other ways to find the maximum, but the definition is unique.

Of course, we, as the developers who wrote the code, know why the steps are there. But by the time the compiler/interpreter/runtime-system reads the steps, the reasons are lost, and only dumb instructions remain, without the reasons why they should be followed, and of course without any possibility of adjusting further.

If we could find a way to embody this essence, we could have something much more valuable than the imperative code above.

Don’t imperative and OO programming cover this with their libraries and frameworks?

Imperative programming is the canonical “all information is lost” approach. State is stored in variables or objects (object-oriented programming is a kind of imperative programming), and each step is free to alter any state it wants. Actually, not only can any statement in the example above botch the calculation, but it can crash the program too. I believe this is the reason programming sometimes feels to me like walking barefoot amid pieces of broken glass.

Even more importantly, there is a brick wall that this is hitting: due to the loss of information, it’s nearly impossible to reorganize this kind of code automatically at all. One big loss here, very relevant for the coming years, is that it’s nearly impossible to automatically parallelize imperative code. Those 64-core processors that are coming out soon? You’re out of luck if you want the code your team wrote during the last 5 years to benefit from them.

Many a developer will just say: “duh, the standard library in $MY_FAVORITE_LANGUAGE already has a MAX function, so I don’t even have to think about it – and I’m sure it will parallelize things the day my desktop has 64 cores”. And the answer is: yes, your current language may help with this, but it doesn’t help with the many variations of the MAX function you are going to have to write for your code. Of course, if your favorite environment has a standard call to do something, you will do well in using it. But pretty soon, you will deviate from the exact needs the environment covers: instead of the MAX function, you may need the largest value below a certain threshold, or the first value after some given one, or whatever variation. And right then and there, you will have to start writing code like the above, which won’t benefit from the 64 cores unless you put in serious extra work.

We need to find something more powerful than imperative calls and standard class libraries.

Does the current trend towards functional programming cover this?

Functional programming is touted as one of the possible solutions to the parallelism problem. Functional programming definitely has some benefits, but I think it has a big problem, as that is the fact that is based on lists.

Lists in functional programming are a weird creature. In all common functional programming languages, they are defined recursively, with a head that is the first element of the list, and a tail that is another list, the “rest” of the original list. This seems completely normal to functional programmers. If you are not familiar with it, it definitely looks asymmetric and weird. I’d say it actually is weird, this is not how us humans think about sets and sequences. The advantage is that such lists provide a simple and formal way to structure and think about computation, which is why they’re so widespread.

Here is the canonical definition of MAX in most functional programming languages:

There are variations of this, but they are all basically the same. As in imperative programming, we are just walking down the list, recursively, and doing the comparisons that we know will let us find the maximum value. To begin with, its recursive nature makes it more difficult to understand. I’d say our brains are better wired for iteration than recursion.

But apart from that, it is still missing the same information as the imperative version. This doesn’t show anywhere the core property that we understand as being the max value, that is, being larger than all other values in the set or list!

How about Prolog and declarative programming?

Prolog takes an initial promising approach, with declarative code, but it very quickly gets off-track. The language lacks expressiveness for really strong declarative power, things like “cut” turn off the guaranteed validity of programs, and the canonical definition of MAX in Prolog is as poor as those in imperative and functional programs:

This is tied to the fact that sets in Prolog are represented as lists, and lists are defined recursively as a head that is an element, and a tail that is a list. This all but forces the definition of MAX above.

So how would it look?

Here is a pseudocode draft of how I believe the code should look in order to contain the essence of the “maximum” concept:

This is a radically different way of definining the maximum function. It is not saying how to calculate the maximum value: it is saying what the maximum value of a set is. It is describing what we understand when we think about the maximum. And it is doing so in formal terms, just a set of properties and constraints that describe the solution:

  • The maximum of a set is one of the elements of that set
  • This element, which we call “max”, fulfills some properties
  • The properties are related to the other elements of the set: for every element in the set other than this one we call “max”, their value is less than or equal than the value of max.

As you see, we are doing exactly the opposite to all the other examples above. We’re defining the WHAT, and leaving out the HOW.

As far as I know, there isn’t a programming language out there that works like this. If you know of one, I’m interested, be sure to let me know!

The bad news

The bad news is that, even if the sample code above has all these beautiful properties about describing the solution, there is something that it is missing: it is not describing how to reach that solution. And thus, the next challenge is born: devising a runtime (compiler, interpreter, interactive dashboard, whatever) that can take code like the above, and turn it into something that can actually be run.

This is a tough and large problem, but one I believe is very worth attacking.

And why is this important?

Say you write the MAX function in a way as described above. Describing its essence. And the runtime environment has the brains to run that code, that is, to compute its results. What would that mean?

  1. The code is guaranteed correct with regards to the spec! There is no possibility of an off-by-one error leaving you lost in those rare cases where the last value in the list is the maximum and it is missed! If you defined the properties correctly, the runtime will reach the right result.
  2. The runtime or compiler understands much more of what it’s doing. It can cache partial results, it can decide what is safe to cache and what not. It can restart computations. It has much more information to manage how it runs the code.
  3. Parallelism! While a dumber runtime will decide to iterate over the list to find the highest value, a smarter runtime may decide to break the list in chunks, calculate the maximum of each chunk in a separate core, and then calculate the maximum of the partial results. With only the definition above, and knowing the transitive property of the >= operator, the runtime could reach this conclusion in a formal way. The possibilities that this opens are endless.
  4. Even if it won’t be easy to write a really smart runtime, we could write a dumber runtime that could take “hints” in order to get good efficiency, and then write the hints separately from the code itself. This type of runtime is an easier target, and at the very least, this approach allows much better separation between defining WHAT we are after and HOW we are going to get it. Similarly to SQL databases, it would separate the problem of defining the data model with tables and queries, and optimizing execution deciding the indexes to keep, how to keep them, etc… ensuring that the optimization will never introduce any bugs, and possibly allowing much better separation of work between different people with different skill sets.

Help is welcome!

I’m trying to address this somehow. I’ve thought a lot about it. I’ve written a bit about it. I’ve even written a bit of code. I have some insights on how to tackle the issues that come up defining such a beast. And there are many reasons I would be grateful for your help:

  • First things first: I think I’m following some valuable insight, while going after some practical, realistic, attainable goal. But I may be delusional? So, be sure to let me know if I’m missing some key point that invalidates the whole enchilada!
  • I have an eminently practical background, not theoretical or academic. I think I’ve done my homework, but I may be missing something out there. I know there are researchers in the different related areas: Coq for theorem proving, constraint-system solving, compiler research, static code analysis, etc… if you have a background in this area, and you can point me in the required directions, I will be grateful. Especially, if someone is already working in a system like this, or pieces of a system like this, I’d definitely like to know!
  • If some of these things are already done, and the whole project turns out to be a matter of putting existing pieces together, let me know! It will take less effort that way.
  • At least to me, this is a fully new approach to computing and programming. It requires expertise in compilers, languages, type systems, optimizers, constraint-system solving, etc…. how could I not need all possible help?
  • If you are practically minded, as I am, you can help me make sure we create something that is useful for day-to-day work and can make developers’ lives less miserable, instead of a research project (much respect to those, but that’s not my goal here).
  • I have little time and I have to dedicate a lot of my time to mundane things like earning a living and paying off debt. So of course, my resources are limited, and all help is appreciated.
  • In general: I’m sure I’ve missed things. Let me know which ones!

This is a long-term project. I doubt much advance can be done in the short term. But hopefully, given enough years, eyeballs, neurons and hands, we will be able to program in much nicer ways in the future.

And we will do away with a lot of onions!