February 8, 2016

Standard Intermediate Assembly For CPUs

Over in GPU land, a magical thing is happening. All the graphics card vendors and big companies got together and came up with SPIR-V as the technological underpinning of Vulkan, the as-of-yet unreleased new graphics API. SPIR-V is a cross-platform binary format for compiled shaders, which allows developers to use any language that can compile to SPIR-V to write shaders, and to run those compiled shaders on any architecture that supports SPIR-V. This is big news, and if it works as well as everyone's hoping it does, it will set the stage for a major change in how shaders are compiled in graphics engine toolchains.

How is this possible? The SPIR-V specification tells us that it is essentially a cross-platform intermediate assembly language. It's higher level than conventional assembly, but lower than an actual language. The specifics of the language are fine-tuned towards modern graphics hardware, so that the instructions can encode sufficient metadata about what they're doing to enable hardware optimizations, while still allowing those instructions to be efficiently decoded by the hardware and implemented by the chip's microcode.

While SPIR-V is specifically designed for GPUs, the specification bears some resemblance to another intermediate assembly language - LLVM IR. This move by GPU vendors towards an intermediate assembly representation mirrors how modern language design is moving towards a standardized intermediate representation that many different languages compile to, which can then itself be compiled to any CPU architecture required. The LLVM IR is used for C, C++, Haskell, Rust, and many others. This intermediate representation decouples the underlying hardware from the high level languages, allowing any language that compiles down to LLVM IR to compile to any of it's supported CPU architectures - even asm.js.

However, we have a serious problem looming over our heads in CPU land. Did you know that the x86 mov instruction is turing complete? In fact, even the page fault handler is turing complete, so you can run programs on x86 without actually executing any instructions! The x86 architecture is so convoluted and bloated that it no longer has any predictable running time for any given set of instructions. Inserting random useless mov instructions can increase speed by 5% or more, false dependencies destroy performance, and Intel keeps introducing more and more complex instruction sets that don't even work properly. As a result, it's extremely difficult for any compiler to produce properly optimized assembly code, even when it's targeting a specific architecture.

One way to attack this problem is to advocate for RISC - Reduced Instruction Set Computer. The argument is that fewer instructions will be easier to implement, reducing the chance of errors and making it easier for compilers to actually optimize the code in a meaningful way. Unfortunately, RISC has a serious problem: the laws of physics. A modern CPU is so fast that it can process an instruction faster than the electrical signal can get to the other side of the chip. Consequently, it spends the majority of it's time just waiting for memory. Both pipelining and branch prediction were created to deal with the memory latency problem, and it turns out that having complex instructions gives you a distinct advantage. The more complex your instruction is, the more the CPU has to do before it needs to fetch things from memory. This was the core observation of the Itanium instruction set, which relies on the compiler to determine which instructions can be executed in parallel in an attempt to remove the need for pipelining. Unfortunately, it turns out that removing dependency calculations is not enough - this is why many of Intel's new instructions are about encapsulating complex behaviors into single instructions instead of simply adding more parallel operators.

Of course, creating hardware that supports increasingly complex operations is very unsustainable, which is why modern CPUs don't execute assembly instructions directly. Instead, they use Microcode, which is the raw machine code that actually implements the "low-level" x86 assembly. Of course, at this point, x86 is so far removed from the underlying hardware it might as well be a (very crude) high level language all by itself. For example, the mov instruction usually doesn't actually move anything, it just renames the internal register being used. Because of this, the modern language stack looks something like this:

Modern Language Stack

Even thought we're talking about four CPU architectures, what we really have is four competing intermediate layers. x86, x86-64, ARM and Itanium are all just crude abstractions above the CPU itself, which has it's own architecture dependent microcode that actually figures out how to run things. Since our CPUs will inevitably have complex microcode no matter what we do, why not implement something else with it? What if the CPUs just executed LLVM IR directly? Then we would have this:

LLVM IR Microcode Stack

Instead of implementing x86-64 with microcode, implement the LLVM intermediate assembly code with microcode. This would make writing platform-independent code trivial, and would allow for way more flexibility for hardware designers to experiment with their CPU architecture. The high-level nature of the instructions would allow the CPU to load large chunks of data into registers for complex operations and perform more efficient optimizations with the additional contextual information.

Realistically, this will probably never happen. For one, directly executing LLVM IR is probably a bad idea, because it was never developed with this in mind. Instead, Intel, AMD and ARM would have to cooperate to create something like SPIR-V that could be efficiently decoded and implemented by the hardware. Getting these competitors to actually cooperate with each other is the biggest obstacle to implementing something like this, and I don't see it happening anytime soon. Even then, a new standard architecture wouldn't replace LLVM IR, so you'd still have to compile to it.

In addition, an entire new CPU architecture is extraordinarily unlikely to be widely adopted. One of the primary reasons x86-64 won out over Itanium was because it was capable of running x86 code at native speed, and Itanium's x86 emulation was notoriously bad. Even if we somehow moved to an industry-wide standard assembly language, the vast majority of the world's programs are still built for x86, so an efficient translation between x86 and our new intermediate representation would be paramount. That's without even considering that you'd have to recompile your OS to take advantage of the new assembly language, and modern OSes still have some platform-specific hand-written assembly in them.

Sadly, as much as I like this concept, it will probably remain nothing more than a thought experiment. Perhaps as we move past the age of silicon and look towards new materials, we might get some new CPU architectures out of it. Maybe if we keep things like this in mind, next time we can do a better job than x86.

1 comment:

  1. The gpus will be compiling SPIR down to the native machine code rather than executing it directly. So arguably the gpu model is closer to the specializer design proposed for the mill cpu, https://millcomputing.com/topic/compiler/, than having hardware execute SPIR directly.