How C Programmers Can Accelerate Algos on FPGAs

C developers and FPGA hardware programmers don't necessarily get along, unless you can find ways to adapt code for hardware acceleration.

Brian Durwood, Impulse Accelerated Technology

At a financial show recently, I talked with a dozen or more bank algorithm teams and a common theme seemed to emerge:

C programmers are not going away.
C programmers outnumber hardware programmers 10 to 1.
C programmers understand that they need to figure out hardware acceleration.

To help make the topic more understandable, this article starts with the basics of hardware acceleration and gradually ramps up and explains some of the intricacies of C programming and hardware accelerated computing.

Primer: Hardware Accelerated Computing
Most software runs on microprocessors via an operating system. The microprocessor is a shared resource but it is limited. When it reaches its processing capacity, things just run slower or cache to memory. In this world of microsecond trading, that no longer is acceptable. A volume spike can overwhelm the real-time processing capability of the CPU and send some data into memory, thereby delaying the transaction. If this happens too frequently, someone will get fired.

Enter hardware accelerated computing. An analogy is the chip in your calculator. It has no operating system, it just does its functions without having to refer to anything off of its little silicon domain. Scale that way up and you have custom or semi-custom processors, configured for a specific trade analytic or function, running without an operating system. The most common is a category of semi-custom processors called Field Programmable Gate Arrays (FPGA), most of which are made by Xilinx or Altera.

But, you also have your C, C++ or other financial algorithm that represents your analytics. The algo might need to change several times a day. Changing the underlying C code isn't rocket science. If you are running on an FPGA, let's say, you want to make a change and you're a software developer, all of a sudden it does seem like rocket science.

So, let's jump to rocket scientists for a moment. In the 1990's, Maya Gokhale's team at Los Alamos National Labs came up with one of the first streaming processing compilers, called Streams C. Basically it took the C code intended for roaring fast, but single core microprocessors and "parallelized" it into multiple streaming processes. Obviously those processes had to be ones that could run in parallel (aka "non sequential logic"). But most financial logic is mostly non sequential. So… the C algorithms, designed for microprocessor, get split into multiple streaming processes, and are compiled to VHDL which then gets "synthesized" into RTL (machine code) which runs on an FPGA. The FPGA, running slower (and therefore cooler) than a microprocessor, ends up with throughput 10 – 100x faster. The military funded much of this initial research for parallel processing of high bandwidth math such as trajectory, encryption and image processing.

Leap ahead 20 years and this hits the bull's eye for financial computation. It's too tempting an analogy to point out that landing a missile on a bunker has some creepy similarity to nailing a buy price in milliseconds. It just isn't as easy as using standard C compilation. Because FPGAs are configured for a specific computation, they are "laid out", i.e. there is an intervening step where the logic is synthesized and mapped to the physical device. This can take hours (vs. seconds or minutes in the microprocessor world). But, there is nothing else out there that can essentially shove your target algorithms into custom hardware several times a day.

Reconfiguring FPGAs
For instance, some trading houses have the intense atmosphere of an air traffic control floor with programmers modifying strategies on urgent schedules. For them, FPGAs offer partial reconfigure-ability, such as the ability to keep 90% of the logic intact and change a value or two. This stems from the objective to keep as much of the dataflow logic as possible on one piece of silicon. So, key elements of the TCP/IP offload engine (a common network interface element of a high speed trading "stack") might co-reside on the same chip with the trading logic. But only the trading logic gets rebooted several times a day. The older methods of programming FPGAs, Verilog or VHDL, are not suited to this type of tweak by the larger population of C programmers.

The other key factor in the acceptance of FPGAs as "C accelerators" is the plumbing. The downfall of programming to hardware is that the access to memory, I/O and things like registers is not well established. In its total flexibility, it becomes a hurdle for software developers and the ease of accessing microprocessor peripherals. So, using automatic compilation from C to FPGA hardware, 80% of the methodology stays the same, there is some re-learning, but trades happen much faster and everyone gets to keep their job.

C Meets FPGA
So, how do C programmers actually work with FPGAs as accelerators? Previously it worked by a C programmer freezing their code and passing it off to the hardware team. Someone on the hardware team wrote the equivalent of the algorithm in Verilog and VHDL, which was compiled down to the actual FPGA. This is not a quick process and is losing out to what is called hardware/software co-design. The hardware/software co-design process includes many steps, such as importing the C algorithm crudely into the C to FPGA compiler, analyzing for bottlenecks, refactoring the C code for parallelism, compiling the C into synthesizable VHDL, and finally testing. There are a couple of other steps along the way as well, but as you can see, it isn't an easy one, two, three process.

Development time ends up being about half what it takes a VHDL writer to code up. But more importantly, iteration time is about 1/8th the time in C than it would be iterating hardware. Quality of results is way beyond the un-parallelized code running in a microprocessor.

First attempts at this technique can be done fairly modestly using Windows based tools (see diagram 1 below) and FPGA based development cards. If time is of the essence, manufacturers offer "turnkey" systems with pre-optimized interface and business logic code, into which you use to can splice your code. Many manufacturers also offer on-site training and/or algorithm refactoring services.

C-based FPGA design looks pretty much like normal C debuggers, emphasizing iterative methods of programming and using standard C tools for desktop simulation. One trick to the tools is the ability to visualize the stage delay (the “tree” window on the bottom left) so the developer can quickly see the efficacy of his/her code refactoring on the code parallelization.

However these techniques are still early in their lifecycle. As an anology, think of hot rods. They are capable of incredibly fast runs, but also prone to breaking. When it comes to algos and hardware acceleration, as the processes mature, we expect them to come down in price and increase in reliability. For many groups, authorizing internal research to prove the concept in their own trading systems may be a prudent first step.

About The Author: Brian Durwood graduated from Brown and Wharton and helped build software divisions at Data I/O, Applied Voice Technology and Impulse Accelerated. Mr. Durwood is an active blogger and technology marketer. He also volunteers towards increasing practical engineering in public schools. He can be reached at [email protected].