A Guide to the HFT Arms Race

Sub-microsecond high-frequency algorithmic trading becomes practical

Brian Durwood, Impulse Accelerated

High Frequency Trading took a leap in the last 18 months with the introduction of extremely competent FPGA-based hardware from nearly a dozen manufacturers. What seems to be emerging as the industry standard is a PCIe board with two to four bidirectional 10 GBP/S ports which fits in an inexpensive 2U chassis and deployed in co-located trading sites worldwide. These boards run from $5,000 – $15,000 and are in production from Taiwan to the US to France. But the real story, and the real cost (≈ $100,000) is optimizing software/firmware for the new hardware, creating interconnects (the “plumbing”) and the little things that they (board vendors) don’t always remember to tell you.

The big picture is that you’re jumping into the “gunslinger” business and will need to keep current (no, forget current, you need to keep ahead) to stay competitive. Reports from clients indicate that it tends to be a winner take all type of business that rewards the fastest almost exclusively. Hence the winners jumping to the latest FPGA hardware as it emerges every 18 to 24 months.

How the hardware works. Many existing automated trading systems are working in milliseconds, striving for microsecond(s), and are subject to lags under spike conditions. The lag is sometimes called “non-determinism” or the inability of the system to handle loads linearly. Anyone with a PC knows that CPUs can slow dramatically under peak loads. Shifting the bottleneck logic to an FPGA (Field Programmable Gate Array) is a means of creating a predictable, and lower latency, system.

An FPGA is simply a “box of gates” with a programmable interconnect. While they operate at clock speeds an order of magnitude below contemporary microprocessors, they do hundreds, often of thousands, of things in parallel. With competent engineering, FPGAs are many times faster than microprocessors, especially when handling streaming data feeds. Accordingly the FPGA is where you offload the bottleneck logic which is capable of being parallelized. The communication between the FPGA and the CPU goes over the PCIe bus using vendor supplied API’s.

Architectural considerations. Many teams have working trading systems on CPU based hardware. Shifting the system to new hardware involves systematically moving modules over and steps into a field called “heterogeneous processing”. Basically assigning different types of tasks to different hardware. It narrows down to a few concepts:

Keep anything non critical on the CPU. CPUs are natural “traffic cops” and the easiest place to handle control and other non-latency critical logic. While the FPGAs advertise millions of gates, resources are still finite and used for non-obvious logic such as data buses and memory controllers. Routing (interconnection) of resources, just as much as gates, can also be just as much of a limiting element. Conserve the FPGA resources by keeping all non-critical logic on CPU.
Keep as much as possible on single pieces of silicon. Any major accesses to memory or CPU will slow things down. Many coders say that understanding memory access is one of the larger tasks in architecting efficient systems. It is here that we need to remember that the computers that got man to the moon and back had only 8K of main memory. You can do a LOT with what is on the FPGA itself.
Profiling is limited - why has no one invented a magic big red button yet? (actually, one of the authors is working on it). Basically you’re hunting for compute-bound or memory-bound CPU code to offload to an FPGA. Nothing magical is happening in the offload process. Loops are unrolled, processes are parallelized. However, there is no magic way to identify which portions of code to shift to FPGA.

How the software works

C to VHDL to Synthesis to FPGA to 4 hours shot to heck. Since compilers cover the car payments on my Toyota, perhaps I should be more loyal. But we have to be honest and explain that, unlike CPU or GPU, the full compile to FPGA hardware can take hours and lacks some(OK most) of the back annotation detail that SW coders had in the 1990’s. It is intrinsic to the process. In this analogy, you are essentially causing a tiny circuit board (the FPGA) to be laid out in a manner optimal for your software logic. It takes a few hours to route several million gates.

Visual Studio based interface of C to FPGA hardware compilation also automatically generates stage delay analysis and compiles to FPGA running on a PCIe board.

Hand coding vs. Machine coding and QoR (Quality of Results). In the 1980’s the old dogs said HDL (hardware description languages) will never replace manual tape-out in terms of quality. None the less, as HDLs matured and the time to market advantages became obvious, HDLs largely replaced hand coding. Flash forward and it’s the same problem... but on steroids. HLLs like C, OpenCL, or Vivado HLS are rattling the cages of HDLs like Verilog or VHDL. The same arguments apply.
If time is not a concern, and you have experienced HW coders, then hand coded HDL can be superior to machine coding from HLLs. But remember this… in the 1980’s if you wanted performance you used assembly language, in the 1990’s if you wanted performance you used “C”, in the 21’st century few even remember what assembly languages looks like and “C” is so “quaint”.
Nick Granny, Impulse Accelerated
The current state of this argument seems to be that it is wise to leave alone mature HDL modules such as encryptors, which have been refined over the years. And to create new code in HLL, compile to HDL (the intermediate file format) and have someone knowledgeable check to see if there is significant upside potential to re-isolating some HDL processes for hand polishing. Unfortunately hand polishing renders them disconnected to the C files (much like editing assembly does) and makes the application harder to maintain so is to be avoided if possible.
Falling off the cliff when compiled to FPGA and resources get over-run. Another caution when compiling to FPGA is the presence of special purpose blocks within the FPGA. Experienced FPGA designers can work magic with hard DSP blocks, or hard-core or soft-core processors. However, if you don’t know how to use them and your design evolves to require one more DSP block than is available, your performance falls off a cliff till you figure out the resource mis-match.

If you are in HFT you are in an “arms race”. The more you tie yourself to a specific FPGA vendor’s embedded IP the more likely you are to have your lunch eaten by a competitor that can jump to another (bigger, faster) platform faster than you. Keep it generic, stay flexible! HLL designs will remain the most portable and dramatically lower your time to migrate to newer hardware. Remember the sw is the long-term investment and the hw becomes an expense item in this model, to be replaced when newer versions become available.

Example - A sub-microsecond trading appliance (pictured, and by the way, successfully deployed). This mounts a $10,000 PCIe board (with a mid-sized of the current FPGAs available) in an off the shelf 2U rack mount chassis. It uses in-band event detection technologies, originally developed for the US Department of Defense, to monitor two 10Gb/s market data feeds. It parses messages in real-time and passes the critical content to the business rules engine.

The business rules engine is written in C and is customizable. It applies pattern matching methods to real-time market feeds, producing one or more triggers for the response logic. Based on the trigger signal payload, the response logic can select one or more TCP response message templates filling in the variable data (such as FIX sequence number, quantities, prices, TCP SEQ/ACK and other overhead) as they are pushed onto the wire.

The system is engineered to be a trading appliance.

A trading system works transparently but there are many components of the software that must be configured perfectly. It is often quicker to work with known good code or a service provider.

While it necessarily requires some customization to correctly realize the client’s trading strategy in FPGA logic, the nature and scope of the customization is limited and maintains the system’s overall affordability. Variable data to the system is provided by the client’s existing trading platform and by a simple boot-time configuration file. An optional embedded MiniFIX housekeeping engine lets the entire system operate autonomously while the host trading platform concentrates on strategy synthesis only providing updates to the message templates as strategy changes. The whole system, with customer logic, typically runs >= $100k.

The Network Ecosystem

Good gateways and bad – why the fastest will not always win. OK, so you’ve constructed the perfect weapon in the HFT arms race and you’re still getting beaten by those b***ards down the street at 1st National Bank of Kazackistan. What’s happening?

Well, all thing being equal, the winners are playing in sub microsecond trades. But all things are not equal. Flash back a few years ago and traders were renting office space as close as possible to the actual exchange feeds in order to minimize wire length to the point of trade. Today they rent space on the exchange networking “floor”. This is called co-lo or colocation.

Zooming down inside the networking hardware we’re finding millisecond differences can occur depending on the gateway or port you get. In some countries this is dynamic, you don’t know what you’ve got till you log in. In some countries traders are allowed to “squat” on a gateway and anecdotally, we’re hearing that the lucky ones hold on to a “hot” gateway and win by milliseconds. While it is possible to temporally load balance (i.e. level the “time” playing field), we don’t see this happening. It remains a trading risk.

Other software elements in the trading system are also being accelerated. We are working with teams accelerating in-line functions such as compliance or pre-trade risk checking such as “does a guest trader have adequate balance”. Intrusion detection is another in-line function that should ideally run at low or zero latency (i.e. wire speed). Off-line functions such as Monte Carlo valuation or options analyses can run 4x faster in hardware. In the non-financial world, governments are using similar technology to execute wire-speed packet filtering applications with front-end protocol conversion.

Development methodology recap. So we generally recommend: you stay in HLL as much as possible. It is quicker, easier for C programmers, and migrates to new hardware faster. Realistically you keep an eye open for bits that can be further hand refined for speed in HDL. And overall, iterate, iterate, iterate. This is one area that is truly responsive to continual improvement.

About the authors: Brian Durwood and Nick Granny are 10 year collaborators who have created and deployed solutions for everyone from the Air Force to the Chicago Mercantile Exchange. Using Impulse Tools and IP, and MercuryMinerva Trading Systems software, the two provide turnkey systems to mid-size trading firms that are jumping in to the HFT arena. The collaboration focuses on IP that clients own, and training such that clients can make their own modifications and extensions. The authors wish to thank Ed Trexel and Richio Aikawa for their help on this article.