For the past four years, a team of 70 engineers in IBM's T. J. Watson Research Center in Hawthorne, N.Y. has been working on an ultra-powerful, large-scale (as in petabytes of data) stream processing system, currently running on 800 x86 computers with embedded Cell processors, that can analyze in real time massive volumes of market data and news (as well as medical, seismic, astrological or any other type of data). At the SIFMA show this week, IBM talked about this "mature prototype," called System S, which Wall Street firms will one day be able to use to create a no-holds-barred environment in which their quants can roam free, testing ideas, finding correlations and refining algorithms, using a huge pipeline of streaming data and seeing instant results. A government agency is already using System S and IBM has filed 400 patents for it. The reason IBM talked this project up at SIFMA is because it's interested in working with capital markets firms on pilots to see what System S could do for them."Some of our most sophisticated clients on Wall Street are jumping all over this," says Kevin Pleiter, director of financial services at IBM. "The power of this is it's able to correlate events from disparate data sources." For instance, a market data event such as a plunge in the price of certain stocks might trigger an algorithmic trading program to buy some of the stock. But if the price drop were caused by a calamity such as a terrorist attack, such a purchase would be unwise. Simultaneously as it's watching the market data, System S could be taking in video feed from television networks, analyzing the news, and sending a recommendation to put the trading system into crisis mode.
In another example, a buy-side firm looking at a company could correlate its road show, analyst call, and fundamentals such as earnings per share with other data sources. "If the CEO is saying that orders are strong, imagine being able to correlate that with satellite imagery that tells you whether or not the parking lot is full and whether or not trucks are going to and from the distribution facility," says Pleiter. "If the parking lot is half empty, the system would recognize that this guy is trying to talk his stock up."
Pleiter acknowledges that several complex event processing products exist on the market today. "But today it's an environment where people are taking in structured data, putting it into a fixed format, and events are triggered off that stream," he says. "System S takes this four generations forward, to a highly distributed, highly scalable stream processing technology that can take in any type of structured or unstructured data without requiring reformatting and allowing anything to be done with it." For instance, video streams from CNN, Al Jazeera, and BBC News could be analyzed alongside market data feeds from Reuters, Thomson and Bloomberg as well as archived phone calls, emails, HTML pages, research reports, purchase orders, invoices, satellite images and more. System S is said to have parsers and semantic annotation to help analyze each of these streams.
IBM already had many of the pieces required to do this. It's had an enterprise platform for managing structured and unstructured information together for several years, and a couple of years ago it introduced an architecture for mixing and matching various types of search and text analytics technology (this is called UIMA and works with most of the best-in-class search products). It has video parsing and searching. It has speech recognition software.
System S contains a brand-new technology layer that Pleiter describes as "an artificial intelligence-like scheduling technology. This is intelligent scheduling, looking at the information streams and steering the hardware when major changes occur, because something important must have happened," he says.
In the Watson lab, the System S computers are connected with 20 gigabit InfiniBand, but researchers are playing with optical switches and optical networking, aiming for a super-fast 100 GB network.
The System S user environment comes in three versions. There's a simplistic user interface that lets users query the system the way they would a database, using predefined SQL calls. There's an intermediate interface that's similar to using Excel macros. For power users, there's an Eclipse-based development environment for writing custom applications.
The system is meant to be flexible in its use of hardware. "The conceptual picture is that the design of the software control programs is specifically intended to allow aggregation and exploitation of all kinds of hardware," says Nagui Halim, director of high performance stream processing at IBM. "My thinking was customers get big installations of hardware, they make changes, they buy specialized accelerators. We allow the segmentation of specialized functions to accelerators." So far the Watson lab is using embedded Cell processors on IBM BladeCenters running Linux and they're testing FPGAs. System S can run on a tiny system such as a laptop and scale up to a 100,000 node cluster.