December 11, 2013

Many organizations have realized significant competitive advantages with complex analytics. But Big Data is stressing the limits of these techniques for both data access and analysis. When the database is distributed over multiple servers, certain types of queries become too slow. When data is too big to fit in memory on one server, then complex math operations fail. Yes, some access and analytics are “embarrassingly” parallel and for these tasks, organizations can easily sidestep these problems. But many of these problems are not easily scaled with this approach. Moreover, many organizations separate data management from analytics forcing them to move massive data from one software package to another. If data is big, it shouldn't be moved.

Bill Kantor, Paradigm4
Bill Kantor, Paradigm4

These challenges have driven several workarounds: like working with subsets of data, buying expensive servers or appliances, replicating alternative views of data to support faster retrieval, or developing custom software that explicitly manages data distribution and parallel computation. Each of these has their problems as they slow down and drive up the cost the data-driven discovery. What is needed is a new software paradigm that enables analytics to just work and work fast—without having to move the data or worry about size.

The financial industry in particular experiences these challenges because it depends on complex analytics—particularly matrix math, multidimensional selects, and moving window aggregates—which most Big Data architectures cannot accommodate readily. Here’s why:

• Extract Transform Load (ETL) gets in the way of interactive, exploratory analytics. Analytics solutions that separate the storage engine from the analytics engine are impractical for Big Data because they force you to move your data and transform it into the analytical package’s format. ETL tools are great in lessening the pain to move the data, but these tools do not address the fundamental issue—separating data management from the math slows analysts down. Interactive, exploratory, “big math” ought to be painless.

• In-memory solutions don’t scale for complex analytics. Big Data datasets exceed a single machine’s memory. Although some “embarrassingly parallel” problems decompose into multiple smaller independent problems that can be distributed across a cluster, many complex analyses needed by financial institutions don’t. Subsetting data produces less accurate models. Even if your data does fit on one machine, performance is limited by the number of cores you have. Analytics ought to scale past limitations of a machine’s memory or number of cores—up to as much computing power as you have available.

• Hadoop doesn’t do complex math well. Hadoop, SQL-on-HDFS, and databases with embedded MapReduce are challenged by complex analytics that are not embarrassingly parallel. For these problems, such architectures can require a lot of low-level coding, turning data scientists into computer scientists. Big Data chores ought to be invisible and automatic.

• Quant-friendly languages are demoted. Typical Big Data solutions don’t let quants and data scientists develop analytical solutions in languages they prefer like R and Python. Analytics solutions should promote collaboration and capitalize on contemporary programing languages and analytical tools.

Analysts want to explore data regardless of its size, iterate rapidly to build models using complex analytic approaches and based on all available data, and deploy them. Ask your Big Data vendors if their infrastructure supports these objectives. Here are some awesome things you should be able to do with a Big Data exploratory analytics database.

1. Build the Arca NBBO for one day of all exchange-traded US equities (186 million quotes) in 80 seconds on a four node (eight cores per node) commodity hardware cluster. Run it in about half the time on a cluster twice as large.

2. Use Principal Components Analysis (PCA) to analyse variance among asset classes and the individual securities within those asset classes. With an array database, it’s possible to run a PCA on a 50M x 50M sparse matrix with 4B non-zero elements in minutes.

3. Select data sets (based on complex criteria) in constant time—irrespective of how big your dataset gets.

4.Perform window operations in parallel on distributed data; express these operations easily and not worry about the programing details to parallelize the work. (A simple and common example of this is to calculate volume-weighted average price, which for many databases will only work if your data contained by your window is all on one machine.)

Computationally intensive matrix math algorithms underpin many pricing, arbitrage and risk calculations used in computational finance. What is needed is a scalable database with native complex analytics, integrated with R and Python. With this infrastructure financial institutions can rapidly implement proprietary algorithms at Big Data scale.

—By Bill Kantor VP Sales and Marketing at Paradigm4