October 24, 2008

Successful alpha generation relies on navigating rising noise levels from market data volumes that double every year. This is especially true in today’s market with increased volatility creating further peaks in data levels. Such high frequency trading strategies rely on relationships between market variables that change frequently, including intraday quotes, multiple securities and multiple markets, and quantitative analysis is becoming a greater and greater challenge. Access to the necessary level of highly granular data, where previously intervalized data would suffice, is imperative for developing the increasingly sophisticated strategies that will deliver competitive advantage. These changes can be addressed by using best-of-breed data management and analysis tools together. Data management systems provide access to data and handle gathering, normalization and retrieval, but may not be well suited for more advanced methods such as stochastic simulation or hypothesis testing. Often analysis is done externally, using tools as simple as Excel or as sophisticated as customized code, depending on the analyst’s preference and that with which they are accustomed. Writing customized code, a custom DLL written in C++ for example, is time consuming and difficult. This has contributed to the quantitative analysis community exploring new options and an explosion of R usage over the past two years.

R is a statistical package first and foremost and contrasts with Matlab, a commercial and general analytical toolkit. R is an open source implementation of the S programming language, which is implemented in the commercial package S-Plus. It is freely available under the GNU General Public License on multiple platforms and can be downloaded at http://www.r-project.org. R is very similar in function to S-Plus (the syntax is nearly identical), contains built-in tools for time series, regression, etc., has graphing and visualization capabilities, and can be used as a programming environment.

The R programming language is rich and well suited for data operations. The argument for it is similar to that for using a high-level language for programming, or for using complex event processing (CEP). The infrastructure work is done for you, and practitioners can focus on what they’re trying to accomplish and not the mechanics of how to do it. Additionally, the ease of access and cost are very important to the investment industry in today’s cost-conscious environment. Finally, R can be extended for specific purposes, and support available through the R community includes discussion forums, mailing lists, public documentation, etc.

Given the open source nature of R and the community around it, there continues to be a rapid ascent in usage. There is a very large contributor base, and new add-on packages are being created all the time. The cost makes it extremely compelling for universities to use in Financial Engineering courses, so it is now widely used in academia and there are graduates of higher education programs with extensive R experience. When these people join firms, whether on the sell-side or buy-side, their first instinct is to use the statistical analysis package they already know, so use of R will continue to expand in the coming years.

Following is a simple example to merge two trade time series for different symbols and produce a derived time series where the spread between the two instruments exceeds 0.5:

xy <- na.locf(merge(x, y))
xy[abs(xy$x - xy$y) > .5, "y"]

While this may not be intuitive to everyone on first reading, it is instantly intuitive to members of the R community doing time series analysis. No learning curve for proprietary languages is required!

R is not the only statistics package available, but it compares well with its commercial competitors, is easy to use and familiar to many practitioners, has a large and diverse supporting community and does not incur product licensing costs or require proprietary language expertise. This addresses the two-fold challenge of data volume and complexity of analysis present in today’s data analysis world. The ultimate advantage is the ability to use R packages and the R language combined with a high performance data management platform, making quantitative analysis easier instead of harder.