How to Make Big Data Consumable

The key to better consumption of big data starts with improving how it is organized.

The central problem of the "big data" era is consumption. The sense that data is being sucked up everywhere by ubiquitous vacuum-like devices and spat out by any number of virtual fire hoses -- along with the sense that it is increasingly difficult to make sense of it all -- has never been more palpable. For data architects and managers, the main struggle (in this era or any other) is providing the right data to the right people at the right time and in the right format. In other words, these shepherds must establish a data management platform that allows end users to harvest information more quickly and easily.

When we think of data management this way, we realize that quantity need not enter the equation, at least not at first. The mission is to extract as much value from data as possible. Full stop. Whether data is "big" or not is almost irrelevant, particularly since the notion of bigness today varies significantly from business to business and most certainly will be redefined in the months and years to come. In a not-too-distant future, the ease with which our perception has migrated from gigabytes to terabytes will give way to a similar familiarity with petabytes and exabytes. Does this mean that we will eventually embark on a "huge data" era? "Colossal"?

Strip away the superlatives around data's size or speed (which are largely hardware challenges), and you will find that the key to better consumption starts with improving how it is organized. If anything, this era of big data is unique not only for the unprecedented scales, but for the diversity of data sources. As a result of technology, every consumer is now a publisher of data — hence, the explosion in data sources. Specific to capital markets, the pursuit of alpha clearly has become a multi-asset and multi-regional game, thus naturally adding to the number of potential data sources to be integrated into trading platforms. Overall, the convergence of these drivers has created new complexities in organizing data, which in turn have placed new pressures on the central tool for keeping track: reference data.

Reference Data: The Glue That Binds

Reference data is like the glue that binds disparate data together. Without it, data can take no shape. There are no graphs, dashboards, heat maps or any other smart chart candy without reference data. Prices are pointless without the ability to chronologically organize them based on timestamps. Aggregate credit exposure, in some cases, can take days or weeks to calculate without entity identifiers that facilitate accurate output in a matter of minutes (and eventually seconds).

Exploration into the unknown through rich analytics is virtually impossible without cross-referencing hitherto seemingly unrelated datasets in new ways. In short, big data is dumb data without some glue. To mix metaphors, this reference data is the key to unlocking untold treasures in your data. Making data more consumable means paying closer attention to foundational elements.

Ironically, there is nothing decidedly big about reference data, which is why it is perhaps bizarre that it would play such a critical role in harvesting value from its cousins: time-series, audio, video and any other frequently changing data stream. Though relatively static by nature -- and, therefore, the primary reason why it rarely gets big in and of itslef -- there are three main categories of reference data in capital markets: security, counterparty and client. That said, the full scope of reference data can go far beyond these groupings to include a comprehensive library of metadata items, or data that describes other data.

The primary challenges in managing reference data include standards and processing. Though far from complete and prone to eruptions of spirited debate, security ID standards are fairly well-established, with the exception of OTC derivatives (which are normally handled with proprietary IDs). In fact, system integrations are made more difficult by competing security ID standards — including CUSIPs, ISINs, SEDOLs and proprietary standards from data vendors such as Bloomberg and Thomson Reuters — rather than due to a lack of a standard.

[Wall Street Scratches Its Head Over Big Data: With IT budgets growing less than 1 percent, firms are scrambling to deal with a huge challenge.]

A global counterparty standard — known as the legal entity identifier (LEI) — is far less developed. But due to strong regulatory backing and the general fervor around improving risk measurement capabilities, it is developing quickly and is likely to be adopted quickly as well. Not to minimize the significant challenges that lie ahead for the development and refinement of standards, but the industry has galvanized awareness of their importance and attention is being paid accordingly.

More Art Than Science

Where reference data management, in particular, and enterprise data management (EDM), in general, fall short is in processing (which includes data governance policies). Moreover, between the technical and cultural components of processing, the bigger challenge is cultural, or management-oriented. Yes, those that have yet to begin to rethink their IT architectures and legacy application consolidation are tempting existential threats to their business; but the bigger gap between where most firms are today and where they need to be in terms of data consumption is much more about art than science.

Data fluency — a critical precursor to data consumability — means that data flows more easily through the enterprise, which in turn means that end users must be able to find it. And finding data requires meticulous attention to standards, labels and other metadata — no matter how imperfect they may be now or in the future. That way, no matter how big or complex the data gets, end users will have a much better shot at harvesting value from it.

Paul Rowady is a senior analyst with Tabb Group. He has 20 years of capital markets experience, with a background in research, risk management, trading technology, software development, hedge fund operations, derivatives and enterprise data management. [email protected]