While industries such as healthcare, e-commerce and even retail financial Services are being rapidly transformed by the technology ecosystem of "big data," leading capital markets firms have been in the uncharacteristic position of lagging the innovation curve.
Capital Markets technologists give many reasons for the lack of viable implementations: data sets that are not sufficiently "big," difficulty integrating another data solution into an already crowded portfolio, and even a desire to keep their activities confidential to preserve their market advantage. But the biggest reason for the failure of these firms to aggressively engage with big data has been limitations in the architecture of the core big data technology.
Apache Hadoop has been optimized for Map/Reduce, an offline ("batch") processing architecture that does not readily support use cases demanding real-time or near real-time results, such as Value-at-Risk (VaR) and algorithmic trading. Although other options beyond Map/Reduce are available, they are layered onto a core technology in which Hadoop closely couples resource management with data processing.
[For more on how Wall Street organizations are approaching big data challenges, read: Financial Firms Adopt Big Data As Defense Against Cyber Threats.]
As a result, it is not possible to run multiple applications simultaneously (e.g., SQL, in-memory analytics, and Map/Reduce) with any level of control over the prioritization of these applications, and hence a guarantee of timely results.
Spinning The YARNThat is, until now. The introduction of Hadoop 2.0 and, in particular, the new YARN resource manager component promises to provide the capital markets with a loosely-coupled architecture that will allow all of the necessary processing modes -- batch, interactive, online and streaming -- to run simultaneously on Hadoop with defined quality of service via resource allocation.
To understand how YARN works, it is important to first understand the current Hadoop architecture. The current Hadoop Map/Reduce System is composed of the JobTracker, which is the master scheduler, and TaskTrackers associated with each of the nodes (instances) of the application. The JobTracker is responsible for all resource management tasks including managing the TaskTrackers, tracking resource consumption/availability, and job life-cycle management. JobTracker views the cluster as composed of nodes managed by individual TaskTrackers with distinct "map" slots and "reduce" slots. These slots are cannot be reassigned. It is this locked-in hierarchy that prevents the optimal execution of non-Map/Reduce workloads on Hadoop.
YARN breaks this limitation by splitting up the two major responsibilities of the JobTracker, resource management and job scheduling/monitoring, into separate functions -- a global ResourceManager and per-application ApplicationMaster. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The ApplicationMaster negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the component tasks.
This separation of responsibilities allows the arbitration of resources among the competing applications and provides the flexibility necessary for more optimal resource management and service level guarantees. This in turn should allow more business-driven capital markets use cases to be realized in the big data paradigm.
Whether or not Hadoop 2.0 proves to be the ultimate big data roadblock "dragon slayer" for capital markets, it will surely be a major breakthrough in eliminating some of the most critical obstacles to success.
About The Author: Jennifer L. Costley, Ph.D. is a scientifically-trained technologist with broad multidisciplinary experience in enterprise architecture, software development, line management and infrastructure operations, primarily (although not exclusively) in capital markets. She is also a non-profit board leader recognized for talent in building strong governance and process. Her current focus is in helping companies, organizations and individuals with opportunities related to data, analysis and sustainability. She can be reached at www.ashokanadvisors.com.