Wall Street & Technology recently held its first Market Data Management Forum, a half-day event focusing on the increasingly difficult task of managing vast amounts of data in financial-services firms. The forum, sponsored by Iverson Financial Systems, attracted over 100 attendees from the market-data community. Panelists gathered to discuss and share their experiences regarding some of the major issues they face when dealing with market data.
The panelists included: John Matero, vice president, Goldman Sachs Asset Management, where he oversees data management within the Pace Group, a team that provides data-quantitative research and portfolio analysis for the asset-management group; Andrew Madoff, head of Nasdaq trading at Bernard L. Madoff Investment Securities, where he oversees market data for the group's trading activities; and Mike Atkin, vice president and director of the Financial Information Services Division, Software and Information Industry Association. Wall Street & Technology's Editor-in-Chief Kerry Massaro and Senior Associate Editor Cristina McEachern moderated the discussion. The following is a brief sample of the questions and answers covered during the event. The full transcript of the day's discussion can be found at www.wallstreetandtech.com/story/WST20020626S0001.
QUESTION: Managing market data internally is a large task as firms can take in up to 30 different vendors feeds, often in different formats, which then have to be normalized to feed back out to applications. The information must also be scrubbed and checked for gaps or errors. John - approximately how many different vendor feeds does your group take in?
JOHN MATERO: We're involved in mostly decision support - not operations or trading - so all my comments are coming from a decision-support mindset. We're primarily focused on equity research and portfolio analysis and the number of data vendors we use right now is approximately in the neighborhood of 15 vendors. These vendors provide everything from market data, corporate actions, index data, earnings forecasts, balance-sheet data and income-statement-type data.
QUESTION: So how do you ensure the integrity of the data?
MATERO: The group is pretty small and we've been doing this type of work, pretty much the entire group, for quite some time. Regardless of their focus, everyone in the group is responsible for some part of data integrity. So the entire team participates. Every morning or night we're all involved in data integrity. Another aspect of how we make it all work is the fact that we're bringing in so many feeds oftentimes one vendor will provide not only market data but index data or fundamental data. So we're getting the same data twice. It helps us triangulate on whom the correct price is, for example.
Another aspect is - and this is kind of what I evangelize within our group and within our division - our data model. The data model that we use is very important and though it's probably not obvious, as I'm speaking to maintaining data integrity, we're allowed to keep the same data from multiple vendors. Based upon our Web-based user interface, it allows us to write diagnostics rather quickly, new ones, and enhance current ones to manage the data efficiently. So it really boils down to human intervention, data modeling and multiple vendors. That's kind of how we manage our data integrity.
QUESTION: By data model you mean the mapping of the meaning of information and taxonomy?
MATERO: Yes, exactly. For example, it's nothing really new in financial-market-data circles, but we've come up with a time-invariant-unique identifier for any security that we maintain. We have multiple sources of data that might have their own identifier, such as the Reuters RIC code, and we map all of these various things into our data model. These other identifiers allow us to integrate data from Reuters into our application from Bloomberg, from ... you name the vendor, and we could identify and map it. We store all these guys' data separately and we create a magical so-called golden copy.
QUESTION: What exactly is "golden copy?" How does it work and how is it implemented at your firm?
MATERO: Essentially what we do is we pick a primary vendor for a particular data item, let's say market data versus price volume shares outstanding versus corporate actions, CUSIP changes, split changes, splits, name changes, things like that versus fundamental data. So we choose a primary vendor and this choice is done from some research. And we use all the other vendors as secondary, and vendors of the same data, and a way to triangulate on the correct answer as best as we can. We pick a primary vendor, use other vendors to triangulate, and we have human beings interact with a suite of diagnostics to check data integrity and the diagnostics are applied to the primary vendor, as well as other vendors.
QUESTION: Andrew, How many feeds does your firm take in and how do you ensure data integrity?
ANDREW MADOFF: I would agree with John's number of probably about 15 different vendors. We take in all of the big market vendors, ILX, Bridge, Bloomberg, direct feeds from Nasdaq, New York and all of the ECNs that deliver last-sale data and order books. We keep it all normalized and store it and use a lot of the same techniques that John mentioned, triangulation between multiple sources giving you the overlapping pieces of the same information is really the key to everything. And we do that not just in terms of storing the data and comparing it post-delivery but trader screens are highlighted any time there's any sort of inconsistency. So almost every single piece of data that's displayed on a trader screen is coming from two different sources at the same point. And any time the two pieces of information differ there will be some sort of a visual cue to the trader to indicate that something is wrong. So if we've got the inside market in a Nasdaq stock coming directly from Nasdaq Level I and then the same inside quote coming from ILX, any time the two differ they turn red and flash so a human market maker who's making a decision based on that quote knows that one of them may be stale and then they have to use their judgment as to which one is the accurate quote. And then in all of the mechanisms where we store the data we apply the same sort of ideas so that we will generate reports that flag inconsistencies in the data and then we do our scrubbing based on that.
We have order data coming in from probably 300 different firms and there you've really just got one source, so if we've got broker/dealers from all across the globe sending us order data, we have specific interfaces to them, that data gets stored in its raw form. It also gets stored in the normalized formats that we take into our trading system so that when we have to generate either compliance or regulatory reports of any kind we've got the order data which is critical to our business and then the market data to cross reference it against.
QUESTION: So it sounds like there's quite a bit of manual intervention as well?
MADOFF: There's really no getting around that. You can try to automate to the extent that it's possible but a lot of the data is dirty and especially when you get away from the big vendors, ILX, Reuters, Bridge, their data is pretty good. The exchange data is okay. ECN data is terrible and we spend a lot of time and energy trying to fill gaps when the vendors themselves and the ECNs have no standardized mechanisms for doing that. Some of them will require you, if you've got a gap at 2:30 in the afternoon, they'll replay the whole day for you and if that's a vendor that's sending you six or eight million messages a day, it's sort of a disaster. But they don't consider data delivery to be one of their primary businesses, in spite of the fact that proprietary traders or quantitative traders love the richness of the information that's coming in; they want it stored, they want to be able to use it which requires us to manually intervene to either fill gaps or erase blocks of data that are just clearly erroneous.
QUESTION: How about this idea of golden copy? Does your firm create a standard duplicate as well?
MADOFF: I guess we do. We don't refer to it as the golden copy but we do have the ultimate piece - however we chose to get to what we felt was the ultimately accurate piece of information; we do store that separately from the raw data. What would be the golden copy is stored in the most readily available format so that we've got stuff that's on tape that can be restored over the course of days or we've got stuff that's either in memory databases or just more or less real-time accessible.