August 07, 2010

If you want to understand the challenges of the Big Data era, hang around Catalina Marketing, a global marketing firm that works with a who's who of consumer packaged goods companies and retailers.

Catalina's data warehousing environment shot past the petabyte mark seven years ago and today stands at 2.5 PB. Its single largest database contains three years' worth of purchase history for 195 million U.S. customer loyalty program members at supermarkets, pharmacies, and other retailers. With 600 billion rows of data in a single table, it's the largest loyalty database in the world, Catalina maintains.

At the cash registers of Catalina's retail customers, real-time analysis of that data triggers printouts of coupons that shoppers are handed with their receipt at checkout. Each coupon is unique--two shoppers checking out one after the other, with identical items in their carts, will get different coupons based on their buying histories, combined with third-party demographic data.

Few companies operate at Catalina's scale, but most every company is living in its own version of the Big Data era. Two forces define this era: size and speed. And those forces are driving companies to consider new choices for how they deal with data.

Size is relative--by some estimates, 90% of data warehouses hold less than 5 TB. But it's the pace of growth that has companies rethinking their options. Nearly half (46%) of organizations surveyed last year by the Data Warehousing Institute said they'll replace their primary data warehousing platform by 2012.

Speed is sometimes about pure performance, as in how quickly a system answers a query, but more important is the broad notion of "speed to insight." That's about how much time people--often statistician-analyst-type people--must spend loading data and tuning for performance. The pressure is on IT to get insights out of ever-larger data sets--faster.

This Big Data era got rolling way back in the dot-com days. Since then, a number of alternatives have emerged to challenge the conventional relational databases from Oracle, IBM, and Microsoft. Those options fall into two camps: systems supporting massively parallel processing (MPP), and those harnessing column-store databases.

InformationWeek: Aug. 9, 2010 Issue To read the rest of the article, download a free PDF of InformationWeek magazine
(registration required)
ABOUT THE AUTHOR
Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in ...