atmos Blog

Current Articles | RSS Feed RSS Feed

Big Data Is Really About Small Data

  
  
  

1 30 12

"Big data" is deceptively self-descriptive.  Big data is not simply about storing large quantities of data.  Big data is about what you do with large quantities of data and how you manage it.  That's a subtle, but important distinction. Let me explain.

Big data sets are difficult to manage and understand because the data is usually stored raw and unfiltered.  The process of sifting through these data sets usually produces much smaller data sets that serve as summaries that are easier to consume.

Big data is about making meaning out of large, unwieldy data sets.  It's about the insight gained because of fundamental changes in the technology used to store and manage content as well as the dramtically reduced costs.

Big data places a bet against the high cost of compute and storage resources because critical insight across disparate data isn't practical until the costs of storing and managing data approach zero. 

One of the most common big data use cases is related to measuring website page views by collecting raw web server log files and processing them to aggregate the log data into more meaningful results.  The input is a "big data" set of log files and the output is a "small data" set that consists of a summary.  The output in this case is the aggregated set of results that describes the actual number of page views, unique visitors, etc. That's the classic Hadoop use case.

Big data for the masses wasn't possible when the storage and compute costs required to gain insight were exhorbitantly high. Simultaneously, the software being used was maturing and improving to make it more broadly accessible, but more importantly the software was developing around the notion of squeezing efficiencies out of the now lower cost hardware.  It was a perfect storm of sorts leading to the ability to do more with less. 

So, yes:  big data is really about small data.  And the costs of storing and managing big data finally make it feasible to create small data. 

 

blog comments powered by Disqus
follow us
May 21-24, 2012

Subscribe by Email

Your email:
About Atmos Online
Atmosonline.com lets us share deep insight about EMC Atmos. We will cover application development, high scale architectures, and other topics around the design and use of cloud storage, with as many actual real world scenarios as possible. Atmosonline.com is also a portal to our Atmos Online storage as a service test and dev environment.

Disclaimer: "The opinions expressed in our blog are the personal opinions of the authors. Content published here is not read or approved in advance by EMC and does not necessarily reflect the views and opinions of EMC nor does it constitute any official communication of EMC."