There’s quite a lot of excitement around data science these days, with its reputation for being remunerative and future-oriented. But it’s often confused with related terms, like ‘big data’.
Both of these concepts are notoriously difficult to pin down, but we’re going to do our best to provide some clarity on the topic.
What is Data Science?
As the image suggests, data science is defined more as the intersection of a number of other fields than as a stand-alone discipline. Most agree that it involves applying statistics and mathematics to problems in specific domains while keeping some of the insights from software engineering best practices in mind. It’s equally valid to conceptualize it as being like statistics with more coding or coding with more statistics.
Domain knowledge is extremely important, however. The kinds of data, models, techniques, and results you can expect vary widely depending on the field you’re in. You won’t be doing the same things in a startup looking to revolutionize advertising as you will be in a startup in the cryptoasset space. In my experience as a new data scientist, approximately 101% of my waking hours are spent learning the absurdly labyrinthine internals of bitcoin, the blockchain, and related technologies. (If you’re wondering how I spend more than 100% of my waking hours thinking about this stuff, it’s because I also dream about it).
What Is Big Data?
Big data is also quite difficult to define, but at Galvanize we used the following informal definition: if you have more data than can fit on your local machine, you’re probably working with big data.
Another fairly common rule is that big data starts at 1 terabyte and goes up from there.
Of course, this means the definition of ‘big’ data is a moving target–if tomorrow’s desktops come equipped with 10 terabyte hard drives, then the threshold for big data will move up to that level.
But leaving aside the ambiguity and semantic quibbles, big data has become such an important part of the modern data science landscape that a whole suite of new tools have been developed specifically to deal with it, including everything from Spark to Cloud Computing. Some of my favorite Galvanize classes were oriented toward these topics, as I think they’re going to become an ever larger portion of the data scientist’s workload.
Hopefully that clears things up a bit.
Drop us a comment below with your own thoughts on how to define and distinguish data science from big data!