Data Science, machine learning, and AI are three of the most attractive fields today. Between them, they account for a sizeable fraction of new breakthroughs, powering innovations like robotic surgeons, chatbot virtual assistants, and self-driving cars, and utterly dominating humans at strategy games like Go.
You can’t skim the headlines any more without seeing some team curing a disease or deciphering a 2,500-year-old language with a new algorithm. But excitement rarely breeds clarity, and over time a good deal of confusion has sprung up about the similarities and differences between data science, machine learning, and AI.
In order to better guide you on your path to a new career, we decided to research these fields in an attempt to set the record straight.
What Is Data Science?
While the term ‘data science’ is relatively new, the key ideas driving the field are fairly old. People have been gathering data as a means of better understanding the world for a long time.
And when you get down to it, that’s basically what data science is. Sitting at the intersection of several different fields, the purpose of data scientists is to derive actionable insights from data.
A lot of the data science toolkit is based in probability and statistics. According to one definition, probability is a branch of theoretical mathematics that deals with predicting the likelihood of future events, while statistics is a branch of applied math that tries to understand and quantify past events.
A good data scientist needs to understand both. They need to be able to understand which distribution best describes a dataset and what that means for understanding new data points. They need to be able to use statistics to quantify just how likely or unlikely an outcome is.
Data scientists are also often responsible for communicating the results of their investigations. This can be accomplished with written descriptions of the tests performed on data, and it can also require building out charts and visualizations which make the core insights more accessible to people without extensive training in the field.
Though they’re often lumped together, data science, data engineering, and data management are not the same thing. Data scientists, therefore, need to be able to work closely with database management and data engineering teams to figure out exactly what data are needed for a project and how to format them correctly.
In my job as a data scientist, figuring out what we need, where to get it, and how to format it is a pretty large fraction of the work.
A good data scientist also needs to have a general understanding of programming best practices. Being able to organize and maintain code, work with APIs, and build software that is tested, reusable, and concise is a major contribution to a data-based organization.
It’s hard to exaggerate the fact that the non-math parts of data science can be just as important as the mathy parts. Being a statistics prodigy doesn’t do much good if you can barely code and can’t communicate your requirements or results.
What Is Machine Learning?
Machine Learning is the art and science of getting machines to learn from data. Machine learning work can range from the simplest algorithms like a linear regression to spectacularly complicated structures like long short-term memory networks, depending on the job. A good machine learning engineer will at least understand the basics of the major approaches and when they are best used.
In my experience, a lot of machine learning boils down to understanding the kind of problem you’re trying to solve, choosing the best potential solutions, and evaluating the results. Outside of a PhD program, it’s pretty rare for a machine learning engineer to build their own algorithms. Instead, they tend to spend their time adapting existing tools to their specific application.
In recent months, for example, I’ve done a lot of analysis of time series data. Time series data are ordered in time, and they have a number of unique properties that have to be taken into consideration before using them to build predictive models.
So, one of the first things I had to do was make sure the data have the properties, like stationarity, time series data are supposed to have. If they do, I could use them to build models and try to forecast future trends.
But there are lots of ways to do this, and here’s where a knowledge of machine learning comes in. One thing I could do is build a univariate (‘one-variable’) model which predicts what the next day, week, month, or year of the time series data will look like. To do this well, I would need to draw on my machine learning training to understand how various models work, what they mean, and whether they’re appropriate for this task.
Or I could build a multivariate (‘multiple-variable’) model which uses data on two or more variables to understand their relationship. This is done in the hopes that we can use one of the variables (like the number of times a stock is mentioned on Twitter) to predict another (like the way that stock’s price will change).
All of this requires that I be able to use many of the same tools in data science to construct and evaluate models. How can I tell when one model is better than another? How can I tune a model to get it to perform better? The answer usually lies in being able to read and interpret the right statistical metrics.
So, to oversimplify a bit, we can say that both data science and machine learning rely on mathematics to do useful things with data. They differ mostly in how they approach this task and what their day-to-day responsibilities are.
This is true even when using something really complicated, like a neural network. When I’ve built neural networks in the past, one part of the task required constructing the architecture with a framework like TensorFlow or Keras. The other involved interpreting the statistics that tell me how good or bad the model is at predicting new data points.
This isn’t so different from the data scientist’s job of using statistical tests to interpret experiment results.
What Is AI?
Of the three terms AI is probably hardest to define. There are those who take AI to refer specifically to human-level machine intelligence or to algorithms that work in a way similar to human thought.
We can get a little clarity by making an important distinction between narrow and general AI. A narrow AI is one that excels in a specific domain, like categorizing images. A general AI is one that is able to perform well across many different domains.
The question of whether or not AIs should resemble humans is as old as the field itself. On the one hand, humans are the smartest general intelligences we know, so it makes sense that we would try to model smart machines after them. On the other hand, critics rightly ask whether engineers should try to make vessels that fly like birds or submarines that swim like fish.
Personally, I tend to take the view that it doesn’t really matter how a machine achieves intelligence. It still counts as AI even if its software is completely different from human intelligence.
So How Are Data Science, Machine Learning, and AI Different?
With the above context, I think we’re prepared to give a general answer to this question.
All three fields tend to draw on the same mathematical foundations, but they’re not the same. AI is an umbrella term for many different approaches to making smart machines. Machine learning is one particular, statistics-based way of doing this. It’s distinct from other approaches that include trying to evolve an algorithm with random mutations, building an intelligence from the bottom-up with one of the major logical systems, and other strategies
Data science is a field whose practitioners use data to better understand and predict things. They may or may not use tools from machine learning to do this, but data science work and machine learning work can look pretty similar day-to-day.