The reason everyone is so excited about big data and data science is because of the enormous potential data has to transform the way business, education, healthcare, and many other fields operate, for the better. Coaxing insights out of huge volumes of information, however, is a well-paying skill that only a small number of people will be able to devote themselves to acquiring.
In my experience, data visualization is a component of the entire data science workflow, from start to finish. Data scientists therefore need to master the art of data visualization. With the appropriate charts and graphs a data scientist can understand the structure of the data they have, identify potential issues, craft analytical strategies, and summarize their results for others.
These projects will set you on the path to learning this subtle and important craft.
The Importance of Data Visualization
As alluded to in the introduction, data visualization is baked into data science at a really deep level.
Building a good visual representation of data is important not just for the non-technical consumers of a data scientist’s work, but for the data scientist herself.
One of the first things a data scientist does with new data is perform ‘exploratory data analysis’, or EDA. This involves familiarizing yourself with the overall patterns and characteristics of a dataset. Among the best ways to approach this is to throw up some quick charts in matplotlib or Altair and just see what’s there.
As with statistics, however, it’s very easy to make a visualization that creates more confusion than it removes. Mislabeled plots, axes on the wrong scales, variables that aren’t properly formatted, and other little problems can work silently to undermine your attempts at clarity.
When doing these projects, try to cultivate the habit of being extremely careful in assessing your visualizations. Where I work, we have extensive sets of checklists to make sure we’ve done things like give our charts accurate titles (harder than you think) and we’ve included units of measurement in our axis labels.
Whether or not you go through this much effort, you should do whatever you need to be similarly careful.
Projects for Data Visualization Beginners
Since most data scientists are going to be using either Python or R, I’d recommend using a mix of the standard visualization tools for completing these projects. These include matplotlib, plotly, Altair, and the built-in functions provided by Pandas and R, like .plot().
- Learn to quickly make the fundamental charts. You should know how to produce a scatter plot, histogram, bar chart, box plot, heat map, and correlogram (a grid showing correlations between variables). Just about any basic dataset should be good for practicing most of these charts. Python has some built-in datasets you could use.
- Get a good time series dataset like Tesla and Apple stock prices or literacy rates over time and make a time series plot. This is really just a line graph, but you have to pay special attention to the dates on the x-axis, especially when your chart contains multiple lines with different start and end times. Also make sure your y-axis has the correct units, and that both your lines are in the same units.
- Being able to make attractive data maps is a huge bonus. See if you can take a map of the U.S. and chart anything (corn production, number of high schools, Google searches for comic book characters) by state.
- Word clouds are surprisingly informative, particularly in natural language processing work. Using the Large Movie Reviews Dataset, see if you can generate one. What does it communicate to you?
If you carry out each of these projects you’ll be well-positioned to accomplish most basic data visualization tasks. This will be a significant step in your journey to becoming a data scientist!