So, it has finally happened. You’ve spent weeks or months thinking about what you wanted your next move to be, settling on data science. And you’ve spent weeks or months researching the best ways to get into that field, settling on a bootcamp. Then you got your acceptance letter, and you’re not sure what you should do to get ready. Data science is a huge field, and preparing for a data science bootcamp can be at least as daunting as preparing for a web development program.
While on the surface, data science and web development bootcamps may look similar, there are some important differences. Data science has far more math, for example, and which set of tools you need to learn varies wildly between different applications.
It’s perfectly understandable if you’re intimidated, or if you don’t know where to start. Both were true for me. But with the right approach, you can succeed.
Most of what follows is based purely on my personal experience attending and graduating from Galvanize’s Data Science Immersive bootcamp. Having compared notes with graduates of other programs, I think all of my advice should still apply to your situation. Thinkful is a program that’s structured very different from Galvanize’s, but I’ve gone to local Thinkful meetups, met Thinkful instructors, and have colleagues who’ve gone through the Thinkful program.
I would tell a prospective Thinkful student the same things I would tell my former self. Still, it’s always possible that the program you’re going to attend prioritizes different things than mine did. You should do your homework and use what you learn to supplement my recommendations.
Also, bear in mind that this is to help you prepare for the data science bootcamp. This isn’t the same thing as telling you how to learn data science. There are definitely things data scientists need to understand, like data structures and graph theory, which may or may not appear at all in the accelerated context of a bootcamp.
Details on how to learn those things will have to wait for another article. I just want to help you avoid the hit-in-the-face-with-a-shovel feeling that so many of us experience on day one of a bootcamp.
The sections below are ranked approximately in order of importance.
1. Coding for Data Science
The coding side of data science is often given short shrift in favor of discussing Bayes Rule or statistical testing, but it’s an extremely important part of the whole learning experience. The algorithms used to implement data science operations are all written in code, pretty much every major tool for data science or machine learning requires you to code, and when you get a job as a data scientist, you’ll probably spend most of your time writing code.
We use code in data science for the same reason we do everywhere else: it makes us vastly more efficient. Knowing how to do matrix algebra by hand is impressive and good for interviews or teaching, but in the real world, you’re almost always going to use code for these tasks.
My advice on what coding to learn falls into two broad categories: actual code and coding with data science tools.
I consider ‘actual coding’ to be similar to what software engineers do, which means ‘actual code’ is in the overlap between software engineering and data science. You might be writing the code to solve a data science problem, but you still need to understand how to write functions, make modular code, perform unit tests, and generally follow software engineering best practices.
There are lots of resources for learning how to do this. Since a huge amount of data science work is done in Python, I’d recommend you start with the 5 best books for learning Python.
Coding for Data Science Tools
Coding for data science tools is rather different. Arguably, the most important data science tool is Pandas. People sometimes think of Pandas as the Python version of Excel, but the truth is that it’s far more powerful than that.
It can be tough to learn to use Pandas well. Despite using it every day for most of a year, I still routinely struggle to do basic things with it. If I could give my former self advice on how to prepare for the data science bootcamp, topping the list would be learning Pandas. A good place to start is Mastering Pandas For Finance, by Michael Heydt.
2. ‘Nerd Stuff’
I once encountered a mathematician telling the story of his journey into Deep Learning. Since he’d studied calculus, linear algebra, and probability while getting a mathematics degree, this part of deep learning was unproblematic for him.
What turned out to be harder, and a struggle to master, was what he called ‘nerd stuff.’ This includes working with the command line to find and manipulate files, downloading and configuring tricky software packages, using version control systems like Github, altering software to work with Amazon Web Services, and lots of other little things that people think of as ‘being good with computers’.
This fits in pretty well with my experience. In addition to all the new math I was learning, I had to simultaneously deepen my understanding of how to use computers well. For my one of capstone projects, I used neural networks to generate text, a task which could be done with either PyCharm or TensorFlow. I opted to experiment with both, because they’re popular frameworks—and because I had no idea how much work I was creating for myself.
It took me almost one whole day to get PyCharm working properly, and another whole day for TensorFlow. A third day went to getting a tool called Jupyter Notebooks to connect to a machine running on the Amazon Cloud.
And for the most part, all of this effort boiled down to pulling the code from the Internet and getting it to run on my local machine by troubleshooting installation issues and occasionally making small alterations to files. This doesn’t sound hard, but it was one of the most difficult parts of my whole capstone.
Unfortunately, there isn’t a general way to ‘get good with computers’, but there are a few skills you can target that’ll point you in the right direction. I recommend getting as comfortable as possible with the command line before you start the bootcamp. YouTube features some great crash courses to get you started.
You should also learn how to use Github for both solo and pair programming. This is crucial not just for the bootcamp, but for pretty much any position you’ll find once you’re done. You could start with Codecademy’s course, and then move on to one of the longer books available on the topic.
3. What Statistics and Probability Is Required for Data Science?
One way of looking at data science is as a branch of applied statistics. While this oversimplifies a bit, it’s not necessarily wrong. The truth is, a lot of data science comes down to being able to use the right statistical metric in the right situation. These days, there are so many tools available for analyzing data, the most important skill the average data scientist needs is just being able to interpret what the statistics say.
As true as this is, though, I don’t think you should spend too much time on statistics in preparation for a bootcamp. I made the mistake of reading several stats books which could’ve been better used learning Pandas, Github, and the command line.
That said, you should absolutely be comfortable with basic statistical concepts like mean, median, variance, and standard deviation. It’s good to at least be acquainted with p-values and null hypothesis statistical testing, as well as interpreting confusion matrices.
My advice is to try to deeply understand the basics instead of learning anything flashy or more advanced.
4. General Mathematics
Finally, there’s some general mathematics that you can expect to encounter.
By far, the most important is linear algebra. Unless you’ve progressed pretty far in mathematics, you probably won’t have encountered linear algebra, but it’s one of the foundations of data science and machine learning.
In these contexts, linear algebra mostly comes down to understanding how lists of numbers called ‘vectors’ are processed and changed. When you start training a text generation neural network, for example, you have to transform the text into ‘vectors,’ because neural networks can’t actually read. There’s no way for you to understand what your model is doing internally without understanding how matrices are multiplied and added.
Linear algebra also shows up in feature elimination, feature engineering, and just about everywhere else. As with probability, it’s more important that you deeply understand the foundations–as opposed to spending time on flashier, more advanced material. I recommend you at least study matrix addition and multiplication, linear combinations, linear maps, eigenvectors, eigenvalues, and principal component analysis.
Almost everyone agrees the best beginner’s overview is ‘Essence of Linear Algebra,’ by Grant Sanderson, which can be found on YouTube. It provides an excellent, intuitive way to think about linear algebra that’ll serve you well when you move on to a more serious course like Gilbert Strang’s linear algebra course at MIT, which I also recommend.
This is the advice I wish I’d received when I was preparing for a data science bootcamp, and I hope you now have some ideas on what to study. It’s my view that if you master these skills, you’ll have a much easier time at your data science bootcamp, and in your subsequent career.