Welcome, Young Padawan
If you're looking for a comprehensive guide to building a data science portfolio, look no further. This blog trilogy encompasses everything from first conceptualizing a portfolio and what to look out for while building it to learning how to present it in a way that will make recruiters (hopefully) flood your inbox. In contrast to the information overload on popular blogs requiring readers to puzzle together a bigger picture, this series acts both as a starting point for lost souls as well as a reference for more advanced enthusiasts.
This series will guide you through a three-legged journey with the following stopovers:
Planning Your Portfolio 📍
I know, I know. You want to get started as fast as possible. However, sometimes it's helpful to take a step back and evaluate the bigger picture. Carefully planning your portfolio will make your results seem more cohesive and save you plenty of time during the last two stages.
Building Your Portfolio
Presenting Your Portfolio
Feel free to jump off here if you have already completed some of these or only need pointers for a specific stage. You'll find this outline in each article of this series, so don't worry about finding your way back.
Start With Why
Forget everything you have learned about data science portfolios so far. Turn off your computer (after reading this article, obviously). Better yet, go for a hike and clear your mind. While stating this in a guide might be confusing: no one can tell you which projects you should do. No blog post, online course, or degree program will be able to hand you the perfect set of projects to build your awesome data science portfolio. Unless you're a piece of software analyzing this text, you are a unique product of a myriad of factors, and now is the time to start taking advantage of that. I talked to Ken Jee, adjunct professor at DePaul University and host of the Ken’s Nearest Neighbors podcast, about building data science portfolios and one of his greatest pet peeves with regard to building a portfolio is students not leveraging their backgrounds to build a cohesive story. Ken stated that, for him, there is no such thing as “jumping” or “switching” to data science from another field. Instead, he recommends taking it slow and building on your existing skills to craft a unique story.
The main reason for building a portfolio is to impress potential employers. Sparrows sing, seahorses dance, and data science students build a portfolio. In essence, it's your very own pitch for why anyone should invest money into hiring you. As with any company out there, your products (or data science projects) should be rooted in a more profound mission. Reflect on your passions, experiences, and interests to find yours. Want to become mayor one day? Write it down. Interested in tennis? Write it down. Don't think about what data sets are on Kaggle right now. Instead, build out a mood board of your interests and how they could tie back to data. Take a look at an example of a mood board I created below:
I created this mood board on Notion; feel free to duplicate it and turn it into your own. As you can see, my mood board contains a mix of facts about my educational and professional background, interests, and other experiences. Once you've collected these building blocks, think about how you can connect them with data. For example, I could combine my business background with my passion for music by analyzing which song characteristics are most influential in determining the commercial success of a new release. During my conversation with Ken, he brought up another great example. Imagine being passionate about golf and, until now, you’ve kept your scores on a piece of paper. What if you could combine your interest in golf with data science by building a computer vision app that reads manually recorded scorecards and uploads them to a leadership board or stores them for further analysis? As illustrated in these examples, your ideas don’t have to be revolutionizing. What matters more is that there is a clear use case that’s unique to you and your story.
Don't worry about these initial ideas being perfect. Instead, start with quick wins and take it step by step.
If you've ever created a CV, you'll know what I'm talking about when I say "accomplished X as measured by Y, by doing Z." This paragraph is going to talk about the Y you're going to use to quantify your project's impact. Depending on your background, you will most likely already be familiar with some of them, but be warned: there are many more out there. Here are some guidelines you can use to identify which of these matter to you and your portfolio projects:
- What metrics are commonly used in the industries you're most interested in?
- Metrics vary significantly between different industries. For example, you probably wouldn't want to use accuracy for a fraud detection use case—research which metrics are commonly used before starting your project.
- What's the goal of your project?
- Think about the problem you are trying to solve. Going back to the fraud detection use case above: are you trying to identify as many fraudulent transactions as possible (possibly at the expense of incorrectly flagging non-fraudulent ones as well), or is your goal to make the customer experience as smooth as possible (while not catching all fraudulent transactions)? Depending on the answer to that question, the metrics that matter most change.
Generally, it might be helpful to distance yourself from the question you are trying to answer. Why would someone hire you to answer that question? What would they do with that answer? Keeping these questions in mind while defining metrics could also come in handy later when crafting a narrative to present your projects.
Whataya Want From Me?
Working in data science means working in an ever-changing environment. New tools and models are published, best practices change, and employer requirements along with them. Take a look at some of the preferred skills of two job descriptions looking for a data scientist on LinkedIn:
Data Scientist - Company A
Familiarity with algorithmic trading & backtesting using Python (Zipline, PyFolio, AlphaLens, etc.)
Data Scientist - Company B
Experienced in manipulating and analyzing large scale health administrative claims (Commercial and/or Medicare), billing, operational, and clinical/EHR datasets
One job title, two completely different jobs. To maximize your hiring chances, make sure to look up industry-specific skills that will help you stand out from the crowd. During our conversation, Ken Jee also elaborated that, too often, students try to demonstrate a broad range of skills by attempting to cover the whole gamut of potential data science project topics. According to Ken, however, this strategy will only lead to applicants doing “okay” in most interviews when they should really be aiming to ace the ones they really care about and pay less attention to all others.
I would recommend the following three approaches to gain a better understanding of more specialized requirements in your industry of interest:
- People in your network. The best source of reliable information regarding the tools actually being used on a day-to-day basis are individuals in your network working in that particular industry. Ask them if they'd like to have a coffee chat with you and discuss their work. Who knows, maybe they'll even give you some helpful pointers beyond important skills.
- Alumni from your school or former employers. These are individuals you may not know yet, but they can be of great help to you. Having attended the same school or worked for the same company provides a great starting point for conversations. Alumni networks are a great way to connect with them in a more meaningful way than through a LinkedIn message.
- Job boards. Should the first two options not work for you, don't worry. Job boards will also help you identify high-level industry requirement trends. However, keep in mind that job descriptions might differ significantly from the actual job.
(Data) Hunter & Gatherer
Woohoo, we've finally arrived at the data-gathering stage. Before discussing potential avenues for finding data, one thing to keep in mind for all of them is to understand the data-generating process. One of the issues with blindly pulling data from Kaggle is that the process behind the data is rarely investigated in much depth. That is not how data science works in the "real" world. Failing to understand how the data came about could render any analytical efforts absolutely meaningless. You've probably heard of Abraham Wald and his approach to identifying which aircraft areas to provide with additional armor during WW2. If not, here's a great read.
Keeping the data-generating process in mind, here are some more and less creative ways to find data that fits your portfolio.
- Find a real-world data set. At first, finding a real-world data set to work with might seem like a stretch. However, you're in luck. Your skills are in demand, and there is much work to do. Smaller companies, in particular, probably cannot afford to hire their own data science teams, and hiring external consultants for all potential data projects is, you guessed it, expensive. Reach out to people in your network or local community and see if anyone needs help.
- BYOData. No rule says you have to use an already existing data set. Other ways of creating a data set include, for instance, scraping websites or collecting data yourself. It's never been easier to collect data than today; why not take advantage?
- Combine & conquer. Yet another way of coming up with a unique data set is simply finding new ways of combining existing data. This approach involves letting your creativity do the heavy-lifting and extracting new insights from data in a way that other people haven't thought of before.
I Got 99 Problems and Time Is One
The devil is in the details. When working on a portfolio project, you'll probably uncover new research questions in addition to your initial one. You might even consider abandoning your initial idea in favor of another one. While there's nothing wrong with either of these, it is crucial to keep yourself accountable. After all, you probably do not have an unlimited amount of time to build the portfolio of your dreams. Here are some best practices I find helpful:
- Plan ahead. Keeping yourself accountable only works when you decide how much time you want to spend on which project. Don't worry about figuring out whether you should do two, three, or four projects. Just plan for whatever you can accommodate and commit to a certain amount of time for each project.
- Track your work. There are many great tools out there for tracking what you accomplished during each work session and how much time you spent on it.
- Take notes. Taking notes while working on a project can seem bothersome but having details on why you did what at each stage will be invaluable when presenting your work. This involves everything from adding comments to stay in the zone while coding to taking 30 minutes at the end of your session to jot down a few bullet points. Trust me. You won't regret it.
Getting To Work
Congrats, you've already made it 33.33% through this blog trilogy. I would recommend taking a step back and thinking about how you can combine the recommendations in this article with your unique goals and work style. I've also gathered all of the Notion pages I used for the planning stage and turned them into templates that you can find here.