Exploring Astronaut Activities in SQL
About this case study
This is a notebook derived from the Datacamp Live Webinar where Justin Saddlemyer taught us the basics of regular expressions and how they can be used to categorize data among other cool things using SQL.
In this notebook, we will use string cleaning to process and analyze a database containing information on astronaut extra-vehicular activities (EVAs).
We will start by exploring the data, identifying potential issues with the columns that contain text. Next, we will take move forward to the necessary steps to clean up the data so you can draw insights the types and participants of EVAs!
Data: Extravehicular astronaut activities, including crew, durations, and purpose.
Primary Goal: Learning exploratory data analysis techniques, including data preparation and manipulation.
Secondary Goal: Get some insights into what astronauts do in space!
Why? Data will not always come pre-processed, or easy to visualize. Extracting information and restructuring it is essential to analytics.
Let's start by looking at the table we will be working with.
Let's inspect the purpose column in greater detail.
What are the most common types of EVAs?
Let's start to get a rough idea of the most popular types of EVAs astronauts take by using CASE expressions.
We are now ready to build this into a final query!
How much material has been extracted from EVAs?
Skimming through the purpose column, we also saw numerous references to extracting rock/dust or geological material. In this case, it will be difficult to extract the total quantity across the columns. Regular expressions to the rescue!
We will define a pattern to extract the total pounds extracted per EVA, and then sum them up. Let's first do a sense check of the data.
Okay, we now know that the format of the pounds extracted is always number lbs of rock/geologic. We can construct a pattern to detect this and extract the number!
To do so, we will make use of:
Let's put this into action, using SUBSTRING() to extract our pattern!
Now we can use a CTE (Common Table Expression) to calculate the total amount!
Which astronauts have the most time in EVAs?
We also have information on how much time each EVA took, as well as the astronauts who participated. Let's use this information to try and calculate totals for each astronaut!
First let's see what the maximum count of astronauts in an EVA has been by splitting the crew column. We can also use TRIM() to remove any extra whitespace from the column.
Now we know how we can extract the time for each astronaut! Let's create a CTE we can then use to piece together duration information for each astronaut.
Now it's just a matter of splitting apart this table and appending each set of results to one table.
What is the cumulative amount of time spent in EVAs over time?
Finally, let's take a look at the cumulative time spent in EVAs by year and space program. To do so, we will need to use a window function in combination with a subquery.