Exploring Astronaut Activities in SQL
About this case study
This is a notebook derived from the Datacamp Live Webinar where Justin Saddlemyer taught us the basics of regular expressions and how they can be used to categorize data among other cool things using SQL.
In this notebook, we will use string cleaning to process and analyze a database containing information on astronaut extra-vehicular activities (EVAs).
We will start by exploring the data, identifying potential issues with the columns that contain text. Next, we will take move forward to the necessary steps to clean up the data so you can draw insights the types and participants of EVAs!
Data: Extravehicular astronaut activities, including crew, durations, and purpose.
Primary Goal: Learning exploratory data analysis techniques, including data preparation and manipulation.
Secondary Goal: Get some insights into what astronauts do in space!
Why? Data will not always come pre-processed, or easy to visualize. Extracting information and restructuring it is essential to analytics.
Let's start by looking at the table we will be working with.
0
3/10/2001
USA
1
5/13/1992
USA
2
8/30/2012
USA
3
12/22/1999
USA
4
12/23/1999
USA
5
12/24/1999
USA
6
5/22/2011
USA
7
8/7/2010
USA
8
5/17/2009
USA
9
2/11/2008
USA
Let's inspect the purpose column in greater detail.
0
First U.S. EVA. Used HHMU and took photos. Gas flow cooling of 25ft umbilical overwhelmed by vehicle ingress work and helmet fogged. Lost overglove. Jettisoned thermal gloves and helmet sun visor
1
HHMU EVA cancelled before starting by stuck on vehicle thruster that ended mission early
2
Inadequate restraints, stiff 25ft umbilical and high workloads exceeded suit vent loop cooling capacity and caused fogging. Demo called off of tethered astronaut maneuvering unit
3
Standup EVA. UV photos of stars. Ended by eye irritation that impaired vision
4
Retrieved MMOD experiment from docked Agena. Used HHMU. Lost camera and retrieved experiment. EVA ended early by unrelated spacecraft problem
5
Attached tether between Agena and Gemini. EVA ended early due to fatigue, overheating & eye sweat
6
Standup EVA. Took star photos. Agena tether ops
7
Standup EVA. Science tasks. Took star photos
8
Attached tether between Agena and Gemini. UV photos of stars. Waist tether and Dutch shoe eval
9
Standup EVA. Jettisoned equipment. Took photos
What are the most common types of EVAs?
Let's start to get a rough idea of the most popular types of EVAs astronauts take by using CASE expressions.
60
LEASAT/SYNCOM IV repairs completed. Accidently reused depleted LiOH cans from 1st EVA.
0
61
EASE/ACCESS truss assembly and stowage demos. Deployed radar target
0
62
EASE/ACCESS truss assembly, stowage and handling demos. Simulated cable routing with rope reel and heat pipe handling with free EASE beams.
0
63
In suits in airlock at 5psi awaiting GO to manually deploy stuck solar array. Waived off when auto sys worked
0
64
Freed stuck GRO high gain antenna. Started Space Station demos (recorded crew loads, variable tension rope translation, dogbone handrails).
0
65
More Space Station demos (evals of 3 different carts and a tether shuttle, RMS translation rates and loads). Time short for all RMS/EVA tests
0
66
Capture bar/grapple fixture attach failed 3 times
0
67
Though more careful to induce less loads, capture bar/grapple fixture attach failed 5 times
0
68
3 man EVA. Manually capture/repair INTELSAT
0
69
ASEM Space Station assembly demos (module to truss mating, updated HHMU self rescue device). Aligned gimbals of Ku antenna for IV commanded locking
0
We are now ready to build this into a final query!
0
191
installation
1
129
repair
2
16
collection
3
13
photography
How much material has been extracted from EVAs?
Skimming through the purpose column, we also saw numerous references to extracting rock/dust or geological material. In this case, it will be difficult to extract the total quantity across the columns. Regular expressions to the rescue!
We will define a pattern to extract the total pounds extracted per EVA, and then sum them up. Let's first do a sense check of the data.
0
First to walk on the moon. Some trouble getting out small hatch. 46.3 lb of geologic material collected. EASEP seismograph and laser reflector exp deployed. Solar wind exp deployed & retrieved. 400 ft (120m) circuit on foot. Dust issue post EVA
1
Collected 75.6 lb of geologic material. ALSEP exp deployed. 6000 ft (1800m) circuit on foot
2
Collected 94.4 lb of geologic material. ALSEP and laser reflector exp deployed. Hiked up to 0.9 miles (1.5km) from lunar module. Used MET rickshaw
3
Collected 169 lb of geologic material. ALSEP exp deployed. First use of the lunar rover. Covered 6.2 mile (10.3 km) circuit
4
Collected 208 lb of rock/dust (41lb this day). ALSEP & solar wind experiments deployed. Assembled and used lunar rover (4.2 km).
5
Collected 82 lb of rock/dust. Drove rover 11.5 km
6
Collected 90 lb of rock/dust. Drove rover 27.1 km
7
Collected 243 lb of geologic material. ALSEP exp deployed. 8 seismic explosive charges placed for later use. Used lunar rover to travel 11.4 miles (19 km) circuit with max range of 4.9 miles from lunar module.
Okay, we now know that the format of the pounds extracted is always number lbs of rock/geologic. We can construct a pattern to detect this and extract the number!
To do so, we will make use of:
\d+ to match one or more digits.
\.? to match zero or more periods.\
* to match zero or more digits following the optional decimal place.
() to specify we only want the digits.
\s to match the whitespace (i.e., spaces).
[ ] and | to specify we either want to match "geologic" or "rock".
Let's put this into action, using SUBSTRING() to extract our pattern!
0
First U.S. EVA. Used HHMU and took photos. Gas flow cooling of 25ft umbilical overwhelmed by vehicle ingress work and helmet fogged. Lost overglove. Jettisoned thermal gloves and helmet sun visor
None
1
HHMU EVA cancelled before starting by stuck on vehicle thruster that ended mission early
None
2
Inadequate restraints, stiff 25ft umbilical and high workloads exceeded suit vent loop cooling capacity and caused fogging. Demo called off of tethered astronaut maneuvering unit
None
3
Standup EVA. UV photos of stars. Ended by eye irritation that impaired vision
None
4
Retrieved MMOD experiment from docked Agena. Used HHMU. Lost camera and retrieved experiment. EVA ended early by unrelated spacecraft problem
None
5
Attached tether between Agena and Gemini. EVA ended early due to fatigue, overheating & eye sweat
None
6
Standup EVA. Took star photos. Agena tether ops
None
7
Standup EVA. Science tasks. Took star photos
None
8
Attached tether between Agena and Gemini. UV photos of stars. Waist tether and Dutch shoe eval
None
9
Standup EVA. Jettisoned equipment. Took photos
None
Now we can use a CTE (Common Table Expression) to calculate the total amount!
0
1008.3
Which astronauts have the most time in EVAs?
We also have information on how much time each EVA took, as well as the astronauts who participated. Let's use this information to try and calculate totals for each astronaut!
First let's see what the maximum count of astronauts in an EVA has been by splitting the crew column. We can also use TRIM() to remove any extra whitespace from the column.
10
David Scott
David Scott
11
Russ Schweickart
Russ Schweickart
12
Neil Armstrong, Buzz Aldrin
Neil Armstrong
13
Neil Armstrong, Buzz Aldrin
Neil Armstrong
14
Allen Bean, Pete Conrad
Allen Bean
15
Allen Bean, Pete Conrad
Allen Bean
16
Allen Bean, Pete Conrad
Allen Bean
17
Ed Mitchell, Alan Shepard
Ed Mitchell
18
Ed Mitchell, Alan Shepard
Ed Mitchell
19
Ed Mitchell, Alan Shepard
Ed Mitchell
0
Paul Weitz, Joe Kerwin, Pete Conrad
Pete Conrad
Now we know how we can extract the time for each astronaut! Let's create a CTE we can then use to piece together duration information for each astronaut.
0
Ed White
Ed White
1
David Scott
David Scott
2
Eugene Cernan
Eugene Cernan
3
Mike Collins
Mike Collins
4
Mike Collins
Mike Collins
5
Richard Gordon
Richard Gordon
6
Richard Gordon
Richard Gordon
7
Buzz Aldrin
Buzz Aldrin
8
Buzz Aldrin
Buzz Aldrin
9
Buzz Aldrin
Buzz Aldrin
Now it's just a matter of splitting apart this table and appending each set of results to one table.
0
Ed White
36
1
David Scott
0
2
Eugene Cernan
127
3
Mike Collins
50
4
Mike Collins
39
5
Richard Gordon
44
6
Richard Gordon
130
7
Buzz Aldrin
149
8
Buzz Aldrin
126
9
Buzz Aldrin
55
0
Jerry Ross
3501
1
Anatoly Solovyev
3086
2
Scott Parazynski
2825
3
Nikola Budarin
2672
4
John Grunsfeld
2527
5
Mike Lopez-Alegria
2501
6
Mike Fincke
2472
7
Dan Tani
2351
8
Victor Afanasyev
2314
9
Rick Mastracchio
2311
What is the cumulative amount of time spent in EVAs over time?
Finally, let's take a look at the cumulative time spent in EVAs by year and space program. To do so, we will need to use a window function in combination with a subquery.
0
1965-01-01
Gemini
1
1965-01-01
Voskhod
2
1966-01-01
Gemini
3
1969-01-01
Apollo
4
1969-01-01
Soyuz
5
1971-01-01
Apollo
6
1972-01-01
Apollo
7
1973-01-01
Skylab
8
1973-01-01
Soyuz
9
1974-01-01
Skylab