Real Python - Exploring a Dataset
https://realpython.com/pandas-python-explore-dataset/
df.shape
You use the Python built-in function len() to determine the number of rows. You also use the .shape attribute of the DataFrame to see its dimensionality. The result is a tuple containing the number of rows and columns.
df.head()
0
1
194611010TRH
1
1
194611010TRH
2
2
194611020CHS
3
2
194611020CHS
4
3
194611020DTF
Getting to Know Your Data
You’ve imported a CSV file with the Pandas Python library and had a first look at the contents of your dataset. nbaSo far, you’ve only seen the size of your dataset and its first and last few rows. Next, you’ll learn how to examine your data more systematically.Displaying Data TypesThe first step in getting to know your data is to discover the different data types it contains. While you can put anything into a list, the columns of a DataFrame contain values of a specific data type. When you compare Pandas and Python data structures, you’ll see that this behavior makes Pandas much faster!You can display all columns and their data types with .info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gameorder 126314 non-null int64
1 game_id 126314 non-null object
2 lg_id 126314 non-null object
3 _iscopy 126314 non-null int64
4 year_id 126314 non-null int64
5 date_game 126314 non-null object
6 seasongame 126314 non-null int64
7 is_playoffs 126314 non-null int64
8 team_id 126314 non-null object
9 fran_id 126314 non-null object
10 pts 126314 non-null int64
11 elo_i 126314 non-null float64
12 elo_n 126314 non-null float64
13 win_equiv 126314 non-null float64
14 opp_id 126314 non-null object
15 opp_fran 126314 non-null object
16 opp_pts 126314 non-null int64
17 opp_elo_i 126314 non-null float64
18 opp_elo_n 126314 non-null float64
19 game_location 126314 non-null object
20 game_result 126314 non-null object
21 forecast 126314 non-null float64
22 notes 5424 non-null object
dtypes: float64(6), int64(7), object(10)
memory usage: 22.2+ MB
Showing Basics Statistics
This function shows you some basic descriptive statistics for all numeric columns:
count
126314
126314
mean
31579
0.5
std
18231.927642531373
0.500001979206457
min
1
0
25%
15790
0
50%
31579
0.5
75%
47368
1
max
63157
1
count
126314
126314
unique
63157
2
top
197211040ATL
NBA
freq
2
118016
.describe() won’t try to calculate a mean or a standard deviation for the object columns, since they mostly include text strings. However, it will still display some descriptive statistics:
Take a look at the team_id and fran_id columns. Your dataset contains 104 different team IDs, but only 53 different franchise IDs. Furthermore, the most frequent team ID is BOS, but the most frequent franchise ID Lakers. How is that possible? You’ll need to explore your dataset a bit more to answer this question.
Exploring Your Dataset
Exploratory data analysis can help you answer questions about your dataset. For example, you can examine how often specific values occur in a column:
It seems that a team named "Lakers" played 6024 games, but only 5078 of those were played by the Los Angeles Lakers. Find out who the other "Lakers" team is:
Indeed, the Minneapolis Lakers ("MNL") played 946 games. You can even find out when they played those games. For that, you’ll first define a column that converts the value of date_game to the datetime data type. Then you can use the min and max aggregate functions, to find the first and last games of Minneapolis Lakers:
You’ve also found out why the Boston Celtics team "BOS" played the most games in the dataset. Let’s analyze their history also a little bit. Find out how many points the Boston Celtics have scored during all matches contained in this dataset. Expand the code block below for the solution:
Getting to Know Pandas’ Data Structures
While a DataFrame provides functions that can feel quite intuitive, the underlying concepts are a bit trickier to understand. For this reason, you’ll set aside the vast NBA DataFrame and build some smaller Pandas objects from scratch.
You’ve used the list [5555, 7000, 1980] to create a Series object called revenues. A Series object wraps two components:A sequence of valuesA sequence of identifiers, which is the indexYou can access these components with .values and .index, respectively:
Understanding DataFrame Objects
Amsterdam
4200
5
Tokyo
6500
8
Toronto
8000
nan
Axis - important
The axis marked with 0 is the row index, and the axis marked with 1 is the column index. This terminology is important to know because you’ll encounter several DataFramemethods that accept an axis parameter.
Accessing Series Elements
Using .loc and .iloc
In the section above, you’ve created a Pandas Series based on a Python list and compared the two data structures. You’ve seen how a Series object is similar to lists and dictionaries in several ways. A further similarity is that you can use the indexing operator ([]) for Series as well. You’ll also learn how to use two Pandas-specific access methods: .loc .iloc You’ll see that these data access methods can be much more readable than the indexing operator.
.loc refers to the label index.
.iloc refers to the positional index.
Accessing DataFrame Elements
Since a DataFrame consists of Series objects, you can use the very same tools to access its elements. The crucial difference is the additional dimension of the DataFrame. You’ll use the indexing operator for the columns and the access methods .loc and .iloc on the rows.
Alright, you’ve used .loc and .iloc on small data structures. Now, it’s time to practice with something bigger! Use a data access method to display the second-to-last row of the nbadataset. Then, expand the code block below to see a solution:
For a DataFrame, the data access methods .loc and .iloc also accept a second parameter. While the first parameter selects rows based on the indices, the second parameter selects the columns. You can use these parameters together to select a subset of rows and columns from your DataFrame:
5555
Pistons
Warriors
5556
Celtics
Knicks
5557
Knicks
Celtics
5558
Kings
Sixers
5559
Sixers
Kings
Querying Your Dataset
ou’ve seen how to access subsets of a huge dataset based on its indices. Now, you’ll select rows based on the values in your dataset’s columns to query your data. For example, you can create a new DataFrame that contains only games played after 2010:
You can also select the rows where a specific field is not null:
Important Select Rows on conditions
1726
864
194902260BLB
4890
2446
195301100BLB
4909
2455
195301140BLB
5208
2605
195303110BLB
5825
2913
195402220BLB
66638
33320
199111010HOU
66649
33325
199111010SAC
66659
33330
199111020DAL
66693
33347
199111050LAL
66742
33372
199111080PHO
66765
33383
199111090UTA
66812
33407
199111130SAS
66816
33409
199111140DAL
66818
33410
199111140GSW
66881
33441
199111190POR
Grouping and Aggregating Your Data
Group by Multiple Columns
See Pandas Group by Tutorial https://realpython.com/pandas-groupby/
Manipulating Columns
You’ll need to know how to manipulate your dataset’s columns in different phases of the data analysis process. You can add and drop columns as part of the initial data cleaningphase, or later based on the insights of your analysis.
0
1
194611010TRH
1
1
194611010TRH
2
2
194611020CHS
3
2
194611020CHS
4
3
194611020DTF
Specifying Data Types
Categorical TYPE
categorical data has a few advantages over unstructured text. When you specify the categorical data type, you make validation easier and save a ton of memory, as Pandas will only use the unique values internally. The higher the ratio of total values to unique values, the more space savings you’ll get.Run df.info() again. You should see that changing the game_location data type from object to categorical has decreased the memory usage.Note: The categorical data type also gives you access to additional methods through the .cat accessor. To learn more, check out the official docs.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gameorder 126314 non-null int64
1 game_id 126314 non-null object
2 lg_id 126314 non-null object
3 _iscopy 126314 non-null int64
4 year_id 126314 non-null int64
5 date_game 126314 non-null datetime64[ns]
6 seasongame 126314 non-null int64
7 is_playoffs 126314 non-null int64
8 team_id 126314 non-null object
9 fran_id 126314 non-null object
10 pts 126314 non-null int64
11 win_equiv 126314 non-null float64
12 opp_id 126314 non-null object
13 opp_fran 126314 non-null object
14 opp_pts 126314 non-null int64
15 game_location 126314 non-null category
16 game_result 126314 non-null object
17 forecast 126314 non-null float64
18 notes 5424 non-null object
19 date_played 126314 non-null datetime64[ns]
20 difference 126314 non-null int64
dtypes: category(1), datetime64[ns](2), float64(2), int64(8), object(8)
memory usage: 19.4+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gameorder 126314 non-null int64
1 game_id 126314 non-null object
2 lg_id 126314 non-null object
3 _iscopy 126314 non-null int64
4 year_id 126314 non-null int64
5 date_game 126314 non-null datetime64[ns]
6 seasongame 126314 non-null int64
7 is_playoffs 126314 non-null int64
8 team_id 126314 non-null object
9 fran_id 126314 non-null object
10 pts 126314 non-null int64
11 win_equiv 126314 non-null float64
12 opp_id 126314 non-null object
13 opp_fran 126314 non-null object
14 opp_pts 126314 non-null int64
15 game_location 126314 non-null category
16 game_result 126314 non-null category
17 forecast 126314 non-null float64
18 notes 5424 non-null object
19 date_played 126314 non-null datetime64[ns]
20 difference 126314 non-null int64
dtypes: category(2), datetime64[ns](2), float64(2), int64(8), object(7)
memory usage: 18.6+ MB
Cleaning Data
Missing Data .dropna()
Sometimes, the easiest way to deal with records containing missing values is to ignore them. You can remove all the rows with missing values using .dropna():
You can also drop problematic columns if they’re not relevant for your analysis. To do this, use .dropna() again and provide the axis=1 parameter:
Fill empty cells with .fillna(value = 'my text', inplace=True)
If there’s a meaningful default value for your use case, then you can also replace the missing values with that:
Invalid values
26684
13343
197210260VIR
Inconsistent Values
Sometimes a value would be entirely realistic in and of itself, but it doesn’t fit with the values in the other columns. You can define some query criteria that are mutually exclusive and verify that these don’t occur together.In the NBA dataset, the values of the fields pts, opp_pts and game_result should be consistent with each other. You can check this using the .empty attribute