Deepnote Pandas puzzle: A comprehensive guide to mastering Pandas

Inspired by the popular “100 Numpy Exercises,” the Deepnote Pandas Puzzle collection is designed to test and enhance your understanding of pandas, one of the most powerful data manipulation libraries in Python. This set of exercises focuses on the fundamental aspects of manipulating data, such as indexing, grouping, aggregating, and cleaning, utilizing the core dataframe and Series objects.

While pandas is a vast library with numerous specialized features, these exercises primarily concentrate on the core functionalities. The solutions to these puzzles are generally straightforward, often requiring just a few lines of code in pandas or NumPy, and are designed to encourage the use of best practices and efficient methods.

Getting started

Before diving into the exercises, ensure you have pandas and NumPy installed and properly set up. If you’re new to pandas, consider exploring the official documentation for a comprehensive overview.

!pip install pandas numpy

Importing Pandas

Difficulty: Easy

1. Import pandas under the name pd.

import pandas as pd

2. Print the version of pandas that has been imported.

print(pd.__version__)

3. Print out all the version information of the libraries required by pandas.

pd.show_versions()

Dataframe basics

Difficulty: Easy

Let’s start with a basic dataframe. Consider the following Python dictionary data and the list labels:

import numpy as np

data = {
    'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
    'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
    'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']
}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

1. Create a dataframe df from this dictionary data which has the index labels.

df = pd.DataFrame(data, index=labels)

2 .Display a summary of the basic information about this dtaframe and its data.

df.info()
# or:
df.describe()

3. Return the first 3 rows of the dataframe df.

df.head(3)

4. Select just the ‘animal’ and ‘age’ columns from the dataframe df.

df[['animal', 'age']]

4. Select the data in rows [3, 4, 8] and in columns [‘animal’, ‘age’].

df.loc[df.index[[3, 4, 8]], ['animal', 'age']]

5. Select only the rows where the number of visits is greater than 3.

df[df['visits'] > 3]

6. Select the rows where the age is missing, i.e., is NaN.

df[df['age'].isnull()]

7. Select the rows where the animal is a cat and the age is less than 3.

df[(df['animal'] == 'cat') & (df['age'] < 3)]

8. Select the rows where the age is between 2 and 4 (inclusive).

df[df['age'].between(2, 4)]

9. Change the age in row ‘f’ to 1.5.

df.loc['f', 'age'] = 1.5

10. Calculate the sum of all visits (the total number of visits).

df['visits'].sum()

11. Calculate the mean age for each different animal in df.

df.groupby('animal')['age'].mean()

12. Append a new row ‘k’ to df with your choice of values for each column. Then delete that row to return the original dataframe.

df.loc['k'] = [5.5, 'dog', 'no', 2]
df = df.drop('k')

13. Count the number of each type of animal in df.

df['animal'].value_counts()

14. Sort df first by the values in the ‘age’ in descending order, then by the value in the ‘visits’ column in ascending order.

df.sort_values(by=['age', 'visits'], ascending=[False, True])

15. The ‘priority’ column contains the values ‘yes’ and ‘no’. Replace this column with a column of boolean values: ‘yes’ should be True and ’noshould beFalse`.

df['priority'] = df['priority'].map({'yes': True, 'no': False})

16. In the ‘animal’ column, change the ‘snake’ entries to ‘python’.

df['animal'] = df['animal'].replace('snake', 'python')

17. For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits, and the values are the mean ages (hint: use a pivot table).

df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')

Dataframes: Beyond the basics

Difficulty: Medium

The previous section covered basic but essential DataFrame operations. The following exercises require a combination of methods to achieve the desired result.

1. Filter out rows which contain the same integer as the row immediately above in the dataframe df with a column ‘A’ of integers.

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df = df.loc[df['A'].shift() != df['A']]

2. Subtract the row mean from each element in the row for a dataframe of numeric values.

df = pd.DataFrame(np.random.random(size=(5, 3)))
df = df.sub(df.mean(axis=1), axis=0)

3. Find the column with the smallest sum in a dataframe with 10 columns of real numbers.

df = pd.DataFrame(np.random.random(size=(5, 0)), columns=list('abcdefghij'))
smallest_sum_col = df.sum().idxmin()

4. Count how many unique rows a dataframe has (i.e., ignore all rows that are duplicates).

unique_rows_count = len(df.drop_duplicates(keep=False))

5. Find the column which contains the third NaN value for each row of a dataframe with 10 columns of floating-point numbers, where exactly 5 entries in each row are NaN values.

third_nan_column = (df.isnull().cumsum(axis=1) == 3).idxmax(axis=1)

6. Find the sum of the three greatest values for each group in a DataFrame with columns ‘grps’ and ‘vals’.

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
df.groupby('grps')['vals'].nlargest(3).sum(level=0)

7. Calculate the sum of the corresponding values in column ‘B’ for each group of 10 consecutive integers in column ‘A’.

df = pd.DataFrame({'A': np.random.randint(1, 101, size=100), 'B': np.random.randint(0, 100, size=100)})
df.groupby(pd.cut(df['A'], np.arange(0, 101, 10)))['B'].sum()

Dataframes: Harder problems

Difficulty: Hard

These exercises might require some creative thinking and a deeper understanding of pandas and NumPy.

1. Count the difference back to the previous zero (or the start of the Series, whichever is closer) for each value in an integer column ‘X’ of a DataFrame.

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
ero = np.r_[-1, (df[‘X’] == 0).to_numpy().nonzero()[0]] # indices of zeros
df[‘Y’] = np.arange(len(df)) - izero[np.searchsorted(izero, np.arange(len(df))) - 1]

2. Create a dataframe of 100 rows, 10 columns of random integers. Then compute the correlation between the first column and each other column.

df = pd.DataFrame(np.random.randint(1, 100, size=(100, 10)))
correlation = df.corrwith(df[0])

3. Create a dataframe of 100 rows, 10 columns of random integers. Replace elements in the DataFrame with elements from a list such that each value from the list is placed at the index of the original value’s position in the sorted DataFrame.

df = pd.DataFrame(np.random.randint(1, 100, size=(100, 10)))
elements = np.arange(1, 1001)
df.values.flat[np.argsort(df.values, axis=None)] = elements

Conclusion

These exercises should provide a comprehensive overview of the fundamental and advanced capabilities of pandas. Completing these tasks will not only help solidify your understanding of pandas but also enhance your problem-solving skills with real-world data manipulation scenarios. If you encounter any issues, please get in touch with our support. Happy learning in Deepnote! 🐍

Getting started

Importing Pandas

Dataframe basics

Dataframes: Beyond the basics

Dataframes: Harder problems

That’s it, time to try Deepnote