Project 2, MA346, Fall 2025

Q1: What are the observational units in the first data set?

A1: The observational units are individual Oscar nominations. Each row represents one nomination for a specific person, for a specific film, in a specific category, in a specific year.

Q2: What are the observational units in the second data set?

A2: For the box-office dataset, the observational units are films ranked by box-office revenue within their release year. Each row represents one film and its financial performance metrics. There are no people-level observations here.

For the demographics dataset (used in the second merge), the observational units are unique person–film pairs with associated demographic attributes such as birthplace, gender, and other personal characteristics.

Q3: What will the observational units be in the merged data set you expect to have as a result of this assignment?

A3: After both merges, the observational units remain individual Oscar nominations. The merges only add information to some nomination:

From the box-office dataset, some nominations gain film-level financial information (only when the film appears in the box-office list).

From the demographics dataset, some nominations gain person-level demographic attributes (only when the person–film pair exists in the demographics table).

The key point: the unit does not change [one row = one nomination] , but each row becomes richer with film-level and person-level data attached.

Q4: Will you be doing a left, right, inner, or outer join?

A4: For both merges, we perform a left join, and for the same core reason: We want to preserve every Oscar nomination from the main dataset.

In the box-office merge, a left join ensures that even films that did not appear among the top box-office performers remain in the dataset. Only films with a match receive box-office information.

In the demographics merge, a left join ensures that every nomination stays in the final dataset even if some individuals lack demographic records. Demographics are added where possible, but never cause removals.

A left join is the only option consistent with the goal of maintaining the full structure of the Oscars dataset while enriching it with optional external attributes.

Because our goal is to analyze how box-office success and demographic attributes relate to Oscar nominations, it is essential to keep every Oscar entry, even for films that never ranked highly at the box office or individuals with missing demographic data. Using a left join preserves the full Oscars dataset while enriching it with additional information wherever available, ensuring that no nomination is lost in the process.

Q5: How many rows and columns do you expect to have in the merged data set, and what is your reason for each of those numbers?

A5: We expect the final merged dataset to have 10,751 rows and 43 columns. The row count remains at 10,751 because we use the Oscars dataset as the left table, and our goal is to keep all valid Oscar nominations. The only rows removed are those where the film title is missing (NaN), since that data wouldn't be useful to our analysis. For the columns: the Oscars dataset contributes 8 variables, the box-office merge adds 7 new variables, and the demographics merge adds 28 variables, for a total of 43 columns in the final dataset.

Importing Data Sets

Project 2 Report:

https://docs.google.com/document/d/1CaOrgIjTXkckUyxUVAIvki8FnkDr5xY4/edit?usp=sharing&ouid=112847349817133872263&rtpof=true&sd=true

oscars_df = pd.read_csv('the_oscar_award_project.csv') oscars_df

Run to view results

box_office_data = pd.read_excel('box_office_data.xlsx') box_office_data

Run to view results

oscars_demographics = pd.read_csv("Oscars-demographics-DFE.csv", encoding="latin-1") oscars_demographics

Run to view results

Cleaning Nan Values

First, we check if the box office data has any empty titles, because those observations would be useless for the merge

box_office_data['title'].isna().sum()

Run to view results

After confirming there are no nan values, we move on to the next dataset

Now we do the same thing for the demographic dataset

oscars_demographics['movie'].isna().sum()

Run to view results

After confirming there are no nan values, we move on to the next dataset

Now we check the Oscars dataset

oscars_df['film'].isna().sum()

Run to view results

We found some nan values here, and since an observation that doesn't correspond to a movie would be useless to our analysis, we can delete these values

oscars_df.shape

Run to view results

Here we execute the cleaning

oscars_df = oscars_df[oscars_df['film'].notna()]

Run to view results

And now we check how the shape of the dataset changed

oscars_df['film'].isna().sum()

Run to view results

oscars_df.shape

Run to view results

Here, we can see in the shape of the dataframe how the 359 invalid rows were deleted, and how no nan values remain

Fixing Data Types

After checking the dataset, we see some columns that appear to have conflicting data types. They should be numeric but are formatted as objects

cols = ['domestic', 'domestic_pct', 'foreign', 'foreign_pct']

Run to view results

First, we check what non-numeric values they have

for col in cols: print(f"\nColumn: {col}") bad_values = box_office_data[col][ pd.to_numeric(box_office_data[col], errors='coerce').isna() ].unique() print(bad_values)

Run to view results

And now we replace them with numerical values

box_office_data['foreign_pct'] = box_office_data['foreign_pct'].replace('<0.1%', '0.09')

Run to view results

Here we replace all the '-' with null values

for col in cols: box_office_data[col] = box_office_data[col].replace('-', np.nan) for col in cols: box_office_data[col] = pd.to_numeric(box_office_data[col]) box_office_data

Run to view results

Now we check for bad dates in the dates column

bad_dates = [] for d in oscars_demographics['date_of_birth'].dropna(): try: pd.to_datetime(d, dayfirst=True) except: bad_dates.append(d) bad_dates = sorted(set(bad_dates)) bad_dates

Run to view results

And now that we have found the issue, we can take out the '[1]' and format it as a date

oscars_demographics['date_of_birth'] = oscars_demographics['date_of_birth'].replace('18-Aug-1936 [1]', '18-Aug-1936')

Run to view results

Here, we apply the date data type

oscars_demographics['date_of_birth'] = pd.to_datetime( oscars_demographics['date_of_birth'], dayfirst=True, format= 'mixed' ) oscars_demographics['date_of_birth']

Run to view results

And now all columns have been correctly formatted

Exploring Matching Observations

Box Office Dataset

Now we need to see how many observations match without making any changes

matching_films = [] for film in box_office_data['title']: if film in oscars_df['film'].values: matching_films.append(film) len(matching_films)

Run to view results

And now how many dont match

non_matching_films = [] for film in box_office_data['title']: if film not in oscars_df['film'].values: non_matching_films.append(film) len(non_matching_films)

Run to view results

Now we use get close matches to get possible matches, and we can store that in a dictionary

matches = {} for film in non_matching_films: close = get_close_matches(film, oscars_df['film'].values) matches[film] = close

Run to view results

Now we display the dictionary in a format in which it is easy to spot positive matches

for name, close_matches in matches.items(): print(f"{name} --> {close_matches}")

Run to view results

After checking all of the fuzzy-match results, none of them produced any new films that needed to be manually merged. A few fuzzy matches correctly pointed to older films that were being re-released, but these do not represent new Oscar overlaps. Overall, the fuzzy matching did not add any valid matches beyond the 19 exact title matches already identified.

Oscar's Demographic Dataset

Now we do the same for the demographics dataset

matching_films_demographics = [] for film in oscars_demographics['movie']: if film in oscars_df['film'].values: matching_films_demographics.append(film) len(matching_films_demographics)

Run to view results

non_matching_films_demographics = [] for film in oscars_demographics['movie']: if film not in oscars_df['film'].values: non_matching_films_demographics.append(film) len(non_matching_films_demographics)

Run to view results

Now we use get close matches to get possible matches, and we can store that in a dictionary

matches_demographics = {} for film in non_matching_films_demographics: close = get_close_matches(film, oscars_df['film'].values) matches_demographics[film] = close

Run to view results

Now we display the dictionary in a format in which it is easy to spot positive matches

matches_demographics

Run to view results

After reading through the matches, we found some movies that were incorrectly matched so we created a new column to store the names of the movies that were correctly matched

oscars_demographics['movie_clean'] = oscars_demographics['movie'] skip = ['Beginning', 'The Passionate Life of Edith Piaf', 'Precious'] for wrong, matches in matches_demographics.items(): if wrong not in skip: correct_value = matches[0] oscars_demographics.loc[ oscars_demographics['movie_clean'] == wrong, 'movie_clean' ] = correct_value

Run to view results

Now we compare the changes made to the original dataset, just to make sure no mistakes were made

changed = oscars_demographics[ oscars_demographics['movie'] != oscars_demographics['movie_clean'] ] changed[['movie', 'movie_clean']]

Run to view results

Now we test how many matches we get with the updated film titles

matching_films_demographics = [] for film in oscars_demographics['movie_clean']: if film in oscars_df['film'].values: matching_films_demographics.append(film) len(matching_films_demographics)

Run to view results

We see how the movies we changed now match the Oscars dataset

for film in oscars_demographics['movie_clean']: if film not in oscars_df['film'].values: close = get_close_matches(film, oscars_df['film'].values) print(f"{film} --> {close}")

Run to view results

After checking every cleaned movie title against the full Oscars dataset, we found that only four titles did not have a valid match: Beginning, The Passionate Life of Edith Piaf, and Precious (which appears twice). These films remain unmatched even after manually searching for possible matches in oscars_df.

oscars_demographics.duplicated(subset=['movie_clean', 'person']).sum()

Run to view results

oscars_demographics = oscars_demographics.drop_duplicates( subset=['movie_clean', 'person'], keep='first' )

Run to view results

Since the demographics dataset can list the same person–movie combination multiple times, we remove these duplicates before merging. We only need one copy of the demographic information for each individual and film, so keeping the first occurrence ensures cleaner data and prevents unnecessary row multiplication during the merge.

Merging the data sets

first_merged = oscars_df.merge( box_office_data, left_on='film', right_on='title', how='left' )

Run to view results

Since each film in the Oscars dataset can receive multiple nominations, we performed the merge solely on the film title. Using a left join ensures that every Oscar nomination row is preserved while attaching box office information only where a film title match exists.

Verifying the First Merge

To verify that the merge worked correctly, I first compared the shapes of the box office dataset, the Oscars dataset, and the merged dataset. Looking at the row and column counts helps confirm whether the merge preserved the expected number of records and added the correct number of variables.

oscars_df.shape, box_office_data.shape, first_merged.shape

Run to view results

The Oscars dataset has 10751 rows and 9 columns, and the merged dataset also has 10751 rows, which confirms that the left join kept every Oscars record and did not accidentally drop or duplicate rows. The number of columns increases from 9 to 17, exactly adding the 8 variables from the 200-row box office dataset, which shows that the box office information was successfully joined to the matching Oscar films.

Next, I check how many rows in the merged dataset actually received a match from the box office table. A non-null value in the title column indicates that a film from the Oscars dataset successfully matched a title in the box office data.

first_merged['title'].notna().sum()

Run to view results

This gives a total of 74 matched rows. However, earlier we found that only 19 unique film titles appear in both datasets. The reason the count is higher here (74 vs. 19) is that Oscar films appear multiple times; each film can receive several nominations across different award categories. Every nomination row for a matched film produces its own matched row in the merged table, which is why the number of matches is larger than the number of unique films.

To confirm that there are no unexpected duplicates, I next count how many unique films appear among the matched rows.

first_merged.loc[first_merged['title'].notna(), 'film'].nunique()

Run to view results

The result shows exactly 19 unique films, which matches the number we previously identified before merging. This confirms that the merge correctly matched only the intended 19 films, and the larger value of 74 reflects multiple nomination rows for those same films.

Merging and Verifying Second Merge

merged = first_merged.merge( oscars_demographics, left_on=['film', 'name'], right_on=['movie_clean', 'person'], how='left' )

Run to view results

To verify that the second merge was performed correctly, I count how many rows in the merged dataset successfully received matching Oscars Demographics information by looking at the number of non-null values in the film column.

oscars_demographics.shape, oscars_df.shape, merged.shape

Run to view results

The shapes confirm that the merge behaved exactly as expected. The Oscars dataset remains at 10,751 rows before and after the merge, which verifies that the left join preserved every nomination record without creating or removing rows. The demographics table contributes 416 unique person–film combinations, and after merging, the final dataset expands to 43 columns, showing that the demographic attributes were successfully attached wherever a matching person–film pair existed. Since the row count did not increase, we know no duplication occurred, and all demographic information merged cleanly into the Oscars records.

Exporting the Merged data set

merged.to_parquet("merged_boxoffice_oscars.parquet") merged

Run to view results

Do Box-Office Performance and Nominee Demographics Affect the Odds of Winning an Oscar? (Amongst Nominees)

In this analysis, we examine whether commercial performance (box-office metrics such as worldwide and domestic revenue) and nominee demographics (e.g., age and other available attributes) are associated with the probability that a nomination becomes a win.

Before testing whether any external factors relate to Oscar outcomes, we first calculate the overall percentage of nominations that win. This baseline win rate serves as our reference point for evaluating whether nominations linked to box-office or demographic information show higher or lower winning rates.

df = merged.copy() df["winner_y"] = df["winner"].astype(int) df["winner_y"].value_counts()

Run to view results

df["winner_y"].mean()

Run to view results

The dataset contains 10,751 nominations, of which 2,239 are winners and 8,512 are non-winners. This implies an overall win rate of 20.83% (2,239 ÷ 10,751). We treat this as the baseline probability that a nomination wins across the full dataset, before considering any external film-level (box-office) or person-level (demographic) factors.

Next, we compare win rates across three groups: (1) all nominations, (2) nominations linked to box-office performance metrics, and (3) nominations linked to demographic attributes. This establishes whether the groups for which we have external factors show different Oscar outcomes, which motivates deeper modeling of the factors themselves.

df["has_box"] = df["worldwide"].notna() df["has_demo"] = df["_unit_id"].notna() win_all = df["winner_y"].mean() win_box = df.loc[df["has_box"], "winner_y"].mean() n_box = df["has_box"].sum() win_demo = df.loc[df["has_demo"], "winner_y"].mean() n_demo = df["has_demo"].sum() win_all, (n_box, win_box), (n_demo, win_demo)

Run to view results

Across the full dataset, the win rate is 20.83%. For nominations linked to box-office data (worldwide not missing), the subset contains 74 nominations and shows a win rate of 22.97%, which is slightly higher than the overall baseline. For nominations linked to demographic data (_unit_id not missing), the subset contains 430 nominations and shows a win rate of 96.97%, meaning this subset is overwhelmingly composed of winners. This is a major red flag for inference: it strongly suggests the demographics dataset is not a representative sample of nominees, but instead is heavily selection-biased toward winners (or was constructed mostly from winners). As a result, demographics variables cannot be used to reliably estimate “what predicts winning among nominees,” because there is almost no variation in the outcome within that subset.

Because the demographics-linked subset is overwhelmingly composed of winners, it cannot support a fair ‘what predicts winning’ model. We therefore focus on the box-office dataset, which includes both winners and non-winners among matched nominations. To keep the analysis interpretable despite small sample size, we group films by box-office rank and compare win rates across rank tiers.

tmp = df[["winner_y", "rank"]].copy() def revenue_group(rank): if pd.isna(rank): return "Not in top 200" elif rank <= 10: return "Top 10" elif rank <= 50: return "11–50" elif rank <= 200: return "51–200" else: return "Other (rank > 200)" tmp["revenue_group"] = tmp["rank"].apply(revenue_group) win_rates = ( tmp.groupby("revenue_group")["winner_y"] .agg(win_rate="mean", num_nominations="count") .reset_index() ) order = ["Top 10", "11–50", "51–200", "Not in top 200"] win_rates["revenue_group"] = pd.Categorical(win_rates["revenue_group"], categories=order, ordered=True) win_rates = win_rates.sort_values("revenue_group") win_rates

Run to view results

The win rates vary noticeably across box-office rank tiers. Nominations associated with Top 10 films have a win rate of 25.0% (16 nominations), nominations tied to films ranked 11–50 have a much lower win rate of 3.85% (26 nominations), and nominations tied to films ranked 51–200 show the highest win rate of 37.5% (32 nominations). Meanwhile, nominations for films not in the top 200 have a win rate of 20.81% (10,677 nominations), which is close to the overall baseline win rate in the full dataset. Although the matched box-office groups are relatively small, these differences suggest that box-office rank may relate to Oscar success in a non-linear way—top-grossing status does not automatically translate into higher win rates, and mid-ranked high performers (51–200) appear to be associated with stronger Oscar outcomes in this sample.

To visualize the relationship between box-office performance and Oscar outcomes, we plot the Oscar win rate for nominations grouped by the film’s box-office rank tier (Top 10, 11–50, 51–200, and Not in Top 200). Each bar shows the share of nominations in that group that ultimately win (winner = 1), and the label above each bar reports both the win rate (%) and the number of nominations (n) in that group. This visualization helps us compare groups at a glance and contextualize differences using the size of each group.

sns.set_theme(style="whitegrid") plt.figure(figsize=(8, 6)) ax = sns.barplot( data=win_rates, x="revenue_group", y="win_rate" ) ax.yaxis.set_major_formatter(lambda x, pos: f"{x*100:.0f}%") plt.xlabel("Box Office Performance Group (Worldwide)") plt.ylabel("Oscar Win Rate (Share of Nominations that Win)") plt.title("Are Top-Grossing Films More Likely to Win Oscars Once Nominated?") for i, row in enumerate(win_rates.itertuples(index=False)): ax.text( i, row.win_rate + 0.005, f"{row.win_rate*100:.1f}%\n(n={int(row.num_nominations)})", ha="center", va="bottom", fontsize=9 ) ax.set_ylim(0, win_rates["win_rate"].max() + 0.07) plt.tight_layout() plt.show()

Run to view results

The bar chart shows that Oscar win rates differ across box-office rank tiers in our sample. Nominations tied to films ranked 51–200 have the highest win rate (about 37.5%), while nominations tied to films ranked 11–50 have the lowest (about 3.8%). The Top 10 group sits around 25%, and the large group of films not in the top 200 is close to the overall baseline win rate (about 20.8%). Overall, the pattern suggests that the relationship between box-office success and winning is not simply “higher gross = higher win rate,” and appears to vary by rank tier.

Now that we’ve visualized win rates by box-office rank tier, we test whether the differences are statistically associated with winning. We do this using a chi-square test of independence on a contingency table of winner vs. revenue group, and we report both the p-value and an effect size (Cramér’s V).

ct = pd.crosstab(tmp["revenue_group"], tmp["winner_y"]) ct

Run to view results

chi2, p, dof, expected = chi2_contingency(ct) n = ct.to_numpy().sum() r, k = ct.shape cramers_v = np.sqrt((chi2 / n) / (min(r-1, k-1))) chi2, dof, p, cramers_v

Run to view results

These results suggest there is an association between box-office rank tier and the probability of winning, but the relationship is weak. The chi-square test returns χ² = 10.11 with 3 degrees of freedom and p = 0.0176, meaning that—within this dataset—we would reject the idea that win/loss is independent of the box-office group at the 5% significance level. However, the effect size is small: Cramér’s V = 0.031, which indicates that while differences in win rates across tiers exist, box-office rank tier explains only a very small portion of the variation in Oscar winning outcomes.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Project 2, MA346, Fall 2025

Importing Data Sets

Project 2 Report:

Cleaning Nan Values

Fixing Data Types

Exploring Matching Observations

Merging the data sets

Verifying the First Merge

Merging and Verifying Second Merge

Exporting the Merged data set

Do Box-Office Performance and Nominee Demographics Affect the Odds of Winning an Oscar? (Amongst Nominees)

Project 2, MA346, Fall 2025