Data Analysis Quick Start

From the Street Drug Analysis Lab at the University of North Carolina at Chapel Hill.

Let us know. We would love your feedback on what you think of our data products and this tutorial. Please email us at opioiddatalab@unc.edu with suggestions, examples, or bugs.

This notebook provides examples of artificial intelligence (AI) prompts and Python 3.7 code for conducting basic analyses of data from the UNC Street Drug Analysis Lab. These analyses are conducted in a Deepnote.com distributed environment; free accounts are available. The data are the public North Carolina drug checking dataset. Because our data are public and anonymous, we feel comfortable using AI-based tools like Deepnote because they allow anyone to analyze the data. For extra credit, you can also turn this kind of code into an app-like live dashboard within Deepnote. See? This is why we insist on making the data public, so y'all can make cool inferences.

If you're new to Deepnote and/or Python, this is a simple tutorial.

Below are just example analyses, and some are simpler than we would recommend for scientific publications. There's a lot more that could be done! But these examples are some of the core functions that you can mix and match for your own needs.

Please refer to Technical Details for codebook and dataset schema.

Google Sheets

Blocks below with the purple sidebar are the AI prompts you can enter to generate the code and results. The text below the blocks of results are machine-generated interpretation; some of it is helpful and others are throw-aways. You should interpet your own data yourself, duh. Your code may not look exactly the same, but it should function similarly to the examples below. You can also copy and paste the Python code from this Deepnote into your own and modify as needed. All of the code in this notebook was generated by AI; not a single line was typed by human fingers. Wild world, eh?

AI can make mistakes. Use you human intuition. Do the results feel right? Does the sample size change abruptly? We strongly recommend opening the data in Google Sheets to see the rows of data so you can do some manual counting and eyeballing.

Create a new Google Sheets document.

Create 2 separate sheets using the tabs at the bottom of the screen by clicking on the + sign the left of Sheet1. They can be renamed to "card" and "lab" in the Google Sheets to give you consistency with the code below. Click on the down arrow/triangle next to "Sheet1" and go to "Rename."

In 'card' tab, at the top above the grid, make sure the cell A1 is clicked with a blue box highlighting the empty rectangle. In the data entry bar to the right of fx paste the following:

You may get an alert message saying "Warning: Some formulas are trying to send and receive data from external parties." Click "Allow Access."

Now go to the 'lab' tab and repeat with this code:

Alternatively, if you have received a different URL for your own data depositories for your program, you can substitute those links in the code above, and in the 2 data import steps below.

Seriously. Go open the datasets in Google Sheets before progressing. It should look like this.

It's good practice to always note when the analysis was run.

Report a date and time stamp in the local time zone of when this notebook was run

from datetime import datetime import pytz datetime.now(pytz.timezone('America/New_York'))

The notebook was run on April 1, 2024, at 17:26:05 EDT.

Import Data

First, import the card data, where one row is an individual sample.

Import https://github.com/opioiddatalab/drugchecking/blob/main/datasets/nc/nc_analysis_dataset.csv as "card" df

import pandas as pd # Load the dataset from the provided URL url = 'https://github.com/opioiddatalab/drugchecking/raw/main/datasets/nc/nc_analysis_dataset.csv' card = pd.read_csv(url) card

The dataset has been successfully imported as a DataFrame named `card_df` and contains 967 rows and 89 columns.

Now import the lab data, which is a row for each substance detected, and thus multiple rows per sample. The two dataframes are linked by 'sampleid' field.

import https://github.com/opioiddatalab/drugchecking/blob/main/datasets/nc/nc_lab_detail.csv as 'lab' df

# Load the lab dataset from the provided URL lab_url = 'https://github.com/opioiddatalab/drugchecking/raw/main/datasets/nc/nc_lab_detail.csv' lab = pd.read_csv(lab_url) lab

The lab dataset has been successfully imported as a DataFrame named `lab_df` and contains 3881 rows and 47 columns.

Basic Analysis

Using 'card' df, tell me the earliest and latest date_complete, which is the date the sample was analyzed in the lab. And to show off, tell me what day of the week those were.

# Convert the `date_complete` column to datetime format card['date_complete'] = pd.to_datetime(card['date_complete']) # Find the earliest and latest dates in the `date_complete` column earliest_date_complete = card['date_complete'].min() latest_date_complete = card['date_complete'].max() # Get the day of the week for the earliest and latest dates earliest_day_of_week = earliest_date_complete.strftime('%A') latest_day_of_week = latest_date_complete.strftime('%A') (earliest_date_complete, earliest_day_of_week, latest_date_complete, latest_day_of_week)

The earliest date on which a sample was analyzed in the lab is January 26, 2022, which was a Wednesday. The latest date of analysis is March 28, 2024, which falls on a Thursday.

Samples

Say we want to know how many samples have been analyzed.

Using 'card' df, tell me the number of unique 'sampleid'

card['sampleid'].nunique()

Now, let's say we want to know how many samples per month have been analyzed.

Show me a line graph with shaded fill of how many samples have been analyzed by month. Use 'card' df, and 'date_complete' to generate.

import pandas as pd import matplotlib.pyplot as plt # Convert 'date_complete' to datetime card['date_complete'] = pd.to_datetime(card['date_complete']) # Extract year and month from 'date_complete' card['year_month'] = card['date_complete'].dt.to_period('M') # Count the number of samples per month samples_per_month = card.groupby('year_month').size() # Plot plt.figure(figsize=(12, 6)) samples_per_month.plot(kind='line', color='skyblue', linewidth=2) plt.fill_between(samples_per_month.index.astype(str), samples_per_month, color='skyblue', alpha=0.3) plt.title('Number of Samples Analyzed by Month') plt.xlabel('Month') plt.ylabel('Number of Samples') plt.xticks(rotation=45) plt.tight_layout() plt.show()

The line graph above displays the number of samples analyzed by month, with a shaded fill to visually represent the data over time.

Lab Results

Say we want to know the most common substance detected, in all of NC for all dates.

Using 'lab' df, tell me the unique number of substances detected using 'substance'

lab['substance'].nunique()

The dataset contains 125 unique substances detected.

Show me a table of the top 25 most common substances in 'lab'

lab['substance'].value_counts().head(25)

The table above lists the top 25 most common substances detected in the lab dataset, along with their counts. Fentanyl is the most common substance detected, followed by 4-ANPP and methamphetamine.

Show me a horizontal bar graph of the 25 most common substances detected, with enough room on the label axis to show the full name in 'substance'

import matplotlib.pyplot as plt # Get the top 25 most common substances top_25_substances = lab['substance'].value_counts().head(25) # Plotting plt.figure(figsize=(10, 8)) top_25_substances.plot(kind='barh', color='skyblue') plt.xlabel('Frequency') plt.ylabel('Substance') plt.title('Top 25 Most Common Substances Detected') plt.gca().invert_yaxis() # To display the highest frequency at the top plt.show()

The horizontal bar graph above displays the top 25 most common substances detected, with the substance names on the y-axis and their frequencies on the x-axis. The graph is organized so that the substances with the highest frequency are at the top.

Primary vs. Trace Abundance. The results are for substances in both primary and trace abundance. Say we want to look only at primary abundance, which is usually recommended for public health settings because these are the substances that cause most of the biological effects. "Trace" is defined as 5% peak chromatogram height of the most abundant substance. This is not the same as "less than 5% purity" -- instead, it is a relative measure that roughly identifies things detected in very low quantity, but are still present. In other areas of forensic chemistry, trace amounts are not routinely reported, but they can sometimes cause side effects in humans. So, we chose to include them in our results.

The 'abundance' field in 'lab' describes if the substance was found in primary or trace abundance. If it is NaN or missing, it should be converted to primary. Show me the same graph as in the block above, but instead only for substances found in primary abundance.

# Fill NaN values in 'abundance' with 'primary' lab['abundance'].fillna('primary', inplace=True) # Filter for primary abundance primary_lab = lab[lab['abundance'] == 'primary'] # Get the top 25 most common substances in primary abundance top_25_primary_substances = primary_lab['substance'].value_counts().head(25) # Plotting plt.figure(figsize=(10, 8)) top_25_primary_substances.plot(kind='barh', color='skyblue') plt.xlabel('Frequency') plt.ylabel('Substance') plt.title('Top 25 Most Common Substances Detected in Primary Abundance') plt.gca().invert_yaxis() # To display the highest frequency at the top plt.show()

This is similar to the first graph, but shows just the primary substances.

Geographic Analyses

Say we want to see a map of all the counties from which we have sample data.

Tell me how many samples have missing or NaN county information 'county' = "No county" or location is NaN, and as a percent of the total

missing_county_info = card[(card['county'] == "No county") | (card['location'].isna())] num_missing_county_info = missing_county_info.shape[0] total_samples = card.shape[0] percent_missing_county_info = (num_missing_county_info / total_samples) * 100 num_missing_county_info, percent_missing_county_info

One observation is missing county information.

Create a categorized chloropeth county map of NC using 'card' that shows how many samples have been collected from each county. To do this, start with 'card' and use 'sampleid' to count the number of rows by 'county'. For NaN convert to 0. Call this df 'nc_county_samples'. Categorize the counties in 4 categories, with the lowest category only containing counties with zero samples, and the rest of the counties divided into 3 categories: low, medium, high -- low samples are less than 10, high samples are more than 35, and medium is in between. County names in 'county' also have " County" in the names, and those need to be trimmed. In the plot, remove x and y axis labels. In the legend, rename the zero sample color to "No samples", and display the count ranges for the other 3 categories. Use plotly and geojson, and import nc_counties.geojson from GitHub https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json using the URL. Geopandas is already installed.

import geopandas as gpd import plotly.express as px # Load the geojson file geojson_url = 'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json' nc_counties_geojson = gpd.read_file(geojson_url) # Filter for North Carolina counties (state FIPS code for NC is 37) nc_counties_geojson = nc_counties_geojson[nc_counties_geojson['STATE'] == '37'] # Prepare the 'card' dataframe card['county'] = card['county'].str.replace(' County', '') card['county'] = card['county'].fillna('No County') samples_per_county = card.groupby('county', observed=True).size().reset_index(name='samples') # Merge with the geojson dataframe nc_county_samples = nc_counties_geojson.merge(samples_per_county, left_on='NAME', right_on='county', how='left') nc_county_samples['samples'] = nc_county_samples['samples'].fillna(0) # Categorize the counties bins = [-1, 0, 10, 35, nc_county_samples['samples'].max()] labels = ['No samples', 'Low (<11)', 'Medium (11-34)', 'High (>34)'] nc_county_samples['category'] = pd.cut(nc_county_samples['samples'], bins=bins, labels=labels) # Draw the map fig = px.choropleth(nc_county_samples, geojson=nc_county_samples.geometry, locations=nc_county_samples.index, color='category', color_discrete_map={'No samples':'lightgrey', 'Low (<11)':'yellow', 'Medium (11-29)':'orange', 'High (>29)':'red'}, category_orders={'category': ['No samples', 'Low (<11)', 'Medium (11-29)', 'High (>29)']}, labels={'category':'Number of samples'}, hover_data={'NAME': True, 'samples': True}) fig.update_geos(fitbounds="locations", visible=False) fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0}, title_text='NC County Samples Map') fig.update_xaxes(visible=False) fig.update_yaxes(visible=False) fig.show()

The categorized chloropleth map of North Carolina has been created, showing the number of samples collected from each county. The counties are categorized into four groups: "No samples" (light grey), "Low" (yellow), "Medium" (orange), and "High" (red), based on the number of samples collected. The map provides a visual representation of the distribution of sample collection across the state. This version can be downloaded.

Can you make the map look nicer using a different package?

import folium from folium.plugins import MarkerCluster import json # Load the geojson file for NC counties with open('geojson-counties-fips.json') as f: nc_geojson = json.load(f) # Filter for North Carolina counties (FIPS code for NC is 37) nc_geojson['features'] = [feature for feature in nc_geojson['features'] if feature['properties']['STATE'] == '37'] # Calculate bins for color scaling bins = [0, 1, 10, 35, max(nc_county_samples['samples'])+1] # Adjust bins to have 0 as a separate bin # Create a base map m = folium.Map(location=[35.5, -80], zoom_start=6.5, tiles='cartodbpositron') # Add the choropleth layer folium.Choropleth( geo_data=nc_geojson, name='choropleth', data=nc_county_samples, columns=['NAME', 'samples'], key_on='feature.properties.NAME', fill_color='YlOrRd', bins=bins, # Apply bins for color scaling fill_opacity=0.7, line_opacity=0.2, legend_name='Number of Samples Collected', highlight=True ).add_to(m) # Add a layer control folium.LayerControl().add_to(m) m

I've created a more visually appealing map using the Folium package, which provides a detailed and interactive visualization of the number of samples collected from each county in North Carolina. This map uses the 'YlOrRd' color scheme for the choropleth layer, enhancing the visual distinction between different levels of sample collection across counties.

Say we want the list of top 10 substances found in NC Medicaid Region 1, which is the western part of the state. Deepnote doesn't have access to that list of counties, so we have to provide it in the prompt. We also have to teach the AI to link the 'card' and 'lab' dataframes because the location is found in 'card' and the substances are in 'lab'.

Tell me the top 10 substances found in the following counties: Cherokee, Clay, Graham, Swain, Macon, Jackson, Haywood, Transylvania, Buncombe, Yancey, Madison, Mitchell, Avery, Caldwell, Burke, McDowell, Rutherford, Polk, Henderson. In the 'card' df, the county name is 'county' field, but it also has " County" after each county name, so that will need to be accounted for. Ignore any samples that are missing county. In order to do this analysis, link sampleid in 'card' and 'lab' df because county is in 'card' and the list of substances is in 'lab'.

# Filter the 'card' DataFrame for the specified counties and remove rows with missing county selected_counties = ['Cherokee', 'Clay', 'Graham', 'Swain', 'Macon', 'Jackson', 'Haywood', 'Transylvania', 'Buncombe', 'Yancey', 'Madison', 'Mitchell', 'Avery', 'Caldwell', 'Burke', 'McDowell', 'Rutherford', 'Polk', 'Henderson'] selected_counties = [county + ' County' for county in selected_counties] # Add ' County' suffix to match the format in the DataFrame card_filtered = card[card['county'].isin(selected_counties)] # Merge 'card_filtered' with 'lab' on 'sampleid' to link the data merged_df = pd.merge(card_filtered, lab, on='sampleid') # Count the top 10 substances in the selected counties top_10_substances = merged_df['substance'].value_counts().head(10) top_10_substances

The top 10 substances found in the specified counties are as follows: 1. Fentanyl: 118 occurrences 2. Methamphetamine: 114 occurrences 3. 4-ANPP: 111 occurrences 4. Cocaine: 38 occurrences 5. Phenethyl 4-ANPP: 35 occurrences 6. Ethyl-4-ANPP: 31 occurrences 7. Xylazine: 29 occurrences 8. P-Fluorofentanyl: 25 occurrences 9. N-Phenylpropanamide: 22 occurrences 10. Despropionyl P-Fluorofentanyl: 17 occurrences

For these filtered counties, I want to know how many samples (unique sampleid) that had substance=fentanyl also contained the substance xylazine. So, use the number of unique samples in a county that contain fentanyl as the denominator, and the numerator is the number of samples with xylazine. Show me a table, by county, with the following columns: 1) 'All samples' as the total number of samples regardless of substance; 2) the number of samples with fentanyl as 'Fentanyl Samples'; 3) number of samples with xylazine AND fentanyl as "Xylazine + Fentanyl"; 4) percent of samples dividing columns 3 and 4, rounding to 1 decimal place, and adding a % symbol. Make NaN zero. Sort by percent.

# Filter samples that contain Fentanyl fentanyl_samples = merged_df[merged_df['substance'] == 'fentanyl'] # Filter samples that contain both Fentanyl and Xylazine fentanyl_xylazine_samples = merged_df[(merged_df['substance'] == 'fentanyl') & (merged_df['sampleid'].isin(merged_df[merged_df['substance'] == 'xylazine']['sampleid']))] # Count the total number of samples, samples with Fentanyl, and samples with both Fentanyl and Xylazine by county summary = merged_df.groupby('county').agg( All_Samples=('sampleid', 'nunique'), Fentanyl_Samples=('sampleid', lambda x: x.isin(fentanyl_samples['sampleid']).sum()), Xylazine_Fentanyl=('sampleid', lambda x: x.isin(fentanyl_xylazine_samples['sampleid']).sum()) ).reset_index() # Calculate the percent summary['Percent'] = (summary['Xylazine_Fentanyl'] / summary['Fentanyl_Samples'] * 100).round(1).astype(str) + '%' # Replace NaN with 0 summary.fillna(0, inplace=True) # Sort by percent summary.sort_values(by='Percent', ascending=False, inplace=True) summary

The table has been successfully generated, but it seems there was an issue with the calculation of the 'All Samples' and 'Fentanyl Samples' columns, as well as the sorting by percent, which resulted in 'nan%' values for some counties. This discrepancy may be due to incorrect aggregation or filtering logic. Let's correct the approach and ensure the table meets the specified requirements.

# Correcting the approach to ensure accurate calculations and sorting # First, get unique sample IDs for fentanyl and xylazine samples fentanyl_sample_ids = lab[lab['substance'] == 'fentanyl']['sampleid'].unique() xylazine_sample_ids = lab[lab['substance'] == 'xylazine']['sampleid'].unique() # Filter 'card_filtered' to get samples that contain fentanyl, xylazine, and both fentanyl_samples = card_filtered[card_filtered['sampleid'].isin(fentanyl_sample_ids)] xylazine_samples = card_filtered[card_filtered['sampleid'].isin(xylazine_sample_ids)] fentanyl_xylazine_samples = card_filtered[card_filtered['sampleid'].isin(fentanyl_sample_ids) & card_filtered['sampleid'].isin(xylazine_sample_ids)] # Create summary table summary_df = card_filtered.groupby('county').agg( All_Samples=('sampleid', 'nunique'), Fentanyl_Samples=('sampleid', lambda x: x.isin(fentanyl_sample_ids).sum()), Xylazine_Fentanyl=('sampleid', lambda x: x.isin(fentanyl_xylazine_samples['sampleid']).sum()) ).reset_index() # Calculate percent summary_df['Percent'] = (summary_df['Xylazine_Fentanyl'] / summary_df['Fentanyl_Samples'] * 100).fillna(0).round(1).astype(str) + '%' # Sort by percent summary_df.sort_values(by='Percent', ascending=False, inplace=True) summary_df

The corrected table now accurately represents the data as requested: - **Henderson County** and **Jackson County** both have 50.0% of their Fentanyl samples also containing Xylazine, the highest percentages among the counties. - **Buncombe County** follows with 23.1% of Fentanyl samples also containing Xylazine. - **Macon County** has 21.4% of its Fentanyl samples also containing Xylazine. - The remaining counties either have 0% or did not have any samples with Fentanyl, thus showing 0.0%.

Analysis with Data Cleaning

Starting with the 'card' df, the 'color' variable is a list of colors reported for the drug, delimited with ; semi-colon. Each sample is a row, and a sample could have more than one color. These will need to be parsed. I would like a frequency table by color of the number of samples containing each color, knowing that samples can have more than one color and the total will be higher than the total number of samples. There may be trailing spaces after parsing that need to be removed for appropriate binning.

# Split the 'color' column on ';' and then explode the resulting lists into separate rows card['color_split'] = card['color'].str.split(';') exploded_colors = card.explode('color_split') # Remove any leading or trailing spaces exploded_colors['color_split'] = exploded_colors['color_split'].str.strip() # Count the frequency of each color colors_frequency = exploded_colors['color_split'].value_counts() colors_frequency

The frequency table of colors reported for the drug samples has been successfully created. The table lists each color along with the number of samples containing that color. The color "white" is the most frequently reported, appearing in 424 samples, followed by "clear" in 87 samples, and "purple" in 63 samples. Other colors appear with varying frequencies down to several colors that only appear once.

Can you simplify the list by consolidating similar colors like "light" and "dark"

# Consolidate similar colors by removing prefixes like 'light', 'dark', 'bright', and 'darker', and consolidating variations colors_consolidated = exploded_colors['color_split'].str.replace('light ', '').str.replace('dark ', '').str.replace('bright ', '').str.replace('darker ', '').str.replace('gray', 'grey').str.replace('clear', 'colorless').str.strip() # Recount the frequency of each consolidated color colors_consolidated_frequency = colors_consolidated.value_counts() # Drop the 'color_split' column card = card.drop(columns='color_split') colors_consolidated_frequency

The list of colors has been simplified by consolidating similar colors, such as combining "light" and "dark" variations into their base colors, and treating "gray" and "clear" as "grey" and "colorless" respectively. The consolidated list now provides a more generalized view of the color frequencies in the dataset.

Expected vs. Actual

I want to compare expected versus actual lab results for methamphetamine. First, in 'card' see if 'expectedsubstance' has "methamphetamine" -- there could be more than one expected substance so use regex match. Then compare with the 'lab_meth' variable in 'card' as the gold standard. Show a 2x2 of expected vs. actual results, and then calculate sensitivity, specificity, positive predictive value, negative predictive value, false positives, and false negatives. Report those to two decimal places.

import re # Check if 'expectedsubstance' contains 'methamphetamine' card['expected_meth'] = card['expectedsubstance'].apply(lambda x: bool(re.search('methamphetamine', x, re.IGNORECASE)) if pd.notnull(x) else False) # Create a 2x2 table of expected vs actual results for methamphetamine confusion_matrix = pd.crosstab(card['expected_meth'], card['lab_meth'], rownames=['Expected'], colnames=['Actual']) confusion_matrix

# Calculate sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), false positives (FP), and false negatives (FN) TP = confusion_matrix.loc[True, 1] TN = confusion_matrix.loc[False, 0] FP = confusion_matrix.loc[False, 1] FN = confusion_matrix.loc[True, 0] sensitivity = TP / (TP + FN) specificity = TN / (TN + FP) PPV = TP / (TP + FP) NPV = TN / (TN + FN) # Report metrics to two decimal places results = { 'Sensitivity': round(sensitivity, 2), 'Specificity': round(specificity, 2), 'Positive Predictive Value': round(PPV, 2), 'Negative Predictive Value': round(NPV, 2), 'False Positives': FP, 'False Negatives': FN } # Drop the 'expected_meth' column after calculations card.drop(columns=['expected_meth'], inplace=True) results

The comparison between expected and actual lab results for methamphetamine shows the following metrics: - Sensitivity: 0.91 - Specificity: 0.91 - Positive Predictive Value: 0.71 - Negative Predictive Value: 0.98 - False Positives: 68 - False Negatives: 16

Creating Drug Identifying Flags

We have created some variables for you in the analysis_dataset.* (or 'card' df) that show 1/0 if a particular substance was detected. For example, lab_fentanyl show's 1 if fentanyl was detected in primary (only) abundance in the lab, and 0 if it wasn't. Similarly, lab_fentanyl_any shows 1 or 0 if fentanyl was detected in primary OR trace abundance. But what if you want to add a column to analysis_dataset.* for a new substance? Let's look at something simple like heroin. We will create a new lab_heroin variable that shows its presence in primary abundance.

I want to create a new variable in 'card' df called 'lab_heroin" to designate if 'lab' df 'substance' contains "heroin" AND abundance is "primary" or NaN, linking by sampleid. Create this and tell me the count of heroin-containing samples, and percent of total samples. Then, show me the updated df with the column added.

# Fill NaN values in 'abundance' with 'primary' in 'lab' dataframe lab['abundance'].fillna('primary', inplace=True) # Filter 'lab' dataframe for rows where 'substance' is 'heroin' and 'abundance' is 'primary' heroin_primary = lab[(lab['substance'] == 'heroin') & (lab['abundance'] == 'primary')] # Create a list of unique sampleids that contain heroin in primary abundance heroin_sampleids = heroin_primary['sampleid'].unique() # Create a new column in 'card' dataframe to indicate if a sample contains heroin in primary abundance # 1 if it contains heroin, 0 otherwise card['lab_heroin'] = card['sampleid'].apply(lambda x: 1 if x in heroin_sampleids else 0) # Count of heroin-containing samples heroin_count = card['lab_heroin'].sum() # Percent of total samples that contain heroin heroin_percent = (heroin_count / len(card)) * 100 heroin_count, heroin_percent, card[['sampleid', 'lab_heroin']]

The new variable `lab_heroin` has been successfully created in the `card` DataFrame, indicating if a sample contains heroin in primary abundance. There are 74 heroin-containing samples, which is approximately 7.65% of the total samples. The updated DataFrame now includes this new column.

card

Now, we can see how many fentanyl-containing samples also contained heroin.

How many and percent of fentnayl-containing samples (lab_fentanyl) also contained heroin lab_heroin?

# Calculate the number and percent of fentanyl-containing samples that also contained heroin total_fentanyl_samples = card['lab_fentanyl'].sum() fentanyl_heroin_samples = card[(card['lab_fentanyl'] == 1) & (card['lab_heroin'] == 1)].shape[0] percent_fentanyl_heroin = (fentanyl_heroin_samples / total_fentanyl_samples) * 100 total_fentanyl_samples, fentanyl_heroin_samples, percent_fentanyl_heroin

Out of the total 522 fentanyl-containing samples, 65 also contained heroin, which represents approximately 12.45% of the fentanyl-containing samples.

Now, let's see how this has changed over time.

Show a timeline line graph of how this percent has changed over time. Use 3-month calendar quarters to aggregate

# First, merge the 'card' dataframe with the 'lab' dataframe to get a comprehensive view of the data merged_df = pd.merge(card, lab[['sampleid', 'substance', 'abundance']], on='sampleid') # Filter for samples that contain fentanyl or heroin in primary abundance or missing abundance (treated as primary) merged_df['fentanyl_primary'] = merged_df.apply(lambda x: 1 if (x['substance'] == 'fentanyl' and (x['abundance'] == 'primary' or pd.isnull(x['abundance']))) else 0, axis=1) merged_df['heroin_primary'] = merged_df.apply(lambda x: 1 if (x['substance'] == 'heroin' and (x['abundance'] == 'primary' or pd.isnull(x['abundance']))) else 0, axis=1) # Group by sampleid to avoid double-counting samples that contain both substances grouped_df = merged_df.groupby('sampleid').agg({'fentanyl_primary': 'max', 'heroin_primary': 'max', 'date_complete': 'first'}).reset_index() # Convert 'date_complete' to datetime grouped_df['date_complete'] = pd.to_datetime(grouped_df['date_complete']) # Extract year and quarter grouped_df['year_quarter'] = grouped_df['date_complete'].dt.to_period('Q') # Calculate the percent of fentanyl samples that also contained heroin for each quarter quarterly_percent = grouped_df.groupby('year_quarter').apply(lambda x: (x['fentanyl_primary'] & x['heroin_primary']).sum() / x['fentanyl_primary'].sum() * 100 if x['fentanyl_primary'].sum() > 0 else 0).reset_index(name='percent') # Plot plt.figure(figsize=(12, 6)) plt.plot(quarterly_percent['year_quarter'].astype(str), quarterly_percent['percent'], marker='o', linestyle='-', color='blue') plt.title('Percent of Fentanyl Samples Containing Heroin by Quarter') plt.xlabel('Year and Quarter') plt.ylabel('Percent') plt.xticks(rotation=45) plt.grid(axis='y') plt.tight_layout() plt.show()

The line graph above shows the percentage of fentanyl samples containing heroin aggregated by 3-month calendar quarters over time.

Say we wanted to see how much overlap there was between lab_heroin and lab_meth and lab_fentanyl.

generate a venn diagram of lab_heroin, lab_meth, lab_fentanyl, showing the counts for each area of overlap using 'card'

import matplotlib.pyplot as plt from matplotlib_venn import venn3 # Calculate the counts for each subset in the Venn diagram # lab_heroin only only_heroin = card[(card['lab_heroin'] == 1) & (card['lab_meth'] == 0) & (card['lab_fentanyl'] == 0)].shape[0] # lab_meth only only_meth = card[(card['lab_heroin'] == 0) & (card['lab_meth'] == 1) & (card['lab_fentanyl'] == 0)].shape[0] # lab_fentanyl only only_fentanyl = card[(card['lab_heroin'] == 0) & (card['lab_meth'] == 0) & (card['lab_fentanyl'] == 1)].shape[0] # lab_heroin and lab_meth heroin_meth = card[(card['lab_heroin'] == 1) & (card['lab_meth'] == 1) & (card['lab_fentanyl'] == 0)].shape[0] # lab_heroin and lab_fentanyl heroin_fentanyl = card[(card['lab_heroin'] == 1) & (card['lab_meth'] == 0) & (card['lab_fentanyl'] == 1)].shape[0] # lab_meth and lab_fentanyl meth_fentanyl = card[(card['lab_heroin'] == 0) & (card['lab_meth'] == 1) & (card['lab_fentanyl'] == 1)].shape[0] # lab_heroin, lab_meth, and lab_fentanyl heroin_meth_fentanyl = card[(card['lab_heroin'] == 1) & (card['lab_meth'] == 1) & (card['lab_fentanyl'] == 1)].shape[0] # Generate the Venn diagram plt.figure(figsize=(8, 8)) venn3(subsets=(only_heroin, only_meth, heroin_meth, only_fentanyl, heroin_fentanyl, meth_fentanyl, heroin_meth_fentanyl), set_labels=('Heroin', 'Meth', 'Fentanyl')) plt.title('Overlap of Heroin, Methamphetamine, and Fentanyl') plt.show()

The Venn diagram above visualizes the overlap between samples containing heroin, methamphetamine, and fentanyl. Each circle represents one of these substances, and the overlaps between circles indicate the number of samples that contain combinations of these substances.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Data Analysis Quick Start