Data Analysis Quick Start
From the Street Drug Analysis Lab at the University of North Carolina at Chapel Hill.
Let us know. We would love your feedback on what you think of our data products and this tutorial. Please email us at opioiddatalab@unc.edu with suggestions, examples, or bugs.
This notebook provides examples of artificial intelligence (AI) prompts and Python 3.7 code for conducting basic analyses of data from the UNC Street Drug Analysis Lab. These analyses are conducted in a Deepnote.com distributed environment; free accounts are available. The data are the public North Carolina drug checking dataset. Because our data are public and anonymous, we feel comfortable using AI-based tools like Deepnote because they allow anyone to analyze the data. For extra credit, you can also turn this kind of code into an app-like live dashboard within Deepnote. See? This is why we insist on making the data public, so y'all can make cool inferences.
If you're new to Deepnote and/or Python, this is a simple tutorial.
Below are just example analyses, and some are simpler than we would recommend for scientific publications. There's a lot more that could be done! But these examples are some of the core functions that you can mix and match for your own needs.
Please refer to Technical Details for codebook and dataset schema.
Google Sheets
Blocks below with the purple sidebar are the AI prompts you can enter to generate the code and results. The text below the blocks of results are machine-generated interpretation; some of it is helpful and others are throw-aways. You should interpet your own data yourself, duh. Your code may not look exactly the same, but it should function similarly to the examples below. You can also copy and paste the Python code from this Deepnote into your own and modify as needed. All of the code in this notebook was generated by AI; not a single line was typed by human fingers. Wild world, eh?
AI can make mistakes. Use you human intuition. Do the results feel right? Does the sample size change abruptly? We strongly recommend opening the data in Google Sheets to see the rows of data so you can do some manual counting and eyeballing.
Create a new Google Sheets document.
Create 2 separate sheets using the tabs at the bottom of the screen by clicking on the + sign the left of Sheet1. They can be renamed to "card" and "lab" in the Google Sheets to give you consistency with the code below. Click on the down arrow/triangle next to "Sheet1" and go to "Rename."
In 'card' tab, at the top above the grid, make sure the cell A1 is clicked with a blue box highlighting the empty rectangle. In the data entry bar to the right of fx paste the following:
You may get an alert message saying "Warning: Some formulas are trying to send and receive data from external parties." Click "Allow Access."
Now go to the 'lab' tab and repeat with this code:
Alternatively, if you have received a different URL for your own data depositories for your program, you can substitute those links in the code above, and in the 2 data import steps below.
Seriously. Go open the datasets in Google Sheets before progressing. It should look like this.
It's good practice to always note when the analysis was run.
Report a date and time stamp in the local time zone of when this notebook was run
The notebook was run on April 1, 2024, at 17:26:05 EDT.
Import Data
First, import the card data, where one row is an individual sample.
Import https://github.com/opioiddatalab/drugchecking/blob/main/datasets/nc/nc_analysis_dataset.csv as "card" df
The dataset has been successfully imported as a DataFrame named `card_df` and contains 967 rows and 89 columns.
Now import the lab data, which is a row for each substance detected, and thus multiple rows per sample. The two dataframes are linked by 'sampleid' field.
import https://github.com/opioiddatalab/drugchecking/blob/main/datasets/nc/nc_lab_detail.csv as 'lab' df
The lab dataset has been successfully imported as a DataFrame named `lab_df` and contains 3881 rows and 47 columns.
Basic Analysis
Using 'card' df, tell me the earliest and latest date_complete, which is the date the sample was analyzed in the lab. And to show off, tell me what day of the week those were.
The earliest date on which a sample was analyzed in the lab is January 26, 2022, which was a Wednesday. The latest date of analysis is March 28, 2024, which falls on a Thursday.
Samples
Say we want to know how many samples have been analyzed.
Using 'card' df, tell me the number of unique 'sampleid'
Now, let's say we want to know how many samples per month have been analyzed.
Show me a line graph with shaded fill of how many samples have been analyzed by month. Use 'card' df, and 'date_complete' to generate.
The line graph above displays the number of samples analyzed by month, with a shaded fill to visually represent the data over time.
Lab Results
Say we want to know the most common substance detected, in all of NC for all dates.
Using 'lab' df, tell me the unique number of substances detected using 'substance'
The dataset contains 125 unique substances detected.
Show me a table of the top 25 most common substances in 'lab'
The table above lists the top 25 most common substances detected in the lab dataset, along with their counts. Fentanyl is the most common substance detected, followed by 4-ANPP and methamphetamine.
Show me a horizontal bar graph of the 25 most common substances detected, with enough room on the label axis to show the full name in 'substance'
The horizontal bar graph above displays the top 25 most common substances detected, with the substance names on the y-axis and their frequencies on the x-axis. The graph is organized so that the substances with the highest frequency are at the top.
Primary vs. Trace Abundance. The results are for substances in both primary and trace abundance. Say we want to look only at primary abundance, which is usually recommended for public health settings because these are the substances that cause most of the biological effects. "Trace" is defined as 5% peak chromatogram height of the most abundant substance. This is not the same as "less than 5% purity" -- instead, it is a relative measure that roughly identifies things detected in very low quantity, but are still present. In other areas of forensic chemistry, trace amounts are not routinely reported, but they can sometimes cause side effects in humans. So, we chose to include them in our results.
The 'abundance' field in 'lab' describes if the substance was found in primary or trace abundance. If it is NaN or missing, it should be converted to primary. Show me the same graph as in the block above, but instead only for substances found in primary abundance.
This is similar to the first graph, but shows just the primary substances.
Geographic Analyses
Say we want to see a map of all the counties from which we have sample data.
Tell me how many samples have missing or NaN county information 'county' = "No county" or location is NaN, and as a percent of the total
One observation is missing county information.
Create a categorized chloropeth county map of NC using 'card' that shows how many samples have been collected from each county. To do this, start with 'card' and use 'sampleid' to count the number of rows by 'county'. For NaN convert to 0. Call this df 'nc_county_samples'. Categorize the counties in 4 categories, with the lowest category only containing counties with zero samples, and the rest of the counties divided into 3 categories: low, medium, high -- low samples are less than 10, high samples are more than 35, and medium is in between. County names in 'county' also have " County" in the names, and those need to be trimmed. In the plot, remove x and y axis labels. In the legend, rename the zero sample color to "No samples", and display the count ranges for the other 3 categories. Use plotly and geojson, and import nc_counties.geojson from GitHub https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json using the URL. Geopandas is already installed.
The categorized chloropleth map of North Carolina has been created, showing the number of samples collected from each county. The counties are categorized into four groups: "No samples" (light grey), "Low" (yellow), "Medium" (orange), and "High" (red), based on the number of samples collected. The map provides a visual representation of the distribution of sample collection across the state. This version can be downloaded.
Can you make the map look nicer using a different package?
I've created a more visually appealing map using the Folium package, which provides a detailed and interactive visualization of the number of samples collected from each county in North Carolina. This map uses the 'YlOrRd' color scheme for the choropleth layer, enhancing the visual distinction between different levels of sample collection across counties.
Say we want the list of top 10 substances found in NC Medicaid Region 1, which is the western part of the state. Deepnote doesn't have access to that list of counties, so we have to provide it in the prompt. We also have to teach the AI to link the 'card' and 'lab' dataframes because the location is found in 'card' and the substances are in 'lab'.
Tell me the top 10 substances found in the following counties: Cherokee, Clay, Graham, Swain, Macon, Jackson, Haywood, Transylvania, Buncombe, Yancey, Madison, Mitchell, Avery, Caldwell, Burke, McDowell, Rutherford, Polk, Henderson. In the 'card' df, the county name is 'county' field, but it also has " County" after each county name, so that will need to be accounted for. Ignore any samples that are missing county. In order to do this analysis, link sampleid in 'card' and 'lab' df because county is in 'card' and the list of substances is in 'lab'.
The top 10 substances found in the specified counties are as follows: 1. Fentanyl: 118 occurrences 2. Methamphetamine: 114 occurrences 3. 4-ANPP: 111 occurrences 4. Cocaine: 38 occurrences 5. Phenethyl 4-ANPP: 35 occurrences 6. Ethyl-4-ANPP: 31 occurrences 7. Xylazine: 29 occurrences 8. P-Fluorofentanyl: 25 occurrences 9. N-Phenylpropanamide: 22 occurrences 10. Despropionyl P-Fluorofentanyl: 17 occurrences
For these filtered counties, I want to know how many samples (unique sampleid) that had substance=fentanyl also contained the substance xylazine. So, use the number of unique samples in a county that contain fentanyl as the denominator, and the numerator is the number of samples with xylazine. Show me a table, by county, with the following columns: 1) 'All samples' as the total number of samples regardless of substance; 2) the number of samples with fentanyl as 'Fentanyl Samples'; 3) number of samples with xylazine AND fentanyl as "Xylazine + Fentanyl"; 4) percent of samples dividing columns 3 and 4, rounding to 1 decimal place, and adding a % symbol. Make NaN zero. Sort by percent.
The table has been successfully generated, but it seems there was an issue with the calculation of the 'All Samples' and 'Fentanyl Samples' columns, as well as the sorting by percent, which resulted in 'nan%' values for some counties. This discrepancy may be due to incorrect aggregation or filtering logic. Let's correct the approach and ensure the table meets the specified requirements.
The corrected table now accurately represents the data as requested: - **Henderson County** and **Jackson County** both have 50.0% of their Fentanyl samples also containing Xylazine, the highest percentages among the counties. - **Buncombe County** follows with 23.1% of Fentanyl samples also containing Xylazine. - **Macon County** has 21.4% of its Fentanyl samples also containing Xylazine. - The remaining counties either have 0% or did not have any samples with Fentanyl, thus showing 0.0%.
Analysis with Data Cleaning
Starting with the 'card' df, the 'color' variable is a list of colors reported for the drug, delimited with ; semi-colon. Each sample is a row, and a sample could have more than one color. These will need to be parsed. I would like a frequency table by color of the number of samples containing each color, knowing that samples can have more than one color and the total will be higher than the total number of samples. There may be trailing spaces after parsing that need to be removed for appropriate binning.
The frequency table of colors reported for the drug samples has been successfully created. The table lists each color along with the number of samples containing that color. The color "white" is the most frequently reported, appearing in 424 samples, followed by "clear" in 87 samples, and "purple" in 63 samples. Other colors appear with varying frequencies down to several colors that only appear once.
Can you simplify the list by consolidating similar colors like "light" and "dark"
The list of colors has been simplified by consolidating similar colors, such as combining "light" and "dark" variations into their base colors, and treating "gray" and "clear" as "grey" and "colorless" respectively. The consolidated list now provides a more generalized view of the color frequencies in the dataset.
Expected vs. Actual
I want to compare expected versus actual lab results for methamphetamine. First, in 'card' see if 'expectedsubstance' has "methamphetamine" -- there could be more than one expected substance so use regex match. Then compare with the 'lab_meth' variable in 'card' as the gold standard. Show a 2x2 of expected vs. actual results, and then calculate sensitivity, specificity, positive predictive value, negative predictive value, false positives, and false negatives. Report those to two decimal places.
The comparison between expected and actual lab results for methamphetamine shows the following metrics: - Sensitivity: 0.91 - Specificity: 0.91 - Positive Predictive Value: 0.71 - Negative Predictive Value: 0.98 - False Positives: 68 - False Negatives: 16
Creating Drug Identifying Flags
We have created some variables for you in the analysis_dataset.* (or 'card' df) that show 1/0 if a particular substance was detected. For example, lab_fentanyl show's 1 if fentanyl was detected in primary (only) abundance in the lab, and 0 if it wasn't. Similarly, lab_fentanyl_any shows 1 or 0 if fentanyl was detected in primary OR trace abundance. But what if you want to add a column to analysis_dataset.* for a new substance? Let's look at something simple like heroin. We will create a new lab_heroin variable that shows its presence in primary abundance.
I want to create a new variable in 'card' df called 'lab_heroin" to designate if 'lab' df 'substance' contains "heroin" AND abundance is "primary" or NaN, linking by sampleid. Create this and tell me the count of heroin-containing samples, and percent of total samples. Then, show me the updated df with the column added.
The new variable `lab_heroin` has been successfully created in the `card` DataFrame, indicating if a sample contains heroin in primary abundance. There are 74 heroin-containing samples, which is approximately 7.65% of the total samples. The updated DataFrame now includes this new column.
Now, we can see how many fentanyl-containing samples also contained heroin.
How many and percent of fentnayl-containing samples (lab_fentanyl) also contained heroin lab_heroin?
Out of the total 522 fentanyl-containing samples, 65 also contained heroin, which represents approximately 12.45% of the fentanyl-containing samples.
Now, let's see how this has changed over time.
Show a timeline line graph of how this percent has changed over time. Use 3-month calendar quarters to aggregate
The line graph above shows the percentage of fentanyl samples containing heroin aggregated by 3-month calendar quarters over time.
Say we wanted to see how much overlap there was between lab_heroin and lab_meth and lab_fentanyl.
generate a venn diagram of lab_heroin, lab_meth, lab_fentanyl, showing the counts for each area of overlap using 'card'
The Venn diagram above visualizes the overlap between samples containing heroin, methamphetamine, and fentanyl. Each circle represents one of these substances, and the overlaps between circles indicate the number of samples that contain combinations of these substances.