Data science notebooks | 2020 Review & Key trends
2020 was a roller coaster, but the data science community is going strong. Interest in the data science domain has grown in the past year yet again. We dug into the data to learn more about the current state of a vital part of the data science ecosystem: the notebooks.
We break down this analysis into 3 sections:
- First, we explore key stats and trends of Jupyter notebooks on GitHub.
- Second, we double down on popular Python libraries and show you what libraries to add to your toolkit for plotting, ML, NLP, and other use cases.
- Last, we look at search trends from Google & YouTube.
Where does the data come from?
We created a representative dataset of 700 Jupyter notebooks that favors faster processing and maps key notebook trends. We mined new insights from a dataset gathered by the Datalore team and looked at other datasets for comparison. You can also open up this project as a Deepnote notebook to see all the code and source data.
To summarize the current state of notebooks, we've tapped into a couple of data sources:
- GitHub API, Deepnote 2020 mini dataset (dn2020mini)
- Google and YouTube search trends
- Datalore 10M Dataset (datalore10M)
1. Notebooks on GitHub
First, we analyzed repositories containing Jupyter Notebooks on Github. Here are some general stats for 2020:
- Number of created repositories containing Jupyter Notebooks: 10,176
- Number of commits: 13,1753
- Number of issues: 51,887
- Number of discussions: 101
Most popular notebook repos
Love is love - heart or stars, it's all the same. The most starred repository of 2020 is Fast AI's Fastbook (https://github.com/fastai/fastbook) with 11k stars, 39 contributors and 3.4k forks. These repos cover an introduction to deep learning, fastai, and PyTorch. Fast AI wins the second spot too with the FastPages repository - a blogging platform (2k stars, 91 contributors, 409 forks).
The all-time favorite repository with 27.3k stars is Python Data Science Handbook created in 2017. This one contains the entire Python Data Science Handbook, in the form of Jupyter notebooks.
Most used Python versions
Since Python is the most popular language in Jupyter notebooks, we found a variety of versions used during our analysis. Python 3.6 is the most used Python version with over 55% of users, followed by Python 3.7 at 36.5%. Python 3.5 and 2.7 only have around 0.51% of users each.
Here's the open-source licenses distribution for the smaller dataset. MIT License is the followed by Apache 2.0.
Most common kernel names
This one is a bit technical. Obviously, most common kernels are variations of python3. We also see a heavy use of conda.
2. Doubling down on popular Python libraries
We've looked at the library popularity. In our small GitHub 2020 dataset, we've found
pandas the most popular.
To provide a more in-depth view, we've used the larger Datalore 10M dataset and found very similar results. No surprise here, numpy, pandas and matplotlib are the top 3 most-imported libraries. Take a look at top 20 libraries overall:
In the sections below, we categorize most used Python libraries across different subject areas. Which of these will you add into your toolkit in 2021?
Matplotlib is the most popular plotting library, with a clear lead over seaborn and plotly.
In machine learning, tensorflow has been the most popular library with 40.0% of users importing it for their ML tasks, closely followed by keras at 34.1%.
For natural language processing, nltk is the clear #1 with 63.0% of imports.
Geospatial analysis libraries
For geospatial analyses, folium has been the most popular library, followed by geopandas and shapely.
Zipfile takes #1 with 48.4% of users importing the library for their compression tasks.
Here's a look at a couple of other subject-specific libraries gaining popularity in notebooks:
- Chemistry: pymatgen
- Medical imaging: nibabel
- Astronomy: astropy
It's important to note that many researchers from similar domains are still not using notebooks, or Python for that matter.
3. Notebooks in search
What does Google search reveal?
We've also had a look at what people have been searching for in relation to notebooks in 2020. Top 10 search queries ask about Jupyter notebooks, Python and .ipynb file manipulation.
Scoring of search terms in this section is relative. Value of 100 has been attributed to the the most commonly searched query, value of 50 to a query searched half as often, and so on.
What's trending on YouTube?
Notebook-related queries on YouTube look very similar to those on Google, viewers have been asking about Jupyter and Python, installation and Anaconda setup.
We have mentioned 3 different data sources that we used for this article. Here's how they look and how you can build on top of them.
In December, Datalore published a blogpost called We Downloaded 10,000,000 Jupyter Notebooks From Github – This Is What We Learned, with accompanying notebook. They have curated a dataset with 10M notebooks (5TB), and they provide a simplified 3GB version with filtered CSVs. The filtered data include notebook names, imports, versions, and text stats. Authors also calculate a consistency of notebooks by reexecuting cells, and comparing the outputs. You can access all 10M notebooks directly in github-notebooks-update1 s3 bucket. You can also access the smaller dataset right in our supporting Deepnote project.
Feel free to duplicate our project, and look for more insights in the data.
Conclusion & Future research
We've seen that it's very easy to analyze code from GitHub, and we've been able to find representative sample with only hundreds of notebooks. Datalore folks already shown very unique consistency analysis, but we think there is a lot more that we can find in the notebooks. Feel free to adjust & extend our analysis - and show us when you do at @DeepnoteHQ on Twitter. We will reward our favourites with some snazzy swag.
Made with 💙 by the Deepnote team. Deepnote is a Jupyter-compatible notebook with real-time collaboration, running in the cloud. Try it out for free.