–– by Allan on January 6, 2022
"Test your data", they said.
Great. How do I get started?
"Follow the tutorial", they said.
At the risk of already reaching my meme quota, let me say plainly that this article describes my bumpy journey into data testing with Great Expectations. This is not a critique of Great Expectations. It is reminder that, as tools become more composable, difficulties emerge with onboarding because tools don't exist in a vacuum. For example, if tool B depends on setting up tool A and C, we quickly get to a "how to draw an owl" situation, hence the memetic title.
Since Deepnote is designed to bring tools, teams, and workflows together, it has become clear to me that, in the context of learning, we can be much more than a compendium for demonstrating scientific tools. Instead, we can promote learning by allowing scientists to observe tools in their natural habitat, plugged neatly into their associated technologies. In other words, Deepnote embodies context-based learning.
By way of example, let's take my recent foray into learning Great Expectations. For those who don't know, Great Expectations is the leading tool for validating, documenting, and profiling your data. Great Expectations brings the software development discipline of automated testing to data science teams.
(Image source here)
Their docs clearly state what they are and what they're not; however, in order to truly grok their value, data scientists will have to interact with "what they're not" sooner or later, and that is where the rub is. Great Expectations naturally shines when observed within a larger software ecosystem, as do many other tools (e.g., dbt, airflow, git). Let's take a look at how Deepnote puts the pieces together for you when learning Great Expectations.
The getting started tutorial for Great Expectations is very well done as far as tutorials go — human-readable CLI commands, and automatically created and narrated notebooks. I must admit though that before long, I couldn't help but notice a great deal of context switching — notebook to notebook, to CLI, to docs, and back. All of this while trying to internalize the Great Expectations parlance. Context switching is cognitively costly and an enemy of learning, similar to multi-tasking.
(Image adapted from here)
Surely beginners would benefit from an all-in-one-place demonstration of the basics. Deepnote makes learning Great Expectations "cheaper" for the mind. It spins up a complete, runnable workflow that does not require the terminal, environment setup/installation, or multiple notebooks. Everything that is related to the learning experience is presented in the same place. No context switching needed.
The very first item in what Great Expectations doesn't do relates to pipeline execution. They are not a pipeline execution framework. Makes perfect sense. The only problem with this is that we end up at the "how to draw an owl" problem again. Data testing of any real value will have to end up in a pipeline at some point. While Deepnote is not going to set up Airflow for you, it does provide a GUI for scheduling your notebooks. Scheduling puts your data testing into a production-level pipeline without any additional learning or peripheral setup.
One of the best features of Great Expectations is their data docs. Every time you validate tests against your data, Great Expectations builds a human-readable documentation site. These HTML docs describe your validation results and more. They are a continuously updated data quality report. In the image below, you can see a page from the data docs showing a failed set validation. The data docs are amazing!
(Image source here)
Unfortunately, we're back at the owl drawing issue again: Now we have to host these docs on the web so that our team can access them. There are likely plenty of data experts who don't want to deal with hosting sites at all. Now, Deepnote is not designed to host your personal website; however, we do allow incoming connections from the web to your cloud machine. This means that Great Expectations learners can spin up the data docs, and even share them publicly, without having to draw the whole owl, so to speak.
There is a proliferation of tools that are capable of being integrated with other tools. On one hand, this is helpful, but it also comes at a cost. Learners are often smacked with a list of prerequisites so long and complex that observing tools in the wild, let alone adopting them, is too heavy a burden to bear.
When it came to learning Great Expectations, I couldn't help but to reflect on the snowballing effect of learning new technologies in general. The role Deepnote is playing with regards to learning is significant — it provides an instantly spun up, "terraformed" world, where related tools can be seen truly living together. No more stale, out of context, non-interactive, habitat-blind, learning guides please.
See for yourself — click the link to the notebook to observe Great Expectations in the wild and enjoy learning.
Share this post
I love to cook, play music, and write software! My background is in cognitive neuroscience. I have developed peer-reviewed statistical software libraries and given lectures on the Python language, interactive data visualization, robust statistics, and original research.