The AI gold rush: not all that glitters
Over the past six months, our team has embarked on an exciting venture, integrating generative AI capabilities into Deepnote. This journey, however, isn't unique to us. Earlier this year, the data world was swept up in the fervor surrounding Large Language Models (LLMs), leading to a widespread rush among data and analytics tools to implement various AI assistant features. Yet, as the initial excitement of AI innovation begins to fade, it's becoming evident that not all of these initiatives have achieved their intended outcomes.
Some user problems simply turned out to be pretty hard to solve effectively, despite the power of LLMs. This is particularly evident in numerous text-to-SQL solutions. Even when equipped with the full schema context, LLMs can sometimes misinterpret data, leading to the generation of incorrect filter values or the selection of the wrong table due to issues like data duplication or insufficient documentation.
At the other end of the spectrum, some companies addressed problems that didn’t really need tackling in the first place. In these cases, the integration of LLMs were purely marketing-driven, producing little more than demo candy that users may look at once and then never use again. It is then no surprise that many AI products see high growth, coupled with very high churn.
So, what's the real score with Deepnote AI? Are people really using it? Has it changed lives like we hoped?
Our answers: It’s off to a great start, but we have lots to improve; some rely on it all the time, others not as much; it's life-changing for some, less so for others.
In this two-article series, we'll dive deeper. You'll learn how we built Deepnote AI, how we track its success, and some of the key lessons we learnt along the way. Plus, we'll share a peek at what's next.
Picking a solid problem
From the beginning, we were convinced that generative AI would be a game-changer for Deepnote. This new technology held two major promises:
- It could make our existing users vastly more productive;
- It could make the product much more accessible, opening up notebooks to a whole new audience.
The main question was about where to start mining this potential. Mesmerized by the runaway success of ChatGPT, many players in the market jumped straight into creating chatbots, which often acted nothing more as simple wrappers of OpenAI’s API. The main advantage these integrations offered was the convenience of not having to alternate between their product and ChatGPT, a benefit we considered rather limited.
We were tempted to create a quick marketing gimmick, especially since we had a prototype of a conversational code generator from a hackathon last year, but we resisted it. Due to our conviction that we are going to be in the AI game for the long term, we decided to take a different route.
We returned to the fundamentals of product development, focusing on a problem that:
- our users genuinely needed solving;
- we were confident in addressing effectively with LLMs.
This problem was code completion.
Back then, GitHub Copilot was the leading example of LLMs in action. It rapidly gained popularity among software developers, with its 'ghost-text' suggestions becoming essential tools in their daily work. Our Deepnote users, many of whom are skilled Python programmers, started requesting similar AI assistance, something they had grown to rely on in their IDEs.
This was a strongly validated and proven use case for using generative AI, so we started working on this first. Initially, we considered leveraging OpenAI models, but their cost was prohibitive for the volume of real-time suggestions we needed. We contemplated limiting API calls by making the feature activation manual (e.g., using a hotkey for suggestions), but this seemed like a significant compromise in user experience, so we rejected that idea. Additionally, performance was a major concern. Depending on the available OpenAI models, response times ranged from one to several seconds for suggestions, which was far from ideal for our needs.
In the end, we looked elsewhere and found an ideal partner in Codeium. They could give us high quality, lightning fast suggestions with a pricing structure that was a good match for the level of user experience with aimed to provide.
Did it work?
Selecting appropriate success metrics for this project proved difficult. With limited concrete benchmarks available, we referred to Github's published reports for guidance (like this one). Inspired by these, we established two highly ambitious targets.
🎯 20% of suggestions are accepted by users.
🎯 80% of participants report higher productivity.
In our Beta program, the average acceptance rate for code suggestions was about 9%, but this varied widely among our customers, ranging from as high as 19% to as low as 5%. Initially, this outcome seemed somewhat underwhelming.
However, when we combined these figures with qualitative feedback, the picture changed dramatically. Our customers were thrilled with the feature! We began receiving consistent feedback about significantly enhanced productivity. Many of our power users even reported that their coding experience with our tool matched or surpassed their favourite IDEs.
It is IMPRESSIVE!! Code suggestions were spot on!!. It felt like someone was reading my mind and typing in my code for me!! wow!!
This enthusiasm was echoed in our survey results: 75% of respondents reported increased productivity.
With all this positive feedback, we adjusted our initial expectations, declared the Beta a qualified success, and rolled out the feature into full production.
This was a big moment for us - we were on to something.
From Beta to better: continuous improvements
However, not all feedback was glowing. Some customers expressed frustration that the completions didn't fully comprehend their working context, like notebook variables and file names. Others appreciated the feature but wanted the flexibility to toggle it on and off, depending on the project stage. A very small minority of users just simply found it frustrating. We also encountered minor UX issues, like overly aggressive suggestions for just a few characters. Additionally, the tool's limitation to Python code blocks was a constraint for users who also wanted assistance with SQL queries.
We listened closely and made several iterative improvements to address the various issues raised. If you’re interested in the technical aspects of this work, check out this excellent article written by my engineer colleague.
Adding more of the right context to the LLM was the most impactful change we made. Together with the notebook’s content and title, we started sending information about the runtime variables and their schema along with the name of files in the project. The richer contextual awareness made an immediate difference to our customer’s experience.
The autocomplete kicks ass, now that it also uses variable explorer… feels sometimes like magic.
We also began to incorporate workspace-level data into the context. With the introduction of AI code completion for SQL blocks, we experimented by integrating relevant SQL block content from other notebooks in the same workspace. Our internal tests showed promise, but we were initially uncertain about its effectiveness with customer data. This approach, however, proved to be quite successful in many customer use cases. As one of our SQL power-users exclaimed:
I have spent more time with the SQL AI now and I am getting increasingly more impressed actually…. How is it knowing that these are actually the exact join conditions I’m looking for?
The data corroborated these positive responses, showing that the acceptance rate for SQL suggestions was nearly on par with those for Python blocks.
This enriched context is proving to be incredibly beneficial. The more our users create content, with or without AI assistance, the more precise the suggestions get. Thanks to these continuous enhancements, the acceptance rate has increased by over 31% since our initial Beta release.
We're aiming to push this number even higher, and there's certainly more work ahead to fully optimize our code suggestions. Yet, even the early outcomes significantly boosted our confidence in the value of LLMs for Deepnote. Encouraged by the results, we then set our sights on exploring a new dimension of AI assistance.
In Part 2 of this series, we will look at how we went about building conversational AI help in the notebook style! We will discuss some of the key product decisions when integrating with OpenAI and the impact of the resulting solutions, such as Auto AI and Edit & Fix. Stay tuned!