The next generation SQL? (An interview with Stack Overflow) By Ollie Hughes
The Stack Overflow annual developer survey provides the most comprehensive global overview of trends in software development and people who code. Last month, Count helped Stack Overflow re-publish their 2019 survey results to promote the survey findings and to provide an easier way for millions of their community members to dive into the data for themselves.
You can explore all the insights people have created or explore the data yourself in Count here.
Alongside the launch, Stack Overflow interviewed me to find out more about Count’s vision and how we help individuals and organisations gain more value from the data they publish online.
The full article was published on Stack Overflow’s blog on 31st July (which you can read here) or check out the excerpt below.
A bit of background context
Firstly, let’s start with something which you probably already know. Using data has become increasingly popular. Even a cursory look at the Stack Overflow Trends page will quickly show you the indomitable rise of Python along with other specialist analytical languages as more people learn how to use data better.
This growth is not surprising. The value of data within business has been well documented and the amount of publicly available data has increased at least 10 fold in the last decade. Add into the mix organisations like FiveThirtyEightand The Pudding who engage millions of people with their interactive data journalism and online communities like Kaggle who help people learn new skills and it’s easy to see why being a “data person” has never been more exciting.
So, what’s the problem?
With all this backdrop it’s a fair question to ask what does a new organisation like Count think it can bring to the table and even more understandable to ask why we believe it’s necessary to develop a completely new programming language which no one knows?
Well, even with all the resources and tools mentioned above, we believe exploring data online — and we are specifically focusing on relational data for the moment — is still not that easy.
Despite its value, and in contrast to other media types like video and images, relational data remains fragmented and hard to find — dotted around the web in different locations and formats. To analyse it in this state takes quite a bit of time and no small amount of skill, particularly if you’re combining data from different sources.
And though there has been a large increase in people learning query languages like Python and R, it is still only a relatively small proportion of people who know them well. In December 2017 (when we were getting stuck into this problem properly) we did a survey of 6000 adults across the UK to understand their level of comfort using data. The results showed that though the majority of respondents felt comfortable reading statistics, 84% didn’t feel comfortable doing basic data analysis. This puts free-form analysis of the data on the web out of the reach of most people.
If you’re an individual or organisation who wants to release data publicly online, this situation leaves you with limited options to properly show the value of your data. Either you publish the data:
- In its raw form (such as a csv, json or through an API), which gives your users complete flexibility in how they use it but also significantly limits your audience because of the time and skill it takes to explore it or,
- in an aggregated, visual form that lets users instantly digest some information but only for a small subset of pre-selected queries.
So what’s different about Count?
When the Count team started to tackle this problem, we set out to find a solution that would offer the flexibility of a query language but that could appeal to as wide an audience as possible.
We analysed thousands of queries and interviewed hundreds of people all using different tools and datasets. We focused our research on descriptive data queries because they form a fundamental step in everyone’s workflow.
Our research provided a number of key insights, but two were most fundamental in shaping how we optimised our language’s design:
- The vast majority of queries were made up of a core set of functionality which was far smaller than the full functionality most analytical tools offered.
- The most valuable insights came from queries that involved multiple stages or joins across different datasets.
These insights were pretty consistent across different sectors, a user’s level of expertise and most data structures. A good example of what we found can be seen browsing the queries written in Stack Exchange’s data explorer.
The data explorer provides access to 29 tables which collectively give full visibility of the activity on the Stack Overflow site. Despite the range of questions, this data can answer (and the no doubt higher than the average capability of Stack Overflow users) when you look at what functionality is being used, often it is a very small fraction of the full SQL language. Additionally, over half the queries involve joining multiple tables together to drill down into various aspects of the platform’s usage.
A simple but powerful way to explore data
Based on our research we designed our language to have the following key features:
- It is a declarative language (like SQL) — allowing users to just ask for what they want.
- To further enhance the declarative nature, the language will join datasets automatically if users request data from multiple tables, no matter the number of intermediate tables between the requested tables.
- Unnecessary syntax has been removed to optimise the speed and simplicity of common analytical tasks.
The features provide users with two main benefits:
- Firstly it allows datasets to be combined fluidly regardless of the complexity of the data model. This, along with the simplified syntax, makes the language much faster to write.
- Secondly, the interface is inherently modular, allowing users to build up complex queries using a series of simple steps which (because of the automatic joining) they can quickly join together.
Below are some example insights gleaned from the 2019 Developer Survey by members of the Stack Overflow community. I’ve included the query in both our language and in SQL below each chart to highlight the difference. Alternatively, you can explore the language and the data yourself here.
The average salary of developers by the language they work with
The query in Count and the equivalent in SQL:
The post The next generation SQL? (An interview with Stack Overflow) By Ollie Hughes appeared first on Data Science Festival.