Analytics craft

The art of being an analytics practitioner.

Surrogate keys in dbt: Integers or hashes?

August 24, 2022 · 12 min read

Staff Developer Experience Advocate at dbt Labs

Those who have been building data warehouses for a long time have undoubtedly encountered the challenge of building surrogate keys on their data models. Having a column that uniquely represents each entity helps ensure your data model is complete, does not contain duplicates, and able to join across different data models in your warehouse.

Sometimes, we are lucky enough to have data sources with these keys built right in — Shopify data synced via their API, for example, has easy-to-use keys on all the tables written to your warehouse. If this is not the case, or if you build a data model with a compound key (aka the data is unique across multiple dimensions), you will have to rely on some strategy for creating and maintaining these keys yourself. How can you do this with dbt? Let’s dive in.

Narrative modeling: How structure can tell a story

August 22, 2022 · 15 min read

Ian Fahey

Analytics Engineer at dbt Labs

The larger a data ecosystem gets, the more its users and stakeholders expect consistency. As the ratio of data models to team members (to say nothing of stakeholders to team members) skyrockets, an agreed-upon modeling pattern often acts as scaffolding around that growth.

The biggest tool in the toolbox today, dimensional modeling, offers enough consistency to make it the dominant approach in the space, but what might be possible if we shut that toolbox, took a break from our workbench, and instead strolled over to our bookshelf?

In other words, what if we told a story?

How we shaved 90 minutes off our longest running model

August 18, 2022 · 15 min read

Bennie Regenold

Analytics Engineer at dbt Labs

Barr Yaron

Product Manager at dbt Labs

When running a job that has over 1,700 models, how do you know what a “good” runtime is? If the total process takes 3 hours, is that fantastic or terrible? While there are many possible answers depending on dataset size, complexity of modeling, and historical run times, the crux of the matter is normally “did you hit your SLAs”? However, in the cloud computing world where bills are based on usage, the question is really “did you hit your SLAs and stay within budget”?

Here at dbt Labs, we used the Model Timing tab in our internal analytics dbt project to help us identify inefficiencies in our incremental dbt Cloud job that eventually led to major financial savings, and a path forward for periodic improvement checks.

Enforcing rules at scale with pre-commit-dbt

August 3, 2022 · 13 min read

Benoit Perigaud

Staff Analytics Engineer at dbt Labs

Editor's note — since the creation of this post, the package pre-commit-dbt's ownership has moved to another team and it has been renamed to dbt-checkpoint. A redirect has been set up, meaning that the code example below will still work. It is also possible to replace repo: https://github.com/offbi/pre-commit-dbt with repo: https://github.com/dbt-checkpoint/dbt-checkpoint in your .pre-commit-config.yaml file.

At dbt Labs, we have best practices we like to follow for the development of dbt projects. One of them, for example, is that all models should have at least unique and not_null tests on their primary key. But how can we enforce rules like this?

That question becomes difficult to answer in large dbt projects. Developers might not follow the same conventions. They might not be aware of past decisions, and reviewing pull requests in git can become more complex. When dbt projects have hundreds of models, it's hard to know which models do not have any tests defined and aren't enforcing your conventions.

Migrating from Stored Procedures to dbt

July 20, 2022 · 11 min read

Matt Winkler

Senior Solutions Architect at dbt Labs

Stored procedures are widely used throughout the data warehousing world. They’re great for encapsulating complex transformations into units that can be scheduled and respond to conditional logic via parameters. However, as teams continue building their transformation logic using the stored procedure approach, we see more data downtime, increased data warehouse costs, and incorrect / unavailable data in production. All of this leads to more stressed and unhappy developers, and consumers who have a hard time trusting their data.

If your team works heavily with stored procedures, and you ever find yourself with the following or related issues:

dashboards that aren’t refreshed on time
It feels too slow and risky to modify pipeline code based on requests from your data consumers
It’s hard to trace the origins of data in your production reporting

It’s worth considering if an alternative approach with dbt might help.

Strategies for change data capture in dbt

July 14, 2022 · 15 min read

Grace Goheen

Analytics Engineer at dbt Labs

There are many reasons you, as an analytics engineer, may want to capture the complete version history of data:

You’re in an industry with a very high standard for data governance
You need to track big OKRs over time to report back to your stakeholders
You want to build a window to view history with both forward and backward compatibility

These are often high-stakes situations! So accuracy in tracking changes in your data is key.

Tackling the complexity of joining snapshots

May 26, 2022 · 16 min read

Lauren Benezra

Analytics Engineer at dbt Labs

Let’s set the scene. You are an analytics engineer at your company. You have several relational datasets flowing through your warehouse, and, of course, you can easily access and transform these tables through dbt. You’ve joined together the tables appropriately and have near-real time reporting on the relationships for each entity_id as it currently exists.

But, at some point, your stakeholder wants to know how each entity is changing over time. Perhaps, it is important to understand the trend of a product throughout its lifetime. You need the history of each entity_id across all of your datasets, because each related table is updated on its own timeline.

What is your first thought? Well, you’re a seasoned analytics engineer and you know the good people of dbt Labs have a solution for you. And then it hits you — the answer is snapshots!

Optimizing dbt Models with Redshift Configurations

May 19, 2022 · 16 min read

Christine Berger

Resident Architect at dbt Labs

If you're reading this article, it looks like you're wondering how you can better optimize your Redshift queries - and you're probably wondering how you can do that in conjunction with dbt.

In order to properly optimize, we need to understand why we might be seeing issues with our performance and how we can fix these with dbt sort and dist configurations.

Stakeholder-friendly model names: Model naming conventions that give context

May 17, 2022 · 13 min read

Pat Kearns

Senior Analytics Engineer at dbt Labs

Analytics engineers (AEs) are constantly navigating through the names of the models in their project, so naming is important for maintainability in your project in the way you access it and work within it. By default, dbt will use your model file name as the view or table name in the database. But this means the name has a life outside of dbt and supports the many end users who will potentially never know about dbt and where this data came from, but still access the database objects in the database or business intelligence (BI) tool.

Model naming conventions are usually made by AEs, for AEs. While that’s useful for maintainability, it leaves out the people who model naming is supposed to primarily benefit: the end users. Good model naming conventions should be created with one thing in mind: Assume your end-user will have no other context than the model name. Folders, schema, and documentation can add additional context, but they may not always be present. Your model names will always be shown in the database.

How we remove partial duplicates: Complex deduplication to refine your models' grain

May 12, 2022 · 11 min read

Lauren Benezra

Analytics Engineer at dbt Labs

Hey data champion — so glad you’re here! Sometimes datasets need a team of engineers to tackle their deduplification (totz a real word), and that’s why we wrote this down. For you, friend, we wrote it down for you. You’re welcome!

Let’s get rid of these dupes and send you on your way to do the rest of the super-fun-analytics-engineering that you want to be doing, on top of super-sparkly-clean data. But first, let’s make sure we’re all on the same page.

From the Slack Archives: When Backend Devs Spark Joy for Data Folks

April 5, 2022 · 5 min read

Kira Furuichi

Technical Writer at dbt Labs

"I forgot to mention we dropped that column and created a new one for it!”

“Hmm, I’m actually not super sure why customer_id is passed as an int and not a string.”

“The primary key for that table is actually the order_id, not the id field.”

I think many analytics engineers, including myself, have been on the receiving end of some of these comments from their backend application developers.

Backend developers work incredibly hard. They create the database and tables that drive the heart of many businesses. In their efforts, they can sometimes overlook, forget, or not understand their impact on analytics work. However, when backend developers do understand and implement the technical and logistical requirements from data teams, they can spark joy.

So what makes strong collaboration possible between analytics engineers and backend application developers?

Founding an Analytics Engineering Team

March 2, 2022 · 18 min read

Nate Sooter

Manager of BI Operations at Smartsheet

Executive Summary:

If your company is struggling to leverage analytics, dealing with an overgrown ecosystem of dashboards/databases or simply want to avoid the mistakes of others, this story is for you. In this article, I will walk through forming the first analytics engineering team at Smartsheet including how momentum built around forming the team, the challenges we faced, and the solutions we developed within the first year.

Introduction

Most writing about analytics engineering, or AE for short, assumes a team already exists. It’s about operating as an AE team or managing stakeholders or leveraging tools more effectively. But what about the prologue? What initial problems do AEs solve? How does an AE team even start? What do the early days look like?

The JaffleGaggle Story: Data Modeling for a Customer 360 View

February 8, 2022 · 16 min read

Donny Flynn

Customer Data Architect at Census

Editor's note: In this tutorial, Donny walks through the fictional story of a SaaS company called JaffleGaggle, who needs to group their freemium individual users into company accounts (aka a customer 360 view) in order to drive their product-led growth efforts.

You can follow along with Donny's data modeling technique for identity resolution in this dbt project repo. It includes a set of demo CSV files, which you can use as dbt seeds to test Donny's project for yourself.

How We Calculate Time on Task, the Business Hours Between Two Dates

February 3, 2022 · 10 min read

Dave Connors

Staff Developer Experience Advocate at dbt Labs

Measuring the number of business hours between two dates using SQL is one of those classic problems that sounds simple yet has plagued analysts since time immemorial.

This comes up in a couple places at dbt Labs:

Calculating the time it takes for a support ticket to be solved
Measuring team performance against response time SLAs

We internally refer to this at "Time on Task," and it can be a critical data point for customer or client facing teams. Thankfully our tools for calculating Time on Task have improved just a little bit since 2006.

Even still, you've got to do some pretty gnarly SQL or dbt gymnastics to get this right, including:

Figuring out how to exclude nights and weekends from your SQL calculations
Accounting for holidays using a custom holiday calendar
Accommodating for changes in business hour schedules

This piece will provide an overview of how and critically why to calculate Time on Task and how we use it here at dbt Labs.

Welcome to the dbt Developer Blog

November 29, 2021 · 3 min read

Jason Ganz

Developer Experience at dbt Labs

David Krevitt

Marketing at dbt Labs

Doing analytics is hard. Doing analytics right is even harder.

There are a massive number of factors to consider: Is data missing? How do we make this insight discoverable? Why is my database locked? Are we even asking the right questions?

Compounding this is the fact that analytics can sometimes feel like a lonely pursuit.

Sure, our data is generally proprietary and therefore we can’t talk much about it. But we certainly can share what we’ve learned about working with that data.

So let’s all commit to sharing our hard won knowledge with each other—and in doing so pave the path for the next generations of analytics practitioners.

How I Study Open Source Community Growth with dbt

November 29, 2021 · 20 min read

Ross Turk

VP Marketing at Datakin

Most organizations spend at least some of their time contributing to an open source project. 100% of them, though, depend in some way on the output of open source communities.

The (Missing) Role of Design in Analytics

November 29, 2021 · 6 min read

Seth Rosen

Co-Founder & CEO at TopCoat Data

If you’ve spoken to me lately, follow me on Twitter, or have taken my order at Wendy’s, you probably know how much I hate traditional dashboards. My dad, a psychotherapist, has been working with me to get to the root of my upbringing that led to this deep-rooted feeling.

On the Importance of Naming: Model Naming Conventions (Part 1)

November 29, 2021 · 8 min read

Pat Kearns

Senior Analytics Engineer at dbt Labs

💾 This article is for anyone who has ever questioned the sanity of a date not in ISO 8601 format

Have you ever been assigned to add new fields or concepts to an existing set of models and wondered:

Why are there multiple models named almost the same but slightly different?
Which model has the fields I need?
Which model is upstream or downstream from which?

Analytics craft

Surrogate keys in dbt: Integers or hashes?

Narrative modeling: How structure can tell a story

How we shaved 90 minutes off our longest running model

Enforcing rules at scale with pre-commit-dbt

Migrating from Stored Procedures to dbt

Strategies for change data capture in dbt

Tackling the complexity of joining snapshots

Optimizing dbt Models with Redshift Configurations

Stakeholder-friendly model names: Model naming conventions that give context

How we remove partial duplicates: Complex deduplication to refine your models' grain

From the Slack Archives: When Backend Devs Spark Joy for Data Folks

Founding an Analytics Engineering Team

Introduction

The JaffleGaggle Story: Data Modeling for a Customer 360 View

How We Calculate Time on Task, the Business Hours Between Two Dates

Welcome to the dbt Developer Blog

How I Study Open Source Community Growth with dbt

The (Missing) Role of Design in Analytics

On the Importance of Naming: Model Naming Conventions (Part 1)

Resources

Community

Support

Connect with Us

Introduction​

Introduction