A super helpful data engineering handbook.
It lists resources from:
- Certifications Courses
- Communities
- Conferences
- Data Engineering Whitepapers
- Great Podcasts
- Great YouTube Channels
- Great books
- Newsletters
- People from LinkedIn, Twitter
Building a Data Engineering Project in 20 min: you'll learn web-scraping with real-estates, uploading them to S3, Spark and Delta Lake, adding Data Science with Jupyter, ingesting into Druid, dataviz with Superset and managing everything with Dagster.
We've open-sourced an "Open Enterprise Data Platform", integrating the Modern Data Stack into a single portal.
It features state-of-the-art tools like dbt for SQL data modeling, Airflow for task orchestration, and Superset for BI dashboards, all on a Postgres database.
Big update to my Practical Data Engineering project on GitHub. 3 years on, this hands-on guide remains a key resource, now refreshed with the latest from Dagster, Delta Lake. Goodbye, Spark (locally) - hello, delta-rs. And much more.
GH or the YT 👇🏻
I'm upgrading my practical data engineering project. Interestingly, the tools I used three years ago are still valid today. Except I'm ditching Spark locally—it's such a nightmare—and using delta-rs.
Getting there... 🙃.
Announcement: I'm writing a book! ✨ But wait... it's not your usual IT book.
1. Debuting as a digital book & website.
2. It does *not* come finished. I will steadily release new chapters and carefully listen to all feedback.
But the topic? 👉🏻 Data Engineering Design Patterns
🎉 Celebrating the release of Pandas 2.0. With Apache Arrow as its backbone, Pandas is now faster and more powerful than ever before. We'll explore why and compare it to alternatives (Polars, Vaex, Koalas, or even DuckDB), consolidating the in-memory format to an open standard.
Found a new glossary for data engineering. Check it out .
Some terms explained extensively:
* Fan-Out:
* Partition:
* Wrangle:
Well done,
@dagster
team 👏🏻. Next up, backlinks? :)
My Journey through Data Modeling: Navigating the Levels.
Reflecting on my 20-year data modeling journey, I'm amazed by the evolution of approaches and levels in the field. No longer limited to Inmon and Kimball, we now have diverse techniques, each with its value.
Great quick start list to start your data engineering journey with various templates for domains like Marketing, Product, Finance, Operations, and more.
➡️
Great work by Thalia and the Airbyte team.
> It's scary how versatile/productive your terminal, and specifically Neovim, can get.
In one screen:
1. data integration/dbt code
2. analysis of SQL queries
3. db connections/browser
4. result of queries
5. docker build
6. dbt run
7. postgres
8. more windows/sessions (tmux)
«Data Engineering Vault»
> More than a mere collection of terms, it’s a curated network of data engineering knowledge to facilitate exploration and discovery.
Like a digital garden, 100+ interconnected terms are a gateway to deeper insights.
Rill Developer and Dagster are still my favorite tools; running on top of DuckDB, it's a blast to use.
Currently building my personal finance dashboard, reading from exported CSV, and categorizing groups of transactions in main and subcategories.
GH:
Data lakes consist of mainly three parts:
1. Storage-Layer (S3, google/azure blob)
2. File Formats (Parquet, ORC, Avro, "Arrow")
3. Table Formats (Delta, Iceberg, Hudi)
Recapping my article data lake/lakehouse guide shows that there are favorites in all categories.
Data engineering (DE) is still not well defined; It's a discipline that has shifted from DBA, ETL developer, and BI specialist and merged with SW to a Data Engineer. If you are like me and confused about the latest terms, I started a DE concept page .
I'm upgrading my practical data engineering project. Interestingly, the tools I used three years ago are still valid today. Except I'm ditching Spark locally—it's such a nightmare—and using delta-rs.
Getting there... 🙃.
Amazed by how
@Readwiseio
improves my workflow whenever I check the latest. Replaced my RSS feeder with Readwise Reader, which combines Instapaper, RSS, web highlights, tweets, books, email, PDF into one. All highlights/notes are automatically synced into
@obsdmd
, my
#secondbrain
> The history of SQL
SQL -> Data Mart -> Materialized View -> Business Intelligence Dashboard -> OLAP Cube -> dbt tables -> One Big/Wide/Super Table -> Semantic Layer
Nice illustration on the different data modeling techniques:
> Enterprise Data Warehouse (Inmon)
> Star Schema (Kimball)
> Data Vault
> One Big Table (OBT)
Source:
🌟 Data Modeling: The Unsung Hero of Data Engineering. In my upcoming blog post, I'll explore the significance of data modeling, its various approaches, and its role in the broader context of data engineering.
#datamodeling
#dataengineering
#dataarchitecture
If you use SQL IDEs (DBeaver but in the terminal), you might enjoy .
Supports: Big Query, ClickHouse, Impala, jq, MongoDB, MySQL, Oracle, osquery, PostgreSQL, Presto, Redis, SQL Server, SQLite, DuckDB (on the way).
Or Harlequin, if vim is not a thing.
Please create your own website. Don't give away all your content to social media. That was always my philosophy; therefore, my website has a lot of content. I created knowledge for myself, not for other huge companies.
Check out Eric's explanations: .
Interesting
#ModernDataStack
: «PRQL + DuckDB + Dagster».
> I evaluated the space for work at my current company (handling ~300 sources, ~1k downstream dbt tables + hundreds of dashboards)
Found on HN () about the PRQL as a DuckDB extension announcement.
Just released a new chapter in my DEDP book, exploring the evolution of SQL. Dive into concepts like Materialized View, OLAP Cube, dbt Table, Traditional OLAP, and DWA.
Discover common patterns such as reusability, caching, and business transformations.
As a data professional with 20 years of experience, I've seen repeated terms in tech over and over again. Today, I discovered "Personalized API", yet another new term for something that already existed.
My Journey through Data Modeling: Navigating the Levels.
Reflecting on my 20-year data modeling journey, I'm amazed by the evolution of approaches and levels in the field. No longer limited to Inmon and Kimball, we now have diverse techniques, each with its value.
a fantastic presentation about `dagster-embedded-elt` with dagster.
Talking about:
> Types of data ingestion
> What makes data integration difficult
> Lessons from DuckDB ("smol" is better)
> Ingesting from API and a database are inherently different
👉🏻
Many asked me how to get started with data engineering. I suggest solving a problem or something you are passionate about with an actual project.
I collect a list of projects if you need help—get inspired and chose according to your skills.
With the release of my book, I added 60+ more terms to my second brain. To make these terms more discoverable, I added a map of content dedicated to
#dataengineering
.
All notes are interconnected, similar to our brain, making learning new terms easy.
Some 🔮 for 2023
> DuckDB standard for working with data
> Rust will be more mainstream (and spark will compete with it)
> MDS will rename and be more known outside the US
> Semantic layers will gain adoption
> Orchestration is seen as a key component
> Open standards everywhere
It's best to keep an updated CV—even if not searching. I do not like this process. Everything I do is online already. But as in Europe, CVs are still a thing, so I converted mine into Markdown and keep it updated on .
Not perfect, but it's a start.
📘 Just released the next chapter in my Data Engineering Design Pattern book. It covers the evolutionary journey of
#ETL
and dives into the realms of Data Warehouses, Master Data Management, Data Lakes, Reverse ETL, and CDPs.
📊🔨 Launching the final part of our series: "Data Modeling: The Unsung Hero of Data Engineering." Delving into data architecture patterns, their influence on data modeling, & the importance of strategic decisions.
#DataModeling
#DataEngineering
I'm exploring the evolution of orchestration, comparing different CEs: From Bash scripts and Cron to stored procedures and Python's modern frameworks. How did we transition from basic scripts to complex, data-aware orchestration?
Any anecdotes or specifics I should include?
Quick Update: I'm no longer at
@AirbyteHQ
. Tremendously thankful to Michel & John, who believed in me and created a unique position as a writer and data engineer. Also, huge thanks to Ari, the behind-the-scene.
Some learning 👇🏻
TIL— Instead of doing some dbt magic, I can use
@RillData
to analyze my exported transactions and build an analytics dashboard without any extra steps 🤯
Beautiful how Rill visualizes time/number data automatically, playfully, and interactively.
I've recently replaced DBeaver (for the most part) with an extension for my IDE of choice, Neovim. It works surprisingly well.
🪄 Check out a short demo: .
Extension. GitHub: kristijanhusak/vim-dadbod-ui
#DuckDB
is hot these days, but what are its uses cases? Here are 3:
* Ulta fast analytical use-case locally
* SQL wrapper with zero copies (e.g., on top of parquets in S3)
* Bring your data to the users instead of having big roundtrips and latency by doing REST calls
What else?
It's always a pleasure to listen to
@schrockn
and Tobias.
TIL—
✅ MLOps is mostly data engineering
🤔 SQLMesh is a better dbt
💯 Orchestration shouldn't be an afterthought; instead, the first thing when starting a data project
WASM and DuckDB to get the Parquet schema by hovering in BigQuery 🆒.
Usually, you need to download parts of the metadata to read it, done in notebooks or similar, but WASM, which runs entirely inside the browser, is an excellent use case. Thanks for sharing
@_Blef
.
Ready to unleash the power of
#DataModeling
? Dive into the dynamic world of data modeling techniques! In Part 1, we explored the importance of data modeling & its role in unlocking the value of your organization's data. In Part 2 (), let's delve deeper.
The moment I discovered the efficiency of Vim’s modal editing, my journey has been about finding clarity in my work.
Going from Notepad++ and SSMS to embracing Vim, represented a significant shift in how I approach tasks in data engineering and writing.
Vim is a popular text editor that relies heavily on keyboard shortcuts to get stuff done fast.
Once
@sspaeti
started learning its language, he was hooked.
Here he explains why Vim is more than just an editor & discusses its language, motions, & modes.
Learning
#rust
with
#duckdb
🧐.
So far: Converting exported transactions in XLSs to CSV and importing them into DuckDB.
Next step: Find missing rows 🙈.
🔗
Why do I always end up justifying Facts and Dimensions instead of directly creating a one-big-table with the fixed group when implementing dimensional modeling? How do you argue against it despite the added complexity no business user ever understands? 🫠
Hey everyone! Have you tried Ballista, a distributed compute platform primarily built in Rust and powered by Apache Arrow and Datafusion? It competes with Apache Spark for distributed SQL query processing. I'd be curious to hear anyone's thoughts 🤔.
Diving deep into "Business Intelligence, Semantic Layer, Modern OLAP, and Data Virtualization." Each has unique attributes but with intersecting goals. A thread👇
It takes a lot of work to keep up with the latest data engineering. We (
@AirbyteHQ
) created a to keep up with yearly trends.
Check out the exciting results from 886 participants, concluding the largest data engineering survey.
Explore HelloDATA BE on GitHub () and try the docker-compose on your local machine. We're early and value community feedback.
For a deeper understanding, our documentation covers the data stack, components, architecture, and infrastructure.
Some Personal News: I joined
@AirbyteHQ
as Data Engineer & Technical Author! 🎉 So proud and honoured to work with such talented people and a fantastic
#dataintegration
tool. Stay tuned if you are following me for
#datacontent
. I will spend more time writing than I could before.
A fascinating podcast about CRDTs and Automerge with
@martinkl
. I am looking forward to when we have a
@obsdmd
sync with CRDT for collaboration on top of Markdown.
I'm adding paid chapters to . It was a tough decision. But I have to try. I'm working hard to consolidate my decade of experience into one book, introducing new data engineering patterns and insights. I'm confident some will appreciate & pay a small amount.
Understanding semantic vs transformation layer:
> The transformation logic (e.g., dbt) and logic hosted in metrics are different. A semantic layer transforms/joins data at query time, whereas the transformation layer does during the transform (T).
Exploring the history of SQL reveals a fascinating evolution of data management. SQL's journey has been groundbreaking since its 1970s inception as SEQUEL to today's advanced Natural Language Queries.
What's your favorite SQL evolution with all these different SQL flavors today?
Built a free, open-source pipeline that fetches data from public APIs, ingests from postgres database, reads 17 million row CSV files, writes parquets , reads into DuckDB, aggregates, joins, sorts, runs dbt and completes on my laptop in a minute, with materialization and…
🎉 Today, I'm sharing something different - A post where I share my heart, struggles, and triumphs. It's raw, honest, and real.
I took the
@0xFoster
course and read
@p_millerd
's Pathless Path book, which inspired me to dive deep into my personal journey.
I created a «Technical Writers' Collective» for anyone who is passionate about writing and interested in efficiency, PKM, workflow, and tools like Obsidian, Vim motions, etc. An intersection between tech and a love for writing.
@zulip
invite:
I can't stop thinking of CRDTs (Conflict-free Replicated Data Types).
> General-purpose data structures, like hash maps and lists, uniquely built multi-user from the ground up.
I'm looking forward to how they'll be integrated into our daily tools for local first collaboration.
Lots of insights from the Airflow migration day by
@dagster
. It is fantastic to see these customer examples. Here is a short recap of what I found most interesting. Thanks for sharing.
Nine-month ago, I posted my latest blog post, but hey, here is a new one! This time it's all about getting your hands dirty with a real-estate
#dataengineering
project, including common challenges explained along the way:
What is a semantic layer?
> A SL we use every day. We build dashboards with yearly and monthly aggregations, and design dimensions for drilling down reports by region, and product. What has changed is that we no longer use a singular BI tool; teams use different visualizations.
#movedata
: There are so many great speakers from the data engineering space! I loved the insights from everyone, all bundled into short lightning talks in one single YouTube playlist.
This is the future of blogging. Notes are updated constantly, evolving, and hosted on plain Markdown files. No lock-in into a platform that will be gone in 2-3 years, as I experienced many times throughout my career. 👉🏻 Checkout the first movers:
@anna__geller
@imrobertyi
@matsonj
@pdrmnvd
@AirbyteHQ
Thanks for sharing, Anna. Exactly, the data glossary is built on top of the digital garden/second brain analogy. Instead of single levels, it lets you go inwards. You can learn and go deeper into each connection with an interactive graph and backlinks. .
Have you missed the
#dbtcoalesce
keynote yesterday?
The most significant update is the dbt semantic layer and its robust integrations in other platforms. dbt python is still early, and not much news there.
As Tristan Harris said before: "A handful of tech companies control billions of minds every day" by creating the most sophisticated technology to make us addicted. It's great to see a counterforce at least doing it for a good cause. :)
At age 37, I realized that my most productive days are when I sleep enough and let my brain wander. Instead of checking SM, drinking coffee, and doing another YT for research, aka overstimulation, I do nothing.
Later in the day, I will have an insight I wouldn't otherwise have.
Breaking: the pathless path is now free, instantly
i asked myself, what would be the most fun thing to do with my book?
the recent Smart Friends podcast with
@EricJorgenson
convinced me to do this
(i am not tracking downloads)
👉
Don't specialize, hybridize.
> T-shaped hybrid path: engineering and design, or singing and dancing.
> U-shaped: engineering and dancing, or singing and design. Skills that are not often found together.
By becoming a hybrid, you can become greater than the sum of your skills.
As a seasoned computer scientist, I've learned the power of a Personal Knowledge Management (PKM) system for a deeper life. Imagine capturing every fleeting thought, every piece of knowledge, and interlinking them. It's more than productivity; it's crafting a deeper existence.
Quality software deserves your hard‑earned cash
Quality software from independent makers is like quality food from the farmer’s market. A jar of handmade organic jam is not the same as mass-produced corn syrup-laden jam from the supermarket.
Industrial fruit jam is filled with…
Btw, I'm trying out dlt, and the Postgres to Postgres was so slow for some biggish table that I exported it to DuckDB (using the performance gain of ConnectorX, Parquet) and used the Postgres extension to import it to Postgres.
Hacky? Yes totally! But so far, brutally fast.
What is a Data Catalog?
> A data catalog is a centralized store where all metadata data about your data is made searchable. Think about a Google search for your internal metadata.