Bank: We're going to need a wet signature on this form. Please print, sign, scan, and send back to us.
Me:
% brew install ImageMagick
% convert -density 90 bank_form.pdf -rotate 0.5 -attenuate 0.2 +noise Multiplicative -colorspace Gray scanned_form.pdf
“Have you ever tried multiplying roman numerals? It’s incredibly, ridiculously difficult. That’s why, before the 14th century, everyone thought that multiplication was an incredibly difficult concept, and only for the mathematical elite."
"Then arabic numerals came along, with their nice place values, and we discovered that even seven-year-olds can handle multiplication just fine. There was nothing difficult about the concept of multiplication — the problem was that numbers, at the time, had a bad user interface."
Today
@Motherduck
, the company behind
@duckdb
, announced that they've raised $100mm.
Yesterday,
@tabulario
, the team behind Apache Iceberg, announced a fresh $26mm round.
And last week Databricks added $500mm to its coffers.
What is happening?!
What I believe we are now
This morning I was lucky enough to catch up with
@duckdb
creator
@hfmuehleisen
. I asked him the most surprising pattern he's observed for DuckDB in the wild.
He shared that it's not just interactive use cases that are driving adoption, such as powering faster data applications
The "structured versus unstructured" taxonomy is outdated and unhelpful, a shibboleth of the Hadoop era.
Data infrastructure today is dominated by *structured* data: JSON events, Kafka payloads, Postgres CDC logs, Hive-partitioned buckets of Parquet.
"Modeled versus
Today
@ClickHouseDB
announced they're moving into the embedded OLAP engine space, with their acquisition of
@chdb_io
, and directly competing with
@duckdb
.
Why is this a big deal?
Because
@chdb_io
, like
@duckdb
, provides a cheaper, faster, and SQL-ier alternative to Spark for
Could not be more excited to have
@Auxten
join forces with us to focus on
@chdb_io
(in-process version of
@ClickHouseDB
) full time!
How are you using chDB? What do you want us to focus on next? Share your ideas here as we embark on this journey together:
"SQL + YAML = dashboard" officially launched today by
@rilldata
. A BI-as-code stack controllable via git, powered by
@duckdb
. Ask and ye shall receive
@josh_wills
.
While I wait for
@TopcoatData
to be open-sourced, what is my best declarative, SQL+yaml way of creating a simple dashboard against DuckDB?
@RillData
do y’all have a way to help me out here?
So
@DuckDB
's native format is 4x faster than Parquet, but 2x larger on disk. CSV is a poor performer across the board. Blog post:
(Also discussed In our Discord channel here
)
The database market is bifurcating into two broad battlefields, with 100s of firms competing for a few trillion dollars of market share, backed by staggering amounts of capital:
* A cost war for SQL-at-scale, playing out between warehouses (Snowflake, Oracle) & lakehouses
Apache Arrow began with faster analytics, but is now the core of a new breed of infrastructure (Iceberg, Parquet, Polars,
@duckdb
) says
@buckymoore
:
"A new, unbundled OLAP architecture [where] data is stored directly in object storage like S3 or GCS."
Data lake architectures continue to rise in prominence with Tabular’s $26m funding announcement.
Apache Iceberg, Tabular’s core tech, was forged at Netflix by eng teams wanting to layer a better table API on top of unstructured data lakes.
Iceberg is a foundational pillar,
Exciting news! We closed a $26M round of funding from Altimeter,
@a16z
and Zetta Venture Partners to build our independent data platform based on
#Apacheiceberg
.
We've also have added
#GoogleCloud
and Amazon Athena support.
Read more here:
The obsession with
@DuckDB
has at times resembled a cult following. 🦆🦆🦆
So why did we build
@rilldata
's data profiling & dashboard building tool with
@duckDB
?
A substantive argument deserves more than one tweet. In the following 🧵, I discuss why we chose it in 2021.
👇👇
Yes, we got rid of the run button in our SQL editor, thanks to
@duckdb
.
No more tapping out SQL, clicking "run", watching spinners, and waiting for results.
Just type "SELECT * FROM foo" and the results appear before your finger lifts from the keyboard.
@hamiltonulmer
talks
Big news in the data world:
Lloyd Tabb, after selling Looker to Google for $2.6B, is now leaving to join... Meta.
At first glance, it's an unusual move. Why Meta?
Lloyd's latest creation is a "better SQL" called
@MalloyDev
.
Meta offers a crucible for shaping Malloy into
“Necessity is the mother of invention”. This coming week I’m starting a new job at Meta to work on Malloy and to bring Malloy into Meta’s internal data tooling.
DBT has taught an entire generation of analysts that Github, command-line interfaces, and codeful workflows are to be embraced, not feared.
BI-as-code is a natural extension of this philosophy further up the stack to data applications and dashboards.
Is anyone building a serverless time-series database?
The architecture would be a one-dimensional version of what cloud optimized GeoTIFFs use: tiles across time, with different aggregation levels (minutely, hourly, daily, etc), all backed by an object store.
You wouldn't
A smart friend recently asked me: Are vector databases a product or a feature?
@Pinecone
,
@qdrant_engine
,
@trychroma
,
@weaviate_io
, and
@milvusio
have raised hundreds of millions and are collectively worth billions. They represent the fastest-growing segment in data
If you live in NYC, I'm hosting a salon this Tuesday night with ~25 founder/CEOs & CTOs to talk data infra + AI at Manhattan's oldest distillery. We'll be hosting a discussion over drinks led by:
* Edo Liberty, founder of
@pinecone
* Erik
@Bernhardsson
, founder of
@modal_labs
Today, I’m excited to share that
@rilldata
has raised $12mm of capital from a data supergroup of investors to re-imagine how business dashboards are built and used.
Sometimes as a data person, you just need to know "What's in that S3 bucket"?
No README file, no schema, no Slack discussion can substitute for just a few minutes of just 👀 looking at the data. 👀
So here's a demo of going from Parquet file to Pivot table in less than 60
A killer feature for GitHub would be a more fully integrated object store offering.
Many projects reference data in S3. This requires managing a second set of credentials.
I want a GitHub storage service, with ghs:// style pointers, and have it just work.
Data lakes are increasingly the foundation of companies’ data infrastructure, so it’s great to see orchestrators like
@RudderStack
,
@bobsled
,
@AirbyteHQ
, and now
@fivetran
making them first class destinations (with help from
@ApacheIceberg
, formats like Parquet, and engines like
Data lake support is one of the most technically challenging things we've ever delivered. Writing updates to S3 requires building a quasi-DWH inside Fivetran. We use
@DuckDB
to rewrite the parquet files and built a BigQuery-style scale-out service to deal with large tables.
The
@RillData
team was in Brussels for
#fosdem
, and we rode the train back to Amsterdam with Alexey Milovidov, creator of
@ClickHouse
. He asked us:
"So why don't you scale up Rill on ClickHouse?"
So here we are: interactively exploring a decade's worth of Wikipedia traffic
.
@RillData
operational BI vision really resonates with me: beautiful, immersive, SQL-driven BI-as-code. 😍
Sneak peek into a PR in the works. On the picture: instant queries on a dataset of 480,933,298,381 (half a trillion) records in
@ClickHouseDB
🚀
The line between localhost and cloud is blurring.
Apps like Figma & Notion leverage local compute but with transient local state, backed by cloud, for a fast, responsive UX.
WASM unleashes this same power for web apps.
An M1 chip is a terrible thing to waste.
With
@duckdb
, we can skip the data warehouse and build blazing-fast analytics *directly* on Apache Parquet files in S3 or GS, no ETL necessary.
Go from 10GB Parquet --> interactive pivot table in the browser, in seconds. Cloud scale meets MacBook M2 speed.
Jeff Bezos famously quipped, "Your margin is my opportunity."
In data infrastructure, the profit margins of Snowflake and Databricks are an opportunity for the SQL-on-object-store insurgents: Tabular, MotherDuck, and others.
So here's a prediction...
Tomorrow night inside the Twitt(er, X) HQ in SF,
@rilldata
is hosting a salon with dozens of founders, builders, & innovators in the most exciting niche in data: serverless infrastructure. (DM me if you'd like to join us).
Why is this creative surge in data infra happening?
Spotlight vs lantern intelligence in analytics.
One of the root causes of dashboard sprawl in companies is applying a "spotlight" philosophy to business questions.
What were our top-selling products last week? Here's a Top-Selling Products dashboard.
Which cohorts are
If you're a startup, the ability to make sense of your unit economics at a customer-level is a superpower.
But it's a deceptively hard problem to solve. Why?
The heart of the matter is the mismatch between your pricing model and your cost model. You price on one set of axes
Here's an idea I wish someone would build: Mixpanel-style analytics for customer unit economics.
At
@monzo
, we had an event-driven data architecture that allowed us to assign cost or revenue to every action a customer might take.
Spend £120 on a debit card in NYC? £2 of
Data startups love building demos on publicly available data sets.
But in my experience, no one uses these demos.
Demo data is to data tools what Lorem Ipsum is to publishing tools: nearly useless.
If you want to get someone excited about your tool, they need to see how
.
@hamiltonulmer
recently gave the first ever live demo of
@rilldata
at
@BrowserTech
SF. He shows what's possible when you combine modern browsers +
@duckdb
+ BI-as-code philosophy... with Midjourney slides to boot!
Fascinating tour of the modern data stack running at AirBnB, through the lens of Minerva, their data modeling middleware.
“Define metrics once, use them everywhere.”
Fast databases like
@ClickHouseDB
deserve fast dashboards.
We're excited to announce that as of our 0.41 release,
@RillData
dashboards can now run on
@ClickHouseDB
.
With this live connector, ClickHouse users can instantly transform any table into an exploratory dashboard,
1/ Bulk open data is best served as statically-hosted parquet files, with csv equivalents. It's faster, easier to use and cheaper to host than alternatives such as custom APIs.
New blog:
Am I missing something? So you agree? Interested in views!
WYSIWYF (What You See Is What You Fetch) is my new preferred acronym for lazy query evaluation / deferred loading of data in scrollable, interactive tables in
@rilldata
...
(About to go on stage at
#fosdem2024
in Brussels to talk about this + other strategies for keeping data
Not just S3, but other object stores such as Azure Blob Service, Google Cloud Storage, and increasingly, Cloudflare R2 all offer cheap, reliable, vast storage, attachable to serverless compute metered by the millisecond.
This will reshape cloud infrastructure.
S3 is increasingly becoming the default storage layer for cloud infrastructure. I wrote notes on this trend, its benefits, its challenges, its early adopters, and the opportunity it presents for new startups to disrupt large infrastructure categories
Postmodern Data Stack: a serverless approach that builds at run-time a pipeline, database, and data application. No stateful infrastructure required.
(HT to
@matsonj
whose epic MDS-in-a-box thread yesterday inspired this work by David.)
Spent some time simplifying a PoC I did a while back stithing together
@getdbt
,
@duckdb
, and
@RillData
.
It works on Codespaces/Devcontainer! More info in the README.👇
Where
@DuckDB
SQL goes, others follow: now Snowflake has adopted the superior ergonomics of DuckDB's GROUP BY ALL expression.
Dare I call it imprinting...
New in
@SnowflakeDB
's SQL: GROUP BY ALL
This saves time and prevents errors, as the compiler figures out the columns that need aggregation.
E.g., in the pic GROUP BY ALL takes cares of the query - instead of "GROUP BY tag, answered, year" or the obscurer "GROUP BY 1, 3, 4".
@villi
For five years my wife has asked if she could Docusign the IRB (medical research) approvals at her hospital, and was told no. Last week, they introduced Docusign.
Code-first products present an unreasonably effective interface to AI models, because AIs excel at generating the very code these products run on.
Today's release of Rill 0.41 introduces a related, powerful side effect of our BI-as-code philosophy: You can now create an
Cloud data warehouses like Snowflake offer cheap storage, but expensive (and at scale, slow) access.
Data stores like Druid, Pinot, and Clickhouse offer expensive storage, but cheap (and fast!) access.
Choose the right database for your application.
To kick off our sponsorship of DuckCon
#4
tomorrow in Amsterdam, we created this ~45 second video showing how the blazing speed of
@duckdb
powers a radically new kind of BI experience via
@RillData
.
Instant slicing & dicing, automatic visualizations, interactive pivot tables, &
Machine learning algorithms are best suited to replace humans in systems with decisions that are fast, frequent and -- most importantly -- inconsequential if wrong.
So we should stop obsessing about high-death-potential autonomous driving until we've nailed autonomous vacuuming.
I recently joined
@ericdodds
and
@KostasPardalis
on The Data Stack Show to chat OLAP engines and BI.
Here the tl;dr of what I said to save you 57 minutes of your life:
* Long live OLAP - Fast OLAP engines make dashboards awesome; but scaling up OLAP is hard (c.f.
Open-source projects stand on the shoulders of giants, or in our case, 🦆🦆🦆. The
@rilldata
dashboard tool is powered by
@sveltejs
and
@duckdb
. Try it yourself:
curl -s | bash
Today's release of
@RillData
Developer includes the ability to search for dimension values and choose whether to exclude or include them. These new features make a powerful addition to the dashboard experience that will help you find the right insights faster than ever.
"The data science community is reinventing DBMSs... poorly." -
@hfmuehleisen
@duckdb
changes that, with a state-of-the-art analytical DBMS that runs blazingly fast on your M1 laptop.
Making data fast makes humans happy.
The video from our lightning talk "Flying Fast at Scale with DuckDB" at
#FOSDEM
last weekend in Brussels is here! We speed run through how we've optimized
@RillData
and
@duckdb
to deliver an operational BI tool that sparks joy.
We do a live demo, discuss our 3-in-1