This week I'm doing an internal talk at Amazon about an approach to system design that I use a lot, and think would use useful to a lot of people: simulation. This thread is a summary of the talk 1/
Tomorrow, the DynamoDB team is going to be presenting "Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service" () at ATC. This is a super exciting paper that covers a real-world big system, and how it has evolved.
15 years! I started at AWS, with the EC2 team in Cape Town, on the 1st of August 2008. It's been a real pleasure to have a front row seat for the growth of cloud, to be involved in the genesis of serverless, and to have exciting problems to work on every day. Some memories:
Histograms are rightfully a popular tool for visualizing and thinking about latency. But I believe that empirical distribution functions (eCDFs) are almost always a better choice. Let's look at an example to understand why. This highly bimodal distribution:
If we draw database rows as points, and add edges between rows that appear in the same transaction, the resulting graph is a great way to think about potential scalability. The more you can cut the graph up without crossing edges, the easier the workload is to scale.
A couple weeks back, I did a talk titled "Distributed Systems Solve Only Half My Problems (and I have a lot of problems)" at HPTS'22. Talks at HPTS aren't recorded, so here's a summary of what I said.
In distributed systems, especially deep SoA and microservice architectures, retries are mostly bad, despite being considered by many to be a "best practice". Specifically, doing more when when you're overloaded is bad for availability, stability, and efficiency.
The internet is talking about retries and backoff! As we've seen over the last day or so, simply retrying without backoff leads to an unbounded increase in work. This, in turn, tends to make overloaded systems even more overloaded. What can we do instead?
If you write your own database, durable storage, or compute isolation, your team will eventually become a database, durable storage, or compute isolation team. These problems tend to be all-consuming. Do you want your team to specialize in these problems? Can you afford it?
That's it. That's the talk.
"Never write a database. Even if you want to, even if you think you should. Resist. Never write a database. Unless you have to write a database. But you don't."
I will present this talk at any conference of your choosing.
A common problem with long queues in distributed systems is that they make recovery time worse: by the time a system recovers, it has built up a long backlog of work that needs to be done before new work succeeds. LIFO queues (stacks) are sometimes a good way to avoid that, but..
Sad to learn of the passing of Richard Cook
@ri_cook
. Richard's talk, and article, "How Complex Systems Fail" is an absolute classic in the field. Absolutely worth your time:
The declarative nature of SQL is a major strength, but also a common source of operational problems. This is because SQL obscures one of the most important practical questions about running a program: how much work are we asking the computer to do?
What many people don't understand with SQL is that it is declarative. When you say ORDER BY it doesn't tell the DB to sort the data. It declares that you want an ordered result. Only the execution plan will tell you if there's a sort operation or not
On Wednesday, I'm presenting a tech talk titled "Gigabytes in milliseconds: Bringing container support to AWS Lambda without adding latency" about how we added container support to AWS Lambda, and some of the technical challenges we faced along the way:
"The Amazon Time Sync Service now gives you a way to synchronize time within microseconds of UTC on Amazon EC2 instances.... customers can now access local, GPS-disciplined reference clocks on supported EC2 Instances."
Our new paper "On-demand Container Loading in AWS Lambda" is now up (). New blog post highlighting some of what's interesting in the paper: Featuring erasure coding, deduplication, lazy loading, FUSE, and more.
New blog post, looking at the great new paper from the Amazon MemoryDB folks: Much of the discussion on distributed systems is about scalability, but this paper shows how availability, durability, cost, and performance are equally important.
This video, seemingly about FastPass at DisneyLand, is a fantastic introduction to queues and quality of service as complex interacting systems of technology, people, and incentives:
Deterministic testing (simulation testing) is a super powerful tool for building correct distributed systems. Write unit tests that test packet loss, network partitions, and more. Excited to see Turmoil 0.3 released:
This talk is now available on YouTube:
I cover a lot of ground in 15 minutes, including container flattening, erasure coding, convergent encryption, and an overview of the Lambda architecture.
On Wednesday, I'm presenting a tech talk titled "Gigabytes in milliseconds: Bringing container support to AWS Lambda without adding latency" about how we added container support to AWS Lambda, and some of the technical challenges we faced along the way:
Formal methods are widely used in many software systems you're using today, including many of the most important parts of the Internet's infrastructure.
E.g. cloud systems like S3 (), EBS (), and others ()
Formal methods for building provable software systems have never show themselves to be useful or successful for anything but a tiny sliver of any complex software-intensive system.
It will similarly fail for any attempts to build a so-called AGI.
The way Corey Quinn treats my colleagues and friends online is despicable, and the fact that AWS continues to enable him is an embarrassment to all of us.
Meta's "Cache Made Consistent" paper covers what seems like some cool work on observability and correctness. But I think they're understating what it is that fundamentally makes caches difficult.
Erlang's work on telephone systems in the early 20th century is foundational to how we think about, and build, distributed and cloud systems 100 years later. How can this work, done before modern computing was even a field, be so important?
"FoundationDB: A Distributed Key-Value Store", from this month's CACM, is a great read. Well worth checking out for anybody who works on the architecture of large systems:
New blog post: "Invariants: A Better Debugger?" about the power of invariants as a technique for testing and debugging algorithms and systems (and why I tend not to reach for debuggers or printf as my go-to way to debug):
New blog post: "Formal Methods Only Solve Half My Problems" about the need for tools that allow us to reason quickly, and quantitatively, about distributed systems at the design stage.
New blog post about why circuit breakers may not solve your problems: (and why they're hard to make compatible with modern distributed systems design practices).
When do you want backoff and jitter, and when do you want adaptive retries? Are they just two ways to do the same thing, or is there something different about them? New blog post:
New small blog post, on latency, utilization, a bit of queue theory, and how the latency gains from efficiency work sometimes don't last as long as we'd like:
Joe's right about this. But why do caches lead to long outages? Let's explore one reason with a small simulation, starting with a really simple two-tier system, and seeing what happens when a cache gets emptied.
Both of these are true statements:
• Caches are responsible for more outage minutes than most other design patterns in modern computing.
• Caches are an integral part of modern computing, without which computing like we know it wouldn't exist.
Interested in hearing more about how S3 works on its 18th birthday?
Check out Andy Warfield's OSDI'23 keynote ()
or this great talk by Amy and Seth at Reinvent'22:
We’re celebrating AWS
#PiDay
AND the 18th birthday of our first generally available service, Amazon S3! Since it launched, S3 has grown to become the world’s most popular cloud data store with more than 350 trillion objects and now 1 million+ data lakes running on
@awscloud
. S3
Then we can fetch N+1 in parallel, and immediately be done when the first N comes back. That makes the system completely resilient to one deterministically slow server, and strongly resistant to long outlier tail latencies.
The story of MemDS in the DynamoDB paper is a fascinating one. DynamoDB used to use a metadata cache with a very high hit rate ("cache hit rate was approximately 99.75 percent"). What's not to love about a cache with a 99.75% hit rate?
For several years I read every COE (essentially postmortem) that was written at AWS. I don't do it anymore, but still read many. AWS's culture of writing quality COEs, then having a lot of people read and discuss them (at every level) is great.
New blog post, following up on circuit breakers and retries. Using simulation, I compare the "token bucket" retry strategy, circuit breaker retry strategy, and some classic approaches:
Atomic commitment - the fundamental mechanism behind scale-out databases - has some really surprising scaling behaviors. New blog post: "Atomic Commitment: The Unscalability Protocol"
You can now run your AWS Lambda functions on Graviton 2 processors! "Lambda functions using the Arm/Graviton2 architecture provide up to 34 percent price performance improvement."
If you're interested in correctness of distributed systems, you'll likely enjoy "Demystifying and Checking Silent Semantic Violations in Large Distributed Systems" from folks at JHU.
A few interesting trends from chatting to folks at OSDI/ATC today.
1/ Rust seems to have become the default language for systems work across a lot of academia and industry (quite suddenly, because it definitely wasn’t that way in early 2020).
New blog post, on the assumptions that distributed systems make, and how thinking about those assumptions as 'optimistic' or 'pessimistic' can lead to better designs:
Cool visualization!
One thing that's particularly cool about this technique is that it's robust to stale load data. The higher you make 'k' in best-of-k, the better the ideal load balancing but the worse the effect of stale load data.
A favorite load balancing technique at AWS is "the power of two random choices"
On the left, nodes are chosen and used at random
On the right, 2 nodes are chosen at random, but only the minimum is used
This simple technique balances load very well
New blog post on some rules for effective writing:
Bottom line: write for somebody. Have somebody (or a kind of person) in mind that you're trying to communicate with. Think about what they know, what you want them to know, and what you want them to do.
The video of my ATC'23 talk on "On-Demand Container Loading in AWS Lambda" is now available:
I tried to take a high-level view of the paper, focusing on why we made the decisions we did, and where we spent our complexity budget.
I've been spending a good bit of time recently picking up P, a language for specifying and modelling distributed systems: Some initial impressions, especially compared to TLA+:
Stop using non-cryptographic PRNGs. Stop using them for simulations. Stop using them for crypto. Stop using them for jitter. Stop using them in a box. Stop using them with a fox. Insist your OS and hardware can give you high quality randomness at the rate you need.
Very cool work from Anand et al at SOSP'23: "Blueprint: A Toolchain for Highly-Reconfigurable Microservices" (). I have a lot to say about this paper, but my favorite part is the treatment of metastable failures.
Completely unsurprisingly, the effect of isolation levels on latency in PostgreSQL is very sensitive to concurrency (and therefore frequency of conflicts).
All good ideas! But what are their downsides? First: arrays. Memory safety issues. Leaky abstraction over true cost of random vs sequential accesses. Common operations (push front, delete, grow, insert-in-place, etc) expensive. Requires fixed sizes. 8/10
Interested in adopting formal methods, or exploring how they can help you and your team move faster? Check out this talk from
@ankushpd
and Bikash Behera from
#reinvent2023
:
Every heard people say that transactions don’t scale? Curious about how DynamoDB does transactions at massive scale with low latency? Interested in the tradeoffs between different ways of doing distributed transactions? Check out this new paper to see how its done in
@dynamodb
Excited to share our published paper at USENIX ATC 23 on how distributed transactions were implemented in
@dynamodb
using a timestamp ordering protocol without sacrificing high scalability, high availability, and predictable performance at scale
Fun article about the early days of EC2 and the cloud in Cape Town. It was great working will all the folks in the photo, and I think most are still
@awscloud
The thing most discussions of simplicity vs complexity miss is that simplicity is a property of a system. It's always easy to simplify a component by pushing the complexity elsewhere in the system.
This seems fun. Here's my try.
A 'vector' is a fancy name for a place in space. Like a pin on a map. A vector database is good at storing these places, and being asked questions like "give me a few other places near this one".
I've heard a *lot* of people take a stab at explaining how a vector database works in simple terms.
I'd love to hear how others explain them at the "101" level.
The beauty of distributed systems is that you can build a system that is up 99.99% of the time, even when it runs on EC2 instances which each have an up time of 99.5%.
New blog post, of the "long email that became a post" genre. This one has something like career advice in it, about being explicit about how you spend your time:
"Today we are happy to announce Snapchange, a new open source project to make snapshot-based fuzzing much easier. Snapchange enables a target binary to be fuzzed with minimal modifications, providing useful introspection that aids in fuzzing."
Some highlights from "Achieving scale with Amazon Aurora Limitless Database" with David Wein and Christopher Heim, diving into the new Aurora Limitless Database.
SUPER small sample. As in sample of 1. But cold start for a
@rustlang
@awscloud
Lambda deserializing a Kinesis record stored in base64. Build output is 2MB as well. Color me impressed. Onward I will press ...
"Using Lightweight Formal Methods to Validate a
Key-Value Storage Node in Amazon S3" (), the SOSP'21 paper from the S3 team at AWS, is a good read. A couple highlights (in my, non-official, opinion):
New blog post, on the relationship between multitenancy and scalability in the cloud (and how serverless enables scalability in a fundamentally different way to traditional architectures):
The scale challenge of building cloud services isn't just because they're big, but because they span such a huge range of sizes.
E.g. the 300,000,000,000x scale difference between the DynamoDB behind my blog, and Amazon's Prime Day use-case.
Somewhere at Google there's a vast ML model that looked through my decades of search and usage history, and decided that what I really need is a push notification about Jennifer Aniston's new haircut.
Another related thing: in the graph, nodes with high indegree tend to become "hot keys" even if the external access patterns are uniform. A ton of DB tuning best-practices are about avoiding these high-degree nodes.
If we draw database rows as points, and add edges between rows that appear in the same transaction, the resulting graph is a great way to think about potential scalability. The more you can cut the graph up without crossing edges, the easier the workload is to scale.
New blog post, on what the word "scalable" means in my head, and how thinking about marginal costs of adding work makes a lot of the debates about scalability go away:
This whole thing is worth reading. My hot take is the ability to do efficient parallel IO (net and storage) without significant extra programmer effort is the most important thing a system language can offer. Latency will continue to lag bandwidth. Parallelism is king.
Check out this article posted on USENIX's ;login: online: 'Investigating Managed Language Runtime Performance' by David Lion, Adrian Chiu, Michael Stumm, and Ding Yuan:
#OpenAccess
Very cool look inside the labs at Annapurna Labs, designing and building custom silicon for AWS: I'm personally especially excited about the power and efficiency improvements that Graviton has brought to AWS.
One absolutely does need a model of failures to discuss reliability.
For example, one of our most powerful HA tools (redundancy) is exponentially powerful when failures are independent, but useless when failures are correlated.
You ever see the same term pop up in a few places and think "hmmm"?
@TigerBeetleDB
often talks of "fault models". Wonder if theres more to this idea than I think.
(Paper is "A Transaction Model", Jim Gray, IBM Research Laboratory 1980)
New blog post: Writing is Magic
From the genre of emails that got out of hand, my thoughts on why I think most people should spend more time writing (writing prose for humans, that is).
This is a fascinating spot in the trade-off space of commit algorithms. It's less fault tolerant than 2PC-over-Paxos (needs to wait for reconfiguration to make progress after even one failure), but saves a whole round-trip in the deal.
Wait… what!?
Fault tolerant 2PC that is simple and commits in 1RTT? 🤯
If you missed
@ChrisJe34211511
fantastic talk in
#eurosys24
(
#papoc24
) definitely check the paper
I wrote a blog post (or thread that really got out-of-hand) on deployment safety, and why online discussions of things like "should we deploy on Friday?" are so seldom productive:
torn pages problem is interesting, who thought a mismatch between database page size and file system block size can cause this.
So if the fs block is 4K and DB page is 8K, we need to write two fs blocks for each page and it needs to happen atomically. If you wrote half a page