Marc Brooker @MarcJBrooker Twitter profile

Pinned Tweet

Marc Brooker

@MarcJBrooker

1 year

I'm stepping away from Twitter for now. You can find me on my blog: And (for now) on Mastodon:

5

8

87

Last Seen Profiles

@djshirkhan

@registability

@rottenfishyum

@GiovaB95

@PGAndyFord

@bigmsptso

@pradeeshvm

@anaclumos

@szczesniak__a

@RyanJKaz

@joeshortsqueeze

@JMHS_Athletics

@playlostark

@akiramuracco

@enemieszy

@IsraelinGT

@bionicanic

@djingridoficial

@TY_WHITE_02

@Renoimafetiep

@RunPureSports

@Paul_O_Williams

@maxxinedupri

@slave_chang

@ThePredAlert

@_AMPERSANDONE_

@valaratomics

@BruceDevlin

@Bball_paul

@DeportesCuatro

@NYBBGB

@abiprofeno

@jp_anix

@Lloydbanks

@nom0leste

@Lila__777

Marc Brooker

@MarcJBrooker

2 years

This week I'm doing an internal talk at Amazon about an approach to system design that I use a lot, and think would use useful to a lot of people: simulation. This thread is a summary of the talk 1/

22

187

1K

Marc Brooker

@MarcJBrooker

2 years

About a decade ago, my late grandfather asked me "if computers are deterministic, why isn't debugging easy?" I think about that a lot.

92

81

878

Marc Brooker

@MarcJBrooker

2 years

Tomorrow, the DynamoDB team is going to be presenting "Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service" () at ATC. This is a super exciting paper that covers a real-world big system, and how it has evolved.

10

116

757

Marc Brooker

@MarcJBrooker

2 years

Erasure coding really is a great, and under-used, technique for reducing tail latency in systems that fetch data.

24

119

731

Marc Brooker

@MarcJBrooker

10 months

15 years! I started at AWS, with the EC2 team in Cape Town, on the 1st of August 2008. It's been a real pleasure to have a front row seat for the growth of cloud, to be involved in the genesis of serverless, and to have exciting problems to work on every day. Some memories:

18

44

711

Marc Brooker

@MarcJBrooker

3 years

New blog post, on why caches may be bad in distributed systems, despite them being a "best practice":

15

107

485

Marc Brooker

@MarcJBrooker

2 years

Histograms are rightfully a popular tool for visualizing and thinking about latency. But I believe that empirical distribution functions (eCDFs) are almost always a better choice. Let's look at an example to understand why. This highly bimodal distribution:

10

68

479

Marc Brooker

@MarcJBrooker

3 months

If we draw database rows as points, and add edges between rows that appear in the same transaction, the resulting graph is a great way to think about potential scalability. The more you can cut the graph up without crossing edges, the easier the workload is to scale.

17

63

467

Marc Brooker

@MarcJBrooker

2 years

A couple weeks back, I did a talk titled "Distributed Systems Solve Only Half My Problems (and I have a lot of problems)" at HPTS'22. Talks at HPTS aren't recorded, so here's a summary of what I said.

6

86

459

Marc Brooker

@MarcJBrooker

2 years

In distributed systems, especially deep SoA and microservice architectures, retries are mostly bad, despite being considered by many to be a "best practice". Specifically, doing more when when you're overloaded is bad for availability, stability, and efficiency.

34

58

444

Marc Brooker

@MarcJBrooker

11 months

The internet is talking about retries and backoff! As we've seen over the last day or so, simply retrying without backoff leads to an unbounded increase in work. This, in turn, tends to make overloaded systems even more overloaded. What can we do instead?

15

62

428

Marc Brooker

@MarcJBrooker

8 months

If you write your own database, durable storage, or compute isolation, your team will eventually become a database, durable storage, or compute isolation team. These problems tend to be all-consuming. Do you want your team to specialize in these problems? Can you afford it?

Charity Majors

@mipsytipsy

8 months

That's it. That's the talk. "Never write a database. Even if you want to, even if you think you should. Resist. Never write a database. Unless you have to write a database. But you don't." I will present this talk at any conference of your choosing.

65

52

732

16

36

427

Marc Brooker

@MarcJBrooker

2 years

A common problem with long queues in distributed systems is that they make recovery time worse: by the time a system recovers, it has built up a long backlog of work that needs to be done before new work succeeds. LIFO queues (stacks) are sometimes a good way to avoid that, but..

17

54

422

Marc Brooker

@MarcJBrooker

3 years

Millions of IOPS per core on commodity hardware is going to do very interesting things to the database market over the next decade.

Jens Axboe

@axboe

3 years

That's it. 10M IOPS, one physical core. #io_uring #linux

32

200

1K

7

63

424

Marc Brooker

@MarcJBrooker

2 years

Sad to learn of the passing of Richard Cook @ri_cook . Richard's talk, and article, "How Complex Systems Fail" is an absolute classic in the field. Absolutely worth your time:

Velocity 2012: Richard Cook, "How Complex Systems Fail"

Richard CookRoyal Institute of Technology, StockholmDr. Richard Cook is the Professor of Healthcare Systems Safety and Chairman of the Department of Patient ...

www.youtube.com

4

74

350

Marc Brooker

@MarcJBrooker

2 years

I wrote a small blog post in honor of DynamoDB's tenth birthday, about my favorite feature:

11

61

336

Marc Brooker

@MarcJBrooker

8 months

The declarative nature of SQL is a major strength, but also a common source of operational problems. This is because SQL obscures one of the most important practical questions about running a program: how much work are we asking the computer to do?

Franck Pachot ✈️ JCON Europe 2024

@FranckPachot

8 months

What many people don't understand with SQL is that it is declarative. When you say ORDER BY it doesn't tell the DB to sort the data. It declares that you want an ordered result. Only the execution plan will tell you if there's a sort operation or not

3

5

68

8

31

306

Marc Brooker

@MarcJBrooker

3 years

On Wednesday, I'm presenting a tech talk titled "Gigabytes in milliseconds: Bringing container support to AWS Lambda without adding latency" about how we added container support to AWS Lambda, and some of the technical challenges we faced along the way:

AWS@OSDI: Gigabytes in milliseconds: Bringing container support to AWS Lambda without adding latency

Join Marc Brooker, AWS Senior Principal Engineer, to learn more about how AWS Lambda offers scalable serverless functions as a service.

awsosdi2021.splashthat.com

3

64

304

Marc Brooker

@MarcJBrooker

6 months

"The Amazon Time Sync Service now gives you a way to synchronize time within microseconds of UTC on Amazon EC2 instances.... customers can now access local, GPS-disciplined reference clocks on supported EC2 Instances."

Amazon Time Sync Service now supports microsecond-accurate time

aws.amazon.com

8

57

301

Marc Brooker

@MarcJBrooker

6 months

New blog post, on the role of time in distributed systems, and how to think about using physical time:

10

40

285

Marc Brooker

@MarcJBrooker

2 years

Very cool to see TLA+, P, and Dafny code on the keynote stage at AWS re:Invent.

4

45

267

Marc Brooker

@MarcJBrooker

1 year

Our new paper "On-demand Container Loading in AWS Lambda" is now up (). New blog post highlighting some of what's interesting in the paper: Featuring erasure coding, deduplication, lazy loading, FUSE, and more.

6

53

270

Marc Brooker

@MarcJBrooker

5 months

New blog post, on cache eviction, the SIEVE algorithm, and some variant with interesting properties:

3

40

255

Marc Brooker

@MarcJBrooker

20 days

New blog post, looking at the great new paper from the Amazon MemoryDB folks: Much of the discussion on distributed systems is about scalability, but this paper shows how availability, durability, cost, and performance are equally important.

4

47

243

Marc Brooker

@MarcJBrooker

6 months

Some database tech highlights from Peter DeSantis's keynote at reInvent last night. Starting with my favorite point: the log is the database.

6

38

238

Marc Brooker

@MarcJBrooker

2 years

This video, seemingly about FastPass at DisneyLand, is a fantastic introduction to queues and quality of service as complex interacting systems of technology, people, and incentives:

Disney's FastPass: A Complicated History

Kevin deep dives into Disney's now defunct expedited queue service, FastPass and FastPass Plus.Listen to the soundtrack: https://open.spotify.com/album/2xtaO...

www.youtube.com

8

21

232

Marc Brooker

@MarcJBrooker

2 years

Deterministic testing (simulation testing) is a super powerful tool for building correct distributed systems. Write unit tests that test packet loss, network partitions, and more. Excited to see Turmoil 0.3 released:

5

35

229

Marc Brooker

@MarcJBrooker

3 years

This talk is now available on YouTube: I cover a lot of ground in 15 minutes, including container flattening, erasure coding, convergent encryption, and an overview of the Lambda architecture.

Gigabytes in milliseconds: Bringing container support to AWS Lambda...

AWS Lambda offers scalable serverless functions as a service, with the ability to scale up and add capacity in hundreds of milliseconds. When we launched Lam...

www.youtube.com

Marc Brooker

@MarcJBrooker

3 years

On Wednesday, I'm presenting a tech talk titled "Gigabytes in milliseconds: Bringing container support to AWS Lambda without adding latency" about how we added container support to AWS Lambda, and some of the technical challenges we faced along the way:

3

64

304

8

45

230

Marc Brooker

@MarcJBrooker

1 year

Systems programming

8

33

218

Marc Brooker

@MarcJBrooker

8 months

Formal methods are widely used in many software systems you're using today, including many of the most important parts of the Internet's infrastructure. E.g. cloud systems like S3 (), EBS (), and others ()

Using lightweight formal methods to validate a key-value storage node in Amazon S3

This paper reports our experience applying lightweight formal methods to validate the correctness of ShardStore, a new key-value storage node implementation for the Amazon S3 cloud object storage...

www.amazon.science

Grady Booch

@Grady_Booch

8 months

Formal methods for building provable software systems have never show themselves to be useful or successful for anything but a tiny sliver of any complex software-intensive system. It will similarly fail for any attempts to build a so-called AGI.

30

27

206

6

30

224

Marc Brooker

@MarcJBrooker

2 years

The way Corey Quinn treats my colleagues and friends online is despicable, and the fact that AWS continues to enable him is an embarrassment to all of us.

34

10

216

Marc Brooker

@MarcJBrooker

2 years

Meta's "Cache Made Consistent" paper covers what seems like some cool work on observability and correctness. But I think they're understating what it is that fundamentally makes caches difficult.

Cache made consistent

Caches help reduce latency, scale read-heavy workloads, and save cost. They are literally everywhere. Caches run on your phone and in your browser. For example, CDNs and DNS are essentially geo-rep…

engineering.fb.com

3

44

217

Marc Brooker

@MarcJBrooker

2 years

Erlang's work on telephone systems in the early 20th century is foundational to how we think about, and build, distributed and cloud systems 100 years later. How can this work, done before modern computing was even a field, be so important?

Marc Brooker

@MarcJBrooker

2 years

The life and crimes of A.K. Erlang.

2

3

15

4

52

215

Marc Brooker

@MarcJBrooker

11 months

"FoundationDB: A Distributed Key-Value Store", from this month's CACM, is a great read. Well worth checking out for anybody who works on the architecture of large systems:

5

39

213

Marc Brooker

@MarcJBrooker

16 days

It's always TCP_NODELAY. Every damn time.

10

17

204

Marc Brooker

@MarcJBrooker

10 months

New blog post: "Invariants: A Better Debugger?" about the power of invariants as a technique for testing and debugging algorithms and systems (and why I tend not to reach for debuggers or printf as my go-to way to debug):

4

34

195

Marc Brooker

@MarcJBrooker

1 year

New blog post, with some of my thoughts on Lambda Snapstart, and some open research areas that MicroVM snapshots open up:

4

52

197

Marc Brooker

@MarcJBrooker

2 years

Aurora Serverless V2 is now generally available! Check out @mavi888uy 's blog post for more:

Amazon Aurora Serverless v2 is Generally Available: Instant Scaling for Demanding Workloads |...

Today we are very excited to announce that Amazon Aurora Serverless v2 is generally available for both Aurora PostgreSQL and MySQL. Aurora Serverless is an on-demand, auto-scaling configuration for...

aws.amazon.com

9

44

191

Marc Brooker

@MarcJBrooker

2 years

Super important trend to understand for database and system builders (from "Jurassic cloud" ):

5

39

185

Marc Brooker

@MarcJBrooker

2 years

New blog post: "Formal Methods Only Solve Half My Problems" about the need for tools that allow us to reason quickly, and quantitatively, about distributed systems at the design stage.

4

33

187

Marc Brooker

@MarcJBrooker

2 years

New blog post about why circuit breakers may not solve your problems: (and why they're hard to make compatible with modern distributed systems design practices).

9

33

184

Marc Brooker

@MarcJBrooker

2 years

When do you want backoff and jitter, and when do you want adaptive retries? Are they just two ways to do the same thing, or is there something different about them? New blog post:

4

28

181

Marc Brooker

@MarcJBrooker

5 months

If you want to understand database isolation levels, you could do a lot worse than reading the SQL implementations of @martinkl 's Hermitage tests:

GitHub - ept/hermitage: What are the differences between the transaction isolation levels in...

What are the differences between the transaction isolation levels in databases? This is a suite of test cases which differentiate isolation levels. - ept/hermitage

github.com

0

26

180

Marc Brooker

@MarcJBrooker

3 years

New small blog post, on latency, utilization, a bit of queue theory, and how the latency gains from efficiency work sometimes don't last as long as we'd like:

5

52

178

Marc Brooker

@MarcJBrooker

2 years

Lamport's "State The Problem Before Describing the Solution" is great writing advice:

7

23

172

Marc Brooker

@MarcJBrooker

2 years

Joe's right about this. But why do caches lead to long outages? Let's explore one reason with a small simulation, starting with a really simple two-tier system, and seeing what happens when a cache gets emptied.

Joe Magerramov

@_joemag_

2 years

Both of these are true statements: • Caches are responsible for more outage minutes than most other design patterns in modern computing. • Caches are an integral part of modern computing, without which computing like we know it wouldn't exist.

8

22

207

3

45

171

Marc Brooker

@MarcJBrooker

1 year

New blog post: Amazon's Distributed Computing Manifesto

1

27

161

Marc Brooker

@MarcJBrooker

2 months

Microsecond-accurate time is now available in EC2 US East. So many cool things this makes possible:

Amazon Time Sync Service now supports microsecond-accurate time in US East (N. Virginia) Region

aws.amazon.com

6

21

159

Marc Brooker

@MarcJBrooker

2 months

Interested in hearing more about how S3 works on its 18th birthday? Check out Andy Warfield's OSDI'23 keynote () or this great talk by Amy and Seth at Reinvent'22:

AWS re:Invent 2023 - Dive deep on Amazon S3 (STG314)

Amazon S3 provides developers and IT teams with cloud object storage that delivers industry-leading scalability, durability, security, and performance. In th...

www.youtube.com

Adam Selipsky

@aselipsky

2 months

We’re celebrating AWS #PiDay AND the 18th birthday of our first generally available service, Amazon S3! Since it launched, S3 has grown to become the world’s most popular cloud data store with more than 350 trillion objects and now 1 million+ data lakes running on @awscloud . S3

6

37

199

1

14

160

Marc Brooker

@MarcJBrooker

2 years

Then we can fetch N+1 in parallel, and immediately be done when the first N comes back. That makes the system completely resilient to one deterministically slow server, and strongly resistant to long outlier tail latencies.

8

3

160

Marc Brooker

@MarcJBrooker

2 years

The story of MemDS in the DynamoDB paper is a fascinating one. DynamoDB used to use a metadata cache with a very high hit rate ("cache hit rate was approximately 99.75 percent"). What's not to love about a cache with a 99.75% hit rate?

2

25

159

Marc Brooker

@MarcJBrooker

2 years

For several years I read every COE (essentially postmortem) that was written at AWS. I don't do it anymore, but still read many. AWS's culture of writing quality COEs, then having a lot of people read and discuss them (at every level) is great.

Dan Luu

@altluu

2 years

How many postmortems have you read?

3

0

7

5

13

153

Marc Brooker

@MarcJBrooker

2 years

New blog post, following up on circuit breakers and retries. Using simulation, I compare the "token bucket" retry strategy, circuit breaker retry strategy, and some classic approaches:

4

27

154

Marc Brooker

@MarcJBrooker

2 years

Atomic commitment - the fundamental mechanism behind scale-out databases - has some really surprising scaling behaviors. New blog post: "Atomic Commitment: The Unscalability Protocol"

8

25

153

Marc Brooker

@MarcJBrooker

3 years

You can now run your AWS Lambda functions on Graviton 2 processors! "Lambda functions using the Arm/Graviton2 architecture provide up to 34 percent price performance improvement."

AWS Lambda Functions Powered by AWS Graviton2 Processor – Run Your Functions on Arm and Get Up to...

December 13, 2022: Post updated to include all the AWS Regions where Lambda Functions can be powered by the Graviton2 Processor. June 19, 2023: List of AWS Regions updated. Many of our customers...

aws.amazon.com

3

27

155

Marc Brooker

@MarcJBrooker

2 years

If you're interested in correctness of distributed systems, you'll likely enjoy "Demystifying and Checking Silent Semantic Violations in Large Distributed Systems" from folks at JHU.

4

34

150

Marc Brooker

@MarcJBrooker

10 months

A few interesting trends from chatting to folks at OSDI/ATC today. 1/ Rust seems to have become the default language for systems work across a lot of academia and industry (quite suddenly, because it definitely wasn’t that way in early 2020).

5

17

151

Marc Brooker

@MarcJBrooker

7 months

New blog post, on the assumptions that distributed systems make, and how thinking about those assumptions as 'optimistic' or 'pessimistic' can lead to better designs:

5

24

148

Marc Brooker

@MarcJBrooker

3 months

Cool visualization! One thing that's particularly cool about this technique is that it's robust to stale load data. The higher you make 'k' in best-of-k, the better the ideal load balancing but the worse the effect of stale load data.

Grant Slatton

@GrantSlatton

3 months

A favorite load balancing technique at AWS is "the power of two random choices" On the left, nodes are chosen and used at random On the right, 2 nodes are chosen at random, but only the minimum is used This simple technique balances load very well

37

323

3K

2

8

148

Marc Brooker

@MarcJBrooker

7 months

One day I hope to write a paper conclusion this clear. (From Gray and Lamport, "Consensus on Transaction Commit"):

2

9

147

Marc Brooker

@MarcJBrooker

8 months

New blog post on some rules for effective writing: Bottom line: write for somebody. Have somebody (or a kind of person) in mind that you're trying to communicate with. Think about what they know, what you want them to know, and what you want them to do.

5

13

145

Marc Brooker

@MarcJBrooker

7 months

The video of my ATC'23 talk on "On-Demand Container Loading in AWS Lambda" is now available: I tried to take a high-level view of the paper, focusing on why we made the decisions we did, and where we spent our complexity budget.

USENIX ATC '23 - On-demand Container Loading in AWS Lambda

USENIX ATC '23 - On-demand Container Loading in AWS LambdaMarc Brooker, Amazon Web Services, Mike Danilov, Amazon Web Services, Chris Greenwood, Amazon Web S...

www.youtube.com

4

25

145

Marc Brooker

@MarcJBrooker

3 years

I've been spending a good bit of time recently picking up P, a language for specifying and modelling distributed systems: Some initial impressions, especially compared to TLA+:

4

24

139

Marc Brooker

@MarcJBrooker

2 years

Stop using non-cryptographic PRNGs. Stop using them for simulations. Stop using them for crypto. Stop using them for jitter. Stop using them in a box. Stop using them with a fox. Insist your OS and hardware can give you high quality randomness at the rate you need.

Matthew Green

@matthew_d_green

2 years

This paper on Monte Carlo simulations absolutely blows my mind. h/t @inf_0_

82

512

3K

7

15

128

Marc Brooker

@MarcJBrooker

6 months

Very cool work from Anand et al at SOSP'23: "Blueprint: A Toolchain for Highly-Reconfigurable Microservices" (). I have a lot to say about this paper, but my favorite part is the treatment of metastable failures.

2

18

127

Marc Brooker

@MarcJBrooker

1 year

Completely unsurprisingly, the effect of isolation levels on latency in PostgreSQL is very sensitive to concurrency (and therefore frequency of conflicts).

8

21

126

Marc Brooker

@MarcJBrooker

2 years

@penberg @intensivedata I often recommend Database Internals: A Deep Dive into How Distributed Data Systems Work

Database Internals: A Deep Dive into How Distributed Data Systems Work

www.amazon.com

2

7

124

Marc Brooker

@MarcJBrooker

1 month

All good ideas! But what are their downsides? First: arrays. Memory safety issues. Leaky abstraction over true cost of random vs sequential accesses. Common operations (push front, delete, grow, insert-in-place, etc) expensive. Requires fixed sizes. 8/10

Daniel Hooper

@DanielcHooper

1 month

What ideas in computer science are universally considered good? My list: - Arrays (1942?) - Functions (1947) - Hashmaps (1953) - The stack (1957) - Processes (1958) - Virtual memory (1959) - TCP/IP (1974)

342

74

3K

1

10

124

Marc Brooker

@MarcJBrooker

6 months

Interested in adopting formal methods, or exploring how they can help you and your team move faster? Check out this talk from @ankushpd and Bikash Behera from #reinvent2023 :

AWS re:Invent 2023 - Gain confidence in system correctness & resili...

Distributed applications, systems, and services are difficult to design and test. Formal methods enable the early discovery of design bugs that can escape th...

www.youtube.com

2

22

119

Marc Brooker

@MarcJBrooker

10 months

Every heard people say that transactions don’t scale? Curious about how DynamoDB does transactions at massive scale with low latency? Interested in the tradeoffs between different ways of doing distributed transactions? Check out this new paper to see how its done in @dynamodb

Akshat Vig

@akshatvig

10 months

Excited to share our published paper at USENIX ATC 23 on how distributed transactions were implemented in @dynamodb using a timestamp ordering protocol without sacrificing high scalability, high availability, and predictable performance at scale

2

38

158

2

19

117

Marc Brooker

@MarcJBrooker

3 years

Fun article about the early days of EC2 and the cloud in Cape Town. It was great working will all the folks in the photo, and I think most are still @awscloud

Firing up cloud machines like ‘elves on roller skates’

Why South Africa was the perfect place to change computing forever.

www.aboutamazon.com

5

25

117

Marc Brooker

@MarcJBrooker

1 year

The thing most discussions of simplicity vs complexity miss is that simplicity is a property of a system. It's always easy to simplify a component by pushing the complexity elsewhere in the system.

7

15

114

Marc Brooker

@MarcJBrooker

8 months

This seems fun. Here's my try. A 'vector' is a fancy name for a place in space. Like a pin on a map. A vector database is good at storing these places, and being asked questions like "give me a few other places near this one".

Matt Silverlock 🐀

@elithrar

8 months

I've heard a *lot* of people take a stab at explaining how a vector database works in simple terms. I'd love to hear how others explain them at the "101" level.

17

8

36

5

16

114

Marc Brooker

@MarcJBrooker

2 months

I wonder why the Paxos Commit paper (by Jim Gray and Leslie Lamport!) isn't better known. Even various known-broken 3PC protocols are better known ()

4

15

113

Marc Brooker

@MarcJBrooker

3 months

To add to the beauty, distributed systems can get exponentially better availability at linear cost. Not a lot of deals like that in computing.

Colt McNealy

@coltmcnealy

3 months

The beauty of distributed systems is that you can build a system that is up 99.99% of the time, even when it runs on EC2 instances which each have an up time of 99.5%.

1

5

31

2

5

114

Marc Brooker

@MarcJBrooker

3 months

New blog post, of the "long email that became a post" genre. This one has something like career advice in it, about being explicit about how you spend your time:

4

15

111

Marc Brooker

@MarcJBrooker

1 year

"Today we are happy to announce Snapchange, a new open source project to make snapshot-based fuzzing much easier. Snapchange enables a target binary to be fuzzed with minimal modifications, providing useful introspection that aids in fuzzing."

Announcing Snapchange: An Open Source KVM-backed Snapshot Fuzzing Framework | Amazon Web Services

Today we are happy to announce Snapchange, a new open source fuzzing tool from the AWS Find and Fix (F2) open source security research team.

aws.amazon.com

3

31

109

Marc Brooker

@MarcJBrooker

2 years

This list of testing resources from @asatarin is truly excellent: Bookmark it!

Testing Distributed Systems

Curated list of resources on testing distributed systems

asatarin.github.io

4

32

111

Marc Brooker

@MarcJBrooker

2 months

New blog post, on making technical "build vs adapt" decisions:

2

10

109

Marc Brooker

@MarcJBrooker

6 months

Some highlights from "Achieving scale with Amazon Aurora Limitless Database" with David Wein and Christopher Heim, diving into the new Aurora Limitless Database.

AWS re:Invent 2023 - [LAUNCH] Achieving scale with Amazon Aurora...

Amazon Aurora is a relational database service built for the cloud that is designed for unparalleled high performance and availability at global scale, with ...

www.youtube.com

2

16

108

Marc Brooker

@MarcJBrooker

7 months

Rust is a great fit with AWS Lambda. Super easy to deploy, fast cold starts, low memory requirements, simple dependency management.

Benjamen Pyle

@benjamenpyle

7 months

SUPER small sample. As in sample of 1. But cold start for a @rustlang @awscloud Lambda deserializing a Kinesis record stored in base64. Build output is 2MB as well. Color me impressed. Onward I will press ...

2

1

31

3

9

105

Marc Brooker

@MarcJBrooker

3 years

"Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" (), the SOSP'21 paper from the S3 team at AWS, is a good read. A couple highlights (in my, non-official, opinion):

Using lightweight formal methods to validate a key-value storage node in Amazon S3

This paper reports our experience applying lightweight formal methods to validate the correctness of ShardStore, a new key-value storage node implementation for the Amazon S3 cloud object storage...

www.amazon.science

2

31

103

Marc Brooker

@MarcJBrooker

1 year

New blog post, on the relationship between multitenancy and scalability in the cloud (and how serverless enables scalability in a fundamentally different way to traditional architectures):

1

22

100

Marc Brooker

@MarcJBrooker

11 months

The scale challenge of building cloud services isn't just because they're big, but because they span such a huge range of sizes. E.g. the 300,000,000,000x scale difference between the DynamoDB behind my blog, and Amazon's Prime Day use-case.

2

13

103

Marc Brooker

@MarcJBrooker

2 years

I've read "A Read-Only Transaction Anomaly Under Snapshot Isolation" many times, and go through all five stages of grief every time.

4

10

102

Marc Brooker

@MarcJBrooker

3 months

Somewhere at Google there's a vast ML model that looked through my decades of search and usage history, and decided that what I really need is a push notification about Jennifer Aniston's new haircut.

1

98

Marc Brooker

@MarcJBrooker

3 months

Another related thing: in the graph, nodes with high indegree tend to become "hot keys" even if the external access patterns are uniform. A ton of DB tuning best-practices are about avoiding these high-degree nodes.

Marc Brooker

@MarcJBrooker

3 months

If we draw database rows as points, and add edges between rows that appear in the same transaction, the resulting graph is a great way to think about potential scalability. The more you can cut the graph up without crossing edges, the easier the workload is to scale.

17

63

467

1

5

95

Marc Brooker

@MarcJBrooker

4 months

New blog post, on what the word "scalable" means in my head, and how thinking about marginal costs of adding work makes a lot of the debates about scalability go away:

1

14

97

Marc Brooker

@MarcJBrooker

2 years

This whole thing is worth reading. My hot take is the ability to do efficient parallel IO (net and storage) without significant extra programmer effort is the most important thing a system language can offer. Latency will continue to lag bandwidth. Parallelism is king.

USENIX Association

@usenix

2 years

Check out this article posted on USENIX's ;login: online: 'Investigating Managed Language Runtime Performance' by David Lion, Adrian Chiu, Michael Stumm, and Ding Yuan: #OpenAccess

0

11

38

3

12

97

Marc Brooker

@MarcJBrooker

2 years

Very cool look inside the labs at Annapurna Labs, designing and building custom silicon for AWS: I'm personally especially excited about the power and efficiency improvements that Graviton has brought to AWS.

Take a look inside the lab where AWS makes custom chips

This team of AWS employees are pushing the limits of what it means to design and build computer hardware to help customers work faster, more securely, and more sustainably—at lower cost.

www.aboutamazon.com

1

25

93

Marc Brooker

@MarcJBrooker

4 months

One absolutely does need a model of failures to discuss reliability. For example, one of our most powerful HA tools (redundancy) is exponentially powerful when failures are independent, but useless when failures are correlated.

Lewis Campbell

@LewisCTech

4 months

You ever see the same term pop up in a few places and think "hmmm"? @TigerBeetleDB often talks of "fault models". Wonder if theres more to this idea than I think. (Paper is "A Transaction Model", Jim Gray, IBM Research Laboratory 1980)

0

2

32

3

13

91

Marc Brooker

@MarcJBrooker

11 months

Simplicity is a property of systems. It's always possible to make components simpler by pushing the complexity elsewhere in the system.

5

7

91

Marc Brooker

@MarcJBrooker

28 days

New blog post: "Formal Methods: Just Good Engineering Practice?"

2

10

92

Marc Brooker

@MarcJBrooker

2 years

I wrote a quick new blog post, on the bug/ambiguity in Paxos Made Simple:

3

22

92

Marc Brooker

@MarcJBrooker

2 years

New blog post: Writing is Magic From the genre of emails that got out of hand, my thoughts on why I think most people should spend more time writing (writing prose for humans, that is).

4

17

90

Marc Brooker

@MarcJBrooker

5 months

This is the coolest demo of transmission line effects I've ever seen. Just imagine Oliver Heaviside was around to see this. Would have blown his mind.

Watch electricity hit a fork in the road at half a billion frames per...

In this video, I measure a wave of electricity traveling down a wire, and answer the question - how does electricity know where to go? How does "electricity"...

www.youtube.com

2

13

91

Marc Brooker

@MarcJBrooker

23 days

This is a fascinating spot in the trade-off space of commit algorithms. It's less fault tolerant than 2PC-over-Paxos (needs to wait for reconfiguration to make progress after even one failure), but saves a whole round-trip in the deal.

Antonis Katsarakis

@akatsarakis

23 days

Wait… what!? Fault tolerant 2PC that is simple and commits in 1RTT? 🤯 If you missed @ChrisJe34211511 fantastic talk in #eurosys24 ( #papoc24 ) definitely check the paper

11

32

174

3

9

91

Marc Brooker

@MarcJBrooker

2 years

I wrote a blog post (or thread that really got out-of-hand) on deployment safety, and why online discussions of things like "should we deploy on Friday?" are so seldom productive:

1

16

89

Marc Brooker

@MarcJBrooker

3 months

Torn write protection is indeed a big issue for databases. EC2 and EBS support atomic writes up to 16kB:

Hussein Nasser

@hnasr

3 months

torn pages problem is interesting, who thought a mismatch between database page size and file system block size can cause this. So if the fs block is 4K and DB page is 8K, we need to write two fs blocks for each page and it needs to happen atomically. If you wrote half a page

6

8

64

1

5

89