Marc Brooker Profile
Marc Brooker

@MarcJBrooker

17,368
Followers
748
Following
255
Media
2,005
Statuses

AI, databases, and serverless at AWS. Views are my own. On Mastodon: @marcbrooker @fediscience .org

Joined October 2013
Don't wanna be here? Send us removal request.
Pinned Tweet
@MarcJBrooker
Marc Brooker
1 year
I'm stepping away from Twitter for now. You can find me on my blog: And (for now) on Mastodon:
5
8
87
@MarcJBrooker
Marc Brooker
2 years
This week I'm doing an internal talk at Amazon about an approach to system design that I use a lot, and think would use useful to a lot of people: simulation. This thread is a summary of the talk 1/
22
187
1K
@MarcJBrooker
Marc Brooker
2 years
About a decade ago, my late grandfather asked me "if computers are deterministic, why isn't debugging easy?" I think about that a lot.
92
81
878
@MarcJBrooker
Marc Brooker
2 years
Tomorrow, the DynamoDB team is going to be presenting "Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service" () at ATC. This is a super exciting paper that covers a real-world big system, and how it has evolved.
10
116
757
@MarcJBrooker
Marc Brooker
2 years
Erasure coding really is a great, and under-used, technique for reducing tail latency in systems that fetch data.
24
119
731
@MarcJBrooker
Marc Brooker
10 months
15 years! I started at AWS, with the EC2 team in Cape Town, on the 1st of August 2008. It's been a real pleasure to have a front row seat for the growth of cloud, to be involved in the genesis of serverless, and to have exciting problems to work on every day. Some memories:
18
44
711
@MarcJBrooker
Marc Brooker
3 years
New blog post, on why caches may be bad in distributed systems, despite them being a "best practice":
15
107
485
@MarcJBrooker
Marc Brooker
2 years
Histograms are rightfully a popular tool for visualizing and thinking about latency. But I believe that empirical distribution functions (eCDFs) are almost always a better choice. Let's look at an example to understand why. This highly bimodal distribution:
Tweet media one
10
68
479
@MarcJBrooker
Marc Brooker
3 months
If we draw database rows as points, and add edges between rows that appear in the same transaction, the resulting graph is a great way to think about potential scalability. The more you can cut the graph up without crossing edges, the easier the workload is to scale.
Tweet media one
17
63
467
@MarcJBrooker
Marc Brooker
2 years
A couple weeks back, I did a talk titled "Distributed Systems Solve Only Half My Problems (and I have a lot of problems)" at HPTS'22. Talks at HPTS aren't recorded, so here's a summary of what I said.
6
86
459
@MarcJBrooker
Marc Brooker
2 years
In distributed systems, especially deep SoA and microservice architectures, retries are mostly bad, despite being considered by many to be a "best practice". Specifically, doing more when when you're overloaded is bad for availability, stability, and efficiency.
Tweet media one
34
58
444
@MarcJBrooker
Marc Brooker
11 months
The internet is talking about retries and backoff! As we've seen over the last day or so, simply retrying without backoff leads to an unbounded increase in work. This, in turn, tends to make overloaded systems even more overloaded. What can we do instead?
15
62
428
@MarcJBrooker
Marc Brooker
8 months
If you write your own database, durable storage, or compute isolation, your team will eventually become a database, durable storage, or compute isolation team. These problems tend to be all-consuming. Do you want your team to specialize in these problems? Can you afford it?
@mipsytipsy
Charity Majors
8 months
That's it. That's the talk. "Never write a database. Even if you want to, even if you think you should. Resist. Never write a database. Unless you have to write a database. But you don't." I will present this talk at any conference of your choosing.
65
52
732
16
36
427
@MarcJBrooker
Marc Brooker
2 years
A common problem with long queues in distributed systems is that they make recovery time worse: by the time a system recovers, it has built up a long backlog of work that needs to be done before new work succeeds. LIFO queues (stacks) are sometimes a good way to avoid that, but..
17
54
422
@MarcJBrooker
Marc Brooker
3 years
Millions of IOPS per core on commodity hardware is going to do very interesting things to the database market over the next decade.
@axboe
Jens Axboe
3 years
That's it. 10M IOPS, one physical core. #io_uring #linux
Tweet media one
32
200
1K
7
63
424
@MarcJBrooker
Marc Brooker
2 years
Sad to learn of the passing of Richard Cook @ri_cook . Richard's talk, and article, "How Complex Systems Fail" is an absolute classic in the field. Absolutely worth your time:
4
74
350
@MarcJBrooker
Marc Brooker
2 years
I wrote a small blog post in honor of DynamoDB's tenth birthday, about my favorite feature:
11
61
336
@MarcJBrooker
Marc Brooker
8 months
The declarative nature of SQL is a major strength, but also a common source of operational problems. This is because SQL obscures one of the most important practical questions about running a program: how much work are we asking the computer to do?
@FranckPachot
Franck Pachot ✈️ JCON Europe 2024
8 months
What many people don't understand with SQL is that it is declarative. When you say ORDER BY it doesn't tell the DB to sort the data. It declares that you want an ordered result. Only the execution plan will tell you if there's a sort operation or not
3
5
68
8
31
306
@MarcJBrooker
Marc Brooker
3 years
On Wednesday, I'm presenting a tech talk titled "Gigabytes in milliseconds: Bringing container support to AWS Lambda without adding latency" about how we added container support to AWS Lambda, and some of the technical challenges we faced along the way:
3
64
304
@MarcJBrooker
Marc Brooker
6 months
"The Amazon Time Sync Service now gives you a way to synchronize time within microseconds of UTC on Amazon EC2 instances.... customers can now access local, GPS-disciplined reference clocks on supported EC2 Instances."
8
57
301
@MarcJBrooker
Marc Brooker
6 months
New blog post, on the role of time in distributed systems, and how to think about using physical time:
10
40
285
@MarcJBrooker
Marc Brooker
2 years
Very cool to see TLA+, P, and Dafny code on the keynote stage at AWS re:Invent.
Tweet media one
4
45
267
@MarcJBrooker
Marc Brooker
1 year
Our new paper "On-demand Container Loading in AWS Lambda" is now up (). New blog post highlighting some of what's interesting in the paper: Featuring erasure coding, deduplication, lazy loading, FUSE, and more.
6
53
270
@MarcJBrooker
Marc Brooker
5 months
New blog post, on cache eviction, the SIEVE algorithm, and some variant with interesting properties:
3
40
255
@MarcJBrooker
Marc Brooker
20 days
New blog post, looking at the great new paper from the Amazon MemoryDB folks: Much of the discussion on distributed systems is about scalability, but this paper shows how availability, durability, cost, and performance are equally important.
4
47
243
@MarcJBrooker
Marc Brooker
6 months
Some database tech highlights from Peter DeSantis's keynote at reInvent last night. Starting with my favorite point: the log is the database.
Tweet media one
6
38
238
@MarcJBrooker
Marc Brooker
2 years
This video, seemingly about FastPass at DisneyLand, is a fantastic introduction to queues and quality of service as complex interacting systems of technology, people, and incentives:
8
21
232
@MarcJBrooker
Marc Brooker
2 years
Deterministic testing (simulation testing) is a super powerful tool for building correct distributed systems. Write unit tests that test packet loss, network partitions, and more. Excited to see Turmoil 0.3 released:
5
35
229
@MarcJBrooker
Marc Brooker
3 years
This talk is now available on YouTube: I cover a lot of ground in 15 minutes, including container flattening, erasure coding, convergent encryption, and an overview of the Lambda architecture.
@MarcJBrooker
Marc Brooker
3 years
On Wednesday, I'm presenting a tech talk titled "Gigabytes in milliseconds: Bringing container support to AWS Lambda without adding latency" about how we added container support to AWS Lambda, and some of the technical challenges we faced along the way:
3
64
304
8
45
230
@MarcJBrooker
Marc Brooker
1 year
Systems programming
Tweet media one
8
33
218
@MarcJBrooker
Marc Brooker
8 months
Formal methods are widely used in many software systems you're using today, including many of the most important parts of the Internet's infrastructure. E.g. cloud systems like S3 (), EBS (), and others ()
@Grady_Booch
Grady Booch
8 months
Formal methods for building provable software systems have never show themselves to be useful or successful for anything but a tiny sliver of any complex software-intensive system. It will similarly fail for any attempts to build a so-called AGI.
30
27
206
6
30
224
@MarcJBrooker
Marc Brooker
2 years
The way Corey Quinn treats my colleagues and friends online is despicable, and the fact that AWS continues to enable him is an embarrassment to all of us.
34
10
216
@MarcJBrooker
Marc Brooker
2 years
Meta's "Cache Made Consistent" paper covers what seems like some cool work on observability and correctness. But I think they're understating what it is that fundamentally makes caches difficult.
3
44
217
@MarcJBrooker
Marc Brooker
2 years
Erlang's work on telephone systems in the early 20th century is foundational to how we think about, and build, distributed and cloud systems 100 years later. How can this work, done before modern computing was even a field, be so important?
@MarcJBrooker
Marc Brooker
2 years
The life and crimes of A.K. Erlang.
Tweet media one
2
3
15
4
52
215
@MarcJBrooker
Marc Brooker
11 months
"FoundationDB: A Distributed Key-Value Store", from this month's CACM, is a great read. Well worth checking out for anybody who works on the architecture of large systems:
5
39
213
@MarcJBrooker
Marc Brooker
16 days
It's always TCP_NODELAY. Every damn time.
10
17
204
@MarcJBrooker
Marc Brooker
10 months
New blog post: "Invariants: A Better Debugger?" about the power of invariants as a technique for testing and debugging algorithms and systems (and why I tend not to reach for debuggers or printf as my go-to way to debug):
4
34
195
@MarcJBrooker
Marc Brooker
1 year
New blog post, with some of my thoughts on Lambda Snapstart, and some open research areas that MicroVM snapshots open up:
4
52
197
@MarcJBrooker
Marc Brooker
2 years
Super important trend to understand for database and system builders (from "Jurassic cloud" ):
Tweet media one
5
39
185
@MarcJBrooker
Marc Brooker
2 years
New blog post: "Formal Methods Only Solve Half My Problems" about the need for tools that allow us to reason quickly, and quantitatively, about distributed systems at the design stage.
4
33
187
@MarcJBrooker
Marc Brooker
2 years
New blog post about why circuit breakers may not solve your problems: (and why they're hard to make compatible with modern distributed systems design practices).
9
33
184
@MarcJBrooker
Marc Brooker
2 years
When do you want backoff and jitter, and when do you want adaptive retries? Are they just two ways to do the same thing, or is there something different about them? New blog post:
4
28
181
@MarcJBrooker
Marc Brooker
3 years
New small blog post, on latency, utilization, a bit of queue theory, and how the latency gains from efficiency work sometimes don't last as long as we'd like:
5
52
178
@MarcJBrooker
Marc Brooker
2 years
Lamport's "State The Problem Before Describing the Solution" is great writing advice:
Tweet media one
7
23
172
@MarcJBrooker
Marc Brooker
2 years
Joe's right about this. But why do caches lead to long outages? Let's explore one reason with a small simulation, starting with a really simple two-tier system, and seeing what happens when a cache gets emptied.
@_joemag_
Joe Magerramov
2 years
Both of these are true statements: • Caches are responsible for more outage minutes than most other design patterns in modern computing. • Caches are an integral part of modern computing, without which computing like we know it wouldn't exist.
8
22
207
3
45
171
@MarcJBrooker
Marc Brooker
1 year
New blog post: Amazon's Distributed Computing Manifesto
1
27
161
@MarcJBrooker
Marc Brooker
2 months
Microsecond-accurate time is now available in EC2 US East. So many cool things this makes possible:
6
21
159
@MarcJBrooker
Marc Brooker
2 months
Interested in hearing more about how S3 works on its 18th birthday? Check out Andy Warfield's OSDI'23 keynote () or this great talk by Amy and Seth at Reinvent'22:
@aselipsky
Adam Selipsky
2 months
We’re celebrating AWS #PiDay AND the 18th birthday of our first generally available service, Amazon S3! Since it launched, S3 has grown to become the world’s most popular cloud data store with more than 350 trillion objects and now 1 million+ data lakes running on @awscloud . S3
Tweet media one
6
37
199
1
14
160
@MarcJBrooker
Marc Brooker
2 years
Then we can fetch N+1 in parallel, and immediately be done when the first N comes back. That makes the system completely resilient to one deterministically slow server, and strongly resistant to long outlier tail latencies.
8
3
160
@MarcJBrooker
Marc Brooker
2 years
The story of MemDS in the DynamoDB paper is a fascinating one. DynamoDB used to use a metadata cache with a very high hit rate ("cache hit rate was approximately 99.75 percent"). What's not to love about a cache with a 99.75% hit rate?
2
25
159
@MarcJBrooker
Marc Brooker
2 years
For several years I read every COE (essentially postmortem) that was written at AWS. I don't do it anymore, but still read many. AWS's culture of writing quality COEs, then having a lot of people read and discuss them (at every level) is great.
@altluu
Dan Luu
2 years
How many postmortems have you read?
3
0
7
5
13
153
@MarcJBrooker
Marc Brooker
2 years
New blog post, following up on circuit breakers and retries. Using simulation, I compare the "token bucket" retry strategy, circuit breaker retry strategy, and some classic approaches:
4
27
154
@MarcJBrooker
Marc Brooker
2 years
Atomic commitment - the fundamental mechanism behind scale-out databases - has some really surprising scaling behaviors. New blog post: "Atomic Commitment: The Unscalability Protocol"
8
25
153
@MarcJBrooker
Marc Brooker
2 years
If you're interested in correctness of distributed systems, you'll likely enjoy "Demystifying and Checking Silent Semantic Violations in Large Distributed Systems" from folks at JHU.
4
34
150
@MarcJBrooker
Marc Brooker
10 months
A few interesting trends from chatting to folks at OSDI/ATC today. 1/ Rust seems to have become the default language for systems work across a lot of academia and industry (quite suddenly, because it definitely wasn’t that way in early 2020).
5
17
151
@MarcJBrooker
Marc Brooker
7 months
New blog post, on the assumptions that distributed systems make, and how thinking about those assumptions as 'optimistic' or 'pessimistic' can lead to better designs:
5
24
148
@MarcJBrooker
Marc Brooker
3 months
Cool visualization! One thing that's particularly cool about this technique is that it's robust to stale load data. The higher you make 'k' in best-of-k, the better the ideal load balancing but the worse the effect of stale load data.
@GrantSlatton
Grant Slatton
3 months
A favorite load balancing technique at AWS is "the power of two random choices" On the left, nodes are chosen and used at random On the right, 2 nodes are chosen at random, but only the minimum is used This simple technique balances load very well
37
323
3K
2
8
148
@MarcJBrooker
Marc Brooker
7 months
One day I hope to write a paper conclusion this clear. (From Gray and Lamport, "Consensus on Transaction Commit"):
Tweet media one
2
9
147
@MarcJBrooker
Marc Brooker
8 months
New blog post on some rules for effective writing: Bottom line: write for somebody. Have somebody (or a kind of person) in mind that you're trying to communicate with. Think about what they know, what you want them to know, and what you want them to do.
5
13
145
@MarcJBrooker
Marc Brooker
7 months
The video of my ATC'23 talk on "On-Demand Container Loading in AWS Lambda" is now available: I tried to take a high-level view of the paper, focusing on why we made the decisions we did, and where we spent our complexity budget.
4
25
145
@MarcJBrooker
Marc Brooker
3 years
I've been spending a good bit of time recently picking up P, a language for specifying and modelling distributed systems: Some initial impressions, especially compared to TLA+:
4
24
139
@MarcJBrooker
Marc Brooker
2 years
Stop using non-cryptographic PRNGs. Stop using them for simulations. Stop using them for crypto. Stop using them for jitter. Stop using them in a box. Stop using them with a fox. Insist your OS and hardware can give you high quality randomness at the rate you need.
@matthew_d_green
Matthew Green
2 years
This paper on Monte Carlo simulations absolutely blows my mind. h/t @inf_0_
Tweet media one
Tweet media two
82
512
3K
7
15
128
@MarcJBrooker
Marc Brooker
6 months
Very cool work from Anand et al at SOSP'23: "Blueprint: A Toolchain for Highly-Reconfigurable Microservices" (). I have a lot to say about this paper, but my favorite part is the treatment of metastable failures.
Tweet media one
2
18
127
@MarcJBrooker
Marc Brooker
1 year
Completely unsurprisingly, the effect of isolation levels on latency in PostgreSQL is very sensitive to concurrency (and therefore frequency of conflicts).
Tweet media one
8
21
126
@MarcJBrooker
Marc Brooker
1 month
All good ideas! But what are their downsides? First: arrays. Memory safety issues. Leaky abstraction over true cost of random vs sequential accesses. Common operations (push front, delete, grow, insert-in-place, etc) expensive. Requires fixed sizes. 8/10
@DanielcHooper
Daniel Hooper
1 month
What ideas in computer science are universally considered good? My list: - Arrays (1942?) - Functions (1947) - Hashmaps (1953) - The stack (1957) - Processes (1958) - Virtual memory (1959) - TCP/IP (1974)
342
74
3K
1
10
124
@MarcJBrooker
Marc Brooker
10 months
Every heard people say that transactions don’t scale? Curious about how DynamoDB does transactions at massive scale with low latency? Interested in the tradeoffs between different ways of doing distributed transactions? Check out this new paper to see how its done in @dynamodb
@akshatvig
Akshat Vig
10 months
Excited to share our published paper at USENIX ATC 23 on how distributed transactions were implemented in @dynamodb using a timestamp ordering protocol without sacrificing high scalability, high availability, and predictable performance at scale
2
38
158
2
19
117
@MarcJBrooker
Marc Brooker
3 years
Fun article about the early days of EC2 and the cloud in Cape Town. It was great working will all the folks in the photo, and I think most are still @awscloud
5
25
117
@MarcJBrooker
Marc Brooker
1 year
The thing most discussions of simplicity vs complexity miss is that simplicity is a property of a system. It's always easy to simplify a component by pushing the complexity elsewhere in the system.
7
15
114
@MarcJBrooker
Marc Brooker
8 months
This seems fun. Here's my try. A 'vector' is a fancy name for a place in space. Like a pin on a map. A vector database is good at storing these places, and being asked questions like "give me a few other places near this one".
@elithrar
Matt Silverlock 🐀
8 months
I've heard a *lot* of people take a stab at explaining how a vector database works in simple terms. I'd love to hear how others explain them at the "101" level.
17
8
36
5
16
114
@MarcJBrooker
Marc Brooker
2 months
I wonder why the Paxos Commit paper (by Jim Gray and Leslie Lamport!) isn't better known. Even various known-broken 3PC protocols are better known ()
Tweet media one
4
15
113
@MarcJBrooker
Marc Brooker
3 months
To add to the beauty, distributed systems can get exponentially better availability at linear cost. Not a lot of deals like that in computing.
@coltmcnealy
Colt McNealy
3 months
The beauty of distributed systems is that you can build a system that is up 99.99% of the time, even when it runs on EC2 instances which each have an up time of 99.5%.
1
5
31
2
5
114
@MarcJBrooker
Marc Brooker
3 months
New blog post, of the "long email that became a post" genre. This one has something like career advice in it, about being explicit about how you spend your time:
4
15
111
@MarcJBrooker
Marc Brooker
1 year
"Today we are happy to announce Snapchange, a new open source project to make snapshot-based fuzzing much easier. Snapchange enables a target binary to be fuzzed with minimal modifications, providing useful introspection that aids in fuzzing."
3
31
109
@MarcJBrooker
Marc Brooker
2 years
This list of testing resources from @asatarin is truly excellent: Bookmark it!
4
32
111
@MarcJBrooker
Marc Brooker
2 months
New blog post, on making technical "build vs adapt" decisions:
2
10
109
@MarcJBrooker
Marc Brooker
6 months
Some highlights from "Achieving scale with Amazon Aurora Limitless Database" with David Wein and Christopher Heim, diving into the new Aurora Limitless Database.
2
16
108
@MarcJBrooker
Marc Brooker
7 months
Rust is a great fit with AWS Lambda. Super easy to deploy, fast cold starts, low memory requirements, simple dependency management.
@benjamenpyle
Benjamen Pyle
7 months
SUPER small sample. As in sample of 1. But cold start for a @rustlang @awscloud Lambda deserializing a Kinesis record stored in base64. Build output is 2MB as well. Color me impressed. Onward I will press ...
Tweet media one
2
1
31
3
9
105
@MarcJBrooker
Marc Brooker
3 years
"Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3" (), the SOSP'21 paper from the S3 team at AWS, is a good read. A couple highlights (in my, non-official, opinion):
2
31
103
@MarcJBrooker
Marc Brooker
1 year
New blog post, on the relationship between multitenancy and scalability in the cloud (and how serverless enables scalability in a fundamentally different way to traditional architectures):
1
22
100
@MarcJBrooker
Marc Brooker
11 months
The scale challenge of building cloud services isn't just because they're big, but because they span such a huge range of sizes. E.g. the 300,000,000,000x scale difference between the DynamoDB behind my blog, and Amazon's Prime Day use-case.
Tweet media one
2
13
103
@MarcJBrooker
Marc Brooker
2 years
I've read "A Read-Only Transaction Anomaly Under Snapshot Isolation" many times, and go through all five stages of grief every time.
4
10
102
@MarcJBrooker
Marc Brooker
3 months
Somewhere at Google there's a vast ML model that looked through my decades of search and usage history, and decided that what I really need is a push notification about Jennifer Aniston's new haircut.
1
1
98
@MarcJBrooker
Marc Brooker
3 months
Another related thing: in the graph, nodes with high indegree tend to become "hot keys" even if the external access patterns are uniform. A ton of DB tuning best-practices are about avoiding these high-degree nodes.
Tweet media one
@MarcJBrooker
Marc Brooker
3 months
If we draw database rows as points, and add edges between rows that appear in the same transaction, the resulting graph is a great way to think about potential scalability. The more you can cut the graph up without crossing edges, the easier the workload is to scale.
Tweet media one
17
63
467
1
5
95
@MarcJBrooker
Marc Brooker
4 months
New blog post, on what the word "scalable" means in my head, and how thinking about marginal costs of adding work makes a lot of the debates about scalability go away:
1
14
97
@MarcJBrooker
Marc Brooker
2 years
This whole thing is worth reading. My hot take is the ability to do efficient parallel IO (net and storage) without significant extra programmer effort is the most important thing a system language can offer. Latency will continue to lag bandwidth. Parallelism is king.
@usenix
USENIX Association
2 years
Check out this article posted on USENIX's ;login: online: 'Investigating Managed Language Runtime Performance' by David Lion, Adrian Chiu, Michael Stumm, and Ding Yuan: #OpenAccess
0
11
38
3
12
97
@MarcJBrooker
Marc Brooker
2 years
Very cool look inside the labs at Annapurna Labs, designing and building custom silicon for AWS: I'm personally especially excited about the power and efficiency improvements that Graviton has brought to AWS.
1
25
93
@MarcJBrooker
Marc Brooker
4 months
One absolutely does need a model of failures to discuss reliability. For example, one of our most powerful HA tools (redundancy) is exponentially powerful when failures are independent, but useless when failures are correlated.
@LewisCTech
Lewis Campbell
4 months
You ever see the same term pop up in a few places and think "hmmm"? @TigerBeetleDB often talks of "fault models". Wonder if theres more to this idea than I think. (Paper is "A Transaction Model", Jim Gray, IBM Research Laboratory 1980)
Tweet media one
0
2
32
3
13
91
@MarcJBrooker
Marc Brooker
11 months
Simplicity is a property of systems. It's always possible to make components simpler by pushing the complexity elsewhere in the system.
5
7
91
@MarcJBrooker
Marc Brooker
28 days
New blog post: "Formal Methods: Just Good Engineering Practice?"
2
10
92
@MarcJBrooker
Marc Brooker
2 years
I wrote a quick new blog post, on the bug/ambiguity in Paxos Made Simple:
3
22
92
@MarcJBrooker
Marc Brooker
2 years
New blog post: Writing is Magic From the genre of emails that got out of hand, my thoughts on why I think most people should spend more time writing (writing prose for humans, that is).
4
17
90
@MarcJBrooker
Marc Brooker
23 days
This is a fascinating spot in the trade-off space of commit algorithms. It's less fault tolerant than 2PC-over-Paxos (needs to wait for reconfiguration to make progress after even one failure), but saves a whole round-trip in the deal.
@akatsarakis
Antonis Katsarakis
23 days
Wait… what!? Fault tolerant 2PC that is simple and commits in 1RTT? 🤯 If you missed @ChrisJe34211511 fantastic talk in #eurosys24 ( #papoc24 ) definitely check the paper
Tweet media one
11
32
174
3
9
91
@MarcJBrooker
Marc Brooker
2 years
I wrote a blog post (or thread that really got out-of-hand) on deployment safety, and why online discussions of things like "should we deploy on Friday?" are so seldom productive:
1
16
89
@MarcJBrooker
Marc Brooker
3 months
Torn write protection is indeed a big issue for databases. EC2 and EBS support atomic writes up to 16kB:
@hnasr
Hussein Nasser
3 months
torn pages problem is interesting, who thought a mismatch between database page size and file system block size can cause this. So if the fs block is 4K and DB page is 8K, we need to write two fs blocks for each page and it needs to happen atomically. If you wrote half a page
6
8
64
1
5
89