Aman Sanger @amanrsanger Twitter profile

Pinned Tweet

Aman Sanger

7 months

Coding just got a little more delightful. We've raised an 8M seed round, led by the OpenAI Startup Fund to build Cursor! Read more here:

We raised $8M from OpenAI Fund.

Anysphere raises an 8M seed round to build the future of programming.

anysphere.inc

84

91

2K

Last Seen Profiles

@77forbes

@ronaldotrancoso

@StreetArt_NYC

@grant25716784

@carlodenuzzo90

@JakeBarton32

@SenateForeign

@traderTVLIVE

@nigarhan

@LeSlipFrancais

@Kylehart438370

@SailfishSB

@aleriy_

@Sabainnocent52

@AnnWigg43255257

@honey_bee48

@leeroy606

@rowblanchard

@nyahapi

@Belga2k

@Jennife26634151

@threadz2go

@ChunkStudios

@VALINOTIRES

@2kattire

@DocklessObs

@D_AbellaOficial

@esraeslen

@SundayBrunchC4

@Zoogerdee2024

@MEDIOSPUBL58882

@EvanAxelbank

@HellenaMicel

@larosanera10

@ViktoriaGessner

@AutumnCons79843

Aman Sanger

@amanrsanger

1 year

Introducing Cursor!! () A brand new IDE build from the ground up with LLMs. Watch us use Cursor to ship new features blazingly fast.

117

358

3K

Aman Sanger

@amanrsanger

5 months

At @cursor_ai , we’ve scaled throughput on GPT-4 to 2-3x over baseline without access to knobs in OpenAI’s dedicated instances [1] We did this by reverse-engineering expected GPT-4 latency and memory usage from first principles. Here’s how... (1/10)

17

90

1K

Aman Sanger

@amanrsanger

1 year

Want to code using GPT-4? We made an IDE built for programming alongside it Try out the public beta here:

46

222

1K

Aman Sanger

@amanrsanger

1 year

No one is talking about the actual best open-source language model today. It isn't Bloom or OPT, not even GLM-130. It's an 11B instruction fine-tuned model open-sourced by Google themselves: Flan-T5 11B And the second best is Flan-T5 3B...

18

92

1K

Aman Sanger

@amanrsanger

5 months

At Cursor, we've built very high-quality retrieval datasets (for training embeddings/rerankers). To do this, we use GPT-4 grading and the Trueskill ratings system (a better version of Elo) Here’s how.. (1/10)

21

66

1K

Aman Sanger

@amanrsanger

5 months

Try out GPT-V in Cursor! It's pretty good for building/modifying components!

80

117

1K

Aman Sanger

@amanrsanger

3 months

At Cursor, we’re fascinated by the problem of deeply understanding codebases. One useful primitive we’ve been focused on is code graph construction and traversal. Here's how/why we're tackling this... (1/12)

22

68

868

Aman Sanger

@amanrsanger

1 year

Llama and many recent open-source models have a significant architectural limitation They use multi-head attention instead of multi-query attention (which is used by PaLM and probs Claude 100K) This can result in slowdowns of up to 30x Heres the math behind why (1/n)

17

163

847

Aman Sanger

@amanrsanger

5 months

Though @cursor_ai is powered by standard retrieval pipelines today, we've been working on something much better called: Deep Context After @walden_yan built an early version for our vscode fork, Q&A accuracy skyrocketed. Soon, we're bringing this to everyone (1/6)

28

42

765

Aman Sanger

@amanrsanger

10 months

the masculine urge to build a vector db startup

21

32

696

Aman Sanger

@amanrsanger

2 months

Introducing Copilot++: The first and only copilot that suggests edits to your code:

51

45

699

Aman Sanger

@amanrsanger

3 months

An underrated part of Cursor is our codebase indexing system. It provides efficient indexing/updating without storing any code on our servers. (1/9)

11

33

688

Aman Sanger

@amanrsanger

4 months

One magical part of Cursor’s internal tech stack is a prompt compilation library called priompt () Here's why works so well... (1/12)

15

57

666

Aman Sanger

@amanrsanger

17 days

SWE-bench is probably contaminated for frontier models (gpt-4/claude-3-opus). Given only the name of a pull request in the dataset, Claude-3-opus already knows the correct function to modify.

14

59

606

Aman Sanger

@amanrsanger

5 months

Another LLM inference trick that is surprisingly missing in most inference engines, but powers Cursor: Request-level memory-based KV caching. This can bring time to the first token (TTFT) down by an order of magnitude and dramatically increases generation throughput. (1/6)

10

40

595

Aman Sanger

@amanrsanger

2 years

My AI prediction: Training will look like researchers/practitioners offloading large-scale training jobs to specialized “training” companies: a state of the world that resembles chip design & fabrication. (1/n)

27

65

549

Aman Sanger

@amanrsanger

9 months

Sub 600ms latency speech conversational AI is completely possible today, surprised I haven’t seen anyone that does this. The key is hosting a model (like llama), streaming from whisper, and every few tokens, prefilling more of the kv cache - without evicting from memory (1/4)

33

35

552

Aman Sanger

@amanrsanger

1 year

The size of all code/history on Github public repos is 92TB The size of Google's monorepo in 2015 was 86TB (of much higher quality code) If Google were willing to deploy code models trained on their own data, they'd have a noticable advantage over everyone else.

38

37

529

Aman Sanger

@amanrsanger

1 year

There are times and places for training your own models... With the release OpenAI's chatGPT API - coding is looking less like one of them. The human-eval pass @1 rate of ChatGPT is as good as the best Open Source model's pass @100 rate. And this is still just GPT 3.5...

20

65

533

Aman Sanger

@amanrsanger

4 months

Working on a new version of copilot that can suggest *edits* to your codebase

18

24

505

Aman Sanger

@amanrsanger

1 month

Long context models with massive custom prompts (~2M) may soon replace fine-tuning for new knowledge! Let’s explore why: (1/10)

12

75

503

Aman Sanger

@amanrsanger

1 year

GPT-4 is waaay better at programming than given credit for. HumanEval is a benchmark of python programming problems. With some prompt engineering, GPT-4 scores ~85%, destroying Codex's 29% from just 2 years ago And performing much better than OpenAI's publicized accuracy

9

53

427

Aman Sanger

@amanrsanger

1 year

New feature just dropped... You can now generate a whole project with Cursor: Coming this week, multifile generation + codebase-wide understanding 👀

13

45

419

Aman Sanger

@amanrsanger

1 year

My bet is that in the long run, reading and writing to external memory is key for much more capable models that can continually learn. Someone will make the Neural Turing Machine work with a transformer backbone. (1/4)

18

23

416

Aman Sanger

@amanrsanger

1 year

is now powered by GPT-4! Since partnering with @openai in December, we’ve completely redesigned the IDE to incorporate the power of next-gen models like GPT-4 Soon, we’ll be fully opening up the beta. Retweet this, and we’ll give you access today 😉

40

266

340

Aman Sanger

@amanrsanger

9 months

Despite Cursor’s recent insane growth, the current version is just 0.1% of what we have in store. We’re a small, very strong team and are looking for fantastic SWEs and designers to help shape the future of software development. Read more here -

19

18

357

Aman Sanger

@amanrsanger

5 months

There are some interesting optimizations to consider when running retrieval at scale (in @cursor_ai 's case, hundreds of thousands of codebases) For example, reranking 500K tokens per query With blob-storage KV-caching and pipelining, it's possible to make this 20x cheaper (1/8)

8

18

349

Aman Sanger

@amanrsanger

1 year

Despite the recent hype around Replit's new model, it isn't actually the best open-source code model out there In fact, it's not even the best 3-billion parameter code model That title belongs to Microsoft's MIM-2.7B... And it was trained on 2x fewer tokens!

10

36

340

Aman Sanger

@amanrsanger

1 year

Cursor just got a massive upgrade () ...and we're now compatible with most of your VSCode plugins ;)

11

25

329

Aman Sanger

@amanrsanger

11 months

gpt-3.5-turbo is criminally underrated at coding When using it with Azure's completion endpoint instead of OpenAI's chat endpoint, you can get a jump in HumanEval performance from <50% to 74%! This blows claude v1.3 out of the water, which sits just below 60% perf. [1]

23

41

328

Aman Sanger

@amanrsanger

5 months

People claim LLM knowledge distillation is trivial with logprobs, but that's not quite right... It's very tricky to distill between different tokenizers. [1] Internally, we've solved this with a clever algorithm we called tokenization transfer (1/7)

8

20

300

Aman Sanger

@amanrsanger

9 months

Surprisingly fp16 inference is cheaper than most 4-bit quantization (i.e. GPT-Q/exllama, bits n bytes, likely llama.cpp) when running inference at scale! After profiling these methods with llama2-7b, we see fp16 vllm is the cheapest! [1] Here’s the math behind why... (1/6)

9

42

302

Aman Sanger

@amanrsanger

5 months

After switching our vector db to @turbopuffer , we're saving an order of magnitude in costs and dealing with far less complexity! Here's why... (1/10)

Simon Eskildsen

@Sirupsen

5 months

we're very much in prod with @turbopuffer : 1. 600m+ vectors 2. 100k+ indexes 3. 250+ RPS thrilled to be working with @cursor_ai —now we're ready for your vectors too

9

172

5

16

288

Aman Sanger

@amanrsanger

1 year

probably nothing...

12

21

283

Aman Sanger

@amanrsanger

9 months

More people should be training their own embeddings Why? Because it costs <$1000 to train a SOTA embeddings model GTE-base is a 110M param model, which beat text-ada-embedding-002 on basically everything but code, and it likely costs < $921 (1/2)

8

38

285

Aman Sanger

@amanrsanger

10 months

Huggingface/ Megatron Vanilla pytorch Deepspeed

5

23

275

Aman Sanger

@amanrsanger

2 months

With a 256K token prompt, a 7b model can generate tokens as quickly as codellama-7b with an 8K prompt. How? The model must use multi-query attention. Here's why... (1/10)

7

39

272

Aman Sanger

@amanrsanger

5 months

Large dense-model training often requires fancy parallelism strategies (tensor/3d) instead of just FSDP because of a non-obvious constraint: global batch sizes (1/14)

4

28

267

Aman Sanger

@amanrsanger

1 month

"Token Counts" for long context models are a deceiving measure of content length. For code: 100K Claude Tokens ~ 85K gpt-4 Tokens 100K Gemini Tokens ~ 81K gpt-4 Tokens 100K Llama Tokens ~ 75K gpt-4 Tokens OpenAI's 128K context window goes farther than it would appear.

9

17

257

Aman Sanger

@amanrsanger

1 year

Crazy results... 120B open-source model that is second only to Minerva in all kinds of STEM related tasks. Big takeaway: multi-epoch training works. No degradation of performance for 4 epochs of training. And with 3-4x less train data than BLOOM and OPT, it beats both.

Papers with Code

@paperswithcode

1 year

🪐 Introducing Galactica. A large language model for science. Can summarize academic literature, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more. Explore and get weights:

283

2K

8K

6

20

223

Aman Sanger

@amanrsanger

1 year

Flash attention is fantastic, but there are misconceptions that it is a silver bullet for all LM workloads. At inference time, it is quite fast at ingesting prompts, but... Flash attention offers minimal (if any) speedups for completions. Let's explore why... (1/7)

9

28

219

Aman Sanger

@amanrsanger

9 months

A very simple trick and a very hard trick for sub 300ms latency speech to speech. Simple: Ask the language model to always preface a response with a VERY believable filler word, then a pause Um… Well… Really… Interesting… Maybe… Hard: Speculatively sample different user…

Aman Sanger

@amanrsanger

9 months

Sub 600ms latency speech conversational AI is completely possible today, surprised I haven’t seen anyone that does this. The key is hosting a model (like llama), streaming from whisper, and every few tokens, prefilling more of the kv cache - without evicting from memory (1/4)

33

35

552

20

13

218

Aman Sanger

@amanrsanger

1 year

As fantastic as OpenAI's models are, it's hard to justify using them for finetuning. I have a few reasons... First, Instruct Models are unavailable for finetuning. The best model you can finetune is davinci, which is probs worse than open-source equivalents like GLM-130. (1/3)

9

13

216

Aman Sanger

@amanrsanger

1 year

Palm2 has been leaked to be 340B params and trained on 3.6T tokens (7.4e24 FLOPs). Someone out there could feasibly reproduce a similar quality model... for under $6M! But that price tag largely depends on H100s... [1/6]

7

29

193

Aman Sanger

@amanrsanger

1 year

For those using open-source models like CodeGen instead of OpenAI's Codex, I have some bad news about its "comparable" performance. It isn’t even close anymore. code-davinci 1-shot is competitive with CodeGen 10-shot. (1/7) (bottom results computed by me with OpenAI API)

11

17

192

Aman Sanger

@amanrsanger

5 months

Tricks for LLM inference are very underexplored For example, @cursor_ai ’s “/edit” and cmd-k are powered by a similar trick to speculative decoding, which we call speculative edits. We get 5x lower latency for full-file edits with the same quality as rewriting a selection!

4

7

192

Aman Sanger

@amanrsanger

1 year

Palm2 just dropped, and there are claims that the largest model is just 14.7B params. In reality, the model is probably closer to 100B parameters But why... (1/5)

7

25

188

Aman Sanger

@amanrsanger

2 months

Groq looks very good I’d suspect it’s possible to achieve this speed with bs=1, 4-bit weights, and speculative decoding on 4-8 H100s But even on bs=4 H100 pricing, that would cost at least $2.5/1M tokens. For groq its $0.8…

11

10

186

Aman Sanger

@amanrsanger

9 months

I’m bearish on the future of on-device LM inference. Why? Because Mixture of Experts (MOE) shifts the balance in favor of datacenter inference (1/7)

20

11

178

Aman Sanger

@amanrsanger

11 months

Cursor now answer questions about your entire repo! Powered by several layers of retrieval, press cmd+enter to: * Find a missing block of code * Plan out a full PR * Write a method with scattered dependencies

9

21

175

Aman Sanger

@amanrsanger

1 month

Try out a better and much faster model today! and more crazy improvements coming soon ;)

Cursor

@cursor_ai

1 month

Copilot++ is now ~2x faster! This speedup comes from inference optimizations + our best model yet. We really like it and hope you do too ☺️

22

33

503

8

7

170

Aman Sanger

@amanrsanger

1 month

2024 is the year that long-context gets commoditized I'd bet we see several 1m+ token models in oss and closed-source by end of year

5

9

168

Aman Sanger

@amanrsanger

9 months

4090s are 5x cheaper per FLOP than A100s (and can even serve more peak flops)! This means they're exceptionally underrated at serving embedding models. If you wanted to serve SOTA embeddings (GTE) on 4090s, you could do it for < $1 / 1 Billion tokens [1]

11

9

160

Aman Sanger

@amanrsanger

1 year

My favorite new feature in ... Toolformer. We give the model access to everything a human uses in their IDE. LSP errors, goto definition, terminal outputs, rg search. Watch Cursor answer complex queries just using rg search... for a >1M line codebase

6

10

150

Aman Sanger

@amanrsanger

3 months

For some pure-prompt (prefill) workloads that require log probabilities, we’ve built a lightweight inference server that outperforms vllm by a factor of 2 It’s MUCH easier/simpler than it sounds and just uses pytorch + transformers. here's why... (1/9)

7

9

146

Aman Sanger

@amanrsanger

10 months

Llama-2 is more expensive than you'd think It can be up to 33x more expensive than gpt-3.5, without large batches. But for some workloads, it can actually be 3x cheaper! We delve deep into the math and measured perf here:

Why GPT-3.5 is (mostly) cheaper than Llama 2

Llama-2 is more expensive than you'd think. In this post, we explore why it's often more expensive than gpt-3.5-turbo.

cursor.sh

7

21

130

Aman Sanger

@amanrsanger

5 months

Most OSS inference frameworks combine pre-filling (prompt tokens) with decoding (generation tokens) per device/process. But separating the stages should give you better perf! (1/14)

2

9

127

Aman Sanger

@amanrsanger

3 months

This year, we intend to solve the problem of having Cursor completely “understand” a codebase. If you have clever solutions for chipping away at this problem, would love to talk ;) (12/12)

14

8

122

Aman Sanger

@amanrsanger

9 months

A surprising fact about transformers is that training is often cheaper per token than inference. But only on small batch sizes. Here's why...(1/5)

4

7

118

Aman Sanger

@amanrsanger

5 months

[1] Learning codebase tactics is motivated in large part by the Voyager paper!

9

2

116

Aman Sanger

@amanrsanger

1 year

SOTA for retrieval in academia uses an interesting technique called multi-vector retrieval This requires separately embedding each token of the query and documents. It actually uses similar model compute to vanilla retrieval [1], but significantly more vector db usage

1

11

116

Aman Sanger

@amanrsanger

3 months

A very rough draft of a new UX for making your code more readable/bug-free. Heavily inspired by @JoelEinbinder : (1/3)

8

5

115

Aman Sanger

@amanrsanger

1 year

Cursor is hiring! We’re looking for talented engineers, researchers, and designers to join us in making code editing drastically more efficient (and fun!) If you're excited to redesign the experience of building software, please reach out at hiring @cursor .so

8

5

111

Aman Sanger

@amanrsanger

11 months

that’s like a 100m supercomputer it has more peak flops than even metas compute cluster!!

Nat Friedman

@natfriedman

11 months

Daniel and I have setup a cluster for startups:

189

398

4K

2

4

110

Aman Sanger

@amanrsanger

1 year

Instruction fine-tuning is crazy

4

101

Aman Sanger

@amanrsanger

2 months

Surprising that it works to finetune well past overfitting on validation data: Done by both OpenAI’s InstructGPT and Lima:

7

18

96

Aman Sanger

@amanrsanger

5 months

Decent chance we get AGI before full self-driving. The most intelligent models may demand more inference compute than a reasonably-priced car can provide. And a GPT-6 likely won’t satisfy the inference speed constraints for real-time driving.

12

4

95

Aman Sanger

@amanrsanger

5 months

I do find it interesting that despite this project dealing just with external inference systems and OpenAI APIs, it required a deep understanding of how transformer inference works to actually get right.

8

4

94

Aman Sanger

@amanrsanger

10 months

My new favorite feature on : Pulling in docs for inline edits. It grounds the model, giving hallucination-free edits while preserving flow.

5

4

94

Aman Sanger

@amanrsanger

10 months

Interestingly, just one subtle detail added to this model makes codegen 2.5 substantially faster than codegen 2 All it required was increasing the number of attention heads from 16 to 32... (1/4)

Salesforce AI Research

@SFResearch

10 months

Releasing 🚀 CodeGen2.5 🚀, a small but mighty LLM for code. - On par with models twice its size - Trained on 1.5T tokens - Features fast infill sampling Blog: Paper: Code: Model:

8

105

345

2

7

90

Aman Sanger

@amanrsanger

1 month

Jamba actually requires more memory for long context generation than (some) similar-sized transformers! In particular, multi-query attention (mqa) models (like Palm) are slightly more memory efficient.

5

2

91

Aman Sanger

@amanrsanger

9 months

LoRA may be more helpful for closed AI providers than for serving OSS models With Lora they could offer fine tuning as a service, but serve the finetuned model with small marginal cost. They’d just need inference engine that serves multiple Lora adapters in a single batch (1/4)

5

9

84

Aman Sanger

@amanrsanger

5 months

The next step is letting the model reflect on PRs and local developer changes, storing this in its repository of deep context. Combine this new context engine with Cursor, and you get game-changing experimental results. (6/6)

2

1

80

Aman Sanger

@amanrsanger

1 year

Google's answer to GPT-4, Palm2, just dropped! It is competitive with GPT-4, and even beats it on Math! [1] But on code, the smallest variant of Palm2 significantly underperforms GPT-4 on human eval... Still, it already blows past SOTA for non-OpenAI general coding models [2]

4

8

81

Aman Sanger

@amanrsanger

11 months

Once fp8 utilization is figured out, you could probs train llama on this cluster in... 3 days! [1]

Nat Friedman

@natfriedman

11 months

Daniel and I have setup a cluster for startups:

189

398

4K

7

3

79

Aman Sanger

@amanrsanger

1 year

Hate debugging your code? Cursor will literally do it for you! Thanks to @walden_yan , when you hit an error, press cmd+d to have Cursor automatically fix it.

5

9

81

Aman Sanger

@amanrsanger

10 months

Cursor is releasing a nightly build! Here we'll prototype experimental features before wide release. These include: - Agents for making codebase-wide edits - An AI-powered bug finder Comment down below if you’d like to get access!

54

6

81

Aman Sanger

@amanrsanger

6 months

@AravSrinivas Same reason Bing didn't build something like Perplexity 😉

3

0

80

Aman Sanger

@amanrsanger

1 year

Inter-temporal Bradley Terry (IBT) reward modeling is the most important concept from the recent DeepMind paper on multimodal RLHF. I believe it will be key for getting language models to perform long-term complex tasks... (1/10)

Reward model example

null

www.youtube.com

2

4

78

Aman Sanger

@amanrsanger

5 months

The fundamental issue with RAG is that it encourages shallow answers Even excellent engineers would struggle to answer a question with only RAG-like context about a codebase they've never seen. (2/6)

2

76

Aman Sanger

@amanrsanger

1 year

Lots of talk about Llama being the new dominant "open-source" model But, Meta hasn't even open-sourced the best model from that paper! Llama-I, an instruction-fine-tuned variant of Llama-65B, is the best. And it isn't available to download...

3

4

77

Aman Sanger

@amanrsanger

1 year

Wait, so text-davinci-002 is Codex with instruction finetuning… Huh

9

3

76

Aman Sanger

@amanrsanger

7 months

We're hiring across the board in software engineering and ML If you're as excited about the future of AI-assisted coding as we are, please reach out at hiring @anysphere .co!

3

0

72

Aman Sanger

@amanrsanger

5 months

A good engineer would first read the code. They'd follow the breadcrumbs. Go to relevant files, functions, classes, goto definitions/references, to build understanding. What happens when we let GPT-4 do this... It builds a scratchpad of deep understanding (3/6)

3

1

71

Aman Sanger

@amanrsanger

2 years

@nikitabier

0

2

72

Aman Sanger

@amanrsanger

5 months

GPT-4 zero/few-shot is poorly grounded, so asking it to predict a raw score for each code block will give inaccurate results. Instead, let's take 4 code blocks for a given query. If we ask gpt-4 to order them based on relevance, its accuracy at this task is almost 100%! (4/10)

2

71

Aman Sanger

@amanrsanger

2 months

Copilot++ was built to predict the next edit given the sequence of your previous edits. This makes it much smarter at predicting your next change and inferring your intent. Try it out today in Cursor:

Cursor - Copilot++

Autocomplete redesigned to predict your next edit

cursor.sh

3

0

70

Aman Sanger

@amanrsanger

5 months

First, all of this is only possible with OpenAI's dedicated capacity. For large enough orgs with high usage, this is a no-brainer for cost reasons. Dedicated capacity lets you commit to some usage for an extended period of time for reduced pricing. (2/10)

1

0

66

Aman Sanger

@amanrsanger

1 year

And it's a pretty good IDE. With intellisense support, split panes, vim mode, multi-file search etc... We even support Copilot ;)

4

0

66

Aman Sanger

@amanrsanger

1 year

We're going to be rolling out invites soon. Sign up for the waitlist at

10

0

66

Aman Sanger

@amanrsanger

17 days

It still hallucinates the rest of the PR, but finding the correct edit location (a hard task!) would be impossible if the source and/or PR weren't in the train set. I suspect a 15-20% score on contaminated repos would translate to something like <5% on unseen/non-public repos.

2

0

65

Aman Sanger

@amanrsanger

1 year

The OG open-source code model is back! CodeGen2 is out, and they've released (unfinished) checkpoints up to 16B Interestingly, they find FIM isn't free - it incurs a drop in human eval perf And, they find that PrefixLM doesn't improve perf vs causal attention (1/2)

4

7

64

Aman Sanger

@amanrsanger

1 year

This unification of capabilities is insane. Before, the best code models needed to be rigorously fine-tuned on code - forgetting much of general language modeling, Now, ChatGPT is the best coding model while remaining the best generalist model.

3

4

64

Aman Sanger

@amanrsanger

3 months

And in this whole process, no code gets stored on our servers! Just the sha256 hashes of the chunks, file names, and 500 token embeddings. (9/9)

2

0

62

Aman Sanger

@amanrsanger

5 months

If I wanted to know how the vscode text model worked, RAG-based results would fail. With deep context, Cursor can pull from the pre-computed scratchpad to fully answer the question! But we can take it a step further (4/6)

1

0

59

Aman Sanger

@amanrsanger

5 months

More importantly, it offers a different abstraction: Your models now run on “instances”. [2] Each instance can be treated as a machine (or group of machines) running some large transformer. (3/10)

1

0

57

Aman Sanger

@amanrsanger

17 days

This does not mean SWE bench is a bad benchmark. It is a fantastic measure of progress! But a high score on it will likely translate to a much lower score on uncontaminated (i.e most) repos.

1

0

59

Aman Sanger

@amanrsanger

3 months

The first step is constructing a merkle tree on the client. We use napi to hook our rust implementation into our typescript frontend. We choose rust to maximize speed and minimize resource use for merkle tree construction. Why a merkle tree? (2/9)

1

0

59

Aman Sanger

@amanrsanger

8 months

Tough… The best code-llama model drastically underperforms gpt-4 prompted (87%) and gpt-3.5-turbo prompted (75%) on humaneval OSS still has some ways to go on code

18

1

56