Vik Paruchuri @VikParuchuri Twitter profile | Pikagi

Pikagi

Vik Paruchuri

@VikParuchuri

9,916

Followers

167

Following

85

Media

1,332

Statuses

Open source AI. Past: founded @dataquestio

Oakland, CA

https://t.co/LCSRcWK9nW

Joined June 2012

Don't wanna be here? Send us removal request.

Pinned Tweet

@VikParuchuri

Vik Paruchuri

2 months

I wrote a blog post on going from not knowing anything about deep learning last year to training state of the art OSS models - . Hope it helps you. tldr; read the deep learning book, implemented papers + taught, built open source tools

Tweet card media

How I got into deep learning

I ran an education company, Dataquest, for 8 years. Last year, I got the itch to start building again. Deep learning was always interesting to me, but I knew very little about it. I set out to fix...

27

154

1K

Last Seen Profiles

@kammi_boo

@stw_pdg

@jordancdavis

@wiiidi_35

@MaddixLety

@JorgeGu67241774

@Bio2Nic

@PupSnowUK

@countcontess

@Sham_saafi

@SportsPrinceM

@YAW3BOGINYA

@romaindaubie

@BamThaDon

@MarsOnTV

@clevedonmurrays

@Dark_Emi_

@CreeksBendGC

@SangerNYT

@kekinof

@CorposNFTs

@Hewasmormon

@pn7n_c

@life_zer0_

@elfacryu

@MartiaIHeadVein

@Lizzy_Lizard1

@Youngr

@MePonesMas

@HoneyRynie

@sushikiosk

@dowmediaindia

@TwistedTragedy3

@jandakembangstw

@Kpopidolvoting_

@KerouleV

@VikParuchuri

Vik Paruchuri

5 months

Announcing surya - a multilingual text line detection model for documents. It gives you accurate line-level bboxes and column breaks. Find it here - .

Tweet media one

83

407

3K

@VikParuchuri

Vik Paruchuri

3 months

Announcing surya OCR - text recognition in 93 languages. It outperforms tesseract in almost all languages, often by large margins. Find it here - .

Tweet media one

Tweet media two

39

253

2K

@VikParuchuri

Vik Paruchuri

6 months

I'm excited to ship marker - a pdf to markdown converter that is 10x faster than nougat, more accurate outside arXiv, and has low hallucination risk. Marker is optimized for throughput, like converting LLM pretrain data. Find it here - .

Tweet media one

25

130

942

@VikParuchuri

Vik Paruchuri

22 days

Cool to see a 500M param model I trained myself do better than Google cloud vision, Claude, and GPT-4V on this task. (look at the thread for the results) It's a relatively narrow one (OCR), but feels nice to see that small open source models still have a place.

@moyix

Brendan Dolan-Gavitt

23 days

It's weird how we live in an age of miracles with respect to AI/ML, and yet when I want to extract some text from a screenshot the best (very bad) option is tesseract, last updated ~7 years ago.

67

33

845

22

55

859

@VikParuchuri

Vik Paruchuri

5 months

Better data = better AI. That's why I've spent the last 3 months on: - Marker - fast, accurate PDF to markdown (5k GH ⭐️s) - Texify - SOTA math to LaTeX OCR - Libgen to txt - get 3TB of HQ data - Textbook quality - HQ synth data Find them at .

23

98

845

@VikParuchuri

Vik Paruchuri

5 months

I'm training a text line detection model for a document OCR pipeline. It could also be useful on its own, but I'm not sure. Is anyone interested in a standalone release? It works for every language I tried - it detects text bboxes and column breaks. ~2 second inference per

Tweet media one

80

52

786

@VikParuchuri

Vik Paruchuri

5 months

I'm tweaking my line detection model to get it ready for a Github release. This was a fun test case. It's not really designed for newspapers, so I was surprised this worked

Tweet media one

55

47

730

@VikParuchuri

Vik Paruchuri

28 days

I've shipped most of the models + libraries I wanted in the last few months: - PDF to markdown - marker - Text line detection, OCR in 93 languages, layout analysis, reading order - surya - Equation to LaTeX - PDF text extraction Find them on Github - .

11

65

713

@VikParuchuri

Vik Paruchuri

9 years

Are you a "rockstar" programmer? Someone made a keyboard just for you. http://t.co/A4lt2KJuid

Tweet media one

33

808

654

@VikParuchuri

Vik Paruchuri

1 month

Announcing surya reading order! It predicts the order that a human would read a document in. It's useful for RAG, accessibility, and text extraction. It works on a variety of documents, layouts, and languages.

Tweet media one

Tweet media two

Tweet media three

Tweet media four

20

87

669

@VikParuchuri

Vik Paruchuri

2 months

Announcing surya layout! It detects tables, images, figures, section headers, and more. It works with any language, and a variety of document types. Find it here - . Thanks @LambdaAPI for sponsoring compute.

Tweet media one

Tweet media two

Tweet media three

Tweet media four

23

90

648

@VikParuchuri

Vik Paruchuri

4 months

I can't get over @ylecun tweeting that surya was nice. Lifetime achievement unlock. My next steps are: - Improving old/scanned doc performance - Seeing if I can do anything about rotations Then on to the next recognition part! Here's the repo - .

Tweet card media

GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages

OCR, layout analysis, reading order, line detection in 90+ languages - VikParuchuri/surya

12

40

632

@VikParuchuri

Vik Paruchuri

5 months

Announcing texify - an OCR model that turns inline and block equations into markdown/LaTeX. It's more accurate at this than nougat and pix2tex. Find it here - .

8

86

554

@VikParuchuri

Vik Paruchuri

7 months

The biggest barrier to GPT-quality open source LLMs is data. If you want 1TB of quality data, here's my repo that will convert libgen nonfiction to txt format - .

Tweet card media

GitHub - VikParuchuri/libgen_to_txt: Convert all of libgen to high quality markdown

Convert all of libgen to high quality markdown. Contribute to VikParuchuri/libgen_to_txt development by creating an account on GitHub.

13

85

514

@VikParuchuri

Vik Paruchuri

1 month

I made pdftext, a small tool that extracts text like pymupdf, but with an Apache license (mupdf is AGPL). It can pull out blocks and lines or plain text. Find it here - .

14

54

490

@VikParuchuri

Vik Paruchuri

8 months

I'm open sourcing my code to generate high quality synthetic textbooks. Here's the repo - . The quality is the highest I've seen in any open dataset.

Tweet card media

GitHub - VikParuchuri/textbook_quality: Generate textbook-quality synthetic LLM pretraining data

Generate textbook-quality synthetic LLM pretraining data - VikParuchuri/textbook_quality

5

69

477

@VikParuchuri

Vik Paruchuri

17 days

Marker v2 is out! The main new features: - Extracts images/figures - Better table parsing - Pip package install - Can be used commercially - Improved OCR with more languages - Better ordering for complex docs Get it here - .

15

59

469

@VikParuchuri

Vik Paruchuri

4 months

Surya () has been updated with a new model checkpoint that is far better on scanned/old docs. It works even with blurry/rotated complex layouts, like this one:

Tweet media one

13

34

430

@VikParuchuri

Vik Paruchuri

4 months

Surya () didn't work well on scanned/rotated docs, so I decided to spend a couple of days on it this week. I'm making good progress. It's still training, hopefully will have something out tomorrow.

Tweet media one

17

22

385

@VikParuchuri

Vik Paruchuri

3 months

I benchmarked surya () against Google Cloud OCR, and it looks competitive. Pretty nice for an open source model I trained myself.

Tweet media one

Tweet media two

12

36

347

@VikParuchuri

Vik Paruchuri

2 months

I think my layout model will be good to ship this week. What do you think? There are some minor issues, but working on them.

Tweet media one

Tweet media two

Tweet media three

Tweet media four

42

12

332

@VikParuchuri

Vik Paruchuri

1 month

I'm going to release my reading order model next week. I had to change the architecture to perform better with complex layouts. It seems to be working, though (see the image). There are mistakes, but it's only 20% trained, and still improving.

Tweet media one

15

18

307

@VikParuchuri

Vik Paruchuri

8 months

Textbooks generated with finetuned mistral + search and wikipedia RAG are surprisingly good. They seem close to GPT-3.5. See samples here - , and here - . Working on a bigger set now! Please let me know if you can sponsor.

Tweet media one

3

51

301

@VikParuchuri

Vik Paruchuri

2 months

Working on a reading order detection model. Still early in training, but output is starting to look decent. Hoping to release next week.

Tweet media one

Tweet media two

14

27

297

@VikParuchuri

Vik Paruchuri

8 months

I've generated 70M tokens of extremely high quality synthetic textbooks - , using retrieval and gpt-3.5. Seriously, the quality is 💯. I'm generating 1B tokens, but will use llama for $$ reasons. Please DM if you can sponsor compute or credits.

Tweet card media

vikp/textbook_quality_programming · Datasets at Hugging Face

7

53

281

@VikParuchuri

Vik Paruchuri

1 month

My reading order model is getting close to being release-ready. (it may not be immediately obvious, but this is a hard doc to order properly) Working on fixing just a few remaining issues.

Tweet media one

19

14

266

@VikParuchuri

Vik Paruchuri

2 months

I'm working on a layout analysis model. Hopefully will ship in the next 2 weeks.

Tweet media one

Tweet media two

Tweet media three

13

17

259

@VikParuchuri

Vik Paruchuri

6 months

I released marker last week - . Within 72 hours, marker got to #1 on HN, with 700 votes, and was starred 3.4k times on Github. I didn't expect this kind of response - thank you so much for the support!

Tweet card media

GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

Convert PDF to markdown quickly with high accuracy - VikParuchuri/marker

5

25

252

@VikParuchuri

Vik Paruchuri

4 months

An update on surya text recognition - I'm happy with the data/architecture, and I'm ready to scale up training. Here are some results from a (very) early checkpoint. Left is original, right is OCR (Malayalam)

Tweet media one

Tweet media two

9

24

247

@VikParuchuri

Vik Paruchuri

8 months

I'm building a dataset of high quality synthetic textbooks for pretraining. Here's a 4M token preview - . The quality is incredibly high (it really surprised me).

Tweet card media

vikp/textbook_quality_programming · Datasets at Hugging Face

8

38

239

@VikParuchuri

Vik Paruchuri

8 months

I've been generating additional textbooks! is up to 115M high quality tokens, and is up to 85M. I'm seeing promising humaneval results with models trained on this data.

Tweet card media

open-phi/textbooks · Datasets at Hugging Face

2

35

218

@VikParuchuri

Vik Paruchuri

1 month

As @jeremyphoward shared yesterday, I'll be joining @answerdotai ! I'm excited to work with such a strong team. Before I start, I'm going to finish some in-progress work: - Integrate surya with marker - Commercial version of marker - Launch an API for both

9

8

212

@VikParuchuri

Vik Paruchuri

6 months

Libgen to txt now supports marker for pdf -> markdown. Turn libgen rs nonfiction into 3TB of high quality markdown. AI labs are using this data to train LLMs - now you can, too. Full instructions and usage are here - .

Tweet card media

GitHub - VikParuchuri/libgen_to_txt: Convert all of libgen to high quality markdown

Convert all of libgen to high quality markdown. Contribute to VikParuchuri/libgen_to_txt development by creating an account on GitHub.

1

32

207

@VikParuchuri

Vik Paruchuri

8 months

I built a dataset of every package on pypi. The quality of code is high, and I'm finding it great for finetuning and pretraining - . I cleaned extra leading comments, and rendered notebooks, so this data should be ready to use.

Tweet card media

vikp/pypi_clean · Datasets at Hugging Face

8

28

193

@VikParuchuri

Vik Paruchuri

23 days

@moyix Check out surya -

Tweet card media

GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages

OCR, layout analysis, reading order, line detection in 90+ languages - VikParuchuri/surya

3

10

184

@VikParuchuri

Vik Paruchuri

8 months

I'm excited to release a 400m token synthetic programming textbook dataset - . This is a mix of GPT-3.5 (great quality), and finetuned llama (good quality). It was generated with the textbook quality repo - .

Tweet card media

GitHub - VikParuchuri/textbook_quality: Generate textbook-quality synthetic LLM pretraining data

Generate textbook-quality synthetic LLM pretraining data - VikParuchuri/textbook_quality

2

14

173

@VikParuchuri

Vik Paruchuri

3 years

A timeline of @DataCamp 2017-2020: - CEO sexually harassed an employee - The company covered it up - After years of community pressure, the CEO stepped down - They just BROUGHT THE CEO BACK 🤦🏾‍♀️ This is a repeated and ongoing failure of leadership and ethics.

7

46

157

@VikParuchuri

Vik Paruchuri

5 years

Expectation: Data science is all about ML and deep learning. Reality: It's 80% storytelling and data acquisition + cleaning. And these parts are actually quite interesting (I promise!)

4

32

132

@VikParuchuri

Vik Paruchuri

8 months

I'm amazed by the quality of RAG-augmented books from finetuned mistral. The writing is higher quality than 34b codellama, but it does make subtle mistakes (see math below). Mistral - Codellama -

Tweet media one

1

15

133

@VikParuchuri

Vik Paruchuri

5 years

If you want to learn data science, don't start with a list of technologies. Start with a project you want to build, and work backwards,

3

26

126

@VikParuchuri

Vik Paruchuri

8 months

I've improved my synthetic textbook generator in collaboration with @ocolegro - . The books are now longer and a lot more detailed! Here's a preview - . (the programming books were generated with this technique)

Tweet card media

open-phi/textbooks · Datasets at Hugging Face

1

18

123

@VikParuchuri

Vik Paruchuri

8 months

@Yampeleg Thank you! I have a finetuned model that can generate similar quality to GPT-3.5. Just need compute credits to scale to 1B+ tokens 🙏🏾 . LLM credits (OpenAI or other) are also nice! Dataset is here, btw -

Tweet card media

vikp/textbook_quality_programming · Datasets at Hugging Face

4

14

117

@VikParuchuri

Vik Paruchuri

7 months

Excited to ship classified - a quality rater for LLM pretraining and instruct data - . It can stream datasets from HF hub, or from disk. It uses GPT-(4/3.5) now, but custom classifier training and dataset filtering are coming soon.

Tweet media one

7

19

114

@VikParuchuri

Vik Paruchuri

24 days

I have a very early commercial usage preview of marker on the dev branch. This removes layoutlm and pymupdf, and swaps in new models I trained. I'd love some help testing it. You can find it here - .

12

10

101

@VikParuchuri

Vik Paruchuri

5 months

Surya was trained on a diverse set of documents, including scientific papers. It works with every language that I've tried. It should work with good quality scanned documents as well due to image augmentation.

Tweet media one

2

10

99

@VikParuchuri

Vik Paruchuri

5 years

If you're learning data science, it can be exciting to jump straight to machine learning. But data cleaning, data visualization, and SQL will take up most of your time in entry-level roles. Don't neglect those skills.

3

12

96

@VikParuchuri

Vik Paruchuri

8 months

My synthetic code textbook dataset is now up to 8M tokens - . The quality is still very good. Going to try to get to 100M tokens in the next few days.

Tweet card media

vikp/textbook_quality_programming · Datasets at Hugging Face

2

17

92

@VikParuchuri

Vik Paruchuri

6 months

@sterlingcrispin @peterthiel Too many people are fine-tuning generalist models, and too few people are building pipelines of models for specific tasks. I think niche data + pipeline will beat generalist models.

5

3

88

@VikParuchuri

Vik Paruchuri

5 months

Text detection is step 1 in building a GPU-accelerated OCR model that is more accurate than tesseract. Step 2 is to build the text recognition system - I'll be working on that in the next couple of weeks.

4

6

88

@VikParuchuri

Vik Paruchuri

5 months

Ok - looks like I will be releasing this one standalone) Note that this is just the text detection (drawing bboxes around the text). I'll be working on text recognition (turning the bboxes into text) next week

7

1

80

@VikParuchuri

Vik Paruchuri

22 days

The model is the surya OCR model - .

1

6

77

@VikParuchuri

Vik Paruchuri

2 months

After this, I can integrate layout + OCR + reading order from surya into marker, and make marker commercially usable! ()

Tweet card media

GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

Convert PDF to markdown quickly with high accuracy - VikParuchuri/marker

4

2

78

@VikParuchuri

Vik Paruchuri

5 years

At @dataquestio , we aren't flashy. We don't raise $$ from investors. What we do instead is build the best way to learn data science. Students who finish >10 courses see an avg $16.6k salary boost, and we've created $103.9M in total salary gains. And all it costs is $49 a month.

3

12

76

@VikParuchuri

Vik Paruchuri

5 years

I'm a self-taught data scientist. When I looked for jobs, I got rejected many times for not having credentials. It was crushing. But I realized that the rejections only mattered if they stopped me from trying. Don't let them stop you.

0

9

58

@VikParuchuri

Vik Paruchuri

5 months

Here's a column break example:

Tweet media one

2

0

58

@VikParuchuri

Vik Paruchuri

5 years

When I first got into data science, I had impostor syndrome, and I dealt with insecurity by not engaging with people, or acting like I knew everything. This was a mistake. The best way through it is to humbly engage with people - I've learned a lot more this way!

0

7

56

@VikParuchuri

Vik Paruchuri

5 months

Benchmarking was a little tricky, since surya generates line-level bboxes, and tesseract generates word level. Most datasets are also word-level. I decided to benchmark using doclaynet.

Tweet media one

1

2

52

@VikParuchuri

Vik Paruchuri

5 years

I used to work in a UPS hub. I once thought I'd work there my whole career (until my boss told me they wouldn't promote me). The fact that I've been able to find my own path, and that I'm able to help others do the same with @dataquestio , is something I never take for granted.

3

5

48

@VikParuchuri

Vik Paruchuri

6 months

@kevinsxu This is a good thing - most architectural changes don't make a big difference (the training data does). This makes Yi compatible with all the existing llama inference tools. They also acknowledged the issue and will rename - .

Tweet card media

01-ai/Yi-34B · llama-compatibility

1

5

45

@VikParuchuri

Vik Paruchuri

5 months

And a scientific paper:

Tweet media one

3

1

47

@VikParuchuri

Vik Paruchuri

3 months

Last year, I built Endless Academy - - a site for AI-generated personalized courses. It has potential, and I'd love to see it grow, but I don't have the time. I'm looking for someone who's interested in taking it over.

4

7

45

@VikParuchuri

Vik Paruchuri

2 months

I'm planning to make this part of surya -

Tweet card media

GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages

OCR, layout analysis, reading order, line detection in 90+ languages - VikParuchuri/surya

3

0

44

@VikParuchuri

Vik Paruchuri

5 months

Surya is built on some amazing open source work, including: - transformers from @huggingface - segformer from @nvidia - CRAFT from the @official_naver team - an amazing paper and team Thank you to everyone who makes open source AI great.

1

1

43

@VikParuchuri

Vik Paruchuri

3 months

I'm also planning to work on other PDF-related projects soon, like table/image detection/extraction, and reading order detection. I will be porting all of these into marker (), my pdf to markdown converter, to improve accuracy.

Tweet card media

GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

Convert PDF to markdown quickly with high accuracy - VikParuchuri/marker

2

3

40

@VikParuchuri

Vik Paruchuri

5 years

1/ In this thread, I'll discuss @LambdaSchool , a bootcamp that charges 17% of your pre-tax income for up to 2 years (ISA). tl;dr Lambda is much more expensive than the average bootcamp, and has similar outcomes. 75% of Lambda students could pay an avg of $9k less elsewhere.

3

9

38

@VikParuchuri

Vik Paruchuri

5 years

A summary of 90% of management books: 1. Build trust 2. Build culture 3. Share context 4. Create process, but not too much 5. Give honest, caring, feedback 6. Delegate, but don't micromanage 7. Set actionable goals 8. Hold people accountable 9. Be a mentor 10. Solicit feedback

2

11

35

@VikParuchuri

Vik Paruchuri

5 months

I'm excited to start shipping again tomorrow. Stay tuned for: - General purpose OCR model - Open version of layoutlmv3 (or vgt) - Commercial version of marker - Better support for non-European languages

5

4

36

@VikParuchuri

Vik Paruchuri

5 months

Oh, and here's a column break example:

Tweet media one

1

1

36

@VikParuchuri

Vik Paruchuri

5 months

Surya has limitations, including: - It is specialized for document OCR. It will likely not work on photos or other images. It will also not work on handwritten text. - Performance on scanned documents can be hit or miss. - It doesn't work well with images that look like ads or

1

2

36

@VikParuchuri

Vik Paruchuri

5 months

The model was trained from scratch, so it's okay for commercial usage. See the repo for more details and dual licensing.

1

1

35

@VikParuchuri

Vik Paruchuri

1 month

Find it here - . By combining reading order with OCR and text detection in surya, it's easy to turn entire documents into readable plain text. Even complex ones like newspapers or magazines.

Tweet media one

3

4

35

@VikParuchuri

Vik Paruchuri

4 months

Ok - now on to the text recognition part :) I think I have a good plan, hopefully will have news next week.

1

0

33

@VikParuchuri

Vik Paruchuri

5 months

I hope you find this useful! Please join the Discord - - if you'd like to discuss surya. If you do try surya out, please let me know how it went for you. I've tried it across a range of images, but there are so many edge cases.

Join the Data Lab Discord Server!

Check out the Data Lab community on Discord - hang out with 1068 other members and enjoy free voice and text chat.

2

1

33

@VikParuchuri

Vik Paruchuri

5 months

Surya uses a modified segformer architecure from @nvidia . I found that by changing some of the shapes in the decoder, I could cut inference RAM usage to 1/4 of the original without a performance degradation.

1

1

33

@VikParuchuri

Vik Paruchuri

5 years

We announced scholarships for underrepresented groups @dataquestio . Here's why: - Data skills unlock economic opportunity + widely distributing them keeps the field ethical - Some groups have been excluded due to systemic bias - Scholarships help level the playing field

1

14

31

@VikParuchuri

Vik Paruchuri

4 months

Surya () already supports line detection , and I'm excited to have it do full end to end OCR. The final model should support ~90 languages (all major languages in use today).

Tweet card media

GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages

OCR, layout analysis, reading order, line detection in 90+ languages - VikParuchuri/surya

2

1

31

@VikParuchuri

Vik Paruchuri

5 years

I want to spend the rest of my career working towards a world where only what you can do matters - not the logo on your degree, who you know, or what you look like.

0

2

29

@VikParuchuri

Vik Paruchuri

5 months

I just uploaded a new model checkpoint for texify , a math OCR tool. The recognition quality is incredibly good. Left is the selected region of a PDF page, right is detected and rendered Markdown/LaTeX.

Tweet media one

2

3

29

@VikParuchuri

Vik Paruchuri

5 years

One lesson that was hard for me to learn is that the success of those around me doesn't diminish my own. It actually enhances it by building a stronger network. Don't hoard knowledge. Help the people around you. Not only is it the right thing to do, it also helps you.

1

7

27

@VikParuchuri

Vik Paruchuri

1 month

Looking better after training more overnight

Tweet media one

4

1

28

@VikParuchuri

Vik Paruchuri

2 months

I'm planning to make it part of surya - . After training an ordering model, surya will be able to sort text properly, and extract tables/images.

Tweet card media

GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages

OCR, layout analysis, reading order, line detection in 90+ languages - VikParuchuri/surya

0

0

28

@VikParuchuri

Vik Paruchuri

5 months

The benchmark is calculated by % coverage of predicted bboxes by references (precision), and vice versa (recall). Anything over a .5 threshold is a hit. There is a small penalty for overlapping multiple reference boxes in precision.

1

2

28

@VikParuchuri

Vik Paruchuri

4 years

Based on my experiences as a solo technical founder growing @dataquestio to 30+ people, I wrote a guide on quickly improving your management skills - . This is how I went from having no idea what I was doing to kind of knowing what I'm doing :)

Tweet card media

How to rapidly improve your management skills

It can be overwhelming when you start as a new manager, or when you’re an existing manager who is asked to take on more responsibility.

4

3

27

@VikParuchuri

Vik Paruchuri

8 years

The best way to raise money for a startup is to get people to pay you for what you're making.

3

6

27

@VikParuchuri

Vik Paruchuri

1 month

After I start, I'm planning to continue working on OSS data tools/models. Early ideas are: - Decode images from any language and doc type into markdown (like nougat, but faster/more general) - A single chat model that can do OCR, layout analysis, reading order, etc

1

0

26

@VikParuchuri

Vik Paruchuri

5 months

Time in the benchmark is not apples to apples, since tesseract uses CPU, and I used GPU for surya. I used batch size 32 for surya and 32 cores for tesseract (8 processes * 4 cores each) to try to compensate. Surya could be sped up more with quantization and compilation to onnx

1

2

25

@VikParuchuri

Vik Paruchuri

5 months

Release most likely coming at the end of the week

1

0

25

@VikParuchuri

Vik Paruchuri

2 months

My next project is reading order detection. I will then be porting all of these into marker (), my pdf to markdown converter, to improve accuracy, and allow commercial usage.

Tweet card media

GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

Convert PDF to markdown quickly with high accuracy - VikParuchuri/marker

1

0

25

@VikParuchuri

Vik Paruchuri

9 years

A cooperative machine learning contest that anyone can participate in: #machinelearning #DataScience

Tweet card media

GitHub - dataquestio/unite

Contribute to dataquestio/unite development by creating an account on GitHub.

0

21

24

@VikParuchuri

Vik Paruchuri

5 years

I wouldn't have believed this a few years ago, but life has been better in my 30s than in my 20s. I think the primary reason for this is that learning and personal growth pay compound interest.

1

0

23

@VikParuchuri

Vik Paruchuri

3 months

To benchmark, I sampled multilingual pdfs from common crawl, then filtered out the ones with bad OCR. Some PDFs still have bad text, so the absolute values can be worse than real-world performance. I couldn't find PDFs for 30 languages, so I made synthetic ones.

Tweet media one

1

3

24

@VikParuchuri

Vik Paruchuri

5 years

@_casey_bates @DataCamp Hi Casey, we might be able to help with some of the lost income. If you email me at vik @dataquest .io I can put you in touch with right folks.

0

0

23

@VikParuchuri

Vik Paruchuri

2 months

By combining layout analysis with the existing text line detection and OCR in surya, it's easy to extract data from tables and figures.

Tweet media one

Tweet media two

Tweet media three

1

2

24

@VikParuchuri

Vik Paruchuri

5 months

@gakonst Maybe you solved it. Share your method and we can compare :)

0

0

23

@VikParuchuri

Vik Paruchuri

5 years

There's data science / ML work that creates new systems (ex self driving cars), and there's work that improves existing systems (ex telling farmers when to plant). The former can seem more exciting, but imo, the latter is just as impactful.

2

4

23

@VikParuchuri

Vik Paruchuri

3 months

Surya can detect up to 4 languages in the same text. Any combination of languages can be used. Supported language iso 639 codes:

Tweet media one

1

0

24

@VikParuchuri

Vik Paruchuri

5 years

As a student, it's hard to evaluate if a course will help you reach your goals until after you've done it. Many edtech companies exploit this info asymmetry by focusing on marketing and engagement over depth. One heuristic - can you apply what you learned in the real world?

2

2

24

@VikParuchuri

Vik Paruchuri

3 months

Surya uses donut (swin transformer encoder and mbart decoder), with a lot of modifications. These include an MoE layer to store language-specific information, GQA for faster decoding, and UTF-16 decoding (adjacent bytes can be combined).

1

0

22

@VikParuchuri

Vik Paruchuri

3 months

Surya was trained on a diverse set of text lines. It will work with all types of documents, including scans.

Tweet media one

1

0

23