I wrote a blog post on going from not knowing anything about deep learning last year to training state of the art OSS models - .
Hope it helps you.
tldr; read the deep learning book, implemented papers + taught, built open source tools
I'm excited to ship marker - a pdf to markdown converter that is 10x faster than nougat, more accurate outside arXiv, and has low hallucination risk. Marker is optimized for throughput, like converting LLM pretrain data.
Find it here - .
Cool to see a 500M param model I trained myself do better than Google cloud vision, Claude, and GPT-4V on this task. (look at the thread for the results)
It's a relatively narrow one (OCR), but feels nice to see that small open source models still have a place.
It's weird how we live in an age of miracles with respect to AI/ML, and yet when I want to extract some text from a screenshot the best (very bad) option is tesseract, last updated ~7 years ago.
Better data = better AI. That's why I've spent the last 3 months on:
- Marker - fast, accurate PDF to markdown (5k GH ⭐️s)
- Texify - SOTA math to LaTeX OCR
- Libgen to txt - get 3TB of HQ data
- Textbook quality - HQ synth data
Find them at .
I'm training a text line detection model for a document OCR pipeline.
It could also be useful on its own, but I'm not sure. Is anyone interested in a standalone release?
It works for every language I tried - it detects text bboxes and column breaks. ~2 second inference per
I'm tweaking my line detection model to get it ready for a Github release. This was a fun test case. It's not really designed for newspapers, so I was surprised this worked
I've shipped most of the models + libraries I wanted in the last few months:
- PDF to markdown - marker
- Text line detection, OCR in 93 languages, layout analysis, reading order - surya
- Equation to LaTeX
- PDF text extraction
Find them on Github - .
Announcing surya reading order! It predicts the order that a human would read a document in.
It's useful for RAG, accessibility, and text extraction. It works on a variety of documents, layouts, and languages.
Announcing surya layout! It detects tables, images, figures, section headers, and more. It works with any language, and a variety of document types.
Find it here - .
Thanks
@LambdaAPI
for sponsoring compute.
I can't get over
@ylecun
tweeting that surya was nice. Lifetime achievement unlock.
My next steps are:
- Improving old/scanned doc performance
- Seeing if I can do anything about rotations
Then on to the next recognition part! Here's the repo - .
Announcing texify - an OCR model that turns inline and block equations into markdown/LaTeX. It's more accurate at this than nougat and pix2tex.
Find it here - .
The biggest barrier to GPT-quality open source LLMs is data.
If you want 1TB of quality data, here's my repo that will convert libgen nonfiction to txt format - .
I made pdftext, a small tool that extracts text like pymupdf, but with an Apache license (mupdf is AGPL). It can pull out blocks and lines or plain text.
Find it here - .
Marker v2 is out! The main new features:
- Extracts images/figures
- Better table parsing
- Pip package install
- Can be used commercially
- Improved OCR with more languages
- Better ordering for complex docs
Get it here - .
Surya () has been updated with a new model checkpoint that is far better on scanned/old docs.
It works even with blurry/rotated complex layouts, like this one:
Surya () didn't work well on scanned/rotated docs, so I decided to spend a couple of days on it this week.
I'm making good progress. It's still training, hopefully will have something out tomorrow.
I'm going to release my reading order model next week. I had to change the architecture to perform better with complex layouts.
It seems to be working, though (see the image). There are mistakes, but it's only 20% trained, and still improving.
Textbooks generated with finetuned mistral + search and wikipedia RAG are surprisingly good. They seem close to GPT-3.5.
See samples here - , and here - .
Working on a bigger set now! Please let me know if you can sponsor.
I've generated 70M tokens of extremely high quality synthetic textbooks - , using retrieval and gpt-3.5.
Seriously, the quality is 💯.
I'm generating 1B tokens, but will use llama for $$ reasons. Please DM if you can sponsor compute or credits.
My reading order model is getting close to being release-ready. (it may not be immediately obvious, but this is a hard doc to order properly)
Working on fixing just a few remaining issues.
I released marker last week - .
Within 72 hours, marker got to
#1
on HN, with 700 votes, and was starred 3.4k times on Github.
I didn't expect this kind of response - thank you so much for the support!
An update on surya text recognition - I'm happy with the data/architecture, and I'm ready to scale up training.
Here are some results from a (very) early checkpoint. Left is original, right is OCR (Malayalam)
I'm building a dataset of high quality synthetic textbooks for pretraining. Here's a 4M token preview - . The quality is incredibly high (it really surprised me).
I've been generating additional textbooks! is up to 115M high quality tokens, and is up to 85M.
I'm seeing promising humaneval results with models trained on this data.
As
@jeremyphoward
shared yesterday, I'll be joining
@answerdotai
! I'm excited to work with such a strong team.
Before I start, I'm going to finish some in-progress work:
- Integrate surya with marker
- Commercial version of marker
- Launch an API for both
Libgen to txt now supports marker for pdf -> markdown.
Turn libgen rs nonfiction into 3TB of high quality markdown. AI labs are using this data to train LLMs - now you can, too.
Full instructions and usage are here - .
I built a dataset of every package on pypi. The quality of code is high, and I'm finding it great for finetuning and pretraining - .
I cleaned extra leading comments, and rendered notebooks, so this data should be ready to use.
I'm excited to release a 400m token synthetic programming textbook dataset - .
This is a mix of GPT-3.5 (great quality), and finetuned llama (good quality).
It was generated with the textbook quality repo - .
A timeline of
@DataCamp
2017-2020:
- CEO sexually harassed an employee
- The company covered it up
- After years of community pressure, the CEO stepped down
- They just BROUGHT THE CEO BACK 🤦🏾♀️
This is a repeated and ongoing failure of leadership and ethics.
Expectation: Data science is all about ML and deep learning.
Reality: It's 80% storytelling and data acquisition + cleaning. And these parts are actually quite interesting (I promise!)
I'm amazed by the quality of RAG-augmented books from finetuned mistral. The writing is higher quality than 34b codellama, but it does make subtle mistakes (see math below).
Mistral -
Codellama -
I've improved my synthetic textbook generator in collaboration with
@ocolegro
- . The books are now longer and a lot more detailed!
Here's a preview - . (the programming books were generated with this technique)
@Yampeleg
Thank you! I have a finetuned model that can generate similar quality to GPT-3.5. Just need compute credits to scale to 1B+ tokens 🙏🏾 .
LLM credits (OpenAI or other) are also nice!
Dataset is here, btw -
Excited to ship classified - a quality rater for LLM pretraining and instruct data - .
It can stream datasets from HF hub, or from disk.
It uses GPT-(4/3.5) now, but custom classifier training and dataset filtering are coming soon.
I have a very early commercial usage preview of marker on the dev branch.
This removes layoutlm and pymupdf, and swaps in new models I trained.
I'd love some help testing it. You can find it here - .
Surya was trained on a diverse set of documents, including scientific papers. It works with every language that I've tried.
It should work with good quality scanned documents as well due to image augmentation.
If you're learning data science, it can be exciting to jump straight to machine learning. But data cleaning, data visualization, and SQL will take up most of your time in entry-level roles. Don't neglect those skills.
@sterlingcrispin
@peterthiel
Too many people are fine-tuning generalist models, and too few people are building pipelines of models for specific tasks. I think niche data + pipeline will beat generalist models.
Text detection is step 1 in building a GPU-accelerated OCR model that is more accurate than tesseract. Step 2 is to build the text recognition system - I'll be working on that in the next couple of weeks.
Ok - looks like I will be releasing this one standalone) Note that this is just the text detection (drawing bboxes around the text). I'll be working on text recognition (turning the bboxes into text) next week
At
@dataquestio
, we aren't flashy. We don't raise $$ from investors. What we do instead is build the best way to learn data science.
Students who finish >10 courses see an avg $16.6k salary boost, and we've created $103.9M in total salary gains. And all it costs is $49 a month.
I'm a self-taught data scientist. When I looked for jobs, I got rejected many times for not having credentials.
It was crushing. But I realized that the rejections only mattered if they stopped me from trying. Don't let them stop you.
When I first got into data science, I had impostor syndrome, and I dealt with insecurity by not engaging with people, or acting like I knew everything. This was a mistake. The best way through it is to humbly engage with people - I've learned a lot more this way!
Benchmarking was a little tricky, since surya generates line-level bboxes, and tesseract generates word level. Most datasets are also word-level. I decided to benchmark using doclaynet.
I used to work in a UPS hub. I once thought I'd work there my whole career (until my boss told me they wouldn't promote me).
The fact that I've been able to find my own path, and that I'm able to help others do the same with
@dataquestio
, is something I never take for granted.
@kevinsxu
This is a good thing - most architectural changes don't make a big difference (the training data does). This makes Yi compatible with all the existing llama inference tools. They also acknowledged the issue and will rename - .
Last year, I built Endless Academy - - a site for AI-generated personalized courses.
It has potential, and I'd love to see it grow, but I don't have the time. I'm looking for someone who's interested in taking it over.
Surya is built on some amazing open source work, including:
- transformers from
@huggingface
- segformer from
@nvidia
- CRAFT from the
@official_naver
team - an amazing paper and team
Thank you to everyone who makes open source AI great.
I'm also planning to work on other PDF-related projects soon, like table/image detection/extraction, and reading order detection.
I will be porting all of these into marker (), my pdf to markdown converter, to improve accuracy.
1/ In this thread, I'll discuss
@LambdaSchool
, a bootcamp that charges 17% of your pre-tax income for up to 2 years (ISA).
tl;dr Lambda is much more expensive than the average bootcamp, and has similar outcomes. 75% of Lambda students could pay an avg of $9k less elsewhere.
A summary of 90% of management books:
1. Build trust
2. Build culture
3. Share context
4. Create process, but not too much
5. Give honest, caring, feedback
6. Delegate, but don't micromanage
7. Set actionable goals
8. Hold people accountable
9. Be a mentor
10. Solicit feedback
I'm excited to start shipping again tomorrow. Stay tuned for:
- General purpose OCR model
- Open version of layoutlmv3 (or vgt)
- Commercial version of marker
- Better support for non-European languages
Surya has limitations, including:
- It is specialized for document OCR. It will likely not work on photos or other images. It will also not work on handwritten text.
- Performance on scanned documents can be hit or miss.
- It doesn't work well with images that look like ads or
Find it here - .
By combining reading order with OCR and text detection in surya, it's easy to turn entire documents into readable plain text. Even complex ones like newspapers or magazines.
I hope you find this useful! Please join the Discord - - if you'd like to discuss surya.
If you do try surya out, please let me know how it went for you. I've tried it across a range of images, but there are so many edge cases.
Surya uses a modified segformer architecure from
@nvidia
. I found that by changing some of the shapes in the decoder, I could cut inference RAM usage to 1/4 of the original without a performance degradation.
We announced scholarships for underrepresented groups
@dataquestio
. Here's why:
- Data skills unlock economic opportunity + widely distributing them keeps the field ethical
- Some groups have been excluded due to systemic bias
- Scholarships help level the playing field
Surya () already supports line detection , and I'm excited to have it do full end to end OCR.
The final model should support ~90 languages (all major languages in use today).
I want to spend the rest of my career working towards a world where only what you can do matters - not the logo on your degree, who you know, or what you look like.
I just uploaded a new model checkpoint for texify , a math OCR tool.
The recognition quality is incredibly good. Left is the selected region of a PDF page, right is detected and rendered Markdown/LaTeX.
One lesson that was hard for me to learn is that the success of those around me doesn't diminish my own.
It actually enhances it by building a stronger network.
Don't hoard knowledge. Help the people around you. Not only is it the right thing to do, it also helps you.
The benchmark is calculated by % coverage of predicted bboxes by references (precision), and vice versa (recall). Anything over a .5 threshold is a hit. There is a small penalty for overlapping multiple reference boxes in precision.
Based on my experiences as a solo technical founder growing
@dataquestio
to 30+ people, I wrote a guide on quickly improving your management skills - .
This is how I went from having no idea what I was doing to kind of knowing what I'm doing :)
After I start, I'm planning to continue working on OSS data tools/models.
Early ideas are:
- Decode images from any language and doc type into markdown (like nougat, but faster/more general)
- A single chat model that can do OCR, layout analysis, reading order, etc
Time in the benchmark is not apples to apples, since tesseract uses CPU, and I used GPU for surya. I used batch size 32 for surya and 32 cores for tesseract (8 processes * 4 cores each) to try to compensate.
Surya could be sped up more with quantization and compilation to onnx
My next project is reading order detection. I will then be porting all of these into marker (), my pdf to markdown converter, to improve accuracy, and allow commercial usage.
I wouldn't have believed this a few years ago, but life has been better in my 30s than in my 20s. I think the primary reason for this is that learning and personal growth pay compound interest.
To benchmark, I sampled multilingual pdfs from common crawl, then filtered out the ones with bad OCR. Some PDFs still have bad text, so the absolute values can be worse than real-world performance. I couldn't find PDFs for 30 languages, so I made synthetic ones.
@_casey_bates
@DataCamp
Hi Casey, we might be able to help with some of the lost income. If you email me at vik
@dataquest
.io I can put you in touch with right folks.
There's data science / ML work that creates new systems (ex self driving cars), and there's work that improves existing systems (ex telling farmers when to plant). The former can seem more exciting, but imo, the latter is just as impactful.
As a student, it's hard to evaluate if a course will help you reach your goals until after you've done it.
Many edtech companies exploit this info asymmetry by focusing on marketing and engagement over depth.
One heuristic - can you apply what you learned in the real world?
Surya uses donut (swin transformer encoder and mbart decoder), with a lot of modifications. These include an MoE layer to store language-specific information, GQA for faster decoding, and UTF-16 decoding (adjacent bytes can be combined).