Melissa Dell Profile
Melissa Dell

@MelissaLDell

12,188
Followers
11
Following
32
Media
107
Statuses

Economics Professor @Harvard . Development economics, political economy, economic history, deep learning methods for data curation.

Cambridge, MA
Joined March 2021
Don't wanna be here? Send us removal request.
@MelissaLDell
Melissa Dell
3 years
(1/n) Social science research often relies on scans of documents such as statistical tables, newspapers, firm level reports, etc. #EconTwitter
41
905
3K
@MelissaLDell
Melissa Dell
9 months
I’m excited to share American Stories, a new billion-scale dataset of structured texts/layouts from public domain newspapers (1780-1960) that we’ve built using our deep learning packages. #EconTwitter (1/13) Paper: Dataset:
15
435
2K
@MelissaLDell
Melissa Dell
3 years
(3/n) We are releasing an open-source deep-learning powered library, Layout Parser, that provides a variety of tools for automatically processing document image data at scale. Webpage: Arxiv: Github:
12
299
1K
@MelissaLDell
Melissa Dell
5 months
I’m excited to share News Déjà Vu (), which uses a custom large language model to retrieve historical news articles that are the most similar to modern news articles. (1/4)
Tweet media one
15
145
774
@MelissaLDell
Melissa Dell
3 years
(1/2) Knowledge base on deep learning methods for data curation is up: Covers methods from computer vision and NLP. I found it overwhelming at first to tackle the vast DL lit, hope links to resources for getting started will be of potential use to others
7
129
650
@MelissaLDell
Melissa Dell
3 years
The Harvard Economics department has an opening for a tenured position in development economics: This is a senior search, specific to development economics, that requires application through JOE. Please spread the word! #EconTwitter
2
111
386
@MelissaLDell
Melissa Dell
2 years
I'm hiring summer undergrad RAs; build deep learning pipelines for econ dev/pol econ (no DL experience required). $15/hr; can be remote; US work auth. required; undergrads only. Send abigailpowers @fas .harvard.edu CV/transcript to apply. Specify FT/PT interest. #EconTwitter
7
119
297
@MelissaLDell
Melissa Dell
9 months
Introducing LinkTransformer: LT brings the advantages of AI to standard data frame manipulation tasks like merges, deduplication, and clustering, making it easy to use large language models in a standard data wrangling workflow. #EconTwitter (1/10)
Tweet media one
4
57
285
@MelissaLDell
Melissa Dell
3 years
(18/n) If Layout-Parser seems relevant to your work, please consider taking less than a minute to visit our website: . If you are on Github, take two seconds to star our repo: . This will help us demonstrate crucial community support.
4
44
272
@MelissaLDell
Melissa Dell
3 years
(15/n) No background in deep learning? I’m teaching a new course this semester on deep learning for data curation at scale. I’ll be putting the course material into a public knowledgebase. I’ll post here when this is released (sometime in the next 1-2 months).
3
16
262
@MelissaLDell
Melissa Dell
3 years
Thanks to @pquerubo , @qlquanle , @krishna_econ for convincing me to join. Looking forward to sharing more about our research and open-source projects!
12
23
247
@MelissaLDell
Melissa Dell
3 years
(4/n) Contrast the off-the-shelf OCR with the layout detection results we achieve through Layout Parser’s deep learning powered pipelines.
Tweet media one
2
24
239
@MelissaLDell
Melissa Dell
3 years
I currently have two open predoc positions for next academic year: and Great opportunity for gaining hands on experience applying both deep learning and econometric methods to novel research
6
76
235
@MelissaLDell
Melissa Dell
1 year
Harvard is hosting the NEUDC econ development conference this fall; we particularly encourage PhD students and recent graduates to submit a paper. Huge thanks to CID's fantastic staff and dozens of reviewers who are making the conference possible. Please spread the word!
@HarvardCID
Harvard CID
1 year
📣Spread the word - We are hosting the 2023 convening of the North East Universities Development Consortium (NEUDC) on Nov. 4–5, 2023! 🖇 Accepting papers June 17 - August 17, 2023 #NEUDC
Tweet media one
1
57
83
3
88
218
@MelissaLDell
Melissa Dell
3 years
(2/n) Unfortunately, OCR often fails to detect layouts in such documents. These figures show off-the-shelf OCRed bounding boxes. Much of the text is not detected\some is detected twice\scrambled. The OCR cannot distinguish different text types, ie headlines v captions v articles.
Tweet media one
Tweet media two
2
32
196
@MelissaLDell
Melissa Dell
1 year
We have a new string matching package – supporting Simplified and Traditional Chinese, Japanese, and Korean. HomoglyphsCJK available here: . Paper here: . With Xinmei Yang, Abhishek Arora, and Shao-Yu Jheng (1/8)
2
58
182
@MelissaLDell
Melissa Dell
3 years
(6/n) Layout Parser is not just for English. Here’s another example, a complex historical table from Japan
Tweet media one
3
28
177
@MelissaLDell
Melissa Dell
3 years
(5/n) We are currently using Layout Parser to process millions of such documents
Tweet media one
4
20
178
@MelissaLDell
Melissa Dell
24 days
I have had a pre-doc opportunity open up: . For those who may have applied to a past position prior to March 1 and are interested, please resubmit your materials. Position combines social science questions with big data and deep learning.
3
68
146
@MelissaLDell
Melissa Dell
5 months
I'm accepting applications for a pre-doctoral fellow position on a rolling basis: The position involves working at the intersection of deep learning and economics with a fantastic group of collaborators! #econtwitter
0
49
140
@MelissaLDell
Melissa Dell
3 years
(19/n) Layout Parser contributors: @_shannon_shen , @ruochenxD , @MelissaLDell , @lee_bcg , @J_S_Carlson , Weining Li. Currently working with @qlquanle , @pquerubo , @LeanderHeldring , @krishna_econ , Sahar Parsa, and awesome RAs on additional models that will be added when complete.
16
7
136
@MelissaLDell
Melissa Dell
9 months
If you think American Stories () or Headlines () may be useful for you, please like or download. It is challenging to fund dataset/open-source projects, and we need to show that people find our work useful so we can do more! (12/13)
2
11
128
@MelissaLDell
Melissa Dell
3 years
What's next? We've been working lately on custom OCR pipelines (post layout detection), as off-the-shelf products often fail at accurate character/number detection with historical documents. We hope to have some helpful insights to share later this fall... #EconTwitter
2
5
125
@MelissaLDell
Melissa Dell
3 years
Harvard Academy Scholars program is accepting applications. Deadline Oct 1. This is a great post-doc for economists/other social scientists, providing an opportunity to be very integrated within the economics and broader scholarly community in Cambridge
0
46
118
@MelissaLDell
Melissa Dell
3 years
(17/n) Building this takes a ton of work and financial resources. We’ve been invited to the final round of a large grant competition that would significantly expand Layout Parser, but we need to show there is demand for this from the social science community.
1
9
105
@MelissaLDell
Melissa Dell
3 years
(7/n) These are the Layout Parser functionalities
Tweet media one
1
10
106
@MelissaLDell
Melissa Dell
11 months
@EmilySilcock1 and I have recently released HEADLINES, a massive-scale dataset containing nearly 400 million positive semantic similarity pairs, drawn from historical U.S. newspapers. Dataset: Paper: (1/3)
Tweet media one
1
21
100
@MelissaLDell
Melissa Dell
3 years
(10/n) Don’t have labeled data? Layout Parser incorporates a data annotation toolkit that makes it more efficient to create labeled data.
Tweet media one
2
13
92
@MelissaLDell
Melissa Dell
9 months
I remember being told by a colleague: “Economic history isn’t and can't be science because there are no data points." So much great work over the past decade proving the contrary! I’m pretty pumped to have a billion+ observations in a historical dataset description table (11/13)
Tweet media one
2
7
89
@MelissaLDell
Melissa Dell
2 years
I'm hiring RAs for projects about Japanese development and political economy. Remote ok, pt or summer ft opps, US work authoriz. req. Pre-doc (in person, w visa sponsorship) an option as well. Japanese fluency, python or R experience req. #EconTwitter
3
39
84
@MelissaLDell
Melissa Dell
3 years
(9/n) With Layout Parser, you can train your own customized DL-based layout models. Because our pre-trained model zoo is currently small, right now Layout Parser is mostly useful for designing your own customized models
1
4
84
@MelissaLDell
Melissa Dell
2 years
Exciting news - JPE Micro and JPE Macro now live - A huge shout out to all the work John List and Greg Kaplan have done to make this a reality!
1
11
75
@MelissaLDell
Melissa Dell
3 years
(8/n) Layout Parser currently has some pre-trained models, and the pipelines for the above examples will be integrated when finalized. We are working to expand the types of documents it can process off-the-shelf
1
4
75
@MelissaLDell
Melissa Dell
3 years
(2/2) Also includes links to slides and videos from my course on the topic. Obviously, student interactions are edited out so videos are just me talking to zoom. But I promise some of the linked resources are more interesting!
0
3
73
@MelissaLDell
Melissa Dell
3 years
(16/n) We hope to make substantial innovations. With more resources we can expand the pre-trained model zoo significantly. Ultimately, we hope to convert the library into a user-friendly online platform that can be used by anyone, regardless of Python literacy or hardware.
2
3
72
@MelissaLDell
Melissa Dell
5 months
NYT sues OpenAI for copyright infringement showing that GPT exactly reproduces articles. This is often driven by duplicates in training data. Our ICLR paper develops robust duplicate detection, finding far more duplicates in news than method used for GPT3
1
14
72
@MelissaLDell
Melissa Dell
3 years
(14/n) Layout Parser is implemented with simple APIs and can perform off-the-shelf layout analysis with four lines of Python code
1
9
72
@MelissaLDell
Melissa Dell
3 years
(13/n) Layout Parser provides a flexible output structure to facilitate diverse downstream analyses.
Tweet media one
1
7
64
@MelissaLDell
Melissa Dell
3 years
(1/n) I'm organizing a couple of sessions at the North American Winter Meeting of the Econometric Society (Jan 6-9, 2022; held concurrently with ASSA) Would love to see your submissions! Due April 21.
1
10
60
@MelissaLDell
Melissa Dell
9 months
The pipeline is highly efficient to deploy and has been open-sourced (). We’ve also created open-source packages – LayoutParser and EfficientOCR – to help researchers develop similar pipelines for their own document collections. (4/13)
1
6
57
@MelissaLDell
Melissa Dell
3 years
(12/n) Layout Parser builds wrappers to call OCR engines and comes with a DL-based CNN-RNN
1
3
55
@MelissaLDell
Melissa Dell
9 months
We detect 1.14 billion individual content regions in around 20M newspaper scans from Library of Congress’s Chronicling America collection. Headlines, articles, bylines, and captions are custom-OCRed. The dataset contains 438 million structured article texts. (2/13)
Tweet media one
1
7
51
@MelissaLDell
Melissa Dell
9 months
Our team: @pquerubo , @J_S_Carlson , Tom Bryan, @EmilySilcock1 , @96abhishekarora , Luca D’Amico Wong, @shannonzshen , @qlquanle , @LeanderHeldring , and fantastic undergrad Ras. Funding: Harvard Data Science Initiative, Catalyst, and Griffin Fund and MS Azure (13/13)
2
1
46
@MelissaLDell
Melissa Dell
1 year
On a different note, the knowledge base from my redesigned PhD course on Deep Learning Methods for Processing Unstructured Data in Economics is now live: . Covering language models, computer vision, and more. (8/8)
0
7
45
@MelissaLDell
Melissa Dell
9 months
Structured article texts also support analyses that are impossible with existing page level texts. We detect the biggest stories of the year, using a custom trained large language model to embed texts and then applying clustering to group articles into coherent stories (6/13)
Tweet media one
1
5
42
@MelissaLDell
Melissa Dell
9 months
It covers all 50 states, with content concentrated pre-1920. (3/13)
Tweet media one
1
3
35
@MelissaLDell
Melissa Dell
5 months
We first mask out all named entities (e.g. people, locations, organizations). The language model, trained to capture semantic similarity, then maps each news article to a vector. For a given modern news article, we choose the closest historical article in this vector space. (2/4)
Tweet media one
1
1
35
@MelissaLDell
Melissa Dell
3 years
v0.2 of Layout Parser is out! Amazing work by @_shannon_shen_ incorporating lots of useful updates
@layoutparser
layoutparser
3 years
(1/n) Layout Parser v0.2 is out! New models, better API support, and much more! ✨Highlights✨ - Add support for loading and saving with JSON and CSV. - New shape operations between blocks (union and intersection) are available. - Table detection models are up for grabs!
1
13
36
1
5
33
@MelissaLDell
Melissa Dell
5 months
Thanks @pquerubo for suggesting we use AI to query historical articles most similar to 2024 predictions. The model pulled celebrity psychic Jeane Dixon (1969) on Vietnam and a 1931 article on the folly of gloomy prophecies.
Tweet media one
0
9
33
@MelissaLDell
Melissa Dell
9 months
Interested in a later period? See our massive scale headlines dataset (1920s-80s) - and paper - , consisting of locally written headlines from news wire articles (10/13)
Tweet media one
3
4
32
@MelissaLDell
Melissa Dell
9 months
This is important because there are lots of illegible scans, with illegibility varying across space and time. Illegibility could bias analyses if researchers include illegible content in the denominator when measuring the presence of different terms or textual features. (8/13)
Tweet media one
2
3
28
@MelissaLDell
Melissa Dell
9 months
The existing Chronicling America OCR from LoC doesn’t recognize layouts, scrambling articles, headlines, ads, etc. American Stories significantly improves accuracy on text classification (allowing it at the article level) and on detecting reproduced content (5/13)
Tweet media one
1
2
28
@MelissaLDell
Melissa Dell
5 months
Huge shout out to our team: @96abhishekarora , Brevin Franklin, Andrew Lu, and @EmilySilcock1 . We will be doing weekly drops. Please let us know if there are particular modern stories you'd like to see included. (4/4)
0
1
27
@MelissaLDell
Melissa Dell
5 months
Disclaimer: the language model captures similarities in the semantics of how things are described, which may or may not reflect similarities in the underlying events or situations being described. News articles may have biases or inaccuracies. (3/4)
1
1
27
@MelissaLDell
Melissa Dell
3 years
More on LayoutParser updates:
@layoutparser
layoutparser
3 years
1 Multi Deep Learning Backend Support DL models are constantly updating, and how can we enable easy access to the latest advances? In v0.3, #layoutparser starts to support different DL model backends and layout models.
1
0
3
0
1
25
@MelissaLDell
Melissa Dell
9 months
Another distinguishing feature of American Stories: after detecting text regions (articles, headlines, captions, bylines), we classify whether they are legible (7/13)
Tweet media one
1
2
24
@MelissaLDell
Melissa Dell
5 months
What does AI pull as most similar to the Claudine Gay resignation coverage? UC President Clark Kerr and Berkeley protests involving “a four-letter obscenity.” More articles here.
Tweet media one
4
2
24
@MelissaLDell
Melissa Dell
3 years
(2/n) Deep learning based methods for data curation in economics: This session will examine how deep learning‐based methods have been integrated into economic research, with a particular focus on how DL/ML can be used to convert novel sources of information into computable data
1
4
23
@MelissaLDell
Melissa Dell
9 months
Our texts are high quality. This figure compares the non-word rate in our custom OCR to the Library of Congress OCR. Differences are due to a combination of our high-quality custom-OCR and filtering of illegible content and ads before OCRing. (9/13)
Tweet media one
1
2
21
@MelissaLDell
Melissa Dell
3 years
We know that you all will find lots of ways in which Layout Parser can be improved. If so, please consider contributing to the library and making these improvements publicly available!
@layoutparser
layoutparser
3 years
(6/6) And please consider joining us and contribute to the library - any help would make a huge difference! .
2
1
5
0
3
21
@MelissaLDell
Melissa Dell
2 years
RE pay rates for undergrad RAs, this is set by the National Science Foundation (Research Experience for Undergrads Program). I agree these programs would ideally have more generous stipends, compensated by lower rates of overhead #EconTwitter
1
2
18
@MelissaLDell
Melissa Dell
3 years
(3/n) Economic growth and structural transformation: This session welcomes papers that examine the determinants of economic growth, as well as those that seek to understand the causes and consequences of structural transformation
0
4
19
@MelissaLDell
Melissa Dell
9 months
If you find LT useful, please cite it and consider starring our repo . We funded LT out of the PI’s very limited unrestricted funds, and to maintain/expand we need to show potential funders that it is having a positive impact on the community! (9/10)
2
3
16
@MelissaLDell
Melissa Dell
9 months
Merge with transformer language models like you would in Pandas. The API is designed to be as simple as possible and very familiar to practitioners coming from other environments like R and Stata. Demo notebook: Paper: (2/10)
Tweet media one
1
2
16
@MelissaLDell
Melissa Dell
9 months
A few years ago, I started working on deep learning methods to liberate data at scale, with Layout Parser , and now EffOCR (more coming soon!) and LinkTransformer. We have more exciting DL-based packages for social science research in the pipeline. (8/10)
1
3
14
@MelissaLDell
Melissa Dell
11 months
Many interesting questions require a language model trained on semantic similarity data. Existing training data are largely from web texts; we are excited to release a massive-scale historical texts dataset. Headlines are also fascinating from a social science perspective (3/3)
1
1
14
@MelissaLDell
Melissa Dell
3 years
Tomorrow is the deadline for submitting to the Econometric Society's North American Winter Meetings (part of the 2022 ASSA Meetings): I'd love to see your submissions to sessions on growth/structural change and methods for curating data
0
1
14
@MelissaLDell
Melissa Dell
9 months
Huge shout out to @96abhishekarora , a pre-doc in our lab, for his phenomenal work on LinkTransformer, and our summer RA Sam Jones; we welcome other open-source contributors! (10/10)
0
1
12
@MelissaLDell
Melissa Dell
11 months
Around half of content in historical local newspapers came from newswires (e.g., the AP) but local papers wrote their own headlines. Headlines corresponding to the same wire article capture semantic similarity. (2/3)
2
0
10
@MelissaLDell
Melissa Dell
1 year
Also multimodal methods for record linkage (), which avoid the OCR information bottleneck and leverage the power of a large language model. (7/8)
Tweet media one
1
0
9
@MelissaLDell
Melissa Dell
5 months
Also, for hundreds of millions of off-copyright historical news articles, see
1
2
9
@MelissaLDell
Melissa Dell
9 months
LinkTransformer supports all models on the Hugging Face Hub and OpenAI Embedding models. We’ve also trained our own collection of over 20 open-source language models for different languages and tasks. A guide to selecting models is here: (3/10)
1
1
9
@MelissaLDell
Melissa Dell
9 months
Training your own models is as easy as one line of code, with most of the heavy lifting done behind the scenes. You can fine-tune any pretrained model from Hugging Face. Learn more at our repo and demo notebook (5/10)
Tweet media one
1
2
9
@MelissaLDell
Melissa Dell
1 year
This is part of an agenda developing methods that make curating data in lower-resource settings easier, e.g. EfficientOCR () – which makes customized OCR easier, cheaper, and more extensible (with Jake Carlson and Tom Bryan) (6/7)
Tweet media one
1
1
9
@MelissaLDell
Melissa Dell
9 months
LT aims to create a community for deep record linkage, streamlining the distribution of record linkage models and promoting the reusability and reproducibility of pipelines. Users can tag and share their models on the Hugging Face hub with a single line of code. (6/10)
Tweet media one
1
1
9
@MelissaLDell
Melissa Dell
9 months
This is our initial release, and we welcome feedback via Github. Planned features for the next release include integrating vision transformer models for visual record linkage (forgo OCR altogether!) and FAISS GPU support. (7/10)
1
1
8
@MelissaLDell
Melissa Dell
9 months
LT supports a wide range of data wrangling tasks with transformers: standard merging, merge with blocking or multiple keys, cross-lingual merges (no need to translate), 1-m and m-m merges, aggregation/classification, clustering and de-duplication. (4/10)
1
2
8
@MelissaLDell
Melissa Dell
5 months
@banreportcards OCR errors from digitizing the historical content. There's more about this pipeline in our recent NeurIPs paper:
0
0
7
@MelissaLDell
Melissa Dell
1 year
The basic idea is to use vision transformers – a deep neural network – to quantify character similarity. The model is contrastively trained to learn a metric space where different augmentations of the same character (e.g., from different fonts) are represented nearby (2/8)
Tweet media one
1
1
6
@MelissaLDell
Melissa Dell
5 months
Historical note: Kerr withdrew his resignation a few days later and was reinstated, but ultimately gets fired in 1967. Fascinating to learn about
0
0
5
@MelissaLDell
Melissa Dell
5 months
Our NEWS-COPY dataset, with 122,876 noisily duplicated news article pairs, is publicly available, as is our trained duplicate detection model @EmilySilcock1
1
1
4
@MelissaLDell
Melissa Dell
1 year
The Homoglyphs model can be used to measure character similarity, which we show can significantly improve string matching accuracy for OCR’ed databases. Example homoglyphs: (3/8)
Tweet media one
1
0
4
@MelissaLDell
Melissa Dell
5 months
@ProfLHunter Yes, we are hoping to get the funding to do this soon!
0
0
3
@MelissaLDell
Melissa Dell
1 year
When homoglyphic matching fails, it is often because OCR has destroyed too much information for string matching to be a realistic solution (4/8)
Tweet media one
1
0
3
@MelissaLDell
Melissa Dell
1 year
Because the method for quantifying character similarity is purely self-supervised, it can be cheaply extended to any character set. Homoglyphs for ancient Chinese characters from over 3,000 years ago capture related abstract concepts noted in the archaeological literature (5/8)
Tweet media one
1
0
3
@MelissaLDell
Melissa Dell
9 months
@wite_wall Thanks for the question. No. Only 0.2% of articles in a labeled random sample spanned multiple pages (it was less common in this period than later), so wouldn't have justified the significant extra resources required for this linking.
1
0
2
@MelissaLDell
Melissa Dell
3 years
@dandekadt Amazon Textract? This is pretty different... We have some pre-trained models but yeah, prob won't work for your specific applications. But you can build a customized pipeline that prob will (it's a lot more work than off-the-shelf, LP aims to make it easier)
0
0
1
@MelissaLDell
Melissa Dell
3 years
@J_S_Carlson Have you seen this?
@gneubig
Graham Neubig
3 years
Wow, this is great 😁 I joked a while ago when we were working on representing text visually () that this would be the a really robust way to translate noisy text like that on Twitter, but apparently it actually is!
3
16
104
1
0
1