Melissa Dell @MelissaLDell Twitter profile

Last Seen Profiles

@mme_daigaku

@crawleycyrxs

@TNFLIGHTSELECT

@lionbookstore

@crawleycyrxs

@mainfdevA

@julie_renbe

@jandakembangstw

@mingiskoi

@flamelauthor

@Saints_MLax

@TomDymond

@dul_turkporno

@TomDymond

@ud7j3r1XKR939

@RIRIVOGUEE

@crawleycyrxs

@nxckxel

@shhab_4

@Sumiayu3

@megha_thak

@CezarJinga

@amaters7

@berraktzntc

@BMWUSA

@dan_soji

@juoshu

@Mounty57

@memexpropriados

@PeriwinkleLaLa

@XueTian_01

@MikeTArchangel

@JakeHadleyMMA

@shani_krm

@stw_pdg

Melissa Dell

@MelissaLDell

3 years

(1/n) Social science research often relies on scans of documents such as statistical tables, newspapers, firm level reports, etc. #EconTwitter

41

905

3K

Melissa Dell

@MelissaLDell

9 months

I’m excited to share American Stories, a new billion-scale dataset of structured texts/layouts from public domain newspapers (1780-1960) that we’ve built using our deep learning packages. #EconTwitter (1/13) Paper: Dataset:

dell-research-harvard/AmericanStories · Datasets at Hugging Face

huggingface.co

15

435

2K

Melissa Dell

@MelissaLDell

3 years

(3/n) We are releasing an open-source deep-learning powered library, Layout Parser, that provides a variety of tools for automatically processing document image data at scale. Webpage: Arxiv: Github:

GitHub - Layout-Parser/layout-parser: A Unified Toolkit for Deep Learning Based Document Image...

A Unified Toolkit for Deep Learning Based Document Image Analysis - Layout-Parser/layout-parser

github.com

12

299

1K

Melissa Dell

@MelissaLDell

5 months

I’m excited to share News Déjà Vu (), which uses a custom large language model to retrieve historical news articles that are the most similar to modern news articles. (1/4)

15

145

774

Melissa Dell

@MelissaLDell

3 years

(1/2) Knowledge base on deep learning methods for data curation is up: Covers methods from computer vision and NLP. I found it overwhelming at first to tackle the vast DL lit, hope links to resources for getting started will be of potential use to others

7

129

650

Melissa Dell

@MelissaLDell

3 years

The Harvard Economics department has an opening for a tenured position in development economics: This is a senior search, specific to development economics, that requires application through JOE. Please spread the word! #EconTwitter

2

111

386

Melissa Dell

@MelissaLDell

2 years

I'm hiring summer undergrad RAs; build deep learning pipelines for econ dev/pol econ (no DL experience required). $15/hr; can be remote; US work auth. required; undergrads only. Send abigailpowers @fas .harvard.edu CV/transcript to apply. Specify FT/PT interest. #EconTwitter

7

119

297

Melissa Dell

@MelissaLDell

9 months

Introducing LinkTransformer: LT brings the advantages of AI to standard data frame manipulation tasks like merges, deduplication, and clustering, making it easy to use large language models in a standard data wrangling workflow. #EconTwitter (1/10)

4

57

285

Melissa Dell

@MelissaLDell

3 years

(18/n) If Layout-Parser seems relevant to your work, please consider taking less than a minute to visit our website: . If you are on Github, take two seconds to star our repo: . This will help us demonstrate crucial community support.

GitHub - Layout-Parser/layout-parser: A Unified Toolkit for Deep Learning Based Document Image...

A Unified Toolkit for Deep Learning Based Document Image Analysis - Layout-Parser/layout-parser

github.com

4

44

272

Melissa Dell

@MelissaLDell

3 years

(15/n) No background in deep learning? I’m teaching a new course this semester on deep learning for data curation at scale. I’ll be putting the course material into a public knowledgebase. I’ll post here when this is released (sometime in the next 1-2 months).

3

16

262

Melissa Dell

@MelissaLDell

3 years

Thanks to @pquerubo , @qlquanle , @krishna_econ for convincing me to join. Looking forward to sharing more about our research and open-source projects!

12

23

247

Melissa Dell

@MelissaLDell

3 years

(4/n) Contrast the off-the-shelf OCR with the layout detection results we achieve through Layout Parser’s deep learning powered pipelines.

2

24

239

Melissa Dell

@MelissaLDell

3 years

I currently have two open predoc positions for next academic year: and Great opportunity for gaining hands on experience applying both deep learning and econometric methods to novel research

Team

dell-research-harvard.github.io

6

76

235

Melissa Dell

@MelissaLDell

1 year

Harvard is hosting the NEUDC econ development conference this fall; we particularly encourage PhD students and recent graduates to submit a paper. Huge thanks to CID's fantastic staff and dozens of reviewers who are making the conference possible. Please spread the word!

Harvard CID

@HarvardCID

1 year

📣Spread the word - We are hosting the 2023 convening of the North East Universities Development Consortium (NEUDC) on Nov. 4–5, 2023! 🖇 Accepting papers June 17 - August 17, 2023 #NEUDC

1

57

83

3

88

218

Melissa Dell

@MelissaLDell

3 years

(2/n) Unfortunately, OCR often fails to detect layouts in such documents. These figures show off-the-shelf OCRed bounding boxes. Much of the text is not detected\some is detected twice\scrambled. The OCR cannot distinguish different text types, ie headlines v captions v articles.

2

32

196

Melissa Dell

@MelissaLDell

1 year

We have a new string matching package – supporting Simplified and Traditional Chinese, Japanese, and Korean. HomoglyphsCJK available here: . Paper here: . With Xinmei Yang, Abhishek Arora, and Shao-Yu Jheng (1/8)

HomoglyphsCJK

An easy Python package for fuzzy matching Chinese(simplified and traditional), Japanese and Korean, using character similarity trained from ViT transformer

pypi.org

2

58

182

Melissa Dell

@MelissaLDell

3 years

(6/n) Layout Parser is not just for English. Here’s another example, a complex historical table from Japan

3

28

177

Melissa Dell

@MelissaLDell

3 years

(5/n) We are currently using Layout Parser to process millions of such documents

4

20

178

Melissa Dell

@MelissaLDell

24 days

I have had a pre-doc opportunity open up: . For those who may have applied to a past position prior to March 1 and are interested, please resubmit your materials. Position combines social science questions with big data and deep learning.

Pre-Doctoral Fellowship (Prof Dell)

Professor Melissa Dell is seeking a predoctoral fellow. This position will be for one year, from July 2024 until June 2025. The fellow will be an active participant in the Harvard research community...

academicpositions.harvard.edu

3

68

146

Melissa Dell

@MelissaLDell

5 months

I'm accepting applications for a pre-doctoral fellow position on a rolling basis: The position involves working at the intersection of deep learning and economics with a fantastic group of collaborators! #econtwitter

Pre-Doctoral Fellowship (Prof Dell)

Professor Melissa Dell is seeking a predoctoral fellow. This position will be for one year, from July 2024 until June 2025. The fellow will be an active participant in the Harvard research community...

academicpositions.harvard.edu

0

49

140

Melissa Dell

@MelissaLDell

3 years

(19/n) Layout Parser contributors: @_shannon_shen , @ruochenxD , @MelissaLDell , @lee_bcg , @J_S_Carlson , Weining Li. Currently working with @qlquanle , @pquerubo , @LeanderHeldring , @krishna_econ , Sahar Parsa, and awesome RAs on additional models that will be added when complete.

16

7

136

Melissa Dell

@MelissaLDell

9 months

If you think American Stories () or Headlines () may be useful for you, please like or download. It is challenging to fund dataset/open-source projects, and we need to show that people find our work useful so we can do more! (12/13)

dell-research-harvard/headlines-semantic-similarity · Datasets at Hugging Face

huggingface.co

2

11

128

Melissa Dell

@MelissaLDell

3 years

What's next? We've been working lately on custom OCR pipelines (post layout detection), as off-the-shelf products often fail at accurate character/number detection with historical documents. We hope to have some helpful insights to share later this fall... #EconTwitter

2

5

125

Melissa Dell

@MelissaLDell

3 years

Harvard Academy Scholars program is accepting applications. Deadline Oct 1. This is a great post-doc for economists/other social scientists, providing an opportunity to be very integrated within the economics and broader scholarly community in Cambridge

Academy Scholars Program

This program is open to recent PhD recipients and doctoral candidates in the social sciences.

academy.wcfia.harvard.edu

0

46

118

Melissa Dell

@MelissaLDell

3 years

(17/n) Building this takes a ton of work and financial resources. We’ve been invited to the final round of a large grant competition that would significantly expand Layout Parser, but we need to show there is demand for this from the social science community.

1

9

105

Melissa Dell

@MelissaLDell

3 years

(7/n) These are the Layout Parser functionalities

1

10

106

Melissa Dell

@MelissaLDell

11 months

@EmilySilcock1 and I have recently released HEADLINES, a massive-scale dataset containing nearly 400 million positive semantic similarity pairs, drawn from historical U.S. newspapers. Dataset: Paper: (1/3)

1

21

100

Melissa Dell

@MelissaLDell

3 years

(10/n) Don’t have labeled data? Layout Parser incorporates a data annotation toolkit that makes it more efficient to create labeled data.

2

13

92

Melissa Dell

@MelissaLDell

9 months

I remember being told by a colleague: “Economic history isn’t and can't be science because there are no data points." So much great work over the past decade proving the contrary! I’m pretty pumped to have a billion+ observations in a historical dataset description table (11/13)

2

7

89

Melissa Dell

@MelissaLDell

2 years

I'm hiring RAs for projects about Japanese development and political economy. Remote ok, pt or summer ft opps, US work authoriz. req. Pre-doc (in person, w visa sponsorship) an option as well. Japanese fluency, python or R experience req. #EconTwitter

Team

dell-research-harvard.github.io

3

39

84

Melissa Dell

@MelissaLDell

3 years

(9/n) With Layout Parser, you can train your own customized DL-based layout models. Because our pre-trained model zoo is currently small, right now Layout Parser is mostly useful for designing your own customized models

1

4

84

Melissa Dell

@MelissaLDell

2 years

Exciting news - JPE Micro and JPE Macro now live - A huge shout out to all the work John List and Greg Kaplan have done to make this a reality!

Journal of Political Economy: New journals

www.journals.uchicago.edu

1

11

75

Melissa Dell

@MelissaLDell

3 years

(8/n) Layout Parser currently has some pre-trained models, and the pipelines for the above examples will be integrated when finalized. We are working to expand the types of documents it can process off-the-shelf

1

4

75

Melissa Dell

@MelissaLDell

3 years

(2/2) Also includes links to slides and videos from my course on the topic. Obviously, student interactions are edited out so videos are just me talking to zoom. But I promise some of the linked resources are more interesting!

0

3

73

Melissa Dell

@MelissaLDell

3 years

(16/n) We hope to make substantial innovations. With more resources we can expand the pre-trained model zoo significantly. Ultimately, we hope to convert the library into a user-friendly online platform that can be used by anyone, regardless of Python literacy or hardware.

2

3

72

Melissa Dell

@MelissaLDell

5 months

NYT sues OpenAI for copyright infringement showing that GPT exactly reproduces articles. This is often driven by duplicates in training data. Our ICLR paper develops robust duplicate detection, finding far more duplicates in news than method used for GPT3

1

14

72

Melissa Dell

@MelissaLDell

3 years

(14/n) Layout Parser is implemented with simple APIs and can perform off-the-shelf layout analysis with four lines of Python code

1

9

72

Melissa Dell

@MelissaLDell

3 years

(11/n) Amongst its varied functionalities is a perturbation-based scoring method to select the most informative samples to label

OLALA: Object-Level Active Learning for Efficient Document Layout...

Document images often have intricate layout structures, with numerous content regions (e.g. texts, figures, tables) densely arranged on each page. This makes the manual annotation of layout...

arxiv.org

1

8

72

Melissa Dell

@MelissaLDell

3 years

(13/n) Layout Parser provides a flexible output structure to facilitate diverse downstream analyses.

1

7

64

Melissa Dell

@MelissaLDell

3 years

(1/n) I'm organizing a couple of sessions at the North American Winter Meeting of the Econometric Society (Jan 6-9, 2022; held concurrently with ASSA) Would love to see your submissions! Due April 21.

1

10

60

Melissa Dell

@MelissaLDell

9 months

The pipeline is highly efficient to deploy and has been open-sourced (). We’ve also created open-source packages – LayoutParser and EfficientOCR – to help researchers develop similar pipelines for their own document collections. (4/13)

GitHub - dell-research-harvard/AmericanStories: The official Github for the American Stories...

The official Github for the American Stories dataset as in {link} - dell-research-harvard/AmericanStories

github.com

1

6

57

Melissa Dell

@MelissaLDell

3 years

(12/n) Layout Parser builds wrappers to call OCR engines and comes with a DL-based CNN-RNN

1

3

55

Melissa Dell

@MelissaLDell

9 months

We detect 1.14 billion individual content regions in around 20M newspaper scans from Library of Congress’s Chronicling America collection. Headlines, articles, bylines, and captions are custom-OCRed. The dataset contains 438 million structured article texts. (2/13)

1

7

51

Melissa Dell

@MelissaLDell

9 months

Our team: @pquerubo , @J_S_Carlson , Tom Bryan, @EmilySilcock1 , @96abhishekarora , Luca D’Amico Wong, @shannonzshen , @qlquanle , @LeanderHeldring , and fantastic undergrad Ras. Funding: Harvard Data Science Initiative, Catalyst, and Griffin Fund and MS Azure (13/13)

2

1

46

Melissa Dell

@MelissaLDell

1 year

On a different note, the knowledge base from my redesigned PhD course on Deep Learning Methods for Processing Unstructured Data in Economics is now live: . Covering language models, computer vision, and more. (8/8)

Blog

dell-research-harvard.github.io

0

7

45

Melissa Dell

@MelissaLDell

9 months

Structured article texts also support analyses that are impossible with existing page level texts. We detect the biggest stories of the year, using a custom trained large language model to embed texts and then applying clustering to group articles into coherent stories (6/13)

1

5

42

Melissa Dell

@MelissaLDell

3 years

Version 0.3 of Layout Parser is live, with various updates to streamline document image analysis workflows: . #EconTwitter #MachineLearning

Release v0.3.0: Multi-backend Support, Additional Models, Better Visualizations, and many more ·...

We are excited to release LayoutParser v0.3.0, with a lot of exciting updates and functional improvements. New Features The biggest change in this version is that LayoutParser now supports multipl...

github.com

0

14

38

Melissa Dell

@MelissaLDell

9 months

It covers all 50 states, with content concentrated pre-1920. (3/13)

1

3

35

Melissa Dell

@MelissaLDell

5 months

We first mask out all named entities (e.g. people, locations, organizations). The language model, trained to capture semantic similarity, then maps each news article to a vector. For a given modern news article, we choose the closest historical article in this vector space. (2/4)

1

35

Melissa Dell

@MelissaLDell

3 years

v0.2 of Layout Parser is out! Amazing work by @_shannon_shen_ incorporating lots of useful updates

layoutparser

@layoutparser

3 years

(1/n) Layout Parser v0.2 is out! New models, better API support, and much more! ✨Highlights✨ - Add support for loading and saving with JSON and CSV. - New shape operations between blocks (union and intersection) are available. - Table detection models are up for grabs!

1

13

36

1

5

33

Melissa Dell

@MelissaLDell

5 months

Thanks @pquerubo for suggesting we use AI to query historical articles most similar to 2024 predictions. The model pulled celebrity psychic Jeane Dixon (1969) on Vietnam and a 1931 article on the folly of gloomy prophecies.

0

9

33

Melissa Dell

@MelissaLDell

9 months

Interested in a later period? See our massive scale headlines dataset (1920s-80s) - and paper - , consisting of locally written headlines from news wire articles (10/13)

3

4

32

Melissa Dell

@MelissaLDell

9 months

This is important because there are lots of illegible scans, with illegibility varying across space and time. Illegibility could bias analyses if researchers include illegible content in the denominator when measuring the presence of different terms or textual features. (8/13)

2

3

28

Melissa Dell

@MelissaLDell

9 months

The existing Chronicling America OCR from LoC doesn’t recognize layouts, scrambling articles, headlines, ads, etc. American Stories significantly improves accuracy on text classification (allowing it at the article level) and on detecting reproduced content (5/13)

1

2

28

Melissa Dell

@MelissaLDell

5 months

Huge shout out to our team: @96abhishekarora , Brevin Franklin, Andrew Lu, and @EmilySilcock1 . We will be doing weekly drops. Please let us know if there are particular modern stories you'd like to see included. (4/4)

0

1

27

Melissa Dell

@MelissaLDell

5 months

Disclaimer: the language model captures similarities in the semantics of how things are described, which may or may not reflect similarities in the underlying events or situations being described. News articles may have biases or inaccuracies. (3/4)

1

27

Melissa Dell

@MelissaLDell

3 years

More on LayoutParser updates:

layoutparser

@layoutparser

3 years

1 Multi Deep Learning Backend Support DL models are constantly updating, and how can we enable easy access to the latest advances? In v0.3, #layoutparser starts to support different DL model backends and layout models.

1

0

3

0

1

25

Melissa Dell

@MelissaLDell

9 months

Another distinguishing feature of American Stories: after detecting text regions (articles, headlines, captions, bylines), we classify whether they are legible (7/13)

1

2

24

Melissa Dell

@MelissaLDell

5 months

What does AI pull as most similar to the Claudine Gay resignation coverage? UC President Clark Kerr and Berkeley protests involving “a four-letter obscenity.” More articles here.

4

2

24

Melissa Dell

@MelissaLDell

3 years

(2/n) Deep learning based methods for data curation in economics: This session will examine how deep learning‐based methods have been integrated into economic research, with a particular focus on how DL/ML can be used to convert novel sources of information into computable data

1

4

23

Melissa Dell

@MelissaLDell

9 months

Our texts are high quality. This figure compares the non-word rate in our custom OCR to the Library of Congress OCR. Differences are due to a combination of our high-quality custom-OCR and filtering of illegible content and ads before OCRing. (9/13)

1

2

21

Melissa Dell

@MelissaLDell

3 years

We know that you all will find lots of ways in which Layout Parser can be improved. If so, please consider contributing to the library and making these improvements publicly available!

layoutparser

@layoutparser

3 years

(6/6) And please consider joining us and contribute to the library - any help would make a huge difference! .

2

1

5

0

3

21

Melissa Dell

@MelissaLDell

3 years

Also check out this helpful Layout Parser overview that the phenomenally talented @ZejiangS put together for the ICDAR conference. #EconTwitter #MachineLearning

Layout Parser Main Presentation

0:00 Introduction 0:08 Motivation2:24 Demo3:20 Design & Implementation3:40 Design & Implementation - Deep Learning Models for Layout Detection5:35 Design & I...

www.youtube.com

0

1

20

Melissa Dell

@MelissaLDell

2 years

RE pay rates for undergrad RAs, this is set by the National Science Foundation (Research Experience for Undergrads Program). I agree these programs would ideally have more generous stipends, compensated by lower rates of overhead #EconTwitter

1

2

18

Melissa Dell

@MelissaLDell

3 years

(3/n) Economic growth and structural transformation: This session welcomes papers that examine the determinants of economic growth, as well as those that seek to understand the causes and consequences of structural transformation

0

4

19

Melissa Dell

@MelissaLDell

9 months

If you find LT useful, please cite it and consider starring our repo . We funded LT out of the PI’s very limited unrestricted funds, and to maintain/expand we need to show potential funders that it is having a positive impact on the community! (9/10)

GitHub - dell-research-harvard/linktransformer: A convenient way to link, deduplicate, aggregate...

A convenient way to link, deduplicate, aggregate and cluster data(frames) in Python using deep learning - dell-research-harvard/linktransformer

github.com

2

3

16

Melissa Dell

@MelissaLDell

9 months

Merge with transformer language models like you would in Pandas. The API is designed to be as simple as possible and very familiar to practitioners coming from other environments like R and Stata. Demo notebook: Paper: (2/10)

1

2

16

Melissa Dell

@MelissaLDell

9 months

A few years ago, I started working on deep learning methods to liberate data at scale, with Layout Parser , and now EffOCR (more coming soon!) and LinkTransformer. We have more exciting DL-based packages for social science research in the pipeline. (8/10)

1

3

14

Melissa Dell

@MelissaLDell

11 months

Many interesting questions require a language model trained on semantic similarity data. Existing training data are largely from web texts; we are excited to release a massive-scale historical texts dataset. Headlines are also fascinating from a social science perspective (3/3)

1

14

Melissa Dell

@MelissaLDell

3 years

Tomorrow is the deadline for submitting to the Econometric Society's North American Winter Meetings (part of the 2022 ASSA Meetings): I'd love to see your submissions to sessions on growth/structural change and methods for curating data

0

1

14

Melissa Dell

@MelissaLDell

9 months

Huge shout out to @96abhishekarora , a pre-doc in our lab, for his phenomenal work on LinkTransformer, and our summer RA Sam Jones; we welcome other open-source contributors! (10/10)

0

1

12

Melissa Dell

@MelissaLDell

11 months

Around half of content in historical local newspapers came from newswires (e.g., the AP) but local papers wrote their own headlines. Headlines corresponding to the same wire article capture semantic similarity. (2/3)

2

0

10

Melissa Dell

@MelissaLDell

1 year

Also multimodal methods for record linkage (), which avoid the OCR information bottleneck and leverage the power of a large language model. (7/8)

1

0

9

Melissa Dell

@MelissaLDell

5 months

Also, for hundreds of millions of off-copyright historical news articles, see

dell-research-harvard/AmericanStories · Datasets at Hugging Face

huggingface.co

1

2

9

Melissa Dell

@MelissaLDell

9 months

LinkTransformer supports all models on the Hugging Face Hub and OpenAI Embedding models. We’ve also trained our own collection of over 20 open-source language models for different languages and tasks. A guide to selecting models is here: (3/10)

Introducing LinkTransformer.ipynb

Colaboratory notebook

colab.research.google.com

1

9

Melissa Dell

@MelissaLDell

9 months

Training your own models is as easy as one line of code, with most of the heavy lifting done behind the scenes. You can fine-tune any pretrained model from Hugging Face. Learn more at our repo and demo notebook (5/10)

1

2

9

Melissa Dell

@MelissaLDell

1 year

This is part of an agenda developing methods that make curating data in lower-resource settings easier, e.g. EfficientOCR () – which makes customized OCR easier, cheaper, and more extensible (with Jake Carlson and Tom Bryan) (6/7)

1

9

Melissa Dell

@MelissaLDell

9 months

LT aims to create a community for deep record linkage, streamlining the distribution of record linkage models and promoting the reusability and reproducibility of pipelines. Users can tag and share their models on the Hugging Face hub with a single line of code. (6/10)

1

9

Melissa Dell

@MelissaLDell

9 months

This is our initial release, and we welcome feedback via Github. Planned features for the next release include integrating vision transformer models for visual record linkage (forgo OCR altogether!) and FAISS GPU support. (7/10)

1

8

Melissa Dell

@MelissaLDell

9 months

LT supports a wide range of data wrangling tasks with transformers: standard merging, merge with blocking or multiple keys, cross-lingual merges (no need to translate), 1-m and m-m merges, aggregation/classification, clustering and de-duplication. (4/10)

Link Records with LinkTransformer.ipynb

Colab notebook

colab.research.google.com

1

2

8

Melissa Dell

@MelissaLDell

5 months

@banreportcards OCR errors from digitizing the historical content. There's more about this pipeline in our recent NeurIPs paper:

0

7

Melissa Dell

@MelissaLDell

1 year

The basic idea is to use vision transformers – a deep neural network – to quantify character similarity. The model is contrastively trained to learn a metric space where different augmentations of the same character (e.g., from different fonts) are represented nearby (2/8)

1

6

Melissa Dell

@MelissaLDell

5 months

Historical note: Kerr withdrew his resignation a few days later and was reinstated, but ultimately gets fired in 1967. Fascinating to learn about

Clark Kerr - Wikipedia

en.wikipedia.org

0

5

Melissa Dell

@MelissaLDell

5 months

Our NEWS-COPY dataset, with 122,876 noisily duplicated news article pairs, is publicly available, as is our trained duplicate detection model @EmilySilcock1

1

4

Melissa Dell

@MelissaLDell

1 year

The Homoglyphs model can be used to measure character similarity, which we show can significantly improve string matching accuracy for OCR’ed databases. Example homoglyphs: (3/8)

1

0

4

Melissa Dell

@MelissaLDell

5 months

@ProfLHunter Yes, we are hoping to get the funding to do this soon!

0

3

Melissa Dell

@MelissaLDell

1 year

When homoglyphic matching fails, it is often because OCR has destroyed too much information for string matching to be a realistic solution (4/8)

1

0

3

Melissa Dell

@MelissaLDell

1 year

Because the method for quantifying character similarity is purely self-supervised, it can be cheaply extended to any character set. Homoglyphs for ancient Chinese characters from over 3,000 years ago capture related abstract concepts noted in the archaeological literature (5/8)

1

0

3

Melissa Dell

@MelissaLDell

9 months

@wite_wall Thanks for the question. No. Only 0.2% of articles in a labeled random sample spanned multiple pages (it was less common in this period than later), so wouldn't have justified the significant extra resources required for this linking.

1

0

2

Melissa Dell

@MelissaLDell

3 years

@dandekadt Amazon Textract? This is pretty different... We have some pre-trained models but yeah, prob won't work for your specific applications. But you can build a customized pipeline that prob will (it's a lot more work than off-the-shelf, LP aims to make it easier)

0

1

Melissa Dell

@MelissaLDell

3 years

@J_S_Carlson Have you seen this?

Graham Neubig

@gneubig

3 years

Wow, this is great 😁 I joked a while ago when we were working on representing text visually () that this would be the a really robust way to translate noisy text like that on Twitter, but apparently it actually is!

3

16

104

1

0

1