Simran Khanuja @simi_97k Twitter profile

Pinned Tweet

Simran Khanuja

24 days

I really like this work by @akshitajha and team! Check it out :) I’ve been feeling overwhelmed in keeping up with the great work being done in cross-cultural NLP so thought of starting this list of awesome resources😌 GitHub: It is in no means

GitHub - simran-khanuja/awesome-cultural-nlp: Resources for cultural NLP research

Resources for cultural NLP research. Contribute to simran-khanuja/awesome-cultural-nlp development by creating an account on GitHub.

github.com

Akshita Jha

@akshitajha

28 days

New #ACL2024 paper alert📢: Introducing ViSAGe: "Visual Stereotypes Around the Globe" - a dataset enabling evaluation of known nationality-based stereotypes in Text-to-Image models. Work w/ @sunipa17 , @vinodkpg , @cephaloponderer , Sarah, @shachi_dave , @qadrida , @chandakreddy 1/n

1

10

68

3

10

53

Last Seen Profiles

@kota_lemon

@rilnmja

@mikamerano

@oukmis

@BiankaMarosi

@Mrnobod76977796

@E_Karagiannis

@zlipalik

@noahpyapya

@FCE_team

@Pengcheng2020

@AnneLaignel

@brianrose53

@pasurtihot

@KnucklesT_

@dsalazar270

@Arevalo_Perez29

@NickstoryWiki

@GforcePforce

@LilianeModeste

@AnneDiaz472446

@nehabatham03

@Tracing_Stories

@James7439776951

@Phsycotater360

@DavidJusto_

@GemaMH

@Utuhare9

@MdM95242

@Adam4Murph

@FanboyPlanet

@CampaignforPubs

@MoodyBets24

@palmacsports

@DamiDESERVES

@CheblyJudy

Simran Khanuja

@simi_97k

2 years

Ecstatic to share that I'll be joining @SCSatCMU for my PhD at LTI this Fall! I'll be working with @gneubig and @dan_fried among many others! I've really enjoyed talking to students and faculty at CMU and am very excited to embark on this journey✨ (1/n)

56

6

390

Simran Khanuja

@simi_97k

7 months

How would you choose the best data instances to label, that maximize the performance of a model on target data? What if your target data is multilingual and you have no annotators in those languages? Our new work, DeMuX, addresses this problem. (1/n)

DeMuX: Data-efficient Multilingual Learning

We consider the task of optimally fine-tuning pre-trained multilingual models, given small amounts of unlabelled target data and an annotation budget. In this paper, we introduce DEMUX, a...

arxiv.org

4

34

241

Simran Khanuja

@simi_97k

2 months

Ever noticed how Pixar adapts movies for international markets? The beloved newscaster in Zootopia is a jaguar in Brazil, a panda in China, a koala in Australia … While machine translation (MT) has only dealt with language in speech/text thus far, we extend the scope of MT to

8

35

222

Simran Khanuja

@simi_97k

1 year

Grateful to have received the best paper award at SLT 2022 for FLEURS! FLEURS is a multi-lingual (102 languages), multi-modal (speech-text), n-way parallel dataset, built on top of Flores-101. (1/n)

Alexis Conneau

@alex_conneau

1 year

Our FLEURS paper won the best paper award at SLT 2022! @ieee_slt SLT: arXiv: Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂

6

19

116

15

7

139

Simran Khanuja

@simi_97k

2 years

It was great meeting with undergrad students passionate about research at BITS-Goa! In the last couple of years, they've successfully setup research groups like LRG, SAiDL etc., significantly enhancing the research culture on campus. (1/2)

2

6

138

Simran Khanuja

@simi_97k

3 months

Now accepted to NAACL 2024 ❤️ Excited to present this in Mexico City and continue building upon this work🎊

Simran Khanuja

@simi_97k

7 months

How would you choose the best data instances to label, that maximize the performance of a model on target data? What if your target data is multilingual and you have no annotators in those languages? Our new work, DeMuX, addresses this problem. (1/n)

4

34

241

8

7

121

Simran Khanuja

@simi_97k

3 years

Since multilingual LMs cannot equitably represent a 100+ languages, we have recently witnessed the growth of a language/domain specific pre-trained model universe. In our ACL 2021 Findings paper, we make a first attempt at merging multiple pre-trained LMs using KD. (1/2)

4

3

111

Simran Khanuja

@simi_97k

3 years

Excited to share a technical write-up on MuRIL, now available on arxiv!

2

13

110

Simran Khanuja

@simi_97k

6 months

I recently gave a lecture on Image-Text Modeling for Multilingual NLP at CMU and thought I'd share my slides in case interested folks may find it useful! Here are a few things covered in the slides. (1/n)

Image-Text Modeling for Multilingual NLP.pdf

drive.google.com

2

17

107

Simran Khanuja

@simi_97k

3 years

Greetings all :) Today, @BigAmeya and I will be conducting a TF tutorial session at the CVIT IIIT Summer School at 7PM IST. The session will be a gentle introduction to TF 2.0 with two interesting applications in NLP and GNNs! (1/2)

6

5

82

Simran Khanuja

@simi_97k

2 years

Check out their amazing work on fairness in the Indian context! P.S: The first author, Shaily, is applying for a PhD this year and is a passionate young researcher! Do keep an eye out for her application :)

Shaily

@shaily99

2 years

Check out our @aaclmeeting paper “Re-contextualizing Fairness in NLP: The Case of India”. We analyze India-specific biases in #NLProc and propose a research agenda for meaningful fairness interventions. w/ @sunipadev @partha_p_t @shachi_dave @vinodkpg 🧵

6

31

229

1

3

76

Simran Khanuja

@simi_97k

3 years

We have released the pre-trained model (with the MLM layer for masked word predictions) on HuggingFace.

google/muril-base-cased · Hugging Face

huggingface.co

2

14

71

Simran Khanuja

@simi_97k

30 days

I’m in the Bay Area for the summer ☀️and attending NAACL in Mexico City from June 13-23! Please feel free to DM for a coffee chat if y’all are around :) Would love to meet up with fellow researchers/friends 💕

8

0

69

Simran Khanuja

@simi_97k

2 years

Come join us at #DecodewithGoogle 2022 where @ManishGuptaMG1 and I will be sharing about how our Research team is tackling unique Indian challenges with simple, local and pathbreaking solutions! Register now: #DecodeWithGoogle #Google #GoogleIndia

0

10

65

Simran Khanuja

@simi_97k

1 year

🔍📊Came across three works today, all benchmarking and evaluating the multilingual capabilities of LLMs. All consistently show how LLMs are significantly outperformed by smaller fine-tuned LMs for (most) tasks! (1/n)

2

9

63

Simran Khanuja

@simi_97k

1 year

And it's a wrap🎉We had an exciting hackathon @LTIatCMU where we had 13 teams working with LLMs on a diverse range of topics! (1/n)

1

2

60

Simran Khanuja

@simi_97k

11 months

Excited to attend #ACL2023NLP in person this week! Feel free to reach out if anyone wants to catch up :) A few things I’ve been interested in these days: a) multilingual, low (text) resource NLP (as always 😌); b) sample efficiency in training/fine-tuning; (1/3)

3

1

59

Simran Khanuja

@simi_97k

1 year

It escapes me why every official US document terms foreign nationals as "aliens". Has this ever been called for a change?

4

1

53

Simran Khanuja

@simi_97k

1 year

It was so fun to work on this with everyone! Literal translations of metaphors in other languages never failed to make us laugh 😅 Refer to thread for more details on our work!

Emmy Liu

@_emliu

1 year

"आज-कल NLP Research के साथ बने रहना उतना ही आसान है जितना कि मानसून मॆं भीगने से बचे रहना!" . Did you understand? How about LMs? Our #ACL2023 Findings paper explores multilingual models' cultural understanding through figurative language in 7 langs 🌎(1/9)

5

39

204

2

3

51

Simran Khanuja

@simi_97k

1 year

I'll be attending @eaclmeeting (May 1st-7th) in person! Would love to catch up with those attending :) Happy to chat about all-things-research (especially multilingual NLP), life @LTIatCMU as a grad student, or anything else :)

2

0

45

Simran Khanuja

@simi_97k

3 years

Great opportunity for students to contribute to the progress of Indian NLP. Please do participate!

Partha Talukdar

@partha_p_t

3 years

#chaii2021 , a QA research challenge in Hindi and Tamil, is now live. You can participate even if you didn't pre-register. Build models and data, win prizes, and advance #IndicNLP ! #NLPforAll @GoogleAI @GoogleIndia

1

32

152

0

4

39

Simran Khanuja

@simi_97k

1 month

Congratulations! I'm so excited to see Shuyan's lab and research grow! Potential applicants: Shuyan is one of the kindest people I know, and a great researcher ofc <3

Shuyan Zhou

@shuyanzhxyc

1 month

I am thrilled to announce that I will be joining @DukeU @dukecompsci as an Assistant Professor in summer 2025. Super excited for the next chapter! Stay tuned for the launch of my lab 🧠🤖

110

29

545

1

2

37

Simran Khanuja

@simi_97k

2 years

It was a humbling experience to be able to interact with @MilindTambe_AI ! Very insightful AMA :)

Milind Tambe

@MilindTambe_AI

2 years

Excited to be in our @GoogleAI research India office in Bangalore after two years! Awesome to meet with lab director @ManishGuptaMG1 , with our #AIforSocialgood team, other labmates in the office; most rewarding was Q&A session with predocs.

3

1

167

0

1

34

Simran Khanuja

@simi_97k

6 months

Go talk to Anjali as she presents her cool work! She’s really passionate about advancing dialectal research :)

Anjali Kantharuban

@anjali_ruban

6 months

📣 Presenting this work on performance disparities across non-standard dialects at EMNLP's poster session 7! Come say hi 11-12:30 Dec 10 (today) in the East Foyer!

1

9

60

2

32

Simran Khanuja

@simi_97k

4 years

It was great fun to host @JeffDean in a Fireside Chat on his virtual India visit! We had a small quiz titled, "Two truths (and a lie?)" centred around "true facts" :) He then answered several questions from fellow Googlers. Thanks for participating in this!

1

0

31

Simran Khanuja

@simi_97k

3 years

Great work and a brilliant contribution to Indian NLP! Congrats :)

nasscom insights

@NasscomR

3 years

NASSCOM congratulates @iitmadras for winning the #AIGamechangers 2021 award in the #NLP category for their use case: Samanantar - Translation corpora and models for the next billion users #NASSCOMXperienceAI @MicrosoftIndia @DeloitteIndia #AI #ML #translation #languages

2

14

49

0

28

Simran Khanuja

@simi_97k

2 years

Lastly, but most importantly, I'd like to take this opportunity to thank my parents, for supporting my dreams despite them having limited knowledge about research as a career. I can never thank them enough :)

0

28

Simran Khanuja

@simi_97k

2 years

I'd also like to thank my wonderful peers and collaborators at @MSFTResearch and @GoogleAI including @shachi_dave , @rajiv120kb , @shaily99 , @sanad_maker , @SebastinSanty , @melvinjohnsonp , @alex_conneau , @ankurbpn , who have been very supportive and I've learnt so much from all!(3/n)

1

27

Simran Khanuja

@simi_97k

3 years

Great work on **much needed** token-free models!

Colin Raffel

@colinraffel

3 years

Can your NLP model handle noooisy mEsSy #realworldtext ? ByT5 works on raw UTF-8 bytes (no tokenization!), beats SoTA models on many popular tasks, and is more robust to noise. 📜 Preprint: 💾 Code/Models: Summary thread ⬇️ (1/9)

6

149

647

0

27

Simran Khanuja

@simi_97k

4 months

This is so cool!

Jeff Dean (@🏡)

@JeffDean

4 months

Kalamang Translation One of the most exciting examples in the report involves translation of Kalamang. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea in the east of Indonesian Papua (). Kalamang has almost no online

13

122

730

0

2

27

Simran Khanuja

@simi_97k

2 years

I wouldn't be here without the constant guidance and support of my brilliant mentors @partha_p_t , @monojitchou , Sunayana Sitaram and @seb_ruder . I'm immensely grateful to each one of them, all of whom have enabled me to learn and grow so much in the past few years. (2/n)

1

0

26

Simran Khanuja

@simi_97k

3 years

Accurate description✨Do apply and don't hesitate to reach out for more information :)

Manish Gupta

@ManishGuptaMG1

3 years

Applications are open for our pre-doc researcher program. This is one of my favorite programs, where we provide exciting research opportunities to recent graduates with an undergraduate (or Masters) degree, infecting them with the "research bug" 🙂

1

53

182

0

2

24

Simran Khanuja

@simi_97k

3 years

IKDD has opened up their networking sessions for all :) Great opportunity to interact with like-minded people. Starting at 12:15PM IST. I'll be hosting NLP (2) where @monojitchou is our guest speaker! Do join :) The corresponding meeting rooms are here

IKDD-Data Science in India - Networking Sessions

Sheet1 Topic/Group,Joining Link,Distinguished Guest,Session Chair,Logistics Host,Supporting Core Team Member,Discussion Topics Database/Systems,https://us02web.zoom.us/j/2690097568?pwd=L0EwOE9CZWVW...

docs.google.com

1

0

24

Simran Khanuja

@simi_97k

17 days

Great study showing that the common practice of training VLMs on english-filtered image-text pairs harms communities of lower socioeconomic status! Train on all of your data for improved cultural understanding of images (even if performance on western-centric benchmarks takes a

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن

@ibomohsin

21 days

Want your VLM to reflect the world's rich diversity 🌍? We’re very excited to share our recent research on this topic. TLDR: to build truly inclusive models that work for everyone, don’t filter by English, and check out our recommended evaluation benchmarks. (1/7)

3

27

136

0

3

22

Simran Khanuja

@simi_97k

2 years

Would highly encourage alumni to reach out and contribute in any capacity :) All thanks to efforts of faculty and students like @baths_veeky @RajaswaPatil and others

0

23

Simran Khanuja

@simi_97k

1 year

Join our exciting seminar on LLMs! We have talks, a panel discussion, and a hackathon lined up :)

Language Technologies Institute | @CarnegieMellon

@LTIatCMU

1 year

🎉🌐 Join us on April fool’s day for an exciting event at the @LTIatCMU at @CMU on April 1st & 2nd! We're diving deep into the fascinating world of large language models and their transformative impact on academia, industry, and society at large. 🌐🎉

1

17

82

0

1

23

Simran Khanuja

@simi_97k

2 months

So glad you liked it ❤️

Lina Conti

@lina_conti

2 months

Pick of the week @fbk_mt : Very original (and fun!) work by @simi_97k et al. on "translating" images to make them more relevant to different cultures

1

4

21

1

0

22

Simran Khanuja

@simi_97k

3 years

As the next generation of smartphone users is expected to permeate through several strata of society (many of whom may not know how to read/write), voice assistants will definitely play a very important role in building inclusive technology. Highly impactful research! @gneubig

Language Technologies Institute | @CarnegieMellon

@LTIatCMU

3 years

LTI Prof. @gneubig chatted with @905wesa today about his work in expanding the reach of spoken-language translation systems and being named a finalist for the Blavatnik Award for Young Scientists. See what he had to say here:

0

5

19

1

0

21

Simran Khanuja

@simi_97k

3 years

Check out MuRIL repackaged for transformers as well :)

0

1

21

Simran Khanuja

@simi_97k

26 days

Subha and team at LJL have been doing great work on language preservation and supporting speakers of low (digital) resource languages. I'm so excited to work alongside passionate volunteers from diverse backgrounds, towards the common goal of advancing language technologies for

OpenNLP

@Linguistics_LJL

26 days

We've received an overwhelming response from both mentors and mentees. We're excited to start our LJL's Research Labs ☀ Summer Cohort ☀ on June 2nd! Ranging from startups to nonprofits to research, 🌅 they're all working on making natural language processing more accessible.

1

3

0

17

Simran Khanuja

@simi_97k

3 years

Awesome work! Excited about its potential and creative applications :)

Krishna Srinivasan

@krishna2

3 years

We are very happy to announce WIT: Wikipedia-Based Image Text Dataset, a large multimodal multilingual dataset. @GoogleAI

1

35

126

1

0

17

Simran Khanuja

@simi_97k

1 year

"Taking the fun out of YouTube" won🥇 by creating a chrome extension to make YT video titles non-clickbait-y! They obtain pairs of titles by prompting GPT and fine-tune LLaMa with the data obtained, to deploy for inference. Team: @AthiyaD , Abuzar Khan, Alex Li (2/n)

2

1

15

Simran Khanuja

@simi_97k

3 years

MuRIL is a multilingual model specifically built for Indian languages. Work done w/ @partha_p_t at @GoogleAI . Please mail your queries/feedback to muril-contact @google .com.

1

0

10

Simran Khanuja

@simi_97k

1 year

"ChatHuman" won🥈 by prompting LLMs to ask for human help when completing tasks, to better understand human intent and produce a refined output! Team: @_Hao_Zhu @nlpxuhui @prakhariitr @jimin__sun and Kaixin Ma (3/n)

1

10

Simran Khanuja

@simi_97k

3 years

We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder.

Google | MuRIL | Kaggle

Multilingual Representations for Indian Languages : BERT base models pre-trained on 17 Indian languages, and their transliterated counterparts.

www.kaggle.com

1

0

10

Simran Khanuja

@simi_97k

6 months

Congratulations! Well deserved🎉

Adithya Pratapa

@AdithyaPratapa

6 months

Delighted to share that our @emnlpmeeting work on background summarization received an outstanding paper award in the summarization track 🏆 w/ Kevin Small and @markusdr Paper: Github: Highlights below, (1/4) #EMNLP2023

12

5

64

0

10

Simran Khanuja

@simi_97k

2 months

All model outputs can be found here: We have several discussion points in Sections 7 and 8 of the paper on: a) Why we categorize culture based on country b) How a one-one mapping may never exist c) The tradeoff between relatability v/s stereotyping

1

9

Simran Khanuja

@simi_97k

1 year

This is very impactful work for Indian languages by a superb interdisciplinary team! Congratulations to all involved✨

Raj Dabre

@prajdabre1

1 year

🚨 📢 Preprint Alert After more than a year of hard work, we are pleased to introduce IndicTrans2, the first machine translation system supporting all 22 scheduled Indic languages. 📎: 💻: ▶️: Thread👇[1/n]

8

92

388

0

9

Simran Khanuja

@simi_97k

2 months

Since time immemorial, translators have advocated the need for cultural adaptation in translation. With increasing multimodal content online, translating all modes is essential for complete transfer of meaning. In translation studies, people use the term transcreation to

1

0

9

Simran Khanuja

@simi_97k

2 months

Michael has been doing some amazing work! Go watch his talk if y’all are around :)

Michael Saxon @ NAACL🇲🇽

@m2saxon

2 months

I'm in the Baltimore/DC area this week to share some cool stuff we've been cooking! I'll give this talk at UMBC on Monday, Georgetown on Tuesday, UMD on Wednesday, and as a poster at MASC at JHU on Friday. If you're around and want to chat please ask! Happy to come say hi! 😃

3

41

0

1

8

Simran Khanuja

@simi_97k

1 year

💡Additionally, XTREME-UP shows how byte-based models especially help morphologically rich languages, as compared to sub-word models. ⌛️An important time to be working on making our NLP systems more inclusive and equitable🤝 through data, modeling, and evaluation efforts :) (Fin)

0

7

Simran Khanuja

@simi_97k

2 months

Next, we construct a two-part evaluation dataset to test these pipelines: a) concept – where we aggregate 600 images from seven countries. Each country has 85 commonly occurring concepts across 17 categories (like food, celebrations etc.). We follow the data annotation protocol

1

0

8

Simran Khanuja

@simi_97k

3 years

Do let us know if anyone is interested in attending the same. Thanks @divy93t for the opportunity! (2/2)

2

0

8

Simran Khanuja

@simi_97k

7 months

This work was done with amazing collaborators @derylucio , Srinivas, and @gneubig . We’ve added support for MT and custom data in our codebase. Please e-mail or raise an issue in case of questions or concerns! We hope you find our work useful! Code: (n/n)

GitHub - simran-khanuja/demux: Code for DeMuX: Data-efficient Multilingual Learning

Code for DeMuX: Data-efficient Multilingual Learning - simran-khanuja/demux

github.com

0

7

Simran Khanuja

@simi_97k

4 years

Great opportunities!

Danish Pruthi

@danish037

4 years

Indian students interested in NLP, looking to do some solid research before starting graduate school, should consider: 1. Pre-doctoral program at Google w/ @partha_p_t and team. 2. MSR RF program w/ @monojitchou @kalikabali , Sunayana and others. ()

5

13

106

1

0

7

Simran Khanuja

@simi_97k

1 year

I specifically contributed to cross-modal retrieval since I was (and still am) pretty excited by its potential to broaden information access across modalities and languages! (3/n)

1

0

7

Simran Khanuja

@simi_97k

1 year

We had many more interesting explorations, all of which we will shortly update on our GitHub repo ()! Shoutout to our high school student team that attempted to develop a metric system to evaluate the accuracy and consistency of LLM references! (5/n)

1

0

7

Simran Khanuja

@simi_97k

2 months

We construct three pipelines for this unprecedented task, leveraging state-of-the-art generative models. End-to-end image-editing models simply paste the flag or culturally specific entities (like sakura blossoms or Mt. Fuji for Japan) to increase cultural relevance. Hence we

1

0

7

Simran Khanuja

@simi_97k

24 days

Great talk by Michael, very cool work :) I really liked your idea on obtaining varied contextual embeddings of the concepts in question. This paper () explores a very similar idea for image-editing, where they generate multiple sentences of the concept in

Michael Saxon @ NAACL🇲🇽

@m2saxon

25 days

NAACL video and NeurIPS submission deadlines all in the same week🙃 Anyway, here's my prerecorded video for our NAACL oral "Lost in Translation? ... (long title)" where we analyze how translation errors challenge multilinguality assessment in T2I models!

1

14

1

0

7

Simran Khanuja

@simi_97k

3 years

@partha_p_t @aryaman2020 @daanvanesch Thanks @aryaman2020 . We shall release improved versions of MuRIL soon which hopefully address some gaps present right now :) We are also considering distilled and large versions to serve all use-cases.

2

0

7

Simran Khanuja

@simi_97k

6 months

slides inspired by works of @delliott @ebugliarello @quAVTum @m2saxon @GregorGeigle @zengyan97 @danish037 and many others! Thank you for your incredible work :) This is in no way exhaustive but please feel free to reach out if I've missed relevant work! (Fin.)

0

5

Simran Khanuja

@simi_97k

3 years

Well, it is a tough problem, but highly applicative and useful! Work done w/ @partha_p_t and @melvinjohnsonp . Pre-print out soon.

0

6

Simran Khanuja

@simi_97k

2 months

Results? The best pipelines can only successfully transcreate 5% of images for some countries (Nigeria) in the easier concept dataset, while no transcreation is successful for some countries (Portugal) in the harder application dataset: (6/n)

1

0

6

Simran Khanuja

@simi_97k

1 year

While LLM performances on English are on a spectacular rise📈, they are only widening disparities amongst languages 📉 The three works are : MEGA (); BUFFET () and XTREME-UP () (2/n)

Sebastian Ruder

@seb_ruder

1 year

Our evaluation highlights that there is lots of headroom left to improve performance on under-represented languages. Byte-based methods have a clear edge on such morphologically rich languages while small fine-tuned models significantly outperform large in-context models.

1

0

8

1

0

6

Simran Khanuja

@simi_97k

1 year

Congratulations 🎉

Divy Thakkar

@divy93t

1 year

Congratulations, @partha_p_t on winning the ACM India Early Career Researcher Award 2022! It is an honour to have you as a friend and a colleague!

9

17

234

0

6

Simran Khanuja

@simi_97k

11 months

Does anyone know the hyperparameters used to train mbart-50? I may have missed it in the paper () if it's there, but a pointer would be helpful, TIA!

1

6

Simran Khanuja

@simi_97k

2 months

This highlights the challenging nature of this task! Image-editing models simply don’t understand the meaning of “making something culturally relevant” and do funny (at times borderline offensive or stereotypical) edits. Some examples below. Can you guess the target country for

1

0

6

Simran Khanuja

@simi_97k

2 months

This has been an exciting exploration and I’m eagerly looking forward to working on the many open problems in this space! Please feel free to reach out if this work interests you and you’d like to engage further! Work done with amazing supportive collaborators: @gneubig ,

1

0

6

Simran Khanuja

@simi_97k

2 months

Finally, we transcreate collected images using all three pipelines, and conduct a multi-faceted human evaluation. For successful transcreation, the edited image should be more culturally relevant than the original image. Further, for: a) concept – it should belong to the same

1

0

6

Simran Khanuja

@simi_97k

7 months

For the same target languages, the selected data varies across tasks. When Urdu is the target for instance, Hindi data is chosen for most tasks, except for NER, where Farsi/Arabic are preferred, that share the same script with Urdu. (5/n)

1

0

6

Simran Khanuja

@simi_97k

6 months

I start with motivating how image-text modeling can benefit multilingual NLP. Briefly, the visual modality can: a) bridge culturally consistent concepts; b) encode cultural diversity within a concept (below); c) enrich the representation of cross-culturally unique concepts. (2/n)

1

0

6

Simran Khanuja

@simi_97k

1 year

Finally, thanks to everyone here at CMU LTI who made all of this possible in 10 days! Until next time :) (Fin)

0

6

Simran Khanuja

@simi_97k

6 months

😢😢

Aral Lobo

@aralalobo

6 months

English signboards destroyed on Lavelle Road. Where are the cops!! At least 20 shops!!

469

607

2K

1

0

5

Simran Khanuja

@simi_97k

1 year

"MindfulComm" won🥉 by proposing to create a plug-in that paraphrases text to adhere to non-violent communication principles! Team: Ya-Fang Lin, Benjamin Panny (4/n)

1

5

Simran Khanuja

@simi_97k

7 months

Past works have either a) focused on one target language; b) assumed annotator availability in target languages; c) not prescribed data-points to label; d) relied on past model performances, expensive to obtain. Our e2e framework removes the need for such assumptions. (2/n)

1

0

5

Simran Khanuja

@simi_97k

1 year

FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Cross-Modal Retrieval. (2/n)

1

0

5

Simran Khanuja

@simi_97k

7 months

Finally, we test the applicability of DeMuX in low-budget settings, acquired in one active learning round. We observe substantial gains for lower budgets, with a trend of diminishing returns as the budget increases. (6/n)

1

0

4

Simran Khanuja

@simi_97k

2 months

Personally, this has been a great learning experience doing multiple data annotation rounds, human evaluation, and experimenting with image generation models. So glad to share what I've been upto the past year :) (fin)

0

4

Simran Khanuja

@simi_97k

7 months

Our strategies outperform baselines in 84% of test cases, including multilingual target pools. We observe that a hybrid strategy combining both a) and b) above, performs best for token-level tasks, but picking globally uncertain points gains precedence for NLI and QA. (4/n)

1

0

4

Simran Khanuja

@simi_97k

1 year

Excited to see how FLEURS catalyzes research in low-resource speech understanding :) (n/n).

0

4

Simran Khanuja

@simi_97k

7 months

Assuming access to small amounts of unlabelled target data, we develop strategies to pick points that: a) lie in the neighborhood of target points; and b) the model is most uncertain about, so that labelling them would be most beneficial to the model. (3/n)

1

0

4

Simran Khanuja

@simi_97k

6 months

Next, I cover important downstream tasks that multilingual multimodal models can solve. These include cross-lingual, cross-modal a) VQA; b) NLI and reasoning; c) retrieval; d) image/caption generation. (3/n)

1

0

3

Simran Khanuja

@simi_97k

11 months

c) leveraging multimodal signals to improve NLU for non-English languages and build culturally inclusive systems; d) how can we crowdsource high quality data in the LLM era; e) also been thinking about better pretraining architectures for multilingual models (2/3)

1

0

3

Simran Khanuja

@simi_97k

2 years

@AnshKhurana11 @kritipraks @peakbengaluru Life after BLR rains is ❤️

0

3

Simran Khanuja

@simi_97k

1 year

@eaclmeeting @LTIatCMU Feel free to DM if someone would like to catch up!

0

2

Simran Khanuja

@simi_97k

1 month

@bhatia_mehar @sivareddyg @Mila_Quebec @mcgillu @VeredShwartz @UBC_CS @UBC_NLP Congratulations ❤️

1

0

3

Simran Khanuja

@simi_97k

2 years

@mdredze If only the disclaimer was generated by the model too :’)

1

0

3

Simran Khanuja

@simi_97k

2 years

@AnshKhurana11 @Stanford Congrats, so happy for you, you deserve all the✨✨

0

2

Simran Khanuja

@simi_97k

2 years

@adityaasinha @GoogleAI @UTCompSci @uwcse Congratulations! Well deserved ✨✨

0

2

Simran Khanuja

@simi_97k

2 years

@shaily99 @aaclmeeting Yay yay you totally deserve it❤️❤️🔥🔥

0

2

Simran Khanuja

@simi_97k

1 year

@nrbnsntr @DeployableAI @returaj2 @ravi_iitm @fooobar Congratulations 💝✨

0

2

Simran Khanuja

@simi_97k

2 years

@saujasv @SCSatCMU @dan_fried @LTIatCMU Congrats! Looking forward to exciting research discussions and collaborations :)

0

2

Simran Khanuja

@simi_97k

11 months

@fooobar @CMU_Robotics @ShubhikaGarg2 Congratulations @ShubhikaGarg2 ❤️✨

1

0

2

Simran Khanuja

@simi_97k

6 months

Next we discuss biases in image generation models across languages and cultures. Finally, my favorite part of this was revisiting the motivation in context of prior work, to think about the *multiple* open questions and challenges yet to be solved in this exciting space! (5/n)