Simran Khanuja Profile
Simran Khanuja

@simi_97k

2,299
Followers
938
Following
21
Media
304
Statuses

NLP | PhD Student @LTIatCMU | Predoctoral Researcher @Google | Microsoft Research | BITS Pilani, Goa

Joined April 2018
Don't wanna be here? Send us removal request.
Pinned Tweet
@simi_97k
Simran Khanuja
24 days
I really like this work by @akshitajha and team! Check it out :) I’ve been feeling overwhelmed in keeping up with the great work being done in cross-cultural NLP so thought of starting this list of awesome resources😌 GitHub: It is in no means
@akshitajha
Akshita Jha
28 days
New #ACL2024 paper alert📢: Introducing ViSAGe: "Visual Stereotypes Around the Globe" - a dataset enabling evaluation of known nationality-based stereotypes in Text-to-Image models. Work w/ @sunipa17 , @vinodkpg , @cephaloponderer , Sarah, @shachi_dave , @qadrida , @chandakreddy 1/n
Tweet media one
1
10
68
3
10
53
@simi_97k
Simran Khanuja
2 years
Ecstatic to share that I'll be joining @SCSatCMU for my PhD at LTI this Fall! I'll be working with @gneubig and @dan_fried among many others! I've really enjoyed talking to students and faculty at CMU and am very excited to embark on this journey✨ (1/n)
56
6
390
@simi_97k
Simran Khanuja
7 months
How would you choose the best data instances to label, that maximize the performance of a model on target data? What if your target data is multilingual and you have no annotators in those languages? Our new work, DeMuX, addresses this problem. (1/n)
4
34
241
@simi_97k
Simran Khanuja
2 months
Ever noticed how Pixar adapts movies for international markets? The beloved newscaster in Zootopia is a jaguar in Brazil, a panda in China, a koala in Australia … While machine translation (MT) has only dealt with language in speech/text thus far, we extend the scope of MT to
Tweet media one
8
35
222
@simi_97k
Simran Khanuja
1 year
Grateful to have received the best paper award at SLT 2022 for FLEURS! FLEURS is a multi-lingual (102 languages), multi-modal (speech-text), n-way parallel dataset, built on top of Flores-101. (1/n)
@alex_conneau
Alexis Conneau
1 year
Our FLEURS paper won the best paper award at SLT 2022! @ieee_slt SLT: arXiv: Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂
6
19
116
15
7
139
@simi_97k
Simran Khanuja
2 years
It was great meeting with undergrad students passionate about research at BITS-Goa! In the last couple of years, they've successfully setup research groups like LRG, SAiDL etc., significantly enhancing the research culture on campus. (1/2)
Tweet media one
2
6
138
@simi_97k
Simran Khanuja
3 months
Now accepted to NAACL 2024 ❤️ Excited to present this in Mexico City and continue building upon this work🎊
@simi_97k
Simran Khanuja
7 months
How would you choose the best data instances to label, that maximize the performance of a model on target data? What if your target data is multilingual and you have no annotators in those languages? Our new work, DeMuX, addresses this problem. (1/n)
4
34
241
8
7
121
@simi_97k
Simran Khanuja
3 years
Since multilingual LMs cannot equitably represent a 100+ languages, we have recently witnessed the growth of a language/domain specific pre-trained model universe. In our ACL 2021 Findings paper, we make a first attempt at merging multiple pre-trained LMs using KD. (1/2)
4
3
111
@simi_97k
Simran Khanuja
3 years
Excited to share a technical write-up on MuRIL, now available on arxiv!
2
13
110
@simi_97k
Simran Khanuja
6 months
I recently gave a lecture on Image-Text Modeling for Multilingual NLP at CMU and thought I'd share my slides in case interested folks may find it useful! Here are a few things covered in the slides. (1/n)
2
17
107
@simi_97k
Simran Khanuja
3 years
Greetings all :) Today, @BigAmeya and I will be conducting a TF tutorial session at the CVIT IIIT Summer School at 7PM IST. The session will be a gentle introduction to TF 2.0 with two interesting applications in NLP and GNNs! (1/2)
6
5
82
@simi_97k
Simran Khanuja
2 years
Check out their amazing work on fairness in the Indian context! P.S: The first author, Shaily, is applying for a PhD this year and is a passionate young researcher! Do keep an eye out for her application :)
@shaily99
Shaily
2 years
Check out our @aaclmeeting paper “Re-contextualizing Fairness in NLP: The Case of India”. We analyze India-specific biases in #NLProc and propose a research agenda for meaningful fairness interventions. w/ @sunipadev @partha_p_t @shachi_dave @vinodkpg 🧵
Tweet media one
6
31
229
1
3
76
@simi_97k
Simran Khanuja
3 years
We have released the pre-trained model (with the MLM layer for masked word predictions) on HuggingFace.
2
14
71
@simi_97k
Simran Khanuja
30 days
I’m in the Bay Area for the summer ☀️and attending NAACL in Mexico City from June 13-23! Please feel free to DM for a coffee chat if y’all are around :) Would love to meet up with fellow researchers/friends 💕
8
0
69
@simi_97k
Simran Khanuja
2 years
Come join us at #DecodewithGoogle 2022 where @ManishGuptaMG1 and I will be sharing about how our Research team is tackling unique Indian challenges with simple, local and pathbreaking solutions! Register now: #DecodeWithGoogle #Google #GoogleIndia
0
10
65
@simi_97k
Simran Khanuja
1 year
🔍📊Came across three works today, all benchmarking and evaluating the multilingual capabilities of LLMs. All consistently show how LLMs are significantly outperformed by smaller fine-tuned LMs for (most) tasks! (1/n)
2
9
63
@simi_97k
Simran Khanuja
1 year
And it's a wrap🎉We had an exciting hackathon @LTIatCMU where we had 13 teams working with LLMs on a diverse range of topics! (1/n)
Tweet media one
1
2
60
@simi_97k
Simran Khanuja
11 months
Excited to attend #ACL2023NLP in person this week! Feel free to reach out if anyone wants to catch up :) A few things I’ve been interested in these days: a) multilingual, low (text) resource NLP (as always 😌); b) sample efficiency in training/fine-tuning; (1/3)
3
1
59
@simi_97k
Simran Khanuja
1 year
It escapes me why every official US document terms foreign nationals as "aliens". Has this ever been called for a change?
4
1
53
@simi_97k
Simran Khanuja
1 year
It was so fun to work on this with everyone! Literal translations of metaphors in other languages never failed to make us laugh 😅 Refer to thread for more details on our work!
@_emliu
Emmy Liu
1 year
"आज-कल NLP Research के साथ बने रहना उतना ही आसान है जितना कि मानसून मॆं भीगने से बचे रहना!" . Did you understand? How about LMs? Our #ACL2023 Findings paper explores multilingual models' cultural understanding through figurative language in 7 langs 🌎(1/9)
Tweet media one
5
39
204
2
3
51
@simi_97k
Simran Khanuja
1 year
I'll be attending @eaclmeeting (May 1st-7th) in person! Would love to catch up with those attending :) Happy to chat about all-things-research (especially multilingual NLP), life @LTIatCMU as a grad student, or anything else :)
2
0
45
@simi_97k
Simran Khanuja
3 years
Great opportunity for students to contribute to the progress of Indian NLP. Please do participate!
@partha_p_t
Partha Talukdar
3 years
#chaii2021 , a QA research challenge in Hindi and Tamil, is now live. You can participate even if you didn't pre-register. Build models and data, win prizes, and advance #IndicNLP ! #NLPforAll @GoogleAI @GoogleIndia
1
32
152
0
4
39
@simi_97k
Simran Khanuja
1 month
Congratulations! I'm so excited to see Shuyan's lab and research grow! Potential applicants: Shuyan is one of the kindest people I know, and a great researcher ofc <3
@shuyanzhxyc
Shuyan Zhou
1 month
I am thrilled to announce that I will be joining @DukeU @dukecompsci as an Assistant Professor in summer 2025. Super excited for the next chapter! Stay tuned for the launch of my lab 🧠🤖
Tweet media one
110
29
545
1
2
37
@simi_97k
Simran Khanuja
2 years
It was a humbling experience to be able to interact with @MilindTambe_AI ! Very insightful AMA :)
@MilindTambe_AI
Milind Tambe
2 years
Excited to be in our @GoogleAI research India office in Bangalore after two years! Awesome to meet with lab director @ManishGuptaMG1 , with our #AIforSocialgood team, other labmates in the office; most rewarding was Q&A session with predocs.
Tweet media one
Tweet media two
Tweet media three
3
1
167
0
1
34
@simi_97k
Simran Khanuja
6 months
Go talk to Anjali as she presents her cool work! She’s really passionate about advancing dialectal research :)
@anjali_ruban
Anjali Kantharuban
6 months
📣 Presenting this work on performance disparities across non-standard dialects at EMNLP's poster session 7! Come say hi 11-12:30 Dec 10 (today) in the East Foyer!
Tweet media one
1
9
60
2
2
32
@simi_97k
Simran Khanuja
4 years
It was great fun to host @JeffDean in a Fireside Chat on his virtual India visit! We had a small quiz titled, "Two truths (and a lie?)" centred around "true facts" :) He then answered several questions from fellow Googlers. Thanks for participating in this!
1
0
31
@simi_97k
Simran Khanuja
3 years
Great work and a brilliant contribution to Indian NLP! Congrats :)
@NasscomR
nasscom insights
3 years
NASSCOM congratulates @iitmadras for winning the #AIGamechangers 2021 award in the #NLP category for their use case: Samanantar - Translation corpora and models for the next billion users #NASSCOMXperienceAI @MicrosoftIndia @DeloitteIndia #AI #ML #translation #languages
2
14
49
0
0
28
@simi_97k
Simran Khanuja
2 years
Lastly, but most importantly, I'd like to take this opportunity to thank my parents, for supporting my dreams despite them having limited knowledge about research as a career. I can never thank them enough :)
0
0
28
@simi_97k
Simran Khanuja
2 years
I'd also like to thank my wonderful peers and collaborators at @MSFTResearch and @GoogleAI including @shachi_dave , @rajiv120kb , @shaily99 , @sanad_maker , @SebastinSanty , @melvinjohnsonp , @alex_conneau , @ankurbpn , who have been very supportive and I've learnt so much from all!(3/n)
1
1
27
@simi_97k
Simran Khanuja
3 years
Great work on **much needed** token-free models!
@colinraffel
Colin Raffel
3 years
Can your NLP model handle noooisy mEsSy #realworldtext ? ByT5 works on raw UTF-8 bytes (no tokenization!), beats SoTA models on many popular tasks, and is more robust to noise. 📜 Preprint: 💾 Code/Models: Summary thread ⬇️ (1/9)
Tweet media one
6
149
647
0
0
27
@simi_97k
Simran Khanuja
4 months
This is so cool!
@JeffDean
Jeff Dean (@🏡)
4 months
Kalamang Translation One of the most exciting examples in the report involves translation of Kalamang. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea in the east of Indonesian Papua (). Kalamang has almost no online
Tweet media one
Tweet media two
13
122
730
0
2
27
@simi_97k
Simran Khanuja
2 years
I wouldn't be here without the constant guidance and support of my brilliant mentors @partha_p_t , @monojitchou , Sunayana Sitaram and @seb_ruder . I'm immensely grateful to each one of them, all of whom have enabled me to learn and grow so much in the past few years. (2/n)
1
0
26
@simi_97k
Simran Khanuja
3 years
Accurate description✨Do apply and don't hesitate to reach out for more information :)
@ManishGuptaMG1
Manish Gupta
3 years
Applications are open for our pre-doc researcher program. This is one of my favorite programs, where we provide exciting research opportunities to recent graduates with an undergraduate (or Masters) degree, infecting them with the "research bug" 🙂
1
53
182
0
2
24
@simi_97k
Simran Khanuja
3 years
IKDD has opened up their networking sessions for all :) Great opportunity to interact with like-minded people. Starting at 12:15PM IST. I'll be hosting NLP (2) where @monojitchou is our guest speaker! Do join :) The corresponding meeting rooms are here
1
0
24
@simi_97k
Simran Khanuja
17 days
Great study showing that the common practice of training VLMs on english-filtered image-text pairs harms communities of lower socioeconomic status! Train on all of your data for improved cultural understanding of images (even if performance on western-centric benchmarks takes a
@ibomohsin
Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن
21 days
Want your VLM to reflect the world's rich diversity 🌍? We’re very excited to share our recent research on this topic. TLDR: to build truly inclusive models that work for everyone, don’t filter by English, and check out our recommended evaluation benchmarks. (1/7)
Tweet media one
3
27
136
0
3
22
@simi_97k
Simran Khanuja
2 years
Would highly encourage alumni to reach out and contribute in any capacity :) All thanks to efforts of faculty and students like @baths_veeky @RajaswaPatil and others
0
0
23
@simi_97k
Simran Khanuja
1 year
Join our exciting seminar on LLMs! We have talks, a panel discussion, and a hackathon lined up :)
@LTIatCMU
Language Technologies Institute | @CarnegieMellon
1 year
🎉🌐 Join us on April fool’s day for an exciting event at the @LTIatCMU at @CMU on April 1st & 2nd! We're diving deep into the fascinating world of large language models and their transformative impact on academia, industry, and society at large. 🌐🎉
1
17
82
0
1
23
@simi_97k
Simran Khanuja
2 months
So glad you liked it ❤️
@lina_conti
Lina Conti
2 months
Pick of the week @fbk_mt : Very original (and fun!) work by @simi_97k et al. on "translating" images to make them more relevant to different cultures
Tweet media one
1
4
21
1
0
22
@simi_97k
Simran Khanuja
3 years
As the next generation of smartphone users is expected to permeate through several strata of society (many of whom may not know how to read/write), voice assistants will definitely play a very important role in building inclusive technology. Highly impactful research! @gneubig
@LTIatCMU
Language Technologies Institute | @CarnegieMellon
3 years
LTI Prof. @gneubig chatted with @905wesa today about his work in expanding the reach of spoken-language translation systems and being named a finalist for the Blavatnik Award for Young Scientists. See what he had to say here:
0
5
19
1
0
21
@simi_97k
Simran Khanuja
3 years
Check out MuRIL repackaged for transformers as well :)
0
1
21
@simi_97k
Simran Khanuja
26 days
Subha and team at LJL have been doing great work on language preservation and supporting speakers of low (digital) resource languages. I'm so excited to work alongside passionate volunteers from diverse backgrounds, towards the common goal of advancing language technologies for
@Linguistics_LJL
OpenNLP
26 days
We've received an overwhelming response from both mentors and mentees. We're excited to start our LJL's Research Labs ☀ Summer Cohort ☀ on June 2nd! Ranging from startups to nonprofits to research, 🌅 they're all working on making natural language processing more accessible.
Tweet media one
1
1
3
0
0
17
@simi_97k
Simran Khanuja
3 years
Awesome work! Excited about its potential and creative applications :)
@krishna2
Krishna Srinivasan
3 years
We are very happy to announce WIT: Wikipedia-Based Image Text Dataset, a large multimodal multilingual dataset. @GoogleAI
1
35
126
1
0
17
@simi_97k
Simran Khanuja
1 year
"Taking the fun out of YouTube" won🥇 by creating a chrome extension to make YT video titles non-clickbait-y! They obtain pairs of titles by prompting GPT and fine-tune LLaMa with the data obtained, to deploy for inference. Team: @AthiyaD , Abuzar Khan, Alex Li (2/n)
Tweet media one
2
1
15
@simi_97k
Simran Khanuja
3 years
MuRIL is a multilingual model specifically built for Indian languages. Work done w/ @partha_p_t at @GoogleAI . Please mail your queries/feedback to muril-contact @google .com.
1
0
10
@simi_97k
Simran Khanuja
1 year
"ChatHuman" won🥈 by prompting LLMs to ask for human help when completing tasks, to better understand human intent and produce a refined output! Team: @_Hao_Zhu @nlpxuhui @prakhariitr @jimin__sun and Kaixin Ma (3/n)
Tweet media one
1
1
10
@simi_97k
Simran Khanuja
3 years
We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder.
1
0
10
@simi_97k
Simran Khanuja
6 months
Congratulations! Well deserved🎉
@AdithyaPratapa
Adithya Pratapa
6 months
Delighted to share that our @emnlpmeeting work on background summarization received an outstanding paper award in the summarization track 🏆 w/ Kevin Small and @markusdr Paper: Github: Highlights below, (1/4) #EMNLP2023
12
5
64
0
0
10
@simi_97k
Simran Khanuja
2 months
All model outputs can be found here: We have several discussion points in Sections 7 and 8 of the paper on: a) Why we categorize culture based on country b) How a one-one mapping may never exist c) The tradeoff between relatability v/s stereotyping
1
1
9
@simi_97k
Simran Khanuja
1 year
This is very impactful work for Indian languages by a superb interdisciplinary team! Congratulations to all involved✨
@prajdabre1
Raj Dabre
1 year
🚨 📢 Preprint Alert After more than a year of hard work, we are pleased to introduce IndicTrans2, the first machine translation system supporting all 22 scheduled Indic languages. 📎: 💻: ▶️: Thread👇[1/n]
8
92
388
0
0
9
@simi_97k
Simran Khanuja
2 months
Since time immemorial, translators have advocated the need for cultural adaptation in translation. With increasing multimodal content online, translating all modes is essential for complete transfer of meaning. In translation studies, people use the term transcreation to
1
0
9
@simi_97k
Simran Khanuja
2 months
Michael has been doing some amazing work! Go watch his talk if y’all are around :)
@m2saxon
Michael Saxon @ NAACL🇲🇽
2 months
I'm in the Baltimore/DC area this week to share some cool stuff we've been cooking! I'll give this talk at UMBC on Monday, Georgetown on Tuesday, UMD on Wednesday, and as a poster at MASC at JHU on Friday. If you're around and want to chat please ask! Happy to come say hi! 😃
Tweet media one
3
3
41
0
1
8
@simi_97k
Simran Khanuja
1 year
💡Additionally, XTREME-UP shows how byte-based models especially help morphologically rich languages, as compared to sub-word models. ⌛️An important time to be working on making our NLP systems more inclusive and equitable🤝 through data, modeling, and evaluation efforts :) (Fin)
0
0
7
@simi_97k
Simran Khanuja
2 months
Next, we construct a two-part evaluation dataset to test these pipelines: a) concept – where we aggregate 600 images from seven countries. Each country has 85 commonly occurring concepts across 17 categories (like food, celebrations etc.). We follow the data annotation protocol
Tweet media one
1
0
8
@simi_97k
Simran Khanuja
3 years
Do let us know if anyone is interested in attending the same. Thanks @divy93t for the opportunity! (2/2)
2
0
8
@simi_97k
Simran Khanuja
7 months
This work was done with amazing collaborators @derylucio , Srinivas, and @gneubig . We’ve added support for MT and custom data in our codebase. Please e-mail or raise an issue in case of questions or concerns! We hope you find our work useful! Code: (n/n)
0
0
7
@simi_97k
Simran Khanuja
4 years
Great opportunities!
@danish037
Danish Pruthi
4 years
Indian students interested in NLP, looking to do some solid research before starting graduate school, should consider: 1. Pre-doctoral program at Google w/ @partha_p_t and team. 2. MSR RF program w/ @monojitchou @kalikabali , Sunayana and others. ()
5
13
106
1
0
7
@simi_97k
Simran Khanuja
1 year
I specifically contributed to cross-modal retrieval since I was (and still am) pretty excited by its potential to broaden information access across modalities and languages! (3/n)
1
0
7
@simi_97k
Simran Khanuja
1 year
We had many more interesting explorations, all of which we will shortly update on our GitHub repo ()! Shoutout to our high school student team that attempted to develop a metric system to evaluate the accuracy and consistency of LLM references! (5/n)
Tweet media one
1
0
7
@simi_97k
Simran Khanuja
2 months
We construct three pipelines for this unprecedented task, leveraging state-of-the-art generative models. End-to-end image-editing models simply paste the flag or culturally specific entities (like sakura blossoms or Mt. Fuji for Japan) to increase cultural relevance. Hence we
Tweet media one
1
0
7
@simi_97k
Simran Khanuja
24 days
Great talk by Michael, very cool work :) I really liked your idea on obtaining varied contextual embeddings of the concepts in question. This paper () explores a very similar idea for image-editing, where they generate multiple sentences of the concept in
@m2saxon
Michael Saxon @ NAACL🇲🇽
25 days
NAACL video and NeurIPS submission deadlines all in the same week🙃 Anyway, here's my prerecorded video for our NAACL oral "Lost in Translation? ... (long title)" where we analyze how translation errors challenge multilinguality assessment in T2I models!
1
1
14
1
0
7
@simi_97k
Simran Khanuja
3 years
@partha_p_t @aryaman2020 @daanvanesch Thanks @aryaman2020 . We shall release improved versions of MuRIL soon which hopefully address some gaps present right now :) We are also considering distilled and large versions to serve all use-cases.
2
0
7
@simi_97k
Simran Khanuja
6 months
slides inspired by works of @delliott @ebugliarello @quAVTum @m2saxon @GregorGeigle @zengyan97 @danish037 and many others! Thank you for your incredible work :) This is in no way exhaustive but please feel free to reach out if I've missed relevant work! (Fin.)
0
0
5
@simi_97k
Simran Khanuja
3 years
Well, it is a tough problem, but highly applicative and useful! Work done w/ @partha_p_t and @melvinjohnsonp . Pre-print out soon.
0
0
6
@simi_97k
Simran Khanuja
2 months
Results? The best pipelines can only successfully transcreate 5% of images for some countries (Nigeria) in the easier concept dataset, while no transcreation is successful for some countries (Portugal) in the harder application dataset: (6/n)
Tweet media one
Tweet media two
1
0
6
@simi_97k
Simran Khanuja
1 year
While LLM performances on English are on a spectacular rise📈, they are only widening disparities amongst languages 📉 The three works are : MEGA (); BUFFET () and XTREME-UP () (2/n)
@seb_ruder
Sebastian Ruder
1 year
Our evaluation highlights that there is lots of headroom left to improve performance on under-represented languages. Byte-based methods have a clear edge on such morphologically rich languages while small fine-tuned models significantly outperform large in-context models.
1
0
8
1
0
6
@simi_97k
Simran Khanuja
1 year
Congratulations 🎉
@divy93t
Divy Thakkar
1 year
Congratulations, @partha_p_t on winning the ACM India Early Career Researcher Award 2022! It is an honour to have you as a friend and a colleague!
9
17
234
0
0
6
@simi_97k
Simran Khanuja
11 months
Does anyone know the hyperparameters used to train mbart-50? I may have missed it in the paper () if it's there, but a pointer would be helpful, TIA!
1
1
6
@simi_97k
Simran Khanuja
2 months
This highlights the challenging nature of this task! Image-editing models simply don’t understand the meaning of “making something culturally relevant” and do funny (at times borderline offensive or stereotypical) edits. Some examples below. Can you guess the target country for
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
0
6
@simi_97k
Simran Khanuja
2 months
This has been an exciting exploration and I’m eagerly looking forward to working on the many open problems in this space! Please feel free to reach out if this work interests you and you’d like to engage further! Work done with amazing supportive collaborators: @gneubig ,
1
0
6
@simi_97k
Simran Khanuja
2 months
Finally, we transcreate collected images using all three pipelines, and conduct a multi-faceted human evaluation. For successful transcreation, the edited image should be more culturally relevant than the original image. Further, for: a) concept – it should belong to the same
Tweet media one
1
0
6
@simi_97k
Simran Khanuja
7 months
For the same target languages, the selected data varies across tasks. When Urdu is the target for instance, Hindi data is chosen for most tasks, except for NER, where Farsi/Arabic are preferred, that share the same script with Urdu. (5/n)
1
0
6
@simi_97k
Simran Khanuja
6 months
I start with motivating how image-text modeling can benefit multilingual NLP. Briefly, the visual modality can: a) bridge culturally consistent concepts; b) encode cultural diversity within a concept (below); c) enrich the representation of cross-culturally unique concepts. (2/n)
Tweet media one
1
0
6
@simi_97k
Simran Khanuja
1 year
Finally, thanks to everyone here at CMU LTI who made all of this possible in 10 days! Until next time :) (Fin)
0
0
6
@simi_97k
Simran Khanuja
6 months
😢😢
@aralalobo
Aral Lobo
6 months
English signboards destroyed on Lavelle Road. Where are the cops!! At least 20 shops!!
469
607
2K
1
0
5
@simi_97k
Simran Khanuja
1 year
"MindfulComm" won🥉 by proposing to create a plug-in that paraphrases text to adhere to non-violent communication principles! Team: Ya-Fang Lin, Benjamin Panny (4/n)
Tweet media one
1
1
5
@simi_97k
Simran Khanuja
7 months
Past works have either a) focused on one target language; b) assumed annotator availability in target languages; c) not prescribed data-points to label; d) relied on past model performances, expensive to obtain. Our e2e framework removes the need for such assumptions. (2/n)
Tweet media one
1
0
5
@simi_97k
Simran Khanuja
1 year
FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Cross-Modal Retrieval. (2/n)
1
0
5
@simi_97k
Simran Khanuja
7 months
Finally, we test the applicability of DeMuX in low-budget settings, acquired in one active learning round. We observe substantial gains for lower budgets, with a trend of diminishing returns as the budget increases. (6/n)
Tweet media one
1
0
4
@simi_97k
Simran Khanuja
2 months
Personally, this has been a great learning experience doing multiple data annotation rounds, human evaluation, and experimenting with image generation models. So glad to share what I've been upto the past year :) (fin)
0
0
4
@simi_97k
Simran Khanuja
7 months
Our strategies outperform baselines in 84% of test cases, including multilingual target pools. We observe that a hybrid strategy combining both a) and b) above, performs best for token-level tasks, but picking globally uncertain points gains precedence for NLI and QA. (4/n)
Tweet media one
1
0
4
@simi_97k
Simran Khanuja
1 year
Excited to see how FLEURS catalyzes research in low-resource speech understanding :) (n/n).
0
0
4
@simi_97k
Simran Khanuja
7 months
Assuming access to small amounts of unlabelled target data, we develop strategies to pick points that: a) lie in the neighborhood of target points; and b) the model is most uncertain about, so that labelling them would be most beneficial to the model. (3/n)
Tweet media one
1
0
4
@simi_97k
Simran Khanuja
6 months
Next, I cover important downstream tasks that multilingual multimodal models can solve. These include cross-lingual, cross-modal a) VQA; b) NLI and reasoning; c) retrieval; d) image/caption generation. (3/n)
Tweet media one
1
0
3
@simi_97k
Simran Khanuja
11 months
c) leveraging multimodal signals to improve NLU for non-English languages and build culturally inclusive systems; d) how can we crowdsource high quality data in the LLM era; e) also been thinking about better pretraining architectures for multilingual models (2/3)
1
0
3
@simi_97k
Simran Khanuja
2 years
0
0
3
@simi_97k
Simran Khanuja
1 year
@eaclmeeting @LTIatCMU Feel free to DM if someone would like to catch up!
0
0
2
@simi_97k
Simran Khanuja
2 years
@mdredze If only the disclaimer was generated by the model too :’)
1
0
3
@simi_97k
Simran Khanuja
2 years
@AnshKhurana11 @Stanford Congrats, so happy for you, you deserve all the✨✨
0
0
2
@simi_97k
Simran Khanuja
2 years
@adityaasinha @GoogleAI @UTCompSci @uwcse Congratulations! Well deserved ✨✨
0
0
2
@simi_97k
Simran Khanuja
2 years
@shaily99 @aaclmeeting Yay yay you totally deserve it❤️❤️🔥🔥
0
0
2
@simi_97k
Simran Khanuja
2 years
@saujasv @SCSatCMU @dan_fried @LTIatCMU Congrats! Looking forward to exciting research discussions and collaborations :)
0
0
2
@simi_97k
Simran Khanuja
11 months
1
0
2
@simi_97k
Simran Khanuja
6 months
Next we discuss biases in image generation models across languages and cultures. Finally, my favorite part of this was revisiting the motivation in context of prior work, to think about the *multiple* open questions and challenges yet to be solved in this exciting space! (5/n)
Tweet media one
1
0
2
@simi_97k
Simran Khanuja
2 years
1
0
2
@simi_97k
Simran Khanuja
3 months
0
0
1
@simi_97k
Simran Khanuja
3 years
0
0
2
@simi_97k
Simran Khanuja
4 years
@adityaasinha *fortunate ones 🙃
0
0
2