I really like this work by
@akshitajha
and team! Check it out :)
I’ve been feeling overwhelmed in keeping up with the great work being done in cross-cultural NLP so thought of starting this list of awesome resources😌
GitHub:
It is in no means
Ecstatic to share that I'll be joining
@SCSatCMU
for my PhD at LTI this Fall! I'll be working with
@gneubig
and
@dan_fried
among many others! I've really enjoyed talking to students and faculty at CMU and am very excited to embark on this journey✨ (1/n)
How would you choose the best data instances to label, that maximize the performance of a model on target data? What if your target data is multilingual and you have no annotators in those languages?
Our new work, DeMuX, addresses this problem.
(1/n)
Ever noticed how Pixar adapts movies for international markets? The beloved newscaster in Zootopia is a jaguar in Brazil, a panda in China, a koala in Australia …
While machine translation (MT) has only dealt with language in speech/text thus far, we extend the scope of MT to
Grateful to have received the best paper award at SLT 2022 for FLEURS!
FLEURS is a multi-lingual (102 languages), multi-modal (speech-text), n-way parallel dataset, built on top of Flores-101. (1/n)
Our FLEURS paper won the best paper award at SLT 2022!
@ieee_slt
SLT:
arXiv:
Thanks to the organizers! Grateful for the collaboration with many great colleagues 🙂
It was great meeting with undergrad students passionate about research at BITS-Goa! In the last couple of years, they've successfully setup research groups like LRG, SAiDL etc., significantly enhancing the research culture on campus. (1/2)
How would you choose the best data instances to label, that maximize the performance of a model on target data? What if your target data is multilingual and you have no annotators in those languages?
Our new work, DeMuX, addresses this problem.
(1/n)
Since multilingual LMs cannot equitably represent a 100+ languages, we have recently witnessed the growth of a language/domain specific pre-trained model universe. In our ACL 2021 Findings paper, we make a first attempt at merging multiple pre-trained LMs using KD. (1/2)
I recently gave a lecture on Image-Text Modeling for Multilingual NLP at CMU and thought I'd share my slides in case interested folks may find it useful!
Here are a few things covered in the slides. (1/n)
Greetings all :) Today,
@BigAmeya
and I will be conducting a TF tutorial session at the CVIT IIIT Summer School at 7PM IST. The session will be a gentle introduction to TF 2.0 with two interesting applications in NLP and GNNs! (1/2)
Check out their amazing work on fairness in the Indian context!
P.S: The first author, Shaily, is applying for a PhD this year and is a passionate young researcher! Do keep an eye out for her application :)
I’m in the Bay Area for the summer ☀️and attending NAACL in Mexico City from June 13-23! Please feel free to DM for a coffee chat if y’all are around :) Would love to meet up with fellow researchers/friends 💕
🔍📊Came across three works today, all benchmarking and evaluating the multilingual capabilities of LLMs. All consistently show how LLMs are significantly outperformed by smaller fine-tuned LMs for (most) tasks! (1/n)
Excited to attend
#ACL2023NLP
in person this week! Feel free to reach out if anyone wants to catch up :) A few things I’ve been interested in these days: a) multilingual, low (text) resource NLP (as always 😌); b) sample efficiency in training/fine-tuning; (1/3)
It was so fun to work on this with everyone! Literal translations of metaphors in other languages never failed to make us laugh 😅 Refer to thread for more details on our work!
"आज-कल NLP Research के साथ बने रहना उतना ही आसान है जितना कि मानसून मॆं भीगने से बचे रहना!" . Did you understand? How about LMs? Our
#ACL2023
Findings paper explores multilingual models' cultural understanding through figurative language in 7 langs 🌎(1/9)
I'll be attending
@eaclmeeting
(May 1st-7th) in person! Would love to catch up with those attending :) Happy to chat about all-things-research (especially multilingual NLP), life
@LTIatCMU
as a grad student, or anything else :)
Congratulations! I'm so excited to see Shuyan's lab and research grow! Potential applicants: Shuyan is one of the kindest people I know, and a great researcher ofc <3
I am thrilled to announce that I will be joining
@DukeU
@dukecompsci
as an Assistant Professor in summer 2025. Super excited for the next chapter! Stay tuned for the launch of my lab 🧠🤖
Excited to be in our
@GoogleAI
research India office in Bangalore after two years! Awesome to meet with lab director
@ManishGuptaMG1
, with our
#AIforSocialgood
team, other labmates in the office; most rewarding was Q&A session with predocs.
📣 Presenting this work on performance disparities across non-standard dialects at EMNLP's poster session 7! Come say hi 11-12:30 Dec 10 (today) in the East Foyer!
It was great fun to host
@JeffDean
in a Fireside Chat on his virtual India visit! We had a small quiz titled, "Two truths (and a lie?)" centred around "true facts" :) He then answered several questions from fellow Googlers. Thanks for participating in this!
Lastly, but most importantly, I'd like to take this opportunity to thank my parents, for supporting my dreams despite them having limited knowledge about research as a career. I can never thank them enough :)
Can your NLP model handle noooisy mEsSy
#realworldtext
?
ByT5 works on raw UTF-8 bytes (no tokenization!), beats SoTA models on many popular tasks, and is more robust to noise.
📜 Preprint:
💾 Code/Models:
Summary thread ⬇️ (1/9)
Kalamang Translation
One of the most exciting examples in the report involves translation of Kalamang. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea in the east of Indonesian Papua (). Kalamang has almost no online
I wouldn't be here without the constant guidance and support of my brilliant mentors
@partha_p_t
,
@monojitchou
, Sunayana Sitaram and
@seb_ruder
. I'm immensely grateful to each one of them, all of whom have enabled me to learn and grow so much in the past few years. (2/n)
Applications are open for our pre-doc researcher program. This is one of my favorite programs, where we provide exciting research opportunities to recent graduates with an undergraduate (or Masters) degree, infecting them with the "research bug" 🙂
IKDD has opened up their networking sessions for all :) Great opportunity to interact with like-minded people. Starting at 12:15PM IST. I'll be hosting NLP (2) where
@monojitchou
is our guest speaker! Do join :) The corresponding meeting rooms are here
Great study showing that the common practice of training VLMs on english-filtered image-text pairs harms communities of lower socioeconomic status! Train on all of your data for improved cultural understanding of images (even if performance on western-centric benchmarks takes a
Want your VLM to reflect the world's rich diversity 🌍? We’re very excited to share our recent research on this topic. TLDR: to build truly inclusive models that work for everyone, don’t filter by English, and check out our recommended evaluation benchmarks. (1/7)
Would highly encourage alumni to reach out and contribute in any capacity :) All thanks to efforts of faculty and students like
@baths_veeky
@RajaswaPatil
and others
🎉🌐 Join us on April fool’s day for an exciting event at the
@LTIatCMU
at
@CMU
on April 1st & 2nd! We're diving deep into the fascinating world of large language models and their transformative impact on academia, industry, and society at large. 🌐🎉
As the next generation of smartphone users is expected to permeate through several strata of society (many of whom may not know how to read/write), voice assistants will definitely play a very important role in building inclusive technology. Highly impactful research!
@gneubig
LTI Prof.
@gneubig
chatted with
@905wesa
today about his work in expanding the reach of spoken-language translation systems and being named a finalist for the Blavatnik Award for Young Scientists. See what he had to say here:
Subha and team at LJL have been doing great work on language preservation and supporting speakers of low (digital) resource languages. I'm so excited to work alongside passionate volunteers from diverse backgrounds, towards the common goal of advancing language technologies for
We've received an overwhelming response from both mentors and mentees. We're excited to start our LJL's Research Labs ☀ Summer Cohort ☀ on June 2nd!
Ranging from startups to nonprofits to research, 🌅 they're all working on making natural language processing more accessible.
"Taking the fun out of YouTube" won🥇 by creating a chrome extension to make YT video titles non-clickbait-y! They obtain pairs of titles by prompting GPT and fine-tune LLaMa with the data obtained, to deploy for inference. Team:
@AthiyaD
, Abuzar Khan, Alex Li (2/n)
MuRIL is a multilingual model specifically built for Indian languages. Work done w/
@partha_p_t
at
@GoogleAI
. Please mail your queries/feedback to muril-contact
@google
.com.
"ChatHuman" won🥈 by prompting LLMs to ask for human help when completing tasks, to better understand human intent and produce a refined output! Team:
@_Hao_Zhu
@nlpxuhui
@prakhariitr
@jimin__sun
and Kaixin Ma (3/n)
We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder.
Delighted to share that our
@emnlpmeeting
work on background summarization received an outstanding paper award in the summarization track 🏆
w/ Kevin Small and
@markusdr
Paper:
Github:
Highlights below, (1/4)
#EMNLP2023
All model outputs can be found here:
We have several discussion points in Sections 7 and 8 of the paper on:
a) Why we categorize culture based on country
b) How a one-one mapping may never exist
c) The tradeoff between relatability v/s stereotyping
🚨 📢 Preprint Alert
After more than a year of hard work, we are pleased to introduce IndicTrans2, the first machine translation system supporting all 22 scheduled Indic languages.
📎:
💻:
▶️:
Thread👇[1/n]
Since time immemorial, translators have advocated the need for cultural adaptation in translation. With increasing multimodal content online, translating all modes is essential for complete transfer of meaning. In translation studies, people use the term transcreation to
I'm in the Baltimore/DC area this week to share some cool stuff we've been cooking!
I'll give this talk at UMBC on Monday, Georgetown on Tuesday, UMD on Wednesday, and as a poster at MASC at JHU on Friday.
If you're around and want to chat please ask! Happy to come say hi! 😃
💡Additionally, XTREME-UP shows how byte-based models especially help morphologically rich languages, as compared to sub-word models.
⌛️An important time to be working on making our NLP systems more inclusive and equitable🤝 through data, modeling, and evaluation efforts :) (Fin)
Next, we construct a two-part evaluation dataset to test these pipelines:
a) concept – where we aggregate 600 images from seven countries. Each country has 85 commonly occurring concepts across 17 categories (like food, celebrations etc.). We follow the data annotation protocol
This work was done with amazing collaborators
@derylucio
, Srinivas, and
@gneubig
. We’ve added support for MT and custom data in our codebase. Please e-mail or raise an issue in case of questions or concerns!
We hope you find our work useful!
Code: (n/n)
Indian students interested in NLP, looking to do some solid research before starting graduate school, should consider:
1. Pre-doctoral program at Google w/
@partha_p_t
and team.
2. MSR RF program w/
@monojitchou
@kalikabali
, Sunayana and others. ()
I specifically contributed to cross-modal retrieval since I was (and still am) pretty excited by its potential to broaden information access across modalities and languages! (3/n)
We had many more interesting explorations, all of which we will shortly update on our GitHub repo ()! Shoutout to our high school student team that attempted to develop a metric system to evaluate the accuracy and consistency of LLM references! (5/n)
We construct three pipelines for this unprecedented task, leveraging state-of-the-art generative models. End-to-end image-editing models simply paste the flag or culturally specific entities (like sakura blossoms or Mt. Fuji for Japan) to increase cultural relevance. Hence we
Great talk by Michael, very cool work :)
I really liked your idea on obtaining varied contextual embeddings of the concepts in question. This paper () explores a very similar idea for image-editing, where they generate multiple sentences of the concept in
NAACL video and NeurIPS submission deadlines all in the same week🙃
Anyway, here's my prerecorded video for our NAACL oral "Lost in Translation? ... (long title)" where we analyze how translation errors challenge multilinguality assessment in T2I models!
@partha_p_t
@aryaman2020
@daanvanesch
Thanks
@aryaman2020
. We shall release improved versions of MuRIL soon which hopefully address some gaps present right now :) We are also considering distilled and large versions to serve all use-cases.
Results? The best pipelines can only successfully transcreate 5% of images for some countries (Nigeria) in the easier concept dataset, while no transcreation is successful for some countries (Portugal) in the harder application dataset: (6/n)
While LLM performances on English are on a spectacular rise📈, they are only widening disparities amongst languages 📉 The three works are : MEGA (); BUFFET () and XTREME-UP () (2/n)
Our evaluation highlights that there is lots of headroom left to improve performance on under-represented languages.
Byte-based methods have a clear edge on such morphologically rich languages while small fine-tuned models significantly outperform large in-context models.
This highlights the challenging nature of this task! Image-editing models simply don’t understand the meaning of “making something culturally relevant” and do funny (at times borderline offensive or stereotypical) edits. Some examples below. Can you guess the target country for
This has been an exciting exploration and I’m eagerly looking forward to working on the many open problems in this space! Please feel free to reach out if this work interests you and you’d like to engage further! Work done with amazing supportive collaborators:
@gneubig
,
Finally, we transcreate collected images using all three pipelines, and conduct a multi-faceted human evaluation. For successful transcreation, the edited image should be more culturally relevant than the original image. Further, for:
a) concept – it should belong to the same
For the same target languages, the selected data varies across tasks. When Urdu is the target for instance, Hindi data is chosen for most tasks, except for NER, where Farsi/Arabic are preferred, that share the same script with Urdu. (5/n)
I start with motivating how image-text modeling can benefit multilingual NLP. Briefly, the visual modality can: a) bridge culturally consistent concepts; b) encode cultural diversity within a concept (below); c) enrich the representation of cross-culturally unique concepts. (2/n)
"MindfulComm" won🥉 by proposing to create a plug-in that paraphrases text to adhere to non-violent communication principles! Team: Ya-Fang Lin, Benjamin Panny (4/n)
Past works have either a) focused on one target language; b) assumed annotator availability in target languages; c) not prescribed data-points to label; d) relied on past model performances, expensive to obtain. Our e2e framework removes the need for such assumptions. (2/n)
FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Cross-Modal Retrieval. (2/n)
Finally, we test the applicability of DeMuX in low-budget settings, acquired in one active learning round. We observe substantial gains for lower budgets, with a trend of diminishing returns as the budget increases. (6/n)
Personally, this has been a great learning experience doing multiple data annotation rounds, human evaluation, and experimenting with image generation models. So glad to share what I've been upto the past year :) (fin)
Our strategies outperform baselines in 84% of test cases, including multilingual target pools. We observe that a hybrid strategy combining both a) and b) above, performs best for token-level tasks, but picking globally uncertain points gains precedence for NLI and QA. (4/n)
Assuming access to small amounts of unlabelled target data, we develop strategies to pick points that: a) lie in the neighborhood of target points; and b) the model is most uncertain about, so that labelling them would be most beneficial to the model. (3/n)
Next, I cover important downstream tasks that multilingual multimodal models can solve. These include cross-lingual, cross-modal a) VQA; b) NLI and reasoning; c) retrieval; d) image/caption generation. (3/n)
c) leveraging multimodal signals to improve NLU for non-English languages and build culturally inclusive systems; d) how can we crowdsource high quality data in the LLM era; e) also been thinking about better pretraining architectures for multilingual models (2/3)
Next we discuss biases in image generation models across languages and cultures. Finally, my favorite part of this was revisiting the motivation in context of prior work, to think about the *multiple* open questions and challenges yet to be solved in this exciting space! (5/n)