I’m recruiting PhD students
@Duke
for fall 2024! Consider applying if you’re interested in reimagining healthcare by developing novel ML/NLP methodology. I can advise students through the CS dept and the Biostats & Bioinformatics dept.
Info here:
Absolutely thrilled to announce that I'll be joining
@DukeU
year next as an assistant professor, joint between Biostats/Bioinformatics and Computer Science!
Releasing our flexible open-source clinical text annotation tool, built by our amazing undergraduate researcher Ariel Levy. It provides decision aid (recommendations, pre-filled annotations) to make annotation faster and less painful!
How can we pre-train good representations of longitudinal health data?
Scramble a patient’s data (e.g radiology notes), and ask your model to unscramble! For chronic conditions, the model learns to attend to features corresponding to disease progression
Super excited to release this work bringing autocomplete to the EHR, deployed live! The goal is to make clinical notes faster and cleaner to write, all while creating structured data for downstream research. See the full video or the paper () for details!
Introducing Layer Health! We're thrilled to unveil our AI healthcare company spun out of
@MIT
, backed by $4M from
@GVteam
,
@generalcatalyst
, and
@inceptionhealth
. We’re on a mission to solve the information problem in healthcare. Want to learn more?
Clinicians have significant + shifting information needs over the course of a patient visit. In our new paper, we characterize + predict the notes relevant for clinicians to read, based on the current clinical context. Presenting this work, led by Sharon Jiang, today at
#MLHC2023
Excited to be spending the day at
#ML4H2023
!
(I’m also recruiting PhD students this cycle at
@DukeU
, and prospective students should definitely feel free to grab me to chat!)
Clinical notes are hard to understand: for people and computers. In our paper () presented at
#MLHC2020
, we dive into the current performance of clinical entity extraction algorithms (hint: they could be better) and propose a path forward for clinical NLP.
Many variables needed to construct timelines for clinical research are trapped in notes 🗒️ Manual extraction can be expensive, and ML is still error-prone. In our
#MLHC2021
paper, we explore a hybrid approach that can extract clinical events accurately with minimal oversight!
Getting labeled data for clinical NLP can be prohibitively difficult. In work at
#emnlp2022
, we find that GPT-3 models perform very well at clinical tasks in the few-shot setting, indicating a new paradigm for transforming EHR notes into actionable data
Tfw Georgia goes blue (by voting in the daughter of an Indian immigrant!) as a first in my lifetime :') growing up in GA, I constantly felt othered: by questions about my family’s “funny” clothing/food, by a mattress salesman who thought my family slept on coconuts (???) (1/2)
EHRs can be...suboptimal. Our new EHR tool proactively surfaces information (e.g. relevant notes + labs) in a side panel based on what's being typed in the note, so that clinicians can spend a little less of their day clicking. Work led by my HCI colleague
@lukesmurray
!
Presenting MedKnowts... a new way of writing clinical notes for electronic medical records, developed by AI and HCI researchers at
@MIT
and clinicians at
@BIDMChealth
.
Bopping around
#ICML2022
this week! Excited to finally make the mythical “conference friends” I’ve heard about. DM me if you want to meet up to talk about NLP/ML for healthcare (or to help me find the best dessert in Baltimore)!
!!! Very excited for this course I TA'd in the spring to finally be publicly available 🩺💻 This graduate course covers topics particularly relevant to machine learning in healthcare like uncertainty, dataset shift, causal inference , and time-series data in the clinical context
MIT's class on Machine Learning in Healthcare is now available for free on MIT's OpenCourseWare! All videos, slides, and lecture notes can be found here:
Our new
#AISTATS2021
paper enables users to encode domain knowledge for flexible data cleaning of diverse errors. It was selfishly developed since I have spent approximately 90% of my PhD cleaning data (the other 10% has been spent making elaborate research gifs for Twitter).
Excited to share PClean, our domain-specific PPL for scalable, Bayesian data cleaning! Short PClean programs can fix errors, impute missing values, & link duplicates in datasets w/ up to millions of rows.
#AISTATS2021
w/
@MonicaNAgrawal
,
@david_sontag
,
@vmansinghka
In other news, I’m still trying to find a good lab name. My top contender is currently ML4ML4H (Monica’s Lab for Machine Learning for Healthcare), so please help me find better options.
Our workflow chair
@MonicaNAgrawal
spent a lot of time sifting through this years wonderful submissions. Now, she's talking about 5 reasons why she's excited to attend ML4H in person this year!
Register for ML4H
Vividly I remember a summer camp where the class decided I shouldn’t have been born a citizen since my true loyalty could lie with India. I kept waiting for a defense that never came. So for today, I won’t stop smiling at the significance of Kamala Harris’s election (2/2)
After a whirlwind day organizing ML4H yesterday, excited to be spending the rest of the week at
#NeurIPS2022
! DM if you want to chat about NLP/ML for healthcare or sample French bakeries with me!
My research takes a human-centered approach to clinical NLP, smarter electronic health records, and the design and evaluation of human-in-the-loop systems.
I’m incredibly excited by the avenues at Duke for thoughtful clinical translation and impact (e.g.,
@DukeAIHealth
,
@DIHI
).
However, we wanted to understand the interactions between domain experts, decision aid, and the resulting annotations. This became a
#CHI2021
paper, presented today at 11AM EDT. Joint with
@david_sontag
and
@arvindsatya1
📢 Calling all PhD students! We're excited to announce that the Doctoral Symposium will be returning for CHIL 2024. Find more details at the link below and submit by March 15th! 📅
#CHIL2024
Vanilla word vectors don’t naturally represent all relations (e.g. hierarchies). Andrew McCallum presenting interesting work on representing concepts as high-dimensional boxes in vector space
#KR2ML
Excited to see this new Science Translational Medicine paper out from my lab (). It outlines a ML algorithm for smarter recommendation of antibiotics that can help reduce antibiotic resistance while simultaneously improving patient outcomes.
MIMIC-IV was published this week. The core dataset has been out for a while, but we've just published the deidentified free-text clinical notes: 300,000+ discharge summaries and 2.5 million radiology reports!
We investigate extraction of date of metastasis and medication start. Joint work with
@j4sonzhao
, Dr. Pedram Razavi, and
@david_sontag
.
Video:
Paper:
Joint work with
@david_sontag
and several wonderful Twitter-less coauthors at
@MIT_CSAIL
and
@MGHMedicine
. Learn more here (), and reach out if you're interested in helping create an open-sourced dataset!
Empirically, we find that initializing models with `order-contrastive pre-training’ improves disease progression classification in both linear and deep settings. Theoretically in a simplified setting, we investigate why, proving a finite-sample guarantee for downstream error.
We benchmark 5 different tasks with diverse output spaces (e.g., disambiguation, clinical trial parsing, medication relation extraction). This involved the creation of new annotations, released here: . Paper:
@danofer
Re: ICD codes, that would be interesting! We only tested on text data. Challenge with ICD codes is you need a technique for ignoring the shortcuts caused by global nonstationarities in ICD coding. Those make it easy to cheat on the self-supervised objective.
@danofer
@david_sontag
It’s a different self-supervised objective. We compare to masked language modeling (of clinical text) in our experiments. Intuitively, this objective focuses on time-irreversible features, which are useful for downstream tasks like classifying progression
@danofer
Can’t share the pretrained model due to data-use agreement, unfortunately. Besides pre-processing, the code is p much out-of-the-box. For linear feature selector we used sklearn + L1 penalty to predict the order. For deep model, we used NextSentencePrediction from HuggingFace.
@he_why_zhi_yanm
@danofer
Not a link, but for example, there was the conversion from ICD-9 to ICD-10 in the early 2010s, and there are new ICD codes introduced over time, e.g. for COVID
@NoClosedForm
The full dissertation is still undergoing a few final finishing touches. Happy to send you the names of the papers that formed the backbone of the dissertation, if you'd like!
@_jennylai_
And we are so grateful to you all for participating :) It was so crucial for us to understand how decision aid affected clinical users, before we scaled up our efforts