Interpretable deep learning to model immune receptors

Name: Interpretable deep learning to model immune receptors
Uploaded: 2024-10-28T14:15:54.6066667Z
Duration: 21 min 3 s

October 28, 2024

Information

ID: 12269
To Cite: DCA Citation Guide

Download Transcript

00:00We have Maria Martinez,
00:02from the,
00:03the data's department of, biomedical
00:05informatics and data science.
00:07So Maria just joined Yale
00:08recently, a few months ago.
00:10She initially, again, came to
00:12biology from physics,
00:14and completed a a postdoc
00:15in computational biology as the
00:17Weizmann Institute and then Columbia
00:19University.
00:20And then before she came
00:21to Yale for a decade,
00:23she led a group at
00:24IBM Research in Switzerland where
00:26she focuses on developing machine
00:27learning approaches for cancer personalized
00:30medicine.
00:31And, and I know Marie
00:32has been now working on,
00:34the immune system focusing on
00:35looking at molecular recognition and
00:37TCR specificity prediction.
00:39So
00:40yeah.
00:42Thank you, John. It's great
00:43to be here. And, John
00:44said, I just joined,
00:46Yale five months ago, so
00:47it's a great opportunity to
00:48be here and tell you
00:49a little bit about what
00:50the work my group has
00:51been doing.
00:53So,
00:55the thing the part I'm
00:56gonna be telling you here
00:56is about the work we
00:57have been doing for the
00:58last years to model the
01:00binding of T cell receptors.
01:01And I guess here I
01:02don't need to make a
01:03big introduction about T cells.
01:05There is a huge interest
01:06now in modeling the the
01:07predicting the binding of T
01:09cell receptor receptors because of
01:10the huge medi biomedical applications
01:13from developing better cancer immunotherapies,
01:16from understanding many autoimmune diseases
01:18that are driven by autoreactive
01:20t cells, from vaccine development,
01:22etcetera.
01:24So this has been a
01:25lot of work done.
01:26This is this plot is,
01:28it's from a recent review
01:29we wrote, where we are
01:31surveying the recent computational models
01:33to predict T cell receptor
01:34binding. And as you can
01:36see, just in two thousand
01:37twenty three, there were more
01:38than twenty papers, yes, for
01:39t cell receptor binding prediction.
01:42There was interesting
01:43trends, and the trend is
01:44not slowing down. I'm not,
01:46checking what's going on. There's
01:47still a lot of papers
01:48published this year.
01:50So,
01:52there a point. Is there
01:53a pointer
01:55or not?
01:57Or the mouse? Okay. Yeah.
01:59The mouse works. Yeah. Thank
02:00you. Yeah. So the other
02:02interesting trend that we can
02:03see in this plot is
02:04that there seem to be
02:05a switch. So until maybe
02:06two thousand twenty one, most
02:08models were using some sort
02:09of supervised machine learning. But
02:11in recent years, very recently,
02:13we are seeing the emergence
02:14of protein language models. And,
02:16well,
02:17Naveep already introduced a little
02:19bit, and I will tell
02:19you a bit more in
02:20a few,
02:21a little bit. But as
02:23you can see, there is
02:23a lot of interest now
02:24in applying the latest deep
02:25learning technologies.
02:28Now the other thing to
02:29tell you that most of
02:30these models, well, at least
02:31the ones in this in
02:31this plot were based only
02:33in sequence. That means we
02:34take the sequence of the
02:35cell receptors, the sequence of
02:36the epitope, and we try
02:37to predict binding with more
02:39or less success.
02:41It's true that we are
02:41at this point, most people
02:43were still neglecting a structure.
02:44And, certainly, this is a
02:45big, oversimplification.
02:47Structure is important to predict
02:49binding.
02:49The reason until now, most
02:51people are leaving a structure
02:52on the side is just
02:53because T cell receptors are
02:54extremely difficult,
02:55proteins to model. The part
02:57that binds the anti the
02:58the epitope is a flexible
02:59loop. And flexible loops and
03:01other so their proteins cannot
03:03be effectively model with alpha
03:04fold, not even in the
03:06last version and similarity level
03:08of models. So people are
03:10neglecting a structure not because
03:11it's not important, it's just
03:12because it's challenging. But we
03:14are still starting to see
03:15the new wave of papers
03:16that are somehow trying to
03:18to to look into that.
03:21For the time being, let
03:21me take one step back,
03:23and let me tell you
03:24the work with how chronologically
03:26the work we have been
03:27doing in this topic starting
03:28from the work, still focus
03:30on sequence that we did,
03:31when we started thinking about
03:33this problem.
03:34So, basically, this was like
03:35a few years ago, like,
03:36I think four years ago,
03:37and we're still start thinking
03:38about predicting the cell receptor
03:40binding. And we're got wanted
03:41to use we have expertise
03:43on machine learning, and we
03:44wanted to develop a multimodal
03:46deep learning approach. That means
03:47you give, the the model
03:49the sequence of the T
03:50cell receptor,
03:51the sequence of the epitope,
03:52and you train it to
03:53predict binding.
03:54And another topic I will
03:55tell you a little bit,
03:56we have in my team
03:57at IBM were very interested
03:59in inter interpretability.
04:00So that means we wanted
04:01just to build a model,
04:02but also to get some
04:04sort of insight of why
04:05the model predicts
04:06binding or not binding.
04:09So now the challenges well,
04:10one of the main challenge
04:11of this problem in biology,
04:13typically, most models are underpowered,
04:15but especially in this particular
04:16regime. So the theoretical diversity
04:19of potential disal receptor that
04:21a person can harbor is
04:23astronomical.
04:25But,
04:26we have some what? We
04:27have, quite a lot of
04:28data and public databases, but
04:29if you compare these two
04:30numbers, certainly you see quite
04:32a few orders of magnitude
04:33of indifference.
04:34So it's clear that although
04:36the the numbers these numbers
04:37by the world was towards
04:38the numbers we use when
04:39we train this model, by
04:41now, these numbers has increased
04:43substantially.
04:44But still, certainly, we are
04:45not even sampling a fraction
04:46of the huge potential diversity.
04:48So this is something to
04:49keep in mind. Models are
04:50still under power.
04:52And especially when we come
04:53to the EBITDAF side so,
04:55again, the number has increased
04:56by now. I think by
04:57now, there might be state
04:58about,
04:59less than one thousand. That's
05:00that's still it's still quite
05:02small amount of data.
05:03So as you can see
05:04already, I can I don't
05:05need to tell you that
05:06we might be able to
05:07predict something about T cell
05:08receptors, but certainly, we'll not
05:10have a lot of power
05:11to predict binding to new
05:12epitopes, which is one of
05:13the has the most interesting
05:15biomedical applications?
05:17Nevertheless, with this caveat, let
05:19me tell you a little
05:19bit what we did. So
05:20here, we are focusing first
05:22on the site where we
05:23can predict. That means we
05:24fix the epitope,
05:25and we try to we
05:26build models with the predict
05:28the the binding to know
05:29what to new,
05:30to to different decelerate.
05:33So with these constraints, we
05:34train for models that are
05:36different variations because different choices
05:38you can make in what
05:38you encode, so the full
05:40sequence or only the variable
05:41part, and also the ways
05:43in which you encode the
05:44the epitope. And already and
05:45Nadif already explained. So an
05:47amino acid is, an epitope
05:49is a string of amino
05:50acids, nine to ten, fifteen
05:52amino acids, but it's relatively
05:53small.
05:55There are typically called as
05:56amino acids, but there are
05:58some tricks that you can
05:59do here and I'll take
06:00you in a second.
06:01So first, we train this
06:02for models.
06:04We were, of course, you
06:05need to compare with a
06:06baseline.
06:07So as a baseline, we
06:08develop a super simplistic classifier.
06:11So is this, based on
06:13k nearest neighbors is one
06:14of the simplest machine learning
06:15approaches you can develop.
06:17And we were disappointing
06:19disappointed, but not completely surprised
06:21when see that that will
06:22be the simple classifier with
06:23a much better and deep
06:24learning models. Again, this is
06:26a problem of data limitation.
06:28We do not have enough
06:29data.
06:30Now the trick that was
06:31really interesting here is is
06:33that coming back to this
06:34representation of epitopes. So an
06:36epitope, the the the more,
06:39normal representation will be to
06:41represent each amino acid and
06:42then just have a string
06:43of amino acids. But, however,
06:45if you consider an epitope
06:46is kind of rather small
06:47molecule, so you couldn't call
06:49it as if it was
06:50a compound using the small
06:52representations,
06:53and, Nadeep was talking about.
06:55Again, this is not a
06:56chemical compound, but from a
06:57computational point of view, they
06:59admit similar representations.
07:01And if you do that,
07:02now you can use transfer
07:04learning because you can now
07:05pretrain your model on a
07:07much larger database.
07:08So binding DB is like
07:10a database of compound protein
07:12compound interactions where we have
07:14around one million pairs. So
07:15this is a substantially
07:17larger amount of data to
07:18pretend our model. So, basically,
07:20what we did here is
07:21pretrain the model on binding
07:22d v and then, of
07:23course, fine tune on the
07:24cell receptor specific data.
07:26And as you can see,
07:27this is an ex this
07:28this, transfer learning exercise, it
07:30was assisted led a substantial,
07:32boost in performance.
07:34This, by the way, is
07:35done all the time in
07:36in in in different areas
07:37of machine learning. In biology,
07:38we are still getting used
07:39to it, yeah. But you
07:40can see here, we could
07:41really boost the performance.
07:45Now let me talk a
07:46little bit about interpretability and
07:47again, so another concept that
07:48Navif already explained, introduce a
07:50little bit. So we wanted,
07:52what when we we working
07:53here, we want we wanted
07:55not only to develop the
07:56model, but also just get
07:57insights about which amino acid
07:59the model the model thought
08:00were important to predict binding
08:02or nonbinding.
08:04So here we use attention
08:05mechanisms, which is one of
08:07the most common, not the
08:08only one, of course, but
08:09one of the popular approaches
08:10for interpretability.
08:12They are using transformers and
08:13in language models, so they
08:14are becoming very popular.
08:16The idea with with potential
08:17layers that you can get
08:19after you train your model
08:20and you get your prediction,
08:21you can get this type
08:22of heat maps that tell
08:23you that dark color means
08:25how much attention the model
08:27paid to each feature. In
08:28that case, each amino acid.
08:30So for instance, these three
08:31sequences correspond to three the
08:33variable part of three sequences.
08:35These are receptor sequences that
08:36were predicted to be to
08:38bind the same epitope.
08:40And, again, the the the
08:41the trick the problem here
08:42is we don't have the
08:43ground truth because, I mean,
08:44we need to have the
08:45crystallographical
08:46structures to see whether these
08:47amino acids are involved in
08:48the binding, which we don't
08:49have.
08:50But we can do some
08:52sort of sanity check. So
08:54if they bind the SNP
08:55top, it's logical to expect
08:57that they share a similar
08:58binding motif. And as you
08:59can see here, actually, the
09:01amino acids that were highlighted
09:02with high attention are conserved
09:04on the three sequences.
09:05So again, this is not
09:06a validation. It's all only
09:08a sanity check. But from
09:09this point of view, this
09:10prediction, this attention prediction makes
09:12sense.
09:14Now I want to show
09:15you now a negative example.
09:16Another case where attention actually
09:18tell you that something is
09:18wrong in your model.
09:20So when I was telling
09:21you before that we have
09:22very little data to try
09:23to predict the initialization bending
09:25to unseen epitopes. This is
09:27an unsolved problem even today,
09:28forty years
09:30later. So we train our
09:31models also in this direction.
09:33The performance was quite bad
09:34as expected,
09:35but we wanted to look
09:36at attention maps.
09:38So this is what we
09:39got. And, actually, that will
09:41we were kind of, thinking
09:42about this for quite some
09:43time. We could not understand
09:45these maps for quite some
09:46time.
09:47Eventually, what we realized because,
09:48basically, for all two hundred
09:50epitopes in our database, the
09:52model was always selecting with
09:53the same attention or with
09:55the same three epitopes in
09:56the same positions.
09:57So that was quite puzzling.
09:59But eventually, we realized that
10:01the model actually realizes quite
10:02early that there's not sufficient
10:04data to try to make
10:05a prediction to unseen epitopes.
10:07So instead, just it's looking
10:08at it selects three random
10:10positions, highly variable positions, but
10:12random.
10:13And this is enough to
10:14give a label to each
10:15epitope. And this is a
10:16way of memorizing your data.
10:18So now you have instead
10:19of just training to generalize,
10:21you're just transforming this into
10:22a classification problem.
10:24Of course, you're not able
10:25to generalize to new epitopes.
10:26At least, we already knew.
10:28So I'm showing this because,
10:29actually, this highlights why interpretability
10:31is essential.
10:33So it can first tell
10:34you when things are working.
10:35It can tell you which
10:36amino acids might be involved
10:38into the grind the the
10:39binding prediction and give you
10:41ideas to just build new
10:42hypothesis.
10:43But equally important, it tells
10:45you when things are not
10:45working. And especially in biology,
10:47with very often when very
10:48often we're working with really
10:50underpowered datasets and we are
10:51overfitting, it's good to get
10:53hints when things are not
10:54working.
10:56Okay. So after saying that,
10:57let me tell you what
10:58we did after we published
10:59this paper, and it was
11:01more or less at the
11:01time where, protein language models
11:03started to become a hit
11:05in biology.
11:06So I think the first
11:07protein language model was published
11:08in two thousand twenty one.
11:10So, yes, well, we're introduced
11:11by Nadiv, but in case,
11:12for those of who just
11:13came recently and have not
11:15heard about protein language models.
11:17So the idea is that
11:18they use the same type
11:19of models, a
11:34And if you do that,
11:34then you are learning the
11:36language of proteins, and this
11:37is what is called protein
11:38language models.
11:40So there by now quite
11:42a few of this model,
11:42this is one of, early
11:44ones, was published in Nature
11:45Biotech, and this one was
11:47shown to be able to
11:48predict secondary and tertiary structure.
11:50Later on, there was other
11:52models that were able to
11:53design the novel proteins with
11:55particular with good,
11:58faulting capabilities. So that was
11:59quite impressive.
12:01And, also, as Steve mentioned
12:03in a bit, by now,
12:04we have a bunch of
12:05different, protein language models.
12:07Some of them are general
12:09like ESM. This is one
12:10of the earlier models. Well,
12:11now we are in the
12:12third version already, but this
12:13is one of the first
12:14models that appeared.
12:16This is the particularity of
12:17these models is this is
12:19this is probably trained on
12:20the largest collections of I
12:21mean, collection of amino acids.
12:24This is a super large,
12:25but also very heterogeneous. So
12:26you have proteins from all
12:28sort of organisms, tissues, conditions,
12:33opposed to that, you now
12:34will start also seeing models
12:35that are specific to particular
12:37families like TCR BERT, only
12:39trained on this TCR sequences
12:41and up lung trained on
12:42amino acid sequences.
12:45So thinking about it, we
12:46wanted not to train just
12:47a new protein language model
12:48because there were quite a
12:49few already, but we wanted
12:50to test how good this
12:52model were to represent immune
12:54receptors, which are, as I
12:55said before, they're quite particular
12:57and they are
12:58different than regular proteins.
13:00So we are looking both
13:01at T cell receptors and
13:02B cell receptors. So let
13:04me show you some of
13:05the things that we tested
13:06starting with T cell receptors.
13:09So for TCRs,
13:10we built, here we focused
13:12we exploited ESM, this general
13:14heterogeneous model that trained on
13:15the heterogeneous model that train
13:16on the largest collection of
13:17proteins.
13:18And here we train a
13:19super simple model. So it
13:21was ESM. We started the
13:22rep the the embedded representation
13:24of the receptors,
13:25and then we added a
13:26super simple,
13:27two neural two layered neural
13:29network. So it doesn't get
13:30simpler than that.
13:32And to test the capability
13:33of this model, we use
13:35a recent benchmark that on
13:36T cell receptor binding prediction
13:38models that was published last
13:40year in informatics.
13:42So in this benchmark so
13:43the plot the the dot
13:44plot comes from this benchmark,
13:46actually.
13:46So, basically, this plot here,
13:48we are showing two different
13:49predictive tasks. On the x
13:51axis, the task was a
13:52classification.
13:53So given a pair, so
13:54epitope TCR, does it bind?
13:56Yes or no? So, here,
13:58good models will have will
13:59be be close to this,
14:01ax to this side of
14:02the plot.
14:03And here, the task was
14:04an epitope ranking task. So
14:06the idea here that you
14:07have the for each cell
14:08receptor, you will have seventeen
14:10different lipidomes, and you have
14:11to rank depending according to
14:13the likelihood that each one
14:14what is the real by
14:16epitope that binds.
14:17So, basically, good predict good
14:19models are in this corner
14:20of the plot.
14:21And you can see just
14:22to to tell you a
14:23little bit about the benchmark.
14:24So the the there are
14:25three colors that depends whether
14:27the models were trained on
14:28single cell data. So for
14:29which we have the alpha
14:30beta appearance
14:32or on bulk data for
14:33which you have typically the
14:34beta, sometimes also all of
14:35the alpha chain.
14:37So not surprisingly,
14:38the single cell mode model
14:40stress of single cell data
14:41perform better. So that's expected.
14:43The nice thing with this
14:44is actually this model we
14:45trained. This is the purple,
14:47star. This model that we
14:48trained starting embedding from ESM
14:50and, again, very simple neural
14:52network on top of it.
14:53Actually, with very little effort,
14:54we were performing close to
14:56the, best performing models. So,
14:58again, this is just one
14:59particular task, but that gives
15:00you an idea of how
15:01much the accuracy you can
15:03you can gain with this
15:04extracting using exploiting this pretrained
15:06language models.
15:09We have also looked at
15:10B cell receptors in the
15:11context of, protein language models.
15:13So here, we're also looking
15:15wanted to look at the
15:16representation of capabilities
15:18of these models.
15:20Here, we use Uplank. This
15:21is a, a model for
15:23b cell protein language model
15:24trained on b cell receptor
15:25sequences.
15:26And, well, Steve, again, already
15:28introduced the clone evolution within
15:30b cells. So here, this
15:31is, we're going to just
15:32to compare use this exercise
15:35to compare how good the
15:36world is smallest to represent
15:37this grand
15:39evolution. So, basically, to explain
15:41this plot, this we took
15:43data from an experiment where
15:45some collaborators have sample,
15:48B cell receptor from individual
15:49germinal centers. Centers. So we
15:50could characterize the germinal center
15:52of origin of each B
15:53cell receptor.
15:55And then we could infer
15:56we had to infer high
15:57inflate the clonal clonal families
15:59of each receptor.
16:00So in a few cases,
16:01we observed, and this has
16:03been reported before, that a
16:05the same clone was split
16:06across different germinal centers.
16:08This is, again, this has
16:09been observed that happens when
16:11one cell is evolving in
16:12a germinal center and then
16:14migrates onto another germinal center
16:15and keeps evolving there. And
16:17And this is what this
16:18phylogenetic tree is showing you.
16:19So there are these three
16:20family these clones that are
16:21clonally related. So probably they
16:24they descend from the same
16:25ancestor
16:26ancestor,
16:27this BCR.
16:29And you see that the
16:30the blue clone probably is
16:31the one that is farther
16:32away. So it's once more
16:33separated than the others, and
16:35the purple and the green
16:36are a bit intermixed over
16:37here.
16:39So compared to this standard
16:41flow genetic inference, we just
16:43plot the this data with
16:45uplank and we just want
16:46it to visualize. So this
16:47is only a visualization exercise.
16:48There's no inference here.
16:50And you can see actually
16:51that in a at the
16:52qualitative level, this this plots
16:54have a lot of similarities.
16:55So the upland also can
16:57recapitulate quite well the three
16:58clonal families. The blue one
17:00seems to be a bit
17:01more separated
17:02and there seem to be
17:03this good intermixing
17:05between the purple and the
17:06green family. So again, this
17:07is not an inference tool,
17:09but as a representation
17:10test visualization,
17:11this seems that we are
17:12capturing
17:13a lot of the aspects
17:14of the clonal evolution that
17:15we'll capture with traditional phylogenetic
17:17inference tools.
17:20The last thing I would
17:21like to mention is, the
17:23the the question with which
17:24I started this section. So
17:25which model is better? If
17:27you are interested in predicting,
17:28for instance, BCR affinity binding,
17:31so what would be better?
17:32Should we use a general
17:33model or should we use
17:34a, BCR specific model?
17:36So this is an open
17:37question.
17:38I see I see many
17:39papers in each direction, so
17:40I'm not gonna give the
17:41final answer here. Probably there's
17:43no final answer. I can
17:44tell you only about the
17:45particular task that we did
17:46here.
17:47So here, we took a
17:49dataset of seventy thousand antibodies.
17:51They with with measure binding
17:53affinity to a SARS CoV-two,
17:54peptide.
17:56And then we train again
17:57a simple model using both
17:58embeddings from the ESM, the
18:00general model, and Upland, the
18:02VCR specific model.
18:03And then the thing here,
18:05we wanted to test the
18:06accuracy of each model at
18:07different,
18:08with with different amount of
18:10data. So here, we are
18:11taking just a
18:13piece of the
18:15pieces of the data from
18:16here for little data to
18:17going to largest amount of
18:19data.
18:19And as you can see
18:20here, when the data is
18:22small, so the the difference
18:23with the model is not
18:24significant.
18:25But as I the the
18:26amount of data starts increasing,
18:28in this particular task, we
18:29found the SM to be
18:30better.
18:31So, again, this is an
18:32open debate, and I've seen
18:33papers in both directions. I
18:34don't claim that this is
18:35the final answer.
18:37What I would take from
18:38here is probably that that
18:39probably the answer depends on
18:40the task. Probably high specificity
18:42task benefit from a specific
18:44mod language models.
18:46But in any case, when
18:47the more data you have
18:49to fine tune the general
18:50model, the more likely you
18:51will be able to extract
18:52whatever information is hidden in
18:54this large heterogeneous collection of
18:56proteins.
18:57So, again, this is there
18:58were many other variables that
18:59we entered in in impacting
19:01the the the accuracy, but,
19:02this is the the message
19:03we got from here.
19:06Okay. So I think we're
19:07running out of time. So
19:09maybe I skip this, and
19:10I will tell you a
19:10little bit.
19:12Yeah. So maybe I will
19:13conclude.
19:14The last part was about
19:15interpretability, but maybe leave it
19:16for future discussions,
19:18and I will conclude here.
19:20So, basically, the the message
19:21I would like you to
19:22take home is that there
19:22is a lot of exciting
19:23opportunities to develop, both traditional
19:26computational models and machine learning
19:27deep learning models to properties
19:29of immune receptors.
19:31We are we have been
19:32developing
19:33working a lot on this
19:34area. We have developed models
19:36such as Taitan that was
19:37a multimodal,
19:38more traditional deep learning approach
19:40to to predict the specificity
19:42with, and with interpretability.
19:45Then we have been exploring
19:46the the capabilities of protein
19:47language models. As I said,
19:48there is a lot of
19:49work that we can do
19:50in this area. And when
19:51this is the part I
19:52did not take did not
19:53have time to tell you
19:54about, but we have also
19:55developed, interpretability
19:56pipelines to extract more human
19:58understandable
19:59explanations of why why a
20:01model predict what the model
20:02predicts.
20:04And maybe the last slide.
20:05So I wanted to tell
20:06you a little bit about
20:07future work. So I I
20:09started this talk by telling
20:10you that most model until
20:11now neglect the structure.
20:13I think the way to
20:14move forward, I think we
20:15are we are seeing now
20:15there's a saturation on accuracy.
20:17It's difficult to go farther.
20:18We have rich model that
20:19the models that predict very
20:21well what they can predict,
20:22but they cannot predict binding
20:23to unseen epitopes.
20:25One possibility to move forward
20:26to is to to integrate
20:27a structure. But how to
20:28do that is tricky because
20:30it's not about the lack
20:31of a structure, it's just
20:31the flexibility of the loop
20:33that is difficult. So certainly,
20:34I've seen paper that try
20:36to adapt alpha fold in
20:37different ways, but I think
20:38we have to go beyond
20:39alpha fold and integrate models
20:41that account the flexibility.
20:42And this requires a lot
20:43of more more thinking.
20:45And, certainly, interpretability is another
20:47area that what I'm very
20:48interested. If you are interested
20:50in, I'm certainly looking for
20:51new collaborations. So please contact
20:53to me.
20:54And with that, I would
20:55like to just, thank the
20:56people. This work most of
20:58the work was not IBM,
20:59so thank the people who
21:00did the work, funding.
21:02I'm happy to take questions
21:03if any.