Interpretable deep learning to model immune receptors
October 28, 2024Information
- ID
- 12269
- To Cite
- DCA Citation Guide
Transcript
- 00:00We have Maria Martinez,
- 00:02from the,
- 00:03the data's department of, biomedical
- 00:05informatics and data science.
- 00:07So Maria just joined Yale
- 00:08recently, a few months ago.
- 00:10She initially, again, came to
- 00:12biology from physics,
- 00:14and completed a a postdoc
- 00:15in computational biology as the
- 00:17Weizmann Institute and then Columbia
- 00:19University.
- 00:20And then before she came
- 00:21to Yale for a decade,
- 00:23she led a group at
- 00:24IBM Research in Switzerland where
- 00:26she focuses on developing machine
- 00:27learning approaches for cancer personalized
- 00:30medicine.
- 00:31And, and I know Marie
- 00:32has been now working on,
- 00:34the immune system focusing on
- 00:35looking at molecular recognition and
- 00:37TCR specificity prediction.
- 00:39So
- 00:40yeah.
- 00:42Thank you, John. It's great
- 00:43to be here. And, John
- 00:44said, I just joined,
- 00:46Yale five months ago, so
- 00:47it's a great opportunity to
- 00:48be here and tell you
- 00:49a little bit about what
- 00:50the work my group has
- 00:51been doing.
- 00:53So,
- 00:55the thing the part I'm
- 00:56gonna be telling you here
- 00:56is about the work we
- 00:57have been doing for the
- 00:58last years to model the
- 01:00binding of T cell receptors.
- 01:01And I guess here I
- 01:02don't need to make a
- 01:03big introduction about T cells.
- 01:05There is a huge interest
- 01:06now in modeling the the
- 01:07predicting the binding of T
- 01:09cell receptor receptors because of
- 01:10the huge medi biomedical applications
- 01:13from developing better cancer immunotherapies,
- 01:16from understanding many autoimmune diseases
- 01:18that are driven by autoreactive
- 01:20t cells, from vaccine development,
- 01:22etcetera.
- 01:24So this has been a
- 01:25lot of work done.
- 01:26This is this plot is,
- 01:28it's from a recent review
- 01:29we wrote, where we are
- 01:31surveying the recent computational models
- 01:33to predict T cell receptor
- 01:34binding. And as you can
- 01:36see, just in two thousand
- 01:37twenty three, there were more
- 01:38than twenty papers, yes, for
- 01:39t cell receptor binding prediction.
- 01:42There was interesting
- 01:43trends, and the trend is
- 01:44not slowing down. I'm not,
- 01:46checking what's going on. There's
- 01:47still a lot of papers
- 01:48published this year.
- 01:50So,
- 01:52there a point. Is there
- 01:53a pointer
- 01:55or not?
- 01:57Or the mouse? Okay. Yeah.
- 01:59The mouse works. Yeah. Thank
- 02:00you. Yeah. So the other
- 02:02interesting trend that we can
- 02:03see in this plot is
- 02:04that there seem to be
- 02:05a switch. So until maybe
- 02:06two thousand twenty one, most
- 02:08models were using some sort
- 02:09of supervised machine learning. But
- 02:11in recent years, very recently,
- 02:13we are seeing the emergence
- 02:14of protein language models. And,
- 02:16well,
- 02:17Naveep already introduced a little
- 02:19bit, and I will tell
- 02:19you a bit more in
- 02:20a few,
- 02:21a little bit. But as
- 02:23you can see, there is
- 02:23a lot of interest now
- 02:24in applying the latest deep
- 02:25learning technologies.
- 02:28Now the other thing to
- 02:29tell you that most of
- 02:30these models, well, at least
- 02:31the ones in this in
- 02:31this plot were based only
- 02:33in sequence. That means we
- 02:34take the sequence of the
- 02:35cell receptors, the sequence of
- 02:36the epitope, and we try
- 02:37to predict binding with more
- 02:39or less success.
- 02:41It's true that we are
- 02:41at this point, most people
- 02:43were still neglecting a structure.
- 02:44And, certainly, this is a
- 02:45big, oversimplification.
- 02:47Structure is important to predict
- 02:49binding.
- 02:49The reason until now, most
- 02:51people are leaving a structure
- 02:52on the side is just
- 02:53because T cell receptors are
- 02:54extremely difficult,
- 02:55proteins to model. The part
- 02:57that binds the anti the
- 02:58the epitope is a flexible
- 02:59loop. And flexible loops and
- 03:01other so their proteins cannot
- 03:03be effectively model with alpha
- 03:04fold, not even in the
- 03:06last version and similarity level
- 03:08of models. So people are
- 03:10neglecting a structure not because
- 03:11it's not important, it's just
- 03:12because it's challenging. But we
- 03:14are still starting to see
- 03:15the new wave of papers
- 03:16that are somehow trying to
- 03:18to to look into that.
- 03:21For the time being, let
- 03:21me take one step back,
- 03:23and let me tell you
- 03:24the work with how chronologically
- 03:26the work we have been
- 03:27doing in this topic starting
- 03:28from the work, still focus
- 03:30on sequence that we did,
- 03:31when we started thinking about
- 03:33this problem.
- 03:34So, basically, this was like
- 03:35a few years ago, like,
- 03:36I think four years ago,
- 03:37and we're still start thinking
- 03:38about predicting the cell receptor
- 03:40binding. And we're got wanted
- 03:41to use we have expertise
- 03:43on machine learning, and we
- 03:44wanted to develop a multimodal
- 03:46deep learning approach. That means
- 03:47you give, the the model
- 03:49the sequence of the T
- 03:50cell receptor,
- 03:51the sequence of the epitope,
- 03:52and you train it to
- 03:53predict binding.
- 03:54And another topic I will
- 03:55tell you a little bit,
- 03:56we have in my team
- 03:57at IBM were very interested
- 03:59in inter interpretability.
- 04:00So that means we wanted
- 04:01just to build a model,
- 04:02but also to get some
- 04:04sort of insight of why
- 04:05the model predicts
- 04:06binding or not binding.
- 04:09So now the challenges well,
- 04:10one of the main challenge
- 04:11of this problem in biology,
- 04:13typically, most models are underpowered,
- 04:15but especially in this particular
- 04:16regime. So the theoretical diversity
- 04:19of potential disal receptor that
- 04:21a person can harbor is
- 04:23astronomical.
- 04:25But,
- 04:26we have some what? We
- 04:27have, quite a lot of
- 04:28data and public databases, but
- 04:29if you compare these two
- 04:30numbers, certainly you see quite
- 04:32a few orders of magnitude
- 04:33of indifference.
- 04:34So it's clear that although
- 04:36the the numbers these numbers
- 04:37by the world was towards
- 04:38the numbers we use when
- 04:39we train this model, by
- 04:41now, these numbers has increased
- 04:43substantially.
- 04:44But still, certainly, we are
- 04:45not even sampling a fraction
- 04:46of the huge potential diversity.
- 04:48So this is something to
- 04:49keep in mind. Models are
- 04:50still under power.
- 04:52And especially when we come
- 04:53to the EBITDAF side so,
- 04:55again, the number has increased
- 04:56by now. I think by
- 04:57now, there might be state
- 04:58about,
- 04:59less than one thousand. That's
- 05:00that's still it's still quite
- 05:02small amount of data.
- 05:03So as you can see
- 05:04already, I can I don't
- 05:05need to tell you that
- 05:06we might be able to
- 05:07predict something about T cell
- 05:08receptors, but certainly, we'll not
- 05:10have a lot of power
- 05:11to predict binding to new
- 05:12epitopes, which is one of
- 05:13the has the most interesting
- 05:15biomedical applications?
- 05:17Nevertheless, with this caveat, let
- 05:19me tell you a little
- 05:19bit what we did. So
- 05:20here, we are focusing first
- 05:22on the site where we
- 05:23can predict. That means we
- 05:24fix the epitope,
- 05:25and we try to we
- 05:26build models with the predict
- 05:28the the binding to know
- 05:29what to new,
- 05:30to to different decelerate.
- 05:33So with these constraints, we
- 05:34train for models that are
- 05:36different variations because different choices
- 05:38you can make in what
- 05:38you encode, so the full
- 05:40sequence or only the variable
- 05:41part, and also the ways
- 05:43in which you encode the
- 05:44the epitope. And already and
- 05:45Nadif already explained. So an
- 05:47amino acid is, an epitope
- 05:49is a string of amino
- 05:50acids, nine to ten, fifteen
- 05:52amino acids, but it's relatively
- 05:53small.
- 05:55There are typically called as
- 05:56amino acids, but there are
- 05:58some tricks that you can
- 05:59do here and I'll take
- 06:00you in a second.
- 06:01So first, we train this
- 06:02for models.
- 06:04We were, of course, you
- 06:05need to compare with a
- 06:06baseline.
- 06:07So as a baseline, we
- 06:08develop a super simplistic classifier.
- 06:11So is this, based on
- 06:13k nearest neighbors is one
- 06:14of the simplest machine learning
- 06:15approaches you can develop.
- 06:17And we were disappointing
- 06:19disappointed, but not completely surprised
- 06:21when see that that will
- 06:22be the simple classifier with
- 06:23a much better and deep
- 06:24learning models. Again, this is
- 06:26a problem of data limitation.
- 06:28We do not have enough
- 06:29data.
- 06:30Now the trick that was
- 06:31really interesting here is is
- 06:33that coming back to this
- 06:34representation of epitopes. So an
- 06:36epitope, the the the more,
- 06:39normal representation will be to
- 06:41represent each amino acid and
- 06:42then just have a string
- 06:43of amino acids. But, however,
- 06:45if you consider an epitope
- 06:46is kind of rather small
- 06:47molecule, so you couldn't call
- 06:49it as if it was
- 06:50a compound using the small
- 06:52representations,
- 06:53and, Nadeep was talking about.
- 06:55Again, this is not a
- 06:56chemical compound, but from a
- 06:57computational point of view, they
- 06:59admit similar representations.
- 07:01And if you do that,
- 07:02now you can use transfer
- 07:04learning because you can now
- 07:05pretrain your model on a
- 07:07much larger database.
- 07:08So binding DB is like
- 07:10a database of compound protein
- 07:12compound interactions where we have
- 07:14around one million pairs. So
- 07:15this is a substantially
- 07:17larger amount of data to
- 07:18pretend our model. So, basically,
- 07:20what we did here is
- 07:21pretrain the model on binding
- 07:22d v and then, of
- 07:23course, fine tune on the
- 07:24cell receptor specific data.
- 07:26And as you can see,
- 07:27this is an ex this
- 07:28this, transfer learning exercise, it
- 07:30was assisted led a substantial,
- 07:32boost in performance.
- 07:34This, by the way, is
- 07:35done all the time in
- 07:36in in in different areas
- 07:37of machine learning. In biology,
- 07:38we are still getting used
- 07:39to it, yeah. But you
- 07:40can see here, we could
- 07:41really boost the performance.
- 07:45Now let me talk a
- 07:46little bit about interpretability and
- 07:47again, so another concept that
- 07:48Navif already explained, introduce a
- 07:50little bit. So we wanted,
- 07:52what when we we working
- 07:53here, we want we wanted
- 07:55not only to develop the
- 07:56model, but also just get
- 07:57insights about which amino acid
- 07:59the model the model thought
- 08:00were important to predict binding
- 08:02or nonbinding.
- 08:04So here we use attention
- 08:05mechanisms, which is one of
- 08:07the most common, not the
- 08:08only one, of course, but
- 08:09one of the popular approaches
- 08:10for interpretability.
- 08:12They are using transformers and
- 08:13in language models, so they
- 08:14are becoming very popular.
- 08:16The idea with with potential
- 08:17layers that you can get
- 08:19after you train your model
- 08:20and you get your prediction,
- 08:21you can get this type
- 08:22of heat maps that tell
- 08:23you that dark color means
- 08:25how much attention the model
- 08:27paid to each feature. In
- 08:28that case, each amino acid.
- 08:30So for instance, these three
- 08:31sequences correspond to three the
- 08:33variable part of three sequences.
- 08:35These are receptor sequences that
- 08:36were predicted to be to
- 08:38bind the same epitope.
- 08:40And, again, the the the
- 08:41the trick the problem here
- 08:42is we don't have the
- 08:43ground truth because, I mean,
- 08:44we need to have the
- 08:45crystallographical
- 08:46structures to see whether these
- 08:47amino acids are involved in
- 08:48the binding, which we don't
- 08:49have.
- 08:50But we can do some
- 08:52sort of sanity check. So
- 08:54if they bind the SNP
- 08:55top, it's logical to expect
- 08:57that they share a similar
- 08:58binding motif. And as you
- 08:59can see here, actually, the
- 09:01amino acids that were highlighted
- 09:02with high attention are conserved
- 09:04on the three sequences.
- 09:05So again, this is not
- 09:06a validation. It's all only
- 09:08a sanity check. But from
- 09:09this point of view, this
- 09:10prediction, this attention prediction makes
- 09:12sense.
- 09:14Now I want to show
- 09:15you now a negative example.
- 09:16Another case where attention actually
- 09:18tell you that something is
- 09:18wrong in your model.
- 09:20So when I was telling
- 09:21you before that we have
- 09:22very little data to try
- 09:23to predict the initialization bending
- 09:25to unseen epitopes. This is
- 09:27an unsolved problem even today,
- 09:28forty years
- 09:30later. So we train our
- 09:31models also in this direction.
- 09:33The performance was quite bad
- 09:34as expected,
- 09:35but we wanted to look
- 09:36at attention maps.
- 09:38So this is what we
- 09:39got. And, actually, that will
- 09:41we were kind of, thinking
- 09:42about this for quite some
- 09:43time. We could not understand
- 09:45these maps for quite some
- 09:46time.
- 09:47Eventually, what we realized because,
- 09:48basically, for all two hundred
- 09:50epitopes in our database, the
- 09:52model was always selecting with
- 09:53the same attention or with
- 09:55the same three epitopes in
- 09:56the same positions.
- 09:57So that was quite puzzling.
- 09:59But eventually, we realized that
- 10:01the model actually realizes quite
- 10:02early that there's not sufficient
- 10:04data to try to make
- 10:05a prediction to unseen epitopes.
- 10:07So instead, just it's looking
- 10:08at it selects three random
- 10:10positions, highly variable positions, but
- 10:12random.
- 10:13And this is enough to
- 10:14give a label to each
- 10:15epitope. And this is a
- 10:16way of memorizing your data.
- 10:18So now you have instead
- 10:19of just training to generalize,
- 10:21you're just transforming this into
- 10:22a classification problem.
- 10:24Of course, you're not able
- 10:25to generalize to new epitopes.
- 10:26At least, we already knew.
- 10:28So I'm showing this because,
- 10:29actually, this highlights why interpretability
- 10:31is essential.
- 10:33So it can first tell
- 10:34you when things are working.
- 10:35It can tell you which
- 10:36amino acids might be involved
- 10:38into the grind the the
- 10:39binding prediction and give you
- 10:41ideas to just build new
- 10:42hypothesis.
- 10:43But equally important, it tells
- 10:45you when things are not
- 10:45working. And especially in biology,
- 10:47with very often when very
- 10:48often we're working with really
- 10:50underpowered datasets and we are
- 10:51overfitting, it's good to get
- 10:53hints when things are not
- 10:54working.
- 10:56Okay. So after saying that,
- 10:57let me tell you what
- 10:58we did after we published
- 10:59this paper, and it was
- 11:01more or less at the
- 11:01time where, protein language models
- 11:03started to become a hit
- 11:05in biology.
- 11:06So I think the first
- 11:07protein language model was published
- 11:08in two thousand twenty one.
- 11:10So, yes, well, we're introduced
- 11:11by Nadiv, but in case,
- 11:12for those of who just
- 11:13came recently and have not
- 11:15heard about protein language models.
- 11:17So the idea is that
- 11:18they use the same type
- 11:19of models, a
- 11:34And if you do that,
- 11:34then you are learning the
- 11:36language of proteins, and this
- 11:37is what is called protein
- 11:38language models.
- 11:40So there by now quite
- 11:42a few of this model,
- 11:42this is one of, early
- 11:44ones, was published in Nature
- 11:45Biotech, and this one was
- 11:47shown to be able to
- 11:48predict secondary and tertiary structure.
- 11:50Later on, there was other
- 11:52models that were able to
- 11:53design the novel proteins with
- 11:55particular with good,
- 11:58faulting capabilities. So that was
- 11:59quite impressive.
- 12:01And, also, as Steve mentioned
- 12:03in a bit, by now,
- 12:04we have a bunch of
- 12:05different, protein language models.
- 12:07Some of them are general
- 12:09like ESM. This is one
- 12:10of the earlier models. Well,
- 12:11now we are in the
- 12:12third version already, but this
- 12:13is one of the first
- 12:14models that appeared.
- 12:16This is the particularity of
- 12:17these models is this is
- 12:19this is probably trained on
- 12:20the largest collections of I
- 12:21mean, collection of amino acids.
- 12:24This is a super large,
- 12:25but also very heterogeneous. So
- 12:26you have proteins from all
- 12:28sort of organisms, tissues, conditions,
- 12:33opposed to that, you now
- 12:34will start also seeing models
- 12:35that are specific to particular
- 12:37families like TCR BERT, only
- 12:39trained on this TCR sequences
- 12:41and up lung trained on
- 12:42amino acid sequences.
- 12:45So thinking about it, we
- 12:46wanted not to train just
- 12:47a new protein language model
- 12:48because there were quite a
- 12:49few already, but we wanted
- 12:50to test how good this
- 12:52model were to represent immune
- 12:54receptors, which are, as I
- 12:55said before, they're quite particular
- 12:57and they are
- 12:58different than regular proteins.
- 13:00So we are looking both
- 13:01at T cell receptors and
- 13:02B cell receptors. So let
- 13:04me show you some of
- 13:05the things that we tested
- 13:06starting with T cell receptors.
- 13:09So for TCRs,
- 13:10we built, here we focused
- 13:12we exploited ESM, this general
- 13:14heterogeneous model that trained on
- 13:15the heterogeneous model that train
- 13:16on the largest collection of
- 13:17proteins.
- 13:18And here we train a
- 13:19super simple model. So it
- 13:21was ESM. We started the
- 13:22rep the the embedded representation
- 13:24of the receptors,
- 13:25and then we added a
- 13:26super simple,
- 13:27two neural two layered neural
- 13:29network. So it doesn't get
- 13:30simpler than that.
- 13:32And to test the capability
- 13:33of this model, we use
- 13:35a recent benchmark that on
- 13:36T cell receptor binding prediction
- 13:38models that was published last
- 13:40year in informatics.
- 13:42So in this benchmark so
- 13:43the plot the the dot
- 13:44plot comes from this benchmark,
- 13:46actually.
- 13:46So, basically, this plot here,
- 13:48we are showing two different
- 13:49predictive tasks. On the x
- 13:51axis, the task was a
- 13:52classification.
- 13:53So given a pair, so
- 13:54epitope TCR, does it bind?
- 13:56Yes or no? So, here,
- 13:58good models will have will
- 13:59be be close to this,
- 14:01ax to this side of
- 14:02the plot.
- 14:03And here, the task was
- 14:04an epitope ranking task. So
- 14:06the idea here that you
- 14:07have the for each cell
- 14:08receptor, you will have seventeen
- 14:10different lipidomes, and you have
- 14:11to rank depending according to
- 14:13the likelihood that each one
- 14:14what is the real by
- 14:16epitope that binds.
- 14:17So, basically, good predict good
- 14:19models are in this corner
- 14:20of the plot.
- 14:21And you can see just
- 14:22to to tell you a
- 14:23little bit about the benchmark.
- 14:24So the the there are
- 14:25three colors that depends whether
- 14:27the models were trained on
- 14:28single cell data. So for
- 14:29which we have the alpha
- 14:30beta appearance
- 14:32or on bulk data for
- 14:33which you have typically the
- 14:34beta, sometimes also all of
- 14:35the alpha chain.
- 14:37So not surprisingly,
- 14:38the single cell mode model
- 14:40stress of single cell data
- 14:41perform better. So that's expected.
- 14:43The nice thing with this
- 14:44is actually this model we
- 14:45trained. This is the purple,
- 14:47star. This model that we
- 14:48trained starting embedding from ESM
- 14:50and, again, very simple neural
- 14:52network on top of it.
- 14:53Actually, with very little effort,
- 14:54we were performing close to
- 14:56the, best performing models. So,
- 14:58again, this is just one
- 14:59particular task, but that gives
- 15:00you an idea of how
- 15:01much the accuracy you can
- 15:03you can gain with this
- 15:04extracting using exploiting this pretrained
- 15:06language models.
- 15:09We have also looked at
- 15:10B cell receptors in the
- 15:11context of, protein language models.
- 15:13So here, we're also looking
- 15:15wanted to look at the
- 15:16representation of capabilities
- 15:18of these models.
- 15:20Here, we use Uplank. This
- 15:21is a, a model for
- 15:23b cell protein language model
- 15:24trained on b cell receptor
- 15:25sequences.
- 15:26And, well, Steve, again, already
- 15:28introduced the clone evolution within
- 15:30b cells. So here, this
- 15:31is, we're going to just
- 15:32to compare use this exercise
- 15:35to compare how good the
- 15:36world is smallest to represent
- 15:37this grand
- 15:39evolution. So, basically, to explain
- 15:41this plot, this we took
- 15:43data from an experiment where
- 15:45some collaborators have sample,
- 15:48B cell receptor from individual
- 15:49germinal centers. Centers. So we
- 15:50could characterize the germinal center
- 15:52of origin of each B
- 15:53cell receptor.
- 15:55And then we could infer
- 15:56we had to infer high
- 15:57inflate the clonal clonal families
- 15:59of each receptor.
- 16:00So in a few cases,
- 16:01we observed, and this has
- 16:03been reported before, that a
- 16:05the same clone was split
- 16:06across different germinal centers.
- 16:08This is, again, this has
- 16:09been observed that happens when
- 16:11one cell is evolving in
- 16:12a germinal center and then
- 16:14migrates onto another germinal center
- 16:15and keeps evolving there. And
- 16:17And this is what this
- 16:18phylogenetic tree is showing you.
- 16:19So there are these three
- 16:20family these clones that are
- 16:21clonally related. So probably they
- 16:24they descend from the same
- 16:25ancestor
- 16:26ancestor,
- 16:27this BCR.
- 16:29And you see that the
- 16:30the blue clone probably is
- 16:31the one that is farther
- 16:32away. So it's once more
- 16:33separated than the others, and
- 16:35the purple and the green
- 16:36are a bit intermixed over
- 16:37here.
- 16:39So compared to this standard
- 16:41flow genetic inference, we just
- 16:43plot the this data with
- 16:45uplank and we just want
- 16:46it to visualize. So this
- 16:47is only a visualization exercise.
- 16:48There's no inference here.
- 16:50And you can see actually
- 16:51that in a at the
- 16:52qualitative level, this this plots
- 16:54have a lot of similarities.
- 16:55So the upland also can
- 16:57recapitulate quite well the three
- 16:58clonal families. The blue one
- 17:00seems to be a bit
- 17:01more separated
- 17:02and there seem to be
- 17:03this good intermixing
- 17:05between the purple and the
- 17:06green family. So again, this
- 17:07is not an inference tool,
- 17:09but as a representation
- 17:10test visualization,
- 17:11this seems that we are
- 17:12capturing
- 17:13a lot of the aspects
- 17:14of the clonal evolution that
- 17:15we'll capture with traditional phylogenetic
- 17:17inference tools.
- 17:20The last thing I would
- 17:21like to mention is, the
- 17:23the the question with which
- 17:24I started this section. So
- 17:25which model is better? If
- 17:27you are interested in predicting,
- 17:28for instance, BCR affinity binding,
- 17:31so what would be better?
- 17:32Should we use a general
- 17:33model or should we use
- 17:34a, BCR specific model?
- 17:36So this is an open
- 17:37question.
- 17:38I see I see many
- 17:39papers in each direction, so
- 17:40I'm not gonna give the
- 17:41final answer here. Probably there's
- 17:43no final answer. I can
- 17:44tell you only about the
- 17:45particular task that we did
- 17:46here.
- 17:47So here, we took a
- 17:49dataset of seventy thousand antibodies.
- 17:51They with with measure binding
- 17:53affinity to a SARS CoV-two,
- 17:54peptide.
- 17:56And then we train again
- 17:57a simple model using both
- 17:58embeddings from the ESM, the
- 18:00general model, and Upland, the
- 18:02VCR specific model.
- 18:03And then the thing here,
- 18:05we wanted to test the
- 18:06accuracy of each model at
- 18:07different,
- 18:08with with different amount of
- 18:10data. So here, we are
- 18:11taking just a
- 18:13piece of the
- 18:15pieces of the data from
- 18:16here for little data to
- 18:17going to largest amount of
- 18:19data.
- 18:19And as you can see
- 18:20here, when the data is
- 18:22small, so the the difference
- 18:23with the model is not
- 18:24significant.
- 18:25But as I the the
- 18:26amount of data starts increasing,
- 18:28in this particular task, we
- 18:29found the SM to be
- 18:30better.
- 18:31So, again, this is an
- 18:32open debate, and I've seen
- 18:33papers in both directions. I
- 18:34don't claim that this is
- 18:35the final answer.
- 18:37What I would take from
- 18:38here is probably that that
- 18:39probably the answer depends on
- 18:40the task. Probably high specificity
- 18:42task benefit from a specific
- 18:44mod language models.
- 18:46But in any case, when
- 18:47the more data you have
- 18:49to fine tune the general
- 18:50model, the more likely you
- 18:51will be able to extract
- 18:52whatever information is hidden in
- 18:54this large heterogeneous collection of
- 18:56proteins.
- 18:57So, again, this is there
- 18:58were many other variables that
- 18:59we entered in in impacting
- 19:01the the the accuracy, but,
- 19:02this is the the message
- 19:03we got from here.
- 19:06Okay. So I think we're
- 19:07running out of time. So
- 19:09maybe I skip this, and
- 19:10I will tell you a
- 19:10little bit.
- 19:12Yeah. So maybe I will
- 19:13conclude.
- 19:14The last part was about
- 19:15interpretability, but maybe leave it
- 19:16for future discussions,
- 19:18and I will conclude here.
- 19:20So, basically, the the message
- 19:21I would like you to
- 19:22take home is that there
- 19:22is a lot of exciting
- 19:23opportunities to develop, both traditional
- 19:26computational models and machine learning
- 19:27deep learning models to properties
- 19:29of immune receptors.
- 19:31We are we have been
- 19:32developing
- 19:33working a lot on this
- 19:34area. We have developed models
- 19:36such as Taitan that was
- 19:37a multimodal,
- 19:38more traditional deep learning approach
- 19:40to to predict the specificity
- 19:42with, and with interpretability.
- 19:45Then we have been exploring
- 19:46the the capabilities of protein
- 19:47language models. As I said,
- 19:48there is a lot of
- 19:49work that we can do
- 19:50in this area. And when
- 19:51this is the part I
- 19:52did not take did not
- 19:53have time to tell you
- 19:54about, but we have also
- 19:55developed, interpretability
- 19:56pipelines to extract more human
- 19:58understandable
- 19:59explanations of why why a
- 20:01model predict what the model
- 20:02predicts.
- 20:04And maybe the last slide.
- 20:05So I wanted to tell
- 20:06you a little bit about
- 20:07future work. So I I
- 20:09started this talk by telling
- 20:10you that most model until
- 20:11now neglect the structure.
- 20:13I think the way to
- 20:14move forward, I think we
- 20:15are we are seeing now
- 20:15there's a saturation on accuracy.
- 20:17It's difficult to go farther.
- 20:18We have rich model that
- 20:19the models that predict very
- 20:21well what they can predict,
- 20:22but they cannot predict binding
- 20:23to unseen epitopes.
- 20:25One possibility to move forward
- 20:26to is to to integrate
- 20:27a structure. But how to
- 20:28do that is tricky because
- 20:30it's not about the lack
- 20:31of a structure, it's just
- 20:31the flexibility of the loop
- 20:33that is difficult. So certainly,
- 20:34I've seen paper that try
- 20:36to adapt alpha fold in
- 20:37different ways, but I think
- 20:38we have to go beyond
- 20:39alpha fold and integrate models
- 20:41that account the flexibility.
- 20:42And this requires a lot
- 20:43of more more thinking.
- 20:45And, certainly, interpretability is another
- 20:47area that what I'm very
- 20:48interested. If you are interested
- 20:50in, I'm certainly looking for
- 20:51new collaborations. So please contact
- 20:53to me.
- 20:54And with that, I would
- 20:55like to just, thank the
- 20:56people. This work most of
- 20:58the work was not IBM,
- 20:59so thank the people who
- 21:00did the work, funding.
- 21:02I'm happy to take questions
- 21:03if any.