Skip to Main Content

Interpretable deep learning to model immune receptors

October 28, 2024
ID
12269

Transcript

  • 00:00We have Maria Martinez,
  • 00:02from the,
  • 00:03the data's department of, biomedical
  • 00:05informatics and data science.
  • 00:07So Maria just joined Yale
  • 00:08recently, a few months ago.
  • 00:10She initially, again, came to
  • 00:12biology from physics,
  • 00:14and completed a a postdoc
  • 00:15in computational biology as the
  • 00:17Weizmann Institute and then Columbia
  • 00:19University.
  • 00:20And then before she came
  • 00:21to Yale for a decade,
  • 00:23she led a group at
  • 00:24IBM Research in Switzerland where
  • 00:26she focuses on developing machine
  • 00:27learning approaches for cancer personalized
  • 00:30medicine.
  • 00:31And, and I know Marie
  • 00:32has been now working on,
  • 00:34the immune system focusing on
  • 00:35looking at molecular recognition and
  • 00:37TCR specificity prediction.
  • 00:39So
  • 00:40yeah.
  • 00:42Thank you, John. It's great
  • 00:43to be here. And, John
  • 00:44said, I just joined,
  • 00:46Yale five months ago, so
  • 00:47it's a great opportunity to
  • 00:48be here and tell you
  • 00:49a little bit about what
  • 00:50the work my group has
  • 00:51been doing.
  • 00:53So,
  • 00:55the thing the part I'm
  • 00:56gonna be telling you here
  • 00:56is about the work we
  • 00:57have been doing for the
  • 00:58last years to model the
  • 01:00binding of T cell receptors.
  • 01:01And I guess here I
  • 01:02don't need to make a
  • 01:03big introduction about T cells.
  • 01:05There is a huge interest
  • 01:06now in modeling the the
  • 01:07predicting the binding of T
  • 01:09cell receptor receptors because of
  • 01:10the huge medi biomedical applications
  • 01:13from developing better cancer immunotherapies,
  • 01:16from understanding many autoimmune diseases
  • 01:18that are driven by autoreactive
  • 01:20t cells, from vaccine development,
  • 01:22etcetera.
  • 01:24So this has been a
  • 01:25lot of work done.
  • 01:26This is this plot is,
  • 01:28it's from a recent review
  • 01:29we wrote, where we are
  • 01:31surveying the recent computational models
  • 01:33to predict T cell receptor
  • 01:34binding. And as you can
  • 01:36see, just in two thousand
  • 01:37twenty three, there were more
  • 01:38than twenty papers, yes, for
  • 01:39t cell receptor binding prediction.
  • 01:42There was interesting
  • 01:43trends, and the trend is
  • 01:44not slowing down. I'm not,
  • 01:46checking what's going on. There's
  • 01:47still a lot of papers
  • 01:48published this year.
  • 01:50So,
  • 01:52there a point. Is there
  • 01:53a pointer
  • 01:55or not?
  • 01:57Or the mouse? Okay. Yeah.
  • 01:59The mouse works. Yeah. Thank
  • 02:00you. Yeah. So the other
  • 02:02interesting trend that we can
  • 02:03see in this plot is
  • 02:04that there seem to be
  • 02:05a switch. So until maybe
  • 02:06two thousand twenty one, most
  • 02:08models were using some sort
  • 02:09of supervised machine learning. But
  • 02:11in recent years, very recently,
  • 02:13we are seeing the emergence
  • 02:14of protein language models. And,
  • 02:16well,
  • 02:17Naveep already introduced a little
  • 02:19bit, and I will tell
  • 02:19you a bit more in
  • 02:20a few,
  • 02:21a little bit. But as
  • 02:23you can see, there is
  • 02:23a lot of interest now
  • 02:24in applying the latest deep
  • 02:25learning technologies.
  • 02:28Now the other thing to
  • 02:29tell you that most of
  • 02:30these models, well, at least
  • 02:31the ones in this in
  • 02:31this plot were based only
  • 02:33in sequence. That means we
  • 02:34take the sequence of the
  • 02:35cell receptors, the sequence of
  • 02:36the epitope, and we try
  • 02:37to predict binding with more
  • 02:39or less success.
  • 02:41It's true that we are
  • 02:41at this point, most people
  • 02:43were still neglecting a structure.
  • 02:44And, certainly, this is a
  • 02:45big, oversimplification.
  • 02:47Structure is important to predict
  • 02:49binding.
  • 02:49The reason until now, most
  • 02:51people are leaving a structure
  • 02:52on the side is just
  • 02:53because T cell receptors are
  • 02:54extremely difficult,
  • 02:55proteins to model. The part
  • 02:57that binds the anti the
  • 02:58the epitope is a flexible
  • 02:59loop. And flexible loops and
  • 03:01other so their proteins cannot
  • 03:03be effectively model with alpha
  • 03:04fold, not even in the
  • 03:06last version and similarity level
  • 03:08of models. So people are
  • 03:10neglecting a structure not because
  • 03:11it's not important, it's just
  • 03:12because it's challenging. But we
  • 03:14are still starting to see
  • 03:15the new wave of papers
  • 03:16that are somehow trying to
  • 03:18to to look into that.
  • 03:21For the time being, let
  • 03:21me take one step back,
  • 03:23and let me tell you
  • 03:24the work with how chronologically
  • 03:26the work we have been
  • 03:27doing in this topic starting
  • 03:28from the work, still focus
  • 03:30on sequence that we did,
  • 03:31when we started thinking about
  • 03:33this problem.
  • 03:34So, basically, this was like
  • 03:35a few years ago, like,
  • 03:36I think four years ago,
  • 03:37and we're still start thinking
  • 03:38about predicting the cell receptor
  • 03:40binding. And we're got wanted
  • 03:41to use we have expertise
  • 03:43on machine learning, and we
  • 03:44wanted to develop a multimodal
  • 03:46deep learning approach. That means
  • 03:47you give, the the model
  • 03:49the sequence of the T
  • 03:50cell receptor,
  • 03:51the sequence of the epitope,
  • 03:52and you train it to
  • 03:53predict binding.
  • 03:54And another topic I will
  • 03:55tell you a little bit,
  • 03:56we have in my team
  • 03:57at IBM were very interested
  • 03:59in inter interpretability.
  • 04:00So that means we wanted
  • 04:01just to build a model,
  • 04:02but also to get some
  • 04:04sort of insight of why
  • 04:05the model predicts
  • 04:06binding or not binding.
  • 04:09So now the challenges well,
  • 04:10one of the main challenge
  • 04:11of this problem in biology,
  • 04:13typically, most models are underpowered,
  • 04:15but especially in this particular
  • 04:16regime. So the theoretical diversity
  • 04:19of potential disal receptor that
  • 04:21a person can harbor is
  • 04:23astronomical.
  • 04:25But,
  • 04:26we have some what? We
  • 04:27have, quite a lot of
  • 04:28data and public databases, but
  • 04:29if you compare these two
  • 04:30numbers, certainly you see quite
  • 04:32a few orders of magnitude
  • 04:33of indifference.
  • 04:34So it's clear that although
  • 04:36the the numbers these numbers
  • 04:37by the world was towards
  • 04:38the numbers we use when
  • 04:39we train this model, by
  • 04:41now, these numbers has increased
  • 04:43substantially.
  • 04:44But still, certainly, we are
  • 04:45not even sampling a fraction
  • 04:46of the huge potential diversity.
  • 04:48So this is something to
  • 04:49keep in mind. Models are
  • 04:50still under power.
  • 04:52And especially when we come
  • 04:53to the EBITDAF side so,
  • 04:55again, the number has increased
  • 04:56by now. I think by
  • 04:57now, there might be state
  • 04:58about,
  • 04:59less than one thousand. That's
  • 05:00that's still it's still quite
  • 05:02small amount of data.
  • 05:03So as you can see
  • 05:04already, I can I don't
  • 05:05need to tell you that
  • 05:06we might be able to
  • 05:07predict something about T cell
  • 05:08receptors, but certainly, we'll not
  • 05:10have a lot of power
  • 05:11to predict binding to new
  • 05:12epitopes, which is one of
  • 05:13the has the most interesting
  • 05:15biomedical applications?
  • 05:17Nevertheless, with this caveat, let
  • 05:19me tell you a little
  • 05:19bit what we did. So
  • 05:20here, we are focusing first
  • 05:22on the site where we
  • 05:23can predict. That means we
  • 05:24fix the epitope,
  • 05:25and we try to we
  • 05:26build models with the predict
  • 05:28the the binding to know
  • 05:29what to new,
  • 05:30to to different decelerate.
  • 05:33So with these constraints, we
  • 05:34train for models that are
  • 05:36different variations because different choices
  • 05:38you can make in what
  • 05:38you encode, so the full
  • 05:40sequence or only the variable
  • 05:41part, and also the ways
  • 05:43in which you encode the
  • 05:44the epitope. And already and
  • 05:45Nadif already explained. So an
  • 05:47amino acid is, an epitope
  • 05:49is a string of amino
  • 05:50acids, nine to ten, fifteen
  • 05:52amino acids, but it's relatively
  • 05:53small.
  • 05:55There are typically called as
  • 05:56amino acids, but there are
  • 05:58some tricks that you can
  • 05:59do here and I'll take
  • 06:00you in a second.
  • 06:01So first, we train this
  • 06:02for models.
  • 06:04We were, of course, you
  • 06:05need to compare with a
  • 06:06baseline.
  • 06:07So as a baseline, we
  • 06:08develop a super simplistic classifier.
  • 06:11So is this, based on
  • 06:13k nearest neighbors is one
  • 06:14of the simplest machine learning
  • 06:15approaches you can develop.
  • 06:17And we were disappointing
  • 06:19disappointed, but not completely surprised
  • 06:21when see that that will
  • 06:22be the simple classifier with
  • 06:23a much better and deep
  • 06:24learning models. Again, this is
  • 06:26a problem of data limitation.
  • 06:28We do not have enough
  • 06:29data.
  • 06:30Now the trick that was
  • 06:31really interesting here is is
  • 06:33that coming back to this
  • 06:34representation of epitopes. So an
  • 06:36epitope, the the the more,
  • 06:39normal representation will be to
  • 06:41represent each amino acid and
  • 06:42then just have a string
  • 06:43of amino acids. But, however,
  • 06:45if you consider an epitope
  • 06:46is kind of rather small
  • 06:47molecule, so you couldn't call
  • 06:49it as if it was
  • 06:50a compound using the small
  • 06:52representations,
  • 06:53and, Nadeep was talking about.
  • 06:55Again, this is not a
  • 06:56chemical compound, but from a
  • 06:57computational point of view, they
  • 06:59admit similar representations.
  • 07:01And if you do that,
  • 07:02now you can use transfer
  • 07:04learning because you can now
  • 07:05pretrain your model on a
  • 07:07much larger database.
  • 07:08So binding DB is like
  • 07:10a database of compound protein
  • 07:12compound interactions where we have
  • 07:14around one million pairs. So
  • 07:15this is a substantially
  • 07:17larger amount of data to
  • 07:18pretend our model. So, basically,
  • 07:20what we did here is
  • 07:21pretrain the model on binding
  • 07:22d v and then, of
  • 07:23course, fine tune on the
  • 07:24cell receptor specific data.
  • 07:26And as you can see,
  • 07:27this is an ex this
  • 07:28this, transfer learning exercise, it
  • 07:30was assisted led a substantial,
  • 07:32boost in performance.
  • 07:34This, by the way, is
  • 07:35done all the time in
  • 07:36in in in different areas
  • 07:37of machine learning. In biology,
  • 07:38we are still getting used
  • 07:39to it, yeah. But you
  • 07:40can see here, we could
  • 07:41really boost the performance.
  • 07:45Now let me talk a
  • 07:46little bit about interpretability and
  • 07:47again, so another concept that
  • 07:48Navif already explained, introduce a
  • 07:50little bit. So we wanted,
  • 07:52what when we we working
  • 07:53here, we want we wanted
  • 07:55not only to develop the
  • 07:56model, but also just get
  • 07:57insights about which amino acid
  • 07:59the model the model thought
  • 08:00were important to predict binding
  • 08:02or nonbinding.
  • 08:04So here we use attention
  • 08:05mechanisms, which is one of
  • 08:07the most common, not the
  • 08:08only one, of course, but
  • 08:09one of the popular approaches
  • 08:10for interpretability.
  • 08:12They are using transformers and
  • 08:13in language models, so they
  • 08:14are becoming very popular.
  • 08:16The idea with with potential
  • 08:17layers that you can get
  • 08:19after you train your model
  • 08:20and you get your prediction,
  • 08:21you can get this type
  • 08:22of heat maps that tell
  • 08:23you that dark color means
  • 08:25how much attention the model
  • 08:27paid to each feature. In
  • 08:28that case, each amino acid.
  • 08:30So for instance, these three
  • 08:31sequences correspond to three the
  • 08:33variable part of three sequences.
  • 08:35These are receptor sequences that
  • 08:36were predicted to be to
  • 08:38bind the same epitope.
  • 08:40And, again, the the the
  • 08:41the trick the problem here
  • 08:42is we don't have the
  • 08:43ground truth because, I mean,
  • 08:44we need to have the
  • 08:45crystallographical
  • 08:46structures to see whether these
  • 08:47amino acids are involved in
  • 08:48the binding, which we don't
  • 08:49have.
  • 08:50But we can do some
  • 08:52sort of sanity check. So
  • 08:54if they bind the SNP
  • 08:55top, it's logical to expect
  • 08:57that they share a similar
  • 08:58binding motif. And as you
  • 08:59can see here, actually, the
  • 09:01amino acids that were highlighted
  • 09:02with high attention are conserved
  • 09:04on the three sequences.
  • 09:05So again, this is not
  • 09:06a validation. It's all only
  • 09:08a sanity check. But from
  • 09:09this point of view, this
  • 09:10prediction, this attention prediction makes
  • 09:12sense.
  • 09:14Now I want to show
  • 09:15you now a negative example.
  • 09:16Another case where attention actually
  • 09:18tell you that something is
  • 09:18wrong in your model.
  • 09:20So when I was telling
  • 09:21you before that we have
  • 09:22very little data to try
  • 09:23to predict the initialization bending
  • 09:25to unseen epitopes. This is
  • 09:27an unsolved problem even today,
  • 09:28forty years
  • 09:30later. So we train our
  • 09:31models also in this direction.
  • 09:33The performance was quite bad
  • 09:34as expected,
  • 09:35but we wanted to look
  • 09:36at attention maps.
  • 09:38So this is what we
  • 09:39got. And, actually, that will
  • 09:41we were kind of, thinking
  • 09:42about this for quite some
  • 09:43time. We could not understand
  • 09:45these maps for quite some
  • 09:46time.
  • 09:47Eventually, what we realized because,
  • 09:48basically, for all two hundred
  • 09:50epitopes in our database, the
  • 09:52model was always selecting with
  • 09:53the same attention or with
  • 09:55the same three epitopes in
  • 09:56the same positions.
  • 09:57So that was quite puzzling.
  • 09:59But eventually, we realized that
  • 10:01the model actually realizes quite
  • 10:02early that there's not sufficient
  • 10:04data to try to make
  • 10:05a prediction to unseen epitopes.
  • 10:07So instead, just it's looking
  • 10:08at it selects three random
  • 10:10positions, highly variable positions, but
  • 10:12random.
  • 10:13And this is enough to
  • 10:14give a label to each
  • 10:15epitope. And this is a
  • 10:16way of memorizing your data.
  • 10:18So now you have instead
  • 10:19of just training to generalize,
  • 10:21you're just transforming this into
  • 10:22a classification problem.
  • 10:24Of course, you're not able
  • 10:25to generalize to new epitopes.
  • 10:26At least, we already knew.
  • 10:28So I'm showing this because,
  • 10:29actually, this highlights why interpretability
  • 10:31is essential.
  • 10:33So it can first tell
  • 10:34you when things are working.
  • 10:35It can tell you which
  • 10:36amino acids might be involved
  • 10:38into the grind the the
  • 10:39binding prediction and give you
  • 10:41ideas to just build new
  • 10:42hypothesis.
  • 10:43But equally important, it tells
  • 10:45you when things are not
  • 10:45working. And especially in biology,
  • 10:47with very often when very
  • 10:48often we're working with really
  • 10:50underpowered datasets and we are
  • 10:51overfitting, it's good to get
  • 10:53hints when things are not
  • 10:54working.
  • 10:56Okay. So after saying that,
  • 10:57let me tell you what
  • 10:58we did after we published
  • 10:59this paper, and it was
  • 11:01more or less at the
  • 11:01time where, protein language models
  • 11:03started to become a hit
  • 11:05in biology.
  • 11:06So I think the first
  • 11:07protein language model was published
  • 11:08in two thousand twenty one.
  • 11:10So, yes, well, we're introduced
  • 11:11by Nadiv, but in case,
  • 11:12for those of who just
  • 11:13came recently and have not
  • 11:15heard about protein language models.
  • 11:17So the idea is that
  • 11:18they use the same type
  • 11:19of models, a
  • 11:34And if you do that,
  • 11:34then you are learning the
  • 11:36language of proteins, and this
  • 11:37is what is called protein
  • 11:38language models.
  • 11:40So there by now quite
  • 11:42a few of this model,
  • 11:42this is one of, early
  • 11:44ones, was published in Nature
  • 11:45Biotech, and this one was
  • 11:47shown to be able to
  • 11:48predict secondary and tertiary structure.
  • 11:50Later on, there was other
  • 11:52models that were able to
  • 11:53design the novel proteins with
  • 11:55particular with good,
  • 11:58faulting capabilities. So that was
  • 11:59quite impressive.
  • 12:01And, also, as Steve mentioned
  • 12:03in a bit, by now,
  • 12:04we have a bunch of
  • 12:05different, protein language models.
  • 12:07Some of them are general
  • 12:09like ESM. This is one
  • 12:10of the earlier models. Well,
  • 12:11now we are in the
  • 12:12third version already, but this
  • 12:13is one of the first
  • 12:14models that appeared.
  • 12:16This is the particularity of
  • 12:17these models is this is
  • 12:19this is probably trained on
  • 12:20the largest collections of I
  • 12:21mean, collection of amino acids.
  • 12:24This is a super large,
  • 12:25but also very heterogeneous. So
  • 12:26you have proteins from all
  • 12:28sort of organisms, tissues, conditions,
  • 12:33opposed to that, you now
  • 12:34will start also seeing models
  • 12:35that are specific to particular
  • 12:37families like TCR BERT, only
  • 12:39trained on this TCR sequences
  • 12:41and up lung trained on
  • 12:42amino acid sequences.
  • 12:45So thinking about it, we
  • 12:46wanted not to train just
  • 12:47a new protein language model
  • 12:48because there were quite a
  • 12:49few already, but we wanted
  • 12:50to test how good this
  • 12:52model were to represent immune
  • 12:54receptors, which are, as I
  • 12:55said before, they're quite particular
  • 12:57and they are
  • 12:58different than regular proteins.
  • 13:00So we are looking both
  • 13:01at T cell receptors and
  • 13:02B cell receptors. So let
  • 13:04me show you some of
  • 13:05the things that we tested
  • 13:06starting with T cell receptors.
  • 13:09So for TCRs,
  • 13:10we built, here we focused
  • 13:12we exploited ESM, this general
  • 13:14heterogeneous model that trained on
  • 13:15the heterogeneous model that train
  • 13:16on the largest collection of
  • 13:17proteins.
  • 13:18And here we train a
  • 13:19super simple model. So it
  • 13:21was ESM. We started the
  • 13:22rep the the embedded representation
  • 13:24of the receptors,
  • 13:25and then we added a
  • 13:26super simple,
  • 13:27two neural two layered neural
  • 13:29network. So it doesn't get
  • 13:30simpler than that.
  • 13:32And to test the capability
  • 13:33of this model, we use
  • 13:35a recent benchmark that on
  • 13:36T cell receptor binding prediction
  • 13:38models that was published last
  • 13:40year in informatics.
  • 13:42So in this benchmark so
  • 13:43the plot the the dot
  • 13:44plot comes from this benchmark,
  • 13:46actually.
  • 13:46So, basically, this plot here,
  • 13:48we are showing two different
  • 13:49predictive tasks. On the x
  • 13:51axis, the task was a
  • 13:52classification.
  • 13:53So given a pair, so
  • 13:54epitope TCR, does it bind?
  • 13:56Yes or no? So, here,
  • 13:58good models will have will
  • 13:59be be close to this,
  • 14:01ax to this side of
  • 14:02the plot.
  • 14:03And here, the task was
  • 14:04an epitope ranking task. So
  • 14:06the idea here that you
  • 14:07have the for each cell
  • 14:08receptor, you will have seventeen
  • 14:10different lipidomes, and you have
  • 14:11to rank depending according to
  • 14:13the likelihood that each one
  • 14:14what is the real by
  • 14:16epitope that binds.
  • 14:17So, basically, good predict good
  • 14:19models are in this corner
  • 14:20of the plot.
  • 14:21And you can see just
  • 14:22to to tell you a
  • 14:23little bit about the benchmark.
  • 14:24So the the there are
  • 14:25three colors that depends whether
  • 14:27the models were trained on
  • 14:28single cell data. So for
  • 14:29which we have the alpha
  • 14:30beta appearance
  • 14:32or on bulk data for
  • 14:33which you have typically the
  • 14:34beta, sometimes also all of
  • 14:35the alpha chain.
  • 14:37So not surprisingly,
  • 14:38the single cell mode model
  • 14:40stress of single cell data
  • 14:41perform better. So that's expected.
  • 14:43The nice thing with this
  • 14:44is actually this model we
  • 14:45trained. This is the purple,
  • 14:47star. This model that we
  • 14:48trained starting embedding from ESM
  • 14:50and, again, very simple neural
  • 14:52network on top of it.
  • 14:53Actually, with very little effort,
  • 14:54we were performing close to
  • 14:56the, best performing models. So,
  • 14:58again, this is just one
  • 14:59particular task, but that gives
  • 15:00you an idea of how
  • 15:01much the accuracy you can
  • 15:03you can gain with this
  • 15:04extracting using exploiting this pretrained
  • 15:06language models.
  • 15:09We have also looked at
  • 15:10B cell receptors in the
  • 15:11context of, protein language models.
  • 15:13So here, we're also looking
  • 15:15wanted to look at the
  • 15:16representation of capabilities
  • 15:18of these models.
  • 15:20Here, we use Uplank. This
  • 15:21is a, a model for
  • 15:23b cell protein language model
  • 15:24trained on b cell receptor
  • 15:25sequences.
  • 15:26And, well, Steve, again, already
  • 15:28introduced the clone evolution within
  • 15:30b cells. So here, this
  • 15:31is, we're going to just
  • 15:32to compare use this exercise
  • 15:35to compare how good the
  • 15:36world is smallest to represent
  • 15:37this grand
  • 15:39evolution. So, basically, to explain
  • 15:41this plot, this we took
  • 15:43data from an experiment where
  • 15:45some collaborators have sample,
  • 15:48B cell receptor from individual
  • 15:49germinal centers. Centers. So we
  • 15:50could characterize the germinal center
  • 15:52of origin of each B
  • 15:53cell receptor.
  • 15:55And then we could infer
  • 15:56we had to infer high
  • 15:57inflate the clonal clonal families
  • 15:59of each receptor.
  • 16:00So in a few cases,
  • 16:01we observed, and this has
  • 16:03been reported before, that a
  • 16:05the same clone was split
  • 16:06across different germinal centers.
  • 16:08This is, again, this has
  • 16:09been observed that happens when
  • 16:11one cell is evolving in
  • 16:12a germinal center and then
  • 16:14migrates onto another germinal center
  • 16:15and keeps evolving there. And
  • 16:17And this is what this
  • 16:18phylogenetic tree is showing you.
  • 16:19So there are these three
  • 16:20family these clones that are
  • 16:21clonally related. So probably they
  • 16:24they descend from the same
  • 16:25ancestor
  • 16:26ancestor,
  • 16:27this BCR.
  • 16:29And you see that the
  • 16:30the blue clone probably is
  • 16:31the one that is farther
  • 16:32away. So it's once more
  • 16:33separated than the others, and
  • 16:35the purple and the green
  • 16:36are a bit intermixed over
  • 16:37here.
  • 16:39So compared to this standard
  • 16:41flow genetic inference, we just
  • 16:43plot the this data with
  • 16:45uplank and we just want
  • 16:46it to visualize. So this
  • 16:47is only a visualization exercise.
  • 16:48There's no inference here.
  • 16:50And you can see actually
  • 16:51that in a at the
  • 16:52qualitative level, this this plots
  • 16:54have a lot of similarities.
  • 16:55So the upland also can
  • 16:57recapitulate quite well the three
  • 16:58clonal families. The blue one
  • 17:00seems to be a bit
  • 17:01more separated
  • 17:02and there seem to be
  • 17:03this good intermixing
  • 17:05between the purple and the
  • 17:06green family. So again, this
  • 17:07is not an inference tool,
  • 17:09but as a representation
  • 17:10test visualization,
  • 17:11this seems that we are
  • 17:12capturing
  • 17:13a lot of the aspects
  • 17:14of the clonal evolution that
  • 17:15we'll capture with traditional phylogenetic
  • 17:17inference tools.
  • 17:20The last thing I would
  • 17:21like to mention is, the
  • 17:23the the question with which
  • 17:24I started this section. So
  • 17:25which model is better? If
  • 17:27you are interested in predicting,
  • 17:28for instance, BCR affinity binding,
  • 17:31so what would be better?
  • 17:32Should we use a general
  • 17:33model or should we use
  • 17:34a, BCR specific model?
  • 17:36So this is an open
  • 17:37question.
  • 17:38I see I see many
  • 17:39papers in each direction, so
  • 17:40I'm not gonna give the
  • 17:41final answer here. Probably there's
  • 17:43no final answer. I can
  • 17:44tell you only about the
  • 17:45particular task that we did
  • 17:46here.
  • 17:47So here, we took a
  • 17:49dataset of seventy thousand antibodies.
  • 17:51They with with measure binding
  • 17:53affinity to a SARS CoV-two,
  • 17:54peptide.
  • 17:56And then we train again
  • 17:57a simple model using both
  • 17:58embeddings from the ESM, the
  • 18:00general model, and Upland, the
  • 18:02VCR specific model.
  • 18:03And then the thing here,
  • 18:05we wanted to test the
  • 18:06accuracy of each model at
  • 18:07different,
  • 18:08with with different amount of
  • 18:10data. So here, we are
  • 18:11taking just a
  • 18:13piece of the
  • 18:15pieces of the data from
  • 18:16here for little data to
  • 18:17going to largest amount of
  • 18:19data.
  • 18:19And as you can see
  • 18:20here, when the data is
  • 18:22small, so the the difference
  • 18:23with the model is not
  • 18:24significant.
  • 18:25But as I the the
  • 18:26amount of data starts increasing,
  • 18:28in this particular task, we
  • 18:29found the SM to be
  • 18:30better.
  • 18:31So, again, this is an
  • 18:32open debate, and I've seen
  • 18:33papers in both directions. I
  • 18:34don't claim that this is
  • 18:35the final answer.
  • 18:37What I would take from
  • 18:38here is probably that that
  • 18:39probably the answer depends on
  • 18:40the task. Probably high specificity
  • 18:42task benefit from a specific
  • 18:44mod language models.
  • 18:46But in any case, when
  • 18:47the more data you have
  • 18:49to fine tune the general
  • 18:50model, the more likely you
  • 18:51will be able to extract
  • 18:52whatever information is hidden in
  • 18:54this large heterogeneous collection of
  • 18:56proteins.
  • 18:57So, again, this is there
  • 18:58were many other variables that
  • 18:59we entered in in impacting
  • 19:01the the the accuracy, but,
  • 19:02this is the the message
  • 19:03we got from here.
  • 19:06Okay. So I think we're
  • 19:07running out of time. So
  • 19:09maybe I skip this, and
  • 19:10I will tell you a
  • 19:10little bit.
  • 19:12Yeah. So maybe I will
  • 19:13conclude.
  • 19:14The last part was about
  • 19:15interpretability, but maybe leave it
  • 19:16for future discussions,
  • 19:18and I will conclude here.
  • 19:20So, basically, the the message
  • 19:21I would like you to
  • 19:22take home is that there
  • 19:22is a lot of exciting
  • 19:23opportunities to develop, both traditional
  • 19:26computational models and machine learning
  • 19:27deep learning models to properties
  • 19:29of immune receptors.
  • 19:31We are we have been
  • 19:32developing
  • 19:33working a lot on this
  • 19:34area. We have developed models
  • 19:36such as Taitan that was
  • 19:37a multimodal,
  • 19:38more traditional deep learning approach
  • 19:40to to predict the specificity
  • 19:42with, and with interpretability.
  • 19:45Then we have been exploring
  • 19:46the the capabilities of protein
  • 19:47language models. As I said,
  • 19:48there is a lot of
  • 19:49work that we can do
  • 19:50in this area. And when
  • 19:51this is the part I
  • 19:52did not take did not
  • 19:53have time to tell you
  • 19:54about, but we have also
  • 19:55developed, interpretability
  • 19:56pipelines to extract more human
  • 19:58understandable
  • 19:59explanations of why why a
  • 20:01model predict what the model
  • 20:02predicts.
  • 20:04And maybe the last slide.
  • 20:05So I wanted to tell
  • 20:06you a little bit about
  • 20:07future work. So I I
  • 20:09started this talk by telling
  • 20:10you that most model until
  • 20:11now neglect the structure.
  • 20:13I think the way to
  • 20:14move forward, I think we
  • 20:15are we are seeing now
  • 20:15there's a saturation on accuracy.
  • 20:17It's difficult to go farther.
  • 20:18We have rich model that
  • 20:19the models that predict very
  • 20:21well what they can predict,
  • 20:22but they cannot predict binding
  • 20:23to unseen epitopes.
  • 20:25One possibility to move forward
  • 20:26to is to to integrate
  • 20:27a structure. But how to
  • 20:28do that is tricky because
  • 20:30it's not about the lack
  • 20:31of a structure, it's just
  • 20:31the flexibility of the loop
  • 20:33that is difficult. So certainly,
  • 20:34I've seen paper that try
  • 20:36to adapt alpha fold in
  • 20:37different ways, but I think
  • 20:38we have to go beyond
  • 20:39alpha fold and integrate models
  • 20:41that account the flexibility.
  • 20:42And this requires a lot
  • 20:43of more more thinking.
  • 20:45And, certainly, interpretability is another
  • 20:47area that what I'm very
  • 20:48interested. If you are interested
  • 20:50in, I'm certainly looking for
  • 20:51new collaborations. So please contact
  • 20:53to me.
  • 20:54And with that, I would
  • 20:55like to just, thank the
  • 20:56people. This work most of
  • 20:58the work was not IBM,
  • 20:59so thank the people who
  • 21:00did the work, funding.
  • 21:02I'm happy to take questions
  • 21:03if any.