Progress in Generative Models in Machine Learning-

Name: Progress in Generative Models in Machine Learning-
Uploaded: 2024-10-28T14:18:41.22Z
Duration: 24 min 56 s

October 28, 2024

Information

ID: 12270
To Cite: DCA Citation Guide

Download Transcript

00:00So next up, we have,
00:03doctor Deep Jaylee,
00:05who's,
00:06we're lucky to have him
00:07from,
00:08Apple machine learning where he's,
00:10lead leads a team of
00:11researchers working on fundamental techniques
00:13for machine learning.
00:15I know Deep for some
00:16time, and I know that
00:17he's been actually working on
00:18deep learning, before deep learning
00:20was cool a long time
00:21ago.
00:22So he actually did, his
00:24PhD under the supervision of
00:26Jeff Hinton, who just got
00:27a Nobel Prize, you may
00:28have heard, in the foundational
00:30days of deep learning. And
00:31then he joined Google Brain.
00:33He worked on deep learning
00:34models for
00:36sequences, and then he also,
00:38worked at various,
00:39places such as in media,
00:42Google Brain Robotics. And then
00:44somehow he got to finance
00:45for a bit, D. E.
00:46Shaw, and then, also before
00:47that, international labs. So we're
00:50gonna hear, I think, from
00:51Deepan the latest in generative
00:53models.
00:56Thank you, John.
00:59Well,
01:01I'll be presenting a lot
01:02of work with my colleagues
01:03at Apple. I have to
01:04say, this is a little
01:06different from my usual talk
01:07where I get into the
01:08nitty gritty of, machine learning
01:10and why one slight variation
01:12is more important than the
01:14other.
01:15Instead,
01:16for today, I thought maybe
01:17I would touch base on
01:19what I thought were three
01:21essential things that,
01:23people working in life sciences
01:25should think about
01:26in advances in machine learning
01:28as they would be relevant
01:29to them when they're looking
01:30at data.
01:32Okay. So what are these
01:34three things that I have
01:36in mind? Well, the first
01:36one,
01:38is that recently,
01:39neural networks have got really
01:41good at trying to embed
01:43various kinds of data,
01:45into a vector representation.
01:47You know, if you get
01:47some data, you first need
01:49to convert that into a
01:50form that you can work
01:51with in statistical models.
01:53And so this is a
01:54requirement. And in the past,
01:56you know,
01:57people had some ways to
01:59do it. And recently, there's
02:00been a lot of progress
02:00in this, so I wanna
02:01touch that touch upon that
02:03a little bit.
02:04I should also say,
02:07we can fit these representations
02:09very well. No one really
02:10knows why,
02:11but we can.
02:13And,
02:15recently, generated models themselves have,
02:19become really powerful. So we
02:20have really an uncanny ability
02:23to generate data now that
02:25doesn't
02:26that really surprises me every
02:28day, from, you know, generated
02:30models of text to
02:32models of images.
02:33In that regard, I'll touch
02:35on, like, the two main
02:36techniques,
02:37autoregressive models and diffusion models
02:39today and how they work.
02:42And I should also highlight
02:43that now we can do
02:44this across modalities. So not
02:46only is it just a
02:47model for text or just
02:48a model for images,
02:50but instead, we can build
02:51models that,
02:53work across all of them.
02:55I should say everything's gonna
02:57be on a really high
02:57level, but if you want
02:58to get into nitty gritties,
02:59we can,
03:00touch base after the talk.
03:03And so,
03:06I will end by one
03:08little vignette on
03:09doing conformer predictions with diffusion
03:12models. So you're given a,
03:14compound. You want to predict
03:15what the,
03:16structure of that compound is.
03:18And, you know, everybody's seen
03:20AlphaFold.
03:21There's specific methodologies,
03:23the early methods in AlphaFold
03:25used, which use a lot
03:26of information like multiple sequence
03:28alignments and so on.
03:29But now our techniques are
03:30getting powerful enough where you
03:32can do things have initial
03:34without that much information. And
03:35so I think this is
03:37an interesting
03:38approach to,
03:39to highlight, and I think
03:40latest alpha fold three also
03:42works on diffusion.
03:45So there's some commonality there.
03:48Okay. So,
03:50how does, how does this
03:52embedding of data into
03:54any space work? So,
03:56traditionally,
03:57we think of this as
03:58representation learning. You're given some
04:00data.
04:01Before you can do anything
04:02about it, you want to
04:03first convert that into a
04:04usable form by embedding it
04:06into some vector space with
04:08n dimensions.
04:10And then you plug that
04:11into a statistical model, and
04:13you can do things like
04:14make predictions on it or
04:15maybe do unsupervised learning like
04:17clustering that we just saw
04:19previously.
04:21And so,
04:22what people used to do
04:24in the past,
04:25was there was a specific
04:27technique for every modality. You
04:29had images. You would use
04:30two d convolutions, your convolutional
04:32models.
04:33For text, you would embed
04:35each text into a little,
04:38descriptor for it. For waveforms,
04:40you might convert, the waveform
04:42into spectral representations
04:44and then embed those into
04:46a big size vector space.
04:50And so,
04:52you know, they what because
04:54of this limitation, what happened
04:55is that everybody had,
04:58models for different kinds of
04:59data, and they were all
05:00separated.
05:01And then over time, people
05:03decided,
05:04well, let's try and embed
05:05different modalities into the same
05:07space.
05:08And once they're in the
05:09same space, we'll just combine
05:10them by little things like
05:12adding the representations from different
05:14spaces together
05:15or maybe even putting a
05:16small neural net on top
05:17of that.
05:19But that was really quite
05:21inflexible,
05:22in that, the kinds of
05:24changes you could make to
05:25the representation were limited. And,
05:28what you did at,
05:30times when you wanted to
05:31use the model was pretty
05:32much how you trained it.
05:33If you had two things
05:34going in during training, you
05:35only use two things,
05:38during evaluation. So, like, if
05:40you had images and text,
05:41you could just use images
05:42and text and no other
05:43combinations.
05:45And so in twenty twenty
05:46seven, all this changed with
05:48the attention,
05:49models paper.
05:51It was really,
05:52a breakthrough paper, which has
05:54had its implications in various
05:56forms.
05:57So
05:58the I the basic idea
06:00is you can take,
06:02embeddings of different data, and
06:04then
06:04you can combine these embeddings
06:06by choosing what's important. So
06:08there's this notion of attention.
06:09I won't go into the
06:10details necessarily,
06:12on that, but there's this
06:13notion of looking at your
06:14data,
06:15and looking at different parts
06:16of it and choosing,
06:18those parts,
06:20if it seems relevant to,
06:22the model itself.
06:23And this is all learned
06:25during training itself of the
06:26model, so this attention is
06:28not baked in beforehand. You
06:29just learn how to do
06:30it, as part of the
06:31training.
06:34So,
06:36what this also offers is
06:37this really interesting ability to
06:39change how you embed data
06:41into your model. So instead
06:43of just,
06:44using your traditional
06:46way of approaching,
06:48embeddings where you just put
06:49your data in through some
06:51prebaked
06:52model, what you can do
06:53is you can
06:54apply these attention models to
06:56sort of compress your data
06:58into a fixed size representation.
06:59So here's an example for
07:01images. You can take an
07:02image, split it into patches,
07:04and then learn an attention
07:06model on tap, which kind
07:07of compresses the whole image
07:09down to a single
07:10vector, and now you can
07:11use that for anything else
07:13you want to do with
07:13it.
07:16Furthermore,
07:17what's really interesting is now
07:19you can do this for
07:20various modalities across time. You
07:22can have images.
07:23You can have text. You
07:25can have videos. You can
07:26have sound. They all get
07:27embedded into the same space,
07:29and you can,
07:31compress them down to the
07:32same format. So you can
07:33do things like apply it
07:34to different sentences of different
07:36lengths. So you don't have
07:37to worry about periodicity or
07:39or the fact that everything
07:40has to be on the
07:41same length, with this device,
07:43and you can do this
07:45across different data types.
07:47And so,
07:49you know, it's a really
07:50powerful tool, and I, you
07:51know, I wanted to highlight
07:52that today because I think,
07:53you know, if you're dealing
07:54with multivariate data, you can,
07:57over time, think about clever
07:58techniques,
07:59on how to combine them
08:01together. And a lot of
08:02ingenuity
08:03that's gone into things like
08:04alpha bold and stuff is
08:05about how do you combine,
08:07various, data that goes in
08:08there.
08:09So it requires some experience,
08:11but I think, you know,
08:11just with a little tweaking,
08:13you get pretty good at
08:14it.
08:16Okay. So switching to generative
08:18models.
08:20Once you have an embedding,
08:21you can do sort of
08:22generative models of data.
08:24What's a generative model? A
08:26generative model is a model
08:27that allows, by definition, to
08:29generate new data of that
08:30modality.
08:32Additionally,
08:33it can help you quantify
08:35whether something you're seeing has
08:37high probability or low probability,
08:39so that you can do
08:41other things,
08:42with that probability as in
08:43build tools on top of,
08:46those measures itself.
08:48There's a wide variety of,
08:51techniques for generative models,
08:54that the machine learning community
08:55has built over time, but
08:57I'll basically just be talking
08:58about autoregressive
08:59models and diffusion models, which
09:02are really the mainstay of,
09:04of the models today. You're
09:06quite familiar with them.
09:08Autoregressive models are an example
09:10would be CHATCPT,
09:13for diffusion models,
09:15you know, something like stable
09:16diffusion for image generation
09:18is an example.
09:22Okay. So,
09:23with autoregressive models,
09:26the goal is to build
09:28a model,
09:29where you get a probability
09:31for any data point.
09:32And,
09:33the way we we do
09:35this with autoregressive models is
09:36to convert high dimensional data
09:38into a sequence
09:39and then measure the probability
09:41of the sequence using the
09:42chain rule of conditional probability,
09:44which is just basically
09:47multiply the probabilities
09:49of,
09:50one variable given the rest.
09:52I think the details are
09:53not too important. I'll try
09:54and highlight with this sort
09:56of example here.
09:57Let's say you have the
09:58web and you want to
09:59build a generative model of
10:00text.
10:02What you would do is
10:03you would take the entire
10:05dataset of text and convert
10:06it to input output pairs
10:10of of the type x
10:11and y. So you're given
10:12some data x and you
10:14want to predict y. Know,
10:15you're familiar with sort of
10:17regression or logistic regression. It's
10:19the same sort of,
10:20technique. You're basically just trying
10:22to predict,
10:24some some target given some
10:25input.
10:26And so with with the
10:27web, what you would do
10:28is you would just take
10:29all the prefixes.
10:31So,
10:32you got a Shakespearean verse
10:34here, and you could say,
10:35to be or not to
10:36be. So you convert that
10:37into data examples.
10:39Empty start, first word being
10:41two,
10:42and then x being two,
10:44and the next word is
10:45b.
10:46That's another data example. And
10:48then to b is an
10:49in an input for another
10:50one, and
10:51r is the target,
10:53word. And so you can
10:54convert the entire dataset entire
10:56web into such a database,
10:58and
10:59you are now learning a
11:00model that learns to predict,
11:03the next word given whatever
11:04context it is.
11:06And so, you know, this
11:08is the workhorse
11:09of how, current models like
11:11ChatDPD work. You just take
11:13the web and just do
11:14next
11:15token prediction,
11:16as it were. And then
11:18when you want to run
11:19the model, what you do
11:20is you feed in some
11:22context such as what it,
11:24why is the sky blue,
11:26and then you let the
11:27model generate the next, word,
11:28which which it has already
11:30learned from its model, and
11:32then you take that word,
11:33for example, in this case,
11:34because,
11:35and you feed that
11:37word back in the next
11:38sentence,
11:39why is the sky blue
11:40because, and then you have
11:42it predict the next word,
11:43because of,
11:46Raleigh, scattering, and so and
11:48so on. So you basically
11:49run the model during inference,
11:50and it it just generates
11:52text, and that,
11:54that generated,
11:55output is,
11:57you know, what you see.
12:00So how do you apply
12:01this to other modalities other
12:02than text? It's quite clear
12:04for text.
12:05It's a discrete data, and
12:07these models work really well
12:08for discrete data. They don't
12:09really work so well in
12:10continuous regression space.
12:13So it's easy to apply
12:14just to things like protein
12:16sequences, amino acids,
12:19and so on that are
12:20naturally discrete. It's a little
12:22trickier for, high dimensional data,
12:26which are not like strings.
12:28So when it highlights
12:30how how people do this,
12:32what you end up doing
12:33is building a model which
12:35first encodes your data into
12:36a sequence of,
12:38of tokens,
12:39dispute
12:41tokens,
12:42and then
12:43you learn another model.
12:45Typically, actually, you learn both
12:46of these models together.
12:48And the other model is
12:49called reconstruction model, which takes
12:51in,
12:52the output tokens and converts
12:54it back to the data
12:55itself.
12:56And once you have this,
12:57you can convert your entire
12:59data into the sequence of
13:00tokens,
13:02and then learn the autoregressive
13:04model on that sequence of
13:05tokens by just predicting the
13:07next token given the history
13:09of its tokens.
13:11And,
13:12you can now generate this
13:14new kind of data
13:16by running the autoregressive model,
13:17generating some sequences, and then
13:20converting that to real data.
13:23Here's an example on how
13:24you might apply this to
13:25modality like speech.
13:28So you have speech is
13:30really just waveforms.
13:32And so to be able
13:33to build a model actually,
13:34you were really really wanted
13:36to do it. You could
13:37just model speech directly,
13:39and people have done that.
13:41But it's harder to, deal
13:43with that because speech happens
13:45at a very fast rate,
13:46so the data would be
13:47just too much. So, typically,
13:49what people will do now
13:50is convert speech into a
13:51spectral representation,
13:53by just taking windows of
13:54speech and,
13:55computing a Fourier spectrum in
13:57it. So this original waveform
13:59is converted to a frequency
14:01diagram over time showing how
14:03the sound,
14:04is is distributing energy over
14:07these different frequencies on on
14:08y axis.
14:12And once once you've converted
14:14that to this, format, you
14:15can learn an inverse model
14:17called the vocoder, which will
14:18just generate,
14:20the raw waveform from
14:23from,
14:24this
14:25coded speech.
14:26Now, unfortunately,
14:27the,
14:29spectrum on the right hand
14:30side is it's still continuous.
14:32It's not discrete, and so
14:34it's hard to embed it
14:35into,
14:35an autoregressive model.
14:37So what what you can
14:39do is simply just take
14:40that data and tokenize it
14:42by discretizing the data. So
14:44just round divide by
14:46take off the minimum, divide
14:48by maximum, and just convert
14:50it into a range of
14:51numbers between zero and some
14:52maximum bin.
14:53And now you have a
14:54discretized
14:55version. You can convert back,
14:58to the original as well
14:59by just mapping,
15:00the
15:01continuous values to, the discrete
15:03values back to continuous codes.
15:06And so now you have
15:06this machinery by which you
15:08can take this continuous data,
15:09convert it to tokens, and
15:11then convert it back to
15:12real data. So you can
15:13really just beat that into
15:15an autoregressive model.
15:17And so,
15:19you take the waveform. You
15:20have spectral representation.
15:22You take each spectral representation
15:23and convert it into
15:25a sequence of discrete tokens,
15:27and
15:29voila. You can do next
15:30step predictions. So you feed
15:31in your history of tokens,
15:32and then you can predict
15:34the next token. So,
15:36it's basically a recipe that's,
15:38repeated all over. You just
15:40learn how to discretize your
15:42data into,
15:43some discrete bins, and then
15:44you learn how to address
15:45the model. And I think
15:46this could be applied.
15:49It's already applied to various
15:50things like speech,
15:52videos, images,
15:54and so on. And it's
15:55a pretty powerful technique that
15:57can be applied to other
15:58modalities as well.
16:03And now I want to
16:03talk a little bit about
16:04diffusion models.
16:06It's a set of new
16:07techniques that allows you to
16:09morph one probability distribution to
16:11another.
16:13And so,
16:15there's methods called optimal transport,
16:18flow matching, diffusion models. They're
16:20all trying to map
16:22distributions
16:23from one distribution to another.
16:25It might seem like a
16:26very arcane
16:28idea, but it's really a
16:29powerful,
16:30methodology when you want to
16:32think about
16:33how how to generate,
16:35data from noise. So in
16:37the case of diffusion models,
16:38you morph
16:40a Gaussian distribution, which is
16:41something that people know how
16:43to handle,
16:44and convert that into a
16:45real data distribution, which is
16:47really hard
16:48to handle. So if you
16:49give me images or or
16:51speech, I don't know what
16:52the data distribution itself is
16:54and how to model that
16:55distribution itself. Or multi omics
16:57data, like, how do you
16:58what is the actual distribution
17:00of data one doesn't know?
17:02And so the ability to
17:04to generate and sample from
17:05that is quite useful,
17:07and,
17:08mapping to a simple distribution
17:10allows us to do that.
17:12And so how does this
17:13actually work in practice?
17:15I'll show you an example
17:16with images.
17:18You have some image on
17:20the right hand side, x
17:21zero.
17:25I guess I don't see
17:26a mouse there.
17:27And you take that image
17:28on the right hand side
17:31and
17:38Maybe it's Never mind. It's
17:39okay.
17:41So you take an image
17:42on the right hand side.
17:44What you can do is
17:45you can just,
17:46scale it down in magnitude
17:48by multiplying by some compression
17:50term, and then you add
17:51some noise, which expands the
17:53data up again. And so
17:54you started with some data,
17:56and you can generate a
17:57whole bunch of data at
17:58different noise levels.
18:00And,
18:01what you really want to
18:02do is to learn a
18:04function that takes
18:05data at one noise level
18:07and cleans it up slightly
18:08to a slightly less,
18:10noisy level.
18:12And so and you can
18:13then apply that model.
18:14You start with noisy data,
18:16and then you clean it
18:17up a little bit. And
18:18then you clean it up
18:19a little bit, and you
18:20do this over and over
18:21again,
18:22till you are back to
18:23the clean cleanest level, which
18:25is where the data itself
18:26lies.
18:28And so that that's a
18:29very simplistic explanation of diffusion
18:31models. There's,
18:32there's a whole range of,
18:34possibilities in this in this
18:36scheme. How do you add
18:37noise? How do you convert
18:39the noisy data back to
18:40clean data?
18:42And there's a whole bunch
18:43of techniques, that factor in
18:45different trade offs,
18:48in in these choices.
18:49There's also variants of diffusion
18:51that,
18:53don't look at it as
18:54a sequence of discrete steps,
18:56but it deal with this
18:57as, like, a continuous time
18:59step, which is almost like
19:00a,
19:03a diffusion process in continuous
19:05time.
19:06And,
19:07there's also techniques that apply
19:08this for discrete data. So
19:10I've been showing you continuous
19:11data. Even discrete data can
19:13work through diffusion models where
19:14you have,
19:16have categorical choices,
19:18kind of like maybe like
19:19mutations during evolution. It's,
19:21you know, just things
19:23mutate from,
19:26from signal down to noise,
19:28and then you learn a
19:29model on going backwards,
19:30to generate the data for,
19:33for real sequences.
19:35And there's even a continuous
19:36time version of this discrete
19:38diffusion process,
19:39if you can believe me.
19:41Okay. So these models work
19:42well.
19:44So I don't wanna leave
19:45you with the impression that
19:46everything just works right off
19:48the bat. So I wanna
19:49highlight an example just,
19:52just to leave with you
19:53with a vignette
19:56of how,
19:57the kinds of innovations you
19:58need to do to make
19:59some things work when you
20:01take on a new challenge.
20:02So if you use models
20:04work well, but if you
20:04get really large data, like
20:06high resolution images, it's a
20:08lot more tricky,
20:09to make it work right
20:10off the bat.
20:11And so what people will
20:13end up doing is I'm
20:14highlighting two different techniques in
20:16in literature. On the left
20:18hand side,
20:19what you can do is
20:20you first,
20:22learn an encoding of your
20:23data itself for high resolution
20:25images. You can learn a
20:26compression that compresses it to
20:28smaller
20:29images
20:30or smaller
20:31feature vectors,
20:32and then you can learn
20:33a diffusion model in that
20:35smaller space,
20:36and then generate everything in
20:38that compressed space.
20:40And from that compressed space,
20:41you can come back to
20:42the real data,
20:43from the model you first
20:44learned.
20:45On the right hand side
20:46is something called cascaded diffusion.
20:48So if you want to
20:49generate high resolution images, you
20:51generate things at a lower
20:52resolution
20:53and then use that as
20:55a seed for something that's
20:56at a higher resolution,
20:58and you expand it upwards
20:59up to,
21:01the full resolution.
21:06Okay. So I think I'm
21:08gonna run really out of
21:08time. So I'm gonna skip
21:10right to the end of
21:11my talk because I have
21:12only one minute, and I
21:13think this might be interesting.
21:18Okay. So
21:21I wanna talk a little
21:22bit quickly about how you
21:24might use this for predicting
21:25the structure of molecules.
21:27So as I mentioned with
21:28diffusion models,
21:30you have something you learned,
21:31which is the denoising model
21:33that takes in some noisy
21:34data and tries to clean
21:35it up.
21:36And you're given some features
21:38that describe the data as
21:39well, which help in the
21:40cleanup process.
21:42So you can do the
21:42same thing,
21:44with,
21:46with molecules.
21:47You can give it a
21:48mild representation of your molecule,
21:51which is for those that
21:52don't know, it's a way
21:53of, representing a compound
21:56sequence,
21:57that's used in in,
22:01chem informatics packages.
22:02And you can take the
22:04smiles feature and convert that
22:05to, features for a molecule,
22:07and then
22:09you have a denoising model
22:10that takes in the noisy
22:11coordinates for each of the
22:12molecule each of the atoms
22:13in the molecule and cleans
22:14up the coordinates. So, really,
22:16there's no
22:17no real information used. Don't
22:19you don't really bake in
22:21any sort of information about
22:23bond angles and any of
22:25any of that at all.
22:26You just basically train a
22:27model. You give you're given
22:29the compound,
22:30and you mutate you noise
22:32up its structure, and then
22:34you learn how to denoise
22:35the structure
22:36back.
22:37And so the way the
22:38the features are computed is
22:40you take a compound,
22:41and then you you label
22:43all the atoms. And from
22:45the atoms, you can compute
22:46a graph.
22:47This graph basically represents
22:49what atoms are connected to
22:51what other atoms.
22:53And,
22:56I guess the detail is
22:56not important, but you can
22:58represent the structure of the,
22:59of a molecule in a
23:00graph. And then from that
23:02graph, you can actually compute
23:04something called a graph Laplacian,
23:06which allows you to compute
23:07features for each of the
23:08atoms
23:09in the graph.
23:11And you can then add
23:12on some descriptors for each
23:13of the atoms in the
23:15graph, things like the obvious
23:17but very basic things like,
23:19the atom type, the degree,
23:21the valence, and and so
23:22on.
23:23And then you just run
23:24a diffusion model. I won't
23:26go into the details of
23:27it, but, essentially,
23:29you get features for the
23:30atoms based on connectivity, some
23:32extra descriptors, and it's three
23:34d noisy coordinates.
23:35And then you just learn
23:36to predict the noisy coordinates
23:38at the next type. And
23:39you can then, during inference,
23:41just run that model. You
23:42start with the structure, some
23:44some random three d assignments
23:46for the,
23:47positions of each of the
23:48atom.
23:49And once you run them,
23:51the model starts,
23:53to clean up the three
23:54d positions.
23:55And at the end,
23:57it'll give you a a
23:58full structure for the molecule.
24:01Same for another one.
24:10And, yeah, the the TLDR
24:12is this works quite well.
24:13We got state of the
24:14art results on predicting structures,
24:16compared to prior works, although
24:18we actually didn't use any
24:20information about
24:21chemistry from it. All of
24:23that was just learned, by
24:24the model on its own.
24:27So to conclude,
24:29I hope,
24:30you'll take away that there's
24:32interesting ways to embed all
24:33kinds of data into a
24:34vector representation that you could
24:36use for your statistical models.
24:38Generative models will allow you
24:39to build a model where
24:40you can generate new data,
24:42which can be used for
24:43infilling or even finding correlations
24:45that you may not have
24:46expected.
24:48And,
24:49yeah, we're
24:52a group in Apple just
24:53doing fundamental machine learning research.
24:55So
24:56with that.