Skip to Main Content

Progress in Generative Models in Machine Learning-

October 28, 2024
ID
12270

Transcript

  • 00:00So next up, we have,
  • 00:03doctor Deep Jaylee,
  • 00:05who's,
  • 00:06we're lucky to have him
  • 00:07from,
  • 00:08Apple machine learning where he's,
  • 00:10lead leads a team of
  • 00:11researchers working on fundamental techniques
  • 00:13for machine learning.
  • 00:15I know Deep for some
  • 00:16time, and I know that
  • 00:17he's been actually working on
  • 00:18deep learning, before deep learning
  • 00:20was cool a long time
  • 00:21ago.
  • 00:22So he actually did, his
  • 00:24PhD under the supervision of
  • 00:26Jeff Hinton, who just got
  • 00:27a Nobel Prize, you may
  • 00:28have heard, in the foundational
  • 00:30days of deep learning. And
  • 00:31then he joined Google Brain.
  • 00:33He worked on deep learning
  • 00:34models for
  • 00:36sequences, and then he also,
  • 00:38worked at various,
  • 00:39places such as in media,
  • 00:42Google Brain Robotics. And then
  • 00:44somehow he got to finance
  • 00:45for a bit, D. E.
  • 00:46Shaw, and then, also before
  • 00:47that, international labs. So we're
  • 00:50gonna hear, I think, from
  • 00:51Deepan the latest in generative
  • 00:53models.
  • 00:56Thank you, John.
  • 00:59Well,
  • 01:01I'll be presenting a lot
  • 01:02of work with my colleagues
  • 01:03at Apple. I have to
  • 01:04say, this is a little
  • 01:06different from my usual talk
  • 01:07where I get into the
  • 01:08nitty gritty of, machine learning
  • 01:10and why one slight variation
  • 01:12is more important than the
  • 01:14other.
  • 01:15Instead,
  • 01:16for today, I thought maybe
  • 01:17I would touch base on
  • 01:19what I thought were three
  • 01:21essential things that,
  • 01:23people working in life sciences
  • 01:25should think about
  • 01:26in advances in machine learning
  • 01:28as they would be relevant
  • 01:29to them when they're looking
  • 01:30at data.
  • 01:32Okay. So what are these
  • 01:34three things that I have
  • 01:36in mind? Well, the first
  • 01:36one,
  • 01:38is that recently,
  • 01:39neural networks have got really
  • 01:41good at trying to embed
  • 01:43various kinds of data,
  • 01:45into a vector representation.
  • 01:47You know, if you get
  • 01:47some data, you first need
  • 01:49to convert that into a
  • 01:50form that you can work
  • 01:51with in statistical models.
  • 01:53And so this is a
  • 01:54requirement. And in the past,
  • 01:56you know,
  • 01:57people had some ways to
  • 01:59do it. And recently, there's
  • 02:00been a lot of progress
  • 02:00in this, so I wanna
  • 02:01touch that touch upon that
  • 02:03a little bit.
  • 02:04I should also say,
  • 02:07we can fit these representations
  • 02:09very well. No one really
  • 02:10knows why,
  • 02:11but we can.
  • 02:13And,
  • 02:15recently, generated models themselves have,
  • 02:19become really powerful. So we
  • 02:20have really an uncanny ability
  • 02:23to generate data now that
  • 02:25doesn't
  • 02:26that really surprises me every
  • 02:28day, from, you know, generated
  • 02:30models of text to
  • 02:32models of images.
  • 02:33In that regard, I'll touch
  • 02:35on, like, the two main
  • 02:36techniques,
  • 02:37autoregressive models and diffusion models
  • 02:39today and how they work.
  • 02:42And I should also highlight
  • 02:43that now we can do
  • 02:44this across modalities. So not
  • 02:46only is it just a
  • 02:47model for text or just
  • 02:48a model for images,
  • 02:50but instead, we can build
  • 02:51models that,
  • 02:53work across all of them.
  • 02:55I should say everything's gonna
  • 02:57be on a really high
  • 02:57level, but if you want
  • 02:58to get into nitty gritties,
  • 02:59we can,
  • 03:00touch base after the talk.
  • 03:03And so,
  • 03:06I will end by one
  • 03:08little vignette on
  • 03:09doing conformer predictions with diffusion
  • 03:12models. So you're given a,
  • 03:14compound. You want to predict
  • 03:15what the,
  • 03:16structure of that compound is.
  • 03:18And, you know, everybody's seen
  • 03:20AlphaFold.
  • 03:21There's specific methodologies,
  • 03:23the early methods in AlphaFold
  • 03:25used, which use a lot
  • 03:26of information like multiple sequence
  • 03:28alignments and so on.
  • 03:29But now our techniques are
  • 03:30getting powerful enough where you
  • 03:32can do things have initial
  • 03:34without that much information. And
  • 03:35so I think this is
  • 03:37an interesting
  • 03:38approach to,
  • 03:39to highlight, and I think
  • 03:40latest alpha fold three also
  • 03:42works on diffusion.
  • 03:45So there's some commonality there.
  • 03:48Okay. So,
  • 03:50how does, how does this
  • 03:52embedding of data into
  • 03:54any space work? So,
  • 03:56traditionally,
  • 03:57we think of this as
  • 03:58representation learning. You're given some
  • 04:00data.
  • 04:01Before you can do anything
  • 04:02about it, you want to
  • 04:03first convert that into a
  • 04:04usable form by embedding it
  • 04:06into some vector space with
  • 04:08n dimensions.
  • 04:10And then you plug that
  • 04:11into a statistical model, and
  • 04:13you can do things like
  • 04:14make predictions on it or
  • 04:15maybe do unsupervised learning like
  • 04:17clustering that we just saw
  • 04:19previously.
  • 04:21And so,
  • 04:22what people used to do
  • 04:24in the past,
  • 04:25was there was a specific
  • 04:27technique for every modality. You
  • 04:29had images. You would use
  • 04:30two d convolutions, your convolutional
  • 04:32models.
  • 04:33For text, you would embed
  • 04:35each text into a little,
  • 04:38descriptor for it. For waveforms,
  • 04:40you might convert, the waveform
  • 04:42into spectral representations
  • 04:44and then embed those into
  • 04:46a big size vector space.
  • 04:50And so,
  • 04:52you know, they what because
  • 04:54of this limitation, what happened
  • 04:55is that everybody had,
  • 04:58models for different kinds of
  • 04:59data, and they were all
  • 05:00separated.
  • 05:01And then over time, people
  • 05:03decided,
  • 05:04well, let's try and embed
  • 05:05different modalities into the same
  • 05:07space.
  • 05:08And once they're in the
  • 05:09same space, we'll just combine
  • 05:10them by little things like
  • 05:12adding the representations from different
  • 05:14spaces together
  • 05:15or maybe even putting a
  • 05:16small neural net on top
  • 05:17of that.
  • 05:19But that was really quite
  • 05:21inflexible,
  • 05:22in that, the kinds of
  • 05:24changes you could make to
  • 05:25the representation were limited. And,
  • 05:28what you did at,
  • 05:30times when you wanted to
  • 05:31use the model was pretty
  • 05:32much how you trained it.
  • 05:33If you had two things
  • 05:34going in during training, you
  • 05:35only use two things,
  • 05:38during evaluation. So, like, if
  • 05:40you had images and text,
  • 05:41you could just use images
  • 05:42and text and no other
  • 05:43combinations.
  • 05:45And so in twenty twenty
  • 05:46seven, all this changed with
  • 05:48the attention,
  • 05:49models paper.
  • 05:51It was really,
  • 05:52a breakthrough paper, which has
  • 05:54had its implications in various
  • 05:56forms.
  • 05:57So
  • 05:58the I the basic idea
  • 06:00is you can take,
  • 06:02embeddings of different data, and
  • 06:04then
  • 06:04you can combine these embeddings
  • 06:06by choosing what's important. So
  • 06:08there's this notion of attention.
  • 06:09I won't go into the
  • 06:10details necessarily,
  • 06:12on that, but there's this
  • 06:13notion of looking at your
  • 06:14data,
  • 06:15and looking at different parts
  • 06:16of it and choosing,
  • 06:18those parts,
  • 06:20if it seems relevant to,
  • 06:22the model itself.
  • 06:23And this is all learned
  • 06:25during training itself of the
  • 06:26model, so this attention is
  • 06:28not baked in beforehand. You
  • 06:29just learn how to do
  • 06:30it, as part of the
  • 06:31training.
  • 06:34So,
  • 06:36what this also offers is
  • 06:37this really interesting ability to
  • 06:39change how you embed data
  • 06:41into your model. So instead
  • 06:43of just,
  • 06:44using your traditional
  • 06:46way of approaching,
  • 06:48embeddings where you just put
  • 06:49your data in through some
  • 06:51prebaked
  • 06:52model, what you can do
  • 06:53is you can
  • 06:54apply these attention models to
  • 06:56sort of compress your data
  • 06:58into a fixed size representation.
  • 06:59So here's an example for
  • 07:01images. You can take an
  • 07:02image, split it into patches,
  • 07:04and then learn an attention
  • 07:06model on tap, which kind
  • 07:07of compresses the whole image
  • 07:09down to a single
  • 07:10vector, and now you can
  • 07:11use that for anything else
  • 07:13you want to do with
  • 07:13it.
  • 07:16Furthermore,
  • 07:17what's really interesting is now
  • 07:19you can do this for
  • 07:20various modalities across time. You
  • 07:22can have images.
  • 07:23You can have text. You
  • 07:25can have videos. You can
  • 07:26have sound. They all get
  • 07:27embedded into the same space,
  • 07:29and you can,
  • 07:31compress them down to the
  • 07:32same format. So you can
  • 07:33do things like apply it
  • 07:34to different sentences of different
  • 07:36lengths. So you don't have
  • 07:37to worry about periodicity or
  • 07:39or the fact that everything
  • 07:40has to be on the
  • 07:41same length, with this device,
  • 07:43and you can do this
  • 07:45across different data types.
  • 07:47And so,
  • 07:49you know, it's a really
  • 07:50powerful tool, and I, you
  • 07:51know, I wanted to highlight
  • 07:52that today because I think,
  • 07:53you know, if you're dealing
  • 07:54with multivariate data, you can,
  • 07:57over time, think about clever
  • 07:58techniques,
  • 07:59on how to combine them
  • 08:01together. And a lot of
  • 08:02ingenuity
  • 08:03that's gone into things like
  • 08:04alpha bold and stuff is
  • 08:05about how do you combine,
  • 08:07various, data that goes in
  • 08:08there.
  • 08:09So it requires some experience,
  • 08:11but I think, you know,
  • 08:11just with a little tweaking,
  • 08:13you get pretty good at
  • 08:14it.
  • 08:16Okay. So switching to generative
  • 08:18models.
  • 08:20Once you have an embedding,
  • 08:21you can do sort of
  • 08:22generative models of data.
  • 08:24What's a generative model? A
  • 08:26generative model is a model
  • 08:27that allows, by definition, to
  • 08:29generate new data of that
  • 08:30modality.
  • 08:32Additionally,
  • 08:33it can help you quantify
  • 08:35whether something you're seeing has
  • 08:37high probability or low probability,
  • 08:39so that you can do
  • 08:41other things,
  • 08:42with that probability as in
  • 08:43build tools on top of,
  • 08:46those measures itself.
  • 08:48There's a wide variety of,
  • 08:51techniques for generative models,
  • 08:54that the machine learning community
  • 08:55has built over time, but
  • 08:57I'll basically just be talking
  • 08:58about autoregressive
  • 08:59models and diffusion models, which
  • 09:02are really the mainstay of,
  • 09:04of the models today. You're
  • 09:06quite familiar with them.
  • 09:08Autoregressive models are an example
  • 09:10would be CHATCPT,
  • 09:13for diffusion models,
  • 09:15you know, something like stable
  • 09:16diffusion for image generation
  • 09:18is an example.
  • 09:22Okay. So,
  • 09:23with autoregressive models,
  • 09:26the goal is to build
  • 09:28a model,
  • 09:29where you get a probability
  • 09:31for any data point.
  • 09:32And,
  • 09:33the way we we do
  • 09:35this with autoregressive models is
  • 09:36to convert high dimensional data
  • 09:38into a sequence
  • 09:39and then measure the probability
  • 09:41of the sequence using the
  • 09:42chain rule of conditional probability,
  • 09:44which is just basically
  • 09:47multiply the probabilities
  • 09:49of,
  • 09:50one variable given the rest.
  • 09:52I think the details are
  • 09:53not too important. I'll try
  • 09:54and highlight with this sort
  • 09:56of example here.
  • 09:57Let's say you have the
  • 09:58web and you want to
  • 09:59build a generative model of
  • 10:00text.
  • 10:02What you would do is
  • 10:03you would take the entire
  • 10:05dataset of text and convert
  • 10:06it to input output pairs
  • 10:10of of the type x
  • 10:11and y. So you're given
  • 10:12some data x and you
  • 10:14want to predict y. Know,
  • 10:15you're familiar with sort of
  • 10:17regression or logistic regression. It's
  • 10:19the same sort of,
  • 10:20technique. You're basically just trying
  • 10:22to predict,
  • 10:24some some target given some
  • 10:25input.
  • 10:26And so with with the
  • 10:27web, what you would do
  • 10:28is you would just take
  • 10:29all the prefixes.
  • 10:31So,
  • 10:32you got a Shakespearean verse
  • 10:34here, and you could say,
  • 10:35to be or not to
  • 10:36be. So you convert that
  • 10:37into data examples.
  • 10:39Empty start, first word being
  • 10:41two,
  • 10:42and then x being two,
  • 10:44and the next word is
  • 10:45b.
  • 10:46That's another data example. And
  • 10:48then to b is an
  • 10:49in an input for another
  • 10:50one, and
  • 10:51r is the target,
  • 10:53word. And so you can
  • 10:54convert the entire dataset entire
  • 10:56web into such a database,
  • 10:58and
  • 10:59you are now learning a
  • 11:00model that learns to predict,
  • 11:03the next word given whatever
  • 11:04context it is.
  • 11:06And so, you know, this
  • 11:08is the workhorse
  • 11:09of how, current models like
  • 11:11ChatDPD work. You just take
  • 11:13the web and just do
  • 11:14next
  • 11:15token prediction,
  • 11:16as it were. And then
  • 11:18when you want to run
  • 11:19the model, what you do
  • 11:20is you feed in some
  • 11:22context such as what it,
  • 11:24why is the sky blue,
  • 11:26and then you let the
  • 11:27model generate the next, word,
  • 11:28which which it has already
  • 11:30learned from its model, and
  • 11:32then you take that word,
  • 11:33for example, in this case,
  • 11:34because,
  • 11:35and you feed that
  • 11:37word back in the next
  • 11:38sentence,
  • 11:39why is the sky blue
  • 11:40because, and then you have
  • 11:42it predict the next word,
  • 11:43because of,
  • 11:46Raleigh, scattering, and so and
  • 11:48so on. So you basically
  • 11:49run the model during inference,
  • 11:50and it it just generates
  • 11:52text, and that,
  • 11:54that generated,
  • 11:55output is,
  • 11:57you know, what you see.
  • 12:00So how do you apply
  • 12:01this to other modalities other
  • 12:02than text? It's quite clear
  • 12:04for text.
  • 12:05It's a discrete data, and
  • 12:07these models work really well
  • 12:08for discrete data. They don't
  • 12:09really work so well in
  • 12:10continuous regression space.
  • 12:13So it's easy to apply
  • 12:14just to things like protein
  • 12:16sequences, amino acids,
  • 12:19and so on that are
  • 12:20naturally discrete. It's a little
  • 12:22trickier for, high dimensional data,
  • 12:26which are not like strings.
  • 12:28So when it highlights
  • 12:30how how people do this,
  • 12:32what you end up doing
  • 12:33is building a model which
  • 12:35first encodes your data into
  • 12:36a sequence of,
  • 12:38of tokens,
  • 12:39dispute
  • 12:41tokens,
  • 12:42and then
  • 12:43you learn another model.
  • 12:45Typically, actually, you learn both
  • 12:46of these models together.
  • 12:48And the other model is
  • 12:49called reconstruction model, which takes
  • 12:51in,
  • 12:52the output tokens and converts
  • 12:54it back to the data
  • 12:55itself.
  • 12:56And once you have this,
  • 12:57you can convert your entire
  • 12:59data into the sequence of
  • 13:00tokens,
  • 13:02and then learn the autoregressive
  • 13:04model on that sequence of
  • 13:05tokens by just predicting the
  • 13:07next token given the history
  • 13:09of its tokens.
  • 13:11And,
  • 13:12you can now generate this
  • 13:14new kind of data
  • 13:16by running the autoregressive model,
  • 13:17generating some sequences, and then
  • 13:20converting that to real data.
  • 13:23Here's an example on how
  • 13:24you might apply this to
  • 13:25modality like speech.
  • 13:28So you have speech is
  • 13:30really just waveforms.
  • 13:32And so to be able
  • 13:33to build a model actually,
  • 13:34you were really really wanted
  • 13:36to do it. You could
  • 13:37just model speech directly,
  • 13:39and people have done that.
  • 13:41But it's harder to, deal
  • 13:43with that because speech happens
  • 13:45at a very fast rate,
  • 13:46so the data would be
  • 13:47just too much. So, typically,
  • 13:49what people will do now
  • 13:50is convert speech into a
  • 13:51spectral representation,
  • 13:53by just taking windows of
  • 13:54speech and,
  • 13:55computing a Fourier spectrum in
  • 13:57it. So this original waveform
  • 13:59is converted to a frequency
  • 14:01diagram over time showing how
  • 14:03the sound,
  • 14:04is is distributing energy over
  • 14:07these different frequencies on on
  • 14:08y axis.
  • 14:12And once once you've converted
  • 14:14that to this, format, you
  • 14:15can learn an inverse model
  • 14:17called the vocoder, which will
  • 14:18just generate,
  • 14:20the raw waveform from
  • 14:23from,
  • 14:24this
  • 14:25coded speech.
  • 14:26Now, unfortunately,
  • 14:27the,
  • 14:29spectrum on the right hand
  • 14:30side is it's still continuous.
  • 14:32It's not discrete, and so
  • 14:34it's hard to embed it
  • 14:35into,
  • 14:35an autoregressive model.
  • 14:37So what what you can
  • 14:39do is simply just take
  • 14:40that data and tokenize it
  • 14:42by discretizing the data. So
  • 14:44just round divide by
  • 14:46take off the minimum, divide
  • 14:48by maximum, and just convert
  • 14:50it into a range of
  • 14:51numbers between zero and some
  • 14:52maximum bin.
  • 14:53And now you have a
  • 14:54discretized
  • 14:55version. You can convert back,
  • 14:58to the original as well
  • 14:59by just mapping,
  • 15:00the
  • 15:01continuous values to, the discrete
  • 15:03values back to continuous codes.
  • 15:06And so now you have
  • 15:06this machinery by which you
  • 15:08can take this continuous data,
  • 15:09convert it to tokens, and
  • 15:11then convert it back to
  • 15:12real data. So you can
  • 15:13really just beat that into
  • 15:15an autoregressive model.
  • 15:17And so,
  • 15:19you take the waveform. You
  • 15:20have spectral representation.
  • 15:22You take each spectral representation
  • 15:23and convert it into
  • 15:25a sequence of discrete tokens,
  • 15:27and
  • 15:29voila. You can do next
  • 15:30step predictions. So you feed
  • 15:31in your history of tokens,
  • 15:32and then you can predict
  • 15:34the next token. So,
  • 15:36it's basically a recipe that's,
  • 15:38repeated all over. You just
  • 15:40learn how to discretize your
  • 15:42data into,
  • 15:43some discrete bins, and then
  • 15:44you learn how to address
  • 15:45the model. And I think
  • 15:46this could be applied.
  • 15:49It's already applied to various
  • 15:50things like speech,
  • 15:52videos, images,
  • 15:54and so on. And it's
  • 15:55a pretty powerful technique that
  • 15:57can be applied to other
  • 15:58modalities as well.
  • 16:03And now I want to
  • 16:03talk a little bit about
  • 16:04diffusion models.
  • 16:06It's a set of new
  • 16:07techniques that allows you to
  • 16:09morph one probability distribution to
  • 16:11another.
  • 16:13And so,
  • 16:15there's methods called optimal transport,
  • 16:18flow matching, diffusion models. They're
  • 16:20all trying to map
  • 16:22distributions
  • 16:23from one distribution to another.
  • 16:25It might seem like a
  • 16:26very arcane
  • 16:28idea, but it's really a
  • 16:29powerful,
  • 16:30methodology when you want to
  • 16:32think about
  • 16:33how how to generate,
  • 16:35data from noise. So in
  • 16:37the case of diffusion models,
  • 16:38you morph
  • 16:40a Gaussian distribution, which is
  • 16:41something that people know how
  • 16:43to handle,
  • 16:44and convert that into a
  • 16:45real data distribution, which is
  • 16:47really hard
  • 16:48to handle. So if you
  • 16:49give me images or or
  • 16:51speech, I don't know what
  • 16:52the data distribution itself is
  • 16:54and how to model that
  • 16:55distribution itself. Or multi omics
  • 16:57data, like, how do you
  • 16:58what is the actual distribution
  • 17:00of data one doesn't know?
  • 17:02And so the ability to
  • 17:04to generate and sample from
  • 17:05that is quite useful,
  • 17:07and,
  • 17:08mapping to a simple distribution
  • 17:10allows us to do that.
  • 17:12And so how does this
  • 17:13actually work in practice?
  • 17:15I'll show you an example
  • 17:16with images.
  • 17:18You have some image on
  • 17:20the right hand side, x
  • 17:21zero.
  • 17:25I guess I don't see
  • 17:26a mouse there.
  • 17:27And you take that image
  • 17:28on the right hand side
  • 17:31and
  • 17:38Maybe it's Never mind. It's
  • 17:39okay.
  • 17:41So you take an image
  • 17:42on the right hand side.
  • 17:44What you can do is
  • 17:45you can just,
  • 17:46scale it down in magnitude
  • 17:48by multiplying by some compression
  • 17:50term, and then you add
  • 17:51some noise, which expands the
  • 17:53data up again. And so
  • 17:54you started with some data,
  • 17:56and you can generate a
  • 17:57whole bunch of data at
  • 17:58different noise levels.
  • 18:00And,
  • 18:01what you really want to
  • 18:02do is to learn a
  • 18:04function that takes
  • 18:05data at one noise level
  • 18:07and cleans it up slightly
  • 18:08to a slightly less,
  • 18:10noisy level.
  • 18:12And so and you can
  • 18:13then apply that model.
  • 18:14You start with noisy data,
  • 18:16and then you clean it
  • 18:17up a little bit. And
  • 18:18then you clean it up
  • 18:19a little bit, and you
  • 18:20do this over and over
  • 18:21again,
  • 18:22till you are back to
  • 18:23the clean cleanest level, which
  • 18:25is where the data itself
  • 18:26lies.
  • 18:28And so that that's a
  • 18:29very simplistic explanation of diffusion
  • 18:31models. There's,
  • 18:32there's a whole range of,
  • 18:34possibilities in this in this
  • 18:36scheme. How do you add
  • 18:37noise? How do you convert
  • 18:39the noisy data back to
  • 18:40clean data?
  • 18:42And there's a whole bunch
  • 18:43of techniques, that factor in
  • 18:45different trade offs,
  • 18:48in in these choices.
  • 18:49There's also variants of diffusion
  • 18:51that,
  • 18:53don't look at it as
  • 18:54a sequence of discrete steps,
  • 18:56but it deal with this
  • 18:57as, like, a continuous time
  • 18:59step, which is almost like
  • 19:00a,
  • 19:03a diffusion process in continuous
  • 19:05time.
  • 19:06And,
  • 19:07there's also techniques that apply
  • 19:08this for discrete data. So
  • 19:10I've been showing you continuous
  • 19:11data. Even discrete data can
  • 19:13work through diffusion models where
  • 19:14you have,
  • 19:16have categorical choices,
  • 19:18kind of like maybe like
  • 19:19mutations during evolution. It's,
  • 19:21you know, just things
  • 19:23mutate from,
  • 19:26from signal down to noise,
  • 19:28and then you learn a
  • 19:29model on going backwards,
  • 19:30to generate the data for,
  • 19:33for real sequences.
  • 19:35And there's even a continuous
  • 19:36time version of this discrete
  • 19:38diffusion process,
  • 19:39if you can believe me.
  • 19:41Okay. So these models work
  • 19:42well.
  • 19:44So I don't wanna leave
  • 19:45you with the impression that
  • 19:46everything just works right off
  • 19:48the bat. So I wanna
  • 19:49highlight an example just,
  • 19:52just to leave with you
  • 19:53with a vignette
  • 19:56of how,
  • 19:57the kinds of innovations you
  • 19:58need to do to make
  • 19:59some things work when you
  • 20:01take on a new challenge.
  • 20:02So if you use models
  • 20:04work well, but if you
  • 20:04get really large data, like
  • 20:06high resolution images, it's a
  • 20:08lot more tricky,
  • 20:09to make it work right
  • 20:10off the bat.
  • 20:11And so what people will
  • 20:13end up doing is I'm
  • 20:14highlighting two different techniques in
  • 20:16in literature. On the left
  • 20:18hand side,
  • 20:19what you can do is
  • 20:20you first,
  • 20:22learn an encoding of your
  • 20:23data itself for high resolution
  • 20:25images. You can learn a
  • 20:26compression that compresses it to
  • 20:28smaller
  • 20:29images
  • 20:30or smaller
  • 20:31feature vectors,
  • 20:32and then you can learn
  • 20:33a diffusion model in that
  • 20:35smaller space,
  • 20:36and then generate everything in
  • 20:38that compressed space.
  • 20:40And from that compressed space,
  • 20:41you can come back to
  • 20:42the real data,
  • 20:43from the model you first
  • 20:44learned.
  • 20:45On the right hand side
  • 20:46is something called cascaded diffusion.
  • 20:48So if you want to
  • 20:49generate high resolution images, you
  • 20:51generate things at a lower
  • 20:52resolution
  • 20:53and then use that as
  • 20:55a seed for something that's
  • 20:56at a higher resolution,
  • 20:58and you expand it upwards
  • 20:59up to,
  • 21:01the full resolution.
  • 21:06Okay. So I think I'm
  • 21:08gonna run really out of
  • 21:08time. So I'm gonna skip
  • 21:10right to the end of
  • 21:11my talk because I have
  • 21:12only one minute, and I
  • 21:13think this might be interesting.
  • 21:18Okay. So
  • 21:21I wanna talk a little
  • 21:22bit quickly about how you
  • 21:24might use this for predicting
  • 21:25the structure of molecules.
  • 21:27So as I mentioned with
  • 21:28diffusion models,
  • 21:30you have something you learned,
  • 21:31which is the denoising model
  • 21:33that takes in some noisy
  • 21:34data and tries to clean
  • 21:35it up.
  • 21:36And you're given some features
  • 21:38that describe the data as
  • 21:39well, which help in the
  • 21:40cleanup process.
  • 21:42So you can do the
  • 21:42same thing,
  • 21:44with,
  • 21:46with molecules.
  • 21:47You can give it a
  • 21:48mild representation of your molecule,
  • 21:51which is for those that
  • 21:52don't know, it's a way
  • 21:53of, representing a compound
  • 21:56sequence,
  • 21:57that's used in in,
  • 22:01chem informatics packages.
  • 22:02And you can take the
  • 22:04smiles feature and convert that
  • 22:05to, features for a molecule,
  • 22:07and then
  • 22:09you have a denoising model
  • 22:10that takes in the noisy
  • 22:11coordinates for each of the
  • 22:12molecule each of the atoms
  • 22:13in the molecule and cleans
  • 22:14up the coordinates. So, really,
  • 22:16there's no
  • 22:17no real information used. Don't
  • 22:19you don't really bake in
  • 22:21any sort of information about
  • 22:23bond angles and any of
  • 22:25any of that at all.
  • 22:26You just basically train a
  • 22:27model. You give you're given
  • 22:29the compound,
  • 22:30and you mutate you noise
  • 22:32up its structure, and then
  • 22:34you learn how to denoise
  • 22:35the structure
  • 22:36back.
  • 22:37And so the way the
  • 22:38the features are computed is
  • 22:40you take a compound,
  • 22:41and then you you label
  • 22:43all the atoms. And from
  • 22:45the atoms, you can compute
  • 22:46a graph.
  • 22:47This graph basically represents
  • 22:49what atoms are connected to
  • 22:51what other atoms.
  • 22:53And,
  • 22:56I guess the detail is
  • 22:56not important, but you can
  • 22:58represent the structure of the,
  • 22:59of a molecule in a
  • 23:00graph. And then from that
  • 23:02graph, you can actually compute
  • 23:04something called a graph Laplacian,
  • 23:06which allows you to compute
  • 23:07features for each of the
  • 23:08atoms
  • 23:09in the graph.
  • 23:11And you can then add
  • 23:12on some descriptors for each
  • 23:13of the atoms in the
  • 23:15graph, things like the obvious
  • 23:17but very basic things like,
  • 23:19the atom type, the degree,
  • 23:21the valence, and and so
  • 23:22on.
  • 23:23And then you just run
  • 23:24a diffusion model. I won't
  • 23:26go into the details of
  • 23:27it, but, essentially,
  • 23:29you get features for the
  • 23:30atoms based on connectivity, some
  • 23:32extra descriptors, and it's three
  • 23:34d noisy coordinates.
  • 23:35And then you just learn
  • 23:36to predict the noisy coordinates
  • 23:38at the next type. And
  • 23:39you can then, during inference,
  • 23:41just run that model. You
  • 23:42start with the structure, some
  • 23:44some random three d assignments
  • 23:46for the,
  • 23:47positions of each of the
  • 23:48atom.
  • 23:49And once you run them,
  • 23:51the model starts,
  • 23:53to clean up the three
  • 23:54d positions.
  • 23:55And at the end,
  • 23:57it'll give you a a
  • 23:58full structure for the molecule.
  • 24:01Same for another one.
  • 24:10And, yeah, the the TLDR
  • 24:12is this works quite well.
  • 24:13We got state of the
  • 24:14art results on predicting structures,
  • 24:16compared to prior works, although
  • 24:18we actually didn't use any
  • 24:20information about
  • 24:21chemistry from it. All of
  • 24:23that was just learned, by
  • 24:24the model on its own.
  • 24:27So to conclude,
  • 24:29I hope,
  • 24:30you'll take away that there's
  • 24:32interesting ways to embed all
  • 24:33kinds of data into a
  • 24:34vector representation that you could
  • 24:36use for your statistical models.
  • 24:38Generative models will allow you
  • 24:39to build a model where
  • 24:40you can generate new data,
  • 24:42which can be used for
  • 24:43infilling or even finding correlations
  • 24:45that you may not have
  • 24:46expected.
  • 24:48And,
  • 24:49yeah, we're
  • 24:52a group in Apple just
  • 24:53doing fundamental machine learning research.
  • 24:55So
  • 24:56with that.