Articles, Blog

The Power of Deep Learning at Facebook | Distinguished Lecture Series on AI | J.P. Morgan

October 26, 2019


YANN LECUN: Yeah, I spend about half my career in industry and half in academia now I have one foot in each. I guess it’s more like one half of a foot in academia now because I’m spending most of my time at Facebook but, you know, things change. I basically started the first research lab at Facebook fair. There’s a bunch of research labs and Facebook but this was the first. And this is a bit of a cultural shock for Facebook, which was a very sort of engineering short term oriented company that had to kind of invent itself a new culture for research. And that’s what I liked about it, the fact that it could start from scratch and basically establish this culture. So what we have– what we’re doing at Facebook is open research. Facebook research is really outward focused and all the research we do is published and all the– almost all the code that we write is open sourced. With that we hope to foster interesting problems that we think are interesting and can see the community towards working on provenance that we think are important. The thing here is not whether you know Facebook technology is ahead of Google’s or Microsoft’s or whatever but more the fact that the products that we want to build are not possible today. We don’t have the science even the basic principles to build the stuff we wanted– we want to build. And so since we don’t have a monopoly on good ideas, the best we can do is basically accelerate progress of the entire community and that’s one of our goals. Of course we have a big impact on the company, in fact, a much bigger impact than Mark Zuckerberg thought we would have five years ago when fair was created. And today Facebook is really kind of entirely built around deep learning. If you take the printing out of Facebook today you get dust, essentially. Not entirely, but you know what I mean. OK so most of learning– machine learning today, you know, machine learning has had a huge impact in various areas of business and society and science, certainly. But most of the applications of machine learning today basically you supervise running one of those three main paradigms of learning, right. So the supervisor learning, there is reinforcement learning, which people have been talking about a lot in the last few years, and then there is another thing that’s not very well defined called unsupervised learning or self supervised learning which I’ll talk about later. So supervised learning is this idea by which if you want to train a machine, for example, to classify images of cars from airplanes, you show an image of a car, you run this through the machine and the machine has adjustable knobs on it. And if it says car, you don’t do anything to the knobs. If it doesn’t say car, you adjust the knobs so that the answer the machine produces gets closer to the answer you want, OK. There is a desired answer that you give to the machine and you can measure the discrepancy between the answer you want an answer the machine produces. And then you show the image of the airplane and you do the same. And by kind of tweaking the parameters, you know, with thousands of examples, eventually perhaps the knobs will converge to a configuration where all the cars are correctly classified and all the airplanes are correctly classified. And the magic of this is that it may even work for airplanes and cars it’s never seen before. That’s the whole purpose of learning is that you run the concept without having to memorize every example. So this type of learning, supervised learning, works really well if you want to do speech recognition, that being speech two words, images to categories, face recognition, generating captions for photos, figuring out the topping of the text, translating from one language to another, that’s all supervised learning. And the basic idea of this or the tradition starts with models from the late 50s, early 60s, the Perceptron, the Adeline. Which interestingly at the time were really sort of hardware devices. That were not programs on the computer, they were actually analog computers that were built. At the bottom here is a Perceptron, what you see here is Bernie Woodrow reviving one of his old Adeline systems at Stanford. So that created the standard model of pattern recognition that really was prevalent until fairly recently. By which you take a rough signal and you feed it to what’s called feature extractor which is hand engineered, OK. It’s designed by people to basically extract relevant information from the raw signal. And then you feed the result, the feature vector, to classifier, something like a linear classifier or nearest neighbor or a tree or whatever. There’s a lot of techniques that people have come up with over the last 50 years, if not 60, more like 60 years actually, they used food to do this. And that’s kind of the standard way. And what deep learning changed is the idea that you can learn this feature extractor. So instead of having to engineer, you know, spend a lot of time and expertise and money on building those things for every new problem, you can basically train a future extractor as part of the entire process. So basically you learn– you build a machine as a cascade of trainable modules and which you can call layers, and essentially all those modules that transform [INAUDIBLE] using supervised learning. And you hope that the machine will learn not just to classify but also to figure out what the relevant features are that need to be extracted at every layer for the system to do a good job. So the next question you can ask is, what do we put in those boxes? And the answer to this is not recent, it’s the idea of artificial neural networks. So in a artificial neural net, essentially, the layers are of two types. One type is just a linear operator. So imagine that the signal is represented as a vector, [INAUDIBLE] numbers. The pixel values or signal, you know, whether it’s audio or financial terms or whatever. Represent this as a vector, multiply this by matrix, OK. So what a matrix does is that when you computed the product with this matrix by this vector is you computing the dot product of the vector by every row in the matrix and that’s like computing a weighted sum of input features. It’s actually represented here. So you have a bunch of here’s a vector, with components of this vector you’re computing a weighted sum where the weights are the coefficients in the matrix and that gives you an output. And then there is another type of function here which is the pointwise non-linearity. So you take this vector and then apply the nonlinear function to every component of this vector independently. In this case, what’s called a value which is really just a halfway rectifier, OK. So a function is identity for positive arguments and equal to zero for negative arguments, very simple non-linearity. And there are theorists that show that with only two layers of this or linear, nonlinear linear, you can approximate any function you want as close as you want as long as the dimension of this vector is sufficiently large, possibly infinite. But there is no real theoretical results on this, but what we know empirically and intuitively is that by stacking lots of layers of those, you can represent many functions very efficiently. So that’s the whole motivation to deep learning which is that by stacking multiple layers of linear– alternating linear and nonlinear operators, you can approximate a lot of useful functions very efficiently. AUDIENCE: [INAUDIBLE] theory proved that the deep learning net can approximate the simple function. So if we can go to the simple function [INAUDIBLE]. YANN LECUN: What do you mean by simple? AUDIENCE: A simple function is a special term saying basically [INAUDIBLE]. YANN LECUN: Yeah, yeah. So the theorems from the late ’80s that show that with just two layers, with just one layer of non-linearity, you can approximate any function you want. OK, but there is no limit on the dimension of the middle layer. So for the longest time, because of the limitations of computer power and because the data sets were small, those approaches, you know, the number of applications that we could apply these to were very limited. I mean, the basic techniques for this are from the late ’80s. But the amount of problems for which we could have enough data to train those systems was very small. You know, we could use it for, maybe, handwriting recognition and speech recognition and maybe a few other applications, but it was kind of limited. People, at the time, actually, in the ’80s, were really interested in hardware implementations. And this is coming back. So there is a whole industry now that has been restarted over the last three to five years on building special purpose chips to run those neural nets efficiently, particularly for embedded devices. So probably within the next two years, every smartphone will have a neural net accelerator embedded in it. And, you know, within five years, it’ll be in every car. And, you know, shortly after that in, basically, every electronic device that you buy will have a neural net accelerator in it. And so your vacuum cleaner will have smart computer vision in it. Because you will have a $3 chip that does neural net acceleration. So, you know, how do we train those things? So we train them by, you know, it’s basically large-scale optimization. So the supervised learning, you measure the discrepancy between the answer the machine produces and the answer you want through some sort of objective function that basically just measures the distance or something like that of some kind. Then you can average this over a training set of samples of pairs of input and output. And the process by which you tune the parameters of the system is just gradient descent. So figure out in which direction to change all the knobs so that the objective function goes down, and then take a step in that direction. And then sort of keep doing this until you reach some sort of minimum. So what people use is something called stochastic gradient descent where you estimate the gradient on the basis of a single sample or maybe a small batch of samples, right? So you show a simple example, figure out the error between those two things, then compute the gradient of that function with respect to all the parameters, tweak the parameters, then go to the next sample. Stochastic because you get a noisy estimate of the gradient on the basis of a single sample or small batch. So the next question you might ask is, how do we complete this gradient? And that’s where back propagation comes in. And I’m sure many of you are familiar with this. Don’t attempt to understand the formula. You don’t need to. But it’s basically the idea that to compute the gradient, which is really the sensitivity parameters of the cost function with respect to all the coefficients in the system, all the matrices in the weighted sums, you can compute all of those by doing a backward pass, which is basically just a practical application of chain rule, OK? So you know that by tweaking a parameter in this block here, it will affect the output in a particular way. And you know how tweaking this output affects the overall cost. And so it’s very easy by using this sort of back propagation method to compute all the terms in this gradient. So now for every parameter in the system, you have a quantity that indicates by how much the cost will increase or decrease if you tweak the parameter by some given delta. So that gives you the gradient. Take a step in the negative gradient direction. There’s various tricks to make this fast. So what made deep learning possible and what makes it easy to use is you don’t have to figure this out at all. Like, the modern deep learning frameworks, you basically build a network either by writing a program in Python or your favorite language or by assembling blocks that have been predefined into a graph. And the system automatically knows how to compute the gradient. So you know, you tell it how to compute the output. And it kind of keeps track of all the operations that are done during this computation. And then can sort of trace back those operations so that, automatically, the gradient of whatever it is you’re computing with respect to all the parameters you have will be computed. So that’s a very simple concept– automatic differentiation. But that’s really what makes deep learning so easy to deploy and use. OK, now, here is a problem. If you want to apply deep learning or neural nets in the way I describe them to images, it’s not really practical to view an image as a vector where the values, the components of the vectors are the pixels. Because, you know, if you have an image that’s, say, you know, 200 by 200 pixels, that’s 40,000 pixels. And, you know, if you multiply this by matrix, that matrix will be 40,000 by something. It’s gonna be large, OK? Too large. So you have to figure out how to kind of specialize the connection between the neurons or, basically, how to build sparse matrices in such a way that the computation becomes practical. That’s the idea of convolutional networks, which is something that my name is associated with. The inspiration for this goes back to classic work in neuroscience from the ’60s. Actually, it’s Nobel Prize winning work by Hubel and Wiesel on the architecture of the visual cortex. And there was sort of various people who tried to make computer models of this, but didn’t have things like back prop. So what’s the idea behind a convolutional net? You take an image and the operation you’re going to do, the linear operation, it’s not gonna be a full matrix, but it’s gonna be what’s called a discrete convolution which consists in taking a little patch of the image– 5 by 5, in this case– and then computing the weighted sum of those pixels with a set of 25 weights and then putting the result in a corresponding pixel on the output, OK? And then you take that window, you shift it over a little bit by one pixel, and do the same, compute the dot product or the weighted sum of the pixels by those coefficients and record the result next to it. OK, so by swiping this window over, you get kind of an image at the output which is a result of convolving this image by this so-called convolution kernel. So the number of free parameters here in your matrix is very small. It’s only 25 in this case, OK? And the amount of computation is relatively small. And the advantage of using this kind of computation is in situations where the signal comes to you in the form of an array, you know, a single or multidimensional array in such a way that the statistics are more or less stationary and in such a way also that neighboring values tend to be highly correlated whereas far away values are not or less, OK? It’s the case for like, you know, financial time series, for example, that belongs to this, right? And people have been using convolutional nets on financial time series, yes? AUDIENCE: Can it also do edge detection? YANN LECUN: Say again? AUDIENCE: [INAUDIBLE] spanning tree and things like that. YANN LECUN: So if you there are certain configurations of those coefficients that will produce edge detection, yes. But we’re not gonna hard wire those coefficients. They’re gonna be the result of learning, right? We’re not gonna build them by hand. We’re just gonna initialize them randomly and then train the entire thing end-to-end, supervised to produce a right answer at the end on millions of examples or thousands of examples. And then look at the result. OK, so that’s the first layer. And we’re gonna have multiple filters of this type. In this case, you have four. So each of those, you know, you have four filters here producing fur so-called feature maps. And then there’s a second type of operation here called pooling which consists in taking the results of those filtering small neighborhood and pooling the results, which means computing an average or max of the values, and then sub-sampling the image. So sub-sampling means that this image is half the resolution of that image. The pixels are twice as big, if you want. The reason for this is to eliminate a little bit of position information about the location of features in the image. And that’s important if you want to have a system that is robust to small deformations with the input. So this is a convolutional net in action. It has been trained to recognize hand-written digits. I’m not showing the output here. And this is the input first layer after pooling, third layer, another layer of pooling, then yet another layer. By the time you get here, the representation is very distributed and kind of abstract. But every unit here is essentially influenced by the entire input, OK, so [AUDIO OUT] is the entire input. And the representation of the input is this list of those values. You can get those things to recognize not just single characters, but multiple characters and do simultaneous segmentation. This is very important because, eventually, you want to use those systems with natural images. So this is sort of vintage, you know, early ’90s convolutional net which was built when I was at Bell Labs. Eventually, at Bell Labs, we built a check reading system based on those convolutional nets and various sorts of tricks. And it was deployed in the mid ’90s. And by the end of the ’90s, it was reading somewhere between 10% and 20% of all the checks in the US, so a big success. But by that time the machine learning community had lost interest in neural nets. Nobody was working on neural nets, essentially, in the late ’90s until the mid 2000s roughly. I left the industry in 2003 and joined NYU, as I was mentioning. And I wanted to kind of reignite the interest of the community for those methods, because I knew they were working. And they had the reputation of being very finicky, right? It was, you know, we had our own framework for deep learning, but nobody was interested in this, so nobody was using our code. And neural nets had the reputation of being so hard to train that, you know, only I and a few people working with me were able to train them, which, of course, was not true. It’s just that people are lazy. So I’m being facetious here. So around 2003, 2004, just when I joined NYU, I got together with my friends Yoshua Bengio at University of Montreal. And Geoff Hinton at University of Toronto where I had done my postdoc many years before. And we decided to basically start a conspiracy to renew the interest of the community in neural nets. And we started with various algorithms that we thought would enable backprop, perhaps, to train very, very deep networks. So not networks with just three or four or five layers, but networks perhaps with 20 layers or something like that. And we started working with unsupervised learning algorithms which were only partially successful. But they were successful enough to get enough interest from people that a community started kind of building itself. Around 2007, there was enough of a community that our papers started to get actually accepted at NIPS. Before that, we could never publish a paper in any conference on neural nets, essentially. And then we started getting really good results on standard benchmarks. But they were still kind of dismissed to some extent. That changed around 2009, 2010, in speech recognition, where the results were so much better that people started really kind of switching to using neural nets and then around 2013 in computer vision. And the history of this is well known. But, you know, in the meantime, you know, we’re starting the mid 2000s– I’m hearing myself now. I started working in robotics, you know, something that [INAUDIBLE] is very familiar with, a project that, actually, that Tucker Balch was involved in as well. He is now at JP Morgan. He was at Georgia Tech at the time, still at Georgia Tech. And this was a project to kind of use machine learning to get robots to drive themselves in nature. This took place roughly between 2004, 2005, 2009. So the idea here was to use a neural net basically to do what’s called semantic segmentation which means to label every pixel in an image with the category of the object it belongs to. So it’s using a convolutional net which sees kind of a band around the horizon of the image. And it’s trained to produce another image, which is this image that has essentially three categories. Here is something I can drive over. I’m going to label it green. Or is it something that is an obstacle? And my video is not working for some reason. Oh, that’s interesting. OK. All right, here we go. Oh, that’s fun. All right. Oh. Wow, OK. All right, this one is working. So this is another example of semantic segmentation that took place a couple years later around 2010 or so, 2009, where there were, you know, data sets with a few thousand images where people had painfully labeled every pixel with the category the object belongs to, so things like road and sidewalk and cars and, you know, pedestrians, trees, et cetera. So we trained this convolutional net to kind of be applied to the entire image. And it basically labels every pixel with a category. It makes mistakes. It labels this as desert. This is the middle of Washington Square Park. [LAUGHING] There is no beach I’m aware of. But, you know, at the time, that was state of the art. It was, in fact, quite a bit better than the state of the art at the time. This was 2010. And it was also 50 times faster than the best runner up competing technique. So we submitted a paper to CVPR, the big computer vision conference, at the end of 2010. Pretty sure that paper was going to be accepted because it was faster and better than everything else people had done before. And it was rejected by all three reviewers who said, basically, what the hell is a convolutional net? And we don’t believe that a method we never heard of could do so well. So this could to be wrong. I mean, essentially, that’s what the reviewers said. And it’s funny because now you can’t actually have a paper accepted at CVPR unless you use convolutional nets. Oops, that’s not what I wanted to do. Sorry about that. Bear with me for just a second. OK. So your comments are really useful. A lot of things today for self driving cars, every self-driving car project has a convolutional net in it, and for all kinds of other things. I gave a talk in 2013 that gave some ideas to people at Mobileye, which now belongs to Intel. Also to Nvidia, and They’re using convolutional nets for all their self-driving car projects. In fact, there is a self-driving car project taking place in the Holmdel building which is the building where I used to work at Bell Labs by a group from Nvidia. And the guy running this project at Nvidia is actually a former colleague from Bell Labs who worked with us on this robotics project that Tucker was involved in. OK, so deep learning today– there was a revolution in 2013 in computer vision because our friends at University of Toronto in Geoff Hinton’s group figured out how to implement convolutional nets on the GPU in a very efficient manner. They were not the first ones to do this. It was done at Microsoft in the mid 2000s. But they applied this to ImageNet and managed to get results that were so much better than what people were doing before that. That really created a bit of a revolution. So this was kind of the error rate on ImageNet that people were getting in 2011. And in 2012 with the so-called AlexNet project from Toronto. The error rate went down by a huge amount. And then over the last few years, you know, it went down to levels that are so low that now this benchmark is actually not interesting anymore. It’s, you know, better than human performance on this particular data set. What we’ve seen simultaneously is an inflation in the number of layers in those networks. So the video of convolutional net I showed you from the ’90s earlier had seven layers. The one from 2013, one of the ones that worked best had 20 layers. And, you know, now the best ones have anywhere between 50 and 150 layers. And, you know, Facebook uses those convolutional nets very widely for a lot of different things. And one of the most popular one that’s used in production is something called ResNet-50. So ResNet is this particular architecture that is here where there are layers of convolutions and pooling and non-linearities. But there’s also skipping connections that can allow the system to sort of fail gracefully. If some layers don’t learn appropriately, then they become transparent, essentially. And so that is what enables us to kind of train a very, very deep network. This is an idea that came from Kaiming He, who was, at the time, at Microsoft Research Asia who is now at Facebook. And so that’s a graph that was put by Alfredo Canziani who is a postdoc with me at NYU. But he did this before he came. On the y-axis, you have accuracy. On the x-axis, the number of operations, which is billions of operations that are necessary to compute one output. And what people have been trying to do in the industry is kind of, you know, bring everything down to this corner, essentially, where for the minimum amount of computation, you have the best accuracy on ImageNet or similar things. And so ResNet-50 is right here. There are better results now. And then the size of the bubble, if you want, is the sort of memory footprint, the number of parameters that are necessary. There’s a lot of work on optimizing, running those networks on regular processors or specialized processors to save power. And the reason this is important is, to give you an idea, Facebook users upload somewhere between two and three billion photos on Facebook every day. And this is just on the Blue site. I’m not talking about Instagram or anything like that, just Facebook. Every single one of those three billion photos goes through six convolutional nets, roughly half a dozen, within two seconds of being uploaded. And those do things like essentially represent the image into a feature vector that can be used for all kinds of things, retrieval, search, indexing, feature vector for other purpose, generic feature vectors, if you want. And one that does face recognition and face detections. Another one that generates captions that describes the images for the visually impaired. And there is a couple that basically detect objectionable content, you know, nudity, violence, you know, things like that. So the advantage of deep learning is that the system basically spontaneously learns to represent images in a hierarchical way from sort of low level features like edges to kind of parts of objects and motifs and things like that. One trend over the last few years is the use of weakly supervised learning or semi-supervised learning. This is weakly supervised learning. So this is an experiment that was on Facebook where one of the applied computer vision groups which consisted in taking 3 and 1/2 billion images from Instagram and then training a convolutional net to predict the hashtags that people tag images with. So they decided on about 17,000 different hashtags that correspond to a kind of physical concepts, if you want. And then run through 3.5 billion images through the convolutional net asking you to predict which of those 17,000 hashtags is present. Then you take this network, chop up the last layer that predicts the hashtags, and just use the second last layer as kind of a feature vector, which is an input to a classifier that you train on other tasks, like, say ImageNet. And you can actually beat the record on ImageNet this way. OK, so until fairly recently, it was actually beaten by another team at Facebook. But until fairly recently the record on ImageNet was held by this system, which is trained on a different task that the one you actually finally train it on. So that points towards something that is going to become more and more important in the future, which is this idea that you kind of pre-train with lots of data in a relatively task-independent way. And then you use a relatively small amount of data to actually train your system to solve the task you want to solve. And I’ll come back to this afterwards. So a lot of progress over the last few years in computer vision using convolutional nets with– I’m not gonna go into the details of how this is built, but you can get results like this where every object in an image is outlined and identified. That’s called instant segmentation. And, you know, you can detect wine glasses and backpacks and count sheeps. And it’s optimized– my videos aren’t running for some reason. It’s optimized to the point that you can run those things in real time on smartphones. So this is, unfortunately, a video that you can’t see. And I’m not sure why you can’t see it. Oh, wow, it disappears, which is a person essentially being tracked, people being tracked on a smartphone in real time at something like 10 frames per second. So, you know, a lot of work has gone into those optimizations to run on small platforms. And on iPhone, also, you have acceleration libraries. This is all open source. So if you want to play with computer vision, the latest systems, you can just download this. This is using the PyTorch framework, which also was developed at Facebook. And there is similar things for kind of tracking body poses [INAUDIBLE]. ConvNets are used for all kinds of stuff in medical imaging. It’s actually one of the hottest topics now in radiology, which is how you use deep learning for analyzing medical images. This is a project which I’m not involved in, but it’s colleagues at NYU who’ve been at the medical school and in the computer science department who have been developing those architectures for analyzing MRI images of hips and getting really good results with this. So this is a really hot topic now. And it’s probably going to have a big effect on, you know, radiology in the future. OK, you know, but I don’t want to do a laundry list of applications of convolutional nets. This is one that was also developed at Facebook for translation. It’s a little complicated to explain here. It’s a so-called gated convolutional net. But basically, the input is a sequence of words. And the output is also a sequence of words in a different language, OK, a translation. And that goes through convolutions that include something called attention circuits. And there is some sort of module in the middle that sort of tries to kind of match, kind of warp the sequence so that words appear in the right place in the output sequence. This had the record on some data set for a short time. They have since been overrun. And you know, you can use them for sound generation or for sequence generation here. So this is kind of, you know, generating synthetic sounds by specifying what type of sound you want. This is a project that was done at Facebook in Paris. And interesting projects in unsupervised learning for translation. So this is a project also that was done in Paris, mostly, party in New York, where you feed a so-called unsupervised embedding system, so you can learn vector representations for words in a language by figuring out in which contexts they appear. This is a very classic technique called [INAUDIBLE]. This doesn’t use what– they use something different. But it’s very similar. So with this technique, in a completely unsupervised manner, you give a big corpus of text in one language to a system. And it figures out a vector representation for each word in such a way that similar vectors correspond to similar words, essentially, depending on which context they appear in. You do this for several languages. And then you ask the question, is there a simple mapping that will take the cloud of points corresponding to all the vectors in one language and transform it into the cloud of points of another language. And if you can find such a mapping, there is some chance that, you know, you’ll find a mapping between the two languages. And this actually works. And so what this allows you to do is basically build a translation system from one language to another without having any parallel text of those two languages, which is, you know, dumbfounding to me. But it works, OK? You know, it doesn’t give you kind of record breaking results if you had data, but it’s amazing. And it’s very important for Facebook because people use thousands of different languages on Facebook. In fact, we just open sourced something which is actually not this particular project– I mean, this project is open source, too– where we provide embeddings for words and sentences in various languages, in 92 different languages. Actually, that’s open source. Oh, that’s nice. OK. All right. OK. Question answering, I’m going to skip this. OK, so lots of applications of deep learning and convolutional nets. A whole new set of applications, potential applications that are trying to pop up, which are enabled by sort of a new type of neural net which instead of being applied to, basically, multidimensional array data, you know, things like images or things like that, you can now apply neural nets to graph data, so data that comes to you in the form of a graph with values on it, a function on a graph, if you want. And the graph doesn’t need to be static. I want to point you to a review paper that I’m a distant co-author on. “Geometric Deep Learning– Going Beyond Euclidean Data.” So this is the idea of how can you define things like convolutional nets and things like this on data that is not an array, but basically a function on a graph. And the cool thing about this is that you can apply this to social networks, regulatory networks, networks of– I don’t know– financial instruments, let’s say, 3D shapes, functional networks in biology, things like that. There’s essentially three types. You know, there’s sort of classical ConvNets where the input is known. It’s a grid. You know, it’s a function on a grid, if you want, like an image, for example, you can think of as a function on a grid. Things where the graph is fixed. For example, the graph of interactions between different areas of the brain. But the function on the graph is not fixed. And so you’d like to apply convolutional nets to domains of this type. You know, how do you define a convolution on kind of a such a funny graph. And then there are applications where the graph changes for every new data. Right, so, for example, the data point could be a molecule. A molecule is best represented as a graph. Can we run a neural net on a graph? And the answer is yes. And I think this whole area opens an entire Pandora’s box of new applications of neural nets that are heretofore unforeseen. And so I think it’s really cool. Last year I co-organized a workshop at IPAN, the Institute for Pure & Applied Mathematics at UCLA, on new techniques in deep learning. And there was a lot of talks about this. So if you want to kind of learn about this, that’s a good way to get started. OK, now there’s been a lot of excitement about reinforcement learning, particularly deep reinforcement learning, in the last few years. Everybody has sort of AlphaGo and things like that. And reinforcement only works really well for things like games. So if you want to train a machine to play Doom or play Go or play chess, StarCraft not so much yet, reinforcement learning works really well. So reinforcement learning is the idea that you don’t tell the machine the correct answer, you only tell the machine whether it did good or bad, right? So you let the machine produce an answer. In this case, an action or an action sequence, and then you tell it, you know, you won or you lost or, you know, you did good, you gained points or you didn’t. So it works amazingly well, except that it requires many, many, many interactions with the environment. So a few people were thinking that, so it works really well for Go, for example. So this is a Go system that is actually produced at Facebook, which is similar to AlphaGo and Alpha0, which works at superhuman level and everything. We’re working also on a similar project with StarCraft where we train, you know, a StarCraft agent to kind of win the battle. The big problem that I was just mentioning is that reinforcement learning is very inefficient in terms of samples. So this is a figure from a recent paper from “DeepMind” where they kind of measure as a function of the number of millions of frames that the system sees for playing an Atari game. So this a classic Atari games from the 1980s. Using the best algorithms, it takes roughly seven million frames to reach a performance that humans will reach in a few minutes. And that corresponds to, you know, something like 100 hours of play, if you kind of translate this into real time. So these systems are much, much slower than humans or animals, for that matter, at kind of learning new skills. And that’s why they are not really practical for sort of real world application for which there is no gigantic amount of interactions that are accessible. So if you want to use reinforcement learning to train a car to drive itself, it’s basically not gonna work in its current form. You know, machine will have to, you know, drive off a cliff several thousand times before it figures out how not to do it. Now, how is it that humans are able to learn to drive a car in about 20 hours of training without crashing? It’s kind of amazing. I mean, this would require, you know, hundreds of thousands, if not millions of hours of training to get a car to drive itself. You could do this in simulation, but, you know, simulations are not accurate. And people are working on how you can transfer from simulation environment to the real world. Yeah, this is just in passing, a list of sort of major open source projects that Facebook research has put out. So PyTorch is the environment we use for deep learning. Faiss is a very fast similarity search library for nearest neighbor. It’s very useful. This is used everywhere within Facebook. This stands for dialogue. There’s this reinforcement learning framework for Go. OpenGo is the system I just mentioned. FastText for natural language understanding. Fairseq for sequence processing, things like translation and things like that. There’s a whole bunch of projects coming up. You can get them all from this GitHub here– github.com/facebookresearch. OK, so obviously, you know, we can’t get our machines to run as fast as humans, so we’re missing something really essential here to get to real AI. And in my opinion, we’re missing three things. One thing we’re missing is the ability of learning machines to reason. So right now all the applications I’ve shown you is for perception. And for perception, deep learning works amazingly well. It can learn to represent the perceptual world really well. But, you know, kind of learning to reason, that’s more difficult. There are a lot of ideas, some work on it, but I don’t think we have the answer to that. The second problem is learning models of the world. So the reason, perhaps, that we are able to learn to drive a car with 20 hours of training without crashing is that we can predict the effect of our actions. We can predict what’s going to happen in the world before it happens. The whole front part of our brain, basically, is a prediction engine. And our machines don’t really have the ability of, basically, predicting. Not that they don’t have it, we can train them to predict in certain ways, but there are technical difficulties which I will come to in a minute. And the last thing, which I’m not going to talk about, is the ability to learn not just hierarchical representations of the perceptual world, but hierarchical representations of the sort of action world. When we decide to go from here to Atlanta, OK, we have to sort of decompose that task into sub-tasks all the way down to kind of millisecond by millisecond control of our muscles. And so we have this sort of hierarchical representation of action sequences. And we don’t really know how to do this automatically with machine learning today. But I’m not gonna talk about this. So it’s a big problem because, you know, we can have all those cool things that we can build with deep learning. But we can’t have those things, which is what we really want. We like to have machines with common sense, you know, a dialogue system that we can talk to and doesn’t have like a very narrow set of things it can do for us, right? Like just, you know, play music and giving us the weather and the traffic. You know, we’d like machines to help us in our daily lives the way a human assistant would. So we want to build things like intelligent personal assistants and we won’t have that until we have machines that have some level of common sense. We’d like to have household robots that are agile and dexterous, you know. We don’t have that. We don’t have robots that are nearly as agile and have nearly as much common sense as a house cat. You know, with all their superhuman performance in Go and everything. So that’s, you know, that’s what we need to think of, like, what’s the next step? OK, so about reasoning, there is sort of an avenue which is interesting because it might lead to sort of a new way of doing computer science, essentially, which is the idea of differentiable programming. And it’s the idea that when you build a deep learning system, you don’t actually build a graph of modules anymore in frameworks like PyTorch, you just write a program. And the purpose of this program is just to compute the output of your neural net. And every call of functions in this program is like a module that you can differentiate. And so essentially, it’s a new way of writing software where, you know, when you write the program, the function of each instruction is not entirely specified until you train the program to actually do the right thing from examples. OK, so it’s like a weakly specified program, essentially. So it’s called differentiable programming because it’s the idea that, you know, you write programs. So essentially, a neural net architecture is really a program like an algorithm whose function is not completely finalized until you train it. And there’s lots of really interesting stuff that kind of are viewed in this context. For example, the idea of memory augmented neural nets. So the idea that you have a neural net and you attach to it something that works like a memory, like an associative memory that this thing can use as a working memory to, you know, do things like reasoning, long chains of reasoning. Or maybe it can use the memory to store factual knowledge. You know, like, you know relationships between knowledge bases between objects, you know, objects and relations, things like that. There’s quite a bit of work on this. Again, I don’t think we have the complete answer. But it’s interesting. Here’s another example. This is a interesting project where you’d like a system to be able to answer questions like that here, so you show it an image of this type. And you tell it there is a shiny object that is right of the gray metallic cylinder. Does it have the same size as the large rubber sphere? And for us to answer that question, we kind of have to configure a visual system to basically detect the shiny objects and the gray metallic cylinder. You know, we have this strategy, we detect the gray metallic cylinder. And then we look for objects nearby that are shiny. And then we compare sizes, right? And so the idea behind this project, which is at Facebook in Menlo Park, is you have a neural net that reads the sentence. And what it does is that it generates another neural net whose only purpose is to answer that particular question from an image. So the modules here are kind of dynamically wired, if you want, depending on the input. So it’s one of those examples of a dynamic neural net whose structure is [INAUDIBLE]. That’s the essence of differential programming. Software 2.0, some people have called it this way. So PyTorch was really kind of designed from the start with this idea that you could have dynamic neural nets, not quite the case with TensorFlow, which is the Google framework. But TensorFlow is kind of catching up. They’re trying to do the same thing. So how do humans and animals learn? You look at babies in the first few days and weeks of life, months of life, and they learn an amazing amount of background knowledge about the world just by observation. Babies are kind of helpless, you know, their actions are very limited. But they can observe. And they learn a lot by observing. So if, you know, you play a trick on a baby before the age of six months or so, you show the baby, you know, put a toy on a platform and push the toy off. And there’s a trick that, you know, makes it such that the toy doesn’t fall. Before six months, the baby doesn’t pay attention to this. They are sure that’s how the world works, no problem. After eight or nine months, you show this scenario to a baby and she goes like this. Because, you know, in the meantime, she has learned that an object is not supposed to float in the air. It’s suppose to fall if it’s not supported. So she’s learned the concept of gravity, you know, in between, intuitive physics and things like that, inertia. In fact, there was this chart that was put together by Emmanuel Dupoux, who is a cognitive neuroscience in Paris who spends part of his time at Facebook, which is kind of, you know, when babies learns basic concepts of this type, you know, gravity, inertia, you know, happens around seven months or so, seven or eight months. And, you know, object permanence is an important one that pops up very early, the difference between animate and inanimate objects also appears quite early. So, you know, we learn those things just by observation. It’s not in a task-dependent way. And this is what allows us to predict what’s gonna happen in the world. We have a very good model of the world that we learn since we’re born just by observation. And we’re not the only ones. Animals also have good models of the world. Here is a baby orangutan here is being played a magic trick. There was an object in this cup. The object was removed but he didn’t see it. Now the cup is empty. And he’s rolling on the floor laughing. You know, his model of the world was violated. And so it causes you to do one of two or three things when your model of the world is violated, you laugh, you get scared because maybe something dangerous is gonna happen that you didn’t predict, in any case, you pay attention. All right, so I think the way to push that problem is through what I call self-supervised learning. And it’s basically the idea that for the system to be able to learn from raw data just by observation, what you’re gonna do is you’re going to feed a piece of data to the system, let’s say a video clip, OK? And you’re gonna tell the system, pretend you know a piece of this input and pretend you don’t know this and try to predict this piece that you pretend you don’t know. And then I’m going to show you this piece and you can correct your internal parameters to make the prediction that actually occurred. OK, so, for example, I show you a piece of a video clip and I ask you to predict how the clip is going to continue the next few friends in the video clip. And then I show you the frames and, you know, [INAUDIBLE]. But it’s not just predicting the future. It could be predicting the past. It could be predicting the top from the bottom, you know, whatever, the piece of the input. So there’s really sort of those three types of learning, reinforcement learning where the feedback to the machine is very weak informationally. It’s just one scalar value that tells the machine whether it did good or bad once in a while. Supervised learning you give more information to the machine, you tell it what the correct answer is. But it’s still not very strong because all that data has to be curated by humans. And so it’s limited in the amount. And then there is this kind of self-supervised predictive learning idea where the amount of data that the machine is asked to predict and the amount of data it’s given to train is absolutely enormous. You know, just an hour video is like a ridiculously large amount of data that you’re asking the machine to predict, you know, every future frame from every past frame, for example. So, you know, Geoff Hinton made this argument many years ago that if you have a very large learning system like say a brain that has 10 to the 14 parameters, free parameters for the synaptic connections, you need a lot of data to constrain the system to learn anything useful. And that’s pretty much the only way, kind of predicting everything from everything else, essentially. We’re not gonna do this with supervised learning or reinforcement learning. That led me to this kind of certainly obnoxious analogy here that if, you know, the stuff we can learn, our intelligence, is a cake. The bulk of the cake is self-supervised learning. Almost everything we learn, we learn just in self-supervised fashion. We learn a little bit with supervised learning. And we learn a tiny amount through reinforcement learning. So that would be the cherry on the cake. People working in reinforcement learning get a little upset when I show this. But it’s become a bit of a meme now in the machine learning community. OK, you know, this doesn’t mean reinforcement learning is not interesting. It’s necessary. This is a Black Forest cake. And Black Forest cake has to have a cherry. Actually, it has cherries inside even. But it’s really not where we learn most of our knowledge. So, yeah, with things like image in-painting for example is an example of self-supervised learning. And people are working on this in computer vision. So the next revolution in AI is not going to be supervised, that’s for sure. OK, so let’s say we want to build predictive models of the world. So it’s the very classical thing in optimal control. And I’m sure some of you may have a background in this kind of stuff. There’s a system you want to control, which optimal control people call it “plant.” And you have an objective you want to minimize or maximize, in your case. And you can run your simulator forward, and then figure out an optimal sequence of commands that will optimize your objective, given your predictive model. OK, and that’s a classical thing in optimal control. And in fact, that should be a classical thing in the architecture of an intelligent system. An intelligent system should have a way of kind of predicting what’s going to happen before it happens to avoid doing stupid things like running off a cliff, right? And we don’t run off cliffs, even if we don’t know how to drive, mostly, because, you know, we have this ability to predict the consequence of our actions. So we need this world simulator in an intelligent agent as well as other components that I’m not gonna talk about. So how do we learn predictive models of the world? You know, we can observe the state of the world, at least partially through observation and we can train a function to predict what the next state is going to be. And then we’re going to observe what the next state is going to be and we just train our system in a supervised manner to do this. So this is something that some of my colleagues at Facebook have tried to do a few years ago where, you know, you have those kind of scenarios where you put a stack of cubes and you let the physics operate. And the cubes fall. And the predictions you get, this is predictions produced by a convolutional net, the predictions you get are blurry. Because the system cannot exactly predict what’s going to happen. There is an uncertainty about what’s going to happen to this tower. And so you get those blurry predictions. If I take a pen and I put it on the table and I let it go, you can predict that it’s going to fall. But you probably can’t predict in which direction it’s going to fall. And so that’s a problem. Because we have to be able to get machines to learn in the presence of large uncertainties. So there is the pen example. And the only way we can do this is through models that have latent variables. So basically, we observe the past, you know, the clip where I put the pen on the table, and we’re gonna make a prediction. And what we’d like is make multiple predictions depending on the circumstances. And so what we’re gonna need a set of extra variables, latent variables, that we can observe, and when we vary those variables of this vector, it makes the prediction vary among all the possible predictions that may occur, OK? Let’s call it a latent variable model. And a good example of how to do this is adversarial training. So adversarial training says I’m going to sample this latent variable randomly. And now what I need to train this predictor is something that tells me whether my prediction is on this set of plausible futures or whether it’s outside. OK, but of course, I don’t have any characterization of the set of possible futures, so I’m going to train a second neural net to tell me whether I’m on this manifold or outside. OK, that’s called a discriminator in the context of adversarial training. And you can think of it as a trainable loss function, a trainable objective function, basically. The objective function is, it tells you how far you are from this manifold and there’s a gradient of it that points you towards the manifold. So that’s the idea of adversarial training. Let’s say you want to video prediction, you show the system a piece of video. And, of course, in your data set, you know what the video is going to do. That’s the real data. But then you run this through your generator, which, you know, from a source of random vectors, tries to predict what’s gonna happen. And initially, it’s not trained properly. It’s gonna make, you know, a bad blurry prediction or something like that. So now you train your discriminator, your function that tells you whether you are on the manifold or not, of data. You train it to produce low values for this and high values for that. So that’s kind of a representation of what this discriminator is doing. And what it’s going to try to do is the green points that come from here that are not on the manifold of data is going to try to kind of push up the output here. And for the real ones, they’re going to try to push them down. Those are the blue spheres. And so this function is going to take that shape. And then what you’re going to do is, use the gradient of that function with respect to its input to train this generator to produce images that this guy can’t tell are fake. OK, so now what you have is an objective function in the discriminator that can tell the generator you are on the manifold or outside. It can use the gradient of that back propagated through the generator to train you to do the right thing. And eventually, it makes decent predictions. People have used this. I mean, these kind of techniques now have taken over the field, basically. A lot of people are working on this for all kinds of stuff, generating synthetic images. This is work from a few years ago. These are fake faces. This is work from Nvidia in Finland. And they trained a system to transform a bunch of random numbers into a face image. They trained it on a database of photos of celebrities. And at the end, you feed a bunch of random numbers, and out comes an image of face. And these are synthetic faces at high resolution. You can’t tell they’re fake. But none of those people exist. At Facebook we’ve been working on similar techniques to do things like generating fashion elements. So it’s in France, so you know? So we got a big data set from a very famous designer house and sort of trained one of those generating networks on this. And these are examples of generations. And this is not textures that like humans would come up with, essentially. OK, I’m going to talk a little bit about video prediction. So video prediction is interesting because, you know, in the context of self-driving cars, for example, you’d like a self-driving car to be able to predict what the car around it are doing, right? I realize I’m out of time, so– AUDIENCE: [INAUDIBLE] YANN LECUN: And this is a project that we’ve done with people at Nvidia. We are trying to predict what cars around us are going to do and then use this predictive model to sort of learn a good driving policy. So basically, we feed the system with a few frames of what the environment of cars looks like around us. And we train the system to predict what the cars around are gonna do. And we use data that comes from an overhead camera for this. And so these are examples of predictions. So this is if you have a completely deterministic system that doesn’t have any latent variable and basically makes those blurry predictions. And these are predictions that occur if you have a system with latent variables in it. And I don’t have time to go into the details of how it’s built. And then you can build a system to, basically, train the system to run a driving policy. So you start from a real state. You run your predictive model forward. You can compute the cost, which is how close you are, how far you are from other cars, whether you are in the lane or not. And by back propagation, you can learn a policy network that learns to produce an action that will minimize the probability of collisions over the thing. And if you do this, it doesn’t work. But if you add a term in the cost that indicates how certain the system is of its prediction, then it works. And so I’m just going to end with a cute video here. So the blue car is driving itself, basically, and the white point indicates, you know, whether the car is accelerating, decelerating, turning. The other cars are real cars around it that are just recorded. And so our own car is invisible to them. So it’s like we’re driving on a highway, but nobody sees us, right? And so we can get squeezed between two cars, basically, and there’s nothing we can do. But this thing kind of learns how to kind of merge on the highway and, you know, things like that. OK, so I’m going to end here. Just remind you, you know, there’s interesting areas of research in deep learning in things like graph of structured data, reasoning, self-supervised learning, learning hierarchical representation control space. We need more theory. And maybe there is a new type of computer science emerging through differentiable programming. Thank you very much.

No Comments

Leave a Reply