Articles, Blog

Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 15 – Natural Language Generation

February 13, 2020


So today we’re gonna be learning about Natural Language Generation. And uh, this is probably going to be a little different to my previous lectures because this is going to be much more of a kind of survey, of lots of cutting edge, uh, research topics that are happening in NLG right now. So before we get to that, uh, we’ve got a few announcements. Uh, so I guess the main announcement is just, thank you all so much for your hard work. I know, um, the last week or two have been pretty tough. Uh, assignment five was really quite difficult, I think, and it was a challenge to do it in eight days. So we just really appreciate all the hard work you’ve put in. Um, we also understand the project proposal was, uh, sometimes a bit difficult to understand the expectations for some people. Um, so, yeah, these are both new components of the class this year that were not present last year. Um, so you know, we have to go through some learning curves as well as the teaching staff. So just we really want to say thank you so much, uh, for putting everything into this class. And please do continue to give us your feedback both right now and in the end of quarter feedback survey. Okay, so here’s the overview for what we’re going to be doing today. So today we’re going to learn about what’s happening in the world of neural approaches for Natural Language Generation. Uh, that is a super, super broad title, Natural Language Generation. Um, NLG encompasses a huge variety of research areas and pretty much each of those could have had their own lectures and we could have taught a whole, a whole quarter- quarter’s worth of classes on, ah, NLG. Uh, but we’re going to try to cover a selection of things today. And, um, uh, it’s mostly going to be, uh, guided by the things which, uh, I’ve seen that I think are cool or interesting or exciting. So it’s by no means going to be comprehensive but I hope you’re going to enjoy some of the stuff we’re going to learn about. Okay, so in particular we’re going to start off by having a recap of what we already know about Natural Language Generation to make sure we’re on the same page. And we’re also going to learn a little bit extra about decoding algorithms. So we learned a bit before about, uh, greedy decoding and beam search decoding, but today we’re going to learn some extra information about that and some other types of decoding algorithms. After that we’re going to go through, um, a pretty quick tour of lots of different NLG tasks and a selection of neural approaches to them. And then after that we’re gonna talk about probably the biggest problem in NLG research, which is NLG evaluation and why it is such a tricky situation. And then lastly, we’re going to have some concluding thoughts on NLG research. What are the current trends and where are we going in the future? Okay. So, uh, section one, let’s do a recap. Okay, so Natural Language Generation to define it just refers to any setting in which we are generating some kind of text. So for example, NLG is an important sub-component of lots of different tasks such as, uh, machine translation, we’ve already met, uh, abstracted summarization, we’ll learn a bit more about that later, um, dialogue both chit-chat and task-based. Uh, also creative writing tasks such as writing stories and writing poems even. NLG is also a sub-component of, uh, free-form question answering. So I know a lot of you are doing the SQuAD project right now, uh, that is not an NLG task because you’re just extracting the answer from the, uh, the source document. But there are other question answering tasks that do have a Natural Language Generation component. Uh, image captioning is another example of, uh, a task that has an NLG sub-component. So NLG is a pretty cool component of a lot of different NLP tasks. All right, let’s go into our recap. So the first thing I want to recap is, uh, what is language modeling? Um, I’ve noticed that some people are a little bit confused about this, I think it, uh, might be because the name language modeling sounds like it might mean just simply encoding language like representing language using embeddings or something. So as a reminder language modeling, uh, has a more precise meaning. Language modeling is the task of predicting the next word given the word so far. So any system which produces this conditional probability distribution that does this task is called a Language Model. And if that language model, uh, system is an RNN, then we often abbreviate it as RNN-Language Model. Okay, so I hope, uh, you’ll remember that. Uh, the next thing we’re going to recap is do you remember what a Conditional Language Model is? Uh, the task of Conditional Language Modeling is when you’re predicting, uh, what word’s going to come next but you’re also conditioning on some other input x as well as all of your words so far. So to recap some examples of conditional language modeling include, uh, machine translation where you’re conditioning on the source sentence x, uh, summarization you’re conditioning on your input text that you’re trying to summarize. Dialogue, you’re conditioning on your dialogue history and so on. Okay, uh, next we’re going to quickly recap how do you train an RNN-Language model? I guess, it could also be a transformer-based language model or a CNN-based language model, now that you know about those, uh, and it could be conditional or it could be not. So the main thing I want to remind you about is that when you are training the system, then you feed in the target sequence that you’re trying to generate so where it says target sentence from corpus, uh, that’s saying that you have some sequence that you’re trying to generate and you feed that into the decoder, the RNN-Language model. And then it predicts what words are going to come next. So the super important thing is that during training, we’re feeding the gold, that is the reference target sentence into the decoder, regardless of what the decoder is predicting. So even if let’s say this is a very bad decoder that isn’t predicting the correct words, uh, it’s not, you know, predicting them high at all, um, that doesn’t matter we still just, um, input the targets- the gold target sequence into the decoder. And, um, I’m emphasizing this because it’s going to come up later, uh, this training method is called Teacher Forcing. Which might be a phrase that you have come across elsewhere. So, yeah, it refers to the fact that the teacher, that is kind of like the gold input is- is forcing, uh, the language model to use that on every step instead of using its own predictions on each step. So that’s how you train a RNN-Language model which might be conditional. Uh, okay. So now a recap on decoding algorithms. So, uh, you’ve got your trained language model which might be conditional. The question is how do you use it to generate a text? So the answer is you need a decoding algorithm. A decoding algorithm is an algorithm you use to generate the text from your trained language model. So, uh, in the NMT lecture a few weeks ago we learned about two different decoding algorithms. We learned about greedy decoding and beam search. So let’s quickly recap those. Uh, greedy decoding is a pretty simple algorithm. On each step you just take what’s the most probable words according to the language model. You could deal with the argmax and then use that as the next word, you feed it in as the input on the next step. And you just keep going until you produce some kind of END token or maybe when you reach some maximum length. And I think you’re all quite familiar with this because you did it in assignment five. So uh, yes this diagram shows how greedy decoding would work to generate the sentence. So as we learned before, due to a kind of lack of backtracking and inability to go back if you made a wrong choice, uh, the output from greedy decoding is generally, uh, pretty poor like it can be ungrammatical, or it can be unnatural, kind of nonsensical. Okay, let’s recap beam search decoding. So beam search is a search algorithm which aims to find a high probability sequence. So if we’re doing translation that sequence is the sequence of translation words, um, by tracking multiple possible sequences at once. So the core idea is that on each step of the decoder, you’re going to be keeping track of the K most probable partial sequences which we call hypotheses. And here K is some hyper- hyper parameter called the beam size. So the idea is by um, considering lots of different hypotheses we’re going to try to search effectively for a high probability sequence but there is no guarantee that this is going to be the optimal, most high probability sequence. So, uh, at the end of beam search, uh, you reach some kind of stopping criterion which we talked about before but I won’t cover in detail again. Uh, and once you’ve reached your stopping criterion, you choose the sequence with the highest probability, um, factoring in some adjustments for length and then that’s your output. So just to do this one more time. Here’s the diagram that we saw in the NMT lecture of beam search decoding um, once it’s completed and in this scenario we have a beam size of two. So this is what it looks like after we’ve done this exploration problem, this shows the full tree that we explored, and then we’ve come to some kind of stopping criterion and we identify the top, uh, hypothesis and, uh, that’s highlighted in green. So on the subject of beam search decoding, I was watching TV the other day, and I notice something in Westworld. I think the hosts- [LAUGHTER] the AI hosts in Westworld maybe used beam search. Which is something I wasn’t expecting to see on TV. [LAUGHTER] So there’s this scene, uh, Westworld is, by the way, a sci-fi series that has these, um, very convincing humanoid AI systems. Um, and there’s a scene where one of the AI systems is confronted with the reality of the fact that, um, she, I suppose is, um, not human because she sees the generation system of words as she says them, and I was looking at the TV and I thought, is that beam search? Because that diagram looks a lot like this diagram here, um, but maybe with a bigger beam size. So, I thought, that was pretty cool because, you know, AI has hit the mainstream when you see beam search on TV. And then if you zoom in really hard you can see some other exciting words in this screenshot like knowledge base, forward chaining and backward chaining, identifies the same thing as forward prop and backward prop, um, and also fuzzy logic algorithms and neural net. Um, so yeah, beam search, I think, has hit the mainstream now, um, so it’s good enough for Westworld, maybe it’s good enough for us. Uh, so with beam search, right? We’ve talked about how you have this hyperparameter k or the beam size. And one thing we didn’t talk about in the last lecture, so now we’re leaving the recap portion, um, is what’s the effect of changing that beam size k. So, uh, if you have a really small k, then you’re gonna have similar problems to greedy decoding. And in fact, if k equals one, then you are actually just doing greedy decoding. So those same problems are, you know, ungrammatical, maybe unnatural, nonsensical, just kind of plain incorrect output. So once if we get larger k, if you have a larger beam size, then you’re doing your search algorithm but considering more hypotheses, right? You’re, you’re having a larger search space and you’re considering more different possibilities. So if you do that, then we often find that this reduces some of the problems above. So you’re much less likely to have this ungrammatical, uh, you know, disjointed output. But there are some downsides to raising k. So of course, larger k is more computationally expensive and that can get pretty bad if you’re trying to, um, for example, generate your, uh, outputs for a large, you know, test set of NMT examples. Um, but more seriously than that, increasing k can introduce some other problems. So for example, it’s been shown that in NMT, increasing the beam size too much actually decreases the BLEU score. And this is kind of counter-intuitive, right? Because we were thinking of beam search as this algorithm that tries to find the optimal solution. So surely, if you increase k, then you’re only going to find a better solution, right? Um, so I think maybe the key here is the difference between optimality in terms of the search problem that is finding a high probability sequence and BLEU score, which are two separate things, and there’s no guarantee that they actually, um, correspond, right? And I mean, there’s a difference, again, between BLEU score and actual translation, uh, quality as we know. So if you look at the two papers which I’ve linked to here which are the ones that show that, uh, increasing beam size too much decreases the BLEU score. They explain it by saying that the main reason why this happens is because when you increase the beam size too much, then you end up producing translations that are too short. So I mean, that kind of explains it to a degree that translations are too short, therefore they have low BLEU because they’re probably missing words that they should contain. But the question is, why does large beam size gives you short translations? I think that’s harder to answer. Wherever, in these two papers, I didn’t see an explicit explanation of why. Um, I think it’s possible larger kind of passing, we see sometimes with beam search which is when you really increase your, uh, search space and make the search much more powerful so that it can consider lots of different alternatives. It can end up finding these high probability, um, sequences which aren’t actually the thing that you want. Sure, they’re high probabili- probability but they’re not actually the thing that you wanted. Um, so another example of that is that in open-ended tasks like for example chit-chat dialogue where you’re trying to just, um, say something interesting back to your conversational partner, if we use a beam search with a large beam size, we find that that can give you some output that is really generic. Um, and I’ll give you an example here to show you what I mean. So these are examples from a chit-chat, uh, dialogue project that I was doing. So here you’ve got, uh, your human chit-chat partner said something like I mostly eat a fresh and raw diet, so I save on groceries. And then here’s what the chat bot said back depending on the beam size. I will let you read that. So I would say that this is fairly characteristic of what you see happening when you raise and lower the beam size [NOISE]. When you have a low beam size, um, it might be more kind of on topic. Like here, we can see that eat healthy, eat healthy, I am a nurse so I do not eat raw food and so on, that kind of relates to what the user said, uh, but it’s kind of bad English, right? There’s some repetition and, uh, it doesn’t always make that much sense, right? Um, [NOISE] but then, when you raise the beam size, then it kind of converges to a safe so-called correct response but it’s kind of generic and less relevant, right? And it’s kind of applicable in all scenarios, what do you do for a living. Um, so the, the, the particular dataset I was using here is, uh, one called Persona-Chat, that I’ll tell you more about later. Um, but it’s a, it’s a chit-chat dialog dataset where each, uh, conv- conversational partner has a persona which is a set of traits. Um, so the reason it keeps talking about being a nurse, I think is because it was in the persona. [NOISE] But the main point here is that, um, we kind of have an unfortunate trade off with no, with no Goldilocks zone that’s very obvious. I mean, there’s, there’s a, yeah, kind of an unfortunate trade-off between having kind of bad, bad output, bad English and just having something very boring. So this is one of the problems that we get with beam, beam search. Okay. So we’ve talked about, uh, greedy decoding and beam search. Yes. So beam size depending on the [inaudible] The question is, can we have an adaptive beam size dependent on the position that you’re in? You mean like in the sequence? Yeah. That is in [inaudible]. Yeah. I mean, I think I- I might have heard of a research paper that does that? That adaptively like raises the capacity of the, the hypothesis space. I mean, it sounds awkward to implement, uh, because, you know, things fitting into a fixed space in your GPU. Um, but I think that might be possible, I suppose you’d would have to learn the criterion on which you increase beam, beam size, yeah. Seems possible. Okay. So we’ve talked about, uh, beam search and greedy decoding. So here’s a new family of decoding algorithms which are pretty simple, uh, sampling-based decoding. So something which I’m calling pure sampling because I didn’t know what else to call it. Um, this is just the, the simple sampling method that says that on each, uh, timestep of your decoder t, you just want to randomly sample from the probability distribution, uh, to obtain your next word. So this is very simple. It’s just like greedy decoding. But instead of taking the top words, instead just sample from that distribution. So the reason I call this pure sampling was to differentiate it from top-n sampling. And again, this is actually usually called top-k sampling but I already called k the beam size, and I didn’t want to be confusing, so I’m gonna call it top-n sampling for now. Um, so the idea here is also pretty simple. On each step t, you want to randomly sample from your probability distribution but you’re gonna restrict to just the top n most probable words. So this is saying that it’s, it’s like the simple, you know, pure sampling method but you want to truncate your probability distribution just to be, you know, the, the top most probable words. So, uh, the idea here kind of like how beam search, um, gave you a hyperparameter is kind of go between greedy decoding and, you know, uh, a very exhaustive search. In the same way here, you’ve got a hyperparameter n which can take you between greedy search and pure sampling. If you think about this for a moment, if n is one, then you would truncate it the top one. So you’re just taking arg max which is greedy. And if n is vocab size, then you don’t truncate it at all. You’re sampling from everything, that’s just the pure sampling method. So here, um, it should be clear, I hope, if you think about that if you increase n, then you’re gonna get more diverse and risky output, right? Because you’re, uh, giving it more, more to choose from and you’re going lower into the probability distribution, going lower into less likely things. And then, if you decrease n, then you’re gonna get more kind of generic safe output because you’re restricting more to the most high probability options. So both of these are more efficient than beam search which I think is something important to note, uh, because there are no multiple hypotheses to track, right? Because in beam search, on every step t of the decoder, you’ve got k different, you know, beam size, many hypotheses to track. Uh, whereas here, at least if you’re only generating one sample, there’s only one thing to track. So it, it’s a very simple algorithm. So that is one advantage of these sampling-based algorithms over beam search. Okay. So, the last thing I want to tell you that’s kind of related to decoding is, uh, softmax [NOISE] temperature. So, if you recall on timestep t of your decoder, your language model computes some kind of probability distribution P_t, uh, by applying the softmax function to a vector of scores that you got from somewhere. Like from your transformer or from your RNN or something. So, there’s the softmax function again. It’s saying that the probability of a word W is this softmax function, uh, given, given the scores. So, the idea here of a temperature on the softmax is that you have some kind of temperature hyperparameter tau and you’re going to apply that to this, uh, softmax. So, all that we’re doing is we’re div- dividing all of the scores, or logits you might call them, by the temperature hyperparameter. So again, if you just think about this a little bit, you’ll see that raising the temperature, that is increasing, uh, the hyperparameter, this is going to make your probability distribution more uniform. And this kind of comes down to the question about when you, when you multiply all of your scores by a constant, um, how does that affect the softmax, right? So, do things get more far apart or less far apart once you take the exponential? So, this is something you can just work up by yourself on paper, but as a, uh, a kind of a memory shortcut, a good way to think about it is that if you raise the temperature, then the distribution kind of melts and goes soft and mushy and uniform. And if you, uh, lower the temperature, like make it cold then, the probability distribution becomes more spiky, right? So, like the things which are rated as high probability become like even more, uh, disproportionately high probability compared to the other things. Um, I think that’s a easy way to remember it. Today I had to work it out on paper and then, uh, I realized that just the, the, the temperature visualization thing usually gets me there quicker. So, um, one thing I want to note is that softmax temperature is not a decoding algorithm. I know that I put it in the decoding algorithm section, uh, that was just because it’s kind of a thing, a simple thing that you can do at test time to change how the decoding happens, right? You don’t need to train, uh, with the, the softmax temperature. So, it’s not a decoding algorithm itself. It’s a technique that you can apply at test time in conjunction with a decoding algorithm. So, for example, if you’re doing beam search or you’re doing some kind of sampling, then you can also apply a softmax temperature, um, to change, you know, this kind of risky versus safe, um, trade-off. Any questions on this? Okay. So, here’s a summary of what we just learned about decoding algorithms. Um, Greedy decoding is a simple method. It gives kind of low quality output in comparison to the others, at least beam search. Beam search, especially when you’ve got a high beam size, uh, it searches through lots of different hypotheses for high-probability outputs. And this generally is gonna deliver better quality than greedy search, uh, but if the beam size is too high, then you can have these, uh, kind of counter-intuitive problems we talked about before. Where you’ve retrieved some kind of high-probability but unsuitable output. Say, something is too generic or something is too short. And we’re gonna talk about that more later. Uh, sampling methods are a way to get more diversity, uh, via, via randomness. Uh, well, getting randomness might be your goal in itself. Um, so, this is good if you want to have some kind of, for example, open-ended or creative generation setting like, uh, generating poetry or stories, then sampling is probably a better idea than beam search because you want to have a kind of source of randomness to, uh, write different things creatively. And top-n sampling allows you to control the diversity by, uh, changing n. And then lastly, softmax temperature is another way to control diversity. So there’s quite a few different knobs you can turn here. And it’s not a decoding algorithm, it’s just a technique that you can apply alongside any decoding algorithm. Although it wouldn’t make sense to apply it with greedy decoding because even if you make it more spiky or more flat, the argmax is still the argmax, so it doesn’t make sense. Okay. Cool. I’m going to move on to section two. So, uh, section two is NLG tasks and neural approaches to them. Uh, as mentioned before, this is not going to be an overview of all of NLG. That will be quite impossible. This is gonna be some selected highlights. So, in particular, I’m gonna start off with a fairly deep dive into a particular NLG task that I’m a bit more familiar with, and that is, uh, summarization. So, let’s start off with a task definition for summarization. Um, one sensible definition would be: Given some kind of input text x, you want to write a summary y which is shorter than x and contains the main information of x. So, summarization can be single-document or multi-document. Uh, single-document means that you just have a summary y of a single document x. In multi-document summarization, you’re saying that you want to write a single summary y of multiple documents x_1 up to x_n. And here typically x_1 up to x_n will have some kind of overlapping content. So, for example, they might all be different news articles from different newspapers about the same event, right? Because it kind of makes sense to write a single summary that draws from all of those. Um, makes less sense to summarize things that are about different topics. There is further, uh, subdivision of, uh, task definitions in, in summarization. So, I’m gonna describe it via some datasets. Uh, here are some different really common datasets especially in, uh, neural summarization, um, and they kind of correspond to different, like, lengths and different styles of text. So, a common one is, uh, the Gigaword dataset. And the task here is that you want to map from the first one or two sentences of a news article to write the headline. [NOISE] And you could think of this as sentence compression, especially if it’s kind of one sentence to headline because you’re going from a longish sentence to a shortish headline style sentence. Uh, next one that I, um, wanted to tell you about is this, uh, it’s a Chinese summarization dataset but I, I see people using it a lot. And it’s, uh, from a micro-blogging, um, website where people write summaries of their posts. So, the actual summarization task is you’ve got some paragraph of text and then you want to, uh, summarize that into, I think, a single sentence summary. Uh, another one, uh, two actually, are the New York Times and CNN/Daily Mail, uh, datasets. So, these ones are both of the form, you’ve got a whole news article which is actually pretty long like hun-hundreds of words and then you want to summarize that into, uh, like, maybe a single-sentence or multi-sentence summary. Uh, The New York Times ones are written by, I think, uh, librarians or people who, who, um, write summaries for, for library purposes. Uh, and then, uh, one I just spotted today when I was writing this list is there’s a new, fairly new like last six months dataset from wikiHow. So, from what I can tell this seems to be, you’ve got a full how-to-article from wikiHow and then you want to boil this down to the summary sentences which are kind of cleverly extracted from throughout the wikiHow article. They are kind of like headings. So, um, I looked at this paper and it seems that, um, this is kind of interesting because it’s a different type of text. As you might have noticed most of the other ones are news-based and this is, uh, not, so that kind of poses different challenges. Uh, another kind of division of summarization is sentence simplification. So, this is a related but actually different task. In summarization, you want to write something which is shorter and contains main information but is still maybe written in just as complex language, whereas in sentence simplification you want to rewrite the source text using simpler, uh, simpler language, right? So, like simpler word choices and simpler sentence structure. That might mean it’s shorter but not necessarily. So, for example, uh, simple Wiki- Wikipedia is a standard dataset for this. And the idea is you’ve got, um, you know, standard Wikipedia and you’ve got a simple Wikipedia version. And they mostly align up, so you want to map from some sentence in one to the equivalent sentence in the [NOISE] other. Another source of data for this is Newsela which is a website that, uh, rewrites news for children. Actually, at different learning levels I think. So, you have multiple options for how much it’s simplified. Okay. So, um, so that’s the definition or the many definitions of summarization as different tasks. So, now I’m gonna give an overview of, like, what are the main, uh, techniques for doing summarization. So, there’s two main strategies for summarization. Uh, you can call them extractive summarization and abstractive summarization. And the main idea as I had hinted out earlier, is that in extractive summarization you’re just selecting parts of the original texts to form a summary. And often this will be whole sentences but maybe it’ll be more granular than that; maybe, uh, phrases or words. Whereas abstractive summarization, you’re going to be generating some new text using NLG techniques. So the idea is that it’s, you know, generation from scratch. And my visual metaphor for this is this kind of like the difference between highlighting the parts with a highlighter or writing the summary yourself with a pen. I think the high level things to know about these two techniques are that extractive summarization is basically easier, at least to make a decent system to start, because selecting things is probably easier than writing text from scratch. Um, but extractive summarization is pretty restrictive, right? Because you can’t really paraphrase anything, you can’t really do any powerful sentence compression if you can only just select sentences. Um, and, of course, abstractive summarization as a paradigm is more flexible and it’s more how humans might summarize, uh, but as noted it’s pretty difficult. So, I’m gonna give you a very quick view of what pre-neural summarization looks like. And here we’ve got, uh, this is a diagram from the, uh, Speech and Language Processing book. So, uh, pre-neural summarization systems were mostly extractive. And like pre-neural NMT, which we learnt about in the NMT lecture, it typically had a pipeline which is what this picture is showing. So, a typical pipeline might have three parts. First, you have content selection which is, uh, essentially choosing some of the sentences from the source document to include. And then secondly, you’re going to do some kind of information ordering which means choosing what order should I put these sentences in. And this is particularly a more nontrivial question if you were doing multiple document summarization because your sentences might come from different documents. Uh, and then lastly, you’re going to do a sentence realization that is actually, um, turning your selected sentences into your actual summary. So, although we’re not doing, kind of, free-form text generation, uh, there might be some kind of editing for example like, uh, simplifying, editing, or removing parts that are redundant, or fixing continuity issues. So for example, you can’t refer to a person as she if you never introduced them in the first place. So maybe you need to change that she to the name of the person. So in particular [NOISE] uh, these pre-neural summarization systems, uh, have some pretty sophisticated algorithms of content selection. Um, so, for example, uh, you would have some sentence scoring functions. This is the most simple, uh, way you might do it, is you might score all of the sentences individually and you could score them based on features such as, um, are there, you know, topic keywords in the sentence? If so, maybe it’s an important sentence that we should include. Um, and you could compute those, uh, keywords using, uh, statistics such as tf-idf for example. [NOISE] You can also use pretty basic but powerful features such as, uh, where does the sentence appear in the document? If it’s near the top of the document, then it’s more likely to be important. Uh, there are also some more complex content selection algorithms such as for example, uh, there are these graph-based algorithms which kind of view the document as a set of sentences and those sentences are the nodes of the graph, and you imagine that all sentences, er, sentence pairs have an edge between them, and the weight of the edge is kind of how similar the sentences are. So, then, if you think about the graph in that sense, then now you can try to identify which sentences are important by finding which sentences are central in the graph. So you can apply some kind of general purpose gla- graph algorithms to figure out which [NOISE] nodes are central, and this is a way to find central sentences. Okay. So um, [NOISE] back to summarization as a task. Um, we’ve, I can’t remember if we’ve talked about ROUGE already. We’ve certainly talked about BLEU. But I’m gonna tell you about ROUGE now which is the main automatic metric for summarization. So ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. I’m not sure if that was the first thing they came up with or if they made it like that to match BLEU. Um, and here’s the, here’s the equation, uh, for, well, I suppose one of the ROUGE metrics. I’ll tell you more about what that means later and you can read more in the original paper which is linked at the bottom. So, uh, the overall idea is that ROUGE is actually pretty similar to BLEU. It’s based on n-gram overlap. So, some main differences with BLEU are ROUGE doesn’t have a brevity penalty. Um, I’ll talk more about that in a minute. Uh, the other big one is that ROUGE is based on recall while BLEU is based on precision. So you can see it’s there in the title. [NOISE] Um, so, if you think about this a little bit, I think you can say arguably precision is more important for machine translation. That is, you only want to generate text that appears in one of your reference, uh, translations, and then to avoid taking a really conservative strategy where you only generate really safe things in a really short translation. That’s why you add the brevity penalty to make sure that [NOISE] it tries to write something long enough. And then by contrast, recall is more important for summarization because you want to include all the information, the info- the important information in your summary, right? So the information that’s in the reference summary is, uh, assumed to be the important information. So recall means that you captured all of that. Um, and I suppose i- if you assume that you have a maximum length constraint for your summarization system, then those two kind of give a trade-off, right? Where you want to include all the information but you can’t be too long as a summary. So I think that’s the kind of justification why you have recall and precision for these two different tasks. However, confusingly, often an F1, that is combination of precision and recall version of ROUGE is reported anyway in the summarization literature. And to be honest, I’m not entirely sure why this is, uh, maybe it’s because of the lack of, uh, explicit max length constraint. Um, anyway, I, I tried to search that but I couldn’t find an answer. So here’s some more information on ROUGE. Um, if you remember, BLEU is reported as a single number, right? BLEU is just a single number and it is a combination of the precisions for the different n-grams which is usually 1-4 whereas ROUGE scores are usually reported separately for each n-gram. So, the most commonly reported ROUGE scores are ROUGE-1, ROUGE-2 and ROUGE-L. So, ROUGE one, not to be confused with Rogue One: A Star Wars Story. Um, I feel like since that film came out, I see so many people mistyping this, and I think it’s related. Um, so, ROUGE-1 is, uh, based on unigram overlap, um, [NOISE] and ROUGE-2 based on bigram overlap. It’s kind of an analogy to BLEU really except, uh, recall-based, not precision-based. The more interesting one is ROUGE-L which is longest common subsequence overlap. Um, so, the idea here is that you are interested not only in, uh, particular n-grams matching up but in, you know, how many, uh, how, how long a sequence of words can you find that appear in both. So you can, uh, read more about these metrics in the paper that was linked on the previous page. And another really important thing to note is there’s [NOISE] now a convenient Python implementation of ROUGE, and um, maybe it is not apparent why that’s exciting, but it’s actually pretty exciting because for a long time, there was just this Perl script, um, that was quite hard to run and quite hard to set up and understand. So um, someone out there has been a hero and has, uh, implemented a pure Python version of ROUGE and checked that it really does match up to the Perl script that people were using before. So if any of you are using ROUGE or doing summarization for your projects, uh, make sure that you, uh, go use that because it will probably save you some time. [NOISE] Okay. So we’re gonna re- return to ROUGE a little bit later. Um, I know that in assignment 4 you thought about the shortcomings of BLEU as a metric and um, for sure ROUGE has some short- shortcomings as well as a metric for summarization. Um, we’re gonna come back to that later. Okay. So, we’re gonna move on to neural approaches for summarization. [NOISE] So uh, going back to 2015, I don’t have another dramatic reenactment, I’m afraid. [NOISE] Um, Rush et al. published the first seq2seq summarization paper. [NOISE] So uh, they were viewing this as, you know, NMT has recently been super successful, why don’t we view abstractive summarization as a translation task and therefore apply standard translation seq2seq methods to it. So that’s exactly what they did and they applied, uh, a standard attention model, and then they did a pretty good job at, uh, Gigaword summarization. That’s the one where you’re, um, converting from the first sentence of the news article to the headline. So it’s kind of like, uh, sentence compression. So crucially, this is kind of the same order of magnitude of length as NMT, right? Because NMT is sentence to sentence and this is kind of sentence to sentence, maybe at most two sentence two sentence. So this works pretty well and you can get pretty decent, um, headline generation or sentence compression using this kind of method. [NOISE] Okay. So after that, since 2015, there have been lots more developments in neural abstractive summarization. And you can kind of um, group together these developments in, uh, a collection of themes. So one theme is make it easier to copy. Uh, this seems pretty obvious because in summarization, you know, you’re gonna want to copy every, quite a few words and even phrases, but don’t copy too much. Uh, that’s the other thing is that if you make it too easy to copy, then you copy too much. So, then there’s other research showing how to prevent too much copying. [NOISE] Uh, the next thing is some kind of hierarchical or multi-level attention. So as I just showed, the attention has been pretty key to, um, abstractive neural summarization so far. So there’s been some work looking at, you know, can we kind of make this attention work at a more kind of high-level, low-level cost fine version so that we can kind of maybe do our selection at the high-level and at low-level. Another thing which is kind of related is having some more kind of global content selection. So if you remember when we were talking about the, the pipelines pre-neural summarization, they had these different content selection algorithms. And I think you can say that, um, kind of naive attention, attention seq2seq is not necessarily the best way to do content selection for summarization, maybe you want a more kind of global strategy where you choose what’s important. It’s not so apparent here when you’re doing this small-scale summarization, but if you imagine that you’re summarizing a whole news article and you’re choosing which information, kind of deciding on each decoder step, what to choose doesn’t seem like the most global strategy. Er, what else have we got? Uh, there’s using, uh, Reinforcement Learning to directly maximize ROUGE or other discrete goals you might care about such as maybe the length of the summary. Um, and I say discrete here because ROUGE is a non-differentiable, uh, function of your generated outputs. There’s no, you know, easy way to differentiably learn that during training in the usual way. Uh, my last point on this list is the kind of theme of resurrecting pre-neural ideas such as those graph algorithms that I mentioned earlier and working them into these new seq2seq abstractive neural systems and I’m sure there is more as well. So, I’m gonna show you a few of these, um, especially because even if you’re not particularly interested in summarization, a lot of the ideas that we’re gonna explore here are actually kind of applicable to other areas of NLG or just other areas of NLP deep learning. So, the first thing on the list is making it easier to copy, which seems like probably the first thing you want to fix, if you’ve just got basic seq2seq with attention. So, um, a copy mechanism, which can exist outside of summarization. The reason, why you want this is that basic seq2seq with attention, they’re good at writing fluent output, as we know, but they are pretty bad at copying over details like rare words correctly. So a copy mechanism is just the kind of sensible idea of saying, um, let’s have an explicit mechanism to just copy over words. So for example, you could use the attention distribution to- to kind of select what you’re going to copy. Um, so, if you are allowing both copying over words and generating words in the usual way with your language model, then now you’ve got a kind of hybrid extractive/abstractive approach to summarization. So, there are several papers, which are- which propose some kind of copy mechanism variants and I think, the reason why there is multiple is because there’s kind of a few different choices you can make about how to implement this, and that means that there’s a few different versions of how to implement copy mechanism. So, uh, yeah, there are several papers here which you can look at. I’m going to show you a diagram from a paper that um, I did a few years ago with Chris. So, this is just one example of how you can do a copying mechanism. So, the – the way we did it, is we said that on each decoder step, you’re going to calculate this probability Pgen and that’s the probability of generating the next word rather than copying it, and the idea is that this is computed based on your current kind of context, your current decoder hidden state. So, then once you’ve done that, then the idea is you’ve got your attention distribution as normal and you’ve got your kind of output, you know, generation distribution as normal and you’re going to use this Pgen, which is just a scalar. You can use that to kind of, uh, combine, mix together these two probability distributions. So, what this equation is telling you, is that saying that the uh, final output distribution for uh, what word is gonna come next, it’s kind of saying, you know, it is the probability of generating times your probability distribution of what you would generate but then also the probability of copying and then also what you’re attending to at that time. So, the, the main thing is, you’re using attention as your copying mechanism. So, attention is kind of doing double-duty here. It’s both uh, being useful for the generator to, you know, uh, maybe choose to rephrase things but it is also being useful as a copying mechanism. And I think that’s one of the several things that these different papers do differently. I think, I’ve seen a paper that maybe has like two separate uh, attention distributions, one for the copying and one for the attending. Um, other choices you can make differently are for example, D1 Pgen to be this kind of soft thing that’s between zero and one or do you want it to be a hard thing that has to be either zero or one. Um, you can also make decisions about like do you want the Pgen to have supervision during training? Do you want to kind of annotate your data set saying these things are copied, things, these things are not, or do you want to just like learn it end-to-end? So there’s multiple ways you can do this and um, this has now become pretty, pretty standard. Okay, so copy mechanism seems like, seems like a sensible idea but there’s a big problem with them, which is what I mentioned earlier and that problem is, that they copy too much. Um, so, when you- when you run these kind of systems on summarization, you find that they end up copying a lot of long phrases and sometimes even whole sentences and uh, unfortunately your dream of having an abstractive summarization system, isn’t going to work out because your, um, you know, copy augmented seq2seq system has just collapsed into a mostly extractive system, which is unfortunate. Another problem with these uh, copy mechanism models is that they are bad at overall content selection especially if the input document is long, and this is what I was hinting at earlier. Um, let’s suppose, that you are summarizing something that’s quite long like a news article that’s hundreds of words long and you, you want to write a several sentence summary. It doesn’t seem like the kind of smartest choice to on every step of writing your several sentence summary, but you’re choosing again what to attend to, what to select, what to summarize. It seems better to kind of make a global decision at the beginning and then summarize. So, yeah, the problem is, there’s no overall strategy for selecting the contents. So, uh, here’s a paper that I found. Nope, not yet. Okay. So, how might you do better content selection for neural summarization? So, if you remember in this pre-neural summarization we looked at, you had completely separate stages in the pipeline, right? You had the content selection stage and you had a surface realization that is the text generation stage. But in our seq2seq attention systems, these two stages are just completely mixed together, right? You’re doing your step-by-step surface realization that is text generation, and then on each of those, you’re also doing content selection. So, yeah, this doesn’t make sense. So, I found a paper, which is, uh, published I think last year, which gives a quite nice kind of simple solution to this problem and it’s called bottom-up summarization. So, in this paper if you look at the- if you look at the figure, uh, the main idea is pretty simple. It says that, first you’re going to have a content selection stage and this is just uh, thought of as a neural sequence tagging model problem, right? You run through your source documents and you kind of tag every word as include or don’t include. So, you’re just kinda deciding like what seems important, what seems like it should make it into the summary and what doesn’t and then the bottom-up attention stage says that, now you’ll seq2seq with an attention system, which is gonna generate the summary. Are you’re gonna kind of apply a mask? You know, apply a hard constraint that says, that you can’t attend to words that were tagged don’t-include. So, this turns out to be pretty simple but effective um, because it’s a better overall content selection strategy because by doing this first content selection stage by sequence-tagging you’re kind of just, just doing the selection thing without also at the same time doing the generation thing, which I think turns out to be a better way to make better decisions about what to include and then separately, this also means as a great side effect, you have less copying of long sequences in the generation model. Um, because if you are not allowed to attend to things, which you shouldn’t be including, then it’s kind of hard to copy a really long sequence, right? Like if you want to copy a whole sentence but the sentence has plenty of don’t include words in it, then you can’t really copy a long sequence, you have to break it up. So, what the model ends up doing, is it kind of has to skip, skip around the parts that is meant to include and then it’s forced to be more abstractive to put the parts together. Yep. How did they backpropagate the masking decision because it seems like- Because during training [inaudible] masking decision. Yeah, I think it might be trained separately. I mean, you can go and check the paper. I’ve, I’ve read a lot of papers in the last days, I can’t quite remember. I think, it might be trained separately but they might have tried training it together but it didn’t work as well. I am not sure. You can check it out. Okay. So, another paper I want to tell you about is a paper which uh, used reinforcement learning to directly maximize ROUGE for neural summarization. So this was a paper from two years ago. And the main idea is that they can use RL to directly optimize in this case ROUGE-L, the metric. So by contrast, the standard maximum likelihood of training that is the training objective we’ve been talking about for the whole class so far for language models uh, that can’t directly optimize ROUGE-L because it’s a non-differentiable function. So they uh, they use this RL technique to compute the ROUGE score during training and then uh, use a reinforcement learning to backprop to the model. So, the interesting finding from this paper is that if they just used the RL objective, then they do indeed get higher ROUGE scores. So they can successfully optimize this ROUGE-L metric that they were aiming to optimize but the problem is that when you do that, you get lower human judgment scores. So, on the right we’re seeing that the RL only model has actually pretty pretty bad readability relevance human judgment scores. It’s worse than just the maximum likelihood supervised training system. So, this is a quote from their blog post that says, “We have observed that our models with the highest ROUGE scores also generated barely readable summaries.” So, this is- this is, um, I suppose a problem, right? If you try to directly optimize for the metric, then you might start finding that you’re kind of gaming the metric and not optimizing for the true task, right, because we know, just as we know that BLEU was not really a perfect analogy to actual translation quality so is ROUGE not a perfect analogy to uh, summarization quality. But they did do something cool, which is that they found that if you combine the two objectives, so they kind of, uh, you know, predict the language model sequence objective and then they also like produce an overall summary that gets a high ROUGE score objective and you combine them together, then you can get a better human uh, judgment score, which in the end is the closest thing we have to uh, a measure of actual summarization quality. [NOISE] Okay. So, I’m gonna move on to uh, dialogue, which is um, a different NLG, kind of family of tasks. Uh, so, really dialogue encompasses a really large variety of settings. And we are not going to cover them all, but here is a kind of overview of all the different kinds of tasks that people might mean, when they say dialogue. Um, so, there’s task-oriented dialogue and this kind of refers to any setting, where you’re trying to kind of get something done in the conversation. So, if for example, you’ve got kind of assistive tasks where it’s assumed that you have, you know, maybe the uh, the dialogue agent is trying to help a human user to do something like maybe giving customer service or recommendations, answering questions, helping a user, you know, accomplish a task like buying or booking something. Uh, these are the kinds of tasks, which the virtual systems on your phone can do or can kind of do. Um, another family of task-oriented dialogue tasks are cooperative tasks. So, this is kind of anything where you’ve got two agents who are trying to solve a task together via dialogue. Um, and the opposite of that would be adversarial. So anything where you have two agents who are trying to compete in a task and that uh, competition is conducted through dialogue. [NOISE] So uh, the opposite to task-oriented dialogue is, uh, social dialogue. So that’s something where there is no explicit task other than to, I suppose socialize. So chit-chat dialogue, um, is just dialogue where you’re just doing it for social fun or for company. Um, I’ve also seen some work on kind of like therapy or mental well-being dialogue, I’m not sure if this should go in task or social, it’s kind of a mix, uh, but I suppose these are the ones where the goal is to maybe offer kind of emotional support to the human user. Um, so as a very kind of brief overview of how, uh, the deep learning, uh, renaissance has kind of changed dialog research, um, I think you can say that in kind of pre-deep learning, uh, the difficulty of open-ended, free-form natural language generation, meant that, uh, dialogue systems were often, uh, not doing free-form NLG. They might use predefined templates meaning that you have a template where you just fill in some slots with the content, uh, or maybe you retrieve an appropriate response from a corpus of responses that you have in order to find, you know, an appropriate response for the user. And these are by no means simple systems, they had some very complex things going on like deciding, you know, what their dialogue state is and what template you should use and so on and the- all the natural language understanding components of understanding the context so far. But, uh, one effect that, that deep learning had is that, uh, since again kind of 2015 which is when NMT, uh, became standard, there’s been, uh, just like summarization, lots of papers applying seq2seq methods to dialogue. And this has kind of led to a renewed interest in open-ended, free-form dialogue systems. So uh, if you wanna have a look at what did those early seq2seq dialogue papers look like, um, here’s two kind of early ones like maybe the first ones to apply seq2seq. Okay. So uh, people quickly applied seq2seq, uh, NMT methods to dialogue but it quickly became very apparent that this kind of naive application of standard NMT methods has some serious pervasive deficiencies when applied to a task like chitchat dialogue. And this is even more true than it was for summarization. So what are some examples of these serious pervas- pervasive deficiencies? Uh, one would be genericness or boring responses, and I’ll go into more detail about these in a moment. Another one is irrelevant responses. So that’s when, uh, the dialogue agent kind of says something back that’s just kind of unrelated to what the user says. Um, another one is repetition, this is pretty basic but it, uh, it happens a lot. Um, so that’s also repetition within the utterance and maybe repetition across utterances. Ah, another difficulty is, uh, kind of lack of context, like not remembering the conversation history. Obviously, if you do not condition on the whole conversation history, there’s no way your dialogue agent can use it but it is a challenge especially if you have a very long dialogue history to figure out how to condition on it effectively. Another problem is the lack of consistent persona. So if you kind of, uh, naively as in maybe those two papers that I referenced on the previous slide, if you naively train a kind of standard seq2seq model to maybe take the, uh, you know the user’s last utterance and then say something back, or maybe even the whole dialogue history and say something back. Often your dialogue agent will have this completely inconsistent persona, like one moment they will say that it lives in Europe and then it’ll say it lives in, I don’t know, China or something and it just doesn’t make sense. So I’m gonna go through, uh, some of these problems and give you a bit more detail on them. So first, this irrelevant response problem. So in a bit more detail, your problem is that seq2seq often generates some response that’s kind of unrelated to the user’s utterance. So it can be unrelated because it’s simply generic, which means that this is kind of like an overlap with a generic response problem or it can be kind of unrelated because the model’s choosing to kind of change, to change the subject to something unrelated. So one solution of many, there, there are a lot of different papers which, uh, kind of attack this irrelevant response problem, uh, but just one, one for example is, uh, that you should tr- change the training objective. So instead of trying to optimize, um, mapping from input S to response T such that you’re maximizing the conditional probability of T given S, instead you should maximize the maximum mutual information. So that’s why this is here. So maximum mutual information, uh, you can kind of rewrite the objective like this, and if you want to see some more detail you can go look at this paper here. But the idea is that you’re trying to find your response T that kind of, uh, maximizes this thing which is kind of like saying, it needs to be probable given the inputs but kind of like as a ratio of its probability in itself. So if T is very high likelihood, then it gets penalized and it’s kind of like about the ratio of the probability given the input and it’s just the stand-alone probability. So the idea is that this is meant to discourage, um, just saying generic things that just have a high PT by themselves. Um, so that’s the irrelevant response problem. And as I just hinted at, there’s, uh, definitely a strong link between the irrelevant response problem and the kind of generic or boring response problem. So to look at the genericness or boring response problem. [NOISE] So I think there are some pretty easy fixes that you can make to, to a degree ameliorate the boring response problem. Whether you’re really getting to the heart of the issue is a different question. But some kind of easy test-time fixes that you can certainly do are for example, you can just directly up-rate, up-weight rare words during beam search. So you can say, all rare words kind of get a boost to their, uh, log probabilities and then now we’re more likely to produce them during beam search. Another thing you could do is you could use for example, a sampling decoding algorithm rather than beam search and we talked about that earlier, um, or you could use, oh yeah, you could use softmax temperature as well. That’s another thing. So those are kind of test-time fixes and you could regard those as a kind of late intervention, right? So an earlier intervention would be maybe training your model differently. So I’m calling these kind of conditioning fixes because these fixes kind of relate to, uh, conditioning your model on something that’s gonna help it be less boring. So one example is maybe you should condition the decoder on some kind of additional context. Uh, so for example, there’s some work showing that, you know, if you’re doing chitchat dialogue, then maybe you should, uh, go and sample some related words that are related to what the user said and then just kind of attend to them when you generate and then you’re more likely to say something that’s kind of content full and interesting compared to the boring things you were saying before. Ah, another option is you could train a retrieve-and-refine model rather than a generate-from-scratch model. So by retrieve-and-refine, I mean, uh, you’ve- supposing you have some kind of corpus of, of just general kind of utterances, things that you could say and then maybe you sample one, uh, from that test set, th- the training set, and then you edit it to fit the current situation. So it turns out that this is a pretty strong method to produce much more kind of diverse and human-like and interesting utterances, um, because you can get all of that kind of fine grain detail from the sampled, ah, utterance and then you edit it as necessary to fit your current situation. So I mean, there are downsides to these kinds of methods like maybe it can be hard to edit it to actually appropriately fit the situation, um, but it’s certainly a way to effectively get like some more diversity and, um, interest in that. So on the subject of the repetition problem, that was another kind of major problem we noticed for, um, applying seq2seq to, uh, chitchat. Um, again, there are kind of simple solutions and more complex solutions. Um, so a simple solution is you could just block repeating n-grams during beam search and this is usually really quite effective. And what I mean by that is, uh, during beam search when you’re kind of considering, you know, what are my K hypotheses? Which is just kind of the top K in the probability distribution, you say, well, anything that would constitute a repeating n-gram just gets thrown out. So when I say constitutes a repeating n-gram, I mean if you did take that word, would you now be creating a repeating let’s say two-gram, bigram and, um, if we’re deciding that we’re banning all repeating bigrams or trigrams or whatever, then you essentially just have to check for every possible word that you might be looking at in beam search and whether that would create a repeating n-gram. So this works pretty well, I mean, it’s by no means a kind of principled solution, right? If feels like we should kind of have a better way to learn not to repeat, um, but as a kind of, uh, effective hack, I think that works, that works pretty well. So the more complex solutions are, for example, you can train something called coverage mechanism. Um, so in seq2seq, and this is mostly, uh, inspired by the machine translation setting, uh, a coverage mechanism is a kind of objective that prevents the attention mechanism from attending to the same words multiple times or too many times. And the intuition here is that, uh, maybe repetition is caused by repeated attention. So if you attend to the same things many times, then maybe you’re gonna repeat, you know, the same output many times. So if you prevent the repeated attention, you prevent the repeated output. So this does work pretty well but it’s definitely, um, more of a complex thing to implement, it’s less convenient and, um, I don’t know, in some settings, it does seem like the simple solution is, uh, easier and works just as well. Uh, so other complex solutions might be you could define a training objective to discourage repetition. Uh, this cou- you could try to, um, define something differentiable but one of the, the difficulties there is that because you’re training with a teacher forcing, right? Where you’re always like looking at the, the gold inputs so far, then you never really do the thing where you generate your own output and start repeating yourself. So it’s kind of hard to define the penalty in that situation. So maybe this needs to be a kind of non-differentiable function. So kind of like how, um, the Paul et al paper was, uh, optimizing for ROUGE, maybe we kind of, uh, optimize for not repeating which is a discrete function of the input. Uh, I’m going to skip ahead to storytelling. So in storytelling, uh, there’s a lot of interesting neural storytelling work going on right now. And most of it uses some kind of prompt to write a story. So for example, uh, writing a story given an image or given a writing prompt or writing the next sentence of the story given the story so far. So, uh, here’s an example of generating a story from an image. And what’s interesting here is that we have this image which is a picture of what appears to be an explosion and then here you have a story about the image but written in the style of Taylor Swift lyrics. So it says, you have to be the only light bulb in the night sky I thought, oh god, it’s so dark out of me that I missed you, I promise. And what’s interesting here is that there wasn’t any straightforward, supervised, you know, image-captioning data set of explosions and Taylor Swift lyrics. Um, they kind of learned this, uh, separately. So how they did this is that they used a kind of common sentence encoding space. So they used this particular kind of sentence encoding called skip-thought vectors and then they trained, um, this COCO image-captioning, uh, system to go from the image to the encoding of the sentence and then separately they also trained, uh, a language model, a conditional language model to go from the sentence-encoding to the Taylor Swift lyrics. And then because you had this shared encoding space, you can now put the two together and then go from the picture, to the embedding, to the Taylor Swift style output, which I think is pretty, pretty amazing. Wow, I’ve really lost, lost track of the time. So I, I think I have to hurry up quite a lot. So, um, we’ve got some really impressive story, generation systems, recently, um, and this is an example of, uh, a system which, uh, prepares a new data set, where you write a story given a prompt, and they made this very impressive, very beefed-up, uh, convolutional language model, seq-to-seq system that generates the story given the input. I’m not gonna go through all these details, but I encourage you if you want to check out, uh, what’s the state of the art in story generation, you should check this out. There’s a lot of different interesting things going on with very fancy attention and convolutions and so on, and they managed to generate some really interesting, um, impressive stories. So here, if you look at this example, we’ve got some really interesting, um, kind of, uh, story generation that’s kind of diverse, it’s non-generic, it’s stylistically dramatic which is good, and is related to the prompts. Um, but I think you can see here kind of the limits of what the state of the art story generation system can do which is that- um, although it’s kind of in style, it’s mostly kind of atmospheric and descriptive. It’s not really moving the plot forward. There’s no kind of events here, right? Um, so the problem is it gets even worse when you generate for longer. When you generate a long, a long text, then it will mostly just stay on the same idea without moving forward with new ideas. Okay. So I’m gonna skip forward a lot and, uh, sorry, ought to have planned better. There’s a lot of information here which you wanna check out about poetry generation and other things. I’m going to skip ahead because I want to get to the NLG evaluation section because that’s pretty important. So, um, we’ve talked about Automatic Evaluation Metrics fr NLG, and we know that these words overlap based metrics, such as BLEU, and ROUGE, and METEOR, uh, we know they’re not ideal for machine translation. Ah, they’re kind of even worse for summarization mostly because summarization is even more open-ended than machine translation. And that means that having this kind of rigid notion, if you’ve got to match the N-grams, is even less useful. And then for something even more open-ended like dialogue, then it’s just kind of a disaster. It’s not even a metric that gives you a good signal at all, and this also applies to anything else open-ended, like story generation. So it’s been shown, and you can check out the paper at the bottom, that word overlap metrics are just not a good fit for dialogue. So the orange box is showing you, uh, some plots of the correlation between human score on a dialog class and BLEU-2, some variation of BLEU. And the prob- the problem here is you’re not seeing much of a correlation at all, right? It seems that particularly on this dialogue setting, ah, the correlation between the BLEU metric and the human judgment of whether it’s a good dialogue response is, uh, the correlation is- I mean, it looks kind of non-existent. It’s at least very weak. So that’s pretty unfortunate and there’s some other papers that show much the same thing. So you might think, “Well, what other automatic metrics can we use? “What about perplexity? Um, so perplexity certainly captures how powerful your language model is, but it doesn’t tell you anything about generation. So for example, if your deca- decoding algorithm is bad in some way, then perplexity is not gonna tell you anything about that, right? Because decoding is something you apply to your trained language model. Perplexity can tell if you’ve got a strong language model or not, but it’s not gonna tell you, um, necessarily how good your generation is. So some other thoughts you might have about automatic evaluation are, well, what about word embedding based metrics? Uh, so the main idea with word embedding based metrics, uh, you want to compute the similarity of the, the word embeddings or maybe the average of the word embeddings across a sentence, not just the overlap of the words themselves. Um, so the idea is that rather than just being very strict and saying only the exact same word counts, you say, “Well, if the words are similar and in word embedding space, then they count.” So this is certainly more flexible, but unfortunately, uh, the same paper I showed before shows that this doesn’t correlate well either with human judgments of quality, at least for the- the dialogue task they are looking at. So here, the middle column is showing the correlation between human, judgments, and some kind of average of word embedding based metric. So, um, yeah, that doesn’t look great either, not a great correlation. So if we have no automatic metrics to adequately capture overall quality for natural language generation, um, what, what can we do instead? So I think often the strategy is, you end up defining some more kind of focused automatic metrics to capture the particular aspects of the generated text that you might be interested in. Um, so for example, you might be interested in, uh, fluency, and you can compute that by just kind of running a well-trained language model over your text and generating the probability, and that’s kind of a proxy for how well it’s written, you know, good, fluent, grammatical text. Um, if you’re particularly interested in maybe generating text in a particular style, then you could ta- take a language model that’s trained on the corpus representing that style, and now the probability tells you not only is it a good text, but is it in the right style. Um, there are some other things as well that are like, you know, diversity, um, and you can can that pretty easily by just having some statistics about, you know, how much you’re using rare words. Um, relevance to input, you can kind of compute a similarity score with the input, and there are just some simple things like, you know, length and repetition that you surely can count, and yes, it doesn’t tell you overall the overall quality, but these things are worth measuring. So I think my main point is that yes, we have a really difficult situation with NLG evaluation. There’s no kind of overall metric. Often, they capture this overall quality. Um, but if you measure lots of these things, then they certainly can help you track some important things that you should know. So we talked about how automatic evaluation metrics for NLG are really tough. So let’s talk about human evaluation. Uh, human judgments are regarded as the gold standard, right? But we already know that human evaluation is slow and expensive, uh, but are those the only problems with human eval? Let’s suppose that you do have access, uh, to, let’s say, the time or money you need to do human evaluations. Um, does that solve all your problems? Suppose you have unlimited human eval, does that actually solve your problems? And my answer is, uh, no. And this is kinda from personal experience. Um, conducting human evaluation in itself is very difficult to get right. It’s not easy at all, and this is partially because humans do a lot of weird things. Humans, uh, unlike a metric, uh, an automatic metric, they’re inconsistent, they could be illogical. Sometimes, they just get bored of your task, and they don’t really pay attention anymore. Uh, they can misinterpret the question you asked, and sometimes they do things they can’t really explain why they did it. So, um, as a kind of case study of this I’m going to tell you about, um, a project I did where I was, uh, building some chatbots, and it turned out that the human evaluation was kind of the hardest part of the project. So I was trying to build these chatbots for the Persona-Chat data set and in particular investigating controllability. So we’re trying to control aspects of the generated texts such as, you know, whether you repeat itself, how generic you are, kind of these same problems that we noted before. So we built these models that control, you know, specificity of what we’re saying and how related what we’re saying is to what the user said. So here you can see that, you know, uh, our partner said something like, “Yes, I’m studying law at the moment,” and we can kind of control- turn this control knob that makes us say something very generic like, “Oh,” and then like 20 dots or something just completely bonkers that’s just all the rare words you know. And there’s like a sweet- a sweet spot between what you say, “That sounds like a lot of fun. How long have you been studying?” And then similarly, we have a knob we can turn to, uh, determine how semantically related what we say is to what, what they said. So, um, you know, that’s kind of interesting. It’s, it’s a way to control the output of the, uh, NLG system. But actually, I want to tell you about how the human evaluation was so difficult, so we have these systems that we wanted to generate using human eval. So the question is, how do you ask for the human quality judgments here? Uh, you can ask kind of simple overall quality questions, like, you know, how well does the conversation go? Was- was the user engaging? Um, or maybe comparative, Which of these users gave the best response? Uh, questions like this. And, you know, we tried a lot of them, but there were just major problems with all of them. Like, these questions are necessarily very subjective and also, the different respondents have different expectations, and this affects their judgments. So for example, if you ask, do you think this user is a human or a bot? Then, well, that depends entirely on this respondents’ knowledge of bots or opinion of bots and what they think they can do. Another example is, you’d have kind of catastrophic misunderstanding of the question. So for example, if we ask, was this user- was this chatbot engaging? Then someone responded saying, “Yup, it was engaging because it always wrote back”, which clearly isn’t what we meant. We meant like are they an engaging conversation partner, but they took a very literal assumption, uh, of, of what engaging means. So the problem here is that overall quality depends on many underlying factors, and it’s pretty hard to kind of find a single, overall question that captures just overall quality. So we ended up doing this, we ended up breaking this down into lots more kind of factors of quality. So, uh, the way we saw it is that, you have maybe these kind of overall measures of quality of the chatbot, such as how engaging was it, how enjoyable was it to talk to, and kind of maybe how convincing was it that it was human. And then below those, we kind of broke down as these more low level, uh, components of quality such as, you know, uh, were you interesting? Were you li- showing that you were listening? Were you asking enough questions and so on? And then below that, we had these kind of controllable attributes which were the knobs that we were turning and then the goal was to figure out, um, how these things affected the output. Um, so let’s see. Um, so we had a bunch of findings here, and I think, maybe the ones which I will highlight were, uh, these two kind of in the middle. So the overall metric engagingness, which means enjoyment, that was really easy to maximize. It turned out, uh, our bots managed to get near human performance in terms of engagingness. Um, but the overall metric humanness, that is the kind of Turing test metric, that was not at all easy to maximize. All of our bots were way, way below humans in terms of humanness, right? So we were not at all convincing of being human, and this is kind of interesting, right? Like, we were as enjoyable as talk to as humans, but we were clearly not human, right? So like, humanness is not the same thing as conversational quality. And one of the interesting things we found in this, um, study, where we not only evaluated our chatbots, we also actually got humans to evaluate each other, was that, um, humans are sub-optimal conversationalists. Uh, they scored pretty poorly on interestingness, fluency, listening. They didn’t ask each other enough questions, and this is kind of the reason why we managed to like approach human performance in kind of enjoyableness to talk to you because we just, for example, turned up the question asking knob, asked more questions, and people responded really well to that because people like talking about themselves. So, um, yeah. I think this is kind of interesting, right? Because it shows that there is no obvious just one question to ask, right? Because if you just seemed, “Oh, the one question to ask is clearly engagingness or it’s clearly humanness, then we would have gotten completely different reads on how well we were doing, right? Whereas asking these multiple questions kind of gives you more of an overview. I am going to skip this just because there’s not a lot of time. Okay. So, here’s the final section. Uh, this is my kind of wrap-up thoughts on NLG research, the current trends and where we’re going in the future. So, here’s kind of three exciting current trends to identify in NLG. And of course your mileage may vary, you might think that other things are more interesting. So, uh, the ones which I was thinking about, are firstly incorporating discrete latent variables into NLG. Um, so, you should go check out the slides I skipped over because there were some examples of this. But the idea is that with some tasks such as for example storytelling or task oriented dialogue where you’re trying to actually get something done. Um, you probably want a more kind of concrete hard notion of the things that you’re talking about like you know, entities and people and events and negotiation and so on. So, uh, there’s, there’s mentioning what kind of modeling these discrete latent variables inside these continuous, uh, NLG methods. The second one is alternatives to strict left to right generation. And I’m really sorry [LAUGHTER] I skipped over so many things. Um, so, there’s some interesting work recently in trying to generate text in ways other than left to right. So, for example there’s some kind of parallel generation stuff or maybe writing something and iteratively refining it, uh, there’s also the idea of kind of top-down generation, um, for especially longer pieces of text like maybe tried to decide the contents of each of the sentences separately before uh, writing the words. And then a third one is like alternatives to maximum likelihood training with teacher forcing. So, to remind you, a maximum likelihood training with teacher forcing is just the standard method of training a language model that we’ve been telling you about in the class so far. Um, so, you know, there’s some interesting work on looking at more kind of holistic, um, sentence level rather than word level objectives. Uh, so, unfortunately I ran out of time with this slide, and I didn’t have time to put the references in but I will put the references in later and it will be on the course website so you can go check them out later. Okay. So, as a kind of overview, NLG research, where are we and where are we going? Um, so my metaphor is I think that about five years ago NLP and deep learning research was a kind of a Wild West. Right? Like everything was new and um, we were unsure, NLP research weren’t sure what kind of what the new research landscape was because uh, you know, uh, neural methods kind of changed machine translation a lot, looked like they might change other areas but it was uncertain how much. Um, but these days you know five years later, um, it’s a lot less wild. I’d say, you know things are settled down a lot kind of standard practices have emerged and sure there’s still a lot of things changing. Um, but you know there’s more people in the community, there’s more standard practices, we have things like TensorFlow and PyTorch. So, you don’t have to take up gradients anymore. So, I’d say things are a lot less wild now but I would say NLG does seem to be one of the wildest parts remaining and part of the reasons for that is because of the lack of evaluation metrics that makes it so difficult to tell what we’re doing. It’s, uh, quite difficult to identify like what are the main methods that are working when we don’t have any metrics that can clearly tell us what’s going on. So, another thing that I’m really glad to see is that the neural NLG community is rapidly expanding. Um, so, in the early years, uh, people were mostly transferring successful NMT methods to various NLG tasks. Uh, but now I’m seeing you know, increasingly more inventive NLG techniques merging which is specific to the non-NMT generation settings. Um, and again I urge you to go back into the slides that I skipped. Um, so, I’m also saying there’s increasingly more kind of neural NLG workshops and competitions especially focusing on open-ended NLG like those tasks that we know are not well suited by the automatic metrics that work for NMT. So, there’s a neural generation workshop, a storytelling workshop uh, and various challenges as well where people enter their for example, um, conversational dialogue agents to be, um, evaluated against each other. So, I think that these different, um, kind of community organizing workshops and competitions are really doing a great job to kind of organize a community, increase reproducibility and standard evaluate, standardized evaluation. Um, so, this is great but I’d say the biggest roadblock to progress is definitely still evaluation. Okay. So, the last thing that I want to share with you is eight things that I’ve learned from working in NLG. So, the first one is the more open-ended the task, the harder everything becomes. Evaluation becomes harder, defining what you’re doing becomes harder, telling when you’re doing a good job becomes harder. So, for this reason constraints can sometimes make things more welcome. So, if you decide to constrain your task then sometimes it’s easier to, to complete it. Uh, the next one is aiming for a specific improvement can often be more manageable than aiming to improve overall generation quality. So, for example, if you decide that you want to well for example increase diversity for your model, like say more interesting things that’s an easier thing to achieve and measure than just saying we want to do overall generation quality because of the evaluation problem. The next one is if you’re using your language model to do NLG, then improving the language model that is getting better with perplexity will give you probably better generation quality because you’ve got a stronger language model but it’s not the only way to improve generation quality, as we talked about before, uh, there’s also other components that can affect generation apart from just language model, and that’s part of the problem is that that’s not in the training objective. Um, my next tip is that you should look at your output a lot, partially because you don’t have any single metric that can tell you what’s going on. It’s pretty important to look at your output a lot to form your own opinions. It can be time consuming but it’s probably worth doing. I ended up talking to these chatbots a huge amount during the time that I was working on the project. Okay. Almost done, so, five you need an automatic metric, even if it’s imperfect. So, I know you that already know this because we wrote it all over the project instructions. Uh, but I’d probably amend that to like maybe you need several automatic metrics. I talked earlier about how you might track multiple things to get an overall picture of what’s going on, I’d say the more open-ended your NLG task is, the more likely you need probably several metrics. If you do human eval, you want to make the questions as focused as possible. So, as I found out the hard way if you define the question as a very kind of overall vague thing, then you’re just opening yourself up to, um, the respondents kind of misunderstanding you and, uh, if they are doing that then it’s actually not their fault, it’s your fault and you need to take your questions and that’s what I learned. Uh, next thing is reproducibility is a huge problem in today’s NLP and deep learning in general, and the problem is only bigger in NLG, I guess it’s another way that it’s still a wild west. So, I’d say that, uh, it would be really great, if everybody could publicly release all of their generated output when they write NLG papers. I think this is a great practice because if you released your generated outputs, then if someone later let’s say comes up with a great automatic metric, then they can just grab your generated output and then compute the metric on that. Whereas if he never released your output or you released with some kind of imperfect metric number, then future researchers have nothing to compare it against. Uh, so lastly, my last thought about working in NLG is that it can be very frustrating sometimes, uh, because things can be difficult and it’s hard to know when you’re making progress. But the upside is it can also be very funny. So this my last slide, here are some bizarre conversations that I’ve had with my chatbot. [LAUGHTER] Thanks. [NOISE] [LAUGHTER] All right, thanks.

No Comments

Leave a Reply