Consider a sequence to sequence problem, say machine translation. Initially proposed architecures are mostly of encoder-decoder models where encoder and decoder are made of RNNs
Say we have the task of Hindi --> English translation
i/p - Mai ghar ja raha hoo
o/p- I am going home
The encoder reads the sentences only once and encodes it
At each timestep the decoder uses this embedding to produce a new word
Is this how humans translate a sentence ? Not really! Humans try to produce each word in the output by focusing only on certain words in the input
Essentially at each time step we come up with a distribution on the input words. This distribution tells us how much attention to pay to each input words at each time step
Ideally, at each time-step we should feed only this relevant information (i.e. encodings of relevant words) to the decoder
Let us revisit the decoder that we have seen so far. We either feed in the encoder information only once(at s0), Or we feed the same encoder information at each time step
Now suppose an oracle told you which words to focus on at a given time-step t
Can you think of a smarter way of feeding information to the decoder?
We could just take a weighted average of the corresponding word representations and feed it to the decoder
For example at timestep 3, we can just take a weighted average of the representations of ‘ja’ and ‘raha’
Intuitively this should work better because we are not overloading the decoder with irrelevant information (about words that do not matter at this time step)
How do we convert this intuition into a model ? Of course in practice we will not have this oracle
The machine will have to learn this from the data
To enable this we define a function
ejt = fATT(st−1, cj)
This quantity captures the importance of the j th input word for decoding the t th output word (we will see the exact form of fATT later)
We can normalize these weights by using the softmax function
αjt = exp(ejt) / ∑ M j=1 exp(ejt)
αjt denotes the probability of focusing on the j th word to produce the t th output word
We are now trying to learn the α’s instead of an oracle informing us about the α’s
Learning would always involve some parameters. These parameters will be learned along with the other parameters of the encoder and decoder
Wait a minute ! This model would make a lot of sense if were given the true α’s at training time
α true tj = [0, 0, 0.5, 0.5, 0]
α pred tj = [0.1, 0.1, 0.35, 0.35, 0.1]
We could then minimize L (α true, αpred) in addition to L (θ) as defined earlier
But in practice it is very hard to get α true
For example, in our translation example we would want someone to manually annotate the source words which contribute to every target word. It is hard to get such annotated data
Then how would this model work in the absence of such data ? It works because it is a better modeling choice
This is a more informed model. We are essentially asking the model to approach the problem in a better (more natural) way
Given enough data it should be able to learn these attention weights just as humans do. That’s the hope (and hope is a good thing)
And in practice indeed these models work better than the vanilla encoder decoder models
The Bicycle Analogy
Say, as a kid you want to learn a bicycle, your parents just showed you the bicycle and explained you the use of paddles and balancing, but did not explained you the use of handles to you. As if, handles were not introduced to you. Now you start learning the bicycle but without using handles, of course it is possible that you will eventually learn bicycle riding after very very long time but this would be very cumbersome and inefficient process, although some people are their which can ride bicycle without using handles at all.
Now what if someone introduces you with the handles, what they are used for. Now you have additional set of input or command over bicycle to balance out, this is a more natural and better choice of model learning and you are likely to learn much faster and be more efficient in riding the bicycle