Generally, we use LSTM or BiLSTM to encode a sequence in NLP, and use self-attention to sum up all hidden outputs to classify. However, if we will implement classification on every hidden output of LSTM or BiLSTM, how to do?
There are two things we must notice:
(1). If there are M hidden outputs, we will have to create M separated attention networks to get the attention output of each LSTM hidden ouitput.
(2). The input to attention network and attention network output are concatenating to the final output for classification.
Here is the full tutorial!