Bert usually can process no more than 512 tokens (words). However, if a sequence is longer than 512, how to process using Bert? There are three common ways:
(1) head-only: use first 510 tokens
(2) tail-only: use the last 510 tokens
(3) head+tail: select the first 128 and the last 382 tokens.
The experiment shows head+tail has the best performance.
Here is the full tutorial!