This page shows the result of the test set I created by changing temporal phrase, labels are predicted by bert-base-uncased, using huggingface BERT implementation. The accuracy is 23.75%.
The pretrained BERT model is trained for 3 epochs, on MNLI, and get 84.16% on matched dev set, 84.35% on mismatched dev set.
I was a bit surprised BERT doesn’t get any better results compared to GLUE baseline models:
| model | accuracy |
|---|---|
| CBOW | 34.8 |
| BiLSTM | 22.9 |
| ESIM | 19.4 |
I argue this technique should be used as a sanity check, for neural network models, to verify that these models, at least, understand basic concepts governing our universe. I cannot imagine any intelligent system that does not understand time goes forward.
For more details see my report