Josherich's Blog


Breaking NLP system

20 Apr 2018

As more and more research shows, NLP systems are extremely vulnerable to out-of-domain distribution, “human-level” performance on test sets cannot be reproduced on real-world problems. Language is hard and the hard part is sparse in datasets, neural models have never show good result on small real-world dataset. The false belief that neural networks solve these problems directs people to repetitive, trivial, and redundant research. It’s even more dangerous that the media cover these false positive and thus the public’s false perception on related topics.

Things are embedded in language, more or less

One way to represent common sense is triplets, people have tried to mine triplets from raw text, build benchmark datasets, but these attempts are more or less half-done pieces. It is also a part of knowledge base construction. With these pieces, we can construct two styles of systems, one predicts answers directly from text, one predicts using pipelines including KBC and natural language to SQL. The latter seems more promising in this moment.

Maybe it’s still just overfitting

I saw this interesting work on NLP Highlight podcast, which shares a great idea found in many other important works: always assuming neural network are overfitting, as they always tend to do so. The work find a way to prove most models make predictions with hints, even just a single word, even a single question mark.

reducing single word

Hand-picked Diagnostic set

GLUE benchmark constructs a set of test cases on basic common sense concepts, known as Diagnostics Main. It covers many ‘baseline’ features of intelligent system.

  • Lexical Semantics
    • Lexical Entailment
    • Morphological Negation
    • Factivity
    • Symmetry/Collectivity
    • Redundancy
    • Named Entities
    • Quantifiers
  • Predicate-Argument Structure
    • Syntactic Ambiguity
    • Prepositional Phrases
    • Core Arguments
    • Alternations
    • Ellipsis/Implicits
    • Anaphora/Coreference
    • Intersectivity
    • Restrictivity
  • Logic
    • Propositional Structure
    • Quantification
    • Monotonicity
    • Richer Logical Structure
  • Knowledge and Common sense
    • World Knowledge
    • Common Sense

However, they decide not to include these tests in the benchmark score. Somehow this makes a crazy situation that top models exceeds human performance, while get less than 50% with these Diagnostics tests:

diagnostic main

Benchmark by Transforming Sentences

Swap Premise and Hypothesis

One simple but effective way to verify NLI models is to swap the sentence pair. The swapped sentence could be infered as following:

  • entailment => contradiction
  • neutral => neutral or entailment
  • contradiction => contradiction

Time and Quantity Reasoning

These sentences are selected from RepEval20171 shared tasks, under the time/quantity resoning category. Inference on these sentences are currently quite hard for neural models.

  - Like one, two, three, four. 
  - Count from one to four. 

  - Of those beginning prenatal care in the first trimester, Indiana mothers rank in the lower third of individuals receiving such care nationwide. 
  - There are Indiana mothers who do not receive prenatal care in the first trimester.  

  - The value of the coefficient can vary from zero (if demand is exactly the same every week) to numbers much greater than one for wildly fluctuating weekly demand. 
  - A coefficient can indeed rise up to a hundred for wildly fluctuating weekly demand. 

  - A short time later, Nawaf and Salem al Hazmi entered the same checkpoint. 
  - Nawaf and Salem al Hazmi entered the checkpoint ten minutes later.  

  - I think uh-oh 200 pounds of mush going to be laying on the floor, I'll never get him up you know. 
  - There is 100 pounds of mush on the floor and I can't get him up.  

  - At best, experience with different combinations of waist sizes and leg lengths for a given design allows a scheduler to aggregate the units to be made into groups of large and small sizes, which means marker-makers can achieve efficiencies near 90 percent for casual pants. 
  - Makrer-makers can only ever achieve efficiencies of 80 percent for casual pants.  

By transforming time phrases, we can construct sentence pairs and verify models. I list ome examples here

Subtle Aspects of Language

I collect some interesting sentences here for which natural language understanding, if it works, should be able to give a sound representation, and give multiple possible representations in some cases.

I hope people could stop focusing on boring numbers.

他觉得就在刚才他记得昨天下午的聚会她认为自己丑(recursive) 他觉得就在刚才他记得昨天下午的聚会她认为自己丑

领导:你这是什么意思?没什么意思。意思意思。(disambiguate “意思”) 领导:你这是什么意思?没什么意思。意思意思。


地铁里听到一个女孩大概是给男朋友打电话:“我已经到西直门了,你快出来往地铁站走。如果你到了,我还没到,你就等着吧。如果我到了,你还没到,你就等着吧。”(字面意思和语气)2 他觉得就在刚才他记得昨天下午的聚会她认为自己丑


  • 赵元任


  • 隋景芳

江州市长江大桥参加了长江大桥的通车仪式 他觉得就在刚才他记得昨天下午的聚会她认为自己丑

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo 他觉得就在刚才他记得昨天下午的聚会她认为自己丑

我在北京西站南广场东 他觉得就在刚才他记得昨天下午的聚会她认为自己丑


James while John had had had had had had had had had had had a better effect on the teacher

Colorless green ideas sleep furiously. 5



  1. mismatched annotation from