# Recent paper list

Tree embedding

BERT chinese whole word masking

visualize BERT

model lang model using quantum physics

naacl 2019 https://github.com/wabyking/qnn https://mp.weixin.qq.com/s/NbLnGQD4TjlbMGOXbfa8pg


TREC-CAR task:

query is wikipedia title concatenated with section titles, answer is the wikipedia page text

query_0 -> multi reformulator -> search results -> aggregator

use RL to train reformulator agents (state is documents, reward r is )

aggragator as meta-agent

  • accum rank score s_A, relavence(CNN/LSTM), top-k answer wrt s = s_R s_A

polysemy in word embedding

multiple word senses reside in linear superposition within the wordembedding and simple sparse coding can re-cover vectors that approximately capture the senses.

polysemy vector is linear combination of its monosemous vectors

  • linearity Assertion linear structure appears out of a highly nonlinearembedding

  • Word Sense Induction via sparse coding

random walk on discourses

random walk across micro-topics(discourse)

there exists a linear relationship between the vector of aword and the vectors of the words in its contexts

Automated WordNet Construction Using Word Embeddings

Sentence Compression as Tree Transduction

tree-to-tree transduction method for sentence compression

memory and locality in NLP

lstm preserve word order preference

[code2vec](code2vec: Learning Distributed Representations of Code)

transformer 1 head

Snorkel MeTaL basic

knowledge base construction fonduer

multi task NLU

# recent post list

entity link: https://spaces.ac.cn/archives/6919

# corpus

# Datasets

# Multilingual


# Chinese


DRCD https://arxiv.org/pdf/1806.00920.pdf

people daily NER/MSRA NER



fever Fact Extraction and VERification


LAMBADA dataset long dependency, predict last word




mozilla voice

cn words

cn rumors weibo

wikitext-2 'train': "https://s3.amazonaws.com/datasets.huggingface.co/wikitext-2/train.txt" 'valid': "https://s3.amazonaws.com/datasets.huggingface.co/wikitext-2/valid.txt"

wikitext-103 'train': "https://s3.amazonaws.com/datasets.huggingface.co/wikitext-103/wiki.train.tokens" 'valid': "https://s3.amazonaws.com/datasets.huggingface.co/wikitext-103/wiki.valid.tokens"

simplebooks-2-raw : {'train': "https://s3.amazonaws.com/datasets.huggingface.co/simplebooks-2-raw/train.txt", 'valid': "https://s3.amazonaws.com/datasets.huggingface.co/simplebooks-2-raw/valid.txt"}, simplebooks-92-raw : {'train': "https://s3.amazonaws.com/datasets.huggingface.co/simplebooks-92-raw/train.txt", 'valid': "https://s3.amazonaws.com/datasets.huggingface.co/simplebooks-92-raw/valid.txt"}, imdb : {'train': "https://s3.amazonaws.com/datasets.huggingface.co/aclImdb/train.txt", 'test': "https://s3.amazonaws.com/datasets.huggingface.co/aclImdb/test.txt"}, trec {'train': "https://s3.amazonaws.com/datasets.huggingface.co/trec/train.txt", 'test': "https://s3.amazonaws.com/datasets.huggingface.co/trec/test.txt"}

# text classification


Yelp-2, Yelp-5



Amazon-2, Amazon-5

# Reading comprehension


span based

  • SQUAD unanswerable questions



exam based

  • [AI2 Elementary SchoolScience Questions] 1080 questions

  • [NTCIR QA Lab] university entrance

  • [CLEF QA Track]


# Tools

syntok sent segment

BERT multi label


tecent ai word embedding

lang detector




cn words

sentence piece tokenizer

# Tasks

Pairwise Relations

  • arc classification
  • arc prediction(binary)

Preposition supersense disambiguation

STREUSLE 4.0 corpus

semantic tagging

event factuality

Universal Decompositional Semantics It Happened v2

Grammatical error detection


conjunct identification


Document-level Generation and Translation

2019 Shared Task

Dependency parsing

  • head and dependent

  • the structure must be connected, have a designated root node, and be acyclic or planar.

  • each word has a single head.

  • there is a single root node from which one can follow a unique directed path to each of the words in the sentence.

  • projective, if there is a path from the head to every word that lies between the head and the dependent in the sentence.

  • a tree is projective, if it can be drawn with no crossing edges.

  • non-projective, seen in flexible langauages.

Translation, text, audio

Graph based dependency parsing

  • score the tree with edge weights

  • using maximum spanning tree, but might have cycles

  • to eliminate cycles, Chu, Liu, Edmonds(1967):

  1. for each vertex, incoming edge with max weight is chosen

  2. subtract each incoming edges with max incoming weight.

  3. select a cycle and collapse it into a single new node.

higher order dependency parsing

Zhang and McDonald 2012

# Technique

# Tree embedding


𝑛  vectors completely at random from a unit Gaussian distribution in β„π‘š. If π‘šβ‰«π‘›, with high probability the result would be an approximate Pythagorean embedding.

The reason is that in high dimensions, (1) vectors drawn from a unit Gaussian distribution have length very close to 1 with high probability; and (2) when π‘šβ‰«π‘›, a set of 𝑛 unit Gaussian vectors will likely be close to mutually orthogonal.

Initialize with a completely random tree embedding, and in addition pick a special random vector for each vertex; then at each step, move each child node so that it is closer to its parent's location plus the child's special vector

> [A Structural Probe for Finding Syntax in Word Representations](https://nlp.stanford.edu/pubs/hewitt2019structural.pdf)

# Byte Pair Encoding

# Subword Neural Machine Translation


# weight sharing


# Gibbs sampling for posterior distribution

The idea in Gibbs sampling is to generate posterior samplesby sweeping through each variable (or block of variables) to sample from its conditionaldistribution with the remaining variables fixed to their current values.

# methods, algorithms

# computing attention score

  • dot product

  • bilinear

  • multi-layer perceptron

# char-based and subword-based models

namesake (Markov 1906)

Sutskever et el. 2011

Mikolov el al. 2012, Sennrich el al. 2015

# char-aware models

char and word level (Kang el al. 2011) (Ling el al. 2015) (Kim el al. 2016)

unsupervised morphology with log-bilinear model (Botha and Blunsom 2014)(Jozefowicz el al. 2016)

# open-vocabulary hybrid models

Brown el al. 1992

Chung el al. 2016

Luong and Manning 2016

# mixture model generation

combine count based and neural lm (Neubig Dyer 2016)

char based and word based to translate text and code (Ling el al 2016)

copy when UNK (Merity et al. 2016)

# Misc theories

interpretable ML NIPS 2017

# level of NLP research

  • Phonetics and phonology

  • Morphology

  • Syntax

  • Lexical semantics

  • Compositional semantics

  • Pragmatics

  • Discourse

# Vision-and-Language Navigation


a discriminator that evalu-ates how well an instruction explains a givenpath in VLN task using multi-modal alignment

dataset: room to room

# Chomsky hierarchy of languages and grammars

parsing technique

type 1

type 2

type 3

type 4

# Linguistics

heavy NP Shift

verb + NP + PP(temporal adjunct), if NP is heavy enough, PP will be shifted after verb

hrasal verbs and particle shift

give the habit up / give up the habit

dative alternation

give the woman the object / give the object to the woman

double object / propositional object

Genitive altenation

the woman's house / the house of the woman

# Transition based dependency parsing

  • configuration

    • stack,
    • input buffer,
    • relations
  • arc standard

    • leftArc
    • rightArc
    • shift
  • high accuracy on shorter denpendency relations but accurary declines significantly as the distance between head and dependent increases.

# Training oracle

  1. generate training data, by simulation a given reference parse

one change to the rule: choose rightarc if it produces a correct relation given reference parsing, and, all dependents of the top word in stack have been assigned.

  1. features

combine simples features, use feature template: location.property

  • s: stack
  • b: buffer
  • w: word forms
  • r: set of relations
  • l: lemmas
  • t: part-of-speech
  1. learning

# Universal Dependency Set

  • clausal relation, that describe syntatic roles with respect to predicate(often a verb)

  • modifier relations, that categorize the ways that words can modify theirs heads.

Clausal Argument Relations Description
NSUBJ Nominal subject
DOBJ Direct Object
IOBJ Indirect Object
CCOMP clausal complement
XCOMP open clausal complement
Nominal Modifier Relations Description
NMOD Nominal Modifier
AMOD Adjective modifier
NUMMOD Numeric modifier
APPOS appositional modifier
DET determiner
CASE Prepositions, postpositions and other case markers
others Description
CONJ Conjunct
CC Coordinationg conjunction

# NLP Talks

# NLP and Text as Data Speaker Series


# Towards Understanding Deep Learning for Natural Language Processing - Omer levy


# What are Neural Sequence Models Doing - kevin knight


# Structured Embedding Models for Language Variation - Maja Rudolph



# Unsupervised Models of Conversational Dynamics - Justine Zhang


# End-to-end Learning for Broad Coverage Semantics - Luke Zettlemoyer


# What Can Neural Networks Teach us about Language?


# Unsupervised Methods for Extracting Political Positions from Text - Ben Lauderdale


# Future Philologies - Digital Directions in Ancient World Text

Institute for the Study of the Ancient World the ISAW Library NYU Center for Humanities NYU Division of Libraries NYU Center for Ancient Studies and NYU Department of Classics


# Kyle P. Johnson (Accenture), on historical text and natural language processing, "The Next 700 Classical Languages"


# Historical text and information retrieval, "Authorship and Translation: Bilingual Modeling of the Patrologia Graeca" - David Mimno and Laure Thompson


# Historical text and machine learning, "Viral Texts and Networked Authors: Computational Models of Information Propagation" - David Smith


# CoLA

Neural Network Acceptability Judgments

includes generative grammars of the type described by Chomsky (1957)

isolate a violation, so tend to be unacceptable for a single identifiable reason.

the artificial learner must not be ex-posed to any knowledge of language that could not plausibly be part of the input to a human learner

100k most frequent words in the British National Corpus

in-domain: training set (8551 examples), a development set (527), and a test set (530), 17 sources

out-domain development set (516) and a test set (533), 6 sources

human 86.1%

aggregate rating, yielding an average agreement of 93%

The encoders are trained on 100- 200 million tokens, which is within a factor of ten of the number of tokens human learners are ex- posed to during language acquisition

Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies

subject-verb number agreement

highly sensitive to recent but irrelevant nouns

agreement attractors

one-hot encoding input lstm 50 units: 0.83% errors

nouns-only baseline: 4.5%


distance between nouns and verbs

attractor: last intervening nouns

count of attractors

relative clauses

word representation learned the concept of singular and plural from scratch

word-by-word activation visualization

training obj

verb inflection: have verb's singular form before verb(give cue to syntactic clause boundary

grammaticality judge: (human rarely receive ungrammatical signals


training only on hard dataset(with attractors)

Agreement attraction errors in humans are much more common when the attractor is plural than when it is singular


noun-noun compounds

verbs similar to plural nouns are mistaken as nouns: The ship that the player drives has a very high speed.

limitation of the number prediction task, which jointly evaluates the model’s ability to identify the subject and its ability to assign the correct number to noun phrases

# Structural Supervision Improves Learning of Non-Local Grammatical Dependencies

Negative licensor: not none

Negative Polarity Item: any ever

filler-gap dependency


word-synchronus beam search (Stern et al., 2017)

Wh-Licensing Interaction

Poverty of theStimulus Argument.

purely data-driven learning is not powerful enough to explain the richness and uniformity of human grammars



# services, corps


knowledge embedding TransE

masked language model, next sent prediction as pre-training obj


write with transformer