# # paper reading notes

## # pix2code: Generate Code from a Graphical User Interface Screenshot

similar to a image caption task

easy to create a lot of training data

use DSL as a smaller hypothsis space, one hot encode DSL tags

## # Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

teachers trained on disjoint private data

students learn to predict by noisy voting from the teachers

model overfit and memorize data, attackers use hill-climbing from output probability, reveal real faces from dataset

strengthen this ensemble method by limited number of teacher votes, andy pick only topmost vote, after adding random noise

differential privacy

privacy loss

moment accountant

## # Making Neural Programming Architectures Generalize via Recursion

$$ s_t = f_{enc}(e_t, a_t) $$ $$ h_t = f_{lstm}(s_t, p_t, h_{t-1}) $$ $$ r_t = f_{end}(h_t), p_{t+1} = f_{prog}(h_t), a_{t+1} = f_{arg}(h_t) $$

: time-step : state : domain-specific encoder : environment slice : environment arguments : core module : hidden LSTM state : decode return probability :

## # Understanding Deep Learning requires rethinkng generization

regularization does not stop overfitting data with random labels

SGD has implicit regularization

capacity control

matrix factorization

parameters memorize the datasets

optimization on random labels

model fit on Gaussian noise

## # Using Morphological Knowledge in Open-Vocabulary Neural Language Models

mixture model, with

word model with finite vocabulary

char based model, dealing with UNK and subword structures

morphs model with max pooling on multiple morphes vectors, marginalize on multiple sequences of abstract morphemes

$$ p(w_i \vert h_i) = \sum_{m_i = 1}^M p(w_i, m_i \vert h_i) $$ $$ = \sum_{m_i=1}^M p(m_i \vert h_i)p(w_i \vert h_i, m_i) $$

as seen above, models are weighted to compute the probability of

three word embeddings are concatednated and used as inputs, char embedding captures dative like Obame in Russian fro Obama; morphmes embeddings are max-pooled when different structures are find(say does can be decomposed as do+3-person+singular, or doe+plural)

### # morphological disambiguation

## # Learning to Map sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammers

constants: map entities, numbers, or functions to a value

logical connectors:

- conjunction
- disjunction
- negative
- implication

quantification

- universal, any
- existential
- count
- argmax
- argmin
- definite l(lambda) reutrn the unique set for which the lambda exp is true

lambda exp

### # CGG(combinatory categorial grammer)

functional application rules a/b b = a b a\b = a

functional application rules with semantics

### # MEASURING THE INTRINSIC DIMENSION OF OBJECTIVE LANDSCAPES

- theta^D = theta_0^D + P theta^d
- invariant to model width and depth, if fc layers
- 90% performance as a threshold for intrinsic dimension
- for RL task, performance is defined as max attained mean evaluation reward
- 750 dimension for MNIST, invariant for fc layer, variant for convolution layers(measured on shuffled labels)
- dense / sparse / fastfood random matrix

### # ORDERED NEURONS

forget gate and input gate defined by binary sequence, trainable by counting

overlap of forget and input shows structure

### # FeUdal

(1) transition policy gradient,

(2) directional cosine similarity rewards,

(3) goals specified with respect to a learned representation

(4) dilated RNN.

### # Simese Network

learn feature from verification task(same or different images), then one-shot learning new category and test images, using max likelihood

### # UNDERSTANDING THE ASYMPTOTIC PERFORMANCE OF MODEL-BASED RL METHODS

optimal planning horizon can be over 100 steps

combining model-based and model-free

- augment a model- free method with a model for faster learning
- make up for the asymptotic deficiencies of a model-based method by transitioning to model-free.

action-conditional predictor plan-conditional predictor