similar to a image caption task
easy to create a lot of training data
use DSL as a smaller hypothsis space, one hot encode DSL tags
teachers trained on disjoint private data
students learn to predict by noisy voting from the teachers
model overfit and memorize data, attackers use hill-climbing from output probability, reveal real faces from dataset
strengthen this ensemble method by limited number of teacher votes, andy pick only topmost vote, after adding random noise
differential privacy
privacy loss
moment accountant
\(s_t = f_{enc}(e_t, a_t)\) \(h_t = f_{lstm}(s_t, p_t, h_{t-1})\) \(r_t = f_{end}(h_t), p_{t+1} = f_{prog}(h_t), a_{t+1} = f_{arg}(h_t)\)
$t$: time-step
$s_t$: state
$f_{enc}$: domain-specific encoder
$e_t$: environment slice
$a_t$: environment arguments
$f_{lstm}$: core module
$h_t$: hidden LSTM state
$f_{end}$: decode return probability $r_t$:
regularization does not stop overfitting data with random labels
SGD has implicit regularization
capacity control
matrix factorization
parameters memorize the datasets
optimization on random labels
model fit on Gaussian noise
mixture model, with
word model with finite vocabulary
char based model, dealing with UNK and subword structures
morphs model with max pooling on multiple morphes vectors, marginalize on multiple sequences of abstract morphemes
\(p(w_i \vert h_i) = \sum_{m_i = 1}^M p(w_i, m_i \vert h_i)\) \(= \sum_{m_i=1}^M p(m_i \vert h_i)p(w_i \vert h_i, m_i)\)
as seen above, models are weighted to compute the probability of $w_i$. The model assume that the selection of model i is conditionally independent of all models before it, given the sequence of word generated.
three word embeddings are concatednated and used as inputs, char embedding captures dative like Obame in Russian fro Obama; morphmes embeddings are max-pooled when different structures are find(say does can be decomposed as do+3-person+singular, or doe+plural)
constants: map entities, numbers, or functions to a value
functional application rules a/b b = a b a\b = a
functional application rules with semantics
forget gate and input gate defined by binary sequence, trainable by counting
overlap of forget and input shows structure
(1) transition policy gradient,
(2) directional cosine similarity rewards,
(3) goals specified with respect to a learned representation
(4) dilated RNN.
learn feature from verification task(same or different images), then one-shot learning new category and test images, using max likelihood
optimal planning horizon can be over 100 steps
combining model-based and model-free
action-conditional predictor plan-conditional predictor