Josherich's Blog

HOME SHORTS SOFTWARES DRAWING ABOUT RSS

Machine Learning

recent list

Tools

botorch

graphVite: graph embedding system

ml implementation

https://paperswithcode.com

pytorch hub

Colaboratory

cvxpy

evolution tool

apex

disentanglement lib

Examples of Learning Tasks

some broad tasks

General Objective

Topics

Convex Optimization

Convexity

\[\lbrace \alpha x + (1-a)y, 0 \le \alpha \le 1 \rbrace \subseteq \mathrm X.\]

convex form

level set

affine set

convex hull

convex combination

hyperlane

halfspace

feature spaces

Very large feature spaces have two potential issues:

Overfitting we handle with regularization.

Kernel methods” can (sometimes) help with memory and computational costs.

Empirical Risk Minimizer

regularization

Constrained Empirical Risk Minimization (Ivanov)

Pennalized Empirical Risk Minimization (Tikhonov)

Ridge Regression

Lasso Regression

Elastic Net

combine lasso and ridge penalities.

Regularization Paths

KNN

kd tree

Bayesian

Maximum Likelihood Estimation

Bayesian dicision

Bayesian network

EM

Decision tree

Loss function for regression Given the partition ${ R_1,…,R_M }$, prediction function is \(f(x) = sum_{m=1}^{M}c_m 1(x \in R_m)\)

for $l_2$ loss,

for $l_1$ loss,

categorical feature

assign each category a number, the proportion of class 0

find optimal split as though it were a numeric feature

entropy gain in tree spliting

empirical entropy

empirical conditional entrpy

information gain is defined as:

\[g(D,A) = H(D) - H(D \vert A)\]

maximize g

logistic regression

ensemble learning

Boosting

AdaBoost

Boosting Tree

HMM

Conditional Random Field

Quadratic Problem

Subgradient descent

Kernel Methods

Definition

A method is kernelized if every feature vector ψ(x) only appears inside an inner product with another feature vector ψ(x′). In particular, this applies to both the optimization problem and the prediction function.

optimization

minibatch descent

SGD

Functional margin

\[yf(x)\]

Loss function

turn local minima to non-local deep or trivial, leave it to the reader to judge

SVM

\[min_{\substack{w \in \mathbf{R}^{d}, b \in \mathbf{R}}} \frac{1}{2}\Vert w \Vert^{2} + \frac{c}{n}\sum^{n}_{i=1}max\left( 0,1-y_i \left[ \mathsf{w^T} x_i + b \right] \right)\]

Representer theorem

$$\mathsf w^\ast = sum_{i=1}^n \alpha_i x_i

The Kernel Function

The kernel function corresponding to $\psi$ is

\[k\left(x,x'\right) = \langle\psi(x),\psi(x')\rangle\]

where $\langle \cdot, \cdot \rangle$ is the inner product associated with $\mathbf{H}$.

What are the Benefits of Kernelization?

Kernelization

Kernelized SVM

complementary slackness

Similarity Scores

It is often useful to think of the kernel function as a similarity score. But this is not a mathematically precise statement.

Overfitting

MDPs

VC dimension

Boosting

Weak Learning

NMF

r(n+d) NP-hard reformulated

LDA

Gamma function Dirichlet distribution

structured prediction

HMM

Viterbi algorithm (dp with max)

MEMM

CRF

https://spaces.ac.cn/archives/4695/

Hamming loss

VC dimension

vapnik chervonenkis growth function: expression power of hypothesis space dichotomy shatter Given a set S of examples and a concept class H, we say that S is shattered by H if for every A ⊆ S there exists some h ∈ H that labels all examples in A as positive and all examples in S \ A as negative.

The VC-dimension of H is the size of the largest set shattered by H.

shatter function

Given a set S of examples and a concept class H, let H[S] = {h ∩ S : h ∈ H}. That is, H[S] is the concept class H restricted to the set of points S. For integer n and class H, let H[n] = max S =n H[S] ; this is called the growth function of H.

EM

PAC learnable

efficient properly sample complexity m >= poly(,,,)

Definition:

concept class C is weakly PAC-learnable if there exists a (weak) learning algorithm L and > 0 such that:

Bayes-PAC-Hoeffding Bound

\[\mathbb{E}_u [L(w+u)]\leq \mathbb{E}_u [\hat{L}(w+u)] + \frac{KL(w+u||\pi) + \log \frac{1}{\delta}}{\eta} + \frac{\eta}{2n}\]

Hessian Lipschitz

\[\forall w_1, w_2, \|\nabla^2 f(w_1) - \nabla^2 f(w_2)\|\leq \rho \|w_1-w_2\|\]

identifying-generalization-properties-in-neural-networks

\[\sigma_i \approx \frac{1}{\sqrt{\nabla^2_{i,i} \hat{L}+ \rho N_{\gamma, \epsilon}(w_i)} },\]

$N_{\gamma, \epsilon}(w_i)=\gamma |w_i| + \epsilon$

max entropy

uncertainty should be equally distributed conditional entropy

Hoeffding’s inequality

GAN

cycleGAN

pixel to pixle

WGAN

VAE

ResNet

DenseNet

lambda calculus

group

Bayesian decision

Bayesian action

James-Stein Estimator

shrinkage

\[( 1 - frac{N - 2}{\Vert z \Vert^2})\]

Tweedie’s formula

Chi-squared distribution

exponential family

\[h(x) = exp(\eta x − \psi(\eta))h_0(x)\]

$\eta$ : natural parameter

$\psi$ : cumulant generating function

conjugate

maximum a posteriori

frequentist statistics

credible set

evaluation

training, dev, test segmentation is based on the i.i.d. assumption

\[\frac{TP}{TP + FP}\] \[\frac{TP}{TP + FN}\] \[\frac{2PR}{P+R} = \frac 1 2 \left( \frac 1 P + \frac 1 R \right)\] \[\frac{1}{1+\beta^2} \left( \frac{1}{P} + \frac{\beta^2}{R} \right)\]

ReLU

maxout

sigmoid

tanh

RBF network

AI Conference

ICML

NIPS

COLT

ECML

ACML

correlation and independence

linear independent

a set of vectors is linearly independent if there exist m scalar coefficients, which are not all equal to zero and \(sum_{i=1}^m \alpha_i x_i = 0\)

span

all possible linear combinations of a set of vectors

the span of any set of vectors in $\mathcal V$ is a subspace of $\mathcal V$

basis

inner product

matrix inner product

function inner product

dot product

symmetric

inner product norm

sample variation

sample standard deviation

correlation coefficient

holder’s inequality

othogonality

gram-schmidt

orthogonal projection

closest

rank

dim(col(A)) = dim(row(A))

trace

sum of diagonal elements

linear map

map vectors from vector space V to vector space R

$A\vec{x}$ is a linear combination of the columns of $A$

linear map can be represent by a matrix

adjoint

the adjoint of a linear map satisfies

\[\langle f(\vec{x}), \vec{y} \rangle_{\mathcal R} = \langle \vec{x},f^{\ast}(\vec{y})_{\mathcal V}\]

conjugate transpose

\[\(A^{\ast}_{ij} = \bar{A_ji}\) 1 \leq i \leq n, 1 \leq j \leq m.\]

range

\[range(f) = \lbrace \vec{y} \vert \vec{y} = f(\vec{x}) \text{, for some } \vec{x} \in \mathcal V \rbrace\]

the range of a matrix is the range of its associated linear map

Null space

the set of vectors that are mapped to zero

null space of a linear map is a subspace

null space is perpendicular to row space \(null(A) = row(A)^{\perp}\)

column space

if the column spaces of two matrix are orthogonal, then inner product is 0

\[\langle A, B \rangle\]

the range is the column space

\[range(A) = col(A)\]

row space

orthogonal matrices

orthogonal matrix is a square matrix s.t.

\[U^T U = U U^T = I\]

the columns form an orhonormal basis

orthogonal matrices change the direction of vectors, not their magnitude

Variational Autoencoder

sample implementation

reparametrization trick

Mean-field variational inference

Encoder

in the neural net world, the encoder is a neural network that outputs a representation zz of data xx. In probability model terms, the inference network parametrizes the approximate posterior of the latent variables zz. The inference network outputs parameters to the distribution q(z \vert x)q(z∣x).credit

Decoder

in deep learning, the decoder is a neural net that learns to reconstruct the data xx given a representation zz. In terms of probability models, the likelihood of the data xx given latent variables zz is parametrized by a generative network. The generative network outputs parameters to the likelihood distribution p(x \vert z)p(x∣z).credit

Inference

in neural nets, inference usually means prediction of latent representations given new, never-before-seen datapoints. In probability models, inference refers to inferring the values of latent variables given observed data. credit

Optimal Transportation

Optimal Mass Transportation Map

Brenier potential function

Gradient Projection

蒙日-安培方程

divergence

homeomorphism

isomorphism

Homotopy

Graphs and Networks

Detailed Syllabi for lectures:

Jan 25: Introduction to graph theory, approximation algorithm, Max-Cut approximation. Chapter 8 on Lecture Notes.

Feb 01: Max-Cut approximation. Lifting / SDP relaxations technique in mathematical signal processing, phase retrieval and k-means SDP.

Feb 08: Unique Games Conjecture, Sum-of-Squares interpretation of SDP relaxation. Chapter 8 of Lecture Notes.

Feb 15: Shannon Capacity, Lovasz Theta Function. Section 7.3.1. on Lecture Notes and “On the Shannon Capacity of a Graph” by Laszlo Lovasz. See also Section 6.5.3.

Feb 22: Stochastic Block Model and Phase Transitions on graphs. Chapter 9 of Lecture Notes

Mar 01: Recovery in the Stochastic Block Model with Semidefinite relaxations. Chapter 9 of Lecture Notes

Detailed Syllabi for labs:

Jan 24: review of linear algebra and probability

Jan 31: discussion of homework 1

Feb 07: graph Laplacian and Cheeger’s inequality

Feb 14: pseudo distribution for maxcut, derivation of primal and dual program for Maxcut, SOS4

Feb 21: introduction to Grothendieck inequality and a proof of an upper bound of Grothendieck constant (Krivines bound)

Feb 28: calculate the Lovasz theta function for n-cycle and discuss connection with Grothendieck constant on graph

Clique

A clique of a graph G is a subset S of its nodes such that the subgraph corresponding to it is complete. In other words S is a clique if all pairs of vertices in S share an edge. The clique number c(G) of G is the size of the largest clique of G.

Independent set

An independence set of a graph G is a subset S of its nodes such that no two nodes in S share an edge. Equivalently it is a clique of the complement graph Gc := (V, Ec). The independence number of G is simply the clique number of Sc.

Ramsey number

A natural question is whether it is possible to have arbitrarily large graphs without cliques (and without its complement having cliques), Ramsey answer this question in the negative in 1928 [Ram28]

It is easy to show that R(3) = 6

high probability

We say an event happens with high probability if its probability is ≥ 1 − n−Ω(1)

Spencer 94 about Ramsey number

“Erdos asks us to imagine an alien force, vastly more powerful than us, landing on Earth and demanding the value of R(5) or they will destroy our planet. In that case, he claims, we should marshal all our computers and all our mathematicians and attempt to find the value. But suppose, instead, that they ask for R(6). In that case, he believes, we should attempt to destroy the aliens.”

Erdos-Renyi graph

Given n and p, the random Erdos-Renyi graph G(n,p) is a random graph on n vertices where each possible edge appears, independently, with probability p.

theorem lower bound of R(r)

For every r >= 2, \(R(r) \geq 2^{\frac{r-1}{2}}\)

Probabilistic method on graph

\[\mathbb{E} \lbrack \mathbf{X(S)} \rbrack = Prob \lbrace \text{S is a clique or independent set} \rbrace = \frac{2}{2^{\left( S \atop 2 \right) }}\]

best known lower bound

Erdos-Hajnal Conjecture

For any finite graph H, there exists a constant $\delta H > 0$ such that any graph on n nodes that does not contain H as a subgraph (is a H-free graph) must have

\[r(G) \geq n^{\delta^H}\]

max-cut problem

to design polynomial algorithms that, in any instance, produce guaranteed approximate solutions.

Given a graph G = (V, E) with non-negative weights wij on the edges, find a set S ⊂ V for which cut(S) is maximal.

Goemans and Williamson [GW95] introduced an approximation algorithm that runs in polynomial time and has a randomized component to it, and is able to obtain a cut whose expected value is guaranteed to be no smaller than a particular constant αGW times the optimum cut. The constant αGW is referred to as the approximation ratio.

cut

a cut is a partition of the vertices of a graph into two disjoint subsets. Any cut determines a cut-set, the set of edges that have one endpoint in each subset of the partition.

Unique Game Problem

Given a graph and a set of k colors, and, for each edge, a matching between the colors, the goal in the unique games problem is to color the vertices as to agree with as high of a fraction of the edge matchings as possible.

conjecture

For any ε > 0, the problem of distinguishing whether an instance of the Unique Games Problem is such that it is possible to agree with a ≥ 1 − ε fraction of the constraints or it is not possible to even agree with a ε fraction of them, is NP-hard.

Probabilistic Graphic Model

two types:

Markov random field

an undirected graphical model, or Markov random filed, represents a distribution $P(X_1, …, X_n)$ defined by and undirected graph H, and a set of positive potential functions $\Psi_c$ associated with cliques of H, s.t.

\[P(x_1, ..., x_n) = \frac{1}{\mathbf Z} prod_{c \in C}\Psi_c(\mathbf X_c)\]

where Z is known as the partition function

Gibbs distribution

potential function

contingency function of its arguments assigning “pre-probabilistic” score of their joint configuration.

Lagrangian multiplier

Neural Network

Elman Network

LSTM

GRU

Elman network就是指现在一般说的RNN(包括LSTM、GRU等等)。一个recurrent层的输出经过时延后作为下一时刻这一层的输入的一部分,然后recurrent层的输出同时送到网络后续的层,比如最终的输入层。一个Jordan network说的是直接把整个网络最终的输出(i.e. 输出层的输出)经过时延后反馈回网络的输入层

Clockwork RNN

隐层被分成了g个模块,每个模块的大小是k,每个模块内部是全连接的。模块j到i的recurrent仅当Ti小于Tj时才会存在。根据增长的阶段对模块进行分类,模块之间的传递在隐层之间是从右向左的,从慢的模块到快的模块。

CW-RNN跟RNN的主要不同之处在于在每个时间点t,只有模块i满足t%Ti=0,才会执行,产生输出。Ti的是任意的,这篇论文取得是Ti=2^(i-1)

F test and t test

Restraint and unrestrained variables Reject the null hypothesis Use Degree freedom1 and df2 and critical level 0.05 to look for the value, ssr r - ssr ur / ssr ur

Pearson correlation

our world in data

https://ourworldindata.org/

Misc

pdf and pmf

for the negative log-likelihood loss, ERM and MLE are equivalent convex opt review

t分布,χ2分布,F分布

PMI

\text{PMI}(w_1,w_2)=\log \frac{P(w_1,w_2)}{P(w_1)P(w_2)}\tag{2}

supervised Imitation learning lower bound

At time t=1 you’re shown a picture of either a zero or a one. You have two possible actions: press a button marked “zero” or press a button marked “one.”The “correct” thing to do at t=1 is to press the button that corresponds to the image you’ve been shown. Pressing the correct button leads t01=0; the incorrect leads to1=1. Now, at time t=2 you are shown another image, again of a zero or one. The correct thing todo in this time step is the xor of (a) the number written on the picture you see right now, and (b) the correct answer from the previous time step. This holds in general for t>1

single mistake never recover

dagger

data aggregation

do something unusual.We put the human expert in the car, and record their actions, but the car behaves not according to the expert’s behavior, but according to f0. That is,f0is in control of the car, and the expert is trying to steer,but the car is ignoring them and simply recording their actions as training data

http://ciml.info/dl/v0_99/ciml-v0_99-ch18.pdf

russian tank problem

importance sampling for switching expectation

the classifier is being told to pay more attention to training examples that have high probability under the new distribution, and less attention to training that have low probability under the new distribution.

supervised adaptation

when the distributions agree on the value of a feature, let them share it, but when they dis-agree, allow them to learn separately

feature aug

create three ver-sions of every feature: one that’s shared (for words like “awesome”),one that’s old-distribution-specific and one that’s new-distribution-specific

fairness

NYAS Machine Learning Day 2019

Emma Brunskill

policy certificate: https://arxiv.org/pdf/1811.03056.pdf

regret minimization, epsilon bound, tighter bound: https://arxiv.org/pdf/1901.00210.pdf

Aaron roth

differentially private fair learning

even with unbiased dataset, learning could still be biased

https://www.nowpublishers.com/article/DownloadSummary/TCS-042

subgroup fairness:

yaqi duan

Adaptive Low-Nonnegative-Rank Approximation for State Aggregation of Markov Chains

Adaptive Low-Nonnegative-Rank Approximation for State Aggregation of Markov Chains

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

temporal difference learning

Attribute-Efficient Learning of Monomials over Highly-Correlated Variables

Kiran Vodrahalli

Nadav Cohen, wei hu

A Convergence Analysis of Gradient Descent for Deep Linear Neural…

deep linear NN, close form solution, provably converge

two condition: 1. deficiency margin c, 2. delta-balance

std of initialization has a sweet spot for convergence

Nadav Cohen

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Wei Hu

a way to compute data complexity, without training, related to intrinsic dimension

Ruslan

Deep Generative Models with Learnable Knowledge Constraints

teacher-student learning, build loss object with external knowledge, as extrinsic reward in RL

use Posterior regularization to compute confidence on teacher distillation, reject samples with low confidence

graph convolution

sham kakade

Provably Efficient Maximum Entropy Exploration

maximize cross entropy with uniform distribution, to explore

a efficient appro plan, or density estimaiton algo, return a policy no less effective than optimal with epsilon

the result of the estimation is as close as to the truth distribution, given the uniform as estimation prior.

algo for unknown MDP

Approximate planning oracle

State distribution estimate oracle

Audio

https://github.com/facebookresearch/wav2letter/blob/master/docs/installation.md https://github.com/erikd/libsndfile#hacking https://github.com/kpu/kenlm#compiling

feature-wise transformation https://distill.pub/2018/feature-wise-transformations/

Pytorch

sparse tensor

pytorch_sparse

docs

use cases

todo funs

states

sparse math operations

sparse tensor in tensorflow

tf.SparseTensor(indices=np.array([fm[0],fm[1]]).T, values=fm[2], dense_shape=[weight_basis.size, total_dim])
tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4])

tf.sparse_tensor_dense_matmul(ww, theta_small_norm, adjoint_a=True, adjoint_b=True)
modules = [module for k, module in model._modules.items()]

SparseTensor sparse slice

sparse matrix in Eigen

GAN

birthday paradox

mode collapse

encoder-decoder GAN

https://openreview.net/pdf?id=BJehNfW0-

BIRTHDAY PARADOX TEST FORGANS

the training objective can approach its optimum value even if the generated distribution has very lowsupport —in other words, the training objective is unable to preventmode collapse.

cannot prevent learning meaningless codes for data

F-Integral Probability Metric, F = {all 1 Lipchitz functions}

https://arxiv.org/pdf/1702.07028.pdf

On the ability of neural nets to express distributions

Barron theorem

if a certain quantity involving the Fourier transform is small, then the function can be approximated by a neural network with one hidden layer and a small number of nodes

a generative model can be expressed asthe composition of n Barron functions, then it can be approximated by an+ 1-layer neural network

if a distribution is generated bya composition ofnBarron functions, then the distribution can be approximately generated by aneural network withnhidden layers.

Bar 93: gave a upper bound for the size of the network required in terms of a Fourier criterion. He showed that a functionf can be approximated in L2 up to error ε by a 2-layer neural network with \(O(\frac{C_f^2}{\epsilon})\) units, where Cf depends on Fourier properties of f

the numberof parameters required to obtain a fixed error increases linearly

warp fusion reconstruction

dual-quaternion