Bayesian non-parametric modeling using Dirichlet processes

A Bayesian nonparametric model is a Bayesian model on an in finite-dimensional
parameter space. The parameter space is typically chosen as the set of all possible
solutions for a given learning problem. For example, in a regression problem
the parameter space can be the set of continuous functions, and in a density estimation
problem the space can consist of all densities. A Bayesian nonparametric
model uses only a fi nite subset of the available parameter dimensions to explain
a finite sample of observations, with the set of dimensions chosen depending on
the sample, such that the e ffective complexity of the model (as measured by the
number of dimensions used) adapts to the data. Classical adaptive problems,
such as nonparametric estimation and model selection, can thus be formulated
as Bayesian inference problems. Popular examples of Bayesian nonparametric
models include Gaussian process regression, in which the correlation structure
is re fined with growing sample size, and Dirichlet process mixture models for
clustering, which adapt the number of clusters to the complexity of the data.
Bayesian nonparametric models have recently been applied to a variety of machine
learning problems, including regression, classifi cation, clustering, latent
variable modeling, sequential modeling, image segmentation, source separation
and grammar induction.

In this reading group we will be going over Michael I. Jordan’s tutorial presentation from NIPS focusing on Dirichlet processes.
The original post script file can found in the following link:
http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps
The presentation follows the same lines as the paper found in:
http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf

Compilation of topics since May

Sent September 9:
This week we will go over the paper:
It talks about primal-dual averaging, one of the methods mentioned in the review we covered which we didn’t quite work out.
Sent August 26:
Reminder – As we discussed last week, we will talk about the second part of the optimization lecture on Thursday.

Sent August 13:
Sent August 5:
I want to dedicate a few sessions to the area of optimization in ML. The idea is to cover new results but also try to make a “map” of the area, and make the connections between the fields.
To bring us all up to some level, this week, instead of  reading a paper on a specific algorithmic/theoretical result, I thought we should read a review on the subject. I couldn’t find a good one written but found a nice nips tutorial on the subject (so you don’t even have to read:)).
Sent July 28:
from ICML 2013.
Sent July 14:
This week (Thursday @14:30) we will continue with Gaussian Processes. The subject will be the paper from ICML2011: http://www.icml-2011.org/papers/323_icmlpaper.pdf,  which applies GPs to Reinforcement Learning.
Also, here a motivational video (learning this task previously demanded hundreds of trials, this algorithm does it in 7):
Sent July 3:
It was suggested that we do a couple of sessions on Gaussian Processes. For next week, please read Chapters 2 and 5 of the book Gaussian Processes for Machine Learning, available at http://www.gaussianprocess.org/gpml/.
Sent June 16:
This week we will go over the paper:
“A Provably Efficient Algorithm for Training Deep Networks” http://arxiv.org/abs/1304.7045

Sent June 9:

Odalric will lead the discussion on the paper –

“Follow the Leader If You Can, Hedge If You Must”
http://arxiv.org/pdf/1301.0534v2.pdf, by Steven de RooijTim van ErvenPeter D. GrünwaldWouter M. Koolen.
This paper considers the online learning setting and tries to find a way to optimally
tune the Hedge algorithm so as to get an (really) adaptive algorithm.

A quick reference to the Hedge Algorithm: http://onlineprediction.net/n=Main.HedgeAlgorithm

Some motivation why this setting is useful, can be found in http://hal.archives-ouvertes.fr/docs/00/71/51/77/PDF/Devaine-Goude-Stoltz-Gaillard.pdf).

Sent April 29:

This week we’ll be reading the paper  “Sparse inverse covariance estimation with the graphical lasso”   by Friedman et al. (http://www-stat.stanford.edu/~tibs/ftp/graph.pdf). The paper discusses the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. The connection to  graphs- conditional independence may be deduced when there is a zero in the inverse covariance matrix; for a reminder on the subject see the attached tutorial.

Causality

This week will be about Causality. Please read –

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.66.2460 : an article, “causal feature selection”, where we are interested in the causal part: section 1 to 2.2 included, sections 5 and 7. (add section 8 if you want a lengthy list of applications)

– A part of a tutorial presentation on causality. I removed unnecessary slides to make sure you would not print the whole things if you don’t want to. The video and the full slides are available on video lecture, but not necessary for the reading group. However, the guy who gave the tutorial is really good. http://videolectures.net/ecmlpkdd2011_tsamardinos_discovery/

Probabilistic Graphical models – Intro

Last week we discussed the subject of Probabilistic Graphical Models – an introduction.
It is a good and general introduction which covers the basis, and tries to provide intuitions rather than a perfect mathematical formalism. It is slightly outdated, but it makes many links with other areas of machine learning.

Dimension Reduction

In the last few weeks we covered a few interesting subjects in dimensionality reduction. I would like to give a recap of the papers and give some pointers to further reading.

In the first meet we covered a comparative review mostly on non linear methods for dimension reduction (manifold learning) – http://www.iai.uni-bonn.de/~jz/dimensionality_reduction_a_comparative_review.pdf

Next, we dived into the details of one common manifold learning method – Diffusion Maps. Papers: The original paper by Coifman (link) and an extension paper.

Last, we talked about Random Projections. We looked at several different sources:

  •  For some general theory see link. It gives the basic theorem and proofs and an application to fast computation of truncated PCA (which is highly relevant to our the global methods of dimensionality reduction which use it to place points given a distance matrix).
  • For some practice see: “Random projection in dimensionality reduction: Applications to image and text data”. The paper shows how distances between images are better preserved by random projections than by PCA (note PCA isn’t supposed to preserve distances, exactly). In particular it gives a feel for how these methods sometimes work at much more reasonable dimensions than basic theory predicts.
  • A little more of both: “Experiments with Random Projection” Dasgupta (Sections 3.2, 4.3). Shows a nice extra lemma about eccentricity of Gaussians under random projections, and an application to simple classification of the MNIST drawn digits data-set.

For some more reading on random projections, suggested by Odalric:

Random projections in ML:

  • Linear regression with Random Projections: http://jmlr.csail.mit.edu/papers/volume13/maillard12a/maillard12a.pdf
  • Random Projections trees: Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In Proceedings Of The 40th Annual ACM Symposium On Theory Of Computing, STOC ’08, pages537–546, New York, NY, USA, 2008. ACM. (+ dasgupta”s webpage)
  • Spectral clustering: Bin Zhao and Changshui Zhang. Compressed spectral clustering. InProceedings Of The 2009 IEEE International Conference On Data Mining Workshops, ICDMW ’09, pages 344–349, Washington, DC, USA, 2009. IEEE Computer Society.

For applications in signal processing:

  • Sparse recovery with Brownian sensing: http://books.nips.cc/papers/files/nips24/NIPS2011_1005.pdf
  • Compressive sensing: http://www-m15.ma.tum.de/foswiki/pub/M15/Allgemeines/OldNews/CSFornasierRauhut.pdf

More topics on random matrices, some examples:

  • Random projection with sparse matrices: Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binarycoins.Journal Of Computer And System Sciences, 66(4):671–687, June 2003
  • The RIP property: R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation,28(3):253–263,2008.
  • Compressed Sensing and high-dimensional geometry: http://www-personal.umich.edu/~romanv/teaching/2012-13/709/CS-random-matrices.pdf
  • The notion of incoherence: E. Candes and J. Romberg. Sparsity and incoherence in compressive sampling.Inverse Problems,23:969–985,2007.
  • Singular values: http://djalil.chafai.net/Docs/sing.pdf
  • Fast Random projections (e.g.): https://web.math.princeton.edu/~amits/publications/LeanWalsh_published.pdf

Temporal-Difference Learning with Linear Function Approximation

The next couple of meetings will be dedicated to Reinforcement Learning (RL).
This week we will cover the paper
Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation by Sutton et al.
This paper presented a solution to a long-standing problem: how to learn an
approximate value function in an *online* (efficient) but* off-policy* setup.

If you want to refresh your RL knowledge, I recommend Nahum Shimkin’s
lecture notes. In particular, the last lecture Value and Policy Approximations is very relevant to this weeks topic.

Best of NIPS 2012

Next Wednesday, the 9/2, we will have a group covering some of the work presented this year at NIPS. Each one should pick one paper and share with the group (about 15 minutes).
To make sure we don’t all pick the same paper, please comment on this post with your selection.
All papers are up for grabs, so start looking for your fav!

Online Passive-Aggressive Algorithms

This week we will cover the paper –
Online Passive-Aggressive Algorithms
by Crammer et al. This group will be especially interesting thanks to Koby who has kindly agreed to share some insights into this work.

Abstract
We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

We have moved the time back to 9:30.

See you on Wednesday (2/1/2012)!

Deep Learning – Introduction

In the next few sessions we’ll be covering Deep Learning, a subject that has been gaining a lot of attention lately.
We will be following the review of Yoshua Bengio –
The first discussion will cover an introduction and motivation to the subject. The relevant chapters are 1-2.
In the next group discussion, on 19/12, we will go “deeper” into the subject and try to cover the rest of the chapters of the review.