This week will be about Causality. Please read – : an article, “causal feature selection”, where we are interested in the causal part: section 1 to 2.2 included, sections 5 and 7. (add section 8 if you want a lengthy list of applications)

– A part of a tutorial presentation on causality. I removed unnecessary slides to make sure you would not print the whole things if you don’t want to. The video and the full slides are available on video lecture, but not necessary for the reading group. However, the guy who gave the tutorial is really good.

Probabilistic Graphical models – Intro

Last week we discussed the subject of Probabilistic Graphical Models – an introduction.
It is a good and general introduction which covers the basis, and tries to provide intuitions rather than a perfect mathematical formalism. It is slightly outdated, but it makes many links with other areas of machine learning.

Dimension Reduction

In the last few weeks we covered a few interesting subjects in dimensionality reduction. I would like to give a recap of the papers and give some pointers to further reading.

In the first meet we covered a comparative review mostly on non linear methods for dimension reduction (manifold learning) –

Next, we dived into the details of one common manifold learning method – Diffusion Maps. Papers: The original paper by Coifman (link) and an extension paper.

Last, we talked about Random Projections. We looked at several different sources:

  •  For some general theory see link. It gives the basic theorem and proofs and an application to fast computation of truncated PCA (which is highly relevant to our the global methods of dimensionality reduction which use it to place points given a distance matrix).
  • For some practice see: “Random projection in dimensionality reduction: Applications to image and text data”. The paper shows how distances between images are better preserved by random projections than by PCA (note PCA isn’t supposed to preserve distances, exactly). In particular it gives a feel for how these methods sometimes work at much more reasonable dimensions than basic theory predicts.
  • A little more of both: “Experiments with Random Projection” Dasgupta (Sections 3.2, 4.3). Shows a nice extra lemma about eccentricity of Gaussians under random projections, and an application to simple classification of the MNIST drawn digits data-set.

For some more reading on random projections, suggested by Odalric:

Random projections in ML:

  • Linear regression with Random Projections:
  • Random Projections trees: Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In Proceedings Of The 40th Annual ACM Symposium On Theory Of Computing, STOC ’08, pages537–546, New York, NY, USA, 2008. ACM. (+ dasgupta”s webpage)
  • Spectral clustering: Bin Zhao and Changshui Zhang. Compressed spectral clustering. InProceedings Of The 2009 IEEE International Conference On Data Mining Workshops, ICDMW ’09, pages 344–349, Washington, DC, USA, 2009. IEEE Computer Society.

For applications in signal processing:

  • Sparse recovery with Brownian sensing:
  • Compressive sensing:

More topics on random matrices, some examples:

  • Random projection with sparse matrices: Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binarycoins.Journal Of Computer And System Sciences, 66(4):671–687, June 2003
  • The RIP property: R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation,28(3):253–263,2008.
  • Compressed Sensing and high-dimensional geometry:
  • The notion of incoherence: E. Candes and J. Romberg. Sparsity and incoherence in compressive sampling.Inverse Problems,23:969–985,2007.
  • Singular values:
  • Fast Random projections (e.g.):

Temporal-Difference Learning with Linear Function Approximation

The next couple of meetings will be dedicated to Reinforcement Learning (RL).
This week we will cover the paper
Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation by Sutton et al.
This paper presented a solution to a long-standing problem: how to learn an
approximate value function in an *online* (efficient) but* off-policy* setup.

If you want to refresh your RL knowledge, I recommend Nahum Shimkin’s
lecture notes. In particular, the last lecture Value and Policy Approximations is very relevant to this weeks topic.

Best of NIPS 2012

Next Wednesday, the 9/2, we will have a group covering some of the work presented this year at NIPS. Each one should pick one paper and share with the group (about 15 minutes).
To make sure we don’t all pick the same paper, please comment on this post with your selection.
All papers are up for grabs, so start looking for your fav!

Online Passive-Aggressive Algorithms

This week we will cover the paper –
Online Passive-Aggressive Algorithms
by Crammer et al. This group will be especially interesting thanks to Koby who has kindly agreed to share some insights into this work.

We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

We have moved the time back to 9:30.

See you on Wednesday (2/1/2012)!

Deep Learning – Introduction

In the next few sessions we’ll be covering Deep Learning, a subject that has been gaining a lot of attention lately.
We will be following the review of Yoshua Bengio –
The first discussion will cover an introduction and motivation to the subject. The relevant chapters are 1-2.
In the next group discussion, on 19/12, we will go “deeper” into the subject and try to cover the rest of the chapters of the review.


This week we will discuss the paper of Shai Shalev-Shwartz et al:
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM

The paper presents an iterative algorithm for solving the optimization problem of Support Vector Machines (SVM). The authors show that the number of iterations needed to converge to a solution with ε accuracy is O(1/ε).  In each iteration the algorithm optimizes on a single sample or a subset of samples (mini-batch setting). The flavor of the algorithm is of stochastic gradient descent (for the single sample case it is a stochastic sub-gradient decent algorithm). As a result, the convergence rate is not directly related to the number of samples. This property makes the algorithm attractive for large scale problems.

If you don’t have much time this week, there is also a shorter ICML version available . I recommend the longer version, it shows a slight generalization of this algorithm, and is generally less squeezed ;).

If you think you’ll be able to make it, please sign in.
Note that the discussion time slot has moved to Wednesday morning at 9:30.


Hi all,

With the opening of our new ML reading group I’ve decided to open this blog that will keep you up to date with the group’s upcoming discussions and archives.

Before each meetup, hopefully a full week in advance, I will post the subject for the reading group with a link to the reading material. If you think you can make it this week please register. Walk-ins are also welcomed, but registration will allow us to provide better hospitality (size of room and snacks for all 🙂 ).

This site is also a place for holding prior/post meeting discussion on the paper, so comment away…
If you have any suggestions for future subjects, please comment this post.

Maayan Harel

A unified framework for high-dimensional analysis

The subject of this week’s group will be the paper –
 “A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers”

Link to paper

The formulation of an estimation problem as a regularized optimization problem is useful and familiar in ML. This formulation combines two components, a loss function that measures how well the model fits the data and some regularization function (e.g. Lasso, group Lasso, ridge regression, PCA, matrix completion etc).
The authors of this paper present a general approach for constructing consistency and convergence rates for such regularized estimators in the high-dimensional setting.
This means, when the number of parameters is in the same scale as the number of samples!
The reading may require a bit of effort, but the diligent reader will be rewarded 🙂

Register to this week’s group