Differential privacy prevents overfitting of adaptive queries
Abstract: As XKCD explains, green jelly beans cause acne
For n hypotheses chosen ahead of time, the solution is fairly clean: Just collect O(log(n)) times more data and you'll be fine. But this isn't how most data research works: You study the data a bit and then decide what new questions you want to ask. In other words, most data analysis is adaptive, constructing subsequent queries based on the results of previous queries.
The standard solution to testing adaptive queries is either to not do it (e.g. preregistering your experiments ahead of time) or to gather a fresh set of data (or divide it into chunks) with each round of adaptivity. So instead of a logarithmic increase in data, the sample complexity grows linearly with more queries.
A new approach, proposed by Dwork, Feldman, Hardt, Pitassi, Reingold, and Roth (2014), potentially solves this problem, restoring the logarithmic sample cost per query for adaptive queries as well, by requiring the researcher to only access the data via algorithms that are differentially private. I will give a brief overview of the basics of differential privacy and then show the connection between privacy and generalization error that makes this approach so promising.
Shortest Common Superstring and friends: approximation hardness
Abstract: The Shortest Common Superstring (SCS) problem is: given strings s_1, ..., s_n, minimize the length of a superstring S with each of the s_i's as substrings (e.g. the superstring of "aaabbb" and "bbbccc" would be "aaabbbccc"). How hard it is to approximate SCS has been studied at great length because of its connections to the de novo assembly problem in computational genomics, where an assembled genome can be thought of as the superstring of the small snippets of DNA that a sequencer can read.
SCS is known to be APX-complete, and several generalizations have also been studied. In particular, previous results include that SCS with Negative (disallowed) Strings (SCSN) is in Log-APX and SCS with Wildcards (SCSW) is Poly-APX-hard. More recently, I proved in my 6.890 final project last semester that SCSN is Log-APX-hard and SCS with Negative Strings and Wildcards (SCSNW) is NPOPB-hard.
In this talk, we will cover the gadgets and reductions for hardness proofs of all four SCS variants described above. No background in complexity theory will be assumed (and a good thing too, as I myself have next to none).
Thin Trees and the Travelling Salesman Problem
Abstract: The travelling salesman problem is perhaps the quintessential NP-hard problem. Given a network of cities and costs to travel between them, it asks for the cheapest route that visits all of the cities (in any order). Since there is (probably) no efficient algorithm to solve it exactly, I'll be talking about efficient algorithms to get provably-approximate solutions. I'll start with a simple (but clever) 3/2-approximation algorithm due to Christofides. Then I'll talk about the more difficult "asymmetric" (directed) variant of the problem. A major open problem is to find a constant-factor approximation algorithm for this variant. One possible approach involves a combinatorial conjecture of Goddyn about "thin trees".
|Shalev Ben David
Interval scheduling using topological fixed-point theorems
Abstract: Interval scheduling is one of the first problems taught in an introductory algorithms course: there is a simple greedy algorithm that makes the problem trivial. In this talk, we consider a slight variant of interval scheduling, in which each interval is replaced by 2 intervals. This variant is NP-hard, but there is a known 4-approximation algorithm. Perhaps surprisingly, this problem is closely related to both LP duality and to some topological fixed-point theorems. Exploiting these relationships results in a 3-approximation algorithm if one has access to a PPAD oracle.
The Construction of Bridges, or: How I Learned to Stop Worrying and Love the SDP Solver
Abstract: Semidefinite programming is a totally overpowered numerical technique generalizing linear programming. Although its literal definition looks a bit weird (why should I care about optimizing over semidefinite matrices?), I'll try to demystify this sledgehammer by bashing some (probably not all) of the following problems with it:
I'll show off as many demos as I'm able to code up between now and Thursday, so if you want more demos, feed me coffee this week.
Are the hierarchies the best approximation algorithms?
Abstract: Kinda yes. In this talk, I'm going to introduce Sherali-Adams and Lasserre hierarchies in terms of "pseudo-distributions". And then, I will talk about very recent results saying that each hierarchy gives you the smallest formulation for constraint satisfaction problems among the polynomial size linear programs and semidefinite programs, respectively.
Submodular Optimization for (Possible) Fun and (Certain) Profit
Abstract: Everyone who does optimization knows that "convexity" is a magic word, but sometimes it's hard to see just what that means. I'll tell you what submodular functions are, how they're sort of convex, and how to make them the best that they can be. I'll discuss some applications to classical combinatorial optimization problems and also to fancy new questions in machine learning.
The Wonderful World of Bessel Functions
Memories and magnets
Abstract: Your brain is able to label objects and find memories with astonishing speed and accuracy. How is it possible to store so many memories and keep them all separate? One way you might do this is using a dynamical system called a Hopfield network, which is based on simple rules for updating a graph. The idea comes from the Ising model - a bunch of magnets that flip depending on what their neighbors are doing. We'll see how all these ideas relate together, with ideas from Markov chains, neural networks, and statistical physics.
How to solve most combinatorial optimization problems in O~(n^3) (if it is in P)?
Sample Optimal Density Estimation in Nearly-Linear Time
Non-negative matrix factorization in pictures