Similar Papers

Showing search results

@inproceedings{Nacson2022ImplicitBiasStep,
  title={Implicit Bias of the Step Size in Linear Diagonal Neural Networks},
  author={Nacson, Mor S. and Ravichandran, Kavya and Srebro, Nathan and Soudry, Daniel},
  booktitle={ICML},
  year={2022}
}

Save bibtex to file

Focusing on diagonal linear networks as a model for understanding the implicit bias in underdetermined models, we show how the gradient descent step size can have a large qualitative effect on the implicit bias, and thus on generalization ability. In particular, we show how using large step size for non-centered data can change the implicit bias from a "kernel" type behavior to a "rich" (sparsity-inducing) regime — even when gradient flow, studied in previous works, would not escape the "kernel" regime. We do so by using dynamic stability, proving that convergence to dynamically stable global minima entails a bound on some weighted

$\ell_1$ -norm of the linear predictor, i.e. a "rich" regime. We prove this leads to good generalization in a sparse regression setting.

Transductive Robust Learning Guarantees

Omar Montasser, Steve Hanneke, Nathan Srebro

AISTATS2022

bibtex

@inproceedings{Montasser2022TransductiveRobustLearning,
  title={ Transductive Robust Learning Guarantees },
  author={Montasser, Omar and Hanneke, Steve and Srebro, Nathan},
  booktitle={AISTATS},
  year={2022}
}

Save bibtex to file

We study the problem of adversarially robust learning in the transductive setting. For classes H of bounded VC dimension, we propose a simple transductive learner that when presented with a set of labeled training examples and a set of unlabeled test examples (both sets possibly adversarially perturbed), it correctly labels the test examples with a robust error rate that is linear in the VC dimension and is adaptive to the complexity of the perturbation set. This result provides an exponential improvement in dependence on VC dimension over the best known upper bound on the robust error in the inductive setting, at the expense of competing with a more restrictive notion of optimal robust error.

An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning

Blake E. Woodworth, Nathan Srebro

NeurIPS2021

bibtex

@inproceedings{Woodworth2021EvenMoreOptimal,
  title={An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning},
  author={Woodworth, Blake E. and Srebro, Nathan},
  booktitle={NeurIPS},
  year={2021}
}

Save bibtex to file

We present and analyze an algorithm for optimizing smooth and convex or strongly convex objectives using minibatch stochastic gradient estimates. The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. This improves over the optimal method of Lan, which is insensitive to the minimum expected loss; over the optimistic acceleration of Cotter et al., which has suboptimal dependence on the minibatch size; and over the algorithm of Liu and Belkin, which is limited to least squares problems and is also similarly suboptimal.Applied to interpolation learning, the improvement over Cotter et al.~and Liu and Belkin translates to a linear, rather than square-root, parallelization speedup.

Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting

Frederic Koehler, Lijia Zhou, Danica J. Sutherland, Nathan Srebro

NeurIPS2021

bibtex

@inproceedings{Koehler2021UniformConvergenceInterpolators,
  title={Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting},
  author={Koehler, Frederic and Zhou, Lijia and Sutherland, Danica J. and Srebro, Nathan},
  booktitle={NeurIPS},
  year={2021}
}

Save bibtex to file

We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class’s Gaussian width.Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data.We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum

$\ell_1$ -norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.

On the Power of Differentiable Learning versus PAC and SQ Learning

Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, Nathan Srebro

NeurIPS2021

bibtex

@inproceedings{Abbe2021PowerDifferentiableLearning,
  title={On the Power of Differentiable Learning versus PAC and SQ Learning},
  author={Abbe, Emmanuel and Kamath, Pritish and Malach, Eran and Sandon, Colin and Srebro, Nathan},
  booktitle={NeurIPS},
  year={2021}
}

Save bibtex to file

We study the power of learning via mini-batch stochastic gradient descent (SGD) on the loss of a differentiable model or neural network, and ask what learning problems can be learnt using this paradigm. We show that SGD can always simulate learning with statistical queries (SQ), but its ability to go beyond that depends on the precision

$\rho$ of the gradients and the minibatch size

$b$ . With fine enough precision relative to minibatch size, namely when

$b \rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for

$b=1$ . Moreover, with polynomially many bits of precision (i.e. when

$\rho$ is exponentially small), SGD can simulate PAC learning regardless of the batch size. On the other hand, when

$b \rho^2$ is large enough, the power of SGD is equivalent to that of SQ learning.

A Stochastic Newton Algorithm for Distributed Convex Optimization

Brian Bullins, Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake E. Woodworth

NeurIPS2021

code

bibtex

@inproceedings{Bullins2021StochasticNewtonAlgorithm,
  title={A Stochastic Newton Algorithm for Distributed Convex Optimization},
  author={Bullins, Brian and Patel, Kshitij and Shamir, Ohad and Srebro, Nathan and Woodworth, Blake E.},
  booktitle={NeurIPS},
  year={2021}
}

Save bibtex to file

We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations performed between rounds of communication.We show that our method can reduce the number, and frequency, of required communication rounds, compared to existing methods without hurting performance, by proving convergence guarantees for quasi-self-concordant objectives (e.g., logistic regression), alongside empirical evidence.

Representation Costs of Linear Neural Networks: Analysis and Design

Zhen Dai, Mina Karzand, Nathan Srebro

NeurIPS2021

bibtex

@inproceedings{Dai2021RepresentationCostsLinear,
  title={Representation Costs of Linear Neural Networks: Analysis and Design},
  author={Dai, Zhen and Karzand, Mina and Srebro, Nathan},
  booktitle={NeurIPS},
  year={2021}
}

Save bibtex to file

For different parameterizations (mappings from parameters to predictors), we study the regularization cost in predictor space induced by

$l_2$ regularization on the parameters (weights).We focus on linear neural networks as parameterizations of linear predictors.We identify the representation cost of certain sparse linear ConvNets and residual networks.In order to get a better understanding of how the architecture and parameterization affect the representation cost, we also study the reverse problem, identifying which regularizers on linear predictors (e.g.,

$l_p$ norms, group norms, the

$k$ -support-norm, elastic net) can be the representation cost induced by simple

$l_2$ regularization, and designing the parameterizations that do so.

Implicit Bias of the Step Size in Linear Diagonal Neural Networks

Mor Shpigel Nacson, Kavya Ravichandran, Nathan Srebro, Daniel Soudry

ICML2021

bibtex

@inproceedings{Nacson2021ImplicitBiasStep,
  title={Implicit Bias of the Step Size in Linear Diagonal Neural Networks},
  author={Nacson, Mor S. and Ravichandran, Kavya and Srebro, Nathan and Soudry, Daniel},
  booktitle={ICML},
  year={2021}
}

Save bibtex to file

$\ell_1$ -norm of the linear predictor, i.e. a "rich" regime. We prove this leads to good generalization in a sparse regression setting.

Dropout: Explicit Forms and Capacity Control

Raman Arora, Peter Bartlett, Poorya Mianjy, Nathan Srebro

ICML2021

bibtex

@inproceedings{Arora2021DropoutExplicitForms,
  title={Dropout: Explicit Forms and Capacity Control},
  author={Arora, Raman and Bartlett, Peter and Mianjy, Poorya and Srebro, Nathan},
  booktitle={ICML},
  year={2021}
}

Save bibtex to file

We investigate the capacity control provided by dropout in various machine learning problems. First, we study dropout for matrix completion, where it induces a distribution-dependent regularizer that equals the weighted trace-norm of the product of the factors. In deep learning, we show that the distribution-dependent regularizer due to dropout directly controls the Rademacher complexity of the underlying class of deep neural networks. These developments enable us to give concrete generalization error bounds for the dropout algorithm in both matrix completion as well as training deep neural networks.

On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake E Woodworth, Nathan Srebro, Amir Globerson, Daniel Soudry

ICML2021

bibtex

@inproceedings{Azulay2021ImplicitBiasInitialization,
  title={On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent},
  author={Azulay, Shahar and Moroshko, Edward and Nacson, Mor S. and Woodworth, Blake E. and Srebro, Nathan and Globerson, Amir and Soudry, Daniel},
  booktitle={ICML},
  year={2021}
}

Save bibtex to file

Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called “rich regimes”. However, the initialization structure is richer than the overall scale alone and involves relative magnitudes of different weights and layers in the network. Here we show that these relative scales, which we refer to as initialization shape, play an important role in determining the learned model. We develop a novel technique for deriving the inductive bias of gradient-flow and use it to obtain closed-form implicit regularizers for multiple cases of interest.