no code implementations • 22 Apr 2024 • Yujin Han, Difan Zou
GIC trains a spurious attribute classifier based on two key properties of spurious correlations: (1) high correlation between spurious attributes and true labels, and (2) variability in this correlation between datasets with different group distributions.
no code implementations • 18 Apr 2024 • Kun Zhai, Yifeng Gao, Xingjun Ma, Difan Zou, Guangnan Ye, Yu-Gang Jiang
In this paper, we study the convergence of FL on non-IID data and propose a novel \emph{Dog Walking Theory} to formulate and identify the missing element in existing research.
no code implementations • 2 Apr 2024 • Xingwu Chen, Difan Zou
Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization.
no code implementations • 26 Mar 2024 • Yifan Hao, Yong Lin, Difan Zou, Tong Zhang
We demonstrate that in this scenario, further increasing the model's parameterization can significantly reduce the OOD loss.
no code implementations • 13 Mar 2024 • Junwei Su, Difan Zou, Chuan Wu
In this paper, we study the generalization performance of SGD with preconditioning for the least squared problem.
no code implementations • 10 Mar 2024 • Xunpeng Huang, Hanze Dong, Difan Zou, Tong Zhang
Along this line, Freund et al. (2022) suggest that the modified Langevin algorithm with prior diffusion is able to converge dimension independently for strongly log-concave target distributions.
no code implementations • 20 Feb 2024 • Junwei Su, Difan Zou, Zijun Zhang, Chuan Wu
We provide a formal formulation and analysis of the problem, and propose a novel regularization-based technique called Structural-Shift-Risk-Mitigation (SSRM) to mitigate the impact of the structural shift on catastrophic forgetting of the inductive NGIL problem.
no code implementations • 6 Feb 2024 • Junwei Su, Difan Zou, Chuan Wu
Memory-based Dynamic Graph Neural Networks (MDGNNs) are a family of dynamic graph neural networks that leverage a memory module to extract, distill, and memorize long-term temporal dependencies, leading to superior performance compared to memory-less counterparts.
no code implementations • 12 Jan 2024 • Xunpeng Huang, Difan Zou, Hanze Dong, Yian Ma, Tong Zhang
Specifically, DMC follows the reverse SDE of a diffusion process that transforms the target distribution to the standard Gaussian, utilizing a non-parametric score estimation.
no code implementations • 26 Oct 2023 • Miao Lu, Beining Wu, Xiaodong Yang, Difan Zou
In view of this finding, we call such a phenomenon "benign oscillation".
no code implementations • 12 Oct 2023 • Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Peter L. Bartlett
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters.
no code implementations • 5 Oct 2023 • Xu Luo, Difan Zou, Lianli Gao, Zenglin Xu, Jingkuan Song
Transferring a pretrained model to a downstream task can be as easy as conducting linear probing with target data, that is, training a linear classifier upon frozen features extracted from the pretrained model.
no code implementations • 3 Oct 2023 • Xuran Meng, Difan Zou, Yuan Cao
Modern deep learning models are usually highly over-parameterized so that they can overfit the training data.
no code implementations • 20 Jun 2023 • Yuan Cao, Difan Zou, Yuanzhi Li, Quanquan Gu
We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-\Omega(\log^2 t))$ convergence rate.
no code implementations • 31 Mar 2023 • Xuran Meng, Yuan Cao, Difan Zou
In this paper, we explore the per-example gradient regularization (PEGR) and present a theoretical analysis that demonstrates its effectiveness in improving both test error and robustness against noise perturbations.
no code implementations • 15 Mar 2023 • Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu
We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data).
no code implementations • 3 Mar 2023 • Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
On the other hand, we provide some negative results for stochastic gradient descent (SGD) for ReLU regression with symmetric Bernoulli data: if the model is well-specified, the excess risk of SGD is provably no better than that of GLM-tron ignoring constant factors, for each problem instance; and in the noiseless case, GLM-tron can achieve a small risk while SGD unavoidably suffers from a constant risk in expectation.
no code implementations • 3 Aug 2022 • Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data.
no code implementations • 7 Mar 2022 • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
no code implementations • 12 Oct 2021 • Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
In this paper, we provide a problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems.
no code implementations • 25 Aug 2021 • Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu
In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization.
no code implementations • NeurIPS 2021 • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade
Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches.
no code implementations • 25 Jun 2021 • Spencer Frei, Difan Zou, Zixiang Chen, Quanquan Gu
We show that there exists a universal constant $C_{\mathrm{err}}>0$ such that if a pseudolabeler $\boldsymbol{\beta}_{\mathrm{pl}}$ can achieve classification error at most $C_{\mathrm{err}}$, then for any $\varepsilon>0$, an iterative self-training algorithm initialized at $\boldsymbol{\beta}_0 := \boldsymbol{\beta}_{\mathrm{pl}}$ using pseudolabels $\hat y = \mathrm{sgn}(\langle \boldsymbol{\beta}_t, \mathbf{x}\rangle)$ and using at most $\tilde O(d/\varepsilon^2)$ unlabeled examples suffices to learn the Bayes-optimal classifier up to $\varepsilon$ error, where $d$ is the ambient dimension.
no code implementations • 19 Apr 2021 • Difan Zou, Spencer Frei, Quanquan Gu
To the best of our knowledge, this is the first work to show that adversarial training provably yields robust classifiers in the presence of noise.
no code implementations • 23 Mar 2021 • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade
More specifically, for SGD with iterate averaging, we demonstrate the sharpness of the established excess risk bound by proving a matching lower bound (up to constant factors).
no code implementations • ICLR 2021 • Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu
Understanding the algorithmic bias of \emph{stochastic gradient descent} (SGD) is one of the key challenges in modern machine learning and deep learning theory.
no code implementations • 19 Oct 2020 • Difan Zou, Pan Xu, Quanquan Gu
We provide a new convergence analysis of stochastic gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
1 code implementation • ICLR 2020 • Yisen Wang, Difan Zou, Jin-Feng Yi, James Bailey, Xingjun Ma, Quanquan Gu
In this paper, we investigate the distinctive influence of misclassified and correctly classified examples on the final robustness of adversarial training.
no code implementations • ICLR 2020 • Difan Zou, Philip M. Long, Quanquan Gu
We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d, k$ are the input and output dimensions respectively.
1 code implementation • NeurIPS 2019 • Difan Zou, Pan Xu, Quanquan Gu
Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithms have received increasing attention in both theory and practice.
no code implementations • ICLR 2021 • Zixiang Chen, Yuan Cao, Difan Zou, Quanquan Gu
A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees.
1 code implementation • NeurIPS 2019 • Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, Quanquan Gu
Original full-batch GCN training requires calculating the representation of all the nodes in the graph per GCN layer, which brings in high computation and memory costs.
1 code implementation • 2 Nov 2019 • Bao Wang, Difan Zou, Quanquan Gu, Stanley Osher
As an important Markov Chain Monte Carlo (MCMC) method, stochastic gradient Langevin dynamics (SGLD) algorithm has achieved great success in Bayesian learning and posterior sampling.
no code implementations • NeurIPS 2019 • Difan Zou, Quanquan Gu
A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i. e., sufficiently wide) deep neural networks.
no code implementations • 21 Nov 2018 • Difan Zou, Yuan Cao, Dongruo Zhou, Quanquan Gu
In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data.
no code implementations • ICML 2018 • Difan Zou, Pan Xu, Quanquan Gu
We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution.
no code implementations • 11 Dec 2017 • Yaodong Yu, Difan Zou, Quanquan Gu
We propose a family of nonconvex optimization algorithms that are able to save gradient and negative curvature computations to a large extent, and are guaranteed to find an approximate local minimum with improved runtime complexity.
no code implementations • NeurIPS 2018 • Pan Xu, Jinghui Chen, Difan Zou, Quanquan Gu
Furthermore, for the first time we prove the global convergence guarantee for variance reduced stochastic gradient Langevin dynamics (SVRG-LD) to the almost minimizer within $\tilde O\big(\sqrt{n}d^5/(\lambda^4\epsilon^{5/2})\big)$ stochastic gradient evaluations, which outperforms the gradient complexities of GLD and SGLD in a wide regime.