My main research focus is on designing, analyzing and implementing scalable, robust and efficient nonlinear optimization algorithms.
Research Interests
Unconstrained Nonlinear Optimization Algorithms
Aggregated QuasiNewton Methods
A displacement aggregation strategy is proposed for the curvature pairs stored in a limitedmemory BFGS method such that the resulting (inverse) Hessian approximations are equal to those that would be derived from a fullmemory BFGS method. This means that, if a sufficiently large number of pairs are stored, then an optimization algorithm employing the limitedmemory method can achieve the same theoretical convergence properties as when fullmemory (inverse) Hessian approximations are stored and employed, such as a local superlinear rate of convergence under assumptions that are common for attaining such guarantees. To the best of our knowledge, this is the first work in which a local superlinear convergence rate guarantee is offered by a quasiNewton scheme that does not either store all curvature pairs throughout the entire run of the optimization algorithm or store an explicit (inverse) Hessian approximation.
Paper
 (Status: Accepted, Mathematical Programming) Albert S. Berahas, Frank E. Curtis and Baoyu Zhou. LimitedMemory BFGS with Displacement Aggregation. (Download PDF)
Talks
 (postponed due to COVID19) DataX Workshop (September 2020): Old and New Open Questions in Optimization, Princeton University, Princeton, NJ.
 ICCOPT 2019, International Conference on Continuous Optimization, Berlin, Germany. (slides)
 Numerical Analysis and Scientific Computing Seminar, November 2019, Courant Institute of Mathematical Sciences, New York University, New York, NY. (slides)
Global Convergence Rate Analysis of a Generic Line Search Algorithm with Noise
In this work, we develop convergence analysis of a modified line search method for objective functions whose value is computed with noise and whose gradient estimates are inexact and possibly random. The noise is assumed to be bounded in absolute value without any additional assumptions. We extend the framework based on stochastic methods from [Cartis & Scheinberg, 2018] which was developed to provide analysis of a standard line search method with exact function values and random gradients to the case of noisy function. We introduce a condition on the gradient which when satisfied with some sufficiently large probability at each iteration, guarantees convergence properties of the line search method. We derive expected complexity bounds for convex, strongly convex and nonconvex functions.
Paper
 (Status: Submitted) Albert S. Berahas, Liyuan Cao and Katya Scheinberg. Global Convergence Rate Analysis of a Generic Line Search Algorithm with Noise. (Download PDF)
Talks
 (virtual) INFORMS Annual Meeting 2020, Institute for Operations Research and the Management Sciences Conference, National Harbor, MD. (slides)
 (postponed due to COVID19) SIOPT 2020, SIAM Conference on Optimization, Hong Kong.
 Joint Mathematics Meetings (JMM) 2020, Denver, CO. (slides)
 Young Researchers Workshop 2019, ORIE, Cornell University, Ithaca, NY. (slides)
Constrained Stochastic Optimization
Sequential Quadratic Optimization for Nonlinear Equality Constrained Stochastic Optimization
Sequential quadratic optimization algorithms are proposed for solving smooth nonlinear optimization problems with equality constraints. The main focus is an algorithm proposed for the case when the constraint functions are deterministic, and constraint function and derivative values can be computed explicitly, but the objective function is stochastic. It is assumed in this setting that it is intractable to compute objective function and derivative values explicitly, although one can compute stochastic function and gradient estimates. As a starting point for this stochastic setting, an algorithm is proposed for the deterministic setting that is modeled after a stateoftheart linesearch SQP algorithm, but uses a stepsize selection scheme based on Lipschitz constants (or adaptively estimated Lipschitz constants) in place of the line search. This sets the stage for the proposed algorithm for the stochastic setting, for which it is assumed that line searches would be intractable. Under reasonable assumptions, convergence (resp.,~convergence in expectation) from remote starting points is proved for the proposed deterministic (resp.,~stochastic) algorithm. The results of numerical experiments demonstrate the practical performance of our proposed techniques.
Papers
 (Status: Coming soon!) Albert S. Berahas, Frank E. Curtis and Daniel P. Robinson. Sequential Quadratic Optimization for Nonlinear Equality Constrained Stochastic Optimization with Rank Deficient Jacobians.
 (Status: Accepted, SIAM Journal on Optimization) Albert S. Berahas, Frank E. Curtis, Daniel P. Robinson and Baoyu Zhou. Sequential Quadratic Optimization for Nonlinear Equality Constrained Stochastic Optimization. (Download PDF)
Talk
 (upcoming, virtual) SIOPT 2021, SIAM Conference on Optimization, Spokane, WA.
 (upcoming, virtual) 3rd IMA and OR Society Conference on Mathematics of Operational Research.
Optimization Algorithms for Machine Learning
Sampled QuasiNewton Methods
In this work, we present two sampled quasiNewton methods for deep learning: sampled LBFGS (SLBFGS) and sampled LSR1 (SLSR1). Contrary to the classical variants of these methods that sequentially build Hessian or inverse Hessian approximations as the optimization progresses, our proposed methods sample points randomly around the current iterate at every iteration to produce these approximations. As a result, the approximations constructed make use of more reliable (recent and local) information, and do not depend on past iterate information that could be significantly stale. Our proposed algorithms are efficient in terms of accessed data points (epochs) and have enough concurrency to take advantage of parallel/distributed computing environments. We provide convergence guarantees for our proposed methods. Numerical tests on a toy classification problem as well as on popular benchmarking neural network training tasks reveal that the methods outperform their classical variants and are competitive with stateoftheart firstorder methods such as ADAM. We also develop a distributed variant of the SLSR1 method (DSLSR1). Contrary to a naive distributed implementation, DSLSR1 drastically reduces the amount of data communicated at every iteration, has favorable workload balancing across nodes and is matrixfree and inversefree. The method scales well in terms of both the dimension of the problem and the number of data points. We illustrate the performance of DSLSR1 on standard neural network training tasks.
Papers
 (Status: Coming soon!) Albert S. Berahas, Majid Jahani and Martin Takáč. Fast, Stochastic and Adaptive Sampled QuasiNewton Methods.
 (Status: Submitted) Albert S. Berahas, Majid Jahani and Martin Takáč. QuasiNewton Methods for Deep Learning: Forget the Past, Just Sample. (Download PDF)
 Majid Jahani, Mohammadreza Nazari, Sergey Rusakov, Albert S. Berahas and Martin Takáč. Scaling Up QuasiNewton Algorithms: Communication Efficient Distributed SR1. 6th Annual Conference on Machine Learning, Optimization and Data Science (LOD), 2020.(Download PDF)
 Albert S. Berahas, Majid Jahani and Martin Takáč. Sampled QuasiNewton Methods for Deep Learning. OPT 2019: Optimization for Machine Learning Workshop (NeurIPS 2019) (Download PDF)
Talks
 (postponed due to COVID19) MDS 2020: SIAM Conference on Mathematics of Data Science, Cincinnati, OH.
 INFORMS 2019, Institute for Operations Research and the Management Sciences Conference, Seattle, WA. (slides)
Deterministic and Stochastic Hybrid Methods
This work presents a new algorithm for empirical risk minimization. The algorithm bridges the gap between first and secondorder methods by computing a search direction that uses a secondordertype update in one subspace, coupled with a scaled steepest descent step in the orthogonal complement. To this end, partial curvature information is incorporated to help with illconditioning, while simultaneously allowing the algorithm to scale to the large problem dimensions often encountered in machine learning applications. Theoretical results are presented to confirm that the algorithm converges to a stationary point in both the strongly convex and nonconvex cases. A stochastic variant of the algorithm is also presented, along with corresponding theoretical guarantees. Numerical results confirm the strengths of the new approach on standard machine learning problems.
Papers
 (Status: Coming soon!) Majid Jahani, Mohammadreza Nazari, Rachael Tappenden, Albert S. Berahas and Martin Takáč. Deterministic and Stochastic Hybrid Methods.
 (Status: Coming soon!) Majid Jahani, Mohammadreza Nazari, Rachael Tappenden, Albert S. Berahas and Martin Takáč. An Adaptive Stochastic Hybrid Method for Machine Learning.
 (Status: Accepted, 24th International Conference on Artificial Intelligence and Statistics) Majid Jahani, Mohammadreza Nazari, Rachael Tappenden, Albert S. Berahas and Martin Takáč. SONIA: A Symmetric Blockwise Truncated Optimization Algorithm. (Download PDF)
Talk
 (upcoming, virtual) SIAM CSE 2021, SIAM Conference on Computational Science and Engineering, Fort Worth, TX.
NewtonSketch and Subsampled Newton Methods
The concepts of sketching and subsampling have recently received much attention by the optimization and statistics communities. In this work, we study NewtonSketch and Subsampled Newton (SSN) methods for the finitesum optimization problem. We consider practical versions of the two methods in which the Newton equations are solved approximately using the conjugate gradient (CG) method or a stochastic gradient iteration. We establish new complexity results for the SSNCG method that exploit the spectral properties of CG. Controlled numerical experiments compare the relative strengths of NewtonSketch and SSN methods and show that for many finitesum problems, they are far more efficient than SVRG, a popular firstorder method.
Paper
 Albert S. Berahas, Raghu Bollapragada and Jorge Nocedal. An Investigation of NewtonSketch and Subsampled Newton Methods. Optimization Methods and Software, 2020, 35(4), pp. 661–680. (Download PDF, OMS Online, Supplementary Material)
Talk
 SIOPT 2017, SIAM Conference on Optimization, Vancouver, Canada. (slides)
Robust Stochastic QuasiNewton Methods for Machine Learning
The question of how to parallelize the stochastic gradient (SG) method has received much attention in the literature. In this work, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ secondorder information. We describe an implementation of the LBFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multibatch approach in which the data points used to compute the function and gradient are purposely changed at each iteration to accelerate the learning process. The aforementioned settings inherently give the algorithm a stochastic flavor that can cause instability in LBFGS. These difficulties arise because LBFGS employs gradient differences to update the Hessian approximations; when these gradients are computed using different data points the process can be unstable. This work shows how to perform stable quasiNewton updating in the multibatch setting, illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning, and studies its convergence properties for both the convex and nonconvex cases.
Papers
 Albert S. Berahas and Martin Takáč, A Robust MultiBatch LBFGS Method for Machine Learning. Optimization Methods and Software, 2020, 35(1), pp. 191219. (Download PDF, OMS Online, Supplementary Material)
 Albert S. Berahas, Jorge Nocedal and Martin Takáč, A MultiBatch LBFGS Method for Machine Learning. 2016 Advances In Neural Information Processing Systems (NeurIPS), Barcelona, Spain, Dec 2016, pp. 10551063. (Download PDF, Supplementary Material)
Talk
 CASSC 2017, SIAM Student Conference (Chicago area), Evanston, IL. (slides)
Stochastic QuasiNewton Methods for Training Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are powerful models that achieve exceptional performance on several pattern recognition problems. However, training of RNNs is a computationally difficult task owing to the wellknown “vanishing/exploding” gradient problem. Algorithms proposed for training RNNs either exploit no (or limited) curvature information and have cheap periteration complexity, or attempt to gain significant curvature information at the cost of increased periteration cost. The former set includes diagonallyscaled firstorder methods such as ADAM and ADAGRAD, while the latter consists of secondorder algorithms like HessianFree Newton and KFAC. In this work, we present adaQN, a stochastic quasiNewton algorithm for training RNNs. Our approach retains a low periteration cost while allowing for nondiagonal scaling through a stochastic LBFGS updating scheme. The method uses a novel LBFGS scaling initialization scheme and is judicious in storing and retaining LBFGS curvature pairs. We present numerical experiments on two language modeling tasks and show that adaQN is competitive with popular RNN training algorithms.
Paper
 Nitish Shirish Keskar and Albert S. Berahas, adaQN: An adaptive quasiNewton algorithm for training RNNs. 2016 European Conference Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Riva del Garda, Italy, Sept 2016, pp. 116. (Download PDF)
Talks
 ECML 2016, European Conference on Machine Learning, Riva del Garda, Italy. (slides)
 CASSC 2016, SIAM Student Conference (Chicago area), Chicago, IL. (slides)
DerivativeFree Optimization
DerivativeFree Optimization of Noisy Function via QuasiNewton Methods
In this work, we present a finite difference quasiNewton method for the minimization of noisy functions. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval (h) based on the noise estimation techniques of Hamming (2012) and Moré and Wild (2011). This noise estimation procedure and the selection of h are inexpensive but not always accurate, and to prevent failures the algorithm incorporates a recovery mechanism that takes appropriate action in the case when the line search procedure is unable to produce an acceptable point. A novel convergence analysis is presented that considers the effect of a noisy line search procedure. Numerical experiments comparing the method to a model based trust region method are presented.
Paper
 Albert S. Berahas, Richard Byrd and Jorge Nocedal. DerivativeFree Optimization of Noisy Functions via QuasiNewton Methods. SIAM Journal on Optimization, 2019, 29(2), pp. 965993. (Download PDF, SIOPT Online)
Talks
 ISMP 2018, International Symposium on Mathematical Programming, Bordeaux, France. (slides)
 MOPTA 2018, Modeling and Optimization: Theory and Applications, Bethlehem, PA. (slides)
 INFORMS 2017, Institute for Operations Research and the Management Sciences Conference, Houston, TX. (slides)
Gradient Approximations in DerivativeFree Optimization
In this work, we analyze several methods for approximating the gradient of a function using only function values. These methods include finite differences, linear interpolation, Gaussian smoothing and smoothing on a unit sphere. The methods differ in the number of functions sampled, the choice of the sample points and the way in which the gradient approximations are derived. For each method, we derive bounds on the number of samples and the sampling radius which guarantee favorable convergence properties for a line search or fixed step size descent method. To this end, we derive one common condition on the accuracy of gradient approximations which guarantees these convergence properties and then show how each method can satisfy this condition. We analyze the convergence properties even when this condition is only satisfied with some sufficiently large probability at each iteration, as happens to be the case with Gaussian smoothing and smoothing on a unit sphere. Finally, we present numerical results evaluating the quality of the gradient approximations as well as their performance in conjunction with a line search derivativefree optimization algorithm.
Papers
 (Status: Submitted) Albert S. Berahas, Liyuan Cao, Krzysztof Choromanski and Katya Scheinberg. A Theoretical and Empirical Comparison of Gradient Approximations in DerivativeFree Optimization. (Download PDF)
 Albert S. Berahas, Liyuan Cao, Krzysztof Choromanski and Katya Scheinberg. Linear interpolation gives better gradients than Gaussian smoothing in derivativefree optimization. Technical Report, Lehigh University 2019. (Download PDF)
Talk

CSE 2019, SIAM Conference on Computational Science and Engineering, Spokane, WA. (slides)
FullLow Evaluation Methods for DerivativeFree Optimization
We propose a new class of directional methods for DerivativeFree Optimization that considers two types of iterations. The first type is expensive in function evaluations, but exhibits good performance in the smooth, nonnoisy case. The instance considered is BFGS computed over gradients approximated by finite differences. The second type is cheap in function evaluations, more appropriate under the presence of noise or nonsmoothness. The instance considered is probabilistic direct search with 1 or 2 random directions.The resulting FullLow Evaluation method is globally convergent even in the nonsmooth case, and yields the appropriate rates in the smooth case. Preliminary results show that is efficient and robust across problems with different levels of smoothness and noise. Our approach can be extended to linear constraints using approximate tangent cones.
Paper
 (Status: Coming soon!) Albert S. Berahas, Oumaima Sohab and Luis Nunes Vicente. FullLow Evaluation Methods for DerivativeFree Optimization.
Distributed Optimization
Communication and Computation Efficient Methods for Distributed Optimization
A distributed optimization method typically consists of two key components: communication and computation. The standard way of judging an algorithm via only the number of iterations overlooks the complexity associated with each iteration. Furthermore, various applications deploying distributed methods may prefer a different composition of communication and computation. Motivated by this discrepancy, in this work we propose an adaptive cost framework which adjusts the cost measure depending on the features of various applications. Moreover, we present a flexible algorithmic framework, where communication and computation steps are explicitly decomposed to enable algorithm customization for various applications. We apply this framework to the wellknown distributed gradient descent (DGD) method, and develop a class of customized algorithms that compare favorably to their base algorithms, both theoretically and empirically. We test the performance and illustrate the flexibility of the methods on quadratic functions and classification problems that arise in machine learning, in terms of iterations, gradient evaluations, communications as well as the proposed cost framework.
Papers
 (Status: Submitted) Albert S. Berahas, Raghu Bollapragada and Ermin Wei. On the Convergence of Nested Distributed Gradient Methods with Multiple Consensus and Gradient Steps. (Download PDF)
 Albert S. Berahas, Raghu Bollapragada, Nitish Shirish Keskar and Ermin Wei. Balancing Communication and Computation in Distributed Optimization. IEEE Transactions on Automatic Control, 2019, 64(8), pp. 31413155. (Download PDF, IEEE Xplore)
Talk
 IFORS 2017, International Federations on Operations Research Societies Conference, Quebec city, Canada. (slides)
Adaptive Methods with Quantized Communication
We consider the minimization of a sum of local convex objective functions in a distributed setting, where communication can be costly. We propose and analyze a class of nested distributed gradient methods with adaptive quantized communication. We show the effect of performing multiple quantized communication steps on the rate of convergence and on the size of the neighborhood of convergence, and prove RLinear convergence to the exact solution with increasing number of consensus steps and adaptive quantization. We test the performance of the method, as well as some practical variants, on quadratic functions, and show the effects of multiple quantized communication steps in terms of iterations/gradient evaluations, communication and cost.
Paper
 Albert S. Berahas, Charikleia Iakovidou and Ermin Wei. Nested Distributed Gradient Methods with Adaptive Quantized Communication. 58th IEEE Conference on Decision and Control (CDC), Nice, France, 2019, pp. 15191525. (Download PDF, IEEE Xplore)
Applications of Machine Learning
Fast Prediction of Partial Differential Equations
Discovering the underlying behavior of complex systems is an important topic in many science and engineering disciplines. In this paper, we propose a novel neural network framework, finite difference neural networks (FDNet), to learn partial differential equations from data. Specifically, our proposed finite difference inspired network is designed to learn the underlying governing partial differential equations from trajectory data, and to iteratively estimate the future dynamical behavior using only a few trainable parameters. We illustrate the performance (predictive power) of our framework on the heat equation, with and without noise and/or forcing, and compare our results to the Forward Euler method. Moreover, we show the advantages of using a HessianFree Trust Region method to train the network.
Paper
 Zheng Shi, Nur Sila Gulgec, Albert S. Berahas, Shamim N. Pakzad, Martin Takáč. Finite Difference Neural Networks: Fast Prediction of Partial Differential Equations. 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), 2020, pp. 130135. (Download PDF)
Face Recognition
We present work on two fronts: (1) a novel approach to face recognition, and (2) a general framework for robust error estimation in face recognition.
With regards to (1), we propose an adaptation and extension of the stateoftheart methods in face recognition. Effectively, our method combines the sparsitybased approaches with additional leastsquares steps and exhibits robustness to outliers achieving significant performance improvement with little additional cost. With regards to (2), we propose a general framework that allows the simultaneous use of various loss functions for modeling the residual in face images. Our method extends the current literature offering flexibility in the selection of the residual modeling characteristics but, at the same time, considering many existing algorithms as special cases.
Papers
 Michael Iliadis, Leonidas Spinoulas, Albert S. Berahas, Haohong Wang, and Aggelos K. Katsaggelos, Multimodel robust error correction for face recognition. 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, Arizona, Sept 2016, pp. 32293233. (IEEE Xplore)
 Michael Iliadis, Leonidas Spinoulas, Albert S. Berahas, Haohong Wang, and Aggelos K. Katsaggelos, Sparse representation and least squaresbased classification in face recognition. 2014 IEEE 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, Sept 2014, pp. 526530. (IEEE Xplore)