Publication

On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

Related concepts (32)

In the mathematical field of analysis, uniform convergence is a mode of convergence of functions stronger than pointwise convergence. A sequence of functions converges uniformly to a limiting function on a set as the function domain if, given any arbitrarily small positive number , a number can be found such that each of the functions differs from by no more than at every point in .

Probability

Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, 0 indicates impossibility of the event and 1 indicates certainty. The higher the probability of an event, the more likely it is that the event will occur. A simple example is the tossing of a fair (unbiased) coin.

Rate of convergence

In numerical analysis, the order of convergence and the rate of convergence of a convergent sequence are quantities that represent how quickly the sequence approaches its limit. A sequence that converges to is said to have order of convergence and rate of convergence if The rate of convergence is also called the asymptotic error constant. Note that this terminology is not standardized and some authors will use rate where this article uses order (e.g., ).

Probability theory

Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set of axioms. Typically these axioms formalise probability in terms of a probability space, which assigns a measure taking values between 0 and 1, termed the probability measure, to a set of outcomes called the sample space.

Probability space

In probability theory, a probability space or a probability triple is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models the throwing of a die. A probability space consists of three elements: A sample space, , which is the set of all possible outcomes. An event space, which is a set of events, , an event being a set of outcomes in the sample space. A probability function, , which assigns each event in the event space a probability, which is a number between 0 and 1.

Probability distribution

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space). For instance, if X is used to denote the outcome of a coin toss ("the experiment"), then the probability distribution of X would take the value 0.5 (1 in 2 or 1/2) for X = heads, and 0.

Pointwise convergence

In mathematics, pointwise convergence is one of various senses in which a sequence of functions can converge to a particular function. It is weaker than uniform convergence, to which it is often compared. Suppose that is a set and is a topological space, such as the real or complex numbers or a metric space, for example. A net or sequence of functions all having the same domain and codomain is said to converge pointwise to a given function often written as if (and only if) The function is said to be the pointwise limit function of the Sometimes, authors use the term bounded pointwise convergence when there is a constant such that .

Probability axioms

The Kolmogorov axioms are the foundations of probability theory introduced by Russian mathematician Andrey Kolmogorov in 1933. These axioms remain central and have direct contributions to mathematics, the physical sciences, and real-world probability cases. An alternative approach to formalising probability, favoured by some Bayesians, is given by Cox's theorem. The assumptions as to setting up the axioms can be summarised as follows: Let be a measure space with being the probability of some event , and .

Almost surely

In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (or Lebesgue measure 1). In other words, the set of possible exceptions may be non-empty, but it has probability 0. The concept is analogous to the concept of "almost everywhere" in measure theory.

Convergence of random variables

In probability theory, there exist several different notions of convergence of random variables. The convergence of sequences of random variables to some limit random variable is an important concept in probability theory, and its applications to statistics and stochastic processes. The same concepts are known in more general mathematics as stochastic convergence and they formalize the idea that a sequence of essentially random or unpredictable events can sometimes be expected to settle down into a behavior that is essentially unchanging when items far enough into the sequence are studied.

Conditional probability

In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occurring with some sort of relationship with another event A. In this event, the event B can be analyzed by a conditional probability with respect to A. If the event of interest is A and the event B is known or assumed to have occurred, "the conditional probability of A given B", or "the probability of A under the condition B", is usually written as P(AB) or occasionally P_B(A).

Stochastic gradient descent

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).

Compact convergence

In mathematics compact convergence (or uniform convergence on compact sets) is a type of convergence that generalizes the idea of uniform convergence. It is associated with the compact-open topology. Let be a topological space and be a metric space. A sequence of functions is said to converge compactly as to some function if, for every compact set , uniformly on as . This means that for all compact , If and with their usual topologies, with , then converges compactly to the constant function with value 0, but not uniformly.

Learning rate

In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate is commonly referred to as gain. In setting a learning rate, there is a trade-off between the rate of convergence and overshooting.

Bayesian probability

Bayesian probability (ˈbeɪziən or ˈbeɪʒən ) is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification of a personal belief. The Bayesian interpretation of probability can be seen as an extension of propositional logic that enables reasoning with hypotheses; that is, with propositions whose truth or falsity is unknown.

Global optimization

Global optimization is a branch of applied mathematics and numerical analysis that attempts to find the global minima or maxima of a function or a set of functions on a given set. It is usually described as a minimization problem because the maximization of the real-valued function is equivalent to the minimization of the function . Given a possibly nonlinear and non-convex continuous function with the global minima and the set of all global minimizers in , the standard minimization problem can be given as that is, finding and a global minimizer in ; where is a (not necessarily convex) compact set defined by inequalities .

Probability interpretations

The word probability has been used in a variety of ways since it was first applied to the mathematical study of games of chance. Does probability measure the real, physical, tendency of something to occur, or is it a measure of how strongly one believes it will occur, or does it draw on both these elements? In answering such questions, mathematicians interpret the probability values of probability theory. There are two broad categories of probability interpretations which can be called "physical" and "evidential" probabilities.

Gradient descent

In mathematics, gradient descent (also often called steepest descent) is a iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.

Radius of convergence

In mathematics, the radius of convergence of a power series is the radius of the largest disk at the center of the series in which the series converges. It is either a non-negative real number or . When it is positive, the power series converges absolutely and uniformly on compact sets inside the open disk of radius equal to the radius of convergence, and it is the Taylor series of the analytic function to which it converges.

Definite matrix

In mathematics, a symmetric matrix with real entries is positive-definite if the real number is positive for every nonzero real column vector where is the transpose of . More generally, a Hermitian matrix (that is, a complex matrix equal to its conjugate transpose) is positive-definite if the real number is positive for every nonzero complex column vector where denotes the conjugate transpose of Positive semi-definite matrices are defined similarly, except that the scalars and are required to be positive or zero (that is, nonnegative).