A Tutorial on Energy-Based Learning

用户1908973

发布于 2018-12-26 17:23:31

7260

发布于 2018-12-26 17:23:31

A Tutorial on Energy-Based Learning

Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang The Courant Institute of Mathematical Sciences, New York University{yann,sumit,raia,ranzato,jhuangfu}@cs.nyu.edu http://yann.lecun.com v1.0, August 19, 2006 To appear in “Predicting Structured Data”, G. Bakir, T. Hofman, B. Scho ?lkopf, A. Smola, B. Taskar (eds) MIT Press, 2006

Abstract

Energy-Based Models (EBMs) capture dependencies between variables by as- sociating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the re- maining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones. The EBM approach provides a common theoretical frame- work for many learning models, including traditional discriminative and genera- tive approaches, as well as graph-transformer networks, conditional random fields, maximum margin Markov networks, and several manifold learning methods.

Probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all possible variable configura- tions. Since EBMs have no requirement for proper normalization, this problem is naturally circumvented. EBMs can be viewed as a form of non-probabilistic factor graphs, and they provide considerably more flexibility in the design of architec- tures and training criteria than probabilistic approaches.

Introduction: Energy-Based Models

The main purpose of statistical modeling and machine learning is to encode depen- dencies between variables. By capturing those dependencies, a model can be used to answer questions about the values of unknown variables given the values of known variables.

Energy-Based Models (EBMs) capture dependencies by associating a scalar en- ergy (a measure of compatibility) to each configuration of the variables. Inference, i.e., making a prediction or decision, consists in setting the value of observed variables

and finding values of the remaining variables that minimize the energy. Learning con- sists in finding an energy function that associates low energies to correct values of the remaining variables, and higher energies to incorrect values. A loss functional, mini- mized during learning, is used to measure the quality of the available energy functions. Within this common inference/learning framework, the wide choice of energy func- tions and loss functionals allows for the design of many types of statistical models, both probabilistic and non-probabilistic.

Energy-based learning provides a unified framework for many probabilistic and non-probabilistic approaches to learning, particularly for non-probabilistic training of graphical models and other structured models. Energy-based learning can be seen as an alternative to probabilistic estimation for prediction, classification, or decision-making tasks. Because there is no requirement for proper normalization, energy-based ap- proaches avoid the problems associated with estimating the normalization constant in probabilistic models. Furthermore, the absence of the normalization condition allows for much more flexibility in the design of learning machines. Most probabilistic mod- els can be viewed as special types of energy-based models in which the energy function satisfies certain normalizability conditions, and in which the loss function, optimized by learning, has a particular form.

This chapter presents a tutorial on energy-based models, with an emphasis on their use for structured output problems and sequence labeling problems. Section 1 intro- duces energy-based models and describes deterministic inference through energy min- imization. Section 2 introduces energy-based learning and the concept of the loss func- tion. A number of standard and non-standard loss functions are described, including the perceptron loss, several margin-based losses, and the negative log-likelihood loss. The negative log-likelihood loss can be used to train a model to produce conditional probability estimates. Section 3 shows how simple regression and classification mod- els can be formulated in the EBM framework. Section 4 concerns models that contain latent variables. Section 5 analyzes the various loss functions in detail and gives suf- ficient conditions that a loss function must satisfy so that its minimization will cause the model to approach the desired behavior. A list of “good” and “bad” loss functions is given. Section 6 introduces the concept of non-probabilistic factor graphs and infor- mally discusses efficient inference algorithms. Section 7 focuses on sequence labeling and structured output models. Linear models such as max-margin Markov networks and conditional random fields are re-formulated in the EBM framework. The liter- ature on discriminative learning for speech and handwriting recognition, going back to the late 80’s and early 90’s, is reviewed. This includes globally trained systems that integrate non-linear discriminant functions, such as neural networks, and sequence alignment methods, such as dynamic time warping and hidden Markov models. Hier- archical models such as the graph transformer network architecture are also reviewed. Finally, the differences, commonalities, and relative advantages of energy-based ap- proaches, probabilistic approaches, and sampling-based approximate methods such as contrastive divergence are discussed in Section 8.

本文参与?腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-12-12，如有侵权请联系?cloudcommunity@tencent.com 删除

linux