x战警为什么会消失 x战警为什么不统治人类

>>>合作联系微信：bushyu<<<
文章来源：arxiv 作者：Yang Yuan等
这是一种简单而快速的超参数优化算法，其灵感来自于分析布尔函数的技术。
1 Introduction
Large scale machine learning and optimization systems usually involve a large number of freeparameters for the user to fix according to their application. A timely example is the trainingof deep neural networks for a signal processing application: the ML specialist needs to decideon an architecture, depth of the network, choice of connectivity per layer (convolutional, fullyconnected,etc.), choice of optimization algorithm and recursively choice of parameters insidethe optimization library itself (learning rate, momentum, etc.).
Automatically finding a good setting of these parameters– now referred to as hyperparameteroptimization– has become an important problem in machine learning and has receivedsignificant attention in recent years. For continuous hyperparameters, gradient descent is usuallysufficient [MDA15]. Discrete parameters, however, such as choice of architecture, numberof layers, connectivity and so forth are significantly more challenging. Existing approaches includeBayesian optimization [SLA12], Multi-armed bandit algorithms [LJD+16], and Randomsearch [BB12].
In this paper we introduce a new spectral approach to hyperparameter optimization basedon harmonic analysis of Boolean functions. At a high level, the idea is to fit a sparse polynomialfunction to the discrete, high-dimensional function mapping hyperparameters to loss, and thenoptimize the resulting sparse polynomial.
Using ideas from discrete Fourier analysis and compressed sensing, we can give provableguarantees for a sparse-recovery algorithm that admits an efficient, paralellizable implementation.Here we are concerned with the tradeoff between running time and sample complexityfor learning Boolean functions f where sampling uniformly from f is very expensive. Thisapproach appears to be new and allows us to give uniform-distribution learning algorithms forBoolean concept classes such as decision trees that match the state-of-the-art in running timeand save dramatically in sample complexity.
Our contributions:
? An new spectral method called Harmonica that has provable guarantees: sample-efficientrecovery if the underlying hyperparameter objective is a sparse (noisy) polynomial andeasy to implement on parallel architectures.
? A sample-efficient learning procedure for learning size s decision trees over n variablesunder the uniform distribution. We improve the two-decades old sample complexitybound of nO(log(s/ε)), to quadratic in the size of the tree O?(s2log n/ε), while matchingthe best known quasipolynomial bound in running time.
? We demonstrate significant improvements in accuracy, sample complexity, and runningtime for deep neural net training experiments. We compare ourselves to state-of-the-artsolvers from Bayesian optimization, Multi-armed bandit techniques, and random search.Projecting to even higher numbers of hyperparameters, we perform simulations that showseveral orders-of-magnitude of speedup versus Bayesian optimization techniques.
2 Setup and definitions
The problem of hyperparameter optimization is that of minimizing a discrete, real-valuedfunction, which we denote by f : {?1, 1}n7→ [?1, 1] (we can handle arbitrary inputs, binaryis chosen for simplicity of presentation).
In the context of hyperparameter optimization, function evaluation is very expensive, althoughparallelizable, as it corresponds to training a deep neural net. In contrast, any computationthat does not involve function evaluation is considered less expensive, such as computationsthat require time ?(nd) for “somewhat large” d or are subexponential (we still considerruntimes that are exponential in n to be costly).
2.1 Basics of Fourier analysis

2.2 Compressed sensing and sparse recovery
In the problem of sparse recovery, a learner attempts to recover a sparse vector x ∈ Rn whichis s sparse, i.e. kxk0 ≤ s, from an observation vector y ∈ Rm that is assumed to equal
y = Ax + e,
where e is assumed to be zero-mean, usually Gaussian, noise. The seminal work of [CRT06,Don06] shows how x can be recovered exactly under various conditions on the observationmatrix A ∈ Rm×n and the noise. The usual method for recovering the signal proceeds by solvinga convex optimization problem consisting of `1 minimization as follows (for some parameterλ > 0):
The above formulation comes in many equivalent forms (e.g., Lasso), where one of the objectiveparts may appear as a hard constraint.
For our work, the most relevant extension of traditional sparse recovery is due to Rauhut[Rau10], who considers the problem of sparse recovery when the measurements are evaluatedaccording to a random orthonormal family. More concretely, fix x ∈ Rn with s non-zeroentries. For K-bounded random orthonormal family F = {ψ1, . . . , ψN }, and m independent
3 Polynomial Recovery Algorithm and Main Theorem
The main component of our spectral algorithm for hyperparameter optimization is given inAlgorithm (1). It is essentially an extension of sparse recovery (basis pursuit or Lasso) to theorthogonal basis of polynomials. In the next subsection, we describe how it is called from ourmain algorithm Harmonica.
Remark: In this paper we focus on distributions D that are products of discrete univariatedistributions. In this case the basis of orthonormal polynomials is the set of all parity functions,see Section 2.1. However, our method encompasses scenarios where μi can be continuous (e.g.,Gaussian).
In the rest of this section we proceed to prove the main theorem. As a first step, forfunctions that are computed exactly by a sparse, low-degree polynomial. This is immediatelyobtainable as a corollary of Theorem 5:

3.1 Application: Learning Decision Trees in Quasi-polynomial Time andPolynomial Sample Complexity
4 Harmonica: The Full Algorithm
The full algorithm continues to invoke the PSR subroutine until the search space has becomesufficiently small, at which point we use a “base” hyperparameter optimizer (in our case eitherHyperband or Random Search).
The space of minimizing assignments to a multivariate polynomial is a highly non-convexset that may contain many distinct points. As such, we take an average of several of the bestminimizers, and iteratively apply PSR on this average.
In order to describe this formally we need the following definition for restrictions of functions:
5 Algorithm Comparison and Experiments
We compare Harmonica with Spearmint [SLA12], Hyperband [LJD+16] and random search, seeTable 1. Both Spearmint and Hyperband are state-of-the-art algorithms, and it is observedthat random search 2x (random search with doubled time resource) is a very competitivebenchmark that beats many algorithms2. We defer the detailed explanation of these algorithmsand comparisons to Section 6.
5.1 Resnet hyperparameter tuning
Our first experiment runs Harmonica for residual network on Cifar-10 dataset, using degree3 features. We included 39 binary hyperparameters, including initialization, optimizationmethod, learning rate schedule, momentum rate, etc. Please refer to Table 3 (Section 7.1)for the detailed explanation of those hyperparameters. To make the task more challenging, we also include 21 dummy variables. We run this experiment on 10 NC6 machines on Azurecloud, each has one GPU of Tesla K80.
As most hyperparameters have a consistent effect as the network becomes deeper, a commonhand-tuning strategy is “tune on small network, then apply the knowledge to big network”.Harmonica can also exploit this strategy as it selects important features stage-by-stage. Morespecifically, during its first 2 stages of feature selection, we run Harmonica on an 8 layer neuralnetwork for training 30 epochs, with 300 samples, 5 features selected per stage, restriction sizet = 4 (see Algorithm 1). After that, we fix all the important features, and run the SH algorithm(subroutine of Hyperband [LJD+16], 4 stage version) as our base algorithm on the big 56 layerneural network for training the whole 160 epochs3. To clarify, “stage” means the stages of thehyperparameter algorithms, while “epoch” means the epochs for training the neural network.
Top Test Error DistributionsSee the top 10 test error results of the different algorithmsin Figure 1, and running time in Figure 2. We run Harmonica twice, with different totalrunning times: Harmonica 1 uses 10.1 days, while Harmonica 2 uses 3.6 days. They have the same feature extraction process, while the difference is the time for running
the SH algorithm.We show less than 10 test errors for both Harmonica 1 and Harmonica 2 because those are theonly final results returned after running the SH algorithm.
Better test error, and faster. Harmonica 2 uses only 1/4 time of Hyperband and1/5 time compared with random search, but get results better than all the results obtainedby these algorithms. In other words, it beats random search 5x benchmark (stronger thanrandom search 2x benchmark of [LJD+16]).
Close to human rate. The best one from Harmonica 1 (7.03%) is only 0.06% away fromthe hand-tuning rate 6.97% reported in [HZRS16], using 10 GPU days, which is 1 day with 10GPU.
We also tried random search as the base algorithm for Harmonica, which gives slightly worseperformance compared with the SH version. However, it is still much better than Hyperband,Random Search and Spearmint. It finds three sets of hyperparameters with less than 8% testerror. This indicates that different base algorithms can be used with Harmonica.
Since Spearmint does not have multiple-machine support, we had to run it on a singlemachine. Due to time constraints, we chose to stop it after running for 8.5 days. The resultsof Spearmint were not competitive. It seems that for the high dimensional regime, Spearmintis similar to Random Search4.
Average Test Error For Each Stage We compute the average test error among 300random samples for 8 layer network with 30 epochs after each stage. See Figure 3. Afterselecting 5 features in stage 1, the average test error drops from 60.16 to 33.3, which indicatesthe top 5 features are very helpful. As we proceed to stage 3, the improvement on test errorbecomes less significant as the selected features at stage 3 have mild contributions.
Sensitivities to hyperparameters As a hyperparameter tuning algorithm, Harmonica itselfhas a few hyperparameters that one needs to set. Fortunately, its performance is notsensitive to those parameters. For example, Table 2 shows the stable ranges for the regularizationterm λ and the number of samples in Harmonica. Here stable range means as longas the parameters are set in this range, the top 5 features and the signs of their weights donot change. In other words, the feature selection outcome is not affected. See Section 7.3 fordiscussions on other parameters.
5.2 Synthetic function
Our second experiment is considers a synthetic hierarchically bounded function h(x) (see Appendix7.4 for detailed definition of h(x)). We run Harmonica with 100 samples, 5 featuresselected per stage, for 3 stages, using degree 3 features. See Figure 4 for optimization timecomparison. We only plot the running time for Spearmint when n = 60, which takes more thanone day for 500 samples. Harmonica is several magnitudes faster than Spearmint. In Figure5 (Section 7.4), we show that Harmonica is able to estimate the hidden function with errorproportional to the noise level. For synthetic experiment, Spearmint is able to find equallygood results compared with Harmonica5.
6 Algorithm Comparisons
6.1 Bayesian optimization: Spearmint
Bayesian optimization (BO) algorithms tune hyperparameters by assuming a prior distributionof the loss function, and then keep updating this prior distribution based on the newobservations. Each new observation is selected according to an acquisition function, whichbalances exploration and exploitation such that either the new observation gives us a betterresult, or helps us gain more information about the loss function. See e.g., [SLA12, SSA13,SSZA14, GKX+14, WZH+13]. In this paper, we use one of the most popular Bayesian optimizationpackages called Spearmint6, which combines the results of multiple recent papers[SLA12, SSA13, SSZA14].
https://github.com/HIPS/Spearmint.git
6.2 Hyperband algorithm
Hyperband [LJD+16] is a general version of the SuccessiveHalving (SH) algorithm [JT16],which is a multi-armed bandit algorithm on randomly selected configurations. The SH algorithmruns in multiple stages, and maintains a hyperparameter configuration set T. Initially Thas m random configurations. At each stage, SH runs each configuration from T for r iterations(assuming we are training some machine learning tasks iteratively, and intermediate resultsare visible). Then based on the intermediate result, only the top 1/3 configurations in T arekept for the next stage. Hyperband algorithm is based on SH algorithm, but automaticallytunes m, as well as the number of stages. We implemented a parallel version of Hyperband inLua.
6.3 Comparing Harmonica with other algorithms

文章插图
Scalability If the hidden function if s-sparse, Harmonica could find such sparse function usingO?(s log s) samples. If at every stage of Harmonica, the target function can be approximatedby a s sparse function, we only need O?(ks log s) samples where k is the number of stages. Forreal world applications like neural network hyperparameter tuning, it seems that the hiddenfunction is indeed sparse at every stage and Harmonica performs well. See Section 5.1.
For Hyperband or random search, even if the function is s-sparse, in order to cover theoptimal configuration by random sampling, we need ?(2s) samples. For Bayesian optimizationit might be even worse.
Optimization time Harmonica runs Lasso [Tib96] algorithm after each stage to solve(2), which is a well studied convex optimization problem and has very fast implementations.Hyperband is also scalable in n, because it simply runs a quick sort algorithm on m samplesafter every stage, which is negligible. The running time of Bayesian optimization is cubic innumber of evaluations, and is very slow for large n. See Section 5.2.
ParallelizabilityHarmonica, Hyperband and random search are naturally suitable forparallelization, and have straightforward implementations. In every stage of those algorithms,we could simply run m random samples in parallel. In our experiment m = 300, which isenough to gain full speed ups with 10+ machines.
By contrast, it’s hard to run Bayesian optimization algorithm in parallel, due to its inherentserial nature. While it is possible to pick multiple points at the same time and evaluate themas a batch in parallel [WF16], the speed ups do not grow linearly in the number of machines,and the batch size is usually limited up to 30. Spearmint does not have multiple machineimplementations, therefore in this paper we always run it on a single machine.
Feature Extraction Harmonica is able to extract important features with weights in eachstages, which automatically sorts all the features according to their importance. See Section7.2.1.
7 Experimental details
7.1 Options

7.2 Importance features
7.2.1 Important features that Harmonica selected
We show the selected important features and their weights during the first 3 stages in Table 4(Section 7.2), where each feature is a monomial of variables with degree at most 3. We do notinclude the 4th stage because in that stage there are no features with nonzero weights.
Smart choiceson important options.Based on Table 4, Harmonica will fix the followingvariables (sorted according to their importance): Batch Norm (Yes), Activation (ReLU),Initial learning rate ([0.001, 0.1]), Optimization method (Adam), Use momentum (Yes), Resblockfirst activation (Yes), Resblcok third activation (No), Weight decay (No if initial learningrate is comparatively small and Yes otherwise), Batch norm tuning (Yes). Most of these choicesmatch what people are doing in practice.
A metric for the importance of variables. The features that Harmonica finds canserve as a metric for measuring the importance of different variables. For example, BatchNorm turns out to be the most significant variable, more important than any other variables,and ReLU is second important. By contrast, Dropout, when Batch Norm is presented, doesnot have significant contributions. This actually matches with the observations in [IS15].
No dummy/irrelevant variables selected. Although there are 21/60 dummy variables,we never select any of them. Moreover, the irrelevant variables like cudnn, backend, nthreads,which do not affect the test error, were not selected.
7.3 Sensitivities to hyperparameters
#Stages. We observe that having 2, 3 or 4 stages have similar results for this experiment.If the degree of the features is 1 (no correlations between features), or every stage only 3 orfewer features are selected, then 3 or 4 stages will be better. https://github.com/facebook/fb.resnet.torch
#Features per stage. We observe that selecting 3-5 features per stage works well, andproduces similar results. More features per stage brings more noise, and fewer features incursmore stages.
Lasso parameters.For Lasso regression, we need to decide the regularization term λ andthe number of samples. See Table 2 for stable range for λ and #samples. Here stable rangemeans as long as the parameters are set in this range, the top 5 features and the signs of theirweights (which are what we need for computing g(x) in Algorithm 1) do not change. In otherwords, the feature selection outcome is not affected. When parameters are outside the stableranges, usually the top features are still unchanged, and we miss only one or two out of thefive features.
Degree of features. As we can see in Table 4, although we set the degree to be 3, almostall the important features Harmonica selected have degree at most 2. The same observationholds for degree 4. We also tried to use degree 1 features, which is significantly faster tooptimize in Lasso, but will give slightly worse results (≈ 7.5% best test error with 3 stages and12 days total running time), because it does not consider the correlations between features.Therefore, we recommend using degree 2 or degree 3 in practice.
8 Acknowledgements
We thank Sanjeev Arora for helpful discussions and encouragement. Elad Hazan is supportedby NSF grant 1523815 and a Google research award. This project is supported by a MicrosoftAzure research award and Amazon AWS research award.
热门文章推荐
【x战警为什么会消失 x战警为什么不统治人类】中国女CTO胡宁荣登卡内基梅隆(CMU)校刊封面人物！
DBA泪奔了,亚马逊用机器学习自动调优DBMS!
破解“人脸识别神经元”编码，机器产生意识成可能！
最新｜周志华教授发布最新版gcforest论文和代码！
马化腾:马云和李彦宏都错了,场景比数据和技术都重要！
最新｜谷歌董事长:我可以直接告诉你,互联网很快消失！
斯坦福CS231n：卷积神经网络视觉识别课程讲义
独家|面对人工智能，下个像柯洁一样哭泣的可能就是你！
最新|扎克伯格：如何在被抛弃和被否定中成就自己？