loading

Logout succeed

Logout succeed. See you again!

ebook img

An Optimal Affine Invariant Smooth Minimization Algorithm PDF

file size0.09 MB
languageEnglish

Preview An Optimal Affine Invariant Smooth Minimization Algorithm

AN OPTIMAL AFFINE INVARIANT SMOOTH MINIMIZATION ALGORITHM ALEXANDRED’ASPREMONTANDMARTINJAGGI ABSTRACT. Weformulateanaffineinvariantimplementationofthealgorithmin[Nesterov,1983]. Weshow thatthecomplexityboundisthenproportionaltoanaffineinvariantregularityconstantdefinedwithrespectto theMinkowskigaugeofthefeasibleset. 3 1 0 1. INTRODUCTION 2 Inthisshortnote,weshowhowtoimplementthesmoothminimizationalgorithmdescribedin[Nesterov, n a 1983, 2005] so that both its iterations and its complexity bound are invariant by achange ofcoordinates in J theproblem. Wefocusontheminimization problem 3 minimize f(x) (1) ] subject to x Q, C ∈ O where f is a convex function with Lipschitz continuous gradient and Q is a compact convex set. Without . too much loss ofgenerality, wewillassume that the interior of Qis nonempty and contains zero. When Q h t is sufficiently simple, in a sense that will be made precise later, Nesterov [1983] showed that this problem a couldbesolvedwithacomplexityofO(1/√ǫ),whereǫistheprecisiontarget. Furthermore,itcanbeshown m thatthiscomplexity boundisoptimalfortheclassofsmoothproblems [Nesterov,2003]. [ WhilethedependenceinO(1/√ǫ)ofthecomplexityboundinNesterov[1983]isoptimal,theconstantin 1 front ofthat bound still depends ona few parameters which vary with implementation: the choice of norm v andproxregularizationfunction. Thismeansinparticularthat,everythingelsebeingequal,thisboundisnot 5 6 invariant withrespect toanaffinechange ofcoordinates, sothecomplexity bound varieswhiletheintrinsic 4 complexity of problem (1) remains unchanged. Here, we show one possible fix for this inconsistency, by 0 choosing anormandaproxtermforthealgorithm in[Nesterov, 1983, 2005]whichmakeitsiterations and . 1 complexityinvariant byachangeofcoordinates. 0 3 1 2. SMOOTH OPTIMIZATION ALGORITHM : v We first recall the basic structure of the algorithm in [Nesterov, 1983]. While many variants of this i X methodhavebeenderived,weusetheformulation in[Nesterov,2005]. Wechooseanorm andassume k·k r thatthefunction f inproblem (1)isconvexwithLipschitzcontinuous gradient, so a 1 f(y) f(x)+ f(x),y x + L y x 2, x,y Q, (2) ≤ h∇ − i 2 k − k ∈ for some L > 0. We also choose a prox function d(x) for the set Q, i.e. a continuous, strongly convex function on Q with parameter σ (see Nesterov [2003] or Hiriart-Urruty and Lemare´chal [1993] for a dis- cussion ofregularization techniques using strongly convex functions). Welet x be the center ofQ for the 0 prox-function d(x)sothat x , argmind(x), 0 x∈Q assumingw.l.o.g. thatd(x ) = 0,wethengetinparticular 0 1 d(x) σ x x 2. (3) 0 ≥ 2 k − k Date:December11,2013. 1 WewriteT (x)asolution tothefollowingsubproblem Q 1 T (x), argmin f(x),y x + L y x 2 (4) Q y∈Q (cid:26)h∇ − i 2 k − k (cid:27) Welety , T (x )wherex isdefinedabove. Werecursivelydefinethreesequencesofpoints: thecurrent 0 Q 0 0 iteratex ,thecorresponding y = T (x ),andthepoints k k Q k k L z , argmin d(x)+ α [f(x )+ f(x ),x x ] (5) k i i i i x∈Q (σ h∇ − i) i=0 X andastepsizesequence α 0withα (0,1]sothat k 0 ≥ ∈ x = τ z +(1 τ )y k+1 k k − k k (6) y = T (x ) k+1 Q k+1 where τ = α /A with A = k α . We implicitly assume here that Q is simple enough so that k k+1 k+1 k i=0 i thetwosubproblems defining y andz can besolved very efficiently. Wehave thefollowing convergence k k P result. Theorem 2.1. Suppose α = (k+1)/2 withtheiterates x ,y andz defined in(5)and(6), then forany k k k k k 0wehave ≥ 4Ld(x⋆) f(y ) f(x⋆) k − ≤ σ(k+1)2 wherex⋆ isanoptimalsolution toproblem (1). Proof. SeeNesterov[2005]. If ǫ > 0 is the target precision, Theorem 2.1 ensures that Algorithm 1 will converge to an ǫ-accurate solution innomorethan 8Ld(x⋆) (7) σǫ r iterations. Inpracticeofcourse,d(x⋆)needstobebounded aprioriandLandσ areoftenhardtoevaluate. Algorithm1Smoothminimization. Input: x ,theproxcenterofthesetQ. 0 1: fork =0,...,N do 2: Compute f(xk). ∇ 3: Computeyk = TQ(xk). 4: Computezk = argminx∈Q Lσd(x)+ ki=0αi[f(xi)+h∇f(xi),x−xii] . 5: Setxk+1 = τkzk +(1−τk)nyk. P o 6: endfor Output: x ,y Q. N N ∈ Whilemostoftheparameters inAlgorithm 1aresetexplicitly, thenorm andtheproxfunction d(x) k·k arechosenarbitrarily. Inwhatfollows,wewillseethatanaturalchoiceforbothmakesthealgorithm affine invariant. 2 3. AFFINE INVARIANT IMPLEMENTATION We can define an affine change of coordinates x = Ay where A Rn×n is a nonsingular matrix, for ∈ whichtheoriginaloptimization problemin(1)istransformed so minimize f(x) minimize fˆ(y) becomes (8) subjectto x Q, subjectto y Qˆ, ∈ ∈ inthevariable y Rn,where ∈ fˆ(y) , f(Ay) and Qˆ , A−1Q. (9) UnlessAispathologicallyill-conditioned, bothproblemsareequivalentandshouldhaveinvariantcomplex- ityboundsanditerations. Infact,thecomplexityanalysisofNewton’smethodbasedontheself-concordance argument developed in[Nesterov andNemirovskii, 1994] produces affineinvariant complexity bounds and theiteratesthemselvesareinvariant. Herewewillshowhowtochoosethenorm andtheproxfunction k·k d(x)togetasimilarbehavior forAlgorithm 1. 3.1. Choosing the norm. We start by a few classical results and definitions. Recall that the Minkowski gaugeofasetQisdefinedasfollows. Definition3.1. LetQ Rn containing zero,wedefinetheMinkowskigaugeofQas ⊂ γ (x) , inf λ 0 :x λQ Q { ≥ ∈ } withγ (x) = 0whenQisunbounded inthedirectionx. Q WhenQisacompactconvex,centrallysymmetricsetwithrespecttotheoriginandhasnonemptyinterior, the Minkowski gauge defines a norm. We write this norm , γ (). From now on, we will assume Q Q thatthesetQiscentrallysymmetricoruseforexampleQ¯ =kQ·k Q(inth·eMinkowskisense)forthegauge − when it is not (this can be improved and extending these results to the nonsymmetric case is a classical topic in functional analysis). Note that any affine transform of a centrally symmetric convex set remains centrallysymmetric. Thefollowingsimpleresultshowswhy ispotentially agoodchoiceofnormfor Q k·k Algorithm 1. Lemma 3.2. Suppose f : Rn R, Q is a centrally symmetric convex set with nonempty interior and let A Rn×n be a nonsingular m→atrix. Then f has Lipschitz continuous gradient with respect to the norm ∈ withconstant L > 0,i.e. Q k·k 1 f(y) f(x)+ f(x),y x + L y x 2, x,y Q, ≤ h∇ − i 2 k − kQ ∈ ifandonlyifthefunction f(Aw)hasLipschitzcontinuous gradient withrespecttothenorm A−1Q with k·k thesameconstant L. Proof. Lety = Az andx = Aw,then 1 f(y) f(x)+ f(x),y x + L y x 2, x,y Q, ≤ h∇ − i 2 k − kQ ∈ isequivalent to 1 f(Az) f(Aw)+ A−T f(Aw),Az Aw + L Az Aw 2 , z,w A−1Q, ≤ ∇w − 2 k − kQ ∈ and,usingthefactthat Ax Q(cid:10)= x A−1Q,thisisalso (cid:11) k k k k 1 f(Az) f(Aw)+ f(Aw),A−1(Az Aw) + L z w 2 , z,w A−1Q, ≤ ∇w − 2 k − kA−1Q ∈ hencethedesiredresult. (cid:10) (cid:11) An almost identical argument shows the following analogous result for the property of strong convexity withrespect tothenorm andaffinechangesofcoordinates. Q k·k 3 Lemma 3.3. Suppose f : Rn R, Q is a centrally symmetric convex set with nonempty interior and let A Rn×n be a nonsingular m→atrix. Suppose f is strongly convex with respect to the norm with Q ∈ k · k parameterσ > 0,i.e. 1 f(y) f(x)+ f(x),y x + σ y x 2, x,y Q, ≥ h∇ − i 2 k − kQ ∈ if and only if the function f(Ax) is strongly convex with respect to the norm A−1Q with the same k · k parameterσ. Wenowturnourattention tothechoiceofproxfunction inAlgorithm1. 3.2. Choosing the prox. Choosing the norm as allows us to define a norm without introducing Q k · k an arbitrary geometry in the algorithm, since the norm is extracted directly from the problem definition. When Q is smooth, a similar reasoning allows us to choose the prox term in Algorithm 1, and we can set d(x) = x 2 . The immediate impact of this choice is that the term d(x∗) in (7) is bounded by one, by k kQ construction. This choice has other natural benefits which are highlighted below. We first recall a result showingthattheconjugate ofasquarednormisthesquareddualnorm. Lemma3.4. Let beanormand ∗ itsdualnorm,then k·k k·k 1 1 ( y ∗)2 = sup yTx x 2. 2 k k x − 2k k Proof. We recall the proof in [Boyd and Vandenberghe, 2004, Example3.27] as it will prove useful in whatfollows. Bydefinition, xTy y ∗ x ,hence ≤ k k k k 1 1 1 yTx x 2 y ∗ x x 2 ( y ∗)2 − 2k k ≤k k k k− 2k k ≤ 2 k k because the second term is a quadratic function of x 2, with maximum ( y ∗)2/2. This maximum is attainedbyanyxsuchthatxTy = y ∗ x (theremusktbkeonebyconstructionkofkthedualnorm),normalized so x = y ∗,whichyieldsthedkesikrekdrkesult. k k k k This last result (and its proof) shows that solving the prox mapping is equivalent to finding a vector aligned withthe gradient, with respect totheMinkowski norm . Wenow recall another simple result Q showingthatthedualofthenorm Q isgivenby Q◦ whekre·kQ◦ isthepolarofQ. k·k k·k Lemma3.5. LetQbeacentrallysymmetricconvexsetwithnonempty interior, thenk·k∗Q = k·kQ◦. Proof. Wewrite ◦ x Q◦ = inf λ 0 :x λQ k k { ≥ ∈ } = inf λ 0 :xTy t,forally Q { ≥ ≤ ∈ } = inf λ 0 : supxTy t ( ≥ y∈Q ≤ ) = supxTy y∈Q ∗ = x Q k k whichisthedesiredresult. The last remaining issue to settle is the strong convexity of the squared Minkowski norm. Fortunately, thistooisaclassicalresultinfunctionalanalysis, asasquarednormisstronglyconvexwithrespecttoitself ifandonlyifitsdualnormhasasmoothness modulusofpower2. However, this does not cover the case where the norm . is not smooth. In that scenario, we need to Q kk pick the norm based on Q but find a smooth prox function not too different from . . This is exactly the Q kk problem studied byJuditsky andNemirovski[2008]whodefinetheregularity ofaBanachspace(E, . ) E kk 4 in terms of the smoothness of the best smooth approximation of the norm . . We first recall a few E kk more definitions, and we will then show that the regularity constant defined by Juditsky and Nemirovski [2008] produces an affine invariant bound on the term d(x∗)/σ in the complexity of the smooth algorithm in[Nesterov, 1983]. Definition 3.6. Suppose and are two norms on a space E, the distortion d( , ) X Y X Y k · k k · k k · k k · k betweenthesetwonormsisequaltothesmallestproduct ab > 0suchthat 1 x x a x Y X Y bk k ≤ k k ≤ k k overallx E. ∈ Notethatlogd( , )definesametriconthesetofallsymmetricconvexbodiesinRn,calledthe X Y k·k k·k Banach-Mazur distance. Wethenrecalltheregularity definitioninJuditskyandNemirovski[2008]. Definition3.7. TheregularityconstantofaBanachspace(E, . )isthesmallestconstant∆ > 0forwhich kk thereexistsasmoothnormp(x)suchthat (i) p(x)2/2hasaLipschitzcontinuousgradientwithconstantµw.r.t. thenormp(x),with1 µ ∆, ≤ ≤ (ii) thenormp(x)satisfies ∆ x 2 p(x)2 x 2, forallx E (10) k k ≤ ≤ µk k ∈ henced(p(x), . ) ∆/µ. kk ≤ Notethatinfinitedimensionp,sinceallnormsareequivalenttotheEuclideannormwithdistortionatmost √dimE,weknowthatallfinitedimensional Banach spaces areatleast(dimE)-regular. Furthermore, the regularity constant is invariant withrespect to anaffine change ofcoordinates since both thedistortion and thesmoothness boundsare. Wearenowreadytoprovethemainresultofthissection. Proposition 3.8. Letǫ > 0 be the target precision, suppose that the function f has a Lipschitz continuous gradient with constant L with respect to the norm and that the space (Rn, ∗) is D -regular, Q k·kQ k·kQ Q thenAlgorithm 1willproduce anǫ-solution toproblem(1)inatmost L min D /2,n Q Q 8 { } (11) ǫ r iterations. Theconstants L andD areaffineinvariant. Q Q Proof. If (Rn, ∗) is D -regular, then by Definition 3.7, there exists a norm p(x) such that p(x)2/2 k·kQ Q has a Lipschitz continuous gradient with constant µ with respect to the norm p(x), and [Juditsky and Ne- mirovski, 2008, Prop.3.2] shows by conjugacy that the prox function d(x) , p∗(x)2/2 is strongly convex withrespect tothenormp∗(x)withconstant 1/µ. Now(10)meansthat µ ∗ x p (x) x , forallx Q Q Q D k k ≤ ≤ k k ∈ Q r since ∗∗ = ,hence k·k k·k 1 d(x+y) d(x)+ ∂d(x),y + p∗(y)2 ≥ h i 2µ 1 d(x)+ ∂d(x),y + y 2 ≥ h i 2D k kQ Q sod(x)isstronglyconvexwithrespectto withconstantσ = 1/D ,andusing(10)asabove Q Q k·k d(x∗) = p∗(x∗)2DQ kx∗k2QDQ DQ σ 2 ≤ 2 ≤ 2 5 by definition of , if x∗ is an optimal (hence feasible) solution of problem (1). The bound in (11) Q k · k then follows from (7) and its affine invariance follows directly from affine invariance of the distortion and Lemmas3.2and3.3. 4. EXAMPLES To illustrate our results, consider the problem of minimizing a smooth convex function over the unit simplex,written minimize f(x) (12) subjectto 1Tx 1, x 0, ≤ ≥ in the variable x Rn. As discussed in [Juditsky et al., 2009, 3.3], choosing as the norm and 1 d(x) = logn+ ∈n x logx as the prox function, we have σ =§1 and d(x∗) lkog· kn, which means the i=1 i i ≤ complexityofsolving(12)usingAlgorithm 1isbounded by P L logn 1 8 (13) ǫ r where L is the Lipschitz constant of f withrespect to the ℓ norm. This choice of norm and prox has a 1 1 double advantage here. First, theprox∇term d(x∗)growsonly aslognwiththe dimension. Second, theℓ∞ norm being the smallest among all ℓ norms, the smoothness bound L is also minimal among all choices p 1 ofℓ norms. p Let us now follow the construction of Section 3. The simplex C = x Rn : 1Tx 1,x 0 is not { ∈ ≤ ≥ } centrally symmetric, butwecansymmetrize itastheℓ ball. TheMinkowski norm associated withthatset 1 isthenequaltotheℓ1-norm,so Q = 1 here. Thespace(Rn, ∞)is2lognregular[Juditsky and k·k k·k k·k Nemirovski, 2008, Example3.2]withtheproxfunction chosen hereas 2 /2. Proposition 3.8then k·k(2logn) showsthatthecomplexitybound weobtainusingthisprocedure isidentical tothatin(13). Asimilarresult holdsinthematrixcase. 5. CONCLUSION From a practical point of view, the results above offer guidance in the choice of a prox. function de- pending onthegeometry ofthefeasible setQ. Onthetheoretical side,theseresultsprovide affineinvariant descriptions of the complexity of the feasible set and of the smoothness of the objective function, written in terms of the regularity constant of the polar of the feasible set and the Lipschitz constant of f with ∇ respect totheMinkowski norm. However,whileweshow thatitispossible toformulate anaffineinvariant implementation oftheoptimalalgorithm in[Nesterov,1983], wedonotyetshowthatthisisalwaysagood idea... In particular, given our choice of norm the constants L and D are both affine invariant, with L Q Q Q optimalbyconstruction andourchoice ofproxfunction minimizing D overallsmooth squarenorms, but Q this does not mean that our choice of norm (Minkowski) minimizes the product L min D /2,n , hence Q Q { } thatweachievethebestpossibleboundforthecomplexityofthesmoothalgorithm in[Nesterov,1983]. ACKNOWLEDGMENTS AAandMJwouldliketoacknowledge support fromtheEuropeanResearchCouncil(projectSIPA).MJ alsoacknowledges supportfromtheSwissNationalScienceFoundation(SNSF). REFERENCES S.BoydandL.Vandenberghe. ConvexOptimization. CambridgeUniversityPress,2004. Jean-BaptisteHiriart-UrrutyandClaudeLemare´chal. ConvexAnalysisandMinimizationAlgorithms. Springer,1993. A.JuditskyandA.S.Nemirovski. Largedeviationsofvector-valuedmartingalesin2-smoothnormedspaces. arXiv preprintarXiv:0809.0813,2008. A.Juditsky,G.Lan,A.Nemirovski,andA.Shapiro. Stochasticapproximationapproachtostochasticprogramming. SIAMJournalonOptimization,19(4):1574–1609,2009. 6 Y.Nesterov.AmethodofsolvingaconvexprogrammingproblemwithconvergencerateO(1/k2).SovietMathematics Doklady,27(2):372–376,1983. Y.Nesterov. IntroductoryLecturesonConvexOptimization. Springer,2003. Y.Nesterov. Smoothminimizationofnon-smoothfunctions. MathematicalProgramming,103(1):127–152,2005. Y.NesterovandA.Nemirovskii. Interior-pointpolynomialalgorithmsinconvexprogramming. SocietyforIndustrial andAppliedMathematics,Philadelphia,1994. CMAP,UMRCNRS7641,ECOLEPOLYTECHNIQUE,PALAISEAU,FRANCE. E-mailaddress:[email protected] CMAP,UMRCNRS7641,ECOLEPOLYTECHNIQUE,PALAISEAU,FRANCE. E-mailaddress:[email protected] 7

See more

The list of books you might like