AN OPTIMAL AFFINE INVARIANT SMOOTH MINIMIZATION ALGORITHM ALEXANDRED’ASPREMONTANDMARTINJAGGI ABSTRACT. Weformulateanafﬁneinvariantimplementationofthealgorithmin[Nesterov,1983]. Weshow thatthecomplexityboundisthenproportionaltoanafﬁneinvariantregularityconstantdeﬁnedwithrespectto theMinkowskigaugeofthefeasibleset. 3 1 0 1. INTRODUCTION 2 Inthisshortnote,weshowhowtoimplementthesmoothminimizationalgorithmdescribedin[Nesterov, n a 1983, 2005] so that both its iterations and its complexity bound are invariant by achange ofcoordinates in J theproblem. Wefocusontheminimization problem 3 minimize f(x) (1) ] subject to x Q, C ∈ O where f is a convex function with Lipschitz continuous gradient and Q is a compact convex set. Without . too much loss ofgenerality, wewillassume that the interior of Qis nonempty and contains zero. When Q h t is sufﬁciently simple, in a sense that will be made precise later, Nesterov [1983] showed that this problem a couldbesolvedwithacomplexityofO(1/√ǫ),whereǫistheprecisiontarget. Furthermore,itcanbeshown m thatthiscomplexity boundisoptimalfortheclassofsmoothproblems [Nesterov,2003]. [ WhilethedependenceinO(1/√ǫ)ofthecomplexityboundinNesterov[1983]isoptimal,theconstantin 1 front ofthat bound still depends ona few parameters which vary with implementation: the choice of norm v andproxregularizationfunction. Thismeansinparticularthat,everythingelsebeingequal,thisboundisnot 5 6 invariant withrespect toanafﬁnechange ofcoordinates, sothecomplexity bound varieswhiletheintrinsic 4 complexity of problem (1) remains unchanged. Here, we show one possible ﬁx for this inconsistency, by 0 choosing anormandaproxtermforthealgorithm in[Nesterov, 1983, 2005]whichmakeitsiterations and . 1 complexityinvariant byachangeofcoordinates. 0 3 1 2. SMOOTH OPTIMIZATION ALGORITHM : v We ﬁrst recall the basic structure of the algorithm in [Nesterov, 1983]. While many variants of this i X methodhavebeenderived,weusetheformulation in[Nesterov,2005]. Wechooseanorm andassume k·k r thatthefunction f inproblem (1)isconvexwithLipschitzcontinuous gradient, so a 1 f(y) f(x)+ f(x),y x + L y x 2, x,y Q, (2) ≤ h∇ − i 2 k − k ∈ for some L > 0. We also choose a prox function d(x) for the set Q, i.e. a continuous, strongly convex function on Q with parameter σ (see Nesterov [2003] or Hiriart-Urruty and Lemare´chal [1993] for a dis- cussion ofregularization techniques using strongly convex functions). Welet x be the center ofQ for the 0 prox-function d(x)sothat x , argmind(x), 0 x∈Q assumingw.l.o.g. thatd(x ) = 0,wethengetinparticular 0 1 d(x) σ x x 2. (3) 0 ≥ 2 k − k Date:December11,2013. 1 WewriteT (x)asolution tothefollowingsubproblem Q 1 T (x), argmin f(x),y x + L y x 2 (4) Q y∈Q (cid:26)h∇ − i 2 k − k (cid:27) Welety , T (x )wherex isdeﬁnedabove. Werecursivelydeﬁnethreesequencesofpoints: thecurrent 0 Q 0 0 iteratex ,thecorresponding y = T (x ),andthepoints k k Q k k L z , argmin d(x)+ α [f(x )+ f(x ),x x ] (5) k i i i i x∈Q (σ h∇ − i) i=0 X andastepsizesequence α 0withα (0,1]sothat k 0 ≥ ∈ x = τ z +(1 τ )y k+1 k k − k k (6) y = T (x ) k+1 Q k+1 where τ = α /A with A = k α . We implicitly assume here that Q is simple enough so that k k+1 k+1 k i=0 i thetwosubproblems deﬁning y andz can besolved very efﬁciently. Wehave thefollowing convergence k k P result. Theorem 2.1. Suppose α = (k+1)/2 withtheiterates x ,y andz deﬁned in(5)and(6), then forany k k k k k 0wehave ≥ 4Ld(x⋆) f(y ) f(x⋆) k − ≤ σ(k+1)2 wherex⋆ isanoptimalsolution toproblem (1). Proof. SeeNesterov[2005]. If ǫ > 0 is the target precision, Theorem 2.1 ensures that Algorithm 1 will converge to an ǫ-accurate solution innomorethan 8Ld(x⋆) (7) σǫ r iterations. Inpracticeofcourse,d(x⋆)needstobebounded aprioriandLandσ areoftenhardtoevaluate. Algorithm1Smoothminimization. Input: x ,theproxcenterofthesetQ. 0 1: fork =0,...,N do 2: Compute f(xk). ∇ 3: Computeyk = TQ(xk). 4: Computezk = argminx∈Q Lσd(x)+ ki=0αi[f(xi)+h∇f(xi),x−xii] . 5: Setxk+1 = τkzk +(1−τk)nyk. P o 6: endfor Output: x ,y Q. N N ∈ Whilemostoftheparameters inAlgorithm 1aresetexplicitly, thenorm andtheproxfunction d(x) k·k arechosenarbitrarily. Inwhatfollows,wewillseethatanaturalchoiceforbothmakesthealgorithm afﬁne invariant. 2 3. AFFINE INVARIANT IMPLEMENTATION We can deﬁne an afﬁne change of coordinates x = Ay where A Rn×n is a nonsingular matrix, for ∈ whichtheoriginaloptimization problemin(1)istransformed so minimize f(x) minimize fˆ(y) becomes (8) subjectto x Q, subjectto y Qˆ, ∈ ∈ inthevariable y Rn,where ∈ fˆ(y) , f(Ay) and Qˆ , A−1Q. (9) UnlessAispathologicallyill-conditioned, bothproblemsareequivalentandshouldhaveinvariantcomplex- ityboundsanditerations. Infact,thecomplexityanalysisofNewton’smethodbasedontheself-concordance argument developed in[Nesterov andNemirovskii, 1994] produces afﬁneinvariant complexity bounds and theiteratesthemselvesareinvariant. Herewewillshowhowtochoosethenorm andtheproxfunction k·k d(x)togetasimilarbehavior forAlgorithm 1. 3.1. Choosing the norm. We start by a few classical results and deﬁnitions. Recall that the Minkowski gaugeofasetQisdeﬁnedasfollows. Deﬁnition3.1. LetQ Rn containing zero,wedeﬁnetheMinkowskigaugeofQas ⊂ γ (x) , inf λ 0 :x λQ Q { ≥ ∈ } withγ (x) = 0whenQisunbounded inthedirectionx. Q WhenQisacompactconvex,centrallysymmetricsetwithrespecttotheoriginandhasnonemptyinterior, the Minkowski gauge deﬁnes a norm. We write this norm , γ (). From now on, we will assume Q Q thatthesetQiscentrallysymmetricoruseforexampleQ¯ =kQ·k Q(inth·eMinkowskisense)forthegauge − when it is not (this can be improved and extending these results to the nonsymmetric case is a classical topic in functional analysis). Note that any afﬁne transform of a centrally symmetric convex set remains centrallysymmetric. Thefollowingsimpleresultshowswhy ispotentially agoodchoiceofnormfor Q k·k Algorithm 1. Lemma 3.2. Suppose f : Rn R, Q is a centrally symmetric convex set with nonempty interior and let A Rn×n be a nonsingular m→atrix. Then f has Lipschitz continuous gradient with respect to the norm ∈ withconstant L > 0,i.e. Q k·k 1 f(y) f(x)+ f(x),y x + L y x 2, x,y Q, ≤ h∇ − i 2 k − kQ ∈ ifandonlyifthefunction f(Aw)hasLipschitzcontinuous gradient withrespecttothenorm A−1Q with k·k thesameconstant L. Proof. Lety = Az andx = Aw,then 1 f(y) f(x)+ f(x),y x + L y x 2, x,y Q, ≤ h∇ − i 2 k − kQ ∈ isequivalent to 1 f(Az) f(Aw)+ A−T f(Aw),Az Aw + L Az Aw 2 , z,w A−1Q, ≤ ∇w − 2 k − kQ ∈ and,usingthefactthat Ax Q(cid:10)= x A−1Q,thisisalso (cid:11) k k k k 1 f(Az) f(Aw)+ f(Aw),A−1(Az Aw) + L z w 2 , z,w A−1Q, ≤ ∇w − 2 k − kA−1Q ∈ hencethedesiredresult. (cid:10) (cid:11) An almost identical argument shows the following analogous result for the property of strong convexity withrespect tothenorm andafﬁnechangesofcoordinates. Q k·k 3 Lemma 3.3. Suppose f : Rn R, Q is a centrally symmetric convex set with nonempty interior and let A Rn×n be a nonsingular m→atrix. Suppose f is strongly convex with respect to the norm with Q ∈ k · k parameterσ > 0,i.e. 1 f(y) f(x)+ f(x),y x + σ y x 2, x,y Q, ≥ h∇ − i 2 k − kQ ∈ if and only if the function f(Ax) is strongly convex with respect to the norm A−1Q with the same k · k parameterσ. Wenowturnourattention tothechoiceofproxfunction inAlgorithm1. 3.2. Choosing the prox. Choosing the norm as allows us to deﬁne a norm without introducing Q k · k an arbitrary geometry in the algorithm, since the norm is extracted directly from the problem deﬁnition. When Q is smooth, a similar reasoning allows us to choose the prox term in Algorithm 1, and we can set d(x) = x 2 . The immediate impact of this choice is that the term d(x∗) in (7) is bounded by one, by k kQ construction. This choice has other natural beneﬁts which are highlighted below. We ﬁrst recall a result showingthattheconjugate ofasquarednormisthesquareddualnorm. Lemma3.4. Let beanormand ∗ itsdualnorm,then k·k k·k 1 1 ( y ∗)2 = sup yTx x 2. 2 k k x − 2k k Proof. We recall the proof in [Boyd and Vandenberghe, 2004, Example3.27] as it will prove useful in whatfollows. Bydeﬁnition, xTy y ∗ x ,hence ≤ k k k k 1 1 1 yTx x 2 y ∗ x x 2 ( y ∗)2 − 2k k ≤k k k k− 2k k ≤ 2 k k because the second term is a quadratic function of x 2, with maximum ( y ∗)2/2. This maximum is attainedbyanyxsuchthatxTy = y ∗ x (theremusktbkeonebyconstructionkofkthedualnorm),normalized so x = y ∗,whichyieldsthedkesikrekdrkesult. k k k k This last result (and its proof) shows that solving the prox mapping is equivalent to ﬁnding a vector aligned withthe gradient, with respect totheMinkowski norm . Wenow recall another simple result Q showingthatthedualofthenorm Q isgivenby Q◦ whekre·kQ◦ isthepolarofQ. k·k k·k Lemma3.5. LetQbeacentrallysymmetricconvexsetwithnonempty interior, thenk·k∗Q = k·kQ◦. Proof. Wewrite ◦ x Q◦ = inf λ 0 :x λQ k k { ≥ ∈ } = inf λ 0 :xTy t,forally Q { ≥ ≤ ∈ } = inf λ 0 : supxTy t ( ≥ y∈Q ≤ ) = supxTy y∈Q ∗ = x Q k k whichisthedesiredresult. The last remaining issue to settle is the strong convexity of the squared Minkowski norm. Fortunately, thistooisaclassicalresultinfunctionalanalysis, asasquarednormisstronglyconvexwithrespecttoitself ifandonlyifitsdualnormhasasmoothness modulusofpower2. However, this does not cover the case where the norm . is not smooth. In that scenario, we need to Q kk pick the norm based on Q but ﬁnd a smooth prox function not too different from . . This is exactly the Q kk problem studied byJuditsky andNemirovski[2008]whodeﬁnetheregularity ofaBanachspace(E, . ) E kk 4 in terms of the smoothness of the best smooth approximation of the norm . . We ﬁrst recall a few E kk more deﬁnitions, and we will then show that the regularity constant deﬁned by Juditsky and Nemirovski [2008] produces an afﬁne invariant bound on the term d(x∗)/σ in the complexity of the smooth algorithm in[Nesterov, 1983]. Deﬁnition 3.6. Suppose and are two norms on a space E, the distortion d( , ) X Y X Y k · k k · k k · k k · k betweenthesetwonormsisequaltothesmallestproduct ab > 0suchthat 1 x x a x Y X Y bk k ≤ k k ≤ k k overallx E. ∈ Notethatlogd( , )deﬁnesametriconthesetofallsymmetricconvexbodiesinRn,calledthe X Y k·k k·k Banach-Mazur distance. Wethenrecalltheregularity deﬁnitioninJuditskyandNemirovski[2008]. Deﬁnition3.7. TheregularityconstantofaBanachspace(E, . )isthesmallestconstant∆ > 0forwhich kk thereexistsasmoothnormp(x)suchthat (i) p(x)2/2hasaLipschitzcontinuousgradientwithconstantµw.r.t. thenormp(x),with1 µ ∆, ≤ ≤ (ii) thenormp(x)satisﬁes ∆ x 2 p(x)2 x 2, forallx E (10) k k ≤ ≤ µk k ∈ henced(p(x), . ) ∆/µ. kk ≤ Notethatinﬁnitedimensionp,sinceallnormsareequivalenttotheEuclideannormwithdistortionatmost √dimE,weknowthatallﬁnitedimensional Banach spaces areatleast(dimE)-regular. Furthermore, the regularity constant is invariant withrespect to anafﬁne change ofcoordinates since both thedistortion and thesmoothness boundsare. Wearenowreadytoprovethemainresultofthissection. Proposition 3.8. Letǫ > 0 be the target precision, suppose that the function f has a Lipschitz continuous gradient with constant L with respect to the norm and that the space (Rn, ∗) is D -regular, Q k·kQ k·kQ Q thenAlgorithm 1willproduce anǫ-solution toproblem(1)inatmost L min D /2,n Q Q 8 { } (11) ǫ r iterations. Theconstants L andD areafﬁneinvariant. Q Q Proof. If (Rn, ∗) is D -regular, then by Deﬁnition 3.7, there exists a norm p(x) such that p(x)2/2 k·kQ Q has a Lipschitz continuous gradient with constant µ with respect to the norm p(x), and [Juditsky and Ne- mirovski, 2008, Prop.3.2] shows by conjugacy that the prox function d(x) , p∗(x)2/2 is strongly convex withrespect tothenormp∗(x)withconstant 1/µ. Now(10)meansthat µ ∗ x p (x) x , forallx Q Q Q D k k ≤ ≤ k k ∈ Q r since ∗∗ = ,hence k·k k·k 1 d(x+y) d(x)+ ∂d(x),y + p∗(y)2 ≥ h i 2µ 1 d(x)+ ∂d(x),y + y 2 ≥ h i 2D k kQ Q sod(x)isstronglyconvexwithrespectto withconstantσ = 1/D ,andusing(10)asabove Q Q k·k d(x∗) = p∗(x∗)2DQ kx∗k2QDQ DQ σ 2 ≤ 2 ≤ 2 5 by deﬁnition of , if x∗ is an optimal (hence feasible) solution of problem (1). The bound in (11) Q k · k then follows from (7) and its afﬁne invariance follows directly from afﬁne invariance of the distortion and Lemmas3.2and3.3. 4. EXAMPLES To illustrate our results, consider the problem of minimizing a smooth convex function over the unit simplex,written minimize f(x) (12) subjectto 1Tx 1, x 0, ≤ ≥ in the variable x Rn. As discussed in [Juditsky et al., 2009, 3.3], choosing as the norm and 1 d(x) = logn+ ∈n x logx as the prox function, we have σ =§1 and d(x∗) lkog· kn, which means the i=1 i i ≤ complexityofsolving(12)usingAlgorithm 1isbounded by P L logn 1 8 (13) ǫ r where L is the Lipschitz constant of f withrespect to the ℓ norm. This choice of norm and prox has a 1 1 double advantage here. First, theprox∇term d(x∗)growsonly aslognwiththe dimension. Second, theℓ∞ norm being the smallest among all ℓ norms, the smoothness bound L is also minimal among all choices p 1 ofℓ norms. p Let us now follow the construction of Section 3. The simplex C = x Rn : 1Tx 1,x 0 is not { ∈ ≤ ≥ } centrally symmetric, butwecansymmetrize itastheℓ ball. TheMinkowski norm associated withthatset 1 isthenequaltotheℓ1-norm,so Q = 1 here. Thespace(Rn, ∞)is2lognregular[Juditsky and k·k k·k k·k Nemirovski, 2008, Example3.2]withtheproxfunction chosen hereas 2 /2. Proposition 3.8then k·k(2logn) showsthatthecomplexitybound weobtainusingthisprocedure isidentical tothatin(13). Asimilarresult holdsinthematrixcase. 5. CONCLUSION From a practical point of view, the results above offer guidance in the choice of a prox. function de- pending onthegeometry ofthefeasible setQ. Onthetheoretical side,theseresultsprovide afﬁneinvariant descriptions of the complexity of the feasible set and of the smoothness of the objective function, written in terms of the regularity constant of the polar of the feasible set and the Lipschitz constant of f with ∇ respect totheMinkowski norm. However,whileweshow thatitispossible toformulate anafﬁneinvariant implementation oftheoptimalalgorithm in[Nesterov,1983], wedonotyetshowthatthisisalwaysagood idea... In particular, given our choice of norm the constants L and D are both afﬁne invariant, with L Q Q Q optimalbyconstruction andourchoice ofproxfunction minimizing D overallsmooth squarenorms, but Q this does not mean that our choice of norm (Minkowski) minimizes the product L min D /2,n , hence Q Q { } thatweachievethebestpossibleboundforthecomplexityofthesmoothalgorithm in[Nesterov,1983]. ACKNOWLEDGMENTS AAandMJwouldliketoacknowledge support fromtheEuropeanResearchCouncil(projectSIPA).MJ alsoacknowledges supportfromtheSwissNationalScienceFoundation(SNSF). REFERENCES S.BoydandL.Vandenberghe. ConvexOptimization. CambridgeUniversityPress,2004. Jean-BaptisteHiriart-UrrutyandClaudeLemare´chal. ConvexAnalysisandMinimizationAlgorithms. Springer,1993. A.JuditskyandA.S.Nemirovski. Largedeviationsofvector-valuedmartingalesin2-smoothnormedspaces. arXiv preprintarXiv:0809.0813,2008. A.Juditsky,G.Lan,A.Nemirovski,andA.Shapiro. Stochasticapproximationapproachtostochasticprogramming. SIAMJournalonOptimization,19(4):1574–1609,2009. 6 Y.Nesterov.AmethodofsolvingaconvexprogrammingproblemwithconvergencerateO(1/k2).SovietMathematics Doklady,27(2):372–376,1983. Y.Nesterov. IntroductoryLecturesonConvexOptimization. Springer,2003. Y.Nesterov. Smoothminimizationofnon-smoothfunctions. MathematicalProgramming,103(1):127–152,2005. Y.NesterovandA.Nemirovskii. Interior-pointpolynomialalgorithmsinconvexprogramming. SocietyforIndustrial andAppliedMathematics,Philadelphia,1994. CMAP,UMRCNRS7641,ECOLEPOLYTECHNIQUE,PALAISEAU,FRANCE. E-mailaddress:[email protected] CMAP,UMRCNRS7641,ECOLEPOLYTECHNIQUE,PALAISEAU,FRANCE. E-mailaddress:[email protected] 7

An Optimal Affine Invariant Smooth Minimization Algorithm PDF

0.09 MB

English

by Alexandre d'Aspremont

#additional_collections #journals #arxiv

Checking for file health...

Preview An Optimal Affine Invariant Smooth Minimization Algorithm

AN OPTIMAL AFFINE INVARIANT SMOOTH MINIMIZATION ALGORITHM ALEXANDRED’ASPREMONTANDMARTINJAGGI ABSTRACT. Weformulateanafﬁneinvariantimplementationofthealgorithmin[Nesterov,1983]. Weshow thatthecomplexityboundisthenproportionaltoanafﬁneinvariantregularityconstantdeﬁnedwithrespectto theMinkowskigaugeofthefeasibleset. 3 1 0 1. INTRODUCTION 2 Inthisshortnote,weshowhowtoimplementthesmoothminimizationalgorithmdescribedin[Nesterov, n a 1983, 2005] so that both its iterations and its complexity bound are invariant by achange ofcoordinates in J theproblem. Wefocusontheminimization problem 3 minimize f(x) (1) ] subject to x Q, C ∈ O where f is a convex function with Lipschitz continuous gradient and Q is a compact convex set. Without . too much loss ofgenerality, wewillassume that the interior of Qis nonempty and contains zero. When Q h t is sufﬁciently simple, in a sense that will be made precise later, Nesterov [1983] showed that this problem a couldbesolvedwithacomplexityofO(1/√ǫ),whereǫistheprecisiontarget. Furthermore,itcanbeshown m thatthiscomplexity boundisoptimalfortheclassofsmoothproblems [Nesterov,2003]. [ WhilethedependenceinO(1/√ǫ)ofthecomplexityboundinNesterov[1983]isoptimal,theconstantin 1 front ofthat bound still depends ona few parameters which vary with implementation: the choice of norm v andproxregularizationfunction. Thismeansinparticularthat,everythingelsebeingequal,thisboundisnot 5 6 invariant withrespect toanafﬁnechange ofcoordinates, sothecomplexity bound varieswhiletheintrinsic 4 complexity of problem (1) remains unchanged. Here, we show one possible ﬁx for this inconsistency, by 0 choosing anormandaproxtermforthealgorithm in[Nesterov, 1983, 2005]whichmakeitsiterations and . 1 complexityinvariant byachangeofcoordinates. 0 3 1 2. SMOOTH OPTIMIZATION ALGORITHM : v We ﬁrst recall the basic structure of the algorithm in [Nesterov, 1983]. While many variants of this i X methodhavebeenderived,weusetheformulation in[Nesterov,2005]. Wechooseanorm andassume k·k r thatthefunction f inproblem (1)isconvexwithLipschitzcontinuous gradient, so a 1 f(y) f(x)+ f(x),y x + L y x 2, x,y Q, (2) ≤ h∇ − i 2 k − k ∈ for some L > 0. We also choose a prox function d(x) for the set Q, i.e. a continuous, strongly convex function on Q with parameter σ (see Nesterov [2003] or Hiriart-Urruty and Lemare´chal [1993] for a dis- cussion ofregularization techniques using strongly convex functions). Welet x be the center ofQ for the 0 prox-function d(x)sothat x , argmind(x), 0 x∈Q assumingw.l.o.g. thatd(x ) = 0,wethengetinparticular 0 1 d(x) σ x x 2. (3) 0 ≥ 2 k − k Date:December11,2013. 1 WewriteT (x)asolution tothefollowingsubproblem Q 1 T (x), argmin f(x),y x + L y x 2 (4) Q y∈Q (cid:26)h∇ − i 2 k − k (cid:27) Welety , T (x )wherex isdeﬁnedabove. Werecursivelydeﬁnethreesequencesofpoints: thecurrent 0 Q 0 0 iteratex ,thecorresponding y = T (x ),andthepoints k k Q k k L z , argmin d(x)+ α [f(x )+ f(x ),x x ] (5) k i i i i x∈Q (σ h∇ − i) i=0 X andastepsizesequence α 0withα (0,1]sothat k 0 ≥ ∈ x = τ z +(1 τ )y k+1 k k − k k (6) y = T (x ) k+1 Q k+1 where τ = α /A with A = k α . We implicitly assume here that Q is simple enough so that k k+1 k+1 k i=0 i thetwosubproblems deﬁning y andz can besolved very efﬁciently. Wehave thefollowing convergence k k P result. Theorem 2.1. Suppose α = (k+1)/2 withtheiterates x ,y andz deﬁned in(5)and(6), then forany k k k k k 0wehave ≥ 4Ld(x⋆) f(y ) f(x⋆) k − ≤ σ(k+1)2 wherex⋆ isanoptimalsolution toproblem (1). Proof. SeeNesterov[2005]. If ǫ > 0 is the target precision, Theorem 2.1 ensures that Algorithm 1 will converge to an ǫ-accurate solution innomorethan 8Ld(x⋆) (7) σǫ r iterations. Inpracticeofcourse,d(x⋆)needstobebounded aprioriandLandσ areoftenhardtoevaluate. Algorithm1Smoothminimization. Input: x ,theproxcenterofthesetQ. 0 1: fork =0,...,N do 2: Compute f(xk). ∇ 3: Computeyk = TQ(xk). 4: Computezk = argminx∈Q Lσd(x)+ ki=0αi[f(xi)+h∇f(xi),x−xii] . 5: Setxk+1 = τkzk +(1−τk)nyk. P o 6: endfor Output: x ,y Q. N N ∈ Whilemostoftheparameters inAlgorithm 1aresetexplicitly, thenorm andtheproxfunction d(x) k·k arechosenarbitrarily. Inwhatfollows,wewillseethatanaturalchoiceforbothmakesthealgorithm afﬁne invariant. 2 3. AFFINE INVARIANT IMPLEMENTATION We can deﬁne an afﬁne change of coordinates x = Ay where A Rn×n is a nonsingular matrix, for ∈ whichtheoriginaloptimization problemin(1)istransformed so minimize f(x) minimize fˆ(y) becomes (8) subjectto x Q, subjectto y Qˆ, ∈ ∈ inthevariable y Rn,where ∈ fˆ(y) , f(Ay) and Qˆ , A−1Q. (9) UnlessAispathologicallyill-conditioned, bothproblemsareequivalentandshouldhaveinvariantcomplex- ityboundsanditerations. Infact,thecomplexityanalysisofNewton’smethodbasedontheself-concordance argument developed in[Nesterov andNemirovskii, 1994] produces afﬁneinvariant complexity bounds and theiteratesthemselvesareinvariant. Herewewillshowhowtochoosethenorm andtheproxfunction k·k d(x)togetasimilarbehavior forAlgorithm 1. 3.1. Choosing the norm. We start by a few classical results and deﬁnitions. Recall that the Minkowski gaugeofasetQisdeﬁnedasfollows. Deﬁnition3.1. LetQ Rn containing zero,wedeﬁnetheMinkowskigaugeofQas ⊂ γ (x) , inf λ 0 :x λQ Q { ≥ ∈ } withγ (x) = 0whenQisunbounded inthedirectionx. Q WhenQisacompactconvex,centrallysymmetricsetwithrespecttotheoriginandhasnonemptyinterior, the Minkowski gauge deﬁnes a norm. We write this norm , γ (). From now on, we will assume Q Q thatthesetQiscentrallysymmetricoruseforexampleQ¯ =kQ·k Q(inth·eMinkowskisense)forthegauge − when it is not (this can be improved and extending these results to the nonsymmetric case is a classical topic in functional analysis). Note that any afﬁne transform of a centrally symmetric convex set remains centrallysymmetric. Thefollowingsimpleresultshowswhy ispotentially agoodchoiceofnormfor Q k·k Algorithm 1. Lemma 3.2. Suppose f : Rn R, Q is a centrally symmetric convex set with nonempty interior and let A Rn×n be a nonsingular m→atrix. Then f has Lipschitz continuous gradient with respect to the norm ∈ withconstant L > 0,i.e. Q k·k 1 f(y) f(x)+ f(x),y x + L y x 2, x,y Q, ≤ h∇ − i 2 k − kQ ∈ ifandonlyifthefunction f(Aw)hasLipschitzcontinuous gradient withrespecttothenorm A−1Q with k·k thesameconstant L. Proof. Lety = Az andx = Aw,then 1 f(y) f(x)+ f(x),y x + L y x 2, x,y Q, ≤ h∇ − i 2 k − kQ ∈ isequivalent to 1 f(Az) f(Aw)+ A−T f(Aw),Az Aw + L Az Aw 2 , z,w A−1Q, ≤ ∇w − 2 k − kQ ∈ and,usingthefactthat Ax Q(cid:10)= x A−1Q,thisisalso (cid:11) k k k k 1 f(Az) f(Aw)+ f(Aw),A−1(Az Aw) + L z w 2 , z,w A−1Q, ≤ ∇w − 2 k − kA−1Q ∈ hencethedesiredresult. (cid:10) (cid:11) An almost identical argument shows the following analogous result for the property of strong convexity withrespect tothenorm andafﬁnechangesofcoordinates. Q k·k 3 Lemma 3.3. Suppose f : Rn R, Q is a centrally symmetric convex set with nonempty interior and let A Rn×n be a nonsingular m→atrix. Suppose f is strongly convex with respect to the norm with Q ∈ k · k parameterσ > 0,i.e. 1 f(y) f(x)+ f(x),y x + σ y x 2, x,y Q, ≥ h∇ − i 2 k − kQ ∈ if and only if the function f(Ax) is strongly convex with respect to the norm A−1Q with the same k · k parameterσ. Wenowturnourattention tothechoiceofproxfunction inAlgorithm1. 3.2. Choosing the prox. Choosing the norm as allows us to deﬁne a norm without introducing Q k · k an arbitrary geometry in the algorithm, since the norm is extracted directly from the problem deﬁnition. When Q is smooth, a similar reasoning allows us to choose the prox term in Algorithm 1, and we can set d(x) = x 2 . The immediate impact of this choice is that the term d(x∗) in (7) is bounded by one, by k kQ construction. This choice has other natural beneﬁts which are highlighted below. We ﬁrst recall a result showingthattheconjugate ofasquarednormisthesquareddualnorm. Lemma3.4. Let beanormand ∗ itsdualnorm,then k·k k·k 1 1 ( y ∗)2 = sup yTx x 2. 2 k k x − 2k k Proof. We recall the proof in [Boyd and Vandenberghe, 2004, Example3.27] as it will prove useful in whatfollows. Bydeﬁnition, xTy y ∗ x ,hence ≤ k k k k 1 1 1 yTx x 2 y ∗ x x 2 ( y ∗)2 − 2k k ≤k k k k− 2k k ≤ 2 k k because the second term is a quadratic function of x 2, with maximum ( y ∗)2/2. This maximum is attainedbyanyxsuchthatxTy = y ∗ x (theremusktbkeonebyconstructionkofkthedualnorm),normalized so x = y ∗,whichyieldsthedkesikrekdrkesult. k k k k This last result (and its proof) shows that solving the prox mapping is equivalent to ﬁnding a vector aligned withthe gradient, with respect totheMinkowski norm . Wenow recall another simple result Q showingthatthedualofthenorm Q isgivenby Q◦ whekre·kQ◦ isthepolarofQ. k·k k·k Lemma3.5. LetQbeacentrallysymmetricconvexsetwithnonempty interior, thenk·k∗Q = k·kQ◦. Proof. Wewrite ◦ x Q◦ = inf λ 0 :x λQ k k { ≥ ∈ } = inf λ 0 :xTy t,forally Q { ≥ ≤ ∈ } = inf λ 0 : supxTy t ( ≥ y∈Q ≤ ) = supxTy y∈Q ∗ = x Q k k whichisthedesiredresult. The last remaining issue to settle is the strong convexity of the squared Minkowski norm. Fortunately, thistooisaclassicalresultinfunctionalanalysis, asasquarednormisstronglyconvexwithrespecttoitself ifandonlyifitsdualnormhasasmoothness modulusofpower2. However, this does not cover the case where the norm . is not smooth. In that scenario, we need to Q kk pick the norm based on Q but ﬁnd a smooth prox function not too different from . . This is exactly the Q kk problem studied byJuditsky andNemirovski[2008]whodeﬁnetheregularity ofaBanachspace(E, . ) E kk 4 in terms of the smoothness of the best smooth approximation of the norm . . We ﬁrst recall a few E kk more deﬁnitions, and we will then show that the regularity constant deﬁned by Juditsky and Nemirovski [2008] produces an afﬁne invariant bound on the term d(x∗)/σ in the complexity of the smooth algorithm in[Nesterov, 1983]. Deﬁnition 3.6. Suppose and are two norms on a space E, the distortion d( , ) X Y X Y k · k k · k k · k k · k betweenthesetwonormsisequaltothesmallestproduct ab > 0suchthat 1 x x a x Y X Y bk k ≤ k k ≤ k k overallx E. ∈ Notethatlogd( , )deﬁnesametriconthesetofallsymmetricconvexbodiesinRn,calledthe X Y k·k k·k Banach-Mazur distance. Wethenrecalltheregularity deﬁnitioninJuditskyandNemirovski[2008]. Deﬁnition3.7. TheregularityconstantofaBanachspace(E, . )isthesmallestconstant∆ > 0forwhich kk thereexistsasmoothnormp(x)suchthat (i) p(x)2/2hasaLipschitzcontinuousgradientwithconstantµw.r.t. thenormp(x),with1 µ ∆, ≤ ≤ (ii) thenormp(x)satisﬁes ∆ x 2 p(x)2 x 2, forallx E (10) k k ≤ ≤ µk k ∈ henced(p(x), . ) ∆/µ. kk ≤ Notethatinﬁnitedimensionp,sinceallnormsareequivalenttotheEuclideannormwithdistortionatmost √dimE,weknowthatallﬁnitedimensional Banach spaces areatleast(dimE)-regular. Furthermore, the regularity constant is invariant withrespect to anafﬁne change ofcoordinates since both thedistortion and thesmoothness boundsare. Wearenowreadytoprovethemainresultofthissection. Proposition 3.8. Letǫ > 0 be the target precision, suppose that the function f has a Lipschitz continuous gradient with constant L with respect to the norm and that the space (Rn, ∗) is D -regular, Q k·kQ k·kQ Q thenAlgorithm 1willproduce anǫ-solution toproblem(1)inatmost L min D /2,n Q Q 8 { } (11) ǫ r iterations. Theconstants L andD areafﬁneinvariant. Q Q Proof. If (Rn, ∗) is D -regular, then by Deﬁnition 3.7, there exists a norm p(x) such that p(x)2/2 k·kQ Q has a Lipschitz continuous gradient with constant µ with respect to the norm p(x), and [Juditsky and Ne- mirovski, 2008, Prop.3.2] shows by conjugacy that the prox function d(x) , p∗(x)2/2 is strongly convex withrespect tothenormp∗(x)withconstant 1/µ. Now(10)meansthat µ ∗ x p (x) x , forallx Q Q Q D k k ≤ ≤ k k ∈ Q r since ∗∗ = ,hence k·k k·k 1 d(x+y) d(x)+ ∂d(x),y + p∗(y)2 ≥ h i 2µ 1 d(x)+ ∂d(x),y + y 2 ≥ h i 2D k kQ Q sod(x)isstronglyconvexwithrespectto withconstantσ = 1/D ,andusing(10)asabove Q Q k·k d(x∗) = p∗(x∗)2DQ kx∗k2QDQ DQ σ 2 ≤ 2 ≤ 2 5 by deﬁnition of , if x∗ is an optimal (hence feasible) solution of problem (1). The bound in (11) Q k · k then follows from (7) and its afﬁne invariance follows directly from afﬁne invariance of the distortion and Lemmas3.2and3.3. 4. EXAMPLES To illustrate our results, consider the problem of minimizing a smooth convex function over the unit simplex,written minimize f(x) (12) subjectto 1Tx 1, x 0, ≤ ≥ in the variable x Rn. As discussed in [Juditsky et al., 2009, 3.3], choosing as the norm and 1 d(x) = logn+ ∈n x logx as the prox function, we have σ =§1 and d(x∗) lkog· kn, which means the i=1 i i ≤ complexityofsolving(12)usingAlgorithm 1isbounded by P L logn 1 8 (13) ǫ r where L is the Lipschitz constant of f withrespect to the ℓ norm. This choice of norm and prox has a 1 1 double advantage here. First, theprox∇term d(x∗)growsonly aslognwiththe dimension. Second, theℓ∞ norm being the smallest among all ℓ norms, the smoothness bound L is also minimal among all choices p 1 ofℓ norms. p Let us now follow the construction of Section 3. The simplex C = x Rn : 1Tx 1,x 0 is not { ∈ ≤ ≥ } centrally symmetric, butwecansymmetrize itastheℓ ball. TheMinkowski norm associated withthatset 1 isthenequaltotheℓ1-norm,so Q = 1 here. Thespace(Rn, ∞)is2lognregular[Juditsky and k·k k·k k·k Nemirovski, 2008, Example3.2]withtheproxfunction chosen hereas 2 /2. Proposition 3.8then k·k(2logn) showsthatthecomplexitybound weobtainusingthisprocedure isidentical tothatin(13). Asimilarresult holdsinthematrixcase. 5. CONCLUSION From a practical point of view, the results above offer guidance in the choice of a prox. function de- pending onthegeometry ofthefeasible setQ. Onthetheoretical side,theseresultsprovide afﬁneinvariant descriptions of the complexity of the feasible set and of the smoothness of the objective function, written in terms of the regularity constant of the polar of the feasible set and the Lipschitz constant of f with ∇ respect totheMinkowski norm. However,whileweshow thatitispossible toformulate anafﬁneinvariant implementation oftheoptimalalgorithm in[Nesterov,1983], wedonotyetshowthatthisisalwaysagood idea... In particular, given our choice of norm the constants L and D are both afﬁne invariant, with L Q Q Q optimalbyconstruction andourchoice ofproxfunction minimizing D overallsmooth squarenorms, but Q this does not mean that our choice of norm (Minkowski) minimizes the product L min D /2,n , hence Q Q { } thatweachievethebestpossibleboundforthecomplexityofthesmoothalgorithm in[Nesterov,1983]. ACKNOWLEDGMENTS AAandMJwouldliketoacknowledge support fromtheEuropeanResearchCouncil(projectSIPA).MJ alsoacknowledges supportfromtheSwissNationalScienceFoundation(SNSF). REFERENCES S.BoydandL.Vandenberghe. ConvexOptimization. CambridgeUniversityPress,2004. Jean-BaptisteHiriart-UrrutyandClaudeLemare´chal. ConvexAnalysisandMinimizationAlgorithms. Springer,1993. A.JuditskyandA.S.Nemirovski. Largedeviationsofvector-valuedmartingalesin2-smoothnormedspaces. arXiv preprintarXiv:0809.0813,2008. A.Juditsky,G.Lan,A.Nemirovski,andA.Shapiro. Stochasticapproximationapproachtostochasticprogramming. SIAMJournalonOptimization,19(4):1574–1609,2009. 6 Y.Nesterov.AmethodofsolvingaconvexprogrammingproblemwithconvergencerateO(1/k2).SovietMathematics Doklady,27(2):372–376,1983. Y.Nesterov. IntroductoryLecturesonConvexOptimization. Springer,2003. Y.Nesterov. Smoothminimizationofnon-smoothfunctions. MathematicalProgramming,103(1):127–152,2005. Y.NesterovandA.Nemirovskii. Interior-pointpolynomialalgorithmsinconvexprogramming. SocietyforIndustrial andAppliedMathematics,Philadelphia,1994. CMAP,UMRCNRS7641,ECOLEPOLYTECHNIQUE,PALAISEAU,FRANCE. E-mailaddress:[email protected] CMAP,UMRCNRS7641,ECOLEPOLYTECHNIQUE,PALAISEAU,FRANCE. E-mailaddress:[email protected] 7

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.