loading

Logout succeed

Logout succeed. See you again!

ebook img

Stochastic ADMM for Nonsmooth Optimization PDF

file size0.15 MB
languageEnglish

Preview Stochastic ADMM for Nonsmooth Optimization

Stochastic ADMM for Nonsmooth Optimization HuaOuyang NiaoHe ComputationalScienceandEngineering IndustrialandSystemsEngineering GeorgiaInstituteofTechnology GeorgiaInstituteofTechnology [email protected] [email protected] 3 1 0 AlexanderGray 2 ComputationalScienceandEngineering n GeorgiaInstituteofTechnology a [email protected] J 2 2 Abstract ] G L Wepresentastochasticsettingforoptimizationproblemswithnonsmoothconvexseparableobjec- tive functions over linear equality constraints. To solve such problems, we propose a stochastic . s AlternatingDirectionMethodofMultipliers(ADMM)algorithm. Ouralgorithmappliestoamore c generalclassofnonsmoothconvexfunctionsthatdoesnotnecessarilyhaveaclosed-formsolution [ byminimizingtheaugmentedfunctiondirectly. We alsodemonstratethe ratesofconvergencefor 2 ouralgorithmundervariousstructuralassumptionsofthestochasticfunctions:O(1/√t)forconvex v functionsandO(logt/t)forstronglyconvexfunctions. Comparedtopreviousliterature,weestab- 2 lishtheconvergencerateofADMMalgorithm,forthefirsttime,intermsofboththeobjectivevalue 3 andthefeasibilityviolation. 6 0 . 1 1 2 1 Introduction 1 : v i The Alternating Direction Method of Multipliers (ADMM) [1, 2] is a very simple computational method for con- X strainedoptimizationproposedin1970s.ThetheoreticalaspectsofADMMhavebeenstudiedfrom1980sto90sand r itsglobalconvergencewasestablishedintheliterature[3,4,5]. Asreviewedinthecomprehensivepaper[6],withits a capacityofdealingwith objectivefunctionsseparatelyandsynchronously,thismethodturnedouttobe a naturalfit in thefield oflarge-scaledata-distributedmachinelearningandbig-datarelatedoptimizationandthereforereceived significantamountofattentioninthelastfewyears. Intensivetheoreticalandpracticaladvancesareconductedthere- after. Onthetheoreticalhand,ADMMisrecentlyshowntohavearateofconvergenceofO(1/N)[7,8,9,10],where N standsforthe numberof iterations. On the practicalhand, ADMM has beenappliedto a wide range of applica- tiondomains,suchascompressedsensing[11],imagerestoration[12],videoprocessingandmatrixcompletion[13]. Besidesthat,manyvariationsofthisclassicalmethodhavebeenrecentlydeveloped,suchaslinearized[13,14, 15], accelerated[13],andonline[10]ADMM.However,mostofthesevariantsincludingtheclassiconeimplicitlyassume fullaccessibiltytotruedatavalues,whileinrealityonecanhardlyignoretheexistenceofnoise. Amorenaturalway ofhandlingthisissueistoconsiderunbiasedorevenbiasedobservationsoftruedata,whichleadsustothestochastic setting. 1Ashortversionappearsinthe5thNIPSWorkshoponOptimizationforMachineLearning,LakeTahoe,Nevada,USA,2012. 1 1.1 StochasticSettingforADMM Inthiswork,westudyafamilyofconvexoptimizationproblemsinwhichourobjectivefunctionsareseparableand stochastic.Inparticular,weareinterestedinsolvingthefollowinglinearequality-constrainedstochasticoptimization: min E θ (x,ξ)+θ (y) s.t. Ax+By=b, (1) ξ 1 2 x ,y ∈X ∈Y wherex Rd1,y Rd2,A Rm×d1,B Rm×d2,b Rm, isaconvexcompactset,and isaclosedconvex ∈ ∈ ∈ ∈ ∈ X Y set. Weareabletodrawasequenceofidenticalandindependent(i.i.d.) observationsfromtherandomvector ξ that obeysa fixed butunknowndistribution P. One can see thatwhen ξ is deterministic, we can recoverthe traditional problemsetting forADMM [6]. Denotethe expectationfunctionθ (x) E θ (x,ξ). In ourmostgeneralsetting, 1 ξ 1 ≡ real-valuedfunctionsθ ()andθ ()areconvexbutnotnecessarilycontinuouslydifferentiable. 1 2 · · Note that our stochastic setting of the problem is quite different from that of the Online ADMM proposed in [10]. In Online ADMM, one does not assume ξ to be i.i.d., nor the objective to be stochastic, but in- stead, a deterministic concept referred as regret is concerned: R x t [θ (x ,ξ )+θ (y )] [1:t] ≡ k=1 1 k k 2 k − inf t [θ (x,ξ )+θ (y)]. Ax+By=b k=1 1 k 2 (cid:0) (cid:1) P P 1.2 OurContributions In this work, we propose a stochastic setting of the ADMM problem and design the Stochastic ADMM algorithm. A key algorithmic feature of our Stochastic ADMM that distinguishes it from previousADMM and variants is the first-orderapproximationof θ thatweusetomodifytheaugmentedLagrangian. Thissimplemodificationnotonly 1 guarantees the convergenceof our stochastic method, but also benefits to a more general class of convex objective functionswhichmightnothaveaclosed-formsolutionin minimizingtheaugmentedθ directly. Forexample,with 1 stochasticADMM,we canderiveclose-formupdatesforthenonsmoothhingelossfunction(usedin supportvector machines). However, with deterministic ADMM, one has to call SVM solvers during each iteration [6], which is indeedverytime-consuming.Oneofourmaincontributionsisthatwedeveloptheconvergenceratesofouralgorithm undervariousstructuralassumptions. Forconvex θ (),therateisprovedtobeO(1/√t);forstronglyconvexθ (), 1 1 · · therateisprovedtobeO(logt/t).Tothebestofourknowledge,thisisthefirsttimethatconvergenceratesofADMM areestablishedforboththeobjectivevalueandthefeasibilityviolation.Bycontrast,recentresearch[8,10]onlyshows theconvergenceofADMMindirectlyintermsofthesatisfactionofvariationalinequalities. 1.3 Notations Throughoutthispaper,we denotethesubgradientsof θ andθ asθ andθ . Whentheyare differentiable,we will 1 2 1′ 2′ use θ and θ todenotethegradients.Weusethenotationθ bothfortheinstancefunctionvalueθ (x,ξ)andfor 1 2 1 1 ∇ ∇ itsexpectationθ (x). Wedenotebyθ(u) θ (x)+θ (y)thesumofthestochasticandthedeterministicfunctions. 1 1 2 ≡ Forsimplicityandclarity,wewillusethefollowingnotationstodenotestackedvectorsortuples: x x x k X u , w y , w y , , ≡(cid:18) y (cid:19) ≡ λ ! k ≡ λkk ! W ≡ RYm ! (2) u¯ k1 ik=−01xi , w¯ k11 kik=1xyi , F(w) −BATTλλ . k ≡ k1Pki=1yi ! k ≡ 1k Pki=1λi  ≡ Ax−+By b   k Pi=1 i  − P     ForapositivesemidefinitematrixG Rd1×d1,wedefinPetheG-normofavectorxas x G := G1/2x 2 =√xTGx. ∈ k k k k Weuse , todenotetheinnerproductinafinitedimensionalEuclideanspace.Whenthereisnoambiguity,weoften h· ·i use to denote the Euclidean norm . We assume that the optimal solution of (1) exists and denote it as 2 k ·k k ·k u xT,yT T. Thefollowingquantityappearfrequentlyinourconvergenceanalysis: ∗ ≡ ∗ ∗ (cid:0) (cid:1) δ θ (x ,ξ ) θ (x ), Dk ≡ 1′ suk−p1 kx − x1′ ,k−D1 B(y y ) . (3) X ≡xa,xb∈Xk a− bk y∗,B ≡k 0− ∗ k 2 1.4 Assumptions Beforepresentingthealgorithmandconvergenceresults,welisttheassumptionsthatwillbeusedinourstatements. Assumption1. Forallx ,E θ (x,ξ) 2 M2. ∈X k 1′ k ≤ h i Assumption2. Forallx ,E exp θ (x,ξ) 2/M2 exp 1 . ∈X k 1′ k ≤ { } Assumption3. Forallx ,Eh θ (nx,ξ) θ (x) 2 oiσ2. ∈X k 1′ − 1′ k ≤ (cid:2) (cid:3) 2 Stochastic ADMM Algorithm Directly solving problem (1) can be nontrivialeven if ξ is deterministic and the equality constraint is as simple as x y=0. Forexample,usingtheaugmentedLagrangianmethod,onehastominimizetheaugmentedLagrangian: − β min (x,y,λ) min θ (x)+θ (y) λ,Ax+By b + Ax+By b 2, β 1 2 x ,y L ≡x ,y −h − i 2k − k ∈X ∈Y ∈X ∈Y whereβ isapre-definedpenaltyparameter. Thisproblemisatleastnoteasierthansolvingtheoriginalone. The(de- terministic)ADMM(Alg.1)solvesthisprobleminaGauss-Seidelmanner:minimizing w.r.t. xandyalternatively β L giventheotherfixed,followedbyapenaltyupdateovertheLagrangianmultiplierλ. Algorithm1DeterministicADMM 0. Initializey andλ =0. 0 0 fork =0,1,2,...do 2 1. xk+1 ←argminx∈X θ1(x)+ β2 (Ax+Byk−b)− λβk . (cid:26) (cid:13) (cid:13) (cid:27)2 2. yk+1 ←argminy∈Y θ2(y)+ β2 (cid:13)(cid:13)(Axk+1+By−b)− λ(cid:13)(cid:13)βk . 3. λ λ β(Ax(cid:26) +By (cid:13) b). (cid:13) (cid:27) k+1 k k+1 k+1(cid:13) (cid:13) endfor ← − (cid:13)− (cid:13) AvariantdeterministicalgorithmnamedlinearizedADMMreplacesLine1ofAlg.1by β 1 x argmin θ (x)+ (Ax+By b) λ /β 2+ x x 2 , (4) k+1 ← x 1 2 k k− − k k 2k − kkG ∈X(cid:26) (cid:27) whereG Rd1×d1 ispositivesemidefinite. ThisvariantcanberegardedasageneralizationoftheoriginalADMM. WhenG=∈ 0,itisthesameasAlg.1. WhenG=rI βATA,itisequivalenttothefollowinglinearizedproximal d1 − pointmethod: r x argmin θ (x)+β(x x )T AT(Ax +By b λ /β) + x x 2 . k+1 1 k k k k k ← x − − − 2k − k ∈Xn (cid:2) (cid:3) o Notethatthelinearizationisappliedonlytothequadraticfunction (Ax+By b) λ /β 2,butnottoθ . This k k 1 k − − k approximation helps when Line 1 of Alg.1 does not produce a closed-form solution given the quadratic term. For example,letθ (x)= x andAnotidentity. 1 1 k k As shownin Alg.2, we proposea StochasticAlternatingDirection Methodof Multipliers(StochasticADMM) algo- rithm. OuralgorithmsharessomefeatureswiththeclassicalandthelinearizedADMM.OnecanseethatLine2and 3 are essentially the same as before. However, there are two major differences in Line 1. First, we replace θ (x) 1 with afirst-orderapproximationof θ (x,ξ )atx : θ (x )+xTθ (x ,ξ ). Thisapproximationhasthesame 1 k+1 k 1 k 1′ k k+1 flavourofthestochasticmirrordescent[16]usedforsolvingaone-variablestochasticconvexproblem.Oneimportant benefitofusingthisapproximationisthatouralgorithmcanbeappliedtononsmoothobjectivefunctions,beyondthe smoothandseparableleastsquareslossusedinlasso.Second,similartothelinearizedADMM(4),weaddanl -norm 2 prox-function x x 2 butscaleitbyatime-varyingstepsizeη . AswewillseeinSection3,thechoiceofthis k k+1 k − k stepsizeiscrucialinguaranteeingaconvergence. 3 Algorithm2StochasticADMM 0. Initializex ,y andλ =0. 0 0 0 fork =0,1,2,...do 1. xk+1 ←argminx∈X hθ1′(xk,ξk+1),xi+ β2 (Ax+Byk−b)− λβk 2+ kx2−ηkx+k1k2 . (cid:26) (cid:13) 2 (cid:13) (cid:27) 2. yk+1 ←argminy∈Y θ2(y)+ β2 (Axk+1+(cid:13)(cid:13)By−b)− λβk . (cid:13)(cid:13) 3. λ λ β(Ax(cid:26) +By (cid:13) b). (cid:13) (cid:27) k+1 k k+1 k+1(cid:13) (cid:13) endfor ← − (cid:13)− (cid:13) 3 MainResults ofConvergence Rates In this section, we will show that ourStochastic ADMM givenin Alg.2 exhibitsa rate O(1/√t) of convergencein termsofboththeobjectivevalueandthefeasibilityviolation:E[θ(u¯ ) θ(u )+ρ Ax¯ +By¯ b ]=O(1/√t). t t t 2 Weextendthemainresultifmorestructuralinformationofθ isavailab−le. ∗ k − k 1 Before we address the main theorem on convergencerates, we first present an upper bound of the variation of the Lagrangianfunctionanditsfirstorderapproximationbased oneachiterationpoints. Lemma1. w ,k 1,wehave ∀ ∈W ≥ η θ (x ,ξ ) 2 θ (x )+θ (y ) θ(u)+(w w)TF(w ) k+1k 1′ k k+1 k 1 k 2 k+1 k+1 k+1 − − ≤ 2 1 β + x x 2 x x 2 + Ax+By b 2 Ax+By b 2 (5) k k+1 k k+1 2η k − k −k − k 2 k − k −k − k k+1 (cid:0) 1 (cid:1) (cid:0) (cid:1) + δ ,x x + λ λ 2 λ λ 2 . h k+1 − ki 2β k − kk2−k − k+1k2 (cid:0) (cid:1) Utilizing this lemma we are able to obtain our main result shown as below. We present our main theorem of the convergenceintwofashions,bothintermsofexpectationandprobabilitysatisfaction. Theorem1. Letη = DX , k 1andρ>0. k M√2k ∀ ≥ (i) UnderAssumption1,wehave t 1, ∀ ≥ √2D M βD2 +ρ2/β E[θ(u¯t) θ(u )+ρ Ax¯t+By¯t b ] M1(t)+M2(t) X + y∗,B , (6) − ∗ k − k ≤ ≡ √t 2t (ii) UnderAssumption1and2,wehaveforanyΩ>0, 1 Prob θ(u¯ ) θ(u )+ρ Ax¯ +By¯ b > 1+ Ω+2√2Ω M (t)+M (t) 2exp Ω , (7) t t t 1 2 − ∗ k − k 2 ≤ {− } (cid:26) (cid:18) (cid:19) (cid:27) Remark1. Adaptingourprooftechniquestothedeterministiccasewherenonoisetakesplace,weareabletoobtain asimilarresultfordeterministicADMM: βD2 ρ2 ρ>0,t 1, θ(u¯ ) θ(u )+ρ Ax¯ +By¯ b y∗,B + , (8) t t t 2 ∀ ≥ − ∗ k − k ≤ 2t 2βt While resulting in a O(1/t) convergencerate same as the existingliterature[8, 9, 10], the abovefindingis actually a significant advancein the theoretical aspects of ADMM. For the first time, the convergenceof ADMM is proved in terms of objective value and feasibility violation. By contrast, the existing literature [8, 9, 10] only shows the convergenceof ADMMin termsofthesatisfaction ofvariationalinequalities,whichis notadirectmeasureofhow fastanalgorithmreachestheoptimalsolution. 4 3.1 Extension: StronglyConvexθ 1 Whenfunctionθ ()isstronglyconvex,theconvergencerateofStochasticADMMcanbeimprovedtoO logt . 1 · t Theorem 2. When θ is µ-strongly convex with respect to , taking η = 1 in Alg.2, under Ass(cid:16)umpti(cid:17)on 1, 1 k · k k kµ ρ>0,t 1wehaveE[θ(u¯ ) θ(u )+ρ Ax¯ +By¯ b ] M2logt + µDX2 + βDy2∗,B + ρ2 . ∀ ≥ t − ∗ k t t− k2 ≤ µt 2t 2t 2βt 3.2 Extension: LipschitzSmoothθ 1 SincetheboundsgiveninTheorem1arerelatedtothemagnitudeofsubgradients,theydonotprovideanyintuition ofthe performancein low-noisescenarios. With a Lipschitz smoothfunctionθ , we are ableto obtainconvergence 1 ratesintermsofthevariationsofgradients,asstatedinAssumption3. Besides,underthisassumptionweareableto replacetheunusualdefinitionofu¯ in(2)with k 1 k x u¯ k i=1 i . (9) k ≡ k1Pki=1yi! Theorem3. Whenθ ()isL-LipschitzsmoothwithrespecPtto ,takingη = 1 inAlg.2,underAssump- 1 · k·k k L+σ√2k/DX tion3, ρ>0,t 1wehaveE[θ(u¯ ) θ(u )+ρ Ax¯ +By¯ b ] √2DXσ + LDX2 + βDy2∗,B + ρ2 . ∀ ≥ t − ∗ k t t− k2 ≤ √t 2t 2t 2βt 4 Summary andFuture Work Inthispaper,wehaveproposedthestochasticsettingforADMMalongwithourstochasticADMMalgorithm.Based onafirst-orderapproximationofthestochasticfunction,ouralgorithmisapplicabletoaverybroadclassofproblems evenwithfunctionsthathavenoclosed-formsolutiontothe subproblemofminimizingtheaugmentedθ . We have 1 also established convergence rates under various structural assumptions of θ : O(1/√t) for convex functions and 1 O(logt/t)forstronglyconvexfunctions.WeareworkingonintegratingNesterov’soptimalfirst-ordermethods[17]to ouralgorithm,whichwillhelpinachievingoptimalconvergencerates. Moreinterestingandchallengingapplications willbecarriedoutinourfuturework. 5 Appendix 5.1 3-PointsRelation BeforeprovingLemma1,wewillstartwiththefollowingsimplelemma,whichisaveryusefulresultbyimplementing Bregmandivergenceasaprox-functioninproximalmethods. Lemma2. Letl(x): Rbeaconvexdifferentiablefunctionwithgradientg. Letscalars 0. Foranyvectoru X → ≥ andv,denotetheirBregmandivergenceasD(u,v) ω(u) ω(v) ω(v),u v . If u , ≡ − −h∇ − i ∀ ∈X x∗ argminl(x)+sD(x,u), (10) ≡ x ∈X then g(x ),x x s[D(x,u) D(x,x ) D(x ,u)]. ∗ ∗ ∗ ∗ h − i≤ − − Proof. Invokingtheoptimalityconditionfor(10),wehave g(x )+s D(x ,u),x x 0, x , ∗ ∗ ∗ h ∇ − i≥ ∀ ∈X whichisequivalentto g(x ),x x s D(x ,u),x x ∗ ∗ ∗ ∗ h − i≤ h∇ − i =s ω(x ) ω(u),x x ∗ ∗ h∇ −∇ − i =s[D(x,u) D(x,x ) D(x ,u)]. ∗ ∗ − − 5 5.2 ProofofLemma1 Proof. Duetotheconvexityofθ andusingthedefinitionofδ ,wehave 1 k θ (x ) θ (x) θ (x ),x x = θ (x ,ξ ),x x + δ ,x x + θ (x ,ξ ),x x . 1 k − 1 ≤h 1′ k k− i h 1′ k k+1 k+1− i h k+1 − ki h 1′ k k+1 k− k+1i (11) ApplyingLemma2toLine1ofAlg.2andtakingD(u,v)= 1 v u 2,wehave 2k − k θ (x ,ξ )+AT [β(Ax +By b) λ ],x x 1′ k k+1 k+1 k− − k k+1− 1 (12) (cid:10) x x 2 x x 2 x x 2 (cid:11) k k+1 k k+1 ≤ 2η k − k −k − k −k − k k+1 (cid:0) (cid:1) Combining(11)and(12)wehave θ (x ) θ (x)+ x x, ATλ 1 k 1 k+1 k+1 − − − (11) (cid:10) (cid:11) θ (x ,ξ ),x x + δ ,x x + θ (x ,ξ ),x x + ≤ h 1′ k k+1 k+1− i h k+1 − ki h 1′ k k+1 k− k+1i x x,AT [β(Ax +By b) λ ] k+1 k+1 k+1 k − − − =(cid:10)θ1′(xk,ξk+1)+AT [β(Axk+1+Byk−b)−λ(cid:11)k],xk+1−x + (13) h(cid:10)δk+1,x−xki+ x−xk+1,βATB(yk−yk+1) +hθ1′(xk,ξ(cid:11)k+1),xk−xk+1i (12) 1 (cid:10) (cid:11) x x 2 x x 2 x x 2 + δ ,x x + k k+1 k+1 k k+1 k ≤ 2η k − k −k − k −k − k h − i k+1 x x (cid:0),βATB(y y ) + θ (x ,ξ ),x (cid:1)x − k+1 k− k+1 h 1′ k k+1 k− k+1i Wehandlethelasttwotermsseparately: (cid:10) (cid:11) x x ,βATB(y y ) =β Ax Ax ,By By k+1 k k+1 k+1 k k+1 − − h − − i β (cid:10)= Ax+By b 2 A(cid:11)x+By b 2 + Ax +By b 2 Ax +By b 2 k k+1 k+1 k+1 k+1 k 2 k − k −k − k k − k −k − k β (cid:2)(cid:0) (cid:1) (cid:0)1 (cid:1)(cid:3) Ax+By b 2 Ax+By b 2 + λ λ 2 k k+1 k+1 k ≤ 2 k − k −k − k 2βk − k (cid:0) (cid:1) (14) and η θ (x ,ξ ) 2 x x 2 hθ1′(xk,ξk+1),xk−xk+1i≤ k+1k 1′ 2k k+1 k + k k2−η k+1k , (15) k+1 wherethelaststepisduetoYoung’sinequality.Inserting(14)and(15)into(13),wehave θ (x ) θ (x)+ x x, ATλ 1 k 1 k+1 k+1 − − − 1 η θ (x ,ξ ) 2 x x(cid:10) 2 x x 2 +(cid:11) k+1k 1′ k k+1 k + δ ,x x ≤ 2η k k− k −k k+1− k 2 h k+1 − ki (16) k+1 β (cid:0) (cid:1) 1 + Ax+By b 2 Ax+By b 2 + λ λ 2, k k+1 k+1 k 2 k − k −k − k 2βk − k (cid:0) (cid:1) DuetotheoptimalityconditionofLine2inAlg.2andtheconvexityofθ ,wehave 2 θ (y ) θ (y)+ y y, BTλ 0. (17) 2 k+1 2 k+1 k+1 − − − ≤ UsingLine3inAlg.2,wehave (cid:10) (cid:11) λ λ,Ax +By b k+1 k+1 k+1 h − − i 1 = λ λ,λ λ β h k+1− k− k+1i (18) 1 = λ λ 2 λ λ 2 λ λ 2 k k+1 k+1 k 2β k − k −k − k −k − k (cid:0) (cid:1) Takingthesummationofinequalities(16)(17)and(18),weobtaintheresultasdesired. 6 5.3 ProofofTheorem1 Proof. (i).Invokingconvexityofθ ()andθ ()andthemonotonicityofoperatorF(),wehave w Ω: 1 2 · · · ∀ ∈ t 1 θ(u¯ ) θ(u)+(w¯ w)TF(w¯ ) θ (x )+θ (y ) θ(u)+(w w)TF(w ) t t t 1 k 1 2 k k k − − ≤ t − − − k=1 X(cid:2) (cid:3) (19) t 1 1 − = θ (x )+θ (y ) θ(u)+(w w)TF(w ) 1 k 2 k+1 k+1 k+1 t − − k=0 X(cid:2) (cid:3) ApplyingLemma1attheoptimalsolution(x,y)=(x ,y ),wecanderivefrom(19)that, λ ∗ ∗ ∀ θ(u¯ ) θ(u )+(x¯ x )T( ATλ¯ )+(y¯ y )T( BTλ¯ )+(λ¯ λ)T(Ax¯ +By¯ b) t t t t t t t t − ∗ − ∗ − − ∗ − − − (5) 1 t−1 ηk+1kθ1′(xk,ξk+1)k2 + 1 x x 2 x x 2 + δ ,x x k k+1 k+1 k ≤ t k=0(cid:20) 2 2ηk+1 k − ∗k −k − ∗k h ∗− i(cid:21) X (cid:0) (cid:1) 1 β 1 (20) + Ax +By b 2+ λ λ 2 0 0 t 2k ∗ − k 2βk − k (cid:18) (cid:19) ≤ 1t kt−=10(cid:20)ηk+1kθ1′(x2k,ξk+1)k2 +hδk+1,x∗−xki(cid:21)+ 1t (cid:18)D2ηX2t + β2Dy2∗,B+ 21βkλ−λ0k22(cid:19) X Theaboveinequalityistrueforallλ Rm,henceitalsoholdsintheball = λ : λ ρ . Combingwiththe 0 2 ∈ B { k k ≤ } factthattheoptimalsolutionmustalsobefeasible,itfollowsthat max θ(u¯ ) θ(u )+(x¯ x )T( ATλ¯ )+(y¯ y )T( BTλ¯ )+(λ¯ λ)T(Ax¯ +By¯ b) t t t t t t t t λ∈B0 − ∗ − ∗ − − ∗ − − − = max(cid:8)θ(u¯ ) θ(u )+λ¯T(Ax +By b) λT(Ax¯ +By¯ b) (cid:9) λ∈B0 t − ∗ t ∗ ∗− − t t− (21) = max(cid:8)θ(u¯ ) θ(u ) λT(Ax¯ +By¯ b) (cid:9) t t t λ∈B0 − ∗ − − =θ(u¯ )(cid:8) θ(u )+ρ Ax¯ +By¯ b (cid:9) t t t 2 − ∗ k − k Takinganexpectationover(21)andusing(20)wehave: E[θ(u¯ ) θ(u )+ρ Ax¯ +By¯ b ] t t t 2 − ∗ k − k ≤E"1t kt−=10(cid:18)ηk+1kθ1′(x2k,ξk+1)k2 +hδk+1,x∗−xki(cid:19)+ 1t (cid:18)D2ηX2t + β2Dy2∗,B(cid:19)# X 1 +E max λ λ 2 (cid:20)λ∈B0(cid:26)2βtk − 0k2(cid:27)(cid:21) 1 M2 t η + D2 + βDy2∗,B + ρ2 + 1 t−1E[ δ ,x x ] k X k+1 k ≤ t 2 2ηt! 2t 2βt t h ∗− i k=1 k=0 X X 1 M2 t D2 βD2 ρ2 = η + + y∗,B + k X t 2 2η 2t 2βt t! k=1 X √2D M βD2 ρ2 X + y∗,B + ≤ √t 2t 2βt In the second last step, we use the fact that x is independent of ξ , hence E δ ,x x = k k+1 ξk+1|ξ[1:k]h k+1 ∗− ki E δ ,x x =0. ξk+1|ξ[1:k] k+1 ∗− k D E 7 (ii)Fromthestepsintheproofofpart(i),itfollowsthat, θ(u¯ ) θ(u )+ρ Ax¯ +By¯ b t t t − ∗ k − k ≤ 1t kt−=10ηk+1 kθ1′(x2k,ξk+1)k2 + 1t kt−=10hδk+1,x∗−xki+ 1t (cid:18)D2ηX2t + β2Dy2∗,B + 2ρβ2(cid:19) (22) X X A +B +C t t t ≡ NotethatrandomvariablesA andB aredependentonξ . t t [t] Claim1. ForΩ >0, 1 M2 t Prob A (1+Ω ) η exp Ω . (23) t 1 k 1 ≥ 2t !≤ {− } k=1 X Letα ηk k = 1,...,t,then0 α 1and t α = 1. Usingthefactthat δ , k areindependent k ≡ Ptk=1ηk ∀ ≤ k ≤ k=1 k { k ∀ } andapplyingAssumption2,onehas P t t E exp α θ (x ,ξ ) 2/M2 = E exp α θ (x ,ξ ) 2/M2 " ( kk 1′ k k+1 k )# kk 1′ k k+1 k k=1 k=1 X Y (cid:2) (cid:8) (cid:9)(cid:3) t E exp θ (x ,ξ ) 2/M2 αk (Jensen’sInequality) ≤ k 1′ k k+1 k kY=1(cid:16) (cid:2) (cid:8) (cid:9)(cid:3)(cid:17) t t (exp 1 )αk =exp α =exp 1 k ≤ { } ( ) { } k=1 k=1 Y X Hence,byMarkov’sInequality,wecanget M2 t t Prob A (1+Ω ) η exp (1+Ω ) E exp α θ (x ,ξ ) 2/M2 exp Ω . t ≥ 1 2t k!≤ {− 1 } " ( kk 1′ k k+1 k )#≤ {− 1} k=1 k=1 X X WehavethereforeprovedClaim1. Claim2. ForΩ >0, 2 D M Ω2 Prob Bt >2Ω2 X exp 2 . (24) √t ≤ − 4 (cid:18) (cid:19) (cid:26) (cid:27) Inordertoprovethisclaim,weadoptthefollowingfactsinNemirovski’spaper[16]. Lemma 3. Given that for all k = 1,...,t, ζ is a deterministic function of ξ with E ζ ξ = 0 and k [k] k [k 1] E exp ζ2/σ2 ξ exp 1 ,wehave | − { k k}| [k−1] ≤ { } (cid:2) (cid:3) (cid:2) (cid:3) (a) Forγ 0,E exp γζ ξ exp γ2σ2 , k=1,...,t ≥ { k}| [k−1] ≤ { k} ∀ (cid:2) (cid:3) (b) LetS = t ζ ,thenProb S >Ω t σ2 exp Ω2 . t k=1 k { t k=1 k}≤ − 4 q n o P P Using this result by setting ζ = δ ,x x , S = t ζ , and σ = 2D M, k, we can verify that E ζ ξ =0and k h k ∗− k−1i t k=1 k k X ∀ k| [k−1] P (cid:2) (cid:3) E exp ζ2/σ2 ξ E exp D2 δ 2/σ2 ξ exp 1 , { k k}| [k−1] ≤ { Xk kk k}| [k−1] ≤ { } since ζ 2 x x (cid:2)2 δ 2 D2 2 θ(cid:3) (x ,(cid:2)ξ ) 2+2M2 . (cid:3) | k| ≤k ∗− k−1k k kk ≤ X k 1′ k k+1 k (cid:0) (cid:1) 8 Implementingtheaboveresults,itfollowsthat Ω2 Prob S >2Ω D M√t exp 2 . t 2 X ≤ − 4 (cid:16) (cid:17) (cid:26) (cid:27) SinceS =tB ,wehave t t D M Ω2 Prob Bt >2Ω2 X exp 2 √t ≤ − 4 (cid:18) (cid:19) (cid:26) (cid:27) asdesired. Combining(22),(23)and(24),weobtain M2 t D M Ω Prob Errρ(u¯t)>(1+Ω1) ηk+2Ω2 X +Ct exp Ω1 +exp 2 , 2t k=1 √t !≤ {− } (cid:26)− 4 (cid:27) X whereErr (u¯ ) θ(u¯ ) θ(u )+ρ Ax¯ +By¯ b .SubstitutingΩ =Ω,Ω =2√Ωandplugginginη = DX , ρ t ≡ t − ∗ k t t− k2 1 2 k M√2k weobtain(7)asdesired. 5.4 ProofofTheorem2 Proof. Bythestrong-convexityofθ wehave x: 1 ∀ µ θ1(xk)−θ1(x)≤hθ1′(xk),xk−xi− 2kx−xkk2 µ = θ (x ,ξ ),x x + δ ,x x + θ (x ,ξ ),x x x x 2. h 1′ k k+1 k+1− i h k+1 − ki h 1′ k k+1 k− k+1i− 2k − kk FollowingthesamederivationsasinLemma1andTheorem1(i),wehave E[θ(u¯ ) θ(u )+ρ Ax¯ +By¯ b ] t t t 2 − ∗ k − k E 1 t−1 ηk+1kθ1′(xk,ξk+1)k2 + 1 µ xk x 2 kxk+1−x∗k2 ≤ (t k=0(cid:20) 2 (cid:18)2ηk+1 − 2(cid:19)k − ∗k − 2ηk+1 (cid:21)) X βD2 1 + y∗,B +E max λ λ 2 2t hλ∈B0n2βtk − 0k0oi M2 t 1 + 1 t−1E µk x x 2 µ(k+1) x x 2 + βDy2∗,B + ρ2 k k+1 ≤ 2t µk t 2 k − ∗k − 2 k − ∗k 2t 2βt k=1 k=0 (cid:20) (cid:21) X X M2logt µD2 βD2 ρ2 + + y∗,B + . X ≤ µt 2t 2t 2βt 5.5 ProofofTheorem3 Proof. TheLipschitzsmoothnessofθ impliesthat k 0: 1 ∀ ≥ L θ (x ) θ (x )+ θ (x ),x x + x x 2 1 k+1 1 k 1 k k+1 k k+1 k ≤ h∇ − i 2k − k (=3)θ (x )+ θ (x ,ξ ),x x δ ,x x + L x x 2. 1 k 1 k k+1 k+1 k k+1 k+1 k k+1 k h∇ − i−h − i 2k − k 9 Itfollowsthat x : ∀ ∈X θ (x ) θ (x)+ x x, ATλ 1 k+1 1 k+1 k+1 − − − L θ (x ) θ (x)+ (cid:10)θ (x ,ξ ),x (cid:11)x δ ,x x + x x 2+ x x, ATλ 1 k 1 1 k k+1 k+1 k k+1 k+1 k k+1 k k+1 k+1 ≤ − h∇ − i−h − i 2k − k − − L (cid:10) (cid:11) =θ (x ) θ (x)+ θ (x ,ξ ),x x δ ,x x + x x 2 1 k 1 1 k k+1 k k+1 k+1 k k+1 k − h∇ − i−h − i 2k − k + θ (x ,ξ ),x x + x x, ATλ 1 k k+1 k+1 k+1 k+1 h∇ − i − − L (cid:2)θ (x ),x x + θ (x ,ξ(cid:10) ),x x δ ,(cid:11)x(cid:3) x + x x 2 1 k k 1 k k+1 k k+1 k+1 k k+1 k ≤h∇ − i h∇ − i−h − i 2k − k + θ (x ,ξ ),x x + x x, ATλ 1 k k+1 k+1 k+1 k+1 h∇ − i − − L = δ(cid:2) ,x x + x x(cid:10) 2+ θ (x ,ξ )(cid:11),(cid:3)x x + x x, ATλ k+1 k+1 k+1 k 1 k k+1 k+1 k+1 k+1 h − i 2k − k h∇ − i − − L (cid:2) (cid:10) (cid:11)(cid:3) = δ ,x x + x x 2+ x x ,βATB(y y ) k+1 k+1 k+1 k k+1 k k+1 h − i 2k − k − − + θ (x ,ξ )+AT [β(Ax +B(cid:10)y b) λ ],x x (cid:11) 1 k k+1 k+1 k k k+1 ∇ − − − (12) (cid:10) 1 x x 2 x x 2 1/ηk+1−L x x (cid:11)2 k k+1 k+1 k ≤ 2η k − k −k − k − 2 k − k k+1 + x x(cid:0) ,βATB(y y ) +(cid:1)δ ,x x . k+1 k k+1 k+1 k+1 − − h − i Thela(cid:10)stinnerproductcanbebounded(cid:11)asbelowusingYoung’sinequality,giventhatη 1: k+1 ≤ L δ ,x x = δ ,x x + δ ,x x k+1 k+1 k+1 k k+1 k k+1 h − i h − i h − i 1 1/η L δ ,x x + δ 2+ k+1− x x 2. k+1 k k+1 k k+1 ≤h − i 2(1/η L)k k 2 k − k k+1 − Combiningthiswithinequalities(14,17)and(18),wecangetasimilarstatementasthatofLemma1: δ 2 θ(u ) θ(u)+(w w)TF(w ) k k+1k k+1 k+1 k+1 − − ≤ 2(1/η L) k+1 − 1 β + x x 2 x x 2 + Ax+By b 2 Ax+By b 2 k k+1 k k+1 2η k − k −k − k 2 k − k −k − k k+1 (cid:0) 1 (cid:1) (cid:0) (cid:1) + δ ,x x + λ λ 2 λ λ 2 . h k+1 − ki 2β k − kk2−k − k+1k2 (cid:0) (cid:1) TherestoftheproofareessentiallythesameasTheorem1(i),exceptthatweusethenewdefinitionofu¯ in(9). k References [1] R.GlowinskiandA.Marroco,“Surlapproximation,parelementsnisdordreun,etlaresolution,parpenalisation- dualite, dune classe de problems de dirichlet non lineares,” Revue Francaise dAutomatique, Informatique, et RechercheOperationelle,vol.9,no.2,1975. [2] Daniel Gabay and Bertrand Mercier, “A dual algorithmfor the solution of nonlinearvariationalproblemsvia finiteelementapproximation,” Computers&MathematicswithApplications,vol.2,no.1,1976. [3] DanielGabay, “Applicationsofthemethodofmultiplierstovariationalinequalities,” inAugmentedLagrangian Methods: Applicationsto the Solutionof Boundary-ValueProblems, M. Fortin and R. Glowinski, Eds. North- Holland:Amsterdam,1983. [4] RolandGlowinskiandPatrickLeTallec, AugmentedLagrangianandOperator-SplittingMethodsinNonlinear Mechanics, StudiesinAppliedandNumericalMathematics.SIAM,1989. [5] JonathanEcksteinandDimitriP.Bertsekas, “Onthedouglas-rachfordsplittingmethodandtheproximalpoint algorithmformaximalmonotoneoperators,” MathematicalProgramming,vol.55,no.1-3,pp.293–318,1992. 10

See more

The list of books you might like