Logout succeed
Logout succeed. See you again!

Backfitting in Smoothing Spline ANOVA PDF
Preview Backfitting in Smoothing Spline ANOVA
TheAnnalsofStatistics 1998,Vol.26,No.5,1733–1759 BACKFITTING IN SMOOTHING SPLINE ANOVA By Zhen Luo Pennsylvania State University AcomputationalschemeforfittingsmoothingsplineANOVAmodels to large data sets with a (near) tensor product design is proposed. Such datasetsarecommoninspatial-temporalanalyses.Theproposedscheme usesthebackfittingalgorithmtotakeadvantageofthetensorproductde- signtosavebothcomputationalmemoryandtime.Severalwaystofurther speed up the backfitting algorithm, such as collapsing component func- tionsandsuccessiveover-relaxation,arediscussed.Aniterativeimputation procedure is used to handle the cases of near tensor product designs. An application to a global historical surface air temperature data set, which motivatedthiswork,isusedtoillustratetheschemeproposed. 1. Introduction. Smoothing spline ANOVA (SS-ANOVA) is a multivari- atefunctionestimationmethodbasedonanANOVAtypedecompositionofthe function to be estimated. It generalizes the decomposition of multiple factor effectsinanordinaryANOVAandthesmoothingsplinemethodforunivariate function estimation. It has been applied to various data sets, for example, an environmental data set [Gu and Wahba (1993a, b)], and an epidemiological data set [Wahba, Wang, Gu, Klein and Klein (1996)]. In this article, an appli- cationtothespatial-temporalanalysisofaglobalhistoricaltemperaturedata set will be discussed. A major difficulty in using SS-ANOVA is its large computational demand. Gu(1989)carefullyimplementedacomputationalschemeforgenericsmooth- ing spline estimation problems. The scheme that he used works well on data sets of small to moderate sizes. Since it does not assume any special data structures, it is inevitably slow and memory-demanding when dealing with large data sets. AsubclassofSS-ANOVAmodelsarethosewithoutanykindsofinteractions, that is, smoothing spline additive models. Buja, Hastie and Tibshirani (1989) studied these models (and other additive models) and used the backfitting algorithm to transform the problem of multivariate function estimation into many univariate function estimation problems. Because of the existence of sparsematrixrepresentationsoftheunivariatesmoothing(polynomial)spline estimate [see, for example, O’Sullivan (1985)], this transformation speeds up computation considerably and saves memory as well. When interactions are nonnegligible, special structures other than that of complete additivity may be used to save computational memory and time. In ReceivedDecember1996;revisedDecember1997. AMS 1991 subject classifications. Primary 62G07, 65D10, 65F10; secondary 62H11, 65U05, 86A32. Keywordsandphrases.Gauss–Seidelalgorithm,tensorproductdesign,spatial-temporalanal- ysis,additivemodel,collapsing,grouping,SOR,globalhistoricaltemperaturedata. 1733 1734 Z. LUO this article, we will make use of tensor product designs that exist in many large data sets, especially in spatial-temporal analyses. The key to our algo- rithmisalsobackfitting,whichenablesustofitaSS-ANOVAmodelbywayof decomposing matrices much smaller than those without the consideration of tensorproductdesigns.Inthisway,wecansignificantlyreducethedemandfor computational memory. However, unlike the situation of additive models, the “correlation”betweencomponentfunctions,whichisamajorfactorinslowing downthebackfittingalgorithm,isveryoftentoohighinSS-ANOVAmodels.It makes the backfitting algorithm converge very slowly in its straightforward version. Therefore we will also discuss several techniques to speed up the backfitting algorithm. Inpractice,tensorproductdesignsveryoftenholdonlyapproximately,that is, there are no observations at some tensor product grid points. An iterative imputation procedure is used to handle this situation. The motivation for this work came from the spatial-temporal analysis of a global historical surface air temperature data set. The data set has tens of thousands of observations, which makes the currently available methods to fit SS-ANOVA models, such as the one implemented in Rkpack [Gu (1989)], unusable. This data set will be used to illustrate our proposed computational scheme. The paper is organized as follows. The SS-ANOVA approach to multivari- ate function estimation is introduced in Section 2. Section 3 describes our general computational strategy using the backfitting algorithm. Several dif- ferenttechniques,suchasgrouping,collapsingandsuccessive-overrelaxation, used to speed up the backfitting algorithm, are discussed in Section 4. Sec- tion 5 describes an iterative imputation procedure handling the situation of a near tensor product design. An application to a global historical temperature data set is discussed in Section 6. Finally, in Section 7 some comparisons of timing are made between Rkpack and several versions of backfitting. 2. SmoothingsplineANOVA. TheSS-ANOVAapproachtomultivariate functionestimationisintroducedinthissectionthroughitsapplicationtothe spatial-temporalanalysisofaglobalhistoricaltemperaturedataset.Ageneral formulation of this approach may be found in Wahba (1990), Gu and Wahba (1993a, b) or Wahba, Wang, Gu, Klein and Klein (1995). Considertheproblemofestimatingtheglobalwintermeansurfaceairtem- perature field as a function, denoted by f, of year and geographical location based on scattered noisy data: (1) y f t !P ε !i 1!2!###!n! i = ! i i"+ i = where t 1!2!###!n denotes the year of the ith data point and P i ∈ $ t% i = (latitude,longitude) denotes the location of the ith data point on the sphere !. Here n is the total number of data points, n is the total number of years t included in the data set. We also denote the total number of distinct locations inthedatasetbyn forlateruse.Theε ’sareassumedasindependentnoise P i terms. (We note that in this particular application, ε contains not only the i BACKFITTING IN SMOOTHING SPLINE ANOVA 1735 measurementerroroftherecordy ,butalsothelocalvariationthatisofmuch i smaller scale than the resolution of current modeling interest.) The function f needs only a minimal condition, that f t! is integrable ! ·" over ! for any t, to have a unique decomposition, (2) f t!P d d φ t g t g P g P φ t g t!P ! ! "= 1+ 2 ! "+ 1! "+ 2! "+ φ!2! " ! "+ 12! " where φ t t n 1 /2, g !g !g and g satisfy the following side ! " &= −! t + " 1 2 φ!2 12 conditions: n n t t g t g t φ t 0! 1! "= 1! " ! "= t 1 t 1 != != n n t t (3) g t!P g t!P φ t 0! 12! "= 12! " ! "= t 1 t 1 != != g P dP g P dP g t!P dP 0 2! " = φ!2! " = 12! " = ! ! ! " " " for any t and P. The uniqueness of this decomposition is easy to check using thesesideconditions.Toseetheexistenceofsuchadecomposition,firstdefine three averaging operators on function f: n t f t!P !"tf"!t!P"&= t=1n! "! # t n t f t!P φ t !"t(f"!t!P"&= t=1nt! φ2"t ! "φ!t"! # t=1 ! " #f t!P dP " f t!P ! ! " # ! P "! "&= 4π $ Let I denote the identity operator, then f can be decomposed as f= "t+"t(+!I−"t−"t(" "P+!I−"P" f =%"t"Pf+"t("Pf+!I−"&t%−"t(""Pf & +"t!I−"P"f+"t(!I−"P"f+!I−"t−"t("!I−"P"f! which gives the six components in (2). Note that averaging over t is defined in a discrete form because “year” is treated as a discrete variable. While we may consider t as a continuous time variable, we will see in Section 4.1 why it is a better idea to treat it as a discrete one. The decomposition of f generalizes the decomposition of ordinary two-way ANOVA models. Note that a direct generalization would have only four com- ponents in (2), and corresponds to the decomposition using only operators " t and " . Here we may call d φ t , g t and g P main effect terms, and P 2 ! " 1! " 2! " g P φ t and g t!P interaction terms. In this article, we will also call φ!2! " ! " 12! " d d φthe“parametriccomponent”andtheotherfourtermsin(2)“nonpara- 1+ 2 metric components.” The components in (2) are of interest to us because they have clearly defined climatological meanings: d is the grand average winter 1 1736 Z. LUO temperature over both year and location; d is the linear trend (coefficient) 2 of global average winter temperature; d d φ g is the global average 1 + 2 + 1 winter temperature history; g , g and g represent local adjustments to 2 φ!2 12 the global average terms d , d and g ! respectively. For example, g P is 1 2 1 φ!2! " the local adjustment to the grand linear trend (coefficient) d at location P. 2 In other words, d g P is the linear trend (coefficient) at location P. A 2+ φ!2! " map of d g P would show where the warming areas are and where the 2+ φ!2! " cooling areas are over the years covered by the data. To make any inferences about the function f at points other than the data points, some “smoothness” assumptions about f have to be made. First, as usually done in nonparametric function estimation, we will restrict f to a classoffunctionswithsomedegreeofdifferentiability.Then,wewillestimate f by maximizing a penalized least square criterion with appropriately chosen penalty terms on its component functions. Let # t denote the space of the functions of t that we will restrict our !" attention to. Since t has only a finite number (n ) of possible values, we will t add no differentiability restrictions; that is, # t is the Euclidean space of a !" dimension n . For the functions of P, we will restrict our attention to a func- t tion space defined in Wahba (1981), denoted there by # ! . Here let # P 2! " ! " denote it in order to be consistent with the notation # t . Roughly speaking, !" # P consists of functions g defined on the sphere ! that are twice differen- ! " tiableandthat&gissquareintegrable.TheLaplace–Beltramioperatorforthe sphere is denoted by &! which is a direct analogue of the Laplace operator for a Euclidean space. For a precise definition of # P , please see Wahba (1981), ! " page 7. Both # t and # P are reproducing kernel Hilbert spaces (RKHSs) !" ! " with squared norms, nt g t 2 nt g t φ t 2 (4) )g)2#!t" &=’#t=n1t ! "( +’#t=nt1=t1!φ2"!t!" "( n 2 t− # 2 g t 2 2g t 1 g t ! + ! + "− ! + "+ ! " t 1 != % & g P dP 2 (5) g 2 ! ! " &g 2dP! ) )#!P" &=)$ 4π * +"!! " respectively.Notethat nt−2 g t 2 2g t 1 g t 2 isadiscreteversionof integrated squared secontd=1de*ri!va+tiv"e−. The! f+un"c+tion!f"+t!P in (1) is restricted to the tensor product s#pace of # t and # P , denote!d by"#. Here # is also !" ! " a RKHS since the tensor product space of any two RKHSs is a RKHS. The restriction to RKHSs is barely restrictive because the only requirement for a Hilbert function space to be a RKHS is that every evaluation functional is continuous. That is, a small change of the function measured in the function space should induce only a small change of the value of this function evalu- ated at any fixed point. For details on reproducing kernel Hilbert spaces and their tensor product spaces, please see Aronszajn (1950). Some introductory BACKFITTING IN SMOOTHING SPLINE ANOVA 1737 material can also be found in Wahba (1990), or Tapia and Thompson [(1978), Appendix I]. Let*1!t"+,*φ+and#a!t" denotethethreesubspacesof#!t" definedby"t#!t", " # t , and I " " # t ! respectively. They are orthogonal to each other t( !" ! − t− t(" !" with respect to the inner product corresponding to the norm defined in (4). The three terms on the right side of (4) are the squared norms in these three subspaces. We have #!t" 1!t" φ #a!t"# =* +⊕* +⊕ Let *1!P"+ and #a!P" denote the two subspaces of #!P" defined by "P#!P" and I " # P ! respectively. These two subspaces are also orthogonal with ! − P" ! " respecttotheinnerproductcorrespondingtothenormdefinedin(5).Thetwo terms on the right side of (5) are the squared norms in these two subspaces. We have #!P" 1!P" #a!P"# =* +⊕ Now the function space we restrict f to can be written as an orthogonal sum of six subspaces: # #!t" #!P" = ⊗ (6) 1!t" 1!P" φ 1!P" #a!t" 1!P" = * +⊗* + ⊕ * +⊗* + ⊕ ⊗* + + 1!t" #a!,P" + φ #a!,P" + #a!t" #a!P," # ⊕ * +⊗ ⊕ * +⊗ ⊕ ⊗ These six subspace+s correspond,to +the six com,pon+ents in (2). ,Let the first two subspaces be combined into a two-dimensional space, denoted by #0. This is the parametric subspace. The last four subspaces, denoted by #α for α 1!2!3!4! respectively, correspond to the four nonparametric components = in (2). We note that the subspaces discussed here are all RKHSs. Thesmoothingspline(ANOVA)estimateoffisdefinedasapenalizedleast square estimate with the four nonparametric components in (2) penalized by theirsquarednormsinthecorrespondingsubspaces,oramaximumpenalized likelihood estimate if the noise terms (ε’s) in (1) are assumed to be indepen- dent and identically distributed Gaussian random variables. We penalize the t nonlinear part of a function of t by its squared norm in the subspace #a!" and the nonconstant part of a function of P by its squared norm in the sub- P space #a! ". Specifically, the smoothing spline estimate of f is defined as the minimizer of n 1 1 1 1 2 (7) y f t !P J f J f J f J f i− ! i i" + θ 1! "+ θ 2! "+ θ 3! "+ θ 4! " i 1 1 2 3 4 != + , in # where θ’s are positive numbers called smoothing parameters, J’s are penalty terms on four nonparametric components of f in (2). The expression gJ1!f"=I "tn=t−12!g"1!t"+f2;"J−2fg1!t+1"&+gg12!dt"P"2iissaappeennaallttyyoonntthhee ccoommppoonneenntt 1 =! −# t − t(" P 2! "= !! 2" $ 1738 Z. LUO g " I " f; J f &g 2dP is a penalty on the component 2 = t! − P" 3! "= !! φ!2" gφ!2φ="t(!I−"P"fandJ4!f"$isthesquarednormofg12in#4 =#a!t"⊗#a!P", a penalty on the component g I " " I " f. 12 =! − t− t("! − P" The choices of J’s given here are not the only ones we may make. How to chooseJ’sandθ’sisanimportantissueinsmoothingsplinemethods.Discus- sionsaboutthisissuecanbefoundinmanyplaces,forexample,Wahba(1990). For now, suppose that they have been chosen appropriately. Let us move our attention to the computational aspect of this procedure. 3. Computing SS-ANOVA estimates using backfitting. Even though the SS-ANOVA estimate of f is defined as the minimizer of (7) in an infinite- dimensional space #, the solution has a finite-dimensional representation [see Wahba (1990), pages 12, 127, 128], 4 n (8) f t!P d d φ t θ c R t !P t!P ! θ! "= 1+ 2 ! "+ α i α! i i. " α 1 i 1 != != where R is the reproducing kernel of RKHS #α, a known nonnegative def- α inite function, for α 1!2!3!4; d d !d T and c c !###!c T are the = &= ! 1 2" &= ! 1 n" coefficient vectors to be decided. In general, the reproducing kernel (RK) of a RKHS F of the functions of x isatwo-variablefunctionR x!x satisfying(i)foreveryx,R !x belongsto ( ( ( ! " !· " F;(ii)foreveryx andeveryf F,f x f !R !x where ! denotes ( ( ( ∈ ! "=/ !·" !· "0 /· ·0 the inner product of F. Since R !x is in F, using (ii), we get R x!x ( ( !· " ! "= R !x !R !x . Therefore R x!x is symmetric and nonnegative definite. ( ( / !· " !· "0 ! " See Aronszajn (1950) for more information on reproducing kernels. It is not difficult to verify that the RK of space 1t defined in Section 2 is !" * + t R t!t( 1,theRKofspace φ isR t!t( φ t φ t( andtheRKofspace#a!" ! "= * + ! "= ! " ! " is R t!t the t!t entry of LTL †, where † denotes the Moore–Penrose ( ( ! "= ! " ! " generalized inverse and L is a n 2 n matrix given by ! t− "× t 1 2 1 0 0 0 0 − ··· 0 1 2 1 0 0 0 − ··· 0 0 1 2 0 0 0 − ··· (9) # ## ## # # 0 0 0 0 2 1 0 ··· − 0 0 0 0 1 2 1 ··· − t From now on, let Rt!t!t(" denote the reproducing kernel of #a!". Note that this form of RK is only suitable for the equally-spaced design of variable t. It is easy to verify that the RK of space 1P is R P!P 1. The RK ! " ( * + ! "= P of space #a! ", denoted by RP!P!P(", is given in equations (3.3) and (3.4) of BACKFITTING IN SMOOTHING SPLINE ANOVA 1739 Table1 The reproducing kernels used in the representation (8) of the SS-ANOVAestimate ! HHH! R! 1 #a!t"⊗*1!P"+ R1!t!P.t(!P("= Rt!t!t(" 2 *1!t"+⊗#a!P" R2!t!p.t(!P("= RP!P!P(" P 3 *φ+⊗#a! " R3!t!P.t(!P("= φ!t"φ!t("RP!P!P(" t P 4 #a!"⊗#a! " R4!t!P.t(!P("= Rt!t!t("RP!P!P(" Wahba (1981), and reproduced here: 1 1 1 R P!P q z ! P! ("= 2π 2 2! "− 6 ’ ( where z is the cosine of the angle between P and P on the sphere, and ( 1 2 1 z 2 1 z q z ln 1 12 − 4 − 2! "= 2 + 1 z 2 − 2 3 ) 4 − *’ ) * ) *( 1 z 3/2 1 z 12 − 6 − 1 # − 2 + 2 + ) * ) * 5 We note that this R , strictly speaking, does not correspond exactly to the P squared norm &g 2dP, but a norm topologically equivalent to it. In other !! " words, we are not computing the minimizer of (7), but the minimizer of (7) $ with slightly changed J’s. The reason is purely computational since the R P correspondingto &g 2dPdoesnothaveanexplicitformandisexpensiveto !! " compute. On the other hand, as discussed in Stein (1990), this type of change $ to the penalty terms should not make much difference in the results when sufficient data are available. The reproducing kernel of the tensor product space of two RKHSs is the product of their individual reproducing kernels [Aronszajn (1950), Theorem I of Section 8]. Therefore the RKs of # !α 1!2!3!4 are the products of the $ α = % RK’s given above and listed in Table 1. The f expressed in the form of (8) has an obvious decomposition θ corresponding to (2). For example, g t θ n c R t !P t!P θ n c R t !t . Plugging the repre1s!en"t=ation 1(8)i=1intio 1(!7i) [ni.ote "t=hat 1 ni=1c Ri t!t i!P" t!P 2 c c R t !P t !P# for α 1!2!3!4], )we#ie=n1diupαw!iith ai.finit"e)-dim=ensioi!njaliqjuadα!raitic io.ptjimizj"ation prob=lem # # (10) min y Sd Q c 2 cTQ c ! d!c ) − − θ ) + θ 6 7 1740 Z. LUO where y y !###!y T, S is a n 2 matrix with the ith row given by 1!φ t ,=Q! i1s a n n"n matrix defi×ned as 4 θ Q and Q is a n n !matr!ixi"w"ith θits i!j e×ntry given by R t !P tα!=P1 α, foαr α 1!α2!3!4. Si×nce ! " α! i i.#j j" = theobjectivefunctionin(10)isquadraticandnonnegative(Q isnonnegative θ definite), it must have a minimizer. It is easy to show that even though the minimizer c and d may not be unique, the corresponding f is unique if S θ is of full rank [see, e.g., Chen, Gu and Wahba (1989), page 517]. That is, the SS-ANOVA estimate f is uniquely defined. Because the decomposition (2) is θ unique, all the component functions of f are also uniquely defined. θ One of the minimizers of (10) is the solution of 0 STc! = (11) Q I c y Sd ! ! θ+ " =! − " which is the system of linear equations usually solved for computing smooth- ingsplineestimates[see,e.g.,Wahba(1990),pages12,13].Whenn,thesample size,isnottoolarge,thissystemcanbesolvedbydirectmatrixdecompositions of S and Q , which is essentially what Gu (1989) did for general SS-ANOVA θ models.However,thisapproachhaslimitationsintermsofbothcomputational time and memory requirement. The memory required is O n2 and the time ! " is O n3 . ! " When the data and model exhibit some special structures, we may be able to find more efficient ways to compute. The following fact will be used later. Letf denotethevectorofthevaluesoftheparametriccomponent,d d φ, 0 0+ 1 evaluated at the data points, and f denote the vector of the values of the α αth nonparametric component of f evaluated at the data points. From the θ representation (8) we know that f Sd, f θ Q c for α 1!2!3!4. If we 0 = α = α α = cangetf andf ’s,thenwecancomputedbysolvingSd f ,andcthrough 0 α = 0 the second equation of (11), that is, 4 (12) c y Sd Q c y f # = − − θ = − α α 0 != With c and d, we can compute the values of f and its components at any θ points using (8). A special structure we have here is the tensor product structure of Q ’s α when the data have a complete tensor product design, that is, one obser- vation at every tensor product grid point t !P for i 1!2!###!n and ! i j" = t j 1!2!###!n . In practice we will more likely only have a near tensor prod- = P uct design, that is, one observation at almost every point t !P . For the ! i j" moment, however, we will assume that we do have a complete tensor product design and wait till Section 5 to discuss the way to deal with the situations of near tensor product designs. In the cases of complete tensor product designs, thesamplesizenisn n .Assumingthatthedataisorderedinsuchaway t× P that all the data of one location is listed before the data of the next location, BACKFITTING IN SMOOTHING SPLINE ANOVA 1741 the S and Q ’s in (11) have the following forms: α S 1 S! = ⊗ ˜ Q 11T Q ! 1 = ⊗ t (13) Q Q 11T! 2 = P⊗ Q Q φφT! 3 = P⊗ Q Q Q ! 4 = P⊗ t where 1 is a vector of ones of appropriate length, φ φ 1 !###!φ n T, S is =! ! " ! t"" ˜ an 2matrixwithitsithrowgivenby 1!φ i ,Q isann n matrixwith t× ! ! "" t t× t its i!j entry given by R i!j and Q is an n n matrix with its i!j ! " t! " P P× P ! " entry given by R P !P . Therefore, Q LTL † and Q 1 Q φ 0. This P! i j" t =! " t = t = fact, which results from the equally spaced design for t and the special form of penalty terms involving t in (7), will be used in Section 4.1. Even though every Q has such a tensor product structure, their linear α combination Q may not. Therefore, (11) still cannot be solved using this spe- θ cial structure. In order to make use of this structure, we consider another representation of the SS-ANOVA estimate, 4 n (14) f t!P d d φ t θ c R t !P t!P ! θ! "= 0+ 1 ! "+ α i!α α ! i i".! " α 1 i 1 != != + , which is more general than the one in (8). With a slightly abused notation, let c denote the vector c !c !###!c T for α 1!2!3!4. Plugging (14) α ! 1!α 2!α n!α" = into (7), we end up with another finite-dimensional quadratic optimization problem, 4 2 4 (15) min y Sd θ Q c θ cTQ c # d!c !c !c !c − − α α α + α α α α 1 2 3 438 α 1 8 α 1 5 8 != 8 != Any solution of this pro8blem should satisfy 8 8 8 4 STS d ST y θ Q c ! ! " = − α α α ) α 1 * (16) != θ Q I Q c Q y Sd θ Q c for β 1!2!3!4# ! β β+ " β β = β − − α α α = ) α β * !2= Again, it is easy to show that even though the solution c ’s and d may not be α unique,theircorrespondingf ,theSS-ANOVAestimateandallitscomponent θ functionsareuniquelydefinedifSisoffullrank.Withtherepresentation(14), the components of f evaluated at the data points can be written as f Sd, θ 0 = f θ Q c for α 1!2!3!4. According to system (16), they must satisfy α = α α α = (17) f S y f for β 0!1!2!3!4! β = β − α = ) α β * !2= 1742 Z. LUO where S S STS 1ST and S Q 1/θ I 1Q for β 1!2!3!4. 0 &= ! "− β &= ! β +! β" "− β = Since (15) must have a minimizer, (16), hence (17), must have a solution. For any solution of (17), there exists a corresponding solution to (16) which in turn corresponds to the uniquely defined SS-ANOVA estimate f ; therefore θ the solution of (17) is unique. Thesystem(17)suggestsanaturaliterativemethodtocomputef ’s,thatis, β (18) f!βk" =Sβ y− f!αk"− f!αk−1" for β=0!1!2!3!4# ) α<β α>β * ! ! This is exactly the backfitting algorithm studied by Buja, Hastie and Tibshi- rani(1989)inwhichadditivemodelsarefitted.Inthecaseofadditivemodels, eachS isaone-dimensionalsmootherandhasasparsematrixrepresentation β which makes computation very efficient. Here each S has a tensor product β structure because S and Q do. Therefore, updating in (18) can be done with β thedecompositionsofmatricesQ andQ whicharemuchsmallerthanQ ’s. P t β Inthisway,wemaysavecomputationalmemoryandtime.Thetensorproduct structure plays the same role here as the sparsity structure does in additive models. Rewrite (17) as f S y I S S 0 0 0 ··· 0 S I S f1 S1y (19) 1 ··· 1 # ### = ### ··· S4 S4 ··· I f4 S4y It is clear that the backfitting algorithm (18) is a (block) Gauss–Seidel algo- rithm. Sinceallthesmoothers S herearesymmetricwitheigenvaluesin 0!1 , $ α% * + Theorem 9 of Buja, Hastie and Tibshirani (1989) shows that the backfitting 0 algorithm (18) starting with any initial vectors fα! " does converge to the $ % uniquesolutionof(17).Chen,GuandWahba(1989)pointedoutthepossibility of applying the backfitting algorithm to SS-ANOVA models and the above derivation of (18) is basically theirs. Another interesting discussion about the convergence of backfitting is given by Ansley and Kohn (1994). Note that (17) is the system of stationary equations for the following minimization problem: 4 2 4 1 (20) min y f fTQ†f ! f0∈$!S"!fα∈$!Qα"8 −α 0 α8 +α 1θα α α α 8 != 8 != whereQ†α istheMoore–Penrose88generalize88dinverseofQα,and$!A"denotes the space spanned by the columns of A. The backfitting algorithm (18) is the sameasthecoordinatedescentalgorithmforthisminimizationproblemwhere each component f is treated as one “coordinate.” See also the discussion in α Buja, Hastie and Tibshirani [(1989), pages 477 and 488]. As a matter of fact, thinking the backfitting algorithm in this way, there is a direct way to show