## Abstract

Fixed effects models have dominated the statistical analysis of genetic crosses between inbred strains. In spite of their popularity, the traditional models ignore polygenic background and must be tailored to each specific cross. We reexamine the role of random effect models in gene mapping with inbred strains. The biggest difficulty in implementing random effect models is the lack of a coherent way of calculating trait covariances between relatives. The standard model for outbred populations is based on premises of genetic equilibrium that simply do not apply to crosses between inbred strains since every animal in a strain is genetically identical and completely homozygous. We fill this theoretical gap by introducing novel combinatorial entities called strain coefficients. With an appropriate theory, it is possible to reformulate QTL mapping and QTL association analysis as an application of mixed models involving both fixed and random effects. After developing this theory, our first example compares the mixed effects model to a standard fixed effects model using simulated advanced intercross line (AIL) data. Our second example deals with hormone data. Here multivariate traits and parameter identifiability questions arise. Our final example involves random mating among eight strains and vividly demonstrates the versatility of our models.

IN analyzing gene mapping data from inbred strains, there is always the temptation to borrow models more pertinent to outbred populations. The vast majority of statisticians are wise enough to resist this temptation and turn to analysis methods tailored to specific breeding designs. Fortunately, the typical backcross or F_{2} design has sufficient symmetry to permit analysis of variance by standard statistical packages. As mammalian geneticists explore more complicated designs involving multiple strains and multiple generations, this analysis paradigm has begun to fracture. It is therefore hardly surprising that the last decade and a half have seen a revival of interest in statistical models for gene mapping with inbred strains. Although we briefly review some of the important contributions to this literature in the next section, it is fair to say that most modern models rely heavily on fixed effects. In contrast, the most successful models for mapping quantitative trait loci (QTL) in outbred populations invoke random effects (Hopper and Mathews 1982; Goldgar 1990; Schork 1993; Amos 1994; Blangero and Almasy 1997).

The premise of this article is that, properly formulated, random effects models hold equal promise for more complicated inbred strain data. If a QTL is segregating between two strains, backcross and F_{2} designs reliably detect it (Valdar*et al*. 2006). Models based on fixed allelic effects play a critical role in this process. Traditional designs have two drawbacks. First, the scarcity of recombination events often gives long mapped intervals. Second, when two founder strains of related ancestry are chosen, there may be no segregating QTL. To increase the number of recombination events and the number of segregating QTL, geneticists are turning to more complex designs involving multiple strains. Although the rationale for more complex designs is compelling, they bring in their wake problems of overparameterization. Random effects models neatly circumvent some of the parameterization issues encountered with fixed effects models. Unfortunately, the standard outbred QTL model does not make sense for inbred strains. All individuals of a particular strain are genetically identical and completely homozygous. These cardinal characteristics have subtle consequences when we calculate trait covariances for the descendants of matings between different strains. A logically correct theory for specifying covariances between pairs of individuals is the key to making random effects models respectable for inbred strains.

In this article, we take two approaches to QTL mapping; both capture polygenic background as a source of random variation. The two approaches differ in how they handle variation caused by the QTL. In association mapping, markers are treated one by one as candidate genes, and observed genotypes or allele counts at a marker serve as fixed predictors of trait means. In linkage mapping, markers in the vicinity of the QTL provide prior information on gene sharing, and the QTL contribution is modeled as a random effect. The greatest defect of our models is the blanket assumption of additivity. The greatest strength of our models is their generality in other regards. Thus, there is no limit to the number of founding strains, the depth and complexity of pedigrees, or the number of traits in a multivariate analysis.

To avoid breaking the flow of our discussion, much of the mathematical detail is relegated to the appendixes. The following sections summarize previous contributions, lay out the model with full attention to computation of strain coefficients and relative covariances, resolve the thorny issue of identifiability, apply the models to real and simulated data, and discuss the broader implications and limitations of the models.

## METHODS

#### A brief survey of previous methods:

Inbred mammalian strains have unique advantages in genetics. All members of a strain are genetically identical and completely homozygous. Simple crosses between strains involve no phase ambiguities, and any genes mapped can be quickly located in humans and other species by synteny. With mice and other small mammals, breeding is reasonably straightforward, generation times are fairly short, and the environment can be exquisitely controlled.

For decades, QTL mapping in inbred strains was considered an exercise in fixed effects modeling. Testing for association between marker genotypes and trait values is readily carried out using several available statistical packages. In the interval method introduced by Lander and Botstein (1989), the QTL is allowed to take any position along a chromosome. This makes QTL genotypes unobservable and requires computation of posterior distributions given observed genotypes at the flanking markers. Although the EM algorithm is applicable in this context, it is often slow to converge, and the regression method of Haley and Knott (1992) provides a quick approximation. The permutation test of Churchill and Doerge (1994) handles multiple testing problems gracefully. The recent program R/qtl (Broman*et al*. 2003), which capitalizes on the R software environment, combines several of these methods with hidden Markov modeling of missing genotypes. Despite these admirable advances, interval mapping is still limited to simple crosses where polygenic background is confounded with random environment. As the field embraces more complex crosses, geneticists no longer have the luxury of ignoring polygenic background, and it seems self-evident that explicitly modeling it will improve statistical inference.

The composite interval mapping method of Zeng (1993, 1994) implemented in QTL Cartographer generalizes interval mapping by including the direct effects of one or more markers unlinked to the QTL. Hence, composite interval mapping can be viewed as an attempt to incorporate polygenic background through fixed effects. If the number of typed markers is large, then it becomes hopeless to include all of them, and some automatic selection of background markers is desirable (Manly and Olson 1999).

Although Xie*et al*. (1989) take important first steps toward including polygenic background as a random effect, they do not derive general covariance expressions. This failure makes it difficult to deal with nonstandard crosses and awkward to combine data from different crosses. In the meantime, the pressure to increase the number of strains per cross has been growing (Rebai and Goffinet 1993). Of 21 cloned mouse genes listed in Tables 1 and 2 of the review by Flint*et al*. (2005), 7 rely on cloning strategies involving multiple strains or outbred mice. These practical concerns are stimulating intense efforts to revamp experimental design and statistical analysis of inbred cross data (Liu and Zeng 2000; Hitzemann*et al*. 2002; Pletcher*et al*. 2004; Li*et al*. 2005; Cervino*et al*. 2007). Other recent models that delve into multiple QTL models and epistasis are both frequentist (Kao*et al*. 1999; Janninka and Jansena 2001; Seaton*et al*. 2002; Broman*et al*. 2003) and Bayesian oriented (Sillanpää and Arjas 1998; Sen and Churchill 2001; Broman*et al*. 2003).

#### Trait means, variances, and covariances:

We begin our theory development with a basic model applicable to any inbred strain design, including F_{2}, advanced intercross lines, and random mating. Suppose that *i* and *j* are two animals generated by a complex cross involving *s* inbred strains. At *t* traits of interest, *i* and *j* exhibit random vectors *X _{i}* and

*X*of trait values. For the sake of simplicity, assume further that

_{j}*X*and

_{i}*X*reflect the contributions of a single gene whose alleles have additive effects. Our immediate goal is to calculate the expected vectors

_{j}*E*(

*X*) and

_{i}*E*(

*X*) and the covariance matrix Cov(

_{j}*X*,

_{i}*X*). When

_{j}*i*=

*j*, we recover variances as well as covariances. Because of our assumption of additivity,

*X*decomposes as the sum

_{i}*Y*+

_{i}*Z*of a maternal contribution

_{i}*Y*plus a paternal contribution

_{i}*Z*. To calculate

_{i}*E*(

*Y*), let

_{i}*M*denote the originating strain of the maternal gene of

_{i}*i*. Although

*M*is unobserved, we can calculate the probability Pr(

_{i}*M*=

_{i}*a*) for any given strain

*a*. In terms of these probabilities and the

*t*× 1 mean vector μ(

*a*) of allelic effects on each trait for strain

*a*, we haveInvoking a similar expression for

*E*(

*Z*), it follows that(1)where γ

_{i}_{i}(

*a*) is the probability that a randomly sampled gene from

*i*originates from strain

*a*. We refer to γ

_{i}as the strain fraction vector for animal

*i*; γ

_{i}has dimension

*s*× 1.

Covariances are derived by the same kind of reasoning. Decompose *X _{j}* into the sum

*V*+

_{j}*W*of a maternal contribution

_{j}*V*plus a paternal contribution

_{j}*W*. In view of the bilinearity of the covariance operator and the symmetry of maternal and paternal alleles, it suffices to find the covariance Cov(

_{j}*Y*,

_{i}*V*). Let

_{j}*N*denote the originating strain of the maternal gene of

_{j}*j*. Conditioning on the joint value of

*M*and

_{i}*N*then yieldswhere the superscript * indicates a vector or matrix transpose. By analogy with kinship coefficients, we define the strain coefficient ψ

_{j}_{ij}(

*a*,

*b*) to be the joint probability that a randomly drawn gene from animal

*i*originates from strain

*a*and a randomly drawn gene from the same locus of animal

*j*originates from strain

*b*. If

*i*and

*j*coincide, then sampling is done with replacement. The

*t*×

*t*covariance matrix between the trait values of

*i*and

*j*becomes(2)where

*C*(

_{ij}*a*,

*b*) = ψ

_{ij}(

*a*,

*b*) − γ

_{i}(

*a*)γ

_{j}(

*b*), which we collect into an

*s*×

*s*matrix, denoted

*C*.

_{ij}For *s* strains and *t* traits, it is convenient to stack the allelic effects into a column vector μ of length *st* with transposeThe positive semidefinite matrix Ω = μμ* can then be split into *t*^{2} blocks Ω_{kl} each of size *s* × *s*. Restricting our attention to the block corresponding to traits *k* and *l*, the covariance matrix (2) has entries given by the trace formula(3)

In polygenic inheritance, many independent loci contribute in an additive manner to the traits under consideration. Since trait means and covariances add in this setting, the mean expression (1) and the covariance expressions (2) and (3) remain valid provided we replace μ by and Ω by . Here μ* ^{l}* denotes the vector contribution corresponding to locus

*l*rather than the

*l*th component of μ. appendix a shows that every pair (μ, Ω) consisting of a vector μ and a positive semidefinite matrix Ω can be represented as two such coordinated linear combinations. Hence, to capture polygenic background, it suffices to estimate arbitrary μ and Ω. We see later that there is an identifiability issue that must be surmounted in estimating Ω.

#### Computation of strain coefficients:

Because the combinatorial coefficients γ_{i}(*a*) and ψ_{ij}(*a*, *b*) are essential in calculating trait means and variances, we need good algorithms to compute these coefficients. Fortunately, we can mimic the logic used in calculating kinship coefficients for outbred populations. Since a pedigree founder *i* is assumed to be strain pure, one entry of the vector γ_{i} = 1, and the remaining entries = 0. Likewise for two founders *i* and *j*, one entry of the matrix ψ_{ij} = 1, and the remaining entries = 0. All other strain fraction vectors γ_{i} and strain coefficient matrices ψ_{ij} are defined recursively starting with the founders.

To avoid circular reasoning, pedigree members are numbered so that parents always precede their children. If animal *i* is not a founder, then it has parents *k* and *l*. Assuming that *k* and *l* have already been visited in filling in the strain fractions, we set(4)If *j* ≠ *i*, then without loss of generality we can assume *j* has been visited already, and we can set(5)(6)This leaves only the case *j* = *i*. There are four equally likely possibilities when we sample two genes of *i*: (a) both genes coincide with the gene passed by *k*, (b) both genes coincide with the gene passed by *l*, (c) the first gene comes from *k* and the second from *l*, and (d) the first gene comes from *l* and the second from *k*. These considerations produce the matrix recurrence(7)where diag(γ) denotes a diagonal matrix whose diagonal entries coincide with the entries of the vector γ.

The initial conditions on founders and the recurrences (4)–(7) completely determine γ_{i} and ψ_{ij}. These in turn determine the *C _{ij}* matrices, which have a richer mathematical structure than the strain coefficient matrices ψ

_{ij}. appendix b describes several fascinating properties of the

*C*matrices. One such property is

_{ij}*C*=

_{ij}**0**between most members of simple crosses, for example, for all F

_{2}animals when

*i*≠

*j*or whenever

*i*is a founder or F

_{1}.

Variance component models for QTL mapping with outbred populations require conditional kinship coefficients in addition to theoretical kinship coefficients. For exactly the same reasons, we also need conditional strain fractions and coefficient matrices. These depend on observed marker genotypes in the vicinity of a putative QTL. On small pedigrees, it is possible to compute conditional strain coefficient matrices exactly by considering all descent graphs (gene flow patterns) at the QTL and neighboring markers (Kruglyak*et al*. 1996). In practice, inbred strain pedigrees are so large that the number of possible descent graphs is astronomical. Stochastic sampling provides a workable substitute for exhaustive enumeration of descent graphs (Sobel and Lange 1996). The Markov chain Monte Carlo (MCMC) method incorporated in the computer program SimWalk samples relevant descent graphs with the appropriate conditional probabilities. Given a descent graph at the QTL, it is trivial to compute strain fractions for all animals and strain coefficient matrices for all pairs of animals in a pedigree. The averages of these quantities over all sampled descent graphs serve as approximations to the conditional strain fractions and strain coefficient matrices.

Strain coefficients convey more information than strain fractions. For instance, it is obvious thatWe can put this extra information to good use in predicting QTL genotypes. At a given genomic location, imagine a marker with a different allele for each strain. Let be the conditional probability that animal *i* has unordered genotype *a*/*b* at the hypothetical marker given the observed data at the ordinary markers. The relationsconnect the conditional genotype probabilities to the conditional strain fractions and coefficients. These relations in turn imply that(8)Thus, we can impute strain genotypes as well as strain fractions.

#### Variance component models:

Variance component models revolve around the multivariate normal distribution or related distributions such as the multivariate *t*. Every multivariate normal distribution is uniquely determined by its mean vector ν and variance matrix Σ. If we decompose trait values into independent, additive contributions, then ν and Σ can be expressed as sums over the various contributions. As long as we are willing to take the leap of faith that all random contributions are Gaussian, then trait vectors will be Gaussian as well. For each random contribution, variance matrices are constructed from a constant part and a parametric part. The genetic covariance formula (3) is typical in this regard. The constant parts *C _{ij}* are forced on us by the nature of the pedigree. The parametric part Ω with blocks Ω

_{kl}requires estimation.

The environmental contribution to the mean is usually modeled as the sum of a grand mean η plus covariate effects such as age or sex. Random environment and cage effects can be modeled by Kronecker products of variance matrices, provided we order trait values so that all values corresponding to a given trait are contained in a single block, and animals are consistently enumerated across blocks. Given these conventions, the variance matrix under random environment reduces to the Kronecker product ϒ ⊗ *I* of the trait variance matrix ϒ and the identity matrix *I*. Obviously, ϒ is the parametric part; it describes the environmental covariation of the traits in a single animal. The matrix *I* reflects the independence of the random environments for the various animals. For a random cage effect, we replace the identity matrix by a cage matrix *H* = (*h _{ij}*), where

*h*= 1 if animals

_{ij}*i*and

*j*belong to the same cage and 0 otherwise. The matrix replacing ϒ describes the environmental covariation of the traits for animals in a single cage (Lange 2002). As an example, heritability analyses generally specify two random effects, additive polygenes and random error/environment,(9)(10)where α

_{ic}is the

*c*th of

*C*covariates measured on animal

*i*and β

_{kc}is the corresponding regression coefficient for trait

*k*.

Once we specify the mean and variance components, the loglikelihood of a pedigree can be written asusing the observed trait values *x*, the mean vector ν such as that of Equation 9, and the variance matrix Σ such as that of Equation 10. Assuming pedigrees behave independently, their loglikelihoods add. Given the overall loglikelihood, parameters can be estimated by maximum likelihood, and statistical inference conducted by standard likelihood ratio tests comparing alternative hypotheses to null hypotheses. Lange (2002) develops this frequentist approach to estimation and inference in detail. Our computer program Mendel relies on a quasi-Newton algorithm for maximum likelihood estimation. Bauman*et al*. (2005) discusses an alternative EM algorithm as well as factor-analytic parameterizations of variance matrices. Given the presence of covariates and heterogenous pedigree structures, permutation testing is rarely possible. To aid the user in judging significance and model fitting, Mendel reports standard errors of parameters, pedigree deviances, outlier individuals, and various goodness-of-fit statistics.

#### Two QTL mapping strategies:

There are two specific strategies, association and linkage, for QTL mapping. Variance component models are pertinent to both. Although the two strategies differ in how they portray QTL effects, each captures polygenic background as a random effect. In addition to the strain effects appearing in Equation 1, most models include a grand mean η and fixed effects tied to plausible predictors. If we specify η, then we must impose the vector constraint on the polygenic mean vector μ. Here the index *a* ranges over all strains. Random effects include the polygenic effect summarized by Equation 3, random environment plus measurement error, and possibly correlated environment such as cage effects. As described in the next section, the polygenic variance matrix Ω is not identifiable, and complicated constraints must be imposed on it to compensate for this fact. Regardless of the nature of these constraints, we must compute theoretical strain fractions and strain coefficients to estimate μ and Ω under the null hypothesis of no QTL effect.

In linkage mapping, markers serve to tag chromosome segments and keep track of recombination events. The genotypes of the causative QTL are unobserved, and the QTL is allowed to assume any position along the genome. Under the alternative hypothesis in linkage mapping, we model the QTL as a random effect in the same way that we modeled the contribution of a single gene with additive effects. The only difference is that we use strain fractions and coefficients calculated conditional on the observed marker data. From here on, we refer to these as conditional strain fractions and coefficients; those calculated unconditionally we call theoretical strain fractions and coefficients. Motivated by Equations 1 and 3, we let ε(*a*) denote the additive effect of the QTL in strain *a*. Then our earlier reasoning shows that the QTL contribution has meanfor animal *i* and covariancefor animals *i* and *j*. Here the circumflexes indicate conditional versions of the strain fractions and coefficients estimated from the marker data. Under the alternative hypothesis, we estimate the entries of ε.

Our basic linkage model therefore specifies the trait means and covariances(11)(12)for two animals *i* and *j*. Here *k* and *l* index two traits, α_{ic} is covariate *c* of animal *i*, and β_{kc} is the corresponding regression coefficient for trait *k*. If we let denote the average , then all QTL models that include a grand mean require the constraint . In the presence of this constraint, the likelihood ratio test of linkage follows asymptotically a χ^{2} distribution with *st* − *t* degrees of freedom.

In association mapping, QTL fixed effects are tied to the current marker. The marker is viewed as a candidate gene whose genotypes or alleles directly influence trait means (Lange*et al*. 2005); random QTL effects are omitted. Hence, in Equation 12 we drop the random effect , and in Equation 11 we amend the fixed effect to represent regression on observed allele counts at the current marker. If the additive model for allelic effects is viewed as too restrictive, then we can regress on observed genotypes. Association testing is again conducted by likelihood ratio statistics.

In the presence of missing genotypes in association testing, we fall back on imputed allele counts or imputed genotype counts. Because genotypes at markers are usually directly observed, little is lost in imputation by ignoring genotypes at flanking markers. In this simpler setting, a fast deterministic algorithm is available for imputation (Lange*et al*. 2005). Flanking marker genotypes occasionally resolve phase ambiguities caused by combining closely spaced single nucleotide polymorphisms (SNPs) into supermarkers. Accordingly, the current version of Mendel also accepts MCMC estimates of conditional strain fractions from SimWalk. When each strain carries a different allele at the marker, the allele counts delivered by SimWalk are computed by doubling the conditional strain fractions at the marker. When two strains share a common allele at the marker, the corresponding strain fractions are added before doubling.

#### Identifiability:

We have seen that the polygenic covariance expression (3) between trait *k* of animal *i* and trait *l* of animal *j* involves the *s* × *s* trait block Ω_{kl} of an *st* × *st* variance matrix Ω. Unfortunately, estimation of Ω collides with an identifiability issue. The crux of the problem is the existence of nontrivial matrices Λ withfor every legitimate choice of *C _{ij}* and every trait pair (

*k*,

*l*). Proposition 2 of appendix b explains this phenomenon by representing

*C*as a convex combination of the matrix

_{ij}**0**and matrices

*E*indexed by unordered strain pairs {

_{mn}*m*,

*n*}. Here all entries of

*E*are 0 except for the diagonal entries

_{mn}*e*=

_{mm}*e*= 1 and the off-diagonal entries

_{nn}*e*=

_{mn}*e*= −1. It follows thatprovided(13)for every strain pair {

_{nm}*m*,

*n*} and every

*s*×

*s*trait block Λ

_{kl}= (λ

_{kl,mn}) of Λ.

We can solve the identifiability problem by subtracting the nonidentifiable part of Ω from Ω. To achieve this end, we view the positive semidefinite matrix Ω as a vector in the Euclidean space R^{st}^{×st}. In this setting the trace function = tr(*AB**) and Frobenius norm ‖*A*‖_{F} = tr(*AA**)^{1/2} reduce to the standard inner product and Euclidean norm. To find the nonidentifiable part of Ω, one projects Ω onto the vector subspace of symmetric matrices satisfying Equation 13 for every strain pair {*m*, *n*} and every trait block Ω_{kl}. Formally, the projection *P*(Ω) is defined to be the matrix *X* giving the minimum of for .

Fortunately, minimization of separates into subproblems corresponding to different trait blocks. First, consider a diagonal block Ω_{kk} of Ω. To simplify notation, denote its entries by *y _{mn}* = Ω

_{kk,mn}and the entries of the corresponding block of the projection by

*x*=

_{mn}*P*(Ω)

_{kk,mn}. To find

*P*(Ω)

_{kk}we must minimize the sum of squaressubject to the constraints

*x*+

_{mm}*x*=

_{nn}*x*+

_{mn}*x*for every pair {

_{nm}*m*,

*n*}. Now consider off-diagonal blocks Ω

_{kl}= . These come in pairs that must be handled together, so we letandand minimize the sum of squaressubject to the constraints

*x*+

_{mm}*x*=

_{nn}*x*+

_{mn}*x*for every pair {

_{nm}*m*,

*n*}. It follows that diagonal blocks and off-diagonal blocks lead to the same constrained minimization problem.

appendix c shows that each of these least-squares problems has solution *X* = (*x _{mn}*) with residualwhere

*Y*= (

*y*),

_{mn}*U*=

*QYQ*, and

*Q*is the

*s*×

*s*projection matrix . In calculating a covariance, we can ignore symmetrization and replace the matrix by

*U*. Indeed, the symmetry of

*C*implies thatThus, tr(

_{ij}*C*Ω

_{ij}Q*) faithfully represents the covariance between trait*

_{kl}Q*k*of animal

*i*and trait

*l*of animal

*j*. By the same reasoning, we can replace the entire residual matrix Ω −

*P*(Ω) by the matrix(14)Here diag(

*Q*) is a diagonal block matrix with all

*t*diagonal blocks equal to

*Q*. One can easily check that diag(

*Q*) is a projection matrix and that

*R*inherits the properties of symmetry and positive semidefiniteness from Ω.

In reparameterizing Ω, it is convenient to define an orthogonal matrix *O* mapping the vector to the standard basis vector *e*_{1}. (See appendix d for one version of *O*.) It follows thatObserve that pre- and postmultiplying any square matrix by zeros out the first row and first column of the matrix. To take advantage of this fact, we express the residual matrix (14) as(15)The matrixis a positive semidefinite replacement for diag(*O*)Ω diag(*O**). By our earlier remark, a block ϒ_{kl} of ϒ equals the corresponding block of diag(*O*)Ω diag(*O**) with its first row and column zeroed out.

We are now close to the desired goal of reparameterizing the residual. The matrix ϒ has entire rows and columns consisting of zeros. Permuting its rows and columns appropriately will move its nontrivial part to an upper-left block, which will be positive definite whenever Ω is positive definite. The Cholesky decomposition of this upper-left block then serves as a good parameterization of *R*. To compute the number of parameters for *s* strains and *t* traits, observe that the matrix ϒ is *st* × *st*. A total of *t* rows and columns are lost in the zeroing-out process. This leaves an (*st* − *t*) × (*st* − *t*) upper-left block with (*st* − *t*)(*st* − *t* + 1)/2 diagonal or subdiagonal entries. For example, with three strains and two traits, there are 10 parameters.

For the sake of clarity, let us summarize how our proposed parameterization leads to trait covariances. It begins with a Cholesky decomposition Δ of an (*st* − *t*) × (*st* − *t*) positive definite matrix. The matrix ΔΔ* is then subdivided into (*s* − 1) × (*s* − 1) trait blocks (ΔΔ*)_{kl}, and each block is promoted to an *s* × *s* trait block ϒ_{kl} by adding a top row and left column of zeros. In matrix notation, ϒ_{kl} = *Z*(ΔΔ*)* _{kl}Z** with

*Z*the

*s*× (

*s*− 1) matrixFinally, we construct the residual matrix

*R*via Equation 15, using the orthogonal matrix

*O*.

With these conventions, the covariance between trait *k* of animal *i* and trait *l* of animal *j* amounts to(16)In computing covariances over large pedigrees, it saves time and storage to precompute and store the (*s* − 1) × (*s* − 1) matrices 4*Z***OC _{ij}O**

*Z*and discard the

*s*×

*s*matrices

*C*. Note that the action

_{ij}*A*

*Z**

*AZ*on an

*s*×

*s*matrix

*A*deletes the first row and first column of

*A*.

This ends our theoretical overview of the model. appendix e shows how to differentiate covariances with respect to parameters, and appendix f supplies a counterexample connecting identifiability and symmetry. We now move on to data analysis.

## APPLICATIONS

#### A simulated advanced intercross line:

An AIL starts with F_{1} offspring from an intercross of two inbred strains. The F_{1} animals are randomly bred to produce the F_{2} animals, the F_{2} animals are randomly bred to produce the F_{3} animals, and so on for a total of *n* generations. An AIL differs from repeated brother–sister mating, because it involves enough animals to preserve genetic diversity. It draws its strength from the steady accumulation of recombination events over many generations (Darvasi and Soller 1995). Simulating data according to an AIL design permits us to compare our mixed effects results with the fixed effects results of the benchmark program QTL Cartographer. This exercise is not meant to be a substitute for an exhaustive study of power and experimental design. Also, the comparison is not entirely fair because QTL Cartographer analyzes the F_{n} data at the last generation ignoring the previous generations. To reconstruct missing marker information, QTL Cartographer applies an inflated recombination fraction scaled to reflect *n*.

To create our simulated AIL data, we mated two inbred founder animals and subjected their descendants in each generation to virtual random mating. Generation 10 contained 175 animals in 140 sibships with 492 animals overall. Placing the QTL locus at the midpoint of markers 5 and 6 of 11 equally spaced marker loci, we simulated genotypes by gene dropping and assigned QTL effects on the basis of the genotypes at the QTL. QTL genotypes were then discarded from further analysis. We modeled a univariate trait with a grand mean η = 4, an environmental variance σ = 1, and a 2 × 2 polygenic variance matrixFor this simulated trait, strain one has a genetic variance comparable to the environmental variance and larger than the genetic variance of strain two. The two strains share a modest genetic correlation. For reasons explained in the next section, a single generation of data in a symmetric cross of this sort does not sustain estimation of strain-specific polygenic means. To circumvent this problem in our comparisons, we set the strain-specific polygenic means equal to 0. We chose small strain-specific QTL effects ε_{1} = 0.2 and ε_{2} = −0.2 centered around 0. In view of our discussion of identifiability, we can estimate only a single parameter *p*_{1} characterizing Ω. The projection technique discussed yields the value *p*_{1} = 0.667. The discussion of the *C _{ij}* matrices in appendix b explains why genotype data on a single generation also prevent estimation of

*p*

_{1}.

To provide the most informative comparisons, we ran three analyses: (1) Mendel on the full pedigree with complete genotype and phenotype data (Mendel Full), (2) Mendel on the full pedigree but with phenotype data on only the final F_{10} generation (Mendel F_{10}), and (3) QTL Cartographer on the final F_{10} generation with complete genotype and phenotype data (Cartographer). Simply comparing cases Mendel Full and Cartographer is hardly fair; the full pedigree contains more than twice the number of animals in the final generation. Mendel F_{10} takes advantage of the full genealogy and all genotype data in computing theoretical and conditional strain coefficients. It limits itself to the phenotype sample in the last generation to enable a better comparison to QTL Cartographer.

Before turning to QTL mapping in the Mendel analyses, we fit a baseline model including the grand mean, the polygenic variance, and the environmental variance. We then estimated conditional strain coefficients at each of the 11 marker loci. This put us in a position to estimate the global parameters and the QTL-specific parameters simultaneously at each locus. The evidence in favor of the QTL is summarized by a likelihood ratio test (LRT) statistic following a distribution; a nonlinear false discovery rate (FDR) correction (Benjamini*et al*. 2001) corrects for multiple testing for all three analyses. Table 1 summarizes the type I error rate, power, and coverage as well as the generating parameters, their estimates, and the standard errors of the estimates at the loci adjacent to the QTL. Successful coverage occurs when the equivalent one-LOD drop interval (4.6 LRT units) includes the QTL. We reject the null hypothesis of no QTL effect when the LRT is significant at the 0.05 level.

The results in Table 1 reflect 100 simulations for a QTL-effect size that yields power >90% for Mendel Full; type I error rates are given as confidence intervals based on 500 simulations under the null hypothesis of no QTL effect. Clearly, the power to detect linkage is drastically reduced when only the F_{10} generation is available for analysis. This absence of data also makes it difficult for Mendel F_{10} to estimate the polygenic parameter *p*_{1} accurately. For Mendel Full all estimates are within one standard error of their true values, and standard errors are small. QTL Cartographer exhibits slightly better power and coverage than Mendel F_{10}, but with a largely inflated type I error rate. Both methods are easily bested by Mendel Full. These trends continue over a range of smaller QTL effects (data not shown). We are pleased with these results. In our view they demonstrate that application of the mixed effects model sacrifices little in simple settings while generalizing readily to complex pedigrees.

#### A multivariate four-way cross:

To illustrate the analysis of multivariate traits, we next consider the hormone data of Burke and colleagues (Harper*et al*. 2003) on aging UM-HET3 mice. Figure 1 shows how the UM-HET3 mice were created from four founder strains: BALB/cJ (C), C57BL/6J (B6), C3H/HeJ (C3), and DBA/2J (D2). CB6F_{1} females crossed with C3D2F_{1} males provided 967 F_{2} full siblings. At markers with four different alleles, all F_{2} mice were heterozygous. Thus compared to a two-way cross, the four-way cross doubles the number of founder strains without sacrificing phase certainty. Hormone levels of insulin-like growth factor I (IGF), leptin (Lep), and thyroxine (T4) were measured at 4 and 15 months on each of the F_{2} mice. Testing maternal and paternal effects separately, Harper *et al*. found several linked markers in these data via ANOVA, including a maternal allele at D3Mit25 linked to IGF at 15 months, a paternal allele at D3Mit127 linked to Lep at 4 months, and both maternal and paternal alleles linked to Lep at 15 months. It is worth pointing out that ANOVA or MANOVA must be carried out at marker loci. Only here do marker genotypes or allele counts unambiguously define factor levels. With complete genotyping, our model collapses in this setting to the classical models.

This multistrain cross highlights identifiability pitfalls inherent in the structure of some crosses and the data collected on them. For example, all F_{2} mice share the strain fraction vector . Hence, the polygenic mean is confounded with the grand mean. Using strain trait averages or phenotyping members of the original strains would allow us to estimate the polygenic means, but this is not an option for the current data.

Although the rigid structure of the four-way cross preserves phase certainty, it reduces uncertainty to the point where the polygenic covariance matrix cannot be estimated. Polygenic covariances depend on the combinatorial matrices *C _{ij}*. We have already noted that

*C*=

_{ij}**0**whenever

*i*is a founder or

*i*and

*j*are F

_{1}mice. Straightforward calculations for F

_{2}mice

*i*and

*j*with

*i*≠

*j*yieldInspection of Equation 3 therefore shows that the polygenic covariance matrix Ω is confounded with the matrix describing the environmental covariances.

Finally, there are identifiability problems with the QTL allelic effects. At the covariance level, the conditional coefficient matrix is identically 0 when typing is full and different alleles are present in each strain. At the mean level, imposition of the constraint ε_{4} = − ε_{1} − ε_{2} − ε_{3} shows that the genotype-specific means in a purely allelic model can be expressed as the vectorBecause the matrix on the right of this equation has less than full rank, some mean vectors are not representable. As a substitute for the additive QTL contributions, we assign a different mean effect to each of the four F_{2} genotypes.

We analyze these data in the same manner as the simulated AIL except for graphing the −log_{10}(*P*-value) instead of LRTs and analyzing multiple map points in the intervals between marker loci. We enjoy two advantages over ANOVA or MANOVA; namely, we can use phenotyped individuals with wholly or partially missing genotypes, and we can estimate both QTL location and effect size.

To carry out a multivariate analysis, one must decide which univariate traits to analyze together. This is not a trivial matter because combining traits exacerbates the multiple testing problem and may add noise and degrade power (Amos*et al*. 2001; Bauman*et al*. 2005). With outbred populations it is intertwined with the issue of ascertainment (Dawson and Elston 1984); it may also be a problem with inbred populations since strains are often chosen for a particular experiment on the basis of their average phenotype. We present here the results of two multivariate analyses making these points. The most interesting results from this example data set are on chromosome 3, and we focus on three traits, leptin measurements at both 4 and 15 months and insulin-like growth factor I at 15 months in this region. In univariate analysis, both IGF-15 and Lep-4 show significant linkage to markers on chromosome 3, while Lep-15 shows suggestive linkage. Multivariate analyses are indicated biologically, spatially, and temporally.

We carried out a number of multivariate analyses; some of the results are summarized in Figures 2 and 3 and Table 2. The graphs of −log_{10}(*P*-value) along chromosome 3 in Figure 2 correspond to the univariate analyses of IGF-15, Lep-4, and Lep-15 and the bivariate analysis of Lep-4 and Lep-15. The univariate graph of IGF-15 peaks over marker D3Mit5. Subjecting the *P*-values for IGF-15 to the nonlinear FDR correction (Benjamini*et al*. 2001) suggests a single location for IGF-15. Both of the univariate leptin graphs as well as the bivariate graph peak over D3Mit127. After FDR correction, at least two significant map points are suggested over D3Mit127 for the bivariate leptin analysis. Table 2 reports estimates and standard errors for the bivariate leptin mean parameters at marker D3Mit127. These estimates are very similar at the two time points. Although likelihood ratios improve over univariate analysis, *P*-values do not because the degrees of freedom of the χ^{2} test double. The estimated environmental covariance matrix(17)is consistent with the raw correlation of the two traits. In the matrix (17), the standard error of each estimate appears in parentheses.

A trivariate analysis of IGF-15, Lep-4, and Lep-15 clearly illustrates that in the case of multivariate traits, more is not always better. Comparing Figure 2 to Figure 3 shows two large peaks: one at marker D3Mit25 and one over marker D3Mit127. After FDR adjustment only the first peak survives, and the evidence for it is compromised. Thus, the trivariate analysis provides no additional linkage information and actually degrades the power to detect linkage. While leptin and IGF share numerous biological interactions, there is no evidence in these data for a common genetic determinant on chromosome 3.

#### An eight-strain simulated cross:

Our first two examples demonstrated the equivalence of the random effects model to the fixed effects model for standard cross designs and hint at the flexibility of our approach. To demonstrate this flexibility, we now present an eight-strain simulated example that (a) documents how correctly accounting for polygenic background can be beneficial and (b) demonstrates how it is possible to test hypotheses with the kind of unbalanced pedigree data encountered in human studies. As with the simulated AIL example, this exercise is not meant to be a substitute for an exhaustive study of power and experimental design.

#### Simulation specifics:

Our simulated cross involves a univariate trait, eight inbred strains, and seven pedigrees of nine generations each. We are motivated in part by the heterogeneous stock (Mott*et al*. 2000) and the collaborative-cross designs (Williams*et al*. 2002). Starting with strain-pure founders, we constructed each pedigree by random mating with a decreasing number of progeny per animal per generation. The average number of animals per pedigree is 366. Random mating ensures substantial diversity in theoretical and conditional strain fractions and coefficients. On the basis of the marker map for chromosome 2 in the UM-HET example of the previous section, we simulated genotypes at six loci using the gene-dropping option of Mendel. Locus 3 serves as the QTL and the remaining loci as markers. Genotypes at the QTL are omitted during linkage analysis.

We generated univariate trait values independently for each pedigree by sampling from a multivariate normal distribution with prescribed means and covariances. If animal *i* has QTL genotype *a*/*b* and trait value *X _{i}*, thenwhere η is the grand mean, μ is the vector of polygenic deviations from the mean, and ε is the vector of QTL deviations from the mean. For animals

*i*and

*j*, the polygenic and random environment contributions entail the covarianceNote the absence here of a QTL variance contribution. Although the data are analyzed conditionally given observed marker genotypes, they are generated unconditionally. Table 3 displays the values of the parameters used for the simulations. These values were chosen randomly subject to constraints such as .

Our simulation choices present both opportunities and challenges. For example, the fact that each strain is assigned a unique QTL allele suggests that even a simple F_{2} cross between two strains would be adequate to map the QTL. This advantage is tempered by the long genetic distances separating the QTL from the flanking markers, by the smallness of the QTL effects, by the similarity of these effects in some strains, and by the discordance of the QTL effects and the polygenic means effects.

In using random effect models for QTL mapping, inclusion of polygenic background is usually a good idea. If polygenic background is present but ignored, then the only way of accounting for relative correlations is through the QTL component. When we analyze the current data omitting polygenic background, every single chromosome location in the linkage analysis achieves a *P*-value <0.00001. Adding polygenic background causes *P*-values to reach more reasonable levels, ranging from 0.0019 to 0.3835. Subjecting the *P*-values to the (FDR) procedure highlights the QTL and one neighboring point as significant (Benjamini*et al*. 2001). Figure 4 plots the function −log_{10}(*P*-value) along the chromosome; as earlier, the *P*-values reflect the likelihood ratio tests of the QTL component. The QTL is located at 30 cM from the origin between marker D2Mit323 at 23 cM and marker D2Mit37 at 42 cM.

We also used these data to illustrate the application of the QTL association model. As in our linkage analysis, omitting polygenic background leads to unrealistically small *P*-values. Figure 4 plots the −log_{10}(*P*-value) for the association analysis with the polygenic background. The association results are similar to the linkage results. The marker with the most significant result is D2Mit323, which is the marker nearest to the QTL. The FDR procedure singles out D2Mit323 as the only significant association.

Comparison of computation times between the two models illustrates the speed of the association analysis. The linkage model requires ∼4 hr for calculation of the coefficient matrices for each pedigree and ∼20 hr to estimate the parameters for each of the 17 points. The association model requires ∼1.5 hr for all calculations at each of the five markers.

## DISCUSSION

In the hope of mapping QTL with small effects, geneticists are undertaking more ambitious crosses with multiple strains, multivariate traits, and dense marker sets. The random effects models developed here will enable a smooth transition to more sophisticated statistical analysis. The greatest strength of the models is their ability to capture polygenic background parsimoniously. A second strength is their versatility in handling large pedigrees, large numbers of contributing strains, and multivariate traits. While we have warned against importing ideas wholesale from the rest of statistical genetics, judicious adaptations are fully warranted. For example, since environment can be exquisitely controlled for inbred strain experiments, models of gene-by-environment interaction can be put to good use on the mean level (Blangero 1993) and on the variance level (Lange 1986; Itoh and Yamada 1990). These techniques apply both to continuous traits (Pletcher 1999; Pletcher and Geyer 1999; Jaffréic and Pletcher 2000; Pletcher and Jaffréic 2002; Purcell 2002; Purcell and Sham 2002; Meyer and Kirkpatrick 2005) and to categorical traits (Towne*et al*. 1997; Viel*et al*. 2005). It is also straightforward to model multiple QTL acting additively (Lange 2002).

Balanced against these strengths is the need for better-conceived study designs. Unless crosses are carefully structured, some parameters will be unidentifiable. One antidote is to scale back the complexity of a model and reparameterize. Our first two examples illustrate this tactic. Another antidote is to avoid monolithic designs and opt for a mixture of designs that individually reveal different features of a model. Our third example does this.

In random effects models, trait values for most animals are correlated. Logically, one should treat all animals as members of a single large pedigree. At some point this requirement becomes unwieldy. The computational demands of the random effects models are fairly high, so tactics such as pedigree splitting, marker thinning, and marker amalgamation should not be dismissed. It will probably take a combination of these tactics to cope with the large-scale mapping projects now under way (Pletcher*et al*. 2004). Fortunately, our experiences with simulated data suggest that a moderate amount of pedigree splitting sacrifices little information.

We have omitted a detailed discussion of how the program SimWalk delivers conditional strain fractions and coefficients. In our experience, SimWalk's MCMC algorithm adequately samples descent graph space. In association analysis, this lengthy process can be dispensed with if information at neighboring markers is ignored. Deterministic algorithms that produce approximate kinship and strain coefficients may ultimately be a better choice than stochastic sampling (Gao*et al*. 2004; Gao and Hoeschele 2005). In maximizing loglikelihoods, it is also worth mentioning that Mendel allows the user to set initial parameter values and bounds. This flexibility is valuable in exploring multimodal likelihood surfaces.

Our QTL parameters enter the model at both the mean and the variance level and are not subject to nonnegativity constraints. Thus, the asymptotic distribution of a likelihood ratio test follows a chi-square distribution with degrees of freedom equal to the difference in the number of independent parameters between the underlying nested models. Model selection can be accomplished by likelihood ratio tests or modified criteria such as the Akaike information criteria (AIC) or the Bayesian information criteria (BIC). Multiple testing is certainly an issue. The FDR correction of Benjamini and Hochberg (Benjamini*et al*. 2001) for dependent tests is often a useful cure and provided us with correct inferences in our simulated examples. Extensions such as Storey's optimal discovery procedure (Storey 2007; Storey*et al*. 2007) can lead to more accurate *P*-values and should be kept in mind.

The assumption of multivariate normality is helpful in maximum likelihood estimation. For univariate traits with excess kurtosis, the multivariate *t* distribution is a workable substitute for the multivariate normal distribution and is an implemented option in Mendel. It is reasonable to conjecture that some version of the central limit theorem should hold for a polygenic trait over a pedigree (Lange 1978; Lange and Boehnke 1983). For simple pedigrees generated *en masse* in a cross, one can check the normality assumption empirically. The impact of departures from normality has been considered by several researchers (Beaty*et al*. 1985; Allison*et al*. 1999; Pratt*et al*. 2000). Blangero*et al*. (2000) and Sham*et al*. (2000) suggest solutions to gross violations. One can object that QTL effects by their discrete nature cannot be normal. Three responses are possible. First, this objection has never stopped ordinary QTL mapping with outbred populations. Second, under the null hypothesis, the discrete effects disappear. Third, in all but the simplest crosses, application of a rigorous model incorporating both polygenes and major genes is very computationally demanding.

The web site (http://www.genetics.ucla.edu/software) offers the current versions of Mendel and SimWalk for several computing platforms. Ample documentation and sample problems are provided. The experimental versions of Mendel and SimWalk featured in this article will be released publicly as soon as it is practical.

## APPENDIX A: REPRESENTATION OF POSITIVE DEFINITE MATRICES

Given a *k* × *k* positive definite matrix Ω and a *k* × 1 vector μ, we now prove that vectors μ_{1}, … , μ_{n} exist such that and . To simplify the proof, we pass to the spectral decomposition Ω = *O***DO* of Ω. Here *O* is an orthogonal matrix, and *D* is a diagonal matrix whose *j*th diagonal entry *d _{j}* is an eigenvalue of Ω. If

*n*vectors ν

_{1}, … , ν

_{n}exist such that and , then taking μ

_{i}=

*O**ν

_{i}for each

*i*completes the proof.

With the transformed problem, we can work on each dimension *j* separately. Suppose we can find scalars *a*_{1}, … , *a _{m}* such that

*a*

_{1}+ … +

*a*= ν

_{m}_{j}and + … + =

*d*. Then we construct

_{j}*m*vectors

*w*

_{1}, … ,

*w*whose entries are 0 except for their

_{m}*j*th entries

*w*=

_{ij}*a*. These

_{i}*m*vectors compose part of the solution set ν

_{1}, … , ν

_{n}and do not impinge on the parts contributed by other dimensions. To show that appropriate scalars

*a*

_{1}, … ,

*a*exist, we consider optimizing the function

_{m}*f*(

*a*) = + … + subject to the affine constraint

*a*

_{1}+ … +

*a*= ν

_{m}_{j}. By introducing a Lagrange multiplier, we can prove that

*f*(

*a*) attains its minimum when all

*a*= ν

_{i}_{j}/

*m*. The maximum of

*f*(

*a*) is infinite in all but the trivial case

*m*= 1. For instance, we can take , , and all other

*a*= 0 and send

_{i}*p*to ∞. Since

*d*must be positive, some positive integer

_{j}*m*exists with <

*d*. This choice of

_{j}*m*puts us in a position to invoke the intermediate value theorem. The set of vectors

*a*= (

*a*

_{1}, … ,

*a*) satisfying the constraint is convex and therefore connected. A continuous function on a connected set attains every value between its minimum and maximum values. Hence, there is some

_{m}*a*with

*f*(

*a*) =

*d*.

_{j}## APPENDIX B: PROPERTIES OF THE *C*_{IJ} MATRICES

_{IJ}

The role of the matrix *C _{ij}* in formula (3) suggests its importance. Mathematically

*C*is better behaved than the strain coefficient matrix ψ

_{ij}_{ij}. Recall that the founder initial conditions and the recurrences (4)–(7) completely determine the strain fraction vectors γ

_{i}and the strain coefficient matrices ψ

_{ij}. If we retain the conventions that

*i*has parents

*k*and

*l*and

*j*is an animal previously considered, then the last three recurrences translate into the similar recurrences(B1)(B2)and(B3)on the

*C*matrices. The next proposition collects some relevant facts.

_{ij}Proposition 1. *In addition to satisfying the recurrences* (*B*1), (*B*2), *and* (*B*3), *the matrix C _{ij}*

*has all entries*0*when either i or j is a founder*,*is symmetric*,*is positive semidefinite*,*has the vector***1***in its null space*,*has entries C*(_{ij}*m*,*n*)*confined to the interval**for n*≠*m and to the interval**for n*=*m*.

*Proof*.

If

*i*is a founder belonging to strain*q*and*j*is a founder belonging to strain*r*, then by definition γ_{i}(*m*) = 1_{{m=q}}, γ_{j}(*n*) = 1_{{n=r}}, and ψ_{ij}(*m*,*n*) = 1_{{m=q}}1_{{n=r}}. Thus, all entries of*C*vanish. If_{ij}*i*or*j*is a founder but the other is not, then induction and the recurrences (B1) and (B2) show that all entries of*C*vanish._{ij}Formula (B3) forces

*C*to be symmetric, and the recurrences (B1) and (B2) preserve symmetry._{ii}Because the recurrences (B1) and (B2) preserve positive semidefiniteness, it suffices to prove that

*C*is positive semidefinite. Inspection of formula (B3) further demonstrates that it suffices to prove that diag(γ_{ii}_{k}) − γ_{k}γ_{k}* is positive semidefinite for all*k*. Accordingly, let*v*be an arbitrary vector. The quadratic formis nonnegative owing to Cauchy's inequalityand the fact that .Again this is a consequence of the recurrences (B1) and (B2) and the validity of the assertion for

*C*. In the latter case, the equalityis obvious._{ii}Because the stated bounds are preserved by recurrences (B1) and (B2), it suffices to consider

*C*. The contribution to a diagonal term in Equation B3 is bounded below by 0 and above by . The contribution to an off-diagonal term is bounded below by and above by 0. ▪_{ii}

The collection of all *C _{ij}* matrices over a pedigree has considerable structure. For example, the symmetry of

*C*entails

_{ij}*C*=

_{ij}*C*. With just

_{ji}*s*= 2 strains, parts b, d, and e of Proposition 1 imply that every

*C*is representable asfor some constant . Furthermore, since

_{ij}*a*

_{ij}= 0 whenever

*i*or

*j*is a founder or an F

_{1}individual, straightforward recursive arguments show that within any strictly linear mating designs like F

_{n},

*a*

_{ij}= 0 for all

*i*≠

*j*. Two-strain systems also produce uninteresting

*conditional*coefficients; straightforward calculations show that for all

*i*and

*j*at markers that differentiate between the strains with complete genotyping.

To generalize this representation to more than two strains, it is helpful to introduce the *s* × *s* matrix *E _{mn}* where all entries of

*E*are 0 except for

_{mn}*e*=

_{mm}*e*= 1 and

_{nn}*e*=

_{mn}*e*= −1. There are such matrices.

_{nm}Proposition 2. *Every matrix C _{ij} from the collection*

*can be represented as a linear combination*(B4)

*Furthermore*,

*the coefficients a*

_{ij}_{,mn}

*are nonnegative dyadic rationals satisfying*

*Proof*. Each matrix *C _{ii}* =

**0**corresponding to a founder

*i*clearly qualifies. The representation (B4) is preserved by the averaging process of the recurrences (B1) and (B2), so it suffices to prove the representation for a matrix

*C*generated by a nonfounder. Again the averaging nature of recurrence (B3) allows us to verify the representation (B4) for a matrix of the form . Because the set of dyadic rationals constitutes an algebraic field, it is clear by induction that all entries of γ

_{ii}_{k}are dyadic rationals. We now claim that(B5)Equality (B5) is certainly true for the off-diagonal entries of the matrices on both sides. For the diagonal entries, it is a consequence of the identityBecause the coefficients are dyadic rationals, all that remains is to check that the sum of the coefficients is properly bounded. This follows from(B6)The upper bound (B6) can be proved by introducing a Lagrange multiplier corresponding to the constraint . Equality is achieved only when all . ▪

## APPENDIX C: COMPUTATION OF THE PROJECTION

Consider minimizing the functionsubject to the constraints *x _{mm}* +

*x*=

_{nn}*x*+

_{mn}*x*for every unordered pair {

_{nm}*m*,

*n*}. We proceed by seeking a stationary point of the LagrangianThis point is characterized by the equations(C1)(C2)with the convention that

*m*≠

*n*. Rearrangement of Equation (C1) gives(C3)If we interchange

*m*and

*n*in Equation C3, add the result to Equation C3, and invoke the constraint (C2), then we get the equationdetermining μ

_{{m,n}}asFrom Equation C3 it follows that(C4)

It is easy to check that the constraint (C2) is implicit in this solution. Furthermore, the solution entails the residualIf we set , our objective function can now be expressed asNeither the off-diagonal entries *x _{mn}* nor the constraints now appear. To solve this unconstrained problem, we center the

*a*by subtracting their average value . This allows us to reparameterize

_{mn}*f*(

*x*) asin more or less obvious notation.

Minimizing the objective function in this form coincides with a classical problem in population genetics. If we assume that *m* and *n* represent two possible alleles from *s* equally frequent alleles and *b _{mn}* represents a trait value determined by the genotype

*m*/

*n*, then minimizing

*f*(

*x*) corresponds to the problem of determining the additive genetic variance of a centered trait. The solution to this problem is known to beIt follows thatA final substitution for

*a*givesand the general formula(C5)based on Equation (C4) and valid for both

_{mn}*m*≠

*n*and

*m*=

*n*.

The projection solution (C5) reduces the residual *r _{mn}* =

*y*−

_{mn}*x*toIf we define a matrix

_{mn}*R*with entries

*r*and a matrix

_{mn}*U*with entriesthen it is clear thatIn other words, the residual matrix is a symmetrized version of

*U*. Fortunately, we can represent

*U*as the matrix productof

*Y*= (

*y*) sandwiched between two copies of the orthogonal projection .

_{mn}## APPENDIX D: CONSTRUCTION OF AN ORTHOGONAL MATRIX

An orthogonal matrix *O* mapping the vector to the standard basis vector *e*_{1} can be explicitly constructed by the Gramm–Schmidt process applied to the basis , where *e _{k}* is the standard basis vector with 1 in position

*k*and zeros elsewhere. The first row of

*O*is just ; the subsequent rows take the formwhere

*k*− 2 zeros precede the entry

*s*−

*k*+ 1. The reader can easily check that the row vectors provide an orthonormal basis.

## APPENDIX E: DIFFERENTIATION OF VARIANCES AND COVARIANCES

Because the fastest maximum likelihood algorithms rely on exact derivatives, there is an obvious need to calculate the partial derivatives of each covariance Cov(*X _{ik}*,

*X*) with respect to the entries of Δ = (δ

_{jl}_{mn}). If we let ∂

_{mn}denote partial differentiation with respect to δ

_{mn}, then formula (16) immediately leads toso it suffices to compute the partial derivatives of ΔΔ* = (

*d*). Sincethe product rule of differentiation yieldsThus, ∂

_{uv}_{mn}(ΔΔ*) consists entirely of zeros except for row

*m*and column

*n*. This fact considerably simplifies computation of derivatives.

## APPENDIX F: A COUNTEREXAMPLE ON IDENTIFIABILITY

Finally, we consider a counterexample that illustrates some of the subtleties of identifiability. We noted that projection replaces each trait block *Y* = Ω_{kl} with a symmetrized block residualFor purposes of computing covariances, we argued that symmetrization is unnecessary and avoiding it simultaneously yields correct covariances and reduces the number of parameters. We have not actually demonstrated that no further reduction is possible. Furthermore, exploiting the symmetrized version may lead to a residual Ω − *P*(Ω) that fails to be positive semidefinite. Consider the matrixStraightforward algebra leads to the positive semidefinite matrixIf we symmetrize each 2 × 2 block of *B*, then we getA tedious computation shows thatand *C* cannot be positive semidefinite.

## Acknowledgments

The authors are grateful to David Burke for access to the UM-HET3 data, to Karl Broman for his editorial interest and guidance, and to the anonymous reviewers for their helpful comments. This investigation was supported by U.S. Public Health Service grants MH59490, GM53275, T32-HG02536, and HL28481.

## Footnotes

Communicating editor: K. W. Broman

- Received May 14, 2008.
- Accepted September 5, 2008.

- Copyright © 2008 by the Genetics Society of America