Search
Author
Title
Vol.
Issue
Year
1st Page

Journal of Animal Science - Animal Genetics

Technical note: A successive over-relaxation preconditioner to solve mixed model equations for genetic evaluation1

 

This article in JAS

  1. Vol. 94 No. 11, p. 4530-4535
     
    Received: May 24, 2016
    Accepted: Aug 18, 2016
    Published: October 13, 2016


    2 Corresponding author(s): kmeyer@une.edu.au
    kmeyer.agbu@gmail.com
 View
 Download
 Share

doi:10.2527/jas.2016-0665
  1. K. Meyer 2a
  1. a Animal Genetics and Breeding Unit,3 University of New England, Armidale NSW 2351, Australia

Abstract

A computationally efficient preconditioned conjugate gradient algorithm with a symmetric successive over-relaxation (SSOR) preconditioner for the iterative solution of set mixed model equations is described. The potential computational savings of this approach are examined for an example of single-step genomic evaluation of Australian sheep. Results show that the SSOR preconditioner can substantially reduce the number of iterates required for solutions to converge compared with simpler preconditioners with marked reductions in overall computing time.



INTRODUCTION

Estimation of breeding values in genetic evaluation schemes for livestock requires solving a set of mixed model equations (MME). Often, the number of equations is too large for solutions to be “directly” obtained, that is, by inversion or triangular decomposition of the coefficient matrix in the MME, and iterative methods need to be used. Early applications tended to rely on Gauss–Seidel–type solution schemes, often with over-relaxation to improve convergence rates. Over the last 2 decades, however, the preconditioned conjugate gradient (PCG) algorithm has become the standard method to solve large systems of MME in animal breeding applications (e.g., Strandén and Lidauer, 1999; Tsuruta et al., 2001).

Convergence rates of the PCG algorithm depend on the distribution of eigenvalues of the coefficient matrix in the MME. This can be “improved”—that is, their spread and the condition number of the matrix can be reduced—by adequate choice of a preconditioning matrix. Loosely speaking, the closer this matrix resembles the coefficient matrix, the fewer iterations are required. In practice, however, this needs to be balanced with computational requirements—both to set up and to store the preconditioner and to apply it for each iterate—and many schemes, therefore, rely on simple, diagonal, or block-diagonal matrices (Strandén et al., 2002).

This paper describes an alternative preconditioner that has the same structure as the coefficient matrix, together with an efficient implementation in a PCG algorithm. We demonstrate, for an applied example, that it can substantially reduce the number of iterations and overall computing time required.

THE SYMMETRIC SUCCESSIVE OVER-RELAXATION PRECONDITIONER

Letdenote the system of MME to be solved, with A the coefficient matrix, x the vector of unknowns, and b the vector of right-hand sides. Equivalently, solutions for x can be obtained by solvinginstead, with M denoting the preconditioner. Typically, the matrix M is chosen so that it and the product M−1r are “easy” to calculate (with r denoting a vector) while reducing the condition number of M−1A (Benzi, 2002).

Saad (2003, Chapter 4, p. 105–132) showed that an iterative solution scheme with over-relaxation is equivalent to fixed point iteration on a preconditioned system. Decompose the coefficient matrix A intowith L = {aij} for j < i the matrix comprising the strictly lower triangle of A and D = Diag(A) = {aii} the matrix of diagonal elements. The preconditioner corresponding to a symmetric successive over-relaxation (SSOR) scheme iswith 0 < ω < 2 (Saad, 2003). The matrix MSSOR has the same structure (i.e., number and position of nonzero elements) as A and can be used as preconditioner with a PCG algorithm. Saad (2003) stated that the choice of ω for a preconditioner is not important and recommended a value of ω = 1. Each PCG iterate then requires evaluating the product of the inverse of M and a vector, MSSOR−1r = h. Fortunately, this can be obtained without even explicitly setting up MSSOR by exploiting the factorization into upper and lower triangular matrices. Moreover, as D + L in MSSOR (for ω = 1) is simply the lower triangle of A, there is no computational overhead to set up M or additional memory required to store it.

We refer to the process of obtaining direct solutions for a system of equations with lower or upper triangular coefficient matrix as a “triangular solve.” Solving MSSORh = r for h then involves 3 steps: 1) solve (D + ωL)t1 = r using a forward triangular solve, which gives t1 = D−1(D + ωL′)h; 2) calculate t2 = Dt1; and 3) apply a second, backward triangular solve to (D + ωL′)h = t2, to solve for h. However, although easy to implement, this implies a substantial number of multiplications that increase calculations required per iterate in a standard PCG algorithm to at least twice those needed for a diagonal preconditioner, that is, M = D.

IMPROVED SYMMETRIC SUCCESSIVE OVER-RELAXATION

Improved versions of a PCG algorithm using a SSOR preconditioner have been described by Han and Zhang (2011) and Li et al. (2013). These substantially reduce computations per iterate by recognizing that calculation of the 2 “expensive” matrix vector products, namely 1) of the coefficient matrix and a vector of directions, A d, and 2) of the inverse of the preconditioning matrix and a vector, MSSOR−1r, can be replaced. Instead, only the solutions for 2 triangular systems of equations are required.

We adopt the procedure of Li et al. (2013), which differs from that of Han and Zhang (2011) by an initial transformation of the system of equations. For D = Diag{aii}, define a transformed system of equationsor

Decomposing A* as above givesandwithand

Defining auxiliary vectors y = W−1r and z = W−Tr then gives the following PCG algorithm (adapted from Li et al., 2013):

  1. For a given vector of starting values for the unknowns, x*0, initializeand

  2. For the kth iterate, compute updated solutions

  3. Check for convergence. If the chosen criterion has not been met, update work vectorsand

  4. At convergence, calculate solutions on the original scale as

The major computations per iterate are the products of W−1 or W−T and a vector (in step 3). As W is triangular, these can again be obtained without inverting W using triangular solves.

Preconditioned conjugate gradient algorithms in animal breeding applications commonly involve a step resetting the search direction at regular intervals (e.g., Tsuruta et al., 2001) to reduce potential problems arising from the accumulation of rounding errors. This can be achieved in the scheme above by replacing the update of work vectors (step 3) for selected iterates with the corresponding calculations from the set-up phase (step 1).

APPLICATION

We demonstrate the utility of the SSOR preconditioner for the example considered by Meyer et al. (2015). In brief, this comprised a set of 11 traits considered in genetic evaluation of Australian meat sheep (generously made available from the Meat and Livestock Australia’s Sheep Genetics [Armidale, NSW Australia] database and the Cooperative Research Centre for Sheep Industry Innovation [Armidale, NSW Australia), with 5.28 million records on 1.77 million animals. Including parents without records, there were 1,995,755 animals, of which 10,944 were genotyped for 48,599 SNP. Genomic relationships were computed following Yang et al. (2010).

The model of analysis included contemporary groups as fixed effects and animals’ additive genetic effects and genetic groups (93 levels) as random effects for all traits. The latter were “explicitly” fitted, assigning proportions of membership for each animal. Genetic groups and animals’ additive genetic effects were fitted assuming the same covariance matrix. In addition, dams’ permanent environmental effects (653,068 levels) were fitted as random effects for 3 traits. This resulted in 24,161,124 equations in the mixed model. Analyses were performed using either the pedigree-based relationship matrix with 6,584,393 elements in its inverse, A−1 or combining pedigree and genomic information in a “single-step” model with 66,455,483 nonzero elements in the inverse of the combined relationship matrix, H−1 = A−1 + Δ (half stored), with Δ the adjustment for genomic relationships. Ordering rows so that genotyped animals are trailing, only the lower diagonal block of Δ is nonzero and dense, G−1 − A−122, with G and A22 representing the genomic and corresponding block of the pedigree based relationship matrix, respectively; see, for instance, the recent review of Legarra et al. (2014) for details. Furthermore, both the standard multivariate (MV) formulation and the equivalent model using a parameterization to principal components (PC; Meyer et al., 2015) were examined.

COMPUTING ENVIRONMENT AND STRATEGY

Mixed model equations were iteratively solved, using double precision computations in a PCG algorithm, as implemented in the single-step module of our mixed model package WOMBAT (Meyer, 2007).

Nonzero elements of the coefficient matrix (half stored) in the MME were held in core using a combination of sparse matrix and dense storage. Dense diagonal blocks were assigned for genetic groups, considering all levels and traits together, and genotyped animals. For the MV parameterization, the latter again used 1 large block for effects for all traits. Ordering equations for genotyped animals within PC, 11 separate diagonal blocks (of size equal to the number of genotyped animals) were used for the PC model, which substantially reduced the memory required. Sparse matrix storage for the remaining parts held diagonal elements in core and used compressed sparse row format otherwise.

Preconditioning schemes compared were a simple diagonal preconditioner (DIAG), a block-diagonal preconditioner (BLOX), and the improved SSOR scheme described above, using ω = 1. Solutions were deemed to have converged when α[d′d/(x*′ x*)]1/2 dropped below 10−7. Computations were implemented using single- and multithreaded versions of routines from the BLAS and sparse BLAS (Blackford et al., 2002) and LAPACK libraries (Anderson et al., 1999). Specifically, each PCG iterate for DIAG and BLOX required the product of the coefficient matrix and a vector, which was formed using routines DSYMV and MKL_DCSRSYMV. Multiple vector “inner” products (also known as “dot products”) needed were evaluated using function DDOT. Triangular solves for SSOR used routines DAXPY and DAXPYI and functions DDOT and DDOTI to parallelize computations within rows. Use of routines DTRSV and MKL_DCSRTRSV for the latter was disregarded, as it appeared to increase computing time required. For BLOX, the inverses of diagonal blocks were used as preconditioners, except for genotyped animals for MV analyses. This comprised a block for genetic group effects for all traits, separate diagonal blocks for genotyped animals (of dimension 10,944) for each PC, and diagonal blocks equal to the number of traits for which they were fitted for the remaining random effects levels. For MV analyses, a Cholesky decomposition of the dense diagonal block for all genotyped animals and traits together was performed and used as a preconditioner in a triangular solve. These calculations were performed using LAPACK routines DPOTRF, DPOTRI, and DPOTRS and BLAS routine DSYMV.

Computations were carried under Linux on a shared machine with 512 GB of random access memory (RAM) and 28 Intel Xeon CPU E5-2697 cores (Intel Corporation, Santa Clara, CA), rated at 2.6 GHz, with a cache size of 35 MB. BLAS and LAPACK routines used were loaded from the Intel Math Kernel Library 11.3.2 (Intel Corporation).


RESULTS

The number of iterates and computing time required for each of the 24 analyses are summarized in Table 1. The results clearly show the impact of the preconditioner used on the number of iterates required. Numbers for BLOX differ somewhat from those reported by Meyer et al. (2015) due to a slightly less stringent convergence criterion used compared with the earlier study and tweaks in the implementation since. Correlations between solutions for corresponding analyses were at least 0.99.


View Full Table | Close Full ViewTable 1.

Characteristics of the mixed model equations and computing requirements for a diagonal preconditioner (DIAG), a block-diagonal preconditioner (BLOX), and symmetric successive over-relaxation (SSOR) preconditioning schemes for single- and multithreaded computations

 
No. of iterates
Elapsed time, h
Threads Rel.1 Param.2 NNZ3 DIAG BLOX SSOR DIAG BLOX SSOR
Single A−1 MV 918 5,550 1,999 1,724 4.73 1.84 1.92
PC 1,377 2,784 2,109 1,095 3.10 2.57 1.81
H−1 MV 8,162 6,691 2,910 1,994 17.58 28.25 11.53
PC 2,035 3,921 2,984 1,320 5.04 4.11 2.36
Multiple A−1 MV 918 5,521 1,991 1,724 1.83 0.99 1.66
PC 1,377 2,803 2,107 1,095 1.57 1.43 1.64
H−1 MV 8,162 6,681 2,889 1,994 3.75 18.30 4.73
PC 2,035 3,929 2,965 1,320 2.24 2.31 2.06
1Rel. = relationship matrix; A−1 = pedigree base relationship matrix; H−1 = = relationship matrix for single step analysis combining pedigree and genomic information.
2Param. = parameterization; MV = standard multivariate; PC = principal components.
3NNZ = number of nonzero elements in 1 triangle of coefficient matrix (in millions).

For the MV parameterization, BLOX dramatically reduced the number of iterates required compared with DIAG. This was less pronounced for the PC model, suggesting that “decorrelating” effects in the transformation to PC scale already achieved a considerable part of the benefits otherwise obtained by simultaneously considering all traits when using BLOX. The SSOR preconditioner further reduced the number of iterates required in all cases, on average to about a third of the corresponding number for DIAG.

Results for computing times required did not quite match the differences in the number of iterates needed. This predominantly reflected differences in efficiency of implementation and, for multithreaded analyses, in scope for parallelization. Although Li et al. (2013) emphasized that computations required per iterate for the improved SSOR PCG algorithm would be similar to those in a standard (nonpreconditioned) conjugate gradient scheme, times per iterate for our example were higher throughout for SSOR than for DIAG. Nevertheless, for single processor analyses, the overall execution time was reduced by 35 to 60%. Parameterizing to PC was advantageous for all preconditioners considered.

With 28 processors available, a different pattern emerged for multithreaded analyses. Each iterate of DIAG and BLOX involved the product of the coefficient matrix and a vector that was amenable to parallel processing (implemented through highly optimized library routines). In contrast, SSOR required 2 triangular solves that were performed “in order” (of rows) and, therefore, offered far less opportunity for parallelization, restricting parallel execution to computations within rows. The latter was most effective for rows with dense parts corresponding to genotyped animals. As these represented only 0.5% of animals in our example, there was limited impact of simultaneously using multiple cores. Conversely, this is expected to increase with the number of genotyped animals in the analysis. Similarly, times for BLOX were substantially higher than for DIAG for the MV single-step model. This was due to the size of the dense diagonal block (120,384) and the fact that using its Cholesky factor (rather than inverse) as the preconditioner again involved a triangular solve. Repeating calculations for MV and H−1 using small diagonal blocks of size equal to the number of traits in the preconditioner for both genotyped and nongenotyped animals reduced execution times to 6.89 h (2,866 iterates) for single-threaded execution and 2.54 h (2,883 iterates) for multithreaded execution, respectively. Because exploiting the large RAM together with efficient subroutines for dense matrix multiplications available seemed sensible, this showed that relationships between genotyped animals taken into account in the preconditioner did not reduce the number of iterates required and, therefore, did not balance the associated large number of additional calculations.


DISCUSSION

A SSOR preconditioner for a PCG algorithm appears not to have been previously considered in the context of genetic evaluation. Applications for engineering problems have been described, for instance, by Chen et al. (2002), Han and Zhang (2011), Li et al. (2013), and Meng et al. (2016), with favorable reports on convergence rates achieved. Results demonstrate that, using the improved version of Li et al. (2013), the SSOR preconditioner can substantially reduce the computational requirements for iterative solution of MME. Moreover, it requires about the same memory as the diagonal preconditioner and is easier to implement than a block-diagonal scheme.

The SSOR PCG scheme described is most useful for applications where the MME can be held in core or where selected parts of the coefficient matrix can be strategically read from out-of-core memory. With large amounts of RAM available for modern hardware, this is feasible for many genetic evaluation schemes for small to moderately large populations. The SSOR preconditioner can also reduce computational requirements for maximum likelihood analyses to estimate variance components that use Monte Carlo techniques and require multiple solutions of the MME for each iterate, for example, Matilainen et al. (2012). The scheme is likely to be less beneficial for extremely large applications using an “iteration on data” type strategy, as multiple passes through the data would be needed in each iterate, which may quickly erode the computational savings afforded by the SSOR preconditioner per se.

However, in the era of multithreaded and highly parallel computing, the choice of algorithm needs to consider the hardware constellation targeted. Using all 28 processors available, there was little advantage of SSOR over DIAG in terms of overall execution time. Competitive performance of the diagonal preconditioner for parallel computing applications has been reported elsewhere (Pini and Gambolati, 1990). As outlined above, this was due to triangular solves required in each iterate, which limited the scope for parallel processing to operations within each row, paired with our fairly basic implementation. Studies examining PCG algorithms for massively parallel processors have, therefore, often take a different route which involves on approximation of the inverse, MSSOR−1, to avoid the need for ordered, triangular solution schemes albeit at the expense of additional memory requirements (e.g., Helfenstein and Koko, 2012). Improving parallelization of sparse triangular solves is an active area of research in a number of fields. Alternative proposals to boost parallel performance of triangular solves include identification of independent “levels” in the triangular matrix and appropriate scheduling of operations (Mayer, 2009) and use of iterative schemes (Anzt et al., 2015). Advances in this area and increasing availability of corresponding high-performance routines are likely to make the SSOR preconditioner more competitive for applications using more than a few cores.

IMPLICATIONS

The improved preconditioned gradient algorithm with the SSOR preconditioner described can substantially reduce computing times for iterative solution of MME in quantitative genetic applications. It is suitable as a “drop-in” replacement for existing methods for schemes explicitly setting up the MME and most advantageous for computing environments with modest amounts of parallelization.

 

References

Footnotes


Comments
Be the first to comment.



Please log in to post a comment.
*Society members, certified professionals, and authors are permitted to comment.