# Predicting hybrid performance in rice using genomic best linear unbiased prediction

See allHide authors and affiliations

Contributed by Qifa Zhang, July 23, 2014 (sent for review May 15, 2014)

## Significance

Genomic prediction is a new field of quantitative genetics. Individual performance can be predicted using genome-wide markers before the phenotype is measured. Genomic prediction for hybrid performance is even more promising because genotype of a hybrid is predetermined by the parents. We propose a genomic best linear unbiased prediction to predict hybrid performance of rice, and incorporate dominance and epistasis into the prediction model. Simulation studies showed that predictability can be further improved after incorporating dominance and epistasis into the model. The new strategy of marker-guided hybrid prediction is called genomic hybrid breeding, and it represents a new technology that may potentially revolutionize hybrid breeding in agriculture.

## Abstract

Genomic selection is an upgrading form of marker-assisted selection for quantitative traits, and it differs from the traditional marker-assisted selection in that markers in the entire genome are used to predict genetic values and the QTL detection step is skipped. Genomic selection holds the promise to be more efficient than the traditional marker-assisted selection for traits controlled by polygenes. Genomic selection for pure breed improvement is based on marker information and thus leads to cost-saving due to early selection before phenotypes are measured. When applied to hybrid breeding, genomic selection is anticipated to be even more efficient because genotypes of hybrids are predetermined by their inbred parents. Hybrid breeding has been an important tool to increase crop productivity. Here we proposed and applied an advanced method to predict hybrid performance, in which a subset of all potential hybrids is used as a training sample to predict trait values of all potential hybrids. The method is called genomic best linear unbiased prediction. The technology applied to hybrids is called genomic hybrid breeding. We used 278 randomly selected hybrids derived from 210 recombinant inbred lines of rice as a training sample and predicted all 21,945 potential hybrids. The average yield of top 100 selection shows a 16% increase compared with the average yield of all potential hybrids. The new strategy of marker-guided prediction of hybrid yields serves as a proof of concept for a new technology that may potentially revolutionize hybrid breeding.

The mission of plant breeding is to develop high-yield varieties to increase crop productivity to meet the need of human population. Hybrid breeding has been proved to be an important tool to improve yield. The most successful examples are hybrid maize and rice, which have greatly increased the global food security. Despite the successes of hybrid breeding programs, selection of desirable hybrids has largely been a practice of trial and error in the past. It takes much luck to find desired matches between selected parents. The biggest challenge in hybrid breeding is how to predict the performance of future crosses based on existing data. Large efforts have been made in the past to develop methods for hybrid prediction, with the goal to facilitate hybrid breeding by obtaining better hybrids with fewer crosses. A common approach in the early days was to find correlations between marker polymorphisms and hybrid performance in crosses involving diverse germplasms. Extensive studies in corn and rice using this approach have produced variable results depending on the germplasms used in the studies (1).

Bernardo (2) applied best linear unbiased prediction technology to predict hybrid corn. He used existing hybrids and the pedigree relationship between them and untested hybrids to make prediction. Recent development in genomic research has greatly increased the availability of molecular makers to easily cover the entire genome, which are used to calculate the relationship matrix, leading to a method called genomic best linear unbiased prediction (GBLUP) (3). This method has been used to predict heterotic traits in maize hybrids (4). However, no attempt has been made to incorporate nonadditive effects into the prediction models.

Genomic selection aims to use whole-genome markers to predict future individuals, and it differs from traditional predictions in that the marker-detection step is skipped; instead, all markers are used to predict genomic values. Theoretically, when the number of markers is larger than the sample size, there is no unique estimation of effects, but the total genomic value remains estimable. Therefore, genomic prediction does not require accurate estimates of effects; it concerns the predictability resulted from the combination of all markers and their collective effects. Such genomic prediction has been applied to many agricultural species, including dairy cattle (5), crops (6), mice (7), and even humans (8). However, this approach has not been used to predict hybrid performance, an unexplored area that has a great potential to significantly improve efficiency of hybrid breeding.

Current methods for genomic prediction include Bayes B (9), empirical Bayes (10), least absolute shrinkage selection operator (LASSO) (11), and GBLUP (3). The first three procedures are classified into a category called selective shrinkage. Although simulation studies repeatedly showed that selective shrinkage is superior over GBLUP (12), experimental studies using cross-validation often showed similar performance (13), and GBLUP can be superior if a trait is controlled largely by polygenes, which implies that GBLUP may be more robust than the selective shrinkage methods. These genomic selection tools are mainly based on the additive model. Incorporating nonadditive variances is the next step in genomic prediction, but little study has been done so far. Incorporating dominance has been shown to be effective (14), but the benefit of incorporating epistasis has never been correctly demonstrated. One goal of this study is to investigate the effect of nonadditive variances on the efficiency of genomic prediction for hybrid performance.

## Results

### Predicting Hybrid Performance Using GBLUP.

Results of the restricted maximum likelihood (REML) analysis under the additive model are summarized in Table 1. The narrow sense heritability, defined as the ratio of the additive variance to the phenotypic variance, ranges from 0.38 for yield to 0.84 for 1,000 grain weight (KGW). The goodness of fit (squared correlation between observed and predicted trait values) varies from 0.51 for yield to 0.89 for KGW, which are all relatively high. The predictability drawn from fivefold cross-validation is much lower than the goodness of fit. Yield has the lowest predictability (0.13), and KGW has the highest predictability (0.68). The other two traits have predictabilities between the two. This analysis shows that genomic selection will be effective for all traits, especially for KGW. The difference between goodness of fit and predictability will be explained elsewhere in the text.

### Comparison of GBLUP with LASSO and SSVS.

The estimated marker effects (including the intercepts) of GBLUP for all traits are given in Dataset S1, where the results of the two competing methods are stored in separate sheets named LASSO and stochastic search variable selection (SSVS), respectively. The predictabilities drawn from fivefold cross-validation are listed in Table 2 along with the predictability from GBLUP. Results of the fivefold cross-validation depend on the partitioning of the sample. We randomly partitioned the sample into five parts of equal size and repeated the partitioning 20 times. Table 2 gives the average predictability for each trait under each method. The LASSO method barely outperformed GBLUP. The SSVS method is the worst one, especially for trait yield, which is 0.0943, compared with 0.1264 for GBLUP and 0.1601 for LASSO. In general, the three methods produced similar results for three of the four traits.

### GBLUP Incorporating Epistasis.

The immortalized F_{2} (IMF2) population is unique in the sense that we can incorporate dominance and all four types of epistatic variances into the model. There are six genetic variance components, which are additive (a), dominance (d), additive by additive (aa), dominance by dominance (dd), additive by dominance (ad), and dominance by additive (da) variance components. The estimated variance components are given in Table 3, where ad and da are combined into a single composite term named (ad). Yield (YIELD) and tiller number (TILLER) are largely controlled by the (ad) interactions. Additive variance plays a major role for grain number (GRAIN) and KGW. None of the traits are controlled by the dominance variance. We also evaluated six different models, designated models 1–6, where the model number also represents the model size (number of genetic variances included in the model). For example, model 4 means that the model contains four genetic variance components, *a*, *d*, *aa*, and *dd*. Model 6 means that all six genetic variances are included (see Table S1 for model definitions).

Fig. 1 shows the goodness of fit (*Upper*) and the predictability (*Lower*) plotted against the model size. Both goodness of fit and predictability were expressed as the squared correlation between observed and predicted phenotypes, but the predicted phenotypes were calculated using different approaches. For the goodness of fit, individuals predicted were also used to estimate parameters. For the predictability, the predicted values were drawn from fivefold cross-validation where individuals predicted did not contribute to parameter estimation. As the model size grows, the goodness of fit also grows until it reaches perfect fit when all six genetic variances are included in the model. To our surprise, the predictability does not show noticeable change as the model grows. The conclusion is that adding dominance and epistatic variances did not help genomic prediction. The estimated variance components for all traits are given in Table S2 for all six models. The lack of improvement is due to the large SEs of the estimated variances (Figs. S1 and S2 and Table S3) and the high correlation between different estimated variance components (Table S4). Large sample sizes are required to demonstrate the benefit of adding epistatic variances.

### Simulation Study on Prediction Under Epistasis.

We performed a simulation study to demonstrate the effects of sample size and model size on model predictability. A hypothetic trait was simulated with equal values for all variance components (six genetic variances and a residual variance). We took the genotypes of *n* randomly selected hybrids (of 21,945 potential hybrids) as the true genotypes, where *n* ranged from 200 to 1,000 incremented by 100. The results are illustrated in Fig. 2. The goodness of fit started at ∼60% (additive variance only) and reached 100% for the full model (all six variance components) under all sample sizes. Small samples sizes tend to have a higher goodness of fit (Fig. 2, *Upper*). The predictability (Fig. 2, *Lower*) shows that adding dominance has increased the predictability under all sample sizes, but no further improvement are observed when the sample sizes are below *n* = 500. When *n* > 500, the predictability has progressively increased as the model size grows. For large sample sizes, there is benefit in prediction by including epistatic variances in the model. Our simulation experiments demonstrated the benefit of adding dominance for all sample sizes, but the real data analysis did not support this because we actually simulated dominance in the experiment. However, in the real experiment, dominance may be absent or very small. Further simulation study showed that it is safe to include dominance and epistatic variances in the model, even if the trait is only controlled by additive variance.

### Prediction of Genomic Values for Future Crosses.

The 278 hybrids analyzed in this experiment are a random sample of all potential crosses, where 210 is the number of recombinant inbred lines (RILs) that initiated the current hybrid population. From the available RILs, we deduced the genotypes of all potential hybrids. We now try to predict the phenotypic values of the 21,667 remaining hybrids using the genetic parameters given in Table 1 under the additive model. The extended kinship matrix for all of the 21,945 hybrids is partitioned as follows:

We then sorted the predicted phenotypic values in descending order and calculated the running average. For example, if we choose the top 100 crosses, the mean predicted breeding value of the top 100 crosses will be 50.5589 ± 0.23034 for yield. The average predicted genomic value of the entire hybrid population for yield is 43.6152. With genomic selection of the top 100 crosses, we expect to gain

Among the 21,667 potential crosses, we field evaluated 105 crosses in year 2012. These crosses were not included in the training sample, but their trait values have been predicted from the training sample. We calculated the squared correlation between the predicted and the observed trait values for the 105 crosses. This squared correlation is the actual predictability of our model under the assumption of no G×E interaction. The predictabilities are given in Table 4 for the four traits using the three competing methods. YIELD and TILLER have lost their predictability due to G×E interaction. The predictabilities for GRAIN and KGW remain relatively high, although both are lower than the cross-validation generated predictabilities due to possible G×E interaction. Recall that our training sample was collected in years 1998 and 1999, but the 105 additional crosses were collected in year 2012, which experienced an unusually high temperature. For traits with strong G×E interaction, such as YIELD and TILLER, our model can fail to predict the genomic values. However, heritable traits with less G×E interaction, such as GRAIN and KGW, are highly predictable.

## Discussion

The study demonstrated the application of advanced technologies, including genotype by sequencing and new statistical models to hybrid prediction. We used 278 existing hybrids derived from 210 RIL parents to predict the genomic values of all 21,945 potential hybrids for yield component traits in rice. The predicted best-performing hybrids can be generated from the original RIL parents. Genotypes of the future hybrids are not measured but determined from their inbred parents. The top crosses can be used immediately and converted to high-performing hybrids. What is the optimal proportion of the top crosses that should be selected for hybrid breeding? Two factors should be considered: one is the estimation error of the average performance of the selected top crosses. From Fig. 3, it is obvious that we should not select only the top five crosses because the average predicted value has a large prediction error. We need to keep at least 10 top crosses to reduce the prediction error. The other point to consider is that the genetic diversity of the top crosses tends to be narrow relative to that of the entire hybrid population. To maintain a high genetic diversity, one should select as many top crosses as possible, but the average predicted genomic value should remain high. For example, if we select the top 100 crosses (of 21,945 hybrids) for yield, the gain would be

One special feature of this experiment was that all of the RILs are from the same two parents, and this will limit the inference space of the results, i.e., the estimated parameters cannot be used to predict crosses of RILs that are not derived from the two parents. However, these types of crosses (IMF2) represent the best scenario to capture genetic variation of the parents because the F_{2}-like genotypes provide the largest possible genetic variation. In addition, the two inbred lines that initiated the IMF2 were carefully selected to represent the best germplasms in hybrid rice breeding; they are the parents of Shanyou 63, the most popular rice hybrid in China and other Asian countries. Moreover, many widely used hybrids have one or the other line as their parents. Furthermore, most of the parental lines now commonly used in hybrid rice breeding programs have parentage of the two lines. Therefore, the result obtained here is important in its own rights.

Our cross-validation analysis showed that incorporating nonadditive variances did not show improvement in prediction, which may give the impression that we are not taking advantage of special combining ability and using only the general combining ability to predict heterosis. In fact, we are predicting hybrid performance, not heterosis, which is defined as the difference of the hybrid performance from the midparent performance. It is also important to emphasize that the six genetic kinship matrices are highly correlated. Therefore, the additive kinship matrix may already capture much information about the other kinship matrices. We need a very large sample size to well-separate the six variance components due to the multicollinearity of the kinship matrices.

For general application to a broader range of germplasms, imagine that if a half-diallel cross is to be conducted from 1,000 varieties, there would be

We also compared the performance of three different statistical methods to ensure that there are no artifacts caused by human errors in any single method. Three methods produced very similar results. Theoretically, selective shrinkage methods (LASSO and SSVS) perform better for traits controlled by a few large QTL, whereas GBLUP performs better for traits controlled under the infinitesimal model. GBLUP is more robust than the other methods because it does not depend on estimated marker effects, and has an additional advantage of being able to incorporate epistatic variances. In real-life experiments, any one of the three methods may be used under the additive and dominance model. If software packages are available, all methods should be tried to cross-confirm the results.

We reported the results of hybrid prediction under the main effect model, models incorporating dominance and epistatic effects were also investigated (see *SI Text* for methods and results incorporating epistasis). We originally hoped to demonstrate some improvement of the epistatic model over the additive model. To our surprise, there was no noticeable improvement; this does not disqualify the epistatic model because our training sample size is not large enough. The fact that the simple additive model performed equally well as the nonadditive models does not mean that nonadditive variances are not important to these traits. These additional variances are mostly captured by the additive variance because of the high correlations among the different types of kinship matrices (Table S4). Increasing sample size may not necessarily help to decrease the correlations of the kinship matrices but will reduce the estimation errors of different variance components, which in turn will improve prediction.

## Materials and Methods

The rice population was constructed by randomly dividing 240 RILs derived from a cross between two *indica* rice, Zhenshan 97 and Minghui 63, into two groups and pairing lines in the two groups at random to create 120 crosses (15). Two additional rounds of crossing resulted in 360 crosses, IMF2. The two inbred lines that initiated the IMF2 were carefully selected to represent the best germplasms in hybrid rice breeding.

### Field Data and Genotyping of the Population.

Field data of yield (YIELD), number of tillers per plant (TILLER), number of grains per panicle (GRAIN), and 1,000 grain weight (KGW) for the IMF2 population and the RILS were collected in the 1998 and 1999 rice-growing seasons from replicated field trials on the experimental farm of Huazhong Agricultural University (15). The RILs were genotyped using next-generation sequencing (16). More than 250,000 high-density SNP markers were obtained to infer recombination breakpoints (crossovers) and then construct bins. The 1,619 bins were treated as markers, and the genotypes of the hybrids in the IMF2 were deduced based on the bin genotypes of the RILs. Only 278 of the 360 crosses were available in both phenotypes and bin genotypes. For each trait, there were two temporal replications (years 1998 and 1999). The phenotypic values of the two replicates were pooled for each cross after removing the year effect using

### Statistical Methods.

Three methods were used to predict hybrid performance: GBLUP (3), LASSO (11), and SSVS (17). The LASSO and SSVS methods are well known in statistics and in the genomic selection community, and the GBLUP method is not as familiar to the plant-breeding community. In addition, we adopted an efficient algorithm to perform variance component analysis for GBLUP. Therefore, GBLUP will be described in more detail than the other two methods.

#### Mixed model.

Let *y* be an *j* at locus *k* by a variable *Z* with *m* markers is

where *n* individuals for marker *k*, *k*, and *y* is

where *U* and *D*, are obtained using marker information only before the REML analysis. With this algorithm, the variance matrix is rewritten as

The model goodness of fit is expressed as the squared correlation coefficient between the observed (*y*) and the predicted (

#### Genomic best linear unbiased prediction.

Let us define **2** by

where

In the future when

Alternatively, estimated polygenic effects can be converted into estimated marker effects using the following equation, *SI Text*.

#### LASSO and SSVS.

Two additional models were compared with the GBLUP method, which are the LASSO method and the SSVS method; the latter is also called Bayes B (9) and is the very first method for genomic selection. Both methods use the model given in Eq. **2**, i.e., they directly estimate marker effects in the training sample and predict the genomic values for individuals in the testing sample. The LASSO method minimizes a penalized sum of squares and was implemented using the GlmNet/R program (18) in this study. The SSVS method is a Markov chain Monte Carlo (MCMC) sampling-based method; it assumes that each marker effect has a mixture of two normal distributions, described as

The three methods (GBLUP, LASSO, and SSVS) were compared using the predictability drawn from fivefold cross-validation, in which four parts of the sample were used to estimate parameters for prediction of the phenotypic values in the remaining part of the sample. Eventually, each individual was predicted once and used four times to estimate parameters. The squared Pearson correlation coefficient between the observed and the predicted phenotypic values is a measure of the predictability.

#### Incorporation of nonadditive variances.

In addition to *k*. The polygenic effect is now partitioned into six polygenic components,

where

Note that *k* and *y* is

is the genetic covariance matrix, in which the

where

## Acknowledgments

This project was supported by US Department of Agriculture National Institute of Food and Agriculture Grant 2007-02784 (to S.X.), National Natural Science Foundation Grant 31330039, and 111 Project of China Grant B07041 (to Q.Z.).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: qifazh{at}mail.hzau.edu.cn.

Author contributions: S.X. and Q.Z. designed research; S.X., D.Z., and Q.Z. performed research; S.X. analyzed data; and S.X. and Q.Z. wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1413750111/-/DCSupplemental.

## References

- ↵
- Zhang Q,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Legarra A,
- Robert-Granié C,
- Manfredi E,
- Elsen JM

- ↵
- ↵
- Meuwissen THE,
- Hayes BJ,
- Goddard ME

- ↵
- ↵
- Tibshirani R

- ↵
- ↵
- Resende MF Jr.,
- et al.

- ↵
- Vitezica ZG,
- Varona L,
- Legarra A

- ↵
- Hua J,
- et al.

- ↵
- Xie W,
- et al.

- ↵
- Yi N,
- George V,
- Allison DB

- ↵
- ↵
- Xu S

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Genetics