Westlake News ACADEMICS

Westlake-led Team Developed the World’s First Generalized Association Tool that Applies Applicable to Biobank-scale Data with Millions of Individuals


30, 2021

PRESS INQUIRIES Chi ZHANG
Email: zhangchi@westlake.edu.cn
Phone: +86-(0)571-86886861
Office of Public Affairs

On Nov 4 2021, a research article entitled “A generalized linear mixed model association tool for biobank-scale data” from a team led by Prof Jian Yang at the School of Life Sciences at Westlake University was published online in Nature Genetics . The team developed an ultra-efficient generalized linear mixed model-based (GLMM) association analysis tool, named fastGWA-GLMM, for genome-wide association studies (GWASs) of binary traits (a trait with only two possible values, e.g., disease status). fastGWA-GLMM and fastGWA (a tool developed earlier by the same team for continuous traits) are currently the only mixed model-based association tools that apply to large-scale data with millions of individuals.



Most traits in humans such as behaviors, physiological characteristics and disease susceptibilities are influenced by a large number of genetic variants each with a small effect. GWAS is a widely used experimental design to detect genetic variants associated with a disease of interest. It is used to characterize genetic variations by comparing the genetic information of a large cohort of people, using statistical methods to scan the genome to identify genetic variants associated with the disease, and to reveal genes and regulatory mechanisms related to disease occurrence and development.


Mixed model-based association analysis tools are often preferred in GWAS because of their advantage in correcting confounding factors that could potentially bias the association analysis results. In recent years, the computational efficiencies of mixed model-based tools are challenged by data from cohorts with very large sample sizes, such as the UK Biobank (UKB). For example, a typical GWAS in the UKB involves the analysis of data of on >400,000 individuals each with >10 million genetic variants.


Yang’s team proposed a sparse matrix-based model and a series of efficient computational algorithms that circumvented the limitations of mixed models in computing speed and memory usage and developed an ultra-efficient GLMM-based association analysis tool for binary traits, fastGWA-GLMM. They used real data to demonstrate the extremely high computational efficiency of fastGWA-GLMM, which far exceeds existing GLMM-based methods. For example, the computational efficiency can reach 36 times that of the most used GLMM-based association tool in the analysis of the UKB data. In a simulated cohort of 2 million people (each person has about 12 million genetic variants), a fastGWA-GLMM analysis can be completed in only 17 hours using 16 CPU cores and 32GB of memory, but such an analysis is not achievable for other mixed model-based tools. fastGWA-GLMM's fast processing capability for large data is of great significance to the upcoming biobank-scale data with millions of samples.


As an efficient and robust association analysis tool, fastGWA-GLMM can be applied to identify genetic variants for binary traits in all the large-scale biobanks. The team has used fastGWA-GLMM to analyze 2,989 binary traits in the UKB and shared all the association analysis results on their online data portal (http://fastgwa.info/ukbimpbin). Users can browse, retrieve, query, and download all the result data without restrictions on this platform. Moreover, this method has been integrated into the open-source software package GCTA developed by the team (https://yanglab.westlake.edu.cn/software/gcta). fastGWA and fastGWA-GLMM may become indispensable tools for genetic analysis of data from ultra-large biological repositories in the future, and their application potential for deciphering the genetic mysteries of complex human diseases is immeasurable. 


Doctoral student from the University of Queensland (UQ), visiting student at Westlake University Dr. Longda Jiang (currently a postdoctoral fellow at the New York Genome Center), and Dr. Zhili Zheng from UQ are the co-first authors of this article, and Professor Yang Jian from Westlake University is the last author.