An R package for LD-based analysis of genotyping data
The design and execution of genome-wide association studies (GWASs) are critically dependent on detailed knowledge of the pattern of linkage disequilibrium (LD) in the human genome, as characterised by the HapMap project. A GWAS generates a list of variants, usually single-nucleotide polymorphisms (SNPs), ranked according to the significance of their association to the trait of interest. Downstream analyses generally focus on the gene(s) that are physically closest to the aforementioned SNPs, failing to account for their LD profile with other SNPs .
Modern array-based platforms allow for the parallel genotyping of millions of SNPs. For selecting the markers present on array platforms, LD information is often taken into account, to allow for example interrogating genotypes of regions in high LD using an optimally reduced set of markers. Therefore, it can be beneficial to account for LD in the subsequent analysis process.
We have developed a flexible and user-friendly open-source R-package, LDsnpR, which efficiently assigns SNPs to genes, or user-defined bins, based both on their physical position and on their pairwise LD with other SNPs.
LDsnpR allows to compute gene-wise scores based on the combined p-values of all the markers assigned to the gene bins with or without LD binning, and numerous summary statistics for computation of joint p-values, which can be extended by user-defined functions. The package supports formats of the widely-used PLINK genotyping software as input data and can also generate PLINK-set format of SNP identifiers for subsequent analysis. LD calculations are based on the four initial HapMap populations. The population code with which to carry out the LD-binning is a parameter to the analysis.
The method is implemented in the R-language and uses the IRanges package from Bioconductor for fast interval overlapping. To facilitate efficient storage and retrieval of LD-data, we have converted the compressed text files found in HapMap into a compressed HDF5 binary data format. That way, we are able to provide LD data for all chromosomes and four populations in a single file of only 4.9 GB (about 20% of the original size). We have made a derived version of the hdf5 interface for R, to allow for partial loading of HDF5 data-sets. That way, computation of LD-based binning of 1 Million SNPs into all human Ensembl genes can be carried out with less than 2 GB of memory.
Thanks to the efficiency and maturity of HDF5 by the the HDF5 Group, it lends itself to efficient storage of LD-data on a much larger scale. Once the 1000-Genomes project has released data on linkage-disequilibrium, we hope to be able to convert these to HDF5 format, contribute the converted data, and integrate them into our analysis work-flow.