Projects

Biostatistical analyses of population level data for the 14th IHIWS

Glenys Thomson, Richard Single, Diogo Meyer, Alex Lancaster

   We have implemented an analysis package (PyPop) for comprehensive analyses of MHC multi-locus population genetic variation. We applied this package in analyses of data from the 13th IHW for the following projects: Anthropology/Human Diversity, HLA and Disease, and Hematopoietic Cell Transplantation (HCT). The PyPop program is available for analyses of data from the 14th IHIWS.

  PyPop is in an object-oriented framework that allows us to implement individual analysis modules which can be inserted or removed without affecting other modules. The following PyPop modules are in place:
1. population summary data (labcode, typing method, ethnic group, continent of origin, collection site, latitude, longitude, total sample size, number of loci typed),
2. locus specific information (locus name, sample size, number of distinct alleles observed, counts and frequencies of all alleles observed in the sample (ordered by frequency, and also by allele name)),
3. Hardy Weinberg (HW) testing using:
  (a) the chi-square test statistic with genotypes with an expected value less than 5 combined together in a lumped class (observed, expected under HW, chi-square value, degrees of freedom, and p values are given for the following categories when appropriate: all common genotypes, lumped genotypes, all common genotypes plus lumped, all homozygotes, all heterozygotes, heterozygotes by individual allele, i.e., A1/X where X represents all non-A1 alleles, and each individual homozygote and heterozygote genotype), and
  (b) the exact overall test of Guo and Thompson,
4. Ewens Watterson homozygosity test of neutrality (observed homozygosity under HW for the sample, which reflects the allele frequency distribution for the given sample size, n, and observed number of alleles, k, expected homozygosity under neutrality for the same n and k values, p value indicating fit to neutrality or lack of fit in the directions of balancing selection or directional selection),
5. haplotype frequency estimation using the Expectation-Maximization (EM) algorithm, for all pairwise combinations, and specified 3, 4, etc. locus combinations,
6. individual allele pair and locus measures of linkage disequilibrium (LD) and significance testing of LD using a likelihood approach.

There are two “standalone” binary versions of PyPop, which will be released for general use early in 2004:
(a) a Windows version (tested on Windows 98, XP), and
(b) a Linux version (tested on Red Hat 9, Slackware 9.1)
These standalone binary packages are directly installable on these operating systems and will obviate the need for the user to install third party packages.

Comprehensive documentation has been written in DocBook, which can be used to generate documentation in HTML, PDF, and plain text formats. The documentation comprises:
(a) PyPop User Guide (including an installation guide for Windows and Linux, a guide to using PyPop, and a guide to interpreting PyPop output)
(b) PyPop Reference Guide (which includes a detailed description of methods used by the program and other background and reference information)

The development of a new package of population genetic analysis programs was needed to handle the high level of HLA polymorphism, and in some cases, large sample sizes. Further, we wanted the code to be open source. A primary feature of of the package is that it allows integration of statistics across large numbers of data-sets. The output of the analyses are stored in XML (eXtensible Markup Language) format. These output files can then be transferred using standard tools into many other data formats suitable for machine consumption (such as input for the statistical package "R"), or plain text or HTML format.
With application of the PyPop analyses to patient data, either from case/control data or family based data, care must be taken in interpretation of some results. For example, the haplotype frequency estimation program assumes the data are in Hardy Weinberg proportions; this assumption may not hold in the patient data (nor in control and general population data if there is significant admixture, etc.).

For more information on PyPop go to:
http://allele5.biol.berkeley.edu/pypop/
and for details on statistical tests etc. or help in running your data contact
Glenys Thomson (glenys@berkeley.edu).

BACK