| 
Home
Welcome
Participate
Committee
Contacts
Projects
Digital Discussion
Program
Abstracts
Venue
Registration
Travel Bursaries
Accommodation
Links
Newsletter
Sponsors |
Projects
Biostatistical analyses of population level data for the 14th IHIWS
Glenys
Thomson, Richard Single, Diogo Meyer, Alex Lancaster
We have implemented an analysis package (PyPop) for comprehensive
analyses of MHC multi-locus population genetic variation. We applied
this package
in analyses of data from the 13th IHW for the following projects: Anthropology/Human
Diversity, HLA and Disease, and Hematopoietic Cell Transplantation (HCT).
The PyPop program is available for analyses of data from the 14th IHIWS.
PyPop is in an object-oriented framework that allows us to implement individual
analysis modules which can be inserted or removed without affecting other modules.
The following PyPop modules are in place:
1. population summary data (labcode, typing method, ethnic group, continent
of origin, collection site, latitude, longitude, total sample size, number
of loci typed),
2. locus specific information (locus name, sample size, number of distinct
alleles observed, counts and frequencies of all alleles observed in the sample
(ordered by frequency, and also by allele name)),
3. Hardy Weinberg (HW) testing using:
(a) the chi-square test statistic with genotypes with an expected value less
than 5 combined together in a lumped class (observed, expected under HW, chi-square
value, degrees of freedom, and p values are given for the following categories
when appropriate: all common genotypes, lumped genotypes, all common genotypes
plus lumped, all homozygotes, all heterozygotes, heterozygotes by individual
allele, i.e., A1/X where X represents all non-A1 alleles, and each individual
homozygote and heterozygote genotype), and
(b) the exact overall test of Guo and Thompson,
4. Ewens Watterson homozygosity test of neutrality (observed homozygosity under
HW for the sample, which reflects the allele frequency distribution for the
given sample size, n, and observed number of alleles, k, expected homozygosity
under neutrality for the same n and k values, p value indicating fit to neutrality
or lack of fit in the directions of balancing selection or directional selection),
5. haplotype frequency estimation using the Expectation-Maximization (EM) algorithm,
for all pairwise combinations, and specified 3, 4, etc. locus combinations,
6. individual allele pair and locus measures of linkage disequilibrium (LD)
and significance testing of LD using a likelihood approach.
There are two “standalone” binary versions of PyPop, which
will be released for general use early in 2004:
(a) a Windows version (tested on Windows 98, XP), and
(b) a Linux version (tested on Red Hat 9, Slackware 9.1)
These standalone binary packages are directly installable on these operating
systems and will obviate the need for the user to install third party
packages.
Comprehensive documentation has been written in DocBook, which can be
used to generate documentation in HTML, PDF, and plain text formats.
The documentation comprises:
(a) PyPop User Guide (including an installation guide for Windows and
Linux, a guide to using PyPop, and a guide to interpreting PyPop output)
(b) PyPop Reference Guide (which includes a detailed description of methods
used by the program and other background and reference information)
The development of a new package of population genetic analysis programs
was needed to handle the high level of HLA polymorphism, and in some
cases, large sample sizes. Further, we wanted the code to be open source.
A primary feature of of the package is that it allows integration of
statistics across large numbers of data-sets. The output of the analyses
are stored in XML (eXtensible Markup Language) format. These output files
can then be transferred using standard tools into many other data formats
suitable for machine consumption (such as input for the statistical package "R"),
or plain text or HTML format.
With application of the PyPop analyses to patient data, either from case/control
data or family based data, care must be taken in interpretation of some results.
For example, the haplotype frequency estimation program assumes the data are
in Hardy Weinberg proportions; this assumption may not hold in the patient
data (nor in control and general population data if there is significant admixture,
etc.).
For more information on PyPop go to:
http://allele5.biol.berkeley.edu/pypop/
and for details on statistical tests etc. or help in running your data contact
Glenys Thomson (glenys@berkeley.edu).
BACK
|