Genetic landscape

The Indian Genome Variation Consortium is carrying out a unique study to provide a comprehensive genetic mapping of India as a whole.

HISTORIANS and anthropologists have over the years provided us with a fairly good understanding of the peopling of India, its evolution over centuries to its current diverse compositional fabric, its population groupings in terms of geography, language, culture and ethnicity as well as its characteristically unique societal stratification and hierarchies. The billion-plus people of India today comprise 4,693 communities, which include several thousands of endogamous groups, speak in 325 functioning languages and write in 25 different scripts. Now, as a result of what is perhaps the largest multi-institutional research effort (at least in biology) in this country, we have a genetic basis to this unparalleled diversity.

This research effort began about five years ago under the name of the Indian Genome Variation Consortium (IGVC). It has involved many Indian anthropologists and over 150 scientists drawn from six laboratories of the Council of Scientific and Industrial Research (CSIR); the Indian Statistical Institute (ISI), Kolkata; and The Centre for Genomic Application (TCGA), an institution in New Delhi set up in the public-private-partnership (PPP) mode by the CSIR and the Chatterjee Group of Kolkata. The six CSIR institutes are the Institute of Genomics and Integrative Biology (IGIB), Delhi, the nodal institution for the consortium; the Central Drug Research Institute (CDRI), Lucknow; the Indian Institute of Toxicology Research, or IITR (formerly the Industrial Toxicology Research Centre), Lucknow; the Institute of Microbial Technology (IMTECH), Chandigarh; the Indian Institute of Chemical Biology (IICB), Kolkata; and the Centre for Cellular and Molecular Biology (CCMB), Hyderabad. (Interestingly, the letters in the acronym of the PPP institute T, C, G, A also stand for the molecules called bases in nucleotides, the fundamental structural units of deoxyribonucleic acid, or DNA, whose ordering or sequence in DNA codes for genetic information.)

Many studies in the population genetics of the Indian people have been carried out in the past, primarily from an anthropological perspective, but most of them have been limited to certain identified population groups. This study, however, is unique because the genetic information generated is of biomedical relevance. To obtain population-specific genetic information, genes were selected on the basis of their established, or suggested but not proven, linkages to certain common diseases and disorders. The study thus becomes significant from the perspective of pharmacogenetics, or genetic-information-based medicine.

The IGVC was launched as a response to the International HapMap Consortium launched in 2002 to map global genomic diversity. The HapMap study, which cost $100 million, covered 45 Japanese, 45 Chinese, 90 Caucasian and 90 African individuals. Significantly, it failed to cover India. Even if it had, a population sample of the order of 45 would hardly have captured the diversity that is evident in a population that accounts for one-sixth of the world population. Besides HapMap, there are other genetic databases on worldwide populations such as dbSNP (2001), Celera (2002) and HGVBase (2004) and on specific populations such as the Japanese JSNP (2002) in the public domain. The Indian subcontinent is not represented in any of these as well.

The independent Indian effort has already provided considerable genetic insight into the people(s) of India. Its conceptually different approach focussed on a smaller set of apparently functional genes because of the suggested disease linkages and was carried out at about 1/20th the cost of the HapMap study (Rs.25 crore as of date). The budget was significantly scaled down from the original proposal of Rs.113 crore based on a HapMap-like approach, says Samir K. Brahmachari, Director-General of the CSIR and former Director of the IGIB.

The aim of the IGVC project is to obtain data on about 1,000 such genes in 15,000 individuals drawn from different sub-populations so as to provide a comprehensive genetic mapping of the country as a whole. This could serve as the template for identifying reference sub-populations or groups predisposed to specific diseases. Appropriate medical intervention could then target these populations.

In the first phase, the results derived from a study of 55 populations, involving 75 genes and 405 genetic markers (random mutations that serve as milestones), have been published as a paper authored by 151 scientists in the April issue of Journal of Genetics, a high-impact journal of the Indian Academy of Sciences (IAS). The second phase of this mammoth effort is already on, and data on 4,000 genetic markers from over 1,000 genes from 26 of the 55 populations have been gathered and are undergoing analyses.

The paper points out that earlier studies carried out elsewhere by mapping isolated populations, considered important to the understanding of the genetic underpinnings of diseases, had met with limited success because the lack of genetic homogeneity in outbred populations meant that the results could not be validated across the populations. It becomes difficult to get a proper gene-disease linkage map by sampling in such genetically heterogeneous populations. Notwithstanding the overall genetic diversity that one obtains in India, the existence of large families and the high levels of endogamy and population stratification provide a unique source for dissecting complex disease etiology and pathogenesis, the authors note.

It is this situation that the work has sought to exploit. The premise that populations in India are more endogamous than our original perception was actually suggested by our work on SCA12 [spinocerebellar ataxia 12, a neurological disorder], says Brahmachari. What we found was that the disorder was associated predominantly with a single kind of allele [variant of a gene], and every patient came from the same community, he adds.

Even as the study reiterates well-known historical and anthropological elements about the Indian population, it has revealed a remarkable feature of the genetic profile: that there are clusters of populations showing significant genetic differences between them in the frequencies of disease-associated genetic markers and that there is a strong association between the geographic/linguistic profiles in the country and these clusters or groups. That is, Brahmachari says: From a linguistic map, we have built a genetic map for disease risk.

The basic approach of the IGVC was to study the genetic variation across population groups on the basis of data on what are called Single Nucleotide Polymorphisms (or SNPs, pronounced snips) in the genes that are known to be markers for certain diseases. SNPs constitute the simplest of mutations that a gene undergoes and involve a change, by the alteration of the base molecule (A, T, C or G) at its free end, in one among the string of nucleotides that make up the gene.

In contrast, earlier population genetics studies on Indian populations generally used mitochondrial DNA (mtDNA), the DNA of certain self-replicating organelles present in the cytoplasm of the cell and inherited from the mother; the Y-chromosome; and, occasionally, limited markers from the genomic DNA. As Mitali Mukerji of the IGIB, one of the lead authors of the paper, points out, in mtDNA, which is uniparentally transmitted, there is no possibility of crossovers and genetic recombinations exchange of DNA sequences in the genes between chromosome pairs and other accidental variations that can occur during chromosome duplication as cells divide (mitosis).

Similarly, unlike autosomes, or non-sex chromosomes, the Y-chromosome does not have a homologous pair (having the same genetic loci) and, therefore, cannot reflect genetic recombinations that can occur by the exchange of segments across identical loci.

Such [mtDNA-based] studies, she says, can at best reveal the origins or ancestry of populations by tracing the communities that females have married into and how societies have intermingled and not the relevant fine structure or stratification in the populations. The bulk of our genome is in the autosomes where there is adequate mixing.

A population genetics study that merely looks at the gene flow and infers who is related to whom and ends up with a phylogenetic (evolutionary descent) tree, is too expensive to use public money on, feels Brahmachari. Our objective, he says, was to see how much is the variation in the genes with important pathways or is suggestive of such associations in populations and whether there is any population stratification revealed on that basis. Databases such as the Online Mendelian Inheritance in Man (OMIM), maintained by the National Centre for Biotechnology Information (NCBI) at Johns Hopkins University, were used to identify established gene-disease associations.

As the paper states, the first phase of the IGVC effort set out to answer the following five questions: (i) Are the frequencies of SNPs putatively associated with diseases similar across populations and can clusters of populations that have similar frequencies be identified? (ii) Do these clusters correlate with ethnic, linguistic and geographic divisions? (iii) What is the nature and extent of genetic differentiation within and among such clusters? (iv) How are Indian populations related to the HapMap populations? and (v) Does a subset of all the SNPs considered suffice to distinguish between ethnic groups? By answering the above affirmatively, the first results of the project have shed new and important light on the genetic profile of India, which, as will be described in Part 2 of this report, can have significant public health implications.

The population groups in the study were classified as Isolated (tribal) Populations (IPs), Large Populations (LPs), which were mostly caste groups with a population strength of more than 10 million, and Special Populations (SPs), essentially religious groups apart from the dominant Hindu group, which were also nearly all large populations. Some of the IPs were also large. This classification ensured that there was maximum genetic variation between the tribals and the large populations, says Brahmachari.

For sampling, given the fact that there are basically four linguistic groups in the country Austro-Asiatic (AA), Tibeto-Burman (TB), Indo-European (IE) and Dravidian (DR) and six broad geographical regions south, west, central, north, east and north-east at least two IPs and two LPs were selected from each language-geography grid so that population and ethnic diversity would be adequately represented. (This gave each selected group a three-character label of language-geographic zone-population category.) The 55 population groups selected included 26 LPs, 23 IPs and six SPs (see Table 1).

A representative set of 75 genes spread across nearly all the chromosomes (excluding the Y-chromosome) was identified for the study. The genes are known to be associated with drug response, cancer and aging, eye diseases, allergy and asthma, susceptibility to infections, neuro-psychiatric problems, and metabolic and cardiovascular disorders (see Table 2). Genes on which maximum biochemical and molecular biological information was available were chosen. This also ensured that nearly the entire gamut of biomolecular functions and biological processes were covered by the genes, says Brahmachari.

A minimum of 50 samples from LPs and 25 samples from IPs were collected. From an initial number of 2,014 unrelated samples, quality control reduced the number to 1,871, comprising 1,240 males and 631 females (see Table 3). The sampling does reveal an apparent bias towards populations of the north, which account for as many as 44 per cent of the samples.

Yes, the south has not been covered very well, admits Mukerji. This was basically because the north is fairly homogeneous and the south comparatively heterogeneous. We have now collected all the information on the admixtures in southern populations, particularly in the IE and AA tribals. The fine structuring of the population has to be captured, which is being done in the second phase with more samples, she adds. The new results are not very different, which indicates that there was not any significant bias due to sampling, Brahmachari points out, however.

Since the study was primarily focussed on disease-related functional polymorphisms, the SNPs suggestive of such associations had to be first identified. For this a discovery panel of 43 samples was drawn up from geographically and ethnically diverse populations. By sequencing the candidate gene loci from these samples, a total of 170 novel and 560 reported SNPs were identified. Application of certain selection criteria brought the number down to 601, and when these were validated on the larger complete sample set, only 405 met the criterion of being from the non-sex chromosomes. A bi-directional sampling on the panel of 43 samples enabled the discovery of the novel SNPs.

The identification of SNPs was done by the automated high throughput sequencing technique (HTST), widely used in post-genomics biological research, which can easily locate these. Such a facility exists at the TCGA. For sequencing genes, some of the design was done at the IGIB and some here, but we did all the quality controls and high-throughput screening for SNPs and genotyping, says K. Narayanasamy of the TCGA. Genotyping refers to determining the genotype, or the genetic constitution of an individual, in this case, the limited set of SNPs in the sample.

While the throughput we achieved on the Sequenom platform of about 50,000 genotypes a day was perhaps a little lower than what one can achieve with the latest equipment; in terms of accuracy it was most accurate, Narayanasamy adds. For the second phase, for which we have already sequenced over 1,000 genes and screened them at about five to six SNPs per gene, we used this as well as a comparatively higher-throughput platform, Ilumina, and the concordance between the two platforms is excellent, he says.

A larger number of genetic markers from a set of populations than most studies have to deal with also meant that a more careful data analysis had to be done. It meant that the data had to be carefully curated and quality controls applied consistently, says Partha Majumder of the ISI, another lead author of the paper, under whose guidance much of the statistical analysis was carried out. To make comparisons, the techniques used were fairly standard. Only that they had to be adapted suitably for the larger dimensional problem at hand, he adds.

Data analysis basically involved computation of the frequencies of genetic variation (allele frequencies) of all autosomal loci, and these were systematically compared between pairs of population to determine how far each population was genetically separated from another, that is, determining the genetic distance or the extent of genetic differentiation, as it is termed. It was found that the differentiation was statistically significant in most cases, and as the paper notes, in those few cases where it was not statistically significant, it remains to be determined whether it was an artefact of the small sample sizes.

However, the comparison on the basis of a few loci alone revealed a very high degree of differentiation, which was comparable in magnitude to that observed among continental populations. This is a clear demonstration of the genetic diversity in the country. Significantly, maximum differentiation was found among tribal populations in the different linguistic regions. A significant observation of the study is that while on a pan-India level the extent of differentiation among linguistic or geographic groups was not statistically significant, grouping by ethnicity (caste and tribe) revealed a higher degree of differentiation between caste groups and tribal groups, which is indicative of the antiquity and isolation of the tribal compared to the caste populations.

Analysis of genetic distances within geographic regions or ethno-linguistic groups has revealed some interesting features. While Dravidian-speaking LPs and IPs did not show statistically significant differentiation, the Indo-European-speaking LP and IP groups are significantly differentiated. Also, there is significant differentiation within both Indo-European- speaking tribes and Dravidian-speaking tribes, pointing to a relatively greater degree of isolation and non-mixing. The same, however, cannot be said of caste groups, which is indicative of a more recent formation of such societal groups. Interestingly,

Dravidian-speaking tribal people and the tribal people who speak Austro-Asiatic languages did not show any significant degree of genetic separation. From the above results, observe the authors, it is clear that pooling populations without considering ethnicity and linguistic affiliations that contribute to population stratification can result in false inferences in genetic association studies.

Different statistical methods were employed to obtain a slightly different perspective of the countrys genetic profile: namely, to study the genetic affinities of populations and see how the population clustered genetically across the country and to see how this correlated with the linguistic groupings (see map). The first major cluster comprised Austro-Asian IPs and Dravidian IPs, consistent with the earlier observation of lower genetic differentiation between them. The second cluster included Tibeto-Burman-speaking populations, irrespective of their geographical region of habitat. This cluster also consisted of three Indo-European-speaking IPs and two LPs. The majority of these populations reside in the Himalayan belt, the paper notes. There were a larger number of smaller clusters that predominantly consisted of Indo-European-speaking LPs and SPs.

The Dravidian-speaking LPs and IPs formed a separate cluster, predominantly located in the south. The clustering also reveals that there seems to be considerable diversity among Indo-European-speaking populations in different geographical locations. Besides, within the Indo-European-speaking populations, LPs and SPs (largely Muslims) were found to have a strong genetic affinity, which basically implies that Hindus and Muslims of the north are essentially the same people. Similarly, affinity was also observed between Tibeto-Burman-speaking IPs and SPs.

Thus, infer the authors, although there are no clear geographical grouping of populations, ethnicity (tribal/non-tribal) and language seem to be the major determinants of genetic affinities between populations of India. On the basis of the observed stratification of the Indian population, the paper cautions conventional population genetics studies to correct for this stratification if cases and controls are drawn from populations that belong to different clusters. This observation, therefore, has the implication that careful sampling of the populations is necessary to capture the entire genetic spectrum of the country.

Besides disease-linked genetic mapping of Indian populations, an objective of the study was to look at the ancestry of the diverse populations and their relatedness from that perspective. This, in the parlance of population genetics, is determined by looking at what is called Linkage Disequilibrium (LD). Majumder explains this as follows: Even if two SNPs are not close to each other, there is a statistical association between them when the ancestry is short. This is because sufficient time has not elapsed for genetic recombinations that would disturb this association to occur. They stay together over a period of time. LD is basically a measure of this. However, according to him, the analysis of SNPs in Indian populations shows that the extent of LD was not high enough. This basically implies that the population groups are of ancient descent and despite that, there are pockets of homogeneity in the Indian population.

An interesting corollary outcome of the study is the finding that the IPs are relatively unadmixed, unlike the LPs. An analysis conceived by Majumder found that it was possible to identify a small number of SNPs that can serve as signatures of population ancestry. If there are two populations and we had a small bunch of alleles, I wanted to see if there is a useful way of differentiating the populations based on these, says Majumder.

The answer turned out to be yes, perhaps not too surprising given the extent of differentiation seen between different sub-populations. A small set of 12 SNP loci, which Majumder has called Keystone SNPs, is enough to identify a population with unknown ethnicity as predominantly tribal or predominantly caste with 100 per cent accuracy! The accuracy dropped when unknown linguistic lineage or geographical lineage had to be determined. Though there are a few Keystone SNPs that have strong medical linkage, I would not draw any larger conclusion from this until there is clearly proven linkages for the entire set, which we do not have, Majumder says.

The study also investigated what the genetic variations in Indian populations reveal when viewed from a global perspective. This was done by carrying out a comparison with the genetic profiles of the HapMap populations by porting what are called Tag SNPs from the HapMap populations onto the Indian populations.

Interestingly, it was found that all the HapMap populations, except those of African descent (YRI), namely, the Chinese (CHB), the Japanese (JPT) and the Caucasian (CEU), were well represented among the Indian population. The isolated populations of the Himalayan belt are closest to the Chinese and Japanese and separate from the rest of the populations. The YRI, on the other hand, lay statistically farther away from the Indian populations and closest to the single outlying population group (from the 55) labelled as OG-W-IP. As expected, the CEU was found to be closest to the Indo-European-speaking LPs, the majority of whom are in the north. Again, as expected, the Austro-Asiatic-speaking and Dravidian-speaking populations were distinct from the HapMap populations. Interestingly, however, the mean statistical association of these Tag SNPs was higher in Indian populations than in the HapMap populations themselves. This is perhaps a reflection of the inherent urban bias in the HapMap samples, which were drawn from Beijing, Tokyo, and so on.

A pertinent question is how robust is our genetic map because ideally genetic affinities should be inferred from a random set of neutral loci and not from a biased set based on disease linkage, says Brahmachari.

A statistical analysis, after a proper statistical filtering of the loci, showed that the correlation between affinities with 405 and the reduced set of 332 was very high, establishing the robustness of the findings, Brahmachari points out.

Having thus established that the results of the first phase are on terra firma (which the early second phase results also seem to indicate), the IGVC project is perhaps all set to enter the third phase and attempt to answer the next-level question, which could have public health implications: namely, can one identify populations at risk for complex disorders, poor drug response and predisposition to infectious diseases?