Unravelling the rice genome

Two research groups provide the first detailed blueprint of a food crop, opening new vistas in comparative plant biology.

THE April 5 issue of the journal Science carried two research papers that mark a milestone in plant biotechnology. Two research groups, one Chinese and the other from a Swiss agrobiotechnology giant Syngenta, respectively published draft sequences of the complete genetic code, or genome, of two subspecies of rice (Oryza sativa), the staple food of over three billion people, chiefly from Asia and the Asia-Pacific. The two sequences represent works in progress, in the sense that they have errors and are incomplete. But they provide the first detailed genetic blueprint of a food crop. These efforts "will be the first sequencing projects to yield tangible results for humankind from the standpoints of food security and combating malnutrition," wrote Ronald P. Cantrell of the International Rice Research Institute (IRRI) and Timothy G. Reeves of the International Maize and Wheat Improvement Centre (CIMMYT) in a commentary in the same issue of the journal.

The Chinese team comprising scientists from 11 Chinese institutions and led by Jun Yu of Beijing Genomics Institute (BGI), a publicly funded institute established in 1999, and the University of Washington Genome Centre unveiled the genomic sequences of the rice subspecies indica, consumed predominantly in China and India. The Syngenta team, which included scientists from Myriad Genetics Inc., another private biotech company, and was led by Stephen A. Goff of Syngenta, unveiled the sequences of the code for the other major subspecies japonica, which is popular in Japan and other countries with a temperate climate. Syngenta had announced in January 2001 that it had completed the sequencing of the rice genome but did not publish the data then (Frontline, March 2, 2002). The rapid progress made by the BGI, which has taken less than two years to sequence the indica genome prompting Science to call China a sequencing superpower, may have driven Syngenta to release its sequence data now.

The publication has, in some sense, stolen a march over the ongoing publicly funded International Rice Genome Sequencing Project (IRGSP), a consortium effort by 10 participating countries (which includes China) to map precisely and sequence the rice genome completely (see chart). The effort is expected to come out with its highly accurate draft sequence by end of this year. This draft - called Phase 2 - will have all the sequences, read with an accuracy of 99.99 per cent, genetically aligned as well as correctly oriented. The final finished sequence, with few gaps or errors, is expected to be completed by 2005, three years ahead of the original time-table. The genome sequences of Syngenta and BGI are akin to Phase 1 but without the necessary information to go to the next phases.

The IRGSP was initiated by Japan in 1997 and it got under way in 1998. The project is estimated to cost about $200 million. India joined the IRGSP in June 2000 and chose to sequence a part of chromosome 11. India has invested Rs.48.83 crores for the "Indian Initiative for Rice Genome Sequencing (IIRGS)". The initiative is a joint effort by the Department of Plant Molecular Biology (DPMB), University of Delhi South Campus (UDSC) and the National Research Centre on Plant Biotechnology (NRCPB) and the Indian Agricultural Research Institute (IARI), New Delhi. The Indian share of chromosome 11 has been equally divided between these two centres.

The IIRGS has chosen the japonica variety 'Nipponbare' for sequencing, the same one Syngenta has sequenced. Its technique, known as 'clone-by-clone' sequencing is, however, slower and about 10 times more expensive, but much more accurate and complete. In contrast, BGI and Syngenta adopted the quicker 'whole genome shotgun' approach. The latter method was pioneered in the sequencing of the fruit-fly Drosophila melanogaster and was later adopted for the sequencing of the human genome in 2001 by Celera Genomics. The international human genome project too has adopted the 'clone-by-clone' sequencing method.

The 'clone-by-clone' approach involves the creation of physical and genetic maps of the genome, developing a library of clones, or relatively short stretches of DNA, anchored to a specific location identified from the map, which are then sequenced. In the shot-gun approach, the entire genome is randomly broken into pieces and sequenced. These are pieced together into large coherent units of sequence with the aid of high-speed computers and software, which look for overlapping DNA sequences and identify contiguous genomic regions, or contigs. But this method is prone to errors and gaps remain. This is basically because higher organisms have several repeated sequences in their DNA that prevents assembly of sequenced snippets on the genome scale.

Knowledge of the complete genetic code of rice will help breeders develop strains of the crop with specific characteristics, such as stress-resistance, disease-resistance or high yield, much quicker than through traditional methods, which may require years of crossing to achieve the desired property. The sequencing of all the rice genes alone provides insufficient information on which to base crop improvements. Sequence information linked to complete physical and genetic maps of the genome is required to exploit the full potential of the sequences. Herein lies the importance of the consortium effort. The IRGSP sequencing will not only be complete and very accurate but will be 'genetically anchored' as well. That is, based on the physical and genetic maps, the precise ordering of the short stretches of sequences and their precise locations along the DNA will be known. The sequences released by the BGI and Syngenta, however, lack this quality of location-specific data and are incomplete in that sense. The sequences are a large number of bits and pieces - tens of thousands - without sufficient information on how they piece together to make up the 12 strands of DNA, the number of chromosomes in rice.

The statements that Syngenta and BGI have both completed sequencing the rice genome should be understood in the correct perspective of what has actually been achieved. The actual status is that in 2001 only 28 per cent of the genome had been fully sequenced, 66 per cent was sequenced with gaps and 6 per cent was unknown. In the case of the rice genome, the size of the sequences released by the two groups cover only 93 per cent and 92 per cent respectively of the total genome. In contrast, the IRGSP has released its more accurate sequences in the public database GenBank which, as of date, cover 68 per cent of the genome.

Syngenta sequenced 5.5 million random clones and eventually assembled the sequenced snippets into 42,109 contigs. Similarly, BGI lined up the large number of sequenced snippets into 127,550 contigs. Equivalently, this means the two genome sequences have so many gaps. The number of contigs in the IRGSP released data is only 2,000. In the final finished sequence stage, Phase 3, this number should come down to 12. Further, the accuracy of the sequences in the data of both Syngenta and BGI is lower as compared to the accuracy envisaged in the IRGSP of an error of one in 10,000 base pairs (bp) - all the genetic code is made up of different sequences of four fundamental chemicals called 'bases' - which means 99.99 per cent accuracy. This is the level of accuracy required by the Bermuda Declaration, which had been adopted for the publicly funded human genome project, and the same is being followed by the IRGSP.

A ten-fold redundancy in the sequence "reads" will achieve the desired accuracy in the IRGSP data. That is, the overlaps in the total clones sequenced amount to covering the genome 10 times over. In contrast, while with its 5.5 million snippets Syngenta has been able to achieve only six-fold coverage, the BGI data cover the genome only four times. Correspondingly, their accuracy too would be less. While Syngenta has claimed that it has anchored most of its sequence contigs onto the physical and genetic maps - for this the researchers have made use of the publicly accessible data released by the IRGSP - the BGI plans to do the same. However, it is clear that since the BGI is working with a different subspecies, it will have to generate its own physical and genetic maps.

Neverthless the Syngenta and BGI sequences do serve as a complementary guiding tool for the more thorough sequencing and help in speeding it up. The utility of the 'clone-by-clone' approach has been borne out by the draft sequence of part of the rice genome made available by Monsanto to the IRGSP in early 2000. Since Monsanto had also sequenced the Nipponbare strain, the sequenced clones, though of low (fivefold) accuracy, could be used by institutions participating in the IRGSP, under some conditions of no commercial exploitation. This has enabled late entrants like France and Brazil to progress rapidly in sequencing the chromosome share allocated to them. China too, for example, has utilised the Monsanto data for its full sequencing of chromosome 4. That experience has also been useful in China's parallel effort at sequencing indica.

While the BGI group has submitted the data to GenBank, Syngenta has refused to do so. Instead, it has entered into an agreement with Science to allow only limited access to its raw data for non-commercial purposes only much like what Celera did with its human genome sequence data. The conditions limit searches to less than 15 kbp of the sequence at a time and download 100 kbp of sequence a week. Any academic requiring more than 100 kbp a week must obtain the prior written approval of the company. More can be freely downloaded if the researcher's institution enters into an agreement with Syngenta stating that the data will not be used for commercial ends.

And as before, Syngenta's commercial hold over data has sparked a controversy. Since Syngenta also used Nipponbare, only its data can be of any possible use for the IRGSP effort. While Syngenta has said that it would make its data available to the IRGSP, it has not done so yet. According to the IRGSP, Monsanto clones and sequences underlie 30 per cent of the sequence in public databases. Given that, the utility of Syngenta data, which probably are of better quality than those of Monsanto, in accelerating the IRGSP's sequencing efforts would be more.

The news of the Syngenta-Science deal drew criticism from several sections of the scientific community. Twenty prominent scientists, including two Nobel laureates, wrote to Science protesting against the publication of genome maps without requiring the authors to place the supporting sequence data in public databases as is the "accepted norm".

Donald Kennedy, the Editor of Science, however, defended the decision to allow this exception to Syngenta. He said: "Some cost, especially to the bioinformatics community, results from allowing an exception to the GenBank rule. We allowed it on the basis that the public benefit of releasing the finding from trade-secret status outweighed that cost. Exceptions of this kind will surely be rare."

As if in response to this criticism, Goff and Yu of BGI have apparently written to the IRGSP proposing a merger of the various sequencing efforts. It is being speculated that in response to this, the two public groups - the IRGSP and the BGI - and two private groups -Syngenta and Monsanto - could come together and merge their efforts into a single joint project. Meanwhile, the IRGSP has said that negotiations are on with Syngenta to have access to its data on terms similar to those with Monsanto earlier.

The importance of knowing the complete rice genome stems also from the fact that rice is a model experimental plant. All flowering plants are divided into two classes: monocots and dicots. In November 2000, the genome sequence of Arabidopsis thaliana, a weed belonging to the cabbage and mustard family and widely used in research, provided the first complete view of the genetic code of a dicot. Rice, on the other hand, is a monocot and belongs to the grass family which includes the other important cereals like maize, wheat, barley and sorghum. Its genomic sequence now enables the comparison of the genome of a monocot to that of a dicot and their genomes to other organisms. Unlike other cereals, rice has a relatively small genome (about 400 million base pairs, Mbp). The maize genome - whose sequencing the U.S. is launching in a big way under its National Genome Initiative - is 2,500 Mbp, barley is 4,900 Mbp and wheat is a whopping 16,000 Mbp. Recall that the size of the human genome is about 3,000 Mbp (see table).

WHILE the Syngenta researchers estimate the size of the rice genome to be 420 Mbp, the BGI group gives a substantially higher estimate of 466 Mbp. The IRGSP's initial estimate was 400 Mbp and its more recent estimate based on the just-published detailed physical and genetic map is 403 Mbp. More interestingly, the two draft blueprints reveal that a rice plant contains more genes than a human being. While the number of human genes is estimated to be around 35,000, indica rice contains between 45,000 and 56,000 genes. And the number of genes in japonica could be between 42,000 and 63,000. According to scientists, the reason for a large number of genes in rice as compared to humans is that plants rely on duplication for protein diversity. Protein diversity in humans is believed to be achieved by a process of alternative splicing. In humans a single gene does several things. The genes are constantly broken up and spliced together with a different sequence and function.

A significant observation of both the research groups is that over 80 per cent of Arabidopsis genes have counterparts in rice whereas only abut 50 per cent of rice genes are present in Arabidopsis. This, argue Pamela Ronald and Hei Leung in their commentary accompanying the research papers, suggests that all rice genes are essentially a superset of Arabidopsis genes. The rice genome perhaps is the result of a massive "gene duplication event" and this difference could shed light on how monocots and dicots evolved and diverged about 200 million years ago. But more important, recent research suggests that while the various cereal genomes differ markedly in the amount of total DNA, they share a large number of genes and have similar genetic content over large blocks of chromosomes. Now that the draft genome is available, this "syntenic" relationship can be exploited by geneticists to investigate other cereals even without complete knowledge of their larger genomes. As Jeffrey Bennetzen has commented in an accompanying piece in Science, the rice genome sequence now opens the door to comparative plant biology.

Access to genome sequence gives a handle to discover all the rice genes and establish their functionality. Gene identification and functional validation is what "functional genomics" is all about. It takes off once the genomic sequence is laid bare.

Says N.K. Singh of the IARI: "Functional genomics is where we should focus. Using the sequence data we should begin to develop appropriate markers to breed Indian rice varieties, isolate genes and study how they can be over-expressed or under-expressed and, of course, with genes identified for desirable traits and functions, create transgenics." One way to determine the functions of genes is to create "deletion mutants", rice varieties with specific genes deleted from their DNA. The IRRI is expected to have over 40,000 deletion mutant varieties by the end of this year. It is in discussion with Indian scientists in developing appropriate mutant lines for the International Rice Functional Genomics Programme that the IRRI has established and sequence data would be useful in this effort.

Knowing the sequence of specific genes will enable tapping into the natural genetic variation in the rice species, of which the IRRI has over 100,000 accessions. In India, the Central Rice Research Institute has 42,000 varieties in its germplasm collection. These rice seeds serve as a pool of "natural variants" whose advantageous traits have not been fully tapped owing to lack of genetic handle on them. Now, once the gene associated with a particular trait is known from sequence data, alleles variants of this gene can be examined from the germplasm collection for their relative usefulness. For this purpose, as well as much of functional genomics, accurate "gold standard" sequence data being generated by the IRGSP will become invaluable.

Fear has been expressed in the wake of the release of draft sequence data that the funding agencies might believe that the sequence is essentially in the hands and stop funding the IRGSP. Rod Wing, a molecular biologist at Clemson University in South Carolina collaborating in the IRGSP, emphasised that though commercial benefits and research insights may already be generated from the data, "drafts are just that - drafts - and not finished sequences."

At a meeting of the IRGSP in February in Tsukuba, where the current status of the consortium effort was discussed, it was resolved that "the IRGSP was committed to go on to complete the genome with finished quality sequence." As per the meeting report, the IRGSP is on track to complete the genome at the Phase 2 level by the end of 2002; three chromosomes (1, 4, 10) are essentially complete; and, significant advances in the rice physical map, the size of the genome and the chromosomal structure had been made. The Indian commitment to complete Phase 2 sequencing of its share of chromosome 11 (totalling about 12 Mbp) will be fulfilled, according to the VDSC Centre. At the end of the year, in all likelihood, one will see a convergence of the different versions of genomic sequence. And the finished "gold standard" sequence could happen sooner than 2005 if Syngenta submits its data to the IRGSP or the merger becomes a reality.

Unravelling the rice genome

More stories from this issue

Diversionary tactics

BUILDING CONFRONTATION

Home truths

Contortions in Delhi

Lacking in strategy

Anxious in Europe

A chance for peace

The offensive in Jharkhand

A new right for the poor

Unravelling the rice genome