Summary
Genetic reassortment plays a vital role in the evolution of the influenza virus and has historically been linked with the emergence of pandemic strains. Reassortment is believed to occur when a single host - typically swine - is simultaneously infected with multiple influenza strains. The reassorted viral strains with novel gene combinations tend to easily evade the immune system in other host species, satisfying the basic requirements of a virus with pandemic potential. Therefore, it is vital to continuously monitor the genetic content of circulating influenza strains and keep an eye out for new reassortants. We present a new approach to identify reassortants from large data sets of influenza whole genome nucleotide sequences and report the results of the first ever comprehensive search for reassortants of all published influenza A genomic data. 35 of the 52 well supported candidate reassortants we found are reported here for the first time while our analysis method offers new insight that enables us to draw a more detailed picture of the origin of some of the previously reported reassortants. A disproportionately high number (13/52) of the candidate reassortants found were the result of the introduction of novel hemagglutinin and/or neuraminidase genes into a previously circulating virus. The method described in this paper may contribute towards automating the task of routinely searching for reassortants among newly sequenced strains.
Introduction
The influenza A virus has caused substantial morbidity and mortality in humans as well as livestock. Many severe influenza epidemics have been documented in the past 300 years of world history, while four pandemics and numerous other epidemics have occurred in the past century (Kilbourne, 2006; Potter, 2001; Zimmer and Burke, 2009). Furthermore, each year seasonal influenza results in a considerable death toll worldwide. Spanish Flu, the first pandemic of the past century, killed around 50 million people and earned the honorary title “Mother of all pandemics” (Taubenberger and Morens, 2006). The evolutionary origin of the virus responsible for this calamity has always been subject to controversy and is still not fully resolved (Antonovics et al., 2006; Reid and Taubenberger, 2003; Taubenberger, 2006).
It has been long established that genetic reassortment of the eight RNA segments that constitute the influenza A virus may produce novel viruses in nature (Desselberger et al., 1978). This happens when two or more viruses coinfect the same cell and plays a very important role in the long term evolution of the virus as well as in the making of global influenza pandemics. In particular, reassortment events in pigs may give birth to novel viruses with pandemic potential in humans (Castrucci et al., 1993; Ma et al., 2008; Ma et al., 2009). In fact, all known pandemics in recent human history have been attributed as results of reassortment of genes between two or more distinct viruses, with concrete evidence in support of this fact for all except the 1918 Spanish Flu pandemic (Kawaoka et al., 1989; Lindstrom et al., 2004; Scholtissek et al., 1978; Smith et al., 2009a). This was only possible due to the availability of genetic information of contemporary and precursor viral isolates. Furthermore, there is independent multiple evidence that the pandemic H1N1/2009 virus has reassorted again in swine, possibly giving it a chance to escape herd immunity in humans (Ducatez et al., 2011; Vijaykrishna et al., 2010). Isolated human infections of such viruses have also been reported recently (Centers for Disease Control and Prevention (CDC), 2011).
There is little argument about the necessity of continuously monitoring circulating influenza viruses for the possible emergence of new reassortant influenza A viruses. With the number of full genome influenza sequences in publicly available databases growing exponentially in recent years (http://www.ncbi.nlm.nih.gov/genomes/FLU/growth.html), there is little concern about the availability of data as well. Furthermore, with astronomic advances in sequencing technology within the last few years (Schuster, 2008), it will not be long before it becomes not only possible, but also the most reasonable thing to do, to sequence each individual viral genome we would encounter or have access to. Computing infrastructure development would be the need of the hour when such a situation unfolds, as a lot of data would be of no use without ways and means to interpret them efficiently and accurately.
However, very few attempts have been made to automate the identification of reassortment in the influenza virus. A recent paper (Nagarajan and Kingsford, 2011) described an interesting method to identify reassortments which employs a graph mining technique to find topological incongruities between collections of Markov chain Monte Carlo-sampled trees from different segments. It seemed to be technically sound and may, by itself, even prove to be sufficiently robust in handling very large data sets. However, the computational cost of phylogeny reconstruction (MCMC-sampling in this case) is a formidable obstacle towards using phylogeny dependent methods for identifying reassortments from very large data sets. We aim to overcome this problem by formulating a phylogeny independent method which uses only the nucleotide distance matrices as input.
Virologists would usually check for reassortment in their full genome sequences by doing a homology search on a public influenza sequence database or by reconstructing the phylogenies with reference sequences for each of the segments (Holmes et al., 2005; Karasin et al., 2000; Lindstrom et al., 2004). If the most homologous strains do not match across all segments, the possibility of the query sequence belonging to a reassorted virus is considered high. Our algorithm is based on the same principle, but we focus on the ‘neighbours’ on the phylogenetic tree of each segment rather than doing a homology search. The neighbourhood is defined as a fixed number of closest neighbours of a given strain on a given segment phylogeny. The number of common elements in the neighbourhoods of the same taxa on two different segments denotes whether the two segments originated from a single parent or not. A very low common neighborhood size is very often a sign of reassortment.
Materials and Methods
Definition of reassortment
While genetic reassortment is a subject of wide interest in influenza research, it has not been clearly defined so far to our knowledge. We do not plan to give a rigourous mathematical definition here, but a solid image of what amounts to reassortment is imperative before setting out to detect them. Any influenza virus that has at least one pair of segments (out of all possible combinations of two out of eight) such that each segment is clearly derived from one of two distinct parents is what we call a reassortant - the direct product of a reassortment event - in this paper. It must be noted that, for all practical purposes, a direct progeny of such a reassortant may not be distinguishable from the original strain under this definition, particularly if the original strain has not been sampled. Thus, it is practically impossible to fully ascertain when or where the reassortment event took place, unless there is sufficient clinical data to establish the viral transmission pathways surrounding the event. Here, we base our results solely on nucleotide sequence data, and refrain from making any distinction between strains with near identical sequences or reassortment patterns.
Overview of the algorithm
For any given flu genome sequence, the only straightforward practical test for reassortment is to find for each segment the ‘closest’ strains out of a reference set and compare the results across segments. For each segment, the set of r closest strains (including self) is called the r-neighbourhood for that segment (Fig. 1). When the r-neighbourhoods of two segments are compared, the common elements in the two sets form the common neighborhood and its number of elements is referred to as the common neighborhood size. If this size is one, the two segments are deemed to have different ancestry; if it is more than one but comparatively smaller than r, there is still a high chance of different ancestry. We test all combinations of segment pairs and determine if a given strain is a reassortant or not. Highest sensitivity would be achieved by our method when all known complete genomic sequences are included in the reference set. We shall make a data set of all available complete genome sequences, and sequentially test each genome sequence using the rest of the set as the reference.
Preparation of nucleotide sequence data
All available full genome influenza A nucleotide sequences (as at 24th June 2011) were downloaded from the FTP servers of Influenza Virus Resource at NCBI (http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html). Duplicate sequences were removed within each segment, and incomplete genomes were further removed to have 9284 sequences for each segment (see supplementary material Table S1 for a complete list). The coding regions for M1, M2, NS1 and NS2 proteins were treated as four individual segments (segment 7–10 respectively) for ease of enumeration. Both coding regions were used for these two segments for added confidence. The coding region of PB1-F2 was left out due to its relatively short length and lack of annotation or presence in some lineages. For each of the 10 segments, multiple alignment was performed by MAFFT and the results were further enhanced by manual editing. Phylogenetic analysis of each major coding region of the eight RNA segments (M2 and NS2 are not included) was performed as a separate study (de Silva et al., 2012).
In order to minimize sampling bias, highly similar genome sequences (>97% in HA and NA; >98% in all other segments simultaneously) were removed, while retaining one genome to represent each class of such genome sequences. This resulted in a final dataset of 1670 representative sequences, on which all further analysis was performed.
Algorithm
The algorithm is described in detail along with definitions of the concepts involved. It is implemented by a Ruby script using Bioruby (Goto et al., 2010), which is available from the authors upon request. All calculations were performed on a Debian Linux server powered by 2 quad-core Intel Xeon X7560 / 2.27 GHz Processors with 264 GB RAM. The run time was approximately 16 hrs for the current data.
Step 0. Calculation of genetic distances
The genetic distances between all pairs of nucleotide sequences are calculated by Phylip dnadist version 3.67 using Jukes Cantor method and the respective distance matrices for each of the segments are consequently used in the analysis as described below.
Step 1. Determination of the common neighbourhoods
For strain t, let its r- neighbourhood in segment i=(1,..,10) be defined as where Ω denotes the complete set of viral strains and d(i) is the distance measure on segment i, and
where
and all the st,* are distinct. In other words, it is the set of r nearest strains to strain t with respect to the genetic distance calculated using the nucleotide sequence data from the i'th segment.
Mt is calculated for all strains, which in turn is examined for signs of reassortment. is the i,j- common neighbourhood size of strain t, which is the number of common elements in its r - neighbourhoods in segments i and j.
Parameter tweaking
The smaller the values of the , the stronger the argument for being a reassortant, but where to draw the line is not trivial. We experimented with different numbers and arrived at two separate search criteria, biased more towards sensitivity than specificity. The neighbourhood size r is 80 (approximately 5% of the number of taxa) in what follows.
Step 2. Reassortment search
Let , and
,
For each strain t, each row of its common neighbourhood matrix is inspected to see if it satisfies one or more of the following conditions.
If , and
1) if , t is a product of single segment reassortment.(S)
2) if +0.8r, t is a product of multiple segment reassortment.(M)
Here, “single segment reassortment” refers to the event where one segment is uniquely derived from a parent virus different from those of the other segments, while “multiple segment reassortment” refers to the event where more than one segment are derived from a common parent different from those of the other segments. A typical reassortment event may be a mixture of one or several reassortments of these two types.
Results and Discussion
Our common neighbourhood approach easily picked up reassortants whose parent sequences were sufficiently distant and where the reassortment had not become fixed in the population. The common neighbourhood matrix of A/Swine/Italy/1521/98(H1N2) is a good example (Table 1). In segment 4(HA), it has only two elements in common between its segment 4 neighbourhood and most of the other neighbourhoods, which is a good example of a clean reassortment event. Segment 6(NA) is very similar in its relationship with the other segments, with their biggest common neighbourhoods having eleven elements. The rest of the segments have a fairly high number of common ‘neighbours’ amongst themselves. Altogether, this virus has three parents: two of them contributing one segment each and the last contributing the remaining segments.
A comprehensive search of all available full genome sequence data (9284 strains represented by 1670 genome sequences) resulted in 280 hits (see supplementary material Table S2 for a complete list). These represented a total of 3086 influenza A strains in the original set, with the pandemic H1N1/2009 virus responsible for 2636 of them. Due to space constraints, we chose to show just 52 of them with the highest confidence by limiting to hits with 12 or more very small elements (size ≤2) in their respective common neighbourhood size matrices (Table 2). Some of the reassortants that we found have already been reported, in which case we have shown the reference or commented about the annotation. In some other cases, the sequences have been published in a journal, but the reassortment has not been explicitly declared. Altogether, 35 of the total of 52 reassortants are reported here for the first time, to the best of our knowledge.
A/California/04/09, the reference strain for the pandemic H1N1/2009 virus, was easily picked up by the algorithm notwithstanding the huge sampling bias, while its reassortment pattern (Smith et al., 2009a) was subsequently determined correctly (Table 3).
Of particular interest was an individual segment's propensity to reassort and acquire genetic information from a parent unique to itself or at most common to one more segment. The 7-1(eg:aaabaaaa) and 6-2(eg:aaababaa) reassortment patterns are the most typical of this kind. Segment 4 and 6, which code the HA and NA genes, tend to reassort in this way very often (13/52 instances).
Using our algorithm, we were able to identify further breakdowns in the ancestry of known reassortants. In A/Swine/Ontario/53518/03 (Karasin et al., 2006), for example, we found that PB2 - as well as the previously reported PB1 - was derived from a unique parent of its own. In the 2005 triple reassortant H3N2 viruses from Canada (Olsen et al., 2006), we found that the PB1 gene was of a lineage distinct from that of the other polymerase genes and close to that of HA. Moreover, as a by-product of our analysis, we found that these swine viral sequences were very similar to A/turkey/Ontario/31232/2005(H3N2), a contemporary avian virus from the same region, strongly suggesting cross species transmission. This finding was only possible owing to our comprehensive dataset spanning all hosts.
Similarly, it is quite trivial to find influenza sequences “frozen” in time. A/USSR/90/1977(H1N1), one of the first H1N1 isolates after its re-emergence in 1977 (Zimmer and Burke, 2009) after a 20 year lapse, happened to possess a genomic sequence very similar to that of A/Roma/1949(H1N1).
Furthermore, it proved to be powerful enough to analyse complex reassortment patterns within closely related sequences, when an appropriate data set is used. For example, the predicted reassortment patterns for Clade B (A/New York/32/2003, A/New York/198/2003 and A/New York/199/2003), Clade C (A/New York/52/2004 and A/New York/59/2004) and A/New York/11/2003 from a comprehensive phylogenetic study of 156 complete genomes of H3N2 influenza A collected between 1999 and 2004 from New York (Karasin et al., 2000) are a perfect match with the patterns that were previously inferred by examining their phylogenies (data not shown).
Sample bias is a major confounding factor in molecular evolutionary analyses, particularly so in our reassortment search. The number of isolates available from the first half of the 20th century is very scarce, making it difficult to determine evolutionary lineages. This is exacerbated by our fixed neighbourhood size of 80, which is too big for sparsely sampled lineages. We actually did not have any hits from that era.
In a preliminary analysis with a smaller data set, the oldest influenza A strain in the database, A/Brevig Mission/1/1918, was picked up by our algorithm, in spite of the fact that, by definition, its ancestry and reassortment history could not be directly determined by the available data. This is a result of our neighbourhoods consisting of both ancestors and descendants, when only ancestors define a given strain's reassortment history. This would potentially pose problems in highly reassortment driven lineages. For example, A/Goose/Guangdong/1/96 (Gs/Gd/1/96), the precursor of the recent HPAI H5N1 lineage in Asia (Li et al., 2004) has passed various combinations of its gene segments to a few generations of multiple reassortants, which did adversely affect our grouping of its own segments by ancestry. Direct descendants could also negatively affect the output when the reassorted genes get fixed in the population. Conversely, such direct descendants of reassortant strains may be wrongly selected as reassortants themselves.
The ideal solution for this problem is to have only ancestors in the neighbourhoods. However, it is not possible to distinguish ancestors from descendants from our distance matrices alone. It would require the construction of all the phylogenies with additional assumptions about the relative rate of evolution on each branch.
Minor topological and distance inconsistencies may occur across segments in phylogenies even without a reassortment event, due to stochastic errors and limitations in distance estimation methods. We need to allow for such minor inconsistencies so that our algorithm does not wrongly pick up strains that are in fact not reassortants. To this end, we must avoid too small a neighbourhood size, thereby allowing movements upto a certain degree to occur without being considered as results of reassortment. Too large a neighbourhood size would, on the other hand, not detect small movements that are actually reassortment driven and may give distorted results when the immediate surroundings are sparsely sampled. After much deliberation, we decided to use a neighbourhood size of approximately 5% of the data set, which seemed to work reasonably well. Perhaps, a neighbourhood size that varies across lineages by sampling density would be a potential improvement to our algorithm.
The property of common ancestry should be transitive over all segments in order to group the segments by ancestry without confusion. (ie. if i and j have common ancestry, and j and k have common ancestry, then i and k should also have common ancestry). Nevertheless, many of our results do not satisfy this criteria, which is no wonder given the fact that we use a common cutoff value for all segment combinations and all lineages. Hence, we have had to assume transitivity in some cases before assigning the ancestry of each segment. (ie. we assume i and k have common ancestry even if the common neighbourhood size falls below the cutoff, provided there exists a segment j that has common ancestry with both of them).
We have tried to reconstruct the phylogenies for our data using MrBayes (Huelsenbeck and Ronquist, 2001) as described in the GiRaF paper (Nagarajan and Kingsford, 2011) and found the computation time till convergence with sufficient mixing to be at least in the order of months per segment on a single processor machine (data not shown). It seems inevitable that we would have to settle for phylogeny independent methods at current processor speeds, when doing comprehensive analyses of influenza genomic data. One such earlier method (Rabadan et al., 2008) seemed to perform well in detecting reassortants within lineages, but no comprehensive study has been undertaken to date using this method.
In this paper, we demonstrate our algorithm using a comprehensive complete genome data set, and strive to find the reassortants within that data set while using the same data for reference. The same algorithm may be used to check any given new influenza A strain with a complete genome sequence for reassortment. If this algorithm is to be used for that purpose, it is imperative that the reference data set is always maintained up to date. We believe this method could be efficiently utilized for rapidly testing high throughput sequence data if the need arises.
Acknowledgements
The authors gratefully acknowledge an open access fee waiver from the publisher for this work. U. C. de S. is thankful for the support from his employer, Immunology Frontier Research Centre, Osaka University, and N. S. is grateful for funding from the Japan Initiative for Global Research Network on Infectious Diseases (J-GRID).
References
Competing Interests
The authors declare no competing interests.