Figure 1 From Seqotron: A User-friendly Sequence Editor For Mac

Posted on  by admin
Figure 1 From Seqotron: A User-friendly Sequence Editor For Mac 8,7/10 9293 reviews

SEED 2: a user-friendly platform for amplicon high-throughput sequencing data analyses. Sequence viewing and editing tools such as Seqotron (Fourment and Holmes, 2016), UGENE. A user-friendly sequence editor for Mac OS X. Notes, 9, 106. Sequences and detection frequency for each editing event (Figure 1C and 1D). Our laboratory has recently applied CRISPR-based genome editing to lignin biosynthesis perturbations in Populus.A gene-specific guide RNA (gRNA) was designed to target 4-coumarate:CoA ligase 1 (4CL1), but not the paralogous 4CL5 (Zhou et al., 2015).

  1. Figure 1 From Seqotron: A User-friendly Sequence Editor For Mac Free

The Psychrobacter genus is a cosmopolitan and diverse group of aerobic, cold-adapted, Gram-negative bacteria exhibiting biotechnological potential for low-temperature applications including bioremediation. Here, we present the draft genome sequence of a bacterium from the Psychrobacter genus isolated from a sediment sample from King George Island, Antarctica (3,490,622 bp; 18 scaffolds; G + C = 42.76%).

Using phylogenetic analysis, biochemical properties and scanning electron microscopy the bacterium was identified as Psychrobacter glacincola BNF20, making it the first genome sequence reported for this species. Glacincola BNF20 showed high tellurite (MIC 2.3 mM) and chromate (MIC 6.0 mM) resistance, respectively.

Genome-wide nucleotide identity comparisons revealed that P. Glacincola BNF20 is highly similar (90%) to other uncharacterized Psychrobacter spp. Such as JCM18903, JCM18902, and P11F6. Bayesian multi-locus phylogenetic analysis showed that P. Glacincola BNF20 belongs to a polyphyletic clade with other bacteria isolated from polar regions. A high number of genes related to metal(loid) resistance were found, including tellurite resistance genetic determinants located in two contigs: Contig LIQB01000002.1 exhibited five ter genes, each showing putative promoter sequences (terACDEZ), whereas contig LIQB1000003.2 showed a variant of the terZ gene. Finally, investigating the presence and taxonomic distribution of ter genes in the NCBI’s RefSeq bacterial database (5,398 genomes, as January 2017), revealed that 2,623 (48.59%) genomes showed at least one ter gene.

Figure 1 From Seqotron: A User-friendly Sequence Editor For Mac Free

At the family level, most (68.7%) genomes harbored one ter gene and 15.6% exhibited five (including P. glacincola BNF20). Overall, our results highlight the diverse nature (genetic and geographic diversity) of the Psychrobacter genus, provide insights into potential mechanisms of metal resistance, and exemplify the benefits of sampling remote locations for prospecting new molecular determinants. Figure 1: Phylogenetic, morphological and genomic characteristics of P. Glacincola BNF20. (A) Scanning electron micrograph showing the morphology and dimensions of P.

Glacincola BNF20. Samples were stained with uranyl acetate (0.5% w/v) and examined using a low-voltage electron microscope (Delong Instruments, LVEM5), with a nominal operating voltage of 5 kV. Bar represents 10 µm. (B) Phylogenetic tree of P. Glacincola BNF20 based on the partial 16S rRNA gene sequence (Accession number ).

Psychrobacter ingroup was rooted using Moraxella osloensis DSM 6978T as outgroup. (C) Circular map of the 18-scaffold draft genome with coding sequences colored by COG categories. Inner circles represent GC Skew and GC content.

Figure 1 from seqotron: a user-friendly sequence editor for mac download

Glacincola BNF20 tolerates high tellurite and chromate concentrations Several tests were carried out to determine if BNF20 was resistant to multiple metals. Besides tellurite (used in the initial selection), P. Glacincola BNF20 was 4 times more resistant to chromate than the sensitive strain E. Coli BW25113 under optimal growth conditions.

Glacincola BNF20 growth was impaired in the presence of all other metal(loid)s tested, including Cu 2+, Cd 2+, Hg 2+, Zn 2+, AuCl 4 1 −, Ni 2+, AsO 4 2 −, AsO 3 1 −, and Ag 1+. DOI: First draft genome of Psychrobacter glacincola BNF20 Previous studies showed that P. Glacincola BNF20 was highly resistant to tellurite (MIC ∼2.3 mM, ). Although tellurite reduction is often accompanied by the formation of black deposits of elemental tellurium in resistant organisms, this phenotype was not observed in P. Glacincola BNF20.

To further investigate the mechanism(s) of tellurite resistance in P. glacincola BNF20, we sequenced the whole genome in search for genetic determinants implicated in metal(loid) resistance. The assembled genome of P. glacincola BNF20 consisted of 3,490,622 bp, 18 scaffolds, with an average G + C content of 42.76% (; NCBI Reference Sequence: ). The predicted proteome scored 100% completeness according to the presence of highly conserved ortholog genes in bacteria (BUSCO analysis). A set of 47 tRNA genes and one copy of the rRNA operon were also identified. From a total of 2,968 predicted CDS, 2,872 (96.7%) ORFs matched coding sequences available in public databases, of which 2,515 were assigned (84.7%) or not (352 CDS, 19.31%) to COG categories. Glacincola BNF20 is evolutionarily divergent from other Antarctic Psychrobacter isolates The genome sequence of P.

Glacincola BNF20 was compared to other 34 available genomic sequences by estimating ANI values and performing a multi-locus phylogenetic analysis. Glacincola BNF20, the full dataset was composed of 10 named and 24 unnamed Psychrobacter species, respectively. Glacincola BNF20 exhibited an average nucleotide identity 95% and an alignment fraction of over 80% with 3 isolates designated as Psychrobacter sp. JCM18903 (GCA000586475.1), Psychrobacter sp.

JCM 18902 (GCA000586455.1) and Psychrobacter sp. P11F6 (GCA001435295.1) , of which none was isolated from Antarctica. We did not find any genome comparison against BNF20 of 96.5% ANI and 60% alignment fraction, which has been suggested as a “genomic boundary” for bacterial species (; ). While some of the available genomes come from Antarctic isolates, none of them showed high ANI values (90%): P. Aquaticus (85%; GCA000471625.1); P. alimentarius (85%; GCA001606025.1); P.

Urativorans (85%; GCA001298525.1); TB15 (84%, GCA000511655.1), G (86%, GCA000418305.1); PAMC 21119 (86%, GCA000247495.2); TB2 (84%, GCA000508345.1); TB47 (86%, GCA000511045.1); TB67 (86%, GCA000511065.1) and AC24 (86%, GCA000511635.1). Figure 2: Whole genome nucleotide identity and multi-locus phylogenetic analysis.

(A) Average nucleotide identity (ANI) in the 35-genome Psychrobacter dataset. Glacincola BNF20 forms a cluster with other three Psychrobacter genomes with an alignment fraction over 80%. (B) Bayesian multi-locus phylogenetic analysis of the genomic sequences from the indicated Psychrobacter members.

Taxa are colored by geographic location. Node values correspond to posterior probabilities, and the phylogeny was mid-point rooted. DOI: Supporting our previous results, multi-locus phylogenetic analysis showed that P. glacincola BNF20 is more related to P11F6 (isolated from Tunicate ascidians from the Arctic, ), JCM 18902 and 18903 (isolated from frozen porpoise Neophocaena phocaenoides, ). Antarctic isolates PAMC21119 and G (from King George Island,; ) belong to a polyphyletic group and do not form a monophyletic clade with P.

Glacincola BNF20, highlighting the heterogeneous nature of the Psychrobacter genus. All nodes of the phylogeny were well supported (posterior probability 0.99). Glacincola BNF20 encodes multiple metal resistance determinants As P. Glacincola BNF20 was isolated from King George Island sediments, a place where heavy metal contamination has not been previously reported, we searched for genes known to be involved in metal resistance that could explain the observed tellurite and chromate resistance of strain BNF20 (BacMet database; ). Type and gene copy number distribution was not uniform in the 35-genome Psychrobacter dataset. Specifically, ∼100 genes possibly conferring metal resistance were identified in the genome of P.

For

Glacincola BNF20, of which some are related to chromate resistance, including chrL (BAC0361; regulatory protein, involved in chromate resistance), chrR (BAC0538; chromate reductase), mdrL/yfmO (BAC0209; multidrug efflux protein yfmO) and ruvB (BAC0355; ATP-dependent DNA helicase), and some to tellurite resistance—the so-called ter genes , including terA (BAC0386), terC (BAC0388), terD (BAC0389), terE (BAC0390) and terZ (BAC0392) (, ). Two other genes apparently involved in tellurite resistance, ruvB (BAC0355; ATP-dependent DNA helicase) and pitA (BAC0312; low-affinity inorganic phosphate transporter 1), were also identified. DOI: Organization of ter genes in P. Glacincola BNF20 Given that (i) tellurite is by far more toxic for bacteria than other metals and (ii) it is scarce in the Earth’s crust , finding tellurite resistance determinants in P. Glacincola BNF20 was somewhat unexpected. Since to date the presence of ter genes in Antarctic microorganisms has not been reported, we focused the following analyses our study on them.

The ter genes were originally described as part of an E. Coli operon exhibiting the terZABCDE structure. Glacincola BNF20 harbors terA, terZ, terE, terC and terD orthologs, but not terB ; terA shows the opposite transcriptional orientation than the rest of the ter genes, while terZ is duplicated and is contained in different contigs.

In addition, the expression of all ter genes in P. Glacincola BNF20 seems to be regulated by individual promoters (PromPredict and BPROM analyses), suggesting that they are organized as a gene cluster rather than as an operon. Three members of the Psychrobacter genus contained one ter gene ( P. Phenylpyruvicus ( terZ, GCA000685805.1), P. Lutiphocae ( terZ, GCA000382145.1) and P. ENNN9 III ( terD, GCA001462175.1)), while the rest had different combinations of them.

Glacincola BNF20, the context of the ter gene cluster is similar to other isolates like Psychrobacter sp. JC18902 (GCA00058655.1), Psychrobacter sp. G (GCA000418305.1), Psychrobacter sp. TB67 (GCA000511065.1), Psychrobacter sp. AC24 (GCA000511635.1), Psychrobacter sp. TB47 (GCA00051045.1) and P. Arcticus 273-4 (GCA000012305.1).

Interestingly, in all analyzed Psychrobacter genomes the ter gene cluster also contains a gene encoding a protein of the TIGR00266 family (unknown function, ). Ter genes are distributed over several bacterial Phyla To determine the frequency of ter genes in known bacterial genomes, their taxonomic distribution was evaluated. In general, ter genes are more commonly found in Gram-positive than in Gram-negative bacteria. Using NCBI’s RefSeq database (5,000 genomes; accessed January 2017), we found that 48.59% of them contained ter genes (26 out of 30 bacterial Phyla). While, at the genus level, most genomes had one ter gene (67.95%) (, ), others harbor two (2.31%), three (0.69%), four (5.24%), six (4.61%) or seven (1.15%) ter genes. Interestingly, the second most abundant combination of ter genes in genomes was five (18.04%), which could suggest evolutionary constrains.

DOI: At the phylum level most Proteobacteria contain one ter gene, with a few exceptions showing up to 7, including Yersiniacee, Morganellaceae, Enterobacteriaceae and Erwiniaceae. A similar pattern is observed in other Phyla, except for Firmicutes where genomes exhibit a defined array of ter genes. Interestingly, while members belonging to the best represented family in RefSeq, i.e., Streptomycetaceae (149 genomes) exhibit five or six ter genes, in other well-represented families such as Flavobactericidae only 26 out of 114 genomes exhibit five ter genes (23%). Within the Moraxelaceae family, nine out of 45 genomes show five ter genes (20%, including BNF20), which agrees with the complete family database distribution (∼18% with 5 ter genes).

Discussion Here we show for the first time the genome sequence of a P. Glacincola species isolated from Antarctica, which can tolerate high concentrations of tellurite and chromate. Glacincola BNF20 showed to be 4- and 500-fold more resistance to chromate and the tellurium oxyanion tellurite than E. Coli BW25113.

Previous studies showed that defined toxicants can trigger common responses or repair mechanisms , suggesting that tellurite and chromate resistance could be related. Besides tellurite and chromate, P. Glacincola BNF20 genome encodes resistance determinants associated to a number of other heavy metal(loid)s such as arsenic, cadmium, copper and mercury. Interestingly, tellurite resistance in P. glacincola BNF20 did not correlate with a strong tellurite reduction, as previously reported , which prompted us to search for genes associated with tellurite resistance in its genome. Identifying these genetic resistance determinants could be useful as the Psychrobacter genus has been proposed as good candidate for biotechnological applications including bioremediation. Members of the Psychrobacter genus are versatile and have been isolated from different places with low temperatures—including Antarctica—as well as from some animal hosts including skin, fish gills and guts and human blood, among others (;; ).

However, isolates from similar environments show high genomic variability, as evidenced by ANI analysis. A multi-locus phylogenetic analysis revealed that Antarctic Psychrobacter isolates do not form a monophyletic group. In this context, the presence of ter genes is correlated to some extent with their genomic structure. In fact, higher ANI values reflected a more similar ter gene organization.

Glacincola BNF20 exhibited a very close ter gene organization with the three closest members Psychrobacter sp. P11F6, JCM18902 and JCM18903 (, ).

Psychrophilic and psychrotolerant microorganisms require several genes to increase their phenotypic flexibility to survive in extreme environments such as cold habitats. Thus, in addition to genes associated with cold shock proteins, membrane fluidity, among others, the presence of metal(loid) resistance genes seems to favor their adaptation (; ). This is also the case of P. Glacincola BNF20, which harbors over 100 putative metal resistance genes. In principle and even though this high number of genes predicted bacterial resistance to a number of metal(loid)s, MIC determinations showed that P. Glacincola BNF20 was only resistant to chromate and tellurite (MIC 6 and 2.3 mM, respectively).

Chromate resistance genes included chrI (regulatory protein of Ralstonia metallidurans CH34; ), chrR (encoding a chromate reductase; ), mdrL/yfmO (multidrug efflux transporter in Listeria monocytogenes; ) and ruvB, encoding a DNA helicase involved in both chromate and tellurite resistance in P. aeruginosa PAO1. Genes related to tellurite resistance identified in P. Glacincola BNF20 included the phosphate transporter pitA and a cluster of ter genes , composed of terA, terZ, terC, terD and terE, which exhibit a different organization as compared to other ter gene clusters previously described. Although ter refers to tellurite resistance, the same genes participate in resistance to phages, colicins and to other oxidative stress-generating antimicrobials , which could be the result of transcriptional control by a common regulator, OxyR. A number of reasons may explain the observed discordances among MIC values (i.e., Hg, Cu, As, etc.) and the respective resistance genes identified in this bacterium.

For instance, P. Glacincola BNF20 sensitivity to mercury could be a result of the absence of some genes (i.e., merT) belonging to the mer operon, which could render it non-functional. Similarly, the absence of the c usS gene (Cu sensor) in the P. Glacincola BNF20 genome could be responsible for its copper sensitivity, in spite the presence of other genes that participate in Cu homeostasis. Tellurite resistance-associated ter genes are grouped in three different families: (i) TerC, encompassing transmembrane proteins, (ii) TerD, which includes the cytoplasmatic paralogs TerD, TerA, TerE, TerF and TerZ , and (iii) TerB, representing proteins that are directly associated with the inner surface of the cell membrane, although they also have a cytoplasmatic localization. As mentioned, TerC interacts with TerD, TerB and other proteins showing different cell functions.

Most bacteria carrying ter genes display a similar transcriptional organization. Thus terZABCDEF, terZABCDE and TerABD present in E.

Coli O157:H7, Proteus sp. Radiodurans, respectively, are operons (;; ). The Psychrobacter genus represents an exception to this rule, with terA lying in the opposite transcriptional orientation. Transcriptomic and proteomic assays have shown that terB is expressed when E. Radiodurans are exposed to tellurite (; ). TerB seems to be essential for tellurite resistance and interacts with some cytoplasmatic proteins such as the alpha subunit of ATP synthase, G subunit of the NADH-dependent quinone oxidoreductase and DnaK chaperone, among others.

Given that P. Glacincola BNF20 lacks terB, we hypothesize that there must be another gene product that mediates tellurite resistance.

Based on their genetic background, ter genes have also been classified into different groups (I–IV). In this context and given its similitude with the ter genes found in Psychrobacter sp. Glacincola BNF20 would belong to group I, which contains a gene encoding a protein exhibiting the AIM24 domain, also found in the P. Glacincola BNF20 TIGR00266 protein. Although no role has been ascribed to it in prokaryotes, in higher organisms it is an internal membrane protein related to mitochondrial biogenesis which is required for yeast respiration.

The AIM24 domain exhibits a double beta-helix folding, which is frequently found in genes neighboring TerD, suggesting that both proteins could interact. Deciphering the origins of bacterial operons is not straightforward, and there are some hypotheses that try to explain their formation. An interesting example is the piecewise model, which states that the his operon (hisGDCBHAFIE) was gradually formed. Phylogenetic analyses of the Proteobacterial phylum his genes showed their progressive grouping, which suggests that they were located in nearby zones of the chromosome in closely related microorganisms. Following, new events ended with the formation of the hisBHAF central core and the whole operon. A future hypothesis to test is whether the ter operon has a similar evolutionary origin.

To evaluate the taxonomical distribution of ter genes in the Bacterial kingdom, the 5,398 genomes retrieved from the NCBI’s RefSeq bacterial database were screened. About 48.6% of them (2,623 genomes) were found to contain ter genes. While at the family level most (68.7%) harbored one ter gene (chiefly terC) and 15.6% exhibited five (including P. Glacincola BNF20), at the class level the number of genomes exhibiting at least one ter gene was Gammaproteobacteria (379), Alphaproteobacteria (253) and Bacilli (247). Finally and regarding phyla, Proteobacteria, Actinobacteria and Firmicutes had 867, 854 and 361 genomes containing at least one ter gene, respectively (, ). Within the Proteobacteria phylum, most families had only one ter gene, while others up to 7 (Morganellaceae, Yersiniaceae), 6 (Chromatiacceae, Budviciaceae), 5 (Moraxellceae, Burkholderiaceae), 4 (Erythrobacteraceae), etc.

In this context, it would be interesting to carry out phylogenetic analyses to understand the evolution of these ter genes and how the currently known terZABCDEF operon was formed (; ). Finally, it was found that—in general—Gram-positive microorganisms contain more ter genes than Gram-negative bacteria. This is interesting because it is generally accepted that they also show higher tellurite resistance.

For instance, Streptomyces and Bacillus genera comprise 137 and 65 genomes carrying up to 5–6 ter genes, respectively, suggesting that ter gene copy number could be related to the high resistance to tellurite observed in S. Coelicolor and Geobacillus stearothermophilus (; ). Conclusions A new species of Antarctic bacteria exhibiting high tellurite resistance was isolated and identified as P.

Glacincola BNF20. Although within the genus the percent of sequence coverage is low, its genomic sequence is similar to other uncharacterized genomes and contains a large number of genes implicated in metal(loid) resistance, especially chromate and tellurite. The transcriptional orientation of tellurite resistance ( ter) genes in P. glacincola BNF20 is different to that described in other microorganisms and most likely do not function as an operon. The wide distribution of ter genes in the bacterial world suggests that they play an important physiological role. Supplemental Information.

Background Next generation sequencing (NGS) technologies have substantially increased the sequence output while the costs were dramatically reduced. In addition to the use in whole genome sequencing, the 454 GS-FLX platform is becoming a widely used tool for biodiversity surveys based on amplicon sequencing. In order to use NGS for biodiversity surveys, software tools are required, which perform quality control, trimming of the sequence reads, removal of PCR primers, and generation of input files for downstream analyses. A user-friendly software utility that carries out these steps is still lacking. Findings We developed CANGS ( C leaning and A nalyzing N ext G eneration S equences) a flexible and user-friendly integrated software utility: CANGS is designed for amplicon based biodiversity surveys using the 454 sequencing platform.

CANGS filters low quality sequences, removes PCR primers, filters singletons, identifies barcodes, and generates input files for downstream analyses. The downstream analyses rely either on third party software (e.g.: rarefaction analyses) or CANGS-specific scripts.

The latter include modules linking 454 sequences with the name of the closest taxonomic reference retrieved from the NCBI database and the sequence divergence between them. Our software can be easily adapted to handle sequencing projects with different amplicon sizes, primer sequences, and quality thresholds, which makes this software especially useful for non-bioinformaticians. Next generation sequencing technologies have dramatically increased the sequence output at a substantially reduced cost. In addition to genome sequencing and transcriptome profiling, ultra-deep sequencing of short amplicons offers an enormous potential in clinical studies and in studies of ecological diversity. PCR amplicons of more than 400 bp can be sequenced in a massively parallel manner which allows building a fine-grained catalog of species abundance patterns in a broad range of habitats. This increase in the amount of sequence data requires efficient software tools for processing the raw data generated by next generation sequencers. We developed CANGS - a flexible and user-friendly utility to trim sequences, filter low quality sequences, and produce input files for further downstream analyses.

CANGS can be used to assign the taxonomic grouping based on similarity with sequences from the NCBI database. CANGS has been developed for Mac OS X but it also works on Linux and any other Unix system. CANGS can be obtained from. See additional file for the source code of CANGS, additional file for test dataset of CANGS and additional file for the CANGS user manual. Figure 1 The architecture of CANGS utility.

The four major components of the CANGS are tsfs.pl (Trimming Sequences and Filtering Sequences), ta.pl(Taxonomy Analysis), ba.pl (Blast Analysis) and ra.pl (Rarefaction Analysis). All these four components are connected to a single configuration file 'CANGSOptions.txt' to take inputs.

Required programs are BLAST for the similarity search and MAFFT for pairwise distance calculation. MOTHUR and Analytic Rarefaction are needed for estimation of the number of species (OTUs), and updateblastdb.pl is required for downloading the BLAST database on a local computer. Schema for processing and analyzing 454 GS-FLX sequences Figure shows the way in which the CANGS utility processes 454-sequence data sets. The arrows illustrate the path of data flow. As a preparation step for CANGS, the options file CANGSOptions.txt needs to be customized. This file allows the user to specify all parameters needed for the processing of the 454 sequences.

CANGS provides two layers of analysis: the Sequence Processing Layer is the first step, in which tsfs.pl trims the sequences (removal of PCR primers, adapter sequence and sample identifiers) and filters low quality sequences (sequences with Ns, singletons, and sequences with very low average quality score). The script tsfs.pl creates two high quality processed sequence data sets: 1) redundant sequences and 2) non-redundant sequences by using the user-defined parameters in the options file. The second step is the Sequence Analysis Layer in which three different programs are available to assign the newly sequenced reads to a taxonomic group (ta.pl), estimate the change in species composition among different samples (ba.pl), and to measure species richness (ra.pl). CANGS components CANGS input customization CANGS configuration file - CANGSOptions.txt. CANGS was designed to allow a high flexibility for the user.

In the options file the user defines the parameters that will be used by all CANGS modules. This simplified customization increases the usability and integration of the utility because the multiple programs can reference a single options file.

The parameters include BLAST cutoff values, quality scores, PCR primers, barcodes, size range of PCR products etc. Sequence trimming and quality filtering The tsfs.pl (Trim Sequences and Filter low quality Sequences) program automates the processing of raw 454 sequences. The goal of tsfs.pl is to obtain the high quality reads from pooled 454 sequences by trimming the raw sequences and filtering low quality reads which is done in seven steps. Removal of adapter B based on the sequence of adapter B, as specified in the CANGSOptions.txt file, the 3'- end of each read is trimmed. It is possible to process only sequences with a perfect match to adapter B, but a pattern search that allows for imperfection in adapter B recovers more sequences. Filtering sequences with ambiguities tsfs.pl removes reads with one or more Ns (unknown bases). Removal of singletons to ameliorate the problem of sequencing errors tsfs.pl allows the user to remove very low frequency variants from the data set.

Note that several data sets could be combined to minimize the removal of true low frequency sequence variants. Grouping of sequences according to bar codes tsfs.pl distinguishes different samples based on the bar codes specified in the CANGSOptions.txt file and separates them into different data sets.

This step is skipped when only a single sample is processed 5. Filtering sequences according to length threshold the tsfs.pl program removes sequence reads falling outside the size range specified in the options file. Removal of PCR primers forward and reverse PCR primers are specified in the CANGSOptions.txt file and removed from the sequence. Only sequences with perfect identity to the specified PCR primers are processed. The 454 sequencing process preferentially generates length variants in homopolymers.

As homopolymers can be as short as two bases and the target sequence is frequently not known, we developed a special procedure to recognize such sequencing errors at the end of the PCR primer: for all sequences with the same PCR primers the tsfs.pl program scans 8 bp of the target sequence immediately adjacent to the PCR primer and identifies the most frequent 8 bp motif. Next, this consensus sequence is compared for the +1, and -1 offset of each sequence. For sequences with no 454 homopolymer mutation both the +1 and -1 offset results in many mismatches, but a read with a 454 homopolymer mutation at the end of the PCR primer will be very similar to either the +1 or -1 offset. We empirically determined that filtering reads with. To demonstrate the utility of CANGS, we used 454 sequences, which have been deposited in the NCBI database NCBI: SRA008706.2. This data set consists of 447,909 reads from the 18S rRNA gene obtained from 10 temporal freshwater samples.

Applied to our example data set, the tsfs.pl program eliminated approximately 37% of all sequences (Table ). Hence a total of 281,003 (63%) sequences could be used for downstream analyses. On Macintosh OS X version 10.6.2 with a single processor, CANGS (tsfs.pl) takes 6.5 hours for processing this data set. If the user skips the removal of singletons the tsfs.pl program takes only 20 minutes for the same data set. Order of steps Steps Total no. Of sequences No. Of sequences considered No.

Ta.pl, this program classifies the processed 454 reads by assessing their similarity to taxonomic entries in the NCBI database. This analysis requires the nucleotide preformatted BLAST database from to be installed, which is done using the perl program ' updateblastdb.pl'. The script ta.pl BLASTs the non-redundant sequences against this database. In a second step the best hit(s) from the BLAST search are used to retrieve the taxonomic path - either for all sequences or only for a taxonomic group of interest. In the case of multiple best hits with identical E-value, this program selects the hit with the most detailed taxonomic classification and links it with the non-redundant query sequence. If CANGS identifies a conflict we provide the option to assign taxonomic status by the majority rule.

The partial output of this program is shown in Figure. Studies of species diversity are frequently designed to compare species richness and species composition among different samples. The ba.pl (Blast Analysis) program performs a BLAST analysis of non-redundant sequences in one sample against non-redundant sequences in any number of other samples. For user convenience the ba.pl software automatically generates the BLAST database(s) required for the analyses. The output of the BLAST search(es), is parsed and a tabular output is created.

As it may be of interest to group sequences with different similarities, the ba.pl program could be customized to group sequences up to a specified similarity. A similarity cutoff of 100 should be used to group only identical sequences (ignoring gaps).

In the tabular output, the number of sequences shared between the two data sets is reported for every species as shown in Figure. The similarities given in the output are calculated as follows. Hence, gaps are not considered.

Rarefaction Analysis ra.pl. Several software packages exist for performing rarefaction analysis ,. The script ra.pl (Rarefaction Analysis) program links the data processed by CANGS with two popular rarefaction analysis software packages with minimal user interference: MOTHUR and Analytic Rarefaction. For MOTHUR ra.pl is calculating the pairwise genetic distance by using the 'mafft-distance' program of MAFFT executables. The mafft-distance program takes non-redundant sequences generated by tsfs.pl as input and gives the corresponding genetic distance table as output. For the Analytic Rarefaction software, ra.pl first calculates the abundance of each sequence in the data set using BLAST, as described above.

Compared to pattern matching this procedure allows to consider sequences with gaps jointly.