- Transcriptome version 5
- Genome version 0.5
- Transcriptome version 3 unigene set
- Genome version 0.3
- Genome version 0.2
- Transcriptome version 3
- Genome version 0.1
A new transcriptome assembly was generated by combining the output of four de novo assemblers: Abyss/TransAbyss, Trinity, Soap-de-novo-trans and Velvet/Oases. Reads used to generate the version 3 transcriptome were supplemented with 100nt paired-end and 50nt single-end RNAseq reads (non strand specific) from whole plants (generated on the Illumina platform).
Multiple assemblies using a range of k-mer sizes and varying input read counts were generated from the assemblers and combined to create a super-set of transcripts (manuscript in preparation). This super set was then run through the EvidentialGene tr2aacds pipeline to select a best set of transcripts and simultaneously reduce redundancy. The pipeline classifies transcripts as being primary (the main, usually longest CDS) or alternate (possible isoforms).
The super-set of transcripts contained around 9.9 million transcripts. After EviGene processing, a total of 234,526 transcripts were obtained, of which 49,818 were classified as primary and 184,708 as alternate. These three sets of transcripts are available now in the Blast databases and have been mapped using GMAP against the version 0.5 genome in Gbrowse.
The version 0.5 assembly was generated in 2 steps.
First, the version 0.3 genome assembly was scaffolded using Mate Pair libraries generated by the Illumina Nextera protocol. Three rounds of scaffolding was performed using 4kb, 6kb and 8kb insert sizes, to generate a version 0.4 assembly.
The version 0.4 assembly was then gap-filled using our original Paired-end libraries. Following that, another round of scaffolding was performed as described above, and finally one more round of gap-filling was carried out. This generated our final version 0.5 assembly. Some statistics on the assembly are detailed below.
|# scaffolds||Sum||Min||Max||Mean||Median||N50||L50||% N's|
|# contigs||Sum||Min||Max||Mean||Median||N50||L50||% N's|
A CEGMA analysis shows an improvement in the percentage of complete core proteins detected.
The merging of k-mer assemblies from Abyss by Trans-Abyss involves the subsuming of smaller contigs that are 100% identical into larger contigs. This means that a single nucleotide deviation results in separate and possibly distinct contigs/transcripts, but also introduces some level of redundancy in the assembly. Clustering of contigs from the current transcriptome assembly into a unigene data set was performed with CD-HIT-EST, with a threshold identity of 95%, which resulted in a reduction in the number of contigs from 237,340 to 119,014. This should assist researchers in selecting the most relevant transcript in a first pass blasting of the assembly.
Following analysis of some key genes of interest within the v0.2 assembly several scaffolds were identified as mis-assembled. As a consequence we re-assembled the existing data as well as including two further new lanes of paired-end read data. We also switched from using the Velvet assembler back to SOAPDenovo (as was used for v0.1). The assembly selected for v0.3 resulted from using SOAPDenovo with a kmer setting of 65 followed by scaffolding using SSPACE with the available 2kb long insert mate pair data. A single round of gap closure was performed using GapCloser. Analysis of genes mis-assembled in v0.2 showed that they were correctly assembled in the v0.3. The Hagfish tool was used to assess individual scaffolds for correctness based on paired-end read data.
To assess the completeness of the assembly, we used the CEGMA method, which maps a set of proteins that are highly conserved across different taxa. Below is the CEGMA report for assembly v0.3.
Using the 773 million paired-end reads, the Velvet short-read assembler was used with a hash length of 59 and minimum contig length of 201 bases. Velvet contigs were further subject to two iterations of gap closing using GapCloser. An additional 2kb insert length mate pair library was then used for further scaffolding with SSPACE.
To assess the completeness of the assembly, we used the CEGMA method, which maps a set of proteins that are highly conserved across different taxa. Below is the CEGMA report for assembly v0.2.
A total of 193 million RNA-seq reads (145 million paired, 48 million orphaned after filtering and trimming) from 9 tissue samples from our lab line (L) was used. Reads were generated from the Illumina HiSeq-2000 platform. Assembly was carried with Abyss v1.3 and Trans-Abyss v1.1, with kmer sizes from 58 to 80 with step size of 2. The merged assembly generated 237,340 contigs, with a median contig size of 510, mean of 795 and maximum contig size of 14845. For reference, the N50 was 1350. The transcriptome assembly was annotated with an in-house annotation pipeline.
The Illumina HiSeq-2000 platform was used to sequence the N. benthamiana genome. Genomic DNA was extracted from 1 week old seedlings. The v0.1 draft assembly used 7 lanes that totaled 773 million, 100nt long paired end reads (average insert size 415bp), yielding 154 gigabases. After read trimming, SOAPDenovo v1.05 was used to assemble the dataset. The table below shows the assembly statistics.