- Transcriptome version 3 unigene set
- Genome version 0.3
- Genome version 0.2
- Transcriptome version 3
- Genome version 0.1
The merging of k-mer assemblies from Abyss by Trans-Abyss involves the subsuming of smaller contigs that are 100% identical into larger contigs. This means that a single nucleotide deviation results in separate and possibly distinct contigs/transcripts, but also introduces some level of redundancy in the assembly. Clustering of contigs from the current transcriptome assembly into a unigene data set was performed with CD-HIT-EST, with a threshold identity of 95%, which resulted in a reduction in the number of contigs from 237,340 to 119,014. This should assist researchers in selecting the most relevant transcript in a first pass blasting of the assembly.
Following analysis of some key genes of interest within the v0.2 assembly several scaffolds were identified as mis-assembled. As a consequence we re-assembled the existing data as well as including two further new lanes of paired-end read data. We also switched from using the Velvet assembler back to SOAPDenovo (as was used for v0.1). The assembly selected for v0.3 resulted from using SOAPDenovo with a kmer setting of 65 followed by scaffolding using SSPACE with the available 2kb long insert mate pair data. A single round of gap closure was performed using GapCloser. Analysis of genes mis-assembled in v0.2 showed that they were correctly assembled in the v0.3. The Hagfish tool was used to assess individual scaffolds for correctness based on paired-end read data.
To assess the completeness of the assembly, we used the CEGMA method, which maps a set of proteins that are highly conserved across different taxa. Below is the CEGMA report for assembly v0.3.
Using the 773 million paired-end reads, the Velvet short-read assembler was used with a hash length of 59 and minimum contig length of 201 bases. Velvet contigs were further subject to two iterations of gap closing using GapCloser. An additional 2kb insert length mate pair library was then used for further scaffolding with SSPACE.
To assess the completeness of the assembly, we used the CEGMA method, which maps a set of proteins that are highly conserved across different taxa. Below is the CEGMA report for assembly v0.2.
A total of 193 million RNA-seq reads (145 million paired, 48 million orphaned after filtering and trimming) from 9 tissue samples from our lab line (L) was used. Reads were generated from the Illumina HiSeq-2000 platform. Assembly was carried with Abyss v1.3 and Trans-Abyss v1.1, with kmer sizes from 58 to 80 with step size of 2. The merged assembly generated 237,340 contigs, with a median contig size of 510, mean of 795 and maximum contig size of 14845. For reference, the N50 was 1350. The transcriptome assembly was annotated with an in-house annotation pipeline.
The Illumina HiSeq-2000 platform was used to sequence the N. benthamiana genome. Genomic DNA was extracted from 1 week old seedlings. The v0.1 draft assembly used 7 lanes that totaled 773 million, 100nt long paired end reads (average insert size 415bp), yielding 154 gigabases. After read trimming, SOAPDenovo v1.05 was used to assemble the dataset. The table below shows the assembly statistics.