Sequencing strategy

Transcriptome version 5 (November 2013)

A new transcriptome assembly was generated by combining the output of four de novo assemblers: Abyss/TransAbyss, Trinity, Soap-de-novo-trans and Velvet/Oases. Reads used to generate the version 3 transcriptome were supplemented with 100nt paired-end and 50nt single-end RNAseq reads (non strand specific) from whole plants (generated on the Illumina platform).

Multiple assemblies using a range of k-mer sizes and varying input read counts were generated from the assemblers and combined to create a super-set of transcripts (manuscript in preparation). This super set was then run through the EvidentialGene tr2aacds pipeline to select a best set of transcripts and simultaneously reduce redundancy. The pipeline classifies transcripts as being primary (the main, usually longest CDS) or alternate (possible isoforms).

The super-set of transcripts contained around 9.9 million transcripts. After EviGene processing, a total of 234,526 transcripts were obtained, of which 49,818 were classified as primary and 184,708 as alternate. These three sets of transcripts are available now in the Blast databases and have been mapped using GMAP against the version 0.5 genome in Gbrowse.

Genome version 0.5 (November 2013)

The version 0.5 assembly was generated in 2 steps.
First, the version 0.3 genome assembly was scaffolded using Mate Pair libraries generated by the Illumina Nextera protocol. Three rounds of scaffolding was performed using 4kb, 6kb and 8kb insert sizes, to generate a version 0.4 assembly.

The version 0.4 assembly was then gap-filled using our original Paired-end libraries. Following that, another round of scaffolding was performed as described above, and finally one more round of gap-filling was carried out. This generated our final version 0.5 assembly. Some statistics on the assembly are detailed below.

# scaffolds Sum Min Max Mean Median N50 L50 % N's
147949 2549549119 201 2628400 17233 517 396153 1936 1.56
# contigs Sum Min Max Mean Median N50 L50 % N's
179402 2509981408 73 965197 13991 613 107227 6960 0.01


A CEGMA analysis shows an improvement in the percentage of complete core proteins detected.

Statistics of the completeness of the genome based on 248 CEGs. Prots = number of 248 ultra-conserved CEGs present in genome; %Completeness = percentage of 248 ultra-conserved CEGs present; Total = total number of CEGs present including putative orthologs; Average = average number of orthologs per CEG; %Ortho = percentage of detected CEGS that have more than 1 ortholog
 

#Prots

%Completeness #Total Average %Ortho
Complete 211 85.08 586 2.78 79.15
Group 1 54 81.82 131 2.43 74.07
Group 2 45 80.36 114 2.53 75.56
Group 3 53 86.89 156 2.94 84.91
Group 4 59 90.77 185 3.14 81.36
           
Partial 238 95.97 799 3.36 92.02
Group 1 64 96.97 178 2.78 85.94
Group 2 54 96.43 166 3.07 90.74
Group 3 59 96.72 211 3.58 93.22
Group 4 61 93.85 244 4.00 98.36
Transcriptome version 3 unigene set (September 2012)

The merging of k-mer assemblies from Abyss by Trans-Abyss involves the subsuming of smaller contigs that are 100% identical into larger contigs. This means that a single nucleotide deviation results in separate and possibly distinct contigs/transcripts, but also introduces some level of redundancy in the assembly. Clustering of contigs from the current transcriptome assembly into a unigene data set was performed with CD-HIT-EST, with a threshold identity of 95%, which resulted in a reduction in the number of contigs from 237,340 to 119,014. This should assist researchers in selecting the most relevant transcript in a first pass blasting of the assembly.

Genome version: 0.3 (August 2012)

Following analysis of some key genes of interest within the v0.2 assembly several scaffolds were identified as mis-assembled. As a consequence we re-assembled the existing data as well as including two further new lanes of paired-end read data. We also switched from using the Velvet assembler back to SOAPDenovo (as was used for v0.1). The assembly selected for v0.3 resulted from using SOAPDenovo with a kmer setting of 65 followed by scaffolding using SSPACE with the available 2kb long insert mate pair data. A single round of gap closure was performed using GapCloser. Analysis of genes mis-assembled in v0.2 showed that they were correctly assembled in the v0.3. The Hagfish tool was used to assess individual scaffolds for correctness based on paired-end read data.

Kx N SUM MIN 1st-Quartile Median 3rd-Quartile Max Mean N50
 K65 275036 2443539138 201 477 1385 9637 447128 8884.4338 31834

To assess the completeness of the assembly, we used the CEGMA method, which maps a set of proteins that are highly conserved across different taxa. Below is the CEGMA report for assembly v0.3.

Statistics of the completeness of the genome based on 248 CEGs. Prots = number of 248 ultra-conserved CEGs present in genome; %Completeness = percentage of 248 ultra-conserved CEGs present; Total = total number of CEGs present including putative orthologs; Average = average number of orthologs per CEG; %Ortho = percentage of detected CEGS that have more than 1 ortholog
 

#Prots

%Completeness #Total Average %Ortho
Complete 180 72.58 481 2.67 75.00
Group 1 49 74.24 111 2.27 63.27
Group 2 35 62.50 83 2.37 82.86
Group 3 44 72.13 126 2.86 75.00
Group 4 52 80.00 161 3.10 80.77
           
Partial 237 95.56 788 3.32 86.92
Group 1 63 95.45 185 2.94 79.37
Group 2 52 92.86 157 3.02 82.69
Group 3 61 100.00 205 3.36 88.52
Group 4 61 93.85 241 3.95 96.72
Genome version: 0.2 (May 2012)

Using the 773 million paired-end reads, the Velvet short-read assembler was used with a hash length of 59 and minimum contig length of 201 bases. Velvet contigs were further subject to two iterations of gap closing using GapCloser. An additional 2kb insert length mate pair library was then used for further scaffolding with SSPACE.

Kx N SUM MIN 1st-Quartile Median 3rd-Quartile Max Mean N50
 K59 339779 2426666590 201 443 1078 7442 292195 7141.8969 26227

To assess the completeness of the assembly, we used the CEGMA method, which maps a set of proteins that are highly conserved across different taxa. Below is the CEGMA report for assembly v0.2.

Statistics of the completeness of the genome based on 248 CEGs. Prots = number of 248 ultra-conserved CEGs present in genome; %Completeness = percentage of 248 ultra-conserved CEGs present; Total = total number of CEGs present including putative orthologs; Average = average number of orthologs per CEG; %Ortho = percentage of detected CEGS that have more than 1 ortholog
 

#Prots

%Completeness #Total Average %Ortho
Complete 180 72.58 467 2.59 70.00
Group 1 46 69.70 107 2.33 65.22
Group 2 35 62.50 84 2.40 68.57
Group 3 46 75.41 125 2.72 71.74
Group 4 53 81.54 151 2.85 73.58
           
Partial 229 92.34 765 3.34 88.65
Group 1 61 92.42 181 2.97 83.61
Group 2 49 87.50 147 3.00 85.71
Group 3 57 93.44 204 3.58 94.74
Group 4 62 95.38 233 3.76 90.32
Transcriptome version: 3 (May 2012)

A total of 193 million RNA-seq reads (145 million paired, 48 million orphaned after filtering and trimming) from 9 tissue samples from our lab line (L) was used. Reads were generated from the Illumina HiSeq-2000 platform. Assembly was carried with Abyss v1.3 and Trans-Abyss v1.1, with kmer sizes from 58 to 80 with step size of 2. The merged assembly generated 237,340 contigs, with a median contig size of 510, mean of 795 and maximum contig size of 14845. For reference, the N50 was 1350. The transcriptome assembly was annotated with an in-house annotation pipeline.

Genome version: 0.1 (March 2012)

The Illumina HiSeq-2000 platform was used to sequence the N. benthamiana genome. Genomic DNA was extracted from 1 week old seedlings. The v0.1 draft assembly used 7 lanes that totaled 773 million, 100nt long paired end reads (average insert size 415bp), yielding 154 gigabases. After read trimming, SOAPDenovo v1.05 was used to assemble the dataset. The table below shows the assembly statistics.

Kx N SUM MIN 1st-Quartile Median 3rd-Quartile Max Mean N50
K35 3570278 2224442951 100 111 127 196 66604 623 4019
K45 1742715 2498984875 100 126 175 477 170760 1434 9576
K55 2712856 2538330488 100 111 111 223 158154 935.7 9820
K65 3199039 2523668725 100 131 172 438 58562 788.9 3820
K65+1gapclosure 3199039 2600857227 100 131 172 441 60316 813 4021
K75 2315049 938411051 100 151 151 515 51164 405.4 642

Software and tools references