Sequencing strategy

Transcriptome version 3 unigene set (September 2012)

The merging of k-mer assemblies from Abyss by Trans-Abyss involves the subsuming of smaller contigs that are 100% identical into larger contigs. This means that a single nucleotide deviation results in separate and possibly distinct contigs/transcripts, but also introduces some level of redundancy in the assembly. Clustering of contigs from the current transcriptome assembly into a unigene data set was performed with CD-HIT-EST, with a threshold identity of 95%, which resulted in a reduction in the number of contigs from 237,340 to 119,014. This should assist researchers in selecting the most relevant transcript in a first pass blasting of the assembly.

Genome version: 0.3 (August 2012)

Following analysis of some key genes of interest within the v0.2 assembly several scaffolds were identified as mis-assembled. As a consequence we re-assembled the existing data as well as including two further new lanes of paired-end read data. We also switched from using the Velvet assembler back to SOAPDenovo (as was used for v0.1). The assembly selected for v0.3 resulted from using SOAPDenovo with a kmer setting of 65 followed by scaffolding using SSPACE with the available 2kb long insert mate pair data. A single round of gap closure was performed using GapCloser. Analysis of genes mis-assembled in v0.2 showed that they were correctly assembled in the v0.3. The Hagfish tool was used to assess individual scaffolds for correctness based on paired-end read data.

Kx N SUM MIN 1st-Quartile Median 3rd-Quartile Max Mean N50
 K65 275036 2443539138 201 477 1385 9637 447128 8884.4338 31834

To assess the completeness of the assembly, we used the CEGMA method, which maps a set of proteins that are highly conserved across different taxa. Below is the CEGMA report for assembly v0.3.

Statistics of the completeness of the genome based on 248 CEGs. Prots = number of 248 ultra-conserved CEGs present in genome; %Completeness = percentage of 248 ultra-conserved CEGs present; Total = total number of CEGs present including putative orthologs; Average = average number of orthologs per CEG; %Ortho = percentage of detected CEGS that have more than 1 ortholog
 

#Prots

%Completeness #Total Average %Ortho
Complete 180 72.58 481 2.67 75.00
Group 1 49 74.24 111 2.27 63.27
Group 2 35 62.50 83 2.37 82.86
Group 3 44 72.13 126 2.86 75.00
Group 4 52 80.00 161 3.10 80.77
           
Partial 237 95.56 788 3.32 86.92
Group 1 63 95.45 185 2.94 79.37
Group 2 52 92.86 157 3.02 82.69
Group 3 61 100.00 205 3.36 88.52
Group 4 61 93.85 241 3.95 96.72
Genome version: 0.2 (May 2012)

Using the 773 million paired-end reads, the Velvet short-read assembler was used with a hash length of 59 and minimum contig length of 201 bases. Velvet contigs were further subject to two iterations of gap closing using GapCloser. An additional 2kb insert length mate pair library was then used for further scaffolding with SSPACE.

Kx N SUM MIN 1st-Quartile Median 3rd-Quartile Max Mean N50
 K59 339779 2426666590 201 443 1078 7442 292195 7141.8969 26227

To assess the completeness of the assembly, we used the CEGMA method, which maps a set of proteins that are highly conserved across different taxa. Below is the CEGMA report for assembly v0.2.

Statistics of the completeness of the genome based on 248 CEGs. Prots = number of 248 ultra-conserved CEGs present in genome; %Completeness = percentage of 248 ultra-conserved CEGs present; Total = total number of CEGs present including putative orthologs; Average = average number of orthologs per CEG; %Ortho = percentage of detected CEGS that have more than 1 ortholog
 

#Prots

%Completeness #Total Average %Ortho
Complete 180 72.58 467 2.59 70.00
Group 1 46 69.70 107 2.33 65.22
Group 2 35 62.50 84 2.40 68.57
Group 3 46 75.41 125 2.72 71.74
Group 4 53 81.54 151 2.85 73.58
           
Partial 229 92.34 765 3.34 88.65
Group 1 61 92.42 181 2.97 83.61
Group 2 49 87.50 147 3.00 85.71
Group 3 57 93.44 204 3.58 94.74
Group 4 62 95.38 233 3.76 90.32
Transcriptome version: 3 (May 2012)

A total of 193 million RNA-seq reads (145 million paired, 48 million orphaned after filtering and trimming) from 9 tissue samples from our lab line (L) was used. Reads were generated from the Illumina HiSeq-2000 platform. Assembly was carried with Abyss v1.3 and Trans-Abyss v1.1, with kmer sizes from 58 to 80 with step size of 2. The merged assembly generated 237,340 contigs, with a median contig size of 510, mean of 795 and maximum contig size of 14845. For reference, the N50 was 1350. The transcriptome assembly was annotated with an in-house annotation pipeline.

Genome version: 0.1 (March 2012)

The Illumina HiSeq-2000 platform was used to sequence the N. benthamiana genome. Genomic DNA was extracted from 1 week old seedlings. The v0.1 draft assembly used 7 lanes that totaled 773 million, 100nt long paired end reads (average insert size 415bp), yielding 154 gigabases. After read trimming, SOAPDenovo v1.05 was used to assemble the dataset. The table below shows the assembly statistics.

Kx N SUM MIN 1st-Quartile Median 3rd-Quartile Max Mean N50
K35 3570278 2224442951 100 111 127 196 66604 623 4019
K45 1742715 2498984875 100 126 175 477 170760 1434 9576
K55 2712856 2538330488 100 111 111 223 158154 935.7 9820
K65 3199039 2523668725 100 131 172 438 58562 788.9 3820
K65+1gapclosure 3199039 2600857227 100 131 172 441 60316 813 4021
K75 2315049 938411051 100 151 151 515 51164 405.4 642

Software and tools references