Bioinformatics Tools

Trans-NanoSim

Trans-NanoSim is a tool that simulates reads with technical and transcriptome-specific features learnt from nanopore RNA-sequencing data. This tool is a cost-effective alternative to sequencing lab-based transcriptomes. Through comparison against other nanopore read simulators, we show the unique advantage and robustness of Trans-NanoSim in capturing the characteristics of nanopore complementary DNA and direct RNA reads. More detailed information can be found in Hafezqorani S. et al. 2020. To access NanoSim: https://github.com/bcgsc/NanoSim

ORCA

The ORCA bioinformatics environment is a Docker image that contains hundreds of bioinformatics tools and their dependencies. The ORCA image and accompanying server infrastructure provide a comprehensive bioinformatics environment for education and research. The ORCA environment on a server is implemented using Docker containers, but without requiring users to interact directly with Docker, suitable for novices who may not yet have familiarity with managing containers. ORCA has been used successfully to provide a private bioinformatics environment to external collaborators at a large genome institute, for teaching an undergraduate class on bioinformatics targeted at biologists, and to provide a ready-to-go bioinformatics suite for a hackathon. Using ORCA eliminates time that would be spent debugging software installation issues, so that time may be better spent on education and research. More detailed information can be found in Jackman et al 2019. Access ORCA here: https://hub.docker.com/r/bcgsc/orca/

RapidACi

RapidACi is an R package for the batch treatment of Rapid carbon dioxide response curves (A-Ci) generated by the LI-COR® portable systems. It is a tool to accelerate photosynthesis phenotyping measurements. More detailed information can be found in Coursolle et al 2019. Access RapidACi here: https://github.com/ManuelLamothe/RapidACi

ntEdit

In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. ntEdit is a scalable genomics application for polishing genome assembly drafts. ntEdit simplifies polishing and “haploidization” of gene and genome sequences with its re-usable Bloom filter design. We expect ntEdit to have additional applications in fast mapping of simple nucleotide variations between any two individuals or species’ genomes. We generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophytes), and used it to edit our pseudo haploid assemblies of the 20 Gbp interior and white spruce genomes in <4 and <5h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. More detailed information can be found in Warren et al 2019. Access to ntEdit here: https://github.com/bcgsc/ntEdit.

Tigmint

Tigmint is a software tool to address assembly errors in large molecules reads such as those generated by 10X Genomics Chromium platform. The utility of Tigmint is for correcting the assemblies of mutliple assembly tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing. More detailed information can be found in Jackman et al 2018. Access Tigmint here: https://github.com/bcgsc/tigmint.

ARKS

The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. More detailed information can be found in Coombe et al 2018. To access ARKS : https://github.com/bcgsc/arks

ARCS

ARCS is an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. Using this tool, we have shown how the contiguity of an ABySS Homo sapiens genome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts. More detailed information can be found in Yeo et al 2018. To access ARCS: https://github.com/bcgsc/ARCS/.

ChopStitch

ChopStitch is a new algorithm for finding putative exons de novo and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-Seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also accounts for base substitutions in transcript sequences that may be derived from sequencing or assembly errors, haplotype variations, or putative RNA editing events. More detailed information can be found in Khan et al 2018. To access ChopStitch: https://github.com/bcgsc/ChopStitch.

ABySS 2.0

ABySS 2.0 is the second version of our flagship sequence assembly algorithm. It improves on the resource efficiency of ABySS, and provides support for the emerging sequencing technologies, including those from 10x Genomics (Pleasanton, CA), Pacific Biosciences (PacBio, Menlo Park, CA), and Oxford Nanopore Technologies (ONT, Oxford, UK). We have demonstrated that ABySS 2.0 and its associated algorithms can assemble human genomes to chromosome-scale scaffolds, using computational resources readily available in modern servers. More detailed information can be found in Jackman, Vandevalk et al 2017. To access ABySS 2.0: https://github.com/bcgsc/abyss.

Kollector

Kollector is an alignment-free targeted assembly algorithm approach to perform local assembly of sequences of interest. A typical use case for the algorithm is the assembly of genic loci of non-model organisms using a set of transcript sequences. The resulting sequences can be readily utilized for more focused biological research, for example to study cis-regulatory elements. More detailed information can be found in Kucuk et al 2017. To access Kollector: https://github.com/bcgsc/kollector.

ntCard

ntCard performs a fundamental bioinformatics function to analyze the sequence content of large volumes of raw sequencing data. It provides statistics for estimating the sequencing error frequency, genome size, and the repeat content by profiling the k-mer spectrum of the input data. ntCard implements a computationally efficient algorithm that can process 90x coverage of spruce mega-genome in 30 min using 500 MB of RAM. More detailed information can be found in Mohamadi, Khan, and Birol 2017. To access ntCard: https://github.com/bcgsc/ntCard.

ntJoin

ntJoin, is a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of applications, including improving a draft assembly with a reference-grade genome, a short-read assembly with a draft long-read assembly, and a draft assembly with an assembly from a closely related species. More detailed information can be found in Coombe et al 2020. To access ntJoin: Bioinformatics 36(12): 3885-3887

PhysIr

Physlr constructs a de novo physical map using linked reads from 10X Genomics or stLFR. This physical map can then be used to scaffold an existing assembly to yield chromosomal level contiguity. More detailed information can be found in Afshinfard et al 2022. Access to Physir here: https://github.com/bcgsc/physlr

ntHash

Hashing has been widely used for indexing, querying and rapid similarity search in many bioinformatics applications, including sequence alignment, genome and transcriptome assembly, k-mer counting and error correction. Hence, expediting hashing operations would have a substantial impact in the field, making bioinformatics applications faster and more efficient. ntHash, is a hashing algorithm tuned for processing DNA/RNA sequences. It performs the best when calculating hash values for adjacent k-mers in an input sequence, operating an order of magnitude faster than the best performing alternatives in typical use cases. More detailed information can be found in Mohamadi et al 2016. Access to ntHash here: https://github.com/bcgsc/nthash

LongStitch

Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have a tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. More detailed information can be found in Coombe et al 2021. Access to longStitch here: https://github.com/bcgsc/longstitch

RNA-Bloom

Detection and discovery of isoforms in single cells are difficult because of the inherent technical shortcomings of scRNA-seq data, and existing transcriptome assembly methods are mainly designed for bulk RNA samples. To address this challenge, we developed RNA-Bloom, an assembly algorithm that leverages the rich information content aggregated from multiple single-cell transcriptomes to reconstruct cell-specific isoforms. Assembly with RNA-Bloom can be either reference-guided or reference-free, thus enabling unbiased discovery of novel isoforms or foreign transcripts. More detailed information can be found in Nip et al 2020. Access to RNA-Bloom here: https://github.com/bcgsc/rnabloom

XMatchView

In genomics research, the visual representation of DNA sequences is of prime importance. When displayed with additional information, or tracks, showing the position of annotated genes, alignments of sequence of interest, etc., these displays facilitate our understanding of genome and gene structure, and become powerful tools to assess the relationship between various sequence data. XMatchView and XMatchView-conifer, two python applications for comparing genomes visually and assessing their synteny. The software is a robust implementation of the sensitive Smith-Waterman algorithm for DNA alignments. More detailed information can be found in Warren et al 2018. Access to XMatchView here: https://github.com/bcgsc/xmatchview