A SARS-CoV-2 Mutation Pattern Query Tool

Overview

MIT license Platforms

vdb

A SARS-CoV-2 Mutation Pattern Query Tool

1. Purpose

The vdb program is designed to query the SARS-CoV-2 mutational landscape. It runs as a command shell in a terminal, and it allows customized searches for mutation patterns over the entire SARS-CoV-2 genome dataset or subsets thereof. These patttern searches can be for spike protein mutations or nucleotide mutations over the whole genome.

The vdb tool uses a natural syntax, permitting quick searches over various subsets of the data. The two main types of objects in vdb are groups of viruses (“clusters”) and groups of mutations (“patterns”). Clusters can be obtained by searching for patterns, and patterns can be obtained by examining clusters. The program does NOT automatically scan for some pre-defined pattern. Instead, the goal of the program is to make it very easy to look around the spike mutational landscape and see what’s there. The vdb program can be thought of as a “viewer” (a device for looking), even though it's entirely text-based.

The default cluster to search is the collection of all sequenced SARS-CoV-2 viruses (“world”). Alternatively, a country or a US state can be specified. To search for all viruses from the United States, enter from US or just us as part of the search command. A cluster or pattern can be assigned to a variable using an equal sign, =.

Clusters can be filtered by date, number of mutations, country, and Pango lineage. For example, to find all viruses collected in the US containing both mutations E484K and D614G, and then to see what mutations patterns this set has, use the following two commands:

        VDB> a = us w/ E484K D614G

        VDB> patterns a

2. Documentation and reference

A full description of commands is given here.
A quick reference listing all commands is here. This information can also be listed by entering help or ? in vdb.

vdb is described in the bioRxiv manuscript Detection and characterization of the SARS-CoV-2 lineage B.1.526 in New York.

Questions about vdb can be sent to [email protected].

3. Installation

There are two programs:

vdbCreate - this converts multiple sequence alignments (MSA) of SARS-CoV-2 genomes into a file listing spike mutations

vdb - this is the query tool

These programs are written in Swift and are run in a terminal. Swift is available at https://swift.org/download/ or as part of Xcode. The programs can be compiled with Swift version 5.3 and higher. To simplify installation each program is distributed as a single, stand-alone source file. If vdb is run with nucleotide mutation data, then the file "nuclref.wiv04" should be in the working directory. To compile the programs, first check that the Swift compiler (swiftc) is part of your path. On an Ubuntu system, a command similar to the following (adjusting the path as necessary) is appropriate for a bash shell:

        export PATH=/data/username/swift-5.3.3-RELEASE-ubuntu16.04/usr/bin:$PATH

Next, download the vdb repository ("Download ZIP" under the "Code" button on the top level vdb page). Unzip the file. Then to compile the programs, run these commands (these take < 1 minute):

        swiftc -O vdbCreate.swift
        swiftc -O vdb.swift

4. Data files

The sequence alignment of viral genomes can be downloaded from GISAID. This requires registration with GISAID, agreeing to GISAID terms of use, and an account. Note that among these terms of use are the following requirements: (1) to not share or re-distribute the data to any third party, (2) to make best efforts to collaborate with the originating laboratories who provided the data to GISAID, and (3) to acknowledge the originating and submitting laboratories in any publication with results obtained by analyzing this data.

On the GISAID EpiCov “Downloads” window, select “MSA full0405 (64MB)” or the latest version in the "Alignment and proteins" section. Also download the “metadata” file in the "Download packages" section or in the "Genomic epidemiology" section. Uncompress the files and place the FASTA file and the metadata file in the same directory that will be used to run vdb. One can also download selected sequences from GISAID, add the WIV04 reference sequence, and align these with MAFFT. It is possible to load both the large dataset from the main MSA and a local, manually aligned set. The FASTA sequence identifier lines must have the same format as used by GISAID:

>hCoV-19/Wuhan/WIV04/2019|EPI_ISL_402124|2019-12-30|China

Manually added sequences without GISAID-assigned accession numbers should use a provisional number slightly greater than the highest accession number in the current dataset.

Other files included in this repository are:

nuclref.wiv04    This is the SARS-CoV-2 genomic sequence reference, which is used when vdb is run in nucleotide mode

ref_wiv04      This is the same reference in fasta format, to be used for manual alignments of GISAID sequences

5. Running the programs

To run vdbCreate to create the mutations list (this takes about 10 minutes for a million sequences):

        ./vdbCreate msa_0405.fasta

For the vdb program, you can either tell the program what file(s) to load on the command line, or if you do not give a file on the command line, the program will load the most recently modified file with the name vdb_mmddyy.txt:

        ./vdb vdb_040521.txt
        ./vdb

The vdb programs can also be used to examine nucleotide mutations. To produce the nucleotide mutation list file, use the -n or -N flag:

        ./vdbCreate -N msa_0405.fasta

The -n excludes ambiguous bases, while the -N flag includes these. The -N flag is necessary to have protein mutations match what is listed in GISAID. The file produced by -N is much larger. This can be useful if one wants to check if a certain region was not resolved in a particular strain, but it is also slower because of the much larger file. Probably the best option is to generate the mutation list file with the -N flag, and then trim this file using vdb, which keeps a very small subset of the Ns. This prevents mutation calls at codons such as NNC, which could happen if these Ns are dropped. The trim command takes about 30 seconds on a million sequences, and this only needs to be done once since the results can be saved. The suggested workflow is

        ./vdbCreate -N msa_0405.fasta  
        ./vdb vdb_040521_nucl.txt  
        VDB> trim  
        VDB> save world vdb_040521_trimmed_nucl.txt  
        VDB> quit  

To read the resulting file into vdb and thereby analyze mutations in nucleotide mode:

        ./vdb vdb_040521_trimmed_nucl.txt  

or if the trimmed file has not been generated:

        ./vdb vdb_040521_nucl.txt 

6. Usage notes

One should be aware that the SARS-CoV-2 genome dataset has some artefacts in the sequences and some errors in the metadata. Obvious examples include viruses with incorrect or partial collection date information. Anomalies in the sequences are less obvious, but there is a way to guard against this problem. Unusual sequences are less likely to be an artefact if they have been deposited by multiple laboratories. A virus name often gives an indication of the organization which deposited the sequence.

You might also like...
SwiftRegressor - A linear regression tool that’s flexible and easy to use

SwiftRegressor - A linear regression tool that’s flexible and easy to use

BudouX: the machine learning powered line break organizer tool
BudouX: the machine learning powered line break organizer tool

BudouX.swift BudouX Swift implementation. BudouX is the machine learning powered

A visual developer tool for inspecting your iOS application data structures.
A visual developer tool for inspecting your iOS application data structures.

Tree Dump Debugger A visual developer tool for inspecting your iOS application data structures. Features Inspect any data structure with only one line

A little beautifier tool for xcodebuild

xcbeautify xcbeautify is a little beautifier tool for xcodebuild. Similar to xcpretty, but faster. Features 2x faster than xcpretty. Human-friendly an

Scaffold is a tool for generating code from Stencil templates, similar to rails gen.

🏭 Scaffold Scaffold is a tool for generating code from Stencil templates, similar to rails gen. It happens to be written in Swift, but it can output

A reverse engineering tool to restore stripped symbol table and dump Objective-C class or Swift types for machO file.

A reverse engineering tool to restore stripped symbol table and dump Objective-C class or Swift types for machO file.

Swift library and command line tool that interacts with the mach-o file format.
Swift library and command line tool that interacts with the mach-o file format.

MachO-Reader Playground project to learn more about the Mach-O file format. How to run swift run MachO-Reader path-to-binary You should see a simila

🚘 A simple tool for updating Carthage script phase
🚘 A simple tool for updating Carthage script phase

Do you use Carthage? Are you feel tired of adding special script and the paths to frameworks (point 4, 5 and 6 in Getting Started guide) manually? Me

A tool to convert Apple PencilKit data to Scribble Proto3.

ScribbleConverter Example To run the example project, clone the repo, and run pod install from the Example directory first. Requirements Installation

Comments
  • v1.4 vdbCreate handles larger MSA files

    v1.4 vdbCreate handles larger MSA files

    vdbCreate - change blockBufferSize to handle larger MSA files vdbCreate - add check for updates vdb - modified functions getDateFor() and loadMutationDBTSV() to be compatible with Swift 5.1 vdb - improved response if nucleotide reference file is missing - tries to download file from Github, better error handling vdb - disallow variable names that have the form of a mutation string vdb - allow automatic recognition and opening of vdb nucleotide files when no file is specified

    opened by anthonypwest 0
  • Current bufferSize too small causing out of array bounds access error

    Current bufferSize too small causing out of array bounds access error

    When using vdb with the newest MSA from GISAID I got illegal hardware instruction error. After turning off the -O flag during the compilation I identified that this was an Swift/ContiguousArrayBuffer.swift:575: Fatal error: Index out of range error. Some quick bug hunt led me to the issue which occurred on this line: https://github.com/variant-database/vdb/blob/ab73f5ee8f161ce998fa83825326db133e80ee59/vdbCreate.swift#L379

    The bufferSize is set to 50,000 (thus limiting the size of refBuffer array), while the refBufferPosition gets the value >57,000.

    I fixed this in my local repo by changing the following line: https://github.com/variant-database/vdb/blob/ab73f5ee8f161ce998fa83825326db133e80ee59/vdbCreate.swift#L209

    and setting the bufferSize to 100,000.

    Please let me know if you would like me to create a PR to fix this or whether this will be addressed in the future version.

    opened by nsapoval 0
Releases(v3.0)
  • v3.0(Aug 4, 2022)

    vdb: Much reduced memory usage by changing mutation position data type. Reduced memory usage for vdb data files in compressed format (initial load only). Reduced memory requirement for loading/aligning FASTA sequences. Compress option ("z") added for trim mode and for save cluster command. NCBI mode for reading vdb data files (w/ lineages) created from NCBI GenBank sequences. Accession numbers encoded for NCBI GenBank data files. 2nd level Pango lineage aliases now supported. For FASTA sequence output, N regions at start and end are now removed. "Greater than" and "less than" commands now can filter for completion fraction. vdb data files now have optional markers for NCBI mode and compressed format. Missing country name now handled as "Unknown". Full (de-aliased) lineage name added to characteristic command output. Processing of command input now allows floating point values. VDBNumber struct added to support filtering by completeness. Fixed memory leak in Swift Edlib library. Support for aligning and updating NCBI GenBank data files. Support for vdb2 server with GenBank nucleotide data. delScores file now made using data from all isolates. Alignment timeout added.

    Source code(tar.gz)
    Source code(zip)
  • v2.9(Jul 7, 2022)

    vdb: Added option to load cluster from FASTA file with automatic alignment. Improved option to save cluster as reconstructed sequences in FASTA format. Added double subscript command to print a range of a virus sequence. Reduced memory required for loading metadata. Initial work on memory optimization of mutation storage. Fixed several printing issues.

    Source code(tar.gz)
    Source code(zip)
  • v2.8(Jun 15, 2022)

    vdb: Added support for Python API. Improved handling of insertions. Unsequenced ranges stored as 'N regions' in nucleotide mode. Trim mode saves unsequenced regions in '_Nregions' file. Trim mode corrects MSA artifical deletions. Added cluster array subscripting. New arrayBase option for zero- or one-base array numbering. Added isolate info command via subscript. Improved range options for subscripts. Removed age from Isolate class. Added MutationStruct to wrap mutation for lists. Modified mutation printing to support insertions. Mutation frequency lists extended to include all with freq. >= 50%. Imporved performance of 'variants' command. Improved performance of cluster addition command. Improved performance related to multithreaded operations on non-Swift strings. Fixed country-name bug related to 'variants' command. Fixed sublineages bug related to lineage B.1.1.529. Added option to save clusters in fasta format. Added support for importing in embedded mode.

    vdbCreate: Accepts "msaCodon_" files. Cleaned up output to stdout (-p option). Improved handling of insertions.

    Source code(tar.gz)
    Source code(zip)
  • v2.7(Apr 23, 2022)

    vdb:

    Major performance improvements to all filter and list commands via concurrency. Isolate struct changed to class to improve performance. Array extensions added for concurrent filter and reduce. Removal of unnecessary exit() calls to support multiple vdb instances in server and embedded modes. Case matching options added for virus name search. Fixed performance bug for named command due to non-contiguous strings. Fixed bugs revealed by fuzz testing - in before/after command parsing, isPattern, and group/trends. Added "Unassigned" to "None" as special case lineage names. Increased database size limits. Refactoring: changed to allIsolatesKeyword and allIsolates token to clarify code. Preparation for future command lineageFrequenciesByLocation and possibly binCluster. Improved format of abstract syntax tree printing for debug mode. Improved performance of minus command. Fixed page printing option for help. Code cleanup mostly related to user settings. Allow faster quit for command line mode.

    Source code(tar.gz)
    Source code(zip)
  • v2.6(Apr 8, 2022)

    vdb: Unified codebase for command line tool, web server, and macOS/iOS apps, with version determined by compilation flags. Major internal improvements to support versions of vdb with a graphical interface. Removed static variables so that multiple vdb instances can be run in the same process. MULTI mode added so that new vdb instances can be quickly started from a server instance. Improved methods to obtain the current terminal size. Changed printing architecture to support multiple vdb instances. Changed vdb.isolates to be an atomic dictionary.

    Source code(tar.gz)
    Source code(zip)
  • v2.5(Mar 24, 2022)

    vdb: Various modifications to enable future graphical interfaces Added 'sample' command to generate a random subset of a cluster Renamed enum Protein to VDBProtein Renamed struct Terminal in LinenoiseTerminal Added zeroData, weekMax Added faster distance metric Added weekNumber() func to Isolate VDB.additionalSublineages has includePlainSublineages option Changed to Swift.print for trimming nucleotide logging Added quiet option for mutationFrequenciesInCluster Added week info to listCountries countryCounts - should this be optional? Added week info to listStates countryCounts - should this be optional? Added quiet option for listLineages Added week info to listLineages lineageCounts - should this be optional? Added quiet option for isolatesFromCountry Added quiet option for proteinMutationsForIsolate Changes vdb.lists to atomic Added better atomics Added option to save a cluster's metadata Added 'group variants' command

    Source code(tar.gz)
    Source code(zip)
  • v2.4(Mar 1, 2022)

    vdbCreate: Updated to work on Windows Fixed bug affecting processing of small input files

    vdb: Updated to work on Windows Fixed bug in load cluster command Improved speed of loading metadata file

    Source code(tar.gz)
    Source code(zip)
  • v2.3(Jan 12, 2022)

    vdbCreate: added parallel processing to improve speed added option -s to use standard input (i.e., a pipe) for fasta file added option -p to use standard output for processed data added option -o to overwrite existing output file

    vdb: increased database capacity added command line option -t for trim-only mode for trim-only mode, added option -s to use standard input (i.e., a pipe) for vdb database file

    Source code(tar.gz)
    Source code(zip)
  • v2.2(Dec 10, 2021)

    Added Omicron to WHO variants Improved handling of large metadata files New commands for lineage assignment: prepare - prepare vdb to assign lineages based on consensus mutation sets assign - assigns Pango lineages to viruses in a cluster identical - searches for viruses in different lineages with identical mutation patterns compare - compares the lineage assignments of viruses in two clusters

    Source code(tar.gz)
    Source code(zip)
  • v2.1(Aug 21, 2021)

    sublineages command added diff command added save command extended to history and patterns mutation frequency command can display specificity info consensusPercentage setting added listSpecificity setting added paging of output improved group lineage command improved for WHO variants small adjustments to vdbCreate

    Source code(tar.gz)
    Source code(zip)
  • v2.0(Jul 22, 2021)

    Improved list functionality Results of most list commands can be assigned to a variable These lists can be saved to a file Some commands can be applied to lists Ability to use an item from a list (e.g. a pattern) in certain commands quiet mode added for commands acting on lists Dataset loading is now multithreaded maxMutationsInFreqList setting added Ability to load clusters from Pango definition file Updated WHO variants

    Source code(tar.gz)
    Source code(zip)
  • v1.8(Jul 9, 2021)

  • v1.7(Jun 29, 2021)

  • v1.6(Jun 15, 2021)

    added WHO variant designations added date range command added demo command added trends number command alias added paging setting improved handling of Pango lineage aliases and sublineages improved number formatting improved table printing improved group lineages for lineages/trends improved trends graph x axis labeling improved prompt with color fixed hint display fixed delete/back space issue

    Source code(tar.gz)
    Source code(zip)
  • v1.5(May 19, 2021)

  • v1.4(May 8, 2021)

    vdbCreate has been updated to handle larger MSA files. The previous version crashed with files over 40 GB. vdbCreate now checks for updates at GitHub. vdb has better error handling if the nucleotide reference file is missing. An attempt is made to download this file from GitHub if it's missing.

    Source code(tar.gz)
    Source code(zip)
  • v1.3(May 4, 2021)

    Added trends command displaying tables and graphs Added tab completions and hints Added trim command to remove extraneous N nucleotides Added check for updates

    Source code(tar.gz)
    Source code(zip)
Owner
null
FluxCapacitor makes implementing Flux design pattern easily with protocols and typealias.

FluxCapacitor makes implementing Flux design pattern easily with protocols and typealias. Storable protocol Actionable protocol Dispatch

Taiki Suzuki 123 Aug 23, 2022
A meta library to provide a better `Delegate` pattern.

Delegate A meta library to provide a better Delegate pattern described here and here. Usage Instead of a regular Apple's protocol-delegate pattern, us

Wei Wang 67 Dec 23, 2022
Swordinator is a simple way of integrating an iOS Coordinator pattern.

Swordinator is a minimal, lightweight and easy customizable navigation framework for iOS applications. Requirements iOS 14.0+, Swift 5.0+ Installation

Timotheus Laubengaier 10 Oct 17, 2022
🚀Comprehensive Redux library for SwiftUI, ensures State consistency across Stores with type-safe pub/sub pattern.

??Comprehensive Redux library for SwiftUI, ensures State consistency across Stores with type-safe pub/sub pattern.

Cheng Zhang 18 Mar 9, 2022
Differific is a diffing tool that helps you compare Hashable objects using the Paul Heckel's diffing algorithm

Differific is a diffing tool that helps you compare Hashable objects using the Paul Heckel's diffing algorithm. Creating a chan

Christoffer Winterkvist 127 Jun 3, 2022
A functional tool-belt for Swift Language similar to Lo-Dash or Underscore.js in Javascript

Dollar Dollar is a Swift library that provides useful functional programming helper methods without extending any built in objects. It is similar to L

Ankur Patel 4.2k Jan 2, 2023
A command-line tool and Swift Package for generating class diagrams powered by PlantUML

SwiftPlantUML Generate UML class diagrams from swift code with this Command Line Interface (CLI) and Swift Package. Use one or more Swift files as inp

null 374 Jan 3, 2023
This package will contain the standard encodings/decodings/hahsing used by the String Conversion Tool app.

This package will contain the standard encodings/decodings/hahsing used by the String Conversion Tool app. It will also, however, contain extra encoding/decoding methods (new encoding/decoding)

Gleb 0 Oct 16, 2021
qr code generator tool

qr code generator tool Small command line tool for generate and reconition qr codes written in Swift Using Usage: ./qrgen [options] -m, --mode:

Igor 3 Jul 15, 2022
AnalyticsKit for Swift is designed to combine various analytical services into one simple tool.

?? AnalyticsKit AnalyticsKit for Swift is designed to combine various analytical services into one simple tool. To send information about a custom eve

Broniboy 6 Jan 14, 2022