BpForms: a toolkit for concretely describing non-canonical DNA, RNA, and proteins

BpForms is a toolkit for unambiguously describing the molecular structure (atoms and bonds) of DNA, RNA, and proteins, including non-canonical monomeric forms (subunits which compose polymers), crosslinks, nicks, and circular topologies. By concretely describing the molecular structure of biopolymers, BpForms aims to help epigenomics, transcriptomics, proteomics, systems biology, and synthetic biology researchers share and integrate information about DNA modification, post-transcriptional modification, post-translational modification, expanded genetic codes, and synthetic parts. In particular, BpForms was developed to help researchers collaboratively develop whole-cell computational models . See the use cases below for more information .

BpForms includes a grammar for describing biopolymer forms and three consensus alphabets of non-canonical monomeric forms of DNA nucleotide monophosphates, RNA nucleotide monophosphates, and protein amino acids. BpForms also includes four software tools for verifying descriptions of biopolymer forms and calculating properties such as their molecular structure, formula, molecular weight, and charge: this website, a JSON REST API , a command line interface , and a Python API . BpForms is available open-source under the MIT license.

BpForms can be combined with BcForms to concretely describe the primary structure of complexes.

BpForms verifier/calculator

Enter a biopolymer form

Calculated properties of the biopolymer form

Features

BpForms has the following features:

Concrete: To help researchers communicate and integrate data about macromolecules, the grammar can capture the primary structures of polymers, including non-canonical (NC) residues, caps, crosslinks, and nicks.
Abstract: To facilitate network research, BpForms uses alphabets of residues and an ontology of crosslinks to abstract the structures of polymers.
Extensible: To capture any polymer, users can define residues and crosslinks inline or define custom alphabets and ontologies.
Structured coordinates: To compose residues and crosslinks into polymers, each residue and atom has a unique coordinate relative to its parent.
Context-free: To help integrate information about the processes which synthesize and modify macromolecules, the grammar captures the structures of macromolecules separately from the processes which generate them.
User-friendly: To ensure BpForms is easy to use, the grammar is human-readable, and BpForms includes a web application and a command-line program.
Machine-readable: The grammar is machine-readable to enable analyses of macromolecules.
Composable: To facilitate network research, BpForms includes protocols for composing the grammar with formats such as BioPAX, CellML, SBML, and SBOL.
Backward-compatible: BpForms is backward compatible with the IUPAC/IUBMB format to maximize compatibility with existing formats, software, and knowledge.

Grammar for polymers

Overview

The BpForms grammar extends the IUPAC/IUBMB notation commonly used to represent unmodified DNA, RNA, and proteins to describe non-canonical forms of DNA, RNA, and proteins:

BpForms can represent a wider range of monomeric forms, including monomeric forms that are not described in pre-defined alphabets. See the Alphabets section below for more information.
BpForms can capture left and right caps such as 5' caps.
BpForms can capture intrastrand crosslinks (additional bonds between non-adjacent monomeric forms).
BpForms can capture nicks (absence of a bond between adjacent monomeric forms).
BpForms can capture linear and circular topologies of polymers.
BpForms has concrete semantics for generating molecular structures from its compressed representation of sequences of monomeric forms.

BpForms descriptions of biopolymers consists of three parts separated by pipes ("|") (e.g., A{pSer}Y | circular | x-link: [...]):

A sequence of strings that indicate monomeric forms of an alphabet bonded together (e.g., A{pSer}Y. As described below, BpForms includes six alphabets of hundreds of monomeric forms of DNA, RNA, and proteins, and users can define additional alphabets.
An optional set of crosslinks which describe bonds between non-adjacent monomeric forms (e.g., x-link: [...]). See below for additional information.
An optional set of nicks which describe adjacent monomeric forms that are not covalently bonded (e.g., A:C). See below for additional information.
An optional attribute for indicating the circularity of the polymer (e.g., circular). See below for additional information.

The monomeric forms in the sequence of monomeric forms can be described in three ways:

Individual characters (e.g., A) can be used to indicate monomeric forms that have single character codes in the defined alphabets.
Multiple characters delimited by curly brackets (e.g., {pSer}) can be used to indicate monomeric forms that have multi-character codes in the defined alphabets.
Monomeric forms which are not defined in the alphabets can be defined "inline" using multiple attributes delimited by square brackets (e.g., [id: "m2A" | ...]). See below for additional information.

The BpForms grammar is defined in Lark syntax , which is based on EBNF syntax .

Examples

DNA

{dI}ACGC

Length: 5
Formula: C₅₃H₆₄N₂₀O₃₃P₅
Molecular weight: 1664.1
Charge: -5

Deoxyinosine at the first position

RNA

AC{9A}GC

Length: 5
Formula: C₄₈H₅₅N₂₀O₃₅P₅
Molecular weight: 1626.9
Charge: -6

Inosine at the third position

Protein

AC[id: "U" | structure: "N[C@H](C(=O)O)C[SeH]" | ... ]C

Length: 4
Formula: C₁₂H₂₃N₄O₅S₂Se₁
Molecular weight: 446.4
Charge: 1

L-selenocysteine at the third position

Alphabets

BpForms includes several alphabets. Each alphabet describes hundreds of monomeric forms.

As described above, monomeric forms with single characters codes can be indicated by their codes (e.g., C) and monomeric forms with multi-character codes can be indicated by their codes delimited by curly brackets (e.g., {m2A}).

Examples

DNA: 422 nucleotide monophosphates and 3' and 5' caps

m2A: 2-methyladenine

Formula: C₁₀H₁₃N₆O₆P₁
Molecular weight: 344.224
Charge: -2

RNA: 378 nucleotide monophosphates and 3' and 5' caps

21C: 2-lysidine

Formula: C₁₅H₂₇N₅O₉P₁
Molecular weight: 452.381
Charge: -1

Protein: 1,435 amino acids and carboxy and amine termini

AA0037: phosphoserine

Formula: C₃H₇N₁O₆P₁
Molecular weight: 184.064
Charge: -1

"Inline" monomeric forms

Monomeric forms which are not defined in the alphabet can be defined "inline" within the sequence of monomeric forms.

Inline monomeric forms can be defined by enclosing multiple attributes separated by pipes ("|") in square brackets (e.g., [id: "a-short-name" | name: "a long name " | ...]).

The inline monomeric forms support seven types of attributes. See below for more information about each attribute.

id, name, synonym: These attributes indicate human-readable names for the monomeric form.
structure: This attribute indicates the molecular structure of the monomeric form.
l-bond-atom, l-displaced-atom, r-bond-atom, r-displaced-atom: These attributes indicate the atoms in the structure that participate in bonds with the preceding and following monomeric forms in the sequence and the atoms that are displaced by the formation of these bonds.
delta-mass, delta-charge, position: These attributes can be used to describe uncertainty in the structure and location of the monomeric form.
identifier: This attribute indicates an entry in a database or ontology which is equivalent to the monomeric form.
comment: This attribute can be used to store comments about monomeric forms.

All of these attributes are optional. However, the structure, l-bond-atom, l-displaced-atom, r-bond-atom, and r-displaced-atom attributes are required to calculate the molecular structure, chemical formula, molecular weight, and charge of the polymer.

Examples

[id: "dI"
    | name: "hypoxanthine"
    | structure: "OC[C@H]1O[C@H](C[C@@H]1O)[N+]1(C=Nc2c1nc[nH]c2=O)
                  C1CC(C(O1)COP(=O)([O-])[O-])O"
    ]

[id: "AA0305"
    | name: "N5-methyl-L-arginine"
    | structure: "OC(=O)[C@H](CCCN(C(=[NH2])N)C)[NH3+]"
    | l-bond-atom: N16-1
    | l-displaced-atom: H16+1
    | l-displaced-atom: H16
    | r-bond-atom: C2
    | r-displaced-atom: O1
    | r-displaced-atom: H1
    | comments: "Methylated form of L-arginine"
    ]

Molecular structure

The structure attribute describes the molecular structure of the monomeric form. This attribute must be a SMILES-encoded string, and the atoms should be canonically ordered (i.e., Open Babel canonical SMILES format). Each monomeric form can only have one structure attribute. This attribute is required to calculate the molecular structure of the polymer.

Example

The text below illustrates how to describe the modified DNA nucleotide monophosphate 2'-deoxy-2-O-methylcytosine-5'-monophosphate, and the image below illustrates the molecule that the text specifies. The atom labels indicate the numbers of the atoms within the molecule. These numbers can be generated with Open Babel .

[id: "m2C"
    | name: "2-O-methylcytosine"
    | structure: "COC1=NC(=CCN1C1CC(C(O1)COP(=O)
                  ([O-])[O-])O)N"
    ]

Bonds with adjacent monomers

The l-bond-atom and l-displaced-atom attributes describe bonds with preceding monomeric forms; the r-bond-atom and r-displaced-atom attributes describe bonds with succeeding monomeric forms.

The values of these attributes are the element of the atom, the position of the atom within the monomeric form, and the charge of the atom (e.g., N3+1). Open Babel can be used to display the numbers of the atoms within monomeric forms.

Each monomeric form can have one or more bonds and displaced atoms with the preceding and following monomeric forms. In addition, the number of left bond atoms must be equal to the number of right bond atoms for the preceding monomeric form, and the number of right bond atoms must be equal to the number of left bond atoms for the following monomeric form. The BpForms software verifies these constraints. These attributes are required to calculate the molecular structure of the polymer.

Example

The example below illustrates how to describe the modified amino acid N⁵-methyl-L-arginine.

The blue atoms indicate atoms (N terminus) involved in left bonds to preceding monomeric forms; the dark blue N atom indicates the atom which bonds with preceding monomeric forms; the light blue H atom indicates atoms displaced by the formation of these bonds.

The green atoms indicate the atoms (C terminus) involved in right bonds to succeeding monomeric forms. The dark green C atom indicates the atom which bonds with succeeding monomeric forms; the light green H atom (not shown) indicates atoms which are displaced by the formation of these bonds.

[id: "AA0305"
  | name: "N5-methyl-L-arginine"
  | structure: "OC(=O)[C@H](CCCN(C(=[NH2])N)C)
                [NH3+]"
  | l-bond-atom: N16-1
  | l-displaced-atom: H16+1
  | l-displaced-atom: H16
  | r-bond-atom: C2
  | r-displaced-atom: O1
  | r-displaced-atom: H1
  ]

Structure and positional uncertainty

Through the inline monomeric forms, BpForms can represent two types of uncertainty in the molecular structure of forms of biopolymers:

The delta-mass and delta-charge attributes can describe additional mass and charge beyond that of the canonical monomeric form which have been observed, but which cannot be resolved to a specific molecular structure. The value of delta-mass attribute must be a float. The value of the delta-charge attribute must be an integer. Such mass and charge uncertainty can arise from mass spectrometry studies which often cannot resolve the exact molecular structure of each monomeric form.
The position attribute can be used to indicate a range of positions within the sequence in which the monomeric form is believed to be located, rather than a single position within the sequence. The value of the position should be two integers separated by a dash (e.g., 2-3), optionally, followed by a pipe-delimited list of possible originating monomeric forms that the non-canonical monomeric form is derived from enclosed in square brackets (e.g., 2-3 [A | C | G]). Such positional uncertainty can arise from enzymatic assays which do not have single nucleotide resolution.

Examples

[id: "dAMP" | delta-mass: -18 | delta-charge: 0]: indicates the dehydration of dAMP whose exact structure is not known.
[id: "dI" | position: 2-3]: indicates that deoxyinosine may occur anywhere between the second and third position.
[id: "dI" | position: 4-8 [A | C]]: indicates that deoxyinosine may occur at any A or C between the fourth and eighth positions.

Metadata

BpForms can represent several types of metadata about inline monomeric forms:

The id and name attributes are human-readable labels for monomeric forms. The id attribute should be used to represent a short label. Only one id and one name are allowed per monomeric form.
The synonym attribute is an additional human-readable label. Monomeric forms can have multiple synonyms.
The identifier attribute indicates entries in namespaces (e.g., databases and ontologies) which are equivalent to the monomeric form. The value of this attribute should be an id within a namespace and the prefix of the namespace as registered at Identifiers.org separated by "@" (e.g., "65058" @ "pubchem.compound"). Monomeric forms can have multiple identifiers.
The comments attribute can describe additional information about the monomeric form. Each monomeric forms is limited to one comment.

Examples

[id: "dI" | name: "deoxyinosine"]: represents the id and name of deoxyinosine.
[id: "dI" | synonym: "deoxyinosine" | synonym: "2'-deoxyinosine"]: represents multiple synonyms of deoxyinosine.
[id: "dI" | identifier: "65058" @ "pubchem.compound" | identifier: "65058" @ "pubchem.compound"]: represents equivalent entries in ChEBI and PubChem to deoxyinosine.
[id: "dI" | comments: "A purine 2'-deoxyribonucleotide monophosphate that is inosine ..."]: represents comments about deoxyinosine.

Crosslinks between non-adjacent monomers

The x-link polymer attribute can be used to indicate a bond between non-adjacent monomeric forms. For example, this attribute can be used to describe intrastrand crosslinks in DNA and disulfide bonds between cysteines in proteins.

Crosslinks can be described our ontology of crosslinks or defined inline.

Polymers can have zero, one, or more crosslinks.

Ontology-defined crosslinks

Crosslinks defined using our ontology can be described by enclosing attributes which indicate the monomeric forms involved in the crosslink within square brackets and delimiting the attributes with pipes (e.g., CAC | x-link: [type: disfulfide | l: 1 | r: 3]).

The value of the type attribute must be the id of a crosslink in our ontology. See the crosslinks browser for a list of the defined crosslinks.

The values of the l and r attributes should be integers which indicates the positions of the monomeric forms involved in the crosslink. The left/right orientation of the monomeric forms must be matched to the definition of the crosslink in the ontology.

User-defined crosslinks

Users can also define crosslinks "inline" by enclosing attributes which indicate the atoms involved in the bond within square brackets and delimiting the attributes with pipes (e.g., | x-link: [l-bond-atom: 1C1 | r-bond-atom: 3C2 | ...]).

Each user-defined crosslink can be described with the following attributes:

l-bond-atom and r-bond-atom: These attributes indicate the atoms involved in the bond. The values of these attributes are the position of the monomeric form within the sequence, the element of the atom, the position of the atom within the monomeric form, and the charge of the atom (e.g., 8N3+1). Open Babel can be used to display the numbers of the atoms within monomeric forms.
l-displaced-atom and r-displaced-atom: These attributes indicate the atoms displaced by the formation of the bond. The values of these attributes are also the position of the monomeric form within the sequence, the element of the atom, the position of the atom within the monomeric form, and the charge of the atom.
order: This attribute can indicate the order (single, double, triple, aromatic) of the bond.
stereo: This attribute can indicate the stereochemistry of the bond (wedge, hash, up, down).
comments: This attribute can indicate comments about the crosslink, such as uncertainty about its location or structure.

Each user-defined crosslink can have one or more left and right bond atoms and zero or more left and right displaced atoms. Furthermore, each user-defined crosslink must have the same number of left and right bond atoms. This constraint is verified by the BpForms software.

Example

The example below illustrates how to describe a tripeptide with a disulfide bond. The blue line indicates the disulfide bond (crosslink). The green lines indicate the bonds between the successive amino acids. The black labels indicate the positions of monomeric forms within the sequence.

Ontology-defined crosslink

CAC | x-link: [type: "disulfide"
      l: 1
    | r: 3
]

User-defined crosslink

CAC | x-link: [
      l-bond-atom: 1S11
    | l-displaced-atom: 1H11
    | r-bond-atom: 3S11
    | r-displaced-atom: 3H11
    | comments: "Disulfide bond"
]

Nicks between adjacent monomers

The : notation can be used to indicate a nick between adjacent monomeric forms.

Example

The example below illustrates how to describe a tripeptide with nick between the first and second residues and a disulfide bond between the first and third residues. Such a peptide could be generated by nicking the form of the peptide that doesn't contain the nick. The blue line indicates the disulfide bond (crosslink). The green lines indicate the bonds between the successive amino acids. The black labels indicate the positions of monomeric forms within the sequence.

Ontology-defined crosslink

C:AC | x-link: [type: "disulfide"
      l: 1
    | r: 3
]

Circular and linear topologies

By default, BpForms describes linear polymers. The circular polymer attribute can be used to indicate that the polymer is circular (there is a bond between the last and first monomeric forms).

Example

The example below illustrates how to describe the circular DNA dimer of the DNA nucleotides deoxyadenosine monophosphate and deoxycytosine monophosphate. The green lines indicate the bonds between successive nucleotides. The black labels indicate the positions of monomeric forms within the sequence.

AC | circular

Coordinate system

Each residue, and atom represented by BpForms has a unique coordinate. The coordinate of each residue is its position within the residue sequence of its parent polymer. The coordinate of each atom is a tuple of the coordinate of its parent residue and its position within the canonical SMILES ordering of the atoms in its parent residue prior to incorporation into polymers (which can be displayed by Open Babel).

Example

The example below illustrates the atom coordinates for the modified amino acid N⁵-methyl-L-arginine.

[id: "AA0305"
  | name: "N5-methyl-L-arginine"
  | structure: "OC(=O)[C@H](CCCN(C(=[NH2])N)C)
                [NH3+]"
  | l-bond-atom: N16-1
  | l-displaced-atom: H16+1
  | l-displaced-atom: H16
  | r-bond-atom: C2
  | r-displaced-atom: O1
  | r-displaced-atom: H1
  ]

Syntactic and semantic verification of descriptions of complexes

To help quality control information about macromolecules, the BpForms user interfaces include methods for verifying the syntactic and semantic correctness of complexes:

Check that each residue has a defined structure, each atom that bonds an adjacent residue has a defined element and position which is consistent with the structure of its parent residue, and each pair of consecutive residues can form a bond.
Check that the element and position of each atom in each crosslink are consistent with the structure of its parent residue. For example, this can identify invalid proteins that contain consecutive residues that cannot bond because the first residue lacks a carboxyl terminus or the second residue lacks an amino terminus.
Check that each subunit is semantically concrete and that the element and position of each atom in each crosslink are consistent with the structure of its parent residue.

User interfaces

BpForms includes four software interfaces for verifying descriptions of biopolymers and calculating properties such as their molecular structures, formulae, molecular weights, and charges.

Webform

The webform above can be used to validate BpForms and calculate their properties.

JSON REST API

A JSON REST API is available at https://bpforms.org/api. Documentation is available by opening this URL in your browser.

Command line interface

A command line interface is available from PyPI . Installation instructions and documentation are available at docs.karrlab.org . Documentation is available inline by running bpforms --help.

Python library

A Python library is available from PyPI . Installation instructions are available at docs.karrlab.org . Interactive tutorials are available as Jupyter notebooks at sandbox.karrlab.org . Detailed documentation is available at .

Use cases: Epigenomics, proteomics, systems biology, synthetic biology, and proteomics

By concretely capturing the molecular structure of biopolymers, BpForms can facilitate a wide range of epigenomics, proteomics, proteomics systems biology, and synthetic biology research.

Epigenomics

BpForms can help researchers precisely communicate the structures of modified DNA, such as methylations that bacteria use to distinguish self from non-self DNA. We anticipate that this will be increasingly important as researchers continue to discover new types of modifications and begin to investigate their impact on the interactions of proteins with DNA.

Example

Several chemotherapeutics, such as cisplatin, cause toxic side effects by damaging the DNA of healthy cells. Cells have several pathways with overlapping functions to repair DNA damage. This includes direct repair, base excision repair, nucleotide excision repair, and homologous recombination. Because chemotherapeutics cause a wide range of damage, and because cells have several pathways to repair DNA damage, it is challenging to assemble an integrated understanding of the repair of DNA damage caused by chemotherapeutics. BpForms can help researchers develop an integrated understanding of DNA repair by helping researchers concretely communicate the damage caused by each chemotherapeutic and the types of damage repaired by each pathway.

Transcriptomics

BpForms can help researchers precisely communicate the sequences of rRNA, tRNA, and other non-coding RNA; analyze RNA modifications; and improve the quality of reported sequences by identifying errors in the descriptions of modified RNA such as undefined monomeric forms and inconsistent bonds (e.g., 3' caps that are not located at the 3' position).

Example: Analysis of the metabolic load of rRNA and tRNA modification in E. coli

MODOMICS contains 732 curated sequences of rRNA and tRNA . We used BpForms together with MODOMICS to assess the metabolic cost of RNA modification in Escherichia coli. We found that E. coli tRNA have 7.8 ± 2.2 modifications per transcript that increase their mass by 166.2 ± 103.7 Da and charge by 0.40 ± 0.63 per transcript. This analysis also led us to add missing information about the origin of several of the monomeric forms derived from MODOMICS and correct three types of errors in the MODOMICS RNA sequences. The code is available at GitHub .

Proteomics

BpForms can help researchers precisely communicate the sequences of proteoforms, analyze modifications, and improve the quality of reported sequences by identifying errors in the descriptions of proteoforms, such as monomeric forms that are inconsistent with the unmodified sequence (e.g., selenocysteine modification of a non-cysteine amino acid) and inconsistent bonds (e.g., N,N,N-trimethyl-L-alanine (which has no N-terminus) located in the middle of a peptide).

Example: Analysis of the metabolic load of protein modification in humans

The PRO database contains curated modifications of 2,312 human proteins . We used BpForms together with PRO to assess the metabolic cost of protein modification in humans. We found that human proteins have 1.7 ±1.6 modifications per protein that increase their mass by 146.4 ± 154.3 Da per protein and decrease their charge by 3.05 ± 3.30 per protein. This analysis also led us to improve PRO by correcting four types of errors in the curated modifications. The code is available at GitHub .

Systems biology

BpForms can help modelers describe the semantic meaning of models by helping modelers precisely describe the species in models. Importantly, this precision makes it easier for other researchers to understand, reuse, extend, and compose models for other studies. BpForms can also help modelers build more comprehensive models by helping researchers identify gaps in models such as missing intermediate modification states of proteins and missing interactions between modification states. In particular, BpForms can help modelers identify the full combinatorial complexity of biochemistry that should be modeled. In addition, BpForms can help researchers increase the quality of models by helping identify errors such as element imbalances.

Example

The Kholodenko model of the eukaryotic MAPK signaling cascade (DOI: 10.1046/j.1432-1327.2000.01197.x , BioModels: BIOMD0000000010 ) represents the biphosporylation of Mek1/MAPKK by Mos/MAPKKK and the biphosporylation of Erk2/MAPK by Mek1/MAPKK. Annotating the structures of the species in the model with BpForms, enabled us to identify two gaps in the model: two additional intermediate phosphorylation forms of Mek1 and Erk2 and the reactions involving these species. BpForms also enabled us to identify several unbalanced reactions that do not capture phosphate donors.

Synthetic biology

BpForms can help engineers precisely represent and communicate the structures of parts for synthetic organisms. In addition, BpForms could help engineers identify the dependencies and interfaces of parts which, in turn, could help engineers use parts in alternative hosts, compose parts, and share parts more reliably.

Example

E. coli pyruvate dehydrogenase requires lipoate ligation at L43 of the active site of the E1 subunit (UniProt: P96104 ). By representing lipoate ligation of the E1 subunit, BpForms can help capture the dependence of E. coli pyruvate dehydrogenase on lipoate ligase. In turn, this could help engineers recognize that E. coli pyruvate dehydrogenase can only be used in other hosts that have a lipoate ligase, or that a lipoate ligase, such as LplA (UnitProt: P32099 ), must be co-transformed with E. coli pyruvate dehydrogenase.

Integrating BpForms into the BioPAX, CellML, FASTA, PDB, SBML, and SBOL standards

BpForms can be used in conjunction with several commonly used standards in genomics, transcriptomics, proteomics, systems biology, and synthetic biology. In addition, BpForms can easily be embedded within other documents such as Excel workbooks and comma-separated tables.

Macromolecule 3D structures: PDB

BpForms can be used to provide human-readable annotations of protein structures encoded in PDB files . BpForms can be embedded within REMARK records.

Example

Protein: Bos taurus selenocysteine synthase Gpx1

...
REMARK   1
REMARK   1 >1GP1:A
REMARK   1 AAALAAAAPRTVYAFSARPLAGGEPFNLSSLRGKVLLIENVASL{SE7}GTTVRDYTQMNDLQRRLGPRGLVVLGFPCNQFGHQENAKNEEIL
REMARK   1 NCLKYVRPGGGFEPNFMLFEKCEVNGEKAHPLFAFLREVLPTPSDDATALMTDPKFITWSPVCRNDVSWNFEKFLVGPDGVPVRRYSRRFLTI
REMARK   1 DIEPDIETLLSQGASA
REMARK   1 >1GP1:B
REMARK   1 AAALAAAAPRTVYAFSARPLAGGEPFNLSSLRGKVLLIENVASL{SE7}GTTVRDYTQMNDLQRRLGPRGLVVLGFPCNQFGHQENAKNEEIV
REMARK   1 VLGFPCNQFGHQENAKNEEILNCLKYVRPGGGFEPNFMLFEKCEVNGEKAHPLFAFLREVLPTPSDDATALMTDPKFITWSPVCRNDVSWNFE
REMARK   1 KFLVGPDGVPVRRYSRRFLTIDIEPDIETLLSQGASA
...

Sets of sequences: FASTA

Sets of BpForms can be encoded in FASTA files. Such files can be written with the bpforms.util.write_to_fasta function or packages such as BioPython .

Example

Protein: multiple phosphorylated forms of H. sapiens MAPK

> y | MEK | Q02750
MPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKLEELELDEQQRKRLEAFLTQKQKVGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIH
LEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGSLDQVLKKAGRIPEQILGKVSIAVIKGLTYLREKHKIMHRDVKPSNILVNS
RGEIKLCDFGVSGQLIDSMANSFVGTRSYMSPERLQGTHYSVQSDIWSMGLSLVEMAVGRYPIPPPDAKELELMFGCQVEGDAAETPPRPRTPGRPLSSY
GMDSRPPMAIFELLDYIVNEPPPKLPSGVFSLEFQDFVNKCLIKNPAERADLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAAGV

> yp | phosphorylated MEK | Q02750 | pS218
MPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKLEELELDEQQRKRLEAFLTQKQKVGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIH
LEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGSLDQVLKKAGRIPEQILGKVSIAVIKGLTYLREKHKIMHRDVKPSNILVNS
RGEIKLCDFGVSGQLID{AA0037}MANSFVGTRSYMSPERLQGTHYSVQSDIWSMGLSLVEMAVGRYPIPPPDAKELELMFGCQVEGDAAETPPRPRTP
GRPLSSYGMDSRPPMAIFELLDYIVNEPPPKLPSGVFSLEFQDFVNKCLIKNPAERADLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAAGV

Knowledge of pathways: BioPAX

BpForms can be used to concretely describe all of the DNA, RNA, and proteins involved in pathways encoded in BioPAX . BpForms can be used with the sequence child of DNAReference, RNAReference, and ProteinReference objects.

Examples

DNA: E. coli K-12 MG1655 Dam 6-methyladenine sites (701..914) involved in host recognition

...
  <bp:DNA>
    <bp:entityReference>
      <bp:DNAReference>
        <bp:sequence
          rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
          rdf:about="http://edamontology.org/format_3909#dna">
          ...
          TGATTTGCCGTGGCGAGAAAATGTCG{a}TCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAAC
          GTTACTGTTATCG{a}TCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATAT
          TGCTGAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTG{a}TCACATGGTGCTGATGGCAGGTT
          ...
        </bp:sequence>
      </bp:DNAReference>
    </bp:entityReference>
  </bp:DNA>
...

RNA: Modifications of B. subtilis tRNA^UGC involved in stability

...
  <bp:RNA>
    <bp:entityReference>
      <bp:RNAReference>
        <bp:sequence
          rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
          rdf:about="http://edamontology.org/format_3909#rna">
          GGAGCCUUAGCUCAGC{8U}GGGAGAGCGCCUGCUU{501U}GC{6A}CGCAGGAG{7G}UCAGCGG{5U}{9U}CGAUCCCGCUAGGCUCCA
          CCA
        </bp:sequence>
      </bp:RNAReference>
    </bp:entityReference>
  </bp:RNA>
...

Protein: Modifications of H. sapiens MAPK3 involved in signaling

...
  <bp:Protein>
    <bp:entityReference>
      <bp:ProteinReference>
        <bp:sequence
          rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
          rdf:about="http://edamontology.org/format_3909#protein">
          M{AA0041}AAAAQGGGGGEPRRTEGVGPGVPGEVEMVKGQPFDVGPRYTQLQYIGEGAYGMVSSAYDHVRKTRVAIKKISPFEHQTYCQRTL
          REIQILLRFRHENVIGIRDILRASTLEAMRDVYIVQDLMETDLYKLLKSQQLSNDHICYFLYQILRGLKYIHSANVLHRDLKPSNLLINTTCD
          LKICDFGLARIADPEHDH{AA0038}GFL{AA0038}E{AA0039}VA{AA0038}RWYRAPEIMLNSKGYTKSIDIWSVGCILAEMLSNRPI
          FPGKHYLDQLNHILGILGSPSQEDLNCIINMKARNYLQSLPSKTKVAWAKLFPKSDSKALDLLDRMLTFNPNKRITVEEALAHPYLEQYYDPT
          DEPVAEEPFTFAMELDDLPKERLKELIFQETARFQPGVLEAP
        </bp:sequence>
      </bp:ProteinReference>
    </bp:entityReference>
  </bp:Protein>
...

Kinetic models: CellML

BpForms can be used to concretely describe the meaning of each component of a model encoded in CellML . BpForms can be used with the RDF element of component objects.

Example

Protein: Phosphorylated Erk and Mek in signal transduction (DOI: 10.1038/msb.2009.4 , Physiome Model Repository ). See complete CellML file .

...
  <component cmeta:id="ypp" name="ypp">
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about="#ypp">
          <bpforms:ProteinForm xmlns:bpforms="https://bpforms.org">
            MPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKLEELELDEQQRKRLEAFLTQKQKVGELKDDDFEKISELGAGNGGVVFKVSHKPSG
            LVMARKLIHLEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGSLDQVLKKAGRIPEQILGKVSIAVIKGLTYLRE
            KHKIMHRDVKPSNILVNSRGEIKLCDFGVSGQLID{AA0037}MAN{AA0037}FVGTRSYMSPERLQGTHYSVQSDIWSMGLSLVEMAVG
            RYPIPPPDAKELELMFGCQVEGDAAETPPRPRTPGRPLSSYGMDSRPPMAIFELLDYIVNEPPPKLPSGVFSLEFQDFVNKCLIKNPAERA
            DLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAAGV
          </bpforms:ProteinForm>
        </rdf:Description>
      </rdf:RDF>
    </component>
...

Kinetic models: SBML

BpForms can be used to concretely describe the meaning of each species in a model encoded in Systems Biology Markup Language (SBML) . BpForms can be used with the annotation element of species objects.

Examples

Protein: Phosphorylated Cdc2 and Cdc12 in the yeast cell cycle (DOI: 10.1073/pnas.88.16.7328 , BioModels: BIOMD0000000005 ). See complete SBML file .

...
  <species name="cdc2k-p" metaid="cdc2k">
    <annotation>
      <rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about="#cdc2k-p">
          <bpforms:ProteinForm xmlns:bpforms="https://bpforms.org">
            MENYQKVEKIGEG{AA0038}{AA0039}GVVYKARHKLSGRIVAMKKIRLEDESEGVPSTAIREISLLKEVNDENNRSNCVRLLDI
            LHAESKLYLVFEFLDMDLKKYMDRISETGATSLDPRLVQKFTYQLVNGVNFCHSRRIIHRDLKPQNLLIDKEGNLKLADFGLARSFGVPLRN
            Y{AA0038}HEIVTLWYRAPEVLLGSRHYSTGVDIWSVGCIFAEMIRRSPLFPGDSEIDEIFKIFQVLGTPNEEVWPGVTLLQDYKSTFPRW
            KRMDLHKVVPNGEEDAIELLSAMLVYDPAHRISAKRALQQNYLRDFH
          </bpforms:ProteinForm>
        </rdf:Description>
      </rdf:RDF>
    </annotation>
  </species>
...

Protein: Phosphorylated Mos/Raf1, Mek1, and Erk2 in the eukaryote MAPK cascade (DOI: 10.1046/j.1432-1327.2000.01197.x , BioModels: BIOMD0000000010 ). See complete SBML file .

...
  <species name="Erk2-PP" metaid="_584615">
    <annotation>
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about="#_584615">
          <bpforms:ProteinForm xmlns:bpforms="https://bpforms.org">
            MAAAGAASNPGGGPEMVRGQAFDVGPRYINLAYIGEGAYGMVCSAHDNVNKVRVAIKKISPFEHQTYCQRTLREIKILLRFKHENIIGINDI
            IRAPTIEQMKDVYIVQDLMETDLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANVLHRDLKPSNLLLNTTCDLKICDFGLARVADPDHDHT
            GFL{AA0038}E{AA0039}VATRWYRAPEIMLNSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLK
            ARNYLLSLPHKNKVPWNRLFPNADPKALDLLDKMLTFNPHKRIEVEAALAHPYLEQYYDPSDEPVAEAPFKFEMELDDLPKETLKELIFEET
            ARFQPGY
          </bpforms:ProteinForm>
        </rdf:Description>
      </rdf:RDF>
    </annotation>
  </species>
...

SBOL: Genetic designs

BpForms can be used to describe the meaning of each DNA, RNA, and protein molecule in genetic designs encoded in the Synthetic Biology Open Language (SBOL) . BpForms can be used with the elements attribute of Sequence objects.

The following URIs should be used to indicate the encodings for the sequences of DNA, RNA, and protein molecules.

DNA: http://edamontology.org/format_3909#dna
RNA: http://edamontology.org/format_3909#rna
Protein: http://edamontology.org/format_3909#protein

See SBOL SEP 033 and issue 77 for more information.

Examples

RNA: Modified B. subtilis tRNA^ILE 69 (SynBioHub: BO_28687 ). See complete SBOL file .

...
  <sbol:Sequence>
    <sbol:elements>
      GGGCCUGUAGCUCAGC{8U}GG{8U}{8U}AGAGCGCACGCCUGAU{62A}AGCGUGAG{7G}UCGAUGG{5U}{9U}CGAGUCCAUUCAGGCCCACCA
    </sbol:elements>
    <sbol:encoding rdf:resource="http://edamontology.org/format_3909#rna"/>
  </sbol:Sequence>
...

Protein: Lipoate-ligated acetyltransferase component PdhC of B. subtilis pyruvate dehydrogenase complex (SynBioHub: BO_32431 ). See complete SBOL file .

...
  <sbol:Sequence>
    <sbol:elements>
      MAFEFKLPDIGEGIHEGEIVKWFVKPNDEVDEDDVLAEVQND{AA0118}AVVEIPSPVKGKVLELKVEEGTVATVGQTIITFDAPGYEDLQFKGSDE
      SDDAKTEAQVQSTAEAGQDVAKEEQAQEPAKATGAGQQDQAEVDPNKRVIAMPSVRKYAREKGVDIRKVTGSGNNGRVVKEDIDSFVNGGAQEAAPQE
      TAAPQETAAKPAAAPAPEGEFPETREKMSGIRKAIAKAMVNSKHTAPHVTLMDEVDVTNLVAHRKQFKQVAADQGIKLTYLPYVVKALTSALKKFPVL
      NTSIDDKTDEVIQKHYFNIGIAADTEKGLLVPVVKNADRKSVFEISDEINGLATKAREGKLAPAEMKGASCTITNIGSAGGQWFTPVINHPEVAILGI
      GRIAEKAIVRDGEIVAAPVLALSLSFDHRMIDGATAQNALNHIKRLLNDPQLILMEA
    </sbol:elements>
    <sbol:encoding rdf:resource="http://edamontology.org/format_3909#protein"/>
  </sbol:Sequence>
...

Resources for determining the sequences of biopolymer forms

Below are several resources which can be helpful for determining the sequences of natural biopolymers and designing the sequences of synthetic biopolymers.

DNA

DNAMod : Database of non-canonical DNA nucleobases
MethDB : Database of non-canonical DNA
MethSMRT : Database of non-canonical DNA
PDB Chemical Component Dictionary : Database of modified deoxyribonucleic acids
REPAIRtoire : Database of DNA damages

RNA

MODOMICS : Database of non-canonical RNA nucleosides
PDB Chemical Component Dictionary : Database of modified ribonucleic acids
RMBase : Database of modified RNA
RNA Modification Database : Database of modified RNA

Drawing structures of monomeric forms

ChemAxon Marvin : Software for drawing structures of monomeric forms
Open Babel : Software for calculating the numbers of the atoms in monomeric forms

Proteins

dbPTM : Database of non-canonical amino acids
Delta Mass : Database of modified amino acids
FindMod : Database of post-translational modifications
iPTMnet : Database of post-translational modifications
ProForma : Notation for protein forms. Note, this notation is not unambiguous. This limits its ability to facilitate data integration and the calculation of properties of protein forms.
PDB Chemical Component Dictionary : Database of modified amino acids
PDB in Europe Chemical Components : Database of modified amino acids
PhosphoSitePlus : Database of protein phosphorylations
Protein Ontology : Database of modified proteins
PSIMOD : Ontology of non-canonical amino acids
RESID : Database of non-canonical protein residues
UniMod : Database of non-canonical amino acids
UniProt : Database of modified amino acids in proteins
UniProt Controlled Vocabulary of Posttranslational Modifications : Database of modified amino acids

Tutorials, documentation, and help

Documentation for the grammar

Documentation for the grammar is available above . The definition of the grammar is available at GitHub .

Definitions of the monomeric forms in the alphabets

Documentation for the alphabets is available above . This includes images and detailed information about each monomeric form in each alphabet.

Query builder for the REST API

A visual interface for building REST queries is available at bpforms.org/api .

Documentation for the REST API

Documentation for the REST API is available at bpforms.org/api .

Installation instructions for the CLI and Python API

Installation instructions are available at docs.karrlab.org . A minimal Dockerfile is also available from the Git repository.

Documentation for the command line program

Documentation for the command line program is available inline by running bpforms --help.

Tutorial for the Python API

A Jupyter notebook with an interactive tutorial is available at sandbox.karrlab.org .

Documentation for the Python API

Documentation for the Python API is available at docs.karrlab.org .

Comparison with other resources

Please see the documentation for a detailed comparison of BpForms with other formats and alphabet-like resources.

Questions

Please contact the Karr Lab with any questions.

Contributing to BpForms

Contributing to the alphabets

To suggest new residues or modifications to the existing residues, please use this GitHub issue template , submit a GitHub pull request , or contact us by email .

Please provide as much information as possible about each residue using the YAML alphabet format. Please see the Git repository for examples .

Contributing an additional alphabet

To contribute to an additional alphabet, please use this GitHub issue template , submit a GitHub pull request , or contact us by email .

Please provide as much information as possible about each residue using the YAML alphabet format. Please see the Git repository for examples .

Contributing to the software

To contribute to the software, please submit a GitHub pull request or contact us by email .

About BpForms

Source code

BpForms is available open-source from GitHub .

License

BpForms is released under the MIT license .

Citing BpForms

Lang PF, Chebaro Y & Jonathan R. Karr. BpForms: a toolkit for concretely describing modified DNA, RNA and proteins. arXiv:1903.10042

Team

BpForms was developed by Yassmine Chebaro , Jonathan Karr , Paul Lang , and John Sekar in the Karr Lab at the Icahn School of Medicine at Mount Sinai in New York, USA.

Acknowledgements

BpForms was supported by a National Institute of Health P41 award , a National Institute of Health MIRA R35 award , and a National Science Foundation INSPIRE award .

Questions/comments

Please contact the Karr Lab with any questions or comments.

BpForms: a toolkit for concretely describing non-canonical DNA, RNA, and proteins

Table of contents

BpForms verifier/calculator

Enter a biopolymer form

Calculated properties of the biopolymer form

Features

Grammar for polymers

Overview

Examples

DNA

RNA

Protein

Alphabets

Examples

DNA: 422 nucleotide monophosphates and 3' and 5' caps

RNA: 378 nucleotide monophosphates and 3' and 5' caps

Protein: 1,435 amino acids and carboxy and amine termini

"Inline" monomeric forms

Examples

Molecular structure

Example

Bonds with adjacent monomers

Example

Structure and positional uncertainty

Examples

Metadata

Examples

Crosslinks between non-adjacent monomers

Ontology-defined crosslinks

User-defined crosslinks

Example

Nicks between adjacent monomers

Example

Circular and linear topologies

Example

Coordinate system

Example

Syntactic and semantic verification of descriptions of complexes

User interfaces

Webform

JSON REST API

Command line interface

Python library

Use cases: Epigenomics, proteomics, systems biology, synthetic biology, and proteomics

Epigenomics

Example

Transcriptomics

Example: Analysis of the metabolic load of rRNA and tRNA modification in E. coli

Proteomics

Example: Analysis of the metabolic load of protein modification in humans

Systems biology

Example

Synthetic biology

Example

Integrating BpForms into the BioPAX, CellML, FASTA, PDB, SBML, and SBOL standards

Macromolecule 3D structures: PDB

Example

Sets of sequences: FASTA

Example

Knowledge of pathways: BioPAX

Examples

Kinetic models: CellML

Example

Kinetic models: SBML

Examples

SBOL: Genetic designs

Examples

Resources for determining the sequences of biopolymer forms

DNA

RNA

Drawing structures of monomeric forms

Proteins

Tutorials, documentation, and help

Documentation for the grammar

Definitions of the monomeric forms in the alphabets

Query builder for the REST API

Documentation for the REST API

Installation instructions for the CLI and Python API

Documentation for the command line program

Tutorial for the Python API