PDBFixer

Copyright 2013-2017 by Peter Eastman and Stanford University

1. Introduction

Protein Data Bank (PDB or PDBx/mmCIF) files often have a number of problems that must be fixed before they can be used in a molecular dynamics simulation. The details vary depending on how the file was generated. Here are some of the most common ones:
  1. If the structure was generated by X-ray crystallography, most or all of the hydrogen atoms will usually be missing.
  2. There may also be missing heavy atoms in flexible regions that could not be clearly resolved from the electron density. This may include anything from a few atoms at the end of a sidechain to entire loops.
  3. Many PDB files are also missing terminal atoms that should be present at the ends of chains.
  4. The file may include nonstandard residues that were added for crystallography purposes, but are not present in the naturally occurring molecule you want to simulate.
  5. The file may include more than what you want to simulate. For example, there may be salts, ligands, or other molecules that were added for experimental purposes. Or the crystallographic unit cell may contain multiple copies of a protein, but you only want to simulate a single copy.
  6. There may be multiple locations listed for some atoms.
  7. If you want to simulate the structure in explicit solvent, you will need to add a water box surrounding it.
  8. For membrane proteins, you may also need to add a lipid membrane.
PDBFixer can fix all of these problems for you in a fully automated way. You simply select a file, tell it which problems to fix, and it does everything else.

PDBFixer can be used in three different ways: as a desktop application with a graphical user interface; as a command line application; or as a Python API. This allows you to use it in whatever way best matches your own needs for flexibility, ease of use, and scriptability. The following sections describe how to use it in each of these ways.

2. Installation

To install PDBFixer, navigate to the root directory of the source distribution you've download and type

python setup.py install

This will install the PDBFixer python package as well as the command line program pdbfixer.

Before running PDBFixer, you must first install OpenMM 6.3 or later. Follow the installation instructions in the OpenMM manual. It is also recommended that you install CUDA or OpenCL, since the performance will usually be faster than when running on the CPU platform. PDBFixer requires that NumPy be installed.

Alternatively, PDBFixer is included as part of the Omnia suite for molecular simulation. If you install the suite, PDBFixer and its dependencies will be included.

3. PDBFixer as a Desktop Application

To run PDBFixer as a desktop application, type

pdbfixer

on the command line. PDBFixer displays its user interface through a web browser, but it is still a single user desktop application. It should automatically launch a web browser and open a new window displaying the user interface. If for any reason this does not happen, you can launch a web browser yourself and point it to http://localhost:8000.

The user interface consists of a series of pages for selecting a PDB or PDBx/mmCIF file and choosing what changes to make to it. Depending on the details of a particular file, some of these pages may be skipped.

Load File

On this page you select the PDB file to process. You can load a file from your local disk, or enter the identifier of a PDB structure to download from RCSB.

Select Chains

This is the first place where you can choose to remove parts of the structure. It lists all chains contained in the file. If you want to remove some of them, just uncheck them before clicking "Continue".

This page also gives options for removing heterogens. A heterogen is any residue other than a standard amino acid or nucleotide. The possible options are to keep all heterogens; keep only water while deleting all other heterogens; or delete all heterogens.

Add Residues

If the file has missing resides (based on the sequence given in the SEQRES records), this page will list them. Select which blocks of missing residues to add, then click "Continue".

Convert Residues

If the file contains nonstandard residues, this page will allow you to replace them with standard ones. It suggests a reasonable replacement for each residue, but you can choose a different one if you prefer.

Add Heavy Atoms

This page lists all heavy atoms that are missing from the file. They will be added automatically.

Add Hydrogens, Water, and Membrane

This page gives you the chance to make other optional changes:

Save File

You're all done! Click "Save File" to save the processed PDB file to disk. Then click "Process Another File" if you have more files to process, or "Quit" (in the top right corner of the page) if you are finished.

4. PDBFixer as a Command Line Application

PDBFixer provides a simple command line interface that is especially useful if you want to script it to process many files at once. This interface is significantly less flexible than either the desktop interface or the Python API, but it is still powerful enough for many purposes.

To get usage instructions for the command line interface, type

pdbfixer --help

This displays the following information:

Usage: pdbfixer
       pdbfixer [options] filename

When run with no arguments, it launches the user interface.  If any arguments are specified, it runs in command line mode.

Options:
  -h, --help            show this help message and exit
  --pdbid=PDBID         PDB id to retrieve from RCSB [default: None]
  --url=URL             URL to retrieve PDB from [default: None]
  --output=FILENAME     output pdb file [default: output.pdb]
  --add-atoms=ATOMS     which missing atoms to add: all, heavy, hydrogen, or
                        none [default: all]
  --keep-heterogens=OPTION
                        which heterogens to keep: all, water, or none
                        [default: all]
  --replace-nonstandard
                        replace nonstandard residues with standard equivalents
  --add-residues        add missing residues
  --water-box=X Y Z     add a water box. The value is the box dimensions in nm
                        [example: --water-box=2.5 2.4 3.0]
  --ph=PH               the pH to use for adding missing hydrogens [default:
                        7.0]
  --positive-ion=ION    positive ion to include in the water box: Cs+, K+,
                        Li+, Na+, or Rb+ [default: Na+]
  --negative-ion=ION    negative ion to include in the water box: Cl-, Br-,
                        F-, or I- [default: Cl-]
  --ionic-strength=STRENGTH
                        molar concentration of ions to add to the water box
                        [default: 0.0]
For example, consider the following command line:

pdbfixer --keep-heterogens=water --replace-nonstandard --water-box=4.0 4.0 3.0 myfile.pdb

This will load the file "myfile.pdb". It will add any missing atoms to existing residues, but not add any missing residues (because we did not specify --add-residues). Hydrogens will be added that are appropriate at pH 7.0 (the default). If the file contains any nonstandard amino acids or nucleotides, they will be replaced with the closest equivalent standard ones. Any water molecules present in the file will be kept, but all other heterogens will be deleted. Finally a water box of size 4 by 4 by 3 nanometers will be added surrounding the structure. If necessary, counterions will be added to neutralize it (Na+ or Cl-), but no other ions will be added (because we accepted the default ionic strength of 0.0). After making all these changes, the result will be written to a file called "output.pdb".

5. PDBFixer as a Python API

This is the most powerful way to use PDBFixer. It allows you to script the processing of PDB files while maintaining precise programmatic control of every part of the process.

PDBFixer is based on OpenMM, and to use its Python API you should first be familiar with the OpenMM API. Consult the OpenMM documentation for details. In everything that follows, I will assume you are already familiar with OpenMM.

To use PDBFixer create a PDBFixer object, passing to its constructor the file to process. You then call a series of methods on it to perform various transformations. When all the transformations are done, you can get the new structure from its topology and positions fields. The overall outline of your code will look something like this:

fixer = PDBFixer(filename='myfile.pdb')
# ...
# Call various methods on the PDBFixer
# ...
PDBFile.writeFile(fixer.topology, fixer.positions, open('output.pdb', 'w'))
There are many different methods you might call in the middle, depending on what you want to do. These methods are described below. Almost all of them are optional, but they must be called in exactly the order listed. You may choose not to call some of the optional methods, but you may not call them out of order.

Remove Chains

To remove unwanted chains from the structure, call

fixer.removeChains(indices)

where indices is a list of the indices of the chains to remove.

Identify Missing Residues

To identify missing residues, call

fixer.findMissingResidues()

This method identifies any residues in the SEQRES records for which no atom data is present. The residues are not actually added yet: it just stores the results into the missingResidues field. This is a dictionary. Each key is a tuple consisting of the index of a chain, and the residue index within that chain at which new residues should be inserted. The corresponding value is a list of the names of residues to insert there. After calling this method, you can modify the content of that dictionary to specify what residues should actually be inserted. For example, you can remove an entry to prevent that set of residues from being inserted, or change the identities of the residues to insert. If you do not want any residues at all to be added, you can just write

fixer.missingResidues = {}

The residues actually get added when you call addMissingAtoms(), described below.

Replace Nonstandard Residues

If you want to replace nonstandard residues with their standard versions, call
fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
findNonstandardResidues() stores the results into the nonstandardResidues field, which is a list. Each entry is a tuple whose first element is a Residue object and whose second element is the name of the suggested replacement residue. Before calling replaceNonstandardResidues() you can modify the contents of this list. For example, you can remove an entry to prevent a particular residue from being replaced, or you can change what it will be replaced with. You can even add new entries if you want to mutate other residues.

Remove Heterogens

If you want to remove heterogens, call

fixer.removeHeterogens(False)

The argument specifies whether to keep water molecules. False removes all heterogens including water. True keeps water molecules while removing all other heterogens.

Add Missing Heavy Atoms

To add missing heavy atoms, call
fixer.findMissingAtoms()
fixer.addMissingAtoms()
findMissingAtoms() identifies all missing heavy atoms and stores them into two fields called missingAtoms and missingTerminals. Each of these is a dictionary whose keys are Residue objects and whose values are lists of atom names. missingAtoms contains standard atoms that should be present in any residue of that type, while missingTerminals contains missing terminal atoms that should be present at the start or end of a chain. You are free to remove atoms from these dictionaries before continuing, if you want to prevent certain atoms from being added.

addMissingAtoms() is the point at which all heavy atoms get added. This includes the ones identified by findMissingAtoms() as well as the missing residues identified by findMissingResidues(). Also, if you used replaceNonstandardResidues() to modify any residues, that will have removed any atoms that do not belong in the replacement residue, but it will not have added ones that are missing from the original residue. addMissingAtoms() is the point when those get added.

Add Missing Hydrogens

If you want to add missing hydrogen atoms, call

fixer.addMissingHydrogens(7.0)

The argument is the pH based on which to select protonation states.

Add Water

If you want to add a water box, call addSolvent(). This method has several optional arguments. Its full definition is

addSolvent(self, boxSize, positiveIon='Na+', negativeIon='Cl-', ionicStrength=0*molar)

boxSize is a Vec3 object specifying the dimensions of the water box to add. The other options specify the types of ions to add and the desired ionic strength. Allowed values for positiveIon are Cs+, K+, Li+, Na+, and Rb+. Allowed values for negativeIon are Cl-, Br-, F-, and I-. For example, the following line builds a 5 nm cube of 0.1 molar potassium chloride:

fixer.addSolvent(Vec3(5, 5, 5)*nanometer, positiveIon='K+', ionicStrength=0.1*molar)

Add Membrane

If you want to add a lipid membrane, call addMembrane(). This method also adds water, so you should not also call addSolvent() when using it. There are several optional arguments. The full definition is

addMembrane(self, lipidType='POPC', membraneCenterZ=0*nanometer, minimumPadding=1*nanometer, positiveIon='Na+', negativeIon='Cl-', ionicStrength=0*molar)

lipidType should be either 'POPC' or 'POPE'. membraneCenterZ is the position along the Z axis at which the center of the membrane will be located. minimumPadding is the minimum acceptable distance between the protein and the edges of the periodic box. All other arguments are the same as for addSolvent(). For example, the following line adds a POPE membrane centered at Z=0 with at least 1 nm of padding on all sides:

fixer.addMembrane('POPE')

Examples

Here is a complete example that ties this together. It adds all missing atoms including heavy atoms, missing residues, and hydrogens. It replaces all nonstandard residues by their standard equivalents then deletes all remaining heterogens except water. Finally, it fills in all gaps with water; that is, it adds a water box whose dimensions match the crystallographic unit cell. It saves the result to a file called output.pdb.
from pdbfixer import PDBFixer
from simtk.openmm.app import PDBFile
fixer = PDBFixer(filename='myfile.pdb')
fixer.findMissingResidues()
fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
fixer.removeHeterogens(True)
fixer.findMissingAtoms()
fixer.addMissingAtoms()
fixer.addMissingHydrogens(7.0)
fixer.addSolvent(fixer.topology.getUnitCellDimensions())
PDBFile.writeFile(fixer.topology, fixer.positions, open('output.pdb', 'w'))
Suppose you want to keep only the first chain in the file and remove all the others. To do this, call removeChains():
fixer = PDBFixer(filename='myfile.pdb')
numChains = len(list(fixer.topology.chains()))
fixer.removeChains(range(1, numChains))
fixer.findMissingResidues()
Suppose you only want to add missing residues in the middle of a chain, not ones at the start or end of the chain. This can be done by modifying the missingResidues field immediately after calling findMissingResidues():
fixer.findMissingResidues()
chains = list(fixer.topology.chains())
keys = fixer.missingResidues.keys()
for key in keys:
    chain = chains[key[0]]
    if key[1] == 0 or key[1] == len(list(chain.residues())):
        del fixer.missingResidues[key]
fixer.findNonstandardResidues()
Suppose that instead of matching the size of the crystallographic unit cell, you want to add a cubic water box large enough to hold the entire system. That is, you want to compute the range of atom positions along each axis, then set the size of the water box to match the largest axis. This can be done as follows:
fixer.addMissingHydrogens(7.0)
maxSize = max(max((pos[i] for pos in fixer.positions))-min((pos[i] for pos in fixer.positions)) for i in range(3))
boxSize = maxSize*Vec3(1, 1, 1)
fixer.addSolvent(boxSize)
PDBFile.writeFile(fixer.topology, fixer.positions, open('output.pdb', 'w'))
Suppose you want to replace all nonstandard residues by alanine. findNonstandardResidues() selects a suggested replacement for each residue, but you can modify that before calling replaceNonstandardResidues():
fixer.findNonstandardResidues()
fixer.nonstandardResidues = [(residue, 'ALA') for residue, replacement in fixer.nonstandardResidues]
fixer.replaceNonstandardResidues()