Bad chemistry in old protein-ligand binding complex data set

1 minute read

Published:

The Astex Diverse set1 is a dataset containing the crystallized poses of 85 protein-ligand complexes. It was introduced in 2007 to address problems in previous datasets such as incorrect ligand representation.

Loading the 85 ligand files with today’s version of the cheminformatics toolkit RDKit2 is, however, not as straightforward as you might expect.

29 out of the 85 files fail RDKit’s sanitization checks because they each contain a neutral Nitrogen atom which is connected to two Carbon atoms and two Hydrogen atoms.

Ligand loaded from dedicated ligand sdf file obtained from the PDB.
Ligand LI9 of complex 1YWR.

Luckily, we have ways to rectify this situation. First, we can download a chemically valid ligand file with the same coordinates from the PDB3 at the cost of loosing the protonation state that was chosen for the Astex Diverse set.

Alternatively, we can either add a positive charge to the Nitrogen or delete one Hydrogen atom to satisfy the sanitization checks. Adding a positive charge for ligand LI9 shown above yields a secondary amine with a positive charge on the Nitrogen atom. The pKa=0.8 of such a structure is extremely low and therefore unlikely to be the intended structure of a ligand which acts near the physiological pH.

So the best option in this case is to remove one of the two Hydrogens. We only have to be careful to follow the intended Astex Diverse set’s protocol and optimize the positions of the remaining Hydrogen atoms using force field optimization.

References

  1. Hartshorn, M. J. et al. Diverse, high-quality test set for the validation of protein−ligand docking performance. J. Med. Chem. 50, 726–741 (2007). 

  2. RDKit: Open-source cheminformatics. https://www.rdkit.org

  3. Berman, H. M. The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000).