Bad chemistry in old protein-ligand binding complex data set

1 minute read

Published: December 13, 2022

The Astex Diverse set¹ is a dataset containing the crystallized poses of 85 protein-ligand complexes. It was introduced in 2007 to address problems in previous datasets such as incorrect ligand representation.

Loading the 85 ligand files with today’s version of the cheminformatics toolkit RDKit² is, however, not as straightforward as you might expect.

29 out of the 85 files fail RDKit’s sanitization checks because they each contain a neutral Nitrogen atom which is connected to two Carbon atoms and two Hydrogen atoms.

Ligand loaded from dedicated ligand sdf file obtained from the PDB. — Ligand LI9 of complex 1YWR.

Luckily, we have ways to rectify this situation. First, we can download a chemically valid ligand file with the same coordinates from the PDB³ at the cost of loosing the protonation state that was chosen for the Astex Diverse set.

Alternatively, we can either add a positive charge to the Nitrogen or delete one Hydrogen atom to satisfy the sanitization checks. Adding a positive charge for ligand LI9 shown above yields a secondary amine with a positive charge on the Nitrogen atom. The pKa=0.8 of such a structure is extremely low and therefore unlikely to be the intended structure of a ligand which acts near the physiological pH.

So the best option in this case is to remove one of the two Hydrogens. We only have to be careful to follow the intended Astex Diverse set’s protocol and optimize the positions of the remaining Hydrogen atoms using force field optimization.

References

Hartshorn, M. J. et al. Diverse, high-quality test set for the validation of protein−ligand docking performance. J. Med. Chem. 50, 726–741 (2007). ↩
RDKit: Open-source cheminformatics. https://www.rdkit.org. ↩
Berman, H. M. The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000). ↩

Share on

Twitter Facebook LinkedIn

M. Baek and D. Baker, “Deep learning and protein structure modeling,” Nat Methods, vol. 19, no. 1, pp. 13–14, Jan. 2022, doi: 10.1038/s41592-021-01360-8. ↩

Martin Buttenschoen

Bad chemistry in old protein-ligand binding complex data set

References

Share on

You May Also Enjoy

Controlling PyMol from afar

Fine-tune generated molecular poses with a force field

Molecular conformation generation with a DL-based force field

Ligands of CASF-2016