PubChem is a powerful tool for chemists and researchers alike, providing access to a vast array of chemical information at the click of a button. This online database, maintained by the National Institutes of Health, is an invaluable resource for anyone working in the field of chemistry.
At its core, PubChem is a repository of chemical substances and their associated properties. It contains information on over 100 million unique chemical structures which are relatively small molecules with less than 1000 atoms and less than 1000 bonds . The PubChem database is divided into the following three sections, and can be used freely on the web. including data on their physical and chemical properties, biological activities, and safety profiles. This vast collection of data is constantly being updated and expanded, making it an essential tool for chemists looking to explore new areas of research.
One of the most innovative features of PubChem is its ability to connect chemical substances with biological activities. This is done through the integration of data from various sources, including high-throughput screening assays, literature references, and curated databases. By linking chemical compounds with their biological targets and activity profiles, PubChem allows researchers to identify potential drug candidates and gain insight into the mechanisms of action of existing drugs.
Another key feature of PubChem is its user-friendly interface, which allows users to search for chemical substances and access relevant data with ease. The database is organized into three main sections: substances, bioassays, and compounds. Users can search for information on specific substances by entering their chemical names or structures, or explore the database using broader search terms such as disease targets or chemical classes.
In addition to its vast collection of chemical data, PubChem also offers a range of tools and resources to help researchers analyze and visualize their results. These include chemical structure drawing tools, molecular modeling software, and data analysis and visualization tools.
PubChemPy is a Python module that provides a programmatic interface to the PubChem database, a large-scale repository of chemical information maintained by the National Center for Biotechnology Information (NCBI). PubChemPy enables us to access and analyze PubChem data through a user-friendly and flexible programming interface.
PubChemPy provides a solution to this challenge by offering a Python interface to the PubChem database. Python is a popular programming language that is known for its simplicity, flexibility, and extensive library of scientific tools. With PubChemPy, we can easily retrieve and manipulate PubChem data using Python scripts, allowing for more efficient and automated analysis of chemical information.
PubChemPy's official documentation recommends using pip.
The code below assumes that pubchempy is imported as pcp .
From the Compound database, we can search using
compound name
molecular formula
SMILES
InChI
SDF
CID (Compound ID)
pubchempy.get_compounds(identifier, namespace=u'cid')
Parameters:
identifier – The substance identifier to use as a search query.
namespace – (optional) The identifier type, one of sid, name or sourceid/<source name>.
as_dataframe – (optional) Automatically extract the Substance properties into a pandas DataFrame and return that.
This method returns the specified search results in a list. The example below searches for the name 'alanine' and displays the results by CID and IUPAC name. In this case, racemic, D- and L- forms are shown.
| Output
CID: 602 Name: 2-aminopropanoic acid
CID: 5950 Name: (2S)-2-aminopropanoic acid
CID: 71080 Name: (2R)-2-aminopropanoic acid
pubchempy.get_properties()
Parameters:
identifier – The substance identifier to use as a search query.
namespace – (optional) The identifier type, one of sid, name or sourceid/<source name>.
as_dataframe – (optional) Automatically extract the Substance properties into a pandas DataFrame and return that.
Here’s another method to get calculated properties for a specific compound:
| Output
C8H10N4O2
194.19
-0.1
Multiple properties may be specified in a list, or in a comma-separated string.
The available properties are: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, ExactMass, MonoisotopicMass, TPSA, Complexity, Charge, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, HeavyAtomCount, IsotopeAtomCount, AtomStereoCount, DefinedAtomStereoCount, UndefinedAtomStereoCount, BondStereoCount, DefinedBondStereoCount, UndefinedBondStereoCount, CovalentUnitCount, Volume3D, XStericQuadrupole3D, YStericQuadrupole3D, ZStericQuadrupole3D, FeatureCount3D, FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D, FeatureHydrophobeCount3D, ConformerModelRMSD3D, EffectiveRotorCount3D, ConformerCount3D.
Subsequent data processing often progresses if the acquired data can be acquired as a pandas data frame instead of a dictionary format.
get_compounds()
get_properties()
By specifying the as_dataframe=True option in the three methods , you can get the results as a pandas data frame instead of as a list of the Compound class. It will be easier to do the analysis later.
For example, the 'alanine' example above is stored in the following data frame.
If you wanna take a dive deep into PubChemPy, I suggest to see the documentation.