The chemical structural formulas that we usually use are difficult for computers to understand. Therefore, when handling structural information on a computer, it is necessary to convert the structural information into a computer-friendly method before handling it.
Chemical graph theory is a branch of mathematics which combines graph theory and chemistry. A molecule can be thought of as a graph with atoms as nodes and bonds as edges. A graph can represent how an atom is connected to other atoms. Hydrogen atoms can be added later if the number of bonds between atoms is known, so hydrogen atoms are often omitted when expressing molecules on a computer. Each carbon atom has four chemical bonds and each hydrogen atom has one chemical bond. Therefore, the hydrogen atoms can be removed without losing information about the molecule.
For example, Ethane should be expressed as:
In the graph structure, the positions of atoms are not taken into consideration, and only the connections between atoms are important .
In this way, the method of notating the molecular structure in one line according to a certain rule is called " line notation ".
Line notations are highly desirable structure representation methods. they tend to offer a compact representation of the constitution and connectivity of a topological representation of chemical structures, but tend to lack additional information, such as protonation and geometry, that is necessary for many modelling techniques. here, three line notations will be introduced: Simplified Molecular-Input Line-Entry Specification (SMILES), SMARTS, and IUpaC International Chemical Identifier (InChI)
The SMILES is one of the "line notation" which is an acronym for "Simplified Molecular Input Line Entry System". Arguably the most commonly used line notation is the SMILES string. The SMILES representation uses alphanumeric characters that closely mimic atoms and bonds as drawn in two-dimensional chemical structures.
Atoms in a SMILES string are represented by their elemental symbol in the periodic table of the elements, within square brackets. however, the square brackets can be implicit for the organic subset of elements: ‘B’, ‘C’, ‘N’, ‘O’, ‘S’, ‘p’, ‘Br’, ‘Cl’, ‘F’, and ‘I’. The hydrogens are typically implicit, but can be defined in certain cases. an atom that contains one or more charges must be enclosed in square brackets followed by the ‘h’ symbol and number of hydrogens bonded to it—if only one then it may be omitted. Following this, a plus symbol represents a positive charge and a subtraction symbol represents a negative charge. the number of charges can be included after the charge symbol, with one charge again being implicit. the number of charges can also be included explicitly by additional charge symbols. therefore, methane is simply ‘C’ and water ‘O’.
Bonds in a SMILES string are represented by symbols that mimic the chemical structure diagram representations: a single bond is ‘-’; a double bond is ‘=’; a triple bond is ‘#’.
However, bonds in a SMILES string are implied in a large number of cases. Bonds between aliphatic atoms are implicitly assumed to be single bonds and therefore the single bond symbol is not required. therefore, ethanol, starting the SMILES string from the monovalent carbon, is written as ‘CCO’, but is equally valid as ‘C–C–O’. Bonds between aromatic atoms are implicitly assumed to be aromatic.
Branching in a SMILES string is defined by round brackets. therefore, Isobutyl alcohol, would be ‘CC(C)CO '.
Ring systems in a SMILES string are encoded by ring closure tags, which indicate that two atoms in the string are connected and therefore form a ring system. So, hexane would be ‘CCCCCC’, whereas cyclohexane would be ‘C1CCCCC1’. For a second ring, the ring closure tag would be ‘2’, and so on. If the number of ring closure tags needed exceeds ‘9’ then a percentage symbol must be used in front of the symbol. this is important since a single atom may encode two different ring closures, e.g. ‘–C12–’.
Aromaticity in a SMILES string is encoded by using the lowercase characters for carbon, nitrogen, oxygen, and sulphur: ‘c’, ‘n’, ‘o’, ‘s’, respectively. therefore, cyclohexane, as we have already seen, is ‘C1CCCCC1’, whereas benzene is ‘c1ccccc1’. aromatic bonds are implied between aromatic atoms, but may be explicitly defined using the ‘:’ symbol. an aromatic nitrogen bonded to a hydrogen must be explicitly defined as ‘[nh]’: pyrrole is ‘c1cc[nh]c1’ and imidazole is ‘c1cnc[nh]1’.
Stereochemistry in a SMILES string is encoded by the special characters ‘\’, ‘/’, ‘@’, and ‘@@’. around two double bonds, the configuration specifies the cis and trans configurations. therefore, valid SMILES strings of cis- and trans-butene are ‘C\C=C\C’ and ‘C\C=C/C’, respectively.
For example, E- and Z-1,2-difluoroethene can be represented by the following isomeric SMILES:
F/C=C/F or F\C=C\F (E)-1,2-difluoroethene (trans isomer)
F/C=C\F or F\C=C/F (Z)-1,2-difluoroethene (cis isomer)
Configuration around tetrahedral centers are indicated by the symbols “@” or “@@”
C[C@@H](C(=O)O)N L-Alanine
C[C@H](C(=O)O)N D-Alanine
SMARTS are straightforward extensions of SMILES. It is an acronym for SMILES ARbritrary Target Specification (SMARTS) notation and allows us to search in certain databases (like PubChem) for generic structures. It is a language used for describing molecular patterns. It is a notation developed especially for expressing substructures and performing structural searches in databases.
Some representative symbols and examples are summarized below.
For details on SMARTS notation , go for "Daylight Theory Manual: SMARTS – A Language for Describing Molecular Patterns " by Daylight .
the IUpaC International Chemical Identifier (InChI™) is an international standard in structure representation. InChI is a representation of molecular information in a form understandable to humans. Since every compound gives a different InChI, it can be thought of as analogous to the IUPAC name of the compound. As mentioned earlier from the development history, the difference from canonical SMILES is that the generation algorithm can be used freely for non-commercial purposes.
the InChI identifier provides a layered representation of a molecule to
allow for the representation of differing levels of resolution depending on
the application in mind. the layers defined by InChI are as follows:
Main Layer
Chemical Formula Layer
Connections- bonds between atoms and may have sublayers, with the last one dealing with mobile hydrogens.
Charge Layer
Component Charge
Protons
Stereochemical Layer
Double Bond sp2 (Z/E) Sterochemistry
Tetrahedral Sterochemistry
Isotopic Layer
Fixed Hydrogen Layer
While InChIs may seem difficult for humans to decipher, they are primarily designed for computer processing and contain valuable molecular data hidden within their layers. Unlike simpler chemical representations like SMILES, InChIs cannot be easily read by people. However, computers excel in interpreting this complex code, enabling them to access vital information about a given compound or structure. So while it might prove challenging for an individual reader, InChIs serve as powerful tools for computers to analyze chemical structures effectively. To make sense of these notations, researchers typically rely on specialized software and databases capable of decoding InChIs. By employing these resources, users can obtain essential details about the molecules in question without needing direct comprehension of the underlying code.
From Databases
SMILES and other chemical representations of a certain molecule are compatible with compound databases such as ChemSpider, PubChem, ChEMBL, and DrugBank, so naturally you can get information from each entry in the database.
And we will show you in the upcoming tutorials how to deal with ChemSpider and PubChem programatically with python to obtain chemical information.
Since SMILES notation is a general-purpose compound information format, many software supports exporting in these formats. Using ChemDraw, which is familiar to experimental chemists, you can Select the drawn structure and select " Edit -> Copy As " to export in SMILES or InChI format.
Another one is using an online website. For example, the SMILES generator/checker website generates SMILES in real time when you enter a structural formula. It is also possible to generate InChI.