In order to handle compounds by computer in cheminformatics, it is necessary to convey compound information in a format that is easy for computers to understand. Although visual depictions of molecular structures work well for us humans due to our remarkable ability to recognize patterns through sight, these graphical representations create significant obstacles for computers to interpret and manipulate. That being the case, there's a critical need for alternative methods to represent those same molecules in ways more suited to computations .
We discussed in the previous article Structure Representation and Line Notations in Chemistry formats based on linear notation
As the name suggests, these formats are methods of expressing chemical structures in a " one-line character string : but it has some disadvantages such as, Information such as atom coordinate positions and bond distances is lost.
Our eyes are very good at reading shapes and patterns. Structural formulas used by chemists are representations of molecules designed to be easily understood by the human eye . For example, just by looking at the structural formula of caffeine below, a chemist can instantly understand a lot of information.
However, it is very difficult for computers to understand image files of such structural formulas. This time, we will learn about some representations of chemical compounds that are easy for computers to understand
MOL files are considered connection table formats were originally an input/output format for databases. It is also called "MDL MOL file".
The structure of a MOL file is basically a combination of an "atom list" and a "linked list", with headers and footers added.
Most connection table formats contain one or more of the following:
A list of atoms, specifying the elemental identity of each atom
A list of bonds, specifying the atoms that it connects and the bond multiplicity (single, double, triple)
2D or 3D spatial coordinates for each atom (sometimes measured, sometimes calculated; often, it’s not clear which)
Counts of the number of atoms and bonds in the molecule
Attributes associated with atoms or bonds (e.g. R/S configuration of a stereocenter; dashed/wedged bond
Attributes associated with an entire structure (e.g. net charge)
The MOL file, a widely-used chemical structure file format, contains all of these.
The following is a MOL file for acetone. All molfiles have a header and a connection table that has two blocks, the Atom Block and the Bond Block.
The Header block has two lines, the first gives the name/formula of the molecule (if known) and is of variable format, the second gives the program that made file, the date and time it was made, and if 2D or 3D coordintates are given.
The Count line block tells us acetone has 10 atoms and 9 bonds, it also provides the version number of the MOL flie.
An SD File is a chemical structure-data file format that can associate data with one or more chemical structures.
An SDF is composed of one or more MOL records each with associated data fields and is basically a collection of MOL files and contains multiple molecules in one file. "$$$$" is used as the delimiter between molecules.
One of the features of SDF is that it can have separate properties and attributes of molecules.
An XYZ file is a file format with the extension " .xyz ", and represents the structure of a molecule by describing the coordinates of each atom in the three-dimensional space in angstrom units. It is also called the Cartesian coordinate system because it indicates the positions of atoms along the XYZ axes of the Cartesian coordinate system. In any case, it describes the positions of the atoms relative to the external coordinate axes.
It is an input format that is often used especially in the field of computational chemistry.
Given the coordinates of the atoms, the distance between the atoms can be found, so it is possible to estimate which "bonds" exist between the atoms. Therefore, it can be said that the molecular structure can be completely expressed only with the XYZ format.
Normally, it is sufficient to specify only the XYZ format for input files for molecular orbital methods and DFT calculations. On the other hand, in molecular mechanics calculations, bond type information is required when applying force field parameters to each atom, so bond list information is required.
The acronym CIF is used for the Crystallographic Information File,
A CIF contains information about the crystal structure (such as unit cell values, atom names and their coordinates and any structural model quality indicators, e.g., R Factor) as well as any details of the diffraction experiment (such as temperature, pressure, experimental wavelength and the type and name of equipment used) and any data processing undertaken (such as the programs used to process the data).
The Protein Data Bank (PDB) file format is a textual file format describing the three dimensional structures of molecules held in the Protein Data Bank.
The PDB format provides protein description and annotation, nucleic acid structures including atomic coordinates, observed side chain rotamers, secondary structure assignments and also the atomic connectivity. Structures which are deposited with other molecules such as water, ions, nucleic acids, ligands etc. can also be described in the pdb format. A typical PDB formatted file includes a large "header" section of text that summarizes the protein, citation information, and the details of the structure solution, followed by the sequence and a long list of the atoms and their coordinates.
Chemoinformatics deals with chemical structures on computers. Structural formulas that we usually handle are not computer-friendly, so they are inevitably converted to other formats for processing.
There is a vast array of file formats employed in computational chemistry and chemoinformatics, encompassing both generic and software-specific formats. In this instance, we will elucidate the widely utilized open-source software library called "Open Babel" which serves as a primary tool for converting between diverse chemical information formats.
In the areas of computational chemistry and cheminformatics, multiple chemical information formats exist based on their intended use. Open Babel allows for the conversion of these file formats to one another. A notable advantage of Open Babel is its extensive support for over 100 different file formats.
Also, because of its open source nature, it is used in many software and websites. For example, the molecular modeling software " Avogadoro " also incorporates Open Babel. Therefore, the user can convert the file format without being aware of the back side of the conversion work.
Also, the C6H6 Laboratory notebook, which is a great tool to control the flow of your data, has Open Babel integrated in GUI at C6H6.org
Open Babel isn't solely a tool for converting files; it also has additional capabilities such as structural calculations utilizing the molecular mechanics method. However, this article will concentrate on the file conversion aspect of Open Babel.
Install Open Babel: If you haven't already, you can download and install Open Babel from the official website (http://openbabel.org/wiki/Main_Page) or using a package manager. Open Babel is available for Windows, macOS, and Linux.
Launch Open Babel GUI: After installing Open Babel, launch the Open Babel GUI from the start menu or desktop shortcut.
Open the input file: Click on the "File" menu and select "Open" to open the input file that you want to convert. Alternatively, you can drag and drop the input file into the Open Babel GUI window.
Select the input and output formats: In the "Input" section, select the format of the input file from the drop-down menu. In the "Output" section, select the output format that you want to convert the file to.
Configure the conversion options: Depending on the input and output formats, you may need to configure some conversion options. For example, if you are converting a file from SMILES to SDF format, you may need to specify the number of conformations to generate.
Choose the output file name and location: In the "Output" section, click on the "Save As" button and choose the output file name and location.
Convert the file: Click on the "Convert" button to start the conversion process.
Check the output file: Once the conversion is complete, you can check the output file to make sure it was converted correctly.