DO Patent output consists of several columns for recognition quality assessment and subsequent data processing:
- Extracted Image: The original image in the PDF document as the algorithm extracted it.
- Predicted structure: 2D rendering of a chemical structure encoded in a SMILES string (see below).
- Confidence: Confidence score indicating accuracy of recognition and the need for manual data review. We recommend sorting results by the confidence score.
- >0.98 confidence score: high likelihood of accurate recognition
- 0.92-0.98 confidence score: manual review is needed
- <0.92 confidence score: poor recognition, consider discarding result
- Confidence details: Specific recognition tokens forming the confidence score from the elements of the molecular structure.
- SMILES: 1D representation of the molecule predicted by the algorithm. This is a standard format for data import across all scientific software solutions.
- Source: Name of the original PDF document.
- Page: Page number of the recognized image of the molecule.