Document Extractor is an API class that searches for compound names (IUPAC or traditional) in text file types as html, txt, xml and pdf and converts them to chemical structures. This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.