Abstract

Information extraction is a process of extracting relevant data in a specified structured format from semi-structured and unstructured data sources. Extracting information from a collection of unstructured documents allowing reasonable range of fault tolerance is a challenging problem. Existing methodology includes statistical training methods that require enormous training time, and which yield trained models biased to erroneous data. To avoid these weaknesses, we propose a similarity maximization methodology that requires a very small amount of human coding and uses an integer-programming (IP) framework to extract appropriate information.

Share

COinS