Archive for the ‘Review’ Category

Adaptive Information Extraction – A review

Tuesday, June 30th, 2009

Paper title: Adaptive Information Extraction

Authors: JORDI TURMO, ALICIA AGENO, NEUS CATALFA (TALP Research Center, Universitat Politecnica de Catalunya, Spain)

Link to Paper: ACM Computing Portal.

The abstract highlights that the last decade have seen an increase in information extraction (IE) research. It also mentions that the domain dependence is the major drawback for IE research. To avoid this, exploration of adaptive systems leveraging machine learning techniques was initiated.

The introduction gives an overview of how textual information extraction actually began, from manual process to computer driven. The paper describes linguistics challenges play a crucial role for automating. In the introduction Information Retrieval (IR) and IE are both contrasted. NLP plays a minor or marginal role in IR. IE requires an in-depth understanding. Also it mentioned that the emerging technology tends to fuse IE and IR for additional benefits, such as questions answering and summarization technologies.

The section 2 of the paper describes Goal of IE, which is stated as following: “The objective of IE is to extract certain pieces of information from text that are related to a prescribed set of related concepts, namely, an extraction scenario”. A well laid out example is illustrated as well.

The section 3 talks about the historical framework of information extraction. The state the earlier days of IE was associated closely with Message Understanding Conferences, also talk about TIPSTER text program from DARPA. The general goal of MUC was to evaluate the IE systems.

MUC 1 was exploratory, MUC 2 consisted of a task for template filling but not proper evaluation systems were in place. MUC 3 was well organized with a task, detailed templates and good evaluation procedures in place. MUC 4 used the same task as MUC 3 but this time they established a global evaluation criterion F Score to the task evaluation. MUC 5, 6, 7 and so forth expanded with more domains being included or expanded into which lead IE to grow in scale.

Section 4 describes the Architecture of Information Extraction Systems. A definition for the architecture as extracted from the paper states as following “Within the historical framework, the general architecture of an IE system was defined by Hobbs [1993] in MUC-5 as a cascade of transducers or modules that, at each step, add structure to the documents and, sometimes, filter relevant information, by means of applying rules”. The current systems have two components a General Core and a Module on top of it for a specific purpose. Modules could include document preprocessing, output template filling and so forth.

The section 4.1 provides an insight to the document preprocessing module. The preprocessing is done using text zoners, segmenters, splitters, tokenizers, language guessers, lexical analyzers, engines dealing with unknown words, disambiguators, stemmers and lemmatizers, etc.

The section 4.2 introduces syntactic parsing and semantic interpretation. Earlier on the practice followed was to process an entire document and later perform an in-depth syntactic and semantic analysis, but later it was found that best F scores were found with a selective process. Full parsing involves a large and relatively unconstrained search space, and is consequently expensive. Some content as extracted from the paper is as following: “Full parsing is not a robust process because a global parse tree is not always achieved. In order to correct such incompleteness, the parse covering the largest substring of the sentence is attempted. Sometimes, however, this global goal leads to incorrect local choices of analyses. Full parsing may produce ambiguous results. More than one syntactic interpretation is usually achieved. In this situation, the most correct interpretation must be selected. Broad-coverage grammars, needed for full parsing, are difficult to be consistently tuned. Dealing with new domains, new specialized syntactic constructs could occur in texts and be unrecognized by broad-coverage grammars. Adding new grammar rules could produce an inconsistentnal grammar. A full parsing system cannot manage off-vocabulary situations”. Thus most systems today are based on partial parse models. Chucking is described as a processing process to chunk groups of words together. Pattern matching and grammatical relations play a role as well.

The section 4.3 talks about discourse analysis, which helps to improve or better the partial filled templates or logical forms. They present merging procedures to better the task.

The section 4.4 presents Output template generation, which aims at mapping extracted template generation to desired output format.

The section 5 focuses on Machine Learning techniques which can be applied to Information Extraction. The two main problems areas presented here are Error propagation problem and new domain adaption problem. To deal with these problems Adaptive IE technology was research and it mainly consisted of empirical methods in NLP. Most of these are rule based methods. Apart from rule based methods, classification models were also research and these included propositional learning, statistical models, and relational learning.

The section 5.1 focuses on learning rules, later discusses supervised approaches and unsupervised approaches used. The supervised approaches consist of proportional and relational learning techniques which are widely known. The illustrations and well laid out, helping the reader to understand it easily.

In the unsupervised section we see that the authors discuss Autoslog-TS. Later they introduce work done by various converging algorithms. They also mention that the wordnet is being leveraged.

The section 5.2 focuses on learning based on statistical techniques and dive into Markov Models, Maximum entropy models, dynamic Bayesian networks, conditional random fields, and other statistical models are also presented.

The sections 5.3 and 5.4 talk about multi-stage approaches and wrapper generation as special cases of IE. The section 5.5 final compares some prominent methods and clearly states that as the techniques are domain specific a clear way to identify or rank the techniques is nearly impossible.

The section 6 introduces methodologies and use of knowledge in IE systems. In general this section introduces various IE architectures.

The IE2 System, The PROTEUS System and The SIFT System are presented.

The section 7 presents us with the conclusion, which is as follows:

  • IE is major research area, as there are many applications which need domain specific knowledge and the increase on availability of online documents.
  • One major drawback of IE is difficult to adapt to new domains.
  • The survey concludes with a statement expressing the difficult to conclude which technique is the best for IE.
  • Share/Bookmark