Within xLiMe one of the strongest focus points for us at JSI is producing a quality stream of aggregated event information on the Event Registry platform built from a stream of multi-lingual online news. For each event, presented as a cluster of articles, we extract a time and a location which of course do not necessarily correspond to the time and location of article publication. Lately we are working hard on expanding these extraction capabilities to obtain infobox-like structured representation of events.
We are developing a machine learning system which extracts structured event data for pre-defined event type templates from events. The system computes features aggregated over all the articles that comprise the event, which offers data redundancy and increased model stability compared to extraction from single article. Furthermore, we use methodology based on canonical correlation analysis to project contextual language features into a language-agnostic space which makes the extraction process cross-lingual.
Results obtained so far on simple templates for event types such as company acquisition (pictured) and earthquake are encouraging and on par with top results obtained in the extraction track of knowledge base population competition at the Text Analysis Conference. In future we aim to extend the experiments to more event types and develop an active learning component for building extractors for new event types.