Wikipedia Mining: Obtaining Geographical, Temporal and Ontological Information
The Wikipedia represents an amazing amount of human knowledge and judgement. However, the content remains largely unstructured. While the content is marked up for display, there is very little structure around the content to allow direct machine understanding. Therefore, more complex operations are required to extract information from the text for meaningful machine processing. This project seeks to extract geospatial and temporal information from Wikipedia articles.
We started with a simple approach where POS tagging and regular expressions were used. This work was done by Suzette Stoutenburg as an independent study project, but was later published at a conference in the Czech Republic. Jeremy Witmer, who received his MS in May 2009, developed a system that is able to extract all spatial named entities (i.e., names of places) from Wikipedia articles about wars and battles (mostly concerning the Civil War) and then geocodes to specific locations on the globe. We published two papers based on his work in 2009: AAAI Spring Symposium at Stanford, and IEEE Semantic Computing Conference.
Our future work will involve extending the work in novel ways. Here are some initial ideas for extension.
- Creating map from Wikipedia articles: Once we have unambigously determined the names of places that are being talked about in an article, we would like to display the places in a map, possibly a Google map. We will annotate the places with important events or other things we know happened at that location.
- Creating a timeline of events from Wikipedia articles: We need to learn to find events in Wikipedia or any other types of articles, find temporal relations among the events and draw a graph of some kind showing the relationships. It will be challenging to identify events and temporal relationships. However, we are sure we can make a good start. Here are a couple of papers that you may use to get started on this topic: Mei and Zhai 2005, and Chklovski and Pantel 2004. These are not our papers, but I cite these papers here as exemplars of work in extracting temporal and other relationships from text. These will help us get started on thinking about how one goes about extracting relations from text.
- Learning to identify causality relations among events. This is a hard task too. Look at Girju and Moldovan 2002, and Gordon and Swanson 2009 to get started. Once again, these are not our papers, but these will get you started.