The purpose of this lab session is to extract structured information from a natural language text corpus.
~senellar/inf396/texton the computers used during the labs. The format of this file is as follows: The first line is the title of the first article, while the following lines (up to the first blank line) form the content of this article, in plain text format. The second article comes after the next blank line, and so on. There are 50,441 articles in total.
~senellar/inf396/poson the computers used during the labs. The format of this file is as above, but each word is followed by "_XXX", where XXX identifies the POS, as explained in the lecture.
~senellar/inf396/fullon the computers used during the labs. The format of the file is as explained in the lecture.
Page, which has two String fields: the page title and the page content. In this class, as well as in all other classes, remember to implement the standard methods
compareTo(). Write a class
Parser, which has a constructor that takes as argument the filename of the corpus. The class should have a method
next():Page, which delivers the next page in the corpus. Remember to close the file at the end. If you want, the class
Parsercan implement the interfaces
Iterator<Page>. Implement this class for the preprocessed text-only corpus. Create a small sample corpus of two or three articles and test your class.
Triple, which has three String fields:
object. Create an abstract class
Extractor, which has a method
extract(Page):Collection<Triple>. Create a simple extractor,
NameExtractor, which produces triples of the form <PageTitle, hasName, "PageTitle">. Create a class
InformationExtractor, which has only one method:
run. This method takes as input a corpus file, a target file and a list of extractors. It iterates over all pages in the corpus, calls all extractors and writes the triples, TAB-separated, into the target file. Test this method with the
NameExtractoron the small sample corpus.
DateExtractorthat uses a regular expression to find the first date mentioned in the article. Let it return a triple of the form <PageTitle, hasDate, Date>. Try the extractor with the pages of Elvis Presley and Alan Turing. If you are adventurous, try normalizing the dates you extract to the form
[-]YYYY-MM-DD. Regular expressions work as follows in Java:
// matcher.group(N) holds the N-th group of the match
// matcher.group() holds the entire match
TypeExtractorthat extracts the type of the article entity ("Leicester is a city"). Manually exclude terms that are too abstract to be an ontological type ("member of...", "way of..."). If you are adventurous, improve your type extractor by running it on the POS-tagged corpus. For this purpose, extend the class
Pageand write a new
LocationExtractorthat extracts the location of a place ("Hollywood is a district in Los Angeles"). Alternatively or additionally: Write a
TypeAndLocationExtractor, which first calls the
TypeExtractor, checks if the article entity is a city, district, etc., and, if so, extracts the location.