Computational Linguistics Summary
(c) 2002-07-06 Fabian M. Suchanek
http://suchanek.name
This is a summary of the lecture "Computational Linguistics" held
by Prof. Peter Bosch and Dr. Sabine Reinhard at the University of
Osnabrueck in SS 2002. In spite of claiming to be a formal
discipline, Computational Linguistics (as presented here) often
seems to lack clear and consistent definitions...
By reading the following text, you accept that the author does not
accept any responsibility for the correctness or completeness of
this text. If you have any corrections or remarks, please
send me a mail. This is the only way to make the publication of this
summary useful for me, too.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Intro
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See also lingu.txt for linguistic terms
See also infod.txt for information on theoretical languages
See also ai.txt for trees and First Order Predicate calculus
See also statistik.txt for probabilities
See also nn.txt for detailed information on vectors
See also algebra.txt for mathematical backgrounds
Set: An unordered collection of different elements.
S = {s1,...sn}
S = { a : p(x) } is the set of all those x for which p(x) holds
The number of elements of a set S is referred to as |S|.
ordered set: An ordered collection of different elements.
// Contrary to a set, this collection is ordered
// We need this expression in order to correctly identify sets with
// vectors
Size: The size of a set is the number of its elements.
|{s1,s2,...sn}| = n
Intersection: The intersection of two sets is the set of those elements
which occur in both of them.
Notation: X /\ Y
Union: The union of two sets is the set of all of their elements.
Notation: X \/ Y
Difference: The difference of two sets is the set of those elements
which occur in the first one, but not in the second one.
Notation: X \ Y
Tupel: An ordered collection of elements.
Example: t=(1,'A',abc) is a tupel
The entries are referred to as t[1]...t[n]
// Contrary to an ordered set, this collection may contain
// the same value in two positions
List, Sequence: An ordered collection of elements, combined with
the idea of a ranking.
Graph: A structure consisting of dots ("nodes") connected by lines
("arcs").
Character, char: A symbol as listed in the Unicode beginning at 32.
String: A possibly empty sequence of symbols.
Concatenation: The concatenation of two strings is the string
consisting of all symbols of the first one followed by all symbols
of the second one.
Substring: A substring of a string is a sequence of symbols occurring
in it.
epsilon, eps: A special symbol with the property that a string
containing epsilon is considered the same as this string without
epsilon.
// Take care, in InfoD, epsilon is used for the empty string!
Probability: The likelyhood of an event A, written as P(A). P(A)
can be estimated by dividing the number of situations in which A
happened in the past by the total number of situations.
Example: P(it's raining today) =
Number of days it rained last year / All days last year =
351 / 356 = 0.985 = 98.5 %
Combined probability: The probability of two independent events A
and B to happen at the same time, written as P(A & B). P(A & B) can
be calculated by multiplying the single probabilities:
P(A & B) = P(A) * P(B)
Conditional probability: The probability of an event A, given that an
event B happened before, written as P(A | B).
P(A | B) = P(A & B) / P(B)
Bayes's theorem: P(B|A) = P(A|B) * P(B) / P(A).
This formula allows us to calculate the probability of "B if A" by
knowing the probability of "A if B".
Function: A mapping from elements of one set X to elements of another
set Y. Each element of X is mapped to exactly one element
of Y.
Notation: f:X->Y
f(x)=y f maps x of X to y of Y
f = (LAMBDA x f(x)) another way of writing it
(LAMBDA x f(x))(a) = f(a)
x is called the "argument" of function f.
Example: double := (LAMBDA x (x*2)) define a function double
double(12) apply it to 12
= double (12)
= (LAMBDA x (x*2))(12)
= 12*2
= 24
"LAMBDA" is used to make clear that we speak of the function itself.
Saying "(LAMBDA x square_root(x))" is like talking about "Square root"
without specifying an argument for it, i.e. without talking of a
specific result or value.
Please note that the elements of the sets X or Y may also be
functions!
Constant function: A function which always maps to the same element.
In this case, the argument may be omitted, since it does not play
a role.
Example: f(x)=3 for all x ~~~~> f = 3.
Binary function: A function mapping to the set of 0 and 1.
A binary function f:X->{0,1} can be seen as a set which contains all
those elements of X, which are mapped to 1.
Example: odd(x)= "1 if x is odd, 0 else"
odd(17)=1
odd(22)=0
~~~> odd = {1,3,5,7,... }
max: A function which, given a set of real numbers, returns the greatest
among them.
Example: max {1,2,3,1024} = 1024
min: A function which, given a set of real numbers, returns the smallest
among them.
Example: min {1,2,3,1024} = 1
argmax: A function which, given a function and an argument variable,
returns the value of the variable with the greatest function value.
Example: argmax x -(x-3)^2 = 3, because -(x-3)^2 is greatest for x=3
NL: Natural language.
Matrix: A tupel of tupels of the same length.
M = ( (v[1][1], v[1][2], ... v[1][n]),
...
(v[m][1], v[m][2], ... v[m][n]))
Pair: A tupel with two elements.
Cartesian Product: The cartesian product of two sets A and B is the set
of all possible pairs, whose first element is in A and whose second
element is in B.
Example: {1,2} x {a,b,c} = { (1,a), (2,a), (1,b), (2,b), (1,c), (2,c) }
Relation: A subset of a cartesian product of two sets.
Instead of "(a,b) is an element of the relation r", one also writes
"r(a,b)" or "a r b" or "(a,b) is in r".
Example: The relation ">" is the set of all pairs of real numbers,
in which the first number is greater than the second.
It holds: ">" = { (1,0), (27.3,12.5), (17,-12), ... }
One writes: >(3,2) or 3 > 2 or (3,2) is in ">".
Vector: A tupel of real numbers. A vector may be seen as point in an
coordinate system or as an "arrow" pointing from the origin to that
point.
Example: v = (13,8.2,-17)
// Contrary to a tupel, a vector contains real numbers, such that
// algebraic operations can be defined on vectors
Sum of vectors: The vector consisting of the sums of all corresponding
entries of the two vectors.
Example: (1,2,3) + (4,5,6) = (5,7,9)
Difference of vectors: The vector consisting of the differences of all
corresponding entries of the two vectors.
Example: (1,2,3) - (1,2,3) = (0,0,0)
Dimension of a vector: The number of its entries.
correlation coefficient of vectors, dot product of vectors, similarity of
vectors: A real value calculated from two vectors x and y as follows:
sim(x,y) = x * y = x[1]*y[1] + x[2]*y[2] + ... x[n]*y[n].
Length of a vector: The distance between the origin and the point
indicated by the vector in a coordinate system. It may be calculated
by |v| = SQRT(v*v)
Binary vector: A vector of 0's and 1's.
Binary vectors and ordered sets: A subset to a given ordered set can be
described by a vector: Suppose the given set has n elements. Then any
subset of this set can be described by a binary vector with n elements,
which has a '1' for every element which is in the subset and a 0 else.
On the other hand, every binary vector can be seen as a subset of an
ordered set. This allows us to use set operations on binary vectors,
which then work like operations on binary numbers in a programming
language. Please note that a simple ordered set cannot be converted
to a vector, but only subsets of a given ordered set can be expressed
by vectors.
Example: set ={1,2,3,4,5}
subset={1, 3, 5}
vector=(1,0,1,0,1)
// The distinction between vectors, sets, ordered sets, tupels,
// binary vectors and subsets does not become clear from the lecture
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Grammar
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Alphabet: A set of symbols.
Word: A word of an alphabet is a string of its symbols.
Language: A language of an alphabet S is a subset of the set of all
words of S.
If you consider natural languages, the "alphabet" is its set of words,
the "symbols" are its words and the "words" are its sentences!
FOPC, First Order Predicate Calculus: A language of variables,
truth-operators and quantifiers.
// For a detailed definition see ai.txt
Formula: A word of FOPC or a similar formal language.
Language concatenation: The concatenation of two languages is the language
of words consisting of a word from the first language concatenated with
a word from the second language.
L1L2 = { uv : u is from L1 and v is from E L2}
"Self-concatenation": The n-fold self-concatenation of a language is
the language of all those words which consist of n words of the language.
L^0 = {}
L^n = L L^(n-1)
"Infinite self-concatenation", Kleene-Star: The Kleene-Star of a language
is the language of all those words which consist of 0 to INF words of
that language.
L* = union of all L^n for n=0..INF
Complementation: The complementation of a language is the language of
those words of the underlying alphabet, which are not in this
language.
Reversal: The reversal of a language is the set of all words contained
in it, turned around.
Definition of a language: A formalism which can both detect and
generate all and only the words of this language.
Production: A rule of substituting one substring by another.
Example: N -> "house" and DET -> "the" applied to "DET N"
yield "the house"
Terminal, lexical constituent: A symbol of the alphabet of the language.
Example: "cat"
Non-terminal: A symbol classifying a terminal or a sequence of
non-terminals (i.e all those symbols in the productions which do not
belong to the language).
Examples: DET classifies "the" and S classifies NP VP
Sentence: A "word" of a natural language.
Example: The sentence "The cat runs" is a "word" of the symbols
"the", "cat" and "runs".
Sentence form: A sequence of non-terminals and maybe also terminals.
Example: "DET cat runs"
Grammar: A formalism of productions which can generate all words of a
language. Since a grammar can also be used to check whether a word could
be generated, it functions as a definition of that language. A grammar is
a "declarative" way of defining a a language. Formally,
a grammar consists of:
* An alphabet of terminals
* An alphabet of non-terminals
* A set of productions
* A starting symbol, which is non-terminal
Example: S -> NP VP, NP -> DET N, VP -> V, DET -> "the", N -> "cat",
V -> "runs" with starting symbol S and alphabets as usual can tranform
"S" to "The cat runs"
Overgeneration: The phenomenon that a grammar produces words which are not
wanted.
Derivation: A sequence of productions which transforms the string of
a non-terminal (by default: the starting symbol) to a word of the
language.
Parse tree: The graphical representation of a derivation in form of
a tree. The root is the starting symbol. The set of children of a node
is the sequence of the symbols resulting from the parent by one
production.
Recursion: The phenomenon that the name of a symbol occurs in the
definition of the symbol.
Center-embedded recursive production: A production of the form
A -> xAy where A is a non-terminal and x and y are arbitrary
non-empty sentence forms.
Immediately left-recursive production: A production of the form
A -> Ax where x is a sentence form.
Left-recursive grammar: A grammar with a non-terminal A which has a
derivation that leads to a sentence form with A itself at the
leftmost position.
Constituency: The phenomenon that a sequence of symbols may act as a
single unit.
Constituent, phrase, clause: A sequence of terminals which can be
derived from a non-terminal.
Example: "The big cat" is a constituent since it can be derived from
NP.
strongly equivalent grammars: Grammars which generate the same
language and assign the same parse tree to all words.
weakly equivalent grammars: Grammars which generate the same
language but do not assign the same parse tree to all words.
Chomsky Hierarchy: The following list of sets of grammars:
1. Type 0 Grammars/Turing machines (most powerful)
Productions are not restricted
2. Context-sensitive Grammars
Productions replace one non-terminal surrounded by a context
by a sentence form
3. Context-free Grammars
Productions replace one non-terminal by a sentence form
independent of the context
4. Regular/Right linear Grammars (least powerful)
Productions replace one non-terminal by a terminal or by
a terminal followed by a non-terminal -- independent of
the context.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Regular Languages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Regular Expression, RE: Algebraic notation for characterizing a set
of strings.
A RE is
* a symbol
This RE describes all strings which consist of just this symbol
* a RE followed by a RE
This RE describes strings consisting of one substring matching
the first RE followed by another substring matching the second RE
* a "."
This RE matches any string of one arbitrary symbol except for
epsilon
* a "^"
This RE matches the beginning of a line
* a "$"
This RE matches the end of a line
// What is the difference here (?)
* a "\b"
This RE matches word boundaries
* a "\B"
This RE matches non-boundaries
* a "[s]" where s is a string
This RE matches strings consisting of one of the
symbols of s
* a "[x-y]" where x and y are symbols with an order
This RE matches strings consisting of a symbol in between
and including x and y
* a "[^x-y]" where x and y are symbols with an order
This RE matches strings of one symbol, which is not in between
x and y
* a RE followed by a '?'
This RE matches empty strings or strings matching the
previous RE
* a RE followed by a '*'
This RE matches strings consisting of a possibly empty
sequence of substrings which each match the previous RE
* a RE followed by a '+'
This RE matches strings consisting of a non-empty
sequence of substrings which each match the previous RE
* a "(x)" where x is a RE
* a "(x | y)" where x and y are REs
This RE matches strings which either match x or y or both
* a RE followed by "{n}" where n is a positive integer
This RE matches strings consisting of n substrings matching
the previous RE
* a RE followed by "{n,m}" where n and m are positive integers
This RE matches strings consisting of n to m substrings
matching the previous RE
* a RE followed by "{n,}" where n is a positive integer
This RE matches strings consisting of at least n substrings
matching the previous RE
RE Substitution string: A string of the form "s///"
where is a RE and is a string. This
construction says "Replace every substring matching by
". (?)
Deterministic: Always responding the same way upon an input.
Transition: The act of changing one state to another.
Epsilon-transition: A spontaneous transition which occurs without
input.
Transition function: A function d:S x I -> S which, given a state and
a symbol, returns a sucessor state.
Transition relation: A relation r:S x I x S between states.
A transition relation can be seen as a function which, given a symbol
and a state, returns a set of sucessor states.
Automaton: A formalism which can detect whether a string belongs to
a language.
// Unfortunately, the definitions of automata, FSAs, deterministic
// FSAs and indeterministic FSAs are not hierarchical -- although
// this would be possible. For a complete overview see infod.txt
FSA, Finite State Automaton: A formalism which performs transitions on
a finite set of states while reading an input string.
A FSA "reads" a string symbol by symbol. At each symbol read,
the state is changed according internal rules. If the string is over
and the FSA remains in a state marked as "final", then the FSA is
said to "accept" the string. A FSA may be depicted as a graph where
every node stands for a state and the edges show a transition from
one state to another, given a particular input symbol. A FSA may also
be depicted by a transition table, which shows the new state(s) for
a particular current state and a particular input. A FSA can also be
used vice versa to generate the language. Hence it works as a
definition of a language.
Deterministic FSA: A FSA consisting of
* a finite set of states S={s0,s1,...sn}
* a finite input alphabet I
* a start state s0 E S
* a set of final states F < S
* a transition function d(q,i)
A deterministic FSA thus has only 1 sucessor state for a current
state and a given input.
Non-deterministic FSA: A FSA consisting of
* a finite set of states S={s0,s1,...sn}
* a finite input alphabet I
* a set of begin states B < S
// In InfoD, B is just a single beginning state!
* a set of final states F < S
* a transition relation r
// Note that with this definition, the non-deterministic
// FSA can act deterministically -- if there is always 1 transition
// for 1 input and 1 state, which is not prohibited.
A non-deterministic FSA thus can have choice points where for a
current state and a given input, one of multiple possible successor
states results. If _any_ of the possible ways in which the FSA can
process an input results in a final state, the FSA is said to accept
the string.
Converting non-deterministic FSAs to deterministic FSAs:
Introducing new states for each set of states reached with one input
symbol.
Regular language, RL:
* The empty set
* The set of one symbol (including epsilon)
* L1 L2 = {xy : x is from L1, y is from L2}, the concatenation of two regular
languages L1 and L2
* L1 \/ L2, the union of two regular languages L1 and L2
* L*, the result of the Kleene operation on a regular language L
Closure properties of regular languages:
IF L1 and L2 are regular languages, then so are
* their intersection
* their difference
* their complementations
* their reversals
Left-linear grammar: A grammar which just has productions of the form
Non-Terminal -> Terminal or
Non-Terminal -> Non-Terminal Terminal
Right-linear grammar: A grammar which just has productions of the form
Non-Terminal -> Terminal or
Non-Terminal -> Terminal Non-Terminal
Any right-linear grammar can be transformed to a left-linear one and
vice versa.
Regular grammar, RG: A grammar which is either left-linear or right-linear.
Regular equivalences: For every RL, there is a
* RE which generates it
and all RE just generate RLs
* FSA which accepts it
and FSA just accept RLs
* regular grammar which produces it
and regular grammars just produce RLs
Problems with RLs:
* RLs cannot represent constituency
* RLs cannot represent structural ambiguity
* RLs cannot deal with recursion
Finite state transducer, FST: A formalism which gets an input string
and produces an output string by performing transitions on a finite
set of states. It consists of
* a finite set of states S={s0,s1,...sn}
* a finite set A of pairs of an input symbol and an output symbol,
where both of them may also be epsilon
* a start state s E S
* a set of final states F < S
* a transition function d: S x A -> S
A FST works like a FSA, but produces one output symbol with every
transition. A FST can thus also be depicted as a table or graph.
Regular relation: A set of pairs of strings, where each pair has a
corresponding FST.
// I'd prefer to define "regular relation" without FSTs and
// afterwards show equivalence,
// but I just don't know what a "regular relation" is (?)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PoSs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Word: When talking about natural languages, a word is what it uses to be.
Linguistic expression: A word or a constellation of words.
Preterminal, PoS, Part of Speech: A non-terminal which can derive a
terminal in one step. A PoS is thus a symbol classifying the role a
linguistic expression will take in a sentence. A PoS carries
syntactical, morphological and semantic information. The PoS is
written with a slash behind the linguistic expression.
Examples: "House/Noun", "goes/verb", "you/pronoun", "to/preposition",
"well/adverb", "and/conjunction", "up/participle", "the/article"
// These POS base on syntax trees, see lingu.txt
Tag: An additional information for an item.
Here mostly: The PoS of a linguistic expression.
Tagging: Assigning PoS to words.
Subject: The noun of the noun phrase of a sentence.
Object: The noun of the noun phrase of the verb phrase of a sentence.
Corpus: Collection of text data.
Annotated corpus: A corpus with additional information such as
sentences being split to tokens, words being tagged with their PoS
and words being tagged with their lemmata.
CQP: Corpus Query control, a tool for searching annotated corpora.
The user types a query in form of an extended RE, CQP searches the
corpus and stores the result in a variable or prints it on the screen.
Various display modes (such as length of context or display of tags)
may be set. Results may be grouped according to a feature.
// some CQP commands would be nice here
Word class: A set of words with the same PoS.
Closed word class: A word class which cannot be extended by inventing
new words. Closed word classes are classes of function words.
Examples: prepositions, determiners, numerals, ...
Open word class: A word class which can be extended by inventing new
words.
Examples: Noun, verbs, adjectives
Lexicon-only tagging: Assigning tags by looking the PoS up in a lexicon.
Disambiguition rule: An algorithm to determine the most plausible
PoS out of a set of choices for a linguistic expression, taking into
account the whole sentence.
Lexicon plus rules tagging: Assigning tags by looking up the PoS in a
lexicon. When two or more PoS can be assigned to a linguistic
expression, disambiguition rules are used to pick out one of them.
Lexicon plus absolute probability tagging: Assigning tags by looking
up the PoS in a lexicon. When two or more PoS can be assigned, choose
the one which occurs most frequently in the language.
Lexicon plus relative probability tagging: Assigning tags by looking
up the PoS in a lexicon. When two or more PoS can be assigned, choose
the one which occurs most frequently in the language, given the
preceding PoSs. This strategy thus does not only take into account
how often a word occurs in the language, but also how often
a word of this word class occurs after this sequence of PoSs.
One maximizes P(word|tag) * P(tag|previous n tags).
Example: "run" is much more likely to be a noun if it is preceeded
by a determiner: P(run|Noun) * P(noun|det) > P(run|verb) * P(verb|det)
Gold Standard: The human expert tagging of a corpus, roughly defining
what is "correct" tagging. Machine tagging today reaches 95-97% of
correct tags.
Word-Tag probability: The probability that, given a tag T, the word W
occurs, written as P(W | T).
Example: P(run | verb) > P(run | determiner)
Tag-Tag probability: The probability that a tag T1 is preceeded by a
tag T2, written as P(T1 | T2).
Example: P(noun | determiner) will be large because nouns are often
preceeded by a determiner.
Dynamic programming: Writing algorithms which store partial results
in a table instead of recalculating them each time they need them.
This reduces the exponential time effort needed by recalulation to
a polynomial time effort.
Example: You all know the fibonacci function
fib(0) = 1; fib(1) = 1;
fib(n) = fib(n-1) + fib(n-2);
When calculating fib(4), you first calculate fib(3), which requires
fib(2) and fib(1). Then you calculate fib(2) and add it to fib(3).
Obviously, fib(2) was calculated (at least) two times! Instead of
recalculating it the second time, you look up your old result of
fib(2). You might as well start from fib(0) and make a list fib(0) to
fib(n), such that you can easily look up any previously calculated
value. PROLOG supports dynamic programming by "assert"': It does
not matter whether a query is calculated from rules or whether it has
been dynamically added as a constant fact.
Viterbi Algorithm: A dynamic programming algorithm for efficiently
calculating the tags of a sentence. It uses a table in which for all
words (rows) and for all tags (columns) the probability of that word
being assigned that tag is stored. The first row stands for "beginning
of the sentence" and has all cells set to 0 -- except for the column
with the tag "PERIOD", set to 1. All rows are now filled top-down.
For calculating the probability of a word-tag-pair, three factors are
taken into account:
1. The word-tag probability of this pair P(W[i] | T)
This probability can be calculated from a corpus
2. The tag-tag probability of this tag and a preceeding tag X
P(T|X), can be calculated from a corpus
3. The word-tag probability of the preceeding word and the tag X
P(W[i-1] | X), can be found in the preceeding row of the table
In order to find the tag X, one runs through row i-1 and choses the
tag X which maximizes P(W[i]|T)*P(T|X)*P(W[i-1]|X).
The resulting probability (and the tag X) is stored in the cell.
After having filled the table, the most probable sequence of
tags for the sentence can be found by going bottom-up in the table:
Choose the largest value in the last row. The corresponding tag is
the most probable tag for the last word. The tag X stored in the
same cell is the most probable tag for the preceeding word, because
we chose X as preceeding tag to maximize the current word-tag value.
Go to this tag X in the preceeding row. There will be another tag
X stored in this cell, giving you the most probable PoS of the last
but two word. Continue up to the first row.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Contextfree Languages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Contextfree grammar, CFG, phrase structure grammar: A grammar which
just has productions of the form
Non-Terminal -> Sentence form
Contextfree language: A language generated by a contextfree grammar.
epsilon-free contextfree grammar: A grammar which has no production
containing epsilon. Only the production S -> epsilon is allowed to
ensure the grammar can generate the language of epsilon.
Chomsky normalized contextfree grammar: A epsilon-free contextfree
grammar with only productions of the form
non-terminal -> nonterminal non-terminal
non-terminal -> terminal
Every contextfree grammar can be converted into a weakly equivalent
Chomsky normalized grammar.
Agreement: The phenomenon that two linguistic expressions match
according to a certain criterion (such as number or person).
Head: The constituent of a phrase which provides features for the whole
phrase.
Example: In "my sister", "sister" is the head since it makes the
whole noun phrase "singular" and "female".
Complement: The constituent of a phrase which is not the head.
Subcategorization frame: A set of non-terminals describing the possible
complements for given non-terminal.
Rule proliferation: An immense increase in the number of rules.
Advantages of CFGs: CFGs can represent
* Declarative sentences
* Imparative sentences
* Wh-questions
* Yes/No-questions
Problems with CFGs:
* CFGs don't recognize agreement
Example: "I sees the tree"
* CFGs don't recognize subcategorization frames
Example: "I sleep you"
CFGs overgenerate and produce ungrammatical sentences. More precise
rules would help out but would lead to rule proliferation. Furthermore,
redundancy and a lack of generalization result.
CFGs and RGs: CFGs are more powerful than RGs and include RGs. But a RG can
generate the same language as a CFG if and only if the CFG does not
contain center-embedded recursion.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Parsing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Parsing: Assigning structure to an input.
Morphological parsing: Assigning structure to a word.
Syntactical parsing: Assigning structure to a sentence. Syntactical
parsing can be implemented as search in the parse tree.
Top-down, goal-driven: Starting with the goal and going to what we
have. Here: A goal-driven parsing algorithm starts with S and expands
the tree top-down to a structure of preterminals which matches the
sentence.
Advantage: Only trees resulting in S are built (since we start with it)
Disadvantage: Much time is spent on building trees which do not match
the sentence.
Bottom-up, data-driven: Starting with what we have and going to the
goal. Here: A data-driven parsing algorithm starts with the sentence
and tries to find a way up to S.
Advantage: Only trees which match the sentence are built
(since we start with it)
Disadvantage: Much time is spent on building trees which do not lead
to S.
Left-corner: The left-corner of (a subtree of) a parse-tree is its
left-most preterminal.
Bottom-up filtering: A technique for improving the performance of a
top-down parser. It uses a left-corner table, where for each non-terminal
all left-corners of this non-terminal's parse trees are listed. Thus,
the parser already knows whether it is worth expanding this non-terminal,
given a part of a sentence which is to be matched.
Left-corner parser: A top-down, depth-first left-to-right parser which
uses bottom-up filtering.
Ambiguity: The phenomenon that a linguistic expression has multiple
meanings. Ambiguity is a problem to all simple parsers.
Attachment ambiguity: The ambiguity resulting from multiple possibilities
of attaching a constituent to a parse tree.
Coordination ambiguity: The ambiguity resulting from an unclear range
of a conjunction.
Noun-phrase backeting ambiguity: The ambiguity resulting from an unclear
pairing within a sequence of nouns. (?)
Problems of a left-corner parser:
* Left-recursion
A left-recursive grammar leads the parser into an infinite loop
A solution might be to create a weakly equivalent grammar with the
left-recursion eliminated
* Ambiguity
* Repeated parsing of subtrees
Since the left-corner parser is depth-first, it may happen that a
non-terminal's subtree is parsed more than once.
Dynamic programming parser: A top-down parser which stores subtrees in
a table such that it does not have to be reparsed when needed again.
Since always all possible parses are stored in the table, ambiguity
can be identified. The problem of left-recursion is also solved.
Earley algorithm: A dynamic programming parser. It parses the input
left to right and has a "chart" for the beginning of the sentence
and one more for each word of the sentence. Each chart is an ordered
set of "states". A state is a production with a dot in it and two
zero-based integer values behind it:
A -> B C D * E F [i;j]
The dot tells us, where the current position in the input sentence
has its corresponding position in the production. i is the position
of the word in the input sentence that corresponds to B and j is
the position of the current word, i.e. of the dot. The initial
state is y -> * S [0;0]. Now, 3 operators are successively applied
to the states, yielding new states and filling the charts until
S -> ... * [0;] is reached, indicating that the
sentence has been successfully parsed from the first to the last
position. The three operators are the predictor, the scanner and
the completer. By tracking back those states which triggered a
completer, the complete parse tree can be read out of the charts.
Predictor: An operator of the Earley algorithm. It is applied to
every state that has a non-terminal which is not a PoS to the right
of the dot. It expands the non-terminal according to the grammar and
adds new states to the same chart, which contain:
* a production of the form N -> * A B C, where A B C is
the derivation of the found non-terminal N.
* the sentence position index set to the dot position index
* the same dot position index
Scanner: An operator of the Earley algorithm. It is applied to every
state that has a preterminal to the right of the dot. When the
current word fits that preterminal, a new state is added in the next
chart with:
* a production of the form Preterminal -> word *
* the same sentence position index
* the dot position index increased by 1
Completer: An operator of the Earley algorithm. It is applied to every
state where the dot has reached the right end of the production. For
all previous states in the charts, it picks those which have the
newly finished non-terminal to the immediate right of the dot. For
all of them, new states are created in the current chart with:
* the same production, but the dot advanced by 1
* the same sentence position index
* the same dot position index, but increased by the length of
the newly finished state, i.e. j-i
Chunk-pattern: A sequence of preterminals.
A chunk-pattern is thus non-recursive.
Chunk: A sequence of terminals which matches a chunk-pattern.
Chunk-parser: An algorithm which picks out chunks from a sentence.
It matches chunk-patterns to the PoS-tags. Since it is implemented
as a FSA, it works extremely efficient. The result of a chunk parser
is less informative than a syntactic parse, but the chunk parser
does not have the problems a left-corner parser has.
Cascade chunk parsing: Feeding one chunk parser with the results
from another in order to find higher-order structures.
Contingency matrix: A matrix showing the performance of a chunk
parser on a set of strings. It looks like this:
is a chunk is not a chunk
in the parser's output TP FP
not in the output FN TN
TP: true positives, the number of those chunks which were found
FP: false positives, the number of non-chunks reported
by the parser
// Take care: In Neural Networks (nn.txt), these expressions
// are used vice versa!
FN: false negatives, the number of those chunks which were not
found by the parser
TN: true negatives, the number of those non-chunks which were
not reported. Note that this number will be extremely large,
it is never used and never calculated.
Precision, accuracy: The proportion of the correct chunk-patterns
in the parser's output, calculated as TP/(TP+FP). Thus, the less
non-chunks the parser reported, the better is its precision.
See also ai.txt
Recall, coverage: The proportion of those chunk-patterns which were
found, calculated as TP/(TP+FN). Thus, the less chunks the parser
forgets to report, the better is its recall.
See also ai.txt
F-measure: A measure combining precision P and recall R. It is
calculated as 1/(b/P+(1-b)/R). b is a parameter between 0 and 1
which either weights precision or recall.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Feature structures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Feature structure: A set of attribute-value pairs. The attributes are
atomic, but the values may be nominals, integers or again whole
feature structures. Values may also be "[]", i.e. unspecified. Each
attribute is only contained once.
Underspecification of a feature structure: The phenomenon that a feature
structure does not contain all possible attributes.
Attribute-value matrix, AVM: A matrix consisting of lines of attribute-
value pairs, thus representing a feature structure.
Feature graph: A directed acyclic graph representing a feature
structure. Its arcs are labelled by the attributes and its values
appear as nodes.
Reentrancy, structure sharing: The phenomenon that several attributes
have a precisely identical value. The shared value is notated by
coindexing boxes in the AVM.
Path equation: A string of the form (A B C) = (D E F) or
(A B C) = b, where (A B C) and (D E F) denote paths in the feature
graph by the names of attributes and b denotes a node of the graph.
(A B C) = (D E F) means: You reach the same node when travelling via
A,B,C and D,E,F. (A B C) = b means: You reach a node with value b
when you travel along A,B,C.
Subsumption: The relation of something including something.
Subsumption of feature structures: The relation between a feature
structure and a more specific feature structure.
Subsumption is written as A < B, where "<" is a square-shaped
less-equal sign and A is the less specific, more general, subsuming
one. Subsumption is a partial ordering relation, which does not hold
in any direction with conflicting feature structures (same attribute,
different value) or completely different feature structures (no
equal attributes).
Feature structure semilattice: A graph of feature structures as nodes
and subsumption relations as arcs.
Unification: The act of combining two information structures to a
new one. When the information structures are incompatible, unification
fails.
Unification of feature structures: The act of combining two feature
structures to a new one, which contains the information of both
feature structures, but not more (the most general unifier). Unification
is written as A |_| B = C. It works like this: Go through all
attribute-value-pairs of the first feature structure.
* If the attribute is not contained in the second one, keep it
* If the attribute is contained in the second one and has an
unspecified value, keep it
* If the attribute is contained in the second one and has another
specified value than the first specified value, the unification
fails.
* If the attribute is contained in the second one and has a feature
structure as value, unify the first value and the second value and
keep the resulting structure as value to the current attribute.
Repeat vice versa with the second feature structure.
Feature structures in grammar: Feature structures may be added to a
grammar in order to model agreement, subcategorization and gramatical
heads. Feature structure contraints are added to the grammar in the
form
= atomic-value or
=
and mean the same as do path equations. Lexical constituents get their
features from the lexicon, whereas non-lexical constituents get their
attributes and values from their heads of phrases.
Example: = , where "agreement" is
an attribute whose value is maybe a feature structure of "person" and
"number" or "gender".
Head feature: A feature passed from the head to its phrase.
Head features are grouped as value of an attribute called "HEAD".
Subcategorization feature: A feature specifying the complements of
a non-terminal. Subcategorization features are grouped as value of
an attribute called "SUBCAT".
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Psycholinguistics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Thematic role: The function the referent of a linguistic expression
has in the situation represented by the sentence.
Example: In "Peter kicked Franzerl", "Peter" is the agent and
"Franzerl" is the patient.
Children find sentences difficult in which an entity is both an
agent and a patient.
Right-branching language, head initial language: A language whose
complements stand right to the heads. Maybe parsed top-down. English
is an example of a right-branching language.
Left-branching language, head final language: A language whose complements
stand left to the heads. Can not be parsed top-down, but rather bottom
up. Japanese is an example of a left-branching language.
Different types of parsing:
Here is a table of different types of parsing and their properties.
Classical Formal ComputerL's NL PsychoLing
Done by: humans Abstract device computer computer humans
Done on: sentences abstract representation symbolic input representations of NL bits of NL
Result: inf.gramm.descr. formal representation change in mach.state change in mach.state mental rep.
Method: is a skill algorithmic algorithmic algorithmic/heuristic heuristic
Innate: No Yes
Aware: Yes No
Goal: Pedagogical
Efficiency principles for human parsing:
* Use any information immediatly
* Avoid backtracking and reanalysis
Human parsing:
According to the standard view, human parsing takes place left-to-right
and top-down. But this would cause problems with left-branching
languages. The innate parsing maybe parameterized with "right" or
"left" for top-down or bottom-up parsing. Apart from a parse tree, human
parsing also establishes thematic roles and meaning.
Center-embedded relative clause: A complement for a noun which is
surrounded by other terminals in the sentence and starts with a
determiner. Center-embedded relative clauses put a considerable burden
on the working memory because unfinished clauses must be kept in memory
until they are completed by the corresponding verb. Center-embedded
structures can be removed in English.
Minimal attachment parsing, immediate syntactic processing: The act of
constructing a parse tree by immediately inserting every new word from the
input left-to-right. A constituent is thus completed as soon as possible
and no new constituents are added beyond necessity. This leads to
difficulties with sentences like "Sam loaded the boxes on
the cart onto the van", where "on the cart" first seems to complete the
verb phrase but then is a complement to "the boxes".
Immediate semantic processing: The act of anticipating sentence
constituents by semantic properties of the preceeding constituents.
This accounts for the anticipation of "are" in the sentence
"When you live near a runway, landing planes ..." and for the
anticipation of "is" in the sentence "If you've been trained as a pilot,
landing planes ...".
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Semantics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Syntax: Rules on the arrangement of symbols.
Semantics: The study of meaning.
Computation: Purely syntactic manipulation of Symbols. Semantic
properties of the symbols are external to computation. Syntactic
symbol manipulation is semantically invariant, i.e. if a string
of symbols is transduced to another, the meaning should have
remained the same.
Analyzing a sentence:
* Parsing it
~~~~> yields a parse tree (syntactic constituent structure)
* doing a sementic analysis
~~~~> yields a semantic representation
(an unambiguous logical structure in a canonical form
which can serve for drawing inferences)
world: A constellation of entities with attributes.
Example: Our universe is a world. But we may also imagine a world
where Peter Bosch is not a lecturer but the German Chancellor.
necessary: true in all possible worlds.
Example: Sentences which are true by definition, mathematical statements
contingent: not necessarily true.
Example: "I am studying Cognitive Science" -- I could also study
something else, say "Agrarpsychologie"
Declarative sentence: A sentence of the form NP VP.
Truth: A relation between a FOPC formula without variables and a world.
Example: likes_ice_cream(Fabian) is true in our world.
Proposition, FOPC sentence: A FOPC formula without variables.
A proposition can be seen as a binary function which, given a world
w, returns 1 if the proposition would be true in this world or 0 if
the proposition would be false in this world. Since binary functions
can be seens as sets, a proposition is the set of those worlds in
which the proposition would return 1.
Example: sleeps(Peter) = the set of all worlds in which
Peter is sleeping
= { w : Peter sleeps in world w }
Sense, content: The content of a declarative sentence is its proposition.
Example: content("Peter is jogging") = jogging(Peter)
= the set of all worlds in which
Peter is jogging
= { w : Peter is jogging in w }
Content of a lexical item: The content of a lexical item is the concept
it refers to. We will see lots of ways how to formalize these concepts.
Rule-to-Rule-hypothesis, Rule-by-Rule-hypothesis: The assumption that
each syntactic rule has a corresponding rule of semantic composition.
Indexicality: The phenomenon that the truth of a sentence depends on the
circumstances of the utterance (e.g. by involving pronouns, by
involving relative time expressions or by being in a specific place).
Meaning of a sentence: A function which, given information on the
circumstances of the utterance, returns a proposition. Meaning thus
yields a canonical form of a sentence, it erases indexicality. Meaning
is a level which intermediates between the linguistic expression of a
sentence and the knowledge representation in form of propositions. Since
we will talk of sentences with no indexicality, meaning becomes a
constant function and thus equal to content.
Example: Assume M[r] to be the meaning of "It's raining". Then
we may apply M[r] to certain circumstances:
M[r](Osnabrueck,2002-06-01,2:35)
= eventInAt(rain,Osnabrueck, 01-06-2002_2:35)
Example: Assume M[c] to be the meaning of "All students are clever".
This is a constant function which always returns the same
result on no matter which circumstances:
M[c](Bonn,2002-07-09,2:30) = All x: student(x) => clever(x)
M[c](Wien,1999-02-03,1:02) = All x: student(x) => clever(x)
~~~~~~> M[c] = All x: student(x) => clever(x)
~~~~~~> M[c] = {w : all students are clever in w }
~~~~~~> M[c] = content("All students are clever")
Meaning becomes equal to the proposition.
Annotation: In the following text, meaning will be treated as equal to
the proposition.
Notation: || "All students are clever" || is the meaning of
this sentence
Calculation: The meaning of a sentence can be calculated by applying the
meaning of its verb phrase to its Noun Phrase:
|| S || = || VP || ( || NP || )
Meaning of an intransitive verb: A function mapping the subject of the verb
to a proposition.
Example: The function (LAMBDA x sleeps(x)) maps "Peter" to the set of
all worlds where Peter sleeps:
(LAMBDA x sleeps(x))(Peter) = sleeps(Peter)
= { w: Peter sleeps in w }
Annotation: Note that we still cannot tell what "to sleep" really means.
We have just found a way to express that "to sleep" is a concept which
is "not yet complete": It needs a noun given to become a proper
meaningful sentence (i.e. a proposition). Just like "square root" needs
an argument to become a proper value. This is what we express by the
argument following "LAMBDA".
Meaning of a transitive verb: A function mapping the object of the
verb to the meaning of an intransitive verb.
Example: In "Peter kicks Franzerl", "kicks" is a function which returns
a function:
(LAMBDA y (LAMBDA x (kicks(x,y))))
~~~~~~~~~~~~~~~~~~~~~~~ return value: a function
Here, it is applied to the object of the verb, "Franzerl":
(LAMBDA y (LAMBDA x (kicks(x,y ))))(Franzerl)
= (LAMBDA x (kicks(x,Franzerl)))
It yields the meaning of an intransive verb, namely a function
from entities x to propositions. This intransitive verb can be
seen as "kicking_franzerl". It is now applied to "Peter", the
subject of this intransitive verb:
(LAMBDA x (kicks(x, Franzerl)))(Peter)
= kicks(Peter,Franzerl)
= { w : Peter kicks Franzerl in w }
If we have a dictionary which gives us the meanings of single
words as functions (such as the above (LAMBDA y (LAMBDA x...))),
we can get a first order predicate calculus formula for a
sentence by applying the meanings of the constituents to each
other.
Meaning of a proper name: The entity this proper name refers to.
Quantifier subject: A subject which does not denote a named entity.
Examples: "Nobody", "Everyody", "Somebody"
Meaning of a quantifier subject: The meaning of a quantifier subject
is a function which maps a verb (i.e. a function) to a proposition.
Example: In "Nobody sleeps", "nobody" is a function which needs
another function as its argument:
LAMBDA P (~ Ex. x: person(x) & P(x))
To calculate the meaning of "Nobody sleeps", we use
"sleeps" as an argument for "nobody":
(LAMBDA P (~ Ex. x: person(x) & P(x)))(sleeps)
= ~ Ex. x: person(x) & sleeps(x)
= { w : Nobody sleeps in w }
Events in First Order Predicate Calculus: Events can be expressed as
entities, which have the action of the sentence as a property and
which stand in a "patient" relation to the object of the sentence
and in an "agent" relation to its subject:
"Peter sees Anke"
~~~> Ex. e: seeing(e) & agent(e,Peter) & patient(e,Anke)
Meaning relation: A semantic relation between words.
Example: A "mother" is a "female parent"
Meaning postulate: A constraint on a meaning relation.
Meaning postulate for a noun: A logical formula of the form
Necessarily All x: W(x) <=> P(x) & Q(x) & ...
where W is a noun and P, Q, ... are properties. Several
variables may occur.
Example:
Necessarily All x, y: mother(x,y) <=> parent(x,y) & female(y).
Meaning postulate for a verb: A logical formula of the form
Necessarily All x1,x2,...: V(x1,x2,...) <=>
where V is a verb and the may contain the usual
logical operators and also the special predicates
CAUSE(cause,effect) and
BECOME(state)
Example:
Necessarily All x,y: kill(x,y) <=> CAUSE(x,BECOME(dead(y))
Meaning representation: A logical formula which fixes the meaning
of a word. We use the lambda-calculus in combination with first order
predicate calculus.
Example: The meaning of the transitive verb "to open" is the concept
it refers to, which can be expressed as follows:
LAMBDA obj (LAMBDA actor
(Ex. e : opening(e) & theme(e,obj) & agent(e,actor)
) )
This means: "to open" is a function which needs two arguments given:
An actor and an object. Saying "A opens B" is equivalent to stating:
There is an event e, which has the property of being an "opening"-event,
which stands in a "theme"-relation to the object and in an "agent"-
relation to the actor. Meaning representations allow for inference
drawing with the help of a knowledge base.
Meaning postulates and ambiguities: Meaning postulates must
take into account semantic ambiguity of words. Implications can be used
to differentiate between two meanings: Certain properties of the argument
of the word imply certain meanings.
Example:
A meaning postulate for the adjective "open" when applied to a container
(e.g. a bottle):
All x Ex y : ((container(x) & open(x) & content(x,y)) => accessible(y))
A meaning postulate for the adjective "open" when applied to a shop:
All x Ex y : ((shop(x) & open(x) & goods(x,y)) => forSale(y))
// In the slides, meaning postulates and meaning representations seem
// to mix. (?)
Understanding: Understanding a sentence means to know what is the case
when the sentence is true. (Ludwig Wittgenstein)
The truth of a sentence here depends on
* its meaning
* indexicality
* contingency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Discourse analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Discourse: A linguistic unit larger than a sentence.
Examples: A conversation or a story.
Segment: An utterance.
Computational discourse analysis: The modelling of the knowledge processes
taking place in the course of understanding a discourse.
Center, Discourse entity: An entity a linguistic expression (mostly a
noun phrase) in a discourse refers to.
Example: In "Bob eats bananas", Bob and bananas would be discourse
entities.
Center theory: A framework for modelling attention to discourse entities.
It states constraints on the discourse entities pronouns can refer to
and declares preferred shift-of-attention patterns.
It assumes that discourses consist of separate utterances (mostly
sentences).
Forward-looking center set: An ordered set of all discourse entities of
an utterance. The set is ranked according to some emprically determined
metric which measures the "importance" of a discourse entity in an
utterance. The usual metric is Subject > Object > others.
Preferred center: The first entry of the forward-looking center set.
The preferred center is thus the (grammatically) most important entity
in the current utterance -- whereas this does not necessarily mean that
it is the topic of the current part of discourse. But the preferred
center is likely to predict the backward looking center of the next
utterance.
Example: In "Bob sits on a banana", the forward-looking center set
would be Cf = (Bob, banana) and the preferred center would be Cp = Bob.
Backward looking center: The highest-ranking entry of the forward-looking
center set of the last utterance, which occurs in the current utterance.
(If there is no such center, the backward looking center remains
undefined.) The backward looking center is thus the entity which connects
the current and the last utterance. It can be seen as the topic of the
current part of the discourse, since it "survives" two utterances.
Example: "Bob hits Tim" Cf = (Bob, Tim)
"That's why Tom likes Bob" Cf = (Tom, Bob), Cb = Bob
Pronoun rule: The following rule, as stated by the Center Theory: "If any
pronoun occurs in an utterance and refers to a discourse entity, one of
the pronouns in the sentence refers to the backward looking center.".
Example: "Bob hits Tim with a banana" Cf = (Bob, Tim, banana)
"Bob likes hitting" Cf = ( Bob ), Cb = Bob
OK, no pronouns
"He likes hitting" Cf = ( Bob ), Cb = Bob
OK, 1 pronoun, refers to Cb
"He (Bob) likes it" Cf = ( Bob, hitting ), Cb = Bob
OK, 2 pronouns, one refers to Cb
"He (Tim) likes Bob" Cf = ( Tim, Bob ), Cb = Bob
Not OK, 1 pronoun, not referring to Cb
Center transition: The process of changing or keeping the backward
looking center and the preferred center in the course of the discourse.
Continue transition: A center transition, keeping the same backward looking
center and also having the backward looking center as preferred center.
This transition corresponds to cases where the speaker has been
talking about a particular entity (a "hero") and indicates an intention
to continue talking about that entity.
Preferred centers: X. Y. A.
Backward centers: ^--A-- ^--A--
Retain transition: A center transition, keeping the same backward looking
center but introducing a new preferred center. Retain seems to correspond
to a situation where the speaker is intending to shift onto a new entity
in the next utterance and is signaling this by realizing the current
topic (the backward looking center) in a lower-ranked position on the
forward-looking center set.
Preferred centers: X. Y. B.
Backward centers: ^--A-- ^--A--
Smooth shift: A center transition, having changed the backward looking center,
but keeping the current backward looking center as preferred center. A Smooth
shift tends to occur when the speaker has started talking about a new entity
and is doing so in a way that indicates that s/he will continue talking about
that entity.
Preferred centers: X. Y. A.
Backward centers: ^--B-- ^--A--
Rough shift: A center transition, having changed the backward looking
center and having another preferred center than the backward looking center. A
Rough shift tends to occur when the speaker has started talking about a new
entity but is only doing so momentarily.
Preferred centers: X. Y. A.
Backward centers: ^--B-- ^--C--
Centering theory and Center transitions: The centering theory states that
center transitions are ranked by preference as follows:
1. Continue transition (Cb = last Cb, Cp = Cb)
2. Retain transition (Cb = last Cb, Cp != Cb, --> announce new topic)
3. Smooth shift (Cb != last Cb, Cp = Cb, --> topic changed, is ctd.)
4. Rough shift (Cb != last Cb, Cp != Cb, --> topic changed, not ctd.)
This ranking reflects the hearer's inference load, which is minimal in continue
transitions. When pronouns are ambiguos, the reading assuming the easiest
transition is chosen.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Word Sense Disambiguation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Document: A text.
Word Sense Disambiguation: The task of finding out which meaning an
ambiguous word has in a context. In CL, this is often done by examining
the words in the context, following the idea of "Meaning is use".
Collocational feature: A syntactical or morphological cue for a word.
Collocational features of a word in a text may for instance include
immediate preceeding or following words, pos's of these words or the
root of the word.
context window: The context window of a word in a text is a set of
a fixed number of words to the left and to the right of the word, optionally
together with their properties. Properties may for instance be
the pos's of the words. Context windows span sentence boundaries, but not
document boundaries.
Example: A context window of the word "span" in the above sentence
is {the,words,Context,windows,sentence,boundaries,but,not}
// In the lecture, the term "context vector" is also used to refer to
// "context window"
Distribution: The distribution of a word is the union of all its context
windows.
Association set: An association set of a word w is an ordered set of words
(of open word classes) which tend to occur with one meaning of w.
Example: An association set of "bank" in the meaning of "financial
institution" could be {money, account, debt, skyscraper}.
// In the lecture, this set has no name
High-frequency set: A high-frequency set of a word w is an ordered set of
all words (of open word classes) which tend to occur with w in whatever
meaning. The high-frequency set is thus the union of all association sets.
Example: For "bank", this could be {money, account, debt, skyscraper,
river, water, sand}
// In the lecture, this set has no name
Feature set: A feature set for a word w is an ordered set of words whose
occurance is likely to give a hint on the sense of w. Often, the
association set of w or the high-frequency set of w are used as a feature
set for w.
Co-occurance feature set: The ordered set given by the intersection of the
feature set and the context window. The co-occurance feature set thus
filters the context window for relevant words.
Co-occurance feature vector, Co-occurance vector, Context vector: The
binary vector which identifies the co-occurance feature set as a subset
of the feature set.
Example: Word: "bank" in the meaning of "financial institution"
Sentence: "I went to the bank to get money from my account"
Context window: {I,went,to,the, to,get,money,from,my,account}
Feature set: {money, account, debt, skyscraper, river, water, sand}
Co-occ. feat. set: {money }
context vector: ( 1, 0, 0, 0, 0, 0, 0)
// Do all of the above terms really mean the same (?)
// In the lecture, "context vector" is also used for "context window"
Most probable meaning: The most likely meaning of a word w, given a
particular context. If S is the set of all possible meanings of the word,
the most probable meaning can be estimated as follows, given a context
vector V:
argmax (s out of S) P(s|V),
i.e. the meaning which is most probable, given the context V. To
calculate P(s|V), we use Bayes' rule and get:
argmax (s out of S) P(V|s)*P(s)/P(V)
Since P(V) is constant, it won't change the result of argmax:
argmax (s out of S) P(V|s)*P(s)
P(V|s) can be split up to the dimensions of V:
argmax (s out of S) P( (v[1]|s) & (v[2]|s) & ... ) )*P(s)
Supposing that the dimensions of the vector V are independent of
each other, we may rewrite this as:
argmax (s out of S) P(v[1]|s) * P(v[2]|s) * ... * P(s)
or, written with the product symbol:
argmax (s out of S) ( PROD P(v[i]|s) ) * P(s)
Due to the assumption that the dimensions of v are independent from each
other, the above expression is also called "Naive Bayes classifyer".
To evaluate this formula, the absolute probabilities for each meaning of w
need to be known. They can be calculated by simply counting the overall
occurances of w with meaning s. P(v[i]|s) can also be calculated
from corpus data: It is the number of times where the word v[i] occurs
with w in sense s, divided by the total number of occurances of w in
sense s.
Training corpus: A corpus used to calculate the values for a Naive
Bayes classifyer.
Test corpus: A corpus used to evaluate the performance of a Naive Bayes
classifyer, which was calculated from the training corpus. Ideally,
training corpus and test corpus are different portions of the same
corpus.
Document space: The cartesian product of a set of words and a set of
documents, represented by a document matrix.
Document matrix: A matrix whose columns represent different words and
whose rows represent different documents. A cell contains a '1' if the
corresponding word appears in the corresponding document, else a '0'.
Document vector: A row of a document matrix. A document vector thus
represents an ordered set of all words occuring in this document as a
subset to the ordered set of all words in the matrix.
Word vector: A column of a document matrix. A word vector thus represents
an ordered set of all those documents in which the word occurs as a
subset to the ordered set of all documents in the matrix.
Word space: The cartesian product of a set of words with itself,
represented by a word matrix.
Word matrix: A square matrix whose columns and rows both represent the
same words. A cell contains the number of times one word co-occurs with
the other in a particular part of a document.
Word matrices are useful for showing topical or thematical relatedness of
words.
Modifier space: The cartesian product of a set of nouns and a set of
adjectives, represented by a modifier matrix.
Modifier matrix: A matrix whose rows represent adjectives and whose
columns represent nouns. A cell contains a '1', if a noun is modified
by (used with) the corresponding adjective in a particular document.
The modifier matrix thus takes syntactic structure into account, while
word and document spaces only calculate bare co-occurances. Finer-grained
modifier spaces show similarity between properties of concepts.
Modifier vector: A column of a modifier matrix. A modifier vector thus
represents an ordered set of adjectives used with a specific noun as a
subset to the ordered set of all adjectives in the matrix.
// The naming is a little bit incoherent here:
// Document vectors consist of words
// Word vectors consist of documents
// Modifyer vectors consist of modifyers
Set-theoretic similarity measure: A function (i.e. a formula) which gets
two vectors and returns a real value which ranks their similarity. To
calculate the value, both vectors are interpretated as sets and set
operations are used.
Matching coefficient: A set-theoretic similarity measure whose value
for two binary vectors X and Y is: | X /\ Y |
This is just the number of '1'-entries, which occur in the same place
in both vectors. It is the similarity of x and y in the sense
of sim(x,y), i.e. simply the dot product of x and y, the number of
elements which both sets share.
Dice coefficient: A set-theoretic similarity measure whose value
for two binary vectors X and Y is: 2*| X /\ Y | / ( |X| + |Y| )
This is just the number of '1'-entries, which occur in the same place
in both vectors, normalized by dividing by the total number of 1-entries,
i.e. the number of elements shared by both sets divided by the total
number of elements. The range of the dice-coefficient is 0 to 1.
Jacquard-coefficient: A set-theoretic similarity measure whose value
for two binary vectors X and Y is: | X /\ Y | / | X \/ Y |
The Jacquard-coefficient is similar to the dice-coefficient.
Overlap coefficient: A set-theoretic similarity measure whose value
for two binary vectors X and Y is: | X /\ Y | / min(|X|,|Y|)
This coefficient yields 1 if X is a subset of Y or Y is a subset of X.
Cosine coefficient: A set-theoretic similarity measure whose value
for two binary vectors X and Y is: | X /\ Y | / SQRT(|X|+|Y|).
The cosine coefficient is similar to the dice coefficient.
Normalized correlation coefficient: A value for two (also non-binary)
vectors x and y, which is calculated as follows:
cos(x,y) = x * y / |x| / |y|
= (x[1]*y[1] + x[2] * y[2] + ... x[n]*y[n])
/
SQRT(x[1]^2 + x[2]^2 + ... x[n]^2)
/
SQRT(y[1]^2 + y[2]^2 + ... y[n]^2)
This value corresponds to the angle between the two vectors, if they are
interpretated as points in a coordinate system. cos(x,y) may thus also be
used as a similarity measure, which is independet of the length of the
vectors.
Euclidean distance: The Euclidean distance between two vectors is the length
of the difference of the two vectors. It may be calculated as:
|X-Y| = SQRT( (x-y)*(x-y) )
= SQRT( (x[1]-y[1])^2 + ... (x[n]-y[n])^2 )
The Euclidean distance is the distance between the two points indicated by
the vectors in a coordinate system. It may thus also be used as a similarity
measure and yields the same ranking as does cos(x,y).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Machine Translation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Translation: A transfer of a text from a source language to a target
language which preserves meaning.
World knowledge model: Data on reality.
MT, Machine translation: A translation done by a computer. It involves
* Source language analysis
* Target language generation
* World knowledge modelling
FAHQMT, Fully-Automatic High-Quality Machine Translation: A translation
which is done by a computer and whose quality equals a human
translation. Up to now, there is no perfect machine translation without
human intervention or assistance.
CAT, Computer aided translation: A translation which is done by a
computer and supported by a human or vice versa.
Bilingual translation: A translation between two languages.
Multilingual translation: A translation between more than two languages.
Direct translation: Basically translating a sentence word by word. This
consists of
* morphological analysis of the source language sentence
* look-up in a dictionary
* local reordering
* Outputting the result in the target language
Adding a new language means writing a new system. Direct translation
involves only very little effort in analysis and generation.
Transfer translation: Using an analysis module and a generation module for
each language and a transfer module for each pair of languages. Adding a
new language means adding transfer modules from and to all existing
languages. Transfer translation involves more effort in analysis and
generation than direct translation.
Interlingua translation: Translating first to an abstract universal
language and then to the target language. As a consequence, there are
only analysis and generation modules necessary for each language. Adding
a new language thus means adding two more modules. Interlingua
translation involves the highest effort in analysis and generation.
Example: For a small demo program, go here:
http://www.mpi-inf.mpg.de/~suchanek/personal/txt_e.htm#summaries
You may translate the sentence "x sees y", where x and y
may be pronouns, "the tree" or "the big tree" and the sentence
may also be negated. Works with German, French, Italian and
English.
Example-based translation: A direct translation which works by looking up
sequences of words of the source language sentence in a translation
corpus. The corpus is bilingual and is aligned on word, phrase and
sentence level.
Statistical machine translation: A direct translation which calculates
the most probable target language sentence from an aligned bilingual
corpus.
Predicate: Verb phrase.
Null-Subject: The word which can be seen as the subject of a sentence
but does not occur in the sentence.
Example: "Cosa avete fatto?"
"Was habt [ihr] gemacht?"
Passive voice: The property of a sentence of containing certain "passive
verb" constructions (which, in English, are "to be + participle").
// A semantic definition of "passive voice" will probably fail
Active voice: The property of a sentence of not having passive voice.
Lexical ambiguity: The ambiguity resulting from an unclear meaning of
a word.
Structural ambiguity, syntactical ambiguity: The ambiguity resulting
from an unclear grammatical structure of a sentence.
Real structural ambiguity, global structural ambiguity: The ambiguity
resulting from an unclear grammatical structure of the whole sentence.
Example: "Flying planes can be dangerous"
can be "It can be dangerous to fly planes" or "Planes which
are flying can be dangerous"
Accidental structural ambiguity, local structural ambiguity: The
ambiguity resulting from an unclear structure of a sequence of words,
which is usually solved when reading on.
Example: "The mathematics students sat their examinations"
"The mathematics students study today is very complex"
Homonym: A word which has different, unrelated meanings.
Example: "bank"
Polysem: A word which has different, though related meanings.
Example: "work" can either be a verb or a noun
Anaphora: The use of a pronoun to refer to a previously mentioned
entity.
Translation problem: A phenomenon which causes difficulties for
translations. Translation problems include
* Divergences
* Mismatches (Lexical ambiguity problems)
* Structural ambiguity problems
Divergence: The phenomenon that a predicate is realized by different
syntactical structures in two languages. Divergence includes
* Lexical divergence
* Non-lexical divergence
Lexical divergence: Divergence resulting from the difference
between the verb itself in one language and in the other
language. Lexical divergence includes:
* Structural divergence
* Categorial divergence
* Thematic divergence
* Head switching
* one to multi-word translation
* Support verb construction divergence
Structural divergence: The lexical divergence of a predicate
being translated to a different pattern of constituents.
Example: "Wir treffen uns am Mittwoch"
"We meet --- on Wednesday"
Categorial divergence: The lexical divergence of a word being
translated to a word with another POS.
Example: "Ich habe Hunger/N"
"I am hungry/ADJ"
Thematic divergence: The change of subject in a translation.
Example: "Mir faellt der Termin/SUBJECT ein"
"I/SUBJECT remember the date"
Head switching: The lexical divergence of a verb v being
translated to a verb phrase where v occurs as a complement.
Example: "Ich esse gern"
"I like to eat"
One to multi-word translation: The divergence of the verb
being translated to multiple words.
Example: "Lassen Sie uns das festhalten"
"Let us make a note of it"
Support verb construction divergency: (?)
Solutions to lexical divergence: Lexical Divergence may be dealt
with by encoding the appropriate information in the lexical
entry.
Non-lexical divergence: The divergence not resulting from the
verb itself. Non-lexical divergence includes:
* Constituent order divergence
* Preposition stranding
* Null-Subject divergence
* Dative divergence
* Voice divergence
Constituent order divergence: The non-lexical divergence of
constituents being placed in another order in a translation
Example: "Ich habe ihn gesehen"
"I have seen him"
Preposition stranding: Constituent order divergence with a
preposition. (?)
Example: "What store did John go to?"
v-----------------------´
"In welchen Laden ging John?"
Null-Subject divergence: The non-lexical divergence resulting
from a Null-Subject which is explicit in the other language.
Example: " Ho visto il libro"
"I saw the book"
Dative divergence: The non-lexical divergence resulting from
the occurance of a dative construction instead of a
verb+particle construction.
Example: "I gave the book to him"
"Ich gab ihm das Buch"
Voice divergence: The non-lexical divergence consisting in a
switch from active voice in one language to passive voice in
another language.
Example: "Es wurde getanzt"
"There was dancing"
Solutions to non-lexical divergence: Non-lexical Divergence may
be dealt with by providing appropriate grammar rules.
Mismatch: The phenomenon that the meaning of a word in one language
cannot be expressed adequately in another language. This is due to
lexical ambiguity in a wider sense. Mismatches include
* Lexical gaps
* Homonymy mismatches
* Polysemy mismatches
Lexical gap: The mismatch which occurs when a word with a general
meaning has to be translated to a language which has different
words for the specific meanings. "General meaning" and "specific
meaning" are here to be interpretated in a set-theoretic sense
as an inclusion of the extensions of the words.
Example: "I go home"
"Ich (fahre|gehe) nach Hause"
Homonymy mismatch: The mismatch resulting from a homonymy of a
word.
Example: "ausmachen" can mean "to arrange" or "to switch off"
Polysemy mismatch: The mismatch resulting from a polysemy of a
word.
Example: "window" can be "Fenster" or "Schalter"
// What is the difference to lexical gap here (?)
Solutions to mismatches: Mismatches may be dealt with by examining
the local syntactic context, the local semantic context or the
global context (anaphora resolution). Word sense disambiguation
might be needed.
Structural ambiguity problem: The problem of choosing a meaning
in order to translate a sentence with structural ambiguity. This
includes
* Global structural ambiguity problems
* Local structural ambiguity problems
Solutions to structural ambiguity problems: These problems may be
dealt with by linguistic knowledge (in order to resolve local
strucural ambiguities), contextual knowledge or world knowledge
(in order to resolve global strucural ambiguities).