Computational Linguistics Summary (c) 2002-07-06 Fabian M. Suchanek http://www.mpi-inf.mpg.de/~suchanek/personal/texts/summaries/cl.txt This is a summary of the lecture "Computational Linguistics" held by Prof. Peter Bosch and Dr. Sabine Reinhard at the University of Osnabrueck in SS 2002. In spite of claiming to be a formal discipline, Computational Linguistics (as presented here) often seems to lack clear and consistent definitions... By reading the following text, you accept that the author does not accept any responsibility for the correctness or completeness of this text. If you have any corrections or remarks, please send me a mail. This is the only way to make the publication of this summary useful for me, too. My e-mail address is f.m.suchanek@zweb.de, but the letter 'z' has to be removed from the address. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Intro ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ See also lingu.txt for linguistic terms See also infod.txt for information on theoretical languages See also ai.txt for trees and First Order Predicate calculus See also statistik.txt for probabilities See also nn.txt for detailed information on vectors See also algebra.txt for mathematical backgrounds Set: An unordered collection of different elements. S = {s1,...sn} S = { a : p(x) } is the set of all those x for which p(x) holds The number of elements of a set S is referred to as |S|. ordered set: An ordered collection of different elements. // Contrary to a set, this collection is ordered // We need this expression in order to correctly identify sets with // vectors Size: The size of a set is the number of its elements. |{s1,s2,...sn}| = n Intersection: The intersection of two sets is the set of those elements which occur in both of them. Notation: X /\ Y Union: The union of two sets is the set of all of their elements. Notation: X \/ Y Difference: The difference of two sets is the set of those elements which occur in the first one, but not in the second one. Notation: X \ Y Tupel: An ordered collection of elements. Example: t=(1,'A',abc) is a tupel The entries are referred to as t[1]...t[n] // Contrary to an ordered set, this collection may contain // the same value in two positions List, Sequence: An ordered collection of elements, combined with the idea of a ranking. Graph: A structure consisting of dots ("nodes") connected by lines ("arcs"). Character, char: A symbol as listed in the Unicode beginning at 32. String: A possibly empty sequence of symbols. Concatenation: The concatenation of two strings is the string consisting of all symbols of the first one followed by all symbols of the second one. Substring: A substring of a string is a sequence of symbols occurring in it. epsilon, eps: A special symbol with the property that a string containing epsilon is considered the same as this string without epsilon. // Take care, in InfoD, epsilon is used for the empty string! Probability: The likelyhood of an event A, written as P(A). P(A) can be estimated by dividing the number of situations in which A happened in the past by the total number of situations. Example: P(it's raining today) = Number of days it rained last year / All days last year = 351 / 356 = 0.985 = 98.5 % Combined probability: The probability of two independent events A and B to happen at the same time, written as P(A & B). P(A & B) can be calculated by multiplying the single probabilities: P(A & B) = P(A) * P(B) Conditional probability: The probability of an event A, given that an event B happened before, written as P(A | B). P(A | B) = P(A & B) / P(B) Bayes's theorem: P(B|A) = P(A|B) * P(B) / P(A). This formula allows us to calculate the probability of "B if A" by knowing the probability of "A if B". Function: A mapping from elements of one set X to elements of another set Y. Each element of X is mapped to exactly one element of Y. Notation: f:X->Y f(x)=y f maps x of X to y of Y f = (LAMBDA x f(x)) another way of writing it (LAMBDA x f(x))(a) = f(a) x is called the "argument" of function f. Example: double := (LAMBDA x (x*2)) define a function double double(12) apply it to 12 = double (12) = (LAMBDA x (x*2))(12) = 12*2 = 24 "LAMBDA" is used to make clear that we speak of the function itself. Saying "(LAMBDA x square_root(x))" is like talking about "Square root" without specifying an argument for it, i.e. without talking of a specific result or value. Please note that the elements of the sets X or Y may also be functions! Constant function: A function which always maps to the same element. In this case, the argument may be omitted, since it does not play a role. Example: f(x)=3 for all x ~~~~> f = 3. Binary function: A function mapping to the set of 0 and 1. A binary function f:X->{0,1} can be seen as a set which contains all those elements of X, which are mapped to 1. Example: odd(x)= "1 if x is odd, 0 else" odd(17)=1 odd(22)=0 ~~~> odd = {1,3,5,7,... } max: A function which, given a set of real numbers, returns the greatest among them. Example: max {1,2,3,1024} = 1024 min: A function which, given a set of real numbers, returns the smallest among them. Example: min {1,2,3,1024} = 1 argmax: A function which, given a function and an argument variable, returns the value of the variable with the greatest function value. Example: argmax x -(x-3)^2 = 3, because -(x-3)^2 is greatest for x=3 NL: Natural language. Matrix: A tupel of tupels of the same length. M = ( (v[1][1], v[1][2], ... v[1][n]), ... (v[m][1], v[m][2], ... v[m][n])) Pair: A tupel with two elements. Cartesian Product: The cartesian product of two sets A and B is the set of all possible pairs, whose first element is in A and whose second element is in B. Example: {1,2} x {a,b,c} = { (1,a), (2,a), (1,b), (2,b), (1,c), (2,c) } Relation: A subset of a cartesian product of two sets. Instead of "(a,b) is an element of the relation r", one also writes "r(a,b)" or "a r b" or "(a,b) is in r". Example: The relation ">" is the set of all pairs of real numbers, in which the first number is greater than the second. It holds: ">" = { (1,0), (27.3,12.5), (17,-12), ... } One writes: >(3,2) or 3 > 2 or (3,2) is in ">". Vector: A tupel of real numbers. A vector may be seen as point in an coordinate system or as an "arrow" pointing from the origin to that point. Example: v = (13,8.2,-17) // Contrary to a tupel, a vector contains real numbers, such that // algebraic operations can be defined on vectors Sum of vectors: The vector consisting of the sums of all corresponding entries of the two vectors. Example: (1,2,3) + (4,5,6) = (5,7,9) Difference of vectors: The vector consisting of the differences of all corresponding entries of the two vectors. Example: (1,2,3) - (1,2,3) = (0,0,0) Dimension of a vector: The number of its entries. correlation coefficient of vectors, dot product of vectors, similarity of vectors: A real value calculated from two vectors x and y as follows: sim(x,y) = x * y = x[1]*y[1] + x[2]*y[2] + ... x[n]*y[n]. Length of a vector: The distance between the origin and the point indicated by the vector in a coordinate system. It may be calculated by |v| = SQRT(v*v) Binary vector: A vector of 0's and 1's. Binary vectors and ordered sets: A subset to a given ordered set can be described by a vector: Suppose the given set has n elements. Then any subset of this set can be described by a binary vector with n elements, which has a '1' for every element which is in the subset and a 0 else. On the other hand, every binary vector can be seen as a subset of an ordered set. This allows us to use set operations on binary vectors, which then work like operations on binary numbers in a programming language. Please note that a simple ordered set cannot be converted to a vector, but only subsets of a given ordered set can be expressed by vectors. Example: set ={1,2,3,4,5} subset={1, 3, 5} vector=(1,0,1,0,1) // The distinction between vectors, sets, ordered sets, tupels, // binary vectors and subsets does not become clear from the lecture ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Grammar ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Alphabet: A set of symbols. Word: A word of an alphabet is a string of its symbols. Language: A language of an alphabet S is a subset of the set of all words of S. If you consider natural languages, the "alphabet" is its set of words, the "symbols" are its words and the "words" are its sentences! FOPC, First Order Predicate Calculus: A language of variables, truth-operators and quantifiers. // For a detailed definition see ai.txt Formula: A word of FOPC or a similar formal language. Language concatenation: The concatenation of two languages is the language of words consisting of a word from the first language concatenated with a word from the second language. L1L2 = { uv : u is from L1 and v is from E L2} "Self-concatenation": The n-fold self-concatenation of a language is the language of all those words which consist of n words of the language. L^0 = {} L^n = L L^(n-1) "Infinite self-concatenation", Kleene-Star: The Kleene-Star of a language is the language of all those words which consist of 0 to INF words of that language. L* = union of all L^n for n=0..INF Complementation: The complementation of a language is the language of those words of the underlying alphabet, which are not in this language. Reversal: The reversal of a language is the set of all words contained in it, turned around. Definition of a language: A formalism which can both detect and generate all and only the words of this language. Production: A rule of substituting one substring by another. Example: N -> "house" and DET -> "the" applied to "DET N" yield "the house" Terminal, lexical constituent: A symbol of the alphabet of the language. Example: "cat" Non-terminal: A symbol classifying a terminal or a sequence of non-terminals (i.e all those symbols in the productions which do not belong to the language). Examples: DET classifies "the" and S classifies NP VP Sentence: A "word" of a natural language. Example: The sentence "The cat runs" is a "word" of the symbols "the", "cat" and "runs". Sentence form: A sequence of non-terminals and maybe also terminals. Example: "DET cat runs" Grammar: A formalism of productions which can generate all words of a language. Since a grammar can also be used to check whether a word could be generated, it functions as a definition of that language. A grammar is a "declarative" way of defining a a language. Formally, a grammar consists of: * An alphabet of terminals * An alphabet of non-terminals * A set of productions * A starting symbol, which is non-terminal Example: S -> NP VP, NP -> DET N, VP -> V, DET -> "the", N -> "cat", V -> "runs" with starting symbol S and alphabets as usual can tranform "S" to "The cat runs" Overgeneration: The phenomenon that a grammar produces words which are not wanted. Derivation: A sequence of productions which transforms the string of a non-terminal (by default: the starting symbol) to a word of the language. Parse tree: The graphical representation of a derivation in form of a tree. The root is the starting symbol. The set of children of a node is the sequence of the symbols resulting from the parent by one production. Recursion: The phenomenon that the name of a symbol occurs in the definition of the symbol. Center-embedded recursive production: A production of the form A -> xAy where A is a non-terminal and x and y are arbitrary non-empty sentence forms. Immediately left-recursive production: A production of the form A -> Ax where x is a sentence form. Left-recursive grammar: A grammar with a non-terminal A which has a derivation that leads to a sentence form with A itself at the leftmost position. Constituency: The phenomenon that a sequence of symbols may act as a single unit. Constituent, phrase, clause: A sequence of terminals which can be derived from a non-terminal. Example: "The big cat" is a constituent since it can be derived from NP. strongly equivalent grammars: Grammars which generate the same language and assign the same parse tree to all words. weakly equivalent grammars: Grammars which generate the same language but do not assign the same parse tree to all words. Chomsky Hierarchy: The following list of sets of grammars: 1. Type 0 Grammars/Turing machines (most powerful) Productions are not restricted 2. Context-sensitive Grammars Productions replace one non-terminal surrounded by a context by a sentence form 3. Context-free Grammars Productions replace one non-terminal by a sentence form independent of the context 4. Regular/Right linear Grammars (least powerful) Productions replace one non-terminal by a terminal or by a terminal followed by a non-terminal -- independent of the context. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Regular Languages ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Regular Expression, RE: Algebraic notation for characterizing a set of strings. A RE is * a symbol This RE describes all strings which consist of just this symbol * a RE followed by a RE This RE describes strings consisting of one substring matching the first RE followed by another substring matching the second RE * a "." This RE matches any string of one arbitrary symbol except for epsilon * a "^" This RE matches the beginning of a line * a "$" This RE matches the end of a line // What is the difference here (?) * a "\b" This RE matches word boundaries * a "\B" This RE matches non-boundaries * a "[s]" where s is a string This RE matches strings consisting of one of the symbols of s * a "[x-y]" where x and y are symbols with an order This RE matches strings consisting of a symbol in between and including x and y * a "[^x-y]" where x and y are symbols with an order This RE matches strings of one symbol, which is not in between x and y * a RE followed by a '?' This RE matches empty strings or strings matching the previous RE * a RE followed by a '*' This RE matches strings consisting of a possibly empty sequence of substrings which each match the previous RE * a RE followed by a '+' This RE matches strings consisting of a non-empty sequence of substrings which each match the previous RE * a "(x)" where x is a RE * a "(x | y)" where x and y are REs This RE matches strings which either match x or y or both * a RE followed by "{n}" where n is a positive integer This RE matches strings consisting of n substrings matching the previous RE * a RE followed by "{n,m}" where n and m are positive integers This RE matches strings consisting of n to m substrings matching the previous RE * a RE followed by "{n,}" where n is a positive integer This RE matches strings consisting of at least n substrings matching the previous RE RE Substitution string: A string of the form "s///" where is a RE and is a string. This construction says "Replace every substring matching by ". (?) Deterministic: Always responding the same way upon an input. Transition: The act of changing one state to another. Epsilon-transition: A spontaneous transition which occurs without input. Transition function: A function d:S x I -> S which, given a state and a symbol, returns a sucessor state. Transition relation: A relation r:S x I x S between states. A transition relation can be seen as a function which, given a symbol and a state, returns a set of sucessor states. Automaton: A formalism which can detect whether a string belongs to a language. // Unfortunately, the definitions of automata, FSAs, deterministic // FSAs and indeterministic FSAs are not hierarchical -- although // this would be possible. For a complete overview see infod.txt FSA, Finite State Automaton: A formalism which performs transitions on a finite set of states while reading an input string. A FSA "reads" a string symbol by symbol. At each symbol read, the state is changed according internal rules. If the string is over and the FSA remains in a state marked as "final", then the FSA is said to "accept" the string. A FSA may be depicted as a graph where every node stands for a state and the edges show a transition from one state to another, given a particular input symbol. A FSA may also be depicted by a transition table, which shows the new state(s) for a particular current state and a particular input. A FSA can also be used vice versa to generate the language. Hence it works as a definition of a language. Deterministic FSA: A FSA consisting of * a finite set of states S={s0,s1,...sn} * a finite input alphabet I * a start state s0 E S * a set of final states F < S * a transition function d(q,i) A deterministic FSA thus has only 1 sucessor state for a current state and a given input. Non-deterministic FSA: A FSA consisting of * a finite set of states S={s0,s1,...sn} * a finite input alphabet I * a set of begin states B < S // In InfoD, B is just a single beginning state! * a set of final states F < S * a transition relation r // Note that with this definition, the non-deterministic // FSA can act deterministically -- if there is always 1 transition // for 1 input and 1 state, which is not prohibited. A non-deterministic FSA thus can have choice points where for a current state and a given input, one of multiple possible successor states results. If _any_ of the possible ways in which the FSA can process an input results in a final state, the FSA is said to accept the string. Converting non-deterministic FSAs to deterministic FSAs: Introducing new states for each set of states reached with one input symbol. Regular language, RL: * The empty set * The set of one symbol (including epsilon) * L1 L2 = {xy : x is from L1, y is from L2}, the concatenation of two regular languages L1 and L2 * L1 \/ L2, the union of two regular languages L1 and L2 * L*, the result of the Kleene operation on a regular language L Closure properties of regular languages: IF L1 and L2 are regular languages, then so are * their intersection * their difference * their complementations * their reversals Left-linear grammar: A grammar which just has productions of the form Non-Terminal -> Terminal or Non-Terminal -> Non-Terminal Terminal Right-linear grammar: A grammar which just has productions of the form Non-Terminal -> Terminal or Non-Terminal -> Terminal Non-Terminal Any right-linear grammar can be transformed to a left-linear one and vice versa. Regular grammar, RG: A grammar which is either left-linear or right-linear. Regular equivalences: For every RL, there is a * RE which generates it and all RE just generate RLs * FSA which accepts it and FSA just accept RLs * regular grammar which produces it and regular grammars just produce RLs Problems with RLs: * RLs cannot represent constituency * RLs cannot represent structural ambiguity * RLs cannot deal with recursion Finite state transducer, FST: A formalism which gets an input string and produces an output string by performing transitions on a finite set of states. It consists of * a finite set of states S={s0,s1,...sn} * a finite set A of pairs of an input symbol and an output symbol, where both of them may also be epsilon * a start state s E S * a set of final states F < S * a transition function d: S x A -> S A FST works like a FSA, but produces one output symbol with every transition. A FST can thus also be depicted as a table or graph. Regular relation: A set of pairs of strings, where each pair has a corresponding FST. // I'd prefer to define "regular relation" without FSTs and // afterwards show equivalence, // but I just don't know what a "regular relation" is (?) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PoSs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Word: When talking about natural languages, a word is what it uses to be. Linguistic expression: A word or a constellation of words. Preterminal, PoS, Part of Speech: A non-terminal which can derive a terminal in one step. A PoS is thus a symbol classifying the role a linguistic expression will take in a sentence. A PoS carries syntactical, morphological and semantic information. The PoS is written with a slash behind the linguistic expression. Examples: "House/Noun", "goes/verb", "you/pronoun", "to/preposition", "well/adverb", "and/conjunction", "up/participle", "the/article" // These POS base on syntax trees, see lingu.txt Tag: An additional information for an item. Here mostly: The PoS of a linguistic expression. Tagging: Assigning PoS to words. Subject: The noun of the noun phrase of a sentence. Object: The noun of the noun phrase of the verb phrase of a sentence. Corpus: Collection of text data. Annotated corpus: A corpus with additional information such as sentences being split to tokens, words being tagged with their PoS and words being tagged with their lemmata. CQP: Corpus Query control, a tool for searching annotated corpora. The user types a query in form of an extended RE, CQP searches the corpus and stores the result in a variable or prints it on the screen. Various display modes (such as length of context or display of tags) may be set. Results may be grouped according to a feature. // some CQP commands would be nice here Word class: A set of words with the same PoS. Closed word class: A word class which cannot be extended by inventing new words. Closed word classes are classes of function words. Examples: prepositions, determiners, numerals, ... Open word class: A word class which can be extended by inventing new words. Examples: Noun, verbs, adjectives Lexicon-only tagging: Assigning tags by looking the PoS up in a lexicon. Disambiguition rule: An algorithm to determine the most plausible PoS out of a set of choices for a linguistic expression, taking into account the whole sentence. Lexicon plus rules tagging: Assigning tags by looking up the PoS in a lexicon. When two or more PoS can be assigned to a linguistic expression, disambiguition rules are used to pick out one of them. Lexicon plus absolute probability tagging: Assigning tags by looking up the PoS in a lexicon. When two or more PoS can be assigned, choose the one which occurs most frequently in the language. Lexicon plus relative probability tagging: Assigning tags by looking up the PoS in a lexicon. When two or more PoS can be assigned, choose the one which occurs most frequently in the language, given the preceding PoSs. This strategy thus does not only take into account how often a word occurs in the language, but also how often a word of this word class occurs after this sequence of PoSs. One maximizes P(word|tag) * P(tag|previous n tags). Example: "run" is much more likely to be a noun if it is preceeded by a determiner: P(run|Noun) * P(noun|det) > P(run|verb) * P(verb|det) Gold Standard: The human expert tagging of a corpus, roughly defining what is "correct" tagging. Machine tagging today reaches 95-97% of correct tags. Word-Tag probability: The probability that, given a tag T, the word W occurs, written as P(W | T). Example: P(run | verb) > P(run | determiner) Tag-Tag probability: The probability that a tag T1 is preceeded by a tag T2, written as P(T1 | T2). Example: P(noun | determiner) will be large because nouns are often preceeded by a determiner. Dynamic programming: Writing algorithms which store partial results in a table instead of recalculating them each time they need them. This reduces the exponential time effort needed by recalulation to a polynomial time effort. Example: You all know the fibonacci function fib(0) = 1; fib(1) = 1; fib(n) = fib(n-1) + fib(n-2); When calculating fib(4), you first calculate fib(3), which requires fib(2) and fib(1). Then you calculate fib(2) and add it to fib(3). Obviously, fib(2) was calculated (at least) two times! Instead of recalculating it the second time, you look up your old result of fib(2). You might as well start from fib(0) and make a list fib(0) to fib(n), such that you can easily look up any previously calculated value. PROLOG supports dynamic programming by "assert"': It does not matter whether a query is calculated from rules or whether it has been dynamically added as a constant fact. Viterbi Algorithm: A dynamic programming algorithm for efficiently calculating the tags of a sentence. It uses a table in which for all words (rows) and for all tags (columns) the probability of that word being assigned that tag is stored. The first row stands for "beginning of the sentence" and has all cells set to 0 -- except for the column with the tag "PERIOD", set to 1. All rows are now filled top-down. For calculating the probability of a word-tag-pair, three factors are taken into account: 1. The word-tag probability of this pair P(W[i] | T) This probability can be calculated from a corpus 2. The tag-tag probability of this tag and a preceeding tag X P(T|X), can be calculated from a corpus 3. The word-tag probability of the preceeding word and the tag X P(W[i-1] | X), can be found in the preceeding row of the table In order to find the tag X, one runs through row i-1 and choses the tag X which maximizes P(W[i]|T)*P(T|X)*P(W[i-1]|X). The resulting probability (and the tag X) is stored in the cell. After having filled the table, the most probable sequence of tags for the sentence can be found by going bottom-up in the table: Choose the largest value in the last row. The corresponding tag is the most probable tag for the last word. The tag X stored in the same cell is the most probable tag for the preceeding word, because we chose X as preceeding tag to maximize the current word-tag value. Go to this tag X in the preceeding row. There will be another tag X stored in this cell, giving you the most probable PoS of the last but two word. Continue up to the first row. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Contextfree Languages ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Contextfree grammar, CFG, phrase structure grammar: A grammar which just has productions of the form Non-Terminal -> Sentence form Contextfree language: A language generated by a contextfree grammar. epsilon-free contextfree grammar: A grammar which has no production containing epsilon. Only the production S -> epsilon is allowed to ensure the grammar can generate the language of epsilon. Chomsky normalized contextfree grammar: A epsilon-free contextfree grammar with only productions of the form non-terminal -> nonterminal non-terminal non-terminal -> terminal Every contextfree grammar can be converted into a weakly equivalent Chomsky normalized grammar. Agreement: The phenomenon that two linguistic expressions match according to a certain criterion (such as number or person). Head: The constituent of a phrase which provides features for the whole phrase. Example: In "my sister", "sister" is the head since it makes the whole noun phrase "singular" and "female". Complement: The constituent of a phrase which is not the head. Subcategorization frame: A set of non-terminals describing the possible complements for given non-terminal. Rule proliferation: An immense increase in the number of rules. Advantages of CFGs: CFGs can represent * Declarative sentences * Imparative sentences * Wh-questions * Yes/No-questions Problems with CFGs: * CFGs don't recognize agreement Example: "I sees the tree" * CFGs don't recognize subcategorization frames Example: "I sleep you" CFGs overgenerate and produce ungrammatical sentences. More precise rules would help out but would lead to rule proliferation. Furthermore, redundancy and a lack of generalization result. CFGs and RGs: CFGs are more powerful than RGs and include RGs. But a RG can generate the same language as a CFG if and only if the CFG does not contain center-embedded recursion. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Parsing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Parsing: Assigning structure to an input. Morphological parsing: Assigning structure to a word. Syntactical parsing: Assigning structure to a sentence. Syntactical parsing can be implemented as search in the parse tree. Top-down, goal-driven: Starting with the goal and going to what we have. Here: A goal-driven parsing algorithm starts with S and expands the tree top-down to a structure of preterminals which matches the sentence. Advantage: Only trees resulting in S are built (since we start with it) Disadvantage: Much time is spent on building trees which do not match the sentence. Bottom-up, data-driven: Starting with what we have and going to the goal. Here: A data-driven parsing algorithm starts with the sentence and tries to find a way up to S. Advantage: Only trees which match the sentence are built (since we start with it) Disadvantage: Much time is spent on building trees which do not lead to S. Left-corner: The left-corner of (a subtree of) a parse-tree is its left-most preterminal. Bottom-up filtering: A technique for improving the performance of a top-down parser. It uses a left-corner table, where for each non-terminal all left-corners of this non-terminal's parse trees are listed. Thus, the parser already knows whether it is worth expanding this non-terminal, given a part of a sentence which is to be matched. Left-corner parser: A top-down, depth-first left-to-right parser which uses bottom-up filtering. Ambiguity: The phenomenon that a linguistic expression has multiple meanings. Ambiguity is a problem to all simple parsers. Attachment ambiguity: The ambiguity resulting from multiple possibilities of attaching a constituent to a parse tree. Coordination ambiguity: The ambiguity resulting from an unclear range of a conjunction. Noun-phrase backeting ambiguity: The ambiguity resulting from an unclear pairing within a sequence of nouns. (?) Problems of a left-corner parser: * Left-recursion A left-recursive grammar leads the parser into an infinite loop A solution might be to create a weakly equivalent grammar with the left-recursion eliminated * Ambiguity * Repeated parsing of subtrees Since the left-corner parser is depth-first, it may happen that a non-terminal's subtree is parsed more than once. Dynamic programming parser: A top-down parser which stores subtrees in a table such that it does not have to be reparsed when needed again. Since always all possible parses are stored in the table, ambiguity can be identified. The problem of left-recursion is also solved. Earley algorithm: A dynamic programming parser. It parses the input left to right and has a "chart" for the beginning of the sentence and one more for each word of the sentence. Each chart is an ordered set of "states". A state is a production with a dot in it and two zero-based integer values behind it: A -> B C D * E F [i;j] The dot tells us, where the current position in the input sentence has its corresponding position in the production. i is the position of the word in the input sentence that corresponds to B and j is the position of the current word, i.e. of the dot. The initial state is y -> * S [0;0]. Now, 3 operators are successively applied to the states, yielding new states and filling the charts until S -> ... * [0;] is reached, indicating that the sentence has been successfully parsed from the first to the last position. The three operators are the predictor, the scanner and the completer. By tracking back those states which triggered a completer, the complete parse tree can be read out of the charts. Predictor: An operator of the Earley algorithm. It is applied to every state that has a non-terminal which is not a PoS to the right of the dot. It expands the non-terminal according to the grammar and adds new states to the same chart, which contain: * a production of the form N -> * A B C, where A B C is the derivation of the found non-terminal N. * the sentence position index set to the dot position index * the same dot position index Scanner: An operator of the Earley algorithm. It is applied to every state that has a preterminal to the right of the dot. When the current word fits that preterminal, a new state is added in the next chart with: * a production of the form Preterminal -> word * * the same sentence position index * the dot position index increased by 1 Completer: An operator of the Earley algorithm. It is applied to every state where the dot has reached the right end of the production. For all previous states in the charts, it picks those which have the newly finished non-terminal to the immediate right of the dot. For all of them, new states are created in the current chart with: * the same production, but the dot advanced by 1 * the same sentence position index * the same dot position index, but increased by the length of the newly finished state, i.e. j-i Chunk-pattern: A sequence of preterminals. A chunk-pattern is thus non-recursive. Chunk: A sequence of terminals which matches a chunk-pattern. Chunk-parser: An algorithm which picks out chunks from a sentence. It matches chunk-patterns to the PoS-tags. Since it is implemented as a FSA, it works extremely efficient. The result of a chunk parser is less informative than a syntactic parse, but the chunk parser does not have the problems a left-corner parser has. Cascade chunk parsing: Feeding one chunk parser with the results from another in order to find higher-order structures. Contingency matrix: A matrix showing the performance of a chunk parser on a set of strings. It looks like this: is a chunk is not a chunk in the parser's output TP FP not in the output FN TN TP: true positives, the number of those chunks which were found FP: false positives, the number of non-chunks reported by the parser // Take care: In Neural Networks (nn.txt), these expressions // are used vice versa! FN: false negatives, the number of those chunks which were not found by the parser TN: true negatives, the number of those non-chunks which were not reported. Note that this number will be extremely large, it is never used and never calculated. Precision, accuracy: The proportion of the correct chunk-patterns in the parser's output, calculated as TP/(TP+FP). Thus, the less non-chunks the parser reported, the better is its precision. See also ai.txt Recall, coverage: The proportion of those chunk-patterns which were found, calculated as TP/(TP+FN). Thus, the less chunks the parser forgets to report, the better is its recall. See also ai.txt F-measure: A measure combining precision P and recall R. It is calculated as 1/(b/P+(1-b)/R). b is a parameter between 0 and 1 which either weights precision or recall. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Feature structures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Feature structure: A set of attribute-value pairs. The attributes are atomic, but the values may be nominals, integers or again whole feature structures. Values may also be "[]", i.e. unspecified. Each attribute is only contained once. Underspecification of a feature structure: The phenomenon that a feature structure does not contain all possible attributes. Attribute-value matrix, AVM: A matrix consisting of lines of attribute- value pairs, thus representing a feature structure. Feature graph: A directed acyclic graph representing a feature structure. Its arcs are labelled by the attributes and its values appear as nodes. Reentrancy, structure sharing: The phenomenon that several attributes have a precisely identical value. The shared value is notated by coindexing boxes in the AVM. Path equation: A string of the form (A B C) = (D E F) or (A B C) = b, where (A B C) and (D E F) denote paths in the feature graph by the names of attributes and b denotes a node of the graph. (A B C) = (D E F) means: You reach the same node when travelling via A,B,C and D,E,F. (A B C) = b means: You reach a node with value b when you travel along A,B,C. Subsumption: The relation of something including something. Subsumption of feature structures: The relation between a feature structure and a more specific feature structure. Subsumption is written as A < B, where "<" is a square-shaped less-equal sign and A is the less specific, more general, subsuming one. Subsumption is a partial ordering relation, which does not hold in any direction with conflicting feature structures (same attribute, different value) or completely different feature structures (no equal attributes). Feature structure semilattice: A graph of feature structures as nodes and subsumption relations as arcs. Unification: The act of combining two information structures to a new one. When the information structures are incompatible, unification fails. Unification of feature structures: The act of combining two feature structures to a new one, which contains the information of both feature structures, but not more (the most general unifier). Unification is written as A |_| B = C. It works like this: Go through all attribute-value-pairs of the first feature structure. * If the attribute is not contained in the second one, keep it * If the attribute is contained in the second one and has an unspecified value, keep it * If the attribute is contained in the second one and has another specified value than the first specified value, the unification fails. * If the attribute is contained in the second one and has a feature structure as value, unify the first value and the second value and keep the resulting structure as value to the current attribute. Repeat vice versa with the second feature structure. Feature structures in grammar: Feature structures may be added to a grammar in order to model agreement, subcategorization and gramatical heads. Feature structure contraints are added to the grammar in the form = atomic-value or = and mean the same as do path equations. Lexical constituents get their features from the lexicon, whereas non-lexical constituents get their attributes and values from their heads of phrases. Example: = , where "agreement" is an attribute whose value is maybe a feature structure of "person" and "number" or "gender". Head feature: A feature passed from the head to its phrase. Head features are grouped as value of an attribute called "HEAD". Subcategorization feature: A feature specifying the complements of a non-terminal. Subcategorization features are grouped as value of an attribute called "SUBCAT". ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Psycholinguistics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Thematic role: The function the referent of a linguistic expression has in the situation represented by the sentence. Example: In "Peter kicked Franzerl", "Peter" is the agent and "Franzerl" is the patient. Children find sentences difficult in which an entity is both an agent and a patient. Right-branching language, head initial language: A language whose complements stand right to the heads. Maybe parsed top-down. English is an example of a right-branching language. Left-branching language, head final language: A language whose complements stand left to the heads. Can not be parsed top-down, but rather bottom up. Japanese is an example of a left-branching language. Different types of parsing: Here is a table of different types of parsing and their properties. Classical Formal ComputerL's NL PsychoLing Done by: humans Abstract device computer computer humans Done on: sentences abstract representation symbolic input representations of NL bits of NL Result: inf.gramm.descr. formal representation change in mach.state change in mach.state mental rep. Method: is a skill algorithmic algorithmic algorithmic/heuristic heuristic Innate: No Yes Aware: Yes No Goal: Pedagogical Efficiency principles for human parsing: * Use any information immediatly * Avoid backtracking and reanalysis Human parsing: According to the standard view, human parsing takes place left-to-right and top-down. But this would cause problems with left-branching languages. The innate parsing maybe parameterized with "right" or "left" for top-down or bottom-up parsing. Apart from a parse tree, human parsing also establishes thematic roles and meaning. Center-embedded relative clause: A complement for a noun which is surrounded by other terminals in the sentence and starts with a determiner. Center-embedded relative clauses put a considerable burden on the working memory because unfinished clauses must be kept in memory until they are completed by the corresponding verb. Center-embedded structures can be removed in English. Minimal attachment parsing, immediate syntactic processing: The act of constructing a parse tree by immediately inserting every new word from the input left-to-right. A constituent is thus completed as soon as possible and no new constituents are added beyond necessity. This leads to difficulties with sentences like "Sam loaded the boxes on the cart onto the van", where "on the cart" first seems to complete the verb phrase but then is a complement to "the boxes". Immediate semantic processing: The act of anticipating sentence constituents by semantic properties of the preceeding constituents. This accounts for the anticipation of "are" in the sentence "When you live near a runway, landing planes ..." and for the anticipation of "is" in the sentence "If you've been trained as a pilot, landing planes ...". ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Semantics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Syntax: Rules on the arrangement of symbols. Semantics: The study of meaning. Computation: Purely syntactic manipulation of Symbols. Semantic properties of the symbols are external to computation. Syntactic symbol manipulation is semantically invariant, i.e. if a string of symbols is transduced to another, the meaning should have remained the same. Analyzing a sentence: * Parsing it ~~~~> yields a parse tree (syntactic constituent structure) * doing a sementic analysis ~~~~> yields a semantic representation (an unambiguous logical structure in a canonical form which can serve for drawing inferences) world: A constellation of entities with attributes. Example: Our universe is a world. But we may also imagine a world where Peter Bosch is not a lecturer but the German Chancellor. necessary: true in all possible worlds. Example: Sentences which are true by definition, mathematical statements contingent: not necessarily true. Example: "I am studying Cognitive Science" -- I could also study something else, say "Agrarpsychologie" Declarative sentence: A sentence of the form NP VP. Truth: A relation between a FOPC formula without variables and a world. Example: likes_ice_cream(Fabian) is true in our world. Proposition, FOPC sentence: A FOPC formula without variables. A proposition can be seen as a binary function which, given a world w, returns 1 if the proposition would be true in this world or 0 if the proposition would be false in this world. Since binary functions can be seens as sets, a proposition is the set of those worlds in which the proposition would return 1. Example: sleeps(Peter) = the set of all worlds in which Peter is sleeping = { w : Peter sleeps in world w } Sense, content: The content of a declarative sentence is its proposition. Example: content("Peter is jogging") = jogging(Peter) = the set of all worlds in which Peter is jogging = { w : Peter is jogging in w } Content of a lexical item: The content of a lexical item is the concept it refers to. We will see lots of ways how to formalize these concepts. Rule-to-Rule-hypothesis, Rule-by-Rule-hypothesis: The assumption that each syntactic rule has a corresponding rule of semantic composition. Indexicality: The phenomenon that the truth of a sentence depends on the circumstances of the utterance (e.g. by involving pronouns, by involving relative time expressions or by being in a specific place). Meaning of a sentence: A function which, given information on the circumstances of the utterance, returns a proposition. Meaning thus yields a canonical form of a sentence, it erases indexicality. Meaning is a level which intermediates between the linguistic expression of a sentence and the knowledge representation in form of propositions. Since we will talk of sentences with no indexicality, meaning becomes a constant function and thus equal to content. Example: Assume M[r] to be the meaning of "It's raining". Then we may apply M[r] to certain circumstances: M[r](Osnabrueck,2002-06-01,2:35) = eventInAt(rain,Osnabrueck, 01-06-2002_2:35) Example: Assume M[c] to be the meaning of "All students are clever". This is a constant function which always returns the same result on no matter which circumstances: M[c](Bonn,2002-07-09,2:30) = All x: student(x) => clever(x) M[c](Wien,1999-02-03,1:02) = All x: student(x) => clever(x) ~~~~~~> M[c] = All x: student(x) => clever(x) ~~~~~~> M[c] = {w : all students are clever in w } ~~~~~~> M[c] = content("All students are clever") Meaning becomes equal to the proposition. Annotation: In the following text, meaning will be treated as equal to the proposition. Notation: || "All students are clever" || is the meaning of this sentence Calculation: The meaning of a sentence can be calculated by applying the meaning of its verb phrase to its Noun Phrase: || S || = || VP || ( || NP || ) Meaning of an intransitive verb: A function mapping the subject of the verb to a proposition. Example: The function (LAMBDA x sleeps(x)) maps "Peter" to the set of all worlds where Peter sleeps: (LAMBDA x sleeps(x))(Peter) = sleeps(Peter) = { w: Peter sleeps in w } Annotation: Note that we still cannot tell what "to sleep" really means. We have just found a way to express that "to sleep" is a concept which is "not yet complete": It needs a noun given to become a proper meaningful sentence (i.e. a proposition). Just like "square root" needs an argument to become a proper value. This is what we express by the argument following "LAMBDA". Meaning of a transitive verb: A function mapping the object of the verb to the meaning of an intransitive verb. Example: In "Peter kicks Franzerl", "kicks" is a function which returns a function: (LAMBDA y (LAMBDA x (kicks(x,y)))) ~~~~~~~~~~~~~~~~~~~~~~~ return value: a function Here, it is applied to the object of the verb, "Franzerl": (LAMBDA y (LAMBDA x (kicks(x,y ))))(Franzerl) = (LAMBDA x (kicks(x,Franzerl))) It yields the meaning of an intransive verb, namely a function from entities x to propositions. This intransitive verb can be seen as "kicking_franzerl". It is now applied to "Peter", the subject of this intransitive verb: (LAMBDA x (kicks(x, Franzerl)))(Peter) = kicks(Peter,Franzerl) = { w : Peter kicks Franzerl in w } If we have a dictionary which gives us the meanings of single words as functions (such as the above (LAMBDA y (LAMBDA x...))), we can get a first order predicate calculus formula for a sentence by applying the meanings of the constituents to each other. Meaning of a proper name: The entity this proper name refers to. Quantifier subject: A subject which does not denote a named entity. Examples: "Nobody", "Everyody", "Somebody" Meaning of a quantifier subject: The meaning of a quantifier subject is a function which maps a verb (i.e. a function) to a proposition. Example: In "Nobody sleeps", "nobody" is a function which needs another function as its argument: LAMBDA P (~ Ex. x: person(x) & P(x)) To calculate the meaning of "Nobody sleeps", we use "sleeps" as an argument for "nobody": (LAMBDA P (~ Ex. x: person(x) & P(x)))(sleeps) = ~ Ex. x: person(x) & sleeps(x) = { w : Nobody sleeps in w } Events in First Order Predicate Calculus: Events can be expressed as entities, which have the action of the sentence as a property and which stand in a "patient" relation to the object of the sentence and in an "agent" relation to its subject: "Peter sees Anke" ~~~> Ex. e: seeing(e) & agent(e,Peter) & patient(e,Anke) Meaning relation: A semantic relation between words. Example: A "mother" is a "female parent" Meaning postulate: A constraint on a meaning relation. Meaning postulate for a noun: A logical formula of the form Necessarily All x: W(x) <=> P(x) & Q(x) & ... where W is a noun and P, Q, ... are properties. Several variables may occur. Example: Necessarily All x, y: mother(x,y) <=> parent(x,y) & female(y). Meaning postulate for a verb: A logical formula of the form Necessarily All x1,x2,...: V(x1,x2,...) <=> where V is a verb and the may contain the usual logical operators and also the special predicates CAUSE(cause,effect) and BECOME(state) Example: Necessarily All x,y: kill(x,y) <=> CAUSE(x,BECOME(dead(y)) Meaning representation: A logical formula which fixes the meaning of a word. We use the lambda-calculus in combination with first order predicate calculus. Example: The meaning of the transitive verb "to open" is the concept it refers to, which can be expressed as follows: LAMBDA obj (LAMBDA actor (Ex. e : opening(e) & theme(e,obj) & agent(e,actor) ) ) This means: "to open" is a function which needs two arguments given: An actor and an object. Saying "A opens B" is equivalent to stating: There is an event e, which has the property of being an "opening"-event, which stands in a "theme"-relation to the object and in an "agent"- relation to the actor. Meaning representations allow for inference drawing with the help of a knowledge base. Meaning postulates and ambiguities: Meaning postulates must take into account semantic ambiguity of words. Implications can be used to differentiate between two meanings: Certain properties of the argument of the word imply certain meanings. Example: A meaning postulate for the adjective "open" when applied to a container (e.g. a bottle): All x Ex y : ((container(x) & open(x) & content(x,y)) => accessible(y)) A meaning postulate for the adjective "open" when applied to a shop: All x Ex y : ((shop(x) & open(x) & goods(x,y)) => forSale(y)) // In the slides, meaning postulates and meaning representations seem // to mix. (?) Understanding: Understanding a sentence means to know what is the case when the sentence is true. (Ludwig Wittgenstein) The truth of a sentence here depends on * its meaning * indexicality * contingency ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Discourse analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Discourse: A linguistic unit larger than a sentence. Examples: A conversation or a story. Segment: An utterance. Computational discourse analysis: The modelling of the knowledge processes taking place in the course of understanding a discourse. Center, Discourse entity: An entity a linguistic expression (mostly a noun phrase) in a discourse refers to. Example: In "Bob eats bananas", Bob and bananas would be discourse entities. Center theory: A framework for modelling attention to discourse entities. It states constraints on the discourse entities pronouns can refer to and declares preferred shift-of-attention patterns. It assumes that discourses consist of separate utterances (mostly sentences). Forward-looking center set: An ordered set of all discourse entities of an utterance. The set is ranked according to some emprically determined metric which measures the "importance" of a discourse entity in an utterance. The usual metric is Subject > Object > others. Preferred center: The first entry of the forward-looking center set. The preferred center is thus the (grammatically) most important entity in the current utterance -- whereas this does not necessarily mean that it is the topic of the current part of discourse. But the preferred center is likely to predict the backward looking center of the next utterance. Example: In "Bob sits on a banana", the forward-looking center set would be Cf = (Bob, banana) and the preferred center would be Cp = Bob. Backward looking center: The highest-ranking entry of the forward-looking center set of the last utterance, which occurs in the current utterance. (If there is no such center, the backward looking center remains undefined.) The backward looking center is thus the entity which connects the current and the last utterance. It can be seen as the topic of the current part of the discourse, since it "survives" two utterances. Example: "Bob hits Tim" Cf = (Bob, Tim) "That's why Tom likes Bob" Cf = (Tom, Bob), Cb = Bob Pronoun rule: The following rule, as stated by the Center Theory: "If any pronoun occurs in an utterance and refers to a discourse entity, one of the pronouns in the sentence refers to the backward looking center.". Example: "Bob hits Tim with a banana" Cf = (Bob, Tim, banana) "Bob likes hitting" Cf = ( Bob ), Cb = Bob OK, no pronouns "He likes hitting" Cf = ( Bob ), Cb = Bob OK, 1 pronoun, refers to Cb "He (Bob) likes it" Cf = ( Bob, hitting ), Cb = Bob OK, 2 pronouns, one refers to Cb "He (Tim) likes Bob" Cf = ( Tim, Bob ), Cb = Bob Not OK, 1 pronoun, not referring to Cb Center transition: The process of changing or keeping the backward looking center and the preferred center in the course of the discourse. Continue transition: A center transition, keeping the same backward looking center and also having the backward looking center as preferred center. This transition corresponds to cases where the speaker has been talking about a particular entity (a "hero") and indicates an intention to continue talking about that entity. Preferred centers: X. Y. A. Backward centers: ^--A-- ^--A-- Retain transition: A center transition, keeping the same backward looking center but introducing a new preferred center. Retain seems to correspond to a situation where the speaker is intending to shift onto a new entity in the next utterance and is signaling this by realizing the current topic (the backward looking center) in a lower-ranked position on the forward-looking center set. Preferred centers: X. Y. B. Backward centers: ^--A-- ^--A-- Smooth shift: A center transition, having changed the backward looking center, but keeping the current backward looking center as preferred center. A Smooth shift tends to occur when the speaker has started talking about a new entity and is doing so in a way that indicates that s/he will continue talking about that entity. Preferred centers: X. Y. A. Backward centers: ^--B-- ^--A-- Rough shift: A center transition, having changed the backward looking center and having another preferred center than the backward looking center. A Rough shift tends to occur when the speaker has started talking about a new entity but is only doing so momentarily. Preferred centers: X. Y. A. Backward centers: ^--B-- ^--C-- Centering theory and Center transitions: The centering theory states that center transitions are ranked by preference as follows: 1. Continue transition (Cb = last Cb, Cp = Cb) 2. Retain transition (Cb = last Cb, Cp != Cb, --> announce new topic) 3. Smooth shift (Cb != last Cb, Cp = Cb, --> topic changed, is ctd.) 4. Rough shift (Cb != last Cb, Cp != Cb, --> topic changed, not ctd.) This ranking reflects the hearer's inference load, which is minimal in continue transitions. When pronouns are ambiguos, the reading assuming the easiest transition is chosen. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Word Sense Disambiguation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Document: A text. Word Sense Disambiguation: The task of finding out which meaning an ambiguous word has in a context. In CL, this is often done by examining the words in the context, following the idea of "Meaning is use". Collocational feature: A syntactical or morphological cue for a word. Collocational features of a word in a text may for instance include immediate preceeding or following words, pos's of these words or the root of the word. context window: The context window of a word in a text is a set of a fixed number of words to the left and to the right of the word, optionally together with their properties. Properties may for instance be the pos's of the words. Context windows span sentence boundaries, but not document boundaries. Example: A context window of the word "span" in the above sentence is {the,words,Context,windows,sentence,boundaries,but,not} // In the lecture, the term "context vector" is also used to refer to // "context window" Distribution: The distribution of a word is the union of all its context windows. Association set: An association set of a word w is an ordered set of words (of open word classes) which tend to occur with one meaning of w. Example: An association set of "bank" in the meaning of "financial institution" could be {money, account, debt, skyscraper}. // In the lecture, this set has no name High-frequency set: A high-frequency set of a word w is an ordered set of all words (of open word classes) which tend to occur with w in whatever meaning. The high-frequency set is thus the union of all association sets. Example: For "bank", this could be {money, account, debt, skyscraper, river, water, sand} // In the lecture, this set has no name Feature set: A feature set for a word w is an ordered set of words whose occurance is likely to give a hint on the sense of w. Often, the association set of w or the high-frequency set of w are used as a feature set for w. Co-occurance feature set: The ordered set given by the intersection of the feature set and the context window. The co-occurance feature set thus filters the context window for relevant words. Co-occurance feature vector, Co-occurance vector, Context vector: The binary vector which identifies the co-occurance feature set as a subset of the feature set. Example: Word: "bank" in the meaning of "financial institution" Sentence: "I went to the bank to get money from my account" Context window: {I,went,to,the, to,get,money,from,my,account} Feature set: {money, account, debt, skyscraper, river, water, sand} Co-occ. feat. set: {money } context vector: ( 1, 0, 0, 0, 0, 0, 0) // Do all of the above terms really mean the same (?) // In the lecture, "context vector" is also used for "context window" Most probable meaning: The most likely meaning of a word w, given a particular context. If S is the set of all possible meanings of the word, the most probable meaning can be estimated as follows, given a context vector V: argmax (s out of S) P(s|V), i.e. the meaning which is most probable, given the context V. To calculate P(s|V), we use Bayes' rule and get: argmax (s out of S) P(V|s)*P(s)/P(V) Since P(V) is constant, it won't change the result of argmax: argmax (s out of S) P(V|s)*P(s) P(V|s) can be split up to the dimensions of V: argmax (s out of S) P( (v[1]|s) & (v[2]|s) & ... ) )*P(s) Supposing that the dimensions of the vector V are independent of each other, we may rewrite this as: argmax (s out of S) P(v[1]|s) * P(v[2]|s) * ... * P(s) or, written with the product symbol: argmax (s out of S) ( PROD P(v[i]|s) ) * P(s) Due to the assumption that the dimensions of v are independent from each other, the above expression is also called "Naive Bayes classifyer". To evaluate this formula, the absolute probabilities for each meaning of w need to be known. They can be calculated by simply counting the overall occurances of w with meaning s. P(v[i]|s) can also be calculated from corpus data: It is the number of times where the word v[i] occurs with w in sense s, divided by the total number of occurances of w in sense s. Training corpus: A corpus used to calculate the values for a Naive Bayes classifyer. Test corpus: A corpus used to evaluate the performance of a Naive Bayes classifyer, which was calculated from the training corpus. Ideally, training corpus and test corpus are different portions of the same corpus. Document space: The cartesian product of a set of words and a set of documents, represented by a document matrix. Document matrix: A matrix whose columns represent different words and whose rows represent different documents. A cell contains a '1' if the corresponding word appears in the corresponding document, else a '0'. Document vector: A row of a document matrix. A document vector thus represents an ordered set of all words occuring in this document as a subset to the ordered set of all words in the matrix. Word vector: A column of a document matrix. A word vector thus represents an ordered set of all those documents in which the word occurs as a subset to the ordered set of all documents in the matrix. Word space: The cartesian product of a set of words with itself, represented by a word matrix. Word matrix: A square matrix whose columns and rows both represent the same words. A cell contains the number of times one word co-occurs with the other in a particular part of a document. Word matrices are useful for showing topical or thematical relatedness of words. Modifier space: The cartesian product of a set of nouns and a set of adjectives, represented by a modifier matrix. Modifier matrix: A matrix whose rows represent adjectives and whose columns represent nouns. A cell contains a '1', if a noun is modified by (used with) the corresponding adjective in a particular document. The modifier matrix thus takes syntactic structure into account, while word and document spaces only calculate bare co-occurances. Finer-grained modifier spaces show similarity between properties of concepts. Modifier vector: A column of a modifier matrix. A modifier vector thus represents an ordered set of adjectives used with a specific noun as a subset to the ordered set of all adjectives in the matrix. // The naming is a little bit incoherent here: // Document vectors consist of words // Word vectors consist of documents // Modifyer vectors consist of modifyers Set-theoretic similarity measure: A function (i.e. a formula) which gets two vectors and returns a real value which ranks their similarity. To calculate the value, both vectors are interpretated as sets and set operations are used. Matching coefficient: A set-theoretic similarity measure whose value for two binary vectors X and Y is: | X /\ Y | This is just the number of '1'-entries, which occur in the same place in both vectors. It is the similarity of x and y in the sense of sim(x,y), i.e. simply the dot product of x and y, the number of elements which both sets share. Dice coefficient: A set-theoretic similarity measure whose value for two binary vectors X and Y is: 2*| X /\ Y | / ( |X| + |Y| ) This is just the number of '1'-entries, which occur in the same place in both vectors, normalized by dividing by the total number of 1-entries, i.e. the number of elements shared by both sets divided by the total number of elements. The range of the dice-coefficient is 0 to 1. Jacquard-coefficient: A set-theoretic similarity measure whose value for two binary vectors X and Y is: | X /\ Y | / | X \/ Y | The Jacquard-coefficient is similar to the dice-coefficient. Overlap coefficient: A set-theoretic similarity measure whose value for two binary vectors X and Y is: | X /\ Y | / min(|X|,|Y|) This coefficient yields 1 if X is a subset of Y or Y is a subset of X. Cosine coefficient: A set-theoretic similarity measure whose value for two binary vectors X and Y is: | X /\ Y | / SQRT(|X|+|Y|). The cosine coefficient is similar to the dice coefficient. Normalized correlation coefficient: A value for two (also non-binary) vectors x and y, which is calculated as follows: cos(x,y) = x * y / |x| / |y| = (x[1]*y[1] + x[2] * y[2] + ... x[n]*y[n]) / SQRT(x[1]^2 + x[2]^2 + ... x[n]^2) / SQRT(y[1]^2 + y[2]^2 + ... y[n]^2) This value corresponds to the angle between the two vectors, if they are interpretated as points in a coordinate system. cos(x,y) may thus also be used as a similarity measure, which is independet of the length of the vectors. Euclidean distance: The Euclidean distance between two vectors is the length of the difference of the two vectors. It may be calculated as: |X-Y| = SQRT( (x-y)*(x-y) ) = SQRT( (x[1]-y[1])^2 + ... (x[n]-y[n])^2 ) The Euclidean distance is the distance between the two points indicated by the vectors in a coordinate system. It may thus also be used as a similarity measure and yields the same ranking as does cos(x,y). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Machine Translation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Translation: A transfer of a text from a source language to a target language which preserves meaning. World knowledge model: Data on reality. MT, Machine translation: A translation done by a computer. It involves * Source language analysis * Target language generation * World knowledge modelling FAHQMT, Fully-Automatic High-Quality Machine Translation: A translation which is done by a computer and whose quality equals a human translation. Up to now, there is no perfect machine translation without human intervention or assistance. CAT, Computer aided translation: A translation which is done by a computer and supported by a human or vice versa. Bilingual translation: A translation between two languages. Multilingual translation: A translation between more than two languages. Direct translation: Basically translating a sentence word by word. This consists of * morphological analysis of the source language sentence * look-up in a dictionary * local reordering * Outputting the result in the target language Adding a new language means writing a new system. Direct translation involves only very little effort in analysis and generation. Transfer translation: Using an analysis module and a generation module for each language and a transfer module for each pair of languages. Adding a new language means adding transfer modules from and to all existing languages. Transfer translation involves more effort in analysis and generation than direct translation. Interlingua translation: Translating first to an abstract universal language and then to the target language. As a consequence, there are only analysis and generation modules necessary for each language. Adding a new language thus means adding two more modules. Interlingua translation involves the highest effort in analysis and generation. Example: For a small demo program, go here: http://www.mpi-inf.mpg.de/~suchanek/personal/txt_e.htm#summaries You may translate the sentence "x sees y", where x and y may be pronouns, "the tree" or "the big tree" and the sentence may also be negated. Works with German, French, Italian and English. Example-based translation: A direct translation which works by looking up sequences of words of the source language sentence in a translation corpus. The corpus is bilingual and is aligned on word, phrase and sentence level. Statistical machine translation: A direct translation which calculates the most probable target language sentence from an aligned bilingual corpus. Predicate: Verb phrase. Null-Subject: The word which can be seen as the subject of a sentence but does not occur in the sentence. Example: "Cosa avete fatto?" "Was habt [ihr] gemacht?" Passive voice: The property of a sentence of containing certain "passive verb" constructions (which, in English, are "to be + participle"). // A semantic definition of "passive voice" will probably fail Active voice: The property of a sentence of not having passive voice. Lexical ambiguity: The ambiguity resulting from an unclear meaning of a word. Structural ambiguity, syntactical ambiguity: The ambiguity resulting from an unclear grammatical structure of a sentence. Real structural ambiguity, global structural ambiguity: The ambiguity resulting from an unclear grammatical structure of the whole sentence. Example: "Flying planes can be dangerous" can be "It can be dangerous to fly planes" or "Planes which are flying can be dangerous" Accidental structural ambiguity, local structural ambiguity: The ambiguity resulting from an unclear structure of a sequence of words, which is usually solved when reading on. Example: "The mathematics students sat their examinations" "The mathematics students study today is very complex" Homonym: A word which has different, unrelated meanings. Example: "bank" Polysem: A word which has different, though related meanings. Example: "work" can either be a verb or a noun Anaphora: The use of a pronoun to refer to a previously mentioned entity. Translation problem: A phenomenon which causes difficulties for translations. Translation problems include * Divergences * Mismatches (Lexical ambiguity problems) * Structural ambiguity problems Divergence: The phenomenon that a predicate is realized by different syntactical structures in two languages. Divergence includes * Lexical divergence * Non-lexical divergence Lexical divergence: Divergence resulting from the difference between the verb itself in one language and in the other language. Lexical divergence includes: * Structural divergence * Categorial divergence * Thematic divergence * Head switching * one to multi-word translation * Support verb construction divergence Structural divergence: The lexical divergence of a predicate being translated to a different pattern of constituents. Example: "Wir treffen uns am Mittwoch" "We meet --- on Wednesday" Categorial divergence: The lexical divergence of a word being translated to a word with another POS. Example: "Ich habe Hunger/N" "I am hungry/ADJ" Thematic divergence: The change of subject in a translation. Example: "Mir faellt der Termin/SUBJECT ein" "I/SUBJECT remember the date" Head switching: The lexical divergence of a verb v being translated to a verb phrase where v occurs as a complement. Example: "Ich esse gern" "I like to eat" One to multi-word translation: The divergence of the verb being translated to multiple words. Example: "Lassen Sie uns das festhalten" "Let us make a note of it" Support verb construction divergency: (?) Solutions to lexical divergence: Lexical Divergence may be dealt with by encoding the appropriate information in the lexical entry. Non-lexical divergence: The divergence not resulting from the verb itself. Non-lexical divergence includes: * Constituent order divergence * Preposition stranding * Null-Subject divergence * Dative divergence * Voice divergence Constituent order divergence: The non-lexical divergence of constituents being placed in another order in a translation Example: "Ich habe ihn gesehen" "I have seen him" Preposition stranding: Constituent order divergence with a preposition. (?) Example: "What store did John go to?" v-----------------------´ "In welchen Laden ging John?" Null-Subject divergence: The non-lexical divergence resulting from a Null-Subject which is explicit in the other language. Example: " Ho visto il libro" "I saw the book" Dative divergence: The non-lexical divergence resulting from the occurance of a dative construction instead of a verb+particle construction. Example: "I gave the book to him" "Ich gab ihm das Buch" Voice divergence: The non-lexical divergence consisting in a switch from active voice in one language to passive voice in another language. Example: "Es wurde getanzt" "There was dancing" Solutions to non-lexical divergence: Non-lexical Divergence may be dealt with by providing appropriate grammar rules. Mismatch: The phenomenon that the meaning of a word in one language cannot be expressed adequately in another language. This is due to lexical ambiguity in a wider sense. Mismatches include * Lexical gaps * Homonymy mismatches * Polysemy mismatches Lexical gap: The mismatch which occurs when a word with a general meaning has to be translated to a language which has different words for the specific meanings. "General meaning" and "specific meaning" are here to be interpretated in a set-theoretic sense as an inclusion of the extensions of the words. Example: "I go home" "Ich (fahre|gehe) nach Hause" Homonymy mismatch: The mismatch resulting from a homonymy of a word. Example: "ausmachen" can mean "to arrange" or "to switch off" Polysemy mismatch: The mismatch resulting from a polysemy of a word. Example: "window" can be "Fenster" or "Schalter" // What is the difference to lexical gap here (?) Solutions to mismatches: Mismatches may be dealt with by examining the local syntactic context, the local semantic context or the global context (anaphora resolution). Word sense disambiguation might be needed. Structural ambiguity problem: The problem of choosing a meaning in order to translate a sentence with structural ambiguity. This includes * Global structural ambiguity problems * Local structural ambiguity problems Solutions to structural ambiguity problems: These problems may be dealt with by linguistic knowledge (in order to resolve local strucural ambiguities), contextual knowledge or world knowledge (in order to resolve global strucural ambiguities).