Fabian M. Suchanek Probability Theory A type‐safe introduction to
Probability theory 2 starring and Shrek Puss In all of the following, if there is a definition followed by a “special case”, it is fully sufficient to know and understand the special case for the purpose of most applications in Information Extraction, including Markov Random Fields, Hidden Markov Models, Markov Chains, and Conditional Random Fields.
Overview Complex dependencies and/or feature functions only visible variables visible and invisible variables Markov Chains Markov Random Fields Hidden Markov Models Conditional Random Fields 3 Chains Introduction to Probabilities
Def: Universe universe  (also: sample space) is the set of all possible worlds (also: possible states of the world, outcomes of a random trial). 4
Special case: Universe Given   named sets  , we use as universe   . possible world universe Example: X1={Shrek,Puss},               X2={shouts,purrs} 5 X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4
Set of events / Sigma algebra sigma algebra  (also: set of events) over a universe is a set of subsets of the universe that fulfills certain conditions. 6 It is the set of “allowable events”
Special case: Sigma algebra We use as sigma-algebra the set of all subsets of the universe: 7
Def: Probability probability  (also: probability measure, probability distribution) is a function such that 8 for a finite number of disjoint sets   . For general (not necessarily disjoint) sets  , we have: “union bound” ...because the   may overlap, in which case  .
Special case: Probability We use as probability a function that maps every possible world to [0,1]. 9 shouts w2: Shrek Puss Shrek shouts purrs Puss w3: purrs X1 w4: X2 w1: P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2
Special case: Probability For our probability function, we define    for   . 10 P({w1,w2}) = P(w1)+P(w2) = 0.7  shouts w2: Shrek Puss Shrek shouts purrs Puss w3: purrs X1 w4: X2 w1: P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2
Def: Probability space probability space  is a triple of a universe, a sigma algebra, and a probability measure:  11 We assume a fixed probability space from now on. shouts w2: Shrek Puss Shrek shouts purrs Puss w3: purrs X1 w4: X2 w1: P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2
Def: Random variable random variable  is a function that takes a possible world and returns a value (also: state). 12 (The random variable “extracts” a feature from the state of the world)
Example: Random variable The Random Variable    describes the atmosphere of a world. 13 A(w1)=scary A(w2)=cosy A(w3)=cosy A(w4)=cosy X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4
Are Random Variables random? Random variables are • neither random • nor variables Wikipedia/Random Variables 14 They take a possible world and return a characteristic of that world. In our special case: They are just the components of the universe.
Special case: Random variable In our universe, the named sets can be considered random variables. We consider only these as random variables: 15 X1(w1)=Shrek X1(w2)=Puss X1(w3)=Shrek X1(w4)=Puss X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4
Def: Events An  event  is a subset of the universe. Event where someone purrs: {w2, w3} 16 X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4
Special case: Events “X2=purrs” = {w2, w3} Note that X=x is not a statement, but a shorthand notation for a set of possible worlds. 17 We define an event as “ ”   . X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4
An event has a probability:        Events 18 P(X1=Shrek) = P({w1,w3}) = 0.5 P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2 X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4
Set operations on events X=x is a set, and can hence undergo set operations such as 19 P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2 X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4
Syntax: Intersections For the intersection, we write • a comma-separated list   • vectors   • single values  20
Def: Conditional probability The conditional probability of an event A given an event B is With random variables:     This entails:   i.e.: look only at the worlds where B happens.  In these cases, compute the ratio of the probabilities where also A happens. 21
Example: Conditional probability 22 P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2 X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4
Example: Conditional probability Look at only cases where someone purrs 23 P(w2)=0.3 P(w3)=0.1 X1 Puss Shrek X2 purrs purrs world w2 w3
Example: Conditional probability 24   P(w2)=0.3 P(w3)=0.1 X1 Puss Shrek X2 purrs purrs world w2 w3
Example: Conditional probability 25   P(w2)=0.3 P(w3)=0.1 X1 Puss Shrek X2 purrs purrs world w2 w3 If someone purrs, it’s unlikely that it’s Shrek
Def: Independence Knowing B does not influence the probability of A, and vice versa. 26 This entails     Two events A and B are  independent  if  
Example: Independence Events are not independent (shouting means it's Shrek) 27 Two events A and B are  independent  if P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2 X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4
Conditional Independence 28 Two events A and B are  conditionally independent  given C, if          This means:         P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2 X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4 X3 loudly quietly quietly loudly In the example, all events with X1 and X2 are independent given X3: >NaïveBayes
Chain Rule 29 From the definition of  , we can prove the  chain rule : P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2 X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4 In the example: X3 loudly quietly quietly loudly >NaïveBayes
Chain Rule + Independence 30 If   and   are independent given  , then P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)=0.2 X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4 In the example, the events with X1 and X2 are independent given X3: X3 loudly quietly quietly loudly >NaïveBayes
Probabilistic Classification 31 Classification  is the task of predicting the most likely value for one  variable (the  class ) given the values for the others (the  features ): P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)= ? X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4 X3 loudly quietly quietly ? Choose the X3 that maximizes the probability Class Features
Probabilistic Classification 32 P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)= ? X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4 X3 loudly quietly quietly ? How do we compute this? -> It’s monolithic, we need an independence assumption to break it up. Can we assume that all variables are independent? -> No, because this would mean that only a single given variable     is allowed to correlate (= can predict) the unknown variable. Classification  is the task of predicting the most likely value for one  variable (the  class ) given the values for the others (the  features ):
Probabilistic Classification 33 P(w1)=0.4 P(w2)=0.3 P(w3)=0.1 P(w4)= ? X1 Shrek Puss Shrek Puss X2 shouts purrs purrs shouts world w1 w2 w3 w4 X3 loudly quietly quietly ? Let’s assume that the features are independent given the class: