离散概率part2

离散概率part2

随机变量

Definition

A random variable (r.v.) X is a function from sample space of an experiment to the set of real numbers in R, i.e.,
$\forall \omega \in \Omega,X(\omega) = x\in R$

$X$ 是一个随机变量函数,是样本空间到实数集的一个函数,对于任何在$\Omega$中的$\omega$ 通过$X$ 函数映射到一个 $x$上,$x$唯一。

  • Note that a random variable is a function. It is not a variable,and it is not random!
    • 我们要注意随机变量并不是随机的也不是一个变量,他是一个函数,是把我们样本空间的一个$\omega$ 映射成一个实数。
    • 那么为什么叫随机变量呢?因为在这件事没有发生之前,也就是$\omega$ 是否发生是不确定的。
    • 也就是说,随机变量的值是确定的,但是发生不发生是不确定的
  • We usually use notation X,Y , etc. to represent a r.v., and x,y to represent the numerical values. For example, X = x means that r.v. X has value x.
    • 注意大写的叫做随机变量**,小写的叫做随机变量的值**。
  • The domain of the function can be countable and uncountable. If it is countable, the random variable is a discrete r.v., otherwise continuous r.v..
    • 如果这个x是连续空间中的,我们叫做连续的随机变量
      • 比如说我们在等车的时候,时间就是连续的随机变量
    • 否则x是离散空间中的取值,我们叫做离散的随机变量
      • 比如说我去食堂排队,考虑队前有几个人,那么就是离散的,如果考虑要排多长时间的队,那么就是连续的

例子

  • A coin is tossed. If X is the r.v. whose value is the number of heads obtained, then
    • $X(H) = 1,X(T) = 0$
  • And then tossed again. We de ne sample $space = {HH,HT,TH,TT}$ If $X $ is the r.v. whose value is the number of heads obtained, then
    • $X(HH) = 2,X(HT) = X,X(TH) = 1,X(TT) = 0$
  • When a player rolls a die, he will win $1 if the outcome is 1,2 or 3, otherwise lose $1​. Let $\Omega $ = { 1, 2 , 3 , 4 , 5 , 6} and de ne X as follows
    • $X(1) = X(2) = X(3) = 1;X(4) = X(5) = X(6) = -1$

为什么要定义一个随机变量呢?

比如我要抛100次硬币,那么样本空间是$2^{100}$ 但现在我们定义随机变量为 硬币朝上的次数,那么一共只有101个值,所以这个问题就会被我们简化

除了简化之外,我们有两种情况 {HTTT….T} 和{THTT…T} 如果不考虑随机变量,是没有办法运算的,如果考虑的话就可以对他们进行四则运算。这是定义随机变量的初衷

Random variables VS. events

Suppose now that a sample space $\Omega = \omega_1,\omega_2,…\omega_n$ is given, and r.v. X on $ \Omega $
is de fined the number of heads obtained when we toss a coin twice.

  • Event E1 represents only one head obtained. Hence,
    $E_1 = ${$\omega:X(\omega)=1$}
  • Event E2 represents even heads obtained. Hence,
    $E_2 = ${$\omega:X(\omega) ~~ mod ~~2 =0$}
  • Event E2 represents at least one heads obtained. Hence,
    $E_3 = ${$\omega:X(\omega)>1$}

These indicate that we can also de ne probability about r.v.s.
就是说,我定义随机变量和一个事件是等价的,我既可以计算事件的概率,也可以计算随机变量取值的概率,所以从计算上是等价的。

Distribution(分布)

De nition: The distribution of a r.v. X on a sample space is the set of pairs $(r , p(X = r )) $for all $r\in X(\Omega),$ where $P(X = r )$ is the probability that $ r.v. X $ takes value r . That is, the set of pairs in this distribution is determined by probabilities $P(X = r ) $for $r\in X(\Omega).$

r 是 随机变量的值域当中取一个值,那么$p(X=r)$就是当X=r 是的概率,相当于是一张表

就比如说 抛硬币100次,我们就可以列出 0次正面朝上的概率,1次正面朝上的概率……

注意点:

  • 其实分布 就相当于一个函数,
  • If we de ne event E which X has vaule x in , then,
  • 那么就可以说 f(x) is a probability distribution (function) if
    • $f(x)\geq0$
    • $\sum_xf(x)=1$

例子

* Joint and marginal probability distributions(联合概率分布和边际概率分布)

definition

Let X and Y be two r.v.s,

  • $f (x, y) = P(X = x\land Y = y) $is the joint probability distribution; 联合概率分布
  • $f_X (x)$ is the marginal probability distribution for $r.v. X. $ 边际概率分布
  • 所以说,边际概率就是对联合概率求和

  • 要求$f_X (x)$的概率,就是对所有$y_i$求和

Independence of r.V. 随机变量的独立

  • 我们看到如果随机变量是两两独立的或者相互独立的,那么他们的联合概率就等于边际概率的乘积

Independence of r.v. Cont’d

如果X和Y是相互独立的,那么条件概率就等于无条件概率

例子

Bernoulli Trials and Binomial distribution

Definition

  • Each performance of an experiment with two possible outcomes is called a Bernoulli trial.

    • 凡是只有两个结果的,都叫做贝努力实验
  • In general, a possible outcome of a Bernoulli trial is called a success or a failure.

  • If p is the probability of a success and q is the probability of a failure, it follows that p + q = 1:
    • 只要知道成功的概率,就知道失败的概率
  • Many problems can be solved by determining the probability of k successes when an experiment consists of n mutually independent Bernoulli trials.

Mutually independent Bernoulli trials

Question: A coin is biased so that the probability of heads is 2/3.What is the probability that exactly four heads come up when the coin is flipped seven times, assuming that the flipps are independent?

Solution:

Theorem

那么我们就把他叫做二项分布

二项分布

$X$ ~ $Bin(n,p)$

$P(X=k) = C(n,k)p^k(1-p)^{n-k}$

那么,对于不同的p,我们会得到什么样的分布图呢?

我们看到当p=0.5的时候,分布式基本均匀的

当p比较小的时候,n又比较大的时候,峰会往y中靠拢,从表达式就可以看出

例子:掷出的筛子是3就成功,否则记为失败

贝努力分布

贝努力分布是二项分布的特殊情况,就是只进行一次。抛硬币,掷色子,只进行1次。所以二项分布是由很多个贝努力分布组合起来的。

几何分布

二项分布和几何分布的关注点不同

  • 二项分布的关注点在做了n次,成功或者失败了几次
  • 几何分布式这个贝努力实验一直做下去,什么时候第一次成功
    几何分布有时候可以描述一个东西寿命的长短,比如点灯泡的次数,以知道灯泡炸了为止

Let r.v. Y be the number of coin flips until the fi rst head obtained.

Let G(k; p) denote the probability of failures before the k-th independent Bernoulli trials with probability of success p and probability of failure q = 1- p.

We call this function the Geometric distribution, i.e.,

Collision in hashing

Question:

Hashing functions map a large universe of keys (such as the approximately 300 million Social Security numbers in the United States) to a much smaller set of storage locations. A good hashing function yields few collisions, which are mappings of two different keys to the same memory location. What is the probability that no two keys are mapped to the same location by a hashing function, or, in other words, that there are no collisions?

Solution:
To calculate this probability, we assume that the probability that a randomly and uniformly selected key is mapped to a location is 1=m, where m is # available locations.Suppose that the keys are $k_1, k_2… k_n$. When we add a new record$k_i$ , the probability that it is mapped to a location different from the locations of already hashed records, that $h(k_i) != h(k_j )$ for $1 \leq j < i$ is $(m - i + 1) / m$

我插入第i个的时候,那么就是说还剩下 $m-(i-1)$个空位置,如果不发生哈希碰撞,那么这个概率就是$(m - i + 1) / m$

这也可以看作是一个几何分布。一直往里面插入,一直到发生哈希碰撞为止

Because the keys are independent, the probability that all n keys are mapped to different locations is

H(n,m)是没有发生碰撞的概率,那么1-H(n,m)就变成了发生碰撞的概率

比如说我们给定一个0.5,那么当我们n,m 满足什么条件的时候,会让发生碰撞的概率大于0.5呢?

例子2

question:Let X be a r.v. of the first sucess in independent Bernoulli trials with probability of success p ,please compute the value of

$\sum_{k=1}^{n} P(X=k)$

solution: $1-(1-p)^n$

Monte Carlo algorithms

蒙特卡洛方法在几何分布当中

蒙特卡洛方法是一个随机算法。

A Monte Carlo algorithm is a randomized or probabilistic algorithm whose output may be inaccuracy with a certain (typically small) probability.

Probabilistic algorithms make random choices at one or more steps, and result in different output even given the same input, which is different from all deterministic algorithms.

Monte Carlo algorithm for a decision problem: The probability that the algorithm answers the decision problem correctly increases as more tests are carried out.

We will show that possibility of making mistake becomes extremely unlikely as # tests increases.

就是说只要有一次成功,这个算法就是正确的。

Suppose that p is the probability that the response of a test is “true” given that the answer is “true”.

当我们迭代的次数越来越多的时候,算法正确的概率

Because the algorithm answers “false” when all n iterations yield the answer “unknown” and the iterations perform independent tests, the probability of error is (1 - p)^n^.

当成功的概率不等于0 的时候,如果前n次都是unknow的话,概率就是 (1 - p)^n^,所以当n越来越大,这个概率就会越来越少,最终趋向于0.

When $p != 0$, this probability approaches 0 as the number of tests increases. Consequently, the probability that the algorithm answers “true” when the answer is “true” approaches 1.

也就是说,随着迭代,我们最后会把算法正确的概率提升到1

第二个应用

计算$\pi$

如果$P_i$ 属于 S的话,那么就取1,否则就取0。

所以 $l_s(P_i)$是一个0,1的随机变量,是一个贝努力实验,另外一个也是

那我们怎么来判断蒙特卡罗方法到底准不准呢?

Sample with discrete distribution

我们通过计算机采样的方法来模拟随机生成的点,让计算机去抛硬币,那么抽样的方法也有两种。

discreat distribution is $0.1,0.2,0.3,0.4$

CDF sample:

这种抽样方法就是说把所有的概率放在一个bar上,如下图所示。当我们生成一个0-1之间的随机数的时候,我们通过它所处的位置来判断它的随机变量是多少,那么这样我们需要在这个bar中间找到这个随机数的“归宿”

那么怎么才能找到”归宿呢“ ,我们可以把每个取值建成一棵二叉树,通过比较的方式来判断是否属于这棵树中的某一结点

但这样的时间消耗就很大了,下图只是4种取值,那么当我们有n种取值方式的时候,每次都要找到这随机数数据的范围,时间复杂的为$O(log_2n)$

所以我们又提出了Aliasing sample这个取样方法

Aliasing sample:

别名采样方法就是

1.把离散分布分布排成一列,如右图第一张。但是这样,生成一个随机点落在这个区域中的话,很大一部分会落在空白处,而空白处是无效的。这样当最大值(这里是0.4)很大的时候,效率势必异常低下,所以我们要重新设计一块区域
2.如何设计?首先我们把第一块总面积(包括彩色和空白)。首先判断有几个样本区域,然后 计算
总样本区域数*样本概率 如图二

3.这样一变化之后,虽然概率没有变,但是我现在可以通过分割的方式,把第二张图变成第三张图的形式。以1为分割点,进行分割,具体怎么分割,那么就随自己切了

4.这样通过别名采样的方法,可以让时间复杂度大大降低,因为这时候我们只需要判断他落在的区域该取什么值就行了,所以是$O(1)$的复杂度

综合试题

Question: We have two boxes. The first contains two green balls and seven red balls; the second contains four green balls and three red balls. Bob selects a ball by first choosing one of the two boxes at random. He then selects one of the balls in this box at random.

If Bob has selected a red ball, what is the probability that he selected a red ball from the first box?

Solution: Let E be the event that Bob has chosen a red ball. Let F and ~F be the event that Bob has chosen a ball from the first box and the second box, respectively.

We want to find P(F|E), the probability that the ball Bob selected came from the fi rst box, given that it is red.

Bob 首先再两个箱子中随机选择了一个,又在选定的箱子中随机选择了一个球。问我们如果选红球的话,选择第一个箱子的概率是多大。

所以根据定义我们可以得到 $P(F|E) = \frac{P(F\cap E)}{P(E)}$

F事件就是选择第一个箱子,E就是选择了红球。

我们的目标,就是计算$P(F\cap E)$和 $P(E)$

因为F是选箱子,选择第一个还是第二个箱子的概率都是均等的。

要求出P(E),我们就要用到 $P(E) = \Sigma_{i=1}^nP(F_i)\cdot P(E|F_i)$ 这个全概率公式

那么

所以 $P(E) = \frac{7}{18}+\frac{3}{14}$

所以

这里要注意的是:

上面这个题目其实就是证明了贝叶斯定理的的过程。我们下面来介绍一下贝叶斯定理

Bayes‘ Therem

证明

推广的贝叶斯定理

应用:

第一类应用:

例一:

Question:

Suppose that one of 100,000 persons has a particular rare disease for which there is a fairly accurate diagnostic test. This test is correct 99.0% when given to a person selected at random who has the disease; it is correct 99.5% when given to a person selected at random who does not have the disease. Given this information can we fi nd

  • the probability that a person who tests positive for the disease has the disease?

  • the probability that a person who tests negative for the disease does not have the disease?

    Should a person who tests positive be very concerned that he or she has the disease?

Solution:

Let

F be the event that a person selected at random has the disease

E be the event that a person selected at random tests positive
for the disease. Hence, we have $ p(F) = 1/100000 = 10^{-5}$

做贝叶斯应用题的方法:

我们首先要确定 事件 $F,E,\overline F$分别代表什么。

然后分别求出 $P(F),P(E|F),P(E|\overline F)$

再带入即可

例二:

Question:

When a test for steroids is given to soccer players, 98% of the players taking steroids test positive and 12% of the players not taking steroids test positive. Suppose that 5% of soccer players take steroids. What is the probability that a soccer player who tests positive takes steroids?

Solution:

Let

E: a soccer player who tests positive

F: a soccer player who takes steroids

$\overline F$: a soccer player who doesn’t take steroids

Thus,

$E|F$ : a soccer player who takes steroids tests positive

$E|\overline F$ : a soccer player who does’t take steroids tests positive.

$P(F) = 5\%$

$P(\overline F) = 95\%$

$P(E|F)=98\%$

$P(E|\overline F) = 12\%$

带入得到答案 :$=\frac{49}{163}$

第二类应用:

Question: How to detect spam email?
Solution: Bayesian spam lters look for occurrences of particular words in messages. For a particular word w, the probability that w appears in a spam e-mail message is estimated by determining # times w appears in a message from a large set of messages known to be spam and # times it appears in a large set of messages known not to be spam.

我们可以找些特定的“敏感词” ,然后分析他在垃圾邮件中出现的概率为多少,在不是垃圾的邮件中出现概率为多少。

Step 1: Collect ground-truth Suppose we have a set B of messages known to be spam and a set G of messages known not to be spam.

ground-truth :就是我们把一些垃圾邮件贴上标签然后告诉计算机。然后让计算机取学习这个模型。

所以我们要去搜集什么是垃圾邮件,什么不是垃圾邮件。比如回复和转发的基本不是垃圾邮件,如果在垃圾箱里面的,基本都是垃圾邮件。

在Step2中,看w这个单词出现在B 和G 种出现的概率。就是计算两个条件概率

然后我们通过贝叶斯定律开始计算。因为$P(\overline F|E)+P(F|E) =1$ 所以计算一个即可

我们没有办法计算所有垃圾邮件和非垃圾邮件的数量和概率,所以我们假设这两种邮件出现的概率都为0.5。然后我们可以得到$r(w)$,那么,我们如果发现r(w)远远大于1/2 的话,那么这封邮件的垃圾概率,就是非常大了

那么我如果有很多很多的词,该怎么办? 可以基于分词工具(jieba)等对邮件进行分词。然后得到下面的贝叶斯公式

  • 等式左边代表的意思就是如果 $w_1,w_2,..w_i$出现在这封邮件里的时候,这封邮件是垃圾邮件的概率。

  • 等式右边

    • 分子代表的意思:就是 一个联合概率: 就是说这份邮件确实是垃圾邮件,然后判断这些垃圾词出现的概率。这些数据,都是从ground-truth中学习得来的。
    • 分母是这个词是垃圾词出现在垃圾邮件的概率和这个垃圾词出现在不是垃圾邮件的概率

因为$E_i|S $是相互独立的,所以可以用相乘来计算,就是说我这些垃圾词同时出现的时候,这份邮件是垃圾邮件的概率。

但是这个方法被称为 navie Bayesian 。为什么说是naive,是因为当一封邮件写得很长的时候,比如有1000个词,那么分子就是1000个小于1的数相乘,这很可能超出了计算机的表达范围。计算结果可能等于0,所以通常我们的做法是取个对数,然后再取指数。

-------------本文结束,感谢您的阅读-------------