簡介
舉個例子:如要在同一數(shù)據(jù)集上檢驗兩個獨立的假設(shè),顯著水平設(shè)為常見的0.05。此時用于檢驗該兩個假設(shè)應(yīng)使用更嚴(yán)格的0.025。即0.05* (1/2)。該方法是由Carlo Emilio Bonferroni發(fā)展的,因此稱Bonferroni校正。
這樣做的理由是基于這樣一個事實:在同一數(shù)據(jù)集上進(jìn)行多個假設(shè)的檢驗,每20個假設(shè)中就有一個可能純粹由于概率,而達(dá)到0.05的顯著水平。
維基百科原文
Bonferroni correction
Bonferroni correction states that if an experimenter is testing n independent hypotheses on a set of data, then the statistical significance level that should be used for each hypothesis separately is 1/n times what it would be if only one hypothesis were tested.
For example, to test two independent hypotheses on the same data at 0.05 significance level, instead of using a p value threshold of 0.05, one would use a stricter threshold of 0.025.
The Bonferroni correction is a safeguard against multiple tests of statistical significance on the same data, where 1 out of every 20 hypothesis-tests will appear to be significant at the α = 0.05 level purely due to chance. It was developed by Carlo Emilio Bonferroni.
A less restrictive criterion is the rough false discovery rate giving (3/4)0.05 = 0.0375 for n = 2 and (21/40)0.05 = 0.02625 for n = 20.
數(shù)據(jù)分析中常碰見多重檢驗問題(multiple testing).Benjamini于1995年提出一種方法,是假陽性的。在統(tǒng)計學(xué)上,這也就等價于控制FDR不能超過5%.
根據(jù)Benjamini在他的文章中所證明的定理,控制fdr的步驟實際上非常簡單。
設(shè)總共有m個候選基因,每個基因?qū)?yīng)的p值從小到大排列分別是p(1),p(2),...,p(m),
The False Discovery Rate (FDR) of a set of predictions is the expected percent of false predictions in the set of predictions. For example if the algorithm returns 100 genes with a false discovery rate of .3 then we should expect 70 of them to be correct.
The FDR is very different from ap-value, and as such a much higher FDR can be tolerated than with a p-value. In the example above a set of 100 predictions of which 70 are correct might be very useful, especially if there are thousands of genes on the array most of which are not differentially expressed. In contrast p-value of .3 is generally unacceptabe in any circumstance. Meanwhile an FDR of as high as .5 or even higher might be quite meaningful.
FDR錯誤控制法是Benjamini于1995年提出一種方法,通過控制FDR(False Discovery Rate)來決定P值的域值. 假設(shè)你挑選了R個差異表達(dá)的基因,其中有S個是真正有差異表達(dá)的,另外有V個其實是沒有差異表達(dá)的,是假陽性的。實踐中希望錯誤比例Q=V/R平均而言不能超過某個預(yù)先設(shè)定的值(比如0.05),在統(tǒng)計學(xué)上,這也就等價于控制FDR不能超過5%.
對所有候選基因的p值進(jìn)行從小到大排序,則若想控制fdr不能超過q,則只需找到最大的正整數(shù)i,使得 p(i)<= (i*q)/m.然后,挑選對應(yīng)p(1),p(2),...,p(i)的基因做為差異表達(dá)基因,這樣就能從統(tǒng)計學(xué)上保證fdr不超過q。因此,F(xiàn)DR的計算公式如下:
p-value(i)=p(i)*length(p)/rank(p)
參考文獻(xiàn)
1.Audic, S. and J. M. Claverie (1997). The significance of digital gene expression profiles. Genome Res 7(10): 986-95.
2.Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 29: 1165-1188.
計算方法 請參考 R統(tǒng)計軟件的p.adjust函數(shù):
> p<-c(0.0003,0.0001,0.02)
> p
[1] 3e-04 1e-04 2e-02
>
> p.adjust(p,method="fdr",length(p))
[1] 0.00045 0.00030 0.02000
>
> p*length(p)/rank(p)
[1] 0.00045 0.00030 0.02000
> length(p)
[1] 3
> rank(p)
[1] 2 1 3
sort(p)
[1] 1e-04 3e-04 2e-02[1]