基于INLA算法的贝叶斯推断及应用

摘要 iv
ABSTRACT（英文摘要） iv
第一章序言 1
§1.1 研究背景 1
§1.2 国内外研究现状 2
§1.3 本文主要内容 2
第二章预备内容 4
§2.1 贝叶斯方法 4
2.1.1贝叶斯方法定义 4
2.1.2先验分布的选取 5
§2.2 Logistic 回归模型 6
2.2.1广义线性模型 6
2.2.2logit 链接函数 6
§2.3 数据介绍 6
§2.4 数据可视化与预处理 7
第三章马尔科夫链蒙特卡洛方法 (MCMC) 9
§3.1 蒙特卡洛积分 9
§3.2 马尔科夫链 10
§3.3 Metropolis-Hastings 算法 10
第四章集成嵌套拉普拉斯近似 (INLA) 方法 12
§4.1 拉普拉斯近似 12
§4.2 隐高斯模型 13
§4.3 高斯马尔科夫随机场 (GMRF) 14
4.3.1有向图和无向图 14
4.3.2高斯马尔科夫随机场 14
§4.4 集成嵌套拉普拉斯近似 15
第五章实证分析与预测 19
§5.1 Logistic 回归建模 19
§5.2 基于 MCMC 方法的贝叶斯推断 19
§5.3 基于 INLA 方法的贝叶斯推断 21
第六章 INLA 方法与 MCMC 方法对比 23
§6.1 INLA 方法与 MCMC 方法对比 23
6.1.1参数估计结果对比 23
6.1.2DIC 值对比 24
6.1.3运行时间对比 24
§6.2 预测结果对比 25
6.2.1各指标值对比 26
6.2.2各曲线图对比 27
第七章总结与展望 28
参考文献 28
附录 A 附录 33
§A.1 R 代码 33
§A.2 SAS 代码 36
致谢 38
本文的研究安排为：

第一章介绍贝叶斯理论研究背景以及 MCMC 方法的发展，与 MCMC、INLA 相关的国内外研究现状以及论文全篇的结构安排。
第二章介绍预备知识，包括贝叶斯方法的内容与发展、Logistic 回归模型以及在实证分析中所使用的数据集。本文使用一个预测学生能否被研究生院校录取的数据集，响应变量为可被录取的概率，解释变量为该学生的各项成绩和材料评分等。本章最后对此数据集进行可视化和预处理。
第三章主要介绍马尔科夫链蒙特卡洛方法，包含基本内容、蒙特卡洛积分和马尔科夫链，最后介绍使用 MCMC 时所采用的抽样方法。
第四章主要介绍集成嵌套的拉普拉斯近似方法。首先介绍拉普拉斯近似，其次介绍 INLA 方法适用的模型：隐高斯模型，并介绍高斯马尔科夫随机场，最后总结 INLA 方法，基于前面内容提出处理参数后验的新计算方式，使用一系列近似与数值积分得到参数后验。
第五章进行实证分析，对实证所用的数据进行 Logistic 回归模型建模，并分别基于 MCMC 方法和 INLA 方法进行参数的贝叶斯估计，得到参数的后验均值、标准差和各分位数估计，以及近似的后验分布。本文使用 SAS 软件实现 MCMC 方法，使用 R 软件的 INLA 包实现 INLA 方法。
第六章将第五章得到的结果进行对比，主要从参数估计结果、DIC 值和运行时间三个方面进行对比，观察 INLA 的估计结果是否精确且高效。并将 INLA 方法估计的参数结果代回模型进行预测，并与 KNN、随机森林和 XGboost 三种机器学习方法进行混淆矩阵及衍生指标和曲线的对比。

library(INLA)
data(Seeds)
?Seeds
head(Seeds)
# - r is the number of seed germinated (successes)
# - n is the number of seeds attempted (trials)
# - x1 is the type of seed
# - x2 is the type of root extract
# - plate is the numbering of the plates/experiments

# All the covariates are 0/1 factors, and the numbering of the plates are arbitrary. We do not re-scale any covariates. The observations are integers, so we do not re-scale these either.

df = data.frame(y = Seeds$r, Ntrials = Seeds$n, Seeds[, 3:5])
# Explore data
summary(df)
table(df$x1)
table(df$x2)
plot(df)

##########################
##### Likehood
##########################
family1 = "binomial"
control.family1 = list(control.link=list(model="logit"))
# number of trials is df$Ntrials

# This specifies which likelihood we are going to use for the observations. The binomial distribution is defined with a certain number of trials, in INLA known as Ntrials. If there were hyper-parameters in our likelihood, we would specify the priors on these in control.family
# 这指定了我们将使用哪种可能性进行观测。二项分布由一定数量的试验定义，在INLA中称为Ntrials。如果在我们的可能性中有超参数，我们会在control.family中指定这些先验


##########################
##   
#########################
fisher.dat <- readRDS(system.file("demodata/data_fisher2.rds", package
                                  = "INLA"))
fisher.dat$id1 <- fisher.dat$id
fisher.dat$dist_cent <- scale(fisher.dat$dist_cent)
formula.inla <- y ~ sex + landuse + dist_cent +
  f(stratum,model="iid",hyper=list(theta = list(initial=log(1e-6),fixed=T))) +
  f(id1,dist_cent, model="iid")
r.inla <- inla(formula.inla, family ="Poisson", data=fisher.dat)
summary(r.inla)


formula.inla <- y ~ sex + landuse + dist_cent + stratum
r.inla <- inla(formula.inla, family ="binomial", data=fisher.dat)
summary(r.inla)


###########################
####
###########################
n=100
a = 1
b = 1
z = rnorm(n)
eta = a + b*z
Ntrials = sample(c(1,5,10,15), size=n, replace=TRUE)
prob = exp(eta)/(1 + exp(eta))
y = rbinom(n, size=Ntrials, prob = prob)
data = list(y=y,z=z)
formula = y ~ 1+z
result = inla(formula, family = "binomial", data = data, Ntrials=Ntrials)
summary(result)





hyper = list(
  prec = list(
    prior = "normal",
    param = c(0,0.00001)
  )
)

在这里插入图片描述