/ #bayesian statistics 

Bayesian Statistics

1 Introduction

blog based on the YouTube video of ‘Ox edu’ Bayesian statistics is a new way of modelling in which we use prior knowledge also called domain knowledge to arrive at the predictions better. In a formula Bayesian modelling/statistics is as follows:

\[P(\theta|data) = \frac{P(data|\theta).P(\theta)}{P(data)}\]

Where \(P(\theta)\) is called the prior, which comes from the domain knowledge. \(P(data|\theta)\) is called the likelihood(think the probability).

Lets take an example of flipping a coin and explain the differences between frequentist and Bayesian.

1.1 Frequentist vs. Bayesian View

Imagine a coin flipping experiment, where a coin is held ‘d’ distance above a table and is flipped with an angle \(\theta\) with horizontal. In frequentist view the probability of head assumes that the ‘d’ and \(\theta\) are fixed and you are depended on the number of samples. Thus in frequentist view data varies but parameters are fixed. In the Bayesian world we depending on the parameters ‘d’ and \(\theta\) we create a probability of heads occurring, thus varying parameters varies the probability of getting a head.

1.2 Probability Distribution

Assume a basket full of objects with unique number, there is only one of each element present. Now lets assume we pick on object from this sample, whats the probability of picking ‘X’ P(X) for any given ‘X’.

Assuming ‘n’ objects, its 1/n

Notice that sum of all probabilities is 1, as it should be.

However if the X is now a continuous variable PMF does not work. Why? Lets take an example.

Assume you are measuring the height of people in a country, now whats the probability that someone is exactly 1.70 meters tall? Its zero. How? The logic is that probability of someone to be exactly to be ‘X’ is zero but someone being in the range of 1.69 and 1.71 is non-zero. Thus for continuous case we use something called as the ‘probability density function’ (PDF)

1.3 Marginal Probability

Lets take an example ‘X’ which represent whether or not a person has some disease(X=1, person has disease), also let ‘Y’ be represent if someone is showing the symptoms of the diseases. Now lets create a table with the following scenario.

Now the probability of having both ‘symptoms’ and the ‘disease’ would be if both X=Y=1, from the table this is 0.3. This way of measuring and combining two probabilities is called ‘joint probabilities’. Mathematically we represent this as follows.

\[P(X=1|Y=1) = 0.3\] Notice as always with probabilities, the sum of all probabilities should be equal to 1, that is \[\sum_{x,y} P(X=x,Y=y) =1 \]

What if we want the probability that someone has the symptoms, well intuitively its nothing but summing all the probabilities in the table where ‘Y’=1, thus 0.3+0.1=0.4. Whenever in a joint probability distribution we want probabilities without considering one dimension then such a probability is called ‘marginal probability’.

In general the marginal distribution is calculated as follows:

\[P(X=x) = \sum_{y} P(X=x,Y=y) ~~~Marginal~probability~of~x\]

In case of the continuous case the summation is replaced by integral

1.4 Conditional Probability

Now imagine if we say what is the probability that someone has the disease given they have symptoms, observe this is not same as saying whats the probability of having disease and symptoms, in other words what we are interested is whats called as ‘conditional probability’. However why cant we simply say that the probability its nothing but P(X=1,Y=1)=0.3?

Remember that probability must always sum up to 1, thus we sort of do a normalization here:

\[P(X=1|Y=1) + P(X=0|Y=1) = 1\] Thus

\[P(X=1|Y=1) = \frac{P(X=1,Y=1)}{P(X=1,Y=1) + P(X=0,Y=1)} = \frac{0.3}{0.3+0.1} = \frac{3}{4}\]

In general this is given by \[P(X=x|Y=y) = \frac{P(X=x,Y=y)}{P(Y=y)}~~~~~~Where~P(Y=y)~is~the~marginal~probability~of~y\]

1.5 Bayes Rule

Bayes rule is given by the following expression:

\[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]

###Intutive Explaination

Lets take the example of ‘Rain’ and ‘Forecast as Rain’.

Plugging in the formula above we get

\[P(Rain|Forecasted) = \frac{P(Forecasted|Rain) \cdot P(Rain)}{P(Forecasted)}\] Lets multiply both side by P(Forecast) we get:

\[P(Rain|Forecasted) \cdot P(Forecasted) = P(Forecasted|Rain) \cdot P(Rain)\] Where using an Venn diagram its not hard to conclude that \[P(Rain|Forecasted) \cdot P(Forecasted) = P(Forecasted|Rain) \cdot P(Rain) = P(Rain \cap Forecasted)\]

2 Cheatbook

3 Lab1 Question 1: Bernoulli … again.

Let y1,y2,y3|\(\theta\) ~ Bern(\(\theta\)), and assume that you have obtained a sample with s = 14 successes in n = 20 trials. Assume a Beta(\(\alpha_0\),\(\beta_0\)) prior for \(\theta\) and let \(\alpha_0\) = \(\beta_0\) = 2

3.1 a) Draw random numbers from the posterior \[\theta|y \sim \beta(\alpha_0 + s; \beta_0 + f), y = (y_1,y_2,y_3....y_n)\] and verify graphically that the posterior mean and standard deviation converges to the true values as the number of random draws grows large.

Let us verify the result for better understanding

  • PDF of Bernoulli distribution (unordered) is given by:

\[ p(s|\theta) = \binom{N}{s} \theta^{s} \cdot (1-\theta)^{n-s} ~~~where~s~is~the~number~of~success\]

  • PDF of Beta distribution is given by:

\[ f(\theta|\alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \cdot\Gamma(\beta)} \theta^{\alpha-1} \cdot (1-\theta)^{\beta-1}\] \[ Where~\Gamma(n+1)=(n)!\]

\[Mean: \frac{\alpha}{\alpha+\beta}\] \[Standard~Deviation: \sqrt{\frac{\alpha \cdot \beta}{(\alpha+\beta)^2(\alpha+\beta+1)}}\] * Verification:

\[ \begin{split} Likelihood = \binom{N}{s} \theta^{s} \cdot (1-\theta)^{n-s} \\ Prior = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \cdot\Gamma(\beta)} \theta^{\alpha-1} \cdot (1-\theta)^{\beta-1} \\ Prior \propto \theta^{\alpha-1} \cdot (1-\theta)^{\beta-1} \\\\ Posterior \propto Likelihood \times Prior \\ Posterior \propto Beta(\alpha_0 + s, \beta_0 + f) \\ \end{split} \]

#' Title A beta distribution generator to generate N values
#'
#' @param N Number of iterations
#' @param alpha Alpha value for beta distribution
#' @param beta  Beta value for beta distribution
#'
#' @return a list containing the mean and std dev of the generated vector
#' @export
#'
#' @examples
gen_beta_sample <- function(N, alpha, beta){
  temp <- rbeta(N, shape1=alpha, shape2=beta)
  mean_value <- mean(temp)
  std_dev <- sd(temp)
  answer <- list(mean_value=mean_value, std_dev=std_dev)
  return(answer)
}

# required input
N = 3000
alpha = 16
beta = 8

# Running the loop inside a function to get the values for varying N
final <- NULL
for(i in 2:N){
  temp <- gen_beta_sample(N=i, alpha = alpha, beta=beta)
  df <- cbind(N = i, Mean = temp$mean_value, stand_dev = temp$std_dev)
  final <- rbind(df, final)
}

# Making final as a dataframe to make it easy to plot
final <- final %>% data.frame()
final$true_mean <- ((alpha)/(alpha+beta))
final$true_std_dev <- sqrt((alpha*beta)/((alpha+beta)^2 *(alpha+beta+1)))

#plots
ggplot(data=final, aes(x=N, y=Mean)) + 
  geom_point(color= "#E69F00") + 
  geom_line(aes(y=true_mean), color = "#000000") + 
  ggtitle("Predicted(line) and Actual Mean(dots) vs. Iterations") 

ggplot(data=final, aes(x=N, y=stand_dev)) + 
  geom_point(color= "#E69F00") + 
  geom_line(aes(y=true_std_dev), color = "#000000") + 
  ggtitle("Predicted(line) and Actual Standard Deviation(dots) vs. Iterations") 

3.2 b) Use simulation (nDraws = 10000) to compute the posterior probability Pr(\(\theta\) < 0.4|y) and compare with the exact value [Hint: pbeta()].

#' Title A beta distribution generator to generate N values without mean
#'
#' @param N Number of iterations
#' @param alpha Alpha value for beta distribution
#' @param beta  Beta value for beta distribution
#'
#' @return a list containing the mean and std dev of the generated vector
#' @export
#'
#' @examples
gen_beta_sample_2 <- function(N, alpha, beta){
  answer <- rbeta(N, shape1=alpha, shape2=beta)
  return(answer)
}

# required input
N = 10000

# Running the loop inside a function to get the values for varying N
beta_values <- gen_beta_sample_2(N=N, alpha = alpha, beta = beta)
actual_prop <- sum(beta_values < 0.4)/N
true_prop <- pbeta(q=0.4, shape1=alpha, shape2=beta, lower.tail = TRUE)

cat("The true value of the probability is: ", true_prop, ", 
    while the calculated value is: ", actual_prop)
## The true value of the probability is:  0.003972681 , 
##     while the calculated value is:  0.0039

3.3 c) Compute the posterior distribution of the log-odds, \(/phi = log\frac{\theta}{1-\theta}\) by simulation (nDraws = 10000).

log_odds = log(beta_values/(1-beta_values))
log_odds <- as.data.frame(log_odds)

ggplot(data=log_odds, aes(x=log_odds)) + 
  geom_histogram(bins = 30) +
  ggtitle("Histogram of log-odds")

4 Lab1 Question 2: Log-normal distribution and the Gini coefficient

Assume that you have asked 10 randomly selected persons about their monthly income (in thousands Swedish Krona) and obtained the following ten observations: 14, 25, 45, 25, 30, 33, 19, 50, 34 and 67. A common model for non-negative continuous variables is the log-normal distribution. The log-normal distribution logN(\(\mu, \sigma^{2}\)) has density function

\[ p(y|\mu,\sigma^2) = \frac{1}{y \cdot \sqrt{2\pi\sigma^2}} e{[- \frac{(logy-\mu)^2}{2\sigma^2}]}\]

for y > 0, \(\mu\) > 0 and \(\sigma^2\) > 0. The log-normal distribution is related to the normal distribution as follows: if y ~ logN(\(\mu\), \(\sigma^2\)) then log y ~ N(\(\mu\); \(\sigma^2\)). Let y1, y2, y3….yn |\(\mu\), \(\sigma^2\) iid ~ logN(\(\mu\), \(\sigma^2\)), where \(\mu\) = 3.5 is assumed to be known but \(\sigma^2\) is unknown with non-informative prior p(\(\sigma^2\)) / 1=\(\sigma^2\). The posterior for \(\sigma^2\) is the Inv \(\chi^2\) (n, \(\tau^2\)) distribution, where

\[ \tau^2 = \frac{\sum_{i=1}^{n} (logy_{i} - \mu)^2}{n}\]

4.1 a) Simulate 10,000 draws from the posterior of \(\sigma^2\) (assuming \(\mu\) = 3.5) and compare with the theoretical Inv - \(\chi^2 (n, \tau^2)\) posterior distribution.

In the following, \(\tau^2\) can be used to simulate from the posterior \(\sigma^2|(y_1, ..., y_n); \mu \sim Inv-\chi^2(n,\tau^2)\). This will be done by first drawing \(X \sim \chi^2(n)\). This drawed value will then be used within the formula \(\sigma^2 = \frac{n\tau^2}{X}\) which is a draw from \(Inv-\chi^2(n,\tau^2)\). This process will be repeated 10000 times. The obtained values for \(\sigma^2\) will be stored.

y = c(14, 25, 45, 25, 30, 33, 19, 50, 34, 67)
n = length(y)
mu = 3.5
tau_sq = sum((log(y)-mu)^2)/n

sigma_sq = c()
for (i in 1:10000) {
  # Drawing x.
  x = rchisq(n = 1, df = n)
  # Calculating and storing sigma_sq.
  sigma_sq[i] = (n*tau_sq)/x
}
# Plotting simulated posterior distribution.
ggplot() +
  geom_density(aes(x = sigma_sq)) +
  labs(title = "Posterior distribution for sigma_sq") +
  theme_bw()

After the simulated posterior distribution has been plotted, it will be compared to the theoretical distribution. This will be done by a comparison of the mean and the standard deviation. The theoretical mean and standard deviation are obtained with \(\frac{n\tau^2}{n-2}\text{for }n>2\) and \(\sqrt{\frac{2n^2\tau^4}{(n-2)^2(n-4)}}\text{for }n>4\), respectively.

# Printing statistics of simulated distribution compared to theoretical values.
knitr::kable(
  as.data.frame(
    rbind(
      cbind(Posterior = "Simulation", Mean = mean(sigma_sq), Sd = sd(sigma_sq)),
      cbind(Posterior = "Theory", Mean = n*tau_sq/(n-2), Sd = sqrt((2*n^2*tau_sq^2)/(((n-2)^2)*(n-4))))
    )
  )
)
Posterior Mean Sd
Simulation 0.24563314411 0.13955371581103
Theory 0.247349686559943 0.142807408119353

It can be seen that the statistics obtained with the simulation are very close to the theoretical values. Thus, we assume that the simulation of the posterior distribution for \(\sigma^2\) has been successful.

4.2 b) The most common measure of income inequality is the Gini coefficient, G, where 0 \(\leq\) G \(\leq\) 1. G = 0 means a completely equal income distribution, whereas G = 1 means complete income inequality. See Wikipedia for more information. It can be shown that G = 2\(\phi (\frac{\sigma}{\sqrt{2}})\) when incomes follow a logN(\(\mu, \sigma^2\)) distribution. \(\phi\)(z) is the cumulative distribution function (CDF) for the standard normal distribution with mean zero and unit variance. Use the posterior draws in a) to compute the posterior distribution of the Gini coefficient G for the current data set.

In 2a), we simulated the posterior distribution for \(\sigma^2\) using the current data set. These obtained values for \(\sigma^2\) now can be used within \(G=2\phi(\sigma/\sqrt{2})-1\) to compute the posterior distribution of the Gini coefficient G. Since \(\phi(z)\) refers to the CDF of the standard normal distribution, we can make use of the pnorm(q, mean = 0, sd = 1)-function where q equals to the different values of \(\sigma/\sqrt{2}\). The computed distribution will be plotted.

# Computing g values. 
g = 2 * pnorm(q = sqrt(sigma_sq)/sqrt(2), mean = 0, sd = 1) - 1
# Plotting distribution.
ggplot() +
  geom_density(aes(x = g)) +
  ggtitle("Posterior distribution for G")

4.3 c) Use the posterior draws from b) to compute a 95% equal tail credible interval for G. An 95% equal tail interval (a,b) cuts off 2.5% percent of the posterior probability mass to the left of a, and 97.5% to the right of b. Also, do a kernel density estimate of the posterior of G using the density function in R with default settings, and use that kernel density estimate to compute a 95% Highest Posterior Density interval for G. Compare the two intervals.

temp <- density(g)
values_df <- data.frame(x = temp$x, Density = temp$y)
total <- sum(values_df$Density)

values_df_unsorted <- values_df %>% 
  arrange(x) %>% 
  mutate(running_per_unsort = 100 * cumsum(Density)/total) %>%  
  mutate(flag_unsort = ifelse(running_per_unsort < 2.50, "Drop",
                       (ifelse(running_per_unsort < 97.50, "Accept", "Drop"))))

values_df_sorted <- values_df %>%  
  arrange(desc(Density)) %>% 
  mutate(running_per_sort = 100 * cumsum(Density)/total) %>%  
  mutate(flag_sort = ifelse(running_per_sort < 95.00, "Accept", "Drop"))

ggplot(data=values_df_unsorted, aes(x=x, y=Density)) +
  geom_line() +
  geom_point(aes(x=x,y=0,color=flag_unsort)) +
  ggtitle("Posterior distribution for G using tail method")

ggplot(data=values_df_sorted, aes(x=x, y=Density)) +
  geom_line() +
  geom_point(aes(x=x,y=0,color=flag_sort)) +
  ggtitle("Posterior distribution for G using Highest Posterior Density")

Analysis: We see that Highest Posterior Density works better and covers the points where the density is higher, while the usuall symmetric tail method clips even high density regions.

5 Lab1 Question 3: Bayesian Inference

Bayesian inference for the concentration parameter in the von Mises distribution. This exercise is concerned with directional data. The point is to show you that the posterior distribution for somewhat weird models can be obtained by plotting it over a grid of values. The data points are observed wind directions at a given location on ten different days. The data are recorded in degrees and converted into radians: (2.44, 2.14, 2.54, 1.83, 2.02, 2.33, 2.79, 2.23, 2.07, 2.02)

Assume that these data points are independent observations following the von Mises distribution:

\[ p(y|\mu,\kappa) = \frac{e[\kappa \cdot cos(y-\mu)]}{2\pi I_{0}(\kappa)},~~~-\pi\leq y\leq \pi\]

where I0(k) is the modified Bessel function of the first kind of order zero [see ?besselI in R]. The parameter \(\mu\) (\(-\pi\leq y\leq \pi\)) is the mean direction and k > 0 is called the concentration parameter. Large k gives a small variance around \(\mu\), and vice versa. Assume that \(\mu\) is known to be 2:39. Let k ~ Exponential(\(\lambda\) = 1) a priori, where \(\lambda\) is the rate parameter of the exponential distribution (so that the mean is 1/\(\lambda\)).

5.1 a)Plot the posterior distribution of \(\kappa\) for the wind direction data over a fine grid of \(\kappa\) values.

y <- c(-2.44, 2.14, 2.54, 1.83, 2.02, 2.33, -2.79, 2.23, 2.07, 2.02)
mean_y <- 2.39
k <- seq(0,10,0.001)

calc_prob = function(k, y){
  prob = exp(k * cos(y - mean_y)) / (2*pi*besselI(k, 0))
  return (prob)
}

calc_post = function(k){
  probabilities = sapply(y, calc_prob, k=k)
  prior = dexp(k)
  posterior = prod(probabilities) * prior
  return (posterior)
}

posterior = sapply(k, calc_post)

result_df <- data.frame(k = k, posterior = posterior)

ggplot(data=result_df, aes(x=k, y=posterior)) +
  geom_point() +
  ggtitle("Posterior over varying k")

5.2 b)Find the (approximate) posterior mode of \(\kappa\) from the information in a).

# Find the k value which maximizes posterior
mode = result_df$k[which.max(result_df$posterior)]

ggplot(data=result_df, aes(x=k, y=posterior)) +
  geom_point() +
  geom_vline(aes(xintercept = mode, color = "red")) +
  ggtitle(paste0("Posterior over varying k with max value shown as redline = ",mode))

Analysis: The max value of posterior is termed as the mode of kappa

6 Lab2 Question 1: Linear and polynomial regression

The dataset TempLinkoping.txt contains daily temperatures (in Celcius degrees) at Malmslatt, Linkoping over the course of the year 2016 (366 days since 2016 was a leap year). The response variable is temp and the covariate is

time = the number of days since beginning of year/366

The task is to perform a Bayesian analysis of a quadratic regression

\[ temp = \beta_0 + \beta_1 \cdot time + \beta_2 \cdot time^2 + \epsilon, epsilon \sim N(0,\sigma^2) \]

6.1 a) Determining the prior distribution of the model parameters

Joint Prior are given by the following: \[ \begin{split} \beta|\sigma^2 \sim N(\mu_0, \sigma^2 \cdot \Omega_0^{-1}) \\ \sigma^2 \sim Inv-\chi^2(\nu_o, \sigma^2_0) \end{split} \]

Posterior

\[ \begin{split} \beta|\sigma^2,y \sim N(\mu_n, \sigma^2 \Omega_n^{-1}) \\ \sigma^2|y \sim Inv-\chi^2(\nu_n, \sigma^2_n) \\\\ \Omega_0 = \lambda I_n \\ \mu_n = (X^T X+\Omega_0)^{-1} (X^T X \hat{B} + \Omega_0 \mu_0) \\ \Omega_n = X^T X + \Omega_0 \\ \hat{\beta} = (X^TX)^{-1} X^T Y \\ \nu_0 = \nu_0 + n \\ \sigma^2_n = \frac{\nu_0 \sigma^2_0 + (YY^T + \mu_0^T \Omega_0 \mu_0 - \mu_n^T \Omega_n \mu_n)}{\nu_n} \end{split} \] Finally

\[ \begin{split} Y = X \beta + \epsilon \\ \epsilon = N(0,\sigma^2) \end{split} \]

set.seed(12345)

temp_data = read.delim("../../static/data/TempLinkoping2016.txt")
temp_data$time_square = (temp_data$time)^2
temp_data$intercept = 1

# Initialzing hyper-parameters
mu_0 = t(c(-10,100,-100))
omega_0 = diag(x=0.01,nrow=3,ncol=3)
nu_0 = 4
sigma_sq_0 = 1 

# calculating X and other vectors
X = temp_data[,c("intercept", "time", "time_square")] %>% as.matrix()
X_T = t(X)
Y = temp_data[,c("temp")] %>% as.matrix()
Y_T = t(Y)

# defining a function to sample and predict temperature
quad_regression <- function(mu_0, omega_0, nu_0, sigma_sq_0,X,Y){
# sampling sigma squares using inverse chi-square  
x = rchisq(n = 1, df = nu_0)
# Calculating sigma_sq.
sigma_sq = (nu_0*sigma_sq_0)/x
beta <- mvtnorm::rmvnorm(n = 1 , mean=mu_0, sigma=sigma_sq * solve(omega_0))
final <-  X %*% t(beta) + rnorm(n=1,mean = 0, sd=sqrt(sigma_sq))
colnames(final) <- c("Predicted_temperature")
return(final)
}

# Define a named list of parameter values
gs = list(nu_0 = seq(1,10,1),
           sigma_sq_0 = seq(0,5,0.5)) %>% 
  cross_df() # Convert to data frame grid

final = gs %>% mutate(predicted_temperature = pmap(gs, mu_0 = mu_0, 
                                                   omega_0 = omega_0, X=X, Y=Y, 
                                                   quad_regression),
                      actual_temperature = list(temp_data$temp),
                      time = list(temp_data$time)) %>% as_tibble()

final = unnest(final, predicted_temperature, actual_temperature, time)

# Using cut
final$sigma_sq_0_cut <- cut2(final$sigma_sq_0, m=5)
final$nu_0_cut <- cut2(final$nu_0, m=5)

# plotting
final %>% filter(predicted_temperature <30 & predicted_temperature >-25)  %>%  
  ggplot(aes(x=time, y = predicted_temperature, color=nu_0)) +
  geom_point() +
  geom_line(aes(y=actual_temperature), color="red") +
  facet_wrap(~nu_0_cut) +
  ggtitle("redicted Temperature vs. Time By Varying Nu_0, with true curve in red")

final %>% filter(predicted_temperature <50 & predicted_temperature >-50)  %>%  
  ggplot(aes(x=time, y = predicted_temperature, color=sigma_sq_0)) +
  geom_point() +
  geom_line(aes(y=actual_temperature), color="red") +
    facet_wrap(~sigma_sq_0_cut) +
  ggtitle("Predicted Temperature vs. Time By Varying Sigma square, with true curve in red")

Analysis: Its aparent that low values of sigma work well (0-0.5), thus we will set it to ~0.5 and nu_0 will be ~10, this leads to have the following curve. The impact of omega and mu_0 was not much, so we keep them same.

6.1.1 Best fit

set.seed(12345)

# Initialzing final hyper-parameters
mu_0 = t(c(-10,100,-100))
omega_0 = diag(x=50,nrow=3,ncol=3)
nu_0 = 10
sigma_sq_0 = 0.25 

best_fit <- NULL
for(i in 1:100){
temp <- quad_regression(mu_0=mu_0, omega_0=omega_0, nu_0=nu_0, 
                        sigma_sq_0=sigma_sq_0,X=X,Y=Y)
temp <- temp %>% as.data.frame() %>% mutate(Actual_temperature = temp_data$temp, 
                                            Time = temp_data$time) 
best_fit <- rbind(temp, best_fit)
}

# plotting
best_fit %>% 
  ggplot(aes(x=Time, y = Predicted_temperature)) +
  geom_point() +
  geom_line(aes(y=Actual_temperature), color="red") +
  ggtitle("Predicted and Actual Temperature(shown in red) vs. Time")

6.2 b) Simulating from the joint posterior distribution

set.seed(12345)
# initialized values from above
mu_0 = c(-10,100,-100)
mu_0_T = t(mu_0)
omega_0 = diag(x=50,nrow=3,ncol=3)
nu_0 = 10
sigma_sq_0 = 0.25 

# calculate beta hat given by beta_hat =  (X^{T} X)^{-1} X^T Y
beta_hat = solve(X_T %*% X) %*% X_T %*% Y

# calculate mu_n
mu_n = solve(X_T %*% X + omega_0) %*% (X_T %*% X %*% beta_hat + omega_0 %*% mu_0)

# calculate omega_n given by Omega_n = X^T X + \Omega_0 
omega_n = X_T %*% X + omega_0

# calculate nu_n given by nu_0 + n
nu_n = NROW(X) + nu_0

# calculate sigma_sq_n given by (nu_0 * sigma_sq_0 + (Y %*% Y_T + mu_0^T %*% omega_0 %*% mu_0 - mu_n^T %*% omega_n %*% mu_n))/nu_n
sigma_sq_n = (nu_0 * sigma_sq_0 + Y_T %*% Y + mu_0_T %*% omega_0 %*% mu_0 - 
                t(mu_n) %*% omega_n %*% mu_n)/nu_n

# posterior distribution given by Beta = N(\mu_n, \sigma^2 \Omega_n^{-1}) 

final <- NULL
temp <- NULL
temp2 <- NULL
predicted_data <- NULL
for(i in 1:1000){
#\sigma^2 =  Inv-\chi^2(nu_n, \sigma^2_n)
x = rchisq(n = 1, df = nu_n)
# Calculating sigma_sq.
sigma_sq = as.vector((nu_n*sigma_sq_n)/x) 
betas <- mvtnorm::rmvnorm(n = 1 , mean=mu_n, sigma=sigma_sq * solve(omega_n))
temp2 <-  X %*% t(betas) + rnorm(n=1,mean = 0, sd=sqrt(sigma_sq)) # add the error term you get prediction band
temp2 <- cbind(temp2, X, Y)
predicted_data <- rbind(temp2, predicted_data)
temp <- cbind(betas,sigma_sq)
final <- rbind(temp,final)
}

colnames(final) <- c("Beta_0", "Beta_1", "Beta_2", "Sigma_squared")
colnames(predicted_data) <- c("Predicted_Temperature", "Intercept", "Time", 
                              "Time_square", "Actual_Temperature")
final = final %>% as.data.frame()

# calculation of the 95% confidence interval
predicted_data = predicted_data %>% as.data.frame() %>% 
group_by(Actual_Temperature, Time) %>%
summarise(temp_hat_median = median(Predicted_Temperature),
temp_hat_l_limit = quantile(x = Predicted_Temperature, probs = 0.025),
temp_hat_u_limit = quantile(x = Predicted_Temperature, probs = 0.975))

ggplot(data=final,aes(Beta_0)) + 
  geom_histogram(bins=50) + 
  ggtitle("Histogram of Beta_0")

ggplot(data=final,aes(Beta_1)) + 
  geom_histogram(bins=50) + 
  ggtitle("Histogram of Beta_1")

ggplot(data=final,aes(Beta_2)) + 
  geom_histogram(bins=50) + 
  ggtitle("Histogram of Beta_2")

ggplot(data=final,aes(Sigma_squared)) + 
  geom_histogram(bins=50) + 
  ggtitle("Histogram of Sigma_squared")

ggplot(data= predicted_data, aes(x = Time)) + geom_point(aes(y = Actual_Temperature)) +
geom_line(aes(y = temp_hat_median, color = "Posterior median")) +
geom_line(aes(y = temp_hat_l_limit, color = "Posterior credible interval")) +
geom_line(aes(y = temp_hat_u_limit, color = "Posterior credible interval")) +
ggtitle("Plot of regression with confidence intervals")

Analysis: No we dont expect 95% of the data inside the confidence band because this is not the prediction band(does not contain the error term)

##c) Simulating from posterior distribution of \(\widetilde{x}\)

Since the given expression is quadratic the first derivate will be zero that is: \[ \begin{split} y = \beta_0 + \beta_1 time + \beta_2 time^2 \\ 0 = \beta_1 + 2 \beta_2 time \\ time = -0.5 \cdot \frac{\beta_1}{\beta_2} \\ \end{split} \]

6.2.1 As evident from the graph this will be close to ~0.50, calculating using the median values we get.

cat("The highest expected temperature is:", - 0.5 * mean(final$Beta_1/final$Beta_2))
## The highest expected temperature is: 0.521753

##d) Suggesting prior to estimate a high-order polynomial model

This is equivalent of using regularisation, the posterior where Beta are given by

Lasso: \[\beta_i|\sigma^2 \sim Laplace(0,\frac{\sigma^2}{\lambda})\] or simple ridge given by: \[\beta_i|\sigma^2 \sim N(0,\frac{\sigma^2}{\lambda})\]

7 Lab2 Question 2: Posterior approximation for classification with logistic regression

##a) Implementing logistic regression model

set.seed(12345)
data <- read.csv("../../static/data/WomenWork.csv")
model <- glm(Work~0 +., data = data, family = binomial)
summary(model)
## 
## Call:
## glm(formula = Work ~ 0 + ., family = binomial, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1666  -0.9298   0.4391   0.9491   2.0579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## Constant     0.64433    1.52306   0.423 0.672258    
## HusbandInc  -0.01982    0.01591  -1.246 0.212743    
## EducYears    0.18000    0.07914   2.274 0.022938 *  
## ExpYears     0.16737    0.06593   2.539 0.011130 *  
## ExpYears2   -0.14380    0.23551  -0.611 0.541474    
## Age         -0.08234    0.02699  -3.050 0.002286 ** 
## NSmallChild -1.36248    0.38996  -3.494 0.000476 ***
## NBigChild   -0.02541    0.14171  -0.179 0.857707    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 277.26  on 200  degrees of freedom
## Residual deviance: 222.73  on 192  degrees of freedom
## AIC: 238.73
## 
## Number of Fisher Scoring iterations: 4

7.1 b) Approximating the posterior distribution of \(\beta\) with a multivariate normal distribution

Lets derive the formula for the log-likelihood

In our data ‘Work’ is a binary function thus

\[ \begin{split} P(Work=1|x_i) = \frac{exp(X^T \cdot \beta)}{1+exp(X^T \cdot \beta)} \\ P(Work=0|x_i) = \frac{1}{1+exp(X^T \cdot \beta)} \\\\ Since~ P(Work=1|x_i) + P(Work=0|x_i) = 1 \\\\ Loss~function~is~cross~entropy~loss~function~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \\ Loss = y_i \cdot\log P(Work=1|x_i) + (1-y_i) \cdot \log P(Work=0|x_i) \end{split} \]

Likelihood:

\[ \begin{split} p(Y|x,\beta) = \sum_{i=1}^{n=N} y_i \cdot\log(P(Work=1|x_i)) + (1-y_i) \cdot \log(P(Work=0|x_i))\\ p(Y|x,\beta) = \sum_{i=1}^{n=N} y_i \cdot \log(\frac{exp(X^T \cdot \beta)}{1+exp(X^T \cdot \beta)})+ (1-y_i) \cdot \log(\frac{1}{1+exp(X^T \cdot \beta)}) \\ p(Y|x,\beta) = \sum_{i=1}^{n=N} y_i \cdot \log(\frac{1}{1+exp(- X^T \cdot \beta)})+ (1-y_i) \cdot \log(\frac{1}{1+exp(X^T \cdot \beta)}) \\ p(Y|x,\beta) = \sum_{i=1}^{n=N} y_i \cdot [\log(\frac{1}{1+exp(- X^T \cdot \beta)}) - \log(\frac{1}{1+exp(X^T \cdot \beta)})] + \log(\frac{1}{1+exp(X^T \cdot \beta)} \\ p(Y|x,\beta) = \sum_{i=1}^{n=N} y_i \cdot [\log(\frac{\frac{1}{1+exp(- X^T \cdot \beta)})}{(\frac{1}{1+exp(X^T \cdot \beta)})}] + \log(\frac{1}{1+exp(X^T \cdot \beta)} \\ p(Y|x,\beta) = \sum_{i=1}^{n=N} y_i \cdot log(exp(-X^T \cdot \beta)) + \log(1) - \log({1+exp(X^T \cdot \beta)}) \\ p(Y|x,\beta) = \sum_{i=1}^{n=N} y_i \cdot X^T \cdot \beta - \log({1+exp(X^T \cdot \beta)}) \\ \end{split} \]

Further more

In the next step, the goal is to approximate the posterior distribution of \(\beta\) with a multivariate normal distribution \[\beta|y,X\sim N \bigg(\widetilde{\beta},J^{-1}_y(\widetilde{\beta})\bigg)\] where \(\widetilde{\beta}\) is the posterior mode and \(J(\widetilde{\beta}) = - \frac{\partial^2lnp(\beta|y)}{\partial\beta\partial\beta^T}|_{\beta=\widetilde{\beta}}\) is the observed Hessian evaluated at the posterior mode. Both \(\widetilde{\beta}\) and \(J(\widetilde{\beta})\) are computed by the optim-function in R. We use the prior \(\beta \sim N(0, \tau^2I)\) with \(\tau = 10\).

set.seed(12345)
# creating matrix for calculation
X <- data[,c("Constant", "HusbandInc", "EducYears", "ExpYears", 
             "ExpYears2", "Age", "NSmallChild", "NBigChild" )] %>% as.matrix()
Y <- data[,c("Work")] %>% as.matrix()

# Prior.
tau = 10
mu_0 = rep(0, ncol(X))
sigma_0 = tau^2*diag(ncol(X))
beta_init = rep(0, ncol(X))

log_posterior <- function(betaVect,Y,X,mu,sigma){
  
  nPara <- length(betaVect);
  linPred <- X %*% betaVect;
  
  # evaluating the log-likelihood                                    
  logLik <- sum( linPred*Y -log(1 + exp(linPred)));
  if (abs(logLik) == Inf) logLik = -20000; # Likelihood is not finite, stear the optimizer away from here!
  
  # evaluating the prior
  logPrior <- dmvnorm(betaVect, matrix(0,nPara,1), sigma, log=TRUE);
  
  # add the log prior and log-likelihood together to get log posterior
  return(logLik + logPrior)
}

results_optim = optim(par = beta_init,
                      fn = log_posterior,
                      Y = Y,
                      X = X,
                      mu = mu_0,
                      sigma = sigma_0,
                      method=c("BFGS"),
                      # Multiplying objective function by -1 to find maximum instead of minimum.
                      control=list(fnscale=-1), 
                      hessian=TRUE)

The optim-function returns the mode for every \(\beta\), \(\widetilde{\beta}\). Since the function returns \(-J_y(\widetilde{\beta})\), we still need to transform this matrix to get the desired \(J^{-1}_y(\widetilde{\beta})\) before printing.

# Printing results.
  # Beta_mode.
  knitr::kable(
    data.frame(
      beta_ = seq(from = 0,to = length(results_optim$par)-1), 
      posterior_mode = results_optim$par), 
    caption = "Numerical values for beta_mode")
Table 7.1: Numerical values for beta_mode
beta_ posterior_mode
0 0.6267583
1 -0.0198388
2 0.1803346
3 0.1674284
4 -0.1440383
5 -0.0820616
6 -1.3591145
7 -0.0246623
# Adjusted hessian. # Posterior covariance (J^-1(beta hat))
print("Adjusted hessian:")
## [1] "Adjusted hessian:"
hessian =  -solve(results_optim$hessian)
hessian
##              [,1]           [,2]            [,3]           [,4]
## [1,]  2.265999213  0.00333682417 -0.065449232806 -0.01177765557
## [2,]  0.003336824  0.00025290866 -0.000561152701 -0.00003146443
## [3,] -0.065449233 -0.00056115270  0.006218404726 -0.00035435284
## [4,] -0.011777656 -0.00003146443 -0.000354352838  0.00434298028
## [5,]  0.045717513  0.00014228726  0.001889764740 -0.01421143177
## [6,] -0.030294134 -0.00003586266 -0.000003319259 -0.00013398170
## [7,] -0.188769159  0.00050695295 -0.006135548472 -0.00146499283
## [8,] -0.098022329 -0.00014422783  0.001752241397  0.00054447141
##               [,5]            [,6]         [,7]          [,8]
## [1,]  0.0457175134 -0.030294134435 -0.188769159 -0.0980223288
## [2,]  0.0001422873 -0.000035862665  0.000506953 -0.0001442278
## [3,]  0.0018897647 -0.000003319259 -0.006135548  0.0017522414
## [4,] -0.0142114318 -0.000133981701 -0.001464993  0.0005444714
## [5,]  0.0554187693 -0.000330219635  0.003192093  0.0005086101
## [6,] -0.0003302196  0.000718477547  0.005184373  0.0010952774
## [7,]  0.0031920930  0.005184372595  0.151262310  0.0067689298
## [8,]  0.0005086101  0.001095277394  0.006768930  0.0199714842
approx_post_std_dev = sqrt(diag(hessian))

  knitr::kable(approx_post_std_dev, caption="Observed hessian evaluated at the posterior mode")
Table 7.1: Observed hessian evaluated at the posterior mode
x
1.5053236
0.0159031
0.0788569
0.0659013
0.2354119
0.0268044
0.3889246
0.1413205

To compute an approximate 95% credible interval for the coefficients of the variable NSmallChild, we first need to simulate from the posterior. We do this by simulating from the posterior \[\beta|y,X\sim N \bigg(\widetilde{\beta},J^{-1}_y(\widetilde{\beta})\bigg)\]

by usage of our obtained parameters \(\widetilde{\beta}\) and \(J^{-1}_y(\widetilde{\beta})\).

set.seed(12345)
# Simulating 1000 times from approximated posterior.
posterior_sim_beta_all = rmvnorm(n = 1000,
                                 mean = results_optim$par,
                                 sigma = -solve(results_optim$hessian))

# Extracting simulated posterior beta values for variable NSmallChild.
posterior_sim_beta_nsc = posterior_sim_beta_all[, 7]

# Computing 95% credible interval bounds.
interval_bounds = quantile(x = posterior_sim_beta_nsc,
                           probs = c(0.025, 0.975))

# Plotting simulated beta values with 95% credible interval for variable NSmallChild.
ggplot() +
  geom_histogram(aes(x = posterior_sim_beta_nsc),
                 binwidth = 0.1) +
  geom_vline(xintercept = interval_bounds[1],
             color = "red") +
  geom_vline(xintercept = interval_bounds[2],
             color = "red") +
  ggtitle("Histogram of simulated posterior distribution of beta for Nsmallchild")

# Printing interval bounds.
knitr::kable(as.data.frame(interval_bounds), caption = "Computed credible interval bounds")
Table 7.2: Computed credible interval bounds
interval_bounds
2.5% -2.1296565
97.5% -0.5759335

##c) Simulating from the predictive distribution of the response variable in a logistic regression

To perform a simulation from the predictive distribution of the response variable in a logistic regression, we use the drawn posterior \(\beta\)-coefficiencts (posterior_sim_beta_all) from 2b) to predict the Work-value. For every drawn \(\beta\)-vector, we calculate \(Pr(y=1|x) = \frac{exp(x^T\beta)}{1+exp(x^T\beta)}\). Since the goal is to predict the Work-value for 40-year-old woman with two children (3 and 9 years old), 8 years of education, 10 years of experience and a husband with an income of 10, we use this data for every different drawn \(\beta\)-vector.

set.seed(12345)
# Implementing specified data.
X = matrix(c(Constant = 1, 
             HusbandInc = 10, 
             EducYears = 8, 
             ExpYears = 10, 
             ExpYears2 = 1.00,
             Age = 40, 
             NSmallChild = 1, 
             NBigChild = 1),
           nrow = 1)

X_T <- t(X)

colnames(posterior_sim_beta_all) <- c("Constant", "HusbandInc", "EducYears", 
                                      "ExpYears", "ExpYears2", "Age", 
                                      "NSmallChild", "NBigChild")

# Using all drawn posterior beta coeffiecients and specified data vector 
# to compute distribution for Pr(y=1|x).
predictive_sim_work = c()
for(sim in 1:nrow(posterior_sim_beta_all)) {
    num <- exp(posterior_sim_beta_all[sim,] %*% X_T)
  predictive_sim_work[sim] <- num/(1+num)
}

predictive_sim_work = predictive_sim_work %>% as.data.frame()
colnames(predictive_sim_work) <- c("Probability_of_Work")

# Plotting histogram of predictive distribution.
ggplot(data=predictive_sim_work, aes(Probability_of_Work)) +
geom_histogram(bins = 30) +
ggtitle("Histogram of the Porbability of Women Working given the parameters")  

8 Lab3 Assignment 1: Normal model, mixture of normal model with semi-conjugate prior

First, the provided data rainfall.dat will be read into the R environment.

# Storing provided data in R environment as a numeric vector.
data = read.table("../../static/data/rainfall.dat", header = FALSE, sep = "", dec = ".")[,1]

8.1 1a. Normal model

8.1.1 Given information

For the specified Normal model with the semi-conjugate prior \[\mu \sim N(\mu_0, \tau_0^2)\] and \[\sigma^2 \sim Inv-\chi^2(\nu_0, \sigma^2_0),\] we know the full-conditional posteriors from lecture 7: \[\mu|\sigma^2,x \sim N(\mu_n, \tau^2_n)\] and \[\sigma^2|\mu,x\sim Inv-\chi^2\bigg(\nu_n, \frac{\nu_0 \tau^2_0 + \sum_{i=1}^n(x_i-\mu)^2}{n+\nu_0}\bigg).\] Also, from lecture 2, we know that \[w = \frac{\frac{n}{\sigma^2}}{\frac{n}{\sigma^2}+\frac{1}{\tau^2_0}},\] \[\mu_n = w\bar{x}+(1-w)\mu_0\] and \[\frac{1}{\tau^2_n} = \frac{n}{\sigma^2} + \frac{1}{\tau^2_0} \Rightarrow \tau^2_n = \frac{1}{\frac{n}{\sigma^2}+\frac{1}{\tau^2_0}}.\]

8.1.2 Setup

To be able to sample from the full conditional posteriors, we have to set up the prior parameters \(\mu_0\), \(\tau^2_0\), \(\nu_0\) and \(\sigma^2_0\). For \(\mu_0\), our best guess is the mean of the provided data. For the other parameters, we do not have any information. Thus, we use initial values of 1 for each of these parameters.

# Initializing prior parameters.
mu_0 = mean(data)
tau_0_sq = 1
nu_0 = 1
sigma_0_sq = 1

Furthermore, we have to implement \(n\), \(\bar{x}\) by usage of the provided data. \(\nu_n\) is given by \(\nu_0 + n\).

# Implementing further variables.
n = length(data)
x_bar = mean(data)
nu_n = nu_0 + n

In the following, functions which are used to sample from the full conditional posteriors will be implemented following the presented formulas. To draw from the full conditional posterior for \(\sigma^2\), we need to draw from the scalded inverse-chi-squared distribution. We obtain this by usage of the same principle of our simulators from the previous labs. Precisely, we first draw from \(X\sim\chi^2(\nu_n)\) and then compute \(\sigma^2 = \frac{\nu_n\bigg(\frac{\nu_0\sigma^2_0+\sum_{i=1}^n(x_i-\mu)^2}{n+\nu_0}\bigg)}{X}\)

# Implementing functions to enable sampling from full conditional posteriors.
  # mu.
  posterior_mu = function(sigma_sq) {
    # Calculating w.
    w = (n/sigma_sq)/((n/sigma_sq) + (1/tau_0_sq))
    # Calculating mu_n.
    mu_n = w*x_bar+(1-w)*mu_0
    # Calculating tau_n_sq.
    tau_n_sq = 1/((n/sigma_sq)+(1/tau_0_sq))
    # Drawing mu from N(mu_n, tau_n_sq).
    mu = rnorm(n = 1, mean = mu_n, sd = tau_n_sq)
    # Returning draw.
    return(mu)
  }
  # sigma_sq.
  posterior_sigma_sq = function(mu) {
    # Drawing X.
    X = rchisq(n = 1, df = nu_n)
    # Computing sigma_sq.
    sigma_sq = nu_n*((nu_0*sigma_0_sq+sum((data-mu)^2))/(n+nu_0))/X
    # Returning sigma_sq.
    return(sigma_sq)
  }

To perform Gibbs sampling, we first need to initialize the \(\theta_2, ..., \theta_k\) parameters. Since we only have two parameters in this case, we only need to initialize our \(\theta_2\) which is \(\sigma^2\). To do so, it would be possible to simply select a random value for \(\sigma^2\). However, we just draw from our prior distribution for \(\sigma^2\). Again, to sample from the scalded inverse-chi-squared distribution, we follow the same principle as described.

# Initializing sigma_sq.
  # Drawing X.
  X = rchisq(n = 1, df = nu_0)
  # Computing sigma_sq.
  sigma_sq = (nu_0*sigma_0_sq)/X

8.1.3 Run

Using both posterior functions and the initialized \(\sigma^2\), we are now able to perform Gibbs sampling.

# Defining number of iterations.
niter = 5000
# Creating objects to store results.
mu_draws = c()
sigma_sq_draws = c()
# Performing gibbs sampling.
for (i in 1:niter) {
  # Sampling and storing mu.
  mu = posterior_mu(sigma_sq)
  mu_draws[i] = mu
  # Sampling and storing sigma_sq.
  sigma_sq = posterior_sigma_sq(mu)
  sigma_sq_draws[i] = sigma_sq
}

8.1.4 Results

The sampling procedure leads to the following results. For each drawn parameter (\(\mu\) and \(\sigma^2\)), both the chain and the histogram of the samples are plotted.

# Plotting results of Gibbs sampling.
  # mu.
    # chain.
    ggplot() +
    geom_line(aes(x = seq(from = 1, to = niter), y = mu_draws)) +
    theme_bw() +
    labs(title = "Chain of sampled mu",
         x = "iteration",
         y = "mu")

    # histogram.
    ggplot() +
    geom_histogram(aes(x = mu_draws),
                   fill = "black",
                   colour = "white",
                   bins = 30) +
    theme_bw() +
    labs(title = "Histogram of sampled mu",
         x = "mu")

  # sigma_sq.
    # chain.
    ggplot() +
    geom_line(aes(x = seq(from = 1, to = niter), y = sigma_sq_draws)) +
    theme_bw() +
    labs(title = "Chain of sampled sigma_sq",
         x = "iteration",
         y = "sigma_sq_draws")

    # histogram.
    ggplot() +
    geom_histogram(aes(x = sigma_sq_draws),
                   fill = "black",
                   colour = "white",
                   bins = 30) +
    theme_bw() +
    labs(title = "Histogram of sampled sigma_sq",
         x = "sigma_sq_draws")

The results show that for both parameters \(\mu\) and \(\sigma^2\), there is no Burn-in.

8.2 1b. Mixture normal model.

8.2.1 Given information

The specified mixture normal model is \[p(y_i|\mu,\sigma^2,\pi)=\pi N(y_i|\mu_1,\sigma^2_1)+(1-\pi)N(y_i|\mu_2,\sigma^2_2),\] where \[\mu = (\mu_1,\mu_2) \text{ and } \sigma^2=(\sigma^2_1, \sigma^2_2).\] Instead of using the provided code in the file NormalMixtureGibbs.R, we create our own programme to perform Gibbs sampling. Within the provided file, another prior is assumed (the prior from lecture 2 where \(\mu_j\) does not depend on \(\sigma^2_j\)). Therefore, also the full-condtional posteriors differ from the ones provided in lecture 7.

Here, we use the prior \[\pi \sim Beta(\alpha_1, \alpha_2)\] and the conjugate prior \[\mu_j|\sigma^2_j \sim N\big(\mu_{0j}, \frac{\sigma^2_j}{\kappa_{0j}}\big).\] \[\sigma^2_j \sim Inv-\chi^2\big(\nu_{0j}, \sigma^2_{0j}\big)\]

We also define \(n_1 = \sum_{i=1}^n(I_i=1)\) and \(n_2 = n - n_1.\)

According to lecture 7, usage of this prior leads to the following full conditional posteriors:

  • \((\pi_1, ..., \pi_k)|I, x\sim Dirichlet(\alpha_1 + n_1, ..., \alpha_k + n_k)\)
  • \(\mu_j|I, \sigma^2_j \sim N(\mu_{nj}, \frac{\sigma^2_j}{\kappa_{nj}})\)
  • \(\sigma^2_j|I, x \sim Inv-\chi^2(\nu_{nj}, \sigma^2_{nj})\)

where \[\mu_{nj} = \frac{\kappa_{0j}}{\kappa_{0j}+n_j}\mu_{0j}+\frac{n_j}{\kappa_{0j}+n_j}\bar{x},\] \[\kappa_{nj} = \kappa_{0j} + n_j,\] \[\nu_{nj} = \nu_{0j} + n_j\] and \[\sigma^2_{nj} = \frac{ \nu_{0j}\sigma^2_{0j}+(n_j-1)s^2+\frac{\kappa_{0j}n_j}{\kappa_{0j}+n_j}(\bar{x}-\mu_{0j})^2}{\nu_n}\] (see lecture 5).

Furthermore, for our case k = 2 (since we have two components in our model), we can sample \(I_i\) as follows:

\[I_i |\pi, \mu_1, \sigma^2_1, \mu_2, \sigma^2_2, x \sim Bern(\theta_i), i = 1, ..., n\] where \[\theta_{ij} = \frac{\pi_j\phi(x_i; \mu_j, \sigma^2_j)}{\sum_{r=1}^k \pi_r \phi(x_i; \mu_r, \sigma^2_r)}\]

and \(\phi(x_i; \mu_j, \sigma^2_j)\) denotes the probability of \(x_i\) given \(X \sim N(\mu_j, \sigma^2_j).\)

8.2.2 Setup

First, we need to implement our prior beliefs.

# Implementing number of components.
n_comp = 2
# Implementing prior parameters.
alpha_j = 10*rep(1, n_comp)
mu_0_j = rep(0, n_comp)
kappa_0_j = rep(3500, n_comp)
nu_0_j = rep(4, n_comp) # degrees of freedom for prior on sigma2
sigma_sq_0_j = rep(var(data), n_comp) # best guess of sigma2

Furthermore, functions to use within the sampling process will be implemented.

# Implementing function that simulates from Dirichlet distribution
rDirichlet = function(alpha_j) {
  n_alpha = length(alpha_j)
  pi_draws = matrix(NA, n_alpha, 1)
  for (j in 1:n_alpha){
    pi_draws[j] = rgamma(1, alpha_j[j], 1)
  }
  # Dividing every column of pi_draws by the sum of the elements in that column.
  pi_draws = pi_draws/sum(pi_draws)
  return(pi_draws)
}
# Implementing function that simulates from Scaled Inverse Chi Sqaured distribution.
rScaledInvChi2 <- function(n, df, scale){
  return((df*scale)/rchisq(n,df=df))
}

To be able to draw from all full conditional posteriors, we need to initialize the paramters \(I_i\) (i from observation \(i\)), \(\mu_j\) and \(\sigma^2_j\).

# Initializing parameters for the MCMC.
  # I_i. Random allocation of observations with same probability.
  I = c()
  for (i in 1:length(data)) {
    prob_j = c()
    # Reallocating observation.
    I[i] = which(t(rmultinom(1, size = 1 , prob = rep(1/n_comp, n_comp))) == 1)
  }
  # mu_j.
  mu_j = quantile(data, probs = seq(0,1,length = n_comp))
  # sigma_sq_j.
  sigma_sq_j = rep(var(data), n_comp)

Also, at the end, a plot will be created. The plot has to be set up before as well.

# Setting up the plot.
x = as.matrix(data)
xGrid <- seq(min(x)-1*apply(x,2,sd),max(x)+1*apply(x,2,sd),length = 100)
xGridMin <- min(xGrid)
xGridMax <- max(xGrid)
mixDensMean <- rep(0,length(xGrid))
effIterCount <- 0
ylim <- c(0,2*max(hist(x)$density))
plotFit <- TRUE
lineColors <- c("blue", "green", "magenta", 'yellow')
sleepTime <- 0.1 # Adding sleep time between iterations for plotting

Lastly we still have to setup the number of iterations and we have to initialize the matrices to store the results.

# Defining number of iterations.
niter = 200
# Initializing matrices to store results for every iteration.
mu_all = matrix(0, niter, n_comp)
sigma_sq_all = matrix(0, niter, n_comp)

8.2.3 Run

With the implemented prior parameters, sampling functions and the initialized parameter values, we are able to perfrom Gibbs sampling.

# Performing gibbs sampling.
for (i in 1:niter) {
  
  # Printing iteration number.
  # message(paste('Iteration number:', i))
  
  # Calculating and printing n_j.
  n_j = c()
  for (j in 1:n_comp) {
    n_j[j] = sum(I == j)
  }
  # print(n_j)
  
  # Updating parameters.
    # pi.
    pi_j = rDirichlet(alpha_j + n_j)
    # mu.
    for (j in 1:n_comp) {
      # Calculating mu_n_j.
      mu_n_j = (kappa_0_j/(kappa_0_j + n_j))*mu_0_j + n_j/(kappa_0_j + n_j)*mean(data[I == j])
      # Calculating kappa_n_j.
      kappa_n_j = kappa_0_j + n_j
      # Sampling from N(mu_n_j, sigma_sq_j)
      mu_j[j] = rnorm(n = 1, mean = mu_n_j, sd = sigma_sq_j[j] / kappa_n_j)
    }
    mu_all[i, ] = mu_j
    # sigma_sq.
    for (j in 1:n_comp) {
      # Calculating nu_n_j.
      nu_n_j = nu_0_j[j] + n_j[j]
      # Calculating sigma_sq_n_j.
      sigma_sq_n_j = 
        (nu_0_j[j] * sigma_sq_0_j[j] +
        (n_j[j]-1) * sd(data[I == j]) + 
        (kappa_0_j[j] * n_j[j]) / (kappa_0_j[j] + n_j[j]) * (mean(data[I == j]) - mu_0_j[j])^2) /
        nu_n_j
      # Sampling from Scaled-Inv-Chi-Sq(nu_n_j, sigma_sq_n_j).
      sigma_sq_j[j] = rScaledInvChi2(n = 1, df = nu_n_j, scale = sigma_sq_n_j)
    }
    sigma_sq_all[i, ] = sigma_sq_j
    
  # Updating allocation.
  for (obs in 1:length(data)) {
    prob_j = c()
    for (j in 1:n_comp) {
      # Extracting value of current observation.
      x = data[obs]
      # Calculating probability of x belonging to component j.
      prob_j[j] = pi_j[j]*dnorm(x, mean = mu_j[j], sd = sqrt(sigma_sq_j[j]))
    }
    # Reallocating observation.
    I[obs] = which(t(rmultinom(1, size = 1 , prob = prob_j/sum(prob_j))) == 1)
  }
  
  # Storing data for fitted density against data histogram.
  # Plotting every iteration is uncommented.
  if (plotFit && (i%%1 == 0)){
    effIterCount <- effIterCount + 1
    # hist(data, breaks = 20, freq = FALSE, xlim = c(xGridMin,xGridMax), 
    #     main = paste("Iteration number", i), ylim = ylim)
    mixDens <- rep(0,length(xGrid))
    components <- c()
    for (j in 1:n_comp){
      compDens <- dnorm(xGrid,mu_j[j],sd = sqrt(sigma_sq_j[j]))
      mixDens <- mixDens + pi_j[j]*compDens
    #  lines(xGrid, compDens, type = "l", lwd = 2, col = lineColors[j])
      components[j] <- paste("Component ",j)
    }
    mixDensMean <- ((effIterCount-1)*mixDensMean + mixDens)/effIterCount
    
    #lines(xGrid, mixDens, type = "l", lty = 2, lwd = 3, col = 'red')
    #legend("topleft", box.lty = 1, legend = c("Data histogram",components, 'Mixture'), 
    #       col = c("black",lineColors[1:n_comp], 'red'), lwd = 2)
    #Sys.sleep(sleepTime)
  }
}

8.2.4 Results

# Mu.
ggplot() +
  geom_line(aes(x = seq(1:nrow(mu_all)),
                y = mu_all[,1],
                color = "Component 1")) +
  geom_line(aes(x = seq(1:nrow(mu_all)),
                y = mu_all[,2],
                color = "Component 2")) +
  scale_color_manual(values = c("Component 1" = "black",
                                "Component 2" = "red")) +
  theme_bw() +
  labs(title = "Convergence of the sampler for mu",
       subtitle = paste0("Last sample: ", 
                         "Comp. 1: ",
                         round(mu_all[niter, 1], 2),
                         " ; ",
                         "Comp. 2: ",
                         round(mu_all[niter, 2], 2)),
       x = "sample",
       y = "mu",
       colour = NULL)

# Sigma_sq.
ggplot() +
  geom_line(aes(x = seq(1:nrow(sigma_sq_all)),
                y = sigma_sq_all[,1],
                color = "Component 1")) +
  geom_line(aes(x = seq(1:nrow(sigma_sq_all)),
                y = sigma_sq_all[,2],
                color = "Component 2")) +
  scale_color_manual(values = c("Component 1" = "black",
                                "Component 2" = "red")) +
  theme_bw() +
  labs(title = "Convergence of the sampler for sigma_sq",
       subtitle = paste0("Last sample: ", 
                         "Comp. 1: ",
                         round(sigma_sq_all[niter, 1], 2),
                         " ; ",
                         "Comp. 2: ",
                         round(sigma_sq_all[niter, 2], 2)),
       x = "sample",
       y = "sigma_sq",
       colour = NULL)

8.2.5 Alternative solution

set.seed(12345)
data = read.table("../../static/data/rainfall.txt", header=FALSE)[,1]
x = as.matrix(data)

# Model options
nComp <- 2 # Number of mixture components
# Prior options
alpha <- 10*rep(1,nComp) # Dirichlet(alpha)
muPrior <- rep(0,nComp) # Prior mean of theta
tau2Prior <- rep(10,nComp) # Prior std theta
sigma2_0 <- rep(var(x),nComp) # s20 (best guess of sigma2)
nu0 <- rep(4,nComp) # degrees of freedom for prior on sigma2
# MCMC options
nIter <- 100 # Number of Gibbs sampling draws

# Plotting options
plotFit <- FALSE
lineColors <- c("blue", "green", "magenta", 'yellow')
sleepTime <- 0.1 # Adding sleep time between iterations for plotting

################ END USER INPUT ###############
###### Defining a function that simulates from the
rScaledInvChi2 <- function(n, df, scale){
return((df*scale)/rchisq(n,df=df))
}

####### Defining a function that simulates from a Dirichlet distribution
rDirichlet <- function(param){
nCat <- length(param)
thetaDraws <- matrix(NA,nCat,1)
for (j in 1:nCat){
thetaDraws[j] <- rgamma(1,param[j],1)
}#Diving every column of ThetaDraws by the sum of the elements in that column.
thetaDraws = thetaDraws/sum(thetaDraws)
return(thetaDraws)
}

# Simple function that converts between two different
# representations of the mixture allocation
S2alloc <- function(S){
n <- dim(S)[1]
alloc <- rep(0,n)
for (i in 1:n){
alloc[i] <- which(S[i,] == 1)
}
return(alloc)
}

# Initial value for the MCMC
nObs <- length(x)
# nObs-by-nComp matrix with component allocations.
S <- t(rmultinom(nObs, size = 1 , prob = rep(1/nComp,nComp)))
theta <- quantile(x, probs = seq(0,1,length = nComp))
sigma2 <- rep(var(x),nComp)
probObsInComp <- rep(NA, nComp)
# Setting up the plot
xGrid <- seq(min(x)-1*apply(x,2,sd),max(x)+1*apply(x,2,sd),length = 100)
xGridMin <- min(xGrid)
xGridMax <- max(xGrid)
mixDensMean <- rep(0,length(xGrid))
effIterCount <- 0
ylim <- c(0,2*max(hist(x)$density))

gibbs_thetas = matrix(0,nIter,2)
gibbs_sigmas = matrix(0,nIter,2)

for (k in 1:nIter){
# Just a function that converts between different representations
# of the group allocations
alloc <- S2alloc(S)
nAlloc <- colSums(S)
if(k == nIter){
message(paste('Iteration number:',k))
print(nAlloc)
}

# Update components probabilities
w <- rDirichlet(alpha + nAlloc)

# Update theta's
for (j in 1:nComp){
precPrior <- 1/tau2Prior[j]
precData <- nAlloc[j]/sigma2[j]
precPost <- precPrior + precData
wPrior <- precPrior/precPost
muPost <- wPrior*muPrior + (1-wPrior)*mean(x[alloc == j])
tau2Post <- 1/precPost
theta[j] <- rnorm(1, mean = muPost, sd = sqrt(tau2Post))
}

gibbs_thetas[k, ] = theta
# Update sigma2's
for (j in 1:nComp){
sigma2[j] <- rScaledInvChi2(1, df = nu0[j] + nAlloc[j],
scale = (nu0[j]*sigma2_0[j] +
sum((x[alloc == j] - theta[j])^2))/(nu0[j] + nAlloc[j]))
}

gibbs_sigmas[k,] = sigma2
# Update allocation
for (i in 1:nObs){
for (j in 1:nComp){
probObsInComp[j] <- w[j]*dnorm(x[i], mean = theta[j], sd = sqrt(sigma2[j]))
}
S[i,] <- t(rmultinom(1, size = 1 , prob = probObsInComp/sum(probObsInComp)))
}

# Printing the fitted density against data histogram
if ((k==nIter) && (k%%1 ==0)){
effIterCount <- effIterCount + 1
hist(x, breaks = 20, freq = FALSE, xlim = c(xGridMin,xGridMax),
main = paste("Iteration number",k), ylim = ylim)
mixDens <- rep(0,length(xGrid))
components <- c()

for (j in 1:nComp){
compDens <- dnorm(xGrid,theta[j],sd = sqrt(sigma2[j]))
mixDens <- mixDens + w[j]*compDens
lines(xGrid, compDens, type = "l", lwd = 2, col = lineColors[j])
components[j] <- paste("Component ",j)
}

mixDensMean <- ((effIterCount-1)*mixDensMean + mixDens)/effIterCount
lines(xGrid, mixDens, type = "l", lty = 2, lwd = 3, col = 'red')
legend("topleft", box.lty = 1, legend = c("Data histogram",components, 'Mixture'),
col = c("black",lineColors[1:nComp], 'red'), lwd = 2)
Sys.sleep(sleepTime)
}
}
## Iteration number: 100
## [1] 3791 3129

# Calculate mean of batches of 2 draws to visualize the
# auto correlation between sequential draws
t1 = c()
t2 = c()
s1 = c()
s2 = c()
for (i in 1:nIter){
if(i%%2 == 0){
t1 = c(t1, mean(gibbs_thetas[,1][i-1:i]))
t2 = c(t2, mean(gibbs_thetas[,2][i-1:i]))
s1 = c(s1, mean(gibbs_sigmas[,1][i-1:i]))
s2 = c(s2, mean(gibbs_sigmas[,2][i-1:i]))
}
}

# Plots displaying convergence of the Normal hyper
# parameters during sampling
pdf("3_1_2_gibbs_mixt.pdf")
par(mfrow=c(3,1))

# Plot comparison between kernel density, mixture of normals and a normal
# approximation
hist(x, breaks = 20,
cex=.1,
border="lightgray",
freq = FALSE,
xlim = c(xGridMin,xGridMax),
xlab="Precipitation",
ylab="Density",
main = "Rainfall: Mixture of Normals")
lines(xGrid,
mixDensMean,
type = "l",
lwd = 2,
lty = 4,
col = "black")
lines(xGrid,dnorm(xGrid, mean = mean(x), sd = apply(x,2,sd)),type = "l",lwd = 2,col = "gray")
legend("topright",box.lty = 1,legend = c("Data histogram","Mixture density","Normal density"),col=c("lightgray","black","gray"),lwd = 2)
# Plot the auto correlation (convergence) between draws of mu
min_t = min(c(min(t1), min(t2)))
max_t = max(c(max(t1), max(t2)))

plot(t1,type="l",ylim=c(min_t, max_t),cex=.1,lwd=2,
     main=expression(paste("Convergence of Gibbs Sampling ", "(", theta, ")", sep=" ")),
xlab="Batches of sequential draws",
ylab=expression(paste("Mean of seq. draws of ", theta, sep=" ")))
lines(t2, lwd=2, col="gray")
legend("topright",box.lty = 1,legend = c(expression(paste(theta, " (1)", sep=" ")),
                                         expression(paste(theta, " (2)", sep=" "))),
       col=c("black","gray"),lwd = 2)
# Plot the auto correlation (convergence) between draws of sigma
min_s = min(c(min(s1), min(s2)))
max_s = max(c(max(s1), max(s2)))

plot(s1,type="l",ylim=c(min_s, max_s),cex=.1,lwd=2,main=expression(
paste("Convergence of Gibbs Sampling ", "(", sigma^2, ")", sep=" ")),
xlab="Batches of sequential draws",
ylab=expression(paste("Mean of seq. draws of ", sigma^2, sep=" ")))
lines(s2, lwd=2, col="gray")
legend("topright",box.lty = 1,legend = c(expression(paste(sigma^2, " (1)", sep=" ")),
                                         expression(paste(sigma^2, " (2)", sep=" "))),
       col=c("black","gray"),lwd = 2)

dev.off()
## png 
##   2

8.3 1c. Graphical comparison.

par(mfrow = c(1, 1))
hist(data, freq=FALSE, breaks=20)
lines(xGrid, mixDensMean, type = "l", lwd = 2, lty = 4, col = "red")
mu = mean(mu_draws)
sigma2 = mean(sigma_sq_draws)
lines(xGrid, dnorm(xGrid, mean = mu, sd=sqrt(sigma2)), type = "l", lwd = 2, col = "blue")
legend("topright", box.lty = 1, legend = c("Data histogram", "Mixture density", "Normal density with mean parameters obtained in a)"), col = c("black", "red", "blue"), lwd = 2)

It becomes obvious that the mixture model fits the data much better.

9 Lab3 Assignment 2: Metropolis Random Walk for Poisson regression.

First, the provided data eBayNumberOfBidderData.dat will be read into the R environment.

# Storing provided data in R environment as a numeric vector.
data = read.table("../../static/data/eBayNumberOfBidderData.dat", header = TRUE, sep = "", dec = ".")

9.1 2a. Obtaining the maximum likelihood estimator of beta.

To obtain the maximum likelihood estimator of \(\beta\) in the Poisson regression model, we use the glm()-function and specify the family as poisson.

# Fitting poisson regression using maximum likelihood estimation.
glm_model = glm(nBids ~ ., data = data[, -which(colnames(data) == "Const")], family = poisson)

Afterwards, we can identify the significance of the covariates using the summary()-function. We assume \(\alpha = 0.05\).

## Covariate significant?
summary(glm_model)$coeff[, 4] < 0.05
## (Intercept) PowerSeller    VerifyID      Sealed     Minblem     MajBlem 
##        TRUE       FALSE        TRUE        TRUE       FALSE        TRUE 
##     LargNeg     LogBook MinBidShare 
##       FALSE        TRUE        TRUE

The information about the significant covariates are printed.

# Printing beta-coefficients for significant covariates.
knitr::kable(summary(glm_model)$coeff[which(summary(glm_model)$coeff[, 4] < 0.05), ])
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0724421 0.0307749 34.847987 0.0000000
VerifyID -0.3945165 0.0924258 -4.268468 0.0000197
Sealed 0.4438426 0.0505627 8.778057 0.0000000
MajBlem -0.2208712 0.0914364 -2.415571 0.0157106
LogBook -0.1206776 0.0289646 -4.166387 0.0000309
MinBidShare -1.8940966 0.0712396 -26.587685 0.0000000

9.2 2b. Bayesian analysis using approxiamted multivariate normal posterior.

The goal is to implement the approximated posterior multivariate normal. To do so, we need to implement a way to obtain the posterior mode \(\widetilde{\beta}\) and the negative Hessian at the posterior mode \(J_y(\widetilde{\beta})\) by numerical optimization.

The optim-function can be used to return the mode \(\widetilde{\beta}\) for every \(\beta\). Also, it returns \(-J_y(\widetilde{\beta})\) which means that finally we still need to transform this matrix to get the desired \(J_y^{-1}(\widetilde{\beta})\) at the end.

# Defining function to optimize on (log-posterior of poisson).
log_posterior_poi = function(beta_vector, y, X, mu, sigma) {
  # Evaluating log of likelihood for poisson.
  log_likelihood = sum(y * X %*% beta_vector - exp(X %*% beta_vector))
  if (abs(log_likelihood) == Inf) log_likelihood = -20000
  # Evaluating log of prior.
  log_prior = dmvnorm(x = beta_vector, 
                      mean = mu, 
                      sigma = sigma, 
                      log = TRUE)
  # Returning log posterior.
  return(log_likelihood + log_prior)
}
# Setting up parameters.
  # Response.
  y = data$nBids
  # Features.
  X = as.matrix(data[, -which(colnames(data) == "nBids")])
  # Prior.
  mu_0 = rep(0, ncol(X))
  sigma_0 = 100*solve(t(X)%*%X)
  # Initial beta_vector.
  beta_init = rep(0, ncol(X))
# Computing beta_mode and hessian.
results_optim = optim(par = beta_init,
                      fn = log_posterior_poi,
                      y = y,
                      X = X,
                      mu = mu_0,
                      sigma = sigma_0,
                      method=c("BFGS"),
                      # Multiplying objective function by -1 to find maximum instead of minimum.
                      control=list(fnscale=-1),
                      hessian=TRUE)
# Storing results.
beta_mode = results_optim$par
beta_inv_hessian= -solve(results_optim$hessian)

# The posterior is approximately c(theta1,theta2) ~ N(results_optim$par, -solve(results_optim$hessian))


# Printing results.
  # Beta_mode.
  knitr::kable(data.frame(beta_ = seq(from = 0,to = length(results_optim$par)-1),
                          posterior_mode = results_optim$par),
               caption = "Numerical values for beta_mode")
Table 9.1: Numerical values for beta_mode
beta_ posterior_mode
0 1.0698412
1 -0.0205125
2 -0.3930060
3 0.4435555
4 -0.0524663
5 -0.2212384
6 0.0706968
7 -0.1202177
8 -1.8919850
# Adjusted hessian.
print("Adjusted hessian:")
## [1] "Adjusted hessian:"
beta_inv_hessian
##                 [,1]           [,2]           [,3]           [,4]
##  [1,]  0.00094546251 -0.00071389716 -0.00027415169 -0.00027090164
##  [2,] -0.00071389716  0.00135307595  0.00004024623 -0.00029489678
##  [3,] -0.00027415169  0.00004024623  0.00851535996 -0.00078248858
##  [4,] -0.00027090164 -0.00029489678 -0.00078248858  0.00255777811
##  [5,] -0.00044545544  0.00011429600 -0.00010136130  0.00035771579
##  [6,] -0.00027722385 -0.00020826682  0.00022825391  0.00045323084
##  [7,] -0.00051283511  0.00028017771  0.00033135679  0.00033764674
##  [8,]  0.00006436765  0.00011818516 -0.00031918692 -0.00013110254
##  [9,]  0.00110993485 -0.00056857063 -0.00042928275 -0.00005759169
##                 [,5]           [,6]           [,7]           [,8]
##  [1,] -0.00044545544 -0.00027722385 -0.00051283511  0.00006436765
##  [2,]  0.00011429600 -0.00020826682  0.00028017771  0.00011818516
##  [3,] -0.00010136130  0.00022825391  0.00033135679 -0.00031918692
##  [4,]  0.00035771579  0.00045323084  0.00033764674 -0.00013110254
##  [5,]  0.00362460567  0.00034923530  0.00005844006  0.00005854011
##  [6,]  0.00034923530  0.00836505907  0.00040486444 -0.00008975843
##  [7,]  0.00005844006  0.00040486444  0.00317506008 -0.00025417515
##  [8,]  0.00005854011 -0.00008975843 -0.00025417515  0.00083847026
##  [9,] -0.00006437066  0.00026222644 -0.00010631687  0.00103742837
##                 [,9]
##  [1,]  0.00110993485
##  [2,] -0.00056857063
##  [3,] -0.00042928275
##  [4,] -0.00005759169
##  [5,] -0.00006437066
##  [6,]  0.00026222644
##  [7,] -0.00010631687
##  [8,]  0.00103742837
##  [9,]  0.00505475741

9.3 2c. Bayesian analysis with using actual posterior using Metropolis algorithm.

First we implement the function metropolis_sampler() which will be used to perform the sampling procedure.

Within the implementing procedure, \(\alpha\) will calculated as \[ \alpha = min\bigg(1, \frac{p(\theta_p|y)}{p(^{(i-1)}|y)}\bigg) = min\bigg(1,exp\bigg[log \text{ }p(\theta_p|y)-log \text{ }p(\theta^{(i-1)}|y) \big]\bigg).\] Also, instead of using the posterior density values, we make use of the log posterior density, since logs are more stable and avoids problems with too small or large numbers (overflow).

As an input of the function, the user has to specify any posterior density function logPostFunc and its parameters, a starting value/vector theta, the number of iterations n and the tuning parameter c related to the variance of the fixed multivariate normal proposal density function.

# Implementing metropolis function.

  #metropolis_sampler = function(nBurnIn, n, theta, c, logPostFunc, ...) {
metropolis_sampler = function(logPostFunc, theta, n, c, ...) {
  
  # Setup.
  theta_prev = theta
  n_accepted = 0
  p = length(theta)
  theta_samples = matrix(NA, n, p)
  
  # Run.
  #for(i in -nBurnIn : n) {
  for (i in 1:n) {
    # Sample from proposal distribution
    theta_cand = as.vector(rmvnorm(n = 1, 
                                   mean = theta_prev, 
                                   sigma = c * beta_inv_hessian))
    # Calculate log posteriors
    log_post_cand = logPostFunc(theta_cand, ...)
    log_post_prev = logPostFunc(theta_prev, ...)
    
    # Calculate alpha
    alpha = min(1, exp(log_post_cand - log_post_prev))
    
    # Select sample with probability alpha
    u = runif(1)
    if (u <= alpha){
      theta_samples[i,] = theta_cand
      theta_prev = theta_cand
      n_accepted = n_accepted + 1
    } else {
      theta_samples[i,] = theta_prev
    }
    # Save sample if not burnin
    #if (i>0) theta.samples[i,] = theta.c
  }
  cat("Sampled", n, "samples with an acceptance rate of", n_accepted/n)
  return(theta_samples)
}

# Sampling from posterior using metropolis function.
res = metropolis_sampler(logPostFunc = log_posterior_poi, 
                         theta = beta_mode,
                         n = 1000,
                         c = 0.6,
                         mu = mu_0,
                         sigma = sigma_0,
                         X = X,
                         y = y)
## Sampled 1000 samples with an acceptance rate of 0.255

Using a value of 0.6 for c, an acceptance rate between 25 and 30 percent is obtained.

The convergence of each beta can be shown as follows:

# Plotting convergence of beta.
  # Const.
  conv_Const = ggplot() +
    geom_line(aes(x = seq(1:nrow(res)),
                  y = res[,1])) +
    geom_hline(yintercept = mean(res[,1]),
               color = "red") +
    labs(x = "sample",
         y = colnames(data)[1]) +
    theme_bw()
  # PowerSeller
  conv_PowerSeller = ggplot() +
    geom_line(aes(x = seq(1:nrow(res)),
                  y = res[,2])) +
    geom_hline(yintercept = mean(res[,2]),
               color = "red") +
    labs(x = "sample",
         y = colnames(data)[2]) +
    theme_bw()
  # VerifyID
  conv_VerifyID = ggplot() +
    geom_line(aes(x = seq(1:nrow(res)),
                  y = res[,3])) +
    geom_hline(yintercept = mean(res[,3]),
               color = "red") +
    labs(x = "sample",
         y = colnames(data)[3]) +
    theme_bw()
  # Sealed
  conv_Sealed = ggplot() +
    geom_line(aes(x = seq(1:nrow(res)),
                  y = res[,4])) +
    geom_hline(yintercept = mean(res[,4]),
               color = "red") +
    labs(x = "sample",
         y = colnames(data)[4]) +
    theme_bw()
  # Minblem
  conv_Minblem = ggplot() +
    geom_line(aes(x = seq(1:nrow(res)),
                  y = res[,5])) +
    geom_hline(yintercept = mean(res[,5]),
               color = "red") +
    labs(x = "sample",
         y = colnames(data)[5]) +
    theme_bw()
  # MajBlem
  conv_MajBlem = ggplot() +
    geom_line(aes(x = seq(1:nrow(res)),
                  y = res[,6])) +
    geom_hline(yintercept = mean(res[,6]),
               color = "red") +
    labs(x = "sample",
         y = colnames(data)[6]) +
    theme_bw()
  # LargNeg
  conv_LargNeg = ggplot() +
    geom_line(aes(x = seq(1:nrow(res)),
                  y = res[,7])) +
    geom_hline(yintercept = mean(res[,7]),
               color = "red") +
    labs(x = "sample",
         y = colnames(data)[7]) +
    theme_bw()
  # LogBook
  conv_LogBook = ggplot() +
    geom_line(aes(x = seq(1:nrow(res)),
                  y = res[,8])) +
    geom_hline(yintercept = mean(res[,8]),
               color = "red") +
    labs(x = "sample",
         y = colnames(data)[8]) +
    theme_bw()
  # MinBidShare
  conv_MinBidShare = ggplot() +
    geom_line(aes(x = seq(1:nrow(res)),
                  y = res[,9])) +
    geom_hline(yintercept = mean(res[,9]),
               color = "red") +
    labs(x = "sample",
         y = colnames(data)[9]) +
    theme_bw()
  
grid.arrange(conv_Const, conv_PowerSeller, conv_VerifyID, conv_Sealed, 
             conv_Minblem, conv_MajBlem, conv_LargNeg, conv_LogBook, 
             conv_MinBidShare, nrow = 3)

10 Appendix

10.1 Conjugate priors

10.2 Common Distribution

Common_Distributions

##Mattias Code for 1b)

# Estimating a simple mixture of normals
# Author: Mattias Villani, IDA, Linkoping University. http://mattiasvillani.com

##########    BEGIN USER INPUT #################
# Data options
data = read.table("rainfall.dat", header = FALSE, sep = "", dec = ".")[,1]
x = as.matrix(data)
# Model options
nComp <- 2    # Number of mixture components

# Prior options
alpha <- 10*rep(1,nComp) # Dirichlet(alpha)
muPrior <- rep(0,nComp) # Prior mean of mu
tau2Prior <- rep(10,nComp) # Prior std of mu
sigma2_0 <- rep(var(x),nComp) # s20 (best guess of sigma2)
nu0 <- rep(4,nComp) # degrees of freedom for prior on sigma2

# MCMC options
nIter <- 100 # Number of Gibbs sampling draws

# Plotting options
plotFit <- TRUE
lineColors <- c("blue", "green", "magenta", 'yellow')
sleepTime <- 0.1 # Adding sleep time between iterations for plotting
################   END USER INPUT ###############

###### Defining a function that simulates from the 
rScaledInvChi2 <- function(n, df, scale){
  return((df*scale)/rchisq(n,df=df))
}

####### Defining a function that simulates from a Dirichlet distribution
rDirichlet <- function(param){
  nCat <- length(param)
  piDraws <- matrix(NA,nCat,1)
  for (j in 1:nCat){
    piDraws[j] <- rgamma(1,param[j],1)
  }
  piDraws = piDraws/sum(piDraws) # Diving every column of piDraws by the sum of the elements in that column.
  return(piDraws)
}

# Simple function that converts between two different representations of the mixture allocation
S2alloc <- function(S){
  n <- dim(S)[1]
  alloc <- rep(0,n)
  for (i in 1:n){
    alloc[i] <- which(S[i,] == 1)
  }
  return(alloc)
}

# Initial value for the MCMC
nObs <- length(x)
S <- t(rmultinom(nObs, size = 1 , prob = rep(1/nComp,nComp))) # nObs-by-nComp matrix with component allocations.
mu <- quantile(x, probs = seq(0,1,length = nComp))
sigma2 <- rep(var(x),nComp)
probObsInComp <- rep(NA, nComp)

# Setting up the plot
xGrid <- seq(min(x)-1*apply(x,2,sd),max(x)+1*apply(x,2,sd),length = 100)
xGridMin <- min(xGrid)
xGridMax <- max(xGrid)
mixDensMean <- rep(0,length(xGrid))
effIterCount <- 0
ylim <- c(0,2*max(hist(x)$density))


for (k in 1:nIter){
  message(paste('Iteration number:',k))
  alloc <- S2alloc(S) # Just a function that converts between different representations of the group allocations
  nAlloc <- colSums(S)
  print(nAlloc)
  # Update components probabilities
  pi <- rDirichlet(alpha + nAlloc)
  
  # Update mu's
  for (j in 1:nComp){
    precPrior <- 1/tau2Prior[j]
    precData <- nAlloc[j]/sigma2[j]
    precPost <- precPrior + precData
    wPrior <- precPrior/precPost
    muPost <- wPrior*muPrior + (1-wPrior)*mean(x[alloc == j])
    tau2Post <- 1/precPost
    mu[j] <- rnorm(1, mean = muPost, sd = sqrt(tau2Post))
  }
  
  # Update sigma2's
  for (j in 1:nComp){
    sigma2[j] <- rScaledInvChi2(1, df = nu0[j] + nAlloc[j], 
                                scale = (nu0[j]*sigma2_0[j] + 
                                           sum((x[alloc == j] - mu[j])^2))/(nu0[j] + nAlloc[j]))
  }
  
  # Update allocation
  for (i in 1:nObs){
    for (j in 1:nComp){
      probObsInComp[j] <- pi[j]*dnorm(x[i], mean = mu[j], sd = sqrt(sigma2[j]))
    }
    S[i,] <- t(rmultinom(1, size = 1 , prob = probObsInComp/sum(probObsInComp)))
  }
  
  # Printing the fitted density against data histogram
  if (plotFit && (k%%1 ==0)){
    effIterCount <- effIterCount + 1
    hist(x, breaks = 20, freq = FALSE, xlim = c(xGridMin,xGridMax), 
         main = paste("Iteration number",k), ylim = ylim)
    mixDens <- rep(0,length(xGrid))
    components <- c()
    for (j in 1:nComp){
      compDens <- dnorm(xGrid,mu[j],sd = sqrt(sigma2[j]))
      mixDens <- mixDens + pi[j]*compDens
      lines(xGrid, compDens, type = "l", lwd = 2, col = lineColors[j])
      components[j] <- paste("Component ",j)
    }
    mixDensMean <- ((effIterCount-1)*mixDensMean + mixDens)/effIterCount
    
    lines(xGrid, mixDens, type = "l", lty = 2, lwd = 3, col = 'red')
    legend("topleft", box.lty = 1, legend = c("Data histogram",components, 'Mixture'), 
           col = c("black",lineColors[1:nComp], 'red'), lwd = 2)
    Sys.sleep(sleepTime)
  }
  
}

hist(x, breaks = 20, freq = FALSE, xlim = c(xGridMin,xGridMax), 
     main = "Final fitted density")
lines(xGrid, mixDensMean, type = "l", lwd = 2, lty = 4, col = "red")
lines(xGrid, dnorm(xGrid, mean = mean(x), sd = apply(x,2,sd)), type = "l", lwd = 2, col = "blue")
legend("topright", box.lty = 1, legend = c("Data histogram","Mixture density",
                                           "Normal density"), col=c("black","red","blue"), lwd = 2)

#########################    Helper functions    ##############################################