BayPass is a genome–environment association (GEA) method developed by
Gautier (2015). Its models explicitly account for (and can estimate) the
covariance structure between population allele frequencies that arises
from the shared history of populations. BayPass provides three
models:
- Core model: An FST-scan method that does not incorporate climatic or
environmental variables.
- Auxiliary covariate model: A GEA method that does not account for
population structure.
- Standard covariate model: A GEA method that accounts for population
structure.
We will use the Standard Covariate Model to perform outlier detection. This model calculates a scaled covariance matrix between populations, which is used as a covariate to account for population structure. It assesses the extent to which each marker is linearly associated with the covariates (the covariance matrix and the climatic/environmental variables).
The regression coefficient for each SNP with the covariates is estimated using either MCMC (not detailed here) or the IS (importance sampling) approximation. The IS approach also allows estimation of the Bayes factor, which evaluates support for the association of each SNP i with covariate k, i.e., comparing the model with association (βᵢₖ ≠ 0) to the null model (βᵢₖ = 0) (BayPass manual).
Steps to perform outlier detection with BayPass:
- Prepare genomic data (with and without MAF filtering).
- Prepare climate/environmental data in the required format.
- Run the Standard Covariate Model with IS on Ubuntu.
- Analyze the results following the BayPass manual.
#meta data pop
meta_data_pop <- read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Populations/taxus_sample_29pop.csv",h=T,sep=";",dec=",")
#meta data indiv
meta_data_vcf=read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Samples/samples_taxus_baccata_adapcon_gentree.csv",h=T,sep=";",dec=",")
The genomic data must be formatted with SNPs in rows, with one row
per SNP, and populations in columns (the number of columns should be
twice the number of populations), with two columns per population (one
for each allele of a SNP). The genomic information is encoded in allele
counts at the population level, as shown below:
POP1 POP2 POP3 POP4 POP5 POP6
SNP 1: 71 8 115 0 61 36 51 39 10 91 69 58
SNP 2: 82 0 91 0 84 14 24 57 28 80 18 80
We will need to create two genomic files in this format: one for the GEA and another to calculate the covariance matrix to account for the population structure. The covariance matrix file will not be filtered for Minor Allele Count (MAC), as MAC is important for assessing population structure.
The genomic matrix consists of the imputed and MAC-corrected dataset (for GEA) or the dataset without MAC correction (for the omega matrix), comprising 475 individuals and 8,616 SNPs.
#genomic data for outlier detection
load("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/Gen_matrix_imp_T_Adapcon_Gentree_475_8616.Rdata")
genomic_data_maf_c <-Gen_matrix_imp_T_Adapcon_Gentree_475_8616
#genomic data with MAF (non corrected) for covariance matrix
load("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/genetic_new/GEA/Data_geno_no_maf_c_8252SNP_452IND.Rdata")
genomic_data_no_maf_c <- Data_geno_no_maf_c_8252SNP_452IND
genomic_data_maf_c_ID <- rownames_to_column(genomic_data_maf_c)
#name of VCF_ID to merge
names(genomic_data_maf_c_ID)[names(genomic_data_maf_c_ID) == 'rowname'] <- 'VCF_ID'
#add the population information
genomic_data_MAF_pop <- data.frame(merge(meta_data_vcf[,c(1,3)],genomic_data_maf_c_ID, "VCF_ID"))
#This function enables to calculate the allelic count of the 1st allele of SNP from the genotypic data in format: 0, 1, 2.
reformat_genotype <- function(allele) {
allele1 <- case_when(
allele == " 0" ~ 2,
allele == " 1" ~ 1,
allele == " 2" ~ 0,
allele == "0" ~ 2,
allele == "1" ~ 1,
allele == "2" ~ 0,
TRUE ~ NA_real_
)
return(allele1)
}
#Apply the reformat_genotype function to SNP columns and calculate Allele2. Allele1 and 2 in two different dataframe to merge them more easily after.
#allele1
df_allele1 <- genomic_data_MAF_pop %>%
mutate(across(starts_with("Tbac"), ~ reformat_genotype(.)))
#allele 2
df_allele2 <- df_allele1 %>%
mutate_at(vars(starts_with("Tbac")), ~ 2 - .)#to calculate the allele2, we just did 2- the count of allele1 for each individual for each snp.
# Group by Population and summarize allele counts
#allele1
allele_counts_allele1 <- df_allele1 %>%
group_by(Population) %>%
summarize(
across(starts_with("Tbac"), ~ sum(.)),
)
#allele2
allele_counts_allele2 <- df_allele2 %>%
group_by(Population) %>%
summarize(
across(starts_with("Tbac"), ~ sum(.)),
)
#rename allele_counts_2 to merge them with allele1
allele_counts_allele2$Population <- paste0(allele_counts_allele2$Population,"_Allele_2")
#final dataset with both allele 1 and 2
final_dtf <- rbind(allele_counts_allele1,allele_counts_allele2);row.names(final_dtf) <- final_dtf$Population
#order the population to have the format where allele1 and allele 2 for each pop are beside
final_r_order <- final_dtf[order(row.names(final_dtf)), ]
#allele in row and population*2 in columns side by side
data_allele_count_BAYPASS_MAF_c <- data.frame(t(final_r_order))
#export the data in txt
write.table(x=data_allele_count_BAYPASS_MAF_c,
file = "C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/GEA/BAYPASS/data_allele_count_BAYPASS_MAF_c.txt",
sep = " ",
row.names = F,
col.names = F)
The data_allele_count_BAYPASS_MAF_c file contains the genotypic data to run the BAYPASS standard covariate model with importance sampling (IS).
genomic_data_maf_no_c_ID <- rownames_to_column(genomic_data_no_maf_c)
#name of VCF_ID to merge
names(genomic_data_maf_no_c_ID)[names(genomic_data_maf_no_c_ID) == 'rowname'] <- 'VCF_ID'
#add the population information
genomic_data_no_MAF_pop <- data.frame(merge(meta_data_vcf[,c(1,3)],genomic_data_maf_no_c_ID, "VCF_ID"))
# Apply the reformat_genotype function to SNP columns and calculate Allele2. Allele1 and 2 in two different dataframe to merge them more easily after.
#allele1
df_allele1_no_mac <- genomic_data_no_MAF_pop %>%
mutate(across(starts_with("Tbac"), ~ reformat_genotype(.)))
#allele 2
df_allele2_no_mac <- df_allele1_no_mac %>%
mutate_at(vars(starts_with("Tbac")), ~ 2 - .)#to calculate the allele2, we just did 2- the count of allele1 for each individual for each snp.
# Group by Population and summarize allele counts
#allele1
allele_counts_allele1_no_mac <- df_allele1_no_mac %>%
group_by(Population) %>%
summarize(
across(starts_with("Tbac"), ~ sum(.,na.rm = TRUE)),
)
#allele2
allele_counts_allele2_no_mac <- df_allele2_no_mac %>%
group_by(Population) %>%
summarize(
across(starts_with("Tbac"), ~ sum(.,na.rm = TRUE)),
)
#rename allele_counts_2 to merge them with allele1
allele_counts_allele2_no_mac$Population <- paste0(allele_counts_allele2_no_mac$Population,"_Allele_2")
#final dataset with both allele 1 and 2
final_dtf_no_mac <- rbind(allele_counts_allele1_no_mac,allele_counts_allele2_no_mac);row.names(final_dtf_no_mac) <- final_dtf_no_mac$Population
#order the population to have the format where allele1 and allele 2 for each pop are beside
final_r_order_no_mac <- final_dtf_no_mac[order(row.names(final_dtf_no_mac)), ]
#allele in row and population*2 in columns side by side
data_allele_count_BAYPASS_MAF_no_c <- data.frame(t(final_r_order_no_mac))
#export the data in txt
write.table(x=data_allele_count_BAYPASS_MAF_no_c,
file = "C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/GEA/BAYPASS/data_allele_count_BAYPASS_MAF_no_c.txt",
sep = " ",
row.names = F,
col.names = F)
data_allele_count_BAYPASS_MAF_no_c contains the genomic data used to run the core model of BAYPASS to calculate the omega matrix.
We load the non scale climatic data:
#climatic data
climatic_data <- read.csv("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/Climatic_data/new_selection/Past_new_6_Climatic_data_scale_df.csv",sep=";",dec=",")
climatic_data_BAYPASS <- data.frame(t(climatic_data))
The units are
- Bio1: annual mean temperature (°C)
- Bio2: Mean diurnal range (mean of max temp - min temp)
- Bio4: Seasonality of temperature (standard deviation *100)
- Bio9: Mean temperature of driest quarter (°C)
- Bio12: Total (annual) rainfall (mm)
- Bio15: Precipitation seasonality (variation coefficient)
We need to format the data for BAYPASS, which can be done in two ways: either by creating a separate text file for each climate variable with populations in columns, or by creating a single text file containing all climate variables in rows and populations in columns.
write.table(x=climatic_data_BAYPASS[-c(1,2),],
file = "C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/GEA_new_var/BAYPASS/climatic_data_BAYPASS.txt",
sep = " ",
row.names = F,
col.names = F)
climatic_data_BAYPASS contains the climatic data used to run the standard covariate model with IS in BAYPASS.
We ran the core model to estimate the covariance matrix used to
correct for population structure. We used genomic data without MAC
correction: 8,252 SNPs, 29 populations, with allele counts at the
population level (data_allele_count_BAYPASS_MAF_no_c). The core model
was run with the following parameters:
i) 20 pilot runs of 500 iterations (to adjust proposal
distributions)
ii) a burn-in period of 5000 iterations
iii) final MCMC sampling of 1000 parameter values sampled every 20
iterations (i.e., a total of 20 000 iterations)
The covariance matrix used to correct for population structure is scaled directly within the model, so no additional scaling is required.
The output results are here:
omega <- as.matrix(read.table("C:/Users/tfrancisco/Documents/Thesis/Data/Species/Taxus_baccata/GEA/BAYPASS/covari_matrix_core_model/out_mat_omega.out"))
country_names <- climatic_data$Country
dimnames(omega)=list(country_names,country_names)
Now, we can visualize the covariance matrix using plots, corrplot, and heatmaps, as proposed by Gautier (BAYPASS Manual) and following Archambeau et al. (2024).
plot.omega <- function(omega, PC = c(1,2), country, main = expression(Omega*" matrix"),
group_palette, pch = 16, expand_factor = 1.5) {
om.svd <- svd(omega)
eig <- om.svd$d
pcent.var <- 100 * eig / sum(eig)
# Assign colors based on country
col <- group_palette[country]
# Get min and max values for PC axes
x_range <- range(om.svd$u[, PC[1]])
y_range <- range(om.svd$u[, PC[2]])
# Expand range symmetrically
x_expand <- diff(x_range) * (expand_factor - 1) / 2
y_expand <- diff(y_range) * (expand_factor - 1) / 2
xlim <- x_range + c(-x_expand, x_expand)
ylim <- y_range + c(-y_expand, y_expand)
# Adjust margins for legend space
par(mar = c(5, 5, 4, 6)) # Right margin adjusted for legend
plot(om.svd$u[, PC], main = main, pch = pch, col = col,
xlim = xlim, ylim = ylim,
xlab = paste0("PC", PC[1], " (", round(pcent.var[PC[1]], 2), "%)"),
ylab = paste0("PC", PC[2], " (", round(pcent.var[PC[2]], 2), "%)"),
cex.lab = 1.5, cex.axis = 1.3, cex.main = 1.8
)
# Fix legend position
legend("topright", inset = c(-0.15, 0), legend = names(group_palette),
col = group_palette, pch = 16, title = "Country",
cex = 1, bty = "n", xpd = TRUE, ncol = 2)
}
# Increase plotting area
dev.new(width = 10, height = 8)
group_palette <- c("Bosnia"="orangered3", "France"="gold2","Germany"= "darkorchid3",
"Greece"="navyblue", "Italy"="turquoise2", "Norway"="green3",
"Slovakia"="blue", "Slovenia"="red", "Spain"="black",
"Sweden"="gray", "Switzerland"="orange", "UK"="darkgreen")
#order country in the matrix
country_vector <- c("Italy", "Switzerland","Slovenia","Norway","Spain","Spain","Spain","UK","Italy","Italy","Slovenia","Slovakia","Switzerland","Switzerland","Greece","Greece","Greece","Germany","Spain","UK","France","Sweden","France","Spain","Slovenia","Italy","Bosnia","Switzerland","UK") #
# Run the function
plot.omega(omega, PC = c(1,2), country = country_vector, group_palette = group_palette)
# as a correlation plot
cor_mat <- cov2cor(omega)
corrplot::corrplot(cor_mat)
# corrplot(cor_mat,method="color",mar=c(2,1,2,2)+0.1,
# main=expression("Correlation map based on"~hat(Omega)))
# as a heatmap and hierarchical clustering tree (using the average agglomeration method)
##we use the population names
pop_names <- climatic_data$Population
dimnames(omega)=list(pop_names,pop_names)
#cor matrix
cor_mat <- cov2cor(omega)
hclust_ave <- function(x) hclust(x, method="average")
Heatmap_omega_matrix<-heatmap(1-cor_mat,hclustfun = hclust_ave,
main=expression("Heatmap of "~hat(Omega)~"("*d[ij]*"=1-"*rho[ij]*")"))
The plot of the omega (covariance) matrix closely resembles the population distribution obtained from a PCA using the first two axes. This indicates that the correction for population structure in BAYPASS is comparable to the corrections achieved with pRDA, LFMM, and Gradient Forest analyses.
We need to run the IS models multiple times to ensure that the outputs are consistent across runs. The models are executed using the IS method via the Linux command-line interface of BAYPASS. We will interpret the results using the Bayes factor (BF) and the empirical Bayesian p-values (eBPis).
First, we extracted the BF and eBPis from the BAYPASS output (out_BAYPASS_outliers_ide_{seed}_summary_betai_reg.out)
#names snp
names_snps <- colnames(genomic_data_maf_c)
#name climatic data
climatic_variables <- colnames(climatic_data)
COVARIABLE <- c(1,2,3,4,5,6) #to merge with output results of BAYPASS and have the name of the climatic variables
climatic_variables_merge <- data.frame(climatic_variables[-c(1,2)], COVARIABLE)
#only the name of the variables
names_climatic_variables <- climatic_variables[-c(1,2)]
for(x in 1:5){
seed <- x+11#because the selected seeds of the runs are: 12 to 16
output_BAYPASS_run <- read.table(paste0("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/GEA_new_var/BAYPASS/runs/out_BAYPASS_outliers_ide_",seed,"_summary_betai_reg.out"),h=T)
#add the name of the climatic variables
BAYPASS_clim <- merge(climatic_variables_merge,output_BAYPASS_run,"COVARIABLE")
BAYPASS_clim$climatic_variables..c.1..2.. <- as.factor(BAYPASS_clim$climatic_variables..c.1..2..)
#add names snps
subset_name <- paste0("RUN_",x, "_BAYPASS_results") # Name the output dataframe of the loop
Data_results <- BAYPASS_clim %>%
group_by(climatic_variables..c.1..2..) %>%
mutate(names_snps = names_snps) %>%
ungroup()
assign(subset_name, Data_results)
}
names_subset_data_frame <- c("RUN_1_BAYPASS_results","RUN_2_BAYPASS_results","RUN_3_BAYPASS_results","RUN_4_BAYPASS_results","RUN_5_BAYPASS_results")
We also added the run info for BF and for EBpis.
## we also add the RUN info
for (x in 1:5) {
var <- paste0("RUN_",x,"_BAYPASS_results")
# Create the new dataframe name
names_final <- paste0(var, "_final")
# Extract the dataframe using get
data <- get(var)
name_BF.dB. <- paste0("BF_RUN",x)
nam_eBPis <- paste0("eBPis_RUN",x)
# Mutate the data
names(data)[names(data)== "BF.dB."] <- name_BF.dB.
names(data)[names(data)== "eBPis"] <- nam_eBPis
# Assign the mutated data to a new dataframe
assign(names_final, data)
}
Finally, we merged all the runs in on dataframe
#merge all runs
data_tot_results_allRUNs <- cbind(RUN_1_BAYPASS_results_final[,c(12,2,8,11)],RUN_2_BAYPASS_results_final[,c(8,11)],RUN_3_BAYPASS_results_final[,c(8,11)],RUN_4_BAYPASS_results_final[,c(8,11)],RUN_5_BAYPASS_results_final[,c(8,11)])
Next, we assessed whether the runs produced consistent results. To do this, we calculated the correlation of the Bayes factors (BF) and empirical Bayesian p-values (eBPis) across runs for each climatic variable.
First, we subset the dataset of all runs by climatic variable to create one dataset per variable.
#subset the results at the climatic variable scale to compare values between runs
for(var in names_climatic_variables){
subset_name <- paste0(var, "_BAYPASS_all_runs") #name the output dataframe of the loop
assign(subset_name, subset(data_tot_results_allRUNs, climatic_variables..c.1..2.. == var))#assign the name to the dataframe in a loop
}
#list of dataset with data of all run for each climatic data
final_names_dataframe <- c("Annual_Tc_BAYPASS_all_runs","Diurnal_range_Tc_BAYPASS_all_runs","Tc_Seasonality_BAYPASS_all_runs","Tc_driest_quarter_BAYPASS_all_runs","Annual_P_BAYPASS_all_runs","P_Seasonality_BAYPASS_all_runs")
#name climatic data
names_climatic_variables <- colnames(climatic_data[,-c(1,2)])
Then, we wanted to see if these values were consistent across runs by testing the correlation of the values across runs:
#correlation between runs for BF
list_title <- c("BIO 1","BIO 2","BIO 4","BIO 9","BIO 12", "BIO 15")
for(x in 1:6){
var <- final_names_dataframe[x]
title_name <- list_title[x]
#names of the corr matrix for each biovariable
names_corr <- paste0("correlation_",var,"_BF")
#title of corrplot
title <- (paste0(title_name," all runs"))
#group for each bio var with the values of Bayes factor only
corr_bio <- cor(get(var)[, grepl("BF", names(get(var)))])
#name the corr_bio with names_corr
assign(names_corr,corr_bio)
#plot corrplot
corr_plot <- corrplot(get(names_corr), method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 1, number.cex = 1, title = title,mar=c(0,0,1,0) )
#save
pdf(paste0("C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/GEA_new_var/BAYPASS/figures/correlation_BF_values_across_runs_",names_climatic_variables[x],".pdf"));corrplot(get(names_corr), method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 1, number.cex = 1, title = title,mar=c(0,0,1,0) );dev.off()
}
Interpretation: Bayes factor (BF) values show high consistency across runs, with correlations for all climatic variables ≥ 0.84.
Conclusion: Candidate SNPs can be identified based on the mean BF across runs, indicating robust associations with the climatic variables.
for(var in final_names_dataframe){
#names of the corr matrix for each biovariable
names_corr <- paste0("correlation_",var,"_eBPis")
#title of corrplot
title <- (paste0(var,"_eBPis"))
#group for each bio var with the values of empirical bayesian pvalues only
corr_bio <- cor(get(var)[, grepl("eBPis", names(get(var)))])
#name the corr_bio with names_corr
assign(names_corr,corr_bio)
#plot corrplot
#corr_plot <- corrplot(get(names_corr), method = "number", addrect = 2, col = c("red", "white", "red"), type = "lower", tl.col = "black", tl.cex = 0.6, number.cex = 0.6, title = title,mar=c(0,0,1,0) )
}
Interpretation: Empirical Bayesian p-values (eBPis) show high consistency across runs, with correlations for all climatic variables exceeding 0.80. Conclusion: We can also identify candidate SNPs based on the mean eBPis across runs.
To select candidate SNPs, we will set a threshold based on the Bayes
factor (BF) and empirical Bayesian p-values (eBPis). Following Gautier
(2015), we will apply Jeffreys’ rule (Jeffreys, 1961) to quantify the
strength of evidence in favour of the association between the SNP and
the covariate, using the following dB unit scale:
- 10 < BF < 15 = strong evidence
- 15 < BF < 20 = very strong evidence
- BF > 20 = conclusive evidence
In addition, we considered eBPis, with values greater than 3 indicating candidates, as suggested by Ruiz Daniels et al. (2019). We also considered the top 1%, top 0.5% or top 100 SNPs with the highest correlation with the climatic variables.
To begin this selection process, we calculate the mean BF values across runs for each SNP.
final_names_dataframe <- c("Annual_Tc_BAYPASS_all_runs","Diurnal_range_Tc_BAYPASS_all_runs","Tc_Seasonality_BAYPASS_all_runs","Tc_driest_quarter_BAYPASS_all_runs","Annual_P_BAYPASS_all_runs","P_Seasonality_BAYPASS_all_runs")
#mean BF values
for(data in final_names_dataframe){
#names of the corr matrix for each biovariable
names_corr <- paste0("mean_",data,"_BF_values")
#select only the BF values
dataset <- get(data)[, grepl("BF", names(get(data)))]
#mean
mean_BF_values <- data.frame(names_snps,rowMeans(dataset)); colnames(mean_BF_values)=c("names_snps","BF_values")
#name the corr_bio with names_corr
assign(names_corr,mean_BF_values)
}
list_mean_BF_clim <- c("mean_Annual_Tc_BAYPASS_all_runs_BF_values","mean_Diurnal_range_Tc_BAYPASS_all_runs_BF_values","mean_Tc_Seasonality_BAYPASS_all_runs_BF_values","mean_Tc_driest_quarter_BAYPASS_all_runs_BF_values","mean_Annual_P_BAYPASS_all_runs_BF_values","mean_P_Seasonality_BAYPASS_all_runs_BF_values")
Next, we will identify candidates by applying a Bayes Factor (BF) threshold of 10 for each climatic variable.
thres_BF <- 10 #threshold of BF
for(x in 1:6){
#names of the corr matrix for each biovariable
names_climatic_var <- names_climatic_variables[x]
data <- get(list_mean_BF_clim[x])
selection_outliers <- data.frame(Loci=data$names_snps[which(data$BF_values>thres_BF)],BF = data$BF_values[which(data$BF_values>thres_BF)], climatic_variables=names_climatic_var)
assign(paste0("names_outliers_",names_climatic_var),selection_outliers)
# Count the number of candidates for each climatic variable
count <- data.frame(Climatic_variable=names_climatic_var,Number_outliers=nrow(selection_outliers))
#name the corr_bio with names_corr
assign(paste0("outliers_",names_climatic_var),count)
}
# Combine the results for each climatic variable to see the number of candidates
outliers_summary_BF10 <- rbind(outliers_Annual_Tc,outliers_Diurnal_range_Tc,outliers_Tc_Seasonality,outliers_Tc_driest_quarter,outliers_Annual_P,outliers_P_Seasonality)
#name of the candidates
outliers_names_summary_BF_10 <- rbind(names_outliers_Annual_Tc,names_outliers_Diurnal_range_Tc,names_outliers_Tc_Seasonality,names_outliers_Tc_driest_quarter,names_outliers_Annual_P,names_outliers_P_Seasonality)
#finally we can search for same candidates across climatic variables
duplicated_loci_BF10 <- duplicated(outliers_names_summary_BF_10$Loci) | duplicated(outliers_names_summary_BF_10$Loci, fromLast = TRUE)
# Subset the data frame to show only the duplicated loci
duplicated_outliers_BF10 <- outliers_names_summary_BF_10[duplicated_loci_BF10, ]
#Number of candidates:
nrow(outliers_names_summary_BF_10)
## [1] 67
#at the climatic var scale
outliers_summary_BF10
## Climatic_variable Number_outliers
## 1 Annual_Tc 9
## 2 Diurnal_range_Tc 7
## 3 Tc_Seasonality 23
## 4 Tc_driest_quarter 8
## 5 Annual_P 3
## 6 P_Seasonality 17
We can see that 67 candidates are identified using BAYPASS:
- Bio 1: 9
- Bio 2: 7
- Bio 4: 23
- Bio 9: 8
- Bio 12: 3
- Bio 15: 17
In addition, 4 candidates are identified by 2 climatic variables.
Finally, we saved the set of candidates for downstream analysis
outliers_T_adapcon_gentree_BAYPASS_BF_10 <- outliers_names_summary_BF_10[,-2]
#write_xlsx(outliers_T_adapcon_gentree_BAYPASS_BF_10,"C:/Users/tfrancisco/Documents/Thèse/Results/species/taxus/GEA_new_var/outliers/outliers_T_adapcon_gentree_BAYPASS_BF_10.xlsx")
save(outliers_T_adapcon_gentree_BAYPASS_BF_10, file="C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/GEA_new_var/outliers/outliers_T_adapcon_gentree_BAYPASS_BF_10.Rdata")
We can also create a less conservative dataset with a threshold of BF>8:
thres_BF <- 8 #relax threshold
for(x in 1:6){
#names of the corr matrix for each biovariable
names_climatic_var <- names_climatic_variables[x]
data <- get(list_mean_BF_clim[x])
selection_outliers <- data.frame(Loci=data$names_snps[which(data$BF_values>thres_BF)],BF = data$BF_values[which(data$BF_values>thres_BF)], climatic_variables=names_climatic_var)
assign(paste0("names_outliers_",names_climatic_var),selection_outliers)
# Count the number of candidates for each climatic variable
count <- data.frame(Climatic_variable=names_climatic_var,Number_outliers=nrow(selection_outliers))
#name the corr_bio with names_corr
assign(paste0("outliers_",names_climatic_var),count)
}
# Combine the results for each climatic variable to see the number of candidates
outliers_summary_BF_8 <- rbind(outliers_Annual_Tc,outliers_Diurnal_range_Tc,outliers_Tc_Seasonality,outliers_Tc_driest_quarter,outliers_Annual_P,outliers_P_Seasonality)
#name of the candidates
outliers_names_summary_BF8 <- rbind(names_outliers_Annual_Tc,names_outliers_Diurnal_range_Tc,names_outliers_Tc_Seasonality,names_outliers_Tc_driest_quarter,names_outliers_Annual_P,names_outliers_P_Seasonality)
#finally we can search for same candidates across climatic variables
duplicated_loci_BF_8 <- duplicated(outliers_names_summary_BF8$Loci) | duplicated(outliers_names_summary_BF8$Loci, fromLast = TRUE)
# Subset the data frame to show only the duplicated loci
duplicated_outliers_BF_8 <- outliers_names_summary_BF8[duplicated_loci_BF_8, ]
#Number of candidates:
outliers_summary_BF_8
## Climatic_variable Number_outliers
## 1 Annual_Tc 13
## 2 Diurnal_range_Tc 17
## 3 Tc_Seasonality 35
## 4 Tc_driest_quarter 20
## 5 Annual_P 13
## 6 P_Seasonality 35
#at the climatic var scale
nrow(outliers_names_summary_BF8)
## [1] 133
We can see that 133 candidates are identified using BAYPASS:
- Bio 1: 13
- Bio 2: 17 - Bio 4: 35
- Bio 9: 20
- Bio 12: 13
- Bio 15: 35
In addition 8 candidates are identified by 2 climatic variables.
Finally, we saved the set of candidates for downstream analysis
outliers_T_adapcon_gentree_BAYPASS_BF_8 <- outliers_names_summary_BF8[,-2]
#write_xlsx(outliers_T_adapcon_gentree_BAYPASS_BF_8,"C:/Users/tfrancisco/Documents/Thèse/Results/species/taxus/GEA/outliers/outliers_T_adapcon_gentree_BAYPASS_BF_8.xlsx")
save(outliers_T_adapcon_gentree_BAYPASS_BF_8, file="C:/Users/tfrancisco/Documents/Thesis/Results/species/taxus/GEa_new_var/outliers/outliers_T_adapcon_gentree_BAYPASS_BF_8.Rdata")
Another method of identifying outliers is to look for overlapping SNPs across 5 runs, similar to the approach used in GF, rather than relying on the Bayes factor (BF) means.
list_clim <- c("Annual_Tc","Diurnal_range_Tc","Tc_Seasonality","Tc_driest_quarter","Annual_P","P_Seasonality")
list_runs <- c("RUN_1_BAYPASS_results","RUN_2_BAYPASS_results","RUN_3_BAYPASS_results","RUN_4_BAYPASS_results","RUN_5_BAYPASS_results")
for(x in 1:length(list_runs)){
data <- get(list_runs[x])
for(i in 1:length(list_clim)){
clim_name <- list_clim[i]
data_clim <- data %>% filter(climatic_variables..c.1..2.. == clim_name)
#BF values > 8
outliers_BF_8 <- data_clim[,c(12,2,8)] %>% filter(BF.dB.> 8) %>% pull(names_snps)
assign(paste0("Run",x,"_outliers_BF_8_",clim_name),outliers_BF_8)
#BF values > 10
outliers_BF_10 <- data_clim[,c(12,2,8)] %>% filter(BF.dB.> 10) %>% pull(names_snps)
assign(paste0("Run",x,"_outliers_BF_10_",clim_name),outliers_BF_10)
}
}
list_threshold <- c(8,10)
for(i in 1:length(list_threshold)){
threshold <- list_threshold[i]
for(x in 1:length(list_clim)){
clim_var <- list_clim[x]
data1 <- get(paste0("Run",1,"_outliers_BF_",threshold,"_",clim_var))
data2 <- get(paste0("Run",2,"_outliers_BF_",threshold,"_",clim_var))
data3 <- get(paste0("Run",3,"_outliers_BF_",threshold,"_",clim_var))
data4 <- get(paste0("Run",4,"_outliers_BF_",threshold,"_",clim_var))
data5 <- get(paste0("Run",5,"_outliers_BF_",threshold,"_",clim_var))
#Select only the candidates identified in all 5 runs
outliers_set <- Reduce(intersect, list(data1,data2,data3,data4,data5))
assign(paste0("outliers_",clim_var,"_",threshold),outliers_set)
}
}
See the final set of outliers
list_threshold <- c(8,10) # Replace with your actual threshold values
clim_vars <- c("Annual_Tc", "Diurnal_range_Tc", "Tc_Seasonality", "Tc_driest_quarter", "Annual_P", "P_Seasonality")
df_tot_outliers_list <- list()
for (x in 1:length(list_threshold)) {
threshold <- list_threshold[x]
combined_data <- data.frame()
for (var in clim_vars) {
data <- data.frame(set = get(paste0("outliers_", var, "_", threshold)))
data$var <- var
combined_data <- rbind(combined_data, data)
}
unique_data <- combined_data[!duplicated(combined_data$set),] # Ensuring unique rows
df_tot_outliers_list[[threshold]] <- unique_data
assign(paste0("df_tot_outliers_", threshold), unique_data)
}
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.3.2 (2023-10-31 ucrt)
## os Windows 11 x64 (build 26100)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate French_France.utf8
## ctype French_France.utf8
## tz Europe/Paris
## date 2025-09-22
## pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-8 2024-09-12 [1] CRAN (R 4.3.3)
## backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.1)
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.3.1)
## broom 1.0.8 2025-03-28 [1] CRAN (R 4.3.3)
## bslib 0.6.1 2023-11-28 [1] CRAN (R 4.3.2)
## cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.1)
## car 3.1-2 2023-03-30 [1] CRAN (R 4.3.2)
## carData 3.0-5 2022-01-06 [1] CRAN (R 4.3.2)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.3.2)
## cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.1)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.1)
## corrplot * 0.92 2021-11-18 [1] CRAN (R 4.3.2)
## curl 5.2.0 2023-12-08 [1] CRAN (R 4.3.2)
## data.table * 1.15.0 2024-01-30 [1] CRAN (R 4.3.2)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.3)
## digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.1)
## evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2)
## fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1)
## fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1)
## geigen * 2.3 2019-05-30 [1] CRAN (R 4.3.1)
## generics 0.1.4 2025-05-09 [1] CRAN (R 4.3.2)
## ggplot2 * 3.5.2 2025-04-09 [1] CRAN (R 4.3.2)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2)
## gtable 0.3.6 2024-10-25 [1] CRAN (R 4.3.3)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.3)
## highr 0.10 2022-12-22 [1] CRAN (R 4.3.1)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.2)
## htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2)
## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.2)
## httpuv 1.6.13 2023-12-06 [1] CRAN (R 4.3.2)
## httr 1.4.7 2023-08-15 [1] CRAN (R 4.3.2)
## import 1.3.2 2024-01-21 [1] CRAN (R 4.3.3)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.3.1)
## jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.2)
## knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2)
## later 1.3.2 2023-12-06 [1] CRAN (R 4.3.2)
## lattice 0.22-5 2023-10-24 [2] CRAN (R 4.3.2)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.3.2)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.2)
## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.2)
## magrittr * 2.0.3 2022-03-30 [1] CRAN (R 4.3.1)
## markdown 1.12 2023-12-06 [1] CRAN (R 4.3.2)
## MASS 7.3-60.0.1 2024-01-13 [2] CRAN (R 4.3.2)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.1)
## mime 0.12 2021-09-28 [1] CRAN (R 4.3.1)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.2)
## mnormt 2.1.1 2022-09-26 [1] CRAN (R 4.3.1)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.3)
## mvtnorm * 1.3-1 2024-09-03 [1] CRAN (R 4.3.3)
## nlme 3.1-164 2023-11-27 [2] CRAN (R 4.3.2)
## patchwork 1.2.0 2024-01-08 [1] CRAN (R 4.3.2)
## pillar 1.10.2 2025-04-05 [1] CRAN (R 4.3.3)
## pkgbuild 1.4.3 2023-12-10 [1] CRAN (R 4.3.2)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.2)
## pkgload 1.3.4 2024-01-16 [1] CRAN (R 4.3.2)
## plotly 4.10.4 2024-01-13 [1] CRAN (R 4.3.2)
## png 0.1-8 2022-11-29 [1] CRAN (R 4.3.1)
## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.2)
## promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.2)
## psych 2.4.1 2024-01-18 [1] CRAN (R 4.3.2)
## purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.2)
## R6 2.6.1 2025-02-15 [1] CRAN (R 4.3.3)
## radiant.data * 1.6.3 2023-12-17 [1] CRAN (R 4.3.3)
## randomizr 1.0.0 2023-08-10 [1] CRAN (R 4.3.3)
## Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1)
## readr 2.1.5 2024-01-10 [1] CRAN (R 4.3.2)
## readxl 1.4.3 2023-07-06 [1] CRAN (R 4.3.2)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.3.3)
## rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2)
## rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1)
## rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.3.2)
## rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.2)
## sass 0.4.8 2023-12-06 [1] CRAN (R 4.3.2)
## scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.2)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.2)
## shiny 1.8.0 2023-11-17 [1] CRAN (R 4.3.2)
## shinyAce 0.4.2 2022-05-06 [1] CRAN (R 4.3.3)
## shinyFiles 0.9.3 2022-08-19 [1] CRAN (R 4.3.3)
## stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.2)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.2)
## tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.2)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.2)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.3)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.2)
## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.2)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.2)
## usethis 2.2.3 2024-02-19 [1] CRAN (R 4.3.2)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2)
## viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.3.2)
## withr 3.0.2 2024-10-28 [1] CRAN (R 4.3.3)
## writexl * 1.5.0 2024-02-09 [1] CRAN (R 4.3.2)
## xfun 0.41 2023-11-01 [1] CRAN (R 4.3.2)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.2)
## yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2)
##
## [1] C:/Users/tfrancisco/AppData/Local/R/win-library/4.3
## [2] C:/Users/tfrancisco/AppData/Local/Programs/R/R-4.3.2/library
##
## ──────────────────────────────────────────────────────────────────────────────